Data science is an incredibly important field which is gaining popularity now that technology has made it cheaper and easier to store giga/tera/petabytes of data. At its core, data science isn’t necessarily a technical field, but more of an academic one. Wikipedia says, “Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from data in various forms, both structured and unstructured, similar to data mining.” Nate Silver once stated. “I think data-scientist is a sexed up term for a statistician.”
Nowadays (2018, at the time of this post), data scientists are often much more than statisticians who hang around the office whiteboards. The tools to ingest and process data have greatly evolved in the last decade, forcing workers to do the same. Depending on the team and the goal, a data scientist may be working on spreadsheets formulas or programming directly in AWS cloud.
In fact, the goals of a data science team can be incredibly diverse from company to company. For example, data scientists can…
They can do these things by…
Because of this spread of specialties, it can be difficult to find the right data scientist for your team. It can be even harder to build teams around the data scientists that you already have.
You’ll want to start with the output: what are you trying to optimize or predict? Next, decide on the timeline. Are you going to build a system to process incoming data in real time, or are you just trying to make some one-time decisions?
Let’s talk about AirBnB’s data science team, which has been trying to improve their dynamic pricing by pouring over all of their booking data to predict surges in booking for specific dates and locations. Their goals not only involved aggregating data, but also creating mathematical formulae which had to be fed into machine learning models. Those models, once trained, needed to be integrated into the AirBnB product itself. Clearly, there is more than just statistics at work. This team can only function with a combination of statistics, infrastructure management, machine learning, and application development.