What Data Science Isn’t

Is it possible to hammer in a nail with a jackhammer? Well sure, but obviously it would be better to have a hammer. A few months ago we paid a visit to a company that was very excited to tell us they were doing data science. As they explained the project it became painfully obvious that they had expended a copious amount of time at a premium cost to build basic data integration and conformity logic to construct what essentially was a data warehouse. In other words, they used Data Science tooling for something their in-house experts and tools were already capable of doing. However, because the tools they purchased were for Data Science they could check the Data Science box off their to do list. The complaint from the client was, “we’re paying for a lot of compute resources, and we’re looking for optimization.” Now, we’re starting to see this in a lot of places. This is a case of the buzzwords leading the investments, and the ground level needs driving the implementation. No organization wants to say they’re not doing Data Science. But at the same time most of those organizations haven’t really dealt with how data science will fit in their broader data architecture. So in this video I want to lay out where Data Science tooling lives and where the grey lines of demarcation might be.

Imagine you’re coming stop light. How much data is available when you stop your car? Yes of course there’s a stop light, but there’s data everywhere around it, enough that could fill a massive data set. The eye color of the people crossing the street, how many steps they make, the length of people’s stride, the brand of cars around you, their color, their trim, the shoes the pedestrians are wearing, how many pebbles are in the tarmac, the luminosity of the traffic lights, and on and on. However, the data that matters for your task of driving your car is much more limited. There’s the light, the people and cars crossing that's it. The rest would just be noise. The limitation of the noise is the function of the data warehouse. And it’s the reason why it feeds such large audiences. However, because it by function limits the focus of the data ingestion to just the information assets, it can at times limit the scope of true data scientists. Real data scientists are literally conducting experiments on the data to uncover the unseen. To do that, often data science teams need to access a broader spectrum of data inputs which the organizations might not be so focused on delivering in its data warehouse. Thus we often see that Data Science teams prefer raw data lake deployments which limit predefined dimensions or measures. This doesn’t however preclude their use of the data warehouse, as much of their time ends up being spent doing data munging. And that time can be saved by standing on the shoulders of giants. (or in this case the data warehouse deployment). 

One way to look at that grey line of demarcation is to ask, Am I building something that has more deterministic data attributes or is this something that will be probabilistically driven? The reason for asking that question is because the tools and skills for the two are drastically different. If your implementation is deterministic then you likely have already invested in the tooling and skills to do your deployment. Now this isn’t a black and white argument. Sometimes you do have probabilistic requirements mixed in with a deterministic integration process. But this is where the balance needs to be understood. If a vast majority of those data functions are joins, aggregations, conditional splits, lookups, merges, sorts, etc., then you likely already have the tooling, and probably the expertise, you need to get started. However, if you’re working with k-means clusters, factorials, complex rankings, logarithmic expressions, and anything machine learning oriented then you definitely need to be living in a data science oriented landscape.

Intricity has a solution architecture that encapsulates both the deterministic and the probabilistic requirements organizations deal with without resorting to blanket probabilistic architectures. We have a video that we’ve published about this topic titled Data Lake VS Data Warehouse. Additionally, I've written a whitepaper titled: “What Data Science Isn’t”. If you’re making decisions about your data landscape I recommend you reach out to Intricity to talk with a specialist.

Whitepaper Link: https://view.attach.io/H1IoqhYcS

YouTube Video Link: https://youtu.be/IBLbcAnIaTc

Talk with a Specialist: https://www.intricity.com/intricity101/