What is a Data Lake?

One of the problems that we encounter in managing data is that it lives literally all over the place. This is especially true in large corporations. In fact, we’ve seen companies that have thousands of disparate applications all gathering their own data. You can imagine how frustrating it is navigating the corporate politics to report on organizational activity, with data living in so many different places. Now if you’ve seen my video titled “Real Data Warehouse?”, you’ll remember I talk about the data warehouse helping you solve two problems, a data locality problem, and a query logic problem. In fixing the locality problem we use a methodology under the covers, called a Staging Area, or an Operational Data Store. These act as a location where the raw data can land and act as a single place to draw from.  Usually an Operational Data Store has an expanded mission of being able to serve up raw data to more requirements than just the Data Warehouse. Now let’s box that concept up for a moment, and we’ll unpack it later.

If you’ve seen my videos about Hadoop, you’ll recall that it will accept and save any data. This is because it was originally designed by Google to index the entire world wide web. If you’re going to do something like that, you can’t be picky on what structure or format the data is. This flexible feature of Hadoop made it an interesting candidate to deal with the data location problem inside of corporations. After all, companies are generating a lot more data than just structured data. Companies have Web Logs, PDF’s, Machine Data, and all sorts of other sources and formats.

This is where the concept of a Data Lake came from, rather than solving the locality problem with an Operational Data Store which uses a structured Relational Database. Corporations can use Hadoop which provides an open door for storing the organization's data no matter what it is, so the organization could mix Unstructured, semi-structured, and structured data in a single place.

Now, it’s important to remember that there is a second problem to solve when it comes to consuming data. This is the Query Logic Problem that I mentioned earlier. Just because the data lives in one place doesn’t mean the data is ready to make decisions on. There is still a significant integration and correlation effort required to build the logic to correctly interpret the data. This is where Hadoop makes us pay the Piper. Because Hadoop doesn’t require any logical data structuring when we write the data, we have to deal with it when we read the data. On top of that, Hadoop is designed to take the queries to where the data lives, so under the covers these queries are quite a bit more complicated. So, while you can query Hadoop using a SQL interpretation layer, its more common to see a persistent logical data store in either a Relational or NoSQL database which draws from the Data Lake. Sound familiar? This is where we unpack our discussion about the Operational Data Store. In essence, a Data Lake is an Operational Data Store with a much greater mission of storing all the categories of corporate data for a wide variety of data distribution use cases.

In dealing with the cloud of buzz words it’s important to remember that there’s no cheating the locality and query logic problem. There will always be a need to deal with bringing data together and querying that data. I recommend you reach out to Intricity and talk with a specialist, we can help you cut through the vendor hype to determine the value justification for your landscape.