I’m convinced that out of every IT discipline, the world of data has the most buzzwords. I’ve always wondered why this was. I’m still not sure I know the answer, but I believe it’s because true innovations in data are so difficult to come by. So when something even remotely interesting emerges, it almost always needs a new word to try and get money to move. The latest buzzword being evangelized is the Data Lakehouse. Queue the holographic interface…
So what is a Data Lakehouse? In a sentence, it’s a single data backbone for data lake and data warehousing logic and activities. Basically, it’s the architecture we spoke about in a series which we called, The Next Generation Architecture, which you can find here.
The ability to centralize all the organization's data is the "remotely interesting" part. Legacy Hadoop aficionados would say centralization capacity has been around forever, but the truth is that never really was the case. The limitations of hardware and the software of the past made the prospect of centralization so slow and challenging to engineer that it was essentially impossible. But now, we no longer are dealing with endless meetings to solve the engineering complexities around scale. Stepping back, consider how much of our day was consumed by scale in the past. Planning hardware, scaling for user communities, scaling for competing queries, scaling for data loads, query performance, data loading speeds, onboarding more data, and on and on. Many of those conversations that required multiple meetings and executive sponsorships are now just a slack message and a few menu options. Today, if we're dealing with issues related to scale, it's truly the extremes. So it's only recently that the ability to centralize has even been a thing because the capacity to scale is a thing.
Now don't think that just because a new buzzword has been coined, the dream of complete centralization is easy. That dream is even a challenge today as the sheer complexity of conforming that much data and marrying people-to-information assets remain as the existing hurdles. However, at least the solutions around these problems ARE obtainable, and they're obtainable in a single data platform.
Here's the meat and potatoes that need to be understood by decision-makers: The logic for how data comes together must live somewhere and it must take some form. The decision of where it lives, when it is executed, and what form it takes, is ultimately what you pay solution architects the big bucks for (we do have a video on this topic, which you can find here). But today that logic can live in a single location and that's what's different. Ultimately what you will be setting up in your organization is an automation process for curating data from low structure and conformity to high structure and conformity. This will be done so that it can be shared with a graduating number of users and roles as it transitions.
“There are no solutions. There are only trade-offs.”
― Thomas Sowell
Having a single platform to handle your graduated data-to-information pipeline has big advantages. But can you think of what you might be trading off? The trade-off might be totally worth the centralization, in fact, it probably is, but recognize that there will be something traded. The more centralization you do, the more tightly coupled your solution stack will become. So the opportunity to independently upgrade your solution stack down the road diminishes. Having a loosely coupled architecture has often been the hallmark of good data management for this reason (we have a video that speaks to loose coupling here). If all the aspects of a solution architecture live under a single umbrella, organizations begin to tightly couple themselves to a single vendor, which gives the vendor a lot of power over pricing, slowed innovation, and degraded support to monetize your lock in. This is just a future risk meaning there's no guarantee something like this WILL happen, but it is familiar behavior as vendors reach maturity. Loose coupling made the upgrading process less painful as the organization didn't have all it's eggs in one basket. However, the trade off to loose coupling is persistent friction between solutions, so as Tom Sowell said, "There are only trade-offs."
Here's the good news: the ability to convert code from one solution to another has never been easier. Over the last 3 years, solutions have popped up for converting code between platforms. Intricity uses BladeBridge for this purpose today which makes the trade-off of centralization a less painful risk. However, this is where discipline in your code really makes a difference. The more standardized your development efforts are, the cheaper a future code migration will be. This is because the operation of code conversion tooling is built to take advantage of patterns in your code. So the more pattern-oriented your code is, the more agile any future code conversion will be. This is where Intricity recommends leveraging code generation tooling.
So the next time somebody brags about their Data Lakehouse with a twinkle in their eye, maybe you can suggest that you're implementing a Data Lakeoperationsciencesharehouse and put them in their place...