Late 90's

Data management strategies have historically followed a graduated curation process that begins with raw data ingestion then progresses through stages of curation and aggregation for analytics. This process, established in the late 1990s, includes:

  1. Data Staging: Data is transferred from source systems to a raw landing zone.
  2. Data Operationalization: Data is patched up to create a pristine source representation, suitable for querying.
  3. Aggregation for Analytics: Data is further processed to support aggregated analytics.

In the late 90s and early 2000’s, when organizations produced these layers they often were part of an ETL process. Much of the transformation of data happened in flight, and the data would get persisted to fit organizational needs. The target for landing the data was often called a “staging area” or an “Operational Data Store (ODS)”. Sometimes these layers would be separated to facilitate different needs around the data. For example, the more real-time the data became, the data often needed to be staged before landing in the ODS to keep it from housing incomplete transactions. These early architectures emerged out of necessity and were formalized into books by Bill Inmon and later Ralph Kimball.

Much like AI is all-the-buzz today, the early 2000’s buzz tech was the Data Warehouse. Like AI today everybody said they were doing it, but few were actually aware of what they were doing with it. The best practices of the graduated architecture that lead to a Data Warehouse is alive and well today. Folks who speak of some ethereal “next best thing” have returned over and over again to the Data Warehouse. Back when the Data Warehouse took its name, to end users, it was this thing that helped with Analytics. To architects it was a bevy of data curation layers that graduated the data into an analytically ready state of facts and dimensions. Since most organizations’ users really only interacted with the Data Warehouse, that simplicity fit its time.

 

Ha-Duped

This graduated Data Warehouse architecture dominated the analytical marketplace. However, post 2010s a new entrant came in the form of Google’s Map Reduce. Google at the time could do no wrong, their tech was considered almost magical. This is especially the case for business people that couldn’t understand how it was possible that searches came back so quickly but their own data took an eternity to query. So when Hadoop took the technology and packaged it as an open source project, the stage was set for a good-ole fashion hype-cycle. 

Unbeknownst to decision makers, Hadoop and corporate transaction data were a couple made in hell. Hadoop didn’t do anything wrong, their platform was made for large swaths of unstructured files. However, transaction data was never meant for a carefree data storage framework. In the hype’s zeel to smash these two things together large corporations rushed in head first, and spent millions on the “Big Data” frankenstein. What emerged on the other end of these deployments was a graduated architecture almost identical to the one proposed by the Data Warehouse that preceded it. However, even with these graduated architectures Hadoop was devastatingly slow, making the traditional data warehouse look like a hero.

The last ditch effort for unstructured data architectures was to deem it as an intermediary to a data warehouse. This was called the Data Lake. The Data Lake was actually a welcome reprieve as unstructured data stores made sense for once. The Data Lake could act as a place where all data despite its format could live, and from there it could be curated into more governed layers. Essentially the Data Lake became the “ODS” of times past but with a lot more data format options. However the number of customers that really mixed everything into their data lake and saw that as a key differentiating factor in their analytics were pretty few and far between. 

The most useful innovation from that era was Spark. At the very least Spark made the Datalake queryable. Spark provided the memory tendrils to make the sprawling queries on Hadoop actually work. That innovation was destined for much more scalable platforms in the future.

Snowflake put the nail into Hadoop’s corporate ambitions. Customers that had tried Hadoop were moving in droves to the new startup. By the time they went public they were the largest technology IPO up to that point. The truth was that a vast majority of what corporations wanted to measure was already represented in data, and Snowflake was built for that.

 

Cloud Architecture

Perhaps unsurprisingly, the best practices in deploying a Snowflake environment followed the exact same architectural pattern. When data was loaded it came from replication systems so it had to be landed in a staging location to make records were made whole. From there the data was landed into a “Data Lake” in Snowflake. This lake represented an application-pristine representation of the source data. Then from there, the Data Warehouse was designed to source its data from the Data Lake. 

Again, this graduated curation architecture emerged as the best practice for deploying data.

Strange Bedfellows

In August of 2018 Snowflake and Databricks announced a partnership, right out of the gate they were strange bedfellows. Snowflake was looking for compute, and Databricks wanted the queries. Sales cycles with the two ended up with odd architectures showcasing opposing messages depending on who was presenting. “Who is going to be actually running the queries,” customers would ask, and both would answer “we will.”

Then in the spring and summer of 2019 Databricks released Delta Lake. This essentially ended whatever collaboration the two organizations had, it was clear that they were competitors at that point.

Delta Lake

Delta Lake added ACID compliance to the Spark query architecture, making it possible to reliably query transactions, and get consistent results. The vision for what this would become was end to end unification of all data.

Unsurprisingly, Databricks customers deployed with a graduated curation architecture ending with an ACID compliant Data Warehouse at the end.

Other Cloud Data Systems

Snowflake and Databricks are certainly not alone in adopting the curated data architecture. Every major cloud data provider presumes the same practices of graduating data.

 

Marketecture

The excitement around this cloud architecture birthed a new naming convention which people began to call the “Medallion Architecture”. This was quickly taken up as the next buzzword, and sales reps began asking customers if they had deployed a Medallion Architecture or if they were using their old Data Warehouse architecture?

The truth was, this wasn’t something new, this was that uptake of marketing on a concept that had been in place for nearly 30 years. In formal terms the Medallion Architecture represented:

  • Bronze Layer: Focuses on the raw, unprocessed data. Challenges associated with managing this layer include data quality and scalability
  • Silver Layer: Data at this stage has been cleaned and organized to support operational reporting and exploratory analytics
  • Gold Layer: Represents the pinnacle of data refinement, where data is aggregated and optimized for strategic business insights

There is no harm, of course, in marketecture as long as customers understand it isn’t a product feature or some kind of major development. In fact, in this case, the new naming convention will help end users understand where they should get their data from. However, quite often customers do get duped into thinking they are missing out in some big way, when they already have that architecture in house.

 

Source

Perhaps it is hubris to expect adherence to naming conventions of the past. However, the number of wasted cycles decision makers have gone through over the last 30 years getting duped into marketing speak is remarkable. Our current architectures stand on the shoulders of giants. Knowing the source of those architectures doesn’t hurt our ability to appreciate the new features we have today. 

Data Architects need to have a little more confidence in the foundational aspects of their world, and do a better job of informing buyers where actual innovations are. Often when something new comes out, there is room for doubt, but data has to follow the laws of physics just like everything else.

 

TO CONTINUE READING
Register Here

appears invalid. We can send you a quick validation email to confirm it's yours