Data Governance, Videos, Blog

What is a Data Catalog?

Jared Hillam

Jared Hillam

July 19, 2018

 

There has always been a bit of a no man's land between IT and Business. IT is supposed to know, how to work with data, but the business is supposed to know what the data represents. So we often find that neither knows enough about the data to strategically use it. What we often see is tribal behavior emerge from this, with each guarding their pocket of expertise. Every company deals with this at some level, even established data management implementations. However, over the years, solutions have emerged that can ease this tribal tendency. These solutions are called Data Catalogs.

 

You can think of a Data Catalog just like you would a retailer's catalog. But instead of giving you information about products, it provides information about the data elements within an organization. Everything from what the source of data is, what it is used for, and the formulas that it participates in. Data Catalogs have been around for a long time. They provided a place for people to find where data was and how it was used. However, the generation of this information was very manual, so often organizations would just give up on the task altogether. What has reignited the data catalog is the data lake and more advanced automation from the data catalog vendors.

Let’s talk a little about the Data Lake. Often Data Lakes can suffer a bit of an identity crisis because they don’t gate-keep the data coming in by forcing a schema. This is because the Data Lake’s mission is to be an open door to all the organizations' data, so intentionally you want that data onboarding to be as simple as possible. But that comes with an obvious downside… nobody really understands everything that’s in the Data Lake. Organizations that did deploy a Data Lake found themselves with a massive heap of data and no context to what that data represented.

Before the Data Lake, most catalog vendors spent a significant amount of time focusing on the connection to each source and the flow of the data through the organization. These tools are called Metadata Management tools. You may recall my video on Metadata Management from a few years ago. The emphasis on Data Lineage in Metadata Management was very heavy before the Data Lake. This is because most of the organization's data went through an extensive series of hops to ultimately land in enterprise reporting tools. Knowing those hops were critical, and still is today. However, if the organization has all its data making one-hop directly into the data lake without any changes, then the lineage becomes more of a “nice to have” for some organizations. Additionally, Data Warehouse modeling practices started designing lineage tracking right into the Data Warehouse. So the real focus these days is not so heavily leaned towards data lineage but rather the Data Catalog itself.

One mistake that many Data Catalog vendors have made is assuming that the Catalog would be used by a bunch of data nerds. This is mostly a residue of the Metadata Management legacy many vendors transitioned from. Consumers of data these days are all over the organizational hierarchy. These people want to use the data catalog the same way they use Yelp. To achieve this, Data Catalog vendors pack their solutions with a heavy dose of automation to collect meaningful information about the data elements that get imported into their solutions. To the extent that Data Catalog vendors deliver on this plug-and-play experience, their user communities quickly bridge their organizational tribalism.

It’s important to realize however that a data catalog doesn’t act as a layer to conform the data, but rather a place to identify its uses. The application of conformity is something that lives in a data warehouse. In fact, the data catalog is often used in conjunction with a data warehouse to appropriately point audiences to using its conformed data objects because the organization has invested so much into that conformity.

Intricity can help you develop a strategy for how to deploy a data catalog in your organization as part of a mature architecture. If you would like to see what these architecture sessions typically cost, click here, and of course, you can reach out to Intricity to Talk with a Specialist at any time.

Related Post

CI/CD for Data Projects

Despite its prevalence in software development, CI/CD is less common in data projects due to differences in pace and cultural perception. Discover the importance of implementing CI/CD in...

Learn More

New Video: Modern center of excellence

Now more than ever, organizations need to stabilize and optimize their primary use cases to manage costs effectively, maximize technology, and foster a culture of innovation and efficiency.

Watch Now

Modern center of excellence

Discover the transformative power of a Center of Excellence in optimizing cloud data operations. Explore the six pillars of a modern CoE and learn how they drive efficiency and innovation...

Learn More