Data Governance, Article, Whitepapers, Blog, Databricks

Reference Data Wrangling



January 25, 2017


Reference Data Management

So what is reference data? Think of it as data that resides in your organization, but that you don’t own the standards and naming conventions of. For example, imagine one day you decided to lead all your postal codes with your store numbers. What would happen? Well, all the sudden, all your packages with the new zip code wouldn’t reach their destination… this is because you don’t control the standards and naming conventions for postal codes, that is something that the Postal Service masters and you just reference.

Reference data is critical for any organization, as it’s a special subset of master data that is used for classification throughout the entire organization’s applications and databases. Reference data includes the lookup table and code table data that is found in virtually every enterprise application. Some of the simple examples of reference data are postal codes, country codes, currency codes, gender codes and industry codes. Complex reference data originates from multiple applications, derived from transactional data, supplied by external agencies etc.  Reference data is typically defined with a code and a description, and has a set of domain values, that is, a list of allowed values.

Why does Reference Data need to be Managed?

Today, most enterprises manage their critical reference data using spreadsheets or by following manual ad-hoc methods. Within the organization there is no centralized mechanism to manage reference data. Reference data variations and inconsistencies can be a major source of data quality issues within the enterprise and cause business losses through system downtime, incorrect transactions, and incorrect reports. Errors in reference data will affect the quality of master data in each domain, which in turn affects quality in all dependent transactional systems. Also, the same reference data or code may have different values in different applications. For example, a gender code may be ‘M’ for Male and ‘F’ for Female in one application but 1 for Male and 2 for Female in some other application. Manual or custom RDM often lacks change management, audit controls, & granular security/permissions. Mismatches in reference data impact the integrity of BI reports and also raises application integration failures. Several compliance requirements require reference data to be monitored and governed.

Tools to Use

Have you already invested in Informatica’s Master Data Management product to maintain and govern your master data? Beyond using Informatica for master data, did you know you can use it for managing your reference data as well?Tools to Use Customers of MDM can find Reference Data Management as a marketplace item, which can be downloaded for free. Intricity recently implemented RDM at a large insurance company so they can take advantage of their existing MDM setup as well as maintain and govern their reference data. Besides Informatica, there are many other reference data management solutions that exist e.g. Collibra, IBM, Oracle, Teradata etc. The main advantage with Informatica’s RDM accelerator is that it uses their powerful MDM platform where you can define, manage, share, and monitor reference data like master data.

Reference Data Management Implementation Steps

  1. The first step in implementing RDM is to find out the owners of the applications containing reference data and understand what level of governance is needed.  This also helps in avoiding or minimizing local maintenance of reference data.
  2. The next step is to identify the data domain for reference data and validate if those fit in the data model supplied with RDM Accelerator. Reference data models are different than the typical party models; they are more dynamic in nature. Reference data models have two parts; the first part defines the reference data set and second part has the actual code values.
  3. After data model validation, the source data loading is the next task. Source data can be loaded using the data integration platform, or it can be imported using IDD import functionality.
  4. The next important step is to build the governance process or workflows, for example, who will initiate the change and who will certify the reference data.
  5. Finally, it’s the publishing data to consumer applications – communication channel.

Post Implementation 

Once you create a “golden copy” of reference data, it is very critical that you maintain and accommodate ongoing changes so that all downstream systems can leverage it. Reference data is no exception and needs to be seamlessly integrated.  An extensive service layer should be built using Java code or ESB tools. There should be a flexible mechanism to export and transform reference data to be consumed by subscribing applications.


So while you might not own the mastering of reference data, that doesn't mean that reference data doesn’t need governance. Reference Data Management allows you to wrangle the inconsistencies between systems within your organization to ensure that you’re receiving consistent reference data which matches the industry standards.

Intricity Experts Article by Vandana Jain


Related Post

Snowflake Data Breach... Now What?

Snowflake's data breach affected 165 customers through stolen credentials. Discover the security measures that could have prevented it and how to protect your data in the future.

Read More

Using AI for Code & Metadata Conversions on Data Systems

LLM-based code conversions can have challenges and successes. Explore real-world insights and best practices for navigating these projects.

Read More

Medallion Architecture, From the Late 90's?

Much like AI is all-the-buzz today, the early 2000’s buzz tech was the Data Warehouse. Like AI today, everybody said they were doing it, but few were actually aware of what they were doing...

Learn More