About Client

The Client is a leader in transaction data analytics for the Financial Services industry. They house many critical analytics for large-scale fortune organizations to analyze ACH transactions, retail banking customer demographics, consumer behaviors, and a host of other critical analytics. 

Challenge

One of the offerings provided by the Client required the matching of credit card transactions to identify unique vendors like Taco Bell, Amazon, etc. This unique information was highly valuable to many analytics and to their Financial Services clients.

The data sets being acquired on behalf of the Client's Financial Services customers was not only large in its own right (4 billion records) as it spanned multiple Financial Services Institutions, but also had daily streams of 12 million additional records.

Navigating Constraints

The Client had been using Snowflake as their enterprise standard, but this particular need to narrow unique vendor records from credit cards had them looking at other compute solutions as brute force machine learning appeared to be the only legitimate way forward.

Training machine learning models on identity resolution for a specific organization’s data requires a much more efficient method of training. The compute cost of training such a massive data set to do brute force compares would take weeks of extensive compute costs to generate models. This is because a brute force comparison requires string comparisons with the following counts

number of records * (number of records -1)/2

4,000,000,000 * (4,000,000,000-14,)/2 =
7,999,999,998,000,000,000 comparisons

That's nearly 8 quintillion comparisons! Therefore, the problem required out-of-the-box thinking. The cost of brute force generation of a machine learning model from scratch would have required them to potentially change platforms and the compute costs would have been astronomical.

Win 1: 

Intricity architected an Identity Resolution solution that avoided the onboarding of a completely separate compute engine for machine learning. The solution coupled supervised regression model and string comparison algorithms to efficiently generate unique record identities.

Win 2:

To optimize the efficiency of the training, Intricity created a cleansing step that standardized the record formats. In turn, the noise was eliminated from the string comparisons which greatly cut down on compute cycles to map appropriate string algorithms. 

Win 3:

Intricity implemented its Identity Management powered by Snowflake. The solution is a supervised regression modeling system that allowed the client team members to compare presented matches and indicate which records were not matches. The presentation of potential matches and the human "teaching" of what should be excluded trained the regression model. The regression model in turn allocated about 30 different string comparison algorithms (predicates) that produced the best match to the data set.

This allows for the nuance of appropriate string comparison algorithms to be applied. The reduced cost of Entity Resolution was tremendous as all the applied algorithmic predicates were natively written in Snowflake and the ML could assign them appropriately.

Win 4:

Intricity created canopies and groups which enabled the nearly 8 quintillion comparisons to be broken down into more logical sets. These represent segments that are likely to contain matches, like the same country. This breaks down the transaction records into sets that are a lot more computable at scale without brute-forcing the comparisons from scratch.

Win 5:

The customer gained a Unique Vendor data set derived from billions of credit card transactions. The unique vendors can be used for a host of products offered to the Client's Financial Services customers including several productized analytics solutions and cleansing offerings.