The Client is a global not-for-profit organization focused on helping young girls maximize their potential through STEM, life skills, and entrepreneurship. Their program memberships measure in the tens of millions and they are one of the most recognizable brands in the world.
With membership in the many tens of millions, their organization was challenged with managing so many unique identities. Over time, many iterations of the same member would flow in, sometimes with different addresses, typos, married names, etc. The challenge was how to narrow this enormous pool of members into unique member IDs. Obtaining these unique IDs would more accurately portray the membership participation and a large portion of downstream analytics.
Member data spanned entire lifetimes, as members would join as children and continue all the way up as alumni and volunteers in their late 80s. This breadth of membership meant that effective tracking of attributes needed extensive nuance, as to which algorithmic methods get attached to the identification of unique members.
With such massive member rolls, traditional solutions priced themselves with a very high barrier to entry. The organization needed a viable way of managing unique records without costing a small fortune. The organization had 8 primary sources of data which all had different touchpoints with their membership. Each touchpoint contained a version of the member records with varying levels of accuracy. Additionally, some of these systems had their own APIs for acquisition.
Training machine learning models on identity resolution for a specific organization’s data required a much more efficient method of training. This is because the quantities of data, although voluminous, are still not enough to effectively train a neuro network from scratch.
String comparisons for Identity Resolution requires computation of
number of records * (number of records -1)/2
The challenge of doing this many comparisons is evident when you consider how many records an organization might have. So if an organization has 40 million records, we're talking about 800 trillion comparisons.
Intricity worked jointly with the client to append their existing Snowflake deployment with additional sources of data. This allowed all of the data to reside in an effective platform for scalable compute. This also effectively centralized the customer domain so it could be calculated for unique values.
To optimize the efficiency Intricity created a standardized cleansing step. This eliminated noise from the string comparisons which the Identity Management solution would be doing in later steps.
Intricity implemented its Identity Management powered by Snowflake. The solution is a supervised regression fit system which allowed the client team members to compare presented matches and indicate which records were not matches. The presentation of potential matches and the human "teaching" of what should be excluded, trains the Identity Management solution. The solution in turn automatically allocates 30 different string comparison algorithms (predicates) that produced the best match to the data set.
This allows for the nuance of appropriate string comparison algorithms to be applied. What is powerful about this method is that the training can occur on smaller data footprints vs the truly enormous data footprint and compute requirements found in Deep Learning.
Intricity created canopies and groups which enabled the nearly 800 trillion comparisons to be broken down into more logical sets. These represent segments that are likely to contain matches, like the same country. This breaks down the membership records into sets that are a lot more computable at scale without brute-forcing the comparisons from scratch.
The customer gained a Unique Members data set which they can flow all their new systems into for consolidation. The membership records can be used for custom mailers, segmentation, alumni donation tracking, onboarding applications, and many other internal activities related to their critical membership.