Maturity Journey of a Snowflake Customer
Written by Jared Hillam
This whitepaper is the result of copious debate regarding the “typical path” a Snowflake customer takes during their maturity journey. The truth is that there isn’t such a thing as a “typical path” because there are too many priorities that vary between organizations. However, there are some distinct technical wins that are measurable. To simplify why these are “wins,” we need to first explain a problem being solved. So this whitepaper will apply them to the narrative of a maturity path that many Snowflake customers do play out.
Stage 1: New to Cloud
For many, Snowflake is the first time organizations experience the true elasticity of the cloud. That's not to say they’ve never used a SAAS application, but in many ways, the sheer performance of a typical SAAS application (like a CRM) is similar to on-prem. However, in the world of data, the opportunity to show what cloud elasticity feels like is very real. On-prem MPP databases have very physical limitations, with forklifted servers as the brick wall keeping scale at a maximum. So the way most organizations would plan on hardware was an attempt to appease the CFO and CIO. Inevitably the CFO resented the massive spend for hardware that was designed like an “Easter Sunday Cathedral”, only to be mostly empty the rest of the year. On the other hand, the CIO had to make a legitimate attempt to feed the mass of people attempting to get information out of their sea of data. If the organization underestimated, they could leave a massive user community needing information in the dark; if they overestimated it meant the organization just sunk cost into a boondoggle. From an end-user perspective, this is how it played out...
Any sufficiently advanced technology
is indistinguishable from magic.
Arthur C. Clarke, "Profiles of The Future"
For those in the on-prem data world, Snowflake’s promises sound too good to be true, but with exposure, the “magic” turns into reality. Organizations that start their Snowflake deployments are often dabbling, setting things up, and testing ideas. These also come with status reports on what their Data Engineers are finding with the platform. The rate at which this happens is largely a function of a business’s aggressiveness. If they’re “all in,” then you’ll see this occur as part of the next stage of maturity. If the business is passive, then it’s often engineers being curious about new tech which leads to opportunistic wins. In either case, many of these Data Engineers don’t realize that there is a good chance, a few months down the road, the entire corpus of the organization's data will be housed in Snowflake. Like anything else, having perfect hindsight of best practices early on goes a long way.
Things to Consider in the New to Cloud Stage
- You need to design a Data Replication architecture that brings your data to Snowflake from your on-prem and SAAS applications.
- Design a blueprint of the future-state process of onboarding and turning data-into-information.
- Naming conventions, naming conventions, naming conventions.
- Use a Landing Zone for data to curate the chaos of live replications so it lands into the Data Lake as clean application data sets.
- When transitioning the data into the Data Lake, pay close attention to sorting to avoid wasted compute due to too many micro-partition scans.
Stage 2: Migration
By internalizing the benefits of Snowflake, the business makes the decision to migrate their existing data-to-information footprint to it. Some people that have been conditioned to think the cloud is just a rented server space, may also believe that migration is just a physical relocation event… not so. Another simplistic view of migration is to think of it as just a bunch of data being migrated. The data itself is simple to migrate. The complexity is in the logic. Snowflake is a full-fledged cloud data platform as a service, with its own ANSI SQL syntax variant. So the complexity of a Snowflake migration is in converting the logic surrounding the legacy database to Snowflake’s modern SQL syntax. This includes Stored Procedures, ETL/ELT code, and physical tables to name a few. Many organizations have invested more than a decade of work into such logic that refines data into information. The prospect of migration can sound impossible to pull off particularly when organizations don’t have the slightest idea what the legacy black box is even doing. So how can migration be possible for such organizations?
This is where the realm of Code Conversion tools comes into the picture. The legacy database and ETL tooling within an organization stores its code in the form of metadata. In a database that metadata is SQL files and programs that orchestrate that SQL. In an ETL tool that metadata can usually be exported into XML files that are readable. This metadata acts as a consistent source to read in the legacy environment’s logic. With that logic exposed, code conversion tooling can automate the syntactical changes so it can be compliant with Snowflake’s format. Each database has varied levels of specialization in automating a conversion. A healthy view of these code conversion efforts is to see them as cyclical automation events rather than push-button conversions. Even ones claiming to be push-button will require manual intervention of a portion of the remaining code.
The pricing of these code conversion efforts is really based on the size and scope of the legacy code landscape. At Intricity, we leverage a metadata analysis tool to determine the conversion effort. This provides details that are also useful for the conversion itself. Once the conversion is in full swing, Intricity leverages a configurable metadata migration tool called BladeBridge (https://bladebridge.com/converter/) to facilitate the conversion. Code that converts with errors undergoes review, and the Intricity team updates the Converter’s configuration to address the errored code. This cyclical process allows for customization to the patterns and conformity to the desired output.
As code gets compiled free of errors, the testing teams run the code with data. This is often conducted by the customer so test cases can be defined. BladeBridge does provide a tool for reconciliation of the legacy and target databases, flagging what differences were found in the data itself. As the conversion of the code is completed and tested, the new environment is rolled out to production with both the legacy and new running in parallel for a time, then terminating the legacy once ready.
Things to Consider in the Migration Stage
- Do not use best guesses on how many code objects you have to size your conversion effort. Code conversion solutions should have a code Analyzer to capture these counts.
- Don’t take on code conversion as a manual effort.
- Don’t forget the ETL/ELT layer.
- The future-state architecture needs to be planned out before the conversion of the code.
- Have a production promotion plan in place.
- Injecting too many enhancements during the code conversion process can make testing difficult to execute. Convert the code first, test it for parity, then enhance.
- Make a RACI (Responsible, Accountable, Consulted, Informed) chart for the data testing process so clear lines of responsibility are drawn.
Stage 3: Consolidation
While database code migration is a critical portion of the data-to-information factory, the data that has undergone this surgical migration represents a narrow portion of the entire data landscape. There are many other stores of data within an organization that sit within its four walls. In legacy environments, this data usually was consolidated in Operational Data Stores, Data Lakes, and Reporting Tables. Additionally, there are applications that, up to this point, have not been centralized.
The centralization of this kind of data has three fundamental requirements that were difficult to encounter within the same datastore.
- Cheap Storage
Hadoop, AWS S3, Google Cloud Storage, and Azure Blob Storage all can check the boxes of cheap storage that is reliable. The issue we encounter in those platforms is that they aren’t very queryable. This point is debatable, as all of them CAN be queried. However, the question is how long it takes for queries to produce results. Is a datastore really queryable when the response time is measured in several minutes or even hours? This is where Snowflake’s micro-partition architecture overlaid on top of reliable immutable cloud storage made such waves. They were the first platform to credibly deliver 1, 2, AND 3.
Beyond the 3 fundamental requirements, Snowflake introduces the ability to decouple the logical instance of the data from the actual data itself. Meaning Dev, Test, and Prod environments are only separated logically and they don’t represent separate copies of the data. With this capability, Data Engineers gain massive productivity as they no longer are wrestling with promotion or storage allocation issues.
Thus, Snowflake is an excellent location to house the broad semi-structured data landscape of Data Lakes, Operational Data Stores, and Reporting Tables. So with Snowflake’s adoption, consolidation of semi-structured becomes a no-brainer.
Further Consolidation of Analytical Structures
The introduction of all these other sources of data introduces opportunities to further consolidate conformity logic for the Data Warehousing based data. Thus organizations often undergo an enrichment of their existing Data Warehouse logic. This includes enhancements to the Data Warehouse model and new integration routines that populate the conformed data. Because of the way Snowflake separates compute and storage, Snowflake allows for true real-time triggered Data Warehouses. While building such does require quite a bit of programming finesse, the increasing popularity of code generation engines makes this a reality.
Things to Consider in the Consolidation Stage
- Generate a Master reset script for the object names in Snowflake to ensure you can govern your content, the count of objects will become high the more you centralize. The FlowObjects Access Control Master Base offering can help with this.
- Usage monitoring across Snowflake will begin to be an important task as your user community will begin to grow dramatically. This is something that the Flow Objects Auditing and Monitoring solution can help with
- If you are enhancing your Data Warehouse models and logic, build your requirements from the business requirements rather than the data itself. This ensures that the output will actually get used by the business and that the business sees they have ownership in the solution.
Stage 4: Enrichment
With the consolidation of the internal data across the entire enterprise, the prospect for ingestion of 3rd party data sources is an obvious remaining need. Externally, there are 100’s of data brokerage services that provide data that can enrich an organization's data landscape.
Intricity has been encouraging data enrichment for nearly 8 years. But enrichment always came with a painful discussion about the adoption of a managed file transfer processes, coupled with ETL transformations to onboard the 3rd party data. This inevitably created friction as the cost of deploying such enrichment kept this external data from being onboarded.
One of Snowflake’s unique value propositions is their Data Marketplace which allows their customers to securely query data sets they subscribe to. These subscriptions completely eliminate the FTP and ETL process requirements in favor of a direct connection to the data sets being subscribed to. This enables on-demand access to enriched data sets, whether they be strategic or tactical in nature or query-ready in SQL.
This enablement means that the organization can now have a secured window into the outside world and see how they relate to internal trends. There are many use cases, but the most obvious recent use case was the COVID data sets which provided organizations with immediate analytics that drove decisions, without spinning up complex integration cycles.
Things to Consider in the Enrichment Stage
- Many of the data shares are free, but some require subscriptions. Plan a budget for enrichment activities with the business stakeholders.
- Not all enrichment activities are strategic and long-running. Many are just tactical.
- Subscriptions that you build automation routines become sticky, so do your shopping in the Marketplace before settling on a particular data set.
- Review Subscriptions for redundant use cases.
Stage 5: Association
Organizations that have truly reached this level of centralization begin to realize that they no longer have a data lake, or a data warehouse, or even an Operational Data Store. They have an unlimited data estate, one that spans nearly every facet of the organization. The question from the beginning is how does this vast quantity of data get mapped to the users of this data? How does the organization avoid landing users in a giant haystack of data that is just noise to the users? How can the user communities get what they want when they want without creating an administrative nightmare? How do we securely support N number of Analytics tools, Integration tools, Dashboards, Business Intelligence layers, and data dumps? This is where the Association stage comes into play. The solution to this problem requires a multi-faceted approach that maps the data to roles then maps the roles to existing Identity Management solutions.
One of the most common answers to the question, “why does Snowflake do it that way” is... micro-partitions. The magic of being able to have all this computational flexibility, parallel speed, instant cloning, and time-traveling, comes with a few linchpins. This can be manageable but organizations must take these linchpins seriously.
The first thing to understand about Snowflake’s access control is that the structure is inverted to what most people expect it to be. In other words, instead of having a parent-to-child inheritance where children objects inherit from their parent object, Snowflake Role Inheritance has an inheritance-up model, where the parent inherits what the child object can do. This confuses administrators that have a preconceived assumption that rights are combined based on what roles the user is assigned. Snowflake doesn’t do this at all by default. Users that are assigned to roles don’t inherit all the rights of the combined roles they participate in. In other words, you can only use one role at a time. This becomes complex to define if you want to create a role that provides a specific slice of data access. Flippant deployments here can lead to some of the most egregious issues to untangle.
Mapping the Many to Many Relationship
Intricity has come up with a solution to address the “one-role-at-a-time” issue. To start, it requires a consistent naming convention for the objects in Snowflake so they can be programmatically mapped to the right user communities. The assembly of the programmatic script requires some base naming convention definitions. This master script is something Intricity has pre-packaged as a Product + Service offering called the FlowObjects Base Access Control.
Access Control Automation
Once the content has been governed, the focus moves to the user communities. Organizations have already spent their time and effort mastering user credentials and grants through their Identity Management systems. These centralized identity systems are a perfect source to control data access. However, to map to these systems, the data itself needs to be mapped to subject-oriented roles. Intricity calls these Subject Primitive Roles. These would be things like Finance, Sales, Accounting, etc. The Subject Primitive Roles then can be mapped to the Identity Management System. To address the “one-role-at-a-time” issue, Intricity creates a specialized role in Snowflake called a User Synthetic Role which acts as a proxy for the user enabling inheritance from the Subject Primitive Roles. Thus you can have an executive that can see both Sales and Accounting data in the same role. The scripts and automation for this solution are part of the FlowObjects Advanced Access Control solution.
Things to Consider in the Association Stage
- It’s not uncommon to require the Association Stage during early adoption of Snowflake.
- The definition of objects and the mapping of data to users require quite a few discussions with business and technical stakeholders; the organization needs to be prepared to document those relationships.
- Often people get Access Control confused with “security.” They are in the same vein, but Access Control is about what people see when they DO have credentials. Snowflake is VERY secure as turning encryption off isn’t even an option.
- There is also an option to add masking to sensitive data sets like social security numbers in the FlowObjects Access Control Master.
Stage 6: Sharing & Marketability
With the access of all the organization's data mapped to user communities, there are so many opportunities that emerge.
Snowflake natively supports the creation of “Data Shares” which allow the organization to secure a data set for 3rd parties to consume. These could be suppliers, partners, customers, or even internal departments. When a Data Share is set up, the organization can configure how the consumption of the Snowflake credits is conducted. If they are set up a Reader Account, the credit consumption is paid for by the organization. If they use a Full Account, the 3rd party’s account will pay for the consumption of compute credits on their own Snowflake instance.
What this means is that organizations that leverage cost accounting to allocate profit and loss for all their departmental activities related to analytics and integration. Additionally, organizations can facilitate their interactions with 3rd parties without resorting to complex data integration processes.
Just as Snowflake provides a gateway to gain access to data sets from 3rd party services, organizations can themselves become suppliers on the Data Marketplace. This means that the value of the organization's data is no longer confined to facilitating decisions, but actual profit center value. The ideation process for new products and services now has a new channel to consider for increasing both the margins and market presence of the offering. The Data Marketplace has both free and subscription-based data sets that Snowflake hosts on behalf of its customers.
Things to Consider in the Sharing & Marketability Stage
- The Association Stage is usually a prerequisite for the Sharing & Marketability to occur. This is because it requires the organization to define sharable segments that can be segregated into user communities.
- Building a marketable data set often requires them to be broken up into a free dataset and a paid dataset. This enables customers to build interest in a billable relationship.
- When considering data sets to share with the market, consider both the tactical and strategic needs of the market. Some data sets may only be valuable for a short period of time. For example, COVID will have a hot useful life span but will at some point be hardly used. Strategic datasets will have a lasting stickiness.
- If your data is not easily consumable by you, it won’t be easily consumable to 3rd parties. Make sure you spend the time to get the modeling right so the data is easily queryable.
As stated earlier, there is no set path for a Snowflake customer. For example, there are data brokerage companies that adopted Snowflake for the sole purpose of replacing their antiquated FTP process in favor of Snowflake’s SQL interface to shared data.
Whatever your reasons are for adopting Snowflake, it will dictate your path. Getting a plan put onto a blueprint BEFORE you start calling vendors will help ensure you’re not distracted by shiny objects along the way.
Intricity conducts such blueprinting exercises, leveraging its experience as the first Snowflake partner and with hundreds of implementations under its belt. If you would like to talk with us about your specific scenario, scroll down and fill out the form to talk with a specialist.
WHO IS INTRICITY?
Intricity is a specialized selection of over 100 Data Management Professionals, with offices located across the USA and Headquarters in New York City. Our team of experts has implemented in a variety of Industries including, Healthcare, Insurance, Manufacturing, Financial Services, Media, Pharmaceutical, Retail, and others. Intricity is uniquely positioned as a partner to the business that deeply understands what makes the data tick. This joint knowledge and acumen has positioned Intricity to beat out its Big 4 competitors time and time again. Intricity’s area of expertise spans the entirety of the information lifecycle. This means when you’re problem involves data; Intricity will be a trusted partner. Intricity's services cover a broad range of data-to-information engineering needs:
WHAT MAKES INTRICITY DIFFERENT?
While Intricity conducts highly intricate and complex data management projects, Intricity is first a foremost a Business User Centric consulting company. Our internal slogan is to Simplify Complexity. This means that we take complex data management challenges and not only make them understandable to the business but also make them easier to operate. Intricity does this through using tools and techniques that are familiar to business people but adapted for IT content.
Intricity authors a highly sought after Data Management Video Series targeted towards Business Stakeholders at https://www.intricity.com/videos. These videos are used in universities across the world. Here is a small set of universities leveraging Intricity’s videos as a teaching tool:
TALK WITH A SPECIALIST
If you would like to talk with an Intricity Specialist about your particular scenario, don’t hesitate to reach out to us. You can write us an email: firstname.lastname@example.org
(C) 2020 by Intricity, LLC
This content is the sole property of Intricity LLC. No reproduction can be made without Intricity's explicit consent.
Intricity, LLC. 244 Fifth Avenue Suite 2026 New York, NY 10001
Phone: 212.461.1100 • Fax: 212.461.1110 • Website: www.intricity.com