Data Warehousing, Snowflake, Blog

Physical vs Virtual DW

Admin

Admin

June 10, 2019

teaser-3-450x222

In many of my videos I discuss data warehousing under the assumption that the data warehouse has physical tables. Meaning, data gets loaded to a specific database structure which optimizes for querying. However, there is a virtual approach to data warehousing by creating databa se views. There are pros and cons to doing this, and that’s the topic I want to dive into.

First let's simplify what a view is. In short, a view is a query on a query. So imagine you have a complex data set, and you do 1 query to produce a simplified and cleansed “view” on that complex data. Now imagine you point a bar chart to that simplified view of data. The query your chart runs on the simplified view, triggers the query against the more complex database. Thus the query on the query.

One of the advantages of this approach is that I can load raw data in real time. And when people query that data, they will be getting the very latest records available. This is because the only physical storage is the complex underlying database. You may ask, who says the underlying database has to be complex? Well complexity comes from the nature of how databases are designed for applications. When you have an application that is receiving data from user activities, the database is designed to run the application screens, as well as capture the user data. There are flags, processes, and more minute points of data that come with the users inputs. So consuming data from applications is very messy, and often requires an API to cobble it all together in a human readable way. If I need instant access to that data then views can often be a good approach.

But, creating a system that is based purely on real time views can be a huge juggling feat, particularly if there are multiple transaction systems. Let me share some examples of things you have to carefully manage when streaming data in real time.

1st, you have to deal with Late Arriving Data. This is data that might be part of a transaction but wasn’t written to the database by the time the query triggered. Remember a view is gonna retrieve whatever is written, if its not in the database it doesn’t exist. So if you’re writing views you need to plan for late arriving data. Many organizations end up just tossing a placeholder in the late arriving record, but this can cause the view to roll up inaccurately. Intricity has strategies for dealing with these scenarios which won’t cause a mess later.

2nd, tracing slowly changing dimensions is going to require you to persist the data. I give an example of a slowly changing dimension in my video titled “salesforce ≠ Data Warehouse”. Since most applications have records that don’t store historical statuses, you’re going to have to create a database that performs that task, and you’re going to have to come up with a query strategy to capture that history into the end analytics.

3rd, like I mentioned before using a real time strategy becomes highly complex once we add more application databases. This is because we have to use the views to deal with conforming data between sources. Thus we start stacking views on views to obtain the set of data we want to expose to the BI layer. And that takes us into the 4th problem. The complexity of the SQL: At the end of the chain of views there is a SQL query that is running against a database. Often that SQL query can be so complex, that it is nearly unauditable. So when tuning needs to be done, organizations have to audit layer by layer to tune view performance. This also takes us to our 5th challenge: Views on views can quickly add up and require extensive computing resources. If you’re running against a complex set of real time views, than that means you will spin a huge amounts of compute resources for uncached queries. This makes for a very expensive landscape.

Now while there are a lot of landmines, there are some real opportunities for using views as a strategy. For example, Intricity has leveraged views to deal with highly volatile application schemas like active Salesforce.com deployments. Intricity used views to buffer schema changes from Salesforce.com so that its data warehouse deployment wouldn’t break of if columns or tables underwent deletes or big changes. The buffering of this volatility made reports and analytics run smoother downstream. Additionally, Intricity has used views in certain use cases alongside a physical table strategy to produce analytics content. This mixed approach saved Intricity’s client from the headache of having everything in views, but still allowed certain points of data to be instantaneous. In preparing  this video I created a small grid that weighed the pros and cons between virtual and physical data warehouses. If you’d like to see the grid click here. If you have some pros or cons which I didn’t mention, I’d love to hear them in the video comments. I’ll be sure to add them to the small grid.

If you’re looking at building a data warehouse or some kind of analytics solution, I recommend you talk with an Intricity specialist about your plans. We can help you close the gap in requirements without creating a beast you can’t later manage.

Related Post

CI/CD for Data Projects

Despite its prevalence in software development, CI/CD is less common in data projects due to differences in pace and cultural perception. Discover the importance of implementing CI/CD in...

Learn More

Snowflake ProTalk | Next-Level Features: Cortex LLM, ASOF Join, & Hybrid Tables

April 24 | Explore some of Snowflake's newest innovations with Cortex LLM, ASOF Join, & Hybrid Tables for data innovation!

Watch Now

New Video: Modern center of excellence

Now more than ever, organizations need to stabilize and optimize their primary use cases to manage costs effectively, maximize technology, and foster a culture of innovation and efficiency.

Watch Now