LOADING BIG FILES INTO SNOWFLAKE with FlowObjects Data Loader
WHAT DO YOU DO IF YOU HAVE A 50 TERABYTE FILE AND YOU NEED TO LOAD IT TO SNOWFLAKE?
It sounds simple...until you go to do it.
Snowflake does provide a few data loading options in the form of the web GUI and SnowSQL client, but neither is designed for bulk data loading. Another extreme, there are very capable data replication suites, but they are designed to listen to CDC logs and come with price tags to boot. So the answer to that question gets a little more complicated because there is no middle ground.
In this session, the FlowObjects team shares a simple Data Loader that fills this gap. The team discusses loading options that are available using the Data Loader and how it can quickly fill a simple need as well as be embedded into other processes your organization might have.
You will learn:
How to load data into Snowflake
Options provided by the Data Loader
Use cases for embedding the Data Loader into existing processes
Jared Hillam (0:17) Good day, everybody, and thank you for taking the time to join us for our presentation of theFlowObjects Data Loader. My name is Jared Hillam, and I'll be hosting today. I'm also here with two of our engineers, Liren Zhang and Gennady Kleiner. What I'll do is jump into the explanation of the Data Loader and then we'll pass it over to Liren to do a quick hands-on demo.
First, a little bit aboutFlowObjects. FlowObjects is an organization that was created by Intricity, that is focused on selling utilities for Snowflake. There are four utilities that are out today. And really, we're going to be focusing on the Data Loader utility. So I'm going to click on that, and we'll just jump into it.
So Data Loader is a command-line utility that really is intended for loading data. And to explain the necessity of that: just imagine for a second that you have maybe 100 terabyte files, and, let's just say it's a CSV file that needs to be loaded into Snowflake. How would you actually do that? You don't have any tools in place today. So you're kind of met with a bit of an extreme here, a spectrum of extremes, you've got on one side of the extreme, you've got the Snowflake GUI, which is really not designed for loading anything large. It's really just for small tests. And then you've got SnowSQL, which is really designed for querying, but it was really never designed to load big files, everything would load sequentially. And those kinds of challenges would ensue. So how would you load this file?
Well, in the market, you have these full-blown CDC tools (Change Data Capture tools) and they're designed to do production Change Data Capture, to monitor the logs of different database types that you have in-house, to know when to trigger a load of the data. And those are very great tools, they're necessary for accounts to synchronize data up into Snowflake. But in your case, you just have a big file, and you need to load this file. And there really is nothing that kind of addresses that middle ground. Now, certainly, you could use a full-blown CDC tool to load a large file. But a CDC tool comes with the price tag to meet it, which is, production reading of logs. And so how do you kind of meet this middle ground?
Well, that really is what the FlowObjects Data Loader is intended to do. It's a very low-cost, very lightweight utility, that splits up files and loads them into Snowflake. So we can take large files, you can set whatever splitting parameters you want on it. Snowflakes sweet spot is 100-megabyte files, so you can take that huge file that you've got, and it'll break it up into 100-megabyte chunks, and then it can take those chunks, and then parallel load them into Snowflake. And so you get the benefit of non-sequential loading, parallel loading, loading all of the data in as many channels, in parallel that you can open up.
In addition to that, the Data Loader auto compresses and then loads the data into Snowflake. So, as you know, Snowflake wants to store data in a compressed file format, rather than full-blown. And so the Data Loader auto compresses these files. You can choose what kind of compression you want to use, so you've got options there, and then it'll automatically push it up into Snowflake.
Now in terms of platforms that are available to you, it's Linux and Windows. It is Python-based, so that makes it very flexible. And, of course, you can read any kind of CSVs you've got in your organization, as well as Excel. So that gives you kind of a sense of how the tool works. Then let's just talk a little bit about some scenarios and I'm going to pass it over to Liren.
A couple of use cases that are useful: One is just the tactical one that I mentioned at the very beginning of this session, which is "hey, I've got a large file and they need to load it." But that's not the only use case. A lot of use cases are, "Hey, I have this sequence of events that occurs, and at the end of this sequence of events, there's a CSV file. And I need to be able to load the CSV file after this sequence." Again, because it's Python, you can trigger it if you need to, and trigger it to run against the CSV file once it's available. So we often see organizations using this as a mechanism to take a pre-existing process that had some kind of sequence to it and trigger it to run and load Snowflake. And the other one is an interesting scenario where perhaps you have a data integration tool that doesn't support Snowflake, or they're not going to support Snowflake - we are seeing that as well. They're sort of trying to force their clients to use the cloud version of their platform. And the challenge there is there are budgets and things that keep you from doing that. So, in this use case, a scenario where "Hey, my ETL tool doesn't support moving the data up into the cloud to Snowflake, how could I facilitate this ETL tool to do that?" And what this can do is you can just tell your ETL tool, "Hey, when you're done with all of your processing, just dump everything to a CSV file." And then, once that's done you can capture it with the Data Loader, and the Data Loader will just go ahead and push it up into Snowflake as a triggered event.
So you can see it's really intended to be able to snap into a process that you already have, as well as act as a very tactical tool. Often when we demo this tool, inevitably, someone in the administration staff says "oh my gosh, I wish I would have had this months ago when I was trying to load XYZ file!" So it just makes it really, really easy to to load those files.
And what I'm going to do is just walk you through where you can find pricing and details here. So if you go into Data Loader underflowobjects.com, and you just click on "see pricing", you can see that the pricing is very much within the realm of feasability $9600 a year for the organization. That makes it really easy to inject into a process or use it tactically. Now with that, I'm going to pass it over to Liren.
DEMO: FLOW OBJECTS DATA LOADER IN ACTION!
Liren Zhang (7:27) So this is basically what I'm going to do today: demonstrate loading a CSV file roughly about 300 meg. And I'm going to demonstrate to you how to use this loader with some parameters, mandatory parameters, talking through that, the high level. And then while we're running this Data Loader, I'm going to go through more details on some of the additional parameters that maybe you can use.
What you're seeing here, right now, it's just basically the Snowflake screen worksheet. We're loading this data into what is called the "New York parking ticket" table (that's just publicly available data). So right now, I believe there are 1.7 million records in there. So what I'm going to do is I'm going to truncate this, just to make sure you're seeing that process as things get loaded.
All right, so here's a typical command that you're going to use to load that file. This is the file we're gonna load, it's about 360 meg, and this is the command that we're going to use. So there are some typical parameters: Obviously, you need to know what file you want to load, and then the connection information. These are SF accounts. These are all the Snowflake connection information. And then you need to tell the Data Loader what table you need to load that data into. As you can see, there's the "NYC parking ticket", as you just saw that in the Snowflake interface.
And then there are some things that have to be defined as well. It's the database information, the schema, where the table is, and what warehouse it is. And some of the more interesting things here, later on here, is SF stage type. SF stage type means there are three different types of stage. It's a user stage, table stage, and an internal stage for Snowflake. So in this case, in this example, we're using the table stage. And slip file parameter is just saying that because this file is 300 meg typically Snowflake recommends uploads of file size at about 100 meg - that's the optimal file size for uploading. And then the next parameter is basically saying to use the uploads directory. This tells the loader to upload all the files as parallel as possible, at the same time loading everything in that folder. So this creates that parallelism where Jared was just talking about earlier.
Remove split directory - this is basically because the first step this FlowObjects Data Loader is going to do is to split the big file into 100 meg size files. So as you can imagine, there's about 360 meg, you're gonna probably see four files. So at the end when we're finishing loading, there will be files that still exist in the split directory. And this parameter is basically saying to remove all of them.
So let me put this in action, and I'll put a couple of screens up so you can see what's going on. While we're loading this, while we're loading this file, I can go through a little bit more detail in terms of what are some of the other parameters available for the Data Loader. As you can see, this is the folder. Now, immediately it created a folder here, let me double click on that. As you can see, it's already splitting the first file.
Jared Hillam (13:06) And it's doing the split based on the amount of split size that you've designated, right?
Liren Zhang (13:14) Right. By the right by default once you say "have a parameter here, like a split file", by default, you don't have to tell it what the size needs to be. By default, it is 100 meg is trying to split them. This is actually going pretty fast. But in the meantime, I'll pull up the parameters here. So if you want to get some help, you can certainly do that with "- - help". Give it a sec, and it'll pull up a bunch of parameters.
Gennady Kleyner (13:54) While that's being pulled up, Hello, everyone, my name is Gennady. I'm responsible for the product development here at FlowObjects. And so all these parameters, I just wanted to mention, are available as either command-line driven or environment. And so it's sensitive to password requirements and whatnot. And also certainly the benefit of DevOps. One of the reasons that we chose this smaller file, while we don't have any preset limits, is certainly for the purposes of this demo. So that counts watching as it's splitting and uploading files over an internet connection.
Liren Zhang (14:33) Yeah, so let's go through some of these parameters while we're waiting for this. waiting for this to be done. As you can see, these are the Snowflake connection information. With all of these, you can create it as an environment variable in the system. For Windows, it's the environment variable you can set up. And for Linux, you can export the path, you can export those variables as well in the batch profile, for example. So the key thing most of the time, what I need to hide is really this password information. You can just basically create, in Windows, you can go into the environment system variable setting, set up what is called the "SF_path password", and put that actual password in there. Then, at the runtime, as you can see this one, you don't actually see the SF password path in here. That essentially helps to get a little bit more secure with what's your command line. When you use this for the purpose of embedding this application, the password is not obvious to two people.
Going down the list, there are some interesting things that I want to talk about. Just now I talked about the stage type. There are three different types of staging: It's the table stage, the user stage, and the terminal stage. And auto-commit, this is actually pretty interesting. Usually in Snowflake, the auto-commit is actually turned on. So whenever there's a query that gets executed, when it's done, it's automatically committed. What I typically do is actually turn that off to let the application manage that transaction.
So, in the case where during the load there's something wrong, the loader itself actually does a rollback. So all of that activity is wrapped in one transaction. So typically, what I do is to actually set this to false. Encoding delimiter: A lot of these parameters, you could define that in a file format. The Encode, if you remember the file... let me actually just go there in the database itself, you can actually see that. You can define that here.
Jared Hillam (18:06) This is just a demo database we set up right?
Liren Zhang (18:08) Yep. So you can see there's an SF CSV, so you can actually see a lot of this information being defined right here. So in the parameter itself, you can define the file format. There's the file format, see, right here at SF SCV. That's basically utilizing the file format object in Snowflake itself. It defines the date format, it defines the delimiter, defines how to treat nodes. Typically it's a concise way of using that. But there's nothing stopping you from actually defining them individually. And on "split file size", this is where if you don't define/ if you don't call this out, it's by default 100 meg. But if for some reason, if you want it to be bigger or smaller, you can actually define this "split file size" and put the actual size you wanted it to be here.
"No purge" is actually just saying, "I don't want to purge the Data Loader". Actually, it's putting the files into the Snowflake stage. By default, it's removing those files uploaded to the stage at the end of data loading. But if you, for some reason during development, want to see what's going on, and you want to repeat that process, you may not want to delete a file in this stage. So this is the parameter that controls that.
"Remove split directory", obviously, this just at the end of the run, if you have this on the split directory, the file directory is going to get removed. The log is actually interesting. So if you want to log the execution, because right now all of these are actually going to the console itself. But if you put the "--log" it actually saves it into a file.
Gennady Kleyner (20:50) One of the things, we also have the ability to log errors to a Snowflake table. So if you specify that "--error" table parameter, it will go into a Snowflake table.
Liren Zhang (21:41) Anyway, you can see this is the process! First, it split the files into four, and then it did a "put". The put operation basically uploads those individual splits of the files into the Snowflake table stage (because I defined it to use the table stage). And then it does a "copy into" of the four files, and then you can see has 1.4 7 million records. So with that, I'll take you back to this screen, and then take a look at the data itself...
Jared Hillam (22:34) Moment of truth. There it is!
Liren Zhang (22:37) Yep, you got it, the records loaded. So it's fairly simple. To start with, you just basically set those things up with a basic set of parameters, and then you can add or remove based on what you need.
Gennady Kleyner (22:58) As Jared had mentioned early on, some of the common use cases that we have are part of a scheduling tool, because it is a COI command-line interface. It's easy to embed inside batch files or scripts that can execute. And very commonly, it's used as part of DevOp operations as well. So lots of use cases. And while it is Python-based, it does come as a self-compiled binary executable, so that you don't have to worry about any dependencies on there.
SCHEDULE A DEMO AND DISCUSSION TODAY! Jared Hillam (23:33) Yeah, and one of the things that, that I failed to mention about FlowObjects: all of the FlowObjects, utilities are solutions that we have been deploying as part of Snowflake deployments. And so the Data Loader is something that Intricity has been using for years and years. And there are so many of these sorts of tricks that we've come up with over time, that that's really what FlowObjects is. It's a packaging of utilities that we have created over time to solve certain niche problems. And that's what really Data Loader is.
So, if you're interested in Data Loader, then certainly go toflowobjects.com. Go intoData Loader, fill out the form there, and we can get you connected and work with you to get a demo set up in your environment. So you can see what it takes, we can do a time-based key so that you could install it and try it. So if there's interest in doing that, please reach out!
VP of Emerging Technologies & Host
Consultant & Innovative IT Leader
Senior Cloud Architect
FlowObjects Webinar: Loading Big Files into Snowflake