This article will show you how to transfer data from SQL Server to Snowflake with SQLpipe – a free ETL tool. At the end of the article, we will discuss the benefits of moving your analytics workload from a traditional relational database management system to a cloud-based data warehouse.
Table of Contents
Getting ready to migrate data from SQL Server to Snowflake
Before starting the migration process, you need to prepare Microsoft SQL Server and Snowflake. The following steps ensure that SQLpipe can connect to the computer running SQL Server and protect you against undesirable data alterations on the schema and table levels.
Configuring your source SQL Server database
First, let’s configure things to ensure we can extract data from Microsoft SQL Server. You will need SQL Server Management Studio or another SQL Server native client like sqlcmd, plus networking admin privileges for this task.
Allowing network access
First, we need to make sure that we set our firewall rules to allow SQLpipe to connect to the server that is running SQL Server.
You typically provision this access by configuring security rules in a public cloud like AWS, GCP, or Azure. If you are running SQL Server on-prem, contact your hosting operations team to ensure that you can connect to the SQL Server instance.
In any case, you must verify that you can communicate with the port on which SQL Server is running. Usually, this is port 1433.
If you’re unfamiliar with altering networking rules, here is some documentation to help you get started:
- Amazon Web Services: Visit the EC2 page in the AWS management console and select the security groups tab on the left. Visit Amazon’s security groups documentation for more info.
- Microsoft Azure: Search “network security groups” in the search bar at the top of Azure’s management console. Visit Microsoft’s security groups documentation for more info.
- Google Cloud Platform: Go to the firewall rules section of the VPC network page and click on the Firewall rules submenu. Visit Google’s VPC firewall rules documentation for more info.
Granting permissions to load data from your source database server (optional)
Creating a user on your source database server specifically for ETL processes is a security best practice. SQL Server allows you to provide granular control to these systems by executing SQL queries.
We recommend creating a user called “sqlpipe” which has read-only access to the databases, schemas, and tables required to perform the job. If you’re unsure how to do this, check out Microsoft’s official docs on the topic.
Configuring your Snowflake data warehouse (optional)
Since Snowflake makes use of a SaaS model, provisioning network access is unnecessary. However, it is good to use their RBAC system only to grant SQLpipe access to the resources it needs. Specifically, you should allow access to a virtual warehouse for computing power and the analytics database you plan to use for storage. Here are Snowflake’s access control docs.
Gather data source credentials
Now you must gather some credentials for your source SQL Server database.
- SQL Server hostname – The IP address or URL at which you can reach the computer that is running SQL Server.
- SQL Server port – The port SQL Server is running on. Usually 1433.
- SQL Server username – Hopefully, one that you’ve created specifically for ETL purposes.
- SQL Server password – The password for the above username.
- SQL Server database name – The name of the database from which you want to extract data.
Gather credentials needed to upload data
- Snowflake account ID – You can find your account ID by going to the Snowflake login screen and copying / pasting the first part of the URL. It will look something like “xyz12345.us-east-1”.
- Snowflake username – Hopefully, an ETL-specific one.
- Snowflake password – The password for the above username.
- Snowflake database name – The name of the database into which you will bulk load data.
Preparing SQLpipe to copy SQL Server data to Snowflake
Now that you’ve provisioned networking access and gathered credentials, it is time to prepare SQLpipe to copy SQL Server data to Snowflake. We will do this by deploying an executable file, ensuring it can run, and building a configuration file out of the information you gathered.
First, create a directory that will hold SQLpipe and a YAML config file. Enter that directory.
mkdir sqlpipe cd sqlpipe
Now, find the correct version of SQLpipe based on your operating system and processor architecture on our downloads page.
If you’re using Linux or macOS, you’ll have to grant SQLpipe permission to run.
sudo chmod +x sqlpipe
Now you can test if SQLpipe can run with the following command on Linux and macOS:
Linux / macOS: ./sqlpipe -h -------------- Windows: sqlpipe.exe -h
You should see the following output:
macOS users may get a permissions error. If so, follow the directions on our macOS permissions page.
Add credentials to SQLpipe
SQLpipe needs credentials to copy SQL Server to Snowflake. You provide these credentials with a configuration file.
You have already done the hard work of gathering the connection info. Now, you just have to format it into a YAML file. Create a file config.yaml in the same directory as the SQLpipe executable file using the following template.
--- MSSQL: DSTYPE: "mssql" HOST: "mssql_hostname" PORT: mssql_port USERNAME: "mssql_username" PASSWORD: "mssql_username" DBNAME: "mssql_dbname" SNOWFLAKE: DSTYPE: "snowflake" ACCOUNTID: "snowflake_hostname" USERNAME: "snowflake_username" PASSWORD: "snowflake_password" DBNAME: "snowflake_dbname"
Here is what a config file might look like:
At this point, your directory should look like this.
Copy data from Microsoft SQL Server to Snowflake with a data pipeline
In this section, we will copy data from Microsoft SQL Server to Snowflake with a data pipeline. To accomplish this, we will start a SQLpipe server and trigger a transfer request. A SQL command loads data from Microsoft SQL Server to Snowflake during this process using the credentials you gathered in the previous step.
SQLpipe is two things. It is a server that can extract Microsoft SQL Server data and upload that data to a Snowflake data warehouse. It is also a client that can send properly formatted HTTP requests to a SQLpipe server. You don’t have to use the SQLpipe executable to send these requests, but it is the easiest way to get started.
Start SQLpipe server
You can start the server with the following command:
Linux / macOS: ./sqlpipe serve -f config.yaml ------------------------------ Windows: sqlpipe.exe serve -f config.yaml
Assuming the server starts successfully. you will get the following message:
Start copying data by sending a transfer request
Open another shell to send a transfer request to the server. Sending an HTTP request is how we tell the server to export data to our Snowflake data warehouse.
A transfer request will look like this.
./sqlpipe transfer \ --serverHost localhost \ --sourceDsName MSSQL \ --targetDsName SNOWFLAKE\ --targetSchema your_schema_name_in_snowflake \ --targetTable your_target_table_in_snowflake \ --query “query_to_run_on_mssql WITH NOLOCK” \ --writeType overwrite_or_insert
If you are on windows, replace the first line with “sqlpipe.exe transfer”.
Here is what a successfully executed command might look like:
The command is somewhat self-explanatory. The “serverHost” field is the host that the SQLpipe server is running on – in this case, localhost. The “sourceDsName” and “targetDsName” fields are the names of the data systems you want to target from the config file.
The “targetSchema” and “targetTable” fields define where to upload the data files in Snowflake.
Last, “query” is the query you want to run on the source database, and “writetype” is how you will write upon upload.
Checking table data with an ETL test
At this point, so long as you didn’t encounter an error, your data should have transferred from Microsoft SQL Server to Snowflake! Log in to your data warehouse via the web UI to see if your SQL Server data has made it there. If you are running a POC, checking the data using Snowflake’s UI should be fine.
However, if you’re setting up a more robust ETL process, this would be an excellent time to set up an automated data integrity test. For example, if you are overwriting the data, checking that there are an expected number of rows is a good start. Alternatively, if you are inserting data, checking that there are more rows than before can be a good strategy.
What is SQLpipe doing?
Export data to a stage with a SQL command
SQLpipe isn’t doing anything magical. It simply mimics what you would have done to migrate data with native tools. Let’s discuss what happens under the hood when using SQLpipe to transfer from Microsoft SQL Server to Snowflake.
SQLpipe creates a CSV file
First of all, both programs are relational database management systems that use the structured query language. Since the relational data model fits nicely in a CSV file, we can use something similar to this format to upload to a data warehouse.
While Microsoft SQL Server can export a CSV natively with bcp, it doesn’t export data in such a way Snowflake can easily understand. A part of what makes SQLpipe great is its ability to build a CSV file that contains standard SQL data types alongside more exotic varieties like binary.
Snowflake retrieves data
Snowflake needs to retrieve the data and put it in a stage before uploading it into a data warehouse. You have two options at this point – an internal stage or an external stage.
Internal stage vs. external stage
An internal stage is a Snowflake-managed blob storage location, similar to an S3 bucket or GCP storage. SQLpipe uses internal stages because they are straightforward to create and drop with a native Snowflake client. External stages might cost a few pennies less but setting them up is more trouble than they’re worth.
Write data to the target table from staged files
If the SQL Server data is formatted correctly, Snowflake can parse the file using a data warehouse. Somewhat confusingly, Snowflake calls the object that provides computing power a data warehouse. SQLpipe creates and deletes a file format for you automatically. If you’re interested in what SQLpipe is doing under the hood, you can check out Snowflake’s docs on the topic.
Because the data from Microsft SQL Server doesn’t precisely match the data types for a Snowflake data warehouse, SQLpipe converts the data along the way. If you choose the “overwrite” writeType flag, it will drop and automatically create a table with compatible types.
Alternatives to loading data with SQLpipe
SQLpipe is not the only solution for moving structured relational data from Microsoft SQL Server to a Snowflake warehouse. Let’s review the other options for data migration.
Migrating data with an ETL platform
There are many companies offering platforms that promise to integrate data sources without having to write any code. These platforms usually use change data capture to perform data replication from a relational database management system to a data warehouse. A great use case for these platforms is data replication from SQL Server to Snowflake, but they usually do much more than that.
Change data capture (CDC)
All RDBMS systems produce database transaction logs – a written record of all changes that have occurred to your data. CDC entails making an initial copy of your database (or at least a selected subset) from SQL Server to Snowflake and then using transaction logs to “keep up” with changes.
When CDC works well, it keeps your data warehouse up to date with Microsoft SQL Server with a relatively small performance hit. However, you can only use it on tables that have single-column primary keys. Also, support for DDL SQL statements is sometimes imperfect.
Primary keys are needed because if you update a row in Microsoft SQL Server, there needs to be a way to look up that row in the Snowflake database. Also, primary keys help prevent rows from being replicated twice – thereby creating duplicate rows. Without proper constraints, it is easy to violate data consistency rules when running fault-tolerant ETL processes.
For example, picture a table called “orders” that doesn’t have a primary key. How would the Snowflake database know what to do if an order got canceled in SQL Server? Since there is no primary key to identify the correct order, there is no way to know which row to alter or delete.
Another potential problem might arise when an ETL process fails in the middle of a job. Your system cannot identify rows that have already been replicated, thereby introducing the possibility of duplicate rows.
Most dimension and fact tables have single-column primary keys, but not all of them do. You can work around this issue by periodically syncing the tables without primary keys via select * commands or maintaining a “last modified” column. Suppose your CDC and traditional sync techniques get out of rhythm. In that case, you can end up with data consistency issues and be left with incorrect results when running analytics on that data!
Why using SQL improves data management
An underrated feature of Microsoft SQL Server is its ability to specify transaction consistency when running SQL commands. Look closely at the POST command we sent to SQLpipe to export table data – you will notice the command ends with a clause WITH NOLOCK. This clause is a table hint, allowing you to override the query optimizer and obtain tighter control over ACID constraints.
Specifically, the NOLOCK clause instructs a SQL Server database to allow changes to a given table while a query is running. Such control over transactions can be helpful in two situations:
- Suppose you are querying a production SQL Server database. In that case, you may not have the luxury of locking a busy, important table for an hour while moving data around. You can run the query that extracts the SQL Server data WITH NOLOCK and accept the chances of encountering a dirty read.
- On the other hand, you may have set up your database with generally loose transactional consistency constraints. However, there may be a table at the center of an analytics initiative that requires precision. In this instance, you may wish to set a higher level of isolation for that particular transaction.
To achieve a high level of data management, you must take transactions into account when moving SQL Server data to a data warehouse.
Using a data lake
While “big data” initiatives seem to be a new phenomenon, data warehouses are mature technology at this point. However, the shimmering technology firms of the early 2000s found themselves with more data than traditional data warehouses could handle. They needed a place to store and query vast amounts of structured and unstructured data – this is what a data lake allows you to do.
The most successful of these early projects revolved around Hadoop and its related technologies. Hadoop File System, or HDFS, is a solution to link together multiple computers to form a mega-hard drive capable of storing enormous amounts of data. On top of this storage platform, numerous query engines sprung up like Hive, Spark, and Presto, among others, to query the vast (and often unstructured) data living inside of HDFS.
This pattern of maintaining a massive pool of data split between multiple computers and drives, then querying that pool with specialized query engines, became known as a data lake. The critical insight that made this paradigm so successful was separating compute power from storage. Other players, such as Spark, have improved the performance and reliability of these systems by using new and improved file systems such as AWS S3.
For whatever reason, data lake tools never quite reached the level of polish that software engineers have come to expect from data warehouses. As a result, data lake initiatives have developed a reputation for failing to deliver business value.
As someone who has implemented data lake systems in the past, I think data lakes are only helpful for massive organizations with loads of unstructured data. Unless you have a clear and present need to make sense of tons of data like images, sounds, or scientific readings of some sort, there isn’t a need to adopt a tool like Spark. Snowflake has implemented the separation of computing power and storage while retaining the sleek user experience of traditional data warehouses.
There is no one size fits all ETL solution to go from Microsoft SQL Server to Snowflake. Despite the multitude of options, many companies choose to move data with native tools like SQL Server Management Studio, SSIS, and BCP. There is nothing wrong with using these technologies if they get the job done.
We hope you will consider SQLpipe as a potential solution because it is easy to use, fast, and fits well with traditional workflow orchestrators like Airflow. If you want help implementing SQLpipe, or would like guidance when choosing among other ETL options, please contact us.