Ingesting data from AWS s3 to Snowflake using Stitch
Stitch is a very simple yet very powerful cloud service which connects to a diversity of data sources. It is a very user-friendly ingestion system and works in a very simple format. Stitch connects to the data sources, pulls the data and loads the data to a target.
In this blog, I am going to connect to Amazon S3, read a file and load the data to Snowflake but first let’s understand few concepts of Stitch. For Stitch to work you need an Integration and a Destination.
- An Integration is your data source. Each integration will contain details about a source system like AWS S3, MySQl, etc.
- Destination is a place like Snowflake which holds all the data from different sources pulled via integration.
Now that we know the concepts, we are ready to start ingestion using Stitch. For the rest of the blog I am going to create Destination, Integration and do the data load.
Let’s look at the process of Integration using an example of AWS S3. As a first step, login to Stitch and click on Integrations
Click on Add Integration and select your data source from the list. Let’s select Amazon S3 CSV
Configuring the integration is very simple but in case you still need assistance, every data source listed has a documentation available. Give the details like your integration name, S3 bucket Name, AWS account ID, your file name or a file pattern
If you want Stitch to limit its file search to a particular directory, then specify the Directory. I have left it open as I just have one file to try. Next specify the Table name, Primary key and file Delimiter.
You can also configure more than one table for an Integration. You can also sync the historical data and set a frequency for data replication. I have left ‘Sync Historical Data’ as default and have configured the Replication Frequency to once in every 24 hours at 10am UTC.
Click on continue. To be able to read file from s3, Stitch needs to have access to the S3 bucket. Create a new IAM Policy and IAM Role to give access. Once the right access is set, Stitch will test the connection. On successful connection you will get the following message.
You could either navigate to Destination from main menu or click ‘Configure Your Warehouse’ in the above screen.
Select Snowflake from the list provided
Configure the snowflake destination. Give the Host, Port, username and password.
To grant stitch access to snowflake, you need to add the required IP address to snowflake security policies. Once done, click on check and Save.
On successful connection you will get the following message
Now come to Integration and select the integration we created ‘s3_stitch_test’
Click on the integration and select the ‘Choose Tables to Replicate’
Select the table ‘Test_Stitch’ . Until now stitch has not read the file. At this point, Stitch read the metadata of the file from s3 bucket. Select the fields you want to replicate. I have selected all the fields. Click on ‘Finalize your Selections’
Selection of columns is an important step as starting from this point, stitch is going to fetch only these columns at every run. Click on Yes, Continue and navigate to Extractions menu.
For testing purpose. Click Run Extraction Now. Notice the status change.
Check the extraction log
Log Starting :
Now, let’s verify the tables in Snowflake
Stitch watches the extraction. If I run the extraction again stitch doesn’t ingest the data. In my next blog I will be creating a new Ingestion or Destination. Keep watching this space and until then happy reading!!