How to bulk load Snowflake tables using Talend Cloud Platform
Talend Cloud is an Integration platform as a service (iPaaS) offering by Talend. It is a fully managed cloud option which has the capabilities of data integration, data stewardship, data preparation, API designer and Tester and Pipeline designer. These tools can be used for lightweight ETL and detecting the schema on the fly. One of the unique features of Talend Cloud is it provides both on premise and cloud execution environments. Remote engines can be installed behind the firewall, which allows one to run tasks on premise, but scheduling and managing tasks can be done using Talend Management console. Similarly, Talend provides Cloud engine which is an execution environment fully managed by Talend. With many companies undergoing the cloud journey because of its ease of use, minimal maintenance and cost effectiveness, Talend cloud certainly will ease and make the process frictionless.
Snowflake database is a completely cloud-managed service for building the analytic data warehouse. It is highly scalable and easy to use as it provides an SQL engine to interact with the underlying database storage.
Prerequisites for the use case:
- User should have valid Talend Cloud license and should have installed Talend Studio version 7.1.1 or higher.
- User should have valid snowflake account.
- User should have valid AWS account (since the files to be loaded are staged in S3 bucket)
Use case description:
In today's world, data is the new oil. It is highly valued and is sought for everywhere. Things like sensors and IoT produce large dataset volumes in gigabytes. Oil needs purification process to convert it into usable form. The same goes for data which needs proper cleansing and aggregation to be convert into a consumable form called information. Hence, the processing of these large datasets/files must be handled efficiently. For this, Snowflake provides bulk load in the form of copy command which ingests the data from large files quickly and efficiently into snowflake tables. Talend has built a component around COPY command. By simply filling in some required parameters, you will be ready to use the COPY command which makes it easier to use and ingest data with ease. For COPY command to load the data, the files must be staged in AWS S3, Google Cloud Storage, or Microsoft Azure. In our current use case the files are staged in S3 bucket.
Creating Talend job for bulk load:
- Talend has a prebuilt component called “tSnowflakeBulkExec” which will be used to execute bulk load and job design as shown in the below screenshot. For this exercise Talend Studio for Cloud Version 7.1.1 has been used.
- SnowflakeConnection(tsnowflakeconnection) creates the connection to snowflake database. LoadEmployee(tSnowflakeBulkExec) executes the COPY command on snowflake database and loads the employee table. CommitLoad(tsnowflakerow) commits the snowflake connection finally CloseConnection(tsnowfalkeclose) closes the snowflake connection.
- The screenshot below shows the configuration of a bulk component. The component is configured such that it loads to Employee table in Snowflake using the file staged in Amazon S3 bucket inside the Snowflake folder.
Creating and Executing Task in Talend Cloud
- After designing of job, right click on job and Publish to Cloud to deploy the job to Talend cloud.
- After publishing to cloud, the corresponding artifact would be available in the respective Workspace. Click on ADD TASK to create a task for the artifact.
Note that artifact is just a binary or executable. To execute the binary a TASK must be created for that binary.
- Verify the below configuration and click on continue.
- Choose the Runtime as Cloud to execute job on Cloud Engine and Run Type as Manual to run it as and when required and click the GO LIVE button to execute the task.
- After going live, the job will run on Talend cloud Engine and will change its status to SUCCESSFUL, as shown below in the screenshot. The data in staged file will be loaded to employee table. You can go back to snowflake query console and query the employee table for data.
With most of the organizations leaning towards cloud solutions and Integration of different technologies within it to handle huge data volumes, Talend cloud with Snowflake provides an easy and seamless way of doing this as depicted here.