In a typical/traditional data warehouse solution, the data is read into ETL memory, processed/transformed in the memory before loading into the target database. With the growing data, the cost of compute is also increasing and hence it becomes vital to look for alternate design. Welcome to pushdown query processing. The basic idea of pushdown is that certain parts of SQL queries or the transformation logic can be “Pushed” to where the data resides in the form of generated SQL statements. So instead of bringing the data to processing logic, we take the logic to where data resides. This is very important for performance reasons.
Snowflake supports Query pushdown with v2.1 and later. This pushdown can help you transition from a traditional ETL process to a more flexible and powerful ELT model. In this blog I will be showcasing how Talend leverages Snowflake query pushdown via ELT.
ETL VS ELT
Before we get into advance details, let rejuvenate the basics. With traditional ETL – Extract Transform Load the data is first Extracted, then transformed and then loaded into target like snowflake. Here most of the data transformation like filtering, sorting, aggregation etc. takes place at ETL tool memory before loading it into target.
With ELT – Extract Load and Transform the data is first Extracted, then loaded and then data transformations are performed. With the ELT model all the data is loaded into snowflake and then the date transformations are performed directly in Snowflake. Snowflake offers powerful SQL capabilities, via query pushdown thereby enabling data transformation to a more effective ELT model. During development using ELT, it is possible to view the code as it will be executed by Snowflake.
TALEND ELT JOB DESIGN
In Talend, there are native components to configure pushdown optimization. These components would convert the transformation logic to an SQL query and also send the query to the snowflake database. The snowflake data base runs the query. In Talend query pushdown can be leveraged using ELT components tELTInput, tELTMap and tELTOutput. These components are available under ELT -> Map -> DB JDBC
Let’s take a quick look at these components
- tELTInput: This component adds input tables for the SQL statement. There is no restriction on the number of input tables. One can add as many Input tables as required by the SQL statement to be executed.
- tELTMap: this is the mapping component where the transformation are defined. This component uses the table(s) provided as input to feed the parameter in the built SQL statement. This component converts the transformation into SQL statements.
- tELTOutput: Carries out the action on the table specified along with the action on the data as specified according to the output schema defined.
Now, lets take build a job to use these components and to utilize snowflake query pushdown. I will explain it with an example. Let’s assume that I have two tables in Snowflake named SALES and CITY. Sales table contains details of item sold, unit sold, sales channel (Online or offline) cost of unit, total revenue, total profit as per Region, country. City table is a dimension table which has Country code, population of country. Now the metric which I need to calculate is the total profit for online sales for each item at Region, country level. The result must be inserted into ONLINE_AGG table.
Now to this logic in ELT format, my job would look as given below:
Let’s look at this job in more detail. As a best practice I have used tPrejob to open the snowflake connection and tPostjob to close the connection. I have also used tDie to handle exceptions at various components. The next few section explains in detail the sections marked in the image above (A,B,C and D)
- A (tELTInput) : this component uses the open snowflake connection and reads data from SALES table. The configuration is given below.
- B (tELTInput) : this component uses the same connection and reads data from COUNTRY table. The detail configuration is given below
- C (tELTMap) : this is an important component as this component transforms the mapping into SQL. Click on the ETL Map Editor to do the join and transformation.
After adding the input tables, I perform an INNER JOIN on the SALES and CITY table. As shown in the image below. This editor can also be used for providing additional where clause, group by clause and order by clause. In the example, I have given an performed the following transformation
- where condition as PUBLIC.SALES.SALES_CHANNEL =’Online’
- as I am doing aggregation on Total_Profit, I have given group by on group by PUBLIC.SALES.REGION , SALES.COUNTRY , PUBLIC.SALES.ITEM_TYPE , PUBLIC.SALES.SALES_CHANNEL columns
- Null handling for column ITEM_TYPE
These transformations are highlighted in the image below.
Now, the beauty of this component is that as you write the transformation, the SQL gets generated. Click on the ‘Generated SQL Select query for table2 output’ to see the generated SQL
To validate the results, I copied this sql, ran it in snowflake worksheet.
- D (tELTOutput) : this component is used to push the data to the table ONLINE_AGG. The detail configuration is given below.
Now that the job design is completed, lets run the job. At runtime, you will see that the records are not bought into Talend memory
Instead the query is executed at snowflake.
To confirm the execution, lets query the history at snowflake.
Expanded view of the query executed
In this blog we saw how we could leverage the power of query pushdown with Talend while working with Snowflake. This job design method enables high utilization of snowflake clusters for processing data. Well that’s all for now, keep watching this space for more blogs and until then happy reading!!