Getting Started with Talend Open Studio: Building a Complex tMap Job
In our previous blog, we walked through a simple job moving data from a CSV file into a Snowflake data warehouse. In this blog, we will explore some of the more advanced features of the tMap component.
Similar to the last blog, you will be working with customer data in a CSV file and writing out to a Snowflake data warehouse; however, you will also be joining your customer CSV file with transaction data. As a result, you will need Talend Open Studio for Data Integration, two CSV data sources that you would like to join (in this example we use customer and transaction data sets), and a Snowflake warehouse for this tutorial. If you would like to follow a video-version of this tutorial, feel free to watch our on-demand webinar and skip to the fourth video.
First, we will join and transform customer data and transaction data. As you join the customer data with transaction data, any customer data that does not find matching transactions will be pushed out to a tLogRow component (which will present the data in a Studio log following run time). The data that is successfully matched will be used to calculate top grossing customer sales before being pushed out into a Sales Report table within our Snowflake database.
Construct Your Job
Now, before beginning to work on this new job, make sure you have all the necessary metadata configurations in your Studio’s Repository. As demonstrated in the previous blog (link to blog #2), you will need to import your Customer metadata, and you will need to use the same process to import your transaction metadata. In addition, you will need to import your Snowflake data warehouse as mentioned in the previous blog if you haven’t done so already.
So that you don't have to start building a new job from scratch, you can take the original job that you created from the last blog (containing your customer data, tMap and Snowflake table) and duplicate it by right-clicking on the job and selecting Duplicate from the dropdown menu. Rename this new job – in this example we will be calling the new job “Generate_SalesReport”.
Now in the Repository you can open the duplicated job and begin adjusting the job as needed. More specifically, you will need to delete the old Snowflake output component and the Customers table configuration within t-Map.
Once that is done, you can start building out the new flow.
Start building out your new job by first dragging and dropping your Transactions metadata definition from the Repository onto the Design Window as a tFileInputDelimited component, connecting this new component to the tMap as a lookup. An important rule-of-thumb to keep in mind when working with the tMap component is that the first source connected to a tMap is the “Main” dataset. Any dataset linked to the tMap after the “Main” dataset is considered a “Lookup” dataset.
At this point it is a good idea to rename the source connections to the tMap. Naming connections will come in handy when it's time to configure the tMap components. To rename connections, perform a slow double-click on the connection arrow. The name will become editable. Name the “Main” connection (the Customer Dataset) “Customers” and the “Lookup” connection (the Transactions dataset) “Transactions”. Later, we will come back to this tMap and configure it to perform a full inner join of customer and transaction data. For now, we will continue to construct the rest of the job flow.
To continue building out the rest of the job flow, connect a tLogRow component as an output from the tMap (in the same way as discussed above, rename this connection Cust_NoTransactions). This tLogRow will capture customer records that have no matching transactions, allowing you to review non-matched customer data within the Studio log after you run your job. In a productionalized job flow, this data would be more valuable within a database table making it available for further analysis, but for simplicity of this discussion we will just write it out to a log.
The primary output of our tMap consists of customer data that successfully joins to transaction data. Once joined, this data will be collected using a tAggregateRow component to calculate total quantity and sales of items purchased. To add the tAggregateRow component to the design window, either search for it within the Component Pallet and then drag and drop it into the Design Window OR click directly in the design window and begin typing “tAggregateRow” to automatically locate and place it into your job flow. Now, connect your tAggregateRow to the tMap and name the connection “Cust_Transactions”.
Next, you will want to sort your joined, aggregated data, so add the tSortRow component.
In order to map the data to its final destination--your Snowflake target table—you will need one more tMap. To distinguish between the two tMap components and their intended purposes, make sure to rename this tMap to something like "Map to Snowflake".
Finally, drag and drop your Snowflake Sales Report table from within the Repository to your Design window and ensure the Snowflake output is connected to your job. Name that connection "Snowflake” and click “Yes” to get the schema of the target component.
As a best practice, give your job a quick look over and ensure you’ve renamed any connections or components with clear and descriptive labels. With your job constructed, you can now configure your components.
Configuring Your Components
First, double-click to open the Join Data tMap component configuration. On the left, you can see two source tables, each identified by their connection name. To the right, there are two output tables: one for the customers not matched to any transactions and one for the joined data.
Start by joining your customers and transactions data. Click and hold ID from within the Customers table and drag and drop it onto ID from within the Transactions table. The default join type in a tMap component is a Left Outer Join. But you will want to join only those customer Id’s that have matching transactions, so switch the Join Model to an "Inner Join”.
Within this joined table, we want to include the customer ID in one column and the customers’ full names on a separate column. Since our data has first name and last name as two separate columns, we will need to join them, creating what is called a new “expression”. To do this, drag and drop both the “first_name” and “last_name” columns onto the same line within the table. We will complete the expression in a bit.
Similarly, we want the Quantity column from the transaction data on its own line, but we also want to use it to complete a mathematical expression. By dragging and dropping Unit Price and Quantity onto the same line within the new table, we can do just that.
You can now take advantage of the "Expression Builder", which gives you even more control of your data. It offers a list of defined pre-coded functions that you can apply directly to this expression—I highly recommend that you look through the Expression Builder to see what it can offer. And even better, if you know the Java code for your action, you can enter it manually. In this first case, we want to concatenate the first and last names. After adding the correct syntax within the expression builder, click Ok.
You will want to use the Expression Building again for your grouped transaction expression. With the Unit Price and Quantity expression, complete an arithmetic action to get the total transaction value by multiplying the Unit Price by the Quantity. Then, click Ok.
Remember, we set our Join Model to an Inner Join. However, Talend offers a nice way to capture just the outer customers whom didn’t have transactions. To capture these "rejects" from an Inner Join, first drag and drop ALL the fields from the customers table to the Cust_NoTransactions output table. Then, select the tool icon at the top right of this table definition and switch the "Catch lookup inner join reject" to "true".
With the fields properly mapped, it is time to move on and review the data below. Rename the first_name field to be simply "name" (since it now includes the last name) and rename the Unit Price column to “transaction cost” (since it now has the mathematical expression applied). Then, ensure no further adjustments are necessary to the table's column types to avoid any mismatched type conflicts through the flow.
With this tMap properly configured, click Ok. And then click “Yes” to propagate the changes.
Next, you will need to configure the Aggregate component. To do this, enter the Component Tab (below the Design Workspace) and edit the schema.
To properly configure the output schema of my tAggregateRow component, first choose the columns on the left that will be grouped. In this case we want to group by ID and Name. So, select “id” and “name” and then clicking the yellow arrow button pointing to the right. Next, we want to create two new output columns to store our aggregated values. By clicking the green “+” button below the “Aggregate Sales (Output)” section you can add the desired number of output columns. First, create a new output column for the total quantity (“total_qty”) and identify it as an Integer type. And then create another for the total sales (“total_sales”) and set it as a double type. Next, click ok, making sure to choose to propagate the changes.
With the output schema configured properly within the tAggregateRow component, we can now configure the Group By and Operations Sections of the tAggregateRow component. To add your two Grouped By output columns and two Operations ouput columns, go back to the Component Tabs. Click the green plus sign below the Group By section twice and the Operations section twice to account for the output columns configured in the tAggregateRow schema. Then, in the Operations section, set the “total_qty” column's general function as “sum” and identify the input column position as “qty”. This configures the tAggregateRow component to add all the quantities from the grouped customer Id’s and output the total value in the “total_qty” column. Likewise, set the “total_sales” function as “sum” and input column position as “transaction_cost”.
Next, head to the sorting component and configure it to sort by total sales to help us identify who our highest paying customers are. To do this, click on the green “+” sign in the Component Tab, select “total_sales” in the Schema Column, and select “num” to ensure that your data is sorted numerically. Last, choose “desc” so your data will be shown to you in descending order.
Now, configure your final tMap component, by matching the customer name, total quantity and total sales. Then click Ok and click Yes to propagate the changes.
Finally, make sure your tLogRow component is set to present your data in table format, making it easier for you to read the inner join reject data.
Running Your Job
At last, you are ready to run your job!
Thanks to the tLogRow component, within the log, you can see the six customers that were NOT matched with transaction data.
If you head to Snowflake, you can view your “sales_report” worksheet and review the top customers in order of highest quantity and sales.
And that's how to create a job that joins different sources, captures rejects, and presents the data the way you want it. In our next blog, we will be going through running and debugging your jobs. As always, please comment and let us know if there are any other basic skills you would like us to cover in a tutorial.