A Day In The Life of a Data Integration Developer: How to Build Your First Job in Talend Studio [Part 2]

The video demonstrates a simple example of how to build your first Job in Talend Studio. More advanced features are covered in the rest of this series:

In this installment of A Day in the Life of a Data Integration Developer, we'll be working with a movie ratings data warehouse to capture the average rating of each movie by state. To do this, we will be:

  1. Joining two tables from our data warehouse,
  2. Calculating the average rating by state using an aggregate component, and
  3. Saving it to a separate table in our data warehouse.

1. Joining tables

Right-click within the repository Job Designs folder tree and select Create Job. I'll give it a name and a brief purpose.

data integration job

Let's bring in two database table definitions from the Metadata repository that contain the source data we need. I've got a facts table containing ratings data and a dimension table of users that includes the state information.

I'll drag and drop each of them into our design workspace. And since these are generic metadata table definitions, as I drop each one into the design workspace, I need to specify them as MysqlInput source components.

build a data job

Next, we need a tMap as a way to join these two source components. I can go to the components palette on the right and search for the component, or I can click inside the design workspace, begin typing, and select it to bring it into my Job flow.

Let's connect the data using the tMap. I can right-click and connect using the row function, or I can just drag and stretch connectors using the icons beside the components. As I go, I'll label my connectors by doing a slow double-click on each existing label. I can also rename components. This helps me better identify the tables when configuring the tMap once my Job flow is built.

2. Using an aggregate component

The third component we want is an aggregator, which will calculate the average rating by state.

  1. Click within the design workspace again and search "Aggregate" to locate and select the new component.
  2. Connect it to the tMap and give the output a name.

3. Establishing a target table

The last thing we need is our target table where we want all of our result data to end up. So let's bring in the metadata for my target table and specify this as a MysqlOutput component and connect it to the rest of my flow.

Configuring components

Now that we have all of the pieces of the Job placed and connected, we must configure the components to handle the data flow correctly.

Configure the tMap

First, we will configure the tMap by joining the Ratings_In to the lkp_States.

  1. Link the two data sets by dragging the User ID from the Ratings_In table to the ID field in the lkp_States table.
  2. When the tables are joined, grab the Movie_Id and Rating field from the Ratings_In table, and the State field from the lkp_States table, and drag them into the Ag_Output table. This creates the output dataset that will drive the rest of the flow.
  3. Click OK.

Configure the aggregator

And next we need to configure the aggregator. To do that, I first need to confirm the aggregator schema matches the defined outputs of the component. We want to make sure that the Movie_Id and State columns are moved over into my aggregator output and that a column is available to hold the data that we're aggregating. It's labeled Average_Rating.

Since this column is a calculated average, it needs to account for decimals, so the data type is Float with a length of 5 and a precision of 3.

data integration job

That looks perfect so I'll click OK.

Now with our schema set, we can configure the aggregator to calculate our average rating:

  1. Group by Movie_Id and State, and for our operations, we want to set our Output column to be Average_Rating, the column we just confirmed in our output schema.
  2. Set the function to avg, and we're averaging the Rating field.
  3. I already have my target source mapped to my MySQL target table. So this job is ready to go.

Let's document the flow with a title. I'll select my flow, go to the Components tab, select Show subjob title, and call it “Load Average Rating by State.”

Run your first job

Let's run this Job. I can go into the Run tab within my design palette and complete a basic run. And it ran smoothly.

For complete details on building your first Job in Studio, watch the video above. Next, we’ll cover running processes or jobs, testing, and debugging.


| Last Updated: March 14th, 2019