[TOS tutorial 06] Joining Two Data Sources with the tMap Component

In this tutorial, discover how easy joining two data sources is with the tMap component.

This tutorial uses Talend Open Studio Data Integration version 6.

1. Create a new Job and add a source component

In a new Job, add the movies metadata file to the Job as a tFileInputDelimited component.

Note that the last column in the movies file is directorID. In this tutorial, you will use this column to join the movies file to the directorID column in the directors file.

2. Create a new metadata file based on the simple file named directors.txt and add it as the second source

  1. Create a new delimited metadata file named directors.
  2. To specify a sample file, click Browse next to the File field, select the file txt from the local disk, and click Open.
  3. The field delimiter in this file is a comma. To change the delimiter, in the Field Separator dropdown list, select Comma.
  4. To refresh the file display to reflect the change made, click the Refresh Preview. Note: This file only has two columns: directorID and directorName.
  5. Change the column names to directorID and directorName and the length of the directorName field to 40. The new metadata file, directors 0.1, is displayed under Metadata in the Project Repository. You can now join the directorID column from the movies file with the directorID column from the directors file.
  6. Add the directors metadata file to the Job as the second source input.

3. Add and configure a tMap component to join the two source components

  1. Add a tMap component to the Job Designer.
  2. To establish a flow of data between the two components to the tMap component, first link movies to the tMap_1 component, and then link directors to the tMap_1.
  3. To rename the link between the source component and tMap components, click the link twice and type movies. Similarly, rename the other link as directors.
  4. To open the tMap component wizard, double-click the component. Note that in the left section of the wizard, both movies and directors are listed as inputs to the tMap component.
  5. To create a new output data flow, in the output section of the tMap wizard, click the [+] Name the new output as joinedOutput and click OK. A blank output data flow is created.
  6. To add the movieID, title, releaseYear, and url fields to the output data flow, select the four fields from the movies input and drop them on the output data flow. Similarly, add the directorName field to the output data flow from the directors input.
  7. To configure the join, select the directorID column in the movies table and drag it to the directorID column of the directors table.

4. Store the result in a file

  1. To specify a destination for the output, add a tFileOutputDelimited component to the Job Designer and link the joinedOutput output of the tMap component to it.
  2. To configure the output component, in the Component view of the component, specify the path and name for the output file and include a header row in the output file.
  3. To run the Job, in the Run view, click Run.
  4. To check the moviesComplete.csv file, navigate to the folder in which the file was created and open the file.

Note: Some movies show no director name. These movies had no directorID.

Default behavior in the ETL tool:

  • All rows in the driving table will be displayed.
  • If the join condition was not satisfied, corresponding output fields will be blank. This is a “left outer join”. In contrast, you can configure the join in tMap as an inner join, which only displays rows where the join condition is satisfied.


Ready to get started with Talend?