In this tutorial, discover how metadata can help save a lot of development time, and learn how to create and use them.
This tutorial uses Talend Open Studio for Data Integration version 6.
Talend Open Studio allows you to create and run Java ETL programs or Jobs using predefined components.
Each component can be configured either as a “Built-in” or as a “Repository” component.
For “Built-in” components, information such as how to read the file and what it contains:
- Is defined within the component.
- Applies only to this component.
- Cannot be reused with any other component.
For “Repository” components, information:
- Is saved as metadata.
- Can be efficiently and consistently reused.
- Can be easily maintained because changes to metadata can be propagated to all Jobs that use it.
1. Create a metadata definition for a delimited file
- In the Project Repository, click Metadata, right-click File delimited, and click Create file delimited.
- In the Name field of the wizard, type movies and click Next.
- To specify a sample file, click Browse next to the File field, select the file moviesSorted from the local disk, and click Open. The file is displayed in the File Viewer section of the wizard.
- To define the Property Type settings, click Next. In the wizard window that appears, you can define settings such as how the file should be read, the number of rows, if any, that should be skipped when reading the file, and the maximum number of rows to process.
- To indicate that the first row of the file is column names and should be ignored, in the Preview tab, select the Set heading row as column names. Note that when you do so, the Header checkbox is automatically checked with the value 1.
- To refresh the file display to reflect the change made, click the Refresh Preview button and then click Next.
- In the Name field, type moviesSchema. If the first line of the sample file includes column names, they will be displayed. If not, the columns will appear as Column 0, Column 1, and so on and will have to be renamed manually.
When guessing the schema, Talend only reads the fifty first lines of the sample file and based on the data in these rows, defines the column types and length. You should validate the information displayed or correct it, if necessary.
- Update the displayed schema to reflect the structure of the sample file. In this case, change the length of the title and url fields to 100 and 250 respectively. Also, change the type of the directorID field to integer. Click Finish.
Under Metadata in the Project Repository, the movies 0.1 entry is displayed with the file properties. Under the entry movies 0.1, the schema of the metadata file, moviesSchema, is displayed.
If you need to modify the property type or the schema, right-click on the component in the Project Repository and select Edit File Delimited or Edit Schema.
2. Use the metadata to configure a component
- Create a new Job and name it useMetadata and add a tFileInputDelimited component to it.
Note: By default, the component is configured with “Built-in” parameters.
- In the Property Type field of the Component view, select Repository.
- To select the metadata, click […] next to the field that appears, click movies 0.1, and then click OK.
Note that the parameters set of the metadata is displayed. Also note that all the fields are in grey, indicating that they belong to the metadata and not to the component.
To change the schema, click […] next to text Edit schema and choose an option:
– Change to built-in property to edit the schema for this component only.
– Update repository connection to edit the metadata schema in the repository.
- To view the schema, click […] next to text Edit schema and select View schema.
3. Use the metadata to configure a second component
As an alternative method to use metadata: drag it straight from the Repository to the Designer and choose the tFileInputDelimited component to use it.
Talend allows you to create metadata based on several parameters such as databases, SAP connections, and several file types.
Note: To illustrate this, MySQL Workbench 6.3 CE along with a test dataset called talend_dq is used. You can either try it with a similar configuration or with your own databases.
4. Create a database connection and define it as metadata
- In the Project Repository, click Metadata, right-click Db Connections, and click Create connection.
- In the Name field of the Database Connection wizard, enter MySql and click Next.
- In the DB type field, select MySQL.
- Fill in the connection parameters.
- To check the connection to the database, click Check.
- To close the wizard and create the metadata, click Finish.
- To automatically retrieve all the table schemas, right-click the MySql 0.1 metadata in the Project Repository and then click Retrieve Schema.
- Click Next. The connected database is displayed.
- To select all the tables and views, select the checkbox to the left of the database name and click Next. The database and all tables and details are displayed.
All table schemas have been imported as metadata and can be used.
The tables and the views appear under the mysql 0.1 connection in the Project Repository. To view the field in a table, click the table.
5. Read a database table using the metadata
- To read a table from the list, select the table and drop it on the Job Designer.
- In the Components window, click tMySqlInput and click OK.
A tMysqlInput component is created with the repository information. It used the MySql 0.1 connection, and for the Schema it used the repository information from the metadata table tdq_values.
In addition, Talend generates the SQL query and sends it to the table tdq_values.
- To display the table data, add the tLogRow component and link the tdq_values component to the tLogRow_1 component.
- To run the Job, in the Run view, click Run. The data from the table tdq_values is displayed.