Experience the magic of shuffling columns in Talend Dynamic Schema

If you are a magician specialized in Talend magic, we always hear a key word called Dynamic ingestion of data from various sources to target systems instead of creating individual Talend job for each data flow. In this blog, we will do a quick recap of the concept of Dynamic schema and how we can reorder or shuffle columns when we are employing Dynamic schema in ingestion operations.

There are multiple methods (or magical spells) available for shuffling. Below, I will share with you a simple spell which can be cast for most commonly used files for dynamic ingestion i.e., delimited files. (Statutory Warning: - If you are not a Harry Potter fan, I would recommend opening a new tab with your favorite search engine to quickly check the different magical terms. ?)

Before going to the details regarding shuffling of columns, let us see what Dynamic schema is and what makes this spell more exciting compared to traditional ways of data ingestion.

Dynamic Ingestion – A quick recap about the basics of magic

To make the concept more interesting and to help our new members of Talend developer community, I will explain the concepts with a simple game of arranging a deck of cards (Experienced Talend magicians who know the “Dynamic” spell can safely skip to next heading).

Let us imagine that each row of your data file is a combination of cards where each column of the row is like individual card. Now we are going to move these combinations from source to target in different methods. Sounds interesting? Ok, lets proceed ?

In traditional method, if you must move the card combination from source to target, you will have to move one combination at a time. In the below diagram, you will have one task to move the first combination from source to target and another task to move second combination from source to target.

Coming to Talend context, imagine you have two files with different data types and number of columns as shown below.

In common scenarios, Talend developers might think to create two separate Talend jobs to move from source system to target system. But in the case of big organizations, this approach means, the creation of hundreds of Talend jobs just for ingestion for processing different files with different column combinations.

Let us come back to our card magic game. Imagine if you have a magic wand to move all the different card combinations from source and target, instead of moving one combination at a time.

For the Harry Potter fans, just take your magic wand and cast the spell “Riddikulus”, where the scary task to rearrange all the card combinations magically get converted to a silly and easy task! Well, you can also use the spell Accio, in case you want to summon the cards automatically to target area ?

In Talend world of magic, the data ingestion tasks can be made easy by using the spell “Dynamic”! It magically ingests the data from files or databases even if you are having a huge variation in underlying column schemas for each file.

Talend secret book of magic (help.talend.com) has a specific section for Dynamic schema and you can refer some sample magic spells from this link. Now, the book of Talend magic has given the samples to load the data from multiple files from source to target (either database or file system).

Decoding the magic of shuffling columns in Dynamic schema

The experienced Talend magicians already know the above trick and they must have cast this spell many times when they are in the process of building an ingestion framework using Talend jobs. So, let us step up the magic to next level.

Some of the Talend magicians have asked me how we can do the shuffling of columns while using dynamic ingestion framework for delimited files in source and target. Coming back to the card magic, you can see the difference from the below diagram where the positions of cards have been changed for each row.

Well, it can be achieved in multiple ways (You know by this time that you can use both Riddikulus and Accio spells to create the magical effect). I am going to show one simple trick if your source and target systems are delimited files and you just have to reorder the column patterns (Note:- In the case of databases, they will magically identify the column names based on Dynamic metadata and will populate to correct columns in target tables). If you go to the Diagon Alley of Talend (community.talend.com), you can see examples and details of various other spells (data parsing methods) created by some of the Ace magicians of Talend community.

Ingredients for magic potion (Creation of metadata table)

Since you are going to create a generic ingestion framework to shuffle the columns while using Dynamic schema, the first step will be to create configuration table having below columns. The metadata information table will have input and output file load paths, input and output file names and target column order after shuffling.

The actual spell to load data dynamically after shuffling!

The Talend job to load the data dynamically to target delimited files after column shuffling is as shown below.

Metadata Extraction subjob will extract the data from configuration table and convert them to multiple iterations using tFlowtoIterate component.

A dummy tJava component will be used to orchestrate multiple “On Component Ok” control flows.

The first On Component Ok leads to the header processing subjob for target file. The target schema details for each input row will be copied from tFlowtoIterate col_list column and it will be assigned as input value for the data column of tRowGenerator.

The header data will be populated to target file and the configuration is as shown below. We need to also make sure that the Advanced settings option to “Create directory if does not exist” is selected.

 The next stage will be to pump the actual data from the source delimited file to target delimited file. The input file path directory and input file name variables will be extracted from tFlowtoIterate and will be added to the file name property of the tFileInputDelimited component. The schema will be marked as dynamic to maintain the schema neutrality for different types of files.

Column shuffling will be handled in a tJavaRow component where the input data in Dynamic schema is parsed based on matching metadata column names from the input Dynamic column metadata and the list of target column captured earlier in tFlowtoIterate component. The output data is transmitted as String data type to next component.

The Java code snippet used within tJavaRow can be accessed below.

 

String strArray[] = ((String)globalMap.get("row3.col_list")). split(";");
String tarArray[] = ((String)globalMap.get("row3.col_list")). split(";");

Dynamic columns = row1.data;
String out="";

for (int i = 0; i < columns.getColumnCount(); i++)
{
DynamicMetadata columnMetadata = columns.getColumnMetadata(i);
String inp_col=columnMetadata.getName().toString();
String col_val=(String)row1.data.getColumnValue(inp_col);

for (int j = 0; j < columns.getColumnCount(); j++)
{
if (inp_col.equals(strArray[j].toString()))
{
tarArray[j]=col_val;
}
}

}

for (int t=0; t < columns.getColumnCount(); t++)
{
if (t ==0)
out=tarArray[0];
else
out=out+";"+tarArray[t];
}

row2.data=out;

 

 

The output data in the new shuffled format will be loaded to the target delimited file in Append mode.

The output data in the target files will be according to the reshuffled column list.

You can customize the above job further to incorporate additional data validations and data quality checks but its not the scope of current magical spell.

Conclusion

I would highly recommend reading below interesting Blogs around Dynamic Ingestion of Talend before closing your current tab!

Ready to get started with Talend?