How To Operationalize Meta-Data in Talend with Dynamic Schemas

How To Operationalize Meta-Data in Talend with Dynamic Schemas

  • Robert Griswold

    Robert Griswold is a Professional Services Architect at Talend. Mr. Griswold has dedicated a career to keeping current on application development,systems administration and data management practices.

    Roberts experience includes working with many Fortune 100 companies. His focus over the years spans mainframe, client server, Java, batch integration, SOA, and Big Data. While at Talend Robert has been able to leverage his vast experiences to gain adoption and enablement of Talend platform to his growing list of both large and small customers.

This blog addresses the operationalizing of meta-data usage in data management using Talend. To explain further you have files and tables which have schema definitions. The schemas hold information like name, data type, and length. This information can be imported or keyed in with a visual editor like Talend Studio at design time; however, this can add a lot of extra work and is prone to errors. If you could extract this information from a known & governed source and use it at run time you could automate the creation of tables or file structures. Once tested and verified the tables can be manually imported if stored meta-data schemas are desired.What is schema?Schema is the definition of the data formats, fields, types and lengths.Persisted Schema This is an example of what a persisted meta-data schema looks like in Talend. This is created at design time in Talend.Data Dictionary File ExampleThis is what a data dictionary file could look like. This can be used to instead of defining a static schema in Talend Meta-Data as displayed above. The data dictionaries used in this process will have a static meta-data schema from a data modeling tool like Erwin or a database schema. Your dictionaries should be transformed to a common format allowing for more code reuse.Schema for a Data DictionaryThis is a basic layout for a data dictionary using Column Name, Type and Length. An exhaustive list of Talend dictionary items is listed below.
  • ColumnName
  • DBName
  • Type
  • DBType
  • Precision
  • Nullable
  • Key
What is the conceptual process?The diagram below is the logical process for handling dynamics schemas with a data dictionary at run time. Talend has components for handling dynamic schemas for positional non-Big-Data files, but all other types of data sources and targets could apply to this pattern.The process:
  • Load the data dictionary
  • Define input and output as a single data item
    • String for Big Data
    • Dynamic for files, tables and internal schemas
  • Use Java APIs to operationalize the use of data dictionaries
    • For non-Big-Data use the Talend API which will be demonstrated in subsequent examples.
    • For Big-Data Java string utilities will be used
This process can be used to:
  • Virtualize schemas for files
    • Fixed (NOTE: There are Talend components to do this as well)
  • Virtualize schemas for Tables
  • Virtualize schemas for Big Data elements like HDFS or Hive Schemas
  • Virtualize Internal Schemas used for Data services or Queues
Load the DictionaryBelow is the code to load the data dictionary from a file or table. The values read from the dictionary are moved into memory in the form of an ArrayList. This ArrayList can then be used throughout the data management process to operationalize the processing of data.// Define a counter that controls number of columnsint rowCnt = (Integer) globalMap.get("rowCount");// Define three arrays to hold the data dictionary columnsList nameList = new ArrayList();List typeList = new ArrayList();List lengthList = new ArrayList(); // Load the arrays from global variablesif (rowCnt == 0){}else{nameList = (ArrayList) globalMap.get("nameList");typeList = (ArrayList) globalMap.get("typeList");lengthList = (ArrayList) globalMap.get("lengthList");} // Move data dictionary file or tables values to the array elementsnameList.add(;typeList.add(row3.type);lengthList.add(row3.length); // Put the array back to a global variable with the new values addedglobalMap.put("rowCount", rowCnt + 1);globalMap.put("nameList", nameList);globalMap.put("typeList", typeList);globalMap.put("lengthList", lengthList); Define Input and OutputDynamic Schema for non-Big-Data ComponentsThis is an example of a dynamic schema definition for a delimited file. One field of the type Dynamic is used for the entire record and the schema will be determined at runtime. Dynamic Schema definition for Big-Data ComponentsThis is an example of a dynamic schema for Big Data. Notice that a data type of string is used instead of dynamic. Talend doesn’t support dynamic types for Bid Data components.Java APIsIn the following weeks I will go into specific usages of dynamic schemas and the implementation of Talend jobs, components and Java APIs.
  • Dynamic Schemas for traditional files
  • Dynamic Schemas for Big Data Files
  • Dynamic Schemas for NoSQL Tables
  • Operationalizing with the Meta Data Bridge (Available in a future release of Talend)
 Conclusion:The usage of dynamic schemas can save on maintenance and development in the data management layer. Dynamics schemas can also be used to create tables or files that can be imported as persisted meta-date schemas. This article is intended to propose some complementary technologies around meta-data management. How you use them will apply to the priority of your architectural objects. These objectives can be somewhat opposing such as code maintainability vs strictly governed meta-data.To find the meta-data approach that works you can ask questions such as:
  • Does your Talend use-case favor code re-use and the use of governed 3rd-party data dictionaries?
  • Use Dynamic Schemas
  • Does Talend use case favor persisted meta-data? Is this meta-data the vehicle for schema governance?
  • Use Persisted Schemas
  • Do you want to use dynamic schemas to create meta-data for testing and POCs which will eventually end up as persisted meta-data?
  • Use a Hybrid Approach by dynamically creating files or tables that can be imported as persisted schemas.
Related Resources5 Ways to Become A Data Integration HeroProducts MentionedTalend Data Integration  

Join The Conversation


Leave a Reply

Your email address will not be published. Required fields are marked *

  1. vikas says:

    Hello Robert,

    Thanks for the detailed explanation. I am fairly new to Talend and trying to understand how to convert the logic explained into a Talend job. If you can please provide more details or a screen shot of what components to be used and their order that would be really great and helpful.

    Thanks in advance.

  2. Sam Jebaraj says:

    Great deal of information! How do we manage the dynamic schema metadata in Talend Data Catalog? Is there any Talend DI best practice to follow in order to enable easy metadata import in to Talend Data Catalog? If no, do you have any suggestions for better metadata management and lineage tracking.