Talend Job Design Patterns and Best Practices: Part 4

Our journey in Talend Job Design Patterns & Best Practices is reaching an exciting juncture.  My humble endeavor at providing useful content has taken on a life of its own.  The continued success of the previous blogs in this series (please read Part 1, Part 2, and Part 3 if you haven’t already), plus Technical Boot Camp presentations (thanks to those I have met there for attending) and delivering this material directly to customers, has led to an internal request for a transformation.  Our plan then to create several webinars around this series and make them available in the near future is now underway.  Please do however be a little patient as it will take some time and coordinated resources, but my hope is see the first such webinar available sometime in early 2017.  I am certainly looking forward to this and welcome your continued interest and readership.

As promised however, it is time to dive into an additional set of Best Practices for Job Design Patterns with Talend.  First let me remind you of a simple and often ignored fact.  Talend is a Java Code Generator and thus crafting developer guidelines fortifies and streamlines the java code being generated through job design patterns.  It seems obvious, and it is, but well-designed jobs that generate clean java code, by painting your canvas using these concepts is the best way I know to achieve great results.  I call it ‘Success-Driven Projects’.

Success-Driven Talend Projects

Building Talend Jobs can be very straight forward; they can also become quite complex.  The secret for their successful implementation is to adopt and adapt the good habits and discipline required.

From the ‘Foundational Precepts’ discussed in the start of this series to now, my goal has always been to foster an open discussion on these best practices to achieve solid, useful job design patterns with Talend.  Most use cases will benefit from atomic job design and parent/child orchestration and when projects contain significant reusable code, overall success can be accelerated.  Of course choose your own path, but at the very least all I ask is: Be consistent!

Database Development Life Cycle - DDLC

But hold on!  Maybe it is not just about Job Design, what about the data?  We are processing data, aren’t we?  For the most part data resides in a database.  I ask you, do databases need best practices?  A rhetorical question?  Data models (schemas) change over time, so Database Designs must have a life cycle too!  It just makes sense.

Databases evolve and we developers need to accommodate this fact.  We have embraced our SDLC process so it should not be that hard to accept that we need a Database Development Life Cycle, alongside.  It is actually quite straight forward in my mind.  For any environment (DEV/TEST/PROD) a database needs to support:

  • A Fresh INSTALL - based upon the current version of the schema
  • Apply an UPGRADE - drop/create/alter dB objects upgrading one version to the next
  • Data MIIGRATION - where a disruptive ‘upgrade’ occurs (like splitting tables)

Understanding the database life cycle and its impact on job design becomes very important.  Versioning your database model is key.  Follow a prescribed design process.  Use graphical diagrams to illustrate the designs.  Create a ‘Data Dictionary’ or ‘Glossary’ and track linage for historical changes.  I will be writing a separate blog on this topic in more detail soon.  Watch for it.  Meanwhile please consider the following process when crafting database models.  It is a higher discipline; but it works!

More Job Design Best Practices

Ok.  Here are more job design patterns & best practices for your immediate delight and consumption!  These begin to dive deeper into Talend features that may be common for you or perhaps less frequently used.  My hope is that you will find them helpful.

8 more Best Practices:

tMap Lookups

As many of you already know, the essential tMap component is widely used within Talend Jobs.  This is due to its powerful transformation capabilities. 

The most common use for the tMap component is to map a data flow schema from a source input to a target output: Simple, Right?  Sure it is!  We know that we can also incorporate multiple source and target schema data flows, offering us complex opportunities for joining and/or splitting these data flows as needed, possibly incorporating transformation expressions that control what and how the incoming data becomes dispersed downstream.  Expressions within a tMap component can occur on the source or the target schemas.  They can also be applied using variables defined within the tMap component.  You can read up on how to accomplish all this in the Talend Components Reference Guide.  Just remember: Great Power comes with Great Responsibility!

Another very compelling use for the tMap component is incorporating lookups that join with the source data flow.  While there is no physical limit to how many lookups you can apply to a tMap component, or what comprises the lookup data, there are real and practical considerations to make.

Look at this basic example: Two row generators; one is the source, the other the lookup.  At runtime, the lookup data generation occurs first and then source data is processed. 

Because the join setting on the lookup data is set to ‘Load Once’, all its records are loaded into memory and then processed against the source data result set.  This default behavior provides high performance joins and can be quite efficient.

Alternately, you might imagine that when loading up millions of lookup rows or dozens of columns, considerable memory requirements may occur.  It probably will.  What if multiple lookups having millions of rows each, are required?  How much memory will that need?  Carefully consider your lookups when many records or hundreds of columns are involved.

Let us examine a trade-off: Memory vs Performance.  There are three Lookup Models available:

  • Load Once - reads all qualifying records into memory
  • Reload at each Row - reads the qualifying row for each source record only
  • Reload at each Row (cache) - reads the qualifying row for each source record, caching it

Clearly, lookup data that has been loaded into memory to join with the source can be quite fast.  However, when memory constraints prevent enormous amounts of lookup data, or when you simply do not want to load ALL the lookup data as the use case might not need it, use  the ‘Reload at each Row’ lookup model.  Note there is a trick you need to understand to make this work.

First inside the tMap component, change the Lookup Mode to ‘Read at each Row’.  Notice the area below expands to allow input of the ‘Key(s)’ you will need to do the lookup.  Add the keys, which effectively define global variables available outside the tMap component.

For the lookup component use the (datatype)globalMap.get(“key”) function in the ‘WHERE’ clause of your SQL syntax to apply the saved key value defined in the tMap on the lookup dataset.  This completes the lookup retrieval for each record processed from the source.

There you are, efficient lookups, either way!

Global Variables

There are several aspects to the definition and use of what we think of as ‘Global Variables’.  Developers create and use them in Talend Jobs all the time and we refer to them as ‘Context Variables’.  Sometimes these are ‘Built-In’ (local to a job), and sometimes they are found in the ‘Project Repository’ as Context Groups, which allow them to be reused across multiple jobs. 

Either way these are all ‘Global Variables’ whose value is
determined at runtime and available for use anywhere within the job that defines them.  You know you are using one whenever a context.varname is embedded in a component, expression, or trigger.  Please remember to place commonly used variables in a ‘Reference Project’ to maximize access across projects.

Talend also provides the tSetGlobalVar and tGlobalVarLoad components that can define, store and use ‘Global Variables’ at runtime.  The tSetGlobalVar component stores a key-value pair within jobs that is analogous to using a ‘Context Variable’ providing greater control (like error handling).  Look at my example where a single MAX(date) value is retrieved and then used in a subsequent SQL query to filter a second record set retrieval.

To access the Global Variable use the (datatype)globalMap.get(“key”) function in the SQL ‘WHERE’ clause.  Get very familiar with this function, as you will likely use it a lot once you know its power!

The tGlobalVarLoad component provides similar capabilities for Big Data jobs where the tSetGlobalVar component is not available.  Look at my example where an aggregated value is calculated and then used in a subsequent read to qualify which records to return.

We are not quite done on this topic.  Hidden in plain sight are a set of ‘System Global Variables’ that are available within a job whose values are determined by components themselves.  We talked about one of them before on the Error Handling Best Practice way back in Part 1 of this series: CHILD_RETURN_CODE and ERROR_MESSAGE.  These System Global Variables are typically available for use immediately after a component’s execution setting its value.  Depending upon the component, different system variables are available.  Here is a partial list:

  • ERROR_MESSAGE / DIE_MESSAGE / WARN_MESSAGE
  • CHILD_RETURN_CODE / DIE_CODE / WARN_CODE /CHILD_EXCEPTION_STACK
  • NB_LINE / NB_LINE_OK / NB_LINE_REJECT
  • NB_LINE_UPDATED / NB_LINE_INSERTED / NB_LINE_DELETED
  •  global.projectName / global.jobName (these are system level; their use is obvious)

Loading Contexts

Context Groups’ support highly reusable job design, yet there are still times when we want even more flexibility.  For example, suppose you want to maintain the context variables default values externally.  Sometimes having them stored in a file or even a database makes more sense.  Having the ability to maintain their values externally can prove quite effective and even support some security concerns.  This is where the tContextLoad component comes in.

The example above shows a simple way to design your job to initialize context variables at runtime.  The external file used to load contains comma-delimited key-value named pairs and as read in will override the current values for the defined context variables within the job.  In this case, the database connection details are loaded to ensure a desired connection.  Notice that you have some control over some error handling and in fact, this presents another place where a job can programmatically exit immediately: ‘Die on Error’.  There are so few of these.  Of course, the tContextLoad component can use a database query just as easily and I know of several customers who do just that.

There is a corresponding tContextDump component available, which will write out the current context variable values to a file or database.  This can be useful when crafting highly adaptable job designs.

Using Dynamic Schemas

Frequently I am asked about how to build jobs that cope with dynamic schemas.  In reality, this is a loaded question as there are various use cases where dealing with dynamic schemas occurs.  The most common seems to focus on when you have many tables whose data you want to move to another corresponding set of tables, perhaps in a different database system (say from Oracle to MS SQL Server).  Creating a job to move this data over is straightforward yet almost immediately, we conclude that building one job for each table is not that practical.  What if there are hundreds of tables.  Can we not simply build a single job that can handle ALL the tables?  Unfortunately, this remains a limitation in Talend.  However, do not be dismayed, we can do it with TWO jobs: One to dump the data and one to load the data: Acceptable?

Here is my sample job.  Establishing three connections; the first two to retrieve the TABLE and COLUMN lists and the third to retrieve actual data.  Simply through iteration of each table, saving its columns, I can read and write data to a positional flat file (the DUMP process) by using the tSetDynamicSchema component.  A similar job would do the same thing except the third connection would read the positional file and write to the target data store (the LOAD process).

In this scenario, developers must understand a little bit about the inner workings of their host database.  Most systems like Oracle, MS SQL Server, and MySQL have system tables, often called an ‘Information Schema’, which contain object metadata about a database, including tables and their columns.  Here is a query that extracts a complete table/column list from my TAC v6.1 MySQL database (do you like my preferred SQL syntax formatting?):

Be sure to use connection credentials having ‘SELECT’ permissions to this usually protected database.

Notice my use of the tJavaFlex component to iterate through the table names found.  I save each ‘Table Name’ and establish a ‘Control Break’ flag, then I iterate for each table found and retrieve its sorted column list.  After adjusting for any nulls in column lengths, the saved ‘Dynamic Schema’ is complete.  The conditional ‘IF’ checks the ‘Control Break’ flag when the table name changes and begins the dump process of the current table.  Voilà!

Dynamic SQL Components

Dynamic code is awesome!  Talend provides several ways to implement it.  In the previous job design, I used a direct approach in retrieving table and column lists from a database.  Talend actually provides host system specific components that do the same thing.  These t{DB}TableList and t{DB}ColumnList components (where {DB} is replaced by the host component name) provide direct access to the ‘Information Schema’ metadata without having to actually know anything about it.  Using these components instead for the DUMP/LOAD process previously described cold work just as well: but what is the fun in that?

Not all SQL queries need to retrieve or store data.  Sometimes other database operations are required.  Enlist the t{DB}Row and t{DB}SP components for these requirements.  The first allows you to execute almost any SQL query that does not return a result set, like ‘DROP TABLE’.  The later allows you to execute a ‘Stored Procedure’.

Last but not least is the t{DB}LastInsertId component which retrieves the most recently inserted ‘ID’ from a database output component; very useful on occasion.

CDC

Another common question that occurs is; Does Talend support CDC: ‘Change Data Capture’?  The answer is YES, of course: Through ‘Publish/Subscribe’ mechanisms tied directly with the host database system involved.  Important to note is that not all database systems support CDC.  Here is the definitive ‘current’ list for CDC support in Talend jobs:

There are three CDC modes available, which include:

  • Trigger (default) - Uses DB Host Triggers that tracks Inserts, Updates, & Deletes
  • Redo/Archive Log - Used with Oracle 11g and earlier versions only
  • XStream - Used with Oracle 12 and OCI only

Since the ‘Trigger’ mode is the most likely you will use, let us peek at its architecture:

The Talend User Guide, Chapter 11 provides a comprehensive discussion on the CDC process, configuration, and usage within the Studio and in coordination with your host database system.  While quite straightforward conceptually, there is some considerable setup required.  Fully understand your requirements, CDC modes, and job design parameters up front and document them well in your Developer Guidelines!

Once established, the CDC environment provides a robust mechanism for keeping downstream targets (usually a data warehouse) up to date.  Use the t{DB}CDC components within your Talend jobs to extract data that has changed since the last extraction.  While CDC Takes time and diligence to configure and operationalize, it is a very useful feature!

Custom Components

While Talend now provides well over 1,000 components on the palette, there are still many reasons to build your own.  Talend developers often encapsulate specialized functionality within a custom component.   Some have built and productized their components while others publish them with free access on the recently modernized Talend Exchange. When a component is not available on the Talend Palette, search there instead, you may find exactly what you need.  A ‘Talend Forge’ account is required, but you have probably already created one.

To start, ensure that the directory where custom components will be stored is set properly.  Do this from the ‘Preferences’ menu and choose a common location all developers would use.  Click ‘Apply’, then ‘OK’.

Find the ‘Exchange’ link on the menu bar allowing the selection and installation of components.  The first time you do this, check ‘Always run in Background’ and click the ‘Run in Background’ button as it takes time to load up the extensive list of available objects.  From this list, you can ‘View/Download’ objects of interest.  After completing a component download, click on the ‘Downloaded Extensions’ to actually install them for use in your Studio.  Once completed, the component will show as ‘Installed’ and will available from the palette.

A component and its associated files, once installed can be hard to find.  Look in two places:

{talend}/studio/plugins/org.talend.designer.components.exchange{v}{talend}/studio/plugins/org.talend.designer.components.localprovider{v}

If you want to create a custom component yourself, switch to the ‘Component Designer’ perspective within the Studio.  Most custom components utilize ‘JavaJet’, which is the file extension for encapsulating Java code for the ‘Eclipse IDE’.  A decent tutorial on ‘How to create a custom component’ is available for beginners.  While a bit dated (circa 2013), it presents the basics on what you need to know.  There are third party tutorials out there as well (some are listed in the tutorial.  Here is a good one: Talend by Example: Custom Components.  Also try Googleing them to find even more information on creating ‘Custom Components’. 

JobScript API

Normally we use the ‘Designer’ to painting our Talend Job, which then generates the underlying Java code.  Have you ever wondered if a Talend Job can be automatically generated?  Well there is a way!  Open up any of your jobs.  There are three tabs at the bottom of the canvas: DESIGNER, CODE, & JOBSCRIPT.  Hmm: that is interesting.  You have probably clicked on the CODE tab to inspect the generated Java code.  Have you clicked on the JOBSCRIPT tab?  If you did, were you aware of what you were looking at?  I bet not for most of you.  This tab is showing the script that represents the Job design.  Take a closer look next time.  Do you see anything familiar as it pertains to your job design?  Sure, you do…

So what, you say!  Well, let me tell you what!  Suppose you create and maintain metadata somewhere about your job design and run it through a process engine (that you create), generating the properly formatted JobScript, perhaps adjusting key elements to create multiple permutations of the job.  Now that is interesting!

Look in the ‘Project Repository’ under the CODE>ROUTINES section to find the ‘Job Scripts’ folder.  Create a new JobScript (I called mine ‘test_JobScript’).  Open up any of your jobs and copy the JobScript tab contents, pasting it into the JobScript file and save.  Right click on the JobScript and choose ‘Generate Job’.  Now look in the ‘Job Designs’ folder and you will find a newly minted job.  Imagine what you can do now!  Nice!

Conclusion

Whew!  That about does it; Not to say that there are no more best practices involved with creating and maintaining Talend job designs, I am sure there are many more.  Instead let me leave that to a broader community dialog for now and surmise that this collection (32 in all) offers a comprehensive breadth and depth for success driven projects using Talend.

Look for my next blog in this series where we’ll shift gears and discuss how to apply all these best practices to a conventional Use Case.  Applied Technology; Solid Methodologies; Solutions that achieve results! The backbone of where to use these best practices and job design patterns. Cheers!

Ready to get started with Talend?