Talend “Job Design Patterns” and Best Practices

article in French

Talend developers everywhere, from beginners to the very experienced, often deal with the same question: “What is the best way for me to write this job”?  We know it should be efficient, easy to read, easy to write, and above all (in most cases), easy to maintain.  We also know that the Talend Studio is a free-formCanvas’ upon which we ‘Paint’ our code using a comprehensive and colorful pallet of Components, Repository Objects, Metadata, and Linkage Options.  How then are we ever sure that we’ve created a job design using the Best Practices?

Job Design Patterns

Since version 3.4, when I started using Talend, job designs were very important to me.  At first I did not think of patterns while developing my jobs; I had used Microsoft SSIS and other similar tools before, so a visual editor like Talend was not new to me.  Instead my focus centered on basic functionality, code reusability, then canvas layout and finally naming conventions. Today, after developing hundreds of Talend jobs for a variety of use cases, I found my code becoming more refined, more reusable, more consistent, and yes patterns started to emerge.

After joining Talend in January this year I’ve had many opportunities to review jobs developed by our customers.  It confirmed my perception that for every developer there are indeed multiple solutions for each use caseThis, I believe compounds the problem for many of us.  We developers do think alike, but just as often we believe our way is the best or the only way to develop a particular job.  Inherently we also know, quietly haunting us upon our shoulder, whispering in our ear, that maybe, just maybe there is a better way.  Hence we look or ask for best practices: in this case - Job Design Patterns!

Formulating the Basics

When I consider what is needed to achieve the best possible job code, fundamental precepts are always at work.  These come from years of experience making mistakes and improving upon success.  They represent important principles that create a solid foundation upon which to build code and should be (IMHO) taken very seriously; I believe them to include (in no particular order of importance):

- Readability: creating code that can be easily figured out and understood

- Writability: creating straightforward, simple, code in the least amount of time

- Maintainability: creating appropriate complexity with minimal impact from change

- Functionality: creating code that delivers on the requirements

- Reusability: creating sharable objects and atomic units of work

- Conformity: creating real discipline across teams, projects, repositories, and code

- Pliability: creating code that will bend but not break

- Scalability: creating elastic modules that adjust throughput on demand

- Consistency: creating commonality across everything

- Efficiency: creating optimized data flow and component utilization

- Compartmentation: creating atomic, focused modules that serve a single purpose

- Optimization: creating the most functionality with the least amount of code

- Performance: creating effective modules that provide the fastest throughput

Achieving a real balance across these precepts is the key; in particular the first three as they are in constant contradiction which each other.  You can often get two, while sacrificing the 3rd.  Try ordering all of these by importance, if you can!

Guidelines NOT Standards ~ It’s about Discipline!

Before we can really dive into Job Design Patterns, and in conjunction with the basic precepts I’ve just illustrated, let’s make sure we understand some additional details that should be taken into account.  Often I find rigid standards in place which make no room for the unexpected situations that often poke holes into them.  I also find, far too often, the opposite; unyielding, unkempt, and incongruous code from different developers doing basically the same thing; or worse, developers propagating confusing clutters of disjointed, unplanned, chaos.  Frankly, I find this sloppy and misguided as it really does not take much effort to avoid.

For these and other fairly obvious reasons, I prefer first to craft and document ‘Guidelines’, notStandards.  These encompass the foundational precepts and attach specifics to them.  Once a ‘Development Guidelines’ document is created and adopted by all the teams involved in the SDLC (Software Development Life Cycle) process, the foundation supports structure, definition, and context.  Invest in this, and long term, get results that everyone will be happy with!

Here is a proposed outline that you may utilize for yours (feel free to change/expand on this; heck it’s only a guideline!).

  1. Methodologies which should detail HOW you want to build things
    1. Data Modeling
      1. Holistic / Conceptual / Logical / Physical
      2. Database, NoSQL, EDW, Files
    2. SDLC Process Controls
      1. Waterfall or Agile/Scrum
      2. Requirements & Specifications
    3. Error Handling & Auditing
    4. Data Governance & Stewardship
  2. Technologies which should list TOOLS (internal & external) and how they interrelate
    1. OS & Infrastructure Topology
    2. DB Management Systems
    3. NoSQL Systems
    4. Encryption & Compression
    5. 3rd Party Software Integration
    6. Web Service Interfaces
    7. External Systems Interfaces
  3. Best Practices which should describe WHAT & WHEN particular guidelines are to be followed
    1. Environments (DEV/QA/UAT/PROD)
    2. Naming Conventions
    3. Projects & Jobs & Joblets
    4. Repository Objects
    5. Logging, Monitoring & Notifications
    6. Job Return Codes
    7. Code (Java) Routines
    8. Context Groups & Global Variables
    9. Database & NoSQL Connections
    10. Source/Target Data & Files Schemas
    11. Job Entry & Exit Points
    12. Job Workflow & Layout
    13. Component Utilization
    14. Parallelization
    15. Data Quality
    16. Parent/Child Jobs & Joblets
    17. Data Exchange Protocols
    18. Continuous Integration & Deployment
      1. Integrated Source Code Control (SVN/GIT)
      2. Release Management & Versioning
      3. Automated Testing
      4. Artifact Repository & Promotion
    19. Administration & Operations
      1. Configuration
      2. User Security & Authorizations
      3. Roles & Permissions
      4. Project Management
      5. Job Tasks, Schedules, & Triggers
    20. Archives & Disaster Recovery

Some additional documents I think should be developed and maintained include:

- Module Library: describing all reusable projects, methods, objects, joblets, & context groups

- Data Dictionary: describing all data schemas & related stored procedures

- Data Access Layer: describing all things pertinent to connecting to and manipulating data

Sure creating documentation like this takes time but the value, over its lifetime, far outweighs its cost.  Keep it simple, direct, up-to-date, (it doesn’t need to be a manifesto) and it will make huge contributions to the success of all your projects that utilize it by dramatically reducing development mistakes (which can prove to be even more expensive).

Can We Talk About Job Design Patterns Now?

Sure!  But first: one more thing. It is my belief that every developer can develop both good and bad habits when writing code.  Building upon the good habits is vital.  Start out with some easy habits, like always giving every component a label. This makes code more readable and understandable (one of our foundational precepts).  Once everyone is making a habit of that, ensure that all jobs are thoughtfully organized into repository folders with meaningful names that make sense for your projects (yes, conformity).  Then have everyone adopt the same style of logging messages, perhaps using a common method wrapper around the System.out.PrintLn()function; and establish a common entry/exit point criterion with options for alternative requirements, for job code (both of these help realize several precepts all at once).  Over time, as development teams adopt and utilize well defined Development Guideline disciplines, project code becomes easier to read, to write, and (my favorite) to maintain by anyone on the team.

Job Design Patterns & Best Practices

For me, Talend Job Design Patterns present us with proposed template or skeleton layouts that involve essentail and/or required elements that focus on a particular use case.  Patterns because often they can be reused again for similar job creation, thus jumpstarting the code development effort.  As you might expect, there are also common use patterns that can be adopted over several different use cases which, when identified and implemented properly, strengthen the overall code base, condense effort, and reduce repetitive but similar code.  So, let’s start there. 

Here are 7 Best Practices to consider:

Canvas Workflow & Layout

There are many ways to place components on the job canvas, and just as many was to link them together.  My preference is to fundamentally start ‘top to bottom’, then work ‘left and right’ where a left bound flow is generally an error path, and a right and/or downward bound flow is the desired, or normal path.  Avoiding link lines that cross over themselves whereever possible is good, and as of v6.0.1, the nicely curved link lines adopt this strategy quite well.

For me, I am uncomfortable with the ‘zig-zag’ pattern, where components are placed ‘left to right’ serially, then once it goes to the right most edge boundry the next component drops down and back to the left side edge for more of the same; I think this pattern is awkward and can be harder to maintain, but I get it (easy to write).  Use this pattern if you must but it may indicate the possibility that the job is doing more than it should or may not be organized properly.

Atomic Job Modules ~ Parent/Child Jobs

Big jobs with lots of components, simply put, are just hard to understand and maintain.  Avoid this by breaking them down into smaller jobs, or units of work wherever possible.  Then execute them as child jobs from a parent job (using the tRunJob component) whose purpose includes the control and execution of them.  This also creates the opportunity to handle errors better and what happens next.  Remember a cluttered job can be hard to understand, difficult to debug/fix, and almost impossible to maintain.  Simple, smaller jobs that have clear purpose jump off the canvas as to their intent, almost always easy to debug/fix, and maintenance, comparitevely a breeze.

While it is perfectly acceptable to create nested Parent/Child job hierarchies, there are practicle limitations to consider.  Depending upon job memory utilization, passed parameters, test/debug concerns, and parallelization techniques (described below), a good job design pattern should not exceed 3 nested levels of tRunJob Parent/Child calls.  While it is safe to perhaps go deeper, I think that with good reasons, 5 levels should be more than enough for any use case.

tRunJob vs Joblets

The simple difference between deciding between a child job versus using a joblet is that a child job is ‘Called’ from your job and a joblet is ‘Included’ in your job.  Both offer the opportunity to create reusable, and/or generic code modules.  A highly effective strategy in any Job Design Pattern would be to properly incorporate their use.

Entry & Exit Points

All Talend Jobs need to start and end somewhere.  Talend provides two basic components: tPreJob and tPostJob whose purpose is to help control what happens before and after the content of a job executes.  I think of these as ‘Initialize’ and ‘WrapUp’ steps in my code.  These behave as you might expect in that the tPreJob executes first, then the real code gets executed, then finally the tPostJob code will execute.  Note that the tPostJob code will execute regardless of any devised exit within the code body (like a tDie component, or a component checkbox option to ‘die on error’) is encountered.

Using the tWarn and tDie components should also be part of your consideration for job entry and exit points.  These components provide programmable control over where and how a job should complete.  It also supports improved error handling, logging, and recovery opportunities.

One thing I like to do for this Job Design pattern is to use the tPreJob to initialize context variables, establish connections, and log important information.  For the tPostJob: closing connections and other important cleanup and more logging.  Fairly straight forward, right?  Do you do this?

Error Handling & Logging

This is very important, perhaps critical, and if you create a common job design pattern properly, a highly resusable mechanism can be established across almost all your projects.  My job pattern is to create a ‘logPROCESSING’ joblet for a consistent, maintainable logging processor that can be included into any job, PLUS incorporating well defined ‘Return Codes’ that offers conformity, reusability, and high efficiency.  Plus is was easy to write, is easy to read, and yes, quite easy to maintain.  I believe that once you’ve developed ‘your way’ for handling and logging errors across your project jobs, there will be a smile on your face a mile wide.  Adapt and Adopt!

Recent versions of Talend have added support for the use of Log4j and a Log Server.  Simply enable the Project Settings>Log4j menu option and configure the Log Stash server in the TAC.  Incorporating this basic functionality into your jobs is definitely a Good Practice!

OnSubJobOK/ERROR vs OnComponentOK/ERROR (& Run If) Component Links

It can be a bit confusing sometimes to any Talend developer what the differences between the ‘On SubJob’ or the ‘On Component’ links are.  The ‘OK’ versus ‘ERROR’ is obvious.  So what are these ‘Trigger Connections’ differences and how do they affect a job design flow?

Trigger Connections’ between components define the processing sequence and data flow where dependencies between components exist within a subjob.  Subjobs are characterized by a component having one or more components linked to it dealing with the current data flow.  Multiple subjobs can exist within a single job and is visualized by default as having a blue highlighted box (which can be toggled on/off on the toolbar) around all the related subjob components.

An ‘On Subjob OK/ERROR’ trigger will continue the process to the next ‘linked’ subjob after all components within the subjob have completed processing.  This should be used only from the starting component in the subbjob.  An ‘On Component OK/ERROR’ trigger will continue the process to the next ‘linked’ component after that particular component has completed processing.  A ‘Run If’ trigger can be quite useful when the continuation of the process to the next ‘linked’ component is based upon a programmable java expression.

What is a Job Loop?

Significant to almost every Job Design Pattern is the ‘Main Loop’ and any ‘Secondary Loops’ in the code.  These are the points where control of the potential exit of a job’s execution is made.  The ‘Main Loop’ generally is represented by the top-most processing of a data flow result set that once complete, the job is finished.  ‘Secondary Loops’ are nested within a higher-order loop and often require considerable control to ensure a jobs proper exit.  I always identify the ‘Main Loop’ and ensure that I add a tWarn and a tDie component to the controlling comonent.  The tDie usually is set to exit the JVM immediately (but note that even then the tPostJob code will execute).  These top level exit points use a simple ‘0’ for success and ‘1’ for failure return code, but following your established ‘Return Codes’ guideline is best.  ‘Secondary Loops’ (and other critical components in the flow) are great places to incorporate additional tWarn and tDie components (where the tDie is NOT set to exit the JVM immediately).

Most of the Job Design Pattern Best Practices discussed above are illustrated below.  Notice, while I’ve adopted useful component labels, even I’ve bent the rules a bit on component placements.  Regardless the result is a highly readable, maintainable job that was fairly wasy to write.

Conclusion

Well ~ I can’t say that all your questions about Job Design Patterns have been answered here; probably not in fact.  But it’s a start!  We’ve covered some fundamentals and proffered a direction and end game.  Hopefully it has been useful and provokes some insightful considerations for you, my gentle reader.

Clearly I’ll need to write another Blog (or perhaps a few) on this topic to cover everything.  The next one will focus on some valuable advanced topics and several Use Cases that we all are likely to encounter in some form.  Additionally the Customer Success Architecture team is working on some sample Talend code to support these use cases.  These will be available in the Talend Help Center for subscribed customers fairly soon.  Stay on the lookout for them.

 

Related Resources

5 Ways to Become A Data Integration Hero

Products Mentioned

Talend Data Integration

Share

Comments

Jan Lolling
Some things should be added for a successful data ware house project: 1. DI jobs must be restartable. Especially in case of errors the job MUST do all its work within transactions. 2. DI jobs should process the data within chunks. It is always a good design to have a job (lets call him steering-job) which collects the data ranges to process and so called worker jobs which do the processing for one chunk. 3. Performance is less worth than predictability! It is much more important to be able to get a clear prediction based on the current progress instead of doing everything in a huge one-step without the possibility to monitor the current progress. 4. DI jobs should be designed to cause a more constant but lower load on the database instead of causing load peaks.
Dale Anderson
@viralmpatel - thank you so much for your kind words. I sincerely believe that this is an important topic for all our Talend developers. Stay tuned for my next blog in this series which will be released soon. d;)
viralmpatel
@Dale Anderson. Great blog. Absolutely loved it. I will be waiting to read your next series of posts on same track. @Dezzsoke I totally agree with you. I thoroughly enjoyed using Generic loader job :)
Dale Anderson
@Dezzoke, Ah ha - you are getting out in front of me! d;) Context Groups, Log4j, Dynamic SQL, and Parallelization will all be covered in the next post on the series. Watch for it!
Dale Anderson
@BretDeveloper Thanks for reading my blog. I am very happy you found this useful. I believe this is a key topic for expanded discussion. As a CSA here at Talend, I hope all of our customers get value from this series, and contribute to the discussion here in the comments section!
BretDeveloper
I greatly appreciate this blog topic and additional 'best practice' comments and look for more in the same vein. This is a key stepping stone toward establishing confidence with using these tools. Thanks again.
Dezzsoke
Another best practice that I really really miss from here is the usage of contexts. Contexts are awesome, you can control the behavior of your jobs easily through contexts. Each and every database connection should use contexts. And depending on your use case you should create multiple contexts, one for your DEV environment another one for your PROD system. And if you have QA systems then create one for those as well. In our generic data loader job (which is just 3 level deep) we have like 40 context variables, and through that we're able to move over tens of thousands of tables just by triggering the job. The built in Database logging and XML like tWarn messages help us reconciling / keeping track of the job. And when it comes to debugging Log4J comes into the view. Log4j is one of the features why It worth the upgrade to 5.6 or 6.0 It makes developers life much more easier. Your job generates a dynamic SQL and upon execution throw an error. No problems: you can check what was the submitted query through log4j. I don't really get why would anyone develop a custom logging feature when it is possible to use the built in logging feature? By using tWarns and embedding different messages with different codes you can achieve anything that you want. Also another thing I really like is the variables such as: projectName, jobName. currentComponent We use these frequently when we have to create files / folders. For example: "/data/files/" + projectName + "/" + jobName + ".csv" Easy to understand, easy to distinguish.

Leave a comment

Comments

Comment: 
Another best practice that I really really miss from here is the usage of contexts. Contexts are awesome, you can control the behavior of your jobs easily through contexts. Each and every database connection should use contexts. And depending on your use case you should create multiple contexts, one for your DEV environment another one for your PROD system. And if you have QA systems then create one for those as well. In our generic data loader job (which is just 3 level deep) we have like 40 context variables, and through that we're able to move over tens of thousands of tables just by triggering the job. The built in Database logging and XML like tWarn messages help us reconciling / keeping track of the job. And when it comes to debugging Log4J comes into the view. Log4j is one of the features why It worth the upgrade to 5.6 or 6.0 It makes developers life much more easier. Your job generates a dynamic SQL and upon execution throw an error. No problems: you can check what was the submitted query through log4j. I don't really get why would anyone develop a custom logging feature when it is possible to use the built in logging feature? By using tWarns and embedding different messages with different codes you can achieve anything that you want. Also another thing I really like is the variables such as: projectName, jobName. currentComponent We use these frequently when we have to create files / folders. For example: "/data/files/" + projectName + "/" + jobName + ".csv" Easy to understand, easy to distinguish.
Comment: 
I greatly appreciate this blog topic and additional 'best practice' comments and look for more in the same vein. This is a key stepping stone toward establishing confidence with using these tools. Thanks again.
Comment: 
@BretDeveloper Thanks for reading my blog. I am very happy you found this useful. I believe this is a key topic for expanded discussion. As a CSA here at Talend, I hope all of our customers get value from this series, and contribute to the discussion here in the comments section!
Comment: 
@Dezzoke, Ah ha - you are getting out in front of me! d;) Context Groups, Log4j, Dynamic SQL, and Parallelization will all be covered in the next post on the series. Watch for it!
Comment: 
@Dale Anderson. Great blog. Absolutely loved it. I will be waiting to read your next series of posts on same track. @Dezzsoke I totally agree with you. I thoroughly enjoyed using Generic loader job :)
Comment: 
@viralmpatel - thank you so much for your kind words. I sincerely believe that this is an important topic for all our Talend developers. Stay tuned for my next blog in this series which will be released soon. d;)
Comment: 
Some things should be added for a successful data ware house project: 1. DI jobs must be restartable. Especially in case of errors the job MUST do all its work within transactions. 2. DI jobs should process the data within chunks. It is always a good design to have a job (lets call him steering-job) which collects the data ranges to process and so called worker jobs which do the processing for one chunk. 3. Performance is less worth than predictability! It is much more important to be able to get a clear prediction based on the current progress instead of doing everything in a huge one-step without the possibility to monitor the current progress. 4. DI jobs should be designed to cause a more constant but lower load on the database instead of causing load peaks.

Add new comment

More information?
Image CAPTCHA
More information?