REPORT : The Forrester Wave™: Big Data Fabric, Q2 2018

Successful Methodologies with Talend – Part 2

Successful Methodologies with Talend – Part 2

  • Dale Anderson
    Dale Anderson is a Customer Success Architect at Talend. Over a 30 year career, Mr. Anderson has gained extensive experience in a range of disciplines including systems architecture, software development, quality assurance, and product management and honed his skills in database design, modeling, and implementation, as well as data warehousing and business intelligence. A keen advocate for an appropriate temperance of the software workflow process, he has applied his innovations to the often overlooked lifecycle of a database. The Database Development Lifecycle Mr. Anderson has developed incorporates data modeling best practices to address the complexities involved in modeling adaptations, multi-environment fulfilments, fresh installations, schema upgrades, and data migrations for any database implementation.
  • April 16, 2018

Let’s continue our conversation about Job Design Patterns with Talend.  In my last blog, Successful Methodologies with Talend, I discussed the importance of a well-balanced approach:

  • Having a clearly defined ‘Use Case’,
  • Incorporating the proper ‘Technology’
  • Defining the way everyone works together with a ‘Methodology’ of choice (Waterfall, Agile, JEP+, or maybe some hybrid of these).

Your Talend projects should be no exception.  Also, when you follow and adopt Best Practices (many are discussed in my other previous blogs), you dramatically increase the opportunity for successful Talend Job Designs.  This leads to successful Talend Projects; and joyful cheers!

With all these fundamentals in place, it seems like a good time for me to elaborate on Job Design Patterns themselves.  Brief descriptions of several common patterns are listed in Part 1 of this series.  Now, let’s take a deeper dive.  First, however, to augment our discussion on the ‘Just-Enough-Process’ methodology, I want to reinforce the importance of team communication.  An Agile team (called a Scrum) is comprised of several players:

  • A Scrum-Master
  • Stakeholders
  • Business Analysts
  • Developers
  • Testers

A successful ‘Sprint’ with defined milestones that follow the agile process communicates, at the Scrum level, with well-defined tasks using tracking tools like Jira from Atlassian.  I’ll assume you know some basics about the Agile Methodology; in case you want to learn more, here are some good links:

Agile Modeling

The Agile Unified Process

Disciplined Agile Delivery

 

JEP+ Communication Channels

To understand how effective communication can propel the software development life cycle, JEP+ defines an integrated approach. Let’s walk through the ideas behind the diagram on the left. The Software Quality Assurance (SQA) team provides the agnostic hub of any project sprint conversation.  This is effective because the SQA main purpose is to test and measure results.  As the communication hub, the SQA team can effectively, and objectively, become the epicenter of all Scrum communications.  This has worked very effectively for me on small and large projects.


As shown, all key milestones and deliverables are managed effectively.  I like this stratagem, not because I defined it, but because it makes sense.  Adopt this style of communication across your Scrum team, using tools of your choice, and it will likely increase your team's knowledge and understanding across any software development project, Talend or otherwise.  Let me know if you want to learn more about JEP+; maybe another blog?

Talend Job Design Patterns

Ok, so let’s get into Talend Job Design Patterns.  My previous blog suggested that of the many elements in a successful approach or methodology, for Talend developers, one key element is Job Design Patterns.  What do I mean by that?  Is it a template-based approach to creating jobs?  Well, yes, sort of!  Is it a skeleton, or jumpstart job?  Yeah, that too!  Yet, for me, it is more about the business use case that defines the significant criteria

Ask yourself what is the job’s purpose?  What does it need to do?  From there you can devise a proper approach (or pattern) for the job’s design.  Since there are many common uses cases, several patterns have emerged for me where the job depends greatly upon what result I seek.  Unlike some other ETL tools available, Talend integrates both the process and data flow into a single job.  This allows us to take advantage of building reusable code resulting in sophisticated and pliable jobs.  Creating reusable code is therefore about the orchestration of intelligent code modules.

It is entirely possible ,of course, that job design patterns will vary greatly from use case to other use cases.  This reality should force us to think carefully about job designs and how we should build them.  It should also promote consideration into what can be built as common code modules, reusable across different design patterns.  These can get a bit involved so let’s examine them individually.  We’ll start with some modest ones:

LIFT-N-SHIFT: Job Design Pattern #1

This job design pattern is perhaps the easiest to understand.  Extract the data from one system and then directly place a copy of the data into another.  Few (if any) transformations are involved. It’s simply a 1:1 mapping of source to target data stores.  Examples of possible transformations may include a data type change, or column length variation, or perhaps adding or removing an operative column or two.  Still the idea of a ‘Lift-n-Shift’ is to copy data from one storage system to another, quickly.  Best practices assert that using tPreJob and tPostJob components, appropriate use of tWarn and tDie components, and a common exception handler ‘Joblet’ are highly desirable.

Here is what a simple ‘Lift-n-Shift’ job design pattern may look like:

 

Let’s go through some significant elements of this job design pattern:

  • The layout follows a Top-to-Bottom / Left-to-Right flow: Best Practice #1
  • Use of a common ‘Joblet ’ for exception handling: Best Practice #3
  • Entry/Exit Points are well defined: Best Practice #4
    • tPreJob and tPostJob components are in place to initialize and wrap up job design
    • tDie and tWarn components are used effectively (not excessively)
    • Also, notice the green highlighted components; these entry points must be understood
  • Exception handling is incorporated: Best Practice #5
  • The process completes all selected data and captures rejected rows: Best Practice #6
  • Finally, the Main-Loop is identified as the main sub-job: Best Practice #7

If you haven’t read my blog series on Job Design Patterns & Best Practices, you should.  These call outs will make more sense!

It is also fair to say that this job design pattern may become more complex depending up the business use case and the technology stack in place.  Consider using a Parent/Child orchestration job design (Best Practice #2).  In most cases, your job design patterns will likely keep this orchestration to a minimum, instead of using the technique I describe for using the tSetDynamicSchema component (Best Practice #29).  This approach, with the source information schema details, may even address some of the limited transformations (ie: data type & size) required.  Keep it simple; make it smart!

MIGRATION: Job Design Pattern #2

A ‘Migration’ job design pattern is a lot like the ‘Lift-n-Shift’ pattern with one important difference: Transformations!  Essentially, when you are copying and/or moving data from source systems to target data stores significant changes to that data may be required thus becoming a migration process.  The source and target systems may be completely different (ie: migration of an Oracle database to MS SQL Server).  Therefore the job design pattern expands from the ‘Lift-n-Shift’ and must accommodate the some or many critical steps involved in converting the data.  This may include splitting or combining tables, accounting for differences between the systems on data types, and/or features (new or obsolete).  Plus you may need to apply specific business rules to the process and data flow in order to achieve the full migration effect.

Here is what a simple ‘Migration’ job design pattern may look like:

Take notice of the same elements and best practices from the ‘Lift-n-Shift’ job design plus some important additions:

  • I’ve created an additional, separate DB connection for the lookup!
    • This is NOT an option; you can’t share the SELECT or INSERT connections simultaneously
    • You may, optionally, define the connection directly in the lookup component instead
    • When multiple lookups are involved I prefer the direct connection strategy
  • The tMap component opts for the correct Lookup Model: Best Practice #25
    • Use the default ‘Load Once’ model for best performance and when there is enough memory on the Job Server to execute the lookup of all rows in the lookup table
    • Use the ‘Reload at Each Row’ model to eliminate any memory limitations, and/or when the lookup table is very large, and/or not all records are expected to be looked-up

It is reasonable to expect that ‘Migration’ job designs may become quite complex.  Again the Parent/Child orchestration approach may prove invaluable.  The tSetDynamicSchema approach may dramatically reduce the overall number of jobs required.  Remember also that parallelism techniques may be needed for a migration job design.  Review Best Practice #16 for more on those options.

COMMAND LINE: Job Design Pattern #3

The ‘Command Line’ job design pattern is very different.  The idea here is that the job works like a command line executable having parameters which control the job behavior.  This can be very helpful in many ways, most of which will be highlighted in subsequent blogs from this series.  Think of the parent job as being a command.  This parent job validates argument values and determines what the next steps are. 

In the job design pattern below we can see that the tPreJob component parses the arguments for required values and exits when they are missing.  That’s all.  The main body of the job then checks for specific argument values and determines the process flow.  Using the ‘RunIF’ trigger we can control the execution of a particular child job.  Clearly, you might need to pass down these arguments into the child jobs where they can incorporate additional validation and execution controls (see Best Practice #2).

Here is what a simple ‘Command Line’ job design pattern may look like:

There are several critical elements of this job design pattern:

  • There are no database connections in this top-level orchestration job
    • Instead, the tPreJob checks to see if any arguments have been passed in
    • You may validate values here, but I think that should be handled in the job body
  • The job body checks the arguments individually using the ‘RunIF ’ trigger, branching the job flow
  • The ‘RunIF ’ check in the tPreJob triggers a tDie component and exits the job directly
    • Why continue if required argument values are missing?
  • The ‘RunIF’ check on the tJava component triggers a tDie component but does not exit the job
    • This allows the tPostJob component to wrap things up properly
  • The ‘RunIF’ checks on the tRunJob components triggers only if the return code is greater than 5000 (see Best Practice #5: Return Codes) but does not exit the job either

In a real world ‘Command Line’ use case, a considerable intelligence factor can be incorporated into the overall job design, Parent/Child orchestration, and exception handling.  A powerful approach!

DUMP-LOAD: Job Design Pattern #4

The ‘Dump-Load’ job design pattern is a two-step strategy. It’s not too different from a ‘Lift-n-Shift’ and/or ‘Migration’ job design. This approach is focused upon the efficient transfer of data from a source to a target data store.  It works quite well on large data sets, where replacing SELECT/INSERT queries with write/read of flat files using a ‘BULK/INSERT’ process is likely a faster option.

 

Take notice of several critical elements for the 1st part of this job design pattern:

  • A single database connection is used for reading a CONTROL table
    • This is a very effective design allowing for the execution of the job based upon externally managed ‘metadata’ records
    • A CONTROL record would contain a ‘Source’, ‘RunDate’, ‘Status’, and other useful process/data state values
      • The ‘Status’ values might be: ‘ready to dump’ and ‘dumped
      • It might even include the actual SQL query needed for the extract
      • It may also incorporate a range condition for limiting the number of records
      • This allows external modification of the extraction code without modifying the job directly
    • Key variables are initialized to craft a meaningful, unique ‘dump’ file name
      • I like the format {drv:}/{path}/YYYYMMDD_{tablename}.dmp
    • With this job design pattern, it is possible to control multiple SOURCE data stores in the same execution
      • The main body of the job design will read from the CONTROL table for any source ready to process
      • Then using a tMap, separated data flows can handle different output formats
    • A common ‘Joblet’ updates the CONTROL record values of the current data process
      • The ‘Joblet’ will perform an appropriate UPDATE and manage its own database connection
        • Setting the ‘Run Date’, current ‘Status’, and ‘Dump File Name’
      • I have also used an ‘in process’ status to help with exceptions that may occur
        • If you choose to set the 1st state to ‘in process’ an additional use of the ‘Joblet’ after the SELECT query has processed successfully is required to update the status to ‘dumped’ for that particular CONTROL record
        • In this case, external queries of the CONTROL table will show which SOURCE failed as the status will remain ‘in process’ after the job completes its execution
      • Whatever works best for your use case: it’s a choice of requisite sophistication
      • Note that this job design allows ‘at-will re-execution’
    • The actual ‘READ’ or extract then occurs and the resulting data is written to a flat file
      • The extraction component creates its own DB connection directly
        • You can choose to create the connection 1st and reuse it if preferable
      • This output file is likely a delimited CSV file
        • You have many options
      • Once all the CONTROL records are processed the job completes using the ‘tPostJob’, closing the database connection and logging its successful completion
      • As the ‘Dump’ process is decoupled from the ‘Load’ process it can be run multiple times before loading any dumped files
        • I like this as anytime you can decouple the data process you introduce flexibility

Let’s take notice of several critical elements for the 2nd part of this job design pattern:

  • Two database connections are used
    • One for reading the CONTROL table and one for writing to the TARGET data store
  • The same key variables are initialized to set up the unique ‘dump’ file name to be read
    • This may actually be stored in the CONTROL record, yet you still may need to initialize some things
  • This step of the job design pattern controls multiple TARGET data stores within the same execution
    • The main body of the job design will read the CONTROL table for any dump files ready to process
    • Their status will be ‘dumped’ and other values can be retrieved for processing like the actual data file name
    • Then using a tMap, separated data flows can handle the different output formats
  • The same ‘Joblet’ is reused to update the CONTROL record values of the current data process
    • This time the ‘Joblet’ will again UPDATE the current record in process
      • Setting the ‘Run Date’ and current ‘Status’: ‘loaded
    • Note that this job design also allows ‘at-will re-execution’
  • The actual ‘BULK/INSERT’ then occurs and the data file is written to the TARGET table
    • The insertion component can creates its own DB connection directly
      • I’ve created it in the ‘tPreJob’ flow
      • The trade-off is how you want the ‘Auto Commit’ setting
    • The data file being process may also require further splitting based upon some business rules
      • These rules can be incorporated into the CONTROL record
      • A ‘tMap’ would handle the expression to split the output flows
      • And as you may guess, you might need to incorporate a lookup before writing the final data
      • Beware, these additional features may determine if you can actually use the host db Bulk/Insert
    • Finally, process wither saves processed data or captures rejected rows
    • Again, once all the CONTROL records are processed the job completes using the ‘tPostJob’, closing the database connections and logging its successful completion

Conclusion

This is just the beginning.  There will be more to follow, yet I hope this blog post gets you thinking about all the possibilities now.

Talend is a versatile technology and coupled with sound methodologies, job design patterns, and best practices can deliver cost-effective, process efficient and highly productive data management solutions.  These SDLC practices and Job Design Patterns present important segments for implementation of successful methodologies.  In my next blog, I will augment these methodologies with additional segments you may find helpful, PLUS I will share more Talend Job Design Patterns!

Till next time…

Join The Conversation

0 Comments

Leave a Reply

Your email address will not be published. Required fields are marked *