I recently started working with Talend as a Customer Success Architect. In this role, I help customers with architecture guidelines and best practices for managing their data strategies with Talend. Before joining Talend, I worked on several data warehouse implementations where Informatica PowerCenter was the common ETL tool of choice. Any transition from one technology to another can be a big challenge. But, instead of trying to “replicate” how things are done in PowerCenter into Talend, let us take a step back and understand how Talend works and what its capabilities are and how it differs from PowerCenter. In this blog, I will share my experiences with switching from Informatica to a more modern integration platform in order to help minimize your migration efforts from Informatica to Talend.
Talend vs. Informatica PowerCenter: What’s the Difference?
Both tools are doing essentially the same thing – moving data from source to target but they go about achieving it in different ways. Both approaches have their merits. It is important to understand these pros and cons before designing your ETL job.
The first thing we need to understand is that even though both tools have a graphical user interface and both extract data from sources, transform and load it to a target, their implementations are different. Talend generates native Java code allowing you to run anywhere. PowerCenter, on the other hand, generates metadata that is stored in a RDBMS repository that their proprietary engine uses to run.
What is important to understand is that since Talend is a code generator, it can run both as an ETL (running on its own standalone server) or as ELT (running natively on the target server) engine. The Java code that is generated by Talend can be run on any platform that supports Java – it could be on a server sitting in your data center, on the cloud or even running on your laptop. While both platforms provide components that handle the majority of tasks required for data integration, there are situations where something custom is required. This often results in some custom coding which I believe is an arduous and inefficient process to do using PowerCenter. Yet in Talend you can build your own custom components in Java and integrate them into the studio without any hassle. These are important considerations when you design your data integration job.
How Are My Jobs Designed?
The other important difference between the two tools is how a job is constructed. Let’s start with PowerCenter. The first thing one develops is a Mapping (which is essentially a “data flow”). This is where the mapping between the source and the target and the transformation logic are defined. Once the Mapping is validated and its metadata is saved into the repository, Sessions and Workflows (“process flow”) are created. Then physical connections to the source and target objects are assigned, tasks are sequenced in the order of execution plus error handling/notification procedures can be implemented.
In Talend, both the data and process flows are implemented together, seamlessly. We construct a Job that defines the “process flow” using a wide variety of Components which provide specific functionality that implement the “data flow”. The “process flow” is implemented using “triggers” and the “data flow” between components using “rows” based upon a particular schema.
To help understand, let us compare how PowerCenter concepts map to their Talend equivalents:
The PowerCenter repository and Talend Project Repository contains reusable metadata objects (like jobs, DB connections, schema definitions, etc.). In Talend, these are seamlessly integrated with either SVN or GIT source code control systems instead of using a proprietary source code control system.
Folders help organize objects based on their functionality. PowerCenter does not allow for subfolders but Talend does.
The Workflow or Job implements the ETL process flow with all the connections and dependencies defined. In Talend, a Job represents both the process flow and the data flow.
A combination of a set of tasks that is reusable across Workflows/Jobs. You can use this for reusable code like Error Handling, Notifications, or repeatable processes.
Session & Mapping
PowerCenter defines connections, file locations, error handling separately in a Session while in Talend, the function of a Mapping and a Session are combined and implemented in a Component or a set of Components linked by process or data flow.
Talend has a large library of components that support various transformations. e.g. one of the most frequently used Components – tMap – is a combination of the Informatica Expression, Lookup, Router and Joiner transformations.
Source and Target – Definitions & Connections
In Talend, schema definitions and connections can be hardcoded into each Component but as a best practice, it is highly recommended that they be defined in the Repository Metadata and reused in Components.
An In-Depth Look at the Interface
Finally, let’s look at Talend’s Eclipse-based Studio interface and try and understand it from a PowerCenter developer’s perspective.
- The Repository (PowerCenter: Navigator) is where all the resources – Folders, Jobs, Schema Definitions and Connections, parameters and variables are defined.
- The Design Area (PowerCenter: Workspace) is where Jobs are assembled.
- The contextual tabs at the bottom are used to configure and document the components and run the Job. It combines several features that are provided in PowerCenter Designer and Workflow Manager tools.
- The Palette (PowerCenter: Transformation toolbar) is a library of all components available.
- The Perspective determines the overall layout of the Studio and the arrangement of the different areas within the Studio. Each major Talend product has its own perspective. The big advantage is a developer does not need to switch between several tools based on the product being used. The unified user interface across products improves developer productivity.
After spending several years as an Informatica architect, the most important lesson I learned is any technology is only as good as the best practices implemented around it. Talend is no exception to that rule. If you want to realize the full value of your investment in Talend, you need to implement best practices and follow them as part of your software development lifecycle. Here are some links to get you started on Talend Job Design Patterns and Best Practices – Part 1, Part 2, Part 3 and Part 4.
My journey with Talend has just started. What I have learned so far is – once you understand the differences between PowerCenter and Talend, how Talend works and the best practices around it, you can start delivering incredible value to your organization leveraging Talend as a data integration platform. The next part of my journey with Talend is exploring Talend Big Data platform and what I have seen so far is very exciting. I hope to share more of my findings about it in my next blog.