Let me start by thanking all those who read the first part of our blog series on converting legacy ETL to modern ETL! Before I begin the second part of this three-blog series, let’s recap the three key aspects under consideration for converting from legacy ETL to modern ETL.
- Will the source and/or target systems change? Is this just an ETL conversion from their legacy system to modern ETL like Talend?
- Is the goal to re-platform as well? Will the target system change?
- Will the new platform reside in the cloud or continue to be on-premise?
In the first blog of the series, we focused on the first question in that list. If you haven’t caught up on that yet, please do so before continuing. In this blog, I will be focusing on the second question mentioned above.
Let’s assume that we have an organization that wants to move away from their legacy backend system to a more advanced, distributed, columnar backend system and at the same time would like to migrate away from legacy ETL to a more modern ETL platform. What’s the best way to approach this initiative? Today, we are going to find out!
Where Should You Get Started?
The first step is to understand the environment. To do that, I often ask myself a couple of key questions:
- What will the new backend system be based on? Will it be an MPP, columnar database platform or a Hadoop cluster?
- How much data are they processing and how do they anticipate the growth of their data in the next x months/years?
- Finally, is the business looking for new nuggets of insight in the data or is the goal to simply get off the overall legacy environment for a better turnaround?
If the answer to the above questions is an MPP database platform that is expected to see the same amount of data growth that they have been experiencing and their ultimate goal is to get off a legacy environment for better turnaround, then the strategy outlined in the first blog can be applied here too with an additional functionality of the conversion tool to replace the existing output component to the relevant output component, for instance, replacing Oracle output with Netezza/Snowflake output. Also, leveraging ELT components in Talend where it makes sense to push down the processing of data to the MPP database rather than processing it external to the database.
Both these aspects can be leveraged using the strategy outlined in the first blog. Please note, at this point, the strategy remains pretty much the same whether you decide to keep this migration on-prem or in the cloud. Some special considerations need to be taken care of for cloud migration which I’ll be covering in the upcoming third blog.
Now, should the answer to the questions above be a Hadoop cluster to build out a data lake, anticipating medium to large data growth with added business functionalities such as self-service queries of data on the data lake as well as additional metrics for reporting, then the strategy I’ll go through below has proven to be useful for many of our customers.
Bright minds bring in success. Keeping that mantra in mind again, first build your team:
- Core Team – Identify architects, senior developers and SMEs (data analysts, business analysts, people who live and breathe data in your organization)
- Talend Experts – Bring in experts of the tool so that they can guide you and provide you with the best practices and solutions to all your conversion related effort. Will participate in performance tuning activities
- Conversion Team – A System Integrator partner who can provide people trained and skilled in Talend
- QA Team – Seasoned QA professionals that help you breeze through your QA testing activities
Now comes the approach: Divide the effort into logical waves and follow this approach for each wave.
Based on the existing ETL code and the new functionalities that need to be incorporated in order to migrate to, for instance, a data lake on HDFS or S3
Identify Data Ingestion & Processing Patterns – Analyze the ETL jobs and categorize them based on overall business and technical functionality. Given that migration needs to happen, there will be new technical functionalities that may be identified, such as an “Ingestion Framework” that will ingest data in its raw format into the data lake sitting on HDFS or an S3 bucket. Write down all these patterns. This is where your SMEs, Talend Experts and Architects from your System Integrator partner can help you define the right categories working closely in conjunction with each other.
Design Job Templates for those Patterns – Once the patterns are identified for a given wave, Talend Experts can help you design the right template for such patterns. Be it templates for “Ingestion Framework” or data loads following specific business rules. Designs will most likely leverage big data components.
Develop – Now that the designs for the identified patterns are ready, work in an iterative manner across multiple sprints to develop the jobs required to ingest and process the data. Any deviation to the template, which could be the case in a handful of jobs, given data processing complexities, needs to be approved by a governing body consisting of your SMEs and Talend Experts.
Optimize – Focus on job design and performance tuning. This will primarily be driven by volume and data processing complexities. Given the use case, usage of big data components will be common. For instance, the focus will be more on tuning Spark parameters and queries. Here again, Talend Experts will be helpful in providing the most optimum performance tuning guidelines given each scenario.
Complete – Unit test and ensure all functionalities and performance acceptance criteria are satisfied before handing over the job to QA
QA – A mix of SIT, UAT and an automated approach to compare result sets produced by the old set of ETL jobs and new ETL jobs (the latter may not be applicable for all jobs and hence proper SIT and UAT will be required). Extremely important to introduce an element of regression testing to ensure fixes are not breaking other functionalities and performance testing to ensure SLAs are met
In the past, we have witnessed that such conversions take a significant amount of time to complete. It is, therefore, extremely important to set the right expectations with the stakeholders and ensure setting milestones to define success criteria for each wave. Have a clear roadmap and work towards it. Keep business informed and ensure key resources are available from business and IT during critical design discussions and during UAT. This will ensure a smooth transition of your legacy data management platform to a whole new modern big data management platform.
This brings me to the end of the second part of the three-blog series. Below are the five key takeaways of this blog:
- Define roadmap and spread the conversion effort across multiple waves
- Set milestones at critical junctures and define success criteria for each of them
- Identify core team, Talend experts, a good System Integrator partner and seasoned QA professionals
- Identify patterns, design templates and follow an iterative approach across multiple sprints to implement those patterns
- Leverage Talend experts for the most optimum performance tuning guidelines
Stay tuned for the last one!!