Extract/Transform/Load (ETL) is the process used to turn raw data into information that can be used for actionable business intelligence. In an ETL process, data is extracted from applications like Salesforce and SAP, databases like Oracle and Redshift, or file formats like CSV, XML, JSON, or AVRO. That data, once extracted, is transformed into usable information to be loaded into new applications or destinations, or might be loaded into a data lake or data warehouse to be transformed later in an ELT process.
These data processes can be done through hand-coded connections, or can be completed with an ETL tool. But just as data processes are evolving to be more cloud-based, ETL tools have to evolve too. While a great deal of enterprise data is coming from cloud sources, or from sensors, a great deal of critical enterprise data is also coming from on-premises sources as well.
We’ll take a close look at scenarios in which you need an ETL tool, what to look for when choosing one, and why ETL tools must function in the cloud.
Migrating to a Cloud Data Warehouse Architecture with AWS Redshift now.
ETL tools in the cloud and on-premises
Everything in IT is moving to the cloud. Forrester has predicted that the global public cloud market has increased by $40 billion to $178 billion in the last year, and will continue to grow at a 22% CAGR. This is creating a new imperative for data-driven businesses; cloud data warehouses are becoming important, processing that data in the cloud is becoming critical for real-time analytics, and cloud-based SaaS applications are allowing companies to enjoy capabilities and functions in an affordable and efficient way.
What all these cloud-based functions are doing, besides providing new compute power and capabilities, is creating cloud-based data silos on top of the ones that already existed in legacy systems. In a cloud-centric world, organizations have to work with cloud apps, databases, and platforms — along with the data that they generate. But it’s important not to forget the data contained in your on-premises systems. ETL tools should be able to accommodate data from any source — cloud, multi-cloud, hybrid, or on-premises. And, with so much innovation in the cloud space, it’s important that your data management infrastructure can be flexible enough to move effortlessly to whatever cloud provider you desire.
Why you need an ETL tool
Many data professionals ask themselves why they would need an ETL tool when hand coding can often get the job quickly and at less expense. Hand coding has two major drawbacks:
- Hand coding limits sharing and re-use. Good ETL tools should have a visual design interface, so colleagues can simply look at the flow chart and see what’s been built. With hand coding, it’s difficult to see what’s been done, so developers will often give up and re-do the same work, which is not only redundant but drains time and resources.
- Hand coding drives up maintenance costs. If different people maintain and support the code once it’s in production, their learning curve with a hand-coded approach will be high, and if the code is in production for years, turnover will lead to far higher costs in the long run.
Data integration projects often start small, perhaps as a POC for a quick win; a developer often will need to connect two systems together to complete a particular task, so the perception is they don’t need an ETL tool to handle more complicated processes like real-time analytics or data governance when hand coding will do. But the problem is small projects rarely stay small. Very frequently, these projects get seen and adopted by the rest of the enterprise, so what starts out as a fairly narrow project rapidly expands to being multi-departmental, and suddenly a quick little hand-coded integration isn’t so quick and isn’t so little.
Plus, as the number and volume of data sources grow, and companies increasingly are subject to data protection regulations like GDPR, your ETL processes will need to incorporate data quality functions, data governance capabilities, and metadata management. These capabilities are impossible to hand code at the kind of scale needed by most organizations.
An example is an American telecommunications company which has more than 50 million subscribers. Their advertising team wanted to create a new product to sell personalized, local advertising to conference attendees, so they used Hadoop to track location data of cell phones on their network. Hadoop made data analytics of this scale possible. And when the rest of the business saw the success of this project, suddenly there were 50 more requests to do this level of data analysis. There was simply no way the IT team had the time or resources to do 50 more of these hand-coded integrations. They needed an ETL tool and one that would scale quickly.
2018 Gartner Magic Quadrant for Data Integration Tools now.
How to select the right ETL tool
When you’re considering the purchase of an ETL tool, even for a single project like building a data warehouse in Snowflake, there are a number of features that it will need to have.
Here is the checklist of what you need in a good ETL tool:
- It should be able to both read and write from the entire breadth of the data sources you need, wherever located in the cloud or on-premises.
- It should be able to do data transformation processes like sorting, filtering, and aggregating.
- There should be a data quality and data governance capabilities built in, like deduplication, matching, and data profiling.
- Collaboration tools should be included. This will make it easier to reuse prior development elements and can lead to more efficient data integration flows. A single job can feed multiple targets instead of having a series of data integration flows all doing roughly the same thing over and over.
- With the shift to cloud systems, the ability to accommodate CI/CD processes is a necessity.
- Your ETL tool should be able to work in any environment, across on-premises, cloud, or hybrid infrastructures.
- An ETL tool should be able to accommodate changing providers easily. You might build a data lake in Redshift today and Snowflake tomorrow, or use AWS as your cloud provider this quarter but Azure the next quarter. It’s important to have an ETL tool that works in a multi-cloud environment and can accommodate switching providers and deployment environments by simply swapping out a few components while keeping the business logic and transformation logic the same.
- An ETL tool should work well with the latest innovations and can accommodate new technologies easily. Good ETL tools will be able to integrate with serverless technologies, Spark, Snowflake, machine learning, and more — and will be able to adapt quickly to new technology that has yet to emerge.
- Scalability is very important when choosing tools. It’s tempting to look at a tool vendor that can just handle simple processes — they’re cheap and easy to use. But they don’t scale very well; they’re bound to whatever the machine can scale to, so you have to consider how much RAM or how many CPUs you might be using. This means that as your business grows, you’re putting an artificial limit on the growth of your analytics operation, which in this hypercompetitive business environment, could put you at a disadvantage. It’s also harder to collaborate and re-use elements from one project to another, which means ETL projects often have to be started from scratch, taking time and resources.
- Portability is an important but sometimes overlooked capability for ETL tools. For example, The Apache Hadoop ecosystem is moving incredibly quickly. In 2014 and 2015, MapReduce was the standard, but by the end of 2016 Spark emerged as a new standard. If you’ve taken a hand-coding approach, it was impossible to port that code from MapReduce to Spark. Leading ETL tools allow you to do this seamlessly.
Consider an ETL tool for the cloud
There’s no question that the world is moving to the cloud. Often, when companies move to the cloud, they reassess their entire infrastructure stack. What worked well on-premises isn’t going to necessarily work well on the cloud. For example, on-premises you have more control what you can do with our servers and hardware, whereas in the cloud it’s a bit of a different paradigm — you need to know how to deal with cloud stacks and the private networks outside of your own networks. The cloud changes how ETL tools work and how you work with them; there are different coding languages, processes, and platforms that are different than what you might encounter on-premises, so you need to have a tool that can switch seamlessly between the two.
The cloud’s ability to scale is critical to ETL tool selection because it’s hard to predict how big your environment will be in the future. You could have 50 data sources or 500 — and, of course, you need to accommodate new technology that isn’t even on the market yet. A cloud-based ETL tool will allow you to make changes to your data management infrastructure easily.
Another factor to consider when thinking about an ETL tool for the cloud is the pricing model. A major feature of moving to the cloud is being able to pay for only the resources you actually use. So if you’re buying a product that charges you per node, that directly conflicts with the cloud ethos of being able to spin resources up and spin them down easily, or have lots of them at a particular moment or none the next. Your ETL tools should be able to work the same way; you shouldn’t have to pay for runtimes you’re not using or extra resource that will sit idle. You should pay for your ETL tools like any other aspect of cloud infrastructure.
ETL tools and data quality
Data quality is becoming increasingly important to the business as poor data quality costs a significant amount of time and money. And as data privacy regulations like GDPR come online, ensuring both data quality and governance is going to become an increasingly critical enterprise imperative.
As we know, in this era of cloud computing, there is an explosion of data sources which you will want to combine for business intelligence. But every data source on the planet has data quality challenges, and when you integrate them, those challenges are compounded. An ETL project might seem like a matter of a simple connection between two systems, but investing in the quality of data upfront will save your company significant amounts of money and resources. A proactive approach to data quality allows you to check and measure that level of quality before it even really gets into your core systems. That’s why having data quality tools built in to ETL tools is becoming a very important factor in your ETL tool purchase. Take a look at the Definitive Guide to Data Quality to learn more about how a good ETL tool can proactively ensure quality data before it enters your systems and is used.
Talend's ETL tools
Talend has a number of ETL tools that work for any data integration project you might need to complete. If you do need a basic tool for ETL processes, Talend’s free Open Studio for Data Integration< is perfect for that. It works well with integrations for Redshift, Snowflake, Google BigQuery, and more — single loads of data and easy ingestion processes. If you are looking for a simple yet powerful open source tool, download Open Studio today.
When you’re ready for an ETL tool which includes data quality, data governance, machine learning, collaborative features and other capabilities, the right choice is Talend’s Data Management Platform. It has a dynamic pricing model, and can be seamlessly upgraded from the Open Source product once you decide to add more features and capabilities. You can compare the features of all our data integration products.
Talend’s ETL tools are designed to simplify the complex needs of a growing, data-driven business. Try the tool that’s right for you and make ETL processes in your business easy.