What is Reverse ETL? Meaning and Use Cases
It isn’t easy to turn raw data endpoints into stellar customer experiences or successful marketing campaigns. Companies need a data platform to move data from one place to another and make it available to analysts, business teams, and marketing teams. These are complex workflows involving a range of technologies — and the latest of those technologies is reverse ETL (extract, transform, load).
In a nutshell, reverse ETL is a technology for taking cleaned and processed data from the data warehouse and ingesting it back into business applications such as Salesforce, where it can be used for business operations and forecasting. As a component of the modern data tech stack, it sets businesses up to do more complex operational analytics than they could with BI tools alone.
Reverse ETL has been around since 2018, but it didn’t really start trending until 2021. The idea sounds simple enough, but how do you know if you need reverse ETL? What specific problems does it solve, and what kinds of businesses can benefit from it the most?
A brief history of data management
To appreciate the role of reverse ETL, you must first understand the inherent differences between mature data management platforms and the modular — or “modern” — data tech stack.
For decades, enterprise companies have understood the value of data for market insights and business decision making. As early as the 1980s, technology companies were employing data warehouses as a central data repository. Then, as now, the data warehouse was a single source of truth, providing clean, standardized data to experts and engineers across the business.
But early data warehouses were housed in physical, on-premises servers. Not only did the hardware itself require a massive, up-front investment, these servers were also expensive to operate, requiring highly technical data teams with local access to the physical hardware. To run even a simple SQL query, data engineers would have to manually extract the data, transform it using technology outside of the servers, and finally load it back into the data warehouse. This process became known as "ETL" — extract, transform, load.
Traditional data management platforms evolved in response to this paradigm, taking advantage of controlled data environments and highly technical users to provide a suite of applications for data integration, data quality, data governance, and more. These technologies were robust and flexible, giving enterprise organizations everything they could need for end-to-end management of the entire data lifecycle. While newer data management platforms began to offer features like low-code options for creating data pipelines, there was still an assumption that their customers would have access to expert operators. Many products required a high level of technical knowledge and some hand-coded pipelines or queries, but because the companies that could afford to operate an on-prem data warehouse could also afford the data engineers to oversee data processing, that arrangement worked well — for a time.
Everything changed when the data warehouse made the move to the cloud. With shared servers and remote access, both storage and processing got a lot cheaper. Within a few years, even smaller companies and startups could afford to move their data to a cloud data warehouse, taking advantage of the access and availability of a remote source of truth for their data. But in smaller, scrappier businesses, there are seldom enough engineering resources to go around, leaving no one with the technical skills to build pipelines or models for moving and processing data.
The modern data stack
With cloud storage flattening the field and giving smaller companies access to the advantages of large-scale automation and data operations, an entire ecosystem of point solutions has emerged to fill the gaps. These solutions are often low-code and (comparatively) low-cost, making it possible for younger companies to build out a modular tech stack piece by piece, relying on their existing resources instead of making a massive, up-front investment in a full data infrastructure.
As in the earliest days of data management, the data warehouse is still the foundation of any organization’s data infrastructure. The data warehouse — or its close cousin, the data lake — provides a centralized repository so that data can be managed, cleaned, and applied to business intelligence.
A cloud data warehouse like Snowflake or BigQuery may also have some native processing functionality, giving business users the ability to run SQL queries directly in the warehouse.
The data stored in the warehouse has to come from somewhere, though. That’s where data loaders come into play. A data loader pulls raw data from a variety of data sources, systems, and SaaS applications, then ingests that data into the data warehouse, potentially in real time or near-real time.
The classic approach to this process is “ELT,” which stands for extract, load, transform. While there is a lot of nuance to the differences between ELT and ETL, the simplest explanation is that ETL uses a mature data management platform to transform data before it enters the data warehouse, while ELT is a component of a modern data stack that loads raw data directly into the warehouse where other modules take care of data transformation, taking advantage of the processing power of the data warehouse.
Modern data loaders serve a similar function, but they're designed for a different persona than a classic ETL. Modern data loaders are built for teams with a strong need for agility and direct access to the data sources, but without the same level of technical expertise. To better serve this market, data loaders offer a breadth of pre-built connectors that often focus on smaller, more vertically specialized and self-service SaaS apps.
Stitch is a popular data loader that makes it easy for even non-technical users to extract data from a range of applications and load it into their data warehouse.
The value of data loaders lies in their power to easily and quickly tap into data sources to make that data available in the warehouse. But data loaders copy and paste raw data without making any changes to it. Raw data can be valuable, but in most cases it needs to be cleaned up before it can be useful for analytics or business intelligence; for example, a standard format for phone numbers or addresses must be established.
This kind of data preparation is only a small subset of data transformation. The full possibilities of data transformation and modeling go far beyond that, making it possible to enrich data by crossing multiple sources, change the format of data (such as its structure, granularity) to be fit for analytic purposes, or add business rules on top of the data.
In the modern data stack, many simple transformations can be performed using the native processing power of the warehouse — although that processing is usually limited to SQL queries. Additional technologies, the most popular being the development framework dbt, amplify the native modeling capabilities of the warehouse. These innovations give users the ability to create more complex data models using Java or Python directly in the warehouse, with no need to extract and re-load the data.
At the end of the day, the purpose of all these technologies and processes is to make sure that business teams can get relevant, useful insights out of the data. That requires a business intelligence (BI) or reporting tool, such as Looker.
Reporting tools let users run analytics and build helpful reports dashboards using the company’s most current data directly from the data warehouse. They are often low-code technology, allowing even non-technical users to create visual and informative dashboards for marketing, finance, and other departments across the organization.
Closing the loop: Reverse ETL
But what happens when the dashboard isn’t enough? Reports and BI dashboards are incredibly helpful for high-level decision making and gut-checking various plans and forecasts. But sometimes front-line business users could benefit from the most current data as it has been consolidated and transformed within the data warehouse.
Reverse ETL is the solution to this problem. A reverse ETL tool extracts current data from the data warehouse, transforms it, and loads it into an operational system or application.
This process comes into play in a variety of use cases where business users would like to use transformed data or the results of data modeling in their preferred applications. For example, say you have key metrics that were computed within the data warehouse, such as CLV (customer lifetime value). This data has been made available in a Looker dashboard, but what you really want is to make it available in Salesforce (or other CRM) so the sales team can improve their reporting and forecasting. A reverse ETL tool would be able to pull that metric from the data warehouse and feed it back to Salesforce.
With a modern tech stack, there are three approaches to reverse ETL. First, if you have enough technical people, you can code individual connectors or rely on APIs to move data from the data warehouse into each business application. This is far less likely to happen in a small company, however, where every available engineer is already spoken for.
Second, you can rely on the native integrations that connect tools to each other — for example, Salesforce and Slack. But, while this may be a successful approach with Salesforce, not every SaaS tool will have the native integrations you need.
Finally, you can use purpose-built reverse ETL solutions, such as Hightouch or Census. Just like ELT solutions such as Stitch come preloaded with the connectors you need to ingest data from a range of business applications into your data warehouse, reverse ETL solutions make it easy to load data from the warehouse back into business applications without complicated setup or writing any additional code.
Who needs reverse ETL?
Here comes the million-dollar question: does your business need reverse ETL? And the answer is maybe, but not necessarily.
The “modern” data stack is modern in the sense that it takes advantage of the most current trends in technology, such as the ability to pick and choose between best-of-breed point solutions. But modern does not necessarily mean that it’s the best approach in every situation.
Most enterprise businesses and larger companies still prefer a mature data tech stack for data management because it is so powerful, comprehensive, and customizable — and because they may still be operating in an on-prem or hybrid cloud environment where they must work with local servers for data storage and processing. Besides which, an enterprise business will tend to put in a place a more governed, robust API strategy to serve the same function as reverse ETL.
In a mature data tech stack, there is no need for reverse ETL because there’s no practical difference between building a pipeline to move data from Salesforce into Snowflake vs. a pipeline moving data from Snowflake into Salesforce. As long as you have the resources and expertise, moving data is just moving data. And the more advanced mature data management platforms, including Talend Data Fabric, are increasingly user-friendly, offering low-code options that make it easy for even business users with comparatively low technical expertise to build bi-directional data pipelines.
On the opposite end of the spectrum, many smaller businesses and startups may have no need for reverse ETL. Because they are operating with a smaller team and more limited demands, they may be able to get all the insight they need from dashboards and reports. Even though they are operating on a modern data stack, it may be months or even years before they need to a specific, agile solution to deliver data throughout the entire system.
There is no one-size-fits-all answer for data management. Reverse ETL offers an efficient solution for businesses that need a modern data stack with more data pipelines that can move data not just into the data warehouse, but also back into business applications.
The future of the modern data stack
Reverse ETL is still a young technology, but it is a strong signal of things to come. The main advantage of the modern data stack is its agility. A company can implement only the solutions that provide the most immediate value — then build on to the stack later, as needs and resources change. As a result, the solution is going to be efficient, cost-effective, and responsive.
While there are a lot of advantages to the modern data tech stack, however, it does have its limits. Because the solution is modular, if you don’t have a module for a given function, that function is not going to happen. There are gaps. And while the landscape is constantly changing, as things stand today, the modern data stack has no fully mature modules for data quality, data cataloging, or data governance.
Reverse ETL is plugging one of those gaps, closing the loop between business applications and the data warehouse. As time goes on, new technologies, solutions, and products will continue to arrive to close more of those gaps.
The modern data tech stack does have its limits. But, with the addition of reverse ETL, it has come one step closer to the functionality of the mature tech stack.
Ready to get started with Talend?
More related articles
- Data Extraction Tools: Improving Data Warehouse Performance
- Best Practices for Managing Data Quality: ETL vs ELT
- Data Wrangling vs. ETL
- Data Wrangling: Speeding Up Data Preparation
- ETL in the Cloud: What the Changes Mean for You
- ETL Tools: Finding the Best Cloud-Based ETL Software for your Business
- ETL of the Future: What Data Lakes and The Cloud Mean for ETL
- ETL Testing: An Overview
- ETL vs ELT: Defining the Difference
- Understanding the ETL Architecture Framework
- What is ELT?
- What is ETL?
- Why ELT Tools Are Disrupting the ETL Market