Best practices for managing data quality: ETL vs ELT

For decades, enterprise data projects have relied heavily on traditional ETL for their data processing, integration, and storage needs. The emergence of big data and unstructured data originating from disparate sources has made cloud-based ELT solutions even more attractive.

Although cloud computing has undoubtedly changed the way most organizations approach data integration projects, data quality tools continue to ensure that your organization will benefit from data you can trust.

This article will underscore the relevance of data quality to both ETL and ELT data integration methods by exploring different use cases in which data quality tools have played a relevant role. We will also examine what it takes for data quality tools to be effective for both ETL and ELT.

Key differences

In order to understand the role of data quality and how it is applied to both methods, let’s first go over the key differentiators between ETL and ELT.

The Extract Transform Load (ETL) approach has been the gold standard for data integration for many decades and is commonly used for integrating data from CRMs, ERPs, or other structured data repositories into data warehouses. While ETL processes are designed for internal, relational data warehousing, they require dedicated platforms for the intermediate steps between extracting data and loading it into target repositories.

Extract Load Transform (ELT), on the other hand, addresses the volume, variety, and velocity of big data sources and doesn’t require this intermediate step to load data into target systems. ELT requires less physical infrastructure and dedicated resources because transformation is performed within the target system’s engine. Thus, the shift from ETL to ELT tools is a natural consequence of the big data age and has become the preferred method for data lake integrations.

The key difference between ETL and ELT tools is that ETL transforms data prior to loading it into target systems, while the latter transforms data within those systems.

Data quality with ETL use cases

But wait, is ETL still being used?

It has been said that ETL only has a place in legacy data warehouses used by companies or organizations that don't plan to transition to the cloud. In a cloud-centric world, organizations of all types have to work with cloud apps, databases, and platforms — along with the data that they generate. But it’s important not to forget the data contained in your on-premises systems. ETL tools should be able to accommodate data from any source — cloud, multi-cloud, hybrid, or on-premises. Today, there are ETL tools on the market that have made significant advancements in their functionality by adding data quality capabilities such as data profiling, data cleansing, big data processing, and data governance.

Domino's Pizza delivers data quality

Domino’s wanted to integrate information from over 85,000 structured and unstructured data sources to get a single view of its customers and global operations. The IT architecture in place at Domino’s was preventing them from reaching those goals. They didn’t have a standard way to ingest data and had data quality issues because they were doing a lot of custom — and costly — development.

Domino’s selected Talend Data Fabric for its unified platform capabilities for data integration and big data, combined with data quality tools, to capture, cleanse, standardize, enrich, and store data, so that it could be consumed by multiple teams after the ETL process.

With its modern data platform in place, Domino’s now has a single, trusted source of the truth that it can use to improve business performance, from logistics to financial forecasting, while enabling one-to-one buying experiences across multiple touchpoints.

AstraZeneca: ETL into a data lake

AstraZeneca plc is the seventh-largest pharmaceutical company in the world with operations in over 100 countries and data spread throughout the organization in a wide range of sources and repositories. Having to draw data from CRM, HR, finance systems, and several different versions of SAP ERP systems dispersed throughout the organization slowed down vital reporting and analytics projects. In addition, inconsistencies in reporting from silos of information prevented the company from finding insights hiding in unconnected data sources. They needed to put in place an architecture that could help bring data together in a single source of truth.

Using a data lake on AWS to hold the data from its diverse range of source systems, AstraZeneca leverages Talend for lifting, shifting, transforming, and delivering its data into the cloud, extracting from multiple sources and then pushing that data into Amazon S3. The Talend Jobs are built and then executed in AWS Elastic Beanstalk. After some transformation work, Talend then bulk loads that into Amazon Redshift for analytics.

Because it’s crucial to manage the quality of the data entering the data lake so that is does not become a data swamp, Talend Data Quality has been added to the data scientist AWS workstation. This has allowed the team to develop and automate data transfer and cleansing to assist in advanced analytics. Leveraging data quality through ETL and the data lake lets AstraZeneca’s Sciences and Enabling unit manage itself more efficiently, with a new, higher level of visibility. By consolidating data from global SAP systems, the finance department has created a single source of truth to provide insight and help set long-term strategy.

Data quality for ELT use case

DoubleDown: from ETL to ELT

DoubleDown Interactive is a leading provider of fun-to-play casino games on the internet. DoubleDown’s challenge was to take continuous data feeds of its game-event data and integrate that with other data into a holistic representation of game activity, usability, and trends. The company’s data integration, however, was complex — it required many sources with separate data flow paths and ETL transformations for each data log from the JSON format.

The previous process was to use Talend’s enterprise integration data suite to get the data into a NoSQL database for running database connectors and aggregators. The data was then pulled into a staging area where data quality tools cleaned and transformed the data and organized it to conform with the star schema. This created hidden costs and risks due to the lack of reliability of their data pipeline and the number of ETL transformations required. DoubleDown had to find an alternative method to accelerate the data extraction and transformation process.

DoubleDown opted for an ELT method with a Snowflake cloud data warehouse, because of Snowflake’s scalable cloud architecture and its ability to load and process JSON log data in its native form. All transformations and aggregations previously done in MongoDB, plus several new ones, are now done inside Snowflake. Using Snowflake has brought DoubleDown three important advantages: a faster, more reliable data pipeline; lower costs; and the flexibility to access new data using SQL. In addition, by making the integration more streamlined, DoubleDown leverages data quality tools while running its Talend ELT process every 5 minutes to create a more trusted source of data.

Data quality tools

Both ETL and ELT processes involve staging areas. In ETL, these staging areas are found within the ETL tool. In ELT, by contrast, the staging area is within the data warehouse and the database engine performs the transformations. It is within these staging areas where the data quality tools must also go to work.

Data quality for ETL tools

In an ETL integration, data quality must be managed as the root data is extracted from applications like Salesforce and SAP, databases like Oracle and Redshift, and/or file formats like CSV, XML, JSON, and AVRO. This can lead to a lot of work for the data scientist.

By managing ETL through a unified platform, data quality can be improved in the cloud for better flexibility and scalability.

Data quality for ELT tools

With ELT, on the other hand, data staging occurs after data is loaded into data warehouses, data lakes, or cloud data storage, resulting in increased efficiency and less latency. Consequently, if the target repository doesn’t have data quality tools built in, it will be harder to ensure that the data being transformed after loading is data you can trust.

Pervasive data quality

In either case, the best approach is to establish a pervasive, proactive, and collaborative approach to data quality in your company. Data quality must be something that every team (not just the technical ones) is responsible for, it has to cover every system, and it has to have rules and policies that stop bad data before it ever gets in.

With self-service data preparation tools like Talend Data Preparation, cloud-native platforms with machine learning capabilities make the data preparation process easier. This means that business users who may lack advanced IT skills can run the processes themselves and data scientists can spend more time on analyzing data, rather than on cleaning it. Self-service tools make data preparation a team sport.

The right tools

Regardless of which integration method is being used, data quality tools should offer the following:

  • Profiling – Data is analyzed to determine its quality, volume, and format. Metadata is examined and overall data quality is assessed.
  • Matching – Data is examined to identify and merge related entries within your dataset.
  • Cleansing – Duplicate values are eliminated, missing values are filled in (or the record is discarded), and categories and field options are standardized.
  • Enrichment – Existing data is supplemented with other data sources to increase the overall value of the data.
  • Monitoring – In order to avoid interruptions in data quality, the tool identifies and resolves quality issues quickly, often in real time.

The preferred integration method

The differences between these two methods are not only confined to the order in which you perform the steps. To decide which method to use, you’ll need to consider the following:

  • Underlying storage technologies
  • Your design approach to data warehouse architecture
  • The business use cases for the data warehouse itself

Ultimately, choosing either ETL or ELT will depend on your specific data needs, the types and amounts of data being processed, and how far along your organization is in its digital transformation.

Data quality with ETL and ELT

Integrating your data doesn't have to be complicated or expensive. Ensuring its quality doesn’t have to be a compromise. Talend Data Fabric simplifies your ETL and/or ELT processes with data quality capabilities so your team can work with data you can trust. With over 1,000 connectors and components, you'll be able to move data from virtually any source to your data warehouse more quickly and efficiently than by hand coding alone. We'll help you cut costs, accelerate time to value, and deliver data you can trust.

Learn more about managing data quality in modern data environments with the Definitive Guide to Data Quality.

Ready to get started with Talend?