For decades, enterprise data projects have relied heavily on traditional ETL for their data processing, integration and storage needs. Today, the emergence of big data and unstructured data originating from disparate sources has made cloud-based ELT solutions even more attractive.
The 2018 IDG Cloud Computing Study revealed that 73% percent of organizations had at least one application, or a portion of their computing infrastructure, already in the cloud. Although cloud computing has undoubtedly changed the way most organizations approach data integration projects today, data quality tools continue ensuring that your organization will benefit from data you can trust.
This article will underscore the relevance of data quality to both ETL and ELT data integration methods by exploring different use cases in which data quality tools have played a relevant part role. We will also examine what it takes for data quality tools to be effective for both ETL and ELT.
In order to understand the role of data quality and how it is applied to both methods, let’s first go over the key differentiators between ETL and ELT.
On the one hand, the Extract Transform Load (ETL) approach has been the gold standard for data integration for many decades and is commonly used for integrating data from CRMs, ERPs, or other structured data repositories into data warehouses. While ETL processes are designed for internal, relational data warehousing, they require dedicated platforms for the intermediate steps between extracting data and loading it into target repositories.
Extract Load Transform (ELT), on the other hand, addresses the volume, variety, and velocity of big data sources and don’t require this intermediate step to load data into target systems. ELT requires less physical infrastructure and dedicated resources because transformation is performed within the target system’s engine. Thus, the shift from ETL to ELT tools is a natural consequence of the big data age and has become the preferred method for data lake integrations.
The key difference between ETL and ELT tools is ETL transforms data prior to loading data into target systems, while the latter transforms data within those systems.
Data Quality with ETL use cases
But wait, is ETL still being used?
It has been said that ETL only has a place in legacy data warehouses used by companies or organizations that don’t plan to transition to the cloud. In a cloud-centric world, organizations of all types have to work with cloud apps, databases, and platforms — along with the data that they generate. But it’s important not to forget the data contained in your on-premises systems. ETL tools should be able to accommodate data from any source — cloud, multi-cloud, hybrid, or on-premises. Today, there are ETL tools on the market that have made significant advancements in their functionality by expanding data quality capabilities such as data profiling, data cleansing, big data processing and data governance.
Talend helps Domino’s Pizza deliver data quality
Dominos wanted to integrate information from over 85,000 structured and unstructured data sources to get a single view of its customers and global operations. The IT architecture in place at Domino’s was preventing them from reaching those goals. They didn’t have a standard way to ingest data and had data quality issues because they were doing a lot of custom and costly development.
Domino’s selected Talend Data Fabric for its unified platform capabilities for data integration and big data, combined with the data quality tools, to capture data, cleanse it, standardize it, enrich it, and store it, so that it could be consumed by multiple teams after the ETL process.
With its modern data platform in place, Domino’s now has a trusted, single source of the truth that it can use to improve business performance from logistics to financial forecasting while enabling one-to-one buying experiences across multiple touchpoints.
AstraZeneca: ETL into a data lake
AstraZeneca plc is the seventh-largest pharmaceutical company in the world with operations in in over 100 countries and data dispersed throughout the organization in a wide range of sources and repositories. Having to draw data dispersed throughout the organization from CRM, HR, Finance systems and several different versions of SAP ERP systems slowed down vital reporting and analysis projects. In addition, inconsistencies in reporting from silos of information prevented the company from finding insights hiding in unconnected data sources. They needed to put in place an architecture that could help bring data together in a single source of the truth.
Using a data lake on AWS to hold the data from its diverse range of source systems, AstraZeneca leverages Talend for lifting, shifting, transforming and delivering our data into the cloud, extracting from multiple sources and then pushing that data into Amazon S3. The Talend jobs are built and then executed in AWS Elastic Beanstalk. After some transformation work, Talend then bulk loads that into Amazon Redshift for the analytics.
As it is crucial to manage the quality of the data entering the data lake so that is does not become a data swamp, Talend Data Quality has been added to the Data Scientist AWS workstation. This has allowed the team to develop and automate the data transfer and cleansing to assist in their advanced analytics. Leveraging data quality through ETL and the data lake lets AstraZeneca’s Sciences and Enabling unit manage itself more efficiently, with a new level of visibility. By consolidating data from global SAP systems, the finance department has created a single source of the truth to provide insight and help set long-term strategy.
Data Quality for ELT use case
DoubleDown: from ETL to ELT
DoubleDown Interactive is a leading provider of fun-to-play casino games on the internet. DoubleDown’s challenge was to take continuous data feeds from their game event data and integrate that with other data into a holistic representation of game activity, usability and trends. Their data integration, however, was complex—it required many sources with separate data flow paths and ETL transformations for each data log from the JSON format.
The previous process was to use Talend’s enterprise integration data suite to get the data into a noSQL database for running DB collectors and aggregators. The data was then pulled into a staging area where data quality tools cleaned, transformed, and conformed it to the star schema. This created hidden costs and risks due to the lack of reliability of their data pipeline and the amount of ETL transformations required. DoubleDown had to find an alternative method to hasten the data extraction and transformation process.
DoubleDown opted for an ELT method with a Snowflake cloud data warehouse because of its scalable cloud architecture and its ability to load and process JSON log data in its native form. All previous MongoDB transformations and aggregations, plus several new ones, are now done inside Snowflake. Using Snowflake has brought DoubleDown three important advantages: a faster, more reliable data pipeline; lower costs; and the flexibility to access new data using SQL. In addition, by making the integration more streamlined, they leverage data quality tools while running their Talend ELT process every 5 minutes for a more trusted source of data.
Data Quality tools
Both ETL and ELT processes involve staging areas. In ETL, these staging areas are found within the ETL tool, whereas in ELT, the staging area is within the data warehouse, and the database engine performs the transformations. It is within these staging areas where the data quality tools must also go to work.
Data quality for ETL tools
In an ETL integration, data quality must be managed at the root data is extracted from applications like Salesforce and SAP, databases like Oracle and Redshift, or file formats like CSV, XML, JSON, or AVRO. This can lead to a lot of work for the data scientist
By managing ETL through a unified platform, data quality can be transformed in the cloud for better flexibility and scalability.
Data quality for ELT tools
With ELT, on the other hand, data staging occurs after data is loaded into data warehouses, data lakes, or cloud data storage, resulting in increased efficiency and less latency. Consequently, if the target repository doesn’t have data quality tools built in, it will be harder to ensure that the data being transformed after loading is data you can trust.
Pervasive Data quality
In either case, the best approach is to establish a pervasive, proactive, and collaborative approach to data quality in your company. Data quality must be something that every team (not just the technical ones) has to be responsible for; it has to cover every system; and has to have rules and policies that stop bad data before it ever gets in.
Thanks to self-service data preparation tools like Talend Data Preparation, cloud-native platforms with machine learning capabilities make the data preparation process easier. This means that business users who may lack advanced IT skills can run the processes themselves and data scientists can spend more time on analyzing data, rather than on cleaning it. Self-service tools make data preparation a team sport.
The right tools
Regardless the integration method being used, the data quality tools should do the following:
- Profiling – Data is analyzed to determine its quality, volume, and format. Metadata will be examined, and overall data quality is assessed.
- Matching – Data is examined to identify and merge related entries within your dataset.
- Cleansing – Duplicate values are eliminated, missing values are completed or discarded, and all categories and field options are standardized.
- Enrichment – Existing data is supplemented with other data sources to maximize the value of the data.
- Monitoring – In order to avoid interruptions in data quality, this tool identifies and resolves quality issues quickly, often in real-time.
The preferred integration method
The differences between these two methods are not only confined to the order in which you perform the steps. In order to decide which method to use, you’ll need to consider the following:
- Underlying storage technologies
- Your design approach to data warehouse architecture
- The business use cases for the data warehouse itself
Ultimately, choosing either ETL or ELT will depend on their specific data needs, the types and amounts of data being processed and how far along an organization is in its digital transformation.
Data quality with ETL and ELT
Integrating your data doesn’t have to be complicated or expensive. Ensuring its quality doesn’t have to be a compromise. Talend Data Fabric simplifies your ETL or ELT process with data quality capabilities, so your team can focus on other priorities and work with data you can trust. With over 900 components, you’ll be able to move data from virtually any source to your data warehouse more quickly and efficiently than by hand-coding alone. We’ll help you reduce your spending, accelerate time to value, and deliver data you can trust.
Try Talend Data Fabric for free to see how it can help your business.