ETL of the Future: What Data Lakes and The Cloud Mean for ETL

Advances in IT infrastructures, the emergence of data lakes, and increased reliance on cloud-native technologies have information experts asking the question: what does the future hold for ETL? While the answer to that question is neither simple nor straightforward, one thing is clear: in an ever-evolving data landscape, companies and organizations must remain vigilant in order to ensure that their ETL tools and strategies continue to be efficient, effective, and capable of handling whatever tomorrow brings.

The Future of ETL - The Impact of the Cloud and Data Lakes

Many aspects of the data landscape have undergone dramatic transformations in the past few years. New technologies and methods emerge, organizations’ data management infrastructure continues to evolve, and the amount of available data grows larger and larger each year. To understand what these factors mean for the ETL (extract, load, transform) process, it will be helpful to look more closely at two developments in particular: the shift toward cloud-native technologies and the emergence of data lakes. But first, it’s worth taking a look at the way ETL has already changed.

ETL and the Evolution of IT

Advances in data speed, infrastructure, and data processing have done more to shape the future of ETL than perhaps any other factor. After all, these advances laid the foundation for the shift toward cloud-storage and the arrival of big data. Consider the evolution of the internet itself: when the world wide web was invented in 1989 (this is an abbreviated history!), few people had access to the internet. In 1995, there were only 16M users worldwide, and dial-up internet speeds topped out at 56 kbps. The early 2000s brought the emergence of fiber optic networks and dramatic improvements in data transfer speeds. Today, there are over 4 billion internet users across the globe and the fastest average connection speed has grown to 28.6 mbps. While that might seem impressive enough, Google Fiber now boasts a connection speed of one gigabit per second.

Along with improved internet speeds and the explosion of the number of internet users, advances in programming and data architecture are also impacting the future of ETL. With the birth of Apache Hadoop in 2011, the average organization gained access to a fast, dependable framework for distributed computing. This allowed powerful processors—which previously sat mostly idle—to share in the work of processing large data jobs. The results were significant improvements in speed, capacity, and reliability. As the Hadoop framework grew, more and more companies reduced their dependence on expensive onsite servers in favor of distributed computing clusters or, as they are often collectively referred to, the cloud.

In 2013, Apache introduced Spark, a real-time big data analytics technology that could process tasks at up to 100 times the speed of Hadoop. This made near real-time ETL widely accessible and changed the way industry professionals approached data analytics and business intelligence.

Today, ETL processes are handling vast amounts of data at incredible speeds. ETL has evolved in other ways too; ETL can now scale in tandem with the ebb and flow of web traffic, and many cloud service providers charge only for the actual ETL processing time used. The result is ETL that is flexible, fast, and cost-effective.

ETL and the Cloud

In the past, ETL processes were executed locally or on-site. In other words, ETL was managed in a facility in close proximity to the physical location where the data would ultimately be used or stored. Today, ETL processes are increasingly migrating away from centralized data centers and toward systems that run partially or completely in the cloud. The movement toward cloud-native storage and processing is itself the result of advances in technology, increased efforts to prevent data loss, faster internet speeds, and cybersecurity threats.

The trend toward cloud-native storage and processing isn’t the only factor transforming the ETL process. The proliferation of device connectivity, improvements in processes that collect and store information, and the Internet of Things (IoT) have all resulted in a big data boom. As our data set grows larger and the number of data sources continues to multiply, companies are increasingly dependent on data for maintaining their competitive advantage.

As a result of both of these factors, ETL must now be able to accommodate more data from more sources more quickly than ever before. In many cases, ETL must also be able to handle streaming data, which means it must process data as it is generated in real-time. ETL tools are also evolving in response to the kinds of data that are now available, so that companies are able to process data effectively and mine it for business intelligence and actionable insights, no matter where it comes from.

ETL in motion: E-commerce and the Cloud

One illustration of how big data and cloud computing have impacted ETL can be found by looking at the world of e-commerce. In the U.S. alone, retail e-commerce sales grew from 34M in 2009 to 127B in the first quarter of 2018. Even more recently, from 2014 to 2018, the number of people worldwide making purchases online grew from 1.32B to 1.79B, a number that is on track to increase to 2.14B by the year 2021. That means almost a third of the earth’s population will be using the internet to pay bills and shop.

The increased prominence of e-commerce has placed new demands on ETL tools. With massive amounts of data being generated from online shopping portals, marketing campaigns, and customer service feedback, ETL processes must be capable of handling different kinds of data from multiple sources, often in real-time. And since many retail companies use a variety of data sources and applications, ETL must also be able to handle the complex integrations necessary for turning various processes into a single, unified whole.

E-commerce giant Groupon provides a great example of the level of complexity that ETL processes must now be able to handle. Groupon’s ETL platform processes and stores 1TB of raw data, manages 1,000 data integration jobs each day, and unifies multiple types of data from a variety of sources.

Data Lakes and ETL

Before we tackle the effect of data lakes on ETL, it might be helpful to spend some time discussing data lakes in general. As with all things data-driven, the way data is collected and stored continues to change. In the past, companies have primarily relied on data warehouses for storing, reporting, and analyzing data. In more simple terms, data warehouses are systems which contain current and historical data that has been processed and standardized. The warehouse is the central location from which all data is retrieved.

In contrast, data lakes are repositories of data in a more fluid sense (pun inevitable). Data lakes store both raw and transformed data, from a variety of sources, in any virtually any format. More complex and adaptable than data warehouses, data lakes offer companies the capacity for storing data in any form for use at any time. For example, data lakes contain:

  • Unstructured data — data that does not comply to any standardized format
  • Semi-structured data — data stored in its own loose format, but tagged with identifiers that make it accessible in structured environments
  • Structured data — data pre-organized to comply with an expected and explicitly defined layout.

So as far as ETL is concerned, what do the differences between data warehouses and data lakes mean? The original ETL processes were developed under the data warehouse model, in which data was structured and organized systematically. But some ETL technologies have adapted to the emergence of data lakes and are now called ELT. That’s right. The “extract, transform, load” approach has become the “extract, load, transform” approach. In other words, when it comes to data lakes, the process has to be changed up a bit. Instead of transforming data before it reaches its final destination, different types of data are collected from multiple sources and delivered to a location, so that they can then be transformed.

As more organizations move to the data lake storage solution, ETL is in some cases being eclipsed by its cousin ELT. But that doesn’t mean that ETL is going away. In fact, ETL continues to play a vital role in data migration and integration. Either process may be appropriate, depending on the company, the data, and the situation.

Choosing ETL Tools

With so many changes taking place in the data landscape, it can be difficult to know which ETL tools make the most sense. The good news is that ETL tools have continued to evolve in order to meet changing business needs, and the right platform will provide the flexibility and adaptability to manage your data today and tomorrow. So whether you rely on a data warehouse or a data lake, there is an effective ETL solution.

Talend Data Fabric provides both ETL and ELT support, as well as over 900 connectors, so you can manage and control your data no matter where it’s stored. Talend makes it easier for developers and data pros to build, test, troubleshoot, and deploy applications and solutions that utilize both ETL or ELT transactions, wherever the data sources are located. Take control of your data today with a solution that is ready for today and built for tomorrow.

Try Talend today to see how it can simplify all your data management needs.

Ready to get started with Talend?