Ranjith Ramachandran is a senior Big Data and Talend Consultant at Wavicle Data Solutions.
Data lakes: smooth sailing or choppy waters?
In May, Talend announced its support for Databricks’ open source Delta Lake, “a storage layer that sits on top of data lakes to ensure reliable data sources for machine learning and other data science-driven pursuits.” What does this mean for your company, and is Delta Lake right for you?
Since data lakes came into existence nearly a decade ago, they have been touted as a panacea for companies seeking a single repository to house data from all sources, whether internal or external to the organization, cloud-based or on-premises, batch or streaming.
Data lakes remain an ideal repository for storing all types of historical and transactional data to be ingested, organized, stored, assessed, and analyzed. Never before have business analysts been able to access all this data in one place. All of this was not sustainable in traditional data warehouses due to the high volume, cost, latency, complexity, and performance requirements. So yes, data lakes are a cure-all for many of our data woes.
But over time, data lakes have grown exponentially, often to the extent that the volumes of raw, granular data have become overwhelming for analytical purposes, even though they’re intended to make it easy to mine and analyze data.
In fact, the term “data swamp” has emerged to create the perfect visualization of a data lake gone bad. Data swamps are data lakes that have no curation or data life cycle management and minimal to no contextual metadata and data governance. Due to the way it is stored, it has become hard to use or unusable.
Now Delta Lake offers a solution to restore reliability to data lakes “by managing transactions across streaming and batch data and across multiple simultaneous readers and writers.” Here, we’ll discuss how Delta Lake overcomes common challenges with data lakes and how you can leverage this technology with Talend to get more value from your data.
Common challenges with data lakes
Whether or not you would classify your data lake as a swamp, you may notice end users struggling with data quality, query performance, and reliability as a result of the volume and raw nature of data in data lakes. Specifically:
- Too many small or very big files require more time to open and close files, rather than reading content (this is even worse with streaming data)
- Partitioning or indexing breaks down when data has many dimensions and/or high cardinality columns
- Neither storage systems, nor processing engines are great at handling very large number of subdir/files
Data quality and reliability:
- Failed production jobs leave data in corrupt state requiring tedious recovery
- Lack of consistency makes it hard to mix appends, deletes, and upserts and get consistent reads
- Lack of schema enforcement creates inconsistent and low-quality data
Generating analytics from data lakes
As organizations set up their data lake solutions, often migrating from traditional data warehousing environments to cloud solutions, they need an analytics environment that can quickly access accurate and consistent data for business applications and reports. For data lakes to serve the analytic needs of the organization, you must follow these key principles:
- Data cataloging and metadata management: To present the data to business, create a catalog or inventory of all data, so business users can search data in simple business terminology and get what they need. But with high volumes of new data added every day, it’s easy to lose control of indexing and cataloging the contents of the data lake.
- Governance and multi-tenancy: Authorizing and granting access to subsets of data requires security and data governance. Delineating who can see which data and at what granularity level requires multi-tenancy features. Without these capabilities, data is controlled by only few data scientists instead of the broader organization and business users.
- Operations: For a data lake to become a key operational business platform, build in high availability, backup, and constant recovery.
- Self-service: To offer a data lake with value, build a consistent ingestion of data with all the metadata and schema captured. In many cases, business users want to blend their own data with the data from the data lake.
Yet as data lakes continue to grow in size, including increasing volumes of unstructured data, these principles become increasingly complex to design and implement. Delta Lake was created to simplify this process.
Delta Lake improves reliability and speed of analytics
Talend has committed to seamlessly integrate with Delta Lake, “leveraging its ACID compliance, Time Travel (data versioning), and unified batch and streaming processing. In addition to connecting to a broad range of data sources, including popular SaaS apps and cloud platforms, Talend will empower Delta Lake users with comprehensive data quality and governance features to support machine learning and advanced analytics, natively supporting the full power of the Apache Spark technology underneath Delta Lake.”
The benefits of Delta Lake include:
- Reliability: Failed write jobs do not update the commit log, hence partial or corrupt files are not DELTA visible to readers
- Consistency: Changes to tables are stored as ordered, atomic commits and each commit is a set of actions filed in a directory. Readers read the log in atomic units, thus reading consistent snapshots. In practice, most writes don’t conflict with tunable isolation levels.
- Performance: Compaction is performed on transactions using OPTIMIZE; optimize using multi-dimensional clustering on multiple columns
- Reduced system complexity: Delta is able to handle both batch and streaming data (via a direct integration with structured streaming for low latency updates) including the ability to concurrently write batch and streaming data to the same data table
Architecting a modern Delta Lake platform with Talend
The architecture diagram below shows how Talend supports Delta Lake integration. Using Talend’s rich base of built-in connectors as well as MQTT and AMQP to connect to real-time streams, you can easily ingest real-time, batch, and API data into your data lake environment. The use of a data lake accelerator makes it is easier to onboard any new sources at greater pace rather than hand coding for every requirement. An accelerator allows you to ingest data in a consistent way by capturing all required metadata and schemas of the ingested systems, which is the first principle of deploying a successful data lake.
Talend integrates well with all cloud solution providers. In this architecture diagram, we’re showing the data lake on Microsoft Azure cloud platform using Azure Blob for storage. The storage layer is called Azure Data Lake Store (ADLS) and the analytics layer consists of two components: Azure Data Lake Analytics and HDInsight. Another alternative option in Azure is to use Azure BLOB storage which is just a storage and no compute is attached to it.
Alternatively, if you’re using Amazon Web Services, the data lake can be built based on Amazon S3 with all other analytical services sitting on top of S3.
The Talend Big Data Platform integrates with Databricks Delta Lake, where you can take advantage of several features that enable you to query large volumes of data for accurate, reliable analytics:
- Scalable storage. Data is stored as parquet files on big data filesystem or on storage layers such as S3 or Azure BLOB
- Metadata: Sequence of metadata files track operations made on table, stored in scalable storage along with table
- Schema check and validation: Delta provides ability to infer schema from input data. This reduces the effort for dealing with schema impact of changing business needs at multiple levels of the pipeline/ data stack.
Here is a diagram of an architecture that shows how Talend supports Delta Lake implementation, followed by instructions for converting a data lake project to Delta Lake with Talend.
Creating or Converting data lake project to Delta Lake through Talend
Below are instructions that highlight how to use Delta Lake through Talend.
Configuration : Set up the Big Data Batch job with Spark Configuration under Run tab. Select the distribution to Databricks and the corresponding version.
Under Databricks section update the Databricks Endpoint(it could be Azure or AWS), Cluster Id, Authentication Token.
Sample Flow: In this sample job, click events are collected from mobile app and events are joined against customer profile and loaded as parquet file into DBFS. This DBFS file will be used in next step for creating delta table.
Create Delta Table: Creating delta table needs keyword “Using Delta” in the DDL and in this case since the file is already in DBFS, Location is specified to fetch the data for Table.
- Convert to Delta table: If the source files are in Parquet format, we can use the SQL Convert to Delta statement to convert files in place to create an unmanaged table:
SQL: CONVERT TO DELTA parquet.`/delta_sample/clicks`
- Partition data: Delta Lake supports partitioning of tables. In order to speed up queries that have predicates involving the partition columns, partitioning of data can be done.
CREATE TABLE clicks (
PARTITIONED BY (date)
- Batch upserts: To merge a set of updates and inserts into an existing table, we can use the MERGE INTO statement. For example, the following statement takes a stream of updates and merges it into the clicks table. If a click event is already present with the same eventId, Delta Lake updates the data column using the given expression. When there is no matching event, Delta Lake adds a new row.
MERGE INTO clicks
ON events.eventId = updates.eventId
WHEN MATCHED THEN
events.data = updates.data
WHEN NOT MATCHED
THEN INSERT (date, eventId, data) VALUES (date, eventId, data)
- Read Table : All Delta tables can be accessed either by choosing the file location or using the delta table name.
SQL : Either SELECT * FROM delta.`/delta_sample/clicks` or SELECT * FROM clicks
Talend in Data Egress, analytics and machine learning on high level:
- Data Egress: Using Talend API services, create APIs faster by eliminating the need to use multiple tools or manually code. Talend covers the complete API development lifecycle, from design, test, documentation, implementation to deployment – using simple, visual tools and wizards.
- Machine learning: With the Talend toolset, machine learning components are ready to use off the shelf. This ready-made ML software allows data practitioners, no matter their level of experience, to easily work with algorithms—without needing to know how the algorithm works or how it was constructed. At the same time, experts can fine-tune those algorithms as desired. Talend supports below machine learning algorithm. Talend machine learning componentsinclude tALSModel, tRecommend, tClassifySVM, tClassify, tDecisionTreeModel, tPredict, tGradientBoostedTreeModel, tSVMModel, tLogicRegressionModel, tNaiveBayesModel and tRandomForestModel.