Enabling Olympic-level performance and productivity for Delta Lake on Databricks

Enabling Olympic-level performance and productivity for Delta Lake on Databricks

Databricks lakehouse performance

 

Recently, Databricks introduced Delta Lake, a new analytics platform that combines the best elements of data lakes and data warehouses in a paradigm it calls a “lakehouse.” Delta Lake expands the breadth and depth of use cases that Databricks customers can enjoy. Databricks provides a unified analytics platform that provides robust support for use cases ranging from simple reporting to complex data warehousing to real-time machine learning.

If you use Delta Lake, you have just one place to put data and one place to deploy your jobs, making your architectures more streamlined. And for Databricks users, as with any technology developers who want things streamlined, a tool like Talend Data Fabric that automates processes can become a “killer app” thanks to the improvement it brings in things like productivity, manageability, and governance.

The ‘magic’ of Delta Lake

Databricks is well-known as the team that invented Spark and the first company to commercialize it in a cloud-based environment. Not resting on those laurels, Databricks took another leap forward with Delta Lake, which expands Spark’s processing power with robust database capabilities to serve data to downstream systems. With the introduction of Delta Lake, Databricks has introduced a new file format called Delta that allows for ACID transactions, data history (a.k.a. time travel), schema enforcement (i.e. a string field must be a string), schema evolution, and the ability to continuously stream data including updates.

Using Delta Lake, the general pattern is to have a “multi-hop” architecture that lands data into ingestion tables (Bronze), then refines those tables with a common data schema and standardized data for consistency (Silver), then publishes that data as features or aggregated data store (Gold).

Data professionals encounter unique challenges at each of those steps, which are represented in this figure.

 

3 levels of Data Lake

Figure 1: Image from http://delta.io

Let’s talk about the requirements for building a trusted lakehouse and how Talend can help you prepare and publish trusted data to all downstream users.

 

Building a lakehouse with Talend – Earn your Olympic medal

The figure above shows three logical data stages:

  • Bronze data ingested from external systems into a raw or landing zone
  • Silver data, refined as tables in a common data model in which the data conforms to quality expectations
  • Gold data, ready to finalize and apply in data science models and machine learning techniques

 

Bronze — ingestion

Most organizations have hundreds of sources they might need data from. Some are simple, like .CSV files, while others are complicated API-based integrations like Salesforce and Marketo. To populate  a lakehouse for historical analytics, you need to perform a one-time ingestion (bulk load) of existing records.  If you want to keep your data current, as most organizations do,  you need to replicate incremental changes to the lakehouse on a scheduled basis, moving only new or changed data each time.

Remove the barriers to data ingestion

Stitch Data Loader for Databricks Delta: This month, we added Delta Lake on Databricks as a supported destination for Stitch Data Loader. Stitch is a cloud-first platform for simple data ingestion and replication.

All Databricks users now have a very quick, easy, and reliable way to ingest data from more than 100 SaaS and database sources into their lakehouses. It take only minutes to set up a Stitch replication job and begin loading data into Delta Lake. There’s no coding, nothing to install —Stitch makes data ingestion easy for everyone, including nontechnical users.

If you’re building BI reports on the raw “Bronze” data, you‘re ready to go.

 

Silver – refined tables

At the “Silver” level your data is ready to use in business intelligence and enterprise reporting. It’s in a common data model and the data content itself has gone through your data quality rules, ensuring that it meets your organization’s expectations for trusted data.

However, most of the time, you're dealing with more than one data source. and two or more sources seldom have the same schema. You need a common schema, and that requires a mapping exercise to have the data from the ingestion tables match the schema of the refined tables. Once you have a common data structure, you have to consider the issue of data conformity. Is the data standardized?

Is a State field a two-letter abbreviation (e.g. NY) or is it spelled out (New York)?

Is Phone Number standardized (e.g. for US number is it (415) 555-1212 or 415-555-1212 or 4155551212)?

These simple examples; things can get more complicated when it comes to internal data (e.g. sales stages, product codes, customer statuses, etc.).

Assuming you have a data structure that conforms to a standard, what about duplication — especially for reference data, master data, and dimension data?

Although you can find duplicates easily when things field contents are exactly the same (John Smith in CRM and John Smith in your billing system), what happens when data sets differ, yet refer to the same person, organization, place, or thing? For example, John Van Ofen who lives at 123 Hanover St and Johan Vanoven who lives at 123 Hanover Strasse could be the same person.

Data must be restructured to a common format. It must meet organizational standards and it needs to be deduplicated.  Talend Data Fabric, with its suite of transformation and data quality components, can standardize the data and automatically correct it, ensuring your lakehouse data is refined enough to be at the “Silver” level.

 See our data quality documentation for a full description of our data quality features, or download the Definitive Guide to Data Quality to learn how to stop bad data before it enters your system. 

Much like a toolbox, Talend Data Fabric allows data engineers to quickly find components that can solve their data transformation or data quality challenges.

 

Gold – feature engineering and aggregate data store

Once you have all the data loaded into a common structure, you should trust the quality of that data. Now, the final step is to identify the features, the  field, or combination of fields to use in your data science and machine learning algorithms. At this step you can create aggregations that simplify model development.

Aggregation

Aggregation is a common function in data analytics.

Talend offers many aggregation components, including count, min, max, sum, average, median, mean, and others, and uses mappings to apply the components to perform these calculations.

Feature engineering

Feature engineering is an integral step in machine learning. Ensuring that your machine learning models have measurable properties or characteristics (called features) will not only makes the data compatible for the model, it improves the accuracy and performance of the model.  Feature engineering is the process of transforming, standardizing, and preparing the data for the ML model.

There are many aspects to feature engineering, and although this article will not attempt to describe them all, these techniques are well described in “Fundamental Techniques of Feature Engineering for Machine Learning”.

This article highlights the key Feature Engineering functions needed.  These include:

  • Imputation – filling in values that are not present in source
  • Handling Outliers – finding and disregarding values that are outliers
  • Binning – grouping data into common “bins” (i.e. converting values to “High”, “Medium”, “Low” groupings)
  • Log Transform – standardize numerical data to correct for magnitude differences
  • One-Hot Encoding – turn categorical data into a table of 1’s and 0’s making it easy to consume for a ML model
  • Grouping Operations – Organizing the data into a pivot table
  • Feature Split – Decomposing a value into constituent parts (i.e. full name into first, middle, last)
  • Scaling – normalizing numeric values into a range between 0 and 1 and standardizing the scales by considering standard deviations
  • Extracting Date – identify the day, month, year, time between dates, holidays, weekends, etc.

As I look at each of these tactics in feature engineering, I keep checking them off as transformation functions or data quality functions that are provided in Talend Studio as “out of the box” components. In fact Talend offers many more ML prep components and a full description can be found in our online documentation.

 

Talend supports key Databricks infrastructure

Delta Lake users need the ability to leverage the Databricks platform and the core Spark and Delta Lake technologies.

Talend has been the leader in native Spark code generation dating back to the first commercial releases of Spark. Native code generation ensures that the logic defined in Talend Studio translates to the highest performance execution while ensuring that code follows the standard practices of Databricks.

With the latest release, Talend adds production-level support for reading and writing Delta  tables. Because Data Fabric supports Spark DataSets and DataFrames, Talend jobs can attain the highest performance possible with the easiest way to define workloads.

Power to the people

We surveyed hundreds of Talend customers using Databricks and asked them what data sources they want to load into a lakehouse, whether they want to make a one-time copy or they have an ongoing replication need, and what user profile they want to perform this work.

More than 70% of the respondents said they wanted data from cloud-based sales and marketing applications such as Salesforce, Marketo, and Google Ads. They said they want to do an initial bulk load of data, then keep the lakehouse refreshed at least every day; many said they wanted data refreshed at least every 15 minutes. Lastly, while data engineers made up a significant number of responses for the roles that asked for this, even more data scientists and business analysts were interested.

 

Summary

With built-in capabilities to ingest data into Bronze tables, refine that data into Silver tables, and finalize that data for data science and ML into your Gold tables, Talend provides the complete breadth of functions to build and maintain data pipelines. With the simple configuration setting that targets Databricks, you can deploy any level of data pipelines, data quality, and data governance easily. With Talend, you have an end-to-end data management platform to support BI analytics, data engineering, data science, and machine learning use cases .

 

 

Join The Conversation

0 Comments

Leave a Reply

Your email address will not be published. Required fields are marked *