What is a Data Lake?
The digital universe is doubling in size every year, and is expected to reach 44 trillion gigabytes by 2020. Up to 90 percent of that data is unstructured or semi-structured, which presents a two-fold challenge: find a way to store all this data and maintain the capacity to process it quickly. This is where a data lake comes in.
What is a data lake?
A data lake is a central storage repository that holds big data from many sources in a raw, granular format. It can store structured, semi-structured, or unstructured data, which means data can be kept in a more flexible format for future use. When storing data, a data lake associates it with identifiers and metadata tags for faster retrieval.
Coined by James Dixon, CTO of Pentaho, the term “data lake” refers to the ad hoc nature of data in a data lake, as opposed to the clean and processed data stored in traditional data warehouse systems.
Data lakes are usually configured on a cluster of inexpensive and scalable commodity hardware. This allows data to be dumped in the lake in case there is a need for it later without having to worry about storage capacity. The clusters could either exist on-premises or in the cloud.
Data lakes are easily confused with data warehouses, but feature some distinct differences that can offer big benefits to the right organizations—especially as big data and big data processes continue to migrate from on-premises to the cloud.
Benefits of a data lake
A data lake works on a principle called schema-on-read. This means that there is no predefined schema into which data needs to be fitted before storage. Only when the data is read during processing is it parsed and adapted into a schema as needed. This feature saves a lot of time that’s usually spent on defining a schema. This also enables data to be stored as is, in any format.
Data scientists can access, prepare, and analyze data faster and with more accuracy using data lakes. For analytics experts, this vast pool of data — available in various non-traditional formats — provides the opportunity to access the data for a variety of use cases like sentiment analysis or fraud detection.
A data lake and a data warehouse are similar in their basic purpose and objective, which make them easily confused:
- Both are storage repositories that consolidate the various data stores in an organization.
- The objective of both is to create a one-stop data store that will feed into various applications.
However, there are fundamental distinctions between the two that make them suitable for different scenarios.
- Schema-on-read vs schema-on-write — The schema of a data warehouse is defined and structured before storage (schema is applied while writing data). A data lake, in contrast, has no predefined schema, which allows it to store data in its native format. So in a data warehouse most of the data preparation usually happens before processing. In a data lake, it happens later, when the data is actually being used.
- Complex vs simple user accessibility — As data is not organized in a simplified form before storage, a data lake often needs an expert with a thorough understanding of the various kinds of data and their relationships, to read through it. A data warehouse, in contrast, is easily accessible to both tech and non-tech users due its well-defined and documented schema. Even a new member on the team can begin to use a warehouse quickly.
- Flexibility vs rigidity — With a data warehouse, not only does it take time to define the schema at first, it also takes considerable resources to modify it when requirements change in the future. However, data lakes can adapt to changes easily. Also, as the need for storage capacity increases, it is easier to scale the servers on a data lake cluster.
For more on this distinction, and to help determine which is best for your organization, see “Data Lakes vs Data Warehouses”. There is also an emerging open data management architecture that combines the flexibility of a data lake with the data management capabilities of a data warehouse, known as a data lakehouse.
Cloud data lakes or on-premises?
Data lakes are traditionally implemented on-premises, with storage on HDFS and processing (YARN) on Hadoop clusters. Hadoop is scalable, low-cost, and offers good performance with its inherent advantage of data locality (data and compute reside together).
However, there are challenges to creating an on-premises infrastructure:
- Space — Bulky servers occupy real-estate that translates to higher costs.
- Setup — Procuring hardware and setting up data centers isn’t straightforward and can take weeks or months to take off.
- Scalability — If there is a need to scale up the storage capacity, it takes time and effort, due to increased space requirement and cost approvals from senior execs.
- Estimating requirements — Since scalability isn’t easier on-premises, it becomes important to estimate the hardware requirements correctly at the beginning of the project. As data grows unsystematically every day, this is a tough feat to achieve.
- Cost — Cost estimations have proven to be higher on-premises than the cloud alternatives.
Cloud data lakes, on the other hand, help overcome these challenges. A data lake in the cloud is:
- Easier and quicker to get started. Rather than a big bang approach, the cloud allows users to get started incrementally.
- Cost-effective with a pay-as-you-use model.
- Easier to scale up as needs grow, which eliminates the stress of estimating requirements and getting approvals.
The real-estate savings also adds to the cost benefits.
See how BeachBody improved the scalability of both its data architecture and its workforce with a cloud data lake:
Cloud data lake challenges
There are challenges to using a cloud data lake, of course. Some organizations prefer not to store confidential and sensitive information in the cloud due to security risks. While most cloud-based data lake vendors vouch for security and have increased their protection layers over the years, the looming uncertainty over data theft remains.
Another practical challenge is that some organizations already have an established data warehousing system in place to store their structured data. They may choose to migrate all that data to cloud, or explore a hybrid solution with a common compute engine accessing structured data from the warehouse and unstructured data from the cloud.
Data governance is another concern. A data lake should not become a data swamp that is difficult to wade through. Talend’s platform ensures that data lakes stay clean and accessible.
Data lake architecture: Hadoop, AWS, and Azure
It’s important to remember that there are two components to a data lake: storage and compute. Both storage and compute can be located either on-premises or in the cloud. This results in multiple possible combinations when designing a data lake architecture.
Organizations can choose to stay completely on-premises, move the whole architecture to the cloud, consider multiple clouds, or even a hybrid of these options.
There is no single recipe here. Depending on the needs of an organization, there are several good options.
Data lakes on Hadoop
Many people associate Hadoop with data lakes.
A Hadoop cluster of distributed servers solves the concern of big data storage. At the core of Hadoop is its storage layer, HDFS (Hadoop Distributed File System), which stores and replicates data across multiple servers. YARN (Yet Another Resource Negotiator) is the resource manager that decides how to schedule resources on each node. MapReduce is the programming model used by Hadoop to split data into smaller subsets and process them in its cluster of servers.
Other than these three core components, the Hadoop ecosystem comprises several supplementary tools such as Hive, Pig, Flume, Sqoop, and Kafka that help with data ingestion, preparation, and extraction. Hadoop data lakes can be set up on-premises as well as in the cloud using enterprise platforms such as Cloudera and HortonWorks. Other cloud data lakes such as Azure wrap functionalities around the Hadoop architecture.
- More familiarity among technologists
- Less expensive because it is open-source
- Many ETL tools available for integration with Hadoop
- Easy to scale
- Data locality makes computation faster
Data lakes on AWS
AWS has an exhaustive suite of product offerings for its data lake solution.
Amazon Simple Storage Service (Amazon S3) is at the center of the solution providing storage function. Kinesis Streams, Kinesis Firehose, Snowball, and Direct Connect are data ingestion tools that allow users to transfer massive amounts of data into S3. There is also a database migration service that helps migrate existing on-premises data to the cloud.
In addition to S3, there is DynamoDB, a low-latency No-SQL database, and Elastic Search, a service that provides a simplified mechanism to query the data lake. Cognito User Pools define user authentication and access to the data lake. Services such as Security Token Service, Key Management Service, CloudWatch, and CloudTrail ensure data security. For processing and analytics, there are tools such as RedShift, QuickSight, EMR, and Machine Learning.
The huge list of products offerings available from AWS come with a steep initial learning curve. However, the solution’s comprehensive functionalities find extensive use in business intelligence applications.
- Exhaustive and feature-rich product suite
- Flexibility to pick and choose products based unique requirements
- Low costs
- Strong security and compliance standards
- Separation of compute and storage to scale each one as needed
- Collaboration with APN (AWS Partner Network) firms such as Talend ensures seamless AWS onboarding
Data lakes on Azure
Azure is a data lake offered by Microsoft. It has a storage and an analytics layer; the storage layer is called as Azure Data Lake Store (ADLS) and the analytics layer consists of two components: Azure Data Lake Analytics and HDInsight.
ADLS is built on the HDFS standard and has unlimited storage capacity. It can store trillions of files with a single file larger than one petabyte in size. ADLS allows data to be stored in any format and is secure and scalable. It supports any application that uses the HDFS standard. This makes migration of existing data easier, and also facilitates plug-and-play with other compute engines.
HDInsight is a cloud-based data lake analytics service. Built on top of Hadoop YARN, it allows data to be accessed using tools such as Spark, Hive, Kafka, and Storm. It supports enterprise-grade security due to integration with Azure Active Directory.
Azure Data Lake Analytics is also an analytics service, but its approach is different. Rather than using tools such as Hive, it uses a language called U-SQL, a combination of SQL and C#, to access data. It is ideal for big data batch processing as it provides faster speed at lower costs (pay only for the jobs used).
- Both storage and compute in the cloud makes it simple to manage.
- Strong analytical services with powerful functionalities
- Easy to migrate from an existing Hadoop cluster
- Many big data experts are familiar with Hadoop and its tools, so it is easy to find skilled manpower.
- Integration with Active Directory ensures no separate effort to manage security
Getting started with data lakes
Data lakes, with their ability to handle velocity and variety, have business intelligence users excited. Now, there is an opportunity to combine processed data with subjective data available in the internet.
It is possible to sift through machine data such as X-rays and MRI scans to determine causal patterns of diseases. In IoT applications, a huge amount of sensor data can be processed with incredible speeds. The retail industry is able to offer an omni-channel experience using a wealth of data mined about the user.
Data lakes are not only useful in advanced predictive analytical applications, but also in regular organizational reporting, especially when it involves different data formats.
It is no longer a question of whether a data lake is needed, but it is about which solution to use and how to implement it. Take a look at our Definitive Guide to Cloud Data Warehouses and Cloud Data Lakes to learn how to maximize your data lake investment.