Data lakes and data warehouses are both widely used for storing big data, but they are not interchangeable terms. A data lake is a vast pool of raw data, the purpose for which is not yet defined. A data warehouse is a repository for structured, filtered data that has already been processed for a specific purpose.
The two types of data storage are often confused, but are much more different than they are alike. In fact, the only real similarity between them is their high-level purpose of storing data.
The distinction is important because they serve different purposes and require different sets of eyes to be properly optimized. While a data lake works for one company, a data warehouse will be a better fit for another.
Data Lakes: Purposes, Practices, Patterns, and Platforms now.
Four key differences between a data lake and a data warehouse
There are several differences between a data lake and a data warehouse. Data structure, ideal users, processing methods, and the overall purpose of the data are the key differentiators.
Purpose of Data
Not Yet Determined
Currently In Use
Highly accessible and quick to update
More complicated and costly to make changes
Data structure: raw vs. processed
Raw data is data that has not yet been processed for a purpose. Perhaps the greatest difference between data lakes and data warehouses is the varying structure of raw vs. processed data. Data lakes primarily store raw, unprocessed data, while data warehouses store processed and refined data.
Because of this, data lakes typically require much larger storage capacity than data warehouses. Additionally, raw, unprocessed data is malleable, can be quickly analyzed for any purpose, and is ideal for machine learning. The risk of all that raw data, however, is that data lakes sometimes become data swamps without appropriate data quality and data governance measures in place.
Data warehouses, by storing only processed data, save on pricey storage space by not maintaining data that may never be used. Additionally, processed data can be easily understood by a larger audience.
Purpose: undetermined vs in-use
The purpose of individual data pieces in a data lake is not fixed. Raw data flows into a data lake, sometimes with a specific future use in mind and sometimes just to have on hand. This means that data lakes have less organization and less filtration of data than their counterpart.
Processed data is raw data that has been put to a specific use. Since data warehouses only house processed data, all of the data in a data warehouse has been used for a specific purpose within the organization. This means that storage space is not wasted on data that may never be used.
Users: data scientists vs business professionals
Data lakes are often difficult to navigate by those unfamiliar with unprocessed data. Raw, unstructured data usually requires a data scientist and specialized tools to understand and translate it for any specific business use.
Alternatively, there is growing momentum behind data preparation tools that create self-service access to the information stored in data lakes.
Processed data is used in charts, spreadsheets, tables, and more, so that most, if not all, of the employees at a company can read it. Processed data, like that stored in data warehouses, only requires that the user be familiar with the topic represented.
Accessibility: flexible vs secure
Accessibility and ease of use refers to the use of data repository as a whole, not the data within them. Data lakes have no structure and are therefore easy to access and easy to change. Plus, any changes that are made to the data can be done quickly since data lakes have very few limitations.
Data warehouses are, by design, more structured. One major benefit of data warehouses is that the processing and structure of data makes the data itself easier to decipher, the limitations of structure make data warehouses difficult and costly to manipulate.
Architecting an Open Data Lake for the Enterprise now.
Data lake vs data warehouse: which is right for me?
Organizations often need both. Data lakes were born out of the need to harness big data and benefit from the raw, granular structured and unstructured data for machine learning, but there is still a need to create data warehouses for analytics use by business users.
Healthcare: data lakes store unstructured information
Data warehouses have been used for many years in the healthcare industry, but it has never been hugely successful. Because of the unstructured nature of much of the data in healthcare (physicians notes, clinical data, etc.) and the need for real-time insights, data warehouses are generally not an ideal model.
Data lakes allow for a combination of structured and unstructured data, which tends to be a better fit for healthcare companies.
Education: data lakes offer flexible solutions
In recent years, the value of big data in education reform has become enormously apparent. Data about student grades, attendance, and more can not only help failing students get back on track, but can actually help predict potential issues before they occur. Flexible big data solutions have also helped educational institutions streamline billing, improve fundraising, and more.
Much of this data is vast and very raw, so many times, institutions in the education sphere benefit best from the flexibility of data lakes.
Finance: data warehouses appeal to the masses
In finance, as well as other business settings, a data warehouse is often the best storage model because it can be structured for access by the entire company rather than a data scientist.
Big data has helped the financial services industry make big strides, and data warehouses have been a big player in those strides. The only reason a financial services company may be swayed away from such a model is because it is more cost-effective, but not as effective for other purposes.
Transportation: data lakes help make predictions
Much of the benefit of data lake insight lies in the ability to make predictions.
In the transportation industry, especially in supply chain management, the prediction capability that comes from flexible data in a data lake can have huge benefits, namely cost cutting benefits realized by examining data from forms within the transport pipeline.
The importance of choosing a data lake or data warehouse
The “data lake vs data warehouse” conversation has likely just begun, but the key differences in structure, process, users, and overall agility make each model unique. Depending on your company’s needs, developing the right data lake or data warehouse for those needs will be instrumental in growth.