Modern Data Warehouse Architecture: Traditional vs Cloud Data Warehouse
The data warehouse space is changing rapidly. Traditional, on-premises legacy data warehouses are still adept at integrating structured data for business intelligence. It is the increase in diversely structured and formatted big data via the cloud that is making data storage needs more complex.
A cloud-based data warehouse architecture is designed to address the limitations of traditional databases. Moving to a cloud data warehouse will give an enterprise the opportunity to leverage many of the cloud’s benefits for data management. In this article, we’ll explain the differences between traditional and cloud data warehouse architectures and identify the advantages of both.
Understanding on-premises traditional data warehouse architecture
There are a number of different characteristics attributed solely to a traditional data warehouse architecture. These characteristics include varying architectural approaches, designs, models, components, processes and roles — all which influence the architecture’s effectiveness.
Three-tier architecture approach
The three-tier architecture approach is one of the more commonly found approaches to on-premises data warehousing. The three tiers include a bottom, middle, and top layer.
- Bottom Tier: The bottom tier contains the actual database server used to remove data from origin sources.
- Middle Tier: The middle tier has a server for online analytical processing (OLAP) that’s responsible for transforming data. It can either to relational operations or leverage a multidimensional OLAP model for multidimensional data operations.
- Top Tier: The top tier is similar to a user interface layer. It consists of tools for common data warehousing analytics such as reporting and.
Architecture design: Kimball or Inmon
Two of the most frequently used approaches to data warehousing design were created by Ralph Kimball and Bill Inmon. Inmon’s approach is considered top down; it treats the warehouse as a centralized repository for all of an organization’s data. Once there’s a centralized data model for that repository, organizations can use dimensional data marts based on that model. Data marts are repositories for individual business lines.
Kimball’s approach is based on a bottom up method in which data marts are the main methods of storing data. The data warehouse is basically a collection of those data marts that allows for uniform analytics jobs, reporting, and other business intelligence essentials.
Traditional data warehouse architecture models
Most data warehouses rely on one of three different models:
- Virtual data warehouse: Is based on the warehouse operating as the center of an organization’s data assets. It integrates data from each line of business for easy access across the enterprise.
- Data mart: Stresses the individual business units’ data for analytics and reporting. It involves aggregating data from multiple sources for one area of focus like marketing.
- Enterprise data warehouse
On-prem data warehouse architectural components
There are a couple of different structural components that can be included with traditional on-premise data warehouses. All data warehouses have a user layer for the specific data analytics or data mining tasks. If the data sources (another type of structure) contain mostly the same types of data, those sources can be input into the data warehouse structure and analyzed directly through the user layer.
The staging area structure is needed when the data sources contain data of different structures, formats, and data models. Here, data is changed into a summarized structured format so it can be holistically analyzed at the user layer. Depending on the business use case, this structure might not be needed if organizations are only analyzing data of similar types.
The staging area can be assisted by the addition of another structure, data marts. Data marts are useful for housing a specific business line’s summarized data for highly specific queries. For instance, sales department teams could access this data structure for detailed predictive analytics of sales across different locations.
Process and roles involved in traditional data warehouse design
The ETL (extract, transform, and load) process for traditional data warehouse design requires extracting data from sources, staging it with third party ETL tools for transformation, and moving data into the data warehouse for storage. ELT involves removing data from sources and putting them into data stores, then applying transformations within the warehouse. In both of these approaches, each aspect of the data flow is monitored via metadata and systems operations. Data logic sets rules for the analytics and reporting.
IT teams are usually involved in monitoring these processes at each step. However, data scientists may also oversee these steps, especially with the big data repositories commonly used with ELT. Both of these roles supply the results of the analytics performed to business users, who act on them.
There are still several benefits associated with the use of traditional, on-premises data warehouses that work well for integrating similar types of structured data and implementing data quality. However, this approach is much less flexible with semi-structured and structured data. It not only takes longer to adjust this data to the repositories’ uniform data models, but also is expensive to do so because of the separate ETL tools and staging required.
Understanding cloud-based data warehouse architecture
Cloud-based data warehouse architecture is relatively new when compared to legacy options. This data warehouse architecture means that the actual data warehouses are accessed through the cloud. There are several cloud based data warehouses options, each of which has different architectures for the same benefits of integrating, analyzing, and acting on data from different sources. The difference between a cloud-based data warehouse approach compared to that of a traditional approach include:
- Up-front costs: The different components required for traditional, on-premises data warehouses mandate pricey up-front expenses. Since the components of cloud architecture are accessed through the cloud, these expenses don’t apply.
- Ongoing costs: While businesses with on-prem data warehouses must deal with upgrade and maintenance costs, the cloud offers a low, pay-as-you-go model.
- Speed: Cloud-based data warehouse architecture is substantially speedier than on-premises options, partly due to the use of ELT — which is an uncommon process for on-premises counterparts.
- Flexibility: Cloud data warehouses are designed to account for the variety of formats and structures found in big data. Traditional relational options are designed simply to integrate similarly structured data.
- Scale: The elastic resources of the cloud make it ideal for the scale required of big datasets. Additionally, cloud-based data warehousing options can also scale down as needed, which is difficult to do with other approaches.
Some of the more notable cloud data warehouses in the market include Amazon Redshift, Google BigQuery, Snowflake, and Microsoft Azure SQL Data Warehouse.
Amazon Redshift architecture overview
Redshift leverages the massively parallel processing architecture in which nodes store data in slices via a columnar format. Each node has their own storage, RAM, and compute power. The main types of nodes are leader and compute nodes; the former intakes queries and assigns them to compute nodes to perform the queries.
Because compute nodes can process data in different slices at the same time, Redshift has robust query performance. Compute nodes return the results to leader nodes, which aggregate them for client-side applications. Users can connect directly to Redshift with an assortment of BI or analytics tools to query the data directly where it lives.
Google BigQuery architecture overview
Google BigQuery relies on a serverless architecture in which different machines are used by the provider to manage resources. Consequently, clients aren’t involved in the resource management process. BigQuery’s architecture supports both traditional data loading and data streaming, the latter of which is designed for ingesting data in real-time.
The main architectural component for this cloud data warehouse is Dremel, a massively parallel query engine capable of reading hundreds of millions of rows in seconds. With this approach, data is actually stored in a file management system called Colossus that puts the data in clusters made up of different nodes. The files are distributed in 64 megabyte amounts in a columnar format. Queries are issued from a tree architecture among the different machines the data is stored in, helping with the quick response times.
Snowflake data warehouse architecture
Snowflake’s architecture is similar to Redshift because it also uses a cluster and node approach. The main architectural difference with Snowflake is that the compute capabilities are separate from the storage, producing a few important advantages. The main benefit? The storage location changes depending on whether or not users require computing at the moment.
If the user doesn’t need computation, the data is tiered (meaning moved to) another storage area that is less costly, since that storage area is not used for data computation. Also, separating storage from compute enables Redshift’s architecture to easily scale up and down as needed, well beyond the capacity of on-premises data warehouses. Additionally, the components for data ingestion and analysis are integrated with the storage component.
Microsoft Azure SQL data warehouse architecture
Similar to Redshift’s architecture, the cloud based data warehouse architecture of Microsoft Azure is based on massively parallel processing. Data is stored in relational databases that, because of this architecture, run fast SQL queries of enormous complexity. Additional tools in the Azure cloud ecosystem allow users to create automated pipelines for transmitting, processing and storing data at petabyte scale.
Moving from traditional to a cloud data warehouse
Although traditional database architecture still has its place when working with tight integrations of similar structured data types, the on-premise options begins to break down when there’s more variety to the stored data. Furthermore, on-premises architecture is expensive to attain and maintain, and simply doesn’t function at the speed and flexibility required for modern datasets in the current age of big data.
Cloud-based data warehouse architecture, on the other hand, is designed for the extreme scalability of today’s data integration and analytics needs. Not only does it produce significant performance and integration benefits, but cloud data warehouses are much more cost-efficient, scalable, and flexible for the variety of data formats used by organizations today. Ultimately, cloud-based data warehouse architecture is the most efficient utilization of data warehousing resources.
Organizations can optimize their transition from on-premises options to cloud-based data warehouses by using solutions designed comprehensively to manage the movement of data in the cloud. Talend Data Fabric, for example, focuses on providing well-governed and secure data management that facilitates the sustainability of cloud and hybrid-cloud workflows.
As a single suite of apps for data integration and data integrity, Talend Data Fabric provides you with easy access to your data while supporting the latest cloud data warehouses in the market. Try Talend Data Fabric to ensure trust and speed for your data as you move it to a cloud data warehouse.