Data Warehouse Optimization POC

Talend Big Data and Machine Learning Cookbook

View the rest of the series:

Sandbox Set Up

IoT Predictive Maintenance Real-time Risk Assessment Recommendation Engine


In this example we explore a data warehouse optimization approach that utilizes the power of Spark to perform analytics of a large dataset prior to loading to the data warehouse.  This way, we can confirm the integrity of the data and ultimately choose to reject the load before bad data corrupts the quality of the data warehouse.  This demo is configured to highlight Talend Studio’s built-in Spark Engine and can be run without direct connectivity to a big data platform.  While Talend Studio’s Spark engine is not a solution for a production environment, it does illustrate that even in development, Talend users can get full Spark functionality without having to connect to or set up a full-scale big data platform


Sandbox Data Warehouse Analytics

Pre-load Analytics

By analyzing large volumes of data BEFORE loading it to your data warehouse, you eliminate the overhead of costly data anomalies in the data warehouse.


Sandbox Icon Large Processing

Large volume processing with Spark

With Talend and Spark, you can process gigabytes and terabytes of data in a fraction of the time.


ETL Off-loading

Utilizing Talend with a big data platform, you can optimize your data warehouse by removing the costly overhead of data processing.



Access the data warehouse optimization use case portal from the sandbox loading page for quick-run directions and graphical web interface.

Sandbox Data Warehouse Optimization Loading page

Open Talend Studio within the sandbox environment.   For this example, we will be working in the Warehouse_Optimization folder found in the repository view.  We will explore jobs in the Standard and Big Data Batch  Job Designs.   When ready to begin, follow the steps below: 

  1. Navigate to the Warehouse_Optimization folder under Big Data Batch jobs.  Run job Step_01_Generate_MonthlyReport. This is a spark job that reads a large collection of files from a single directory and aggregates the data into a single report.  Data Warehouse Optimization Generate Monthly Report
  2. Navigate to the Warehouse_Optimization folder under Big Data Batch jobs.  Run job Step_02_Month_Over_Month_Comparison.  This Spark job takes the new aggregated data from the previous job and compares it to the same report file that was generated the prior month to indicate any data anomalies.  In a typical use-case, these jobs would be run each time a new dataset is scheduled to be loaded into the Warehouse (in this case, it is monthly).  For the purposes of this sandbox, we have already included the “Previous Month’s” data for comparison.Data Warehouse Optimization Month Over Monther job
  3. Navigate to the Warehouse_Optimization folder under Standard jobs.  Run job Step_03_GoogleChart_Product_by_Unit. This job simply uses GoogleCharts to graph the generated comparison reports that were created in the previous job.  These charts can be viewed in this web page dashboard. Data Warehouse Optimization Google Chart
  4. Review the web page dashboard and the two graphical reports that are generated.  You should notice certain products that are significantly out of alignment with the other products, indicating the data is inaccurate.  In a real-life scenario, business analysts who are familiar with the data should be able to recognize the anomalies and decline the data load to the warehouse.

Data Warehouse demo web page


This example highlights the use of a built-in Spark Engine that is embedded in Talend Studio.  Similarly, this same demo could be configured to run against any of the big data platforms.  Large data volumes take time to load and if the data is corrupt or inaccurate it can take even longer for corrective action.  But with Talend’s native Spark connectivity and processing power, the large dataset can be analyzed quickly and easily before loading into the data warehouse.  This gives business analysts a preview of the quality of the data, and removes the burden of heavy data processing from the data warehouse; freeing the warehouse to be used for its intended purpose of providing quick access to quality reporting.