Talend Big Data and Machine Learning Cookbook
View the rest of the series:
In this example we explore a data warehouse optimization approach that utilizes the power of Spark to perform analytics of a large dataset prior to loading to the data warehouse. This way, we can confirm the integrity of the data and ultimately choose to reject the load before bad data corrupts the quality of the data warehouse. This demo is configured to highlight Talend Studio’s built-in Spark Engine and can be run without direct connectivity to a big data platform. While Talend Studio’s Spark engine is not a solution for a production environment, it does illustrate that even in development, Talend users can get full Spark functionality without having to connect to or set up a full-scale big data platform.
Access the data warehouse optimization use case portal from the sandbox loading page for quick-run directions and graphical web interface.
Open Talend Studio within the sandbox environment. For this example, we will be working in the Warehouse_Optimization folder found in the repository view. We will explore jobs in the Standard and Big Data Batch Job Designs. When ready to begin, follow the steps below:
- Navigate to the Warehouse_Optimization folder under Big Data Batch jobs. Run job Step_01_Generate_MonthlyReport. This is a spark job that reads a large collection of files from a single directory and aggregates the data into a single report.
- Navigate to the Warehouse_Optimization folder under Big Data Batch jobs. Run job Step_02_Month_Over_Month_Comparison. This Spark job takes the new aggregated data from the previous job and compares it to the same report file that was generated the prior month to indicate any data anomalies. In a typical use-case, these jobs would be run each time a new dataset is scheduled to be loaded into the Warehouse (in this case, it is monthly). For the purposes of this sandbox, we have already included the "Previous Month's" data for comparison.
- Navigate to the Warehouse_Optimization folder under Standard jobs. Run job Step_03_GoogleChart_Product_by_Unit. This job simply uses GoogleCharts to graph the generated comparison reports that were created in the previous job. These charts can be viewed in this web page dashboard.
- Review the web page dashboard and the two graphical reports that are generated. You should notice certain products that are significantly out of alignment with the other products, indicating the data is inaccurate. In a real-life scenario, business analysts who are familiar with the data should be able to recognize the anomalies and decline the data load to the warehouse.
This example highlights the use of a built-in Spark Engine that is embedded in Talend Studio. Similarly, this same demo could be configured to run against any of the big data platforms. Large data volumes take time to load and if the data is corrupt or inaccurate it can take even longer for corrective action. But with Talend's native Spark connectivity and processing power, the large dataset can be analyzed quickly and easily before loading into the data warehouse. This gives business analysts a preview of the quality of the data, and removes the burden of heavy data processing from the data warehouse; freeing the warehouse to be used for its intended purpose of providing quick access to quality reporting.