Modern Data Architectures In the Real-World: Enabling Business Users and Big Data Processing
Earlier this year, I finished an exciting Proof of Concept (POC) with one of the top Energy and Utility organizations using the Talend Big Data Platform. I thought I would write a quick blog on getting started with self-service data in the enterprise as it’s a common theme I have been experiencing with many companies focusing on digital transformation.
This company was moving an existing on-premise data warehouse (DWH) based on legacy system and technology to a cutting edge cloud based environment using AWS EC2, S3, Redshift and Aurora DB.
Multiple tools were utilized to perform this POC and Talend Big Data was used to move data between Amazon S3 and Redshift using AWS COPY command function (all configured visually and as a reusable process), then populating and transforming Redshift dimensional and fact tables including Data Quality checks and flagging all invalid records to a web UI based Talend Data Stewardship app.
The flagged records were then fixed by the internal Data Stewards and other business users in the team. Once the records were corrected through Talend’s Data Stewardship app, they were then brought back into the data flow in a governed manner and started populating the end target tables. This meant no data was unused and governance enabled the prospect to ensure data usage and compliance was maintained.
Later on in the POC process, the company’s fact table was then copied to Aurora RDS DB since this can cope with multiple reads that many applications such as analytics and reporting tools would work from. In this case, we decided to use Talend Data Preparation which was pointed to Aurora and Redshift DB to enable self-service to the Business users. In the past, it had been difficult for the prospect to share data with the business users and also keep up with their ad-hoc demands on getting access to data. Using Talend Data Preparation, we were able to demonstrate how IT would be able to govern the process and how business users would be able to get access to the data via a web UI without having to go through rigid IT change processes on daily basis to gain access to data. Now business users were also able to tailor their datasets they downloaded without IT having to do lots of manual transformation processes using the powerful feature of prepared recipes in Talend Data Prep tool.
Lastly, the company wanted to see how they could make use of a big data framework using AWS EMR for the near future. Talend Big Data was able to provide a visual tool to design the prospects Data Warehouse (DWH) processes in Apache Spark framework and write out to Amazon Redshift tables just like the ELT/ETL designs. Some of the ETL jobs developed in Talend were also switched to an Apache Spark framework in a single click. This conversion is a truly rich feature to speed up development processes when converting from standard ETL to Spark or MapReduce framework. The Talend Data Preparation tool also works in the Spark framework so this supports self-service and future-proofing initiatives as well.
Here are a few key observations from this experience to conclude: Services like Amazon Redshift are good when processing large volumes of data transformations as it’s a columnar DB. Let me be more specific, on certain occasions, I had dimension look-up tables going to almost 1 billion records and the end results were computed in a couple of minutes. This short cycle of computing the dimension and fact table meant the companies are able to run the DWH processes in a higher frequency and shorter time window allowing other business processes to run in between rather than a single overnight type of cycle dedicated for DWH. Also, the huge cost savings that companies have experienced as having cloud-based computing platforms mean they can process large volumes on demand without having to invest in long term on-premise hardware. Talend tools all feature components that can start up EC2 machines on demand when needed to process data then shut them down after data processing is complete to save costs. Also, there are components to add more nodes to the redshift cluster when needed and downsize on demand as a data flow process rather than a long painful procurement process.
Where a cloud first strategy was once a burden to achieve, today, most organizations can now adopt and expand in the cloud easily. This is now my 3rd project within the last 10 months where I have helped prospects move on-premise DWH into cloud. As companies look to manage more of their data in the cloud, tools to migrate that data and ultimately govern its use will become a critical in helping achieve a successful deployment.