Diving Into Cloud Data Warehousing and Big Data with Microsoft Azure
Talend has always been committed to open source technologies from our earliest days. As the years passed, we continued our track record of innovation in open source by committing ourselves to the latest Big Data technologies such as Spark, and Kafka. And now, with our redoubled commitment to cloud, it’s fitting that we talk about our native integration with several Microsoft Azure technologies.
A little over three years ago, Microsoft’s newly-appointed CEO, Satya Nadella committed the company’s future as being cloud-first. More recently, Microsoft has also taken a strong liking toward open source technologies, and in the Big Data space, has made numerous advances with Azure HDInsight, their own distribution for Hadoop. One only needs to take a quick glance at the Azure HDInsight documentation to see that Microsoft has added support for several open source Big Data components in less than a year.
Besides making advances in Big Data, Microsoft also has a strong cloud data warehousing strategy, with Azure SQL Data Warehouse reaching general availability last July. With its large install base of enterprise customers eager to move from Microsoft’s on-premises offerings to Azure, it was a no-brainer for Talend to develop native connectors for services such as Azure HDInsight, Azure SQL Datawarehouse, and Azure Blob Storage. Furthermore, our technical alliances team led by our Director of Technical Alliances, Ed Ost has also developed numerous reference architectures involving these Azure services.
Creating a Basic Data Warehouse from Transactional Systems
It’s a well-known database principle that OLTP systems should be separate from data warehousing. Organizations are generating tremendous amounts of data from transactions due to the explosion of purchases made through mobile and online means, as well as the increased processing demands from back-office systems. Data from these systems can be used to perform historic trend analysis such as trying to understand the month-over-month revenue of products, locations, the impact of any promotions, and sales by channel.
Talend data integration jobs (see Figure 1) can be used to extract data from OLTP systems and directly load them into Azure Blob Storage “landing zones”. Talend supports both traditional ETL and ELT styles of integration. Alternatively, transactional data can be pulled in via SFTP or other protocols via Talend ingest jobs. This method of extracting transactional data into landing zones is especially important so that data can be cleansed, standardized, and transformed. Additional transformations can be used to create data-marts or star schemas within Azure SQL Datawarehouse. Any BI tool of choice can be used to report against these data marts. Companies looking to migrate their existing on-premises data warehouses to Azure SQL Datawarehouse can use this reference architecture as a starting point in their data integration projects.
Figure 1 Talend Azure SQL Datawarehouse Deployment Architecture
ETL Offload using Talend, Azure DW, and HDInsight
Imagine in the example above if organizations wanted to not only analyze historical data but also use it for highly interactive applications. Doing so would require analyzing customer interactions across a variety of channels with the end goal of delivering targeted campaigns, optimizing product mixes, or viewing the real-time efficiency of their supply chain. The datasets, data models, and data structures involved in this effort would be sufficiently large or complex enough such that traditional ETL processes would not suffice. This is where Azure HDInsight can be used in conjunction with Talend. In so far as extracting the data into an Azure Blob Storage landing zone goes, the process is exactly the same as in first use case. However, when it comes to performing some of the transformations, HDInsight with Spark can be used to create the data marts or star schemas in Azure SQL Datawarehouse (Figure 2). This approach is especially advisable for those organizations committed towards an enterprise architecture that involves Big Data at its core. Talend can support several Big Data components – Spark, Hive, Pig, HBase, and more.
Figure 2 Azure SQL Datawarehouse with ETL Offload using Azure HDInsight
Talend has been foremost in the integration space when it comes to enterprise customers using Big Data, and was recently named a leader in the Forrester Wave Big Data Fabric. Microsoft has numerous enterprise customers moving from on-premises database technologies to the cloud. The opportunity for both companies to grow our current partnership partner on several fronts – from go-to-market, to joint product development, to a bilateral sales strategy - is tremendous, and you can expect more integration with Azure technologies in the coming months.