In the space of Big Data, a common pattern found is offloading a traditional data warehouse into a Hadoop environment. Whether it be for primary use or to only store “cold” data, Talend makes it painless to offload.
Many organizations trying to optimize their data architecture have leveraged Hadoop for their cold data or to maintain an archive. With the native code generation for Hadoop, Talend can make this process easy.
Talend already provides out of the box Connectors to support this paradigm using SQOOP; here we are going to focus on how to make the same using Apache Spark.
Apache Spark is a fast and general engine for large-scale data processing. This engine is available in most of the latest Hadoop distribution version (Cloudera, Hortonworks, MapR, AWS EMR, etc…). Built on a massively parallel processing (MPP) architecture, it allows you to massively parallelize a data flow to handle any enterprise workload.
The fastest and most known solution today to bring data from your Databases into Hadoop is to leverage and use SQOOP (Sqoop is leveraging underneath a MapReduce process to perform the offload from RDBMS to Hadoop). Today I wanted to introduce you to something which will perform the same purpose of SQOOP but using SPARK as framework/ending.
In this blog post, I’m going to address first how to use Spark to move 1 table from Oracle or MySQL into Hadoop. Then once we have one working job to do this task; how we can turn this job to be generic to be control by a list of Tables to move from your databases server to Hadoop.
For simplicity, we will key in on the following two scenarios:
- How to move a Table into HDFS from a Spark job.
- How to automate and turn the job above into a Metadata-driven ingestion framework to work on a list of tables.
Moving a Table into HDFS from a Talend Spark Job
In this scenario, we created a very generic job that extract from a Database table and move the data into HDFS using Apache Spark and a generic Query statement such as:
"SELECT concat_ws('"+context.FIELD_SEPARATOR+"', "+context.column_list+" ) as my_data FROM my_table". ) as my_data FROM my_table".
context.FIELD_SEPARATOR = is a context variable at the job level set to ‘,’ or ‘;’ or ‘|’ or others context.column_list = is a context variable which is the concat of the FIELD required to be extracted (for example: field1, field2, field3, etc…)
The Offload piece will execute the query statement natively on Hadoop using Spark. The generated code is deployed directly through YARN.
Automating and turning the Job above into a Metadata driven ingestion framework to work on a list of tables
The offload-preparation process starts at the database. Next, the table list is pulled and contextualized, along with a list of the columns in the table (preparation of the variables to be sent to Offload job). Once this has been completed, a simple call to Offload Job is made through the iteration over tables to offload to Hadoop. The Offload process is the Job we described in the section “How to move a Table into HDFS from a Talend Spark job” above.