TALEND CONNECT 2018 : Get inspired by the movers and shakers in the big data world in NYC

How to Migrate Your Data From On-premise to the Cloud: Amazon S3

How to Migrate Your Data From On-premise to the Cloud: Amazon S3

  • Eric Higaki
    Eric Higaki joined Talend in 2017 as an Associate Evangelist. In this role, Eric works on the community team to provide technical support and brand awareness. Prior to Talend, Eric worked on the Integration team at the largest protein producing company in the world, while also studying at Colorado State University. Go Rams!
  • March 15, 2018

Migrating to The Cloud

2018 is the year of the cloud, and as more and more companies move to cloud technologies it is important to realize how your business can best utilize the cloud. One of the biggest issues enterprises are having today, is moving their data from their on-premise databases to their cloud data storage. This can be a long, and tedious process if you don’t have the correct tools. Luckily, Talend is here to help!

At Talend, I needed to take our on-premise database, MySQL, and migrate it to our cloud storage, Amazon S3. Rather than deal with the complexities of Apache Sqoop, I decided to create a job within Talend that would run whenever we needed to migrate new data to the cloud. Using this method saved me precious time that I can use to analyze my newly migrated data. In this blog, I will be reviewing how I built said job. Without further ado, let’s jump right in!

Creating a Connection

As with any Talend job, the first thing we want to do is create the connections. I have a MySQL database so I am going to use the tMysqlConnection component. I also need to create a connection to my S3 cloud storage using tS3Connection. Because connecting to both MySQL and S3 are the first steps every time this job is run, we also need to add tPrejob in front of both components.

Remember, Talend is a code generation tool, by using tPrejob, I can control what will always compile first, ensuring I always connect to my databases. After I configure both connection components, I can connect tPrejob, tMysqlConnection, and tS3Connection together like the screenshot shown below.

Getting your Tables and Setting the Dynamic Schema

Now that I am connected to both of my storage platforms, I can start my cloud migration process from MySQL to Amazon S3. To start, I need to get a list of all the tables I want to move from the database. Using tMysqlTableList, I can specify which tables I want to list through the “WHERE clause”.  However, In the case, I only want to pull from the customer tables.

Now that I have the list of all the tables I want to transfer, my next step is to get a list of the columns within that table.

Using “tMysql” global variables is a fantastic way to pull values from components. These global variables can pull data from the “tMysql” components for other components to use. In this case “((String)globalMap.get(“tMysqlTableList_1_CURRENT_TABLE”))”, will make the component pull columns from the tables that are being gathered by the tMysqlTableList component. Talend makes it easy to retrieve global variables without having to memorize them. All I have to do is type “tMysql”, press Ctrl + Space and all the “tMysql” global variables will appear in a list where you can choose which one you want.

Next, I need to add a tFixedFlowInput to generate the “tableName” and “columnName” columns. The values will only appear within the tFixedFlowInput component If I configure the schema for these columns first. Once I set the schema I can now set the value for these columns which will be, ((String)globalMap.get(“tMysqlTAbleList_1_CURRENT_TABLE”)) for “tableName” and ((String)globalMap.get(“tMysqlTAbleList_1_COLUMN_NAME”)) for “columnName”.

Adding a tLogRow after the fixed flow will allow me to see the names of the tables and columns that my job is pulling from by displaying the information on the run console. Below is an updated screenshot of my job thus far.

Now it’s time to set the dynamic schema that the data will use when being pulled from my on-premise database. Like the name suggests, a dynamic schema is a schema type that will change depending on the column that is being read at the time, making it essential to the job.

To set a dynamic schema I will be using a fancy component called tSetDynamicSchema. Other than having a great name, tSetDynamicSchema will allow me to dynamically set the schema based on the value “columnName”. Now that the schema is dynamic, I don’t need to move each table individually, I can move multiple, different tables with ease.

Reading the Data and Writing the Tables

With my dynamic schema set, I’m ready to start reading the table data using the dynamic type that was created from the tSetDynamicSchema component. Because I am reading data from my on-premise database, I need to use an input component that will read from a MySQL database, tMysqlInput. First, I need to edit the schema of the tMysqlInput component to use the dynamic DB type. I named the column for this schema “dynamic_row” with type “Dynamic” (of course) and DB Type, “VARCHAR”.

After the schema is set I can move onto configuring the tMysqlInput component, making sure the data is being pulled  from the current table being listed by tMysqlTableList.

The data in the tables is now being read from the current table listed, however, the data still needs to be written out to a CSV file. To accomplish this, I am going to be using tFileOutputDelimited. I need to make sure the “File Name” follows the correct file path.

Phew! Don’t worry folks, we’re almost done. This is an updated look at the job that I have created up to this point.

Putting Files on Amazon S3

So far, this job reads all the tables with the name customer and writes them to CSV files in a specified folder. Now that I can pull data from tables located in my on-premise database, I need to finish the job by moving these files to Amazon S3. 

tFileList will allow me to get a list of all the files in a specified folder, or in this case, it will allow me to get a list of all the tables that I have pulled from my on-premise database. All I need to do is specify the directory where the files are located.

Once I get a list of all the files I can start to move them into one of my S3 buckets. The tS3Put component will allow me to do this. All I need to do is specify the “Bucket”, “Key”, and “File”. The “Key” being the name of the file within S3 and “File” being the name of the file that is being uploaded to S3.

Now that the configuration for the tFileList and tS3Put are completed, all that’s left to do is to put the finishing touches on the cloud migration job. Remember those connections that I opened in the very beginning of the job? With the help of tPostjob, tMysqlClose, and tS3Close, I can close the connections that I opened every single time the job is run. Just like before, I want to be able to control what happens after the main loop is compiled, thus, the reason for the tPostjob component. Easy-Peasy! The finished job should resemble something like this.

Running the Job

If the job is run and everything is in tip top shape, then the run console should coincide with the screenshot below. As you can see, the console shows the table that is being read and written, as well as, the corresponding column name.

Now that this job is complete, I can move any tables I want from my on-premise database to my cloud storage without having to build multiple jobs for each table, or messing with pesky hand coding. It feels good to be cloud ready.  

Watch this Demo LIVE

Want to catch this demo live? Join us on Thursday, March 22nd on Talend's Facebook page for #TalendDevLive where I'll be building out this job step-by-step and taking your questions along the way. Dont miss it!

Join The Conversation

0 Comments

Leave a Reply

Your email address will not be published. Required fields are marked *