Before going to the world of integration, machine learning, etc., I would like to discuss with all of you about a scenario many of you might experience when you live in a mega city. I lived in the London suburbs for almost 2 years (and it’s a city quite close to my heart too), so let me use London as this story’s background. When I moved to London, one question which came to my mind was whether I should buy a car or not. The public transport system in London is quite dense and amazing (Oh!!! I just love the amazing London Underground and I miss it in Toronto). Occasionally, I really need a car, especially when I’m traveling and I need to bring heavy bags to the airport. The question is: do I need to spend a considerable amount of money for a down payment on a car, insurance, maintenance cost, queueing at gas stations, vacuuming the car, getting a parking space…for the few times that I really needed a car? I decided against it; it was easier and cheaper to call an Uber or the famous London black cab and finish the job quickly, rather than trying to sort out all the things mentioned in the previous long list ?.
The choice of a cab vs your own car in Big Data processing
Coming back to the world of Big Data and machine learning, the question remains the same! Do we really need an array of heavy-duty servers running 24/7, with a huge army of engineers, to manage big data and machine learning processing or can we do something different? This thought has led to the concept of serverless processing in the Big Data arena, where you can save the costs related to computation associated with idle clusters. The new technology also helps in automatic management of cluster upscaling, downscaling, and rebalancing based on various factors like the context of the workload, SLA, and priority for each job running on the cluster.
Talend is actively collaborating with industry leaders in big data and machine learning serverless technology. In this blog, I am going to tell the story of the friendship between Talend and Qubole.
Tell me more about Qubole
Many readers who have yet to get into this space of IT might not have heard about Qubole. Qubole is one of the market leaders in serverless big data technology:
“Qubole provides you with the flexibility to access, configure, and monitor your big data clusters in the cloud of your choice. Data users get self-service access to data using their interface of choice. Users can query the data through the web-based console in the programming language of choice, build integrated products using the REST API, use the SDK to build applications with Qubole, and connect to third-party BI tools through ODBC/JDBC connectors”
Talend and Qubole is a good example of the phrase “match made in heaven.” Talend helps customers to build complex data jobs and pipelines through its signature graphical user interface and Qubole automatically handles the infrastructure part in a seamless fashion.
(Picture Courtesy: – Qubole)
How can you perform machine learning tasks using Talend and Qubole?
Many of you might be thinking that you have heard these types of stories about seamless data integration numerous times in your IT career. The million-dollar question in your mind might be whether the machine learning data processing using Talend and Qubole is really seamless or not? The answer is an emphatic Yes.
Instead of explaining the theory, let us create a quick Talend Job and see the steps involved in the flow. The prerequisites and steps involved for setting up of a Qubole account, which will be using Amazon Web Services (AWS) to interact with Talend 7.1, has been depicted in a detailed fashion in the Qubole documentation link.
Our story will start from the point where a Cluster has been created in Qubole where we can run Spark jobs. The examples for both a Talend Standard job and a big data Batch job using Spark are also available in the Qubole documentation, but our interest is to see how we can create a machine learning Job easily using the two tools.
Once the Cluster is created for Spark processing, its status can be verified from Qubole dashboard as shown below.
In this blog, I am going to use a simple Zoo data set containing a classification of animals provided by UCI (University of California, Irvine) for prediction.
UCI Machine Learning: https://archive.ics.uci.edu/ml/datasets/Zoo
Source Information — Creator: Richard Forsyth — Donor: Richard S. Forsyth, 8 Grosvenor Avenue, Mapperley Park, Nottingham, NG3 5DX, 0602-621676 — Date: 5/15/1990
The dataset provided was bifurcated into two groups for our example job. The main dataset zoo_training.csv will be used for training the model and the second dataset zoo_predict.csv will be used as the input for Prediction. The third file class.csv will act as lookup file to get the description of each code value of animal categories. All the three datasets will be loaded to S3 as shown below.
The next step is to create a Talend Job to process the files and to get the prediction output. In three easy stages, we will be able to create the Job as shown below.
- The first stage captured the configuration of S3 from where we will be reading the data.
- The data processing for machine learning model training and decision tree creation was done in the second stage
- The third stage will read the input data and it will calculate the Prediction, perform the lookup for each prediction value code and will print the output to console.
For those who are curious to try the sample Job shown in the blog, please download the attached Talend Job and sample input files (Click here to download).
The configuration of Qubole in Talend job is quite easy as shown below.
The input and the output values for Prediction stage as shown below.
- Input file for Prediction
- Output from Talend job using Qubole
Everything happened like a breeze, right? Now imagine the effort you would have to spend if the overall process was to create a big data cluster of your own and create the same machine learning logic using hand coding. I am sure the long list of tasks for owning a car in London is coming to your mind and how calling a cab can make your life easy.
Is this the end of the story?
Absolutely not! We just saw the fairy tale ending of the story for the machine learning flow development using Talend and Qubole. But like any Marvel movie, let me give some post credit scenes to keep up the interest until I complete the next blog post in this series. What we have done is just creation of a sample Job. We still have to get answers for lot of other questions like how we can operationalize the job through methods other than Talend Cloud. Did someone hear the word “Docker”?? Is it possible to do a continuous integration flow to move the jobs seamlessly? We will soon meet again in the next part of the blog to find answers.