Using the Spark Machine Learning Library in Talend Components
Talend provides a family of Machine Learning components which are available in the Palette of the Talend Studio if you have subscribed to any Talend Platform product with Big Data or Talend Data Fabric.
These components provide a whole bunch of tools and technologies to help integrate Machine Learning concepts for your use cases. These out of the box components can perform various Machine Learning techniques such as Classification, Clustering, Recommendation and Regression. They leverage Spark for scale and performance (i.e. for working with large data sets) and also provide a faster time to gain insight and value. These components focus on business outcomes, not development tasks so there is no need to learn complex skills such as R, Python or Java.
However, if you do wish to take advantage of some of the complex ML resources and algorithms available within Apache Spark, it is possible to do so.
As a recap, Apache Spark is an open-source cluster-computing framework. Originally developed at UCLA’s Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation. Spark Streaming uses Spark Core's fast scheduling capability to perform streaming analytics. Spark’s Machine Learning library, know as MLlib is a powerful, fast distributed machine learning framework that sits on top of the Spark Core. Many common Machine Learning and statistical algorithms have been implemented and are shipped with MLlib. These can be utilised inside some of Talend’s Machine Learning components.
One important component is the tModelEncoder component. This component performs operations which transform data into the format expected by the Talend model training components such as tLogisticRegressionModel or tRandomForestModel. It receives data from its preceding components, apply a wide range of these processing algorithms to transform given columns of this data. It then sends the result to the model training component that follows to eventually train and create a predictive model. Depending on the Talend solution you are using, this component can be used in either Spark Batch, Spark Streaming or both modes.
The specific algorithms available in the Transformation column varies depending on the type of the input schema columns that make up the data to be processed. Here is where you can utilise the transformations available in the Spark MLib Machine Learning Library. There are a large number of transformations which can use, some in batch, some in streaming and some in both modes.
There are text processing transformations which perform a number of functions such as the hashing and unhashing of data. There are algorithms to identify similarities in text and extract frequent terms and are algorithms to bucket data. There are mathematical algorithms to work on vectors and time series data and algorithms that work on image data. There are algorithms to expand or quantities data. You can work with Regular expressions or tokenise data. There is an algorithm to transform SQL statements, an algorithm to do statistical analysis, one to index strings and one to do assemble vectors. Finally, there is RFormula, a very useful algorithm which allows you to define a formula which represents the relationship between variables in your data and then model its output. Overall, there are plenty of algorithms in the MLib library to suit most needs and use cases, and new ones are added all the time. Using these algorithms from within a Talend component is easy. Illustrated below is a screenshot showing how you can select an algorithm from within the ‘basic settings’ configuration section of a tModelEncoder component, in this case to build a Random Forest model. The choice of algorithms available in the Transformation column varies depending on the type of the input schema columns to be processed.
Using these algorithms in the tModelEncoder component allows you to build a wide range of models which can be used for many use cases. Whether its modelling who your gold customers are, whether fraud is occurring in your organisation, whether it’s the suitability of drug treatments or whether some event may happen or not. All of these use cases, and many, many more can be modelled using the different modelling components and model types which are available. In the diagram below, we can see a Talend job which has been built to predict outcomes by using a Model created in a previous job. In this job we take data from within the HDFS file system, use a model to make predictions with the Talend tPredict component, and then output the results to file.
One important point to note is that Talend’s Machine Learning components are ‘out of the box’ components, and they are configurable. You do not need to be an R programmer or have to have skills in Python. You should be able to design, build and configure a Machine Learning jobs without having to write lots of complex code.
So, as we have shown, Talend is leveraging the power of Spark, Big Data and Machine Learning to allow our customers to do things which just a few years ago we thought were not possible. In many industries and verticals Talend, can empower and enable you to quickly and simply build Machine Learning jobs.
More information on leveraging Spark and Talend’s Machine learning components can be found on the Talend website, or speak to your Talend account executive.