The main purpose of machine learning is to perform learning tasks on unseen data sets, having previously built up experience using training and testing data. Often those tasks can include looking for patterns and relationships between variables within the data. In supervised learning, one of the three main types of machine learning, we will have some idea of the types of input and output that we are looking for, and in order to quickly and efficiently build a supervised model, it helps that we understand some of the relationships between the variables within the data.
As an example, imagine we are an automotive trader. We want to build a machine learning model that can work out the value of second-hand cars. We know from experience that the value is dependent upon the model of car, the condition, the mileage, the service history, etc. even the color of a car can affect the resale price. What we don’t know is the exact form of the relationships between these variables and this is where machine learning comes in. we can use a training data set, of say tens of thousands of sale records, to train our model. If it includes all those variables then we can build a model, include the variables that affect the value of second-hand cars and then let the model learn those relationships with our training and testing data. Once we are happy the model is performing correctly (we can test the accuracy of our model using test data) then we let our model run.
Now imagine a situation where we don’t know, or are not certain about the relationships between our variables, what can we do? We need some tools that can help us understand the relationships between our variables, which could then help us build a model.
This is what we would call unsupervised machine learning. That is, we don’t really understand how elements within the data are related, we can’t classify or categorize those data, so we need some way to do this.
The most common type of unsupervised learning is called Cluster Analysis, or Clustering. Clustering is the task of grouping together a set of objects (whatever they are) in such a way that objects in the same group (or cluster) are more like each other (or more similar to) than to those in other groups (or clusters). It is basically one of the main tasks of exploratory data mining, and a common technique for statistical data analysis. It is and can be used in many, many fields, and there are lots of different algorithms which can be used to perform that analysis.
The most well-known type of clustering algorithm is K-means clustering. This is a type of unsupervised learning in which is used when you have unlabelled data (data without any defined categories or groups). The goal of this algorithm is to find groups in the data, with the of groups represented by the variable K. The algorithm works iteratively to assign each data point to one of K groups based on the features that are provided. Data points are then clustered based on feature similarity. K-means is one of the simplest unsupervised learning algorithms that solve the well–known clustering problem. The procedure follows a simple and easy way to classify a given data set through a certain number of clusters. The diagram shows a simple illustration of how clustering works.
In this diagram we can see data which is clustered into three main groups, red, blue and green. Data points that are near the center of each cluster are referred to as ‘well-clustered data’, and that on the outside are referred to as ‘loose-clustered data’.
In Talend we have clustering components in our set of Machine Learning components. These consist of three components as shown below.
The tKmeansModel component analyzes sets of data based on applying the K-Means algorithm. This component analyses feature vectors usually pre-processed by the tModelEncoder component to generate a clustering model. This model can then be used by the tPredict component to cluster given elements. It generates a clustering model out of this analysis and writes this model either in memory or into file. The tPredict and the tPredictCluster component predict which elements or clusters an element belongs to based on the clustering model generated by a model training component (the tKMeans component). These Talend components are all available in all Talend Platform products with Big Data and in Talend Data Fabric. They are available for both the Spark Batch and Spark Streaming Framework.
As well as the classification components shown above, there are also three other sets of machine learning components available in Talend. There are classification components which are used for identifying which of a set of categories a new observation belongs, on the basis of a training set of data containing observations whose category membership is already known. There are Recommendation components which seek to predict the “rating” or “preference” that a user would give to a certain item. Finally, there are Regression components. Regression analysis is a process for estimating the relationships among variables. It includes techniques for modeling and analyzing several variables when the main focus is on the relationship between a dependent variable and one or more independent variables.
So, we can see there are a number of sets of Talend machine learning components that can help you look for patterns in data. These can help you discover hitherto unknown relations between data, to find patterns in those data, to classify your data and to help you build models which can predict future patterns and behavior. For more information on Talend and machine learning refer to the following page, which includes a video introduction to machine learning presented by myself: