Testing Machine Learning Algorithms with K-Fold Cross Validation

Testing Machine Learning Algorithms with K-Fold Cross Validation

  • Norbert Krupa
    Norbert has over 10 years’ experience in the data space working in different industries and various roles; from business intelligence to database administration, consulting as well as architecting high volume, distributed systems.

In an earlier post on Applying Machine Learning to IoT Sensors, I discussed the process for classifying sensor data with a machine learning algorithm. In this post, I’ll give a background on choosing an algorithm, then using a validation technique. For the technique, I’ll show how to apply it, and how it can be built using the Talend Studio without hand coding.


Given a prediction scenario involving a machine learning algorithm, the first question to ask is what is the appropriate machine learning algorithm? Taking the example of predicting a user’s activity based on mobile phone accelerometer data, we must be able to classify a category for the data (resting, walking, or running). As Talend leverages Spark MLlib out-of-the-box, we evaluate some of the popular algorithms which fall under classification.

This classification exercise presents common algorithms such as Logistic Regression, Decision Tree, Random Forest, and Naïve Bayes. Logistic Regression is not a candidate as it only supports binary (two group) classification. Naïve Bayes can only represent non-negative frequency counts of features; therefore it was not a candidate as accelerometer data has negative values. However, this could be mediated by simply scaling all the data to ensure positive values (i.e. multiplying all values times 100). This leaves Decision Tree and Random Forest.

At this point, we take both algorithms and measure the accuracy of each model. If we were to take the initial training dataset (which has been classified by hand) and apply it against each model, we would see a high accuracy rate if used as test data. However, what would the accuracy look like if the model encountered data it has never seen before? To test the accuracy of both algorithms, we leverage a validation technique such as K-Fold.

Applying K-Fold technique

In K-Fold Cross Validation, the training dataset is partitioned into two pieces: training and test, where K represents the number of folds or observations to take place. For example, if we have a training dataset with 450 events, and we chose 10-Fold validation, then this would break up the training dataset into 10 folds:

Taking a training dataset with 450 events against 10-Fold Cross Validation would produce a test dataset of 45 events and a training dataset of 405 events. This process is then repeated K times (10) and the resulting accuracy is averaged to produce an overall, more realistic accuracy of the model being tested.

To help us determine which algorithm is the most accurate with our dataset, we can build out the validation technique graphically using Talend Studio, and apply it against the algorithms being tested.

Building K-Fold in Talend Studio

Leveraging the out-of-the-box machine learning algorithms, we will build a K-Fold Cross Validation job in Talend Studio and test against a Decision Tree and Random Forest. The training set used for this example can be downloaded on GitHub.

Before we can build the validation, we build a job to encode each model being tested. Each of these jobs consist of reading the training dataset, encoding the model vectors, and saving the model.

Additionally, we build a job to test each model. When testing the model, we want to do five things:

  1. Read the test dataset
  2. Apply the model
  3. Check whether the predicted label matches our training label
  4. Calculate the sum of matched labels
  5. Save the number of matched labels

When the model will be tested, each test dataset will produce an accuracy which we will store and output at the end of our K-Fold validation. After these are built, we can begin work on the validation job.

The overall validation job will have three main routines:

  1. Read the full training dataset and partition into a test dataset and training dataset for each K-Fold
  2. Pass the partitioned training dataset to create the model and pass the partitioned test dataset to test the model
  3. Calculate the accuracy for each K-Fold

Partitioning datasets 

For the first piece of the validation job, we need to read the full training dataset, capture the row count, and then configure some variables which will be used in processing the validation. The most obvious variable is the number of K-Folds to be processed. This is stored as a context parameter and is requested when the validation is run:

Based on the number of K-Folds, we will be able to calculate the following:

  1. Row Number — used to filter the rows when creating the test bin and training data
  2. Fold Size — the current iteration of the loop used to calculate the size (number of rows) of the test bin
  3. K Value — calculated value based on the total number of rows in the original training data and the number of parameterized folds
  4. Bin Start — the row number where the current test dataset starts
  5. Bin End — the row number where the current test dataset ends

For development purposes, the row numbers and variables can be sent to the console. When we specify 10 for K-Folds, we should see an output such as:

|                                   loop 0                                  |
|row_number|aX    |aY    |aZ    |label  |k_value|fold_size|bin_start|bin_end|
|1         |-4.1  |8.07  |-16.36|running|0      |45       |1        |45     |
|2         |-2.34 |9.69  |-0.33 |running|0      |45       |1        |45     |
|3         |0.0   |0.01  |-0.01 |resting|0      |45       |1        |45     |
|...       |...   |...   |...   |...    |...    |...      |...      |...    |
|450       |-0.01 |-0.02 |-0.07 |resting|0      |45       |1        |45     |
|                                   loop 9                                  |
|row_number|aX    |aY    |aZ    |label  |k_value|fold_size|bin_start|bin_end|
|1         |-4.1  |8.07  |-16.36|running|9      |45       |406      |450    |
|2         |-2.34 |9.69  |-0.33 |running|9      |45       |406      |450    |
|3         |0.0   |0.01  |-0.01 |resting|9      |45       |406      |450    |
|...       |...   |...   |...   |...    |...    |...      |...      |...    |
|450       |-0.01 |-0.02 |-0.07 |resting|9      |45       |406      |450    |

For each loop (starting at 0), we calculate the fold size and find the offset to create the test datasets and training datasets. Once we’ve validated the calculations, we can add a filter which separates the outputs:

After setting up the mappings, the first main routine of the validation job looks like:

This piece of the validation job will create the test and training datasets into a directory:

Create and test models

Once the test and training datasets are prepared, we’re ready to generate the model and test it. This is done by calling the jobs built earlier for each type of model (Decision Tree or Random Forest):

The loop will iterate K-Fold times and pick up the appropriate test or training dataset. This process is repeated for each model.

Calculate accuracy of the model

The last step is to aggregate the results from the validation. Again, the loop will iterate K-Fold times and read from the outputs of each test to calculate the accuracy.

After running the validation job against a model, we receive an output of the accuracy of each K-Fold:


All the pieces

Putting it all together, the final K-Fold Cross Validation job will look like:

When not testing one model, it can simply be de-activated. The first routine to partition the training data can also be de-activated after running once. We also leveraged context parameters to pass information between this validation job to the create and test model jobs.


Machine Learning provides the ability to learn and make predictions on different types of data. In this example, we’ve taken two classification algorithms (Decision Tree and Random Forest) and used a K-Fold Cross Validation technique to determine which algorithm would have a higher accuracy for classifying the user activity based on accelerometer sensor data.

Leveraging Talend’s graphical design environment to build out the validation technique, choosing an appropriate algorithm was simple. Using the provided training dataset, Random Forest had a slightly higher overall accuracy over Decision Tree using 10-Fold Validation. In addition, each validation job ran in just under 3 minutes with Spark under the covers.

There were a few exceptions not built into this validation job such as handling odd rows, or performing clean up steps after each model has been tested. As with all development, refactoring could also be done to make this more dynamic for other data sets. However, the goal was to leverage a graphical design environment and out-of-the-box machine learning algorithms to help us build and choose an appropriate model.

Join The Conversation


Leave a Reply

Your email address will not be published. Required fields are marked *