Decision trees are used extensively in machine learning because they are easy to use, easy to interpret, and easy to operationalize. KD Nuggets, one of the most respected sites for data science and machine learning, recently published an article that identified decision trees as a “top 10” algorithm for machine learning. (1)
Besides being easy to understand, use and operationalize, most decision tree algorithms contain very useful features, such as:
- • Support for continuous and categorical inputs
- • Clear indication of most important variables
- • Minimal computation needed for classification
- • Automatic handling of missing values
If you are new to machine learning, some of these concepts may be unfamiliar. The goal of this blog is to provide you with the basics of decision trees using Talend and Apache Spark.
If you want to learn more about advanced analytics, please see the references section below.(2) Some of the concepts presented here are sourced from that material.
Machine Learning Definitions
In this post I’ll be referring to a few terms that warrant a brief explanation:
- • Training Data – a set of data used to train a model, comprised of a vector input variables (features) and a known outcome (target).
- • Test Data – a set of data used to test model performance after it has been trained.
- • Features – training data input variables used to train a model. E.g. “age”, “income”, “loan amount.”
- • Target – output from trained model. E.g. “risk of default.”
Decision Trees Algorithms
Decision trees have several implementations. The three most common are:
- • Invented by Ross Quinlan in 1979
- • Builds the tree from the top down
- • Information gain is used to select the most useful attribute
- • Designed for classification trees
- • Greedy algorithm tends to over-fit data
- • Addresses over-fitting by bottom-up technique known as pruning
- • Handles incomplete data points
- • Handles categorical or continuous data
- • Developed by Quinlan to expand upon ID3
- CART – Classification and Regression Trees
- • Developed by four Berkeley and Stanford statistics professors (1974 – 1984)
- • Focused on accurate assessment when data is noisy (outliers and missing values)
- • Handles categorical or continuous data as a function of the dependent variables
Of the three implementations, CART is the most popular. The name refers to the fact that CART can be used when the target variable is expressed either as a finite list, such as age or hair color (classification), or is continuous, such as annual precipitation (regression). * Note that it is the target variable that determines the type of tree.
Below, I’ll demonstrate a simple classification tree using data well known to the machine learning community. The kyphosis data set (3) reports on children who, having had corrective spinal surgery, exhibited kyphosis post-surgery. Below is a sampling of the kyphosis data set which was used to train a decision tree model:
- • Kyphosis – indicator of whether or not the condition “kyphosis” is present after surgery
- • Age – age of patient in months
- • Number – the number of vertebrae involved
- • Start – the number of the first (topmost) vertebra operated on
The feature variables are 1. Age, 2. Number, and 3. Start. The target variable is indicated by Kyphosis. It is a simple category that has two levels; “absent” and “present.”
Anatomy of a Decision Tree
Below is a visual depiction of a classification tree trained using the kyphosis data set.
Key components of a decision tree (4):
- • Root Node – Top of internal node
- • Branch – Outcome of test
- • Internal Node – Decision on variable
- • Leaf Node – Class label
- • Depth – Minimum number of steps required to reach the node from the root
Decision trees are built using a recursive partitioning algorithm that is easily described using the following pseudo code (2):
- For every input feature variable, assess the best way to split the data into two or more subgroups. Select the best split and divide the data into the subgroups defined by the split.
- Pick one of the subgroups and repeat Step 1. Repeat for every subgroup.
- Continue splitting until all records after the split belong to the same target variable or until another stop condition is met.
The concept of “best split” is pertinent to how decision trees are built. Best split refers to node purity (P) which is a measure of its homogeneity (information gain) or heterogeneity (entropy).
Using the root node as an example:
Root Node: P (Kyphosis = “absent”) = 47/60 = 78% pure on the class Kyphosis = “absent.”
With this information, the decision tree algorithm then iterates through all subgroups to find the next best fit. Here, the feature “Start” where its value is greater than or equal to 12, was chosen as the best split.
Internal Node 1 Left Branch: P (Kyphosis = “absent”) = 32/34 = 94% pure on the class Kyphosis = “absent.” The 57% is the percentage of observations used at this node.
Node pureness is used by the decision tree algorithm as input to much more involved calculations called entropy and information gain.
However, the goal of information gain is to compare the purity of the parent node before the split with the degree of purity after the split. The feature with the greatest information gain is considered the most informative and it is used at the split. (4) In the example above, “Start >= 12” is the most informative feature for the first internal node.
Operationalizing a Decision Tree
As mentioned earlier, decision trees are intuitive and easy to explain. They are also very easy to implement using any programming language that supports conditional statements. Using our decision tree, let’s define a few rules for the conditional logic:
- if (Start >= 12) then (Kyphosis=”absent”)
- if (Start<12) and (Age< 86) and (Number< 4.5) then (Kyphosis=”absent”)
- if (Start < 12) and (Age< 86) and (Number >= 4.5) and (Start<7.5) then (Kyphosis=”present”)
- And so on
Implementing a decision tree model is conceptually straight forward. However, training, testing, and operationalizing a model is not without some challenges.
One such challenge is the maintenance and upkeep of the model. The “learning” part of machine learning alludes to the fact that models need to evolve with the data. This necessitates model re-training and depending on the business use case can occur quite frequently.
To maximize efficiency, model training, re-training, and operationalization needs to be automated and Talend makes this easy.
Enabling the Data-Driven Enterprise
Talend provides a comprehensive eco-system of tools and technologies to facilitate the integration of machine learning into data integration work flows in a continuous and automated way. This enables organizations to focus more on business outcomes and realize a faster time to value for their most valuable asset; their data.
But don’t just take my word for it. See for yourself! Get Started Right Now with Our Hands-on Tutorial: Machine Learning 101 – Decision Trees
As always, I am interested in hearing your thoughts on this topic and encourage your feedback.
 Le, J. (August, 2016). The 10 Algorithms Machine Learning Engineers Need to Know. Accessed from URL
 Abbott, D. (2014). Applied Predictive Analytics. Indianapolis: Wiley
 John M. Chambers and Trevor J. Hastie eds. (1992) Statistical Models in S, Wadsworth and Brooks/Cole, Pacific Grove, CA 1992. Kyphosis data set
 Diertrich, D. Heller B, Yang, B. (2015). Data Science and Big Data Analytics. Indianapolis: Wiley