August 30, 2024
A guest post from Fabrício Ceolin, DevOps Engineer at Comet. Inspired by the growing demand…
Tree-based machine learning methods are among the most commonly used supervised learning methods. They are constructed by two entities; branches and nodes. Tree-based ML methods are built by recursively splitting a training sample, using different features from a dataset at each node that splits the data most effectively. The splitting is based on learning simple decision rules inferred from the training data.
Generally, tree-based ML methods are simple and intuitive; to predict a class label or value, we start from the top of the tree or the root and, using branches, go to the nodes by comparing features on the basis of which will provide the best split.
Tree-based methods also use the mean for continuous variables or mode for categorical variables when making predictions on training observations in the regions they belong to.
Since the set of rules used to segment the predictor space can be summarized in a visual representation with branches that show all the possible outcomes, these approaches are commonly referred to as decision tree methods.
The methods are flexible and can be applied to either classification or regression problems. Classification and Regression Trees (CART) is a commonly used term by Leo Breiman, referring to the flexibility of the methods in solving both linear and non-linear predictive modeling problems.
Decision trees can be classified based on the type of target or response variable.
The default type of decision trees, used when the response variable is categorical—i.e. predicting whether a team will win or lose a game.
Used when the target variable is continuous or numerical in nature—i.e. predicting house prices based on year of construction, number of rooms, etc.
i) Root node — this represents the entire population or the sample, which gets divided into two or more homogenous subsets.
ii) Splitting — subdividing a node into two or more sub-nodes.
iii) Decision node — this is when a sub-node is divided into further sub-nodes.
iv) Leaf/Terminal node — this is the final/last node that we consider for our model output. It cannot be split further.
v) Pruning — removing unnecessary sub-nodes of a decision node to combat overfitting.
vi) Branch/Sub-tree — the sub-section of the entire tree.
vii) Parent and Child node — a node that’s subdivided into a sub-node is a parent, while the sub-node is the child node.
The decision of splitting a tree affects its accuracy. Tree-based machine learning models use multiple algorithms to decide where to split a node into two or more sub-nodes. The creation of sub-nodes increases the homogeneity of the resultant sub-nodes. Algorithm selection is based on the type of target variable.
Suppose you’re the basketball coach of a grade school. The inter-school basketball competitions are nearby and you want to do a survey to determine which students play basketball in their leisure time. The sample selected is 40 students. The selection criterion is based on a number of factors such as gender, height, and class.
As a coach, you’d want to select the students based on the most significant input variable among the three variables.
Decision tree algorithms will help the coach identify the right sample of students using the variable, which creates the best homogenous set of student players.
CART is used to train a decision tree. It first splits the training set into two subsets using a single feature k and threshold tk—i.e. height≥150cm. The algorithm searches for the pair (k, tk) that produces the purest subsets. The cost function for the classification the algorithm tries to minimize is given by:
The CART algorithm splits the child nodes using the same approach recursively, and once it reaches the max-depth, it stops.
We’ll now have a look at the two commonly-used criteria for measuring the impurity of a node; the Gini index and entropy.
The Gini index states that if we select two items from a population at random, then they must be of the same class and probability if the population is pure. The target variable is normally categorical, such as pass or fail, and it performs only binary splits. The higher the value of the Gini index, the higher the homogeneity. The CART algorithm uses the Gini method to create binary splits.
Referring to the example above, where the basketball coach wants to separate students based on whether they play basketball or not—using the input variables gender, height, and class, we’ll identify which split produces the most homogenous sub-nodes using the Gini index.
Calculating:
Calculating:
Calculating:
It’s evident that the split based on gender has the highest Gini index, hence the split will take place on gender, which will result in the most homogenous sub-nodes.
If we want to calculate the Gini impurity:
1 — Gini index
The split on class has the highest Gini impurity and will produce the least homogenous sub-nodes.
In machine learning, entropy is commonly used as an impurity measure. If a sample is completely homogenous, the entropy is zero, and if the sample is equally divided (50%-50%), it has an entropy of one.
Entropy is calculated by the following equation:
P and q are the probability of, say, success and failure in a particular node. Entropy can also be applied to categorical target variables. The lesser the entropy, the better the split.
We’ll now calculate the entropy for all the gender, height, and class nodes in the above example.
It’s evident that the split on gender has the lowest entropy; hence, the most favorable split, which will give homogenous results.
Information gain can be calculated from entropy and is given as:
Information gain = 1 — entropy
Decision trees make very few assumptions about the training data, as opposed to linear models which assume the data is linear.
The tree structure adapts itself to the training data if left unconstrained, resulting in overfitting. Such a model is known as a non-parametric model. The number of parameters in such a model isn’t determined prior to training and the model structure is free to fit closely to the data.
A parametric model, such as a linear model, has a predetermined number of parameters; thus, its degree of freedom is limited. This reduces the risk of overfitting but increases the risk of underfitting. In order to avoid overfitting on the training data, we need to restrict the decision tree’s freedom during training. This is known as regularization.
The regularization of hyperparameters depends on the algorithm used. Decision tree classifier algorithms use hyperparameters that restrict the shape of the decision tree, thus preventing both overfitting and underfitting. They include:
This is the minimum number of samples a node must have before it can be split. It reduces overfitting.
The minimum number of samples a leaf node must have.
The maximum depth of a tree. A higher depth allows the model to learn relations that are very specific to a particular sample.
The maximum number of leaf nodes.
The maximum number of features that are evaluated for splitting at each node.
It’s the same as mini_samples_leaf but expressed as a fraction of the total number of weighted instances.
Increasing the min_* hyperparameters or reducing max_* hyperparameters regularizes the model.
The figure above shows a decision tree trained with the default hyperparameters (no restrictions), and on the right, it’s trained with min_samples_leaf=4. It’s evident that the model on the left is overfitting, while the model on the right generalizes better to unseen data.
How do we implement pruning in decision trees?
A decision tree without constraints such as max_depth
uses a greedy approach and will recursively look for the best split, hence an exponential growth in the nodes. A node whose children are all leaf nodes is considered unnecessary if the purity improvement it provides isn’t statistically significant.
Standard statistical tests such as the X2 test are used to estimate the probability that the improvement is purely a result of chance (null hypothesis).
If the probability (p-value) is higher than a given threshold (5%), then the node is considered unnecessary, and its children are deleted. Pruning continues until all the unnecessary nodes have been removed.
If we aggregate the predictions of a group of predictors (regressor or classifier), we’ll often get better predictions than with the individual predictor.
A group of predictive models that aim to achieve model accuracy and stability is called an ensemble, and the technique used to achieve accuracy is ensemble learning. An ensemble learning algorithm, such as random forest, is an ensemble method.
We can train a group of decision tree classifiers, each on a different random subset of the training set. To make the final prediction, we obtain the predictions of all the individual trees, then predict the class that gets the most votes.
Such an ensemble of decision trees is called a random forest and is one of the most powerful machines learning algorithms out there for simple classification or regression tasks. It’s able to undertake dimensional reduction methods, treat missing values, outlier values, and perform other exploratory data analysis steps.
Random forest algorithms introduce extra randomness when growing trees— instead of searching for the very best feature when splitting a node, it searches for the best feature among a random subset of features. This results in a greater tree diversity, yielding an overall better model.
Tree-based algorithms are really important for every data scientist to learn. In this article, we’ve covered a wide range of details about tree-based methods; their features and structure, attribute selection algorithms (entropy, Gini index), and random forests.