# weLiveIn.space

## Decision trees in machine-learning

Or: how to calculate the chances of surviving on the titanic

Have you ever wondered how two calculate the chances of an outcome of some event? Turns out: It can be quiet hard! Predictive analytics are at the forefront of modern data-science and machine-learning fueling Google, Facebook and Co. to pour money in their R&D departments. As of today there are some sophisticated techniques for classification and prediction: Hui Li from SAS wrote a good article about which algorithm when to use. The condensed result of his article is this very informative graphic:

In this article we are solely in the lower left "Classification" leaf. Image taken from here. |

In this example we will look at a technique that is widely used in data science: **Decision trees**. We will use this approach to calculate the chances of surviving on the Titanic. We will also code using Pythons sklearn implementation of decision trees and will dig deep enough into the math to understand what is going on.

### Learning from disaster: The Titanic dataset

The sinking of the Titanic is a tragic tale of hubris, human errors, bad luck and unusual circumstances. The largest ship ever built collided with an iceberg that split the ships hull starboard along five watertight compartments. Since the ship was only designed to withstand four flooded compartments it foundered with approximately 2224 people on board. Only 705 survived, making it one of the deadliest maritime disasters in modern history.

The task for this article will be to establish a model with which we can infer the chances of surviving this catastrophe. Not to be cynical but the Titanic dataset provides a great exercise for this sort of data exploration since it is not too large and it exhibits very strong correlations with certain factors like: Was the passenger male or female or was it a child? This is mostly because the crew had the order “women and children first” during the evacuation of the ship.

### Explore data with Python

**All of the code is available as a jupyter notebook here**

We start our script with some standard imports. We will use pandas for managing the data. graphviz for visualizing our derived tree and the ‘tree’ implementation from the sklearn module.

First we download the dataset from kaggle. If you don’t know kaggle: It’s a great platform for data scientist of any sort, providing a large number of datasets, exercises, tutorials and much more. The titanic dataset comes as a “comma-separated-values” file (csv) and we can import it with pandas read_csv function.

Lets have a look at it. We use the iloc method so see the first 10 entries

PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|

0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |

1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |

2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |

3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |

4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |

5 | 6 | 0 | 3 | Moran, Mr. James | male | NaN | 0 | 0 | 330877 | 8.4583 | NaN | Q |

6 | 7 | 0 | 1 | McCarthy, Mr. Timothy J | male | 54.0 | 0 | 0 | 17463 | 51.8625 | E46 | S |

7 | 8 | 0 | 3 | Palsson, Master. Gosta Leonard | male | 2.0 | 3 | 1 | 349909 | 21.0750 | NaN | S |

8 | 9 | 1 | 3 | Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) | female | 27.0 | 0 | 2 | 347742 | 11.1333 | NaN | S |

9 | 10 | 1 | 2 | Nasser, Mrs. Nicholas (Adele Achem) | female | 14.0 | 1 | 0 | 237736 | 30.0708 | NaN | C |

The first 10 entries from the Titanic dataset. Obtained from Kaggle. |

The dataset consists of 12 columns, we call them features from now on, for every passenger. Most are self-explanatory except for SibSp, which is the number of siblings on board, and Parch, which is the number of parents on board. A full explanation of all features is on the kaggle dashboard

The first thing we have to do is to get a feel for the data. A good starting point is to look at correlations. In our case we look at all correlation with the ‘Survived’ feature. This is simply done in pandas, but before we calculate it we need to transform the ‘Sex’ feature into numerical values. We use an if/then clause within list comprehension for that.

This gives us:

PassengerId | 0.005007 |

SibSp | 0.035322 |

Age | 0.077221 |

Parch | 0.081629 |

Fare | 0.257307 |

Pclass | 0.338481 |

SexNumerical | 0.543351 |

Survived | 1.000000 |

It is obvious that ‘SexNumerical’ is the dominant feature in our dataset, so we should use it for our decision tree. To make our tree not too complicated we will limit ourselves to three features. ‘Pclass’ will be the second feature, it has the second highest correlation and it makes sense since people from the first class had a higher chance of survival simply because they were not stuffed in the bottom of the ship. The third highest correlation is ‘Fare’ but we will not use it because it is basically the same as ‘Pclass’. ‘Fare’ is just the amount of money someone paid for the trip. Since ‘sex’ has such a strong correlation we split the dataset and look at all correlations for female and male passengers:

Male | Female | |
---|---|---|

PassengerId | 0.040477 | 0.008790 |

SibSp | 0.020238 | 0.263284 |

Age | 0.119618 | 0.116109 |

Parch | 0.096318 | 0.223644 |

Fare | 0.171288 | 0.218466 |

Pclass | 0.220618 | 0.477114 |

SexNumerical | NaN | NaN |

Survived | 1.000000 | 1.000000 |

Furthermore we want to know if more men or more women died:

233 women and 109 men survived

So to shed some light into why men died that more often we are interested in the correlations with ‘Male’, and on that list ‘Age’ has the next highest correlation. So we choose ‘Age’ as our third feature.

As data-scientists the first thing we have to do is to clean our data from NaNs.

Now we split the data in a training set and an evaluation set. We use 90% of all data for the training and 10% for the evaluation. I will explain later for why we need these two datasets.

We now prepare our three feature columns for the decision tree routine by stacking them all together

Now to the fitting part. Define the classifier and fit it to the data. I choose *max_depth* and *max_leaf_nodes* in a
way so that the output is easy to interpret. You can fiddle around with these two parameter to get better accuracy.

To visualize our graph we use graphviz:

This gives us this decision tree:

I modified manually the colors and the shape of each node a bit. The tree in the jupyter notebook will look slightly different but contains roughly same information. Note that due to different initial conditions every time you run this code the tree can look a bit different. |

Samples |
This gives us the number of all available samples in that node. So for the root node this means we have training data of 642 passengers. |

Value |
Since this is binary classification at each node the tuple tells us how many samples fall into which category. So for the root node it says: Of 642 passengers 379 died and 263 lived. |

Gini |
The gini impurity is a measure of misclassification. If it is 0.0 then the probability to misclassify an item out of this category is zero. This is what we want, a very low gini impurity, because it makes for good predictions. In other words: If gini = 0.0 then we can be absolutely sure that if we have to use this node the outcome is certain. I will explain this parameter in depth in the first math section below. See also: here |

**This tree is telling a story:**

The root node at the top asks essentially the question: *“Out of 642 how are the chances of survival if the passenger was a man/woman?”*. The answer is one leaf further down. We see that if “being woman” (Sex < 0.5) is *True* then we are on the women branch (left side). “Value” is showing (57, 177). Those are all the women on board that: (not survived, survived), so 177 women survived while 57 died. Since more woman survived than died in that node, “Class” is showing *survived*. The gini index is *0.369*. If it were 0 then all woman had survived. Going further down the women branch we see that the bottom left node has a gini index of 0.106. Predictions based on that node are therefore very good, it says that out of 143 women, who were traveling not in the third class only 8 died. So if you were to fall in that category chances that you die are only 5.6 %.

Looking on the male side we see that a lot of men died - 322 out of 408 - and, even worse, the next leaf is showing that if the age is above 13 years chances are insanely high for the passenger to die. Of 377 men only 68 survived. If the passenger was a child the chances looked much better. Of 31 young boys under the age of 13 only 13 died. All of these 13 boys were traveling in the third class. This can be seen in the last node, which has a gini index of 0.0. So all the boys that were younger than 13 years and were traveling in the first or the second class survived.

Now we should have three questions: *“How to make predictions with our tree?”*, *“How good are these predictions?”* and *“Where is that tree coming from anyway?”*. The former is easy and we will address it first, the latter involves math and will be covered in the first math section.

*1. How to make predictions with our tree?*

Very easy; By climbing up the tree. We’ll take my personal data as an example. I am a 33 year old man who probably couldn’t afford first class back then, so I
will put myself in the second class. We are starting at the root node (the one on top), since I am a man (Sex=1) we’ll progress on the right
branch to the second node which asks if I am younger then 13 years. I am definitely older than that so we’ll progress again on the right branch
to the node that divides the classes. I traveled second class so I find my final prediction node by once again progressing on the right side.
This final node has a sample size of 285. This means there were 285 men older than 13 years and traveling second or third class. Of this 285 men
252 did die. Basically almost everyone, to precise: 252 / 2.85 = **88.42 %**. That leaves my chances of survival at **11.58 %**, which is kind of
depressing.

All of this is also implemented in the sklearn class object. We can simply use the *predict_proba* method of our tree object:

11.58 %

*2. How good are these predictions?*

We remember we had split our original dataset in two subsets: A training and an evaluation dataset. The first one contained 90% of all
available data and was used during the training procedure. Now we use the evaluation dataset to test
the accuracy of our trained model on passenger data the model has not seen during the training process. What we are doing is to feed all the entries in our evaluation dataset into our tree and compare the output of our
model with the ground truth provided in the evaluation dataset.
We start coding it by defining a python lambda function and list comprehension for the prediction part and then using a regular function with the *apply* method of our pandas dataframe:

80.56 % was correctly predicted

So we have a model with roughly 81% accuracy; A moderate result but since this is only an introduction and we chose our parameter for simplicity and not for accuracy it is quite alright. A model which has that low prediction capabilities should not be used in a production systems. If you have very small base rates in the distribution from which you are inferring such a model will explode, see also my blog post about Bayes statistics. This leaves us with the third - and most complicated - question:

*3. Where is that tree coming from?*

- Intuition:

A decision tree is nothing complicated. Splitting a decision into two possible outcomes, even building complete trees, is something we do all the time in our head: *Should I go to work or stay home? Am I equally productive when I stay home? Isn’t my way to work so long that the efficiency bonus is nulled out anyway? Should I find a new job?*

The question that arises when applying this technique to machine learning is to find some metric that makes good splits and then a procedure
to iteratively apply this metric to our data. There are several metrics available but the two most used are called: *Gini impurity* and
*information gain*. We will look at these two and code an example but first we need to establish a procedure for applying those metrics.

- The recursive formulation:

My description of the involved math follows Breiman, Leo, et al. Classification and regression trees. CRC press, 1984 and the scikit-learn documentation.

Given a training vector: \( x_{i}\in\mathbf{R}^{n},i=1,\dots,l \) and a label vector: \( y\in\mathbf{R}^{1} \). Where \( l \) is the
number of samples (number of passengers in our case) and \( n \) is the dimensionality of the training vector (3 for us, as we only used: *Sex*, *Age* and *Class*). To get a feel for this: Here are the first 5 entries of \( x_{i} \) and \( y \) in the titanic dataset:

Label vector | Training vector | ||
---|---|---|---|

Survived | SexNumerical | Age | Pclass |

0 | 1 | 22.0 | 3 |

1 | 0 | 38.0 | 1 |

1 | 0 | 26.0 | 3 |

1 | 0 | 35.0 | 1 |

0 | 1 | 35.0 | 3 |

The first 5 entries of our data, color-coded to give an intuition for the training and the label vector. Each row is a vector. So the training vector is 3-dimensional and the label vector is a scalar. |

Let the data at a node \( m \) be denoted by the set \(Q \). The tuple at which a split possibly occurs is called a candidate split, we write it as: \(\Theta = \left(j,t_{m}\right) \), where \( j \) is a feature and \( t_{m} \) is a threshold for the node \( m \). We then want to divide \(Q \) into two subsets:

\[ Q_{left}\left(\Theta\right) = \left(x,y\right)|x_{j} \leq t_{m} \] \[ Q_{right}\left(\Theta\right) = Q \setminus Q_{left}\left( \Theta \right) \]

This is just saying that what ends up on the left side can’t be on the right side and that we use some threshold value, applied to some feature, to determine where we want to split our data into two subsets. Our function which determines what makes a good split is defined as:

\[ G\left(Q, \Theta\right) = \frac{n_{left}}{N_{m}}H\left(Q_{left}\left(\Theta\right)\right)+\frac{n_{right}}{N_{m}}H\left(Q_{right}\left(\Theta\right)\right) \]

where \( N_{m} \) is the number of available samples for the node \( m \). \( n_{left} \) and \( n_{right} \) are the number of samples for the left or the right split and \( H \) is the metric that we are using.

This function is nothing else then a weighted average over a metric that gets our candidate split datasets as input data. Since our metric function \( H \) measures homogeneity of a dataset, the weighted average in \( G \) gives us way to determine if a split would provide a homogeneous dataset on both sides, which is what we want. We call such a function a cost function and our ultimate goal is to minimize this expression by varying the threshold \( t_{m} \) and the feature parameter \( j \), so we want to find a \( \left(j,t_{m}\right) \) tuple that minimizes \( G\left(Q, \Theta\right) \):

\[ \tilde{\Theta} = argmin_{\Theta}\left(G\left(Q, \Theta\right)\right) \]

This brings us to the choice of metric. Commonly used metrics like the *‘gini impurity’* and *‘information gain’* are not a metric in a
mathematical definition, as they lack to satisfy the subadditivity or triangle inequality
condition. However, they all have a rigorous theoretical and empirical background which is beyond the scope of this article. The first one that we want to understand is the *gini impurity*:

\[ H_{gini}\left(Q_{m}\right) = \sum_{i=1}^{J} p_{mi} \sum_{k\neq i} p_{mk} = \sum_{k=1}^{J} p_{mk} (1-p_{mk}) \]

\( J \) is the number of possible outcomes (2 in our case \( \rightarrow \)(*survived*, *not survived*)) and \( p_{mk} \) is the probability
for one particular outcome \( k \) in the current node \( m \). The first transformation is valid because we are working with probabilities whose
sum has to add up to 1. So: \( \sum_{i=1}^{J} p_{i}+\sum_{k=1}^{J} p_{k}=1 \Rightarrow \sum_{i=1}^{J} p_{i} = \sum_{k=1}^{J} (1-p_{k}) \). We can calculate theses probabilities straight from the number of occurrences within one subset:

\[ p_{mk} = \frac{1}{n_m} \sum_{x_{i}\in \hat{Q}} \mathbf{I}\left(y_{i}=k\right) \]

where \( n_m \) is the number of samples within the subset (this could be the number of *“female/male”* passengers, if we choose to split with the *Sex* feature). \( \mathbf{I} \) is called an indicator function that returns the number of samples that fulfill a certain condition. An example is: *The number of female passenger that survived*. In this case \( k = \) *survived* and \( \hat{Q} \) is the set over all female passengers, because \( m \) is a node that splits for the feature *Sex*.

The other metric that I want to highlight is called in the context of decision trees *information gain* but is really just the
cross entropy. Cross entropy is a concept from information theory which was founded by Claude Shannon
in 1948. I can’t overstate the importance of Shannons work who derived an analytical expression about how much “information” is needed to
restore some signal in a dataset. This theorem is known as the Shannon-Nyquist-Theorem and is of utter importance in basically everything
that has to do with electronics, analytics, statistics or cryptography. The cross entropy is an expression that gives a measure about how
much information, measured in “bits” or “shannons”, is needed
to identify an event drawn from a set of events that has a true distribution \( p \) but was measured via a different distribution \( q \)
coming from the same set of events. Since we are interested in the homogeneity or “pureness” of a dataset we evaluate this metric with \( p=q \):

\[ H_{info}\left(Q_{m}\right) = - \sum_{k=1}^{J} p_{mk}\, \log p_{mk}. \]

From a mathematical standpoint we have everything we need. The procedure for calculating the tree looks like this:

- Get all the data \( Q \) that falls into the node \( m \).
- Minimize \( G\left(Q, \Theta\right) \)

- Calculate the probabilities for each outcome of each feature. \( p_{mk} \)
- Calculate the gini impurity/information gain for a given feature \( H_{gini|info} \)
- Find the arguments \( \tilde{\Theta} \) that minimizes the cost function \( G\left(Q, \Theta\right) \)
- Make the split, go to the next node and start again with step 1.

- How to code all this:

Decision trees are always coded top-down. This means we start with the root node using all the data and then going down one step along a leaf and calculate the next node with the data that falls into this category. We will exercise this through with the gini impurity on the root node. First we have to define all necessary functions:

And now we just start our procedure as described, please note that we are rastering all possible parameter for all outcomes. This means we have 2 parameter for “Sex”, 3 parameter for “Pclass” and - in our case - 100 parameter for “Age”. You can change the “100” as you see fit, smaller numbers yields coarser results.

The best results for each feature are:

‘SexNumerical’ \( \rightarrow \) Cost: 0.346, Threshold: 0.50, Gini_Left: 0.369, Gini_Right: 0.333

‘Age’ \( \rightarrow \) Cost: 0.470, Threshold: 6.05, Gini_Left: 0.393, Gini_Right: 0.475

‘Pclass’ \( \rightarrow \) Cost: 0.572, Threshold: 2.00, Gini_Left: 0.490, Gini_Right: 0.439

So the best split for our root node is with the *SexNumerical* feature with a threshold of \( 0.5 \). For this tuple the cost is minimal. This comes as no surprise, this is exactly what the sklearn algorithm did and when we compare the gini impurities of
the *female* and the *male* node in the plotted tree above we see they are exactly as we calculated them (0.369 on the *female* side and 0.333 on the *male* side). The next step would be now to go on the *male* or *female* side and repeat the calculation except that we now would not consider the
*SexNumerical* feature and would only use data where *SexNumerical* is either 0 (female) or 1 (male).

- One downside:

The problem with this approach is that we have to sum over all possible parameter when calculating a node. This means we need to compute every possible path to obtain the best tree for our data, therefore the computational cost of this algorithm is growing polynomially with the number of features we are analyzing. In this scenario the complexity to build an optimal tree is called NP-hard and is not feasible for every-day scenarios, sometimes even not feasible at all. Therefore something called a greedy algorithm is used when computing the tree. A greedy algorithm chooses randomly subsets within the dataset and optimizes only the immediate neighborhood of this subset, this is called: “making the locally optimal choice”. It is therefore “short sighted” and may fail to produce the optimal results. This can be seen here:

The goal is to find the largest number. The greedy approach fails in this scenario to converge to the optimal solution. In the direct neighborhood of 7 lies the subset (3, 12), by not summing over all possible paths the locally optimal choice is to go for the 12 and than to the 6. If we would have explored all paths we would have found the 99 after walking along the 3. Source. |

This is the reason why your tree may look different, and may yield different accuracies, every time you run a professional implementation like the one from sklearn, even with the same parameters on the same data.

### Conclusion

This was a lot to learn, and there is even more, we can improve on many fronts. I haven’t talked about: The *bias-variance-tradeoff* which leads eventually to *overfitting* in decision trees, *pruning* of a tree via cross-validation and then *variance reduction* using *bootstrap aggregating* or *boosting* which leads to
the natural evolution of decision trees called *random forest methods*. Those topics will be part of my next article.

Let’s recap what we learned: We have learned that a decision tree can be computed in a supervised manner (with labels that is). We applied
this technique to a dataset containing passenger information from the Titanic and were able to build a model that has \( \approx 81\% \)
accuracy on the survival outcome of passengers it was not trained on. We are now able to answer
the question: *“How high were my chances to survive on the Titanic?”*. Furthermore we have looked at the math
involved, developed an intuition for it and calculated all values for the root node in a step-by-step approach.

Decision trees are a fast and efficient approach for problems that are too large to be analyzed manually. They are easy to comprehend and intuitive to explain for someone outstanding and should be in the repertoire of every data-scientist. I hope this article helps you to understand the underlying concepts, the intuition behind it and the wide range of problems it can be applied to.