# weLiveIn.space

## Decision trees in machine-learning

Or: how to calculate the chances of surviving on the titanic

Have you ever wondered how to calculate the chances of an outcome of some event? Turns out: It can be quiet hard! Predictive analytics is at the forefront of modern data-science and machine-learning fueling Google, Facebook, and Co. to pour money into their R&D departments. As of today, there are some sophisticated techniques for classification and prediction: Hui Li from SAS wrote an excellent article about which algorithm when to use. The condensed result of his article is this graphic:

In this article we are solely in the lower left "Classification" leaf. Image taken from here. |

In this example, we look at a technique widely used in data science: **Decision trees**. We use this approach to calculate the chances of surviving on the Titanic. We code this using Pythons sklearn implementation of decision trees and dig deep enough into the math to understand what is going on.

### Learning from disaster: The Titanic dataset

The sinking of the Titanic is a tragic tale of hubris, human errors, bad luck and unusual circumstances. The largest ship ever built collided with an iceberg that split the ships hull starboard along five watertight compartments. Since the ship was only designed to withstand four flooded compartments, it foundered with approximately 2224 people on board. Only 705 survived, making it one of the deadliest maritime disasters in modern history.

The task for this article is to establish a model with which we can infer the chances of surviving this catastrophe. Not to be cynical but the Titanic dataset provides an excellent exercise for this sort of data exploration since it is not too large and it exhibits robust correlations with specific factors like: Was the passenger male or female or was it a child? These correlations are present mostly because the crew had the order “women and children first” during the evacuation of the ship.

### Explore data with Python

**All of the code is available as a jupyter notebook here**

We start our script with some standard imports. We use pandas for managing the data and Graphviz for visualizing our derived tree and the ‘tree’ implementation from the sklearn module.

First, we download the dataset from kaggle. If you don’t know kaggle: It’s an excellent platform for data scientist of any sort, providing a large number of datasets, exercises, tutorials and much more. The Titanic dataset comes as a “comma-separated-values” file (CSV), and we can import it with pandas read_csv function.

Lets have a look at it. We use the iloc method so see the first 10 entries

PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|

0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |

1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |

2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |

3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |

4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |

5 | 6 | 0 | 3 | Moran, Mr. James | male | NaN | 0 | 0 | 330877 | 8.4583 | NaN | Q |

6 | 7 | 0 | 1 | McCarthy, Mr. Timothy J | male | 54.0 | 0 | 0 | 17463 | 51.8625 | E46 | S |

7 | 8 | 0 | 3 | Palsson, Master. Gosta Leonard | male | 2.0 | 3 | 1 | 349909 | 21.0750 | NaN | S |

8 | 9 | 1 | 3 | Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) | female | 27.0 | 0 | 2 | 347742 | 11.1333 | NaN | S |

9 | 10 | 1 | 2 | Nasser, Mrs. Nicholas (Adele Achem) | female | 14.0 | 1 | 0 | 237736 | 30.0708 | NaN | C |

The first 10 entries from the Titanic dataset. Obtained from Kaggle. |

The dataset consists of 12 columns; we call them features from now on, for every passenger. Most are self-explanatory except for SibSp, which is the number of siblings on board, and Parch, which is the number of parents on board. A full explanation of all features is on the kaggle dashboard.

The first thing we have to do is to get a feel for the data. A good starting point is to look at correlations. In our case, we look at all correlations with the ‘Survived’ feature, but before doing that, we need to transform the ‘Sex’ feature into numerical values. We use an if/then clause within list comprehension for that.

This gives us:

PassengerId | 0.005007 |

SibSp | 0.035322 |

Age | 0.077221 |

Parch | 0.081629 |

Fare | 0.257307 |

Pclass | 0.338481 |

SexNumerical | 0.543351 |

Survived | 1.000000 |

It is evident that ‘SexNumerical’ is the dominant feature in our dataset, so we should use it for our decision tree. To make our tree not too complicated we limit ourselves to three features. ‘Pclass’ is the second feature, it has the second highest correlation, and it makes sense since people from the first class had a higher chance of survival simply because they were not stuffed in the bottom of the ship. The third highest correlation is “Fare” but we don’t use it because it is the same as ‘Pclass’. ‘Fare’ is just the amount of money someone paid for the trip. Since ‘sex’ has such a strong correlation we split the dataset and look at all correlations for female and male passengers:

Male | Female | |
---|---|---|

PassengerId | 0.040477 | 0.008790 |

SibSp | 0.020238 | 0.263284 |

Age | 0.119618 | 0.116109 |

Parch | 0.096318 | 0.223644 |

Fare | 0.171288 | 0.218466 |

Pclass | 0.220618 | 0.477114 |

SexNumerical | NaN | NaN |

Survived | 1.000000 | 1.000000 |

Furthermore we want to know if more men or more women died:

233 women and 109 men survived

So to shed some light on why men died that more often we are interested in the correlations with ‘Male’, and on that list ‘Age’ has the next highest correlation. So we choose ‘Age’ as our third feature.

As data-scientists, the first thing we have to do is to clean our data from NaNs.

Now we split the data between a training set and an evaluation set. We use 90% of all data for the training and 10% for the evaluation. I explain later for why we need these two datasets.

We now prepare our three feature columns for the decision tree routine by stacking them all together.

Now to the fitting part. Define the classifier and fit it to the data. I choose *max_depth* and *max_leaf_nodes* in a way so that the output is easy to interpret. You can fiddle around with these two parameters to get better accuracy.

To visualize our graph we use graphviz:

This gives us this decision tree:

I modified manually the colors and the shape of each node a bit. The tree in the jupyter notebook looks slightly different but contains roughly the same information. Note that due to different initial conditions every time you run this code, the tree can look a bit different. |

Samples |
This gives us the number of all available samples in that node. So for the root node, this means we have training data of 642 passengers. |

Value |
Since this is binary classification at each node the tuple tells us how many samples fall into which category. So for the root node, it says: Of 642 passengers 379 died, and 263 lived. |

Gini |
The Gini impurity is a measure of misclassification. If it is 0.0, then the probability of misclassifying an item out of this category is zero. A very low Gini impurity is what we want because it makes for useful predictions. In other words: If Gini = 0.0 then we can be entirely sure that if we have to use this node, the outcome is certain. I explain this parameter in depth in the first math section below. See also: here |

**This tree is telling a story:**

The root node at the top asks the question essentially: *“Out of 642 how are the chances of survival if the passenger was a man/woman?”*. The answer is one leaf further down. We see that if “being woman” (Sex < 0.5) is *True* then we are on the women branch (left side). “Value” is showing (57, 177). Those are all the women on board that: (not survived, survived), so 177 women survived while 57 died. Since more woman survived than died in that node, “Class” is showing *survived*. The Gini index is *0.369*. If it were 0, then all woman had survived. Going further down the “women” branch we see that the bottom left node has a Gini index of 0.106 which means that predictions based on that node are excellent. Out of 143 women, who were traveling not in the third class, only 8 died. So if you were to fall into that category chances that you die are only 5.6 %.

Looking on the male side we see that many men died - 322 out of 408 - and, even worse, the next leaf is showing that if the age is above 13 years chances are insanely high for the passenger to die. Of 377 men only 68 survived. If the passenger was a child, the chances looked much better. Of 31 young boys under the age of 13, only 13 died. All of these 13 boys were traveling in the third class. The class affiliation is in the last node, which has a Gini index of 0.0. So all the boys that were younger than 13 years and were traveling in the first or the second class survived.

Now we should have three questions: *“How to make predictions with our tree?”*, *“How good are these predictions?”* and *“How to derive this tree”*. The former is easy, and we address it first, the latter involves math and is in the first math section.

*1. How to make predictions with our tree?*

Very easy; By climbing up the tree. We’ll take my private data as an example. I am a 33-year-old man who probably couldn’t afford first class back then, so I put myself in the second class. We are starting at the root node (the one on top), since I am a man (Sex=1), we’ll progress on the right branch to the second node. The second node asks if I am younger then 13 years and since I am older we’ll progress on the right branch of the node. I traveled second class, so I find my final prediction node by once again progressing on the right side. This final node has a sample size of 285. So, there were 285 men older than 13 years and traveling second or third class. Of this 285 men, 252 did die. Basically almost everyone, to precise: 252 / 2.85 = **88.42 %**. That leaves my chances of survival at **11.58 %**, which is depressing.

All of this is in the sklearn class object. We can use the *predict_proba* method of our tree object:

11.58 %

*2. How good are these predictions?*

We remember we had split our original dataset into two subsets: A training and an evaluation dataset. The first one contained 90% of all available data used during the training procedure. Now we use the evaluation dataset to test the accuracy of our trained model on passenger data the model has not seen during the training process. What we are doing is to feed all the entries in our evaluation dataset into our tree and compare the output of our model with the ground truth provided in the evaluation dataset. We start coding it by defining a python lambda function and list comprehension for the prediction part and then using a regular function with the *apply* method of our pandas dataframe:

80.56 % was correctly predicted

So we have a model with roughly 81% accuracy; A reasonable result but since this is only an introduction and we chose our parameter for simplicity and not for accuracy it is quite all right. A model which has that low prediction capabilities should not be used in a production system. If you have small base rates in the distribution from which you are inferring such a model explodes, see also my blog post about Bayes statistics.

*3. How to derive this tree?*

- Intuition:

A decision tree is nothing complicated. Splitting a question into two possible outcomes, even building entire trees, is something we do all the time in our head: *Should I go to work or stay home? Am I equally productive when I stay home?*

The question that arises when applying this technique to machine learning is to find some metric that does good splits and then a procedure to iteratively apply this metric to our data. There are several metrics available, but the two most used are the *“Gini impurity”* and the *“Information gain.”* We look at these two and code an example, but first, we need to establish a procedure for applying those metrics.

- The recursive formulation:

My description of the involved math follows Breiman, Leo, et al. Classification and regression trees. CRC press, 1984 and the scikit-learn documentation.

Given a training vector: \( x_{i}\in\mathbf{R}^{n},i=1,\dots,l \) and a label vector: \( y\in\mathbf{R}^{1} \). Where \( l \) is the
number of samples (number of passengers in our case) and \( n \) is the dimensionality of the training vector (3 for us, as we only used: *Sex*, *Age* and *Class*). To get a feel for this: Here are the first 5 entries of \( x_{i} \) and \( y \) in the titanic dataset:

Label vector | Training vector | ||
---|---|---|---|

Survived | SexNumerical | Age | Pclass |

0 | 1 | 22.0 | 3 |

1 | 0 | 38.0 | 1 |

1 | 0 | 26.0 | 3 |

1 | 0 | 35.0 | 1 |

0 | 1 | 35.0 | 3 |

The first 5 entries of our data, color-coded to give an intuition for the training and the label vector. Each row is a vector. So the training vector is 3-dimensional and the label vector is a scalar. |

Let the data at a node \( m \) be denoted by the set \(Q \). The tuple at which a split possibly occurs is called a candidate split, we write it as: \(\Theta = \left(j,t_{m}\right) \), where \( j \) is a feature and \( t_{m} \) is a threshold for the node \( m \). We then want to divide \(Q \) into two subsets:

\[ Q_{left}\left(\Theta\right) = \left(x,y\right)|x_{j} \leq t_{m} \] \[ Q_{right}\left(\Theta\right) = Q \setminus Q_{left}\left( \Theta \right) \]

This is just saying that what ends up on the left side can’t be on the right side and that we use some threshold value, applied to some feature, to determine where we want to split our data into two subsets. Our function which determines what makes a good split is defined as:

\[ G\left(Q, \Theta\right) = \frac{n_{left}}{N_{m}}H\left(Q_{left}\left(\Theta\right)\right)+\frac{n_{right}}{N_{m}}H\left(Q_{right}\left(\Theta\right)\right) \]

where \( N_{m} \) is the number of available samples for the node \( m \). \( n_{left} \) and \( n_{right} \) are the number of samples for the left or the right split and \( H \) is the metric that we are using.

This function is nothing else then a weighted average over a metric that gets our candidate split datasets as input data. Since our metric function \( H \) measures the homogeneity of a dataset, the weighted average in \( G \) gives us way to determine if a split would provide a similar dataset on both sides, which is what we want. We call such a function a cost function and our ultimate goal is to minimize this expression by varying the threshold \( t_{m} \) and the feature parameter \( j \), so we want to find a \( \left(j,t_{m}\right) \) tuple that minimizes \( G\left(Q, \Theta\right) \):

\[ \tilde{\Theta} = argmin_{\Theta}\left(G\left(Q, \Theta\right)\right) \]

Next thing is the choice of metric. Commonly used metrics like the *“Gini impurity”* and *“Information gain”* are not a metric in a
mathematical definition, as they lack to satisfy the subadditivity or triangle inequality
condition. However, they all have a rigorous theoretical and empirical background which is beyond the scope of this article. The first one that we want to understand is the *Gini impurity*:

\[ H_{gini}\left(Q_{m}\right) = \sum_{i=1}^{J} p_{mi} \sum_{k\neq i} p_{mk} = \sum_{k=1}^{J} p_{mk} (1-p_{mk}) \]

\( J \) is the number of possible outcomes (2 in our case \( \rightarrow \)(*survived*, *not survived*)) and \( p_{mk} \) is the probability
for one particular outcome \( k \) in the current node \( m \). The first transformation is valid because we are working with probabilities whose
sum has to add up to 1. So: \( \sum_{i=1}^{J} p_{i}+\sum_{k=1}^{J} p_{k}=1 \Rightarrow \sum_{i=1}^{J} p_{i} = \sum_{k=1}^{J} (1-p_{k}) \). We can calculate theses probabilities straight from the number of occurrences within one subset:

\[ p_{mk} = \frac{1}{n_m} \sum_{x_{i}\in \hat{Q}} \mathbf{I}\left(y_{i}=k\right) \]

where \( n_m \) is the number of samples within the subset (this could be the number of *“female/male”* passengers, if we choose to split with the *Sex* feature). \( \mathbf{I} \) is called an indicator function that returns the number of samples that fulfill a certain condition. An example is: *The number of female passenger that survived*. In this case \( k = \) *survived* and \( \hat{Q} \) is the set over all female passengers, because \( m \) is a node that splits for the feature *Sex*.

The other metric that I want to highlight is called in the context of decision trees *information gain* but is just the
cross entropy. Cross entropy is a concept from information theory which was founded by Claude Shannon
in 1948. I can’t overstate the importance of Shannon’s work who derived an analytical expression about how much “information” is needed to
restore some signal in a dataset. This theorem is known as the Shannon-Nyquist-Theorem and is of utter importance in basically everything
that has to do with electronics, analytics, statistics or cryptography. The cross entropy is an expression that gives a measure about how
much information, measured in “bits” or “shannons”, is needed
to identify an event drawn from a set of events that has a true distribution \( p \) but was measured via a different distribution \( q \)
coming from the same set of events. Since we are interested in the homogeneity or “pureness” of a dataset we evaluate this metric with \( p=q \):

\[ H_{info}\left(Q_{m}\right) = - \sum_{k=1}^{J} p_{mk}\, \log p_{mk}. \]

From a mathematical standpoint we have everything we need. The procedure for calculating the tree looks like this:

- Get all the data \( Q \) that falls into the node \( m \).
- Minimize \( G\left(Q, \Theta\right) \)

- Calculate the probabilities for each outcome of each feature. \( p_{mk} \)
- Calculate the gini impurity/information gain for a given feature \( H_{gini|info} \)
- Find the arguments \( \tilde{\Theta} \) that minimizes the cost function \( G\left(Q, \Theta\right) \)
- Make the split, go to the next node and start again with step 1.

- How to code all this:

Decision trees are always coded top-down, meaning we start with the root node using all the data and then going down one step along a leaf and calculate the next node with the data that falls into this category. We exercise this with the Gini impurity on the root node. First, we have to define all necessary functions:

So now we start our procedure as described, please note that we are rastering all possible parameter for all outcomes, meaning we have 2 parameters for “Sex,” 3 parameter for “Pclass” and - in our case - 100 parameter for “Age.” You can change the “100” as you see fit. However, smaller numbers yield coarser results.

The best results for each feature are:

‘SexNumerical’ \( \rightarrow \) Cost: 0.346, Threshold: 0.50, Gini_Left: 0.369, Gini_Right: 0.333

‘Age’ \( \rightarrow \) Cost: 0.470, Threshold: 6.05, Gini_Left: 0.393, Gini_Right: 0.475

‘Pclass’ \( \rightarrow \) Cost: 0.572, Threshold: 2.00, Gini_Left: 0.490, Gini_Right: 0.439

So the best split for our root node is with the *SexNumerical* feature with a threshold of 0.5. For this tuple the cost is minimal, which comes as no surprise, this is precisely what the sklearn algorithm did and when we compare the Gini impurities of the *female* and the *male* node in the plotted tree above we see they are precise as we calculated them (0.369 on the *female* side and 0.333 on the *male* side). The next step would be now to go on the male or *female* side and repeat the calculation except that we now would not consider the *SexNumerical* feature and would only use data where *SexNumerical* is either 0 (*female*) or 1 (*male*).

- One downside:

The problem with this approach is that we have to sum over all possible parameter when calculating a node, meaning we need to compute every possible path to obtain the best tree for our data. Therefore, the computational cost of this algorithm is growing polynomially with the number of features we are analyzing. In this scenario, the complexity to build an optimal tree is called NP-hard and is not feasible for every-day scenarios, sometimes even not feasible at all. Therefore something called a greedy algorithm is used when computing the tree. A greedy algorithm chooses subsets randomly within the dataset and optimizes only the immediate neighborhood of this subset; this is called: “making the locally optimal choice.” It is, therefore “short-sighted” and may fail to produce the optimal results, which is visualized here:

The goal is to find the largest number. The greedy approach fails in this scenario to converge to the optimal solution. In the direct neighborhood of 7 is the subset (3, 12), by not summing over all possible paths the locally optimal choice is to go to the 12 and then to the 6. If we had explored all possible paths, we would have found the 99. Source. |

This locality is the reason why your tree may look different and may yield different accuracies, every time you run a professional implementation like the one from sklearn, even with the same parameters on the same data.

### Conclusion

This article covered a lot to learn, and there is even more, we can improve on many fronts. I haven’t talked about: The *bias-variance-tradeoff* which leads eventually to *overfitting* in decision trees, *pruning* of a tree via cross-validation and then *variance reduction* using *bootstrap aggregating* or *boosting* which leads to the natural evolution of decision trees called *random forest methods*. Those topics are part of a next article.

Let’s recap what we learned: We have learned that a decision tree can be computed in a supervised manner (with labels that is). We applied this technique to a dataset containing passenger information from the Titanic and were able to build a model that has \( \approx 81\% \) accuracy on the survival outcome of passengers it was not trained on. We are now able to answer the question: *“How high were my chances to survive on the Titanic?”*. Furthermore, we have looked at the math involved, developed an intuition for it and calculated all values for the root node in a step-by-step approach.

Decision trees are a fast and efficient approach for problems that are too large to be analyzed manually. They are easy to comprehend and intuitive to explain for someone outstanding and should be in the repertoire of every data-scientist. I hope this article helps you to understand the underlying concepts, the intuition behind it and the range of use-cases for where this technique might help you.