The importance of Bayes statistics

Explained with predictive policing.

Why statistics?

We are living in the age of big data. And despite being at the center of societal interests, most people know very little about how we make sense of vast amounts of information. In this article, I want to present and explain a simple statistical model (Bayes theorem) and motivate it with a real-world application: Predictive policing.

Complex statistics are becoming more and more a part of our daily life, although we are not aware of this most of the time. All the recommendations we get from Amazon, Google, Facebook and so on, they are part of what we consume, part of how we perceive our environment. All of the hype about machine learning, neural networks, A.I., and what-not is at its core, pure statistics. It’s always about probability distributions, metrics, f-divergences, likelihood functions, and regression problems. I will talk about all of that in subsequent articles, but for now, we stick to the basics. To better understand what is presented to us all the time, we need to understand what statistics are and should be able to interpret a sentence like: The probability of something is so and so.

The case

Let me start by motivating the topic of this article with reference to a science-fiction movie: Minority Report. In this movie, three children have the psychic ability to foresee crimes which the police uses to prevent these crimes before they happen. But of course, there is a problem. The children, called precogs, sometimes see different outcomes for potential crimes. So, in short, they are not 100% sure about the suspect. This uncertainty is, of course, unacceptable and - spoiler alert - the program shuts down in the end. Later on, in this article, I will make the case that even a tiny deviation from a 100% success rate - or true positive rate - bears substantial consequences for the list of potential suspects.

With the rise of computational power and big data, police forces saw the need to get a “precog” system on their own and so: Predictive policing was born. There are a lot of competing ideas on how statistics and machine learning can support police work, and I want to point out that I am not writing this article to discredit any of them, there was a comprehensive discussion about ethical and feasibility issues in the New York Times, ProPublica and the MIT Technology review, all of them are highly recommended. My goal here is to emphasize that we have to pay very close attention when it comes to statistical inference these days because it is everywhere and sometimes counter-intuitive. I also highly recommend you take a look at Algorithm Watch](https://algorithmwatch.org/en/), a Berlin-based non-profit that’s trying to shed some light on complicated algorithmic processes that otherwise would go unnoticed. But let us get back to predictive policing.

In February 2014 the Chicago police department (CPD) was sending police officers to the homes of potential suspects most likely to be involved in a crime because of a list called the “Heat List” on which these people appeared. This list was not compiled manually but by a machine learning algorithm which took several factors into account. The CPD claimed this list would contain the “400 most dangerous people in Chicago”, but never fully disclosed how their algorithm predicts potential suspects. One official statement was that it is “based on empirical data compared with known associates of the identified person” but the Conference on Civil and Human Rights concluded in their 2014 report that there “… is no public, comprehensive description of the algorithm’s input.”

The funding for the CPDs predictive policing project came from the National Institute of Justice (NIJ) who made millions of dollars available in 2009 for “the application of analytical techniques - particularly quantitative techniques - to identify likely targets for police intervention and prevent crime or solve past crimes by making statistical predictions.” Back to 2014, on the first field test, one person on the list was Robert McDaniel, who never committed a crime but had the wrong social connections and was visited by the police who offered “social services and a tailored warning.” McDaniel said in an interview with the Chicago Tribune: “I haven’t done nothing that the next kid growing up hadn’t done. Smoke weed. Shoot dice.”

Pressed on how reliable the predictions of the CPD algorithm were, Miles Wernick, the technical lead of the CPD predictive policing program, answered: “These are persons who the model has determined are those most likely to be involved in a shooting or homicide, with probabilities that are hundreds of times that of an ordinary citizen”. Or as Steven Caluris, the Deputy Chief of Crime Control Strategies of the CPD, put it:

“If you end up on that list, there’s a reason you’re there”.

The first quote, by Wernick, will be the base for our calculations which will show that the second quote, by Caluris, is a dangerous statement and mostly incorrect.

We need some math

First, we’re trying to make an educated guess about the probability that someone who is on this list was never part of a violent crime. Since Wernick stated that the algorithm would take the criminal history into account and implied ordinary citizens would not appear on this list, we can relate criminal records directly to the prediction capabilities of the algorithm. If you never committed a violent crime and then someday the police is standing at your door, saying you might one day become an offender because statistics told them, you have to agree; this is highly problematic. Second, we will make a guess about a way more serious allegation, the possibility that an actual future murder is on the list. But for both we need to take a look at some math:

In statistics there is one very fundamental theorem, it’s called Bayes theorem and it looks like this:

\[ P(A\mid B) = \frac{P(B \mid A) \, P(A)}{P(B)}\] A and B are so-called events and:

  • \( P(A) \) and \( P(B) \) are the probabilities of observing one of the events without regard to the other. \( P(A) \) is called the prior and \( P(B) \) the marginal.
  • \( P(A\mid B) \) is the conditional probability of observing event A given that B is true. This is what we’re interested in and it is called the posterior.
  • \( P(B\mid A) \) is the conditional probability of observing event B given that A is true, it’s often called the likelihood.

We will go through every term and develop an interpretation of how to understand this equation. In Short: Bayes theorem is about how likely some event is, given that a theory is correct and affected by a new piece of information. So let’s start with our first question:

1. How probable is it that someone on this list has never committed a violent crime ?

So, A and B are events. That means they are a set of outcomes and for our first question A and B have two possible outcomes each: \[ A = \{ \text{“criminal record”} , \text{“no criminal record”} \} \] \[ B = \{ \text{“on list”} , \text{“not on list”} \} \]

We are interested in: \[ P(\text{“no criminal record”} \mid \text{“on list”}) \] that is: The probability that: It is true that someone is on the list without having a criminal record for violent crimes.

  • Finding the likelihood
    In this scenario we presume a binary criterion for the list, either you’re on it or you’re not. We are looking for the likelihood \( P(\text{“on list”} \mid \text{“no criminal record”}) \) and for that we use the quote of Miles Wernick who promised that a suspect on this list had a probability to be involved in a violent crime “hundreds of times that of an ordinary citizen”. We make a conservative guess and state that the true positive and the true negative rate of the CPDs algorithm is 99.8 %. This means that if your name is on the list the algorithm is 99.8% sure that you’re a future criminal. People on this list have then a probability that is 500 times higher than of an ordinary citizen, this should be sufficient to satisfy Miles Wernicks statement.

  • Finding the prior
    Now this is a bit tricky. Sadly there is no publication - or at least I was unable to find one - about the number of citizens with a violent crime record. Instead we start with the Strategic Plan of the Bureau of Justice Statistics that states that nearly 68 Million American citizens have a criminal record may it violent or not. This is actually a pretty insane number, for instance it is higher than the entire U.S. population of 1900 and it is also on par with the number of Americans who have a Bachelor’s Degree, see also here. Now we look at the percentage of all crimes that were violent crimes, this is in the yearly crime report of the F.B.I.. It states that arrests for violent crimes accounted for 4.7% of all crime-related arrests in 2015. This translates to roughly 3.2 Million Americans, or 1.3%, who most probable have a violent criminal record, and 98.7% who haven’t.

  • Finding the marginal
    To find the marginal \( P(B)\), this is \( P(\text{“on list”})\) and \( P(\text{“not on list”})\), we need to find a substitute. We don’t know the probability to appear on the heat list without regard to a criminal record, but we know both the likelihoods \( P(B\mid A) \) and \( P(B\mid \neg A) \) and we also know the prior for having a criminal record or having no record (\( P(A) \) and \( P(\neg A) \)). We therefore can write: \[ P(B) = P(B\mid A) P(A) + P(B\mid \neg A) P(\neg A) \] Let’s think about this again and write it out for one outcome. For the case “on list” this statement means that the probability to appear on the list, without regard of a criminal record, is equal to the sum between appearing on the list with a record and appearing on the list without one. Since you can only have one of these two outcomes, all scenarios are covered.

Now it’s time to put everything together, our final equation looks like this: \[ P(A\mid B) = \frac{P(B \mid A) \, P(A)}{P(B\mid A) P(A) + P(B\mid \neg A) P(\neg A)}\] And we’re ready to insert all the numbers: \[ P(\text{“no criminal record”} \mid \text{“on list”}) = \frac{0.002 \cdot 0.987}{0.002 \cdot 0.987+ 0.998 \cdot 0.013} = 0.132 \] Please bear in mind that these are all rough estimates based on conservative initial guesses mainly aiming to provide intuition for statistical processes, but the bottom line here is:

While the algorithm is 99.8% sure that someone is about to commit a crime, the chances that this person was never arrested for a violent crime are 13.2%. This means on the CPDs original list of the 400 most dangerous people were, under this assumptions, approximately 53 people who never even were arrested for a violent crime.

Very important here to remember is that the CPD never explicitly said they would only consider citizen without a violent criminal record. Therefore these 53 people are maybe on purpose there, which is controversial at the least. We will now look at a more serious crime and then give an intuitive explanation for these numbers.

2. How probable is it that an actual future murderer is on this list ?

If someone in January 2015 decided to become a future murderer, then we would have by the end of 2015 a record of that. We, therefore, can look at all murder crimes conducted in that year and guess how high the chances are this person would have appeared on the heat list in January 2015.

The number of murder crimes throughout 2015 was 15696 which corresponds to a murder rate of 4.9 per inhabitants. We’re claiming that every murder was committed by a different person, which is probably not correct but it gives us an upper boundary for calculations. We also need to state that the algorithm is in principle able to identify a future murder with reasonable accuracy. This is also not necessarily true, but we presume it has the same true positive and true negative rate as in our previous calculation. Both were 99.8%, we derived this based on the statement of Miles Wernick.

In statistics there is a funny name for a handy thing. The confusion matrix. It gives you an overview about the true/false positive/negative rates in a compact form. Here is ours:

Predicted:
no future murder future murder
Actual: no future murder 99.8% 0.02%
future murder 0.02% 99.8%


Let’s think about this again. We are not only saying that this algorithm is almost perfect in predicting a future murder out of a pool of people that were about to become a murderer (True Positives), we are also saying that it is almost perfect in predicting you are an ordinary citizen without murderous intentions (True Negative).

So, given all boundary conditions, let’s see how the chances are that an actual future murder is on the list. The math hasn’t changed, so we plug in our numbers into Bayes theorem:

\[ P(\text{“future murder”} \mid \text{“on list”}) = \frac{0.998 \cdot 0.00049}{0.998 \cdot 0.00049 + 0.002 \cdot 0.99951} = 0.19655 \]

And there you have it:

The chances that our algorithm will find an actual future murderer are only 19.66%.

So it has questionable usefulness. Well, how can that be? If you paid attention, you realized that our future murderer prediction probability is directly proportional to the violent murder rate. Therefore it’s interesting to make this calculation for all the years where there’s violent murder data, and in fact it proves true: The more murderers in general, the better our algorithm works, this can be seen here:

The probability to find an actual future murderer is in red and uses the right y axis.
The violent murder rate is in blue and uses the left y axis. Data taken from here.

To understand this we need to have a look at base rates.

The base rate fallacy

The base rate is essentially the question: “…out of how many?”. Here is an easy example: Some test for having a cold is always 99% right, and out of 100000 people 1000 have a cold. You would think, that if you get a positive test result, you have a 99% chance to have a cold. But in truth, it is only roughly 50% because the test correctly identified 990 from 1000 as having a cold, but it also turned out positive for 1% of the 99000 healthy people. So from the 1990 people with a positive test, only 990 had a cold.

Base rates are very important in statistic but are often neglected. There is research (here, here and here) that most people seem to even prefer Non-Bayes explanations. This is troublesome because it leads always to wrong conclusions and expectations.

Let us look at our police algorithm example in a simplified way. First, we assume the algorithm has a true positive rate and a true negative rate of 100%, and furthermore we state that 100 people are classified as a future criminal out of a pool of 500. We can plot this, where blue dots are people who went on to become a future criminal and red dots are an ordinary citizen:

Future Criminals First

We see, every future criminal is correctly identified and the chance of a future criminal to be on the list is 100%.
Now let us consider the case that the true positive rate is still 100% but the true negative rate is only 50%. This would look like this:

Future Criminals Second

The algorithm miss-classified some future criminals as a citizen. Although every single person on our heat list is going to be a future criminal, there is now a considerable amount of criminals who are not on the list. Even worse, the number of criminals who are not on the list is two times higher than the criminals on the list. This means: Out of 500 People, 300 are future criminals, but only 100 are labeled as such. Therefore if you are a future criminal, chances that you’re on the list are only 40%. And this even though our algorithm is never wrong when he labels someone a future criminal. So we see, base rates are a significant factor when it comes to conclusions from statistics.

“Bayes thinking” is also crucial for every-day scenarios, like the real chances that you are drunk when you get a positive test result from a breathalyzer or how likely it is that you have cancer after you get a positive test result. The Wikipedia has an article about it for further information, and I recommend reading it.

Bottom Line

In reality, you never have a true positive rate of 100%, and since we’re talking about huge statistics, the size of the American population, even a tiny deviation from 100% would directly affect millions of citizens. So the actual prediction (True Positives) is sketchy, to begin with, and on top of that you will always have very high base rates that need to be taken into account, or otherwise you end up making the wrong conclusions, just as Steven Caluris did when he was stating that there would be a reason for being on the list. Very often there isn’t one, just statistics.

Please bear in mind that this article is not an accurate assessment of the usefulness, or morality, of predictive policing but merely an introduction to Bayes thinking. I hope it will help you when you need to conclude your reflections.