The Bahamas Leaks, Offshore Leaks, Panama Papers and Paradise Papers constitute the most significant leak of classified information in recent human history. Although most of these documents do not contain illegal activity; they show dubious financial activities of people who try to avoid proper taxation in their home countries. The International Consortium of Investigative Journalists (ICIJ) runs a database where everyone can download these documents for testing and research. We’ll do this, and show how to work with this dataset. The ICIJ database contains information on more than 785,000 offshore entities and covers nearly 80 years up to 2016 from people and companies in more than 200 countries.
Data from this datasets is already nicely prepared, sorted and good to go. This article focuses heavily on the data science aspect of working through this dataset using Python, for articles about the leaks themselves, I refer the reader to the investigations page of the ICIJ, where they gathered all reports about every data leak.
A list of current German and American politicians (We scrape the Wikipedia for that)
Whoever you like (Just type a name)
Make some closing remarks
1. Download the data
To obtain the data, we go to this page and download a compressed archive of the Bahamas Leaks, Offshore Leaks, Panama Papers and Paradise Papers. Extract them all in the same folder. If you’re on Linux you can do this, for example:
2. Take a first look at the data
Every leak database contains 5 or 6 files, but the labeling and data structure follows the same pattern, there are *edges* files that contain mainly an ID, a relation type and a second ID for all nodes in the dataset. The *address* files have an ID and an address associated with it, the same goes for *entity*, *intermediary*, *officer* and *other* files. All of them hold an ID and an associated attribute for that ID. So for example. If I would be the officer of a company, I could look up my ID in the *officer* file using my name. Then I’d go in the *edges* file and look for all entries related to that ID. There probably would be something like:
With julians_company_id I could go in the *entity* file and look up the name of the company. I could also look in the *edges* file for a line like:
And then with julians_company_address_id, go to the *address* file and look up the address for my company. I think you get the picture…
One thing I noticed right away was that the file-headers, containing the column labels, are different in the Bahamas leaks than in all other leak files. You can see this for the *edges* files yourself by doing this in your favorite bash/zsh terminal:
The START_ID, TYPE and END_ID labels are called node_1, rel_type, and node_2 within the Bahamas leaks file. So we have to change that. Otherwise, we end up missing these ID’s when reading in all the files. You can do this either manually or use the absurdly powerful sed command line tool:
This line tells sed to go in the Bahamas leaks edges file and change the labels so that they are in agreement with the other *edges* files. The command looks a bit cumbersome since we have to shell-escape all quotation marks and underscores within the labels, but it gets the job done.
3. Printing and plotting some key statistics
We start by importing some packages. We use os and glob for parsing file names and join path strings, pandas for importing the data and providing the potent dataframe class, matplotlib for plotting, wikipedia for scraping the Wikipedia and seaborn, tqdm and pprint for styling up the output of the jupyter cells.
Next, we define some functions for getting key characteristics out of the data. There is the unique_vals function that provides a fast way of getting the unique entries out of a list full of strings. Then we have the compare function, that casts two lists to a set and calculates the intersection between these sets and returns their intersection as a list. Finally, there are two lookup functions. One for the offshore data and one for the People with significant control dataset. We’ll address these two functions later.
Now it’s time to load the data into a pandas dataframe. We use pandas “concat” function in combination with some list comprehension to concatenate all files into one dataframe. Next, we use the string method “lower” to transform all the names within the dataframe to lowercase. This is necessary to not look for every uppercase/lowercase combination of every name when querying the data.
Then, we’ll print all the keys of the dataframe. This corresponds to the column labels of the csv files.
From this list, we handpick the most interesting categories. We are interested in the company name, company type, its country and jurisdiction (“name”, “company_type”, “countries”, “jurisdiction_description”). We also want to know which company provided the service to go offshore (“service_provider”) and from which leak-database the data is coming from (“sourceID”):
Having chosen that, we print out basic statistics for all these categories:
This yields tables in which we counted the appearance of every entity after we cleaned for NaN’s. For example, in the category countries, we clearly can see that most entries are from Malta and Hong Kong, while most jurisdictions lie in the Bahamas and that most companies are a Standard International Company.
Some people prefer graphics to tables, so let’s plot the same statistics using nice histograms:
For convenience I prepared an image slider for all images, this would otherwise clutter this article.
This is nice and all, but we want to look for some connection to the real world, to real people. So what we are doing first is to download an index of names that the UK government deems People with Significant Control.
4. Look for names within the data
4.1 People with Significant Control
The People with Significant Control (PSC) index is a list curated by the UK government’s Companies House. Everyone on this list owns more than 25% of shares or voting rights in a UK company or has the right to appoint or remove the majority of the board of directors; also people that can exercise significant control over a company are on this list. You can download the database as a large JSON file which has the date of publishing in its name. I’m using the version from February 2, but this should work for all subsequent releases as well. For a full description of all available information in this file, please visit this site.
For reading in this dataset, you need a lot of RAM. The way python is handling memory allocations you need more than 40 GB RAM temporarily. When everything is read-in, you still need 26 GB… so you have been warned. Pandas read_json function can do the heavy lifting for us here:
This results in a dataframe with two keys: company_number and data. In data, we have all the json fields related to every specific company.
The first step is to get all unique names from the PSC list. Due to the size of this dataset, we need to think about performance. We could loop through every item within the PSC list and only add it to our unique name list if the name is not already on the list. This would invoke a string search on every iteration throughout the whole dataset, which would be very expensive. Therefore we add all names into a list and feed this list afterwards to our unique_vals function, which casts the list to a set and returns it again as a list. We build two lists; One with the full names and one with only the last names:
We do the same thing with our offshore dataset:
Next, we compare both lists, which is done utilizing the beautiful set data type with its intersection method inside of the predefined compare function. Then we print out everything:
This gives us:
So, 15714 names came up and to go through all these entries would be a very time-consuming task. We can pick a single name and print out results using our pre-defined name_lookup_psc and name_lookup_offshore functions:
These functions give us a rather nicely formatted summary about all entries in botch datasets that belong to this name:
Feel free to look through all 15000 entries.
At this point, it is essential to note that just by looking at names we don’t have any proof that our matches in both datasets are really the same person. Please keep that in mind. For every follow-up research on these people, you have to have various other sources for making a connection to the offshore leaks.
We can, of course, also have a look at politicians. First, let us look at German ones because I am from Berlin and naturally interested in the corruption within the German political landscape. Problem is; there is no official list of all German politicians. The best I could do is to scrape the Wikipedia, which works relatively good. For that, we use the Python module wikipedia and scrape the content of the List of German politicians page. This gives us the links to lists of politicians of every party. Since some parties do not exist anymore, like the Nazi parties or the Centre Party during the Weimar Republic, we handpick the lists that we want to keep. This is done with the idx_to_keep variable. That done, we scrape the content of the party members list similarly and get than a relatively large list with almost all active and inactive German politicians.
This will print out the following:
I shortened the output here in the blog for brevity. This guy, Josef Bauer, is an excellent example that a name is a horrible identifier for things. It’s not even close to being unique. Josef Bauer was a politician from Wasserburg am Inn in the Christian Social Union of Bavaria and died in 1989. The Josef Bauer from the offshore leaks database lived in Breitenfurt, which is close to Vienna and was/is an officer of a company, the sky gourmet malta ltd, that was founded in the 90s, you can read that on their weird website.
Hmmm, also seems not to be the same person. The US politician is originally from Atlanta and is now in the Georgia State Senate, and the David Scott from the offshore leaks is connected to Japan and the UK. So we see, finding people within the offshore leaks reliably is not so easy and this article provides a mere introduction to how your research could start. Go and have fun with it!
One last thing. You want to know if your own name is on the list? Sure, go ahead:
A blog about people learning about machines learning about people.
I write in this blog about machine learning, data science, programming and sometimes also about their political implications. Currently, I am a research assistant at the Max-Born-Institute Berlin working on my physics/machine learning Ph.D.