Taking a closer look at all the Offshore Leaks

Find content within the Bahamas Leaks, Offshore Leaks, Panama Papers and Paradise Papers

The Bahamas Leaks, Offshore Leaks, Panama Papers and Paradise Papers constitute the most significant leak of classified information in recent human history. Although most of these documents do not contain illegal activity; they show dubious financial activities of people who try to avoid proper taxation in their home countries. The International Consortium of Investigative Journalists (ICIJ) runs a database where everyone can download these documents for testing and research. We’ll do this, and show how to work with this dataset. The ICIJ database contains information on more than 785,000 offshore entities and covers nearly 80 years up to 2016 from people and companies in more than 200 countries.

Data from this datasets is already nicely prepared, sorted and good to go. This article focuses heavily on the data science aspect of working through this dataset using Python, for articles about the leaks themselves, I refer the reader to the investigations page of the ICIJ, where they gathered all reports about every data leak.

All of the code is available as a jupyter notebook here

So, let us start by outlining what we will do in this article:

  1. Download the data from the ICIJ database website
  2. Look through it and get a feeling for the data
  3. Printing and plotting some key statistics
  4. Look for names within the data. For this we will download additional lists that are interesting:
    1. The register of People with Significant Control
    2. A list of current German and American politicians (We scrape the Wikipedia for that)
    3. Whoever you like (Just type a name)
  5. Make some closing remarks

1. Download the data

To obtain the data, we go to this page and download a compressed archive of the Bahamas Leaks, Offshore Leaks, Panama Papers and Paradise Papers. Extract them all in the same folder. If you’re on Linux you can do this, for example:

> mkdir offshore_leaks
> unzip "csv_*.zip" -d "offshore_leaks"

2. Take a first look at the data

Every leak database contains 5 or 6 files, but the labeling and data structure follows the same pattern, there are *edges* files that contain mainly an ID, a relation type and a second ID for all nodes in the dataset. The *address* files have an ID and an address associated with it, the same goes for *entity*, *intermediary*, *officer* and *other* files. All of them hold an ID and an associated attribute for that ID. So for example. If I would be the officer of a company, I could look up my ID in the *officer* file using my name. Then I’d go in the *edges* file and look for all entries related to that ID. There probably would be something like:

"julians_id", "officer_of", "julians_company_id"

With julians_company_id I could go in the *entity* file and look up the name of the company. I could also look in the *edges* file for a line like:

"julians_company_id", "registered_address", "julians_company_address_id"

And then with julians_company_address_id, go to the *address* file and look up the address for my company. I think you get the picture…

One thing I noticed right away was that the file-headers, containing the column labels, are different in the Bahamas leaks than in all other leak files. You can see this for the *edges* files yourself by doing this in your favorite bash/zsh terminal:

> head -c 30 offshore_leaks/*edges*.csv
==> offshore_leaks_two/bahamas_leaks.edges.csv <==
"node_1","rel_type","node_2","
==> offshore_leaks_two/offshore_leaks.edges.csv <==
"START_ID","TYPE","END_ID","li
==> offshore_leaks_two/panama_papers.edges.csv <==
"START_ID","TYPE","END_ID","li
==> offshore_leaks_two/paradise_papers.edges.csv <==
"START_ID","TYPE","END_ID","li

The START_ID, TYPE and END_ID labels are called node_1, rel_type, and node_2 within the Bahamas leaks file. So we have to change that. Otherwise, we end up missing these ID’s when reading in all the files. You can do this either manually or use the absurdly powerful sed command line tool:

> sed -i -e \
'1s/\"node_1\",\"rel_type\",\"node_2\"/\"START_ID\",\"TYPE\",\"END_ID\"/g' \
./offshore_leaks/bahamas_leaks.edges.csv

This line tells sed to go in the Bahamas leaks edges file and change the labels so that they are in agreement with the other *edges* files. The command looks a bit cumbersome since we have to shell-escape all quotation marks and underscores within the labels, but it gets the job done.

3. Printing and plotting some key statistics

We start by importing some packages. We use os and glob for parsing file names and join path strings, pandas for importing the data and providing the potent dataframe class, matplotlib for plotting, wikipedia for scraping the Wikipedia and seaborn, tqdm and pprint for styling up the output of the jupyter cells.

import os
import glob
import pandas as pd
import matplotlib.pyplot as plt
import wikipedia
from tqdm import tqdm
from pprint import pprint
import seaborn as sns
pd.set_option("display.max_columns", 24)
sns.set_style("whitegrid")
sns.set_context("paper")
sns.despine(offset=10, trim=True)

Next, we define some functions for getting key characteristics out of the data. There is the unique_vals function that provides a fast way of getting the unique entries out of a list full of strings. Then we have the compare function, that casts two lists to a set and calculates the intersection between these sets and returns their intersection as a list. Finally, there are two lookup functions. One for the offshore data and one for the People with significant control dataset. We’ll address these two functions later.

"""
Functions that we need later
"""
def unique_vals(seq):
    """Find unique entries in a sequence and preserve the order."""
    seen = set()
    return [x for x in seq if x not in seen and not seen.add(x)]


def compare(A, B):
    """Compare two lists of strings in a fast manner."""
    return list(set(A).intersection(B))


# define a function that prints all connection to a single name
def name_lookup_offshore(name, df):
    """
    Look up names within the offshore data.

    Input:
        name: (str) the name to look out for
        df: (pandas dataframe) the dataframe that holds the offshore data
    """
    vals = pd.unique(df.node_id[df.name == str(name).lower()].values)
    print("#######################################")
    print("######### offshore database lookup:")
    print("#######################################")
    for ident_id in vals:
        old_id = 0
        for item in df[df.START_ID == ident_id].itertuples():
            try:
                relation_title = item.TYPE
                id_ = int(item.END_ID)
                if id_ == old_id:
                    raise IndexError()
                else:
                    old_id = int(item.END_ID)
            except IndexError:
                id_ = False

            if relation_title == "registered_address" and id_:
                # we"re dealing with a private address here, so END_ID points
                # to the address file where the address is saved in the
                # address field.
                try:
                    private_address = df.address.where(
                        df.node_id == item.END_ID).dropna().values[0]
                except IndexError:
                    private_address = "No address in the system"

                print("{}; {}:  {}, with id:{}".format(name, relation_title,
                                                       private_address, id_))

                print("Find the interactive graph for this node at: "
                      "https://offshoreleaks.icij.org/nodes/{}".format(id_))
                print("------------------------")

            elif relation_title == "officer_of" and id_:
                try:
                    cmp_name = df.name.where(
                        df.node_id == id_).dropna().values[0]
                except IndexError:
                    cmp_name = "No name in the system"
                try:
                    cmp_jur = df.jurisdiction_description.where(
                        df.node_id == id_).dropna().values[0]
                except IndexError:
                    cmp_name = "No jurisdiction_description in the system"
                try:
                    comp_id = df.END_ID.where(
                        df.START_ID == id_).dropna().values[0]
                    cmp_address = df.address.where(
                        df.node_id == comp_id).dropna().values[0]
                except IndexError:
                    cmp_address = "No address in the system"

                print(
                    "{}; {}:  {}, registered at: {}, jurisdiction: {}".format(
                        name, relation_title, cmp_name, cmp_address, cmp_jur) +
                    "with unique leak id:{}".format(id_))
                print("Find the interactive graph for this node at: "
                      "https://offshoreleaks.icij.org/nodes/{}".format(id_))
                print("------------------------")


def name_lookup_psc(name, psc_names, psc_idxs, psc_df):
    """
    Look up names within the psc data.

    Input:
        name: (str) the name to look out for
        psc_names: (str or list of str) the full list
            of all names of the psc data
        psc_idxs: (number or list of numbers) the full list
            of all indexes of the psc data
        psc_df: (pandas dataframe) the dataframe that
            holds the psc data
    """
    if psc_df is not None:
        print("#######################################")
        print("######### psc database lookup:")
        print("#######################################")
        indexes = [i for i, x in enumerate(psc_names) if x == name]
        if len(indexes) > 1:
            print("Found multiple companies")
        for idx in indexes:
            cmp_number = psc_df.company_number[psc_idxs[idx]]
            print("------------------------")
            print("company number: {}".format(cmp_number))
            print("Find more information about this company at:")
            try:
                print(
                    "https://beta.companieshouse.gov.uk/company/{:08d}".format(
                        int(cmp_number)))
            except ValueError:
                print("https://beta.companieshouse.gov.uk/company/{}".format(
                    cmp_number))
            print("------------------------")
            dict_ = psc_df.data[psc_idxs[idx]]
            pprint(dict_)

Now it’s time to load the data into a pandas dataframe. We use pandas “concat” function in combination with some list comprehension to concatenate all files into one dataframe. Next, we use the string method “lower” to transform all the names within the dataframe to lowercase. This is necessary to not look for every uppercase/lowercase combination of every name when querying the data.

# load the data
data_folder = "/path/to/offshore_leak_data"
files = glob.glob(os.path.join(data_folder, "**", "*.csv"), recursive=True)
data_frame = pd.concat(
    [pd.read_csv(f, low_memory=False) for f in files], sort=True)
data_frame.name = data_frame.name.str.lower()  # convert all string in the name column to lowercase

Then, we’ll print all the keys of the dataframe. This corresponds to the column labels of the csv files.

# lets see what columns are available
print(data_frame.keys())
Index(['END_ID', 'START_ID', 'TYPE', 'address', 'closed_date', 'company_type',
       'countries', 'country_codes', 'end_date', 'ibcRUC', 'inactivation_date',
       'incorporation_date', 'jurisdiction', 'jurisdiction_description',
       'labels(n)', 'link', 'name', 'node_id', 'note', 'service_provider',
       'sourceID', 'start_date', 'status', 'struck_off_date', 'type',
       'valid_until'],
      dtype='object')

From this list, we handpick the most interesting categories. We are interested in the company name, company type, its country and jurisdiction (“name”, “company_type”, “countries”, “jurisdiction_description”). We also want to know which company provided the service to go offshore (“service_provider”) and from which leak-database the data is coming from (“sourceID”):

# we"ll pick the most interesting categories
cats = ("company_type", "countries", "jurisdiction_description", "name",
        "service_provider", "sourceID")

Having chosen that, we print out basic statistics for all these categories:

"name"
# print for every columns the top-10 histogramm entries
for key in cats:
    try:
        __data = data_frame.dropna(how="any", subset=[key])
        print("#######################################")
        print("##### {} #######".format(key))
        print("#######################################")
        print(__data[key].value_counts()[:3])
    except (ValueError, KeyError):
        pass

This yields tables in which we counted the appearance of every entity after we cleaned for NaN’s. For example, in the category countries, we clearly can see that most entries are from Malta and Hong Kong, while most jurisdictions lie in the Bahamas and that most companies are a Standard International Company.

#######################################
##### company_type #######
#######################################
Standard International Company                 40072
Standard Company under IBC Act                 26214
Business Company Limited by Shares             25130
Name: company_type, dtype: int64
#######################################
##### countries #######
#######################################
Malta                     129661
Hong Kong                 100243
China                      72621
Name: countries, dtype: int64
#######################################
##### jurisdiction_description #######
#######################################
Bahamas                   209686
British Virgin Islands    153872
Malta                      83934
Name: jurisdiction_description, dtype: int64
#######################################
##### name #######
#######################################
the bearer                         71850
el portador                         9351
bearer A                            2667
Name: name, dtype: int64
#######################################
##### service_provider #######
#######################################
Mossack Fonseca               213634
Portcullis Trustnet            61123
Commonwealth Trust Limited     44393
Name: service_provider, dtype: int64
#######################################
##### sourceID #######
#######################################
Panama Papers                                    1233702
Paradise Papers - Malta corporate registry       1090703
Offshore Leaks                                    841225
Name: sourceID, dtype: int64

Some people prefer graphics to tables, so let’s plot the same statistics using nice histograms:

# do the same but with real histogram plots
plt.close()
plt.rcParams.update({
    "figure.max_open_warning": 0
})  # we are creating a lot of plots
for nr, key in enumerate(cats):
    if "node" not in key:
        fig, ax = plt.subplots(1, 1, figsize=(16, 8))
        __data = (data_frame.dropna(how="any", subset=[key]))
        try:
            __data = __data[key].value_counts()[:10]
            print("Plotting {}".format(key))
            sns.barplot(
                x=__data.index, y=__data.values, ax=ax, palette="GnBu_d")
            ax.title.set_text(key)
            plt.xticks(rotation=45)
            fig.autofmt_xdate()
            plt.tight_layout()
            plt.savefig(str(nr) + "_" + key + ".png")
        except (ValueError, KeyError):
            pass
print("Printing to screen")
plt.show()

For convenience I prepared an image slider for all images, this would otherwise clutter this article.

Company types Countries Jurisdictions Names Service providers Source ID's

This is nice and all, but we want to look for some connection to the real world, to real people. So what we are doing first is to download an index of names that the UK government deems People with Significant Control.

4. Look for names within the data

4.1 People with Significant Control

The People with Significant Control (PSC) index is a list curated by the UK government’s Companies House. Everyone on this list owns more than 25% of shares or voting rights in a UK company or has the right to appoint or remove the majority of the board of directors; also people that can exercise significant control over a company are on this list. You can download the database as a large JSON file which has the date of publishing in its name. I’m using the version from February 2, but this should work for all subsequent releases as well. For a full description of all available information in this file, please visit this site.

For reading in this dataset, you need a lot of RAM. The way python is handling memory allocations you need more than 40 GB RAM temporarily. When everything is read-in, you still need 26 GB… so you have been warned. Pandas read_json function can do the heavy lifting for us here:

# read in the psc data
data_folder = "/path/to/psc/data"
psc_folder = "People_with_significant_control_UK"
psc_file = glob.glob(os.path.join(data_folder, psc_folder, "*.json"))
psc_ = pd.read_json(psc_file[0], lines=True)

This results in a dataframe with two keys: company_number and data. In data, we have all the json fields related to every specific company. The first step is to get all unique names from the PSC list. Due to the size of this dataset, we need to think about performance. We could loop through every item within the PSC list and only add it to our unique name list if the name is not already on the list. This would invoke a string search on every iteration throughout the whole dataset, which would be very expensive. Therefore we add all names into a list and feed this list afterwards to our unique_vals function, which casts the list to a set and returns it again as a list. We build two lists; One with the full names and one with only the last names:

# get all the unique names from the psc data
name_list_full = []
idx_list = []
last_name_list = []
for idx in tqdm(range(len(psc_.data))):
    try:
        cur_name = psc_.data[idx]["name"]
        name_list_full.append(str.lower(cur_name[cur_name.find(" ") + 1:]))
        idx_list.append(idx)
        last_name_list.append(str.lower(cur_name[-cur_name[::-1].find(" "):]))
    except KeyError:
        pass

name_list = unique_vals(
    name_list_full
)  # this is faster than check within the for loop for existence
last_name_list = unique_vals(last_name_list)

We do the same thing with our offshore dataset:

# do the same for the offshore data
offshore_name_list = []
offshore_last_name_list = []
for offshore_name in tqdm(data_frame.name.str.lower().unique()):
    try:
        cur_name = str(offshore_name)
        offshore_name_list.append(str(cur_name))
        offshore_last_name_list.append(
            str(cur_name[-cur_name[::-1].find(" "):]))
    except KeyError:
        pass

Next, we compare both lists, which is done utilizing the beautiful set data type with its intersection method inside of the predefined compare function. Then we print out everything:

mutual_last_names = compare(last_name_list, offshore_last_name_list)
mutual_full_names = compare(name_list, offshore_name_list)
print("For the full name we found {} entries in the offshore leaks".format(
    len(mutual_full_names)))
print(
    "For only the last name we found {} entries in the offshore leaks".format(
        len(mutual_last_names)))

This gives us:

For the full name we found 15714 entries in the offshore leaks
For only the last name we found 41900 entries in the offshore leaks

So, 15714 names came up and to go through all these entries would be a very time-consuming task. We can pick a single name and print out results using our pre-defined name_lookup_psc and name_lookup_offshore functions:

# look for a name
lookup_name = mutual_full_names[100]
print(lookup_name)
name_lookup_offshore(lookup_name, data_frame)
name_lookup_psc(lookup_name, name_list_full, idx_list, psc_)

These functions give us a rather nicely formatted summary about all entries in botch datasets that belong to this name:

dave john elzas
#######################################
######### offshore database lookup:
#######################################
dave john elzas; registered_address:
14, CHEMIN ALBERT-DUFOUR 1222 VESENAZ, with id:58010482
Find the interactive graph for this node at: https://offshoreleaks.icij.org/nodes/58010482
------------------------
dave john elzas; officer_of:  narida limited, registered at:
2nd Floor, Europa Centre St. Anne Street,
FLORIANAFRN1400, MALTA, jurisdiction: Maltawith unique leak id:55033968
Find the interactive graph for this node at: https://offshoreleaks.icij.org/nodes/55033968
------------------------
#######################################
######### psc database lookup:
#######################################
------------------------
company number: 09900385
Find more information about this company at:
https://beta.companieshouse.gov.uk/company/09900385
------------------------
{'address': {'address_line_1': '19th Floor Newton Tower',
             'address_line_2': 'Sir William Newton Street',
             'country': 'Mauritius',
             'locality': 'Port Louis',
             'premises': 'C/O Gmg Trust Ltd'},
 'country_of_residence': 'Switzerland',
 'date_of_birth': {'month': 9, 'year': 1966},
 'etag': 'a12af678f7ef7858cfd5f9857cf2a977bd388c98',
 'kind': 'individual-person-with-significant-control',
 'name': 'Mr Dave John Elzas',
 'name_elements': {'forename': 'Dave',
                   'middle_name': 'John',
                   'surname': 'Elzas',
                   'title': 'Mr'},
 'nationality': 'Dutch',
 'natures_of_control': ['significant-influence-or-control-as-trust'],
 'notified_on': '2016-04-06'}
 

Feel free to look through all 15000 entries.

At this point, it is essential to note that just by looking at names we don’t have any proof that our matches in both datasets are really the same person. Please keep that in mind. For every follow-up research on these people, you have to have various other sources for making a connection to the offshore leaks.

We can, of course, also have a look at politicians. First, let us look at German ones because I am from Berlin and naturally interested in the corruption within the German political landscape. Problem is; there is no official list of all German politicians. The best I could do is to scrape the Wikipedia, which works relatively good. For that, we use the Python module wikipedia and scrape the content of the List of German politicians page. This gives us the links to lists of politicians of every party. Since some parties do not exist anymore, like the Nazi parties or the Centre Party during the Weimar Republic, we handpick the lists that we want to keep. This is done with the idx_to_keep variable. That done, we scrape the content of the party members list similarly and get than a relatively large list with almost all active and inactive German politicians.

german_politicians = []
wiki_data = wikipedia.page("List of German politicians")
# This Wikipedia page contains also parties that doesn't exist anymore
# So we need to manually define the indexes of the list
# for still active parties
idx_to_keep = (0, 3, 4, 6, 7, 8, 12, 14)
for idx, poli_list in enumerate(wiki_data.links):
    if idx in idx_to_keep:
        print(poli_list)
        german_politicians.append(wikipedia.page(poli_list).links)
# Since we appended full lists we now have one large lists of several sublists
# therefore we need to flatten our list:
german_politicians_ori = [
    item for sublist in german_politicians for item in sublist
]
# we need to clean out the list from name appendixes like
# "Surname LastName (politician)" we also need to get rid of
# non-name entries  like "List of ...."
german_politicians = []
german_politicians_ori = unique_vals(german_politicians_ori)
for item in german_politicians_ori:
    if "List" not in item and "Members" not in item:
        if "(" in item:
            german_politicians.append(item[:item.find("(") - 1].lower())
        else:
            german_politicians.append(item.lower())

politicians_full_names = compare(german_politicians, offshore_name_list)
# Print all results
print("We found {} german politicians in the offshore leaks".format(
    len(politicians_full_names)))
if politicians_full_names:
    for lookup_name in politicians_full_names:
        print(lookup_name)
        name_lookup_offshore(lookup_name, data_frame)

This will print out the following:

List of Bavarian Christian Social Union politicians
List of German Christian Democratic Union politicians
List of German Communist Party politicians
List of German Free Democratic Party politicians
List of German Green Party politicians
List of German Left Party politicians
List of National Democratic Party of Germany politicians
List of Social Democratic Party of Germany politicians
We found 5 german politicians in the offshore leaks
josef bauer
#######################################
######### offshore database lookup:
#######################################
josef bauer; registered_address:  2384, LIESINGTALSTRASSE 58, BREITENFURT, with id:58022447
Find the interactive graph for this node at: https://offshoreleaks.icij.org/nodes/58022447
------------------------
josef bauer; officer_of:  sky gourmet malta ltd,
registered at: JAMES CONFECTIONERY VELLERAN STREET,
FGURAFGR1900, MALTA, jurisdiction: Maltawith unique leak id:55033415
Find the interactive graph for this node at: https://offshoreleaks.icij.org/nodes/55033415
------------------------
⋮
⋮
⋮

I shortened the output here in the blog for brevity. This guy, Josef Bauer, is an excellent example that a name is a horrible identifier for things. It’s not even close to being unique. Josef Bauer was a politician from Wasserburg am Inn in the Christian Social Union of Bavaria and died in 1989. The Josef Bauer from the offshore leaks database lived in Breitenfurt, which is close to Vienna and was/is an officer of a company, the sky gourmet malta ltd, that was founded in the 90s, you can read that on their weird website.

Next thing would be to check out American politicians. A list of current legislators is thankfully readily available as json download from this Github repo; here is the direct link to the file I used.

american_congress = pd.read_json("legislators-current.json")
delegates = [
    american_congress.name[idx]["official_full"].lower()
    for idx in range(american_congress.shape[0])
]

delegates_full_names = compare(delegates, offshore_name_list)
print("We found {} American politicians in the offshore leaks".format(
    len(delegates_full_names)))

if delegates_full_names:
    for lookup_name in delegates_full_names:
        print(lookup_name)
        name_lookup_offshore(lookup_name, data_frame)

This will give us:

We found 4 American politicians in the offshore leaks
david scott
#######################################
######### offshore database lookup:
#######################################
david scott; registered_address:
Homat Ruby #301 Minami Aoyama 1-26-6 Minato-ku, Tokyo 107 Japan, with id:237672
Find the interactive graph for this node at: https://offshoreleaks.icij.org/nodes/237672
------------------------
david scott; registered_address:
33, KENILWORTH ROAD, BRIDGE OF ALLAN, STIRLING FK94RP, with id:58027627
Find the interactive graph for this node at: https://offshoreleaks.icij.org/nodes/58027627
------------------------
david scott; officer_of:
gozo aquaculture limited, registered at: 52, OLD THEATRE STREET,
VALLETTAVLT 05, MALTA, jurisdiction: Maltawith unique leak id:55010764
Find the interactive graph for this node at: https://offshoreleaks.icij.org/nodes/55010764
------------------------
⋮
⋮
⋮

Hmmm, also seems not to be the same person. The US politician is originally from Atlanta and is now in the Georgia State Senate, and the David Scott from the offshore leaks is connected to Japan and the UK. So we see, finding people within the offshore leaks reliably is not so easy and this article provides a mere introduction to how your research could start. Go and have fun with it!

One last thing. You want to know if your own name is on the list? Sure, go ahead:

my_name = "Julian Zimmermann" # That's me
me_in_the_leaks = compare([my_name.lower()], offshore_name_list)
if me_in_the_leaks:
    print("We found you")
    lookup_name = me_in_the_leaks[0]
    name_lookup_offshore(lookup_name, data_frame)
else:
    print("You are not in it")

Which prints out:

You are not in it

Phew, so far my record is clean.