The Paradise Papers: How the Cloud Helped Expose the Hidden Wealth of the Global Elite
In early 2016, the International Consortium of Investigative Journalists (ICIJ) published the Panama Papers –one of the biggest tax-related data leaks in recent history involving 2.6 Terabytes (TBs) of information. It exposed the widespread use of offshore tax havens and shell companies by thousands of wealthy individuals and political officials, including the British and Icelandic Prime Ministers.
Now if that wasn’t fascinating or mind-blowing enough, shortly after came the Paradise Papers –wherein 1.4 Terabytes of documents were leaked to two reporters at the German newspaper Suddeutsche Zeitung. When they were published in late 2017, the Paradise Papers revelations caused major embarrassment and potential legal troubles for tech giants such as Apple and Facebook--not to mention Bono, Shakira, the Queen of England and President Donald Trump’s Commerce Secretary, Wilbur Ross. Aside from revealing the plotted actions taken by these individuals and organization to evade taxes, the Paradise Papers also exposed efforts to violate international sanctions and hide compromising financial relationships.
In both cases, Talend Data Fabric and the cloud—chiefly Amazon Web Services—played a pivotal role in enabling ICIJ to take charge of the information and the scalability to process and share it with hundreds of globally dispersed journalists—every day users if you will with little to no ‘developer’ or deep tech experience. ICIJ was also able to provide those reporters with powerful analytical tools to help them make sense of and mine the huge volume of unstructured and structured data they had at their disposal for meaningful insights.
Drowning in an Ocean of Data
Even though the Paradise Papers was slightly smaller in scale than the Panama Papers—equated to 1.4 TBs of information—the number of documents was actually much larger—13.4M vs. 11.5M. This made the task of sharing and making sense of all that information even more complex for three primary reasons:
- More sources- The Panama Papers all came from a single source: Mossack Fonseca—versus the Paradise Papers came largely from offshore law firm Appleby, but also from Asiaciti Trust, a smaller, family-owned trust company, and the company registries of 19 different tax havens.
- Data variety – The Paradise Papers were mostly emails but the data set also included 70 years-worth of loan agreements, trust deeds, Excel and CSV files, PDF’s and images. ICIJ had to process and make all that content searchable, along with structured data from several database platforms.
- Collaboration – The ICIJ needed to share the same set of documents with more than 380 journalists, in 67 countries, on six continents in 30 languages—and eventually with the entire public. Thus, the sheer complexity of organizing the data in a format that was digestible ‘for the masses’ if you will, was quite an undertaking.
The Life-Jacket: A Perfect Cloud and Big Data Solution
For the three primary needs listed above, utilizing the cloud to scale the project was the logical choice. The power of the cloud vastly simplifies processing and collaboration among a massive set of globally dispersed and varied skilled-level users.
“Processing the data was a big job that took months,” says ICIJ’s head of data and research Mar Cabra. “However, it took less time [than it would have otherwise] because we used parallel processing in the cloud.” This included firing up more than 20 servers in AWS for parallel optical character recognition (OCR) and processing. It also included the use of Talend Data Fabric platform for most of the processing and orchestration.
“Talend is our preferred solution when it comes to cleaning, transforming, and integrating the data we receive,” said Pierre Romera, CTO at ICIJ. “It works as a crucial mechanism for enabling us to build a robust database and when coupled with AWS, we could significantly cut the time needed to process the huge volumes of information we had.”
ICIJ also took advantage of several Open Source tools, such as VeraCrypt for hidden files and encryption, Apache Tika to extract text from PDF and other non-text files, Apache Solrfor index and search, and Blacklight for a search interface. Amazon’s services came in handy as well, including CloudFront for secure global content delivery, Elasticache’s in-memory data store caching for quick access, Relational Database Service, Route 53 for DNS, and of course EC2 and S3 for scalable compute and storage.
Charting a Course for Connecting the Dots
ICIJ used Talend to process and load all the unstructured data into a Neo4j graph database and harnessed the Linkurious graph visualization platform to provide access for users. The latter was invaluable for its ability to display links among the players in a graphical format, vastly simplifying the task for reporters of connecting the dots.
The ICIJ provided three cloud-enabled platforms: the Global I-Hub for secure internal communications, the Global Knowledge Center for document research, and Linkurious for data connections.
“Utilizing the cloud was an obvious choice for us due to the nature of our mission and the large volume of data we had to process,” said Pierre. “Cloud technology offered the scalability we needed, when we needed it, with robust power for processing and security.”
Both the Panama and Paradise papers are now available to the public, who can make use of the same tools offered to reporters to glean even more revelations. None of this would have been possible without the cloud’s unmatched scalability and global accessibility.