Recognising ‘bad actors’ in data leaks with AI

Luis Flores, Michel Schammel, Anna Vissens

December 9, 2022 at 8:41 a.m.·6 min read

<span>Photograph: Dominic Lipinski/PA</span> — Photograph: Dominic Lipinski/PA

For the 2022 JournalismAI fellowship, the Guardian teamed up with Daily Maverick and Follow the Money with a shared aim to uncover ‘bad actors’ hidden in extensive digital corpora, of which whistleblower leaks have become the most emblematic example.

Investigative journalism involves large data dumps and a lot of manual work. Our aim in this project was to build a pipeline powered by AI to automatically surface and organise people of interest from data leaks. This would make uncovering “bad actors” quicker and easier, and lead to exclusive stories.

Handling, investigating, and visualising large quantities of data invariably involves several steps — from obtaining the data, to using multiple tools to search for entities (people or organisations) in this data and finding the connections between them. In our experience, one of the big challenges newsrooms face in this process is matching the entities they are able to extract from large datasets — using Natural Language Processing (NLP) — to real world people, places and things. For instance, linking multiple individuals with political interests sharing the same surname to the right person is often a far from trivial task.

Entity linking pipeline

The aim of the project was to train an entity linker, a model which disambiguates entities mentioned in text (‘Document’ in the diagram below) by linking them to unique identifiers. This requires a database (Knowledge base), as well as a function to generate plausible candidates from that database, and a machine learning model to pick the right candidate using the local context of the mention.

Named entity recognition (NER) is the process of locating and classifying named entities mentioned in text into predefined categories, such as person, organisation, or location.

Once we extract mentions of personal names in text, the entity linker should be able to identify the right candidate for the mention and provide a link back to the knowledge base.

To train our prototype we decided to use a set of articles from the Guardian newsroom. This had several advantages, including that it was readily accessible via the Guardian’s content API, and contained a wide range of well-known entities to develop and validate our approach.

At the same time, we wanted the project to reflect the real-life challenges faced by investigative journalists and for the model to work with less well-known entities. So we decided, instead of using a common approach in this field using WikiData, to combine two databases - LittleSis and Open Sanctions, containing information on people of interest.

Training the model

To train the entity linker model we needed to create a training dataset consisting of a large number of examples where different names mentioned in text were linked to the corresponding correct identifiers in the database.

We used two tools created by Explosion, a software company specialising in NLP tools: spaCy, an open-source library for advanced NLP, and Prodigy, an annotation tool for quick and efficient labelling of training data.

As with all NLP projects involving labelling by multiple annotators, we started by writing a clear and concise guide. By writing these instructions we tried to improve consistency between annotators who might take a slightly different approach to the task. We even created a decision tree chart (shown below) so the annotators would understand the context in the same way.

The guide also included some edge cases we encountered and discussed during our group annotation sessions, which shaped the guide nicely and helped us align our approach to the task.

Overall, the team annotated more than 4,000 examples linking mentions in text to entries in the database, and these efforts successfully led to a functioning prototype of an entity linker model.

Based on our preliminary analysis, the model’s performance on a validation dataset was acceptable but it was clear that it did not reflect real-life scenarios well. For example, the model struggled to reliably disambiguate single name mentions.

Challenges

Creating a training dataset for an entity linking task proved extremely challenging. Having decided to merge two databases to form our knowledge base, we were faced with resolving issues such as inconsistent formatting, duplicate entities, and contradictory or outdated descriptions. We thus had to invest a considerable amount of time in cleaning the databases.

In order to surface the right candidates for each annotation we had to create an algorithm to filter and refine the entries in the knowledge base. In the end, we used a combined metric of fuzzy string matching and shared vocabulary between the text containing the mention and the knowledge base descriptions.

Despite our best efforts, it became clear that our training dataset did not contain enough examples of mentions for which multiple candidates shared the same name (such as Adam Smith, the renowned philosopher and economist, vs a character on a popular TV show or the name of an institution) but these were a crucial type of example we needed to allow the model to learn as intended. There were also very few examples of mentions consisting of a single name, which are particularly hard to disambiguate, so the model did not learn to deal with this.

Although the team annotated a large number of examples, not all of these annotations could be used to train the model as many of them needed more context to find a match.

Much more training data was needed to show the model enough examples to learn from. On average one annotator could go through 100 examples an hour, from those roughly only half would end up in the training dataset. It means that we needed one annotator to work non-stop for nearly 6 weeks to collect 10,000 useful examples.

When it came to training, we also faced another challenge. Candidates often had similar context, for example, relating to politics, which made it difficult for the model to discriminate between candidates, especially if the paragraph they were mentioned in only contained generic words relating to politics.

What’s next?

Despite the many challenges we faced, some of which are not yet fully resolved, the lessons we have learnt from this project have been invaluable. We now have a much better understanding of the advantages and limitations of entity linking, some of which are transferable across the suite of tools and techniques we typically rely on. These will no doubt be extremely useful in future projects.

Once we have a robust entity linking model which performs well in more challenging tasks we can extend it to other entities such as organisations. Eventually this model could be used to disambiguate entities which could auto-generate graphs of the relationships between persons, organisations and other entities we find in large document collections.

The foundational and exploratory work we have done together over the last few months has highlighted the exceptionally high standards of accuracy which tools used by investigative journalists must meet. It’s a challenging task, but one we continue fruitfully developing with our work and upcoming JournalismAI initiatives.

This project is part of the 2022 JournalismAI Fellowship Programme. The Fellowship brought together 46 journalists and technologists from across the world to collaboratively explore innovative solutions to improve journalism via the use of AI technologies. You can explore all the Fellowship projects at this link.

JournalismAI is a project of Polis – the journalism think-tank at the London School of Economics and Political Science – and it’s sponsored by the Google News Initiative. If you want to know more about the Fellowship and the other JournalismAI activities, sign up for the newsletter or get in touch with the team via hello@journalismai.info

Yahoo Sports Videos
Brian Thomas on joining LSU greats in the NFL: ‘I’m ready for it’
The LSU Tigers wide receiver spoke to Yahoo Sports’ Woody Wommack about his road to the NFL draft and adding to his school’s reputation as “Wide Receiver U”. Brian joined Yahoo Sports on behalf of Old Spice.
3 minutes ago
Yahoo Sports Videos
Dallas Turner on defensive players being overlooked: ‘The game is just evolving’
The Alabama edge rusher spoke to Yahoo Sports’ Woody Wommack about how many are projecting a run of offensive players during the early picks of this year’s draft. Dallas joined Yahoo Sports on behalf of Old Spice.
3 minutes ago
Yahoo Sports Videos
Terrion Arnold on defensive players being overlooked: ‘There will be a chip on all our shoulders’
The Alabama defensive back spoke to Yahoo Sports’ Woody Wommack about how many are projecting a run of offensive players during the early picks of this year’s draft.
4 minutes ago
The Telegraph
Biden closes gap in presidential poll as rival Trump is stuck in court
Joe Biden is closing the polling gap with Donald Trump as the former president is confined to courtrooms and unable to campaign, new data show.
5 minutes ago
Evening Standard
Brighton 0-4 Man City: Phil Foden bags brace as champions respond to Arsenal in Premier League title race
Kevin De Bruyne opens scoring with brilliant header
5 minutes ago
Reuters
Cincinnati Financial quarterly profit rises on higher premiums, investment income
Cincinnati Financial posted a rise in first-quarter profit on Thursday, helped by a jump in premiums and higher income from investments. The company posted an adjusted operating income of $272 million, or $1.72 per share in the first-quarter, up from $141 million, or 89 cents, in the year-ago period. Growing expectations of a soft landing for the U.S. economy fueled a rally in equities and other asset classes at the start of the year, helping insurers rake in more income from investments.
5 minutes ago
Investor's Business Daily
T-Mobile Earnings, Wireless Subscriber Adds Top Estimates
T-Mobile US on Thursday reported first-quarter earnings that topped consensus estimates while revenue missed.
5 minutes ago
People
Madonna Pays Tribute to Her Kids as She Shares Behind-the-Scenes Photos with Them from Her Celebration Tour
The singer is mom to six kids — Lourdes, Rocco, David, Mercy, Stella and Estere
7 minutes ago
Yahoo Finance Video
PCE inflation data, ExxonMobil, earnings: What to Watch
March's Personal Consumption Expenditures (PCE) index — the Federal Preserve's preferred inflation gauge — is due out tomorrow morning, as many other companies prepare to publish their latest earnings figures, including oil giants ExxonMobil (XOM) and Chevron (CVX). Yahoo Finance's Julie Hyman and Jared Blikre list the top headlines and data investors should be paying attention to on Friday, April 25. For more expert insight and the latest market action, click here to watch this full episode. This post was written by Luke Carberry Mogan.
7 minutes ago
The Canadian Press
Members of NHL community and beyond pay tribute to legendary broadcaster Bob Cole
Members of the NHL community and beyond paid tribute Bob Cole on Thursday after the legendary broadcaster died at age 90. He died Wednesday night in St. John's, N.L., surrounded by his family, his daughter, Megan Cole, told the CBC. Cole, an influential voice in broadcasting for more than half a century, brought life to some of hockey's biggest games. TNT hockey broadcaster Ed Olczyk, a former Winnipeg Jets and Toronto Maple Leafs forward, reminisced about admiring Bob Cole's big-game presence d
8 minutes ago
Investor's Business Daily
Dow Jones Cuts Losses After Plunging 700 Points; Stock Market Up After Hours As Microsoft, Alphabet Surge
The Dow Jones Industrial Average trimmed what was a 700-point loss by nearly half by the close Thursday as it sought to recover from an earnings-fueled drop courtesy of component Caterpillar. Other indexes also got hit as Facebook and Instagram operator Meta Platforms led techs lower, plunging as much as 16% on the stock market today. Nasdaq 100 futures indicated a 0.7% rise after the 4 p.m. ET close of the regular trading session.
8 minutes ago
The Canadian Press
A look at hockey broadcasting legend Bob Cole's most iconic calls
Bob Cole, the legendary voice of Canadian hockey for over five decades, died Wednesday at 90 in St. John's, N.L. Famous for his 'Oh baby!' catchphrase, Cole's play-by-play on CBC's 'Hockey Night in Canada' marked an era of unforgettable moments on the ice. Here's a look back at some of his most iconic calls. "They're going home! They're going home!” Cole incredulously and repeatedly uttered this line during the first period of the 1976 Super Series game between the Philadelphia Flyers and Red Ar
8 minutes ago
The Canadian Press
Canada's auto sector faces an EV renaissance, but local job protection is a concern
OTTAWA — Canada's auto industry is experiencing a renaissance as it transitions from building gas-powered vehicles to ones that run on batteries, but some are raising the alarm over the protection of local jobs. Southern Ontario has become a hub for foreign automakers that have invested tens of billions of dollars since 2020 to build up electric-vehicle battery plants, with help from the federal government in the form of tax credits and subsidies. As the federal and provincial governments subsid
8 minutes ago
TVLine.com
2018 Cable TV Renewal Scorecard: What’s Returning? What’s Cancelled?
TVLine’s Cable Renewal Scorecard is nothing short of a #PeakTV blessing. This handy cheat sheet lists the status of more than 150 offerings on both basic and premium cable networks. Due to issues related to length as well as our own sanity, not every series is included. Nor should they be. We’ll be refreshing this list as needed …
9 minutes ago
WLWT - Cincinnati
Nurses on board medical helicopter that crashed in Butler County filing lawsuit
Nurses on board medical helicopter that crashed in Butler County filing lawsuit
9 minutes ago
WPTZ - Burlington/Plattsburgh
Bill to prevent hair discrimination signed into law by Vermont Governor Phil Scott
Bill to prevent hair discrimination signed into law by Vermont Governor Phil Scott
9 minutes ago
PA Media: Sport
Phil Foden brace helps Man City thump Brighton and close gap on leaders Arsenal
Kevin De Bruyne’s eye-catching diving header set City side on course for a comprehensive success.
9 minutes ago
The Canadian Press
Man City beats Brighton 4-0 to stay on course for another Premier League title. Phil Foden scores 2
BRIGHTON, England (AP) — Manchester City chalked up another big win in its hunt for an unprecedented fourth straight Premier League title, with Phil Foden continuing his career-best scoring season with two goals in a 4-0 thrashing of Brighton on Thursday. Foden’s first-half double came between goals by Kevin De Bruyne and Julian Alvarez as City extended its unbeaten run in the league to 18 games and trimmed the gap to leader Arsenal to one point. City has five games remaining — one more than Ars
10 minutes ago
WPBF - West Palm Beach
Rifle-carrying man tried to carjack woman before he was shot and killed by deputy in Lake Worth
Rifle-carrying man tried to carjack woman before he was shot and killed by deputy in Lake Worth
10 minutes ago
Evening Standard
Premier League table 2023-24: Latest standings, fixtures and results
Another huge weekend awaits
11 minutes ago

Entity linking pipeline

Training the model

Challenges

What’s next?

Latest Stories