Bootleg: Chasing the Tail with Self-Supervised Named Entity Disambiguation

Laurel Orr, Megan Leszczynski, Simran Arora, Neel Guha, Xiao Ling, Sen Wu, and Chris Ré

November 12, 2020

Named entity disambiguation (NED) is the process of mapping “strings” to “things” in a knowledge base. You have likely already used a system that requires NED multiple times today. Every time you ask a question to your personal assistant or issue a search query on your favorite browser, these systems use NED to understand what people, places, and things (entities) are being talked about.

Named entity disambiguation example. The ambiguous “Lincoln” refers to the car, not the person or location.

Take the example shown above. You ask your personal assistant “What is the average gas mileage of a Lincoln?”. The assistant would need NED to know that “Lincoln” refers to Lincoln Motors (the car company)—not the former president or city in Nebraska. The ambiguity of mentions in text is what makes NED so challenging as it requires the use of subtle cues.

The spectrum of entities. Popular (head) entities occur frequently in data while rare (tail) entities are infrequent.

NED gets more interesting when we examine the full spectrum of entities shown above, specifically the more rare tail and unseen entities. These are entities that occur infrequently or not at all in data. Performance over the tail is critical because the majority of entities are rare. In Wikidata, only 13% of entities even have Wikipedia pages as a source of textual information.

Bootleg compared to a BERT-based baseline model Févry et el. 2020 showing average F1 versus number of times an entity occurred in the training data. As there are 15x the number of entities in Wikidata than in Wikipedia (most of them are rare) and the baseline model needs to see an entity on average 100x for it to achieve 60 F1, it follows that the baseline model would need to train on data 1,500x the size of Wikipedia to achieve 60 F1 over all entities.

Prior approaches to NED use BERT-based systems to memorize textual patterns associated with an entity (e.g., Abraham Lincoln is associated with “president”). As shown above, the SotA BERT-based baseline from Févry does a great job at memorizing patterns over popular entities (it achieves 86 F1 points over all entities). For the rare entities, it does much worse (58 F1 points lower on the tail). One possible solution to better tail performance is to simply train over more data, but this would likely require training over data 1,500x the size of Wikipedia for the model to achieve 60 F1 points over all entities!

In this blog post, we present Bootleg, a self-supervised approach to NED that is better able to handle rare entities.

Tail Disambiguation through NED Reasoning Patterns

The question we are left with is how to disambiguate these rare entities? Our insight is that humans disambiguate entities, including rare entities, by using signals from text as well as from entity relations and types. For example, the sentence “What is the gas mileage of a Lincoln?” requires reasoning that cars have a gas mileage, not people or locations. This can be used to reason that the mention of “Bluebird” in “What is the average gas mileage of a Bluebird?” refers to the car, a Nissan Bluebird, not the animal. Our goal in Bootleg is to train a model to reason over entity types and relations and better identify these tail entities.

Through empirical analysis, we found four reasoning patterns for NED, shown and defined in the figure below.

Four reasoning patterns of NED. Each pattern uses some combination of entity, type, and relation information.

These patterns rely on signals from entities, types, and relations. Luckily, tail entities do not have equally rare types and relations. This means we should be able to learn type and relation patterns from our data that can apply to tail entities.

Bootleg: A Model for Tail NED

Bootleg takes as input a sentence, determines the possible entity candidates that could be mentioned in the sentence, and outputs the most likely candidates. The core insight that enables Bootleg to better identify rare entities is in how it internally represents entities.

The creation of an entity candidate representation. Each candidate is a combination of an entity, type, and relation learned embedding.

Similar to how words are often represented by continuous word embeddings (e.g., BERT or ELMo), Bootleg represents entity candidates as a combination of a unique entity embedding, a type embedding, and a relation embedding, as shown above. For example, each car entity will get the same car type embedding (likewise for relations) which will encode patterns learned over all cars in the training data. A rare car can then use this global “car type” knowledge for disambiguation, as it will have the car embedding as part of its representation.

To output the correct entities, Bootleg uses these representations in a stacked Transformer module to allow the model to naturally learn the useful patterns for disambiguation without hard-coded rules. Bootleg then scores the output candidate representations and returns the most likely candidates.

There are other exciting techniques we present in our paper regarding regularization and weak labeling to improve tail performance.

Bootleg Improves Tail Performance and Allows for Knowledge Transfer

Our simple insight of training a model to reason over types and relations provides state-of-the-art performance on three standard NED benchmarks – matching or exceeding SotA by up to 5.6 F1 points – and outperforms a BERT-based NED baseline by 5.4 F1 points over all entities and 40 F1 points over tail entities (see F1 versus entity occurrence plot above).

Benchmark	System	Precision	Recall	F1
KORE50	Hu et al., 2019	80.0	79.8	79.9
KORE50	Bootleg	86.0	85.4	85.7
RSS500	Phan et al., 2019	82.3	82.3	82.3
RSS500	Bootleg	82.5	82.5	82.5
AIDA CoNLL YAGO	Févry et al., 2020	-	96.7	-
AIDA CoNLL YAGO	Bootleg	96.9	96.7	96.8

We’ll now show how the entity knowledge encoded in Bootleg’s entity representations can transfer to non-NED tasks. We extract our entity representations and use them in both a production task at a major technology company and relation extraction task. We find that the use of Bootleg embeddings in the production task provides a 8% lift in performance and even improves quality over Spanish, French, and German languages. We repeat this experiment by adding Bootleg representations to a SotA model for the TACRED relation extraction task (see tutorial). We find this Bootleg-enhanced model sets a new SotA by 1 F1 point.

Model	TACRED F1
Bootleg-Enhanced	80.3
KnowBERT	79.3
SpanBERT	78.0

These results suggest that Bootleg entity representations can transfer entity knowledge to other language tasks!

Recap

To recap, we described the problem of the tail of NED and showed that existing NED systems fall short at disambiguating these rare, yet important entities. We then introduced four reasoning patterns for NED and described how we trained Bootleg to learn these patterns through the use of embeddings and Transformer modules. We finally showed that Bootleg is a SotA NED system that better disambiguates rare entities than prior methods. Further, Bootleg learns representations that can transfer entity knowledge to non-NED tasks.

We are actively developing Bootleg and would love to hear your thoughts. See our website, source code, and paper.

Keep on top of the latest SAIL Blog posts via , , or email: