Question Answering with Knowledge

From search engines to personal assistants, we use question-answering systems every day. When we ask a question (“Where was the painter of the Mona Lisa born?”), the system needs to gather background knowledge (“The Mona Lisa was painted by Leonardo da Vinci”, “Leonardo da Vinci was born in Italy”) and reason over it to produce the answer (“Italy”).

Knowledge sources
In recent AI research, such background knowledge is commonly available in the forms of knowledge graphs (KGs) and language models (LMs) pre-trained on a large set of documents. In KGs, entities are represented as nodes and relations between them as edges, e.g. [Leonardo da Vinci — born in — Italy]. Examples of KGs include Freebase (general-purpose facts)1, ConceptNet (commonsense)2, and UMLS (biomedical facts)3. Examples of pre-trained LMs include BERT (trained on Wikipedia articles and 10,000 books)4, RoBERTa (extending BERT)5, BioBERT (trained on biomedical publications)6, and GPT-3 (the largest public LM to date)7.

The two knowledge sources have complementary strengths. LMs can be pre-trained on any unstructured text and thus have a broad coverage of knowledge. On the other hand, KGs are more structured and help for logical reasoning by providing paths between entities. KGs also include knowledge that may not be commonly stated in text: for instance, people do not often state obvious facts like “people breathe” and compositional sentences like “The birthplace of the painter of the Mona Lisa is Italy”.

In our recent work8 published at NAACL 2021, we study how to effectively combine both sources of knowledge, LMs and KGs, to perform question answering.

Problem setup and Challenges
We consider a question answering setup illustrated in the figure below, where given a question and answer choices if any (combined, we call them the QA context) the system predicts an answer. Using LMs and KGs for question answering presents two challenges. Given a QA context (purple box in the figure), the system needs to first identify informative knowledge from a large KG (green box), and then capture the nuance of the QA context and the structure of the KG to jointly reason over them.

In existing systems that use LMs and KGs, such as RelationNet9, KagNet10 and MHGRN11, extracted KG subgraphs tended to be noisy, and the interactions between the QA context and KG were not modeled. In this work, we introduce promising solutions to the aforementioned two challenges: i) KG relevance scoring, where we estimate the relevance of KG nodes conditioned on the QA context, and ii) Joint graph, where we connect the QA context and KG as a joint graph to model their interactions.


We design an end-to-end question answering model that uses a pre-trained LM and KG. First, as commonly done in existing systems, we use an LM to obtain a vector representation for the QA context, and retrieve a KG subgraph by entity linking. Then, in order to identify informative knowledge from the KG, we estimate the relevance of KG nodes conditioned on the QA context (see the “KG Relevance Scoring” section below). Next, to jointly reason with the QA context and KG, we connect them as a joint graph and update their representations (see the “Joint Reasoning” section below). Finally, we combine the representations of the QA context and KG to predict the answer.

KG Relevance Scoring
Real-world KGs are huge, with millions of entities. How can we effectively extract a KG subgraph that is most relevant to the given question? Let’s consider an example question in the figure: “A revolving door is convenient for two direction travel, but also serves as a security measure at what?”. Common methods for extracting a KG subgraph link entities in the QA context such as “travel”, “door”, “security” and “bank” (topic entities; blue and red nodes in the figure left) and retrieve their 1- or 2-hop neighbors from the KG (gray nodes in the figure left). However, this may introduce many entity nodes that are semantically irrelevant to the QA context, especially when the number of hops or entities in the QA context increases. In this example, 1-hop neighbors may include nodes like “holiday”, “riverbank”, “human” and “place”, but they are off-topic or too generic.

Joint Reasoning
Now we have the QA context and the retrieved KG ready. How can we jointly reason over them to obtain the answer? To create a joint reasoning space, we explicitly connect them in a graph, where we view the QA context as a node (purple node in the figure) and connect it to each topic entity in the KG (blue and red nodes in the figure). As this joint graph intuitively provides a working memory for reasoning, we call it the working graph. Each node in the working graph is associated with one of the four types: purple is the QA context node, blue is an entity in the question, orange is an entity in the answer choices, and gray is any other entity. The representation of each node is initialized as the LM representation of the QA context (for the QA context node) or entity name (for KG nodes). The working graph essentially unifies the two modalities, text and KG, into one graph.

To reason on the working graph, we mutually update the representation of the QA context node and the KG via graph attention networks (GAT). The basic idea of GAT is to update the representation of each node by letting neighboring nodes send message vectors to each other for multiple layers. Concretely, in our model, we update the representation of each node t by the rule shown on the figure right, where m is the message vector from the neighbor node s, α is the attention weight between the current node t and neighbor node s. For more details about GAT, we refer readers to 12. Below are examples of how the message passing can look like, where a thicker edge indicates a higher attention weight.

Let’s use our question answering model!

We apply and evaluate our question answering model (we call QA-GNN) on two QA benchmarks that require reasoning with knowledge:

  • CommonsenseQA13: contains questions that test commonsense knowledge (e.g. “What do people typically do while playing guitar?”)
  • OpenBookQA14: contains questions that test elementary science knowledge (e.g. “Which of the following objects would let the most heat travel through?”)

For our LM component, we use RoBERTa, which was pre-trained on Wikipedia articles, books and other popular web documents. For our KG component, we use ConceptNet, which contains a million entities and covers commonsense facts such as [round brush — used for — painting].

QA-GNN improves on existing methods of using LMs and KGs for question answering
We compare with a baseline that only uses the LM (RoBERTa) without the KG, and existing LM+KG models (RelationNet, KagNet and MHGRN). The main innovations we made in QA-GNN are that we perform the KG relevance scoring w.r.t. questions and that we mutually update the text and KG representations on the joint graph, while existing methods combined text and KG representations at later stages. We find that these two techniques provide improvement on the question answering accuracy e.g. 71%→73% on CommonsenseQA and 67%→70% on OpenBookQA (figure below).

Case studies: When is KG helpful and when is LM?
Let’s look at several question-answering examples in the CommonsenseQA benchmark, and see when/how the KG component or the LM component of our model is helpful. In each figure below, blue nodes are entities in the question, and red nodes are answer choices, where the bolded entity is the correct answer and the entity with (P) is the prediction by our model. As shown in the next two figures, we find that the KG component is especially useful when the KG provides concrete facts (e.g. [postpone — antonym — hasten] in the first figure) or paths (e.g. [chicken egg — egg — chicken — barn] in the second figure) that help for answering the questions.

On the other hand, we find that the LM component is especially helpful when the question requires language nuance and commonsense that are not available in the KG. For instance, in the next two figures, if we simply follow the paths in the KG, we may reach answers like “night sky” or “water” in the first and second questions respectively. While they are not completely wrong answers, “universe” and “soup” are better collocations.


In this work, we studied how to combine two sources of background knowledge (pre-trained LM and KG) to do better in question answering. To solve this problem, we introduced a new model QA-GNN, which has two innovations:

  1. KG relevance scoring: We use a pre-trained LM to score KG nodes conditioned on a question. This is a general framework to weight information on KGs.
  2. Joint reasoning over text and KGs: We connect the QA context and KG to form a joint graph, and mutually update their representations via a LM and graph neural network.

Through case studies we also identified the complementary strengths of pre-trained LMs and KGs as knowledge sources.

You can check out our full paper here and our source code/data on GitHub. If you have questions, please feel free to email us.


This blog post is based on the paper:

Many thanks to my collaborators and advisors, Hongyu Ren, Antoine Bosselut, Percy Liang and Jure Leskovec for their help. Many thanks to Megha Srivastava and Sidd Karamcheti for edits on this blog post.

  1. Freebase: a collaboratively created graph database for structuring human knowledge. Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. 2008. 

  2. Conceptnet 5.5: An open multilingual graph of general knowledge. Robyn Speer, Joshua Chin, and Catherine Havasi. 2017. Website here

  3. The unified medical language system (UMLS): integrating biomedical terminology. Olivier Bodenreider. 2004 

  4. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. 2019 

  5. RoBERTa: A Robustly Optimized BERT Pretraining Approach. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov. 2019 

  6. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, Jaewoo Kang. 2019. 

  7. Language Models are Few-Shot Learners. Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei. 2020 

  8. QA-GNN: Reasoning with Language Models and Knowledge Graphs for Question Answering. Michihiro Yasunaga, Hongyu Ren, Antoine Bosselut, Percy Liang, and Jure Leskovec. 2021. 

  9. A Simple Neural Network Module for Relational Reasoning. Adam Santoro, David Raposo, David G Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia, and Timothy Lillicrap. 2017. 

  10. Kagnet: Knowledge-aware Graph Networks for Commonsense Reasoning. Bill Yuchen Lin, Xinyue Chen, Jamin Chen, and Xiang Ren. 2019. 

  11. Scalable multi-hop relational reasoning for knowledge-aware question answering. Yanlin Feng, Xinyue Chen, Bill Yuchen Lin, Peifeng Wang, Jun Yan, and Xiang Ren. 2020. 

  12. Graph Attention Networks. Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, Yoshua Bengio. 2018. 

  13. CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge. Alon Talmor, Jonathan Herzig, Nicholas Lourie, Jonathan Berant. 2019. Dataset website here

  14. Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering. Todor Mihaylov, Peter Clark, Tushar Khot, Ashish Sabharwal. 2018. Dataset website here.s