The Stanford Wordnet Project

Overview

It is a long-standing dream of AI to have algorithms automatically read and obtain knowledge from text. By applying a learning algorithm to parsed text, we have developed methods that can automatically identify the concepts in the text and the relations between them. For example, reading the phrase "heavy water rich in the doubly heavy hydrogen atom called deuterium", our algorithm learns (and adds to its semantic network) the fact that deuterium is a type of atom (Snow et al., 2005). By applying this procedure (and extensions: Snow et al., 2006, Snow et al., 2007) to large amounts of text, our algorithms automatically acquires hundreds of thousands of items of world knowledge, and uses these to produce significantly enhanced versions of WordNet (made freely available online). WordNet (a laboriously hand-coded database) is a major NLP resource, but has proven to be very expensive to manually build and maintain. By automatically inducing knowledge to add to WordNet, our work provides an even greater NLP resource (e.g., significantly greater precision/recall in identifying various relations), but at a tiny fraction of the cost.

Team

Rion Snow
Daniel Jurafsky
Andrew Ng

Resources

Augmented Wordnets

These lexical resources (and the method of their construction) are described in Semantic Taxonomy Induction from Heterogenous Evidence (ACL-06). They are automatically augmented versions of WordNet 2.1 (available at http://wordnet.princeton.edu).

WN 2.1 +10000 synsets: WN_plus10k.tgz
WN 2.1 +20000 synsets: WN_plus20k.tgz
WN 2.1 +30000 synsets: WN_plus30k.tgz
WN 2.1 +40000 synsets: WN_plus40k.tgz

New!
WN 2.1 + 400,000 synsets (cropped for ./wn compatibility) : wn400k_cropped.tgz
WN 2.1 + 400,000 synsets (all, won't work with ./wn binary): wn400k_all.tgz

Installation instructions:
Each tarball contains a full set of the data files used by the standard wn and wnb executables distributed in the standard WordNet 2.1 package, as well as by common interfaces such as WordNet::QueryData and Lingua::WordNet. To install an extended Wordnet, copy the data files included in one of the above tarballs into the dict/ folder of a preexisting installation of WordNet 2.1.

Disclaimer:
Due to restrictions in the WordNet format (i.e., a synset is limited to having at most 999 direct links, and character set restrictions), only a subset of the links discussed in our paper are added in the above extended Wordnets. To receive a full version of our extended Wordnets in the BerkeleyDB format, please e-mail rion@cs.stanford.edu.

Sense-clustered Wordnets

These lexical resources (and the method of their construction) are described in Learning to Merge Word Senses (EMNLP-07). They are automatically sense-clustered versions of WordNet 2.1 (available from http://wordnet.princeton.edu).

WN 2.1 -1091 synsets: senseclusteredWN_1.tgz
WN 2.1 -2783 synsets: senseclusteredWN_0.5.tgz
WN 2.1 -3100 synsets: senseclusteredWN_0.tgz
WN 2.1 -3584 synsets: senseclusteredWN_-0.5.tgz
WN 2.1 -9868 synsets: senseclusteredWN_-0.9.tgz
WN 2.1 -12619 synsets: senseclusteredWN_-0.95.tgz
WN 2.1 -19370 synsets: senseclusteredWN_-1.tgz
WN 2.1 -32065 synsets: senseclusteredWN_-1.5.tgz

Relevant Papers

Rion Snow, Sushant Prakash, and Andrew Y. Ng, "Learning to Merge Word Senses". EMNLP 2007. [pdf]
Rion Snow, Daniel Jurafsky, and Andrew Y. Ng, "Semantic taxonomy induction from heterogenous evidence". COLING/ACL, 2006. Received Best Paper Award. [pdf]
Rion Snow, Daniel Jurafsky, and Andrew Y. Ng, "Learning syntactic patterns for automatic hypernym discovery". NIPS 17, 2005. [pdf]

Related Work

Sharon Caraballo, "Automatic Acquisition of a Hypernym-Labeled Noun Hierarchy from Text". Brown University Ph.D. Thesis, 2001. [ps]
Marti Hearst, "Automatic Acquisition of Hyponyms from Large Text Corpora". COLING 1992. [pdf]
George Miller, "WordNet: a lexical database for English. Communications of the ACM, 1995." [pdf] [WordNet homepage]

Back to:

Rion Snow's Homepage