The Stanford Wordnet Project
It is a long-standing dream of AI to have algorithms automatically read and obtain knowledge from text. By applying a learning algorithm to parsed text, we have developed methods
that can automatically identify the concepts in the text and the relations between them. For example, reading the phrase "heavy water rich in the doubly heavy hydrogen atom called deuterium", our algorithm learns (and adds to its semantic network) the fact that deuterium is a type of atom (Snow et al., 2005). By applying this procedure (and
extensions: Snow et al., 2006, Snow et al., 2007) to large amounts of text, our algorithms automatically acquires hundreds of thousands of items of world knowledge,
and uses these to produce significantly enhanced versions of WordNet (made freely available online). WordNet (a laboriously hand-coded database) is a major NLP resource, but has proven
to be very expensive to manually build and maintain. By automatically inducing knowledge to add to WordNet, our work provides an even greater NLP resource
(e.g., significantly greater precision/recall in identifying various relations), but at a tiny fraction of the cost.
These lexical resources (and the method of their construction) are described in Semantic Taxonomy Induction from Heterogenous Evidence (ACL-06). They are automatically augmented versions of WordNet 2.1 (available at http://wordnet.princeton.edu).
WN 2.1 +10000 synsets: WN_plus10k.tgz
WN 2.1 +20000 synsets: WN_plus20k.tgz
WN 2.1 +30000 synsets: WN_plus30k.tgz
WN 2.1 +40000 synsets: WN_plus40k.tgz
WN 2.1 + 400,000 synsets (cropped for ./wn compatibility) : wn400k_cropped.tgz
WN 2.1 + 400,000 synsets (all, won't work with ./wn binary): wn400k_all.tgz
Each tarball contains a full set of the data files used by the standard wn and wnb executables distributed in the standard WordNet 2.1 package, as well as by common interfaces such as WordNet::QueryData and Lingua::WordNet. To install an extended Wordnet, copy the data files included in one of the above tarballs into the dict/ folder of a preexisting installation of WordNet 2.1.
Due to restrictions in the WordNet format (i.e.,
a synset is limited to having at most 999 direct links, and
character set restrictions), only a subset of
the links discussed in our paper are added in the above extended Wordnets. To receive a full version of our extended Wordnets in the BerkeleyDB format, please e-mail email@example.com.
These lexical resources (and the method of their construction) are described in Learning to Merge Word Senses (EMNLP-07). They are automatically sense-clustered versions of WordNet 2.1 (available from http://wordnet.princeton.edu).
WN 2.1 -1091 synsets: senseclusteredWN_1.tgz
WN 2.1 -2783 synsets: senseclusteredWN_0.5.tgz
WN 2.1 -3100 synsets: senseclusteredWN_0.tgz
WN 2.1 -3584 synsets: senseclusteredWN_-0.5.tgz
WN 2.1 -9868 synsets: senseclusteredWN_-0.9.tgz
WN 2.1 -12619 synsets: senseclusteredWN_-0.95.tgz
WN 2.1 -19370 synsets: senseclusteredWN_-1.tgz
WN 2.1 -32065 synsets: senseclusteredWN_-1.5.tgz
- Rion Snow, Sushant Prakash, and Andrew Y. Ng, "Learning to Merge Word Senses". EMNLP 2007. [pdf]
- Rion Snow, Daniel Jurafsky, and Andrew Y. Ng, "Semantic taxonomy induction from heterogenous evidence". COLING/ACL, 2006. Received Best Paper Award. [pdf]
- Rion Snow, Daniel Jurafsky, and Andrew Y. Ng, "Learning syntactic patterns for automatic hypernym discovery". NIPS 17, 2005. [pdf]
- Sharon Caraballo, "Automatic Acquisition of a Hypernym-Labeled Noun Hierarchy from Text". Brown University Ph.D. Thesis, 2001. [ps]
- Marti Hearst, "Automatic Acquisition of Hyponyms from Large Text Corpora". COLING 1992. [pdf]
- George Miller, "WordNet: a lexical database for English. Communications of the ACM, 1995." [pdf] [WordNet homepage]