The Wordnet synonym database can be churned into a Lucene Index. This allows for rapid synonym lookup. User query terms can be expanded using these synonym sets as a method of boosting recall. Query expansion is a known technique for improving retrieval performance in Information Extraction systems.
The online Wordnet search:
This article covers the building and querying of a Lucene search index of Wordnet synonyms for the purpose of enabling computationally-efficient and thread-safe synonym lookups at runtime.
Building the Index
To build the index, you can either call Syns2Index directly from the command line or in the context of another Java class.
I prefer to use a simple wrapper:
The referenced "pl" file is the Wordnet Prolog Distribution (see references below).
Querying the Index
Note: Imports have been removed for increased readibility.
LUKE is a handy development and diagnostic tool.
LUKE can be used to access pre-existing Lucene indexes and for the purpose of displaying and modify content.
I've connected LUKE to the Wordnet index, and can perform graphical queries:
Usage and Test Case
We note that the test case output is identical the screenshot captured from the online synonym lookup on the official Wordnet site.
I use Maven, and my POM looks something like this:
- [MvnRepository] Lucene Wordnet 3.3.0
- Also contains the JAR file if you do not plan to use Maven.
- This dependency was last updated Jun 26, 2011.
- [Princeton.edu] Wordnet Main Site
- Prolog Distribution (deep link)
- Unzip the tar file and extract the "wn_s.pl" prolog file, and make it available to the QueryBuilder.
- Online Synonym Search
- Used to derive the screenshot at the beginning of this article.
- Query Extraction:
- Wikipedia Entry, Stanford NLP
- Use of a thesaurus can be combined with ideas of term weighting: for instance, one might weight added terms less than original query terms.
- [Google Code] LUKE (Lucene Analyzer)