Introduction
The Wordnet synonym database can be churned into a Lucene Index. This allows for rapid synonym lookup. User query terms can be expanded using these synonym sets as a method of boosting recall. Query expansion is a known technique for improving retrieval performance in Information Extraction systems.
The online Wordnet search:
This article covers the building and querying of a Lucene search index of Wordnet synonyms for the purpose of enabling computationally-efficient and thread-safe synonym lookups at runtime.
Building the Index
To build the index, you can either call Syns2Index directly from the command line or in the context of another Java class.
I prefer to use a simple wrapper:
1 2 3 | public static void main(String... args) throws Throwable { Syns2Index.main(new String[] { "wn_s.pl", Constants.WORDNET_SYNONYMS_DIR.getAbsolutePath() }); } |
The referenced "pl" file is the Wordnet Prolog Distribution (see references below).
Querying the Index
To query
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 | package com.mycompany.wordnet.synonyms.svc; public class QueryIndex { public static LogManager logger = new LogManager(QueryIndex.class); private FSDirectory directory; private IndexSearcher searcher; public QueryIndex() throws BusinessException { init(); } public void close() throws BusinessException { try { if (null != searcher) searcher.close(); if (null != directory) directory.close(); this.searcher = null; this.directory = null; } catch (IOException e) { logger.error(e); throw new BusinessException("Unable to shutdown Lucene Index"); } } private void init() throws BusinessException { try { directory = FSDirectory.open(Constants.WORDNET_SYNONYMS_DIR); searcher = new IndexSearcher(directory); } catch (Exception e) { logger.error(e); throw new BusinessException("Unable to open Lucene Directory (path = %s)", Constants.WORDNET_SYNONYMS_DIR.getAbsolutePath()); } } public Collection<String> process(String term) throws BusinessException { try { if (null == directory || null == searcher) init(); Query query = new TermQuery(new Term(Syns2Index.F_WORD, term)); TotalHitCountCollector thcc = new TotalHitCountCollector(); searcher.search(query, thcc); Set<String> results = new TreeSet<String>(); ScoreDoc[] hits = searcher.search(query, 10).scoreDocs; for (ScoreDoc hit : hits) { Document doc = searcher.doc(hit.doc); String[] values = doc.getValues(Syns2Index.F_SYN); results.addAll(SetUtils.toSet(values)); } if (0 == thcc.getTotalHits()) logger.debug("No Results Found (term = %s)", term); else logger.info("Synonyms Found (term = %s, total = %s, list = %s)", term, results.size(), SetUtils.toString(results, ", ")); return results; } catch (IOException e) { logger.error(e); throw new BusinessException("Unable to Execute Query (term = %s)", term); } } } |
Note: Imports have been removed for increased readibility.
LUKE
LUKE is a handy development and diagnostic tool.
LUKE can be used to access pre-existing Lucene indexes and for the purpose of displaying and modify content.
I've connected LUKE to the Wordnet index, and can perform graphical queries:
Usage and Test Case
1 2 3 4 5 6 7 8 9 10 11 12 13 | @Test public void run() throws Throwable { QueryIndex queryIndex = new QueryIndex(); assertNotNull(queryIndex); Collection<String> list = queryIndex.process("automobile"); assertNotNull(list); assertFalse(list.isEmpty()); assertEquals("auto, car, machine, motorcar", SetUtils.toString(list, ", ")); queryIndex.close(); } |
We note that the test case output is identical the screenshot captured from the online synonym lookup on the official Wordnet site.
Build Environment
I use Maven, and my POM looks something like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 | <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>com.mycompany.wordnet.synonyms</groupId> <artifactId>wordnet-synonyms</artifactId> <version>1.0.0</version> <packaging>jar</packaging> <url /> <name>Wordnet Synonyms</name> <inceptionYear>2015</inceptionYear> <description>Wordnet Synonym Lookup via Lucene Index</description> <build> <plugins> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-compiler-plugin</artifactId> <version>3.2</version> <configuration> <source>1.7</source> <target>1.7</target> </configuration> </plugin> </plugins> </build> <properties> <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> <project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding> </properties> <dependencies> <dependency> <groupId>org.apache.lucene</groupId> <artifactId>lucene-wordnet</artifactId> <version>3.3.0</version> </dependency> </dependencies> </project> |
References
- [MvnRepository] Lucene Wordnet 3.3.0
- Also contains the JAR file if you do not plan to use Maven.
- This dependency was last updated Jun 26, 2011.
- [Princeton.edu] Wordnet Main Site
- Prolog Distribution (deep link)
- Unzip the tar file and extract the "wn_s.pl" prolog file, and make it available to the QueryBuilder.
- Online Synonym Search
- Used to derive the screenshot at the beginning of this article.
- Query Extraction:
- Wikipedia Entry, Stanford NLP
- Use of a thesaurus can be combined with ideas of term weighting: for instance, one might weight added terms less than original query terms.
- [Google Code] LUKE (Lucene Analyzer)
No comments:
Post a Comment