Tuesday, March 17, 2015

Query Expansion with Wordnet

Introduction


The Wordnet synonym database can be churned into a Lucene Index. This allows for rapid synonym lookup. User query terms can be expanded using these synonym sets as a method of boosting recall.  Query expansion is a known technique for improving retrieval performance in Information Extraction systems.

The online Wordnet search:


This article covers the building and querying of a Lucene search index of Wordnet synonyms for the purpose of enabling computationally-efficient and thread-safe synonym lookups at runtime.


Building the Index


To build the index, you can either call Syns2Index directly from the command line or in the context of another Java class.

I prefer to use a simple wrapper:
1
2
3
public static void main(String... args) throws Throwable {
 Syns2Index.main(new String[] { "wn_s.pl", Constants.WORDNET_SYNONYMS_DIR.getAbsolutePath() });
}


The referenced "pl" file is the Wordnet Prolog Distribution (see references below).


Querying the Index


To query
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
package com.mycompany.wordnet.synonyms.svc;

public class QueryIndex {

 public static LogManager logger = new LogManager(QueryIndex.class);

 private FSDirectory  directory;

 private IndexSearcher  searcher;

 public QueryIndex() throws BusinessException {
  init();
 }

 public void close() throws BusinessException {
  try {

   if (null != searcher) searcher.close();
   if (null != directory) directory.close();

   this.searcher = null;
   this.directory = null;

  } catch (IOException e) {
   logger.error(e);
   throw new BusinessException("Unable to shutdown Lucene Index");
  }
 }

 private void init() throws BusinessException {
  try {

   directory = FSDirectory.open(Constants.WORDNET_SYNONYMS_DIR);
   searcher = new IndexSearcher(directory);

  } catch (Exception e) {
   logger.error(e);
   throw new BusinessException("Unable to open Lucene Directory (path = %s)", Constants.WORDNET_SYNONYMS_DIR.getAbsolutePath());
  }
 }

 public Collection<String> process(String term) throws BusinessException {
  try {

   if (null == directory || null == searcher) init();
   Query query = new TermQuery(new Term(Syns2Index.F_WORD, term));

   TotalHitCountCollector thcc = new TotalHitCountCollector();
   searcher.search(query, thcc);

   Set<String> results = new TreeSet<String>();
   ScoreDoc[] hits = searcher.search(query, 10).scoreDocs;

   for (ScoreDoc hit : hits) {
    Document doc = searcher.doc(hit.doc);

    String[] values = doc.getValues(Syns2Index.F_SYN);
    results.addAll(SetUtils.toSet(values));
   }

   if (0 == thcc.getTotalHits()) logger.debug("No Results Found (term = %s)", term);
   else logger.info("Synonyms Found (term = %s, total = %s, list = %s)", term, results.size(), SetUtils.toString(results, ", "));

   return results;

  } catch (IOException e) {
   logger.error(e);
   throw new BusinessException("Unable to Execute Query (term = %s)", term);
  }
 }
}

Note: Imports have been removed for increased readibility.


LUKE


LUKE is a handy development and diagnostic tool.

LUKE can be used to access pre-existing Lucene indexes and for the purpose of displaying and modify content.

I've connected LUKE to the Wordnet index, and can perform graphical queries:




Usage and Test Case



 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
@Test
public void run() throws Throwable {

 QueryIndex queryIndex = new QueryIndex();
 assertNotNull(queryIndex);

 Collection<String> list = queryIndex.process("automobile");
 assertNotNull(list);
 assertFalse(list.isEmpty());
 assertEquals("auto, car, machine, motorcar", SetUtils.toString(list, ", "));

 queryIndex.close();
}

We note that the test case output is identical the screenshot captured from the online synonym lookup on the official Wordnet site.


Build Environment


I use Maven, and my POM looks something like this:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
 <modelVersion>4.0.0</modelVersion>

 <groupId>com.mycompany.wordnet.synonyms</groupId>
 <artifactId>wordnet-synonyms</artifactId>
 <version>1.0.0</version>
 <packaging>jar</packaging>

 <url />
 <name>Wordnet Synonyms</name>
 <inceptionYear>2015</inceptionYear>
 <description>Wordnet Synonym Lookup via Lucene Index</description>

 <build>
  <plugins>
   <plugin>
    <groupId>org.apache.maven.plugins</groupId>
    <artifactId>maven-compiler-plugin</artifactId>
    <version>3.2</version>
    <configuration>
     <source>1.7</source>
     <target>1.7</target>
    </configuration>
   </plugin>
  </plugins>
 </build>

 <properties>
  <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
  <project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding>
 </properties>

 <dependencies>
  <dependency>
   <groupId>org.apache.lucene</groupId>
   <artifactId>lucene-wordnet</artifactId>
   <version>3.3.0</version>
  </dependency>
 </dependencies>

</project>




References

  1. [MvnRepository] Lucene Wordnet 3.3.0
    1. Also contains the JAR file if you do not plan to use Maven
    2. This dependency was last updated Jun 26, 2011.
  2. [Princeton.edu] Wordnet Main Site
    1. Prolog Distribution (deep link)
      1. Unzip the tar file and extract the "wn_s.pl" prolog file, and make it available to the QueryBuilder.
    2. Online Synonym Search
      1. Used to derive the screenshot at the beginning of this article.
  3. Query Extraction:
    1. Wikipedia Entry, Stanford NLP
      1. Use of a thesaurus can be combined with ideas of term weighting: for instance, one might weight added terms less than original query terms.
  4. [Google Code] LUKE (Lucene Analyzer)

No comments:

Post a Comment