Tuesday, March 17, 2015

Query Expansion with Wordnet

Introduction


The Wordnet synonym database can be churned into a Lucene Index. This allows for rapid synonym lookup. User query terms can be expanded using these synonym sets as a method of boosting recall.  Query expansion is a known technique for improving retrieval performance in Information Extraction systems.

The online Wordnet search:


This article covers the building and querying of a Lucene search index of Wordnet synonyms for the purpose of enabling computationally-efficient and thread-safe synonym lookups at runtime.


Building the Index


To build the index, you can either call Syns2Index directly from the command line or in the context of another Java class.

I prefer to use a simple wrapper:
1
2
3
public static void main(String... args) throws Throwable {
 Syns2Index.main(new String[] { "wn_s.pl", Constants.WORDNET_SYNONYMS_DIR.getAbsolutePath() });
}


The referenced "pl" file is the Wordnet Prolog Distribution (see references below).


Querying the Index


To query
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
package com.mycompany.wordnet.synonyms.svc;

public class QueryIndex {

 public static LogManager logger = new LogManager(QueryIndex.class);

 private FSDirectory  directory;

 private IndexSearcher  searcher;

 public QueryIndex() throws BusinessException {
  init();
 }

 public void close() throws BusinessException {
  try {

   if (null != searcher) searcher.close();
   if (null != directory) directory.close();

   this.searcher = null;
   this.directory = null;

  } catch (IOException e) {
   logger.error(e);
   throw new BusinessException("Unable to shutdown Lucene Index");
  }
 }

 private void init() throws BusinessException {
  try {

   directory = FSDirectory.open(Constants.WORDNET_SYNONYMS_DIR);
   searcher = new IndexSearcher(directory);

  } catch (Exception e) {
   logger.error(e);
   throw new BusinessException("Unable to open Lucene Directory (path = %s)", Constants.WORDNET_SYNONYMS_DIR.getAbsolutePath());
  }
 }

 public Collection<String> process(String term) throws BusinessException {
  try {

   if (null == directory || null == searcher) init();
   Query query = new TermQuery(new Term(Syns2Index.F_WORD, term));

   TotalHitCountCollector thcc = new TotalHitCountCollector();
   searcher.search(query, thcc);

   Set<String> results = new TreeSet<String>();
   ScoreDoc[] hits = searcher.search(query, 10).scoreDocs;

   for (ScoreDoc hit : hits) {
    Document doc = searcher.doc(hit.doc);

    String[] values = doc.getValues(Syns2Index.F_SYN);
    results.addAll(SetUtils.toSet(values));
   }

   if (0 == thcc.getTotalHits()) logger.debug("No Results Found (term = %s)", term);
   else logger.info("Synonyms Found (term = %s, total = %s, list = %s)", term, results.size(), SetUtils.toString(results, ", "));

   return results;

  } catch (IOException e) {
   logger.error(e);
   throw new BusinessException("Unable to Execute Query (term = %s)", term);
  }
 }
}

Note: Imports have been removed for increased readibility.


LUKE


LUKE is a handy development and diagnostic tool.

LUKE can be used to access pre-existing Lucene indexes and for the purpose of displaying and modify content.

I've connected LUKE to the Wordnet index, and can perform graphical queries:




Usage and Test Case



 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
@Test
public void run() throws Throwable {

 QueryIndex queryIndex = new QueryIndex();
 assertNotNull(queryIndex);

 Collection<String> list = queryIndex.process("automobile");
 assertNotNull(list);
 assertFalse(list.isEmpty());
 assertEquals("auto, car, machine, motorcar", SetUtils.toString(list, ", "));

 queryIndex.close();
}

We note that the test case output is identical the screenshot captured from the online synonym lookup on the official Wordnet site.


Build Environment


I use Maven, and my POM looks something like this:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
 <modelVersion>4.0.0</modelVersion>

 <groupId>com.mycompany.wordnet.synonyms</groupId>
 <artifactId>wordnet-synonyms</artifactId>
 <version>1.0.0</version>
 <packaging>jar</packaging>

 <url />
 <name>Wordnet Synonyms</name>
 <inceptionYear>2015</inceptionYear>
 <description>Wordnet Synonym Lookup via Lucene Index</description>

 <build>
  <plugins>
   <plugin>
    <groupId>org.apache.maven.plugins</groupId>
    <artifactId>maven-compiler-plugin</artifactId>
    <version>3.2</version>
    <configuration>
     <source>1.7</source>
     <target>1.7</target>
    </configuration>
   </plugin>
  </plugins>
 </build>

 <properties>
  <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
  <project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding>
 </properties>

 <dependencies>
  <dependency>
   <groupId>org.apache.lucene</groupId>
   <artifactId>lucene-wordnet</artifactId>
   <version>3.3.0</version>
  </dependency>
 </dependencies>

</project>




References

  1. [MvnRepository] Lucene Wordnet 3.3.0
    1. Also contains the JAR file if you do not plan to use Maven
    2. This dependency was last updated Jun 26, 2011.
  2. [Princeton.edu] Wordnet Main Site
    1. Prolog Distribution (deep link)
      1. Unzip the tar file and extract the "wn_s.pl" prolog file, and make it available to the QueryBuilder.
    2. Online Synonym Search
      1. Used to derive the screenshot at the beginning of this article.
  3. Query Extraction:
    1. Wikipedia Entry, Stanford NLP
      1. Use of a thesaurus can be combined with ideas of term weighting: for instance, one might weight added terms less than original query terms.
  4. [Google Code] LUKE (Lucene Analyzer)

Monday, March 16, 2015

Querying the Lucene Index

Note: Imports removed for legibility.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
package com.ibm.ted.ebear.lucene;

public final class QueryDemo1 {

 public static LogManager logger = new LogManager(QueryDemo1.class);

 private static BooleanQuery createBooleanQuery(Analyzer analyzer) throws ParseException {
  BooleanQuery booleanQuery = new BooleanQuery();

  Query query1 = new QueryParser("content", analyzer).parse(String.format("(%s) AND (%s)", "alpha", "beta"));
  query1.setBoost(0.9f);
  booleanQuery.add(query1, Occur.SHOULD);

  Query query2 = new QueryParser("title", analyzer).parse(String.format("(%s) AND (%s)", "alpha", "beta"));
  query2.setBoost(1.1f);
  booleanQuery.add(query2, Occur.SHOULD);

  return booleanQuery;
 }

 public static void main(String... args) throws Throwable {

  Directory directory = FSDirectory.open(new File("/home/astoruser/lucene/01/"));
  IndexReader reader = DirectoryReader.open(directory);

  Analyzer analyzer = new StandardAnalyzer();
  IndexSearcher searcher = new IndexSearcher(reader);

  TopScoreDocCollector collector = TopScoreDocCollector.create(10, true);
  BooleanQuery booleanQuery = createBooleanQuery(analyzer);
  searcher.search(booleanQuery, collector);

  ScoreDoc[] hits = collector.topDocs().scoreDocs;
  for (ScoreDoc hit : hits) {

   Document doc = searcher.doc(hit.doc);
   logger.debug("Result (score = %s, filename = %s)", hit.score, doc.get("filename"));
  }
 }
}

Building the Lucene Index

Introduction


Source documents can either be loaded into Lucene on a nearly as-is basis, or be passed through a custom parser.  Such a parser may be responsible for various forms of pre-processing on the source document and/or removing information that should not be placed within the index.

The output of the parser is a collection of Lucene documents, that are then loaded into the Lucene Index, and made available to be queried.

Fig 1: Building the Lucene Index

At a minimum, a Lucene developer will need to create a component that returns one or more Lucene Document given one or more incoming source documents.

The component may leverage existing technology (such as Apache Tika) for transforming incoming document types (PDF, DOC, etc) into plain text representations.  Or, the component may require custom logic for extracting information.

The developer will also need a strategy for reducing the incoming document in a set of Key/Value pairs.  This could be as simple as treating the entire document as a single unit (e.g. key=all, value=<everything>).  It's more likely that multiple Key/Value pairs will be used, including special treatment of document metadata, as a method for enhancing search and the display of search results.


Logical Architecture


  1. The Lucene Document
  2. Fields
    1. String Fields
    2. Text Fields
    3. Custom Fields
    The Lucene API is easy to use and easy to understand. Creating a basic search index up and running is not difficult. Complexity is found in loading and tuning the index and manipulation of the underlying content -- this is both a science and an art form.

    Hierarchy:
    • A Lucene Index contains multiple documents.
    • A Lucene Document contains multiple fields.
    • Each field contains a Key/Value pair.


    Fig 2: The Index Hierarchy

    Hence, the API is a simple, structured hierarchy of Key/Value pairs extracted from incoming source documents.

    For the sake of this tutorial, we'll assume that each source document will have a corresponding Lucene doucment.  There might be times when this isn't true.  A document could have a single Key/Value pair.  This generally happens when structured data is being loaded into a Lucene index.  The contents of a customer table could be represented using a single document for each customer name.

    We'll make the assumption we're dealing with unstructured text, and that for each source document we will create a Lucene document that contains multiple fields, and each field contains a single Key/Value pair.  We'll have to carefully choose the proper type of field to represent our Key/Value entry, depending on how we plan to use that data in our search queries.

    Assuming the code to create the Lucene documents already exists, the latter half of the process depicted in Fig 1 looks like this:
    1
    2
    3
    4
    Collection<Document> docs = ...
    LuceneIndexer indexer = new LuceneIndexer("/home/user/lucene/");
    indexer.add(docs);
    indexer.close();
    


    The only question that needs to be answered at this point is how to create Lucene Document instances from unstructured source data.


    The Lucene Document


    In Lucene, a Document is the unit of search and index. An index consists of one or more Documents. Indexing involves adding Documents to an IndexWriter, and searching involves retrieving Documents from an index via an IndexSearcher.

    A Lucene Document doesn't necessarily have a 1..1 corespondence with an incoming text document, nor does it even imply the need to be something similar. If Lucene is being used to index structured text (e.g. a database table of users), then each user would be represented in the index as a Lucene Document.

    A Document consists of one or more Fields. A Field is simply a name-value pair. For example, a Field commonly found in applications is title. In the case of a title Field, the field name is title and the value is the title of that content item. Indexing in Lucene thus involves creating Documents comprising of one or more Fields, and adding these Documents to an IndexWriter.

    Given a hypothetical source document, these are some Key/Value pairs I would be interested in:
    1
    2
    3
    4
    5
    6
    String id = getId(inputDocument.getName());
    String title = getTitle(inputDocument.getName());
    String page = String.valueOf(inputDocument.getPage());
    String uri = getUri(inputDocument.getName());
    String filename = inputDocument.getName();
    String content = getContent(inputDocument);
    

    The implementation of these methods is not important.  This is a hypothetical source document, and therefore this data is hypothetical.  There is no implication here that every incoming document will have any or all of these content items (id, title, page, url, etc).


    Lucene Fields


    Each of these Key/Value pairs will be contained within a Lucene Field.  The type of Lucene Field I choose is important, and will impact how the data can be found during the execution of a Search Query.
    1. Id
      1. This is a numeric identifier that can be used to uniquely identify the document.
      2. It may be useful for correlating the incoming document to documents in other data sources.
    2. Title
      1. The title of the source document
      2. This will be useful for displaying in the final search results back to the user
    3. Page
      1. Given a multi-page document, it is often useful to make each page a separate Lucene Document.  In this case, we'll want to record the exact page number for traceability in the search results.
    4. URI
      1. Given some uniform resource identifer, either a network path or URL.
      2. This is particularly useful in the final search results if you want to give the user a way to examine the underlying search results.
      3. This can also be useful for restricting a query to search or exclude a certain domain.
    5. Filename
      1. The underlying filename of the incoming document.
      2. Useful in the search results, but may not be applicable in the case of a web page.
    6. Content
      1. The actual text for the document.
      2. This could be very large, and that's fine.  

    To get a sense of other fields that could be added to a Lucene Document, I recommend looking at various Ontologies that have been created and used in the industry in the last few years.  Dublin Core is a small set of standard vocabulary terms that can be used to describe web resources.  The W3C PROV Ontology is an interoptable vocabulary for defining the influence on digital entities by agents, activities or other entities.

    Proper use of known taxonomies and ontologies can provide a standards-based way of extending your Lucene search index and encourage lateral thinking in terms of what data is extracted from the incoming source documents.


    Field Types


    There are two basic Field types:
    1. String Fields
      1. Use for atomic values that should not be tokenized into a set of words for indexing.  
      2. Id, URI and Page are all examples of content items that should be placed within a String Field type.
    2. Text Fields
      1. Used for fields that should be tokenized into a set of words.

    I'm going to implement the above fields like this:
    1
    2
    3
    4
    5
    6
    7
    doc.add(new VectorTextField("line", line, Field.Store.YES));
    doc.add(new VectorTextField("speaker", speaker, Field.Store.YES));
    doc.add(new VectorTextField("title", title, Field.Store.YES));
    doc.add(new VectorTextField("filename", filename, Field.Store.YES));
    doc.add(new StringField("id", id, Field.Store.YES));
    doc.add(new StringField("uri", url, Field.Store.YES));
    doc.add(new StringField("page", page, Field.Store.YES));
    


    The VectorTextField is a custom type that logically extends the Text Field functionality. This field type stores additional information and permits retrieval of row vector information during the query result stage.

    The attribution is given inline within the source code below:
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    package com.yourpackage;
    
    import java.io.Reader;
    
    import org.apache.lucene.analysis.TokenStream;
    import org.apache.lucene.document.Field;
    import org.apache.lucene.document.FieldType;
    
    /* http://stackoverflow.com/questions/11945728/how-to-use-termvector-lucene-4-0 */
    public class VectorTextField extends Field {
    
     /* Indexed, tokenized, not stored. */
     public static final FieldType TYPE_NOT_STORED = new FieldType();
    
     /* Indexed, tokenized, stored. */
     public static final FieldType TYPE_STORED  = new FieldType();
    
     static {
      TYPE_NOT_STORED.setIndexed(true);
      TYPE_NOT_STORED.setTokenized(true);
      TYPE_NOT_STORED.setStoreTermVectors(true);
      TYPE_NOT_STORED.setStoreTermVectorPositions(true);
      TYPE_NOT_STORED.freeze();
    
      TYPE_STORED.setIndexed(true);
      TYPE_STORED.setTokenized(true);
      TYPE_STORED.setStored(true);
      TYPE_STORED.setStoreTermVectors(true);
      TYPE_STORED.setStoreTermVectorPositions(true);
      TYPE_STORED.freeze();
     }
    
     /** Creates a new TextField with Reader value. */
     public VectorTextField(String name, Reader reader, Store store) {
      super(name, reader, store == Store.YES ? TYPE_STORED : TYPE_NOT_STORED);
     }
    
     /** Creates a new TextField with String value. */
     public VectorTextField(String name, String value, Store store) {
      super(name, value, store == Store.YES ? TYPE_STORED : TYPE_NOT_STORED);
     }
    
     /** Creates a new un-stored TextField with TokenStream value. */
     public VectorTextField(String name, TokenStream stream) {
      super(name, stream, TYPE_NOT_STORED);
     }
    }
    



    Environment Setup


    I use Maven, and prefer to create a single POM for Lucene, then reference this POM in other projects.

    Here's the Apache Lucene POM I've created:
    <project 
     xmlns="http://maven.apache.org/POM/4.0.0" 
     xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
     xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
     <modelVersion>4.0.0</modelVersion>
    
     <groupId>lucene-dependencies</groupId>
     <artifactId>lucene-dependencies</artifactId>
     <version>4.10.1</version>
     <packaging>pom</packaging>
    
     <properties>
      <lucene-core.version>4.10.1</lucene-core.version>
      <lucene-analyzers.version>4.10.1</lucene-analyzers.version>
      <lucene-queryparser.version>4.10.1</lucene-queryparser.version>
     </properties>
    
     <dependencies>
      <dependency>
       <groupId>org.apache.lucene</groupId>
       <artifactId>lucene-core</artifactId>
       <version>${lucene-core.version}</version>
      </dependency>
      <dependency>
       <groupId>org.apache.lucene</groupId>
       <artifactId>lucene-analyzers-common</artifactId>
       <version>${lucene-analyzers.version}</version>
      </dependency>
      <dependency>
       <groupId>org.apache.lucene</groupId>
       <artifactId>lucene-queryparser</artifactId>
       <version>${lucene-queryparser.version}</version>
      </dependency>
     </dependencies>
     
    </project>
    


    Then in other projects, I simply reference this as:
    <dependency>
     <groupId>lucene-dependencies</groupId>
     <artifactId>lucene-dependencies</artifactId>
     <version>4.10.1</version>
     <type>pom</type>
    </dependency>
    



    References

    1. McCandless, Michael, et al. Lucene in Action, 2nd Ed. Manning Publications, 2010. Book.
      1. The source code is a little dated 5 years on, given extensive API changes since this book's publication.  The concepts however remain relatively unchanged, and this is a well-written text.
    2. [LingPipe, 08-March-2014] Lucene 4 Essentials
      1. Good overview of Lucene's inverted index structure.