DB: Building the Lucene Index

Introduction

Source documents can either be loaded into Lucene on a nearly as-is basis, or be passed through a custom parser. Such a parser may be responsible for various forms of pre-processing on the source document and/or removing information that should not be placed within the index.

The output of the parser is a collection of Lucene documents, that are then loaded into the Lucene Index, and made available to be queried.

Fig 1: Building the Lucene Index

At a minimum, a Lucene developer will need to create a component that returns one or more Lucene Document given one or more incoming source documents.

The component may leverage existing technology (such as Apache Tika) for transforming incoming document types (PDF, DOC, etc) into plain text representations. Or, the component may require custom logic for extracting information.

The developer will also need a strategy for reducing the incoming document in a set of Key/Value pairs. This could be as simple as treating the entire document as a single unit (e.g. key=all, value=<everything>). It's more likely that multiple Key/Value pairs will be used, including special treatment of document metadata, as a method for enhancing search and the display of search results.

Logical Architecture

The Lucene Document
Fields

String Fields
Text Fields
Custom Fields

The Lucene API is easy to use and easy to understand. Creating a basic search index up and running is not difficult. Complexity is found in loading and tuning the index and manipulation of the underlying content -- this is both a science and an art form.

Hierarchy:

A Lucene Index contains multiple documents.
A Lucene Document contains multiple fields.
Each field contains a Key/Value pair.

Fig 2: The Index Hierarchy

Hence, the API is a simple, structured hierarchy of Key/Value pairs extracted from incoming source documents.

For the sake of this tutorial, we'll assume that each source document will have a corresponding Lucene doucment. There might be times when this isn't true. A document could have a single Key/Value pair. This generally happens when structured data is being loaded into a Lucene index. The contents of a customer table could be represented using a single document for each customer name.

We'll make the assumption we're dealing with unstructured text, and that for each source document we will create a Lucene document that contains multiple fields, and each field contains a single Key/Value pair. We'll have to carefully choose the proper type of field to represent our Key/Value entry, depending on how we plan to use that data in our search queries.

Assuming the code to create the Lucene documents already exists, the latter half of the process depicted in Fig 1 looks like this:

Collection<Document> docs = ...
LuceneIndexer indexer = new LuceneIndexer("/home/user/lucene/");
indexer.add(docs);
indexer.close();

The only question that needs to be answered at this point is how to create Lucene Document instances from unstructured source data.

The Lucene Document

In Lucene, a Document is the unit of search and index. An index consists of one or more Documents. Indexing involves adding Documents to an IndexWriter, and searching involves retrieving Documents from an index via an IndexSearcher.

A Lucene Document doesn't necessarily have a 1..1 corespondence with an incoming text document, nor does it even imply the need to be something similar. If Lucene is being used to index structured text (e.g. a database table of users), then each user would be represented in the index as a Lucene Document.

A Document consists of one or more Fields. A Field is simply a name-value pair. For example, a Field commonly found in applications is title. In the case of a title Field, the field name is title and the value is the title of that content item. Indexing in Lucene thus involves creating Documents comprising of one or more Fields, and adding these Documents to an IndexWriter.

Given a hypothetical source document, these are some Key/Value pairs I would be interested in:

String id = getId(inputDocument.getName());
String title = getTitle(inputDocument.getName());
String page = String.valueOf(inputDocument.getPage());
String uri = getUri(inputDocument.getName());
String filename = inputDocument.getName();
String content = getContent(inputDocument);

The implementation of these methods is not important. This is a hypothetical source document, and therefore this data is hypothetical. There is no implication here that every incoming document will have any or all of these content items (id, title, page, url, etc).

Lucene Fields

Each of these Key/Value pairs will be contained within a Lucene Field. The type of Lucene Field I choose is important, and will impact how the data can be found during the execution of a Search Query.

This is a numeric identifier that can be used to uniquely identify the document.
It may be useful for correlating the incoming document to documents in other data sources.

Title

The title of the source document
This will be useful for displaying in the final search results back to the user

Page

Given a multi-page document, it is often useful to make each page a separate Lucene Document. In this case, we'll want to record the exact page number for traceability in the search results.

Given some uniform resource identifer, either a network path or URL.
This is particularly useful in the final search results if you want to give the user a way to examine the underlying search results.
This can also be useful for restricting a query to search or exclude a certain domain.

Filename

The underlying filename of the incoming document.
Useful in the search results, but may not be applicable in the case of a web page.

Content

The actual text for the document.
This could be very large, and that's fine.

To get a sense of other fields that could be added to a Lucene Document, I recommend looking at various Ontologies that have been created and used in the industry in the last few years. Dublin Core is a small set of standard vocabulary terms that can be used to describe web resources. The W3C PROV Ontology is an interoptable vocabulary for defining the influence on digital entities by agents, activities or other entities.

Proper use of known taxonomies and ontologies can provide a standards-based way of extending your Lucene search index and encourage lateral thinking in terms of what data is extracted from the incoming source documents.

Field Types

There are two basic Field types:

String Fields

Use for atomic values that should not be tokenized into a set of words for indexing.
Id, URI and Page are all examples of content items that should be placed within a String Field type.

Text Fields

Used for fields that should be tokenized into a set of words.

I'm going to implement the above fields like this:

doc.add(new VectorTextField("line", line, Field.Store.YES));
doc.add(new VectorTextField("speaker", speaker, Field.Store.YES));
doc.add(new VectorTextField("title", title, Field.Store.YES));
doc.add(new VectorTextField("filename", filename, Field.Store.YES));
doc.add(new StringField("id", id, Field.Store.YES));
doc.add(new StringField("uri", url, Field.Store.YES));
doc.add(new StringField("page", page, Field.Store.YES));

The VectorTextField is a custom type that logically extends the Text Field functionality. This field type stores additional information and permits retrieval of row vector information during the query result stage.

The attribution is given inline within the source code below:

package com.yourpackage;

import java.io.Reader;

import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.FieldType;

/* http://stackoverflow.com/questions/11945728/how-to-use-termvector-lucene-4-0 */
public class VectorTextField extends Field {

 /* Indexed, tokenized, not stored. */
 public static final FieldType TYPE_NOT_STORED = new FieldType();

 /* Indexed, tokenized, stored. */
 public static final FieldType TYPE_STORED  = new FieldType();

 static {
  TYPE_NOT_STORED.setIndexed(true);
  TYPE_NOT_STORED.setTokenized(true);
  TYPE_NOT_STORED.setStoreTermVectors(true);
  TYPE_NOT_STORED.setStoreTermVectorPositions(true);
  TYPE_NOT_STORED.freeze();

  TYPE_STORED.setIndexed(true);
  TYPE_STORED.setTokenized(true);
  TYPE_STORED.setStored(true);
  TYPE_STORED.setStoreTermVectors(true);
  TYPE_STORED.setStoreTermVectorPositions(true);
  TYPE_STORED.freeze();
 }

 /** Creates a new TextField with Reader value. */
 public VectorTextField(String name, Reader reader, Store store) {
  super(name, reader, store == Store.YES ? TYPE_STORED : TYPE_NOT_STORED);
 }

 /** Creates a new TextField with String value. */
 public VectorTextField(String name, String value, Store store) {
  super(name, value, store == Store.YES ? TYPE_STORED : TYPE_NOT_STORED);
 }

 /** Creates a new un-stored TextField with TokenStream value. */
 public VectorTextField(String name, TokenStream stream) {
  super(name, stream, TYPE_NOT_STORED);
 }
}

Environment Setup

I use Maven, and prefer to create a single POM for Lucene, then reference this POM in other projects.

Here's the Apache Lucene POM I've created:

<project 
 xmlns="http://maven.apache.org/POM/4.0.0" 
 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
 xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
 <modelVersion>4.0.0</modelVersion>

 <groupId>lucene-dependencies</groupId>
 <artifactId>lucene-dependencies</artifactId>
 <version>4.10.1</version>
 <packaging>pom</packaging>

 <properties>
  <lucene-core.version>4.10.1</lucene-core.version>
  <lucene-analyzers.version>4.10.1</lucene-analyzers.version>
  <lucene-queryparser.version>4.10.1</lucene-queryparser.version>
 </properties>

 <dependencies>
  <dependency>
   <groupId>org.apache.lucene</groupId>
   <artifactId>lucene-core</artifactId>
   <version>${lucene-core.version}</version>
  </dependency>
  <dependency>
   <groupId>org.apache.lucene</groupId>
   <artifactId>lucene-analyzers-common</artifactId>
   <version>${lucene-analyzers.version}</version>
  </dependency>
  <dependency>
   <groupId>org.apache.lucene</groupId>
   <artifactId>lucene-queryparser</artifactId>
   <version>${lucene-queryparser.version}</version>
  </dependency>
 </dependencies>
 
</project>

Then in other projects, I simply reference this as:

<dependency>
 <groupId>lucene-dependencies</groupId>
 <artifactId>lucene-dependencies</artifactId>
 <version>4.10.1</version>
 <type>pom</type>
</dependency>

References

McCandless, Michael, et al. Lucene in Action, 2nd Ed. Manning Publications, 2010. Book.

The source code is a little dated 5 years on, given extensive API changes since this book's publication. The concepts however remain relatively unchanged, and this is a well-written text.

[LingPipe, 08-March-2014] Lucene 4 Essentials

Good overview of Lucene's inverted index structure.

DB

Monday, March 16, 2015

Building the Lucene Index