Thursday, April 23, 2015

The Apache Solr Query Architecture

Overview

Fig 1: Solr Query Architecture

The diagram above demonstrates a document that contains the text: "the greenhouse gas effect".

A user query for "climate change" will result in a scored relevancy match against this document.

The user query against the Solr server is very simple:
q=climate change
Solr will augment this query in two stages:

  1. The Request Handler will add additional meta-data about how the query should be executed.  Notions of relevance, number of rows returned, if highlighting should be used, the fields to query and return, are all specified here.
    1. The query then becomes:
      q=climate change&defType=xml&wt=xml&fl=id title text&qf=title^2 text&rows=10&pf=title^2 text&ps=5&echoParams=all&hl=true&hl.fl=title text&debug=true
  2. The Query Analyzer will perform a linguistic analysis of the user query.  Tokenization, pattern filtering, stemming, synonyms, etc are all specified here.
    1. The query then becomes:
      q=(+((DisjunctionMaxQuery((text:climate | title:climate^2.0 | speaker:climate)) DisjunctionMaxQuery((text:change | title:change^2.0 | speaker:change)))~2) DisjunctionMaxQuery((title:"(greenhouse ghg climate climate deforestation pollution greenhouse carbon co2 methane nitrous n2o hydroflurocarbons hfcs perfluorocarbons pfcs sulfur sf6) (gas change shift gasses dioxide oxide hexafluoride)"~5^2.0 | text:"(greenhouse ghg climate climate deforestation pollution greenhouse carbon co2 methane nitrous n2o hydroflurocarbons hfcs perfluorocarbons pfcs sulfur sf6) (gas change shift gasses dioxide oxide hexafluoride)"~5)))/no_coord&defType=xml&wt=xml&fl=id title text&qf=title^2 text&rows=10&pf=title^2 text&ps=5&echoParams=all&hl=true&hl.fl=title text&debug=true

This is a powerful design technique for abstracting complexity away from the user query while creating very complex and specific queries to find relevant documents.


Request Handler


The first augmentation stage is controlled by the request handler.

Request handlers are defined within solrconfig.xml:
<requestHandler name="/docQuery" class="solr.SearchHandler">
  <lst name="defaults">
    <str name="defType">edismax</str>
    <str name="wt">xml</str>
    <str name="fl">id author abstract heading text</str>
    <str name="qf">title^4 abstract^2 text</str>
    <str name="rows">10</str>
    <str name="pf">title^4 abstract^2 text</str>
    <str name="ps">5</str>
    <str name="echoParams">all</str>
    <str name="mm">3&lt;-1 5&lt;-2 6&lt;-40%</str> 
    <str name="hl">true</str>
    <str name="hl.fl">title abstract text</str>
    <str name="debug">true</str>
    <str name="explain">true</str>
  </lst>
</requestHandler> 

This request handler creates the node entitled "Augmented User Query 1".  So here's a major benefit to the configuration files already.  The user (or the application) didn't have to append all this information to the query string.  It's appended by default to each query.


Query Analyzer


The query is next augmented by each field that is being searched.

Within the schema.xml file, I have a field defined for text:
    <fieldType name="text_doc" class="solr.TextField" positionIncrementGap="100">

      <!-- Indexer -->
      <analyzer type="index">
        <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="([a-zA-Z])\1+" replacement="$1$1" />
        <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="([a-zA-Z])(/)([a-zA-Z])" replacement="$1 or $3" />
        <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(\()(.)+(\))" replacement="" />
        <tokenizer class="solr.WhitespaceTokenizerFactory" />
        <filter class="solr.WordDelimiterFilterFactory"
          generateWordParts="1"
          splitOnCaseChange="1"
          splitOnNumerics="1"
          stemEnglishPossessive="1"
          preserveOriginal="1"
          catenateWords="1"
          generateNumberParts="1"
          catenateNumbers="1"
          catenateAll="1"
          types="wdfftypes.txt" />
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
        <filter class="solr.EnglishPossessiveFilterFactory" />
        <filter class="solr.LowerCaseFilterFactory" />
        <filter class="solr.ASCIIFoldingFilterFactory" />
        <filter class="solr.StemmerOverrideFilterFactory" dictionary="stemdict.txt" />
        <filter class="solr.KStemFilterFactory" />
      </analyzer>

      <!-- Query Analyzer -->
      <analyzer type="query">
        <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="([a-zA-Z])\+1" replacement="$1$1" />
        <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="([a-zA-Z])(/)([a-zA-Z])" replacement="$1 or $3" />
        <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(\()(.)+(\))" replacement="$2" />
        <tokenizer class="solr.WhitespaceTokenizerFactory" />
        <filter class="solr.WordDelimiterFilterFactory"
          generateWordParts="1"
          splitOnCaseChange="1"
          splitOnNumerics="1"
          stemEnglishPossessive="1"
          preserveOriginal="1"
          catenateWords="1"
          generateNumberParts="1"
          catenateNumbers="1"
          catenateAll="1"
          types="wdfftypes.txt" />
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
        <filter class="solr.EnglishPossessiveFilterFactory" />
        <filter class="solr.LowerCaseFilterFactory" />
        <filter class="solr.ASCIIFoldingFilterFactory" />
       <filter class="solr.StemmerOverrideFilterFactory" dictionary="stemdict.txt" />
        <filter class="solr.KStemFilterFactory" />
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" />
      </analyzer>

    </fieldType>


Note that in the configuration file above, there are two analyzers specified for the field called "text_doc".  One analyzer is for the document text being indexed during the ingestion phase.  The other analyzer is for the user query text that triggers the search (for the indexed text).  In both cases, the configuration is largely identical, except for the use of synonyms.

This is an important concept to grasp.  If the indexed content and the user query are both treated by (nearly) identical analyzers, it's going to be a lot easier to find relevant text.  As a counter example, imagine having to design a tokenization pipeline for user queries against content indexed by multiple, unknown configurations.  If you use aggresive stemming and wildcards to boost recall, this will come at the expense of precision.


References

  1. [YouTube, 5:54] Apache Solr: Complex Query Format

No comments:

Post a Comment