DB: Generating an Automated Tokenization Test for Solr

Introduction

There is a variety of ways to configure the tokenizer within Solr.

The configuration of the tokenizer can be automated for the purpose of testing how each variation performs. In summary, the schema.xml file is incrementally adjusted, a Docker/Solr container is launched, a Java analysis query is issued against Docker/Solr, and the analysis result is written to file.

An activity diagram depicting this flow is shown:

Fig 1: Activity Diagram

The Query Analyzer

The Query Analyzer depicted below is incrementally modified using the sed comamnd within a shell script:

<!-- Query Analyzer -->
<analyzer type="query">
 <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="([a-zA-Z])\+1" replacement="$1$1" />
 <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="([a-zA-Z])(/)([a-zA-Z])" replacement="$1 or $3" />
 <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(\()(.)+(\))" replacement="$2" />
 <tokenizer class="solr.WhitespaceTokenizerFactory" />
 <filter class="solr.WordDelimiterFilterFactory"
  generateWordParts="#1"
  splitOnCaseChange="#2"
  splitOnNumerics="#3"
  stemEnglishPossessive="#4"
  preserveOriginal="#5"
  catenateWords="#6"
  generateNumberParts="#7"
  catenateNumbers="#8"
  catenateAll="#9"
  types="wdfftypes.txt" />
 <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
 <filter class="solr.LowerCaseFilterFactory" />
 <filter class="solr.ASCIIFoldingFilterFactory" />
 <filter class="solr.KStemFilterFactory" />
 <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" />
</analyzer>

The shell script that performs this incremental modification is invoked like this:

$ ./modify-schema.sh 0 1 0 0 0 0 1 1 0

The trailing numbers are placed into the relevant sections of the LuceneWordDelimiterFactory (LWDF) above.

modify-schema.sh

This script likely isn't going to win any points for style, but it takes the user params and seds them into the schema.xml file:

clear

# be root
echo root | sudo -S echo 'done'

# echo params
echo 'params: '1=$1, 2=$2, 3=$3, 4=$5, 5=$5, 6=$6, 7=$7, 8=$8, 9=$9

# place the shell args into the schema file
cat solr_data/transcripts/conf/schema.xml.bak       > solr_data/transcripts/conf/schema.xml
cat solr_data/transcripts/conf/schema.xml  | sed 's/#1/'$1'/' > solr_data/transcripts/conf/temp.xml
cat solr_data/transcripts/conf/temp.xml  | sed 's/#2/'$2'/' > solr_data/transcripts/conf/schema.xml
cat solr_data/transcripts/conf/schema.xml  | sed 's/#3/'$3'/' > solr_data/transcripts/conf/temp.xml
cat solr_data/transcripts/conf/temp.xml  | sed 's/#4/'$4'/' > solr_data/transcripts/conf/schema.xml
cat solr_data/transcripts/conf/schema.xml  | sed 's/#5/'$5'/' > solr_data/transcripts/conf/temp.xml
cat solr_data/transcripts/conf/temp.xml  | sed 's/#6/'$6'/' > solr_data/transcripts/conf/schema.xml
cat solr_data/transcripts/conf/schema.xml  | sed 's/#7/'$7'/' > solr_data/transcripts/conf/temp.xml
cat solr_data/transcripts/conf/temp.xml  | sed 's/#8/'$8'/' > solr_data/transcripts/conf/schema.xml
cat solr_data/transcripts/conf/schema.xml  | sed 's/#9/'$9'/' > solr_data/transcripts/conf/temp.xml
cat solr_data/transcripts/conf/temp.xml         > solr_data/transcripts/conf/schema.xml

# launch the docker container in a new instance
gnome-terminal -e ./reset.sh

# sleep for 20 seconds (give solr time to instantiate)
sleep 20

# invoke the JAR file that runs the analysis
java -cp \
  uber-ebear-scripts-testing-1.0.0.jar \
  com.ibm.ted.ebear.scripts.testing.SolrjAnalysis \
  $1 $2 $3 $4 $5 $6 $7 $8 $9

# stop the solr instance
./stop.sh

reset.sh

Resets the Solr Container:

clear
echo root | sudo -S echo 'done'
./stop.sh
sudo rm -rf /home/craig/ebear/solr
./run.sh

stop.sh

Stops the Solr Container:

# stop all docker containers
# <https://coderwall.com/p/ewk0mq/stop-remove-all-docker-containers>
sudo docker stop $(sudo docker ps -a -q)
sudo docker rm $(sudo docker ps -a -q)
# <http://jimhoskins.com/2013/07/27/remove-untagged-docker-images.html>
sudo docker rmi $(sudo docker images | grep "^<none>" | awk "{print $3}")

run.sh

Launches a new Solr Container:

create_dirs() {
 if [ ! -d "$2" ]; then
  echo "creating data volume ..."
  mkdir -p $2/conf
  cp $1/core.properties $2
  cp $1/conf/* $2/conf
  mkdir -p $2/data
  chmod -R 777 $2
 else
  echo "data volume already exists"
 fi
}

# copy SOLR core for 'transcripts' 
create_dirs \
  solr_data/transcripts \
  ~/ebear/solr/transcripts

sudo docker-compose up

Solrj Analysis

This class access the Solr analyzer and writes the results to file.

The Solrj Analyzer has a list of terms to test the tokenizer against. These terms are URL encoded and inserted into a hard-coded query string against the Solr server (known LAN IP). An Xml/Xpath analysis is performed against the XML returned from the query, and the results are prepared into a TSV format for appending to file. There's a fair bit of hard-coding in here at the moment, but that can easily be extrapolated to external properties files or into parameters fed to the main method at runtime.

package com.mycompany.testing;

import java.net.*;
import java.util.*;
import javax.xml.parsers.*;
import org.w3c.dom.*;
import com.mycompany.utils.*;

public class SolrjAnalysis {

 public static LogManager logger = new LogManager(SolrjAnalysis.class);

 public static void main(String... args) throws Throwable {

  List<String> terms = ListUtils.toList("WiFi", "WiFi's", "Wi-Fi", "O'Reilly's", "U.S.A", "can't", "what're", "afford.", "that!", "where?", "well,well", "now...", "craigtrim@gmail.com", "http://www.ibm.com", "@cmtrm", "#theartoftokenization");
  String encodedUrl = URLEncoder.encode(StringUtils.toString(terms, " "), Codepage.UTF_8.toString());
  logger.debug("Created Encoded URL String: %s", encodedUrl);

  String urlString = "http://192.168.1.73:8983/solr/transcripts/analysis/field?wt=xml&analysis.fieldvalue=" + encodedUrl + "&analysis.fieldtype=text_transcript";

  URL url = new URL(urlString);
  URLConnection conn = url.openConnection();

  DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
  DocumentBuilder builder = factory.newDocumentBuilder();
  Document dom = builder.parse(conn.getInputStream());

  Map<String, Map<String, Set<String>>> totalManipulationsByType = new HashMap<String, Map<String, Set<String>>>();

  Collection<String> types = getTypes(dom);

  List<String> orderedTypes = new ArrayList<String>();
  for (String type : types) {

   Map<String, Set<String>> innerMap = totalManipulationsByType.containsKey(type) ? totalManipulationsByType.get(type) : new HashMap<String, Set<String>>();

   Collection<Element> elements = XpathUtils.evaluateElements(dom, String.format("descendant-or-self::arr[@name='%s']/lst", type));
   for (Element element : elements) {

    String position = XpathUtils.evaluateText(element, "descendant-or-self::arr[@name='positionHistory']/int/text()");
    String text = XpathUtils.evaluateText(element, "descendant-or-self::str[@name='text']/text()");

    logger.debug("Extracted Variation (position = %s, text = %s, type = %s, )", position, text, type);
    if (!orderedTypes.contains(type)) orderedTypes.add(type);

    Set<String> innerSet = (innerMap.containsKey(position)) ? innerMap.get(position) : new HashSet<String>();
    innerSet.add(text);
    innerMap.put(position, innerSet);
   }

   totalManipulationsByType.put(type, innerMap);
  }

  StringBuilder sb = new StringBuilder();

  sb.append(getHeader(orderedTypes));
  
  String _params = StringUtils.toString(args, "\t");

  for (int i = 1; i < terms.size() + 1; i++) {
   StringBuilder sbBody = new StringBuilder();
   sbBody.append(_params + "\t");

   for (String key : orderedTypes) {
    Set<String> values = totalManipulationsByType.get(key).get(String.valueOf(i));
    if (null == values) sbBody.append("\t");
    else sbBody.append(StringUtils.toString(values, ",") + "\t");
   }

   sb.append(sbBody.toString() + "\n");
  }

  System.err.println(sb.toString());
  FileUtils.toFile(sb, "/home/craig/ebear/analysis.dat", true, Codepage.UTF_8);
 }

 private static String getHeader(List<String> orderedTypes) {

  StringBuilder sbH1 = new StringBuilder();
  sbH1.append("generateWordParts\tsplitOnCaseChange\tsplitOnNumerics\tstemEnglishPossessive\tpreserveOriginal\tcatenateWords\tgenerateNumberParts\tcatenateNumbers\tcatenateAll");

  StringBuilder sbH2 = new StringBuilder();
  for (String key : orderedTypes) {

   String _key = StringUtils.substringAfterLast(key, ".");
   sbH2.append(_key + "\t");
  }

  return sbH1 + "\t" + sbH2.toString() + "\n";
 }

 private static Collection<String> getTypes(Document dom) throws Throwable {
  List<String> list = new ArrayList<String>();

  Collection<Element> elements = XpathUtils.evaluateElements(dom, "descendant-or-self::lst[@name='index']/arr");
  for (Element element : elements)
   list.add(element.getAttribute("name"));

  logger.debug("Extracted Types (total = %s):\n\t%s", list.size(), StringUtils.toString(list, "\n\t"));
  return list;
 }
}

References

[Blogger] Docker and Solr

DB

Wednesday, May 13, 2015

Generating an Automated Tokenization Test for Solr