Introduction
There is a variety of ways to configure the tokenizer within Solr.
The configuration of the tokenizer can be automated for the purpose of testing how each variation performs. In summary, the schema.xml file is incrementally adjusted, a Docker/Solr container is launched, a Java analysis query is issued against Docker/Solr, and the analysis result is written to file.
An activity diagram depicting this flow is shown:
Fig 1: Activity Diagram |
The Query Analyzer
The Query Analyzer depicted below is incrementally modified using the sed comamnd within a shell script:
<!-- Query Analyzer --> <analyzer type="query"> <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="([a-zA-Z])\+1" replacement="$1$1" /> <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="([a-zA-Z])(/)([a-zA-Z])" replacement="$1 or $3" /> <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(\()(.)+(\))" replacement="$2" /> <tokenizer class="solr.WhitespaceTokenizerFactory" /> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="#1" splitOnCaseChange="#2" splitOnNumerics="#3" stemEnglishPossessive="#4" preserveOriginal="#5" catenateWords="#6" generateNumberParts="#7" catenateNumbers="#8" catenateAll="#9" types="wdfftypes.txt" /> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" /> <filter class="solr.LowerCaseFilterFactory" /> <filter class="solr.ASCIIFoldingFilterFactory" /> <filter class="solr.KStemFilterFactory" /> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" /> </analyzer>
The shell script that performs this incremental modification is invoked like this:
$ ./modify-schema.sh 0 1 0 0 0 0 1 1 0
modify-schema.sh
This script likely isn't going to win any points for style, but it takes the user params and seds them into the schema.xml file:
clear # be root echo root | sudo -S echo 'done' # echo params echo 'params: '1=$1, 2=$2, 3=$3, 4=$5, 5=$5, 6=$6, 7=$7, 8=$8, 9=$9 # place the shell args into the schema file cat solr_data/transcripts/conf/schema.xml.bak > solr_data/transcripts/conf/schema.xml cat solr_data/transcripts/conf/schema.xml | sed 's/#1/'$1'/' > solr_data/transcripts/conf/temp.xml cat solr_data/transcripts/conf/temp.xml | sed 's/#2/'$2'/' > solr_data/transcripts/conf/schema.xml cat solr_data/transcripts/conf/schema.xml | sed 's/#3/'$3'/' > solr_data/transcripts/conf/temp.xml cat solr_data/transcripts/conf/temp.xml | sed 's/#4/'$4'/' > solr_data/transcripts/conf/schema.xml cat solr_data/transcripts/conf/schema.xml | sed 's/#5/'$5'/' > solr_data/transcripts/conf/temp.xml cat solr_data/transcripts/conf/temp.xml | sed 's/#6/'$6'/' > solr_data/transcripts/conf/schema.xml cat solr_data/transcripts/conf/schema.xml | sed 's/#7/'$7'/' > solr_data/transcripts/conf/temp.xml cat solr_data/transcripts/conf/temp.xml | sed 's/#8/'$8'/' > solr_data/transcripts/conf/schema.xml cat solr_data/transcripts/conf/schema.xml | sed 's/#9/'$9'/' > solr_data/transcripts/conf/temp.xml cat solr_data/transcripts/conf/temp.xml > solr_data/transcripts/conf/schema.xml # launch the docker container in a new instance gnome-terminal -e ./reset.sh # sleep for 20 seconds (give solr time to instantiate) sleep 20 # invoke the JAR file that runs the analysis java -cp \ uber-ebear-scripts-testing-1.0.0.jar \ com.ibm.ted.ebear.scripts.testing.SolrjAnalysis \ $1 $2 $3 $4 $5 $6 $7 $8 $9 # stop the solr instance ./stop.sh
reset.sh
Resets the Solr Container:
clear echo root | sudo -S echo 'done' ./stop.sh sudo rm -rf /home/craig/ebear/solr ./run.sh
stop.sh
Stops the Solr Container:
# stop all docker containers # <https://coderwall.com/p/ewk0mq/stop-remove-all-docker-containers> sudo docker stop $(sudo docker ps -a -q) sudo docker rm $(sudo docker ps -a -q) # <http://jimhoskins.com/2013/07/27/remove-untagged-docker-images.html> sudo docker rmi $(sudo docker images | grep "^<none>" | awk "{print $3}")
run.sh
Launches a new Solr Container:
create_dirs() { if [ ! -d "$2" ]; then echo "creating data volume ..." mkdir -p $2/conf cp $1/core.properties $2 cp $1/conf/* $2/conf mkdir -p $2/data chmod -R 777 $2 else echo "data volume already exists" fi } # copy SOLR core for 'transcripts' create_dirs \ solr_data/transcripts \ ~/ebear/solr/transcripts sudo docker-compose up
Solrj Analysis
This class access the Solr analyzer and writes the results to file.
The Solrj Analyzer has a list of terms to test the tokenizer against. These terms are URL encoded and inserted into a hard-coded query string against the Solr server (known LAN IP). An Xml/Xpath analysis is performed against the XML returned from the query, and the results are prepared into a TSV format for appending to file. There's a fair bit of hard-coding in here at the moment, but that can easily be extrapolated to external properties files or into parameters fed to the main method at runtime.
package com.mycompany.testing; import java.net.*; import java.util.*; import javax.xml.parsers.*; import org.w3c.dom.*; import com.mycompany.utils.*; public class SolrjAnalysis { public static LogManager logger = new LogManager(SolrjAnalysis.class); public static void main(String... args) throws Throwable { List<String> terms = ListUtils.toList("WiFi", "WiFi's", "Wi-Fi", "O'Reilly's", "U.S.A", "can't", "what're", "afford.", "that!", "where?", "well,well", "now...", "craigtrim@gmail.com", "http://www.ibm.com", "@cmtrm", "#theartoftokenization"); String encodedUrl = URLEncoder.encode(StringUtils.toString(terms, " "), Codepage.UTF_8.toString()); logger.debug("Created Encoded URL String: %s", encodedUrl); String urlString = "http://192.168.1.73:8983/solr/transcripts/analysis/field?wt=xml&analysis.fieldvalue=" + encodedUrl + "&analysis.fieldtype=text_transcript"; URL url = new URL(urlString); URLConnection conn = url.openConnection(); DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance(); DocumentBuilder builder = factory.newDocumentBuilder(); Document dom = builder.parse(conn.getInputStream()); Map<String, Map<String, Set<String>>> totalManipulationsByType = new HashMap<String, Map<String, Set<String>>>(); Collection<String> types = getTypes(dom); List<String> orderedTypes = new ArrayList<String>(); for (String type : types) { Map<String, Set<String>> innerMap = totalManipulationsByType.containsKey(type) ? totalManipulationsByType.get(type) : new HashMap<String, Set<String>>(); Collection<Element> elements = XpathUtils.evaluateElements(dom, String.format("descendant-or-self::arr[@name='%s']/lst", type)); for (Element element : elements) { String position = XpathUtils.evaluateText(element, "descendant-or-self::arr[@name='positionHistory']/int/text()"); String text = XpathUtils.evaluateText(element, "descendant-or-self::str[@name='text']/text()"); logger.debug("Extracted Variation (position = %s, text = %s, type = %s, )", position, text, type); if (!orderedTypes.contains(type)) orderedTypes.add(type); Set<String> innerSet = (innerMap.containsKey(position)) ? innerMap.get(position) : new HashSet<String>(); innerSet.add(text); innerMap.put(position, innerSet); } totalManipulationsByType.put(type, innerMap); } StringBuilder sb = new StringBuilder(); sb.append(getHeader(orderedTypes)); String _params = StringUtils.toString(args, "\t"); for (int i = 1; i < terms.size() + 1; i++) { StringBuilder sbBody = new StringBuilder(); sbBody.append(_params + "\t"); for (String key : orderedTypes) { Set<String> values = totalManipulationsByType.get(key).get(String.valueOf(i)); if (null == values) sbBody.append("\t"); else sbBody.append(StringUtils.toString(values, ",") + "\t"); } sb.append(sbBody.toString() + "\n"); } System.err.println(sb.toString()); FileUtils.toFile(sb, "/home/craig/ebear/analysis.dat", true, Codepage.UTF_8); } private static String getHeader(List<String> orderedTypes) { StringBuilder sbH1 = new StringBuilder(); sbH1.append("generateWordParts\tsplitOnCaseChange\tsplitOnNumerics\tstemEnglishPossessive\tpreserveOriginal\tcatenateWords\tgenerateNumberParts\tcatenateNumbers\tcatenateAll"); StringBuilder sbH2 = new StringBuilder(); for (String key : orderedTypes) { String _key = StringUtils.substringAfterLast(key, "."); sbH2.append(_key + "\t"); } return sbH1 + "\t" + sbH2.toString() + "\n"; } private static Collection<String> getTypes(Document dom) throws Throwable { List<String> list = new ArrayList<String>(); Collection<Element> elements = XpathUtils.evaluateElements(dom, "descendant-or-self::lst[@name='index']/arr"); for (Element element : elements) list.add(element.getAttribute("name")); logger.debug("Extracted Types (total = %s):\n\t%s", list.size(), StringUtils.toString(list, "\n\t")); return list; } }
References
- [Blogger] Docker and Solr
No comments:
Post a Comment