There is a variety of ways to configure the tokenizer within Solr.
The configuration of the tokenizer can be automated for the purpose of testing how each variation performs. In summary, the schema.xml file is incrementally adjusted, a Docker/Solr container is launched, a Java analysis query is issued against Docker/Solr, and the analysis result is written to file.
An activity diagram depicting this flow is shown:
|Fig 1: Activity Diagram|
The Query Analyzer
The Query Analyzer depicted below is incrementally modified using the sed comamnd within a shell script:
The shell script that performs this incremental modification is invoked like this:
The trailing numbers are placed into the relevant sections of the LuceneWordDelimiterFactory (LWDF) above.
This script likely isn't going to win any points for style, but it takes the user params and seds them into the schema.xml file:
Resets the Solr Container:
Stops the Solr Container:
Launches a new Solr Container:
This class access the Solr analyzer and writes the results to file.
The Solrj Analyzer has a list of terms to test the tokenizer against. These terms are URL encoded and inserted into a hard-coded query string against the Solr server (known LAN IP). An Xml/Xpath analysis is performed against the XML returned from the query, and the results are prepared into a TSV format for appending to file. There's a fair bit of hard-coding in here at the moment, but that can easily be extrapolated to external properties files or into parameters fed to the main method at runtime.
- [Blogger] Docker and Solr