DB: May 2015

Wednesday, May 27, 2015

Cloudant DB and Java on Bluemix

Dependencies

I've added this dependency to my Maven POM file:

<dependency>
 <groupId>com.cloudant</groupId>
 <artifactId>cloudant-client</artifactId>
 <version>1.0.1</version>
</dependency>

Java Client

Assuming a POJO named MyDoc with two fields

{ private String _id, private java.util.Collection lines }

this client will add and retrieve an instance of the class:

public class CloudantTest {

 public static final String DBNAME = "my-db";

 public static LogManager logger = new LogManager(CloudantTest.class);

 public static void main(String... args) throws Throwable {

  String url = ... ;
  String username = ... ;
  String password = ... ;

  CloudantClient client = new CloudantClient(url, username, password);
  logger.debug("Connected to Cloudant:\n\turl = %s\n\tserver-version = %s", url, client.serverVersion());

  List<String> databases = client.getAllDbs();

  /* drop the database if it exists */
  for (String db : databases)
   if (DBNAME.equals(db)) client.deleteDB(DBNAME, "delete database");

  /* create the db */
  client.createDB(DBNAME);
  Database db = client.database(DBNAME, true);

  /* POJO does not exist and will throw an exception */
  try {
   MyDoc doc = db.find(MyDoc.class, "100");
  } catch (NoDocumentException e) {
   logger.debug("Transcript not found (id = %s)", "100");
  }

  Response response = db.save(doc);
  logger.debug("Saved Document (id = %s)", response.getId());

  MyDoc doc = db.find(Transcript.class, "100");
  logger.debug("Found Document (id = %s)", doc.get_id());

  client.deleteDB(DBNAME, "delete database");
 }
 
 private static MyDoc getDoc() {
  List<String> lines = new ArrayList<String>();
  lines.add("transcript line 1");
  lines.add("transcript line 2");
  lines.add("transcript line 3");

  MyDoc doc = new MyDoc();
  doc.set_id("100");
  doc.setLines(lines);
  
  return doc;
 }
}

References

[Github] A Java client for Cloudant

Wednesday, May 13, 2015

Generating an Automated Tokenization Test for Solr

Introduction

There is a variety of ways to configure the tokenizer within Solr.

The configuration of the tokenizer can be automated for the purpose of testing how each variation performs. In summary, the schema.xml file is incrementally adjusted, a Docker/Solr container is launched, a Java analysis query is issued against Docker/Solr, and the analysis result is written to file.

An activity diagram depicting this flow is shown:

Fig 1: Activity Diagram

The Query Analyzer

The Query Analyzer depicted below is incrementally modified using the sed comamnd within a shell script:

<!-- Query Analyzer -->
<analyzer type="query">
 <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="([a-zA-Z])\+1" replacement="$1$1" />
 <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="([a-zA-Z])(/)([a-zA-Z])" replacement="$1 or $3" />
 <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(\()(.)+(\))" replacement="$2" />
 <tokenizer class="solr.WhitespaceTokenizerFactory" />
 <filter class="solr.WordDelimiterFilterFactory"
  generateWordParts="#1"
  splitOnCaseChange="#2"
  splitOnNumerics="#3"
  stemEnglishPossessive="#4"
  preserveOriginal="#5"
  catenateWords="#6"
  generateNumberParts="#7"
  catenateNumbers="#8"
  catenateAll="#9"
  types="wdfftypes.txt" />
 <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
 <filter class="solr.LowerCaseFilterFactory" />
 <filter class="solr.ASCIIFoldingFilterFactory" />
 <filter class="solr.KStemFilterFactory" />
 <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" />
</analyzer>

The shell script that performs this incremental modification is invoked like this:

$ ./modify-schema.sh 0 1 0 0 0 0 1 1 0

The trailing numbers are placed into the relevant sections of the LuceneWordDelimiterFactory (LWDF) above.

modify-schema.sh

This script likely isn't going to win any points for style, but it takes the user params and seds them into the schema.xml file:

clear

# be root
echo root | sudo -S echo 'done'

# echo params
echo 'params: '1=$1, 2=$2, 3=$3, 4=$5, 5=$5, 6=$6, 7=$7, 8=$8, 9=$9

# place the shell args into the schema file
cat solr_data/transcripts/conf/schema.xml.bak       > solr_data/transcripts/conf/schema.xml
cat solr_data/transcripts/conf/schema.xml  | sed 's/#1/'$1'/' > solr_data/transcripts/conf/temp.xml
cat solr_data/transcripts/conf/temp.xml  | sed 's/#2/'$2'/' > solr_data/transcripts/conf/schema.xml
cat solr_data/transcripts/conf/schema.xml  | sed 's/#3/'$3'/' > solr_data/transcripts/conf/temp.xml
cat solr_data/transcripts/conf/temp.xml  | sed 's/#4/'$4'/' > solr_data/transcripts/conf/schema.xml
cat solr_data/transcripts/conf/schema.xml  | sed 's/#5/'$5'/' > solr_data/transcripts/conf/temp.xml
cat solr_data/transcripts/conf/temp.xml  | sed 's/#6/'$6'/' > solr_data/transcripts/conf/schema.xml
cat solr_data/transcripts/conf/schema.xml  | sed 's/#7/'$7'/' > solr_data/transcripts/conf/temp.xml
cat solr_data/transcripts/conf/temp.xml  | sed 's/#8/'$8'/' > solr_data/transcripts/conf/schema.xml
cat solr_data/transcripts/conf/schema.xml  | sed 's/#9/'$9'/' > solr_data/transcripts/conf/temp.xml
cat solr_data/transcripts/conf/temp.xml         > solr_data/transcripts/conf/schema.xml

# launch the docker container in a new instance
gnome-terminal -e ./reset.sh

# sleep for 20 seconds (give solr time to instantiate)
sleep 20

# invoke the JAR file that runs the analysis
java -cp \
  uber-ebear-scripts-testing-1.0.0.jar \
  com.ibm.ted.ebear.scripts.testing.SolrjAnalysis \
  $1 $2 $3 $4 $5 $6 $7 $8 $9

# stop the solr instance
./stop.sh

reset.sh

Resets the Solr Container:

clear
echo root | sudo -S echo 'done'
./stop.sh
sudo rm -rf /home/craig/ebear/solr
./run.sh

stop.sh

Stops the Solr Container:

# stop all docker containers
# <https://coderwall.com/p/ewk0mq/stop-remove-all-docker-containers>
sudo docker stop $(sudo docker ps -a -q)
sudo docker rm $(sudo docker ps -a -q)
# <http://jimhoskins.com/2013/07/27/remove-untagged-docker-images.html>
sudo docker rmi $(sudo docker images | grep "^<none>" | awk "{print $3}")

run.sh

Launches a new Solr Container:

create_dirs() {
 if [ ! -d "$2" ]; then
  echo "creating data volume ..."
  mkdir -p $2/conf
  cp $1/core.properties $2
  cp $1/conf/* $2/conf
  mkdir -p $2/data
  chmod -R 777 $2
 else
  echo "data volume already exists"
 fi
}

# copy SOLR core for 'transcripts' 
create_dirs \
  solr_data/transcripts \
  ~/ebear/solr/transcripts

sudo docker-compose up

Solrj Analysis

This class access the Solr analyzer and writes the results to file.

The Solrj Analyzer has a list of terms to test the tokenizer against. These terms are URL encoded and inserted into a hard-coded query string against the Solr server (known LAN IP). An Xml/Xpath analysis is performed against the XML returned from the query, and the results are prepared into a TSV format for appending to file. There's a fair bit of hard-coding in here at the moment, but that can easily be extrapolated to external properties files or into parameters fed to the main method at runtime.

package com.mycompany.testing;

import java.net.*;
import java.util.*;
import javax.xml.parsers.*;
import org.w3c.dom.*;
import com.mycompany.utils.*;

public class SolrjAnalysis {

 public static LogManager logger = new LogManager(SolrjAnalysis.class);

 public static void main(String... args) throws Throwable {

  List<String> terms = ListUtils.toList("WiFi", "WiFi's", "Wi-Fi", "O'Reilly's", "U.S.A", "can't", "what're", "afford.", "that!", "where?", "well,well", "now...", "craigtrim@gmail.com", "http://www.ibm.com", "@cmtrm", "#theartoftokenization");
  String encodedUrl = URLEncoder.encode(StringUtils.toString(terms, " "), Codepage.UTF_8.toString());
  logger.debug("Created Encoded URL String: %s", encodedUrl);

  String urlString = "http://192.168.1.73:8983/solr/transcripts/analysis/field?wt=xml&analysis.fieldvalue=" + encodedUrl + "&analysis.fieldtype=text_transcript";

  URL url = new URL(urlString);
  URLConnection conn = url.openConnection();

  DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
  DocumentBuilder builder = factory.newDocumentBuilder();
  Document dom = builder.parse(conn.getInputStream());

  Map<String, Map<String, Set<String>>> totalManipulationsByType = new HashMap<String, Map<String, Set<String>>>();

  Collection<String> types = getTypes(dom);

  List<String> orderedTypes = new ArrayList<String>();
  for (String type : types) {

   Map<String, Set<String>> innerMap = totalManipulationsByType.containsKey(type) ? totalManipulationsByType.get(type) : new HashMap<String, Set<String>>();

   Collection<Element> elements = XpathUtils.evaluateElements(dom, String.format("descendant-or-self::arr[@name='%s']/lst", type));
   for (Element element : elements) {

    String position = XpathUtils.evaluateText(element, "descendant-or-self::arr[@name='positionHistory']/int/text()");
    String text = XpathUtils.evaluateText(element, "descendant-or-self::str[@name='text']/text()");

    logger.debug("Extracted Variation (position = %s, text = %s, type = %s, )", position, text, type);
    if (!orderedTypes.contains(type)) orderedTypes.add(type);

    Set<String> innerSet = (innerMap.containsKey(position)) ? innerMap.get(position) : new HashSet<String>();
    innerSet.add(text);
    innerMap.put(position, innerSet);
   }

   totalManipulationsByType.put(type, innerMap);
  }

  StringBuilder sb = new StringBuilder();

  sb.append(getHeader(orderedTypes));
  
  String _params = StringUtils.toString(args, "\t");

  for (int i = 1; i < terms.size() + 1; i++) {
   StringBuilder sbBody = new StringBuilder();
   sbBody.append(_params + "\t");

   for (String key : orderedTypes) {
    Set<String> values = totalManipulationsByType.get(key).get(String.valueOf(i));
    if (null == values) sbBody.append("\t");
    else sbBody.append(StringUtils.toString(values, ",") + "\t");
   }

   sb.append(sbBody.toString() + "\n");
  }

  System.err.println(sb.toString());
  FileUtils.toFile(sb, "/home/craig/ebear/analysis.dat", true, Codepage.UTF_8);
 }

 private static String getHeader(List<String> orderedTypes) {

  StringBuilder sbH1 = new StringBuilder();
  sbH1.append("generateWordParts\tsplitOnCaseChange\tsplitOnNumerics\tstemEnglishPossessive\tpreserveOriginal\tcatenateWords\tgenerateNumberParts\tcatenateNumbers\tcatenateAll");

  StringBuilder sbH2 = new StringBuilder();
  for (String key : orderedTypes) {

   String _key = StringUtils.substringAfterLast(key, ".");
   sbH2.append(_key + "\t");
  }

  return sbH1 + "\t" + sbH2.toString() + "\n";
 }

 private static Collection<String> getTypes(Document dom) throws Throwable {
  List<String> list = new ArrayList<String>();

  Collection<Element> elements = XpathUtils.evaluateElements(dom, "descendant-or-self::lst[@name='index']/arr");
  for (Element element : elements)
   list.add(element.getAttribute("name"));

  logger.debug("Extracted Types (total = %s):\n\t%s", list.size(), StringUtils.toString(list, "\n\t"));
  return list;
 }
}

References

[Blogger] Docker and Solr

SolrJ: Java API for Solr

Introduction

SolrJ is a Java client to access solr. SolrJ offers a java interface to add, update, and query the solr index.

HttpSolrServer

I use Spring to load this properties file:

solr.host   = 127.0.0.1
solr.port   = 8983
solr.core   = documents

# defaults to 0.  > 1 not recommended
solr.maxretries   = 1

# 5 seconds to establish TCP
solr.connectiontimeout  = 5000

# socket read timeout
solr.socketreadtimeout  = 50000

# max connections per host
solr.maxconnectionsperhost = 250

# max total connections
solr.maxtotalconnections = 100

# defaults to false
solr.followredirects  = false

# defaults to false
# Server side must support gzip or deflate for this to have any effect.
solr.allowcompression  = true

and to populate this method:

public static HttpSolrServer transform(String url, boolean allowcompression, Integer connectiontimeout, String core, boolean followredirects, Integer maxconnectionsperhost, Integer maxretries, Integer maxtotalconnections, Integer socketreadtimeout) throws AdapterValidationException {
 HttpSolrServer server = new HttpSolrServer(url);

 server.setMaxRetries(maxretries); 
 server.setConnectionTimeout(connectiontimeout); 
 
 /* Setting the XML response parser is only required for cross
    version compatibility and only when one side is 1.4.1 or
    earlier and the other side is 3.1 or later. */
 server.setParser(new XMLResponseParser()); 
 
 /* The following settings are provided here for completeness.
    They will not normally be required, and should only be used 
    after consulting javadocs to know whether they are truly required. */
 server.setSoTimeout(socketreadtimeout); 
 server.setDefaultMaxConnectionsPerHost(maxconnectionsperhost);
 server.setMaxTotalConnections(maxtotalconnections);
 server.setFollowRedirects(followredirects);
 server.setAllowCompression(allowcompression);

 return server;
}

By populating the method above with the values from the properties file, the system is able to work with a configured instance of the HttpSolrServer.

Query Snippets

This code will find the total records that mention the word "climate":

public static void main(String... args) throws Throwable {

 HttpSolrServer server = HttpSolrServerAdapter.transform();
 assertNotNull(server);

 SolrQuery q = new SolrQuery("text:climate");
 q.setRows(0); // don't actually request any data

 long total = server.query(q).getResults().getNumFound();
 logger.info("Total Records (total = %s)", total);

}

A reusable method for executing a query:

public static QueryResponse execute(
 String queryName, 
 HttpSolrServer server, 
 SolrQuery solrQuery) 
 throws BusinessException {
 try {

  QueryResponse queryResponse = server.query(solrQuery);
  logger.debug("Query Statistics " +
   "(query-name = %s, elapsed-time = %s, query-time = %s, status = %s, request-url = %s)",
    queryName, 
    queryResponse.getElapsedTime(), 
    queryResponse.getQTime(), 
    queryResponse.getStatus(), 
    queryResponse.getRequestUrl());

  SolrDocumentList solrDocumentList = queryResponse.getResults();
  logger.debug("Total Records " +
   "(query-name = %s, total = %s, query = %s)", 
    queryName, 
    StringUtils.format(solrDocumentList.size()), 
    solrQuery.toString());

  return queryResponse;

 } catch (SolrServerException e) {
  logger.error(e);
  throw new BusinessException("Unable to Query Server (query-name = %s, message = %s)", e.getMessage());
 }
}

References

[Apache] SolrJ Wiki
[Apache] SolrJ Reference Guide