- Ubuntu Linux 14.04
- MongoDB 2.6.6 [installation how-to]
- Oracle JDK 7 [installation how-to]
- Maven 3.2.3 [installation how-to]
- Eclipse Luna 4.4.1 [installation how-to]
- Spring JPA 4.1.4
- Spring Data for MongoDB 1.6.1
IntroductionSpring Data for MongoDB is part of the umbrella Spring Data project which aims to provide a familiar and consistent Spring-based programming model for for new datastores while retaining store-specific features and capabilities.
I started to use Spring Data for MongoDB because the default query API in Mongo was awkward for Java.
For example, searching for
i > 50is represented as:
The equivalent Spring enabled query is:
While these are both simple cases, the former case is both syntactically and semantically awkward. Semantically, we lose a lot of meaning as the query grows in length. For queries with multiple conditions, a large number of BasicDBObject instances have to be created and appended to simulate a pipeline. Syntactically, operators like ">" and "<" have to be escaped.
Spring support brings further advantages around deployment, integration and environment support for MongoDB in enterprise applications.
Person.java (domain transfer object):
The Soundex Use Case
The Soundex algorithm is in the class of approximate string matching (asm) algorithms.
The goal is for homophones (e.g. Jon Smith, John Smythe) to be encoded to the same representation so that they can be matched despite minor differences in spelling.
The soundex encoder is provided through an apache-commons codec:
Since this Soundex algorithm is designed for English phonology only, a simple check for each String that it exists within the English alphabet, then a call out to the codec:
This is a simple test case that demonstrates the Soundex algorithm working correctly:
I loaded this data into MongoDB using the mongoOps.insert(...) command. Admittedly, this test hardly flexes the use case for Spring/Mongo; I expect that in the analysis stage. Insertion performance was tracked across 24 large files and 75 million records. ~50% of the names were non-English, and had to be discarded.
The x-axis represents the number of records being loaded (in millions). The y-axis represents the insertion time per record in milliseconds (ms). The jagged green line is the actual insertion performance on a ms-per-record basis. The lighter dotted green line is a linear trendline through the actual and seems to exhibit slightly better than O(1/2 n). For comparison, three hypothetical (dotted) lines are drawn. The blue line is O(n), the orange is O(1/2 n) and the purpose is O(log n).
The load performance is very reasonable. The total time to process the entire dataset (10 GB across a local LAN with gigabit ethernet and minimal computation prior to the db insertion, using a quad-core VirtuaBox image with 16GB ram for Ubuntu 14) was 19 minutes.