Our current project is using
ElasticSearch for, er search.
Hierarchical categorization of content found itself pinned to our sprint planning board this week and so I decided to take a look into what options were available.
Our classification is a pretty basic hierarchy dividing and redividing things into groups, where each new group is a sub-species of its parent group.
Essentially, at this stage we have two main requirements:
1. Hierarchical Search
By classifying X as:
/a/b/c/d
We must be able to search for and find X for the following paths:
/a
/a/b
/a/b/c
/a/b/c/d
(I'm using /a/b/c/d for brevity, but imagine /Computer Peripherals/Hard Drives/External/USB for example).
2. Drill Down with facets
To allow for searching from the more general to the more specific we want to able to allow children of a given facet to be returned.
e.g when we ask for the direct children of /a as facets, the system should return /a/b/.
A useful starting point came not from the ElasticSearch documentation but from the
Solr wiki, which is a worthwhile read in my opinion. Much of what follows is explained in fuller detail there.
A nice solution to requirement 1 is provided by Lucene's
PathHierarchyTokenizer, which is
exposed by ElasticSearch.
As described in the docs this tokenizer allows us to take input like:
/a/b/c/d
and produce the following tokens:
/a
/a/b
/a/b/c
/a/b/c/d
This works pretty well, however by using this tokenizer alone it isn’t particularly easy to constrain the depth of the classification (Requirement 2). Therefore we decided to follow the seemingly hacky but certainly workable idea (See facet.prefix in the
Solr article) of storing depth information along with the path.
So our indexed category terms will now be stored as:
0/a
1/a/b
2/a/b/c
3/a/b/c/d
i.e. 1/a/b represents the fact that a/b is 1 level below /a.
With this information we can now use a
Regex Pattern to get facets for a particular depth - More on this later in the article.
Encoding the depth
We gave some thought to how we might go about encoding depth along with our categories and opted for writing a custom
TokenFilter with input provided from the token stream of the PathHierarchyTokenizer.
The remainder of this article is intended to illustrate how to write and test a TokenFilter as well as how to deploy and reference it for use with ElasticSearch. The full
source code is available on GitHub.
We are using Maven 3 for managing this customisation so having created a new Maven (jar) project in my IDE, I setup the required dependencies (Full POM is
here).
org.elasticsearch
elasticsearch
${elasticsearch.version}
compile
org.apache.lucene
lucene-test-framework
${lucene.version}
test
junit
junit
${junit.version}
test
Note: the lucene-test-framework allows us to easily test our TokenFilter without having to keep deploying to ElasticSearch.
Step 1 Write a failing test
public class TokenCountFilterTest {
private class TestAnalyzer extends Analyzer {
@Override
public TokenStream tokenStream(final String fieldName, final Reader reader) {
return new TokenCountFilter(new PathHierarchyTokenizer(reader));
}
}
/**
* Create a string of ${pathElementCount} path elements
* and assert that each path in the hierarchy is prepended with its depth.
*
* e.g given the input /a/a/a/a
*
* We would expect the output
*
* 0/a
* 1/a/a
* 2/a/a/a
* 3/a/a/a/a
* @throws IOException
*
*/
@Test
public void testTokens() throws IOException {
int pathElementCount = 10;
StringBuffer input = new StringBuffer();
String[] output = new String[pathElementCount];
String pathElement = "/a";
for(int i = 0; i< pathElementCount; i++) {
input.append(pathElement);
output[i] = i + input.toString();
}
Analyzer testAnalyzer = new TestAnalyzer();
BaseTokenStreamTestCase.assertAnalyzesTo(testAnalyzer, input.toString(),output);
}
}
On lines 3-8 we create an Analyzer for use with the test.
TokenCountFilter is the filter that we are going to write to amend the tokens produced by the instance of PathHierarchyTokenizer which is passed as an argument to the constructor on line 6.
The javadoc for the
testTokens method describes how the test will work. Note that most of the work is managed by
BaseTokenStreamTestCase on line 38. This Lucene test class provides the static method
assertAnalyzesTo, which accepts an Analyzer to test, an input string to analyze e.g
"/a/b/c" and an array of expected output tokens e.g
["0/a", "1/a/b", "2a/b/c"] which should be emitted after tokenization. Internally BaseTokenStreamTestCase utilizes the JUnit framework to make test assertions based on the input and expected output.
Step 2 Write the TokenFilter
Note: This may not be the most efficient way of achieving our goal but performs well enough in our tests. It is important to note that token filters operate on every token produced for a stream. With this in mind one must consider their use carefully. Our category paths will seldom contain more that 5 elements and consequently this filter will not greatly affect performance when used with the PathHierarchyTokenizer. Any future enhancements by us will be posted in a future article and of course we welcome any advice from your own experience.
public class TokenCountFilter extends TokenFilter {
private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
private int count = 0;
public TokenCountFilter(final TokenStream input) {
super(input);
}
/**
* Prepend the value of ${count} to each token and increment ${count}
*/
@Override
public final boolean incrementToken() throws IOException {
final boolean increment = input.incrementToken();
if (increment) {
final char[] count = String.valueOf(this.count++).toCharArray();
final int newLength = termAtt.length() + count.length;
final char[] resizedBuffer = termAtt.resizeBuffer(newLength);
termAtt.setLength(newLength);
System.arraycopy(resizedBuffer, 0, resizedBuffer, count.length, termAtt.length());
System.arraycopy(count, 0, resizedBuffer, 0, count.length);
}
return increment;
}
@Override
public void end() throws IOException {
super.end();
count = 0;
}
@Override
public void reset() throws IOException {
super.reset();
count = 0;
}
For each call to input.incrementToken (line 15) that returns true we end up with a new token in the termAtt buffer (declared on line 3). The remainder of the method manages resizing the termAtt buffer before refilling it with the value of
count followed by the original chars from the buffer hence providing us with the required depth encoded before the term in the index.
Running our test class will now give us a green bar.
Step 3 Create a factory to provide the TokenFilter to ElasticSearch
@AnalysisSettingsRequired
public class TokenCountFilterFactory extends AbstractTokenFilterFactory {
@Inject
public TokenCountFilterFactory(Index index, @IndexSettings Settings indexSettings,
@Assisted String name, @Assisted Settings settings) {
super(index, indexSettings, name, settings);
}
@Override
public TokenStream create(TokenStream tokenStream) {
return new TokenCountFilter(tokenStream);
}
}
This code is fairly self-explanatory and is basically cut and paste code which follows the same pattern as the other factories from ElasticSearch i.e. with a factory method creating a new instance of our TokenCountFilter around the TokenStream.
Step 4 Make the new code available to ElasticSearch
Running:
mvn clean package
Will compile and the code and generate a jar file. This jar should then be added to $ES_HOME/lib. A restart of ElasticSearch will ensure that our classes get loaded.
Step 5 Try it out
First we can setup a new 'test' index which will use our TokenFilter as part of its analysis:
curl -XPUT 'http://localhost:9200/test/' -d '
{
"index": {},
"settings": {
"analysis": {
"analyzer": {
"depth_path_hierarchy": { "type": "custom", "tokenizer": "path_hierarchy", "filter": ["token_count"] }
},
"filter" : {
"token_count": {"type": "com.springyweb.elasticsearch.index.analysis.TokenCountFilterFactory"}
}
}
}
}'
Lines 9-11 define our new "token_count" filter. Note that we actually define the type as TokenCountFilterFactory which is the class responsible for creating our TokenFilter.
Lines 6-8 define a ""depth_path_hierarchy" analyzer. Note that we specify the "path_hierarchy" tokenizer which will provide the tokens for our "token_count" filter.
Next we setup a mapping for a type called 'content' which we shall be storing in our 'test' index.
curl -XPUT 'http://localhost:9200/test/content/_mapping' -d '
{
"test_mapping" : {
"properties" : {
"categories" : {"type" : "string", "index_analyzer" : "depth_path_hierarchy"}
}
}
}'
A
Mapping defines how a document should be mapped to the search engine. In this case we have specified that when
indexing our "categories" property that the
"depth_path_hierarchy" analyzer should be used. Note: For this particular analyzer to be used solely for indexing (i.e. not for search) we used the key
"index_analyzer" in our request. When creating a mapping for a property it is also possible to specify which analyzer is to be used when searching by adding the key/value
"search_analyzer" : "<analyzer name>", or one can simply use the single key/value
"analyzer" : "<analyzer name>" which will use the same analyzer for indexing and searching the property. Further details regarding mapping can be found
here.
We are now in a position to add some test content to our 'test' index. (Take note of the categories)
curl -XPUT http://localhost:9200/test/content/1 -d '{
"name": "test 1",
"categories": ["/foo/bar/baz"]
}'
curl -XPUT http://localhost:9200/test/content/2 -d '{
"name": "test 2",
"categories": ["/foo/baz/bar"]
}'
The search shown below verifies that we can find both pieces of content beneath the root "0/foo" category which means that Requirement 1 is satisfied.
Note that we use a
term query which is not analyzed, the reason for this is that we want search for an entire term e.g.
/a/b/c/d without that term being tokenized in any way. This makes sense as
otherwise a search for /a/b/c/d would be tokenized into path elements (if we use the
"depth_path_hierarchy" analyzer for example) and
therefore could find something categorized more generally, /a/b/c
for example, which we don't want.
curl -XGET http://localhost:9200/test/content/_search -d '{
"query" : {
"term" : { "categories": "0/foo" }
}
}'
returns:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"failed" : 0
},
"hits" : {
"total" : 2,
"max_score" : 0.5945348,
"hits" : [ {
"_index" : "test",
"_type" : "content",
"_id" : "1",
"_score" : 0.5945348, "_source" : {
"name": "test 1",
"categories": ["/foo/bar/baz"]
}
}, {
"_index" : "test",
"_type" : "content",
"_id" : "2",
"_score" : 0.5945348, "_source" : {
"name": "test 2",
"categories": ["/foo/baz/bar"]
}
} ]
}
}
Our final test is to try to fulfill Requirement 2, namely to return direct children as facets based on a given path.
curl -XGET http://localhost:9200/test/content/_search?pretty=true -d '{
"query" : {
"match_all" : { }
},
"facets" : {
"category" : {
"terms" : {
"field" : "categories",
"regex" : "^1/foo/.*"
}
}
}
}'
Note the regex on line 9. Here we are limiting the facets returned to those that start with 1/foo i.e a depth of 1 beneath /foo.
The response shows that we get the required facet terms, '1/foo/baz' and '1/foo/bar'
"facets" : {
"category" : {
"_type" : "terms",
"missing" : 0,
"total" : 6,
"other" : 4,
"terms" : [ {
"term" : "1/foo/baz",
"count" : 1
}, {
"term" : "1/foo/bar",
"count" : 1
} ]
}
So that wraps this rather long post up. No doubt there are numerous approaches to solving this kind of problem but this article has been intended partially as a solution but more generally as an attempt
to illustrate some of the features of ElasticSearch that you may not be familiar with.
You can download the source code for this article:
here.
Happy searching!