Tuesday, 24 January 2012

Hierarchical Faceting With Elastic Search

Our current project is using ElasticSearch for, er search. Hierarchical categorization of content found itself pinned to our sprint planning board this week and so I decided to take a look into what options were available. Our classification is a pretty basic hierarchy dividing and redividing things into groups, where each new group is a sub-species of its parent group.
Essentially, at this stage we have two main requirements:

1. Hierarchical Search

By classifying X as:
/a/b/c/d
We must be able to search for and find X for the following paths:
/a
/a/b
/a/b/c
/a/b/c/d
(I'm using /a/b/c/d for brevity, but imagine /Computer Peripherals/Hard Drives/External/USB for example).

2. Drill Down with facets

To allow for searching from the more general to the more specific we want to able to allow children of a given facet to be returned. e.g when we ask for the direct children of /a as facets, the system should return /a/b/.

A useful starting point came not from the ElasticSearch documentation but from the Solr wiki, which is a worthwhile read in my opinion. Much of what follows is explained in fuller detail there.

A nice solution to requirement 1 is provided by Lucene's PathHierarchyTokenizer, which is exposed by ElasticSearch.
As described in the docs this tokenizer allows us to take input like:
/a/b/c/d
and produce the following tokens:
/a
/a/b
/a/b/c
/a/b/c/d

This works pretty well, however by using this tokenizer alone it isn’t particularly easy to constrain the depth of the classification (Requirement 2). Therefore we decided to follow the seemingly hacky but certainly workable idea (See facet.prefix in the Solr article) of storing depth information along with the path.

So our indexed category terms will now be stored as:
0/a
1/a/b
2/a/b/c
3/a/b/c/d
i.e. 1/a/b represents the fact that a/b is 1 level below /a. With this information we can now use a Regex Pattern to get facets for a particular depth - More on this later in the article.

Encoding the depth

We gave some thought to how we might go about encoding depth along with our categories and opted for writing a custom TokenFilter with input provided from the token stream of the PathHierarchyTokenizer.

The remainder of this article is intended to illustrate how to write and test a TokenFilter as well as how to deploy and reference it for use with ElasticSearch. The full source code is available on GitHub.

We are using Maven 3 for managing this customisation so having created a new Maven (jar) project in my IDE, I setup the required dependencies (Full POM is here).

   
    
        
            org.elasticsearch
            elasticsearch
            ${elasticsearch.version}
            compile
        
        
        
            org.apache.lucene
            lucene-test-framework
            ${lucene.version}
            test
        
        
        
            junit
            junit
            ${junit.version}
            test
        
    
Note: the lucene-test-framework allows us to easily test our TokenFilter without having to keep deploying to ElasticSearch.

Step 1 Write a failing test
public class TokenCountFilterTest {

    private class TestAnalyzer extends Analyzer {
      @Override
      public TokenStream tokenStream(final String fieldName, final Reader reader) {
        return new TokenCountFilter(new PathHierarchyTokenizer(reader));
      }   
    }
    
    /**
     * Create a string of ${pathElementCount} path elements
     * and assert that each path in the hierarchy is prepended with its depth.
     * 
     * e.g given the input /a/a/a/a
     * 
     * We would expect the output
     * 
     * 0/a
     * 1/a/a
     * 2/a/a/a
     * 3/a/a/a/a
     * @throws IOException 
     * 
     */
    @Test
    public void testTokens() throws IOException {
      int pathElementCount = 10;
      StringBuffer input = new StringBuffer();
      String[] output = new String[pathElementCount];
      String pathElement = "/a";
      for(int i = 0; i< pathElementCount; i++) {
       input.append(pathElement);
       output[i] = i + input.toString(); 
      }
      
      Analyzer testAnalyzer = new TestAnalyzer();
      BaseTokenStreamTestCase.assertAnalyzesTo(testAnalyzer, input.toString(),output);
    }
}

On lines 3-8 we create an Analyzer for use with the test. TokenCountFilter is the filter that we are going to write to amend the tokens produced by the instance of PathHierarchyTokenizer which is passed as an argument to the constructor on line 6.

The javadoc for the testTokens method describes how the test will work.  Note that most of the work is managed by BaseTokenStreamTestCase on line 38. This Lucene test class provides the static method assertAnalyzesTo, which accepts an Analyzer to test, an input string to analyze e.g "/a/b/c" and an array of expected output tokens e.g ["0/a", "1/a/b", "2a/b/c"] which should be emitted after tokenization. Internally BaseTokenStreamTestCase utilizes the JUnit framework to make test assertions based on the input and expected output.

Step 2 Write the TokenFilter

Note: This may not be the most efficient way of achieving our goal but performs well enough in our tests. It is important to note that token filters operate on every token produced for a stream. With this in mind one must consider their use carefully. Our category paths will seldom contain more that 5 elements and consequently this filter will not greatly affect performance when used with the PathHierarchyTokenizer.  Any future enhancements by us will be posted in a future article and of course we welcome any advice from your own experience.

public class TokenCountFilter extends TokenFilter {

private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
private int count = 0;

public TokenCountFilter(final TokenStream input) {
  super(input);
}

/**
 * Prepend the value of ${count} to each token and increment ${count}
 */
@Override
public final boolean incrementToken() throws IOException {
  final boolean increment = input.incrementToken();
  if (increment) {
    final char[] count = String.valueOf(this.count++).toCharArray();
    final int newLength = termAtt.length() + count.length;
    final char[] resizedBuffer = termAtt.resizeBuffer(newLength);
    termAtt.setLength(newLength); 
    System.arraycopy(resizedBuffer, 0, resizedBuffer, count.length, termAtt.length());
    System.arraycopy(count, 0, resizedBuffer, 0, count.length);
  }
  return increment;
}

@Override 
public void end() throws IOException {
  super.end();
  count = 0;
}

@Override
public void reset() throws IOException {
  super.reset();
  count = 0;
}
For each call to input.incrementToken (line 15) that returns true we end up with a new token in the termAtt buffer (declared on line 3). The remainder of the method manages resizing the termAtt buffer before refilling it with the value of count followed by the original chars from the buffer hence providing us with the required depth encoded before the term in the index.

Running our test class will now give us a green bar.

Step 3 Create a factory to provide the TokenFilter to ElasticSearch
@AnalysisSettingsRequired
public class TokenCountFilterFactory extends AbstractTokenFilterFactory {
  
  @Inject
  public TokenCountFilterFactory(Index index, @IndexSettings Settings indexSettings,
      @Assisted String name, @Assisted Settings settings) { 
      super(index, indexSettings, name, settings);
  }
  
  @Override
  public TokenStream create(TokenStream tokenStream) {
      return new TokenCountFilter(tokenStream);
  }
}
This code is fairly self-explanatory and is basically cut and paste code which follows the same pattern as the other factories from ElasticSearch i.e. with a factory method creating a new instance of our TokenCountFilter around the TokenStream.

Step 4 Make the new code available to ElasticSearch

Running:

mvn clean package


Will compile and the code and generate a jar file. This jar should then be added to $ES_HOME/lib. A restart of ElasticSearch will ensure that our classes get loaded. 

Step 5 Try it out

First we can setup a new 'test' index which will use our TokenFilter as part of its analysis:
curl -XPUT 'http://localhost:9200/test/' -d '
{
  "index": {},
  "settings": {
    "analysis": {
      "analyzer": {
        "depth_path_hierarchy": { "type": "custom", "tokenizer": "path_hierarchy", "filter": ["token_count"] }
      },
      "filter" : {
  "token_count": {"type": "com.springyweb.elasticsearch.index.analysis.TokenCountFilterFactory"}
      }
    }
  }
}'
Lines 9-11 define our new "token_count" filter. Note that we actually define the type as TokenCountFilterFactory which is the class responsible for creating our TokenFilter. Lines 6-8 define a ""depth_path_hierarchy" analyzer. Note that we specify the "path_hierarchy" tokenizer which will provide the tokens for our "token_count" filter.

Next we setup a mapping for a type called 'content' which we shall be storing in our 'test' index.
curl -XPUT 'http://localhost:9200/test/content/_mapping' -d '
{
     "test_mapping" : {
         "properties" : {
       "categories" : {"type" : "string", "index_analyzer" : "depth_path_hierarchy"}
         }
     }
}'
A Mapping defines how a document should be mapped to the search engine. In this case we have specified that when indexing our "categories" property that the "depth_path_hierarchy" analyzer should be used. Note: For this particular analyzer to be used solely for indexing (i.e. not for search) we used the key "index_analyzer" in our request. When creating a mapping for a property it is also possible to specify which analyzer is to be used when searching by  adding the key/value "search_analyzer" : "<analyzer name>", or one can simply use the single key/value "analyzer" : "<analyzer name>" which will use the same analyzer for indexing and searching the property. Further details regarding mapping can be found here.

We are now in a position to add some test content to our 'test' index. (Take note of the categories)
curl -XPUT http://localhost:9200/test/content/1 -d '{
    "name": "test 1",
  "categories": ["/foo/bar/baz"]
}'

curl -XPUT http://localhost:9200/test/content/2 -d '{
    "name": "test 2",
  "categories": ["/foo/baz/bar"]
}'

The search shown below verifies that we can find both pieces of content beneath the root "0/foo" category which means that Requirement 1 is satisfied.
Note that we use a term query which is not analyzed, the reason for this is that we want search for an entire term e.g. /a/b/c/d without that term being tokenized in any way. This makes sense as otherwise a search for /a/b/c/d would be tokenized into path elements (if we use the "depth_path_hierarchy" analyzer for example) and therefore could find something categorized more generally, /a/b/c for example, which we don't want.

curl -XGET http://localhost:9200/test/content/_search -d '{
    "query" : {
        "term" : { "categories": "0/foo" }
    }
}'

returns:

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 0.5945348,
    "hits" : [ {
      "_index" : "test",
      "_type" : "content",
      "_id" : "1",
      "_score" : 0.5945348, "_source" : {
    "name": "test 1",
"categories": ["/foo/bar/baz"]
}
    }, {
      "_index" : "test",
      "_type" : "content",
      "_id" : "2",
      "_score" : 0.5945348, "_source" : {
    "name": "test 2",
"categories": ["/foo/baz/bar"]
}
    } ]
  }
}
Our final test is to try to fulfill Requirement 2, namely to return direct children as facets based on a given path.
curl -XGET http://localhost:9200/test/content/_search?pretty=true -d '{
    "query" : {
            "match_all" : {  }
        },
    "facets" : {
        "category" : { 
            "terms" : {
                "field" : "categories",
                "regex" : "^1/foo/.*"
            } 
       }
    }
}'

Note the regex on line 9. Here we are limiting the facets returned to those that start with 1/foo i.e a depth of 1 beneath /foo. The response shows that we get the required facet terms, '1/foo/baz' and '1/foo/bar'
"facets" : {
   "category" : {
     "_type" : "terms",
     "missing" : 0,
     "total" : 6,
     "other" : 4,
     "terms" : [ {
       "term" : "1/foo/baz",
       "count" : 1
     }, {
       "term" : "1/foo/bar",
       "count" : 1
     } ]
   }
So that wraps this rather long post up. No doubt there are numerous approaches to solving this kind of problem but this article has been intended partially as a solution but more generally as an attempt to illustrate some of the features of ElasticSearch that you may not be familiar with.

You can download the source code for this article: here.

Happy searching!

3 comments:

  1. Thanks for the post! It was helpful.

    ReplyDelete
  2. I have one question though: through this approach, if we are to say: give me all items under the category 'a/b/c', and exactly under this category, then we have a problem. We'll end up fetching all items under 'a/b/c/d' as well. I am thinking, that we should tokenize in the following way:
    0/a
    1/a/b
    2/a/b/c^

    The ^ above will be added to the exact full hierarchy. This way, if we want get items exactly under a/b/c, we can search for 2/a/b/c^.

    ReplyDelete
  3. Hi, this is really helpful. I have a question : similar to above, when I need to search an item under 'c' and show the breadcrumb for that item like a/b/c (showing hierarchy), how to showcase that?

    ReplyDelete