Thursday, 8 March 2012

Run GWT samples from Eclipse

I'm documenting this here as it was a surprisingly awkward job to get the Google Web Toolkit samples running locally. First, install the GWT plugin for Eclipse.

Next check out and compile a local copy of the GWT source code. You'll need to fetch the main GWT source and the prerequisite tools.

    ~$ mkdir ~/gwt; cd ~/gwt
    ~/gwt$ svn checkout http://google-web-toolkit.googlecode.com/svn/tools/ tools
    ~/gwt$ export GWT_TOOLS = `pwd`/tools
    ~/gwt$ svn checkout http://google-web-toolkit.googlecode.com/svn/trunk/ trunk
    ~/gwt$ cd trunk
    ~/gwt/trunk$ ant dist

This should produce a zip file trunk/build/dist/gwt-0.0.0.zip. We need to unzip this so that the GWT plugin in Eclipse can work with it.

    ~/gwt/trunk$ cd ..
    ~/gwt$ mkdir trunk-dist; cd trunk-dist
    ~/gwt/trunk-dist$ unzip ../trunk/build/dist/gwt-0.0.0.zip

Now it is ready to be configured in the Eclipse GWT plugin. Open your Eclipse preferences, navigate to Google → Web Toolkit, click the Add… button, browse to your new gwt-0-0-0 directory, and give it a name like GWT-TRUNK.

Eclipse Properties, GWT SDK

Now we are set up to run the trunk samples, but for some reason we still can't directly open them in Eclipse. A workaround I'm using is to create a new project with the Eclipse GWT plugin, then delete all of the files and copy in the ones from the sample you're trying to run. For example, to run the Showcase sample, do the following:

Create a new Web Application Project (the first entry in the "g" drop down menu in your toolbar). Call it samples-showcase, give it any package name, pick the GWT TRUNK SDK we set up earlier, deselect Use Google App Engine and Generate project sample code, then click Finish.

Eclipse GWT Create Project

This will create the samples-showcase directory in your workspace directory. Change to this directory, delete the folders it created, and copy in the folders from the Showcase sample:

    ~$ cd YOUR_WORKSPACE_DIR/samples-showcase
    samples-showcase$ rm -rf src war
    samples-showcase$ cp -R ~/gwt/trunk/samples/showcase/* .

Go back to Eclipse and Refresh your sample-showcase project. At this point you might get an error in Eclipse - a red X mark on the war/WEB-INF folder, and in the Problems panel you'll see the message "The GWT SDK JAR gwt-servlet.jar is missing in the WEB-INF/lib directory". Right-click this message and choose "Quick Fix" to get the following dialog box:

The GWT SDK JAR gwt-servlet.jar is missing in the WEB-INF/lib directory

Click Finish and it should be fixed.

You should now be able to run the sample (yay!). Right-click the samples-showcase project in the Package Explorer, choose Run As → Web Application, and the local URL should appear in your Development Mode view tab. Double click the URL to open it in your default browser, or right-click → Copy, and paste it into the URL bar of your preferred browser.

Wednesday, 1 February 2012

Tip: Content assist for static imports in Eclipse

You can add a list of "Favourites" in your Eclipse settings to help with content assist for standard static methods. For example, assertNotNull() and friends when writing unit tests.

To enable this, in your Eclipse Preferences window, navigate to Java / Editor / Content Assist / Favorites. I've got the following types in mine:

  • org.hamcrest.Matchers
  • org.junit.Assert
  • org.springframework.test.web.server.request.MockMvcRequestBuilders
  • org.springframework.test.web.server.result.MockMvcResultHandlers
  • org.springframework.test.web.server.result.MockMvcResultMatchers
  • org.springframework.test.web.server.setup.MockMvcBuilders
Add them one at a time with the New Type… button.

Tuesday, 24 January 2012

Hierarchical Faceting With Elastic Search

Our current project is using ElasticSearch for, er search. Hierarchical categorization of content found itself pinned to our sprint planning board this week and so I decided to take a look into what options were available. Our classification is a pretty basic hierarchy dividing and redividing things into groups, where each new group is a sub-species of its parent group.
Essentially, at this stage we have two main requirements:

1. Hierarchical Search

By classifying X as:
/a/b/c/d
We must be able to search for and find X for the following paths:
/a
/a/b
/a/b/c
/a/b/c/d
(I'm using /a/b/c/d for brevity, but imagine /Computer Peripherals/Hard Drives/External/USB for example).

2. Drill Down with facets

To allow for searching from the more general to the more specific we want to able to allow children of a given facet to be returned. e.g when we ask for the direct children of /a as facets, the system should return /a/b/.

A useful starting point came not from the ElasticSearch documentation but from the Solr wiki, which is a worthwhile read in my opinion. Much of what follows is explained in fuller detail there.

A nice solution to requirement 1 is provided by Lucene's PathHierarchyTokenizer, which is exposed by ElasticSearch.
As described in the docs this tokenizer allows us to take input like:
/a/b/c/d
and produce the following tokens:
/a
/a/b
/a/b/c
/a/b/c/d

This works pretty well, however by using this tokenizer alone it isn’t particularly easy to constrain the depth of the classification (Requirement 2). Therefore we decided to follow the seemingly hacky but certainly workable idea (See facet.prefix in the Solr article) of storing depth information along with the path.

So our indexed category terms will now be stored as:
0/a
1/a/b
2/a/b/c
3/a/b/c/d
i.e. 1/a/b represents the fact that a/b is 1 level below /a. With this information we can now use a Regex Pattern to get facets for a particular depth - More on this later in the article.

Encoding the depth

We gave some thought to how we might go about encoding depth along with our categories and opted for writing a custom TokenFilter with input provided from the token stream of the PathHierarchyTokenizer.

The remainder of this article is intended to illustrate how to write and test a TokenFilter as well as how to deploy and reference it for use with ElasticSearch. The full source code is available on GitHub.

We are using Maven 3 for managing this customisation so having created a new Maven (jar) project in my IDE, I setup the required dependencies (Full POM is here).

   
    
        
            org.elasticsearch
            elasticsearch
            ${elasticsearch.version}
            compile
        
        
        
            org.apache.lucene
            lucene-test-framework
            ${lucene.version}
            test
        
        
        
            junit
            junit
            ${junit.version}
            test
        
    
Note: the lucene-test-framework allows us to easily test our TokenFilter without having to keep deploying to ElasticSearch.

Step 1 Write a failing test
public class TokenCountFilterTest {

    private class TestAnalyzer extends Analyzer {
      @Override
      public TokenStream tokenStream(final String fieldName, final Reader reader) {
        return new TokenCountFilter(new PathHierarchyTokenizer(reader));
      }   
    }
    
    /**
     * Create a string of ${pathElementCount} path elements
     * and assert that each path in the hierarchy is prepended with its depth.
     * 
     * e.g given the input /a/a/a/a
     * 
     * We would expect the output
     * 
     * 0/a
     * 1/a/a
     * 2/a/a/a
     * 3/a/a/a/a
     * @throws IOException 
     * 
     */
    @Test
    public void testTokens() throws IOException {
      int pathElementCount = 10;
      StringBuffer input = new StringBuffer();
      String[] output = new String[pathElementCount];
      String pathElement = "/a";
      for(int i = 0; i< pathElementCount; i++) {
       input.append(pathElement);
       output[i] = i + input.toString(); 
      }
      
      Analyzer testAnalyzer = new TestAnalyzer();
      BaseTokenStreamTestCase.assertAnalyzesTo(testAnalyzer, input.toString(),output);
    }
}

On lines 3-8 we create an Analyzer for use with the test. TokenCountFilter is the filter that we are going to write to amend the tokens produced by the instance of PathHierarchyTokenizer which is passed as an argument to the constructor on line 6.

The javadoc for the testTokens method describes how the test will work.  Note that most of the work is managed by BaseTokenStreamTestCase on line 38. This Lucene test class provides the static method assertAnalyzesTo, which accepts an Analyzer to test, an input string to analyze e.g "/a/b/c" and an array of expected output tokens e.g ["0/a", "1/a/b", "2a/b/c"] which should be emitted after tokenization. Internally BaseTokenStreamTestCase utilizes the JUnit framework to make test assertions based on the input and expected output.

Step 2 Write the TokenFilter

Note: This may not be the most efficient way of achieving our goal but performs well enough in our tests. It is important to note that token filters operate on every token produced for a stream. With this in mind one must consider their use carefully. Our category paths will seldom contain more that 5 elements and consequently this filter will not greatly affect performance when used with the PathHierarchyTokenizer.  Any future enhancements by us will be posted in a future article and of course we welcome any advice from your own experience.

public class TokenCountFilter extends TokenFilter {

private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
private int count = 0;

public TokenCountFilter(final TokenStream input) {
  super(input);
}

/**
 * Prepend the value of ${count} to each token and increment ${count}
 */
@Override
public final boolean incrementToken() throws IOException {
  final boolean increment = input.incrementToken();
  if (increment) {
    final char[] count = String.valueOf(this.count++).toCharArray();
    final int newLength = termAtt.length() + count.length;
    final char[] resizedBuffer = termAtt.resizeBuffer(newLength);
    termAtt.setLength(newLength); 
    System.arraycopy(resizedBuffer, 0, resizedBuffer, count.length, termAtt.length());
    System.arraycopy(count, 0, resizedBuffer, 0, count.length);
  }
  return increment;
}

@Override 
public void end() throws IOException {
  super.end();
  count = 0;
}

@Override
public void reset() throws IOException {
  super.reset();
  count = 0;
}
For each call to input.incrementToken (line 15) that returns true we end up with a new token in the termAtt buffer (declared on line 3). The remainder of the method manages resizing the termAtt buffer before refilling it with the value of count followed by the original chars from the buffer hence providing us with the required depth encoded before the term in the index.

Running our test class will now give us a green bar.

Step 3 Create a factory to provide the TokenFilter to ElasticSearch
@AnalysisSettingsRequired
public class TokenCountFilterFactory extends AbstractTokenFilterFactory {
  
  @Inject
  public TokenCountFilterFactory(Index index, @IndexSettings Settings indexSettings,
      @Assisted String name, @Assisted Settings settings) { 
      super(index, indexSettings, name, settings);
  }
  
  @Override
  public TokenStream create(TokenStream tokenStream) {
      return new TokenCountFilter(tokenStream);
  }
}
This code is fairly self-explanatory and is basically cut and paste code which follows the same pattern as the other factories from ElasticSearch i.e. with a factory method creating a new instance of our TokenCountFilter around the TokenStream.

Step 4 Make the new code available to ElasticSearch

Running:

mvn clean package


Will compile and the code and generate a jar file. This jar should then be added to $ES_HOME/lib. A restart of ElasticSearch will ensure that our classes get loaded. 

Step 5 Try it out

First we can setup a new 'test' index which will use our TokenFilter as part of its analysis:
curl -XPUT 'http://localhost:9200/test/' -d '
{
  "index": {},
  "settings": {
    "analysis": {
      "analyzer": {
        "depth_path_hierarchy": { "type": "custom", "tokenizer": "path_hierarchy", "filter": ["token_count"] }
      },
      "filter" : {
  "token_count": {"type": "com.springyweb.elasticsearch.index.analysis.TokenCountFilterFactory"}
      }
    }
  }
}'
Lines 9-11 define our new "token_count" filter. Note that we actually define the type as TokenCountFilterFactory which is the class responsible for creating our TokenFilter. Lines 6-8 define a ""depth_path_hierarchy" analyzer. Note that we specify the "path_hierarchy" tokenizer which will provide the tokens for our "token_count" filter.

Next we setup a mapping for a type called 'content' which we shall be storing in our 'test' index.
curl -XPUT 'http://localhost:9200/test/content/_mapping' -d '
{
     "test_mapping" : {
         "properties" : {
       "categories" : {"type" : "string", "index_analyzer" : "depth_path_hierarchy"}
         }
     }
}'
A Mapping defines how a document should be mapped to the search engine. In this case we have specified that when indexing our "categories" property that the "depth_path_hierarchy" analyzer should be used. Note: For this particular analyzer to be used solely for indexing (i.e. not for search) we used the key "index_analyzer" in our request. When creating a mapping for a property it is also possible to specify which analyzer is to be used when searching by  adding the key/value "search_analyzer" : "<analyzer name>", or one can simply use the single key/value "analyzer" : "<analyzer name>" which will use the same analyzer for indexing and searching the property. Further details regarding mapping can be found here.

We are now in a position to add some test content to our 'test' index. (Take note of the categories)
curl -XPUT http://localhost:9200/test/content/1 -d '{
    "name": "test 1",
  "categories": ["/foo/bar/baz"]
}'

curl -XPUT http://localhost:9200/test/content/2 -d '{
    "name": "test 2",
  "categories": ["/foo/baz/bar"]
}'

The search shown below verifies that we can find both pieces of content beneath the root "0/foo" category which means that Requirement 1 is satisfied.
Note that we use a term query which is not analyzed, the reason for this is that we want search for an entire term e.g. /a/b/c/d without that term being tokenized in any way. This makes sense as otherwise a search for /a/b/c/d would be tokenized into path elements (if we use the "depth_path_hierarchy" analyzer for example) and therefore could find something categorized more generally, /a/b/c for example, which we don't want.

curl -XGET http://localhost:9200/test/content/_search -d '{
    "query" : {
        "term" : { "categories": "0/foo" }
    }
}'

returns:

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 0.5945348,
    "hits" : [ {
      "_index" : "test",
      "_type" : "content",
      "_id" : "1",
      "_score" : 0.5945348, "_source" : {
    "name": "test 1",
"categories": ["/foo/bar/baz"]
}
    }, {
      "_index" : "test",
      "_type" : "content",
      "_id" : "2",
      "_score" : 0.5945348, "_source" : {
    "name": "test 2",
"categories": ["/foo/baz/bar"]
}
    } ]
  }
}
Our final test is to try to fulfill Requirement 2, namely to return direct children as facets based on a given path.
curl -XGET http://localhost:9200/test/content/_search?pretty=true -d '{
    "query" : {
            "match_all" : {  }
        },
    "facets" : {
        "category" : { 
            "terms" : {
                "field" : "categories",
                "regex" : "^1/foo/.*"
            } 
       }
    }
}'

Note the regex on line 9. Here we are limiting the facets returned to those that start with 1/foo i.e a depth of 1 beneath /foo. The response shows that we get the required facet terms, '1/foo/baz' and '1/foo/bar'
"facets" : {
   "category" : {
     "_type" : "terms",
     "missing" : 0,
     "total" : 6,
     "other" : 4,
     "terms" : [ {
       "term" : "1/foo/baz",
       "count" : 1
     }, {
       "term" : "1/foo/bar",
       "count" : 1
     } ]
   }
So that wraps this rather long post up. No doubt there are numerous approaches to solving this kind of problem but this article has been intended partially as a solution but more generally as an attempt to illustrate some of the features of ElasticSearch that you may not be familiar with.

You can download the source code for this article: here.

Happy searching!

Tuesday, 13 December 2011

Error building Erlang on Lion using Macports

Macports was hanging during my install of the rabbitmq-server port. Running port -v showed me that an error was occurring during Erlang compilation (RabbitMQ is writtien in Erlang). The issue, it seems to due to the optimisation settings used by llvm-gcc-4.2. The solution is to patch the Erlang Portfile. Edit:
/opt/local/var/macports/sources/rsync.macports.org/release/ports/lang/erlang/Portfile
Search for:
if {${configure.compiler} == "clang"} { 
    configure.compiler llvm-gcc-4.2 
} 
And insert the following directly below it:
configure.cflags-delete "-std=c99"
configure.cflags-delete -O2
configure.cxxflags-delete -O2
configure.cflags-append -O0
configure.cxxflags-append -O0
Save the file and then re-run the port install.

Thursday, 8 December 2011

Useful MongoDB book

Kristina Chodorow's '50 Tips and Tricks for MongoDB Developers' is a concise and really useful resource. It backed up some of my assumptions and taught me a great deal that I had not considered.

Wednesday, 30 November 2011

Java Virtual Machine Process Status Tool

I'm sure I'm just way behind the times but I never knew about the Java Virtual Machine Process Status Tool even though it's been around since Java 1.5. Running jps on the command line with no options, shows me that my Hadoop processes are running.
18513 DataNode
18582 SecondaryNameNode
18761 Jps
18699 TaskTracker
18631 JobTracker
With the the -l option I can see the package names too
18513 org.apache.hadoop.hdfs.server.datanode.DataNode
18763 sun.tools.jps.Jps
18582 org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode
18699 org.apache.hadoop.mapred.TaskTracker
18631 org.apache.hadoop.mapred.JobTracker
With -v I can see the the arguments passed to the JVM.
18513 DataNode -Dfile.encoding=UTF-8 -Xserver -Dproc_datano...

18582 SecondaryNameNode -Dfile.encoding=UTF-8 -Dproc_seconda...

18766 Jps -Dfile.encoding=UTF-8 -Dapplication.home=/System/Li...

18699 TaskTracker -Dfile.encoding=UTF-8 -Dproc_tasktracker -Xm ...

18631 JobTracker -Dfile.encoding=UTF-8 -Dproc_jobtracker -Xmx ...
A useful tool that is certainly easier than
 ps aux | grep java
Although do note that what looks like the OS process id is actually the local VM identifier which according to jps docs: "is typically, but not necessarily, the operating system's process identifier for the JVM proces"

Hadoop on OSX 10.7

When starting Hadoop using start-all.sh I kept seeing the following in my log file
"Unable to load realm info from SCDynamicStore"
A workaround is suggested on the Hadoop Jira under HADOOP-7489 which is to specify kerberos config on the command line:
 -Djava.security.krb5.realm=OX.AC.UK -Djava.security.krb5.kdc=kdc0.ox.ac.uk:kdc1.ox.ac.uk
This works perfectly but is easier if you add it into the HADOOP_OPTS environment variable in $HADOOP_HOME/conf/hadoop-env.sh Mine looks like this:
export HADOOP_OPTS="-server -Djava.security.krb5.realm=OX.AC.UK -Djava.security.krb5.kdc=kdc0.ox.ac.uk:kdc1.ox.ac.uk"