Wednesday, 30 November 2011

Java Virtual Machine Process Status Tool

I'm sure I'm just way behind the times but I never knew about the Java Virtual Machine Process Status Tool even though it's been around since Java 1.5. Running jps on the command line with no options, shows me that my Hadoop processes are running.
18513 DataNode
18582 SecondaryNameNode
18761 Jps
18699 TaskTracker
18631 JobTracker
With the the -l option I can see the package names too
18513 org.apache.hadoop.hdfs.server.datanode.DataNode
18582 org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode
18699 org.apache.hadoop.mapred.TaskTracker
18631 org.apache.hadoop.mapred.JobTracker
With -v I can see the the arguments passed to the JVM.
18513 DataNode -Dfile.encoding=UTF-8 -Xserver -Dproc_datano...

18582 SecondaryNameNode -Dfile.encoding=UTF-8 -Dproc_seconda...

18766 Jps -Dfile.encoding=UTF-8 -Dapplication.home=/System/Li...

18699 TaskTracker -Dfile.encoding=UTF-8 -Dproc_tasktracker -Xm ...

18631 JobTracker -Dfile.encoding=UTF-8 -Dproc_jobtracker -Xmx ...
A useful tool that is certainly easier than
 ps aux | grep java
Although do note that what looks like the OS process id is actually the local VM identifier which according to jps docs: "is typically, but not necessarily, the operating system's process identifier for the JVM proces"

Hadoop on OSX 10.7

When starting Hadoop using I kept seeing the following in my log file
"Unable to load realm info from SCDynamicStore"
A workaround is suggested on the Hadoop Jira under HADOOP-7489 which is to specify kerberos config on the command line:
This works perfectly but is easier if you add it into the HADOOP_OPTS environment variable in $HADOOP_HOME/conf/ Mine looks like this:
export HADOOP_OPTS="-server"

Tuesday, 29 November 2011

ElasticSearch is full of Springy goodness

Firstly, I am a huge fan of Solr. It is easy to test, blisteringly fast, and incredibly stable. I would recommend it as a solution to many search problems in a heartbeat. However, despite the recent efforts of SolrCloud I have simply found that as a scalable cloud solution, Solr is still in its infancy.

Thanks to a useful blog article I was able to test some multi-core examples out (note: If you do follow the blog example then beware of capitalisation issues with attribute names. - e.g "instancedir" should be camel case - "instanceDir"). My major issue is that the whole approach feels too clunky, particularly for a solution such as ours which requires multi-tennancy and therefore a large number of cores. At the time of writing SolrCloud doesn't seem to have a nice way of discovering other nodes in a cluster. It all works but it isn't slick.

ElasticSearch on the other hand has been designed for the cloud. James Cook has written a brilliant tutorial on running ElasticSearch on EC2. In just a few hours I had created some test EC2 instances running ElasticSearch.
Discovery of other ElasticSearch instances was trivial. The cloud-aws plugin allows for several options, e.g security groups or tags. When I stopped the "master" instance, the "slave" instance noticed and took over as master. Asynchronous backup to EC3 just worked.
Multi-tennancy has been made really easy too - If you post to an index that doesn't exist, it gets created. Trivial to get going and so far very impressive.

Developing an API service with Ruby

Our current project involves creating a json-based API service, which can be called either from a web browser, or from other code. We started developing it directly in Rails, but then found the grape project on github.
Grape seems to be a much better fit for the project than Rails was. We will still use Rails for the front end website, but using grape for the API service means that the code is a lot simpler.