Writing a Solr Analysis Filter Plugin

June 10th, 2008

As we’ve been working on getting a better result out of the phonetic search we’re currently doing at derdubor, I started writing a plugin for Solr to be able to return better search results when searching for norwegian names. We’ve been using the standard phonetic filter from Solr 1.2 so far, using the double metaphone encoder for encoding a regular token as a phonetic value. The trouble with this is that a double metaphone value is four simple letters, which means that searchwords such as ‘trafikkontroll’ would get the same meaning as ‘Dyrvik’. The latter being a name and the first being a regular search string which would be better served through an article view. TRAFIKKONTROLL resolves to TRFK in double metaphone, while DYRVIK resolves to DRVK. T and D is considered similiar, as is V and F, and voilá, you’ve got yourself a match in the search result, but not a visual one (or a semantic one, as the words have very different meanings).

To solve this, I decided to write a custom filter plugin which we could tune to names that are in use in Norway. I’ll post about the logic behind my reasoning in regards to wording later and hopefully post the complete filter function we’re applying, but I’ll leave that for another post.

First you need a factory that’s able to produce filters when Solr asks for them:

NorwegianNameFilterFactory.java:

  1. package no.derdubor.solr.analysis;
  2.  
  3. import java.util.Map;
  4.  
  5. import org.apache.solr.analysis.BaseTokenFilterFactory;
  6. import org.apache.lucene.analysis.TokenStream;
  7.  
  8. public class NorwegianNameFilterFactory extends BaseTokenFilterFactory
  9. {
  10.     Map<String,String> args;
  11.  
  12.     public Map<String,String> getArgs()
  13.     {
  14.         return args;
  15.     }
  16.  
  17.     public void init(Map<String,String> args)
  18.     {
  19.         this.args = args;
  20.     }
  21.  
  22.     public NorwegianNameFilter create(TokenStream input)
  23.     {
  24.         return new NorwegianNameFilter(input);
  25.     }
  26. }

To compile this example yourself, put the file in no/derdubor/solr/analysis/ (which matches no.derdubor.solr.analysis; in the package statement), and run

  1. javac -6 no/derdubor/solr/analysis/NorwegianNameFilterFactory.java

to compile it. You’ll of course also need the filter itself (which is returned from the create-method above):

  1. package no.derdubor.solr.analysis;
  2.  
  3. import java.io.IOException;
  4. import org.apache.lucene.analysis.Token;
  5. import org.apache.lucene.analysis.TokenFilter;
  6. import org.apache.lucene.analysis.TokenStream;
  7.  
  8. public class NorwegianNameFilter extends TokenFilter
  9. {
  10.     public NorwegianNameFilter(TokenStream input)
  11.     {
  12.         super(input);
  13.     }
  14.  
  15.     public Token next() throws IOException
  16.     {
  17.         return parseToken(this.input.next());
  18.     }
  19.  
  20.     public Token next(Token result) throws IOException
  21.     {
  22.         return parseToken(this.input.next());
  23.     }
  24.  
  25.     protected Token parseToken(Token in)
  26.     {
  27.         /* do magic stuff with in.termBuffer() here (a char[] which can be manipulated) */
  28.         /* set the changed length of the new term with in.setTermLength(); before returning it */
  29.         return in;
  30.     }
  31. }

You should now be able to compile both files:

  1. javac -6 no/derdubor/solr/analysis/*.java

After compiling the plugin, create a jar file which contain your plugin. This will be the “distributable” version of your plugin, and should contain the .class-files of your application.

  1. jar cvf derdubor-solr-norwegiannamefilter.jar no/derdubor/solr/analysis/*.class

Move the file you just created (derdubor-solr-norwegiannamefilter.jar in the example above) into your Solr home directory. This is where you keep your bin/ and conf/ directory (which contains schema.xml, etc). Create a lib directory in the solr home directory. This is where your custom libraries will live, so copy the file into this directory (lib/).

Restart Solr and check that everything still works as it should. If everything still seems normal, it’s time to enable your filter. In one of your <filter>-chains, you can simply append a <filter> element to insert your own filter into the chain:

  1. <analyzer>
  2.     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
  3.     <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" catenateWords="1" />
  4.     <filter class="solr.LowerCaseFilterFactory" />
  5.     <filter class="no.derdubor.solr.analysis.NorwegianNameFilterFactory" />
  6. </analyzer>

Restart Solr again, and if everything still works as it should, you’re all set! Time to index some new data (remember that you’ll need to reindex the data for things to work as you expect, since no stored data is processed when you edit your configuration files) and commit it! Do a few searches through the admin interface to see that everything works as it should. I’ve used the “debug” option to .. well, debug .. my plugin while developing it. A very neat trick is to see what terms your filter expands to (if you set type=”query” in the analyzer section, it will be applied to all queries against that field), which will be shown in the first debug section when looking at the result (you’ll have to scroll down to the end to see this). If you need to debug things to a greater extend, you can attach a debugger or simply use the Good Old Proven Way of println! (these will end up in catalina.out in logs/ in your tomcat directory). Good luck!

Potential Problems and How To Solve Them

  • If you get an error about incompatible class versions, check that you’re actually running the same (or newer) version of the JVM (java -version) on your Solr search server that you use on your own development machine (use -5 to force 1.5 compatible class files instead of 1.6 when compiling).
  • If you get an error about missing config or something similiar, or that Solr is unable to find the method it’s searching for (generally triggered by an ReflectionException), remember to define your classes public! public class NorwegianNameFilter is your friend! It took at least half an hour until I realized what this simple issue was…

Any comments and followups are of course welcome!

String Metrics

April 25th, 2008

«Estimate» stumbled across this awesome page with different string metric algorithms earlier today. Here you’ll find descriptions and implementations of Hamming distance, Levenshtein distance, Needleman-Wunch distance, Smith-Waterman distance and dozens other. Invaluable if you’re ever going to need to compare strings against each other and need some way to measure their similiarity.

New Times Ahead, Baby!

April 24th, 2008

As the most observant people out there probably have noticed, I’ve given the site a little face lift to bring it into the next century (so bring it on, 2100!!!11). The illustration was done by the very talented Anette Heiberg - Children’s Book Illustrator - which also is the one single person that manages to live together with me. A neat little coincidence there!

Anyways, the new design is dark, but I’ve decided to use the inverse header for each post as it makes visually scanning the page with your eyes to find the items _very_ effective. I like it, so it stays.

Happy Happy Joy Joy!

Christer and His Quest For More Zend_Form-age

April 24th, 2008

I finally found out why Christer had been so quiet all day: he’s obviously been writing the largest post seen in the history of blogs. His introduction to Translating Zend Form Error Messages is enormous and a giant of a beast, and will give a thorough introduction to the concept of using Zend_Translate together with Zend_Form to use resource files to present an user interface in several localized versions.

Solr: Using the dismax Query Handler and Still Limit a Specific Field

April 23rd, 2008

While working with the facets for our search result earlier today, I came across the need to limit the search against solr on one specific field in addition to our regular search string (which we run against several fields with different weights). The situation was something like this:

  • Lastname
  • AggregateSearchField
  • AggregatePhoneticSearchField

We do the searches against the AggregateSearchField and the AggregatePhoneticSearchField, where we weight the exact match higher than the phonetic matches. This ensure that the more specific matches are ranked higher than those that are merely similiar. We do this for several different field groupings, but that’s not revelant for this post, so let’s just assume that these are the three fields relevant. We search against two of them, and uses Lastname as a facet / navigator field to allow users to get more specific with their search.

However, while users should be allowed to get more specific with their search when selecting one of the facets, it should not change their regular search. And since the dismax handler will search through all the allowed field for a given value, you cannot just append Lastname:facetValue to the search string and be done with it (dismax does not support fielded searches through the regular query). After a bit of searching through our friends over at Google, I finally stumbled across the solution (which I of course should have seen on the Solr wiki): use the fq-parameter. This allows you to submit a “Filter Query” in your request, which will be used to further filter your existing query through another set of queries. This fits very neatly in with keeping your original query and then appending filter queries for each facet limitation that gets set.

A small code example for Solrj: (filterQueries is a HashMap<String, String> which contains the facets; filterQueries.put(”Lastname”, “Smith”) will add a limitation on the field “Lastname” being “Smith” (you might want to escape “-s in the facet values)):

  1. if (filterQueries != null)
  2. {
  3.     for (String q : filterQueries.keySet())
  4.     {
  5.         String value = filterQueries.get(q);
  6.         query.addFilterQuery(q + ":\"" + value + "\""); // this adds Lastname:"Smith" as a filter query
  7.     }
  8. }

So now we can just parse the query string for valid facet limitations, and set the fields in the filterQueries HashMap accordingly. As we already have a list of facet fields to include, this is a simple as iterating that list and checking for the parameters in the request variables.

A great thank you to Mike Klaas in the dismax and Plone thread indexed by nabble.com that sent me in the right direction.

Handling Large Datasets at Google

April 23rd, 2008

High Scalability has a neat post today highlighting a recent presentation given by Jeff Dean from Google at this year’s Data-Intensive Computing Symposium. The presentation named “Handling Large Datasets at Google: Current Systems and Future Directions” (video (hosted by Yahoo!) ) (slides) dips into quite an amount of issues and thoughts about what it takes to run something handling petabytes of data. the video of the presentation

I’ll leave you with a quite interesting list shown in slide 8 (of 58) under the title of “Typical first year for a new cluster“:

  • ~0.5 overheating (power down most machines in <5 mins, ~1-2 days to recover)
  • ~1 PDU failure (~500-1000 machines suddenly disappear, ~6 hours to come back)
  • ~1 rack-move (plenty of warning, ~500-1000 machines powered down, ~6 hours)
  • ~1 network rewiring (rolling ~5% of machines down over 2-day span)
  • ~20 rack failures (40-80 machines instantly disappear, 1-6 hours to get back)
  • ~5 racks go wonky (40-80 machines see 50% packet loss)
  • ~8 network maintenances (4 might cause ~30-minute random connectivity losses)
  • ~12 router reloads (takes out DNS and external vips for a couple minutes)
  • ~3 router failures (have to immediately pull traffic for an hour)
  • ~dozens of minor 30-second blips for dns
  • ~1000 individual machine failures
  • ~thousands of hard drive failures

The Importance of The Double-Click Time

April 23rd, 2008

Raymond Chen’s blog “The Old New Thing” is an invaluable source of interesting theories and histories about the inner workings of Windows. If you haven’t read his book “The Old New Thing: Practical Development Throughout the Evolution of Windows” yet, add it to your wishlist now. Although some parts of it can be a bit too much code and internals, the stories and the appendices are simply awesome. Well worth it.

But this post wasn’t supposed to be about that, so I’ll leave you with what I intended to write about instead; the recently posted entry about how several different values are derived from the double click time setting.

Typesetting on the World Wide Web

April 23rd, 2008

The awesome people over at Smashing Magazine has a neat article up today about 5 principles and ideas for setting type on the web. While I do not agree with the usability concept of some of the examples (in particular, the first and last example in section 4 is painful to watch), the article is informative and presents quite a few issues and good tips about typography and the web. Keep a bookmark available for the next time you’re sketching up a new site (.. which I’ll have to do with the design around here soon ..).

Memcached Internals

April 22nd, 2008

Ilya Grigorik has posted a very good summary of a talk that Brian Aker and Alan Kasindorf gave about memcached at MySQL User Conference last week. The article is straight to the point in regards to several key attributes about memcached, and serves up almost 30 direct tips and tidbits about how to use memcached in a more optimal way. Awesome reading, and well worth to check out together with the slides from the memcached talk.

PHP Vikinger Registration Up and Running

April 22nd, 2008

This year’s version of the unconference PHP Vikinger is taking place 21st of June in Skien, Norway. Derick has just opened up the registration which involves using high tech methods such as an E-mail-client and writing your name and other relevant information. One thing’s for sure, I, Eirik and Christer are heading out, and hopefully we’ll get a few more friends to join in.. and that includes YOU!