Writing a Solr Analysis Filter Plugin

Update: If you’re writing a plugin for a Solr-version after 1.4.1 or Lucene 3.0+, you should be sure to read Updating a Solr Analysis Plugin to Lucene 4.0 as well. A few of the method calls used below has changed in the new API.

As we’ve been working on getting a better result out of the phonetic search we’re currently doing at derdubor, I started writing a plugin for Solr to be able to return better search results when searching for norwegian names. We’ve been using the standard phonetic filter from Solr 1.2 so far, using the double metaphone encoder for encoding a regular token as a phonetic value. The trouble with this is that a double metaphone value is four simple letters, which means that searchwords such as ‘trafikkontroll’ would get the same meaning as ‘Dyrvik’. The latter being a name and the first being a regular search string which would be better served through an article view. TRAFIKKONTROLL resolves to TRFK in double metaphone, while DYRVIK resolves to DRVK. T and D is considered similiar, as is V and F, and voilá, you’ve got yourself a match in the search result, but not a visual one (or a semantic one, as the words have very different meanings).

To solve this, I decided to write a custom filter plugin which we could tune to names that are in use in Norway. I’ll post about the logic behind my reasoning in regards to wording later and hopefully post the complete filter function we’re applying, but I’ll leave that for another post.

First you need a factory that’s able to produce filters when Solr asks for them:

NorwegianNameFilterFactory.java:

  1. package no.derdubor.solr.analysis;
  2.  
  3. import java.util.Map;
  4.  
  5. import org.apache.solr.analysis.BaseTokenFilterFactory;
  6. import org.apache.lucene.analysis.TokenStream;
  7.  
  8. public class NorwegianNameFilterFactory extends BaseTokenFilterFactory
  9. {
  10.     Map<String,String> args;
  11.  
  12.     public Map<String,String> getArgs()
  13.     {
  14.         return args;
  15.     }
  16.  
  17.     public void init(Map<String,String> args)
  18.     {
  19.         this.args = args;
  20.     }
  21.  
  22.     public NorwegianNameFilter create(TokenStream input)
  23.     {
  24.         return new NorwegianNameFilter(input);
  25.     }
  26. }

To compile this example yourself, put the file in no/derdubor/solr/analysis/ (which matches no.derdubor.solr.analysis; in the package statement), and run

  1. javac -6 no/derdubor/solr/analysis/NorwegianNameFilterFactory.java

(you’ll need apache-solr-core.jar and lucene-core.jar in your classpath to do this)

to compile it. You’ll of course also need the filter itself (which is returned from the create-method above):

  1. package no.derdubor.solr.analysis;
  2.  
  3. import java.io.IOException;
  4. import org.apache.lucene.analysis.Token;
  5. import org.apache.lucene.analysis.TokenFilter;
  6. import org.apache.lucene.analysis.TokenStream;
  7.  
  8. public class NorwegianNameFilter extends TokenFilter
  9. {
  10.     public NorwegianNameFilter(TokenStream input)
  11.     {
  12.         super(input);
  13.     }
  14.  
  15.     public Token next() throws IOException
  16.     {
  17.         return parseToken(this.input.next());
  18.     }
  19.  
  20.     public Token next(Token result) throws IOException
  21.     {
  22.         return parseToken(this.input.next());
  23.     }
  24.  
  25.     protected Token parseToken(Token in)
  26.     {
  27.         /* do magic stuff with in.termBuffer() here (a char[] which can be manipulated) */
  28.         /* set the changed length of the new term with in.setTermLength(); before returning it */
  29.         return in;
  30.     }
  31. }

You should now be able to compile both files:

  1. javac -6 no/derdubor/solr/analysis/*.java

After compiling the plugin, create a jar file which contain your plugin. This will be the “distributable” version of your plugin, and should contain the .class-files of your application.

  1. jar cvf derdubor-solr-norwegiannamefilter.jar no/derdubor/solr/analysis/*.class

Move the file you just created (derdubor-solr-norwegiannamefilter.jar in the example above) into your Solr home directory. This is where you keep your bin/ and conf/ directory (which contains schema.xml, etc). Create a lib directory in the solr home directory. This is where your custom libraries will live, so copy the file into this directory (lib/).

Restart Solr and check that everything still works as it should. If everything still seems normal, it’s time to enable your filter. In one of your <filter>-chains, you can simply append a <filter> element to insert your own filter into the chain:

  1. <analyzer>
  2.     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
  3.     <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" catenateWords="1" />
  4.     <filter class="solr.LowerCaseFilterFactory" />
  5.     <filter class="no.derdubor.solr.analysis.NorwegianNameFilterFactory" />
  6. </analyzer>

Restart Solr again, and if everything still works as it should, you’re all set! Time to index some new data (remember that you’ll need to reindex the data for things to work as you expect, since no stored data is processed when you edit your configuration files) and commit it! Do a few searches through the admin interface to see that everything works as it should. I’ve used the “debug” option to .. well, debug .. my plugin while developing it. A very neat trick is to see what terms your filter expands to (if you set type=”query” in the analyzer section, it will be applied to all queries against that field), which will be shown in the first debug section when looking at the result (you’ll have to scroll down to the end to see this). If you need to debug things to a greater extend, you can attach a debugger or simply use the Good Old Proven Way of println! (these will end up in catalina.out in logs/ in your tomcat directory). Good luck!

Potential Problems and How To Solve Them

  • If you get an error about incompatible class versions, check that you’re actually running the same (or newer) version of the JVM (java -version) on your Solr search server that you use on your own development machine (use -5 to force 1.5 compatible class files instead of 1.6 when compiling).
  • If you get an error about missing config or something similiar, or that Solr is unable to find the method it’s searching for (generally triggered by an ReflectionException), remember to define your classes public! public class NorwegianNameFilter is your friend! It took at least half an hour until I realized what this simple issue was…

Any comments and followups are of course welcome!

Tags: , , ,

3 Responses to “Writing a Solr Analysis Filter Plugin”

  1. MyD Says:

    Thanks for your hints.

    It took me 2h to find out that I forgot to define my class PUBLIC :)

    Cheers, MyD

  2. aladeck Says:

    Thank your for your post

  3. Mats Lindh » Blog Archive » Updating a Solr Analysis Plugin from 1.4.1 (Lucene 2.9) to Solr / Lucene 4.0 (current trunk) Says:

    [...] years and a couple of weeks ago I wrote a post about how to get started writing a simple Solr Analysis Plugin to handle incoming tokens and modifying them in place when an update is [...]

Leave a Reply