Updating a Solr Analysis Plugin from 1.4.1 (Lucene 2.9) to Solr / Lucene 4.0 (current trunk)

Three years and a couple of weeks ago I wrote a post about how to get started writing a simple Solr Analysis Plugin to handle incoming tokens and modifying them in place when an update is requested.

Since then the whole version number structure of Solr has changed (and is now in sync with the underlying Lucene version), and not surprisingly, the current API has also been updated. This means that a few small changes are required to get your analysis plugins running on the current trunk of Lucene and Solr.

The main change is that the previously named TermAttribute is now named CharTermAttribute, this means that any imports will have to change:

- import org.apache.lucene.analysis.tokenattributes.TermAttribute; 
+ import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; 

Any declarations of TermAttributes will need to be CharTermAttributes instead:

- private TermAttribute termAtt; 
+ private CharTermAttribute termAtt; 
  public NorwegianNameFilter(TokenStream input) 
  { 
      super(input); 
-     termAtt = (TermAttribute) addAttribute(TermAttribute.class); 
+     termAtt = input.getAttribute(CharTermAttribute.class); 
  } 

We now fetch the attribute from the current TokenStream (not sure if the old way I did it has been deprecated, but this seems to be the suggested way now). We also change any references to TermAttribute.class to CharTermAttribute.class.

The actual TermAttribute interface has also changed, meaning we’ll have to change a few of the old method calls:

- termAtt.setTermLength(this.parseBuffer(termAtt.termBuffer(), termAtt.termLength())); 
+ termAtt.setLength(this.parseBuffer(termAtt.buffer(), termAtt.length())); 

.setTermLength() => .setLength()
.termBuffer => .buffer()
.termLength => .length()

The methods will behave in the same manner as in the previous API, .buffer() will retrieve a char array (char[]) which is the current buffer of the actual term which can you modify in place, while length() and setLength() retrieves the current length of the buffer (the buffer can be larger than the part used) and sets the new length of the buffer (if you’re collapsing characters).

The new implementation of our analysis filter skeleton:

package no.derdubor.solr.analysis;

import java.io.IOException;
import org.apache.lucene.analysis.Token;
import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;

public class NorwegianNameFilter extends TokenFilter
{
    private CharTermAttribute termAtt;

    public NorwegianNameFilter(TokenStream input)
    {
        super(input);
        termAtt = input.getAttribute(CharTermAttribute.class);
    }

    public boolean incrementToken() throws IOException
    {
        if (this.input.incrementToken())
        {
            termAtt.setLength(this.parseBuffer(termAtt.buffer(), termAtt.length()));
            return true;
        }
        
        return false;
    }
    
    protected int parseBuffer(char[] buffer, int bufferLength)
    {

    }
}

2 thoughts on “Updating a Solr Analysis Plugin from 1.4.1 (Lucene 2.9) to Solr / Lucene 4.0 (current trunk)”

Leave a Reply

Your email address will not be published. Required fields are marked *