Porting SOLR Token Filter from Lucene 2.4.x to Lucene 2.9.x

I had trouble getting our current token filter to work after recompiling with the nightly builds of SOLR, which seemed to stem from the recently adopted upgrade to 2.9.0 of Lucene (not released yet, but SOLR nightly is bleeding edge!). There’s functionality added for backwards compability, and while that might have worked, things didn’t really come together as it should (somewhere or the other). So I decided to port our filter over to the new model, where incrementToken() is the New Way ™ of doing stuff. Helped by the current lowercase filter in the SVN trunk of Lucene, I made it all the way through.

Our old code:

    public NorwegianNameFilter(TokenStream input)
    {
        super(input);
    }

    public Token next() throws IOException
    {
        return parseToken(this.input.next());
    }
 
    public Token next(Token result) throws IOException
    {
        return parseToken(this.input.next());
    }

Compiling this with Lucene 2.9.0 gave me a new warning:

Note: .. uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.

To the internet mobile!

Turns out next() and next(Token) has been deprecated in the new TokenStream implementation, and the New True Way is to use the incrementToken() method instead.

Our new code:

    private TermAttribute termAtt;

    public NorwegianNameFilter(TokenStream input)
    {
        super(input);
        termAtt = (TermAttribute) addAttribute(TermAttribute.class);
    }

    public boolean incrementToken() throws IOException
    {
        if (this.input.incrementToken())
        {
            termAtt.setTermLength(this.parseBuffer(termAtt.termBuffer(), termAtt.termLength()));
            return true;
        }
        
        return false;
    }

A few gotcha’s along the way: incrementToken needs to be called on the input token string, not on the filter (super.incrementToken() will give you a stack overflow). This moves the token stream one step forward. We also decided to move the buffer handling into the parse token function to handle this, and remember to include the length of the “live” part of the buffer (the buffer will be larger, but only the content up to termLength will be valid).

The return value from our parseBuffer function is the actual amount of usable data in the buffer after we’ve had our way with it. The concept is to modify the buffer in place, so that we avoid allocating or deallocating memory.

Hopefully this will help other people with the same problem!

Fatal error: Undefined class constant ‘ATTR_DEFAULT_FETCH_MODE’

This is one of the common error messages that seems to appear after installing PHP – in particular under Ubuntu or Debian (where I experienced it). The reason for this is that the PDO version you’ve just installed is too old for the constant to exist, usually because you where naive enough to install the extension from PECL instead of using the default supplied by ubuntu. If you did the same as me:

apt-get install 
pecl install pdo
pecl install pdo_mysql

/etc/init.d/apache2 restart

And everything seems to work, except for that missing constant. What the fsck?!

The reason is that the PECL version of PDO is no longer maintained (I’d suggest to automagically push the newest version to PECL too, just so pecl install and pecl upgrade works as expected). The fact is that when you did pecl install pdo, you destroyed the more recent version provided by the default php5-common package (under Ubuntu, at least).

To solve the problem (and if you also did pecl install pdo_mysql, you’ll have to replace that one too…)

apt-get install --reinstall php5-common php5-mysql php5-mysqli

Restart. Relive. Re.. eh .. yeah. Rerere.

If you’re building from source, you’ll need to add:

--enable-pdo --with-pdo-mysql --with-pdo-pgsql (and any other driver you need)