Modifying a Lucene Snowball Stemmer

This post is written for advanced users. If you do not know what SVN (Subversion) is or if you’re not ready to get your hands dirty, there might be something more interesting to read on Wikipedia. As usual. This is an introduction to how to get a Lucene development environment running, a Solr environment and lastly, to create your own Snowball stemmer. Read on if that seems interesting. The receipe for regenerating the Snowball stemmer (I’ll get back to that…) assumes that you’re running Linux. Please leave a comment if you’ve generated the stemmer class under another operating system.

When indexing data in Lucene (a fulltext document search library) and Solr (which uses Lucene), you may provide a stemmer (a piece of code responsible for “normalizing” words to their common form (horses => horse, indexing => index, etc)) to give your users better and more relevant results when they search. The default stemmer in Lucene and Solr uses a library named Snowball which was created to do just this kind of thing. Snowball uses a small definition language of its own to generate parsers that other applications can embed to provide proper stemming.

By using Snowball Lucene is able to provide a nice collection of default stemmers for several languages, and these work as they should for most selections. I did however have an issue with the Norwegian stemmer, as it ignores a complete category of words where the base form end in the same letters as plural versions of other words. An example:

one: elektriker
several: elektrikere
those: elektrikerene

The base form is “elektriker”, while “elektrikere” and “elektrikerene” are plural versions of the same word (the word means “electrician”, btw).

Lets compare this to another word, such as “Bus”:

one: buss
several: busser
those: bussene

Here the base form is “buss”, while the two other are plural. Lets apply the same rules to all six words:

buss => buss
busser => buss [strips “er”]
bussene => buss [strips “ene”]

elektrikerene => “elektriker” [strips “ene”]
elektrikere => “elektriker” [strips “e”]

So far everything has gone as planned. We’re able to search for ‘elektrikerene’ and get hits that say ‘elektrikere’, just as planned. All is not perfect, though. We’ve forgotten one word, and evil forces will say that I forgot it on purpose:

elektriker => ?

The problem is that “elektriker” (which is the single form of the word) ends in -er. The rule defined for a word in the class of “buss” says that -er should be stripped (and this is correct for the majority of words). The result then becomes:

elektriker => “elektrik” [strips “er”]
elektrikere => “elektriker” [strips “e”]
elektrikerene => “elektriker” [strips “ene”]

As you can see, there’s a mismatch between the form that the plurals gets chopped down to and the singular word.

My solution, while not perfect in any way, simply adds a few more terms so that we’re able to strip all these words down to the same form:

elektriker => “elektrik” [strips “er”]
elektrikere => “elektrik” [strips “ere”]
elektrikerene => “elektrik” [strips “erene”]

I decided to go this route as it’s a lot easier than building a large selection of words where no stemming should be performed. It might give us a few false positives, but the most important part is that it provides the same results for the singular and plural versions of the same word. When the search results differ for such basic items, the user gets a real “WTF” moment, especially when the two plural versions of the word is considered identical.

To solve this problem we’re going to change the Snowball parser and build a new version of the stemmer that we can use in Lucene and Solr.

Getting Snowball

To generate the Java class that Lucene uses when attempting to stem a phrase (such as the NorwegianStemmer, EnglishStemmer, etc), you’ll need the Snowball distribution. This distribution also includes example stemming algorithms (which have been used to generate the current stemmers in Lucene).

You’ll need to download the application from the snowball download page – in particular the “Snowball, algorithms and libstemmer library” version [direct link].

After extracting the file you’ll have a directory named snowball_code, which contains among other files the snowball binary and a directory named algorithms. The algorithms-directory keeps all the different default stemmers, and this is where you’ll find a good starting point for the changes you’re about to do.

But first, we’ll make sure we have the development version of Lucene installed and ready to go.

Getting Lucene

You can check out the current SVN trunk of Lucene by doing:

svn checkout http://svn.apache.org/repos/asf/lucene/java/trunk lucene/java/trunk

This will give you the bleeding edge version of Lucene available for a bit of toying around. If you decide to build Solr 1.4 from SVN (as we’ll do further down), you do not have to build Lucene 2.9 from SVN – as it already is included pre-built.

If you need to build the complete version of Lucene (and all contribs), you can do that by moving into the Lucene trunk:

cd lucene/java/trunk/
ant dist (this will also create .zip and .tgz distributions)

If you already have Lucene 2.9 (.. or whatever version you’re on when you’re reading this), you can get by with just compiling the snowball contrib to Lucene, from lucene/java/trunk/:

cd contrib/snowball/
ant jar

This will create (if everything works as it should) a file named lucene-snowball-2.9-dev.jar (.. or another version number, depending on your version of Lucene). The file will be located in a sub directory of the build directory on the root of the lucene checkout (.. and the path will be shown after you’ve run ant jar): lucene/java/trunk/build/contrib/snowball/.

If you got the lucene-snowball-2.9-dev.jar file compiled, things are looking good! Let’s move on getting the bleeding edge version of Solr up and running (if you have an existing Solr version that you’re using and do not want to upgrade, skip the following steps .. but be sure to know what you’re doing .. which coincidentally you also should be knowing if you’re building stuff from SVN as we are. Oh the joy!).

Getting Solr

Getting and building Solr from SVN is very straight forward. First, check it out from Subversion:

svn co http://svn.apache.org/repos/asf/lucene/solr/trunk/ solr/trunk/

And then simply build the war file for your favourite container:

cd solr/trunk/
ant dist

Voilá – you should now have a apache-solr-1.4-dev.war (or something similiar) in the build/ directory. You can test that this works by replacing your regular solr installation (.. make a backup first..) and restarting your application server.

Editing the stemmer definition

After extracting the snowball distribution, you’re left with a snowball_code directory, which contains algorithms and then norwegian (in addition to several other stemmer languages). My example here expands the definition used in the norwegian stemmer, but the examples will work with all the included stemmers.

Open up one of the files (I chose the iso-8859-1 version, but I might have to adjust this to work for UTF-8/16 later. I’ll try to post an update in regards to that) and take a look around. The snowball language is interesting, and you can find more information about it at
the Snowball site.

I’ll not include a complete dump of the stemming definition here, but the interesting part (for what we’re attempting to do) is the main_suffix function:

define main_suffix as (
    setlimit tomark p1 for ([substring])
    among(
        'a' 'e' 'ede' 'ande' 'ende' 'ane' 'ene' 'hetene' 'en' 'heten' 'ar'          
        'er' 'heter' 'as' 'es' 'edes' 'endes' 'enes' 'hetenes' 'ens'
        'hetens' 'ers' 'ets' 'et' 'het' 'ast' 
            (delete)
        's'
            (s_ending or ('k' non-v) delete)
        'erte' 'ert'
            (<-'er')
    )
)

This simply means that for any word ending in any of the suffixes in the three first lines will be deleted (given by the (delete) command behind the definitions). The problem provided our example above is that neither of the lines will capture an "ere" ending or "erene" - which we'll need to actually solve the problem.

We simply add them to the list of defined endings:

    among(
        ... 'hetene' 'en' 'heten' 'ar' 'ere' 'erene' 'eren'
        ...
        ...
            (delete)

I made sure to add the definitions before the shorter versions (such as 'er'), but I'm not sure (.. I don't think) if it actually is required.

Save the file under a new file name so you still have the old stemmers available.

Compiling a New Version of the Snowball Stemmer

After editing and saving your stemmer, it's now time to generate the Java class that Lucene will use to generate it base forms of the words. After extracting the snowball archive, you should have a binary file named snowball in the snowball_code directory. If you simply run this file with snowball_code as your current working directory:

./snowball

You'll get a list of options that Snowball can accept when generating the stemmer class. We're only going to use three of them:

-j[ava] Tell Snowball that we want to generate a Java class
-n[ame] Tell Snowball the name of the class we want generated
-o <filename> The filename of the output file. No extension.

So to compile our NorwegianExStemmer from our modified file, we run:

./snowball algorithms/norwegian/stem2_ISO_8859_1.sbl -j -n NorwegianExStemmer -o NorwegianExStemmer

(pardon the excellent file name stem2...). This will give you one new file in the current working directory: NorwegianExStemmer.java! We've actually built a stemming class! Woohoo! (You may do a few dance moves here. I'll wait.)

We're now going to insert the new class into the Lucene contrib .jar-file.

Rebuild the Lucene JAR Library

Copy the new class file into the version of Lucene you checked out from SVN:

cp NorwegianExStemmer.java /contrib/snowball/src/java/org/tartaru/snowball/ext

Then we simply have to rebuild the .jar file containing all the stemmers:

cd /contrib/snowball/
ant jar

This will create lucene-snowball-2.9-dev.jar in <lucenetrunk>/build/contrib/. You now have a library containing your stemmer (and all the other default stemmers from Lucene)!

The last part is simply getting the updated stemmer library into Solr, and this will be a simple copy and rebuild:

Inserting the new Lucene Library Into Solr

From the build/contrib directory in Lucene, copy the jar file into the lib/ directory of Solr:

cp lucene-snowball-2.9-dev.jar lib/

Be sure to overwrite any existing files (.. and if you have another version of Lucene in Solr, do a complete rebuild and replace all the Lucene related files in Solr). Rebuild Solr:

cd 
ant dist

Copy the new apache-solr-1.4-dev.war (check the correct name in the directory yourself) from the build/ directory in Solr to your application servers home as solr.war (.. if you use another name, use that). This is webapps/ if you're using Tomcat. Remember to back up the old .war file, just to be sure you can restore everything if you've borked something.

Add Your New Stemmer In schema.xml

After compiling and packaging the stemmer, it's time to tell Solr that it should use the newly created stemmer. Remember that a stemmer works both when indexing and querying, so we're going to need to reindex our collection after implementing a new stemmer.

The usual place to add the stemmer is the definition of your text fields under the <analyzer>-sections for index and query (remember to change it BOTH places!!):


Change NorwegianEx into the name of your class (without the Stemmer-part, Lucene adds that for you automagically. After changing both locations (or more if you have custom datatypes and indexing or query steps).

Restart Application Server and Reindex!

If you're using Tomcat as your application server this might simply be (depending on your setup and distribution):

cd /path/to/tomcat/bin
./shutdown.sh
./startup.sh

Please consult the documentation for your application server for information about how to do a proper restart.

After you've restarted the application server, you're going to need to reindex your collection before everything works as planned. You can however check that your stemmer works as you've planned already at this stage. Log into the Solr admin interface, select the extended / advanced query view, enter your query (which should now be stemmed in another way than before), check the "debug" box and submit your search. The resulting XML document will show you the resulting of your query in the parsedquery element.

Download the Generated Stemmer

If you're just looking for an improved stemmer for norwegian words (with the very, very simple changes outlined above, and which might give problems when concerned with UTF-8 (.. please leave a comment if that's the case)), you can simply download NorwegianExStemmer.java. Follow the guide above for adding it to your Lucene / Solr installation.

Please leave a comment if something is confusing or if you want free help. Send me an email if you're looking for a consultant.

Removing (dropping) a Foreign Key Constraint in PostgreSQL

Had a need to drop a Foreign Key Constraint in PostgreSQL 8.x today, and this is how you do it:

database=> \d table_name;
Table "public.table_name"
Column | Type | Modifiers
------------------+------------------------+-----------
id | integer |
field | character varying(20) |
field_description | character varying(150) |
Indexes:
[..]
Foreign-key constraints:
"table_name_id_fkey" FOREIGN KEY (id) REFERENCES other_table(id) ON DELETE CASCADE

database=> ALTER TABLE table_name DROP CONSTRAINT "table_name_id_fkey";
ALTER TABLE
database=>

As simple as that. The name of the constraint is shown when describing the table with \d under “Foreign-key constraints”, and you simply do an ALTER statement to drop the constraint.

Translating Drizzle to Norwegian

As Monty asked for help with translations of the current strings available in Drizzle on his blog yesterday, I sat down a couple of hours yesterday and a couple of hours today to at least attempt to contribute something to the project. As my primary language is Norwegian and I have some experience writing, I decided to tackle the Norwegian (Bokmål, not Nynorsk) translation of Drizzle. I’ve currently finished the 358 available messages, but I’d really appreciate it if someone spent a couple of minutes / hours to read through them and confirm that my assumptions are sane.

The most troubling part when it comes to definitions are the issue of MySQL/Drizzle’s ‘relay log’ which I translated into ‘replikasjonslogg’ – which mainly means “replication log”. This sounds much better in Norwegian, but suddenly the code mentioned both a “replication log” and a “relay log”. I tried finding out what the semantic difference in MySQL were, but were unable to grok anything from the MySQL manual or through a Google search. If anyone has any advice here, it’d be very appreciated. I also made a few notes of where there are obvious errors in the original english strings:

Error on close of '%'s (Errcode: %d)
 - Located in mysys/errors.c:28 

Errcode:
Can't read value for symlink '%s' (Error %d)
 - Located in mysys/errors.c:47 
Can't create symlink '%s' pointing at '%s' (Error %d)
 - Located in mysys/errors.c:48
Copy text Error on realpath() on '%s' (Error %d)
 - Located in mysys/errors.c:49 

%*s(Defaults to on; use --skip-%s to disable.)
 - Missing space
 - Located in mysys/my_getopt.c:1170 

The event could not be processed no other hanlder error happened
 - hanlder
 -  Located in mysys/my_handler_errors.h:118 

SSL information in the master info file ('%s') are ignored because this MySQL slave was compiled without SSL support.
 - MySQL
 - Located in drizzled/rpl_mi.cc:276 

Slave I/O thread killed while waitnig to reconnect after a failed registration on master
 - waitnig
 - Located in drizzled/slave.cc:90 

Could not parse relay log event entry. The possible reasons are: the master's binary log is corrupted (you can check this by running 'mysqlbinlog'..
 - mysqlbinlog
 - Located in drizzled/slave.cc:1864 

Found wrong key definition in %s; Please do "ALTER TABLE '%s' FORCE " to fix it!
 - feil spaceplass
 - Located in drizzled/table.cc:1162 

Table '%-.64s' was created with a different version of MySQL and cannot be read
 - MySQL
 - Located in drizzled/table.cc:1818 

I could probably submit a patch for this, but seeing as the source is very much in flux these days, I think I’ll wait until it settles down a bit — unless someone is interesting in reviewing and committing an “unimportant” patch at this stage.

BTW: Launchpad worked great for doing translations, so I’m going to look into using gettext and Launchpad for doing translations for pwned.no and my other services in the future.

Rounding Up The Remaining Database Posts

To finally be able to close my now-ready-to-be-archived Firefox-window, I’m rounding up the three other posts I were going to post about in one single batch here:

Ulf Wendel has a post up about PDO_MYSQLND: The new features of PDO_MYSQL in PHP 5.3. Besides being yet another introduction to how MYSQLND differs from the regular libmysqlclient, Ulf writes in detail about how mysqlnd brings other speedups to PDO in general, by allowing the drivers to return zvals directly. This allows the driver to return data without requiring an explicit copy by the overlying architecture. Interesting stuff and well worth a read for anyone, regardless if you actually know what a zval is.

Nicklas Westerlund has a post about MySQL Back to Basics: Analyze, Check, Optimize, and Repair on the pythian blog, featuring a overview of the useful – and abused – methods of rescuing and keeping your data intact. Do regular and good backups. It’s as easy as that. This might however help when you’re in a hurry or needs to fix a corruption that has occured. Read it.

The last item on today’s list is Sphinx – a free open-source full-text search engine. Sphinx uses it’s own indexing and retrieval system, while Solr is built on top of Lucene. Haven’t had much time to play with it yet, but it’s worth checking out. A native PHP module has also popped up (and that’s where I read about it just now), so if you need a fast and native PHP interface to a full search engine without blowing the big bucks, this may be what you’re looking for.

The Day of Four Letter Abbreviations: BASE vs ACID

Any person that has spent some time with databases will have heard the acronym ACID some time or the other. ACID is what transactions are aimed to achieve, so that you’re able to define a set of operations that are connected to each other. ACID is a definition of what these transactions should be able to provide to the user of the RDBMS.

Via xarpb I got hold of an article from ACM Queue, titled BASE: An ACID Alternative:

If ACID provides the consistency choice for partitioned databases, then how do you achieve availability instead? One answer is BASE (basically available, soft state, eventually consistent).

BASE is diametrically opposed to ACID. Where ACID is pessimistic and forces consistency at the end of every operation, BASE is optimistic and accepts that the database consistency will be in a state of flux. Although this sounds impossible to cope with, in reality it is quite manageable and leads to levels of scalability that cannot be obtained with ACID.

Proves to be a very interesting read, and something that’s very in time with how we’re already doing large partitioning of large databases, where the application tier is responsible for a lot of the distributed “transactions”.

Drizzle – Making MySQL Leaner and Meaner

As noted I recently went for a total of five days of vacation. In the meantime, the intarwebs of blogospheres and the seas of web 2.0 exploded with posts and discussion over Drizzle. Drizzle is a fork of MySQL with the intention of making things more suitable for cloud computing and the regular web use cases. I’m actually quite intrigued by this, and I really look forward to getting some more time to read up on the issues at hand. It will be interesting to see how things compare to CouchDB, Amazon S3, Solr and other service that take a different road than regular relationial databases. Interesting. Anyways, here’s the posts I’d suggest checking out to get better and more usable information about Drizzle:

I’ll take it for a testdrive as soon as the first public version becomes available, and I’m looking forward to it. Might be fun!