SEVERE: Error in xpath:java.lang.RuntimeException: solrconfig.xml missing luceneMatchVersion

One of the things that changed from Solr 1.4.1 to 1.5+ was the introduction of a parameter to tell Solr / Lucene which kind of compability version its index files should be created and used in.

Solr now refuses to start if you do not provide this setting (if you’re upgrading a previous installation from 1.4.1 or earlier). The fix isn’t really straight forward, and you’ll probably have to recreate your index files if you’re just arriving at the scene with Solr / Lucene 3.2 and 4.0. Solr 3.0 (1.5) might be able to upgrade the files from the 2.9 version, but if you’re jumping from Lucene 2.9 to 4.0, the easiest solution seems to be to delete the current index and reindex (set up replication, disable replication from the master, query the slave while reindexing the master, etc.. and you’ll have no downtime while doing this!).

You’ll need to add a parameter to your solrconfig.xml file as well in the <config> section.

LUCENE_CURRENT 

Other valid values are LUCENE_30, LUCENE_31, LUCENE_32 and LUCENE_40. These values represent specific versions of the index structure, while LUCENE_CURRENT will use the version depending on which particular release of Lucene you’re using. The version format will be upgraded automagically between most releases, so you’ll probably be fine by using LUCENE_CURRENT. If you however are trying to load index files that are more than one version older, you may have to use one of the other values. If you want to avoid any possible surprises when updating your Solr installation, you probably want to set this to one of the versioned values.

Updating a Solr Analysis Plugin from 1.4.1 (Lucene 2.9) to Solr / Lucene 4.0 (current trunk)

Three years and a couple of weeks ago I wrote a post about how to get started writing a simple Solr Analysis Plugin to handle incoming tokens and modifying them in place when an update is requested.

Since then the whole version number structure of Solr has changed (and is now in sync with the underlying Lucene version), and not surprisingly, the current API has also been updated. This means that a few small changes are required to get your analysis plugins running on the current trunk of Lucene and Solr.

The main change is that the previously named TermAttribute is now named CharTermAttribute, this means that any imports will have to change:

- import org.apache.lucene.analysis.tokenattributes.TermAttribute; 
+ import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; 

Any declarations of TermAttributes will need to be CharTermAttributes instead:

- private TermAttribute termAtt; 
+ private CharTermAttribute termAtt; 
  public NorwegianNameFilter(TokenStream input) 
  { 
      super(input); 
-     termAtt = (TermAttribute) addAttribute(TermAttribute.class); 
+     termAtt = input.getAttribute(CharTermAttribute.class); 
  } 

We now fetch the attribute from the current TokenStream (not sure if the old way I did it has been deprecated, but this seems to be the suggested way now). We also change any references to TermAttribute.class to CharTermAttribute.class.

The actual TermAttribute interface has also changed, meaning we’ll have to change a few of the old method calls:

- termAtt.setTermLength(this.parseBuffer(termAtt.termBuffer(), termAtt.termLength())); 
+ termAtt.setLength(this.parseBuffer(termAtt.buffer(), termAtt.length())); 

.setTermLength() => .setLength()
.termBuffer => .buffer()
.termLength => .length()

The methods will behave in the same manner as in the previous API, .buffer() will retrieve a char array (char[]) which is the current buffer of the actual term which can you modify in place, while length() and setLength() retrieves the current length of the buffer (the buffer can be larger than the part used) and sets the new length of the buffer (if you’re collapsing characters).

The new implementation of our analysis filter skeleton:

package no.derdubor.solr.analysis;

import java.io.IOException;
import org.apache.lucene.analysis.Token;
import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;

public class NorwegianNameFilter extends TokenFilter
{
    private CharTermAttribute termAtt;

    public NorwegianNameFilter(TokenStream input)
    {
        super(input);
        termAtt = input.getAttribute(CharTermAttribute.class);
    }

    public boolean incrementToken() throws IOException
    {
        if (this.input.incrementToken())
        {
            termAtt.setLength(this.parseBuffer(termAtt.buffer(), termAtt.length()));
            return true;
        }
        
        return false;
    }
    
    protected int parseBuffer(char[] buffer, int bufferLength)
    {

    }
}

SOLR: java.io.FileNotFoundException: no segments* file found

While playing around with one of my development SOLR installations (this time under Windows), I suddenly got a weird error message when feeding data to one of the fresh cores.


SEVERE: java.lang.RuntimeException: java.io.FileNotFoundException: no segments* file found in org.apache.lucene.store.SimpleFSDirectory@C:\temp\solr\*\data\index: files:

Taking a look at the contents of the index\ directory, it was in fact empty. Seems weird, but my initial guess was that Lucene / SOLR would treat this as a new installation and create the files.

Turns out the issue is that it won’t – as long as the index directory exists, Lucene / SOLR goes looking for the segment files.

Thanks to an old post to the solr-dev list by Yonik, the easiest fix is to simply delete the index directory and restart your applet container (Tomcat in this case).

Porting SOLR Token Filter from Lucene 2.4.x to Lucene 2.9.x

I had trouble getting our current token filter to work after recompiling with the nightly builds of SOLR, which seemed to stem from the recently adopted upgrade to 2.9.0 of Lucene (not released yet, but SOLR nightly is bleeding edge!). There’s functionality added for backwards compability, and while that might have worked, things didn’t really come together as it should (somewhere or the other). So I decided to port our filter over to the new model, where incrementToken() is the New Way ™ of doing stuff. Helped by the current lowercase filter in the SVN trunk of Lucene, I made it all the way through.

Our old code:

    public NorwegianNameFilter(TokenStream input)
    {
        super(input);
    }

    public Token next() throws IOException
    {
        return parseToken(this.input.next());
    }
 
    public Token next(Token result) throws IOException
    {
        return parseToken(this.input.next());
    }

Compiling this with Lucene 2.9.0 gave me a new warning:

Note: .. uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.

To the internet mobile!

Turns out next() and next(Token) has been deprecated in the new TokenStream implementation, and the New True Way is to use the incrementToken() method instead.

Our new code:

    private TermAttribute termAtt;

    public NorwegianNameFilter(TokenStream input)
    {
        super(input);
        termAtt = (TermAttribute) addAttribute(TermAttribute.class);
    }

    public boolean incrementToken() throws IOException
    {
        if (this.input.incrementToken())
        {
            termAtt.setTermLength(this.parseBuffer(termAtt.termBuffer(), termAtt.termLength()));
            return true;
        }
        
        return false;
    }

A few gotcha’s along the way: incrementToken needs to be called on the input token string, not on the filter (super.incrementToken() will give you a stack overflow). This moves the token stream one step forward. We also decided to move the buffer handling into the parse token function to handle this, and remember to include the length of the “live” part of the buffer (the buffer will be larger, but only the content up to termLength will be valid).

The return value from our parseBuffer function is the actual amount of usable data in the buffer after we’ve had our way with it. The concept is to modify the buffer in place, so that we avoid allocating or deallocating memory.

Hopefully this will help other people with the same problem!

How To Make Solr Go 45% Faster

If you’re still looking for a good reason to spend a few minutes tuning your SOLR caches (documentCache, filterCache and queryResultCache), I’ll give you two numbers:

avgTimePerRequest : 126.148822
avgTimePerRequest : 70.026436 

The first is with the default cache settings, the latter is with a very small change. Yep. That’s a 45% speed increase. So, the interesting question is what Iactually changed in the cache configuration – although I should warn you, the answer is very, very, very complicated:

The cache size. The default size (at least for our current 1.3 installation) is to keep 512 elements in the cache. When someone on the solr-user list asked for an introduction to what the different cache statistics meant, I remembered that I hadn’t actually tweaked the settings at all. The SOLR server has been running for a year now, so we now have a quite good idea of how it will perform and what kind of queries we are seeing. The stats indicated that a lot more cached entries got evicted than what I were hoping to see, and this gave us a lower cache hit rate (about 50%).

The simple change was to increase the size of the cache (from 512 to 16384), so that we’re able to keep more documents in memory before evicting them. After running 24 hours with the new setup we’re now seeing cache hits as 99%, 68% and 67%. The relevant sections of the solrconfig.xml file are:




The document cache fills about 4 times as fast as the filter cache, so we might have to tweak the settings further by suiting it even better to our load pattern.

So what now?

The next step would be to try to change to the FastLRUCache which is included with Solr 1.4 (currently in SVN and nightlies). If my memory serves me right the changes are mostly related to locking, so I’m not sure if we’ll see any significant better performance.

We’ll also make further adjustments to the size of each of the caches to better match our usage.

Solr Becoming Slow After a While

This is perhaps the most obvious and “not very helpful” post for quite a few people, but for those who experience this issue, it’ll save the day. While doing a test index routine of around 6 million documents, things would get really slow at the moment I passed 1 million documents in the index. Weird. Optimizing didn’t help, as it died with an exception after a while.

The reason?

Not enough free disk space. Solr was indexing to a different partition than I thought.

Solved everything.

Modifying a Lucene Snowball Stemmer

This post is written for advanced users. If you do not know what SVN (Subversion) is or if you’re not ready to get your hands dirty, there might be something more interesting to read on Wikipedia. As usual. This is an introduction to how to get a Lucene development environment running, a Solr environment and lastly, to create your own Snowball stemmer. Read on if that seems interesting. The receipe for regenerating the Snowball stemmer (I’ll get back to that…) assumes that you’re running Linux. Please leave a comment if you’ve generated the stemmer class under another operating system.

When indexing data in Lucene (a fulltext document search library) and Solr (which uses Lucene), you may provide a stemmer (a piece of code responsible for “normalizing” words to their common form (horses => horse, indexing => index, etc)) to give your users better and more relevant results when they search. The default stemmer in Lucene and Solr uses a library named Snowball which was created to do just this kind of thing. Snowball uses a small definition language of its own to generate parsers that other applications can embed to provide proper stemming.

By using Snowball Lucene is able to provide a nice collection of default stemmers for several languages, and these work as they should for most selections. I did however have an issue with the Norwegian stemmer, as it ignores a complete category of words where the base form end in the same letters as plural versions of other words. An example:

one: elektriker
several: elektrikere
those: elektrikerene

The base form is “elektriker”, while “elektrikere” and “elektrikerene” are plural versions of the same word (the word means “electrician”, btw).

Lets compare this to another word, such as “Bus”:

one: buss
several: busser
those: bussene

Here the base form is “buss”, while the two other are plural. Lets apply the same rules to all six words:

buss => buss
busser => buss [strips “er”]
bussene => buss [strips “ene”]

elektrikerene => “elektriker” [strips “ene”]
elektrikere => “elektriker” [strips “e”]

So far everything has gone as planned. We’re able to search for ‘elektrikerene’ and get hits that say ‘elektrikere’, just as planned. All is not perfect, though. We’ve forgotten one word, and evil forces will say that I forgot it on purpose:

elektriker => ?

The problem is that “elektriker” (which is the single form of the word) ends in -er. The rule defined for a word in the class of “buss” says that -er should be stripped (and this is correct for the majority of words). The result then becomes:

elektriker => “elektrik” [strips “er”]
elektrikere => “elektriker” [strips “e”]
elektrikerene => “elektriker” [strips “ene”]

As you can see, there’s a mismatch between the form that the plurals gets chopped down to and the singular word.

My solution, while not perfect in any way, simply adds a few more terms so that we’re able to strip all these words down to the same form:

elektriker => “elektrik” [strips “er”]
elektrikere => “elektrik” [strips “ere”]
elektrikerene => “elektrik” [strips “erene”]

I decided to go this route as it’s a lot easier than building a large selection of words where no stemming should be performed. It might give us a few false positives, but the most important part is that it provides the same results for the singular and plural versions of the same word. When the search results differ for such basic items, the user gets a real “WTF” moment, especially when the two plural versions of the word is considered identical.

To solve this problem we’re going to change the Snowball parser and build a new version of the stemmer that we can use in Lucene and Solr.

Getting Snowball

To generate the Java class that Lucene uses when attempting to stem a phrase (such as the NorwegianStemmer, EnglishStemmer, etc), you’ll need the Snowball distribution. This distribution also includes example stemming algorithms (which have been used to generate the current stemmers in Lucene).

You’ll need to download the application from the snowball download page – in particular the “Snowball, algorithms and libstemmer library” version [direct link].

After extracting the file you’ll have a directory named snowball_code, which contains among other files the snowball binary and a directory named algorithms. The algorithms-directory keeps all the different default stemmers, and this is where you’ll find a good starting point for the changes you’re about to do.

But first, we’ll make sure we have the development version of Lucene installed and ready to go.

Getting Lucene

You can check out the current SVN trunk of Lucene by doing:

svn checkout http://svn.apache.org/repos/asf/lucene/java/trunk lucene/java/trunk

This will give you the bleeding edge version of Lucene available for a bit of toying around. If you decide to build Solr 1.4 from SVN (as we’ll do further down), you do not have to build Lucene 2.9 from SVN – as it already is included pre-built.

If you need to build the complete version of Lucene (and all contribs), you can do that by moving into the Lucene trunk:

cd lucene/java/trunk/
ant dist (this will also create .zip and .tgz distributions)

If you already have Lucene 2.9 (.. or whatever version you’re on when you’re reading this), you can get by with just compiling the snowball contrib to Lucene, from lucene/java/trunk/:

cd contrib/snowball/
ant jar

This will create (if everything works as it should) a file named lucene-snowball-2.9-dev.jar (.. or another version number, depending on your version of Lucene). The file will be located in a sub directory of the build directory on the root of the lucene checkout (.. and the path will be shown after you’ve run ant jar): lucene/java/trunk/build/contrib/snowball/.

If you got the lucene-snowball-2.9-dev.jar file compiled, things are looking good! Let’s move on getting the bleeding edge version of Solr up and running (if you have an existing Solr version that you’re using and do not want to upgrade, skip the following steps .. but be sure to know what you’re doing .. which coincidentally you also should be knowing if you’re building stuff from SVN as we are. Oh the joy!).

Getting Solr

Getting and building Solr from SVN is very straight forward. First, check it out from Subversion:

svn co http://svn.apache.org/repos/asf/lucene/solr/trunk/ solr/trunk/

And then simply build the war file for your favourite container:

cd solr/trunk/
ant dist

Voilá – you should now have a apache-solr-1.4-dev.war (or something similiar) in the build/ directory. You can test that this works by replacing your regular solr installation (.. make a backup first..) and restarting your application server.

Editing the stemmer definition

After extracting the snowball distribution, you’re left with a snowball_code directory, which contains algorithms and then norwegian (in addition to several other stemmer languages). My example here expands the definition used in the norwegian stemmer, but the examples will work with all the included stemmers.

Open up one of the files (I chose the iso-8859-1 version, but I might have to adjust this to work for UTF-8/16 later. I’ll try to post an update in regards to that) and take a look around. The snowball language is interesting, and you can find more information about it at
the Snowball site.

I’ll not include a complete dump of the stemming definition here, but the interesting part (for what we’re attempting to do) is the main_suffix function:

define main_suffix as (
    setlimit tomark p1 for ([substring])
    among(
        'a' 'e' 'ede' 'ande' 'ende' 'ane' 'ene' 'hetene' 'en' 'heten' 'ar'          
        'er' 'heter' 'as' 'es' 'edes' 'endes' 'enes' 'hetenes' 'ens'
        'hetens' 'ers' 'ets' 'et' 'het' 'ast' 
            (delete)
        's'
            (s_ending or ('k' non-v) delete)
        'erte' 'ert'
            (<-'er')
    )
)

This simply means that for any word ending in any of the suffixes in the three first lines will be deleted (given by the (delete) command behind the definitions). The problem provided our example above is that neither of the lines will capture an "ere" ending or "erene" - which we'll need to actually solve the problem.

We simply add them to the list of defined endings:

    among(
        ... 'hetene' 'en' 'heten' 'ar' 'ere' 'erene' 'eren'
        ...
        ...
            (delete)

I made sure to add the definitions before the shorter versions (such as 'er'), but I'm not sure (.. I don't think) if it actually is required.

Save the file under a new file name so you still have the old stemmers available.

Compiling a New Version of the Snowball Stemmer

After editing and saving your stemmer, it's now time to generate the Java class that Lucene will use to generate it base forms of the words. After extracting the snowball archive, you should have a binary file named snowball in the snowball_code directory. If you simply run this file with snowball_code as your current working directory:

./snowball

You'll get a list of options that Snowball can accept when generating the stemmer class. We're only going to use three of them:

-j[ava] Tell Snowball that we want to generate a Java class
-n[ame] Tell Snowball the name of the class we want generated
-o <filename> The filename of the output file. No extension.

So to compile our NorwegianExStemmer from our modified file, we run:

./snowball algorithms/norwegian/stem2_ISO_8859_1.sbl -j -n NorwegianExStemmer -o NorwegianExStemmer

(pardon the excellent file name stem2...). This will give you one new file in the current working directory: NorwegianExStemmer.java! We've actually built a stemming class! Woohoo! (You may do a few dance moves here. I'll wait.)

We're now going to insert the new class into the Lucene contrib .jar-file.

Rebuild the Lucene JAR Library

Copy the new class file into the version of Lucene you checked out from SVN:

cp NorwegianExStemmer.java /contrib/snowball/src/java/org/tartaru/snowball/ext

Then we simply have to rebuild the .jar file containing all the stemmers:

cd /contrib/snowball/
ant jar

This will create lucene-snowball-2.9-dev.jar in <lucenetrunk>/build/contrib/. You now have a library containing your stemmer (and all the other default stemmers from Lucene)!

The last part is simply getting the updated stemmer library into Solr, and this will be a simple copy and rebuild:

Inserting the new Lucene Library Into Solr

From the build/contrib directory in Lucene, copy the jar file into the lib/ directory of Solr:

cp lucene-snowball-2.9-dev.jar lib/

Be sure to overwrite any existing files (.. and if you have another version of Lucene in Solr, do a complete rebuild and replace all the Lucene related files in Solr). Rebuild Solr:

cd 
ant dist

Copy the new apache-solr-1.4-dev.war (check the correct name in the directory yourself) from the build/ directory in Solr to your application servers home as solr.war (.. if you use another name, use that). This is webapps/ if you're using Tomcat. Remember to back up the old .war file, just to be sure you can restore everything if you've borked something.

Add Your New Stemmer In schema.xml

After compiling and packaging the stemmer, it's time to tell Solr that it should use the newly created stemmer. Remember that a stemmer works both when indexing and querying, so we're going to need to reindex our collection after implementing a new stemmer.

The usual place to add the stemmer is the definition of your text fields under the <analyzer>-sections for index and query (remember to change it BOTH places!!):


Change NorwegianEx into the name of your class (without the Stemmer-part, Lucene adds that for you automagically. After changing both locations (or more if you have custom datatypes and indexing or query steps).

Restart Application Server and Reindex!

If you're using Tomcat as your application server this might simply be (depending on your setup and distribution):

cd /path/to/tomcat/bin
./shutdown.sh
./startup.sh

Please consult the documentation for your application server for information about how to do a proper restart.

After you've restarted the application server, you're going to need to reindex your collection before everything works as planned. You can however check that your stemmer works as you've planned already at this stage. Log into the Solr admin interface, select the extended / advanced query view, enter your query (which should now be stemmed in another way than before), check the "debug" box and submit your search. The resulting XML document will show you the resulting of your query in the parsedquery element.

Download the Generated Stemmer

If you're just looking for an improved stemmer for norwegian words (with the very, very simple changes outlined above, and which might give problems when concerned with UTF-8 (.. please leave a comment if that's the case)), you can simply download NorwegianExStemmer.java. Follow the guide above for adding it to your Lucene / Solr installation.

Please leave a comment if something is confusing or if you want free help. Send me an email if you're looking for a consultant.

Using Solrj – A short guide to getting started with Solrj

As Solrj – The Java Interface for Solr – is slated for being released together with Solr 1.3, it’s time to take a closer look! Solrj is the preferred, easiest way of talking to a Solr server from Java (unless you’re using Embedded Solr). This way you get everything in a neat little package, and can avoid parsing and working with XML etc directly. Everything is tucked neatly away under a few classes, and since the web generally lacks a good example of how to use SolrJ, I’m going to share a small class I wrote for testing the data we were indexing at work. As Solr 1.2 is the currently most recent version available at apache.org, you’ll have to take a look at the Apache Solr Nightly Builds website and download the latest version. The documentation is also contained in the archive, so if you’re going to do any serious solrj development, this is the place to do it.

Oh well, enough of that, let’s cut to the chase. We start by creating a CommonsHttpSolrServer instance, which we provide with the URL of our Solr server as the only argument in the constructor. You may also provide your own parsers, but I’ll leave that for those who need it. I don’t. By default your Solr-installation is running on port 8080 and under the solr directory, but you’ll have to accomodate your own setup here. I’ve included the complete source file for download.

class SolrjTest
{
    public void query(String q)
    {
        CommonsHttpSolrServer server = null;

        try
        {
            server = new CommonsHttpSolrServer("http://localhost:8080/solr/");
        }
        catch(Exception e)
        {
            e.printStackTrace();
        }

The next thing we’re going to do is to actually create the query we’re about to ask the Solr server about, and this means building a SolrQuery object. We simply instanciate the object and then start to set the query values to what we’re looking for. The setQueryType call can be dropped to use the default QueryType-handler, but as we currently use dismax, this is what I’ve used here. You can then also turn on Facet-ing (to create navigators/facets) and add the fields you want for those.

        SolrQuery query = new SolrQuery();
        query.setQuery(q);
        query.setQueryType("dismax");
        query.setFacet(true);
        query.addFacetField("firstname");
        query.addFacetField("lastname");
        query.setFacetMinCount(2);
        query.setIncludeScore(true);

Then we simply query the server by calling server.query, which takes our parameters, build the query URL, sends it to the server and parses the response for us.

        try
        {
            QueryResponse qr = server.query(query);

This result can then be fetched by calling .getResults(); on the QueryResponse object; qr.

            SolrDocumentList sdl = qr.getResults();

We then output the information fetched in the query. You can change this to print all fields or other stuff, but as this is a simple application for searching a database of names, we just collect the first and last name of each entry and print them out. Before we do that, we print a small header containing information about the query, such as the number of elements found and which element we started on.

            System.out.println("Found: " + sdl.getNumFound());
            System.out.println("Start: " + sdl.getStart());
            System.out.println("Max Score: " + sdl.getMaxScore());
            System.out.println("--------------------------------");

            ArrayList> hitsOnPage = new ArrayList>();

            for(SolrDocument d : sdl)
            {
                HashMap values = new HashMap();

                for(Iterator> i = d.iterator(); i.hasNext(); )
                    Map.Entry e2 = i.next();
                    values.put(e2.getKey(), e2.getValue());
                }

                hitsOnPage.add(values);
                System.out.println(values.get("displayname") + " (" + values.get("displayphone") + ")");
            }

After this we output the facets and their information, just so you can see how you’d go about fetching this information from Solr too:

            List facets = qr.getFacetFields();

            for(FacetField facet : facets)
            {
                List facetEntries = facet.getValues();

                for(FacetField.Count fcount : facetEntries)
                {
                    System.out.println(fcount.getName() + ": " + fcount.getCount());
                }
            }
        }
        catch (SolrServerException e)
        {
            e.printStackTrace();
        }
    }

    public static void main(String[] args)
    {
        SolrjTest solrj = new SolrjTest();
        solrj.query(args[0]);
    }
}

And there you have it, a very simple application to just test the interface against Solr. You’ll need to add the jar-files from the lib/-directory in the solrj archive (and from the solr library itself) to compile and run the example.

Download: SolrTest.java