Solr – Mats Lindh

SEVERE: org.apache.solr.common.SolrException: can not sort on unindexed field: geodist()

This error may occur if you’re using sort=geodist() in your Solr Spatial / Geographic Search. The reason is probably that you have an empty pt= value or that the parameter is missing all together.

You might also want to make sure that your Solr version is new enough to support sorting by functions, but if you’re doing anything useful with spatial searches you’re probably updated enough – at least for geodist(). :-)

Replacement for Deprecated / Removed BaseTokenFilterFactory

When writing plugins for Solr you’d previously extend the BaseTokenFilterFactory, but at some time since I last built trunk, that changed to TokenFilterFactory – which is located in the util package of lucene instead.

Diff:

- import org.apache.solr.analysis.BaseTokenFilterFactory; 
+ import org.apache.lucene.analysis.util.TokenFilterFactory; 

...

- public class xxxxxFilterFactory extends BaseTokenFilterFactory 
+ public class xxxxxFilterFactory extends TokenFilterFactory

Solr: Missing Geographic Distance in Response When Using fl=_dist_:geodist()

A question that arised at the Freenode Solr IRC channel today was about _dist_:geodist() failing to include a field named _dist_ in the response – a field which would contain the return value from geodist(), which would be the distance from the source point to the destination document. The example from the wiki is:

&q=*:*&sfield=store&pt=45.15,-93.85&sort=geodist() asc&fl=_dist_:geodist()

The _dist_:geodist() in the fl= parameter adds a field named _dist_ with the returned contents from geodist() (the :-syntax creates a field name alias).

.. but why is it missing? Probably because using geodist() in the fl parameter is only supported on Solr 4.0 and later (which is currently only in trunk).

There’s a workaround on the wiki which you can use – but this will not allow you to score documents in any other way – the score returned will be the value of geodist() by using geodist() as the query:

q={!func}geodist()

Solr Response Empty from PHP, but Works in Browser or CURL?

Weird issue that I think I’ve stumbled upon earlier, but yet again reared it’s head yesterday. Certain application containers (possibly Jetty in this case) will for some reason not produce any output from Solr (or other applications I’d guess) if the request is made with HTTP/1.0 as the version identifier (“GET /…/ HTTP/1.0” as the first line of the request). The native HTTP support in PHP identifies itself as HTTP/1.0 as it doesn’t support request chunking, which then turns into a magical problem with requests that used to work, but doesn’t work any longer (the response is just zero bytes in size – all other headers are identical) – but still works as expected if you open them in your browser.

The solution is to either gamble on the server not sending any chunked responses and then setting protocol_version in the stream context that you pass to the file retrieving function (the list of HTTP wrapper settings (.. I don’t think it’s a good idea to define protocol_version as float, but .. well.)), or use cURL instead. The Solr pecl extension uses cURL internally, so it’s not affected by this issue.

Solr: Replication not starting?

After upgrading our Solr-servers from 1.4.1 to 4.0-trunk (to be sure we were ready for the next version), I had trouble with getting replication to start again. It worked perfectly back with 1.4.1, but after upgrading to 4.0-trunk, it simply wouldn’t start.

I had to upgrade the machines individually (to allow the current index to continue serve requests), I removed the replication and then directed all the traffic to the slave. After updating the master (which worked after actually remembering to clean out the old webapps from Tomcat and adding a few new settings) and reindexing, most of the traffic were directed to it, and the slave were upgraded to the new Solr-version. I turned on replication again, updated the configuration file with the needed settings and started the slave. Nothing happened. Weird.

Time to debug!

On any slaves there’s a “replication.properties” file in the data directory ($SOLRHOME/data) which contain information about the current replication status. This file were created, indicating that at least the replication was attempting to run. If you open the file in a text editor (or just cat it), you should be able to read a bit of meta information about the replication state.

replicationFailedAtList=1311072270004,1311072240006..
timesFailed=11

Seems like it’s trying, but for some reason it doesn’t work. First thing to check would be to grep for replication in the log on both the master and the slave, and see if there’s any requests being made at all. There might be, but the replication still doesn’t start.

Try fetching the current state yourself to see what response the master is serving. You can do this by using “GET” or “wget” or “curl” to make an HTTP request to the master Solr-server from the slave together with the URL from “masterUrl” in the requestHandler for /replication from solrconfig.xml:

GET http://example.com/solr/replication?command=indexversion

This should respond with something close to:

<?xml version="1.0" encoding="UTF-8"?>
<response>
  <lst name="responseHeader">
    <int name="status">0
    <int name="QTime">0
  </lst>
  <long name="indexversion">1310994445934
  <long name="generation">2
</response>

If “indexversion” is 0, this means that the master hasn’t triggered a replication yet, which may seem weird if you’ve just started the server and the slave doesn’t have any data at all.

The reason might be that the master has not been instructed to actually trigger a replication event (and unless a replication event has been triggered, the indexversion will be 0):

<requestHandler name="/replication" class="solr.ReplicationHandler">
  <lst name="master">
    <str name="replicateAfter">commit
    <str name="replicateAfter">startup
    <str name="replicateAfter">optimize

If you only have “commit” in the above list, a replication event will not be triggered unless you’ve actually performed a commit after the slave has connected for the first time. If you add “startup”, the replication will also be triggered when the master starts up (so that any connecting slaves will start replicating right away).

To fix the issue without restarting any nodes, issue a single commit to the master and watch as the slaves start replicating. To issue a commit through curl:

curl http://example.com/solr/update -H "Content-Type: text/xml" --data-binary '<commit />'

SEVERE: Error in xpath:java.lang.RuntimeException: solrconfig.xml missing luceneMatchVersion

One of the things that changed from Solr 1.4.1 to 1.5+ was the introduction of a parameter to tell Solr / Lucene which kind of compability version its index files should be created and used in.

Solr now refuses to start if you do not provide this setting (if you’re upgrading a previous installation from 1.4.1 or earlier). The fix isn’t really straight forward, and you’ll probably have to recreate your index files if you’re just arriving at the scene with Solr / Lucene 3.2 and 4.0. Solr 3.0 (1.5) might be able to upgrade the files from the 2.9 version, but if you’re jumping from Lucene 2.9 to 4.0, the easiest solution seems to be to delete the current index and reindex (set up replication, disable replication from the master, query the slave while reindexing the master, etc.. and you’ll have no downtime while doing this!).

You’ll need to add a parameter to your solrconfig.xml file as well in the <config> section.

LUCENE_CURRENT

Other valid values are LUCENE_30, LUCENE_31, LUCENE_32 and LUCENE_40. These values represent specific versions of the index structure, while LUCENE_CURRENT will use the version depending on which particular release of Lucene you’re using. The version format will be upgraded automagically between most releases, so you’ll probably be fine by using LUCENE_CURRENT. If you however are trying to load index files that are more than one version older, you may have to use one of the other values. If you want to avoid any possible surprises when updating your Solr installation, you probably want to set this to one of the versioned values.

Updating a Solr Analysis Plugin from 1.4.1 (Lucene 2.9) to Solr / Lucene 4.0 (current trunk)

Three years and a couple of weeks ago I wrote a post about how to get started writing a simple Solr Analysis Plugin to handle incoming tokens and modifying them in place when an update is requested.

Since then the whole version number structure of Solr has changed (and is now in sync with the underlying Lucene version), and not surprisingly, the current API has also been updated. This means that a few small changes are required to get your analysis plugins running on the current trunk of Lucene and Solr.

The main change is that the previously named TermAttribute is now named CharTermAttribute, this means that any imports will have to change:

- import org.apache.lucene.analysis.tokenattributes.TermAttribute; 
+ import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;

Any declarations of TermAttributes will need to be CharTermAttributes instead:

- private TermAttribute termAtt; 
+ private CharTermAttribute termAtt;

  public NorwegianNameFilter(TokenStream input) 
  { 
      super(input); 
-     termAtt = (TermAttribute) addAttribute(TermAttribute.class); 
+     termAtt = input.getAttribute(CharTermAttribute.class); 
  }

We now fetch the attribute from the current TokenStream (not sure if the old way I did it has been deprecated, but this seems to be the suggested way now). We also change any references to TermAttribute.class to CharTermAttribute.class.

The actual TermAttribute interface has also changed, meaning we’ll have to change a few of the old method calls:

- termAtt.setTermLength(this.parseBuffer(termAtt.termBuffer(), termAtt.termLength())); 
+ termAtt.setLength(this.parseBuffer(termAtt.buffer(), termAtt.length()));

.setTermLength() => .setLength()
.termBuffer => .buffer()
.termLength => .length()

The methods will behave in the same manner as in the previous API, .buffer() will retrieve a char array (char[]) which is the current buffer of the actual term which can you modify in place, while length() and setLength() retrieves the current length of the buffer (the buffer can be larger than the part used) and sets the new length of the buffer (if you’re collapsing characters).

The new implementation of our analysis filter skeleton:

package no.derdubor.solr.analysis;

import java.io.IOException;
import org.apache.lucene.analysis.Token;
import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;

public class NorwegianNameFilter extends TokenFilter
{
    private CharTermAttribute termAtt;

    public NorwegianNameFilter(TokenStream input)
    {
        super(input);
        termAtt = input.getAttribute(CharTermAttribute.class);
    }

    public boolean incrementToken() throws IOException
    {
        if (this.input.incrementToken())
        {
            termAtt.setLength(this.parseBuffer(termAtt.buffer(), termAtt.length()));
            return true;
        }
        
        return false;
    }
    
    protected int parseBuffer(char[] buffer, int bufferLength)
    {

    }
}

Solr, Memory Usage and Dynamic Fields

One of the many great things about Solr is that it allows you to add dynamic fields – you can define a certain pattern that a field will have to follow, but it can then use any field name that matches the pattern.

We’ve been using one such dynamic field to add a sort field for our documents:

xxx_Category_Subcategory: 300

This would allow us to sort by this field to get the priority of our documents in this particular category and subcategory. A document would contain somewhere between 1 and 15 such fields. The total selection of unique field names is somewhere around 1200 across all documents.

Be small, be happy

As long as our collection were quite small (<10k documents) this scheme worked great. When our collection grew to around 500k documents, we started seeing out of memory errors quite often. At the worst rate we got an out of memory exception every 30 minutes, and had to restart the Solr server. Performance didn't suffer, but obviously we couldn't continue restarting servers until we got bored. After removing a few other possible issues (such as our stable random sort) I were rather stumped that things didn't improve. The total amount of data in our dynamic fields were rather low, somewhere around 2.5 - 3.5m integers, or possibly somewhere around 50-70MB in total. The JVM should be able to fit everything about these fields in memory and query them for the fields we're trying to find, but a heap dump of the jvm just before it hit the out of memory exception revealed that we were getting quite a few GBs of Lucene's FieldCache objects. These objects cache the value of a field for the total set of documents available in the index, and you're sadly not able to tune this cache through the Solr configuration (at least not for 1.4 as far as I could find).

Less Dynamic Fields, More Manual Labor

After pondering this issue a bit I came to the conclusion that our problem was related to the dynamic fields we had, and the fact that we used them for sorting. Lucene / Solr keeps one set of field caches for each field when it’s used for sorting, to avoid having to do duplicate work later. For us, this meant that each time we sorted a new field, an array had to be created with the size of the total document set. As long as we just had 10k documents, these arrays were small enough that we had enough memory available – when the document set grew to almost 500k documents, not so much.

This means that the total memory required for field caches will be limited by DocumentsInIndex * FieldsSortedBy. As long as our DocumentsInIndex were just 10k, the available memory to the jvm was enough to keep sorting by the number of fields we did. When the number of documents grew, the memory usage grew by the same factor and we got our OutOfMemoryException.

The Solution

Our solution could probably be more elegant, but currently we’ve moved the sorting to our application layer instead of the data provider layer. We’re requesting the complete set of hits from the Solr-server in the category anyway, so we’re able to sort it in the application – and by using a response format other than XML we’re also doing it rather quickly. This means that we’re not using sorting at all, and are only querying against one multivalued field to see if the category key is present there at all.

Note: Other solutions we considered were to divide our index into several Solr cores. This would allow us to keep the number of documents in each core low, and therefor also keep the fieldcache size in check. We know that each category could very well live on just on core as we won’t be mixing it with data from the other cores (and for that we could keep a separate core with all the documents, just not use it for searching across dynamic fields). We dropped this plan because of the rather worrying increase in complexity in our Solr installation. This could however help in your own case. :-)

mod_jk and Internal Server Error (HTTP 500)

We’ve extended our previously single Solr-node to a few nodes in a cluster. This allows us to run queries against one node while updating or configuring another, distributing the load across several servers (although we’re not there yet load wise) and being able to handle any out of memory or other critical errors.

While Solr supports querying several cores or distributing the queries internally, we decided to move the load balancing and handling of failed nodes higher up in the hierarchy. We’re now doing simple load balancing and handling of failed nodes by using mod_jk in our existing Apache-based environment. mod_jk also handles failed servers without any administrator interaction. We were already using mod_jk for our main web frontend, and since we use Tomcat as our application container for Solr, things should be a breeze!

Well, no. After copying our existing mod_jk setup, configuring our new workers and restarting Apache, all I got was the well known 500 INTERNAL SERVER ERROR. Here’s the worker configuration file:

worker.list=loadbalancer,status

worker.solr1.port=8009
worker.solr1.host=10.0.0.4
worker.solr1.type=ajp13
worker.solr1.lbfactor=1
worker.solr1.cachesize=10

worker.solr2.port=8009
worker.solr2.host=10.0.0.5
worker.solr2.type=ajp13
worker.solr2.lbfactor=4
worker.solr2.cachesize=10

worker.loadbalancer.type=lb
worker.loadbalancer.balance_workers=solr1,solr2
worker.loadbalancer.sticky_session=0

worker.status.type=status

This provides us with two solr servers and one status worker (the status worker is responsible for providing a simple web interface for enabling/disabling/seeing the status of the other workers), configured with a 1:4 load balancing (the second server has quite a bit more memory available for Solr).

I provided the configuration of the workers through the JkWorkersFile configuration setting (in a VirtualHost block… don’t do that):

JkWorkersFile conf/workers.properties

I’d also enable debug logging to attempt to find the problem (still in a VirtualHost block):

JkLogFile logs/mod_jk.log
JkLogLevel debug
JkLogStampFormat "[%a %b %d %H:%M:%S %Y]"

Other mod_jk settings (in the VirtualHost block) were:

JkOptions +ForwardKeySize +ForwardURICompat -ForwardDirectories
JkRequestLogFormat "%w %V %T"
JkShmFile logs/jk.shm
JkMount /* loadbalancer

<Location /jkstatus>
	JkMount status
	Order deny,allow
        Deny from all
        Allow from 127.0.0.1
</Location>

Still no solution. Peeking at the log files mod_jk provided, I were able to deduce the following:

[debug] map_uri_to_worker::jk_uri_worker_map.c (525): Attempting to map context URI '/jkstatus'
[debug] map_uri_to_worker::jk_uri_worker_map.c (550): Found an exact match status -> /jkstatus
[debug] jk_handler::mod_jk.c (1920): Into handler jakarta-servlet worker=status r->proxyreq=0
[debug] wc_get_worker_for_name::jk_worker.c (111): did not find a worker status
[info]  jk_handler::mod_jk.c (2071): Could not find a worker for worker name=status

This indicates that mod_jk was unable to find a worker matching the name I provided in the JkMount statement above; status. Weird. I added some garbage characters to the “JkWorkersFile” setting, and Apache complained that it were unable to find the workers file. Changed it back, reloaded, and still nothing. It was apparently unable to find the worker. The map did however work, as it tried to launch a worker.

Looking back at the start up sequence of mod_jk, the following were found in the log:

[debug] build_worker_map::jk_worker.c (236): creating worker ajp13
[debug] wc_create_worker::jk_worker.c (141): about to create instance ajp13 of ajp13
[debug] wc_create_worker::jk_worker.c (154): about to validate and init ajp13
[debug] ajp_validate::jk_ajp_common.c (1922): worker ajp13 contact is 'localhost:8009'
[debug] ajp_init::jk_ajp_common.c (2047): setting endpoint options:
[debug] ajp_init::jk_ajp_common.c (2050): keepalive:        0
[debug] ajp_init::jk_ajp_common.c (2054): timeout:          -1
[debug] ajp_init::jk_ajp_common.c (2058): buffer size:      0
ajp_init::jk_ajp_common.c (2062): pool timeout:     0
[debug] ajp_init::jk_ajp_common.c (2066): connect timeout:  0
[debug] ajp_init::jk_ajp_common.c (2070): reply timeout:    0
[debug] ajp_init::jk_ajp_common.c (2074): prepost timeout:  0
[debug] ajp_init::jk_ajp_common.c (2078): recovery options: 0
[debug] ajp_init::jk_ajp_common.c (2082): retries:          2
[debug] ajp_init::jk_ajp_common.c (2086): max packet size:  8192
[debug] ajp_create_endpoint_cache::jk_ajp_common.c (1959): setting connection pool size to 1 with min 0

It took a bit of time, but I realized that this tells me that mod_jk created _a default_ worker named ajp13. Apparently it was not reading my worker file at all, but it still complained if I changed the file name. You’d think that the setting which loads the configuration file would work when it complains when it doesn’t. But .. well. After an hour of attempting to find out why the workers didn’t load, revising the workers file to a minimal example, trying with just a single status worker, I concluded that my workers file was correct, and obviously mod_jk found it when it attempted to load it.

Then I suddenly discovered the small notice in the mod_jk configuration manual:

JkWorkersFile: This directive is only allowed once. It must be put into the global part of the configuration.

JkWorkersFile can not be defined in a <VirtualHost> section. It will NOT complain if you do it, it’ll just never define any workers. It will complain if the file doesn’t exist, even if it never tries to actually load it.

Confusing.

Moving the JkWorkersFile statement out from the <VirtualHost> block and to the LoadModule statement instead solved the issue. This is also the case for JkWorkerProperty.

Solr, Tomcat and HTTP/1.1 505 HTTP Version Not Supported

During today’s hacking aboot I came across the above error from our Solr query library. The error indicates that some part of Tomcat was unable to parse the “GET / HTTP/1.1” string – where it is unable to determine the “HTTP/1.1” part. A problem like this could be introduced by having a space in the query string (and it not being escaped properly), so that the request would have been for “GET /?a=b c HTTP/1.1”. After running through both the working and non-working query through ngrep and wireshark, this did however not seem to be the problem. My spaces were properly escaped using plus signs (GET /?a=b+c HTTP/1.1).

There does however seem to be a problem (at least with our version of Tomcat – 6.0.20) which results in the +-s being resolved before the request is handed off to the code that attempts to parse the header, so even though it is properly escaped using “+”, it still barfs.

The solution:

Use %20 to escape spaces instead of + signs; simply adding str_replace(” “, “%20”, ..); in our query layer solved the problem.