How To Dismantle An Atomic HTTP Query .. String.

Following up on yesterday’s gripe about PHPs (old and now useless) automagic translation of dots in GET and POST parameters to underscores, today’s edition manipulates the query string in place instead of returning it as an array.

This is useful if you have a query string you want to pass on to another service, and for some reason the default behaviour in PHP will barf barf and barf. That might happen because of the dot translation issue or that some services (such as Solr) rely on a parameter name being repeatable (in PHP the second parameter value will overwrite the first).

function http_dismantle_query($queryString, $remove)
{
    $removeKeys = array();

    if (is_array($remove))
    {
        foreach($remove as $removeKey)
        {
            $removeKeys[$removeKey] = true;
        }
    }
    else
    {
        $removeKeys[$remove] = true;
    }

    $resultEntries = array();
    $segments = explode("&", $queryString);

    foreach($segments as $segment)
    {
        $parts = explode('=', $segment);

        $key = urldecode(array_shift($parts));

        if (!isset($removeKeys[$key]))
        {
            $resultEntries[] = $segment;
        }
    }

    return join('&', $resultEntries);
}

I’m not really sure what I’ll call the next function in this series, but there sure are loads of candidates out there.

Boosting By Date in Solr 1.4

One of the things introduced with Solr 1.4 is the ms() function for getting the number of milliseconds for a timestamp since the Unix epoch. This means that you can now write date boosts without having to resort to ord() or rord().

The best solution for boosting documents based on a field on query time (to avoid having to update the boost factor based on date as time progresses) seems to be to use the boost query type. The boost query type will pass the query on to your default query handler and let that resolve the query itself, but will provide boosts for each document based on the fields queried.

An example of how to solve this issue can be found on the SolrRelevancy part of the Solr Wiki:

{!boost b=recip(ms(NOW,publishedTime),3.16e-11,1,1)}query

This will take the number of milliseconds between NOW and the time the document was published (publishedTime is one of the fields YOU have to provide when you’re indexing, this might be “created” or something else that suits your needs) and then multiply that number with 3.16e-11, which is equal to 1 / . This will make the result of the function be 1 for a document that just was published, but 0 for anything older than a year.

The Solr Wiki also contains example of how you can divide your boost query into several parts to make it easier to read.

Escaping Characters in a Solr Query / Solr URL

We’re using our own Solr library at Derdubor at the moment, but we’ve only been using it for indexing content. The query part was never standardized in our common library as we usually used an alternative output format, but during the last days that has changed. We now have a parser for the default XML outputter and we’re also supporting facets and field queries (or constraints as they’re abstracted as in our library).

This means that we’re feeding content into the query that may contain foreign characters, in particular those who have special meaning in a Solr query. You can find the complete list of characters that need to be escaped in a SOLR or Lucene query in the Lucene manual.

To escape the characters we use this very simple and stupid PHP method:

    static public function escapeSolrValue($string)
    {
        $match = array('\\', '+', '-', '&', '|', '!', '(', ')', '{', '}', '[', ']', '^', '~', '*', '?', ':', '"', ';', ' ');
        $replace = array('\\\\', '\\+', '\\-', '\\&', '\\|', '\\!', '\\(', '\\)', '\\{', '\\}', '\\[', '\\]', '\\^', '\\~', '\\*', '\\?', '\\:', '\\"', '\\;', '\\ ');
        $string = str_replace($match, $replace, $string);

        return $string;
    }

We used a regular expression first, but the sheer amount of backslashes made it a regular .. hell … to read. So to make it easier for the persons maintaining this in the future, we went the easy to read / easy to maintain road for this one.

SOLR: java.io.FileNotFoundException: no segments* file found

While playing around with one of my development SOLR installations (this time under Windows), I suddenly got a weird error message when feeding data to one of the fresh cores.


SEVERE: java.lang.RuntimeException: java.io.FileNotFoundException: no segments* file found in org.apache.lucene.store.SimpleFSDirectory@C:\temp\solr\*\data\index: files:

Taking a look at the contents of the index\ directory, it was in fact empty. Seems weird, but my initial guess was that Lucene / SOLR would treat this as a new installation and create the files.

Turns out the issue is that it won’t – as long as the index directory exists, Lucene / SOLR goes looking for the segment files.

Thanks to an old post to the solr-dev list by Yonik, the easiest fix is to simply delete the index directory and restart your applet container (Tomcat in this case).

Porting SOLR Token Filter from Lucene 2.4.x to Lucene 2.9.x

I had trouble getting our current token filter to work after recompiling with the nightly builds of SOLR, which seemed to stem from the recently adopted upgrade to 2.9.0 of Lucene (not released yet, but SOLR nightly is bleeding edge!). There’s functionality added for backwards compability, and while that might have worked, things didn’t really come together as it should (somewhere or the other). So I decided to port our filter over to the new model, where incrementToken() is the New Way ™ of doing stuff. Helped by the current lowercase filter in the SVN trunk of Lucene, I made it all the way through.

Our old code:

    public NorwegianNameFilter(TokenStream input)
    {
        super(input);
    }

    public Token next() throws IOException
    {
        return parseToken(this.input.next());
    }
 
    public Token next(Token result) throws IOException
    {
        return parseToken(this.input.next());
    }

Compiling this with Lucene 2.9.0 gave me a new warning:

Note: .. uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.

To the internet mobile!

Turns out next() and next(Token) has been deprecated in the new TokenStream implementation, and the New True Way is to use the incrementToken() method instead.

Our new code:

    private TermAttribute termAtt;

    public NorwegianNameFilter(TokenStream input)
    {
        super(input);
        termAtt = (TermAttribute) addAttribute(TermAttribute.class);
    }

    public boolean incrementToken() throws IOException
    {
        if (this.input.incrementToken())
        {
            termAtt.setTermLength(this.parseBuffer(termAtt.termBuffer(), termAtt.termLength()));
            return true;
        }
        
        return false;
    }

A few gotcha’s along the way: incrementToken needs to be called on the input token string, not on the filter (super.incrementToken() will give you a stack overflow). This moves the token stream one step forward. We also decided to move the buffer handling into the parse token function to handle this, and remember to include the length of the “live” part of the buffer (the buffer will be larger, but only the content up to termLength will be valid).

The return value from our parseBuffer function is the actual amount of usable data in the buffer after we’ve had our way with it. The concept is to modify the buffer in place, so that we avoid allocating or deallocating memory.

Hopefully this will help other people with the same problem!

Finding Substring in a String in Bash

If you’re ever in the need of checking if a variable in bash contains a certain string (and is not equal to, just a part of), the =~ operator to the test function comes in very handy. =~ is usually used to compare a variable against a regular expression, but can also be used for substrings (you may also use == *str*, but I prefer this way).

This short example submits a document to solr using curl, then emails the result if the Solr server responded with an error (.. I tried mapping this against the error code or something similiar instead, but didn’t find a better way. If you have something better, please leave a comment!):

    CURLRESULT=`cat $i | curl -s -X POST -H 'Content-Type: text/xml' -d @- $URL`
    if [[ $CURLRESULT =~ "Error report" ]]
      then 
	echo "Error!! Error!! CRISIS IMMINENT!!"
        echo $CURLRESULT | mail -s "Error importing to SOLR" mail@example.com
        exit
    fi

Neat to check that everything went OK before you remove the files you’ve submitted.

How To Make Solr Go 45% Faster

If you’re still looking for a good reason to spend a few minutes tuning your SOLR caches (documentCache, filterCache and queryResultCache), I’ll give you two numbers:

avgTimePerRequest : 126.148822
avgTimePerRequest : 70.026436 

The first is with the default cache settings, the latter is with a very small change. Yep. That’s a 45% speed increase. So, the interesting question is what Iactually changed in the cache configuration – although I should warn you, the answer is very, very, very complicated:

The cache size. The default size (at least for our current 1.3 installation) is to keep 512 elements in the cache. When someone on the solr-user list asked for an introduction to what the different cache statistics meant, I remembered that I hadn’t actually tweaked the settings at all. The SOLR server has been running for a year now, so we now have a quite good idea of how it will perform and what kind of queries we are seeing. The stats indicated that a lot more cached entries got evicted than what I were hoping to see, and this gave us a lower cache hit rate (about 50%).

The simple change was to increase the size of the cache (from 512 to 16384), so that we’re able to keep more documents in memory before evicting them. After running 24 hours with the new setup we’re now seeing cache hits as 99%, 68% and 67%. The relevant sections of the solrconfig.xml file are:




The document cache fills about 4 times as fast as the filter cache, so we might have to tweak the settings further by suiting it even better to our load pattern.

So what now?

The next step would be to try to change to the FastLRUCache which is included with Solr 1.4 (currently in SVN and nightlies). If my memory serves me right the changes are mostly related to locking, so I’m not sure if we’ll see any significant better performance.

We’ll also make further adjustments to the size of each of the caches to better match our usage.

Solr Becoming Slow After a While

This is perhaps the most obvious and “not very helpful” post for quite a few people, but for those who experience this issue, it’ll save the day. While doing a test index routine of around 6 million documents, things would get really slow at the moment I passed 1 million documents in the index. Weird. Optimizing didn’t help, as it died with an exception after a while.

The reason?

Not enough free disk space. Solr was indexing to a different partition than I thought.

Solved everything.

Shell Script For Submitting Documents to Solr

Here’s a small shell script I’m using to submit pre-made XML documents to Solr. The documents are usually produce by some other program, before being submitted to the Solr server. This way we submit all the files in an active directory to the server (here all the files in the documents directory (relative to the location of the script) will be submitted) .

You’ll have to update the URL and the directory (documents) below. We usually group together 1.000 documents in a single file, so the commit happens for every thousand documents. If you use autocommit in Solr, you can remove that line. This script requires CURL to talk to the Solr server.

URL=http://localhost:8080/solr/update
cd documents || exit

for i in $( ls ); do
    cat $i | curl -X POST -H 'Content-Type: text/xml' -d @- $URL
    curl $URL -H "Content-Type: text/xml" --data-binary ''
    echo item: $i
done

Making Solr Requests with urllib2 in Python

When making XML requests to Solr (A fulltext document search engine) for indexing, committing, updating or deleting documents, the request is submitted as an HTTP POST containg an XML document to the server. urllib2 supports submitting POST data by using the second parameter to the urlopen() call:

f = urllib2.urlopen("http://example.com/", "key=value")

The first attempt involved simply adding the XML data as the second parameter, but that made the Solr Webapp return a “400 – Bad Request” error. The reason for Solr barfing is that the urlopen() function sets the Content-Type to application/x-www-form-urlencoded. We can solve this by changing the Content-Type header:

solrReq = urllib2.Request(updateURL, '')
solrReq.add_header("Content-Type", "text/xml")
solrPoster = urllib2.urlopen(solrReq)
response = solrPoster.read()
solrPoster.close()

Other XML-based Solr requests, such as adding and removing documents from the index, will also work by changing the Content-Type header.

The same code will also allow you to use urllib to submit SOAP, XML-RPC-requests and use other protocols that require you to change the complete POST body of the request.