Boosting By Date in Solr 1.4

One of the things introduced with Solr 1.4 is the ms() function for getting the number of milliseconds for a timestamp since the Unix epoch. This means that you can now write date boosts without having to resort to ord() or rord().

The best solution for boosting documents based on a field on query time (to avoid having to update the boost factor based on date as time progresses) seems to be to use the boost query type. The boost query type will pass the query on to your default query handler and let that resolve the query itself, but will provide boosts for each document based on the fields queried.

An example of how to solve this issue can be found on the SolrRelevancy part of the Solr Wiki:

{!boost b=recip(ms(NOW,publishedTime),3.16e-11,1,1)}query

This will take the number of milliseconds between NOW and the time the document was published (publishedTime is one of the fields YOU have to provide when you’re indexing, this might be “created” or something else that suits your needs) and then multiply that number with 3.16e-11, which is equal to 1 / . This will make the result of the function be 1 for a document that just was published, but 0 for anything older than a year.

The Solr Wiki also contains example of how you can divide your boost query into several parts to make it easier to read.

Java and NetBeans: Illegal escape character

When defining strings in programming languages, they’re usually delimited by ” and “, such as “This is a string” and “Hello World”. The immediate question is what do you do when the string itself should contain a “? “Hello “World”” is hard to read and practically impossible to parse for the compiler (which tries to make sense out of everything you’ve written). To solve this (and similiar issues) people started using escape characters, special characters that tell the parser that it should pay attention to the following character(s) (some escape sequences may contain more than one character after the escape character).

Usually the escape character is \, and rewriting our example above we’ll end up with “Hello \”World\””. The parser sees the \, telling it that it should parse the next characters in a special mode and then inserts the ” into the string itself instead of using it as a delimiter. In Java, C, PHP, Python and several other languages there are also special versions of the escape sequences that does something else than just insert the character following the escape character.

\n – Inserts a new line.
\t – Inserts a tab character.
\xNN – Inserts a byte with the byte value provided (\x13, \xFF, etc).

A list of the different escape sequences that PHP supports can be found in the PHP manual.

Anyways, the issue is that Java found an escape sequence that it doesn’t know how to handle. Attempting to define a string such as “! # \ % &” will trigger this message, as it sees the escape character \, and then attempts to parse the following byte – which is a space (” “). The escape sequence “\ ” is not a valid escape sequence in the Java language specification, and the parser (or NetBeans or Eclipse) is trying to tell you this is probably not what you want.

The correct way to define the string above would be to escape the escape character (now we’re getting meta): “! # \\ % &”. This would define a string with just a single backlash in it.

SOLR: java.io.FileNotFoundException: no segments* file found

While playing around with one of my development SOLR installations (this time under Windows), I suddenly got a weird error message when feeding data to one of the fresh cores.


SEVERE: java.lang.RuntimeException: java.io.FileNotFoundException: no segments* file found in org.apache.lucene.store.SimpleFSDirectory@C:\temp\solr\*\data\index: files:

Taking a look at the contents of the index\ directory, it was in fact empty. Seems weird, but my initial guess was that Lucene / SOLR would treat this as a new installation and create the files.

Turns out the issue is that it won’t – as long as the index directory exists, Lucene / SOLR goes looking for the segment files.

Thanks to an old post to the solr-dev list by Yonik, the easiest fix is to simply delete the index directory and restart your applet container (Tomcat in this case).

Porting SOLR Token Filter from Lucene 2.4.x to Lucene 2.9.x

I had trouble getting our current token filter to work after recompiling with the nightly builds of SOLR, which seemed to stem from the recently adopted upgrade to 2.9.0 of Lucene (not released yet, but SOLR nightly is bleeding edge!). There’s functionality added for backwards compability, and while that might have worked, things didn’t really come together as it should (somewhere or the other). So I decided to port our filter over to the new model, where incrementToken() is the New Way ™ of doing stuff. Helped by the current lowercase filter in the SVN trunk of Lucene, I made it all the way through.

Our old code:

    public NorwegianNameFilter(TokenStream input)
    {
        super(input);
    }

    public Token next() throws IOException
    {
        return parseToken(this.input.next());
    }
 
    public Token next(Token result) throws IOException
    {
        return parseToken(this.input.next());
    }

Compiling this with Lucene 2.9.0 gave me a new warning:

Note: .. uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.

To the internet mobile!

Turns out next() and next(Token) has been deprecated in the new TokenStream implementation, and the New True Way is to use the incrementToken() method instead.

Our new code:

    private TermAttribute termAtt;

    public NorwegianNameFilter(TokenStream input)
    {
        super(input);
        termAtt = (TermAttribute) addAttribute(TermAttribute.class);
    }

    public boolean incrementToken() throws IOException
    {
        if (this.input.incrementToken())
        {
            termAtt.setTermLength(this.parseBuffer(termAtt.termBuffer(), termAtt.termLength()));
            return true;
        }
        
        return false;
    }

A few gotcha’s along the way: incrementToken needs to be called on the input token string, not on the filter (super.incrementToken() will give you a stack overflow). This moves the token stream one step forward. We also decided to move the buffer handling into the parse token function to handle this, and remember to include the length of the “live” part of the buffer (the buffer will be larger, but only the content up to termLength will be valid).

The return value from our parseBuffer function is the actual amount of usable data in the buffer after we’ve had our way with it. The concept is to modify the buffer in place, so that we avoid allocating or deallocating memory.

Hopefully this will help other people with the same problem!

New Adventures in Reverse Engineering

Before I go into the gory details of this post, I’ll start by saying that this method is probably not the right solution for you. This is not something you want to do if you have readily access to any source code or if you have an existing relationship with the 3rd party that provided the library you’re using. Do not do this. This is not for you.

With that out of the mind, this is the part for those who actually are interested in getting down and dirty with Java, and maybe solving a problem that’s hard to solve otherwise.

The setting: We have a library for interfacing with another internal web service, where the library was provided in binary form by a 3rd party as part of the agreement when the service were delivered to us. The problem is that due to some unknown matter, this library is perfectly capable of understanding UTF-8, both as input from us and as input from the web service, but all web related methods in the result class returns data encoded as ISO-8859-1. The original solution to this was to keep two different parts of the query string — the original query in one particular key — and the key for the library in ISO-8859-1. This needs loads of special casing, manually handling that single parameter, etc. This works to a certain degree as long as the library is the only component in the mix. The troubles really began to surface when we started querying other services based on the same query. We’d then have to special case all methods that were used in URLs, as they returned ISO-8859-1 — and all other libraries and encodings are using UTF-8.

The library has since been made into a separate product with a hefty price tag, so upgrading the library was not an acceptable solution for us. Another solution had to be found, and this is were things starts to get interesting.

Writing a proxy class to handle the encoding issue transparently

This was the solution we attempted first, but this requires us to implement quite a few methods, to add additional code to the method that provides access to the library and to extend and embrace parts of the object. This could have been done quite easily by simply changing one method of the class to reference super.methodName() and then returning that result, but as we have to change several classes (these objects live 3-4 levels down into the result object from the library) which add both developer and runtime overhead. Not good.

Decompiling the library

The next step was to decompile the library to see how the code of the library actually worked. This proved to be a good way to find out how we could possibly solve the issue. We could try to fix the issue in the code and then recompile the library, but some of the class files were too new for jad to decompile them completely. The decompilation did however show the problem with the code:

    if (encoding != null)
    {
        return encoding.toString() :
    }

    return "ISO-8859-1";

This was neatly located in a helper method that ran on every property used when generating a query string. The encoding variable is retrieved from a global settings object, only accessible in the same library. This object is empty in our version of the library, so not much help there. But here’s the little detail that leads into the next part, and actually made this hack possible: “ISO-8859-1” is constant. This means that it gets neatly tucked away as an UTF-8 string when the class file is generated. Let’s gets down and dirty.

Binary patching the encoding in the class file

We’ll start by taking a look at the hexdump in our class file, after searching for the string “ISO” in the ASCII representation (“ISO” in UTF-8 is identical to the ASCII representation):

Binary Patching a Java Class

I’ve highlighted the interesting part where “ISO-8859-1” is stored in the file. This is where we want to do our surgical incision and make the method return the string “UTF-8” instead. There is one important thing you should be aware of if you’ve never done any hex editing of files before, and that is the fact that the byte offset of parts of the file may be very important. Sadly, the strings “UTF-8” and “ISO-8859-1” have different lengths, and as such, would require us to either delete bytes following “UTF-8” or put spaces there instead (“UTF-8     “). The first method might leave the rest of the file skewed, the latter might not work if the method used for encoding the value doesn’t trim the string first.

To solve this issue, we turn to our good friend VM Spec: The class File Format, which contains all the details of how the class file format is designed. Interesting parts:

In the ClassFile structure:

cp_info constant_pool[constant_pool_count-1];

As we’re looking at a constant, this is where it should be stored. The cp_info is defined as:

cp_info {
    u1 tag;
    u1 info[];
}

The tag contains the type of constant, the info[] array varies depending on the type of the constant. If we take a look at the table in Chapter 4.4, we see that the the identifier for a unicode string is:

CONSTANT_Utf8 	1

So we should have the value of 1 in the byte (as the actual value, not the ascii character) describing this constant. If the value is one, the complete structure is:

    CONSTANT_Utf8_info {
    	u1 tag;
    	u2 length;
    	u1 bytes[length];
    }

The tag should be 1 as the byte value, the length should be two bytes describing the length of the actual string saved (since we’re storing the length in two bytes (u2), it can be a maximum of 2^16 bytes in total). After that identifier, we should have length number of bytes with UTF-8 data.

If we go back to our hex dump, we can now make more sense of the data we’re seeing:

binarypatchingclass-length

The byte shown as 0x01 in hex is the value 1 for the tag of the structure. The 0x00 0x0A is the two bytes making up the length of the string:

    0000 0000 0000 1010 binary = 10 decimal

    ISO-8859-1
    1234567890 

This shows that the length of our string “ISO-8859-1” is 10 bytes in UTF-8, which is the same value that is stored in the two bytes showing the length of the string in the structure.

Heading back to our original goal: changing the length of the string stored. We change the string bytes to “UTF-8”, which is five bytes. We then change the stored length of the string:

    00 0A becomes
    00 05

We save our changes and re-create the jar file again, with all the previous classes and our changed one.

After inserting our new JAR-file into our maven repository as a new build and updating our local repository, we now have complete UTF-8 support from start to finish. Yey!

Java printf and Booleans

The available stdout mapping in Java, System.out (PrintStream), supports printf (to format a string when outputting it) in the same manner as other languages such as C. By using System.out.printf instead of System.out.print, you can also provide a formatting string which the method can use to write the output in a custom format.

As Java supports several other native datatypes that weren’t available as regular C-types, you also have a few other options. One of these are the %b / %B identifier, which represents a boolean value.

The format string “%b” expects one argument, a boolean. This will be written as either ‘true’ or ‘false. You would (as of writing) get the same result by using the %s identifier as this would cast the boolean of a string, which in turn would give true or false, but it’s always better to be explicit about what you are doing than assuming that people understand that you understood something (which even could change..).

Informa and getCategories Truncates Title at “/”

I stumbled across a weird issue in Informa and the ItemIF.getCategories method today. The categories we retrieve are separated with / to indicate their full hierarchy, but Informa only gave me the first part of the category (just “Properties” of “Properties / Houses”). The solution to this is to explicitly access the category element of the object itself:

String categoryDomain = item.getAttributeValue("category", "domain");
String categoryTitle = item.getElementValue("category");

This should be extended to support several category-elements, but as we only get one in our feeds, this solved the problem for us.

Informa and Custom XML Namespaces in RSS

While integrating a custom search application into a Java-based web application, I came across the need to access properties in custom namespaces through the Informa RSS library. Or to put it in another way; i needed to access to properties, Informa had been used for RSS parsing in the previous versions of the web application. The people who developed the original version of the application had decided to extend the Informa library into their own version, and had added several methods for .get<NameOfCustomProperty> etc. After thinking about this for approximately 2 seconds, I decided that having to support and modify a custom version of Informa was not the right track for us.

My initial thought was that their decision to customize Informa to support these methods had to come from the idea that Informa did not support custom namespaces out of the box. I did a few searchas over at Google, and found nothing useful. Reading through the documentation for Informa didn’t do me any good either, so I tried to find an alternative library instead. Did a bit of searching here too, and stumbled across a hit for one of the util classes for Informa (.. again). This did support custom namespaces, so the backend support was there at least. Then it struck me while reading the documentation for Informa and ChannelIF again; Informa did support it, as it inherited the methods from further up in the hierarchy. The getElementValue and getElementValues methods of the ChannelIF and ItemIF classes allows you to fetch the contents of elements with custom namespaces in a very easy to like manner.

System.out.println(item.getElementValue("exampleNS:field"));

This simply returns the string contained between <exampleNS:field> and </exampleNS:field>

Hoooray! We now have support for these additional fields, and we do not have to keep Informa manually in sync with the version in our application. Why the original developers decided to fork the Informa library to add their own properties I may never know, but I’ll update this post if they decide to step forward!

Google Releases Their Protocol Buffers

Fresh from the Google Open Source Blog comes news that Google has released their Protocol Buffers specification and accompanying libraries. The code and specification has been release at Protocol Buffers on Google Code.

Protocol Buffers is a data format for fast exchange and parsing of data and messages between computers. It is similar to simple uses of XML in this manner, but the messages size on the wire and their parsing time is very much optimized for busy sites. There is no need to spend loads of time doing XML parsing when you instead could do something useful. It’s very easy to interact with the messages through the generated classes (for C++, Java and Python), and future versions of the same schema are compatible with old versions (as new fields are just ignored by older parsers).

Still no PHP implementation available, so guess it’s time to get going and lay down some code during the summer. Anyone up for the job?

Writing a Solr Analysis Filter Plugin

Update: If you’re writing a plugin for a Solr-version after 1.4.1 or Lucene 3.0+, you should be sure to read Updating a Solr Analysis Plugin to Lucene 4.0 as well. A few of the method calls used below has changed in the new API.

As we’ve been working on getting a better result out of the phonetic search we’re currently doing at derdubor, I started writing a plugin for Solr to be able to return better search results when searching for norwegian names. We’ve been using the standard phonetic filter from Solr 1.2 so far, using the double metaphone encoder for encoding a regular token as a phonetic value. The trouble with this is that a double metaphone value is four simple letters, which means that searchwords such as ‘trafikkontroll’ would get the same meaning as ‘Dyrvik’. The latter being a name and the first being a regular search string which would be better served through an article view. TRAFIKKONTROLL resolves to TRFK in double metaphone, while DYRVIK resolves to DRVK. T and D is considered similiar, as is V and F, and voilá, you’ve got yourself a match in the search result, but not a visual one (or a semantic one, as the words have very different meanings).

To solve this, I decided to write a custom filter plugin which we could tune to names that are in use in Norway. I’ll post about the logic behind my reasoning in regards to wording later and hopefully post the complete filter function we’re applying, but I’ll leave that for another post.

First you need a factory that’s able to produce filters when Solr asks for them:

NorwegianNameFilterFactory.java:

package no.derdubor.solr.analysis;

import java.util.Map;

import org.apache.solr.analysis.BaseTokenFilterFactory;
import org.apache.lucene.analysis.TokenStream;

public class NorwegianNameFilterFactory extends BaseTokenFilterFactory
{
    Map args;

    public Map getArgs()
    {
        return args;
    }

    public void init(Map args)
    {
        this.args = args;
    }

    public NorwegianNameFilter create(TokenStream input)
    {
        return new NorwegianNameFilter(input);
    }
}

To compile this example yourself, put the file in no/derdubor/solr/analysis/ (which matches no.derdubor.solr.analysis; in the package statement), and run

javac -6 no/derdubor/solr/analysis/NorwegianNameFilterFactory.java

(you’ll need apache-solr-core.jar and lucene-core.jar in your classpath to do this)

to compile it. You’ll of course also need the filter itself (which is returned from the create-method above):

package no.derdubor.solr.analysis;

import java.io.IOException;
import org.apache.lucene.analysis.Token;
import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;

public class NorwegianNameFilter extends TokenFilter
{
    public NorwegianNameFilter(TokenStream input)
    {
        super(input);
    }

    public Token next() throws IOException
    {
        return parseToken(this.input.next());
    }

    public Token next(Token result) throws IOException
    {
        return parseToken(this.input.next());
    }

    protected Token parseToken(Token in)
    {
        /* do magic stuff with in.termBuffer() here (a char[] which can be manipulated) */
        /* set the changed length of the new term with in.setTermLength(); before returning it */
        return in;
    }
}

You should now be able to compile both files:

javac -6 no/derdubor/solr/analysis/*.java

After compiling the plugin, create a jar file which contain your plugin. This will be the “distributable” version of your plugin, and should contain the .class-files of your application.

jar cvf derdubor-solr-norwegiannamefilter.jar no/derdubor/solr/analysis/*.class

Move the file you just created (derdubor-solr-norwegiannamefilter.jar in the example above) into your Solr home directory. This is where you keep your bin/ and conf/ directory (which contains schema.xml, etc). Create a lib directory in the solr home directory. This is where your custom libraries will live, so copy the file into this directory (lib/).

Restart Solr and check that everything still works as it should. If everything still seems normal, it’s time to enable your filter. In one of your <filter>-chains, you can simply append a <filter> element to insert your own filter into the chain:


    
    
    
    

Restart Solr again, and if everything still works as it should, you’re all set! Time to index some new data (remember that you’ll need to reindex the data for things to work as you expect, since no stored data is processed when you edit your configuration files) and commit it! Do a few searches through the admin interface to see that everything works as it should. I’ve used the “debug” option to .. well, debug .. my plugin while developing it. A very neat trick is to see what terms your filter expands to (if you set type=”query” in the analyzer section, it will be applied to all queries against that field), which will be shown in the first debug section when looking at the result (you’ll have to scroll down to the end to see this). If you need to debug things to a greater extend, you can attach a debugger or simply use the Good Old Proven Way of println! (these will end up in catalina.out in logs/ in your tomcat directory). Good luck!

Potential Problems and How To Solve Them

  • If you get an error about incompatible class versions, check that you’re actually running the same (or newer) version of the JVM (java -version) on your Solr search server that you use on your own development machine (use -5 to force 1.5 compatible class files instead of 1.6 when compiling).
  • If you get an error about missing config or something similiar, or that Solr is unable to find the method it’s searching for (generally triggered by an ReflectionException), remember to define your classes public! public class NorwegianNameFilter is your friend! It took at least half an hour until I realized what this simple issue was…

Any comments and followups are of course welcome!