Avoid Escaping Spaces in the Query String in a Solr Query

Following up on the previous post about escaping values in a Solr query string, it’s important to note that you should not escape spaces in the query itself. The reason for this is that if you escape spaces in the query “foo bar”, the search will be performed on the term “foo bar” itself, and not with “foo” as one term and “bar” as the other. This will only return documents that has the string “foo bar” in sequence.

The solution is to either remove the space from the escape list in the previous function – and use another function for escaping values where you actually should escape the spaces – or break up the string into “escapable” parts.

The code included beneath performs the last task; it splits the string into different parts delimited by space and then escapes each part of the query by itself.

$queryParts = explode(' ', $this->getQuery());
$queryEscaped = array();

foreach($queryParts as $queryPart)
{
    $queryEscaped[] = self::escapeSolrValue($queryPart);
}

$queryEscaped = join(' ', $queryEscaped);

Java and NetBeans: Illegal escape character

When defining strings in programming languages, they’re usually delimited by ” and “, such as “This is a string” and “Hello World”. The immediate question is what do you do when the string itself should contain a “? “Hello “World”” is hard to read and practically impossible to parse for the compiler (which tries to make sense out of everything you’ve written). To solve this (and similiar issues) people started using escape characters, special characters that tell the parser that it should pay attention to the following character(s) (some escape sequences may contain more than one character after the escape character).

Usually the escape character is \, and rewriting our example above we’ll end up with “Hello \”World\””. The parser sees the \, telling it that it should parse the next characters in a special mode and then inserts the ” into the string itself instead of using it as a delimiter. In Java, C, PHP, Python and several other languages there are also special versions of the escape sequences that does something else than just insert the character following the escape character.

\n – Inserts a new line.
\t – Inserts a tab character.
\xNN – Inserts a byte with the byte value provided (\x13, \xFF, etc).

A list of the different escape sequences that PHP supports can be found in the PHP manual.

Anyways, the issue is that Java found an escape sequence that it doesn’t know how to handle. Attempting to define a string such as “! # \ % &” will trigger this message, as it sees the escape character \, and then attempts to parse the following byte – which is a space (” “). The escape sequence “\ ” is not a valid escape sequence in the Java language specification, and the parser (or NetBeans or Eclipse) is trying to tell you this is probably not what you want.

The correct way to define the string above would be to escape the escape character (now we’re getting meta): “! # \\ % &”. This would define a string with just a single backlash in it.

Escaping Characters in a Solr Query / Solr URL

We’re using our own Solr library at Derdubor at the moment, but we’ve only been using it for indexing content. The query part was never standardized in our common library as we usually used an alternative output format, but during the last days that has changed. We now have a parser for the default XML outputter and we’re also supporting facets and field queries (or constraints as they’re abstracted as in our library).

This means that we’re feeding content into the query that may contain foreign characters, in particular those who have special meaning in a Solr query. You can find the complete list of characters that need to be escaped in a SOLR or Lucene query in the Lucene manual.

To escape the characters we use this very simple and stupid PHP method:

    static public function escapeSolrValue($string)
    {
        $match = array('\\', '+', '-', '&', '|', '!', '(', ')', '{', '}', '[', ']', '^', '~', '*', '?', ':', '"', ';', ' ');
        $replace = array('\\\\', '\\+', '\\-', '\\&', '\\|', '\\!', '\\(', '\\)', '\\{', '\\}', '\\[', '\\]', '\\^', '\\~', '\\*', '\\?', '\\:', '\\"', '\\;', '\\ ');
        $string = str_replace($match, $replace, $string);

        return $string;
    }

We used a regular expression first, but the sheer amount of backslashes made it a regular .. hell … to read. So to make it easier for the persons maintaining this in the future, we went the easy to read / easy to maintain road for this one.