Escaping Characters in a Solr Query / Solr URL

We’re using our own Solr library at Derdubor at the moment, but we’ve only been using it for indexing content. The query part was never standardized in our common library as we usually used an alternative output format, but during the last days that has changed. We now have a parser for the default XML outputter and we’re also supporting facets and field queries (or constraints as they’re abstracted as in our library).

This means that we’re feeding content into the query that may contain foreign characters, in particular those who have special meaning in a Solr query. You can find the complete list of characters that need to be escaped in a SOLR or Lucene query in the Lucene manual.

To escape the characters we use this very simple and stupid PHP method:

    static public function escapeSolrValue($string)
    {
        $match = array('\\', '+', '-', '&', '|', '!', '(', ')', '{', '}', '[', ']', '^', '~', '*', '?', ':', '"', ';', ' ');
        $replace = array('\\\\', '\\+', '\\-', '\\&', '\\|', '\\!', '\\(', '\\)', '\\{', '\\}', '\\[', '\\]', '\\^', '\\~', '\\*', '\\?', '\\:', '\\"', '\\;', '\\ ');
        $string = str_replace($match, $replace, $string);

        return $string;
    }

We used a regular expression first, but the sheer amount of backslashes made it a regular .. hell … to read. So to make it easier for the persons maintaining this in the future, we went the easy to read / easy to maintain road for this one.

18 thoughts on “Escaping Characters in a Solr Query / Solr URL”

  1. Good point. I’ve used SolrJ quite a bit before, but I never thought about validating it against the same behaviour. SolrJ also escapes ” and ; which were missing from my list. I’ve added them now.

    Thanks for the update!

  2. ClientUtils class is escaping the space also. For example:

    Input: hello there
    Expected: hello there
    Actual: hello\ there

    This is giving problem as the final string will become as hello\+there when sent over HTTP.

    Regards,
    Satish.

  3. This also gives bad errors for date facets

    For example ..

    Your method turns the query into this…
    http://localhost:8080/test/select?q=fqdn\:b\*&facet=on&facet.date.start=NOW&facet.date.end=2012\-02\-05T13\:37\:29\+00\:00Z&facet.date=ending&facet.date.gap=\+7DAY&rows=25&wt=json

    Making SOLR do this in the error log!

    INFO: [] webapp=/test path=/select params={facet.date.start=NOW&facet=on&q=fqdn\:b\*&facet.date=ending&facet.date.gap=\+7DAY&wt=json&facet.date.end=2012\-02\-05T13\:36\:21\+00\:00Z&rows=25} hits=0 status=400 QTime=1
    07-Dec-2011 13:37:29 org.apache.solr.common.SolrException log
    SEVERE: org.apache.solr.common.SolrException: date facet ‘end’ is not a valid Date string: 2012\-02\-05T13\:37\:29\ 00\:00Z

  4. There’s an escape function in Apache_Solr_Service if you are using that to connect with in php

    $string = Apache_Solr_Service::escape($string);

    for phrases:

    $phrase = Apache_Solr_Service::escapePhrase($phrase);

    or a bit of convenience, this will create the phrase and escape it:

    $phrase = Apache_Solr_Service::phrase($string);

  5. Shalin Shekhar Mangar posted a great link to Solr’s own ClientUtils.escapeQueryChars function. The link has moved to here:
    https://svn.apache.org/repos/asf/lucene/dev/trunk/solr/solrj/src/java/org/apache/solr/client/solrj/util/ClientUtils.java

    For convenience, the function is here:
    /**
    * See: {@link org.apache.lucene.queryparser.classic queryparser syntax}
    * for more information on Escaping Special Characters
    */
    public static String escapeQueryChars(String s) {
    StringBuilder sb = new StringBuilder();
    for (int i = 0; i < s.length(); i++) {
    char c = s.charAt(i);
    // These characters are part of the query syntax and must be escaped
    if (c == '\\' || c == '+' || c == '-' || c == '!' || c == '(' || c == ')' || c == ':'
    || c == '^' || c == '[' || c == ']' || c == '\"' || c == '{' || c == '}' || c == '~'
    || c == '*' || c == '?' || c == '|' || c == '&' || c == ';' || c == '/'
    || Character.isWhitespace(c)) {
    sb.append('\\');
    }
    sb.append(c);
    }
    return sb.toString();
    }

  6. Hi , I am still facing problem with + and space when they are actually sent over HTTP , it does give correct output from Solr browser but not from the http request. Can anyone help?

  7. That’s usually caused by not using proper urlescaping in your application when calling Solr. Exactly how you do that depend on which language and/or framework you’re using. For PHP you can use rawurlencode or urlencode.

  8. even more readable:
    $map = [
    ‘\\’ => ‘\\\\’,
    ‘+’ => ‘\\+’,
    ‘-‘ => ‘\\-‘,
    ‘&’ => ‘\\&’,
    ‘|’ => ‘\\|’,
    ‘!’ => ‘\\!’,
    ‘(‘ => ‘\\(‘,
    ‘)’ => ‘\\)’,
    ‘{‘ => ‘\\{‘,
    ‘}’ => ‘\\}’,
    ‘[‘ => ‘\\[‘,
    ‘]’ => ‘\\]’,
    ‘^’ => ‘\\^’,
    ‘~’ => ‘\\~’,
    ‘*’ => ‘\\*’,
    ‘?’ => ‘\\?’,
    ‘:’ => ‘\\:’,
    ‘”‘ => ‘\\”‘,
    ‘;’ => ‘\\;’,
    ‘ ‘ => ‘\\ ‘,
    ];
    return str_replace(array_keys($map), array_values($map), $input);

Leave a Reply

Your email address will not be published. Required fields are marked *