Avoid Escaping Spaces in the Query String in a Solr Query

Following up on the previous post about escaping values in a Solr query string, it’s important to note that you should not escape spaces in the query itself. The reason for this is that if you escape spaces in the query “foo bar”, the search will be performed on the term “foo bar” itself, and not with “foo” as one term and “bar” as the other. This will only return documents that has the string “foo bar” in sequence.

The solution is to either remove the space from the escape list in the previous function – and use another function for escaping values where you actually should escape the spaces – or break up the string into “escapable” parts.

The code included beneath performs the last task; it splits the string into different parts delimited by space and then escapes each part of the query by itself.

$queryParts = explode(' ', $this->getQuery());
$queryEscaped = array();

foreach($queryParts as $queryPart)
{
    $queryEscaped[] = self::escapeSolrValue($queryPart);
}

$queryEscaped = join(' ', $queryEscaped);

A Simple Smarty Modifier to Generate a Chart Through Google Chart API

After the longest title of my blog so far follows one of the shortest posts.

The function has two required parameters – the first one is provided automagically for you by smarty (it’s the value of the variable you’re applying the modifier to). This should be an array of objects containing the value you want to graph. The only required argument you have to provide to the modifier is the method to use for fetching the values for graphing.

Usage:
{$objects|googlechart:”getValue”}

This will dynamically load your plugin from the file modifier.googlechart.php in your Smarty plugins directory, or you can register the plugin manually by calling register_modifier on the template object after you’ve created it.

function smarty_modifier_googlechart($points, $method, $size = "600x200", $low = 0, $high = 0)
{
    $pointStr = '';
    $maxValue = 0;
    $minValue = INT_MAX;
    
    foreach($points as $point)
    {
        if ($point->$method() > $maxValue)
        {
            $maxValue = $point->$method();
        }

        if ($point->$method() < $minValue)
        {
            $minValue = $point->$method();
        }
    }

    if (!empty($high))
    {
        $maxValue = $high;
    }

    $scale = 100 / $maxValue;

    foreach($points as $point)
    {
        $pointStr .= (int) ($point->$method() * $scale) . ',';
    }

    $pointStr = substr($pointStr, 0, -1);

    // labels (5)
    $labels = array();

    $steps = 4;
    $interval = $maxValue / $steps;

    for($i = 0; $i < $steps; $i++)
    {
        $labels[] = (int) ($i * $interval);
    }

    $labels[] = (int) $maxValue;

    return 'http://chart.apis.google.com/chart?cht=lc&chd=t:' . $pointStr . '&chs=' . $size . '&chxt=y&chxl=0:|' . join('|', $labels);
}

The function does not support the short version of the Google Chart API Just Yet (tm) as it is an simple proof of concept hack made a few months ago.

How To Dismantle An Atomic HTTP Query .. String.

Following up on yesterday’s gripe about PHPs (old and now useless) automagic translation of dots in GET and POST parameters to underscores, today’s edition manipulates the query string in place instead of returning it as an array.

This is useful if you have a query string you want to pass on to another service, and for some reason the default behaviour in PHP will barf barf and barf. That might happen because of the dot translation issue or that some services (such as Solr) rely on a parameter name being repeatable (in PHP the second parameter value will overwrite the first).

function http_dismantle_query($queryString, $remove)
{
    $removeKeys = array();

    if (is_array($remove))
    {
        foreach($remove as $removeKey)
        {
            $removeKeys[$removeKey] = true;
        }
    }
    else
    {
        $removeKeys[$remove] = true;
    }

    $resultEntries = array();
    $segments = explode("&", $queryString);

    foreach($segments as $segment)
    {
        $parts = explode('=', $segment);

        $key = urldecode(array_shift($parts));

        if (!isset($removeKeys[$key]))
        {
            $resultEntries[] = $segment;
        }
    }

    return join('&', $resultEntries);
}

I’m not really sure what I’ll call the next function in this series, but there sure are loads of candidates out there.

Getting Dots to Work in PHP and GET / POST / COOKIE Variable Names

One of the oldest and ugliest relics of the register_globals era of PHP are the fact that all dots in request variable names gets replaced with “_”. If your variable was named “foo.bar”, PHP will serve it to you as “foo_bar”. You cannot turn this off, you cannot use extract() or parse_str() to avoid it and you’re mostly left out in the dark. Luckily the QUERY_STRING enviornment (in _SERVER if you’re running mod_php, etc) contains the raw string, and this string contains the dots.

The following “”parser”” is a work in progress and does currently not support the array syntax for keys that PHP allow, but it solves the issue for regular vars. I will try to extend this later on to do actually replicate the functionality of the regular parser.

Here’s the code. No warranties. Ugly hack. You’re warned. Leave a comment if you have any good suggestions regarding this (.. or know of an existing library doing the same..).

function http_demolish_query($queryString)
{
    $result = array();
    $segments = explode("&", $queryString);

    foreach($segments as $segment)
    {
        $parts = explode('=', $segment);

        $key = urldecode(array_shift($parts));
        $value = null;

        if ($parts)
        {
            $value = urldecode(join('=', $parts));
        }

        $result[$key] = $value;
    }

    return $result;
}

(OK, that’s not the real function name, but it’s aptly named to be the nemesis of http_build_query)

Retrieving URLs in Parallel With CURL and PHP

As we’ve recently added support for querying Solr servers in parallel, one of the things we added was a simple class to allow us to query several servers at the same time. The CURL library (which has a PHP extension) even provides an abstraction layer for doing the nitty gritty work for you, as long as you keep track of the resources. The code beneath is based on examples in the documentation and a few tweaks of my own.

The code beneath is licensed under a MIT license. You can also download the file (gzipped).

class Footo_Content_Retrieve_HTTP_CURLParallel
{
    /**
     * Fetch a collection of URLs in parallell using cURL. The results are
     * returned as an associative array, with the URLs as the key and the
     * content of the URLs as the value.
     *
     * @param array $addresses An array of URLs to fetch.
     * @return array The content of each URL that we've been asked to fetch.
     **/
    public function retrieve($addresses)
    {
        $multiHandle = curl_multi_init();
        $handles = array();
        $results = array();

        foreach($addresses as $url)
        {
            $handle = curl_init($url);
            $handles[$url] = $handle;

            curl_setopt_array($handle, array(
                CURLOPT_HEADER => false,
                CURLOPT_RETURNTRANSFER => true,
            ));

            curl_multi_add_handle($multiHandle, $handle);
        }

        //execute the handles
        $result = CURLM_CALL_MULTI_PERFORM;
        $running = false;

        // set up and make any requests..
        while ($result == CURLM_CALL_MULTI_PERFORM)
        {
            $result = curl_multi_exec($multiHandle, $running);
        }

        // wait until data arrives on all sockets
        while($running && ($result == CURLM_OK))
        {
            if (curl_multi_select($multiHandle) > -1)
            {
                $result = CURLM_CALL_MULTI_PERFORM;

                // while we need to process sockets
                while ($result == CURLM_CALL_MULTI_PERFORM)
                {
                    $result = curl_multi_exec($multiHandle, $running);
                }
            }
        }

        // clean up
        foreach($handles as $url => $handle)
        {
            $results[$url] = curl_multi_getcontent($handle);

            curl_multi_remove_handle($multiHandle, $handle);
            curl_close($handle);
        }

        curl_multi_close($multiHandle);

        return $results;
    }
}

Download the file.

Escaping Characters in a Solr Query / Solr URL

We’re using our own Solr library at Derdubor at the moment, but we’ve only been using it for indexing content. The query part was never standardized in our common library as we usually used an alternative output format, but during the last days that has changed. We now have a parser for the default XML outputter and we’re also supporting facets and field queries (or constraints as they’re abstracted as in our library).

This means that we’re feeding content into the query that may contain foreign characters, in particular those who have special meaning in a Solr query. You can find the complete list of characters that need to be escaped in a SOLR or Lucene query in the Lucene manual.

To escape the characters we use this very simple and stupid PHP method:

    static public function escapeSolrValue($string)
    {
        $match = array('\\', '+', '-', '&', '|', '!', '(', ')', '{', '}', '[', ']', '^', '~', '*', '?', ':', '"', ';', ' ');
        $replace = array('\\\\', '\\+', '\\-', '\\&', '\\|', '\\!', '\\(', '\\)', '\\{', '\\}', '\\[', '\\]', '\\^', '\\~', '\\*', '\\?', '\\:', '\\"', '\\;', '\\ ');
        $string = str_replace($match, $replace, $string);

        return $string;
    }

We used a regular expression first, but the sheer amount of backslashes made it a regular .. hell … to read. So to make it easier for the persons maintaining this in the future, we went the easy to read / easy to maintain road for this one.

PHP: Fatal error: Can’t use method return value in write context

Just a quick post to help anyone struggling with this error message, as this issue gets raised from time to time on support forums.

The reason for the error is usually that you’re attempting to use empty or isset on a function instead of a variable. While it may be obvious that this doesn’t make sense for isset(), the same cannot be said for empty(). You simply meant to check if the value returned from the function was an empty value; why shouldn’t you be able to do just that?

The reason is that empty($foo) is more or less syntactic sugar for isset($foo) && $foo. When written this way you can see that the isset() part of the statement doesn’t make sense for functions. This leaves us with simply the $foo part. The solution is to actually just drop the empty() part:

Instead of:

if (empty($obj->method()))
{
}

Simply drop the empty construct:

if ($obj->method())
{
}

Missed Schedule for Posts in WordPress

As I started queuing the posts for the previous run of “Ready for 2010“-articles, I came across a problem with my WordPress installation. The scheduled articles didn’t show up when they were scheduled, and the only thing shown in the WordPress administration interface were a message about “Missed Schedule”. No shit, sherlock.

The reason behind the message is that the wp-cron.php file didn’t run as it should. WordPress usually tries to run this every now and then by inserting a reference to the file through the web site. Apparently this behavior was borked on my blog. I have a perfectly working cron implementation on my server, so instead of relying on WordPress to do some kind of magic to insert a reference to the file and kick off the processing with a web request, I added a reference to wp-cron.php in my usual crontab.

I have no idea how often wp-cron really should be run, but decided that a five minute resolution was enough for my use. The crontab entry is included here:

*/5 * * * * cd <directory of blog> && php wp-cron.php

This runs the cron script from the proper directory, and seems to work fine.

Writing a Munin Plugin

I have to admit something. I’ve become addicted.

One of the things I finally got around to doing while living the quiet life over the christmas holiday was to dive a bit further into Munin – a simple framework for collecting information from your computers and servers and making nice graphs that you can watch while you’re bored.

I’m not going to write a lot about how you can create your own Munin plugin to create your own graphs, as they have a very simple tutorial giving you all the basics about writing Munin plugins themselves. The only thing you need to remember are these two tidbits:

  1. When Munin first registers your plugin, it runs your script with config as the only argument. This provides Munin with the name of the graph, the labels and names (keys) of the graphs you’re providing values for, information about the axis, etc.
  2. When Munin runs your script without the config argument, it expects you to give it values for the keys you provided it in the configuration.

You enable and disable plugins by creating symlinks in /etc/munin/plugins (at least under debian / ubuntu), and plugins are usually stored in /usr/share/munin/plugins.

I keep my plugins archived together with the rest of the repository for my web projects, and then either symlink the content into the plugins-directory or create a simple wrapper script that changes the current directory to the location of the script and then invokes it (to make the current working directory be correct).

A very simple bash script that does this – and passes through any parameters given to the script:

#!/bin/bash
cd  && php ./