Avoid Escaping Spaces in the Query String in a Solr Query

Following up on the previous post about escaping values in a Solr query string, it’s important to note that you should not escape spaces in the query itself. The reason for this is that if you escape spaces in the query “foo bar”, the search will be performed on the term “foo bar” itself, and not with “foo” as one term and “bar” as the other. This will only return documents that has the string “foo bar” in sequence.

The solution is to either remove the space from the escape list in the previous function – and use another function for escaping values where you actually should escape the spaces – or break up the string into “escapable” parts.

The code included beneath performs the last task; it splits the string into different parts delimited by space and then escapes each part of the query by itself.

$queryParts = explode(' ', $this->getQuery());
$queryEscaped = array();

foreach($queryParts as $queryPart)
{
    $queryEscaped[] = self::escapeSolrValue($queryPart);
}

$queryEscaped = join(' ', $queryEscaped);

A Simple Smarty Modifier to Generate a Chart Through Google Chart API

After the longest title of my blog so far follows one of the shortest posts.

The function has two required parameters – the first one is provided automagically for you by smarty (it’s the value of the variable you’re applying the modifier to). This should be an array of objects containing the value you want to graph. The only required argument you have to provide to the modifier is the method to use for fetching the values for graphing.

Usage:
{$objects|googlechart:”getValue”}

This will dynamically load your plugin from the file modifier.googlechart.php in your Smarty plugins directory, or you can register the plugin manually by calling register_modifier on the template object after you’ve created it.

function smarty_modifier_googlechart($points, $method, $size = "600x200", $low = 0, $high = 0)
{
    $pointStr = '';
    $maxValue = 0;
    $minValue = INT_MAX;
    
    foreach($points as $point)
    {
        if ($point->$method() > $maxValue)
        {
            $maxValue = $point->$method();
        }

        if ($point->$method() < $minValue)
        {
            $minValue = $point->$method();
        }
    }

    if (!empty($high))
    {
        $maxValue = $high;
    }

    $scale = 100 / $maxValue;

    foreach($points as $point)
    {
        $pointStr .= (int) ($point->$method() * $scale) . ',';
    }

    $pointStr = substr($pointStr, 0, -1);

    // labels (5)
    $labels = array();

    $steps = 4;
    $interval = $maxValue / $steps;

    for($i = 0; $i < $steps; $i++)
    {
        $labels[] = (int) ($i * $interval);
    }

    $labels[] = (int) $maxValue;

    return 'http://chart.apis.google.com/chart?cht=lc&chd=t:' . $pointStr . '&chs=' . $size . '&chxt=y&chxl=0:|' . join('|', $labels);
}

The function does not support the short version of the Google Chart API Just Yet (tm) as it is an simple proof of concept hack made a few months ago.

How To Dismantle An Atomic HTTP Query .. String.

Following up on yesterday’s gripe about PHPs (old and now useless) automagic translation of dots in GET and POST parameters to underscores, today’s edition manipulates the query string in place instead of returning it as an array.

This is useful if you have a query string you want to pass on to another service, and for some reason the default behaviour in PHP will barf barf and barf. That might happen because of the dot translation issue or that some services (such as Solr) rely on a parameter name being repeatable (in PHP the second parameter value will overwrite the first).

function http_dismantle_query($queryString, $remove)
{
    $removeKeys = array();

    if (is_array($remove))
    {
        foreach($remove as $removeKey)
        {
            $removeKeys[$removeKey] = true;
        }
    }
    else
    {
        $removeKeys[$remove] = true;
    }

    $resultEntries = array();
    $segments = explode("&", $queryString);

    foreach($segments as $segment)
    {
        $parts = explode('=', $segment);

        $key = urldecode(array_shift($parts));

        if (!isset($removeKeys[$key]))
        {
            $resultEntries[] = $segment;
        }
    }

    return join('&', $resultEntries);
}

I’m not really sure what I’ll call the next function in this series, but there sure are loads of candidates out there.

Getting Dots to Work in PHP and GET / POST / COOKIE Variable Names

One of the oldest and ugliest relics of the register_globals era of PHP are the fact that all dots in request variable names gets replaced with “_”. If your variable was named “foo.bar”, PHP will serve it to you as “foo_bar”. You cannot turn this off, you cannot use extract() or parse_str() to avoid it and you’re mostly left out in the dark. Luckily the QUERY_STRING enviornment (in _SERVER if you’re running mod_php, etc) contains the raw string, and this string contains the dots.

The following “”parser”” is a work in progress and does currently not support the array syntax for keys that PHP allow, but it solves the issue for regular vars. I will try to extend this later on to do actually replicate the functionality of the regular parser.

Here’s the code. No warranties. Ugly hack. You’re warned. Leave a comment if you have any good suggestions regarding this (.. or know of an existing library doing the same..).

function http_demolish_query($queryString)
{
    $result = array();
    $segments = explode("&", $queryString);

    foreach($segments as $segment)
    {
        $parts = explode('=', $segment);

        $key = urldecode(array_shift($parts));
        $value = null;

        if ($parts)
        {
            $value = urldecode(join('=', $parts));
        }

        $result[$key] = $value;
    }

    return $result;
}

(OK, that’s not the real function name, but it’s aptly named to be the nemesis of http_build_query)

Boosting By Date in Solr 1.4

One of the things introduced with Solr 1.4 is the ms() function for getting the number of milliseconds for a timestamp since the Unix epoch. This means that you can now write date boosts without having to resort to ord() or rord().

The best solution for boosting documents based on a field on query time (to avoid having to update the boost factor based on date as time progresses) seems to be to use the boost query type. The boost query type will pass the query on to your default query handler and let that resolve the query itself, but will provide boosts for each document based on the fields queried.

An example of how to solve this issue can be found on the SolrRelevancy part of the Solr Wiki:

{!boost b=recip(ms(NOW,publishedTime),3.16e-11,1,1)}query

This will take the number of milliseconds between NOW and the time the document was published (publishedTime is one of the fields YOU have to provide when you’re indexing, this might be “created” or something else that suits your needs) and then multiply that number with 3.16e-11, which is equal to 1 / . This will make the result of the function be 1 for a document that just was published, but 0 for anything older than a year.

The Solr Wiki also contains example of how you can divide your boost query into several parts to make it easier to read.

Relevant Meta Tags for Facebook Share

I spent the evening making sure the different pages on Gamer.no gave relevant titles and descriptions when shared on Facebook. The implementation was quite straight forward, but finding the finding the actual documentation for which elements Facebook supports ate a bit of development time. After navigating through four or five wiki pages at the development wiki describing various parts of the Facebook Share system, I finally found the page getting down to the metal about which elements you should include.

The page can be found at Facebook Share – Specifying Meta Tags.

We’ve currently implemented title, description and medium. We also had image_src, but decided against it at the moment – the first feedback we reserved made it clear people preferred to select their own image. This may however be because of the first batch of people being a bit too technical competent, so we’ll probably use image_src later (.. does Facebook support providing several images through image_src?).

Retrieving URLs in Parallel With CURL and PHP

As we’ve recently added support for querying Solr servers in parallel, one of the things we added was a simple class to allow us to query several servers at the same time. The CURL library (which has a PHP extension) even provides an abstraction layer for doing the nitty gritty work for you, as long as you keep track of the resources. The code beneath is based on examples in the documentation and a few tweaks of my own.

The code beneath is licensed under a MIT license. You can also download the file (gzipped).

class Footo_Content_Retrieve_HTTP_CURLParallel
{
    /**
     * Fetch a collection of URLs in parallell using cURL. The results are
     * returned as an associative array, with the URLs as the key and the
     * content of the URLs as the value.
     *
     * @param array $addresses An array of URLs to fetch.
     * @return array The content of each URL that we've been asked to fetch.
     **/
    public function retrieve($addresses)
    {
        $multiHandle = curl_multi_init();
        $handles = array();
        $results = array();

        foreach($addresses as $url)
        {
            $handle = curl_init($url);
            $handles[$url] = $handle;

            curl_setopt_array($handle, array(
                CURLOPT_HEADER => false,
                CURLOPT_RETURNTRANSFER => true,
            ));

            curl_multi_add_handle($multiHandle, $handle);
        }

        //execute the handles
        $result = CURLM_CALL_MULTI_PERFORM;
        $running = false;

        // set up and make any requests..
        while ($result == CURLM_CALL_MULTI_PERFORM)
        {
            $result = curl_multi_exec($multiHandle, $running);
        }

        // wait until data arrives on all sockets
        while($running && ($result == CURLM_OK))
        {
            if (curl_multi_select($multiHandle) > -1)
            {
                $result = CURLM_CALL_MULTI_PERFORM;

                // while we need to process sockets
                while ($result == CURLM_CALL_MULTI_PERFORM)
                {
                    $result = curl_multi_exec($multiHandle, $running);
                }
            }
        }

        // clean up
        foreach($handles as $url => $handle)
        {
            $results[$url] = curl_multi_getcontent($handle);

            curl_multi_remove_handle($multiHandle, $handle);
            curl_close($handle);
        }

        curl_multi_close($multiHandle);

        return $results;
    }
}

Download the file.

Java and NetBeans: Illegal escape character

When defining strings in programming languages, they’re usually delimited by ” and “, such as “This is a string” and “Hello World”. The immediate question is what do you do when the string itself should contain a “? “Hello “World”” is hard to read and practically impossible to parse for the compiler (which tries to make sense out of everything you’ve written). To solve this (and similiar issues) people started using escape characters, special characters that tell the parser that it should pay attention to the following character(s) (some escape sequences may contain more than one character after the escape character).

Usually the escape character is \, and rewriting our example above we’ll end up with “Hello \”World\””. The parser sees the \, telling it that it should parse the next characters in a special mode and then inserts the ” into the string itself instead of using it as a delimiter. In Java, C, PHP, Python and several other languages there are also special versions of the escape sequences that does something else than just insert the character following the escape character.

\n – Inserts a new line.
\t – Inserts a tab character.
\xNN – Inserts a byte with the byte value provided (\x13, \xFF, etc).

A list of the different escape sequences that PHP supports can be found in the PHP manual.

Anyways, the issue is that Java found an escape sequence that it doesn’t know how to handle. Attempting to define a string such as “! # \ % &” will trigger this message, as it sees the escape character \, and then attempts to parse the following byte – which is a space (” “). The escape sequence “\ ” is not a valid escape sequence in the Java language specification, and the parser (or NetBeans or Eclipse) is trying to tell you this is probably not what you want.

The correct way to define the string above would be to escape the escape character (now we’re getting meta): “! # \\ % &”. This would define a string with just a single backlash in it.

A Quick Introduction to chmod and Octal Numbers

Someone asked what the difference between doing a chmod 777 and chmod 755 is today, and hopefully this short, informal post will provide you with the answer (if you want to jump straight through to the conclusion, man chmod).

Octal Numbers

The number you provide as an argument to chmod is an octal number telling chmod what access you want to provide to a file (or a directory, device, etc – an entry on the file system). The number are in fact three discreet values, 7, 5 and 5. Each of the values correspond to a set of three bits, either one being zero or one. Three bits makes up a value from 0 – 7, hence an octal number (a decimal number has the digits 0 – 9 for each digit, an octal number has 0 – 7, a binary number has 0 – 1, a hexadecimal number has 0 – F (15)).

If you tried to count from 0 to 10 (decimal) in octal, it’d be: 0, 1, 2, 3, 4, 5, 6, 7, 10, 11, 12. 12 in octal is the same value as 10 in decimal. The big difference is that both octal and decimal maps very neatly on top of binary numbers, being exactly three or four bits.

The usual way to write an octal number in a programming language is by appending a zero in front of it, such as 0755. This tells the compiler that the number is written in octal notation, and the value is then parsed as such. chmod parses all numbers as octal, and does actually handle four digits. Since missing digits are considered to be zero, the first digit is usually not included (or simply as a zero – which will look the same as the representation used in certain programming languages). The first, usually unused digit, have a special meaning, setting the “set user id” (suid), “set group id” (guid) or the “restricted deletion” or “sticky” attributes (you can read more about these options in the manual page).

File permissions

Now that we know what an octal number is, it’s time to look at how the file permissions work. Each file has three sets of permissions, one set for the user owning the file, one set for the group owning the file and one set for anyone else. If you want to take a look at these values on a unix based system, simply type ls -l to list files in a verbose way. Your result will look something like:

-rw-r--r--  1 mats mats        35 2008-08-23 20:24 IMPORTANTFILE

The permissions are listed in the first column, containng “-rw-r–r–“. The first character “-” indicates if the file is a directory (d), if the suid or guid bits are set etc.

This leaves us with “rw-r–r–” – the three sets of permissions. “rw-” is for the user owning the file, “r–” is for the group owning the file and the last “r–” are for anyone else (or ‘other’ as it’s called). The “r” means read, the “w” means write and the currently missing letter is “x”, which means execute (for files) or search (for directories). The “execute” setting is used to let bash (or another shell) attempt to run the file as a script, attempting to parse the first line as a path to the interpreter for the file (i.e. #!/usr/bin/python).

We have three flags (read, write, execute) that can be either on or off. This should remind us of three bits, either being 0 (not set) or 1 (set). And an octal digit is exactly three bits. This means that an octal digit maps exactly to the bit sequence needed to set permissions for a file. A 7 is “111”, a 5 is “101”, a 4 is “100” and so on. Mapping this to permissions:

7 = 111 = rwx
6 = 110 = rw-
5 = 101 = r-x
4 = 100 = r--
3 = 011 = -wx
2 = 010 = -w-
1 = 001 = --x
0 = 000 = ---

When calling chmod 755 on a directory we’re telling chmod to “set the read, write and search bits for me, the read and search bits for the group and the read and search bits for other users” (‘search’ for directories, ‘execute’ for files).

Another example is 644 that maps to 110 100 100, which again maps to “rw-r–r–” which usually is the standard access mode for files (and 755 for directories).

Handling Permissions With Symbols

I’m now going to eliminate the need for remembering everything I’ve written so far in the post, but at least you’ll know what people are talking about when they’re telling you to chmod something this-or-that.

You can also use the symbols directly with chmod, either adding, removing or setting the permissions for one of the three groups.

Examples:

To remove all access for other users (but leaving group and user intact)
chmod o-rwx file

To give everyone read access
chmod a+r file

To give everyone read – and search – access
chmod a+rx directory

To set particular user modes for each group
chmod u=rw,g=w,o=w file (a file that the user can read, but anyone can write to)

And with that I chmod this post a+r.

jQuery, .getJSON and the Same-Origin Policy

When creating a simple mash-up with data from external sources, you usually want to read the data in a suitable format – such as JSON. The tool for the job tends to be javascript, running in your favourite browser. The only problem is that requests made with XHR (XMLHttpRequest) has to follow the same origin policy, meaning that the request cannot be made for a resource living on another host than the host serving the original request.

To get around this clients usually use JSONP – or a simple modification of the usual JSON output. The data is still JSON, but the output also includes a simple callback at the end of the request, triggering a javascript in the local browser. This way the creator of the data actually tells the browser (in so many hacky ways) that it’s OK, I’ve actually thought this through. Help yourself.

In jQuery you can trigger the usual handling of events by using “?” as the name of your callback function. jQuery will handle this transparently and then trigger the function you provided to .getJSON in the first place.

Example

url = "http://feeds.delicious.com/v2/json/recent?callback=?";

$.getJSON(url, function(data) { alert(data); });

There’s an article up at IBM’s developerWorks giving quite a few more examples and information about the issue.