Finding Your Way Around for the Weekend

If you’re interested in a bit of a weekend challenge, we’re currently hosting a small PHP coding competition at one of the IRC channels I’m active in. The challenge’s deadline is this sunday (10th of August, 2008) at midnight (if you’re in another timezone, just use your own sunday’s midnight), but the challenge should always prove interesting later too.

If you’re interested in trying to solve labyrinths with the smallest amount of source code, this might be for you!

Resolutions Be Gone!

(Haha! Thanks to the wonderful internet, I’m on vacation .. and still posting things in my blog .. that I wrote BEFORE we left! How’s that for TIME TRAVEL???)

We should really stop using the resolution information in our statistics scripts for anything. It’s nothing other than curiousity, and it should never be used for making any decisions regarding how a page should be structured. Most people probably stopped reading halfway through and thought that I’m going completely nutters. What do you mean? But .. how can we know how many pixels we can use for our design?

You’re out of luck! Hopefully I’m not singing Lady Madonna and strolling around somewhere in Europe wearing a pink tubetop just yet, and I still have my sanity. Instead of basing our decisions on resolution, we should start paying more attention to actual browser window size. This is interesting from a usability perspective, as it considers the size that we actually can use without demanding that our users start resizing their active browsing window. The desktops I use actively are usually running in 1920×1200, but my browser window is considerably smaller. If a page suddenly would require me to use 1920×1200, I’d be annoyed (and yes, that includes those of you who still think your content is SOOOOO important you should open up windows in fullscreen mode, and still only use 50% of it for content..), even though I actually have a resolution that supports that kind of content.

Since this means that we’d see a lot more sizes than the regular 1024×768, 1280×1024, etc, we’d simply create intervals of “acceptable sizes” instead. We’d have (640-800), (801-980), (981-1099) instead, and we’d be able to give an estimate of how much content we could actually show a certain percentage of our users within their first view of our information. Even though people have 1024×768 as their resolution, we have no idea about how much content they actually see. I have no idea if that means that the majority of people see 500 pixels of content, 300 pixels of content, etc (and that would of course also change based on font size, but .. stay on track with me here), which is actually the useful information to base your decisions on.

So, anyone with me? How about support for this metric in Google Analytics? Anyone know of any existing statistics project that has added support for the actual viewport instead of the screen resolution?

PHP, ImageMagick and Cropping to GIF: Digging into GIFs again!

Christer had an interesting case today, where he tried to resize and crop an image with the Imagick extension for PHP. Everything went as planned, the image was cropped and resized at it should be, but after writing it to disk and opening it again, the image’s size was the same as if he hadn’t done the crop. The content of the image outside the crop area was removed (simply set as transparent), but the image was still returned in it’s uncropped size.

The PHP module for binding ImageMagick is quite simple (simply marshalling between the ImageMagick methods and the PHP user space), so my guess is that this is a weird behaviour with a good enough reason somewhere down in ImageMagick. It might be a bug, but I haven’t had the time to attempt to reproduce it with convert or mogrify yet. If anyone wants to attempt that, feel free. Christer has posted the code, so simply attempt to recreate the same symptoms by using one of these two tools.

Anyways, this post was not to be about the issue itself, as Christer has done a neat analysis and write-up of that, but I’ll give a more detailed look at the issue within the GIF file itself. As chance would have it, I recently participated in a competition at the norwegian demoscene IRC hangout where the goal was to recreate the norwegian flag in an HTML page in the smallest space possible. This ended up being a competition to see who could molest and optimize GIF images the most, while browsers still were able to display them. From this experience I had a quite good knowledge of how GIF files are built internally, and I were able to do a good guess of what could be the actual issue in the resulting file.

Since GIF files can be animated, a single file may contain several “images” (which would be the frames in the animation). These images can have their own size and position within the “larger image”:

 _________
| im1     |
|    _____|
|   |     |
|   | im2 |
|___|_____|

im1 may then represent the first image and im2 the second image in the file. The second image will only update the area that it covers, and this will leave the rest of the image “as it is”. Since a GIF image may contain a large number of these images, a “global” size is defined for the image. This global size covers all the images, and is the total area that these images will be drawn into. If an image is drawn outside of this area (in part or whole), it will be clipped against the viewport.

This should provide enough background to at least give a general feeling about what COULD be the problem here, but to actually find out what’s happening, we’ll dig into a GIF file format specification and the file that was created. This simple reference provides a general layout of the GIF file, and we’ll use that to take a look at what values the file we ended up with had:

On the left we have the actual byte values in hex and on the right we have the corresponding ASCII character represented by that value. As you can see, the first six bytes of the file (0x47 0x49 0x46 0x38 0x39 0x61 (0x is the general way of prefixing numbers that should be interpreted as hexadecimal)) corresponds to “GIF89a” (You can do this exercise yourself armed with this Ascii Table. Simply look up 47 in the Hx column, then 49, etc). Those six bytes are what we call the signature of a GIF file (although the number can be different, i.e. GIF87a, depending on the version used).

The next fields in the specification reads:

Offset   Length   Contents
  6      2 bytes  
  8      2 bytes  

So byte 6-7 and byte 8-9 should tell us the logical size of the whole gif file (which the images will be drawn onto). In our test file here, that’s represented as:

Width: 0x67 0x01
Height: 0x70 0x00

The byte order here is Little Endian, which means that the least important values are placed first. Since we have two bytes for each value, we can calculate the decimal value of the width by multiplying:

0x67 0x01 = 6 * 16 + 7 + (0 * 16 + 1) * 256 = 359
                                        ^-- Since we're in the next byte, we multiply with 256.

You can also do this with the windows calculator, by entering 167 while being in hexmode, then selecting dec (for decimal). The reason for multiplying the second byte with 256 is that this byte provides the value of the “next 8 bits”, while the first provided the value for the first 8 bits. If we see the bits themselves:

0x70 | 0x01: 0111 0000 | 0000 0001

Little Endian says that the least significant bits come first, so to get the raw bit values, we turn it around:

0000 0001 0111 0000

As you can see, the value of the second byte (0x01) can be multiplied with 256 (which is the last 8 bits).

We can also calculate the height:

0x70 0x00 = 7 * 16 + 0 + (0 * 16 + 0) * 256 = 112
                          ^-- both numbers in the second byte is zero

Alas, the global header of the GIF image that were generated says that the size of the image is 359×112, which is why the image is rendered larger than it should have been. We then take a look at the Image section of the GIF file (all GIF files should contain at least one), which is defined as:

Offset   Length   Contents
  0      1 byte   Image Separator (0x2c)
  1      2 bytes  Image Left Position
  3      2 bytes  Image Top Position
  5      2 bytes  Image Width
  7      2 bytes  Image Height

Armed with this information, we examine the area where the image section starts:

The start of the Image section is the “Image Separator”, a byte value of 0x2c, shown highlighted in the image above. This is where the image section starts, and the offsets in the table is relative to this location. The next four bytes tells us where in the global viewport the upper left corner of this image should be drawn. The values here are 0x01 0x00 twice, simply meaning (1,1), or one pixel down and out from the upper left corner (which is also related to the issue posted by Christer, but we ignore that one here now). The next values are however those we are interested in, which provides Image Width and Image Height:

Width:
0x73 0x00 = 7 * 16 + 3 + (0 * 16 + 0) * 256 = 115
Height:
0x6F 0x00 = 6 * 16 + 15 + (0 * 16 + 0) * 256 = 111

This means that the dimension of the image that’s actually supplied in the GIF file, is 115×111 pixels and should be drawn beginning one pixel down and one pixel out (as given by 0x01 0x00 in the x,y-fields above). Compare this to the reported global size of the image (359×112), and we can see where our transparent space is coming from. The browsers (and other image viewers) create a canvas the size of 359×112 pixels, while only drawing an image into the 115 leftmost pixels. The rest is left transparent, but they’re still there as the file says that’s the size of the viewport. If we manually change the size of the viewport to 0x74 0x00 in the GIF header itself, the image displays properly. To illustrate with another great ascii drawing:


               viewport
 _____________________________________
|           |                         |
|  actual   |                         |
|  image    |                         |
|  drawn    |                         |
|           |                         |
|           |                         |
|           |                         |
|___________|_________________________|

The solution to the problem here were to call the setImagePage method of the image object, as that allows us to set the values for the global image ourselves (and we know how wide the image were supposed to be).

Bonus knowledge: This issue did not occur when saving to a JPEG file, as JPEG files does not have the same capability of storing several subimages inside one file, and does not have the same rendering subsystem as GIF files. ImageMagick knows this, and does not use the page-values when rendering the file.

Hopefully this has provided a minor introduction into how files are structured, what you can learn armed with a hex editor and a file format specification and provided a few insights into what you can do when you’re faced with a very weird problem.

Google Releases Their Protocol Buffers

Fresh from the Google Open Source Blog comes news that Google has released their Protocol Buffers specification and accompanying libraries. The code and specification has been release at Protocol Buffers on Google Code.

Protocol Buffers is a data format for fast exchange and parsing of data and messages between computers. It is similar to simple uses of XML in this manner, but the messages size on the wire and their parsing time is very much optimized for busy sites. There is no need to spend loads of time doing XML parsing when you instead could do something useful. It’s very easy to interact with the messages through the generated classes (for C++, Java and Python), and future versions of the same schema are compatible with old versions (as new fields are just ignored by older parsers).

Still no PHP implementation available, so guess it’s time to get going and lay down some code during the summer. Anyone up for the job?

Writing a Solr Analysis Filter Plugin

Update: If you’re writing a plugin for a Solr-version after 1.4.1 or Lucene 3.0+, you should be sure to read Updating a Solr Analysis Plugin to Lucene 4.0 as well. A few of the method calls used below has changed in the new API.

As we’ve been working on getting a better result out of the phonetic search we’re currently doing at derdubor, I started writing a plugin for Solr to be able to return better search results when searching for norwegian names. We’ve been using the standard phonetic filter from Solr 1.2 so far, using the double metaphone encoder for encoding a regular token as a phonetic value. The trouble with this is that a double metaphone value is four simple letters, which means that searchwords such as ‘trafikkontroll’ would get the same meaning as ‘Dyrvik’. The latter being a name and the first being a regular search string which would be better served through an article view. TRAFIKKONTROLL resolves to TRFK in double metaphone, while DYRVIK resolves to DRVK. T and D is considered similiar, as is V and F, and voilá, you’ve got yourself a match in the search result, but not a visual one (or a semantic one, as the words have very different meanings).

To solve this, I decided to write a custom filter plugin which we could tune to names that are in use in Norway. I’ll post about the logic behind my reasoning in regards to wording later and hopefully post the complete filter function we’re applying, but I’ll leave that for another post.

First you need a factory that’s able to produce filters when Solr asks for them:

NorwegianNameFilterFactory.java:

package no.derdubor.solr.analysis;

import java.util.Map;

import org.apache.solr.analysis.BaseTokenFilterFactory;
import org.apache.lucene.analysis.TokenStream;

public class NorwegianNameFilterFactory extends BaseTokenFilterFactory
{
    Map args;

    public Map getArgs()
    {
        return args;
    }

    public void init(Map args)
    {
        this.args = args;
    }

    public NorwegianNameFilter create(TokenStream input)
    {
        return new NorwegianNameFilter(input);
    }
}

To compile this example yourself, put the file in no/derdubor/solr/analysis/ (which matches no.derdubor.solr.analysis; in the package statement), and run

javac -6 no/derdubor/solr/analysis/NorwegianNameFilterFactory.java

(you’ll need apache-solr-core.jar and lucene-core.jar in your classpath to do this)

to compile it. You’ll of course also need the filter itself (which is returned from the create-method above):

package no.derdubor.solr.analysis;

import java.io.IOException;
import org.apache.lucene.analysis.Token;
import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;

public class NorwegianNameFilter extends TokenFilter
{
    public NorwegianNameFilter(TokenStream input)
    {
        super(input);
    }

    public Token next() throws IOException
    {
        return parseToken(this.input.next());
    }

    public Token next(Token result) throws IOException
    {
        return parseToken(this.input.next());
    }

    protected Token parseToken(Token in)
    {
        /* do magic stuff with in.termBuffer() here (a char[] which can be manipulated) */
        /* set the changed length of the new term with in.setTermLength(); before returning it */
        return in;
    }
}

You should now be able to compile both files:

javac -6 no/derdubor/solr/analysis/*.java

After compiling the plugin, create a jar file which contain your plugin. This will be the “distributable” version of your plugin, and should contain the .class-files of your application.

jar cvf derdubor-solr-norwegiannamefilter.jar no/derdubor/solr/analysis/*.class

Move the file you just created (derdubor-solr-norwegiannamefilter.jar in the example above) into your Solr home directory. This is where you keep your bin/ and conf/ directory (which contains schema.xml, etc). Create a lib directory in the solr home directory. This is where your custom libraries will live, so copy the file into this directory (lib/).

Restart Solr and check that everything still works as it should. If everything still seems normal, it’s time to enable your filter. In one of your <filter>-chains, you can simply append a <filter> element to insert your own filter into the chain:


    
    
    
    

Restart Solr again, and if everything still works as it should, you’re all set! Time to index some new data (remember that you’ll need to reindex the data for things to work as you expect, since no stored data is processed when you edit your configuration files) and commit it! Do a few searches through the admin interface to see that everything works as it should. I’ve used the “debug” option to .. well, debug .. my plugin while developing it. A very neat trick is to see what terms your filter expands to (if you set type=”query” in the analyzer section, it will be applied to all queries against that field), which will be shown in the first debug section when looking at the result (you’ll have to scroll down to the end to see this). If you need to debug things to a greater extend, you can attach a debugger or simply use the Good Old Proven Way of println! (these will end up in catalina.out in logs/ in your tomcat directory). Good luck!

Potential Problems and How To Solve Them

  • If you get an error about incompatible class versions, check that you’re actually running the same (or newer) version of the JVM (java -version) on your Solr search server that you use on your own development machine (use -5 to force 1.5 compatible class files instead of 1.6 when compiling).
  • If you get an error about missing config or something similiar, or that Solr is unable to find the method it’s searching for (generally triggered by an ReflectionException), remember to define your classes public! public class NorwegianNameFilter is your friend! It took at least half an hour until I realized what this simple issue was…

Any comments and followups are of course welcome!

Misunderstanding How in_array Works

Brian Moon has a post about how in_array() really, really sucks. This is not a problem with in_array per se, but failing to recognize the proper way to solve a problem like this. Some of the comments has already touched on this matter, but I’ll attempt to further expand and describe the problem.

You have two arrays; a1 and b2. You’re interested in removing all the values from a1 that also are in b2. If you’re doing the naive approach (which Brian Moon describes), you’ll simply do:

foreach($a1 as $key => $value)
{
    foreach($b2 as $key2 => $value2)
    {
        if ($value == $value2)
        {
            unset($a1[$key]);
        }
    }
}

(ignore any potential side effects of manipulating $a1 while looping through it for now)

This will work for small sizes of a1 and b2, but as soon as the number of entries starts to increase (let’s call them m and n), you’ll see that the growth of your function will approach O(m*n), which can be written as O(n²) as both values approach infinity. This is not good and is the same complexity that you’ll find in naive sorting algorithms. This means that for each element you add to the array, your processing time increases quadratically (since you have two loops here). in_array is simply a shortcut for the inner loop (the inner foreach loop) in this example. It loops through each element of the array and checks if it matches the needle we’re searching for.

Compare this to using array_flip on the array to search first, so that the values becomes the keys:

foreach($a1 as $key => $value)
{
    if (isset($b2[$key]))
    {
        unset($a1[$key]);
    }
}

But why is isset($arr[$key]) any faster than using in_array? Doesn’t the application just have to loop through a different set of values instead (this time, the keys instead of the values)? No. This is where hashing comes into the picture. As $key can be any string value in PHP, the string is hashed and resolved to an internal array id. This means that internally, the following is happening:

$arr[$id] => find location by converting $id to an internal array location (on the C-level) => simply index the array by using this value

Instead of going through all the valid keys, the $id is converted once, and then checked to see if there is any value stored at that location. This is a simplification, but explains the concept. The complexity of this conversion may depend on the length of the key (depending on the choice of hashing function), but we’ll ignore this here, and simply give it a complexity of O(1). Looking up the index in the array is also an O(1) operation (it takes the same time, regardless if we’re looking at item #3 or #4818, it’s simply reading from different locations in memory).

This means that while our other loop is still looping through n elements, we’re now just doing a constant lookup in the innerloop. The amount of work done in the inner loop does not depend on the number of elements in b2, and this means that our algorithm instead grows in a linear fashion (O(n)).

Further reading can be found at Wikipedia: Hash function, Big O Notation. I’ll also suggest reading an introductionary book into the field of algorithms and datastructures. The type of book depends on your skillset, but if anyone wants any suggestions, just leave a comment and I’ll add a few as I get home to my bookshelf tonight.

Implementing a Duck Operator with Reflection

Following up on this post regarding a “duck operator” in PHP, I went ahead and wrote a very, very, very simple implementation using reflection api in php to get the same functionality.

getMethods() as $method)
        {
            $ret[$method->getName()] = true;
        }
        
        return $ret;
    }
    
    function it_quacks($object, $interface)
    {
        $reflectionClass = new ReflectionClass($interface);
        $reflectionObject = new ReflectionObject($object);
        
        $reflectionClassMethods = getMethodProperties($reflectionClass);
        $reflectionObjectMethods = getMethodProperties($reflectionObject);
        
        foreach($reflectionClassMethods as $methodName => $methodData)
        {
            if (empty($reflectionObjectMethods[$methodName]))
            {
                return false;
            }
        }
        
        return true;
    }

    if (it_quacks(new MooingGrassEater(), 'Cow'))
    {
        print("A MooingGrassEater can be seen as a Cow\n");
    }
    else
    {
        print("A MooingGrassEater has no hope of being recognized as a Cow\n");
    }

    if (it_quacks(new MooingGrassEater(), 'Sheep'))
    {
        print("A MooingGrassEater can be seen as a Sheep\n");
    }
    else
    {
        print("A MooingGrassEater has no hope of being recognized as a Sheep\n");
    }
?>

Missing obvious points are of course to compare the number of arguments to the methods and wether they’re optional, so that you further ensure call safety. But hey, it’s just an example implementation. Read the original linked page for more information about the concept.

Strange Spambots

Recently I’ve noticed quite a few spambots submitting random comments on a few sites that I run, and while that’s not surprising, the content kind of is. The comments are simple, text only comments mentioning a product of some sort, together with a few random words or characters. No links. Nothing.

My current guess is that the message may be probes to see if there is a word filter active for the words they attempt to submit, and that when they find that the comment goes through, they submit their long list of links and other interesting stuff. The problem is that the sites filter all comments that contain more than one URL and all occurences of “[url”. This has not let a single linked comment through in two years, but now the volume of these comments are getting ridiculous. Guess I’ll have to add some new magic feature with javascript.

Spectator PHP Debugger and an Update on WebGrind

Stumbled upon Spectator – an PHP debugger written in XUL. Seems like a promising project and I’m always in favor of people who actually do what they say other people should do :-)

Also worth noting is that the first releases of WebGrind are out, and seems to be a neat tool for those who need to make sense of a few kcachegrind files (for example if you’re using xdebug and it’s profiling functionality).

Free Flash Based File Upload Applications

When I started writing Swoooosh, the main reason was that after needing a free component for a project for a customer of mine (where uploading multiple files were not an original part of the specification, but was added later), I were left with a few components with dubious licenses and weird attribution requests that left you guessing. Instead I hoped someone would release something under an MIT-based license (or LGPL, BSD, etc) to be free for all kinds of usage, and could be further extended by the community.

Luckily a few alternatives has emerged since then, and Swoooosh isn’t really that relevant any longer (it was a good exercise for writing Flex and ActionScript, tho):

And Yes, Christer, I’m going to implement one of these and commit to SVN any moment now. :-)