Keeping Your Code in Check: Part 1 - DRY

October 10th, 2008

DRY - Don’t Repeat Yourself - is a central principle for everyone who wants to keep their code maintainable and in a clean and pristine state. It will save you from tedious tasks in the future, although it requires you to do a bit more thinking up front. The clue is to recognize when you’re about to commit the first programming sin: repeating parts of code.

Lets first be completely clear: not repeating code does not mean your code should be as general as possible. Doing that will only end up in a horrible mess that your only hope of selling is to label it “Enterprise” and push all the configuration options into XML. If your way to success is to craft a new buzzword and get the consulting business to sell your product, that might be how you do it. If you’re interesting in writing maintainable and solid software that other people can play around with and get the grip of quite quickly; don’t.

Keep it specialized to do what you need it do easily, but keep it general enough to allow you to reuse it for similar tasks.

The reason why I’m posting this now is unsurprisingly enough that I ran into this issue while coding a library at work. This issue is so common that any programmer will run into it several times a day, but the issue at hand today illustrates the problem perfectly. In a class that I was writing to return a set of data to the user, I decided to add three methods for sorting the data in different manners before returning it. The problem is that the data is kept in different array keys, and it’s the arrays themselves I want to sort. Example:

['displayValue' => 'Aloha Beach', 'count' => 13, 'value' => 'aloha']
['displayValue' => 'Norway', count => 3, 'value' => 'norway']
['displayValue' => 'Sweden Swapparoo', count => 7, 'value' => 'sweden']

To sort this in PHP, you’d use the usort function with a callback function / method that sorts the arrays by the value of the right key. My first version was something like:

  1. private function sortByValue($a, $b)
  2. {
  3.     if (!isset($a['value'], $b['value']))
  4.     {
  5.         if (isset($a['value']))
  6.         {
  7.             return -1;
  8.         }
  9.  
  10.         if (isset($b['value']))
  11.         {
  12.             return 1;
  13.         }
  14.    
  15.         return 0;
  16.     }
  17.  
  18.     return strcmp($a['value'], $b['value']);
  19. }
  20.  
  21. private function sortByCount($a, $b)
  22. {
  23.     if (!isset($a['count'], $b['count']))
  24.     {
  25.         if (isset($a['count']))
  26.         {
  27.             return -1;
  28.         }
  29.  
  30.         if (isset($b['count']))
  31.         {
  32.             return 1;
  33.         }
  34.    
  35.         return 0;
  36.     }
  37.  
  38.     return ($b['count'] - $a['count']);
  39. }

As I typed the second if (isset()) in the second method, I finally realized that I were sitting at work and writing the exact same function twice, with just a little twist between them in regards to which field to sort by - and how to determine the sort. One field was numeric, the other was a string. Our goal is to re-use as much as the common code from both methods without typing it up — or much more important, maintaining — it twice.

As you can see, almost the whole function is identicial, except for the key name (’value’ vs ‘count’) and how the comparison is done (numeric vs string). We need to handle these two issues to be able to use the same function for both purposes, so we rework the two functions into one function for general use for sorting on the value of a key:

  1. private function sortByArrayField($a, $b, $field)
  2. {
  3.     if (!isset($a[$field], $b[$field]))
  4.     {
  5.         if (isset($a[$field]))
  6.         {
  7.             return -1;
  8.         }
  9.  
  10.         if (isset($b[$field]))
  11.         {
  12.             return 1;
  13.         }
  14.    
  15.         return 0;
  16.     }
  17.  
  18.     if (is_numeric($a[$field]) && is_numeric($b[$field]))
  19.     {
  20.         return ($b[$field] - $a[$field]);
  21.     }
  22.  
  23.     return strcmp($a[$field], $b[$field]);
  24. }

This way we handle both the comparison method (if both values are numeric, we do the comparison as a numeric value by simply subtracting the first from the second) and the field to sort (as a third parameter).

Sadly usort does not provide that third parameter for us automagically, but by creating a few simple helper methods for using our refactored function, we get all “configuration” set in those three methods. The code base only contains one implementation of the method itself, but several ways to use it. Example of these three helper methods:

  1. private function sortByValue($a, $b)
  2. {
  3.     return $this->sortByArrayField($a, $b, 'value');
  4. }
  5.  
  6. private function sortByCount($a, $b)
  7. {
  8.     return $this->sortByArrayField($a, $b, 'count');
  9. }
  10.  
  11. private function sortByDisplayValue($a, $b)
  12. {
  13.     return $this->sortByArrayField($a, $b, 'displayValue');
  14. }

If someone now comes along and decides to add another column to our array, such as ‘price’, we can simply add another sorting callback to accomodate this:

  1. private function sortByPrice($a, $b)
  2. {
  3.     return $this->sortByArrayField($a, $b, 'price');
  4. }

The sortByArrayField method is already well proven and tried from our previous usage, and by simply changing the field we’re sorting by, we still get the power of the callback, get a completely new sort criteria and the maintainability of just one method.

The First Rule of Debugging

October 8th, 2008

Seems like there are quite a lot of ideas about what the first rule of debugging are, and a quick Google Search gives you insightful suggestions such as:

These are all valuable insights that provide clear value about what and how you could attack the problem of the Bug That Wasn’t Supposed To Be There (there are surely bugs that actually were supposed to be there, because of invalid domain specifications, etc.). I did however find one small column from DDJ that led into the same thing I’m going to write now: “Before you go to fix it, be sure you’re fixing the right thing.

I will assume that you already have uncovered that you actually have a bug — you’re not getting the results you expected, and something is to blame. You start out by trying to isolate the problem, and try to recreate the broken situation with different inputs. You try it on another system to see if there’s something about the configuration, about the dataset, about the versions of your library or something, which we’ll disover later, is completely irrelevant.

I obviously forgot this rule at work today, and that’s why I’m writing this post now. Remember kids. Always make sure you’re fixing the right thing! I’ve started porting one of the front end services we use to federate searches from a borked Java library to a PHP-based implementation instead. This involves talking to an internal search server that I hadn’t written code to interface with before, other than the existing search service. The implementation went from zero to usable in almost no time, and I now have code that provides a nice foundation for the remaining work. The problem is that I obviously decided to fix another issue with the search server that I had looked into earlier, and it seems that I thought that since I now had new and improved experience with the service, I’d be better off this time.

So I set out trying to get a new sorting scheme working for the search results. At first I failed, but then I stumbled across some documentation that threw me in the right direction. I did a few minor changes, and the sorting changed. Woohooo! Well, it almost worked. The order was not quite as expected, but the results seemed to be close enough for the proximity search to be working. It just had to be something in the documentation or implementation that I had missed! Further digging and experimentation for the greater part of an hour or so left me blank, but I managed to get another search ordering which made me think that there was something peculiar about my input parameters that created the issue.

After almost two and a half hour of this, I happened to read four lines that were neatly tucked away in a “process” part of the resulting XML document from the search service. In the diffuse light glooming from my Dell monitor, I could read the now obvious words: “<subsystem> No license.”.

Yep. It never had anything to do with the actual result. It was never even active. The change of the sorting must have happened for some other reason (such as removing the default values or anything like that). I weren’t supposed to debug why the sorting was incorrect, I was supposed to find out why the subsystem didn’t load.

Grrr.

Mats’ (borrowed) first rule of debugging: Make sure you debug the right problem. Do not make assumptions.

(and a rule of documentation: if the feature may not be available in the installation, write how you can check this before actually giving examples and detailing how to use the feature)

Post-commit hook failed with error output: .. and nothing more.

September 27th, 2008

While getting the trac-svn integration up and running in one of our repositories tonight, I stumbled across this issue. Most links I found on Google said to try to run the post-commit script by itself under the same user, which worked just fine. Checked that all my paths were absolute paths to avoid assuming any CWD, but still nothing. After trying to minimize the problem I discovered that even with a post-commit-script of just environmental assignments, things were still failing. This nabble archived thread did however give a very important hint by Ryan Schmidt (who obviously has solved the post-commit problems for every single developer out there, if I were to judge by the number of times his name comes up. Awesome work, Ryan.):

Your post-commit must also begin with a line like “#!/bin/sh” and have its executable bit set.

Wow. Homer Simpson “Doh!”-moment right there. No #!/bin/bash (or #!/bin/sh) ment that SVN weren’t able to run the script with the proper interpreter. When running it from the command line, bash were already running of course, so it just assumed that it was a bash script .. and it were right.

Thanks Ryan, saved me a couple of hours tonight.

More on Domain Specific Languages

September 3rd, 2008

My previous post on the power of micro languages peaked quite a bit of interest. Today Gamasutra published a feature article documenting the creation of Whimsy, a domain specific language developed to create images resembling the art of Rodney Alan Greenblat. The language is actually quite similar to what I used myself:

superegg 0.15,0.10,3.5 at .3,.7 size 1.2 black distort .01
petals 14 0.05 size 1.8 petalblue
inner .88,.01 tvpurple

The statements are one on each single line, with the command / drawing primitive as the first literal, then parameters to the command.

Unit-Testing Code Which Uses a Database

August 20th, 2008

How to unit-test code that interacts with a database appeared on the blog of Baron Schwartz, and to be really boring, I agree with what he’s writing. Unit-testing database connectivity and storage is not hard. If it is, it might be a good time to redo that architecture you’ve been talking about.

An important point that Baron mentions is that you _NEVER_ _EVER_ run your tests on your production servers. That will of course be disastrous, as your tests needs a predefined state of the database to be valid for testing. The solution I’ve been using to handle this, is to always set up my environment to use another database when doing the tests. This way, you’ll never end up with running the tests on a live database by accident. I handle this in my AllTests.php file, where the test suites and shared fixtures are set up. We dump the contents of the developer database (databasename), create a new database (databasename_test) and insert all the current table structures and indexes. This way we get an accurate copy of the table definitions currently defined by the developer (so that we don’t run the tests against an old set of tables), and we test that the code works as it should with the active definitions.

The simplest way to do this, is to use mysqldump and mysql through a call to exec. If you’re not in a trusted environment, please, please, please add the appropriate shell argument escape commands. It can however be argued that if you’re allowing random people to change your database login information, you probably have bigger problems than doing unit testing..

  1. exec('mysqldump -u ' . $username . ' -p' . $password . ' ' . $dbname . ' | mysql -u ' . $username . ' -p' . $password . ' ' . $database . '_test');

It would be very interesting to get more information about which measures Baron advocates for detecting a production system. We have configuration settings for our applications which also defines if this is a development or production system, in addition to the fact that our testing code only touches databases which end in _test.

The Results of our Weekend Challenge

August 13th, 2008

The weekend PHP size competiton I mentioned on friday has come to an end, with the results being as follows:

      CNU (253 bytes)
      Helge (261 bytes)
      Ymgve (267 bytes)
      Me (365 bytes)
      dibon&zep (which never delivered, but had working solutions)

For those who are interested in the strategies and tactics employed by the contestants, I’ve included a small write-up and analysis of the various contributions. We decided early on that using eval() and gzinflate on the content were something that everyone could apply, so the size would be counted for the decompressed code. If any contestant implemented their own decompressor in the user space code themselves, we would accept that. It would not be any advantage anyways, as the decompressor code would eat up more space than the best contributions used in total.

I’ve included the writers own comments if they had any. The biggest change in size happened when people started using one dimensional strings instead of arrays.

CNU’s contribution

  1. $b='';foreach(file('maps.txt',2)as$l){if(!$l){$q[1]=strpos($b,'.');for($b=str_split($b);$p=&$q[++$l];)foreach(array($p-$w,$p+$w,$p-1,$p+1)as$x){$c=$b[$p]+1;$d=&$b[$x];if($d=='X')echo$c.'
  2. ';if($d==' ')$q[]=$x;$d=$c;}$q=$b='';}$b.=$l;$w=strlen($l);}

Helge’s contribution

  1. $m="";foreach(file("maps.txt")as$l){if($l!="
  2. "&&($m.=$l)+$w=strlen($l))continue;for($y=$d[$v[]=strpos($m,'.')]=0;strpos($m,'X')-$u=$v[$y++];)for($x=-$w;$x<=$w;$x+=$x+1?$w-1:2)in_array($n=$u+$x,$v)|$m[$n]=="#"||$d[$v[]=$n]=$d[$u]+1;echo"$d[$u]
  3. ";$m=$v="";}

Helge also included an annotated version with comments in norwegian:

  1. /**
  2.  * $l = én linje (av maps.txt)
  3.  * $m = ett map
  4.  * $w = vidden til et kart
  5.  * $d = distansen til et vilkårlig punkt
  6.  * $v = open/closed-array (putter både squares
  7.         of interest og visited squares her)
  8.  * $u = "current square", utgangspunktet til å
  9.         generere naboer
  10.  * $n = nabo
  11.  * $x = hjelpevariabel for å generere $n
  12.  * $y = hjelpevariabel for å generere $u
  13.  */
  14.  
  15.  
  16. /* Klarte aldri å finne en måte å skippe initialisering av
  17.    $m her.. (den resettes på siste linja og.. føles waste)
  18.    * En idé var å bruke to for(;;)'s istedenfor en foreach
  19.    og continue, men jeg tror jeg hadde økt i kode da.. */
  20. $m="";
  21.  
  22. /* Foreach var nyttig fant jeg ut.. den stopper når filen er
  23.    ferdig lest av seg selv, så slipper kode på sjekking av det,
  24.    $f[$y++], osv. */
  25. foreach(file("maps.txt") as $l) {
  26.     if(
  27.         /* Når $l=="\n" (som betyr slutten på et kart) skal
  28.            vi slutte å appende til $m og heller breake if'en
  29.            så vi kan starte pathfinding */
  30.         $l != "\n" &&
  31.  
  32.         /* Bruker en 1-dimensjonal string istedenfor array
  33.            (= win!) */
  34.         ($m .= $l) +
  35.  
  36.         /* Trenger vidden når jeg skal regne ut naboene til et
  37.            gitt felt. */
  38.         $w = strlen($l)
  39.     )
  40.         /* Gjør så $l blir feedet en ny linje av kartet vårt til
  41.            kartet er ferdig */
  42.         continue;
  43.  
  44.     /* Pathfinding start! */
  45.     for(
  46.         /* Setter $y og $d[starten] lik 0, og putter startposisjon
  47.            inn i $v i samma slengen. */
  48.         $y = $d[ $v[] = strpos($m, '.') ] = 0;
  49.  
  50.         /**
  51.          * Gjør så for(;;) breaker når sluttposisjon-$u blir 0
  52.          * (synonymt med $u == strpos($m, 'X');
  53.          * Tar også neste $u fra $v i samma slengen */
  54.         strpos($m, 'X') - $u = $v[ $y++ ];
  55.        
  56.     )
  57.         /* For hver nabo */
  58.         for(
  59.             /* Genererer naboer.
  60.              * Den starter på -width (som blir opp), går så til -1
  61.                (som blir venstre), 1 (som blir høyre) og tilslutt
  62.                width (som blir ned) */
  63.             $x = -$w;
  64.             $x <= $w;
  65.             $x += $x+1 ? $w-1 : 2
  66.         )
  67.             /* Om naboen (summen av $u og $x) er i $v: continue */
  68.             in_array($n = $u+$x, $v) | /* fancy bitwise or som kan
  69.                                           brukes fordi det er snakk
  70.                                           om 2 booleans */
  71.  
  72.             /* Om naboen er en '#' på kartet: continue */
  73.             $m[$n] == "#" ||
  74.  
  75.             /* Hvis ikke, sett distansen til naboen lik distansen
  76.                til $u + 1 og putt naboen inn i $v */
  77.             $d[ $v[] = $n ] = $d[$u]+1;
  78.  
  79.     /* Skriv ut distansen til $u (skal være sluttpunktet nå) */
  80.     echo "$d[$u]\n";
  81.  
  82.     /* Disse må resettes.. */
  83.     $m=$v="";
  84. }

Ymgve’s contribution

  1. foreach(file("maps.txt")as$l){for($i=0;-75<$e=ord($l[$i++])-88;$m[]=$e+56?$e:1e6,$z=array(-1,1,-$i,$i))$e+42||$q[0]=count($m);if($i<3){for($d=$k=$c=0;;$c++-$k||$i++&$k=$d)for($j=4;$j;$g>$i&&($g=$i)&$q[++$d]=$p)if(!$g=&$m[$p=$q[$c]+$z[$j]])break 2;echo"$i
  2. ";}}

My own contribution

Breadth-first Search, Delivered Version, 365 bytes:

  1. <?php $y=0;foreach(file('maps.txt')as$l)if("
  2. "==$l){for(;$q&&list($y,$x,$d)=array_shift($q);)if(!empty($m[$y][$x])){$k=&$m[$y][$x];echo$k=='X'?"$d
  3. ":'';$k=0;$q[]=array($y,$x-1,++$d);$q[]=array($y-1,$x,$d);$q[]=array($y,$x+1,$d);$q[]=array($y+1,$x,$d);}$m=$y=null;}else{for($x=0;$l[$x++]!="
  4. ";$m[$y][$x]=$l[$x]=="#"?0:$l[$x])if($l[$x]=='.')$q[]=array($y,$x,0);$y++;}

Prettyprinted:

  1. <?php
  2. $y=0;
  3. foreach(file('maps.txt')as$l)
  4. if("
  5. "==$l)
  6. {
  7.     for(;$q&&list($y,$x,$d)=array_shift($q);)
  8.         if(!empty($m[$y][$x]))
  9.         {
  10.             $k=&$m[$y][$x];
  11.             echo$k=='X'?"$d
  12. ":'';
  13.             $k=0;
  14.             $q[]=array($y,$x-1,++$d);
  15.             $q[]=array($y-1,$x,$d);
  16.             $q[]=array($y,$x+1,$d);
  17.             $q[]=array($y+1,$x,$d);
  18.         }
  19.        
  20.         $m=$y=null;
  21. }
  22. else
  23. {
  24.     for($x=0;$l[$x++]!="
  25. ";$m[$y][$x]=$l[$x]=="#"?0:$l[$x])
  26.         if($l[$x]=='.')$q[]=array($y,$x,0);
  27.     $y++;
  28. }

This is mostly your usual run-of-the-mill BFS, where the search itself is implemented as a simple queue.

We loop through all the lines in the file (observe that you can remove the space character on both sides of as, saving you two bytes!):

  1. foreach(file('maps.txt')as$l)

The part within the else condition builds the map to search for the exit point:

  1.     // for each line in the file, we do this
  2.     for($x=0;$l[$x++]!="
  3. ";$m[$y][$x]=$l[$x]=="#"?0:$l[$x])
  4.         if($l[$x]=='.')$q[]=array($y,$x,0);

The for() loops through each character on the line, until it hits the newline marker at the end. Instead of entering “\n”, you simply use the actual newline. This way you save another byte. The map ($m) is then populated using the current y,x coordinate (y being the line, x being the character on that line).

$m[$y][$x]=$l[$x]==”#”?0:$l[$x] can be expanded to:

  1. if ($l[$x] == "#")
  2. {
  3.     $m[$y][$x] = 0;
  4. }
  5. else
  6. {
  7.     $m[$y][$x] = $l[$x];
  8. }

Meaning that if the character is a wall, we store the value zero; if any other value, we store the character itself. We’ll use this attribute when searching to determine when we’ve reached our goal. All of this is kept inside the for()-construct itself. The only thing contained in the for()-loop, is the statement if($l[$x]==’.')$q[]=array($y,$x,0);. This checks if the current character is the starting point, and if that’s the case, adds it to our queue of points to check (making it the starting point of our BFS).

When we hit an empty line in the file (through the if(”
“==$l) check), we execute our BFS.

  1.     /* while $q is valid (this handles empty labyrinths), we fetch the
  2.         current y and x coordinates, in addition to d, the distance to
  3.         the location we're fetching from the queue. No {} here, as the
  4.         for only contains the if() statement below. */
  5.     for(;$q&&list($y,$x,$d)=array_shift($q);)
  6.  
  7.         /* if this spot has a value (remember that we assigned 0 to walls,
  8.             while we kept the character value for other fields when we
  9.             parsed the file. We use empty() as the coordinates may not be
  10.             valid indexes into the array, as we don't do any checks when
  11.             we add them. one call to empty() is better than four ifs and
  12.             duplicate array references). */
  13.         if(!empty($m[$y][$x]))
  14.         {
  15.             // create a reference (saves one byte instead of referencing $m[$y][$x] twice.
  16.             $k=&$m[$y][$x];
  17.  
  18.             /* if we've found the X, echo the distance to the X. Under any
  19.                 under circumstance, you'd break; out of the for/while here,
  20.                 but as bytes count more than speed, we save six bytes by
  21.                 searching the whole labyrinth instead of ending early. Funny
  22.                 detail: as echo is a language construct, you can skip the space.
  23.                 If the character is not 'X', we simply echo an empty string
  24.                 instead of using a conditional echo. Saves a couple of bytes. */
  25.             echo$k=='X'?"$d
  26. ":'';
  27.  
  28.             // unset the value, so that we don't visit this node again.
  29.             $k=0;
  30.  
  31.             /* add all neighbours to our queue. We pre-increment $d, instead
  32.                 of adding +1 in all the adds. */
  33.             $q[]=array($y,$x-1,++$d);
  34.             $q[]=array($y-1,$x,$d);
  35.             $q[]=array($y,$x+1,$d);
  36.             $q[]=array($y+1,$x,$d);
  37.         }
  38.        
  39.         /* Reset the values. PHP generates a warning if a value has never
  40.             been set before when used, but will happily accept null values
  41.             as empty arrays or 0. The double assignment is also a trick worth
  42.             remembering to save bytes. */
  43.         $m=$y=null;

Some of tricks used in the competition

  • Multi-variable assignment: $a=$b=0;
  • Actual newline when detecting a newline: if ($a==”
    “){}
  • Using null to be able to reset both integers and arrays in one statement.
  • The more code you can fit into a for() statement, the better. You get three “free” line endings by doing this. for(;;) is the same length as while(), but you can get several statements evaluated.
  • You can skip the spaces in foreach().
  • Create references to avoid using long array indices several times.
  • The ternary operator (?:) is your friend!
  • Remember that {}’s can be dropped when you have a single statement that contains all subsequent code (as long as you don’t get any conflicts with else etc. on different levels):
    for(;;) <– no { here.
    if ()
    {

    }

  • Simulating arrays with strings

The scripts were run with a set of pre-made labyrinths and a collection of 20 random labyrinths (more like maps, but they still test the code the same). The script for creating the labyrinths is available here. All delivered contributions passed all tests.

Finding Your Way Around for the Weekend

August 8th, 2008

If you’re interested in a bit of a weekend challenge, we’re currently hosting a small PHP coding competition at one of the IRC channels I’m active in. The challenge’s deadline is this sunday (10th of August, 2008) at midnight (if you’re in another timezone, just use your own sunday’s midnight), but the challenge should always prove interesting later too.

If you’re interested in trying to solve labyrinths with the smallest amount of source code, this might be for you!

Resolutions Be Gone!

July 22nd, 2008

(Haha! Thanks to the wonderful internet, I’m on vacation .. and still posting things in my blog .. that I wrote BEFORE we left! How’s that for TIME TRAVEL???)

We should really stop using the resolution information in our statistics scripts for anything. It’s nothing other than curiousity, and it should never be used for making any decisions regarding how a page should be structured. Most people probably stopped reading halfway through and thought that I’m going completely nutters. What do you mean? But .. how can we know how many pixels we can use for our design?

You’re out of luck! Hopefully I’m not singing Lady Madonna and strolling around somewhere in Europe wearing a pink tubetop just yet, and I still have my sanity. Instead of basing our decisions on resolution, we should start paying more attention to actual browser window size. This is interesting from a usability perspective, as it considers the size that we actually can use without demanding that our users start resizing their active browsing window. The desktops I use actively are usually running in 1920×1200, but my browser window is considerably smaller. If a page suddenly would require me to use 1920×1200, I’d be annoyed (and yes, that includes those of you who still think your content is SOOOOO important you should open up windows in fullscreen mode, and still only use 50% of it for content..), even though I actually have a resolution that supports that kind of content.

Since this means that we’d see a lot more sizes than the regular 1024×768, 1280×1024, etc, we’d simply create intervals of “acceptable sizes” instead. We’d have (640-800), (801-980), (981-1099) instead, and we’d be able to give an estimate of how much content we could actually show a certain percentage of our users within their first view of our information. Even though people have 1024×768 as their resolution, we have no idea about how much content they actually see. I have no idea if that means that the majority of people see 500 pixels of content, 300 pixels of content, etc (and that would of course also change based on font size, but .. stay on track with me here), which is actually the useful information to base your decisions on.

So, anyone with me? How about support for this metric in Google Analytics? Anyone know of any existing statistics project that has added support for the actual viewport instead of the screen resolution?

PHP, ImageMagick and Cropping to GIF: Digging into GIFs again!

July 15th, 2008

Christer had an interesting case today, where he tried to resize and crop an image with the Imagick extension for PHP. Everything went as planned, the image was cropped and resized at it should be, but after writing it to disk and opening it again, the image’s size was the same as if he hadn’t done the crop. The content of the image outside the crop area was removed (simply set as transparent), but the image was still returned in it’s uncropped size.

The PHP module for binding ImageMagick is quite simple (simply marshalling between the ImageMagick methods and the PHP user space), so my guess is that this is a weird behaviour with a good enough reason somewhere down in ImageMagick. It might be a bug, but I haven’t had the time to attempt to reproduce it with convert or mogrify yet. If anyone wants to attempt that, feel free. Christer has posted the code, so simply attempt to recreate the same symptoms by using one of these two tools.

Anyways, this post was not to be about the issue itself, as Christer has done a neat analysis and write-up of that, but I’ll give a more detailed look at the issue within the GIF file itself. As chance would have it, I recently participated in a competition at the norwegian demoscene IRC hangout where the goal was to recreate the norwegian flag in an HTML page in the smallest space possible. This ended up being a competition to see who could molest and optimize GIF images the most, while browsers still were able to display them. From this experience I had a quite good knowledge of how GIF files are built internally, and I were able to do a good guess of what could be the actual issue in the resulting file.

Since GIF files can be animated, a single file may contain several “images” (which would be the frames in the animation). These images can have their own size and position within the “larger image”:

 _________
| im1     |
|    _____|
|   |     |
|   | im2 |
|___|_____|

im1 may then represent the first image and im2 the second image in the file. The second image will only update the area that it covers, and this will leave the rest of the image “as it is”. Since a GIF image may contain a large number of these images, a “global” size is defined for the image. This global size covers all the images, and is the total area that these images will be drawn into. If an image is drawn outside of this area (in part or whole), it will be clipped against the viewport.

This should provide enough background to at least give a general feeling about what COULD be the problem here, but to actually find out what’s happening, we’ll dig into a GIF file format specification and the file that was created. This simple reference provides a general layout of the GIF file, and we’ll use that to take a look at what values the file we ended up with had:

On the left we have the actual byte values in hex and on the right we have the corresponding ASCII character represented by that value. As you can see, the first six bytes of the file (0×47 0×49 0×46 0×38 0×39 0×61 (0x is the general way of prefixing numbers that should be interpreted as hexadecimal)) corresponds to “GIF89a” (You can do this exercise yourself armed with this Ascii Table. Simply look up 47 in the Hx column, then 49, etc). Those six bytes are what we call the signature of a GIF file (although the number can be different, i.e. GIF87a, depending on the version used).

The next fields in the specification reads:

Offset   Length   Contents
  6      2 bytes  
  8      2 bytes  

So byte 6-7 and byte 8-9 should tell us the logical size of the whole gif file (which the images will be drawn onto). In our test file here, that’s represented as:

Width: 0x67 0x01
Height: 0x70 0x00

The byte order here is Little Endian, which means that the least important values are placed first. Since we have two bytes for each value, we can calculate the decimal value of the width by multiplying:

0x67 0x01 = 6 * 16 + 7 + (0 * 16 + 1) * 256 = 359
                                        ^-- Since we're in the next byte, we multiply with 256.

You can also do this with the windows calculator, by entering 167 while being in hexmode, then selecting dec (for decimal). The reason for multiplying the second byte with 256 is that this byte provides the value of the “next 8 bits”, while the first provided the value for the first 8 bits. If we see the bits themselves:

0x70 | 0x01: 0111 0000 | 0000 0001

Little Endian says that the least significant bits come first, so to get the raw bit values, we turn it around:

0000 0001 0111 0000

As you can see, the value of the second byte (0×01) can be multiplied with 256 (which is the last 8 bits).

We can also calculate the height:

0x70 0x00 = 7 * 16 + 0 + (0 * 16 + 0) * 256 = 112
                          ^-- both numbers in the second byte is zero

Alas, the global header of the GIF image that were generated says that the size of the image is 359×112, which is why the image is rendered larger than it should have been. We then take a look at the Image section of the GIF file (all GIF files should contain at least one), which is defined as:

Offset   Length   Contents
  0      1 byte   Image Separator (0x2c)
  1      2 bytes  Image Left Position
  3      2 bytes  Image Top Position
  5      2 bytes  Image Width
  7      2 bytes  Image Height

Armed with this information, we examine the area where the image section starts:

The start of the Image section is the “Image Separator”, a byte value of 0×2c, shown highlighted in the image above. This is where the image section starts, and the offsets in the table is relative to this location. The next four bytes tells us where in the global viewport the upper left corner of this image should be drawn. The values here are 0×01 0×00 twice, simply meaning (1,1), or one pixel down and out from the upper left corner (which is also related to the issue posted by Christer, but we ignore that one here now). The next values are however those we are interested in, which provides Image Width and Image Height:

Width:
0x73 0x00 = 7 * 16 + 3 + (0 * 16 + 0) * 256 = 115
Height:
0x6F 0x00 = 6 * 16 + 15 + (0 * 16 + 0) * 256 = 111

This means that the dimension of the image that’s actually supplied in the GIF file, is 115×111 pixels and should be drawn beginning one pixel down and one pixel out (as given by 0×01 0×00 in the x,y-fields above). Compare this to the reported global size of the image (359×112), and we can see where our transparent space is coming from. The browsers (and other image viewers) create a canvas the size of 359×112 pixels, while only drawing an image into the 115 leftmost pixels. The rest is left transparent, but they’re still there as the file says that’s the size of the viewport. If we manually change the size of the viewport to 0×74 0×00 in the GIF header itself, the image displays properly. To illustrate with another great ascii drawing:


               viewport
 _____________________________________
|           |                         |
|  actual   |                         |
|  image    |                         |
|  drawn    |                         |
|           |                         |
|           |                         |
|           |                         |
|___________|_________________________|

The solution to the problem here were to call the setImagePage method of the image object, as that allows us to set the values for the global image ourselves (and we know how wide the image were supposed to be).

Bonus knowledge: This issue did not occur when saving to a JPEG file, as JPEG files does not have the same capability of storing several subimages inside one file, and does not have the same rendering subsystem as GIF files. ImageMagick knows this, and does not use the page-values when rendering the file.

Hopefully this has provided a minor introduction into how files are structured, what you can learn armed with a hex editor and a file format specification and provided a few insights into what you can do when you’re faced with a very weird problem.

Google Releases Their Protocol Buffers

July 8th, 2008

Fresh from the Google Open Source Blog comes news that Google has released their Protocol Buffers specification and accompanying libraries. The code and specification has been release at Protocol Buffers on Google Code.

Protocol Buffers is a data format for fast exchange and parsing of data and messages between computers. It is similar to simple uses of XML in this manner, but the messages size on the wire and their parsing time is very much optimized for busy sites. There is no need to spend loads of time doing XML parsing when you instead could do something useful. It’s very easy to interact with the messages through the generated classes (for C++, Java and Python), and future versions of the same schema are compatible with old versions (as new fields are just ignored by older parsers).

Still no PHP implementation available, so guess it’s time to get going and lay down some code during the summer. Anyone up for the job?