Going Hex Hunting in ZIP Files

It’s been a couple of weeks since I last had to go hex hunting, but fear not – we’re back in action! We discovered this issue after adding support for uploading ZIP files in the platform powering Gamer.no. All our test cases worked perfectly, but as soon as we released the feature our users discovered zip files from one particular site never worked. WinRAR and 7zip were both able to extract the files, so we just left the issue at that for the time being.

This evening I finally got around to trying to find out exactly what was going on. I started by updating my SVN trunk checkout of PHP6, adding the zip module to my configure and running make clean and make. The issue was still present in HEAD, so that meant getting my hands dirty and trying to find out exactly what we were up against.

The first thing on the table was to find out what the zip module was getting all antsy about when trying to open the file. A few well placed printfs (all hail the magic of printf debugging) told me that the line that this particular ZIP file did not pass was:

if ((comlen < cd->comment_len) || (cd->nentry != i)) { 

The cd struct is the “end of central directory” struct, containing meta data about the archive and an optional comment about the ZIP file. The comlen and the cd->comment_len were both zero (as the file didn’t have a comment), so the trouble maker was the cd->nentry != i statement. The library that the PHP module uses reads i by itself and then reads the struct. nentry is the number of files in “this disk” of the archive (a ZIP archive can be divided across several physical disks (think back to 8″, 5.25″ and 3.5″ disks)), while i is the “total number of files in the archive”. The PHP library only supports single archives (not archives spanning several disks), so these two values should be identical. For some reason, they weren’t – meaning that the zip files generate by this particular site actually are invalid ZIP-files. WinRAR and 7zip just makes the best of the situation and does it very nicely.

Here’s the hex dump of the end of central directory section from one of the non-working files:

Hex dump of the end of central directory section from a ZIP file

The first four bytes are the signature of the section (50 4B 05 06, or PK\5\6), then we have 2 bytes which is the “number of this disk” (00 00 here) and 2 bytes with “number of the disk with the start of the central directory” (00 00 here again) (the library in PHP doesn’t support archives spanning multiple disks, so it just compares this section to 00 00 00 00). Then we have our magic numbers, both two bytes: “total number of entries in the central directory on this disk” and “total number of entries in the central directory”. These should logically be the same for single file archives, but they’re 6 and 3 (06 00 and 03 00) here (the correct number of files in the archive is 3).

The solution for us is to use the correct number (“total number of entries in the central directory”) for both the values. To do this we simply patch the two bytes of the binary zip file (.. we could do this with fopen, fseek, fwrite in place, but we’re lazy. This is not time sensitive code.) and rewrite it in place:

    /**
     * Attempt to repair the zip file by correcting wrong total number of files in archive
     */
    static public function repairZIPFile($file)
    {
        // lets read in the file
        $data = file_get_contents($file);

        // lets try to find the end of the central directory record (should be
        // the last thing in the file, so we search from the end)
        $startpos = strrpos($data, "PK\5\6");

        // if we found a header..
        if ($startpos)
        {
            // attempt to repair the file by copying the "total files" to the "files on this disk" field.
            // PHP's ZIP module doesn't handle multidisks anyway..
            $data[$startpos+8] = $data[$startpos+10];
            $data[$startpos+9] = $data[$startpos+11];

            file_put_contents($file, $data);
        }
    }

And Voilá – we’re a happy bunch again!

Supporting 2-pass Parallel Encoding with x264 and ffmpeg

If you’re doing several encodes of a single input file (to encode several different sizes / bitrate combinations) in parallel with x264, you’re going to have a problem. The first pass will create three files with information about the file for the second pass, and you’re unable to change this file name into something better. This seems to be a problem for quite a lot of people according to a Google-search for the issue, and none seems to have any proper solution.

I have one. Well, probably not a proper solution, but at least it works! The trick is to realize that ffmpeg/x264 creates these files in the current working directory. To run several encodings in parallel, you’ll simply have to give each encoding process it’s own directory, and then use absolute paths to the source and destination file (and any other paths). Let it create the files there and clean up and delete the directories afterwards.

I’ve included some example code from PHP in regards to how you could solve something like this. I simply use the output file name as the directory name here, and create the directory in the system temp directory.

$tempDir = sys_get_temp_dir() . '/' . $outputFilename);
mkdir($tempDir, 0700, true);
chdir($tempDir);

After doing the encode, we’ll have to clean up. The three files that ffmpeg/x264 creates are ffmpeg2pass-0.log, x264_2pass.log and x264_2pass.log.mbtree.

unlink($tempDir . '/ffmpeg2pass-0.log');
unlink($tempDir . '/x264_2pass.log');
unlink($tempDir . '/x264_2pass.log.mbtree');
rmdir($tempDir);

And that should hopefully solve it!

Patching The PHP Gearman Extension

Apollo 10 Capsule

Update: it seems that this behaviour in libgearman changed from 0.8 to 0.10, and according to Eric Day (eday), the behaviour will change back to the old one with 0.11.

After upgrading to the most recent version of the Gearman-extension from PHP and libgearman, Suhosin started complaining about a heap overwrite problem. The error only popped up on certain response sizes, which made me guess that it could be a buffer overrun or something strange going on in the code handling the response.

Seeing this as an excellent opportunity to get more familiar with the Gearman code, I dug into the whole shebang yesterday and continued my quest for cleansing today. After quite a few hours of getting to know the code and attempting to understand the general flow, I was finally able to find – and fix – the problem.

The first symptom of the issue was that the Gearman extension at certain times failed to return the complete response from the gearman server. I created a small application that returned responses of different sizes, showing that the problem was all over the place. While n worked, n+1 returned only n bytes, and n+2 resulted in a heap overflow.

The issue was caused by an invalid efree, where the code in question was:

void _php_task_free(gearman_task_st *task, void *context) {
	gearman_task_obj *obj= (gearman_task_obj *)context;
    TSRMLS_FETCH();

	if (obj->flags & GEARMAN_TASK_OBJ_DEAD) {
		GEARMAN_ZVAL_DONE(obj->zdata)
		GEARMAN_ZVAL_DONE(obj->zworkload)
		efree(obj);
	}
	else 
	  obj->flags&= ~GEARMAN_TASK_OBJ_CREATED;
}

This seems innocent enough, and I really had trouble seeing how this could lead to the observed behaviour. This meant going for a wild goose chase around the Gearman code, trying to piece together how things worked. And after a few proper debug rounds, I finally discovered the issue: the context variable was not a gearman_task_obj struct under certain criteria. The gearman_task_obj struct is allocated by php_gearman and then assigned to the task in question. This makes it possible for the extension to tag an internal structure together with the task in libgearman. Under certain conditions this struct is not created, and by default, libgearman assigns the client struct to the context instead (this is also available as task->client). So instead of the gearman_task_obj that was assumed to be present, we actually got a gearman_client struct.

That provides a reason why things went sour, but why exactly did I see the behaviour I saw? Well, to answer that, we’ll have to take a look at the actual contents of the struct. The client struct contains a value keeping the number of bytes in the response, while the task_obj struct keeps the flags (which is what the code above checks and updates). Coincidentally these two int values are aligned similiar in the two structs – resulting in the number of bytes in the response being used as the flags value. This value is then modified (under certain conditions) or results in a free using other offsets into the struct. The call to efree() will then use some random values (or, more specific, the values that lines up with the location in task_obj) when attempting to do the free, resulting in a corruption. Suhosin caught it, while it would probably have generated a few weird bugs (where the last byte would go missing) under an unprotected PHP installation. +1 for Suhosin!

The patch for php_gearman.c is available, and should be applied towards 0.6.0. Although I’ve had a few looks around, it might introduce a memory leak. People who know the code way better than I do will probably commit a better patch, and the issue will be fixed in 0.7.0 of the extension.

Porting SOLR Token Filter from Lucene 2.4.x to Lucene 2.9.x

I had trouble getting our current token filter to work after recompiling with the nightly builds of SOLR, which seemed to stem from the recently adopted upgrade to 2.9.0 of Lucene (not released yet, but SOLR nightly is bleeding edge!). There’s functionality added for backwards compability, and while that might have worked, things didn’t really come together as it should (somewhere or the other). So I decided to port our filter over to the new model, where incrementToken() is the New Way ™ of doing stuff. Helped by the current lowercase filter in the SVN trunk of Lucene, I made it all the way through.

Our old code:

    public NorwegianNameFilter(TokenStream input)
    {
        super(input);
    }

    public Token next() throws IOException
    {
        return parseToken(this.input.next());
    }
 
    public Token next(Token result) throws IOException
    {
        return parseToken(this.input.next());
    }

Compiling this with Lucene 2.9.0 gave me a new warning:

Note: .. uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.

To the internet mobile!

Turns out next() and next(Token) has been deprecated in the new TokenStream implementation, and the New True Way is to use the incrementToken() method instead.

Our new code:

    private TermAttribute termAtt;

    public NorwegianNameFilter(TokenStream input)
    {
        super(input);
        termAtt = (TermAttribute) addAttribute(TermAttribute.class);
    }

    public boolean incrementToken() throws IOException
    {
        if (this.input.incrementToken())
        {
            termAtt.setTermLength(this.parseBuffer(termAtt.termBuffer(), termAtt.termLength()));
            return true;
        }
        
        return false;
    }

A few gotcha’s along the way: incrementToken needs to be called on the input token string, not on the filter (super.incrementToken() will give you a stack overflow). This moves the token stream one step forward. We also decided to move the buffer handling into the parse token function to handle this, and remember to include the length of the “live” part of the buffer (the buffer will be larger, but only the content up to termLength will be valid).

The return value from our parseBuffer function is the actual amount of usable data in the buffer after we’ve had our way with it. The concept is to modify the buffer in place, so that we avoid allocating or deallocating memory.

Hopefully this will help other people with the same problem!

Fatal error: Undefined class constant ‘ATTR_DEFAULT_FETCH_MODE’

This is one of the common error messages that seems to appear after installing PHP – in particular under Ubuntu or Debian (where I experienced it). The reason for this is that the PDO version you’ve just installed is too old for the constant to exist, usually because you where naive enough to install the extension from PECL instead of using the default supplied by ubuntu. If you did the same as me:

apt-get install 
pecl install pdo
pecl install pdo_mysql

/etc/init.d/apache2 restart

And everything seems to work, except for that missing constant. What the fsck?!

The reason is that the PECL version of PDO is no longer maintained (I’d suggest to automagically push the newest version to PECL too, just so pecl install and pecl upgrade works as expected). The fact is that when you did pecl install pdo, you destroyed the more recent version provided by the default php5-common package (under Ubuntu, at least).

To solve the problem (and if you also did pecl install pdo_mysql, you’ll have to replace that one too…)

apt-get install --reinstall php5-common php5-mysql php5-mysqli

Restart. Relive. Re.. eh .. yeah. Rerere.

If you’re building from source, you’ll need to add:

--enable-pdo --with-pdo-mysql --with-pdo-pgsql (and any other driver you need)

I survived!

(Yes, the headline recycling is becoming a trend. Get on with the program!)

The view from the bridge at Hvaler at 05:45I can happily report that I survived this year’s version of Birkebeinerrittet! Together with 16.000 (!) other people, I set out from Rena on saturday morning, heading for Lillehammer – 94.5 kilometers away and with the highest mountain in the world (1100 meters) in between. I left from my home at Hvaler at 05:30 in the morning, joining my parents at Rolvsøy at 06:00 and leaving for Rena. About 3 hours and 25 minutes later we arrived, and I went to get my starting kit (transponder, number plate / start bib), etc).

The starting kit for birkebeinerrittet 2009The track had been made about 3 kms longer than last year, as we now followed an alternative route of Rena. Instead of riding the bridge over highway 3 right after the start, everyone started in the opposite direction and went under the highway instead. Nothing much to report about this, everything worked out fine and the additional kilometers don’t really matter. Compared to last year I spent two minutes more this year up to the first registration point at Skramstad, which means that I had about the exact same pace as last year. Remember to pick the LEFT track when the road splits (I did that last year, while I picked the right track this year – which seemed a lot steeper).

People getting ready for their start at birkebeinerrittet 2009
One of the groups heading out for their start!

But before we go any further, I’ll have to mention the weather. Oh, the weather. It had been raining for at least a day before the trip over the mountain, which meant that everything was muddy and dirty. Grenserittet was also muddy, but that were localized to a few key areas. At birkebeinerrittet everything was muddy (but a few areas a lot more, of course), and people were approaching zero recognizability. As one guy asked me at the second stage: “Atle?” (another common Norwegian name) “Noooo?” “Oh, Sorry. It was impossible to see who you were with all the mud..”. VG has a collection of pictures showing the mud problem.

One of the things that I’ve had on my todo list was to get a nice pair of glasses to use while biking. While I actually managed to get a new set of long biking shorts and a new long sleeved bike jacket before starting (and yes, those were probably this year’s best investment), I failed to get a pair of glasses. And how I regret that. It was completely impossible to follow anyones back wheel because of all the dirt that came blasting! I had to remove chunks of dirt from the corner of my eye for a day and a half after finishing the race. Quite a new experience!

The race went a lot better than last year, even under the current conditions. Although behind my previous time at the second checkpoint, I had a lot more energy and endurance this time. I was still able to get a bit of speed and passed quite a few other riders on my way to the next checkpoint. When I reached the famous “Rosinbakken” (“Raisin hill”), I were experiencing quite an energy loss, and I’ve realized in retrospect that this was because I failed to get any new energy into my body during the 20kms running up to the hill. I try to get at least one serving of energy gel each 30 minutes, but I think I had at least 1 hrs and 30 minutes in this segment. After getting some carbs into the system everything went a lot better, and I were able to get up on my bike and put in a few stints up until the highest location of the track.

In the middle of one of the downhill segments right before the second checkpoint we suddenly met three sheep walking right in the middle of the track! After a bit of panic braking we managed to avoid them, and they trotted along the road as nothing had happened. There were a lot of sheep along the track as usual, but at two occasions they went a bit further than just grassing by the side. Amazing experience anyhow.

After passing the highest point, everything goes downhill almost exclusivly until the finish. I’m usually a lot better at the downhill segments than the uphill parts, and I were able to tag along with a train consisting of five other bikers. We really got up a bit of speed and passed lots of other riders, and I were happy that I finally got a bit of effective riding. Next year I’ll hopefullly be able to tag along with someone for most of the trip, making it a faster journey for all of us. We’ll see.

In the last downhill segment after riding through the spectator stand around the ski jumping hill from the olympics at Lillehammer (and riding down the hill from the freestyle skiing competition at the same olympics), the guy right in front of me went over his handlebar and crashed into the side of the road. He was apparently OK, but it seemed to be a rather unpleasant experience. The rest of the track was covered with five centimeters of mud, which I managed to ride all the way through – although I almost went for a “I’ll plant my complete body into the mud here, thank you” after the rider right in front of me suddenly got problems with keeping her speed and I tried feveriously to free my shoes from my pedals. I saved it, and could ride the last 200 meters and finish my ride (although I’m not sure if anyone would have seen any difference wether I had fallen into the mud or not..)!

Two of my friends who rode the race for their first time, Christer and Magne, also finished. I’m happy to report (.. and Magne is not) that I actually managed to strike back after Magne crushed my time with 40 minutes at Grenserittet a year back. Ten minutes ahead baby, it’s all the time in the world! Christer had a very bad day with two punctures and three chain breakdowns. He finished in about 5:54.

Christer

Skramstad Bringbu Kvarstad Storåsen Goal
2009 00:46:53 01:48:55 03:15:48 04:55:07 05:54:36

Magne

Skramstad Bringbu Kvarstad Storåsen Goal
2010 00:47:55 01:42:17 02:50:42 04:14:53 05:21:34
2009 00:53:17 01:57:57 03:14:41 04:38:18 05:34:01

Mats

Skramstad Bringbu Kvarstad Storåsen Goal
2011 00:48:15 01:38:03 02:38:26 03:51:49 04:37:00
2010 00:50:43 01:51:31 03:03:17 04:30:54 05:29:07
2009 00:53:04 01:59:25 03:05:39 04:30:10 05:24:30
2008 00:51:36 01:47:20 03:08:54 04:47:30 05:47:13
1996 05:47:50

I’ll leave you with the final impression of one tired man and his new friend, the mud. This is after getting hosed down with water at least once to try to clean out the mud from my face.

Myself after finishing birkebeinerrittet 2009 - A bit muddy!

So, are you ready for next year? I am! (.. even after I got the hickups on the way back home .. for at least a couple of hours.)

Birkebeinerrittet Tomorrow

(Yes, I recycled my headline from last year, thanks for noticing)

Yet again I set sail (Yes, that’s what you do when you’re biking) over the mountain between Rena and Lillehammer tomorrow! 94.5kms of gravel, dirt and mud awaits! The weather seems to be pretty OK tomorrow for the area (a mm of downpour), but the forecast for tonight seems a bit rainy. It’s going to make everything a bit slippery tomorrow at least, but hopefully the roads are up to standard and everything goes as planned.

I’ve recently bought a new bike, so this will be the first trip really long trip with a new set wheels (managed to do about 150kms during the last seven days to at least settle everything in). Looking forward to it!

Three goals for this year too, and amazingly they’re the same as for last year:

Primary goal: finishing Secondary goal: finishing below 5 hours. Third goal: escape death.

I finished in 5:47 last year, but I’m feeling a tad more optimistic this year (.. well, I always do). Five hours, here I come!

What Happened To My Beautiful En-dashes?!

First, a small introduction to the problem: We’re running stuff in UTF-8 all the way. A few sites we’re reading feeds from are using ISO-8859-1 as their charset, but they either supply the feed with the correct encoding specific or the feeds arrive as UTF-8. Everything works nicely, except for the mentioned-in-the-headline en-dashes. Firefox only shows 00 96 (0x00 0x96), but everything looks correct when you view the headlines and similiar stuff on the original site.

Strange.

The digging, oh all the digging.

After the already mentioned digging (yes, the digging) in data at the large search engines (ok, maybe I did a search or two), I discovered that the windows cp1252 encoding uses 0x96 to store endashes. This seems similiar! We’re seeing 0x96 as one of the byte values above, so apparently cp1252 is sneaking into the mix somewhere along the lines. Most of the clients using the CMS-es are windows, so they might apparently be to blame.

ISO-8859-1 enters the scene

As the sites (and feeds) provide ISO-8859-1 as their encoding, I thought it would be interesting to see what ISO-8859-1 defines as the representation for the byte value 0x96. Lo’ and behold: 0x96 is not defined in ISO-8859-1. Which actually provides us with the solution.

I welcome thee, Mr. Solution

When the ISO-8859-1 encoded string is converted into UTF-8, the bytes with the value 0x96 (which is the endash in cp1252) is simply inserted as a valid code sequence in UTF-8 which represents a character that’s not defined.

We’re saying that the string is ISO-8859-1, although in reality it is either cp1252 or a mangled version of iso-8859-1 and cp1252 (for the endashes, at least).

If you’re on the parsing end of this mumbo jumbo, one solution is to replace the generated UTF-8 sequence (0xc2 0x96) (converted from 0x96 i ISO-8859-1) with the proper one (0xe2 0x80 0x93):

$data = str_replace("\xc2\x96", "\xE2\x80\x93", $data);

And voilá, everything works.

The Thumbs Up! of Awesome Approval

Every once in a while a few new interesting tools surface themselves and become a natural part of how a developer works. I’ve taken a look at which tools I’ve introduced in my regular workflow during the last six months.

NetBeans

NetBeans got the first version of what has become awesome PHP support in version 6.5, and after version 6.7 got released just before the summer, things have become very stable. NetBeans is absolutely worth looking into for PHP development (and Java), and you sure can’t beat the price (free!). In the good old days NetBeans were slow as hell, but I’ve not noticed any serious issues in 6.7 (.. although we didn’t really have quad cores and 4GB of memory back then either). Go try it out today!

Balsamiq Mockups

Balsamiq is an awesome tool for making quick mockups for UI designs. Previous I’d play around in Adobe Photoshop, dragging layers around and being concerned with all the wrong things. Mockups abstracts away all the UI elements (and comes with a vast library of standard elements), which makes it very easy to experiment and focus on the usability instead of the design and its implementation. For someone who’s more interested in the experience and the programming than the actual design (.. I’ll know what I want when I see it!) this makes it easy to convey my suggestions and create small, visual notes of my own usabilityideas.

You can try it out for free at their website, and they even give away licenses to people who are active in Open Source Development (disclaimer: I got a free license, but the experiences are all my own. This is not paid (or unpaid) advertising or product placement.)

GitHub

I’ve been playing around with git a bit, but after writing a patch for the PEAR-module for Gearman (.. which still doesn’t seem to have made it anywhere significant), I signed up for github to be able to fork the project and submit my patch there. A very good technical solution partnered with an easy way of notifying the original developers of your patch (which you simply provide in your own branch) by submitting a “pull request” makes it very easy to both have patches supplied to you and to submit patches to projects hosted at GitHub.

Thumbs up!