Going Hex Hunting in ZIP Files

It’s been a couple of weeks since I last had to go hex hunting, but fear not – we’re back in action! We discovered this issue after adding support for uploading ZIP files in the platform powering Gamer.no. All our test cases worked perfectly, but as soon as we released the feature our users discovered zip files from one particular site never worked. WinRAR and 7zip were both able to extract the files, so we just left the issue at that for the time being.

This evening I finally got around to trying to find out exactly what was going on. I started by updating my SVN trunk checkout of PHP6, adding the zip module to my configure and running make clean and make. The issue was still present in HEAD, so that meant getting my hands dirty and trying to find out exactly what we were up against.

The first thing on the table was to find out what the zip module was getting all antsy about when trying to open the file. A few well placed printfs (all hail the magic of printf debugging) told me that the line that this particular ZIP file did not pass was:

if ((comlen < cd->comment_len) || (cd->nentry != i)) { 

The cd struct is the “end of central directory” struct, containing meta data about the archive and an optional comment about the ZIP file. The comlen and the cd->comment_len were both zero (as the file didn’t have a comment), so the trouble maker was the cd->nentry != i statement. The library that the PHP module uses reads i by itself and then reads the struct. nentry is the number of files in “this disk” of the archive (a ZIP archive can be divided across several physical disks (think back to 8″, 5.25″ and 3.5″ disks)), while i is the “total number of files in the archive”. The PHP library only supports single archives (not archives spanning several disks), so these two values should be identical. For some reason, they weren’t – meaning that the zip files generate by this particular site actually are invalid ZIP-files. WinRAR and 7zip just makes the best of the situation and does it very nicely.

Here’s the hex dump of the end of central directory section from one of the non-working files:

Hex dump of the end of central directory section from a ZIP file

The first four bytes are the signature of the section (50 4B 05 06, or PK\5\6), then we have 2 bytes which is the “number of this disk” (00 00 here) and 2 bytes with “number of the disk with the start of the central directory” (00 00 here again) (the library in PHP doesn’t support archives spanning multiple disks, so it just compares this section to 00 00 00 00). Then we have our magic numbers, both two bytes: “total number of entries in the central directory on this disk” and “total number of entries in the central directory”. These should logically be the same for single file archives, but they’re 6 and 3 (06 00 and 03 00) here (the correct number of files in the archive is 3).

The solution for us is to use the correct number (“total number of entries in the central directory”) for both the values. To do this we simply patch the two bytes of the binary zip file (.. we could do this with fopen, fseek, fwrite in place, but we’re lazy. This is not time sensitive code.) and rewrite it in place:

     * Attempt to repair the zip file by correcting wrong total number of files in archive
    static public function repairZIPFile($file)
        // lets read in the file
        $data = file_get_contents($file);

        // lets try to find the end of the central directory record (should be
        // the last thing in the file, so we search from the end)
        $startpos = strrpos($data, "PK\5\6");

        // if we found a header..
        if ($startpos)
            // attempt to repair the file by copying the "total files" to the "files on this disk" field.
            // PHP's ZIP module doesn't handle multidisks anyway..
            $data[$startpos+8] = $data[$startpos+10];
            $data[$startpos+9] = $data[$startpos+11];

            file_put_contents($file, $data);

And Voilá – we’re a happy bunch again!