Ready for 2010: Check Your Indexes

One of the many things you should try to keep a continuous watch for during the life of any of your applications are the performance of your SQL queries. You might be caching the hell out of your database layer, but some time you’ll have to hit the database server to retrieve data. And if that starts to happen often enough while you’re growing, you will see your SQL daemon taking up the largest part of your disk io and your CPU time. This might not be a problem for the load you’re seeing now, but could you handle a 10 fold increase in traffic? .. or how about 100x? (which, if I remember correctly, is what Google uses as the scale factor when developing applications)

Indexes Are Your Friend

During the christmas holiday I got around to taking a look at some of the queries running at one of my longest living, most active sites: pwned.no. Pwned is a tournament engine running on top of PHP and MySQL, containing about 40.000 tournaments, 450.000 matches and several other database structures. The site has performed well over the years and there hasn’t been any performance issues other than a few attempts at DoS-ing the site with TCP open requests (the last one during the holiday, actually).

Two weeks ago the server suddenly showed loads well above 30 – while it usually hovers around 0.3 – 0.4 at the busiest hours of the day. The reason? One of the previously less used functions of the tournament engine, using a group stage in your tournament, had suddenly become popular in at least one high traffic tournament. This part of the code had never been used much before, but when the traffic spike happened everything went bananas (B-A-N-A-N-A-S. Now that’s stuck in your head. No problem.) The reason: the query used a couple of columns in a WHERE-statement that wasn’t indexed, and the query ran against the table containing the matches for the tournament. This meant that over 400.000 rows were scanned each time the query ran, meaning that mysqld started hogging every resource it could. The Apache childs then had to wait, making the load a bit too much for my liking. Two CREATE INDEX-calls later the load went back down and everything chugged along nicely again.

My strategy for discovering queries that might need a better index scheme (or if “impossible”, a proper caching layer in front of it):

  1. Run your development server with slow-query-log=1, log-queries-not-using-indexes=1 and long-query-time=<an appropriately low value, such as 0.05 – depends on your setup>. You can also provide a log file name with the log-slow-queries=/var/log/mysql/… in your my.cnf-file for MySQL. This will log all potential queries for optimizing to the log file (this will not necessarily provide you with a complete list of good queries to optimize, but it might provide a few good hints). Be sure to use actual data from your site when working on your development version, as you might start seeing issues when the size of the data set reaches a certain size – such as 400.000 rows in the example mentioned above)
  2. Connect to your MySQL server and issue
    SHOW PROCESSLIST

    and

    SHOW FULL PROCESSLIST

    statements every now and then. This will let you see any queries that run often and way too long (but they’ll have to run when you issue the command). You might not catch the real culprit, but if you’re seing MySQL chugging along with 100% CPU and are wondering what’s happening, try to check out what the threads are doing. You’ll hopefully see just which query is wreaking havoc with your server.

  3. Add a statistics layer in front of your MySQL calls in your application. If you’re using PDO you can subclass it to keep a bit of statistics about your queries around. The number of times each query is run, the time it took in total running the query and other interesting values. We’re using a version of this in the development version of Gamer.no and I’ll probably upload the class to my github repository as soon as I get a bit of free time in the new year.

Not sure what I’ll take a closer look at tomorrow, but hopefully I’ll decide before everything collapses!

What are your strategy for indexes? What methods do you use for finding queries that need a bit more love? Leave a comment below!

Read all the articles in the Ready for 2010-series

Going Hex Hunting in ZIP Files

It’s been a couple of weeks since I last had to go hex hunting, but fear not – we’re back in action! We discovered this issue after adding support for uploading ZIP files in the platform powering Gamer.no. All our test cases worked perfectly, but as soon as we released the feature our users discovered zip files from one particular site never worked. WinRAR and 7zip were both able to extract the files, so we just left the issue at that for the time being.

This evening I finally got around to trying to find out exactly what was going on. I started by updating my SVN trunk checkout of PHP6, adding the zip module to my configure and running make clean and make. The issue was still present in HEAD, so that meant getting my hands dirty and trying to find out exactly what we were up against.

The first thing on the table was to find out what the zip module was getting all antsy about when trying to open the file. A few well placed printfs (all hail the magic of printf debugging) told me that the line that this particular ZIP file did not pass was:

if ((comlen < cd->comment_len) || (cd->nentry != i)) { 

The cd struct is the “end of central directory” struct, containing meta data about the archive and an optional comment about the ZIP file. The comlen and the cd->comment_len were both zero (as the file didn’t have a comment), so the trouble maker was the cd->nentry != i statement. The library that the PHP module uses reads i by itself and then reads the struct. nentry is the number of files in “this disk” of the archive (a ZIP archive can be divided across several physical disks (think back to 8″, 5.25″ and 3.5″ disks)), while i is the “total number of files in the archive”. The PHP library only supports single archives (not archives spanning several disks), so these two values should be identical. For some reason, they weren’t – meaning that the zip files generate by this particular site actually are invalid ZIP-files. WinRAR and 7zip just makes the best of the situation and does it very nicely.

Here’s the hex dump of the end of central directory section from one of the non-working files:

Hex dump of the end of central directory section from a ZIP file

The first four bytes are the signature of the section (50 4B 05 06, or PK\5\6), then we have 2 bytes which is the “number of this disk” (00 00 here) and 2 bytes with “number of the disk with the start of the central directory” (00 00 here again) (the library in PHP doesn’t support archives spanning multiple disks, so it just compares this section to 00 00 00 00). Then we have our magic numbers, both two bytes: “total number of entries in the central directory on this disk” and “total number of entries in the central directory”. These should logically be the same for single file archives, but they’re 6 and 3 (06 00 and 03 00) here (the correct number of files in the archive is 3).

The solution for us is to use the correct number (“total number of entries in the central directory”) for both the values. To do this we simply patch the two bytes of the binary zip file (.. we could do this with fopen, fseek, fwrite in place, but we’re lazy. This is not time sensitive code.) and rewrite it in place:

    /**
     * Attempt to repair the zip file by correcting wrong total number of files in archive
     */
    static public function repairZIPFile($file)
    {
        // lets read in the file
        $data = file_get_contents($file);

        // lets try to find the end of the central directory record (should be
        // the last thing in the file, so we search from the end)
        $startpos = strrpos($data, "PK\5\6");

        // if we found a header..
        if ($startpos)
        {
            // attempt to repair the file by copying the "total files" to the "files on this disk" field.
            // PHP's ZIP module doesn't handle multidisks anyway..
            $data[$startpos+8] = $data[$startpos+10];
            $data[$startpos+9] = $data[$startpos+11];

            file_put_contents($file, $data);
        }
    }

And Voilá – we’re a happy bunch again!

Supporting 2-pass Parallel Encoding with x264 and ffmpeg

If you’re doing several encodes of a single input file (to encode several different sizes / bitrate combinations) in parallel with x264, you’re going to have a problem. The first pass will create three files with information about the file for the second pass, and you’re unable to change this file name into something better. This seems to be a problem for quite a lot of people according to a Google-search for the issue, and none seems to have any proper solution.

I have one. Well, probably not a proper solution, but at least it works! The trick is to realize that ffmpeg/x264 creates these files in the current working directory. To run several encodings in parallel, you’ll simply have to give each encoding process it’s own directory, and then use absolute paths to the source and destination file (and any other paths). Let it create the files there and clean up and delete the directories afterwards.

I’ve included some example code from PHP in regards to how you could solve something like this. I simply use the output file name as the directory name here, and create the directory in the system temp directory.

$tempDir = sys_get_temp_dir() . '/' . $outputFilename);
mkdir($tempDir, 0700, true);
chdir($tempDir);

After doing the encode, we’ll have to clean up. The three files that ffmpeg/x264 creates are ffmpeg2pass-0.log, x264_2pass.log and x264_2pass.log.mbtree.

unlink($tempDir . '/ffmpeg2pass-0.log');
unlink($tempDir . '/x264_2pass.log');
unlink($tempDir . '/x264_2pass.log.mbtree');
rmdir($tempDir);

And that should hopefully solve it!

Patching The PHP Gearman Extension

Apollo 10 Capsule

Update: it seems that this behaviour in libgearman changed from 0.8 to 0.10, and according to Eric Day (eday), the behaviour will change back to the old one with 0.11.

After upgrading to the most recent version of the Gearman-extension from PHP and libgearman, Suhosin started complaining about a heap overwrite problem. The error only popped up on certain response sizes, which made me guess that it could be a buffer overrun or something strange going on in the code handling the response.

Seeing this as an excellent opportunity to get more familiar with the Gearman code, I dug into the whole shebang yesterday and continued my quest for cleansing today. After quite a few hours of getting to know the code and attempting to understand the general flow, I was finally able to find – and fix – the problem.

The first symptom of the issue was that the Gearman extension at certain times failed to return the complete response from the gearman server. I created a small application that returned responses of different sizes, showing that the problem was all over the place. While n worked, n+1 returned only n bytes, and n+2 resulted in a heap overflow.

The issue was caused by an invalid efree, where the code in question was:

void _php_task_free(gearman_task_st *task, void *context) {
	gearman_task_obj *obj= (gearman_task_obj *)context;
    TSRMLS_FETCH();

	if (obj->flags & GEARMAN_TASK_OBJ_DEAD) {
		GEARMAN_ZVAL_DONE(obj->zdata)
		GEARMAN_ZVAL_DONE(obj->zworkload)
		efree(obj);
	}
	else 
	  obj->flags&= ~GEARMAN_TASK_OBJ_CREATED;
}

This seems innocent enough, and I really had trouble seeing how this could lead to the observed behaviour. This meant going for a wild goose chase around the Gearman code, trying to piece together how things worked. And after a few proper debug rounds, I finally discovered the issue: the context variable was not a gearman_task_obj struct under certain criteria. The gearman_task_obj struct is allocated by php_gearman and then assigned to the task in question. This makes it possible for the extension to tag an internal structure together with the task in libgearman. Under certain conditions this struct is not created, and by default, libgearman assigns the client struct to the context instead (this is also available as task->client). So instead of the gearman_task_obj that was assumed to be present, we actually got a gearman_client struct.

That provides a reason why things went sour, but why exactly did I see the behaviour I saw? Well, to answer that, we’ll have to take a look at the actual contents of the struct. The client struct contains a value keeping the number of bytes in the response, while the task_obj struct keeps the flags (which is what the code above checks and updates). Coincidentally these two int values are aligned similiar in the two structs – resulting in the number of bytes in the response being used as the flags value. This value is then modified (under certain conditions) or results in a free using other offsets into the struct. The call to efree() will then use some random values (or, more specific, the values that lines up with the location in task_obj) when attempting to do the free, resulting in a corruption. Suhosin caught it, while it would probably have generated a few weird bugs (where the last byte would go missing) under an unprotected PHP installation. +1 for Suhosin!

The patch for php_gearman.c is available, and should be applied towards 0.6.0. Although I’ve had a few looks around, it might introduce a memory leak. People who know the code way better than I do will probably commit a better patch, and the issue will be fixed in 0.7.0 of the extension.

Fatal error: Undefined class constant ‘ATTR_DEFAULT_FETCH_MODE’

This is one of the common error messages that seems to appear after installing PHP – in particular under Ubuntu or Debian (where I experienced it). The reason for this is that the PDO version you’ve just installed is too old for the constant to exist, usually because you where naive enough to install the extension from PECL instead of using the default supplied by ubuntu. If you did the same as me:

apt-get install 
pecl install pdo
pecl install pdo_mysql

/etc/init.d/apache2 restart

And everything seems to work, except for that missing constant. What the fsck?!

The reason is that the PECL version of PDO is no longer maintained (I’d suggest to automagically push the newest version to PECL too, just so pecl install and pecl upgrade works as expected). The fact is that when you did pecl install pdo, you destroyed the more recent version provided by the default php5-common package (under Ubuntu, at least).

To solve the problem (and if you also did pecl install pdo_mysql, you’ll have to replace that one too…)

apt-get install --reinstall php5-common php5-mysql php5-mysqli

Restart. Relive. Re.. eh .. yeah. Rerere.

If you’re building from source, you’ll need to add:

--enable-pdo --with-pdo-mysql --with-pdo-pgsql (and any other driver you need)

What Happened To My Beautiful En-dashes?!

First, a small introduction to the problem: We’re running stuff in UTF-8 all the way. A few sites we’re reading feeds from are using ISO-8859-1 as their charset, but they either supply the feed with the correct encoding specific or the feeds arrive as UTF-8. Everything works nicely, except for the mentioned-in-the-headline en-dashes. Firefox only shows 00 96 (0x00 0x96), but everything looks correct when you view the headlines and similiar stuff on the original site.

Strange.

The digging, oh all the digging.

After the already mentioned digging (yes, the digging) in data at the large search engines (ok, maybe I did a search or two), I discovered that the windows cp1252 encoding uses 0x96 to store endashes. This seems similiar! We’re seeing 0x96 as one of the byte values above, so apparently cp1252 is sneaking into the mix somewhere along the lines. Most of the clients using the CMS-es are windows, so they might apparently be to blame.

ISO-8859-1 enters the scene

As the sites (and feeds) provide ISO-8859-1 as their encoding, I thought it would be interesting to see what ISO-8859-1 defines as the representation for the byte value 0x96. Lo’ and behold: 0x96 is not defined in ISO-8859-1. Which actually provides us with the solution.

I welcome thee, Mr. Solution

When the ISO-8859-1 encoded string is converted into UTF-8, the bytes with the value 0x96 (which is the endash in cp1252) is simply inserted as a valid code sequence in UTF-8 which represents a character that’s not defined.

We’re saying that the string is ISO-8859-1, although in reality it is either cp1252 or a mangled version of iso-8859-1 and cp1252 (for the endashes, at least).

If you’re on the parsing end of this mumbo jumbo, one solution is to replace the generated UTF-8 sequence (0xc2 0x96) (converted from 0x96 i ISO-8859-1) with the proper one (0xe2 0x80 0x93):

$data = str_replace("\xc2\x96", "\xE2\x80\x93", $data);

And voilá, everything works.

The Thumbs Up! of Awesome Approval

Every once in a while a few new interesting tools surface themselves and become a natural part of how a developer works. I’ve taken a look at which tools I’ve introduced in my regular workflow during the last six months.

NetBeans

NetBeans got the first version of what has become awesome PHP support in version 6.5, and after version 6.7 got released just before the summer, things have become very stable. NetBeans is absolutely worth looking into for PHP development (and Java), and you sure can’t beat the price (free!). In the good old days NetBeans were slow as hell, but I’ve not noticed any serious issues in 6.7 (.. although we didn’t really have quad cores and 4GB of memory back then either). Go try it out today!

Balsamiq Mockups

Balsamiq is an awesome tool for making quick mockups for UI designs. Previous I’d play around in Adobe Photoshop, dragging layers around and being concerned with all the wrong things. Mockups abstracts away all the UI elements (and comes with a vast library of standard elements), which makes it very easy to experiment and focus on the usability instead of the design and its implementation. For someone who’s more interested in the experience and the programming than the actual design (.. I’ll know what I want when I see it!) this makes it easy to convey my suggestions and create small, visual notes of my own usabilityideas.

You can try it out for free at their website, and they even give away licenses to people who are active in Open Source Development (disclaimer: I got a free license, but the experiences are all my own. This is not paid (or unpaid) advertising or product placement.)

GitHub

I’ve been playing around with git a bit, but after writing a patch for the PEAR-module for Gearman (.. which still doesn’t seem to have made it anywhere significant), I signed up for github to be able to fork the project and submit my patch there. A very good technical solution partnered with an easy way of notifying the original developers of your patch (which you simply provide in your own branch) by submitting a “pull request” makes it very easy to both have patches supplied to you and to submit patches to projects hosted at GitHub.

Thumbs up!

Parsing XML With Namespaces with SimpleXML

There’s one thing SimpleXML for PHP is horrible to use for: parsing XML containing namespaces. Namespaces requires special handling, and the only way I’ve found that allows you to refer to an element in another namespace, is to use the ->children() method with the namespace. I’m sure there’s an easier way than this, and if you know of any, please leave a comment!

Let’s start with the following XML snippet (using SOAP as an example):


    
        
            asdasd
        
    

The easiest way to do this is to “ignore” the namespaces, and simply do $root->{soap:Envelope} that to access the property. This will not work, as SimpleXML is quite peculiar about it’s namespaces (.. while everything else is simple and easy to use).

One solution is to provide the namespace you’re interested in to the $element->children() method, which returns all the children of the element in a particular namespace (or without arguments, outside any namespace):

$sxml = new SimpleXMLElement(file_get_contents('soap.xml'));

foreach($sxml->children('http://www.w3.org/2001/12/soap-envelope') as $el)
{
    if ($el->getName() == 'Body')
    {
        /* ... */
    }
}

Yes. That’s quite horrible.

But luckily the xpath method can help us:

$elements = $sxml->xpath('//soap:Envelope/soap:Body/queryInstantStreamResponse');

This will actually fetch all the elements titled “queryInstantStreamResponse” which are childs of soap:Envelope and soap:Body. And this works as you expect it to, without having to use children, provide the actual namespace URI, etc.

The xpath method returns an array containing all the matching elements, so in this case you’ll receive an array with a single element, containing the text inside the queryInstantStreamResponse element.

There should be an easier way than this.

NTFS Junctions and PHP 5.3.0

After upgrading to PHP 5.3.0 on my Windows XP Workstation, Junctions have suddenly stopped working in any PHP related code. I use junctions to hardlink directories from their version specific paths (NTFS symlinks where first introduced with Vista, so I’m still using Junctions), but after upgrading none of the libraries which live in directories that are linked through junctions work.

This seems to be a known bug, Files on NTFS Mounted Volumes (Junctions) inaccessible, although I’m also seeing the issue with completly local files (and not mounted from remote file systems). Seems like the thing to do is to wait for 5.3.1 to resolve the issue .. if it gets fixed to that. For the time being I’ll manually copy the directories.

Update: I’ve added a log of a test session showing the problem.

Adding Support for Asynchronous Status Requests for Net_Gearman

I spent the evening yesterday playing around a bit more with Gearman, a system for farming out tasks to workers across several servers. As my workstation at home still runs Windows, the only PHP library available is the Net_Gearman in PEAR. Net_Gearman supports tasks (something to do), sets (a collection of tasks), workers (the processes that performs the task) and clients (which requests tasks to be performed). The gearman protocol supports retrieving the current status of a task from the gearman server (which contains information about how the worker is progressing, reported by the worker itself), but Net_Gearman did not.

The reason for ‘did not’ is that I’ve created a small patchset to add the functionality to Net_Gearman. All internal methods and properties are still used as they were before, but I’ve added two helper methods for retrieving the socket connection for a particular gearman server (Net_Gearman usually just picks a random server, but we need to contact the server that’s responsible for the task) and a getStatus(server, handle) method to the Gearman Client. I’ve also added a property keeping the address of the server which were assigned the task to the Task class.

After submitting a task to be performed in the background (you do not need this to get the status for foreground tasks, as you can provide a callback to handle that), your Task object will have its handle and server properties set. These can be used to retrieve status information about the task later. You’ll still need to provide the possible servers to the Gearman client when creating the client (through the constructor).

Example of creating a task and retrieving the server / handle pair after starting the task:

require_once 'Net/Gearman/Client.php';

$client = new Net_Gearman_Client(array('host:4730'));

$task = new Net_Gearman_Task('Reverse', range(1,5));
$task->type = Net_Gearman_Task::JOB_BACKGROUND;

$set = new Net_Gearman_Set();
$set->addTask($task);

$client->runSet($set);

print("Status information: \n");
print($task->handle . "\n");
print($task->server . "\n");

Retrieving the status:

require_once 'Net/Gearman/Client.php';

$client = new Net_Gearman_Client(array('host:4730'));
$status = $client->getStatus('host:4730', 'H:mats-ubuntu:1');

The array returned from the getStatus() method is the same array as returned from the gearman server and contains information about the current status (numerator, denominator, finished, etc, var_dump it to get the current structure). I’ve also added the patchset to the Issue tracker for Net_Gearman at github.

The patchset (created from the current master branch at github) can be downloaded here: GearmanGetStatusSupport.tar.gz.

UPDATE: I’ve finally gotten around to creating my own fork of NET_Gearman on github too. This fork features the patch mentioned above.