Ready for 2010: HTTP Headers and Client Side Caching

There’s a few easy changes you can do to your website setup to speed up content delivery and eat up less bandwidth: configure proper expire values and if possible, keep your static resources on a separate domain.

The HTTP Expires Header

Expires tells the client how long it can keep the current version of a resource as the most recent one. If you set the Expires-header a while into the future, the browser will not make a new request for the file until the resource, well, expires (depending on the cache settings for the browser, requesting a reload (such as shift-reloading in a browser), etc. which can expire the resource earlier). The potential problem is the case where a resource actually changes, such as deploying a change to your stylesheet or external javascript files.

The fix for this is to include something about the file which changes when the file is physically updated on the disk. This can be the last modified time (please keep this cached in your web application, you do not want to hit the disk to retrieve the value for each page view), the current revision number from your revision control system (such as SVN – you can get the current revision of a file by using svn info, and please, cache that value to. You do not want to call svn for each page view :-)) or something else, such as the md5 or crc32 hash of the file. The important part is that you include this value as part of the request, making the URL to the resource unique depending on the version of the resource. You can safely ignore this part of the URL in your rewrite / controller routing magic / handling application, as the only function it has is to tell the browser that it has to request a new file and not use the old one anymore.

Examples of URL-schemes To Get Around Expires:-headers

  1. flickr uses as simple .v in their URLs to indicate the version of the file: http://l.yimg.com/g/css/c_sets.css.v74709.14
  2. On Gamer.no we use the current SVN revision: /css/main.css?v=1120M
  3. vg.no uses the current date, followed with an identifier that probably indicates the current revision for that day: css/frontpage.css?20091203-1

It’s important to remember that the identifier is not used to deliver an older version of the file depending on the parameter, just to make the browser see the new resource. The old URL can still serve the new resource – and if you need to keep old versions around, you’ve probably solved this issue already.

Use a Separate Domain for Static Resources

By using another, separate domain for your static resources, you’re letting browsers fetch the static resources while they’re still processing your HTML. The HTTP/1.1 specification says that browsers never should request more than two files at the same time from the same domain. When you host your static resources on another domain, you tell the browser that it can go ahead and fetch those resources while being busy with downloading other items from your main site.

After you’ve moved your static resources to a separate domain, you’ll usually also end up using less bandwidth. Since you’re now delivering the most requested content from another host, cookies will not be included in the request from the browser. When a browser makes a request for a resource on a certain host, it includes all the cookies that have been set for that domain. This happens independent of which files it’s requesting, and if you have a large number of separate files (which you probably could include into one larger file – resulting in fewer HTTP requests), these Cookie-headers can add up to a significant amount of bandwidth. The HTTP server will also have less work to do, making everyone happier!

If you use www. as a prefix for all your regular HTTP requests and take care of setting your cookies in the www.example.com domain, you should be able to simply use something like static.example.com for your static content and avoid leaking cookies into the other subdomain. If you have loads of static content, you can also use several separate subdomains for your files, but be sure to let the request for a certain file point to the same subdomain each time – otherwise you’ll end up with the browser requesting four copies of the same, identical file and actually breaking the regular cache in the browser (which uses If-Modified-Since to tell the server when it last downloaded the file. We want to avoid the browser making the request again at all). At pwned.no I calculate the crc32 of the filename and use that value to determine which static host the request should use. We also redirect any requests directly to pwned.no to www.pwned.no to make the cookie structure consistent. We do however not set the Expires-header yet, but that might be a part of the next update to the site.

Do you have a particular caching strategy you use for client side content? What kind of URL format works best for you? Leave a comment!

Read all the articles in the Ready for 2010-series

Ready for 2010: Check Your Indexes

One of the many things you should try to keep a continuous watch for during the life of any of your applications are the performance of your SQL queries. You might be caching the hell out of your database layer, but some time you’ll have to hit the database server to retrieve data. And if that starts to happen often enough while you’re growing, you will see your SQL daemon taking up the largest part of your disk io and your CPU time. This might not be a problem for the load you’re seeing now, but could you handle a 10 fold increase in traffic? .. or how about 100x? (which, if I remember correctly, is what Google uses as the scale factor when developing applications)

Indexes Are Your Friend

During the christmas holiday I got around to taking a look at some of the queries running at one of my longest living, most active sites: pwned.no. Pwned is a tournament engine running on top of PHP and MySQL, containing about 40.000 tournaments, 450.000 matches and several other database structures. The site has performed well over the years and there hasn’t been any performance issues other than a few attempts at DoS-ing the site with TCP open requests (the last one during the holiday, actually).

Two weeks ago the server suddenly showed loads well above 30 – while it usually hovers around 0.3 – 0.4 at the busiest hours of the day. The reason? One of the previously less used functions of the tournament engine, using a group stage in your tournament, had suddenly become popular in at least one high traffic tournament. This part of the code had never been used much before, but when the traffic spike happened everything went bananas (B-A-N-A-N-A-S. Now that’s stuck in your head. No problem.) The reason: the query used a couple of columns in a WHERE-statement that wasn’t indexed, and the query ran against the table containing the matches for the tournament. This meant that over 400.000 rows were scanned each time the query ran, meaning that mysqld started hogging every resource it could. The Apache childs then had to wait, making the load a bit too much for my liking. Two CREATE INDEX-calls later the load went back down and everything chugged along nicely again.

My strategy for discovering queries that might need a better index scheme (or if “impossible”, a proper caching layer in front of it):

  1. Run your development server with slow-query-log=1, log-queries-not-using-indexes=1 and long-query-time=<an appropriately low value, such as 0.05 – depends on your setup>. You can also provide a log file name with the log-slow-queries=/var/log/mysql/… in your my.cnf-file for MySQL. This will log all potential queries for optimizing to the log file (this will not necessarily provide you with a complete list of good queries to optimize, but it might provide a few good hints). Be sure to use actual data from your site when working on your development version, as you might start seeing issues when the size of the data set reaches a certain size – such as 400.000 rows in the example mentioned above)
  2. Connect to your MySQL server and issue
    SHOW PROCESSLIST

    and

    SHOW FULL PROCESSLIST

    statements every now and then. This will let you see any queries that run often and way too long (but they’ll have to run when you issue the command). You might not catch the real culprit, but if you’re seing MySQL chugging along with 100% CPU and are wondering what’s happening, try to check out what the threads are doing. You’ll hopefully see just which query is wreaking havoc with your server.

  3. Add a statistics layer in front of your MySQL calls in your application. If you’re using PDO you can subclass it to keep a bit of statistics about your queries around. The number of times each query is run, the time it took in total running the query and other interesting values. We’re using a version of this in the development version of Gamer.no and I’ll probably upload the class to my github repository as soon as I get a bit of free time in the new year.

Not sure what I’ll take a closer look at tomorrow, but hopefully I’ll decide before everything collapses!

What are your strategy for indexes? What methods do you use for finding queries that need a bit more love? Leave a comment below!

Read all the articles in the Ready for 2010-series

Redesigning the “Sign Up Team” Form at pwned.no

I’ve spent a couple of days on and off redesigning the “sign up a team” at my tournament site, pwned.no. The currently live form is divided into two sections, depending on wether you want to sign up one of your existing teams or want to just sign up a temporarily created team — or one you haven’t created a proper team for yet:

Old signup form at pwned.no

I’m not happy about the fact that the form is actually divided into two parts, meaning that you have to read two sections to actually understand that you can sign up with an existing team – or write the name of one right there. The forms also have information that are common for both forms, so we’re duplicating a few user interface elements.

The new form still lack a bit of functionality to dynamically enable and disable the submit button based on the selection in the form, but since disabling the form made it look uglier than ugly, I haven’t added that bit yet. The screenshots also differs a bit since Opera and Firefox doesn’t quite agree on the double border CSS.

The new, complete version:

pwned-signup-new

And if you don’t have any existing teams (clans) available, the form now includes a link to the page where you can create one – and try to explains why you should do that.

pwned-signup-no-clans

Also, if you’re not logged in, you’re just presented with the simple form. I think I’m going to add a notice about signing in with a bit of text in the same gist as the create clan bit.

pwned-signup-not-admin

Hopefully this will help reduce any confusion, but luckily the form is quite small anyways. I’ve also added more proper error hilighting, both before signing up and after failing to do so for any reason. Small fixes, but they should increase usability.

Any suggestions of potential ways to enhance this form even more would of course be appreciated!