Ready for 2010: HTTP Headers and Client Side Caching

There’s a few easy changes you can do to your website setup to speed up content delivery and eat up less bandwidth: configure proper expire values and if possible, keep your static resources on a separate domain.

The HTTP Expires Header

Expires tells the client how long it can keep the current version of a resource as the most recent one. If you set the Expires-header a while into the future, the browser will not make a new request for the file until the resource, well, expires (depending on the cache settings for the browser, requesting a reload (such as shift-reloading in a browser), etc. which can expire the resource earlier). The potential problem is the case where a resource actually changes, such as deploying a change to your stylesheet or external javascript files.

The fix for this is to include something about the file which changes when the file is physically updated on the disk. This can be the last modified time (please keep this cached in your web application, you do not want to hit the disk to retrieve the value for each page view), the current revision number from your revision control system (such as SVN – you can get the current revision of a file by using svn info, and please, cache that value to. You do not want to call svn for each page view :-)) or something else, such as the md5 or crc32 hash of the file. The important part is that you include this value as part of the request, making the URL to the resource unique depending on the version of the resource. You can safely ignore this part of the URL in your rewrite / controller routing magic / handling application, as the only function it has is to tell the browser that it has to request a new file and not use the old one anymore.

Examples of URL-schemes To Get Around Expires:-headers

  1. flickr uses as simple .v in their URLs to indicate the version of the file: http://l.yimg.com/g/css/c_sets.css.v74709.14
  2. On Gamer.no we use the current SVN revision: /css/main.css?v=1120M
  3. vg.no uses the current date, followed with an identifier that probably indicates the current revision for that day: css/frontpage.css?20091203-1

It’s important to remember that the identifier is not used to deliver an older version of the file depending on the parameter, just to make the browser see the new resource. The old URL can still serve the new resource – and if you need to keep old versions around, you’ve probably solved this issue already.

Use a Separate Domain for Static Resources

By using another, separate domain for your static resources, you’re letting browsers fetch the static resources while they’re still processing your HTML. The HTTP/1.1 specification says that browsers never should request more than two files at the same time from the same domain. When you host your static resources on another domain, you tell the browser that it can go ahead and fetch those resources while being busy with downloading other items from your main site.

After you’ve moved your static resources to a separate domain, you’ll usually also end up using less bandwidth. Since you’re now delivering the most requested content from another host, cookies will not be included in the request from the browser. When a browser makes a request for a resource on a certain host, it includes all the cookies that have been set for that domain. This happens independent of which files it’s requesting, and if you have a large number of separate files (which you probably could include into one larger file – resulting in fewer HTTP requests), these Cookie-headers can add up to a significant amount of bandwidth. The HTTP server will also have less work to do, making everyone happier!

If you use www. as a prefix for all your regular HTTP requests and take care of setting your cookies in the www.example.com domain, you should be able to simply use something like static.example.com for your static content and avoid leaking cookies into the other subdomain. If you have loads of static content, you can also use several separate subdomains for your files, but be sure to let the request for a certain file point to the same subdomain each time – otherwise you’ll end up with the browser requesting four copies of the same, identical file and actually breaking the regular cache in the browser (which uses If-Modified-Since to tell the server when it last downloaded the file. We want to avoid the browser making the request again at all). At pwned.no I calculate the crc32 of the filename and use that value to determine which static host the request should use. We also redirect any requests directly to pwned.no to www.pwned.no to make the cookie structure consistent. We do however not set the Expires-header yet, but that might be a part of the next update to the site.

Do you have a particular caching strategy you use for client side content? What kind of URL format works best for you? Leave a comment!

Read all the articles in the Ready for 2010-series

Ready for 2010: Check Your Indexes

One of the many things you should try to keep a continuous watch for during the life of any of your applications are the performance of your SQL queries. You might be caching the hell out of your database layer, but some time you’ll have to hit the database server to retrieve data. And if that starts to happen often enough while you’re growing, you will see your SQL daemon taking up the largest part of your disk io and your CPU time. This might not be a problem for the load you’re seeing now, but could you handle a 10 fold increase in traffic? .. or how about 100x? (which, if I remember correctly, is what Google uses as the scale factor when developing applications)

Indexes Are Your Friend

During the christmas holiday I got around to taking a look at some of the queries running at one of my longest living, most active sites: pwned.no. Pwned is a tournament engine running on top of PHP and MySQL, containing about 40.000 tournaments, 450.000 matches and several other database structures. The site has performed well over the years and there hasn’t been any performance issues other than a few attempts at DoS-ing the site with TCP open requests (the last one during the holiday, actually).

Two weeks ago the server suddenly showed loads well above 30 – while it usually hovers around 0.3 – 0.4 at the busiest hours of the day. The reason? One of the previously less used functions of the tournament engine, using a group stage in your tournament, had suddenly become popular in at least one high traffic tournament. This part of the code had never been used much before, but when the traffic spike happened everything went bananas (B-A-N-A-N-A-S. Now that’s stuck in your head. No problem.) The reason: the query used a couple of columns in a WHERE-statement that wasn’t indexed, and the query ran against the table containing the matches for the tournament. This meant that over 400.000 rows were scanned each time the query ran, meaning that mysqld started hogging every resource it could. The Apache childs then had to wait, making the load a bit too much for my liking. Two CREATE INDEX-calls later the load went back down and everything chugged along nicely again.

My strategy for discovering queries that might need a better index scheme (or if “impossible”, a proper caching layer in front of it):

  1. Run your development server with slow-query-log=1, log-queries-not-using-indexes=1 and long-query-time=<an appropriately low value, such as 0.05 – depends on your setup>. You can also provide a log file name with the log-slow-queries=/var/log/mysql/… in your my.cnf-file for MySQL. This will log all potential queries for optimizing to the log file (this will not necessarily provide you with a complete list of good queries to optimize, but it might provide a few good hints). Be sure to use actual data from your site when working on your development version, as you might start seeing issues when the size of the data set reaches a certain size – such as 400.000 rows in the example mentioned above)
  2. Connect to your MySQL server and issue
    SHOW PROCESSLIST

    and

    SHOW FULL PROCESSLIST

    statements every now and then. This will let you see any queries that run often and way too long (but they’ll have to run when you issue the command). You might not catch the real culprit, but if you’re seing MySQL chugging along with 100% CPU and are wondering what’s happening, try to check out what the threads are doing. You’ll hopefully see just which query is wreaking havoc with your server.

  3. Add a statistics layer in front of your MySQL calls in your application. If you’re using PDO you can subclass it to keep a bit of statistics about your queries around. The number of times each query is run, the time it took in total running the query and other interesting values. We’re using a version of this in the development version of Gamer.no and I’ll probably upload the class to my github repository as soon as I get a bit of free time in the new year.

Not sure what I’ll take a closer look at tomorrow, but hopefully I’ll decide before everything collapses!

What are your strategy for indexes? What methods do you use for finding queries that need a bit more love? Leave a comment below!

Read all the articles in the Ready for 2010-series

Ready for 2010: Upgrade Critical Software

You might remember than WordPress installation you did a couple of years ago and that you’ve been ignoring the “upgrade now” message for almost as long. This “Ready for 2010” message is sponsored by the “Get Your Software Up To Date” foundation.

Before starting down this road, it could be a good idea making sure your backups work. :-)

Yes. You should do these things all year around, but at least this should be a perfect occasion to take the time to check out that old server you installed just to test stuff at home, the server where you’re hosting your private blog, your mail server, etc. We geeks have some sort of weird ability to contract a couple of servers in strange places, forgetting them – but still using them for something on a daily basis.

Update and Upgrade Your Distribution

Spend a couple of hours getting the distribution up to date. For Linux-based servers this usually involves using the package systems update manager, such as apt-get or aptitude on debian or Ubuntu-based servers, up2date on Red Hat, the SUSE update manager etc. On Windows-based servers you’ll be running the update manager, getting all the recent fixes and patches into the core library of tools.

If you’re feeling a bit adventurous you should also consider upgrading your distribution to a newer version if one is available. This will make newer versions of the software you’re running available, get new features into your applications and other Good Things. It might however break a few existing features, such as the layout of certain configuration files, new default values for some settings and other, small stuff. Be sure to set aside a couple of hours for this, so don’t do this just as you’re leaving the country for a couple of months.

Find – and Upgrade – Installed Web Applications

It’s very important to keep Web applications you’ve installed, such as WordPress, up to date. As their nature makes them available through the internet, they’re often a preferred vector for automatic attacks against your server. As soon as a remote exploit has been found, you’ll start noticing attempts to break your server in your web server’s access logs. Usually the attacks require some sort of mitigating factor that requires a particular configuration, but there’s always someone who get in trouble because of just that factor. That might be you!

WordPress (and several other web applications) also contain an automagical update mode, where you can update the software simply by clicking a link in the admin interface. Be sure to spend half an hour getting the automagical update to work where it’s available, and do it now!

Upgrade Embedded Software

Our lives are filled with devices that run any sort of embedded software. Your mobile phone, your digital camera, your wireless router, your TV, your media player, your game consoles, your network attached storage disk (NAS), etc.

Check out the manufacturer’s website for the different devices (or if you’re a nerd like .. well, me, you might have exchanged the firmware with alternative firmwares) and check if there are any updates available. Several devices are also able to update themselves, so be sure to just log in to the device (and discover that you don’t remember the password you set a couple of years ago) and check if you can just click a button to make everything go Happy Happy Joy Joy.

You might also have several applications installed on your mobile phone – check for updates and any critical fixes. You do not want your mobile phone to leak private details out onto the world wide web or through Bluetooth.

Any other issues one should be sure to cover when doing this? Leave a comment below!

Read all the articles in the Ready for 2010-series

There’s A Difference Between Being Inspired By and Outright Copying

A recently launched service that has gotten way too much attention in Norwegian press today is Qpsa.no – another “WHAT ARE YOU DOING NOW” service. Their business idea? Reimplementing Twitter, copying their look and defending it with “It’s in Norwegian!”.

First, let’s get this out of the way; I have absolutely no problem with people implementing similar services as other people, iterating concepts, being inspired by and in general, standing on the shoulders of giants. I do however have a slight problem with people directly copying other people success histories and passing them off as “revolutionizing social networks in Norway”. And although some news items has pointed out the link to Twitter, they have all failed to point out the fact that this is a blatant ripoff of the original service.

First, let’s start by looking at the main page:

Then we browse over to our international friends and discover their main page:

This seems way to similar to be a coincidence, so their “inspiration” seems quite obvious. In particular, notice the green box, the content of the page (nothing other than sign up). They’ve even managed to get quite a few birds in there too. The only thing they’re missing is the beautiful layout and look of Twitter, but hey, you can’t have it all. Or can you? On to the next comparison:

Versus our now known international friends yet again:

Hmm. This seems quite similar (thanks to Mister Noname for getting me a screenshot of his tweets updates). Guess it’s not really that much about actually trying to be original, but more about just copying what other people have created.

Their defense for creating the site: Twitter is not available in Norwegian, and Twitter is slow (Twitter doesn’t scale! [two bonus memepoints]). Yes. Twitter is slow from time to time, but this is where it gets even more interesting. Neither of the people behind the application are web developers, and obviously hasn’t given much thought about why Twitter is slow.

My guess is that hopefully Twitter will register a formal complaint or the people behind qpsa.no will get wiser and change their look. Maybe they’ll even try to actually build on the idea that created twitter, and create something that is worth checking out. The largest Norwegian community, Nettby has over 700.000 users (if we compare that to U.S. numbers, it would mean a US site with somewhere around 47 million active users), and could probably add this feature just as quick and with an established user base in those numbers, it would be a steam roller against a bird. A twittering little creature.

Bonus points for using “It’s in Norwegian!” as the main defense, then naming your service as a spanish phrase.

A Group Is It Own Worst Enemy

Read an article about Clay Shirky a day or two ago on one of the blogs I follow (sorry, I don’t remember which (edit: Coding Horror: It’s Clay Shirky’s Internet, We Just Live In It)), which linked to a writeup of a presentation he had done about the structure and workings of social networks and some thoughts about the issues that surround it. It’s insight delivered one word at a time and most certainly worth a read for anyone in the online business.