Handling Large Datasets at Google

High Scalability has a neat post today highlighting a recent presentation given by Jeff Dean from Google at this year’s Data-Intensive Computing Symposium. The presentation named “Handling Large Datasets at Google: Current Systems and Future Directions” (video (hosted by Yahoo!) ) (slides) dips into quite an amount of issues and thoughts about what it takes to run something handling petabytes of data. the video of the presentation

I’ll leave you with a quite interesting list shown in slide 8 (of 58) under the title of “Typical first year for a new cluster“:

  • ~0.5 overheating (power down most machines in <5 mins, ~1-2 days to recover)
  • ~1 PDU failure (~500-1000 machines suddenly disappear, ~6 hours to come back)
  • ~1 rack-move (plenty of warning, ~500-1000 machines powered down, ~6 hours)
  • ~1 network rewiring (rolling ~5% of machines down over 2-day span)
  • ~20 rack failures (40-80 machines instantly disappear, 1-6 hours to get back)
  • ~5 racks go wonky (40-80 machines see 50% packet loss)
  • ~8 network maintenances (4 might cause ~30-minute random connectivity losses)
  • ~12 router reloads (takes out DNS and external vips for a couple minutes)
  • ~3 router failures (have to immediately pull traffic for an hour)
  • ~dozens of minor 30-second blips for dns
  • ~1000 individual machine failures
  • ~thousands of hard drive failures

The Importance of The Double-Click Time

Raymond Chen’s blog “The Old New Thing” is an invaluable source of interesting theories and histories about the inner workings of Windows. If you haven’t read his book “The Old New Thing: Practical Development Throughout the Evolution of Windows” yet, add it to your wishlist now. Although some parts of it can be a bit too much code and internals, the stories and the appendices are simply awesome. Well worth it.

But this post wasn’t supposed to be about that, so I’ll leave you with what I intended to write about instead; the recently posted entry about how several different values are derived from the double click time setting.

Typesetting on the World Wide Web

The awesome people over at Smashing Magazine has a neat article up today about 5 principles and ideas for setting type on the web. While I do not agree with the usability concept of some of the examples (in particular, the first and last example in section 4 is painful to watch), the article is informative and presents quite a few issues and good tips about typography and the web. Keep a bookmark available for the next time you’re sketching up a new site (.. which I’ll have to do with the design around here soon ..).

Memcached Internals

Ilya Grigorik has posted a very good summary of a talk that Brian Aker and Alan Kasindorf gave about memcached at MySQL User Conference last week. The article is straight to the point in regards to several key attributes about memcached, and serves up almost 30 direct tips and tidbits about how to use memcached in a more optimal way. Awesome reading, and well worth to check out together with the slides from the memcached talk.

PHP Vikinger Registration Up and Running

This year’s version of the unconference PHP Vikinger is taking place 21st of June in Skien, Norway. Derick has just opened up the registration which involves using high tech methods such as an E-mail-client and writing your name and other relevant information. One thing’s for sure, I, Eirik and Christer are heading out, and hopefully we’ll get a few more friends to join in.. and that includes YOU!

Canon EOS 5D Mark II Coming?

Wired’s Gadget Lab has noted a neat scoop in regards to the long awaited upgraded version of the Canon EOS 5D! According to the information that supposedly were put online by the german division of Canon, the upgraded full frame camera gets a Digic III processor, a total of 16 megapixels and 6.5 fps shooting speed. I’m already drooling enough to make a small puddle.

Earlier rumors has indicated a price somewhere around $3299 ($3000 – $3500). An announcement from Canon is to be released at friday, but it’s not known wether that will have anything to do with a Canon EOS 5D Mark II.

New Week, New Book: Software Estimation: Demystifying the Black Art

It’s a new week and as I finished my previous “to read while taking the train” book last week, I’ve now started on another well received book, this time about software estimates. The book is published by Microsoft Press and is named Software Estimation: Demystifying the Black Art. The author, Steve McConnel also have a webpage online, in addition to keeping track of the blog 10x Software Development. It’s been a good read so far, and considering the current exchange rate between norwegian kroners and the us dollar, it’s a steal at $26.39 at Amazon.

Stuart Herbert Takes a Look at apache2-mpm-itk

Stuart Herbert has taken a closer look at apache2-mpm-itk , a patch for the apache2 prefork handler to enable Apache to switch which user it runs under based on which VirtualHost that serves the request. The author of the module, Steinar H. Gunderson is a good friend of mine, and it’s always good to see familiar names getting attention for things they’re writing.

The post from Stuart is pretty straight forward, but he fails to mention that apache2-mpm-itk is available as a regular package in all current debian versions . Simply apt-get away, and you’re all set.

Solr: Deleting Multiple Documents with One Request

One of the finals steps in my current Solr adventure was to make it possible to remove a large number of documents form the index at the same time. As we’re currently using Solr to store phone information, we may have to remove several thousand records in one large update. The examples on the Solr Wiki shows how to remove one single document by posting a simple XML-document, or remove something by query. I would rather avoid beating our solr server with 300k of single delete requests, so I tried the obvious tactics with submitting several id’s in one document, making several <delete>-elements in one document etc, but nothing worked as I wanted it to.

After a bit of searching and stumbling around with Google, I finally found this very useful tip from Erik Hatcher. The clue is to simply rewrite the delete request as a delete by query, and then submit all the id’s to be removed as a simple OR query. On our development machine, Solr removed 1000 documents in somewhere around 900ms. Needless to say, that’s more than fast enough and solved our problem.

To sum it up; write a delete-by-query-statement as:

id:(123123 OR 13371337 OR 42424242 .. ) 

Thanks intarwebs!