New Times Ahead, Baby!

As the most observant people out there probably have noticed, I’ve given the site a little face lift to bring it into the next century (so bring it on, 2100!!!11). The illustration was done by the very talented Anette Heiberg – Children’s Book Illustrator – which also is the one single person that manages to live together with me. A neat little coincidence there!

Anyways, the new design is dark, but I’ve decided to use the inverse header for each post as it makes visually scanning the page with your eyes to find the items _very_ effective. I like it, so it stays.

Happy Happy Joy Joy!

Christer and His Quest For More Zend_Form-age

I finally found out why Christer had been so quiet all day: he’s obviously been writing the largest post seen in the history of blogs. His introduction to Translating Zend Form Error Messages is enormous and a giant of a beast, and will give a thorough introduction to the concept of using Zend_Translate together with Zend_Form to use resource files to present an user interface in several localized versions.

Solving UTF-8 Problems With Solr and Tomcat

Came across an issue with searching for UTF-8 characters in Solr today; the search worked just as it should (probably since we’re using a phonetic field to search), but our facets and limitations didn’t work as they should. This happened as soon as we had a value with an UTF-8 character (> 127 in ascii value), in our case the norwegian letters Æ, Ø or Å.

The solution was presented by Charlie Jackson at the Solr-user mailing list and is quite simply to add URIEncoding="UTF-8" to the appropriate connector in the Tomcat server.xml file. This is also documented on the Solr on Tomcat page in the Solr Wiki .

Solr: Using the dismax Query Handler and Still Limit a Specific Field

While working with the facets for our search result earlier today, I came across the need to limit the search against solr on one specific field in addition to our regular search string (which we run against several fields with different weights). The situation was something like this:

  • Lastname
  • AggregateSearchField
  • AggregatePhoneticSearchField

We do the searches against the AggregateSearchField and the AggregatePhoneticSearchField, where we weight the exact match higher than the phonetic matches. This ensure that the more specific matches are ranked higher than those that are merely similiar. We do this for several different field groupings, but that’s not revelant for this post, so let’s just assume that these are the three fields relevant. We search against two of them, and uses Lastname as a facet / navigator field to allow users to get more specific with their search.

However, while users should be allowed to get more specific with their search when selecting one of the facets, it should not change their regular search. And since the dismax handler will search through all the allowed field for a given value, you cannot just append Lastname:facetValue to the search string and be done with it (dismax does not support fielded searches through the regular query). After a bit of searching through our friends over at Google, I finally stumbled across the solution (which I of course should have seen on the Solr wiki): use the fq-parameter. This allows you to submit a “Filter Query” in your request, which will be used to further filter your existing query through another set of queries. This fits very neatly in with keeping your original query and then appending filter queries for each facet limitation that gets set.

A small code example for Solrj: (filterQueries is a HashMap<String, String> which contains the facets; filterQueries.put(“Lastname”, “Smith”) will add a limitation on the field “Lastname” being “Smith” (you might want to escape “-s in the facet values)):

if (filterQueries != null)
    for (String q : filterQueries.keySet())
        String value = filterQueries.get(q);
        query.addFilterQuery(q + ":\"" + value + "\""); // this adds Lastname:"Smith" as a filter query

So now we can just parse the query string for valid facet limitations, and set the fields in the filterQueries HashMap accordingly. As we already have a list of facet fields to include, this is a simple as iterating that list and checking for the parameters in the request variables.

A great thank you to Mike Klaas in the dismax and Plone thread indexed by that sent me in the right direction.

Handling Large Datasets at Google

High Scalability has a neat post today highlighting a recent presentation given by Jeff Dean from Google at this year’s Data-Intensive Computing Symposium. The presentation named “Handling Large Datasets at Google: Current Systems and Future Directions” (video (hosted by Yahoo!) ) (slides) dips into quite an amount of issues and thoughts about what it takes to run something handling petabytes of data. the video of the presentation

I’ll leave you with a quite interesting list shown in slide 8 (of 58) under the title of “Typical first year for a new cluster“:

  • ~0.5 overheating (power down most machines in <5 mins, ~1-2 days to recover)
  • ~1 PDU failure (~500-1000 machines suddenly disappear, ~6 hours to come back)
  • ~1 rack-move (plenty of warning, ~500-1000 machines powered down, ~6 hours)
  • ~1 network rewiring (rolling ~5% of machines down over 2-day span)
  • ~20 rack failures (40-80 machines instantly disappear, 1-6 hours to get back)
  • ~5 racks go wonky (40-80 machines see 50% packet loss)
  • ~8 network maintenances (4 might cause ~30-minute random connectivity losses)
  • ~12 router reloads (takes out DNS and external vips for a couple minutes)
  • ~3 router failures (have to immediately pull traffic for an hour)
  • ~dozens of minor 30-second blips for dns
  • ~1000 individual machine failures
  • ~thousands of hard drive failures

The Importance of The Double-Click Time

Raymond Chen’s blog “The Old New Thing” is an invaluable source of interesting theories and histories about the inner workings of Windows. If you haven’t read his book “The Old New Thing: Practical Development Throughout the Evolution of Windows” yet, add it to your wishlist now. Although some parts of it can be a bit too much code and internals, the stories and the appendices are simply awesome. Well worth it.

But this post wasn’t supposed to be about that, so I’ll leave you with what I intended to write about instead; the recently posted entry about how several different values are derived from the double click time setting.

Typesetting on the World Wide Web

The awesome people over at Smashing Magazine has a neat article up today about 5 principles and ideas for setting type on the web. While I do not agree with the usability concept of some of the examples (in particular, the first and last example in section 4 is painful to watch), the article is informative and presents quite a few issues and good tips about typography and the web. Keep a bookmark available for the next time you’re sketching up a new site (.. which I’ll have to do with the design around here soon ..).

Memcached Internals

Ilya Grigorik has posted a very good summary of a talk that Brian Aker and Alan Kasindorf gave about memcached at MySQL User Conference last week. The article is straight to the point in regards to several key attributes about memcached, and serves up almost 30 direct tips and tidbits about how to use memcached in a more optimal way. Awesome reading, and well worth to check out together with the slides from the memcached talk.

PHP Vikinger Registration Up and Running

This year’s version of the unconference PHP Vikinger is taking place 21st of June in Skien, Norway. Derick has just opened up the registration which involves using high tech methods such as an E-mail-client and writing your name and other relevant information. One thing’s for sure, I, Eirik and Christer are heading out, and hopefully we’ll get a few more friends to join in.. and that includes YOU!