Solr: Using the dismax Query Handler and Still Limit a Specific Field

While working with the facets for our search result earlier today, I came across the need to limit the search against solr on one specific field in addition to our regular search string (which we run against several fields with different weights). The situation was something like this:

  • Lastname
  • AggregateSearchField
  • AggregatePhoneticSearchField

We do the searches against the AggregateSearchField and the AggregatePhoneticSearchField, where we weight the exact match higher than the phonetic matches. This ensure that the more specific matches are ranked higher than those that are merely similiar. We do this for several different field groupings, but that’s not revelant for this post, so let’s just assume that these are the three fields relevant. We search against two of them, and uses Lastname as a facet / navigator field to allow users to get more specific with their search.

However, while users should be allowed to get more specific with their search when selecting one of the facets, it should not change their regular search. And since the dismax handler will search through all the allowed field for a given value, you cannot just append Lastname:facetValue to the search string and be done with it (dismax does not support fielded searches through the regular query). After a bit of searching through our friends over at Google, I finally stumbled across the solution (which I of course should have seen on the Solr wiki): use the fq-parameter. This allows you to submit a “Filter Query” in your request, which will be used to further filter your existing query through another set of queries. This fits very neatly in with keeping your original query and then appending filter queries for each facet limitation that gets set.

A small code example for Solrj: (filterQueries is a HashMap<String, String> which contains the facets; filterQueries.put(“Lastname”, “Smith”) will add a limitation on the field “Lastname” being “Smith” (you might want to escape “-s in the facet values)):

if (filterQueries != null)
{
    for (String q : filterQueries.keySet())
    {
        String value = filterQueries.get(q);
        query.addFilterQuery(q + ":\"" + value + "\""); // this adds Lastname:"Smith" as a filter query
    }
}

So now we can just parse the query string for valid facet limitations, and set the fields in the filterQueries HashMap accordingly. As we already have a list of facet fields to include, this is a simple as iterating that list and checking for the parameters in the request variables.

A great thank you to Mike Klaas in the dismax and Plone thread indexed by nabble.com that sent me in the right direction.

Handling Large Datasets at Google

High Scalability has a neat post today highlighting a recent presentation given by Jeff Dean from Google at this year’s Data-Intensive Computing Symposium. The presentation named “Handling Large Datasets at Google: Current Systems and Future Directions” (video (hosted by Yahoo!) ) (slides) dips into quite an amount of issues and thoughts about what it takes to run something handling petabytes of data. the video of the presentation

I’ll leave you with a quite interesting list shown in slide 8 (of 58) under the title of “Typical first year for a new cluster“:

  • ~0.5 overheating (power down most machines in <5 mins, ~1-2 days to recover)
  • ~1 PDU failure (~500-1000 machines suddenly disappear, ~6 hours to come back)
  • ~1 rack-move (plenty of warning, ~500-1000 machines powered down, ~6 hours)
  • ~1 network rewiring (rolling ~5% of machines down over 2-day span)
  • ~20 rack failures (40-80 machines instantly disappear, 1-6 hours to get back)
  • ~5 racks go wonky (40-80 machines see 50% packet loss)
  • ~8 network maintenances (4 might cause ~30-minute random connectivity losses)
  • ~12 router reloads (takes out DNS and external vips for a couple minutes)
  • ~3 router failures (have to immediately pull traffic for an hour)
  • ~dozens of minor 30-second blips for dns
  • ~1000 individual machine failures
  • ~thousands of hard drive failures

The Importance of The Double-Click Time

Raymond Chen’s blog “The Old New Thing” is an invaluable source of interesting theories and histories about the inner workings of Windows. If you haven’t read his book “The Old New Thing: Practical Development Throughout the Evolution of Windows” yet, add it to your wishlist now. Although some parts of it can be a bit too much code and internals, the stories and the appendices are simply awesome. Well worth it.

But this post wasn’t supposed to be about that, so I’ll leave you with what I intended to write about instead; the recently posted entry about how several different values are derived from the double click time setting.

Typesetting on the World Wide Web

The awesome people over at Smashing Magazine has a neat article up today about 5 principles and ideas for setting type on the web. While I do not agree with the usability concept of some of the examples (in particular, the first and last example in section 4 is painful to watch), the article is informative and presents quite a few issues and good tips about typography and the web. Keep a bookmark available for the next time you’re sketching up a new site (.. which I’ll have to do with the design around here soon ..).

Memcached Internals

Ilya Grigorik has posted a very good summary of a talk that Brian Aker and Alan Kasindorf gave about memcached at MySQL User Conference last week. The article is straight to the point in regards to several key attributes about memcached, and serves up almost 30 direct tips and tidbits about how to use memcached in a more optimal way. Awesome reading, and well worth to check out together with the slides from the memcached talk.

PHP Vikinger Registration Up and Running

This year’s version of the unconference PHP Vikinger is taking place 21st of June in Skien, Norway. Derick has just opened up the registration which involves using high tech methods such as an E-mail-client and writing your name and other relevant information. One thing’s for sure, I, Eirik and Christer are heading out, and hopefully we’ll get a few more friends to join in.. and that includes YOU!

Canon EOS 5D Mark II Coming?

Wired’s Gadget Lab has noted a neat scoop in regards to the long awaited upgraded version of the Canon EOS 5D! According to the information that supposedly were put online by the german division of Canon, the upgraded full frame camera gets a Digic III processor, a total of 16 megapixels and 6.5 fps shooting speed. I’m already drooling enough to make a small puddle.

Earlier rumors has indicated a price somewhere around $3299 ($3000 – $3500). An announcement from Canon is to be released at friday, but it’s not known wether that will have anything to do with a Canon EOS 5D Mark II.

New Week, New Book: Software Estimation: Demystifying the Black Art

It’s a new week and as I finished my previous “to read while taking the train” book last week, I’ve now started on another well received book, this time about software estimates. The book is published by Microsoft Press and is named Software Estimation: Demystifying the Black Art. The author, Steve McConnel also have a webpage online, in addition to keeping track of the blog 10x Software Development. It’s been a good read so far, and considering the current exchange rate between norwegian kroners and the us dollar, it’s a steal at $26.39 at Amazon.

Stuart Herbert Takes a Look at apache2-mpm-itk

Stuart Herbert has taken a closer look at apache2-mpm-itk , a patch for the apache2 prefork handler to enable Apache to switch which user it runs under based on which VirtualHost that serves the request. The author of the module, Steinar H. Gunderson is a good friend of mine, and it’s always good to see familiar names getting attention for things they’re writing.

The post from Stuart is pretty straight forward, but he fails to mention that apache2-mpm-itk is available as a regular package in all current debian versions . Simply apt-get away, and you’re all set.