Support for Solr in eZ Components’ Search

The new release of eZ Components (2008.1) has added a new Search module, and the first implementation included is an interface for sending search requests and new documents to a Solr installation. An introduction can be found over at the eZ Components Search Tutorial. The new release of eZ Components requires at least PHP 5.2.1 (.. and if you’re not already running at least 5.2.5, it’s time to get moving. The world is moving. Fast.).


Writing a Solr Analysis Filter Plugin

Update: If you’re writing a plugin for a Solr-version after 1.4.1 or Lucene 3.0+, you should be sure to read Updating a Solr Analysis Plugin to Lucene 4.0 as well. A few of the method calls used below has changed in the new API.

As we’ve been working on getting a better result out of the phonetic search we’re currently doing at derdubor, I started writing a plugin for Solr to be able to return better search results when searching for norwegian names. We’ve been using the standard phonetic filter from Solr 1.2 so far, using the double metaphone encoder for encoding a regular token as a phonetic value. The trouble with this is that a double metaphone value is four simple letters, which means that searchwords such as ‘trafikkontroll’ would get the same meaning as ‘Dyrvik’. The latter being a name and the first being a regular search string which would be better served through an article view. TRAFIKKONTROLL resolves to TRFK in double metaphone, while DYRVIK resolves to DRVK. T and D is considered similiar, as is V and F, and voilá, you’ve got yourself a match in the search result, but not a visual one (or a semantic one, as the words have very different meanings).

To solve this, I decided to write a custom filter plugin which we could tune to names that are in use in Norway. I’ll post about the logic behind my reasoning in regards to wording later and hopefully post the complete filter function we’re applying, but I’ll leave that for another post.

First you need a factory that’s able to produce filters when Solr asks for them:

NorwegianNameFilterFactory.java:

package no.derdubor.solr.analysis;

import java.util.Map;

import org.apache.solr.analysis.BaseTokenFilterFactory;
import org.apache.lucene.analysis.TokenStream;

public class NorwegianNameFilterFactory extends BaseTokenFilterFactory
{
    Map args;

    public Map getArgs()
    {
        return args;
    }

    public void init(Map args)
    {
        this.args = args;
    }

    public NorwegianNameFilter create(TokenStream input)
    {
        return new NorwegianNameFilter(input);
    }
}

To compile this example yourself, put the file in no/derdubor/solr/analysis/ (which matches no.derdubor.solr.analysis; in the package statement), and run

javac -6 no/derdubor/solr/analysis/NorwegianNameFilterFactory.java

(you’ll need apache-solr-core.jar and lucene-core.jar in your classpath to do this)

to compile it. You’ll of course also need the filter itself (which is returned from the create-method above):

package no.derdubor.solr.analysis;

import java.io.IOException;
import org.apache.lucene.analysis.Token;
import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;

public class NorwegianNameFilter extends TokenFilter
{
    public NorwegianNameFilter(TokenStream input)
    {
        super(input);
    }

    public Token next() throws IOException
    {
        return parseToken(this.input.next());
    }

    public Token next(Token result) throws IOException
    {
        return parseToken(this.input.next());
    }

    protected Token parseToken(Token in)
    {
        /* do magic stuff with in.termBuffer() here (a char[] which can be manipulated) */
        /* set the changed length of the new term with in.setTermLength(); before returning it */
        return in;
    }
}

You should now be able to compile both files:

javac -6 no/derdubor/solr/analysis/*.java

After compiling the plugin, create a jar file which contain your plugin. This will be the “distributable” version of your plugin, and should contain the .class-files of your application.

jar cvf derdubor-solr-norwegiannamefilter.jar no/derdubor/solr/analysis/*.class

Move the file you just created (derdubor-solr-norwegiannamefilter.jar in the example above) into your Solr home directory. This is where you keep your bin/ and conf/ directory (which contains schema.xml, etc). Create a lib directory in the solr home directory. This is where your custom libraries will live, so copy the file into this directory (lib/).

Restart Solr and check that everything still works as it should. If everything still seems normal, it’s time to enable your filter. In one of your <filter>-chains, you can simply append a <filter> element to insert your own filter into the chain:


    
    
    
    

Restart Solr again, and if everything still works as it should, you’re all set! Time to index some new data (remember that you’ll need to reindex the data for things to work as you expect, since no stored data is processed when you edit your configuration files) and commit it! Do a few searches through the admin interface to see that everything works as it should. I’ve used the “debug” option to .. well, debug .. my plugin while developing it. A very neat trick is to see what terms your filter expands to (if you set type=”query” in the analyzer section, it will be applied to all queries against that field), which will be shown in the first debug section when looking at the result (you’ll have to scroll down to the end to see this). If you need to debug things to a greater extend, you can attach a debugger or simply use the Good Old Proven Way of println! (these will end up in catalina.out in logs/ in your tomcat directory). Good luck!

Potential Problems and How To Solve Them

  • If you get an error about incompatible class versions, check that you’re actually running the same (or newer) version of the JVM (java -version) on your Solr search server that you use on your own development machine (use -5 to force 1.5 compatible class files instead of 1.6 when compiling).
  • If you get an error about missing config or something similiar, or that Solr is unable to find the method it’s searching for (generally triggered by an ReflectionException), remember to define your classes public! public class NorwegianNameFilter is your friend! It took at least half an hour until I realized what this simple issue was…

Any comments and followups are of course welcome!

Followup on The Missing Statistics in OpenX

After my previous post about the missing OpenX statistics because of crashed MySQL-tables, I got a very nice and helpful comment from one of the OpenX developers. To put it one single word: awesome. If you’re ever going to run a company and have to look after your customers (even if you release your project as open source), simply do that. People will feel that someone are looking out for them.

Anyways, as promised, this were supposed to be a follow up. We didn’t manage to get the impressions statistics back, but the missing clicks returned after repairing the tables. The tip from Arlen didn’t help either, but I have a few suggestions for how to make the script easier to use.

I were kind of perplexed about how I could give the dates for the time interval it was going to rebuild the statistics. The trick was to change two define()-s in the top of the code. Not very user friendly, so I made a small change to use $argc and $argv instead. That way I could do:

    php regenerateAdServerStatistics.php "2008-06-01 10:00:00" "2008-06-01 10:59:59"

instead of having to edit the file and changing the defines every time. After doing this simple change, I could also write a small helper script that ran the regenerateAdServerStatistics.php file for all the operation intervals within the larger interval (an operation interval is an hour, while my interval were several days).

So, here it is, regenerateForPeriod.php:

 ");
    }

    $start = $argv[1];
    $end = $argv[2];

    $start_ts = strtotime($start);
    $end_ts = strtotime($end);

    if (!$start_ts || !$end_ts || ($start_ts >= $end_ts))
    {
        exit("Invalid dates.");
    }

    $current_ts = mktime(date('H', $start_ts), 0, 0, date('m', $start_ts), date('d', $start_ts), date('y', $start_ts));

    while($current_ts < $end_ts)
    {
        system('php regenerateAdServerStatistics.php "' . date('Y-m-d H', $current_ts) . ':00:00" "' . date('Y-m-d H', $current_ts) . ':59:59"');
        $current_ts += 3600;
    }
?>

This runs the renegerateAdServerStatistics.php script for each operation interval. If your ad server uses a larger interval than 3600 seconds, change the setting to a more appropriate value. Before doing this, you’ll want to remove the sleep(10) and the warning in regenerateAdServerStatistics.php, so that you don’t have to wait 10 seconds for each invocation of the script. I removed the warning and sleep altogheter, but hopefully someone will commit a command line parameter to regenerateAdServerStatistics.php that removes the delay. I didn’t have time to clean up the code and submit an official patch today, but if there is interest, leave a comment and I’ll consider it.

Misunderstanding How in_array Works

Brian Moon has a post about how in_array() really, really sucks. This is not a problem with in_array per se, but failing to recognize the proper way to solve a problem like this. Some of the comments has already touched on this matter, but I’ll attempt to further expand and describe the problem.

You have two arrays; a1 and b2. You’re interested in removing all the values from a1 that also are in b2. If you’re doing the naive approach (which Brian Moon describes), you’ll simply do:

foreach($a1 as $key => $value)
{
    foreach($b2 as $key2 => $value2)
    {
        if ($value == $value2)
        {
            unset($a1[$key]);
        }
    }
}

(ignore any potential side effects of manipulating $a1 while looping through it for now)

This will work for small sizes of a1 and b2, but as soon as the number of entries starts to increase (let’s call them m and n), you’ll see that the growth of your function will approach O(m*n), which can be written as O(n²) as both values approach infinity. This is not good and is the same complexity that you’ll find in naive sorting algorithms. This means that for each element you add to the array, your processing time increases quadratically (since you have two loops here). in_array is simply a shortcut for the inner loop (the inner foreach loop) in this example. It loops through each element of the array and checks if it matches the needle we’re searching for.

Compare this to using array_flip on the array to search first, so that the values becomes the keys:

foreach($a1 as $key => $value)
{
    if (isset($b2[$key]))
    {
        unset($a1[$key]);
    }
}

But why is isset($arr[$key]) any faster than using in_array? Doesn’t the application just have to loop through a different set of values instead (this time, the keys instead of the values)? No. This is where hashing comes into the picture. As $key can be any string value in PHP, the string is hashed and resolved to an internal array id. This means that internally, the following is happening:

$arr[$id] => find location by converting $id to an internal array location (on the C-level) => simply index the array by using this value

Instead of going through all the valid keys, the $id is converted once, and then checked to see if there is any value stored at that location. This is a simplification, but explains the concept. The complexity of this conversion may depend on the length of the key (depending on the choice of hashing function), but we’ll ignore this here, and simply give it a complexity of O(1). Looking up the index in the array is also an O(1) operation (it takes the same time, regardless if we’re looking at item #3 or #4818, it’s simply reading from different locations in memory).

This means that while our other loop is still looping through n elements, we’re now just doing a constant lookup in the innerloop. The amount of work done in the inner loop does not depend on the number of elements in b2, and this means that our algorithm instead grows in a linear fashion (O(n)).

Further reading can be found at Wikipedia: Hash function, Big O Notation. I’ll also suggest reading an introductionary book into the field of algorithms and datastructures. The type of book depends on your skillset, but if anyone wants any suggestions, just leave a comment and I’ll add a few as I get home to my bookshelf tonight.

Implementing a Duck Operator with Reflection

Following up on this post regarding a “duck operator” in PHP, I went ahead and wrote a very, very, very simple implementation using reflection api in php to get the same functionality.

getMethods() as $method)
        {
            $ret[$method->getName()] = true;
        }
        
        return $ret;
    }
    
    function it_quacks($object, $interface)
    {
        $reflectionClass = new ReflectionClass($interface);
        $reflectionObject = new ReflectionObject($object);
        
        $reflectionClassMethods = getMethodProperties($reflectionClass);
        $reflectionObjectMethods = getMethodProperties($reflectionObject);
        
        foreach($reflectionClassMethods as $methodName => $methodData)
        {
            if (empty($reflectionObjectMethods[$methodName]))
            {
                return false;
            }
        }
        
        return true;
    }

    if (it_quacks(new MooingGrassEater(), 'Cow'))
    {
        print("A MooingGrassEater can be seen as a Cow\n");
    }
    else
    {
        print("A MooingGrassEater has no hope of being recognized as a Cow\n");
    }

    if (it_quacks(new MooingGrassEater(), 'Sheep'))
    {
        print("A MooingGrassEater can be seen as a Sheep\n");
    }
    else
    {
        print("A MooingGrassEater has no hope of being recognized as a Sheep\n");
    }
?>

Missing obvious points are of course to compare the number of arguments to the methods and wether they’re optional, so that you further ensure call safety. But hey, it’s just an example implementation. Read the original linked page for more information about the concept.

Debugging Missing Statistics in OpenAds (OpenX)

Our statistics in OpenAds had suddently gone missing in action, and I started suspecting a few errors we’d gotten earlier about fubar-ed MyISAM-tables. First, check out debug.log (or maintenance.log if you’re running a newer version than us) in the var-directory of your Openads-installation. The easiest thing to do here is to search for the string ’emergency’, which will be posted to the log each time something fails in MySQL. The MDB2 error message that is included will show you the error message from MySQL in one of the fields (about 15-25 lines down), which will give you the reason for the error (if MySQL is to blame).

Some tables had been marked as crashed in our MySQL-installation, so we had to find out what to fix. A quick run with myisamchk in the MySQL-data directory for the database gave us a few hints:

myisamchk *.MYI > /tmp/myisamcheckoutput

By redirecting the normal output you’ll just get the error messages to stderr (Openads has quite a few tables, so your console will fill up quite quick otherwise) (as stdout will be redirected to /tmp/myisamcheckoutput). You’ll also be able to check the output by using less on /tmp/myisamcheckoutput.

If any tables are having problems, you can run:

REPAIR TABLE ;

in your MySQL console, and the table should be repaired in the background. After doing this, it’s time to get maintenance back up and running again.

Run the maintenance.php file manually (or wait until it gets triggered within the next hour):

php /scripts/maintenance/maintenance.php 

Good Advice for Buying Technical Books

Brian K. Jones has a neat list of things to look for when buying a new book. To sum it up in a few points:

  • Give Any New Version 6 Months Before Buying a Book About It.
  • Take reviews with several grains of salt
  • Look for “Timeless Tomes”
  • Look at the Copyright Date
  • Be Wary of Growth in Second Editions

These are all good points, but I’d like to add a few rules of my own that help decide which books that end up in my reading queue:

  • If you’re happy with the book you’re currently reading, use Amazon’s suggestion system to find more about the same subject. This has worked surprisingly well (and I’m sure Amazon is happy about that..) for me, but limit yourself to one additonal book this way.
  • While reading a good book, if the author mentions another book that he found interesting or that provided insight into the content you’re now reading, add it to your wishlist. Good authors usually suggest good and insightful books.
  • To further extend the point of “Look at the copyright date”; use this date to find potential “Timeless Tomes”. If a book were first published in the 1970s or 1980s and is in it’s 21st print now, I’ll personally guarantee that the book is worth reading. It might not be about a subject that you’re interested in, but it might give you new insights or a better background in something you never would have read otherwise.
  • Use your wishlist on Amazon to keep track of interesting books that you stumble across. This is particularily useful for those of us who live in non-native Amazon locations, so that we’re able to combine shipping for items. I’ll never order a single book, so if I forget to add it to my wishlist, I will probably never remember it when I actually order books. Use it.
  • Check out the blog of the author if you’re able to (and the author actually have a blog). Also, if a blog that you follow on a daily basis suggests a book, it’s probably worth getting.
  • If you’re out travelling, buy a cheap, simple paperback book about a subject that you’re unfamiliar with. Try to get something a bit populistic (not too academic), as it will make the introduction to the subject easier and is more suitable for reading under noisy conditions. Keep it cheap, so that if you’re unhappy with the book, you can just donate it to another traveller or a local book donation program.

Strange Spambots

Recently I’ve noticed quite a few spambots submitting random comments on a few sites that I run, and while that’s not surprising, the content kind of is. The comments are simple, text only comments mentioning a product of some sort, together with a few random words or characters. No links. Nothing.

My current guess is that the message may be probes to see if there is a word filter active for the words they attempt to submit, and that when they find that the comment goes through, they submit their long list of links and other interesting stuff. The problem is that the sites filter all comments that contain more than one URL and all occurences of “[url”. This has not let a single linked comment through in two years, but now the volume of these comments are getting ridiculous. Guess I’ll have to add some new magic feature with javascript.

Spectator PHP Debugger and an Update on WebGrind

Stumbled upon Spectator – an PHP debugger written in XUL. Seems like a promising project and I’m always in favor of people who actually do what they say other people should do :-)

Also worth noting is that the first releases of WebGrind are out, and seems to be a neat tool for those who need to make sense of a few kcachegrind files (for example if you’re using xdebug and it’s profiling functionality).

Free Flash Based File Upload Applications

When I started writing Swoooosh, the main reason was that after needing a free component for a project for a customer of mine (where uploading multiple files were not an original part of the specification, but was added later), I were left with a few components with dubious licenses and weird attribution requests that left you guessing. Instead I hoped someone would release something under an MIT-based license (or LGPL, BSD, etc) to be free for all kinds of usage, and could be further extended by the community.

Luckily a few alternatives has emerged since then, and Swoooosh isn’t really that relevant any longer (it was a good exercise for writing Flex and ActionScript, tho):

And Yes, Christer, I’m going to implement one of these and commit to SVN any moment now. :-)