Support for Solr in eZ Components’ Search

June 17th, 2008

The new release of eZ Components (2008.1) has added a new Search module, and the first implementation included is an interface for sending search requests and new documents to a Solr installation. An introduction can be found over at the eZ Components Search Tutorial. The new release of eZ Components requires at least PHP 5.2.1 (.. and if you’re not already running at least 5.2.5, it’s time to get moving. The world is moving. Fast.).

  1. <?php
  2. require_once 'tutorial_autoload.php';
  3.  
  4. // on localhost with the default port
  5. $handler = new ezcSearchSolrHandler;
  6.  
  7. // on another host with a different port
  8. $handler = new ezcSearchSolrHandler( '10.0.2.184', 9123 );
  9. ?>

Writing a Solr Analysis Filter Plugin

June 10th, 2008

As we’ve been working on getting a better result out of the phonetic search we’re currently doing at derdubor, I started writing a plugin for Solr to be able to return better search results when searching for norwegian names. We’ve been using the standard phonetic filter from Solr 1.2 so far, using the double metaphone encoder for encoding a regular token as a phonetic value. The trouble with this is that a double metaphone value is four simple letters, which means that searchwords such as ‘trafikkontroll’ would get the same meaning as ‘Dyrvik’. The latter being a name and the first being a regular search string which would be better served through an article view. TRAFIKKONTROLL resolves to TRFK in double metaphone, while DYRVIK resolves to DRVK. T and D is considered similiar, as is V and F, and voilá, you’ve got yourself a match in the search result, but not a visual one (or a semantic one, as the words have very different meanings).

To solve this, I decided to write a custom filter plugin which we could tune to names that are in use in Norway. I’ll post about the logic behind my reasoning in regards to wording later and hopefully post the complete filter function we’re applying, but I’ll leave that for another post.

First you need a factory that’s able to produce filters when Solr asks for them:

NorwegianNameFilterFactory.java:

  1. package no.derdubor.solr.analysis;
  2.  
  3. import java.util.Map;
  4.  
  5. import org.apache.solr.analysis.BaseTokenFilterFactory;
  6. import org.apache.lucene.analysis.TokenStream;
  7.  
  8. public class NorwegianNameFilterFactory extends BaseTokenFilterFactory
  9. {
  10.     Map<String,String> args;
  11.  
  12.     public Map<String,String> getArgs()
  13.     {
  14.         return args;
  15.     }
  16.  
  17.     public void init(Map<String,String> args)
  18.     {
  19.         this.args = args;
  20.     }
  21.  
  22.     public NorwegianNameFilter create(TokenStream input)
  23.     {
  24.         return new NorwegianNameFilter(input);
  25.     }
  26. }

To compile this example yourself, put the file in no/derdubor/solr/analysis/ (which matches no.derdubor.solr.analysis; in the package statement), and run

  1. javac -6 no/derdubor/solr/analysis/NorwegianNameFilterFactory.java

to compile it. You’ll of course also need the filter itself (which is returned from the create-method above):

  1. package no.derdubor.solr.analysis;
  2.  
  3. import java.io.IOException;
  4. import org.apache.lucene.analysis.Token;
  5. import org.apache.lucene.analysis.TokenFilter;
  6. import org.apache.lucene.analysis.TokenStream;
  7.  
  8. public class NorwegianNameFilter extends TokenFilter
  9. {
  10.     public NorwegianNameFilter(TokenStream input)
  11.     {
  12.         super(input);
  13.     }
  14.  
  15.     public Token next() throws IOException
  16.     {
  17.         return parseToken(this.input.next());
  18.     }
  19.  
  20.     public Token next(Token result) throws IOException
  21.     {
  22.         return parseToken(this.input.next());
  23.     }
  24.  
  25.     protected Token parseToken(Token in)
  26.     {
  27.         /* do magic stuff with in.termBuffer() here (a char[] which can be manipulated) */
  28.         /* set the changed length of the new term with in.setTermLength(); before returning it */
  29.         return in;
  30.     }
  31. }

You should now be able to compile both files:

  1. javac -6 no/derdubor/solr/analysis/*.java

After compiling the plugin, create a jar file which contain your plugin. This will be the “distributable” version of your plugin, and should contain the .class-files of your application.

  1. jar cvf derdubor-solr-norwegiannamefilter.jar no/derdubor/solr/analysis/*.class

Move the file you just created (derdubor-solr-norwegiannamefilter.jar in the example above) into your Solr home directory. This is where you keep your bin/ and conf/ directory (which contains schema.xml, etc). Create a lib directory in the solr home directory. This is where your custom libraries will live, so copy the file into this directory (lib/).

Restart Solr and check that everything still works as it should. If everything still seems normal, it’s time to enable your filter. In one of your <filter>-chains, you can simply append a <filter> element to insert your own filter into the chain:

  1. <analyzer>
  2.     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
  3.     <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" catenateWords="1" />
  4.     <filter class="solr.LowerCaseFilterFactory" />
  5.     <filter class="no.derdubor.solr.analysis.NorwegianNameFilterFactory" />
  6. </analyzer>

Restart Solr again, and if everything still works as it should, you’re all set! Time to index some new data (remember that you’ll need to reindex the data for things to work as you expect, since no stored data is processed when you edit your configuration files) and commit it! Do a few searches through the admin interface to see that everything works as it should. I’ve used the “debug” option to .. well, debug .. my plugin while developing it. A very neat trick is to see what terms your filter expands to (if you set type=”query” in the analyzer section, it will be applied to all queries against that field), which will be shown in the first debug section when looking at the result (you’ll have to scroll down to the end to see this). If you need to debug things to a greater extend, you can attach a debugger or simply use the Good Old Proven Way of println! (these will end up in catalina.out in logs/ in your tomcat directory). Good luck!

Potential Problems and How To Solve Them

  • If you get an error about incompatible class versions, check that you’re actually running the same (or newer) version of the JVM (java -version) on your Solr search server that you use on your own development machine (use -5 to force 1.5 compatible class files instead of 1.6 when compiling).
  • If you get an error about missing config or something similiar, or that Solr is unable to find the method it’s searching for (generally triggered by an ReflectionException), remember to define your classes public! public class NorwegianNameFilter is your friend! It took at least half an hour until I realized what this simple issue was…

Any comments and followups are of course welcome!

David Cummins on Fulltext Search as a Webservice

May 4th, 2008

David Cummins has a neat little post up about replicating some of Solr’s features in a PHP based solution. His post “Fulltext search as a webservice” should sound familiar to Solr’s approach from the title, and David describes how they built a similiar solution on top of Zend_Search_Lucene (Solr also uses Lucene in the backend). Seems like it would be easier to just set up a dedicated Solr cluster instead, but hey, how often has “it would be easier to do something else” sparked innovation?

I’d also like to note that the coming Solr 1.3 supports php serialization as an output format, so you can just unserialize() the response from Solr. Should provide for even easier integration between PHP and Solr in the future. While on the subject, I’d like to suggest reading Stemming in Zend_Search_Lucene too, an introduction to adding filters to Zend_Search_Lucene. Also worth a look is the Search Tools in PHP presentation from phplondon.

Solving UTF-8 Problems With Solr and Tomcat

April 24th, 2008

Came across an issue with searching for UTF-8 characters in Solr today; the search worked just as it should (probably since we’re using a phonetic field to search), but our facets and limitations didn’t work as they should. This happened as soon as we had a value with an UTF-8 character (> 127 in ascii value), in our case the norwegian letters Æ, Ø or Å.

The solution was presented by Charlie Jackson at the Solr-user mailing list and is quite simply to add URIEncoding="UTF-8" to the appropriate connector in the Tomcat server.xml file. This is also documented on the Solr on Tomcat page in the Solr Wiki .

Using Solrj - A short guide to getting started with Solrj

April 17th, 2008

As Solrj – The Java Interface for Solr – is slated for being released together with Solr 1.3, it’s time to take a closer look! Solrj is the preferred, easiest way of talking to a Solr server from Java (unless you’re using Embedded Solr). This way you get everything in a neat little package, and can avoid parsing and working with XML etc directly. Everything is tucked neatly away under a few classes, and since the web generally lacks a good example of how to use SolrJ, I’m going to share a small class I wrote for testing the data we were indexing at work. As Solr 1.2 is the currently most recent version available at apache.org, you’ll have to take a look at the Apache Solr Nightly Builds website and download the latest version. The documentation is also contained in the archive, so if you’re going to do any serious solrj development, this is the place to do it.

Oh well, enough of that, let’s cut to the chase. We start by creating a CommonsHttpSolrServer instance, which we provide with the URL of our Solr server as the only argument in the constructor. You may also provide your own parsers, but I’ll leave that for those who need it. I don’t. By default your Solr-installation is running on port 8080 and under the solr directory, but you’ll have to accomodate your own setup here. I’ve included the complete source file for download.

  1. class SolrjTest
  2. {
  3.     public void query(String q)
  4.     {
  5.         CommonsHttpSolrServer server = null;
  6.  
  7.         try
  8.         {
  9.             server = new CommonsHttpSolrServer("http://localhost:8080/solr/");
  10.         }
  11.         catch(Exception e)
  12.         {
  13.             e.printStackTrace();
  14.         }

The next thing we’re going to do is to actually create the query we’re about to ask the Solr server about, and this means building a SolrQuery object. We simply instanciate the object and then start to set the query values to what we’re looking for. The setQueryType call can be dropped to use the default QueryType-handler, but as we currently use dismax, this is what I’ve used here. You can then also turn on Facet-ing (to create navigators/facets) and add the fields you want for those.

  1.         SolrQuery query = new SolrQuery();
  2.         query.setQuery(q);
  3.         query.setQueryType("dismax");
  4.         query.setFacet(true);
  5.         query.addFacetField("firstname");
  6.         query.addFacetField("lastname");
  7.         query.setFacetMinCount(2);
  8.         query.setIncludeScore(true);

Then we simply query the server by calling server.query, which takes our parameters, build the query URL, sends it to the server and parses the response for us.

  1.         try
  2.         {
  3.             QueryResponse qr = server.query(query);

This result can then be fetched by calling .getResults(); on the QueryResponse object; qr.

  1.             SolrDocumentList sdl = qr.getResults();

We then output the information fetched in the query. You can change this to print all fields or other stuff, but as this is a simple application for searching a database of names, we just collect the first and last name of each entry and print them out. Before we do that, we print a small header containing information about the query, such as the number of elements found and which element we started on.

  1.             System.out.println("Found: " + sdl.getNumFound());
  2.             System.out.println("Start: " + sdl.getStart());
  3.             System.out.println("Max Score: " + sdl.getMaxScore());
  4.             System.out.println("——————————–");
  5.  
  6.             ArrayList<HashMap<String, Object>> hitsOnPage = new ArrayList<HashMap<String, Object>>();
  7.  
  8.             for(SolrDocument d : sdl)
  9.             {
  10.                 HashMap<String, Object> values = new HashMap<String, Object>();
  11.  
  12.                 for(Iterator<Map.Entry<String, Object>> i = d.iterator(); i.hasNext(); )
  13.                     Map.Entry<String, Object> e2 = i.next();
  14.                     values.put(e2.getKey(), e2.getValue());
  15.                 }
  16.  
  17.                 hitsOnPage.add(values);
  18.                 System.out.println(values.get("displayname") + " (" + values.get("displayphone") + ")");
  19.             }

After this we output the facets and their information, just so you can see how you’d go about fetching this information from Solr too:

  1.             List facets = qr.getFacetFields();
  2.  
  3.             for(FacetField facet : facets)
  4.             {
  5.                 List facetEntries<FacetField.Count> = facet.getValues();
  6.  
  7.                 for(FacetField.Count fcount : facetEntries)
  8.                 {
  9.                     System.out.println(fcount.getName() + ": " + fcount.getCount());
  10.                 }
  11.             }
  12.         }
  13.         catch (SolrServerException e)
  14.         {
  15.             e.printStackTrace();
  16.         }
  17.     }
  18.  
  19.     public static void main(String[] args)
  20.     {
  21.         SolrjTest solrj = new SolrjTest();
  22.         solrj.query(args[0]);
  23.     }
  24. }

And there you have it, a very simple application to just test the interface against Solr. You’ll need to add the jar-files from the lib/-directory in the solrj archive (and from the solr library itself) to compile and run the example.

Download: SolrTest.java

Writing a Custom Validator for Zend_Form_Element

April 17th, 2008

My good friend Christer has written a simple tutorial on how to write a custom validator for a Zend_Form_Element. If you’ve ever laid your hands on Zend_Form, you’ll want to have a look at this for a short and concise introduction to the topic. He’ll show you how to create a “repeat the password”-field by creating a custom validator and hooking it onto the original password field. Neat stuff.