Java and NetBeans: Illegal escape character

January 23rd, 2010

When defining strings in programming languages, they’re usually delimited by ” and “, such as “This is a string” and “Hello World”. The immediate question is what do you do when the string itself should contain a “? “Hello “World”" is hard to read and practically impossible to parse for the compiler (which tries to make sense out of everything you’ve written). To solve this (and similiar issues) people started using escape characters, special characters that tell the parser that it should pay attention to the following character(s) (some escape sequences may contain more than one character after the escape character).

Usually the escape character is \, and rewriting our example above we’ll end up with “Hello \”World\”". The parser sees the \, telling it that it should parse the next characters in a special mode and then inserts the ” into the string itself instead of using it as a delimiter. In Java, C, PHP, Python and several other languages there are also special versions of the escape sequences that does something else than just insert the character following the escape character.

\n – Inserts a new line.
\t – Inserts a tab character.
\xNN – Inserts a byte with the byte value provided (\x13, \xFF, etc).

A list of the different escape sequences that PHP supports can be found in the PHP manual.

Anyways, the issue is that Java found an escape sequence that it doesn’t know how to handle. Attempting to define a string such as “! # \ % &” will trigger this message, as it sees the escape character \, and then attempts to parse the following byte – which is a space (” “). The escape sequence “\ ” is not a valid escape sequence in the Java language specification, and the parser (or NetBeans or Eclipse) is trying to tell you this is probably not what you want.

The correct way to define the string above would be to escape the escape character (now we’re getting meta): “! # \\ % &”. This would define a string with just a single backlash in it.

Porting SOLR Token Filter from Lucene 2.4.x to Lucene 2.9.x

September 25th, 2009

I had trouble getting our current token filter to work after recompiling with the nightly builds of SOLR, which seemed to stem from the recently adopted upgrade to 2.9.0 of Lucene (not released yet, but SOLR nightly is bleeding edge!). There’s functionality added for backwards compability, and while that might have worked, things didn’t really come together as it should (somewhere or the other). So I decided to port our filter over to the new model, where incrementToken() is the New Way ™ of doing stuff. Helped by the current lowercase filter in the SVN trunk of Lucene, I made it all the way through.

Our old code:

  1.     public NorwegianNameFilter(TokenStream input)
  2.     {
  3.         super(input);
  4.     }
  5.  
  6.     public Token next() throws IOException
  7.     {
  8.         return parseToken(this.input.next());
  9.     }
  10.  
  11.     public Token next(Token result) throws IOException
  12.     {
  13.         return parseToken(this.input.next());
  14.     }

Compiling this with Lucene 2.9.0 gave me a new warning:

Note: .. uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.

To the internet mobile!

Turns out next() and next(Token) has been deprecated in the new TokenStream implementation, and the New True Way is to use the incrementToken() method instead.

Our new code:

  1.     private TermAttribute termAtt;
  2.  
  3.     public NorwegianNameFilter(TokenStream input)
  4.     {
  5.         super(input);
  6.         termAtt = (TermAttribute) addAttribute(TermAttribute.class);
  7.     }
  8.  
  9.     public boolean incrementToken() throws IOException
  10.     {
  11.         if (this.input.incrementToken())
  12.         {
  13.             termAtt.setTermLength(this.parseBuffer(termAtt.termBuffer(), termAtt.termLength()));
  14.             return true;
  15.         }
  16.        
  17.         return false;
  18.     }

A few gotcha’s along the way: incrementToken needs to be called on the input token string, not on the filter (super.incrementToken() will give you a stack overflow). This moves the token stream one step forward. We also decided to move the buffer handling into the parse token function to handle this, and remember to include the length of the “live” part of the buffer (the buffer will be larger, but only the content up to termLength will be valid).

The return value from our parseBuffer function is the actual amount of usable data in the buffer after we’ve had our way with it. The concept is to modify the buffer in place, so that we avoid allocating or deallocating memory.

Hopefully this will help other people with the same problem!

The Results of Our Recent Python Competition

May 29th, 2009

Last week we had yet another competition where the goal is to create the smallest program that solves a particular problem. This time the problem to solve was a simple XML parsing routine with a few extra rules to make the parsing itself easier to implement (The complete rule set). This time python was chosen as the required language of the submissions.

The winning contribution from Helge:

  1. from sys import stdin
  2. p=0
  3. s=str.split
  4. for a in s(stdin.read(),'<'):
  5.  a=s(a,'>')[0];b=a.strip('/');p-=a<b
  6.  if'@'<a:print' '*p*4+s(b)[0];p+=a==b

The contribution from Tobias:

  1. from sys import stdin
  2. i=stdin.read()
  3. s=x=t=0
  4. k=i.find
  5. while x<len(i):
  6.     if i[x]=="<":
  7.         if i[x:x+4]=="<!–":x=k("–>")
  8.         elif i[x+1]=="/":s-=1
  9.         else:
  10.             u=0
  11.             while u<s:print " "*4,;u+=1
  12.             if i[k(">",x)-1]=="/":t=1
  13.             else:t=0;s+=1
  14.             print i[x+1:k(">",x)-t].strip()
  15.     x+=1

The contribution from Harald:

  1. from sys import stdin
  2. l=stdin.read()
  3. e,p,c,x=0,0,0,0
  4. r=""
  5. for i in l:
  6.        if l[e:e+2]==']>'or l[e:e+2]=='->':
  7.                c=0
  8.        if l[e:e+2]=='<!':
  9.                c=1
  10.        if p and i==' 'or i=='/'or i=='>':
  11.                p=0
  12.                if i=='/' and l[e+1]=='>':
  13.                        x-=1
  14.        if p and not c:
  15.                r+=i
  16.        if not c and i=='<'and l[e+1]!='/':
  17.                r+="\n"+(' '*4)*x
  18.                x+=1
  19.                p=1
  20.        if i=='<'and l[e+1]=='/':
  21.                x-=1
  22.        e+=1

If any of the contributors want to provide a better description of their solutions, feel free to leave a comment!

Thanks to all the participants!

Modifying a Lucene Snowball Stemmer

May 27th, 2009

This post is written for advanced users. If you do not know what SVN (Subversion) is or if you’re not ready to get your hands dirty, there might be something more interesting to read on Wikipedia. As usual. This is an introduction to how to get a Lucene development environment running, a Solr environment and lastly, to create your own Snowball stemmer. Read on if that seems interesting. The receipe for regenerating the Snowball stemmer (I’ll get back to that…) assumes that you’re running Linux. Please leave a comment if you’ve generated the stemmer class under another operating system.

When indexing data in Lucene (a fulltext document search library) and Solr (which uses Lucene), you may provide a stemmer (a piece of code responsible for “normalizing” words to their common form (horses => horse, indexing => index, etc)) to give your users better and more relevant results when they search. The default stemmer in Lucene and Solr uses a library named Snowball which was created to do just this kind of thing. Snowball uses a small definition language of its own to generate parsers that other applications can embed to provide proper stemming.

By using Snowball Lucene is able to provide a nice collection of default stemmers for several languages, and these work as they should for most selections. I did however have an issue with the Norwegian stemmer, as it ignores a complete category of words where the base form end in the same letters as plural versions of other words. An example:

one: elektriker
several: elektrikere
those: elektrikerene

The base form is “elektriker”, while “elektrikere” and “elektrikerene” are plural versions of the same word (the word means “electrician”, btw).

Lets compare this to another word, such as “Bus”:

one: buss
several: busser
those: bussene

Here the base form is “buss”, while the two other are plural. Lets apply the same rules to all six words:

buss => buss
busser => buss [strips "er"]
bussene => buss [strips "ene"]

elektrikerene => “elektriker” [strips "ene"]
elektrikere => “elektriker” [strips "e"]

So far everything has gone as planned. We’re able to search for ‘elektrikerene’ and get hits that say ‘elektrikere’, just as planned. All is not perfect, though. We’ve forgotten one word, and evil forces will say that I forgot it on purpose:

elektriker => ?

The problem is that “elektriker” (which is the single form of the word) ends in -er. The rule defined for a word in the class of “buss” says that -er should be stripped (and this is correct for the majority of words). The result then becomes:

elektriker => “elektrik” [strips "er"]
elektrikere => “elektriker” [strips "e"]
elektrikerene => “elektriker” [strips "ene"]

As you can see, there’s a mismatch between the form that the plurals gets chopped down to and the singular word.

My solution, while not perfect in any way, simply adds a few more terms so that we’re able to strip all these words down to the same form:

elektriker => “elektrik” [strips "er"]
elektrikere => “elektrik” [strips "ere"]
elektrikerene => “elektrik” [strips "erene"]

I decided to go this route as it’s a lot easier than building a large selection of words where no stemming should be performed. It might give us a few false positives, but the most important part is that it provides the same results for the singular and plural versions of the same word. When the search results differ for such basic items, the user gets a real “WTF” moment, especially when the two plural versions of the word is considered identical.

To solve this problem we’re going to change the Snowball parser and build a new version of the stemmer that we can use in Lucene and Solr.

Getting Snowball

To generate the Java class that Lucene uses when attempting to stem a phrase (such as the NorwegianStemmer, EnglishStemmer, etc), you’ll need the Snowball distribution. This distribution also includes example stemming algorithms (which have been used to generate the current stemmers in Lucene).

You’ll need to download the application from the snowball download page – in particular the “Snowball, algorithms and libstemmer library” version [direct link].

After extracting the file you’ll have a directory named snowball_code, which contains among other files the snowball binary and a directory named algorithms. The algorithms-directory keeps all the different default stemmers, and this is where you’ll find a good starting point for the changes you’re about to do.

But first, we’ll make sure we have the development version of Lucene installed and ready to go.

Getting Lucene

You can check out the current SVN trunk of Lucene by doing:

  1. svn checkout http://svn.apache.org/repos/asf/lucene/java/trunk lucene/java/trunk

This will give you the bleeding edge version of Lucene available for a bit of toying around. If you decide to build Solr 1.4 from SVN (as we’ll do further down), you do not have to build Lucene 2.9 from SVN – as it already is included pre-built.

If you need to build the complete version of Lucene (and all contribs), you can do that by moving into the Lucene trunk:

  1. cd lucene/java/trunk/
  2. ant dist (this will also create .zip and .tgz distributions)

If you already have Lucene 2.9 (.. or whatever version you’re on when you’re reading this), you can get by with just compiling the snowball contrib to Lucene, from lucene/java/trunk/:

  1. cd contrib/snowball/
  2. ant jar

This will create (if everything works as it should) a file named lucene-snowball-2.9-dev.jar (.. or another version number, depending on your version of Lucene). The file will be located in a sub directory of the build directory on the root of the lucene checkout (.. and the path will be shown after you’ve run ant jar): lucene/java/trunk/build/contrib/snowball/.

If you got the lucene-snowball-2.9-dev.jar file compiled, things are looking good! Let’s move on getting the bleeding edge version of Solr up and running (if you have an existing Solr version that you’re using and do not want to upgrade, skip the following steps .. but be sure to know what you’re doing .. which coincidentally you also should be knowing if you’re building stuff from SVN as we are. Oh the joy!).

Getting Solr

Getting and building Solr from SVN is very straight forward. First, check it out from Subversion:

  1. svn co http://svn.apache.org/repos/asf/lucene/solr/trunk/ solr/trunk/

And then simply build the war file for your favourite container:

  1. cd solr/trunk/
  2. ant dist

Voilá – you should now have a apache-solr-1.4-dev.war (or something similiar) in the build/ directory. You can test that this works by replacing your regular solr installation (.. make a backup first..) and restarting your application server.

Editing the stemmer definition

After extracting the snowball distribution, you’re left with a snowball_code directory, which contains algorithms and then norwegian (in addition to several other stemmer languages). My example here expands the definition used in the norwegian stemmer, but the examples will work with all the included stemmers.

Open up one of the files (I chose the iso-8859-1 version, but I might have to adjust this to work for UTF-8/16 later. I’ll try to post an update in regards to that) and take a look around. The snowball language is interesting, and you can find more information about it at
the Snowball site.

I’ll not include a complete dump of the stemming definition here, but the interesting part (for what we’re attempting to do) is the main_suffix function:

  1. define main_suffix as (
  2.     setlimit tomark p1 for ([substring])
  3.     among(
  4.         'a' 'e' 'ede' 'ande' 'ende' 'ane' 'ene' 'hetene' 'en' 'heten' 'ar'          
  5.         'er' 'heter' 'as' 'es' 'edes' 'endes' 'enes' 'hetenes' 'ens'
  6.         'hetens' 'ers' 'ets' 'et' 'het' 'ast'
  7.             (delete)
  8.         's'
  9.             (s_ending or ('k' non-v) delete)
  10.         'erte' 'ert'
  11.             (<-'er')
  12.     )
  13. )

This simply means that for any word ending in any of the suffixes in the three first lines will be deleted (given by the (delete) command behind the definitions). The problem provided our example above is that neither of the lines will capture an “ere” ending or “erene” – which we’ll need to actually solve the problem.

We simply add them to the list of defined endings:

  1.     among(
  2.         … 'hetene' 'en' 'heten' 'ar' 'ere' 'erene' 'eren'
  3.         …
  4.         …
  5.             (delete)

I made sure to add the definitions before the shorter versions (such as ‘er’), but I’m not sure (.. I don’t think) if it actually is required.

Save the file under a new file name so you still have the old stemmers available.

Compiling a New Version of the Snowball Stemmer

After editing and saving your stemmer, it’s now time to generate the Java class that Lucene will use to generate it base forms of the words. After extracting the snowball archive, you should have a binary file named snowball in the snowball_code directory. If you simply run this file with snowball_code as your current working directory:

./snowball

You’ll get a list of options that Snowball can accept when generating the stemmer class. We’re only going to use three of them:

  1. -j[ava] Tell Snowball that we want to generate a Java class
  2. -n[ame] Tell Snowball the name of the class we want generated
  3. -o &lt;filename&gt; The filename of the output file. No extension.

So to compile our NorwegianExStemmer from our modified file, we run:

  1. ./snowball algorithms/norwegian/stem2_ISO_8859_1.sbl -j -n NorwegianExStemmer -o NorwegianExStemmer

(pardon the excellent file name stem2…). This will give you one new file in the current working directory: NorwegianExStemmer.java! We’ve actually built a stemming class! Woohoo! (You may do a few dance moves here. I’ll wait.)

We’re now going to insert the new class into the Lucene contrib .jar-file.

Rebuild the Lucene JAR Library

Copy the new class file into the version of Lucene you checked out from SVN:

  1. cp NorwegianExStemmer.java <lucenetrunk>/contrib/snowball/src/java/org/tartaru/snowball/ext

Then we simply have to rebuild the .jar file containing all the stemmers:

  1. cd <lucenetrunk>/contrib/snowball/
  2. ant jar

This will create lucene-snowball-2.9-dev.jar in <lucenetrunk>/build/contrib/. You now have a library containing your stemmer (and all the other default stemmers from Lucene)!

The last part is simply getting the updated stemmer library into Solr, and this will be a simple copy and rebuild:

Inserting the new Lucene Library Into Solr

From the build/contrib directory in Lucene, copy the jar file into the lib/ directory of Solr:

  1. cp lucene-snowball-2.9-dev.jar <solrtrunk>lib/

Be sure to overwrite any existing files (.. and if you have another version of Lucene in Solr, do a complete rebuild and replace all the Lucene related files in Solr). Rebuild Solr:

  1. cd <solrtrunk>
  2. ant dist

Copy the new apache-solr-1.4-dev.war (check the correct name in the directory yourself) from the build/ directory in Solr to your application servers home as solr.war (.. if you use another name, use that). This is webapps/ if you’re using Tomcat. Remember to back up the old .war file, just to be sure you can restore everything if you’ve borked something.

Add Your New Stemmer In schema.xml

After compiling and packaging the stemmer, it’s time to tell Solr that it should use the newly created stemmer. Remember that a stemmer works both when indexing and querying, so we’re going to need to reindex our collection after implementing a new stemmer.

The usual place to add the stemmer is the definition of your text fields under the <analyzer>-sections for index and query (remember to change it BOTH places!!):

  1. <filter class="solr.SnowballPorterFilterFactory" language="NorwegianEx" />

Change NorwegianEx into the name of your class (without the Stemmer-part, Lucene adds that for you automagically. After changing both locations (or more if you have custom datatypes and indexing or query steps).

Restart Application Server and Reindex!

If you’re using Tomcat as your application server this might simply be (depending on your setup and distribution):

  1. cd /path/to/tomcat/bin
  2. ./shutdown.sh
  3. ./startup.sh

Please consult the documentation for your application server for information about how to do a proper restart.

After you’ve restarted the application server, you’re going to need to reindex your collection before everything works as planned. You can however check that your stemmer works as you’ve planned already at this stage. Log into the Solr admin interface, select the extended / advanced query view, enter your query (which should now be stemmed in another way than before), check the “debug” box and submit your search. The resulting XML document will show you the resulting of your query in the parsedquery element.

Download the Generated Stemmer

If you’re just looking for an improved stemmer for norwegian words (with the very, very simple changes outlined above, and which might give problems when concerned with UTF-8 (.. please leave a comment if that’s the case)), you can simply download NorwegianExStemmer.java. Follow the guide above for adding it to your Lucene / Solr installation.

Please leave a comment if something is confusing or if you want free help. Send me an email if you’re looking for a consultant.

New Adventures in Reverse Engineering

February 19th, 2009

Before I go into the gory details of this post, I’ll start by saying that this method is probably not the right solution for you. This is not something you want to do if you have readily access to any source code or if you have an existing relationship with the 3rd party that provided the library you’re using. Do not do this. This is not for you.

With that out of the mind, this is the part for those who actually are interested in getting down and dirty with Java, and maybe solving a problem that’s hard to solve otherwise.

The setting: We have a library for interfacing with another internal web service, where the library was provided in binary form by a 3rd party as part of the agreement when the service were delivered to us. The problem is that due to some unknown matter, this library is perfectly capable of understanding UTF-8, both as input from us and as input from the web service, but all web related methods in the result class returns data encoded as ISO-8859-1. The original solution to this was to keep two different parts of the query string — the original query in one particular key — and the key for the library in ISO-8859-1. This needs loads of special casing, manually handling that single parameter, etc. This works to a certain degree as long as the library is the only component in the mix. The troubles really began to surface when we started querying other services based on the same query. We’d then have to special case all methods that were used in URLs, as they returned ISO-8859-1 — and all other libraries and encodings are using UTF-8.

The library has since been made into a separate product with a hefty price tag, so upgrading the library was not an acceptable solution for us. Another solution had to be found, and this is were things starts to get interesting.

Writing a proxy class to handle the encoding issue transparently

This was the solution we attempted first, but this requires us to implement quite a few methods, to add additional code to the method that provides access to the library and to extend and embrace parts of the object. This could have been done quite easily by simply changing one method of the class to reference super.methodName() and then returning that result, but as we have to change several classes (these objects live 3-4 levels down into the result object from the library) which add both developer and runtime overhead. Not good.

Decompiling the library

The next step was to decompile the library to see how the code of the library actually worked. This proved to be a good way to find out how we could possibly solve the issue. We could try to fix the issue in the code and then recompile the library, but some of the class files were too new for jad to decompile them completely. The decompilation did however show the problem with the code:

if (encoding != null)
  1.     {
  2.         return encoding.toString() :
  3.     }
  4.  
  5.     return "ISO-8859-1";

This was neatly located in a helper method that ran on every property used when generating a query string. The encoding variable is retrieved from a global settings object, only accessible in the same library. This object is empty in our version of the library, so not much help there. But here’s the little detail that leads into the next part, and actually made this hack possible: “ISO-8859-1″ is constant. This means that it gets neatly tucked away as an UTF-8 string when the class file is generated. Let’s gets down and dirty.

Binary patching the encoding in the class file

We’ll start by taking a look at the hexdump in our class file, after searching for the string “ISO” in the ASCII representation (“ISO” in UTF-8 is identical to the ASCII representation):

Binary Patching a Java Class

I’ve highlighted the interesting part where “ISO-8859-1″ is stored in the file. This is where we want to do our surgical incision and make the method return the string “UTF-8″ instead. There is one important thing you should be aware of if you’ve never done any hex editing of files before, and that is the fact that the byte offset of parts of the file may be very important. Sadly, the strings “UTF-8″ and “ISO-8859-1″ have different lengths, and as such, would require us to either delete bytes following “UTF-8″ or put spaces there instead (“UTF-8     “). The first method might leave the rest of the file skewed, the latter might not work if the method used for encoding the value doesn’t trim the string first.

To solve this issue, we turn to our good friend VM Spec: The class File Format, which contains all the details of how the class file format is designed. Interesting parts:

In the ClassFile structure:

cp_info constant_pool[constant_pool_count-1];

As we’re looking at a constant, this is where it should be stored. The cp_info is defined as:

cp_info {
    u1 tag;
    u1 info[];
}

The tag contains the type of constant, the info[] array varies depending on the type of the constant. If we take a look at the table in Chapter 4.4, we see that the the identifier for a unicode string is:

CONSTANT_Utf8 	1

So we should have the value of 1 in the byte (as the actual value, not the ascii character) describing this constant. If the value is one, the complete structure is:

    CONSTANT_Utf8_info {
    	u1 tag;
    	u2 length;
    	u1 bytes[length];
    }

The tag should be 1 as the byte value, the length should be two bytes describing the length of the actual string saved (since we’re storing the length in two bytes (u2), it can be a maximum of 2^16 bytes in total). After that identifier, we should have length number of bytes with UTF-8 data.

If we go back to our hex dump, we can now make more sense of the data we’re seeing:

binarypatchingclass-length

The byte shown as 0×01 in hex is the value 1 for the tag of the structure. The 0×00 0×0A is the two bytes making up the length of the string:

    0000 0000 0000 1010 binary = 10 decimal

    ISO-8859-1
    1234567890

This shows that the length of our string “ISO-8859-1″ is 10 bytes in UTF-8, which is the same value that is stored in the two bytes showing the length of the string in the structure.

Heading back to our original goal: changing the length of the string stored. We change the string bytes to “UTF-8″, which is five bytes. We then change the stored length of the string:

    00 0A becomes
    00 05

We save our changes and re-create the jar file again, with all the previous classes and our changed one.

After inserting our new JAR-file into our maven repository as a new build and updating our local repository, we now have complete UTF-8 support from start to finish. Yey!

Communicating The Right Thing Through Code

January 9th, 2009

While trying to fix a larger bug in a module I never had touched before, I came across the following code. While technically correct (it does what it’s supposed to do, and there is no way to currently get it to do something wrong (.. an update on just that), does have a serious flaw:

  1. $result = get_entry($id);
  2. if(is_bool($result))
  3. {
  4.     die('bailing out');
  5. }

Hopefully you can see the error in what the code communicates; namely that the return type from the function is used to what should be considered an error.

While this works as the only way the function can return a boolean value is if it returns false, the person reading the code at a later date will wonder what the code is supposed to do – he or she might not have any knowledge about how the method works. Maybe the method just sets up some resource, a global variable (.. no, don’t do that. DON’T.), etc, but the code does not communicate what we really expect.

As PHP is dynamically typed, checking for type before comparing is perfectly OK, as long as you’re not counting “no returned elements” as an error. The following code more clearly communicates it intent (=== in PHP is comparison based on both type and content, which means that both the type of the variable and the content of it have to match. 0 === “0″ will be considered false.):

  1. $result = get_entry($id);
  2. if($result === false)
  3. {
  4.     die('bailing out');
  5. }

Or if you’re interested in getting to know if the element returned is actually considered false (such as an empty array, an empty string, etc), just drop one of the equal signs:

  1. $result = get_entry($id);
  2. if($result == false)
  3. {
  4.     die('bailing out');
  5. }

I’m also not fond of using die() as a method for stopping a faulty request, as that should be properly logged and dealt with in the best manner possible, but I’ll leave that for a later post.

Java printf and Booleans

October 25th, 2008

The available stdout mapping in Java, System.out (PrintStream), supports printf (to format a string when outputting it) in the same manner as other languages such as C. By using System.out.printf instead of System.out.print, you can also provide a formatting string which the method can use to write the output in a custom format.

As Java supports several other native datatypes that weren’t available as regular C-types, you also have a few other options. One of these are the %b / %B identifier, which represents a boolean value.

The format string “%b” expects one argument, a boolean. This will be written as either ‘true’ or ‘false. You would (as of writing) get the same result by using the %s identifier as this would cast the boolean of a string, which in turn would give true or false, but it’s always better to be explicit about what you are doing than assuming that people understand that you understood something (which even could change..).

Informa and getCategories Truncates Title at “/”

October 22nd, 2008

I stumbled across a weird issue in Informa and the ItemIF.getCategories method today. The categories we retrieve are separated with / to indicate their full hierarchy, but Informa only gave me the first part of the category (just “Properties” of “Properties / Houses”). The solution to this is to explicitly access the category element of the object itself:

  1. String categoryDomain = item.getAttributeValue("category", "domain");
  2. String categoryTitle = item.getElementValue("category");

This should be extended to support several category-elements, but as we only get one in our feeds, this solved the problem for us.

Informa and Custom XML Namespaces in RSS

September 25th, 2008

While integrating a custom search application into a Java-based web application, I came across the need to access properties in custom namespaces through the Informa RSS library. Or to put it in another way; i needed to access to properties, Informa had been used for RSS parsing in the previous versions of the web application. The people who developed the original version of the application had decided to extend the Informa library into their own version, and had added several methods for .get<NameOfCustomProperty> etc. After thinking about this for approximately 2 seconds, I decided that having to support and modify a custom version of Informa was not the right track for us.

My initial thought was that their decision to customize Informa to support these methods had to come from the idea that Informa did not support custom namespaces out of the box. I did a few searchas over at Google, and found nothing useful. Reading through the documentation for Informa didn’t do me any good either, so I tried to find an alternative library instead. Did a bit of searching here too, and stumbled across a hit for one of the util classes for Informa (.. again). This did support custom namespaces, so the backend support was there at least. Then it struck me while reading the documentation for Informa and ChannelIF again; Informa did support it, as it inherited the methods from further up in the hierarchy. The getElementValue and getElementValues methods of the ChannelIF and ItemIF classes allows you to fetch the contents of elements with custom namespaces in a very easy to like manner.

  1. System.out.println(item.getElementValue("exampleNS:field"));

This simply returns the string contained between <exampleNS:field> and </exampleNS:field>

Hoooray! We now have support for these additional fields, and we do not have to keep Informa manually in sync with the version in our application. Why the original developers decided to fork the Informa library to add their own properties I may never know, but I’ll update this post if they decide to step forward!

PHP Vikinger Notes

June 21st, 2008

Just a few notes from PHP Vikinger which were arranged by Derick Rethans in Norway today. Things went mostly smoothly and people in general seemed to have a very good time. These are just some of the random notes I made during the sessions.

All in all it was a good unconference, with a friendly and laid back tone and hopefully people got what they came for. Next time I’ll try to prepare a simple presentation on some interesting and hopefully not too familiar topic and actually contribute something too. We drove from Halden and Fredrikstad to Skien in the morning and back in the evening, which worked out quite OK, except for .. well, the lack of sleep in the morning. But everyone survived and managed to stay awake, so I conclude that the trip was a great success.

To sum it all up: a banana is a fruit and a tomato is a berry. You probably had to be there for that one.

Thanks for the unconference, and hopefully I’ll be able to attend more events in the future too.

UPDATE: Derick also has a writeup online from PHP Vikinger.

Derick and Sebastian Readying a Presentation