What Happened To My Beautiful En-dashes?!

First, a small introduction to the problem: We’re running stuff in UTF-8 all the way. A few sites we’re reading feeds from are using ISO-8859-1 as their charset, but they either supply the feed with the correct encoding specific or the feeds arrive as UTF-8. Everything works nicely, except for the mentioned-in-the-headline en-dashes. Firefox only shows 00 96 (0x00 0x96), but everything looks correct when you view the headlines and similiar stuff on the original site.

Strange.

The digging, oh all the digging.

After the already mentioned digging (yes, the digging) in data at the large search engines (ok, maybe I did a search or two), I discovered that the windows cp1252 encoding uses 0x96 to store endashes. This seems similiar! We’re seeing 0x96 as one of the byte values above, so apparently cp1252 is sneaking into the mix somewhere along the lines. Most of the clients using the CMS-es are windows, so they might apparently be to blame.

ISO-8859-1 enters the scene

As the sites (and feeds) provide ISO-8859-1 as their encoding, I thought it would be interesting to see what ISO-8859-1 defines as the representation for the byte value 0x96. Lo’ and behold: 0x96 is not defined in ISO-8859-1. Which actually provides us with the solution.

I welcome thee, Mr. Solution

When the ISO-8859-1 encoded string is converted into UTF-8, the bytes with the value 0x96 (which is the endash in cp1252) is simply inserted as a valid code sequence in UTF-8 which represents a character that’s not defined.

We’re saying that the string is ISO-8859-1, although in reality it is either cp1252 or a mangled version of iso-8859-1 and cp1252 (for the endashes, at least).

If you’re on the parsing end of this mumbo jumbo, one solution is to replace the generated UTF-8 sequence (0xc2 0x96) (converted from 0x96 i ISO-8859-1) with the proper one (0xe2 0x80 0x93):

$data = str_replace("\xc2\x96", "\xE2\x80\x93", $data);

And voilá, everything works.

Informa and getCategories Truncates Title at “/”

I stumbled across a weird issue in Informa and the ItemIF.getCategories method today. The categories we retrieve are separated with / to indicate their full hierarchy, but Informa only gave me the first part of the category (just “Properties” of “Properties / Houses”). The solution to this is to explicitly access the category element of the object itself:

String categoryDomain = item.getAttributeValue("category", "domain");
String categoryTitle = item.getElementValue("category");

This should be extended to support several category-elements, but as we only get one in our feeds, this solved the problem for us.

Informa and Custom XML Namespaces in RSS

While integrating a custom search application into a Java-based web application, I came across the need to access properties in custom namespaces through the Informa RSS library. Or to put it in another way; i needed to access to properties, Informa had been used for RSS parsing in the previous versions of the web application. The people who developed the original version of the application had decided to extend the Informa library into their own version, and had added several methods for .get<NameOfCustomProperty> etc. After thinking about this for approximately 2 seconds, I decided that having to support and modify a custom version of Informa was not the right track for us.

My initial thought was that their decision to customize Informa to support these methods had to come from the idea that Informa did not support custom namespaces out of the box. I did a few searchas over at Google, and found nothing useful. Reading through the documentation for Informa didn’t do me any good either, so I tried to find an alternative library instead. Did a bit of searching here too, and stumbled across a hit for one of the util classes for Informa (.. again). This did support custom namespaces, so the backend support was there at least. Then it struck me while reading the documentation for Informa and ChannelIF again; Informa did support it, as it inherited the methods from further up in the hierarchy. The getElementValue and getElementValues methods of the ChannelIF and ItemIF classes allows you to fetch the contents of elements with custom namespaces in a very easy to like manner.

System.out.println(item.getElementValue("exampleNS:field"));

This simply returns the string contained between <exampleNS:field> and </exampleNS:field>

Hoooray! We now have support for these additional fields, and we do not have to keep Informa manually in sync with the version in our application. Why the original developers decided to fork the Informa library to add their own properties I may never know, but I’ll update this post if they decide to step forward!