parsing – Mats Lindh

Varnish: No ESI processing, first char not ‘<‘

I’ve spent some time setting up varnish for a service at work where we deliver JSON-fragments from different resources as a unified document. We build the resulting JSON document by including the different JSON fragments through an ESI (Edge Side Includes) statement in the response. Varnish then fetches these fragments from the different services, allowing for independent caching duration for each of the services, combining then into a complete document.

The main “problem” is that of Varnish 2.0.4, varnish does a check to see if the content returned from the server actually is HTML-based content. It does this by check if the first character of the returned document is ‘<'. The goal is to avoid including binary data directly into a text document. Since we’re working with JSON instead of regular HTML/XML, the first character in our responses is not a ‘<', and the ESI statement did not get evaluated. You can however disable this check in varnishd by changing the esi_syntax configuration setting - just add the preferred value to the arguments when starting varnish. To disable the content check, add:

varnishd … -p esi_syntax=0x1

I’m not sure if you can set this for just one handler / match in varnish as this solved our issue, but feel free to leave a comment if you have any experience with this.

What Happened To My Beautiful En-dashes?!

First, a small introduction to the problem: We’re running stuff in UTF-8 all the way. A few sites we’re reading feeds from are using ISO-8859-1 as their charset, but they either supply the feed with the correct encoding specific or the feeds arrive as UTF-8. Everything works nicely, except for the mentioned-in-the-headline en-dashes. Firefox only shows 00 96 (0x00 0x96), but everything looks correct when you view the headlines and similiar stuff on the original site.

Strange.

The digging, oh all the digging.

After the already mentioned digging (yes, the digging) in data at the large search engines (ok, maybe I did a search or two), I discovered that the windows cp1252 encoding uses 0x96 to store endashes. This seems similiar! We’re seeing 0x96 as one of the byte values above, so apparently cp1252 is sneaking into the mix somewhere along the lines. Most of the clients using the CMS-es are windows, so they might apparently be to blame.

ISO-8859-1 enters the scene

As the sites (and feeds) provide ISO-8859-1 as their encoding, I thought it would be interesting to see what ISO-8859-1 defines as the representation for the byte value 0x96. Lo’ and behold: 0x96 is not defined in ISO-8859-1. Which actually provides us with the solution.

I welcome thee, Mr. Solution

When the ISO-8859-1 encoded string is converted into UTF-8, the bytes with the value 0x96 (which is the endash in cp1252) is simply inserted as a valid code sequence in UTF-8 which represents a character that’s not defined.

We’re saying that the string is ISO-8859-1, although in reality it is either cp1252 or a mangled version of iso-8859-1 and cp1252 (for the endashes, at least).

If you’re on the parsing end of this mumbo jumbo, one solution is to replace the generated UTF-8 sequence (0xc2 0x96) (converted from 0x96 i ISO-8859-1) with the proper one (0xe2 0x80 0x93):

$data = str_replace("\xc2\x96", "\xE2\x80\x93", $data);

And voilá, everything works.

parser error : Detected an entity reference loop

While importing a rather large XML-document (45MB+) into a database today, I ran into a weird problem on one specific server. The server runs SUSE Enterprise and presented an error that neither other test server gave. After a bit of digging around on the web I were able to collect enough information from several different threads about what could be the source of the problem.

It seems that the limit was introduced in libxml2 about half a year ago to avoid some other memory problems, but this apparently borked quite a few legitimate uses. As I have very little experience with administrating SUSE Enterprise based servers, I quickly shrugged off trying to update the packages and possibly recompiling PHP. Luckily one of the comments in a thread about the problem saved the day.

If you find yourself running into this message; swap your named entities in the XML file (such as < and >) to numeric entities (such as < and >). This way libxml2 just replaces everything with the byte value while parsing instead of trying to be smart and keep an updated cache.