Fixing Issue With PHPs SoapClient Overwriting Duplicate Attribute and Tag Names

The setting:

An SOAP request contains an Id attribute – and an element with the exact name in the response (directly beneath the element containing the attribute – an immediate child):


  foobar

The problem is that the generated result object from the SoapClient (at least of PHP 5.2.12) contains the attribute value, and not the element value. In our case we could ignore the z:Id attribute, as it was simply an Id to identify the element in the response (this might be something that ASP.NET or some other .NET component does).

Our solution is to subclass the internal SoapClient and handle the __doRequest method, stripping out the part of the request that gives the wrong value for the Id field:

class Provider_SoapClient extends SoapClient
{
    public function __doRequest($request, $location, $action, $version)
    {
        $result = parent::__doRequest($request, $location, $action, $version);
        $result = preg_replace('/ z:Id="i[0-9]+"/', '', $result);
        return $result;
    }
}

This removes the attribute from all the values (there is no danger that the string will be present in any other of the elements. If there is – be sure to adjust the regular expression). And voilá, it works!

What Happened To My Beautiful En-dashes?!

First, a small introduction to the problem: We’re running stuff in UTF-8 all the way. A few sites we’re reading feeds from are using ISO-8859-1 as their charset, but they either supply the feed with the correct encoding specific or the feeds arrive as UTF-8. Everything works nicely, except for the mentioned-in-the-headline en-dashes. Firefox only shows 00 96 (0x00 0x96), but everything looks correct when you view the headlines and similiar stuff on the original site.

Strange.

The digging, oh all the digging.

After the already mentioned digging (yes, the digging) in data at the large search engines (ok, maybe I did a search or two), I discovered that the windows cp1252 encoding uses 0x96 to store endashes. This seems similiar! We’re seeing 0x96 as one of the byte values above, so apparently cp1252 is sneaking into the mix somewhere along the lines. Most of the clients using the CMS-es are windows, so they might apparently be to blame.

ISO-8859-1 enters the scene

As the sites (and feeds) provide ISO-8859-1 as their encoding, I thought it would be interesting to see what ISO-8859-1 defines as the representation for the byte value 0x96. Lo’ and behold: 0x96 is not defined in ISO-8859-1. Which actually provides us with the solution.

I welcome thee, Mr. Solution

When the ISO-8859-1 encoded string is converted into UTF-8, the bytes with the value 0x96 (which is the endash in cp1252) is simply inserted as a valid code sequence in UTF-8 which represents a character that’s not defined.

We’re saying that the string is ISO-8859-1, although in reality it is either cp1252 or a mangled version of iso-8859-1 and cp1252 (for the endashes, at least).

If you’re on the parsing end of this mumbo jumbo, one solution is to replace the generated UTF-8 sequence (0xc2 0x96) (converted from 0x96 i ISO-8859-1) with the proper one (0xe2 0x80 0x93):

$data = str_replace("\xc2\x96", "\xE2\x80\x93", $data);

And voilá, everything works.

Shell Script For Submitting Documents to Solr

Here’s a small shell script I’m using to submit pre-made XML documents to Solr. The documents are usually produce by some other program, before being submitted to the Solr server. This way we submit all the files in an active directory to the server (here all the files in the documents directory (relative to the location of the script) will be submitted) .

You’ll have to update the URL and the directory (documents) below. We usually group together 1.000 documents in a single file, so the commit happens for every thousand documents. If you use autocommit in Solr, you can remove that line. This script requires CURL to talk to the Solr server.

URL=http://localhost:8080/solr/update
cd documents || exit

for i in $( ls ); do
    cat $i | curl -X POST -H 'Content-Type: text/xml' -d @- $URL
    curl $URL -H "Content-Type: text/xml" --data-binary ''
    echo item: $i
done

Making Solr Requests with urllib2 in Python

When making XML requests to Solr (A fulltext document search engine) for indexing, committing, updating or deleting documents, the request is submitted as an HTTP POST containg an XML document to the server. urllib2 supports submitting POST data by using the second parameter to the urlopen() call:

f = urllib2.urlopen("http://example.com/", "key=value")

The first attempt involved simply adding the XML data as the second parameter, but that made the Solr Webapp return a “400 – Bad Request” error. The reason for Solr barfing is that the urlopen() function sets the Content-Type to application/x-www-form-urlencoded. We can solve this by changing the Content-Type header:

solrReq = urllib2.Request(updateURL, '')
solrReq.add_header("Content-Type", "text/xml")
solrPoster = urllib2.urlopen(solrReq)
response = solrPoster.read()
solrPoster.close()

Other XML-based Solr requests, such as adding and removing documents from the index, will also work by changing the Content-Type header.

The same code will also allow you to use urllib to submit SOAP, XML-RPC-requests and use other protocols that require you to change the complete POST body of the request.

parser error : Detected an entity reference loop

While importing a rather large XML-document (45MB+) into a database today, I ran into a weird problem on one specific server. The server runs SUSE Enterprise and presented an error that neither other test server gave. After a bit of digging around on the web I were able to collect enough information from several different threads about what could be the source of the problem.

It seems that the limit was introduced in libxml2 about half a year ago to avoid some other memory problems, but this apparently borked quite a few legitimate uses. As I have very little experience with administrating SUSE Enterprise based servers, I quickly shrugged off trying to update the packages and possibly recompiling PHP. Luckily one of the comments in a thread about the problem saved the day.

If you find yourself running into this message; swap your named entities in the XML file (such as < and >) to numeric entities (such as < and >). This way libxml2 just replaces everything with the byte value while parsing instead of trying to be smart and keep an updated cache.

Google Releases Their Protocol Buffers

Fresh from the Google Open Source Blog comes news that Google has released their Protocol Buffers specification and accompanying libraries. The code and specification has been release at Protocol Buffers on Google Code.

Protocol Buffers is a data format for fast exchange and parsing of data and messages between computers. It is similar to simple uses of XML in this manner, but the messages size on the wire and their parsing time is very much optimized for busy sites. There is no need to spend loads of time doing XML parsing when you instead could do something useful. It’s very easy to interact with the messages through the generated classes (for C++, Java and Python), and future versions of the same schema are compatible with old versions (as new fields are just ignored by older parsers).

Still no PHP implementation available, so guess it’s time to get going and lay down some code during the summer. Anyone up for the job?

Exploring Adobe Flex For The First Time

Adobe Flex is an SDK from Adobe for creating Flash applications based on XML and ActionScript. No need for Flash MX or Flash CS3, just markup, markup, markup and some ECMA-script thrown in for making it all come together. While I’m no fan of flash content on sites, this adventure has it’s purpose: making an open source multifile upload application. Every now and then I run into the need for allowing people to upload several files at once (think 20 – 100 files), and selecting each by itself would be really tiresome. The only good solution I’ve come across so far is the flash based uploaders, so well, here we go. The plan is to release the result under a MIT-based license.

Flex has been quite good to me so far. The UI of the application is built as a XML file, and the code is simply added as ActionScript in a CDATA-section in the XML-file. Simple. You then run the XML-file (MXML) through a compiler (mxmlc) and get a ready-to-run .swf file out. Works like a charm.