What Happened To My Beautiful En-dashes?!

First, a small introduction to the problem: We’re running stuff in UTF-8 all the way. A few sites we’re reading feeds from are using ISO-8859-1 as their charset, but they either supply the feed with the correct encoding specific or the feeds arrive as UTF-8. Everything works nicely, except for the mentioned-in-the-headline en-dashes. Firefox only shows 00 96 (0x00 0x96), but everything looks correct when you view the headlines and similiar stuff on the original site.

Strange.

The digging, oh all the digging.

After the already mentioned digging (yes, the digging) in data at the large search engines (ok, maybe I did a search or two), I discovered that the windows cp1252 encoding uses 0x96 to store endashes. This seems similiar! We’re seeing 0x96 as one of the byte values above, so apparently cp1252 is sneaking into the mix somewhere along the lines. Most of the clients using the CMS-es are windows, so they might apparently be to blame.

ISO-8859-1 enters the scene

As the sites (and feeds) provide ISO-8859-1 as their encoding, I thought it would be interesting to see what ISO-8859-1 defines as the representation for the byte value 0x96. Lo’ and behold: 0x96 is not defined in ISO-8859-1. Which actually provides us with the solution.

I welcome thee, Mr. Solution

When the ISO-8859-1 encoded string is converted into UTF-8, the bytes with the value 0x96 (which is the endash in cp1252) is simply inserted as a valid code sequence in UTF-8 which represents a character that’s not defined.

We’re saying that the string is ISO-8859-1, although in reality it is either cp1252 or a mangled version of iso-8859-1 and cp1252 (for the endashes, at least).

If you’re on the parsing end of this mumbo jumbo, one solution is to replace the generated UTF-8 sequence (0xc2 0x96) (converted from 0x96 i ISO-8859-1) with the proper one (0xe2 0x80 0x93):

$data = str_replace("\xc2\x96", "\xE2\x80\x93", $data);

And voilá, everything works.

The Thumbs Up! of Awesome Approval

Every once in a while a few new interesting tools surface themselves and become a natural part of how a developer works. I’ve taken a look at which tools I’ve introduced in my regular workflow during the last six months.

NetBeans

NetBeans got the first version of what has become awesome PHP support in version 6.5, and after version 6.7 got released just before the summer, things have become very stable. NetBeans is absolutely worth looking into for PHP development (and Java), and you sure can’t beat the price (free!). In the good old days NetBeans were slow as hell, but I’ve not noticed any serious issues in 6.7 (.. although we didn’t really have quad cores and 4GB of memory back then either). Go try it out today!

Balsamiq Mockups

Balsamiq is an awesome tool for making quick mockups for UI designs. Previous I’d play around in Adobe Photoshop, dragging layers around and being concerned with all the wrong things. Mockups abstracts away all the UI elements (and comes with a vast library of standard elements), which makes it very easy to experiment and focus on the usability instead of the design and its implementation. For someone who’s more interested in the experience and the programming than the actual design (.. I’ll know what I want when I see it!) this makes it easy to convey my suggestions and create small, visual notes of my own usabilityideas.

You can try it out for free at their website, and they even give away licenses to people who are active in Open Source Development (disclaimer: I got a free license, but the experiences are all my own. This is not paid (or unpaid) advertising or product placement.)

GitHub

I’ve been playing around with git a bit, but after writing a patch for the PEAR-module for Gearman (.. which still doesn’t seem to have made it anywhere significant), I signed up for github to be able to fork the project and submit my patch there. A very good technical solution partnered with an easy way of notifying the original developers of your patch (which you simply provide in your own branch) by submitting a “pull request” makes it very easy to both have patches supplied to you and to submit patches to projects hosted at GitHub.

Thumbs up!

Parsing XML With Namespaces with SimpleXML

There’s one thing SimpleXML for PHP is horrible to use for: parsing XML containing namespaces. Namespaces requires special handling, and the only way I’ve found that allows you to refer to an element in another namespace, is to use the ->children() method with the namespace. I’m sure there’s an easier way than this, and if you know of any, please leave a comment!

Let’s start with the following XML snippet (using SOAP as an example):


    
        
            asdasd
        
    

The easiest way to do this is to “ignore” the namespaces, and simply do $root->{soap:Envelope} that to access the property. This will not work, as SimpleXML is quite peculiar about it’s namespaces (.. while everything else is simple and easy to use).

One solution is to provide the namespace you’re interested in to the $element->children() method, which returns all the children of the element in a particular namespace (or without arguments, outside any namespace):

$sxml = new SimpleXMLElement(file_get_contents('soap.xml'));

foreach($sxml->children('http://www.w3.org/2001/12/soap-envelope') as $el)
{
    if ($el->getName() == 'Body')
    {
        /* ... */
    }
}

Yes. That’s quite horrible.

But luckily the xpath method can help us:

$elements = $sxml->xpath('//soap:Envelope/soap:Body/queryInstantStreamResponse');

This will actually fetch all the elements titled “queryInstantStreamResponse” which are childs of soap:Envelope and soap:Body. And this works as you expect it to, without having to use children, provide the actual namespace URI, etc.

The xpath method returns an array containing all the matching elements, so in this case you’ll receive an array with a single element, containing the text inside the queryInstantStreamResponse element.

There should be an easier way than this.

Finding Substring in a String in Bash

If you’re ever in the need of checking if a variable in bash contains a certain string (and is not equal to, just a part of), the =~ operator to the test function comes in very handy. =~ is usually used to compare a variable against a regular expression, but can also be used for substrings (you may also use == *str*, but I prefer this way).

This short example submits a document to solr using curl, then emails the result if the Solr server responded with an error (.. I tried mapping this against the error code or something similiar instead, but didn’t find a better way. If you have something better, please leave a comment!):

    CURLRESULT=`cat $i | curl -s -X POST -H 'Content-Type: text/xml' -d @- $URL`
    if [[ $CURLRESULT =~ "Error report" ]]
      then 
	echo "Error!! Error!! CRISIS IMMINENT!!"
        echo $CURLRESULT | mail -s "Error importing to SOLR" mail@example.com
        exit
    fi

Neat to check that everything went OK before you remove the files you’ve submitted.

Avoiding Resetting the Scroll Position in a Textarea When Inserting Content

Now, that’s quite a headline. And this post will explain just the simple concept posted in the headline. How to avoid (at least) firefox from scrolling to the top when you insert content into a textarea.

It’s simple. Very simple. And it was shown to be so very simple for someone who didn’t remember scrollTop by this thread.

In jQuery (which we use with the caret plugin):

currentScrollPosition = $("#textareaId").scrollTop();
/* do stuff */
$("#textareaId").scrollTop(currentScrollPosition);

Yep. So simple that it actually hurts a bit.

Changing The Source Directory in NetBeans

After reorganizing the directory structure on my workstation a bit, NetBeans refused to load the sources for the projects I have configured. The reason for this is probably because I store the NetBeans metadata files separate from the Source directory itself (as I don’t want the NetBeans project files in the repository, etc.).

NetBeans did however not have the option of choosing another source location (changing the existing location) for an existing project. Well, no worries. Luckily the NetBeans project files are in straight forward text format, so we can easily change it there instead! Close NetBeans (so that you don’t accidentally overwrite the new project file) Find the project directory and open “nbproject”. The file “project.properties” is the one we’re after. Close to the top (on line 3 here) you should find:

src.dir=<old path>

Simply change it to:

src.dir=<new path>

while remembering to pay attention to any escape sequences etc along the way (under Windows, \ needs to be escaped, so c:\\Directory\\SubDirectory is the appropriate path).

Adding Support for Asynchronous Status Requests for Net_Gearman

I spent the evening yesterday playing around a bit more with Gearman, a system for farming out tasks to workers across several servers. As my workstation at home still runs Windows, the only PHP library available is the Net_Gearman in PEAR. Net_Gearman supports tasks (something to do), sets (a collection of tasks), workers (the processes that performs the task) and clients (which requests tasks to be performed). The gearman protocol supports retrieving the current status of a task from the gearman server (which contains information about how the worker is progressing, reported by the worker itself), but Net_Gearman did not.

The reason for ‘did not’ is that I’ve created a small patchset to add the functionality to Net_Gearman. All internal methods and properties are still used as they were before, but I’ve added two helper methods for retrieving the socket connection for a particular gearman server (Net_Gearman usually just picks a random server, but we need to contact the server that’s responsible for the task) and a getStatus(server, handle) method to the Gearman Client. I’ve also added a property keeping the address of the server which were assigned the task to the Task class.

After submitting a task to be performed in the background (you do not need this to get the status for foreground tasks, as you can provide a callback to handle that), your Task object will have its handle and server properties set. These can be used to retrieve status information about the task later. You’ll still need to provide the possible servers to the Gearman client when creating the client (through the constructor).

Example of creating a task and retrieving the server / handle pair after starting the task:

require_once 'Net/Gearman/Client.php';

$client = new Net_Gearman_Client(array('host:4730'));

$task = new Net_Gearman_Task('Reverse', range(1,5));
$task->type = Net_Gearman_Task::JOB_BACKGROUND;

$set = new Net_Gearman_Set();
$set->addTask($task);

$client->runSet($set);

print("Status information: \n");
print($task->handle . "\n");
print($task->server . "\n");

Retrieving the status:

require_once 'Net/Gearman/Client.php';

$client = new Net_Gearman_Client(array('host:4730'));
$status = $client->getStatus('host:4730', 'H:mats-ubuntu:1');

The array returned from the getStatus() method is the same array as returned from the gearman server and contains information about the current status (numerator, denominator, finished, etc, var_dump it to get the current structure). I’ve also added the patchset to the Issue tracker for Net_Gearman at github.

The patchset (created from the current master branch at github) can be downloaded here: GearmanGetStatusSupport.tar.gz.

UPDATE: I’ve finally gotten around to creating my own fork of NET_Gearman on github too. This fork features the patch mentioned above.

Making Solr Requests with urllib2 in Python

When making XML requests to Solr (A fulltext document search engine) for indexing, committing, updating or deleting documents, the request is submitted as an HTTP POST containg an XML document to the server. urllib2 supports submitting POST data by using the second parameter to the urlopen() call:

f = urllib2.urlopen("http://example.com/", "key=value")

The first attempt involved simply adding the XML data as the second parameter, but that made the Solr Webapp return a “400 – Bad Request” error. The reason for Solr barfing is that the urlopen() function sets the Content-Type to application/x-www-form-urlencoded. We can solve this by changing the Content-Type header:

solrReq = urllib2.Request(updateURL, '')
solrReq.add_header("Content-Type", "text/xml")
solrPoster = urllib2.urlopen(solrReq)
response = solrPoster.read()
solrPoster.close()

Other XML-based Solr requests, such as adding and removing documents from the index, will also work by changing the Content-Type header.

The same code will also allow you to use urllib to submit SOAP, XML-RPC-requests and use other protocols that require you to change the complete POST body of the request.

Adding SVN Revision to a Configuration File

After a while you realize that the best way to serve almost-never-changing content is to give the content an expire date way ahead in the future. The allows your server and your network pipes to do more sensible stuff than delivering the same old versions of files again and again and again and again.

A problem does however surface when you want to update the files and make the visiting user request the new version instead of the old. The trick here is to change the URL for the resource, so that the browser requests the new file. You can do this by appending a version number to the file and either rewriting it behind the scenes to the original file, or by appending a timestamp (or some other item) to the URL as a GET value. The web server ignores this for regular files, but as it identifies a new unique resource, the web browser has to request it again and use the new and improved ™ file.

Using the timestamp of the file is a bit cumbersome and requires you to hit the disk one additional time each time you’re going to show an URL to one of the almost-static resources, but luckily we already have an identifier describing which version the file is in: the SVN revision number (.. if you use subversion, that is). You could use the SVN revision for each file by itself, but we usually decide that the global version number for SVN is good enough. This means that each time you update the live code base through svn up or something like that (remember to block .svn directories and their files if you run your production directory from a SVN branch. This can be discussed over and over, but I’m growing more and more fond of actually doing just that..). To avoid having to call svnversion each time, it’s useful to be able to insert the current revision number into the configuration file for the application (or a header file / bootstrap file).

Here’s an example of how you can insert the current SVN revision into a config file for a PHP application.

  1. Create a backup of the current configuration file.
  2. Update the current revision through svn up.
  3. Retrieve the current revision number from svnversion.
  4. Insert the revision number using sed into a temporary copy of the configuration file.
  5. Move the new configuration file into place as the current configuration file.
  6. Party like it’s 1999!

This assumes that you use an array named $config in your configuration file. I suggest that you name it something else, but for simplicity I’m going with that here. First, create a $config[‘svn’] entry in your config file. If you have some other naming scheme, you’re going to have to change the relevant parts below.

#!/bin/bash
cp ./config/config.php ./config/config.backup.php
svn up
VERSION=`svnversion .`
echo $VERSION
sed "s/config\['svn'\] = '[0-9M]*';/config\['svn'\] = '$VERSION';/" < ./config/config.php > ./config/config.fixed.php
mv ./config/config.fixed.php ./config/config.php

Save this into a file named upgrade.sh, make it executable by doing chmod u+x upgrade.sh and run it by typing ./upgrade.sh.

And this is where you put your hands above your head and wave them about. When you’re done with that, you can refer to your current SVN revision using $config[‘svn’] in your PHP application (preferrably in your template or where you build the URLs to your static resources). Simply append ?v=$config[‘svn’] to your current filenames. When you have a new version available, run ./upgrade.sh (or whatever name you gave the script) again and let your users enjoy the new experience.

The Results of Our Recent Python Competition

Last week we had yet another competition where the goal is to create the smallest program that solves a particular problem. This time the problem to solve was a simple XML parsing routine with a few extra rules to make the parsing itself easier to implement (The complete rule set). This time python was chosen as the required language of the submissions.

The winning contribution from Helge:

from sys import stdin
p=0
s=str.split
for a in s(stdin.read(),'<'):
 a=s(a,'>')[0];b=a.strip('/');p-=a

The contribution from Tobias:

from sys import stdin
i=stdin.read()
s=x=t=0
k=i.find
while x")
        elif i[x+1]=="/":s-=1
        else:
            u=0
            while u",x)-1]=="/":t=1
            else:t=0;s+=1
            print i[x+1:k(">",x)-t].strip()
    x+=1

The contribution from Harald:

from sys import stdin
l=stdin.read()
e,p,c,x=0,0,0,0
r=""
for i in l:
       if l[e:e+2]==']>'or l[e:e+2]=='->':
               c=0
       if l[e:e+2]=='':
               p=0
               if i=='/' and l[e+1]=='>':
                       x-=1
       if p and not c:
               r+=i
       if not c and i=='<'and l[e+1]!='/':
               r+="\n"+(' '*4)*x
               x+=1
               p=1
       if i=='<'and l[e+1]=='/':
               x-=1
       e+=1

If any of the contributors want to provide a better description of their solutions, feel free to leave a comment!

Thanks to all the participants!