Parsing XML With Namespaces with SimpleXML

There’s one thing SimpleXML for PHP is horrible to use for: parsing XML containing namespaces. Namespaces requires special handling, and the only way I’ve found that allows you to refer to an element in another namespace, is to use the ->children() method with the namespace. I’m sure there’s an easier way than this, and if you know of any, please leave a comment!

Let’s start with the following XML snippet (using SOAP as an example):


    
        
            asdasd
        
    

The easiest way to do this is to “ignore” the namespaces, and simply do $root->{soap:Envelope} that to access the property. This will not work, as SimpleXML is quite peculiar about it’s namespaces (.. while everything else is simple and easy to use).

One solution is to provide the namespace you’re interested in to the $element->children() method, which returns all the children of the element in a particular namespace (or without arguments, outside any namespace):

$sxml = new SimpleXMLElement(file_get_contents('soap.xml'));

foreach($sxml->children('http://www.w3.org/2001/12/soap-envelope') as $el)
{
    if ($el->getName() == 'Body')
    {
        /* ... */
    }
}

Yes. That’s quite horrible.

But luckily the xpath method can help us:

$elements = $sxml->xpath('//soap:Envelope/soap:Body/queryInstantStreamResponse');

This will actually fetch all the elements titled “queryInstantStreamResponse” which are childs of soap:Envelope and soap:Body. And this works as you expect it to, without having to use children, provide the actual namespace URI, etc.

The xpath method returns an array containing all the matching elements, so in this case you’ll receive an array with a single element, containing the text inside the queryInstantStreamResponse element.

There should be an easier way than this.

Finding Substring in a String in Bash

If you’re ever in the need of checking if a variable in bash contains a certain string (and is not equal to, just a part of), the =~ operator to the test function comes in very handy. =~ is usually used to compare a variable against a regular expression, but can also be used for substrings (you may also use == *str*, but I prefer this way).

This short example submits a document to solr using curl, then emails the result if the Solr server responded with an error (.. I tried mapping this against the error code or something similiar instead, but didn’t find a better way. If you have something better, please leave a comment!):

    CURLRESULT=`cat $i | curl -s -X POST -H 'Content-Type: text/xml' -d @- $URL`
    if [[ $CURLRESULT =~ "Error report" ]]
      then 
	echo "Error!! Error!! CRISIS IMMINENT!!"
        echo $CURLRESULT | mail -s "Error importing to SOLR" mail@example.com
        exit
    fi

Neat to check that everything went OK before you remove the files you’ve submitted.

Avoiding Resetting the Scroll Position in a Textarea When Inserting Content

Now, that’s quite a headline. And this post will explain just the simple concept posted in the headline. How to avoid (at least) firefox from scrolling to the top when you insert content into a textarea.

It’s simple. Very simple. And it was shown to be so very simple for someone who didn’t remember scrollTop by this thread.

In jQuery (which we use with the caret plugin):

currentScrollPosition = $("#textareaId").scrollTop();
/* do stuff */
$("#textareaId").scrollTop(currentScrollPosition);

Yep. So simple that it actually hurts a bit.

NTFS Junctions and PHP 5.3.0

After upgrading to PHP 5.3.0 on my Windows XP Workstation, Junctions have suddenly stopped working in any PHP related code. I use junctions to hardlink directories from their version specific paths (NTFS symlinks where first introduced with Vista, so I’m still using Junctions), but after upgrading none of the libraries which live in directories that are linked through junctions work.

This seems to be a known bug, Files on NTFS Mounted Volumes (Junctions) inaccessible, although I’m also seeing the issue with completly local files (and not mounted from remote file systems). Seems like the thing to do is to wait for 5.3.1 to resolve the issue .. if it gets fixed to that. For the time being I’ll manually copy the directories.

Update: I’ve added a log of a test session showing the problem.

Changing The Source Directory in NetBeans

After reorganizing the directory structure on my workstation a bit, NetBeans refused to load the sources for the projects I have configured. The reason for this is probably because I store the NetBeans metadata files separate from the Source directory itself (as I don’t want the NetBeans project files in the repository, etc.).

NetBeans did however not have the option of choosing another source location (changing the existing location) for an existing project. Well, no worries. Luckily the NetBeans project files are in straight forward text format, so we can easily change it there instead! Close NetBeans (so that you don’t accidentally overwrite the new project file) Find the project directory and open “nbproject”. The file “project.properties” is the one we’re after. Close to the top (on line 3 here) you should find:

src.dir=<old path>

Simply change it to:

src.dir=<new path>

while remembering to pay attention to any escape sequences etc along the way (under Windows, \ needs to be escaped, so c:\\Directory\\SubDirectory is the appropriate path).

I Made It Again!

This day seemed quite faint back in february when the signup for Grenserittet (“Cross Countries”) opened up, but after just short of five hours on the bike, I have now returned! Yet again the 81 kilometers between Strømstad and Halden were the goal of the trip, and yet again it proved to be harder than planned. Christer also did the race, but I failed to make any kind of contact with him during the day – even though we started in the same group. He did an amazing race and finished in 3 hours and 49 minutes, almost an hour less than me.

The biggest issue this year was that during the last two days before the race, the weather turned really nasty and we got 60mm of rain during those two days in total. This means one thing: Mud, mud and more mud. I’ve never seen so much .. mud. In one area things had gotten so bad that the organizers declared the part of the track to be unridable, and chose to lead the competitors around the area instead. Luckily the weather today was warm and comfy (not too hot), so the parts of the track that hadn’t gone completely muddy were quite good.

Compared to last year I had a much, much slower start this year, averaging 13.37 (!) km/h during the first 29 kms, down from 19km/h the previous year. Last year things did however go from bad to worse, and the last 50 kms saw slower and slower average speeds. This year things were just the opposite, and after the first checkpoint I averaged 17.8km/h, 20.5km/h and 22.6km/h. Seems like I at least had a bit more energy late in the race this year. The total time this year became 4 hours, 45 minutes and 57 seconds, close to two minutes better than last year. Considering the state of the track in the forrest, I’m almost happy with the time. Next year, tho…

Now it’s four weeks until Birkebeiner’n, a 89.5km long race over the mountains between Rena and Lillehammer. I have no idea why I keep doing this, but I’ll still do it next year.

Adding Support for Asynchronous Status Requests for Net_Gearman

I spent the evening yesterday playing around a bit more with Gearman, a system for farming out tasks to workers across several servers. As my workstation at home still runs Windows, the only PHP library available is the Net_Gearman in PEAR. Net_Gearman supports tasks (something to do), sets (a collection of tasks), workers (the processes that performs the task) and clients (which requests tasks to be performed). The gearman protocol supports retrieving the current status of a task from the gearman server (which contains information about how the worker is progressing, reported by the worker itself), but Net_Gearman did not.

The reason for ‘did not’ is that I’ve created a small patchset to add the functionality to Net_Gearman. All internal methods and properties are still used as they were before, but I’ve added two helper methods for retrieving the socket connection for a particular gearman server (Net_Gearman usually just picks a random server, but we need to contact the server that’s responsible for the task) and a getStatus(server, handle) method to the Gearman Client. I’ve also added a property keeping the address of the server which were assigned the task to the Task class.

After submitting a task to be performed in the background (you do not need this to get the status for foreground tasks, as you can provide a callback to handle that), your Task object will have its handle and server properties set. These can be used to retrieve status information about the task later. You’ll still need to provide the possible servers to the Gearman client when creating the client (through the constructor).

Example of creating a task and retrieving the server / handle pair after starting the task:

require_once 'Net/Gearman/Client.php';

$client = new Net_Gearman_Client(array('host:4730'));

$task = new Net_Gearman_Task('Reverse', range(1,5));
$task->type = Net_Gearman_Task::JOB_BACKGROUND;

$set = new Net_Gearman_Set();
$set->addTask($task);

$client->runSet($set);

print("Status information: \n");
print($task->handle . "\n");
print($task->server . "\n");

Retrieving the status:

require_once 'Net/Gearman/Client.php';

$client = new Net_Gearman_Client(array('host:4730'));
$status = $client->getStatus('host:4730', 'H:mats-ubuntu:1');

The array returned from the getStatus() method is the same array as returned from the gearman server and contains information about the current status (numerator, denominator, finished, etc, var_dump it to get the current structure). I’ve also added the patchset to the Issue tracker for Net_Gearman at github.

The patchset (created from the current master branch at github) can be downloaded here: GearmanGetStatusSupport.tar.gz.

UPDATE: I’ve finally gotten around to creating my own fork of NET_Gearman on github too. This fork features the patch mentioned above.

How To Make Solr Go 45% Faster

If you’re still looking for a good reason to spend a few minutes tuning your SOLR caches (documentCache, filterCache and queryResultCache), I’ll give you two numbers:

avgTimePerRequest : 126.148822
avgTimePerRequest : 70.026436 

The first is with the default cache settings, the latter is with a very small change. Yep. That’s a 45% speed increase. So, the interesting question is what Iactually changed in the cache configuration – although I should warn you, the answer is very, very, very complicated:

The cache size. The default size (at least for our current 1.3 installation) is to keep 512 elements in the cache. When someone on the solr-user list asked for an introduction to what the different cache statistics meant, I remembered that I hadn’t actually tweaked the settings at all. The SOLR server has been running for a year now, so we now have a quite good idea of how it will perform and what kind of queries we are seeing. The stats indicated that a lot more cached entries got evicted than what I were hoping to see, and this gave us a lower cache hit rate (about 50%).

The simple change was to increase the size of the cache (from 512 to 16384), so that we’re able to keep more documents in memory before evicting them. After running 24 hours with the new setup we’re now seeing cache hits as 99%, 68% and 67%. The relevant sections of the solrconfig.xml file are:




The document cache fills about 4 times as fast as the filter cache, so we might have to tweak the settings further by suiting it even better to our load pattern.

So what now?

The next step would be to try to change to the FastLRUCache which is included with Solr 1.4 (currently in SVN and nightlies). If my memory serves me right the changes are mostly related to locking, so I’m not sure if we’ll see any significant better performance.

We’ll also make further adjustments to the size of each of the caches to better match our usage.

Solr Becoming Slow After a While

This is perhaps the most obvious and “not very helpful” post for quite a few people, but for those who experience this issue, it’ll save the day. While doing a test index routine of around 6 million documents, things would get really slow at the moment I passed 1 million documents in the index. Weird. Optimizing didn’t help, as it died with an exception after a while.

The reason?

Not enough free disk space. Solr was indexing to a different partition than I thought.

Solved everything.

Shell Script For Submitting Documents to Solr

Here’s a small shell script I’m using to submit pre-made XML documents to Solr. The documents are usually produce by some other program, before being submitted to the Solr server. This way we submit all the files in an active directory to the server (here all the files in the documents directory (relative to the location of the script) will be submitted) .

You’ll have to update the URL and the directory (documents) below. We usually group together 1.000 documents in a single file, so the commit happens for every thousand documents. If you use autocommit in Solr, you can remove that line. This script requires CURL to talk to the Solr server.

URL=http://localhost:8080/solr/update
cd documents || exit

for i in $( ls ); do
    cat $i | curl -X POST -H 'Content-Type: text/xml' -d @- $URL
    curl $URL -H "Content-Type: text/xml" --data-binary ''
    echo item: $i
done