May 2008 – Page 4 – Mats Lindh

PHP Norge Meeting Tomorrow (15. May 2008)

There’s a member meet in PHP Norge (the norwegian PHP user group) coming up tomorrow, but seems like I’m not going to make it this time either. Suddenly got a meeting in Fredrikstad that might run late, and Christer is leaving for New York on friday, so he’s not attending either. Too bad, but hopefully all the other guys will have a great time. We’re still on for PHP Vikinger, so I guess we’ll meet there.

All other PHP users in the area are of course encouraged to visit!

The Details of Liberty City

As a small break from all the programming related posts, I present you with an amazing photo set at flickr showing real life New York-locations compared to those available in Grand Theft Auto 4. The level of details that the people doing the 3D models for GTA4 has managed to get into the game is simply remarkable. [via Boing Boing]

The Graph of Company Classification

I’ve been meaning to do this for quite some time, but I never found the time before yesterday’s evening. Equipped with the data we’ve made searchable at Derdubor, I digged into the classification of the companies that our dataprovider provides us with. Their classification uses the standard NACE codes for communicating what type of business we’re dealing with, and this set of different classifications is standardized across european nations (there is a new standard that was released in 2007, to further synchronize the classification across the nations).

My goal was to explore the graph that describes the relationship between the different groups of classification. A company may be classified in more than one group, and by using this as a edge in the graph between the classifications, I set out and wrote a small Python program for parsing the input file and building the graph in memory. For rendering the graph I planned on using the excellent GraphViz application, originally created at AT&T just for the purpose of creating beautifully rendered graphs of network descriptions.

My Python-program therefor outputs a file in the dot language, which I then run through neato [PDF] to render the beautiful graph as a PDF.

An example from my generated dot-file:

graph bransjer {
	graph [overlap=scale];
	node [color=lightblue2, width=0.1, fontsize=12, height=0.1, style=filled];
	"Forsikr.,pensjonsfond-unntatt off. trygd" -- "Forsikringsagenter og assurandører" [penwidth=1.15441176471];
	"Forsikr.,pensjonsfond-unntatt off. trygd" -- "Hjelpevirksomhet for forsikring og pensj" [penwidth=1.23382352941];
	"Forsikr.,pensjonsfond-unntatt off. trygd" -- "Skadeforsikring" [penwidth=1.35294117647];

The penwidth=-attributes sets the width of the line between the nodes, and the “string” — “string”-entries describes an edge between the nodes.

I first ran into problems with managing this enormous graph (we’re talking 500k relations here), as trying to render the complete graph would throw both dot and neato off (as soon as we pass 2000 relations, things begin to go awry). Actually, this turned out to be a good thing, as it made me (and with Jørn chipping in a bit) think a bit more about what I actually wanted to graph. I’m not really interested in places where there only are one or two links between different classification groups, as these may be wrongly entered, very peculiar businesses etc. (with a total of 500k registrations, such things are quite common). Instead, I focused on the top ~1000 edges. By limiting my data set to the top 1000 most common relationship between groups, I’m able to render the graph in just below ten seconds, including time to parse and build the graph in Python before filtering it down.

The resulting graph of NACE connections is quite interesting, and shows that most classifications are connected in some way. If I further extend the number of edges, the sub graphs that are left unconnected to the “main graph” would probably establish connections. An interesting observation is that most health service-related businesses (such as doctors, hospitals, etc) live in their own sub graph disconnected from the main graph (this is the graph in the upper right). Another interesting part is the single link from the “main graph” and up into the IT consultancy business group (webdesign, application development, etc) which is placed in the top of the graph.

The Design of Everyday Things

One of the books in my last order-a-lot-of-interesting-books-from-Amazon-craze, was The Design of Everyday Things. I’ve just started reading it, after reading through Robert Hoekman Jr.‘s Designing the Moment on my regular train trip between Fredrikstad and Oslo (good book by the way). The Design of Everyday Things is one of those books that gets mentioned all the time and that I’ve never really gotten around to reading, and I’ll have to say that it’s been an enlighting experience. Since I’ve been reading quite a few user interaction books lately, much of this is stuff I’ve already been introduced to other places, but it’s very nice to be able to put it into a broader context.

I’d recommend the book for anyone who’s going to design a user interface at any time, so get to it and order it today if you haven’t already done it.

Yahoo!, SearchMonkey and Microformats

Both Rasmus and Sara has posts up about a new feature of Yahoo! Search which actually seems to be a step forward in terms of search engine functionality. This will make it possible for 3rd party developers to actually run code on Yahoo!’s servers to enhance their search result for your own page.

The first examples shows how they’ve used Microformats to give a better presentation of businesses available. I’ve previously implemented the hCard microformat at Derdubor, where we have a local directory search for businesses in Norway. All our search hits and profile pages are tagged up in microformats, so that a compliant parser are able to fetch business information and provide it to our users in a proper way. It’s simply great to see Yahoo! add this kind of support for new formats, and I’m already looking forward to playing with it to give better results for pwned.no and a few other projects I’m playing around with.

Christer Goes Bug Hunting

Christer had the pleasure of hunting down a bug in Zend Framework a couple of days ago, and he has just posted a nice article about the bug in Zend MVC Components and how he debugged it. If you’ve never used a debugger before, this article is probably also going to be a bit helpful, and it gives a little insight into how the Zend Framework MVC-components work.

The Power of Micro Languages

A “micro language” is a small, domain specific language. These small languages are often invented on a project basis or carried over from previous projects (or in some cases, a standardized language is used). The power of this comes as we’re able to move the rules of the application from the application logic, and instead can maintain a seperate set of rules independent of the application. This way you can add and remove business rules without having to re-test or rewrite your application. The concept has been around for years, and it’s in use more places than you could imagine, but I see people over and over again which does not see the benefits of separating the rules from the application itself.

These micro languages can be described in a markup language (such as XML), as a subset of other existing languages, or as your own simple-to-parse language. In this small introduction I’ll use an example from the current codebase of pwned.no, a site for arranging tournaments in several different online games.

When writing the engine that powers pwned.no I wanted to give the users several different options of tournaments to create. You have the regular single elimination tournaments, where winning a match moves you on to the next level, you have the double elimination events, where your team is not out before they’ve lost two matches, and you have the possibility of a group stage where the best teams (and possibly the worst) move on to the finals. Even after this, you realize that you might have situations where you want to combine these different sets of tournaments into a large one (example: elimination -> group stage -> double elimination). Embedding this logic into the application would create an unmaintainable mess of special cases and special handling, and each time I wanted to add a tournament, the possibility of breaking some other tournament were a possibility.

Instead I decided to implement a small language that describes tournaments, a parser to load and process the language and finally store it in an underlying database structure of how the tournament should progress when the next round starts. This meant that the job up front demanded a bit more planning and sketching, but the solution makes the system more flexible and more stable. When someone requests a new tournament format now, I simply create a small file describing the flow in the tournament, and everything works as it should. No code edits, no unit tests that need to be run, nothing at all, as long as the file is a valid tournament description file.

A tournament description file looks like this (for a tournament with a group stage, consisting of eight teams and with a single elimination stage for the two best teams from each group):

ID GROUP8
TEAMS 8
TOURNAMENTSTAGE 3 FINAL
MATCH
.ID FINALE
TOURNAMENTSTAGE 2 SEMIFINALS
MATCH
.ID SEMIFINALE1
.WIN FINALE
MATCH
.ID SEMIFINALE2
.WIN FINALE
TOURNAMENTSTAGE 1 PRELIMINARY_GROUPPLAY
GROUP
.ID GROUP1
.PLACETOMATCH 1 SEMIFINALE1
.PLACETOMATCH 2 SEMIFINALE2
GROUP
.ID GROUP2
.PLACETOMATCH 1 SEMIFINALE2
.PLACETOMATCH 2 SEMIFINALE1

The ID is an internal identifier which is resolved to a string in the currently active language, the TEAMS identifier tells the parser how many teams who can sign up for this tournament, and the rest of the lines are descriptions of the different stages of the tournament. The file can be parsed top-to-bottom, as the identifiers mentioned later in the file already exists higher up in the hierarchy. The .WIN commands tells the parser where the winning team should move next, and the .PLACETOMATCH commands indicates where the different places in the group stage should move next. If I wanted to add a lower bracket too, I’d simply add .PLACETOMATCH 3 LOWERBRACKET1 and a MATCH with .ID LOWERBRACKET1.

You could of course use XML for this instead, but this very simple and very easy to parse language has so far created a total of almost 25.000 tournaments, and everything has worked without a hitch. After resolving the first issues with the parser in my development version, there are now several new tournament formats that has been added (such as a tournament with 256 teams) in the years after the application was released.

This was just a very small introduction to the concept of micro languages, and you’ll find loads of more information about them online. You may ask what the difference between a micro language and a configuration file is, and the truth is that they’re quite similiar. But as the configuration file describes different settings for the application, the micro language describe different rules for the application. You may of course have configuration files that also contain rules, but you’ll need a well-defined and expressive language to define your rules. The domain where this language is used is so small and limited (there are only that many different concepts that are used in tournaments), so a simple language as this fits. If you’re in a situation where the micro language evolves into a full fledged programming language, you’d probably be better off with embedding an existing scripting language (such as Python), or moving the rules out into seperate modules in your application.

David Cummins on Fulltext Search as a Webservice

David Cummins has a neat little post up about replicating some of Solr’s features in a PHP based solution. His post “Fulltext search as a webservice” should sound familiar to Solr’s approach from the title, and David describes how they built a similiar solution on top of Zend_Search_Lucene (Solr also uses Lucene in the backend). Seems like it would be easier to just set up a dedicated Solr cluster instead, but hey, how often has “it would be easier to do something else” sparked innovation?

I’d also like to note that the coming Solr 1.3 supports php serialization as an output format, so you can just unserialize() the response from Solr. Should provide for even easier integration between PHP and Solr in the future. While on the subject, I’d like to suggest reading Stemming in Zend_Search_Lucene too, an introduction to adding filters to Zend_Search_Lucene. Also worth a look is the Search Tools in PHP presentation from phplondon.

Web Frontends for Xdebug

David Coallier has a post up about a Google Summer of Code project that has been launched to finally get a web frontend for Xdebug, something many people have been requesting for quite some time. Simultaneously Webgrind has been released, another web based frontend that replicates some of the features from kcachegrind in a web based fashion. Here we go several years without a decent front end, and then two gets announced in the same week. Neat! You can check out Webgrind at their Google Code page.