Google Releases Their Protocol Buffers

July 8th, 2008

Fresh from the Google Open Source Blog comes news that Google has released their Protocol Buffers specification and accompanying libraries. The code and specification has been release at Protocol Buffers on Google Code.

Protocol Buffers is a data format for fast exchange and parsing of data and messages between computers. It is similar to simple uses of XML in this manner, but the messages size on the wire and their parsing time is very much optimized for busy sites. There is no need to spend loads of time doing XML parsing when you instead could do something useful. It’s very easy to interact with the messages through the generated classes (for C++, Java and Python), and future versions of the same schema are compatible with old versions (as new fields are just ignored by older parsers).

Still no PHP implementation available, so guess it’s time to get going and lay down some code during the summer. Anyone up for the job?

The Graph of Company Classification

May 12th, 2008


I’ve been meaning to do this for quite some time, but I never found the time before yesterday’s evening. Equipped with the data we’ve made searchable at Derdubor, I digged into the classification of the companies that our dataprovider provides us with. Their classification uses the standard NACE codes for communicating what type of business we’re dealing with, and this set of different classifications is standardized across european nations (there is a new standard that was released in 2007, to further synchronize the classification across the nations).

My goal was to explore the graph that describes the relationship between the different groups of classification. A company may be classified in more than one group, and by using this as a edge in the graph between the classifications, I set out and wrote a small Python program for parsing the input file and building the graph in memory. For rendering the graph I planned on using the excellent GraphViz application, originally created at AT&T just for the purpose of creating beautifully rendered graphs of network descriptions.

My Python-program therefor outputs a file in the dot language, which I then run through neato [PDF] to render the beautiful graph as a PDF.

An example from my generated dot-file:

graph bransjer {
	graph [overlap=scale];
	node [color=lightblue2, width=0.1, fontsize=12, height=0.1, style=filled];
	"Forsikr.,pensjonsfond-unntatt off. trygd" -- "Forsikringsagenter og assurandører" [penwidth=1.15441176471];
	"Forsikr.,pensjonsfond-unntatt off. trygd" -- "Hjelpevirksomhet for forsikring og pensj" [penwidth=1.23382352941];
	"Forsikr.,pensjonsfond-unntatt off. trygd" -- "Skadeforsikring" [penwidth=1.35294117647];

The penwidth=-attributes sets the width of the line between the nodes, and the “string” — “string”-entries describes an edge between the nodes.

I first ran into problems with managing this enormous graph (we’re talking 500k relations here), as trying to render the complete graph would throw both dot and neato off (as soon as we pass 2000 relations, things begin to go awry). Actually, this turned out to be a good thing, as it made me (and with Jørn chipping in a bit) think a bit more about what I actually wanted to graph. I’m not really interested in places where there only are one or two links between different classification groups, as these may be wrongly entered, very peculiar businesses etc. (with a total of 500k registrations, such things are quite common). Instead, I focused on the top ~1000 edges. By limiting my data set to the top 1000 most common relationship between groups, I’m able to render the graph in just below ten seconds, including time to parse and build the graph in Python before filtering it down.

The resulting graph of NACE connections is quite interesting, and shows that most classifications are connected in some way. If I further extend the number of edges, the sub graphs that are left unconnected to the “main graph” would probably establish connections. An interesting observation is that most health service-related businesses (such as doctors, hospitals, etc) live in their own sub graph disconnected from the main graph (this is the graph in the upper right). Another interesting part is the single link from the “main graph” and up into the IT consultancy business group (webdesign, application development, etc) which is placed in the top of the graph.