The Graph of Company Classification


I’ve been meaning to do this for quite some time, but I never found the time before yesterday’s evening. Equipped with the data we’ve made searchable at Derdubor, I digged into the classification of the companies that our dataprovider provides us with. Their classification uses the standard NACE codes for communicating what type of business we’re dealing with, and this set of different classifications is standardized across european nations (there is a new standard that was released in 2007, to further synchronize the classification across the nations).

My goal was to explore the graph that describes the relationship between the different groups of classification. A company may be classified in more than one group, and by using this as a edge in the graph between the classifications, I set out and wrote a small Python program for parsing the input file and building the graph in memory. For rendering the graph I planned on using the excellent GraphViz application, originally created at AT&T just for the purpose of creating beautifully rendered graphs of network descriptions.

My Python-program therefor outputs a file in the dot language, which I then run through neato [PDF] to render the beautiful graph as a PDF.

An example from my generated dot-file:

graph bransjer {
	graph [overlap=scale];
	node [color=lightblue2, width=0.1, fontsize=12, height=0.1, style=filled];
	"Forsikr.,pensjonsfond-unntatt off. trygd" -- "Forsikringsagenter og assurandører" [penwidth=1.15441176471];
	"Forsikr.,pensjonsfond-unntatt off. trygd" -- "Hjelpevirksomhet for forsikring og pensj" [penwidth=1.23382352941];
	"Forsikr.,pensjonsfond-unntatt off. trygd" -- "Skadeforsikring" [penwidth=1.35294117647];

The penwidth=-attributes sets the width of the line between the nodes, and the “string” — “string”-entries describes an edge between the nodes.

I first ran into problems with managing this enormous graph (we’re talking 500k relations here), as trying to render the complete graph would throw both dot and neato off (as soon as we pass 2000 relations, things begin to go awry). Actually, this turned out to be a good thing, as it made me (and with Jørn chipping in a bit) think a bit more about what I actually wanted to graph. I’m not really interested in places where there only are one or two links between different classification groups, as these may be wrongly entered, very peculiar businesses etc. (with a total of 500k registrations, such things are quite common). Instead, I focused on the top ~1000 edges. By limiting my data set to the top 1000 most common relationship between groups, I’m able to render the graph in just below ten seconds, including time to parse and build the graph in Python before filtering it down.

The resulting graph of NACE connections is quite interesting, and shows that most classifications are connected in some way. If I further extend the number of edges, the sub graphs that are left unconnected to the “main graph” would probably establish connections. An interesting observation is that most health service-related businesses (such as doctors, hospitals, etc) live in their own sub graph disconnected from the main graph (this is the graph in the upper right). Another interesting part is the single link from the “main graph” and up into the IT consultancy business group (webdesign, application development, etc) which is placed in the top of the graph.