Fixing dpkg / apt-get Problem With Python2.6

February 7th, 2010

While trying to upgrade to Python 2.6 on one of my development machines tonight I was faced by an error message after running apt-get install python2.6:

After unpacking 0B of additional disk space will be used.
Setting up python2.6-minimal (2.6.4-4) ...
Linking and byte-compiling packages for runtime python2.6...
pycentral: pycentral rtinstall: installed runtime python2.6 not found
pycentral rtinstall: installed runtime python2.6 not found
dpkg: error processing python2.6-minimal (--configure):
 subprocess post-installation script returned error exit status 1
dpkg: dependency problems prevent configuration of python2.6:
 python2.6 depends on python2.6-minimal (= 2.6.4-4); however:
  Package python2.6-minimal is not configured yet.
dpkg: error processing python2.6 (--configure):
 dependency problems - leaving unconfigured
Errors were encountered while processing:
 python2.6-minimal
 python2.6
E: Sub-process /usr/bin/dpkg returned an error code (1)

Attempting to install python2.6-minimal wouldn’t work, attempting to install python2.6 proved to have the same problem.

Luckily the Launchpad thread for python-central provided the answer: Upgrade python-central first!

:~# apt-get install python-central
[snip]
Setting up python2.6 (2.6.4-4) ...
Setting up python-central (0.6.14+nmu2) ...
:~#

Making Solr Requests with urllib2 in Python

May 30th, 2009

When making XML requests to Solr (A fulltext document search engine) for indexing, committing, updating or deleting documents, the request is submitted as an HTTP POST containg an XML document to the server. urllib2 supports submitting POST data by using the second parameter to the urlopen() call:

  1. f = urllib2.urlopen("http://example.com/", "key=value")

The first attempt involved simply adding the XML data as the second parameter, but that made the Solr Webapp return a “400 – Bad Request” error. The reason for Solr barfing is that the urlopen() function sets the Content-Type to application/x-www-form-urlencoded. We can solve this by changing the Content-Type header:

  1. solrReq = urllib2.Request(updateURL, '<commit waitFlush="false" waitSearcher="false"/>')
  2. solrReq.add_header("Content-Type", "text/xml")
  3. solrPoster = urllib2.urlopen(solrReq)
  4. response = solrPoster.read()
  5. solrPoster.close()

Other XML-based Solr requests, such as adding and removing documents from the index, will also work by changing the Content-Type header.

The same code will also allow you to use urllib to submit SOAP, XML-RPC-requests and use other protocols that require you to change the complete POST body of the request.

The Results of Our Recent Python Competition

May 29th, 2009

Last week we had yet another competition where the goal is to create the smallest program that solves a particular problem. This time the problem to solve was a simple XML parsing routine with a few extra rules to make the parsing itself easier to implement (The complete rule set). This time python was chosen as the required language of the submissions.

The winning contribution from Helge:

  1. from sys import stdin
  2. p=0
  3. s=str.split
  4. for a in s(stdin.read(),'<'):
  5.  a=s(a,'>')[0];b=a.strip('/');p-=a<b
  6.  if'@'<a:print' '*p*4+s(b)[0];p+=a==b

The contribution from Tobias:

  1. from sys import stdin
  2. i=stdin.read()
  3. s=x=t=0
  4. k=i.find
  5. while x<len(i):
  6.     if i[x]=="<":
  7.         if i[x:x+4]=="<!–":x=k("–>")
  8.         elif i[x+1]=="/":s-=1
  9.         else:
  10.             u=0
  11.             while u<s:print " "*4,;u+=1
  12.             if i[k(">",x)-1]=="/":t=1
  13.             else:t=0;s+=1
  14.             print i[x+1:k(">",x)-t].strip()
  15.     x+=1

The contribution from Harald:

  1. from sys import stdin
  2. l=stdin.read()
  3. e,p,c,x=0,0,0,0
  4. r=""
  5. for i in l:
  6.        if l[e:e+2]==']>'or l[e:e+2]=='->':
  7.                c=0
  8.        if l[e:e+2]=='<!':
  9.                c=1
  10.        if p and i==' 'or i=='/'or i=='>':
  11.                p=0
  12.                if i=='/' and l[e+1]=='>':
  13.                        x-=1
  14.        if p and not c:
  15.                r+=i
  16.        if not c and i=='<'and l[e+1]!='/':
  17.                r+="\n"+(' '*4)*x
  18.                x+=1
  19.                p=1
  20.        if i=='<'and l[e+1]=='/':
  21.                x-=1
  22.        e+=1

If any of the contributors want to provide a better description of their solutions, feel free to leave a comment!

Thanks to all the participants!

ImportError: No module named trac.web.modpython_frontend

February 17th, 2009

One of the reasons why you might get the error:

ImportError: No module named trac.web.modpython_frontend

after installing Trac is because of the fact that apache may not be able to create the Python egg cache, which is detailed in the Trac wiki right here. This will also generate the above error if not set up correctly. Create a directory for the files, change the owner to www-data.www-data (or something else, depending on which user you run Trac under) and rejoice.

The settings needed in the vhost configuration (.. or wherever you have your configuration ..):

    
        SetHandler mod_python
        PythonInterpreter main_interpreter
        PythonHandler trac.web.modpython_frontend
        PythonOption TracEnvParentDir /path/to/trac
        PythonOption TracUriRoot /
        PythonOption PYTHON_EGG_CACHE /path/to/directory/you/created
    

You can easily do a quick test by setting the path to /tmp and checking if that solved your problem. If it did, create a dedicated directory and live happily ever after. If it didn’t, continue your quest. Check for genshi and other dependencies. Do a search on Google ™.

Hopefully everything works again.

The Graph of Company Classification

May 12th, 2008


I’ve been meaning to do this for quite some time, but I never found the time before yesterday’s evening. Equipped with the data we’ve made searchable at Derdubor, I digged into the classification of the companies that our dataprovider provides us with. Their classification uses the standard NACE codes for communicating what type of business we’re dealing with, and this set of different classifications is standardized across european nations (there is a new standard that was released in 2007, to further synchronize the classification across the nations).

My goal was to explore the graph that describes the relationship between the different groups of classification. A company may be classified in more than one group, and by using this as a edge in the graph between the classifications, I set out and wrote a small Python program for parsing the input file and building the graph in memory. For rendering the graph I planned on using the excellent GraphViz application, originally created at AT&T just for the purpose of creating beautifully rendered graphs of network descriptions.

My Python-program therefor outputs a file in the dot language, which I then run through neato [PDF] to render the beautiful graph as a PDF.

An example from my generated dot-file:

graph bransjer {
	graph [overlap=scale];
	node [color=lightblue2, width=0.1, fontsize=12, height=0.1, style=filled];
	"Forsikr.,pensjonsfond-unntatt off. trygd" -- "Forsikringsagenter og assurandører" [penwidth=1.15441176471];
	"Forsikr.,pensjonsfond-unntatt off. trygd" -- "Hjelpevirksomhet for forsikring og pensj" [penwidth=1.23382352941];
	"Forsikr.,pensjonsfond-unntatt off. trygd" -- "Skadeforsikring" [penwidth=1.35294117647];

The penwidth=-attributes sets the width of the line between the nodes, and the “string” — “string”-entries describes an edge between the nodes.

I first ran into problems with managing this enormous graph (we’re talking 500k relations here), as trying to render the complete graph would throw both dot and neato off (as soon as we pass 2000 relations, things begin to go awry). Actually, this turned out to be a good thing, as it made me (and with Jørn chipping in a bit) think a bit more about what I actually wanted to graph. I’m not really interested in places where there only are one or two links between different classification groups, as these may be wrongly entered, very peculiar businesses etc. (with a total of 500k registrations, such things are quite common). Instead, I focused on the top ~1000 edges. By limiting my data set to the top 1000 most common relationship between groups, I’m able to render the graph in just below ten seconds, including time to parse and build the graph in Python before filtering it down.

The resulting graph of NACE connections is quite interesting, and shows that most classifications are connected in some way. If I further extend the number of edges, the sub graphs that are left unconnected to the “main graph” would probably establish connections. An interesting observation is that most health service-related businesses (such as doctors, hospitals, etc) live in their own sub graph disconnected from the main graph (this is the graph in the upper right). Another interesting part is the single link from the “main graph” and up into the IT consultancy business group (webdesign, application development, etc) which is placed in the top of the graph.