Python, httplib and Empty Content for 200/201 Responses

While hacking together a client for Imbo in python, I weren’t able to read the response from a connection initiated with httplib. If the request errored out (http response code 400/403/404) everything worked as it should, but if the response code were 200 / 201, the response read from the httplib connection was empty (read by using getresponse()).

Turns out the issue was related to calling close on the connection before reading the response. This apparently works if there’s an error (which means that the response should be rather small), but not if there’s a regular “OK” response from the server (it’s not enough just retrieving the HTTPResponse object, you have to call read() on it before closing the connection).

connection.request(method, path, data)
data = connection.getresponse().read()
connection.close()

(Compared to the previous solution which retrieve the HTTPResponse object, closed the connection and then read the response)

Parse a DSN string in Python

A simple hack to get the different parts of a DSN string (which are used in PDO in PHP):

def parse_dsn(dsn):
    m = re.search("([a-zA-Z0-9]+):(.*)", dsn)
    values = {}
    
    if (m and m.group(1) and m.group(2)):
        values['driver'] = m.group(1)
        m_options = re.findall("([a-zA-Z0-9]+)=([a-zA-Z0-9]+)", m.group(2))
        
        for pair in m_options:
            values[pair[0]] = pair[1]

    return values

The returned dictionary contains one entry for each of the entries in the DSN.

Update: helge also submitted a simplified version of the above:

driver, rest = dsn.split(':', 1)
values = dict(re.findall('(\w+)=(\w+)', rest), driver=driver)

Fixing dpkg / apt-get Problem With Python2.6

While trying to upgrade to Python 2.6 on one of my development machines tonight I was faced by an error message after running apt-get install python2.6:

After unpacking 0B of additional disk space will be used.
Setting up python2.6-minimal (2.6.4-4) ...
Linking and byte-compiling packages for runtime python2.6...
pycentral: pycentral rtinstall: installed runtime python2.6 not found
pycentral rtinstall: installed runtime python2.6 not found
dpkg: error processing python2.6-minimal (--configure):
 subprocess post-installation script returned error exit status 1
dpkg: dependency problems prevent configuration of python2.6:
 python2.6 depends on python2.6-minimal (= 2.6.4-4); however:
  Package python2.6-minimal is not configured yet.
dpkg: error processing python2.6 (--configure):
 dependency problems - leaving unconfigured
Errors were encountered while processing:
 python2.6-minimal
 python2.6
E: Sub-process /usr/bin/dpkg returned an error code (1)

Attempting to install python2.6-minimal wouldn’t work, attempting to install python2.6 proved to have the same problem.

Luckily the Launchpad thread for python-central provided the answer: Upgrade python-central first!

:~# apt-get install python-central
[snip]
Setting up python2.6 (2.6.4-4) ...
Setting up python-central (0.6.14+nmu2) ...
:~#

Making Solr Requests with urllib2 in Python

When making XML requests to Solr (A fulltext document search engine) for indexing, committing, updating or deleting documents, the request is submitted as an HTTP POST containg an XML document to the server. urllib2 supports submitting POST data by using the second parameter to the urlopen() call:

f = urllib2.urlopen("http://example.com/", "key=value")

The first attempt involved simply adding the XML data as the second parameter, but that made the Solr Webapp return a “400 – Bad Request” error. The reason for Solr barfing is that the urlopen() function sets the Content-Type to application/x-www-form-urlencoded. We can solve this by changing the Content-Type header:

solrReq = urllib2.Request(updateURL, '')
solrReq.add_header("Content-Type", "text/xml")
solrPoster = urllib2.urlopen(solrReq)
response = solrPoster.read()
solrPoster.close()

Other XML-based Solr requests, such as adding and removing documents from the index, will also work by changing the Content-Type header.

The same code will also allow you to use urllib to submit SOAP, XML-RPC-requests and use other protocols that require you to change the complete POST body of the request.

The Results of Our Recent Python Competition

Last week we had yet another competition where the goal is to create the smallest program that solves a particular problem. This time the problem to solve was a simple XML parsing routine with a few extra rules to make the parsing itself easier to implement (The complete rule set). This time python was chosen as the required language of the submissions.

The winning contribution from Helge:

from sys import stdin
p=0
s=str.split
for a in s(stdin.read(),'<'):
 a=s(a,'>')[0];b=a.strip('/');p-=a

The contribution from Tobias:

from sys import stdin
i=stdin.read()
s=x=t=0
k=i.find
while x")
        elif i[x+1]=="/":s-=1
        else:
            u=0
            while u",x)-1]=="/":t=1
            else:t=0;s+=1
            print i[x+1:k(">",x)-t].strip()
    x+=1

The contribution from Harald:

from sys import stdin
l=stdin.read()
e,p,c,x=0,0,0,0
r=""
for i in l:
       if l[e:e+2]==']>'or l[e:e+2]=='->':
               c=0
       if l[e:e+2]=='':
               p=0
               if i=='/' and l[e+1]=='>':
                       x-=1
       if p and not c:
               r+=i
       if not c and i=='<'and l[e+1]!='/':
               r+="\n"+(' '*4)*x
               x+=1
               p=1
       if i=='<'and l[e+1]=='/':
               x-=1
       e+=1

If any of the contributors want to provide a better description of their solutions, feel free to leave a comment!

Thanks to all the participants!

ImportError: No module named trac.web.modpython_frontend

One of the reasons why you might get the error:

ImportError: No module named trac.web.modpython_frontend

after installing Trac is because of the fact that apache may not be able to create the Python egg cache, which is detailed in the Trac wiki right here. This will also generate the above error if not set up correctly. Create a directory for the files, change the owner to www-data.www-data (or something else, depending on which user you run Trac under) and rejoice.

The settings needed in the vhost configuration (.. or wherever you have your configuration ..):

    
        SetHandler mod_python
        PythonInterpreter main_interpreter
        PythonHandler trac.web.modpython_frontend
        PythonOption TracEnvParentDir /path/to/trac
        PythonOption TracUriRoot /
        PythonOption PYTHON_EGG_CACHE /path/to/directory/you/created
    

You can easily do a quick test by setting the path to /tmp and checking if that solved your problem. If it did, create a dedicated directory and live happily ever after. If it didn’t, continue your quest. Check for genshi and other dependencies. Do a search on Google ™.

Hopefully everything works again.

BTW: Another reason for this error might be that your trac installation may no longer be available (if your installation uses a version number in the library path and you upgraded the python version, this path will change – and your old libraries may not have been copied over), so it might help reinstalling Trac in your new environment:

easy_install -U Trac

.. and then try again (thanks to Christer for reporting on this after he had the same problem).

Google Releases Their Protocol Buffers

Fresh from the Google Open Source Blog comes news that Google has released their Protocol Buffers specification and accompanying libraries. The code and specification has been release at Protocol Buffers on Google Code.

Protocol Buffers is a data format for fast exchange and parsing of data and messages between computers. It is similar to simple uses of XML in this manner, but the messages size on the wire and their parsing time is very much optimized for busy sites. There is no need to spend loads of time doing XML parsing when you instead could do something useful. It’s very easy to interact with the messages through the generated classes (for C++, Java and Python), and future versions of the same schema are compatible with old versions (as new fields are just ignored by older parsers).

Still no PHP implementation available, so guess it’s time to get going and lay down some code during the summer. Anyone up for the job?

The Graph of Company Classification


I’ve been meaning to do this for quite some time, but I never found the time before yesterday’s evening. Equipped with the data we’ve made searchable at Derdubor, I digged into the classification of the companies that our dataprovider provides us with. Their classification uses the standard NACE codes for communicating what type of business we’re dealing with, and this set of different classifications is standardized across european nations (there is a new standard that was released in 2007, to further synchronize the classification across the nations).

My goal was to explore the graph that describes the relationship between the different groups of classification. A company may be classified in more than one group, and by using this as a edge in the graph between the classifications, I set out and wrote a small Python program for parsing the input file and building the graph in memory. For rendering the graph I planned on using the excellent GraphViz application, originally created at AT&T just for the purpose of creating beautifully rendered graphs of network descriptions.

My Python-program therefor outputs a file in the dot language, which I then run through neato [PDF] to render the beautiful graph as a PDF.

An example from my generated dot-file:

graph bransjer {
	graph [overlap=scale];
	node [color=lightblue2, width=0.1, fontsize=12, height=0.1, style=filled];
	"Forsikr.,pensjonsfond-unntatt off. trygd" -- "Forsikringsagenter og assurandører" [penwidth=1.15441176471];
	"Forsikr.,pensjonsfond-unntatt off. trygd" -- "Hjelpevirksomhet for forsikring og pensj" [penwidth=1.23382352941];
	"Forsikr.,pensjonsfond-unntatt off. trygd" -- "Skadeforsikring" [penwidth=1.35294117647];

The penwidth=-attributes sets the width of the line between the nodes, and the “string” — “string”-entries describes an edge between the nodes.

I first ran into problems with managing this enormous graph (we’re talking 500k relations here), as trying to render the complete graph would throw both dot and neato off (as soon as we pass 2000 relations, things begin to go awry). Actually, this turned out to be a good thing, as it made me (and with Jørn chipping in a bit) think a bit more about what I actually wanted to graph. I’m not really interested in places where there only are one or two links between different classification groups, as these may be wrongly entered, very peculiar businesses etc. (with a total of 500k registrations, such things are quite common). Instead, I focused on the top ~1000 edges. By limiting my data set to the top 1000 most common relationship between groups, I’m able to render the graph in just below ten seconds, including time to parse and build the graph in Python before filtering it down.

The resulting graph of NACE connections is quite interesting, and shows that most classifications are connected in some way. If I further extend the number of edges, the sub graphs that are left unconnected to the “main graph” would probably establish connections. An interesting observation is that most health service-related businesses (such as doctors, hospitals, etc) live in their own sub graph disconnected from the main graph (this is the graph in the upper right). Another interesting part is the single link from the “main graph” and up into the IT consultancy business group (webdesign, application development, etc) which is placed in the top of the graph.