Parse a DSN string in Python

A simple hack to get the different parts of a DSN string (which are used in PDO in PHP):

def parse_dsn(dsn):
    m = re.search("([a-zA-Z0-9]+):(.*)", dsn)
    values = {}
    
    if (m and m.group(1) and m.group(2)):
        values['driver'] = m.group(1)
        m_options = re.findall("([a-zA-Z0-9]+)=([a-zA-Z0-9]+)", m.group(2))
        
        for pair in m_options:
            values[pair[0]] = pair[1]

    return values

The returned dictionary contains one entry for each of the entries in the DSN.

Update: helge also submitted a simplified version of the above:

driver, rest = dsn.split(':', 1)
values = dict(re.findall('(\w+)=(\w+)', rest), driver=driver)

btinfohash – A Small Python Utility for Getting an info_hash

While attempting to add torrents to a closed tracker I needed the info_hash structure for a given torrent file, as this is what the tracker uses to identify a given torrent file. I tried searching the regular sources but came up empty (I later discovered that btshowmetainfo from the original python bittorrent client package also shows the calculated info_hash for a torrent file), so I wrote a simple python script for all your shell scripting or command line needs.

#!/usr/bin/python
import bencode, BTL, hashlib, sys

if (len(sys.argv) < 2):
    sys.stderr.write("Usage: " + sys.argv[0] + " \n")
    sys.exit()

try:
    with open(sys.argv[1], "rb") as torrent_file:
        try:
            bstruct = bencode.bdecode(torrent_file.read())

            if 'info' in bstruct:
                print(hashlib.sha1(bencode.bencode(bstruct['info'])).hexdigest())
            else:
                sys.stderr.write("Did not find an info element in the torrent file.\n")
                sys.exit()
        except BTL.BTFailure:
          sys.stderr.write("Torrent file did not contain a valid bencoded structure.\n")
          sys.exit()
except IOError as e:
    sys.stderr.write("Could not open file: " + str(e) + "\n")
    sys.exit()

This depends on the bencode.py and BTL.py from the bencode python package (their .eggs are borked, so you’ll have to copy the two .py files yourself after downloading the source file).

up2date: The package .. is signed, but with an unknown GPG key.

After attempting to install libevent-devel on one of our cluster nodes, up2date started complaining about missing GPG signatures. We use the Fedora EPEL repository for fairly new items, and apparently the key were missing / not updated.

The error messages:

warning: rpmts_HdrFromFdno: V3 DSA signature: NOKEY, key ID 217521f6
The package .. is signed, but with an unknown GPG key. Aborting...
Package .. has a unknown GPG signature. Aborting...

The fix for us was to download the keys from Fedora’s key page (we just needed the EPEL-key), and add them with rpm:

wget https://fedoraproject.org/static/217521F6.txt
rpm --import 217521F6.txt 

.. and things worked as it should yet again – and we still get our packages verified. You can also confirm that the key id from the error message (217521F6) is the same as the key that you’ve downloaded from Fedora (or if you’re using another repository they’ll probably provide their keys as well).

Solr, Memory Usage and Dynamic Fields

One of the many great things about Solr is that it allows you to add dynamic fields – you can define a certain pattern that a field will have to follow, but it can then use any field name that matches the pattern.

We’ve been using one such dynamic field to add a sort field for our documents:

xxx_Category_Subcategory: 300

This would allow us to sort by this field to get the priority of our documents in this particular category and subcategory. A document would contain somewhere between 1 and 15 such fields. The total selection of unique field names is somewhere around 1200 across all documents.

Be small, be happy

As long as our collection were quite small (<10k documents) this scheme worked great. When our collection grew to around 500k documents, we started seeing out of memory errors quite often. At the worst rate we got an out of memory exception every 30 minutes, and had to restart the Solr server. Performance didn't suffer, but obviously we couldn't continue restarting servers until we got bored. After removing a few other possible issues (such as our stable random sort) I were rather stumped that things didn't improve. The total amount of data in our dynamic fields were rather low, somewhere around 2.5 - 3.5m integers, or possibly somewhere around 50-70MB in total. The JVM should be able to fit everything about these fields in memory and query them for the fields we're trying to find, but a heap dump of the jvm just before it hit the out of memory exception revealed that we were getting quite a few GBs of Lucene's FieldCache objects. These objects cache the value of a field for the total set of documents available in the index, and you're sadly not able to tune this cache through the Solr configuration (at least not for 1.4 as far as I could find).

Less Dynamic Fields, More Manual Labor

After pondering this issue a bit I came to the conclusion that our problem was related to the dynamic fields we had, and the fact that we used them for sorting. Lucene / Solr keeps one set of field caches for each field when it’s used for sorting, to avoid having to do duplicate work later. For us, this meant that each time we sorted a new field, an array had to be created with the size of the total document set. As long as we just had 10k documents, these arrays were small enough that we had enough memory available – when the document set grew to almost 500k documents, not so much.

This means that the total memory required for field caches will be limited by DocumentsInIndex * FieldsSortedBy. As long as our DocumentsInIndex were just 10k, the available memory to the jvm was enough to keep sorting by the number of fields we did. When the number of documents grew, the memory usage grew by the same factor and we got our OutOfMemoryException.

The Solution

Our solution could probably be more elegant, but currently we’ve moved the sorting to our application layer instead of the data provider layer. We’re requesting the complete set of hits from the Solr-server in the category anyway, so we’re able to sort it in the application – and by using a response format other than XML we’re also doing it rather quickly. This means that we’re not using sorting at all, and are only querying against one multivalued field to see if the category key is present there at all.

Note: Other solutions we considered were to divide our index into several Solr cores. This would allow us to keep the number of documents in each core low, and therefor also keep the fieldcache size in check. We know that each category could very well live on just on core as we won’t be mixing it with data from the other cores (and for that we could keep a separate core with all the documents, just not use it for searching across dynamic fields). We dropped this plan because of the rather worrying increase in complexity in our Solr installation. This could however help in your own case. :-)

Introducing Mismi – Amazon Price Comparison for Norwegian Customers

My main project in December was Mismi – a service that compares the total price of items from Amazon.com and from Amazon.co.uk for Norwegians. The solution is built on top of the Zend_Service_Amazon class (with a few extensions of my own).

The reasoning behind making the service is that there are several factors that are in play when deciding whether to order a product from the US or from the UK: the exchange rate for GBP and USD, the shipping cost, the delivery situation for the item and whether the item is sold in the store at all.

The user enters a list of the URLs to the products they’re considering purchasing from an Amazon-store, press submit and get a list back of which items are in stock, where the item is the cheapest and what the total sum of an order placed at the store would be. In addition I added a alpha stage feature just before Christmas which will also tell you the “optimum” set of items for the orders – “order item 1,4,7,9 from .com, item 2,3,5,6,8 from .co.uk”. This took quite a bit of hacking – you also have to consider the initial price of shipping, shipping for each item and other fun things.

Feel free to play with it over at mismi.e-mats.org. It’s in Norwegian, but it should be easy to understand anyhow with the description above.

mod_jk and Internal Server Error (HTTP 500)

We’ve extended our previously single Solr-node to a few nodes in a cluster. This allows us to run queries against one node while updating or configuring another, distributing the load across several servers (although we’re not there yet load wise) and being able to handle any out of memory or other critical errors.

While Solr supports querying several cores or distributing the queries internally, we decided to move the load balancing and handling of failed nodes higher up in the hierarchy. We’re now doing simple load balancing and handling of failed nodes by using mod_jk in our existing Apache-based environment. mod_jk also handles failed servers without any administrator interaction. We were already using mod_jk for our main web frontend, and since we use Tomcat as our application container for Solr, things should be a breeze!

Well, no. After copying our existing mod_jk setup, configuring our new workers and restarting Apache, all I got was the well known 500 INTERNAL SERVER ERROR. Here’s the worker configuration file:

worker.list=loadbalancer,status

worker.solr1.port=8009
worker.solr1.host=10.0.0.4
worker.solr1.type=ajp13
worker.solr1.lbfactor=1
worker.solr1.cachesize=10

worker.solr2.port=8009
worker.solr2.host=10.0.0.5
worker.solr2.type=ajp13
worker.solr2.lbfactor=4
worker.solr2.cachesize=10

worker.loadbalancer.type=lb
worker.loadbalancer.balance_workers=solr1,solr2
worker.loadbalancer.sticky_session=0

worker.status.type=status

This provides us with two solr servers and one status worker (the status worker is responsible for providing a simple web interface for enabling/disabling/seeing the status of the other workers), configured with a 1:4 load balancing (the second server has quite a bit more memory available for Solr).

I provided the configuration of the workers through the JkWorkersFile configuration setting (in a VirtualHost block… don’t do that):

JkWorkersFile conf/workers.properties

I’d also enable debug logging to attempt to find the problem (still in a VirtualHost block):

JkLogFile logs/mod_jk.log
JkLogLevel debug
JkLogStampFormat "[%a %b %d %H:%M:%S %Y]"

Other mod_jk settings (in the VirtualHost block) were:

JkOptions +ForwardKeySize +ForwardURICompat -ForwardDirectories
JkRequestLogFormat "%w %V %T"
JkShmFile logs/jk.shm
JkMount /* loadbalancer

<Location /jkstatus>
	JkMount status
	Order deny,allow
        Deny from all
        Allow from 127.0.0.1
</Location>

Still no solution. Peeking at the log files mod_jk provided, I were able to deduce the following:

[debug] map_uri_to_worker::jk_uri_worker_map.c (525): Attempting to map context URI '/jkstatus'
[debug] map_uri_to_worker::jk_uri_worker_map.c (550): Found an exact match status -> /jkstatus
[debug] jk_handler::mod_jk.c (1920): Into handler jakarta-servlet worker=status r->proxyreq=0
[debug] wc_get_worker_for_name::jk_worker.c (111): did not find a worker status
[info]  jk_handler::mod_jk.c (2071): Could not find a worker for worker name=status

This indicates that mod_jk was unable to find a worker matching the name I provided in the JkMount statement above; status. Weird. I added some garbage characters to the “JkWorkersFile” setting, and Apache complained that it were unable to find the workers file. Changed it back, reloaded, and still nothing. It was apparently unable to find the worker. The map did however work, as it tried to launch a worker.

Looking back at the start up sequence of mod_jk, the following were found in the log:

[debug] build_worker_map::jk_worker.c (236): creating worker ajp13
[debug] wc_create_worker::jk_worker.c (141): about to create instance ajp13 of ajp13
[debug] wc_create_worker::jk_worker.c (154): about to validate and init ajp13
[debug] ajp_validate::jk_ajp_common.c (1922): worker ajp13 contact is 'localhost:8009'
[debug] ajp_init::jk_ajp_common.c (2047): setting endpoint options:
[debug] ajp_init::jk_ajp_common.c (2050): keepalive:        0
[debug] ajp_init::jk_ajp_common.c (2054): timeout:          -1
[debug] ajp_init::jk_ajp_common.c (2058): buffer size:      0
ajp_init::jk_ajp_common.c (2062): pool timeout:     0
[debug] ajp_init::jk_ajp_common.c (2066): connect timeout:  0
[debug] ajp_init::jk_ajp_common.c (2070): reply timeout:    0
[debug] ajp_init::jk_ajp_common.c (2074): prepost timeout:  0
[debug] ajp_init::jk_ajp_common.c (2078): recovery options: 0
[debug] ajp_init::jk_ajp_common.c (2082): retries:          2
[debug] ajp_init::jk_ajp_common.c (2086): max packet size:  8192
[debug] ajp_create_endpoint_cache::jk_ajp_common.c (1959): setting connection pool size to 1 with min 0

It took a bit of time, but I realized that this tells me that mod_jk created _a default_ worker named ajp13. Apparently it was not reading my worker file at all, but it still complained if I changed the file name. You’d think that the setting which loads the configuration file would work when it complains when it doesn’t. But .. well. After an hour of attempting to find out why the workers didn’t load, revising the workers file to a minimal example, trying with just a single status worker, I concluded that my workers file was correct, and obviously mod_jk found it when it attempted to load it.

Then I suddenly discovered the small notice in the mod_jk configuration manual:

JkWorkersFile: This directive is only allowed once. It must be put into the global part of the configuration.

JkWorkersFile can not be defined in a <VirtualHost> section. It will NOT complain if you do it, it’ll just never define any workers. It will complain if the file doesn’t exist, even if it never tries to actually load it.

Confusing.

Moving the JkWorkersFile statement out from the <VirtualHost> block and to the LoadModule statement instead solved the issue. This is also the case for JkWorkerProperty.