SQLAlchemy, MySQL and UTF-8

January 25th, 2013

While SQLAlchemy uses UTF-8 by default, the charset used when communicating with MySQL will affect the encoding of the returned data. To be sure that everything is handled properly as UTF-8 (which you might use SET NAMES 'utf8' in the console (don’t do that here..)), add ?charset=utf8 to your connection url:

  1. mysql://user:password@localhost/database?charset=utf8

Thanks to RustyFluff at StackOverflow.

Debugging Python’s Memory Usage with Dowser

January 24th, 2013

As I mentioned in my previous post, I had to hunt down a leak (which was intentional considering the functionality) somewhere in a batch import task in my Pyramid app. I’ve never played around with any memory profilers in python before, so this was a proper opportunity to see what the different options were. StackOverflow to the rescue as usual, with a handful of suggestions for Python memory profilers.

After trying a few, I ended up with Dowser. Dowser fit my use case neatly, as my application was a long running process, was console based (since it uses cherrypy to launch its own HTTP Server, it was a good thing that it didn’t conflict with any existing serv er) and I could pause it at a proper location before it consumed too much memory (a time.sleep(largevaluehere) worked nicely, thank you).

Installing Dowser was relatively pain free (a few of the other options I tried either needed custom patches, or required the process to run all the way through before giving me the information I needed).

I needed to get a few dependencies installed:

  1. pip install pil

.. which Dowser uses to generate sparkline diagrams, and cherrypy itself:

  1. easy_install cherrypy

.. and last, checking out the latest version of Dowser from SVN:

  1. svn co http://svn.aminus.net/misc/dowser dowser

I modified the example from the Stack Overflow question above a bit, and ended up with a small helper function in the application’s helper library:

  1. def launch_memory_usage_server(port = 8080):
  2.     import cherrypy
  3.     import dowser
  4.  
  5.     cherrypy.tree.mount(dowser.Root())
  6.     cherrypy.config.update({
  7.         'environment': 'embedded',
  8.         'server.socket_port': port
  9.     })
  10.    
  11.     cherrypy.engine.start()

Then doing launch_memory_usage_server() somewhere early in my code launched the HTTP interface (http://localhost:8080/) to see memory usage while the import process was running. This helped me narrow down where the issue showed up (as we were leaking MySQLdb cursors at an alarming rate), and digging deeper into the structure hinted to the underlying cause (the debug toolbar was active for a console script).

Leaking Memory / Cursors with SQLAlchemy and Pyramid

January 24th, 2013

After spending the better part of the day trying to find out why the fsck my console script for importing a dataset through sqlalchemy needed just above 7GBs of memory before barfing out and swapping like a madman, I finally found the solution.

Make sure that Pyramid’s debug toolbar is disabled. It’ll keep an reference around to all queries ran through SQLAlchemy (for .. well, debugging purposes, obviously). This causes an issue if you’re running a very large number of queries, and you’re not going to use the debug toolbar from the console anyway, so .. get rid of it.

I created a second version of my development.ini, a development_console.ini that doesn’t load the debug toolbar, and finally stuff Just Worked ™ again.

Replacement for Deprecated / Removed BaseTokenFilterFactory

June 8th, 2012

When writing plugins for Solr you’d previously extend the BaseTokenFilterFactory, but at some time since I last built trunk, that changed to TokenFilterFactory – which is located in the util package of lucene instead.

Diff:

  1. - import org.apache.solr.analysis.BaseTokenFilterFactory;
  2. + import org.apache.lucene.analysis.util.TokenFilterFactory;
  3.  
  4.  
  5. - public class xxxxxFilterFactory extends BaseTokenFilterFactory
  6. + public class xxxxxFilterFactory extends TokenFilterFactory

php-amqplib: Uncaught exception ‘Exception’ with message ‘Error reading data. Recevived 0 instead of expected 1 bytes’

May 1st, 2012

I’ve been playing around with RabbitMQ recently, but trying to find out what caused the above error included a trip through wireshark and an attempt to dig through the source code of php-amqplib. It seems that it’s (usually) caused by a permission problem: either the wrong username / password combination as reported by some on the wide internet, or by my own issue: the authenticated user didn’t have access to the vhost I tried to associate my connection with.

You can see the active permissions for a vhost path by using rabbitmqctl:

  1. sudo rabbitmqctl list_permissions -p /vhostname

.. or you if you’ve installed the web management plugin for rabbitmq: select Virtual Hosts in the menu, then select the vhost you want to see permissions for.

You can give a user (all out) access to the vhost by using rabbitmqctl:

  1. sudo rabbitmqctl set_permissions -p /vhostname guest ".*" ".*" ".*"

.. or by adding the permissions through the web management interface, where you can select the user and the permission regexes for the user/vhost combination.

A Gentle Introduction to Gearman and its Concepts

August 1st, 2011

Gearman (an anagram for “Manager”) is a system for farming out work units to several different servers (or several processes on one server), allowing the calling code to do something completely different while the task is performed. Gearman is not intended for inter-process communication, but is a way to tell other processes that there are work available, and letting these processes (called workers) grab a piece of work for themselves.

One of the common themes that show up at the gearman IRC channel on freenode is an attempt to understand what gearman is and how everything fits together. I’ll try to explain the different concepts and what the different responsibilities of a working gearman infrastructure are. There’s also a “Getting Started” guide on the Gearman web site with a bit of example code and installation instructions, so you might want to keep that open in another tab. So here we go: a simple gearman tutorial explaining the concepts and not just throwing example code your way.

There are three core components of a gearman installation. These are a client (someone requesting a task to be performed), a worker (someone performing a task) and the server (which coordinates tasks between clients and workers). All these three components need to be running for you to be able to something useful with gearman. It’s worth noting that I’ll use name “task” for a single item to be performed, you’ll also see this named ‘function’ (which is the name of the actual function the task asks to be performed – a server offers several “functions” that a client can call). Some APIs might also refer to a “task” as a collection of functions to be called. I’ll use the first definition; a task is a call to a function on the server, together with the data for the task and a task identifier. Several subsequent tasks will call the same function.

I’ll go a bit more in detail about each of these components, but it’s important that you understand how everything is interconnected first. An exchange of messages between the different parts can be illustrated as follows:

client -> server: ask server to perform a task
server acknowledges request and assigns an identificator to the request
server -> all workers: tell workers registered for the task that there is work to be performed
worker -> server: I'll perform the task you just told us about
server -> worker: ok, go ahead, here's the information about the task.
worker -> server: here's the result of the task performed
server -> client: here's the result of the task you asked me to get someone to do for you

The idea behind the server telling all the workers that there are work available is to let the worker that responds fastest to actually get the task, as it’s assumed that this is the worker with the least load on the server it’s running on (as it responds quickly, the server is not busy doing other things). As I wrote above, the worker is the piece of code actually doing the work – the worker performs the task that a client has submitted to the gearman server.

You’ll find that most of Gearman is designed according to the same principle – keep stuff simple. The server only needs to keep track of which workers perform which functions, and then let the workers grab a task when it becomes available.

The Gearman Client

In Gearman the client is the piece of code that connects to the server and asks for a task to be performed. This can be a dynamic web page (running in python, ruby, PHP, perl or another language with a suitable Gearman library), a completely application that connects to Gearman, a worker (to submit a new task or to divide the current task into several smaller tasks to be performed by other workers) or a combination of the above. The important part is that this is simply a client – it has a task that needs to be handled, and it’ll ask the Gearman server to find someone who can perform the task.

The client can be run in synchronous (blocking) or asynchronous (non-blocking) mode. The first will make the client wait until the task has been performed by a worker (and if no worker is available, it’ll wait indefinitely or until reaching a timeout in the client), while the latter will simply fire-and-forget the task to the Gearman server (the server will confirm that the task has been received) and then go on its merry way afterwards. The Gearman server will provide a task identification value which the asynchronous client can use to query the current state of the task it asked to be performed (as long as the actual worker provide such updates).

A small example of how a client might work (using PHP):

  1. <?php
  2. $client = new GearmanClient();
  3. $client->addServer('localhost', 4730);
  4.  
  5. $arguments = array(
  6.     'url' => 'http://www.example.com/',
  7. );
  8.  
  9. $client->addTaskBackground('fetchURL', json_encode($arguments));
  10.  
  11. $client->runTasks();

This will submit a request to a Gearman server running on the same machine as the script, asking for the function “fetchURL” to be run, and including an array of arguments to the function (you could simply include just the URL, but I find that this way is easier to extend in the future – and using JSON for data exchange makes the worker code more programming language independent). This code uses addTaskBackground to submit the task to be performed in an asynchronous manner. We’re not interested in the result of this task in this particular piece of code – the worker will either provide the result through other means (storing it in a database, in memcache, call an API function telling us that it’s finished) or perhaps we’re not interested in the result at all, just that we’ve attempted to perform the task. If you’re using the synchronous interface, the data returned from the worker will be returned to your code as the return value from the client.

As you can see, the client code is very, very simple. There is no actual work being performed here, we’re just telling the server that we’d like some work to be performed for us.

The Gearman Worker

The Gearman worker is where all the actual work (.. who’d guess) is performed. This is the application that receives a notice that it has to wake up and do a bit of hard work, and which actually goes out and does just that. What kind of work it does depends on what you’re using Gearman for, but a couple of use cases could be to resize an image into smaller sizes (such as thumbnails), to convert an uploaded video into another format for a specific device, sending notification emails, updating an internal search engine such a Solr and quite a few other tasks. As long as the task is not important for the application to continue running (no need for waiting for an E-mail to be delivered if you’re going to show a “Your information has been saved” message), then Gearman (and other alternative message queues) is a valid solution.

You’ll run each worker as its own process. A worker can perform several different functions (although you should (usually) stay away from multi-threading to perform them at the same time). This means starting several copies of the same worker if you want to allow for more than one worker performing a task at the same time (i.e., if you want to send 30 e-mails in parallel), you’ll start each worker as separate processes (30 workers in that case). There are several daemons and frameworks that can help you manage the number of processes available depending on server and task load, such as supervisord and GearmanManager (a PHP daemon). Another possible solution is to use screen to start several workers, which also will allow you to attach to the output of any worker at any time.

How the worker performs its work is up to the worker itself. In most cases you’ll have to write a bit of code to expose your code as a Gearman function (so that clients can submit tasks to perform that function), but this code will usually just instantiate the worker framework from the Gearman library you’re using, letting you register what functions you’ll be able to perform and attaching callbacks telling the library what part of your own code should be called when a request to perform a task arrives.

A simple example modified from the Gearman Getting Started guide:

  1. <?php
  2. $worker = new GearmanWorker();
  3. $worker->addServer("localhost", 4730);
  4. $worker->addFunction("fetchURL", "fetch_url");
  5.  
  6. while ($worker->work());
  7.  
  8. function fetch_url($job)
  9. {
  10.     $arguments = json_decode($job->workload());
  11.  
  12.     if (!empty($arguments['url']))
  13.     {
  14.         print("Fetching " . $arguments['url'] . "\n");
  15.         return file_get_contents($arguments['url']);
  16.     }
  17. }

The $worker->work() method call will wait until a work arrives, then execute the callback as defined in the addFunction call. addFunction instructs the worker to tell the gearman server that this worker is able to perform any tasks calling the “fetchURL” function. The callback provided to the library (“call this PHP function (‘fetch_url’) when tasks want to call ‘fetchURL’”) will then receive the job object containing information about the job (task) to be performed. The workload() method returns the workload – the information we included in addition to which function to call in the client example. The server receives the workload from the client and then sends it to the worker together with the task information.

Since our client calls the server using the asynchronous interface it’ll not wait for the worker to return the web page contents, but by using ->do() or one of the other foreground methods in the PHP Gearman library.

The Gearman Server

The Gearman Server used is usually the C version of the server. There’s also a PERL version, but these days the C server is the one being actively developed. There’s not much to say about the server, you usually just start it and let it run by itself, doing what it was supposed to do all along.

I’ve got one simple suggestion if you’re just playing around with Gearman for the first time: start the server with the -vvv option. This will make gearmand a lot noisier, and will allow you to see clients registering themselves with the server, pinging the server and getting a bit more information about what’s happening inside the server process.

You’ll also want to provide an IP address that the gearman server should bind to – by default it binds to all interfaces, and since gearmand does not have any authentication built in by default, you don’t want to expose your server to the whole world.

Here’s an example of how we start gearmand at one of our servers:

  1. screen -d -m -S gearmand /usr/local/sbin/gearmand -L 127.0.0.1 -p 4730 -vvv

You can drop the part related to screen if you just want to play with gearmand:

  1. /usr/local/sbin/gearmand -L 127.0.0.1 -p 4730 -vvv

If you have gearmand in your path and not in the same location as us, drop /usr/local/sbin :-) This will bind gearmand to your localhost and use the default port (earlier the default port was something other than 4730, so we provide it just in case).

Making it all come together

The easiest way to play around with gearman is to simply open three terminal windows: one for gearmand with logging turned on, one for your worker and its output and the last window for a client sending a task request to gearmand (you can use the ‘gearman’ binary for this, just be sure to include any data in an appropriate format). As you submit a task for a function that the worker has registered, you should see it pick it up and then start processing the task as soon as possible. After a while (depending on how you’ve implemented your worker and what function it performs) the result should appear in your client.

Our production setups usually use a web application (PHP or python/django) as the client in the above scenario. The functions are usually long running tasks, such as analysing GPS paths, encoding videos and downloading files or internal web site analytics (where we just want to get things logged and not wait for the actual logging to complete). The web application submits a request to gearmand as soon as a file has been received, with a payload of the path to the file to be processed. The workers perform their function and then store the information back into the database or to disk, then usually call a web service to tell the web application that the work has been performed and any internal state can be updated to include (and show) the result of the task.

Message queues (such as Gearman) has become one of the core technologies behind many modern web applications (and non-web applications for that matter), so there’s really no reason to avoid at least playing around a bit with it and adding another possible tool to your future options.

Creating / Generating a URL Friendly Snippet in Django

March 19th, 2011

When generating URLs that say something about the content it contains, you usually have the need to create url-friendly strings (such as “creating-generating-a-url-friendly-snippet-in-django” for this post). Earlier today I had the need build something like that in Django, but came up empty handed for a while – simply because I didn’t realize that they’re called slugs in the Django documentation (which in turn inherited it from the print paper world). That helped quite a bit, and here’s the things you need to know:

If you want your model to perform validation and automagically add a database index for the slug, put in a SlugField in your model. You could use a regular CharField, but this will not provide any validation of the field in the admin interface (and you’d have to add db_index yourself). This will however not populate the field with a value, but this stack overflow question has a few pointers on how you could implement that. I’ll cite the interesting parts here:

  1. from django.template.defaultfilters import slugify
  2.  
  3. class test(models.Model):
  4.     q = models.CharField(max_length=30)
  5.     s = models.SlugField()
  6.  
  7.     def save(self, *args, **kwargs):
  8.         if not self.id:
  9.             self.s = slugify(self.q)
  10.  
  11.         super(test, self).save(*args, **kwargs)

This will populate the slug field the first time the entry is saved, and will not update the slug value if the object itself is updated.

You can however also handle this yourself in the get_absolute_url of your object, if you don’t need to perform any lookups on the value (although you could still save it to avoid regenerating it later). This will also give you the current slug for the entry (again, if you only use it as a part of the URL and do not validate it in any way):

  1. from django.template.defaultfilters import slugify
  2.  
  3. def test(models.Model):
  4.     q = models.CharField(max_length=30)
  5.  
  6.     @models.permalink
  7.     def get_absolute_url(self):
  8.         return ('reviewerer.views.sites_detail', [str(self.id), slugify(self.q)])

Hopefully that will help you solve your issue .. or at least search for the correct term next time :-)

Parse a DSN string in Python

January 31st, 2011

A simple hack to get the different parts of a DSN string (which are used in PDO in PHP):

  1. def parse_dsn(dsn):
  2.     m = re.search("([a-zA-Z0-9]+):(.*)", dsn)
  3.     values = {}
  4.    
  5.     if (m and m.group(1) and m.group(2)):
  6.         values['driver'] = m.group(1)
  7.         m_options = re.findall("([a-zA-Z0-9]+)=([a-zA-Z0-9]+)", m.group(2))
  8.        
  9.         for pair in m_options:
  10.             values[pair[0]] = pair[1]
  11.  
  12.     return values

The returned dictionary contains one entry for each of the entries in the DSN.

Update: helge also submitted a simplified version of the above:

  1. driver, rest = dsn.split(':', 1)
  2. values = dict(re.findall('(\w+)=(\w+)', rest), driver=driver)

Introducing Mismi – Amazon Price Comparison for Norwegian Customers

January 10th, 2011

My main project in December was Mismi – a service that compares the total price of items from Amazon.com and from Amazon.co.uk for Norwegians. The solution is built on top of the Zend_Service_Amazon class (with a few extensions of my own).

The reasoning behind making the service is that there are several factors that are in play when deciding whether to order a product from the US or from the UK: the exchange rate for GBP and USD, the shipping cost, the delivery situation for the item and whether the item is sold in the store at all.

The user enters a list of the URLs to the products they’re considering purchasing from an Amazon-store, press submit and get a list back of which items are in stock, where the item is the cheapest and what the total sum of an order placed at the store would be. In addition I added a alpha stage feature just before Christmas which will also tell you the “optimum” set of items for the orders – “order item 1,4,7,9 from .com, item 2,3,5,6,8 from .co.uk”. This took quite a bit of hacking – you also have to consider the initial price of shipping, shipping for each item and other fun things.

Feel free to play with it over at mismi.e-mats.org. It’s in Norwegian, but it should be easy to understand anyhow with the description above.

mod_jk and Internal Server Error (HTTP 500)

January 4th, 2011

We’ve extended our previously single Solr-node to a few nodes in a cluster. This allows us to run queries against one node while updating or configuring another, distributing the load across several servers (although we’re not there yet load wise) and being able to handle any out of memory or other critical errors.

While Solr supports querying several cores or distributing the queries internally, we decided to move the load balancing and handling of failed nodes higher up in the hierarchy. We’re now doing simple load balancing and handling of failed nodes by using mod_jk in our existing Apache-based environment. mod_jk also handles failed servers without any administrator interaction. We were already using mod_jk for our main web frontend, and since we use Tomcat as our application container for Solr, things should be a breeze!

Well, no. After copying our existing mod_jk setup, configuring our new workers and restarting Apache, all I got was the well known 500 INTERNAL SERVER ERROR. Here’s the worker configuration file:

worker.list=loadbalancer,status

worker.solr1.port=8009
worker.solr1.host=10.0.0.4
worker.solr1.type=ajp13
worker.solr1.lbfactor=1
worker.solr1.cachesize=10

worker.solr2.port=8009
worker.solr2.host=10.0.0.5
worker.solr2.type=ajp13
worker.solr2.lbfactor=4
worker.solr2.cachesize=10

worker.loadbalancer.type=lb
worker.loadbalancer.balance_workers=solr1,solr2
worker.loadbalancer.sticky_session=0

worker.status.type=status

This provides us with two solr servers and one status worker (the status worker is responsible for providing a simple web interface for enabling/disabling/seeing the status of the other workers), configured with a 1:4 load balancing (the second server has quite a bit more memory available for Solr).

I provided the configuration of the workers through the JkWorkersFile configuration setting (in a VirtualHost block… don’t do that):

JkWorkersFile conf/workers.properties

I’d also enable debug logging to attempt to find the problem (still in a VirtualHost block):

JkLogFile logs/mod_jk.log
JkLogLevel debug
JkLogStampFormat "[%a %b %d %H:%M:%S %Y]"

Other mod_jk settings (in the VirtualHost block) were:

JkOptions +ForwardKeySize +ForwardURICompat -ForwardDirectories
JkRequestLogFormat "%w %V %T"
JkShmFile logs/jk.shm
JkMount /* loadbalancer

<Location /jkstatus>
	JkMount status
	Order deny,allow
        Deny from all
        Allow from 127.0.0.1
</Location>

Still no solution. Peeking at the log files mod_jk provided, I were able to deduce the following:

[debug] map_uri_to_worker::jk_uri_worker_map.c (525): Attempting to map context URI '/jkstatus'
[debug] map_uri_to_worker::jk_uri_worker_map.c (550): Found an exact match status -> /jkstatus
[debug] jk_handler::mod_jk.c (1920): Into handler jakarta-servlet worker=status r->proxyreq=0
[debug] wc_get_worker_for_name::jk_worker.c (111): did not find a worker status
[info]  jk_handler::mod_jk.c (2071): Could not find a worker for worker name=status

This indicates that mod_jk was unable to find a worker matching the name I provided in the JkMount statement above; status. Weird. I added some garbage characters to the “JkWorkersFile” setting, and Apache complained that it were unable to find the workers file. Changed it back, reloaded, and still nothing. It was apparently unable to find the worker. The map did however work, as it tried to launch a worker.

Looking back at the start up sequence of mod_jk, the following were found in the log:

[debug] build_worker_map::jk_worker.c (236): creating worker ajp13
[debug] wc_create_worker::jk_worker.c (141): about to create instance ajp13 of ajp13
[debug] wc_create_worker::jk_worker.c (154): about to validate and init ajp13
[debug] ajp_validate::jk_ajp_common.c (1922): worker ajp13 contact is 'localhost:8009'
[debug] ajp_init::jk_ajp_common.c (2047): setting endpoint options:
[debug] ajp_init::jk_ajp_common.c (2050): keepalive:        0
[debug] ajp_init::jk_ajp_common.c (2054): timeout:          -1
[debug] ajp_init::jk_ajp_common.c (2058): buffer size:      0
ajp_init::jk_ajp_common.c (2062): pool timeout:     0
[debug] ajp_init::jk_ajp_common.c (2066): connect timeout:  0
[debug] ajp_init::jk_ajp_common.c (2070): reply timeout:    0
[debug] ajp_init::jk_ajp_common.c (2074): prepost timeout:  0
[debug] ajp_init::jk_ajp_common.c (2078): recovery options: 0
[debug] ajp_init::jk_ajp_common.c (2082): retries:          2
[debug] ajp_init::jk_ajp_common.c (2086): max packet size:  8192
[debug] ajp_create_endpoint_cache::jk_ajp_common.c (1959): setting connection pool size to 1 with min 0

It took a bit of time, but I realized that this tells me that mod_jk created _a default_ worker named ajp13. Apparently it was not reading my worker file at all, but it still complained if I changed the file name. You’d think that the setting which loads the configuration file would work when it complains when it doesn’t. But .. well. After an hour of attempting to find out why the workers didn’t load, revising the workers file to a minimal example, trying with just a single status worker, I concluded that my workers file was correct, and obviously mod_jk found it when it attempted to load it.

Then I suddenly discovered the small notice in the mod_jk configuration manual:

JkWorkersFile: This directive is only allowed once. It must be put into the global part of the configuration.

JkWorkersFile can not be defined in a <VirtualHost> section. It will NOT complain if you do it, it’ll just never define any workers. It will complain if the file doesn’t exist, even if it never tries to actually load it.

Confusing.

Moving the JkWorkersFile statement out from the <VirtualHost> block and to the LoadModule statement instead solved the issue. This is also the case for JkWorkerProperty.