Solr: Missing Geographic Distance in Response When Using fl=_dist_:geodist()

A question that arised at the Freenode Solr IRC channel today was about _dist_:geodist() failing to include a field named _dist_ in the response – a field which would contain the return value from geodist(), which would be the distance from the source point to the destination document. The example from the wiki is:

&q=*:*&sfield=store&pt=45.15,-93.85&sort=geodist() asc&fl=_dist_:geodist()

The _dist_:geodist() in the fl= parameter adds a field named _dist_ with the returned contents from geodist() (the :-syntax creates a field name alias).

.. but why is it missing? Probably because using geodist() in the fl parameter is only supported on Solr 4.0 and later (which is currently only in trunk).

There’s a workaround on the wiki which you can use – but this will not allow you to score documents in any other way – the score returned will be the value of geodist() by using geodist() as the query:

q={!func}geodist()

THIS. CHANGES. EVERYTHING. – Useful Bash/*nix Tricks I Never Stumbled Across in the Last 15 Years

A thread at /r/linux sought out to reveal all the magic ways of increasing productivity under Linux (or other *nix based OS-es), and as most people I thought that there wouldn’t be much news here.

But I was wrong. So very, very wrong.

  1. disown – a way to disown a process, making it continue running in the background if you have to log out or close a long running session over ssh because you’re going somewhere, but want to keep the currently running process still running. If you’ve ever thought “why the fsck didn’t I run this under screen?”, then this trick is for you. This is a new future, and I’m proud to be a part of it.
  2. CTRL+r in bash – allows you to search your bash history buffer. I’ve known about this, I’ve just never picked up the habit of actually using it. Will do that now.
  3. ssh-copy-id – Appends your public key to the authorized_keys file at the destination computer.
  4. man ascii – the manual page entry for ascii contains an ascii table, right there in your terminal.
  5. xargs ‐‐max-procs and parallel – allows you to duplicate the functionality of xargs, but in parallel. Starts up all the processes at the same time, instead of starting them one by one.

Head over to the thread for other goodies such as a sudo alias for writing files when you’ve opened them without the correct permissions directly in vim.

Google Doubleclick DFP – Getting the Debug Console When Running in Asynchronous Mode

When trying to find out why a particular ad campaign isn’t being delivered as you thought it should, the DFP documentation indicates that you should apply “google_debug” to the URL and a new window show pop up with the debug information. After digging through documentation and searching Google for a couple of hours without getting this to work, I finally found the relevant part of the DFP documentation.

Here’s the kicker: in the new, improved ad handler (gpt.js) you’ll have to append “google_console=1” in your URL, then press CTRL+F10 to launch the debug console. No google_debug, no google_pubconsole, just google_console and CTRL+F10.

Hopefully this will help someone trying to find out by searching for relevant terms.

Missing Statistics in OpenX Again – This Time in 2.8.7

After upgrading to OpenX 2.8.7 from 2.4.1 our statistics suddenly seemed to have vanished. Debugging an issue like this isn’t just straight forward, but after digging through google searches, wiki pages at OpenX and, well, reading the source (brrrrrrrrr), I think I’ve nailed it.

After upgrading to 2.8.7 the DeliveryLog plugin didn’t get installed – which meant that no delivery / clicks / impressions were logged. After discovering that this had been moved to a plugin I tried simply unzippping the plugin and copying the files to the plugins/-directory. This seemed to make OpenX recognize the plugin if I went to “groups” in the plugin menu, but not under the “plugin” menu. Another problem was the fact that it didn’t actually log anything, which could be considered a problem.

All the Google searches had shown that OpenX had changed the logging format to a new table structure (named buckets), but they don’t provide of restoring / creating the bucket tables if they don’t exist, and they don’t give any error about the bucket tables missing if the plugin doesn’t load. I couldn’t find anything at all about how the tables should look and which tables should be installed, but I finally tried to simply install the plugin through the web interface (Log in as Administrator -> Select Plugins in the top menu) by uploading the zip file directly, and then FINALLY the post install script ran. That created the tables (I’ll dump the definitions later if someone needs them), and after reloading the ads the bucket tables started getting values.

Now we’ll just have to hope that they actually gets aggregated into something useful as well..

PS: I’m less than impressed by the OpenX upgrade procedure, it always seem to fsck up some detail that leaves your installation in limbo, without being able to detect that something has gone wrong and provide a way to resolve the issue. I understand that they need to – and want to – focus on their pay product, so well, I’ll keep having to fix things manually for a while, but Google’s Doubleclick for Small Businesses may see a new customer soon.

Python, httplib and Empty Content for 200/201 Responses

While hacking together a client for Imbo in python, I weren’t able to read the response from a connection initiated with httplib. If the request errored out (http response code 400/403/404) everything worked as it should, but if the response code were 200 / 201, the response read from the httplib connection was empty (read by using getresponse()).

Turns out the issue was related to calling close on the connection before reading the response. This apparently works if there’s an error (which means that the response should be rather small), but not if there’s a regular “OK” response from the server (it’s not enough just retrieving the HTTPResponse object, you have to call read() on it before closing the connection).

connection.request(method, path, data)
data = connection.getresponse().read()
connection.close()

(Compared to the previous solution which retrieve the HTTPResponse object, closed the connection and then read the response)

Evolution & Exchange: Unable to retrieve message

Some time after upgrading to Ubuntu 11.10 I ended up with the dreaded “Unable to retrieve message” in Evolution (which I use for Exchange connectivity). This has usually corrected itself by simply restarting Evolution, but this time nothing would help. I stumbled across a thread that provided a few ways to possibly solve the issue, but the .evolution directory didn’t contain any live installation in Ubuntu.

Turns out the directory is:

.local/share/evolution

As both my mailstore and address book lives on the Exchange server, I decided to just move the evolution directory to a new name and recreate the evolution directory from scratch. This takes a bit of time while Evolution indexes everything, but after a while everything were back to normal.

Solr Response Empty from PHP, but Works in Browser or CURL?

Weird issue that I think I’ve stumbled upon earlier, but yet again reared it’s head yesterday. Certain application containers (possibly Jetty in this case) will for some reason not produce any output from Solr (or other applications I’d guess) if the request is made with HTTP/1.0 as the version identifier (“GET /…/ HTTP/1.0” as the first line of the request). The native HTTP support in PHP identifies itself as HTTP/1.0 as it doesn’t support request chunking, which then turns into a magical problem with requests that used to work, but doesn’t work any longer (the response is just zero bytes in size – all other headers are identical) – but still works as expected if you open them in your browser.

The solution is to either gamble on the server not sending any chunked responses and then setting protocol_version in the stream context that you pass to the file retrieving function (the list of HTTP wrapper settings (.. I don’t think it’s a good idea to define protocol_version as float, but .. well.)), or use cURL instead. The Solr pecl extension uses cURL internally, so it’s not affected by this issue.

E:Error, pkgProblemResolver::Resolve generated breaks

While attempting to upgrade to Ubuntu 11.10 (Oneiric) from 11.04, do-release-upgrade refused to do anything useful. The only message it felt like delivering was “E:Error, pkgProblemResolver::Resolve generated breaks”. Googling didn’t turn up much, but a forum thread (which I seem to have lost now) suggested (among other attempts) to remove any references to external (3rd party) APT repositories. I thought do-release-upgrade did this by itself, but apparently not …

Commenting out the external repositories in /etc/apt/sources.list and in /etc/apt/sources.list.d/* solved the problem (I had spotify, dropbox and Google Chrome there), allowing do-release-upgrade to do its thing.

Checking Status of a Background Task in python-gearman

After stumbling over a question on stackoverflow about how you’d use python-gearman for checking the current status of a running background task, I decided to dig a bit deeper into python-gearman and .. well, answer how you’d do just that.

It turns out it wasn’t as straight forward as it should have been, but at least I managed to solve it by using the current API. First of all you’ll have to keep track of which Gearman server gets your task, and what handle it has assigned to the task. These two values identify a current running task, and since the identifiers (handle) isn’t globally unique, you’ll also have to keep track of the current server (so you know where to ask).

To request the current status of a long running task you’ll have to create appropriate instances of the GearmanJob and GearmanJobRequest yourself.

Here’s a small example of how you can do this:

import gearman
    
client = gearman.GearmanClient(['localhost'])
result = client.submit_job('reverse', 'this is a string', background=True);

The connection information is available through result.job.connection (.gearman_host and .gearman_port), while the handle is available through result.job.handle.

To check the status of a currently running job you create a GearmanClient, but only supply the server you want to query for the current state:

client = gearman.GearmanClient(['localhost'])

# configure the job to request status for - the last four is not needed for Status requests.
j = gearman.job.GearmanJob(client.connection_list[0], result.job.handle, None, None, None, None)

# create a job request 
jr = gearman.job.GearmanJobRequest(j)
jr.state = 'CREATED'

# request the state from gearmand
res = client.get_job_status(jr)

# the res structure should now be filled with the status information about the task
print(str(res.status.numerator) + " / " + str(res.status.denominator))

That should at least solve the problem until python-gearman gets an easier API to do these kinds of requests.

Update: I’ve also added a convenience function to my python-gearman fork at github.

Gearman and Locking for Identical Jobs / Tasks

A question that came up on #gearman on freenode today was how to make sure that a task is only performed by one worker at a time (remember from our previous introduction to Gearman that a worker is the actual piece of code performing a task that has been submitted to gearmand).

I had a few naive suggestions:

Run memcache with a low timeout (add a key when the task arrives with a low timeout value, if the add fails, simply return as someone else is probably doing the task).

Add a function for each unique identification value that can be performed, and only register one worker for each function (I like the memcache solution way better…, but it’d work. at least for a bit.)

But neither of these are a good solution to the problem; luckily Brian Moon also saw the question and was quick to point out that Gearman actually has a built-in mechanism for handling de-duplication of tasks. I’ve never used it myself (only read about it a couple of times), so it’s a good thing that Brian paid attention :-)

The solution: Use gearman_job_unique (in the PHP extension this value (named $unique in the documentation) can be tacked on to the end of most methods that add tasks or perform tasks directly (such as the do* methods)) – if Gearman sees a value that there’s already a worker active for, it’ll not resubmit the task but simply return the same result when the first worker returns (unless it’s a background task, where the second call will just return – there’s no difference in a task being submitted or already being run if you’re counting on Gearman to de-duplicate your tasks).

So if you need to lock and exit, remember that Gearman has de-duplication of non-unique tasks built-in. I tend to forget.