Introducing Mismi – Amazon Price Comparison for Norwegian Customers

My main project in December was Mismi – a service that compares the total price of items from Amazon.com and from Amazon.co.uk for Norwegians. The solution is built on top of the Zend_Service_Amazon class (with a few extensions of my own).

The reasoning behind making the service is that there are several factors that are in play when deciding whether to order a product from the US or from the UK: the exchange rate for GBP and USD, the shipping cost, the delivery situation for the item and whether the item is sold in the store at all.

The user enters a list of the URLs to the products they’re considering purchasing from an Amazon-store, press submit and get a list back of which items are in stock, where the item is the cheapest and what the total sum of an order placed at the store would be. In addition I added a alpha stage feature just before Christmas which will also tell you the “optimum” set of items for the orders – “order item 1,4,7,9 from .com, item 2,3,5,6,8 from .co.uk”. This took quite a bit of hacking – you also have to consider the initial price of shipping, shipping for each item and other fun things.

Feel free to play with it over at mismi.e-mats.org. It’s in Norwegian, but it should be easy to understand anyhow with the description above.

mod_jk and Internal Server Error (HTTP 500)

We’ve extended our previously single Solr-node to a few nodes in a cluster. This allows us to run queries against one node while updating or configuring another, distributing the load across several servers (although we’re not there yet load wise) and being able to handle any out of memory or other critical errors.

While Solr supports querying several cores or distributing the queries internally, we decided to move the load balancing and handling of failed nodes higher up in the hierarchy. We’re now doing simple load balancing and handling of failed nodes by using mod_jk in our existing Apache-based environment. mod_jk also handles failed servers without any administrator interaction. We were already using mod_jk for our main web frontend, and since we use Tomcat as our application container for Solr, things should be a breeze!

Well, no. After copying our existing mod_jk setup, configuring our new workers and restarting Apache, all I got was the well known 500 INTERNAL SERVER ERROR. Here’s the worker configuration file:

worker.list=loadbalancer,status

worker.solr1.port=8009
worker.solr1.host=10.0.0.4
worker.solr1.type=ajp13
worker.solr1.lbfactor=1
worker.solr1.cachesize=10

worker.solr2.port=8009
worker.solr2.host=10.0.0.5
worker.solr2.type=ajp13
worker.solr2.lbfactor=4
worker.solr2.cachesize=10

worker.loadbalancer.type=lb
worker.loadbalancer.balance_workers=solr1,solr2
worker.loadbalancer.sticky_session=0

worker.status.type=status

This provides us with two solr servers and one status worker (the status worker is responsible for providing a simple web interface for enabling/disabling/seeing the status of the other workers), configured with a 1:4 load balancing (the second server has quite a bit more memory available for Solr).

I provided the configuration of the workers through the JkWorkersFile configuration setting (in a VirtualHost block… don’t do that):

JkWorkersFile conf/workers.properties

I’d also enable debug logging to attempt to find the problem (still in a VirtualHost block):

JkLogFile logs/mod_jk.log
JkLogLevel debug
JkLogStampFormat "[%a %b %d %H:%M:%S %Y]"

Other mod_jk settings (in the VirtualHost block) were:

JkOptions +ForwardKeySize +ForwardURICompat -ForwardDirectories
JkRequestLogFormat "%w %V %T"
JkShmFile logs/jk.shm
JkMount /* loadbalancer

<Location /jkstatus>
	JkMount status
	Order deny,allow
        Deny from all
        Allow from 127.0.0.1
</Location>

Still no solution. Peeking at the log files mod_jk provided, I were able to deduce the following:

[debug] map_uri_to_worker::jk_uri_worker_map.c (525): Attempting to map context URI '/jkstatus'
[debug] map_uri_to_worker::jk_uri_worker_map.c (550): Found an exact match status -> /jkstatus
[debug] jk_handler::mod_jk.c (1920): Into handler jakarta-servlet worker=status r->proxyreq=0
[debug] wc_get_worker_for_name::jk_worker.c (111): did not find a worker status
[info]  jk_handler::mod_jk.c (2071): Could not find a worker for worker name=status

This indicates that mod_jk was unable to find a worker matching the name I provided in the JkMount statement above; status. Weird. I added some garbage characters to the “JkWorkersFile” setting, and Apache complained that it were unable to find the workers file. Changed it back, reloaded, and still nothing. It was apparently unable to find the worker. The map did however work, as it tried to launch a worker.

Looking back at the start up sequence of mod_jk, the following were found in the log:

[debug] build_worker_map::jk_worker.c (236): creating worker ajp13
[debug] wc_create_worker::jk_worker.c (141): about to create instance ajp13 of ajp13
[debug] wc_create_worker::jk_worker.c (154): about to validate and init ajp13
[debug] ajp_validate::jk_ajp_common.c (1922): worker ajp13 contact is 'localhost:8009'
[debug] ajp_init::jk_ajp_common.c (2047): setting endpoint options:
[debug] ajp_init::jk_ajp_common.c (2050): keepalive:        0
[debug] ajp_init::jk_ajp_common.c (2054): timeout:          -1
[debug] ajp_init::jk_ajp_common.c (2058): buffer size:      0
ajp_init::jk_ajp_common.c (2062): pool timeout:     0
[debug] ajp_init::jk_ajp_common.c (2066): connect timeout:  0
[debug] ajp_init::jk_ajp_common.c (2070): reply timeout:    0
[debug] ajp_init::jk_ajp_common.c (2074): prepost timeout:  0
[debug] ajp_init::jk_ajp_common.c (2078): recovery options: 0
[debug] ajp_init::jk_ajp_common.c (2082): retries:          2
[debug] ajp_init::jk_ajp_common.c (2086): max packet size:  8192
[debug] ajp_create_endpoint_cache::jk_ajp_common.c (1959): setting connection pool size to 1 with min 0

It took a bit of time, but I realized that this tells me that mod_jk created _a default_ worker named ajp13. Apparently it was not reading my worker file at all, but it still complained if I changed the file name. You’d think that the setting which loads the configuration file would work when it complains when it doesn’t. But .. well. After an hour of attempting to find out why the workers didn’t load, revising the workers file to a minimal example, trying with just a single status worker, I concluded that my workers file was correct, and obviously mod_jk found it when it attempted to load it.

Then I suddenly discovered the small notice in the mod_jk configuration manual:

JkWorkersFile: This directive is only allowed once. It must be put into the global part of the configuration.

JkWorkersFile can not be defined in a <VirtualHost> section. It will NOT complain if you do it, it’ll just never define any workers. It will complain if the file doesn’t exist, even if it never tries to actually load it.

Confusing.

Moving the JkWorkersFile statement out from the <VirtualHost> block and to the LoadModule statement instead solved the issue. This is also the case for JkWorkerProperty.

Patch for Max Iterations for a foreach Block in Smarty

After running into a need for a max iteration count on a foreach block tonight and seeing that several others have had the need during the years, I’ve created a simple patch to add max= as an attribute to the foreach block. I tried to search the archives for a reason why this hadn’t already been included, so feel free to ignore this patch if there are proper reasons why this isn’t available as an argument. There are cases where a simple break in the loop is more efficient than making a copy with array_slice if you need the same data several places but in different slice sizes.

The patch also contains three tests to test the max attribute.

The patch is available here: smarty.foreach.max.patch. The patch is against the current SVN trunk of 2010-08-08.

Example:

{foreach item=x from=[0,1,2,3,4,5,6,7,8,9] max=5}{$x}{/foreach}

Output:

01234

Unbreak My Hea.. Firefox Ctrl Click Please!

When we launched Gamer.no over a year ago, we had to come up with a wallpaper advertising solution in a rush (everything were a rush back then as we built and launched a site from scratch (after disagreements between the previous owner and Gamer) in just under four days (or 96 hours)). While this solution has worked .. good enough .. it has always had a few irky bugs that I’ve never really had the right inspiration to uncover the cause of. Usually I’ve spent an hour and decided that the time wasn’t worth it at the moment and then moved onto something else, but today! Today is a glorious day!

The bug has been fixed!

The wallpaper element is placed around the main content div, which sadly also makes the wallpaper element receive any click elements that the main content div receives. This leads to the wallpaper getting clicked and the wallpaper ad window opening regardless of where people click – which will get very, very annoying very quick. So to battle this issue the original solution was to call .stopPropagation() on the evt object in a click handler for the main content div. This solved the issue and everyone rejoiced! However, all was not perfect in paradise.

Some time later we discovered that the .stopPropagation() fix borked ctrl-click a link in Firefox. Other browsers handled it just fine, but Firefox were obviously not happy. Not happy at all. Mad and going on a killing spree it shot down the proposed fixes from both myself and other people who had a brief look at the code. It wasn’t a big issue as we only run the wallpaper code for small intervals of time and people didn’t complain (maybe we were some of the few who had the issue).

Today I decided to have a look at the issue again, and finally I realized that we had been way to focused on our call to .stopPropagation(). Everyone had been planning how we could get .stopPropagation to do what we wanted it to do – after all – the issue was that stopPropagation didn’t behave when we ctrl-clicked in Firefox. But wait.

If you instead think of the original problem; the window.open gets triggered when people click the inner element instead of the outer, there may be alternative solutions to using stopPropagation. And yes, THAT was quite a simple fix. Instead of trying to stop the event from bubling up through the cloud.. let’s just set a status variable that tells the code handling the wallpaper click that THIS CLICK IS NOT FOR YOU BAD HANDLER GO AWAY LET OTHER GROWNUPS HANDLE THIS. So that I did.

$(document).ready(function () {
    innerClick = false;
    $('#wallpaper').click(function() {
        if (innerClick)
        {
            innerClick = false;
            return true;
        }
        
        window.open("..");
    });
    $('#content').click(function(evt) {
        innerClick = true;
    });
});

As soon as I actually spent some time on what we were trying to solve instead of what seemed like the cause of the issue .. everything went better than expected.

Fixing Issue With PHPs SoapClient Overwriting Duplicate Attribute and Tag Names

The setting:

An SOAP request contains an Id attribute – and an element with the exact name in the response (directly beneath the element containing the attribute – an immediate child):


  foobar

The problem is that the generated result object from the SoapClient (at least of PHP 5.2.12) contains the attribute value, and not the element value. In our case we could ignore the z:Id attribute, as it was simply an Id to identify the element in the response (this might be something that ASP.NET or some other .NET component does).

Our solution is to subclass the internal SoapClient and handle the __doRequest method, stripping out the part of the request that gives the wrong value for the Id field:

class Provider_SoapClient extends SoapClient
{
    public function __doRequest($request, $location, $action, $version)
    {
        $result = parent::__doRequest($request, $location, $action, $version);
        $result = preg_replace('/ z:Id="i[0-9]+"/', '', $result);
        return $result;
    }
}

This removes the attribute from all the values (there is no danger that the string will be present in any other of the elements. If there is – be sure to adjust the regular expression). And voilá, it works!

Avoid Escaping Spaces in the Query String in a Solr Query

Following up on the previous post about escaping values in a Solr query string, it’s important to note that you should not escape spaces in the query itself. The reason for this is that if you escape spaces in the query “foo bar”, the search will be performed on the term “foo bar” itself, and not with “foo” as one term and “bar” as the other. This will only return documents that has the string “foo bar” in sequence.

The solution is to either remove the space from the escape list in the previous function – and use another function for escaping values where you actually should escape the spaces – or break up the string into “escapable” parts.

The code included beneath performs the last task; it splits the string into different parts delimited by space and then escapes each part of the query by itself.

$queryParts = explode(' ', $this->getQuery());
$queryEscaped = array();

foreach($queryParts as $queryPart)
{
    $queryEscaped[] = self::escapeSolrValue($queryPart);
}

$queryEscaped = join(' ', $queryEscaped);

A Simple Smarty Modifier to Generate a Chart Through Google Chart API

After the longest title of my blog so far follows one of the shortest posts.

The function has two required parameters – the first one is provided automagically for you by smarty (it’s the value of the variable you’re applying the modifier to). This should be an array of objects containing the value you want to graph. The only required argument you have to provide to the modifier is the method to use for fetching the values for graphing.

Usage:
{$objects|googlechart:”getValue”}

This will dynamically load your plugin from the file modifier.googlechart.php in your Smarty plugins directory, or you can register the plugin manually by calling register_modifier on the template object after you’ve created it.

function smarty_modifier_googlechart($points, $method, $size = "600x200", $low = 0, $high = 0)
{
    $pointStr = '';
    $maxValue = 0;
    $minValue = INT_MAX;
    
    foreach($points as $point)
    {
        if ($point->$method() > $maxValue)
        {
            $maxValue = $point->$method();
        }

        if ($point->$method() < $minValue)
        {
            $minValue = $point->$method();
        }
    }

    if (!empty($high))
    {
        $maxValue = $high;
    }

    $scale = 100 / $maxValue;

    foreach($points as $point)
    {
        $pointStr .= (int) ($point->$method() * $scale) . ',';
    }

    $pointStr = substr($pointStr, 0, -1);

    // labels (5)
    $labels = array();

    $steps = 4;
    $interval = $maxValue / $steps;

    for($i = 0; $i < $steps; $i++)
    {
        $labels[] = (int) ($i * $interval);
    }

    $labels[] = (int) $maxValue;

    return 'http://chart.apis.google.com/chart?cht=lc&chd=t:' . $pointStr . '&chs=' . $size . '&chxt=y&chxl=0:|' . join('|', $labels);
}

The function does not support the short version of the Google Chart API Just Yet (tm) as it is an simple proof of concept hack made a few months ago.

How To Dismantle An Atomic HTTP Query .. String.

Following up on yesterday’s gripe about PHPs (old and now useless) automagic translation of dots in GET and POST parameters to underscores, today’s edition manipulates the query string in place instead of returning it as an array.

This is useful if you have a query string you want to pass on to another service, and for some reason the default behaviour in PHP will barf barf and barf. That might happen because of the dot translation issue or that some services (such as Solr) rely on a parameter name being repeatable (in PHP the second parameter value will overwrite the first).

function http_dismantle_query($queryString, $remove)
{
    $removeKeys = array();

    if (is_array($remove))
    {
        foreach($remove as $removeKey)
        {
            $removeKeys[$removeKey] = true;
        }
    }
    else
    {
        $removeKeys[$remove] = true;
    }

    $resultEntries = array();
    $segments = explode("&", $queryString);

    foreach($segments as $segment)
    {
        $parts = explode('=', $segment);

        $key = urldecode(array_shift($parts));

        if (!isset($removeKeys[$key]))
        {
            $resultEntries[] = $segment;
        }
    }

    return join('&', $resultEntries);
}

I’m not really sure what I’ll call the next function in this series, but there sure are loads of candidates out there.

Getting Dots to Work in PHP and GET / POST / COOKIE Variable Names

One of the oldest and ugliest relics of the register_globals era of PHP are the fact that all dots in request variable names gets replaced with “_”. If your variable was named “foo.bar”, PHP will serve it to you as “foo_bar”. You cannot turn this off, you cannot use extract() or parse_str() to avoid it and you’re mostly left out in the dark. Luckily the QUERY_STRING enviornment (in _SERVER if you’re running mod_php, etc) contains the raw string, and this string contains the dots.

The following “”parser”” is a work in progress and does currently not support the array syntax for keys that PHP allow, but it solves the issue for regular vars. I will try to extend this later on to do actually replicate the functionality of the regular parser.

Here’s the code. No warranties. Ugly hack. You’re warned. Leave a comment if you have any good suggestions regarding this (.. or know of an existing library doing the same..).

function http_demolish_query($queryString)
{
    $result = array();
    $segments = explode("&", $queryString);

    foreach($segments as $segment)
    {
        $parts = explode('=', $segment);

        $key = urldecode(array_shift($parts));
        $value = null;

        if ($parts)
        {
            $value = urldecode(join('=', $parts));
        }

        $result[$key] = $value;
    }

    return $result;
}

(OK, that’s not the real function name, but it’s aptly named to be the nemesis of http_build_query)

Retrieving URLs in Parallel With CURL and PHP

As we’ve recently added support for querying Solr servers in parallel, one of the things we added was a simple class to allow us to query several servers at the same time. The CURL library (which has a PHP extension) even provides an abstraction layer for doing the nitty gritty work for you, as long as you keep track of the resources. The code beneath is based on examples in the documentation and a few tweaks of my own.

The code beneath is licensed under a MIT license. You can also download the file (gzipped).

class Footo_Content_Retrieve_HTTP_CURLParallel
{
    /**
     * Fetch a collection of URLs in parallell using cURL. The results are
     * returned as an associative array, with the URLs as the key and the
     * content of the URLs as the value.
     *
     * @param array $addresses An array of URLs to fetch.
     * @return array The content of each URL that we've been asked to fetch.
     **/
    public function retrieve($addresses)
    {
        $multiHandle = curl_multi_init();
        $handles = array();
        $results = array();

        foreach($addresses as $url)
        {
            $handle = curl_init($url);
            $handles[$url] = $handle;

            curl_setopt_array($handle, array(
                CURLOPT_HEADER => false,
                CURLOPT_RETURNTRANSFER => true,
            ));

            curl_multi_add_handle($multiHandle, $handle);
        }

        //execute the handles
        $result = CURLM_CALL_MULTI_PERFORM;
        $running = false;

        // set up and make any requests..
        while ($result == CURLM_CALL_MULTI_PERFORM)
        {
            $result = curl_multi_exec($multiHandle, $running);
        }

        // wait until data arrives on all sockets
        while($running && ($result == CURLM_OK))
        {
            if (curl_multi_select($multiHandle) > -1)
            {
                $result = CURLM_CALL_MULTI_PERFORM;

                // while we need to process sockets
                while ($result == CURLM_CALL_MULTI_PERFORM)
                {
                    $result = curl_multi_exec($multiHandle, $running);
                }
            }
        }

        // clean up
        foreach($handles as $url => $handle)
        {
            $results[$url] = curl_multi_getcontent($handle);

            curl_multi_remove_handle($multiHandle, $handle);
            curl_close($handle);
        }

        curl_multi_close($multiHandle);

        return $results;
    }
}

Download the file.