Misunderstanding How in_array Works

Brian Moon has a post about how in_array() really, really sucks. This is not a problem with in_array per se, but failing to recognize the proper way to solve a problem like this. Some of the comments has already touched on this matter, but I’ll attempt to further expand and describe the problem.

You have two arrays; a1 and b2. You’re interested in removing all the values from a1 that also are in b2. If you’re doing the naive approach (which Brian Moon describes), you’ll simply do:

foreach($a1 as $key => $value)
{
    foreach($b2 as $key2 => $value2)
    {
        if ($value == $value2)
        {
            unset($a1[$key]);
        }
    }
}

(ignore any potential side effects of manipulating $a1 while looping through it for now)

This will work for small sizes of a1 and b2, but as soon as the number of entries starts to increase (let’s call them m and n), you’ll see that the growth of your function will approach O(m*n), which can be written as O(n²) as both values approach infinity. This is not good and is the same complexity that you’ll find in naive sorting algorithms. This means that for each element you add to the array, your processing time increases quadratically (since you have two loops here). in_array is simply a shortcut for the inner loop (the inner foreach loop) in this example. It loops through each element of the array and checks if it matches the needle we’re searching for.

Compare this to using array_flip on the array to search first, so that the values becomes the keys:

foreach($a1 as $key => $value)
{
    if (isset($b2[$key]))
    {
        unset($a1[$key]);
    }
}

But why is isset($arr[$key]) any faster than using in_array? Doesn’t the application just have to loop through a different set of values instead (this time, the keys instead of the values)? No. This is where hashing comes into the picture. As $key can be any string value in PHP, the string is hashed and resolved to an internal array id. This means that internally, the following is happening:

$arr[$id] => find location by converting $id to an internal array location (on the C-level) => simply index the array by using this value

Instead of going through all the valid keys, the $id is converted once, and then checked to see if there is any value stored at that location. This is a simplification, but explains the concept. The complexity of this conversion may depend on the length of the key (depending on the choice of hashing function), but we’ll ignore this here, and simply give it a complexity of O(1). Looking up the index in the array is also an O(1) operation (it takes the same time, regardless if we’re looking at item #3 or #4818, it’s simply reading from different locations in memory).

This means that while our other loop is still looping through n elements, we’re now just doing a constant lookup in the innerloop. The amount of work done in the inner loop does not depend on the number of elements in b2, and this means that our algorithm instead grows in a linear fashion (O(n)).

Further reading can be found at Wikipedia: Hash function, Big O Notation. I’ll also suggest reading an introductionary book into the field of algorithms and datastructures. The type of book depends on your skillset, but if anyone wants any suggestions, just leave a comment and I’ll add a few as I get home to my bookshelf tonight.

Implementing a Duck Operator with Reflection

Following up on this post regarding a “duck operator” in PHP, I went ahead and wrote a very, very, very simple implementation using reflection api in php to get the same functionality.

getMethods() as $method)
        {
            $ret[$method->getName()] = true;
        }
        
        return $ret;
    }
    
    function it_quacks($object, $interface)
    {
        $reflectionClass = new ReflectionClass($interface);
        $reflectionObject = new ReflectionObject($object);
        
        $reflectionClassMethods = getMethodProperties($reflectionClass);
        $reflectionObjectMethods = getMethodProperties($reflectionObject);
        
        foreach($reflectionClassMethods as $methodName => $methodData)
        {
            if (empty($reflectionObjectMethods[$methodName]))
            {
                return false;
            }
        }
        
        return true;
    }

    if (it_quacks(new MooingGrassEater(), 'Cow'))
    {
        print("A MooingGrassEater can be seen as a Cow\n");
    }
    else
    {
        print("A MooingGrassEater has no hope of being recognized as a Cow\n");
    }

    if (it_quacks(new MooingGrassEater(), 'Sheep'))
    {
        print("A MooingGrassEater can be seen as a Sheep\n");
    }
    else
    {
        print("A MooingGrassEater has no hope of being recognized as a Sheep\n");
    }
?>

Missing obvious points are of course to compare the number of arguments to the methods and wether they’re optional, so that you further ensure call safety. But hey, it’s just an example implementation. Read the original linked page for more information about the concept.

Short Array Syntax for PHP

En route from the aggregated stream of Planet PHP comes a small post from Brian Moon about Stan’s suggestion for introducing the [] syntax for creating lists in PHP.

I like Python. This is like Python. By association, I like this. This would mean that you now could do:

$a = [1, 2, 3];

instead of $a = array(1,2,3); and:

$a = [‘one’ => 1, ‘two’ => 2]

instead of $a = array(‘one’ => 1, ‘two’ => 2);

Syntactial sugar, but still, sweet sugar.

If we just could get [:]-syntax for slicing and dicing too now..

Swoooosh – Free Open Source Flash-based Multi File Uploader

As I’ve mentioned a few times before I’ve been playing around with Adobe Flex. I finally got some more time to play with it tonight, so I got everything together to a semi-usable shape. A few things are still missing, such as moving the active uploads to the top and handling more than a total number of x queued uploads (at a certain level the progress bars will just disappear out of the Flash area, then magically appear as enough other items are finished).

Download Swoooosh.tar.bz2!

I’m looking for any response on this, and if anyone want to play around with it, please go ahead. It should be fairly simple to set up. I’ve included a brief description of the arguments it accepts below. Everything is released under a slightly modified MIT-based license, where the only change is that I’ve removed the need for keeping the copyright notice in anything that’s not the source code itself. Use it for anything you’d like, and if you make something useful, I’d be happy if you would contribute at patch back to me so that I could update the library itself.

You can see the application in action at my test installation. I’ll remove this test later, and be advised that the files actually will be transferred to my webserver. I’m just going to run rm -f * anyways, but if anyone breaks in and steals your precious uploaded files, you’re the one to blame.

== ARGUMENTS ==

The arguments to the flash file are provided in the flashVars attribute.

There are two required parameters:

destinationURL
The destination where all files are uploaded.

redirectWhenDoneURL
The URL the client is redirected to when all files have been uploaded.

Remember to urlencode both values.

Example:

<SEE THE INSTALL FILE IN THE ARCHIVE>

Optional parameters that are available is:

progressIndicatorColor: "#bfbfbf"
The color of the progress bar.

progressIndicatorBackgroundColor: "white"
The color of the empty bar before any progress has been made.

progressIndicatorWidth: 300
The width of the progress bar indicator.

uploadButtonText: "Click here to upload files!"
The text of the button the user has to click to start uploading files.


== COMPILING ==

To compile the SWF-file from the source code, download the Adobe Flex 3 SDK,
then run mxmlc against Swoooosh/Swoooosh.mxml:

mxmlc Swoooosh/Swoooosh.mxml


== CONTACT ==

Any and all comments are welcome. See the included LICENSE file for
information about usage. Short words: do whatever you want, just don't
claim you wrote it without contributing.

All patches are of course welcome.

UPDATE

As I’ve received many comments about the contents of test.php (the file that receives the post), here is the smallest version:

if (is_array($_FILES))
{
    foreach($_FILES as $file)
    {
        file_put_contents('directory_name_to_write_files_to/' . uniqid(), file_get_contents($file['tmp_name']));
    }
}

This will simply loop through all submitted files and write them to the temporary directory. As with other file uploads in PHP, you can access the original name of the file with the ‘name’ element in the $file array. DO NOT USE FILE NAME FROM ‘name’ WHEN WRITING THE FILE TO DISK. DOING THAT IS A VERY BAD IDEA, AS IT ALLOWS PEOPLE TO CREATE ANY FILE WITH ANY NAME (INCLUDING PHP-FILES WHICH CAN BE RUN IF THEY’RE AVAILABLE THROUGH THE WEB SERVER. YEP.). CAPS OFF.

Remember to make the directory you’re saving the files in WRITABLE for the process that writes the files (might be www-data or whatever user your webserver is running under). If you want to debug the response from the server regardless of what’s shown in the flash UI, use Wireshark to see the raw contents of packets and the conversions between the client and the server.

Even More Programming Challenges

Estimate (one of the two conspirators in “The Rule of Estimate and Dibon”) directed me to The Sphere Online Judge today. If you’ve grown tired of the challenges over at Project Euler or just want something different for a change, head on over. The challenges ranges from quite simple to very complex, and this time you’ll have to provide the program that solves the problem. This means that there is a time factor present and you can’t solve the problem using parallell processing.

For those who have attended NWERC or another version of ACMs programming contests, this is the same stuff (“NM i Programmering” for the norwegian people out there). Go go go!

The Power of Micro Languages

A “micro language” is a small, domain specific language. These small languages are often invented on a project basis or carried over from previous projects (or in some cases, a standardized language is used). The power of this comes as we’re able to move the rules of the application from the application logic, and instead can maintain a seperate set of rules independent of the application. This way you can add and remove business rules without having to re-test or rewrite your application. The concept has been around for years, and it’s in use more places than you could imagine, but I see people over and over again which does not see the benefits of separating the rules from the application itself.

These micro languages can be described in a markup language (such as XML), as a subset of other existing languages, or as your own simple-to-parse language. In this small introduction I’ll use an example from the current codebase of pwned.no, a site for arranging tournaments in several different online games.

When writing the engine that powers pwned.no I wanted to give the users several different options of tournaments to create. You have the regular single elimination tournaments, where winning a match moves you on to the next level, you have the double elimination events, where your team is not out before they’ve lost two matches, and you have the possibility of a group stage where the best teams (and possibly the worst) move on to the finals. Even after this, you realize that you might have situations where you want to combine these different sets of tournaments into a large one (example: elimination -> group stage -> double elimination). Embedding this logic into the application would create an unmaintainable mess of special cases and special handling, and each time I wanted to add a tournament, the possibility of breaking some other tournament were a possibility.

Instead I decided to implement a small language that describes tournaments, a parser to load and process the language and finally store it in an underlying database structure of how the tournament should progress when the next round starts. This meant that the job up front demanded a bit more planning and sketching, but the solution makes the system more flexible and more stable. When someone requests a new tournament format now, I simply create a small file describing the flow in the tournament, and everything works as it should. No code edits, no unit tests that need to be run, nothing at all, as long as the file is a valid tournament description file.

A tournament description file looks like this (for a tournament with a group stage, consisting of eight teams and with a single elimination stage for the two best teams from each group):

ID GROUP8
TEAMS 8
TOURNAMENTSTAGE 3 FINAL
MATCH
.ID FINALE
TOURNAMENTSTAGE 2 SEMIFINALS
MATCH
.ID SEMIFINALE1
.WIN FINALE
MATCH
.ID SEMIFINALE2
.WIN FINALE
TOURNAMENTSTAGE 1 PRELIMINARY_GROUPPLAY
GROUP
.ID GROUP1
.PLACETOMATCH 1 SEMIFINALE1
.PLACETOMATCH 2 SEMIFINALE2
GROUP
.ID GROUP2
.PLACETOMATCH 1 SEMIFINALE2
.PLACETOMATCH 2 SEMIFINALE1

The ID is an internal identifier which is resolved to a string in the currently active language, the TEAMS identifier tells the parser how many teams who can sign up for this tournament, and the rest of the lines are descriptions of the different stages of the tournament. The file can be parsed top-to-bottom, as the identifiers mentioned later in the file already exists higher up in the hierarchy. The .WIN commands tells the parser where the winning team should move next, and the .PLACETOMATCH commands indicates where the different places in the group stage should move next. If I wanted to add a lower bracket too, I’d simply add .PLACETOMATCH 3 LOWERBRACKET1 and a MATCH with .ID LOWERBRACKET1.

You could of course use XML for this instead, but this very simple and very easy to parse language has so far created a total of almost 25.000 tournaments, and everything has worked without a hitch. After resolving the first issues with the parser in my development version, there are now several new tournament formats that has been added (such as a tournament with 256 teams) in the years after the application was released.

This was just a very small introduction to the concept of micro languages, and you’ll find loads of more information about them online. You may ask what the difference between a micro language and a configuration file is, and the truth is that they’re quite similiar. But as the configuration file describes different settings for the application, the micro language describe different rules for the application. You may of course have configuration files that also contain rules, but you’ll need a well-defined and expressive language to define your rules. The domain where this language is used is so small and limited (there are only that many different concepts that are used in tournaments), so a simple language as this fits. If you’re in a situation where the micro language evolves into a full fledged programming language, you’d probably be better off with embedding an existing scripting language (such as Python), or moving the rules out into seperate modules in your application.