Retrieving URLs in Parallel With CURL and PHP

As we’ve recently added support for querying Solr servers in parallel, one of the things we added was a simple class to allow us to query several servers at the same time. The CURL library (which has a PHP extension) even provides an abstraction layer for doing the nitty gritty work for you, as long as you keep track of the resources. The code beneath is based on examples in the documentation and a few tweaks of my own.

The code beneath is licensed under a MIT license. You can also download the file (gzipped).

class Footo_Content_Retrieve_HTTP_CURLParallel
{
    /**
     * Fetch a collection of URLs in parallell using cURL. The results are
     * returned as an associative array, with the URLs as the key and the
     * content of the URLs as the value.
     *
     * @param array $addresses An array of URLs to fetch.
     * @return array The content of each URL that we've been asked to fetch.
     **/
    public function retrieve($addresses)
    {
        $multiHandle = curl_multi_init();
        $handles = array();
        $results = array();

        foreach($addresses as $url)
        {
            $handle = curl_init($url);
            $handles[$url] = $handle;

            curl_setopt_array($handle, array(
                CURLOPT_HEADER => false,
                CURLOPT_RETURNTRANSFER => true,
            ));

            curl_multi_add_handle($multiHandle, $handle);
        }

        //execute the handles
        $result = CURLM_CALL_MULTI_PERFORM;
        $running = false;

        // set up and make any requests..
        while ($result == CURLM_CALL_MULTI_PERFORM)
        {
            $result = curl_multi_exec($multiHandle, $running);
        }

        // wait until data arrives on all sockets
        while($running && ($result == CURLM_OK))
        {
            if (curl_multi_select($multiHandle) > -1)
            {
                $result = CURLM_CALL_MULTI_PERFORM;

                // while we need to process sockets
                while ($result == CURLM_CALL_MULTI_PERFORM)
                {
                    $result = curl_multi_exec($multiHandle, $running);
                }
            }
        }

        // clean up
        foreach($handles as $url => $handle)
        {
            $results[$url] = curl_multi_getcontent($handle);

            curl_multi_remove_handle($multiHandle, $handle);
            curl_close($handle);
        }

        curl_multi_close($multiHandle);

        return $results;
    }
}

Download the file.

Shell Script For Submitting Documents to Solr

Here’s a small shell script I’m using to submit pre-made XML documents to Solr. The documents are usually produce by some other program, before being submitted to the Solr server. This way we submit all the files in an active directory to the server (here all the files in the documents directory (relative to the location of the script) will be submitted) .

You’ll have to update the URL and the directory (documents) below. We usually group together 1.000 documents in a single file, so the commit happens for every thousand documents. If you use autocommit in Solr, you can remove that line. This script requires CURL to talk to the Solr server.

URL=http://localhost:8080/solr/update
cd documents || exit

for i in $( ls ); do
    cat $i | curl -X POST -H 'Content-Type: text/xml' -d @- $URL
    curl $URL -H "Content-Type: text/xml" --data-binary ''
    echo item: $i
done