Scalability – Mats Lindh

php-amqplib: Uncaught exception ‘Exception’ with message ‘Error reading data. Recevived 0 instead of expected 1 bytes’

I’ve been playing around with RabbitMQ recently, but trying to find out what caused the above error included a trip through wireshark and an attempt to dig through the source code of php-amqplib. It seems that it’s (usually) caused by a permission problem: either the wrong username / password combination as reported by some on the wide internet, or by my own issue: the authenticated user didn’t have access to the vhost I tried to associate my connection with.

You can see the active permissions for a vhost path by using rabbitmqctl:

sudo rabbitmqctl list_permissions -p /vhostname

.. or you if you’ve installed the web management plugin for rabbitmq: select Virtual Hosts in the menu, then select the vhost you want to see permissions for.

You can give a user (all out) access to the vhost by using rabbitmqctl:

sudo rabbitmqctl set_permissions -p /vhostname guest ".*" ".*" ".*"

.. or by adding the permissions through the web management interface, where you can select the user and the permission regexes for the user/vhost combination.

Solr: Replication not starting?

After upgrading our Solr-servers from 1.4.1 to 4.0-trunk (to be sure we were ready for the next version), I had trouble with getting replication to start again. It worked perfectly back with 1.4.1, but after upgrading to 4.0-trunk, it simply wouldn’t start.

I had to upgrade the machines individually (to allow the current index to continue serve requests), I removed the replication and then directed all the traffic to the slave. After updating the master (which worked after actually remembering to clean out the old webapps from Tomcat and adding a few new settings) and reindexing, most of the traffic were directed to it, and the slave were upgraded to the new Solr-version. I turned on replication again, updated the configuration file with the needed settings and started the slave. Nothing happened. Weird.

Time to debug!

On any slaves there’s a “replication.properties” file in the data directory ($SOLRHOME/data) which contain information about the current replication status. This file were created, indicating that at least the replication was attempting to run. If you open the file in a text editor (or just cat it), you should be able to read a bit of meta information about the replication state.

replicationFailedAtList=1311072270004,1311072240006..
timesFailed=11

Seems like it’s trying, but for some reason it doesn’t work. First thing to check would be to grep for replication in the log on both the master and the slave, and see if there’s any requests being made at all. There might be, but the replication still doesn’t start.

Try fetching the current state yourself to see what response the master is serving. You can do this by using “GET” or “wget” or “curl” to make an HTTP request to the master Solr-server from the slave together with the URL from “masterUrl” in the requestHandler for /replication from solrconfig.xml:

GET http://example.com/solr/replication?command=indexversion

This should respond with something close to:

<?xml version="1.0" encoding="UTF-8"?>
<response>
  <lst name="responseHeader">
    <int name="status">0
    <int name="QTime">0
  </lst>
  <long name="indexversion">1310994445934
  <long name="generation">2
</response>

If “indexversion” is 0, this means that the master hasn’t triggered a replication yet, which may seem weird if you’ve just started the server and the slave doesn’t have any data at all.

The reason might be that the master has not been instructed to actually trigger a replication event (and unless a replication event has been triggered, the indexversion will be 0):

<requestHandler name="/replication" class="solr.ReplicationHandler">
  <lst name="master">
    <str name="replicateAfter">commit
    <str name="replicateAfter">startup
    <str name="replicateAfter">optimize

If you only have “commit” in the above list, a replication event will not be triggered unless you’ve actually performed a commit after the slave has connected for the first time. If you add “startup”, the replication will also be triggered when the master starts up (so that any connecting slaves will start replicating right away).

To fix the issue without restarting any nodes, issue a single commit to the master and watch as the slaves start replicating. To issue a commit through curl:

curl http://example.com/solr/update -H "Content-Type: text/xml" --data-binary '<commit />'

The First Rule of Scalability

Don’t do slow shit often.

Following in the path of the previous Golden Rule of Frameworks, I give you the first rule of scalability.

Randy Shoup on Lessons of Scaling eBay

InfoQ has an article about scaling ebay online written by Randy Shoup about some of the lessons they’ve learned over at eBay during the years when it comes to scaling one of the largest internet sites in the world. The article is a very interesting read, and I’ll sum up the seven main points:

Partition by Function
Split Horizontally
Avoid Distributed Transactions
Decouple Functions Asynchronously
Move Processing To Asynchronous Flows
Virtualize At All Levels
Cache Appropriately