Solr, Memory Usage and Dynamic Fields

One of the many great things about Solr is that it allows you to add dynamic fields – you can define a certain pattern that a field will have to follow, but it can then use any field name that matches the pattern.

We’ve been using one such dynamic field to add a sort field for our documents:

xxx_Category_Subcategory: 300

This would allow us to sort by this field to get the priority of our documents in this particular category and subcategory. A document would contain somewhere between 1 and 15 such fields. The total selection of unique field names is somewhere around 1200 across all documents.

Be small, be happy

As long as our collection were quite small (<10k documents) this scheme worked great. When our collection grew to around 500k documents, we started seeing out of memory errors quite often. At the worst rate we got an out of memory exception every 30 minutes, and had to restart the Solr server. Performance didn't suffer, but obviously we couldn't continue restarting servers until we got bored. After removing a few other possible issues (such as our stable random sort) I were rather stumped that things didn't improve. The total amount of data in our dynamic fields were rather low, somewhere around 2.5 - 3.5m integers, or possibly somewhere around 50-70MB in total. The JVM should be able to fit everything about these fields in memory and query them for the fields we're trying to find, but a heap dump of the jvm just before it hit the out of memory exception revealed that we were getting quite a few GBs of Lucene's FieldCache objects. These objects cache the value of a field for the total set of documents available in the index, and you're sadly not able to tune this cache through the Solr configuration (at least not for 1.4 as far as I could find).

Less Dynamic Fields, More Manual Labor

After pondering this issue a bit I came to the conclusion that our problem was related to the dynamic fields we had, and the fact that we used them for sorting. Lucene / Solr keeps one set of field caches for each field when it’s used for sorting, to avoid having to do duplicate work later. For us, this meant that each time we sorted a new field, an array had to be created with the size of the total document set. As long as we just had 10k documents, these arrays were small enough that we had enough memory available – when the document set grew to almost 500k documents, not so much.

This means that the total memory required for field caches will be limited by DocumentsInIndex * FieldsSortedBy. As long as our DocumentsInIndex were just 10k, the available memory to the jvm was enough to keep sorting by the number of fields we did. When the number of documents grew, the memory usage grew by the same factor and we got our OutOfMemoryException.

The Solution

Our solution could probably be more elegant, but currently we’ve moved the sorting to our application layer instead of the data provider layer. We’re requesting the complete set of hits from the Solr-server in the category anyway, so we’re able to sort it in the application – and by using a response format other than XML we’re also doing it rather quickly. This means that we’re not using sorting at all, and are only querying against one multivalued field to see if the category key is present there at all.

Note: Other solutions we considered were to divide our index into several Solr cores. This would allow us to keep the number of documents in each core low, and therefor also keep the fieldcache size in check. We know that each category could very well live on just on core as we won’t be mixing it with data from the other cores (and for that we could keep a separate core with all the documents, just not use it for searching across dynamic fields). We dropped this plan because of the rather worrying increase in complexity in our Solr installation. This could however help in your own case. :-)

Varnish: No ESI processing, first char not ‘<‘

I’ve spent some time setting up varnish for a service at work where we deliver JSON-fragments from different resources as a unified document. We build the resulting JSON document by including the different JSON fragments through an ESI (Edge Side Includes) statement in the response. Varnish then fetches these fragments from the different services, allowing for independent caching duration for each of the services, combining then into a complete document.

The main “problem” is that of Varnish 2.0.4, varnish does a check to see if the content returned from the server actually is HTML-based content. It does this by check if the first character of the returned document is ‘<'. The goal is to avoid including binary data directly into a text document. Since we’re working with JSON instead of regular HTML/XML, the first character in our responses is not a ‘<', and the ESI statement did not get evaluated. You can however disable this check in varnishd by changing the esi_syntax configuration setting - just add the preferred value to the arguments when starting varnish. To disable the content check, add:

varnishd … -p esi_syntax=0x1

I’m not sure if you can set this for just one handler / match in varnish as this solved our issue, but feel free to leave a comment if you have any experience with this.

A Quick Introduction to chmod and Octal Numbers

Someone asked what the difference between doing a chmod 777 and chmod 755 is today, and hopefully this short, informal post will provide you with the answer (if you want to jump straight through to the conclusion, man chmod).

Octal Numbers

The number you provide as an argument to chmod is an octal number telling chmod what access you want to provide to a file (or a directory, device, etc – an entry on the file system). The number are in fact three discreet values, 7, 5 and 5. Each of the values correspond to a set of three bits, either one being zero or one. Three bits makes up a value from 0 – 7, hence an octal number (a decimal number has the digits 0 – 9 for each digit, an octal number has 0 – 7, a binary number has 0 – 1, a hexadecimal number has 0 – F (15)).

If you tried to count from 0 to 10 (decimal) in octal, it’d be: 0, 1, 2, 3, 4, 5, 6, 7, 10, 11, 12. 12 in octal is the same value as 10 in decimal. The big difference is that both octal and decimal maps very neatly on top of binary numbers, being exactly three or four bits.

The usual way to write an octal number in a programming language is by appending a zero in front of it, such as 0755. This tells the compiler that the number is written in octal notation, and the value is then parsed as such. chmod parses all numbers as octal, and does actually handle four digits. Since missing digits are considered to be zero, the first digit is usually not included (or simply as a zero – which will look the same as the representation used in certain programming languages). The first, usually unused digit, have a special meaning, setting the “set user id” (suid), “set group id” (guid) or the “restricted deletion” or “sticky” attributes (you can read more about these options in the manual page).

File permissions

Now that we know what an octal number is, it’s time to look at how the file permissions work. Each file has three sets of permissions, one set for the user owning the file, one set for the group owning the file and one set for anyone else. If you want to take a look at these values on a unix based system, simply type ls -l to list files in a verbose way. Your result will look something like:

-rw-r--r--  1 mats mats        35 2008-08-23 20:24 IMPORTANTFILE

The permissions are listed in the first column, containng “-rw-r–r–“. The first character “-” indicates if the file is a directory (d), if the suid or guid bits are set etc.

This leaves us with “rw-r–r–” – the three sets of permissions. “rw-” is for the user owning the file, “r–” is for the group owning the file and the last “r–” are for anyone else (or ‘other’ as it’s called). The “r” means read, the “w” means write and the currently missing letter is “x”, which means execute (for files) or search (for directories). The “execute” setting is used to let bash (or another shell) attempt to run the file as a script, attempting to parse the first line as a path to the interpreter for the file (i.e. #!/usr/bin/python).

We have three flags (read, write, execute) that can be either on or off. This should remind us of three bits, either being 0 (not set) or 1 (set). And an octal digit is exactly three bits. This means that an octal digit maps exactly to the bit sequence needed to set permissions for a file. A 7 is “111”, a 5 is “101”, a 4 is “100” and so on. Mapping this to permissions:

7 = 111 = rwx
6 = 110 = rw-
5 = 101 = r-x
4 = 100 = r--
3 = 011 = -wx
2 = 010 = -w-
1 = 001 = --x
0 = 000 = ---

When calling chmod 755 on a directory we’re telling chmod to “set the read, write and search bits for me, the read and search bits for the group and the read and search bits for other users” (‘search’ for directories, ‘execute’ for files).

Another example is 644 that maps to 110 100 100, which again maps to “rw-r–r–” which usually is the standard access mode for files (and 755 for directories).

Handling Permissions With Symbols

I’m now going to eliminate the need for remembering everything I’ve written so far in the post, but at least you’ll know what people are talking about when they’re telling you to chmod something this-or-that.

You can also use the symbols directly with chmod, either adding, removing or setting the permissions for one of the three groups.

Examples:

To remove all access for other users (but leaving group and user intact)
chmod o-rwx file

To give everyone read access
chmod a+r file

To give everyone read – and search – access
chmod a+rx directory

To set particular user modes for each group
chmod u=rw,g=w,o=w file (a file that the user can read, but anyone can write to)

And with that I chmod this post a+r.

The Thumbs Up! of Awesome Approval

Every once in a while a few new interesting tools surface themselves and become a natural part of how a developer works. I’ve taken a look at which tools I’ve introduced in my regular workflow during the last six months.

NetBeans

NetBeans got the first version of what has become awesome PHP support in version 6.5, and after version 6.7 got released just before the summer, things have become very stable. NetBeans is absolutely worth looking into for PHP development (and Java), and you sure can’t beat the price (free!). In the good old days NetBeans were slow as hell, but I’ve not noticed any serious issues in 6.7 (.. although we didn’t really have quad cores and 4GB of memory back then either). Go try it out today!

Balsamiq Mockups

Balsamiq is an awesome tool for making quick mockups for UI designs. Previous I’d play around in Adobe Photoshop, dragging layers around and being concerned with all the wrong things. Mockups abstracts away all the UI elements (and comes with a vast library of standard elements), which makes it very easy to experiment and focus on the usability instead of the design and its implementation. For someone who’s more interested in the experience and the programming than the actual design (.. I’ll know what I want when I see it!) this makes it easy to convey my suggestions and create small, visual notes of my own usabilityideas.

You can try it out for free at their website, and they even give away licenses to people who are active in Open Source Development (disclaimer: I got a free license, but the experiences are all my own. This is not paid (or unpaid) advertising or product placement.)

GitHub

I’ve been playing around with git a bit, but after writing a patch for the PEAR-module for Gearman (.. which still doesn’t seem to have made it anywhere significant), I signed up for github to be able to fork the project and submit my patch there. A very good technical solution partnered with an easy way of notifying the original developers of your patch (which you simply provide in your own branch) by submitting a “pull request” makes it very easy to both have patches supplied to you and to submit patches to projects hosted at GitHub.

Thumbs up!

UTF-8 and Putty

After making the move to UTF-8 for all applications in my daily routine a couple of weeks ago, putty was the last client to cause any sort of problem. I logged into my home server to read E-mail today, and putty apparently got everything mixed up as it tried parsing the UTF-8 encoded text as ISO-8859-1.

There is however a quick and easy solution in the more recent versions of putty (.. only a couple of years old?):


Menu -> Change Settings -> Window -> Translation -> Received data assumbed to be in character set

Set that value to UTF-8, and everything will be A-OK!