class file format – Mats Lindh

Before I go into the gory details of this post, I’ll start by saying that this method is probably not the right solution for you. This is not something you want to do if you have readily access to any source code or if you have an existing relationship with the 3rd party that provided the library you’re using. Do not do this. This is not for you.

With that out of the mind, this is the part for those who actually are interested in getting down and dirty with Java, and maybe solving a problem that’s hard to solve otherwise.

The setting: We have a library for interfacing with another internal web service, where the library was provided in binary form by a 3rd party as part of the agreement when the service were delivered to us. The problem is that due to some unknown matter, this library is perfectly capable of understanding UTF-8, both as input from us and as input from the web service, but all web related methods in the result class returns data encoded as ISO-8859-1. The original solution to this was to keep two different parts of the query string — the original query in one particular key — and the key for the library in ISO-8859-1. This needs loads of special casing, manually handling that single parameter, etc. This works to a certain degree as long as the library is the only component in the mix. The troubles really began to surface when we started querying other services based on the same query. We’d then have to special case all methods that were used in URLs, as they returned ISO-8859-1 — and all other libraries and encodings are using UTF-8.

The library has since been made into a separate product with a hefty price tag, so upgrading the library was not an acceptable solution for us. Another solution had to be found, and this is were things starts to get interesting.

Writing a proxy class to handle the encoding issue transparently

This was the solution we attempted first, but this requires us to implement quite a few methods, to add additional code to the method that provides access to the library and to extend and embrace parts of the object. This could have been done quite easily by simply changing one method of the class to reference super.methodName() and then returning that result, but as we have to change several classes (these objects live 3-4 levels down into the result object from the library) which add both developer and runtime overhead. Not good.

Decompiling the library

The next step was to decompile the library to see how the code of the library actually worked. This proved to be a good way to find out how we could possibly solve the issue. We could try to fix the issue in the code and then recompile the library, but some of the class files were too new for jad to decompile them completely. The decompilation did however show the problem with the code:

    if (encoding != null)
    {
        return encoding.toString() :
    }

    return "ISO-8859-1";

This was neatly located in a helper method that ran on every property used when generating a query string. The encoding variable is retrieved from a global settings object, only accessible in the same library. This object is empty in our version of the library, so not much help there. But here’s the little detail that leads into the next part, and actually made this hack possible: “ISO-8859-1” is constant. This means that it gets neatly tucked away as an UTF-8 string when the class file is generated. Let’s gets down and dirty.

Binary patching the encoding in the class file

We’ll start by taking a look at the hexdump in our class file, after searching for the string “ISO” in the ASCII representation (“ISO” in UTF-8 is identical to the ASCII representation):

Binary Patching a Java Class

I’ve highlighted the interesting part where “ISO-8859-1” is stored in the file. This is where we want to do our surgical incision and make the method return the string “UTF-8” instead. There is one important thing you should be aware of if you’ve never done any hex editing of files before, and that is the fact that the byte offset of parts of the file may be very important. Sadly, the strings “UTF-8” and “ISO-8859-1” have different lengths, and as such, would require us to either delete bytes following “UTF-8” or put spaces there instead (“UTF-8 “). The first method might leave the rest of the file skewed, the latter might not work if the method used for encoding the value doesn’t trim the string first.

To solve this issue, we turn to our good friend VM Spec: The class File Format, which contains all the details of how the class file format is designed. Interesting parts:

In the ClassFile structure:

cp_info constant_pool[constant_pool_count-1];

As we’re looking at a constant, this is where it should be stored. The cp_info is defined as:

cp_info {
    u1 tag;
    u1 info[];
}

The tag contains the type of constant, the info[] array varies depending on the type of the constant. If we take a look at the table in Chapter 4.4, we see that the the identifier for a unicode string is:

CONSTANT_Utf8 	1

So we should have the value of 1 in the byte (as the actual value, not the ascii character) describing this constant. If the value is one, the complete structure is:

    CONSTANT_Utf8_info {
    	u1 tag;
    	u2 length;
    	u1 bytes[length];
    }

The tag should be 1 as the byte value, the length should be two bytes describing the length of the actual string saved (since we’re storing the length in two bytes (u2), it can be a maximum of 2^16 bytes in total). After that identifier, we should have length number of bytes with UTF-8 data.

If we go back to our hex dump, we can now make more sense of the data we’re seeing:

The byte shown as 0x01 in hex is the value 1 for the tag of the structure. The 0x00 0x0A is the two bytes making up the length of the string:

    0000 0000 0000 1010 binary = 10 decimal

    ISO-8859-1
    1234567890

This shows that the length of our string “ISO-8859-1” is 10 bytes in UTF-8, which is the same value that is stored in the two bytes showing the length of the string in the structure.

Heading back to our original goal: changing the length of the string stored. We change the string bytes to “UTF-8”, which is five bytes. We then change the stored length of the string:

    00 0A becomes
    00 05

We save our changes and re-create the jar file again, with all the previous classes and our changed one.

After inserting our new JAR-file into our maven repository as a new build and updating our local repository, we now have complete UTF-8 support from start to finish. Yey!

Tag: class file format

New Adventures in Reverse Engineering

Writing a proxy class to handle the encoding issue transparently

Decompiling the library

Binary patching the encoding in the class file