svn: Can’t convert string from native encoding to ‘UTF-8’

The error “svn: Can’t convert string from native encoding to ‘UTF-8′” suddenly made it impossible to update one of the projects on our staging servers. The project contains loads of file under SVN control, and several data directories which up to this time wasn’t svn:ignore’d. One of the files in one of these directories had norwegian letters in ISO-8859-1 in its filename (which didn’t work in the project anyhow, as it was something left around from earlier).

This single file borked svn from actually being able to update or do anything useful with the actual files under SVN control. When Subversion analyzed the directory structure to check which files it should attempt to update, it would just barf before seeing any files with the error message about the file name not being in UTF-8. You’d think it would be better to ignore errors for filenames that aren’t a part of svn and that you’re not trying to add, but there’s probably a good reason for this behaviour.

Anyways: The solution: delete the file. We didn’t use it anyway. There’s also a good chapter in the SVN Book about localization issues which contain information about how you can solve the issue by changing your active character set.

Supporting 2-pass Parallel Encoding with x264 and ffmpeg

If you’re doing several encodes of a single input file (to encode several different sizes / bitrate combinations) in parallel with x264, you’re going to have a problem. The first pass will create three files with information about the file for the second pass, and you’re unable to change this file name into something better. This seems to be a problem for quite a lot of people according to a Google-search for the issue, and none seems to have any proper solution.

I have one. Well, probably not a proper solution, but at least it works! The trick is to realize that ffmpeg/x264 creates these files in the current working directory. To run several encodings in parallel, you’ll simply have to give each encoding process it’s own directory, and then use absolute paths to the source and destination file (and any other paths). Let it create the files there and clean up and delete the directories afterwards.

I’ve included some example code from PHP in regards to how you could solve something like this. I simply use the output file name as the directory name here, and create the directory in the system temp directory.

$tempDir = sys_get_temp_dir() . '/' . $outputFilename);
mkdir($tempDir, 0700, true);
chdir($tempDir);

After doing the encode, we’ll have to clean up. The three files that ffmpeg/x264 creates are ffmpeg2pass-0.log, x264_2pass.log and x264_2pass.log.mbtree.

unlink($tempDir . '/ffmpeg2pass-0.log');
unlink($tempDir . '/x264_2pass.log');
unlink($tempDir . '/x264_2pass.log.mbtree');
rmdir($tempDir);

And that should hopefully solve it!

What Happened To My Beautiful En-dashes?!

First, a small introduction to the problem: We’re running stuff in UTF-8 all the way. A few sites we’re reading feeds from are using ISO-8859-1 as their charset, but they either supply the feed with the correct encoding specific or the feeds arrive as UTF-8. Everything works nicely, except for the mentioned-in-the-headline en-dashes. Firefox only shows 00 96 (0x00 0x96), but everything looks correct when you view the headlines and similiar stuff on the original site.

Strange.

The digging, oh all the digging.

After the already mentioned digging (yes, the digging) in data at the large search engines (ok, maybe I did a search or two), I discovered that the windows cp1252 encoding uses 0x96 to store endashes. This seems similiar! We’re seeing 0x96 as one of the byte values above, so apparently cp1252 is sneaking into the mix somewhere along the lines. Most of the clients using the CMS-es are windows, so they might apparently be to blame.

ISO-8859-1 enters the scene

As the sites (and feeds) provide ISO-8859-1 as their encoding, I thought it would be interesting to see what ISO-8859-1 defines as the representation for the byte value 0x96. Lo’ and behold: 0x96 is not defined in ISO-8859-1. Which actually provides us with the solution.

I welcome thee, Mr. Solution

When the ISO-8859-1 encoded string is converted into UTF-8, the bytes with the value 0x96 (which is the endash in cp1252) is simply inserted as a valid code sequence in UTF-8 which represents a character that’s not defined.

We’re saying that the string is ISO-8859-1, although in reality it is either cp1252 or a mangled version of iso-8859-1 and cp1252 (for the endashes, at least).

If you’re on the parsing end of this mumbo jumbo, one solution is to replace the generated UTF-8 sequence (0xc2 0x96) (converted from 0x96 i ISO-8859-1) with the proper one (0xe2 0x80 0x93):

$data = str_replace("\xc2\x96", "\xE2\x80\x93", $data);

And voilá, everything works.

UTF-8 and Putty

After making the move to UTF-8 for all applications in my daily routine a couple of weeks ago, putty was the last client to cause any sort of problem. I logged into my home server to read E-mail today, and putty apparently got everything mixed up as it tried parsing the UTF-8 encoded text as ISO-8859-1.

There is however a quick and easy solution in the more recent versions of putty (.. only a couple of years old?):


Menu -> Change Settings -> Window -> Translation -> Received data assumbed to be in character set

Set that value to UTF-8, and everything will be A-OK!

Getting ÆØÅ to Work in mutt / putty

After reinstalling the server (see the previous post), mutt didn’t show the norwegian letters ÆØÅ properly any longer (.. and yes, I use mutt to read my E-mails. Nothing else comes close.) .. The issue was apparently related to the settings for the current locale, but a quick check showed things to be perfectly valid (.. although not UTF-8, but that’s another issue):

mats@computer:~$ locale
LANG=nb_NO.iso88591
LC_CTYPE="nb_NO.iso88591"
LC_NUMERIC="nb_NO.iso88591"
LC_TIME="nb_NO.iso88591"
LC_COLLATE="nb_NO.iso88591"
LC_MONETARY="nb_NO.iso88591"
LC_MESSAGES="nb_NO.iso88591"
LC_PAPER="nb_NO.iso88591"
LC_NAME="nb_NO.iso88591"
LC_ADDRESS="nb_NO.iso88591"
LC_TELEPHONE="nb_NO.iso88591"
LC_MEASUREMENT="nb_NO.iso88591"
LC_IDENTIFICATION="nb_NO.iso88591"

Why didn’t mutt show the proper letters then? Everything seems to be OK .. Instead, it just kept showing “?” where either of ÆØÅ should be.

Well, the settings are one thing, but if the locale itself isn’t available, things ain’t gonna be any better. So let’s fix that:

apt-get install locale-all

And .. well, at least we have the locale available now, but before we can use it, we need to generate the binary version. Find /etc/locale.gen and open the file in a suitable editor.

Find the line for the locale you’re using and uncomment it:

# nb_NO ISO-8859-1
# nb_NO.UTF-8 UTF-8

becomes:

nb_NO ISO-8859-1
nb_NO.UTF-8 UTF-8

Then run ‘locale-gen’ as root. Wait a few seconds and the locales will be generated. Run mutt. Be happy.