Making Friends With Emacs and UTF-8

After having a few issues with UTF-8 support in Emacs earlier (I .. well .. err .. use emacs as my .. mail editor.), I finally found a solution that works with both putty and gterm as my terminals.

The important settings I ended up using in my .emacs:

(prefer-coding-system       'utf-8)
(set-default-coding-systems 'utf-8)
(set-terminal-coding-system 'utf-8)
(set-keyboard-coding-system 'utf-8)

(setq x-select-request-type '(UTF8_STRING COMPOUND_TEXT TEXT STRING))

Archived for my own use at a later time.

svn: Can’t convert string from native encoding to ‘UTF-8’

The error “svn: Can’t convert string from native encoding to ‘UTF-8′” suddenly made it impossible to update one of the projects on our staging servers. The project contains loads of file under SVN control, and several data directories which up to this time wasn’t svn:ignore’d. One of the files in one of these directories had norwegian letters in ISO-8859-1 in its filename (which didn’t work in the project anyhow, as it was something left around from earlier).

This single file borked svn from actually being able to update or do anything useful with the actual files under SVN control. When Subversion analyzed the directory structure to check which files it should attempt to update, it would just barf before seeing any files with the error message about the file name not being in UTF-8. You’d think it would be better to ignore errors for filenames that aren’t a part of svn and that you’re not trying to add, but there’s probably a good reason for this behaviour.

Anyways: The solution: delete the file. We didn’t use it anyway. There’s also a good chapter in the SVN Book about localization issues which contain information about how you can solve the issue by changing your active character set.

What Happened To My Beautiful En-dashes?!

First, a small introduction to the problem: We’re running stuff in UTF-8 all the way. A few sites we’re reading feeds from are using ISO-8859-1 as their charset, but they either supply the feed with the correct encoding specific or the feeds arrive as UTF-8. Everything works nicely, except for the mentioned-in-the-headline en-dashes. Firefox only shows 00 96 (0x00 0x96), but everything looks correct when you view the headlines and similiar stuff on the original site.


The digging, oh all the digging.

After the already mentioned digging (yes, the digging) in data at the large search engines (ok, maybe I did a search or two), I discovered that the windows cp1252 encoding uses 0x96 to store endashes. This seems similiar! We’re seeing 0x96 as one of the byte values above, so apparently cp1252 is sneaking into the mix somewhere along the lines. Most of the clients using the CMS-es are windows, so they might apparently be to blame.

ISO-8859-1 enters the scene

As the sites (and feeds) provide ISO-8859-1 as their encoding, I thought it would be interesting to see what ISO-8859-1 defines as the representation for the byte value 0x96. Lo’ and behold: 0x96 is not defined in ISO-8859-1. Which actually provides us with the solution.

I welcome thee, Mr. Solution

When the ISO-8859-1 encoded string is converted into UTF-8, the bytes with the value 0x96 (which is the endash in cp1252) is simply inserted as a valid code sequence in UTF-8 which represents a character that’s not defined.

We’re saying that the string is ISO-8859-1, although in reality it is either cp1252 or a mangled version of iso-8859-1 and cp1252 (for the endashes, at least).

If you’re on the parsing end of this mumbo jumbo, one solution is to replace the generated UTF-8 sequence (0xc2 0x96) (converted from 0x96 i ISO-8859-1) with the proper one (0xe2 0x80 0x93):

$data = str_replace("\xc2\x96", "\xE2\x80\x93", $data);

And voilá, everything works.