Encoding, UTF-8 and ISO-8859-1

From time to time, this happens: your web server sends some content, with some character encoding. A proxy server redistributes your content to a web browser, adding, removing or modifying a header that some admin has configured (with good intentions). A user’s browser picks up the content and some character appears broken, perhaps looking like squares.

I’ve only seen this with UTF-8 content (and headers) being resubmitted as ISO-8859-1 but I’m sure other combinations appear in the wild.

What’s one to do? The real fix obviously is to make sure the proxy server doesn’t recode content, nor modify headers it shouldn’t touch. One such header is Content-Type.

Another options is to change the original version of the code and send it with the encoding that the proxy server expects. This can be done by recoding the characters into the proxy expected character encoding, and making sure that the content - on the originating server - is served with proper headers.

iconv -f utf-8 -t iso-8859-1 input.txt

This example is availabe at GitHub.

You may also run into the situation where the characters are split up into two glyphs: a + ^ = â. This is not really a byte encoding issue, but rather a question of how to turn letters into characters and is sometimes called normalization, which is touched upon in a blog post over at Softwareschneiderei.