locked Re: A question on character sets


 

Dano,

I guess part of what confuses me is why, if the message is composed
in this web page window, which seems to understand UTF-8, and it goes
to my machine, which also understands UTF-8, does it pick up the
circumflex A [ Â ] somewhere in between?
I suspect the problem happened right in your machine not somewhere in
between - if you read your email with an older email client which itself
does not recognize UTF-8.

That is, your message as it arrived to me had the header field:

Content-Type: text/plain; charset="utf-8"
The email reader I'm using right now (Thunderbird with Eudora OSE
extensions) does recognize utf-8, and displays your message properly.

My other email client, Eudora Classic, does not recognize utf-8 as a
character set and defaults back to ISO/IEC 8849-1. In UTF-8 these
extended characters (beyond the basic 7-bit ASCII characters) are
transmitted as multi-byte sequences. So in the email body the Â, ÿ or ®
characters are transmitted as two-byte sequences. But my poor old Eudora
knows nothing of two-byte character sequences, so it displays each byte
as a character; and the first happens to come out as  for ® and those
characters near it in Unicode value.

In other words, nobody "picked up" the Â, the byte representing it was
in the message body from the beginning. Note that it doesn't matter that
my Win7 system is perfectly capable of handling Unicode - the problem is
that Eudora didn't know how to tell Windows to display them properly.

Other unicode characters (with higher code point numbers) may take three
bytes, four bytes or longer sequences to represent. Those have even more
"extraneous" characters when displayed by Eudora.

So my guess is that your problem comes about from an email client that
does not recognize UTF-8 nor know how to deal with it in a message body.

-- Shal

Join main@beta.groups.io to automatically receive all group messages.