locked A question on character sets


 

I have what may be a dumb question - but that never stopped me before! I've used the Windows Alt Key Numeric Codes (ALT + XXXX on the numeric keypad) for many years, but they don't seem to work reliably in UTF-8. Is there a way in composing a message to key in the codes for the various other characters not on the keyboard so they can be used in UTF-8?
Dano


 

On Wed, Feb 18, 2015 at 7:01 AM, D R Stinson <dano@...> wrote:
I have what may be a dumb question - but that never stopped me before! I've used the Windows Alt Key Numeric Codes (ALT + XXXX on the numeric keypad) for many years, but they don't seem to work reliably in UTF-8. Is there a way in composing a message to key in the codes for the various other characters not on the keyboard so they can be used in UTF-8?

Ahh, alt codes. I used to use those to do line graphics on the screens of my BBS in the mid-1980s.. Had them all memorized and everything. Ahem.

Not a dumb question. Unfortunately, I'm not familiar with Windows anymore, so I can't direct you how to add unicode characters if you're using that. On a Mac, it's fairly easy. In Settings, in Keyboard, click the 'Show Keyboard & Character Viewers in menu bar'. That puts the character viewer in the menu bar. Click that and it brings up an app listing various unicode characters that you can drag and drop into your message.

Here's a pointing finger: ☞

Mark


Duane
 

It's my understanding that what will appear depends on which character set you have selected. The set that Windows uses by default for UTF-8 is the IBM 437 keyboard. I do know that the cent sign ¢ (Alt0162), degree sign ° (Alt0176), copyright © (Alt0169), and mu µ (Alt0181) are the ones I use the most and work fine. (I don't understand all I know about this!)

Duane


 

It's my understanding that what will appear depends on which character set you have selected.
The set that Windows uses by default for UTF-8 is the IBM 437 keyboard. I do know that the cent
sign ¢ (Alt0162), degree sign ° (Alt0176), copyright © (Alt0169), and mu µ (Alt0181) are the ones
I use the most and work fine. (I don't understand all I know about this!)

Duane
Thanks, Duane -
That's what I'm used to as well. I can create a message in Windows with them, even composed as plain text, and they look fine when I send them. (Don't forget Registered Trade Mark (ALT+0174 ®) or the fractions ¼, ½, and ¾ (0188, 89, and 90).) But for some reason Y! screws them all up. I figured it was something I was doing wrong. I'm trying to send this from the web site.
Dano


 

Duane -
This is your post that arrived in my email. Note the 'capital A with
circumflex over it' before each symbol. That's what seems to happen with Y!.
I don't know what causes it, or why it seems to read okay on the Beta web
page.
Dano

----- Original Message -----

It's my understanding that what will appear depends on which character set
you have selected. The set that Windows uses by default for UTF-8 is the
IBM 437 keyboard. I do know that the cent sign ¢ (Alt0162), degree sign °
(Alt0176), copyright © (Alt0169), and mu µ (Alt0181) are the ones I use
the most and work fine. (I don't understand all I know about this!)

Duane


 

Dano,

I have what may be a dumb question - but that never stopped me before!
I've used the Windows Alt Key Numeric Codes (ALT + XXXX on the numeric
keypad) for many years, but they don't seem to work reliably in UTF-8.
I thought alt codes were too old-school for Unicode. But it turns out that there's a registry key you can set to EnableHexNumpad, and that will allow you to enter Unicode code points (in hex) directly.
http://en.wikipedia.org/wiki/Alt_code

Is there a way in composing a message to key in the codes for the
various other characters not on the keyboard so they can be used in
UTF-8?
An easier method (for those that don't want to memorize Unicode code points) is to use the Character Map application. That's under Accessories, System Tools in the Start menu. It is a frequent occupant of my recently used list in the Start menu.

-- Shal


 

Dano,

This is your post that arrived in my email. Note the 'capital A with
circumflex over it' before each symbol. That's what seems to happen with
Y!.
No, that's what happens with an email reader that interprets the message using an ISO code page instead of UTF-8. The A with circumflex is 0xC2 in the Windows Western code page, and 0xC2 is one of the common lead-in bytes of a UTF-8 byte sequence.

I'm very familiar with this problem as my email client, Eudora Classic, is too old for Unicode; so I see this all the time.

I don't know what causes it, or why it seems to read okay on the Beta
web page.
It will read ok on nearly any web system (where unicode is the lingua franca) as well as in Thunderbird and other email clients that support UTF-8 message bodies.

-- Shal


 

I thought alt codes were too old-school for Unicode. But it turns out that there's a registry key you can set to EnableHexNumpad, and that will allow you to enter Unicode code points (in hex) directly.
http://en.wikipedia.org/wiki/Alt_code

Is there a way in composing a message to key in the codes for the
various other characters not on the keyboard so they can be used in
UTF-8?
An easier method (for those that don't want to memorize Unicode code points) is to use the Character Map application. That's under Accessories, System Tools in the Start menu. It is a frequent occupant of my recently used list in the Start menu.

-- Shal
Thanks, Shal -
I can enter them fin on my Win7 machine right on the Beta web page with the ALT and four digits. The problem comes as you can see below (above?). The character renders but is preceded by a circumflex A.
Dano


 

Thanks, Shal -
I can enter them fine on my Win7 machine right on the Beta web page with the ALT and four digits. The problem comes as you can see below (above?). The character renders but is preceded by a circumflex A.
Dano
Except, as you can see [ ¼ — ½ – ¾ ° piñon ® © ] when I send it from the web page to the web page.
Dano


 

Dano,

I can enter them fin on my Win7 machine right on the Beta web page with
the ALT and four digits.
What happens when you do that (using four digits, including a leading zero) is that Windows looks up the letterform using that number in the Windows-1251 character set. So for example ALT+0174 is the Registered Sign ®
http://en.wikipedia.org/wiki/Windows-1251

Then windows converts that letterform to its Unicode code point, U+00AE, which by no accident at all is 174 in decimal. That's no accident because the first 256 Unicode code points were deliberately chosen to be as compatible as practical with the prior popular code page, ISO/IEC 8859-1 (which is the same as Windows-1251 for most characters).

Then, when that letterform needs to be transmitted (as in email) its numeric value is converted to UTF-8 (Unicode Transformation Format, 8-bit). This is necessary because the message body might include Unicode code points greater than 255, which can't be represented in a single 8-bit byte. The UTF-8 byte sequence for U+00AE is 0xC2 0xAE.
http://en.wikipedia.org/wiki/UTF-8

So, when that gets displayed by an application that doesn't support UTF-8, what you get is usually interpreted using one of the old code pages, and you see the two character sequence ® (Capital A with circumflex, Registered Sign).

As an exercise for the reader, try sending ALT+0255, it should come out as 0xC3 0xBF (Capital A with tilde, Inverted Question Mark) on incompatible systems.

Interesting. For ALT+0255 my Win7 system gives ÿ (Small Letter Y with Diaeresis), which is ISO/IEC 8859-1, but Windows-1251 says that should be a reversed R (Cryillic Capital Letter Ya). Maybe Microsoft changed it.

-- Shal
Note: this message is composed in an email client that will send it in an 8-bit code page, not UTF-8


 

So, when that gets displayed by an application that doesn't support UTF-8, what you get is usually interpreted using one of the old code pages, and you see the two character sequence ® (Capital A with circumflex, Registered Sign).

As an exercise for the reader, try sending ALT+0255, it should come out as 0xC3 0xBF (Capital A with tilde, Inverted Question Mark) on incompatible systems.

Interesting. For ALT+0255 my Win7 system gives ÿ (Small Letter Y with Diaeresis), which is ISO/IEC 8859-1, but Windows-1251 says that should be a reversed R (Cryillic Capital Letter Ya). Maybe Microsoft changed it.

-- Shal
Note: this message is composed in an email client that will send it in an 8-bit code page, not UTF-8
Thanks, Shal, for explaining this. I am understanding part of it, even though I'm not really that tech savvy. I guess part of what confuses me is why, if the message is composed in this web page window, which seems to understand UTF-8, and it goes to my machine, which also understands UTF-8, does it pick up the circumflex A [ Â ] somewhere in between? (I want to see if that circumflex A symbol stays since I put it in with an ALT code.) I'm guessing that somewhere in between is a machine that's not just passing the raw code through. Maybe it's the dread NSA trying to read our secret mail? :-)
Dano

ALT+0255 gives me this [ ÿ ]



 

Dano,

I guess part of what confuses me is why, if the message is composed
in this web page window, which seems to understand UTF-8, and it goes
to my machine, which also understands UTF-8, does it pick up the
circumflex A [ Â ] somewhere in between?
I suspect the problem happened right in your machine not somewhere in
between - if you read your email with an older email client which itself
does not recognize UTF-8.

That is, your message as it arrived to me had the header field:

Content-Type: text/plain; charset="utf-8"
The email reader I'm using right now (Thunderbird with Eudora OSE
extensions) does recognize utf-8, and displays your message properly.

My other email client, Eudora Classic, does not recognize utf-8 as a
character set and defaults back to ISO/IEC 8849-1. In UTF-8 these
extended characters (beyond the basic 7-bit ASCII characters) are
transmitted as multi-byte sequences. So in the email body the Â, ÿ or ®
characters are transmitted as two-byte sequences. But my poor old Eudora
knows nothing of two-byte character sequences, so it displays each byte
as a character; and the first happens to come out as  for ® and those
characters near it in Unicode value.

In other words, nobody "picked up" the Â, the byte representing it was
in the message body from the beginning. Note that it doesn't matter that
my Win7 system is perfectly capable of handling Unicode - the problem is
that Eudora didn't know how to tell Windows to display them properly.

Other unicode characters (with higher code point numbers) may take three
bytes, four bytes or longer sequences to represent. Those have even more
"extraneous" characters when displayed by Eudora.

So my guess is that your problem comes about from an email client that
does not recognize UTF-8 nor know how to deal with it in a message body.

-- Shal