Shal, it is just a little bit more complicated than I suspected.

I have a test group on groups.io, just like you do.

After extensive testing today, I have found that if I send a new message, regardless of what I try, I cannot get it to do all the garbage.

But, if I forward that original message that first caused all the trouble, it screws up every time.

So, apparently it has something to do with whatever else is in the message.

It must be that quoted text sets something up to not be able to cope with the double spaces and quotation marks.

Anyway, I received the message from Mark saying that he noticed it. Not sure if anything needs to be done, Mark, maybe it is just the nature of the beast, as they say.


It is fairly common for rich-text systems, such as HTML, to turn successive spaces into non-breaking spaces. Such systems typically collapse multiple spaces into a single space, but the conversion to non-breaking spaces preserves the original count of spaces.

The unfortunate thing, in this case, is that non-breaking space isn't encoded in the base 7-bit ASCII set that is common to most character sets. The result is that it gets mis-interpreted when a system uses a different character set than the original.

The lesson for Groups.io is that messages posted by email have a specified character set, and that should be preserved in messages passed through. But as displayed in the archive, and as built into digests, careful handling is required. Perhaps the best approach is to convert the text to UTF-8 encoding, but that has its own issues.

My recommendation for the Archives would be to preserve the original message text unaltered, and capture the character set encoding as metadata. Then convert to UTF-8 on the fly for display. If a message gets edited update the metadata to reflect the character set used in the editing - presumably UTF-8. Of course that means the metadata must be versioned along with the text.

