Quantcast
Channel: Active questions tagged utf-8 - Stack Overflow
Viewing all articles
Browse latest Browse all 1215

How to handle this subtle ambiguity of UTF-8 BOM?

$
0
0

The UTF-8 encoding allows both having and not having a BOM at the beginning of the byte sequence. This seems to create a subtle ambiguity, because the BOM itself represents the Unicode character U+FEFF.

For example, what character string does the following UTF-8 byte sequence (in a hex format) represent?

EF, BB, BF, 42, 43, 44

It can represent the character string "BCD" (containing 3 characters), with the first 3 bytes (EF, BB, BF) regarded as the BOM sequence. This seems to be the usual interpretation.

However, it can also represent the character string "[U+FEFF]BCD" (containing 4 characters), with the first 3 bytes (EF, BB, BF) not regarded as the BOM sequence but regarded as an ordinary UTF-8 encoding sequence of the Unicode character U+FEFF.

So, how to handle this ambiguity? Does the UTF-8 encoding have the rule that, if the byte sequence EF, BB, BF is at the beginning of the whole byte sequence, it must be interpreted as a BOM sequence instead of an encoding sequence of the Unicode character U+FEFF? But if this is the case, then the UTF-8 encoding cannot encode some Unicode character strings, namely, any Unicode character string starting with the Unicode character U+FEFF.

Other Unicode encodings, for example, UTF-16, may also have similar problems.


Viewing all articles
Browse latest Browse all 1215

Trending Articles