While porting legacy code to C++20, I replaced string literals (with expected UTF-8 encoded text) to UTF-8 string literals (the one prefixed with u8
).
Thereby, I ran into an issue with octal sequences which I used in the past to encode UTF-8 sequences byte for byte:
While"\303\274"
was the proper encoding of ü
,u8"\303\274"
ended up in ü
.
I investigated into this further and found on cppreference.com:
For each numeric escape sequence, given
v
as the integer value represented by the octal or hexadecimal number comprising the sequence of digits in the escape sequence, and T as the string literal’s array element type (see the table above):
- If
v
does not exceed the range of representable values of T, then the escape sequence contributes a single code unit with valuev
.
(Emphasis mine)
In my own words: In UTF-8 string literals, octal (\ooo
) and hex (\xXX
) escape sequences are interpreted as Unicode code points, similar like Unicode sequences (\uXXXX
and \UXXXXXXXX
).
Hence, this appeared reasonable to me: For UTF-8 string literals, Unicode escape sequences should be favored over the byte-wise octal sequences (I used in the past).
Out of curiosity (and for the purpose of demonstration), I made a small test on coliru and was surprised to see that with g++ -std=c++20
, the octal sequences still are interpreted as single bytes. The above cite in mind, I came to the conclusion:
MSVC seems to be correct, and g++ wrong.
I made an MCVE which I ran in my local Visual Studio 2019:
#include <iostream>#include <string_view>void dump(std::string_view text){ const char digits[] = "0123456789abcdef"; for (unsigned char c : text) { std::cout << ''<< digits[c >> 4]<< digits[c & 0xf]; }}#define DEBUG(...) std::cout << #__VA_ARGS__ << ";\n"; __VA_ARGS__ int main(){ DEBUG(const char* const text = "\344\270\255"); DEBUG(dump(text)); std::cout << '\n'; DEBUG(const char8_t* const u8text = u8"\344\270\255"); DEBUG(dump((const char*)u8text)); std::cout << '\n'; DEBUG(const char8_t* const u8textU = u8"\u4e2d"); DEBUG(dump((const char*)u8textU)); std::cout << '\n';}
Output for MSVC:
const char* const text = "\344\270\255";dump(text); e4 b8 adconst char8_t* const u8text = u8"\344\270\255";dump((const char*)u8text); c3 a4 c2 b8 c2 adconst char8_t* const u8textU = u8"\u4e2d";dump((const char*)u8textU); e4 b8 ad
(Please, note that the dump that the dump of the 1st and 3rd literal are identical while the second results in UTF-8 sequences by interpreting each octal sequence as Unicode code point.)
The same code run in Compiler Explorer, compiled with g++ (13.2):
const char* const text = "\344\270\255";dump(text); e4 b8 adconst char8_t* const u8text = u8"\344\270\255";dump((const char*)u8text); e4 b8 adconst char8_t* const u8textU = u8"\u4e2d";dump((const char*)u8textU); e4 b8 ad
The same code run in Compiler Explorer, compiled with clang (17.0.1):
const char* const text = "\344\270\255";dump(text); e4 b8 adconst char8_t* const u8text = u8"\344\270\255";dump((const char*)u8text); e4 b8 adconst char8_t* const u8textU = u8"\u4e2d";dump((const char*)u8textU); e4 b8 ad
Is my conclusion correct that MSVC does it correct according to the C++ standard, in opposition to g++ and clang?
What I found by web search before:
- C++20 with u8, char8_t and std::string
- Using UTF-8 string-literal prefixes portably between C++17 and C++20
Using hex escape sequences instead of octal sequences doesn't change anything: Demo on Compiler Explorer.
I preferred the somehow unusual octal sequences as they are limited to 3 digits, no unrelated character may extend them unintendedly — in opposition to hex sequences.
Update:
When I was about to file a bug for MSVC, I realized that this was already done:
escape sequences in unicode string literals are overencoded (non conforming => compiler bug)