C++20 UTF-8 String Literals With Octal Sequences

While porting legacy code to C++20, I replaced string literals (with expected UTF-8 encoded text) to UTF-8 string literals (the one prefixed with u8).

Thereby, I ran into an issue with octal sequences which I used in the past to encode UTF-8 sequences byte for byte:

While
"\303\274" was the proper encoding of ü,
u8"\303\274" ended up in Ã¼.

I investigated into this further and found on cppreference.com:

For each numeric escape sequence, given v as the integer value represented by the octal or hexadecimal number comprising the sequence of digits in the escape sequence, and T as the string literal’s array element type (see the table above):
If v does not exceed the range of representable values of T, then the escape sequence contributes a single code unit with value v.

(Emphasis mine)

In my own words: In UTF-8 string literals, octal (\ooo) and hex (\xXX) escape sequences are interpreted as Unicode code points, similar like Unicode sequences (\uXXXX and \UXXXXXXXX).

Hence, this appeared reasonable to me: For UTF-8 string literals, Unicode escape sequences should be favored over the byte-wise octal sequences (I used in the past).

Out of curiosity (and for the purpose of demonstration), I made a small test on coliru and was surprised to see that with g++ -std=c++20, the octal sequences still are interpreted as single bytes. The above cite in mind, I came to the conclusion:

MSVC seems to be correct, and g++ wrong.

I made an MCVE which I ran in my local Visual Studio 2019:

#include <iostream>#include <string_view>void dump(std::string_view text){  const char digits[] = "0123456789abcdef";  for (unsigned char c : text) {    std::cout << ''<< digits[c >> 4]<< digits[c & 0xf];  }}#define DEBUG(...) std::cout << #__VA_ARGS__ << ";\n"; __VA_ARGS__ int main(){  DEBUG(const char* const text = "\344\270\255");  DEBUG(dump(text));  std::cout << '\n';  DEBUG(const char8_t* const u8text = u8"\344\270\255");  DEBUG(dump((const char*)u8text));  std::cout << '\n';  DEBUG(const char8_t* const u8textU = u8"\u4e2d");  DEBUG(dump((const char*)u8textU));  std::cout << '\n';}

Output for MSVC:

const char* const text = "\344\270\255";dump(text); e4 b8 adconst char8_t* const u8text = u8"\344\270\255";dump((const char*)u8text); c3 a4 c2 b8 c2 adconst char8_t* const u8textU = u8"\u4e2d";dump((const char*)u8textU); e4 b8 ad

(Please, note that the dump that the dump of the 1^st and 3^rd literal are identical while the second results in UTF-8 sequences by interpreting each octal sequence as Unicode code point.)

The same code run in Compiler Explorer, compiled with g++ (13.2):

const char* const text = "\344\270\255";dump(text); e4 b8 adconst char8_t* const u8text = u8"\344\270\255";dump((const char*)u8text); e4 b8 adconst char8_t* const u8textU = u8"\u4e2d";dump((const char*)u8textU); e4 b8 ad

The same code run in Compiler Explorer, compiled with clang (17.0.1):

const char* const text = "\344\270\255";dump(text); e4 b8 adconst char8_t* const u8text = u8"\344\270\255";dump((const char*)u8text); e4 b8 adconst char8_t* const u8textU = u8"\u4e2d";dump((const char*)u8textU); e4 b8 ad

Demo on Compiler Explorer

Is my conclusion correct that MSVC does it correct according to the C++ standard, in opposition to g++ and clang?

What I found by web search before:

Using hex escape sequences instead of octal sequences doesn't change anything: Demo on Compiler Explorer.

I preferred the somehow unusual octal sequences as they are limited to 3 digits, no unrelated character may extend them unintendedly — in opposition to hex sequences.

Update:

When I was about to file a bug for MSVC, I realized that this was already done:
escape sequences in unicode string literals are overencoded (non conforming => compiler bug)

C++20 UTF-8 String Literals With Octal Sequences

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112