I am facing difficulties while using C++20's std::u8string
. However, I believe the problem also occurs with the older std::string
.
UTF-8 is a multi-byte sequence encoding method that can represent a single Unicode character with multiple bytes.
Since std::u8string
is implemented using std::basic_string
, the class lacks specific methods for iterating character by character.
I also found this question on obtaining the correct length of std::u8string
, but the answers did not help because they either 'converted' the encoding into another, or were optimized solely for obtaining the length of the string.
#include <iostream>#include <string>int main(){ std::u8string utf8 = u8"α.β"; for (auto c : utf8) { std::cout << std::hex << (0xff & c) << " "; } return 0;}
Note: The above program uses the bit-wise AND
operator to print the hex code properly. However, I want to iterate over the string with an int
datatype to read multi-byte characters.
The above example uses two multi-byte-represented characters, 'α'
and 'β'
.
After compiling with g++11 on Ubuntu, the example program clearly outputs 5 times instead of 3:
ce b1 2e ce b2
The sequence ce b1
corresponds to 'α'
, and ce b2
corresponds to 'β'
.
I don't want to convert the string into UTF-16 or UTF-32 by using <codecvt>
, since it hurts performance to switch between encodings.
Is there a method for std::u8string
to iterate the string character by character?