Quantcast
Channel: Active questions tagged utf-8 - Stack Overflow
Viewing all articles
Browse latest Browse all 1027

Is there a way to iterate through std::u8string character by character?

$
0
0

I am facing difficulties while using C++20's std::u8string. However, I believe the problem also occurs with the older std::string.

UTF-8 is a multi-byte sequence encoding method that can represent a single Unicode character with multiple bytes.

Since std::u8string is implemented using std::basic_string, the class lacks specific methods for iterating character by character.

I also found this question on obtaining the correct length of std::u8string, but the answers did not help because they either 'converted' the encoding into another, or were optimized solely for obtaining the length of the string.

#include <iostream>#include <string>int main(){    std::u8string utf8 = u8"α.β";    for (auto c : utf8)    {        std::cout << std::hex << (0xff & c) << " ";    }    return 0;}

Note: The above program uses the bit-wise AND operator to print the hex code properly. However, I want to iterate over the string with an int datatype to read multi-byte characters.

The above example uses two multi-byte-represented characters, 'α' and 'β'.

After compiling with g++11 on Ubuntu, the example program clearly outputs 5 times instead of 3:

ce b1 2e ce b2

The sequence ce b1 corresponds to 'α', and ce b2 corresponds to 'β'.

I don't want to convert the string into UTF-16 or UTF-32 by using <codecvt>, since it hurts performance to switch between encodings.

Is there a method for std::u8string to iterate the string character by character?


Viewing all articles
Browse latest Browse all 1027

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>