Quantcast
Channel: Active questions tagged utf-8 - Stack Overflow
Viewing all articles
Browse latest Browse all 1052

Why does Encoding.UTF8.GetMaxByteCount(1) return 6?

$
0
0

The TLDR here is simple: What's a sequence of chars that would make either UTF8's Encoding or Encoder return 6 (or even 5) bytes for a single char, as GetMaxByteCount implies it might?

The non-TLDR:

Despite what the docs led me to expect, there is no sign that the UTF8 Encod-ing either considers potential leftover surrogates from a previous decoder operation, or includes the worst case for the currently selected EncoderFallback. Note that the UTF8 Encod-er does support cached bytes, but UTF8's Encod-ing apparently does not.

And while the Encod-er will return as many as 4 bytes from the submission of a single character, I've never been able to get more than 3 from the Encod-ing. And yet GetMaxByteCount is telling me there can sometimes be 6?

Is there some trick here? Maybe some case where a malformed set of characters might return longer-than-expected sequences? I'm looking for some specific examples.

Here's some code you can use to experiment:

string smilelyface = "😄"; // <--- 2 chars, encodes to 4 UTF8 bytesEncoding enc = Encoding.UTF8;int mbc = enc.GetMaxByteCount(1);Console.WriteLine("mbc: {0}", mbc); // <---- 6byte[] sixbytes = new byte[mbc];int gb = enc.GetBytes(smilelyface.AsSpan(0, 1), sixbytes); // Encode 1st charConsole.WriteLine("retval: {0}", gb); // <----- 3gb = enc.GetBytes(smilelyface.AsSpan(1, 1), sixbytes); // Encode 2nd charConsole.WriteLine("retval: {0}", gb); // <----- 3bool b = enc.TryGetBytes(smilelyface.AsSpan(0, 1), sixbytes, out int outbyteswritten);Console.WriteLine("outbyteswritten: {0}", outbyteswritten); // <----- 3b = enc.TryGetBytes(smilelyface.AsSpan(1, 1), sixbytes, out outbyteswritten);Console.WriteLine("outbyteswritten: {0}", outbyteswritten); // <----- 3Encoder encr = enc.GetEncoder();encr.Convert(smilelyface.AsSpan(0, 1), sixbytes, false, out int charsused, out int bytesused, out bool completed);Console.WriteLine("BytesUsed: {0}", bytesused); // <----- 0encr.Convert(smilelyface.AsSpan(1, 1), sixbytes, false, out charsused, out bytesused, out completed);Console.WriteLine("BytesUsed: {0}", bytesused); // <----- 4

You'll note that the Encod-ing never returns more than 3 bytes, and in this case those 3 are the unicode 'replacement' character, suggesting the UTF8 Encod-ing has no intention of caching anything for use by subsequent encoding calls. Hard to see how you can get more than 3 bytes that way.

And while the Encod-er does cache, I still can't see how to get it to output 6 bytes, just 4.

I get that GetMaxByteCount is supposed to be 'worst case,' but AFAICT, worst case here is either 3 or 4 depending on whether we're talking about the Encoding, or the Encoder (the docs are unclear about which. Both?).

Can you really get 6 from encoding 1 char? How?


Viewing all articles
Browse latest Browse all 1052

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>