The TLDR here is simple: What's a sequence of chars that would make either UTF8's Encoding or Encoder return 6 (or even 5) bytes for a single char, as GetMaxByteCount implies it might?
The non-TLDR:
Despite what the docs led me to expect, there is no sign that the UTF8 Encod-ing either considers potential leftover surrogates from a previous decoder operation
, or includes the worst case for the currently selected EncoderFallback
. Note that the UTF8 Encod-er does support cached bytes, but UTF8's Encod-ing apparently does not.
And while the Encod-er will return as many as 4 bytes from the submission of a single character, I've never been able to get more than 3 from the Encod-ing. And yet GetMaxByteCount is telling me there can sometimes be 6?
Is there some trick here? Maybe some case where a malformed set of characters might return longer-than-expected sequences? I'm looking for some specific examples.
Here's some code you can use to experiment:
string smilelyface = "😄"; // <--- 2 chars, encodes to 4 UTF8 bytesEncoding enc = Encoding.UTF8;int mbc = enc.GetMaxByteCount(1);Console.WriteLine("mbc: {0}", mbc); // <---- 6byte[] sixbytes = new byte[mbc];int gb = enc.GetBytes(smilelyface.AsSpan(0, 1), sixbytes); // Encode 1st charConsole.WriteLine("retval: {0}", gb); // <----- 3gb = enc.GetBytes(smilelyface.AsSpan(1, 1), sixbytes); // Encode 2nd charConsole.WriteLine("retval: {0}", gb); // <----- 3bool b = enc.TryGetBytes(smilelyface.AsSpan(0, 1), sixbytes, out int outbyteswritten);Console.WriteLine("outbyteswritten: {0}", outbyteswritten); // <----- 3b = enc.TryGetBytes(smilelyface.AsSpan(1, 1), sixbytes, out outbyteswritten);Console.WriteLine("outbyteswritten: {0}", outbyteswritten); // <----- 3Encoder encr = enc.GetEncoder();encr.Convert(smilelyface.AsSpan(0, 1), sixbytes, false, out int charsused, out int bytesused, out bool completed);Console.WriteLine("BytesUsed: {0}", bytesused); // <----- 0encr.Convert(smilelyface.AsSpan(1, 1), sixbytes, false, out charsused, out bytesused, out completed);Console.WriteLine("BytesUsed: {0}", bytesused); // <----- 4
You'll note that the Encod-ing never returns more than 3 bytes, and in this case those 3 are the unicode 'replacement' character, suggesting the UTF8 Encod-ing has no intention of caching anything for use by subsequent encoding calls. Hard to see how you can get more than 3 bytes that way.
And while the Encod-er does cache, I still can't see how to get it to output 6 bytes, just 4.
I get that GetMaxByteCount is supposed to be 'worst case,' but AFAICT, worst case here is either 3 or 4 depending on whether we're talking about the Encoding, or the Encoder (the docs are unclear about which. Both?).
Can you really get 6 from encoding 1 char? How?