I need to cut an input string at a certain byte length and discard the rest. Every good answer to such a request involves encoding the string, slicing the required portion, decoding the result and removing the "replacement character".
But I discovered an edge case where the character "✊🏼" appears 3 bytes before the cut point.
Here's some code demonstrating it.
function cutString(str, b1, b2){ const te = new TextEncoder("utf-8"), td = new TextDecoder("utf-8"), enc = te.encode(str), cut = enc.slice(b1, b2), res = td.decode(cut); console.log(res);}cutString("abc✊🏼def", 0, 5)> abc�cutString("abc✊🏼def", 0, 6)> abc✊cutString("abc✊🏼def", 6)> 🏼def
If I cut the string after the 5th byte, everything works. If I cut it after the 6th byte I get a valid, but incorrect, utf-8 output.
Here's the bits of the 2 "fist" characters.
function showBits(char){ const te = new TextEncoder("utf-8"), enc = te.encode(char), arr = Array.from(enc), map = arr.map(i => i.toString(2)) console.log(map.join(" "));}showBits("✊🏼")> 11100010 10011100 10001010 11110000 10011111 10001111 10111100showBits("✊")> 11100010 10011100 10001010
So if I find the following in the next 4 bytes after the cut point, I know that I need to remove the final character from the output:
11110000 10011111 10001111 10111100
But what if the fist appears 4 bytes before the cut point?
cutString("ab✊🏼cdef", 0, 6)> ab✊�
Then I need to remove the replacement character and check for the last 3 of those bytes above.
I don't know what possible byte combinations might appear after the cut point to indicate a split multi-byte character. I was hoping someone would be able to explain it and help me create a general solution.
Thanks.