Quantcast
Channel: Active questions tagged utf-8 - Stack Overflow
Viewing all articles
Browse latest Browse all 1045

Cut a string at byte length: how to identify invalid utf-8 characters?

$
0
0

I need to cut an input string at a certain byte length and discard the rest. Every good answer to such a request involves encoding the string, slicing the required portion, decoding the result and removing the "replacement character".

But I discovered an edge case where the character "✊🏼" appears 3 bytes before the cut point.

Here's some code demonstrating it.

function cutString(str, b1, b2){  const te = new TextEncoder("utf-8"),    td = new TextDecoder("utf-8"),    enc = te.encode(str),    cut = enc.slice(b1, b2),    res = td.decode(cut);  console.log(res);}cutString("abc✊🏼def", 0, 5)> abc�cutString("abc✊🏼def", 0, 6)> abc✊cutString("abc✊🏼def", 6)> 🏼def

If I cut the string after the 5th byte, everything works. If I cut it after the 6th byte I get a valid, but incorrect, utf-8 output.

Here's the bits of the 2 "fist" characters.

function showBits(char){  const te = new TextEncoder("utf-8"),    enc = te.encode(char),    arr = Array.from(enc),    map = arr.map(i => i.toString(2))  console.log(map.join(" "));}showBits("✊🏼")> 11100010 10011100 10001010 11110000 10011111 10001111 10111100showBits("✊")> 11100010 10011100 10001010

So if I find the following in the next 4 bytes after the cut point, I know that I need to remove the final character from the output:

11110000 10011111 10001111 10111100

But what if the fist appears 4 bytes before the cut point?

cutString("ab✊🏼cdef", 0, 6)> ab✊�

Then I need to remove the replacement character and check for the last 3 of those bytes above.

I don't know what possible byte combinations might appear after the cut point to indicate a split multi-byte character. I was hoping someone would be able to explain it and help me create a general solution.

Thanks.


Viewing all articles
Browse latest Browse all 1045

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>