Quantcast
Channel: Active questions tagged utf-8 - Stack Overflow
Viewing all articles
Browse latest Browse all 1049

How to effectively substring UTF-8 encoded String to max length in bytes?

$
0
0

I am looking for a solution to the problem I have faced recently in Java: to limit the filename to 255 bytes in UTF-8.

Given that a single UTF-8 character can be represented by multiple bytes, this is not as simple as:

String sampleString = "컴퓨터";byte[] bytes = sampleString.getBytes("utf8");String limitedString = new String(bytes, 0, 5, "utf8");

because we can "cut" the character so that it will end up like in the case above:

컴�

I was looking for a good solution but I cannot find any. ChatGPT suggested using StringBuilder, and adding a character one-by-one and checking if this reached the limit, something like this (this isn't ChatGPT's code, my own interpretation):

String sampleString = "컴퓨터";StringBuilder sb = new StringBuilder();for (int i = 0; i < sampleString.length(); i++) {    String temp = sb.toString() + sampleString.codePointAt(i); // build temporary string    if (temp.getBytes("utf8").length > 5) {                    // convert it back to bytes and check size        break;                                                 // if it does not fit, break    }    sb.append(sampleString.codePointAt(i));                    // add that tested character otherwise}

and then the result is as expected:

but I see this as a very memory-expensive solution. Perhaps a much more performant one exists out there?


Viewing all articles
Browse latest Browse all 1049

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>