I am looking for a solution to the problem I have faced recently in Java: to limit the filename to 255 bytes in UTF-8.
Given that a single UTF-8 character can be represented by multiple bytes, this is not as simple as:
String sampleString = "컴퓨터";byte[] bytes = sampleString.getBytes("utf8");String limitedString = new String(bytes, 0, 5, "utf8");
because we can "cut" the character so that it will end up like in the case above:
컴�
I was looking for a good solution but I cannot find any. ChatGPT suggested using StringBuilder
, and adding a character one-by-one and checking if this reached the limit, something like this (this isn't ChatGPT's code, my own interpretation):
String sampleString = "컴퓨터";StringBuilder sb = new StringBuilder();for (int i = 0; i < sampleString.length(); i++) { String temp = sb.toString() + sampleString.codePointAt(i); // build temporary string if (temp.getBytes("utf8").length > 5) { // convert it back to bytes and check size break; // if it does not fit, break } sb.append(sampleString.codePointAt(i)); // add that tested character otherwise}
and then the result is as expected:
컴
but I see this as a very memory-expensive solution. Perhaps a much more performant one exists out there?