Quantcast
Channel: Active questions tagged utf-8 - Stack Overflow
Viewing all articles
Browse latest Browse all 1069

String.length() doesn't work for "Rolling On the Floor Laughing" : 🤣?

$
0
0

I'm trying to print the first 30 characters of some UTF-8 strings, and notice that Java's String.substring() is returning some funky strings. I've boiled it down to:

I'm expecting "🤣" to be String with length 1, and String.substring to not try to cut it over in the middle. Why is my expectation not met? Java thinks it has length 2.

I'm pretty sure (12) the UTF-8 encoding for 🤣 (U+1F923) "Rolling On the Floor Laughing" is:

0xF0 0x9F 0xA4 0xA3

And so I expect this tiny program:

import java.nio.charset.StandardCharsets;public class Foo {  public static void main(String[] args){    String str = "🤣";    // These are the UTF-8 bytes for "ROLLING ON THE FLOOR LAUGHING"    byte[] raw = {(byte)0xf0, (byte)0x9f, (byte)0xa4, (byte)0xa3};    String str2 = new String(raw, StandardCharsets.UTF_8);    System.out.println(str.equals(str2));    System.out.println(str.length());    System.out.println(str.substring(0,1));  }}

To print out:

true1🤣

But in fact it prints out:

true2?

Am I doing something wrong?

I've tried an custom java 11.0.20.1 build and these standard Ubuntu packages with the same results:

$ javac -versionjavac 19.0.2$ java -versionopenjdk version "19.0.2" 2023-01-17OpenJDK Runtime Environment (build 19.0.2+7-Ubuntu-0ubuntu322.04)OpenJDK 64-Bit Server VM (build 19.0.2+7-Ubuntu-0ubuntu322.04, mixed mode, sharing)

python3 does what I expect:

$ python3 -c 'print(len("🤣"))'1$ python3 -c 'print("🤣"[0])'🤣

Viewing all articles
Browse latest Browse all 1069

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>