The UTF-8 string in example seems to be coded with too many bytes!
The input string: "👉TEST📍TEST"
- “👉” (U+1F449): A hand pointing right
- “T”, “E”, “S”, “T”: Basic Latin letters
- “📍” (U+1F4CD): A round pushpin
- “T”, “E”, “S”, “T”: Basic Latin letters
This string is stored in a UTF-8 encoded file, when I use a hexadecimal editor I see the 16 bytes below as expected. When I copy the strings into Online tools, I find the same 16 bytes.
f0 9f 91 89 54 45 53 54 f0 9f 93 8d 54 45 53 54 \_______/ \_______/ \_______/ \_______/ U+1F449 T E S T U+1F4CD T E S T“👉”“📍”
However, the results of the function babel:string-to-octets are different, I get 20 bytes:
(defun print-hex (octets) (dotimes (offset (length octets)) (let ((byte (aref octets offset))) (format t "~2,'0x " byte))) (format t "(~A bytes)~%" (length octets)))(let ((string "👉TEST📍TEST")) (format t "TEST STRING [~A]~%" string) (print-hex (babel:string-to-octets string)) (print-hex (babel:string-to-octets string :encoding :UTF-8)))TEST STRING [👉TEST📍TEST]ED A0 BD ED B1 89 54 45 53 54 ED A0 BD ED B3 8D 54 45 53 54 (20 bytes)ED A0 BD ED B1 89 54 45 53 54 ED A0 BD ED B3 8D 54 45 53 54 (20 bytes)
If we analyze this further:
ED A0 BD ED B1 89 54 45 53 54 ED A0 BD ED B3 8D 54 45 53 54 \_____________/ \_______/ \_____________/ \_______/ ??? T E S T ??? T E S T ^^^ ^^^UTF-16 surrogate pair? UTF-16 surrogate pair?
How do I get the 16 bytes from the input string?
Another interesting behavior which highlight the same issue, converting to octets and then back to the original string leads to an encoding error on the first character.
(let ((string "👉TEST📍TEST")) (babel:octets-to-string (babel:string-to-octets string)))debugger invoked on a BABEL-ENCODINGS:CHARACTER-OUT-OF-RANGE in thread#<THREAD "main thread" RUNNING {100F080003}>: Illegal :UTF-8 character starting at position 0.Type HELP for debugger help, or (SB-EXT:EXIT) to exit from SBCL.