For a project of mine I need to be able to read all possible MBCS UTF-8 codepoints from the Windows console. As it is well known that Windows works internally with wchar_t (UTF-16), I tried an approach to read even the "strangest" Unicode characters, including Greek and Cyrillic letters (works fine), CJK (works fine too), math characters (also works fine). But whenever an emoji or a very special 4-byte length codepoint is inserted into the input function method, the EF BF BD
sequence is returned, which stands for "Replacement character" instead of the expected sequence. For example, with the emoji 😀, one would expect the sequence F0 9F 98 80
but instead EF BF BD
is returned.
I have tried changing codepages to 65001 and tried using both the old legacy conhost which is known for lacking emoji support and the new Windows terminal which does support emojis.
The code I'm using for getting the input and converting it to its respective hexadecimal MBCS representation is the following:
#include <stdio.h>#include <windows.h>int main(){ HANDLE hStdin = GetStdHandle(STD_INPUT_HANDLE); if (hStdin == INVALID_HANDLE_VALUE) return 1; unsigned int codepage = GetConsoleOutputCP(); if (codepage != 65001) { fprintf(stderr, "[WARNING] Non Unicode codepage found (%u), changing to 65001\n", codepage); SetConsoleOutputCP(65001); SetConsoleCP(65001); } DWORD fdwSaveOldMode; INPUT_RECORD irInBuf[128]; DWORD cNumRead, i; if (!GetConsoleMode(hStdin, &fdwSaveOldMode)) return 1; SetConsoleMode(hStdin, fdwSaveOldMode & ~ENABLE_MOUSE_INPUT); while (1) { if (!ReadConsoleInput(hStdin, irInBuf, 128, &cNumRead)) return 1; for (i = 0; i < cNumRead; i++) { switch (irInBuf[i].EventType) { case KEY_EVENT: // keyboard input if (irInBuf[i].Event.KeyEvent.uChar.UnicodeChar && irInBuf[i].Event.KeyEvent.bKeyDown) printf("Press UChar: %03hd\t0x%02x\n", irInBuf[i].Event.KeyEvent.uChar.UnicodeChar, irInBuf[i].Event.KeyEvent.uChar.UnicodeChar & (~0xff00)); break; } } if (irInBuf[0].Event.KeyEvent.bKeyDown) putchar('\n'); FlushConsoleInputBuffer(hStdin); } SetConsoleMode(hStdin, fdwSaveOldMode);}
When running this code, if θ
is inserted, it shows its correct multibyte UTF-8 representation, however, once an emoji is inserted, the replacement character gets returned twice.
Output:
Press UChar: 206 0xcePress UChar: 184 0xb8Press UChar: 239 0xefPress UChar: 191 0xbfPress UChar: 189 0xbdPress UChar: 239 0xefPress UChar: 191 0xbfPress UChar: 189 0xbd