Quantcast
Channel: Active questions tagged utf-8 - Stack Overflow
Viewing all articles
Browse latest Browse all 1165

Reading emojis using WinAPI console functions returns UTF-8 Replacement Character instead

$
0
0

For a project of mine I need to be able to read all possible MBCS UTF-8 codepoints from the Windows console. As it is well known that Windows works internally with wchar_t (UTF-16), I tried an approach to read even the "strangest" Unicode characters, including Greek and Cyrillic letters (works fine), CJK (works fine too), math characters (also works fine). But whenever an emoji or a very special 4-byte length codepoint is inserted into the input function method, the EF BF BD sequence is returned, which stands for "Replacement character" instead of the expected sequence. For example, with the emoji 😀, one would expect the sequence F0 9F 98 80 but instead EF BF BD is returned.

I have tried changing codepages to 65001 and tried using both the old legacy conhost which is known for lacking emoji support and the new Windows terminal which does support emojis.

The code I'm using for getting the input and converting it to its respective hexadecimal MBCS representation is the following:

#include <stdio.h>#include <windows.h>int main(){    HANDLE hStdin = GetStdHandle(STD_INPUT_HANDLE);    if (hStdin == INVALID_HANDLE_VALUE)        return 1;    unsigned int codepage = GetConsoleOutputCP();    if (codepage != 65001)    {        fprintf(stderr, "[WARNING] Non Unicode codepage found (%u), changing to 65001\n", codepage);        SetConsoleOutputCP(65001);        SetConsoleCP(65001);    }    DWORD fdwSaveOldMode;    INPUT_RECORD irInBuf[128];    DWORD cNumRead, i;    if (!GetConsoleMode(hStdin, &fdwSaveOldMode))        return 1;    SetConsoleMode(hStdin, fdwSaveOldMode & ~ENABLE_MOUSE_INPUT);    while (1)    {        if (!ReadConsoleInput(hStdin, irInBuf, 128, &cNumRead))            return 1;        for (i = 0; i < cNumRead; i++)        {            switch (irInBuf[i].EventType)            {            case KEY_EVENT: // keyboard input                if (irInBuf[i].Event.KeyEvent.uChar.UnicodeChar && irInBuf[i].Event.KeyEvent.bKeyDown)                    printf("Press UChar: %03hd\t0x%02x\n", irInBuf[i].Event.KeyEvent.uChar.UnicodeChar, irInBuf[i].Event.KeyEvent.uChar.UnicodeChar & (~0xff00));                break;            }        }        if (irInBuf[0].Event.KeyEvent.bKeyDown)            putchar('\n');        FlushConsoleInputBuffer(hStdin);    }    SetConsoleMode(hStdin, fdwSaveOldMode);}

When running this code, if θ is inserted, it shows its correct multibyte UTF-8 representation, however, once an emoji is inserted, the replacement character gets returned twice.

Output:

Press UChar: 206        0xcePress UChar: 184        0xb8Press UChar: 239        0xefPress UChar: 191        0xbfPress UChar: 189        0xbdPress UChar: 239        0xefPress UChar: 191        0xbfPress UChar: 189        0xbd

Viewing all articles
Browse latest Browse all 1165

Latest Images

Trending Articles



Latest Images

<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>