Quantcast
Channel: Active questions tagged utf-8 - Stack Overflow
Viewing all articles
Browse latest Browse all 1152

Why is the vocab size of Byte level BPE smaller than Unicode's vocab size?

$
0
0

I recently read GPT2 and the paper says:

This would result in a base vocabulary of over 130,000 before any multi-symbol tokens are added. This is prohibitively large compared to the 32,000 to 64,000 token vocabularies often used with BPE. In contrast, a byte-level version of BPE only requires a base vocabulary of size 256.

I really don't understand the words. The number of characters that Unicode represents is 130K but how can this be reduced to 256? Where's the rest of approximately 129K characters? What am I missing? Does byte-level BPE allow duplicating of representation between different characters?

I don't understand the logic. Below are my questions:

  • Why the size of vocab is reduced? (from 130K to 256)
  • What's the logic of the BBPE (Byte-level BPE)?

Detail question

Thank you for your answer but I really don't get it. Let's say we have 130K unique characters. What we want (and BBPE do) is to reduce this basic (unique) vocabulary. Each Unicode character can be converted 1 to 4 bytes by utilizing UTF-8 encoding. The original paper of BBPE says (Neural Machine Translation with Byte-Level Subwords):

Representing text at the level of bytes and using the 256 bytes set as vocabulary is a potential solution to this issue.

Each byte can represent 256 characters (bits, 2^8), we only need 2^17 (131072) bits for representing the unique Unicode characters. In this case, where did the 256 bytes in the original paper come from? I don't know both the logic and how to derive this result.

I arrange my questions again, more detail:

  • How does BBPE work?
  • Why the size of vocab is reduced? (from 130K to 256 bytes)
    • Anyway, we always need 130K space for a vocab. What's the difference between representing unique characters as Unicode and Bytes?

Since I have little knowledge of computer architecture and programming, please let me know if there's something I missed.

Sincerely, thank you.


Viewing all articles
Browse latest Browse all 1152

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>