Quantcast
Channel: Active questions tagged utf-8 - Stack Overflow
Viewing all articles
Browse latest Browse all 1135

python2.7 - How to decode JSON without decoding the UTF-8 inside of it?

$
0
0

I need a function to decode UTF-8 encoded JSON. This function should take a UTF-8 encoded JSON string and convert it to UTF-8 encoded objects. The following code works:

# helper functiondef Obj_To_UTF8(o):    res = {}    for k, v in o.items():        k = k.encode('utf-8')        v = v.encode('utf-8') if isinstance(v, unicode) else v        res[k] = v    return res# Load UTF-8 encoded JSON to UTF-8 encoded objectsdef Load_JSON(s):    return json.loads(s, object_hook = Obj_To_UTF8)

but is rather ridiculous, as json.loads is decoding the UTF-8 just so we can encode it again. How can I decode JSON without decoding the UTF-8 inside of it?

Background

I emphasize that the JSON is already UTF-8 encoded. This is important because I want to process the decoded JSON in exactly the same character encoding as the encoded JSON, using the same byte-based functions I use to process files, sockets, etc. Therefore, for best modularity, we should not bring unicodes into the program. (If we were changing the encoding or writing a text editor or font renderer, that would be the place to bring in unicodes. If we're not doing those things, then unicodes and the associated encoding/decoding are just a headache. Hopefully the example above illustrates this pretty well.)

For those unfamiliar with UTF-8, it is designed to support matching and tokenization directly on the bytes, specifically so that it is compatible with encoding-agnostic software (some of which is over 50 years old and still kicking). Languages like C and Python2 make it easy to process strings directly in bytes, and therefore avoid constantly encoding and decoding strings.

Unfortunately, json.loads seems to stray from these principles, for no obvious good reason. I can think of only two reasons why a JSON implementation would even care about character encoding:

  1. The \u feature, which is optional (and turned off by default in web browsers, so that they tend to emit UTF-8 sequences for non-ASCII characters).
  2. Validating that the received string is proper UTF-8. Hopefully this pedantry would also be optional. (Apart from the UTF-8 fiat, JSON's actual delimiting scheme works perfectly well with binary data.)

So maybe the JSON implementation cares about the character encoding, but forcing all JSON applications to deal with character encoding is poor modularity: rather than separating concerns, it unnecessarily ties JSON decoding to UTF-8 decoding. So a decent JSON implementation would at least provide a way to disable the UTF-8 decoding.


Viewing all articles
Browse latest Browse all 1135

Trending Articles