Quantcast
Channel: Active questions tagged utf-8 - Stack Overflow
Viewing all articles
Browse latest Browse all 1052

How to detect and fix incorrect character encoding

$
0
0

A upstream service reads a stream of UTF-8 bytes, assumes they are ISO-8859-1, applies ISO-8859-1 to UTF-8 encoding, and sends them to my service, labeled as UTF-8.

The upstream service is out of my control. They may fix it, it may never be fixed.

I know that I can fix the encoding by applying UTF-8 to ISO-8859-1 encoding then labeling the bytes as UTF-8. But what happens if my upstream fixes their issue?

Is there any way to detect this issue and fix the encoding only when I find a bad encoding?

I'm also not sure that the upstream encoding is ISO-8859-1. I think the upstream is perl so that encoding makes sense and each sample I've tried decoded correctly when I apply ISO-8859-1 encoding.


When the source sends e4 9c 94 (✔) to my upstream, my upstream sends me c3 a2 c2 9c c2 94 (â).

  • utf-8 string as bytes: e4 9c 94
  • bytes e4 9c 94 as latin1 string: â
  • utf-8 string â as bytes: c3 a2 c2 9c c2 94

I can fix it applying upstream.encode('ISO-8859-1').force_encoding('UTF-8') but it will break as soon as the upstream issue is fixed.


Viewing all articles
Browse latest Browse all 1052

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>