Quantcast
Channel: Active questions tagged utf-8 - Stack Overflow
Viewing all articles
Browse latest Browse all 1074

Backpropagation of wrongly (double) encoded CSV

$
0
0

I have a CSV file that someone encoded wrongly. It looks like this:

movieId,title,actors(...)61,Eye for an Eye (1996),(a ton of other actors)|Dolores VelÌÁzquez|(more actors)59,The Confessional (1995),(a ton of other actors)|Richard FrÌ©chette|Fran̤ois Papineau|Marie Gignac|Normand Daneau|Anne-Marie Cadieux|Suzanne ClÌ©ment|Lynda Beaulieu|Pascal Rollin|Billy Merasty|Paul HÌ©bert|Marthe Turgeon|Adreanne Lepage-Beaulieu|AndrÌ©e-Anne ThÌ©roux-Faille|Rodrigue Proteau|Philippe Paquin|Pierre HÌ©bert|Nathalie D'Anjou|Danielle Fichaud|Jules Philip|Jacques Laroche|Claude-Nicolas Demers|Jean-Philippe CÌ«tÌ©|Tristan Wiseman|Marc-Olivier Tremblay|Jacques Brouillet|Jean-Paul L'Allier|Denis Bernard|RenÌ©e Hudon|Serge Laflamme|Carl Mathieu(...)

Now as you can see, instead of Umlauts and letters with accents (ÄÖÜ, É, À, Û etc.), the actors have a combination of other special characters instead. I suspect this is because it was encoded two times in a row, rendering two byte that belong together in UTF-8 encoding, to form one Umlaut or letter with accent, into two separate UTF-8 symbols instead (taking the two bytes individually).

My goal is to restore the correct Umlauts etc.

I have found out that all broken Umlauts etc. follow the following scheme:The first byte is an "Ì" and then there is a second symbol, unless the original letter was an "Á", like in "Ángel", which would be "Ìngel" in the CSV that I have.

The broken letters seem to be case sensitive, so the original letters Á and á are not the same broken letter in the file.

I have tried every common encoding to rule out that this is just something very similar to UTF-8, but only UTF-8 comes close to being correct (the other encodings break more characters and the Umlauts etc. are always broken).

I have tried Regexing known Umlauts etc. for which i know the original actor name, and can therefore assume which broken combination can be backpropagated to which original letter. The problem is that it's not always a set of two letters, as you can see in the above example for Á, which only has one letter, so I can basically never replace this with regex, until all other replacements have been done, and on the way there, I have found that some replacements went wrong, so I suspect that there is possibly some combination of 3 bytes instead of 2 for very special letters.

I think this faulty CSV has been generated in Java.

Is there any way for me to

  1. Find out which two encodings have happened after one another, which lead to the broken file and
  2. Fix the errors somehow programmatically?

Edit: Here is a list of characters in the CSV that i have, and with respective original characters that I know:

current | original===================Ì_      | ü or ä or í̦      | öÌÏ      | ÜÌÐ      | Ö̵      | õÌÁ      | áÌÙ      | ßåÁÌ     | ¡å¡2     | °C or ° (i am not sure)̨      | îÌÈ      | ûÌ«      | ô

Viewing all articles
Browse latest Browse all 1074

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>