I have a CSV file that someone encoded wrongly. It looks like this:
movieId,title,actors(...)61,Eye for an Eye (1996),(a ton of other actors)|Dolores VelÌÁzquez|(more actors)59,The Confessional (1995),(a ton of other actors)|Richard FrÌ©chette|Fran̤ois Papineau|Marie Gignac|Normand Daneau|Anne-Marie Cadieux|Suzanne ClÌ©ment|Lynda Beaulieu|Pascal Rollin|Billy Merasty|Paul HÌ©bert|Marthe Turgeon|Adreanne Lepage-Beaulieu|AndrÌ©e-Anne ThÌ©roux-Faille|Rodrigue Proteau|Philippe Paquin|Pierre HÌ©bert|Nathalie D'Anjou|Danielle Fichaud|Jules Philip|Jacques Laroche|Claude-Nicolas Demers|Jean-Philippe CÌ«tÌ©|Tristan Wiseman|Marc-Olivier Tremblay|Jacques Brouillet|Jean-Paul L'Allier|Denis Bernard|RenÌ©e Hudon|Serge Laflamme|Carl Mathieu(...)
Now as you can see, instead of Umlauts and letters with accents (ÄÖÜ, É, À, Û etc.), the actors have a combination of other special characters instead. I suspect this is because it was encoded two times in a row, rendering two byte that belong together in UTF-8 encoding, to form one Umlaut or letter with accent, into two separate UTF-8 symbols instead (taking the two bytes individually).
My goal is to restore the correct Umlauts etc.
I have found out that all broken Umlauts etc. follow the following scheme:The first byte is an "Ì" and then there is a second symbol, unless the original letter was an "Á", like in "Ángel", which would be "Ìngel" in the CSV that I have.
The broken letters seem to be case sensitive, so the original letters Á and á are not the same broken letter in the file.
I have tried every common encoding to rule out that this is just something very similar to UTF-8, but only UTF-8 comes close to being correct (the other encodings break more characters and the Umlauts etc. are always broken).
I have tried Regexing known Umlauts etc. for which i know the original actor name, and can therefore assume which broken combination can be backpropagated to which original letter. The problem is that it's not always a set of two letters, as you can see in the above example for Á, which only has one letter, so I can basically never replace this with regex, until all other replacements have been done, and on the way there, I have found that some replacements went wrong, so I suspect that there is possibly some combination of 3 bytes instead of 2 for very special letters.
I think this faulty CSV has been generated in Java.
Is there any way for me to
- Find out which two encodings have happened after one another, which lead to the broken file and
- Fix the errors somehow programmatically?
Edit: Here is a list of characters in the CSV that i have, and with respective original characters that I know:
current | original===================Ì_ | ü or ä or í̦ | öÌÏ | ÜÌÐ | Ö̵ | õÌÁ | áÌÙ | ßåÁÌ | ¡å¡2 | °C or ° (i am not sure)̨ | îÌÈ | ûÌ« | ô