I made this script which removes every trailing whitespace characters and replace all bad french characters by the right ones.
Removing the trailing whitespace characters works but not the part about replacing the french characters.
The file to read/write are encoded in UTF-8 so I added the utf-8 declaration above my script but in the end every bad characters (like \u00e9) are being replaced by litte square.
Any idea why?
script :
# --*-- encoding: utf-8 --*--import fileinputimport sysCRLF = "\r\n"ACCENT_AIGU = "\\u00e9"ACCENT_GRAVE = "\\u00e8"C_CEDILLE = "\\u00e7"A_ACCENTUE = "\\u00e0"E_CIRCONFLEXE = "\\u00ea"CURRENT_ENCODING = "utf-8"#Getting filepathprint "Veuillez entrer le chemin du fichier (utiliser des \\ ou /, c'est pareil) :"path = str(raw_input())path.replace("\\", "/")#removing trailing whitespace charactersfor line in fileinput.FileInput(path, inplace=1): if line != CRLF: line = line.rstrip() print line print >>sys.stderr, line else: print CRLF print >>sys.stderr, CRLFfileinput.close()#Replacing bad wharactersfor line in fileinput.FileInput(path, inplace=1): line = line.decode(CURRENT_ENCODING) line = line.replace(ACCENT_AIGU, "é") line = line.replace(ACCENT_GRAVE, "è") line = line.replace(A_ACCENTUE, "à") line = line.replace(E_CIRCONFLEXE, "ê") line = line.replace(C_CEDILLE, "ç") line.encode(CURRENT_ENCODING) sys.stdout.write(line) #avoid CRLF added by print print >>sys.stderr, linefileinput.close()
EDIT
the input file contains this type of text :
* Cette m\u00e9thode permet d'appeller le service du module de tourn\u00e9e * <code>rechercherTechnicien</code> et retourne la liste repr\u00e9sentant le num\u00e9ro * de la tourn\u00e9e ainsi que le nom et le pr\u00e9nom du technicien et la dur\u00e9e * th\u00e9orique por se rendre au point d'intervention. *
EDIT2
Final code if someone is interested, the first part replaces the badly encoded caracters, the second part removes all right trailing whitespaces caracters.
# --*-- encoding: iso-8859-1 --*--import fileinputimport reCRLF = "\r\n"print "Veuillez entrer le chemin du fichier (utiliser des \\ ou /, c'est pareil) :"path = str(raw_input())path = path.replace("\\", "/")def unicodize(seg): if re.match(r'\\u[0-9a-f]{4}', seg): return seg.decode('unicode-escape') return seg.decode('utf-8')print "Replacing caracter badly encoded"with open(path,"r") as f: content = f.read()replaced = (unicodize(seg) for seg in re.split(r'(\\u[0-9a-f]{4})',content))with open(path, "w") as o: o.write(''.join(replaced).encode("utf-8"))print "Removing trailing whitespaces caracters"for line in fileinput.FileInput(path, inplace=1): if line != CRLF: line = line.rstrip() print line else: print CRLFfileinput.close()print "Done!"