Quantcast
Channel: Active questions tagged utf-8 - Stack Overflow
Viewing all articles
Browse latest Browse all 1047

Python 2.7 reading and writing "éèàçê" from utf-8 file

$
0
0

I made this script which removes every trailing whitespace characters and replace all bad french characters by the right ones.

Removing the trailing whitespace characters works but not the part about replacing the french characters.

The file to read/write are encoded in UTF-8 so I added the utf-8 declaration above my script but in the end every bad characters (like \u00e9) are being replaced by litte square.

Any idea why?

script :

# --*-- encoding: utf-8 --*--import fileinputimport sysCRLF = "\r\n"ACCENT_AIGU = "\\u00e9"ACCENT_GRAVE = "\\u00e8"C_CEDILLE = "\\u00e7"A_ACCENTUE = "\\u00e0"E_CIRCONFLEXE = "\\u00ea"CURRENT_ENCODING = "utf-8"#Getting filepathprint "Veuillez entrer le chemin du fichier (utiliser des \\ ou /, c'est pareil) :"path = str(raw_input())path.replace("\\", "/")#removing trailing whitespace charactersfor line in fileinput.FileInput(path, inplace=1):    if line != CRLF:        line = line.rstrip()        print line        print >>sys.stderr, line    else:        print CRLF        print >>sys.stderr, CRLFfileinput.close()#Replacing bad wharactersfor line in fileinput.FileInput(path, inplace=1):        line = line.decode(CURRENT_ENCODING)        line = line.replace(ACCENT_AIGU, "é")        line = line.replace(ACCENT_GRAVE, "è")        line = line.replace(A_ACCENTUE, "à")        line = line.replace(E_CIRCONFLEXE, "ê")        line = line.replace(C_CEDILLE, "ç")        line.encode(CURRENT_ENCODING)        sys.stdout.write(line) #avoid CRLF added by print        print >>sys.stderr, linefileinput.close()

EDIT

the input file contains this type of text :

 * Cette m\u00e9thode permet d'appeller le service du module de tourn\u00e9e * <code>rechercherTechnicien</code> et retourne la liste repr\u00e9sentant le num\u00e9ro  * de la tourn\u00e9e ainsi que le nom et le pr\u00e9nom du technicien et la dur\u00e9e  * th\u00e9orique por se rendre au point d'intervention. * 

EDIT2

Final code if someone is interested, the first part replaces the badly encoded caracters, the second part removes all right trailing whitespaces caracters.

# --*-- encoding: iso-8859-1 --*--import fileinputimport reCRLF = "\r\n"print "Veuillez entrer le chemin du fichier (utiliser des \\ ou /, c'est pareil) :"path = str(raw_input())path = path.replace("\\", "/")def unicodize(seg):    if re.match(r'\\u[0-9a-f]{4}', seg):        return seg.decode('unicode-escape')    return seg.decode('utf-8')print "Replacing caracter badly encoded"with open(path,"r") as f:    content = f.read()replaced = (unicodize(seg) for seg in re.split(r'(\\u[0-9a-f]{4})',content))with open(path, "w") as o:    o.write(''.join(replaced).encode("utf-8"))print "Removing trailing whitespaces caracters"for line in fileinput.FileInput(path, inplace=1):    if line != CRLF:        line = line.rstrip()        print line    else:        print CRLFfileinput.close()print "Done!"

Viewing all articles
Browse latest Browse all 1047

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>