Quantcast
Channel: Active questions tagged utf-8 - Stack Overflow
Viewing all articles
Browse latest Browse all 1202

UTF-8 Decode Error Reading CSV into Pandas Despite UTF-8 Encoding

$
0
0

I am trying to convert csv files into tsv files by doing the conversion in a pandas dataframe.

for csv_file in os.listdir(input_dir):    if csv_file.endswith('.csv'):        print("working on " + csv_file)        # Full path to the current CSV file        csv_path = os.path.join(input_dir, csv_file)        # Read the CSV file        df = pd.read_csv(csv_path, encoding='utf-8')        # Create corresponding TSV file name        base_name = os.path.splitext(csv_file)[0]        tsv_file = os.path.join(output_dir, f'{base_name}.tsv')        # Convert and save as TSV file        df.to_csv(tsv_file, sep='\t', index=False)        print(f"File {csv_file} successfully converted to {tsv_file}")

I am pretty confident that all the csv files were encoded using "UTF-8 with BOM". However a couple of these files fail with the error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe3 in position119: unexpected end of data

I then tried to use "Latin-1" decoding which then causes "" to be the first 3 characters in the tsv file which signifies that the file was encoded with UTF-8-BOM from my understanding. However, why can't it tend correctly read the file with UTF-8 encoding?

  • Note utf-8-sig results in the same error as utf-8

Viewing all articles
Browse latest Browse all 1202

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>