I am trying to convert csv files into tsv files by doing the conversion in a pandas dataframe.
for csv_file in os.listdir(input_dir): if csv_file.endswith('.csv'): print("working on " + csv_file) # Full path to the current CSV file csv_path = os.path.join(input_dir, csv_file) # Read the CSV file df = pd.read_csv(csv_path, encoding='utf-8') # Create corresponding TSV file name base_name = os.path.splitext(csv_file)[0] tsv_file = os.path.join(output_dir, f'{base_name}.tsv') # Convert and save as TSV file df.to_csv(tsv_file, sep='\t', index=False) print(f"File {csv_file} successfully converted to {tsv_file}")
I am pretty confident that all the csv files were encoded using "UTF-8 with BOM". However a couple of these files fail with the error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe3 in position119: unexpected end of data
I then tried to use "Latin-1" decoding which then causes "" to be the first 3 characters in the tsv file which signifies that the file was encoded with UTF-8-BOM from my understanding. However, why can't it tend correctly read the file with UTF-8 encoding?
- Note utf-8-sig results in the same error as utf-8