Quantcast
Channel: Active questions tagged utf-8 - Stack Overflow
Viewing all articles
Browse latest Browse all 1041

A UTF-8 text file is failing to import via pandas with a UTF-8 encoding error

$
0
0

I've got a text file exported from SQL as UTF-8 with about 5.5 million rows. I'm trying to then read this file with Pandas/Python, but getting

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 135596: invalid continuation byte

How can I troubleshoot this? I loaded the file into Notepad++ and tried "Convert to UTF-8", but I got the same results. I tried stepping through with a debugger, but pandas is parsing in quite large chunks and I'm having trouble identifying exactly which character is causing it to choke. I tried reading the file as binary and inspecting position 135596 but I didn't see anything out of the ordinary.

Any suggestions on how to identify the issue in our data? At this point I'm considering doing a binary split search (split the data in half, identify which half gives an error, and keep splitting that way until I find it), but it's quite a lot of text.


Viewing all articles
Browse latest Browse all 1041

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>