I've got a text file exported from SQL as UTF-8 with about 5.5 million rows. I'm trying to then read this file with Pandas/Python, but getting
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 135596: invalid continuation byte
How can I troubleshoot this? I loaded the file into Notepad++ and tried "Convert to UTF-8", but I got the same results. I tried stepping through with a debugger, but pandas is parsing in quite large chunks and I'm having trouble identifying exactly which character is causing it to choke. I tried reading the file as binary and inspecting position 135596 but I didn't see anything out of the ordinary.
Any suggestions on how to identify the issue in our data? At this point I'm considering doing a binary split search (split the data in half, identify which half gives an error, and keep splitting that way until I find it), but it's quite a lot of text.