For Python 3 (3.8 and previous versions back to 3.6), surrogatepass
is the default error handler.
This can cause problems for users with file-paths that don't match this encoding.
Why does windows use surrogatepass
instead of surrogateescape
as other platforms do (Linux, macOS), which can handle these bytes. eg:
>>> import sys>>> sys.getfilesystemencoding(), sys.getfilesystemencodeerrors()('utf-8', 'surrogateescape')>>>>>> # This raises an error:>>>>>> b'C:\\Users\\me\\OneDrive\\\xe0\xcd\xa1\xca\xd2\xc3\\my.txt'.decode('utf-8', errors="surrogatepass")Traceback (most recent call last): File "<stdin>", line 1, in <module>UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe0 in position 24: invalid continuation byte>>> # Compared to:>>> b'C:\\Users\\me\\OneDrive\\\xe0\xcd\xa1\xca\xd2\xc3\\my.txt'.decode('utf-8', errors="surrogateescape")'C:\\Users\\me\\OneDrive\\\udce0͡\udcca\udcd2\udcc3\\my.txt'
Note, at a guess I would assume this might be necessary because the underlying NTFS filesystem uses UTF-16 instead of null terminated bytes, requiring some constraints on Python's filesystem encoding not present on Linux/macOS.