I'm having trouble manipulating a dataset in Python which has non-UTF-8 characters. The strings are imported as a binary. But I am having issues converting the binary columns to strings where a cell has non UTF-8 characters.
A minimal working example of my issue is
import polars as plimport pandas as pdpd_df = pd.DataFrame([[b"bob", b"value 2", 3], [b"jane", b"\xc4", 6]], columns=["a", "b", "c"])df = pl.from_pandas(pd_df)column_names = df.columns# Loop through the column namesfor col_name in column_names: # Check if the column has binary values if df[col_name].dtype ==pl.Binary: # Convert the binary column to string format print(col_name) df = df.with_columns(pl.col(col_name).cast(pl.String))
This throws an error when converting column b. For a solution, I'm fine converting any non-utf 8 characters to blanks.
Have tried many other suggestions for conversion in online suggestions, but I can't get any of them to work.