I have some data stored inside a storage account in Azure.
I have created a datastore linking this storage account to the Azure Machine Learning workspace.I have created 2 data assets in the azure ML workspace :
- One for the individual parquet file containing the data
- Another for the folder that holds the file.
I want to pull this data into a pandas dataframe in the azure ML notebook.The folder will contain multiple files and I want to create a single dataframe using all these files so I want something that points to the folder and pulls in all the data from that folder into a data frame.
When I pull in the data for the individual file, I am able to populate the dataframe without any issue.However when I try to do the same for the the entire folder, I get errors.
This is the code I am using. It is generated by Azure itself when we go to the 'Consume' tab of the data asset.
import mltablefrom azure.ai.ml import MLClientfrom azure.identity import DefaultAzureCredentialml_client = MLClient.from_config(credential=DefaultAzureCredential())data_asset = ml_client.data.get("folder_name", version="1")path = {'folder': data_asset.path}tbl = mltable.from_delimited_files(paths=[path])df = tbl.to_pandas_dataframe()df
When I run this code, I get this error:
UserErrorException:Error Code: ScriptExecution.StreamAccess.UnexpectedNative Error: Dataflow visit error: ExecutionError(StreamError(Unknown("stream did not contain valid UTF-8", Some(Error { kind: InvalidData, message: "stream did not contain valid UTF-8" }))))VisitError(ExecutionError(StreamError(Unknown("stream did not contain valid UTF-8", Some(Error { kind: InvalidData, message: "stream did not contain valid UTF-8" })))))
=> Failed with execution error: error in streaming from input data sources ExecutionError(StreamError(Unknown("stream did not containvalid UTF-8", Some(Error { kind: InvalidData, message: "stream did notcontain valid UTF-8" })))) Error Message: Got unexpected error: streamdid not contain valid UTF-8. Error { kind: InvalidData, message:"stream did not contain valid UTF-8" }|
The data contains some different scripts like Chinese, Japanese and Hindi but that is not causing any issue when I try to pull in the data from the single file.