Quantcast
Channel: Active questions tagged utf-8 - Stack Overflow
Viewing all articles
Browse latest Browse all 1053

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position 139390: invalid continuation byte when scraping from Twitter API

$
0
0

I have a web-scraper built using the python package tweepy and I always use it to gather tweets for research. Suddenly, it doesn't seem to work anymore. The issue is it can no longer decode all the characters?

# open and create a file to append the data tocsvFile = open('tweets.csv', 'a')csvWriter = csv.writer(csvFile)    # use the csv file    # loop through the tweets variable and add contents to the CSV filefor tweet in tweets:    text = tweet.full_text.strip()    #convert the text to ascii ignoring all unicode characters, eg. emojis    text_ascii = text.encode('ascii','ignore').decode()    #split the text on whitespace and newlines into a list of words    text_list = text_ascii.split()    #iterate over the words, removing @ mentions or URLs     text_list_filtered = [word for word in text_list if not (word.startswith('@') or word.startswith('http'))]    #join the list back into a string    text_filtered = ''.join(text_list_filtered)    #decoding html escaped characters    text_filtered = html.unescape(text_filtered)    #write text to the CSV file    csvWriter.writerow([tweet.created_at, tweet.place, text_filtered])    print(tweet.created_at, tweet.place, text_filtered)csvFile.close() 

so when I try to read it as a pandas dataframe I get this error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position 139390: invalid continuation byte

The line that is giving me the error is this:

tweetsdf = pd.read_csv('tweets.csv')

I have tried to change the following bit of code from this:

text_ascii = text.encode('ascii','ignore').decode()

to this:

text_ascii = text.encode('utf-8','ignore').decode()

But then I get the same problem when I try to collect the tweets from the API. What should I do?


Viewing all articles
Browse latest Browse all 1053

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>