I have a web-scraper built using the python package tweepy and I always use it to gather tweets for research. Suddenly, it doesn't seem to work anymore. The issue is it can no longer decode all the characters?
# open and create a file to append the data tocsvFile = open('tweets.csv', 'a')csvWriter = csv.writer(csvFile) # use the csv file # loop through the tweets variable and add contents to the CSV filefor tweet in tweets: text = tweet.full_text.strip() #convert the text to ascii ignoring all unicode characters, eg. emojis text_ascii = text.encode('ascii','ignore').decode() #split the text on whitespace and newlines into a list of words text_list = text_ascii.split() #iterate over the words, removing @ mentions or URLs text_list_filtered = [word for word in text_list if not (word.startswith('@') or word.startswith('http'))] #join the list back into a string text_filtered = ''.join(text_list_filtered) #decoding html escaped characters text_filtered = html.unescape(text_filtered) #write text to the CSV file csvWriter.writerow([tweet.created_at, tweet.place, text_filtered]) print(tweet.created_at, tweet.place, text_filtered)csvFile.close()
so when I try to read it as a pandas dataframe I get this error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position 139390: invalid continuation byte
The line that is giving me the error is this:
tweetsdf = pd.read_csv('tweets.csv')
I have tried to change the following bit of code from this:
text_ascii = text.encode('ascii','ignore').decode()
to this:
text_ascii = text.encode('utf-8','ignore').decode()
But then I get the same problem when I try to collect the tweets from the API. What should I do?