Quantcast
Channel: Active questions tagged utf-8 - Stack Overflow
Viewing all articles
Browse latest Browse all 1047

How to read the file without encoding and extract desired urls with python3?

$
0
0

Environment: python3.
There are many files ,some of them encoding with gbk,others encoding with utf-8.I want to extract all the jpg with regular expression

For s.html encoding with gbk.

tree = open("/tmp/s.html","r").read()

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb4 in position 135: invalid start byte

tree = open("/tmp/s.html","r",encoding="gbk").read()pat = "http://.+\.jpg"result = re.findall(pat,tree)print(result)

['http://somesite/2017/06/0_56.jpg']

It is a huge job to open all the files with specified encoding,i want a smart way to extract jpg urls in all the files.


Viewing all articles
Browse latest Browse all 1047

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>