Quantcast
Channel: Active questions tagged utf-8 - Stack Overflow
Viewing all articles
Browse latest Browse all 1138

How to fix encoding: Identity-H error parsing text for Vietnamese IP Official Gazette PDF pdf with python?

$
0
0

I want to parse the pdf to text. But when I use pypdf2 or pymupdf to extract text from this pdf, I have a problem: It returns special characters when encountering accented words in Vietnamese. English or unsigned words don't matter.

#pdf pathpdf_file ='CB410A3 - Copy.pdf'pdf = fitz.open(pdf_file)#Read page 8a8= pdf[8]text = a8.getText("text")text(Pymupdf code)

Or

# pdf pathpdf_file =r'D:data\VN\CB410A3.pdf'#import the PyPDF2 moduleimport PyPDF2#open the PDF filePDFfile = open(pdf_file, 'rb')PDFfilereader = PyPDF2.PdfFileReader(PDFfile)#provide the page numberpages = PDFfilereader.getPage(8)x=pages.extractText()

It will return like: ' \nc«ng b¸o së h÷u c«ng nghiÖp sè 410 tËp a - QuyÓn 3 (05.2022) \n \n \n9 \ngia cÇm; ®å¨n s¸ng trªn c¬ së c¸; ®å¨n s¸ng trªn c¬ së h¶i s¶n; ®å¨n s¸ng trªn c¬ së thÞt; \n®å¨n s¸ng'. But I want it to return like this

image

I try to decode the results with utf-8 but it didn't work.Can someone help me solve this problem? Thanks.

Update infomation:

Starting from January 2023, the Industrial Property Official Gazette PDFs published by ipvietnam will no longer have encoding issues that may cause errors during parsing.


Viewing all articles
Browse latest Browse all 1138

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>