I want to parse the pdf to text. But when I use pypdf2 or pymupdf to extract text from this pdf, I have a problem: It returns special characters when encountering accented words in Vietnamese. English or unsigned words don't matter.
#pdf pathpdf_file ='CB410A3 - Copy.pdf'pdf = fitz.open(pdf_file)#Read page 8a8= pdf[8]text = a8.getText("text")text(Pymupdf code)
Or
# pdf pathpdf_file =r'D:data\VN\CB410A3.pdf'#import the PyPDF2 moduleimport PyPDF2#open the PDF filePDFfile = open(pdf_file, 'rb')PDFfilereader = PyPDF2.PdfFileReader(PDFfile)#provide the page numberpages = PDFfilereader.getPage(8)x=pages.extractText()
It will return like: ' \nc«ng b¸o së h÷u c«ng nghiÖp sè 410 tËp a - QuyÓn 3 (05.2022) \n \n \n9 \ngia cÇm; ®å¨n s¸ng trªn c¬ së c¸; ®å¨n s¸ng trªn c¬ së h¶i s¶n; ®å¨n s¸ng trªn c¬ së thÞt; \n®å¨n s¸ng'. But I want it to return like this
I try to decode the results with utf-8 but it didn't work.Can someone help me solve this problem? Thanks.
Update infomation:
Starting from January 2023, the Industrial Property Official Gazette PDFs published by ipvietnam will no longer have encoding issues that may cause errors during parsing.