I'm trying to use LLMWhisperer for OCR of a document in a foreign language. The language uses special characters but can be fully expressed using UTF-8. Using LLMWhisperer through its 'playground' option in browser handles the OCR beautifully, but can only process 4 pages at a time. My goal is to use the LLMWhisperer Python client to process the entire document at once. However, all the outputs I generate through python have substituted incorrect symbols for the special characters.
Given the quality of outputs in browser, I believe the problem to be not be with LLMWhisperer but with all subsequent actions I am making to write the output of the request into a file. In addition, the whisper() command, which sends the OCR request and returns the result, has no options related to language or encoding.
I am an inexperienced coder and lost as to what I could be missing. Could anyone offer insight into how to adjust my strategy to preserve the special characters properly?
from unstract.llmwhisperer.client import LLMWhispererClientclient = LLMWhispererClient(base_url="https://llmwhisperer-api.unstract.com/v1", api_key="my-api-key")whisper = client.whisper(file_path="my-file-path", processing_mode="ocr", pages_to_extract="1")extracted_text = whisper["extracted_text"]with open("transcript.txt", "w", encoding='utf8') as file: file.write(extracted_text)whisper() returns the result as a dictionary with the text in the "extracted text" field.