I'm following this OpenAI tutorial about fine-tuning.
I already generated the dataset with the openai tool. The problem is that the outputs encoding (inference result) is mixing UTF-8 with non UTF-8 characters.
The generated model looks like this:
{"prompt":"Usuario: Quién eres\\nAsistente:","completion":" Soy un Asistente\n"}{"prompt":"Usuario: Qué puedes hacer\\nAsistente:","completion":" Ayudarte con cualquier gestión o ofrecerte información sobre tu cuenta\n"}
For instance, if I ask "¿Cómo estás?" and there's a trained completion for that sentence: "Estoy bien, ¿y tú?", the inference often returns exactly the same (which is good), but sometimes it adds non-encoded words: "Estoy bien, ¿y tú? Cuéntame algo de ti", adding "é" instead of "é".
Sometimes, it returns exactly the same sentence that was trained for, with no encoding issues.I don't know if the inference is taking the non-encoded characters from my model or from somewhere else.
What should I do?Should I encode the dataset in UTF-8?Should I leave the dataset with UTF-8 and decode the bad encoded chars in the response?
The OpenAI docs for fine-tuning don't include anything about encoding.