Explanation:
I've come across an edge case when writing my web app. I accept UTF-8 files to be uploaded, and I've got a check in place to confirm it is UTF-8 encoded (or at least the best check possible, apparently there is no silver bullet, I'm aware there are many other questions on Stack Overflow for that specific issue).
As a test, I took an ANSI encoded file and converted it to UTF-8 by both (in separate tests) converting it UTF-8 in Notepad++, and also by just decoding as UTF-8 (even though it is ANSI) on the fly in C# using Encoding.UTF.GetBytes(inputStream)
.
Where The Problem Arises:
Later on, I place the raw data of the file as one of the elements in an XML file. This is where the problem arises. It appears that a character has persisted from the ANSI file which (I assume) is not valid in UTF-8. When I try load the XML using the following command...
XDocument xmlSample = XDocument.Load(outputPath);
I get this exception...
{"Invalid character in the given encoding. Line 10, position 14."}
Which looks like this in Visual Studio...
And like this in Notepad++...
Below is the character copy and pasted.
From NPP: ¡
From Visual Studio String Viewer: �
Question:
How can I remove invalid characters from UTF-8 encoded file, or at least discover them in a sane way so I can reject the file?