Quantcast
Channel: Active questions tagged utf-8 - Stack Overflow
Viewing all articles
Browse latest Browse all 1052

How do I remove invalid characters from UTF-8 encoded file?

$
0
0

Explanation:

I've come across an edge case when writing my web app. I accept UTF-8 files to be uploaded, and I've got a check in place to confirm it is UTF-8 encoded (or at least the best check possible, apparently there is no silver bullet, I'm aware there are many other questions on Stack Overflow for that specific issue).

As a test, I took an ANSI encoded file and converted it to UTF-8 by both (in separate tests) converting it UTF-8 in Notepad++, and also by just decoding as UTF-8 (even though it is ANSI) on the fly in C# using Encoding.UTF.GetBytes(inputStream).

Where The Problem Arises:

Later on, I place the raw data of the file as one of the elements in an XML file. This is where the problem arises. It appears that a character has persisted from the ANSI file which (I assume) is not valid in UTF-8. When I try load the XML using the following command...

XDocument xmlSample = XDocument.Load(outputPath);

I get this exception...

{"Invalid character in the given encoding. Line 10, position 14."}

Which looks like this in Visual Studio...

VSImg

And like this in Notepad++...

NPPImg

Below is the character copy and pasted.

From NPP: ¡ From Visual Studio String Viewer:

Question:

How can I remove invalid characters from UTF-8 encoded file, or at least discover them in a sane way so I can reject the file?


Viewing all articles
Browse latest Browse all 1052

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>