Quantcast
Channel: Active questions tagged utf-8 - Stack Overflow
Viewing all articles
Browse latest Browse all 1217

Perl regex to replace invalid UTF-8 characters in string

$
0
0

I have the following code that can detect non-valid utf-8 characters in a string(regex taken from https://www.w3.org/International/questions/qa-forms-utf-8 and Regex to detect invalid UTF-8 string)

use strict;use warnings;use utf8;my @Strings=('Caractéristiques techniques','Test string 1');foreach my $ival (@Strings){            my $TestA=eval{$ival =~    /\A(    [\x00-\x7F]                          # ASCII    | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte    |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs    | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte    |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates    |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3    | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15    |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16    )*\z/x};            if(defined $TestA){         print "$ival is valid utf-8!\n";    } else {        print "$ival is NOT valid utf-8!\n";    }       }

I now need some code to replace any invalid characters found with a valid utf-8 user defined character.

Logically, I need a regex to replace everything EXCEPT those values in my validation regex but I don't know how to do this.

I know that to replace all characters EXCEPT 'a to z' and 'A to Z' I can use $ival=~s/[^a-zA-Z]//g; But I don't know how to extend this concept to the regex in my code.

Note: I do have some strings that contain non-valid utf-8 characters but they are not given here)


Viewing all articles
Browse latest Browse all 1217

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>