I have the following code that can detect non-valid utf-8 characters in a string(regex taken from https://www.w3.org/International/questions/qa-forms-utf-8 and Regex to detect invalid UTF-8 string)
use strict;use warnings;use utf8;my @Strings=('Caractéristiques techniques','Test string 1');foreach my $ival (@Strings){ my $TestA=eval{$ival =~ /\A( [\x00-\x7F] # ASCII | [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte | \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte | \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates | \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3 | [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15 | \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16 )*\z/x}; if(defined $TestA){ print "$ival is valid utf-8!\n"; } else { print "$ival is NOT valid utf-8!\n"; } }
I now need some code to replace any invalid characters found with a valid utf-8 user defined character.
Logically, I need a regex to replace everything EXCEPT those values in my validation regex but I don't know how to do this.
I know that to replace all characters EXCEPT 'a to z' and 'A to Z' I can use $ival=~s/[^a-zA-Z]//g;
But I don't know how to extend this concept to the regex in my code.
Note: I do have some strings that contain non-valid utf-8 characters but they are not given here)