I want to parse text files with Perl to get the following information:
- LineBreak + Codepage/Characterset
no "guess" (like linux "file"-command), but really parse full file.
Encode::Guess
is not helping as it is not distinguishing what I need.
#!/usr/bin/perl -w use strict; use warnings; use Encode::Guess; Encode::Guess->add_suspects(qw/latin1/); my $data; my $decoder; foreach my $filename (@ARGV) { open(my $fh, '<', $filename) or die "cannot open file $filename"; { local $/; $data = <$fh>; } close($fh); $decoder = Encode::Guess->guess($data); ref($decoder) ? print "$filename: " . $decoder->name . $/ : print "$filename: $decoder $/" ; }
output:
testfile_LF_ANSI.txt: iso-8859-1testfile_LF_ASCII.txt: asciitestfile_LF_UTF-16BE_BOM.txt: UTF-16testfile_LF_UTF-16LE_BOM.txt: UTF-16testfile_LF_UTF-8.txt: Encodings too ambiguous: utf8 or iso-8859-1testfile_LF_UTF-8_BOM.txt: utf8testfile_LF_UTF-8_only_ISO8859-1.txt: Encodings too ambiguous: iso-8859-1 or utf8testfile_LF_UTF-8_only_ISO8859-2.txt: Encodings too ambiguous: iso-8859-1 or utf8testfile_LF_UTF-8_only_ISO8859-5.txt: Encodings too ambiguous: iso-8859-1 or utf8
I deal with the following encodings
- ASCII
- IS0-8859-1 (Latin1)
- IS0-8859-2 (Latin2)
- IS0-8859-5 (Cyrilic)
- UTF-8
and following characters (I need somehow to define these in perl)
- MyLatin1 (subset of Latin1-characters I am using)
- MyLatin2 (subset of Latin2-characters I am using)
- MyCyrillic (subset of Cyrillic-characters I am using)
following tests required:
- linebreak
(Mac "CR" skipped on purpose as not needed - but for sure can be added)
1. no linebreak 2. only "LF" 3. only "CR LF" 4. mixture of "LF" +"CR LF"
- Codepage/Characterset
1. \xEF\xBB\xBF -> UTF8 BOM 2. \xFF\xFE -> UTF-16LE BOM 3. \xFE\xFF -> UTF-16BE BOM 4. only ASCII characters -> ASCII 5. UTF-8 validation (as not all sequences of bytes are valid UTF-8) -> valid UTF-8 6. UTF-8 holding only ISO-8859-1 characters -> UTF-8 + subset of Latin1-characters 7. UTF-8 holding only ISO-8859-2 characters -> UTF-8 + subset of Latin2-characters 8. UTF-8 holding only ISO-8859-5 characters -> UTF-8 + subset of Cyrillic-characters 9. UTF-8 holding only ISO-8859-1/ISO-8859-2/ISO-8859-5 characters -> UTF-8 + mixture of (ISO-8859-1 to ISO-8859-5 10. ANSI guess against MyLatin1 (only if not (4 & 5)) -> most probably ISO-8859-1 (as only MyLatin1-Bytes used) 11. ANSI guess against MyLatin2 (only if not (4 & 5)) -> most probably ISO-8859-2 (as only MyLatin2-Bytes used) 12. ANSI guess against MyCyrillic (only if not (4 & 5)) -> most probably ISO-8859-5 (as only MyLatin5-Bytes used) 13. if not (4 & 5 & 10 & 11 & 12) -> ANSI with bytes outside MyLatin1, MyLatin2 and MyCyrillic
Question
linebreak (1-4) and Codepage/Characterset (1-4) I know how to do.
But any idea, how I can
- define my custom characters (ANSI-bytes) for
- MyLatin1
- MyLatin2
- MyCyrillic
- do the Codepage/Characterset checks 5-12?