Parse file to identify CodePage/Characterset

I want to parse text files with Perl to get the following information:

LineBreak + Codepage/Characterset

no "guess" (like linux "file"-command), but really parse full file.

Encode::Guess is not helping as it is not distinguishing what I need.

    #!/usr/bin/perl -w    use strict;    use warnings;    use Encode::Guess;    Encode::Guess->add_suspects(qw/latin1/);    my $data;    my $decoder;    foreach my $filename (@ARGV) {        open(my $fh, '<', $filename) or die "cannot open file $filename";        {            local $/;            $data = <$fh>;        }        close($fh);        $decoder = Encode::Guess->guess($data);        ref($decoder) ? print "$filename: " . $decoder->name . $/ : print "$filename: $decoder $/" ;    }

output:

testfile_LF_ANSI.txt: iso-8859-1testfile_LF_ASCII.txt: asciitestfile_LF_UTF-16BE_BOM.txt: UTF-16testfile_LF_UTF-16LE_BOM.txt: UTF-16testfile_LF_UTF-8.txt: Encodings too ambiguous: utf8 or iso-8859-1testfile_LF_UTF-8_BOM.txt: utf8testfile_LF_UTF-8_only_ISO8859-1.txt: Encodings too ambiguous: iso-8859-1 or utf8testfile_LF_UTF-8_only_ISO8859-2.txt: Encodings too ambiguous: iso-8859-1 or utf8testfile_LF_UTF-8_only_ISO8859-5.txt: Encodings too ambiguous: iso-8859-1 or utf8

I deal with the following encodings

ASCII
IS0-8859-1 (Latin1)
IS0-8859-2 (Latin2)
IS0-8859-5 (Cyrilic)
UTF-8

and following characters (I need somehow to define these in perl)

MyLatin1 (subset of Latin1-characters I am using)
MyLatin2 (subset of Latin2-characters I am using)
MyCyrillic (subset of Cyrillic-characters I am using)

following tests required:

linebreak
(Mac "CR" skipped on purpose as not needed - but for sure can be added)

    1. no linebreak    2. only "LF"    3. only "CR LF"    4. mixture of "LF" +"CR LF"

Codepage/Characterset

    1. \xEF\xBB\xBF -> UTF8 BOM    2. \xFF\xFE -> UTF-16LE BOM    3. \xFE\xFF -> UTF-16BE BOM    4. only ASCII characters -> ASCII    5. UTF-8 validation (as not all sequences of bytes are valid UTF-8) -> valid UTF-8    6. UTF-8 holding only ISO-8859-1 characters -> UTF-8 + subset of Latin1-characters    7. UTF-8 holding only ISO-8859-2 characters -> UTF-8 + subset of Latin2-characters    8. UTF-8 holding only ISO-8859-5 characters -> UTF-8 + subset of Cyrillic-characters    9. UTF-8 holding only ISO-8859-1/ISO-8859-2/ISO-8859-5 characters -> UTF-8 + mixture of (ISO-8859-1 to ISO-8859-5    10. ANSI guess against MyLatin1 (only if not (4 & 5)) -> most probably ISO-8859-1 (as only MyLatin1-Bytes used)    11. ANSI guess against MyLatin2 (only if not (4 & 5)) -> most probably ISO-8859-2 (as only MyLatin2-Bytes used)    12. ANSI guess against MyCyrillic (only if not (4 & 5)) -> most probably ISO-8859-5 (as only MyLatin5-Bytes used)    13. if not (4 & 5 & 10 & 11 & 12) -> ANSI with bytes outside MyLatin1, MyLatin2 and MyCyrillic

Question
linebreak (1-4) and Codepage/Characterset (1-4) I know how to do.

But any idea, how I can

define my custom characters (ANSI-bytes) for
- MyLatin1
- MyLatin2
- MyCyrillic
do the Codepage/Characterset checks 5-12?

Parse file to identify CodePage/Characterset

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112