Quantcast
Channel: Active questions tagged utf-8 - Stack Overflow
Viewing all articles
Browse latest Browse all 1049

Parse file to identify CodePage/Characterset

$
0
0

I want to parse text files with Perl to get the following information:

  • LineBreak + Codepage/Characterset

no "guess" (like linux "file"-command), but really parse full file.

Encode::Guess is not helping as it is not distinguishing what I need.

    #!/usr/bin/perl -w    use strict;    use warnings;    use Encode::Guess;    Encode::Guess->add_suspects(qw/latin1/);    my $data;    my $decoder;    foreach my $filename (@ARGV) {        open(my $fh, '<', $filename) or die "cannot open file $filename";        {            local $/;            $data = <$fh>;        }        close($fh);        $decoder = Encode::Guess->guess($data);        ref($decoder) ? print "$filename: " . $decoder->name . $/ : print "$filename: $decoder $/" ;    }

output:

testfile_LF_ANSI.txt: iso-8859-1testfile_LF_ASCII.txt: asciitestfile_LF_UTF-16BE_BOM.txt: UTF-16testfile_LF_UTF-16LE_BOM.txt: UTF-16testfile_LF_UTF-8.txt: Encodings too ambiguous: utf8 or iso-8859-1testfile_LF_UTF-8_BOM.txt: utf8testfile_LF_UTF-8_only_ISO8859-1.txt: Encodings too ambiguous: iso-8859-1 or utf8testfile_LF_UTF-8_only_ISO8859-2.txt: Encodings too ambiguous: iso-8859-1 or utf8testfile_LF_UTF-8_only_ISO8859-5.txt: Encodings too ambiguous: iso-8859-1 or utf8

I deal with the following encodings

  • ASCII
  • IS0-8859-1 (Latin1)
  • IS0-8859-2 (Latin2)
  • IS0-8859-5 (Cyrilic)
  • UTF-8

and following characters (I need somehow to define these in perl)

  • MyLatin1 (subset of Latin1-characters I am using)
  • MyLatin2 (subset of Latin2-characters I am using)
  • MyCyrillic (subset of Cyrillic-characters I am using)

following tests required:

  • linebreak
    (Mac "CR" skipped on purpose as not needed - but for sure can be added)
    1. no linebreak    2. only "LF"    3. only "CR LF"    4. mixture of "LF" +"CR LF"
  • Codepage/Characterset
    1. \xEF\xBB\xBF -> UTF8 BOM    2. \xFF\xFE -> UTF-16LE BOM    3. \xFE\xFF -> UTF-16BE BOM    4. only ASCII characters -> ASCII    5. UTF-8 validation (as not all sequences of bytes are valid UTF-8) -> valid UTF-8    6. UTF-8 holding only ISO-8859-1 characters -> UTF-8 + subset of Latin1-characters    7. UTF-8 holding only ISO-8859-2 characters -> UTF-8 + subset of Latin2-characters    8. UTF-8 holding only ISO-8859-5 characters -> UTF-8 + subset of Cyrillic-characters    9. UTF-8 holding only ISO-8859-1/ISO-8859-2/ISO-8859-5 characters -> UTF-8 + mixture of (ISO-8859-1 to ISO-8859-5    10. ANSI guess against MyLatin1 (only if not (4 & 5)) -> most probably ISO-8859-1 (as only MyLatin1-Bytes used)    11. ANSI guess against MyLatin2 (only if not (4 & 5)) -> most probably ISO-8859-2 (as only MyLatin2-Bytes used)    12. ANSI guess against MyCyrillic (only if not (4 & 5)) -> most probably ISO-8859-5 (as only MyLatin5-Bytes used)    13. if not (4 & 5 & 10 & 11 & 12) -> ANSI with bytes outside MyLatin1, MyLatin2 and MyCyrillic 

Question
linebreak (1-4) and Codepage/Characterset (1-4) I know how to do.

But any idea, how I can

  • define my custom characters (ANSI-bytes) for
    • MyLatin1
    • MyLatin2
    • MyCyrillic
  • do the Codepage/Characterset checks 5-12?

Viewing all articles
Browse latest Browse all 1049

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>