Quantcast
Channel: Active questions tagged utf-8 - Stack Overflow
Viewing all articles
Browse latest Browse all 1060

Passing UTF-8 arguments to commands in Perl on Windows

$
0
0

I am trying to build a template for Perl scripts so that they would do at least most of the basic things right with UTF-8 and would work equally well on Linux and Windows machines.

One thing in particular escaped me for a while: the difficulty of passing UTF-8 strings as arguments to system commands. It seems to me that there is no way not to have arguments double UTF-8 encoded before they reach the shell (that is, I understand that there is a layer that ignores that the command and its arguments are already properly UTF-8 encoded, takes it for Latin-1 or something of the sorts, and encodes it again as UTF-8). I could not find a way to cleanly avoid this layer of encoding.

Take this script:

#!/usr/bin/perluse v5.14;use utf8;use feature 'unicode_strings';use feature 'fc';use open ':std', ':encoding(UTF-8)';use strict;use warnings;use warnings FATAL => 'utf8';use constant IS_WINDOWS => $^O eq 'MSWin32';# Set proper locale$ENV{'LC_ALL'} = 'C.UTF-8';# Set UTF-8 code page on Windowsif (IS_WINDOWS) {  system("chcp 65001 > nul 2>&1");};# Use Win32::Unicode::Process on Windowsif (IS_WINDOWS) {  eval {    require Win32::Unicode::Process;    Win32::Unicode::Process->import;  };  if ($@) {    die "Could not load Win32::Unicode::Process: $@";  };};# Show the empty directoryprint "---\n" . `ls -1 system*` . "---\n";my $utf = "test-тест-מבחן-परीक्षण-😊-𝓽𝓮𝓼𝓽";# Works fine on Linux but not on Windowsprint "System (touch) exit code: " . system("touch system-$utf > touch-system.txt 2>&1") . "\n";print "System (echo) exit code: " . system("echo system-$utf > echo-system.txt 2>&1") . "\n";if (IS_WINDOWS) {  # Works fine on Windows  print "SystemW (touch) exit code: " . systemW("touch systemW-$utf > touch-systemW.txt 2>&1") . "\n";  print "SystemW (echo) exit code: " . systemW("echo systemW-$utf > echo-systemW.txt 2>&1") . "\n";};# Show the directory with the new the filesprint "---\n" . `ls -1 system*` . "---\n";exit;

On Linux, everything is fine: the file created with touch through system() has a UTF-8 encoded filename and the content of the file created with echo is correctly UTF-8 encoded.

Yet, I found no way to get the same code to behave correctly on Windows. There, the output of the script is this:

------System (touch) exit code: 0System (echo) exit code: 0SystemW (touch) exit code: SystemW (echo) exit code: ---system-test-теÑÑ‚-מבחן-परीकà¥à¤·à¤£-😊-ð“½ð“®ð“¼ð“½systemW-test-тест-מבחן-परीक्षण-😊-𝓽𝓮𝓼𝓽---

As the script shows, the only way I could make it work is to use Win32::Unicode::Process::systemW() to replace system(). The file systemW-test-тест-מבחן-परीक्षण-😊-𝓽𝓮𝓼𝓽 is correctly named and the content of echo-systemW.txt is correctly encoded in UTF-8.

My questions are these:

  1. Is there a way to avoid using systemW() and keep the code identical for Linux and Windows but somehow remove this layer that double-encodes the system command? In other words, is this the only good way to go?

  2. If this is the right way, I am not sure how to obtain the similarly correct behaviour for backticks. They have the same problem as system() but I have no idea how to capture the output of a command with systemW() aside from piping it into a temporary file and reading that at the end (possible, of course, but maybe not great).


Viewing all articles
Browse latest Browse all 1060

Trending Articles