I am trying to build a template for Perl scripts so that they would do at least most of the basic things right with UTF-8 and would work equally well on Linux and Windows machines.
One thing in particular escaped me for a while: the difficulty of passing UTF-8 strings as arguments to system commands. It seems to me that there is no way not to have arguments double UTF-8 encoded before they reach the shell (that is, I understand that there is a layer that ignores that the command and its arguments are already properly UTF-8 encoded, takes it for Latin-1 or something of the sorts, and encodes it again as UTF-8). I could not find a way to cleanly avoid this layer of encoding.
Take this script:
#!/usr/bin/perluse v5.14;use utf8;use feature 'unicode_strings';use feature 'fc';use open ':std', ':encoding(UTF-8)';use strict;use warnings;use warnings FATAL => 'utf8';use constant IS_WINDOWS => $^O eq 'MSWin32';# Set proper locale$ENV{'LC_ALL'} = 'C.UTF-8';# Set UTF-8 code page on Windowsif (IS_WINDOWS) { system("chcp 65001 > nul 2>&1");};# Use Win32::Unicode::Process on Windowsif (IS_WINDOWS) { eval { require Win32::Unicode::Process; Win32::Unicode::Process->import; }; if ($@) { die "Could not load Win32::Unicode::Process: $@"; };};# Show the empty directoryprint "---\n" . `ls -1 system*` . "---\n";my $utf = "test-тест-מבחן-परीक्षण-😊-𝓽𝓮𝓼𝓽";# Works fine on Linux but not on Windowsprint "System (touch) exit code: " . system("touch system-$utf > touch-system.txt 2>&1") . "\n";print "System (echo) exit code: " . system("echo system-$utf > echo-system.txt 2>&1") . "\n";if (IS_WINDOWS) { # Works fine on Windows print "SystemW (touch) exit code: " . systemW("touch systemW-$utf > touch-systemW.txt 2>&1") . "\n"; print "SystemW (echo) exit code: " . systemW("echo systemW-$utf > echo-systemW.txt 2>&1") . "\n";};# Show the directory with the new the filesprint "---\n" . `ls -1 system*` . "---\n";exit;
On Linux, everything is fine: the file created with touch
through system()
has a UTF-8 encoded filename and the content of the file created with echo
is correctly UTF-8 encoded.
Yet, I found no way to get the same code to behave correctly on Windows. There, the output of the script is this:
------System (touch) exit code: 0System (echo) exit code: 0SystemW (touch) exit code: SystemW (echo) exit code: ---system-test-теÑÑ‚-מבחן-परीकà¥à¤·à¤£-😊-ð“½ð“®ð“¼ð“½systemW-test-тест-מבחן-परीक्षण-😊-𝓽𝓮𝓼𝓽---
As the script shows, the only way I could make it work is to use Win32::Unicode::Process::systemW()
to replace system()
. The file systemW-test-тест-מבחן-परीक्षण-😊-𝓽𝓮𝓼𝓽
is correctly named and the content of echo-systemW.txt
is correctly encoded in UTF-8.
My questions are these:
Is there a way to avoid using
systemW()
and keep the code identical for Linux and Windows but somehow remove this layer that double-encodes the system command? In other words, is this the only good way to go?If this is the right way, I am not sure how to obtain the similarly correct behaviour for backticks. They have the same problem as
system()
but I have no idea how to capture the output of a command withsystemW()
aside from piping it into a temporary file and reading that at the end (possible, of course, but maybe not great).