好的Perl风格:如何将UTF-8 C字符串文字转换为\ xXX序列

时间:2021-01-04 00:14:17

[Python people: My question is at the very end :-)]

[Python人:我的问题就在最后:-)]

I want to use UTF-8 within C string literals for readability and easy maintainance. However, this is not universally portable. My solution is to create a file foo.c.in which gets converted by a small perl script to file foo.c so that it contains \xXX escape sequences instead of bytes larger than or equal to 0x80.

我想在C字符串文字中使用UTF-8以提高可读性和易维护性。但是,这不是普遍可移植的。我的解决方案是创建一个文件foo.c.in,它由一个小的perl脚本转换为文件foo.c,以便它包含\ xXX转义序列而不是大于或等于0x80的字节。

For simplicity, I assume that a C string starts and ends in the same line.

为简单起见,我假设C字符串在同一行开始和结束。

This is the Perl code I've created. In case a byte >= 0x80 is found, the original string is emitted as a comment also.

这是我创建的Perl代码。如果找到字节> = 0x80,则原始字符串也将作为注释发出。

use strict;
use warnings;

binmode STDIN, ':raw';
binmode STDOUT, ':raw';


sub utf8_to_esc
{
  my $string = shift;
  my $oldstring = $string;
  my $count = 0;
  $string =~ s/([\x80-\xFF])/$count++; sprintf("\\x%02X", ord($1))/eg;
  $string = '"' . $string . '"';
  $string .= " /* " . $oldstring . " */" if $count;
  return $string;
}

while (<>)
{
  s/"((?:[^"\\]++|\\.)*+)"/utf8_to_esc($1)/eg;
  print;
}

For example, the input

例如,输入

"fööbär"

gets converted to

转换为

"f\xC3\xB6\xC3\xB6b\xC3\xA4r" /* fööbär */

Finally, my question: I'm not very good in Perl, and I wonder whether it is possible to rewrite the code in a more elegant (or more 'Perlish') way. I would also like if someone could point to similar code written in Python.

最后,我的问题是:我在Perl中不是很好,我想知道是否有可能以更优雅(或更“Perlish”)的方式重写代码。如果有人能指出用Python编写的类似代码,我也想。

3 个解决方案

#1


4  

  1. I think it's best if you don't use :raw. You are processing text, so you should properly decode and encode. That will be far less error prone, and it will allow your parser to use predefined character classes if you so desire.

    我认为最好不要使用:raw。您正在处理文本,因此您应该正确解码和编码。这将更不容易出错,并且如果您愿意,它将允许您的解析器使用预定义的字符类。

  2. You parse as if you expect slashes in the literal, but then you completely ignore then when you escape. Because of that, you could end up with "...\\xC3\xA3...". Working with decoded text will also help here.

    你解析就好像你期望文字中的斜杠一样,但是当你逃避时你完全忽略了。因此,你最终可能得到“...... \\ xC3 \ xA3 ......”。使用解码文本也会有所帮助。

So forget "perlish"; let's actually fix the bugs.

所以忘记“perlish”;让我们实际修复错误。

use open ':std', ':locale';

sub convert_char {
   my ($s) = @_;
   utf8::encode($s);
   $s = uc unpack 'H*', $s;
   $s =~ s/\G(..)/\\x$1/sg;
   return $s;
}

sub convert_literal {
   my $orig = my $s = substr($_[0], 1, -1);

   my $safe          = '\x20-\x7E';          # ASCII printables and space
   my $safe_no_slash = '\x20-\x5B\x5D-\x7E'; # ASCII printables and space, no \
   my $changed = $s =~ s{
      (?: \\? ( [^$safe] )
      |   ( (?: [$safe_no_slash] | \\[$safe] )+ )
      )
   }{
      defined($1) ? convert_char($1) : $2
   }egx;

   # XXX Assumes $orig doesn't contain "*/"
   return qq{"$s"} . ( $changed ? " /* $orig */" : '' );
}

while (<>) {
   s/(" (?:[^"\\]++|\\.)*+ ")/ convert_literal($1) /segx;
   print;
}

#2


3  

Re: a more Perlish way.

回复:更多Perlish方式。

You can use arbitrary delimiters for quote operators, so you can use string interpolation instead of explicit concatenation, which can look nicer. Also, counting the number of substitutions is unneccessary: Substitution in scalar context evaluates to the number of matches.

您可以对引号运算符使用任意分隔符,因此您可以使用字符串插值而不是显式连接,这可以看起来更好。此外,计算替换次数是不必要的:标量上下文中的替换计算为匹配数。

I would have written your (misnomed!) function as

我会把你的(错误的!)函数写成

use strict; use warnings;
use Carp;

sub escape_high_bytes {
  my ($orig) = @_;

  # Complain if the input is not a string of bytes.
  utf8::downgrade($orig, 1)
    or carp "Input must be binary data";

  if ((my $changed = $orig) =~ s/([\P{ASCII}\P{Print}])/sprintf '\\x%02X', ord $1/eg) {
    # TODO make sure $orig does not contain "*/"
    return qq("$changed" /* $orig */);
  } else {
    return qq("$orig");
  }
}

The (my $copy = $str) =~ s/foo/bar/ is the standard idiom to run a replace in a copy of a string. With 5.14, we could also use the /r modifier, but then we don't know whether the pattern matched, and we would have to resort to counting.

(my $ copy = $ str)= ~s / foo / bar /是在字符串副本中运行替换的标准习惯用法。使用5.14,我们也可以使用/ r修饰符,但是我们不知道模式是否匹配,我们将不得不求助于计数。

Please be aware that this function has nothing to do with Unicode or UTF-8. The utf8::downgrade($string, $fail_ok) makes sure that the string can be represented using single bytes. If this can't be done (and the second argument is true), then it returns a false value.

请注意,此功能与Unicode或UTF-8无关。 utf8 :: downgrade($ string,$ fail_ok)确保可以使用单个字节表示字符串。如果无法完成(并且第二个参数为true),则返回false值。

The regex operators \p{...} and the negation \P{...} match codepoints that have a certain Unicode property. E.g. \P{ASCII} matches all characters that are not in the range [\x00-\x7F], and \P{Print} matches all characters that are not visible, e.g. control codes like \x00 but not whitespace.

正则表达式运算符\ p {...}和否定\ P {...}匹配具有特定Unicode属性的代码点。例如。 \ P {ASCII}匹配不在[\ x00- \ x7F]范围内的所有字符,\ P {Print}匹配所有不可见的字符,例如控制代码如\ x00但不是空格。

Your while (<>) loop is arguably buggy: This does not neccessarily iterate over STDIN. Rather, it iterates over the contents of the files listed in @ARGV (the command line arguments), or defaults to STDIN if that array is empty. Note that the :raw layer will not be declared for the files from @ARGV. Possible solutions:

你的while(<>)循环可以说是错误的:这不一定会迭代STDIN。相反,它迭代@ARGV(命令行参数)中列出的文件的内容,或者如果该数组为空则默认为STDIN。请注意,不会为@ARGV中的文件声明:raw层。可能的解决方案:

  • You can use the open pragma to declare default layers for all filehandles.
  • 您可以使用open pragma为所有文件句柄声明默认图层。
  • You can while (<STDIN>).
  • 你可以( )。

Do you know what is Perlish? Using modules. As it happens, String::Escape already implements much of the functionality you want.

你知道什么是Perlish吗?使用模块。碰巧的是,String :: Escape已经实现了你想要的大部分功能。

#3


1  

Similar code written in Python

Python 2.7

Python 2.7

import re
import sys

def utf8_to_esc(matched):
    s = matched.group(1)
    s2 = s.encode('string-escape')
    result = '"{}"'.format(s2)
    if s != s2:
        result += ' /* {} */'.format(s)
    return result

sys.stdout.writelines(re.sub(r'"([^"]+)"', utf8_to_esc, line) for line in sys.stdin)

Python 3.x

Python 3.x

def utf8_to_esc(matched):
    ...
    s2 = s.encode('unicode-escape').decode('ascii')
    ...

#1


4  

  1. I think it's best if you don't use :raw. You are processing text, so you should properly decode and encode. That will be far less error prone, and it will allow your parser to use predefined character classes if you so desire.

    我认为最好不要使用:raw。您正在处理文本,因此您应该正确解码和编码。这将更不容易出错,并且如果您愿意,它将允许您的解析器使用预定义的字符类。

  2. You parse as if you expect slashes in the literal, but then you completely ignore then when you escape. Because of that, you could end up with "...\\xC3\xA3...". Working with decoded text will also help here.

    你解析就好像你期望文字中的斜杠一样,但是当你逃避时你完全忽略了。因此,你最终可能得到“...... \\ xC3 \ xA3 ......”。使用解码文本也会有所帮助。

So forget "perlish"; let's actually fix the bugs.

所以忘记“perlish”;让我们实际修复错误。

use open ':std', ':locale';

sub convert_char {
   my ($s) = @_;
   utf8::encode($s);
   $s = uc unpack 'H*', $s;
   $s =~ s/\G(..)/\\x$1/sg;
   return $s;
}

sub convert_literal {
   my $orig = my $s = substr($_[0], 1, -1);

   my $safe          = '\x20-\x7E';          # ASCII printables and space
   my $safe_no_slash = '\x20-\x5B\x5D-\x7E'; # ASCII printables and space, no \
   my $changed = $s =~ s{
      (?: \\? ( [^$safe] )
      |   ( (?: [$safe_no_slash] | \\[$safe] )+ )
      )
   }{
      defined($1) ? convert_char($1) : $2
   }egx;

   # XXX Assumes $orig doesn't contain "*/"
   return qq{"$s"} . ( $changed ? " /* $orig */" : '' );
}

while (<>) {
   s/(" (?:[^"\\]++|\\.)*+ ")/ convert_literal($1) /segx;
   print;
}

#2


3  

Re: a more Perlish way.

回复:更多Perlish方式。

You can use arbitrary delimiters for quote operators, so you can use string interpolation instead of explicit concatenation, which can look nicer. Also, counting the number of substitutions is unneccessary: Substitution in scalar context evaluates to the number of matches.

您可以对引号运算符使用任意分隔符,因此您可以使用字符串插值而不是显式连接,这可以看起来更好。此外,计算替换次数是不必要的:标量上下文中的替换计算为匹配数。

I would have written your (misnomed!) function as

我会把你的(错误的!)函数写成

use strict; use warnings;
use Carp;

sub escape_high_bytes {
  my ($orig) = @_;

  # Complain if the input is not a string of bytes.
  utf8::downgrade($orig, 1)
    or carp "Input must be binary data";

  if ((my $changed = $orig) =~ s/([\P{ASCII}\P{Print}])/sprintf '\\x%02X', ord $1/eg) {
    # TODO make sure $orig does not contain "*/"
    return qq("$changed" /* $orig */);
  } else {
    return qq("$orig");
  }
}

The (my $copy = $str) =~ s/foo/bar/ is the standard idiom to run a replace in a copy of a string. With 5.14, we could also use the /r modifier, but then we don't know whether the pattern matched, and we would have to resort to counting.

(my $ copy = $ str)= ~s / foo / bar /是在字符串副本中运行替换的标准习惯用法。使用5.14,我们也可以使用/ r修饰符,但是我们不知道模式是否匹配,我们将不得不求助于计数。

Please be aware that this function has nothing to do with Unicode or UTF-8. The utf8::downgrade($string, $fail_ok) makes sure that the string can be represented using single bytes. If this can't be done (and the second argument is true), then it returns a false value.

请注意,此功能与Unicode或UTF-8无关。 utf8 :: downgrade($ string,$ fail_ok)确保可以使用单个字节表示字符串。如果无法完成(并且第二个参数为true),则返回false值。

The regex operators \p{...} and the negation \P{...} match codepoints that have a certain Unicode property. E.g. \P{ASCII} matches all characters that are not in the range [\x00-\x7F], and \P{Print} matches all characters that are not visible, e.g. control codes like \x00 but not whitespace.

正则表达式运算符\ p {...}和否定\ P {...}匹配具有特定Unicode属性的代码点。例如。 \ P {ASCII}匹配不在[\ x00- \ x7F]范围内的所有字符,\ P {Print}匹配所有不可见的字符,例如控制代码如\ x00但不是空格。

Your while (<>) loop is arguably buggy: This does not neccessarily iterate over STDIN. Rather, it iterates over the contents of the files listed in @ARGV (the command line arguments), or defaults to STDIN if that array is empty. Note that the :raw layer will not be declared for the files from @ARGV. Possible solutions:

你的while(<>)循环可以说是错误的:这不一定会迭代STDIN。相反,它迭代@ARGV(命令行参数)中列出的文件的内容,或者如果该数组为空则默认为STDIN。请注意,不会为@ARGV中的文件声明:raw层。可能的解决方案:

  • You can use the open pragma to declare default layers for all filehandles.
  • 您可以使用open pragma为所有文件句柄声明默认图层。
  • You can while (<STDIN>).
  • 你可以( )。

Do you know what is Perlish? Using modules. As it happens, String::Escape already implements much of the functionality you want.

你知道什么是Perlish吗?使用模块。碰巧的是,String :: Escape已经实现了你想要的大部分功能。

#3


1  

Similar code written in Python

Python 2.7

Python 2.7

import re
import sys

def utf8_to_esc(matched):
    s = matched.group(1)
    s2 = s.encode('string-escape')
    result = '"{}"'.format(s2)
    if s != s2:
        result += ' /* {} */'.format(s)
    return result

sys.stdout.writelines(re.sub(r'"([^"]+)"', utf8_to_esc, line) for line in sys.stdin)

Python 3.x

Python 3.x

def utf8_to_esc(matched):
    ...
    s2 = s.encode('unicode-escape').decode('ascii')
    ...