如何通过Python 3中作为命令行参数提供的转义序列拆分UTF-8字符串?

时间:2021-06-28 00:06:26

I'm trying to seperate UTF-8 strings by a delimiter provided as command line argument in Python3. The TAB character "\t" should be a valid option. Unfortunately I didn't find any solution to interpret an escape sequence as such. I wrote a little test script called "test.py"

我试图通过Python3中作为命令行参数提供的分隔符来分隔UTF-8字符串。 TAB字符“\ t”应该是有效选项。不幸的是,我没有找到解释转义序列的任何解决方案。我写了一个名为“test.py”的小测试脚本

  1 # coding: utf8
  2 import sys
  3 
  4 print(sys.argv[1])
  5 
  6 l1 = u"12345\tktktktk".split(sys.argv[1])
  7 print(l1)
  8 
  9 l2 = u"633\tbgt".split(sys.argv[1])
 10 print(l2)

I tried to run that script as follows (inside a guake shell on a kubuntu linux host):

我尝试按如下方式运行该脚本(在kubuntu linux主机上的guake shell中):

  1. python3 test.py \t
  2. python3 test.py \ t

  3. python3 test.py \t
  4. python3 test.py \ t

  5. python3 test.py '\t'
  6. python3 test.py'\ t'

  7. python3 test.py "\t"
  8. python3 test.py“\ t”

Neither of these solutions worked. I also tried this with a larger file containing "real" (and unfortunately confidential data) where for some strange reason in many (but by far not all) cases the lines were split correctly when using the 1st call.

这些解决方案都没有奏效。我还尝试使用包含“真实”(以及不幸的机密数据)的更大文件,其中由于某些奇怪的原因,在许多(但并非全部)情况下,使用第一次调用时线条被正确分割。

What is the correct way to make Python 3 interpret a command line argument as escape sequence rather than as string?

使Python 3将命令行参数解释为转义序列而不是字符串的正确方法是什么?

2 个解决方案

#1


You can use $:

你可以使用$:

python3 test.py $'\t'

ANSI_002dC-Quoting

Words of the form $'string' are treated specially. The word expands to string, with backslash-escaped characters replaced as specified by the ANSI C standard. Backslash escape sequences, if present, are decoded as follows:

$'string'形式的单词是专门处理的。单词扩展为字符串,替换为ANSI C标准指定的反斜杠转义字符。反斜杠转义序列(如果存在)按如下方式解码:

\a
alert (bell)

\b
backspace

\e
\E
an escape character (not ANSI C)

\f
form feed

\n
newline

\r
carriage return

\t
horizontal tab <-
............

Output:

$ python3 test.py $'\t'

['12345', 'ktktktk']
['633', 'bgt']

wiki.bash-hackers

This is especially useful when you want to give special characters as arguments to some programs, like giving a newline to sed.

当您想要将特殊字符作为某些程序的参数时,例如为sed提供换行符时,这尤其有用。

The resulting text is treated as if it was single-quoted. No further expansions happen.

生成的文本被视为单引号。没有进一步的扩展。

The $'...' syntax comes from ksh93, but is portable to most modern shells including pdksh. A specification for it was accepted for SUS issue 7. There are still some stragglers, such as most ash variants including dash, (except busybox built with "bash compatibility" features).

$'...'语法来自ksh93,但可以移植到大多数现代shell,包括pdksh。 SUS问题7接受了它的规范。仍然存在一些落后者,例如大多数灰变种,包括破折号(除了使用“bash兼容性”功能构建的busybox之外)。

Or using python:

或者使用python:

 arg = bytes(sys.argv[1], "utf-8").decode("unicode_escape")

print(arg)

l1 = u"12345\tktktktk".split(arg)
print(l1)

l2 = u"633\tbgt".split(arg)
print(l2)

Output:

$ python3 test.py '\t'

['12345', 'ktktktk']
['633', 'bgt']

#2


At least in Bash on Linux uou need to use CTRL + V + TAB:

至少在Linux上的Bash中你需要使用CTRL + V + TAB:

Example:

python utfsplit.py '``CTRL+V TAB``'

Your code otherwise works:

否则您的代码有效:

$ python3.4 utfsplit.py '       '

['12345', 'ktktktk']
['633', 'bgt']

NB: That tab characters can't really be displayed here :)

注意:这个标签字符不能真正显示在这里:)

#1


You can use $:

你可以使用$:

python3 test.py $'\t'

ANSI_002dC-Quoting

Words of the form $'string' are treated specially. The word expands to string, with backslash-escaped characters replaced as specified by the ANSI C standard. Backslash escape sequences, if present, are decoded as follows:

$'string'形式的单词是专门处理的。单词扩展为字符串,替换为ANSI C标准指定的反斜杠转义字符。反斜杠转义序列(如果存在)按如下方式解码:

\a
alert (bell)

\b
backspace

\e
\E
an escape character (not ANSI C)

\f
form feed

\n
newline

\r
carriage return

\t
horizontal tab <-
............

Output:

$ python3 test.py $'\t'

['12345', 'ktktktk']
['633', 'bgt']

wiki.bash-hackers

This is especially useful when you want to give special characters as arguments to some programs, like giving a newline to sed.

当您想要将特殊字符作为某些程序的参数时,例如为sed提供换行符时,这尤其有用。

The resulting text is treated as if it was single-quoted. No further expansions happen.

生成的文本被视为单引号。没有进一步的扩展。

The $'...' syntax comes from ksh93, but is portable to most modern shells including pdksh. A specification for it was accepted for SUS issue 7. There are still some stragglers, such as most ash variants including dash, (except busybox built with "bash compatibility" features).

$'...'语法来自ksh93,但可以移植到大多数现代shell,包括pdksh。 SUS问题7接受了它的规范。仍然存在一些落后者,例如大多数灰变种,包括破折号(除了使用“bash兼容性”功能构建的busybox之外)。

Or using python:

或者使用python:

 arg = bytes(sys.argv[1], "utf-8").decode("unicode_escape")

print(arg)

l1 = u"12345\tktktktk".split(arg)
print(l1)

l2 = u"633\tbgt".split(arg)
print(l2)

Output:

$ python3 test.py '\t'

['12345', 'ktktktk']
['633', 'bgt']

#2


At least in Bash on Linux uou need to use CTRL + V + TAB:

至少在Linux上的Bash中你需要使用CTRL + V + TAB:

Example:

python utfsplit.py '``CTRL+V TAB``'

Your code otherwise works:

否则您的代码有效:

$ python3.4 utfsplit.py '       '

['12345', 'ktktktk']
['633', 'bgt']

NB: That tab characters can't really be displayed here :)

注意:这个标签字符不能真正显示在这里:)