I'm a newbie, and I'm sure a similar question has been asked in the past, but I am having trouble finding/understanding an answer. Thank you in advance for being patient with me!
我是一个新手,我肯定在过去也有人问过类似的问题,但我很难找到答案。提前谢谢你对我的耐心!
So I'm trying to write a script to read lines in a utf-8 encoded input file, compare portions of it to an optional command line argument passed in by the user, and if there's a match, to do some stuff to that line before printing it to an output file. I'm using codecs
to open the files.
因此,我正在尝试编写一个脚本以读取utf-8编码的输入文件中的行,将其部分与用户传递的可选命令行参数进行比较,如果有匹配,在将其打印到输出文件之前,对该行执行一些操作。我正在使用编解码器来打开文件。
I'm using the argparse
module to parse command line arguments right now. The lines in the file can be in all sorts of languages, hence the command line argument needs to also be utf-8.
我现在正在使用argparse模块来解析命令行参数。文件中的行可以是各种语言,因此命令行参数也需要是utf-8。
For example:
例如:
A line from the file might look like this:
文件中的一行可能是这样的:
разъедают {. r ax z . j je . d ax1 . ju t .}
разъедают{。r ax z。j我。d ax₁。桔多琪t。}
The script should be called from the command line with something like this:
脚本应该从命令行调用,如下所示:
>python myscript.py mytextfile.txt -grapheme ъ
python myscript >。py mytextfile。txt字母ъ
Here's the part of my code that is supposed to do the processing. In this case, orth
is some Cyrillic text and grapheme
is a Cyrillic character.
这是我的代码中负责处理的部分。在这种情况下,orth是一些西里尔文字,而grapheme是一个西里尔字符。
def process_orth(orth, grapheme):
grapheme = grapheme.decode(sys.stdin.encoding).encode('utf-8')
if (grapheme in orth):
print 'success, your grapheme was: ' + grapheme.encode('utf-8')
return True
else:
print 'failure, your grapheme was: ' + grapheme.encode('utf-8')
return False
Unfortunately, even though the grapheme is definitely there, the function returns false and prints a question mark instead of the grapheme:
不幸的是,即使函数中确实有这个字母,但函数返回false并打印一个问号而不是这个字母:
failure, your grapheme was: ?
失败,你的字母是:?
I've tried adding the following at the start of process_orth()
as per the recommendation of some other post I read, but it didn't seem to work:
在process_orth()开头,我尝试按照我读过的其他文章的建议添加以下内容,但似乎没有效果:
grapheme.decode(sys.stdin.encoding).encode('utf-8')
grapheme.decode(sys.stdin.encoding).encode(“utf - 8”)
So my question is...
我的问题是……
How do I pass utf-8 strings through the command line into a python script? Also, are there any extra quirks with this on Windows7 (and does having cygwin installed change anything)?
如何通过命令行将utf-8字符串传递到python脚本中?还有,在Windows7上还有其他的怪癖吗?(是否安装了cygwin改变了任何东西)?
1 个解决方案
#1
3
If you are opening the input file using codecs.open()
then you have unicode data, not encoded data. You would want to just decode grapheme
, not encode it again to UTF-8:
如果您正在使用dec .open()打开输入文件,那么您将获得unicode数据,而不是编码数据。你只需要解码字母,而不是再编码成UTF-8:
grapheme = grapheme.decode(sys.stdin.encoding)
if grapheme in orth:
print u'success, your grapheme was: ' + grapheme
return True
Note that we print unicode as well; normally print
will ensure that Unicode values are encoded again for your current codepage. This can still fail as Windows console printing is notoriously difficult, see http://wiki.python.org/moin/PrintFails.
注意,我们也打印unicode;通常,print将确保在当前代码页中再次对Unicode值进行编码。这仍然会失败,因为Windows控制台打印非常困难,请参见http://wiki.python.org/moin/printfailed。
Unfortunately, sys.argv
on Windows can apparently end up garbled, as Python uses a non-unicode aware system call. See Read Unicode characters from command-line arguments in Python 2.x on Windows for a unicode-aware alternative.
不幸的是,系统。显然,Windows上的argv最后可能会出现混乱,因为Python使用非unicode敏感的系统调用。请参阅Python 2中的命令行参数中读取的Unicode字符。在Windows上的x,用于支持unicode的备选方案。
I see no reason for argparse
to have any problems with Unicode input, but if it does, you can always take the unicode output from win32_unicode_argv()
and encode it to UTF-8 before passing it to argparse
.
我认为argparse没有任何关于Unicode输入的问题,但是如果有问题,您总是可以从win32_unicode_argv()获取Unicode输出并将其编码为UTF-8,然后再将其传递给argparse。
#1
3
If you are opening the input file using codecs.open()
then you have unicode data, not encoded data. You would want to just decode grapheme
, not encode it again to UTF-8:
如果您正在使用dec .open()打开输入文件,那么您将获得unicode数据,而不是编码数据。你只需要解码字母,而不是再编码成UTF-8:
grapheme = grapheme.decode(sys.stdin.encoding)
if grapheme in orth:
print u'success, your grapheme was: ' + grapheme
return True
Note that we print unicode as well; normally print
will ensure that Unicode values are encoded again for your current codepage. This can still fail as Windows console printing is notoriously difficult, see http://wiki.python.org/moin/PrintFails.
注意,我们也打印unicode;通常,print将确保在当前代码页中再次对Unicode值进行编码。这仍然会失败,因为Windows控制台打印非常困难,请参见http://wiki.python.org/moin/printfailed。
Unfortunately, sys.argv
on Windows can apparently end up garbled, as Python uses a non-unicode aware system call. See Read Unicode characters from command-line arguments in Python 2.x on Windows for a unicode-aware alternative.
不幸的是,系统。显然,Windows上的argv最后可能会出现混乱,因为Python使用非unicode敏感的系统调用。请参阅Python 2中的命令行参数中读取的Unicode字符。在Windows上的x,用于支持unicode的备选方案。
I see no reason for argparse
to have any problems with Unicode input, but if it does, you can always take the unicode output from win32_unicode_argv()
and encode it to UTF-8 before passing it to argparse
.
我认为argparse没有任何关于Unicode输入的问题,但是如果有问题,您总是可以从win32_unicode_argv()获取Unicode输出并将其编码为UTF-8,然后再将其传递给argparse。