在Windows上的Python 2.x中从命令行参数中读取Unicode字符

I want my Python script to be able to read Unicode command line arguments in Windows. But it appears that sys.argv is a string encoded in some local encoding, rather than Unicode. How can I read the command line in full Unicode?

我希望我的Python脚本能够在Windows中读取Unicode命令行参数。但似乎sys.argv是一个以某种本地编码而不是Unicode编码的字符串。如何以完整的Unicode读取命令行?

Example code: argv.py

示例代码:argv.py

import sys

first_arg = sys.argv[1]
print first_arg
print type(first_arg)
print first_arg.encode("hex")
print open(first_arg)

On my PC set up for Japanese code page, I get:

在我设置日语代码页的PC上,我得到:

C:\temp>argv.py "PC・ソフト申請書08.09.24.doc"
PC・ソフト申請書08.09.24.doc
<type 'str'>
50438145835c83748367905c90bf8f9130382e30392e32342e646f63
<open file 'PC・ソフト申請書08.09.24.doc', mode 'r' at 0x00917D90>

That's Shift-JIS encoded I believe, and it "works" for that filename. But it breaks for filenames with characters that aren't in the Shift-JIS character set—the final "open" call fails:

我认为这是Shift-JIS编码,并且它“适用于”该文件名。但是对于不在Shift-JIS字符集中的字符的文件名,它会中断 - 最终的“打开”调用失败:

C:\temp>argv.py Jörgen.txt
Jorgen.txt
<type 'str'>
4a6f7267656e2e747874
Traceback (most recent call last):
  File "C:\temp\argv.py", line 7,
in <module>
    print open(first_arg)
IOError: [Errno 2] No such file or directory: 'Jorgen.txt'

Note—I'm talking about Python 2.x, not Python 3.0. I've found that Python 3.0 gives sys.argv as proper Unicode. But it's a bit early yet to transition to Python 3.0 (due to lack of 3rd party library support).

注意 - 我在谈论Python 2.x,而不是Python 3.0。我发现Python 3.0将sys.argv作为正确的Unicode。但是过渡到Python 3.0还有点早(由于缺乏第三方库支持)。

Update:

A few answers have said I should decode according to whatever the sys.argv is encoded in. The problem with that is that it's not full Unicode, so some characters are not representable.

一些答案说我应该根据sys.argv编码的内容进行解码。问题在于它不是完整的Unicode,因此有些字符无法表示。

Here's the use case that gives me grief: I have enabled drag-and-drop of files onto .py files in Windows Explorer. I have file names with all sorts of characters, including some not in the system default code page. My Python script doesn't get the right Unicode filenames passed to it via sys.argv in all cases, when the characters aren't representable in the current code page encoding.

这是让我感到悲伤的用例:我已经在Windows资源管理器中将文件拖放到.py文件中。我有各种字符的文件名,包括一些不在系统默认代码页中的字符。在所有情况下,当在当前代码页编码中无法表示字符时,我的Python脚本无法通过sys.argv获取正确的Unicode文件名。

There is certainly some Windows API to read the command line with full Unicode (and Python 3.0 does it). I assume the Python 2.x interpreter is not using it.

肯定有一些Windows API用完整的Unicode读取命令行(而Python 3.0就是这样)。我假设Python 2.x解释器没有使用它。

4 个解决方案

#1

Here is a solution that is just what I'm looking for, making a call to the Windows GetCommandLineArgvW function:
Get sys.argv with Unicode characters under Windows (from ActiveState)

这是我正在寻找的解决方案,调用Windows GetCommandLineArgvW函数:在Windows下获取带有Unicode字符的sys.argv(来自ActiveState)

But I've made several changes, to simplify its usage and better handle certain uses. Here is what I use:

但我做了一些改动,以简化其使用并更好地处理某些用途。这是我使用的:

win32_unicode_argv.py

"""
win32_unicode_argv.py

Importing this will replace sys.argv with a full Unicode form.
Windows only.

From this site, with adaptations:
      http://code.activestate.com/recipes/572200/

Usage: simply import this module into a script. sys.argv is changed to
be a list of Unicode strings.
"""


import sys

def win32_unicode_argv():
    """Uses shell32.GetCommandLineArgvW to get sys.argv as a list of Unicode
    strings.

    Versions 2.x of Python don't support Unicode in sys.argv on
    Windows, with the underlying Windows API instead replacing multi-byte
    characters with '?'.
    """

    from ctypes import POINTER, byref, cdll, c_int, windll
    from ctypes.wintypes import LPCWSTR, LPWSTR

    GetCommandLineW = cdll.kernel32.GetCommandLineW
    GetCommandLineW.argtypes = []
    GetCommandLineW.restype = LPCWSTR

    CommandLineToArgvW = windll.shell32.CommandLineToArgvW
    CommandLineToArgvW.argtypes = [LPCWSTR, POINTER(c_int)]
    CommandLineToArgvW.restype = POINTER(LPWSTR)

    cmd = GetCommandLineW()
    argc = c_int(0)
    argv = CommandLineToArgvW(cmd, byref(argc))
    if argc.value > 0:
        # Remove Python executable and commands if present
        start = argc.value - len(sys.argv)
        return [argv[i] for i in
                xrange(start, argc.value)]

sys.argv = win32_unicode_argv()

Now, the way I use it is simply to do:

现在,我使用它的方式就是:

import sys
import win32_unicode_argv

and from then on, sys.argv is a list of Unicode strings. The Python optparse module seems happy to parse it, which is great.

从那时起,sys.argv是一个Unicode字符串列表。 Python optparse模块似乎很乐意解析它,这很棒。

#2

Dealing with encodings is very confusing.

处理编码非常混乱。

I believe if your inputing data via the commandline it will encode the data as whatever your system encoding is and is not unicode. (Even copy/paste should do this)

我相信,如果您通过命令行输入数据,它将编码数据,无论您的系统编码是什么,并且不是unicode。 (即使复制/粘贴也应该这样做)

So it should be correct to decode into unicode using the system encoding:

因此,使用系统编码解码为unicode应该是正确的:

import sys

first_arg = sys.argv[1]
print first_arg
print type(first_arg)

first_arg_unicode = first_arg.decode(sys.getfilesystemencoding())
print first_arg_unicode
print type(first_arg_unicode)

f = codecs.open(first_arg_unicode, 'r', 'utf-8')
unicode_text = f.read()
print type(unicode_text)
print unicode_text.encode(sys.getfilesystemencoding())

running the following Will output: Prompt> python myargv.py "PC・ソフト申請書08.09.24.txt"

运行以下将输出:提示> python myargv.py“PC·ソフト申请书08.09.24.txt”

PC・ソフト申請書08.09.24.txt
<type 'str'>
<type 'unicode'>
PC・ソフト申請書08.09.24.txt
<type 'unicode'>
?日本語

Where the "PC・ソフト申請書08.09.24.txt" contained the text, "日本語". (I encoded the file as utf8 using windows notepad, I'm a little stumped as to why there's a '?' in the begining when printing. Something to do with how notepad saves utf8?)

“PC·ソフト申请书08.09.24.txt”中包含文字“日本语”。 (我使用Windows记事本将文件编码为utf8,我有点难以理解为什么打印时会出现'?'。与记事本如何保存utf8有什么关系?)

The strings 'decode' method or the unicode() builtin can be used to convert an encoding into unicode.

字符串'decode'方法或内置的unicode()可用于将编码转换为unicode。

unicode_str = utf8_str.decode('utf8')
unicode_str = unicode(utf8_str, 'utf8')

Also, if your dealing with encoded files you may want to use the codecs.open() function in place of the built-in open(). It allows you to define the encoding of the file, and will then use the given encoding to transparently decode the content to unicode.

此外,如果您处理编码文件,您可能需要使用codecs.open()函数代替内置的open()。它允许您定义文件的编码,然后使用给定的编码透明地将内容解码为unicode。

So when you call content = codecs.open("myfile.txt", "r", "utf8").read() content will be in unicode.

所以当你调用content = codecs.open(“myfile.txt”,“r”,“utf8”)时,read()内容将是unicode。

codecs.open: http://docs.python.org/library/codecs.html?#codecs.open

If I'm miss-understanding something please let me know.

如果我想念一些东西,请告诉我。

If you haven't already I recommend reading Joel's article on unicode and encoding: http://www.joelonsoftware.com/articles/Unicode.html

如果你还没有我推荐阅读Joel关于unicode和编码的文章:http://www.joelonsoftware.com/articles/Unicode.html

#3

Try this:

import sys
print repr(sys.argv[1].decode('UTF-8'))

Maybe you have to substitute CP437 or CP1252 for UTF-8. You should be able to infer the proper encoding name from the registry key HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage\OEMCP

也许您必须将CP437或CP1252替换为UTF-8。您应该能够从注册表项HKEY_LOCAL_MACHINE \ SYSTEM \ CurrentControlSet \ Control \ Nls \ CodePage \ OEMCP推断正确的编码名称

#4

The command line might be in Windows encoding. Try decoding the arguments into unicode objects:

命令行可能采用Windows编码。尝试将参数解码为unicode对象:

args = [unicode(x, "iso-8859-9") for x in sys.argv]

#1