Python的os.path在希伯来文件名上窒息

时间:2022-08-24 08:57:58

I'm writing a script that has to move some file around, but unfortunately it doesn't seem os.path plays with internationalization very well. When I have files named in Hebrew, there are problems. Here's a screenshot of the contents of a directory:

我正在写一个脚本,必须移动一些文件,但不幸的是它似乎并没有os.path很好地与国际化。当我有希伯来语命名的文件时,有问题。这是目录内容的屏幕截图:

alt text http://eli.thegreenplace.net/files/temp/hebfilenameshot.png

alt text http://eli.thegreenplace.net/files/temp/hebfilenameshot.png

Now consider this code that goes over the files in this directory:

现在考虑这个代码遍历此目录中的文件:

files = os.listdir('test_source')

for f in files:
    pf = os.path.join('test_source', f)
    print pf, os.path.exists(pf)

The output is:

输出是:

test_source\ex True
test_source\joe True
test_source\mie.txt True
test_source\__()'''.txt True
test_source\????.txt False

Notice how os.path.exists thinks that the hebrew-named file doesn't even exist? How can I fix this?

请注意os.path.exists如何认为希伯来语命名的文件甚至不存在?我怎样才能解决这个问题?

ActivePython 2.5.2 on Windows XP Home SP2

Windows XP Home SP2上的ActivePython 2.5.2

4 个解决方案

#1


Hmm, after some digging it appears that when supplying os.listdir a unicode string, this kinda works:

嗯,经过一些挖掘,似乎在为os.listdir提供一个unicode字符串时,这种方式有效:

files = os.listdir(u'test_source')

for f in files:

    pf = os.path.join(u'test_source', f)
    print pf.encode('ascii', 'replace'), os.path.exists(pf)

===>

test_source\ex True
test_source\joe True
test_source\mie.txt True
test_source\__()'''.txt True
test_source\????.txt True

Some important observations here:

一些重要的观察:

  • Windows XP (like all NT derivatives) stores all filenames in unicode
  • Windows XP(与所有NT衍生产品一样)将所有文件名存储在unicode中

  • os.listdir (and similar functions, like os.walk) should be passed a unicode string in order to work correctly with unicode paths. Here's a quote from the aforementioned link:
  • os.listdir(以及类似的函数,如os.walk)应该传递一个unicode字符串,以便与unicode路径一起正常工作。以下是上述链接的引用:

os.listdir(), which returns filenames, raises an issue: should it return the Unicode version of filenames, or should it return 8-bit strings containing the encoded versions? os.listdir() will do both, depending on whether you provided the directory path as an 8-bit string or a Unicode string. If you pass a Unicode string as the path, filenames will be decoded using the filesystem's encoding and a list of Unicode strings will be returned, while passing an 8-bit path will return the 8-bit versions of the filenames.

os.listdir(),它返回文件名,引发了一个问题:它应该返回Unicode版本的文件名,还是应该返回包含编码版本的8位字符串? os.listdir()将同时执行这两个操作,具体取决于您是将目录路径提供为8位字符串还是Unicode字符串。如果传递Unicode字符串作为路径,则将使用文件系统的编码对文件名进行解码,并返回Unicode字符串列表,而传递8位路径将返回文件名的8位版本。

  • And lastly, print wants an ascii string, not unicode, so the path has to be encoded to ascii.
  • 最后,print需要一个ascii字符串,而不是unicode,因此路径必须编码为ascii。

#2


It looks like a Unicode vs ASCII issue - os.listdir is returning a list of ASCII strings.

它看起来像Unicode与ASCII问题 - os.listdir返回一个ASCII字符串列表。

Edit: I tried it on Python 3.0, also on XP SP2, and os.listdir simply omitted the Hebrew filenames instead of listing them at all.

编辑:我在Python 3.0上尝试过,也在XP SP2上,os.listdir只是省略了希伯来文件名而不是列出它们。

According to the docs, this means it was unable to decode it:

根据文档,这意味着它无法解码它:

Note that when os.listdir() returns a list of strings, filenames that cannot be decoded properly are omitted rather than raising UnicodeError.

请注意,当os.listdir()返回字符串列表时,将省略无法正确解码的文件名,而不是引发UnicodeError。

#3


It works like a charm using Python 2.5.1 on OS X:

它在OS X上使用Python 2.5.1就像一个魅力:

subdir/bar.txt True
subdir/foo.txt True
subdir/עִבְרִית.txt True

Maybe that means that this has to do with Windows XP somehow?

也许这意味着这与Windows XP有某种关系?

EDIT: I also tried with unicode strings to try mimic the Windows behaviour better:

编辑:我也尝试使用unicode字符串来尝试更好地模仿Windows行为:

for f in os.listdir(u'subdir'):
  pf = os.path.join(u'subdir', f)
  print pf, os.path.exists(pf)

subdir/bar.txt True
subdir/foo.txt True
subdir/עִבְרִית.txt True

In the Terminal (os x stock command prompt app) that is. Using IDLE it still worked but didn't print the filename correctly. To make sure it really is unicode there I checked:

在终端(os x stock命令提示应用程序)中。使用IDLE它仍然有效,但没有正确打印文件名。为了确保它真的是unicode我检查:

>>>os.listdir(u'listdir')[2]
u'\u05e2\u05b4\u05d1\u05b0\u05e8\u05b4\u05d9\u05ea.txt'

#4


A question mark is the more or less universal symbol displayed when a unicode character can't be represented in a specific encoding. Your terminal or interactive session under Windows is probably using ASCII or ISO-8859-1 or something. So the actual string is unicode, but it gets translated to ???? when printed to the terminal. That's why it works for PEZ, using OSX.

问号是当unicode字符无法以特定编码表示时显示的或多或少的通用符号。 Windows下的终端或交互式会话可能使用ASCII或ISO-8859-1等。所以实际的字符串是unicode,但它被转换为????当打印到终端时。这就是为什么它适用于PEZ,使用OSX。

#1


Hmm, after some digging it appears that when supplying os.listdir a unicode string, this kinda works:

嗯,经过一些挖掘,似乎在为os.listdir提供一个unicode字符串时,这种方式有效:

files = os.listdir(u'test_source')

for f in files:

    pf = os.path.join(u'test_source', f)
    print pf.encode('ascii', 'replace'), os.path.exists(pf)

===>

test_source\ex True
test_source\joe True
test_source\mie.txt True
test_source\__()'''.txt True
test_source\????.txt True

Some important observations here:

一些重要的观察:

  • Windows XP (like all NT derivatives) stores all filenames in unicode
  • Windows XP(与所有NT衍生产品一样)将所有文件名存储在unicode中

  • os.listdir (and similar functions, like os.walk) should be passed a unicode string in order to work correctly with unicode paths. Here's a quote from the aforementioned link:
  • os.listdir(以及类似的函数,如os.walk)应该传递一个unicode字符串,以便与unicode路径一起正常工作。以下是上述链接的引用:

os.listdir(), which returns filenames, raises an issue: should it return the Unicode version of filenames, or should it return 8-bit strings containing the encoded versions? os.listdir() will do both, depending on whether you provided the directory path as an 8-bit string or a Unicode string. If you pass a Unicode string as the path, filenames will be decoded using the filesystem's encoding and a list of Unicode strings will be returned, while passing an 8-bit path will return the 8-bit versions of the filenames.

os.listdir(),它返回文件名,引发了一个问题:它应该返回Unicode版本的文件名,还是应该返回包含编码版本的8位字符串? os.listdir()将同时执行这两个操作,具体取决于您是将目录路径提供为8位字符串还是Unicode字符串。如果传递Unicode字符串作为路径,则将使用文件系统的编码对文件名进行解码,并返回Unicode字符串列表,而传递8位路径将返回文件名的8位版本。

  • And lastly, print wants an ascii string, not unicode, so the path has to be encoded to ascii.
  • 最后,print需要一个ascii字符串,而不是unicode,因此路径必须编码为ascii。

#2


It looks like a Unicode vs ASCII issue - os.listdir is returning a list of ASCII strings.

它看起来像Unicode与ASCII问题 - os.listdir返回一个ASCII字符串列表。

Edit: I tried it on Python 3.0, also on XP SP2, and os.listdir simply omitted the Hebrew filenames instead of listing them at all.

编辑:我在Python 3.0上尝试过,也在XP SP2上,os.listdir只是省略了希伯来文件名而不是列出它们。

According to the docs, this means it was unable to decode it:

根据文档,这意味着它无法解码它:

Note that when os.listdir() returns a list of strings, filenames that cannot be decoded properly are omitted rather than raising UnicodeError.

请注意,当os.listdir()返回字符串列表时,将省略无法正确解码的文件名,而不是引发UnicodeError。

#3


It works like a charm using Python 2.5.1 on OS X:

它在OS X上使用Python 2.5.1就像一个魅力:

subdir/bar.txt True
subdir/foo.txt True
subdir/עִבְרִית.txt True

Maybe that means that this has to do with Windows XP somehow?

也许这意味着这与Windows XP有某种关系?

EDIT: I also tried with unicode strings to try mimic the Windows behaviour better:

编辑:我也尝试使用unicode字符串来尝试更好地模仿Windows行为:

for f in os.listdir(u'subdir'):
  pf = os.path.join(u'subdir', f)
  print pf, os.path.exists(pf)

subdir/bar.txt True
subdir/foo.txt True
subdir/עִבְרִית.txt True

In the Terminal (os x stock command prompt app) that is. Using IDLE it still worked but didn't print the filename correctly. To make sure it really is unicode there I checked:

在终端(os x stock命令提示应用程序)中。使用IDLE它仍然有效,但没有正确打印文件名。为了确保它真的是unicode我检查:

>>>os.listdir(u'listdir')[2]
u'\u05e2\u05b4\u05d1\u05b0\u05e8\u05b4\u05d9\u05ea.txt'

#4


A question mark is the more or less universal symbol displayed when a unicode character can't be represented in a specific encoding. Your terminal or interactive session under Windows is probably using ASCII or ISO-8859-1 or something. So the actual string is unicode, but it gets translated to ???? when printed to the terminal. That's why it works for PEZ, using OSX.

问号是当unicode字符无法以特定编码表示时显示的或多或少的通用符号。 Windows下的终端或交互式会话可能使用ASCII或ISO-8859-1等。所以实际的字符串是unicode,但它被转换为????当打印到终端时。这就是为什么它适用于PEZ,使用OSX。