I'm writing a file manager and need to scan directories and deal with renaming files that may have multibyte characters. I'm working on it locally on Windows/Apache PHP 5.3.8, with the following file names in a directory:
我正在编写一个文件管理器,需要扫描目录并重新命名具有多字节字符的文件。我正在Windows/Apache PHP 5.3.8上本地处理它,目录中有以下文件名:
- filename.jpg
- filename.jpg
- имяфайла.jpg
- имяфайла.jpg
- file件name.jpg
- 文件件name.jpg
- פילענאַמע.jpg
- פילענאַמעjpg
- 文件名.jpg
- 文件名jpg
Testing on a live UNIX server woked fine. Testing locally on Windows using glob('./path/*')
returns only the first one, filename.jpg
.
在一个活动的UNIX服务器上进行测试工作正常。在Windows上使用glob('./path/*)本地测试只返回第一个,filename.jpg。
Using scandir()
, the correct number of files is returned at least, but I get names like ?????????.jpg
(note: those are regular question marks, not the � character.
使用scandir(),返回正确的文件数量至少,但我得到的名字? ? ? ? ? ? ? ? ? jpg(注意:这些都是常规的问号,不�字符。
I'll end up needing to write a "search" feature to search recursively through the entire tree for filenames matching a pattern or with a certain file extension, and I assumed glob()
would be the right tool for that, rather than scan all the files and do the pattern matching and array building in the application code. I'm open to alternate suggestions if need be.
我将最终需要编写一个“搜索”功能,在整个树递归搜索文件名匹配一个模式或某个文件扩展名,我以为水珠()将合适的工具,而不是扫描所有文件和模式匹配和数组建立在应用程序代码中。如果需要的话,我愿意接受其他的建议。
Assuming this was a common problem, I immediately searched Google and Stack Overflow and found nothing even related. Is this a Windows issue? PHP shortcoming? What's the solution: is there anything I can do?
假设这是一个常见的问题,我立即搜索谷歌和Stack Overflow,发现没有任何相关的东西。这是Windows的问题吗?PHP的缺点吗?解决办法是什么:我能做点什么吗?
Addendum: Not sure how related this is, but file_exists()
is also returning FALSE
for these files, passing in the full absolute path (using Notepad++, the php file itself is UTF-8 encoding no BOM). I'm certain the path is correct, as neighboring files without multibyte characters return TRUE
.
附录:不确定这之间的关系,但是file_exists()也为这些文件返回FALSE,传递完整的绝对路径(使用Notepad++ +, php文件本身就是UTF-8编码,没有BOM)。我确信路径是正确的,因为没有多字节字符的相邻文件返回TRUE。
EDIT: glob()
can find a file named filename-äöü.jpg
. Previously in my .htaccess
file, I had AddDefaultCharset utf-8
, which I didn't consider before. filename-äöü.jpg
was printing as filename-���.jpg
. The only effect removing that htaccess line seemed to have was now that file name prints normally.
编辑:glob()可以找到一个名为fil珐琅- aoue .jpg的文件。以前在我的.htaccess文件中,我有AddDefaultCharset utf-8,这是我以前没有考虑过的。filename-aou.jpg印刷是文件名,���jpg。除去htaccess线的唯一效果是,现在文件名称通常是打印出来的。
I've deleted the .htaccess
file completely, and this is my actual test script in it's entirety (I changed a couple of file names from the original post):
我已经完全删除了。htaccess文件,这是我完整的测试脚本(我从原来的帖子中修改了几个文件名):
print_r(scandir('./uploads/'));
print_r(glob('./uploads/*'));
Output locally on Windows:
上本地输出窗口:
Array
(
[0] => .
[1] => ..
[2] => ??? ?????.jpg
[3] => ???.jpg
[4] => ?????????.jpg
[5] => filename-äöü.jpg
[6] => filename.jpg
[7] => test?test.jpg
)
Array
(
[0] => ./uploads/filename-äöü.jpg
[1] => ./uploads/filename.jpg
)
Output on remote UNIX server:
远程UNIX服务器上的输出:
Array
(
[0] => .
[1] => ..
[2] => filename-äöü.jpg
[3] => filename.jpg
[4] => test이test.jpg
[5] => имя файла.jpg
[6] => פילענאַמע.jpg
[7] => 文件名.jpg
)
Array
(
[0] => ./uploads/filename-äöü.jpg
[1] => ./uploads/filename.jpg
[2] => ./uploads/test이test.jpg
[3] => ./uploads/имя файла.jpg
[4] => ./uploads/פילענאַמע.jpg
[5] => ./uploads/文件名.jpg
)
Since this is a different server, regardless of platform - configuration could be different so I'm not sure what to think, and I can't fully pin it on Windows yet (could be my PHP installation, ini settings, or Apache config). Any ideas?
因为这是一个不同的服务器,不管平台是什么,配置都可能不同,所以我不知道该怎么想,而且我还不能将它完全固定在Windows上(可能是我的PHP安装、ini设置或Apache config)。什么好主意吗?
5 个解决方案
#1
7
It looks like the glob() function depends on how your copy of PHP was built and whether it was compiled with a unicode-aware WIN32 API (I don't believe the standard builid is.
它看起来像glob()函数,这取决于您的PHP拷贝是如何构建的,以及它是用一个unicode感知的WIN32 API编译的(我不相信标准的builid是什么)。
Cf. http://www.rooftopsolutions.nl/blog/filesystem-encoding-and-php
参见http://www.rooftopsolutions.nl/blog/filesystem-encoding-and-php
Excerpt from comments on the article:
摘自文章评论:
Philippe Verdy 2010-09-26 8:53 am
菲利普·维迪2010-09-26上午8:53
The output from your PHP installation on Windows is easy to explain : you installed the wrong version of PHP, and used a version not compiled to use the Unicode version of the Win32 API. For this reason, the filesystem calls used by PHP will use the legacy "ANSI" API and so the C/C++ libraries linked with this version of PHP will first try to convert yout UTF-8-encoded PHP string into the local "ANSI" codepage selected in the running environment (see the CHCP command before starting PHP from a command line window)
Windows上PHP安装的输出很容易解释:您安装了错误的PHP版本,并使用了未编译的版本来使用Win32 API的Unicode版本。出于这个原因,所使用的文件系统调用PHP将使用遗留“ANSI”API和C / c++库与这个版本的PHP将首先尝试你的PHP utf - 8编码的字符串转换成当地“ANSI”代码页在运行环境中选择(参见CHCP命令从命令行开始前PHP窗口)
Your version of Windows is MOST PROBABLY NOT responsible of this weird thing. Actually, this is YOUR version of PHP which is not compiled correctly, and that uses the legacy ANSI version of the Win32 API (for compatibility with the legacy 16-bit versions of Windows 95/98 whose filesystem support in the kernel actually had no direct support for Unicode, but used an internal conversion layer to convert Unicode to the local ANSI codepage before using the actual ANSI version of the API).
你的Windows版本很可能不会对这种奇怪的事情负责。实际上,这是你的正确版本的PHP不编译,和使用遗留ANSI版本的Win32 API(兼容遗留16位版本的Windows 95/98的文件系统支持内核实际上没有直接对Unicode的支持,但使用了一个内部转换层向当地ANSI Unicode代码页转换使用实际的ANSI之前版本的API)。
Recompile PHP using the compiler option to use the UNICODE version of the Win32 API (which should be the default today, and anyway always the default for PHP installed on a server that will NEVER be Windows 95 or Windows 98...)
使用编译器选项重新编译PHP,使用Win32 API的UNICODE版本(Win32 API现在应该是默认的,而且无论如何,PHP在服务器上的默认安装永远不会是Windows 95或Windows 98…)
Then Windows will be able to store UTF-16 encoded filenames (including on FAT32 volumes, even if, on these volumes, it will also generate an aliased short name in 8.3 format using the filesystem's default codepage, something that can be avoided in NTFS volumes).
然后Windows将能够存储UTF-16编码的文件名(包括FAT32卷上的文件名,即使在这些卷上,它也将使用文件系统的默认代码页生成8.3格式的别名短名称,这在NTFS卷中是可以避免的)。
All what you describe are problems of PHP (incorrect porting to Windows, or incorrect system version identification at runtime) : reread the README files coming with PHP sources explaining the compilation flags. I really think that the makefile on Windows should be able to configure and autodetect if it really needs to use ONLY the ANSI version of the API. If you are compiling it for a server, make sure that the Configure script will effectively detect the full support of the UNICODE version of the Win32 aPI and will use it when compiling PHP and when selecting the runtime libraries to link.
您所描述的都是PHP的问题(在运行时不正确地移植到Windows或不正确的系统版本标识):重新读取使用PHP源代码来解释编译标志的README文件。我真的认为Windows上的makefile应该能够配置和自动检测,如果它真的只需要使用API的ANSI版本。如果您正在为服务器编译它,请确保Configure脚本能够有效地检测到Win32 aPI的UNICODE版本的完全支持,并在编译PHP时和选择运行时库时使用它。
I use PHP on Windows, correctly compiled, and I absolutely DON'T know the problems you cite in your article.
我在Windows上使用PHP,编译正确,我绝对不知道您在文章中提到的问题。
Let's forget now forever these non-UNICODE versions of the Win32 API (which are using inconsistantly the local ANSI codepage for the Windows graphical UI, and the OEM codepage for the filesystem APIs, the DOS/BIOS-compatible APIs, the Console APIs) : these non-Unicode versions of the APIs are even MUCH slower and more costly than the Unicode versions of the APIs, because they are actually translating the codepage to Unicode before using the core Unicode APIs (the situation on Windows NT-based kernels is exactly the reverse from the situation on versions of Windows based on a virtual DOS extender, such as Windows 95/98/ME).
让我们永远忘记Win32 API的非unicode版本(它使用的是Windows图形用户界面的本地ANSI代码页,以及文件系统API的OEM代码页,DOS/ bios兼容的API,控制台API):这些Unicode版本的api是甚至更慢、更昂贵的比Unicode版本的api,因为他们实际上是翻译之前Unicode代码页使用Unicode核心api(这种情况在Windows NT-based内核完全扭转局势的Windows版本的基于虚拟DOS扩展器,比如Windows 95/98 / ME)。
When you don't use the native version of the API, your API call will pass through a thunking layer that will transcode the strings between Unicode and one of the legacy ANSI or CHCP-selected OEM codepages, or the OEM codepage hinted on the filesystem: this requires additional temporary memory allocation within the non-native version of the Win32 API. This takes additional time to convert things before doing the actual work by calling the native API.
当你不使用本机版本的API,API调用将通过铛层将Unicode码之间的字符串和一个遗留ANSI或CHCP-selected OEM编码页,或者文件系统上的OEM代码页暗示:这需要额外的临时内存分配在非本土的Win32 API版本。在通过调用本机API进行实际工作之前,需要额外的时间进行转换。
In summary: the PHP binary you install on Windows MUST be different depending on if you compiled it for Windows 95/98/SE (or the old Win16s emulation layer for Windows 3.x, which had a very mimimum support of UTF-8, only to support the Unicode subsets of Unicode used by the ANSI and OEM codapges selected when starting Windows from a DOS extender) or if it was compiled for any other version of Windows based on the NT kernel.
总之:您在Windows上安装的PHP二进制文件必须是不同的,这取决于您是否为Windows 95/98/SE编译它(或者为Windows 3编译旧的Win16s仿真层)。x支持UTF-8,但只支持ANSI和OEM codapges在从DOS extender启动Windows时使用的Unicode子集),或者基于NT内核为任何其他版本的Windows编译的Unicode子集。
The best proof that this is a problem of PHP and not Windows, is that your weird results will NOT occur in other languages like C#, Javascript, VB, Perl, Ruby... PHP has a very bad history in tracking versions (and too many historical source code quirks and wrong assumptions that should be disabled today, and an inconsistant library that has inherited all those quirks initially made in old versions of PHP for old versions of Windows that are even no longer officially supported, by Microsoft or even by PHP itself !).
这是PHP而不是Windows的问题,最好的证明是您的奇怪结果不会出现在其他语言中,比如c#、Javascript、VB、Perl、Ruby……PHP有一个非常糟糕的历史跟踪版本(源代码和太多的历史怪癖和错误的假设应该禁用今天,和一个inconsistant库,继承了所有这些怪癖最初在老旧版本的PHP版本的Windows,甚至不再官方支持,由微软甚至PHP本身!)。
In other words : RTM ! Or download and install a binary version of PHP for Windows precompield with the correct settings : I really think that PHP should distribute Windows binaries already compiled by default for the Unicode version of the Win32 API, and using the Unicode version of the C/C++ libraries : internally the PHP code will convert its UTF-8 strings to UTF-16 before calling the Win32 API, and back from UTF-16 to UTF-8 when retrieving Win32 results, instead of converting PHP's internal UTF-8 strings back/to the local OEM codepage (for the filesystem calls) or the local ANSI codepage (for all other Win32 APIs, including the registry or process).
换句话说:RTM !或者下载并安装Windows precompield的二进制版本,设置正确:我真的认为PHP应该为Win32 API的Unicode版本发布Windows二进制文件,并使用C/ c++库的Unicode版本:内部PHP代码将其utf - 8编码的字符串转换为utf - 16在调用Win32 API之前,从utf - 16和utf - 8当检索Win32的结果,而不是PHP内部的utf - 8编码的字符串转换回当地OEM /代码页(文件系统调用)或当地的ANSI代码页(对于所有其他Win32 API,包括注册表或过程)。
#2
-1
I haven't touched PHP for 3 or 4 years now, but maybe this may help :
我已经有3到4年没有接触PHP了,但是这可能会有帮助:
pathinfo() is locale aware, so for it to parse a path containing multibyte characters correctly, the matching locale must be set using the setlocale() function
pathinfo()是语言环境敏感的,因此要正确解析包含多字节字符的路径,必须使用setlocale()函数设置匹配的语言环境
And some direct links:
和一些直接链接:
pathinfo - read the second note
帕廷佛——读第二个音符
关于setlocale
(I think your problem comes from scanning the directories, and not from the display code itsself or from the headers, since Chrome or firefox, if I remember well, can handle Unicode chars.)
(我认为您的问题来自于扫描目录,而不是显示代码本身或页眉,因为如果我记得不错的话,Chrome或firefox可以处理Unicode字符。)
#3
-1
PHP on windows does not use the Unicode API yet. So you have to use the runtime encoding (whatever it is) to be able to deal with non ascii charset.
windows上的PHP还没有使用Unicode API。所以你必须使用运行时编码(不管它是什么)来处理非ascii字符集。
#4
-1
Starting with PHP 7.1 long and UTF-8 paths on Windows are supported directly in the core.
从PHP 7.1开始,Windows上的UTF-8路径直接支持核心。
#5
-2
Try setting mb_internal_encoding() to "UTF-8" before using glob
在使用glob之前,尝试将mb_internal_encoding()设置为“UTF-8”
mb_internal_encoding("UTF-8");
print_r(glob('./uploads/*'));
#1
7
It looks like the glob() function depends on how your copy of PHP was built and whether it was compiled with a unicode-aware WIN32 API (I don't believe the standard builid is.
它看起来像glob()函数,这取决于您的PHP拷贝是如何构建的,以及它是用一个unicode感知的WIN32 API编译的(我不相信标准的builid是什么)。
Cf. http://www.rooftopsolutions.nl/blog/filesystem-encoding-and-php
参见http://www.rooftopsolutions.nl/blog/filesystem-encoding-and-php
Excerpt from comments on the article:
摘自文章评论:
Philippe Verdy 2010-09-26 8:53 am
菲利普·维迪2010-09-26上午8:53
The output from your PHP installation on Windows is easy to explain : you installed the wrong version of PHP, and used a version not compiled to use the Unicode version of the Win32 API. For this reason, the filesystem calls used by PHP will use the legacy "ANSI" API and so the C/C++ libraries linked with this version of PHP will first try to convert yout UTF-8-encoded PHP string into the local "ANSI" codepage selected in the running environment (see the CHCP command before starting PHP from a command line window)
Windows上PHP安装的输出很容易解释:您安装了错误的PHP版本,并使用了未编译的版本来使用Win32 API的Unicode版本。出于这个原因,所使用的文件系统调用PHP将使用遗留“ANSI”API和C / c++库与这个版本的PHP将首先尝试你的PHP utf - 8编码的字符串转换成当地“ANSI”代码页在运行环境中选择(参见CHCP命令从命令行开始前PHP窗口)
Your version of Windows is MOST PROBABLY NOT responsible of this weird thing. Actually, this is YOUR version of PHP which is not compiled correctly, and that uses the legacy ANSI version of the Win32 API (for compatibility with the legacy 16-bit versions of Windows 95/98 whose filesystem support in the kernel actually had no direct support for Unicode, but used an internal conversion layer to convert Unicode to the local ANSI codepage before using the actual ANSI version of the API).
你的Windows版本很可能不会对这种奇怪的事情负责。实际上,这是你的正确版本的PHP不编译,和使用遗留ANSI版本的Win32 API(兼容遗留16位版本的Windows 95/98的文件系统支持内核实际上没有直接对Unicode的支持,但使用了一个内部转换层向当地ANSI Unicode代码页转换使用实际的ANSI之前版本的API)。
Recompile PHP using the compiler option to use the UNICODE version of the Win32 API (which should be the default today, and anyway always the default for PHP installed on a server that will NEVER be Windows 95 or Windows 98...)
使用编译器选项重新编译PHP,使用Win32 API的UNICODE版本(Win32 API现在应该是默认的,而且无论如何,PHP在服务器上的默认安装永远不会是Windows 95或Windows 98…)
Then Windows will be able to store UTF-16 encoded filenames (including on FAT32 volumes, even if, on these volumes, it will also generate an aliased short name in 8.3 format using the filesystem's default codepage, something that can be avoided in NTFS volumes).
然后Windows将能够存储UTF-16编码的文件名(包括FAT32卷上的文件名,即使在这些卷上,它也将使用文件系统的默认代码页生成8.3格式的别名短名称,这在NTFS卷中是可以避免的)。
All what you describe are problems of PHP (incorrect porting to Windows, or incorrect system version identification at runtime) : reread the README files coming with PHP sources explaining the compilation flags. I really think that the makefile on Windows should be able to configure and autodetect if it really needs to use ONLY the ANSI version of the API. If you are compiling it for a server, make sure that the Configure script will effectively detect the full support of the UNICODE version of the Win32 aPI and will use it when compiling PHP and when selecting the runtime libraries to link.
您所描述的都是PHP的问题(在运行时不正确地移植到Windows或不正确的系统版本标识):重新读取使用PHP源代码来解释编译标志的README文件。我真的认为Windows上的makefile应该能够配置和自动检测,如果它真的只需要使用API的ANSI版本。如果您正在为服务器编译它,请确保Configure脚本能够有效地检测到Win32 aPI的UNICODE版本的完全支持,并在编译PHP时和选择运行时库时使用它。
I use PHP on Windows, correctly compiled, and I absolutely DON'T know the problems you cite in your article.
我在Windows上使用PHP,编译正确,我绝对不知道您在文章中提到的问题。
Let's forget now forever these non-UNICODE versions of the Win32 API (which are using inconsistantly the local ANSI codepage for the Windows graphical UI, and the OEM codepage for the filesystem APIs, the DOS/BIOS-compatible APIs, the Console APIs) : these non-Unicode versions of the APIs are even MUCH slower and more costly than the Unicode versions of the APIs, because they are actually translating the codepage to Unicode before using the core Unicode APIs (the situation on Windows NT-based kernels is exactly the reverse from the situation on versions of Windows based on a virtual DOS extender, such as Windows 95/98/ME).
让我们永远忘记Win32 API的非unicode版本(它使用的是Windows图形用户界面的本地ANSI代码页,以及文件系统API的OEM代码页,DOS/ bios兼容的API,控制台API):这些Unicode版本的api是甚至更慢、更昂贵的比Unicode版本的api,因为他们实际上是翻译之前Unicode代码页使用Unicode核心api(这种情况在Windows NT-based内核完全扭转局势的Windows版本的基于虚拟DOS扩展器,比如Windows 95/98 / ME)。
When you don't use the native version of the API, your API call will pass through a thunking layer that will transcode the strings between Unicode and one of the legacy ANSI or CHCP-selected OEM codepages, or the OEM codepage hinted on the filesystem: this requires additional temporary memory allocation within the non-native version of the Win32 API. This takes additional time to convert things before doing the actual work by calling the native API.
当你不使用本机版本的API,API调用将通过铛层将Unicode码之间的字符串和一个遗留ANSI或CHCP-selected OEM编码页,或者文件系统上的OEM代码页暗示:这需要额外的临时内存分配在非本土的Win32 API版本。在通过调用本机API进行实际工作之前,需要额外的时间进行转换。
In summary: the PHP binary you install on Windows MUST be different depending on if you compiled it for Windows 95/98/SE (or the old Win16s emulation layer for Windows 3.x, which had a very mimimum support of UTF-8, only to support the Unicode subsets of Unicode used by the ANSI and OEM codapges selected when starting Windows from a DOS extender) or if it was compiled for any other version of Windows based on the NT kernel.
总之:您在Windows上安装的PHP二进制文件必须是不同的,这取决于您是否为Windows 95/98/SE编译它(或者为Windows 3编译旧的Win16s仿真层)。x支持UTF-8,但只支持ANSI和OEM codapges在从DOS extender启动Windows时使用的Unicode子集),或者基于NT内核为任何其他版本的Windows编译的Unicode子集。
The best proof that this is a problem of PHP and not Windows, is that your weird results will NOT occur in other languages like C#, Javascript, VB, Perl, Ruby... PHP has a very bad history in tracking versions (and too many historical source code quirks and wrong assumptions that should be disabled today, and an inconsistant library that has inherited all those quirks initially made in old versions of PHP for old versions of Windows that are even no longer officially supported, by Microsoft or even by PHP itself !).
这是PHP而不是Windows的问题,最好的证明是您的奇怪结果不会出现在其他语言中,比如c#、Javascript、VB、Perl、Ruby……PHP有一个非常糟糕的历史跟踪版本(源代码和太多的历史怪癖和错误的假设应该禁用今天,和一个inconsistant库,继承了所有这些怪癖最初在老旧版本的PHP版本的Windows,甚至不再官方支持,由微软甚至PHP本身!)。
In other words : RTM ! Or download and install a binary version of PHP for Windows precompield with the correct settings : I really think that PHP should distribute Windows binaries already compiled by default for the Unicode version of the Win32 API, and using the Unicode version of the C/C++ libraries : internally the PHP code will convert its UTF-8 strings to UTF-16 before calling the Win32 API, and back from UTF-16 to UTF-8 when retrieving Win32 results, instead of converting PHP's internal UTF-8 strings back/to the local OEM codepage (for the filesystem calls) or the local ANSI codepage (for all other Win32 APIs, including the registry or process).
换句话说:RTM !或者下载并安装Windows precompield的二进制版本,设置正确:我真的认为PHP应该为Win32 API的Unicode版本发布Windows二进制文件,并使用C/ c++库的Unicode版本:内部PHP代码将其utf - 8编码的字符串转换为utf - 16在调用Win32 API之前,从utf - 16和utf - 8当检索Win32的结果,而不是PHP内部的utf - 8编码的字符串转换回当地OEM /代码页(文件系统调用)或当地的ANSI代码页(对于所有其他Win32 API,包括注册表或过程)。
#2
-1
I haven't touched PHP for 3 or 4 years now, but maybe this may help :
我已经有3到4年没有接触PHP了,但是这可能会有帮助:
pathinfo() is locale aware, so for it to parse a path containing multibyte characters correctly, the matching locale must be set using the setlocale() function
pathinfo()是语言环境敏感的,因此要正确解析包含多字节字符的路径,必须使用setlocale()函数设置匹配的语言环境
And some direct links:
和一些直接链接:
pathinfo - read the second note
帕廷佛——读第二个音符
关于setlocale
(I think your problem comes from scanning the directories, and not from the display code itsself or from the headers, since Chrome or firefox, if I remember well, can handle Unicode chars.)
(我认为您的问题来自于扫描目录,而不是显示代码本身或页眉,因为如果我记得不错的话,Chrome或firefox可以处理Unicode字符。)
#3
-1
PHP on windows does not use the Unicode API yet. So you have to use the runtime encoding (whatever it is) to be able to deal with non ascii charset.
windows上的PHP还没有使用Unicode API。所以你必须使用运行时编码(不管它是什么)来处理非ascii字符集。
#4
-1
Starting with PHP 7.1 long and UTF-8 paths on Windows are supported directly in the core.
从PHP 7.1开始,Windows上的UTF-8路径直接支持核心。
#5
-2
Try setting mb_internal_encoding() to "UTF-8" before using glob
在使用glob之前,尝试将mb_internal_encoding()设置为“UTF-8”
mb_internal_encoding("UTF-8");
print_r(glob('./uploads/*'));