PHP preg_replace不按预期使用文件名字符串

I'm trying to create a function which removes all none English characters (except spaces,dots and hyphens) from a string. For this I tried using preg_replace, but the function produces strange results.

我正在尝试创建一个函数，它从字符串中删除所有非英文字符(除了空格、点和连字符)。为此，我尝试使用preg_replace，但该函数会产生奇怪的结果。

I have a file called "example-נידדל.jpg"

我有一个文件叫做“例子——נידדלjpg”

Here is what I'm getting when trying to sanitize the file name:

以下是我在尝试删除文件名时得到的信息:

echo preg_replace('/[^A-Za-z0-9\.]/','','example-נידדל.jpg');

The above produces: example.jpg as expected.

如上所述产生:例如。jpg。

But when I try to pull the file name from a $_FILES array after uploading it to the server I get:

但是，当我试图从$_FILES数组中提取文件名时，在将其上传到服务器后，我得到:

echo preg_replace('/[^A-Za-z0-9\.]/','',$_FILES['file_upload']["name"]);

The above produces example-15041497149114911500.jpg

上述产生的例子- 15041497149114911500. jpg

The numbers I'm getting are in fact the HTML numbers of the characters which were suppose to be removed, see the following for character reference: http://realdev1.realise.com/rossa/phoneme/listCharactors.asp?start=1488&stop=1785&rows=297&page=1

我得到的数字实际上是要删除的字符的HTML数字，参见下面的字符引用:http://realdev1.realse.com/rossa/phoneme/listcharacters .asp?

I can't figure out why doesn't the preg_replace work with file names.

我不知道为什么preg_replace不使用文件名。

Can anyone help?

谁能帮忙吗?

Thanks,

谢谢,

Roy

罗伊

2 个解决方案

#1

What about using mb_convert_encoding to convert the HTML entities back into UTF-8 before the preg_replace?

在preg_replace之前，使用mb_convert_encoding将HTML实体转换为UTF-8怎么样?

echo preg_replace('/[^A-Za-z0-9\.]/', '', mb_convert_encoding($_FILES['file_upload']["name"], 'UTF-8', 'HTML-ENTITIES'));

#2

I would use a combination of regular expressions and iconv to transliterate it.

我将使用正则表达式和iconv的组合来改写它。

Update: Prior transliteration/filtering the filename mabye needs to be urldecoded:

更新:之前的音译/过滤的文件名mabye需要被urldecoded:

$path = urldecode($path); // convert triplets to bytes.

Here is a code example from here that does something very similar to your question:

这里有一个代码示例，它与您的问题非常相似:

function pathauto_cleanstring($string)
{
    $url = $string;
    $url = preg_replace('~[^\\pL0-9_]+~u', '-', $url); // substitutes anything but letters, numbers and '_' with separator
    $url = trim($url, "-");
    $url = iconv("utf-8", "us-ascii//TRANSLIT", $url); // TRANSLIT does the whole job
    $url = strtolower($url);
    $url = preg_replace('~[^-a-z0-9_]+~', '', $url); // keep only letters, numbers, '_' and separator
    return $url;
}

It expects your into to be UTF-8 encoded.

它期望您的into是UTF-8编码。

Reference

参考

#1