I'm trying to create a function which removes all none English characters (except spaces,dots and hyphens) from a string. For this I tried using preg_replace, but the function produces strange results.
我正在尝试创建一个函数,它从字符串中删除所有非英文字符(除了空格、点和连字符)。为此,我尝试使用preg_replace,但该函数会产生奇怪的结果。
I have a file called "example-נידדל.jpg"
我有一个文件叫做“例子——נידדלjpg”
Here is what I'm getting when trying to sanitize the file name:
以下是我在尝试删除文件名时得到的信息:
echo preg_replace('/[^A-Za-z0-9\.]/','','example-נידדל.jpg');
The above produces: example.jpg as expected.
如上所述产生:例如。jpg。
But when I try to pull the file name from a $_FILES array after uploading it to the server I get:
但是,当我试图从$_FILES数组中提取文件名时,在将其上传到服务器后,我得到:
echo preg_replace('/[^A-Za-z0-9\.]/','',$_FILES['file_upload']["name"]);
The above produces example-15041497149114911500.jpg
上述产生的例子- 15041497149114911500. jpg
The numbers I'm getting are in fact the HTML numbers of the characters which were suppose to be removed, see the following for character reference: http://realdev1.realise.com/rossa/phoneme/listCharactors.asp?start=1488&stop=1785&rows=297&page=1
我得到的数字实际上是要删除的字符的HTML数字,参见下面的字符引用:http://realdev1.realse.com/rossa/phoneme/listcharacters .asp?
I can't figure out why doesn't the preg_replace work with file names.
我不知道为什么preg_replace不使用文件名。
Can anyone help?
谁能帮忙吗?
Thanks,
谢谢,
Roy
罗伊
2 个解决方案
#1
2
What about using mb_convert_encoding
to convert the HTML entities back into UTF-8 before the preg_replace
?
在preg_replace之前,使用mb_convert_encoding将HTML实体转换为UTF-8怎么样?
echo preg_replace('/[^A-Za-z0-9\.]/', '', mb_convert_encoding($_FILES['file_upload']["name"], 'UTF-8', 'HTML-ENTITIES'));
#2
1
I would use a combination of regular expressions and iconv to transliterate it.
我将使用正则表达式和iconv的组合来改写它。
Update: Prior transliteration/filtering the filename mabye needs to be urldecoded:
更新:之前的音译/过滤的文件名mabye需要被urldecoded:
$path = urldecode($path); // convert triplets to bytes.
Here is a code example from here that does something very similar to your question:
这里有一个代码示例,它与您的问题非常相似:
function pathauto_cleanstring($string)
{
$url = $string;
$url = preg_replace('~[^\\pL0-9_]+~u', '-', $url); // substitutes anything but letters, numbers and '_' with separator
$url = trim($url, "-");
$url = iconv("utf-8", "us-ascii//TRANSLIT", $url); // TRANSLIT does the whole job
$url = strtolower($url);
$url = preg_replace('~[^-a-z0-9_]+~', '', $url); // keep only letters, numbers, '_' and separator
return $url;
}
It expects your into to be UTF-8 encoded.
它期望您的into是UTF-8编码。
参考
#1
2
What about using mb_convert_encoding
to convert the HTML entities back into UTF-8 before the preg_replace
?
在preg_replace之前,使用mb_convert_encoding将HTML实体转换为UTF-8怎么样?
echo preg_replace('/[^A-Za-z0-9\.]/', '', mb_convert_encoding($_FILES['file_upload']["name"], 'UTF-8', 'HTML-ENTITIES'));
#2
1
I would use a combination of regular expressions and iconv to transliterate it.
我将使用正则表达式和iconv的组合来改写它。
Update: Prior transliteration/filtering the filename mabye needs to be urldecoded:
更新:之前的音译/过滤的文件名mabye需要被urldecoded:
$path = urldecode($path); // convert triplets to bytes.
Here is a code example from here that does something very similar to your question:
这里有一个代码示例,它与您的问题非常相似:
function pathauto_cleanstring($string)
{
$url = $string;
$url = preg_replace('~[^\\pL0-9_]+~u', '-', $url); // substitutes anything but letters, numbers and '_' with separator
$url = trim($url, "-");
$url = iconv("utf-8", "us-ascii//TRANSLIT", $url); // TRANSLIT does the whole job
$url = strtolower($url);
$url = preg_replace('~[^-a-z0-9_]+~', '', $url); // keep only letters, numbers, '_' and separator
return $url;
}
It expects your into to be UTF-8 encoded.
它期望您的into是UTF-8编码。
参考