挣扎于特殊字符(html_entity_decode,iconv等)

时间:2022-11-16 13:03:47

I've been struggling with getting a bunch of characters translated down to core utf-8 to store them in my database.

我一直在努力将一堆字符翻译成核心utf-8以将它们存储在我的数据库中。

PHP iconv fails on many characters, so i've been forced to build my own 'solution', which really isn't a solution if it doesn't work, and it fails almost completely in windows, so developing with iconv is mostly fruitless as I have to 'dev' on the test server. Also, as iconv misses a ton of characters, it isn't very helpful at all.

PHP iconv在很多角色上都失败了,所以我不得不建立自己的“解决方案”,如果它不起作用,它真的不是解决方案,而且它在Windows中几乎完全失败,所以使用iconv进行开发大多没有结果因为我必须在测试服务器上“开发”。此外,由于iconv错过了大量的角色,它根本不是很有帮助。

Here's what i've got my function doing

这就是我的功能

function replace_accents($string) { 
  return str_replace( array('à','á','â','ã','ä', 'ç', 'è','é','ê','ë', 'ì','í','î','ï', 'ñ', 'ò','ó','ô','õ','ö', 'ù','ú','û','ü', 'ý','ÿ', 'À','Á','Â','Ã','Ä', 'Ç', 'È','É','Ê','Ë', 'Ì','Í','Î','Ï', 'Ñ', 'Ò','Ó','Ô','Õ','Ö', 'Ù','Ú','Û','Ü', 'Ý'), array('a','a','a','a','a', 'c', 'e','e','e','e', 'i','i','i','i', 'n', 'o','o','o','o','o', 'u','u','u','u', 'y','y', 'A','A','A','A','A', 'C', 'E','E','E','E', 'I','I','I','I', 'N', 'O','O','O','O','O', 'U','U','U','U', 'Y'), $string); 
} 


function replaceQuote($string){
$replaceQuote=array('‘', '’', '“', '”', ''','‚','„',''',"’");
    return str_replace($replaceQuote,'\'', $string);
}

function replaceArray($string){
$replaceArray=array('—', '™','™','™','©', '®', '®','©',
                    '¡',
                    '¡',
                    '¢',
                    '¢',
                    '£',
                    '£',
                    '¤',
                    '¥',
                    '¥',
                '¦',
            '§',
                '§',
            '«',
            '«',
            '¬',
            '¬',
            '­',
            '¯',
            '¯',
        '²',
            '³',
            'µ',
            'µ',
            '¶',
            '¶',
            '·',
            '·',
            '¸',
            '¸',
            '¹',
        'º',
        'º','»',  '‹', '»','¼', '½','¾','♥', '☆', '☠', '░','▒','▓','█', '★',
'♪','♫','◄','▀','▄','►', '¤', '^', '☣', '…', '†', '‡', '.:','♣','Ξ','ξ','↠','⇒','→','↞','⇐','←',
'⇔','↔','™','♠','&loz','√','∩','&Cap','∴');
  return str_replace($replaceArray, '', $string);
  }

function special_replace($string){
   $replace_from=array('ƒ', 'Œ','œ','•', '–', '—','˜','š','Š','Ÿ','ÿ','ε',
   '€','α','Α','τ','Τ','θ','Θ');

   $replace_to=array('ƒ', 'Œ','œ','•','-','-','~','š','Š','Ÿ','ÿ','ε','€','α','Α','τ','Τ','θ','Θ');
 return str_replace($replace_from, $replace_to, $string);


}

function dbSlug($slugIt){
$slugIt=html_entity_decode($slugIt);

$slugIt=replaceArray($slugIt);
$slugIt=replaceQuote($slugIt);
$slugIt=special_replace($slugIt);

//$slugIt=iconv('ISO-8859-1', 'UTF-8//TRANSLIT//IGNORE', $slugIt);
$slugIt=replace_accents($slugIt);
$slugIt=trim($slugIt);
        return $slugIt;

    }

It may seem inefficient to as I have the same character in multiple replace functions sometimes but I use the functions in multiple places in different ways, so this is why I may have the same character in more than one of my replace functions.

由于我有时在多个替换函数中使用相同的字符,但是我在不同的方式使用多个函数,所以这似乎是低效的,所以这就是为什么我可能在多个替换函数中使用相同的字符。

Now, the problem is that every time I go and look at the data, I find ANOTHER special character that isn't caught through my labyrinth of finding and replacing/removing characters.

现在,问题在于,每当我去查看数据时,我发现另一个特殊的角色并没有通过我的迷宫找到并替换/移除角色。

The currently offensive character is what you'd think would be a rather harmless ' '. Which are ending up in the database as 'Â'. Not all spaces mind you, it appears only to affect some spaces (i haven't figured out why yet).

目前令人反感的角色是你认为会相当无害的''。最终在数据库中以''结尾。并非所有空间都在关注你,它似乎只会影响一些空间(我还没弄清楚为什么)。

I've been at this for more than a week, and every time I go back and look, i've got more to add to the 'fix'.

我已经待了一个多星期了,每次回去看看,我都有更多的东西要加上'修复'。

I'm not asking how to remove 'Â', I am hoping to get a resolution as to how to maintain the integrity of the content/data but not have special characters which get really messed-up sometimes when moving data around, and maintaining searchability.

我不是要问如何删除'Â',我希望得到一个解决方案,如何保持内容/数据的完整性,但没有特殊的字符,有时在移动数据和维护时会搞砸搜索性。

I would do

我会做

preg_replace("/[^a-zA-Z0-9,-\'-!&.etc]/", "", $data);
, but am concerned that i would start screwing up words where special characters which got missed get replaced. I already had this experience where 'México' was coming out 'Mxico', so that just doesn't work.

The character encoding is supposed to be UTF-8, though I've tried changing the header to ISO-8859-1 before encoding, or not setting any encoding, and I always get the same result.

字符编码应该是UTF-8,虽然我在编码之前尝试将标题更改为ISO-8859-1,或者没有设置任何编码,但我总是得到相同的结果。

I'm sure what I've got is probably the worst possible way of doing this, but I haven't been able to find an effective solution. Any suggestions? My concern is that this is almost never ending and I'm always finding new characters that are being missed through my labyrinth of translation.

我确定我所拥有的可能是最糟糕的做法,但我找不到有效的解决方案。有什么建议?我担心的是,这几乎永无止境,而且我总是通过我的翻译迷宫找到新的角色。

2 个解决方案

#1


2  

  1. Save your PHP files as UTF-8.
  2. 将PHP文件保存为UTF-8。

  3. Upon connection do SET NAMES 'UTF8';
  4. 连接后,执行SET NAMES'UTF8';

If you still need to replace characters do the following:

如果仍需要替换字符,请执行以下操作:

$string = preg_replace('~&([a-z]{1,2})(acute|cedil|circ|grave|lig|orn|ring|slash|th|tilde|uml);~i', '$1', htmlentities($string, ENT_COMPAT, 'UTF-8'));

EDIT:

$string = html_entity_decode(preg_replace('~&([a-z]{1,2})(acute|cedil|circ|grave|lig|orn|ring|slash|th|tilde|uml);~i', '$1', htmlentities($string, ENT_COMPAT, 'UTF-8')), ENT_COMPAT, 'UTF-8');

#2


1  

you can use , html_entity_decode($strint,ENT_QUOTES, 'UTF-8')

你可以使用,html_entity_decode($ strint,ENT_QUOTES,'UTF-8')

I had problems with spanish special chars. With this I solved it

我有西班牙语特殊字符的问题。有了这个,我解决了它

#1


2  

  1. Save your PHP files as UTF-8.
  2. 将PHP文件保存为UTF-8。

  3. Upon connection do SET NAMES 'UTF8';
  4. 连接后,执行SET NAMES'UTF8';

If you still need to replace characters do the following:

如果仍需要替换字符,请执行以下操作:

$string = preg_replace('~&([a-z]{1,2})(acute|cedil|circ|grave|lig|orn|ring|slash|th|tilde|uml);~i', '$1', htmlentities($string, ENT_COMPAT, 'UTF-8'));

EDIT:

$string = html_entity_decode(preg_replace('~&([a-z]{1,2})(acute|cedil|circ|grave|lig|orn|ring|slash|th|tilde|uml);~i', '$1', htmlentities($string, ENT_COMPAT, 'UTF-8')), ENT_COMPAT, 'UTF-8');

#2


1  

you can use , html_entity_decode($strint,ENT_QUOTES, 'UTF-8')

你可以使用,html_entity_decode($ strint,ENT_QUOTES,'UTF-8')

I had problems with spanish special chars. With this I solved it

我有西班牙语特殊字符的问题。有了这个,我解决了它