PHP将异国字符转换为a-z A-Z 0-9 [重复]

时间:2022-01-30 03:49:32

Possible Duplicate:
PHP: Replace umlauts with closest 7-bit ASCII equivalent in an UTF-8 string

可能重复:PHP:用UTF-8字符串中最接近的7位ASCII等效替换变音符号

I will be dealing with an external source that may give me data back like :

Tōkyō à á â ã

我将处理一个可能给我数据的外部资源:Tōkyōàáâ

is there a way to convert the fancy characters to standard a-z A-Z

Tokyo a a a a

有没有办法将花哨的角色转换成标准的a-z A-*京a a a a

If there are other characters that don't match with any alphabet, they could be ignored.

如果有其他字符与任何字母不匹配,则可以忽略它们。

Is a large regex with all the from and to values the only way, or, is there a simpler way to go about it?

是一个大的正则表达式,所有的from和to值是唯一的方法,或者,有更简单的方法吗?

1 个解决方案

#1


3  

Something like this (taken from the Symphony CMS project) should get you started:

这样的事情(取自Symphony CMS项目)应该让你开始:

$transliterations = array(

    // Alphabetical

    '/À/' => 'A',       '/Á/' => 'A',       '/Â/' => 'A',       '/Ã/' => 'A',       '/Ä/' => 'Ae',
    '/Å/' => 'A',       '/Ā/' => 'A',       '/Ą/' => 'A',       '/Ă/' => 'A',       '/Æ/' => 'Ae',
    '/Ç/' => 'C',       '/Ć/' => 'C',       '/Č/' => 'C',       '/Ĉ/' => 'C',       '/Ċ/' => 'C',
    '/Ď/' => 'D',       '/Đ/' => 'D',       '/Ð/' => 'D',       '/È/' => 'E',       '/É/' => 'E',
    '/Ê/' => 'E',       '/Ë/' => 'E',       '/Ē/' => 'E',       '/Ę/' => 'E',       '/Ě/' => 'E',
    '/Ĕ/' => 'E',       '/Ė/' => 'E',       '/Ĝ/' => 'G',       '/Ğ/' => 'G',       '/Ġ/' => 'G',
    '/Ģ/' => 'G',       '/Ĥ/' => 'H',       '/Ħ/' => 'H',       '/Ì/' => 'I',       '/Í/' => 'I',
    '/Î/' => 'I',       '/Ï/' => 'I',       '/Ī/' => 'I',       '/Ĩ/' => 'I',       '/Ĭ/' => 'I',
    '/Į/' => 'I',       '/İ/' => 'I',       '/IJ/' => 'Ij',      '/Ĵ/' => 'J',       '/Ķ/' => 'K',
    '/Ł/' => 'L',       '/Ľ/' => 'L',       '/Ĺ/' => 'L',       '/Ļ/' => 'L',       '/Ŀ/' => 'L',
    '/Ñ/' => 'N',       '/Ń/' => 'N',       '/Ň/' => 'N',       '/Ņ/' => 'N',       '/Ŋ/' => 'N',
    '/Ò/' => 'O',       '/Ó/' => 'O',       '/Ô/' => 'O',       '/Õ/' => 'O',       '/Ö/' => 'Oe',
    '/Ø/' => 'O',       '/Ō/' => 'O',       '/Ő/' => 'O',       '/Ŏ/' => 'O',       '/Œ/' => 'Oe',
    '/Ŕ/' => 'R',       '/Ř/' => 'R',       '/Ŗ/' => 'R',       '/Ś/' => 'S',       '/Š/' => 'S',
    '/Ş/' => 'S',       '/Ŝ/' => 'S',       '/Ș/' => 'S',       '/Ť/' => 'T',       '/Ţ/' => 'T',
    '/Ŧ/' => 'T',       '/Ț/' => 'T',       '/Ù/' => 'U',       '/Ú/' => 'U',       '/Û/' => 'U',
    '/Ü/' => 'Ue',      '/Ū/' => 'U',       '/Ů/' => 'U',       '/Ű/' => 'U',       '/Ŭ/' => 'U',
    '/Ũ/' => 'U',       '/Ų/' => 'U',       '/Ŵ/' => 'W',       '/Ý/' => 'Y',       '/Ŷ/' => 'Y',
    '/Ÿ/' => 'Y',       '/Y/' => 'Y',       '/Ź/' => 'Z',       '/Ž/' => 'Z',       '/Ż/' => 'Z',
    '/Þ/' => 'T',
    '/à/' => 'a',       '/á/' => 'a',       '/â/' => 'a',       '/ã/' => 'a',       '/ä/' => 'ae',
    '/å/' => 'a',       '/ā/' => 'a',       '/ą/' => 'a',       '/ă/' => 'a',       '/æ/' => 'ae',
    '/ç/' => 'c',       '/ć/' => 'c',       '/č/' => 'c',       '/ĉ/' => 'c',       '/ċ/' => 'c',
    '/ď/' => 'd',       '/đ/' => 'd',       '/ð/' => 'd',       '/è/' => 'e',       '/é/' => 'e',
    '/ê/' => 'e',       '/ë/' => 'e',       '/ē/' => 'e',       '/ę/' => 'e',       '/ě/' => 'e',
    '/ĕ/' => 'e',       '/ė/' => 'e',       '/ĝ/' => 'g',       '/ğ/' => 'g',       '/ġ/' => 'g',
    '/ģ/' => 'g',       '/ĥ/' => 'h',       '/ħ/' => 'h',       '/ì/' => 'i',       '/í/' => 'i',
    '/î/' => 'i',       '/ï/' => 'i',       '/ī/' => 'i',       '/ĩ/' => 'i',       '/ĭ/' => 'i',
    '/į/' => 'i',       '/ı/' => 'i',       '/ij/' => 'ij',      '/ĵ/' => 'j',       '/ķ/' => 'k',
    '/ł/' => 'l',       '/ľ/' => 'l',       '/ĺ/' => 'l',       '/ļ/' => 'l',       '/ŀ/' => 'l',
    '/ñ/' => 'n',       '/ń/' => 'n',       '/ň/' => 'n',       '/ņ/' => 'n',       '/ŋ/' => 'n',
    '/ò/' => 'o',       '/ó/' => 'o',       '/ô/' => 'o',       '/õ/' => 'o',       '/ö/' => 'oe',
    '/ø/' => 'o',       '/ō/' => 'o',       '/ő/' => 'o',       '/ŏ/' => 'o',       '/œ/' => 'oe',
    '/ŕ/' => 'r',       '/ř/' => 'r',       '/ŗ/' => 'r',       '/ś/' => 's',       '/š/' => 's',
    '/ş/' => 's',       '/ŝ/' => 's',       '/ș/' => 's',       '/ť/' => 't',       '/ţ/' => 't',
    '/ŧ/' => 't',       '/ț/' => 't',       '/ù/' => 'u',       '/ú/' => 'u',       '/û/' => 'u',
    '/ü/' => 'ue',      '/ū/' => 'u',       '/ů/' => 'u',       '/ű/' => 'u',       '/ŭ/' => 'u',
    '/ũ/' => 'u',       '/ų/' => 'u',       '/ŵ/' => 'w',       '/ý/' => 'y',       '/ŷ/' => 'y',
    '/ÿ/' => 'y',       '/y/' => 'y',       '/ź/' => 'z',       '/ž/' => 'z',       '/ż/' => 'z',
    '/þ/' => 't',       '/ß/' => 'ss',      '/ſ/' => 'ss',      '/ƒ/' => 'f',       '/ĸ/' => 'k',
    '/ʼn/' => 'n',

    // Symbolic

    '/\(/' => null,     '/\)/' => null,     '/,/' => null,
    '/–/' => '-',       '/-/' => '-',       '/„/' => '"',
    '/“/' => '"',       '/”/' => '"',       '/—/' => '-',
    '/¿/' => null,      '/‽/' => null,      '/¡/' => null,

    // Ampersands

    '/©/' => 'c',
    '/^&(?!&)$/' => 'and',
    '/^&(?!&)/' => 'and-',
    '/&(?!&)&/' => '-and',
    '/&(?!&)/' => '-and-',

);

You can also use iconv, but this isn't flawlessly, Ü for example, will get returned as "U, while it should get returned as Ue.

你也可以使用iconv,但这并不是完美无缺的,例如,它会以“U”的形式返回,而它应该以Ue的形式返回。

#1


3  

Something like this (taken from the Symphony CMS project) should get you started:

这样的事情(取自Symphony CMS项目)应该让你开始:

$transliterations = array(

    // Alphabetical

    '/À/' => 'A',       '/Á/' => 'A',       '/Â/' => 'A',       '/Ã/' => 'A',       '/Ä/' => 'Ae',
    '/Å/' => 'A',       '/Ā/' => 'A',       '/Ą/' => 'A',       '/Ă/' => 'A',       '/Æ/' => 'Ae',
    '/Ç/' => 'C',       '/Ć/' => 'C',       '/Č/' => 'C',       '/Ĉ/' => 'C',       '/Ċ/' => 'C',
    '/Ď/' => 'D',       '/Đ/' => 'D',       '/Ð/' => 'D',       '/È/' => 'E',       '/É/' => 'E',
    '/Ê/' => 'E',       '/Ë/' => 'E',       '/Ē/' => 'E',       '/Ę/' => 'E',       '/Ě/' => 'E',
    '/Ĕ/' => 'E',       '/Ė/' => 'E',       '/Ĝ/' => 'G',       '/Ğ/' => 'G',       '/Ġ/' => 'G',
    '/Ģ/' => 'G',       '/Ĥ/' => 'H',       '/Ħ/' => 'H',       '/Ì/' => 'I',       '/Í/' => 'I',
    '/Î/' => 'I',       '/Ï/' => 'I',       '/Ī/' => 'I',       '/Ĩ/' => 'I',       '/Ĭ/' => 'I',
    '/Į/' => 'I',       '/İ/' => 'I',       '/IJ/' => 'Ij',      '/Ĵ/' => 'J',       '/Ķ/' => 'K',
    '/Ł/' => 'L',       '/Ľ/' => 'L',       '/Ĺ/' => 'L',       '/Ļ/' => 'L',       '/Ŀ/' => 'L',
    '/Ñ/' => 'N',       '/Ń/' => 'N',       '/Ň/' => 'N',       '/Ņ/' => 'N',       '/Ŋ/' => 'N',
    '/Ò/' => 'O',       '/Ó/' => 'O',       '/Ô/' => 'O',       '/Õ/' => 'O',       '/Ö/' => 'Oe',
    '/Ø/' => 'O',       '/Ō/' => 'O',       '/Ő/' => 'O',       '/Ŏ/' => 'O',       '/Œ/' => 'Oe',
    '/Ŕ/' => 'R',       '/Ř/' => 'R',       '/Ŗ/' => 'R',       '/Ś/' => 'S',       '/Š/' => 'S',
    '/Ş/' => 'S',       '/Ŝ/' => 'S',       '/Ș/' => 'S',       '/Ť/' => 'T',       '/Ţ/' => 'T',
    '/Ŧ/' => 'T',       '/Ț/' => 'T',       '/Ù/' => 'U',       '/Ú/' => 'U',       '/Û/' => 'U',
    '/Ü/' => 'Ue',      '/Ū/' => 'U',       '/Ů/' => 'U',       '/Ű/' => 'U',       '/Ŭ/' => 'U',
    '/Ũ/' => 'U',       '/Ų/' => 'U',       '/Ŵ/' => 'W',       '/Ý/' => 'Y',       '/Ŷ/' => 'Y',
    '/Ÿ/' => 'Y',       '/Y/' => 'Y',       '/Ź/' => 'Z',       '/Ž/' => 'Z',       '/Ż/' => 'Z',
    '/Þ/' => 'T',
    '/à/' => 'a',       '/á/' => 'a',       '/â/' => 'a',       '/ã/' => 'a',       '/ä/' => 'ae',
    '/å/' => 'a',       '/ā/' => 'a',       '/ą/' => 'a',       '/ă/' => 'a',       '/æ/' => 'ae',
    '/ç/' => 'c',       '/ć/' => 'c',       '/č/' => 'c',       '/ĉ/' => 'c',       '/ċ/' => 'c',
    '/ď/' => 'd',       '/đ/' => 'd',       '/ð/' => 'd',       '/è/' => 'e',       '/é/' => 'e',
    '/ê/' => 'e',       '/ë/' => 'e',       '/ē/' => 'e',       '/ę/' => 'e',       '/ě/' => 'e',
    '/ĕ/' => 'e',       '/ė/' => 'e',       '/ĝ/' => 'g',       '/ğ/' => 'g',       '/ġ/' => 'g',
    '/ģ/' => 'g',       '/ĥ/' => 'h',       '/ħ/' => 'h',       '/ì/' => 'i',       '/í/' => 'i',
    '/î/' => 'i',       '/ï/' => 'i',       '/ī/' => 'i',       '/ĩ/' => 'i',       '/ĭ/' => 'i',
    '/į/' => 'i',       '/ı/' => 'i',       '/ij/' => 'ij',      '/ĵ/' => 'j',       '/ķ/' => 'k',
    '/ł/' => 'l',       '/ľ/' => 'l',       '/ĺ/' => 'l',       '/ļ/' => 'l',       '/ŀ/' => 'l',
    '/ñ/' => 'n',       '/ń/' => 'n',       '/ň/' => 'n',       '/ņ/' => 'n',       '/ŋ/' => 'n',
    '/ò/' => 'o',       '/ó/' => 'o',       '/ô/' => 'o',       '/õ/' => 'o',       '/ö/' => 'oe',
    '/ø/' => 'o',       '/ō/' => 'o',       '/ő/' => 'o',       '/ŏ/' => 'o',       '/œ/' => 'oe',
    '/ŕ/' => 'r',       '/ř/' => 'r',       '/ŗ/' => 'r',       '/ś/' => 's',       '/š/' => 's',
    '/ş/' => 's',       '/ŝ/' => 's',       '/ș/' => 's',       '/ť/' => 't',       '/ţ/' => 't',
    '/ŧ/' => 't',       '/ț/' => 't',       '/ù/' => 'u',       '/ú/' => 'u',       '/û/' => 'u',
    '/ü/' => 'ue',      '/ū/' => 'u',       '/ů/' => 'u',       '/ű/' => 'u',       '/ŭ/' => 'u',
    '/ũ/' => 'u',       '/ų/' => 'u',       '/ŵ/' => 'w',       '/ý/' => 'y',       '/ŷ/' => 'y',
    '/ÿ/' => 'y',       '/y/' => 'y',       '/ź/' => 'z',       '/ž/' => 'z',       '/ż/' => 'z',
    '/þ/' => 't',       '/ß/' => 'ss',      '/ſ/' => 'ss',      '/ƒ/' => 'f',       '/ĸ/' => 'k',
    '/ʼn/' => 'n',

    // Symbolic

    '/\(/' => null,     '/\)/' => null,     '/,/' => null,
    '/–/' => '-',       '/-/' => '-',       '/„/' => '"',
    '/“/' => '"',       '/”/' => '"',       '/—/' => '-',
    '/¿/' => null,      '/‽/' => null,      '/¡/' => null,

    // Ampersands

    '/©/' => 'c',
    '/^&(?!&)$/' => 'and',
    '/^&(?!&)/' => 'and-',
    '/&(?!&)&/' => '-and',
    '/&(?!&)/' => '-and-',

);

You can also use iconv, but this isn't flawlessly, Ü for example, will get returned as "U, while it should get returned as Ue.

你也可以使用iconv,但这并不是完美无缺的,例如,它会以“U”的形式返回,而它应该以Ue的形式返回。