Regex删除非字母字符,但保留重音字母

时间:2021-11-15 20:23:23

I have strings in Spanish and other languages that may contain generic special characters like (),*, etc. That I need to remove. But the problem is that it also may contain special language characters like ñ, á, ó, í etc and they need to remain. So I am trying to do it with regexp the following way:

我有西班牙语和其他语言的字符串,它们可能包含通用的特殊字符,如()、*等,我需要删除它们。但问题是它也可能包含一些特殊的语言字符,比如n, a, o, i等等,它们需要保留。所以我试着用regexp这样做:

var desired = stringToReplace.replace(/[^\w\s]/gi, '');

Unfortunately it is removing all special characters including the language related. Not sure how to avoid that. Maybe someone could suggest?

不幸的是,它删除了所有特殊字符,包括与语言相关的字符。不知道如何避免。也许有人可以推荐的?

6 个解决方案

#1


12  

I would suggest using Steven Levithan's excellent XRegExp library and its Unicode plug-in.

我建议使用Steven Levithan的优秀XRegExp库及其Unicode插件。

Here's an example that strips non-Latin word characters from a string: http://jsfiddle.net/b3awZ/1/

这里有一个示例,从字符串http://jsfiddle.net/b3awZ/1/中删除非拉丁单词字符

var regex = XRegExp("[^\\s\\p{Latin}]+", "g");
var str = "¿Me puedes decir la contraseña de la Wi-Fi?"
var replaced = XRegExp.replace(str, regex, "");

See also this answer by Steven Levithan himself:

你也可以看看Steven Levithan他自己的回答:

Regular expression Spanish and Arabic words

正则表达式西班牙语和阿拉伯语

#2


8  

Instead of whitelisting characters you accept, you could try blacklisting illegal characters:

你可以尝试将非法字符列入黑名单,而不是你所接受的白名单。

var desired = stringToReplace.replace(/[-'`~!@#$%^&*()_|+=?;:'",.<>\{\}\[\]\\\/]/gi, '')

#3


7  

Note! Works only for 16bit code points. This answer is incomplete.

注意!只适用于16位代码点。这个答案是不完整的。

Short answer

The character class for all arabic digits and latin letters is: [0-9A-Za-z\u00c0-\u00d6\u00d8-\u00f6\u00f8-\u02af\u1d00-\u1d25\u1d62-\u1d65\u1d6b-\u1d77\u1d79-\u1d9a\u1e00-\u1eff\u2090-\u2094\u2184-\u2184\u2488-\u2490\u271d-\u271d\u2c60-\u2c7c\u2c7e-\u2c7f\ua722-\ua76f\ua771-\ua787\ua78b-\ua78c\ua7fb-\ua7ff\ufb00-\ufb06].

所有的数字和英文字母的字符类都是:[0-9A-Za-z\u00c0-\u00d6\ u00f6\ u00f6\ u02af\ u1d25\ u1d6b-\u1d77\ u1d77\ u1d77 -\u2184 -\ u2c60-\ u2c60-\ u2c60-\ ua722-\ ua78b-\ua78c\ ua78c\ ua78c\ ua78c\ ua78c\ ua78c\ ua78c\ ua78c\ ua78c\ ua78c\ ua78c\ ua78c\ ua78c\ ua78c\ua7fb-\ ufb06]。

To get a regex you can use, prepend /^ and append +$/. This will match strings consisting of only latin letters and digits like "mérito" or "Schönheit".

正则表达式可以使用,预谋/ ^和附加+美元/。这将匹配由拉丁字母和数字组成的字符串,如“merito”或“Schonheit”。

To match non-digits or non-letter characters to remove them, write a ^ as first character after the opening bracket [ and prepend / and append +/.

匹配non-digits或非字母字符删除它们,写一个^作为第一个字符开始后支架(和预谋/和附加+ /。

How did I find that out? Continue reading.

我怎么知道的?继续阅读。

Long answer: use metaprogramming!

Because Javascript does not have Unicode regexes, I wrote a Python program to iterate over the whole of Unicode and filter by Unicode name. It is difficult to get this right manually. Why not let the computer do the dirty and menial work?

因为Javascript没有Unicode regex,所以我编写了一个Python程序来遍历整个Unicode,并以Unicode的名称进行过滤。手动完成这个操作是很困难的。为什么不让计算机来做脏活呢?

import unicodedata
import re
import sys

def unicodeNameMatch(pattern, codepoint):
  try:
    return re.match(pattern, unicodedata.name(unichr(codepoint)), re.I)
  except ValueError:
    return None

def regexChr(codepoint):
  return chr(codepoint) if 32 <= codepoint < 127 else "\\u%04x" % codepoint

names = sys.argv
prev = None

js_regex = ""
for codepoint in range(pow(2, 16)):
  if any([unicodeNameMatch(name, codepoint) for name in names]):
    if prev is None: js_regex += regexChr(codepoint)
    prev = codepoint
  else:
    if not prev is None: js_regex += "-" + regexChr(prev)
    prev = None

print "[" + js_regex + "]"

Invoke it like this: python char_class.py latin digit and you get the character class mentioned above. It's an ugly char class but you know for sure that you catched all characters whose names contain latin or digit.

像这样调用它:python char_class。py拉丁数字,你会得到上面提到的字符类。它是一个丑陋的char类,但是您肯定知道您已经捕获了所有名称包含拉丁文或数字的字符。

Browse the Unicode Character Database to view the names of all unicode characters. The name is in uppercase after the first semicolon, for example for A its the line

浏览Unicode字符数据库,查看所有Unicode字符的名称。名称在第一个分号后面的大写,例如A的its The line

0041;LATIN CAPITAL LETTER A;Lu;0;L;;;;;N;;;;0061;

Try python char_class.py "latin small" and you get a character class for all latin small letters.

python char_class试试。py "拉丁文小",你会得到一个字符类,用于所有的拉丁文小字母。

Edit: There is a small misfeature (aka bug) in that \u271d-\u271d occurs in the regex. Perhaps this fix helps: Replace

编辑:在regex中有一个小的错误特性(又名bug)。或许这一解决方案会有所帮助:替换

if not prev is None: js_regex += "-" + regexChr(prev)

by

通过

if not prev is None and prev != codepoint: js_regex += "-" + regexChr(prev)

#4


1  

var desired = stringToReplace.replace(/[\u0000-\u007F][\W]/gi, '');

might do the trick.

的一招。

See also this Javascript + Unicode regexes question.

请参见这个Javascript + Unicode regexes问题。

#5


1  

If you must insist on whitelisting here is the rawest way of doing it:

如果你一定要坚持白名单,这是最简单的方法:

Test if string contains only letters (a-z + é ü ö ê å ø etc..)

测试字符串是否只包含字母(a - z + e u o eø等. .)

It works by keeping track of 'all' unicode letter chars.

它通过跟踪“所有”unicode字母字符来工作。

#6


0  

Unfortunately, Javascript does not support Unicode character properties (which would be just the right regex feature for you). If changing the language is an option for you, PHP (for example) can do this:

不幸的是,Javascript不支持Unicode字符属性(这对您来说正是regex特性)。如果更改语言是您的选择,PHP(例如)可以这样做:

preg_replace("/[^\pL0-9_\s]/", "", $str);

Where \pL matches any Unicode character that represents a letter (lower case, upper case, modified or unmodified).

\pL匹配代表字母的任何Unicode字符(小写、大写、修改或未修改)。

If you have to stick with JavaScript and cannot use the library suggested by Tim Down, the only options are probably either blacklisting or whitelisting. But your bounty mentions that blacklisting is not actually an option in your case. So you will probably simply have to include the special characters from your relevant language manually. So you could simply do this:

如果你必须坚持使用JavaScript,并且不能使用Tim推荐的库,那么唯一的选项可能是黑名单或白名单。但是你的赏金提到黑名单实际上不是你的选择。因此,您可能只需要手工包含相关语言中的特殊字符。你可以这么做:

var desired = stringToReplace.replace(/[^\w\sñáóí]/gi, '');

Or use their corresponding Unicode sequences:

或使用相应的Unicode序列:

var desired = stringToReplace.replace(/[^\w\s\u00F1\u00C1\u00F3\u00ED]/gi, '');

Then simply add all the ones you want to take care of. Note that the case-insensitive modifier also works with Unicode sequences.

然后简单地把所有你想要照顾的都加起来。注意,不区分大小写的修饰符也适用于Unicode序列。

#1


12  

I would suggest using Steven Levithan's excellent XRegExp library and its Unicode plug-in.

我建议使用Steven Levithan的优秀XRegExp库及其Unicode插件。

Here's an example that strips non-Latin word characters from a string: http://jsfiddle.net/b3awZ/1/

这里有一个示例,从字符串http://jsfiddle.net/b3awZ/1/中删除非拉丁单词字符

var regex = XRegExp("[^\\s\\p{Latin}]+", "g");
var str = "¿Me puedes decir la contraseña de la Wi-Fi?"
var replaced = XRegExp.replace(str, regex, "");

See also this answer by Steven Levithan himself:

你也可以看看Steven Levithan他自己的回答:

Regular expression Spanish and Arabic words

正则表达式西班牙语和阿拉伯语

#2


8  

Instead of whitelisting characters you accept, you could try blacklisting illegal characters:

你可以尝试将非法字符列入黑名单,而不是你所接受的白名单。

var desired = stringToReplace.replace(/[-'`~!@#$%^&*()_|+=?;:'",.<>\{\}\[\]\\\/]/gi, '')

#3


7  

Note! Works only for 16bit code points. This answer is incomplete.

注意!只适用于16位代码点。这个答案是不完整的。

Short answer

The character class for all arabic digits and latin letters is: [0-9A-Za-z\u00c0-\u00d6\u00d8-\u00f6\u00f8-\u02af\u1d00-\u1d25\u1d62-\u1d65\u1d6b-\u1d77\u1d79-\u1d9a\u1e00-\u1eff\u2090-\u2094\u2184-\u2184\u2488-\u2490\u271d-\u271d\u2c60-\u2c7c\u2c7e-\u2c7f\ua722-\ua76f\ua771-\ua787\ua78b-\ua78c\ua7fb-\ua7ff\ufb00-\ufb06].

所有的数字和英文字母的字符类都是:[0-9A-Za-z\u00c0-\u00d6\ u00f6\ u00f6\ u02af\ u1d25\ u1d6b-\u1d77\ u1d77\ u1d77 -\u2184 -\ u2c60-\ u2c60-\ u2c60-\ ua722-\ ua78b-\ua78c\ ua78c\ ua78c\ ua78c\ ua78c\ ua78c\ ua78c\ ua78c\ ua78c\ ua78c\ ua78c\ ua78c\ ua78c\ ua78c\ua7fb-\ ufb06]。

To get a regex you can use, prepend /^ and append +$/. This will match strings consisting of only latin letters and digits like "mérito" or "Schönheit".

正则表达式可以使用,预谋/ ^和附加+美元/。这将匹配由拉丁字母和数字组成的字符串,如“merito”或“Schonheit”。

To match non-digits or non-letter characters to remove them, write a ^ as first character after the opening bracket [ and prepend / and append +/.

匹配non-digits或非字母字符删除它们,写一个^作为第一个字符开始后支架(和预谋/和附加+ /。

How did I find that out? Continue reading.

我怎么知道的?继续阅读。

Long answer: use metaprogramming!

Because Javascript does not have Unicode regexes, I wrote a Python program to iterate over the whole of Unicode and filter by Unicode name. It is difficult to get this right manually. Why not let the computer do the dirty and menial work?

因为Javascript没有Unicode regex,所以我编写了一个Python程序来遍历整个Unicode,并以Unicode的名称进行过滤。手动完成这个操作是很困难的。为什么不让计算机来做脏活呢?

import unicodedata
import re
import sys

def unicodeNameMatch(pattern, codepoint):
  try:
    return re.match(pattern, unicodedata.name(unichr(codepoint)), re.I)
  except ValueError:
    return None

def regexChr(codepoint):
  return chr(codepoint) if 32 <= codepoint < 127 else "\\u%04x" % codepoint

names = sys.argv
prev = None

js_regex = ""
for codepoint in range(pow(2, 16)):
  if any([unicodeNameMatch(name, codepoint) for name in names]):
    if prev is None: js_regex += regexChr(codepoint)
    prev = codepoint
  else:
    if not prev is None: js_regex += "-" + regexChr(prev)
    prev = None

print "[" + js_regex + "]"

Invoke it like this: python char_class.py latin digit and you get the character class mentioned above. It's an ugly char class but you know for sure that you catched all characters whose names contain latin or digit.

像这样调用它:python char_class。py拉丁数字,你会得到上面提到的字符类。它是一个丑陋的char类,但是您肯定知道您已经捕获了所有名称包含拉丁文或数字的字符。

Browse the Unicode Character Database to view the names of all unicode characters. The name is in uppercase after the first semicolon, for example for A its the line

浏览Unicode字符数据库,查看所有Unicode字符的名称。名称在第一个分号后面的大写,例如A的its The line

0041;LATIN CAPITAL LETTER A;Lu;0;L;;;;;N;;;;0061;

Try python char_class.py "latin small" and you get a character class for all latin small letters.

python char_class试试。py "拉丁文小",你会得到一个字符类,用于所有的拉丁文小字母。

Edit: There is a small misfeature (aka bug) in that \u271d-\u271d occurs in the regex. Perhaps this fix helps: Replace

编辑:在regex中有一个小的错误特性(又名bug)。或许这一解决方案会有所帮助:替换

if not prev is None: js_regex += "-" + regexChr(prev)

by

通过

if not prev is None and prev != codepoint: js_regex += "-" + regexChr(prev)

#4


1  

var desired = stringToReplace.replace(/[\u0000-\u007F][\W]/gi, '');

might do the trick.

的一招。

See also this Javascript + Unicode regexes question.

请参见这个Javascript + Unicode regexes问题。

#5


1  

If you must insist on whitelisting here is the rawest way of doing it:

如果你一定要坚持白名单,这是最简单的方法:

Test if string contains only letters (a-z + é ü ö ê å ø etc..)

测试字符串是否只包含字母(a - z + e u o eø等. .)

It works by keeping track of 'all' unicode letter chars.

它通过跟踪“所有”unicode字母字符来工作。

#6


0  

Unfortunately, Javascript does not support Unicode character properties (which would be just the right regex feature for you). If changing the language is an option for you, PHP (for example) can do this:

不幸的是,Javascript不支持Unicode字符属性(这对您来说正是regex特性)。如果更改语言是您的选择,PHP(例如)可以这样做:

preg_replace("/[^\pL0-9_\s]/", "", $str);

Where \pL matches any Unicode character that represents a letter (lower case, upper case, modified or unmodified).

\pL匹配代表字母的任何Unicode字符(小写、大写、修改或未修改)。

If you have to stick with JavaScript and cannot use the library suggested by Tim Down, the only options are probably either blacklisting or whitelisting. But your bounty mentions that blacklisting is not actually an option in your case. So you will probably simply have to include the special characters from your relevant language manually. So you could simply do this:

如果你必须坚持使用JavaScript,并且不能使用Tim推荐的库,那么唯一的选项可能是黑名单或白名单。但是你的赏金提到黑名单实际上不是你的选择。因此,您可能只需要手工包含相关语言中的特殊字符。你可以这么做:

var desired = stringToReplace.replace(/[^\w\sñáóí]/gi, '');

Or use their corresponding Unicode sequences:

或使用相应的Unicode序列:

var desired = stringToReplace.replace(/[^\w\s\u00F1\u00C1\u00F3\u00ED]/gi, '');

Then simply add all the ones you want to take care of. Note that the case-insensitive modifier also works with Unicode sequences.

然后简单地把所有你想要照顾的都加起来。注意,不区分大小写的修饰符也适用于Unicode序列。