I have strings in Spanish and other languages that may contain generic special characters like (),*, etc. That I need to remove. But the problem is that it also may contain special language characters like ñ, á, ó, í etc and they need to remain. So I am trying to do it with regexp the following way:

我有西班牙语和其他语言的字符串,它们可能包含通用的特殊字符,如()、*等,我需要删除它们。但问题是它也可能包含一些特殊的语言字符,比如n, a, o, i等等,它们需要保留。所以我试着用regexp这样做:

var desired = stringToReplace.replace(/[^\w\s]/gi, '');

Unfortunately it is removing all special characters including the language related. Not sure how to avoid that. Maybe someone could suggest?


6 个解决方案



I would suggest using Steven Levithan's excellent XRegExp library and its Unicode plug-in.

我建议使用Steven Levithan的优秀XRegExp库及其Unicode插件。

Here's an example that strips non-Latin word characters from a string:


var regex = XRegExp("[^\\s\\p{Latin}]+", "g");
var str = "¿Me puedes decir la contraseña de la Wi-Fi?"
var replaced = XRegExp.replace(str, regex, "");

See also this answer by Steven Levithan himself:

你也可以看看Steven Levithan他自己的回答:

Regular expression Spanish and Arabic words




Instead of whitelisting characters you accept, you could try blacklisting illegal characters:


var desired = stringToReplace.replace(/[-'`~!@#$%^&*()_|+=?;:'",.<>\{\}\[\]\\\/]/gi, '')



Note! Works only for 16bit code points. This answer is incomplete.


Short answer

The character class for all arabic digits and latin letters is: [0-9A-Za-z\u00c0-\u00d6\u00d8-\u00f6\u00f8-\u02af\u1d00-\u1d25\u1d62-\u1d65\u1d6b-\u1d77\u1d79-\u1d9a\u1e00-\u1eff\u2090-\u2094\u2184-\u2184\u2488-\u2490\u271d-\u271d\u2c60-\u2c7c\u2c7e-\u2c7f\ua722-\ua76f\ua771-\ua787\ua78b-\ua78c\ua7fb-\ua7ff\ufb00-\ufb06].

所有的数字和英文字母的字符类都是:[0-9A-Za-z\u00c0-\u00d6\ u00f6\ u00f6\ u02af\ u1d25\ u1d6b-\u1d77\ u1d77\ u1d77 -\u2184 -\ u2c60-\ u2c60-\ u2c60-\ ua722-\ ua78b-\ua78c\ ua78c\ ua78c\ ua78c\ ua78c\ ua78c\ ua78c\ ua78c\ ua78c\ ua78c\ ua78c\ ua78c\ ua78c\ ua78c\ua7fb-\ ufb06]。

To get a regex you can use, prepend /^ and append +$/. This will match strings consisting of only latin letters and digits like "mérito" or "Schönheit".

正则表达式可以使用,预谋/ ^和附加+美元/。这将匹配由拉丁字母和数字组成的字符串,如“merito”或“Schonheit”。

To match non-digits or non-letter characters to remove them, write a ^ as first character after the opening bracket [ and prepend / and append +/.

匹配non-digits或非字母字符删除它们,写一个^作为第一个字符开始后支架(和预谋/和附加+ /。

How did I find that out? Continue reading.


Long answer: use metaprogramming!

Because Javascript does not have Unicode regexes, I wrote a Python program to iterate over the whole of Unicode and filter by Unicode name. It is difficult to get this right manually. Why not let the computer do the dirty and menial work?

因为Javascript没有Unicode regex,所以我编写了一个Python程序来遍历整个Unicode,并以Unicode的名称进行过滤。手动完成这个操作是很困难的。为什么不让计算机来做脏活呢?

import unicodedata
import re
import sys

def unicodeNameMatch(pattern, codepoint):
    return re.match(pattern,, re.I)
  except ValueError:
    return None

def regexChr(codepoint):
  return chr(codepoint) if 32 <= codepoint < 127 else "\\u%04x" % codepoint

names = sys.argv
prev = None

js_regex = ""
for codepoint in range(pow(2, 16)):
  if any([unicodeNameMatch(name, codepoint) for name in names]):
    if prev is None: js_regex += regexChr(codepoint)
    prev = codepoint
    if not prev is None: js_regex += "-" + regexChr(prev)
    prev = None

print "[" + js_regex + "]"

Invoke it like this: python latin digit and you get the character class mentioned above. It's an ugly char class but you know for sure that you catched all characters whose names contain latin or digit.

像这样调用它:python char_class。py拉丁数字,你会得到上面提到的字符类。它是一个丑陋的char类,但是您肯定知道您已经捕获了所有名称包含拉丁文或数字的字符。

Browse the Unicode Character Database to view the names of all unicode characters. The name is in uppercase after the first semicolon, for example for A its the line

浏览Unicode字符数据库,查看所有Unicode字符的名称。名称在第一个分号后面的大写,例如A的its The line

0041;LATIN CAPITAL LETTER A;Lu;0;L;;;;;N;;;;0061;

Try python "latin small" and you get a character class for all latin small letters.

python char_class试试。py "拉丁文小",你会得到一个字符类,用于所有的拉丁文小字母。

Edit: There is a small misfeature (aka bug) in that \u271d-\u271d occurs in the regex. Perhaps this fix helps: Replace


if not prev is None: js_regex += "-" + regexChr(prev)



if not prev is None and prev != codepoint: js_regex += "-" + regexChr(prev)



var desired = stringToReplace.replace(/[\u0000-\u007F][\W]/gi, '');

might do the trick.


See also this Javascript + Unicode regexes question.

请参见这个Javascript + Unicode regexes问题。



If you must insist on whitelisting here is the rawest way of doing it:


Test if string contains only letters (a-z + é ü ö ê å ø etc..)

测试字符串是否只包含字母(a - z + e u o eø等. .)

It works by keeping track of 'all' unicode letter chars.




Unfortunately, Javascript does not support Unicode character properties (which would be just the right regex feature for you). If changing the language is an option for you, PHP (for example) can do this:


preg_replace("/[^\pL0-9_\s]/", "", $str);

Where \pL matches any Unicode character that represents a letter (lower case, upper case, modified or unmodified).


If you have to stick with JavaScript and cannot use the library suggested by Tim Down, the only options are probably either blacklisting or whitelisting. But your bounty mentions that blacklisting is not actually an option in your case. So you will probably simply have to include the special characters from your relevant language manually. So you could simply do this:


var desired = stringToReplace.replace(/[^\w\sñáóí]/gi, '');

Or use their corresponding Unicode sequences:


var desired = stringToReplace.replace(/[^\w\s\u00F1\u00C1\u00F3\u00ED]/gi, '');

Then simply add all the ones you want to take care of. Note that the case-insensitive modifier also works with Unicode sequences.




