I'm having a little trouble with Python regular expressions.
我在Python正则表达式上有一点麻烦。
What is a good way to remove all characters in a string that are not letters or numbers?
删除字符串中非字母或数字的所有字符的好方法是什么?
Thanks!
谢谢!
7 个解决方案
#1
20
[\w]
matches (alphanumeric or underscore).
匹配(字母数字或下划线)。
[\W]
matches (not (alphanumeric or underscore)), which is equivalent to (not alphanumeric and not underscore)
[\W]匹配(不是(字母数字或下划线)),它等价于(不是字母数字,也不是下划线)
You need [\W_]
to remove ALL non-alphanumerics.
您需要[\W_]删除所有非字母数字。
When using re.sub(), it will be much more efficient if you reduce the number of substitutions (expensive) by matching using [\W_]+
instead of doing it one at a time.
当使用re.sub()时,如果您通过使用[\W_]+进行匹配来减少替换的数量(代价高昂),而不是一次只做一个替换,将会更有效。
Now all you need is to define alphanumerics:
现在你只需要定义字母数字:
str
object, only ASCII A-Za-z0-9:
str对象,只有ASCII A-Za-z0-9:
re.sub(r'[\W_]+', '', s)
str
object, only locale-defined alphanumerics:
对象,仅使用本地定义的字母数字:
re.sub(r'[\W_]+', '', s, flags=re.LOCALE)
unicode
object, all alphanumerics:
unicode对象,所有#:
re.sub(ur'[\W_]+', u'', s, flags=re.UNICODE)
Examples for str
object:
例子为str对象:
>>> import re, locale
>>> sall = ''.join(chr(i) for i in xrange(256))
>>> len(sall)
256
>>> re.sub('[\W_]+', '', sall)
'0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'
>>> re.sub('[\W_]+', '', sall, flags=re.LOCALE)
'0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'
>>> locale.setlocale(locale.LC_ALL, '')
'English_Australia.1252'
>>> re.sub('[\W_]+', '', sall, flags=re.LOCALE)
'0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz\x83\x8a\x8c\x8e\
x9a\x9c\x9e\x9f\xaa\xb2\xb3\xb5\xb9\xba\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\
xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd8\xd9\xda\xdb\xdc\xdd\xde\
xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\
xf3\xf4\xf5\xf6\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff'
# above output wrapped at column 80
Unicode example:
Unicode的例子:
>>> re.sub(ur'[\W_]+', u'', u'a_b A_Z \x80\xFF \u0404', flags=re.UNICODE)
u'abAZ\xff\u0404'
#2
7
'\W'
is the same as [^A-Za-z0-9_]
plus accented chars from your locale.
“\ W”是一样的(^ A-Za-z0-9_]+重音字符区域。
>>> re.sub('\W', '', 'text 1, 2, 3...')
'text123'
Maybe you want to keep the spaces or have all the words (and numbers):
也许你想保留所有的空格或所有的单词(和数字):
>>> re.findall('\w+', 'my. text, --without-- (punctuation) 123')
['my', 'text', 'without', 'punctuation', '123']
#3
4
In the char set matching rule [...]
you can specify ^
as first char to mean "not in"
在char集合匹配规则[…)您可以指定^作为第一个字符的意思是“不”
import re
re.sub("[^0-9a-zA-Z]", # Anything except 0..9, a..z and A..Z
"", # replaced with nothing
"this is a test!!") # in this string
--> 'thisisatest'
#4
3
Also you can try to use isalpha and isnumeric methods the following way:
您也可以尝试使用isalpha和isnumeric方法:
text = 'base, sample test;'
getVals = lambda x: (c for c in text if c.isalpha() or c.isnumeric())
map(lambda word: ' '.join(getVals(word)): text.split(' '))
#5
3
There are other ways also you may consider e.g. simply loop thru string and skip unwanted chars e.g. assuming you want to delete all ascii chars which are not letter or digits
你也可以考虑其他的方法,例如,简单地通过字符串循环,跳过不需要的字符,例如,假设你想删除所有非字母或数字的ascii字符
>>> newstring = [c for c in "a!1#b$2c%3\t\nx" if c in string.letters + string.digits]
>>> "".join(newstring)
'a1b2c3x'
or use string.translate to map one char to other or delete some chars e.g.
或者使用字符串。把一个字符映射到另一个字符或删除一些字符。
>>> todelete = [ chr(i) for i in range(256) if chr(i) not in string.letters + string.digits ]
>>> todelete = "".join(todelete)
>>> "a!1#b$2c%3\t\nx".translate(None, todelete)
'a1b2c3x'
this way you need to calculate todelete
list once or todelete
can be hard-coded once and use it everywhere you need to convert string
通过这种方式,您需要计算一次todelete列表,或者todelete可以硬编码一次,并在需要转换字符串的任何地方使用它
#6
1
you can use predefined regex in python : \W
corresponds to the set [^a-zA-Z0-9_]
. Then,
您可以使用预定义的正则表达式在python中:\ W对应于一组[^ a-zA-Z0-9_]。然后,
import re
s = 'Hello dutrow 123'
re.sub('\W', '', s)
--> 'Hellodutrow123'
#7
1
You need to be more specific:
你需要更具体:
- What about Unicode "letters"? ie, those with diacriticals.
- Unicode“字母”呢?例如,那些为区别的。
- What about white space? (I assume this is what you DO want to delete along with punctuation)
- 空白呢?(我想这就是你想要删除的和标点符号一样的东西)
- When you say "letters" do you mean
A-Z
anda-z
in ASCII only? - 当你说“字母”时,你是指ASCII码中的A-Z和A-Z吗?
- When you say "numbers" do you mean
0-9
only? What about decimals, separators and exponents? - 当你说“数字”时,你是指0-9吗?小数,分离器和指数呢?
It gets complex quickly...
很快就变得复杂……
A great place to start is an interactive regex site, such as RegExr
一个很好的起点是一个交互式的regex站点,比如RegExr
You can also get Python specific Python Regex Tool
您还可以获得Python特定的Python Regex工具
#1
20
[\w]
matches (alphanumeric or underscore).
匹配(字母数字或下划线)。
[\W]
matches (not (alphanumeric or underscore)), which is equivalent to (not alphanumeric and not underscore)
[\W]匹配(不是(字母数字或下划线)),它等价于(不是字母数字,也不是下划线)
You need [\W_]
to remove ALL non-alphanumerics.
您需要[\W_]删除所有非字母数字。
When using re.sub(), it will be much more efficient if you reduce the number of substitutions (expensive) by matching using [\W_]+
instead of doing it one at a time.
当使用re.sub()时,如果您通过使用[\W_]+进行匹配来减少替换的数量(代价高昂),而不是一次只做一个替换,将会更有效。
Now all you need is to define alphanumerics:
现在你只需要定义字母数字:
str
object, only ASCII A-Za-z0-9:
str对象,只有ASCII A-Za-z0-9:
re.sub(r'[\W_]+', '', s)
str
object, only locale-defined alphanumerics:
对象,仅使用本地定义的字母数字:
re.sub(r'[\W_]+', '', s, flags=re.LOCALE)
unicode
object, all alphanumerics:
unicode对象,所有#:
re.sub(ur'[\W_]+', u'', s, flags=re.UNICODE)
Examples for str
object:
例子为str对象:
>>> import re, locale
>>> sall = ''.join(chr(i) for i in xrange(256))
>>> len(sall)
256
>>> re.sub('[\W_]+', '', sall)
'0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'
>>> re.sub('[\W_]+', '', sall, flags=re.LOCALE)
'0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'
>>> locale.setlocale(locale.LC_ALL, '')
'English_Australia.1252'
>>> re.sub('[\W_]+', '', sall, flags=re.LOCALE)
'0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz\x83\x8a\x8c\x8e\
x9a\x9c\x9e\x9f\xaa\xb2\xb3\xb5\xb9\xba\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\
xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd8\xd9\xda\xdb\xdc\xdd\xde\
xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\
xf3\xf4\xf5\xf6\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff'
# above output wrapped at column 80
Unicode example:
Unicode的例子:
>>> re.sub(ur'[\W_]+', u'', u'a_b A_Z \x80\xFF \u0404', flags=re.UNICODE)
u'abAZ\xff\u0404'
#2
7
'\W'
is the same as [^A-Za-z0-9_]
plus accented chars from your locale.
“\ W”是一样的(^ A-Za-z0-9_]+重音字符区域。
>>> re.sub('\W', '', 'text 1, 2, 3...')
'text123'
Maybe you want to keep the spaces or have all the words (and numbers):
也许你想保留所有的空格或所有的单词(和数字):
>>> re.findall('\w+', 'my. text, --without-- (punctuation) 123')
['my', 'text', 'without', 'punctuation', '123']
#3
4
In the char set matching rule [...]
you can specify ^
as first char to mean "not in"
在char集合匹配规则[…)您可以指定^作为第一个字符的意思是“不”
import re
re.sub("[^0-9a-zA-Z]", # Anything except 0..9, a..z and A..Z
"", # replaced with nothing
"this is a test!!") # in this string
--> 'thisisatest'
#4
3
Also you can try to use isalpha and isnumeric methods the following way:
您也可以尝试使用isalpha和isnumeric方法:
text = 'base, sample test;'
getVals = lambda x: (c for c in text if c.isalpha() or c.isnumeric())
map(lambda word: ' '.join(getVals(word)): text.split(' '))
#5
3
There are other ways also you may consider e.g. simply loop thru string and skip unwanted chars e.g. assuming you want to delete all ascii chars which are not letter or digits
你也可以考虑其他的方法,例如,简单地通过字符串循环,跳过不需要的字符,例如,假设你想删除所有非字母或数字的ascii字符
>>> newstring = [c for c in "a!1#b$2c%3\t\nx" if c in string.letters + string.digits]
>>> "".join(newstring)
'a1b2c3x'
or use string.translate to map one char to other or delete some chars e.g.
或者使用字符串。把一个字符映射到另一个字符或删除一些字符。
>>> todelete = [ chr(i) for i in range(256) if chr(i) not in string.letters + string.digits ]
>>> todelete = "".join(todelete)
>>> "a!1#b$2c%3\t\nx".translate(None, todelete)
'a1b2c3x'
this way you need to calculate todelete
list once or todelete
can be hard-coded once and use it everywhere you need to convert string
通过这种方式,您需要计算一次todelete列表,或者todelete可以硬编码一次,并在需要转换字符串的任何地方使用它
#6
1
you can use predefined regex in python : \W
corresponds to the set [^a-zA-Z0-9_]
. Then,
您可以使用预定义的正则表达式在python中:\ W对应于一组[^ a-zA-Z0-9_]。然后,
import re
s = 'Hello dutrow 123'
re.sub('\W', '', s)
--> 'Hellodutrow123'
#7
1
You need to be more specific:
你需要更具体:
- What about Unicode "letters"? ie, those with diacriticals.
- Unicode“字母”呢?例如,那些为区别的。
- What about white space? (I assume this is what you DO want to delete along with punctuation)
- 空白呢?(我想这就是你想要删除的和标点符号一样的东西)
- When you say "letters" do you mean
A-Z
anda-z
in ASCII only? - 当你说“字母”时,你是指ASCII码中的A-Z和A-Z吗?
- When you say "numbers" do you mean
0-9
only? What about decimals, separators and exponents? - 当你说“数字”时,你是指0-9吗?小数,分离器和指数呢?
It gets complex quickly...
很快就变得复杂……
A great place to start is an interactive regex site, such as RegExr
一个很好的起点是一个交互式的regex站点,比如RegExr
You can also get Python specific Python Regex Tool
您还可以获得Python特定的Python Regex工具