I am trying to do what my title says. I have a list of about 30 thousand business addressess, and I'm trying to make each address as uniform as possible
我试着按照我的头衔去做。我有一份大约3万名商业地址的名单,我正在努力使每个地址尽可能统一
As far as removing weird symbols and characters goes, I have found three suggestions, but I don't understand how they are different.
至于去除奇怪的符号和字符,我有三个建议,但我不明白它们有何不同。
If somebody can explain the difference, or provide insight into a better way to standardize address information, please and thank you!
如果有人能解释差异,或提供更好的方式标准化地址信息的洞察力,请和谢谢!
address = re.sub(r'([^\s\w]|_)+', '', address)
address = re.sub('[^a-zA-Z0-9-_*.]', '', address)
address = re.sub(r'[^\w]', ' ', address)
3 个解决方案
#1
1
The first suggestion uses the \s
and \w
regex wildcards.
第一个建议使用\s和\w regex通配符。
\s
means "match any whitespace". \w
means "match any letter or number".
\s表示“匹配任何空格”。\w的意思是“匹配任何字母或数字”。
This is used as an inverted capture group ([^\s\w]
), which, all together, means "match anything which isn't whitespace, a letter or a number". Finally, it is combined using an alternative |
with _
, which will just match an underscore and given a +
quantifier which matches one or more times.
这是作为一个倒置的捕获组([s \ w ^ \]),所有在一起,意味着“匹配任何不是空格,字母或数字”。最后,它使用另一种|与_组合,后者只匹配下划线,并给出一个+量词,该量词匹配一个或多个次数。
So what this says is: "Match any sequence of one or more characters which aren't whitespace, letters, numbers or underscores and remove it".
所以这句话的意思是:“匹配一个或多个字符的序列,这些字符不是空格、字母、数字或下划线,然后删除它”。
The second option says: "Match any character which isn't a letter, number, hyphen, underscore, dot or asterisk and remove it". This is stated by that big capture group (the stuff between the brackets).
第二个选项说:“匹配任何不是字母、数字、连字符、下划线、点或星号的字符并删除它”。这是由大捕获组(括号之间的内容)声明的。
The third option says "Take anything which is not a letter or number and replace it by a space". It uses the \w
wildcard, which I have explained.
第三种选择是“取任何不是字母或数字的东西,用空格代替”。它使用\w通配符,我已经解释过了。
All of the options use Regular Expressions in order to match character sequences with certain characteristics, and the re.sub
function, which sub-stitutes anything matched by the given regex by the second string argument.
所有的选项都使用正则表达式来匹配具有特定特征的字符序列,以及re.sub函数,该函数的作用是:用第二个字符串参数匹配给定的正则表达式。
You can read more about Regular Expressions in Python here.
您可以在这里阅读有关Python中的正则表达式的更多信息。
#2
1
The enumeration [^a-zA-Z0-9-_*.]
enumerates exactly the character ranges to remove (though the literal -
should be at the beginning or end of the character class).
枚举(^ a-zA-Z0-9 _ *。枚举要删除的字符范围(尽管文字-应该在字符类的开头或结尾)。
\w
is defined as "word character" which in traditional ASCII locales included A-Z and a-z as well as digits and underscore, but with Unicode support, it matches accented characters, Cyrillics, Japanese ideographs, etc.
\w被定义为“单词字符”,在传统的ASCII地区包括A-Z和A-Z以及数字和下划线,但是有Unicode支持,它匹配重音字符、Cyrillics、日语ideographs等。
\s
matches space characters, which again with Unicode includes a number of extended characters such as the non-breakable space, numeric space, etc.
\s匹配空格字符,与Unicode一起使用的字符还包括一些扩展字符,如不可破坏空格、数字空格等。
Which exactly to choose obviously depends on what you want to accomplish and what you mean by "special characters". Numbers are "symbols", all characters are "special", etc.
选择哪一个显然取决于你想要完成什么,以及你所说的“特殊人物”是什么意思。数字是“符号”,所有的字符都是“特殊的”,等等。
Here's a pertinent quotation from the Python re
documentation:
以下是来自Python re文档的相关引用:
\s
\ s
For Unicode (str) patterns:
Unicode(str)模式:
Matches Unicode whitespace characters (which includes
[ \t\n\r\f\v]
, and also many other characters, for example the non-breaking spaces mandated by typography rules in many languages). If theASCII
flag is used, only[ \t\n\r\f\v]
is matched (but the flag affects the entire regular expression, so in such cases using an explicit[ \t\n\r\f\v]
may be a better choice).匹配Unicode空白字符(包括[\t\n\r\f\v]),以及许多其他字符,例如由许多语言的排版规则规定的不间断空格)。如果使用了ASCII标志,则只匹配[\t\n\r\f\v](但是该标志影响整个正则表达式,因此在这种情况下,使用显式[\t\n\r\f\v]可能是更好的选择)。
For 8-bit (bytes) patterns:
8位(字节)模式:
Matches characters considered whitespace in the ASCII character set; this is equivalent to
[ \t\n\r\f\v]
.匹配ASCII字符集中考虑的空格字符;这相当于[\t\n\r\f\v]。
\w
\ w
For Unicode (str) patterns:
Unicode(str)模式:
Matches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore. If the
ASCII
flag is used, only[a-zA-Z0-9_]
is matched (but the flag affects the entire regular expression, so in such cases using an explicit[a-zA-Z0-9_]
may be a better choice).Unicode单词字符匹配;这包括任何语言中都可以作为单词一部分的大多数字符,以及数字和下划线。如果使用ASCII标志,则只匹配[a- za - z0 -9_](但是该标志影响整个正则表达式,因此在这种情况下,使用显式[a- za - z0 -9_]可能是更好的选择)。
For 8-bit (bytes) patterns:
8位(字节)模式:
Matches characters considered alphanumeric in the ASCII character set; this is equivalent to
[a-zA-Z0-9_]
.匹配ASCII字符集中认为是字母数字的字符;这相当于[a-zA-Z0-9_]。
#3
0
How you read the re.sub
function is like this (more docs):
如何阅读re.sub函数如下(更多文档):
re.sub(a, b, my_string) # replace any matches of regex a with b in my_string
I would go with the second one. Regexes can be tricky, but this one says:
我会选第二个。Regexes可能很棘手,但是这个说:
[^a-zA-Z0-9-_*.] # anything that's NOT a-z, A-Z, 0-9, -, * .
Which seems like it's what you want. Whenever I'm using regexes, I use this site:
这似乎是你想要的。每当我使用regex时,我都会使用这个站点:
http://regexr.com/
You can put in some of your inputs, and make sure they are matching the right kinds of things before throwing them in your code!
您可以输入一些输入,并确保它们匹配正确的类型,然后将它们放入代码中!
#1
1
The first suggestion uses the \s
and \w
regex wildcards.
第一个建议使用\s和\w regex通配符。
\s
means "match any whitespace". \w
means "match any letter or number".
\s表示“匹配任何空格”。\w的意思是“匹配任何字母或数字”。
This is used as an inverted capture group ([^\s\w]
), which, all together, means "match anything which isn't whitespace, a letter or a number". Finally, it is combined using an alternative |
with _
, which will just match an underscore and given a +
quantifier which matches one or more times.
这是作为一个倒置的捕获组([s \ w ^ \]),所有在一起,意味着“匹配任何不是空格,字母或数字”。最后,它使用另一种|与_组合,后者只匹配下划线,并给出一个+量词,该量词匹配一个或多个次数。
So what this says is: "Match any sequence of one or more characters which aren't whitespace, letters, numbers or underscores and remove it".
所以这句话的意思是:“匹配一个或多个字符的序列,这些字符不是空格、字母、数字或下划线,然后删除它”。
The second option says: "Match any character which isn't a letter, number, hyphen, underscore, dot or asterisk and remove it". This is stated by that big capture group (the stuff between the brackets).
第二个选项说:“匹配任何不是字母、数字、连字符、下划线、点或星号的字符并删除它”。这是由大捕获组(括号之间的内容)声明的。
The third option says "Take anything which is not a letter or number and replace it by a space". It uses the \w
wildcard, which I have explained.
第三种选择是“取任何不是字母或数字的东西,用空格代替”。它使用\w通配符,我已经解释过了。
All of the options use Regular Expressions in order to match character sequences with certain characteristics, and the re.sub
function, which sub-stitutes anything matched by the given regex by the second string argument.
所有的选项都使用正则表达式来匹配具有特定特征的字符序列,以及re.sub函数,该函数的作用是:用第二个字符串参数匹配给定的正则表达式。
You can read more about Regular Expressions in Python here.
您可以在这里阅读有关Python中的正则表达式的更多信息。
#2
1
The enumeration [^a-zA-Z0-9-_*.]
enumerates exactly the character ranges to remove (though the literal -
should be at the beginning or end of the character class).
枚举(^ a-zA-Z0-9 _ *。枚举要删除的字符范围(尽管文字-应该在字符类的开头或结尾)。
\w
is defined as "word character" which in traditional ASCII locales included A-Z and a-z as well as digits and underscore, but with Unicode support, it matches accented characters, Cyrillics, Japanese ideographs, etc.
\w被定义为“单词字符”,在传统的ASCII地区包括A-Z和A-Z以及数字和下划线,但是有Unicode支持,它匹配重音字符、Cyrillics、日语ideographs等。
\s
matches space characters, which again with Unicode includes a number of extended characters such as the non-breakable space, numeric space, etc.
\s匹配空格字符,与Unicode一起使用的字符还包括一些扩展字符,如不可破坏空格、数字空格等。
Which exactly to choose obviously depends on what you want to accomplish and what you mean by "special characters". Numbers are "symbols", all characters are "special", etc.
选择哪一个显然取决于你想要完成什么,以及你所说的“特殊人物”是什么意思。数字是“符号”,所有的字符都是“特殊的”,等等。
Here's a pertinent quotation from the Python re
documentation:
以下是来自Python re文档的相关引用:
\s
\ s
For Unicode (str) patterns:
Unicode(str)模式:
Matches Unicode whitespace characters (which includes
[ \t\n\r\f\v]
, and also many other characters, for example the non-breaking spaces mandated by typography rules in many languages). If theASCII
flag is used, only[ \t\n\r\f\v]
is matched (but the flag affects the entire regular expression, so in such cases using an explicit[ \t\n\r\f\v]
may be a better choice).匹配Unicode空白字符(包括[\t\n\r\f\v]),以及许多其他字符,例如由许多语言的排版规则规定的不间断空格)。如果使用了ASCII标志,则只匹配[\t\n\r\f\v](但是该标志影响整个正则表达式,因此在这种情况下,使用显式[\t\n\r\f\v]可能是更好的选择)。
For 8-bit (bytes) patterns:
8位(字节)模式:
Matches characters considered whitespace in the ASCII character set; this is equivalent to
[ \t\n\r\f\v]
.匹配ASCII字符集中考虑的空格字符;这相当于[\t\n\r\f\v]。
\w
\ w
For Unicode (str) patterns:
Unicode(str)模式:
Matches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore. If the
ASCII
flag is used, only[a-zA-Z0-9_]
is matched (but the flag affects the entire regular expression, so in such cases using an explicit[a-zA-Z0-9_]
may be a better choice).Unicode单词字符匹配;这包括任何语言中都可以作为单词一部分的大多数字符,以及数字和下划线。如果使用ASCII标志,则只匹配[a- za - z0 -9_](但是该标志影响整个正则表达式,因此在这种情况下,使用显式[a- za - z0 -9_]可能是更好的选择)。
For 8-bit (bytes) patterns:
8位(字节)模式:
Matches characters considered alphanumeric in the ASCII character set; this is equivalent to
[a-zA-Z0-9_]
.匹配ASCII字符集中认为是字母数字的字符;这相当于[a-zA-Z0-9_]。
#3
0
How you read the re.sub
function is like this (more docs):
如何阅读re.sub函数如下(更多文档):
re.sub(a, b, my_string) # replace any matches of regex a with b in my_string
I would go with the second one. Regexes can be tricky, but this one says:
我会选第二个。Regexes可能很棘手,但是这个说:
[^a-zA-Z0-9-_*.] # anything that's NOT a-z, A-Z, 0-9, -, * .
Which seems like it's what you want. Whenever I'm using regexes, I use this site:
这似乎是你想要的。每当我使用regex时,我都会使用这个站点:
http://regexr.com/
You can put in some of your inputs, and make sure they are matching the right kinds of things before throwing them in your code!
您可以输入一些输入,并确保它们匹配正确的类型,然后将它们放入代码中!