I am trying to do what my title says. I have a list of about 30 thousand business addressess, and I'm trying to make each address as uniform as possible


As far as removing weird symbols and characters goes, I have found three suggestions, but I don't understand how they are different.


If somebody can explain the difference, or provide insight into a better way to standardize address information, please and thank you!


address = re.sub(r'([^\s\w]|_)+', '', address)

address = re.sub('[^a-zA-Z0-9-_*.]', '', address)

address = re.sub(r'[^\w]', ' ', address)

The first suggestion uses the \s and \w regex wildcards.

第一个建议使用\s和\w regex通配符。

\s means "match any whitespace". \w means "match any letter or number".


This is used as an inverted capture group ([^\s\w]), which, all together, means "match anything which isn't whitespace, a letter or a number". Finally, it is combined using an alternative | with _, which will just match an underscore and given a + quantifier which matches one or more times.

这是作为一个倒置的捕获组([s \ w ^ \]),所有在一起,意味着“匹配任何不是空格,字母或数字”。最后,它使用另一种|与_组合,后者只匹配下划线,并给出一个+量词,该量词匹配一个或多个次数。

So what this says is: "Match any sequence of one or more characters which aren't whitespace, letters, numbers or underscores and remove it".


The second option says: "Match any character which isn't a letter, number, hyphen, underscore, dot or asterisk and remove it". This is stated by that big capture group (the stuff between the brackets).


The third option says "Take anything which is not a letter or number and replace it by a space". It uses the \w wildcard, which I have explained.


All of the options use Regular Expressions in order to match character sequences with certain characteristics, and the re.sub function, which sub-stitutes anything matched by the given regex by the second string argument.


You can read more about Regular Expressions in Python here.




The enumeration [^a-zA-Z0-9-_*.] enumerates exactly the character ranges to remove (though the literal - should be at the beginning or end of the character class).

枚举(^ a-zA-Z0-9 _ *。枚举要删除的字符范围(尽管文字-应该在字符类的开头或结尾)。

\w is defined as "word character" which in traditional ASCII locales included A-Z and a-z as well as digits and underscore, but with Unicode support, it matches accented characters, Cyrillics, Japanese ideographs, etc.


\s matches space characters, which again with Unicode includes a number of extended characters such as the non-breakable space, numeric space, etc.


Which exactly to choose obviously depends on what you want to accomplish and what you mean by "special characters". Numbers are "symbols", all characters are "special", etc.


Here's a pertinent quotation from the Python re documentation:

以下是来自Python re文档的相关引用:


\ s

For Unicode (str) patterns:


Matches Unicode whitespace characters (which includes [ \t\n\r\f\v], and also many other characters, for example the non-breaking spaces mandated by typography rules in many languages). If the ASCII flag is used, only [ \t\n\r\f\v] is matched (but the flag affects the entire regular expression, so in such cases using an explicit [ \t\n\r\f\v] may be a better choice).


For 8-bit (bytes) patterns:


Matches characters considered whitespace in the ASCII character set; this is equivalent to [ \t\n\r\f\v].



\ w

For Unicode (str) patterns:


Matches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore. If the ASCII flag is used, only [a-zA-Z0-9_] is matched (but the flag affects the entire regular expression, so in such cases using an explicit [a-zA-Z0-9_] may be a better choice).

Unicode单词字符匹配;这包括任何语言中都可以作为单词一部分的大多数字符,以及数字和下划线。如果使用ASCII标志,则只匹配[a- za - z0 -9_](但是该标志影响整个正则表达式,因此在这种情况下,使用显式[a- za - z0 -9_]可能是更好的选择)。

For 8-bit (bytes) patterns:


Matches characters considered alphanumeric in the ASCII character set; this is equivalent to [a-zA-Z0-9_].




How you read the re.sub function is like this (more docs):


re.sub(a, b, my_string)  # replace any matches of regex a with b in my_string

I would go with the second one. Regexes can be tricky, but this one says:


[^a-zA-Z0-9-_*.]   # anything that's NOT a-z, A-Z, 0-9, -, * .

Which seems like it's what you want. Whenever I'm using regexes, I use this site:




You can put in some of your inputs, and make sure they are matching the right kinds of things before throwing them in your code!




