使用re.sub的更好方法

I'm cleaning a series of sources from a twitter stream. Here is an example of the data:

我正在清理twitter流中的一系列资源。下面是数据的一个例子:

source = ['<a href="https://twitter.com/download/android" rel="nofollow">Twitter for Android Tablets</a>', 
          '<a href="https://twitter.com/download/android" rel="nofollow">Twitter for  Android</a>',
          '<a href="http://foursquare.com" rel="nofollow">foursquare</a>', 'web',
          '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>',
          '<a href="http://blackberry.com/twitter" rel="nofollow">Twitter for BlackBerry</a>']


import re
for i in source:
    re.sub('<.*?>', '', re.sub(r'(<.*?>)(Twitter for)(\s+)', r'', i))

### This would be the expected output ###
'Android Tablets'
'Android'
'foursquare'
'web'
'iPhone'
'BlackBerry'

The later is the code i have that does the job but looks awful. I was hoping there is a better way of doing this including re.sub() or other function that could be more approapiate.

后面的代码可以完成这项工作，但是看起来很糟糕。我希望有更好的方法来实现这一点，包括re.sub()或其他更合适的函数。

5 个解决方案

#1

here are advices to improve upon your code:

以下是改进您的代码的建议:

Use regex compilation so you don't process the regex each time you apply the regex,
使用regex编译，这样每次应用regex时都不会处理regex，
use raw strings to avoid any interpretation of the regex string by python,
使用原始字符串避免python对regex字符串的任何解释，
use a regex that takes anything but the closing tag character for matching within the tag
使用一个regex，它只接受结束标记字符以外的任何字符来匹配标记
you don't need to repeat the substitution as it's matching every occurance on the line per default
您不需要重复替换，因为它匹配每一个默认行上的每一个出现

here's a simpler and better result:

这里有一个更简单和更好的结果:

>>> import re
>>> r = re.compile(r'<[^>]+>')
>>> for it in source:
...     r.sub('', it)
... 
'Twitter for Android Tablets'
'Twitter for  Android'
'foursquare'
'web'
'Twitter for iPhone'
'Twitter for BlackBerry'

N.B.: the best solution for your use case would be @bakuriu's suggestion:

注意::您的用例的最佳解决方案是@bakuriu的建议:

 >>> for it in source:
 ...     it[it.index('>')+1:it.rindex('<')]
'Twitter for Android Tablets'
'Twitter for  Android'
'foursquare'
'Twitter for iPhone'
'Twitter for BlackBerry'

which adds no important overhead and uses basic, fast string operations. But that solution takes only what is between tags, instead of removing it, which may have side effects if there are tags within the <a> and </a> or no tags at all, i.e. it won't work for the web string. A solution against no tags at all:

它不增加重要的开销，并使用基本的、快速的字符串操作。但是这个解决方案只使用标签之间的内容，而不是删除它，如果和中有标签，或者根本没有标签，这可能会产生副作用，也就是说，它对web字符串不起作用。一种完全不使用标签的解决方案:

 >>> for it in source:
 ...     if '>' in it and '<' in it:
 ...         it[it.index('>')+1:it.rindex('<')]
 ...     else:
 ...         it
 'Twitter for Android Tablets'
 'Twitter for  Android'
 'foursquare'
 'web'
 'Twitter for iPhone'
 'Twitter for BlackBerry'

#2

Just another alternative, using BeautifulSoup html parser:

另一种选择，使用漂亮的html解析器:

>>> from bs4 import BeautifulSoup
>>> for link in source:
...     print BeautifulSoup(link, 'html.parser').text.replace('Twitter for', '').strip()
... 
Android Tablets
Android
foursquare
web
iPhone
BlackBerry

#3

If you're doing a lot of these, use a library designed to handle (X)HTML. lxml works well but I'm more familiar with the BeautifulSoup wrapper.

如果您正在做很多这样的事情，请使用一个用于处理(X)HTML的库。lxml工作得很好，但是我更熟悉漂亮的包装。

from bs4 import BeautifulSoup

source = ['<a href="https://twitter.com/download/android" rel="nofollow">Twitter for Android Tablets</a>', 
      '<a href="https://twitter.com/download/android" rel="nofollow">Twitter for  Android</a>',
      '<a href="http://foursquare.com" rel="nofollow">foursquare</a>', 'web',
      '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>',
      '<a href="http://blackberry.com/twitter" rel="nofollow">Twitter for BlackBerry</a>']

soup = BeautifulSoup('\n'.join(source))
for tag in soup.findAll('a'):
    print(tag.text)

This might be a little overkill for your use case, though.

不过，对于您的用例来说，这可能有点过头了。

#4

One option, if the text really is in this consistent of a format, is to just use string operations instead of regex:

如果文本的格式是一致的，一种选择是使用字符串操作而不是regex:

source = ['<a href="https://twitter.com/download/android" rel="nofollow">Twitter for Android Tablets</a>', 
          '<a href="https://twitter.com/download/android" rel="nofollow">Twitter for  Android</a>',
          '<a href="http://foursquare.com" rel="nofollow">foursquare</a>', 'web',
          '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>',
          '<a href="http://blackberry.com/twitter" rel="nofollow">Twitter for BlackBerry</a>']

for i in source:
    print i.partition('>')[-1].rpartition('<')[0]

This code finds the first '>' in the string, takes everything after it, finds the first '<' in what remains, and returns everything before that; e.g., giving you any text between the first '>' and the last '<'.

这段代码在字符串中找到第一个'>'，取其后的所有内容，在剩下的内容中找到第一个'<'，然后返回之前的所有内容;例如，在第一个“>”和最后一个“<”之间给你任何文本。

There's also the far more minimal version @Bakuriu put in a comment, which is probably better than mine!

还有一个更小的@Bakuriu的版本，可能比我的更好!

#5

This looks less ugly to me and should work equally well:

这在我看来不那么丑陋，而且应该同样有效:

import re
for i in source:
    print re.sub('(<.*?>)|(Twitter for\s+)', '', i);

#1