如何使用漂亮的soup和re查找包含特定文本的特定类的span ?

how can I find all span's with a class of 'blue' that contain text in the format:

如何找到包含格式文本的“蓝色”类的所有span ?

04/18/13 7:29pm

which could therefore be:

因此可以:

04/18/13 7:29pm

or:

或者:

Posted on 04/18/13 7:29pm

in terms of constructing the logic to do this, this is what i have got so far:

在构建这样做的逻辑方面，这是我到目前为止所得到的:

new_content = original_content.find_all('span', {'class' : 'blue'}) # using beautiful soup's find_all
pattern = re.compile('<span class=\"blue\">[data in the format 04/18/13 7:29pm]</span>') # using re
for _ in new_content:
    result = re.findall(pattern, _)
    print result

I've been referring to https://*.com/a/7732827 and https://*.com/a/12229134 to try and figure out a way to do this, but the above is all i have got so far.

我一直在引用https://*.com/a/7732827和https://*.com/a/12229134来尝试找出一种方法，但是以上就是我到目前为止所得到的全部。

edit:

编辑:

to clarify the scenario, there are span's with:

为了阐明这一设想，span是:

<span class="blue">here is a lot of text that i don't need</span>

and

和

<span class="blue">this is the span i need because it contains 04/18/13 7:29pm</span>

and note i only need 04/18/13 7:29pm not the rest of the content.

请注意，我只需要04/18/13:29pm，不需要其他内容。

edit 2:

编辑2:

I also tried:

我也试过:

pattern = re.compile('<span class="blue">.*?(\d\d/\d\d/\d\d \d\d?:\d\d\w\w)</span>')
for _ in new_content:
    result = re.findall(pattern, _)
    print result

and got error:

并得到了错误:

'TypeError: expected string or buffer'

3 个解决方案

#1

import re
from bs4 import BeautifulSoup

html_doc = """
<html>
<body>
<span class="blue">here is a lot of text that i don't need</span>
<span class="blue">this is the span i need because it contains 04/18/13 7:29pm</span>
<span class="blue">04/19/13 7:30pm</span>
<span class="blue">Posted on 04/20/13 10:31pm</span>
</body>
</html>
"""

# parse the html
soup = BeautifulSoup(html_doc)

# find a list of all span elements
spans = soup.find_all('span', {'class' : 'blue'})

# create a list of lines corresponding to element texts
lines = [span.get_text() for span in spans]

# collect the dates from the list of lines using regex matching groups
found_dates = []
for line in lines:
    m = re.search(r'(\d{2}/\d{2}/\d{2} \d+:\d+[a|p]m)', line)
    if m:
        found_dates.append(m.group(1))

# print the dates we collected
for date in found_dates:
    print(date)

output:

输出:

04/18/13 7:29pm
04/19/13 7:30pm
04/20/13 10:31pm

#2

This is a flexible regex that you can use:

这是一个灵活的regex，您可以使用:

"(\d\d?/\d\d?/\d\d\d?\d?\s*\d\d?:\d\d[a|p|A|P][m|M])"

Example:

例子:

>>> import re
>>> from bs4 import BeautifulSoup
>>> html = """
<html>
<body>
<span class="blue">here is a lot of text that i don't need</span>
<span class="blue">this is the span i need because it contains 04/18/13 7:29pm</span>
<span class="blue">04/19/13 7:30pm</span>
<span class="blue">04/18/13 7:29pm</span>
<span class="blue">Posted on 15/18/2013 10:00AM</span>
<span class="blue">Posted on 04/20/13 10:31pm</span>
<span class="blue">Posted on 4/1/2013 17:09aM</span>
</body>
</html>
"""
>>> soup = BeautifulSoup(html)
>>> lines = [i.get_text() for i in soup.find_all('span', {'class' : 'blue'})]
>>> ok = [m.group(1)
      for line in lines
        for m in (re.search(r'(\d\d?/\d\d?/\d\d\d?\d?\s*\d\d?:\d\d[a|p|A|P][m|M])', line),)
          if m]
>>> ok
[u'04/18/13 7:29pm', u'04/19/13 7:30pm', u'04/18/13 7:29pm', u'15/18/2013 10:00AM', u'04/20/13 10:31pm', u'4/1/2013 17:09aM']
>>> for i in ok:
    print i

04/18/13 7:29pm
04/19/13 7:30pm
04/18/13 7:29pm
15/18/2013 10:00AM
04/20/13 10:31pm
4/1/2013 17:09aM

#3

This pattern seems to satisfy what you're looking for:

这种模式似乎可以满足你的需求:

>>> pattern = re.compile('<span class="blue">.*?(\d\d/\d\d/\d\d \d\d?:\d\d\w\w)</span>')
>>> pattern.match('<span class="blue">here is a lot of text that i dont need</span>')
>>> pattern.match('<span class="blue">this is the span i need because it contains 04/18/13 7:29pm</span>').groups()
('04/18/13 7:29pm',)

#1

import re
from bs4 import BeautifulSoup

html_doc = """
<html>
<body>
<span class="blue">here is a lot of text that i don't need</span>
<span class="blue">this is the span i need because it contains 04/18/13 7:29pm</span>
<span class="blue">04/19/13 7:30pm</span>
<span class="blue">Posted on 04/20/13 10:31pm</span>
</body>
</html>
"""

# parse the html
soup = BeautifulSoup(html_doc)

# find a list of all span elements
spans = soup.find_all('span', {'class' : 'blue'})

# create a list of lines corresponding to element texts
lines = [span.get_text() for span in spans]

# collect the dates from the list of lines using regex matching groups
found_dates = []
for line in lines:
    m = re.search(r'(\d{2}/\d{2}/\d{2} \d+:\d+[a|p]m)', line)
    if m:
        found_dates.append(m.group(1))

# print the dates we collected
for date in found_dates:
    print(date)