In answer to a previous question, several people suggested that I use BeautifulSoup for my project. I've been struggling with their documentation and I just cannot parse it. Can somebody point me to the section where I should be able to translate this expression to a BeautifulSoup expression?
在回答之前的一个问题时,一些人建议我在项目中使用“漂亮的汤”。我一直纠结于他们的文档,无法解析它。有人能给我指出我应该能够把这个表达翻译成一个漂亮的汤的表达吗?
hxs.select('//td[@class="altRow"][2]/a/@href').re('/.a\w+')
The above expression is from Scrapy. I'm trying to apply the regex re('\.a\w+')
to td class altRow
to get the links from there.
上面的表达式来自于剪贴。我正在尝试将regex re('\。a\w+')应用到td类altRow,以从那里获得链接。
I would also appreciate pointers to any other tutorials or documentation. I couldn't find any.
我也希望有指向任何其他教程或文档的指针。我找不到任何。
Thanks for your help.
谢谢你的帮助。
Edit: I am looking at this page:
编辑:我在看这个页面:
>>> soup.head.title
<title>White & Case LLP - Lawyers</title>
>>> soup.find(href=re.compile("/cabel"))
>>> soup.find(href=re.compile("/diversity"))
<a href="/diversity/committee">Committee</a>
Yet, if you look at the page source "/cabel"
is there:
但是,如果你看一下网页源“/cabel”就会发现:
<td class="altRow" valign="middle" width="34%">
<a href='/cabel'>Abel, Christian</a>
For some reason, search results are not visible to BeautifulSoup, but they are visible to XPath because hxs.select('//td[@class="altRow"][2]/a/@href').re('/.a\w+')
catches "/cabel"
由于某些原因,搜索结果不可见于漂亮的汤,但它们对XPath是可见的,因为hxs.select('//td[@class="altRow"][2]/a/@href')。
Edit: cobbal: It is still not working. But when I search this:
编辑:cobbal:它还没修好。但是当我搜索这个时
>>>soup.findAll(href=re.compile(r'/.a\w+'))
[<link href="/FCWSite/Include/styles/main.css" rel="stylesheet" type="text/css" />, <link rel="shortcut icon" type="image/ico" href="/FCWSite/Include/main_favicon.ico" />, <a href="/careers/northamerica">North America</a>, <a href="/careers/middleeastafrica">Middle East Africa</a>, <a href="/careers/europe">Europe</a>, <a href="/careers/latinamerica">Latin America</a>, <a href="/careers/asia">Asia</a>, <a href="/diversity/manager">Diversity Director</a>]
>>>
it returns all the links with second character "a" but not the lawyer names. So for some reason those links (such as "/cabel") are not visible to BeautifulSoup. I don't understand why.
它返回带有第二个字符“a”的所有链接,但不返回律师的名字。因此,出于某种原因,这些链接(如“/cabel”)在BeautifulSoup中是不可见的。我不明白为什么。
4 个解决方案
#1
3
I know BeautifulSoup is the canonical HTML parsing module, but sometimes you just want to scrape out some substrings from some HTML, and pyparsing has some useful methods to do this. Using this code:
我知道BeautifulSoup是规范的HTML解析模块,但有时您只想从一些HTML中提取一些子字符串,并且pyparser有一些有用的方法来实现这一点。使用这段代码:
from pyparsing import makeHTMLTags, withAttribute, SkipTo
import urllib
# get the HTML from your URL
url = "http://www.whitecase.com/Attorneys/List.aspx?LastName=&FirstName="
page = urllib.urlopen(url)
html = page.read()
page.close()
# define opening and closing tag expressions for <td> and <a> tags
# (makeHTMLTags also comprehends tag variations, including attributes,
# upper/lower case, etc.)
tdStart,tdEnd = makeHTMLTags("td")
aStart,aEnd = makeHTMLTags("a")
# only interested in tdStarts if they have "class=altRow" attribute
tdStart.setParseAction(withAttribute(("class","altRow")))
# compose total matching pattern (add trailing tdStart to filter out
# extraneous <td> matches)
patt = tdStart + aStart("a") + SkipTo(aEnd)("text") + aEnd + tdEnd + tdStart
# scan input HTML source for matching refs, and print out the text and
# href values
for ref,s,e in patt.scanString(html):
print ref.text, ref.a.href
I extracted 914 references from your page, from Abel to Zupikova.
我从你的页面中提取了914个参考文献,从亚伯到祖皮科娃。
Abel, Christian /cabel
Acevedo, Linda Jeannine /jacevedo
Acuña, Jennifer /jacuna
Adeyemi, Ike /igbadegesin
Adler, Avraham /aadler
...
Zhu, Jie /jzhu
ZÃdek, AleÅ¡ /azidek
Ziółek, Agnieszka /aziolek
Zitter, Adam /azitter
Zupikova, Jana /jzupikova
#2
6
one option is to use lxml (I'm not familiar with beautifulsoup, so I can't say how to do with it), it defaultly supports XPath
一种选择是使用lxml(我不熟悉beautifulsoup,所以我不能说如何使用它),它默认支持XPath
Edit:
try
(untested)
tested:
编辑:试试(未测试)测试:
soup.findAll('td', 'altRow')[1].findAll('a', href=re.compile(r'/.a\w+'), recursive=False)
I used docs at http://www.crummy.com/software/BeautifulSoup/documentation.html
我在http://www.crummy.com/software/beautifulsoup/document.html使用了docs
soup should be a BeautifulSoup object
汤应该是一种美味的汤
import BeautifulSoup
soup = BeautifulSoup.BeautifulSoup(html_string)
#3
2
I just answered this on the Beautiful Soup mailing list as a response to Zeynel's email to the list. Basically, there is an error in the web page that totally kills Beautiful Soup 3.1 during parsing, but is merely mangled by Beautiful Soup 3.0.
我刚刚在漂亮的汤邮件列表上回答了这个问题,作为对Zeynel给这个列表的邮件的回复。基本上,web页面上有一个错误,在解析过程中完全杀死了Beautiful Soup 3.1,但是仅仅是被漂亮的Soup 3.0破坏了。
The thread is located at the Google Groups archive.
线程位于谷歌组归档文件中。
#4
1
It seems that you are using BeautifulSoup 3.1
看来你用的是漂亮的汤了
I suggest reverting to BeautifulSoup 3.0.7 (because of this problem)
我建议恢复到漂亮的汤3.0.7(因为这个问题)
I just tested with 3.0.7 and got the results you expect:
我刚刚测试了3.0.7,得到了你想要的结果:
>>> soup.findAll(href=re.compile(r'/cabel'))
[<a href="/cabel">Abel, Christian</a>]
Testing with BeautifulSoup 3.1 gets the results you are seeing. There is probably a malformed tag in the html but I didn't see what it was in a quick look.
使用BeautifulSoup 3.1进行测试可以得到您正在看到的结果。html中可能有一个畸形的标记,但是我没有很快地看到它是什么。
#1
3
I know BeautifulSoup is the canonical HTML parsing module, but sometimes you just want to scrape out some substrings from some HTML, and pyparsing has some useful methods to do this. Using this code:
我知道BeautifulSoup是规范的HTML解析模块,但有时您只想从一些HTML中提取一些子字符串,并且pyparser有一些有用的方法来实现这一点。使用这段代码:
from pyparsing import makeHTMLTags, withAttribute, SkipTo
import urllib
# get the HTML from your URL
url = "http://www.whitecase.com/Attorneys/List.aspx?LastName=&FirstName="
page = urllib.urlopen(url)
html = page.read()
page.close()
# define opening and closing tag expressions for <td> and <a> tags
# (makeHTMLTags also comprehends tag variations, including attributes,
# upper/lower case, etc.)
tdStart,tdEnd = makeHTMLTags("td")
aStart,aEnd = makeHTMLTags("a")
# only interested in tdStarts if they have "class=altRow" attribute
tdStart.setParseAction(withAttribute(("class","altRow")))
# compose total matching pattern (add trailing tdStart to filter out
# extraneous <td> matches)
patt = tdStart + aStart("a") + SkipTo(aEnd)("text") + aEnd + tdEnd + tdStart
# scan input HTML source for matching refs, and print out the text and
# href values
for ref,s,e in patt.scanString(html):
print ref.text, ref.a.href
I extracted 914 references from your page, from Abel to Zupikova.
我从你的页面中提取了914个参考文献,从亚伯到祖皮科娃。
Abel, Christian /cabel
Acevedo, Linda Jeannine /jacevedo
Acuña, Jennifer /jacuna
Adeyemi, Ike /igbadegesin
Adler, Avraham /aadler
...
Zhu, Jie /jzhu
ZÃdek, AleÅ¡ /azidek
Ziółek, Agnieszka /aziolek
Zitter, Adam /azitter
Zupikova, Jana /jzupikova
#2
6
one option is to use lxml (I'm not familiar with beautifulsoup, so I can't say how to do with it), it defaultly supports XPath
一种选择是使用lxml(我不熟悉beautifulsoup,所以我不能说如何使用它),它默认支持XPath
Edit:
try
(untested)
tested:
编辑:试试(未测试)测试:
soup.findAll('td', 'altRow')[1].findAll('a', href=re.compile(r'/.a\w+'), recursive=False)
I used docs at http://www.crummy.com/software/BeautifulSoup/documentation.html
我在http://www.crummy.com/software/beautifulsoup/document.html使用了docs
soup should be a BeautifulSoup object
汤应该是一种美味的汤
import BeautifulSoup
soup = BeautifulSoup.BeautifulSoup(html_string)
#3
2
I just answered this on the Beautiful Soup mailing list as a response to Zeynel's email to the list. Basically, there is an error in the web page that totally kills Beautiful Soup 3.1 during parsing, but is merely mangled by Beautiful Soup 3.0.
我刚刚在漂亮的汤邮件列表上回答了这个问题,作为对Zeynel给这个列表的邮件的回复。基本上,web页面上有一个错误,在解析过程中完全杀死了Beautiful Soup 3.1,但是仅仅是被漂亮的Soup 3.0破坏了。
The thread is located at the Google Groups archive.
线程位于谷歌组归档文件中。
#4
1
It seems that you are using BeautifulSoup 3.1
看来你用的是漂亮的汤了
I suggest reverting to BeautifulSoup 3.0.7 (because of this problem)
我建议恢复到漂亮的汤3.0.7(因为这个问题)
I just tested with 3.0.7 and got the results you expect:
我刚刚测试了3.0.7,得到了你想要的结果:
>>> soup.findAll(href=re.compile(r'/cabel'))
[<a href="/cabel">Abel, Christian</a>]
Testing with BeautifulSoup 3.1 gets the results you are seeing. There is probably a malformed tag in the html but I didn't see what it was in a quick look.
使用BeautifulSoup 3.1进行测试可以得到您正在看到的结果。html中可能有一个畸形的标记,但是我没有很快地看到它是什么。