如何在href中单击具有javascript:__doPostBack的链接?

I am writing a screen scraper script in python with module 'mechanize' and I would like to use the mechanize.click_link() method on a link that has javascript:__doPostBack in href. I believe the page I am trying to parse is using AJAX.

我正在用python编写一个带有“mechanize”模块的屏幕抓取脚本，我希望在href中具有javascript:__doPostBack的链接上使用mechanize.click_link()方法。我想要解析的页面是使用AJAX的。

Note: mech is the mechanize.Browser()

注意:mech是mechanize.Browser()

>>> next_link.__class__.__name__
'Link'
>>> next_link
Link(base_url='http://www.citius.mj.pt/Portal/consultas/ConsultasDistribuicao.aspx', url="javascript:__doPostBack('ctl00$ContentPlaceHolder1$Pager1$lnkNext','')", text='2', tag='a', attrs=[('id', 'ctl00_ContentPlaceHolder1_Pager1_lnkNext'), ('title', 'P\xc3\xa1gina seguinte: 2'), ('href', "javascript:__doPostBack('ctl00$ContentPlaceHolder1$Pager1$lnkNext','')")])
>>> req = mech.click_link(next_link)
>>> req
<urllib2.Request instance at 0x025BEE40>
>>> req.has_data()
False

I would like to retrieve the page source after clicking the link.

我想在点击链接后检索页面源。

3 个解决方案

#1

I don't use mechanize, but I do a lot of web scraping myself with python.

我不使用mechanize，但是我使用python做了很多web抓取。

When I run into a javascript function like __doPostBack, I do the following:
I access the web site in Firefox, and use the HttpFox extension to see the parameters of the POST request the browser sent to the web server when clicking the relevant link.
I then build the same request in python using urllib.parse.urlencode to build the query strings and POST data I need.
Sometimes the website uses cookies as well, so I just use python's http.cookiejar.

当我遇到像__doPostBack这样的javascript函数时，我执行以下操作:访问Firefox中的web站点，并使用HttpFox扩展查看浏览器在单击相关链接时发送给web服务器的POST请求的参数。然后，我使用urllib.parse在python中构建相同的请求。urlencode用于构建查询字符串并发布我需要的数据。有时网站也使用cookie，所以我只使用python的http.cookiejar。

I have used this technique successfully several times.

我已经成功地使用了这个技巧好几次了。

#2

I don't think mechanize supports Javascript; to scrape pages which intrinsically rely on Javascript execution for their functionality, you may need to use a different tool, such as Selenium RC.

我认为mechanize不支持Javascript;要获取本质上依赖于Javascript执行的页面，您可能需要使用不同的工具，比如Selenium RC。

#3

>>> next_link.__class__.__name__
'Link'
>>> next_link
Link(base_url='http://www.citius.mj.pt/Portal/consultas/ConsultasDistribuicao.aspx', url="javascript:__doPostBack('ctl00$ContentPlaceHolder1$Pager1$lnkNext','')", text='2', tag='a', attrs=[('id', 'ctl00_ContentPlaceHolder1_Pager1_lnkNext'), ('title', 'P\xc3\xa1gina seguinte: 2'), ('href', "javascript:__doPostBack('ctl00$ContentPlaceHolder1$Pager1$lnkNext','')")])
>>> req = mech.click_link(next_link)
>>> req
<urllib2.Request instance at 0x025BEE40>
>>> req.has_data()
False

#1