I want to fetch data from another url for which I am using urllib and Beautiful Soup , My data is inside table tag (which I have figure out using Firefox console). But when I tried to fetch table using his id the result is None , Then I guess this table must be dynamically added via some js code.
我想从我正在使用urllib和Beautiful Soup的另一个url获取数据,我的数据在表标记(我已经用Firefox控制台找到了)内。但是当我尝试使用他的id获取表时,结果是None,那么我想这个表必须通过一些js代码动态添加。
I have tried all both parsers 'lxml', 'html5lib' but still I can't get that table data.
我已经尝试了所有的解析器“lxml”、“html5lib”,但是我还是不能获得表数据。
I have also tried one more thing :
我还尝试了一件事:
web = urllib.urlopen("my url")
html = web.read()
soup = BeautifulSoup(html, 'lxml')
js = soup.find("script")
ss = js.prettify()
print ss
Result :
结果:
<script type="text/javascript">
myPage = 'ETFs';
sectionId = 'liQuotes'; //section tab
breadCrumbId = 'qQuotes'; //page
is_dartSite = "quotes";
is_dartZone = "news";
propVar = "ETFs";
</script>
But now I don't know how I can get data of these js variables.
但是现在我不知道如何得到这些js变量的数据。
Now I have two options either get that table content ot get that the js variables, any one of them can fulfil my task but unfortunately I don't know how to get these , So please tell how I can get resolve any one of the problem.
现在我有两个选项要么得到表格内容要么得到js变量,它们中的任何一个都可以完成我的任务但不幸的是我不知道如何得到它们,所以请告诉我如何解决任何一个问题。
Thanks
谢谢
2 个解决方案
#1
9
EDIT
编辑
This will do the trick using re module to extract the data and loading it as JSON:
这将通过使用re模块提取数据并将其加载为JSON来实现:
import urllib
import json
import re
from bs4 import BeautifulSoup
web = urllib.urlopen("http://www.nasdaq.com/quotes/nasdaq-financial-100-stocks.aspx")
soup = BeautifulSoup(web.read(), 'lxml')
data = soup.find_all("script")[19].string
p = re.compile('var table_body = (.*?);')
m = p.match(data)
stocks = json.loads(m.groups()[0])
>>> for stock in stocks:
... print stock
...
[u'ASPS', u'Altisource Portfolio Solutions S.A.', 116.96, 2.2, 1.92, 86635, u'N', u'N']
[u'AGNC', u'American Capital Agency Corp.', 23.76, 0.13, 0.55, 3184303, u'N', u'N']
.
.
.
[u'ZION', u'Zions Bancorporation', 29.79, 0.46, 1.57, 2154017, u'N', u'N']
The problem with this is that the script tag offset is hard-coded and there is not a reliable way to locate it within the page. Changes to the page could break your code.
这样做的问题是脚本标记偏移量是硬编码的,并且没有可靠的方法在页面中定位它。对页面的更改可能会破坏您的代码。
ORIGINAL answer
原来的答案
Rather than try to screen scrape the data, you can download a CSV representation of the same data from http://www.nasdaq.com/quotes/nasdaq-100-stocks.aspx?render=download.
您可以下载相同数据的CSV表示,而不是试着对数据进行筛选,而是从http://www.nasdaq q.com/quotes/nasdaq-100-stocks.aspx?
Then use the Python csv module to parse and process it. Not only is this more convenient, it will be a more resilient solution because any changes to the HTML could easily break your screen scraping code.
然后使用Python csv模块解析和处理它。这不仅更方便,而且是一种更有弹性的解决方案,因为对HTML的任何更改都很容易破坏屏幕抓取代码。
Otherwise, if you look at the actual HTML you will find that the data is available within the page in the following script tag:
否则,如果您查看实际的HTML,您将发现在下面的脚本标记中,页面中的数据是可用的:
<script type="text/javascript">var table_body = [["ATVI", "Activision Blizzard, Inc", 20.92, 0.21, 1.01, 6182877, .1, "N", "N"],
["ADBE", "Adobe Systems Incorporated", 66.91, 1.44, 2.2, 3629837, .6, "N", "N"],
["AKAM", "Akamai Technologies, Inc.", 57.47, 1.57, 2.81, 2697834, .3, "N", "N"],
["ALXN", "Alexion Pharmaceuticals, Inc.", 170.2, 0.7, 0.41, 659817, .1, "N", "N"],
["ALTR", "Altera Corporation", 33.82, -0.06, -0.18, 1928706, .0, "N", "N"],
["AMZN", "Amazon.com, Inc.", 329.67, 6.1, 1.89, 5246300, 2.5, "N", "N"],
....
["YHOO", "Yahoo! Inc.", 35.92, 0.98, 2.8, 18705720, .9, "N", "N"]];
#2
2
Just to add to @mhawke 's answer, rather than hardcoding the offset of the script tag, you loop through all the script tags and match the one that matches your pattern;
只需添加@mhawke的答案,而不是硬编码脚本标记的偏移量,就可以遍历所有脚本标记并匹配与您的模式匹配的那个;
web = urllib.urlopen("http://www.nasdaq.com/quotes/nasdaq-financial-100-stocks.aspx")
pattern = re.compile('var table_body = (.*?);')
soup = BeautifulSoup(web.read(), "lxml")
scripts = soup.find_all('script')
for script in scripts:
if(pattern.match(str(script.string))):
data = pattern.match(script.string)
stock = json.loads(data.groups()[0])
print stock
#1
9
EDIT
编辑
This will do the trick using re module to extract the data and loading it as JSON:
这将通过使用re模块提取数据并将其加载为JSON来实现:
import urllib
import json
import re
from bs4 import BeautifulSoup
web = urllib.urlopen("http://www.nasdaq.com/quotes/nasdaq-financial-100-stocks.aspx")
soup = BeautifulSoup(web.read(), 'lxml')
data = soup.find_all("script")[19].string
p = re.compile('var table_body = (.*?);')
m = p.match(data)
stocks = json.loads(m.groups()[0])
>>> for stock in stocks:
... print stock
...
[u'ASPS', u'Altisource Portfolio Solutions S.A.', 116.96, 2.2, 1.92, 86635, u'N', u'N']
[u'AGNC', u'American Capital Agency Corp.', 23.76, 0.13, 0.55, 3184303, u'N', u'N']
.
.
.
[u'ZION', u'Zions Bancorporation', 29.79, 0.46, 1.57, 2154017, u'N', u'N']
The problem with this is that the script tag offset is hard-coded and there is not a reliable way to locate it within the page. Changes to the page could break your code.
这样做的问题是脚本标记偏移量是硬编码的,并且没有可靠的方法在页面中定位它。对页面的更改可能会破坏您的代码。
ORIGINAL answer
原来的答案
Rather than try to screen scrape the data, you can download a CSV representation of the same data from http://www.nasdaq.com/quotes/nasdaq-100-stocks.aspx?render=download.
您可以下载相同数据的CSV表示,而不是试着对数据进行筛选,而是从http://www.nasdaq q.com/quotes/nasdaq-100-stocks.aspx?
Then use the Python csv module to parse and process it. Not only is this more convenient, it will be a more resilient solution because any changes to the HTML could easily break your screen scraping code.
然后使用Python csv模块解析和处理它。这不仅更方便,而且是一种更有弹性的解决方案,因为对HTML的任何更改都很容易破坏屏幕抓取代码。
Otherwise, if you look at the actual HTML you will find that the data is available within the page in the following script tag:
否则,如果您查看实际的HTML,您将发现在下面的脚本标记中,页面中的数据是可用的:
<script type="text/javascript">var table_body = [["ATVI", "Activision Blizzard, Inc", 20.92, 0.21, 1.01, 6182877, .1, "N", "N"],
["ADBE", "Adobe Systems Incorporated", 66.91, 1.44, 2.2, 3629837, .6, "N", "N"],
["AKAM", "Akamai Technologies, Inc.", 57.47, 1.57, 2.81, 2697834, .3, "N", "N"],
["ALXN", "Alexion Pharmaceuticals, Inc.", 170.2, 0.7, 0.41, 659817, .1, "N", "N"],
["ALTR", "Altera Corporation", 33.82, -0.06, -0.18, 1928706, .0, "N", "N"],
["AMZN", "Amazon.com, Inc.", 329.67, 6.1, 1.89, 5246300, 2.5, "N", "N"],
....
["YHOO", "Yahoo! Inc.", 35.92, 0.98, 2.8, 18705720, .9, "N", "N"]];
#2
2
Just to add to @mhawke 's answer, rather than hardcoding the offset of the script tag, you loop through all the script tags and match the one that matches your pattern;
只需添加@mhawke的答案,而不是硬编码脚本标记的偏移量,就可以遍历所有脚本标记并匹配与您的模式匹配的那个;
web = urllib.urlopen("http://www.nasdaq.com/quotes/nasdaq-financial-100-stocks.aspx")
pattern = re.compile('var table_body = (.*?);')
soup = BeautifulSoup(web.read(), "lxml")
scripts = soup.find_all('script')
for script in scripts:
if(pattern.match(str(script.string))):
data = pattern.match(script.string)
stock = json.loads(data.groups()[0])
print stock