Python 3, web抓取和Javascript [Oh My]

时间:2021-03-28 01:22:23

I have come to the point of entering the melee on web-scraping webpages using Javascript, with Python3. I am well aware that my boot may be making contact with a dead horse, but I feel like drawing my six-shooter anyway. It's a spaghetti western; be my gray hat?

我已经到了用Javascript和Python3在网页抓取网页上进入混战的地步。我很清楚,我的靴子可能与一匹死马有联系,但我还是想画我的六射手。这是一个意大利面西方;是我的灰色帽子吗?

::Backstory::

::基本信息::

I am using Python 3.2.3.

我使用的是Python 3.2.3。

I am interested in gathering historical stock//etf//mutual_fund price data for YTD, 1-yr, 3-yr, 5-yr 10-yr... and/or similar timeframes for a user-defined stock, etf, or mutual fund. I set my sites on Morningstar.com, as they tend to provide as much data as possible without necessarily requiring a log-in; other folks such as finance.google.com &c tend to be inconsistent in what data they provide regarding stocks vs etfs vs mutual funds.

我感兴趣的是收集历史股票//etf//mutual_fund价格数据的YTD, 1-yr, 3-yr, 5-yr 10-yr…以及/或类似于用户定义的股票、etf或共同基金的时间框架。我在Morningstar.com上设置了我的网站,因为他们倾向于提供尽可能多的数据,而不需要登录;其他的一些人,比如金融公司。google。com和c在他们提供的关于股票和etf和共同基金的数据上往往不一致。

The trade-off in using Morningstar for this historical data, or "Trailing Total Returns" as they call it, is that for producing this data they use Javascript.

在使用晨星(Morningstar)的历史数据(或称为“跟踪总回报”)之间的权衡是,为了生成这些数据,他们使用的是Javascript。

Here are some example links from Morningstar:

这里有一些晨星的链接:

A Mutual Fund;

共同基金;

An ETF;

ETF;

A Stock.

一只股票。

I am interested in the "Trailing Returns" portion, top row or so of numbers in the Javascript-produced chart.

我对javascript生成的图表中的“后返回”部分、顶部行或数字感兴趣。

::Attempted So Far::

::到目前为止未遂::

I've confirmed that wget doesn't play with Javascript; even downloading all of the associated files [css, .js, &c] hasn't allowed me to locally render the javascript in browser or in script. Research here on * confirmed this. Am willing to be corrected here.

我已经确认wget不会使用Javascript;甚至下载所有相关的文件[css, .js, &c]都不允许我在浏览器或脚本中本地渲染javascript。*的研究证实了这一点。我愿意在这里改正。

My research informed me that Mechanize doesn't exist for Python3. I tried anyway, and turned into Policeman Javert crying out "I knew it!" at the error message "module does not exist".

我的研究告诉我,Python3不存在机械化。无论如何,我还是试了一下,然后变成了警察沙威,在错误信息“模块不存在”的时候大喊“我知道!”

::I've Heard Of...::

::我听说过…::

->Selenium. However, my understanding is that this requires Thy Favorite Browser to actually open up a webpage, navigate around, and then not close because there's no "close this tab//window" command//option for Selenium. What if I//my_user want to get historical data for many etfs, stocks, and/or mutual funds? That's a lot of tabs//windows opening up in a browser which was not necessarily desired to be opened.

- >硒。然而,我的理解是,这需要你最喜欢的浏览器打开一个网页,浏览,然后不关闭,因为没有“关闭这个选项卡//窗口”命令//选择Selenium。如果我//my_user想要获取许多etf、股票和/或共同基金的历史数据,该怎么办?这是在浏览器中打开的许多选项卡//窗口,这并不需要打开。

->httplib2. I think this is nice, but I'm doubtful if it will play with Javascript. Does it, for example using the .cache and get options?

- > httplib2。我认为这很好,但是我怀疑它是否会使用Javascript。例如使用.cache和get选项?

import httplib2
conn = httplib2.Http(".cache")
page = conn.request(u"http://the_url","GET")

->Windmill. See 'Selenium'. I am, however, off-key enough to sing 'Man of La Mancha'.

- >风车。看到“硒”。不过,我还不够关键,不能唱《拉曼查人》。

->Google's webscraping code. Would an attempt at downloading a Javascript-laden page result in ... positive results?

- >谷歌webscraping代码。试图下载一个javascript的页面会导致……积极的结果吗?

I've read chatter about having to "emulating a browser without a browser". Sounds like Mechanize, but not for Python3 as I currently understand.

我读过关于“不使用浏览器模拟浏览器”的讨论。听起来像机械化,但对于我目前所理解的Python3来说不是这样的。

::My Question::

::我的问题::

Any suggestions, pointers, solutions, or "look over here" directions?

任何建议,指针,解决方案,或者“看这里”的方向?

Many thanks,

非常感谢,

Miles, Dusty Desert Villager.

英里,尘土飞扬的沙漠村民。

1 个解决方案

#1


11  

When a page loads data via javascript, it has to make requests to the server to get that data via the XMLHttpRequest function (XHR). You can see what requests they are making, and then make them yourself, using wget!

当页面通过javascript加载数据时,它必须请求服务器通过XMLHttpRequest函数(XHR)获取该数据。你可以看到他们在做什么请求,然后用wget来做你自己。

To find out which requests they are making, use the Web Inspector (Chrome and Safari) or Firebug (Firefox). Here's how to do it in Chrome:

要找到他们正在制作的请求,请使用Web Inspector (Chrome和Safari)或Firebug (Firefox)。下面是如何在Chrome中做到这一点:

wrench/tools/developer tools/Network (tab at the top of the tools)/XHR filter at the bottom.

扳手/工具/开发工具/网络(工具顶部的标签)/XHR过滤器。

Here's an example request they make in javascript

下面是他们在javascript中提出的一个示例请求。

If you look closely at the XHR request url, you notice that all trailing returns have the same format:

如果仔细查看XHR请求url,您会发现所有拖尾返回的格式都相同:

http://performance.morningstar.com/Performance/cef/trailing-total-returns.action?t=

http://performance.morningstar.com/Performance/cef/trailing-total-returns.action?t=

You just need to specify t. For example:

你只需要指定t。

http://performance.morningstar.com/Performance/cef/trailing-total-returns.action?t=VAW http://performance.morningstar.com/Performance/cef/trailing-total-returns.action?t=INTC http://performance.morningstar.com/Performance/cef/trailing-total-returns.action?t=VHCOX

http://performance.morningstar.com/Performance/cef/trailing-total-returns.action?t = VAW http://performance.morningstar.com/Performance/cef/trailing-total-returns.action?t = intel http://performance.morningstar.com/Performance/cef/trailing-total-returns.action?t=VHCOX

Now you can wget those URIs and parse out the data directly.

现在您可以获取这些uri并直接解析数据。

#1


11  

When a page loads data via javascript, it has to make requests to the server to get that data via the XMLHttpRequest function (XHR). You can see what requests they are making, and then make them yourself, using wget!

当页面通过javascript加载数据时,它必须请求服务器通过XMLHttpRequest函数(XHR)获取该数据。你可以看到他们在做什么请求,然后用wget来做你自己。

To find out which requests they are making, use the Web Inspector (Chrome and Safari) or Firebug (Firefox). Here's how to do it in Chrome:

要找到他们正在制作的请求,请使用Web Inspector (Chrome和Safari)或Firebug (Firefox)。下面是如何在Chrome中做到这一点:

wrench/tools/developer tools/Network (tab at the top of the tools)/XHR filter at the bottom.

扳手/工具/开发工具/网络(工具顶部的标签)/XHR过滤器。

Here's an example request they make in javascript

下面是他们在javascript中提出的一个示例请求。

If you look closely at the XHR request url, you notice that all trailing returns have the same format:

如果仔细查看XHR请求url,您会发现所有拖尾返回的格式都相同:

http://performance.morningstar.com/Performance/cef/trailing-total-returns.action?t=

http://performance.morningstar.com/Performance/cef/trailing-total-returns.action?t=

You just need to specify t. For example:

你只需要指定t。

http://performance.morningstar.com/Performance/cef/trailing-total-returns.action?t=VAW http://performance.morningstar.com/Performance/cef/trailing-total-returns.action?t=INTC http://performance.morningstar.com/Performance/cef/trailing-total-returns.action?t=VHCOX

http://performance.morningstar.com/Performance/cef/trailing-total-returns.action?t = VAW http://performance.morningstar.com/Performance/cef/trailing-total-returns.action?t = intel http://performance.morningstar.com/Performance/cef/trailing-total-returns.action?t=VHCOX

Now you can wget those URIs and parse out the data directly.

现在您可以获取这些uri并直接解析数据。