All,
所有人,
I'm looking to download stock data either from Yahoo or Google on 15 - 60 minute intervals for as much history as I can get. I've come up with a crude solution as follows:
我想从雅虎或谷歌下载股票数据,每隔15 - 60分钟,以尽可能多的历史记录。我提出了一个粗略的解决方案如下:
library(RCurl)
tmp <- getURL('https://www.google.com/finance/getprices?i=900&p=1000d&f=d,o,h,l,c,v&df=cpct&q=AAPL')
tmp <- strsplit(tmp,'\n')
tmp <- tmp[[1]]
tmp <- tmp[-c(1:8)]
tmp <- strsplit(tmp,',')
tmp <- do.call('rbind',tmp)
tmp <- apply(tmp,2,as.numeric)
tmp <- tmp[-apply(tmp,1,function(x) any(is.na(x))),]
Given the amount of data I'm looking to import, I worry that this could be computationally expensive. I also don't for the life of me, understand how the time stamps are coded in Yahoo and Google.
考虑到我要导入的数据量,我担心这可能会在计算上花费巨大。我也不知道我的生活,知道时间戳是如何在雅虎和谷歌中编码的。
So my question is twofold--what's a simple, elegant way to quickly ingest data for a series of stocks into R, and how do I interpret the time stamping on the Google/Yahoo files that I would be using?
所以我的问题是双重的——什么是一种简单、优雅的方法来快速地将一系列股票的数据输入到R中,以及如何解释我将使用的谷歌/Yahoo文件的时间戳?
4 个解决方案
#1
22
I will try to answer timestamp question first. Please note this is my interpretation and I could be wrong.
我会先回答时间戳问题。请注意这是我的解释,我可能是错的。
Using the link in your example https://www.google.com/finance/getprices?i=900&p=1000d&f=d,o,h,l,c,v&df=cpct&q=AAPL
I get following data :
使用示例中的链接https://www.google.com/finance/getprices?i=900&p=1000d&f=d,o,h,l,c,v&df=cpct&q= apple i得到以下数据:
EXCHANGE%3DNASDAQ
MARKET_OPEN_MINUTE=570
MARKET_CLOSE_MINUTE=960
INTERVAL=900
COLUMNS=DATE,CLOSE,HIGH,LOW,OPEN,VOLUME
DATA=
TIMEZONE_OFFSET=-300
a1357828200,528.5999,528.62,528.14,528.55,129259
1,522.63,528.72,522,528.6499,2054578
2,523.11,523.69,520.75,522.77,1422586
3,520.48,523.11,519.6501,523.09,1130409
4,518.28,520.579,517.86,520.34,1215466
5,518.8501,519.48,517.33,517.94,832100
6,518.685,520.22,518.63,518.85,565411
7,516.55,519.2,516.55,518.64,617281
...
...
Note the first value of first column a1357828200
, my intuition was that this has something to do with POSIXct
. Hence a quick check :
注意第一列a1357828200的第一个值,我的直觉是这和POSIXct有关。因此,我们快速地检查一下:
> as.POSIXct(1357828200, origin = '1970-01-01', tz='EST')
[1] "2013-01-10 14:30:00 EST"
So my intuition seems to be correct. But the time seems to be off. Now we have one more info in the data. TIMEZONE_OFFSET=-300
. So if we offset our timestamps by this amount we should get :
所以我的直觉似乎是正确的。但是时间似乎已经过去了,现在我们在数据中又多了一个信息。TIMEZONE_OFFSET = -300。所以如果我们用这个数量来抵消我们的时间戳我们应该得到:
as.POSIXct(1357828200-300*60, origin = '1970-01-01', tz='EST')
[1] "2013-01-10 09:30:00 EST"
Note that I didn't know which day data you had requested. But quick check on google finance reveals, those were indeed price levels on 10th Jan 2013.
注意,我不知道您要求的是哪一天的数据。但快查谷歌金融公司的数据就会发现,这些确实是2013年1月10日的价格水平。
Remaining values from first column seem to be some sort of offset from first row value.
第一列的剩余值似乎是与第一行值的某种补偿。
#2
3
So downloading and standardizing the data ended up being more much of a bear than I figured it would--about 150 lines of code. The problem is that while Google provides the past 50 training days of data for all exchange-traded stocks, the time stamps within the days are not standardized: an index of '1,' for example could either refer to the first of second time increment on the first trading day in the data set. Even worse, stocks that only trade at low volumes only have entries where a transaction is recorded. For a high-volume stock like APPL that's no problem, but for low-volume small caps it means that your series will be missing much if not the majority of the data. This was problematic because I need all the stock series to lie neatly on to of each other for the analysis I'm doing.
因此,下载和标准化数据最终比我想象的要困难得多——大约150行代码。问题在于,尽管谷歌提供了过去50天的训练数据为所有交易所买卖股票,天内的时间戳并非标准化:索引' 1 '例如可以参考第一次增量数据集内的第一个交易日。更糟糕的是,股票,只有贸易在低卷只有条目事务记录。对于像APPL这样的高成交量股票来说,这是没有问题的,但是对于低成交量的小型股来说,这意味着你的系列将会丢失大部分的数据。这是有问题的,因为我需要所有的股票系列都相互依偎在一起进行分析。
Fortunately, there is still a general structure to the data. Using this link:
幸运的是,数据仍然有一个总体结构。使用这个链接:
https://www.google.com/finance/getprices?i=1800&p=1000d&f=d,o,h,l,c,v&df=cpct&q=AAPL
and changing the stock ticker at the end will give you the past 50 days of trading days on 1/2-hourly increment. POSIX time stamps, very helpfully decoded by @geektrader, appear in the timestamp column at 3-week intervals. Though the timestamp indexes don't invariably correspond in a convenient 1:1 manner (I almost suspect this was intentional on Google's part) there is a pattern. For example, for the half-hourly series that I looked at the first trading day of ever three-week increment uniformly has timestamp indexes running in the 1:15 neighborhood. This could be 1:13, 1:14, 2:15--it all depends on the stock. I'm not sure what the 14th and 15th entries are: I suspect they are either daily summaries or after-hours trading info. The point is that there's no consistent pattern you can bank on.The first stamp in a training day, sadly, does not always contain the opening data. Same thing for the last entry and the closing data. I found that the only way to know what actually represents the trading data is to compare the numbers to the series on Google maps. After days of futiley trying to figure out how to pry a 1:1 mapping patter from the data, I settled on a "ballpark" strategy. I scraped APPL's data (a very high-volume traded stock) and set its timestamp indexes within each trading day as the reference values for the entire market. All days had a minimum of 13 increments, corresponding to the 6.5 hour trading day, but some had 14 or 15. Where this was the case I just truncated by taking the first 13 indexes. From there I used a while loop to essentially progress through the downloaded data of each stock ticker and compare its time stamp indexes within a given training day to the APPL timestamps. I kept the overlap, gap-filled the missing data, and cut out the non-overlapping portions.
最后改变股票行情会让你在过去的50天里每小时增加1/2小时。POSIX时间戳由@geektrader非常有用地解码,以3周的间隔出现在时间戳列中。虽然时间戳索引并非总是以方便的1:1方式对应(我几乎怀疑这是谷歌故意的),但是有一个模式。例如,在我观察的半小时序列中,在连续三周的第一个交易日中,在1:15的社区中运行的时间戳索引是一致的。这可以是1:13,1:14,2:15,这都取决于股票。我不确定第14和第15个条目是什么:我怀疑它们要么是每日总结,要么是下班后的交易信息。关键是,你不能指望一种连贯的模式。遗憾的是,训练日的第一张邮票并不总是包含开幕数据。最后一个条目和结束数据也是一样的。我发现,知道什么实际代表交易数据的唯一方法是将数字与谷歌地图上的系列进行比较。几天来,我费蒂利试图从数据中找出一种1:1的映射模式,然后我决定采用“大致”策略。我筛选了APPL的数据(一个交易量很大的股票),并在每个交易日内将其时间戳指数设置为整个市场的参考值。所有交易日的涨幅都至少为13点,相当于6.5小时的交易日,但有些交易日的涨幅为14或15点。这里我用前13个索引截断。在那里,我使用了一个while循环来对每个股票行情的下载数据进行基本的处理,并将其在给定的培训日内的时间戳索引与APPL时间戳进行比较。我保留了重叠,填补了缺失的数据,并删除了不重叠的部分。
Sounds like a simple fix, but for low-volume stocks with sparse transaction data there were literally dozens of special cases that I had to bake in and lots of data to interpolate. I got some pretty bizarre results for some of these that I know are incorrect. For high-volume, mid- and large-cap stocks, however, the solution worked brilliantly: for the most part the series either synced up very neatly with the APPL data and matched their Google Finance profiles perfectly.
听起来是一个简单的修正,但对于交易量较小、交易数据稀少的股票来说,确实有几十种我不得不考虑的特殊情况,还有大量数据需要插入。我得到了一些很奇怪的结果,我知道有些是不正确的。然而,对于大盘股、大盘股和大盘股来说,解决方案非常有效:在大多数情况下,该系列要么与APPL数据非常一致,要么与谷歌的财务状况完美匹配。
There's no way around the fact that this method introduces some error, and I still need to fine-tune the method for spare small-caps. That said, shifting a series by a half hour or gap-filling a single time increment introduces a very minor amount of error relative to the overall movement of the market and the stock. I am confident that this data set I have is "good enough" to allow me to get relevant answers to some questions that I have. Getting this stuff commercially costs literally thousands of dollars.
这个方法引入了一些错误,这是无法回避的事实,我仍然需要对备用小大写的方法进行微调。也就是说,将一个系列移动半小时,或者只填充一个时间增量,相对于市场和股票的整体走势,只会带来很小的误差。我相信我所拥有的这个数据集“足够好”,可以让我得到一些问题的相关答案。从商业角度来看,购买这些东西需要花费数千美元。
Thoughts or suggestions?
想法或建议吗?
#3
1
Why not loading the data from Quandl? E.g.
为什么不从Quandl加载数据呢?如。
library(Quandl)
Quandl('YAHOO/AAPL')
Update: sorry, I have just realized that only daily data is fetched with Quandl - but I leave my answer here as Quandl is really easy to query in similar cases
更新:抱歉,我刚刚意识到只有每日数据是用Quandl获取的——但是我把我的答案留在这里,因为Quandl在类似的情况下确实很容易查询
#4
0
For the timezone offset, try:
对于时区偏移,请尝试:
as.POSIXct(1357828200, origin = '1970-01-01', tz=Sys.timezone(location = TRUE))
作为。POSIXct(1357828200,原点= '1970-01-01',tz=Sys。时区(位置= TRUE))
(The tz will automatically adjust according to your location)
(tz会根据你的位置自动调整)
#1
22
I will try to answer timestamp question first. Please note this is my interpretation and I could be wrong.
我会先回答时间戳问题。请注意这是我的解释,我可能是错的。
Using the link in your example https://www.google.com/finance/getprices?i=900&p=1000d&f=d,o,h,l,c,v&df=cpct&q=AAPL
I get following data :
使用示例中的链接https://www.google.com/finance/getprices?i=900&p=1000d&f=d,o,h,l,c,v&df=cpct&q= apple i得到以下数据:
EXCHANGE%3DNASDAQ
MARKET_OPEN_MINUTE=570
MARKET_CLOSE_MINUTE=960
INTERVAL=900
COLUMNS=DATE,CLOSE,HIGH,LOW,OPEN,VOLUME
DATA=
TIMEZONE_OFFSET=-300
a1357828200,528.5999,528.62,528.14,528.55,129259
1,522.63,528.72,522,528.6499,2054578
2,523.11,523.69,520.75,522.77,1422586
3,520.48,523.11,519.6501,523.09,1130409
4,518.28,520.579,517.86,520.34,1215466
5,518.8501,519.48,517.33,517.94,832100
6,518.685,520.22,518.63,518.85,565411
7,516.55,519.2,516.55,518.64,617281
...
...
Note the first value of first column a1357828200
, my intuition was that this has something to do with POSIXct
. Hence a quick check :
注意第一列a1357828200的第一个值,我的直觉是这和POSIXct有关。因此,我们快速地检查一下:
> as.POSIXct(1357828200, origin = '1970-01-01', tz='EST')
[1] "2013-01-10 14:30:00 EST"
So my intuition seems to be correct. But the time seems to be off. Now we have one more info in the data. TIMEZONE_OFFSET=-300
. So if we offset our timestamps by this amount we should get :
所以我的直觉似乎是正确的。但是时间似乎已经过去了,现在我们在数据中又多了一个信息。TIMEZONE_OFFSET = -300。所以如果我们用这个数量来抵消我们的时间戳我们应该得到:
as.POSIXct(1357828200-300*60, origin = '1970-01-01', tz='EST')
[1] "2013-01-10 09:30:00 EST"
Note that I didn't know which day data you had requested. But quick check on google finance reveals, those were indeed price levels on 10th Jan 2013.
注意,我不知道您要求的是哪一天的数据。但快查谷歌金融公司的数据就会发现,这些确实是2013年1月10日的价格水平。
Remaining values from first column seem to be some sort of offset from first row value.
第一列的剩余值似乎是与第一行值的某种补偿。
#2
3
So downloading and standardizing the data ended up being more much of a bear than I figured it would--about 150 lines of code. The problem is that while Google provides the past 50 training days of data for all exchange-traded stocks, the time stamps within the days are not standardized: an index of '1,' for example could either refer to the first of second time increment on the first trading day in the data set. Even worse, stocks that only trade at low volumes only have entries where a transaction is recorded. For a high-volume stock like APPL that's no problem, but for low-volume small caps it means that your series will be missing much if not the majority of the data. This was problematic because I need all the stock series to lie neatly on to of each other for the analysis I'm doing.
因此,下载和标准化数据最终比我想象的要困难得多——大约150行代码。问题在于,尽管谷歌提供了过去50天的训练数据为所有交易所买卖股票,天内的时间戳并非标准化:索引' 1 '例如可以参考第一次增量数据集内的第一个交易日。更糟糕的是,股票,只有贸易在低卷只有条目事务记录。对于像APPL这样的高成交量股票来说,这是没有问题的,但是对于低成交量的小型股来说,这意味着你的系列将会丢失大部分的数据。这是有问题的,因为我需要所有的股票系列都相互依偎在一起进行分析。
Fortunately, there is still a general structure to the data. Using this link:
幸运的是,数据仍然有一个总体结构。使用这个链接:
https://www.google.com/finance/getprices?i=1800&p=1000d&f=d,o,h,l,c,v&df=cpct&q=AAPL
and changing the stock ticker at the end will give you the past 50 days of trading days on 1/2-hourly increment. POSIX time stamps, very helpfully decoded by @geektrader, appear in the timestamp column at 3-week intervals. Though the timestamp indexes don't invariably correspond in a convenient 1:1 manner (I almost suspect this was intentional on Google's part) there is a pattern. For example, for the half-hourly series that I looked at the first trading day of ever three-week increment uniformly has timestamp indexes running in the 1:15 neighborhood. This could be 1:13, 1:14, 2:15--it all depends on the stock. I'm not sure what the 14th and 15th entries are: I suspect they are either daily summaries or after-hours trading info. The point is that there's no consistent pattern you can bank on.The first stamp in a training day, sadly, does not always contain the opening data. Same thing for the last entry and the closing data. I found that the only way to know what actually represents the trading data is to compare the numbers to the series on Google maps. After days of futiley trying to figure out how to pry a 1:1 mapping patter from the data, I settled on a "ballpark" strategy. I scraped APPL's data (a very high-volume traded stock) and set its timestamp indexes within each trading day as the reference values for the entire market. All days had a minimum of 13 increments, corresponding to the 6.5 hour trading day, but some had 14 or 15. Where this was the case I just truncated by taking the first 13 indexes. From there I used a while loop to essentially progress through the downloaded data of each stock ticker and compare its time stamp indexes within a given training day to the APPL timestamps. I kept the overlap, gap-filled the missing data, and cut out the non-overlapping portions.
最后改变股票行情会让你在过去的50天里每小时增加1/2小时。POSIX时间戳由@geektrader非常有用地解码,以3周的间隔出现在时间戳列中。虽然时间戳索引并非总是以方便的1:1方式对应(我几乎怀疑这是谷歌故意的),但是有一个模式。例如,在我观察的半小时序列中,在连续三周的第一个交易日中,在1:15的社区中运行的时间戳索引是一致的。这可以是1:13,1:14,2:15,这都取决于股票。我不确定第14和第15个条目是什么:我怀疑它们要么是每日总结,要么是下班后的交易信息。关键是,你不能指望一种连贯的模式。遗憾的是,训练日的第一张邮票并不总是包含开幕数据。最后一个条目和结束数据也是一样的。我发现,知道什么实际代表交易数据的唯一方法是将数字与谷歌地图上的系列进行比较。几天来,我费蒂利试图从数据中找出一种1:1的映射模式,然后我决定采用“大致”策略。我筛选了APPL的数据(一个交易量很大的股票),并在每个交易日内将其时间戳指数设置为整个市场的参考值。所有交易日的涨幅都至少为13点,相当于6.5小时的交易日,但有些交易日的涨幅为14或15点。这里我用前13个索引截断。在那里,我使用了一个while循环来对每个股票行情的下载数据进行基本的处理,并将其在给定的培训日内的时间戳索引与APPL时间戳进行比较。我保留了重叠,填补了缺失的数据,并删除了不重叠的部分。
Sounds like a simple fix, but for low-volume stocks with sparse transaction data there were literally dozens of special cases that I had to bake in and lots of data to interpolate. I got some pretty bizarre results for some of these that I know are incorrect. For high-volume, mid- and large-cap stocks, however, the solution worked brilliantly: for the most part the series either synced up very neatly with the APPL data and matched their Google Finance profiles perfectly.
听起来是一个简单的修正,但对于交易量较小、交易数据稀少的股票来说,确实有几十种我不得不考虑的特殊情况,还有大量数据需要插入。我得到了一些很奇怪的结果,我知道有些是不正确的。然而,对于大盘股、大盘股和大盘股来说,解决方案非常有效:在大多数情况下,该系列要么与APPL数据非常一致,要么与谷歌的财务状况完美匹配。
There's no way around the fact that this method introduces some error, and I still need to fine-tune the method for spare small-caps. That said, shifting a series by a half hour or gap-filling a single time increment introduces a very minor amount of error relative to the overall movement of the market and the stock. I am confident that this data set I have is "good enough" to allow me to get relevant answers to some questions that I have. Getting this stuff commercially costs literally thousands of dollars.
这个方法引入了一些错误,这是无法回避的事实,我仍然需要对备用小大写的方法进行微调。也就是说,将一个系列移动半小时,或者只填充一个时间增量,相对于市场和股票的整体走势,只会带来很小的误差。我相信我所拥有的这个数据集“足够好”,可以让我得到一些问题的相关答案。从商业角度来看,购买这些东西需要花费数千美元。
Thoughts or suggestions?
想法或建议吗?
#3
1
Why not loading the data from Quandl? E.g.
为什么不从Quandl加载数据呢?如。
library(Quandl)
Quandl('YAHOO/AAPL')
Update: sorry, I have just realized that only daily data is fetched with Quandl - but I leave my answer here as Quandl is really easy to query in similar cases
更新:抱歉,我刚刚意识到只有每日数据是用Quandl获取的——但是我把我的答案留在这里,因为Quandl在类似的情况下确实很容易查询
#4
0
For the timezone offset, try:
对于时区偏移,请尝试:
as.POSIXct(1357828200, origin = '1970-01-01', tz=Sys.timezone(location = TRUE))
作为。POSIXct(1357828200,原点= '1970-01-01',tz=Sys。时区(位置= TRUE))
(The tz will automatically adjust according to your location)
(tz会根据你的位置自动调整)