I've a amazon price data for around 8.5k products from a period of Feb. 1, 2015 to Oct. 31, 2015. Currently, it is in the form of a dictionary with key as number of days from base date and value as the new price starting that day. Example, here the price is $10 from day 1 and changes to $15 on 45th day, and then changes to $9 on 173rd day and doesn't change after that.
我从2015年2月1日到2015年10月31日期间获得了大约8.5k产品的亚马逊价格数据。目前,它是一个字典的形式,密钥是从基准日期和价值的天数从那天开始的新价格。例如,这里的价格从第1天起为10美元,第45天变为15美元,然后在第173天变为9美元,之后不再变化。
{1:10,
45:15,
.
.
.
173:9}
What is the best way to store such a timeseries for easy manipulation using python? I would like to perform a lot of aggregations and also would be querying a price on a particular date. Lastly, I would be performing some fixed effect regressions and am confused what would be best way to store this timeseries, so that my programming job becomes comparatively simpler. I could possibly store as table with 273 columns (each for a day) and rows corresponding to 8.5k products. I've been looking at pandas module which can help me do this, but is there a better way? Thanks!
使用python存储这样的时间序列以便于操作的最佳方法是什么?我想执行很多聚合,也会查询特定日期的价格。最后,我将执行一些固定效果回归,并且很困惑什么是存储这个时间序列的最佳方式,以便我的编程工作变得相对简单。我可以存储273列(每天一列)和对应8.5k产品的行。我一直在寻找可以帮助我做到这一点的pandas模块,但有更好的方法吗?谢谢!
2 个解决方案
#1
3
You could use a dict of dicts and convert that into a pandas dataframe and also use numpy to do calculations. Your first key would be product and the inner dict would be the one you already have but it won't print in the format you suggested but all you would need to do is transpose it so for a quick example
您可以使用dicts的dict并将其转换为pandas数据帧,并使用numpy进行计算。你的第一把钥匙就是产品,而内部字典就是你已经拥有的那张,但它不会以你建议的格式打印,但你需要做的就是转置它以便快速举例
import pandas as pd
d = {'Product1': {1:10, 45:15, 173:9}, 'Product2': {1:11, 100:50, 173:10}}
df = pd.DataFrame(d).T
print df
1 45 100 173
Product1 10 15 NaN 9
Product2 11 NaN 50 10
#2
1
8.5k products and 270+ days I would recommend dataframe this way,
8.5k产品和270天以上我会以这种方式推荐数据帧,
price_dic = {1: 10, 2: 11, 3: 12, 5: 15}
df = pd.DataFrame({'days': pd.Series(price_dic.keys(),index=range(len(price_dic))),'price': pd.Series(price_dic.values(),index=range(len(price_dic)))})
df['prod_name'] = "Knote"
df
Out[80]:
days price prod_name
0 1 10 Knote
1 2 11 Knote
2 3 12 Knote
3 5 15 Knote
df['Date'] = pd.to_datetime("Feb. 1, 2015") + pd.to_timedelta(df.days,'D')
df
Out[82]:
days price prod_name Date
0 1 10 Knote 2015-02-02
1 2 11 Knote 2015-02-03
2 3 12 Knote 2015-02-04
3 5 15 Knote 2015-02-06
Update:
Treversing list and getting final Dataframe with all content,
遍历列表并获得包含所有内容的最终Dataframe,
Lets say you have prod list, price list and start date list like below, we could do,
假设您有下面的产品清单,价目表和开始日期清单,我们可以这样做,
product_list = [1001,1002,1003]
y_dict = [{1: 10, 2: 11, 3: 12, 5: 15},
{1: 10, 3: 11, 6: 12, 8: 15},
{1: 90, 2: 100, 7: 120, 9: 100}]
start_dt_list = ['Feb 05 2015','Feb 01 2015','Feb 06 2015']
fdf = pd.DataFrame(columns =['P_ID','Date','Price','Days'])
Out[73]:
Empty DataFrame
Columns: [P_ID, Date, Price, Days]
Index: []
for pid,j ,st_dt in zip(product_list, y_dict,start_dt_list):
df = pd.DataFrame({'P_ID' : pd.Series([pid]*len(j)) ,
'Date' : pd.Series([pd.to_datetime(st_dt)]*len(j)),
'Price': pd.Series(j.values(),index=range(len(j))),
'Days': pd.Series(j.keys(),index=range(len(j)))
})
fdf = fdf.append(df,ignore_index=True)
fdf.head(2)
Out[75]:
Date Days P_ID Price
0 2015-02-05 1 1001 10
1 2015-02-05 2 1001 11
fdf['Date'] = fdf['Date'] + pd.to_timedelta(fdf.Days,'D')
fdf
Out[77]:
Date Days P_ID Price
0 2015-02-06 1 1001 10
1 2015-02-07 2 1001 11
2 2015-02-08 3 1001 12
3 2015-02-10 5 1001 15
4 2015-02-09 8 1002 15
5 2015-02-02 1 1002 10
6 2015-02-04 3 1002 11
7 2015-02-07 6 1002 12
8 2015-02-07 1 1003 90
9 2015-02-08 2 1003 100
10 2015-02-15 9 1003 100
11 2015-02-13 7 1003 120
#1
3
You could use a dict of dicts and convert that into a pandas dataframe and also use numpy to do calculations. Your first key would be product and the inner dict would be the one you already have but it won't print in the format you suggested but all you would need to do is transpose it so for a quick example
您可以使用dicts的dict并将其转换为pandas数据帧,并使用numpy进行计算。你的第一把钥匙就是产品,而内部字典就是你已经拥有的那张,但它不会以你建议的格式打印,但你需要做的就是转置它以便快速举例
import pandas as pd
d = {'Product1': {1:10, 45:15, 173:9}, 'Product2': {1:11, 100:50, 173:10}}
df = pd.DataFrame(d).T
print df
1 45 100 173
Product1 10 15 NaN 9
Product2 11 NaN 50 10
#2
1
8.5k products and 270+ days I would recommend dataframe this way,
8.5k产品和270天以上我会以这种方式推荐数据帧,
price_dic = {1: 10, 2: 11, 3: 12, 5: 15}
df = pd.DataFrame({'days': pd.Series(price_dic.keys(),index=range(len(price_dic))),'price': pd.Series(price_dic.values(),index=range(len(price_dic)))})
df['prod_name'] = "Knote"
df
Out[80]:
days price prod_name
0 1 10 Knote
1 2 11 Knote
2 3 12 Knote
3 5 15 Knote
df['Date'] = pd.to_datetime("Feb. 1, 2015") + pd.to_timedelta(df.days,'D')
df
Out[82]:
days price prod_name Date
0 1 10 Knote 2015-02-02
1 2 11 Knote 2015-02-03
2 3 12 Knote 2015-02-04
3 5 15 Knote 2015-02-06
Update:
Treversing list and getting final Dataframe with all content,
遍历列表并获得包含所有内容的最终Dataframe,
Lets say you have prod list, price list and start date list like below, we could do,
假设您有下面的产品清单,价目表和开始日期清单,我们可以这样做,
product_list = [1001,1002,1003]
y_dict = [{1: 10, 2: 11, 3: 12, 5: 15},
{1: 10, 3: 11, 6: 12, 8: 15},
{1: 90, 2: 100, 7: 120, 9: 100}]
start_dt_list = ['Feb 05 2015','Feb 01 2015','Feb 06 2015']
fdf = pd.DataFrame(columns =['P_ID','Date','Price','Days'])
Out[73]:
Empty DataFrame
Columns: [P_ID, Date, Price, Days]
Index: []
for pid,j ,st_dt in zip(product_list, y_dict,start_dt_list):
df = pd.DataFrame({'P_ID' : pd.Series([pid]*len(j)) ,
'Date' : pd.Series([pd.to_datetime(st_dt)]*len(j)),
'Price': pd.Series(j.values(),index=range(len(j))),
'Days': pd.Series(j.keys(),index=range(len(j)))
})
fdf = fdf.append(df,ignore_index=True)
fdf.head(2)
Out[75]:
Date Days P_ID Price
0 2015-02-05 1 1001 10
1 2015-02-05 2 1001 11
fdf['Date'] = fdf['Date'] + pd.to_timedelta(fdf.Days,'D')
fdf
Out[77]:
Date Days P_ID Price
0 2015-02-06 1 1001 10
1 2015-02-07 2 1001 11
2 2015-02-08 3 1001 12
3 2015-02-10 5 1001 15
4 2015-02-09 8 1002 15
5 2015-02-02 1 1002 10
6 2015-02-04 3 1002 11
7 2015-02-07 6 1002 12
8 2015-02-07 1 1003 90
9 2015-02-08 2 1003 100
10 2015-02-15 9 1003 100
11 2015-02-13 7 1003 120