I have a Pandas
data frame with two sets of dates, a DatetimeIndex
for the index and a column named date2
containing datetime objects, a value and an id. For some id's I am missing values where date2
is equal to the index, in this case I want to fill the row/values with the values of the previous DatetimeIndex and id's values. The date1
represents the current point in time, and date2
represents the last date. Each df[df.id == id]
can be treated as its own dataframe however the data is stored in one giant dataframe 500k rows.
我有一个带有两组日期的Pandas数据框,一个用于索引的DatetimeIndex和一个名为date2的列,其中包含datetime对象,一个值和一个id。对于某些id,我缺少date2等于索引的值,在这种情况下,我想用前一个DatetimeIndex和id的值填充行/值。 date1表示当前时间点,date2表示最后一个日期。每个df [df.id == id]可以被视为自己的数据帧,但数据存储在一个500k行的巨型数据帧中。
Example: Given
date2 id value
index
2006-01-24 2006-01-26 3 3
2006-01-25 2006-01-26 1 1
2006-01-25 2006-01-26 2 2
2006-01-26 2006-01-26 2 2.1
2006-01-27 2006-02-26 4 4
In this example, were missing a index == date2
row for id 1, id 2 and for id3. I'd like to backfill each missing row with the previous index value respective to it's id.
在此示例中,缺少id为1,id 2和id3的index == date2行。我想用前面的索引值回填每个缺失的行,该索引值分别对应于它的id。
I'd like to return:
我想回复:
date2 id value
index
2006-01-24 2006-01-26 3 3
2006-01-25 2006-01-26 1 1
2006-01-25 2006-01-26 2 2
2006-01-26 2006-01-26 1 1 #<---- row added
2006-01-26 2006-01-26 2 2.1
2006-01-26 2006-01-26 3 3 #<---- row added
2006-01-27 2006-02-26 4 4
2006-02-26 2006-02-26 4 4 #<---- row added
2 个解决方案
#1
I'm slightly reluctant to answer b/c it seems @chrisb may have successfully answered the original question, which later changed. However, Chris hasn't updated the answer in a few days and this answer does take a different approach so I'm going to +1 Chris's answer and add this one.
我有点不愿回答b / c似乎@chrisb可能已经成功回答了原来的问题,后来改变了。但是,Chris几天之内没有更新答案,这个答案确实采用了不同的方法,所以我要给Chris的答案+1并加上这个答案。
First, just create a new dataframe from the original with 'index'='date2'. This will be the basis for appending to the existing dataframe (note that 'index' is a column here, not an index):
首先,只需使用'index'='date2'从原始数据框创建一个新的数据框。这将是附加到现有数据框的基础(请注意,'index'是此处的列,而不是索引):
df2 = df[ df['index'] != df['date2'] ]
df2['index'] = df2['date2']
df2['value'] = np.nan
index date2 id value
0 2006-01-26 2006-01-26 3 NaN
1 2006-01-26 2006-01-26 1 NaN
2 2006-01-26 2006-01-26 2 NaN
4 2006-02-26 2006-02-26 4 NaN
Now, just append all of these, but drop the ones we don't need (if we already have an existing row with 'index'='date2', as for id=2 here):
现在,只需附加所有这些,但删除我们不需要的那些(如果我们已经有一个'index'='date2'的现有行,这里的id = 2):
df3 = df.append(df2)
df3 = df3.drop_duplicates(['index','date2','id'])
df3 = df3.reset_index(drop=True).sort(['id','index','date2'])
df3['value'] = df3.value.fillna(method='ffill')
index date2 id value
1 2006-01-25 2006-01-26 1 1.0
6 2006-01-26 2006-01-26 1 1.0
2 2006-01-25 2006-01-26 2 2.0
3 2006-01-26 2006-01-26 2 2.1
0 2006-01-24 2006-01-26 3 3.0
5 2006-01-26 2006-01-26 3 3.0
4 2006-01-27 2006-02-26 4 4.0
7 2006-02-26 2006-02-26 4 4.0
#2
This isn't very clean, but is a possible solution. First, I moved the index into a column, date1
:
这不是很干净,但是可能的解决方案。首先,我将索引移动到了一个列date1:
In [228]: df
Out[228]:
date1 date2 id value
0 2006-01-24 2006-01-26 3 3.0
1 2006-01-25 2006-01-26 1 1.0
2 2006-01-25 2006-01-26 2 2.0
3 2006-01-26 2006-01-26 2 2.1
Then I grouped by each pair of dates, adding ids to those pairs that match. This involves breaking the DataFrame into a list of subframes and use concat
to stick back together.
然后我按每对日期分组,为那些匹配的对添加id。这涉及将DataFrame分解为子帧列表并使用concat重新组合在一起。
In [229]: dfs = []
...: for (date1, date2), df_gb in df.groupby(['date1','date2']):
...: if date1 == date2:
...: to_add = list(set([1,2,3]) - set(df_gb['id']))
...: df_gb = df_gb.append(pd.DataFrame({'id': to_add, 'date1': date1, 'date2': date2, 'value': np.nan}), ignore_index=True)
...: dfs.append(df_gb)
In [231]: df = pd.concat(dfs, ignore_index=True)
In [232]: df
Out[232]:
date1 date2 id value
0 2006-01-24 2006-01-26 3 3.0
1 2006-01-25 2006-01-26 1 1.0
2 2006-01-25 2006-01-26 2 2.0
3 2006-01-26 2006-01-26 2 2.1
4 2006-01-26 2006-01-26 1 NaN
5 2006-01-26 2006-01-26 3 NaN
Finally, I sorted and filled the missing values.
最后,我对缺失值进行了排序和填充。
In [233]: df = df.sort(['id', 'date1', 'date2'])
In [234]: df = df.fillna(method='ffill')
In [236]: df.sort(['date1', 'date2'])
Out[236]:
date1 date2 id value
0 2006-01-24 2006-01-26 3 3.0
1 2006-01-25 2006-01-26 1 1.0
2 2006-01-25 2006-01-26 2 2.0
4 2006-01-26 2006-01-26 1 1.0
3 2006-01-26 2006-01-26 2 2.1
5 2006-01-26 2006-01-26 3 3.0
#1
I'm slightly reluctant to answer b/c it seems @chrisb may have successfully answered the original question, which later changed. However, Chris hasn't updated the answer in a few days and this answer does take a different approach so I'm going to +1 Chris's answer and add this one.
我有点不愿回答b / c似乎@chrisb可能已经成功回答了原来的问题,后来改变了。但是,Chris几天之内没有更新答案,这个答案确实采用了不同的方法,所以我要给Chris的答案+1并加上这个答案。
First, just create a new dataframe from the original with 'index'='date2'. This will be the basis for appending to the existing dataframe (note that 'index' is a column here, not an index):
首先,只需使用'index'='date2'从原始数据框创建一个新的数据框。这将是附加到现有数据框的基础(请注意,'index'是此处的列,而不是索引):
df2 = df[ df['index'] != df['date2'] ]
df2['index'] = df2['date2']
df2['value'] = np.nan
index date2 id value
0 2006-01-26 2006-01-26 3 NaN
1 2006-01-26 2006-01-26 1 NaN
2 2006-01-26 2006-01-26 2 NaN
4 2006-02-26 2006-02-26 4 NaN
Now, just append all of these, but drop the ones we don't need (if we already have an existing row with 'index'='date2', as for id=2 here):
现在,只需附加所有这些,但删除我们不需要的那些(如果我们已经有一个'index'='date2'的现有行,这里的id = 2):
df3 = df.append(df2)
df3 = df3.drop_duplicates(['index','date2','id'])
df3 = df3.reset_index(drop=True).sort(['id','index','date2'])
df3['value'] = df3.value.fillna(method='ffill')
index date2 id value
1 2006-01-25 2006-01-26 1 1.0
6 2006-01-26 2006-01-26 1 1.0
2 2006-01-25 2006-01-26 2 2.0
3 2006-01-26 2006-01-26 2 2.1
0 2006-01-24 2006-01-26 3 3.0
5 2006-01-26 2006-01-26 3 3.0
4 2006-01-27 2006-02-26 4 4.0
7 2006-02-26 2006-02-26 4 4.0
#2
This isn't very clean, but is a possible solution. First, I moved the index into a column, date1
:
这不是很干净,但是可能的解决方案。首先,我将索引移动到了一个列date1:
In [228]: df
Out[228]:
date1 date2 id value
0 2006-01-24 2006-01-26 3 3.0
1 2006-01-25 2006-01-26 1 1.0
2 2006-01-25 2006-01-26 2 2.0
3 2006-01-26 2006-01-26 2 2.1
Then I grouped by each pair of dates, adding ids to those pairs that match. This involves breaking the DataFrame into a list of subframes and use concat
to stick back together.
然后我按每对日期分组,为那些匹配的对添加id。这涉及将DataFrame分解为子帧列表并使用concat重新组合在一起。
In [229]: dfs = []
...: for (date1, date2), df_gb in df.groupby(['date1','date2']):
...: if date1 == date2:
...: to_add = list(set([1,2,3]) - set(df_gb['id']))
...: df_gb = df_gb.append(pd.DataFrame({'id': to_add, 'date1': date1, 'date2': date2, 'value': np.nan}), ignore_index=True)
...: dfs.append(df_gb)
In [231]: df = pd.concat(dfs, ignore_index=True)
In [232]: df
Out[232]:
date1 date2 id value
0 2006-01-24 2006-01-26 3 3.0
1 2006-01-25 2006-01-26 1 1.0
2 2006-01-25 2006-01-26 2 2.0
3 2006-01-26 2006-01-26 2 2.1
4 2006-01-26 2006-01-26 1 NaN
5 2006-01-26 2006-01-26 3 NaN
Finally, I sorted and filled the missing values.
最后,我对缺失值进行了排序和填充。
In [233]: df = df.sort(['id', 'date1', 'date2'])
In [234]: df = df.fillna(method='ffill')
In [236]: df.sort(['date1', 'date2'])
Out[236]:
date1 date2 id value
0 2006-01-24 2006-01-26 3 3.0
1 2006-01-25 2006-01-26 1 1.0
2 2006-01-25 2006-01-26 2 2.0
4 2006-01-26 2006-01-26 1 1.0
3 2006-01-26 2006-01-26 2 2.1
5 2006-01-26 2006-01-26 3 3.0