I have a pandas data frame mydf
that has two columns,and both columns are datetime datatypes: mydate
and mytime
. I want to add three more columns: hour
, weekday
, and weeknum
.
我有一个pandas数据框mydf有两列,两列都是datetime数据类型:mydate和mytime。我想再添加三列:hour,weekday和weeknum。
def getH(t): #gives the hour
return t.hour
def getW(d): #gives the week number
return d.isocalendar()[1]
def getD(d): #gives the weekday
return d.weekday() # 0 for Monday, 6 for Sunday
mydf["hour"] = mydf.apply(lambda row:getH(row["mytime"]), axis=1)
mydf["weekday"] = mydf.apply(lambda row:getD(row["mydate"]), axis=1)
mydf["weeknum"] = mydf.apply(lambda row:getW(row["mydate"]), axis=1)
The snippet works, but it's not computationally efficient as it loops through the data frame at least three times. I would just like to know if there's a faster and/or more optimal way to do this. For example, using zip
or merge
? If, for example, I just create one function that returns three elements, how should I implement this? To illustrate, the function would be:
该代码段有效,但它的计算效率不高,因为它至少循环数据帧三次。我想知道是否有更快和/或更优化的方法来做到这一点。例如,使用zip还是合并?例如,如果我只创建一个返回三个元素的函数,我该如何实现呢?为了说明,该功能将是:
def getHWd(d,t):
return t.hour, d.isocalendar()[1], d.weekday()
4 个解决方案
#1
Here's on approach to do it using one apply
这是使用一个申请的方法
Say, df
is like
说,df就像
In [64]: df
Out[64]:
mydate mytime
0 2011-01-01 2011-11-14
1 2011-01-02 2011-11-15
2 2011-01-03 2011-11-16
3 2011-01-04 2011-11-17
4 2011-01-05 2011-11-18
5 2011-01-06 2011-11-19
6 2011-01-07 2011-11-20
7 2011-01-08 2011-11-21
8 2011-01-09 2011-11-22
9 2011-01-10 2011-11-23
10 2011-01-11 2011-11-24
11 2011-01-12 2011-11-25
We'll take the lambda function out to separate line for readability and define it like
为了便于阅读,我们将lambda函数分离出来,并将其定义为
In [65]: lambdafunc = lambda x: pd.Series([x['mytime'].hour,
x['mydate'].isocalendar()[1],
x['mydate'].weekday()])
And, apply
and store the result to df[['hour', 'weekday', 'weeknum']]
并且,将结果应用并存储到df [['hour','weekday','weeknum']]
In [66]: df[['hour', 'weekday', 'weeknum']] = df.apply(lambdafunc, axis=1)
And, the output is like
而且,输出就像
In [67]: df
Out[67]:
mydate mytime hour weekday weeknum
0 2011-01-01 2011-11-14 0 52 5
1 2011-01-02 2011-11-15 0 52 6
2 2011-01-03 2011-11-16 0 1 0
3 2011-01-04 2011-11-17 0 1 1
4 2011-01-05 2011-11-18 0 1 2
5 2011-01-06 2011-11-19 0 1 3
6 2011-01-07 2011-11-20 0 1 4
7 2011-01-08 2011-11-21 0 1 5
8 2011-01-09 2011-11-22 0 1 6
9 2011-01-10 2011-11-23 0 2 0
10 2011-01-11 2011-11-24 0 2 1
11 2011-01-12 2011-11-25 0 2 2
#2
To complement John Galt's answer:
为了补充约翰高尔特的答案:
Depending on the task that is performed by lambdafunc
, you may experience some speedup by storing the result of apply
in a new DataFrame
and then joining with the original:
根据lambdafunc执行的任务,您可以通过将apply的结果存储在新的DataFrame中然后加入原始文件来体验一些加速:
lambdafunc = lambda x: pd.Series([x['mytime'].hour,
x['mydate'].isocalendar()[1],
x['mydate'].weekday()])
newcols = df.apply(lambdafunc, axis=1)
newcols.columns = ['hour', 'weekday', 'weeknum']
newdf = df.join(newcols)
Even if you do not see a speed improvement, I would recommend using the join
. You will be able to avoid the (always annoying) SettingWithCopyWarning
that may pop up when assigning directly on the columns:
即使您没有看到速度提升,我也建议您使用连接。您将能够避免直接在列上分配时可能弹出的(总是烦人的)SettingWithCopyWarning:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
#3
def getWd(d):
d.isocalendar()[1], d.weekday()
def getH(t):
return t.hour
mydf["hour"] = zip(*df["mytime"].map(getH))
mydf["weekday"], mydf["weeknum"] = zip(*df["mydate"].map(getWd))
#4
You can do this in a somewhat cleaner method by having the function you apply return a pd.Series
with named elements:
您可以通过使用您应用的函数返回带有命名元素的pd.Series,以更简洁的方法执行此操作:
def process(row):
return pd.Series(dict(b=row["a"] * 2, c=row["a"] + 2))
my_df = pd.DataFrame(dict(a=range(10)))
new_df = my_df.join(my_df.apply(process, axis="columns"))
The result is:
结果是:
a b c
0 0 0 2
1 1 2 3
2 2 4 4
3 3 6 5
4 4 8 6
5 5 10 7
6 6 12 8
7 7 14 9
8 8 16 10
9 9 18 11
#1
Here's on approach to do it using one apply
这是使用一个申请的方法
Say, df
is like
说,df就像
In [64]: df
Out[64]:
mydate mytime
0 2011-01-01 2011-11-14
1 2011-01-02 2011-11-15
2 2011-01-03 2011-11-16
3 2011-01-04 2011-11-17
4 2011-01-05 2011-11-18
5 2011-01-06 2011-11-19
6 2011-01-07 2011-11-20
7 2011-01-08 2011-11-21
8 2011-01-09 2011-11-22
9 2011-01-10 2011-11-23
10 2011-01-11 2011-11-24
11 2011-01-12 2011-11-25
We'll take the lambda function out to separate line for readability and define it like
为了便于阅读,我们将lambda函数分离出来,并将其定义为
In [65]: lambdafunc = lambda x: pd.Series([x['mytime'].hour,
x['mydate'].isocalendar()[1],
x['mydate'].weekday()])
And, apply
and store the result to df[['hour', 'weekday', 'weeknum']]
并且,将结果应用并存储到df [['hour','weekday','weeknum']]
In [66]: df[['hour', 'weekday', 'weeknum']] = df.apply(lambdafunc, axis=1)
And, the output is like
而且,输出就像
In [67]: df
Out[67]:
mydate mytime hour weekday weeknum
0 2011-01-01 2011-11-14 0 52 5
1 2011-01-02 2011-11-15 0 52 6
2 2011-01-03 2011-11-16 0 1 0
3 2011-01-04 2011-11-17 0 1 1
4 2011-01-05 2011-11-18 0 1 2
5 2011-01-06 2011-11-19 0 1 3
6 2011-01-07 2011-11-20 0 1 4
7 2011-01-08 2011-11-21 0 1 5
8 2011-01-09 2011-11-22 0 1 6
9 2011-01-10 2011-11-23 0 2 0
10 2011-01-11 2011-11-24 0 2 1
11 2011-01-12 2011-11-25 0 2 2
#2
To complement John Galt's answer:
为了补充约翰高尔特的答案:
Depending on the task that is performed by lambdafunc
, you may experience some speedup by storing the result of apply
in a new DataFrame
and then joining with the original:
根据lambdafunc执行的任务,您可以通过将apply的结果存储在新的DataFrame中然后加入原始文件来体验一些加速:
lambdafunc = lambda x: pd.Series([x['mytime'].hour,
x['mydate'].isocalendar()[1],
x['mydate'].weekday()])
newcols = df.apply(lambdafunc, axis=1)
newcols.columns = ['hour', 'weekday', 'weeknum']
newdf = df.join(newcols)
Even if you do not see a speed improvement, I would recommend using the join
. You will be able to avoid the (always annoying) SettingWithCopyWarning
that may pop up when assigning directly on the columns:
即使您没有看到速度提升,我也建议您使用连接。您将能够避免直接在列上分配时可能弹出的(总是烦人的)SettingWithCopyWarning:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
#3
def getWd(d):
d.isocalendar()[1], d.weekday()
def getH(t):
return t.hour
mydf["hour"] = zip(*df["mytime"].map(getH))
mydf["weekday"], mydf["weeknum"] = zip(*df["mydate"].map(getWd))
#4
You can do this in a somewhat cleaner method by having the function you apply return a pd.Series
with named elements:
您可以通过使用您应用的函数返回带有命名元素的pd.Series,以更简洁的方法执行此操作:
def process(row):
return pd.Series(dict(b=row["a"] * 2, c=row["a"] + 2))
my_df = pd.DataFrame(dict(a=range(10)))
new_df = my_df.join(my_df.apply(process, axis="columns"))
The result is:
结果是:
a b c
0 0 0 2
1 1 2 3
2 2 4 4
3 3 6 5
4 4 8 6
5 5 10 7
6 6 12 8
7 7 14 9
8 8 16 10
9 9 18 11