I'm working with some data where I have to get the date of occurrence. For example, say we're working with medical data. Each row is a unique visit from a patient, though the same patient can have multiple rows. Each row also contains info on the type of visit, whether it was routine or emergency room.
我正在使用一些数据,我必须得到发生日期。例如,假设我们正在处理医疗数据。每行都是来自患者的独特访问,尽管同一患者可以有多行。每行还包含访问类型的信息,无论是常规还是急诊室。
I need to go through, and for each row, get the date that the patient was previously admitted to the emergency room, prior to that visit. For example, I'd like to add a column previous_er_discharge_date as below:
我需要通过,并且在每次访问之前,获取患者之前被送入急诊室的日期。例如,我想添加一列previous_er_discharge_date,如下所示:
visit_id patient_id discharge_date visit_type previous_er_discharge_date
1 abc 2014-05-05 in-patient 2014-05-01
2 abc 2014-05-01 emergency NaT
3 def 2014-04-18 in-patient NaT
4 def 2014-03-12 in-patient 2014-02-12
5 def 2014-02-12 emergency NaT
So I have something working, but it's very slow. I basically just create a separate data frame of only ER visits, and iterate through the main data frame, and finding whether previous ER dates exist for that patient, and if they do I take the first one. (The data is sorted by discharge_date). A general representation of the code I have.
所以我有一些工作,但它很慢。我基本上只是创建一个只有ER访问的单独数据框,并遍历主数据框,并查找该患者是否存在先前的ER日期,如果他们这样做,我会采取第一个。 (数据按discharge_date排序)。我所拥有的代码的一般表示。
er_visits = main_data[main_data.visit_type=='emergency']
prev_dates = []
for index, row in main_data.iterrows():
dates = er_visits.discharge_date[(er_visits.patient_id==row.patient_id) &
(er_visits.discharge_date < row.discharge_date)].values
if len(dates) > 0:
prev_dates.append(dates[0])
else:
prev_dates.append(pd.NaT)
The above code works, but it's slow, and I was hoping to get help in finding faster ways to do this. The data set I'm working with has several hundred thousand rows, and this is something that has to run everyday.
上面的代码有效,但速度很慢,我希望能帮助找到更快的方法来实现这一目标。我正在使用的数据集有几十万行,这是必须每天运行的东西。
Thanks!
谢谢!
2 个解决方案
#1
12
In pandas, you basically want to avoid loops, as they kill performance.
在熊猫中,你基本上想要避免循环,因为它们会破坏性能。
Her's a DataFrame similar to yours (I was lazy about the dates, so they're ints; it's the same idea).
她是一个类似于你的数据框架(我对日期很懒,所以它们是整齐的;这是同样的想法)。
df = pd.DataFrame({
'id': ['abc', 'abc', 'def', 'def', 'def'],
'date': [505, 501, 418, 312, 212]})
And here's a function that, for each group, appends the previous date:
这是一个函数,对于每个组,追加前一个日期:
def prev_dates(g):
g.sort(columns=['date'])
g['prev'] = g.date.shift(-1)
return g
So all that's needed is to connect things:
所以需要的是连接东西:
>> df.groupby(df.id).apply(prev_dates)
date id prev
0 505 abc 501
1 501 abc NaN
2 418 def 312
3 312 def 212
4 212 def NaN
Edit
编辑
As noted by @julius below, sort(columns=
has since been deprecated, and should be replaced by ``sort_values(by=''.
如下面的@julius所述,sort(columns =已被弃用,应该用``sort_values(by ='')代替。
#2
0
What if you need to find all visits for that patient?
如果您需要查找该患者的所有访问,该怎么办?
sort[Date, ID]
[nextpatient] = [ID].shift(-1)
[nextvisit] = np.where([ID] == [nextpatient], 1, 0)
[nextdate] = np.where([nextvisit] == 1, [Date].shift(-1), datetime64.nat)
That's my approach (typed on my phone so it's not exact) . I sort and then shift a unique I'd. If that ID matches the ID, then I shift up date. Then I create a column to measure the time between interactions. Also another column to determine what the reason of the visit was, also just another shift.
这是我的方法(在我的手机上键入,所以它不准确)。我排序然后转移一个独特的我。如果该ID与ID匹配,那么我将更改日期。然后我创建一个列来衡量交互之间的时间。还有另一栏确定访问的原因,也只是另一个转变。
I wonder if this is a good approach too in terms of speed. I run it about weekly on a 5 million row data set.
我想知道这在速度方面是否也是一个好方法。我每周在500万行数据集上运行它。
#1
12
In pandas, you basically want to avoid loops, as they kill performance.
在熊猫中,你基本上想要避免循环,因为它们会破坏性能。
Her's a DataFrame similar to yours (I was lazy about the dates, so they're ints; it's the same idea).
她是一个类似于你的数据框架(我对日期很懒,所以它们是整齐的;这是同样的想法)。
df = pd.DataFrame({
'id': ['abc', 'abc', 'def', 'def', 'def'],
'date': [505, 501, 418, 312, 212]})
And here's a function that, for each group, appends the previous date:
这是一个函数,对于每个组,追加前一个日期:
def prev_dates(g):
g.sort(columns=['date'])
g['prev'] = g.date.shift(-1)
return g
So all that's needed is to connect things:
所以需要的是连接东西:
>> df.groupby(df.id).apply(prev_dates)
date id prev
0 505 abc 501
1 501 abc NaN
2 418 def 312
3 312 def 212
4 212 def NaN
Edit
编辑
As noted by @julius below, sort(columns=
has since been deprecated, and should be replaced by ``sort_values(by=''.
如下面的@julius所述,sort(columns =已被弃用,应该用``sort_values(by ='')代替。
#2
0
What if you need to find all visits for that patient?
如果您需要查找该患者的所有访问,该怎么办?
sort[Date, ID]
[nextpatient] = [ID].shift(-1)
[nextvisit] = np.where([ID] == [nextpatient], 1, 0)
[nextdate] = np.where([nextvisit] == 1, [Date].shift(-1), datetime64.nat)
That's my approach (typed on my phone so it's not exact) . I sort and then shift a unique I'd. If that ID matches the ID, then I shift up date. Then I create a column to measure the time between interactions. Also another column to determine what the reason of the visit was, also just another shift.
这是我的方法(在我的手机上键入,所以它不准确)。我排序然后转移一个独特的我。如果该ID与ID匹配,那么我将更改日期。然后我创建一个列来衡量交互之间的时间。还有另一栏确定访问的原因,也只是另一个转变。
I wonder if this is a good approach too in terms of speed. I run it about weekly on a 5 million row data set.
我想知道这在速度方面是否也是一个好方法。我每周在500万行数据集上运行它。