I have a relatively big dataframe that looks like this:
我有一个相对较大的数据框,如下所示:
(I have uploaded the csv file here - ufile.io/526t4)
(我在这里上传了csv文件 - ufile.io/526t4)
value
0 [[1,92,"D"],[93,93,"C"],[94,113,"S"],[114,120,"C"],[121,181,"S"],[182,187,"C"],[188,292,"S"],[319,319,"S"],[320,353,"C"],[354,393,"D"]]
1 [[18,23,"D"],[24,27,"C"],[28,186,"S"],[187,198,"C"],[199,246,"S"]]
2 [[18,23,"D"],[24,27,"C"],[28,186,"S"],[187,198,"C"],[199,246,"S"]]
3 [[20,79,"D"]]
...
12352 [[25,36,"S"],[37,89,"C"],[90,115,"S"]]
12353 [[1,16,"D"],[17,407,"C"],[408,416,"D"]]
12354 [[16,21,"D"],[22,108,"C"],[109,123,"D"],[124,164,"C"],[165,421,"S"]]
12355 rows × 1 columns
And I want to create a new column with the sum of all "D" occurrences
我想创建一个新列,其中包含所有“D”次出现的总和
using the first row as an example:
以第一行为例:
x = [[1,92,"D"],[93,93,"C"],[94,113,"S"],[114,120,"C"][121,181,"S"],182,187,"C"],[188,292,"S"],[319,319,"S"],[320,353,"C"],[354,393,"D"]]
new_colum_D = (sum([y[1]-y[0] for y in x if y[2]=="D"])) # applied for all rows
the new_colum_D = first row value would be 130
new_colum_D =第一行值为130
I have tried the following:
我尝试过以下方法:
df['Column_D']=df["value"].apply(lambda x:sum([y[1]-y[0] for y in x if y[2]=="D"]))
but I get the following message: IndexError: string index out of range
但我得到以下消息:IndexError:字符串索引超出范围
IndexError Traceback (most recent call last)
<ipython-input-7-f7f23d42d4e5> in <module>()
----> 1 df['sum']=df["value"].apply(lambda x:sum([y[1]-y[0] for y in x if
y[2]=="D"]))
~\AppData\Local\conda\conda\envs\my_root\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds)
2549 else:
2550 values = self.asobject
-> 2551 mapped = lib.map_infer(values, f, convert=convert_dtype)
2552
2553 if len(mapped) and isinstance(mapped[0], Series):
pandas/_libs/src/inference.pyx in pandas._libs.lib.map_infer()
<ipython-input-7-f7f23d42d4e5> in <lambda>(x)
----> 1 df['sum']=df["value"].apply(lambda x:sum([y[1]-y[0] for y in x if y[2]=="D"]))
<ipython-input-7-f7f23d42d4e5> in <listcomp>(.0)
----> 1 df['sum']=df["value"].apply(lambda x:sum([y[1]-y[0] for y in x if y[2]=="D"]))
IndexError: string index out of range
1 个解决方案
#1
0
You are very close. You can structure your calculation in a list comprehension. Then assign the list to a series.
你很近。您可以在列表推导中构建计算结构。然后将列表分配给系列。
You may feel you are vectorising a calculation by using pd.DataFrame.apply
, but this is not the case: apply
is just a thinly veiled loop with some additional overhead.
您可能会觉得使用pd.DataFrame.apply来计算计算,但事实并非如此:apply只是一个带有一些额外开销的精简循环。
df = pd.DataFrame({'value': [[[1,92,"D"],[93,93,"C"],[94,113,"S"],[114,120,"C"],[121,181,"S"], [182,187,"C"],[188,292,"S"],[319,319,"S"],[320,353,"C"],[354,393,"D"]],
[[18,23,"D"],[24,27,"C"],[28,186,"S"],[187,198,"C"],[199,246,"S"]],
[[18,23,"D"],[24,27,"C"],[28,186,"S"],[187,198,"C"],[199,246,"S"]]]})
df['value'] = [sum([y[1]-y[0] for y in x if y[2]=="D"]) for x in df['value']]
print(df)
value
0 130
1 5
2 5
#1
0
You are very close. You can structure your calculation in a list comprehension. Then assign the list to a series.
你很近。您可以在列表推导中构建计算结构。然后将列表分配给系列。
You may feel you are vectorising a calculation by using pd.DataFrame.apply
, but this is not the case: apply
is just a thinly veiled loop with some additional overhead.
您可能会觉得使用pd.DataFrame.apply来计算计算,但事实并非如此:apply只是一个带有一些额外开销的精简循环。
df = pd.DataFrame({'value': [[[1,92,"D"],[93,93,"C"],[94,113,"S"],[114,120,"C"],[121,181,"S"], [182,187,"C"],[188,292,"S"],[319,319,"S"],[320,353,"C"],[354,393,"D"]],
[[18,23,"D"],[24,27,"C"],[28,186,"S"],[187,198,"C"],[199,246,"S"]],
[[18,23,"D"],[24,27,"C"],[28,186,"S"],[187,198,"C"],[199,246,"S"]]]})
df['value'] = [sum([y[1]-y[0] for y in x if y[2]=="D"]) for x in df['value']]
print(df)
value
0 130
1 5
2 5