I have a data frame lets say "df". Now one of the columns of the data frame is named "itemID". I would like to get some how very fast the row index according to a value on the column "itemID".
我有一个数据框让我们说“df”。现在,数据框的一列被命名为“itemID”。我想根据列“itemID”上的值得到一些非常快的行索引。
When I do:
当我做:
df[df['itemID']==X]
The performance is quite slow.
表现很慢。
Is there a way to create something like a hash-index in order to do the above?
有没有办法创建类似哈希索引的东西才能完成上述操作?
1 个解决方案
#1
1
I believe you can use dask.
我相信你可以使用dask。
Docs say:
文件说:
The following class of computations works well:
以下类计算效果很好:
Trivially parallelizable operations (fast):
平凡可并行化的操作(快速):
Row-wise selections: df[df.x > 0]
行方式选择:df [df.x> 0]
You can also check how Create Dask DataFrames.
您还可以查看Create Dask DataFrames的方式。
Example
例
import pandas as pd
import dask.dataframe as dd
df = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'itemID': [1,2,4,4]})
print (df)
A itemID
0 A0 1
1 A1 2
2 A2 4
3 A3 4
#Construct a dask objects from a pandas objects
df_dask = dd.from_pandas(df, npartitions=3)
#Row-wise selections
print (df_dask[df_dask.itemID == 4].compute())
A itemID
2 A2 4
3 A3 4
#1
1
I believe you can use dask.
我相信你可以使用dask。
Docs say:
文件说:
The following class of computations works well:
以下类计算效果很好:
Trivially parallelizable operations (fast):
平凡可并行化的操作(快速):
Row-wise selections: df[df.x > 0]
行方式选择:df [df.x> 0]
You can also check how Create Dask DataFrames.
您还可以查看Create Dask DataFrames的方式。
Example
例
import pandas as pd
import dask.dataframe as dd
df = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'itemID': [1,2,4,4]})
print (df)
A itemID
0 A0 1
1 A1 2
2 A2 4
3 A3 4
#Construct a dask objects from a pandas objects
df_dask = dd.from_pandas(df, npartitions=3)
#Row-wise selections
print (df_dask[df_dask.itemID == 4].compute())
A itemID
2 A2 4
3 A3 4