Given a DataFrame with multiple columns, how do we select values from specific columns by row to create a new Series?
给定一个具有多个列的DataFrame,如何从特定的列中逐行选择值来创建新的系列?
df = pd.DataFrame({"A":[1,2,3,4],
"B":[10,20,30,40],
"C":[100,200,300,400]})
columns_to_select = ["B", "A", "A", "C"]
Goal: [10, 2, 3, 400]
目标:[10,2,3,400]
One method that works is to use an apply statement.
一个有效的方法是使用apply语句。
df["cols"] = columns_to_select
df.apply(lambda x: x[x.cols], axis=1)
Unfortunately, this is not a vectorized operation and takes a long time on a large dataset. Any ideas would be appreciated.
不幸的是,这不是一个矢量化的操作,在大型数据集上要花费很长时间。任何想法都会受到欢迎。
2 个解决方案
#1
10
熊猫的方法:
In [22]: df['new'] = df.lookup(df.index, columns_to_select)
In [23]: df
Out[23]:
A B C new
0 1 10 100 10
1 2 20 200 2
2 3 30 300 3
3 4 40 400 400
#2
8
NumPy way
NumPy方式
Here's a vectorized NumPy way using advanced indexing
-
这里有一种使用高级索引的矢量数字方式
# Extract array data
In [10]: a = df.values
# Get integer based column IDs
In [11]: col_idx = np.searchsorted(df.columns, columns_to_select)
# Use NumPy's advanced indexing to extract relevant elem per row
In [12]: a[np.arange(len(col_idx)), col_idx]
Out[12]: array([ 10, 2, 3, 400])
If column names of df
are not sorted, we need to use sorter
argument with np.searchsorted
. The code to extract col_idx
for such a generic df
would be :
如果没有对df的列名进行排序,我们需要使用带有np.searchsort的sorter参数。提取这样一个通用df的col_idx的代码是:
# https://*.com/a/38489403/ @Divakar
def column_index(df, query_cols):
cols = df.columns.values
sidx = np.argsort(cols)
return sidx[np.searchsorted(cols,query_cols,sorter=sidx)]
So, col_idx
would be obtained like so -
所以,col_idx就像这样得到。
col_idx = column_index(df, columns_to_select)
Further optimization
进一步优化
Profiling it revealed that the bottleneck was processing strings with np.searchsorted
, the usual NumPy weakness of not being so great with strings. So, to overcome that and using the special case scenario of column names being single letters, we could quickly convert those to numerals and then feed those to searchsorted
for much faster processing.
分析显示,瓶颈是用np处理字符串。搜索排序,通常麻木的弱点,不太大的字符串。因此,为了克服这个问题,并使用列名称作为单个字母的特殊情况场景,我们可以快速地将它们转换为数字,然后将它们用于搜索排序,以获得更快的处理。
Thus, an optimized version of getting the integer based column IDs, for the case where the column names are single letters and sorted, would be -
因此,对于列名为单字母和排序的情况,获得基于整数的列id的优化版本是-
def column_index_singlechar_sorted(df, query_cols):
c0 = np.fromstring(''.join(df.columns), dtype=np.uint8)
c1 = np.fromstring(''.join(query_cols), dtype=np.uint8)
return np.searchsorted(c0, c1)
This, gives us a modified version of the solution, like so -
这个,给了我们一个修改后的解,像这样
a = df.values
col_idx = column_index_singlechar_sorted(df, columns_to_select)
out = pd.Series(a[np.arange(len(col_idx)), col_idx])
Timings -
计时,
In [149]: # Setup df with 26 uppercase column letters and many rows
...: import string
...: df = pd.DataFrame(np.random.randint(0,9,(1000000,26)))
...: s = list(string.uppercase[:df.shape[1]])
...: df.columns = s
...: idx = np.random.randint(0,df.shape[1],len(df))
...: columns_to_select = np.take(s, idx).tolist()
# With df.lookup from @MaxU's soln
In [150]: %timeit pd.Series(df.lookup(df.index, columns_to_select))
10 loops, best of 3: 76.7 ms per loop
# With proposed one from this soln
In [151]: %%timeit
...: a = df.values
...: col_idx = column_index_singlechar_sorted(df, columns_to_select)
...: out = pd.Series(a[np.arange(len(col_idx)), col_idx])
10 loops, best of 3: 59 ms per loop
Given that df.lookup
solves for a generic case, that's a probably a better choice, but the other possible optimizations as shown in this post could be handy as well!
鉴于df。查找解决了一般情况,这可能是一个更好的选择,但是本文中显示的其他可能的优化也很方便!
#1
10
熊猫的方法:
In [22]: df['new'] = df.lookup(df.index, columns_to_select)
In [23]: df
Out[23]:
A B C new
0 1 10 100 10
1 2 20 200 2
2 3 30 300 3
3 4 40 400 400
#2
8
NumPy way
NumPy方式
Here's a vectorized NumPy way using advanced indexing
-
这里有一种使用高级索引的矢量数字方式
# Extract array data
In [10]: a = df.values
# Get integer based column IDs
In [11]: col_idx = np.searchsorted(df.columns, columns_to_select)
# Use NumPy's advanced indexing to extract relevant elem per row
In [12]: a[np.arange(len(col_idx)), col_idx]
Out[12]: array([ 10, 2, 3, 400])
If column names of df
are not sorted, we need to use sorter
argument with np.searchsorted
. The code to extract col_idx
for such a generic df
would be :
如果没有对df的列名进行排序,我们需要使用带有np.searchsort的sorter参数。提取这样一个通用df的col_idx的代码是:
# https://*.com/a/38489403/ @Divakar
def column_index(df, query_cols):
cols = df.columns.values
sidx = np.argsort(cols)
return sidx[np.searchsorted(cols,query_cols,sorter=sidx)]
So, col_idx
would be obtained like so -
所以,col_idx就像这样得到。
col_idx = column_index(df, columns_to_select)
Further optimization
进一步优化
Profiling it revealed that the bottleneck was processing strings with np.searchsorted
, the usual NumPy weakness of not being so great with strings. So, to overcome that and using the special case scenario of column names being single letters, we could quickly convert those to numerals and then feed those to searchsorted
for much faster processing.
分析显示,瓶颈是用np处理字符串。搜索排序,通常麻木的弱点,不太大的字符串。因此,为了克服这个问题,并使用列名称作为单个字母的特殊情况场景,我们可以快速地将它们转换为数字,然后将它们用于搜索排序,以获得更快的处理。
Thus, an optimized version of getting the integer based column IDs, for the case where the column names are single letters and sorted, would be -
因此,对于列名为单字母和排序的情况,获得基于整数的列id的优化版本是-
def column_index_singlechar_sorted(df, query_cols):
c0 = np.fromstring(''.join(df.columns), dtype=np.uint8)
c1 = np.fromstring(''.join(query_cols), dtype=np.uint8)
return np.searchsorted(c0, c1)
This, gives us a modified version of the solution, like so -
这个,给了我们一个修改后的解,像这样
a = df.values
col_idx = column_index_singlechar_sorted(df, columns_to_select)
out = pd.Series(a[np.arange(len(col_idx)), col_idx])
Timings -
计时,
In [149]: # Setup df with 26 uppercase column letters and many rows
...: import string
...: df = pd.DataFrame(np.random.randint(0,9,(1000000,26)))
...: s = list(string.uppercase[:df.shape[1]])
...: df.columns = s
...: idx = np.random.randint(0,df.shape[1],len(df))
...: columns_to_select = np.take(s, idx).tolist()
# With df.lookup from @MaxU's soln
In [150]: %timeit pd.Series(df.lookup(df.index, columns_to_select))
10 loops, best of 3: 76.7 ms per loop
# With proposed one from this soln
In [151]: %%timeit
...: a = df.values
...: col_idx = column_index_singlechar_sorted(df, columns_to_select)
...: out = pd.Series(a[np.arange(len(col_idx)), col_idx])
10 loops, best of 3: 59 ms per loop
Given that df.lookup
solves for a generic case, that's a probably a better choice, but the other possible optimizations as shown in this post could be handy as well!
鉴于df。查找解决了一般情况,这可能是一个更好的选择,但是本文中显示的其他可能的优化也很方便!