如何在Pandas系列中找到与输入数字最接近的值?

时间:2021-06-24 12:02:31

I have seen:

我见过:

These relate to vanilla python and not pandas.

这些与香草蟒蛇有关,而不是熊猫。

If I have the series:

如果我有这个系列:

ix   num  
0    1
1    6
2    4
3    5
4    2

And I input 3, how can I (efficiently) find?

我输入3,我怎样才能(有效地)找到?

  1. The index of 3 if it is found in the series
  2. 如果在系列中找到,则索引为3

  3. The index of the value below and above 3 if it is not found in the series.
  4. 如果在系列中找不到,则该值的索引低于和高于3。

Ie. With the above series {1,6,4,5,2}, and input 3, I should get values (4,2) with indexes (2,4).

IE浏览器。使用上面的系列{1,6,4,5,2}和输入3,我应该得到带有索引(2,4)的值(4,2)。

5 个解决方案

#1


You could use argsort() like

您可以使用argsort()之类的

Say, input = 3

比如说,输入= 3

In [198]: input = 3

In [199]: df.ix[(df['num']-input).abs().argsort()[:2]]
Out[199]:
   num
2    4
4    2

df_sort is the dataframe with 2 closest values.

df_sort是具有2个最接近值的数据帧。

In [200]: df_sort = df.ix[(df['num']-input).abs().argsort()[:2]]

For index,

In [201]: df_sort.index.tolist()
Out[201]: [2, 4]

For values,

In [202]: df_sort['num'].tolist()
Out[202]: [4, 2]

Detail, for the above solution df was

细节,对于上面的解决方案df是

In [197]: df
Out[197]:
   num
0    1
1    6
2    4
3    5
4    2

#2


I recommend using iloc in addition to John Galt's answer since this will work even with unsorted integer index, since .ix first looks at the index labels

我建议除了John Galt的回答之外还使用iloc,因为即使使用未排序的整数索引,这也会起作用,因为.ix首先查看索引标签

df.iloc[(df['num']-input).abs().argsort()[:2]]

#3


A disadvantage of the other algorithms discussed here is that they have to sort the entire list. This results in a complexity of ~N log(N).

这里讨论的其他算法的缺点是它们必须对整个列表进行排序。这导致~N log(N)的复杂性。

However, it is possible to achieve the same results in ~N. This approach separates the dataframe in two subsets, one smaller and one larger than the desired value. The lower neighbour is than the largest value in the smaller dataframe and vice versa for the upper neighbour.

但是,可以在~N中获得相同的结果。这种方法将数据帧分成两个子集,一个小于一个子集,一个大于期望值。较低的邻居比较小的数据帧中的最大值,反之亦然。

This gives the following code snippet:

这给出了以下代码片段:

def find_neighbours(value):
    exactmatch=df[df.num==value]
        if !exactmatch.empty:
            return exactmatch.index[0]
        else:
            lowerneighbour_ind = df[df.num<value].idxmax()
            upperneighbour_ind = df[df.num>traversed].idxmin()
            return lowerneighbour_ind, upperneighbour_ind

This approach is similar to using partition in pandas, which can be really useful when dealing with large datasets and complexity becomes an issue.

这种方法类似于在pandas中使用分区,这在处理大型数据集时非常有用,并且复杂性成为一个问题。

#4


If your series is already sorted, you could use something like this.

如果你的系列已经排序,你可以使用这样的东西。

def closest(df, col, val, direction):
    n = len(df[df[col] <= val])
    if(direction < 0):
        n -= 1
    if(n < 0 or n >= len(df)):
        print('err - value outside range')
        return None
    return df.ix[n, col]    

df = pd.DataFrame(pd.Series(range(0,10,2)), columns=['num'])
for find in range(-1, 2):
    lc = closest(df, 'num', find, -1)
    hc = closest(df, 'num', find, 1)
    print('Closest to {} is {}, lower and {}, higher.'.format(find, lc, hc))


df:     num
    0   0
    1   2
    2   4
    3   6
    4   8
err - value outside range
Closest to -1 is None, lower and 0, higher.
Closest to 0 is 0, lower and 2, higher.
Closest to 1 is 0, lower and 2, higher.

#5


If the series is already sorted, an efficient method of finding the indexes is by using bisect. An example:

如果序列已经排序,则使用bisect查找索引的有效方法。一个例子:

idx = bisect_right(df['num'].values, 3)

So for the problem cited in the question, considering that the column "col" of the dataframe "df" is sorted:

因此,对于问题中引用的问题,考虑到数据框“df”的列“col”已排序:

from bisect import bisect_right, bisect_left
def get_closests(df, col, val):
    lower_idx = bisect_right(df[col].values, val)
    higher_idx = bisect_left(df[col].values, val)
if higher_idx == lower_idx:
    return lower_idx
else: 
    return lower_idx, higher_idx

It is quite efficient to find the index of the specific value "val" in the dataframe column "col", or its closest neighbours, but it requires the list to be sorted.

在数据帧列“col”或其最近邻居中找到特定值“val”的索引非常有效,但它需要对列表进行排序。

#1


You could use argsort() like

您可以使用argsort()之类的

Say, input = 3

比如说,输入= 3

In [198]: input = 3

In [199]: df.ix[(df['num']-input).abs().argsort()[:2]]
Out[199]:
   num
2    4
4    2

df_sort is the dataframe with 2 closest values.

df_sort是具有2个最接近值的数据帧。

In [200]: df_sort = df.ix[(df['num']-input).abs().argsort()[:2]]

For index,

In [201]: df_sort.index.tolist()
Out[201]: [2, 4]

For values,

In [202]: df_sort['num'].tolist()
Out[202]: [4, 2]

Detail, for the above solution df was

细节,对于上面的解决方案df是

In [197]: df
Out[197]:
   num
0    1
1    6
2    4
3    5
4    2

#2


I recommend using iloc in addition to John Galt's answer since this will work even with unsorted integer index, since .ix first looks at the index labels

我建议除了John Galt的回答之外还使用iloc,因为即使使用未排序的整数索引,这也会起作用,因为.ix首先查看索引标签

df.iloc[(df['num']-input).abs().argsort()[:2]]

#3


A disadvantage of the other algorithms discussed here is that they have to sort the entire list. This results in a complexity of ~N log(N).

这里讨论的其他算法的缺点是它们必须对整个列表进行排序。这导致~N log(N)的复杂性。

However, it is possible to achieve the same results in ~N. This approach separates the dataframe in two subsets, one smaller and one larger than the desired value. The lower neighbour is than the largest value in the smaller dataframe and vice versa for the upper neighbour.

但是,可以在~N中获得相同的结果。这种方法将数据帧分成两个子集,一个小于一个子集,一个大于期望值。较低的邻居比较小的数据帧中的最大值,反之亦然。

This gives the following code snippet:

这给出了以下代码片段:

def find_neighbours(value):
    exactmatch=df[df.num==value]
        if !exactmatch.empty:
            return exactmatch.index[0]
        else:
            lowerneighbour_ind = df[df.num<value].idxmax()
            upperneighbour_ind = df[df.num>traversed].idxmin()
            return lowerneighbour_ind, upperneighbour_ind

This approach is similar to using partition in pandas, which can be really useful when dealing with large datasets and complexity becomes an issue.

这种方法类似于在pandas中使用分区,这在处理大型数据集时非常有用,并且复杂性成为一个问题。

#4


If your series is already sorted, you could use something like this.

如果你的系列已经排序,你可以使用这样的东西。

def closest(df, col, val, direction):
    n = len(df[df[col] <= val])
    if(direction < 0):
        n -= 1
    if(n < 0 or n >= len(df)):
        print('err - value outside range')
        return None
    return df.ix[n, col]    

df = pd.DataFrame(pd.Series(range(0,10,2)), columns=['num'])
for find in range(-1, 2):
    lc = closest(df, 'num', find, -1)
    hc = closest(df, 'num', find, 1)
    print('Closest to {} is {}, lower and {}, higher.'.format(find, lc, hc))


df:     num
    0   0
    1   2
    2   4
    3   6
    4   8
err - value outside range
Closest to -1 is None, lower and 0, higher.
Closest to 0 is 0, lower and 2, higher.
Closest to 1 is 0, lower and 2, higher.

#5


If the series is already sorted, an efficient method of finding the indexes is by using bisect. An example:

如果序列已经排序,则使用bisect查找索引的有效方法。一个例子:

idx = bisect_right(df['num'].values, 3)

So for the problem cited in the question, considering that the column "col" of the dataframe "df" is sorted:

因此,对于问题中引用的问题,考虑到数据框“df”的列“col”已排序:

from bisect import bisect_right, bisect_left
def get_closests(df, col, val):
    lower_idx = bisect_right(df[col].values, val)
    higher_idx = bisect_left(df[col].values, val)
if higher_idx == lower_idx:
    return lower_idx
else: 
    return lower_idx, higher_idx

It is quite efficient to find the index of the specific value "val" in the dataframe column "col", or its closest neighbours, but it requires the list to be sorted.

在数据帧列“col”或其最近邻居中找到特定值“val”的索引非常有效,但它需要对列表进行排序。