按多索引级别或子级别对大熊猫数据进行切片

时间:2022-07-28 16:17:34

Inspired by this answer and the lack of an easy answer to this question I found myself writing a little syntactic sugar to make life easier to filter by MultiIndex level.

受到这个答案的启发,以及对这个问题缺乏一个简单的答案,我发现我自己写了一些语法糖,使生活更容易被多指标水平过滤。

def _filter_series(x, level_name, filter_by):
    """
    Filter a pd.Series or pd.DataFrame x by `filter_by` on the MultiIndex level
    `level_name`

    Uses `pd.Index.get_level_values()` in the background. `filter_by` is either
    a string or an iterable.
    """
    if isinstance(x, pd.Series) or isinstance(x, pd.DataFrame):
        if type(filter_by) is str:
            filter_by = [filter_by]

        index = x.index.get_level_values(level_name).isin(filter_by)
        return x[index]
    else:
        print "Not a pandas object"

But if I know the pandas development team (and I'm starting to, slowly!) there's already a nice way to do this, and I just don't know what it is yet!

但是如果我知道熊猫发展团队(我开始慢慢地!)有一个很好的方法来做这个,我只是不知道它是什么!

Am I right?

我说的对吗?

3 个解决方案

#1


4  

I actually upvoted joris's answer... but unfortunately the refactoring he mentions has not happened in 0.14 and is not happening in 0.17 neither. So for the moment let me suggest a quick and dirty solution (obviously derived from Jeff's one):

实际上我赞成乔里斯的回答……但不幸的是,他提到的重构在0.14中没有发生,在0.17中也没有发生。所以现在,让我建议一个快速而肮脏的解决方案(很明显是来自Jeff的一个):

def filter_by(df, constraints):
    """Filter MultiIndex by sublevels."""
    indexer = [constraints[name] if name in constraints else slice(None)
               for name in df.index.names]
    return df.loc[tuple(indexer)] if len(df.shape) == 1 else df.loc[tuple(indexer),]

pd.Series.filter_by = filter_by
pd.DataFrame.filter_by = filter_by

... to be used as

…作为

df.filter_by({'level_name' : value})

where value can be indeed a single value, but also a list, a slice...

这里的值可以是一个单独的值,也可以是一个列表,一个切片……

(untested with Panels and higher dimension elements, but I do expect it to work)

(未经测试的面板和更高维度的元素,但我希望它能起作用)

#2


3  

This is very easy using the new multi-index slicers in master/0.14 (releasing soon), see here

在master/0.14(即将发布)中使用新的多索引切片器非常容易,请参见这里

There is an open issue to make this syntatically easier (its not hard to do), see here e.g something like this: df.loc[{ 'third' : ['C1','C3'] }] I think is reasonable

有一个开放的问题使语法更容易(这并不难),请看这里的e。像这样的东西:df。loc[{'third': ['C1','C3']]我认为是合理的

Here's how you can do it (requires master/0.14):

你可以这样做(需要master/0.14):

In [2]: def mklbl(prefix,n):
   ...:     return ["%s%s" % (prefix,i)  for i in range(n)]
   ...: 


In [11]: index = MultiIndex.from_product([mklbl('A',4),
mklbl('B',2),
mklbl('C',4),
mklbl('D',2)],names=['first','second','third','fourth'])

In [12]: columns = ['value']

In [13]: df = DataFrame(np.arange(len(index)*len(columns)).reshape((len(index),len(columns))),index=index,columns=columns).sortlevel()

In [14]: df
Out[14]: 
                           value
first second third fourth       
A0    B0     C0    D0          0
                   D1          1
             C1    D0          2
                   D1          3
             C2    D0          4
                   D1          5
             C3    D0          6
                   D1          7
      B1     C0    D0          8
                   D1          9
             C1    D0         10
                   D1         11
             C2    D0         12
                   D1         13
             C3    D0         14
                   D1         15
A1    B0     C0    D0         16
                   D1         17
             C1    D0         18
                   D1         19
             C2    D0         20
                   D1         21
             C3    D0         22
                   D1         23
      B1     C0    D0         24
                   D1         25
             C1    D0         26
                   D1         27
             C2    D0         28
                   D1         29
             C3    D0         30
                   D1         31
A2    B0     C0    D0         32
                   D1         33
             C1    D0         34
                   D1         35
             C2    D0         36
                   D1         37
             C3    D0         38
                   D1         39
      B1     C0    D0         40
                   D1         41
             C1    D0         42
                   D1         43
             C2    D0         44
                   D1         45
             C3    D0         46
                   D1         47
A3    B0     C0    D0         48
                   D1         49
             C1    D0         50
                   D1         51
             C2    D0         52
                   D1         53
             C3    D0         54
                   D1         55
      B1     C0    D0         56
                   D1         57
             C1    D0         58
                   D1         59
                             ...

[64 rows x 1 columns]

Create an indexer across all of the levels, selecting all entries

在所有级别上创建索引器,选择所有条目。

In [15]: indexer = [slice(None)]*len(df.index.names)

Make the level we care about only have the entries we care about

使我们关心的级别只有我们关心的条目

In [16]: indexer[df.index.names.index('third')] = ['C1','C3']

Select it (its important that this is a tuple!)

选择它(重要的是这是一个元组!)

In [18]: df.loc[tuple(indexer),:]
Out[18]: 
                           value
first second third fourth       
A0    B0     C1    D0          2
                   D1          3
             C3    D0          6
                   D1          7
      B1     C1    D0         10
                   D1         11
             C3    D0         14
                   D1         15
A1    B0     C1    D0         18
                   D1         19
             C3    D0         22
                   D1         23
      B1     C1    D0         26
                   D1         27
             C3    D0         30
                   D1         31
A2    B0     C1    D0         34
                   D1         35
             C3    D0         38
                   D1         39
      B1     C1    D0         42
                   D1         43
             C3    D0         46
                   D1         47
A3    B0     C1    D0         50
                   D1         51
             C3    D0         54
                   D1         55
      B1     C1    D0         58
                   D1         59
             C3    D0         62
                   D1         63

[32 rows x 1 columns]

#3


1  

You have the filter method that can do things like this. Eg with the example that was asked in the linked SO question:

你有一个过滤器方法可以做这样的事情。例如,在连接SO问题中所问的例子:

In [188]: df.filter(like='0630', axis=0)
Out[188]: 
                      sales        cogs    net_pft
STK_ID RPT_Date                                   
876    20060630   857483000   729541000   67157200
       20070630  1146245000  1050808000  113468500
       20080630  1932470000  1777010000  133756300
2254   20070630   501221000   289167000  118012200

The filter method is refactored at the moment (in upcoming 0.14), and a level keyword will be added (because now you can have a problem if the same labels appear in different levels of the index).

filter方法正在重构(在即将到来的0.14中),并将添加level关键字(因为如果相同的标签出现在索引的不同级别中,那么现在可能会出现问题)。

#1


4  

I actually upvoted joris's answer... but unfortunately the refactoring he mentions has not happened in 0.14 and is not happening in 0.17 neither. So for the moment let me suggest a quick and dirty solution (obviously derived from Jeff's one):

实际上我赞成乔里斯的回答……但不幸的是,他提到的重构在0.14中没有发生,在0.17中也没有发生。所以现在,让我建议一个快速而肮脏的解决方案(很明显是来自Jeff的一个):

def filter_by(df, constraints):
    """Filter MultiIndex by sublevels."""
    indexer = [constraints[name] if name in constraints else slice(None)
               for name in df.index.names]
    return df.loc[tuple(indexer)] if len(df.shape) == 1 else df.loc[tuple(indexer),]

pd.Series.filter_by = filter_by
pd.DataFrame.filter_by = filter_by

... to be used as

…作为

df.filter_by({'level_name' : value})

where value can be indeed a single value, but also a list, a slice...

这里的值可以是一个单独的值,也可以是一个列表,一个切片……

(untested with Panels and higher dimension elements, but I do expect it to work)

(未经测试的面板和更高维度的元素,但我希望它能起作用)

#2


3  

This is very easy using the new multi-index slicers in master/0.14 (releasing soon), see here

在master/0.14(即将发布)中使用新的多索引切片器非常容易,请参见这里

There is an open issue to make this syntatically easier (its not hard to do), see here e.g something like this: df.loc[{ 'third' : ['C1','C3'] }] I think is reasonable

有一个开放的问题使语法更容易(这并不难),请看这里的e。像这样的东西:df。loc[{'third': ['C1','C3']]我认为是合理的

Here's how you can do it (requires master/0.14):

你可以这样做(需要master/0.14):

In [2]: def mklbl(prefix,n):
   ...:     return ["%s%s" % (prefix,i)  for i in range(n)]
   ...: 


In [11]: index = MultiIndex.from_product([mklbl('A',4),
mklbl('B',2),
mklbl('C',4),
mklbl('D',2)],names=['first','second','third','fourth'])

In [12]: columns = ['value']

In [13]: df = DataFrame(np.arange(len(index)*len(columns)).reshape((len(index),len(columns))),index=index,columns=columns).sortlevel()

In [14]: df
Out[14]: 
                           value
first second third fourth       
A0    B0     C0    D0          0
                   D1          1
             C1    D0          2
                   D1          3
             C2    D0          4
                   D1          5
             C3    D0          6
                   D1          7
      B1     C0    D0          8
                   D1          9
             C1    D0         10
                   D1         11
             C2    D0         12
                   D1         13
             C3    D0         14
                   D1         15
A1    B0     C0    D0         16
                   D1         17
             C1    D0         18
                   D1         19
             C2    D0         20
                   D1         21
             C3    D0         22
                   D1         23
      B1     C0    D0         24
                   D1         25
             C1    D0         26
                   D1         27
             C2    D0         28
                   D1         29
             C3    D0         30
                   D1         31
A2    B0     C0    D0         32
                   D1         33
             C1    D0         34
                   D1         35
             C2    D0         36
                   D1         37
             C3    D0         38
                   D1         39
      B1     C0    D0         40
                   D1         41
             C1    D0         42
                   D1         43
             C2    D0         44
                   D1         45
             C3    D0         46
                   D1         47
A3    B0     C0    D0         48
                   D1         49
             C1    D0         50
                   D1         51
             C2    D0         52
                   D1         53
             C3    D0         54
                   D1         55
      B1     C0    D0         56
                   D1         57
             C1    D0         58
                   D1         59
                             ...

[64 rows x 1 columns]

Create an indexer across all of the levels, selecting all entries

在所有级别上创建索引器,选择所有条目。

In [15]: indexer = [slice(None)]*len(df.index.names)

Make the level we care about only have the entries we care about

使我们关心的级别只有我们关心的条目

In [16]: indexer[df.index.names.index('third')] = ['C1','C3']

Select it (its important that this is a tuple!)

选择它(重要的是这是一个元组!)

In [18]: df.loc[tuple(indexer),:]
Out[18]: 
                           value
first second third fourth       
A0    B0     C1    D0          2
                   D1          3
             C3    D0          6
                   D1          7
      B1     C1    D0         10
                   D1         11
             C3    D0         14
                   D1         15
A1    B0     C1    D0         18
                   D1         19
             C3    D0         22
                   D1         23
      B1     C1    D0         26
                   D1         27
             C3    D0         30
                   D1         31
A2    B0     C1    D0         34
                   D1         35
             C3    D0         38
                   D1         39
      B1     C1    D0         42
                   D1         43
             C3    D0         46
                   D1         47
A3    B0     C1    D0         50
                   D1         51
             C3    D0         54
                   D1         55
      B1     C1    D0         58
                   D1         59
             C3    D0         62
                   D1         63

[32 rows x 1 columns]

#3


1  

You have the filter method that can do things like this. Eg with the example that was asked in the linked SO question:

你有一个过滤器方法可以做这样的事情。例如,在连接SO问题中所问的例子:

In [188]: df.filter(like='0630', axis=0)
Out[188]: 
                      sales        cogs    net_pft
STK_ID RPT_Date                                   
876    20060630   857483000   729541000   67157200
       20070630  1146245000  1050808000  113468500
       20080630  1932470000  1777010000  133756300
2254   20070630   501221000   289167000  118012200

The filter method is refactored at the moment (in upcoming 0.14), and a level keyword will be added (because now you can have a problem if the same labels appear in different levels of the index).

filter方法正在重构(在即将到来的0.14中),并将添加level关键字(因为如果相同的标签出现在索引的不同级别中,那么现在可能会出现问题)。