无法使用日期作为字符串对pandas数据帧(以日期为键)进行切片

时间:2022-09-27 22:56:31

I'm generating an empty dataframe with a series of dates as the index. Data will be added to the dataframe at a later point.

我正在生成一个空数据框,其中包含一系列日期作为索引。数据将在稍后添加到数据框中。

cbd=pd.date_range(start=pd.datetime(2017,01,02),end=pd.datetime(2017,01,30),period=1)

df = pd.DataFrame(data=None,columns=['Test1','Test2'],index=cbd)

df.head()
           Test1 Test2
2017-01-02   NaN   NaN
2017-01-03   NaN   NaN
2017-01-04   NaN   NaN
2017-01-05   NaN   NaN
2017-01-06   NaN   NaN

A few slicing methods don't seem to work. The following returns a KeyError:

一些切片方法似乎不起作用。以下返回KeyError:

df['2017-01-02']

However any of the following work:

但是,以下任何工作:

df['2017-01-02':'2017-01-02']
df.loc['2017-01-02']

What am I missing here? Why doesn't the first slice return a result?

我在这里想念的是什么?为什么第一个切片不返回结果?

3 个解决方案

#1


11  

Dual behavior of [] in df[]

  • When you don't use : inside [], then the value(s) inside it will be considered as column(s).
  • 如果不使用:inside [],则其中的值将被视为列。

  • And when you use : inside [], then the value(s) inside it will be considered as row(s).
  • 当你使用:inside []时,它内部的值将被视为row(s)。

Why the dual nature?

Because most of the time people want to slice the rows instead of slicing the columns. So they decided that x, y in df[x:y] should correspond to rows and x in d[x] or x, y in df[[x,y]] should correspond to column(s).

因为大多数时候人们想要切片而不是切片。所以他们认为df [x:y]中的x,y应该对应于d [x]或x中的行和x,df [[x,y]]中的y应该对应于列。

Example:

df = pd.DataFrame(data = [[1,2,3], [1,2,3], [1,2,3]],
                                 index = ['A','B','C'], columns = ['A','B','C'])
print df

Output:

   A  B  C
A  1  2  3
B  1  2  3
C  1  2  3

Now when you do df['B'], it can mean 2 things:

现在当你做df ['B']时,它可能意味着两件事:

  • Take the 2nd index B and give you the 2nd row 1 2 3

    取第二个指数B并给你第二排1 2 3

                    OR
    
  • Take the 2nd column B and give you the 2nd column 2 2 2.

    取第二列B,给你第二列2 2 2。

So in order to resolve this conflict and keep it unambiguous df['B'] will always mean that you want the column 'B', if there is no such column then it will throw an Error.

所以为了解决这个冲突并保持明确,df ['B']总是意味着你想要列'B',如果没有这样的列,那么它将抛出一个错误。

Why does df['2017-01-02'] fails?

It will search for a column '2017-01-02', Because there is no such column, it throws an error.

它将搜索列'2017-01-02',因为没有这样的列,它会抛出错误。

Why does df.loc['2017-01-02'] works then?

Because .loc[] has syntax of df.loc[row,column] and you can leave out the column if you will, as in your case, it simply means df.loc[row]

因为.loc []具有df.loc [row,column]的语法,如果你愿意,可以省略列,就像你的情况一样,它只是意味着df.loc [row]

#2


4  

There is difference, because use different approaches:

有区别,因为使用不同的方法:

For select one row is necessary loc:

对于选择一行是必要的loc:

df['2017-01-02']

Docs - partial string indexing:

文档 - 部分字符串索引:

Warning

The following selection will raise a KeyError; otherwise this selection methodology would be inconsistent with other selection methods in pandas (as this is not a slice, nor does it resolve to one):

以下选择将引发KeyError;否则这种选择方法将与pandas中的其他选择方法不一致(因为这不是一个切片,也不会解决一个):

dft['2013-1-15 12:30:00']

To select a single row, use .loc

要选择单行,请使用.loc

In [74]: dft.loc['2013-1-15 12:30:00']
Out[74]: 
A    0.193284
Name: 2013-01-15 12:30:00, dtype: float64

df['2017-01-02':'2017-01-02']

This is pure partial string indexing:

这是纯粹的部分字符串索引:

This type of slicing will work on a DataFrame with a DateTimeIndex as well. Since the partial string selection is a form of label slicing, the endpoints will be included. This would include matching times on an included date.

这种类型的切片也适用于具有DateTimeIndex的DataFrame。由于部分字符串选择是标签切片的一种形式,因此将包括端点。这将包括在包含日期的匹配时间。

#3


1  

First I have updated your test data (just for info) as it returns an 'invalid token' error. Please see changes here:

首先,我更新了您的测试数据(仅用于信息),因为它返回了“无效令牌”错误。请在此处查看更改:

cbd=pd.date_range(start='2017-01-02',end='2017-01-30',period=1)
df = pd.DataFrame(data=None,columns=['Test1','Test2'],index=cbd)

Now looking at the first row:

现在看第一行:

In[1]:

df.head(1)

Out[1]:
          Test1 Test2
2017-01-02  NaN NaN

And then trying the initial slicing approach yields this error:

然后尝试初始切片方法会产生此错误:

In[2]:    

df['2017-01-02']

Out[2]:

KeyError: '2017-01-02'

Now try this using the column name:

现在使用列名尝试:

In[3]:    

df.columns

Out[3]:

Index(['Test1', 'Test2'], dtype='object')

In[4]:

We try 'Test1':

我们尝试'Test1':

df['Test1']

And get the NaN output from this column.

并从此列获取NaN输出。

Out[4]:

2017-01-02    NaN
2017-01-03    NaN
2017-01-04    NaN
2017-01-05    NaN

So the format you are using is designed to be used on the column name unless you use this format df['2017-01-02':'2017-01-02'].

因此,您使用的格式设计用于列名称,除非您使用此格式df ['2017-01-02':'2017-01-02']。

The Pandas docs state "The following selection will raise a KeyError; otherwise this selection methodology would be inconsistent with other selection methods in pandas (as this is not a slice, nor does it resolve to one)".

Pandas文档声明“以下选择将引发KeyError;否则这种选择方法将与pandas中的其他选择方法不一致(因为这不是切片,也不是解决方案)”。

So as you correctly identified, DataFrame.loc is a label based indexer which yields the output you are looking for:

因此,正确识别后,DataFrame.loc是一个基于标签的索引器,它可以生成您正在寻找的输出:

 In[5]:
df.loc['2017-01-02']

 Out[5]:

Test1    NaN
Test2    NaN
Name: 2017-01-02 00:00:00, dtype: object

#1


11  

Dual behavior of [] in df[]

  • When you don't use : inside [], then the value(s) inside it will be considered as column(s).
  • 如果不使用:inside [],则其中的值将被视为列。

  • And when you use : inside [], then the value(s) inside it will be considered as row(s).
  • 当你使用:inside []时,它内部的值将被视为row(s)。

Why the dual nature?

Because most of the time people want to slice the rows instead of slicing the columns. So they decided that x, y in df[x:y] should correspond to rows and x in d[x] or x, y in df[[x,y]] should correspond to column(s).

因为大多数时候人们想要切片而不是切片。所以他们认为df [x:y]中的x,y应该对应于d [x]或x中的行和x,df [[x,y]]中的y应该对应于列。

Example:

df = pd.DataFrame(data = [[1,2,3], [1,2,3], [1,2,3]],
                                 index = ['A','B','C'], columns = ['A','B','C'])
print df

Output:

   A  B  C
A  1  2  3
B  1  2  3
C  1  2  3

Now when you do df['B'], it can mean 2 things:

现在当你做df ['B']时,它可能意味着两件事:

  • Take the 2nd index B and give you the 2nd row 1 2 3

    取第二个指数B并给你第二排1 2 3

                    OR
    
  • Take the 2nd column B and give you the 2nd column 2 2 2.

    取第二列B,给你第二列2 2 2。

So in order to resolve this conflict and keep it unambiguous df['B'] will always mean that you want the column 'B', if there is no such column then it will throw an Error.

所以为了解决这个冲突并保持明确,df ['B']总是意味着你想要列'B',如果没有这样的列,那么它将抛出一个错误。

Why does df['2017-01-02'] fails?

It will search for a column '2017-01-02', Because there is no such column, it throws an error.

它将搜索列'2017-01-02',因为没有这样的列,它会抛出错误。

Why does df.loc['2017-01-02'] works then?

Because .loc[] has syntax of df.loc[row,column] and you can leave out the column if you will, as in your case, it simply means df.loc[row]

因为.loc []具有df.loc [row,column]的语法,如果你愿意,可以省略列,就像你的情况一样,它只是意味着df.loc [row]

#2


4  

There is difference, because use different approaches:

有区别,因为使用不同的方法:

For select one row is necessary loc:

对于选择一行是必要的loc:

df['2017-01-02']

Docs - partial string indexing:

文档 - 部分字符串索引:

Warning

The following selection will raise a KeyError; otherwise this selection methodology would be inconsistent with other selection methods in pandas (as this is not a slice, nor does it resolve to one):

以下选择将引发KeyError;否则这种选择方法将与pandas中的其他选择方法不一致(因为这不是一个切片,也不会解决一个):

dft['2013-1-15 12:30:00']

To select a single row, use .loc

要选择单行,请使用.loc

In [74]: dft.loc['2013-1-15 12:30:00']
Out[74]: 
A    0.193284
Name: 2013-01-15 12:30:00, dtype: float64

df['2017-01-02':'2017-01-02']

This is pure partial string indexing:

这是纯粹的部分字符串索引:

This type of slicing will work on a DataFrame with a DateTimeIndex as well. Since the partial string selection is a form of label slicing, the endpoints will be included. This would include matching times on an included date.

这种类型的切片也适用于具有DateTimeIndex的DataFrame。由于部分字符串选择是标签切片的一种形式,因此将包括端点。这将包括在包含日期的匹配时间。

#3


1  

First I have updated your test data (just for info) as it returns an 'invalid token' error. Please see changes here:

首先,我更新了您的测试数据(仅用于信息),因为它返回了“无效令牌”错误。请在此处查看更改:

cbd=pd.date_range(start='2017-01-02',end='2017-01-30',period=1)
df = pd.DataFrame(data=None,columns=['Test1','Test2'],index=cbd)

Now looking at the first row:

现在看第一行:

In[1]:

df.head(1)

Out[1]:
          Test1 Test2
2017-01-02  NaN NaN

And then trying the initial slicing approach yields this error:

然后尝试初始切片方法会产生此错误:

In[2]:    

df['2017-01-02']

Out[2]:

KeyError: '2017-01-02'

Now try this using the column name:

现在使用列名尝试:

In[3]:    

df.columns

Out[3]:

Index(['Test1', 'Test2'], dtype='object')

In[4]:

We try 'Test1':

我们尝试'Test1':

df['Test1']

And get the NaN output from this column.

并从此列获取NaN输出。

Out[4]:

2017-01-02    NaN
2017-01-03    NaN
2017-01-04    NaN
2017-01-05    NaN

So the format you are using is designed to be used on the column name unless you use this format df['2017-01-02':'2017-01-02'].

因此,您使用的格式设计用于列名称,除非您使用此格式df ['2017-01-02':'2017-01-02']。

The Pandas docs state "The following selection will raise a KeyError; otherwise this selection methodology would be inconsistent with other selection methods in pandas (as this is not a slice, nor does it resolve to one)".

Pandas文档声明“以下选择将引发KeyError;否则这种选择方法将与pandas中的其他选择方法不一致(因为这不是切片,也不是解决方案)”。

So as you correctly identified, DataFrame.loc is a label based indexer which yields the output you are looking for:

因此,正确识别后,DataFrame.loc是一个基于标签的索引器,它可以生成您正在寻找的输出:

 In[5]:
df.loc['2017-01-02']

 Out[5]:

Test1    NaN
Test2    NaN
Name: 2017-01-02 00:00:00, dtype: object