I'm generating an empty dataframe with a series of dates as the index. Data will be added to the dataframe at a later point.
我正在生成一个空数据框,其中包含一系列日期作为索引。数据将在稍后添加到数据框中。
cbd=pd.date_range(start=pd.datetime(2017,01,02),end=pd.datetime(2017,01,30),period=1)
df = pd.DataFrame(data=None,columns=['Test1','Test2'],index=cbd)
df.head()
Test1 Test2
2017-01-02 NaN NaN
2017-01-03 NaN NaN
2017-01-04 NaN NaN
2017-01-05 NaN NaN
2017-01-06 NaN NaN
A few slicing methods don't seem to work. The following returns a KeyError:
一些切片方法似乎不起作用。以下返回KeyError:
df['2017-01-02']
However any of the following work:
但是,以下任何工作:
df['2017-01-02':'2017-01-02']
df.loc['2017-01-02']
What am I missing here? Why doesn't the first slice return a result?
我在这里想念的是什么?为什么第一个切片不返回结果?
3 个解决方案
#1
11
Dual behavior of []
in df[]
- When you don't use
:
inside[]
, then the value(s) inside it will be considered as column(s). - And when you use
:
inside[]
, then the value(s) inside it will be considered as row(s).
如果不使用:inside [],则其中的值将被视为列。
当你使用:inside []时,它内部的值将被视为row(s)。
Why the dual nature?
Because most of the time people want to slice the rows instead of slicing the columns. So they decided that x
, y
in df[x:y]
should correspond to rows and x
in d[x]
or x
, y
in df[[x,y]]
should correspond to column(s).
因为大多数时候人们想要切片而不是切片。所以他们认为df [x:y]中的x,y应该对应于d [x]或x中的行和x,df [[x,y]]中的y应该对应于列。
Example:
df = pd.DataFrame(data = [[1,2,3], [1,2,3], [1,2,3]],
index = ['A','B','C'], columns = ['A','B','C'])
print df
Output:
A B C
A 1 2 3
B 1 2 3
C 1 2 3
Now when you do df['B']
, it can mean 2 things:
现在当你做df ['B']时,它可能意味着两件事:
-
Take the 2nd index
B
and give you the 2nd row1 2 3
取第二个指数B并给你第二排1 2 3
OR
-
Take the 2nd column
B
and give you the 2nd column2 2 2
.取第二列B,给你第二列2 2 2。
So in order to resolve this conflict and keep it unambiguous df['B']
will always mean that you want the column 'B'
, if there is no such column then it will throw an Error.
所以为了解决这个冲突并保持明确,df ['B']总是意味着你想要列'B',如果没有这样的列,那么它将抛出一个错误。
Why does df['2017-01-02']
fails?
It will search for a column '2017-01-02'
, Because there is no such column, it throws an error.
它将搜索列'2017-01-02',因为没有这样的列,它会抛出错误。
Why does df.loc['2017-01-02']
works then?
Because .loc[]
has syntax of df.loc[row,column]
and you can leave out the column if you will, as in your case, it simply means df.loc[row]
因为.loc []具有df.loc [row,column]的语法,如果你愿意,可以省略列,就像你的情况一样,它只是意味着df.loc [row]
#2
4
There is difference, because use different approaches:
有区别,因为使用不同的方法:
For select one row is necessary loc
:
对于选择一行是必要的loc:
df['2017-01-02']
Docs - partial string indexing:
文档 - 部分字符串索引:
Warning
The following selection will raise a KeyError; otherwise this selection methodology would be inconsistent with other selection methods in pandas (as this is not a slice, nor does it resolve to one):
以下选择将引发KeyError;否则这种选择方法将与pandas中的其他选择方法不一致(因为这不是一个切片,也不会解决一个):
dft['2013-1-15 12:30:00']
To select a single row, use .loc
要选择单行,请使用.loc
In [74]: dft.loc['2013-1-15 12:30:00']
Out[74]:
A 0.193284
Name: 2013-01-15 12:30:00, dtype: float64
df['2017-01-02':'2017-01-02']
This is pure partial string indexing:
这是纯粹的部分字符串索引:
This type of slicing will work on a
DataFrame
with a DateTimeIndex as well. Since the partial string selection is a form of label slicing, the endpoints will be included. This would include matching times on an included date.这种类型的切片也适用于具有DateTimeIndex的DataFrame。由于部分字符串选择是标签切片的一种形式,因此将包括端点。这将包括在包含日期的匹配时间。
#3
1
First I have updated your test data (just for info) as it returns an 'invalid token' error. Please see changes here:
首先,我更新了您的测试数据(仅用于信息),因为它返回了“无效令牌”错误。请在此处查看更改:
cbd=pd.date_range(start='2017-01-02',end='2017-01-30',period=1)
df = pd.DataFrame(data=None,columns=['Test1','Test2'],index=cbd)
Now looking at the first row:
现在看第一行:
In[1]:
df.head(1)
Out[1]:
Test1 Test2
2017-01-02 NaN NaN
And then trying the initial slicing approach yields this error:
然后尝试初始切片方法会产生此错误:
In[2]:
df['2017-01-02']
Out[2]:
KeyError: '2017-01-02'
Now try this using the column
name:
现在使用列名尝试:
In[3]:
df.columns
Out[3]:
Index(['Test1', 'Test2'], dtype='object')
In[4]:
We try 'Test1':
我们尝试'Test1':
df['Test1']
And get the NaN
output from this column.
并从此列获取NaN输出。
Out[4]:
2017-01-02 NaN
2017-01-03 NaN
2017-01-04 NaN
2017-01-05 NaN
So the format you are using is designed to be used on the column
name unless you use this format df['2017-01-02':'2017-01-02']
.
因此,您使用的格式设计用于列名称,除非您使用此格式df ['2017-01-02':'2017-01-02']。
The Pandas docs state "The following selection will raise a KeyError; otherwise this selection methodology would be inconsistent with other selection methods in pandas (as this is not a slice, nor does it resolve to one)".
Pandas文档声明“以下选择将引发KeyError;否则这种选择方法将与pandas中的其他选择方法不一致(因为这不是切片,也不是解决方案)”。
So as you correctly identified, DataFrame.loc is a label based indexer which yields the output you are looking for:
因此,正确识别后,DataFrame.loc是一个基于标签的索引器,它可以生成您正在寻找的输出:
In[5]:
df.loc['2017-01-02']
Out[5]:
Test1 NaN
Test2 NaN
Name: 2017-01-02 00:00:00, dtype: object
#1
11
Dual behavior of []
in df[]
- When you don't use
:
inside[]
, then the value(s) inside it will be considered as column(s). - And when you use
:
inside[]
, then the value(s) inside it will be considered as row(s).
如果不使用:inside [],则其中的值将被视为列。
当你使用:inside []时,它内部的值将被视为row(s)。
Why the dual nature?
Because most of the time people want to slice the rows instead of slicing the columns. So they decided that x
, y
in df[x:y]
should correspond to rows and x
in d[x]
or x
, y
in df[[x,y]]
should correspond to column(s).
因为大多数时候人们想要切片而不是切片。所以他们认为df [x:y]中的x,y应该对应于d [x]或x中的行和x,df [[x,y]]中的y应该对应于列。
Example:
df = pd.DataFrame(data = [[1,2,3], [1,2,3], [1,2,3]],
index = ['A','B','C'], columns = ['A','B','C'])
print df
Output:
A B C
A 1 2 3
B 1 2 3
C 1 2 3
Now when you do df['B']
, it can mean 2 things:
现在当你做df ['B']时,它可能意味着两件事:
-
Take the 2nd index
B
and give you the 2nd row1 2 3
取第二个指数B并给你第二排1 2 3
OR
-
Take the 2nd column
B
and give you the 2nd column2 2 2
.取第二列B,给你第二列2 2 2。
So in order to resolve this conflict and keep it unambiguous df['B']
will always mean that you want the column 'B'
, if there is no such column then it will throw an Error.
所以为了解决这个冲突并保持明确,df ['B']总是意味着你想要列'B',如果没有这样的列,那么它将抛出一个错误。
Why does df['2017-01-02']
fails?
It will search for a column '2017-01-02'
, Because there is no such column, it throws an error.
它将搜索列'2017-01-02',因为没有这样的列,它会抛出错误。
Why does df.loc['2017-01-02']
works then?
Because .loc[]
has syntax of df.loc[row,column]
and you can leave out the column if you will, as in your case, it simply means df.loc[row]
因为.loc []具有df.loc [row,column]的语法,如果你愿意,可以省略列,就像你的情况一样,它只是意味着df.loc [row]
#2
4
There is difference, because use different approaches:
有区别,因为使用不同的方法:
For select one row is necessary loc
:
对于选择一行是必要的loc:
df['2017-01-02']
Docs - partial string indexing:
文档 - 部分字符串索引:
Warning
The following selection will raise a KeyError; otherwise this selection methodology would be inconsistent with other selection methods in pandas (as this is not a slice, nor does it resolve to one):
以下选择将引发KeyError;否则这种选择方法将与pandas中的其他选择方法不一致(因为这不是一个切片,也不会解决一个):
dft['2013-1-15 12:30:00']
To select a single row, use .loc
要选择单行,请使用.loc
In [74]: dft.loc['2013-1-15 12:30:00']
Out[74]:
A 0.193284
Name: 2013-01-15 12:30:00, dtype: float64
df['2017-01-02':'2017-01-02']
This is pure partial string indexing:
这是纯粹的部分字符串索引:
This type of slicing will work on a
DataFrame
with a DateTimeIndex as well. Since the partial string selection is a form of label slicing, the endpoints will be included. This would include matching times on an included date.这种类型的切片也适用于具有DateTimeIndex的DataFrame。由于部分字符串选择是标签切片的一种形式,因此将包括端点。这将包括在包含日期的匹配时间。
#3
1
First I have updated your test data (just for info) as it returns an 'invalid token' error. Please see changes here:
首先,我更新了您的测试数据(仅用于信息),因为它返回了“无效令牌”错误。请在此处查看更改:
cbd=pd.date_range(start='2017-01-02',end='2017-01-30',period=1)
df = pd.DataFrame(data=None,columns=['Test1','Test2'],index=cbd)
Now looking at the first row:
现在看第一行:
In[1]:
df.head(1)
Out[1]:
Test1 Test2
2017-01-02 NaN NaN
And then trying the initial slicing approach yields this error:
然后尝试初始切片方法会产生此错误:
In[2]:
df['2017-01-02']
Out[2]:
KeyError: '2017-01-02'
Now try this using the column
name:
现在使用列名尝试:
In[3]:
df.columns
Out[3]:
Index(['Test1', 'Test2'], dtype='object')
In[4]:
We try 'Test1':
我们尝试'Test1':
df['Test1']
And get the NaN
output from this column.
并从此列获取NaN输出。
Out[4]:
2017-01-02 NaN
2017-01-03 NaN
2017-01-04 NaN
2017-01-05 NaN
So the format you are using is designed to be used on the column
name unless you use this format df['2017-01-02':'2017-01-02']
.
因此,您使用的格式设计用于列名称,除非您使用此格式df ['2017-01-02':'2017-01-02']。
The Pandas docs state "The following selection will raise a KeyError; otherwise this selection methodology would be inconsistent with other selection methods in pandas (as this is not a slice, nor does it resolve to one)".
Pandas文档声明“以下选择将引发KeyError;否则这种选择方法将与pandas中的其他选择方法不一致(因为这不是切片,也不是解决方案)”。
So as you correctly identified, DataFrame.loc is a label based indexer which yields the output you are looking for:
因此,正确识别后,DataFrame.loc是一个基于标签的索引器,它可以生成您正在寻找的输出:
In[5]:
df.loc['2017-01-02']
Out[5]:
Test1 NaN
Test2 NaN
Name: 2017-01-02 00:00:00, dtype: object