在熊猫数据表中选择列。

时间:2021-05-20 22:57:55

I have data in different columns but I don't know how to extract it to save it in another variable.

我有不同列的数据,但我不知道如何提取它以保存到另一个变量中。

index  a   b   c
1      2   3   4
2      3   4   5

How do I select 'b', 'c' and save it in to df1?

如何选择“b”、“c”并将其保存到df1中?

I tried

我试着

df1 = df['a':'b']
df1 = df.ix[:, 'a':'b']

None seem to work.

似乎没有工作。

9 个解决方案

#1


784  

The column names (which are strings) cannot be sliced in the manner you tried.

列名称(字符串)不能按您尝试的方式分割。

Here you have a couple of options. If you know from context which variables you want to slice out, you can just return a view of only those columns by passing a list into the __getitem__ syntax (the []'s).

这里有几个选项。如果您知道要从上下文中分割出哪些变量,那么只需通过将列表传递到__getitem__语法([]'s)中来返回这些列的视图。

df1 = df[['a','b']]

Alternatively, if it matters to index them numerically and not by their name (say your code should automatically do this without knowing the names of the first two columns) then you can do this instead:

或者,如果用数字索引而不是用它们的名字(比如你的代码应该在不知道前两列的名字的情况下自动这么做),那么你可以这样做:

df1 = df.iloc[:,0:2] # Remember that Python does not slice inclusive of the ending index.

Additionally, you should familiarize yourself with the idea of a view into a Pandas object vs. a copy of that object. The first of the above methods will return a new copy in memory of the desired sub-object (the desired slices).

此外,您应该熟悉对熊猫对象的视图与该对象的副本的视图的概念。上面的第一个方法将在需要的子对象(所需的片)的内存中返回一个新的副本。

Sometimes, however, there are indexing conventions in Pandas that don't do this and instead give you a new variable that just refers to the same chunk of memory as the sub-object or slice in the original object. This will happen with the second way of indexing, so you can modify it with the copy() function to get a regular copy. When this happens, changing what you think is the sliced object can sometimes alter the original object. Always good to be on the look out for this.

然而,有时熊猫中的索引约定不会这样做,而是给您一个新的变量,它只引用与原始对象中的子对象或片相同的内存块。这将通过第二种索引方式实现,因此可以使用copy()函数对其进行修改,以获得常规的副本。当这种情况发生时,改变你所认为的切片对象有时会改变原始对象。总是很高兴能注意到这一点。

df1 = df.iloc[0,0:2].copy() # To avoid the case where changing df1 also changes df

#2


65  

Assuming your column names (df.columns) are ['index','a','b','c'], then the data you want is in the 3rd & 4th columns. If you don't know their names when your script runs, you can do this

假设您的列名(df.columns)是['index'、'a'、'b'、'c'],那么您需要的数据是在第3和第4列中。如果在运行脚本时不知道它们的名称,可以这样做

newdf = df[df.columns[2:4]] # Remember, Python is 0-offset! The "3rd" entry is at slot 2.

As EMS points out in his answer, df.ix slices columns a bit more concisely, but the .columns slicing interface might be more natural because it uses the vanilla 1-D python list indexing/slicing syntax.

正如EMS在他的回答中指出的,df。ix更简洁地切片列,但是.columns切片接口可能更自然,因为它使用普通的1-D python索引/切片语法。

WARN: 'index' is a bad name for a DataFrame column. That same label is also used for the real df.index attribute, a Index array. So your column is returned by df['index'] and the real DataFrame index is returned by df.index. An Index is a special kind of Series optimized for lookup of it's elements' values. For df.index it's for looking up rows by their label. That df.columns attribute is also a pd.Index array, for looking up columns by their labels.

警告:“索引”是DataFrame列的坏名字。同样的标签也用于真正的df。索引属性,一个索引数组。因此,您的列由df['index']返回,而真正的DataFrame索引由df.index返回。索引是一种特殊的系列,针对查找它的元素的值进行优化。df。索引是用来查看标签上的行。df。columns属性也是一个pd。索引数组,用于通过它们的标签查找列。

#3


50  

As of version 0.11.0, columns can be sliced in the manner you tried using the .loc indexer:

从0.11.0版本开始,可以按照使用.loc索引器的方式对列进行切片:

df.loc[:, 'C':'E']

returns columns C through E.

返回列C到E。


A demo on a randomly generated DataFrame:

随机生成的DataFrame上的演示:

import pandas as pd
import numpy as np
np.random.seed(5)
df = pd.DataFrame(np.random.randint(100, size=(100, 6)), 
                  columns=list('ABCDEF'), 
                  index=['R{}'.format(i) for i in range(100)])
df.head()

Out: 
     A   B   C   D   E   F
R0  99  78  61  16  73   8
R1  62  27  30  80   7  76
R2  15  53  80  27  44  77
R3  75  65  47  30  84  86
R4  18   9  41  62   1  82

To get the columns from C to E (note that unlike integer slicing, 'E' is included in the columns):

从C到E的列(注意,不像整数切片,'E'被包含在列中):

df.loc[:, 'C':'E']

Out: 
      C   D   E
R0   61  16  73
R1   30  80   7
R2   80  27  44
R3   47  30  84
R4   41  62   1
R5    5  58   0
...

Same works for selecting rows based on labels. Get the rows 'R6' to 'R10' from those columns:

基于标签选择行也是如此。从这些列中获取'R6'到'R10'的行:

df.loc['R6':'R10', 'C':'E']

Out: 
      C   D   E
R6   51  27  31
R7   83  19  18
R8   11  67  65
R9   78  27  29
R10   7  16  94

.loc also accepts a boolean array so you can select the columns whose corresponding entry in the array is True. For example, df.columns.isin(list('BCD')) returns array([False, True, True, True, False, False], dtype=bool) - True if the column name is in the list ['B', 'C', 'D']; False, otherwise.

loc还接受一个布尔数组,因此可以选择数组中相应条目为True的列。例如,df.columns.isin(list('BCD'))返回数组([False, True, True, True, False, False], dtype=bool) -如果列名在列表中['B', 'C', 'D'];假,否则。

df.loc[:, df.columns.isin(list('BCD'))]

Out: 
      B   C   D
R0   78  61  16
R1   27  30  80
R2   53  80  27
R3   65  47  30
R4    9  41  62
R5   78   5  58
...

#4


47  

In [39]: df
Out[39]: 
   index  a  b  c
0      1  2  3  4
1      2  3  4  5

In [40]: df1 = df[['b', 'c']]

In [41]: df1
Out[41]: 
   b  c
0  3  4
1  4  5

#5


32  

I realize this question is quite old, but in the latest version of pandas there is an easy way to do exactly this. Column names (which are strings) can be sliced in whatever manner you like.

我意识到这个问题很古老,但在最新的熊猫版本中,有一个简单的方法可以做到这一点。列名(即字符串)可以按照您喜欢的方式进行切分。

columns = ['b', 'c']
df1 = pd.DataFrame(df, columns=columns)

#6


13  

You could provide a list of columns to be dropped and return back the DataFrame with only the columns needed using the drop() function on a Pandas DataFrame.

您可以提供要删除的列列表,并返回DataFrame,其中只包含使用熊猫DataFrame上的drop()函数所需的列。

Just saying

只是说

colsToDrop = ['a']
df.drop(colsToDrop, axis=1)

would return a DataFrame with just the columns b and c.

将返回一个只包含b和c列的DataFrame。

The drop method is documented here.

这里记录了drop方法。

#7


11  

I found this method to be very useful:

我发现这个方法非常有用:

# iloc[row slicing, column slicing]
surveys_df.iloc [0:3, 1:4]

More details can be found here

更多细节可以在这里找到

#8


7  

just use: it will select b and c column.

只需使用:它将选择b和c列。

df1=pd.DataFrame()
df1=df[['b','c']]

then u can just call df1:

那么u可以直接调用df1:

df1

#9


2  

If you want to get one element by row index and column name, you can do it just like df['b'][0]. It is as simple as you can image.

如果想按行索引和列名获取一个元素,可以像df['b'][0]那样进行。它就像你想象的那样简单。

Or you can use df.ix[0,'b'],mixed usage of index and label.

或者你可以用df。ix[0,'b'],索引和标签的混合用法。

#1


784  

The column names (which are strings) cannot be sliced in the manner you tried.

列名称(字符串)不能按您尝试的方式分割。

Here you have a couple of options. If you know from context which variables you want to slice out, you can just return a view of only those columns by passing a list into the __getitem__ syntax (the []'s).

这里有几个选项。如果您知道要从上下文中分割出哪些变量,那么只需通过将列表传递到__getitem__语法([]'s)中来返回这些列的视图。

df1 = df[['a','b']]

Alternatively, if it matters to index them numerically and not by their name (say your code should automatically do this without knowing the names of the first two columns) then you can do this instead:

或者,如果用数字索引而不是用它们的名字(比如你的代码应该在不知道前两列的名字的情况下自动这么做),那么你可以这样做:

df1 = df.iloc[:,0:2] # Remember that Python does not slice inclusive of the ending index.

Additionally, you should familiarize yourself with the idea of a view into a Pandas object vs. a copy of that object. The first of the above methods will return a new copy in memory of the desired sub-object (the desired slices).

此外,您应该熟悉对熊猫对象的视图与该对象的副本的视图的概念。上面的第一个方法将在需要的子对象(所需的片)的内存中返回一个新的副本。

Sometimes, however, there are indexing conventions in Pandas that don't do this and instead give you a new variable that just refers to the same chunk of memory as the sub-object or slice in the original object. This will happen with the second way of indexing, so you can modify it with the copy() function to get a regular copy. When this happens, changing what you think is the sliced object can sometimes alter the original object. Always good to be on the look out for this.

然而,有时熊猫中的索引约定不会这样做,而是给您一个新的变量,它只引用与原始对象中的子对象或片相同的内存块。这将通过第二种索引方式实现,因此可以使用copy()函数对其进行修改,以获得常规的副本。当这种情况发生时,改变你所认为的切片对象有时会改变原始对象。总是很高兴能注意到这一点。

df1 = df.iloc[0,0:2].copy() # To avoid the case where changing df1 also changes df

#2


65  

Assuming your column names (df.columns) are ['index','a','b','c'], then the data you want is in the 3rd & 4th columns. If you don't know their names when your script runs, you can do this

假设您的列名(df.columns)是['index'、'a'、'b'、'c'],那么您需要的数据是在第3和第4列中。如果在运行脚本时不知道它们的名称,可以这样做

newdf = df[df.columns[2:4]] # Remember, Python is 0-offset! The "3rd" entry is at slot 2.

As EMS points out in his answer, df.ix slices columns a bit more concisely, but the .columns slicing interface might be more natural because it uses the vanilla 1-D python list indexing/slicing syntax.

正如EMS在他的回答中指出的,df。ix更简洁地切片列,但是.columns切片接口可能更自然,因为它使用普通的1-D python索引/切片语法。

WARN: 'index' is a bad name for a DataFrame column. That same label is also used for the real df.index attribute, a Index array. So your column is returned by df['index'] and the real DataFrame index is returned by df.index. An Index is a special kind of Series optimized for lookup of it's elements' values. For df.index it's for looking up rows by their label. That df.columns attribute is also a pd.Index array, for looking up columns by their labels.

警告:“索引”是DataFrame列的坏名字。同样的标签也用于真正的df。索引属性,一个索引数组。因此,您的列由df['index']返回,而真正的DataFrame索引由df.index返回。索引是一种特殊的系列,针对查找它的元素的值进行优化。df。索引是用来查看标签上的行。df。columns属性也是一个pd。索引数组,用于通过它们的标签查找列。

#3


50  

As of version 0.11.0, columns can be sliced in the manner you tried using the .loc indexer:

从0.11.0版本开始,可以按照使用.loc索引器的方式对列进行切片:

df.loc[:, 'C':'E']

returns columns C through E.

返回列C到E。


A demo on a randomly generated DataFrame:

随机生成的DataFrame上的演示:

import pandas as pd
import numpy as np
np.random.seed(5)
df = pd.DataFrame(np.random.randint(100, size=(100, 6)), 
                  columns=list('ABCDEF'), 
                  index=['R{}'.format(i) for i in range(100)])
df.head()

Out: 
     A   B   C   D   E   F
R0  99  78  61  16  73   8
R1  62  27  30  80   7  76
R2  15  53  80  27  44  77
R3  75  65  47  30  84  86
R4  18   9  41  62   1  82

To get the columns from C to E (note that unlike integer slicing, 'E' is included in the columns):

从C到E的列(注意,不像整数切片,'E'被包含在列中):

df.loc[:, 'C':'E']

Out: 
      C   D   E
R0   61  16  73
R1   30  80   7
R2   80  27  44
R3   47  30  84
R4   41  62   1
R5    5  58   0
...

Same works for selecting rows based on labels. Get the rows 'R6' to 'R10' from those columns:

基于标签选择行也是如此。从这些列中获取'R6'到'R10'的行:

df.loc['R6':'R10', 'C':'E']

Out: 
      C   D   E
R6   51  27  31
R7   83  19  18
R8   11  67  65
R9   78  27  29
R10   7  16  94

.loc also accepts a boolean array so you can select the columns whose corresponding entry in the array is True. For example, df.columns.isin(list('BCD')) returns array([False, True, True, True, False, False], dtype=bool) - True if the column name is in the list ['B', 'C', 'D']; False, otherwise.

loc还接受一个布尔数组,因此可以选择数组中相应条目为True的列。例如,df.columns.isin(list('BCD'))返回数组([False, True, True, True, False, False], dtype=bool) -如果列名在列表中['B', 'C', 'D'];假,否则。

df.loc[:, df.columns.isin(list('BCD'))]

Out: 
      B   C   D
R0   78  61  16
R1   27  30  80
R2   53  80  27
R3   65  47  30
R4    9  41  62
R5   78   5  58
...

#4


47  

In [39]: df
Out[39]: 
   index  a  b  c
0      1  2  3  4
1      2  3  4  5

In [40]: df1 = df[['b', 'c']]

In [41]: df1
Out[41]: 
   b  c
0  3  4
1  4  5

#5


32  

I realize this question is quite old, but in the latest version of pandas there is an easy way to do exactly this. Column names (which are strings) can be sliced in whatever manner you like.

我意识到这个问题很古老,但在最新的熊猫版本中,有一个简单的方法可以做到这一点。列名(即字符串)可以按照您喜欢的方式进行切分。

columns = ['b', 'c']
df1 = pd.DataFrame(df, columns=columns)

#6


13  

You could provide a list of columns to be dropped and return back the DataFrame with only the columns needed using the drop() function on a Pandas DataFrame.

您可以提供要删除的列列表,并返回DataFrame,其中只包含使用熊猫DataFrame上的drop()函数所需的列。

Just saying

只是说

colsToDrop = ['a']
df.drop(colsToDrop, axis=1)

would return a DataFrame with just the columns b and c.

将返回一个只包含b和c列的DataFrame。

The drop method is documented here.

这里记录了drop方法。

#7


11  

I found this method to be very useful:

我发现这个方法非常有用:

# iloc[row slicing, column slicing]
surveys_df.iloc [0:3, 1:4]

More details can be found here

更多细节可以在这里找到

#8


7  

just use: it will select b and c column.

只需使用:它将选择b和c列。

df1=pd.DataFrame()
df1=df[['b','c']]

then u can just call df1:

那么u可以直接调用df1:

df1

#9


2  

If you want to get one element by row index and column name, you can do it just like df['b'][0]. It is as simple as you can image.

如果想按行索引和列名获取一个元素,可以像df['b'][0]那样进行。它就像你想象的那样简单。

Or you can use df.ix[0,'b'],mixed usage of index and label.

或者你可以用df。ix[0,'b'],索引和标签的混合用法。