
时间:2021-01-18 21:41:15

I've noticed three methods of selecting a column in a Pandas DataFrame:


First method of selecting a column using loc:


df_new = df.loc[:, 'col1']

Second method - seems simpler and faster:


df_new = df['col1']

Third method - most convenient:


df_new = df.col1

Is there a difference between these three methods? I don't think so, in which case I'd rather use the third method.


I'm mostly curious as to why there appear to be three methods for doing the same thing.


1 个解决方案



If you are selecting a single column, a list of columns, or a slice or rows then there is no difference. However, [] does not allow you to select a single row, a list of rows or a slice of columns. More importantly, if your selection involves both rows and columns, then assignment becomes problematic.


df[1:3]['A'] = 5

This selects rows 1 and 2, and then selects column 'A' of the returning object and assign value 5 to it. The problem is, the returning object might be a copy so this may not change the actual DataFrame. This raises SettingWithCopyWarning. The correct way of this assignment is


df.loc[1:3, 'A'] = 5

With .loc, you are guaranteed to modify the original DataFrame. It also allows you to slice columns (df.loc[:, 'C':'F']), select a single row (df.loc[5]), and select a list of rows (df.loc[[1, 2, 5]]).

使用.loc,您可以保证修改原始的DataFrame。它还允许您切片列(df)。loc[:, 'C':'F']),选择一行(df.loc[5]),并选择一个行列表(df。疯狂的[[1、2、5]])。

Also note that these two were not included in the API at the same time. .loc was added much later as a more powerful and explicit indexer. See unutbu's answer for more detail.


Note: Getting columns with [] vs . is a completely different topic. . is only there for convenince. It only allows accessing columns whose name are valid Python identifier (i.e. they cannot contain spaces, they cannot be composed of numbers...). It cannot be used when the names conflict with Series/DataFrame methods. It also cannot be used for non-existing columns (i.e. the assignment df.a = 1 won't work if there is no column a). Other than that, . and [] are the same.

注意:获取带有[]vs的列。是一个完全不同的话题。只有在那里才有召集人。它只允许访问名称为有效Python标识符的列(即它们不能包含空格,它们不能由数字组成…)。当名称与系列/DataFrame方法冲突时,不能使用它。它也不能用于不存在的列(例如赋值df)。a = 1在没有a列的情况下是不成立的。和[]是一样的。



If you are selecting a single column, a list of columns, or a slice or rows then there is no difference. However, [] does not allow you to select a single row, a list of rows or a slice of columns. More importantly, if your selection involves both rows and columns, then assignment becomes problematic.


df[1:3]['A'] = 5

This selects rows 1 and 2, and then selects column 'A' of the returning object and assign value 5 to it. The problem is, the returning object might be a copy so this may not change the actual DataFrame. This raises SettingWithCopyWarning. The correct way of this assignment is


df.loc[1:3, 'A'] = 5

With .loc, you are guaranteed to modify the original DataFrame. It also allows you to slice columns (df.loc[:, 'C':'F']), select a single row (df.loc[5]), and select a list of rows (df.loc[[1, 2, 5]]).

使用.loc,您可以保证修改原始的DataFrame。它还允许您切片列(df)。loc[:, 'C':'F']),选择一行(df.loc[5]),并选择一个行列表(df。疯狂的[[1、2、5]])。

Also note that these two were not included in the API at the same time. .loc was added much later as a more powerful and explicit indexer. See unutbu's answer for more detail.


Note: Getting columns with [] vs . is a completely different topic. . is only there for convenince. It only allows accessing columns whose name are valid Python identifier (i.e. they cannot contain spaces, they cannot be composed of numbers...). It cannot be used when the names conflict with Series/DataFrame methods. It also cannot be used for non-existing columns (i.e. the assignment df.a = 1 won't work if there is no column a). Other than that, . and [] are the same.

注意:获取带有[]vs的列。是一个完全不同的话题。只有在那里才有召集人。它只允许访问名称为有效Python标识符的列(即它们不能包含空格,它们不能由数字组成…)。当名称与系列/DataFrame方法冲突时,不能使用它。它也不能用于不存在的列(例如赋值df)。a = 1在没有a列的情况下是不成立的。和[]是一样的。