When deleting a column in a DataFrame I use:
在删除DataFrame中的列时,我使用:
del df['column_name']
And this works great. Why can't I use the following?
这个伟大的工作。为什么我不能用下面的?
del df.column_name
As you can access the column/Series as df.column_name
, I expect this to work.
您可以以df的形式访问列/系列。column_name,我希望它可以工作。
12 个解决方案
#1
451
It's difficult to make del df.column_name
work simply as the result of syntactic limitations in Python. del df[name]
gets translated to df.__delitem__(name)
under the covers by Python.
做del df很难。column_name仅仅是由于Python中的语法限制而工作的。del df[name]在Python的掩护下被翻译为df.__delitem__(name)。
#2
1507
The best way to do this in pandas is to use drop
:
在熊猫身上做到这一点的最好方法是使用drop:
df = df.drop('column_name', 1)
where 1
is the axis number (0
for rows and 1
for columns.)
其中1为轴号(行为0,列为1)
To delete the column without having to reassign df
you can do:
要删除列而不必重新分配df,您可以这样做:
df.drop('column_name', axis=1, inplace=True)
Finally, to drop by column number instead of by column label, try this to delete, e.g. the 1st, 2nd and 4th columns:
最后,用列号而不是列标来删除,可以尝试删除,如第1、2、4列:
df.drop(df.columns[[0, 1, 3]], axis=1) # df.columns is zero-based pd.Index
#3
173
Use:
使用:
columns = ['Col1', 'Col2', ...]
df.drop(columns, inplace=True, axis=1)
This will delete one or more columns in-place. Note that inplace=True
was added in pandas v0.13 and won't work on older versions. You'd have to assign the result back in that case:
这将删除一个或多个列。注意inplace=True是在熊猫v0.13中添加的,不能在旧版本中使用。在这种情况下,你必须将结果返回:
df = df.drop(columns, axis=1)
#4
75
Drop by index
Delete first, second and fourth columns:
删除第一、第二和第四列:
df.drop(df.columns[[0,1,3]], axis=1, inplace=True)
Delete first column:
删除第一列:
df.drop(df.columns[[0]], axis=1, inplace=True)
There is an optional parameter inplace
so that the original data can be modified without creating a copy.
这里有一个可选参数inplace,这样可以在不创建副本的情况下修改原始数据。
Popped
Column selection, addition, deletion
列选择、添加、删除
Delete column column-name
:
删除列列名称:
df.pop('column-name')
Examples:
df = DataFrame.from_items([('A', [1, 2, 3]), ('B', [4, 5, 6]), ('C', [7,8, 9])], orient='index', columns=['one', 'two', 'three'])
print df
:
打印df:
one two three
A 1 2 3
B 4 5 6
C 7 8 9
df.drop(df.columns[[0]], axis=1, inplace=True)
print df
:
df.drop(df。列[[0]],轴=1,inplace=True)打印df:
two three
A 2 3
B 5 6
C 8 9
three = df.pop('three')
print df
:
三= df.pop('three')打印df:
two
A 2
B 5
C 8
#5
54
The actual question posed, missed by most answers here is:
这里提出的实际问题被大多数人忽视的是:
Why can't I use del df.column_name
?
At first we need to understand the problem, which requires us to dive into python magic methods.
首先,我们需要理解这个问题,这需要我们深入到python魔术方法中。
As Wes points out in his answer del df['column']
maps to the python magic method df.__delitem__('column')
which is implemented in pandas to drop the column
正如Wes在他的答案del df['column']中指出的那样,它映射到python magic方法df.__delitem__('column'),后者在panda中实现以删除列
However, as pointed out in the link above about python magic methods:
但是,正如上面关于python魔术方法的链接所指出的:
In fact, del should almost never be used because of the precarious circumstances under which it is called; use it with caution!
事实上,del几乎不应该被使用,因为它被调用的环境是不稳定的;小心使用它!
You could argue that del df['column_name']
should not be used or encouraged, and thereby del df.column_name
should not even be considered.
您可能会认为不应该使用或鼓励del df['column_name'],因此也不应该使用del df。甚至不应该考虑column_name。
However, in theory, del df.column_name
could be implemeted to work in pandas using the magic method __delattr__
. This does however introduce certain problems, problems which the del df['column_name']
implementation already has, but in lesser degree.
但是,理论上讲,是del df。可以使用魔法方法__delattr__实现column_name在熊猫中工作。然而,这确实引入了一些问题,这些问题是del df['column_name']实现已经存在的,但程度较轻。
Example Problem
What if I define a column in a dataframe called "dtypes" or "columns".
如果我在一个名为“dtypes”或“columns”的dataframe中定义一个列,会怎么样呢?
Then assume I want to delete these columns.
然后假设我要删除这些列。
del df.dtypes
would make the __delattr__
method confused as if it should delete the "dtypes" attribute or the "dtypes" column.
df。dtypes会使__delattr__方法混淆,就好像它应该删除“dtypes”属性或“dtypes”列一样。
Architectural questions behind this problem
- Is a dataframe a collection of columns?
- dataframe是列的集合吗?
- Is a dataframe a collection of rows?
- 数据aframe是行集合吗?
- Is a column an attribute of a dataframe?
- 列是dataframe的属性吗?
Pandas answers:
- Yes, in all ways
- 是的,在所有方面
- No, but if you want it to be, you can use the
.ix
,.loc
or.iloc
methods. - 不,但是如果您希望它是,您可以使用.ix、.loc或.iloc方法。
- Maybe, do you want to read data? Then yes, unless the name of the attribute is already taken by another attribute belonging to the dataframe. Do you want to modify data? Then no.
- 也许,你想读数据吗?然后是yes,除非属性的名称已经被属于dataframe的另一个属性获取。您想要修改数据吗?然后没有。
TLDR;
You cannot do del df.column_name
because pandas has a quite wildly grown architecture that needs to be reconsidered in order for this kind of cognitive dissonance not to occur to its users.
你不能做df。column_name,因为熊猫有一种非常广泛的结构,需要重新考虑,以便这种认知失调不会发生在它的用户身上。
Protip:
Don't use df.column_name, It may be pretty, but it causes cognitive dissonance
不要使用df。column_name,它可能很漂亮,但是它会导致认知失调
Zen of Python quotes that fits in here:
There are multiple ways of deleting a column.
删除列有多种方法。
There should be one-- and preferably only one --obvious way to do it.
应该有一个——最好只有一个——明显的方法。
Columns are sometimes attributes but sometimes not.
列有时是属性,但有时不是。
Special cases aren't special enough to break the rules.
特殊情况并不足以打破规则。
Does del df.dtypes
delete the dtypes attribute or the dtypes column?
▽df。dtypes删除dtypes属性还是dtypes列?
In the face of ambiguity, refuse the temptation to guess.
面对模棱两可,拒绝猜测的诱惑。
#6
37
from version 0.16.1 you can do
从版本0.16.1可以做到。
df.drop(['column_name'], axis = 1, inplace = True, errors = 'ignore')
#7
35
A nice addition is the ability to drop columns only if they exist. This way you can cover more use cases, and it will only drop the existing columns from the labels passed to it:
一个很好的附加功能是只有列存在时才可以删除列。这样,您可以覆盖更多的用例,它只会从传递给它的标签中删除现有的列:
Simply add errors='ignore', for example.:
简单地添加错误='ignore',例如:
df.drop(['col_name_1', 'col_name_2', ..., 'col_name_N'], inplace=True, axis=1, errors='ignore')
- This is new from pandas 0.16.1 onward. Documentation is here.
- 这是熊猫0.16.1以后的最新数据。文档在这里。
#8
24
It's good practice to always use the []
notation. One reason is that attribute notation (df.column_name
) does not work for numbered indices:
总是使用[]符号是很好的实践。一个原因是属性表示法(df.column_name)不适用于编号索引:
In [1]: df = DataFrame([[1, 2, 3], [4, 5, 6]])
In [2]: df[1]
Out[2]:
0 2
1 5
Name: 1
In [3]: df.1
File "<ipython-input-3-e4803c0d1066>", line 1
df.1
^
SyntaxError: invalid syntax
#9
20
In pandas 0.16.1+ you can drop columns only if they exist per the solution posted by @eiTanLaVi. Prior to that version, you can achieve the same result via a conditional list comprehension:
在panda 0.16.1+中,只有当列存在于@eiTanLaVi发布的解决方案中时,才可以删除列。在此版本之前,您可以通过条件列表理解获得相同的结果:
df.drop([col for col in ['col_name_1','col_name_2',...,'col_name_N'] if col in df],
axis=1, inplace=True)
#10
11
TL;DR
A lot of effort to find a marginally more efficient solution. Difficult to justify the added complexity while sacrificing the simplicity of df.drop(dlst, 1, errors='ignore')
要找到稍微更有效的解决方案需要付出很多努力。很难在牺牲df的简单性的同时证明增加的复杂性。下降(dlst 1错误=“忽略”)
df.reindex_axis(np.setdiff1d(df.columns.values, dlst), 1)
Preamble
Deleting a column is semantically the same as selecting the other columns. I'll show a few additional methods to consider.
删除列的序言在语义上与选择其他列相同。我将展示一些需要考虑的其他方法。
I'll also focus on the general solution of deleting multiple columns at once and allowing for the attempt to delete columns not present.
我还将重点讨论一次删除多个列的一般解决方案,并允许尝试删除不存在的列。
Using these solutions are general and will work for the simple case as well.
使用这些解决方案是通用的,并且对于简单的情况也适用。
Setup
Consider the pd.DataFrame
df
and list to delete dlst
设置考虑pd。DataFrame df和list删除dlst。
df = pd.DataFrame(dict(zip('ABCDEFGHIJ', range(1, 11))), range(3))
dlst = list('HIJKLM')
df
A B C D E F G H I J
0 1 2 3 4 5 6 7 8 9 10
1 1 2 3 4 5 6 7 8 9 10
2 1 2 3 4 5 6 7 8 9 10
dlst
['H', 'I', 'J', 'K', 'L', 'M']
The result should look like:
结果应该是:
df.drop(dlst, 1, errors='ignore')
A B C D E F G
0 1 2 3 4 5 6 7
1 1 2 3 4 5 6 7
2 1 2 3 4 5 6 7
Since I'm equating deleting a column to selecting the other columns, I'll break it into two types:
因为我将删除一个列等同于选择其他列,所以我将它分为两种类型:
- Label selection
- 标签选择
- Boolean selection
- 布尔选择
Label Selection
We start by manufacturing the list/array of labels that represent the columns we want to keep and without the columns we want to delete.
我们首先制造一个标签列表/数组,这些标签代表我们想要保留的列,而不包含我们想要删除的列。
-
df.columns.difference(dlst)
df.columns.difference(dlst)
Index(['A', 'B', 'C', 'D', 'E', 'F', 'G'], dtype='object')
-
np.setdiff1d(df.columns.values, dlst)
np.setdiff1d(df.columns。值、dlst)
array(['A', 'B', 'C', 'D', 'E', 'F', 'G'], dtype=object)
-
df.columns.drop(dlst, errors='ignore')
df.columns。下降(dlst、错误=“忽略”)
Index(['A', 'B', 'C', 'D', 'E', 'F', 'G'], dtype='object')
-
list(set(df.columns.values.tolist()).difference(dlst))
列表(集(df.columns.values.tolist()).difference(dlst))
# does not preserve order ['E', 'D', 'B', 'F', 'G', 'A', 'C']
-
[x for x in df.columns.values.tolist() if x not in dlst]
[x表示df.columns.values.tolist(),如果x不表示dlst]
['A', 'B', 'C', 'D', 'E', 'F', 'G']
Columns from Labels
For the sake of comparing the selection process, assume:
为了比较选择过程,从标签中选择列,假设:
cols = [x for x in df.columns.values.tolist() if x not in dlst]
Then we can evaluate
然后我们可以评估
df.loc[:, cols]
- df。loc(:,峡路)
df[cols]
- df(峡路)
df.reindex(columns=cols)
- df.reindex(列=峡路)
df.reindex_axis(cols, 1)
- df。reindex_axis(关口,1)
Which all evaluate to:
所有的评估:
A B C D E F G
0 1 2 3 4 5 6 7
1 1 2 3 4 5 6 7
2 1 2 3 4 5 6 7
Boolean Slice
We can construct an array/list of booleans for slicing
我们可以构造一个用于切片的布尔值数组/列表
~df.columns.isin(dlst)
- ~ df.columns.isin(dlst)
~np.in1d(df.columns.values, dlst)
- ~ np.in1d(df.columns。值、dlst)
[x not in dlst for x in df.columns.values.tolist()]
- [x不在dlst,在df.c .columns.values.tolist()]
(df.columns.values[:, None] != dlst).all(1)
- (df.columns。值(:,)! = dlst)(1)
Columns from Boolean
For the sake of comparison
为了便于比较,列来自布尔值
bools = [x not in dlst for x in df.columns.values.tolist()]
df.loc[: bools]
- df。loc[:bool]
Which all evaluate to:
所有的评估:
A B C D E F G
0 1 2 3 4 5 6 7
1 1 2 3 4 5 6 7
2 1 2 3 4 5 6 7
Robust Timing
健壮的时机
Functions
功能
setdiff1d = lambda df, dlst: np.setdiff1d(df.columns.values, dlst)
difference = lambda df, dlst: df.columns.difference(dlst)
columndrop = lambda df, dlst: df.columns.drop(dlst, errors='ignore')
setdifflst = lambda df, dlst: list(set(df.columns.values.tolist()).difference(dlst))
comprehension = lambda df, dlst: [x for x in df.columns.values.tolist() if x not in dlst]
loc = lambda df, cols: df.loc[:, cols]
slc = lambda df, cols: df[cols]
ridx = lambda df, cols: df.reindex(columns=cols)
ridxa = lambda df, cols: df.reindex_axis(cols, 1)
isin = lambda df, dlst: ~df.columns.isin(dlst)
in1d = lambda df, dlst: ~np.in1d(df.columns.values, dlst)
comp = lambda df, dlst: [x not in dlst for x in df.columns.values.tolist()]
brod = lambda df, dlst: (df.columns.values[:, None] != dlst).all(1)
Testing
测试
res1 = pd.DataFrame(
index=pd.MultiIndex.from_product([
'loc slc ridx ridxa'.split(),
'setdiff1d difference columndrop setdifflst comprehension'.split(),
], names=['Select', 'Label']),
columns=[10, 30, 100, 300, 1000],
dtype=float
)
res2 = pd.DataFrame(
index=pd.MultiIndex.from_product([
'loc'.split(),
'isin in1d comp brod'.split(),
], names=['Select', 'Label']),
columns=[10, 30, 100, 300, 1000],
dtype=float
)
res = res1.append(res2).sort_index()
dres = pd.Series(index=res.columns, name='drop')
for j in res.columns:
dlst = list(range(j))
cols = list(range(j // 2, j + j // 2))
d = pd.DataFrame(1, range(10), cols)
dres.at[j] = timeit('d.drop(dlst, 1, errors="ignore")', 'from __main__ import d, dlst', number=100)
for s, l in res.index:
stmt = '{}(d, {}(d, dlst))'.format(s, l)
setp = 'from __main__ import d, dlst, {}, {}'.format(s, l)
res.at[(s, l), j] = timeit(stmt, setp, number=100)
rs = res / dres
rs
10 30 100 300 1000
Select Label
loc brod 0.747373 0.861979 0.891144 1.284235 3.872157
columndrop 1.193983 1.292843 1.396841 1.484429 1.335733
comp 0.802036 0.732326 1.149397 3.473283 25.565922
comprehension 1.463503 1.568395 1.866441 4.421639 26.552276
difference 1.413010 1.460863 1.587594 1.568571 1.569735
in1d 0.818502 0.844374 0.994093 1.042360 1.076255
isin 1.008874 0.879706 1.021712 1.001119 0.964327
setdiff1d 1.352828 1.274061 1.483380 1.459986 1.466575
setdifflst 1.233332 1.444521 1.714199 1.797241 1.876425
ridx columndrop 0.903013 0.832814 0.949234 0.976366 0.982888
comprehension 0.777445 0.827151 1.108028 3.473164 25.528879
difference 1.086859 1.081396 1.293132 1.173044 1.237613
setdiff1d 0.946009 0.873169 0.900185 0.908194 1.036124
setdifflst 0.732964 0.823218 0.819748 0.990315 1.050910
ridxa columndrop 0.835254 0.774701 0.907105 0.908006 0.932754
comprehension 0.697749 0.762556 1.215225 3.510226 25.041832
difference 1.055099 1.010208 1.122005 1.119575 1.383065
setdiff1d 0.760716 0.725386 0.849949 0.879425 0.946460
setdifflst 0.710008 0.668108 0.778060 0.871766 0.939537
slc columndrop 1.268191 1.521264 2.646687 1.919423 1.981091
comprehension 0.856893 0.870365 1.290730 3.564219 26.208937
difference 1.470095 1.747211 2.886581 2.254690 2.050536
setdiff1d 1.098427 1.133476 1.466029 2.045965 3.123452
setdifflst 0.833700 0.846652 1.013061 1.110352 1.287831
fig, axes = plt.subplots(2, 2, figsize=(8, 6), sharey=True)
for i, (n, g) in enumerate([(n, g.xs(n)) for n, g in rs.groupby('Select')]):
ax = axes[i // 2, i % 2]
g.plot.bar(ax=ax, title=n)
ax.legend_.remove()
fig.tight_layout()
This is relative to the time it takes to run df.drop(dlst, 1, errors='ignore')
. It seems like after all that effort, we only improve performance modestly.
这是相对于运行df的时间。下降(dlst 1错误=“忽略”)。似乎在付出了所有这些努力之后,我们只是适度地提高了性能。
If fact the best solutions use reindex
or reindex_axis
on the hack list(set(df.columns.values.tolist()).difference(dlst))
. A close second and still very marginally better than drop
is np.setdiff1d
.
如果事实是最好的解决方案,那么在hack列表上使用reindex或lucx_axis (set(df.columns.values.tolist()).difference(dlst))。紧随其后的是np.setdiff1d,它仍然比drop要好一点点。
rs.idxmin().pipe(
lambda x: pd.DataFrame(
dict(idx=x.values, val=rs.lookup(x.values, x.index)),
x.index
)
)
idx val
10 (ridx, setdifflst) 0.653431
30 (ridxa, setdifflst) 0.746143
100 (ridxa, setdifflst) 0.816207
300 (ridx, setdifflst) 0.780157
1000 (ridxa, setdifflst) 0.861622
#11
10
Pandas 0.21+ Answer
Pandas version 0.21 has slightly changed the drop method to include both the index
and columns
parameters to match the signature of the rename
and reindex
methods.
panda版本0.21稍微改变了drop方法,将索引和列参数都包含进来,以匹配重命名和重索引方法的签名。
df.drop(columns=['column_a', 'column_c'])
Personally, I prefer using the axis
parameter to denote columns or index because it is the predominant keyword parameter used in nearly all pandas methods. But, now you have some added choices in version 0.21.
就个人而言,我倾向于使用axis参数来表示列或索引,因为它是几乎所有熊猫方法中使用的主要关键字参数。但是,现在您在版本0.21中有了一些附加的选择。
#12
3
The dot syntax works in JavaScript, but not in Python.
点语法在JavaScript中有效,但在Python中无效。
Python: del df['column_name']
Python:德尔df(column_name”)
JavaScript: del df['column_name'] OR del df.column_name
JavaScript: del df['column_name']或del f.column_name
#1
451
It's difficult to make del df.column_name
work simply as the result of syntactic limitations in Python. del df[name]
gets translated to df.__delitem__(name)
under the covers by Python.
做del df很难。column_name仅仅是由于Python中的语法限制而工作的。del df[name]在Python的掩护下被翻译为df.__delitem__(name)。
#2
1507
The best way to do this in pandas is to use drop
:
在熊猫身上做到这一点的最好方法是使用drop:
df = df.drop('column_name', 1)
where 1
is the axis number (0
for rows and 1
for columns.)
其中1为轴号(行为0,列为1)
To delete the column without having to reassign df
you can do:
要删除列而不必重新分配df,您可以这样做:
df.drop('column_name', axis=1, inplace=True)
Finally, to drop by column number instead of by column label, try this to delete, e.g. the 1st, 2nd and 4th columns:
最后,用列号而不是列标来删除,可以尝试删除,如第1、2、4列:
df.drop(df.columns[[0, 1, 3]], axis=1) # df.columns is zero-based pd.Index
#3
173
Use:
使用:
columns = ['Col1', 'Col2', ...]
df.drop(columns, inplace=True, axis=1)
This will delete one or more columns in-place. Note that inplace=True
was added in pandas v0.13 and won't work on older versions. You'd have to assign the result back in that case:
这将删除一个或多个列。注意inplace=True是在熊猫v0.13中添加的,不能在旧版本中使用。在这种情况下,你必须将结果返回:
df = df.drop(columns, axis=1)
#4
75
Drop by index
Delete first, second and fourth columns:
删除第一、第二和第四列:
df.drop(df.columns[[0,1,3]], axis=1, inplace=True)
Delete first column:
删除第一列:
df.drop(df.columns[[0]], axis=1, inplace=True)
There is an optional parameter inplace
so that the original data can be modified without creating a copy.
这里有一个可选参数inplace,这样可以在不创建副本的情况下修改原始数据。
Popped
Column selection, addition, deletion
列选择、添加、删除
Delete column column-name
:
删除列列名称:
df.pop('column-name')
Examples:
df = DataFrame.from_items([('A', [1, 2, 3]), ('B', [4, 5, 6]), ('C', [7,8, 9])], orient='index', columns=['one', 'two', 'three'])
print df
:
打印df:
one two three
A 1 2 3
B 4 5 6
C 7 8 9
df.drop(df.columns[[0]], axis=1, inplace=True)
print df
:
df.drop(df。列[[0]],轴=1,inplace=True)打印df:
two three
A 2 3
B 5 6
C 8 9
three = df.pop('three')
print df
:
三= df.pop('three')打印df:
two
A 2
B 5
C 8
#5
54
The actual question posed, missed by most answers here is:
这里提出的实际问题被大多数人忽视的是:
Why can't I use del df.column_name
?
At first we need to understand the problem, which requires us to dive into python magic methods.
首先,我们需要理解这个问题,这需要我们深入到python魔术方法中。
As Wes points out in his answer del df['column']
maps to the python magic method df.__delitem__('column')
which is implemented in pandas to drop the column
正如Wes在他的答案del df['column']中指出的那样,它映射到python magic方法df.__delitem__('column'),后者在panda中实现以删除列
However, as pointed out in the link above about python magic methods:
但是,正如上面关于python魔术方法的链接所指出的:
In fact, del should almost never be used because of the precarious circumstances under which it is called; use it with caution!
事实上,del几乎不应该被使用,因为它被调用的环境是不稳定的;小心使用它!
You could argue that del df['column_name']
should not be used or encouraged, and thereby del df.column_name
should not even be considered.
您可能会认为不应该使用或鼓励del df['column_name'],因此也不应该使用del df。甚至不应该考虑column_name。
However, in theory, del df.column_name
could be implemeted to work in pandas using the magic method __delattr__
. This does however introduce certain problems, problems which the del df['column_name']
implementation already has, but in lesser degree.
但是,理论上讲,是del df。可以使用魔法方法__delattr__实现column_name在熊猫中工作。然而,这确实引入了一些问题,这些问题是del df['column_name']实现已经存在的,但程度较轻。
Example Problem
What if I define a column in a dataframe called "dtypes" or "columns".
如果我在一个名为“dtypes”或“columns”的dataframe中定义一个列,会怎么样呢?
Then assume I want to delete these columns.
然后假设我要删除这些列。
del df.dtypes
would make the __delattr__
method confused as if it should delete the "dtypes" attribute or the "dtypes" column.
df。dtypes会使__delattr__方法混淆,就好像它应该删除“dtypes”属性或“dtypes”列一样。
Architectural questions behind this problem
- Is a dataframe a collection of columns?
- dataframe是列的集合吗?
- Is a dataframe a collection of rows?
- 数据aframe是行集合吗?
- Is a column an attribute of a dataframe?
- 列是dataframe的属性吗?
Pandas answers:
- Yes, in all ways
- 是的,在所有方面
- No, but if you want it to be, you can use the
.ix
,.loc
or.iloc
methods. - 不,但是如果您希望它是,您可以使用.ix、.loc或.iloc方法。
- Maybe, do you want to read data? Then yes, unless the name of the attribute is already taken by another attribute belonging to the dataframe. Do you want to modify data? Then no.
- 也许,你想读数据吗?然后是yes,除非属性的名称已经被属于dataframe的另一个属性获取。您想要修改数据吗?然后没有。
TLDR;
You cannot do del df.column_name
because pandas has a quite wildly grown architecture that needs to be reconsidered in order for this kind of cognitive dissonance not to occur to its users.
你不能做df。column_name,因为熊猫有一种非常广泛的结构,需要重新考虑,以便这种认知失调不会发生在它的用户身上。
Protip:
Don't use df.column_name, It may be pretty, but it causes cognitive dissonance
不要使用df。column_name,它可能很漂亮,但是它会导致认知失调
Zen of Python quotes that fits in here:
There are multiple ways of deleting a column.
删除列有多种方法。
There should be one-- and preferably only one --obvious way to do it.
应该有一个——最好只有一个——明显的方法。
Columns are sometimes attributes but sometimes not.
列有时是属性,但有时不是。
Special cases aren't special enough to break the rules.
特殊情况并不足以打破规则。
Does del df.dtypes
delete the dtypes attribute or the dtypes column?
▽df。dtypes删除dtypes属性还是dtypes列?
In the face of ambiguity, refuse the temptation to guess.
面对模棱两可,拒绝猜测的诱惑。
#6
37
from version 0.16.1 you can do
从版本0.16.1可以做到。
df.drop(['column_name'], axis = 1, inplace = True, errors = 'ignore')
#7
35
A nice addition is the ability to drop columns only if they exist. This way you can cover more use cases, and it will only drop the existing columns from the labels passed to it:
一个很好的附加功能是只有列存在时才可以删除列。这样,您可以覆盖更多的用例,它只会从传递给它的标签中删除现有的列:
Simply add errors='ignore', for example.:
简单地添加错误='ignore',例如:
df.drop(['col_name_1', 'col_name_2', ..., 'col_name_N'], inplace=True, axis=1, errors='ignore')
- This is new from pandas 0.16.1 onward. Documentation is here.
- 这是熊猫0.16.1以后的最新数据。文档在这里。
#8
24
It's good practice to always use the []
notation. One reason is that attribute notation (df.column_name
) does not work for numbered indices:
总是使用[]符号是很好的实践。一个原因是属性表示法(df.column_name)不适用于编号索引:
In [1]: df = DataFrame([[1, 2, 3], [4, 5, 6]])
In [2]: df[1]
Out[2]:
0 2
1 5
Name: 1
In [3]: df.1
File "<ipython-input-3-e4803c0d1066>", line 1
df.1
^
SyntaxError: invalid syntax
#9
20
In pandas 0.16.1+ you can drop columns only if they exist per the solution posted by @eiTanLaVi. Prior to that version, you can achieve the same result via a conditional list comprehension:
在panda 0.16.1+中,只有当列存在于@eiTanLaVi发布的解决方案中时,才可以删除列。在此版本之前,您可以通过条件列表理解获得相同的结果:
df.drop([col for col in ['col_name_1','col_name_2',...,'col_name_N'] if col in df],
axis=1, inplace=True)
#10
11
TL;DR
A lot of effort to find a marginally more efficient solution. Difficult to justify the added complexity while sacrificing the simplicity of df.drop(dlst, 1, errors='ignore')
要找到稍微更有效的解决方案需要付出很多努力。很难在牺牲df的简单性的同时证明增加的复杂性。下降(dlst 1错误=“忽略”)
df.reindex_axis(np.setdiff1d(df.columns.values, dlst), 1)
Preamble
Deleting a column is semantically the same as selecting the other columns. I'll show a few additional methods to consider.
删除列的序言在语义上与选择其他列相同。我将展示一些需要考虑的其他方法。
I'll also focus on the general solution of deleting multiple columns at once and allowing for the attempt to delete columns not present.
我还将重点讨论一次删除多个列的一般解决方案,并允许尝试删除不存在的列。
Using these solutions are general and will work for the simple case as well.
使用这些解决方案是通用的,并且对于简单的情况也适用。
Setup
Consider the pd.DataFrame
df
and list to delete dlst
设置考虑pd。DataFrame df和list删除dlst。
df = pd.DataFrame(dict(zip('ABCDEFGHIJ', range(1, 11))), range(3))
dlst = list('HIJKLM')
df
A B C D E F G H I J
0 1 2 3 4 5 6 7 8 9 10
1 1 2 3 4 5 6 7 8 9 10
2 1 2 3 4 5 6 7 8 9 10
dlst
['H', 'I', 'J', 'K', 'L', 'M']
The result should look like:
结果应该是:
df.drop(dlst, 1, errors='ignore')
A B C D E F G
0 1 2 3 4 5 6 7
1 1 2 3 4 5 6 7
2 1 2 3 4 5 6 7
Since I'm equating deleting a column to selecting the other columns, I'll break it into two types:
因为我将删除一个列等同于选择其他列,所以我将它分为两种类型:
- Label selection
- 标签选择
- Boolean selection
- 布尔选择
Label Selection
We start by manufacturing the list/array of labels that represent the columns we want to keep and without the columns we want to delete.
我们首先制造一个标签列表/数组,这些标签代表我们想要保留的列,而不包含我们想要删除的列。
-
df.columns.difference(dlst)
df.columns.difference(dlst)
Index(['A', 'B', 'C', 'D', 'E', 'F', 'G'], dtype='object')
-
np.setdiff1d(df.columns.values, dlst)
np.setdiff1d(df.columns。值、dlst)
array(['A', 'B', 'C', 'D', 'E', 'F', 'G'], dtype=object)
-
df.columns.drop(dlst, errors='ignore')
df.columns。下降(dlst、错误=“忽略”)
Index(['A', 'B', 'C', 'D', 'E', 'F', 'G'], dtype='object')
-
list(set(df.columns.values.tolist()).difference(dlst))
列表(集(df.columns.values.tolist()).difference(dlst))
# does not preserve order ['E', 'D', 'B', 'F', 'G', 'A', 'C']
-
[x for x in df.columns.values.tolist() if x not in dlst]
[x表示df.columns.values.tolist(),如果x不表示dlst]
['A', 'B', 'C', 'D', 'E', 'F', 'G']
Columns from Labels
For the sake of comparing the selection process, assume:
为了比较选择过程,从标签中选择列,假设:
cols = [x for x in df.columns.values.tolist() if x not in dlst]
Then we can evaluate
然后我们可以评估
df.loc[:, cols]
- df。loc(:,峡路)
df[cols]
- df(峡路)
df.reindex(columns=cols)
- df.reindex(列=峡路)
df.reindex_axis(cols, 1)
- df。reindex_axis(关口,1)
Which all evaluate to:
所有的评估:
A B C D E F G
0 1 2 3 4 5 6 7
1 1 2 3 4 5 6 7
2 1 2 3 4 5 6 7
Boolean Slice
We can construct an array/list of booleans for slicing
我们可以构造一个用于切片的布尔值数组/列表
~df.columns.isin(dlst)
- ~ df.columns.isin(dlst)
~np.in1d(df.columns.values, dlst)
- ~ np.in1d(df.columns。值、dlst)
[x not in dlst for x in df.columns.values.tolist()]
- [x不在dlst,在df.c .columns.values.tolist()]
(df.columns.values[:, None] != dlst).all(1)
- (df.columns。值(:,)! = dlst)(1)
Columns from Boolean
For the sake of comparison
为了便于比较,列来自布尔值
bools = [x not in dlst for x in df.columns.values.tolist()]
df.loc[: bools]
- df。loc[:bool]
Which all evaluate to:
所有的评估:
A B C D E F G
0 1 2 3 4 5 6 7
1 1 2 3 4 5 6 7
2 1 2 3 4 5 6 7
Robust Timing
健壮的时机
Functions
功能
setdiff1d = lambda df, dlst: np.setdiff1d(df.columns.values, dlst)
difference = lambda df, dlst: df.columns.difference(dlst)
columndrop = lambda df, dlst: df.columns.drop(dlst, errors='ignore')
setdifflst = lambda df, dlst: list(set(df.columns.values.tolist()).difference(dlst))
comprehension = lambda df, dlst: [x for x in df.columns.values.tolist() if x not in dlst]
loc = lambda df, cols: df.loc[:, cols]
slc = lambda df, cols: df[cols]
ridx = lambda df, cols: df.reindex(columns=cols)
ridxa = lambda df, cols: df.reindex_axis(cols, 1)
isin = lambda df, dlst: ~df.columns.isin(dlst)
in1d = lambda df, dlst: ~np.in1d(df.columns.values, dlst)
comp = lambda df, dlst: [x not in dlst for x in df.columns.values.tolist()]
brod = lambda df, dlst: (df.columns.values[:, None] != dlst).all(1)
Testing
测试
res1 = pd.DataFrame(
index=pd.MultiIndex.from_product([
'loc slc ridx ridxa'.split(),
'setdiff1d difference columndrop setdifflst comprehension'.split(),
], names=['Select', 'Label']),
columns=[10, 30, 100, 300, 1000],
dtype=float
)
res2 = pd.DataFrame(
index=pd.MultiIndex.from_product([
'loc'.split(),
'isin in1d comp brod'.split(),
], names=['Select', 'Label']),
columns=[10, 30, 100, 300, 1000],
dtype=float
)
res = res1.append(res2).sort_index()
dres = pd.Series(index=res.columns, name='drop')
for j in res.columns:
dlst = list(range(j))
cols = list(range(j // 2, j + j // 2))
d = pd.DataFrame(1, range(10), cols)
dres.at[j] = timeit('d.drop(dlst, 1, errors="ignore")', 'from __main__ import d, dlst', number=100)
for s, l in res.index:
stmt = '{}(d, {}(d, dlst))'.format(s, l)
setp = 'from __main__ import d, dlst, {}, {}'.format(s, l)
res.at[(s, l), j] = timeit(stmt, setp, number=100)
rs = res / dres
rs
10 30 100 300 1000
Select Label
loc brod 0.747373 0.861979 0.891144 1.284235 3.872157
columndrop 1.193983 1.292843 1.396841 1.484429 1.335733
comp 0.802036 0.732326 1.149397 3.473283 25.565922
comprehension 1.463503 1.568395 1.866441 4.421639 26.552276
difference 1.413010 1.460863 1.587594 1.568571 1.569735
in1d 0.818502 0.844374 0.994093 1.042360 1.076255
isin 1.008874 0.879706 1.021712 1.001119 0.964327
setdiff1d 1.352828 1.274061 1.483380 1.459986 1.466575
setdifflst 1.233332 1.444521 1.714199 1.797241 1.876425
ridx columndrop 0.903013 0.832814 0.949234 0.976366 0.982888
comprehension 0.777445 0.827151 1.108028 3.473164 25.528879
difference 1.086859 1.081396 1.293132 1.173044 1.237613
setdiff1d 0.946009 0.873169 0.900185 0.908194 1.036124
setdifflst 0.732964 0.823218 0.819748 0.990315 1.050910
ridxa columndrop 0.835254 0.774701 0.907105 0.908006 0.932754
comprehension 0.697749 0.762556 1.215225 3.510226 25.041832
difference 1.055099 1.010208 1.122005 1.119575 1.383065
setdiff1d 0.760716 0.725386 0.849949 0.879425 0.946460
setdifflst 0.710008 0.668108 0.778060 0.871766 0.939537
slc columndrop 1.268191 1.521264 2.646687 1.919423 1.981091
comprehension 0.856893 0.870365 1.290730 3.564219 26.208937
difference 1.470095 1.747211 2.886581 2.254690 2.050536
setdiff1d 1.098427 1.133476 1.466029 2.045965 3.123452
setdifflst 0.833700 0.846652 1.013061 1.110352 1.287831
fig, axes = plt.subplots(2, 2, figsize=(8, 6), sharey=True)
for i, (n, g) in enumerate([(n, g.xs(n)) for n, g in rs.groupby('Select')]):
ax = axes[i // 2, i % 2]
g.plot.bar(ax=ax, title=n)
ax.legend_.remove()
fig.tight_layout()
This is relative to the time it takes to run df.drop(dlst, 1, errors='ignore')
. It seems like after all that effort, we only improve performance modestly.
这是相对于运行df的时间。下降(dlst 1错误=“忽略”)。似乎在付出了所有这些努力之后,我们只是适度地提高了性能。
If fact the best solutions use reindex
or reindex_axis
on the hack list(set(df.columns.values.tolist()).difference(dlst))
. A close second and still very marginally better than drop
is np.setdiff1d
.
如果事实是最好的解决方案,那么在hack列表上使用reindex或lucx_axis (set(df.columns.values.tolist()).difference(dlst))。紧随其后的是np.setdiff1d,它仍然比drop要好一点点。
rs.idxmin().pipe(
lambda x: pd.DataFrame(
dict(idx=x.values, val=rs.lookup(x.values, x.index)),
x.index
)
)
idx val
10 (ridx, setdifflst) 0.653431
30 (ridxa, setdifflst) 0.746143
100 (ridxa, setdifflst) 0.816207
300 (ridx, setdifflst) 0.780157
1000 (ridxa, setdifflst) 0.861622
#11
10
Pandas 0.21+ Answer
Pandas version 0.21 has slightly changed the drop method to include both the index
and columns
parameters to match the signature of the rename
and reindex
methods.
panda版本0.21稍微改变了drop方法,将索引和列参数都包含进来,以匹配重命名和重索引方法的签名。
df.drop(columns=['column_a', 'column_c'])
Personally, I prefer using the axis
parameter to denote columns or index because it is the predominant keyword parameter used in nearly all pandas methods. But, now you have some added choices in version 0.21.
就个人而言,我倾向于使用axis参数来表示列或索引,因为它是几乎所有熊猫方法中使用的主要关键字参数。但是,现在您在版本0.21中有了一些附加的选择。
#12
3
The dot syntax works in JavaScript, but not in Python.
点语法在JavaScript中有效,但在Python中无效。
Python: del df['column_name']
Python:德尔df(column_name”)
JavaScript: del df['column_name'] OR del df.column_name
JavaScript: del df['column_name']或del f.column_name