I'm loading a Pandas dataframe which has many data types (loaded from Excel). Two particular columns should be floats, but occasionally a researcher entered in a random comment like "not measured." I need to drop any rows where any values in one of two columns is not a number and preserve non-numeric data in other columns. A simple use case looks like this (the real table has several thousand rows...)
我正在加载一个熊猫dataframe,它有许多数据类型(从Excel加载)。两个特定的列应该是浮点数,但偶尔也会有研究者输入诸如“not measured”之类的随机评论。我需要删除任何行,其中两个列中的任何值都不是数字,并在其他列中保存非数字数据。一个简单的用例是这样的(真正的表有几千行…)
import pandas as pd
df = pd.DataFrame(dict(A = pd.Series([1,2,3,4,5]), B = pd.Series([96,33,45,'',8]), C = pd.Series([12,'Not measured',15,66,42]), D = pd.Series(['apples', 'oranges', 'peaches', 'plums', 'pears'])))
Which results in this data table:
此数据表的结果如下:
A B C D
0 1 96 12 apples
1 2 33 Not measured oranges
2 3 45 15 peaches
3 4 66 plums
4 5 8 42 pears
I'm not clear how to get to this table:
我不知道该怎么做:
A B C D
0 1 96 12 apples
2 3 45 15 peaches
4 5 8 42 pears
I tried dropna, but the types are "object" since there are non-numeric entries. I can't convert the values to floats without either converting the whole table, or doing one series at a time which loses the relationship to the other data in the row. Perhaps there is something simple I'm not understanding?
我尝试过dropna,但是类型是“object”,因为有非数字条目。我不能将值转换为浮点数,而不需要转换整个表,或者一次只执行一个系列,而这段时间将失去与行中其他数据的关系。也许有一些简单的我不理解的东西?
1 个解决方案
#1
1
You can first create subset with columns B
,C
and apply
to_numeric
, check if all
values are notnull
. Then use boolean indexing:
您可以首先使用B、C列创建子集并应用to_numeric,检查所有值是否为notnull。然后使用布尔索引:
print df[['B','C']].apply(pd.to_numeric, errors='coerce').notnull().all(axis=1)
0 True
1 False
2 True
3 False
4 True
dtype: bool
print df[df[['B','C']].apply(pd.to_numeric, errors='coerce').notnull().all(axis=1)]
A B C D
0 1 96 12 apples
2 3 45 15 peaches
4 5 8 42 pears
Next solution use str.isdigit
with isnull
and xor (^
):
下一步解决方案使用str.isdigit isnull和xor(^):
print df['B'].str.isdigit().isnull() ^ df['C'].str.isdigit().notnull()
0 True
1 False
2 True
3 False
4 True
dtype: bool
print df[df['B'].str.isdigit().isnull() ^ df['C'].str.isdigit().notnull()]
A B C D
0 1 96 12 apples
2 3 45 15 peaches
4 5 8 42 pears
But solution with to_numeric
with isnull
and notnull
is fastest:
但是用to_numeric处理isnull和notnull是最快的:
print df[pd.to_numeric(df['B'], errors='coerce').notnull()
^ pd.to_numeric(df['C'], errors='coerce').isnull()]
A B C D
0 1 96 12 apples
2 3 45 15 peaches
4 5 8 42 pears
Timings:
计时:
#len(df) = 5k
df = pd.concat([df]*1000).reset_index(drop=True)
In [611]: %timeit df[pd.to_numeric(df['B'], errors='coerce').notnull() ^ pd.to_numeric(df['C'], errors='coerce').isnull()]
1000 loops, best of 3: 1.88 ms per loop
In [612]: %timeit df[df['B'].str.isdigit().isnull() ^ df['C'].str.isdigit().notnull()]
100 loops, best of 3: 16.1 ms per loop
In [613]: %timeit df[df[['B','C']].apply(pd.to_numeric, errors='coerce').notnull().all(axis=1)]
The slowest run took 4.28 times longer than the fastest. This could mean that an intermediate result is being cached
100 loops, best of 3: 3.49 ms per loop
#1
1
You can first create subset with columns B
,C
and apply
to_numeric
, check if all
values are notnull
. Then use boolean indexing:
您可以首先使用B、C列创建子集并应用to_numeric,检查所有值是否为notnull。然后使用布尔索引:
print df[['B','C']].apply(pd.to_numeric, errors='coerce').notnull().all(axis=1)
0 True
1 False
2 True
3 False
4 True
dtype: bool
print df[df[['B','C']].apply(pd.to_numeric, errors='coerce').notnull().all(axis=1)]
A B C D
0 1 96 12 apples
2 3 45 15 peaches
4 5 8 42 pears
Next solution use str.isdigit
with isnull
and xor (^
):
下一步解决方案使用str.isdigit isnull和xor(^):
print df['B'].str.isdigit().isnull() ^ df['C'].str.isdigit().notnull()
0 True
1 False
2 True
3 False
4 True
dtype: bool
print df[df['B'].str.isdigit().isnull() ^ df['C'].str.isdigit().notnull()]
A B C D
0 1 96 12 apples
2 3 45 15 peaches
4 5 8 42 pears
But solution with to_numeric
with isnull
and notnull
is fastest:
但是用to_numeric处理isnull和notnull是最快的:
print df[pd.to_numeric(df['B'], errors='coerce').notnull()
^ pd.to_numeric(df['C'], errors='coerce').isnull()]
A B C D
0 1 96 12 apples
2 3 45 15 peaches
4 5 8 42 pears
Timings:
计时:
#len(df) = 5k
df = pd.concat([df]*1000).reset_index(drop=True)
In [611]: %timeit df[pd.to_numeric(df['B'], errors='coerce').notnull() ^ pd.to_numeric(df['C'], errors='coerce').isnull()]
1000 loops, best of 3: 1.88 ms per loop
In [612]: %timeit df[df['B'].str.isdigit().isnull() ^ df['C'].str.isdigit().notnull()]
100 loops, best of 3: 16.1 ms per loop
In [613]: %timeit df[df[['B','C']].apply(pd.to_numeric, errors='coerce').notnull().all(axis=1)]
The slowest run took 4.28 times longer than the fastest. This could mean that an intermediate result is being cached
100 loops, best of 3: 3.49 ms per loop