pandas:多列的to_numeric

时间:2021-10-10 23:01:02

I'm working with the following df:

我正在使用以下df:

c.sort_values('2005', ascending=False).head(3)
    GeoName     ComponentName   IndustryId  IndustryClassification  Description     2004    2005    2006    2007    2008    2009    2010    2011    2012    2013    2014
37926   Alabama     Real GDP by state   9   213     Support activities for mining   99  98  117     117     115     87  96  95  103     102     (NA)
37951   Alabama     Real GDP by state   34  42  Wholesale trade     9898    10613   10952   11034   11075   9722    9765    9703    9600    9884    10199
37932   Alabama     Real GDP by state   15  327     Nonmetallic mineral products manufacturing  980     968     940     1084    861     724     714     701     589     641     (NA)

I want to force numeric on all of the years:

我想在所有年份强制数字:

c['2014'] = pd.to_numeric(c['2014'], errors='coerce')

is there an easy way to do this or do I have to type them all out?

是否有一种简单的方法可以做到这一点,还是我必须全部输入?

4 个解决方案

#1


28  

UPDATE: you don't need to convert your values afterwards, you can do it on-the-fly when reading your CSV:

更新:之后您无需转换您的值,您可以在阅读CSV时即时执行此操作:

In [165]: df=pd.read_csv(url, index_col=0, na_values=['(NA)']).fillna(0)

In [166]: df.dtypes
Out[166]:
GeoName                    object
ComponentName              object
IndustryId                  int64
IndustryClassification     object
Description                object
2004                        int64
2005                        int64
2006                        int64
2007                        int64
2008                        int64
2009                        int64
2010                        int64
2011                        int64
2012                        int64
2013                        int64
2014                      float64
dtype: object

If you need to convert multiple columns to numeric dtypes - use the following technique:

如果需要将多个列转换为数字dtypes - 请使用以下技术:

Sample source DF:

样本来源DF:

In [271]: df
Out[271]:
     id    a  b  c  d  e    f
0  id_3  AAA  6  3  5  8    1
1  id_9    3  7  5  7  3  BBB
2  id_7    4  2  3  5  4    2
3  id_0    7  3  5  7  9    4
4  id_0    2  4  6  4  0    2

In [272]: df.dtypes
Out[272]:
id    object
a     object
b      int64
c      int64
d      int64
e      int64
f     object
dtype: object

Converting selected columns to numeric dtypes:

将所选列转换为数字dtypes:

In [273]: cols = df.columns.drop('id')

In [274]: df[cols] = df[cols].apply(pd.to_numeric, errors='coerce')

In [275]: df
Out[275]:
     id    a  b  c  d  e    f
0  id_3  NaN  6  3  5  8  1.0
1  id_9  3.0  7  5  7  3  NaN
2  id_7  4.0  2  3  5  4  2.0
3  id_0  7.0  3  5  7  9  4.0
4  id_0  2.0  4  6  4  0  2.0

In [276]: df.dtypes
Out[276]:
id     object
a     float64
b       int64
c       int64
d       int64
e       int64
f     float64
dtype: object

PS if you want to select all string (object) columns use the following simple trick:

PS如果要选择所有字符串(对象)列,请使用以下简单技巧:

cols = df.columns[df.dtypes.eq('object')]

#2


21  

another way is using apply, one liner:

另一种方法是使用apply,one liner:

cols = ['col1', 'col2', 'col3']
data[cols] = data[cols].apply(pd.to_numeric, errors='coerce', axis=1)

#3


8  

You can use:

您可以使用:

print df.columns[5:]
Index([u'2004', u'2005', u'2006', u'2007', u'2008', u'2009', u'2010', u'2011',
       u'2012', u'2013', u'2014'],
      dtype='object')

for col in  df.columns[5:]:
    df[col] = pd.to_numeric(df[col], errors='coerce')

print df
       GeoName      ComponentName  IndustryId  IndustryClassification  \
37926  Alabama  Real GDP by state           9                     213   
37951  Alabama  Real GDP by state          34                      42   
37932  Alabama  Real GDP by state          15                     327   

                                      Description  2004   2005   2006   2007  \
37926               Support activities for mining    99     98    117    117   
37951                            Wholesale  trade  9898  10613  10952  11034   
37932  Nonmetallic mineral products manufacturing   980    968    940   1084   

        2008  2009  2010  2011  2012  2013     2014  
37926    115    87    96    95   103   102      NaN  
37951  11075  9722  9765  9703  9600  9884  10199.0  
37932    861   724   714   701   589   641      NaN  

Another solution with filter:

过滤器的另一个解决方

print df.filter(like='20')
       2004   2005   2006   2007   2008  2009  2010  2011  2012  2013   2014
37926    99     98    117    117    115    87    96    95   103   102   (NA)
37951  9898  10613  10952  11034  11075  9722  9765  9703  9600  9884  10199
37932   980    968    940   1084    861   724   714   701   589   641   (NA)

for col in  df.filter(like='20').columns:
    df[col] = pd.to_numeric(df[col], errors='coerce')
print df
       GeoName      ComponentName  IndustryId  IndustryClassification  \
37926  Alabama  Real GDP by state           9                     213   
37951  Alabama  Real GDP by state          34                      42   
37932  Alabama  Real GDP by state          15                     327   

                                      Description  2004   2005   2006   2007  \
37926               Support activities for mining    99     98    117    117   
37951                            Wholesale  trade  9898  10613  10952  11034   
37932  Nonmetallic mineral products manufacturing   980    968    940   1084   

        2008  2009  2010  2011  2012  2013     2014  
37926    115    87    96    95   103   102      NaN  
37951  11075  9722  9765  9703  9600  9884  10199.0  
37932    861   724   714   701   589   641      NaN  

#4


-1  

If you are looking for a range of columns, you can try this:

如果您正在寻找一系列列,您可以尝试这样做:

df.iloc[7:] = df.iloc[7:].astype(float)

The examples above will convert type to be float, for all the columns begin with the 7th to the end. You of course can use different type or different range.

上面的示例将type转换为float,因为所有列都以7开头到结尾。你当然可以使用不同的类型或不同的范围。

I think this is useful when you have a big range of columns to convert and a lot of rows. It doesn't make you go over each row by yourself - I believe numpy do it more efficiently.

我认为当你有大量的列要转换和很多行时,这很有用。它不会让你自己超越每一行 - 我相信numpy会更有效地做到这一点。

This is useful only if you know that all the required columns contain numbers only - it will not change "bad values" (like string) to be NaN for you.

仅当您知道所有必需列仅包含数字时才有用 - 它不会将“错误值”(如字符串)更改为NaN。

#1


28  

UPDATE: you don't need to convert your values afterwards, you can do it on-the-fly when reading your CSV:

更新:之后您无需转换您的值,您可以在阅读CSV时即时执行此操作:

In [165]: df=pd.read_csv(url, index_col=0, na_values=['(NA)']).fillna(0)

In [166]: df.dtypes
Out[166]:
GeoName                    object
ComponentName              object
IndustryId                  int64
IndustryClassification     object
Description                object
2004                        int64
2005                        int64
2006                        int64
2007                        int64
2008                        int64
2009                        int64
2010                        int64
2011                        int64
2012                        int64
2013                        int64
2014                      float64
dtype: object

If you need to convert multiple columns to numeric dtypes - use the following technique:

如果需要将多个列转换为数字dtypes - 请使用以下技术:

Sample source DF:

样本来源DF:

In [271]: df
Out[271]:
     id    a  b  c  d  e    f
0  id_3  AAA  6  3  5  8    1
1  id_9    3  7  5  7  3  BBB
2  id_7    4  2  3  5  4    2
3  id_0    7  3  5  7  9    4
4  id_0    2  4  6  4  0    2

In [272]: df.dtypes
Out[272]:
id    object
a     object
b      int64
c      int64
d      int64
e      int64
f     object
dtype: object

Converting selected columns to numeric dtypes:

将所选列转换为数字dtypes:

In [273]: cols = df.columns.drop('id')

In [274]: df[cols] = df[cols].apply(pd.to_numeric, errors='coerce')

In [275]: df
Out[275]:
     id    a  b  c  d  e    f
0  id_3  NaN  6  3  5  8  1.0
1  id_9  3.0  7  5  7  3  NaN
2  id_7  4.0  2  3  5  4  2.0
3  id_0  7.0  3  5  7  9  4.0
4  id_0  2.0  4  6  4  0  2.0

In [276]: df.dtypes
Out[276]:
id     object
a     float64
b       int64
c       int64
d       int64
e       int64
f     float64
dtype: object

PS if you want to select all string (object) columns use the following simple trick:

PS如果要选择所有字符串(对象)列,请使用以下简单技巧:

cols = df.columns[df.dtypes.eq('object')]

#2


21  

another way is using apply, one liner:

另一种方法是使用apply,one liner:

cols = ['col1', 'col2', 'col3']
data[cols] = data[cols].apply(pd.to_numeric, errors='coerce', axis=1)

#3


8  

You can use:

您可以使用:

print df.columns[5:]
Index([u'2004', u'2005', u'2006', u'2007', u'2008', u'2009', u'2010', u'2011',
       u'2012', u'2013', u'2014'],
      dtype='object')

for col in  df.columns[5:]:
    df[col] = pd.to_numeric(df[col], errors='coerce')

print df
       GeoName      ComponentName  IndustryId  IndustryClassification  \
37926  Alabama  Real GDP by state           9                     213   
37951  Alabama  Real GDP by state          34                      42   
37932  Alabama  Real GDP by state          15                     327   

                                      Description  2004   2005   2006   2007  \
37926               Support activities for mining    99     98    117    117   
37951                            Wholesale  trade  9898  10613  10952  11034   
37932  Nonmetallic mineral products manufacturing   980    968    940   1084   

        2008  2009  2010  2011  2012  2013     2014  
37926    115    87    96    95   103   102      NaN  
37951  11075  9722  9765  9703  9600  9884  10199.0  
37932    861   724   714   701   589   641      NaN  

Another solution with filter:

过滤器的另一个解决方

print df.filter(like='20')
       2004   2005   2006   2007   2008  2009  2010  2011  2012  2013   2014
37926    99     98    117    117    115    87    96    95   103   102   (NA)
37951  9898  10613  10952  11034  11075  9722  9765  9703  9600  9884  10199
37932   980    968    940   1084    861   724   714   701   589   641   (NA)

for col in  df.filter(like='20').columns:
    df[col] = pd.to_numeric(df[col], errors='coerce')
print df
       GeoName      ComponentName  IndustryId  IndustryClassification  \
37926  Alabama  Real GDP by state           9                     213   
37951  Alabama  Real GDP by state          34                      42   
37932  Alabama  Real GDP by state          15                     327   

                                      Description  2004   2005   2006   2007  \
37926               Support activities for mining    99     98    117    117   
37951                            Wholesale  trade  9898  10613  10952  11034   
37932  Nonmetallic mineral products manufacturing   980    968    940   1084   

        2008  2009  2010  2011  2012  2013     2014  
37926    115    87    96    95   103   102      NaN  
37951  11075  9722  9765  9703  9600  9884  10199.0  
37932    861   724   714   701   589   641      NaN  

#4


-1  

If you are looking for a range of columns, you can try this:

如果您正在寻找一系列列,您可以尝试这样做:

df.iloc[7:] = df.iloc[7:].astype(float)

The examples above will convert type to be float, for all the columns begin with the 7th to the end. You of course can use different type or different range.

上面的示例将type转换为float,因为所有列都以7开头到结尾。你当然可以使用不同的类型或不同的范围。

I think this is useful when you have a big range of columns to convert and a lot of rows. It doesn't make you go over each row by yourself - I believe numpy do it more efficiently.

我认为当你有大量的列要转换和很多行时,这很有用。它不会让你自己超越每一行 - 我相信numpy会更有效地做到这一点。

This is useful only if you know that all the required columns contain numbers only - it will not change "bad values" (like string) to be NaN for you.

仅当您知道所有必需列仅包含数字时才有用 - 它不会将“错误值”(如字符串)更改为NaN。