I have a data frame df
looks like this:
我有一个数据框df看起来像这样:
birth_year person
0 1980 0
1 1981 1
2 1982 2
3 1983 3
4 1984 4
the birth_year
column looks like numbers but when I check the data type df['birth_year'].dtype
the result is dtype('O')
birth_year列看起来像数字但是当我检查数据类型df ['birth_year']时.dtype结果是dtype('O')
so I thought it might actually be a string, and tried to convert it to numbers with df['birth_year'].astype('int')
but got an error:
所以我认为它可能实际上是一个字符串,并尝试将其转换为数字与df ['birth_year']。astype('int')但出现错误:
UnicodeEncodeError: 'decimal' codec can't encode characters in position
0-3: invalid decimal Unicode string
After a little googling I came to understand (might be wrongly) that there seems to be some invisible characters in it. when accessing the values df['birth_year'][0]
the value I got is 1980L
, rather than 1980
.
经过一番谷歌搜索后,我开始明白(可能是错误的)其中似乎有一些看不见的字符。当访问值df ['birth_year'] [0]时,我得到的值是1980L,而不是1980。
so what exactly is the data type, and how can I convert it to integers? I read somewhere that if the returned data type is dtype('O')
, it usually means it's a string, but this doesn't seem to be the case.
那究竟什么是数据类型,以及如何将其转换为整数?我读到某个地方,如果返回的数据类型是dtype('O'),它通常意味着它是一个字符串,但似乎并非如此。
1 个解决方案
#1
2
You can convert normally using df['birth_year'].astype(int)
but it seems you have invalid values, using df = df.convert_objects(convert_numeric=True)
will coerce invalid values to NaN
which may or may not be what you desire as this changes the dtype to float64
rather than int64
.
您可以使用df ['birth_year']正常转换.astype(int)但似乎您的值无效,使用df = df.convert_objects(convert_numeric = True)会将无效值强制转换为NaN,这可能是也可能不是您想要的因为这会将dtype更改为float64而不是int64。
It's best to look at the invalid string values to determine why they failed to convert.
最好查看无效的字符串值以确定它们无法转换的原因。
So you could do df[df.convert_objects(convert_numeric).isnull()]
to get the rows that have invalid 'birth_year' values
所以你可以做df [df.convert_objects(convert_numeric).isnull()]来获取具有无效'birth_year'值的行
#1
2
You can convert normally using df['birth_year'].astype(int)
but it seems you have invalid values, using df = df.convert_objects(convert_numeric=True)
will coerce invalid values to NaN
which may or may not be what you desire as this changes the dtype to float64
rather than int64
.
您可以使用df ['birth_year']正常转换.astype(int)但似乎您的值无效,使用df = df.convert_objects(convert_numeric = True)会将无效值强制转换为NaN,这可能是也可能不是您想要的因为这会将dtype更改为float64而不是int64。
It's best to look at the invalid string values to determine why they failed to convert.
最好查看无效的字符串值以确定它们无法转换的原因。
So you could do df[df.convert_objects(convert_numeric).isnull()]
to get the rows that have invalid 'birth_year' values
所以你可以做df [df.convert_objects(convert_numeric).isnull()]来获取具有无效'birth_year'值的行