I read data from a .csv file to a Pandas dataframe as below. For one of the columns, namely id
, I want to specify the column type as int
. The problem is the id
series has missing/empty values.
我将.csv文件中的数据读取到Pandas数据帧,如下所示。对于其中一列,即id,我想将列类型指定为int。问题是id系列缺少/空值。
When I try to cast the id
column to integer while reading the .csv, I get:
当我尝试在读取.csv时将id列强制转换为整数时,我得到:
df= pd.read_csv("data.csv", dtype={'id': int})
error: Integer column has NA values
Alternatively, I tried to convert the column type after reading as below, but this time I get:
或者,我尝试在阅读后转换列类型,如下所示,但这次我得到:
df= pd.read_csv("data.csv")
df[['id']] = df[['id']].astype(int)
error: Cannot convert NA to integer
How can I tackle this?
我怎么解决这个问题?
6 个解决方案
#1
79
The lack of NaN rep in integer columns is a pandas "gotcha".
在整数列中缺少NaN rep是熊猫“陷阱”。
The usual workaround is to simply use floats.
通常的解决方法是简单地使用浮动。
#2
2
If you can modify your stored data, use a sentinel value for missing id
. A common use case, inferred by the column name, being that id
is an integer, strictly greater than zero, you could use 0
as a sentinel value so that you can write
如果您可以修改存储的数据,请使用sentinel值来删除id。一个常见的用例,由列名称推断,即id是一个整数,严格大于零,你可以使用0作为sentinel值,这样你就可以写
if row['id']:
regular_process(row)
else:
special_process(row)
#3
0
Assuming your DateColumn formatted 3312018.0 should be converted to 03/31/2018 as a string. And, some records are missing or 0.
假设您的DateColumn格式为3312018.0应作为字符串转换为03/31/2018。并且,一些记录丢失或0。
df['DateColumn'] = df['DateColumn'].astype(int)
df['DateColumn'] = df['DateColumn'].astype(str)
df['DateColumn'] = df['DateColumn'].apply(lambda x: x.zfill(8))
df.loc[df['DateColumn'] == '00000000','DateColumn'] = '01011980'
df['DateColumn'] = pd.to_datetime(df['DateColumn'], format="%m%d%Y")
df['DateColumn'] = df['DateColumn'].apply(lambda x: x.strftime('%m/%d/%Y'))
#4
0
My use case is munging data prior to loading in a DB table:
我的用例是在加载到数据库表之前重新整理数据:
df[col] = df[col].fillna(-1)
df[col] = df[col].astype(int)
df[col] = df[col].astype(str)
df[col] = df[col].replace('-1', np.nan)
Remove NaNs, convert to int, convert to str and then reinsert NANs.
删除NaNs,转换为int,转换为str然后重新插入NAN。
It's not pretty but it gets the job done!
它不漂亮,但它完成了工作!
#5
0
I ran into this issue working with pyspark. As this is a python frontend for code running on a jvm, it requires type safety and using float instead of int is not an option. I worked around the issue by wrapping the pandas pd.read_csv
in a function that will fill user-defined columns with user-defined fill values before casting them to the required type. Here is what I ended up using:
我和pyspark一起遇到了这个问题。因为这是在jvm上运行的代码的python前端,所以它需要类型安全,并且使用float而不是int不是一个选项。我通过将pandas pd.read_csv包装在一个函数中解决了这个问题,该函数将用户定义的填充值填充用户定义的列,然后再将它们转换为所需的类型。以下是我最终使用的内容:
def custom_read_csv(file_path, custom_dtype = None, fill_values = None, **kwargs):
if custom_dtype is None:
return pd.read_csv(file_path, **kwargs)
else:
assert 'dtype' not in kwargs.keys()
df = pd.read_csv(file_path, dtype = {}, **kwargs)
for col, typ in custom_dtype.items():
if fill_values is None or col not in fill_values.keys():
fill_val = -1
else:
fill_val = fill_values[col]
df[col] = df[col].fillna(fill_val).astype(typ)
return df
#6
-6
In my case i have edited the column format of csv i.e. changed the format of column from general to number.Then i am able to change type in pandas.
在我的情况下,我编辑了csv的列格式,即将列的格式从一般更改为数字。然后我能够更改pandas中的类型。
df= pd.read_csv("data.csv")
df[['id']] = df[['id']].astype(int)
#1
79
The lack of NaN rep in integer columns is a pandas "gotcha".
在整数列中缺少NaN rep是熊猫“陷阱”。
The usual workaround is to simply use floats.
通常的解决方法是简单地使用浮动。
#2
2
If you can modify your stored data, use a sentinel value for missing id
. A common use case, inferred by the column name, being that id
is an integer, strictly greater than zero, you could use 0
as a sentinel value so that you can write
如果您可以修改存储的数据,请使用sentinel值来删除id。一个常见的用例,由列名称推断,即id是一个整数,严格大于零,你可以使用0作为sentinel值,这样你就可以写
if row['id']:
regular_process(row)
else:
special_process(row)
#3
0
Assuming your DateColumn formatted 3312018.0 should be converted to 03/31/2018 as a string. And, some records are missing or 0.
假设您的DateColumn格式为3312018.0应作为字符串转换为03/31/2018。并且,一些记录丢失或0。
df['DateColumn'] = df['DateColumn'].astype(int)
df['DateColumn'] = df['DateColumn'].astype(str)
df['DateColumn'] = df['DateColumn'].apply(lambda x: x.zfill(8))
df.loc[df['DateColumn'] == '00000000','DateColumn'] = '01011980'
df['DateColumn'] = pd.to_datetime(df['DateColumn'], format="%m%d%Y")
df['DateColumn'] = df['DateColumn'].apply(lambda x: x.strftime('%m/%d/%Y'))
#4
0
My use case is munging data prior to loading in a DB table:
我的用例是在加载到数据库表之前重新整理数据:
df[col] = df[col].fillna(-1)
df[col] = df[col].astype(int)
df[col] = df[col].astype(str)
df[col] = df[col].replace('-1', np.nan)
Remove NaNs, convert to int, convert to str and then reinsert NANs.
删除NaNs,转换为int,转换为str然后重新插入NAN。
It's not pretty but it gets the job done!
它不漂亮,但它完成了工作!
#5
0
I ran into this issue working with pyspark. As this is a python frontend for code running on a jvm, it requires type safety and using float instead of int is not an option. I worked around the issue by wrapping the pandas pd.read_csv
in a function that will fill user-defined columns with user-defined fill values before casting them to the required type. Here is what I ended up using:
我和pyspark一起遇到了这个问题。因为这是在jvm上运行的代码的python前端,所以它需要类型安全,并且使用float而不是int不是一个选项。我通过将pandas pd.read_csv包装在一个函数中解决了这个问题,该函数将用户定义的填充值填充用户定义的列,然后再将它们转换为所需的类型。以下是我最终使用的内容:
def custom_read_csv(file_path, custom_dtype = None, fill_values = None, **kwargs):
if custom_dtype is None:
return pd.read_csv(file_path, **kwargs)
else:
assert 'dtype' not in kwargs.keys()
df = pd.read_csv(file_path, dtype = {}, **kwargs)
for col, typ in custom_dtype.items():
if fill_values is None or col not in fill_values.keys():
fill_val = -1
else:
fill_val = fill_values[col]
df[col] = df[col].fillna(fill_val).astype(typ)
return df
#6
-6
In my case i have edited the column format of csv i.e. changed the format of column from general to number.Then i am able to change type in pandas.
在我的情况下,我编辑了csv的列格式,即将列的格式从一般更改为数字。然后我能够更改pandas中的类型。
df= pd.read_csv("data.csv")
df[['id']] = df[['id']].astype(int)