将包含NaN的Pandas列转换为dtype`int`

I read data from a .csv file to a Pandas dataframe as below. For one of the columns, namely id, I want to specify the column type as int. The problem is the id series has missing/empty values.

我将.csv文件中的数据读取到Pandas数据帧，如下所示。对于其中一列，即id，我想将列类型指定为int。问题是id系列缺少/空值。

When I try to cast the id column to integer while reading the .csv, I get:

当我尝试在读取.csv时将id列强制转换为整数时，我得到：

df= pd.read_csv("data.csv", dtype={'id': int}) 
error: Integer column has NA values

Alternatively, I tried to convert the column type after reading as below, but this time I get:

或者，我尝试在阅读后转换列类型，如下所示，但这次我得到：

df= pd.read_csv("data.csv") 
df[['id']] = df[['id']].astype(int)
error: Cannot convert NA to integer

How can I tackle this?

我怎么解决这个问题？

6 个解决方案

#1

The lack of NaN rep in integer columns is a pandas "gotcha".

在整数列中缺少NaN rep是熊猫“陷阱”。

The usual workaround is to simply use floats.

通常的解决方法是简单地使用浮动。

#2

If you can modify your stored data, use a sentinel value for missing id. A common use case, inferred by the column name, being that id is an integer, strictly greater than zero, you could use 0 as a sentinel value so that you can write

如果您可以修改存储的数据，请使用sentinel值来删除id。一个常见的用例，由列名称推断，即id是一个整数，严格大于零，你可以使用0作为sentinel值，这样你就可以写

if row['id']:
   regular_process(row)
else:
   special_process(row)

#3

Assuming your DateColumn formatted 3312018.0 should be converted to 03/31/2018 as a string. And, some records are missing or 0.

假设您的DateColumn格式为3312018.0应作为字符串转换为03/31/2018。并且，一些记录丢失或0。

df['DateColumn'] = df['DateColumn'].astype(int)
df['DateColumn'] = df['DateColumn'].astype(str)
df['DateColumn'] = df['DateColumn'].apply(lambda x: x.zfill(8))
df.loc[df['DateColumn'] == '00000000','DateColumn'] = '01011980'
df['DateColumn'] = pd.to_datetime(df['DateColumn'], format="%m%d%Y")
df['DateColumn'] = df['DateColumn'].apply(lambda x: x.strftime('%m/%d/%Y'))

#4

My use case is munging data prior to loading in a DB table:

我的用例是在加载到数据库表之前重新整理数据：

df[col] = df[col].fillna(-1)
df[col] = df[col].astype(int)
df[col] = df[col].astype(str)
df[col] = df[col].replace('-1', np.nan)

Remove NaNs, convert to int, convert to str and then reinsert NANs.

删除NaNs，转换为int，转换为str然后重新插入NAN。

It's not pretty but it gets the job done!

它不漂亮，但它完成了工作！

#5

I ran into this issue working with pyspark. As this is a python frontend for code running on a jvm, it requires type safety and using float instead of int is not an option. I worked around the issue by wrapping the pandas pd.read_csv in a function that will fill user-defined columns with user-defined fill values before casting them to the required type. Here is what I ended up using:

我和pyspark一起遇到了这个问题。因为这是在jvm上运行的代码的python前端，所以它需要类型安全，并且使用float而不是int不是一个选项。我通过将pandas pd.read_csv包装在一个函数中解决了这个问题，该函数将用户定义的填充值填充用户定义的列，然后再将它们转换为所需的类型。以下是我最终使用的内容：

def custom_read_csv(file_path, custom_dtype = None, fill_values = None, **kwargs):
    if custom_dtype is None:
        return pd.read_csv(file_path, **kwargs)
    else:
        assert 'dtype' not in kwargs.keys()
        df = pd.read_csv(file_path, dtype = {}, **kwargs)
        for col, typ in custom_dtype.items():
            if fill_values is None or col not in fill_values.keys():
                fill_val = -1
            else:
                fill_val = fill_values[col]
            df[col] = df[col].fillna(fill_val).astype(typ)
    return df

#6

-6

In my case i have edited the column format of csv i.e. changed the format of column from general to number.Then i am able to change type in pandas.

在我的情况下，我编辑了csv的列格式，即将列的格式从一般更改为数字。然后我能够更改pandas中的类型。

df= pd.read_csv("data.csv")
df[['id']] = df[['id']].astype(int)

#1