当我通过skip_footer arg时,Pandas read_csv忽略了列dtypes

时间:2021-08-17 20:29:43

When I try to import a csv file into a dataframe pandas (0.13.1) is ignoring the dtype parameter. Is there a way to stop pandas from inferring the data type on its own?

当我尝试将csv文件导入数据帧时,pandas(0.13.1)忽略了dtype参数。有没有办法阻止大熊猫自己推断数据类型?

I am merging several CSV files and sometimes the customer contains letters and pandas imports as a string. When I try to merge the two dataframes I get an error because I'm trying to merge two different types. I need everything stored as strings.

我正在合并几个CSV文件,有时客户包含字母和pandas导入为字符串。当我尝试合并两个数据帧时,我得到一个错误,因为我正在尝试合并两种不同的类型。我需要存储为字符串的所有内容

Data snippet:

数据片段:

|WAREHOUSE|ERROR|CUSTOMER|ORDER NO|
|---------|-----|--------|--------|
|3615     |     |03106   |253734  |
|3615     |     |03156   |290550  |
|3615     |     |03175   |262207  |
|3615     |     |03175   |262207  |
|3615     |     |03175   |262207  |
|3615     |     |03175   |262207  |
|3615     |     |03175   |262207  |
|3615     |     |03175   |262207  |
|3615     |     |03175   |262207  |

Import line:

进口线:

df = pd.read_csv("SomeFile.csv", 
                 header=1,
                 skip_footer=1, 
                 usecols=[2, 3], 
                 dtype={'ORDER NO': str, 'CUSTOMER': str})

df.dtypes outputs this:

df.dtypes输出:

ORDER NO    int64
CUSTOMER    int64
dtype: object

2 个解决方案

#1


17  

Pandas 0.13.1 silently ignored the dtype argument because the c engine does not support skip_footer. This caused Pandas to fall back to the python engine which does not support dtype.

Pandas 0.13.1默默地忽略了dtype参数,因为c引擎不支持skip_footer。这导致Pandas回退到不支持dtype的python引擎。

Solution? Use converters

解?使用转换器

df = pd.read_csv('SomeFile.csv', 
                 header=1,
                 skip_footer=1, 
                 usecols=[2, 3], 
                 converters={'CUSTOMER': str, 'ORDER NO': str},
                 engine='python')

Output:

输出:

In [1]: df.dtypes
Out[2]:
CUSTOMER    object
ORDER NO    object
dtype: object

In [3]: type(df['CUSTOMER'][0])
Out[4]: str

In [5]: df.head()
Out[6]:
  CUSTOMER ORDER NO
0    03106   253734
1    03156   290550
2    03175   262207
3    03175   262207
4    03175   262207

Leading 0's from the original file are preserved and all data is stored as strings.

保留原始文件中的前导0,并将所有数据存储为字符串。

#2


6  

Unfortunately using converters or newer pandas versions doesn't solve the more general problem of always ensuring that read_csv doesn't infer a float64 dtype. With pandas 0.15.2, the following example, with a CSV containing integers in hexadecimal notation with NULL entries, shows that using converters for what the name implies they should be used for, interferes with dtype specification.

不幸的是,使用转换器或更新的pandas版本并不能解决总是确保read_csv不推断float64 dtype的更普遍的问题。对于pandas 0.15.2,下面的示例使用包含带有NULL条目的十六进制表示法的整数的CSV,表明使用转换器来表示它们应该用于的名称,这会干扰dtype规范。

In [1]: df = pd.DataFrame(dict(a = ["0xff", "0xfe"], b = ["0xfd", None], c = [None, "0xfc"], d = [None, None]))
In [2]: df.to_csv("H:/tmp.csv", index = False)
In [3]: ef = pd.read_csv("H:/tmp.csv", dtype = {c: object for c in "abcd"}, converters = {c: lambda x: None if x == "" else int(x, 16) for c in "abcd"})
In [4]: ef.dtypes.map(lambda x: x)
Out[4]:
a      int64
b    float64
c    float64
d     object
dtype: object

The specified dtype of object is only respected for the all-NULL column. In this case, the float64 values can just be converted to integers, but by the pigeon hole principle, not all 64 bit integers can be represented as a float64.

指定的dtype对象仅适用于all-NULL列。在这种情况下,float64值只能转换为整数,但是通过鸽子孔原理,并非所有64位整数都可以表示为float64。

The best solution I have found for this more general case is to get pandas to read potentially problematic columns as strings, as already covered, then convert the slice with values that need conversion (and not mapping the conversion on the column, as that will again result in an automatic dtype = float64 inference).

我发现这个更一般情况的最佳解决方案是让pandas将可能存在问题的列读取为字符串,如已经涵盖的那样,然后使用需要转换的值转换切片(而不是在列上映射转换,因为这将再次导致自动dtype = float64推断)。

In [5]: ff = pd.read_csv("H:/tmp.csv", dtype = {c: object for c in "bc"}, converters = {c: lambda x: None if x == "" else int(x, 16) for c in "ad"})
In [6]: ff.dtypes
Out[6]:
a     int64
b    object
c    object
d    object
dtype: object
In [7]: for c in "bc":
   .....:     ff.loc[~pd.isnull(ff[c]), c] = ff[c][~pd.isnull(ff[c])].map(lambda x: int(x, 16))
   .....:
In [8]: ff.dtypes
Out[8]:
a     int64
b    object
c    object
d    object
dtype: object
In [9]: [(ff[c][i], type(ff[c][i])) for c in ff.columns for i in ff.index]
Out[9]:
[(255, numpy.int64),
 (254, numpy.int64),
 (253L, long),
 (nan, float),
 (nan, float),
 (252L, long),
 (None, NoneType),
 (None, NoneType)]

As far as I have been able to determine, at least up to version 0.15.2 there is no way to avoid postprocessing of string values in situations like this.

至于我已经能够确定,至少达到版本0.15.2,就没有办法避免在这种情况下对字符串值进行后处理。

#1


17  

Pandas 0.13.1 silently ignored the dtype argument because the c engine does not support skip_footer. This caused Pandas to fall back to the python engine which does not support dtype.

Pandas 0.13.1默默地忽略了dtype参数,因为c引擎不支持skip_footer。这导致Pandas回退到不支持dtype的python引擎。

Solution? Use converters

解?使用转换器

df = pd.read_csv('SomeFile.csv', 
                 header=1,
                 skip_footer=1, 
                 usecols=[2, 3], 
                 converters={'CUSTOMER': str, 'ORDER NO': str},
                 engine='python')

Output:

输出:

In [1]: df.dtypes
Out[2]:
CUSTOMER    object
ORDER NO    object
dtype: object

In [3]: type(df['CUSTOMER'][0])
Out[4]: str

In [5]: df.head()
Out[6]:
  CUSTOMER ORDER NO
0    03106   253734
1    03156   290550
2    03175   262207
3    03175   262207
4    03175   262207

Leading 0's from the original file are preserved and all data is stored as strings.

保留原始文件中的前导0,并将所有数据存储为字符串。

#2


6  

Unfortunately using converters or newer pandas versions doesn't solve the more general problem of always ensuring that read_csv doesn't infer a float64 dtype. With pandas 0.15.2, the following example, with a CSV containing integers in hexadecimal notation with NULL entries, shows that using converters for what the name implies they should be used for, interferes with dtype specification.

不幸的是,使用转换器或更新的pandas版本并不能解决总是确保read_csv不推断float64 dtype的更普遍的问题。对于pandas 0.15.2,下面的示例使用包含带有NULL条目的十六进制表示法的整数的CSV,表明使用转换器来表示它们应该用于的名称,这会干扰dtype规范。

In [1]: df = pd.DataFrame(dict(a = ["0xff", "0xfe"], b = ["0xfd", None], c = [None, "0xfc"], d = [None, None]))
In [2]: df.to_csv("H:/tmp.csv", index = False)
In [3]: ef = pd.read_csv("H:/tmp.csv", dtype = {c: object for c in "abcd"}, converters = {c: lambda x: None if x == "" else int(x, 16) for c in "abcd"})
In [4]: ef.dtypes.map(lambda x: x)
Out[4]:
a      int64
b    float64
c    float64
d     object
dtype: object

The specified dtype of object is only respected for the all-NULL column. In this case, the float64 values can just be converted to integers, but by the pigeon hole principle, not all 64 bit integers can be represented as a float64.

指定的dtype对象仅适用于all-NULL列。在这种情况下,float64值只能转换为整数,但是通过鸽子孔原理,并非所有64位整数都可以表示为float64。

The best solution I have found for this more general case is to get pandas to read potentially problematic columns as strings, as already covered, then convert the slice with values that need conversion (and not mapping the conversion on the column, as that will again result in an automatic dtype = float64 inference).

我发现这个更一般情况的最佳解决方案是让pandas将可能存在问题的列读取为字符串,如已经涵盖的那样,然后使用需要转换的值转换切片(而不是在列上映射转换,因为这将再次导致自动dtype = float64推断)。

In [5]: ff = pd.read_csv("H:/tmp.csv", dtype = {c: object for c in "bc"}, converters = {c: lambda x: None if x == "" else int(x, 16) for c in "ad"})
In [6]: ff.dtypes
Out[6]:
a     int64
b    object
c    object
d    object
dtype: object
In [7]: for c in "bc":
   .....:     ff.loc[~pd.isnull(ff[c]), c] = ff[c][~pd.isnull(ff[c])].map(lambda x: int(x, 16))
   .....:
In [8]: ff.dtypes
Out[8]:
a     int64
b    object
c    object
d    object
dtype: object
In [9]: [(ff[c][i], type(ff[c][i])) for c in ff.columns for i in ff.index]
Out[9]:
[(255, numpy.int64),
 (254, numpy.int64),
 (253L, long),
 (nan, float),
 (nan, float),
 (252L, long),
 (None, NoneType),
 (None, NoneType)]

As far as I have been able to determine, at least up to version 0.15.2 there is no way to avoid postprocessing of string values in situations like this.

至于我已经能够确定,至少达到版本0.15.2,就没有办法避免在这种情况下对字符串值进行后处理。