如何强制熊猫read_csv对所有浮动列使用float32 ?

Because

因为

I don't need double precision
我不需要双精度
My machine has limited memory and I want to process bigger datasets
我的机器内存有限，我想处理更大的数据集
I need to pass the extracted data (as matrix) to BLAS libraries, and BLAS calls for single precision are 2x faster than for double precision equivalence.
我需要将提取的数据(作为矩阵)传递给BLAS库，而对于单精度的BLAS调用要比双精度等价快2x。

Note that not all columns in the raw csv file have float types. I only need to set float32 as the default for float columns.

注意，并不是原始csv文件中的所有列都有浮点类型。我只需要将float32设置为float列的默认值。

1 个解决方案

#1

Try:

试一试:

import numpy as np
import pandas as pd

# Sample 100 rows of data to determine dtypes.
df_test = pd.read_csv(filename, nrows=100)

float_cols = [c for c in df_test if df_test[c].dtype == "float64"]
float32_cols = {c: np.float32 for c in float_cols}

df = pd.read_csv(filename, engine='c', dtype=float32_cols)

This first reads a sample of 100 rows of data (modify as required) to determine the type of each column.

首先读取100行数据的示例(根据需要进行修改)，以确定每个列的类型。

It the creates a list of those columns which are 'float64', and then uses dictionary comprehension to create a dictionary with these columns as the keys and 'np.float32' as the value for each key.

它创建一个列的列表，这些列是“float64”，然后使用dictionary comprehension创建一个以这些列为键和“np”的字典。float32'作为每个键的值。

Finally, it reads the whole file using the 'c' engine (required for assigning dtypes to columns) and then passes the float32_cols dictionary as a parameter to dtype.

最后，它使用“c”引擎读取整个文件(为为列分配dtype所需)，然后将float32_cols字典作为参数传递给dtype。

df = pd.read_csv(filename, nrows=100)
>>> df
   int_col  float1 string_col  float2
0        1     1.2          a     2.2
1        2     1.3          b     3.3
2        3     1.4          c     4.4

>>> df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3 entries, 0 to 2
Data columns (total 4 columns):
int_col       3 non-null int64
float1        3 non-null float64
string_col    3 non-null object
float2        3 non-null float64
dtypes: float64(2), int64(1), object(1)

df32 = pd.read_csv(filename, engine='c', dtype={c: np.float32 for c in float_cols})
>>> df32.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3 entries, 0 to 2
Data columns (total 4 columns):
int_col       3 non-null int64
float1        3 non-null float32
string_col    3 non-null object
float2        3 non-null float32
dtypes: float32(2), int64(1), object(1)

#1