Python pandas：如何在读取Excel文件时指定数据类型？

I am importing an excel file into a pandas dataframe with the pandas.read_excel() function.

我正在使用pandas.read_excel()函数将excel文件导入到pandas数据框中。

One of the columns is the primary key of the table: it's all numbers, but it's stored as text (the little green triangle in the top left of the Excel cells confirms this).

其中一列是表格的主键:它是所有数字,但它存储为文本(Excel单元格左上角的绿色小三角确认了这一点)。

However, when I import the file into a pandas dataframe, the column gets imported as a float. This means that, for example, '0614' becomes 614.

但是,当我将文件导入pandas数据帧时,该列将作为float导入。这意味着,例如,'0614'变为614。

Is there a way to specify the datatype when importing a column? I understand this is possible when importing CSV files but couldn't find anything in the syntax of read_excel().

有没有办法在导入列时指定数据类型?我知道在导入CSV文件时这是可能的,但在read_excel()的语法中找不到任何内容。

The only solution I can think of is to add an arbitrary letter at the beginning of the text (converting '0614' into 'A0614') in Excel, to make sure the column is imported as text, and then chopping off the 'A' in python, so I can match it to other tables I am importing from SQL.

我能想到的唯一解决方案是在Excel的文本开头添加一个任意字母(将'0614'转换为'A0614'),以确保将列导入为文本,然后切掉'A'在python中,所以我可以将它与我从SQL导入的其他表匹配。

4 个解决方案

#1

You just specify converters. I created an excel spreadsheet of the following structure:

您只需指定转换器。我创建了一个以下结构的excel电子表格:

names   ages
bob     05
tom     4
suzy    3

Where the "ages" column is formatted as strings. To load:

“年龄”列的格式为字符串。加载:

import pandas as pd

df = pd.read_excel('Book1.xlsx',sheetname='Sheet1',header=0,converters={'names':str,'ages':str})
>>> df
       names ages
   0   bob   05
   1   tom   4
   2   suzy  3

#2

Starting with v0.20.0, the dtype keyword argument in read_excel() function could be used to specify the data types that needs to be applied to the columns just like it exists for read_csv() case.

从v0.20.0开始,read_excel()函数中的dtype关键字参数可用于指定需要应用于列的数据类型,就像它存在于read_csv()情况一样。

Using converters and dtype arguments together on the same column name would lead to the latter getting shadowed and the former gaining preferance.

在同一列名称上一起使用转换器和dtype参数会导致后者被遮蔽而前者会获得优先权。

1) Inorder for it to not interpret the dtypes but rather pass all the contents of it's columns as they were originally in the file before, we could set this arg to str or object so that we don't mess up our data. (one such case would be leading zeros in numbers which would be lost otherwise)

1)为了不解释dtypes而是传递它们之前在文件中的所有内容,我们可以将这个arg设置为str或object,这样我们就不会弄乱我们的数据。 (一个这样的情况将是数字中的前导零,否则会丢失)

pd.read_excel('file_name.xlsx', dtype=str)            # (or) dtype=object

2) It even supports a dict mapping wherein the keys constitute the column names and values it's respective data type to be set especially when you want to alter the dtype for a subset of all the columns.

2)它甚至支持一个dict映射,其中键构成列名和值,它是各自要设置的数据类型,尤其是当你想要改变所有列的子集的dtype时。

# Assuming data types for `a` and `b` columns to be altered
pd.read_excel('file_name.xlsx', dtype={'a': np.float64, 'b': np.int32})

#3

The read_excel() function has a converters argument, where you can apply functions to input in certain columns. You can use this to keep them as strings. Documentation:

read_excel()函数有一个converter参数,您可以在其中应用函数以在某些列中输入。您可以使用它将它们保存为字符串。文档:

Dict of functions for converting values in certain columns. Keys can either be integers or column labels, values are functions that take one input argument, the Excel cell content, and return the transformed content.

用于转换某些列中的值的函数的字典。键可以是整数或列标签,值是带有一个输入参数的函数,Excel单元格内容,并返回转换后的内容。

Example code:

pandas.read_excel(my_file, converters = {my_str_column: str})

#4

In case if you are not aware of the number and name of columns in dataframe then this method can be handy:

如果您不知道数据框中列的数量和名称,则此方法可以很方便:

column_list = []
df_column = pd.read_excel(file_name, 'Sheet1').columns
for i in df_column:
    column_list.append(i)
converter = {col: str for col in column_list} 
df_actual = pd.read_excel(file_name, converters=converter)

where column_list is the list of your column names.

其中column_list是列名列表。

#1