当我使用pandas读取.csv中的特定列时,奇怪的跳过

时间:2022-02-06 23:07:28

1. Background

The .csv file I upload here is an example file for me to explain my problem.

我在这里上传的.csv文件是一个示例文件供我解释我的问题。

This file contain all the air quality information for all cities in China(represent in Code) in at an specific day.

此文件包含特定日期中国所有城市(代码中代表)的所有空气质量信息。

For example, the column 1001A represent one city and the value in this column represent the air pollutant concentration corresponding to the type column.

例如,列1001A表示一个城市,该列中的值表示与类型列对应的空气污染物浓度。

当我使用pandas读取.csv中的特定列时,奇怪的跳过

1. My problem

If I want to get the AQI value for the city of 1014A in 20160205-00:00,
I just need to use

如果我想在20160205-00:00获得1014A城市的AQI值,我只需要使用

 df = pd.read_csv("./this file")
 aqi = df["1014A"].iloc[0]

The result is 42. But look the same file in LibraOffice, the result shows like this:

结果是42.但是在LibraOffice中查看相同的文件,结果显示如下:

当我使用pandas读取.csv中的特定列时,奇怪的跳过

It seems like Pandas read the 1013A and make the mistake.

似乎熊猫读了1013A并犯了错误。

So, I want to figure out what happened in column 1013A:

所以,我想弄清楚第1013A栏中发生了什么:

当我使用pandas读取.csv中的特定列时,奇怪的跳过

The pandas read this column(which has finite value inside) as the NaN value column. And it happened so many times in this file. It bother me in the aspects of followed:

大熊猫读取此列(内部有限值)作为NaN值列。它在这个文件中发生了很多次。在以下方面让我感到烦恼:

  • Some columns which has its data are taken as NaN columns in pandas.Dataframe

    有些数据的列被视为pandas.Dataframe中的NaN列

  • The other columns also will be influenced by the Error-NaN columns indirectly.

    其他列也将间接受到Error-NaN列的影响。

The column location would be full of mistake if this problem hasn't been solved.

如果此问题尚未解决,则列位置将充满错误。

Any advice would be appreciate!

任何建议都会很感激!

1 个解决方案

#1


2  

Your csv has two commas in that position:

你的csv在那个位置有两个逗号:

...19,20,24,19,22,24,29,,42,39...

this gets read as NaN by pandas.

这被大熊猫读作NaN。

It looks like in your version of LibreOffice it's skipped and uses the subsequent value (incorrectly).

看起来在您的LibreOffice版本中,它被跳过并使用后续值(不正确)。


In [11]: s = open("china_sites_20160205.csv").readlines()

In [12]: s[0].split(",")[13:18]
Out[12]: ['1011A', '1012A', '1013A', '1014A', '1015A']

In [13]: s[1].split(",")[13:18]
Out[13]: ['24', '29', '', '42', '39']

#1


2  

Your csv has two commas in that position:

你的csv在那个位置有两个逗号:

...19,20,24,19,22,24,29,,42,39...

this gets read as NaN by pandas.

这被大熊猫读作NaN。

It looks like in your version of LibreOffice it's skipped and uses the subsequent value (incorrectly).

看起来在您的LibreOffice版本中,它被跳过并使用后续值(不正确)。


In [11]: s = open("china_sites_20160205.csv").readlines()

In [12]: s[0].split(",")[13:18]
Out[12]: ['1011A', '1012A', '1013A', '1014A', '1015A']

In [13]: s[1].split(",")[13:18]
Out[13]: ['24', '29', '', '42', '39']