The pandas read_csv() method interprets 'NA' as nan (not a number) instead of a valid string.
pandas read_csv()方法将'NA'解释为nan(不是数字)而不是有效字符串。
In the simple case below note that the output in row 1, column 2 (zero based count) is 'nan' instead of 'NA'.
在下面的简单情况中,请注意第1行第2列(基于零的计数)的输出是'nan'而不是'NA'。
sample.tsv (tab delimited)
sample.tsv(制表符分隔)
PDB CHAIN SP_PRIMARY RES_BEG RES_END PDB_BEG PDB_END SP_BEG SP_END
5d8b N P60490 1 146 1 146 1 146
5d8b NA P80377 1 126 1 126 1 126
5d8b O P60491 1 118 1 118 1 118PDB CHAIN SP_PRIMARY RES_BEG RES_END PDB_BEG PDB_END SP_BEG SP_END 5d8b N P60490 1 146 1 146 1 146 5d8b NA P80377 1 126 1 126 1 126 5d8b O P60491 1 118 1 118 1 118
read_sample.py
import pandas as pd
df = pd.read_csv(
'sample.tsv',
sep='\t',
encoding='utf-8',
)
for df_tuples in df.itertuples(index=True):
print(df_tuples)
output
(0, u'5d8b', u'N', u'P60490', 1, 146, 1, 146, 1, 146)
(1, u'5d8b', nan, u'P80377', 1, 126, 1, 126, 1, 126)
(2, u'5d8b', u'O', u'P60491', 1, 118, 1, 118, 1, 118)(0,u'5d8b',u'N',u'P60490',1,146,1,146,1,146)(1,u'5d8b',nan,u'P80377',1,126,1 ,126,1,126)(2,u'5d8b',u'O',u'P60491',1,118,1,118,1,118)
Additional Information
Re-writing the file with quotes for data in the 'CHAIN' column and then using the quotechar parameter quotechar='\''
has the same result. And passing a dictionary of types via the dtype parameter dtype=dict(valid_cols)
does not change the result.
使用“CHAIN”列中的数据引号重写文件,然后使用quotechar参数quotechar ='\''具有相同的结果。并通过dtype参数dtype = dict(valid_cols)传递类型字典不会更改结果。
An old answer to Prevent pandas from automatically inferring type in read_csv suggests first using a numpy record array to parse the file, but given the ability to now specify column dtypes, this shouldn't be necessary.
防止pandas在read_csv中自动推断类型的旧答案建议首先使用numpy记录数组来解析文件,但是现在能够指定列dtypes,这不是必需的。
Note that itertuples() is used to preserve dtypes as described in the iterrows documentation: "To preserve dtypes while iterating over the rows, it is better to use itertuples() which returns tuples of the values and which is generally faster as iterrows."
请注意,itertuples()用于保存dtypes,如iterrows文档中所述:“为了在迭代行时保留dtypes,最好使用itertuples(),它返回值的元组,并且通常更快。”
Example was tested on Python 2 and 3 with pandas version 0.16.2, 0.17.0, and 0.17.1.
在Python 2和3上使用pandas版本0.16.2,0.17.0和0.17.1测试了示例。
Is there a way to capture a valid string 'NA' instead of it being converted to nan?
有没有办法捕获有效的字符串'NA'而不是它被转换为nan?
2 个解决方案
#1
25
You could use parameters keep_default_na
and na_values
to set all NA values by hand docs:
您可以使用参数keep_default_na和na_values来手动设置所有NA值:
import pandas as pd
from io import StringIO
data = """
PDB CHAIN SP_PRIMARY RES_BEG RES_END PDB_BEG PDB_END SP_BEG SP_END
5d8b N P60490 1 146 1 146 1 146
5d8b NA P80377 _ 126 1 126 1 126
5d8b O P60491 1 118 1 118 1 118
"""
df = pd.read_csv(StringIO(data), sep=' ', keep_default_na=False, na_values=['_'])
In [130]: df
Out[130]:
PDB CHAIN SP_PRIMARY RES_BEG RES_END PDB_BEG PDB_END SP_BEG SP_END
0 5d8b N P60490 1 146 1 146 1 146
1 5d8b NA P80377 NaN 126 1 126 1 126
2 5d8b O P60491 1 118 1 118 1 118
In [144]: df.CHAIN.apply(type)
Out[144]:
0 <class 'str'>
1 <class 'str'>
2 <class 'str'>
Name: CHAIN, dtype: object
EDIT
All default NA
values from na-values:
来自na值的所有默认NA值:
The default NaN recognized values are
['-1.#IND', '1.#QNAN', '1.#IND', '-1.#QNAN', '#N/A','N/A', 'NA', '#NA', 'NULL', 'NaN', '-NaN', 'nan', '-nan']
. Although a 0-length string '' is not included in the default NaN values list, it is still treated as a missing value.默认的NaN识别值是['-1。#IND','1。#QNAN','1。#IND',' - 1。#QNAN','#N / A','N / A', 'NA','#NA','NULL','NaN',' - NN','nan',' - nan']。虽然默认的NaN值列表中不包含0长度的字符串'',但它仍被视为缺失值。
#2
10
For me solution came from using parameter na_filter = False
对我来说,解决方案来自使用参数na_filter = False
df = pd.read_csv(file_, header=0, dtype=object, na_filter = False)
#1
25
You could use parameters keep_default_na
and na_values
to set all NA values by hand docs:
您可以使用参数keep_default_na和na_values来手动设置所有NA值:
import pandas as pd
from io import StringIO
data = """
PDB CHAIN SP_PRIMARY RES_BEG RES_END PDB_BEG PDB_END SP_BEG SP_END
5d8b N P60490 1 146 1 146 1 146
5d8b NA P80377 _ 126 1 126 1 126
5d8b O P60491 1 118 1 118 1 118
"""
df = pd.read_csv(StringIO(data), sep=' ', keep_default_na=False, na_values=['_'])
In [130]: df
Out[130]:
PDB CHAIN SP_PRIMARY RES_BEG RES_END PDB_BEG PDB_END SP_BEG SP_END
0 5d8b N P60490 1 146 1 146 1 146
1 5d8b NA P80377 NaN 126 1 126 1 126
2 5d8b O P60491 1 118 1 118 1 118
In [144]: df.CHAIN.apply(type)
Out[144]:
0 <class 'str'>
1 <class 'str'>
2 <class 'str'>
Name: CHAIN, dtype: object
EDIT
All default NA
values from na-values:
来自na值的所有默认NA值:
The default NaN recognized values are
['-1.#IND', '1.#QNAN', '1.#IND', '-1.#QNAN', '#N/A','N/A', 'NA', '#NA', 'NULL', 'NaN', '-NaN', 'nan', '-nan']
. Although a 0-length string '' is not included in the default NaN values list, it is still treated as a missing value.默认的NaN识别值是['-1。#IND','1。#QNAN','1。#IND',' - 1。#QNAN','#N / A','N / A', 'NA','#NA','NULL','NaN',' - NN','nan',' - nan']。虽然默认的NaN值列表中不包含0长度的字符串'',但它仍被视为缺失值。
#2
10
For me solution came from using parameter na_filter = False
对我来说,解决方案来自使用参数na_filter = False
df = pd.read_csv(file_, header=0, dtype=object, na_filter = False)