从缺少值的文本文件中读取数据

时间:2021-08-13 09:46:40

I want to read data from a file that has many missing values, as in this example:

我想从一个有许多缺失值的文件中读取数据,如本例所示:

1,2,3,4,5
6,,,7,8
,,9,10,11

I am using the numpy.loadtxt function:

我用的是numpy。loadtxt功能:

data = numpy.loadtxt('test.data', delimiter=',')

The problem is that the missing values break loadtxt (I get a "ValueError: could not convert string to float:", no doubt because of the two or more consecutive delimiters).

问题是缺少的值会破坏loadtxt(我得到了一个“ValueError:不能将字符串转换为float”),这无疑是因为两个或多个连续的分隔符。

Is there a way to do this automatically, with loadtxt or another function, or do I have to bite the bullet and parse each line manually?

是否有一种方法可以通过loadtxt或其他函数自动完成这一操作,还是我必须咬紧牙关,手工解析每一行?

2 个解决方案

#1


13  

I'd probably use genfromtxt:

我可能使用genfromtxt:

>>> from numpy import genfromtxt
>>> genfromtxt("missing1.dat", delimiter=",")
array([[  1.,   2.,   3.,   4.,   5.],
       [  6.,  nan,  nan,   7.,   8.],
       [ nan,  nan,   9.,  10.,  11.]])

and then do whatever with the nans (change them to something, use a mask instead, etc.) Some of this could be done inline:

然后对nans做任何事情(把它们变成某种东西,用一个蒙版,等等)有些事情是可以内联完成的:

>>> genfromtxt("missing1.dat", delimiter=",", filling_values=99)
array([[  1.,   2.,   3.,   4.,   5.],
       [  6.,  99.,  99.,   7.,   8.],
       [ 99.,  99.,   9.,  10.,  11.]])

#2


0  

Be careful that for this, according to my test, the caracter-cells are not detected, only the numerical values, so if you have a table with strings and numbers there should be some other way.

要注意的是,根据我的测试,特征单元没有被检测到,只有数值,所以如果你有一个带字符串和数字的表格,应该有其他的方法。

My example:

我的例子:

upeak_names.txt:
id  name    Distance    name2   Distance2   name3   Distance3
upeak-3 NOC2L   -161    KLHL17  -1135   NOC2L   -162

>>>table= genfromtxt('upeak_names.txt', delimiter="\t")
>>>comb_table[2,]
>>>array([   nan,    nan,  -161.,    nan, -1135.,    nan,  -162.])

#1


13  

I'd probably use genfromtxt:

我可能使用genfromtxt:

>>> from numpy import genfromtxt
>>> genfromtxt("missing1.dat", delimiter=",")
array([[  1.,   2.,   3.,   4.,   5.],
       [  6.,  nan,  nan,   7.,   8.],
       [ nan,  nan,   9.,  10.,  11.]])

and then do whatever with the nans (change them to something, use a mask instead, etc.) Some of this could be done inline:

然后对nans做任何事情(把它们变成某种东西,用一个蒙版,等等)有些事情是可以内联完成的:

>>> genfromtxt("missing1.dat", delimiter=",", filling_values=99)
array([[  1.,   2.,   3.,   4.,   5.],
       [  6.,  99.,  99.,   7.,   8.],
       [ 99.,  99.,   9.,  10.,  11.]])

#2


0  

Be careful that for this, according to my test, the caracter-cells are not detected, only the numerical values, so if you have a table with strings and numbers there should be some other way.

要注意的是,根据我的测试,特征单元没有被检测到,只有数值,所以如果你有一个带字符串和数字的表格,应该有其他的方法。

My example:

我的例子:

upeak_names.txt:
id  name    Distance    name2   Distance2   name3   Distance3
upeak-3 NOC2L   -161    KLHL17  -1135   NOC2L   -162

>>>table= genfromtxt('upeak_names.txt', delimiter="\t")
>>>comb_table[2,]
>>>array([   nan,    nan,  -161.,    nan, -1135.,    nan,  -162.])