I want to read data from a file that has many missing values, as in this example:
我想从一个有许多缺失值的文件中读取数据,如本例所示:
1,2,3,4,5
6,,,7,8
,,9,10,11
I am using the numpy.loadtxt function:
我用的是numpy。loadtxt功能:
data = numpy.loadtxt('test.data', delimiter=',')
The problem is that the missing values break loadtxt (I get a "ValueError: could not convert string to float:", no doubt because of the two or more consecutive delimiters).
问题是缺少的值会破坏loadtxt(我得到了一个“ValueError:不能将字符串转换为float”),这无疑是因为两个或多个连续的分隔符。
Is there a way to do this automatically, with loadtxt or another function, or do I have to bite the bullet and parse each line manually?
是否有一种方法可以通过loadtxt或其他函数自动完成这一操作,还是我必须咬紧牙关,手工解析每一行?
2 个解决方案
#1
13
I'd probably use genfromtxt:
我可能使用genfromtxt:
>>> from numpy import genfromtxt
>>> genfromtxt("missing1.dat", delimiter=",")
array([[ 1., 2., 3., 4., 5.],
[ 6., nan, nan, 7., 8.],
[ nan, nan, 9., 10., 11.]])
and then do whatever with the nans (change them to something, use a mask instead, etc.) Some of this could be done inline:
然后对nans做任何事情(把它们变成某种东西,用一个蒙版,等等)有些事情是可以内联完成的:
>>> genfromtxt("missing1.dat", delimiter=",", filling_values=99)
array([[ 1., 2., 3., 4., 5.],
[ 6., 99., 99., 7., 8.],
[ 99., 99., 9., 10., 11.]])
#2
0
Be careful that for this, according to my test, the caracter-cells are not detected, only the numerical values, so if you have a table with strings and numbers there should be some other way.
要注意的是,根据我的测试,特征单元没有被检测到,只有数值,所以如果你有一个带字符串和数字的表格,应该有其他的方法。
My example:
我的例子:
upeak_names.txt:
id name Distance name2 Distance2 name3 Distance3
upeak-3 NOC2L -161 KLHL17 -1135 NOC2L -162
>>>table= genfromtxt('upeak_names.txt', delimiter="\t")
>>>comb_table[2,]
>>>array([ nan, nan, -161., nan, -1135., nan, -162.])
#1
13
I'd probably use genfromtxt:
我可能使用genfromtxt:
>>> from numpy import genfromtxt
>>> genfromtxt("missing1.dat", delimiter=",")
array([[ 1., 2., 3., 4., 5.],
[ 6., nan, nan, 7., 8.],
[ nan, nan, 9., 10., 11.]])
and then do whatever with the nans (change them to something, use a mask instead, etc.) Some of this could be done inline:
然后对nans做任何事情(把它们变成某种东西,用一个蒙版,等等)有些事情是可以内联完成的:
>>> genfromtxt("missing1.dat", delimiter=",", filling_values=99)
array([[ 1., 2., 3., 4., 5.],
[ 6., 99., 99., 7., 8.],
[ 99., 99., 9., 10., 11.]])
#2
0
Be careful that for this, according to my test, the caracter-cells are not detected, only the numerical values, so if you have a table with strings and numbers there should be some other way.
要注意的是,根据我的测试,特征单元没有被检测到,只有数值,所以如果你有一个带字符串和数字的表格,应该有其他的方法。
My example:
我的例子:
upeak_names.txt:
id name Distance name2 Distance2 name3 Distance3
upeak-3 NOC2L -161 KLHL17 -1135 NOC2L -162
>>>table= genfromtxt('upeak_names.txt', delimiter="\t")
>>>comb_table[2,]
>>>array([ nan, nan, -161., nan, -1135., nan, -162.])