如何使用numpy。当第一列是字符串而其余列是数字时?

时间:2022-12-27 09:11:21

Basically, I have a bunch of data where the first column is a string (label) and the remaining columns are numeric values. I run the following:

基本上,我有一堆数据,其中第一列是字符串(标签),其余列是数值。我运行下面的:

data = numpy.genfromtxt('data.txt', delimiter = ',')

This reads most of the data well, but the label column just gets 'nan'. How can I deal with this?

这可以很好地读取大部分数据,但是标签列只是“nan”。我该怎么处理呢?

5 个解决方案

#1


48  

By default, np.genfromtxt uses dtype=float: that's why you string columns are converted to NaNs because, after all, they're Not A Number...

默认情况下,np。genfromtxt使用dtype=float:这就是为什么字符串列被转换成NaNs的原因,因为它们毕竟不是一个数字……

You can ask np.genfromtxt to try to guess the actual type of your columns by using dtype=None:

你可以问np。通过使用dtype=None尝试猜测列的实际类型:

>>> from StringIO import StringIO
>>> test = "a,1,2\nb,3,4"
>>> a = np.genfromtxt(StringIO(test), delimiter=",", dtype=None)
>>> print a
array([('a',1,2),('b',3,4)], dtype=[('f0', '|S1'),('f1', '<i8'),('f2', '<i8')])

You can access the columns by using their name, like a['f0']...

您可以使用列的名称来访问它们,比如['f0']……

Using dtype=None is a good trick if you don't know what your columns should be. If you already know what type they should have, you can give an explicit dtype. For example, in our test, we know that the first column is a string, the second an int, and we want the third to be a float. We would then use

如果不知道列应该是什么,那么使用dtype=None是一个很好的技巧。如果您已经知道它们应该具有什么类型,您可以给出显式的dtype。例如,在我们的测试中,我们知道第一列是字符串,第二列是int,我们希望第三列是浮点数。我们将使用

>>> np.genfromtxt(StringIO(test), delimiter=",", dtype=("|S10", int, float))
array([('a', 1, 2.0), ('b', 3, 4.0)], 
      dtype=[('f0', '|S10'), ('f1', '<i8'), ('f2', '<f8')])

Using an explicit dtype is much more efficient than using dtype=None and is the recommended way.

使用显式dtype要比使用dtype=None有效得多,这是推荐的方法。

In both cases (dtype=None or explicit, non-homogeneous dtype), you end up with a structured array.

在这两种情况下(dtype=None或显式、非同质dtype),最终都得到一个结构化数组。

[Note: With dtype=None, the input is parsed a second time and the type of each column is updated to match the larger type possible: first we try a bool, then an int, then a float, then a complex, then we keep a string if all else fails. The implementation is rather clunky, actually. There had been some attempts to make the type guessing more efficient (using regexp), but nothing that stuck so far]

[注意:对于dtype=None,第二次解析输入,并更新每个列的类型以匹配可能的更大类型:首先我们尝试bool,然后是int,然后是float,然后是complex,然后如果所有其他都失败,我们保留一个字符串。实际上,这个实现相当笨拙。曾有一些尝试使类型猜测更有效(使用regexp),但迄今为止还没有出现任何问题)

#2


25  

If your data file is structured like this

如果数据文件的结构是这样的

col1, col2, col3
   1,    2,    3
  10,   20,   30
 100,  200,  300

then numpy.genfromtxt can interpret the first line as column headers using the names=True option. With this you can access the data very conveniently by providing the column header:

然后numpy。genfromtxt可以使用names=True选项将第一行解释为列标题。通过提供列标头,您可以非常方便地访问数据:

data = np.genfromtxt('data.txt', delimiter=',', names=True)
print data['col1']    # array([   1.,   10.,  100.])
print data['col2']    # array([   2.,   20.,  200.])
print data['col3']    # array([   3.,   30.,  300.])

Since in your case the data is formed like this

因为在你的例子中数据是这样形成的

row1,   1,  10, 100
row2,   2,  20, 200
row3,   3,  30, 300

you can achieve something similar using the following code snippet:

您可以使用以下代码片段实现类似的事情:

labels = np.genfromtxt('data.txt', delimiter=',', usecols=0, dtype=str)
raw_data = np.genfromtxt('data.txt', delimiter=',')[:,1:]
data = {label: row for label, row in zip(labels, raw_data)}

The first line reads the first column (the labels) into an array of strings. The second line reads all data from the file but discards the first column. The third line uses dictionary comprehension to create a dictionary that can be used very much like the structured array which numpy.genfromtxt creates using the names=True option:

第一行将第一列(标签)读入字符串数组。第二行从文件中读取所有数据,但丢弃第一列。第三行使用字典理解创建一个字典,它可以像numpy的结构化数组那样使用。genfromtxt使用names=True选项创建:

print data['row1']    # array([   1.,   10.,  100.])
print data['row2']    # array([   2.,   20.,  200.])
print data['row3']    # array([   3.,   30.,  300.])

#3


5  

data=np.genfromtxt(csv_file, delimiter=',', dtype='unicode')

数据= np。genfromtxt(csv_file分隔符=”、“dtype = unicode)

It works fine for me.

对我来说没问题。

#4


0  

You can use numpy.recfromcsv(filename): the types of each column will be automatically determined (as if you use np.genfromtxt() with dtype=None), and by default delimiter=",". It's basically a shortcut for np.genfromtxt(filename, delimiter=",", dtype=None) that Pierre GM pointed at in his answer.

您可以使用numpy.recfromcsv(filename):每个列的类型将被自动确定(就像您使用dtype=None的np.genfromtxt()一样),默认的分隔符=","。它基本上是np的捷径。genfromtxt(文件名,分隔符=",",dtype=None), Pierre GM在他的回答中指出。

#5


0  

For a dataset of this format:

此格式的数据集:

CONFIG000   1080.65 1080.87 1068.76 1083.52 1084.96 1080.31 1081.75 1079.98
CONFIG001   414.6   421.76  418.93  415.53  415.23  416.12  420.54  415.42
CONFIG010   1091.43 1079.2  1086.61 1086.58 1091.14 1080.58 1076.64 1083.67
CONFIG011   391.31  392.96  391.24  392.21  391.94  392.18  391.96  391.66
CONFIG100   1067.08 1062.1  1061.02 1068.24 1066.74 1052.38 1062.31 1064.28
CONFIG101   371.63  378.36  370.36  371.74  370.67  376.24  378.15  371.56
CONFIG110   1060.88 1072.13 1076.01 1069.52 1069.04 1068.72 1064.79 1066.66
CONFIG111   350.08  350.69  352.1   350.19  352.28  353.46  351.83  350.94

This code works for my application:

此代码适用于我的应用程序:

def ShowData(data, names):
    i = 0
    while i < data.shape[0]:
        print(names[i] + ": ")
        j = 0
        while j < data.shape[1]:
            print(data[i][j])
            j += 1
        print("")
        i += 1

def Main():
    print("The sample data is: ")
    fname = 'ANOVA.csv'
    csv = numpy.genfromtxt(fname, dtype=str, delimiter=",")
    num_rows = csv.shape[0]
    num_cols = csv.shape[1]
    names = csv[:,0]
    data = numpy.genfromtxt(fname, usecols = range(1,num_cols), delimiter=",")
    print(names)
    print(str(num_rows) + "x" + str(num_cols))
    print(data)
    ShowData(data, names)

Python-2 output:

python2输出:

The sample data is:
['CONFIG000' 'CONFIG001' 'CONFIG010' 'CONFIG011' 'CONFIG100' 'CONFIG101'
 'CONFIG110' 'CONFIG111']
8x9
[[ 1080.65  1080.87  1068.76  1083.52  1084.96  1080.31  1081.75  1079.98]
 [  414.6    421.76   418.93   415.53   415.23   416.12   420.54   415.42]
 [ 1091.43  1079.2   1086.61  1086.58  1091.14  1080.58  1076.64  1083.67]
 [  391.31   392.96   391.24   392.21   391.94   392.18   391.96   391.66]
 [ 1067.08  1062.1   1061.02  1068.24  1066.74  1052.38  1062.31  1064.28]
 [  371.63   378.36   370.36   371.74   370.67   376.24   378.15   371.56]
 [ 1060.88  1072.13  1076.01  1069.52  1069.04  1068.72  1064.79  1066.66]
 [  350.08   350.69   352.1    350.19   352.28   353.46   351.83   350.94]]
CONFIG000:
1080.65
1080.87
1068.76
1083.52
1084.96
1080.31
1081.75
1079.98

CONFIG001:
414.6
421.76
418.93
415.53
415.23
416.12
420.54
415.42

CONFIG010:
1091.43
1079.2
1086.61
1086.58
1091.14
1080.58
1076.64
1083.67

CONFIG011:
391.31
392.96
391.24
392.21
391.94
392.18
391.96
391.66

CONFIG100:
1067.08
1062.1
1061.02
1068.24
1066.74
1052.38
1062.31
1064.28

CONFIG101:
371.63
378.36
370.36
371.74
370.67
376.24
378.15
371.56

CONFIG110:
1060.88
1072.13
1076.01
1069.52
1069.04
1068.72
1064.79
1066.66

CONFIG111:
350.08
350.69
352.1
350.19
352.28
353.46
351.83
350.94

#1


48  

By default, np.genfromtxt uses dtype=float: that's why you string columns are converted to NaNs because, after all, they're Not A Number...

默认情况下,np。genfromtxt使用dtype=float:这就是为什么字符串列被转换成NaNs的原因,因为它们毕竟不是一个数字……

You can ask np.genfromtxt to try to guess the actual type of your columns by using dtype=None:

你可以问np。通过使用dtype=None尝试猜测列的实际类型:

>>> from StringIO import StringIO
>>> test = "a,1,2\nb,3,4"
>>> a = np.genfromtxt(StringIO(test), delimiter=",", dtype=None)
>>> print a
array([('a',1,2),('b',3,4)], dtype=[('f0', '|S1'),('f1', '<i8'),('f2', '<i8')])

You can access the columns by using their name, like a['f0']...

您可以使用列的名称来访问它们,比如['f0']……

Using dtype=None is a good trick if you don't know what your columns should be. If you already know what type they should have, you can give an explicit dtype. For example, in our test, we know that the first column is a string, the second an int, and we want the third to be a float. We would then use

如果不知道列应该是什么,那么使用dtype=None是一个很好的技巧。如果您已经知道它们应该具有什么类型,您可以给出显式的dtype。例如,在我们的测试中,我们知道第一列是字符串,第二列是int,我们希望第三列是浮点数。我们将使用

>>> np.genfromtxt(StringIO(test), delimiter=",", dtype=("|S10", int, float))
array([('a', 1, 2.0), ('b', 3, 4.0)], 
      dtype=[('f0', '|S10'), ('f1', '<i8'), ('f2', '<f8')])

Using an explicit dtype is much more efficient than using dtype=None and is the recommended way.

使用显式dtype要比使用dtype=None有效得多,这是推荐的方法。

In both cases (dtype=None or explicit, non-homogeneous dtype), you end up with a structured array.

在这两种情况下(dtype=None或显式、非同质dtype),最终都得到一个结构化数组。

[Note: With dtype=None, the input is parsed a second time and the type of each column is updated to match the larger type possible: first we try a bool, then an int, then a float, then a complex, then we keep a string if all else fails. The implementation is rather clunky, actually. There had been some attempts to make the type guessing more efficient (using regexp), but nothing that stuck so far]

[注意:对于dtype=None,第二次解析输入,并更新每个列的类型以匹配可能的更大类型:首先我们尝试bool,然后是int,然后是float,然后是complex,然后如果所有其他都失败,我们保留一个字符串。实际上,这个实现相当笨拙。曾有一些尝试使类型猜测更有效(使用regexp),但迄今为止还没有出现任何问题)

#2


25  

If your data file is structured like this

如果数据文件的结构是这样的

col1, col2, col3
   1,    2,    3
  10,   20,   30
 100,  200,  300

then numpy.genfromtxt can interpret the first line as column headers using the names=True option. With this you can access the data very conveniently by providing the column header:

然后numpy。genfromtxt可以使用names=True选项将第一行解释为列标题。通过提供列标头,您可以非常方便地访问数据:

data = np.genfromtxt('data.txt', delimiter=',', names=True)
print data['col1']    # array([   1.,   10.,  100.])
print data['col2']    # array([   2.,   20.,  200.])
print data['col3']    # array([   3.,   30.,  300.])

Since in your case the data is formed like this

因为在你的例子中数据是这样形成的

row1,   1,  10, 100
row2,   2,  20, 200
row3,   3,  30, 300

you can achieve something similar using the following code snippet:

您可以使用以下代码片段实现类似的事情:

labels = np.genfromtxt('data.txt', delimiter=',', usecols=0, dtype=str)
raw_data = np.genfromtxt('data.txt', delimiter=',')[:,1:]
data = {label: row for label, row in zip(labels, raw_data)}

The first line reads the first column (the labels) into an array of strings. The second line reads all data from the file but discards the first column. The third line uses dictionary comprehension to create a dictionary that can be used very much like the structured array which numpy.genfromtxt creates using the names=True option:

第一行将第一列(标签)读入字符串数组。第二行从文件中读取所有数据,但丢弃第一列。第三行使用字典理解创建一个字典,它可以像numpy的结构化数组那样使用。genfromtxt使用names=True选项创建:

print data['row1']    # array([   1.,   10.,  100.])
print data['row2']    # array([   2.,   20.,  200.])
print data['row3']    # array([   3.,   30.,  300.])

#3


5  

data=np.genfromtxt(csv_file, delimiter=',', dtype='unicode')

数据= np。genfromtxt(csv_file分隔符=”、“dtype = unicode)

It works fine for me.

对我来说没问题。

#4


0  

You can use numpy.recfromcsv(filename): the types of each column will be automatically determined (as if you use np.genfromtxt() with dtype=None), and by default delimiter=",". It's basically a shortcut for np.genfromtxt(filename, delimiter=",", dtype=None) that Pierre GM pointed at in his answer.

您可以使用numpy.recfromcsv(filename):每个列的类型将被自动确定(就像您使用dtype=None的np.genfromtxt()一样),默认的分隔符=","。它基本上是np的捷径。genfromtxt(文件名,分隔符=",",dtype=None), Pierre GM在他的回答中指出。

#5


0  

For a dataset of this format:

此格式的数据集:

CONFIG000   1080.65 1080.87 1068.76 1083.52 1084.96 1080.31 1081.75 1079.98
CONFIG001   414.6   421.76  418.93  415.53  415.23  416.12  420.54  415.42
CONFIG010   1091.43 1079.2  1086.61 1086.58 1091.14 1080.58 1076.64 1083.67
CONFIG011   391.31  392.96  391.24  392.21  391.94  392.18  391.96  391.66
CONFIG100   1067.08 1062.1  1061.02 1068.24 1066.74 1052.38 1062.31 1064.28
CONFIG101   371.63  378.36  370.36  371.74  370.67  376.24  378.15  371.56
CONFIG110   1060.88 1072.13 1076.01 1069.52 1069.04 1068.72 1064.79 1066.66
CONFIG111   350.08  350.69  352.1   350.19  352.28  353.46  351.83  350.94

This code works for my application:

此代码适用于我的应用程序:

def ShowData(data, names):
    i = 0
    while i < data.shape[0]:
        print(names[i] + ": ")
        j = 0
        while j < data.shape[1]:
            print(data[i][j])
            j += 1
        print("")
        i += 1

def Main():
    print("The sample data is: ")
    fname = 'ANOVA.csv'
    csv = numpy.genfromtxt(fname, dtype=str, delimiter=",")
    num_rows = csv.shape[0]
    num_cols = csv.shape[1]
    names = csv[:,0]
    data = numpy.genfromtxt(fname, usecols = range(1,num_cols), delimiter=",")
    print(names)
    print(str(num_rows) + "x" + str(num_cols))
    print(data)
    ShowData(data, names)

Python-2 output:

python2输出:

The sample data is:
['CONFIG000' 'CONFIG001' 'CONFIG010' 'CONFIG011' 'CONFIG100' 'CONFIG101'
 'CONFIG110' 'CONFIG111']
8x9
[[ 1080.65  1080.87  1068.76  1083.52  1084.96  1080.31  1081.75  1079.98]
 [  414.6    421.76   418.93   415.53   415.23   416.12   420.54   415.42]
 [ 1091.43  1079.2   1086.61  1086.58  1091.14  1080.58  1076.64  1083.67]
 [  391.31   392.96   391.24   392.21   391.94   392.18   391.96   391.66]
 [ 1067.08  1062.1   1061.02  1068.24  1066.74  1052.38  1062.31  1064.28]
 [  371.63   378.36   370.36   371.74   370.67   376.24   378.15   371.56]
 [ 1060.88  1072.13  1076.01  1069.52  1069.04  1068.72  1064.79  1066.66]
 [  350.08   350.69   352.1    350.19   352.28   353.46   351.83   350.94]]
CONFIG000:
1080.65
1080.87
1068.76
1083.52
1084.96
1080.31
1081.75
1079.98

CONFIG001:
414.6
421.76
418.93
415.53
415.23
416.12
420.54
415.42

CONFIG010:
1091.43
1079.2
1086.61
1086.58
1091.14
1080.58
1076.64
1083.67

CONFIG011:
391.31
392.96
391.24
392.21
391.94
392.18
391.96
391.66

CONFIG100:
1067.08
1062.1
1061.02
1068.24
1066.74
1052.38
1062.31
1064.28

CONFIG101:
371.63
378.36
370.36
371.74
370.67
376.24
378.15
371.56

CONFIG110:
1060.88
1072.13
1076.01
1069.52
1069.04
1068.72
1064.79
1066.66

CONFIG111:
350.08
350.69
352.1
350.19
352.28
353.46
351.83
350.94