使用numpy.genfromtxt读取包含逗号的字符串的csv文件

I am trying to read in a csv file with numpy.genfromtxt but some of the fields are strings which contain commas. The strings are in quotes, but numpy is not recognizing the quotes as defining a single string. For example, with the data in 't.csv':

我试图用numpy.genfromtxt读取一个csv文件,但有些字段是包含逗号的字符串。字符串是引号,但numpy不会将引号识别为定义单个字符串。例如,使用't.csv'中的数据:

2012, "Louisville KY", 3.5
2011, "Lexington, KY", 4.0

the code

np.genfromtxt('t.csv', delimiter=',')

produces the error:

产生错误:

ValueError: Some errors were detected ! Line #2 (got 4 columns instead of 3)

ValueError:检测到一些错误!第2行(有4列而不是3列)

The data structure I am looking for is:

我正在寻找的数据结构是:

array([['2012', 'Louisville KY', '3.5'],
       ['2011', 'Lexington, KY', '4.0']], 
      dtype='|S13')

Looking over the documentation, I don't see any options to deal with this. Is there a way do to it with numpy, or do I just need to read in the data with the csv module and then convert it to a numpy array?

查看文档,我没有看到任何处理此问题的选项。有没有办法用numpy做,或者我只需要用csv模块读取数据然后将其转换为numpy数组?

4 个解决方案

#1

You can use pandas (the becoming default library for working with dataframes (heterogeneous data) in scientific python) for this. It's read_csv can handle this. From the docs:

您可以使用pandas(成为使用科学python中的数据帧(异构数据)的默认库)。它的read_csv可以处理这个问题。来自文档:

quotechar : string

quotechar:string

The character to used to denote the start and end of a quoted item. Quoted items 
can include the delimiter and it will be ignored.

The default value is ". An example:

默认值为“。例如:

In [1]: import pandas as pd

In [2]: from StringIO import StringIO

In [3]: s="""year, city, value
   ...: 2012, "Louisville KY", 3.5
   ...: 2011, "Lexington, KY", 4.0"""

In [4]: pd.read_csv(StringIO(s), quotechar='"', skipinitialspace=True)
Out[4]:
   year           city  value
0  2012  Louisville KY    3.5
1  2011  Lexington, KY    4.0

The trick here is that you also have to use skipinitialspace=True to deal with the spaces after the comma-delimiter.

这里的诀窍是你还必须使用skipinitialspace = True来处理逗号分隔符后面的空格。

Apart from a powerful csv reader, I can also strongly advice to use pandas with the heterogeneous data you have (the example output in numpy you give are all strings, although you could use structured arrays).

除了强大的csv阅读器之外,我还强烈建议将pandas与你拥有的异构数据一起使用(你给出的numpy中的示例输出都是字符串,尽管你可以使用结构化数组)。

#2

The problem with the additional comma, np.genfromtxt does not deal with that.

附加逗号的问题,np.genfromtxt没有解决这个问题。

One simple solution is to read the file with csv.reader() from python's csv module into a list and then dump it into a numpy array if you like.

一个简单的解决方案是使用csv.reader()从python的csv模块读取文件到列表中,然后根据需要将其转储到numpy数组中。

If you really want to use np.genfromtxt, note that it can take iterators instead of files, e.g. np.genfromtxt(my_iterator, ...). So, you can wrap a csv.reader in an iterator and give it to np.genfromtxt.

如果你真的想使用np.genfromtxt,请注意它可以使用迭代器而不是文件,例如np.genfromtxt(my_iterator,...)。因此,您可以将csv.reader包装在迭代器中并将其提供给np.genfromtxt。

That would go something like this:

这将是这样的:

import csv
import numpy as np

np.genfromtxt(("\t".join(i) for i in csv.reader(open('myfile.csv'))), delimiter="\t")

This essentially replaces on-the-fly only the appropriate commas with tabs.

这实际上只是用标签替换了相应的逗号。

#3

If you are using a numpy you probably want to work with numpy.ndarray. This will give you a numpy.ndarray:

如果你正在使用numpy,你可能想使用numpy.ndarray。这会给你一个numpy.ndarray:

import pandas
data = pandas.read_csv('file.csv').as_matrix()

Pandas will handle the "Lexington, KY" case correctly

熊猫将正确处理“列克星敦,肯塔基州”案件

#4

Make a better function that combines the power of the standard csv module and Numpy's recfromcsv. For instance, the csv module has good control and customization of dialects, quotes, escape characters, etc., which you can add to the example below.

提供更好的功能,结合标准csv模块和Numpy的recfromcsv的强大功能。例如,csv模块可以很好地控制和定制方言,引号,转义字符等,您可以将其添加到下面的示例中。

The example genfromcsv_mod function below reads in a complicated CSV file similar to what Microsoft Excel sees, which may contain commas within quoted fields. Internally, the function has a generator function that rewrites each row with tab delimiters.

下面的示例genfromcsv_mod函数读取类似于Microsoft Excel看到的复杂CSV文件,其中可能包含引用字段中的逗号。在内部,该函数具有一个生成器函数,该函数使用制表符分隔符重写每一行。

import csv
import numpy as np

def recfromcsv_mod(fname, **kwargs):
    def rewrite_csv_as_tab(fname):
        with open(fname, 'rb') as fp:
            reader = csv.reader(fp)
            for row in reader:
                yield '\t'.join(row)
    return np.recfromcsv(rewrite_csv_as_tab(fname), delimiter='\t', **kwargs)

# Use it to read a CSV file into a record array
x = recfromcsv_mod('t.csv', case_sensitive=True)

#1