I am trying to read in a csv file with numpy.genfromtxt
but some of the fields are strings which contain commas. The strings are in quotes, but numpy is not recognizing the quotes as defining a single string. For example, with the data in 't.csv':
我试图用numpy.genfromtxt读取一个csv文件,但有些字段是包含逗号的字符串。字符串是引号,但numpy不会将引号识别为定义单个字符串。例如,使用't.csv'中的数据:
2012, "Louisville KY", 3.5
2011, "Lexington, KY", 4.0
the code
np.genfromtxt('t.csv', delimiter=',')
produces the error:
产生错误:
ValueError: Some errors were detected ! Line #2 (got 4 columns instead of 3)
ValueError:检测到一些错误!第2行(有4列而不是3列)
The data structure I am looking for is:
我正在寻找的数据结构是:
array([['2012', 'Louisville KY', '3.5'],
['2011', 'Lexington, KY', '4.0']],
dtype='|S13')
Looking over the documentation, I don't see any options to deal with this. Is there a way do to it with numpy, or do I just need to read in the data with the csv
module and then convert it to a numpy array?
查看文档,我没有看到任何处理此问题的选项。有没有办法用numpy做,或者我只需要用csv模块读取数据然后将其转换为numpy数组?
4 个解决方案
#1
19
You can use pandas (the becoming default library for working with dataframes (heterogeneous data) in scientific python) for this. It's read_csv
can handle this. From the docs:
您可以使用pandas(成为使用科学python中的数据帧(异构数据)的默认库)。它的read_csv可以处理这个问题。来自文档:
quotechar : string
quotechar:string
The character to used to denote the start and end of a quoted item. Quoted items can include the delimiter and it will be ignored.
The default value is "
. An example:
默认值为“。例如:
In [1]: import pandas as pd
In [2]: from StringIO import StringIO
In [3]: s="""year, city, value
...: 2012, "Louisville KY", 3.5
...: 2011, "Lexington, KY", 4.0"""
In [4]: pd.read_csv(StringIO(s), quotechar='"', skipinitialspace=True)
Out[4]:
year city value
0 2012 Louisville KY 3.5
1 2011 Lexington, KY 4.0
The trick here is that you also have to use skipinitialspace=True
to deal with the spaces after the comma-delimiter.
这里的诀窍是你还必须使用skipinitialspace = True来处理逗号分隔符后面的空格。
Apart from a powerful csv reader, I can also strongly advice to use pandas with the heterogeneous data you have (the example output in numpy you give are all strings, although you could use structured arrays).
除了强大的csv阅读器之外,我还强烈建议将pandas与你拥有的异构数据一起使用(你给出的numpy中的示例输出都是字符串,尽管你可以使用结构化数组)。
#2
10
The problem with the additional comma, np.genfromtxt
does not deal with that.
附加逗号的问题,np.genfromtxt没有解决这个问题。
One simple solution is to read the file with csv.reader()
from python's csv module into a list and then dump it into a numpy array if you like.
一个简单的解决方案是使用csv.reader()从python的csv模块读取文件到列表中,然后根据需要将其转储到numpy数组中。
If you really want to use np.genfromtxt
, note that it can take iterators instead of files, e.g. np.genfromtxt(my_iterator, ...)
. So, you can wrap a csv.reader
in an iterator and give it to np.genfromtxt
.
如果你真的想使用np.genfromtxt,请注意它可以使用迭代器而不是文件,例如np.genfromtxt(my_iterator,...)。因此,您可以将csv.reader包装在迭代器中并将其提供给np.genfromtxt。
That would go something like this:
这将是这样的:
import csv
import numpy as np
np.genfromtxt(("\t".join(i) for i in csv.reader(open('myfile.csv'))), delimiter="\t")
This essentially replaces on-the-fly only the appropriate commas with tabs.
这实际上只是用标签替换了相应的逗号。
#3
3
If you are using a numpy you probably want to work with numpy.ndarray. This will give you a numpy.ndarray:
如果你正在使用numpy,你可能想使用numpy.ndarray。这会给你一个numpy.ndarray:
import pandas
data = pandas.read_csv('file.csv').as_matrix()
Pandas will handle the "Lexington, KY" case correctly
熊猫将正确处理“列克星敦,肯塔基州”案件
#4
1
Make a better function that combines the power of the standard csv
module and Numpy's recfromcsv
. For instance, the csv
module has good control and customization of dialects, quotes, escape characters, etc., which you can add to the example below.
提供更好的功能,结合标准csv模块和Numpy的recfromcsv的强大功能。例如,csv模块可以很好地控制和定制方言,引号,转义字符等,您可以将其添加到下面的示例中。
The example genfromcsv_mod
function below reads in a complicated CSV file similar to what Microsoft Excel sees, which may contain commas within quoted fields. Internally, the function has a generator function that rewrites each row with tab delimiters.
下面的示例genfromcsv_mod函数读取类似于Microsoft Excel看到的复杂CSV文件,其中可能包含引用字段中的逗号。在内部,该函数具有一个生成器函数,该函数使用制表符分隔符重写每一行。
import csv
import numpy as np
def recfromcsv_mod(fname, **kwargs):
def rewrite_csv_as_tab(fname):
with open(fname, 'rb') as fp:
reader = csv.reader(fp)
for row in reader:
yield '\t'.join(row)
return np.recfromcsv(rewrite_csv_as_tab(fname), delimiter='\t', **kwargs)
# Use it to read a CSV file into a record array
x = recfromcsv_mod('t.csv', case_sensitive=True)
#1
19
You can use pandas (the becoming default library for working with dataframes (heterogeneous data) in scientific python) for this. It's read_csv
can handle this. From the docs:
您可以使用pandas(成为使用科学python中的数据帧(异构数据)的默认库)。它的read_csv可以处理这个问题。来自文档:
quotechar : string
quotechar:string
The character to used to denote the start and end of a quoted item. Quoted items can include the delimiter and it will be ignored.
The default value is "
. An example:
默认值为“。例如:
In [1]: import pandas as pd
In [2]: from StringIO import StringIO
In [3]: s="""year, city, value
...: 2012, "Louisville KY", 3.5
...: 2011, "Lexington, KY", 4.0"""
In [4]: pd.read_csv(StringIO(s), quotechar='"', skipinitialspace=True)
Out[4]:
year city value
0 2012 Louisville KY 3.5
1 2011 Lexington, KY 4.0
The trick here is that you also have to use skipinitialspace=True
to deal with the spaces after the comma-delimiter.
这里的诀窍是你还必须使用skipinitialspace = True来处理逗号分隔符后面的空格。
Apart from a powerful csv reader, I can also strongly advice to use pandas with the heterogeneous data you have (the example output in numpy you give are all strings, although you could use structured arrays).
除了强大的csv阅读器之外,我还强烈建议将pandas与你拥有的异构数据一起使用(你给出的numpy中的示例输出都是字符串,尽管你可以使用结构化数组)。
#2
10
The problem with the additional comma, np.genfromtxt
does not deal with that.
附加逗号的问题,np.genfromtxt没有解决这个问题。
One simple solution is to read the file with csv.reader()
from python's csv module into a list and then dump it into a numpy array if you like.
一个简单的解决方案是使用csv.reader()从python的csv模块读取文件到列表中,然后根据需要将其转储到numpy数组中。
If you really want to use np.genfromtxt
, note that it can take iterators instead of files, e.g. np.genfromtxt(my_iterator, ...)
. So, you can wrap a csv.reader
in an iterator and give it to np.genfromtxt
.
如果你真的想使用np.genfromtxt,请注意它可以使用迭代器而不是文件,例如np.genfromtxt(my_iterator,...)。因此,您可以将csv.reader包装在迭代器中并将其提供给np.genfromtxt。
That would go something like this:
这将是这样的:
import csv
import numpy as np
np.genfromtxt(("\t".join(i) for i in csv.reader(open('myfile.csv'))), delimiter="\t")
This essentially replaces on-the-fly only the appropriate commas with tabs.
这实际上只是用标签替换了相应的逗号。
#3
3
If you are using a numpy you probably want to work with numpy.ndarray. This will give you a numpy.ndarray:
如果你正在使用numpy,你可能想使用numpy.ndarray。这会给你一个numpy.ndarray:
import pandas
data = pandas.read_csv('file.csv').as_matrix()
Pandas will handle the "Lexington, KY" case correctly
熊猫将正确处理“列克星敦,肯塔基州”案件
#4
1
Make a better function that combines the power of the standard csv
module and Numpy's recfromcsv
. For instance, the csv
module has good control and customization of dialects, quotes, escape characters, etc., which you can add to the example below.
提供更好的功能,结合标准csv模块和Numpy的recfromcsv的强大功能。例如,csv模块可以很好地控制和定制方言,引号,转义字符等,您可以将其添加到下面的示例中。
The example genfromcsv_mod
function below reads in a complicated CSV file similar to what Microsoft Excel sees, which may contain commas within quoted fields. Internally, the function has a generator function that rewrites each row with tab delimiters.
下面的示例genfromcsv_mod函数读取类似于Microsoft Excel看到的复杂CSV文件,其中可能包含引用字段中的逗号。在内部,该函数具有一个生成器函数,该函数使用制表符分隔符重写每一行。
import csv
import numpy as np
def recfromcsv_mod(fname, **kwargs):
def rewrite_csv_as_tab(fname):
with open(fname, 'rb') as fp:
reader = csv.reader(fp)
for row in reader:
yield '\t'.join(row)
return np.recfromcsv(rewrite_csv_as_tab(fname), delimiter='\t', **kwargs)
# Use it to read a CSV file into a record array
x = recfromcsv_mod('t.csv', case_sensitive=True)