I find myself parsing lots of data files (usually in a .csv file or similar) using the csv reader and a for loop to iterate over every line. The data is usually a table of floats so for example.
我发现自己使用csv阅读器解析了大量数据文件(通常是.csv文件或类似文件),并使用for循环迭代每一行。例如,数据通常是浮动表。
reader = csv.reader(open('somefile.csv'))
header = reader.next()
res_list = [list() for i in header]
for line in reader:
for i in range(len(line)):
res_list[i].append(float(line[i]))
result_dict = dict(zip(header,res_list)) #so we can refer by column title
This is a ok way to populate so I get each column as a separate list however, I would prefer that the default data container for lists of items (and nested lists) be numpy arrays, since 99 times out 100 the numbers get pumped into various processing scripts/functions and having the power of numpy lists makes my life easier.
这是一个填充的好方法所以我将每个列作为一个单独的列表,但是,我更希望项目列表(和嵌套列表)的默认数据容器是numpy数组,因为99次超出100个数字被泵入各种处理脚本/功能并具有numpy列表的强大功能使我的生活更轻松。
The numpy append(arr, item)
doesn't append in-place and therefore would require re-creating arrays for every point in the table (which is slow and unneccesary). I could also iterate over the list of data-columns and wrap them into an array after I'm done (which is what I've been doing), but sometimes it isn't so clear cut about when I'm done parsing the file and may need to append stuff to the list later down the line anyway.
numpy append(arr,item)不会就地附加,因此需要为表中的每个点重新创建数组(这是缓慢且不必要的)。我也可以遍历数据列列表并在完成之后将它们包装成一个数组(这就是我一直在做的),但有时我不完全清楚我何时完成解析文件,可能需要稍后将内容附加到列表中。
I was wondering if there is some less-boiler-heavy way (to use the overused phrase "pythonic") to process tables of data in a similar way, or to populate arrays (where the underlying container is a list) dynamically and without copying arrays all the time.
我想知道是否有一些较少的锅炉重的方式(使用过度使用的短语“pythonic”)以类似的方式处理数据表,或者动态地填充数组(其中底层容器是列表)而不复制数组一直都是。
(On another note: its kind of annoying that in general people use columns to organize data but csv
reads in rows if the reader incorporated a read_column argument (yes, I know it wouldn't be super efficient), I think many people would avoid having boiler plate code like the above to parse a csv data file. )
(另一方面说明:一般人们使用列来组织数据,但是如果读者合并了一个read_column参数(是的,我知道它不会超级高效),csv就会读取行,这让人很烦恼,我想很多人会避免有如上所述的锅炉板代码来解析csv数据文件。)
3 个解决方案
#1
7
There is numpy.loadtxt
:
有numpy.loadtxt:
X = numpy.loadtxt('somefile.csv', delimiter=',')
Edit: for a list of numpy arrays,
编辑:获取numpy数组列表,
X = [scipy.array(line.split(','), dtype='float')
for line in open('somefile.csv', 'r')]
#2
2
I think it is difficult to improve very much on what you have. Python lists are relatively cheap to build and append; NumPy arrays are more expensive to create and don't offer a .append()
method at all. So your best bet is to build the lists like you already are doing, and then coerce to np.array()
when the time comes.
我认为很难在你拥有的东西上有所改善。 Python列表构建和附加相对便宜; NumPy数组的创建成本更高,并且根本不提供.append()方法。因此,最好的办法就是像你已经做的那样构建列表,然后在时机成熟时强制转向np.array()。
A few small points:
几个小点:
-
It is slightly faster to use
[]
to create a list than to calllist()
. This is such a tiny amount of the runtime of the program that you can feel free to ignore this point.使用[]创建列表比调用list()稍快一些。这是程序运行时的一小部分,您可以随意忽略这一点。
-
When you don't actually use the loop index, you can use
_
for the variable name to document this.当您实际不使用循环索引时,可以使用_作为变量名来记录它。
-
It's usually better to iterate over a sequence than to find the length of the sequence, build a
range()
, and then index the sequence a lot. You can useenumerate()
to get an index if you also need the index.迭代序列通常比查找序列的长度更好,构建一个range(),然后对序列进行大量索引。如果还需要索引,可以使用enumerate()来获取索引。
Put those together and I think this is a slightly improved version. But it is almost unchanged from your original, and I can't think of any really good improvements.
将它们放在一起,我认为这是一个稍微改进的版本。但它与你的原版几乎没有变化,我想不出任何真正好的改进。
reader = csv.reader(open('somefile.csv'))
header = reader.next()
res_list = [ [] for _ in header]
for row in reader:
for i, val in enumerate(row):
res_list[i].append(float(val))
# build dict so we can refer by column title
result_dict = dict((n, res_list[i]) for i, n in enumerate(header))
#3
2
To efficiently load data to a NumPy arraya, i like NumPy's fromiter function.
为了有效地将数据加载到NumPy数组,我喜欢NumPy的fromiter函数。
advantages in this context:
在这方面的优势:
stream-like loading,
-
pre-specify data type of the reesult array, and
预先指定reesult数组的数据类型,和
-
pre-allocate empty output array, which is then populated with the stream from the iterable.
预分配空输出数组,然后使用来自iterable的流填充。
The first of these is inherent--fromiter only accepts data input in iterable form--the last two are managed through the second and third arguments passed to fromiter, dtype, and count.
第一个是固有的 - fromiter只接受可迭代形式的数据输入 - 后两个通过传递给fromiter,dtype和count的第二个和第三个参数进行管理。
>>> import numpy as NP
>>> # create some data to load:
>>> import random
>>> source_iterable = (random.choice(range(100)) for c in range(20))
>>> target = NP.fromiter(source_iterable, dtype=NP.int8, count=v.size)
>>> target
array([85, 28, 37, 4, 23, 5, 47, 17, 78, 40, 28, 5, 69, 47, 15, 92,
41, 33, 33, 98], dtype=int8)
If you don't want to load your data using an iterable, you can still pre-allocate memory for your target array, using the NumPy functions empty, and empty_like
如果您不想使用iterable加载数据,您仍然可以使用NumPy函数为空预先为目标数组分配内存,并使用empty_like
>>> source_vec = NP.random.rand(10)
>>> target = NP.empty_like(source_vec)
>>> target[:] = source_vec
>>> target
array([ 0.5472, 0.5085, 0.0803, 0.4757, 0.4831, 0.3054, 0.1024,
0.9073, 0.6863, 0.3575])
Alternatively, you can create an empty, (pre-allocated) array by calling empty, then just passing in the shape you want. This function, by contrast with empty_like, let's you pass in the data type:
或者,您可以通过调用empty创建一个空的(预分配的)数组,然后只传入您想要的形状。与empty_like相比,这个函数让你传入数据类型:
>>> target = NP.empty(shape=s.shape, dtype=NP.float)
>>> target
array([ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
>>> target[:] = source
>>> target
array([ 0.5472, 0.5085, 0.0803, 0.4757, 0.4831, 0.3054, 0.1024,
0.9073, 0.6863, 0.3575])
#1
7
There is numpy.loadtxt
:
有numpy.loadtxt:
X = numpy.loadtxt('somefile.csv', delimiter=',')
Edit: for a list of numpy arrays,
编辑:获取numpy数组列表,
X = [scipy.array(line.split(','), dtype='float')
for line in open('somefile.csv', 'r')]
#2
2
I think it is difficult to improve very much on what you have. Python lists are relatively cheap to build and append; NumPy arrays are more expensive to create and don't offer a .append()
method at all. So your best bet is to build the lists like you already are doing, and then coerce to np.array()
when the time comes.
我认为很难在你拥有的东西上有所改善。 Python列表构建和附加相对便宜; NumPy数组的创建成本更高,并且根本不提供.append()方法。因此,最好的办法就是像你已经做的那样构建列表,然后在时机成熟时强制转向np.array()。
A few small points:
几个小点:
-
It is slightly faster to use
[]
to create a list than to calllist()
. This is such a tiny amount of the runtime of the program that you can feel free to ignore this point.使用[]创建列表比调用list()稍快一些。这是程序运行时的一小部分,您可以随意忽略这一点。
-
When you don't actually use the loop index, you can use
_
for the variable name to document this.当您实际不使用循环索引时,可以使用_作为变量名来记录它。
-
It's usually better to iterate over a sequence than to find the length of the sequence, build a
range()
, and then index the sequence a lot. You can useenumerate()
to get an index if you also need the index.迭代序列通常比查找序列的长度更好,构建一个range(),然后对序列进行大量索引。如果还需要索引,可以使用enumerate()来获取索引。
Put those together and I think this is a slightly improved version. But it is almost unchanged from your original, and I can't think of any really good improvements.
将它们放在一起,我认为这是一个稍微改进的版本。但它与你的原版几乎没有变化,我想不出任何真正好的改进。
reader = csv.reader(open('somefile.csv'))
header = reader.next()
res_list = [ [] for _ in header]
for row in reader:
for i, val in enumerate(row):
res_list[i].append(float(val))
# build dict so we can refer by column title
result_dict = dict((n, res_list[i]) for i, n in enumerate(header))
#3
2
To efficiently load data to a NumPy arraya, i like NumPy's fromiter function.
为了有效地将数据加载到NumPy数组,我喜欢NumPy的fromiter函数。
advantages in this context:
在这方面的优势:
stream-like loading,
-
pre-specify data type of the reesult array, and
预先指定reesult数组的数据类型,和
-
pre-allocate empty output array, which is then populated with the stream from the iterable.
预分配空输出数组,然后使用来自iterable的流填充。
The first of these is inherent--fromiter only accepts data input in iterable form--the last two are managed through the second and third arguments passed to fromiter, dtype, and count.
第一个是固有的 - fromiter只接受可迭代形式的数据输入 - 后两个通过传递给fromiter,dtype和count的第二个和第三个参数进行管理。
>>> import numpy as NP
>>> # create some data to load:
>>> import random
>>> source_iterable = (random.choice(range(100)) for c in range(20))
>>> target = NP.fromiter(source_iterable, dtype=NP.int8, count=v.size)
>>> target
array([85, 28, 37, 4, 23, 5, 47, 17, 78, 40, 28, 5, 69, 47, 15, 92,
41, 33, 33, 98], dtype=int8)
If you don't want to load your data using an iterable, you can still pre-allocate memory for your target array, using the NumPy functions empty, and empty_like
如果您不想使用iterable加载数据,您仍然可以使用NumPy函数为空预先为目标数组分配内存,并使用empty_like
>>> source_vec = NP.random.rand(10)
>>> target = NP.empty_like(source_vec)
>>> target[:] = source_vec
>>> target
array([ 0.5472, 0.5085, 0.0803, 0.4757, 0.4831, 0.3054, 0.1024,
0.9073, 0.6863, 0.3575])
Alternatively, you can create an empty, (pre-allocated) array by calling empty, then just passing in the shape you want. This function, by contrast with empty_like, let's you pass in the data type:
或者,您可以通过调用empty创建一个空的(预分配的)数组,然后只传入您想要的形状。与empty_like相比,这个函数让你传入数据类型:
>>> target = NP.empty(shape=s.shape, dtype=NP.float)
>>> target
array([ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
>>> target[:] = source
>>> target
array([ 0.5472, 0.5085, 0.0803, 0.4757, 0.4831, 0.3054, 0.1024,
0.9073, 0.6863, 0.3575])