如何从流中读取CSV文件并在写入时处理每一行?

时间:2021-11-23 09:04:13

I would like to read a CSV file from the standard input and process each row as it comes. My CSV outputting code writes rows one by one, but my reader waits the stream to be terminated before iterating the rows. Is this a limitation of csv module? Am I doing something wrong?

我想从标准输入读取一个CSV文件并处理每一行。我的CSV输出代码逐个写入行,但我的读者在迭代行之前等待流终止。这是csv模块的限制吗?难道我做错了什么?

My reader code:

我的读者代码:

import csv
import sys
import time


reader = csv.reader(sys.stdin)
for row in reader:
    print "Read: (%s) %r" % (time.time(), row)

My writer code:

我的作家代码:

import csv
import sys
import time


writer = csv.writer(sys.stdout)
for i in range(8):
    writer.writerow(["R%d" % i, "$" * (i+1)])
    sys.stdout.flush()
    time.sleep(0.5)

Output of python test_writer.py | python test_reader.py:

输出python test_writer.py | python test_reader.py:

Read: (1309597426.3) ['R0', '$']
Read: (1309597426.3) ['R1', '$$']
Read: (1309597426.3) ['R2', '$$$']
Read: (1309597426.3) ['R3', '$$$$']
Read: (1309597426.3) ['R4', '$$$$$']
Read: (1309597426.3) ['R5', '$$$$$$']
Read: (1309597426.3) ['R6', '$$$$$$$']
Read: (1309597426.3) ['R7', '$$$$$$$$']

As you can see all print statements are executed at the same time, but I expect there to be a 500ms gap.

正如您所看到的,所有打印语句同时执行,但我预计会有500毫秒的差距。

3 个解决方案

#1


31  

As it says in the documentation,

正如文档中所说,

In order to make a for loop the most efficient way of looping over the lines of a file (a very common operation), the next() method uses a hidden read-ahead buffer.

为了使for循环成为循环文件行的最有效方式(一种非常常见的操作),next()方法使用隐藏的预读缓冲区。

And you can see by looking at the implementation of the csv module (line 784) that csv.reader calls the next() method of the underlyling iterator (via PyIter_Next).

您可以通过查看csv模块(第784行)的实现来看到csv.reader调用underlyling迭代器的next()方法(通过PyIter_Next)。

So if you really want unbuffered reading of CSV files, you need to convert the file object (here sys.stdin) into an iterator whose next() method actually calls readline() instead. This can easily be done using the two-argument form of the iter function. So change the code in test_reader.py to something like this:

因此,如果您真的想要无缓冲读取CSV文件,则需要将文件对象(此处为sys.stdin)转换为迭代器,而next()方法实际上调用readline()。这可以使用iter函数的双参数形式轻松完成。所以将test_reader.py中的代码更改为:

for row in csv.reader(iter(sys.stdin.readline, '')):
    print("Read: ({}) {!r}".format(time.time(), row))

For example,

例如,

$ python test_writer.py | python test_reader.py
Read: (1388776652.964925) ['R0', '$']
Read: (1388776653.466134) ['R1', '$$']
Read: (1388776653.967327) ['R2', '$$$']
Read: (1388776654.468532) ['R3', '$$$$']
[etc]

Can you explain why you need unbuffered reading of CSV files? There might be a better solution to whatever it is you are trying to do.

你能解释为什么你需要无缓冲读取CSV文件吗?无论你想做什么,都可能有更好的解决方案。

#2


1  

Maybe it's a limitation. Read this http://docs.python.org/using/cmdline.html#cmdoption-unittest-discover-u

也许这是一个限制。阅读http://docs.python.org/using/cmdline.html#cmdoption-unittest-discover-u

Note that there is internal buffering in file.readlines() and File Objects (for line in sys.stdin) which is not influenced by this option. To work around this, you will want to use file.readline() inside a while 1: loop.

请注意,file.readlines()和File Objects(对于sys.stdin中的行)中存在内部缓冲,不受此选项的影响。要解决此问题,您需要在while 1:循环中使用file.readline()。

I modified test_reader.py as follows :

我修改了test_reader.py,如下所示:

import csv, sys, time

while True:
    print "Read: (%s) %r" % (time.time(), sys.stdin.readline())

Output

产量

python test_writer.py | python  test_reader.py
Read: (1309600865.84) 'R0,$\r\n'
Read: (1309600865.84) 'R1,$$\r\n'
Read: (1309600866.34) 'R2,$$$\r\n'
Read: (1309600866.84) 'R3,$$$$\r\n'
Read: (1309600867.34) 'R4,$$$$$\r\n'
Read: (1309600867.84) 'R5,$$$$$$\r\n'
Read: (1309600868.34) 'R6,$$$$$$$\r\n'
Read: (1309600868.84) 'R7,$$$$$$$$\r\n'

#3


0  

You are flushing stdout, but not stdin.

你正在冲洗stdout,但不是stdin。

Sys.stdin also has a flush() method, try using that after each line read if you really want to disable the buffering.

Sys.stdin也有一个flush()方法,如果你真的想要禁用缓冲,请尝试在每行读取后使用它。

#1


31  

As it says in the documentation,

正如文档中所说,

In order to make a for loop the most efficient way of looping over the lines of a file (a very common operation), the next() method uses a hidden read-ahead buffer.

为了使for循环成为循环文件行的最有效方式(一种非常常见的操作),next()方法使用隐藏的预读缓冲区。

And you can see by looking at the implementation of the csv module (line 784) that csv.reader calls the next() method of the underlyling iterator (via PyIter_Next).

您可以通过查看csv模块(第784行)的实现来看到csv.reader调用underlyling迭代器的next()方法(通过PyIter_Next)。

So if you really want unbuffered reading of CSV files, you need to convert the file object (here sys.stdin) into an iterator whose next() method actually calls readline() instead. This can easily be done using the two-argument form of the iter function. So change the code in test_reader.py to something like this:

因此,如果您真的想要无缓冲读取CSV文件,则需要将文件对象(此处为sys.stdin)转换为迭代器,而next()方法实际上调用readline()。这可以使用iter函数的双参数形式轻松完成。所以将test_reader.py中的代码更改为:

for row in csv.reader(iter(sys.stdin.readline, '')):
    print("Read: ({}) {!r}".format(time.time(), row))

For example,

例如,

$ python test_writer.py | python test_reader.py
Read: (1388776652.964925) ['R0', '$']
Read: (1388776653.466134) ['R1', '$$']
Read: (1388776653.967327) ['R2', '$$$']
Read: (1388776654.468532) ['R3', '$$$$']
[etc]

Can you explain why you need unbuffered reading of CSV files? There might be a better solution to whatever it is you are trying to do.

你能解释为什么你需要无缓冲读取CSV文件吗?无论你想做什么,都可能有更好的解决方案。

#2


1  

Maybe it's a limitation. Read this http://docs.python.org/using/cmdline.html#cmdoption-unittest-discover-u

也许这是一个限制。阅读http://docs.python.org/using/cmdline.html#cmdoption-unittest-discover-u

Note that there is internal buffering in file.readlines() and File Objects (for line in sys.stdin) which is not influenced by this option. To work around this, you will want to use file.readline() inside a while 1: loop.

请注意,file.readlines()和File Objects(对于sys.stdin中的行)中存在内部缓冲,不受此选项的影响。要解决此问题,您需要在while 1:循环中使用file.readline()。

I modified test_reader.py as follows :

我修改了test_reader.py,如下所示:

import csv, sys, time

while True:
    print "Read: (%s) %r" % (time.time(), sys.stdin.readline())

Output

产量

python test_writer.py | python  test_reader.py
Read: (1309600865.84) 'R0,$\r\n'
Read: (1309600865.84) 'R1,$$\r\n'
Read: (1309600866.34) 'R2,$$$\r\n'
Read: (1309600866.84) 'R3,$$$$\r\n'
Read: (1309600867.34) 'R4,$$$$$\r\n'
Read: (1309600867.84) 'R5,$$$$$$\r\n'
Read: (1309600868.34) 'R6,$$$$$$$\r\n'
Read: (1309600868.84) 'R7,$$$$$$$$\r\n'

#3


0  

You are flushing stdout, but not stdin.

你正在冲洗stdout,但不是stdin。

Sys.stdin also has a flush() method, try using that after each line read if you really want to disable the buffering.

Sys.stdin也有一个flush()方法,如果你真的想要禁用缓冲,请尝试在每行读取后使用它。