我需要拆分一个非常大的文本文件

时间:2023-01-06 21:36:05

I have a large text file(more than my RAM) and I need to use each line in it for further processing​. But if I read say like 4096 bytes at a time I'm worried about splitting the line somewhere in between. How do i proceed?

我有一个大文本文件(超过我的RAM),我需要使用它中的每一行进行进一步处理。但是,如果我一次读出4096字节的话,我担心在两者之间将线分开。我该怎么办?

4 个解决方案

#1


3  

Here's what you can do:

这是你可以做的:

SIZE = 1024

with open('file.txt') as f:
    old, data = '', f.read(SIZE)

    while data:
          # (1)
        lines = data.splitlines()
        if not data.endswith('\n'):
            old = lines[-1]
        else:
            old = ''

        # process stuff

        data = old + f.read(SIZE)
  1. If you do data.splitlines(True), then new line characters will be kept in the resulted list.
  2. 如果执行data.splitlines(True),则新行字符将保留在结果列表中。

#2


2  

Read the file using a generator:

使用生成器读取文件:

def read_file(file_path):
    with open(file_path, 'r') as lines:
        for line in lines:
            yield line

That way you never have more than one line in memory at a time, but will still read the file in order.

这样你一次在内存中永远不会有多行,但仍然会按顺序读取文件。

#3


1  

One does this sort of thing in audio coding lots, where files can be huge. The normal way as I understand it is just to have a memory buffer and do it in two stages: read a blob of arbitrary size into buffer (4096 or whatever), then stream characters from the buffer, reacting to the line endings. Because the buffer is in ram, streaming character by character out of it is fast. I'm not sure what data structure or call would be best to do it with in Python though, I've actually only done this in C, where it's just a block of ram. But the same approach should work.

人们在音频编码批次中做这种事情,文件可能很大。我理解它的正常方法就是有一个内存缓冲区并分两个阶段进行:将任意大小的blob读入缓冲区(4096或其他),然后从缓冲区中流出字符,对行结尾做出反应。因为缓冲区是ram,所以逐个字符流式传输是很快的。我不确定在Python中使用哪种数据结构或调用是最好的,我实际上只在C中完成此操作,它只是一块内存。但同样的方法应该有效。

#4


1  

On linux:

在linux上:

put this into a python script, for example, process.py:

把它放到python脚本中,例如process.py:

import sys

for line in sys.stdin:
    #do something with the line, for example:
    output = line[:5] + line[10:15]
    sys.stdout.write("{}\n".format(output))

to run the script, use:

要运行该脚本,请使用:

cat input_data | python process.py > output

#1


3  

Here's what you can do:

这是你可以做的:

SIZE = 1024

with open('file.txt') as f:
    old, data = '', f.read(SIZE)

    while data:
          # (1)
        lines = data.splitlines()
        if not data.endswith('\n'):
            old = lines[-1]
        else:
            old = ''

        # process stuff

        data = old + f.read(SIZE)
  1. If you do data.splitlines(True), then new line characters will be kept in the resulted list.
  2. 如果执行data.splitlines(True),则新行字符将保留在结果列表中。

#2


2  

Read the file using a generator:

使用生成器读取文件:

def read_file(file_path):
    with open(file_path, 'r') as lines:
        for line in lines:
            yield line

That way you never have more than one line in memory at a time, but will still read the file in order.

这样你一次在内存中永远不会有多行,但仍然会按顺序读取文件。

#3


1  

One does this sort of thing in audio coding lots, where files can be huge. The normal way as I understand it is just to have a memory buffer and do it in two stages: read a blob of arbitrary size into buffer (4096 or whatever), then stream characters from the buffer, reacting to the line endings. Because the buffer is in ram, streaming character by character out of it is fast. I'm not sure what data structure or call would be best to do it with in Python though, I've actually only done this in C, where it's just a block of ram. But the same approach should work.

人们在音频编码批次中做这种事情,文件可能很大。我理解它的正常方法就是有一个内存缓冲区并分两个阶段进行:将任意大小的blob读入缓冲区(4096或其他),然后从缓冲区中流出字符,对行结尾做出反应。因为缓冲区是ram,所以逐个字符流式传输是很快的。我不确定在Python中使用哪种数据结构或调用是最好的,我实际上只在C中完成此操作,它只是一块内存。但同样的方法应该有效。

#4


1  

On linux:

在linux上:

put this into a python script, for example, process.py:

把它放到python脚本中,例如process.py:

import sys

for line in sys.stdin:
    #do something with the line, for example:
    output = line[:5] + line[10:15]
    sys.stdout.write("{}\n".format(output))

to run the script, use:

要运行该脚本,请使用:

cat input_data | python process.py > output