读取并解析TSV文件,然后操作它以将其保存为CSV(*高效*)

时间:2021-02-16 16:38:12

My source data is in a TSV file, 6 columns and greater than 2 million rows.

我的源数据在TSV文件中,有6列,超过200万行。

Here's what I'm trying to accomplish:

下面是我想要达到的目标:

  1. I need to read the data in 3 of the columns (3, 4, 5) in this source file
  2. 我需要读取源文件中3列(3,4,5)中的数据
  3. The fifth column is an integer. I need to use this integer value to duplicate a row entry with using the data in the third and fourth columns (by the number of integer times).
  4. 第五列是一个整数。我需要使用这个整数值来复制一个行条目,并使用第三和第四列中的数据(以整数倍的数量)。
  5. I want to write the output of #2 to an output file in CSV format.
  6. 我想把#2的输出写成CSV格式的输出文件。

Below is what I came up with.

下面是我想到的。

My question: is this an efficient way to do it? It seems like it might be intensive when attempted on 2 million rows.

我的问题是:这是一种有效的方法吗?在200万行上尝试时,它似乎是密集的。

First, I made a sample tab separate file to work with, and called it 'sample.txt'. It's basic and only has four rows:

首先,我制作了一个单独的示例标签文件,并将其命名为“sample.txt”。它是基本的,只有四行:

Row1_Column1    Row1-Column2    Row1-Column3    Row1-Column4    2   Row1-Column6
Row2_Column1    Row2-Column2    Row2-Column3    Row2-Column4    3   Row2-Column6
Row3_Column1    Row3-Column2    Row3-Column3    Row3-Column4    1   Row3-Column6
Row4_Column1    Row4-Column2    Row4-Column3    Row4-Column4    2   Row4-Column6

then I have this code:

然后我有了这个代码:

import csv 

with open('sample.txt','r') as tsv:
    AoA = [line.strip().split('\t') for line in tsv]

for a in AoA:
    count = int(a[4])
    while count > 0:
        with open('sample_new.csv','ab') as csvfile:
            csvwriter = csv.writer(csvfile, delimiter=',')
            csvwriter.writerow([a[2], a[3]])
        count = count - 1

1 个解决方案

#1


105  

You should use the csv module to read the tab-separated value file. Do not read it into memory in one go. Each row you read has all the information you need to write rows to the output CSV file, after all. Keep the output file open throughout.

您应该使用csv模块读取表分隔值文件。不要一口气把它读进记忆。您所读取的每一行都具有将行写入输出CSV文件所需的所有信息。保持输出文件始终打开。

import csv

with open('sample.txt','rb') as tsvin, open('new.csv', 'wb') as csvout:
    tsvin = csv.reader(tsvin, delimiter='\t')
    csvout = csv.writer(csvout)

    for row in tsvin:
        count = int(row[4])
        if count > 0:
            csvout.writerows([row[2:4] for _ in xrange(count)])

or, using the itertools module to do the repeating with itertools.repeat():

或者,使用itertools模块对itertools进行重复操作。

from itertools import repeat
import csv

with open('sample.txt','rb') as tsvin, open('new.csv', 'wb') as csvout:
    tsvin = csv.reader(tsvin, delimiter='\t')
    csvout = csv.writer(csvout)

    for row in tsvin:
        count = int(row[4])
        if count > 0:
            csvout.writerows(repeat(row[2:4], count))

#1


105  

You should use the csv module to read the tab-separated value file. Do not read it into memory in one go. Each row you read has all the information you need to write rows to the output CSV file, after all. Keep the output file open throughout.

您应该使用csv模块读取表分隔值文件。不要一口气把它读进记忆。您所读取的每一行都具有将行写入输出CSV文件所需的所有信息。保持输出文件始终打开。

import csv

with open('sample.txt','rb') as tsvin, open('new.csv', 'wb') as csvout:
    tsvin = csv.reader(tsvin, delimiter='\t')
    csvout = csv.writer(csvout)

    for row in tsvin:
        count = int(row[4])
        if count > 0:
            csvout.writerows([row[2:4] for _ in xrange(count)])

or, using the itertools module to do the repeating with itertools.repeat():

或者,使用itertools模块对itertools进行重复操作。

from itertools import repeat
import csv

with open('sample.txt','rb') as tsvin, open('new.csv', 'wb') as csvout:
    tsvin = csv.reader(tsvin, delimiter='\t')
    csvout = csv.writer(csvout)

    for row in tsvin:
        count = int(row[4])
        if count > 0:
            csvout.writerows(repeat(row[2:4], count))