I'm pretty new to Python, and I have written a (probably very ugly) script that is supposed to randomly select a subset of sequences from a fastq-file. A fastq-file stores information in blocks of four rows each. The first row in each block starts with the character "@". The fastq file I use as my input file is 36 GB, containing about 14,000,000 lines.
我是Python的新手,我编写了一个(可能非常难看)脚本,它应该从fastq文件中随机选择一个序列子集。 fastq文件以每行四行的块存储信息。每个块中的第一行以字符“@”开头。我用作输入文件的fastq文件是36 GB,包含大约14,000,000行。
I tried to rewrite an already existing script that used way too much memory, and I managed to reduce the memory usage a lot. But the script takes forever to run, and I don't see why.
我试图重写一个使用过多内存的现有脚本,并设法减少了很多内存使用量。但脚本需要永远运行,我不明白为什么。
parser = argparse.ArgumentParser()
parser.add_argument("infile", type = str, help = "The name of the fastq input file.", default = sys.stdin)
parser.add_argument("outputfile", type = str, help = "Name of the output file.")
parser.add_argument("-n", help="Number of sequences to sample", default=1)
args = parser.parse_args()
def sample():
linesamples = []
infile = open(args.infile, 'r')
outputfile = open(args.outputfile, 'w')
# count the number of fastq "chunks" in the input file:
seqs = subprocess.check_output(["grep", "-c", "@", str(args.infile)])
# randomly select n fastq "chunks":
seqsamples = random.sample(xrange(0,int(seqs)), int(args.n))
# make a list of the lines that are to be fetched from the fastq file:
for i in seqsamples:
linesamples.append(int(4*i+0))
linesamples.append(int(4*i+1))
linesamples.append(int(4*i+2))
linesamples.append(int(4*i+3))
# fetch lines from input file and write them to output file.
for i, line in enumerate(infile):
if i in linesamples:
outputfile.write(line)
The grep-step takes practically no time at all, but after over 500 minutes, the script still hasn't started to write to the output file. So I suppose it is one of the steps in between grep and the last for-loop that takes such a long time. But I don't understand which step exactly, and what I can do to speed it up.
grep-step几乎没有时间,但是超过500分钟后,脚本仍然没有开始写入输出文件。所以我想这是grep和最后一个for循环之间需要很长时间的步骤之一。但我完全不明白哪一步,以及我能做些什么来加快它。
4 个解决方案
#1
2
Depending on the size of linesamples
, the if i in linesamples
will take a long time since you are searching through a list for each iteration through infile
. You could convert this into a set
to improve the lookup time. Also, enumerate
is not very efficient - I have replaced that with a line_num
construct which we increment in each iteration.
根据行样本的大小,行样本中的if i将花费很长时间,因为您通过infile搜索每个迭代的列表。您可以将其转换为一组以改善查找时间。此外,枚举效率不高 - 我用line_num构造替换了它,我们在每次迭代中递增。
def sample():
linesamples = set()
infile = open(args.infile, 'r')
outputfile = open(args.outputfile, 'w')
# count the number of fastq "chunks" in the input file:
seqs = subprocess.check_output(["grep", "-c", "@", str(args.infile)])
# randomly select n fastq "chunks":
seqsamples = random.sample(xrange(0,int(seqs)), int(args.n))
for i in seqsamples:
linesamples.add(int(4*i+0))
linesamples.add(int(4*i+1))
linesamples.add(int(4*i+2))
linesamples.add(int(4*i+3))
# make a list of the lines that are to be fetched from the fastq file:
# fetch lines from input file and write them to output file.
line_num = 0
for line in infile:
if line_num in linesamples:
outputfile.write(line)
line_num += 1
outputfile.close()
#2
1
You said that grep finishes running quite quickly, so in that case instead of just using grep to count the occurrences of @ have grep output the byte offsets of each @ character it sees (using the -b
option for grep). Then, use random.sample
to pick which ever blocks you want. Once you've chosen the byte offsets you want, use infile.seek
to go to each byte offset and print out 4 lines from there.
你说grep完成运行很快,所以在这种情况下,而不是只使用grep计算@ have grep的出现次数输出它看到的每个@字符的字节偏移量(使用grep的-b选项)。然后,使用random.sample选择您想要的块。一旦选择了所需的字节偏移量,使用infile.seek转到每个字节偏移量并从那里打印出4行。
#3
0
Try to parallelize your code. What I mean is this. You have 14,000,000 lines of input.
尝试并行化您的代码。我的意思是这个。您有14,000,000行输入。
- Work your grep and filter your lines first and write it to filteredInput.txt
- 处理你的grep并首先过滤你的行并将其写入filteredInput.txt
- Split your filteredInput to 10.000-100.000 lines of files, like filteredInput001.txt, filteredInput002.txt
- 将filteredInput拆分为10.000-100.000行文件,如filteredInput001.txt,filteredInput002.txt
- Work our code on this split files. Write your Output to different files such as output001.txt, output002.txt
- 在这个拆分文件上处理我们的代码。将输出写入不同的文件,如output001.txt,output002.txt
- Merge your results as final step.
- 将结果合并为最后一步。
Since your code is not working at all. You may also run your code on these filtered inputs. Your code will check existence of filteredInput files and will understand which step he was in, and resume from that step.
因为你的代码根本不起作用。您也可以在这些过滤的输入上运行代码。您的代码将检查filteredInput文件的存在,并将了解他所处的步骤,并从该步骤继续。
You can also use multiple python process this way (after step 1) using your shell or python threads.
您也可以使用shell或python线程以这种方式使用多个python进程(在步骤1之后)。
#4
0
You can use the Reservoir Sampling algorithm. With this algorithm you read through the data only once (no need to count the lines of the file in advance), so you can pipe data through your script. There's example python code in the Wikipedia page.
您可以使用Reservoir Sampling算法。使用此算法,您只需读取一次数据(无需提前计算文件的行数),因此您可以通过脚本管道数据。*页面中有示例python代码。
There's also a C implementation for fastq sampling in Heng Li's seqtk.
在Heng Li的seqtk中还有一个用于fastq采样的C实现。
#1
2
Depending on the size of linesamples
, the if i in linesamples
will take a long time since you are searching through a list for each iteration through infile
. You could convert this into a set
to improve the lookup time. Also, enumerate
is not very efficient - I have replaced that with a line_num
construct which we increment in each iteration.
根据行样本的大小,行样本中的if i将花费很长时间,因为您通过infile搜索每个迭代的列表。您可以将其转换为一组以改善查找时间。此外,枚举效率不高 - 我用line_num构造替换了它,我们在每次迭代中递增。
def sample():
linesamples = set()
infile = open(args.infile, 'r')
outputfile = open(args.outputfile, 'w')
# count the number of fastq "chunks" in the input file:
seqs = subprocess.check_output(["grep", "-c", "@", str(args.infile)])
# randomly select n fastq "chunks":
seqsamples = random.sample(xrange(0,int(seqs)), int(args.n))
for i in seqsamples:
linesamples.add(int(4*i+0))
linesamples.add(int(4*i+1))
linesamples.add(int(4*i+2))
linesamples.add(int(4*i+3))
# make a list of the lines that are to be fetched from the fastq file:
# fetch lines from input file and write them to output file.
line_num = 0
for line in infile:
if line_num in linesamples:
outputfile.write(line)
line_num += 1
outputfile.close()
#2
1
You said that grep finishes running quite quickly, so in that case instead of just using grep to count the occurrences of @ have grep output the byte offsets of each @ character it sees (using the -b
option for grep). Then, use random.sample
to pick which ever blocks you want. Once you've chosen the byte offsets you want, use infile.seek
to go to each byte offset and print out 4 lines from there.
你说grep完成运行很快,所以在这种情况下,而不是只使用grep计算@ have grep的出现次数输出它看到的每个@字符的字节偏移量(使用grep的-b选项)。然后,使用random.sample选择您想要的块。一旦选择了所需的字节偏移量,使用infile.seek转到每个字节偏移量并从那里打印出4行。
#3
0
Try to parallelize your code. What I mean is this. You have 14,000,000 lines of input.
尝试并行化您的代码。我的意思是这个。您有14,000,000行输入。
- Work your grep and filter your lines first and write it to filteredInput.txt
- 处理你的grep并首先过滤你的行并将其写入filteredInput.txt
- Split your filteredInput to 10.000-100.000 lines of files, like filteredInput001.txt, filteredInput002.txt
- 将filteredInput拆分为10.000-100.000行文件,如filteredInput001.txt,filteredInput002.txt
- Work our code on this split files. Write your Output to different files such as output001.txt, output002.txt
- 在这个拆分文件上处理我们的代码。将输出写入不同的文件,如output001.txt,output002.txt
- Merge your results as final step.
- 将结果合并为最后一步。
Since your code is not working at all. You may also run your code on these filtered inputs. Your code will check existence of filteredInput files and will understand which step he was in, and resume from that step.
因为你的代码根本不起作用。您也可以在这些过滤的输入上运行代码。您的代码将检查filteredInput文件的存在,并将了解他所处的步骤,并从该步骤继续。
You can also use multiple python process this way (after step 1) using your shell or python threads.
您也可以使用shell或python线程以这种方式使用多个python进程(在步骤1之后)。
#4
0
You can use the Reservoir Sampling algorithm. With this algorithm you read through the data only once (no need to count the lines of the file in advance), so you can pipe data through your script. There's example python code in the Wikipedia page.
您可以使用Reservoir Sampling算法。使用此算法,您只需读取一次数据(无需提前计算文件的行数),因此您可以通过脚本管道数据。*页面中有示例python代码。
There's also a C implementation for fastq sampling in Heng Li's seqtk.
在Heng Li的seqtk中还有一个用于fastq采样的C实现。