all, I am looking for an efficient way to organize and filter certain types of text files.
所有,我正在寻找一种有效的方法来组织和过滤某些类型的文本文件。
Let's say I have 10,000,000 text files that are concatenated to larger chunks that are formatted like this
假设我有10,000,000个文本文件连接到更大的块,这些块的格式是这样的
@text_file_header
ID0001
some text
...
@text_file_header
ID0002
some text
...
@text_file_header
ID0003
some text
...
Now, I perform a certain operations on those files so that I end up with 200 x 10,000,000 text files (in chunks) -- each text file has "siblings" now
现在,我对这些文件执行某些操作,以便最终得到200 x 10,000,000个文本文件(以块为单位) - 每个文本文件现在都有“兄弟姐妹”
@text_file_header
ID0001_1
some text
...
@text_file_header
ID0001_2
some text
...
@text_file_header
ID0001_3
some text
...
@text_file_header
ID0002_1
some text
...
@text_file_header
ID0002_2
some text
...
@text_file_header
ID0002_3
some text
However, for certain tasks, I only need certain text files, and my main question is how I can extract them based on an "id" in the text files (e.g., grep ID0001_* and ID0005_* and ID0006_* and so on).
但是,对于某些任务,我只需要某些文本文件,而我的主要问题是如何根据文本文件中的“id”提取它们(例如,grep ID0001_ *和ID0005_ *以及ID0006_ *等)。
SQLite would be one option, and I also already have an SQLite database with ID and file columns, however, the problem is that I need to do this computation where I generate those 200 * 10,000,000 text files on a cluster due to time constraints. The file I/O for SQLite would be too limiting right now.
SQLite是一个选项,我也有一个带有ID和文件列的SQLite数据库,但问题是我需要进行此计算,因为时间限制我在集群上生成那些200 * 10,000,000个文本文件。 SQLite的文件I / O现在太限制了。
My idea was now to split those files into 10,000,000 inidividual files like so
我的想法是将这些文件分成10,000,000个单独的文件,就像这样
gawk -v RS="@<TRIPOS>MOLECULE" 'NF{ print RS$0 > "file"++n".txt" }' all_chunk_01.txt
and after I generated those 200 "siblings", I would do a cat
in the folder based on the file IDs that I would be currently interested in. Let's say I need the corps of 10,000 out of the 10,000,000 text files, I would cat them together to a single document that I need for further processing steps. Now, my concern is if it is a good idea at all to store 10,000,000 individual files in a single folder on a disk and perform the cat, or would it be better to grep
out the files based on an ID from let's say 100 multitext files?
在我生成这200个“兄弟姐妹”后,我会根据我目前感兴趣的文件ID在文件夹中做一只猫。假设我需要10,000,000个文本文件中的10,000个队伍,我会抓住他们一起到我需要进一步处理步骤的单个文档。现在,我担心的是,将10,000,000个单独的文件存储在磁盘上的单个文件夹中并执行cat是一个好主意,还是最好根据ID来渲染文件,比方说100个多文本文件?
1 个解决方案
#1
-2
For example:
grep TextToFind FileWhereToFind
returns what you want.
返回你想要的。
#1
-2
For example:
grep TextToFind FileWhereToFind
returns what you want.
返回你想要的。