I have a large text file (> 4 gb) which is in fixed-width format. I want to get a subset of that file based on content in specific columns. What would be the fastest way to do this?
我有一个大文本文件(> 4 GB),它是固定宽度格式。我想根据特定列中的内容获取该文件的子集。最快的方法是什么?
For example the file will have the following format:
例如,该文件将具有以下格式:
Column width 1 = 3
Column width 2 = 3
Column width 3 = 2
Column width 4 = 2
Column width 5 = 1
Column width 6 = 2
Column width 7 = 2
Column width 8 = 2
Colwidth 9 = 2
And a line of the file might look like:
该文件的一行可能如下所示:
150-9912 17 7 1 0 0
If I wanted to search based on the values of column 2 (e.g. where value of column 2 == -99), what would be the most efficient way to do this? I have multiple files ~ 4GB in size with close to 10 million lines in each file. Appreciate the help!
如果我想根据第2列的值进行搜索(例如,第2列的值== -99),那么最有效的方法是什么?我有多个文件~4GB大小,每个文件中有近1000万行。感谢帮助!
1 个解决方案
#1
2
Using GNU awk:
使用GNU awk:
awk 'BEGIN{FIELDWIDTHS="3 3 2 2 1 2 2 2 2"} $2==-99'
The above will get you well on the way.
以上内容将帮助您顺利完成任务。
#1
2
Using GNU awk:
使用GNU awk:
awk 'BEGIN{FIELDWIDTHS="3 3 2 2 1 2 2 2 2"} $2==-99'
The above will get you well on the way.
以上内容将帮助您顺利完成任务。