如何使用命令行工具根据固定宽度文本文件中特定内容的内容快速提取行?

时间:2021-12-25 19:37:55

I have a large text file (> 4 gb) which is in fixed-width format. I want to get a subset of that file based on content in specific columns. What would be the fastest way to do this?

我有一个大文本文件(> 4 GB),它是固定宽度格式。我想根据特定列中的内容获取该文件的子集。最快的方法是什么?

For example the file will have the following format:

例如,该文件将具有以下格式:

Column width 1 = 3
Column width 2 = 3
Column width 3 = 2
Column width 4 = 2
Column width 5 = 1
Column width 6 = 2
Column width 7 = 2
Column width 8 = 2
Colwidth 9 = 2

And a line of the file might look like:

该文件的一行可能如下所示:

150-9912 17 7 1 0 0

If I wanted to search based on the values of column 2 (e.g. where value of column 2 == -99), what would be the most efficient way to do this? I have multiple files ~ 4GB in size with close to 10 million lines in each file. Appreciate the help!

如果我想根据第2列的值进行搜索(例如,第2列的值== -99),那么最有效的方法是什么?我有多个文件~4GB大小,每个文件中有近1000万行。感谢帮助!

1 个解决方案

#1


2  

Using GNU awk:

使用GNU awk:

awk 'BEGIN{FIELDWIDTHS="3 3 2 2 1 2 2 2 2"} $2==-99'

The above will get you well on the way.

以上内容将帮助您顺利完成任务。

#1


2  

Using GNU awk:

使用GNU awk:

awk 'BEGIN{FIELDWIDTHS="3 3 2 2 1 2 2 2 2"} $2==-99'

The above will get you well on the way.

以上内容将帮助您顺利完成任务。