使用awk编辑文本文件并创建新文件

时间:2021-06-21 16:04:39

I have text file like small example:

我有像小例子的文本文件:

ENSG00000001036 ENST00000002165 6   143832827   143832772
ENSG00000001461 ENST00000003912 1   24766730;24746130;24768628;24742394;24759703    24766662;24745781;24768545;24742293;24759594
ENSG00000004139 ENST00000003834 17      
ENSG00000001460 ENST00000003583 1   24740215;24727946   24740164;24727857

I want to edit the file and make a new file. in fact the fist line is fine and other lines should look like this one. in the 3rd line I do not have fields 4 and 5 so I will remove such lines completely. but there are some lines like lines 3 and 4 in the example. in such lines 4 and 5 are ; separated. I want to divide these lines into more than one depending the number of ; separated parts. for instance the 2nd line will be converted into 5 lines and line 4 will be divided into 2 lines. the new lines would have the same 1st, 2nd and 3rd columns but the difference is in the columns 4 and 5. here is 2 new resulting lines from the 4th line.

我想编辑文件并创建一个新文件。事实上,第一行是好的,其他行应该看起来像这样。在第3行我没有字段4和5所以我将完全删除这些行。但是在示例中有一些像第3行和第4行的行。在第4和第5行中;分离。我想根据数量将这些行划分为多个;分开的部分。例如,第二行将被转换为5行,第4行将被分成2行。新行将具有相同的第1列,第2列和第3列,但区别在于第4列和第5列。这是来自第4行的2个新结果行。

ENSG00000001460 ENST00000003583 1   24740215    24740164
ENSG00000001460 ENST00000003583 1   24727946    24727857

as it is shown in the above 2 lines, the 1st column of the field number 4 and 5 would be the fields number 4 and 5 in the 1st new line and the 2nd column of the field number 4 and 5 would be the fields number 4 and 5 in the 2nd new line. so the result of the small example would look like this:

如上面2行所示,字段编号4和5的第1列是第1个新行中的字段编号4和5,字段编号4和5的第2列是4号字段和第2个新线中的5个。所以小例子的结果看起来像这样:

ENSG00000001036 ENST00000002165 6   143832827   143832772
ENSG00000001461 ENST00000003912 1   24766730    24766662
ENSG00000001461 ENST00000003912 1   24746130    24745781
ENSG00000001461 ENST00000003912 1   24768628    24768545
ENSG00000001461 ENST00000003912 1   24742394    24742293
ENSG00000001461 ENST00000003912 1   24759703    24759594
ENSG00000001460 ENST00000003583 1   24740215    24740164
ENSG00000001460 ENST00000003583 1   24727946    24727857

I wrote a small code using awk:

我用awk写了一个小代码:

awk -F";" '{print $1 "\t" $2 "\t" $3 "\t" $4 "\t" $5}' coord.txt > new.txt.

but actually I do not now how to apply 2 conditions that I mentioned (splitting lines and deleting incomplete lines). do you know how to do that?

但实际上我现在不知道如何应用我提到的两个条件(分割线条和删除不完整的线条)。你知道怎么做吗?

1 个解决方案

#1


2  

You can use this awk command with split on fourth and fifth field on semi-colon:

您可以在分号的第四和第五个字段上使用此awk命令和split:

awk 'NF==5{n=split($4, a, /;/); split($5, b, /;/);
for(i=1; i<=n; i++) print $1, $2, $3, a[i], b[i]}' file

ENSG00000001036 ENST00000002165 6 143832827 143832772
ENSG00000001461 ENST00000003912 1 24766730 24766662
ENSG00000001461 ENST00000003912 1 24746130 24745781
ENSG00000001461 ENST00000003912 1 24768628 24768545
ENSG00000001461 ENST00000003912 1 24742394 24742293
ENSG00000001461 ENST00000003912 1 24759703 24759594
ENSG00000001460 ENST00000003583 1 24740215 24740164
ENSG00000001460 ENST00000003583 1 24727946 24727857

#1


2  

You can use this awk command with split on fourth and fifth field on semi-colon:

您可以在分号的第四和第五个字段上使用此awk命令和split:

awk 'NF==5{n=split($4, a, /;/); split($5, b, /;/);
for(i=1; i<=n; i++) print $1, $2, $3, a[i], b[i]}' file

ENSG00000001036 ENST00000002165 6 143832827 143832772
ENSG00000001461 ENST00000003912 1 24766730 24766662
ENSG00000001461 ENST00000003912 1 24746130 24745781
ENSG00000001461 ENST00000003912 1 24768628 24768545
ENSG00000001461 ENST00000003912 1 24742394 24742293
ENSG00000001461 ENST00000003912 1 24759703 24759594
ENSG00000001460 ENST00000003583 1 24740215 24740164
ENSG00000001460 ENST00000003583 1 24727946 24727857