I have text file like small example:
我有像小例子的文本文件:
ENSG00000001036 ENST00000002165 6 143832827 143832772
ENSG00000001461 ENST00000003912 1 24766730;24746130;24768628;24742394;24759703 24766662;24745781;24768545;24742293;24759594
ENSG00000004139 ENST00000003834 17
ENSG00000001460 ENST00000003583 1 24740215;24727946 24740164;24727857
I want to edit the file and make a new file. in fact the fist line is fine and other lines should look like this one. in the 3rd line I do not have fields 4 and 5 so I will remove such lines completely. but there are some lines like lines 3 and 4 in the example. in such lines 4 and 5 are ;
separated. I want to divide these lines into more than one depending the number of ;
separated parts. for instance the 2nd line will be converted into 5 lines and line 4 will be divided into 2 lines. the new lines would have the same 1st, 2nd and 3rd columns but the difference is in the columns 4 and 5. here is 2 new resulting lines from the 4th line.
我想编辑文件并创建一个新文件。事实上,第一行是好的,其他行应该看起来像这样。在第3行我没有字段4和5所以我将完全删除这些行。但是在示例中有一些像第3行和第4行的行。在第4和第5行中;分离。我想根据数量将这些行划分为多个;分开的部分。例如,第二行将被转换为5行,第4行将被分成2行。新行将具有相同的第1列,第2列和第3列,但区别在于第4列和第5列。这是来自第4行的2个新结果行。
ENSG00000001460 ENST00000003583 1 24740215 24740164
ENSG00000001460 ENST00000003583 1 24727946 24727857
as it is shown in the above 2 lines, the 1st column of the field number 4 and 5 would be the fields number 4 and 5 in the 1st new line and the 2nd column of the field number 4 and 5 would be the fields number 4 and 5 in the 2nd new line. so the result of the small example would look like this:
如上面2行所示,字段编号4和5的第1列是第1个新行中的字段编号4和5,字段编号4和5的第2列是4号字段和第2个新线中的5个。所以小例子的结果看起来像这样:
ENSG00000001036 ENST00000002165 6 143832827 143832772
ENSG00000001461 ENST00000003912 1 24766730 24766662
ENSG00000001461 ENST00000003912 1 24746130 24745781
ENSG00000001461 ENST00000003912 1 24768628 24768545
ENSG00000001461 ENST00000003912 1 24742394 24742293
ENSG00000001461 ENST00000003912 1 24759703 24759594
ENSG00000001460 ENST00000003583 1 24740215 24740164
ENSG00000001460 ENST00000003583 1 24727946 24727857
I wrote a small code using awk
:
我用awk写了一个小代码:
awk -F";" '{print $1 "\t" $2 "\t" $3 "\t" $4 "\t" $5}' coord.txt > new.txt.
but actually I do not now how to apply 2 conditions that I mentioned (splitting lines and deleting incomplete lines). do you know how to do that?
但实际上我现在不知道如何应用我提到的两个条件(分割线条和删除不完整的线条)。你知道怎么做吗?
1 个解决方案
#1
2
You can use this awk
command with split
on fourth and fifth field on semi-colon
:
您可以在分号的第四和第五个字段上使用此awk命令和split:
awk 'NF==5{n=split($4, a, /;/); split($5, b, /;/);
for(i=1; i<=n; i++) print $1, $2, $3, a[i], b[i]}' file
ENSG00000001036 ENST00000002165 6 143832827 143832772
ENSG00000001461 ENST00000003912 1 24766730 24766662
ENSG00000001461 ENST00000003912 1 24746130 24745781
ENSG00000001461 ENST00000003912 1 24768628 24768545
ENSG00000001461 ENST00000003912 1 24742394 24742293
ENSG00000001461 ENST00000003912 1 24759703 24759594
ENSG00000001460 ENST00000003583 1 24740215 24740164
ENSG00000001460 ENST00000003583 1 24727946 24727857
#1
2
You can use this awk
command with split
on fourth and fifth field on semi-colon
:
您可以在分号的第四和第五个字段上使用此awk命令和split:
awk 'NF==5{n=split($4, a, /;/); split($5, b, /;/);
for(i=1; i<=n; i++) print $1, $2, $3, a[i], b[i]}' file
ENSG00000001036 ENST00000002165 6 143832827 143832772
ENSG00000001461 ENST00000003912 1 24766730 24766662
ENSG00000001461 ENST00000003912 1 24746130 24745781
ENSG00000001461 ENST00000003912 1 24768628 24768545
ENSG00000001461 ENST00000003912 1 24742394 24742293
ENSG00000001461 ENST00000003912 1 24759703 24759594
ENSG00000001460 ENST00000003583 1 24740215 24740164
ENSG00000001460 ENST00000003583 1 24727946 24727857