unix使用循环,awk和拆分拆分FASTA

时间:2021-11-15 21:43:55

I have a long list of data organised as below (INPUT). I want to split the data up so that I get an output as below (desired OUTPUT).

我有一个很长的数据列表,如下所示(INPUT)。我想分割数据,以便得到如下输出(所需的OUTPUT)。

The code below first identifies all the lines containing ">gi" and saves the linecount of those lines in an array called B. Then, in a new file, it should replace those lines from array B with the shortened version of the text following the ">gi"

下面的代码首先标识包含“> gi”的所有行,并将这些行的行数保存在名为B的数组中。然后,在新文件中,它应该将数组B中的那些行替换为后面的文本的缩短版本。 “> GI”

I figured the easiest way would be to split at "|", however this does not work (no separation happens with my code if i replace " " with "|")

我认为最简单的方法是拆分为“|”,但这不起作用(如果我用“|”替换“”,我的代码就不会发生分离)

My code is below and does split nicely after the " " if I replace the "|" by " " in the INPUT, however I get into trouble when I want to get the text between the [ ] brackets, which is NOT always there and not always only 2 words...:

如果我替换“|”,我的代码在下面并在“”之后很好地拆分通过INPUT中的“”,当我想在[]括号之间得到文本时,我遇到了麻烦,这并不总是存在,而且并不总是只有2个字......:

B=$( grep -n ">gi" 1VAO_1DII_5fxe_all_hits_combined.txt | cut -d : -f 1)

 awk <1VAO_1DII_5fxe_all_hits_combined.txt >seqIDs_1VAO_1DII_5fxe_all_hits_combined.txt -v lines="$B" '
BEGIN {split(lines, a, " "); for (i in a) change[a[i]]=1}
NR in change {$0 = ">" $4}
1
'

let me know if more explanations are needed!

如果需要更多解释,请告诉我!

INPUT:

INPUT:

 >gi|9955361|pdb|1E0Y|A:1-560 Chain A, Structure Of The D170sT457E DOUBLE MUTANT OF VANILLYL- Alcohol Oxidase
 MSKTQEFRPLTLPPKLSLSDFNEFIQDIIRIVGSENVEVISSKDQIVDGSYMKPTHTHDPHHVMDQDYFLASAIVA

 >gi|557721169|dbj|GAD99964.1|:1-560 hypothetical protein NECHADRAFT_63237 [Byssochlamys spectabilis No. 5]
 MSETMEFRPMVLPPNLLLSEFNGFIRETIRLVGCENVEVISSKDQIHDGSYMDPRHTHDPHHIMEQDYFLASAIVAPRNV

desired OUTPUT:

期望的输出:

 >1E0Y
 MSKTQEFRPLTLPPKLSLSDFNEFIQDIIRIVGSENVEVISSKDQIVDGSYMKPTHTHDPHHVMDQDYFLASAIVAPRNV

 >GAD99964.1 Byssochlamys spectabilis No. 5
 MSETMEFRPMVLPPNLLLSEFNGFIRETIRLVGCENVEVISSKDQIHDGSYMDPRHTHDPHHIMEQDYFLASAIVA

1 个解决方案

#1


2  

This can be done in one step with awk (gnu awk):

这可以用awk(gnu awk)一步完成:

awk -F'|' '/^>gi/{a=1;match($NF,/\[([^]]*)]/, b);print ">"$4" "b[1];next}a{print}!$0{a=0}' input > output

In a more readable way:

以更易读的方式:

/^>gi/ {  # when the line starts with ">gi"
    a=1;  # set flag "a" to 1
    # extract the eventual part between brackets in the last field
    match($NF,"\\[([^]]*)]", b);
    print ">"$4" "b[1]; # display the line
    next # jump to the next record
}

a { print } # when "a" (allowed block) display the line

!$0 { a=0 } # when the line is empty, set "a" to 0 to stop the display

#1


2  

This can be done in one step with awk (gnu awk):

这可以用awk(gnu awk)一步完成:

awk -F'|' '/^>gi/{a=1;match($NF,/\[([^]]*)]/, b);print ">"$4" "b[1];next}a{print}!$0{a=0}' input > output

In a more readable way:

以更易读的方式:

/^>gi/ {  # when the line starts with ">gi"
    a=1;  # set flag "a" to 1
    # extract the eventual part between brackets in the last field
    match($NF,"\\[([^]]*)]", b);
    print ">"$4" "b[1]; # display the line
    next # jump to the next record
}

a { print } # when "a" (allowed block) display the line

!$0 { a=0 } # when the line is empty, set "a" to 0 to stop the display