如何在一个文件的子串位置上使用info从另一个文件中提取子串(loop,bash)

时间:2022-09-17 23:41:26

I'm trying quite hard to write a script that "loopingly" extracts substrings from one file, while getting the information on where to cut from another file. I'm working in bash on MobaXterm. I have the file cut_positions.txt, which is tab delimited and shows name, start point, end point, length, comment:

我正在努力编写一个脚本,“循环”从一个文件中提取子字符串,同时获取从另一个文件中剪切的信息。我正在使用MobaXterm进行bash。我有cut_positions.txt文件,它是制表符分隔的,显示名称,起点,终点,长度,注释:

k141_20066  103484  104617  1133    phnW  
k141_20841  13200   14324   1124    phnW  
k141_23852  69  452 383 phnW  
k141_32328  1   180 179 phnW 

and the string_file.txt with the name (it would be no problem to remove/add the ">" in one of the files) and the string (the original strings are way longer, up to 1.000.000 characters):

和带有名称的string_file.txt(在其中一个文件中删除/添加“>”并没有问题)和字符串(原始字符串更长,最多为1.000.000个字符):

>k141_10671 CCTTCCCCCACACGCCGCTCTTCCGCTCTTGCTGGCC  
>k141_10707 AGGCGGTATCAGACCTTGCCGCAACACTAAGCCCAGTAACGCTGTCGCCCTTATATCTGA  
>k141_11190 CTTTTGTGACAGTGCAGGGCAATGGTGGATTTATCAGTATCGGGCAGAA  
>k141_1479  AGCCGACAGCAGCGCCGAGGGCACATAATCCGATGACACGATGTCCAAAAGATCCGCCTCGGC

Now I want to use the input from the cut_positions.txt. I want to use the first column to match the right line, then the second column as start point of the substring and the fourth column as length of the substring. This should be done with all lines in cut_positions.txt and written to a new out.txt. To get closer I tried (with my original data):

现在我想使用cut_positions.txt中的输入。我想使用第一列匹配右行,然后第二列作为子串的起点,第四列作为子串的长度。这应该在cut_positions.txt中的所有行完成并写入新的out.txt。为了更接近我尝试(使用我的原始数据):

➤ grep ">k141_28027\b" test_out_one_line.txt | awk '{print substr($2,57251,69)}'
TCACTTGAGCGCAATTATTCGCTCTCCGGCGGCGTCAGCATCAGCCTGATCATGCGTCACCAAAAGTGT

which worked well as handmade way. I figured out as well how to access the different elements in cut_positions.txt (here the first row in the second column):

这是手工制作的方式。我也想到了如何访问cut_positions.txt中的不同元素(这里是第二列的第一行):

awk -F '\t' 'NR==1{print $2}' cut_positions.txt

but I can't figure out how to turn this into a loop, as I don't know how to connect the different redirections, piping steps and so on that I used for the small steps. Any help is very much appreciated (and tell me, if you need more sample data)

但我无法弄清楚如何把它变成一个循环,因为我不知道如何连接我用于小步骤的不同重定向,管道步骤等。非常感谢任何帮助(如果您需要更多样本数据,请告诉我)

thanks crazysantaclaus

谢谢crazysantaclaus

1 个解决方案

#1


2  

The following script should work for you:

以下脚本应该适合您:

cut.awk

cut.awk

# We are reading two files: pos.txt and strings.txt
# NR is equal to FNR as long as we are reading the
# first file.
NR==FNR{
    pos[">"$1]=$2 # Store the startpoint in an array pos (indexed by $1)
    len[">"$1]=$4 # Store the length in an array len (indexed by $1)
    next # skip the block below for pos.txt
}

# This runs on every line of strings.txt
$1 in pos {
    # Extract a substring of $2 based on the position and length
    # stored above
    key=$1
    mod=substr($2,pos[key],len[key])
    $2=mod
    print # Print the modified line
}

Call it like this:

这样叫:

awk -f cut.awk pos.txt strings.txt

One important thing to mention. substr() assumes strings to start at index 1 - in opposite to most programming languages where strings start at index 0. If the positions in pos.txt are 0 based, the substr() must become:

一件重要的事情要提。 substr()假设字符串从索引1开始 - 与大多数编程语言相反,其中字符串从索引0开始。如果pos.txt中的位置基于0,则substr()必须变为:

mod=substr($2,pos[key]+1,len[key])

I recommend to test it with simplified, meaningful versions of:

我建议用简化的,有意义的版本测试它:

pos.txt

pos.txt

foo  2  5  3    phnW  
bar  4  5  1    phnW
test 1  5  4    phnW

and strings.txt

和strings.txt

>foo 123456  
>bar 123456
>non 123456

Output:

输出:

>foo 234
>bar 4

#1


2  

The following script should work for you:

以下脚本应该适合您:

cut.awk

cut.awk

# We are reading two files: pos.txt and strings.txt
# NR is equal to FNR as long as we are reading the
# first file.
NR==FNR{
    pos[">"$1]=$2 # Store the startpoint in an array pos (indexed by $1)
    len[">"$1]=$4 # Store the length in an array len (indexed by $1)
    next # skip the block below for pos.txt
}

# This runs on every line of strings.txt
$1 in pos {
    # Extract a substring of $2 based on the position and length
    # stored above
    key=$1
    mod=substr($2,pos[key],len[key])
    $2=mod
    print # Print the modified line
}

Call it like this:

这样叫:

awk -f cut.awk pos.txt strings.txt

One important thing to mention. substr() assumes strings to start at index 1 - in opposite to most programming languages where strings start at index 0. If the positions in pos.txt are 0 based, the substr() must become:

一件重要的事情要提。 substr()假设字符串从索引1开始 - 与大多数编程语言相反,其中字符串从索引0开始。如果pos.txt中的位置基于0,则substr()必须变为:

mod=substr($2,pos[key]+1,len[key])

I recommend to test it with simplified, meaningful versions of:

我建议用简化的,有意义的版本测试它:

pos.txt

pos.txt

foo  2  5  3    phnW  
bar  4  5  1    phnW
test 1  5  4    phnW

and strings.txt

和strings.txt

>foo 123456  
>bar 123456
>non 123456

Output:

输出:

>foo 234
>bar 4