如何使用linux命令提取测序数据

时间:2021-11-13 00:38:48

I would like to extract certain lines and its following sequencing data.

我想提取某些行及其后续的测序数据。

There is a ecoli.ffn file as follows:

有一个ecoli.ffn文件如下:

$head ecoli.ffn
>ecoli16:g027092:GCF_000460315:gi|545267691|ref|NZ_KE701669.1|:551259-572036
ATGAGCCTGATTATTGATGTTATTTCGCGT
AAAACATCCGTCAAACAAACGCTGATTAAT
>ecoli16:g000011:55989:gi|218693476|ref|NC_011748.1|:1128430-1131042
GTGTACGCTATGGCGGGTAATTTTGCCGAT
>ecoli16:g000012:55989:gi|218693476|ref|NC_011748.1|:1128430-1131042
GTGTACGCTATGGCGGGTAATTTTGCCGAT
CTGACAGCTGTTCTTACACTGGATTCAACC
CTGACAGCTGTTCTTACACTGGATTCAACC

and a index.txt as following

和index.txt如下

$head index.txt
g000011
g000012

what I want to do is "extract index.txt from ecoli.ffn", the ideal output is:

我想要做的是“从ecoli.ffn中提取index.txt”,理想的输出是:

>ecoli16:g000011:55989:gi|218693476|ref|NC_011748.1|:1128430-1131042
GTGTACGCTATGGCGGGTAATTTTGCCGAT
>ecoli16:g000012:55989:gi|218693476|ref|NC_011748.1|:1128430-1131042
GTGTACGCTATGGCGGGTAATTTTGCCGAT
CTGACAGCTGTTCTTACACTGGATTCAACC
CTGACAGCTGTTCTTACACTGGATTCAACC

How can I do this?

我怎样才能做到这一点?

3 个解决方案

#1


1  

write a simple script ecoli.sh using awk:

用awk编写一个简单的脚本ecoli.sh:

#!/bin/bash
a=`cat index.txt`
for i in $a
do
    cat ecoli.ffn|awk -F: -v i="$i" 'BEGIN{flag=0} {if($2 == i){print $0;flag=1;} if(flag ==1 && $2 != i){print $0; flag=0;} }'
done

then you need to run this script in your shell.

那么你需要在shell中运行这个脚本。

#2


1  

awk to the rescue!

拯救!

$ awk -F: -v RS=">" 'NR==FNR{n=split($0,t,"\n");
                             for(i=1;i<n;i++) a[t[i]]; 
                             next} 
                     $2 in a{printf "%s", RS $0}' index file  

>ecoli16:g000011:55989:gi|218693476|ref|NC_011748.1|:1128430-1131042
GTGTACGCTATGGCGGGTAATTTTGCCGATCTGACAGCTGTTCTTACACTGGATTCAACC
>ecoli16:g000012:55989:gi|218693476|ref|NC_011748.1|:1128430-1131042
GTGTACGCTATGGCGGGTAATTTTGCCGATCTGACAGCTGTTCTTACACTGGATTCAACC

UPDATE Note that this doesn't depend on how many lines are there for each record. For the updated input file, same script will give you this output

更新请注意,这不取决于每条记录的行数。对于更新的输入文件,相同的脚本将为您提供此输出

$ awk -F: -v RS=">" 'NR==FNR{n=split($0,t,"\n");
                             for(i=1;i<n;i++) a[t[i]];
                             next}
                     $2 in a{printf "%s", RS $0}' index file

>ecoli16:g000011:55989:gi|218693476|ref|NC_011748.1|:1128430-1131042
GTGTACGCTATGGCGGGTAATTTTGCCGAT
>ecoli16:g000012:55989:gi|218693476|ref|NC_011748.1|:1128430-1131042
GTGTACGCTATGGCGGGTAATTTTGCCGAT
CTGACAGCTGTTCTTACACTGGATTCAACC
CTGACAGCTGTTCTTACACTGGATTCAACC

#3


0  

This script can be used to filter a FASTA file by a list or file based on their IDs, which seems to be what you are asking for here:

此脚本可用于根据列表或文件根据其ID过滤FASTA文件,这似乎是您在此处要求的内容:

https://github.com/jorvis/biocode/blob/master/fasta/filter_fasta_by_ids.pl

https://github.com/jorvis/biocode/blob/master/fasta/filter_fasta_by_ids.pl

#1


1  

write a simple script ecoli.sh using awk:

用awk编写一个简单的脚本ecoli.sh:

#!/bin/bash
a=`cat index.txt`
for i in $a
do
    cat ecoli.ffn|awk -F: -v i="$i" 'BEGIN{flag=0} {if($2 == i){print $0;flag=1;} if(flag ==1 && $2 != i){print $0; flag=0;} }'
done

then you need to run this script in your shell.

那么你需要在shell中运行这个脚本。

#2


1  

awk to the rescue!

拯救!

$ awk -F: -v RS=">" 'NR==FNR{n=split($0,t,"\n");
                             for(i=1;i<n;i++) a[t[i]]; 
                             next} 
                     $2 in a{printf "%s", RS $0}' index file  

>ecoli16:g000011:55989:gi|218693476|ref|NC_011748.1|:1128430-1131042
GTGTACGCTATGGCGGGTAATTTTGCCGATCTGACAGCTGTTCTTACACTGGATTCAACC
>ecoli16:g000012:55989:gi|218693476|ref|NC_011748.1|:1128430-1131042
GTGTACGCTATGGCGGGTAATTTTGCCGATCTGACAGCTGTTCTTACACTGGATTCAACC

UPDATE Note that this doesn't depend on how many lines are there for each record. For the updated input file, same script will give you this output

更新请注意,这不取决于每条记录的行数。对于更新的输入文件,相同的脚本将为您提供此输出

$ awk -F: -v RS=">" 'NR==FNR{n=split($0,t,"\n");
                             for(i=1;i<n;i++) a[t[i]];
                             next}
                     $2 in a{printf "%s", RS $0}' index file

>ecoli16:g000011:55989:gi|218693476|ref|NC_011748.1|:1128430-1131042
GTGTACGCTATGGCGGGTAATTTTGCCGAT
>ecoli16:g000012:55989:gi|218693476|ref|NC_011748.1|:1128430-1131042
GTGTACGCTATGGCGGGTAATTTTGCCGAT
CTGACAGCTGTTCTTACACTGGATTCAACC
CTGACAGCTGTTCTTACACTGGATTCAACC

#3


0  

This script can be used to filter a FASTA file by a list or file based on their IDs, which seems to be what you are asking for here:

此脚本可用于根据列表或文件根据其ID过滤FASTA文件,这似乎是您在此处要求的内容:

https://github.com/jorvis/biocode/blob/master/fasta/filter_fasta_by_ids.pl

https://github.com/jorvis/biocode/blob/master/fasta/filter_fasta_by_ids.pl