从txt文件池中搜索ID

时间:2022-12-24 16:05:42

I have asked a one to one query search question, but this one is a one to many files search question. I have one query.txt file containing a thousand IDs.

我问了一对一的查询搜索问题,但这个是一对多的文件搜索问题。我有一个包含一千个ID的query.txt文件。

query.txt
GABDI004191
GABDI007217
GABDI004196
GABDI008080
.....

And I have a group of files (file1.table, file2.table to file120.table) which contains search hits of individual IDs with other sequences.

我有一组文件(file1.table,file2.table到file120.table),其中包含对具有其他序列的各个ID的搜索命中。

file1.table
GABDI004191 c23504_g1_i1    29.38   160 100 2   1   160 90  530 
GABDI004191 c20415_g1_i1    45.21   73  39  1   180 252 27  242 
GABDI004191 c17483_g1_i1    88.78   98  11  0   20  117 1   294 
GABDI008080 c1407_g1_i1 95.56   45  2   0   112 156 200 66  9e-25   
GABDI004196 c2892_g1_i1 35.44   79  50  1   37  115 237 4   7e-08

file2.table
GABDI007217 TR9707|c0_g1_i1 32.47   77  49  2   1   77  309 88  
GABDI004196 TR9163|c0_g1_i1 63.77   69  25  0   315 383 207 1   
GABDI007217 TR1165|c0_g1_i1 91.56   154 12  1   1   153 464 3   
GABDI004191 TR4933|c0_g1_i1 91.56   154 12  1   1   153 35  496 
GABDI008080 TR16029|c0_g1_i1    32.20   118 77  2   37  152 242 

I need to extract for each ID all the lines it appears from each .table files and store them in a separate file that bears the ID's name. for example.

我需要为每个ID提取它从每个.table文件中出现的所有行,并将它们存储在一个带有ID名称的单独文件中。例如。

for ID GABDI008080, it will have an output file GABDI008080.txt which contains the following

对于ID GABDI008080,它将有一个输出文件GABDI008080.txt,其中包含以下内容

GABDI008080 c1407_g1_i1 95.56   45  2   0   112 156 200 66  9e-25
GABDI008080 TR16029|c0_g1_i1    32.20   118 77  2   37  152 242 

and for ID GABDI004191 it will have an output file GABDI004191.txt which will contain the following

对于ID GABDI004191,它将有一个输出文件GABDI004191.txt,它将包含以下内容

GABDI004191 c23504_g1_i1    29.38   160 100 2   1   160 90  530 
GABDI004191 c20415_g1_i1    45.21   73  39  1   180 252 27  242 
GABDI004191 c17483_g1_i1    88.78   98  11  0   20  117 1   294
GABDI004191 TR4933|c0_g1_i1 91.56   154 12  1   1   153 35  496 

I have just started learning Python and Bash scripting. I tried the following python codes but I got stuck.

我刚开始学习Python和Bash脚本。我尝试了以下python代码,但我卡住了。

#!/bin/python
import glob
with open('query.txt' , 'r') as query_file: #reading in IDs from query     file
   for id in query_file:
     for file in glob.glob("*.table"): 
        with open(file, 'r') as one_file: #opening individual files for  reading
           for line in one_file:
              if id in line: #trying to find IDs from each line in those files
                 idname=open(id +'.txt', 'w') #opening a file with the ID name where all found results for that ID is stored
                 idname.append(line)
                 idnam.close()

I would appreciate any help please, Using Awk, or whatever Shell script, or Python. Thanks

我很感激任何帮助,使用Awk,或任何Shell脚本或Python。谢谢

3 个解决方案

#1


2  

Using Bash you can do something like this:

使用Bash你可以做这样的事情:

while IFS= read -r i; do
  for f in file*.table; do
    grep "^$i " "$f" >> "${i}.txt"
  done
done < query.txt

Or even better, since you don't need to know where the lines come from:

或者甚至更好,因为您不需要知道线条的来源:

while IFS= read -r i; do
  grep "^$i " file*.table >> "${i}.txt"
done < query.txt

#2


0  

In think this should work for you:

认为这应该适合你:

EDIT: corrected code, as it was not working. Now fully functional

编辑:更正的代码,因为它不起作用。现在功能齐全

Explanation: First, I load all the CodeList into an internal array, and then, I print the lines of each file if they are in the code list, and I print them on a file named with each code.

说明:首先,我将所有CodeList加载到一个内部数组中,然后,如果它们在代码列表中,我打印每个文件的行,然后将它们打印在每个代码命名的文件上。

cat: awk: No such file or directory
$ ./awk.sh
GABDI004191
GABDI007217
GABDI004196
GABDI008080
$ cat query.txt
GABDI004191
GABDI007217
GABDI004196
GABDI008080
$ cat file1.table
GABDI004191 c23504_g1_i1    29.38   160 100 2   1   160 90  530
GABDI004191 c20415_g1_i1    45.21   73  39  1   180 252 27  242
GABDI004191 c17483_g1_i1    88.78   98  11  0   20  117 1   294
GABDI008080 c1407_g1_i1 95.56   45  2   0   112 156 200 66  9e-25
GABDI004196 c2892_g1_i1 35.44   79  50  1   37  115 237 4   7e-08
$ cat file2.table
GABDI007217 TR9707|c0_g1_i1 32.47   77  49  2   1   77  309 88
GABDI004196 TR9163|c0_g1_i1 63.77   69  25  0   315 383 207 1
GABDI007217 TR1165|c0_g1_i1 91.56   154 12  1   1   153 464 3
GABDI004191 TR4933|c0_g1_i1 91.56   154 12  1   1   153 35  496
GABDI008080 TR16029|c0_g1_i1    32.20   118 77  2   37  152 242
$ cat awk.sh
 awk  'BEGIN{
              while ((getline line < "query.txt" ) > 0)
              {codeList[line]=line
                print codeList[line]
                }
              close("query.txt" )
         }
        $1 in codeList { print $0 > $1".txt"}
'  file*.table
$ ./awk.sh
GABDI004191
GABDI007217
GABDI004196
GABDI008080
$ ls *txt
GABDI004191.txt  GABDI004196.txt  GABDI008080.txt  query.txt
$ cat GABDI004191.txt
GABDI004191 c23504_g1_i1    29.38   160 100 2   1   160 90  530
GABDI004191 c20415_g1_i1    45.21   73  39  1   180 252 27  242
GABDI004191 c17483_g1_i1    88.78   98  11  0   20  117 1   294
$

Hope it helps.

希望能帮助到你。

#3


0  

awk 'NR==FNR{ids[$0];next} $1 in ids{print > ($1".txt")}' query.txt *.table

If you get an error message about having too many files open concurrently then get/use GNU awk as it handles that for you internally, otherwise if that proves impossible to do then add close($1".txt") after the print....

如果你收到一条关于同时打开太多文件的错误消息,那么获取/使用GNU awk,因为它在内部为你处理,否则如果证明不可能,那么在打印后添加close($ 1“.txt”)... 。

A shell is an environment from which to manipulate (e.g. create/move/destroy) files and processes and sequence calls to tools. The standard UNIX tool to manipulate text is awk so any time you need to manipulate text in UNIX you should write an awk script and just call it from your shell. Read the book Effective Awk Programming, 4th Edition, by Arnold Robbins.

shell是一个环境,用于操作(例如创建/移动/销毁)文件和进程以及对工具进行调用。用于操作文本的标准UNIX工具是awk,因此每当您需要在UNIX中操作文本时,您应该编写一个awk脚本并从shell中调用它。阅读Arnold Robbins撰写的Effective Awk Programming,第4版。

#1


2  

Using Bash you can do something like this:

使用Bash你可以做这样的事情:

while IFS= read -r i; do
  for f in file*.table; do
    grep "^$i " "$f" >> "${i}.txt"
  done
done < query.txt

Or even better, since you don't need to know where the lines come from:

或者甚至更好,因为您不需要知道线条的来源:

while IFS= read -r i; do
  grep "^$i " file*.table >> "${i}.txt"
done < query.txt

#2


0  

In think this should work for you:

认为这应该适合你:

EDIT: corrected code, as it was not working. Now fully functional

编辑:更正的代码,因为它不起作用。现在功能齐全

Explanation: First, I load all the CodeList into an internal array, and then, I print the lines of each file if they are in the code list, and I print them on a file named with each code.

说明:首先,我将所有CodeList加载到一个内部数组中,然后,如果它们在代码列表中,我打印每个文件的行,然后将它们打印在每个代码命名的文件上。

cat: awk: No such file or directory
$ ./awk.sh
GABDI004191
GABDI007217
GABDI004196
GABDI008080
$ cat query.txt
GABDI004191
GABDI007217
GABDI004196
GABDI008080
$ cat file1.table
GABDI004191 c23504_g1_i1    29.38   160 100 2   1   160 90  530
GABDI004191 c20415_g1_i1    45.21   73  39  1   180 252 27  242
GABDI004191 c17483_g1_i1    88.78   98  11  0   20  117 1   294
GABDI008080 c1407_g1_i1 95.56   45  2   0   112 156 200 66  9e-25
GABDI004196 c2892_g1_i1 35.44   79  50  1   37  115 237 4   7e-08
$ cat file2.table
GABDI007217 TR9707|c0_g1_i1 32.47   77  49  2   1   77  309 88
GABDI004196 TR9163|c0_g1_i1 63.77   69  25  0   315 383 207 1
GABDI007217 TR1165|c0_g1_i1 91.56   154 12  1   1   153 464 3
GABDI004191 TR4933|c0_g1_i1 91.56   154 12  1   1   153 35  496
GABDI008080 TR16029|c0_g1_i1    32.20   118 77  2   37  152 242
$ cat awk.sh
 awk  'BEGIN{
              while ((getline line < "query.txt" ) > 0)
              {codeList[line]=line
                print codeList[line]
                }
              close("query.txt" )
         }
        $1 in codeList { print $0 > $1".txt"}
'  file*.table
$ ./awk.sh
GABDI004191
GABDI007217
GABDI004196
GABDI008080
$ ls *txt
GABDI004191.txt  GABDI004196.txt  GABDI008080.txt  query.txt
$ cat GABDI004191.txt
GABDI004191 c23504_g1_i1    29.38   160 100 2   1   160 90  530
GABDI004191 c20415_g1_i1    45.21   73  39  1   180 252 27  242
GABDI004191 c17483_g1_i1    88.78   98  11  0   20  117 1   294
$

Hope it helps.

希望能帮助到你。

#3


0  

awk 'NR==FNR{ids[$0];next} $1 in ids{print > ($1".txt")}' query.txt *.table

If you get an error message about having too many files open concurrently then get/use GNU awk as it handles that for you internally, otherwise if that proves impossible to do then add close($1".txt") after the print....

如果你收到一条关于同时打开太多文件的错误消息,那么获取/使用GNU awk,因为它在内部为你处理,否则如果证明不可能,那么在打印后添加close($ 1“.txt”)... 。

A shell is an environment from which to manipulate (e.g. create/move/destroy) files and processes and sequence calls to tools. The standard UNIX tool to manipulate text is awk so any time you need to manipulate text in UNIX you should write an awk script and just call it from your shell. Read the book Effective Awk Programming, 4th Edition, by Arnold Robbins.

shell是一个环境,用于操作(例如创建/移动/销毁)文件和进程以及对工具进行调用。用于操作文本的标准UNIX工具是awk,因此每当您需要在UNIX中操作文本时,您应该编写一个awk脚本并从shell中调用它。阅读Arnold Robbins撰写的Effective Awk Programming,第4版。