基于保持标题行的列值拆分文本文件（基因组数据）

I have a big genome data file (.txt) in the format below. I would like to split it based on chromosome column chr1, chr2..chrX,chrY and so forth keeping the header line in all splitted files. How can I do this using unix/linux command?

我有一个大的基因组数据文件（.txt），格式如下。我想基于染色体列chr1，chr2..chrX，chrY等分割它，保持所有分割文件中的标题行。如何使用unix / linux命令执行此操作？

genome data

基因组数据

 variantId  chromosome   begin  end
    1            1          33223  34343
    2            2          44543  46444
    3            2          55566  59999 
    4            3          33445  55666

result

结果

file.chr1.txt
variantId  chromosome   begin  end
1            1          33223  34343


file.chr2.txt
variantId  chromosome   begin  end
2            2          44543  46444
3            2          55566  59999 

file.chr3.txt
variantId  chromosome   begin  end
4            3          33445  55666

1 个解决方案

#1

Is this data for the human genome (i.e. always 46 chromosomes)? If so, how's this:

这个数据是人类基因组（即总是46条染色体）吗？如果是这样，这是怎么回事：

for chr in $(seq 1 46)
do
    head -n1 data.txt >chr$chr.txt
done
awk 'NR != 1 { print $0 >>("chr"$2".txt") }' data.txt

(This is a second edit, based on @Sasha's comment above.)

（这是第二次编辑，基于@ Sasha上面的评论。）

Note that the parens around ("chr"$2".txt") are apparently not needed on GNU awk, but they are on my OS X version of awk.

请注意，GNU awk显然不需要（“chr”$ 2“.txt”），但它们是我的OS X版本的awk。

#1