I have a big genome data file (.txt) in the format below. I would like to split it based on chromosome column chr1, chr2..chrX,chrY
and so forth keeping the header line in all splitted files. How can I do this using unix/linux command?
我有一个大的基因组数据文件(.txt),格式如下。我想基于染色体列chr1,chr2..chrX,chrY等分割它,保持所有分割文件中的标题行。如何使用unix / linux命令执行此操作?
genome data
基因组数据
variantId chromosome begin end
1 1 33223 34343
2 2 44543 46444
3 2 55566 59999
4 3 33445 55666
result
结果
file.chr1.txt
variantId chromosome begin end
1 1 33223 34343
file.chr2.txt
variantId chromosome begin end
2 2 44543 46444
3 2 55566 59999
file.chr3.txt
variantId chromosome begin end
4 3 33445 55666
1 个解决方案
#1
2
Is this data for the human genome (i.e. always 46 chromosomes)? If so, how's this:
这个数据是人类基因组(即总是46条染色体)吗?如果是这样,这是怎么回事:
for chr in $(seq 1 46)
do
head -n1 data.txt >chr$chr.txt
done
awk 'NR != 1 { print $0 >>("chr"$2".txt") }' data.txt
(This is a second edit, based on @Sasha's comment above.)
(这是第二次编辑,基于@ Sasha上面的评论。)
Note that the parens around ("chr"$2".txt")
are apparently not needed on GNU awk, but they are on my OS X version of awk.
请注意,GNU awk显然不需要(“chr”$ 2“.txt”),但它们是我的OS X版本的awk。
#1
2
Is this data for the human genome (i.e. always 46 chromosomes)? If so, how's this:
这个数据是人类基因组(即总是46条染色体)吗?如果是这样,这是怎么回事:
for chr in $(seq 1 46)
do
head -n1 data.txt >chr$chr.txt
done
awk 'NR != 1 { print $0 >>("chr"$2".txt") }' data.txt
(This is a second edit, based on @Sasha's comment above.)
(这是第二次编辑,基于@ Sasha上面的评论。)
Note that the parens around ("chr"$2".txt")
are apparently not needed on GNU awk, but they are on my OS X version of awk.
请注意,GNU awk显然不需要(“chr”$ 2“.txt”),但它们是我的OS X版本的awk。