利用kseq.h parse fasta/fastq 文件

在分析中经常需要统计fasta/fastq文件的序列数和碱基数，但是没有找到一些专门做这件事的小工具，可能是这个功能太简单了；

之前用自己写的perl的脚本统计这些信息，当fastq文件非常大时，就变的很慢；

今天在网上搜到kseq.h可以parse fasta/fastq文件，用C写的，速度很快；

http://lh3lh3.users.sourceforge.net/parsefastq.shtml

自己试了一下, 在这个基础上添加个小功能，命名为parse.c：

#include <zlib.h>

#include <stdio.h>

#include <string.h>

#include "kseq.h"

// STEP 1: declare the type of file handler and the read() function

KSEQ_INIT(gzFile, gzread)  

int main(int argc, char *argv[])

{

    gzFile fp;

    kseq_t *seq;

    long seqs = ;

    long bases = ;

    int l;

    if (argc == ) {

        fprintf(stderr, "Usage: %s <in.seq>\n", argv[]);

        return ;

    }

    fp = gzopen(argv[], "r"); // STEP 2: open the file handler

    seq = kseq_init(fp); // STEP 3: initialize seq

    while ((l = kseq_read(seq)) >= ) { // STEP 4: read sequence

        //printf("name: %s\n", seq->name.s);

        //if (seq->comment.l) printf("comment: %s\n", seq->comment.s);

        //printf("seq: %s\n", seq->seq.s);

        //if (seq->qual.l) printf("qual: %s\n", seq->qual.s);

        bases += strlen(seq->seq.s);

        seqs += ;

    }

    //printf("return value: %d\n", l);

    printf("reads: %ld\n", seqs);

    printf("bases: %ld\n", bases);

    kseq_destroy(seq); // STEP 5: destroy seq

    gzclose(fp); // STEP 6: close the file handler

    return ;

}

然后编译

gcc -o fastx_read_length -lz parse.c

因为调用zlib,读取压缩文件，所以编译时需要添加-lz 选项；

测试了一下可以跑通；感觉kseq.h功能好强大，支持fasta/fastq,支持gzip压缩文件