RNA-seq(3):学习笔记序列对比

主要参考网站：
chip详细分析流程
序列比对：Hisat2

1.需要建立一个index文件有两种方法。

为啥要这个index？需要把测序数据和这个参考基因组做对比，但是又不能直接和基因组做对比，不然哪儿跟哪儿可能区分不开，只能拿个简化版的注释文件做对比。

第一种，直接去HISAT2这个网站下载就好了。生信小白就是这么干的。这次采用的是mm10 j基因组，所以下载了这个，但是，wget 方式太慢，又没有迅雷会员，便用IDM软件下载，还算快吧，但也好久了。（15：00 -24:00)
在这里插入图片描述

第二种方式，自己下载基因组，自己用Hisat2软件构建index文件，但是我看不懂，更不会自己构建，算了，还是直接下载吧。

# 其实hisat2-buld在运行的时候也会自己寻找exons和splice_sites，但是先做的目的是为了提高运行效率
extract_exons.py gencode. >  &
extract_splice_sites.py gencode. > hg19.splice_sites.gtf &
# 建立index， 必须选项是基因组所在文件路径和输出的前缀
hisat2-build --ss hg19.splice_sites.gtf --exon  genome/hg19/ hg19

上述代码参考这个网站

解压：

tar -zxvf *.

2.下载index文件的时间，等着也是等着，可以把之前下载的SRA转化为FASTQ格式具体代码可以参考序列比对：Hisat2。
我之前直接下载了FASTQ格式，这一步便省略了。对于我来说，步骤越少越好。
但是接下来，序列对比，有两种思路。
第一种，bowtie2 & samtools
第二种，hisat2 & samttols
上边两个参考网站，分别对应了这两种因为今天没有下载好 genome的index文件，所以具体操作不知道，但是，理论上看，都可以，而且HISAT2 比对速度确实很快，一个样本转录组数据比对 2G 的核糖体 RNA 参考基因组约 25 分钟，bowtie2 需要 170 分钟（线程数都是 4）。因此我决定先按照序列比对：Hisat2第二种方式去做。
代码：

cd  /mnt/f/chip_seq/aligned
for ((i=5;i<=9;i++));do hisat2 -t -x /mnt/f/chip_seq/data/reference/index/mm10/genome -U /mnt/f/chip_seq/data/SRR62020${i}. -S SRR62020${i}.sam; done
#如果是双端测序，使用下面这个代码，去掉	U，更换成-1 -2
 for ((i=19;i<=23;i++));do hisat2 -t -x /mnt/d/biotech/chip/ref/mm10/geno
me -p 8 -1 /mnt/d/biotech/RNA/SRR53377${i}_1. -2 /mnt/d/biotech/RNA/SRR53377${i}_2. -S SRR53377${i}.sam;
 done

在对比的过程中，出现了这个错误代码：

Traceback (most recent call last):
  File "/home/leo/miniconda3/bin/hisat2_read_statistics.py", line 210, in <module>
    reads_stat(args.read_file, args.read_count)
  File "/home/leo/miniconda3/bin/hisat2_read_statistics.py", line 168, in reads_stat
    for id, seq in fstream:
  File "/home/leo/miniconda3/bin/hisat2_read_statistics.py", line 44, in parser_FQ
    if line[0] == '@':
IndexError: index out of range
Could not locate a HISAT2 index corresponding to basename "/mnt/f/chip_seq/data/reference/index/mm10/"

后来我仔细看了看，发现是，index定位不准确，没有加上genome 。

/mnt/f/chip_seq/data/reference/index/mm10/genome

但是，这些关于index out of range 的错误先不用管。可能是fastq文件里面有一些空格。但这些可以被允许，尽管是错误。
可以参考一下几个网址：
Question: HISAT2: an “IndexError: index out of range”
Index out of range?
Python-IndexError: list index out of range
python 报错不影响hisat2运行

最后的结果：

Traceback (most recent call last):
  File "/home/leo/miniconda3/bin/hisat2_read_statistics.py", line 210, in <module>
    reads_stat(args.read_file, args.read_count)
  File "/home/leo/miniconda3/bin/hisat2_read_statistics.py", line 168, in reads_stat
    for id, seq in fstream:
  File "/home/leo/miniconda3/bin/hisat2_read_statistics.py", line 44, in parser_FQ
    if line[0] == '@':
IndexError: index out of range
Time loading forward index: 00:00:26
Time loading reference: 00:00:04
Multiseed full-index search: 00:27:09
19687027 reads; of these:
  19687027 (100.00%) were unpaired; of these:
    8549808 (43.43%) aligned 0 times
    9537058 (48.44%) aligned exactly 1 time
    1600161 (8.13%) aligned >1 times
56.57% overall alignment rate
Time searching: 00:27:13
Overall time: 00:27:39

第一个用了27min.
随后对把sam文件转换成bam文件，代码如下：

 samtools sort ./SRR620204.sam  -o ring1B.bam

其实这样做会导致文件量很大，而且内部的数据没有很好的排序，可以如下操作：（当然，我这一次没这么做，直接就按照上边的代码写了下去）

# 首先将比对后的sam文件转换成bam文件
# 利用的是samtools的view选项，参数-S 输入sam文件；参数-b 指定输出的文件为bam；最后重定向写入bam文件
$ cd mnt/f/rna_seq/aligned
$ for ((i=56;i<=62;i++));do samtools view -S SRR35899${i}.sam -b > SRR35899${i}.bam;done
# 将所有的bam文件按默认的染色体位置进行排序
$ for ((i=56;i<=62;i++));do samtools sort SRR35899${i}.bam -o SRR35899${i}_sorted.bam;done
# 将所有的排序文件建立索引，索引文件.bai后缀
$ for ((i=56;i<=62;i++));do samtools index SRR35899${i}_sorted.bam;done

**
3.拿到bam文件，应该再进行一次质控**：
具体步骤参考序列比对：Hisat2
另一个关于bam文件加载入IGV和质控的网站。
质控结果如下：（摘自上述第二个网站）在这里插入图片描述

4.反正上述过程所用的时间很久，可以提前准备R里的软件
如果是ChIP-seq,需要下载 – chipseeker

source ("/")
biocLite("ChIPseeker")
biocLite(".")

然后又是很久。

整个帖子的过程我用了两天下午+晚上做完了。毕竟是新手可能会慢很多

秒客网

RNA-seq(3):学习笔记序列对比

相关文章

RNA-seq(3):学习笔记 序列对比

相关文章

RNA-seq(3):学习笔记序列对比