基因组测序模拟
一、摘要
通过熟悉已有的基因组测序模拟和评估程序,加深全基因组鸟枪法测序原理的理解,并且能够编写程序模拟全基因组鸟枪法测序,理解覆盖度、测序深度、拷贝数等概念,设置测序相关参数,生成单端/双端测序结果文件
二、材料和方法
1、硬件平台
处理器:Intel(R) Core(TM)i7-4710MQ CPU @ 2.50GHz
安装内存(RAM):16.0GB
2、系统平台
Windows 8.1,Ubuntu
3、软件平台
- art_454
- GenomeABC http://crdd.osdd.net/raghava/genomeabc/
- Python3.5
- Biopython
4、数据库资源
NCBI数据库:https://www.ncbi.nlm.nih.gov/
5、研究对象
酵母基因组Saccharomyces cerevisiae S288c (assembly R64)
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/045/GCF_000146045.2_R64/GCF_000146045.2_R64_genomic.fna.gz
6、方法
- art_454的使用
首先至art系列软件的官网,下载软件,在ubuntu系统安装,然后阅读相关参数设置的帮助文档,运行程序。 - GenomeABC
进入GenomeABC(http://crdd.osdd.net/raghava/genomeabc/),输入参数,获得模拟测序结果。 - 编程模拟测序
下载安装python,并且安装biopython扩展模块,编写程序,模拟单端/双端测序。
三、结果
1、art_454的运行结果
无参数art_454运行,阅读帮助文档
图表 1无参数art_454运行
对酵母基因组进行基因组单端测序模拟,FOLD_COVERAGE设为20,即覆盖度为20.
下图为模拟单端测序,程序运行过程及结果
图表 2 art454单端测序
图表 3 art454单端模拟结果
双端测序模拟,FOLD_COVERAGE设为20,即覆盖度为20;MEAN_FRAG_LEN设为1500,即平均片段长度为1500;STD_DEV设为20,即长度的标准差为20
下图为模拟双端测序,程序运行过程及结果
图表 4 art454双端测序
图表 5 art454双端模拟结果
2、GenomeABC
下图为设置参数页面
下图为结果下载页面
图表 6 结果下载页面
3、编程模拟测序结果
拷贝数是这里的N值;覆盖度是m,测序深度是宏观的量,在这里与覆盖度意思相同,就是测序仪10X,20X。
单端测序
图表 7 程序模拟单端测序
双端测序
图表 8 程序模拟双端测序
测序结果
图表 9 结果文件
因为期望片段长度是600bp,在片段长度区间200-1000bp内,所以大部分的片段都没有删除。
测序结果统计表
测序方式 | 基因组大小(bp) | 片段长度区间 (bp) | N值 | 期望片段长度 | 克隆保留率 | 片段数量 | Reads长度范围(bp) | Reads总数量 | Reads总长度 | 覆盖度(m值) | 理论丢失率(e-m) | 覆盖率(1-e-m) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
单端 | 12157kb | 200-1000 | 10 | 600 | 0.95 | 107378 | 50-100 | 101968 | 7645.541kb | 0.62889 | 0.53318 | 0.46682 |
单端 | 12157kb | 200-1000 | 20 | 600 | 0.95 | 213722 | 50-100 | 202996 | 15227.882kb | 1.25259 | 0.28576 | 0.71424 |
双端 | 12157kb | 200-1000 | 10 | 600 | 0.95 | 106704 | 50-100 | 202770 | 15212.662kb | 1.25134 | 0.28612 | 0.71388 |
双端 | 12157kb | 200-1000 | 20 | 600 | 0.95 | 214212 | 50-100 | 407186 | 30534.265kb | 2.51164 | 0.08114 | 0.91886 |
四、讨论和结论
程序运行方法
在类的构造方法init()中,调整参数。
Averagefragmentlength为片段平均的长度;
minfragmentlength和maxfragmentlength是保留片段的范围;
cloneRetainprobability是克隆的保留率;
minreadslength和maxreadslength是测序reads的长度范围
模拟测序的诸多方法都封装成了Sequencing类,只需要创建类,并调用singlereadsequencing()和pairreadsequencing()方法,传入文件名的参数即可。
附录
from Bio import SeqIO
from math import exp
import random
class Sequencing:
# N代表拷贝份数
def __init__(self)
self.fragmentList = []
self.readsID = 1
self.readsList = []
self.averagefragmentlength = 650
self.minfragmentlength = 500
self.maxfragmentlength = 800
self.cloneRetainprobability = 1
self.minreadslength = 50
self.maxreadslength = 150
self.N = 10
self.genomeLength = 0
self.allreadslength = 0
# 生成断裂点
def generatebreakpoint(self, seqlen, averageLength):
# 假设平均每500bp 产生一个断裂点(averageLength = 500),通过随机函数生成seqlen/500个随机断裂点(1到seqlen之间的随机整数)
breakpoint = [random.randint(0, seqlen) for _ in range(int(seqlen / averageLength))]
breakpoint.append(seqlen)
breakpoint.append(0)
# 把随机断裂点从小到大排序
breakpoint.sort()
return breakpoint
# 沿断裂点打断基因组,并删除不符合长度要求的序列片段,定义片段范围:200-1000bp
def breakgenome(self, seq, breakpoint, minfragmentlength, maxfragmentlength):
for i in range(len(breakpoint) - 1):
fragment = seq[breakpoint[i]:breakpoint[i + 1]]
if maxfragmentlength > len(fragment) > minfragmentlength:
self.fragmentList.append(fragment)
return self.fragmentList
# 模拟克隆时的随机丢失情况,random.random()生成0-1的保留概率
def clonefragment(self, fragmentList, cloneRetainprobability):
clonedfragmentList = []
Lossprobability = [random.random() for _ in range(len(fragmentList))]
for i in range(len(fragmentList)):
if Lossprobability[i] <= cloneRetainprobability:
clonedfragmentList.append(fragmentList[i])
return clonedfragmentList
# 模拟单端测序,并修改reads的ID号
def singleread(self, clonedfragmentList):
for fragment in clonedfragmentList:
fragment.id = ""
fragment.name = ""
fragment.description = fragment.description[12:].split(",")[0]
fragment.description = str(self.readsID) + "." + fragment.description
self.readsID += 1
readslength = random.randint(self.minreadslength, self.maxreadslength)
self.allreadslength += readslength
self.readsList.append(fragment[:readslength])
def singlereadsequencing(self, genomedata, sequencingResult):
for seq_record in SeqIO.parse(genomedata, "fasta"):
seqlen = len(seq_record)
self.genomeLength += seqlen
for i in range(self.N):
# 生成断裂点
breakpoint = self.generatebreakpoint(seqlen, self.averagefragmentlength)
# 沿断裂点打断基因组
self.breakgenome(seq_record, breakpoint, self.minfragmentlength, self.maxfragmentlength)
# 模拟克隆时的随机丢失情况
clonedfragmentList = self.clonefragment(self.fragmentList, self.cloneRetainprobability)
# 模拟单端测序
self.singleread(clonedfragmentList)
SeqIO.write(self.readsList, sequencingResult, "fasta")
def pairread(self, clonedfragmentList):
for fragment in clonedfragmentList:
fragment.id = ""
fragment.name = ""
description = fragment.description[12:].split(",")[0]
fragment.description = str(self.readsID) + "." + description
readslength = random.randint(self.minreadslength, self.maxreadslength)
self.allreadslength += readslength
self.readsList.append(fragment[:readslength])
readslength = random.randint(self.minreadslength, self.maxreadslength)
self.allreadslength += readslength
fragmentcomplement = fragment.reverse_complement()
fragmentcomplement.id = ""
fragmentcomplement.name = ""
fragmentcomplement.description = str(self.readsID) + "." + description
self.readsList.append(fragmentcomplement[:readslength])
self.readsID += 1
def pairreadsequencing(self,genomedata, sequencingResult_1, sequencingResult_2):
for seq_record in SeqIO.parse(genomedata, "fasta"):
seqlen = len(seq_record)
self.genomeLength += seqlen
for i in range(self.N):
# 生成断裂点
breakpoint = self.generatebreakpoint(seqlen, self.averagefragmentlength)
# 沿断裂点打断基因组
self.breakgenome(seq_record, breakpoint, self.minfragmentlength, self.maxfragmentlength)
# 模拟克隆时的随机丢失情况
clonedfragmentList = self.clonefragment(self.fragmentList, self.cloneRetainprobability)
# 模拟双端测序
self.pairread(clonedfragmentList)
readsList_1 = [self.readsList[i] for i in range(len(self.readsList)) if i % 2 == 0]
readsList_2 = [self.readsList[i] for i in range(len(self.readsList)) if i % 2 == 1]
SeqIO.write(readsList_1, sequencingResult_1, "fasta")
SeqIO.write(readsList_2, sequencingResult_2, "fasta")
def resultsummary(self):
print("基因组长度:" + str(self.genomeLength / 1000) + "kb")
print("片段长度区间:" + str(self.minfragmentlength) + "-" + str(self.maxfragmentlength))
print("N值:" + str(self.N))
print("期望片段长度:" + str(self.averagefragmentlength))
print("克隆保留率:" + str(self.cloneRetainprobability))
print("片段数量:" + str(len(self.fragmentList)))
print("reads长度:" + str(self.minreadslength) + "-" + str(self.maxreadslength))
print("reads总数量:" + str(len(self.readsList)))
print("reads总长度:" + str(self.allreadslength / 1000) + "kb")
m = self.allreadslength / self.genomeLength
print("覆盖度(m值):" + str(round(m, 5)))
print("理论丢失率(e^-m):" + str(round(exp(-m), 5)))
print("覆盖率(1-e^-m):" + str(round(1 - exp(-m), 5)))
# -------------------------------------------主程序-------------------------------------------
# 模拟单端测序
sequencingObj = Sequencing()
sequencingObj.singlereadsequencing("data/NC_025452.fasta", "result/virusSingleRead.fa")
sequencingObj.resultsummary()
# 模拟双端测序
sequencingObj = Sequencing()
sequencingObj.pairreadsequencing("data/GCF_000146045.2_R64_genomic.fna", "result/yeastPairRead_1.fa", "result/yeastPairRead_2.fa")
sequencingObj.resultsummary()
from Bio import SeqIO
from math import exp
import random
class Sequencing:
# N代表拷贝份数
def __init__(self):
self.fragmentList = []
self.readsID = 1
self.readsList = []
self.averagefragmentlength = 650
self.minfragmentlength = 500
self.maxfragmentlength = 800
self.cloneRetainprobability = 1
self.minreadslength = 50
self.maxreadslength = 150
self.N = 10
self.genomeLength = 0
self.allreadslength = 0
# 生成断裂点
def generatebreakpoint(self, seqlen, averageLength):
# 假设平均每500bp 产生一个断裂点(averageLength = 500),通过随机函数生成seqlen/500个随机断裂点(1到seqlen之间的随机整数)
breakpoint = [random.randint(0, seqlen) for _ in range(int(seqlen / averageLength))]
breakpoint.append(seqlen)
breakpoint.append(0)
# 把随机断裂点从小到大排序
breakpoint.sort()
return breakpoint
# 沿断裂点打断基因组,并删除不符合长度要求的序列片段,定义片段范围:200-1000bp
def breakgenome(self, seq, breakpoint, minfragmentlength, maxfragmentlength):
for i in range(len(breakpoint) - 1):
fragment = seq[breakpoint[i]:breakpoint[i + 1]]
if maxfragmentlength > len(fragment) > minfragmentlength:
self.fragmentList.append(fragment)
return self.fragmentList
# 模拟克隆时的随机丢失情况,random.random()生成0-1的保留概率
def clonefragment(self, fragmentList, cloneRetainprobability):
clonedfragmentList = []
Lossprobability = [random.random() for _ in range(len(fragmentList))]
for i in range(len(fragmentList)):
if Lossprobability[i] <= cloneRetainprobability:
clonedfragmentList.append(fragmentList[i])
return clonedfragmentList
# 模拟单端测序,并修改reads的ID号
def singleread(self, clonedfragmentList):
for fragment in clonedfragmentList:
fragment.id = ""
fragment.name = ""
fragment.description = fragment.description[12:].split(",")[0]
fragment.description = str(self.readsID) + "." + fragment.description
self.readsID += 1
readslength = random.randint(self.minreadslength, self.maxreadslength)
self.allreadslength += readslength
self.readsList.append(fragment[:readslength])
def singlereadsequencing(self, genomedata, sequencingResult):
for seq_record in SeqIO.parse(genomedata, "fasta"):
seqlen = len(seq_record)
self.genomeLength += seqlen
for i in range(self.N):
# 生成断裂点
breakpoint = self.generatebreakpoint(seqlen, self.averagefragmentlength)
# 沿断裂点打断基因组
self.breakgenome(seq_record, breakpoint, self.minfragmentlength, self.maxfragmentlength)
# 模拟克隆时的随机丢失情况
clonedfragmentList = self.clonefragment(self.fragmentList, self.cloneRetainprobability)
# 模拟单端测序
self.singleread(clonedfragmentList)
SeqIO.write(self.readsList, sequencingResult, "fasta")
def pairread(self, clonedfragmentList):
for fragment in clonedfragmentList:
fragment.id = ""
fragment.name = ""
description = fragment.description[12:].split(",")[0]
fragment.description = str(self.readsID) + "." + description
readslength = random.randint(self.minreadslength, self.maxreadslength)
self.allreadslength += readslength
self.readsList.append(fragment[:readslength])
readslength = random.randint(self.minreadslength, self.maxreadslength)
self.allreadslength += readslength
fragmentcomplement = fragment.reverse_complement()
fragmentcomplement.id = ""
fragmentcomplement.name = ""
fragmentcomplement.description = str(self.readsID) + "." + description
self.readsList.append(fragmentcomplement[:readslength])
self.readsID += 1
def pairreadsequencing(self,genomedata, sequencingResult_1, sequencingResult_2):
for seq_record in SeqIO.parse(genomedata, "fasta"):
seqlen = len(seq_record)
self.genomeLength += seqlen
for i in range(self.N):
# 生成断裂点
breakpoint = self.generatebreakpoint(seqlen, self.averagefragmentlength)
# 沿断裂点打断基因组
self.breakgenome(seq_record, breakpoint, self.minfragmentlength, self.maxfragmentlength)
# 模拟克隆时的随机丢失情况
clonedfragmentList = self.clonefragment(self.fragmentList, self.cloneRetainprobability)
# 模拟双端测序
self.pairread(clonedfragmentList)
readsList_1 = [self.readsList[i] for i in range(len(self.readsList)) if i % 2 == 0]
readsList_2 = [self.readsList[i] for i in range(len(self.readsList)) if i % 2 == 1]
SeqIO.write(readsList_1, sequencingResult_1, "fasta")
SeqIO.write(readsList_2, sequencingResult_2, "fasta")
def resultsummary(self):
print("基因组长度:" + str(self.genomeLength / 1000) + "kb")
print("片段长度区间:" + str(self.minfragmentlength) + "-" + str(self.maxfragmentlength))
print("N值:" + str(self.N))
print("期望片段长度:" + str(self.averagefragmentlength))
print("克隆保留率:" + str(self.cloneRetainprobability))
print("片段数量:" + str(len(self.fragmentList)))
print("reads长度:" + str(self.minreadslength) + "-" + str(self.maxreadslength))
print("reads总数量:" + str(len(self.readsList)))
print("reads总长度:" + str(self.allreadslength / 1000) + "kb")
m = self.allreadslength / self.genomeLength
print("覆盖度(m值):" + str(round(m, 5)))
print("理论丢失率(e^-m):" + str(round(exp(-m), 5)))
print("覆盖率(1-e^-m):" + str(round(1 - exp(-m), 5)))
# -------------------------------------------主程序-------------------------------------------
# 模拟单端测序
sequencingObj = Sequencing()
sequencingObj.singlereadsequencing("data/NC_025452.fasta", "result/virusSingleRead.fa")
sequencingObj.resultsummary()
# 模拟双端测序
sequencingObj = Sequencing()
sequencingObj.pairreadsequencing("data/GCF_000146045.2_R64_genomic.fna", "result/yeastPairRead_1.fa", "result/yeastPairRead_2.fa")
sequencingObj.resultsummary()