如何绘制DNA序列的基因图表说ATGCCGCTGCGC？

I need to generate a random walk based on the DNA sequence of a virus, given its base pair sequence of 2k base pairs. The sequence looks like "ATGCGTCGTAACGT". The path should turn right for an A, left for a T, go upwards for a G and downwards for a C. How can I use either Matlab, Mathematica or SPSS for this purpose?

我需要根据病毒的DNA序列生成随机游走，给出其2k碱基对的碱基对序列。序列看起来像“ATGCGTCGTAACGT”。路径应该向右转为A，向左转为T，向上转向G，向下转向C.如何为此目的使用Matlab，Mathematica或SPSS？

6 个解决方案

#1

I did not previously know of Mark McClure's blog about Chaos Game representation of gene sequences, but it reminded me of an article by Jose Manuel Gutiérrez (The Mathematica Journal Vol 9 Issue 2), which also gives a chaos game algorithm for an IFS using (the four bases of) DNA sequences. A detailed description may be found here (the original article).

我之前并不知道Mark McClure关于基因序列的混沌游戏表现的博客，但它让我想起了JoseManuelGutiérrez撰写的一篇文章（The Mathematica Journal Vol 9 Issue 2），该文章还给出了IFS使用的混沌游戏算法（ DNA序列的四个碱基。可在此处找到详细描述（原始文章）。

The method may be used to produce plots such as the following. Just for the hell of it, I've included (in the RHS panels) the plots generated with the corresponding complementary DNA strand (cDNA).

该方法可用于产生如下的图。仅仅为了它，我已经（在RHS面板中）包括用相应的互补DNA链（cDNA）产生的图。

Mouse Mitochondrial DNA (LHS) and its complementary strand (cDNA) (RHS).
小鼠线粒体DNA（LHS）及其互补链（cDNA）（RHS）。

如何绘制DNA序列的基因图表说ATGCCGCTGCGC？

These plots were generated from GenBank Identifier gi|342520. The sequence contains 16295 bases.

这些图是从GenBank Identifier gi | 342520生成的。该序列含有16295个碱基。

(One of the examples used by Jose Manuel Gutiérrez. If anyone is interested, plots for the human equivalent may be generated from gi|1262342).

（JoseManuelGutiérrez使用的一个例子。如果有人感兴趣，可以从gi | 1262342生成人类等效物的图。

Human Beta Globin Region (LHS) and its cDNA (RHS)
人β珠蛋白区（LHS）及其cDNA（RHS）

如何绘制DNA序列的基因图表说ATGCCGCTGCGC？

Generated from gi|455025| (the example used my Mark McClure). The sequence contains 73308 bases

源自gi | 455025 | （这个例子使用了我的Mark McClure）。该序列含有73308个碱基

There are pretty interesting plots! The (sometimes) fractal nature of such plots is known, but the symmetry obvious in the LHS vs RHS (cDNA) versions was very surprising (at least to me).

有很有趣的情节！这种图的（有时）分形性质是已知的，但LHS与RHS（cDNA）版本中明显的对称性是非常令人惊讶的（至少对我而言）。

The nice thing is that such plots for any DNA sequence may be very easily generated by directly importing the sequence (from, say, Genbank), and then using the power of Mma.
All you need it the accession number! ('Unknown' nucleotides such as "R" may need to be zapped) (I am using Mma v7).

好的是，通过直接输入序列（来自Genbank），然后使用Mma的功能，可以非常容易地生成任何DNA序列的这种图。您需要的所有入藏号码！（'未知'核苷酸如“R”可能需要被破坏）（我正在使用Mma v7）。

The Original Implimenation (slightly modified) (by Jose Manuel Gutiérrez)

The Original Implimenation（略有修改）（作者JoseManuelGutiérrez）

Important Update

重要更新

On the advise of Mark McClure, I have changed Point/@Orbit[s, Union[s]] to Point@Orbit[s, Union[s]].

根据Mark McClure的建议，我已将Point / @ Orbit [s，Union [s]]更改为Point @ Orbit [s，Union [s]]。

This speeds things up very considerably. See Mark's comment below.

这大大加快了速度。请参阅下面的Mark的评论。

Orbit[s_List, {a_, b_, c_, d_}] := 
  OrbitMap[s /. {a -> {0, 0}, b -> {0, 1}, c -> {1, 0}, 
     d -> {1, 1}}];
OrbitMap = 
  Compile[{{m, _Real, 2}}, FoldList[(#1 + #2)/2 &, {0, 0}, m]];
IFSPlot[s_List] := 
 Show[Graphics[{Hue[{2/3, 1, 1, .5}], AbsolutePointSize[2.5], 
    Point @ Orbit[s, Union[s]]}], AspectRatio -> Automatic, 
  PlotRange -> {{0, 1}, {0, 1}}, 
  GridLines -> {Range[0, 1, 1/2^3], Range[0, 1, 1/2^3]}]

This gives a blue plot. For green, change Hue[] to Hue[{1/3,1,1,.5}]

这给出了蓝色图。对于绿色，将Hue []更改为Hue [{1 / 3,1,1，.5}]

The following code now generates the first plot (for mouse mitochondrial DNA)

以下代码现在生成第一个图（用于小鼠线粒体DNA）

 IFSPlot[Flatten@
      Characters@
       Rest@Import[
         "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=\
    nucleotide&id=342520&rettype=fasta&retmode=text", "Data"]]

To get a cDNA plot I used the follow transformation rules (and also changed the Hue setting)

为了获得cDNA图，我使用了以下转换规则（并且还更改了Hue设置）

IFSPlot[    ....   "Data"] /. {"A" -> "T", "T" -> "A", "G" -> "C", 
   "C" -> "G"}]

Thanks to Sjoerd C. de Vries and telefunkenvf14 for help in directly importing sequences from the NCBI site.

感谢Sjoerd C. de Vries和telefunkenvf14帮助直接从NCBI网站导入序列。

Splitting things up a bit, for the sake of clarity.

为了清楚起见，将事情分解一点。

Import a Sequence

导入序列

mouseMitoFasta=Import["http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=342520&rettype=fasta&retmode=text","Data"];

The method given for importing sequences in the original Mathematica J. article is dated.

在原始Mathematica J.文章中输入序列的方法已过时。

A nice check

一个很好的检查

First@mouseMitoFasta

首先@ mouseMitoFasta

Output:

输出：

{>gi|342520|gb|J01420.1|MUSMTCG Mouse mitochondrion, complete genome}

Generation of the list of bases

生成基础列表

mouseMitoBases=Flatten@Characters@Rest@mouseMitoFasta

Some more checks

还有一些检查

{Length@mouseMitoBases, Union@mouseMitoBases,Tally@mouseMitoBases}

Output:

输出：

{16295,{A,C,G,T},{{G,2011},{T,4680},{A,5628},{C,3976}}}

The second set of plots was generated in a similar manner from gi|455025. Note that the sequence is long!

第二组图以类似的方式从gi | 455025生成。请注意，序列很长！

{73308,{A,C,G,T},{{G,14785},{A,22068},{T,22309},{C,14146}}}

One final example (containing 265922 bp), also showing fascinating 'fractal' symmetry. (These were generated with AbsolutePointSize[1] in IFSPlot).

最后一个例子（包含265922 bp），也显示出迷人的“分形”对称性。（这些是在IFSPlot中使用AbsolutePointSize [1]生成的）。

The first line of the fasta file:

fasta文件的第一行：

{>gi|328530803|gb|AFBL01000008.1| Actinomyces sp. oral taxon 170 str. F0386 A_spOraltaxon170F0386-1.0_Cont9.1, whole genome shotgun sequence}

{> GI | 328530803 | GB | AFBL01000008.1 |放线菌口服分类群170 str。 F0386 A_spOraltaxon170F0386-1.0_Cont9.1，全基因组鸟枪序列}

如何绘制DNA序列的基因图表说ATGCCGCTGCGC？

The corresponding cDNA plot is again shown in blue on RHS

相应的cDNA图再次在RHS上显示为蓝色

Finally, Mark's method also gives very beautiful plots (for example with gi|328530803), and may be downloaded as a notebook.

最后，Mark的方法也给出了非常漂亮的图（例如gi | 328530803），可以下载为笔记本。

#2

Not that I really understand the "graph" you want, but here is one literal interpretation.

并不是说我真的理解你想要的“图形”，但这是一个字面解释。

None of the following code in necessarily in a final form. I want to know if this is right before I try to refine anything.

以下代码中的任何一个都不一定是最终形式。在我尝试改进任何东西之前，我想知道这是否正确。

rls = {"A" -> {1, 0}, "T" -> {-1, 0}, "G" -> {0, 1}, "C" -> {0, -1}};
Prepend[Characters@"ATGCGTCGTAACGT" /. rls, {0, 0}];
Graphics[Arrow /@ Partition[Accumulate@%, 2, 1]]

如何绘制DNA序列的基因图表说ATGCCGCTGCGC？

Prepend[Characters@"TCGAGTCGTGCTCA" /. rls, {0, 0}];
Graphics[Arrow /@ Partition[Accumulate@%, 2, 1]]

如何绘制DNA序列的基因图表说ATGCCGCTGCGC？

3D Options

i = 0;
Prepend[Characters@"ATGCGTCGTAACGT" /. rls, {0, 0}];
Graphics[{Hue[i++/Length@%], Arrow@#} & /@ 
  Partition[Accumulate@%, 2, 1]]

如何绘制DNA序列的基因图表说ATGCCGCTGCGC？

i = 0;
Prepend[Characters@"ATGCGTCGTAACGT" /. 
    rls /. {x_, y_} :> {x, y, 0.3}, {0, 0, 0}];
Graphics3D[{Hue[i++/Length@%], Arrow@#} & /@ 
  Partition[Accumulate@%, 2, 1]]

如何绘制DNA序列的基因图表说ATGCCGCTGCGC？

Now that I know what you want, here is a packaged version of the first function:

既然我知道你想要什么，这里是第一个函数的打包版本：

genePlot[s_String] :=
 Module[{rls},
  rls =
   {"A" -> { 1, 0},
    "T" -> {-1, 0},
    "G" -> {0,  1},
    "C" -> {0, -1}};
  Graphics[Arrow /@ Partition[#, 2, 1]] & @
   Accumulate @ Prepend[Characters[s] /. rls, {0, 0}]
]

Use it like this:

像这样用它：

genePlot["ATGCGTCGTAACGT"]

#3

It sounds like you are talking about CGR, or the so called Chaos Game Representation of a gene sequence. I blogged about this a few months ago: http://facstaff.unca.edu/mcmcclur/blog/GeneCGR.html

听起来你在谈论CGR，或者所谓的基因序列的混沌游戏表示。几个月前我在博客上写了这篇文章：http：//facstaff.unca.edu/mcmcclur/blog/GeneCGR.html

#4

You might also try something like this...

你也可以试试这样的......

RandomDNAWalk[seq_, path_] := 
 RandomDNAWalk[StringDrop[seq, 1], 
  Join[path, getNextTurn[StringTake[seq, 1]]]];

RandomDNAWalk["", path_] := Accumulate[path];

getNextTurn["A"] := {{1, 0}};
getNextTurn["T"] := {{-1, 0}};
getNextTurn["G"] := {{0, 1}};
getNextTurn["C"] := {{0, -1}};

ListLinePlot[
 RandomDNAWalk[
  StringJoin[RandomChoice[{"A", "T", "C", "G"}, 2000]], {{0, 0}}]]

#5

Assuming that the sequence S has been mapped already*) to integer array then the actual computation of movements is straightforward based on rules R:

假设序列S已经映射到整数数组，那么基于规则R，实际的运动计算是直截了当的：

R =
   1  -1   0   0
   0   0   1  -1
S =
   1   2   3   4   3   2   4   3   2   1   1   4   3   2
T= cumsum(R(:, S), 2)
T =
   1   0   0   0   0  -1  -1  -1  -2  -1   0   0   0  -1
   0   0   1   0   1   1   0   1   1   1   1   0   1   1

*) You need to elaborate more on the actual sequence. Is it represented as single string, or perhaps as cell array, or something else?
Edit:
Assuming your sequence is represented as string, then you'll map it to integer sequence S like:

*）您需要详细说明实际顺序。它表示为单个字符串，或者可能是单元格数组，还是其他什么？编辑：假设您的序列表示为字符串，那么您将它映射到整数序列S，如：

r= zeros(1, 84);
r(double("ATGC"))= [1 2 3 4];
S= r(double("ATGCGTCGTAACGT"))

And to plot it:

并绘制它：

plot([0 T(1, :)], [0 T(2, :)], linespec)

where linespec is the desired line specification.

其中linespec是所需的行规范。

#6

This question seems to have been well answered already, but I thought I'd add that what you are describing has been previously published under the banner of DNA walks among a collection of numerical representation methods for DNA sequences, which are discussed in our preprint.

这个问题似乎已经得到了很好的回答，但我想我还要补充一点，你所描述的内容之前已经在DNA散步的旗帜下发表了DNA序列的数字表示方法，我们的预印本中对此进行了讨论。

It turns out that DNA walks aren't very useful in practice, yet permit intuitive visualisation. I don't have it to hand, but I'd imagine my colleague would be more than happy to provide the Matlab code used to generate the below figure.

事实证明，DNA漫游在实践中并不是非常有用，但允许直观的可视化。我没有它，但我想我的同事非常乐意提供用于生成下图的Matlab代码。

#1

The method may be used to produce plots such as the following. Just for the hell of it, I've included (in the RHS panels) the plots generated with the corresponding complementary DNA strand (cDNA).

该方法可用于产生如下的图。仅仅为了它，我已经（在RHS面板中）包括用相应的互补DNA链（cDNA）产生的图。

Mouse Mitochondrial DNA (LHS) and its complementary strand (cDNA) (RHS).
小鼠线粒体DNA（LHS）及其互补链（cDNA）（RHS）。

如何绘制DNA序列的基因图表说ATGCCGCTGCGC？

These plots were generated from GenBank Identifier gi|342520. The sequence contains 16295 bases.

这些图是从GenBank Identifier gi | 342520生成的。该序列含有16295个碱基。

(One of the examples used by Jose Manuel Gutiérrez. If anyone is interested, plots for the human equivalent may be generated from gi|1262342).

（JoseManuelGutiérrez使用的一个例子。如果有人感兴趣，可以从gi | 1262342生成人类等效物的图。

Human Beta Globin Region (LHS) and its cDNA (RHS)
人β珠蛋白区（LHS）及其cDNA（RHS）

如何绘制DNA序列的基因图表说ATGCCGCTGCGC？

Generated from gi|455025| (the example used my Mark McClure). The sequence contains 73308 bases

源自gi | 455025 | （这个例子使用了我的Mark McClure）。该序列含有73308个碱基

There are pretty interesting plots! The (sometimes) fractal nature of such plots is known, but the symmetry obvious in the LHS vs RHS (cDNA) versions was very surprising (at least to me).

有很有趣的情节！这种图的（有时）分形性质是已知的，但LHS与RHS（cDNA）版本中明显的对称性是非常令人惊讶的（至少对我而言）。

The Original Implimenation (slightly modified) (by Jose Manuel Gutiérrez)

The Original Implimenation（略有修改）（作者JoseManuelGutiérrez）

Important Update

重要更新

On the advise of Mark McClure, I have changed Point/@Orbit[s, Union[s]] to Point@Orbit[s, Union[s]].

根据Mark McClure的建议，我已将Point / @ Orbit [s，Union [s]]更改为Point @ Orbit [s，Union [s]]。

This speeds things up very considerably. See Mark's comment below.

这大大加快了速度。请参阅下面的Mark的评论。

Orbit[s_List, {a_, b_, c_, d_}] := 
  OrbitMap[s /. {a -> {0, 0}, b -> {0, 1}, c -> {1, 0}, 
     d -> {1, 1}}];
OrbitMap = 
  Compile[{{m, _Real, 2}}, FoldList[(#1 + #2)/2 &, {0, 0}, m]];
IFSPlot[s_List] := 
 Show[Graphics[{Hue[{2/3, 1, 1, .5}], AbsolutePointSize[2.5], 
    Point @ Orbit[s, Union[s]]}], AspectRatio -> Automatic, 
  PlotRange -> {{0, 1}, {0, 1}}, 
  GridLines -> {Range[0, 1, 1/2^3], Range[0, 1, 1/2^3]}]

This gives a blue plot. For green, change Hue[] to Hue[{1/3,1,1,.5}]

这给出了蓝色图。对于绿色，将Hue []更改为Hue [{1 / 3,1,1，.5}]

The following code now generates the first plot (for mouse mitochondrial DNA)

以下代码现在生成第一个图（用于小鼠线粒体DNA）

 IFSPlot[Flatten@
      Characters@
       Rest@Import[
         "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=\
    nucleotide&id=342520&rettype=fasta&retmode=text", "Data"]]

To get a cDNA plot I used the follow transformation rules (and also changed the Hue setting)

为了获得cDNA图，我使用了以下转换规则（并且还更改了Hue设置）

IFSPlot[    ....   "Data"] /. {"A" -> "T", "T" -> "A", "G" -> "C", 
   "C" -> "G"}]

Thanks to Sjoerd C. de Vries and telefunkenvf14 for help in directly importing sequences from the NCBI site.

感谢Sjoerd C. de Vries和telefunkenvf14帮助直接从NCBI网站导入序列。

Splitting things up a bit, for the sake of clarity.

为了清楚起见，将事情分解一点。

Import a Sequence

导入序列

mouseMitoFasta=Import["http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=342520&rettype=fasta&retmode=text","Data"];

The method given for importing sequences in the original Mathematica J. article is dated.

在原始Mathematica J.文章中输入序列的方法已过时。

A nice check

一个很好的检查

First@mouseMitoFasta

首先@ mouseMitoFasta

Output:

输出：

{>gi|342520|gb|J01420.1|MUSMTCG Mouse mitochondrion, complete genome}

Generation of the list of bases

生成基础列表

mouseMitoBases=Flatten@Characters@Rest@mouseMitoFasta

Some more checks

还有一些检查

{Length@mouseMitoBases, Union@mouseMitoBases,Tally@mouseMitoBases}

Output:

输出：

{16295,{A,C,G,T},{{G,2011},{T,4680},{A,5628},{C,3976}}}

The second set of plots was generated in a similar manner from gi|455025. Note that the sequence is long!

第二组图以类似的方式从gi | 455025生成。请注意，序列很长！

{73308,{A,C,G,T},{{G,14785},{A,22068},{T,22309},{C,14146}}}

One final example (containing 265922 bp), also showing fascinating 'fractal' symmetry. (These were generated with AbsolutePointSize[1] in IFSPlot).

最后一个例子（包含265922 bp），也显示出迷人的“分形”对称性。（这些是在IFSPlot中使用AbsolutePointSize [1]生成的）。

The first line of the fasta file:

fasta文件的第一行：

{>gi|328530803|gb|AFBL01000008.1| Actinomyces sp. oral taxon 170 str. F0386 A_spOraltaxon170F0386-1.0_Cont9.1, whole genome shotgun sequence}

{> GI | 328530803 | GB | AFBL01000008.1 |放线菌口服分类群170 str。 F0386 A_spOraltaxon170F0386-1.0_Cont9.1，全基因组鸟枪序列}

如何绘制DNA序列的基因图表说ATGCCGCTGCGC？

The corresponding cDNA plot is again shown in blue on RHS

相应的cDNA图再次在RHS上显示为蓝色

Finally, Mark's method also gives very beautiful plots (for example with gi|328530803), and may be downloaded as a notebook.

最后，Mark的方法也给出了非常漂亮的图（例如gi | 328530803），可以下载为笔记本。

#2

Not that I really understand the "graph" you want, but here is one literal interpretation.

并不是说我真的理解你想要的“图形”，但这是一个字面解释。

None of the following code in necessarily in a final form. I want to know if this is right before I try to refine anything.

以下代码中的任何一个都不一定是最终形式。在我尝试改进任何东西之前，我想知道这是否正确。

rls = {"A" -> {1, 0}, "T" -> {-1, 0}, "G" -> {0, 1}, "C" -> {0, -1}};
Prepend[Characters@"ATGCGTCGTAACGT" /. rls, {0, 0}];
Graphics[Arrow /@ Partition[Accumulate@%, 2, 1]]

如何绘制DNA序列的基因图表说ATGCCGCTGCGC？

Prepend[Characters@"TCGAGTCGTGCTCA" /. rls, {0, 0}];
Graphics[Arrow /@ Partition[Accumulate@%, 2, 1]]

如何绘制DNA序列的基因图表说ATGCCGCTGCGC？

3D Options

i = 0;
Prepend[Characters@"ATGCGTCGTAACGT" /. rls, {0, 0}];
Graphics[{Hue[i++/Length@%], Arrow@#} & /@ 
  Partition[Accumulate@%, 2, 1]]

如何绘制DNA序列的基因图表说ATGCCGCTGCGC？

i = 0;
Prepend[Characters@"ATGCGTCGTAACGT" /. 
    rls /. {x_, y_} :> {x, y, 0.3}, {0, 0, 0}];
Graphics3D[{Hue[i++/Length@%], Arrow@#} & /@ 
  Partition[Accumulate@%, 2, 1]]

如何绘制DNA序列的基因图表说ATGCCGCTGCGC？

Now that I know what you want, here is a packaged version of the first function:

既然我知道你想要什么，这里是第一个函数的打包版本：

genePlot[s_String] :=
 Module[{rls},
  rls =
   {"A" -> { 1, 0},
    "T" -> {-1, 0},
    "G" -> {0,  1},
    "C" -> {0, -1}};
  Graphics[Arrow /@ Partition[#, 2, 1]] & @
   Accumulate @ Prepend[Characters[s] /. rls, {0, 0}]
]

Use it like this:

像这样用它：

genePlot["ATGCGTCGTAACGT"]

#3

It sounds like you are talking about CGR, or the so called Chaos Game Representation of a gene sequence. I blogged about this a few months ago: http://facstaff.unca.edu/mcmcclur/blog/GeneCGR.html

听起来你在谈论CGR，或者所谓的基因序列的混沌游戏表示。几个月前我在博客上写了这篇文章：http：//facstaff.unca.edu/mcmcclur/blog/GeneCGR.html

#4

You might also try something like this...

你也可以试试这样的......

RandomDNAWalk[seq_, path_] := 
 RandomDNAWalk[StringDrop[seq, 1], 
  Join[path, getNextTurn[StringTake[seq, 1]]]];

RandomDNAWalk["", path_] := Accumulate[path];

getNextTurn["A"] := {{1, 0}};
getNextTurn["T"] := {{-1, 0}};
getNextTurn["G"] := {{0, 1}};
getNextTurn["C"] := {{0, -1}};

ListLinePlot[
 RandomDNAWalk[
  StringJoin[RandomChoice[{"A", "T", "C", "G"}, 2000]], {{0, 0}}]]

#5

Assuming that the sequence S has been mapped already*) to integer array then the actual computation of movements is straightforward based on rules R:

假设序列S已经映射到整数数组，那么基于规则R，实际的运动计算是直截了当的：

R =
   1  -1   0   0
   0   0   1  -1
S =
   1   2   3   4   3   2   4   3   2   1   1   4   3   2
T= cumsum(R(:, S), 2)
T =
   1   0   0   0   0  -1  -1  -1  -2  -1   0   0   0  -1
   0   0   1   0   1   1   0   1   1   1   1   0   1   1

r= zeros(1, 84);
r(double("ATGC"))= [1 2 3 4];
S= r(double("ATGCGTCGTAACGT"))

And to plot it:

并绘制它：

plot([0 T(1, :)], [0 T(2, :)], linespec)

where linespec is the desired line specification.

其中linespec是所需的行规范。

#6

事实证明，DNA漫游在实践中并不是非常有用，但允许直观的可视化。我没有它，但我想我的同事非常乐意提供用于生成下图的Matlab代码。

秒客网

如何绘制DNA序列的基因图表说ATGCCGCTGCGC？

6 个解决方案

#1

#2

3D Options

#3

#4

#5

#6

#1

#2

3D Options

#3

#4

#5

#6

相关文章