对大型数据集执行操作

I have to perform some analysis on a PSL record which contains information on DNA sequence fragments. Basically I have to find entries that are from the same read in the same contig (these are both values in the PSL entry). The problem is the PSL records are large (10-30 Mb text documents). I wrote a program that works on short records and on the long records given enough time but it took way longer than specified. I was told the program shouldn't take more than ~15 seconds. Mine took over 15 minutes.

我必须对PSL记录进行一些分析，其中包含有关DNA序列片段的信息。基本上我必须在同一个重叠群中找到来自相同读数的条目（这些都是PSL条目中的值）。问题是PSL记录很大（10-30 Mb文本文档）。我写了一个程序，用于短记录和长记录，给予足够的时间，但它比指定的时间更长。我被告知该程序不应超过约15秒。我花了15分钟。

PSL records look like this:

PSL记录如下所示：

275 11  0   0   0   0   0   0   -   M02034:35:000000000-A7UU0:1:1101:19443:1992/2   286 0   286 NODE_406138_length_13407_cov_13.425076  13465   408 694 1   286,    0,  408,

171 5   0   0   0   0   0   0   +   M02034:35:000000000-A7UU0:1:1101:13497:2001/2   294 0   176 NODE_500869_length_34598_cov_30.643419  34656   34334   34510   1   176,    0,  34334,

188 14  0   10  0   0   0   0   +   M02034:35:000000000-A7UU0:1:1101:18225:2002/1   257 45  257 NODE_455027_length_12018_cov_13.759444  12076   11322   11534   1   212,    45, 11322,

My code looks like this:

我的代码如下所示：

import sys
class PSLreader :
    '''
    Class to provide reading of a file containing psl alignments
    formatted sequences:
    object instantiation:
    myPSLreader = PSLreader(<file name>):

    object attributes:
    fname: the initial file name

    methods:
    readPSL() : reads psl file, yielding those alignments that are within the first or last
                1000 nt

    readPSLpairs() : yields psl pairs that support a circular hypothesis 

    Author: David Bernick
    Date: May 12, 2013
    '''

    def __init__ (self, fname=''):
        '''contructor: saves attribute fname '''

        self.fname = fname

    def doOpen (self):
        if self.fname is '':
            return sys.stdin
        else:
            return open(self.fname)

    def readPSL (self):
        '''
        using filename given in init, returns each filtered psl records
        that contain alignments that are within the terminal 1000nt of
        the target. Incomplete psl records are discarded.
        If filename was not provided, stdin is used.

        This method selects for alignments that could may be part of a
        circle.

        Illumina pairs aligned to the top strand would have read1(+) and read2(-).
        For the bottoms trand, read1(-) and read2(+).

        For potential circularity,
        these are the conditions that can support circularity:
        read1(+) near the 3' *
        read1(-) near the 5' *
        read2(-) near the 5' *
        read2(+) near the 3' *

        so...
        any read(+) near the 3', or
        any read(-) near the 5'

        '''

        nearEnd = 1000   # this constant determines "near the end"
        with self.doOpen() as fileH:

            for line in fileH:
                pslList = line.split()
                if len(pslList) < 17:
                    continue
                tSize = int(pslList[14])
                tStart = int(pslList[15])
                strand = str(pslList[8])

                if strand.startswith('+') and (tSize - tStart > nearEnd):
                    continue
                elif strand.startswith('-') and (tStart > nearEnd):
                    continue

                yield line

    def readPSLpairs (self):
        read1 = []
        read2 = []

        for psl in self.readPSL():
            parsed_psl = psl.split()
            strand = parsed_psl[9][-1]
            if strand == '1':
                read1.append(parsed_psl)
            elif strand == '2':
                read2.append(parsed_psl)

        output = {}
        for psl1 in read1:
            name1 = psl1[9][:-1]
            contig1 = psl1[13]
            for psl2 in read2:
                name2 = psl2[9][:-1]
                contig2 = psl2[13]
                if  name1 == name2 and contig1 == contig2:
                    try:
                        output[contig1] += 1
                        break
                    except:
                        output[contig1] = 1
                        break

        print(output)


PSL_obj = PSLreader('EEV14-Vf.filtered.psl')
PSL_obj.readPSLpairs()

I was given some example code that looks like this:

我得到了一些示例代码，如下所示：

def doSomethingPairwise (a):
    for leftItem in a[1]:
        for rightItem in a[2]:
            if leftItem[1] is rightItem[1]:
                print (a)
thisStream = [['David', 'guitar', 1], ['David', 'guitar', 2],
['John', 'violin', 1], ['John', 'oboe', 2],
['Patrick', 'theremin', 1], ['Patrick', 'lute',2] ]
thisGroup = None
thisGroupList = [ [], [], [] ]

for name, instrument, num in thisStream:
    if name != thisGroup:

        doSomethingPairwise(thisGroupList)

        thisGroup = name
        thisGroupList = [ [], [], [] ]

    thisGroupList[num].append([name, instrument, num])
doSomethingPairwise(thisGroupList)

But when I tried to implement it my program still took a long time. Am I thinking about this the wrong way? I realize the nested loop is slow but I don't see an alternative.

但是当我试图实现它时，我的程序仍然需要很长时间。我是否以错误的方式思考这个问题？我意识到嵌套循环很慢但我没有看到替代方案。

Edit: I figured it out, the data was presorted which made my brute force solution very impractical and unnecessary.

编辑：我想通了，数据被预先分类，这使我的暴力解决方案非常不切实际和不必要。

1 个解决方案

#1

I hope help you, since, the question needs a best input example file

我希望能帮到你，因为这个问题需要一个最好的输入示例文件

#is better create PSLRecord class
class PSLRecord:
  def __init__(self, line):
    pslList = line.split()
    properties = ("matches", "misMatches", "repMatches", "nCount",
                 "qNumInsert", "qBaseInsert", "tNumInsert",
                 "tBaseInsert", "strand", "qName", "qSize", "qStart",
                 "qEnd", "tName", "tSize", "tStart", "tEnd", "blockCount",
                 "blockSizes", "qStarts", "tStarts")
    self.__dict__.update(dict(zip(properties, pslList)))

class PSLreader :
  def __init__ (self, fname=''):
    self.fname = fname

  def doOpen (self):
    if self.fname is '':
      return sys.stdin
    else:
      return open(self.fname)

  def readPSL (self):
    with self.doOpen() as fileH:
      for line in fileH:
        pslrc = PSLRecord(line)
        yield pslrc

  #return a dictionary with all psl records group by qName and tName
  def readPSLpairs (self):
    dictpsl = {}
    for pslrc in self.readPSL():
      #OP requirement, remove '1' or '2' char, in pslrc.qName[:-1]
      key = (pslrc.qName[:-1], pslrc.tName)
      if not key in dictpsl:
        dictpsl[key] = []
      dictpsl[key].append(pslrc)
    return dictpsl

#Function filter .... is better out and self-contained
def f_filter(pslrec, nearEnd = 1000):
  if (pslrec.strand.startswith('+') and  
     (int(pslrec.tSize) - int(pslrec.tStart) > nearEnd)):
    return False
  if (pslrec.strand.startswith('-') and 
     (int(pslrec.tStart) > nearEnd)):
    return False
  return True

PSL_obj = PSLreader('EEV14-Vf.filtered.psl')

#read dictionary of pairs
dictpsl = PSL_obj.readPSLpairs()

from itertools import product
#product from itertools
#(1) x (2,3) = (1,2),(1,3)

output = {}
for key, v in dictpsl.items():
  name, contig = key
  #i get filters aligns in principal strand
  strand_princ = [pslrec for pslrec in v if f_filter(pslrec) and
                 pslrec.qName[-1] == '1']
  #i get filters aligns in secondary strand
  strand_sec = [pslrec for pslrec in v if f_filter(pslrec) and
               pslrec.qName[-1] == '2']
  for pslrec_princ, pslrec_sec in product(strand_princ, strand_sec):
    #This For has fewer comparisons, since I was grouped before
    if not contig in output:
      output[contig] = 1
    output[contig] += 1

Note: 10-30 Mb isn't large file, if you ask me

注意：如果你问我，10-30 Mb不是大文件

#1