在Python中存储设置数据的最佳方法是什么?

时间:2022-03-07 09:16:58

I have a list of data in the following form:

我有以下形式的数据列表:

[(id\__1_, description, id\_type), (id\__2_, description, id\_type), ... , (id\__n_, description, id\_type))

[(id \ __ 1_,description,id \ _type),(id \ __ 2_,description,id \ _type),...,(id \ __ n_,description,id \ _type))

The data are loaded from files that belong to the same group. In each group there could be multiples of the same id, each coming from different files. I don't care about the duplicates, so I thought that a nice way to store all of this would be to throw it into a Set type. But there's a problem.

数据从属于同一组的文件加载。在每个组中,可以有相同id的倍数,每个来自不同的文件。我不关心重复项,所以我认为存储所有这些的一个好方法是将它扔进Set类型。但是有一个问题。

Sometimes for the same id the descriptions can vary slightly, as follows:

有时对于相同的ID,描述可能略有不同,如下所示:

IPI00110753

  • Tubulin alpha-1A chain
  • Tubulinα-1A链

  • Tubulin alpha-1 chain
  • Tubulinα-1链

  • Alpha-tubulin 1
  • Alpha-tubulin isotype M-alpha-1
  • α-微管蛋白同种型M-α-1

(Note that this example is taken from the uniprot protein database.)

(请注意,此示例来自uniprot蛋白质数据库。)

I don't care if the descriptions vary. I cannot throw them away because there is a chance that the protein database I am using will not contain a listing for a certain identifier. If this happens I will want to be able to display the human readable description to the biologists so they know roughly what protein they are looking at.

我不在乎描述是否有所不同。我不能把它们扔掉,因为我使用的蛋白质数据库有可能不包含某个标识符的列表。如果发生这种情况,我希望能够向生物学家展示人类可读的描述,以便他们大致了解他们正在研究的蛋白质。

I am currently solving this problem by using a dictionary type. However I don't really like this solution because it uses a lot of memory (I have a lot of these ID's). This is only an intermediary listing of them. There is some additional processing the ID's go through before they are placed in the database so I would like to keep my data-structure smaller.

我目前正在使用字典类型解决此问题。但是我真的不喜欢这个解决方案,因为它使用了大量的内存(我有很多这样的ID)。这只是他们的中间列表。在将ID放入数据库之前,ID会进行一些额外的处理,因此我希望保持较小的数据结构。

I have two questions really. First, will I get a smaller memory footprint using the Set type (over the dictionary type) for this, or should I use a sorted list where I check every time I insert into the list to see if the ID exists, or is there a third solution that I haven't thought of? Second, if the Set type is the better answer how do I key it to look at just the first element of the tuple instead of the whole thing?

我真的有两个问题。首先,我是否会使用Set类型(通过字典类型)获得更小的内存占用,或者我应该使用排序列表,我每次插入列表时检查是否存在ID,或者是否存在我没有想到的第三个解决方案?其次,如果Set类型是更好的答案,我如何键入它来查看元组的第一个元素而不是整个元素?

Thank you for reading my question,
Tim

蒂姆,感谢您阅读我的问题

Update

based on some of the comments I received let me clarify a little. Most of what I do with data-structure is insert into it. I only read it twice, once to annotate it with additional information,* and once to do be inserted into the database. However down the line there may be additional annotation that is done before I insert into the database. Unfortunately I don't know if that will happen at this time.

基于我收到的一些评论,让我澄清一点。我对数据结构所做的大部分内容都是插入其中。我只阅读了两次,一次用附加信息注释,*和一次插入数据库。然而,在我插入数据库之前可能会有额外的注释。不幸的是,我不知道这是否会在这个时候发生。

Right now I am looking into storing this data in a structure that is not based on a hash-table (ie. a dictionary). I would like the new structure to be fairly quick on insertion, but reading it can be linear since I only really do it twice. I am trying to move away from the hash table to save space. Is there a better structure or is a hash-table about as good as it gets?

现在我正在考虑将这些数据存储在一个不基于哈希表(即字典)的结构中。我希望新结构在插入时相当快,但读取它可能是线性的,因为我只做了两次。我试图远离哈希表以节省空间。是否有一个更好的结构或是一个关于它得到的好的哈希表?

*The information is a list of Swiss-Prot protein identifiers that I get by querying uniprot.

*该信息是我通过查询uniprot获得的Swiss-Prot蛋白质标识符列表。

6 个解决方案

#1


1  

Sets don't have keys. The element is the key.

集合没有密钥。元素是关键。

If you think you want keys, you have a mapping. More-or-less by definition.

如果您认为需要密钥,则需要映射。根据定义或多或少。

Sequential list lookup can be slow, even using a binary search. Mappings use hashes and are fast.

即使使用二进制搜索,顺序列表查找也可能很慢。映射使用哈希并且速度很快。

Are you talking about a dictionary like this?

你在谈论这样的字典吗?

{ 'id1': [ ('description1a', 'type1'), ('description1b','type1') ], 
  'id2': [ ('description2', 'type2') ],
...
}

This sure seems minimal. ID's are only represented once.

这肯定是最小的。 ID仅代表一次。

Perhaps you have something like this?

也许你有这样的事情?

{ 'id1': ( ('description1a', 'description1b' ), 'type1' ),
  'id2': ( ('description2',), 'type2' ),
...
}

I'm not sure you can find anything more compact unless you resort to using the struct module.

除非你使用struct模块,否则我不确定你能找到更紧凑的东西。

#2


1  

I'm assuming the problem you try to solve by cutting down on the memory you use is the address space limit of your process. Additionally you search for a data structure that allows you fast insertion and reasonable sequential read out.

我假设您尝试通过减少使用的内存来解决的问题是您的进程的地址空间限制。此外,您还可以搜索允许快速插入和合理顺序读取的数据结构。

Use less structures except strings (str)

The question you ask is how to structure your data in one process to use less memory. The one canonical answer to this is (as long as you still need associative lookups), use as little other structures then python strings (str, not unicode) as possible. A python hash (dictionary) stores the references to your strings fairly efficiently (it is not a b-tree implementation).

您要问的问题是如何在一个进程中构建数据以使用更少的内存。对此的一个规范答案是(只要你仍然需要关联查找),尽可能使用其他结构,然后使用python字符串(str,而不是unicode)。 python hash(字典)可以非常有效地存储对字符串的引用(它不是b-tree实现)。

However I think that you will not get very far with that approach, since what you face are huge datasets that might eventually just exceed the process address space and the physical memory of the machine you're working with altogether.

但是我认为你不会对这种方法走得太远,因为你所面对的是巨大的数据集,它们最终可能会超过你正在处理的机器的进程地址空间和物理内存。

Alternative Solution

I would propose a different solution that does not involve changing your data structure to something that is harder to insert or interprete.

我会提出一个不同的解决方案,不涉及将数据结构更改为更难插入或解释的内容。

  • Split your information up in multiple processes, each holding whatever datastructure is convinient for that.
  • 在多个流程中分割您的信息,每个流程都保存任何数据结构都很方便。

  • Implement inter process communication with sockets such that processes might reside on other machines altogether.
  • 使用套接字实现进程间通信,以便进程可以完全驻留在其他计算机上。

  • Try to divide your data such as to minimize inter process communication (i/o is glacially slow compared to cpu cycles).
  • 尝试划分数据,以最大限度地减少进程间通信(与cpu周期相比,i / o速度非常慢)。

The advantage of the approach I outline is that

我概述的方法的优点是

  • You get to use two ore more cores on a machine fully for performance
  • 为了提高性能,您可以在机器上使用两个以上的核心

  • You are not limited by the address space of one process, or even the physical memory of one machine
  • 您不受一个进程的地址空间限制,甚至不受一台机器的物理内存的限制

There are numerous packages and aproaches to distributed processing, some of which are

分布式处理有许多包和方法,其中一些是

#3


1  

If you're doing an n-way merge with removing duplicates, the following may be what you're looking for.

如果您正在进行n路合并以删除重复项,则可能是您正在寻找的以下内容。

This generator will merge any number of sources. Each source must be a sequence. The key must be in position 0. It yields the merged sequence one item at a time.

该生成器将合并任意数量的源。每个来源必须是一个序列。密钥必须位于位置0.它一次生成一个项目的合并序列。

def merge( *sources ):
    keyPos= 0
    for s in sources:
        s.sort()
    while any( [len(s)>0 for s in sources] ):
        topEnum= enumerate([ s[0][keyPos] if len(s) > 0 else None for s in sources ])
        top= [ t for t in topEnum if t[1] is not None ]
        top.sort( key=lambda a:a[1] )
        src, key = top[0]
        #print src, key
        yield sources[ src ].pop(0)

This generator removes duplicates from a sequence.

此生成器从序列中删除重复项。

def unique( sequence ):
    keyPos= 0
    seqIter= iter(sequence)
    curr= seqIter.next()
    for next in seqIter:
        if next[keyPos] == curr[keyPos]:
            # might want to create a sub-list of matches
            continue
        yield curr
        curr= next
    yield curr

Here's a script which uses these functions to produce a resulting sequence which is the union of all the sources with duplicates removed.

这是一个脚本,它使用这些函数生成一个结果序列,它是删除了重复项的所有源的并集。

for u in unique( merge( source1, source2, source3, ... ) ):
    print u

The complete set of data in each sequence must exist in memory once because we're sorting in memory. However, the resulting sequence does not actually exist in memory. Indeed, it works by consuming the other sequences.

每个序列中的完整数据集必须存在于内存中一次,因为我们在内存中进行排序。但是,结果序列实际上并不存在于内存中。实际上,它通过消耗其他序列来工作。

#4


0  

How about using {id: (description, id_type)} dictionary? Or {(id, id_type): description} dictionary if (id,id_type) is the key.

如何使用{id:(description,id_type)}字典?或者{(id,id_type):description}字典if(id,id_type)是关键字。

#5


0  

Sets in Python are implemented using hash tables. In earlier versions, they were actually implemented using sets, but that has changed AFAIK. The only thing you save by using a set would then be the size of a pointer for each entry (the pointer to the value).

Python中的集合使用哈希表实现。在早期版本中,它们实际上是使用集合实现的,但这改变了AFAIK。使用set保存的唯一内容就是每个条目(指向值的指针)的指针大小。

To use only a part of a tuple for the hashcode, you'd have to subclass tuple and override the hashcode method:

要仅使用元组的一部分作为哈希码,您必须子类化元组并覆盖哈希码方法:

class ProteinTuple(tuple):
     def __new__(cls, m1, m2, m3):
         return tuple.__new__(cls, (m1, m2, m3))

     def __hash__(self):
         return hash(self[0])

Keep in mind that you pay for the extra function call to __hash__ in this case, because otherwise it would be a C method.

请记住,在这种情况下,您需要为__hash__支付额外的函数调用,否则它将是一个C方法。

I'd go for Constantin's suggestions and take out the id from the tuple and see how much that helps.

我会考虑康斯坦丁的建议并从元组中取出id,看看它有多大帮助。

#6


0  

It's still murky, but it sounds like you have some several lists of [(id, description, type)...]

它仍然是模糊的,但听起来你有几个[(id,description,type)...的列表

The id's are unique within a list and consistent between lists.

id在列表中是唯一的,并且在列表之间是一致的。

You want to create a UNION: a single list, where each id occurs once, with possibly multiple descriptions.

您想要创建一个UNION:单个列表,其中每个id出现一次,可能有多个描述。

For some reason, you think a mapping might be too big. Do you have any evidence of this? Don't over-optimize without actual measurements.

出于某种原因,您认为映射可能太大了。你有这方面的证据吗?没有实际测量,不要过度优化。

This may be (if I'm guessing correctly) the standard "merge" operation from multiple sources.

这可能是(如果我猜错了)来自多个来源的标准“合并”操作。

source1.sort()
source2.sort()
result= []
while len(source1) > 0 or len(source2) > 0:
    if len(source1) == 0:
        result.append( source2.pop(0) )
    elif len(source2) == 0:
        result.append( source1.pop(0) )
    elif source1[0][0] < source2[0][0]:
        result.append( source1.pop(0) )
    elif source2[0][0] < source1[0][0]:
        result.append( source2.pop(0) )
    else:
        # keys are equal
        result.append( source1.pop(0) )
        # check for source2, to see if the description is different.

This assembles a union of two lists by sorting and merging. No mapping, no hash.

这通过排序和合并来组合两个列表的并集。没有映射,没有哈希。

#1


1  

Sets don't have keys. The element is the key.

集合没有密钥。元素是关键。

If you think you want keys, you have a mapping. More-or-less by definition.

如果您认为需要密钥,则需要映射。根据定义或多或少。

Sequential list lookup can be slow, even using a binary search. Mappings use hashes and are fast.

即使使用二进制搜索,顺序列表查找也可能很慢。映射使用哈希并且速度很快。

Are you talking about a dictionary like this?

你在谈论这样的字典吗?

{ 'id1': [ ('description1a', 'type1'), ('description1b','type1') ], 
  'id2': [ ('description2', 'type2') ],
...
}

This sure seems minimal. ID's are only represented once.

这肯定是最小的。 ID仅代表一次。

Perhaps you have something like this?

也许你有这样的事情?

{ 'id1': ( ('description1a', 'description1b' ), 'type1' ),
  'id2': ( ('description2',), 'type2' ),
...
}

I'm not sure you can find anything more compact unless you resort to using the struct module.

除非你使用struct模块,否则我不确定你能找到更紧凑的东西。

#2


1  

I'm assuming the problem you try to solve by cutting down on the memory you use is the address space limit of your process. Additionally you search for a data structure that allows you fast insertion and reasonable sequential read out.

我假设您尝试通过减少使用的内存来解决的问题是您的进程的地址空间限制。此外,您还可以搜索允许快速插入和合理顺序读取的数据结构。

Use less structures except strings (str)

The question you ask is how to structure your data in one process to use less memory. The one canonical answer to this is (as long as you still need associative lookups), use as little other structures then python strings (str, not unicode) as possible. A python hash (dictionary) stores the references to your strings fairly efficiently (it is not a b-tree implementation).

您要问的问题是如何在一个进程中构建数据以使用更少的内存。对此的一个规范答案是(只要你仍然需要关联查找),尽可能使用其他结构,然后使用python字符串(str,而不是unicode)。 python hash(字典)可以非常有效地存储对字符串的引用(它不是b-tree实现)。

However I think that you will not get very far with that approach, since what you face are huge datasets that might eventually just exceed the process address space and the physical memory of the machine you're working with altogether.

但是我认为你不会对这种方法走得太远,因为你所面对的是巨大的数据集,它们最终可能会超过你正在处理的机器的进程地址空间和物理内存。

Alternative Solution

I would propose a different solution that does not involve changing your data structure to something that is harder to insert or interprete.

我会提出一个不同的解决方案,不涉及将数据结构更改为更难插入或解释的内容。

  • Split your information up in multiple processes, each holding whatever datastructure is convinient for that.
  • 在多个流程中分割您的信息,每个流程都保存任何数据结构都很方便。

  • Implement inter process communication with sockets such that processes might reside on other machines altogether.
  • 使用套接字实现进程间通信,以便进程可以完全驻留在其他计算机上。

  • Try to divide your data such as to minimize inter process communication (i/o is glacially slow compared to cpu cycles).
  • 尝试划分数据,以最大限度地减少进程间通信(与cpu周期相比,i / o速度非常慢)。

The advantage of the approach I outline is that

我概述的方法的优点是

  • You get to use two ore more cores on a machine fully for performance
  • 为了提高性能,您可以在机器上使用两个以上的核心

  • You are not limited by the address space of one process, or even the physical memory of one machine
  • 您不受一个进程的地址空间限制,甚至不受一台机器的物理内存的限制

There are numerous packages and aproaches to distributed processing, some of which are

分布式处理有许多包和方法,其中一些是

#3


1  

If you're doing an n-way merge with removing duplicates, the following may be what you're looking for.

如果您正在进行n路合并以删除重复项,则可能是您正在寻找的以下内容。

This generator will merge any number of sources. Each source must be a sequence. The key must be in position 0. It yields the merged sequence one item at a time.

该生成器将合并任意数量的源。每个来源必须是一个序列。密钥必须位于位置0.它一次生成一个项目的合并序列。

def merge( *sources ):
    keyPos= 0
    for s in sources:
        s.sort()
    while any( [len(s)>0 for s in sources] ):
        topEnum= enumerate([ s[0][keyPos] if len(s) > 0 else None for s in sources ])
        top= [ t for t in topEnum if t[1] is not None ]
        top.sort( key=lambda a:a[1] )
        src, key = top[0]
        #print src, key
        yield sources[ src ].pop(0)

This generator removes duplicates from a sequence.

此生成器从序列中删除重复项。

def unique( sequence ):
    keyPos= 0
    seqIter= iter(sequence)
    curr= seqIter.next()
    for next in seqIter:
        if next[keyPos] == curr[keyPos]:
            # might want to create a sub-list of matches
            continue
        yield curr
        curr= next
    yield curr

Here's a script which uses these functions to produce a resulting sequence which is the union of all the sources with duplicates removed.

这是一个脚本,它使用这些函数生成一个结果序列,它是删除了重复项的所有源的并集。

for u in unique( merge( source1, source2, source3, ... ) ):
    print u

The complete set of data in each sequence must exist in memory once because we're sorting in memory. However, the resulting sequence does not actually exist in memory. Indeed, it works by consuming the other sequences.

每个序列中的完整数据集必须存在于内存中一次,因为我们在内存中进行排序。但是,结果序列实际上并不存在于内存中。实际上,它通过消耗其他序列来工作。

#4


0  

How about using {id: (description, id_type)} dictionary? Or {(id, id_type): description} dictionary if (id,id_type) is the key.

如何使用{id:(description,id_type)}字典?或者{(id,id_type):description}字典if(id,id_type)是关键字。

#5


0  

Sets in Python are implemented using hash tables. In earlier versions, they were actually implemented using sets, but that has changed AFAIK. The only thing you save by using a set would then be the size of a pointer for each entry (the pointer to the value).

Python中的集合使用哈希表实现。在早期版本中,它们实际上是使用集合实现的,但这改变了AFAIK。使用set保存的唯一内容就是每个条目(指向值的指针)的指针大小。

To use only a part of a tuple for the hashcode, you'd have to subclass tuple and override the hashcode method:

要仅使用元组的一部分作为哈希码,您必须子类化元组并覆盖哈希码方法:

class ProteinTuple(tuple):
     def __new__(cls, m1, m2, m3):
         return tuple.__new__(cls, (m1, m2, m3))

     def __hash__(self):
         return hash(self[0])

Keep in mind that you pay for the extra function call to __hash__ in this case, because otherwise it would be a C method.

请记住,在这种情况下,您需要为__hash__支付额外的函数调用,否则它将是一个C方法。

I'd go for Constantin's suggestions and take out the id from the tuple and see how much that helps.

我会考虑康斯坦丁的建议并从元组中取出id,看看它有多大帮助。

#6


0  

It's still murky, but it sounds like you have some several lists of [(id, description, type)...]

它仍然是模糊的,但听起来你有几个[(id,description,type)...的列表

The id's are unique within a list and consistent between lists.

id在列表中是唯一的,并且在列表之间是一致的。

You want to create a UNION: a single list, where each id occurs once, with possibly multiple descriptions.

您想要创建一个UNION:单个列表,其中每个id出现一次,可能有多个描述。

For some reason, you think a mapping might be too big. Do you have any evidence of this? Don't over-optimize without actual measurements.

出于某种原因,您认为映射可能太大了。你有这方面的证据吗?没有实际测量,不要过度优化。

This may be (if I'm guessing correctly) the standard "merge" operation from multiple sources.

这可能是(如果我猜错了)来自多个来源的标准“合并”操作。

source1.sort()
source2.sort()
result= []
while len(source1) > 0 or len(source2) > 0:
    if len(source1) == 0:
        result.append( source2.pop(0) )
    elif len(source2) == 0:
        result.append( source1.pop(0) )
    elif source1[0][0] < source2[0][0]:
        result.append( source1.pop(0) )
    elif source2[0][0] < source1[0][0]:
        result.append( source2.pop(0) )
    else:
        # keys are equal
        result.append( source1.pop(0) )
        # check for source2, to see if the description is different.

This assembles a union of two lists by sorting and merging. No mapping, no hash.

这通过排序和合并来组合两个列表的并集。没有映射,没有哈希。