哪个数据结构按插入排序并具有快速“包含”检查?

时间:2021-01-27 16:58:25

I am looking for a data structure that preserves the order in which the elements were inserted and offers a fast "contains" predicate. I also need iterator and random access. Performance during insertion or deletion is not relevant. I am also willing to accept overhead in terms of memory consumption.

我正在寻找一种数据结构,它保留了元素的插入顺序,并提供了一个快速的“包含”谓词。我还需要迭代器和随机访问。插入或删除期间的性能无关紧要。我也愿意接受内存消耗方面的开销。

Background: I need to store a list of objects. The objects are instances of a class called Neuron and stored in a Layer. The Layer object has the following public interface:

背景:我需要存储一个对象列表。对象是名为Neuron的类的实例,并存储在Layer中。 Layer对象具有以下公共接口:

class Layer {
public:
    Neuron *neuronAt(const size_t &index) const;
    NeuronIterator begin();
    NeuronIterator end();
    bool contains(const Neuron *const &neuron) const;
    void addNeuron(Neuron *const &neuron);
};

The contains() method is called quite often when the software runs, I've asserted that using callgrind. I tried to circumvent some of the calls to contains(), but is still a hot spot. Now I hope to optimize exactly this method.

当软件运行时,通常会调用contains()方法,我断言使用callgrind。我试图规避对contains()的一些调用,但仍然是一个热点。现在我希望能够完全优化这种方法。

I thought of using std::set, using the template argument to provide my own comparator struct. But the Neuron class itself does not give its position in the Layer away. Additionally, I'd like to have *someNeuronIterator = anotherNeuron to work without screwing up the order.

我想过使用std :: set,使用template参数来提供我自己的比较器结构。但是神经元类本身并没有在Layer中找到它的位置。另外,我想让* someNeuronIterator = anotherNeuron在不搞砸订单的情况下工作。

Another idea was to use a plain old C array. Since I do not care about the performance of adding a new Neuron object, I thought I could make sure that the Neuron objects are always stored linear in memory. But that would invalidate the pointer I pass to addNeuron(); at least I'd have to change it to point to the new copy I created to keep things linear aligned. Right?

另一个想法是使用普通的旧C阵列。由于我不关心添加新的Neuron对象的性能,我想我可以确保Neuron对象始终在内存中保持线性。但这会使我传递给addNeuron()的指针无效;至少我必须改变它以指向我创建的新副本以保持线性对齐。对?

Another idea was to use two data structures in the Layer object: A vector/list for the order, and a map/hash for lookup. But that would contradict my wish for an iterator that allowed operator* without a const reference, wouldn't it?

另一个想法是在Layer对象中使用两个数据结构:顺序的向量/列表,以及用于查找的map / hash。但这与我对迭代器的期望相矛盾,迭代器允许运算符*没有const引用,不是吗?

I hope somebody can hint an idea for a data structure or a concept that would satisfy my needs, or at least give me an idea for an alternative. Thanks!

我希望有人可以暗示一个能够满足我需求的数据结构或概念,或者至少给我一个替代方案的想法。谢谢!

3 个解决方案

#1


If this contains check is really where you need the fastest execution, and assuming you can be a little intrusive with the source code, the fastest way to check if a Neuron belongs in a layer is to simply flag it when you insert it into a layer (ex: bit flag).

如果这包含检查确实是您需要最快执行的地方,并且假设您可能对源代码有点干扰,检查神经元是否属于某个层的最快方法是在将其插入图层时简单地标记它(例如:位标志)。

You have guaranteed O(1) checks at that point to see if a Neuron belongs in a layer and it's also fast at the micro-level.

你已经保证在那一点上进行O(1)检查,以查看神经元是否属于一个层,并且它在微观层面也是快速的。

If there can be numerous layer objects, this can get a little trickier, as you'll need a separate bit for each potential layer a neuron can belong to unless a Neuron can only belong in a single layer at once. This is reasonably manageable, however, if the number of layers are relatively fixed in size.

如果可能存在多个图层对象,这可能会变得有点棘手,因为除非神经元一次只能属于单个图层,否则您需要为神经元可以属于的每个潜在图层使用单独的位。然而,如果层的数量相对固定,则这是合理可管理的。

If the latter case and a Neuron can only belong to one layer at once, then all you need is a backpointer to Layer*. To see if a Neuron belongs in a layer, simply see if that backpointer points to the layer object.

如果后一种情况和一个神经元只能同时属于一个层,那么你所需要的只是一个对Layer *的反向指针。要查看神经元是否属于图层,只需查看该backpointer是否指向图层对象。

If a Neuron can belong to multiple layers at once, but not too many at one time, then you could store like a little array of backpointers like so:

如果一个神经元可以同时属于多个层,但不能同时属于多个层,那么你可以存储像一小部分的后置指针,如下所示:

struct Neuron
{
    ...
    Layer* layers[4]; // use whatever small size that usually fits the common case
    Layer* ptr;
    int num_layers;
};

Initialize ptr to point to layers if there are 4 or fewer layers to which the Neuron belongs. If there are more, allocate it on the free store. In the destructor, free the memory if ptr != layers. You can also optimize away num_layers if the common case is like 1 layer, in which case a null-terminated solution might work better. To see if a Neuron belongs to a layer, simply do a linear search through ptr. That's practically constant-time complexity with respect to the number of Neurons provided that they don't belong in a mass number of layers at once.

如果有4个或更少的神经元所属的层,则将ptr初始化为指向图层。如果还有更多,请在免费商店中分配。在析构函数中,如果ptr!= layers,则释放内存。如果常见情况类似于1层,您还可以优化远离num_layers,在这种情况下,以null结尾的解决方案可能会更好地工作。要查看神经元是否属于某个图层,只需通过ptr进行线性搜索即可。就神经元的数量而言,这实际上是恒定时间的复杂性,条件是它们不同时属于大量的层。

You can also use a vector here but you might reduce cache hits on those common case scenarios since it'll always put its contents in a separate block, even if the Neuron only belongs to like 1 or 2 layers.

你也可以在这里使用向量,但是你可以减少这些常见情况下的缓存命中,因为它总是将其内容放在一个单独的块中,即使Neuron只属于1或2层。

This might be a bit different from what you were looking for with a general-purpose, non-intrusive data structure, but if your performance needs are really skewed towards these kinds of set operations, an intrusive solution is going to be the fastest in general. It's not quite as pretty and couples your element to the container, but hey, if you need max performance...

这可能与您正在寻找的通用,非侵入式数据结构略有不同,但如果您的性能需求实际上倾向于这些类型的集合操作,那么侵入式解决方案将是最快的。它不是很漂亮,并且将你的元素与容器结合在一起,但是嘿,如果你需要最大的性能......

Another idea was to use a plain old C array. Since I do not care about the performance of adding a new Neuron object, I thought I could make sure that the Neuron objects are always stored linear in memory. But that would invalidate the pointer I pass to addNeuron(); [...]

另一个想法是使用普通的旧C阵列。由于我不关心添加新的Neuron对象的性能,我想我可以确保Neuron对象始终在内存中保持线性。但这会使我传递给addNeuron()的指针无效; [...]

Yes, but it won't invalidate indices. While not as convenient to use as pointers, if you're working with mass data like vertices of a mesh or particles of an emitter, it's common to use indices here to avoid the invalidation and possibly to save an extra 32-bits per entry on 64-bit systems.

是的,但它不会使指数无效。虽然不像指针那样方便使用,但是如果你正在处理像网格顶点或发射器粒子这样的海量数据,那么通常在这里使用索引来避免失效并且可能在每个条目上节省额外的32位。 64位系统。

Update

Given that Neurons only exist in one Layer at a time, I'd go with the back pointer approach. Seeing if a neuron belongs to a layer becomes a simple matter of checking if the back pointer points to the same layer.

鉴于神经元一次只存在于一个层中,我会采用后向指针方法。查看神经元是否属于某个层变得很简单,检查后向指针是否指向同一层。

Since there's an API involved, I'd suggest, just because it sounds like you're pushing around a lot of data and have already profiled it, that you focus on an interface which revolves around aggregates (layers, e.g.) rather than individual elements (neurons). It'll just leave you a lot of room to swap out underlying representations when your clients aren't performing operations at the individual scalar element-type interface.

由于涉及到API,我建议,只是因为它听起来像是在推动大量数据并且已经对其进行了分析,因此您专注于围绕聚合(层,例如)而不是单个元素的界面(神经元)。当您的客户端未在单个标量元素类型接口上执行操作时,它会给您留下很大的空间来换出底层表示。

With the O(1) contains implementation and the unordered requirement, I'd go with a simple contiguous structure like std::vector. However, you do expose yourself to potential invalidation on insertion.

由于O(1)包含实现和无序要求,我将使用像std :: vector这样的简单连续结构。但是,您确实会在插入时暴露自己潜在的失效。

Because of that, if you can, I'd suggest working with indices here. However, that become a little unwieldy since it requires your clients to store both a pointer to the layer in which a neuron belongs in addition to its index (though if you do this, the backpointer becomes unnecessary as the client is tracking where things belong).

因此,如果可以,我建议在这里使用索引。然而,这变得有点笨拙,因为它要求你的客户端存储一个指向神经元所属的层的指针以及它的索引(尽管如果这样做,后面的指针变得不必要,因为客户端正在跟踪事物所属的位置) 。

One way to mitigate this is to simply use something like std::vector<Neuron*> or ptr_vector if available. However, that can expose you to cache misses and heap overhead, and if you want to optimize that, this is where the fixed allocator comes in handy. However, that's a bit of a pain with alignment issues and a bit of a research topic, and so far it seems like your main goal is not to optimize insertion or sequential access quite as much as this contains check, so I'd start with the std::vector<Neuron*>.

减轻这种情况的一种方法是简单地使用std :: vector 或ptr_vector(如果可用)。但是,这可能会使您遇到缓存未命中和堆开销,如果您想要优化它,那么这就是固定分配器派上用场的地方。然而,对于协调问题和一些研究主题,这有点痛苦,到目前为止,似乎您的主要目标不是优化插入或顺序访问,因为这包含检查,所以我开始std :: vector

#2


You can get O(1) contains-check, O(1) insert and preserve insertion order. If you are using Java, looked at LinkedHashMap. If you are not using Java, look at LinkedHashMap and figure out a parallel data structure that does it or implement it yourself.

您可以获得O(1)包含检查,O(1)插入并保留插入顺序。如果您使用的是Java,请查看LinkedHashMap。如果您不使用Java,请查看LinkedHashMap并找出并行数据结构,或者自己实现它。

It's just a hashmap with a doubly linked list. The link list is to preserve order and the hashmap is to allow O(1) access. So when you insert an element, it makes an entry with the key, and the map will point to a node in the linked list where your data will reside. To look up, you go to the hash table to find the pointer directly to your linked list node (not the head), and get the value in O(1). To access them sequentially, you just traverse the linked list.

它只是一个带有双向链表的散列图。链接列表是为了保留顺序,而hashmap是允许O(1)访问。因此,当您插入元素时,它会使用键创建一个条目,并且映射将指向链接列表中您的数据所在的节点。要查找,请转到哈希表以直接找到指向链接列表节点(而不是头部)的指针,并在O(1)中获取值。要按顺序访问它们,只需遍历链接列表即可。

#3


A heap sounds like it could be useful to you. It's like a tree, but the newest element is always inserted at the top, and then works its way down based on its value, so there is a method to quickly check if it's there.

堆听起来像对你有用。它就像一棵树,但最新的元素总是插在顶部,然后根据它的值向下工作,所以有一种方法可以快速检查它是否在那里。

Otherwise, you could store a hash table (quick method to check if the neuron is contained in the table) with the key for the neuron, and values of: the neuron itself, and the timestep upon which the neuron was inserted (to check its chronological insertion time).

否则,你可以存储一个哈希表(检查表中是否包含神经元的快速方法)和神经元的密钥,以及:神经元本身的值,以及插入神经元的时间步长(检查它)按时间顺序插入时间)。

#1


If this contains check is really where you need the fastest execution, and assuming you can be a little intrusive with the source code, the fastest way to check if a Neuron belongs in a layer is to simply flag it when you insert it into a layer (ex: bit flag).

如果这包含检查确实是您需要最快执行的地方,并且假设您可能对源代码有点干扰,检查神经元是否属于某个层的最快方法是在将其插入图层时简单地标记它(例如:位标志)。

You have guaranteed O(1) checks at that point to see if a Neuron belongs in a layer and it's also fast at the micro-level.

你已经保证在那一点上进行O(1)检查,以查看神经元是否属于一个层,并且它在微观层面也是快速的。

If there can be numerous layer objects, this can get a little trickier, as you'll need a separate bit for each potential layer a neuron can belong to unless a Neuron can only belong in a single layer at once. This is reasonably manageable, however, if the number of layers are relatively fixed in size.

如果可能存在多个图层对象,这可能会变得有点棘手,因为除非神经元一次只能属于单个图层,否则您需要为神经元可以属于的每个潜在图层使用单独的位。然而,如果层的数量相对固定,则这是合理可管理的。

If the latter case and a Neuron can only belong to one layer at once, then all you need is a backpointer to Layer*. To see if a Neuron belongs in a layer, simply see if that backpointer points to the layer object.

如果后一种情况和一个神经元只能同时属于一个层,那么你所需要的只是一个对Layer *的反向指针。要查看神经元是否属于图层,只需查看该backpointer是否指向图层对象。

If a Neuron can belong to multiple layers at once, but not too many at one time, then you could store like a little array of backpointers like so:

如果一个神经元可以同时属于多个层,但不能同时属于多个层,那么你可以存储像一小部分的后置指针,如下所示:

struct Neuron
{
    ...
    Layer* layers[4]; // use whatever small size that usually fits the common case
    Layer* ptr;
    int num_layers;
};

Initialize ptr to point to layers if there are 4 or fewer layers to which the Neuron belongs. If there are more, allocate it on the free store. In the destructor, free the memory if ptr != layers. You can also optimize away num_layers if the common case is like 1 layer, in which case a null-terminated solution might work better. To see if a Neuron belongs to a layer, simply do a linear search through ptr. That's practically constant-time complexity with respect to the number of Neurons provided that they don't belong in a mass number of layers at once.

如果有4个或更少的神经元所属的层,则将ptr初始化为指向图层。如果还有更多,请在免费商店中分配。在析构函数中,如果ptr!= layers,则释放内存。如果常见情况类似于1层,您还可以优化远离num_layers,在这种情况下,以null结尾的解决方案可能会更好地工作。要查看神经元是否属于某个图层,只需通过ptr进行线性搜索即可。就神经元的数量而言,这实际上是恒定时间的复杂性,条件是它们不同时属于大量的层。

You can also use a vector here but you might reduce cache hits on those common case scenarios since it'll always put its contents in a separate block, even if the Neuron only belongs to like 1 or 2 layers.

你也可以在这里使用向量,但是你可以减少这些常见情况下的缓存命中,因为它总是将其内容放在一个单独的块中,即使Neuron只属于1或2层。

This might be a bit different from what you were looking for with a general-purpose, non-intrusive data structure, but if your performance needs are really skewed towards these kinds of set operations, an intrusive solution is going to be the fastest in general. It's not quite as pretty and couples your element to the container, but hey, if you need max performance...

这可能与您正在寻找的通用,非侵入式数据结构略有不同,但如果您的性能需求实际上倾向于这些类型的集合操作,那么侵入式解决方案将是最快的。它不是很漂亮,并且将你的元素与容器结合在一起,但是嘿,如果你需要最大的性能......

Another idea was to use a plain old C array. Since I do not care about the performance of adding a new Neuron object, I thought I could make sure that the Neuron objects are always stored linear in memory. But that would invalidate the pointer I pass to addNeuron(); [...]

另一个想法是使用普通的旧C阵列。由于我不关心添加新的Neuron对象的性能,我想我可以确保Neuron对象始终在内存中保持线性。但这会使我传递给addNeuron()的指针无效; [...]

Yes, but it won't invalidate indices. While not as convenient to use as pointers, if you're working with mass data like vertices of a mesh or particles of an emitter, it's common to use indices here to avoid the invalidation and possibly to save an extra 32-bits per entry on 64-bit systems.

是的,但它不会使指数无效。虽然不像指针那样方便使用,但是如果你正在处理像网格顶点或发射器粒子这样的海量数据,那么通常在这里使用索引来避免失效并且可能在每个条目上节省额外的32位。 64位系统。

Update

Given that Neurons only exist in one Layer at a time, I'd go with the back pointer approach. Seeing if a neuron belongs to a layer becomes a simple matter of checking if the back pointer points to the same layer.

鉴于神经元一次只存在于一个层中,我会采用后向指针方法。查看神经元是否属于某个层变得很简单,检查后向指针是否指向同一层。

Since there's an API involved, I'd suggest, just because it sounds like you're pushing around a lot of data and have already profiled it, that you focus on an interface which revolves around aggregates (layers, e.g.) rather than individual elements (neurons). It'll just leave you a lot of room to swap out underlying representations when your clients aren't performing operations at the individual scalar element-type interface.

由于涉及到API,我建议,只是因为它听起来像是在推动大量数据并且已经对其进行了分析,因此您专注于围绕聚合(层,例如)而不是单个元素的界面(神经元)。当您的客户端未在单个标量元素类型接口上执行操作时,它会给您留下很大的空间来换出底层表示。

With the O(1) contains implementation and the unordered requirement, I'd go with a simple contiguous structure like std::vector. However, you do expose yourself to potential invalidation on insertion.

由于O(1)包含实现和无序要求,我将使用像std :: vector这样的简单连续结构。但是,您确实会在插入时暴露自己潜在的失效。

Because of that, if you can, I'd suggest working with indices here. However, that become a little unwieldy since it requires your clients to store both a pointer to the layer in which a neuron belongs in addition to its index (though if you do this, the backpointer becomes unnecessary as the client is tracking where things belong).

因此,如果可以,我建议在这里使用索引。然而,这变得有点笨拙,因为它要求你的客户端存储一个指向神经元所属的层的指针以及它的索引(尽管如果这样做,后面的指针变得不必要,因为客户端正在跟踪事物所属的位置) 。

One way to mitigate this is to simply use something like std::vector<Neuron*> or ptr_vector if available. However, that can expose you to cache misses and heap overhead, and if you want to optimize that, this is where the fixed allocator comes in handy. However, that's a bit of a pain with alignment issues and a bit of a research topic, and so far it seems like your main goal is not to optimize insertion or sequential access quite as much as this contains check, so I'd start with the std::vector<Neuron*>.

减轻这种情况的一种方法是简单地使用std :: vector 或ptr_vector(如果可用)。但是,这可能会使您遇到缓存未命中和堆开销,如果您想要优化它,那么这就是固定分配器派上用场的地方。然而,对于协调问题和一些研究主题,这有点痛苦,到目前为止,似乎您的主要目标不是优化插入或顺序访问,因为这包含检查,所以我开始std :: vector

#2


You can get O(1) contains-check, O(1) insert and preserve insertion order. If you are using Java, looked at LinkedHashMap. If you are not using Java, look at LinkedHashMap and figure out a parallel data structure that does it or implement it yourself.

您可以获得O(1)包含检查,O(1)插入并保留插入顺序。如果您使用的是Java,请查看LinkedHashMap。如果您不使用Java,请查看LinkedHashMap并找出并行数据结构,或者自己实现它。

It's just a hashmap with a doubly linked list. The link list is to preserve order and the hashmap is to allow O(1) access. So when you insert an element, it makes an entry with the key, and the map will point to a node in the linked list where your data will reside. To look up, you go to the hash table to find the pointer directly to your linked list node (not the head), and get the value in O(1). To access them sequentially, you just traverse the linked list.

它只是一个带有双向链表的散列图。链接列表是为了保留顺序,而hashmap是允许O(1)访问。因此,当您插入元素时,它会使用键创建一个条目,并且映射将指向链接列表中您的数据所在的节点。要查找,请转到哈希表以直接找到指向链接列表节点(而不是头部)的指针,并在O(1)中获取值。要按顺序访问它们,只需遍历链接列表即可。

#3


A heap sounds like it could be useful to you. It's like a tree, but the newest element is always inserted at the top, and then works its way down based on its value, so there is a method to quickly check if it's there.

堆听起来像对你有用。它就像一棵树,但最新的元素总是插在顶部,然后根据它的值向下工作,所以有一种方法可以快速检查它是否在那里。

Otherwise, you could store a hash table (quick method to check if the neuron is contained in the table) with the key for the neuron, and values of: the neuron itself, and the timestep upon which the neuron was inserted (to check its chronological insertion time).

否则,你可以存储一个哈希表(检查表中是否包含神经元的快速方法)和神经元的密钥,以及:神经元本身的值,以及插入神经元的时间步长(检查它)按时间顺序插入时间)。