克隆HashSet 的有效方法?

时间:2022-08-11 08:53:09

A few days ago, I answered an interesting question on SO about HashSet<T>. A possible solution involved cloning the hashset, and in my answer I suggested to do something like this:

几天前,我回答了有关HashSet 的一个有趣的问题。一个可能的解决方案涉及克隆hashset,在我的回答中我建议做这样的事情:

HashSet<int> original = ...
HashSet<int> clone = new HashSet<int>(original);

Although this approach is quite straightforward, I suspect it's very inefficient: the constructor of the new HashSet<T> needs to separately add each item from the original hashset, and check if it isn't already present. This is clearly a waste of time: since the source collection is a ISet<T>, it is guaranteed not to contain duplicates. There should be a way to take advantage of that knowledge...

虽然这种方法非常简单,但我怀疑它效率很低:新HashSet 的构造函数需要单独添加原始hashset中的每个项目,并检查它是否已经存在。这显然是浪费时间:因为源集合是ISet ,所以保证不包含重复项。应该有办法利用这些知识......

Ideally, HashSet<T> should implement ICloneable, but unfortunately it's not the case. I also checked with Reflector to see if the HashSet<T> constructor did something specific if the source collection was a hashset, but it doesn't. It could probably be done by using reflection on private fields, but that would be an ugly hack...

理想情况下,HashSet 应该实现ICloneable,但遗憾的是情况并非如此。我还检查了Reflector,看看HashSet 构造函数是否做了特定的事情,如果源集合是一个哈希集,但事实并非如此。它可能可以通过在私有字段上使用反射来完成,但这将是一个丑陋的黑客......

So, did someone come up with a clever solution to clone a hashset more efficiently ?

那么,有人提出了一个更有效克隆哈希集的聪明解决方案吗?

(Note that this question is purely theoretical, I don't need to do that in a real program)

(请注意,这个问题纯粹是理论上的,我不需要在真实的程序中这样做)

5 个解决方案

#1


9  

If you really wanted the most efficient way to clone a HashSet<T>, you'd do the following (but possibly at the cost of maintainability)

如果您真的想要克隆HashSet 的最有效方法,那么您将执行以下操作(但可能以可维护性为代价)

  1. Use reflector or the debugger to figure out exactly what fields in HashSet<T> need to be copied. You may need to do this recursively for each field.
  2. 使用反射器或调试器确切地确定需要复制HashSet 中的哪些字段。您可能需要为每个字段递归执行此操作。
  3. Use Reflection.Emit or use expression trees to generate a method which does the necessary copying of all of the fields. May need to call other generated methods which copy the value of each field. We're using runtime code generation because it's the only way to directly access private fields.
  4. 使用Reflection.Emit或使用表达式树生成一个方法,该方法可以对所有字段进行必要的复制。可能需要调用其他生成的方法来复制每个字段的值。我们正在使用运行时代码生成,因为它是直接访问私有字段的唯一方法。
  5. Use FormatterServices.GetUninitializedObject(...) to instantiate a blank object. Use the method generated in step 2 to copy the original object to the new blank object.
  6. 使用FormatterServices.GetUninitializedObject(...)实例化一个空白对象。使用步骤2中生成的方法将原始对象复制到新的空白对象。

#2


2  

EDIT: After closer inspection this does not seems to be a good idea, with less then 60 elements in the original hashset the method below appears to be slower then just creating a new hashset.

编辑:仔细检查后,这似乎不是一个好主意,原始哈希集中少于60个元素,下面的方法似乎比创建新的哈希集慢。

DISCLAIMER: this seems to work but use at your own risk, if you are going to serialize the cloned hashsets you probably want to copy SerializationInfo m_siInfo.

免责声明:这似乎有效但使用风险自负,如果您要序列化克隆的哈希集,您可能要复制SerializationInfo m_siInfo。

I also faced this problem and took a stab at it, below you will find an extension method that uses FieldInfo.GetValue and SetValue to copy the required fields. It is faster than using HashSet(IEnumerable), how much depends on the amount of elements in the original hashset. For 1,000 elements the difference is about a factor 7. With 100,000 elements its about a factor 3.

我也遇到了这个问题并对其进行了尝试,下面你将找到一个扩展方法,它使用FieldInfo.GetValue和SetValue来复制必需的字段。它比使用HashSet(IEnumerable)更快,多少取决于原始hashset中的元素数量。对于1,000个元素,差异大约为7倍。对于100,000个元素,其大约为3倍。

There are other ways which may be even faster, but this has gotten rid of the bottleneck for me for now. I tried using expressiontrees and emitting but hit a roadblock, if I get those to work Ill update this post.

还有其他方法甚至可能更快,但这已经摆脱了我现在的瓶颈。我尝试使用表达式和发射,但遇到了障碍,如果我让那些工作我会更新这篇文章。

using System;
using System.Collections.Generic;
using System.Reflection;
using System.Runtime.Serialization;

public static class HashSetExtensions
{
    public static HashSet<T> Clone<T>(this HashSet<T> original)
    {
        var clone = (HashSet<T>)FormatterServices.GetUninitializedObject(typeof(HashSet<T>));
        Copy(Fields<T>.comparer, original, clone);

        if (original.Count == 0)
        {
            Fields<T>.freeList.SetValue(clone, -1);
        }
        else
        {
            Fields<T>.count.SetValue(clone, original.Count);
            Clone(Fields<T>.buckets, original, clone);
            Clone(Fields<T>.slots, original, clone);
            Copy(Fields<T>.freeList, original, clone);
            Copy(Fields<T>.lastIndex, original, clone);
            Copy(Fields<T>.version, original, clone);
        }

        return clone;
    }

    static void Copy<T>(FieldInfo field, HashSet<T> source, HashSet<T> target)
    {
        field.SetValue(target, field.GetValue(source));
    }

    static void Clone<T>(FieldInfo field, HashSet<T> source, HashSet<T> target)
    {
        field.SetValue(target, ((Array)field.GetValue(source)).Clone());
    }

    static class Fields<T>
    {
        public static readonly FieldInfo freeList = GetFieldInfo("m_freeList");
        public static readonly FieldInfo buckets = GetFieldInfo("m_buckets");
        public static readonly FieldInfo slots = GetFieldInfo("m_slots");
        public static readonly FieldInfo count = GetFieldInfo("m_count");
        public static readonly FieldInfo lastIndex = GetFieldInfo("m_lastIndex");
        public static readonly FieldInfo version = GetFieldInfo("m_version");
        public static readonly FieldInfo comparer = GetFieldInfo("m_comparer");

        static FieldInfo GetFieldInfo(string name)
        {
            return typeof(HashSet<T>).GetField(name, BindingFlags.Instance | BindingFlags.NonPublic);
        }
    }
}

#3


0  

Easy pattern which should won't work for many collections:

简单的模式应该不适用于许多集合:

Class cloneableDictionary(Of T, U)
    Inherits Dictionary(Of T, U)
    Function clone() As Dictionary(Of T, U)
        Return CType(Me.MemberwiseClone, cloneableDict(Of T, U))
    End Function
End Class

Unfortunately, I don't know that Microsoft did anything to prevent calling MemberwiseClone in places where it shouldn't be called (e.g. declaring something other than a method--like maybe a class--with the name MemberwiseClone) so I don't know how one can tell whether such an approach is likely to work.

不幸的是,我不知道微软做了什么阻止在不应该调用它的地方调用MemberwiseClone(例如声明一个方法以外的东西 - 比如可能是一个类 - 名为MemberwiseClone)所以我不这么做知道如何判断这种方法是否有效。

I think there's a fair reason for a standard collection not to support a public cloning method but only a protected one: it's possible that a class which derives from a collection might break severely if cloned, and if base class' cloning method is public there's no way to prevent an object of a derived class from being given to code that expects to clone it.

我认为标准集合不支持公共克隆方法但只有受保护的方法是有正当理由的:如果克隆的话,从集合派生的类可能会严重破坏,如果基类的克隆方法是公开的,那么就没有防止派生类的对象被赋予期望克隆它的代码的方法。

That having been said, it would have been nice if .net included cloneableDictionary and other such classes as standard types (though obviously not implemented essentially as above).

话虽如此,如果.net包含了cloneableDictionary和其他类似的类作为标准类型(虽然显然没有基本上如上所述实现),这将是很好的。

#4


-1  

O(n) clone is as good as it can get, theoretically, to clone two sets that won't share the same underlying data structure.

从理论上讲,O(n)克隆能够克隆两个不共享相同底层数据结构的集合。

Checking whether or not an element is in a HashSet should be a constant time (i.e. O(1)) operation.

检查元素是否在HashSet中应该是恒定时间(即O(1))操作。

So you could create a wrapper that would just wrap an existing HashSet and hold on to any new additions, but that seems pretty perverse.

所以你可以创建一个包装器,它只包装一个现有的HashSet并保持任何新的添加,但这看起来很不正常。

When you say 'efficient', you mean 'more efficient than the existing O(n) method' - I posit you can't actually get more efficient than O(n) without playing pretty serious semantic games about what 'clone' means.

当你说“高效”时,你的意思是“比现有的O(n)方法更有效” - 我认为如果不播放关于'克隆'意味着什么的非常严肃的语义游戏,你实际上不能比O(n)更有效率。

#5


-3  

Just a random thought. It might be silly.

只是一个随意的想法。这可能很愚蠢。

Since they did not implement ICloneable, and the constructor does not use the knowledge that the source is of the same type, I guess we're left with one option. Implementing the optimized version and adding it as an extension method to the type.

由于它们没有实现ICloneable,并且构造函数不使用源类型相同的知识,我想我们只留下一个选项。实现优化版本并将其作为扩展方法添加到类型中。

Something like:

就像是:

namespace ExtensionMethods
{
    public static class MyExtensions
    {
        public static HashSet<int> Clone(this HashSet<int> original)
        {
            HashSet<int> clone = new HashSet<int>();
            //your optimized code here 
            return clone;
        }
    }   
}

Then, your code from the question would look like this:

然后,问题中的代码如下所示:

HashSet<int> original = ...
HashSet<int> clone = HashSet<int>.Clone(original);

#1


9  

If you really wanted the most efficient way to clone a HashSet<T>, you'd do the following (but possibly at the cost of maintainability)

如果您真的想要克隆HashSet 的最有效方法,那么您将执行以下操作(但可能以可维护性为代价)

  1. Use reflector or the debugger to figure out exactly what fields in HashSet<T> need to be copied. You may need to do this recursively for each field.
  2. 使用反射器或调试器确切地确定需要复制HashSet 中的哪些字段。您可能需要为每个字段递归执行此操作。
  3. Use Reflection.Emit or use expression trees to generate a method which does the necessary copying of all of the fields. May need to call other generated methods which copy the value of each field. We're using runtime code generation because it's the only way to directly access private fields.
  4. 使用Reflection.Emit或使用表达式树生成一个方法,该方法可以对所有字段进行必要的复制。可能需要调用其他生成的方法来复制每个字段的值。我们正在使用运行时代码生成,因为它是直接访问私有字段的唯一方法。
  5. Use FormatterServices.GetUninitializedObject(...) to instantiate a blank object. Use the method generated in step 2 to copy the original object to the new blank object.
  6. 使用FormatterServices.GetUninitializedObject(...)实例化一个空白对象。使用步骤2中生成的方法将原始对象复制到新的空白对象。

#2


2  

EDIT: After closer inspection this does not seems to be a good idea, with less then 60 elements in the original hashset the method below appears to be slower then just creating a new hashset.

编辑:仔细检查后,这似乎不是一个好主意,原始哈希集中少于60个元素,下面的方法似乎比创建新的哈希集慢。

DISCLAIMER: this seems to work but use at your own risk, if you are going to serialize the cloned hashsets you probably want to copy SerializationInfo m_siInfo.

免责声明:这似乎有效但使用风险自负,如果您要序列化克隆的哈希集,您可能要复制SerializationInfo m_siInfo。

I also faced this problem and took a stab at it, below you will find an extension method that uses FieldInfo.GetValue and SetValue to copy the required fields. It is faster than using HashSet(IEnumerable), how much depends on the amount of elements in the original hashset. For 1,000 elements the difference is about a factor 7. With 100,000 elements its about a factor 3.

我也遇到了这个问题并对其进行了尝试,下面你将找到一个扩展方法,它使用FieldInfo.GetValue和SetValue来复制必需的字段。它比使用HashSet(IEnumerable)更快,多少取决于原始hashset中的元素数量。对于1,000个元素,差异大约为7倍。对于100,000个元素,其大约为3倍。

There are other ways which may be even faster, but this has gotten rid of the bottleneck for me for now. I tried using expressiontrees and emitting but hit a roadblock, if I get those to work Ill update this post.

还有其他方法甚至可能更快,但这已经摆脱了我现在的瓶颈。我尝试使用表达式和发射,但遇到了障碍,如果我让那些工作我会更新这篇文章。

using System;
using System.Collections.Generic;
using System.Reflection;
using System.Runtime.Serialization;

public static class HashSetExtensions
{
    public static HashSet<T> Clone<T>(this HashSet<T> original)
    {
        var clone = (HashSet<T>)FormatterServices.GetUninitializedObject(typeof(HashSet<T>));
        Copy(Fields<T>.comparer, original, clone);

        if (original.Count == 0)
        {
            Fields<T>.freeList.SetValue(clone, -1);
        }
        else
        {
            Fields<T>.count.SetValue(clone, original.Count);
            Clone(Fields<T>.buckets, original, clone);
            Clone(Fields<T>.slots, original, clone);
            Copy(Fields<T>.freeList, original, clone);
            Copy(Fields<T>.lastIndex, original, clone);
            Copy(Fields<T>.version, original, clone);
        }

        return clone;
    }

    static void Copy<T>(FieldInfo field, HashSet<T> source, HashSet<T> target)
    {
        field.SetValue(target, field.GetValue(source));
    }

    static void Clone<T>(FieldInfo field, HashSet<T> source, HashSet<T> target)
    {
        field.SetValue(target, ((Array)field.GetValue(source)).Clone());
    }

    static class Fields<T>
    {
        public static readonly FieldInfo freeList = GetFieldInfo("m_freeList");
        public static readonly FieldInfo buckets = GetFieldInfo("m_buckets");
        public static readonly FieldInfo slots = GetFieldInfo("m_slots");
        public static readonly FieldInfo count = GetFieldInfo("m_count");
        public static readonly FieldInfo lastIndex = GetFieldInfo("m_lastIndex");
        public static readonly FieldInfo version = GetFieldInfo("m_version");
        public static readonly FieldInfo comparer = GetFieldInfo("m_comparer");

        static FieldInfo GetFieldInfo(string name)
        {
            return typeof(HashSet<T>).GetField(name, BindingFlags.Instance | BindingFlags.NonPublic);
        }
    }
}

#3


0  

Easy pattern which should won't work for many collections:

简单的模式应该不适用于许多集合:

Class cloneableDictionary(Of T, U)
    Inherits Dictionary(Of T, U)
    Function clone() As Dictionary(Of T, U)
        Return CType(Me.MemberwiseClone, cloneableDict(Of T, U))
    End Function
End Class

Unfortunately, I don't know that Microsoft did anything to prevent calling MemberwiseClone in places where it shouldn't be called (e.g. declaring something other than a method--like maybe a class--with the name MemberwiseClone) so I don't know how one can tell whether such an approach is likely to work.

不幸的是,我不知道微软做了什么阻止在不应该调用它的地方调用MemberwiseClone(例如声明一个方法以外的东西 - 比如可能是一个类 - 名为MemberwiseClone)所以我不这么做知道如何判断这种方法是否有效。

I think there's a fair reason for a standard collection not to support a public cloning method but only a protected one: it's possible that a class which derives from a collection might break severely if cloned, and if base class' cloning method is public there's no way to prevent an object of a derived class from being given to code that expects to clone it.

我认为标准集合不支持公共克隆方法但只有受保护的方法是有正当理由的:如果克隆的话,从集合派生的类可能会严重破坏,如果基类的克隆方法是公开的,那么就没有防止派生类的对象被赋予期望克隆它的代码的方法。

That having been said, it would have been nice if .net included cloneableDictionary and other such classes as standard types (though obviously not implemented essentially as above).

话虽如此,如果.net包含了cloneableDictionary和其他类似的类作为标准类型(虽然显然没有基本上如上所述实现),这将是很好的。

#4


-1  

O(n) clone is as good as it can get, theoretically, to clone two sets that won't share the same underlying data structure.

从理论上讲,O(n)克隆能够克隆两个不共享相同底层数据结构的集合。

Checking whether or not an element is in a HashSet should be a constant time (i.e. O(1)) operation.

检查元素是否在HashSet中应该是恒定时间(即O(1))操作。

So you could create a wrapper that would just wrap an existing HashSet and hold on to any new additions, but that seems pretty perverse.

所以你可以创建一个包装器,它只包装一个现有的HashSet并保持任何新的添加,但这看起来很不正常。

When you say 'efficient', you mean 'more efficient than the existing O(n) method' - I posit you can't actually get more efficient than O(n) without playing pretty serious semantic games about what 'clone' means.

当你说“高效”时,你的意思是“比现有的O(n)方法更有效” - 我认为如果不播放关于'克隆'意味着什么的非常严肃的语义游戏,你实际上不能比O(n)更有效率。

#5


-3  

Just a random thought. It might be silly.

只是一个随意的想法。这可能很愚蠢。

Since they did not implement ICloneable, and the constructor does not use the knowledge that the source is of the same type, I guess we're left with one option. Implementing the optimized version and adding it as an extension method to the type.

由于它们没有实现ICloneable,并且构造函数不使用源类型相同的知识,我想我们只留下一个选项。实现优化版本并将其作为扩展方法添加到类型中。

Something like:

就像是:

namespace ExtensionMethods
{
    public static class MyExtensions
    {
        public static HashSet<int> Clone(this HashSet<int> original)
        {
            HashSet<int> clone = new HashSet<int>();
            //your optimized code here 
            return clone;
        }
    }   
}

Then, your code from the question would look like this:

然后,问题中的代码如下所示:

HashSet<int> original = ...
HashSet<int> clone = HashSet<int>.Clone(original);