高效地在。net中合并字符串数组，保持不同的值

I'm using .NET 3.5. I have two string arrays, which may share one or more values:

我用。net 3.5。我有两个字符串数组，它们可能共享一个或多个值:

string[] list1 = new string[] { "apple", "orange", "banana" };
string[] list2 = new string[] { "banana", "pear", "grape" };

I'd like a way to merge them into one array with no duplicate values:

我想要一种方法将它们合并到一个没有重复值的数组中:

{ "apple", "orange", "banana", "pear", "grape" }

I can do this with LINQ:

我可以用LINQ来做这个:

string[] result = list1.Concat(list2).Distinct().ToArray();

but I imagine that's not very efficient for large arrays.

但我认为这对大数组来说不是很有效。

Is there a better way?

有更好的方法吗?

6 个解决方案

#1

string[] result = list1.Union(list2).ToArray();

from msdn: "This method excludes duplicates from the return set. This is different behavior to the Concat(TSource) method, which returns all the elements in the input sequences including duplicates."

msdn:“这个方法从返回集中排除重复。这与Concat(TSource)方法不同，它返回输入序列中的所有元素，包括重复。”

#2

Why do you imagine that it would be inefficient? As far as I'm aware, both Concat and Distinct are evaluated lazily, using a HashSet behind the scenes for Distinct to keep track of the elements which have already been returned.

你为什么会认为它没有效率?据我所知，Concat和Distinct都是惰性评估的，它们使用了幕后的HashSet来跟踪已经返回的元素。

I'm not sure how you'd manage to make it more efficient than that in a general way :)

我不知道你会如何使它在一般情况下更有效率

EDIT: Distinct actually uses Set (an internal class) instead of HashSet, but the gist is still correct. This is a really good example of just how neat LINQ is. The simplest answer is pretty much as efficient as you can achieve without more domain knowledge.

编辑:Distinct实际上使用Set(内部类)而不是HashSet，但主旨仍然是正确的。这是一个很好的例子，说明LINQ是多么的整洁。最简单的答案几乎是你在没有更多领域知识的情况下所能达到的效率。

The effect is the equivalent of:

其效果相当于:

public static IEnumerable<T> DistinctConcat<T>(IEnumerable<T> first, IEnumerable<T> second)
{
    HashSet<T> returned = new HashSet<T>();
    foreach (T element in first)
    {
        if (returned.Add(element))
        {
            yield return element;
        }
    }
    foreach (T element in second)
    {
        if (returned.Add(element))
        {
            yield return element;
        }
    }
}

#3

.NET 3.5 introduced the HashSet class which could do this:

.NET 3.5引入了可以这样做的HashSet类:

IEnumerable<string> mergedDistinctList = new HashSet<string>(list1).Union(list2);

Not sure of performance, but it should beat the Linq example you gave.

不确定性能，但它应该优于您给出的Linq示例。

EDIT: I stand corrected. The lazy implementation of Concat and Distinct have a key memory AND speed advantage. Concat/Distinct is about 10% faster, and saves multiple copies of data.

编辑:我认错。Concat的惰性实现具有关键的内存和速度优势。Concat/ unique大约快10%，可以保存多个数据副本。

I confirmed through code:

我确认通过代码:

Setting up arrays of 3000000 strings overlapping by 300000
Starting Hashset...
HashSet: 00:00:02.8237616
Starting Concat/Distinct...
Concat/Distinct: 00:00:02.5629681

is the output of:

的输出:

        int num = 3000000;
        int num10Pct = (int)(num / 10);

        Console.WriteLine(String.Format("Setting up arrays of {0} strings overlapping by {1}", num, num10Pct));
        string[] list1 = Enumerable.Range(1, num).Select((a) => a.ToString()).ToArray();
        string[] list2 = Enumerable.Range(num - num10Pct, num + num10Pct).Select((a) => a.ToString()).ToArray();

        Console.WriteLine("Starting Hashset...");
        Stopwatch sw = new Stopwatch();
        sw.Start();
        string[] merged = new HashSet<string>(list1).Union(list2).ToArray();
        sw.Stop();
        Console.WriteLine("HashSet: " + sw.Elapsed);

        Console.WriteLine("Starting Concat/Distinct...");
        sw.Reset();
        sw.Start();
        string[] merged2 = list1.Concat(list2).Distinct().ToArray();
        sw.Stop();
        Console.WriteLine("Concat/Distinct: " + sw.Elapsed);

#4

Disclaimer This is premature optimization. For your example arrays, use the 3.5 extension methods. Until you know you have a performance problem in this region, you should use library code.

声明:这是不成熟的优化。对于示例数组，使用3.5扩展方法。在知道该区域存在性能问题之前，应该使用库代码。

If you can sort the arrays, or they're sorted when you get to that point in the code, you can use the following methods.

如果您可以对数组进行排序，或者当您到达代码的那个点时对它们进行排序，您可以使用以下方法。

These will pull one item from both, and produce the "lowest" item, then fetch a new item from the corresponding source, until both sources are exhausted. In the case where the current item fetched from the two sources are equal, it will produce the one from the first source, and skip them in both sources.

它们将从两者中提取一个项，并生成“最低”项，然后从相应的源获取一个新项，直到两个源都耗尽。如果从两个源获取的当前项是相等的，它将从第一个源生成一个，并在两个源中跳过它们。

private static IEnumerable<T> Merge<T>(IEnumerable<T> source1,
    IEnumerable<T> source2)
{
    return Merge(source1, source2, Comparer<T>.Default);
}

private static IEnumerable<T> Merge<T>(IEnumerable<T> source1,
    IEnumerable<T> source2, IComparer<T> comparer)
{
    #region Parameter Validation

    if (Object.ReferenceEquals(null, source1))
        throw new ArgumentNullException("source1");
    if (Object.ReferenceEquals(null, source2))
        throw new ArgumentNullException("source2");
    if (Object.ReferenceEquals(null, comparer))
        throw new ArgumentNullException("comparer");

    #endregion

    using (IEnumerator<T>
        enumerator1 = source1.GetEnumerator(),
        enumerator2 = source2.GetEnumerator())
    {
        Boolean more1 = enumerator1.MoveNext();
        Boolean more2 = enumerator2.MoveNext();

        while (more1 && more2)
        {
            Int32 comparisonResult = comparer.Compare(
                enumerator1.Current,
                enumerator2.Current);
            if (comparisonResult < 0)
            {
                // enumerator 1 has the "lowest" item
                yield return enumerator1.Current;
                more1 = enumerator1.MoveNext();
            }
            else if (comparisonResult > 0)
            {
                // enumerator 2 has the "lowest" item
                yield return enumerator2.Current;
                more2 = enumerator2.MoveNext();
            }
            else
            {
                // they're considered equivalent, only yield it once
                yield return enumerator1.Current;
                more1 = enumerator1.MoveNext();
                more2 = enumerator2.MoveNext();
            }
        }

        // Yield rest of values from non-exhausted source
        while (more1)
        {
            yield return enumerator1.Current;
            more1 = enumerator1.MoveNext();
        }
        while (more2)
        {
            yield return enumerator2.Current;
            more2 = enumerator2.MoveNext();
        }
    }
}

Note that if one of the sources contains duplicates, you might see duplicates in the output. If you want to remove these duplicates in the already sorted lists, use the following method:

注意，如果其中一个源包含副本，您可能会在输出中看到副本。如果您想在已排序的列表中删除这些副本，请使用以下方法:

private static IEnumerable<T> CheapDistinct<T>(IEnumerable<T> source)
{
    return CheapDistinct<T>(source, Comparer<T>.Default);
}

private static IEnumerable<T> CheapDistinct<T>(IEnumerable<T> source,
    IComparer<T> comparer)
{
    #region Parameter Validation

    if (Object.ReferenceEquals(null, source))
        throw new ArgumentNullException("source");
    if (Object.ReferenceEquals(null, comparer))
        throw new ArgumentNullException("comparer");

    #endregion

    using (IEnumerator<T> enumerator = source.GetEnumerator())
    {
        if (enumerator.MoveNext())
        {
            T item = enumerator.Current;

            // scan until different item found, then produce
            // the previous distinct item
            while (enumerator.MoveNext())
            {
                if (comparer.Compare(item, enumerator.Current) != 0)
                {
                    yield return item;
                    item = enumerator.Current;
                }
            }

            // produce last item that is left over from above loop
            yield return item;
        }
    }
}

Note that none of these will internally use a data structure to keep a copy of the data, so they will be cheap if the input is sorted. If you can't, or won't, guarantee that, you should use the 3.5 extension methods that you've already found.

注意，其中没有一个将在内部使用数据结构来保存数据的副本，因此如果输入被排序，它们将变得便宜。如果您不能，或者不能保证，您应该使用您已经找到的3.5扩展方法。

Here's example code that calls the above methods:

下面是调用上述方法的示例代码:

String[] list_1 = { "apple", "orange", "apple", "banana" };
String[] list_2 = { "banana", "pear", "grape" };

Array.Sort(list_1);
Array.Sort(list_2);

IEnumerable<String> items = Merge(
    CheapDistinct(list_1),
    CheapDistinct(list_2));
foreach (String item in items)
    Console.Out.WriteLine(item);

#5

Probably creating a hashtable with your values as keys (only adding those not already present) and then converting the keys to an array could be a viable solution.

可能创建一个以值作为键的哈希表(只添加那些还没有出现的键)，然后将键转换为数组可能是一个可行的解决方案。

#6

You don't know which approach is faster until you measure it. The LINQ way is elegant and easy to understand.

你不知道哪种方法更快，直到你测量它。LINQ的方式是优雅和容易理解的。

Another way is to implement an set as an hash array (Dictionary) and add all the elements of both the arrays to the set. Then use set.Keys.ToArray() method to create the resulting array.

另一种方法是将集合实现为散列数组(Dictionary)并将两个数组的所有元素添加到集合中，然后使用set. key . toarray()方法创建结果数组。

#1

string[] result = list1.Union(list2).ToArray();