找出IEnumerable 是否具有唯一值的最佳方法

时间:2021-07-16 09:54:15

I have a lot of code in which I do something like this

我有很多代码可以用来做这样的事情

bool GetIsUnique(IEnumerable<T> values)
{
    return values.Count() == values.Distinct().Count;
}

Is there a better faster nicer way to do this?

有没有更好的更好的方式来做到这一点?

7 个解决方案

#1


18  

Your method needs to iterate through the sequence twice, with a few of potential drawbacks:

您的方法需要迭代序列两次,但有一些潜在的缺点:

  1. Iterating twice will be slower than iterating once for sequences of any significant size.
  2. 对于任何显着大小的序列,迭代两次将比迭代一次慢。

  3. Some sequences will throw an exception if you try to iterate them more than once; others might return different results for subsequent iterations.
  4. 如果您尝试多次迭代它们,某些序列将抛出异常;其他人可能会为后续迭代返回不同的结果。

  5. Your method uses Count which needs to iterate the entire sequence each time. There's no reason why you shouldn't break-out early as soon as you know that there's a duplicate value.
  6. 您的方法使用Count,每次都需要迭代整个序列。一旦你知道存在重复值,就没有理由不提前爆发。

The following method only needs to iterate through the sequence once, and will break-out early as soon as any duplicate value is encountered:

以下方法只需迭代序列一次,并在遇到任何重复值时立即提前中断:

bool GetIsUnique<T>(IEnumerable<T> values)
{
    var set = new HashSet<T>();

    foreach (T item in values)
    {
        if (!set.Add(item))
            return false;
    }
    return true;
}

#2


21  

I would make this a nice extension method

我会使这个很好的扩展方法

public static bool IsUnique<T>(this IEnumerable<T> list)
{
    var hs = new HashSet<T>();
    return list.All(hs.Add);  
}

Checks that all items can be added to a HashSet.

检查是否可以将所有项目添加到HashSet。

#3


5  

I think it depends on what you want to do if there are non unique values. @Jamiec's Or @LukeH's answer are great answers and probably best for pure speed, but it can't tell you where issues are.

我认为如果有非唯一值,这取决于你想要做什么。 @Jamiec's Or @ LukeH的答案是很好的答案,可能最适合纯粹的速度,但它无法告诉你问题在哪里。

You might also consider something like

您可能也会考虑类似的事情

var group = values.GroupBy(x => x);
return group.Any(g => g.Count() > 1);

On it's own its worse than the HashSet implementation. But if you keep that group around you can find which elements are duplicated.

它本身比HashSet实现更糟糕。但是,如果你保持这个群体,你可以找到哪些元素是重复的。

var group = values.GroupBy(x => x);
return group.Where(g => g.Count() > 1);

Or

var group = values.GroupBy(x => x);
return group.Where(g => g.Count() > 1).Select(g => g.Key);

Thinking about it with GroupBy lets you keep your options open for what to do next. But if all you care about is knowing if all values are unique I'd go with the HashSet

通过GroupBy思考它可以让您保持选择开放,以便下一步做什么。但是,如果您关心的是知道所有值是否都是唯一的,那么我会使用HashSet

#4


1  

You would be doing two loops through the data for the above - once to get the count, once to get the distinct count. Especially bad if the first two items are identical! Try something like this:

你将通过上面的数据进行两次循环 - 一次得到计数,一次得到非常计数。如果前两个项目相同则特别糟糕!尝试这样的事情:

bool GetIsUnique<T>(IEnumerable<T> values)
{
    HashSet<T> hashSet = new HashSet<T>();
    foreach(var value in values)
    {
        if (hashSet.Contains(value))
        {
            return false;
        }
        hashSet.Add(value);
    }
    return true;
}

This one will finish as soon as it finds a duplicate. Obviously it on the speed of the hash lookup but given Distinct uses a set internally I'd still expect it to be quicker.

这一个将在找到重复后立即完成。显然它在哈希查找的速度上,但鉴于Distinct在内部使用一个集合,我仍然期望它更快。

#5


0  

Two basic rules:

两个基本规则:

  1. The easiest way to read and understand is almost always the best way to code something. That code is deliciously easy to read, so I'd say you're good to go.
  2. 阅读和理解的最简单方法几乎总是编写代码的最佳方式。这段代码很容易阅读,所以我说你很高兴。

  3. Performance ("faster") should only be a concern if you can prove that this is the method that's slowing your program down, or if you're building a library that other people will have access to without having access to the source code.
  4. 性能(“更快”)应该只是一个问题,如果你可以证明这是减慢你的程序速度的方法,或者如果你正在构建一个其他人无法访问源代码而无法访问源代码的库。

The other methods are going to be faster (they'll short circuit when a repeated value is found, returning false), but, I'd still stick with your version if it were my code.

其他方法会更快(当找到重复值时它们会短路,返回false),但是,如果它是我的代码,我仍然坚持使用你的版本。

#6


0  

A quick way of finding the first duplicate, if present, is:

找到第一个副本的快速方法(如果存在)是:

public static bool TryFindFirstDuplicate<T>(this IEnumerable<T> source, out T duplicate)
{
    var set = new HashSet<T>();
    foreach (var item in source)
    {
        if (!set.Add(item))
        {
            duplicate = item;
            return true;
        }
    }
    duplicate = default(T);
    return false;
}

#7


0  

I'm suprised no one has tested this just yet:

我很惊讶没有人测试过这个:

Comparing Gluip's version in the question with JamieC, LukeK, and MrKWatkins, All three answers are better than the asker's version.

将问题中的Gluip版本与JamieC,LukeK和MrKWatkins进行比较,所有三个答案都优于提问者的版本。

Of the three answers, they are all pretty comparable but in most case JamieC's is slightly faster.

在这三个答案中,它们都具有相当的可比性,但在大多数情况下,JamieC的速度稍快一些。

When the data has no duplicates or a duplicate is at the end of the IEnumerable, the size or algorithm had no major differences.

当数据没有重复或重复位于IEnumerable的末尾时,大小或算法没有重大差异。

When the data has a duplicate near the middle, or at the beginning, Gluip's version in the original question shows its slowness compared to the other three.

当数据在中间或开头附近有重复时,原始问题中的Gluip版本显示其与其他三个相比较慢。

The time to check a set appears to scale linearly with the size of set, meaning no algorithm is preferable for large or small sets.

检查集合的时间似乎与集合的大小成线性比例,这意味着对于大型或小型集合,没有优选的算法。

Here's a test program that spits out a CSV that you can load into a spreadsheet program and sort and graph if you like:

这是一个测试程序,它会显示一个CSV,您可以将其加载到电子表格程序中,并根据需要进行排序和绘图:

Test Program:

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.IO;
using System.Linq;
using System.Text;
using System.Threading.Tasks;

namespace AreUniqueTest
{
class Program
{
    const int Iterations = 1000;

    enum DupeLocation
    {
        None,
        Early,
        Center,
        Late,
    }

    enum SetSize
    {
        Tiny= 10,
        Small = 100,
        Medium = 1000,
        Large = 10000,
        Huge = 100000,
    }

    static void Main()
    {
        Dictionary<string, Func<IEnumerable<int>, bool>> functions = new Dictionary<string, Func<IEnumerable<int>, bool>>
        {
            {"Gluip", GetIsUniqueGluip},
            {"LukeH", GetIsUniqueLukeH },
            {"Jamiec", GetIsUniqueJamiec },
            {"MrKWatkins", GetIsUniqueMrKWatkins }
        };

        var output = new StringBuilder();

        Console.WriteLine("Function,SetSize,DupeLocation,TotalTicks,AverageTicks");
        output.AppendLine("Function,SetSize,DupeLocation,TotalTicks,AverageTicks");

        foreach (SetSize size in Enum.GetValues(typeof(SetSize)))
        {
            var sizevalue = (int) size;
            foreach (DupeLocation location in Enum.GetValues(typeof(DupeLocation)))
            {
                var data = CreateTestData((int)size, location);
                foreach (string functionKey in functions.Keys)
                {
                    var ticks = RunSet(functions[functionKey], data, Iterations);
                    var avg = ticks / Iterations;
                    Console.WriteLine($"{functionKey},{sizevalue},{location},{ticks},{avg}");
                    output.AppendLine($"{functionKey},{sizevalue},{location},{ticks},{avg}");
                }
            }
        }

        File.WriteAllText("output.csv", output.ToString());
        Process.Start("output.csv");
    }

    static long RunSet<T>(Func<IEnumerable<T>, bool> getIsUnique, IEnumerable<T> values, int iterations)
    {
        var stopwatch = new Stopwatch();
        stopwatch.Start();
        for (var i = 0; i < iterations; i++)
        {
            getIsUnique.Invoke(values);
        }
        stopwatch.Stop();
        return stopwatch.ElapsedTicks;
    }

    static bool GetIsUniqueLukeH<T>(IEnumerable<T> values)
    {
        var set = new HashSet<T>();

        foreach (T item in values)
        {
            if (!set.Add(item))
                return false;
        }
        return true;
    }

    static bool GetIsUniqueGluip<T>(IEnumerable<T> values)
    {
        return values.Count() == values.Distinct().Count();
    }

    static bool GetIsUniqueJamiec<T>(IEnumerable<T> list)
    {
        var hs = new HashSet<T>();
        return list.All(hs.Add);
    }

    static bool GetIsUniqueMrKWatkins<T>(IEnumerable<T> values)
    {
        HashSet<T> hashSet = new HashSet<T>();
        foreach (var value in values)
        {
            if (hashSet.Contains(value))
            {
                return false;
            }
            hashSet.Add(value);
        }
        return true;
    }

    static int[] CreateTestData(int size, DupeLocation location)
    {
        var result = new int[size];
        Parallel.For(0, size, i => { result[i] = i; });
        return SetDupe(result, location);
    }

    static int[] SetDupe(int[] values, DupeLocation location)
    {
        switch (location)
        {
            case DupeLocation.Early:
                values[1] = values[0];
                break;
            case DupeLocation.Center:
                var midpoint = values.Length / 2;
                values[midpoint] = values[midpoint + 1];
                break;
            case DupeLocation.Late:
                values[values.Length - 1] = values[values.Length - 2];
                break;
            // case DupeLocation.None: // do nothing.
        }
        return values;
    }
}
}

#1


18  

Your method needs to iterate through the sequence twice, with a few of potential drawbacks:

您的方法需要迭代序列两次,但有一些潜在的缺点:

  1. Iterating twice will be slower than iterating once for sequences of any significant size.
  2. 对于任何显着大小的序列,迭代两次将比迭代一次慢。

  3. Some sequences will throw an exception if you try to iterate them more than once; others might return different results for subsequent iterations.
  4. 如果您尝试多次迭代它们,某些序列将抛出异常;其他人可能会为后续迭代返回不同的结果。

  5. Your method uses Count which needs to iterate the entire sequence each time. There's no reason why you shouldn't break-out early as soon as you know that there's a duplicate value.
  6. 您的方法使用Count,每次都需要迭代整个序列。一旦你知道存在重复值,就没有理由不提前爆发。

The following method only needs to iterate through the sequence once, and will break-out early as soon as any duplicate value is encountered:

以下方法只需迭代序列一次,并在遇到任何重复值时立即提前中断:

bool GetIsUnique<T>(IEnumerable<T> values)
{
    var set = new HashSet<T>();

    foreach (T item in values)
    {
        if (!set.Add(item))
            return false;
    }
    return true;
}

#2


21  

I would make this a nice extension method

我会使这个很好的扩展方法

public static bool IsUnique<T>(this IEnumerable<T> list)
{
    var hs = new HashSet<T>();
    return list.All(hs.Add);  
}

Checks that all items can be added to a HashSet.

检查是否可以将所有项目添加到HashSet。

#3


5  

I think it depends on what you want to do if there are non unique values. @Jamiec's Or @LukeH's answer are great answers and probably best for pure speed, but it can't tell you where issues are.

我认为如果有非唯一值,这取决于你想要做什么。 @Jamiec's Or @ LukeH的答案是很好的答案,可能最适合纯粹的速度,但它无法告诉你问题在哪里。

You might also consider something like

您可能也会考虑类似的事情

var group = values.GroupBy(x => x);
return group.Any(g => g.Count() > 1);

On it's own its worse than the HashSet implementation. But if you keep that group around you can find which elements are duplicated.

它本身比HashSet实现更糟糕。但是,如果你保持这个群体,你可以找到哪些元素是重复的。

var group = values.GroupBy(x => x);
return group.Where(g => g.Count() > 1);

Or

var group = values.GroupBy(x => x);
return group.Where(g => g.Count() > 1).Select(g => g.Key);

Thinking about it with GroupBy lets you keep your options open for what to do next. But if all you care about is knowing if all values are unique I'd go with the HashSet

通过GroupBy思考它可以让您保持选择开放,以便下一步做什么。但是,如果您关心的是知道所有值是否都是唯一的,那么我会使用HashSet

#4


1  

You would be doing two loops through the data for the above - once to get the count, once to get the distinct count. Especially bad if the first two items are identical! Try something like this:

你将通过上面的数据进行两次循环 - 一次得到计数,一次得到非常计数。如果前两个项目相同则特别糟糕!尝试这样的事情:

bool GetIsUnique<T>(IEnumerable<T> values)
{
    HashSet<T> hashSet = new HashSet<T>();
    foreach(var value in values)
    {
        if (hashSet.Contains(value))
        {
            return false;
        }
        hashSet.Add(value);
    }
    return true;
}

This one will finish as soon as it finds a duplicate. Obviously it on the speed of the hash lookup but given Distinct uses a set internally I'd still expect it to be quicker.

这一个将在找到重复后立即完成。显然它在哈希查找的速度上,但鉴于Distinct在内部使用一个集合,我仍然期望它更快。

#5


0  

Two basic rules:

两个基本规则:

  1. The easiest way to read and understand is almost always the best way to code something. That code is deliciously easy to read, so I'd say you're good to go.
  2. 阅读和理解的最简单方法几乎总是编写代码的最佳方式。这段代码很容易阅读,所以我说你很高兴。

  3. Performance ("faster") should only be a concern if you can prove that this is the method that's slowing your program down, or if you're building a library that other people will have access to without having access to the source code.
  4. 性能(“更快”)应该只是一个问题,如果你可以证明这是减慢你的程序速度的方法,或者如果你正在构建一个其他人无法访问源代码而无法访问源代码的库。

The other methods are going to be faster (they'll short circuit when a repeated value is found, returning false), but, I'd still stick with your version if it were my code.

其他方法会更快(当找到重复值时它们会短路,返回false),但是,如果它是我的代码,我仍然坚持使用你的版本。

#6


0  

A quick way of finding the first duplicate, if present, is:

找到第一个副本的快速方法(如果存在)是:

public static bool TryFindFirstDuplicate<T>(this IEnumerable<T> source, out T duplicate)
{
    var set = new HashSet<T>();
    foreach (var item in source)
    {
        if (!set.Add(item))
        {
            duplicate = item;
            return true;
        }
    }
    duplicate = default(T);
    return false;
}

#7


0  

I'm suprised no one has tested this just yet:

我很惊讶没有人测试过这个:

Comparing Gluip's version in the question with JamieC, LukeK, and MrKWatkins, All three answers are better than the asker's version.

将问题中的Gluip版本与JamieC,LukeK和MrKWatkins进行比较,所有三个答案都优于提问者的版本。

Of the three answers, they are all pretty comparable but in most case JamieC's is slightly faster.

在这三个答案中,它们都具有相当的可比性,但在大多数情况下,JamieC的速度稍快一些。

When the data has no duplicates or a duplicate is at the end of the IEnumerable, the size or algorithm had no major differences.

当数据没有重复或重复位于IEnumerable的末尾时,大小或算法没有重大差异。

When the data has a duplicate near the middle, or at the beginning, Gluip's version in the original question shows its slowness compared to the other three.

当数据在中间或开头附近有重复时,原始问题中的Gluip版本显示其与其他三个相比较慢。

The time to check a set appears to scale linearly with the size of set, meaning no algorithm is preferable for large or small sets.

检查集合的时间似乎与集合的大小成线性比例,这意味着对于大型或小型集合,没有优选的算法。

Here's a test program that spits out a CSV that you can load into a spreadsheet program and sort and graph if you like:

这是一个测试程序,它会显示一个CSV,您可以将其加载到电子表格程序中,并根据需要进行排序和绘图:

Test Program:

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.IO;
using System.Linq;
using System.Text;
using System.Threading.Tasks;

namespace AreUniqueTest
{
class Program
{
    const int Iterations = 1000;

    enum DupeLocation
    {
        None,
        Early,
        Center,
        Late,
    }

    enum SetSize
    {
        Tiny= 10,
        Small = 100,
        Medium = 1000,
        Large = 10000,
        Huge = 100000,
    }

    static void Main()
    {
        Dictionary<string, Func<IEnumerable<int>, bool>> functions = new Dictionary<string, Func<IEnumerable<int>, bool>>
        {
            {"Gluip", GetIsUniqueGluip},
            {"LukeH", GetIsUniqueLukeH },
            {"Jamiec", GetIsUniqueJamiec },
            {"MrKWatkins", GetIsUniqueMrKWatkins }
        };

        var output = new StringBuilder();

        Console.WriteLine("Function,SetSize,DupeLocation,TotalTicks,AverageTicks");
        output.AppendLine("Function,SetSize,DupeLocation,TotalTicks,AverageTicks");

        foreach (SetSize size in Enum.GetValues(typeof(SetSize)))
        {
            var sizevalue = (int) size;
            foreach (DupeLocation location in Enum.GetValues(typeof(DupeLocation)))
            {
                var data = CreateTestData((int)size, location);
                foreach (string functionKey in functions.Keys)
                {
                    var ticks = RunSet(functions[functionKey], data, Iterations);
                    var avg = ticks / Iterations;
                    Console.WriteLine($"{functionKey},{sizevalue},{location},{ticks},{avg}");
                    output.AppendLine($"{functionKey},{sizevalue},{location},{ticks},{avg}");
                }
            }
        }

        File.WriteAllText("output.csv", output.ToString());
        Process.Start("output.csv");
    }

    static long RunSet<T>(Func<IEnumerable<T>, bool> getIsUnique, IEnumerable<T> values, int iterations)
    {
        var stopwatch = new Stopwatch();
        stopwatch.Start();
        for (var i = 0; i < iterations; i++)
        {
            getIsUnique.Invoke(values);
        }
        stopwatch.Stop();
        return stopwatch.ElapsedTicks;
    }

    static bool GetIsUniqueLukeH<T>(IEnumerable<T> values)
    {
        var set = new HashSet<T>();

        foreach (T item in values)
        {
            if (!set.Add(item))
                return false;
        }
        return true;
    }

    static bool GetIsUniqueGluip<T>(IEnumerable<T> values)
    {
        return values.Count() == values.Distinct().Count();
    }

    static bool GetIsUniqueJamiec<T>(IEnumerable<T> list)
    {
        var hs = new HashSet<T>();
        return list.All(hs.Add);
    }

    static bool GetIsUniqueMrKWatkins<T>(IEnumerable<T> values)
    {
        HashSet<T> hashSet = new HashSet<T>();
        foreach (var value in values)
        {
            if (hashSet.Contains(value))
            {
                return false;
            }
            hashSet.Add(value);
        }
        return true;
    }

    static int[] CreateTestData(int size, DupeLocation location)
    {
        var result = new int[size];
        Parallel.For(0, size, i => { result[i] = i; });
        return SetDupe(result, location);
    }

    static int[] SetDupe(int[] values, DupeLocation location)
    {
        switch (location)
        {
            case DupeLocation.Early:
                values[1] = values[0];
                break;
            case DupeLocation.Center:
                var midpoint = values.Length / 2;
                values[midpoint] = values[midpoint + 1];
                break;
            case DupeLocation.Late:
                values[values.Length - 1] = values[values.Length - 2];
                break;
            // case DupeLocation.None: // do nothing.
        }
        return values;
    }
}
}