优化常用的字母组合函数

时间:2020-12-17 16:08:55

I have written a function that determines whether two words are anagrams. Word A is an anagram of word B if you can build word B out of A just by rearranging the letters, e.g.:

我已经写了一个函数来决定两个单词是字谜。单词A是单词B的一个字母组合,如果你可以通过重新排列字母从A中构建单词B,例如:

lead is anagram of deal

This is my function:

这是我的功能:

bool is_anagram(std::string const & s1, std::string const & s2)
{
    auto check = [](std::string const & x)
    {
        std::map<char, unsigned> counter;
        for(auto const & c : x)
        {
            auto it = counter.find(c);
            if(it == counter.end())
                counter[c] = 1;
            else
                ++counter[c];
        }
        return counter;
    };

    return check(s1) == check(s2);
}

This works fine, but as the number of words increases (and this function is used several million times within my application), it very soon became a major bottleneck of my application.

这工作得很好,但是随着单词数量的增加(这个函数在我的应用程序中使用了数百万次),它很快就成为我的应用程序的主要瓶颈。

Does anyone have an idea of how to speed this function up?

有人知道如何加速这个函数吗?

7 个解决方案

#1


43  

The map creation and your call to std::map::find within the iteration, is quite expensive.

在迭代中创建映射并调用std::map: find,是非常昂贵的。

In this case, you can use the fact that std::string behaves in many regards like a std::vector<char>, meaning that you can sort it using std::sort:

在本例中,您可以使用std::string在许多方面类似于std::vector ,这意味着您可以使用std::sort:

bool is_anagram(std::string s1, std::string s2)
{
    std::sort(s1.begin(), s1.end());
    std::sort(s2.begin(), s2.end());
    return s1 == s2;
}

Instead of the two maps that you are creating, I am taking a copy of the strings (passing them by value instead of const reference) and sorting them, so

我没有创建两个映射,而是获取字符串的副本(通过值而不是const引用传递它们)并对它们进行排序

sort("lead") => "adel"
sort("deal") => "adel"

This change should already speed your algorithm up by quite a bit. One more thing that may greatly affect the performance if you tend to compare arbitrary words:

这个改变应该已经让你的算法提高了不少。还有一件事可能会对你的表现有很大的影响如果你比较任意的词:

bool is_anagram(std::string s1, std::string s2)
{
    if(s1.length() != s2.length())
        return false;
    /* as above */
}

If the length of the two strings differs, it obviously cannot be an anagram. std::string::length() is a very fast operation (it needs to store the string size anyway), so we save us the hustle of O(N*log(N)) from sorting the two strings.

如果两个字符串的长度不同,显然它不是一个字母组合。string:::length()是一个非常快速的操作(它需要存储字符串大小),所以我们省去了O(N*log(N))对两个字符串排序的麻烦。

#2


36  

You've got 26 characters, make one array of the size of 26, then iterate through both words and as you encouter characters in words, increment a corresponding array element for characters in the first word and decrement corresponding array element for charaacters in the second word. If the words are anagrams, all array elements will be 0 in the end. The complexity will be just O(N) where N is the length of words.

有26个字符,创建一个大小为26的数组,然后遍历两个单词,并在单词中封装字符时,对第一个单词中的字符增加相应的数组元素,对第二个单词中的字符减量相应的数组元素。如果单词是anagrams,那么所有数组元素最后都是0。复杂度将是O(N)其中N是单词的长度。

#3


33  

Here's a selection of functions that perform anagram analysis:

下面是一些执行分析的函数的选择:

#include <iostream>
#include <iomanip>
#include <map>
#include <cctype>
#include <string>
#include <algorithm>

using namespace std;

bool is_anagram(const std::string & s1, const std::string & s2)
{
    auto check = [](const std::string& x)
    {
        std::map<char, unsigned> counter;
        for(auto c : x)
        {
            auto it = counter.find(c);
            if(it == counter.end())
                counter[c] = 1;
            else
                ++counter[c];
        }
        return counter;
    };

    return check(s1) == check(s2);
}


bool is_anagram1(const std::string & s1, const std::string & s2)
{
    auto check = [](const std::string& x)
    {
        std::map<char, unsigned> counter;
        for(auto c : x)
        {
        ++counter[c];
        }
        return counter;
    };

    return check(s1) == check(s2);
}


bool is_anagram2(std::string s1, std::string s2)
{
    std::sort(s1.begin(), s1.end());
    std::sort(s2.begin(), s2.end());
    return s1 == s2;
}


bool is_anagram3(std::string s1, std::string s2)
{
    if (s1.length() != s2.length()) return false;
    std::sort(s1.begin(), s1.end());
    std::sort(s2.begin(), s2.end());
    return s1 == s2;
}

bool is_anagram4(const std::string &s1, const std::string &s2)
{
    int arr[256] = {};
    if (s1.length() != s2.length()) return false;
    for(std::string::size_type i = 0; i < s1.length(); i++)
    {
    arr[(unsigned)s1[i]]++;
    arr[(unsigned)s2[i]]--;
    }
    for(auto i : arr)
    {
    if (i) return false;
    } 
    return true;
}

bool is_anagram5(const std::string &s1, const std::string &s2)
{
    int arr[26] = {};
    if (s1.length() != s2.length()) return false;

    for(std::string::size_type i = 0; i < s1.length(); i++)
    {
    arr[(unsigned)tolower(s1[i])-'a']++;
    arr[(unsigned)tolower(s2[i])-'a']--;
    }
    for(auto i : arr)
    {
    if (i) return false;
    } 
    return true;
}


static __inline__ unsigned long long rdtsc(void)
{
    unsigned hi, lo;
    __asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
    return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 );
}


template<typename T>
void bm(const char *name, T func, 
    const std::string &s1, const std::string &s2)
{
    unsigned long long time = rdtsc();
    bool res = func(s1, s2);
    time = rdtsc()-time;
    cout << "Function" << left << setw(15) << name 
     << " time: " << right << setw(8) << time 
     << " Res: " << res << endl;
}


#define BM(x) bm(#x, x, s1, s2)

int main()
{
    const std::string short1 = "lead";
    const std::string short2 = "deal";
    const std::string long1 = "leaddealleaddealleaddealleaddealleaddealleaddealleaddealleaddealleaddealleaddeal";
    const std::string long2 = "dealleaddealleaddealleaddealleaddealleaddealleaddealleaddealleaddealleaddeallead";

    cout << "Testing with short strings:" << endl;
    std::string s1 = short1;
    std::string s2 = short2;

    BM(is_anagram);
    BM(is_anagram1);
    BM(is_anagram2);
    BM(is_anagram3);
    BM(is_anagram4);
    BM(is_anagram5);

    cout << "Testing with long strings:" << endl;
    s1 = long1;
    s2 = long2;

    BM(is_anagram);
    BM(is_anagram1);
    BM(is_anagram2);
    BM(is_anagram3);
    BM(is_anagram4);
    BM(is_anagram5);

    cout << "Testing with inequal short string:" << endl;
    s1 = short1;
    s2 = short2;
    s1[s1.length()-1]++;

    BM(is_anagram);
    BM(is_anagram1);
    BM(is_anagram2);
    BM(is_anagram3);
    BM(is_anagram4);
    BM(is_anagram5);

    cout << "Testing with inequal long string:" << endl;
    s1 = long1;
    s2 = long2;
    s1[s1.length()-1]++;

    BM(is_anagram);
    BM(is_anagram1);
    BM(is_anagram2);
    BM(is_anagram3);
    BM(is_anagram4);
    BM(is_anagram5);
}

And the results:

结果:

Testing with short strings:
Functionis_anagram      time:    72938 Res: 1
Functionis_anagram1     time:    15694 Res: 1
Functionis_anagram2     time:    49780 Res: 1
Functionis_anagram3     time:    10424 Res: 1
Functionis_anagram4     time:     4272 Res: 1
Functionis_anagram5     time:    18653 Res: 1
Testing with long strings:
Functionis_anagram      time:    46067 Res: 1
Functionis_anagram1     time:    33597 Res: 1
Functionis_anagram2     time:    52721 Res: 1
Functionis_anagram3     time:    41570 Res: 1
Functionis_anagram4     time:     9027 Res: 1
Functionis_anagram5     time:    15062 Res: 1
Testing with inequal short string:
Functionis_anagram      time:    11096 Res: 0
Functionis_anagram1     time:    10115 Res: 0
Functionis_anagram2     time:    13022 Res: 0
Functionis_anagram3     time:     8066 Res: 0
Functionis_anagram4     time:     2963 Res: 0
Functionis_anagram5     time:     1916 Res: 0
Testing with inequal long string:
Functionis_anagram      time:    44246 Res: 0
Functionis_anagram1     time:    33961 Res: 0
Functionis_anagram2     time:    45004 Res: 0
Functionis_anagram3     time:    38896 Res: 0
Functionis_anagram4     time:     6455 Res: 0
Functionis_anagram5     time:    14603 Res: 0

It is quite clear that the straightforward checking with an array counting up/down based on the occurrence of each character is the winner. I took the original code and removed the find, and just used the fact that a map::operator[] will create a zero value if there is no entry there in is_anagram1, which shows some decent improvement too.

很明显,使用数组根据每个字符的出现进行向上/向下计数的直接检查是赢家。我使用了原始代码并删除了find,并使用了这样的事实:如果is_anagram1中没有条目,那么map:::operator[]将创建一个零值,这也显示了一些不错的改进。

Results are from MY machine, these may or may not reflect the results on other machines, but I doubt that the "winner vs loser" is going to be significantly different.

结果来自我的机器,这些结果可能会也可能不会反映在其他机器上的结果,但我怀疑“赢家vs输家”会有明显的不同。

Edit:

编辑:

Thought about the "find" operation, and decided to use it in different way:

思考“发现”操作,并决定以不同的方式使用:

bool is_anagram0(const std::string & s1, const std::string & s2)
{
    auto check = [](const std::string& x)
    {
        std::map<char, unsigned> counter;
        for(auto const &c : x)
        {
            auto it = counter.find(c);
            if(it == counter.end())
                counter[c] = 1;
            else
                it->second++;
        }
        return counter;
    };

    return check(s1) == check(s2);
}

Gives a little improvement over the original function, but not better enough to compete with the array solution that gives the best result.

对原来的函数做了一点改进,但还不足以与产生最佳结果的数组解决方案竞争。

#4


7  

In addition to all the other suggestions, if you are trying to find pairs of anagrams in a set, you will be calling is_anagram on the same arguments (although not the same pairs of arguments) repeatedly.

除了所有其他的建议之外,如果您试图在一个集合中找到一对anagrams,那么您将在相同的参数(尽管不是相同的参数对)中重复调用is_anagram。

Most solutions consist of two steps:

大多数解决方案包括两个步骤:

  1. map the string arguments to some other object (a sorted string, an array of length 26, a map as in your own solution)
  2. 将字符串参数映射到其他对象(排序的字符串,长度为26的数组,映射为您自己的解决方案)
  3. compare these objects to each other
  4. 将这些对象相互比较

If you have a set of N strings, and you want to find all pairs which are anagrams of each other, you will be calling is_anagram O(N^2) times.

如果你有一组N字符串,和你想找到所有对字谜的对方,你将调用is_anagram O(N ^ 2)倍。

You can save a lot by first computing the step 1 above for each of the N strings and then comparing the results.

您可以通过首先计算上面的步骤1来为每个N个字符串节省很多,然后比较结果。

#5


4  

I propose a solution which is sorting only one of the two strings:

我提出的解决方案是只对两个字符串中的一个进行排序:

 bool is_anagram(std::string const & s1, std::string s2)
 {
     if (s1.length() != s2.length()) return false;
     for (int i=0; i<s1.length(); ++i)
     {
         size_t pos = s2.find(s1[i], i);
         if (pos == std::string::npos)
         {
             return false;
         }
         std::swap(s2[i], s2[pos]);
     }
     return true;
 }

This solution could be advantageous in situations where you have one character in the one string which isn't in the other one - because if it doesn't find that character in the other string, it can short-cut the evaluation (in contrast to all other solutions shown here, where always both full strings are evaluated). Especially if this exclusive character on one side occurs early in a long string...

这个解决方案可以有利的情况下,你有一个字符在一个字符串并不是另一个,因为如果没有发现字符的字符串,它能走捷径的评价(相比其他解决方案所示,总是完整的字符串都是评估)。特别是如果一方的这个排他性字符出现在一个长字符串的早期……

One advantage over the sort solution is also required storage space - the sort solution requires the two strings to be copied, while in my solution only one copy is created. For shorter string lengths (depending on the int type used for counting, but at least up to 256 characters), it also requires less memory than the "count up/down" solution.

与排序解决方案相比,另一个优点是需要存储空间——排序解决方案需要复制两个字符串,而在我的解决方案中只创建一个副本。对于较短的字符串长度(取决于用于计数的int类型,但至少要有256个字符),它还需要比“向上/向下计数”解决方案更少的内存。

For longer strings which are anagrams, on the other hand, it starts to fall behind the "count up/down" solution.

另一方面,对于较长的字符串,它开始落在“count up/down”解决方案后面。

Analysis

分析

See the code & results below. As can easily be seen, my solution (is_anagram3) is quite fast for short anagrams (only beaten by the actually not fully functionally equivalent 26 character count up/down method), and is also fastest for the long non-anagram case with non-matching characters; but tends to get slower than the count up/down methods for longer strings which are anagrams, or for long non-anagrams which just differ by character counts.

参见下面的代码和结果。很容易看出,我的解决方案(is_anagram3)对于短字词来说是比较快的(仅被功能上并不完全等价的26个字符计数/下法打败),对于长非字词情况下不匹配的字符也是最快的;但是对于长串(字谜)的加和减的方法,或者对于长非字谜(字数不同)的方法,它们的速度会比加和减的慢。

Summary

总结

In the end, it would really depend on the expected input data as to what the ideal solution is:

最终,它将取决于理想的解决方案是什么:

  • For single calls, the "count up/down" solutions will usually give the best performance in many cases.
  • 对于单次调用,“计数/计数”解决方案通常会在许多情况下提供最佳性能。
  • Depending on the circumstances, (e.g. with what probability the strings contain different characters, as mentioned above), my solution might be faster
  • 根据不同的情况(如上面提到的字符串包含不同字符的概率),我的解决方案可能会更快
  • Not tested yet, but seems sure: For the case when you have many anagram checks to perform, and when you implement a cache for the sorted strings, the sorting and comparing solution will become the fastest.
  • 虽然还没有经过测试,但似乎可以肯定:对于需要执行许多anagram检查的情况,以及当为已排序的字符串实现缓存时,排序和比较解决方案将成为最快的。

Code:

代码:

#include <string>
#include <map>
#include <algorithm>

#include <sys/time.h>
#include <iostream>
#include <iomanip>

bool is_anagram(std::string const & s1, std::string const & s2)
{
    auto check = [](std::string const & x)
    {
        std::map<char, unsigned> counter;
        for(auto const & c : x)
        {
            auto it = counter.find(c);
            if(it == counter.end())
                counter[c] = 1;
            else
                ++counter[c];
        }
        return counter;
    };

    return check(s1) == check(s2);
}

bool is_anagram2(std::string s1, std::string s2)
{
    std::sort(s1.begin(), s1.end());
    std::sort(s2.begin(), s2.end());
    return s1 == s2;
}

bool is_anagram3(std::string const & s1, std::string s2)
{
    if (s1.length() != s2.length()) return false;

    for (int i=0; i<s1.length(); ++i)
    {
        size_t pos = s2.find(s1[i], i);
        if (pos == std::string::npos)
        {
            return false;
        }
        std::swap(s2[i], s2[pos]);
    }
    return true;
}

bool is_anagram4(std::string const & s1, std::string const & s2)
{
    if (s1.length() != s2.length()) return false;

    int count[26] = {0};
    for (int i=0; i<s1.length(); ++i)
    {
        count[s1[i]-'a']++;
        count[s2[i]-'a']--;
    }
    for (int i=0; i<26; ++i)
    {
        if (count[i] != 0) return false;
    }
    return true;
}

bool is_anagram5(std::string const & s1, std::string const & s2)
{
    if (s1.length() != s2.length()) return false;

    int count[256] = {0};
    for (int i=0; i<s1.length(); ++i)
    {
        count[s1[i]]++;
        count[s2[i]]--;
    }
    for (int i=0; i<256; ++i)
    {
        if (count[i] != 0) return false;
    }
    return true;
}

template<typename T>
bool test(const char* name, T func)
{
    if (!func("deal", "lead") || !func("lead", "deal") || !func("dealbreaker", "breakdealer") ||
        !func("dealbreakerdealbreakerdealbreakerdealbreaker", "breakdealerbreakdealerbreakdealerbreakdealer") ||
        func("dealbreakerdealbreakerdealbreakerdealbreakera", "breakdealerbreakdealerbreakdealerbreakdealerb") ||
        func("dealxbreakerdealbreakerdealbreakerdealbreaker", "breakdealerbreakdealerbreakdealerbreakdealerb") ||
        func("abcdefg", "tuvwxyz") ||
        func("lot", "bloat") || func("lead", "deala") ||
        func("lot", "bloat") || func("deala", "lead") ||
        func("abc", "abcd"))
    {
        std::cout << name << ": impl doesn't work" << std::endl;
        return false;
    }
    return true;
}

template<typename T>
void measure(const char* name, T func,
    std::string const & s1, std::string const & s2)
{
    timeval start,end;
    unsigned long secDiff;
    long usecDiff;

    gettimeofday(&start, 0);
    for (int i=0; i<100000; ++i)
    {
        func(s1, s2);
    }
    gettimeofday(&end, 0);
    secDiff = end.tv_sec - start.tv_sec;
    usecDiff = end.tv_usec - start.tv_usec;
    if (usecDiff < 0)
    {
        secDiff--;
        usecDiff += 1000000;
    }
 std::cout << name << ": " << secDiff << "."<< std::setw(3) << std::setfill('0') << (usecDiff/1000) << " s" << std::endl;
}

template <typename T>
void run(const char * funcName, T func)
{
    std::cout << funcName << std::endl;
    if (!test(funcName, func)) {
        return;
    }
    measure("short", func, "dealbreaker", "breakdealer");
    measure("short no anagram", func, "abcdefg", "tuvwxyz");
    measure("long", func, "dealbreakerdealbreakerdealbreakerdealbreaker", "breakdealerbreakdealerbreakdealerbreakdealer");
    measure("long no anagram", func, "dealbreakerdealbreakerdealbreakerdealbreakera", "breakdealerbreakdealerbreakdealerbreakdealerb");
    measure("long no anagram nonmatching char", func, "dealxbreakerdealbreakerdealbreakerdealbreaker", "breakdealerbreakdealerbreakdealerbreakdealerb");
}

int main(int argv, char**argc)
{
    run("is_anagram", is_anagram);
    run("is_anagram2", is_anagram2);
    run("is_anagram3", is_anagram3);
    run("is_anagram4", is_anagram4);
    run("is_anagram5", is_anagram5);
}

Output

输出

Measured on my slow Atom machine, results may vary a bit on different systems, but should in general give a good picture of the relative performances:

在我的慢速原子机器上测量,结果可能在不同的系统上有所不同,但总的来说应该能很好地反映出相关的性能:

is_anagram
short: 2.960 s
short no anagram: 2.154 s
long: 8.063 s
long no anagram: 8.092 s
long no anagram nonmatching char: 8.267 s
is_anagram2
short: 0.685 s
short no anagram: 0.333 s
long: 3.332 s
long no anagram: 3.557 s
long no anagram nonmatching char: 3.740 s
is_anagram3
short: 0.280 s
short no anagram: 0.022 s
long: 0.984 s
long no anagram: 0.994 s
long no anagram nonmatching char: 0.140 s
is_anagram4
short: 0.123 s
short no anagram: 0.071 s
long: 0.399 s
long no anagram: 0.389 s
long no anagram nonmatching char: 0.390 s
is_anagram5
short: 0.283 s
short no anagram: 0.145 s
long: 0.551 s
long no anagram: 0.454 s
long no anagram nonmatching char: 0.455 s

#6


4  

Assuming that most word pairs are not anagrams, the case you most need to optimise is the failure case.

假设大多数单词对都不是字谜,那么最需要优化的是失败的情况。

Obviously if the lengths are different then the strings cannot be anagrams, and the length is a property that is computed once when the string is created, making it a very efficient thing to compare as a fast-out.

显然,如果长度不同,那么字符串就不能是字谜,而长度是一个属性,它在创建字符串时计算一次,因此作为快速输出进行比较是非常有效的。

If you change your string class to include more of these properties you can improve the accuracy of the fast-out case. Rather than beginning the test function with:

如果您更改字符串类以包含更多的这些属性,您可以提高快出情况的准确性。而不是以:

if (s1.length() != s2.length()) return false;

You could use:

您可以使用:

if (s1.hash != s2.hash) return false;

and when you create the string (or after you modify it), you would compute a value for hash which is unique (or nearly unique) to all strings with that set of letters in arbitrary order.

当您创建字符串(或修改后)时,您将计算一个哈希值,该值对所有字符串都是唯一的(或几乎是唯一的),该值以任意顺序排列。

You only compute this hash when a string changes; not for every comparison that you do.

只有当字符串发生变化时才计算这个散列;不是你做的每一个比较。

A very simple way to compute the hash is:

计算散列的一种非常简单的方法是:

long sum = 0;
for (int i = 0; i < s.length(); i++)
    sum += s[i];
s.hash = sum;

this is quick to compute, but is prone to collisions.

这很容易计算,但容易发生碰撞。

A more advanced method:

一种更高级的方法:

static const unsigned int primetable[256] = { 1, 3, 5, 7, /* ... */ };

unsigned long prod = 1;
for (int i = 0; i < s.length(); i++)
    prod *= primetable[s[i]];
s.hash = prod;

Note that two is omitted from the list of primes. This ensures that prod is always co-prime with the possible range of unsigned long. This maximises the distribution of results in the face of numerical overflow.

注意,在素数列表中省略了两个素数。这确保了prod总是与可能的无符号长范围的联合启动。这将在遇到数值溢出时最大化结果的分布。

If the table is arranged to put small primes at frequent letter positions, then the cases where overflow occurs (which can lead to hash collisions) can be minimised. If there's no overflow then you have a perfect hash (for these purposes) and you would have absolute certainty both ways (return either true or false immediately) by just comparing hash.

如果将表安排为将小素数放在频繁的字母位置,则可以最小化发生溢出(这会导致哈希冲突)的情况。如果没有溢出,那么您就有了一个完美的散列(出于这些目的),通过比较散列,您可以通过两种方式(立即返回true或false)获得绝对的确定性。

Consequently, computing the hash like this works out much better:

因此,像这样计算哈希会更好:

static const unsigned int primetable[256] = {
    1, 1429, 1433, 1439, 1447, 1451, 1453, 1459, 1471, 1481, 1483, 1487, 1489, 1493, 1499, 1511, 1523, 1531, 1543, 1549, 1553, 1559, 1567, 1571, 1579, 1583, 1597, 1601, 1607, 1609, 1613, 1619,
    821, 823, 827, 829, 839, 853, 857, 107, 859, 863, 877, 881, 883, 109, 887, 907, 113, 127, 131, 137, 139, 149, 151, 157, 163, 167, 911, 919, 929, 937, 941, 947,
    601, 607, 613, 617, 619, 631, 641, 643, 647, 653, 659, 661, 673, 677, 683, 691, 701, 709, 719, 727, 733, 739, 743, 751, 757, 761, 769, 773, 787, 797, 809, 811,
    359, 11, 61, 31, 37, 3, 71, 43, 59, 7, 101, 79, 29, 53, 13, 23, 47, 103, 17, 5, 19, 41, 73, 83, 89, 67, 97, 367, 373, 379, 383, 389,
    1181, 1187, 1193, 1201, 1213, 1217, 1223, 1229, 1231, 1237, 1249, 1259, 1277, 1279, 1283, 1289, 1291, 1297, 1301, 1303, 1307, 1319, 1321, 1327, 1361, 1367, 1373, 1381, 1399, 1409, 1423, 1427,
    953, 967, 971, 977, 983, 991, 997, 1009, 1013, 1019, 1021, 1031, 1033, 1039, 1049, 1051, 1061, 1063, 1069, 1087, 1091, 1093, 1097, 1103, 1109, 1117, 1123, 1129, 1151, 1153, 1163, 1171,
    397, 401, 409, 419, 421, 431, 433, 439, 443, 449, 457, 461, 463, 467, 479, 487, 491, 499, 503, 509, 521, 523, 541, 547, 557, 563, 569, 571, 577, 587, 593, 599,
    173, 179, 181, 191, 193, 197, 199, 211, 223, 227, 229, 233, 239, 241, 251, 257, 263, 269, 271, 277, 281, 283, 293, 307, 311, 313, 317, 331, 337, 347, 349, 353
};

unsigned long prod = 1;
boolean overflow = false;
for (int i = 0; i < s.length(); i++)
{
    overflow |= (prod > ULONG_MAX / primetable[s[i]]);
    prod *= primetable[s[i]];
}
if (overflow)
    prod ^= 1;
s.hash = prod;

with fast-outs:

fast-outs:

if (s1.hash != s2.hash) return false;
if ((s1.hash & 1) != 0) return true;
if (s1.length() != s2.length()) return false;

The middle line is only safe to use if the character encoding is not multi-byte. If you are working with a multi-byte coding scheme then the hash will still eliminate most non-anagrams, but it will have a lot of false positives as some byte orderings cannot be ignored.

只有当字符编码不是多字节时,中间的行才是安全的。如果您使用的是多字节编码方案,那么散列仍然会消除大多数非字谜,但是由于某些字节排序不能被忽略,它会有很多误报。

Hacking Mats Petersson's test code to read from a dictionary, and trying this and the other algorithms on real English dictionary input we see roughly a factor of four improvement over the next best algorithm (it was a factor of ten, but I tweaked the other code):

从字典中读取的测试代码,以及在实际的英语字典输入中尝试这个和其他的算法,我们可以看到一个比下一个最好的算法有四个改进的因素(它是10的一个因素,但是我修改了其他代码):

Functionis_anagram      time:    218.9s hits: 93
Functionis_anagram1     time:      200s hits: 93
Functionis_anagram2     time:       40s hits: 93
Functionis_anagram3     time:     7.74s hits: 93
Functionis_anagram4     time:     2.65s hits: 93
Functionis_anagram5     time:      2.3s hits: 166
Functionis_anagram6     time:     0.56s hits: 166  <- This method

(the number of hits is different because the last two algorithms are both case-insensitive and my dictionary probably included acronyms colliding with natural words)

(点击的数量不同,因为最后两种算法都不区分大小写,我的字典里可能包含了与自然单词碰撞的首字母缩略词)


update: Although it's not what was asked, it was negligent of me to not point this out. I don't know if I didn't spot it or if I just got sick of typing or if I didn't want to make assumptions about the calling code...

更新:虽然这不是我的问题,但我没有指出这一点。我不知道我是否没有注意到它,或者我只是厌倦了打字,或者我不想对调用代码做出假设……

Once you've hashed all the words, a trivial way to minimise the number of tests is to sort the list by that hash. Then you can trivially limit comparisons to parts of the list that are likely matches (according to the hash). This can also make branch prediction more efficient, too.

一旦对所有单词进行了散列处理,最小化测试数量的一种简单方法就是使用散列对列表进行排序。然后可以对列表中可能匹配的部分(根据散列)进行简单的比较。这也可以使分支预测更有效。

I just tried changing the N^2 iteration I tested with originally (I'm sure that was deliberately inefficient) to iterate over neighbours in a sorted list. The sort() call dominated the timing, but was 200x faster than the fastest N^2 test, and the choice of comparison algorithm no longer made any meaningful difference to performance.

我试着改变N ^ 2迭代测试最初(我肯定是故意低效)在分类列表遍历的邻居。()调用时间主导,但200 x速度比最快的N ^ 2测试,和比较算法的选择不再做出任何有意义的性能差异。

Or you could just bin words by hash as you receive them.

或者,你可以在收到单词时,将它们按哈希方式存储。

#7


0  

What about this solution? It's in C# if you don't mind.

这个解决方案呢?如果你不介意的话,在c#里。

public bool isAnagram(string first, string second) {
    if(first == null || second == null)
        return false;
    if(first.Length != second.Length)
        return false;
    string characters = "abcd...zABCD...Z";
    int firstResult = 0;
    int secondResult = 0;
    foreach(char x in second) {
        if(first.IndexOf(x) == -1)
            return false;
        if(char == " ")
            secondResult += 0;
        else
            secondResult += characters.IndexOf(x);
    }
    foreach(char x in first) {
        if(char == " ")
            firstResult += 0;
        else
            firstResult += characters.IndexOf(x);
    }
    return firstResult == secondResult;

}

}

#1


43  

The map creation and your call to std::map::find within the iteration, is quite expensive.

在迭代中创建映射并调用std::map: find,是非常昂贵的。

In this case, you can use the fact that std::string behaves in many regards like a std::vector<char>, meaning that you can sort it using std::sort:

在本例中,您可以使用std::string在许多方面类似于std::vector ,这意味着您可以使用std::sort:

bool is_anagram(std::string s1, std::string s2)
{
    std::sort(s1.begin(), s1.end());
    std::sort(s2.begin(), s2.end());
    return s1 == s2;
}

Instead of the two maps that you are creating, I am taking a copy of the strings (passing them by value instead of const reference) and sorting them, so

我没有创建两个映射,而是获取字符串的副本(通过值而不是const引用传递它们)并对它们进行排序

sort("lead") => "adel"
sort("deal") => "adel"

This change should already speed your algorithm up by quite a bit. One more thing that may greatly affect the performance if you tend to compare arbitrary words:

这个改变应该已经让你的算法提高了不少。还有一件事可能会对你的表现有很大的影响如果你比较任意的词:

bool is_anagram(std::string s1, std::string s2)
{
    if(s1.length() != s2.length())
        return false;
    /* as above */
}

If the length of the two strings differs, it obviously cannot be an anagram. std::string::length() is a very fast operation (it needs to store the string size anyway), so we save us the hustle of O(N*log(N)) from sorting the two strings.

如果两个字符串的长度不同,显然它不是一个字母组合。string:::length()是一个非常快速的操作(它需要存储字符串大小),所以我们省去了O(N*log(N))对两个字符串排序的麻烦。

#2


36  

You've got 26 characters, make one array of the size of 26, then iterate through both words and as you encouter characters in words, increment a corresponding array element for characters in the first word and decrement corresponding array element for charaacters in the second word. If the words are anagrams, all array elements will be 0 in the end. The complexity will be just O(N) where N is the length of words.

有26个字符,创建一个大小为26的数组,然后遍历两个单词,并在单词中封装字符时,对第一个单词中的字符增加相应的数组元素,对第二个单词中的字符减量相应的数组元素。如果单词是anagrams,那么所有数组元素最后都是0。复杂度将是O(N)其中N是单词的长度。

#3


33  

Here's a selection of functions that perform anagram analysis:

下面是一些执行分析的函数的选择:

#include <iostream>
#include <iomanip>
#include <map>
#include <cctype>
#include <string>
#include <algorithm>

using namespace std;

bool is_anagram(const std::string & s1, const std::string & s2)
{
    auto check = [](const std::string& x)
    {
        std::map<char, unsigned> counter;
        for(auto c : x)
        {
            auto it = counter.find(c);
            if(it == counter.end())
                counter[c] = 1;
            else
                ++counter[c];
        }
        return counter;
    };

    return check(s1) == check(s2);
}


bool is_anagram1(const std::string & s1, const std::string & s2)
{
    auto check = [](const std::string& x)
    {
        std::map<char, unsigned> counter;
        for(auto c : x)
        {
        ++counter[c];
        }
        return counter;
    };

    return check(s1) == check(s2);
}


bool is_anagram2(std::string s1, std::string s2)
{
    std::sort(s1.begin(), s1.end());
    std::sort(s2.begin(), s2.end());
    return s1 == s2;
}


bool is_anagram3(std::string s1, std::string s2)
{
    if (s1.length() != s2.length()) return false;
    std::sort(s1.begin(), s1.end());
    std::sort(s2.begin(), s2.end());
    return s1 == s2;
}

bool is_anagram4(const std::string &s1, const std::string &s2)
{
    int arr[256] = {};
    if (s1.length() != s2.length()) return false;
    for(std::string::size_type i = 0; i < s1.length(); i++)
    {
    arr[(unsigned)s1[i]]++;
    arr[(unsigned)s2[i]]--;
    }
    for(auto i : arr)
    {
    if (i) return false;
    } 
    return true;
}

bool is_anagram5(const std::string &s1, const std::string &s2)
{
    int arr[26] = {};
    if (s1.length() != s2.length()) return false;

    for(std::string::size_type i = 0; i < s1.length(); i++)
    {
    arr[(unsigned)tolower(s1[i])-'a']++;
    arr[(unsigned)tolower(s2[i])-'a']--;
    }
    for(auto i : arr)
    {
    if (i) return false;
    } 
    return true;
}


static __inline__ unsigned long long rdtsc(void)
{
    unsigned hi, lo;
    __asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
    return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 );
}


template<typename T>
void bm(const char *name, T func, 
    const std::string &s1, const std::string &s2)
{
    unsigned long long time = rdtsc();
    bool res = func(s1, s2);
    time = rdtsc()-time;
    cout << "Function" << left << setw(15) << name 
     << " time: " << right << setw(8) << time 
     << " Res: " << res << endl;
}


#define BM(x) bm(#x, x, s1, s2)

int main()
{
    const std::string short1 = "lead";
    const std::string short2 = "deal";
    const std::string long1 = "leaddealleaddealleaddealleaddealleaddealleaddealleaddealleaddealleaddealleaddeal";
    const std::string long2 = "dealleaddealleaddealleaddealleaddealleaddealleaddealleaddealleaddealleaddeallead";

    cout << "Testing with short strings:" << endl;
    std::string s1 = short1;
    std::string s2 = short2;

    BM(is_anagram);
    BM(is_anagram1);
    BM(is_anagram2);
    BM(is_anagram3);
    BM(is_anagram4);
    BM(is_anagram5);

    cout << "Testing with long strings:" << endl;
    s1 = long1;
    s2 = long2;

    BM(is_anagram);
    BM(is_anagram1);
    BM(is_anagram2);
    BM(is_anagram3);
    BM(is_anagram4);
    BM(is_anagram5);

    cout << "Testing with inequal short string:" << endl;
    s1 = short1;
    s2 = short2;
    s1[s1.length()-1]++;

    BM(is_anagram);
    BM(is_anagram1);
    BM(is_anagram2);
    BM(is_anagram3);
    BM(is_anagram4);
    BM(is_anagram5);

    cout << "Testing with inequal long string:" << endl;
    s1 = long1;
    s2 = long2;
    s1[s1.length()-1]++;

    BM(is_anagram);
    BM(is_anagram1);
    BM(is_anagram2);
    BM(is_anagram3);
    BM(is_anagram4);
    BM(is_anagram5);
}

And the results:

结果:

Testing with short strings:
Functionis_anagram      time:    72938 Res: 1
Functionis_anagram1     time:    15694 Res: 1
Functionis_anagram2     time:    49780 Res: 1
Functionis_anagram3     time:    10424 Res: 1
Functionis_anagram4     time:     4272 Res: 1
Functionis_anagram5     time:    18653 Res: 1
Testing with long strings:
Functionis_anagram      time:    46067 Res: 1
Functionis_anagram1     time:    33597 Res: 1
Functionis_anagram2     time:    52721 Res: 1
Functionis_anagram3     time:    41570 Res: 1
Functionis_anagram4     time:     9027 Res: 1
Functionis_anagram5     time:    15062 Res: 1
Testing with inequal short string:
Functionis_anagram      time:    11096 Res: 0
Functionis_anagram1     time:    10115 Res: 0
Functionis_anagram2     time:    13022 Res: 0
Functionis_anagram3     time:     8066 Res: 0
Functionis_anagram4     time:     2963 Res: 0
Functionis_anagram5     time:     1916 Res: 0
Testing with inequal long string:
Functionis_anagram      time:    44246 Res: 0
Functionis_anagram1     time:    33961 Res: 0
Functionis_anagram2     time:    45004 Res: 0
Functionis_anagram3     time:    38896 Res: 0
Functionis_anagram4     time:     6455 Res: 0
Functionis_anagram5     time:    14603 Res: 0

It is quite clear that the straightforward checking with an array counting up/down based on the occurrence of each character is the winner. I took the original code and removed the find, and just used the fact that a map::operator[] will create a zero value if there is no entry there in is_anagram1, which shows some decent improvement too.

很明显,使用数组根据每个字符的出现进行向上/向下计数的直接检查是赢家。我使用了原始代码并删除了find,并使用了这样的事实:如果is_anagram1中没有条目,那么map:::operator[]将创建一个零值,这也显示了一些不错的改进。

Results are from MY machine, these may or may not reflect the results on other machines, but I doubt that the "winner vs loser" is going to be significantly different.

结果来自我的机器,这些结果可能会也可能不会反映在其他机器上的结果,但我怀疑“赢家vs输家”会有明显的不同。

Edit:

编辑:

Thought about the "find" operation, and decided to use it in different way:

思考“发现”操作,并决定以不同的方式使用:

bool is_anagram0(const std::string & s1, const std::string & s2)
{
    auto check = [](const std::string& x)
    {
        std::map<char, unsigned> counter;
        for(auto const &c : x)
        {
            auto it = counter.find(c);
            if(it == counter.end())
                counter[c] = 1;
            else
                it->second++;
        }
        return counter;
    };

    return check(s1) == check(s2);
}

Gives a little improvement over the original function, but not better enough to compete with the array solution that gives the best result.

对原来的函数做了一点改进,但还不足以与产生最佳结果的数组解决方案竞争。

#4


7  

In addition to all the other suggestions, if you are trying to find pairs of anagrams in a set, you will be calling is_anagram on the same arguments (although not the same pairs of arguments) repeatedly.

除了所有其他的建议之外,如果您试图在一个集合中找到一对anagrams,那么您将在相同的参数(尽管不是相同的参数对)中重复调用is_anagram。

Most solutions consist of two steps:

大多数解决方案包括两个步骤:

  1. map the string arguments to some other object (a sorted string, an array of length 26, a map as in your own solution)
  2. 将字符串参数映射到其他对象(排序的字符串,长度为26的数组,映射为您自己的解决方案)
  3. compare these objects to each other
  4. 将这些对象相互比较

If you have a set of N strings, and you want to find all pairs which are anagrams of each other, you will be calling is_anagram O(N^2) times.

如果你有一组N字符串,和你想找到所有对字谜的对方,你将调用is_anagram O(N ^ 2)倍。

You can save a lot by first computing the step 1 above for each of the N strings and then comparing the results.

您可以通过首先计算上面的步骤1来为每个N个字符串节省很多,然后比较结果。

#5


4  

I propose a solution which is sorting only one of the two strings:

我提出的解决方案是只对两个字符串中的一个进行排序:

 bool is_anagram(std::string const & s1, std::string s2)
 {
     if (s1.length() != s2.length()) return false;
     for (int i=0; i<s1.length(); ++i)
     {
         size_t pos = s2.find(s1[i], i);
         if (pos == std::string::npos)
         {
             return false;
         }
         std::swap(s2[i], s2[pos]);
     }
     return true;
 }

This solution could be advantageous in situations where you have one character in the one string which isn't in the other one - because if it doesn't find that character in the other string, it can short-cut the evaluation (in contrast to all other solutions shown here, where always both full strings are evaluated). Especially if this exclusive character on one side occurs early in a long string...

这个解决方案可以有利的情况下,你有一个字符在一个字符串并不是另一个,因为如果没有发现字符的字符串,它能走捷径的评价(相比其他解决方案所示,总是完整的字符串都是评估)。特别是如果一方的这个排他性字符出现在一个长字符串的早期……

One advantage over the sort solution is also required storage space - the sort solution requires the two strings to be copied, while in my solution only one copy is created. For shorter string lengths (depending on the int type used for counting, but at least up to 256 characters), it also requires less memory than the "count up/down" solution.

与排序解决方案相比,另一个优点是需要存储空间——排序解决方案需要复制两个字符串,而在我的解决方案中只创建一个副本。对于较短的字符串长度(取决于用于计数的int类型,但至少要有256个字符),它还需要比“向上/向下计数”解决方案更少的内存。

For longer strings which are anagrams, on the other hand, it starts to fall behind the "count up/down" solution.

另一方面,对于较长的字符串,它开始落在“count up/down”解决方案后面。

Analysis

分析

See the code & results below. As can easily be seen, my solution (is_anagram3) is quite fast for short anagrams (only beaten by the actually not fully functionally equivalent 26 character count up/down method), and is also fastest for the long non-anagram case with non-matching characters; but tends to get slower than the count up/down methods for longer strings which are anagrams, or for long non-anagrams which just differ by character counts.

参见下面的代码和结果。很容易看出,我的解决方案(is_anagram3)对于短字词来说是比较快的(仅被功能上并不完全等价的26个字符计数/下法打败),对于长非字词情况下不匹配的字符也是最快的;但是对于长串(字谜)的加和减的方法,或者对于长非字谜(字数不同)的方法,它们的速度会比加和减的慢。

Summary

总结

In the end, it would really depend on the expected input data as to what the ideal solution is:

最终,它将取决于理想的解决方案是什么:

  • For single calls, the "count up/down" solutions will usually give the best performance in many cases.
  • 对于单次调用,“计数/计数”解决方案通常会在许多情况下提供最佳性能。
  • Depending on the circumstances, (e.g. with what probability the strings contain different characters, as mentioned above), my solution might be faster
  • 根据不同的情况(如上面提到的字符串包含不同字符的概率),我的解决方案可能会更快
  • Not tested yet, but seems sure: For the case when you have many anagram checks to perform, and when you implement a cache for the sorted strings, the sorting and comparing solution will become the fastest.
  • 虽然还没有经过测试,但似乎可以肯定:对于需要执行许多anagram检查的情况,以及当为已排序的字符串实现缓存时,排序和比较解决方案将成为最快的。

Code:

代码:

#include <string>
#include <map>
#include <algorithm>

#include <sys/time.h>
#include <iostream>
#include <iomanip>

bool is_anagram(std::string const & s1, std::string const & s2)
{
    auto check = [](std::string const & x)
    {
        std::map<char, unsigned> counter;
        for(auto const & c : x)
        {
            auto it = counter.find(c);
            if(it == counter.end())
                counter[c] = 1;
            else
                ++counter[c];
        }
        return counter;
    };

    return check(s1) == check(s2);
}

bool is_anagram2(std::string s1, std::string s2)
{
    std::sort(s1.begin(), s1.end());
    std::sort(s2.begin(), s2.end());
    return s1 == s2;
}

bool is_anagram3(std::string const & s1, std::string s2)
{
    if (s1.length() != s2.length()) return false;

    for (int i=0; i<s1.length(); ++i)
    {
        size_t pos = s2.find(s1[i], i);
        if (pos == std::string::npos)
        {
            return false;
        }
        std::swap(s2[i], s2[pos]);
    }
    return true;
}

bool is_anagram4(std::string const & s1, std::string const & s2)
{
    if (s1.length() != s2.length()) return false;

    int count[26] = {0};
    for (int i=0; i<s1.length(); ++i)
    {
        count[s1[i]-'a']++;
        count[s2[i]-'a']--;
    }
    for (int i=0; i<26; ++i)
    {
        if (count[i] != 0) return false;
    }
    return true;
}

bool is_anagram5(std::string const & s1, std::string const & s2)
{
    if (s1.length() != s2.length()) return false;

    int count[256] = {0};
    for (int i=0; i<s1.length(); ++i)
    {
        count[s1[i]]++;
        count[s2[i]]--;
    }
    for (int i=0; i<256; ++i)
    {
        if (count[i] != 0) return false;
    }
    return true;
}

template<typename T>
bool test(const char* name, T func)
{
    if (!func("deal", "lead") || !func("lead", "deal") || !func("dealbreaker", "breakdealer") ||
        !func("dealbreakerdealbreakerdealbreakerdealbreaker", "breakdealerbreakdealerbreakdealerbreakdealer") ||
        func("dealbreakerdealbreakerdealbreakerdealbreakera", "breakdealerbreakdealerbreakdealerbreakdealerb") ||
        func("dealxbreakerdealbreakerdealbreakerdealbreaker", "breakdealerbreakdealerbreakdealerbreakdealerb") ||
        func("abcdefg", "tuvwxyz") ||
        func("lot", "bloat") || func("lead", "deala") ||
        func("lot", "bloat") || func("deala", "lead") ||
        func("abc", "abcd"))
    {
        std::cout << name << ": impl doesn't work" << std::endl;
        return false;
    }
    return true;
}

template<typename T>
void measure(const char* name, T func,
    std::string const & s1, std::string const & s2)
{
    timeval start,end;
    unsigned long secDiff;
    long usecDiff;

    gettimeofday(&start, 0);
    for (int i=0; i<100000; ++i)
    {
        func(s1, s2);
    }
    gettimeofday(&end, 0);
    secDiff = end.tv_sec - start.tv_sec;
    usecDiff = end.tv_usec - start.tv_usec;
    if (usecDiff < 0)
    {
        secDiff--;
        usecDiff += 1000000;
    }
 std::cout << name << ": " << secDiff << "."<< std::setw(3) << std::setfill('0') << (usecDiff/1000) << " s" << std::endl;
}

template <typename T>
void run(const char * funcName, T func)
{
    std::cout << funcName << std::endl;
    if (!test(funcName, func)) {
        return;
    }
    measure("short", func, "dealbreaker", "breakdealer");
    measure("short no anagram", func, "abcdefg", "tuvwxyz");
    measure("long", func, "dealbreakerdealbreakerdealbreakerdealbreaker", "breakdealerbreakdealerbreakdealerbreakdealer");
    measure("long no anagram", func, "dealbreakerdealbreakerdealbreakerdealbreakera", "breakdealerbreakdealerbreakdealerbreakdealerb");
    measure("long no anagram nonmatching char", func, "dealxbreakerdealbreakerdealbreakerdealbreaker", "breakdealerbreakdealerbreakdealerbreakdealerb");
}

int main(int argv, char**argc)
{
    run("is_anagram", is_anagram);
    run("is_anagram2", is_anagram2);
    run("is_anagram3", is_anagram3);
    run("is_anagram4", is_anagram4);
    run("is_anagram5", is_anagram5);
}

Output

输出

Measured on my slow Atom machine, results may vary a bit on different systems, but should in general give a good picture of the relative performances:

在我的慢速原子机器上测量,结果可能在不同的系统上有所不同,但总的来说应该能很好地反映出相关的性能:

is_anagram
short: 2.960 s
short no anagram: 2.154 s
long: 8.063 s
long no anagram: 8.092 s
long no anagram nonmatching char: 8.267 s
is_anagram2
short: 0.685 s
short no anagram: 0.333 s
long: 3.332 s
long no anagram: 3.557 s
long no anagram nonmatching char: 3.740 s
is_anagram3
short: 0.280 s
short no anagram: 0.022 s
long: 0.984 s
long no anagram: 0.994 s
long no anagram nonmatching char: 0.140 s
is_anagram4
short: 0.123 s
short no anagram: 0.071 s
long: 0.399 s
long no anagram: 0.389 s
long no anagram nonmatching char: 0.390 s
is_anagram5
short: 0.283 s
short no anagram: 0.145 s
long: 0.551 s
long no anagram: 0.454 s
long no anagram nonmatching char: 0.455 s

#6


4  

Assuming that most word pairs are not anagrams, the case you most need to optimise is the failure case.

假设大多数单词对都不是字谜,那么最需要优化的是失败的情况。

Obviously if the lengths are different then the strings cannot be anagrams, and the length is a property that is computed once when the string is created, making it a very efficient thing to compare as a fast-out.

显然,如果长度不同,那么字符串就不能是字谜,而长度是一个属性,它在创建字符串时计算一次,因此作为快速输出进行比较是非常有效的。

If you change your string class to include more of these properties you can improve the accuracy of the fast-out case. Rather than beginning the test function with:

如果您更改字符串类以包含更多的这些属性,您可以提高快出情况的准确性。而不是以:

if (s1.length() != s2.length()) return false;

You could use:

您可以使用:

if (s1.hash != s2.hash) return false;

and when you create the string (or after you modify it), you would compute a value for hash which is unique (or nearly unique) to all strings with that set of letters in arbitrary order.

当您创建字符串(或修改后)时,您将计算一个哈希值,该值对所有字符串都是唯一的(或几乎是唯一的),该值以任意顺序排列。

You only compute this hash when a string changes; not for every comparison that you do.

只有当字符串发生变化时才计算这个散列;不是你做的每一个比较。

A very simple way to compute the hash is:

计算散列的一种非常简单的方法是:

long sum = 0;
for (int i = 0; i < s.length(); i++)
    sum += s[i];
s.hash = sum;

this is quick to compute, but is prone to collisions.

这很容易计算,但容易发生碰撞。

A more advanced method:

一种更高级的方法:

static const unsigned int primetable[256] = { 1, 3, 5, 7, /* ... */ };

unsigned long prod = 1;
for (int i = 0; i < s.length(); i++)
    prod *= primetable[s[i]];
s.hash = prod;

Note that two is omitted from the list of primes. This ensures that prod is always co-prime with the possible range of unsigned long. This maximises the distribution of results in the face of numerical overflow.

注意,在素数列表中省略了两个素数。这确保了prod总是与可能的无符号长范围的联合启动。这将在遇到数值溢出时最大化结果的分布。

If the table is arranged to put small primes at frequent letter positions, then the cases where overflow occurs (which can lead to hash collisions) can be minimised. If there's no overflow then you have a perfect hash (for these purposes) and you would have absolute certainty both ways (return either true or false immediately) by just comparing hash.

如果将表安排为将小素数放在频繁的字母位置,则可以最小化发生溢出(这会导致哈希冲突)的情况。如果没有溢出,那么您就有了一个完美的散列(出于这些目的),通过比较散列,您可以通过两种方式(立即返回true或false)获得绝对的确定性。

Consequently, computing the hash like this works out much better:

因此,像这样计算哈希会更好:

static const unsigned int primetable[256] = {
    1, 1429, 1433, 1439, 1447, 1451, 1453, 1459, 1471, 1481, 1483, 1487, 1489, 1493, 1499, 1511, 1523, 1531, 1543, 1549, 1553, 1559, 1567, 1571, 1579, 1583, 1597, 1601, 1607, 1609, 1613, 1619,
    821, 823, 827, 829, 839, 853, 857, 107, 859, 863, 877, 881, 883, 109, 887, 907, 113, 127, 131, 137, 139, 149, 151, 157, 163, 167, 911, 919, 929, 937, 941, 947,
    601, 607, 613, 617, 619, 631, 641, 643, 647, 653, 659, 661, 673, 677, 683, 691, 701, 709, 719, 727, 733, 739, 743, 751, 757, 761, 769, 773, 787, 797, 809, 811,
    359, 11, 61, 31, 37, 3, 71, 43, 59, 7, 101, 79, 29, 53, 13, 23, 47, 103, 17, 5, 19, 41, 73, 83, 89, 67, 97, 367, 373, 379, 383, 389,
    1181, 1187, 1193, 1201, 1213, 1217, 1223, 1229, 1231, 1237, 1249, 1259, 1277, 1279, 1283, 1289, 1291, 1297, 1301, 1303, 1307, 1319, 1321, 1327, 1361, 1367, 1373, 1381, 1399, 1409, 1423, 1427,
    953, 967, 971, 977, 983, 991, 997, 1009, 1013, 1019, 1021, 1031, 1033, 1039, 1049, 1051, 1061, 1063, 1069, 1087, 1091, 1093, 1097, 1103, 1109, 1117, 1123, 1129, 1151, 1153, 1163, 1171,
    397, 401, 409, 419, 421, 431, 433, 439, 443, 449, 457, 461, 463, 467, 479, 487, 491, 499, 503, 509, 521, 523, 541, 547, 557, 563, 569, 571, 577, 587, 593, 599,
    173, 179, 181, 191, 193, 197, 199, 211, 223, 227, 229, 233, 239, 241, 251, 257, 263, 269, 271, 277, 281, 283, 293, 307, 311, 313, 317, 331, 337, 347, 349, 353
};

unsigned long prod = 1;
boolean overflow = false;
for (int i = 0; i < s.length(); i++)
{
    overflow |= (prod > ULONG_MAX / primetable[s[i]]);
    prod *= primetable[s[i]];
}
if (overflow)
    prod ^= 1;
s.hash = prod;

with fast-outs:

fast-outs:

if (s1.hash != s2.hash) return false;
if ((s1.hash & 1) != 0) return true;
if (s1.length() != s2.length()) return false;

The middle line is only safe to use if the character encoding is not multi-byte. If you are working with a multi-byte coding scheme then the hash will still eliminate most non-anagrams, but it will have a lot of false positives as some byte orderings cannot be ignored.

只有当字符编码不是多字节时,中间的行才是安全的。如果您使用的是多字节编码方案,那么散列仍然会消除大多数非字谜,但是由于某些字节排序不能被忽略,它会有很多误报。

Hacking Mats Petersson's test code to read from a dictionary, and trying this and the other algorithms on real English dictionary input we see roughly a factor of four improvement over the next best algorithm (it was a factor of ten, but I tweaked the other code):

从字典中读取的测试代码,以及在实际的英语字典输入中尝试这个和其他的算法,我们可以看到一个比下一个最好的算法有四个改进的因素(它是10的一个因素,但是我修改了其他代码):

Functionis_anagram      time:    218.9s hits: 93
Functionis_anagram1     time:      200s hits: 93
Functionis_anagram2     time:       40s hits: 93
Functionis_anagram3     time:     7.74s hits: 93
Functionis_anagram4     time:     2.65s hits: 93
Functionis_anagram5     time:      2.3s hits: 166
Functionis_anagram6     time:     0.56s hits: 166  <- This method

(the number of hits is different because the last two algorithms are both case-insensitive and my dictionary probably included acronyms colliding with natural words)

(点击的数量不同,因为最后两种算法都不区分大小写,我的字典里可能包含了与自然单词碰撞的首字母缩略词)


update: Although it's not what was asked, it was negligent of me to not point this out. I don't know if I didn't spot it or if I just got sick of typing or if I didn't want to make assumptions about the calling code...

更新:虽然这不是我的问题,但我没有指出这一点。我不知道我是否没有注意到它,或者我只是厌倦了打字,或者我不想对调用代码做出假设……

Once you've hashed all the words, a trivial way to minimise the number of tests is to sort the list by that hash. Then you can trivially limit comparisons to parts of the list that are likely matches (according to the hash). This can also make branch prediction more efficient, too.

一旦对所有单词进行了散列处理,最小化测试数量的一种简单方法就是使用散列对列表进行排序。然后可以对列表中可能匹配的部分(根据散列)进行简单的比较。这也可以使分支预测更有效。

I just tried changing the N^2 iteration I tested with originally (I'm sure that was deliberately inefficient) to iterate over neighbours in a sorted list. The sort() call dominated the timing, but was 200x faster than the fastest N^2 test, and the choice of comparison algorithm no longer made any meaningful difference to performance.

我试着改变N ^ 2迭代测试最初(我肯定是故意低效)在分类列表遍历的邻居。()调用时间主导,但200 x速度比最快的N ^ 2测试,和比较算法的选择不再做出任何有意义的性能差异。

Or you could just bin words by hash as you receive them.

或者,你可以在收到单词时,将它们按哈希方式存储。

#7


0  

What about this solution? It's in C# if you don't mind.

这个解决方案呢?如果你不介意的话,在c#里。

public bool isAnagram(string first, string second) {
    if(first == null || second == null)
        return false;
    if(first.Length != second.Length)
        return false;
    string characters = "abcd...zABCD...Z";
    int firstResult = 0;
    int secondResult = 0;
    foreach(char x in second) {
        if(first.IndexOf(x) == -1)
            return false;
        if(char == " ")
            secondResult += 0;
        else
            secondResult += characters.IndexOf(x);
    }
    foreach(char x in first) {
        if(char == " ")
            firstResult += 0;
        else
            firstResult += characters.IndexOf(x);
    }
    return firstResult == secondResult;

}

}