如何在.net (c#)中创建一个可以安全存储在数据库中的HashCode ?

时间:2021-05-14 16:11:04

To quote from Guidelines and rules for GetHashCode by Eric Lippert:

引用Eric Lippert的《GetHashCode指南和规则》:

Rule: Consumers of GetHashCode cannot rely upon it being stable over time or across appdomains

规则:GetHashCode的使用者不能依赖它随着时间的推移或在应用程序域之间保持稳定

Suppose you have a Customer object that has a bunch of fields like Name, Address, and so on. If you make two such objects with exactly the same data in two different processes, they do not have to return the same hash code. If you make such an object on Tuesday in one process, shut it down, and run the program again on Wednesday, the hash codes can be different.

假设您有一个客户对象,它有一些字段,比如名称、地址等等。如果在两个不同的进程中使用完全相同的数据创建两个这样的对象,那么它们不必返回相同的散列代码。如果您在周二的一个进程中创建了这样的对象,关闭它,并在周三再次运行该程序,那么哈希代码可以是不同的。

This has bitten people in the past. The documentation for System.String.GetHashCode notes specifically that two identical strings can have different hash codes in different versions of the CLR, and in fact they do. Don't store string hashes in databases and expect them to be the same forever, because they won't be.

这让过去的人们感到痛苦。system . string。的文档GetHashCode特别指出,在不同版本的CLR中,两个相同的字符串可以有不同的哈希代码,事实上它们可以。不要将字符串哈希存储在数据库中,并期望它们永远是相同的,因为它们不会是相同的。

So what is the correct way to create a HashCode of a string that I can store in a database?

那么,创建可以存储在数据库中的字符串的HashCode的正确方法是什么呢?

(Please tell me I am not the first person to have left this bug in software I have written!)

(请告诉我,我不是第一个在我编写的软件中留下这个错误的人!)

3 个解决方案

#1


64  

It depends what properties you want that hash to have. For example, you could just write something like this:

这取决于你想要哈希的属性。例如,你可以这样写:

public int HashString(string text)
{
    // TODO: Determine nullity policy.

    unchecked
    {
        int hash = 23;
        foreach (char c in text)
        {
            hash = hash * 31 + c;
        }
        return hash;
    }
}

So long as you document that that is how the hash is computed, that's valid. It's in no way cryptographically secure or anything like that, but you can persist it with no problems. Two strings which are absolutely equal in the ordinal sense (i.e. with no cultural equality etc applied, exactly character-by-character the same) will produce the same hash with this code.

只要你记录哈希是如何计算的,那就是有效的。它绝不是加密安全之类的东西,但您可以毫无问题地坚持它。在序数意义上绝对相等的两个字符串(即没有文化平等等应用,每个字符都是相同的)将产生与此代码相同的哈希。

The problems come when you rely on undocumented hashing - i.e. something which obeys GetHashCode() but is in no way guaranteed to remain the same from version to version... like string.GetHashCode().

当您依赖未文档化的散列时,问题就出现了——例如,某些东西遵循GetHashCode(),但是不能保证从版本到版本都保持相同……像string.GetHashCode()。

Writing and documenting your own hash like this is a bit like saying, "This sensitive information is hashed with MD5 (or whatever)". So long as it's a well-defined hash, that's fine.

像这样编写和记录您自己的散列有点像说,“这个敏感信息是用MD5(或其他)散列的”。只要它是一个定义良好的散列,就可以。

EDIT: Other answers have suggested using cryptographic hashes such as SHA-1 or MD5. I would say that until we know there's a requirement for cryptographic security rather than just stability, there's no point in going through the rigmarole of converting the string to a byte array and hashing that. Of course if the hash is meant to be used for anything security-related, an industry-standard hash is exactly what you should be reaching for. But that wasn't mentioned anywhere in the question.

编辑:其他答案建议使用密码散列,如SHA-1或MD5。我想说的是,在我们知道需要加密安全性而不仅仅是稳定性之前,没有必要把字符串转换成字节数组并进行散列处理。当然,如果要将散列用于任何与安全相关的内容,那么一个行业标准的散列就是您应该要达到的目标。但问题中没有提到这一点。

#2


6  

Here is a reimplementation of the current way .NET calculates it's string hash code for 64 bit systems. This does not use pointers like the real GetHashCode() does so it will be slightly slower, but it does make it more resilient to internal changes to string, this will give a more evenly distributed hash code than Jon Skeet's version which may result in better lookup times in dictionaries.

这里是一个重新实现的当前方法。net计算它的64位系统的字符串哈希代码。这并不像真正的GetHashCode()那样使用指针,所以它会稍微慢一些,但是它确实使它对字符串的内部更改更有弹性,这将会提供比Jon Skeet的版本更平均的散列码,这可能会导致字典中查找时间更好。

public static class StringExtensionMethods
{
    public static int GetStableHashCode(this string str)
    {
        unchecked
        {
            int hash1 = 5381;
            int hash2 = hash1;

            for(int i = 0; i < str.Length && str[i] != '\0'; i += 2)
            {
                hash1 = ((hash1 << 5) + hash1) ^ str[i];
                if (i == str.Length - 1 || str[i+1] == '\0')
                    break;
                hash2 = ((hash2 << 5) + hash2) ^ str[i+1];
            }

            return hash1 + (hash2*1566083941);
        }
    }
}

#3


1  

The answer is to just write your own hashing function. You can find source for some by following links in the comments to the article you posted. Or you can use a built-in hash function that's originally intended for cryptography (MD5, SHA1, etc.) and just not use all of the bits.

答案就是写出你自己的哈希函数。你可以在你发布的文章的评论中找到一些链接。或者您可以使用最初用于加密(MD5、SHA1等)的内置散列函数,而不使用所有的位。

#1


64  

It depends what properties you want that hash to have. For example, you could just write something like this:

这取决于你想要哈希的属性。例如,你可以这样写:

public int HashString(string text)
{
    // TODO: Determine nullity policy.

    unchecked
    {
        int hash = 23;
        foreach (char c in text)
        {
            hash = hash * 31 + c;
        }
        return hash;
    }
}

So long as you document that that is how the hash is computed, that's valid. It's in no way cryptographically secure or anything like that, but you can persist it with no problems. Two strings which are absolutely equal in the ordinal sense (i.e. with no cultural equality etc applied, exactly character-by-character the same) will produce the same hash with this code.

只要你记录哈希是如何计算的,那就是有效的。它绝不是加密安全之类的东西,但您可以毫无问题地坚持它。在序数意义上绝对相等的两个字符串(即没有文化平等等应用,每个字符都是相同的)将产生与此代码相同的哈希。

The problems come when you rely on undocumented hashing - i.e. something which obeys GetHashCode() but is in no way guaranteed to remain the same from version to version... like string.GetHashCode().

当您依赖未文档化的散列时,问题就出现了——例如,某些东西遵循GetHashCode(),但是不能保证从版本到版本都保持相同……像string.GetHashCode()。

Writing and documenting your own hash like this is a bit like saying, "This sensitive information is hashed with MD5 (or whatever)". So long as it's a well-defined hash, that's fine.

像这样编写和记录您自己的散列有点像说,“这个敏感信息是用MD5(或其他)散列的”。只要它是一个定义良好的散列,就可以。

EDIT: Other answers have suggested using cryptographic hashes such as SHA-1 or MD5. I would say that until we know there's a requirement for cryptographic security rather than just stability, there's no point in going through the rigmarole of converting the string to a byte array and hashing that. Of course if the hash is meant to be used for anything security-related, an industry-standard hash is exactly what you should be reaching for. But that wasn't mentioned anywhere in the question.

编辑:其他答案建议使用密码散列,如SHA-1或MD5。我想说的是,在我们知道需要加密安全性而不仅仅是稳定性之前,没有必要把字符串转换成字节数组并进行散列处理。当然,如果要将散列用于任何与安全相关的内容,那么一个行业标准的散列就是您应该要达到的目标。但问题中没有提到这一点。

#2


6  

Here is a reimplementation of the current way .NET calculates it's string hash code for 64 bit systems. This does not use pointers like the real GetHashCode() does so it will be slightly slower, but it does make it more resilient to internal changes to string, this will give a more evenly distributed hash code than Jon Skeet's version which may result in better lookup times in dictionaries.

这里是一个重新实现的当前方法。net计算它的64位系统的字符串哈希代码。这并不像真正的GetHashCode()那样使用指针,所以它会稍微慢一些,但是它确实使它对字符串的内部更改更有弹性,这将会提供比Jon Skeet的版本更平均的散列码,这可能会导致字典中查找时间更好。

public static class StringExtensionMethods
{
    public static int GetStableHashCode(this string str)
    {
        unchecked
        {
            int hash1 = 5381;
            int hash2 = hash1;

            for(int i = 0; i < str.Length && str[i] != '\0'; i += 2)
            {
                hash1 = ((hash1 << 5) + hash1) ^ str[i];
                if (i == str.Length - 1 || str[i+1] == '\0')
                    break;
                hash2 = ((hash2 << 5) + hash2) ^ str[i+1];
            }

            return hash1 + (hash2*1566083941);
        }
    }
}

#3


1  

The answer is to just write your own hashing function. You can find source for some by following links in the comments to the article you posted. Or you can use a built-in hash function that's originally intended for cryptography (MD5, SHA1, etc.) and just not use all of the bits.

答案就是写出你自己的哈希函数。你可以在你发布的文章的评论中找到一些链接。或者您可以使用最初用于加密(MD5、SHA1等)的内置散列函数,而不使用所有的位。