如何从唯一的字符串生成唯一的int ?

时间:2021-10-29 18:56:16

I have an object with a String that holds a unique id . (such as "ocx7gf" or "67hfs8") I need to supply it an implementation of int hascode() which will be unique obviously.

我有一个具有唯一id的字符串。(例如“ocx7gf”或“67hfs8”)我需要向它提供一个int hascode()的实现,它显然是唯一的。

how do i cast a string to a unique int in the easiest/fastest way?

如何以最简单/最快的方式将字符串转换为唯一的int类型?

10x.

10倍。

Edit - OK. I already know that String.hashcode is possible. But it is not recommended in any place. Actually' if any other method is not recommended - Should I use it or not if I have my object in a collection and I need the hashcode. should I concat it to another string to make it more successful?

编辑-好的。我已经知道这个弦了。hashcode是可能的。但在任何地方都不推荐。实际上,“如果不建议使用任何其他方法——如果我的对象在集合中,并且我需要hashcode,我是否应该使用它?”我应该用另一种方法使它更成功吗?

5 个解决方案

#1


19  

No, you don't need to have an implementation that returns a unique value, "obviously", as obviously the majority of implementations would be broken.

不,您不需要具有返回唯一值的实现,“显然”,显然大多数实现都将被破坏。

What you want to do, is to have a good spread across bits, especially for common values (if any values are more common than others). Barring special knowledge of your format, then just using the hashcode of the string itself would be best.

您想要做的是在位之间有良好的分布,特别是对于公共值(如果有任何值比其他值更常见)。除非对格式有特殊的了解,否则最好只使用字符串本身的hashcode。

With special knowledge of the limits of your id format, it may be possible to customise and result in better performance, though false assumptions are more likely to make things worse than better.

有了您的id格式的极限的特殊知识,可以定制并获得更好的性能,尽管错误的假设更可能使事情变得更糟而不是更好。

Edit: On good spread of bits.

编辑:在良好的传播位。

As stated here and in other answers, being completely unique is impossible and hash collisions are possible. Hash-using methods know this and can deal with it, but it does impact upon performance, so we want collisions to be rare.

如本文和其他答案所述,完全惟一是不可能的,哈希冲突也是可能的。散列使用方法知道这一点,并且可以处理它,但是它确实会影响性能,所以我们希望很少发生冲突。

Further, hashes are generally re-hashed so our 32-bit number may end up being reduced to e.g. one in the range 0 to 22, and we want as good a distribution within that as possible to.

此外,散列通常被重新散列,所以我们的32位数字可能最终被简化为0到22之间的一个,我们希望尽可能好的分布在这个范围内。

We also want to balance this with not taking so long to compute our hash, that it becomes a bottleneck in itself. An imperfect balancing act.

我们还想平衡这一点,不要花太长时间来计算哈希,它本身就会成为瓶颈。一个不完美的平衡。

A classic example of a bad hash method is one for a co-ordinate pair of X, Y ints that does:

一个坏哈希方法的典型例子是对X, Y的坐标对这样做:

return X ^ Y;

While this does a perfectly good job of returning 2^32 possible values out of the 4^32 possible inputs, in real world use it's quite common to have sets of coordinates where X and Y are equal ({0, 0}, {1, 1}, {2, 2} and so on) which all hash to zero, or matching pairs ({2,3} and {3, 2}) which will hash to the same number. We are likely better served by:

虽然这是一个完美的工作返回2 ^ 32可能值4 ^ 32可能的输入,在现实世界中使用很常见的X和Y坐标,是相等的({ 0 },{ 1 },{ 2,}等等),所有哈希为零,或配对({ 2,3 }和{ 3 2 })将散列到相同数量。我们最好的服务是:

return ((X << 16) | (x >> 16)) ^ Y;

Now, there is just as many possible values for which this is dreadful than for the former, but it tends to serve better in real-world cases.

现在,有很多可能的价值比前者更可怕,但在现实世界中它更有用。

Of course, there is a different job if you are writing a general-purpose class (no idea what possible inputs there are) or have a better idea of the purpose at hand. For example, if I was using Date objects but knew that they would all be dates only (time part always midnight) and only within a few years of each other, then I might prefer a custom hash code that used only the day, month and lower-digits of the years, over the standard one. The writer of Date though can't work on such knowledge and has to try to cater for everyone.

当然,如果您正在编写一个通用类(不知道有什么可能的输入),或者对即将实现的目标有更好的理解,则有不同的工作。举个例子,如果我用日期对象,但是只知道他们都是日期(时间部分总是午夜),只有在几年之内,然后,我可能更喜欢一个定制的散列码,只用一天,月和年lower-digits,标准。然而,《约会》的作者不能研究这些知识,他必须设法迎合每个人。

Hence, If I for instance knew that a given string is always going to consist of 6 case-insensitive characters in the range [a-z] or [0-9] (which yours seem to, but it isn't clear from your question that it does) then I might use an algorithm that assigned a value from 0 to 35 (the 36 possible values for each character) to each character, and then walk through the string, each time multiplying the current value by 36 and adding the value of the next char.

因此,举例来说,如果我知道给定字符串总是由6不区分大小写字符范围[a - z]或[0 - 9](这似乎你的,从你的问题,但目前还不清楚它)然后我可以用一个算法,分配一个值从0到35(36可能值为每个字符),每个字符,然后遍历字符串,每次乘36和当前值添加下一个字符的值。

Assuming a good spread in the ids, this would be the way to go, especially if I made the order such that the lower-significant digits in my hash matched the most frequently changing char in the id (if such a call could be made), hence surviving re-hashing to a smaller range well.

假设一个好的传播的id,这是路要走,特别是如果我做订单,这样lower-significant数字在我的散列中最频繁更改字符匹配id(如果这样的调用可以),因此幸存的主导一个较小的范围内。

However, lacking such knowledge of the format for sure, I can't make that call with certainty, and I could well be making things worse (slower algorithm for little or even negative gain in hash quality).

然而,由于缺乏这种格式的知识,我不能确定地调用它,而且我很可能会使事情变得更糟(在哈希质量方面,算法速度更慢,甚至是负增长)。

One advantage you have is that since it's an ID in itself, then presumably no other non-equal object has the same ID, and hence no other properties need be examined. This doesn't always hold.

您拥有的一个优势是,由于它本身就是一个ID,因此假定其他不相等的对象都具有相同的ID,因此不需要检查其他属性。这并不总是持有。

#2


10  

You can't get a unique integer from a String of unlimited length. There are 4 billionish (2^32) unique integers, but an almost infinite number of unique strings.

你不能从无限长度的字符串中得到唯一的整数。有4 billionish(2 ^ 32)独特的整数,但一个几乎无限数量的独一无二的字符串。

String.hashCode() will not give you unique integers, but it will do its best to give you differing results based on the input string.

hashcode()不会给你唯一的整数,但是它会尽量根据输入字符串给你不同的结果。

EDIT

编辑

Your edited question says that String.hashCode() is not recommended. This is not true, it is recommended, unless you have some special reason not to use it. If you do have a special reason, please provide details.

您编辑的问题表明不建议使用String.hashCode()。这是不正确的,除非你有特殊的理由不使用它。如果您有特殊原因,请提供详细信息。

#3


4  

Looks like you've got a base-36 number there (a-z + 0-9). Why not convert it to an int using Integer.parseInt(s, 36)? Obviously, if there are too many unique IDs, it won't fit into an int, but in that case you're out of luck with unique integers and will need to get by using String.hashCode(), which does its best to be close to unique.

看起来你有一个base-36数字(a-z + 0-9)。为什么不把它转换成整数呢?方法(年代,36)?显然,如果有太多的唯一id,它就不能适合int类型,但是在这种情况下,对于唯一整数,您就不太幸运了,需要使用String.hashCode()来获取,它会尽力接近unique。

#4


3  

Unless your strings are limited in some way or your integers hold more bits than the strings you're trying to convert, you cannot guarantee the uniqueness.

除非你的字符串在某种程度上是受限的或者你的整数比你试图转换的字符串持有更多的比特,你不能保证它的唯一性。

Let's say you have a 32 bit integer and a 64-character character set for your strings. That means six bits per character. That will allow you to store five characters into an integer. More than that and it won't fit.

假设您有一个32位的整数和一个64位的字符串字符集。这意味着每个字符有6个字节。这将允许您将5个字符存储到一个整数中。超过这个,就不合适了。

#5


0  

One way to do it is assign each letter a value, and each place of the string it's own multiple ie a = 1, b = 2, and so on, then everything in the first digit (read left to right) would be multiplied by a prime number, the next the next prime number and so on, such that the final digit was multiplied by a prime larger than the number of possible subsets in that digit (26+1 for a space or 52+1 with capitols and so on for other supported characters). If the number is mapped back to the first digits (leftmost character) any number you generate from a unique string mapping back to 1 or 6 whatever the first letter will be, gives a unique value.

这样做的方法之一是分配一个值,每个字母和字符串的每个地方的多个ie = 1,b = 2,等等,然后一切都在第一位(从左至右)乘以一个质数,下一个下一个质数,这样最后的数字乘以一个'的数量大于可能的子集,数字(26 + 1空间或52 + 1与议会大厦等其他受支持的字符)。如果数字被映射回第一个数字(最左的字符),那么您从唯一的字符串映射回1或6所生成的任何数字,无论第一个字母是什么,都会给出一个唯一的值。

Dog might be 30,3(15),101(7) or 782, while God 33,3(15),101(4) or 482. More importantly than unique strings being generated they can be useful in generation if the original digit is kept, like 30(782) would be unique to some 12(782) for the purposes of differentiating like strings if you ever managed to go over the unique possibilities. Dog would always be Dog, but it would never be Cat or Mouse.

狗可能是30、3(15)、101(7)或782,而上帝是33、3(15)、101(4)或482。比生成唯一字符串更重要的是,如果保留原始数字,比如30(782)对于12(782)来说是唯一的,这样就可以像字符串一样进行区分,如果你想过唯一的可能性的话。狗永远是狗,但永远不会是猫或老鼠。

#1


19  

No, you don't need to have an implementation that returns a unique value, "obviously", as obviously the majority of implementations would be broken.

不,您不需要具有返回唯一值的实现,“显然”,显然大多数实现都将被破坏。

What you want to do, is to have a good spread across bits, especially for common values (if any values are more common than others). Barring special knowledge of your format, then just using the hashcode of the string itself would be best.

您想要做的是在位之间有良好的分布,特别是对于公共值(如果有任何值比其他值更常见)。除非对格式有特殊的了解,否则最好只使用字符串本身的hashcode。

With special knowledge of the limits of your id format, it may be possible to customise and result in better performance, though false assumptions are more likely to make things worse than better.

有了您的id格式的极限的特殊知识,可以定制并获得更好的性能,尽管错误的假设更可能使事情变得更糟而不是更好。

Edit: On good spread of bits.

编辑:在良好的传播位。

As stated here and in other answers, being completely unique is impossible and hash collisions are possible. Hash-using methods know this and can deal with it, but it does impact upon performance, so we want collisions to be rare.

如本文和其他答案所述,完全惟一是不可能的,哈希冲突也是可能的。散列使用方法知道这一点,并且可以处理它,但是它确实会影响性能,所以我们希望很少发生冲突。

Further, hashes are generally re-hashed so our 32-bit number may end up being reduced to e.g. one in the range 0 to 22, and we want as good a distribution within that as possible to.

此外,散列通常被重新散列,所以我们的32位数字可能最终被简化为0到22之间的一个,我们希望尽可能好的分布在这个范围内。

We also want to balance this with not taking so long to compute our hash, that it becomes a bottleneck in itself. An imperfect balancing act.

我们还想平衡这一点,不要花太长时间来计算哈希,它本身就会成为瓶颈。一个不完美的平衡。

A classic example of a bad hash method is one for a co-ordinate pair of X, Y ints that does:

一个坏哈希方法的典型例子是对X, Y的坐标对这样做:

return X ^ Y;

While this does a perfectly good job of returning 2^32 possible values out of the 4^32 possible inputs, in real world use it's quite common to have sets of coordinates where X and Y are equal ({0, 0}, {1, 1}, {2, 2} and so on) which all hash to zero, or matching pairs ({2,3} and {3, 2}) which will hash to the same number. We are likely better served by:

虽然这是一个完美的工作返回2 ^ 32可能值4 ^ 32可能的输入,在现实世界中使用很常见的X和Y坐标,是相等的({ 0 },{ 1 },{ 2,}等等),所有哈希为零,或配对({ 2,3 }和{ 3 2 })将散列到相同数量。我们最好的服务是:

return ((X << 16) | (x >> 16)) ^ Y;

Now, there is just as many possible values for which this is dreadful than for the former, but it tends to serve better in real-world cases.

现在,有很多可能的价值比前者更可怕,但在现实世界中它更有用。

Of course, there is a different job if you are writing a general-purpose class (no idea what possible inputs there are) or have a better idea of the purpose at hand. For example, if I was using Date objects but knew that they would all be dates only (time part always midnight) and only within a few years of each other, then I might prefer a custom hash code that used only the day, month and lower-digits of the years, over the standard one. The writer of Date though can't work on such knowledge and has to try to cater for everyone.

当然,如果您正在编写一个通用类(不知道有什么可能的输入),或者对即将实现的目标有更好的理解,则有不同的工作。举个例子,如果我用日期对象,但是只知道他们都是日期(时间部分总是午夜),只有在几年之内,然后,我可能更喜欢一个定制的散列码,只用一天,月和年lower-digits,标准。然而,《约会》的作者不能研究这些知识,他必须设法迎合每个人。

Hence, If I for instance knew that a given string is always going to consist of 6 case-insensitive characters in the range [a-z] or [0-9] (which yours seem to, but it isn't clear from your question that it does) then I might use an algorithm that assigned a value from 0 to 35 (the 36 possible values for each character) to each character, and then walk through the string, each time multiplying the current value by 36 and adding the value of the next char.

因此,举例来说,如果我知道给定字符串总是由6不区分大小写字符范围[a - z]或[0 - 9](这似乎你的,从你的问题,但目前还不清楚它)然后我可以用一个算法,分配一个值从0到35(36可能值为每个字符),每个字符,然后遍历字符串,每次乘36和当前值添加下一个字符的值。

Assuming a good spread in the ids, this would be the way to go, especially if I made the order such that the lower-significant digits in my hash matched the most frequently changing char in the id (if such a call could be made), hence surviving re-hashing to a smaller range well.

假设一个好的传播的id,这是路要走,特别是如果我做订单,这样lower-significant数字在我的散列中最频繁更改字符匹配id(如果这样的调用可以),因此幸存的主导一个较小的范围内。

However, lacking such knowledge of the format for sure, I can't make that call with certainty, and I could well be making things worse (slower algorithm for little or even negative gain in hash quality).

然而,由于缺乏这种格式的知识,我不能确定地调用它,而且我很可能会使事情变得更糟(在哈希质量方面,算法速度更慢,甚至是负增长)。

One advantage you have is that since it's an ID in itself, then presumably no other non-equal object has the same ID, and hence no other properties need be examined. This doesn't always hold.

您拥有的一个优势是,由于它本身就是一个ID,因此假定其他不相等的对象都具有相同的ID,因此不需要检查其他属性。这并不总是持有。

#2


10  

You can't get a unique integer from a String of unlimited length. There are 4 billionish (2^32) unique integers, but an almost infinite number of unique strings.

你不能从无限长度的字符串中得到唯一的整数。有4 billionish(2 ^ 32)独特的整数,但一个几乎无限数量的独一无二的字符串。

String.hashCode() will not give you unique integers, but it will do its best to give you differing results based on the input string.

hashcode()不会给你唯一的整数,但是它会尽量根据输入字符串给你不同的结果。

EDIT

编辑

Your edited question says that String.hashCode() is not recommended. This is not true, it is recommended, unless you have some special reason not to use it. If you do have a special reason, please provide details.

您编辑的问题表明不建议使用String.hashCode()。这是不正确的,除非你有特殊的理由不使用它。如果您有特殊原因,请提供详细信息。

#3


4  

Looks like you've got a base-36 number there (a-z + 0-9). Why not convert it to an int using Integer.parseInt(s, 36)? Obviously, if there are too many unique IDs, it won't fit into an int, but in that case you're out of luck with unique integers and will need to get by using String.hashCode(), which does its best to be close to unique.

看起来你有一个base-36数字(a-z + 0-9)。为什么不把它转换成整数呢?方法(年代,36)?显然,如果有太多的唯一id,它就不能适合int类型,但是在这种情况下,对于唯一整数,您就不太幸运了,需要使用String.hashCode()来获取,它会尽力接近unique。

#4


3  

Unless your strings are limited in some way or your integers hold more bits than the strings you're trying to convert, you cannot guarantee the uniqueness.

除非你的字符串在某种程度上是受限的或者你的整数比你试图转换的字符串持有更多的比特,你不能保证它的唯一性。

Let's say you have a 32 bit integer and a 64-character character set for your strings. That means six bits per character. That will allow you to store five characters into an integer. More than that and it won't fit.

假设您有一个32位的整数和一个64位的字符串字符集。这意味着每个字符有6个字节。这将允许您将5个字符存储到一个整数中。超过这个,就不合适了。

#5


0  

One way to do it is assign each letter a value, and each place of the string it's own multiple ie a = 1, b = 2, and so on, then everything in the first digit (read left to right) would be multiplied by a prime number, the next the next prime number and so on, such that the final digit was multiplied by a prime larger than the number of possible subsets in that digit (26+1 for a space or 52+1 with capitols and so on for other supported characters). If the number is mapped back to the first digits (leftmost character) any number you generate from a unique string mapping back to 1 or 6 whatever the first letter will be, gives a unique value.

这样做的方法之一是分配一个值,每个字母和字符串的每个地方的多个ie = 1,b = 2,等等,然后一切都在第一位(从左至右)乘以一个质数,下一个下一个质数,这样最后的数字乘以一个'的数量大于可能的子集,数字(26 + 1空间或52 + 1与议会大厦等其他受支持的字符)。如果数字被映射回第一个数字(最左的字符),那么您从唯一的字符串映射回1或6所生成的任何数字,无论第一个字母是什么,都会给出一个唯一的值。

Dog might be 30,3(15),101(7) or 782, while God 33,3(15),101(4) or 482. More importantly than unique strings being generated they can be useful in generation if the original digit is kept, like 30(782) would be unique to some 12(782) for the purposes of differentiating like strings if you ever managed to go over the unique possibilities. Dog would always be Dog, but it would never be Cat or Mouse.

狗可能是30、3(15)、101(7)或782,而上帝是33、3(15)、101(4)或482。比生成唯一字符串更重要的是,如果保留原始数字,比如30(782)对于12(782)来说是唯一的,这样就可以像字符串一样进行区分,如果你想过唯一的可能性的话。狗永远是狗,但永远不会是猫或老鼠。