来自Scala和Guava的Murmur3的不同结果

时间:2022-11-10 20:46:03

I am trying to generate hashes using the Murmur3 algorithm. The hashes are consistent but they are different values being returned by Scala and Guava.

我试图使用Murmur3算法生成哈希值。散列是一致的,但它们是由Scala和Guava返回的不同值。

class package$Test extends FunSuite {
  test("Generate hashes") {
    println(s"Seed = ${MurmurHash3.stringSeed}")
    val vs = Set("abc", "test", "bucket", 111.toString)
    vs.foreach { x =>
      println(s"[SCALA] Hash for $x = ${MurmurHash3.stringHash(x).abs % 1000}")
      println(s"[GUAVA] Hash for $x = ${Hashing.murmur3_32().hashString(x).asInt().abs % 1000}")
      println(s"[GUAVA with seed] Hash for $x = ${Hashing.murmur3_32(MurmurHash3.stringSeed).hashString(x).asInt().abs % 1000}")
      println()
    }
  }
}


Seed = -137723950
[SCALA] Hash for abc = 174
[GUAVA] Hash for abc = 419
[GUAVA with seed] Hash for abc = 195

[SCALA] Hash for test = 588
[GUAVA] Hash for test = 292
[GUAVA with seed] Hash for test = 714

[SCALA] Hash for bucket = 413
[GUAVA] Hash for bucket = 22
[GUAVA with seed] Hash for bucket = 414

[SCALA] Hash for 111 = 250
[GUAVA] Hash for 111 = 317
[GUAVA with seed] Hash for 111 = 958

Why am I getting different hashes?

为什么我会得到不同的哈希?

2 个解决方案

#1


It looks to me like Scala's hashString converts pairs of UTF-16 chars to ints differently than Guava's hashUnencodedChars (hashString with no Charset was renamed to that).

在我看来,像Scala的hashString将对的UTF-16字符转换为int不同于Guava的hashUnencodedChars(没有Charset的hashString被重命名为)。

Scala:

val data = (str.charAt(i) << 16) + str.charAt(i + 1)

Guava:

int k1 = input.charAt(i - 1) | (input.charAt(i) << 16);

In Guava, the char at an index i becomes the 16 least significant bits of the the int and the char at i + 1 becomes the most significant 16 bits. In the Scala implementation, that's reversed: the char at i is the most significant while the char at i + 1 is the least significant. (The fact that the Scala implementation uses + rather than | could also be significant I imagine.)

在Guava中,索引i处的char成为int的16个最低有效位,而i + 1处的char成为最重要的16位。在Scala实现中,这是相反的:i处的char是最重要的,而i + 1处的char是最不重要的。 (Scala实现使用+而不是|的事实也可能是我想象的重要。)

Note that the Guava implementation is equivalent to using ByteBuffer.putChar(c) twice to put two characters into a little endian ByteBuffer, then using ByteBuffer.getInt() to get an int value back out. The Guava implementation is also equivalent to encoding the characters to bytes using UTF-16LE and hashing those bytes. The Scala implementation is not equivalent to encoding the string in any of the standard charsets that JVMs are required to support. In general, I'm not sure what precedent (if any) Scala has for doing it the way it does.

请注意,Guava实现相当于使用ByteBuffer.putChar(c)两次将两个字符放入一个小端字节ByteBuffer,然后使用ByteBuffer.getInt()来取回一个int值。 Guava实现也等同于使用UTF-16LE将字符编码为字节并对这些字节进行散列。 Scala实现不等同于在JVM需要支持的任何标准字符集中编码字符串。一般来说,我不确定Scala有什么先例(如果有的话)这样做。

Edit:

The Scala implementation does another thing different than the Guava implementation as well: it passes the number of chars being hashed to the finalizeHash method where Guava's implementation passes the number of bytes to the equivalent fmix method.

Scala实现还做了另一件与Guava实现不同的事情:它将被散列的字符数传递给finalizeHash方法,其中Guava的实现将字节数传递给等效的fmix方法。

#2


I believe hashString(x, StandardCharsets.UTF_16BE) should match Scala's behavior. Let us know.

我相信hashString(x,StandardCharsets.UTF_16BE)应该与Scala的行为相匹配。让我们知道。

(Also, please upgrade your Guava to something newer!)

(另外,请将你的番石榴升级到更新的东西!)

#1


It looks to me like Scala's hashString converts pairs of UTF-16 chars to ints differently than Guava's hashUnencodedChars (hashString with no Charset was renamed to that).

在我看来,像Scala的hashString将对的UTF-16字符转换为int不同于Guava的hashUnencodedChars(没有Charset的hashString被重命名为)。

Scala:

val data = (str.charAt(i) << 16) + str.charAt(i + 1)

Guava:

int k1 = input.charAt(i - 1) | (input.charAt(i) << 16);

In Guava, the char at an index i becomes the 16 least significant bits of the the int and the char at i + 1 becomes the most significant 16 bits. In the Scala implementation, that's reversed: the char at i is the most significant while the char at i + 1 is the least significant. (The fact that the Scala implementation uses + rather than | could also be significant I imagine.)

在Guava中,索引i处的char成为int的16个最低有效位,而i + 1处的char成为最重要的16位。在Scala实现中,这是相反的:i处的char是最重要的,而i + 1处的char是最不重要的。 (Scala实现使用+而不是|的事实也可能是我想象的重要。)

Note that the Guava implementation is equivalent to using ByteBuffer.putChar(c) twice to put two characters into a little endian ByteBuffer, then using ByteBuffer.getInt() to get an int value back out. The Guava implementation is also equivalent to encoding the characters to bytes using UTF-16LE and hashing those bytes. The Scala implementation is not equivalent to encoding the string in any of the standard charsets that JVMs are required to support. In general, I'm not sure what precedent (if any) Scala has for doing it the way it does.

请注意,Guava实现相当于使用ByteBuffer.putChar(c)两次将两个字符放入一个小端字节ByteBuffer,然后使用ByteBuffer.getInt()来取回一个int值。 Guava实现也等同于使用UTF-16LE将字符编码为字节并对这些字节进行散列。 Scala实现不等同于在JVM需要支持的任何标准字符集中编码字符串。一般来说,我不确定Scala有什么先例(如果有的话)这样做。

Edit:

The Scala implementation does another thing different than the Guava implementation as well: it passes the number of chars being hashed to the finalizeHash method where Guava's implementation passes the number of bytes to the equivalent fmix method.

Scala实现还做了另一件与Guava实现不同的事情:它将被散列的字符数传递给finalizeHash方法,其中Guava的实现将字节数传递给等效的fmix方法。

#2


I believe hashString(x, StandardCharsets.UTF_16BE) should match Scala's behavior. Let us know.

我相信hashString(x,StandardCharsets.UTF_16BE)应该与Scala的行为相匹配。让我们知道。

(Also, please upgrade your Guava to something newer!)

(另外,请将你的番石榴升级到更新的东西!)