SQL中二进制字符串的汉明距离

时间:2021-10-01 19:15:12

I have a table in my DB where I store SHA256 hashes in a BINARY(32) column. I'm looking for a way to compute the Hamming distance of the entries in the column to a supplied value, i.e. something like:

我的数据库中有一个表,我将SHA256哈希存储在BINARY(32)列中。我正在寻找一种方法来计算列中条目的汉明距离到提供的值,即:

SELECT * FROM table 
  ORDER BY HAMMINGDISTANCE(hash, UNHEX(<insert supplied sha256 hash here>)) ASC 
  LIMIT 10

(in case you're wondering, the Hamming distance of strings A and B is defined as BIT_COUNT(A^B), where ^ is the bitwise XOR operator and BIT_COUNT returns the number of 1s in the binary string).

(如果您想知道,字符串A和B的汉明距离定义为BIT_COUNT(A ^ B),其中^是按位XOR运算符,BIT_COUNT返回二进制字符串中的1的数量)。

Now, I know that both the ^ operator and BIT_COUNT function only work on INTEGERs and so I'd say that probably the only way to do it would be to break up the binary strings in substrings, cast each binary substring to integer, compute the Hamming distance substring-wise and then add them. The problem with this is that it sounds terribly complicated, not efficient and definitely not elegant. My question therefore is: could you suggest any better way? (please note that I'm on shared hosting and therefore I can't modify the DB server or load libraries)

现在,我知道^运算符和BIT_COUNT函数都只适用于INTEGER,所以我想说可能唯一的方法就是分解子字符串中的二进制字符串,将每个二进制子字符串转换为整数,计算汉明距离子串,然后添加它们。这个问题是它听起来非常复杂,效率不高,绝对不优雅。因此,我的问题是:你能提出更好的建议吗? (请注意我在共享主机上,因此我无法修改数据库服务器或加载库)

edit(1): Obviously loading the whole table in PHP and doing the computations there would be possible but I'd rather avoid it because this table will probably grow quite large.

编辑(1):显然在PHP中加载整个表并进行计算是可能的,但我宁愿避免它,因为这个表可能会变得非常大。

edit(2): The DB server is MySQL 5.1

编辑(2):数据库服务器是MySQL 5.1

edit(3): My answer below contains the code that I just described above.

编辑(3):我的答案包含我刚才描述的代码。

edit(4): I just found out that using 4 BIGINTs to store the hash instead of a BINARY(32) yields massive speed improvements (more than 100 times faster). See the comments to my answer below.

编辑(4):我刚刚发现使用4个BIGINT来存储哈希而不是BINARY(32)会产生大量的速度提升(速度提高100倍以上)。请参阅下面的评论。

2 个解决方案

#1


14  

It appears that storing the data in a BINARY column is an approach bound to perform poorly. The only fast way to get decent performance is to split the content of the BINARY column in multiple BIGINT columns, each containing an 8-byte substring of the original data.

看起来将数据存储在BINARY列中是一种必然表现不佳的方法。获得良好性能的唯一快速方法是在多个BIGINT列中分割BINARY列的内容,每个列包含原始数据的8字节子字符串。

In my case (32 bytes) this would mean using 4 BIGINT columns and using this function:

在我的情况下(32字节),这意味着使用4个BIGINT列并使用此函数:

CREATE FUNCTION HAMMINGDISTANCE(
  A0 BIGINT, A1 BIGINT, A2 BIGINT, A3 BIGINT, 
  B0 BIGINT, B1 BIGINT, B2 BIGINT, B3 BIGINT
)
RETURNS INT DETERMINISTIC
RETURN 
  BIT_COUNT(A0 ^ B0) +
  BIT_COUNT(A1 ^ B1) +
  BIT_COUNT(A2 ^ B2) +
  BIT_COUNT(A3 ^ B3);

Using this approach, in my testing, is over 100 times faster than using the BINARY approach.

在我的测试中,使用这种方法比使用BINARY方法快100多倍。


FWIW, this is the code I was hinting at while explaining the problem. Better ways to accomplish the same thing are welcome (I especially don't like the binary > hex > decimal conversions):

FWIW,这是我在解释问题时所暗示的代码。我们欢迎更好的方法来完成同样的事情(我特别不喜欢二进制>十六进制>十进制转换):

CREATE FUNCTION HAMMINGDISTANCE(A BINARY(32), B BINARY(32))
RETURNS INT DETERMINISTIC
RETURN 
  BIT_COUNT(
    CONV(HEX(SUBSTRING(A, 1,  8)), 16, 10) ^ 
    CONV(HEX(SUBSTRING(B, 1,  8)), 16, 10)
  ) +
  BIT_COUNT(
    CONV(HEX(SUBSTRING(A, 9,  8)), 16, 10) ^ 
    CONV(HEX(SUBSTRING(B, 9,  8)), 16, 10)
  ) +
  BIT_COUNT(
    CONV(HEX(SUBSTRING(A, 17, 8)), 16, 10) ^ 
    CONV(HEX(SUBSTRING(B, 17, 8)), 16, 10)
  ) +
  BIT_COUNT(
    CONV(HEX(SUBSTRING(A, 25, 8)), 16, 10) ^ 
    CONV(HEX(SUBSTRING(B, 25, 8)), 16, 10)
  );

#2


1  

Interesting question, I've found a way to do this for a binary(3) that might work as well for a binary(32):

有趣的问题,我找到了一种方法来为二进制文件(3)执行此操作,这可能对二进制文件(32)也有效:

drop table if exists BinaryTest;
create table  BinaryTest (hash binary(3));
insert BinaryTest values (0xAAAAAA);

set @supplied = cast(0x888888 as binary);

select  length(replace(concat(
            bin(ascii(substr(hash,1,1)) ^ ascii(substr(@supplied,1,1))),
            bin(ascii(substr(hash,2,1)) ^ ascii(substr(@supplied,2,1))),
            bin(ascii(substr(hash,3,1)) ^ ascii(substr(@supplied,3,1)))
        ),'0',''))
from    BinaryTest;

The replace removes any all zeroes, and the length of remainder is the number of ones. (The conversion to binary omits leading zeroes, so counting the zeroes would not work.)

替换删除所有零,并且余数的长度是1的数量。 (转换为二进制会省略前导零,因此计算零将无效。)

This prints 6, which matches the number of ones in

这打印6,与中的数量相匹配

0xAAAAAA ^ 0x888888 = 0x222222 = 0b1000100010001000100010

#1


14  

It appears that storing the data in a BINARY column is an approach bound to perform poorly. The only fast way to get decent performance is to split the content of the BINARY column in multiple BIGINT columns, each containing an 8-byte substring of the original data.

看起来将数据存储在BINARY列中是一种必然表现不佳的方法。获得良好性能的唯一快速方法是在多个BIGINT列中分割BINARY列的内容,每个列包含原始数据的8字节子字符串。

In my case (32 bytes) this would mean using 4 BIGINT columns and using this function:

在我的情况下(32字节),这意味着使用4个BIGINT列并使用此函数:

CREATE FUNCTION HAMMINGDISTANCE(
  A0 BIGINT, A1 BIGINT, A2 BIGINT, A3 BIGINT, 
  B0 BIGINT, B1 BIGINT, B2 BIGINT, B3 BIGINT
)
RETURNS INT DETERMINISTIC
RETURN 
  BIT_COUNT(A0 ^ B0) +
  BIT_COUNT(A1 ^ B1) +
  BIT_COUNT(A2 ^ B2) +
  BIT_COUNT(A3 ^ B3);

Using this approach, in my testing, is over 100 times faster than using the BINARY approach.

在我的测试中,使用这种方法比使用BINARY方法快100多倍。


FWIW, this is the code I was hinting at while explaining the problem. Better ways to accomplish the same thing are welcome (I especially don't like the binary > hex > decimal conversions):

FWIW,这是我在解释问题时所暗示的代码。我们欢迎更好的方法来完成同样的事情(我特别不喜欢二进制>十六进制>十进制转换):

CREATE FUNCTION HAMMINGDISTANCE(A BINARY(32), B BINARY(32))
RETURNS INT DETERMINISTIC
RETURN 
  BIT_COUNT(
    CONV(HEX(SUBSTRING(A, 1,  8)), 16, 10) ^ 
    CONV(HEX(SUBSTRING(B, 1,  8)), 16, 10)
  ) +
  BIT_COUNT(
    CONV(HEX(SUBSTRING(A, 9,  8)), 16, 10) ^ 
    CONV(HEX(SUBSTRING(B, 9,  8)), 16, 10)
  ) +
  BIT_COUNT(
    CONV(HEX(SUBSTRING(A, 17, 8)), 16, 10) ^ 
    CONV(HEX(SUBSTRING(B, 17, 8)), 16, 10)
  ) +
  BIT_COUNT(
    CONV(HEX(SUBSTRING(A, 25, 8)), 16, 10) ^ 
    CONV(HEX(SUBSTRING(B, 25, 8)), 16, 10)
  );

#2


1  

Interesting question, I've found a way to do this for a binary(3) that might work as well for a binary(32):

有趣的问题,我找到了一种方法来为二进制文件(3)执行此操作,这可能对二进制文件(32)也有效:

drop table if exists BinaryTest;
create table  BinaryTest (hash binary(3));
insert BinaryTest values (0xAAAAAA);

set @supplied = cast(0x888888 as binary);

select  length(replace(concat(
            bin(ascii(substr(hash,1,1)) ^ ascii(substr(@supplied,1,1))),
            bin(ascii(substr(hash,2,1)) ^ ascii(substr(@supplied,2,1))),
            bin(ascii(substr(hash,3,1)) ^ ascii(substr(@supplied,3,1)))
        ),'0',''))
from    BinaryTest;

The replace removes any all zeroes, and the length of remainder is the number of ones. (The conversion to binary omits leading zeroes, so counting the zeroes would not work.)

替换删除所有零,并且余数的长度是1的数量。 (转换为二进制会省略前导零,因此计算零将无效。)

This prints 6, which matches the number of ones in

这打印6,与中的数量相匹配

0xAAAAAA ^ 0x888888 = 0x222222 = 0b1000100010001000100010