在数据库中存储和索引二进制字符串

时间:2021-06-29 17:00:37

A binary string as defined here is fixed size "array" of bits. I call them strings since there is no order on them (sorting/indexing them as numbers has no meaning), each bit is independent of the others. Each such string is N bits long, with N in the hundreds.

这里定义的二进制字符串是固定大小的“数组”位。我称之为字符串,因为它们没有顺序(排序/索引它们,因为数字没有意义),每个位都独立于其他位。每个这样的字符串是N位长,N为数百。

I need to store these strings and given a new binary string query for the nearest neighbor using the Hamming distance as the distance metric.
There are specialized data-structures (metric-trees) for metric-based search (VP-trees, cover-trees, M-trees), but I need to use a regular database (MongoDB in my case).

我需要存储这些字符串并使用汉明距离作为距离度量为最近邻居提供新的二进制字符串查询。有基于度量的搜索(VP树,覆盖树,M树)的专用数据结构(度量树),但我需要使用常规数据库(在我的情况下是MongoDB)。

Is there some indexing function that can be applied to the binary strings that can help the DB access only a subset of the records before performing the one-to-one Hamming distance match? Alternatively, how would it be possible to implement such Hamming based search on a standard DB?

是否有一些索引函数可以应用于二进制字符串,可以帮助DB在执行一对一汉明距离匹配之前仅访问记录的子集?或者,如何在标准DB上实现这种基于汉明的搜索?

2 个解决方案

#1


3  

The hamming distance is a metric so it satisfies the triangle inequality. For each bitstring in your database, you could store the it's hamming distance to some pre-defined constant bitstring. Then you can use the triangle inequality to filter out bitstrings in the database.

汉明距离是一个度量,因此它满足三角不等式。对于数据库中的每个位串,您可以将它的汉明距离存储到某个预定义的常量位串。然后,您可以使用三角形不等式来过滤掉数据库中的位串。

So let's say

所以我们说吧

C <- some constant bitstring
S <- bitstring you're trying to find the best match for
B <- a bitstring in the database
distS <- hamming_dist(S,C)
distB <- hamming_dist(B,C)

So for each B, you would store it's corresponding distB.

因此,对于每个B,您将存储它的相应distB。

A lower bound for hamming(B,S) would then be abs(distB-distS). And the upper bound would be distB+distS.

汉明(B,S)的下限将是abs(distB-distS)。并且上限将是distB + distS。

You can discard all B such that the lower bound is higher than the lowest upper bound.

您可以丢弃所有B,使得下限高于最低上限。

I'm not 100% sure as to the optimal way to choose which C. I think you would want it to be a bitstring that's close to the "center" of your metric space of bitstrings.

我不是100%确定选择哪个C的最佳方式。我认为你希望它是一个位串,它接近你的位串度量空间的“中心”。

#2


2  

You could use an approach similar to levenshtein automata, applied to bitstrings:

您可以使用类似于levenshtein自动机的方法,应用于位串:

  1. Compute the first (lexicographically smallest) bitstring that would meet your hamming-distance criteria.
  2. 计算符合汉明距离标准的第一个(按字典顺序排列最小的)位串。

  3. Fetch the first result from the database that's greater than or equal to that value
  4. 从数据库中获取大于或等于该值的第一个结果

  5. If the value is a match, output it and fetch the next result. Otherwise, compute the next value greater than it that is a match, and goto 2.
  6. 如果值匹配,则输出它并获取下一个结果。否则,计算大于匹配的下一个值,并转到2。

This is equivalent to doing a merge join over your database and the list of possible matches, without having to generate every possible match. It'll reduce the search space, but it's still likely to require a significant number of queries.

这相当于对数据库进行合并连接以及可能的匹配列表,而不必生成所有可能的匹配。它会减少搜索空间,但仍可能需要大量查询。

#1


3  

The hamming distance is a metric so it satisfies the triangle inequality. For each bitstring in your database, you could store the it's hamming distance to some pre-defined constant bitstring. Then you can use the triangle inequality to filter out bitstrings in the database.

汉明距离是一个度量,因此它满足三角不等式。对于数据库中的每个位串,您可以将它的汉明距离存储到某个预定义的常量位串。然后,您可以使用三角形不等式来过滤掉数据库中的位串。

So let's say

所以我们说吧

C <- some constant bitstring
S <- bitstring you're trying to find the best match for
B <- a bitstring in the database
distS <- hamming_dist(S,C)
distB <- hamming_dist(B,C)

So for each B, you would store it's corresponding distB.

因此,对于每个B,您将存储它的相应distB。

A lower bound for hamming(B,S) would then be abs(distB-distS). And the upper bound would be distB+distS.

汉明(B,S)的下限将是abs(distB-distS)。并且上限将是distB + distS。

You can discard all B such that the lower bound is higher than the lowest upper bound.

您可以丢弃所有B,使得下限高于最低上限。

I'm not 100% sure as to the optimal way to choose which C. I think you would want it to be a bitstring that's close to the "center" of your metric space of bitstrings.

我不是100%确定选择哪个C的最佳方式。我认为你希望它是一个位串,它接近你的位串度量空间的“中心”。

#2


2  

You could use an approach similar to levenshtein automata, applied to bitstrings:

您可以使用类似于levenshtein自动机的方法,应用于位串:

  1. Compute the first (lexicographically smallest) bitstring that would meet your hamming-distance criteria.
  2. 计算符合汉明距离标准的第一个(按字典顺序排列最小的)位串。

  3. Fetch the first result from the database that's greater than or equal to that value
  4. 从数据库中获取大于或等于该值的第一个结果

  5. If the value is a match, output it and fetch the next result. Otherwise, compute the next value greater than it that is a match, and goto 2.
  6. 如果值匹配,则输出它并获取下一个结果。否则,计算大于匹配的下一个值,并转到2。

This is equivalent to doing a merge join over your database and the list of possible matches, without having to generate every possible match. It'll reduce the search space, but it's still likely to require a significant number of queries.

这相当于对数据库进行合并连接以及可能的匹配列表,而不必生成所有可能的匹配。它会减少搜索空间,但仍可能需要大量查询。