什么指数可以高效率地应用于这种情况？

I met such a problem when I tried to finish my job.

我试图完成工作时遇到了这样的问题。

Given a data set, for each item, there are D dimensionalities and C values can be set to each dimensionality.
for example, a data set THINGS(ID,owner, color, weight), ID is the primary key
the owner attribute can be alice, jack, zuck;
the color attribute can be red, yellow, green;
the weight attribute can be high, medium, low;
in this data set, D=3, C=3

给定数据集，对于每个项目，存在D维度，并且可以将C值设置为每个维度。例如，数据集THINGS（ID，所有者，颜色，重量），ID是主要属性的主键，可以是alice，jack，zuck;颜色属性可以是红色，黄色，绿色;权重属性可以是高，中，低;在该数据集中，D = 3，C = 3

now I want to do many queries many times like :
"is there any data with owner=red and color=red"?
"is there any data with weight=low"?
"is there any data with owner=red and color=red and weight=high"?
I only need "Yes or No" to answer this query.

现在我想做很多次查询，例如：“有没有所有者=红色和颜色=红色的数据”？ “有没有重量=低的数据”？ “有没有所有者=红色和颜色=红色和重量=高”的数据？我只需要“是或否”来回答此查询。

I need to do this originally, I mean without database.
In a PC, I tried Bitmap and inverted index to accomplish the requirement, but the size of the data set will be million and Dimensionality will be 8~18, Cardinality will be 5~15. As a result, the efficiency is not good enough.

我最初需要这样做，我的意思是没有数据库。在PC中，我尝试使用Bitmap和倒排索引来完成要求，但数据集的大小将为百万，Dimensionality将为8~18，基数将为5~15。结果，效率不够好。

could you give me any suggestion to make it much efficient?
Thanks in advance!

你能给我任何建议让它更有效吗？提前致谢！

1 个解决方案

#1

You'd probably want a sorted dictionary for each dimension where the KEY is the possible elements for the dimension and the VALUE is the list of IDs.

您可能希望每个维度都有一个排序字典，其中KEY是维度的可能元素，VALUE是ID列表。

OWNER_DICTIONARY = {
    Bob: [1,5],
    Jim: [2],
    Sally: [3,4],
    Will: []
}
COLOR_DICTIONARY = {
    Blue: [5],
    Green: [2],
    Red: [],
    Yellow: [1,3,4]
}
WEIGHT_DICTIONARY = {
    Low: [1,2,4],
    High: [3,5]
}

Then you simple use a INTERSECT on the VALUES (list of IDs) of your dictionaries. If the intersection size is greater than 0 you have a positive match.

然后，您可以在词典的VALUES（ID列表）上使用INTERSECT。如果交叉点大小大于0，则表示您具有正匹配。

Owner=Bob AND Weight=High

([1,5] UNION [3,5]) = [5]

If one of the VALUES for your criteria (or one of the previous INTERSECTIONs) is [] empty you can short circuit (return false) right away without having to evaluate further.

如果您的标准（或之前的一个交叉点）之一的值为[]为空，则可以立即短路（返回假）而无需进一步评估。

In database terms you'd be putting a NON-CLUSTERED INDEX on each field/column. and doing

在数据库术语中，您将在每个字段/列上放置一个非聚集索引。和做

EXISTS(SELECT ID FROM Table WHERE Col1=@Val1 AND Col2=@Val2 AND Col3=@Val3)

EDIT UNION -> INTERSECTION good catch @ElKamina

EDIT UNION - > INTERSECTION好抓@ElKamina

#1