什么指数可以高效率地应用于这种情况?

时间:2022-12-27 04:15:47


I met such a problem when I tried to finish my job.

我试图完成工作时遇到了这样的问题。

Given a data set, for each item, there are D dimensionalities and C values can be set to each dimensionality.
for example, a data set THINGS(ID,owner, color, weight), ID is the primary key
the owner attribute can be alice, jack, zuck;
the color attribute can be red, yellow, green;
the weight attribute can be high, medium, low;
in this data set, D=3, C=3

给定数据集,对于每个项目,存在D维度,并且可以将C值设置为每个维度。例如,数据集THINGS(ID,所有者,颜色,重量),ID是主要属性的主键,可以是alice,jack,zuck;颜色属性可以是红色,黄色,绿色;权重属性可以是高,中,低;在该数据集中,D = 3,C = 3

now I want to do many queries many times like :
"is there any data with owner=red and color=red"?
"is there any data with weight=low"?
"is there any data with owner=red and color=red and weight=high"?
I only need "Yes or No" to answer this query.

现在我想做很多次查询,例如:“有没有所有者=红色和颜色=红色的数据”? “有没有重量=低的数据”? “有没有所有者=红色和颜色=红色和重量=高”的数据?我只需要“是或否”来回答此查询。

I need to do this originally, I mean without database.
In a PC, I tried Bitmap and inverted index to accomplish the requirement, but the size of the data set will be million and Dimensionality will be 8~18, Cardinality will be 5~15. As a result, the efficiency is not good enough.

我最初需要这样做,我的意思是没有数据库。在PC中,我尝试使用Bitmap和倒排索引来完成要求,但数据集的大小将为百万,Dimensionality将为8~18,基数将为5~15。结果,效率不够好。

could you give me any suggestion to make it much efficient?
Thanks in advance!

你能给我任何建议让它更有效吗?提前致谢!

1 个解决方案

#1


2  

You'd probably want a sorted dictionary for each dimension where the KEY is the possible elements for the dimension and the VALUE is the list of IDs.

您可能希望每个维度都有一个排序字典,其中KEY是维度的可能元素,VALUE是ID列表。

OWNER_DICTIONARY = {
    Bob: [1,5],
    Jim: [2],
    Sally: [3,4],
    Will: []
}
COLOR_DICTIONARY = {
    Blue: [5],
    Green: [2],
    Red: [],
    Yellow: [1,3,4]
}
WEIGHT_DICTIONARY = {
    Low: [1,2,4],
    High: [3,5]
}

Then you simple use a INTERSECT on the VALUES (list of IDs) of your dictionaries. If the intersection size is greater than 0 you have a positive match.

然后,您可以在词典的VALUES(ID列表)上使用INTERSECT。如果交叉点大小大于0,则表示您具有正匹配。

Owner=Bob AND Weight=High

([1,5] UNION [3,5]) = [5]

If one of the VALUES for your criteria (or one of the previous INTERSECTIONs) is [] empty you can short circuit (return false) right away without having to evaluate further.

如果您的标准(或之前的一个交叉点)之一的值为[]为空,则可以立即短路(返回假)而无需进一步评估。

In database terms you'd be putting a NON-CLUSTERED INDEX on each field/column. and doing

在数据库术语中,您将在每个字段/列上放置一个非聚集索引。和做

EXISTS(SELECT ID FROM Table WHERE Col1=@Val1 AND Col2=@Val2 AND Col3=@Val3)

EDIT UNION -> INTERSECTION good catch @ElKamina

EDIT UNION - > INTERSECTION好抓@ElKamina

#1


2  

You'd probably want a sorted dictionary for each dimension where the KEY is the possible elements for the dimension and the VALUE is the list of IDs.

您可能希望每个维度都有一个排序字典,其中KEY是维度的可能元素,VALUE是ID列表。

OWNER_DICTIONARY = {
    Bob: [1,5],
    Jim: [2],
    Sally: [3,4],
    Will: []
}
COLOR_DICTIONARY = {
    Blue: [5],
    Green: [2],
    Red: [],
    Yellow: [1,3,4]
}
WEIGHT_DICTIONARY = {
    Low: [1,2,4],
    High: [3,5]
}

Then you simple use a INTERSECT on the VALUES (list of IDs) of your dictionaries. If the intersection size is greater than 0 you have a positive match.

然后,您可以在词典的VALUES(ID列表)上使用INTERSECT。如果交叉点大小大于0,则表示您具有正匹配。

Owner=Bob AND Weight=High

([1,5] UNION [3,5]) = [5]

If one of the VALUES for your criteria (or one of the previous INTERSECTIONs) is [] empty you can short circuit (return false) right away without having to evaluate further.

如果您的标准(或之前的一个交叉点)之一的值为[]为空,则可以立即短路(返回假)而无需进一步评估。

In database terms you'd be putting a NON-CLUSTERED INDEX on each field/column. and doing

在数据库术语中,您将在每个字段/列上放置一个非聚集索引。和做

EXISTS(SELECT ID FROM Table WHERE Col1=@Val1 AND Col2=@Val2 AND Col3=@Val3)

EDIT UNION -> INTERSECTION good catch @ElKamina

EDIT UNION - > INTERSECTION好抓@ElKamina