I met such a problem when I tried to finish my job.
我试图完成工作时遇到了这样的问题。
Given a data set, for each item, there are D dimensionalities and C values can be set to each dimensionality.
for example, a data set THINGS(ID,owner, color, weight), ID is the primary key
the owner attribute can be alice, jack, zuck;
the color attribute can be red, yellow, green;
the weight attribute can be high, medium, low;
in this data set, D=3, C=3
给定数据集,对于每个项目,存在D维度,并且可以将C值设置为每个维度。例如,数据集THINGS(ID,所有者,颜色,重量),ID是主要属性的主键,可以是alice,jack,zuck;颜色属性可以是红色,黄色,绿色;权重属性可以是高,中,低;在该数据集中,D = 3,C = 3
now I want to do many queries many times like :
"is there any data with owner=red and color=red"?
"is there any data with weight=low"?
"is there any data with owner=red and color=red and weight=high"?
I only need "Yes or No" to answer this query.
现在我想做很多次查询,例如:“有没有所有者=红色和颜色=红色的数据”? “有没有重量=低的数据”? “有没有所有者=红色和颜色=红色和重量=高”的数据?我只需要“是或否”来回答此查询。
I need to do this originally, I mean without database.
In a PC, I tried Bitmap and inverted index to accomplish the requirement, but the size of the data set will be million and Dimensionality will be 8~18, Cardinality will be 5~15. As a result, the efficiency is not good enough.
我最初需要这样做,我的意思是没有数据库。在PC中,我尝试使用Bitmap和倒排索引来完成要求,但数据集的大小将为百万,Dimensionality将为8~18,基数将为5~15。结果,效率不够好。
could you give me any suggestion to make it much efficient?
Thanks in advance!
你能给我任何建议让它更有效吗?提前致谢!
1 个解决方案
#1
2
You'd probably want a sorted dictionary for each dimension where the KEY is the possible elements for the dimension and the VALUE is the list of IDs.
您可能希望每个维度都有一个排序字典,其中KEY是维度的可能元素,VALUE是ID列表。
OWNER_DICTIONARY = {
Bob: [1,5],
Jim: [2],
Sally: [3,4],
Will: []
}
COLOR_DICTIONARY = {
Blue: [5],
Green: [2],
Red: [],
Yellow: [1,3,4]
}
WEIGHT_DICTIONARY = {
Low: [1,2,4],
High: [3,5]
}
Then you simple use a INTERSECT on the VALUES (list of IDs) of your dictionaries. If the intersection size is greater than 0 you have a positive match.
然后,您可以在词典的VALUES(ID列表)上使用INTERSECT。如果交叉点大小大于0,则表示您具有正匹配。
Owner=Bob AND Weight=High
([1,5] UNION [3,5]) = [5]
If one of the VALUES for your criteria (or one of the previous INTERSECTIONs) is [] empty you can short circuit (return false) right away without having to evaluate further.
如果您的标准(或之前的一个交叉点)之一的值为[]为空,则可以立即短路(返回假)而无需进一步评估。
In database terms you'd be putting a NON-CLUSTERED INDEX on each field/column. and doing
在数据库术语中,您将在每个字段/列上放置一个非聚集索引。和做
EXISTS(SELECT ID FROM Table WHERE Col1=@Val1 AND Col2=@Val2 AND Col3=@Val3)
EDIT UNION -> INTERSECTION good catch @ElKamina
EDIT UNION - > INTERSECTION好抓@ElKamina
#1
2
You'd probably want a sorted dictionary for each dimension where the KEY is the possible elements for the dimension and the VALUE is the list of IDs.
您可能希望每个维度都有一个排序字典,其中KEY是维度的可能元素,VALUE是ID列表。
OWNER_DICTIONARY = {
Bob: [1,5],
Jim: [2],
Sally: [3,4],
Will: []
}
COLOR_DICTIONARY = {
Blue: [5],
Green: [2],
Red: [],
Yellow: [1,3,4]
}
WEIGHT_DICTIONARY = {
Low: [1,2,4],
High: [3,5]
}
Then you simple use a INTERSECT on the VALUES (list of IDs) of your dictionaries. If the intersection size is greater than 0 you have a positive match.
然后,您可以在词典的VALUES(ID列表)上使用INTERSECT。如果交叉点大小大于0,则表示您具有正匹配。
Owner=Bob AND Weight=High
([1,5] UNION [3,5]) = [5]
If one of the VALUES for your criteria (or one of the previous INTERSECTIONs) is [] empty you can short circuit (return false) right away without having to evaluate further.
如果您的标准(或之前的一个交叉点)之一的值为[]为空,则可以立即短路(返回假)而无需进一步评估。
In database terms you'd be putting a NON-CLUSTERED INDEX on each field/column. and doing
在数据库术语中,您将在每个字段/列上放置一个非聚集索引。和做
EXISTS(SELECT ID FROM Table WHERE Col1=@Val1 AND Col2=@Val2 AND Col3=@Val3)
EDIT UNION -> INTERSECTION good catch @ElKamina
EDIT UNION - > INTERSECTION好抓@ElKamina