I'm going to build a platform for online content. This system will basically have two entitities: Content and Tag. Tags are related to contents in a many-to-many fashion.
我要为在线内容建立一个平台。这个系统基本上有两个实体:内容和标记。标签以多对多的方式与内容相关。
If I use a SQL database, it would be modeled like:
如果我使用SQL数据库,它将被建模为:
CONTENT 1-----* TC *-----1 TAG
内容1———* TC *——-1标签
Given this model, I need to make queries like:
基于这个模型,我需要做如下查询:
1) Get content by id
1)通过id获取内容。
2) Get content by one tag - "List all MATH* contents"
2)通过一个标签获取内容——“列出所有数学*内容”
(*) MATH is a tag
数学是一个标签
3) Get content by multiple tags - "List all HARD* MATH* contents"
3)通过多个标签获取内容——“列出所有硬*数学*内容”
4) Filter the data above by Content attributes: - "List all HARD* MATH* contents that were created last week"
4)通过内容属性过滤上述数据:-“列出上周创建的所有硬*数学*内容”
1 and 2 are hardly a problem, but I believe 3 and 4 can get tricky.
1和2几乎不是问题,但我相信3和4会变得棘手。
In a relational world, for query (4), I could start from CONTENT and join with TC multiple times, like so:
在关系世界中,对于查询(4),我可以从内容开始,多次加入TC,如下所示:
select distinct(c.*) from CONTENT c, TC tc1, TC tc2
where tc1.content_id = c.id
and tc2.content_id = c.id
and tc1.id = <math_tag_id>
and tc2.id = <hard_tag_id>
and c.creation_date > <last_week>
But I'm not sure this would scale well when: - TC has a lot of data - I need to query the intersection of 4 to 8 tags
但是我不确定当:- TC有很多数据-我需要查询4到8个标签的交集时,这是否能很好地扩展
Any thoughts on this?
有什么想法吗?
In the noSQL world, the only database I worked with so far is BigTable. As far as I can tell, BigTable might not be the best choice for this problem. If I use the same "tables", for (3) I'd probably go with something like (assume ndb+python)
在noSQL世界中,到目前为止我使用的唯一数据库是BigTable。就我所知,BigTable可能不是这个问题的最佳选择。如果我使用相同的“表”,对于(3)我可能会使用类似的东西(假设ndb+python)
tcs = TC.query(
TC.tag_key.IN([math_tag_key, hard_tag_key])
).fetch()
content_keys = [tc.content_key for tc in tcs]
distinct_content_keys = set(content_keys) //eliminate repeated values
contents = ndb.get_multi(distinct_content_keys)
But,
但是,
- I don't know how well this would perform when
TC.tag_key.IN
receives 4 to 8 tags (any thoughts on this?) - 我不知道TC.tag_key的性能如何。IN接收4到8个标签(对此有什么想法吗?)
- I can't make query (4) because I can't join with CONTENT (BigTable doesn't do joins). The alternative would be replicating CONTENT's attributes in TC, which is a PITA. (is there a better way to do this in BigTable?)
- 我不能做查询(4),因为我不能与内容连接(BigTable不做连接)。另一种选择是在TC(一个PITA)中复制内容的属性。(在BigTable中有更好的方法吗?)
So, the bigger question here is: what database solves this problem best? I'm inclined to look into Graph databases to see how well they might solve this, but I think I need some expert opinions about it.
所以,更大的问题是:什么数据库能最好地解决这个问题?我倾向于查看图表数据库,看看它们能多好地解决这个问题,但我认为我需要一些专家的意见。
A Graph DB is really the way to go? Is Neo4J the best option?
图DB真的是正确的方法吗?Neo4J是最好的选择吗?
1 个解决方案
#1
3
One of the areas that graph databases excel at compared to relational DBs is the kind of problem you describe. If the answer in a relational DB world results in many joins (where many can depend on the DB, but starts to be an issue at maybe 8 and certainly by 16) then you should look at a graph DB.
图数据库优于关系DBs的一个方面是您所描述的那种问题。如果关系数据库世界中的答案导致许多连接(其中许多连接可以依赖于DB,但可能在8或16岁时开始成为问题),那么您应该查看图DB。
In addition to Neo4J you may want to look at Titan and either way you may want to look at whether you want something like Blueprints or Spring on top to help isolate you from the implementation specifics (though that can bring other problems if you really need high performance).
除了Neo4J之外,您可能还希望查看Titan,或者以任何一种方式查看您是否希望使用Blueprints或Spring on top来帮助您与实现细节隔离(尽管如果您确实需要高性能,这可能会带来其他问题)。
#1
3
One of the areas that graph databases excel at compared to relational DBs is the kind of problem you describe. If the answer in a relational DB world results in many joins (where many can depend on the DB, but starts to be an issue at maybe 8 and certainly by 16) then you should look at a graph DB.
图数据库优于关系DBs的一个方面是您所描述的那种问题。如果关系数据库世界中的答案导致许多连接(其中许多连接可以依赖于DB,但可能在8或16岁时开始成为问题),那么您应该查看图DB。
In addition to Neo4J you may want to look at Titan and either way you may want to look at whether you want something like Blueprints or Spring on top to help isolate you from the implementation specifics (though that can bring other problems if you really need high performance).
除了Neo4J之外,您可能还希望查看Titan,或者以任何一种方式查看您是否希望使用Blueprints或Spring on top来帮助您与实现细节隔离(尽管如果您确实需要高性能,这可能会带来其他问题)。