使用Neo4j执行任意查询的性能

时间:2022-12-12 18:03:22

I was reading a paper published by Neo4J (a while ago): http://dist.neo4j.org/neo-technology-introduction.pdf

我正在阅读Neo4J(不久前)发表的一篇论文:http://dist.neo4j.org/neo- tion.pdf

and on the 2nd to last page the Drawbacks section states that Neo4J is not good for arbitrary queries.

在第二页至最后一页的“缺点”一节中,Neo4J对于任意查询都不是很好。

Say I had Nodes of users with the following properties: NAME, AGE, GENDER

假设我的用户节点具有以下属性:姓名、年龄、性别

And the following relationships: LIKE (points to Sports, Technology, etc. NODE) and FRIEND (Points to another USER).

和下列关系:LIKE(指运动、技术等节点)和FRIEND(指另一个用户)。

Is Neo4J not very efficient in querying something similar to:

Neo4J在查询类似于:

Find FRIENDS (of given node) that LIKE Sports, Tech, & Reading that were OVER_THE_AGE 21.

寻找(特定节点的)朋友,比如21岁以上的运动、科技和阅读。

Therefore, you must first find the FRIEND edges of USER1 and then find the LIKE edges of friends and determine if that node was called Sports and you must determine if the age property of the given friend is > 21.

因此,必须首先找到USER1的朋友边,然后找到朋友的类似边,然后确定该节点是否称为Sports,并且必须确定给定朋友的年龄属性是否为>21。

Is this a poor data model to begin with? And especially for graph databases? The reason for the LIKE relationship is in the event that you want to find all people who LIKE Sports.

这是一个糟糕的数据模型吗?特别是对于图形数据库?喜欢运动的原因是你想要找到所有喜欢运动的人。

What would be the better database choice for this? Redis, Cassandra, HBase, PostgreSQL? And Why?

什么是更好的数据库选择?复述,卡桑德拉,HBase,PostgreSQL ?,为什么?

Does anyone have any empirical data regarding this?

有人有相关的经验数据吗?

1 个解决方案

#1


21  

This is a general question about the nature of graph databases. Hopefully one of the neo4j devs will jump in here, but here is my understanding.

这是一个关于图形数据库性质的一般性问题。希望neo4j devs中的一个会在这里出现,但这里是我的理解。

You can think of any database as being "naturally indexed" in a certain way. In a relational database, when you look up a record in storage, generally the next record is stored right next to it in storage. We might call this a "natural index" because if what you want to do is scan through a bunch of records, the relational structure is just fundamentally set up to make that perform really well.

您可以认为任何数据库都以某种方式被“自然索引”。在关系数据库中,当您在存储中查找一条记录时,通常下一条记录就存储在它的旁边。我们可以称它为“自然索引”,因为如果您想要做的是扫描一堆记录,那么关系结构基本上就是为了使其性能良好而设置的。

Graph databases on the other hand are generally naturally indexed by relationships. (Neo4J devs, jump in if this needs refinement in terms of how neo4j does storage on disk). This means that in general, graph databases traverse relationships very quickly, but perform less well on mass/bulk queries.

另一方面,图形数据库通常是由关系自然索引的。(Neo4J devs,如果需要对Neo4J在磁盘上的存储方式进行改进的话,就加入进来)。这意味着,一般来说,图形数据库遍历关系的速度非常快,但在大量/大量查询上的性能较差。

Now, we're only talking about relative performance. Here's an example of an RDBMS style query. I'd expect MySQL to blow away neo4j in performance on this query:

现在,我们只讨论相对性能。下面是RDBMS样式查询的示例。我希望MySQL在这一查询中可以在性能上消除neo4j:

MATCH n WHERE n.name='Abe' RETURN n;

Note that this exploits no relationships at all, and forces the DB to scan ALL nodes. You could improve this by narrowing it down to a certain label, or by indexing on name, but in general, if you had a MySQL table of "people" with a "name" column, an RDBMS is going to kick ass on queries like this, and graph is going to do less well.

注意,这完全没有利用任何关系,并强制DB扫描所有节点。可以改善通过缩小到一个特定的标签,或通过索引名称,但总的来说,如果你有一个MySQL表中“人”与“名称”列,RDBMS是要踢屁股这样的查询,和图要做的更少。

OK, so that's the downside. What's the upside? Let's take a look at this query:

这就是缺点。有利的一面是什么?让我们看看这个查询:

MATCH n-[r:foo|bar*..5]->m RETURN m;

This is an entirely different beast. The real action of the query is in matching a variable length path between n and m. How would we do this in relational? We might set up a "nodes" and "edges" table, then add a PK/FK relationship between them. You then could write an SQL query that recursively joined the two tables to traverse that "path". Believe me, I have tried this in SQL, and it requires wizard-level skill to express the "between 1 and 5 hops" part of that query. Also, RDMBS will perform like a dog on this query, because it's not terribly selective, and the recursive query is quite expensive, doing all those repetitive joins.

这是一只完全不同的野兽。查询的实际操作是匹配n和m之间的可变长度路径。我们可以设置一个“节点”和“边缘”表,然后在它们之间添加一个PK/FK关系。然后可以编写一个SQL查询,该查询递归地连接两个表,以遍历该“路径”。相信我,我已经在SQL中尝试过了,要表达查询的“1到5跳”部分,需要使用向导级别的技能。而且,RDMBS在这个查询上的表现就像狗一样,因为它不是很有选择性,而且递归查询的开销很大,执行所有这些重复的连接。

On queries like this, neo4j is going to kick RDBMS's ass.

在这样的查询中,neo4j会让RDBMS相形见绌。

So -- on your question about arbitrary queries -- no system in the world is good at arbitrary queries, that is to say, all queries. Systems have strengths and weaknesses. Neo4J can execute arbitrary queries, but there's no guarantee that for some class of queries, it will perform better than some alternative. But that observation is general - the same is true of MySQL, MongoDB, and anything else you choose.

关于任意查询的问题世界上没有一个系统擅长任意查询,也就是说,所有查询。系统有优点也有缺点。Neo4J可以执行任意的查询,但是不能保证对于某些类型的查询,它的性能会比其他的查询更好。但是这个观察是通用的——MySQL、MongoDB和其他任何选择都是如此。

OK, so bottom lines, and observations:

好了,底线和观察:

  1. Graph databases perform well on a class of queries where RDMBS (and others) perform poorly.
  2. 图数据库在RDMBS(和其他)性能较差的查询类上表现良好。
  3. Graph databases aren't tuned for high performance on mass/bulk queries like the example I provided. They can do them, and you can tune their performance to improve things there, but they're never going to be as good as an RDBMS
  4. 图数据库并没有像我提供的示例那样在大量/大量查询上进行高性能调优。它们可以执行这些操作,您可以调整它们的性能以改进它们,但是它们永远不会像RDBMS那样好
  5. This is because of fundamentally how they're laid out, how they think about/store the data.
  6. 这主要是由于他们的布局,他们对数据的思考和存储方式。
  7. So what should you do? If your problem consists of a lot of relationship/path traversal type problems, graph is a big win! (I.e., your data is a graph, and traversing relationships is important to you). If your problem consists of scanning large collections of objects, then the relational model is probably a better fit.
  8. 那么你该怎么做呢?如果你的问题包含很多关系/路径遍历类型的问题,图形是一个大赢家!(即。,您的数据是一个图形,遍历关系对您很重要)。如果您的问题是扫描大量对象集合,那么关系模型可能更适合。

Use tools in their area of strength. Don't use neo4j like a relational database, or it will perform about as well as if you tried to use a screwdriver to pound nails. :)

在他们的强项上使用工具。不要像使用关系数据库那样使用neo4j,否则它的性能就会像您试图使用螺丝刀敲打钉子一样好。:)

#1


21  

This is a general question about the nature of graph databases. Hopefully one of the neo4j devs will jump in here, but here is my understanding.

这是一个关于图形数据库性质的一般性问题。希望neo4j devs中的一个会在这里出现,但这里是我的理解。

You can think of any database as being "naturally indexed" in a certain way. In a relational database, when you look up a record in storage, generally the next record is stored right next to it in storage. We might call this a "natural index" because if what you want to do is scan through a bunch of records, the relational structure is just fundamentally set up to make that perform really well.

您可以认为任何数据库都以某种方式被“自然索引”。在关系数据库中,当您在存储中查找一条记录时,通常下一条记录就存储在它的旁边。我们可以称它为“自然索引”,因为如果您想要做的是扫描一堆记录,那么关系结构基本上就是为了使其性能良好而设置的。

Graph databases on the other hand are generally naturally indexed by relationships. (Neo4J devs, jump in if this needs refinement in terms of how neo4j does storage on disk). This means that in general, graph databases traverse relationships very quickly, but perform less well on mass/bulk queries.

另一方面,图形数据库通常是由关系自然索引的。(Neo4J devs,如果需要对Neo4J在磁盘上的存储方式进行改进的话,就加入进来)。这意味着,一般来说,图形数据库遍历关系的速度非常快,但在大量/大量查询上的性能较差。

Now, we're only talking about relative performance. Here's an example of an RDBMS style query. I'd expect MySQL to blow away neo4j in performance on this query:

现在,我们只讨论相对性能。下面是RDBMS样式查询的示例。我希望MySQL在这一查询中可以在性能上消除neo4j:

MATCH n WHERE n.name='Abe' RETURN n;

Note that this exploits no relationships at all, and forces the DB to scan ALL nodes. You could improve this by narrowing it down to a certain label, or by indexing on name, but in general, if you had a MySQL table of "people" with a "name" column, an RDBMS is going to kick ass on queries like this, and graph is going to do less well.

注意,这完全没有利用任何关系,并强制DB扫描所有节点。可以改善通过缩小到一个特定的标签,或通过索引名称,但总的来说,如果你有一个MySQL表中“人”与“名称”列,RDBMS是要踢屁股这样的查询,和图要做的更少。

OK, so that's the downside. What's the upside? Let's take a look at this query:

这就是缺点。有利的一面是什么?让我们看看这个查询:

MATCH n-[r:foo|bar*..5]->m RETURN m;

This is an entirely different beast. The real action of the query is in matching a variable length path between n and m. How would we do this in relational? We might set up a "nodes" and "edges" table, then add a PK/FK relationship between them. You then could write an SQL query that recursively joined the two tables to traverse that "path". Believe me, I have tried this in SQL, and it requires wizard-level skill to express the "between 1 and 5 hops" part of that query. Also, RDMBS will perform like a dog on this query, because it's not terribly selective, and the recursive query is quite expensive, doing all those repetitive joins.

这是一只完全不同的野兽。查询的实际操作是匹配n和m之间的可变长度路径。我们可以设置一个“节点”和“边缘”表,然后在它们之间添加一个PK/FK关系。然后可以编写一个SQL查询,该查询递归地连接两个表,以遍历该“路径”。相信我,我已经在SQL中尝试过了,要表达查询的“1到5跳”部分,需要使用向导级别的技能。而且,RDMBS在这个查询上的表现就像狗一样,因为它不是很有选择性,而且递归查询的开销很大,执行所有这些重复的连接。

On queries like this, neo4j is going to kick RDBMS's ass.

在这样的查询中,neo4j会让RDBMS相形见绌。

So -- on your question about arbitrary queries -- no system in the world is good at arbitrary queries, that is to say, all queries. Systems have strengths and weaknesses. Neo4J can execute arbitrary queries, but there's no guarantee that for some class of queries, it will perform better than some alternative. But that observation is general - the same is true of MySQL, MongoDB, and anything else you choose.

关于任意查询的问题世界上没有一个系统擅长任意查询,也就是说,所有查询。系统有优点也有缺点。Neo4J可以执行任意的查询,但是不能保证对于某些类型的查询,它的性能会比其他的查询更好。但是这个观察是通用的——MySQL、MongoDB和其他任何选择都是如此。

OK, so bottom lines, and observations:

好了,底线和观察:

  1. Graph databases perform well on a class of queries where RDMBS (and others) perform poorly.
  2. 图数据库在RDMBS(和其他)性能较差的查询类上表现良好。
  3. Graph databases aren't tuned for high performance on mass/bulk queries like the example I provided. They can do them, and you can tune their performance to improve things there, but they're never going to be as good as an RDBMS
  4. 图数据库并没有像我提供的示例那样在大量/大量查询上进行高性能调优。它们可以执行这些操作,您可以调整它们的性能以改进它们,但是它们永远不会像RDBMS那样好
  5. This is because of fundamentally how they're laid out, how they think about/store the data.
  6. 这主要是由于他们的布局,他们对数据的思考和存储方式。
  7. So what should you do? If your problem consists of a lot of relationship/path traversal type problems, graph is a big win! (I.e., your data is a graph, and traversing relationships is important to you). If your problem consists of scanning large collections of objects, then the relational model is probably a better fit.
  8. 那么你该怎么做呢?如果你的问题包含很多关系/路径遍历类型的问题,图形是一个大赢家!(即。,您的数据是一个图形,遍历关系对您很重要)。如果您的问题是扫描大量对象集合,那么关系模型可能更适合。

Use tools in their area of strength. Don't use neo4j like a relational database, or it will perform about as well as if you tried to use a screwdriver to pound nails. :)

在他们的强项上使用工具。不要像使用关系数据库那样使用neo4j,否则它的性能就会像您试图使用螺丝刀敲打钉子一样好。:)