将大量数据集导入Neo4j非常慢

I have a rather large dataset, ~68 million data points. The data is currently stored in MongoDB and I have written a Java program that goes through the data to link data points together and place them in the Neo4j database using Cypher commands. I ran this program with a test set of data (~1.5 million) and it worked, ran it overnight. Now when I try to import the whole dataset, the program is extremely slow. Ran the whole weekend and only ~350,000 data points have made it. Through some short testing, it seems like Neo4j is the bottleneck. It's been half an hour since I stopped the Java program but Neo4j's CPU usage is at 100% and new nodes are still being added (from the Java program). Is there anyway to overcome this bottleneck? I've thought about multithreading, but since I'm trying to create a network, there are lots of dependencies and non-thread-safe operations being performed. Thanks for your help!

我有一个相当大的数据集,大约6800万个数据点。数据当前存储在MongoDB中,我编写了一个Java程序,通过数据将数据点链接在一起,并使用Cypher命令将它们放在Neo4j数据库中。我用一组测试数据(约150万)运行这个程序,它运行起来,一夜之间运行。现在,当我尝试导入整个数据集时,程序非常慢。整个周末都只有大约350,000个数据点。通过一些简短的测试,似乎Neo4j是瓶颈。自从我停止Java程序以来已经过了半个小时,但Neo4j的CPU使用率为100%,并且仍在添加新节点(来自Java程序)。反正有没有克服这个瓶颈?我考虑过多线程,但是因为我正在尝试创建一个网络,所以有很多依赖项和非线程安全的操作正在执行。谢谢你的帮助!

EDIT: The data I have is a list of users. The data that is contained is the user id, and an array of the user's friends' ids. My Cypher queries look a little like this: "u:USER {id:" + currentID + "}) CREATE (u)-[:FRIENDS {ts:" + timeStamp}]->(u" + connectionID + ":USER {id:" + connectionID + "})" Sorry if this is really terrible, pretty new to this

编辑:我的数据是用户列表。包含的数据是用户ID,以及用户朋友ID的数组。我的Cypher查询看起来有点像:“u:USER {id:”+ currentID +“})CREATE(u) - [:FRIENDS {ts:”+ timeStamp}] - >(u“+ connectionID +”:USER {id:“+ connectionID +”})“对不起,如果这真的很糟糕,对此很新

3 个解决方案

#1

You should first look at this:

你应该先看看这个:

neo4j import slowing down

neo4j进口放缓

If you still decide to DIY, there's a few things you should look out for: First, make sure you don't try to import all your data in one transaction, otherwise your code will spend most of the time suspended by the Garbage Collector. Second, ensure you have given plenty of memory to the Neo4j process (or your application if you're using an embedded instance of Neo4j). 68 million nodes is trivial for Neo4j, but if the Cypher you're generating is constantly looking things up to e.g. create new relationships, then you'll run into severe paging issues if you don't allocate enough memory. Finally, if you are looking up nodes by properties (rather than by id) then you should be using labels and schema indexes:

如果您仍然决定DIY,那么您应该注意以下几点:首先,确保您不要尝试在一个事务中导入所有数据,否则您的代码将花费大部分时间在垃圾收集器中暂停。其次,确保您为Neo4j进程(或者如果您使用的是Neo4j的嵌入式实例的应用程序)提供了足够的内存。对于Neo4j来说,6800万个节点是微不足道的,但是如果你正在生成的Cypher一直在寻找例如创建新的关系,如果你没有分配足够的内存,你将遇到严重的分页问题。最后,如果您按属性(而不是id)查找节点,那么您应该使用标签和模式索引:

http://neo4j.com/news/labels-and-schema-indexes-in-neo4j/

#2

Did you configure neo4j.properties and neo4j-wrapper.conf files? It is highly recommended to adjust the values according to the amount of RAM available on your machine.

你配置了neo4j.properties和neo4j-wrapper.conf文件吗?强烈建议根据机器上可用的RAM量调整值。

in conf/neo4j-wrapper.conf I usually use for a 12GB RAM server

在conf / neo4j-wrapper.conf中我经常使用12GB RAM服务器

wrapper.java.initmemory=8000
wrapper.java.maxmemory=8000

in conf/neo4j.properties I set

在我设置的conf / neo4j.properties中

dbms.pagecache.memory=8000

See http://neo4j.com/blog/import-10m-stack-overflow-questions/ for a complete example to import 10M nodes in a few minutes, it's a good starting point

有关在几分钟内导入10M节点的完整示例,请参阅http://neo4j.com/blog/import-10m-stack-overflow-questions/,这是一个很好的起点

SSD are also recommended to speed up import.

SSD也建议加快导入。

#3

One thing I learned when loading bulk data into a database was to switch off indexing temporarily on the destination table(s). Otherwise every new record added caused a separate update to the indexes, resulting in a lot of work on the disk. It was much quicker to re-index the whole table in a separate operation after the data load was complete. YMMV.

将批量数据加载到数据库时我学到的一件事是暂时关闭目标表上的索引。否则,每个添加的新记录都会导致对索引进行单独更新,从而导致磁盘上的大量工作。在数据加载完成后,在单独的操作中重新索引整个表的速度要快得多。因人而异。

#1