将Neo4j数据保存到Spark RDD（或）DataFrame

I am retrieving the data from Neo4j using Bolt Driver in Python Language. The returned result should be stored as RDD(or atleast into CSV). I am able to see the returned results but unable to store it as an RDD or a Data frame or atleast into a csv.

我正在使用Python语言中的Bolt Driver从Neo4j中检索数据。返回的结果应存储为RDD(或至少存储为CSV)。我能够看到返回的结果,但无法将其存储为RDD或数据帧或至少存储到csv中。

Here is how I am seeing the result:

以下是我看到结果的方式:

session = driver.session()
result = session.run('MATCH (n) RETURN  n.hobby,id(n)')  
session.close()

Here, how can I store this data into RDD or CSV file.

在这里,如何将此数据存储到RDD或CSV文件中。

2 个解决方案

#1

I deleted the old post and reposted the same question. But I haven't received any pointers. So, I am posting my way of approach so that it may help others.

我删除了旧帖并重新发布了相同的问题。但我没有收到任何指示。所以,我发布我的方法,以便它可以帮助别人。

'''
Storing the return result into RDD
'''

session = driver.session()
result = session.run('MATCH (n:Hobby) RETURN  n.hobby AS hobby,id(n) As id LIMIT 10')  
session.close()     

'''
Pulling the keys
'''
keys = result.peek().keys()

'''
Reading all the property values and storing it in a list
'''
values=list()

for record in result:
    rec= list()
    for key in keys:
        rec.append(record[key])
    values.append(rec)

'''
Converting list of values into a pandas dataframe
'''
df = DataFrame(values, columns=keys)     
print df  

'''
Converting the pandas DataFrame to Spark DataFrame
'''  
sqlCtx = SQLContext(sc)
spark_df = sqlCtx.createDataFrame(df)

print spark_df.show()

'''
Converting the Pandas DataFrame to SparkRdd (via Spark Dataframes)
'''
rdd = spark_df.rdd.map(tuple)

print rdd.take(10)

Any suggestions to improve the efficiency is highly appreciated.

任何提高效率的建议都受到高度赞赏。

#2

Instead of going from python to spark, why not use the Neo4j Spark connector? I think this would save python from being a bottle neck if you were moving a lot of data. You can put your cypher query inside of the spark session and save it as an RDD.

而不是从python到spark,为什么不使用Neo4j Spark连接器呢?我认为如果移动大量数据,这将使python成为一个瓶颈。您可以将您的密码查询放在spark会话中并将其另存为RDD。

There has been talk on the Neo4J slack group about a pyspark implementation, which will hopefully be available later this fall. I know the ability to query neo4j from pyspark and sparkr would be very useful.

有关Pyspark实施的Neo4J松弛小组已经有人谈论过,希望能在今年秋天晚些时候推出。我知道从pyspark和sparkr查询neo4j的能力非常有用。

#1