Python将Cassandra数据读入熊猫

时间:2021-02-12 15:21:13

What is the proper and fastest way to read Cassandra data into pandas? Now I use the following code but it's very slow...

把卡桑德拉的数据读入熊猫的正确和最快的方法是什么?现在我使用下面的代码,但是非常慢……

import pandas as pd

from cassandra.cluster import Cluster
from cassandra.auth import PlainTextAuthProvider
from cassandra.query import dict_factory

auth_provider = PlainTextAuthProvider(username=CASSANDRA_USER, password=CASSANDRA_PASS)
cluster = Cluster(contact_points=[CASSANDRA_HOST], port=CASSANDRA_PORT,
    auth_provider=auth_provider)

session = cluster.connect(CASSANDRA_DB)
session.row_factory = dict_factory

sql_query = "SELECT * FROM {}.{};".format(CASSANDRA_DB, CASSANDRA_TABLE)

df = pd.DataFrame()

for row in session.execute(sql_query):
    df = df.append(pd.DataFrame(row, index=[0]))

df = df.reset_index(drop=True).fillna(pd.np.nan)

Reading 1000 rows takes 1 minute, and I have a "bit more"... If I run the same query eg. in DBeaver, I get the whole results (~40k rows) within a minute.

阅读1000行需要1分钟,我还有一点……如果我运行相同的查询如。在DBeaver中,我在一分钟内获得所有结果(~40k行)。

Thank you!!!

谢谢! ! !

2 个解决方案

#1


18  

I got the answer at the official mailing list (it works perfectly):

我在官方邮件列表上找到了答案(它的工作很完美):

Hi,

你好,

try to define your own pandas row factory:

尝试定义你自己的熊猫行工厂:

def pandas_factory(colnames, rows):
    return pd.DataFrame(rows, columns=colnames)

session.row_factory = pandas_factory
session.default_fetch_size = None

query = "SELECT ..."
rslt = session.execute(query, timeout=None)
df = rslt._current_rows

That's the way i do it - an it should be faster...

我就是这么做的——而且应该快一点……

If you find a faster method - i'm interested in :)

如果你找到一个更快的方法——我感兴趣的是:)

Michael

迈克尔

#2


3  

What I do (in python 3) is :

我所做的(在python 3中)是:

query = "SELECT ..."
df = pd.DataFrame(list(session.execute(query)))

#1


18  

I got the answer at the official mailing list (it works perfectly):

我在官方邮件列表上找到了答案(它的工作很完美):

Hi,

你好,

try to define your own pandas row factory:

尝试定义你自己的熊猫行工厂:

def pandas_factory(colnames, rows):
    return pd.DataFrame(rows, columns=colnames)

session.row_factory = pandas_factory
session.default_fetch_size = None

query = "SELECT ..."
rslt = session.execute(query, timeout=None)
df = rslt._current_rows

That's the way i do it - an it should be faster...

我就是这么做的——而且应该快一点……

If you find a faster method - i'm interested in :)

如果你找到一个更快的方法——我感兴趣的是:)

Michael

迈克尔

#2


3  

What I do (in python 3) is :

我所做的(在python 3中)是:

query = "SELECT ..."
df = pd.DataFrame(list(session.execute(query)))