
时间:2022-03-14 15:20:10

I am using R 2.14.1 and Cassandra 1.2.11, I have a separate program which has written data to a single Cassandra table. I am failing to read them from R.

我使用的是r2.14.1和Cassandra 1.2.11,我有一个单独的程序,它将数据写入一个Cassandra表。我没能从R中读到。

The Cassandra schema is defined like this:


create table chosen_samples (id bigint , temperature double, primary key(id))

I have first tried the RCassandra package (http://www.rforge.net/RCassandra/)


> # install.packages("RCassandra")
> library(RCassandra)
> rc <- RC.connect(host ="", port = 9160L)
> RC.use(rc, "poc1_samples")
> cs <- RC.read.table(rc, c.family="chosen_samples")

The connection seems to succeed but the parsing of the table into data frame fails:


> cs
Error in data.frame(..dfd. = c("@\"ffffff", "@(<cc><cc><cc><cc><cc><cd>",  : 
  duplicate row.names: 

I have also tried using JDBC connector, as described here: http://www.datastax.com/dev/blog/big-analytics-with-r-cassandra-and-hive

我也尝试过使用JDBC连接器,如本文所述:http://www.datastax.com/dev/blog/big analytics-with-r-cassandra- hive。

> # install.packages("RJDBC")
> library(RJDBC)
> cassdrv <- JDBC("org.apache.cassandra.cql.jdbc.CassandraDriver", "/Users/svend/dev/libs/cassandra-jdbc-1.2.5.jar", "`")

But this one fails like this:


Error in .jfindClass(as.character(driverClass)[1]) : class not found

Even though the location to the java driver is correct


$ ls /Users/svend/dev/libs/cassandra-jdbc-1.2.5.jar

5 个解决方案



You have to download apache-cassandra-2.0.10-bin.tar.gz and cassandra-jdbc-1.2.5.jar and cassandra-all-1.1.0.jar.


There is no need to install Cassandra on your local machine; just put the cassandra-jdbc-1.2.5.jar and the cassandra-all-1.1.0.jar files in the lib directory of unziped apache-cassandra-2.0.10-bin.tar.gz. Then you can use

不需要在本地机器上安装Cassandra;只是把cassandra-jdbc-1.2.5。jar和cassandra-all-1.1.0。在unziped apache-cassandra-2.0.10-bin.tar.gz的lib目录中的jar文件。然后您可以使用

 drv <- JDBC("org.apache.cassandra.cql.jdbc.CassandraDriver", 

That is working on my unix but not on my windows machine. Hope that helps.




This question is old now, but since it's the one of the top hits for R and Cassandra I thought I'd leave a simple solution here, as I found frustratingly little up-to-date support for what I thought would be a fairly common task.


Sparklyr makes this pretty easy to do from scratch now, as it exposes a java context so the Spark-Cassandra-Connector can be used directly. I've wrapped up the bindings in this simple package, crassy, but it's not necessary to use.

Sparklyr现在很容易做到这一点,因为它公开了java上下文,因此可以直接使用spark - cassandra连接器。我已经打包了这个简单的包,crassy,但是它没有必要使用。

I mostly made it to demystify the config around how to make sparklyr load the connector, and as the syntax for selecting a subset of columns is a little unwieldy.


Column selection and partition filtering are supported. These were the only features I thought were necessary for general Cassandra use cases, given CQL can't be submitted directly to the cluster.


I've not found a solution to submitting more general CQL queries which doesn't involve writing custom scala, however there's an example of how this can work here.




Right, I found an (admittedly ugly) way, simply by calling python from R, parsing the NA manually and re-assigning the data-frames names in R, like this


# install.packages("rPython")
# (don't forget to "pip install cql")
python.exec("import sys")
# adding libraries from virtualenv 
python.exec("import cql")

python.exec("connection=cql.connect('', cql_version='3.0.0')")
python.exec("cursor = connection.cursor()")
python.exec("cursor.execute('use poc1_samples')")
python.exec("cursor.execute('select * from chosen_samples' )")

# coding python None into NA (rPython seem to just return nothing )
python.exec("rep = lambda x : '__NA__' if x is None else x")
python.exec( "def getData(): return [rep(num) for line in cursor for num in line ]" )
data <- python.call("getData")
df <- as.data.frame(matrix(unlist(data), ncol=15, byrow=T))

names(df) <- c("temperature", "maxTemp", "minTemp",
"dewpoint", "elevation", "gust", "latitude", "longitude",
"maxwindspeed", "precipitation", "seelevelpressure", "visibility", "windspeed")

# and decoding NA's    
parsena <- function (x) if (x=="__NA__") NA else x
df <- as.data.frame(lapply(df,  parsena))

Anybody has a better idea?




I had the same error message when executing Rscript with RJDBC connection via batch file (R 3.2.4, Teradata driver). Also, when run in RStudio it worked fine in the second run but not first.

在通过批处理文件(R 3.2.4, Teradata驱动程序)执行RJDBC连接时,我有相同的错误消息。同样,在RStudio中运行时,它在第二次运行时运行良好,但不是第一次运行。

What helped was explicitly call:





It not enough to just download the driver, you have to also download the dependencies and put them into your JAVA ClassPath (MacOS: /Library/Java/Extensions) as stated on the project main page.

仅仅下载驱动程序是不够的,您还需要下载依赖项并将其放入您的JAVA类路径(MacOS: /Library/ JAVA /Extensions),如项目主页上所述。

Include the Cassandra JDBC dependencies in your classpath : download dependencies

在类路径中包括Cassandra JDBC依赖项:下载依赖项。

As of the RCassandra package, right now it's still too primitive compared to RJDBC.




You have to download apache-cassandra-2.0.10-bin.tar.gz and cassandra-jdbc-1.2.5.jar and cassandra-all-1.1.0.jar.


There is no need to install Cassandra on your local machine; just put the cassandra-jdbc-1.2.5.jar and the cassandra-all-1.1.0.jar files in the lib directory of unziped apache-cassandra-2.0.10-bin.tar.gz. Then you can use

不需要在本地机器上安装Cassandra;只是把cassandra-jdbc-1.2.5。jar和cassandra-all-1.1.0。在unziped apache-cassandra-2.0.10-bin.tar.gz的lib目录中的jar文件。然后您可以使用

 drv <- JDBC("org.apache.cassandra.cql.jdbc.CassandraDriver", 

That is working on my unix but not on my windows machine. Hope that helps.




This question is old now, but since it's the one of the top hits for R and Cassandra I thought I'd leave a simple solution here, as I found frustratingly little up-to-date support for what I thought would be a fairly common task.


Sparklyr makes this pretty easy to do from scratch now, as it exposes a java context so the Spark-Cassandra-Connector can be used directly. I've wrapped up the bindings in this simple package, crassy, but it's not necessary to use.

Sparklyr现在很容易做到这一点,因为它公开了java上下文,因此可以直接使用spark - cassandra连接器。我已经打包了这个简单的包,crassy,但是它没有必要使用。

I mostly made it to demystify the config around how to make sparklyr load the connector, and as the syntax for selecting a subset of columns is a little unwieldy.


Column selection and partition filtering are supported. These were the only features I thought were necessary for general Cassandra use cases, given CQL can't be submitted directly to the cluster.


I've not found a solution to submitting more general CQL queries which doesn't involve writing custom scala, however there's an example of how this can work here.




Right, I found an (admittedly ugly) way, simply by calling python from R, parsing the NA manually and re-assigning the data-frames names in R, like this


# install.packages("rPython")
# (don't forget to "pip install cql")
python.exec("import sys")
# adding libraries from virtualenv 
python.exec("import cql")

python.exec("connection=cql.connect('', cql_version='3.0.0')")
python.exec("cursor = connection.cursor()")
python.exec("cursor.execute('use poc1_samples')")
python.exec("cursor.execute('select * from chosen_samples' )")

# coding python None into NA (rPython seem to just return nothing )
python.exec("rep = lambda x : '__NA__' if x is None else x")
python.exec( "def getData(): return [rep(num) for line in cursor for num in line ]" )
data <- python.call("getData")
df <- as.data.frame(matrix(unlist(data), ncol=15, byrow=T))

names(df) <- c("temperature", "maxTemp", "minTemp",
"dewpoint", "elevation", "gust", "latitude", "longitude",
"maxwindspeed", "precipitation", "seelevelpressure", "visibility", "windspeed")

# and decoding NA's    
parsena <- function (x) if (x=="__NA__") NA else x
df <- as.data.frame(lapply(df,  parsena))

Anybody has a better idea?




I had the same error message when executing Rscript with RJDBC connection via batch file (R 3.2.4, Teradata driver). Also, when run in RStudio it worked fine in the second run but not first.

在通过批处理文件(R 3.2.4, Teradata驱动程序)执行RJDBC连接时,我有相同的错误消息。同样,在RStudio中运行时,它在第二次运行时运行良好,但不是第一次运行。

What helped was explicitly call:





It not enough to just download the driver, you have to also download the dependencies and put them into your JAVA ClassPath (MacOS: /Library/Java/Extensions) as stated on the project main page.

仅仅下载驱动程序是不够的,您还需要下载依赖项并将其放入您的JAVA类路径(MacOS: /Library/ JAVA /Extensions),如项目主页上所述。

Include the Cassandra JDBC dependencies in your classpath : download dependencies

在类路径中包括Cassandra JDBC依赖项:下载依赖项。

As of the RCassandra package, right now it's still too primitive compared to RJDBC.
