Pyspark - 配置Amazon Redshift JDBC jar

I am trying to use the spark-redshift databricks package and cannot get the Redshift jdbc driver working correctly. I have downloaded the latest version from here and saved to an s3 bucket.

我试图使用spark-redshift databricks包,无法使Redshift jdbc驱动程序正常工作。我从这里下载了最新版本并保存到s3存储桶。

This is how I am launching the spark-shell

这就是我发射spark-shell的方式

MASTER=yarn-client IPYTHON=1 PYSPARK_PYTHON=/usr/bin/python27 /usr/lib/spark/bin/pyspark 
--packages com.databricks:spark-avro_2.10:2.0.1,com.databricks:spark-redshift_2.10:1.1.0 
--jars 's3://pathto/RedshiftJDBC42-1.2.1.1001.jar'

I am trying to read from Redshift as per the databricks readme

我试图按照databricks自述文件阅读Redshift

df = sql_context.read \
.format("com.databricks.spark.redshift") \
.option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass") \
.option("dbtable", "my_table") \
.option("tempdir", "s3n://path/for/temp/data") \
.load()

but I get a configuration error:

但是我收到配置错误:

 Py4JJavaError: An error occurred while calling o46.load.
 : java.lang.ClassNotFoundException: Could not load an Amazon Redshift JDBC driver; see the README for instructions on downloading and configuring the official Amazon driver.

The jar file seems to have been read so not sure how it needs to be specified differently.

jar文件似乎已被读取,因此不确定如何以不同方式指定它。

1 个解决方案

#1

Just updating this as I realized what I was doing wrong. I was referencing the jar file in an s3 bucket but this needed to be available locally to the cluster.

只是更新这个,因为我意识到我做错了什么。我在s3存储桶中引用了jar文件,但这需要在集群本地可用。

aws s3 cp s3://pathto/RedshiftJDBC42-1.2.1.1001.jar /tmp/

#1

Just updating this as I realized what I was doing wrong. I was referencing the jar file in an s3 bucket but this needed to be available locally to the cluster.

只是更新这个,因为我意识到我做错了什么。我在s3存储桶中引用了jar文件,但这需要在集群本地可用。

aws s3 cp s3://pathto/RedshiftJDBC42-1.2.1.1001.jar /tmp/

秒客网

Pyspark - 配置Amazon Redshift JDBC jar

1 个解决方案

#1

#1

相关文章