I'd like to process Apache Parquet files (in my case, generated in Spark) in the R programming language.
我想用R编程语言处理Apache Parquet文件(在我的例子中,是用Spark生成的)。
Is an R reader available? Or is work being done on one?
有R阅读器吗?还是在一个上面做功?
If not, what would be the most expedient way to get there? Note: There are Java and C++ bindings: https://github.com/apache/parquet-mr
如果没有的话,最有利的方法是什么?注意:有Java和c++绑定:https://github.com/apache/parquet-mr
4 个解决方案
#1
25
If you're using Spark then this is now relatively simple with the release of Spark 1.4 see sample code below that uses the SparkR package that is now part of the Apache Spark core framework.
如果您正在使用Spark,那么对于Spark 1.4的发布来说,这是相对简单的,请参阅下面使用SparkR包的示例代码,该包现在是Apache Spark核心框架的一部分。
# install the SparkR package
devtools::install_github('apache/spark', ref='master', subdir='R/pkg')
# load the SparkR package
library('SparkR')
# initialize sparkContext which starts a new Spark session
sc <- sparkR.init(master="local")
# initialize sqlContext
sq <- sparkRSQL.init(sc)
# load parquet file into a Spark data frame and coerce into R data frame
df <- collect(parquetFile(sq, "/path/to/filename"))
# terminate Spark session
sparkR.stop()
An expanded example is shown @ https://gist.github.com/andyjudson/6aeff07bbe7e65edc665
扩展的示例显示@ https://gist.github.com/andyjudson/6aeff07bbe7e65edc665
I'm not aware of any other package that you could use if you weren't using Spark.
我不知道如果你没有使用Spark,你还可以使用其他的包。
#2
5
Alternatively to SparkR
, you could now use sparklyr
:
对于SparkR,您现在可以使用sparklyr:
# install.packages("sparklyr")
library(sparklyr)
sc <- spark_connect(master = "local")
spark_tbl_handle <- spark_read_parquet(sc, "tbl_name_in_spark", "/path/to/parquetdir")
regular_df <- collect(spark_tbl_handle)
spark_disconnect(sc)
#3
3
For reading a parquet file in an Amazon S3 bucket, try using s3a instead of s3n. That worked for me when reading parquet files using EMR 1.4.0, RStudio and Spark 1.5.0.
要在Amazon S3 bucket中读取parquet文件,请尝试使用s3a而不是s3n。在使用EMR 1.4.0、RStudio和Spark 1.5.0阅读parquet文件时,这一点对我很有用。
#4
0
Spark has been updated and there are many new things and functions which are either deprecated or renamed.
Spark已经更新,并且有许多新的东西和函数被弃用或重命名。
Andy's answer above is working for spark v.1.4 but on spark v.2.3 this is the update where it worked for me.
安迪在上面的回答是为spark v.1.4工作,但在spark v.2.3这是它为我工作的更新。
-
Download latest version of apache spark https://spark.apache.org/downloads.html (point 3 in the link)
下载最新版本的apache spark https://spark.apache.org/downloads.html(链接中的第3点)
-
extract the
.tgz
file.提取. tgz文件。
-
install
devtool
package inrstudio
在rstudio中安装devtool包
install.packages('devtools')
-
Open
terminal
and follow these steps打开终端,按照以下步骤操作
# This is the folder of extracted spark `.tgz` of point 1 above export SPARK_HOME=extracted-spark-folder-path cd $SPARK_HOME/R/lib/SparkR/ R -e "devtools::install('.')"
-
Go back to
rstudio
回到rstudio
# load the SparkR package library(SparkR) # initialize sparkSession which starts a new Spark session sc <- sparkR.session(master="local") # load parquet file into a Spark data frame and coerce into R data frame df <- collect(read.parquet('.parquet-file-path')) # terminate Spark session sparkR.stop()
#1
25
If you're using Spark then this is now relatively simple with the release of Spark 1.4 see sample code below that uses the SparkR package that is now part of the Apache Spark core framework.
如果您正在使用Spark,那么对于Spark 1.4的发布来说,这是相对简单的,请参阅下面使用SparkR包的示例代码,该包现在是Apache Spark核心框架的一部分。
# install the SparkR package
devtools::install_github('apache/spark', ref='master', subdir='R/pkg')
# load the SparkR package
library('SparkR')
# initialize sparkContext which starts a new Spark session
sc <- sparkR.init(master="local")
# initialize sqlContext
sq <- sparkRSQL.init(sc)
# load parquet file into a Spark data frame and coerce into R data frame
df <- collect(parquetFile(sq, "/path/to/filename"))
# terminate Spark session
sparkR.stop()
An expanded example is shown @ https://gist.github.com/andyjudson/6aeff07bbe7e65edc665
扩展的示例显示@ https://gist.github.com/andyjudson/6aeff07bbe7e65edc665
I'm not aware of any other package that you could use if you weren't using Spark.
我不知道如果你没有使用Spark,你还可以使用其他的包。
#2
5
Alternatively to SparkR
, you could now use sparklyr
:
对于SparkR,您现在可以使用sparklyr:
# install.packages("sparklyr")
library(sparklyr)
sc <- spark_connect(master = "local")
spark_tbl_handle <- spark_read_parquet(sc, "tbl_name_in_spark", "/path/to/parquetdir")
regular_df <- collect(spark_tbl_handle)
spark_disconnect(sc)
#3
3
For reading a parquet file in an Amazon S3 bucket, try using s3a instead of s3n. That worked for me when reading parquet files using EMR 1.4.0, RStudio and Spark 1.5.0.
要在Amazon S3 bucket中读取parquet文件,请尝试使用s3a而不是s3n。在使用EMR 1.4.0、RStudio和Spark 1.5.0阅读parquet文件时,这一点对我很有用。
#4
0
Spark has been updated and there are many new things and functions which are either deprecated or renamed.
Spark已经更新,并且有许多新的东西和函数被弃用或重命名。
Andy's answer above is working for spark v.1.4 but on spark v.2.3 this is the update where it worked for me.
安迪在上面的回答是为spark v.1.4工作,但在spark v.2.3这是它为我工作的更新。
-
Download latest version of apache spark https://spark.apache.org/downloads.html (point 3 in the link)
下载最新版本的apache spark https://spark.apache.org/downloads.html(链接中的第3点)
-
extract the
.tgz
file.提取. tgz文件。
-
install
devtool
package inrstudio
在rstudio中安装devtool包
install.packages('devtools')
-
Open
terminal
and follow these steps打开终端,按照以下步骤操作
# This is the folder of extracted spark `.tgz` of point 1 above export SPARK_HOME=extracted-spark-folder-path cd $SPARK_HOME/R/lib/SparkR/ R -e "devtools::install('.')"
-
Go back to
rstudio
回到rstudio
# load the SparkR package library(SparkR) # initialize sparkSession which starts a new Spark session sc <- sparkR.session(master="local") # load parquet file into a Spark data frame and coerce into R data frame df <- collect(read.parquet('.parquet-file-path')) # terminate Spark session sparkR.stop()