I am start to use spark and often when loading the data from cloud, I see the following code
我开始使用spark,通常在从云加载数据时,我看到以下代码
my_sdf = spark.read.format("com.databricks.spark.csv").option("delimiter", ' ').load("s3n://myfolder/data/xyz.txt")
My question is the following: here it seems that we have 2 data sets: one is the com.databricks.spark.csv
, as it is a csv
file right? and another data set is xyz.txt
, as it is a txt
file. So in this command, which data set I am loading? I experimented myself, it seems that it is the xyz.txt
data set is loading. But then my question is what does this com.databricks.spark.csv
do? Especially it is put in the format()
. Does it try to tell that spark will load dataset xyz.txt
using the same format as dataset com.databricks.spark.csv
?
我的问题如下:这里似乎我们有2个数据集:一个是com.databricks.spark.csv,因为它是一个csv文件对吗?另一个数据集是xyz.txt,因为它是一个txt文件。那么在这个命令中,我正在加载哪个数据集?我试验自己,似乎是xyz.txt数据集正在加载。但后来我的问题是这个com.databricks.spark.csv做了什么?特别是它以格式()。它是否试图告诉spark会使用与dataset com.databricks.spark.csv相同的格式加载数据集xyz.txt吗?
1 个解决方案
#1
0
Form Below Code:-
表格下方代码: -
my_sdf = spark.read.format("com.databricks.spark.csv").option("delimiter", ' ').load("s3n://myfolder/data/xyz.txt")
Dataset is s3n://myfolder/data/xyz.txt
数据集是s3n://myfolder/data/xyz.txt
Format
is name of format from which you need to read you data set s3n://myfolder/data/xyz.txt
格式是您需要从中读取数据集的格式名称s3n://myfolder/data/xyz.txt
pyspark < 1.6
don't have any csv format
so databricks format: com.databricks.spark.csv
is required. If your input data is in any other format like parquet or orc or json
then you need to use parquet or orc or json
instead com.databricks.spark.csv
pyspark <1.6没有任何csv格式,所以databricks格式:com.databricks.spark.csv是必需的。如果您的输入数据是任何其他格式,如镶木地板或兽人或json,那么你需要使用实木复合地板或兽人或json而不是com.databricks.spark.csv
Basically format is structure in which your data is saved.
基本上格式是保存数据的结构。
#1
0
Form Below Code:-
表格下方代码: -
my_sdf = spark.read.format("com.databricks.spark.csv").option("delimiter", ' ').load("s3n://myfolder/data/xyz.txt")
Dataset is s3n://myfolder/data/xyz.txt
数据集是s3n://myfolder/data/xyz.txt
Format
is name of format from which you need to read you data set s3n://myfolder/data/xyz.txt
格式是您需要从中读取数据集的格式名称s3n://myfolder/data/xyz.txt
pyspark < 1.6
don't have any csv format
so databricks format: com.databricks.spark.csv
is required. If your input data is in any other format like parquet or orc or json
then you need to use parquet or orc or json
instead com.databricks.spark.csv
pyspark <1.6没有任何csv格式,所以databricks格式:com.databricks.spark.csv是必需的。如果您的输入数据是任何其他格式,如镶木地板或兽人或json,那么你需要使用实木复合地板或兽人或json而不是com.databricks.spark.csv
Basically format is structure in which your data is saved.
基本上格式是保存数据的结构。