`format（）`在pyspark中加载数据时做了什么

I am start to use spark and often when loading the data from cloud, I see the following code

我开始使用spark,通常在从云加载数据时,我看到以下代码

my_sdf = spark.read.format("com.databricks.spark.csv").option("delimiter", ' ').load("s3n://myfolder/data/xyz.txt")

My question is the following: here it seems that we have 2 data sets: one is the com.databricks.spark.csv, as it is a csv file right? and another data set is xyz.txt, as it is a txt file. So in this command, which data set I am loading? I experimented myself, it seems that it is the xyz.txt data set is loading. But then my question is what does this com.databricks.spark.csv do? Especially it is put in the format(). Does it try to tell that spark will load dataset xyz.txt using the same format as dataset com.databricks.spark.csv?

我的问题如下:这里似乎我们有2个数据集:一个是com.databricks.spark.csv,因为它是一个csv文件对吗?另一个数据集是xyz.txt,因为它是一个txt文件。那么在这个命令中,我正在加载哪个数据集?我试验自己,似乎是xyz.txt数据集正在加载。但后来我的问题是这个com.databricks.spark.csv做了什么?特别是它以格式()。它是否试图告诉spark会使用与dataset com.databricks.spark.csv相同的格式加载数据集xyz.txt吗?

1 个解决方案

#1

Form Below Code:-

表格下方代码: -

my_sdf = spark.read.format("com.databricks.spark.csv").option("delimiter", ' ').load("s3n://myfolder/data/xyz.txt")

Dataset is s3n://myfolder/data/xyz.txt

数据集是s3n://myfolder/data/xyz.txt

Format is name of format from which you need to read you data set s3n://myfolder/data/xyz.txt

格式是您需要从中读取数据集的格式名称s3n://myfolder/data/xyz.txt

pyspark < 1.6 don't have any csv format so databricks format: com.databricks.spark.csv is required. If your input data is in any other format like parquet or orc or json then you need to use parquet or orc or json instead com.databricks.spark.csv

pyspark <1.6没有任何csv格式,所以databricks格式:com.databricks.spark.csv是必需的。如果您的输入数据是任何其他格式,如镶木地板或兽人或json,那么你需要使用实木复合地板或兽人或json而不是com.databricks.spark.csv

Basically format is structure in which your data is saved.

基本上格式是保存数据的结构。

#1