I am working with data extracted from SFDC using simple-salesforce package. I am using Python3 for scripting and Spark 1.5.2.
我正在使用简单的salesforce软件包从SFDC中提取数据。我正在使用Python3编写脚本和Spark 1.5.2。
I created an rdd containing the following data:
我创建了一个包含以下数据的rdd:
[('Id', 'a0w1a0000003xB1A'), ('PackSize', 1.0), ('Name', 'A')]
[('Id', 'a0w1a0000003xAAI'), ('PackSize', 1.0), ('Name', 'B')]
[('Id', 'a0w1a00000xB3AAI'), ('PackSize', 30.0), ('Name', 'C')]
...
This data is in RDD called v_rdd
该数据在RDD中称为v_rdd。
My schema looks like this:
我的模式是这样的:
StructType(List(StructField(Id,StringType,true),StructField(PackSize,StringType,true),StructField(Name,StringType,true)))
I am trying to create DataFrame out of this RDD:
我正在尝试从这个RDD中创建DataFrame:
sqlDataFrame = sqlContext.createDataFrame(v_rdd, schema)
I print my DataFrame:
我打印DataFrame:
sqlDataFrame.printSchema()
And get the following:
并得到如下:
+--------------------+--------------------+--------------------+
| Id| PackSize| Name|
+--------------------+--------------------+--------------------+
|[Ljava.lang.Objec...|[Ljava.lang.Objec...|[Ljava.lang.Objec...|
|[Ljava.lang.Objec...|[Ljava.lang.Objec...|[Ljava.lang.Objec...|
|[Ljava.lang.Objec...|[Ljava.lang.Objec...|[Ljava.lang.Objec...|
I am expecting to see actual data, like this:
我期待看到实际的数据,像这样:
+------------------+------------------+--------------------+
| Id|PackSize| Name|
+------------------+------------------+--------------------+
|a0w1a0000003xB1A | 1.0| A |
|a0w1a0000003xAAI | 1.0| B |
|a0w1a00000xB3AAI | 30.0| C |
Can you please help me identify what I am doing wrong here.
你能帮我确定我在这里做错了什么吗?
My Python script is long, I am not sure it would be convenient for people to sift through it, so I posted only parts I am having issue with.
我的Python脚本很长,我不确定是否方便人们筛选它,所以我只发布了我遇到的一些问题。
Thank a ton in advance!
提前感谢一吨!
1 个解决方案
#1
13
Hey could you next time provide a working example. That would be easier.
嘿,下次你能提供一个工作的例子吗?那样就容易了。
The way how your RDD is presented is basically weird to create a DataFrame. This is how you create a DF according to Spark Documentation.
创建一个DataFrame的方式基本上是很奇怪的。这就是根据Spark文档创建DF的方法。
>>> l = [('Alice', 1)]
>>> sqlContext.createDataFrame(l).collect()
[Row(_1=u'Alice', _2=1)]
>>> sqlContext.createDataFrame(l, ['name', 'age']).collect()
[Row(name=u'Alice', age=1)]
So concerning your example you can create your desired output like this way:
关于你的例子,你可以这样创建你想要的输出:
# Your data at the moment
data = sc.parallelize([
[('Id', 'a0w1a0000003xB1A'), ('PackSize', 1.0), ('Name', 'A')],
[('Id', 'a0w1a0000003xAAI'), ('PackSize', 1.0), ('Name', 'B')],
[('Id', 'a0w1a00000xB3AAI'), ('PackSize', 30.0), ('Name', 'C')]
])
# Convert to tuple
data_converted = data.map(lambda x: (x[0][1], x[1][1], x[2][1]))
# Define schema
schema = StructType([
StructField("Id", StringType(), True),
StructField("Packsize", StringType(), True),
StructField("Name", StringType(), True)
])
# Create dataframe
DF = sqlContext.createDataFrame(data_converted, schema)
# Output
DF.show()
+----------------+--------+----+
| Id|Packsize|Name|
+----------------+--------+----+
|a0w1a0000003xB1A| 1.0| A|
|a0w1a0000003xAAI| 1.0| B|
|a0w1a00000xB3AAI| 30.0| C|
+----------------+--------+----+
Hope this helps
希望这有助于
#1
13
Hey could you next time provide a working example. That would be easier.
嘿,下次你能提供一个工作的例子吗?那样就容易了。
The way how your RDD is presented is basically weird to create a DataFrame. This is how you create a DF according to Spark Documentation.
创建一个DataFrame的方式基本上是很奇怪的。这就是根据Spark文档创建DF的方法。
>>> l = [('Alice', 1)]
>>> sqlContext.createDataFrame(l).collect()
[Row(_1=u'Alice', _2=1)]
>>> sqlContext.createDataFrame(l, ['name', 'age']).collect()
[Row(name=u'Alice', age=1)]
So concerning your example you can create your desired output like this way:
关于你的例子,你可以这样创建你想要的输出:
# Your data at the moment
data = sc.parallelize([
[('Id', 'a0w1a0000003xB1A'), ('PackSize', 1.0), ('Name', 'A')],
[('Id', 'a0w1a0000003xAAI'), ('PackSize', 1.0), ('Name', 'B')],
[('Id', 'a0w1a00000xB3AAI'), ('PackSize', 30.0), ('Name', 'C')]
])
# Convert to tuple
data_converted = data.map(lambda x: (x[0][1], x[1][1], x[2][1]))
# Define schema
schema = StructType([
StructField("Id", StringType(), True),
StructField("Packsize", StringType(), True),
StructField("Name", StringType(), True)
])
# Create dataframe
DF = sqlContext.createDataFrame(data_converted, schema)
# Output
DF.show()
+----------------+--------+----+
| Id|Packsize|Name|
+----------------+--------+----+
|a0w1a0000003xB1A| 1.0| A|
|a0w1a0000003xAAI| 1.0| B|
|a0w1a00000xB3AAI| 30.0| C|
+----------------+--------+----+
Hope this helps
希望这有助于