I have a jsonfile to be parsed.The json format is like this :
我有一个json文件要解析.json格式是这样的:
{"cv_id":"001","cv_parse": { "educations": [{"major": "English", "degree": "Bachelor" },{"major": "English", "degree": "Master "}],"basic_info": { "birthyear": "1984", "location": {"state": "New York"}}}}
I have to get every word in the file.How can I get the "major"
from an array and do I have to get the word of "province" using the method df.select("cv_parse.basic_info.location.province")
?
我必须得到文件中的每一个字。如何从数组中获取“主要”,我是否必须使用方法df.select(“cv_parse.basic_info.location.province”)获取“省”字样?
This is the result I want:
这是我想要的结果:
cv_id major degree birthyear state
001 English Bachelor 1984 New York
001 English Master 1984 New York
1 个解决方案
#1
0
This might not be the best way of doing it but you can give it a shot.
这可能不是最好的方法,但你可以试一试。
// import the implicits functions
import org.apache.spark.sql.functions._
import sqlContext.implicits._
//read the json file
val jsonDf = sqlContext.read.json("sample-data/sample.json")
jsonDf.printSchema
Your schema would be :
您的架构将是:
root
|-- cv_id: string (nullable = true)
|-- cv_parse: struct (nullable = true)
| |-- basic_info: struct (nullable = true)
| | |-- birthyear: string (nullable = true)
| | |-- location: struct (nullable = true)
| | | |-- state: string (nullable = true)
| |-- educations: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- degree: string (nullable = true)
| | | |-- major: string (nullable = true)
Now you need can have explode the educations
column
现在你需要爆炸教育专栏
val explodedResult = jsonDf.select($"cv_id", explode($"cv_parse.educations"),
$"cv_parse.basic_info.birthyear", $"cv_parse.basic_info.location.state")
explodedResult.printSchema
Now your schema would be
现在您的架构将是
root
|-- cv_id: string (nullable = true)
|-- col: struct (nullable = true)
| |-- degree: string (nullable = true)
| |-- major: string (nullable = true)
|-- birthyear: string (nullable = true)
|-- state: string (nullable = true)
Now you can select the columns
现在您可以选择列
explodedResult.select("cv_id", "birthyear", "state", "col.degree", "col.major").show
+-----+---------+--------+--------+-------+
|cv_id|birthyear| state| degree| major|
+-----+---------+--------+--------+-------+
| 001| 1984|New York|Bachelor|English|
| 001| 1984|New York| Master |English|
+-----+---------+--------+--------+-------+
#1
0
This might not be the best way of doing it but you can give it a shot.
这可能不是最好的方法,但你可以试一试。
// import the implicits functions
import org.apache.spark.sql.functions._
import sqlContext.implicits._
//read the json file
val jsonDf = sqlContext.read.json("sample-data/sample.json")
jsonDf.printSchema
Your schema would be :
您的架构将是:
root
|-- cv_id: string (nullable = true)
|-- cv_parse: struct (nullable = true)
| |-- basic_info: struct (nullable = true)
| | |-- birthyear: string (nullable = true)
| | |-- location: struct (nullable = true)
| | | |-- state: string (nullable = true)
| |-- educations: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- degree: string (nullable = true)
| | | |-- major: string (nullable = true)
Now you need can have explode the educations
column
现在你需要爆炸教育专栏
val explodedResult = jsonDf.select($"cv_id", explode($"cv_parse.educations"),
$"cv_parse.basic_info.birthyear", $"cv_parse.basic_info.location.state")
explodedResult.printSchema
Now your schema would be
现在您的架构将是
root
|-- cv_id: string (nullable = true)
|-- col: struct (nullable = true)
| |-- degree: string (nullable = true)
| |-- major: string (nullable = true)
|-- birthyear: string (nullable = true)
|-- state: string (nullable = true)
Now you can select the columns
现在您可以选择列
explodedResult.select("cv_id", "birthyear", "state", "col.degree", "col.major").show
+-----+---------+--------+--------+-------+
|cv_id|birthyear| state| degree| major|
+-----+---------+--------+--------+-------+
| 001| 1984|New York|Bachelor|English|
| 001| 1984|New York| Master |English|
+-----+---------+--------+--------+-------+