如何使用spark解析jsonfile

I have a jsonfile to be parsed.The json format is like this :

我有一个json文件要解析.json格式是这样的:

{"cv_id":"001","cv_parse": { "educations": [{"major": "English", "degree": "Bachelor" },{"major": "English", "degree": "Master "}],"basic_info": { "birthyear": "1984", "location": {"state": "New York"}}}}

I have to get every word in the file.How can I get the "major" from an array and do I have to get the word of "province" using the method df.select("cv_parse.basic_info.location.province")?

我必须得到文件中的每一个字。如何从数组中获取“主要”,我是否必须使用方法df.select(“cv_parse.basic_info.location.province”)获取“省”字样?

This is the result I want:

这是我想要的结果:

cv_id   major   degree  birthyear   state
001   English   Bachelor  1984     New York
001   English   Master    1984     New York

1 个解决方案

#1

This might not be the best way of doing it but you can give it a shot.

这可能不是最好的方法,但你可以试一试。

// import the implicits functions
import org.apache.spark.sql.functions._
import sqlContext.implicits._

//read the json file
val jsonDf = sqlContext.read.json("sample-data/sample.json")

jsonDf.printSchema

Your schema would be :

您的架构将是:

root
 |-- cv_id: string (nullable = true)
 |-- cv_parse: struct (nullable = true)
 |    |-- basic_info: struct (nullable = true)
 |    |    |-- birthyear: string (nullable = true)
 |    |    |-- location: struct (nullable = true)
 |    |    |    |-- state: string (nullable = true)
 |    |-- educations: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- degree: string (nullable = true)
 |    |    |    |-- major: string (nullable = true)

Now you need can have explode the educations column

现在你需要爆炸教育专栏

 val explodedResult = jsonDf.select($"cv_id", explode($"cv_parse.educations"),
      $"cv_parse.basic_info.birthyear", $"cv_parse.basic_info.location.state")

  explodedResult.printSchema

Now your schema would be

现在您的架构将是

 root
 |-- cv_id: string (nullable = true)
 |-- col: struct (nullable = true)
 |    |-- degree: string (nullable = true)
 |    |-- major: string (nullable = true)
 |-- birthyear: string (nullable = true)
 |-- state: string (nullable = true)

Now you can select the columns

现在您可以选择列

explodedResult.select("cv_id", "birthyear", "state", "col.degree", "col.major").show

+-----+---------+--------+--------+-------+
|cv_id|birthyear|   state|  degree|  major|
+-----+---------+--------+--------+-------+
|  001|     1984|New York|Bachelor|English|
|  001|     1984|New York| Master |English|
+-----+---------+--------+--------+-------+

#1