[Spark][python]以DataFrame方式打开Json文件的例子

时间:2022-03-16 11:28:10

[Spark][python]以DataFrame方式打开Json文件的例子:

[training@localhost ~]$ cat people.json
{"name":"Alice","pcode":"94304"}
{"name":"Brayden","age":30,"pcode":"94304"}
{"name":"Carla","age":19,"pcoe":"10036"}
{"name":"Diana","age":46}
{"name":"Etienne","pcode":"94104"}
[training@localhost ~]$

[training@localhost ~]$ hdfs dfs -put people.json

[training@localhost ~]$ hdfs dfs -cat people.json
{"name":"Alice","pcode":"94304"}
{"name":"Brayden","age":30,"pcode":"94304"}
{"name":"Carla","age":19,"pcoe":"10036"}
{"name":"Diana","age":46}
{"name":"Etienne","pcode":"94104"}

In [1]: sqlContext = HiveContext(sc)

In [2]: peopleDF = sqlContext.read.json("people.json")

17/10/01 05:20:22 INFO hive.HiveContext: Initializing execution hive, version 1.1.0
17/10/01 05:20:22 INFO client.ClientWrapper: Inspected Hadoop version: 2.6.0-cdh5.7.0
17/10/01 05:20:22 INFO client.ClientWrapper: Loaded org.apache.hadoop.hive.shims.Hadoop23Shims for Hadoop version 2.6.0-cdh5.7.0
17/10/01 05:20:23 INFO hive.metastore: Trying to connect to metastore with URI thrift://localhost.localdomain:9083
17/10/01 05:20:23 INFO hive.metastore: Opened a connection to metastore, current connections: 1
17/10/01 05:20:23 INFO hive.metastore: Connected to metastore.
17/10/01 05:20:23 INFO session.SessionState: Created HDFS directory: file:/tmp/spark-839b35f5-91a1-436c-aae5-922ebacb27f1/scratch/training
17/10/01 05:20:23 INFO session.SessionState: Created local directory: /tmp/b3e52bfc-fe3a-4abe-ac7b-da071104b2f9_resources
17/10/01 05:20:23 INFO session.SessionState: Created HDFS directory: file:/tmp/spark-839b35f5-91a1-436c-aae5-922ebacb27f1/scratch/training/b3e52bfc-fe3a-4abe-ac7b-da071104b2f9
17/10/01 05:20:23 INFO session.SessionState: Created local directory: /tmp/training/b3e52bfc-fe3a-4abe-ac7b-da071104b2f9
17/10/01 05:20:23 INFO session.SessionState: Created HDFS directory: file:/tmp/spark-839b35f5-91a1-436c-aae5-922ebacb27f1/scratch/training/b3e52bfc-fe3a-4abe-ac7b-da071104b2f9/_tmp_space.db
17/10/01 05:20:23 INFO session.SessionState: No Tez session required at this point. hive.execution.engine=mr.
17/10/01 05:20:23 INFO json.JSONRelation: Listing hdfs://localhost:8020/user/training/people.json on driver
17/10/01 05:20:25 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 251.1 KB, free 251.1 KB)
17/10/01 05:20:25 INFO storage.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 21.6 KB, free 272.7 KB)
17/10/01 05:20:25 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:42171 (size: 21.6 KB, free: 208.8 MB)
17/10/01 05:20:25 INFO spark.SparkContext: Created broadcast 0 from json at NativeMethodAccessorImpl.java:-2
17/10/01 05:20:26 INFO mapred.FileInputFormat: Total input paths to process : 1
17/10/01 05:20:26 INFO spark.SparkContext: Starting job: json at NativeMethodAccessorImpl.java:-2
17/10/01 05:20:26 INFO scheduler.DAGScheduler: Got job 0 (json at NativeMethodAccessorImpl.java:-2) with 1 output partitions
17/10/01 05:20:26 INFO scheduler.DAGScheduler: Final stage: ResultStage 0 (json at NativeMethodAccessorImpl.java:-2)
17/10/01 05:20:26 INFO scheduler.DAGScheduler: Parents of final stage: List()
17/10/01 05:20:26 INFO scheduler.DAGScheduler: Missing parents: List()
17/10/01 05:20:26 INFO scheduler.DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[3] at json at NativeMethodAccessorImpl.java:-2), which has no missing parents
17/10/01 05:20:26 INFO storage.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 4.3 KB, free 277.1 KB)
17/10/01 05:20:26 INFO storage.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.4 KB, free 279.5 KB)
17/10/01 05:20:26 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:42171 (size: 2.4 KB, free: 208.8 MB)
17/10/01 05:20:26 INFO spark.SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1006
17/10/01 05:20:26 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 0 (MapPartitionsRDD[3] at json at NativeMethodAccessorImpl.java:-2)
17/10/01 05:20:26 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
17/10/01 05:20:26 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, partition 0,PROCESS_LOCAL, 2149 bytes)
17/10/01 05:20:26 INFO executor.Executor: Running task 0.0 in stage 0.0 (TID 0)
17/10/01 05:20:26 INFO rdd.HadoopRDD: Input split: hdfs://localhost:8020/user/training/people.json:0+179
17/10/01 05:20:27 INFO Configuration.deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
17/10/01 05:20:27 INFO Configuration.deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
17/10/01 05:20:27 INFO Configuration.deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
17/10/01 05:20:27 INFO Configuration.deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
17/10/01 05:20:27 INFO Configuration.deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
17/10/01 05:20:27 INFO executor.Executor: Finished task 0.0 in stage 0.0 (TID 0). 2354 bytes result sent to driver
17/10/01 05:20:27 INFO scheduler.DAGScheduler: ResultStage 0 (json at NativeMethodAccessorImpl.java:-2) finished in 0.715 s
17/10/01 05:20:27 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 667 ms on localhost (1/1)
17/10/01 05:20:27 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
17/10/01 05:20:27 INFO scheduler.DAGScheduler: Job 0 finished: json at NativeMethodAccessorImpl.java:-2, took 1.084685 s
17/10/01 05:20:27 INFO hive.HiveContext: default warehouse location is /user/hive/warehouse
17/10/01 05:20:28 INFO hive.HiveContext: Initializing metastore client version 1.1.0 using Spark classes.
17/10/01 05:20:28 INFO client.ClientWrapper: Inspected Hadoop version: 2.6.0-cdh5.7.0
17/10/01 05:20:28 INFO client.ClientWrapper: Loaded org.apache.hadoop.hive.shims.Hadoop23Shims for Hadoop version 2.6.0-cdh5.7.0
17/10/01 05:20:28 INFO storage.BlockManagerInfo: Removed broadcast_1_piece0 on localhost:42171 in memory (size: 2.4 KB, free: 208.8 MB)
17/10/01 05:20:28 INFO spark.ContextCleaner: Cleaned accumulator 2
17/10/01 05:20:30 INFO hive.metastore: Trying to connect to metastore with URI thrift://localhost.localdomain:9083
17/10/01 05:20:30 INFO hive.metastore: Opened a connection to metastore, current connections: 1
17/10/01 05:20:30 INFO hive.metastore: Connected to metastore.
17/10/01 05:20:30 INFO session.SessionState: Created HDFS directory: /tmp/hive/training
17/10/01 05:20:30 INFO session.SessionState: Created local directory: /tmp/8c1eba54-7260-4314-abbf-7b7de85bdf0a_resources
17/10/01 05:20:30 INFO session.SessionState: Created HDFS directory: /tmp/hive/training/8c1eba54-7260-4314-abbf-7b7de85bdf0a
17/10/01 05:20:30 INFO session.SessionState: Created local directory: /tmp/training/8c1eba54-7260-4314-abbf-7b7de85bdf0a
17/10/01 05:20:30 INFO session.SessionState: Created HDFS directory: /tmp/hive/training/8c1eba54-7260-4314-abbf-7b7de85bdf0a/_tmp_space.db
17/10/01 05:20:30 INFO session.SessionState: No Tez session required at this point. hive.execution.engine=mr.

In [3]: type(peopleDF)
Out[3]: pyspark.sql.dataframe.DataFrame

In [4]:

[Spark][python]以DataFrame方式打开Json文件的例子的更多相关文章

  1. pycharm 打开json 文件 \2 自动成了转义字符

    打开json 文件 \2 自动成了转义字符 暂时只发现在( \2 ) \ 后面为数字的情况下会出现转义json 文件为是指:在pycharm 中新建 file 后缀为json的文件 如: 1234.j ...

  2. [Spark][Python][RDD][DataFrame]从 RDD 构造 DataFrame 例子

    [Spark][Python][RDD][DataFrame]从 RDD 构造 DataFrame 例子 from pyspark.sql.types import * schema = Struct ...

  3. gdal以GA_Update方式打开jpg文件的做法

    作者:朱金灿 来源:http://blog.csdn.net/clever101 gdal库是不支持以GA_Update方式打开jpg文件的,原因在于gdal_1_10_1\frmts\jpeg文件夹 ...

  4. C++->以读或写方式打开一个文件

    以读或写方式打开一个文件 #include<iostream.h>   //.h以C|非C标准引用库文件 #include<fstream.h> #include<std ...

  5. python webdriver 测试框架-数据驱动json文件驱动的方式

    数据驱动json文件的方式 test_data_list.json: [ "邓肯||蒂姆", "乔丹||迈克尔", "库里||斯蒂芬", & ...

  6. Python【8】-分析json文件

    一.本节用到的基础知识 1.逐行读取文件 for line in open('E:\Demo\python\json.txt'): print line 2.解析json字符串 Python中有一些内 ...

  7. Python3编写网络爬虫09-数据存储方式二-JSON文件存储

    2.JSON文件存储 全称为JavaScript Object Notation 通过对象和数组的组合来表示数据,构造简洁且结构化程度非常高.是一种轻量级的数据交换格式 2.1 对象和数组 在Java ...

  8. VisualStudio如何以源码文本方式打开rc文件

    视图 >> 解决方案资源管理器 >> 右击XXX.rc >> 打开方式 >> 源代码(文本)编辑器

  9. python 实现excel转化成json文件

    1.准备工作 python 2.7 安装 安装xlrd -- pip install xlrd 2. 直接上代码 import xlrd from collections import Ordered ...

随机推荐

  1. centos7 服务管理

    服务脚本位置: /usr/lib/systemd/system  (开机不登录就能够运行的服务) /usr/lib/systemd/user      (用户登录后才能运行的服务) 服务脚本示例: [ ...

  2. PHP实现微信公众账号开发

    1.首先需要一个可以外网访问的接口url. 我这里是申请的新浪免费云服务器,http://xxxxx.applinzi.com/wx.php,具体自己可以去新浪云中心申请地址为:http://www. ...

  3. python调试 设置断点

    1在所需要调试的地方加入如下代码: import pdb    pdb.set_trace() 2调试代码常用命令: 实例请见参考文献: 1http://www.cnblogs.com/qi09/ar ...

  4. JUC全景图

    JUC 并发编程全景图如下:

  5. MIT6&period;828 JOS系统 lab2

    MIT6.828 LAB2:http://pdos.csail.mit.edu/6.828/2014/labs/lab2/ LAB2里面主要讲的是系统的分页过程,还有就是简单的虚拟地址到物理地址的过程 ...

  6. Java中的继承性特性

    继承性是java中的第二特性之一.而继承性最为关键的地方为:代码重用性的问题,利用继承性可以从已有的类中继续派生出新的子类,也可以利用子类扩展出更多的操作功能. 继承性的实现代码为:class 子类 ...

  7. ubuntu中使用docker部署&period;netcore2&period;1

     概述    .netcore发布这么久,到现在才在项目中实际运用,之前算是了解一点,一般找工作都会问是否运用过.netcore,软件研发来说,如果这个技术没用过,觉得挺难,其实不难..netcore ...

  8. Java 多线程概述

    几乎所有的操作系统都支持同时运行多个任务,一 个任务通常就是一个程序,每个运行中的程序就是一个进程.当一个程序运行时,内部可能包含了多个顺序执行流,每个顺序执行流就是一个线程. 线程和进程 几乎所有的 ...

  9. 【集成学习】sklearn中xgboost模块的XGBClassifier函数

    # 常规参数 booster gbtree 树模型做为基分类器(默认) gbliner 线性模型做为基分类器 silent silent=0时,不输出中间过程(默认) silent=1时,输出中间过程 ...

  10. vue 内引入jquery

    1.  npm i jquery -- save 2. import $ from 'jquery' window.$ = $ window.jQuery = $ export default $ 这 ...