在Apache Spark中读取漂亮的print json文件

I have a lot of json files in my S3 bucket and I want to be able to read them and query those files. The problem is they are pretty printed. One json file has just one massive dictionary but it's not in one line. As per this thread, a dictionary in the json file should be in one line which is a limitation of Apache Spark. I don't have it structured that way.

我的S3存储桶中有很多json文件,我希望能够读取它们并查询这些文件。问题是它们印刷得很漂亮。一个json文件只有一个庞大的字典,但它不在一行中。根据这个线程,json文件中的字典应该在一行中,这是Apache Spark的限制。我没有这样的结构。

My JSON schema looks like this -

我的JSON架构看起来像这样 -

{
    "dataset": [
        {
            "key1": [
                {
                    "range": "range1", 
                    "value": 0.0
                }, 
                {
                    "range": "range2", 
                    "value": 0.23
                }
             ]
        }, {..}, {..}
    ],
    "last_refreshed_time": "2016/09/08 15:05:31"
}

Here are my questions -

这是我的问题 -

Can I avoid converting these files to match the schema required by Apache Spark (one dictionary per line in a file) and still be able to read it?

我可以避免转换这些文件以匹配Apache Spark所需的架构(文件中每行一个字典)并仍能读取它吗?
If not, what's the best way to do it in Python? I have a bunch of these files for each day in the bucket. The bucket is partitioned by day.

如果没有,在Python中最好的方法是什么?我每天都有一堆这些文件。存储桶按日分区。
Is there any other tool better suited to query these files other than Apache Spark? I'm on AWS stack so can try out any other suggested tool with Zeppelin notebook.

除了Apache Spark之外,还有其他更适合查询这些文件的工具吗?我在AWS堆栈上,所以可以使用Zeppelin笔记本试用任何其他建议的工具。

1 个解决方案

#1

You could use sc.wholeTextFiles() Here is a related post.

你可以使用sc.wholeTextFiles()这是一篇相关的帖子。

Alternatively, you could reformat your json using a simple function and load the generated file.

或者,您可以使用简单的函数重新格式化json并加载生成的文件。

def reformat_json(input_path, output_path):
    with open(input_path, 'r') as handle:
        jarr = json.load(handle)

    f = open(output_path, 'w')
    for entry in jarr:
        f.write(json.dumps(entry)+"\n")
    f.close()

#1