如何有效地解析c++中的bigdata json文件(wikidata) ?

I have a single json file which is about 36 GB (coming from wikidata) and I want to access it more efficiently. Currently I'm using rapidjsons SAX-style API in C++ - but parsing the whole file takes on my machine about 7415200 ms (=120 minutes). I want to access the json objects inside this file, according to one of two primary keys ('name' or 'entity-key' -> i.e. 'Stack Overflow' or 'Q549037') which are inside the json object. That means I have to parse the whole file currently in the worst case.

我有一个大约36gb的json文件(来自wikidata)，我想更有效地访问它。目前我在c++中使用的是rapidjsons sax风格的API——但是解析整个文件在我的机器上大约需要7415200 ms(=120分钟)。我想根据两个主键之一('name'或'entity-key' ->即)访问这个文件中的json对象。'Stack Overflow'或'Q549037')，在json对象中。这意味着我必须在最坏的情况下解析整个文件。

So I thought about two approaches:

所以我想到了两种方法:

splitting the big file in billions of small files - with a filename that indicates the name/entity-key (i.e. Q549037.json / Stack_Overflow.json or Q549037#Stack_Overflow.json) -> not sure about the overload in storage
将大文件分割成数十亿个小文件——用一个显示名称/实体键的文件名(即Q549037)。json / Stack_Overflow。json或Q549037#Stack_Overflow.json) ->不确定存储中的过载。
building some kind of index from the primary keys to the ftell() position in the file. Building the index should take around 120 minutes (like parsing now), but accessing should be faster then
- i.e. use something like two std::unorderedmap (could run into memory problems again)
- 例如，使用两个std::unorderedmap(可能再次遇到内存问题)
- index files - create two files: one with entries sorted by name and one sorted by entity-key (creating these files will probably take much longer, because of the sorting)
- 索引文件——创建两个文件:一个是按名称排序的条目，另一个是按实体键排序的(由于排序的原因，创建这些文件可能需要更长的时间)
从主键到文件中的ftell()位置构建某种索引。构建索引需要大约120分钟(比如解析现在),但是访问应该更快然后即使用类似两个std::unorderedmap(可能再次遇到内存问题)索引文件,创建两个文件:一个条目按名字排序和按实体键(创建这些文件可能需要更长的时间,因为排序)

What is the best-practice for a problem like this? Which approach should I follow? Any other ideas?

对于这样的问题，最佳实践是什么?我应该采用哪种方法?任何其他想法?

4 个解决方案

#1

I think the performance problem is not due to parsing. Using RapidJSON's SAX API should already give good performance and memory friendly. If you need to access every values in the JSON, this may already be the best solution.

我认为性能问题不是解析造成的。使用RapidJSON的SAX API应该已经提供了良好的性能和内存友好性。如果您需要访问JSON中的每个值，这可能已经是最好的解决方案。

However, from the question description, it seems reading all values at a time is not your requirement. You want to read some (probably small amount) values of particular criteria (e.g., by primary keys). Then reading/parsing everything is not suitable for this case.

然而，从问题描述来看，似乎一次读取所有的值并不是你的要求。您希望读取特定标准的一些(可能是少量的)值(例如，按主键)。那么，读取/解析一切都不适合这种情况。

You will need some indexing mechanism. Doing that with file position may be possible. If data at the positions also a valid JSON, you can seek and stream it to RapidJSON to parse that JSON value (RapidJSON can stop parsing when a complete JSON is parsed, by kParseStopWhenDoneFlag).

您需要一些索引机制。使用文件位置来实现这一点是可能的。如果位置上的数据也是一个有效的JSON，您可以查找并将其流到RapidJSON以解析该JSON值(当一个完整的JSON被解析时，RapidJSON可以停止解析，由kParseStopWhenDoneFlag)。

Other options are converting the JSON into some kind of database, either SQL database, key-value database or custom ones. With the provided indexing facilities, you shall query the data fast. This may take long time for conversion, but good performance for later retrieval.

其他选项是将JSON转换为某种数据库，可以是SQL数据库、键值数据库或自定义数据库。使用提供的索引工具，您应该快速查询数据。这可能需要很长时间的转换，但良好的性能为以后的检索。

Note that, JSON is an exchange format. It was not designed for fast individual queries on big data.

注意，JSON是一种交换格式。它不是为快速查询大数据而设计的。

Update: Recently I found that there is a project semi-index that may suit your needs.

更新:最近我发现有一个项目半索引可能适合您的需要。

#2

Write your own JSON parser minimizing allocations and data movement. Also ditch multi character for straight ANSI. I once wrote a XML parser to parse 4GB Xml files. I tried MSXML and Xerces both had minor memory leaks that when used on that much data would actually runout of memory. My parser would actually stop memory allocations once it reached maximum nesting level.

编写自己的JSON解析器，最小化分配和数据移动。也为直的ANSI抛弃多字符。我曾经编写了一个XML解析器来解析4GB的XML文件。我尝试过MSXML和Xerces都有很小的内存泄漏，当在这么多数据上使用时，它们实际上会耗尽内存。我的解析器一旦达到最大嵌套级别，实际上就会停止内存分配。

#3

Your definition of the problem does not allow to give a precise answer.

你对问题的定义不允许给出精确的答案。

I wonder why you would want to stick to JSON in the first place. It is certainly not the best format for rapid access to big data.

我想知道为什么您首先要坚持使用JSON。它当然不是快速获取大数据的最佳格式。

If you're using your wikia data intensively, why not convert them into a more manageable format altogether?

如果您正在密集地使用wikia数据，为什么不将它们转换成一种更易于管理的格式呢?

It should be easy to automate a DB definition that matches the format of your entries, and convert the big lump of JSON into DB records once and for all.

应该很容易自动化匹配条目格式的DB定义，并将大块JSON永久地转换为DB记录。

You can stop DB conversion at any point you like (i.e. store each JSON block as plain text or refine it further).
In the minimal case, you'll end up with a DB table holding your records indexed by name and key.
Certainly less messy than using your file system as a database (by creating millions of files named after name+key) or writing dedicated code to seek to the records.

您可以在任何您喜欢的位置停止DB转换(例如，将每个JSON块存储为纯文本或进一步细化)。在最小的情况下，您将得到一个DB表，其中包含按名称和键索引的记录。当然，这比将文件系统用作数据库(通过创建以名称+键命名的数百万个文件)或编写专用代码查找记录要容易得多。

That will probably save you a lot of disk space too, since internal DB storage is usually more efficient than plain textual representation.

这可能也会为您节省大量磁盘空间，因为内部DB存储通常比纯文本表示更有效。

#4

-1

I've done a bit of parsing of data out of wikipedia. I'm particularly interested in extracting the equations so I'm only interested in part of the file.

我对*的数据做了一些分析。我对解方程特别感兴趣，所以我只对文件的一部分感兴趣。

Firstly if its WikiMedia data your interested in, it much easier to get a Labs account. It takes about a day to do and it will let you run much of the code on their machines, avoiding the need to downloading multiple gigabytes. With a Labs account you should be able to run code on a fairly up to date replication of the database avoiding the need to json entirely.

首先，如果它的维基媒体数据你感兴趣，就更容易得到一个实验室帐户。这需要花费一天的时间，你可以在他们的机器上运行大部分的代码，避免了下载多个g的需要。有了lab帐户，您应该能够在数据库的最新复制上运行代码，而无需完全使用json。

I use a simple python program to parse the data it basically runs a few regexps on each line; one to find lines containing <title>...</title> so I know which wikipedia article it is and a few more to find the namespace and the maths tags. It can process 160MB file in 13 seconds, so might be able to do the whole 36GB in under an hour.

我使用一个简单的python程序来解析数据，它基本上在每行上运行几个regexp;查找包含

…，这样我就知道*上是哪篇文章了，还有一些可以找到名称空间和数学标记。它可以在13秒内处理160MB的文件，因此可以在一小时内完成整个36GB的文件。

This code produces text files with only the data I'm interested in. If you interested the code is

这段代码只生成我感兴趣的数据的文本文件。如果您对代码感兴趣的话

import sys
import re

dump = len(sys.argv)>1 and sys.argv[1]=='-d'
titleRE = re.compile('<title>(.*)</title>')
nsRE = re.compile('<ns>(.*)</ns>')
mathRE = re.compile('&lt;/?math(.*?)&gt;')
pageEndRE = re.compile('</page>')
supOc = 0
supCc = 0
subOc = 0
subCc = 0

title =""
attr = ""
ns = -1
inEqn = 0
for line in sys.stdin:
    m = titleRE.search(line)
    if m :
        title = m.group(1)
        expression = ""
        if dump : print line
        inEqn = 0
    m = nsRE.search(line)
    if m :
        ns = m.group(1)
    start = 0
    pos = 0
    m = mathRE.search(line,pos)
    while m :
        if m.group().startswith('&lt;math'):
            attr = m.group(1)
            start = m.end()
            pos = start
            expression = ""
            inEqn = 1
        if m.group() == '&lt;/math&gt;' :
            end = m.start()
            expression = '    '.join([expression,line[start:end]])
            print title,'\t',attr,'\t',expression.lstrip().replace('&lt;','<').replace('&gt;',
'>').replace('&amp;','&')
            pos = m.end()
            expression = ""
            start = 0
            inEqn = 0
        m = mathRE.search(line,pos)
    if start > 0 :
        expression = line[start:].rstrip()
    elif inEqn :
        expression = '    '.join([expression,line.rstrip()])

Sorry if its a bit cryptic, but it was not ment for public consumption. Sample output is

对不起，这有点神秘，但这不是供公众消费的。样例输出

Arithmetic mean         a_1,\ldots,a_n.
Arithmetic mean         A
Arithmetic mean         A=\frac{1}{n}\sum_{i=1}^{n} a_i
Arithmetic mean         \bar{x}

Each line has the name of article and the latex equation. This reduces the data I need to work with down to a more manageable 500k. I'm not sure is such a strategy would work for your application.

每一行都有文章的名称和乳胶方程。这将我需要处理的数据减少到更易于管理的500k。我不确定这种策略是否适用于您的应用程序。

For the main enwiki data the split the xml dumps into 27 smaller files, of roughly equal size. You might find a few reasonable size files, easier to work with than either one giant file or millions of tiny files. It might be easy to split by first letter in the article title giving less than a hundred files each less than a gigabyte.

对于主要的enwiki数据，将xml转储分割为27个较小的文件，大小大致相同。您可能会发现一些大小合理的文件，比一个大文件或数百万个小文件更容易处理。这可能很容易被文章标题中的第一个字母分开，每个字母的文件少于100个，每个文件少于1g。

#1