如何在Ruby中以流的形式处理大型JSON文件,而不消耗所有内存?

时间:2022-12-24 09:19:39

I'm having trouble processing a huge JSON file in Ruby. What I'm looking for is a way to process it entry-by-entry without keeping too much data in memory.

在Ruby中处理一个巨大的JSON文件有困难。我正在寻找的是一种处理它的入口的方式,而不需要在内存中保存太多的数据。

I thought that yajl-ruby gem would do the work but it consumes all my memory. I've also looked at Yajl::FFI and JSON:Stream gems but there it is clearly stated:

我原以为yajl-ruby gem会做这个工作,但它耗尽了我所有的记忆。我还查看了Yajl::FFI和JSON:Stream gems,但在那里它明确地声明:

For larger documents we can use an IO object to stream it into the parser. We still need room for the parsed object, but the document itself is never fully read into memory.

对于较大的文档,我们可以使用IO对象将其流到解析器中。我们仍然需要解析对象的空间,但是文档本身永远不会被完全读到内存中。

Here's what I've done with Yajl:

这是我对Yajl所做的:

file_stream = File.open(file, "r")
json = Yajl::Parser.parse(file_stream)
json.each do |entry|
    entry.do_something
end
file_stream.close

The memory usage keeps getting higher until the process is killed.

内存使用量不断增加,直到进程被终止。

I don't see why Yajl keeps processed entries in the memory. Can I somehow free them, or did I just misunderstood the capabilities of Yajl parser?

我不明白为什么Yajl会在内存中保存经过处理的条目。我能以某种方式释放它们吗?还是我只是误解了Yajl解析器的功能?

If it cannot be done using Yajl: is there a way to this in Ruby via any library?

如果不能使用Yajl:有办法通过任何库在Ruby中实现这一点吗?

3 个解决方案

#1


5  

Problem

json = Yajl::Parser.parse(file_stream)

json = Yajl:Parser.parse(file_stream)

When you invoke Yajl::Parser like this, the entire stream is loaded into memory to create your data structure. Don't do that.

当您像这样调用Yajl::Parser时,整个流被加载到内存中以创建数据结构。不要这样做。

Solution

Yajl provides Parser#parse_chunk, Parser#on_parse_complete, and other related methods that enable you to trigger parsing events on a stream without requiring that the whole IO stream be parsed at once. The README contains an example of how to use chunking instead.

Yajl提供了解析器#parse_chunk、解析器#on_parse_complete以及其他相关方法,这些方法使您能够在不需要立即解析整个IO流的情况下触发流上的解析事件。README包含了如何使用分块的示例。

The example given in the README is:

自述中给出的例子是:

Or lets say you didn't have access to the IO object that contained JSON data, but instead only had access to chunks of it at a time. No problem!

或者假设您没有访问包含JSON数据的IO对象的权限,而是一次只能访问它的块。没问题!

(Assume we're in an EventMachine::Connection instance)

(假设我们在一个EventMachine::Connection实例中)

def post_init
  @parser = Yajl::Parser.new(:symbolize_keys => true)
end

def object_parsed(obj)
  puts "Sometimes one pays most for the things one gets for nothing. - Albert Einstein"
  puts obj.inspect
end

def connection_completed
  # once a full JSON object has been parsed from the stream
  # object_parsed will be called, and passed the constructed object
  @parser.on_parse_complete = method(:object_parsed)
end

def receive_data(data)
  # continue passing chunks
  @parser << data
end

Or if you don't need to stream it, it'll just return the built object from the parse when it's done. NOTE: if there are going to be multiple JSON strings in the input, you must specify a block or callback as this is how yajl-ruby will hand you (the caller) each object as it's parsed off the input.

或者,如果您不需要对它进行流处理,那么当它完成时,它将从解析中返回构建的对象。注意:如果输入中有多个JSON字符串,您必须指定一个块或回调,因为在解析输入时,yajl-ruby将向您(调用方)提交每个对象。

obj = Yajl::Parser.parse(str_or_io)

One way or another, you have to parse only a subset of your JSON data at a time. Otherwise, you are simply instantiating a giant Hash in memory, which is exactly the behavior you describe.

无论如何,您必须一次解析JSON数据的一个子集。否则,您只是在内存中实例化一个巨大的散列,这正是您所描述的行为。

Without knowing what your data looks like and how your JSON objects are composed, it isn't possible to give a more detailed explanation than that; as a result, your mileage may vary. However, this should at least get you pointed in the right direction.

如果不知道数据是什么样子,不知道JSON对象是如何组成的,就不可能给出更详细的解释;因此,你的里程可能会有所不同。然而,这至少能让你找到正确的方向。

#2


3  

Both @CodeGnome's and @A. Rager's answer helped me understand the solution.

@CodeGnome和@A。Rager的回答帮助我理解了解决方案。

I ended up creating the gem json-streamer that offers a generic approach and spares the need to manually define callbacks for every scenario.

我最后创建了gem json-streamer,它提供了一种通用的方法,并且不必为每个场景手工定义回调。

#3


2  

Your solutions seem to be json-stream and yajl-ffi. There's an example on both that're pretty similar (they're from the same guy):

您的解决方案似乎是json-stream和yajl-ffi。两者都有一个非常相似的例子(他们来自同一个人):

def post_init
  @parser = Yajl::FFI::Parser.new
  @parser.start_document { puts "start document" }
  @parser.end_document   { puts "end document" }
  @parser.start_object   { puts "start object" }
  @parser.end_object     { puts "end object" }
  @parser.start_array    { puts "start array" }
  @parser.end_array      { puts "end array" }
  @parser.key            {|k| puts "key: #{k}" }
  @parser.value          {|v| puts "value: #{v}" }
end

def receive_data(data)
  begin
    @parser << data
  rescue Yajl::FFI::ParserError => e
    close_connection
  end
end

There, he sets up the callbacks for possible data events that the stream parser can experience.

在那里,他为流解析器可能遇到的数据事件设置回调。

Given a json document that looks like:

给定一个json文档,它看起来如下:

{
  1: {
    name: "fred",
    color: "red",
    dead: true,
  },
  2: {
    name: "tony",
    color: "six",
    dead: true,
  },
  ...
  n: {
    name: "erik",
    color: "black",
    dead: false,
  },
}

One could stream parse it with yajl-ffi something like this:

可以用yajl-ffi之类的东西来流解析它:

def parse_dudes file_io, chunk_size
  parser = Yajl::FFI::Parser.new
  object_nesting_level = 0
  current_row = {}
  current_key = nil

  parser.start_object { object_nesting_level += 1 }
  parser.end_object do
    if object_nesting_level.eql? 2
      yield current_row #here, we yield the fully collected record to the passed block
      current_row = {}
    end
    object_nesting_level -= 1
  end

  parser.key do |k|
    if object_nesting_level.eql? 2
      current_key = k
    elsif object_nesting_level.eql? 1
      current_row["id"] = k
    end
  end

  parser.value { |v| current_row[current_key] = v }

  file_io.each(chunk_size) { |chunk| parser << chunk }
end

File.open('dudes.json') do |f|
  parse_dudes f, 1024 do |dude|
    pp dude
  end
end

#1


5  

Problem

json = Yajl::Parser.parse(file_stream)

json = Yajl:Parser.parse(file_stream)

When you invoke Yajl::Parser like this, the entire stream is loaded into memory to create your data structure. Don't do that.

当您像这样调用Yajl::Parser时,整个流被加载到内存中以创建数据结构。不要这样做。

Solution

Yajl provides Parser#parse_chunk, Parser#on_parse_complete, and other related methods that enable you to trigger parsing events on a stream without requiring that the whole IO stream be parsed at once. The README contains an example of how to use chunking instead.

Yajl提供了解析器#parse_chunk、解析器#on_parse_complete以及其他相关方法,这些方法使您能够在不需要立即解析整个IO流的情况下触发流上的解析事件。README包含了如何使用分块的示例。

The example given in the README is:

自述中给出的例子是:

Or lets say you didn't have access to the IO object that contained JSON data, but instead only had access to chunks of it at a time. No problem!

或者假设您没有访问包含JSON数据的IO对象的权限,而是一次只能访问它的块。没问题!

(Assume we're in an EventMachine::Connection instance)

(假设我们在一个EventMachine::Connection实例中)

def post_init
  @parser = Yajl::Parser.new(:symbolize_keys => true)
end

def object_parsed(obj)
  puts "Sometimes one pays most for the things one gets for nothing. - Albert Einstein"
  puts obj.inspect
end

def connection_completed
  # once a full JSON object has been parsed from the stream
  # object_parsed will be called, and passed the constructed object
  @parser.on_parse_complete = method(:object_parsed)
end

def receive_data(data)
  # continue passing chunks
  @parser << data
end

Or if you don't need to stream it, it'll just return the built object from the parse when it's done. NOTE: if there are going to be multiple JSON strings in the input, you must specify a block or callback as this is how yajl-ruby will hand you (the caller) each object as it's parsed off the input.

或者,如果您不需要对它进行流处理,那么当它完成时,它将从解析中返回构建的对象。注意:如果输入中有多个JSON字符串,您必须指定一个块或回调,因为在解析输入时,yajl-ruby将向您(调用方)提交每个对象。

obj = Yajl::Parser.parse(str_or_io)

One way or another, you have to parse only a subset of your JSON data at a time. Otherwise, you are simply instantiating a giant Hash in memory, which is exactly the behavior you describe.

无论如何,您必须一次解析JSON数据的一个子集。否则,您只是在内存中实例化一个巨大的散列,这正是您所描述的行为。

Without knowing what your data looks like and how your JSON objects are composed, it isn't possible to give a more detailed explanation than that; as a result, your mileage may vary. However, this should at least get you pointed in the right direction.

如果不知道数据是什么样子,不知道JSON对象是如何组成的,就不可能给出更详细的解释;因此,你的里程可能会有所不同。然而,这至少能让你找到正确的方向。

#2


3  

Both @CodeGnome's and @A. Rager's answer helped me understand the solution.

@CodeGnome和@A。Rager的回答帮助我理解了解决方案。

I ended up creating the gem json-streamer that offers a generic approach and spares the need to manually define callbacks for every scenario.

我最后创建了gem json-streamer,它提供了一种通用的方法,并且不必为每个场景手工定义回调。

#3


2  

Your solutions seem to be json-stream and yajl-ffi. There's an example on both that're pretty similar (they're from the same guy):

您的解决方案似乎是json-stream和yajl-ffi。两者都有一个非常相似的例子(他们来自同一个人):

def post_init
  @parser = Yajl::FFI::Parser.new
  @parser.start_document { puts "start document" }
  @parser.end_document   { puts "end document" }
  @parser.start_object   { puts "start object" }
  @parser.end_object     { puts "end object" }
  @parser.start_array    { puts "start array" }
  @parser.end_array      { puts "end array" }
  @parser.key            {|k| puts "key: #{k}" }
  @parser.value          {|v| puts "value: #{v}" }
end

def receive_data(data)
  begin
    @parser << data
  rescue Yajl::FFI::ParserError => e
    close_connection
  end
end

There, he sets up the callbacks for possible data events that the stream parser can experience.

在那里,他为流解析器可能遇到的数据事件设置回调。

Given a json document that looks like:

给定一个json文档,它看起来如下:

{
  1: {
    name: "fred",
    color: "red",
    dead: true,
  },
  2: {
    name: "tony",
    color: "six",
    dead: true,
  },
  ...
  n: {
    name: "erik",
    color: "black",
    dead: false,
  },
}

One could stream parse it with yajl-ffi something like this:

可以用yajl-ffi之类的东西来流解析它:

def parse_dudes file_io, chunk_size
  parser = Yajl::FFI::Parser.new
  object_nesting_level = 0
  current_row = {}
  current_key = nil

  parser.start_object { object_nesting_level += 1 }
  parser.end_object do
    if object_nesting_level.eql? 2
      yield current_row #here, we yield the fully collected record to the passed block
      current_row = {}
    end
    object_nesting_level -= 1
  end

  parser.key do |k|
    if object_nesting_level.eql? 2
      current_key = k
    elsif object_nesting_level.eql? 1
      current_row["id"] = k
    end
  end

  parser.value { |v| current_row[current_key] = v }

  file_io.each(chunk_size) { |chunk| parser << chunk }
end

File.open('dudes.json') do |f|
  parse_dudes f, 1024 do |dude|
    pp dude
  end
end