I have a very large JSON file containing an array. Is it possible to use jq
to split this array into several smaller arrays of a fixed size? Suppose my input was this: [1,2,3,4,5,6,7,8,9,10]
, and I wanted to split it into 3 element long chunks. The desired output from jq
would be:
我有一个包含数组的非常大的JSON文件。是否可以使用jq将此数组拆分为几个固定大小的较小数组?假设我的输入是这样的:[1,2,3,4,5,6,7,8,9,10],我想把它分成3个元素的长块。来自jq的期望输出将是:
[1,2,3]
[4,5,6]
[7,8,9]
[10]
In reality, my input array has nearly three million elements, all UUIDs.
实际上,我的输入数组有近300万个元素,都是UUID。
4 个解决方案
#1
2
The following stream-oriented definition of window/3
, due to Cédric Connes (github:connesc), generalizes _nwise
, and illustrates a "boxing technique" that circumvents the need to use an end-of-stream marker, and can therefore be used if the stream contains the non-JSON value nan
. A definition of _nwise/1
in terms of window/3
is also included.
由于CédricConnes(github:connesc),以下面向流的window / 3定义概括了_nwise,并说明了一种“装箱技术”,它避免了使用流末端标记的需要,因此可以使用如果流包含非JSON值nan。还包括以窗口/ 3表示的_nwise / 1的定义。
The first argument of window/3
is interpreted as a stream. $size is the window size and $step specifies the number of values to be skipped. For example,
window / 3的第一个参数被解释为流。 $ size是窗口大小,$ step指定要跳过的值的数量。例如,
window(1,2,3; 2; 1)
yields:
收益率:
[1,2]
[2,3]
window/3 and _nsize/1
def window(values; $size; $step):
def checkparam(name; value): if (value | isnormal) and value > 0 and (value | floor) == value then . else error("window \(name) must be a positive integer") end;
checkparam("size"; $size)
| checkparam("step"; $step)
# We need to detect the end of the loop in order to produce the terminal partial group (if any).
# For that purpose, we introduce an artificial null sentinel, and wrap the input values into singleton arrays in order to distinguish them.
| foreach ((values | [.]), null) as $item (
{index: -1, items: [], ready: false};
(.index + 1) as $index
# Extract items that must be reused from the previous iteration
| if (.ready | not) then .items
elif $step >= $size or $item == null then []
else .items[-($size - $step):]
end
# Append the current item unless it must be skipped
| if ($index % $step) < $size then . + $item
else .
end
| {$index, items: ., ready: (length == $size or ($item == null and length > 0))};
if .ready then .items else empty end
);
def _nwise($n): window(.[]; $n; $n);
Source:
https://gist.github.com/connesc/d6b87cbacae13d4fd58763724049da58
https://gist.github.com/connesc/d6b87cbacae13d4fd58763724049da58
#2
4
There is an (undocumented) builtin, _nwise
, that meets the functional requirements:
有一个(未记录的)内置,_nwise,满足功能要求:
$ jq -nc '[1,2,3,4,5,6,7,8,9,10] | _nwise(3)'
[1,2,3]
[4,5,6]
[7,8,9]
[10]
Also:
也:
$ jq -nc '_nwise([1,2,3,4,5,6,7,8,9,10];3)'
[1,2,3]
[4,5,6]
[7,8,9]
[10]
Incidentally, _nwise
can be used for both arrays and strings.
顺便提一下,_nwise可以用于数组和字符串。
(I believe it's undocumented because there was some doubt about an appropriate name.)
(我认为它没有记录,因为对某个名称有一些疑问。)
TCO-version
Unfortunately, the builtin version is carelessly defined, and will not perform well for large arrays. Here is an optimized version (it should be about as efficient as a non-recursive version):
不幸的是,内置版本是粗心定义的,并且对大型数组不会很好。这是一个优化版本(它应该与非递归版本一样高效):
def nwise($n):
def _nwise:
if length <= $n then . else .[0:$n] , (.[$n:]|_nwise) end;
_nwise;
For an array of size 3 million, this is quite performant: 3.91s on an old Mac, 162746368 max resident size.
对于300万像素的阵列,这是非常高性能:旧Mac上为3.91秒,最大驻留大小为162746368。
Notice that this version (using tail-call optimized recursion) is actually faster than the version of nwise/2
using foreach
shown elsewhere on this page.
请注意,此版本(使用尾调用优化递归)实际上比使用本页其他地方所示的foreach的nwise / 2版本更快。
#3
1
The below is hackery, to be sure -- but memory-efficient hackery, even with an arbitrarily long list:
以下是hackery,可以肯定 - 但是内存效率高的hackery,即使有一个任意长的列表:
jq -c --stream 'select(length==2)|.[1]' <huge.json \
| jq -nc 'foreach inputs as $i (null; null; [$i,try input,try input])'
The first piece of the pipeline streams in your input JSON file, emitting one line per element, assuming the array consists of atomic values (where [] and {} are here included as atomic values). Because it runs in streaming mode it doesn't need to store the entire content in memory, despite being a single document.
输入JSON文件中的第一段管道流,每个元素发出一行,假设该数组由原子值组成(其中[]和{}包含在原子值中)。因为它以流模式运行,所以不需要将整个内容存储在内存中,尽管它是单个文档。
The second piece of the pipeline repeatedly reads up to three items and assembles them into a list.
管道的第二部分重复读取最多三个项目并将它们组装成一个列表。
This should avoid needing more than three pieces of data in memory at a time.
这应该避免一次在内存中需要三个以上的数据。
#4
1
If the array is too large to fit comfortably in memory, then I'd adopt the strategy suggested by @CharlesDuffy -- that is, stream the array elements into a second invocation of jq using a stream-oriented version of nwise
, such as:
如果数组太大而无法舒适地放入内存中,那么我将采用@CharlesDuffy建议的策略 - 即使用面向流的nwise版本将数组元素流式传输到jq的第二次调用,例如:
def nwise(stream; $n):
foreach (stream, nan) as $x ([];
if length == $n then [$x] else . + [$x] end;
if (.[-1] | isnan) and length>1 then .[:-1]
elif length == $n then .
else empty
end);
The "driver" for the above would be:
上面的“驱动程序”将是:
nwise(inputs; 3)
But please remember to use the -n command-line option.
但请记住使用-n命令行选项。
To create the stream from an arbitrary array:
要从任意数组创建流:
$ jq -cn --stream '
fromstream( inputs | (.[0] |= .[1:])
| select(. != [[]]) )' huge.json
So the shell pipeline might look like this:
所以shell管道可能如下所示:
$ jq -cn --stream '
fromstream( inputs | (.[0] |= .[1:])
| select(. != [[]]) )' huge.json |
jq -n -f nwise.jq
This approach is quite performant. For grouping a stream of 3 million items into groups of 3 using nwise/2
,
这种方法非常有效。使用nwise / 2将300万个项目的流分组为3个组,
/usr/bin/time -lp
for the second invocation of jq gives:
对于jq的第二次调用给出:
user 5.63
sys 0.04
1261568 maximum resident set size
Caveat: this definition uses nan
as an end-of-stream marker. Since nan
is not a JSON value, this cannot be a problem for handling JSON streams.
警告:这个定义使用nan作为流末端标记。由于nan不是JSON值,因此处理JSON流不会成为问题。
#1
2
The following stream-oriented definition of window/3
, due to Cédric Connes (github:connesc), generalizes _nwise
, and illustrates a "boxing technique" that circumvents the need to use an end-of-stream marker, and can therefore be used if the stream contains the non-JSON value nan
. A definition of _nwise/1
in terms of window/3
is also included.
由于CédricConnes(github:connesc),以下面向流的window / 3定义概括了_nwise,并说明了一种“装箱技术”,它避免了使用流末端标记的需要,因此可以使用如果流包含非JSON值nan。还包括以窗口/ 3表示的_nwise / 1的定义。
The first argument of window/3
is interpreted as a stream. $size is the window size and $step specifies the number of values to be skipped. For example,
window / 3的第一个参数被解释为流。 $ size是窗口大小,$ step指定要跳过的值的数量。例如,
window(1,2,3; 2; 1)
yields:
收益率:
[1,2]
[2,3]
window/3 and _nsize/1
def window(values; $size; $step):
def checkparam(name; value): if (value | isnormal) and value > 0 and (value | floor) == value then . else error("window \(name) must be a positive integer") end;
checkparam("size"; $size)
| checkparam("step"; $step)
# We need to detect the end of the loop in order to produce the terminal partial group (if any).
# For that purpose, we introduce an artificial null sentinel, and wrap the input values into singleton arrays in order to distinguish them.
| foreach ((values | [.]), null) as $item (
{index: -1, items: [], ready: false};
(.index + 1) as $index
# Extract items that must be reused from the previous iteration
| if (.ready | not) then .items
elif $step >= $size or $item == null then []
else .items[-($size - $step):]
end
# Append the current item unless it must be skipped
| if ($index % $step) < $size then . + $item
else .
end
| {$index, items: ., ready: (length == $size or ($item == null and length > 0))};
if .ready then .items else empty end
);
def _nwise($n): window(.[]; $n; $n);
Source:
https://gist.github.com/connesc/d6b87cbacae13d4fd58763724049da58
https://gist.github.com/connesc/d6b87cbacae13d4fd58763724049da58
#2
4
There is an (undocumented) builtin, _nwise
, that meets the functional requirements:
有一个(未记录的)内置,_nwise,满足功能要求:
$ jq -nc '[1,2,3,4,5,6,7,8,9,10] | _nwise(3)'
[1,2,3]
[4,5,6]
[7,8,9]
[10]
Also:
也:
$ jq -nc '_nwise([1,2,3,4,5,6,7,8,9,10];3)'
[1,2,3]
[4,5,6]
[7,8,9]
[10]
Incidentally, _nwise
can be used for both arrays and strings.
顺便提一下,_nwise可以用于数组和字符串。
(I believe it's undocumented because there was some doubt about an appropriate name.)
(我认为它没有记录,因为对某个名称有一些疑问。)
TCO-version
Unfortunately, the builtin version is carelessly defined, and will not perform well for large arrays. Here is an optimized version (it should be about as efficient as a non-recursive version):
不幸的是,内置版本是粗心定义的,并且对大型数组不会很好。这是一个优化版本(它应该与非递归版本一样高效):
def nwise($n):
def _nwise:
if length <= $n then . else .[0:$n] , (.[$n:]|_nwise) end;
_nwise;
For an array of size 3 million, this is quite performant: 3.91s on an old Mac, 162746368 max resident size.
对于300万像素的阵列,这是非常高性能:旧Mac上为3.91秒,最大驻留大小为162746368。
Notice that this version (using tail-call optimized recursion) is actually faster than the version of nwise/2
using foreach
shown elsewhere on this page.
请注意,此版本(使用尾调用优化递归)实际上比使用本页其他地方所示的foreach的nwise / 2版本更快。
#3
1
The below is hackery, to be sure -- but memory-efficient hackery, even with an arbitrarily long list:
以下是hackery,可以肯定 - 但是内存效率高的hackery,即使有一个任意长的列表:
jq -c --stream 'select(length==2)|.[1]' <huge.json \
| jq -nc 'foreach inputs as $i (null; null; [$i,try input,try input])'
The first piece of the pipeline streams in your input JSON file, emitting one line per element, assuming the array consists of atomic values (where [] and {} are here included as atomic values). Because it runs in streaming mode it doesn't need to store the entire content in memory, despite being a single document.
输入JSON文件中的第一段管道流,每个元素发出一行,假设该数组由原子值组成(其中[]和{}包含在原子值中)。因为它以流模式运行,所以不需要将整个内容存储在内存中,尽管它是单个文档。
The second piece of the pipeline repeatedly reads up to three items and assembles them into a list.
管道的第二部分重复读取最多三个项目并将它们组装成一个列表。
This should avoid needing more than three pieces of data in memory at a time.
这应该避免一次在内存中需要三个以上的数据。
#4
1
If the array is too large to fit comfortably in memory, then I'd adopt the strategy suggested by @CharlesDuffy -- that is, stream the array elements into a second invocation of jq using a stream-oriented version of nwise
, such as:
如果数组太大而无法舒适地放入内存中,那么我将采用@CharlesDuffy建议的策略 - 即使用面向流的nwise版本将数组元素流式传输到jq的第二次调用,例如:
def nwise(stream; $n):
foreach (stream, nan) as $x ([];
if length == $n then [$x] else . + [$x] end;
if (.[-1] | isnan) and length>1 then .[:-1]
elif length == $n then .
else empty
end);
The "driver" for the above would be:
上面的“驱动程序”将是:
nwise(inputs; 3)
But please remember to use the -n command-line option.
但请记住使用-n命令行选项。
To create the stream from an arbitrary array:
要从任意数组创建流:
$ jq -cn --stream '
fromstream( inputs | (.[0] |= .[1:])
| select(. != [[]]) )' huge.json
So the shell pipeline might look like this:
所以shell管道可能如下所示:
$ jq -cn --stream '
fromstream( inputs | (.[0] |= .[1:])
| select(. != [[]]) )' huge.json |
jq -n -f nwise.jq
This approach is quite performant. For grouping a stream of 3 million items into groups of 3 using nwise/2
,
这种方法非常有效。使用nwise / 2将300万个项目的流分组为3个组,
/usr/bin/time -lp
for the second invocation of jq gives:
对于jq的第二次调用给出:
user 5.63
sys 0.04
1261568 maximum resident set size
Caveat: this definition uses nan
as an end-of-stream marker. Since nan
is not a JSON value, this cannot be a problem for handling JSON streams.
警告:这个定义使用nan作为流末端标记。由于nan不是JSON值,因此处理JSON流不会成为问题。