如何使用二进制格式的数字数据进行GNU排序?

时间:2021-04-30 20:12:35

Is there any way to use GNU Coreutils sort with 64bit numbers stored in binary file? If file wasn't binary then sort -n is the solution, but I didn't find any options to use it with binary data.

有没有办法使用存储在二进制文件中的64位数字的GNU Coreutils排序?如果file不是二进制文件,则sort -n是解决方案,但我没有找到任何选项来将它与二进制数据一起使用。

File is quite large (~100GB) and if it is possible I don't want to make its' text (non-binary) copy.

文件非常大(~100GB),如果有可能,我不想制作'文本(非二进制)副本。

Sample of data:

数据样本:

$ xxd file 00292e0: 4036 1eb7 6888 d319 de6b 7402 9ca9 f116 @6..h....kt..... 00292f0: db68 7f05 199f 9d36 cf01 cb28 e49f 1116 .h.....6...(.... 0029300: 0c7c 8b55 2963 ef0c 277a f2b0 38d7 2b19 .|.U)c..'z..8.+. 0029310: c83b 2614 4327 d838 820c 1bb8 444f 1731 .;&.C'.8....DO.1 0029320: 1695 cab3 cd12 092a 0691 d7e4 5fcc b01d .......*...._... 0029330: b12b 7c1b a209 7c1c 568a 125c 541c d334 .+|...|.V..\T..4 0029340: 09a3 ecbc 8370 e205 9265 7759 a378 4e2f .....p...ewY.xN/

$ xxd file 00292e0:4036 1eb7 6888 d319 de6b 7402 9ca9 f116 @ 6..h .... kt ..... 00292f0:db68 7f05 199f 9d36 cf01 cb28 e49f 1116 .h ..... 6 ...(... .... 0029300:0c7c 8b55 2963 ef0c 277a f2b0 38d7 2b19。| .U)c ..'z..8。+。 0029310:c83b 2614 4327 d838 820c 1bb8 444f 1731 .;&.C'.8 .... DO.1 0029320:1695 cab3 cd12 092a 0691 d7e4 5fcc b01d ....... * ...._ .. .0029330:b12b 7c1b a209 7c1c 568a 125c 541c d334。+ | ... | .V .. \ T..4 0029340:09a3 ecbc 8370 e205 9265 7759 a378 4e2f ..... p ... ewY.xN /

2 个解决方案

#1


1  

The bsort utility does this.

bsort实用程序执行此操作。

It is a lightning fast inplace radix sort written in C. One of the test cases for its development was a 100Gb file on a machine with 16Gb ram - took about 22 seconds or so to sort.

这是一个用C编写的闪电般的快速内置基数排序。其开发的一个测试用例是在16Gb内存的机器上的100Gb文件 - 大约需要22秒左右才能进行排序。

#2


0  

sort(1) will not help you here. For a small file it could be possible to split your file into lines and feed it to sort(1), but not for 100G file of course.

sort(1)在这里不会帮到你。对于一个小文件,可以将文件拆分为行并将其提供给sort(1),但当然不能用于100G文件。

The answer to this question on Serverfault has a link of the tool written for solving exactly your task. You can check the github project there (it seems to be written in Go so you will need to install a compiler if you decide to use it).

Serverfault上这个问题的答案有一个工具链接,用于完全解决您的任务。您可以在那里检查github项目(它似乎是用Go编写的,因此如果您决定使用它,则需要安装编译器)。

Quick googling does not find any other popular tool for this task written on some more popular language (and it surprises me a bit as the task itself is just a merge sort that thousands of students implement each year on their CS courses, but that's an off-topic).

快速的谷歌搜索没有找到任何其他流行的工具,用一些更流行的语言写这个任务(并且它让我感到惊讶,因为任务本身只是成千上万的学生每年在他们的CS课程上实施的合并排序,但这是一个关闭-话题)。

#1


1  

The bsort utility does this.

bsort实用程序执行此操作。

It is a lightning fast inplace radix sort written in C. One of the test cases for its development was a 100Gb file on a machine with 16Gb ram - took about 22 seconds or so to sort.

这是一个用C编写的闪电般的快速内置基数排序。其开发的一个测试用例是在16Gb内存的机器上的100Gb文件 - 大约需要22秒左右才能进行排序。

#2


0  

sort(1) will not help you here. For a small file it could be possible to split your file into lines and feed it to sort(1), but not for 100G file of course.

sort(1)在这里不会帮到你。对于一个小文件,可以将文件拆分为行并将其提供给sort(1),但当然不能用于100G文件。

The answer to this question on Serverfault has a link of the tool written for solving exactly your task. You can check the github project there (it seems to be written in Go so you will need to install a compiler if you decide to use it).

Serverfault上这个问题的答案有一个工具链接,用于完全解决您的任务。您可以在那里检查github项目(它似乎是用Go编写的,因此如果您决定使用它,则需要安装编译器)。

Quick googling does not find any other popular tool for this task written on some more popular language (and it surprises me a bit as the task itself is just a merge sort that thousands of students implement each year on their CS courses, but that's an off-topic).

快速的谷歌搜索没有找到任何其他流行的工具,用一些更流行的语言写这个任务(并且它让我感到惊讶,因为任务本身只是成千上万的学生每年在他们的CS课程上实施的合并排序,但这是一个关闭-话题)。