UNIX排序命令如何排序一个非常大的文件？

The UNIX sort command can sort a very large file like this:

UNIX sort命令可以像这样对一个非常大的文件进行排序：

sort large_file

How is the sort algorithm implemented?

排序算法是如何实现的？

How come it does not cause excessive consumption of memory?

为什么不会导致过多的内存消耗？

7 个解决方案

#1

The Algorithmic details of UNIX Sort command says Unix Sort uses an External R-Way merge sorting algorithm. The link goes into more details, but in essence it divides the input up into smaller portions (that fit into memory) and then merges each portion together at the end.

UNIX Sort命令的算法细节说Unix Sort使用外部R-Way合并排序算法。链接进入更多细节，但实质上它将输入分成较小的部分（适合内存），然后在结尾处将每个部分合并在一起。

#2

The sort command stores working data in temporary disk files (usually in /tmp).

sort命令将工作数据存储在临时磁盘文件中（通常在/ tmp中）。

#3

WARNING: This script starts one shell per chunk, for really large files, this could be hundreds.

警告：此脚本为每个块启动一个shell，对于非常大的文件，这可能是数百个。

Here is a script I wrote for this purpose. On a 4 processor machine it improved the sort performance by 100% !

这是我为此目的编写的脚本。在4处理器的机器上，它将分拣性能提高了100％！

#! /bin/ksh

MAX_LINES_PER_CHUNK=1000000
ORIGINAL_FILE=$1
SORTED_FILE=$2
CHUNK_FILE_PREFIX=$ORIGINAL_FILE.split.
SORTED_CHUNK_FILES=$CHUNK_FILE_PREFIX*.sorted

usage ()
{
     echo Parallel sort
     echo usage: psort file1 file2
     echo Sorts text file file1 and stores the output in file2
     echo Note: file1 will be split in chunks up to $MAX_LINES_PER_CHUNK lines
     echo  and each chunk will be sorted in parallel
}

# test if we have two arguments on the command line
if [ $# != 2 ]
then
    usage
    exit
fi

#Cleanup any lefover files
rm -f $SORTED_CHUNK_FILES > /dev/null
rm -f $CHUNK_FILE_PREFIX* > /dev/null
rm -f $SORTED_FILE

#Splitting $ORIGINAL_FILE into chunks ...
split -l $MAX_LINES_PER_CHUNK $ORIGINAL_FILE $CHUNK_FILE_PREFIX

for file in $CHUNK_FILE_PREFIX*
do
    sort $file > $file.sorted &
done
wait

#Merging chunks to $SORTED_FILE ...
sort -m $SORTED_CHUNK_FILES > $SORTED_FILE

#Cleanup any lefover files
rm -f $SORTED_CHUNK_FILES > /dev/null
rm -f $CHUNK_FILE_PREFIX* > /dev/null

另请参阅：“使用shell脚本更快地对大文件进行排序”

#4

I'm not familiar with the program but I guess it is done by means of external sorting (most of the problem is held in temporary files while relatively small part of the problem is held in memory at a time). See Donald Knuth's The Art of Computer Programming, Vol. 3 Sorting and Searching, Section 5.4 for very in-depth discussion of the subject.

我不熟悉该程序，但我想这是通过外部排序完成的（大多数问题都保存在临时文件中，而问题的相对较小部分一次保存在内存中）。参见唐纳德克努特的“计算机编程艺术”，第一卷。 3排序和搜索，第5.4节，对该主题进行了非常深入的讨论。

#5

#!/bin/bash

usage ()
{
    echo Parallel sort
    echo usage: psort file1 file2
    echo Sorts text file file1 and stores the output in file2
}

# test if we have two arguments on the command line
if [ $# != 2 ]
then
    usage
    exit
fi

pv $1 | parallel --pipe --files sort -S512M | parallel -Xj1 sort -S1024M -m {} ';' rm {} > $2

#6

Look carefully at the options of sort to speed performance and understand it's impact on your machine and problem. Key parameters on Ubuntu are

仔细查看排序选项以加快性能并了解它对您的计算机和问题的影响。 Ubuntu的关键参数是

Location of temporary files -T directory_name
临时文件的位置-T directory_name
Amount of memory to use -S N% ( N% of all memory to use, the more the better but avoid over subscription that causes swapping to disk. You can use it like "-S 80%" to use 80% of available RAM, or "-S 2G" for 2 GB RAM.)
要使用的内存量-SN％（所有内存使用的N％，越多越好但避免过度订阅导致交换到磁盘。您可以像“-80％”一样使用它来使用80％的可用内存，或“-S 2G”用于2 GB RAM。）

The questioner asks "Why no high memory usage?" The answer to that comes from history, older unix machines were small and the default memory size is set small. Adjust this as big as possible for your workload to vastly improve sort performance. Set the working directory to a place on your fastest device that has enough space to hold at least 1.25 * the size of the file being sorted.

提问者问“为什么没有高内存使用率？”答案来自历史，旧的unix机器很小，默认的内存大小设置得很小。为您的工作负载调整尽可能大的数量，以大大提高排序性能。将工作目录设置为最快设备上的某个位置，该位置具有足够的空间以容纳至少1.25 *正在排序的文件的大小。

#7

-3

Memory should not be a problem - sort already takes care of that. If you want make optimal usage of your multi-core CPU I have implementend this in a small script (similar to some you might find on the net, but simpler/cleaner than most of those ;)).

记忆不应该是一个问题 - 排序已经解决了这个问题。如果你想最佳地使用你的多核CPU，我已经在一个小脚本中实现了这一点（类似于你可能在网上找到的一些，但比大多数人更简单/更清洁;））。

#!/bin/bash
# Usage: psort filename <chunksize> <threads>
# In this example a the file largefile is split into chunks of 20 MB.
# The part are sorted in 4 simultaneous threads before getting merged.
# 
# psort largefile.txt 20m 4    
#
# by h.p.
split -b $2 $1 $1.part
suffix=sorttemp.`date +%s`
nthreads=$3
i=0
for fname in `ls *$1.part*`
do
    let i++
    sort $fname > $fname.$suffix &
    mres=$(($i % $nthreads))
    test "$mres" -eq 0 && wait
done
wait
sort -m *.$suffix 
rm $1.part*

#1

#2