Multithreading C++ Out of Core Sotring for Massive Data|多线程C++的大规模数据外部排序

先说一下，这个其实是我为实现PantaRay或者是类似Dreamworks的Out of Core点云GI的技术储备，为大规模点云光线跟踪所准备的第一步。在实际的应用中，int类型会被64bit的uint64_t所代替，代表空间中的一个hash键。所有的代码全部使用STL+boost实现了足够高层次的抽象，读者完全可以根据自己的需要改写。

This is the first step to implement the PantaRay or the GI solution from Dreamworks about Out-Core point cloud sorting. Actually the int type in the code would be replaced by he uint64_t which indices a hash key in space. All fragments code are using the STL+Boost, user can modify the code by yourself.

我们先来准备测试数据。这个测试数据有尺寸大小的限制，就是在现在x86_64环境下malloc/new分配的单个数组有1G尺寸的限制，这样就意味着内排序一次操作的数据不可能大于1G，造成了测试上的限制，所以我只生成了一个尺寸大约962M的文件测试，包含了246324610个int。

First of all, let’s prepare the test data. But as we know, there is the 1G array size limitation in x86_64, so that we can only apply qsort or std::stable_sort to a < 1G array. For this test I generate a 962M file which contains the 246324610 integers.

如下程序生成测试数据，均匀分布的Mersenne Twister 19937序列。

The following program generates the test data, using the MT19937 uniform distribution.

#include <iostream>

#include <boost/random/mersenne_twister.hpp>

#include <boost/random/uniform_int_distribution.hpp>

int main(int argc, char *argv[])

{

    -- argc, ++ argv;

    if (argc != )

    {

        return ;

    }

    char * szPath = argv[];

    int iCount = atoi(argv[]);

    std::cout << szPath << " " << iCount << std::endl;

    boost::random::mt19937 cGen;

    boost::random::uniform_int_distribution<> cDist(, );

    FILE * pFile = fopen(szPath, "wb");

    if (pFile)

    {

        for (int i = ; i < iCount; ++ i)

        {

            int iRandom = cDist(cGen);

            fwrite(& iRandom, sizeof(int), , pFile);

        }

        fclose(pFile);

    }

    return ;

}

然后生成内排序的结果，储存为外部独立文件为了比较。

Generate the internal sorted result to verify the data.

int main(int argc, char * argv[])

{

    PlaySTL();

    -- argc, ++ argv;

    if (argc != )

    {

        return EXIT_FAILURE;

    }

    FILE * pOriginalFile = fopen(argv[], "rb");

    fseek(pOriginalFile, , SEEK_END);

    long lSize = ftell(pOriginalFile);

    fseek(pOriginalFile, , SEEK_SET);

    int iNumItems = lSize / ;

    int * pData = new int[iNumItems];

    fread(pData, sizeof(int), iNumItems, pOriginalFile);

    fclose(pOriginalFile);

    std::stable_sort(pData, pData + iNumItems, std::less<int>());

    FILE * pSortedFile = fopen(argv[], "wb");

    fwrite(pData, sizeof(int), iNumItems, pSortedFile);

    fclose(pSortedFile);

    delete [] pData;

    return EXIT_SUCCESS;

}

从设计的思路上，由于操作系统在磁盘IO上都是单线程的，每次只允许一个线程读写，所以把读取的工作部分都放在主线程中，排序线程为了让磁盘写入的时间占据总共处理的时间尽可能地小，所以尽可能的让一个工作线程处理更多的数据。

Because the disk access is synchronized at low-level IO, so that we will read the data in the main thread, the working thread process as much as data as possible to reduce the percent of time on disk writing.

先让我们定义一个名字叫做Job的类，顾名思义，代表一个计算任务，每个计算任务都有一个自己的索引，以及一堆乱序的整数int数据。

Let’s define a Job class, each Job has a index and unsorted data.

class Job

{

public:

    Job()

    :

    m_iIndex(),

    m_iNumItems()

    {

    }

    Job(int iIndex, int iNumItems, const boost::shared_array<int> & aData)

    :

    m_iIndex(iIndex),

    m_iNumItems(iNumItems),

    m_aData(aData)

    {

    }

    Job(const Job & cCopy)

    :

    m_iIndex(cCopy.m_iIndex),

    m_iNumItems(cCopy.m_iNumItems),

    m_aData(cCopy.m_aData)

    {

    }

public:

    int m_iIndex;

    int m_iNumItems;

    boost::shared_array<int> m_aData;

};

然后再来一个Context，负责存储用于计算的共享数据，比如工作队列，以及Mutex等为了同步所需要的对象。

Later the Context class, to keep the queue and mutex objects.

class Context

{

public:

    Context(int iNumSortingThread)

    :

    m_iNumSortingThread(iNumSortingThread),

    m_bHasMoreData(true)

    {

    }

public:

    int m_iNumSortingThread;

    bool m_bHasMoreData;

    boost::mutex m_cMutex;

    boost::condition_variable m_cEvent;

    std::list<Job > m_lJobQueue;

};

这里是工作线程，其中有工作代码的实现。当访问Context中的队列时必须要加锁，抓一个工作包出来，当作局部数据，接下来再排序和写出为Cache，末了尽可能贪婪的告诉主线程我们需要更多的数据，如果真的是没有任何数据了则直接退出。

Here is the working thread, it will get a Job from the queue, sort the data, and write out, at the end, tell the main thread it needs more data to process, if there is no more data it will return.

class SortingThread : public boost::thread

{

public:

    SortingThread(const boost::shared_ptr<Context> & pContext)

    :

    m_pContext(pContext),

    boost::thread(boost::bind(& SortingThread::Sort, this))

    {

    }

    void Sort()

    {

        while ()

        {

            if (! m_pContext->m_bHasMoreData)

            {

                if (! m_pContext->m_lJobQueue.size())

                {

                    break;

                }

            }

            Job cJob;

            {

                boost::unique_lock<boost::mutex> cLock(m_pContext->m_cMutex);

                if (m_pContext->m_lJobQueue.size())

                {

                    // Get a job.

                    //

                    cJob = m_pContext->m_lJobQueue.front();

                    m_pContext->m_lJobQueue.pop_front();

                }

            }

            if (cJob.m_iNumItems)

            {

                std::stable_sort(cJob.m_aData.get(), cJob.m_aData.get() + cJob.m_iNumItems, std::less<int>());

                // Write out the sorted data.

                //

                char aBuffer[];

                sprintf(aBuffer, "%.06d.tmp", cJob.m_iIndex);

                std::ofstream cOutput(aBuffer, std::ios_base::binary);

                cOutput.write(reinterpret_cast<const char *>(cJob.m_aData.get()), cJob.m_iNumItems * sizeof(int));

            }

            // Tell the main thread we need more data here.

            //

            m_pContext->m_cEvent.notify_one();

        }

    }

private:

    boost::shared_ptr<Context> m_pContext;

};

把所有的线程都放入线程池，这样就可以一股脑的执行了。

The simple thread pool.

class SortingThreadGroup : public boost::thread_group

{

public:

    SortingThreadGroup(const boost::shared_ptr<Context> & pContext)

    :

    m_pContext(pContext)

    {

        for (int i = ; i < m_pContext->m_iNumSortingThread; ++ i)

        {

            SortingThread * pSortingThread = new SortingThread(pContext);

            add_thread(pSortingThread);

        }

    }

private:

    boost::shared_ptr<Context> m_pContext;

};

主线程从外部文件读取数据填充Job对象，尽可能的把整个队列的数据控制在一定得范围内，这样内存的占用可以小一些，否则就失去了外排序的意义。

Main thread reads data from file, fills the Job, and keep the memory usage minimal.

bool Sort(const char * szPath, int iNumSortingThreads, int iNumLocalItems)

{

    try

    {

        // Calculate real size.

        //

        std::ifstream cUnSortedFile(szPath, std::ios_base::binary);

        boost::uintmax_t ullSize = boost::filesystem::file_size(szPath);

        boost::uintmax_t ullNumItems = ullSize / ;

        int iNumBatches = ullNumItems / iNumLocalItems;

        std::vector<int> vNumItemsPerBatch(iNumBatches, iNumLocalItems);

        int iNumRestItems = ullNumItems % iNumLocalItems;

        if (iNumRestItems)

        {

            vNumItemsPerBatch.push_back(iNumRestItems);

        }

        std::cout << "Number of Items   : " << ullNumItems << std::endl

                  << "Number of Batches : " << vNumItemsPerBatch.size() << std::endl;

        boost::shared_ptr<Context> pContext(new Context(iNumSortingThreads));

        boost::scoped_ptr<SortingThreadGroup> pSortingThreadGroup(new SortingThreadGroup(pContext));

        boost::timer::auto_cpu_timer cTimer;

        for (int i = ; i < vNumItemsPerBatch.size(); ++ i)

        {

            boost::shared_array<int> aData(new int[vNumItemsPerBatch[i]]);

            cUnSortedFile.read(reinterpret_cast<char *>(aData.get()), vNumItemsPerBatch[i] * sizeof(int));

            Job cJob(i, vNumItemsPerBatch[i], aData);

            //

            boost::unique_lock<boost::mutex> cLock(pContext->m_cMutex);

            if (pContext->m_lJobQueue.size() > iNumSortingThreads * )

            {

                pContext->m_cEvent.wait(cLock);

            }

            pContext->m_lJobQueue.push_back(cJob);

        }

        std::cout << std::endl;

        pContext->m_bHasMoreData = false;

        pSortingThreadGroup->join_all();

        return true;

    }

    catch(const std::exception & cE)

    {

        std::cerr << cE.what() << std::endl;

    }

    catch(...)

    {

        std::cerr << __LINE__ << std::endl;

    }

    return false;

}

第二遍就是k Way Merge Sorting了。这里的思路很简单，直接读取外部的一坨文件，以及维护一个队列，每次从活的最小数字的那一列输出候选者，然后读出下一个放入队列。如果文件读完了，则说明那一路文件流可以丢弃了，队列也相应的变小了。这里当然是单线程的。

The second pass is the single-threaded classical k-Way Merging Sorting.

bool Merge(const char * szPath, int iNumBatches)

{

    try

    {

        //TODO : There is the limitation about the max number of opened file in process.

        //

        std::vector<boost::shared_ptr<std::ifstream> > vTempFiles;

        for (int i = ; i < iNumBatches; ++ i)

        {

            char aBuffer[];

            sprintf(aBuffer, "%.06d.tmp", i);

            boost::shared_ptr<std::ifstream> pTempFile(new std::ifstream(aBuffer, std::ios_base::binary));

            assert(pTempFile->is_open());

            vTempFiles.push_back(pTempFile);

        }

        std::ofstream cSortedFile(szPath, std::ios_base::binary);

        if (! cSortedFile)

        {

            std::cerr << "Can't open " << szPath << " to write. " << std::endl;

            return false;

        }

        //

        boost::timer::auto_cpu_timer cTimer;

        std::vector<int> vCache;

        vCache.reserve( *  * );

        std::vector<int> vQueue;

        std::vector<boost::shared_ptr<std::ifstream> >::iterator iFile = vTempFiles.begin();

        for (; iFile != vTempFiles.end(); ++ iFile)

        {

            int iNumber = - ;

            if ((* iFile)->read(reinterpret_cast<char *>(& iNumber), sizeof(int)))

            {

                vQueue.push_back(iNumber);

            }

        }

        do

        {

            std::vector<int>::iterator iMinPos = std::min_element(vQueue.begin(), vQueue.end());

            vCache.push_back(* iMinPos);

            if (vCache.size() == vCache.capacity())

            {

                cSortedFile.write(reinterpret_cast<const char *>(& vCache[]), vCache.size() * sizeof(int));

                vCache.clear();

            }

            iFile = vTempFiles.begin() + (iMinPos - vQueue.begin());

            int iNumber = - ;

            if ((* iFile)->read(reinterpret_cast<char *>(& iNumber), sizeof(int)))

            {

                (* iMinPos) = iNumber;

            }

            else

            {

                vTempFiles.erase(iFile);

                vQueue.erase(iMinPos);

            }

        } while (vQueue.size());

        cSortedFile.write(reinterpret_cast<const char *>(& vCache[]), vCache.size() * sizeof(int));

        return true;

    }

    catch(const std::exception & cE)

    {

        std::cerr << cE.what() << std::endl;

    }

    catch(...)

    {

        std::cerr << __LINE__ << std::endl;

    }

    return false;

}

测试的环境为Xeon E5-2603@1.8G，4个硬件线程，测试设置的Job中的数据长度为80M，每次工作线程需要排序20M个int。西部数据的蓝盘，非SSD，也不是混合硬盘，纯机械硬盘。

Tested by single Xeon E5-2603 CPU at 1.8G with 4 hardware threads, each thread process 20M integers. Using WD blue disk, not SSD,.

第一个Sort遍的时间为19.111468s wall, 52.369536s user + 4.243227s system = 56.612763s CPU (296.2%)，CPU效率为296.2%/300% = 98.7%，几乎所有时间都在STL中的std::stable_sort里。

Sorting pass used total 19.11 seconds with 98.7% CPU usage.

第二个Merge遍的时间为33.082600s wall, 29.874191s user + 3.010819s system = 32.885011s CPU (99.4%)，主要还是都在磁盘写入和排序。当然这里可以为每个文件流构造一个Cache，也可以显著地提高性能，不过这里有一个问题，一旦牵涉到了Cache，则必然又有内存的占用提升，如果占用过大则又失去了Merge的意义。

这里读者可能有个问题，关于主线程中的不停new，其实从Vista开始Windows的内存分配其实已经是池化的，而且这里根本不是性能瓶颈，只有磁盘IO才是，所以这里可以不需要优化。至于架构上的提升其实也不大，因为这里不是传统的多读取者+单写入着（Multiple Reader+Single Writer）而是多读取者写入者+单写入者（Multiple Reader and Writer + Single Writer），所以在结构上和传统的消费者/生产者的多线程工作方式还是有些不同。未来会尝试Lock-Free的工作方式而不用Mutex，这个是以后的内容了。

The memory allocation in the main thread is not a bottleneck compared with the disk IO and sorting, and the memory allocation is based on pool since Vista, so here we might discard the optimization. Later the Lock-Free architecture might be implemented.

这里有全套代码。

Here is the full code.

 /**

  * Multithreading C++ Out of Core Sotring for Massive Data

  *

  * Copyright (c) 2013 Bo Zhou<Bo.Schwarzstein@gmail.com>

  * All rights reserved.

  * Redistribution and use in source and binary forms, with or without

  * modification, are permitted provided that the following conditions are met:

  *

  *     * Redistributions of source code must retain the above copyright

  *       notice, this list of conditions and the following disclaimer.

  *     * Redistributions in binary form must reproduce the above copyright

  *       notice, this list of conditions and the following disclaimer in the

  *       documentation and/or other materials provided with the distribution.

  *     * Neither the name of the University of California, Berkeley nor the

  *       names of its contributors may be used to endorse or promote products

  *       derived from this software without specific prior written permission.

  */

 #include <fstream>

 #include <list>

 #include <iostream>

 #include <queue>

 #include <boost/filesystem.hpp>

 #include <boost/smart_ptr.hpp>

 #include <boost/thread.hpp>

 #include <boost/timer/timer.hpp>

 class Job

 {

 public:

     Job()

     :

     m_iIndex(),

     m_iNumItems()

     {

     }

     Job(int iIndex, int iNumItems, const boost::shared_array<int> & aData)

     :

     m_iIndex(iIndex),

     m_iNumItems(iNumItems),

     m_aData(aData)

     {

     }

     Job(const Job & cCopy)

     :

     m_iIndex(cCopy.m_iIndex),

     m_iNumItems(cCopy.m_iNumItems),

     m_aData(cCopy.m_aData)

     {

     }

 public:

     int m_iIndex;

     int m_iNumItems;

     boost::shared_array<int> m_aData;

 };

 class Context

 {

 public:

     Context(int iNumSortingThread)

     :

     m_iNumSortingThread(iNumSortingThread),

     m_bHasMoreData(true)

     {

     }

 public:

     int m_iNumSortingThread;

     bool m_bHasMoreData;

     boost::mutex m_cMutex;

     boost::condition_variable m_cEvent;

     std::list<Job > m_lJobQueue;

 };

 class SortingThread : public boost::thread

 {

 public:

     SortingThread(const boost::shared_ptr<Context> & pContext)

     :

     m_pContext(pContext),

     boost::thread(boost::bind(& SortingThread::Sort, this))

     {

     }

     void Sort()

     {

         while ()

         {

             if (! m_pContext->m_bHasMoreData)

             {

                 if (! m_pContext->m_lJobQueue.size())

                 {

                     break;

                 }

             }

             Job cJob;

             {

                 boost::unique_lock<boost::mutex> cLock(m_pContext->m_cMutex);

                 if (m_pContext->m_lJobQueue.size())

                 {

                     // Get a job.

                     //

                     cJob = m_pContext->m_lJobQueue.front();

                     m_pContext->m_lJobQueue.pop_front();

                 }

             }

             if (cJob.m_iNumItems)

             {

                 std::stable_sort(cJob.m_aData.get(), cJob.m_aData.get() + cJob.m_iNumItems, std::less<int>());

                 // Write out the sorted data.

                 //

                 char aBuffer[];

                 sprintf(aBuffer, "%.06d.tmp", cJob.m_iIndex);

                 std::ofstream cOutput(aBuffer, std::ios_base::binary);

                 cOutput.write(reinterpret_cast<const char *>(cJob.m_aData.get()), cJob.m_iNumItems * sizeof(int));

             }

             // Tell the main thread we need more data here.

             //

             m_pContext->m_cEvent.notify_one();

         }

     }

 private:

     boost::shared_ptr<Context> m_pContext;

 };

 class SortingThreadGroup : public boost::thread_group

 {

 public:

     SortingThreadGroup(const boost::shared_ptr<Context> & pContext)

     :

     m_pContext(pContext)

     {

         for (int i = ; i < m_pContext->m_iNumSortingThread; ++ i)

         {

             SortingThread * pSortingThread = new SortingThread(pContext);

             add_thread(pSortingThread);

         }

     }

 private:

     boost::shared_ptr<Context> m_pContext;

 };

 ///////////////////////////////////////////////////////////////////////////////////////////////////

 bool Sort(const char * szPath, int iNumSortingThreads, int iNumLocalItems)

 {

     try

     {

         // Calculate real size.

         //

         std::ifstream cUnSortedFile(szPath, std::ios_base::binary);

         boost::uintmax_t ullSize = boost::filesystem::file_size(szPath);

         boost::uintmax_t ullNumItems = ullSize / ;

         int iNumBatches = ullNumItems / iNumLocalItems;

         std::vector<int> vNumItemsPerBatch(iNumBatches, iNumLocalItems);

         int iNumRestItems = ullNumItems % iNumLocalItems;

         if (iNumRestItems)

         {

             vNumItemsPerBatch.push_back(iNumRestItems);

         }

         std::cout << "Number of Items   : " << ullNumItems << std::endl

                   << "Number of Batches : " << vNumItemsPerBatch.size() << std::endl;

         boost::shared_ptr<Context> pContext(new Context(iNumSortingThreads));

         boost::scoped_ptr<SortingThreadGroup> pSortingThreadGroup(new SortingThreadGroup(pContext));

         boost::timer::auto_cpu_timer cTimer;

         for (int i = ; i < vNumItemsPerBatch.size(); ++ i)

         {

             boost::shared_array<int> aData(new int[vNumItemsPerBatch[i]]);

             cUnSortedFile.read(reinterpret_cast<char *>(aData.get()), vNumItemsPerBatch[i] * sizeof(int));

             Job cJob(i, vNumItemsPerBatch[i], aData);

             //

             boost::unique_lock<boost::mutex> cLock(pContext->m_cMutex);

             if (pContext->m_lJobQueue.size() > iNumSortingThreads * )

             {

                 pContext->m_cEvent.wait(cLock);

             }

             pContext->m_lJobQueue.push_back(cJob);

         }

         std::cout << std::endl;

         pContext->m_bHasMoreData = false;

         pSortingThreadGroup->join_all();

         return true;

     }

     catch(const std::exception & cE)

     {

         std::cerr << cE.what() << std::endl;

     }

     catch(...)

     {

         std::cerr << __LINE__ << std::endl;

     }

     return false;

 }

 ///////////////////////////////////////////////////////////////////////////////////////////////////

 bool Merge(const char * szPath, int iNumBatches)

 {

     try

     {

         //TODO : There is the limitation about the max number of opened file in process.

         //

         std::vector<boost::shared_ptr<std::ifstream> > vTempFiles;

         for (int i = ; i < iNumBatches; ++ i)

         {

             char aBuffer[];

             sprintf(aBuffer, "%.06d.tmp", i);

             boost::shared_ptr<std::ifstream> pTempFile(new std::ifstream(aBuffer, std::ios_base::binary));

             assert(pTempFile->is_open());

             vTempFiles.push_back(pTempFile);

         }

         std::ofstream cSortedFile(szPath, std::ios_base::binary);

         if (! cSortedFile)

         {

             std::cerr << "Can't open " << szPath << " to write. " << std::endl;

             return false;

         }

         //

         boost::timer::auto_cpu_timer cTimer;

         std::vector<int> vCache;

         vCache.reserve( *  * );

         std::vector<int> vQueue;

         std::vector<boost::shared_ptr<std::ifstream> >::iterator iFile = vTempFiles.begin();

         for (; iFile != vTempFiles.end(); ++ iFile)

         {

             int iNumber = - ;

             if ((* iFile)->read(reinterpret_cast<char *>(& iNumber), sizeof(int)))

             {

                 vQueue.push_back(iNumber);

             }

         }

         do

         {

             std::vector<int>::iterator iMinPos = std::min_element(vQueue.begin(), vQueue.end());

             vCache.push_back(* iMinPos);

             if (vCache.size() == vCache.capacity())

             {

                 cSortedFile.write(reinterpret_cast<const char *>(& vCache[]), vCache.size() * sizeof(int));

                 vCache.clear();

             }

             iFile = vTempFiles.begin() + (iMinPos - vQueue.begin());

             int iNumber = - ;

             if ((* iFile)->read(reinterpret_cast<char *>(& iNumber), sizeof(int)))

             {

                 (* iMinPos) = iNumber;

             }

             else

             {

                 vTempFiles.erase(iFile);

                 vQueue.erase(iMinPos);

             }

         } while (vQueue.size());

         cSortedFile.write(reinterpret_cast<const char *>(& vCache[]), vCache.size() * sizeof(int));

         return true;

     }

     catch(const std::exception & cE)

     {

         std::cerr << cE.what() << std::endl;

     }

     catch(...)

     {

         std::cerr << __LINE__ << std::endl;

     }

     return false;

 }

 int main(int argc, char * argv[])

 {

     int iRet = EXIT_FAILURE;

     //

     char * szPath = NULL;

     int iNumSortingThreads = ;

     int iNumLocalItems = ;

     int iNumBatches = ;

     //

     -- argc, ++ argv;

     if (argc == )

     {

         szPath = argv[];

         iNumSortingThreads = atoi(argv[]);

         iNumLocalItems = atoi(argv[]) *  * ;

         if (Sort(szPath, iNumSortingThreads, iNumLocalItems))

         {

             iRet = EXIT_SUCCESS;

         }

     }

     else if (argc == )

     {

         szPath = argv[];

         iNumBatches = atoi(argv[]);

         if (Merge(szPath, iNumBatches))

         {

             iRet = EXIT_SUCCESS;

         }

     }

     return iRet;

 }

View Full Code

秒客网

Multithreading C++ Out of Core Sotring for Massive Data|多线程C++的大规模数据外部排序

相关文章