将巨大的longs数组写入磁盘

时间:2022-07-22 13:32:58

I need to write huge arrays of longs (up to 5GB) to disk. I tried using BinaryFormatter but it seems to be able to write only arrays of size lower than 2GB:

我需要将大量的long(最多5GB)数组写入磁盘。我尝试使用BinaryFormatter,但它似乎只能写入大小低于2GB的数组:

long[] array = data.ToArray();
FileStream fs = new FileStream(dst, FileMode.Create);
BinaryFormatter formatter = new BinaryFormatter();
try
{
    formatter.Serialize(fs, array);
}
catch (SerializationException e)
{
    Console.WriteLine("Failed to serialize. Reason: " + e.Message);
    throw;
}
finally
{
    fs.Close();
}

This code throws IndexOutOfRangeException for larger arrays.

此代码为较大的数组抛出IndexOutOfRangeException。

I don't want to save element per element, because it takes too much time. Is there any proper way to save such large array?

我不想为每个元素保存元素,因为它需要太多时间。有没有正确的方法来保存这么大的阵列?

Writing element per element:

每个元素编写元素:

using (BinaryWriter writer = new BinaryWriter(File.Open(dst, FileMode.Create)))
{
    foreach(long v in array)
    {
        writer.Write(v);
    }
} 

This is very slow.

这很慢。

2 个解决方案

#1


8  

OK, so maybe I got a little carried overboard with the MMF. Here's a simpler version, with a file stream only (I think this is what Scott Chamberlain suggested in the comments).

好吧,也许我对MMF有点过分了。这是一个更简单的版本,只有一个文件流(我认为这是Scott Chamberlain在评论中建议的)。

Timings (on a new system) for a 3Gb array:

用于3Gb阵列的计时(在新系统上):

  1. MMF: ~50 seconds.
  2. MMF:约50秒。
  3. FilStream: ~30 seconds.
  4. FilStream:约30秒。

Code:

码:

long dataLen = 402653184; //3gb represented in 8 byte chunks
long[] data = new long[dataLen];
int elementSize = sizeof(long);

Stopwatch sw = Stopwatch.StartNew();
using (FileStream f = new FileStream(@"D:\Test.bin", FileMode.OpenOrCreate, FileAccess.Write, FileShare.Read, 32768))
{
    int offset = 0;
    int workBufferSize = 32768;
    byte[] workBuffer = new byte[workBufferSize];
    while (offset < dataLen)
    {
        Buffer.BlockCopy(data, offset, workBuffer, 0, workBufferSize);
        f.Write(workBuffer, 0, workBufferSize);

        //advance in the source array
        offset += workBufferSize / elementSize;
    }
}

Console.WriteLine(sw.Elapsed);

Old solution, MMF

老解决方案,MMF

I think you can try with a MemoryMappedFile. I got ~2 to ~2.5 minutes for a 3Gb array on a relatively slower external drive.

我想你可以尝试使用MemoryMappedFile。在相对较慢的外部驱动器上,我需要大约2到2.5分钟的3Gb阵列。

What this solution implies:

这个解决方案意味着:

  1. First, create an empty file.
  2. 首先,创建一个空文件。
  3. Create a memory mapped file over it, with a default capacity of X bytes, where X is the array length in bytes. This automatically sets the physical length of the file, on disk, to that value.
  4. 在其上创建一个内存映射文件,默认容量为X字节,其中X是以字节为单位的数组长度。这会自动将磁盘上文件的物理长度设置为该值。
  5. Dump the array to the file via a 32kx8 bytes wide accessor (you can change this, it's just something I tested with). So, I'm writing the array in chunks of 32k elements.
  6. 通过一个32kx8字节宽的访问器将数组转储到文件中(你可以改变它,它只是我测试过的东西)。所以,我正在用32k元素的块来编写数组。

Note that you will need to account for the case when the array length is not a multiple of chunkLength. For testing purposes, in my sample it is :).

请注意,当数组长度不是chunkLength的倍数时,您需要考虑这种情况。出于测试目的,在我的样本中它是:)。

See below:

见下文:

//Just create an empty file
FileStream f = File.Create(@"D:\Test.bin");
f.Close();

long dataLen = 402653184; //3gb represented in 8 byte chunks
long[] data = new long[dataLen];
int elementSize = sizeof (long);

Stopwatch sw = Stopwatch.StartNew();

//Open the file, with a default capacity. This allows you to write over the initial capacity of the file
using (var mmf = MemoryMappedFile.CreateFromFile(@"D:\Test.bin", FileMode.Open, "longarray", data.LongLength * elementSize))
{
    long offset = 0;
    int chunkLength = 32768; 

    while (offset < dataLen)
    {
        using (var accessor = mmf.CreateViewAccessor(offset * elementSize, chunkLength * elementSize))
        {
            for (long i = offset; i != offset + chunkLength; ++i)
            {
                accessor.Write(i - offset, data[i]);
            }
        }

        offset += chunkLength;
    }
}

Console.WriteLine(sw.Elapsed);

#2


0  

I suggest that the code above is wrong. Should it not be

我建议上面的代码是错误的。不应该

while (offset < dataLen*elementSize)
{
    Buffer.BlockCopy(data, offset, workBuffer, 0, workBufferSize);
    f.Write(workBuffer, 0, workBufferSize);

    //advance in the source array
    offset += workBufferSize;
}

Same in the memory mapped example

在内存映射示例中也是如此

#1


8  

OK, so maybe I got a little carried overboard with the MMF. Here's a simpler version, with a file stream only (I think this is what Scott Chamberlain suggested in the comments).

好吧,也许我对MMF有点过分了。这是一个更简单的版本,只有一个文件流(我认为这是Scott Chamberlain在评论中建议的)。

Timings (on a new system) for a 3Gb array:

用于3Gb阵列的计时(在新系统上):

  1. MMF: ~50 seconds.
  2. MMF:约50秒。
  3. FilStream: ~30 seconds.
  4. FilStream:约30秒。

Code:

码:

long dataLen = 402653184; //3gb represented in 8 byte chunks
long[] data = new long[dataLen];
int elementSize = sizeof(long);

Stopwatch sw = Stopwatch.StartNew();
using (FileStream f = new FileStream(@"D:\Test.bin", FileMode.OpenOrCreate, FileAccess.Write, FileShare.Read, 32768))
{
    int offset = 0;
    int workBufferSize = 32768;
    byte[] workBuffer = new byte[workBufferSize];
    while (offset < dataLen)
    {
        Buffer.BlockCopy(data, offset, workBuffer, 0, workBufferSize);
        f.Write(workBuffer, 0, workBufferSize);

        //advance in the source array
        offset += workBufferSize / elementSize;
    }
}

Console.WriteLine(sw.Elapsed);

Old solution, MMF

老解决方案,MMF

I think you can try with a MemoryMappedFile. I got ~2 to ~2.5 minutes for a 3Gb array on a relatively slower external drive.

我想你可以尝试使用MemoryMappedFile。在相对较慢的外部驱动器上,我需要大约2到2.5分钟的3Gb阵列。

What this solution implies:

这个解决方案意味着:

  1. First, create an empty file.
  2. 首先,创建一个空文件。
  3. Create a memory mapped file over it, with a default capacity of X bytes, where X is the array length in bytes. This automatically sets the physical length of the file, on disk, to that value.
  4. 在其上创建一个内存映射文件,默认容量为X字节,其中X是以字节为单位的数组长度。这会自动将磁盘上文件的物理长度设置为该值。
  5. Dump the array to the file via a 32kx8 bytes wide accessor (you can change this, it's just something I tested with). So, I'm writing the array in chunks of 32k elements.
  6. 通过一个32kx8字节宽的访问器将数组转储到文件中(你可以改变它,它只是我测试过的东西)。所以,我正在用32k元素的块来编写数组。

Note that you will need to account for the case when the array length is not a multiple of chunkLength. For testing purposes, in my sample it is :).

请注意,当数组长度不是chunkLength的倍数时,您需要考虑这种情况。出于测试目的,在我的样本中它是:)。

See below:

见下文:

//Just create an empty file
FileStream f = File.Create(@"D:\Test.bin");
f.Close();

long dataLen = 402653184; //3gb represented in 8 byte chunks
long[] data = new long[dataLen];
int elementSize = sizeof (long);

Stopwatch sw = Stopwatch.StartNew();

//Open the file, with a default capacity. This allows you to write over the initial capacity of the file
using (var mmf = MemoryMappedFile.CreateFromFile(@"D:\Test.bin", FileMode.Open, "longarray", data.LongLength * elementSize))
{
    long offset = 0;
    int chunkLength = 32768; 

    while (offset < dataLen)
    {
        using (var accessor = mmf.CreateViewAccessor(offset * elementSize, chunkLength * elementSize))
        {
            for (long i = offset; i != offset + chunkLength; ++i)
            {
                accessor.Write(i - offset, data[i]);
            }
        }

        offset += chunkLength;
    }
}

Console.WriteLine(sw.Elapsed);

#2


0  

I suggest that the code above is wrong. Should it not be

我建议上面的代码是错误的。不应该

while (offset < dataLen*elementSize)
{
    Buffer.BlockCopy(data, offset, workBuffer, 0, workBufferSize);
    f.Write(workBuffer, 0, workBufferSize);

    //advance in the source array
    offset += workBufferSize;
}

Same in the memory mapped example

在内存映射示例中也是如此