改进大型结构列表的二进制序列化性能

时间:2022-09-01 17:14:27

I have a structure holding 3d co-ordinates in 3 ints. In a test I've put together a List<> of 1 million random points and then used Binary serialization to a memory stream.

我有一个结构,在3个整数中持有3d坐标。在测试中,我将一个包含100万个随机点的List <>组合在一起,然后将二进制序列化用于内存流。

The memory stream is coming in a ~ 21 MB - which seems very inefficient as 1000000 points * 3 coords * 4 bytes should come out at 11MB minimum

内存流大约是21 MB - 这看起来非常低效,因为1000000点* 3个coords * 4个字节应该在最小11MB时出现

Its also taking ~ 3 seconds on my test rig.

它在我的测试台上也需要约3秒钟。

Any ideas for improving performance and/or size?

有关提高性能和/或尺寸的想法吗?

(I don't have to keep the ISerialzable interface if it helps, I could write out directly to a memory stream)

(如果有帮助,我不必保留ISerialzable接口,我可以直接写入内存流)

EDIT - From answers below I've put together a serialization showdown comparing BinaryFormatter, 'Raw' BinaryWriter and Protobuf

编辑 - 从下面的答案我已经把一个序列化摊牌比较BinaryFormatter,'原始'BinaryWriter和Protobuf

using System;
using System.Text;
using System.Collections.Generic;
using System.Linq;
using Microsoft.VisualStudio.TestTools.UnitTesting;
using System.Runtime.Serialization;
using System.Runtime.Serialization.Formatters.Binary;
using System.IO;
using ProtoBuf;

namespace asp_heatmap.test
{
    [Serializable()] // For .NET BinaryFormatter
    [ProtoContract] // For Protobuf
    public class Coordinates : ISerializable
    {
        [Serializable()]
        [ProtoContract]
        public struct CoOrd
        {
            public CoOrd(int x, int y, int z)
            {
                this.x = x;
                this.y = y;
                this.z = z;
            }
            [ProtoMember(1)]            
            public int x;
            [ProtoMember(2)]
            public int y;
            [ProtoMember(3)]
            public int z;
        }

        internal Coordinates()
        {
        }

        [ProtoMember(1)]
        public List<CoOrd> Coords = new List<CoOrd>();

        public void SetupTestArray()
        {
            Random r = new Random();
            List<CoOrd> coordinates = new List<CoOrd>();
            for (int i = 0; i < 1000000; i++)
            {
                Coords.Add(new CoOrd(r.Next(), r.Next(), r.Next()));
            }
        }

        #region Using Framework Binary Formatter Serialization

        void ISerializable.GetObjectData(SerializationInfo info, StreamingContext context)
        {
            info.AddValue("Coords", this.Coords);
        }

        internal Coordinates(SerializationInfo info, StreamingContext context)
        {
            this.Coords = (List<CoOrd>)info.GetValue("Coords", typeof(List<CoOrd>));
        }

        #endregion

        # region 'Raw' Binary Writer serialization

        public MemoryStream RawSerializeToStream()
        {
            MemoryStream stream = new MemoryStream(Coords.Count * 3 * 4 + 4);
            BinaryWriter writer = new BinaryWriter(stream);
            writer.Write(Coords.Count);
            foreach (CoOrd point in Coords)
            {
                writer.Write(point.x);
                writer.Write(point.y);
                writer.Write(point.z);
            }
            return stream;
        }

        public Coordinates(MemoryStream stream)
        {
            using (BinaryReader reader = new BinaryReader(stream))
            {
                int count = reader.ReadInt32();
                Coords = new List<CoOrd>(count);
                for (int i = 0; i < count; i++)                
                {
                    Coords.Add(new CoOrd(reader.ReadInt32(),reader.ReadInt32(),reader.ReadInt32()));
                }
            }        
        }
        #endregion
    }

    [TestClass]
    public class SerializationTest
    {
        [TestMethod]
        public void TestBinaryFormatter()
        {
            Coordinates c = new Coordinates();
            c.SetupTestArray();

            // Serialize to memory stream
            MemoryStream mStream = new MemoryStream();
            BinaryFormatter bformatter = new BinaryFormatter();
            bformatter.Serialize(mStream, c);
            Console.WriteLine("Length : {0}", mStream.Length);

            // Now Deserialize
            mStream.Position = 0;
            Coordinates c2 = (Coordinates)bformatter.Deserialize(mStream);
            Console.Write(c2.Coords.Count);

            mStream.Close();
        }

        [TestMethod]
        public void TestBinaryWriter()
        {
            Coordinates c = new Coordinates();
            c.SetupTestArray();

            MemoryStream mStream = c.RawSerializeToStream();
            Console.WriteLine("Length : {0}", mStream.Length);

            // Now Deserialize
            mStream.Position = 0;
            Coordinates c2 = new Coordinates(mStream);
            Console.Write(c2.Coords.Count);
        }

        [TestMethod]
        public void TestProtoBufV2()
        {
            Coordinates c = new Coordinates();
            c.SetupTestArray();

            MemoryStream mStream = new MemoryStream();
            ProtoBuf.Serializer.Serialize(mStream,c);
            Console.WriteLine("Length : {0}", mStream.Length);

            mStream.Position = 0;
            Coordinates c2 = ProtoBuf.Serializer.Deserialize<Coordinates>(mStream);
            Console.Write(c2.Coords.Count);
        }
    }
}

Results (Note PB v2.0.0.423 beta)

结果(注意PB v2.0.0.423 beta)

                Serialize | Ser + Deserialize    | Size
-----------------------------------------------------------          
BinaryFormatter    2.89s  |      26.00s !!!      | 21.0 MB
ProtoBuf v2        0.52s  |       0.83s          | 18.7 MB
Raw BinaryWriter   0.27s  |       0.36s          | 11.4 MB

Obviously this is just looking at speed/size and doesn't take into account anything else.

显然,这仅仅是考虑速度/尺寸而没有考虑任何其他因素。

2 个解决方案

#1


10  

Binary serialisation using BinaryFormatter includes type information in the bytes it generates. This takes up additional space. It's useful in cases where you don't know what structure of data to expect at the other end, for example.

使用BinaryFormatter的二进制序列化包括它生成的字节中的类型信息。这占用了额外的空间。例如,在您不知道另一端需要什么样的数据结构的情况下,它非常有用。

In your case, you know what format the data has at both ends, and that doesn't sound like it'd change. So you can write a simple encode and decode method. Your CoOrd class no longer needs to be serializable too.

在您的情况下,您知道数据在两端的格式,并且听起来不会改变。所以你可以编写一个简单的编码和解码方法。您的CoOrd类不再需要可序列化。

I would use System.IO.BinaryReader and System.IO.BinaryWriter, then loop through each of your CoOrd instances and read/write the X,Y,Z propery values to the stream. Those classes will even pack your ints into less than 11MB, assuming many of your numbers are smaller than 0x7F and 0x7FFF.

我将使用System.IO.BinaryReader和System.IO.BinaryWriter,然后遍历每个CoOrd实例并读取/写入流的X,Y,Z属性值。假设您的许多数字小于0x7F和0x7FFF,那些类甚至会将您的整数打包成小于11MB。

Something like this:

像这样的东西:

using (var writer = new BinaryWriter(stream)) {
    // write the number of items so we know how many to read out
    writer.Write(points.Count);
    // write three ints per point
    foreach (var point in points) {
        writer.Write(point.X);
        writer.Write(point.Y);
        writer.Write(point.Z);
    }
}

To read from the stream:

要从流中读取:

List<CoOrd> points;
using (var reader = new BinaryReader(stream)) {
    var count = reader.ReadInt32();
    points = new List<CoOrd>(count);
    for (int i = 0; i < count; i++) {
        var x = reader.ReadInt32();
        var y = reader.ReadInt32();
        var z = reader.ReadInt32();
        points.Add(new CoOrd(x, y, z));
    }
}

#2


3  

For simplicity of using a pre-build serializer, I recommend protobuf-net; here is protobuf-net v2, with just adding some attributes:

为了简化使用预构建的序列化程序,我推荐使用protobuf-net;这里是protobuf-net v2,只添加了一些属性:

[DataContract]
public class Coordinates
{
    [DataContract]
    public struct CoOrd
    {
        public CoOrd(int x, int y, int z)
        {
            this.x = x;
            this.y = y;
            this.z = z;
        }
        [DataMember(Order = 1)]
        int x;
        [DataMember(Order = 2)]
        int y;
        [DataMember(Order = 3)]
        int z;
    }
    [DataMember(Order = 1)]
    public List<CoOrd> Coords = new List<CoOrd>();

    public void SetupTestArray()
    {
        Random r = new Random(123456);
        List<CoOrd> coordinates = new List<CoOrd>();
        for (int i = 0; i < 1000000; i++)
        {
            Coords.Add(new CoOrd(r.Next(10000), r.Next(10000), r.Next(10000)));
        }
    }
}

using:

使用:

ProtoBuf.Serializer.Serialize(mStream, c);

to serialize. This takes 10,960,823 bytes, but note that I tweaked SetupTestArray to limit the size to 10,000 since by default it uses "varint" encoding on the integers, which depends on the size. 10k isn't important here (in fact I didn't check what the "steps" are). If you prefer a fixed size (which will allow any range):

序列化。这需要10,960,823个字节,但请注意我调整了SetupTestArray以将大小限制为10,000,因为默认情况下它对整数使用“varint”编码,这取决于大小。 10k在这里并不重要(事实上我没有检查“步骤”是什么)。如果您更喜欢固定尺寸(允许任何范围):

        [ProtoMember(1, DataFormat = DataFormat.FixedSize)]
        int x;
        [ProtoMember(2, DataFormat = DataFormat.FixedSize)]
        int y;
        [ProtoMember(3, DataFormat = DataFormat.FixedSize)]
        int z;

Which takes 16,998,640 bytes

这需要16,998,640字节

#1


10  

Binary serialisation using BinaryFormatter includes type information in the bytes it generates. This takes up additional space. It's useful in cases where you don't know what structure of data to expect at the other end, for example.

使用BinaryFormatter的二进制序列化包括它生成的字节中的类型信息。这占用了额外的空间。例如,在您不知道另一端需要什么样的数据结构的情况下,它非常有用。

In your case, you know what format the data has at both ends, and that doesn't sound like it'd change. So you can write a simple encode and decode method. Your CoOrd class no longer needs to be serializable too.

在您的情况下,您知道数据在两端的格式,并且听起来不会改变。所以你可以编写一个简单的编码和解码方法。您的CoOrd类不再需要可序列化。

I would use System.IO.BinaryReader and System.IO.BinaryWriter, then loop through each of your CoOrd instances and read/write the X,Y,Z propery values to the stream. Those classes will even pack your ints into less than 11MB, assuming many of your numbers are smaller than 0x7F and 0x7FFF.

我将使用System.IO.BinaryReader和System.IO.BinaryWriter,然后遍历每个CoOrd实例并读取/写入流的X,Y,Z属性值。假设您的许多数字小于0x7F和0x7FFF,那些类甚至会将您的整数打包成小于11MB。

Something like this:

像这样的东西:

using (var writer = new BinaryWriter(stream)) {
    // write the number of items so we know how many to read out
    writer.Write(points.Count);
    // write three ints per point
    foreach (var point in points) {
        writer.Write(point.X);
        writer.Write(point.Y);
        writer.Write(point.Z);
    }
}

To read from the stream:

要从流中读取:

List<CoOrd> points;
using (var reader = new BinaryReader(stream)) {
    var count = reader.ReadInt32();
    points = new List<CoOrd>(count);
    for (int i = 0; i < count; i++) {
        var x = reader.ReadInt32();
        var y = reader.ReadInt32();
        var z = reader.ReadInt32();
        points.Add(new CoOrd(x, y, z));
    }
}

#2


3  

For simplicity of using a pre-build serializer, I recommend protobuf-net; here is protobuf-net v2, with just adding some attributes:

为了简化使用预构建的序列化程序,我推荐使用protobuf-net;这里是protobuf-net v2,只添加了一些属性:

[DataContract]
public class Coordinates
{
    [DataContract]
    public struct CoOrd
    {
        public CoOrd(int x, int y, int z)
        {
            this.x = x;
            this.y = y;
            this.z = z;
        }
        [DataMember(Order = 1)]
        int x;
        [DataMember(Order = 2)]
        int y;
        [DataMember(Order = 3)]
        int z;
    }
    [DataMember(Order = 1)]
    public List<CoOrd> Coords = new List<CoOrd>();

    public void SetupTestArray()
    {
        Random r = new Random(123456);
        List<CoOrd> coordinates = new List<CoOrd>();
        for (int i = 0; i < 1000000; i++)
        {
            Coords.Add(new CoOrd(r.Next(10000), r.Next(10000), r.Next(10000)));
        }
    }
}

using:

使用:

ProtoBuf.Serializer.Serialize(mStream, c);

to serialize. This takes 10,960,823 bytes, but note that I tweaked SetupTestArray to limit the size to 10,000 since by default it uses "varint" encoding on the integers, which depends on the size. 10k isn't important here (in fact I didn't check what the "steps" are). If you prefer a fixed size (which will allow any range):

序列化。这需要10,960,823个字节,但请注意我调整了SetupTestArray以将大小限制为10,000,因为默认情况下它对整数使用“varint”编码,这取决于大小。 10k在这里并不重要(事实上我没有检查“步骤”是什么)。如果您更喜欢固定尺寸(允许任何范围):

        [ProtoMember(1, DataFormat = DataFormat.FixedSize)]
        int x;
        [ProtoMember(2, DataFormat = DataFormat.FixedSize)]
        int y;
        [ProtoMember(3, DataFormat = DataFormat.FixedSize)]
        int z;

Which takes 16,998,640 bytes

这需要16,998,640字节