I hope this question isn’t too “right field” and I'll be upfront in saying I'm a newb compared to many people on stackflow...
我希望这个问题不是太“正确的领域”,我会提前说我是一个新人,相比很多人在stackflow ...
I want to compare object representations of images, audio and text for an AI project I am working on. I'd like to convert all three inputs into a single data type and use a central comparison algorithm to determine statically probable matches.
我想比较我正在处理的AI项目的图像,音频和文本的对象表示。我想将所有三个输入转换为单个数据类型,并使用*比较算法来确定静态可能的匹配。
What are the “fastest” native .Net and SQL data types for making comparisons like this? In .Net what data type requires the least amount of conversions in the CLR? For SQL, what type can be “CRUD-ed” the fastest?
什么是“最快”的原生.Net和SQL数据类型,用于进行这样的比较?在.Net中,哪种数据类型需要CLR中的转换量最少?对于SQL,什么类型的“CRUD-ed”最快?
I was thinking bytes for .Net and integers for SQL but integers pose a problem of being a one dimensional concept. Do you think the images and audio should be handled within the file system rather than SQL…I’m guessing so…
我在考虑.Net的字节和SQL的整数,但是整数构成了一维概念的问题。你认为图像和音频应该在文件系统而不是SQL中处理......我猜是这样的......
FWIW I'm building a robot from parts I bought at TrossenRobotics.com
我在TrossenRobotics.com购买的部件正在建造一个机器人
5 个解决方案
#1
Personally, if you need to do frequent comparisons between large binary objects, I would hash the objects and compare the hashes.
就个人而言,如果你需要在大型二进制对象之间进行频繁的比较,我会对对象进行哈希并比较哈希值。
If the hashes don't match, then you can be sure the objects don't match (which should be the majority of the cases).
如果哈希值不匹配,那么您可以确定对象不匹配(这应该是大多数情况)。
If the hashes do match, you can then start a more lengthy routine to compare the actual objects.
如果哈希匹配,则可以启动更长的例程来比较实际对象。
This method alone should boost your performance quite a bit if you're comparing these objects frequently.
如果您经常比较这些对象,单独使用此方法可以提高您的性能。
#2
Speed of data types is a bit hard to measure. It makes a big difference if you're using a 32-bits operating system or a 64-bits. Why? Because it determines the speed at which this data can be processed. In general, on a 32-bits system, all data types that fit inside 32 bits (int16, int32, char, byte, pointers) will be processed as the same speed. If you need lots of data to be processed, it's best to divide it in blocks of four bytes each for your CPU to process them.
数据类型的速度有点难以衡量。如果您使用的是32位操作系统或64位操作系统,则会产生很大的不同。为什么?因为它决定了处理此数据的速度。通常,在32位系统上,所有适合32位(int16,int32,char,byte,pointers)的数据类型将以相同的速度处理。如果需要处理大量数据,最好将其分成四个字节的块,每个块用于CPU处理它们。
However, when you're writing data to disk, data speed tends to depend on a lot more factors. If your disk device is on some USB port, all data gets serialized, thus it would be byte after byte. In that case, size doesn't matter much, although the smallest datablocks would leave the smallest gaps. (In languages like Pascal you'd use a packed record for this kind of data to optimize streaming performance, while having your fields in your records aligned at multiples of 4 bytes for CPU performance.) Regular disks will store data in bigger blocks. To increase reading/writing speed, you'd prefer to make your data structures as compact as possible. But for processing performance, having them aligned on 4 bytes boundaries is more effective.
但是,当您将数据写入磁盘时,数据速度往往取决于更多因素。如果您的磁盘设备位于某个USB端口上,则所有数据都会被序列化,因此它将是逐字节的。在这种情况下,尺寸并不重要,尽管最小的数据块会留下最小的间隙。 (在像Pascal这样的语言中,您可以使用打包记录来优化流式处理性能,同时让记录中的字段以4字节的倍数对齐以获得CPU性能。)常规磁盘将数据存储在更大的块中。为了提高读/写速度,您希望尽可能简化数据结构。但是对于处理性能,使它们在4字节边界上对齐更有效。
Which reminds me that I once had a discussion with someone about using compression on an NTFS disk. I managed to prove that compressing an NTFS partition could actually improve the performance of a computer since it had to read a lot less data blocks, even though it meant it had to do more processing to decompress the same data blocks.
这让我想起曾经和某人讨论过在NTFS磁盘上使用压缩的问题。我设法证明压缩NTFS分区实际上可以提高计算机的性能,因为它必须读取更少的数据块,即使这意味着它必须做更多的处理来解压缩相同的数据块。
To improve performance, you just have to find the weakest (slowest) link and start there. Once it's optimized, there will be another weak link...
要提高性能,您只需要找到最薄弱(最慢)的链接并从那里开始。一旦优化,将会有另一个薄弱环节......
#3
Personally, I'd say you're best off using a byte array. You can easily read the file in to the buffer...and from the buffer into the byte array where you can do the comparison.
就个人而言,我说你最好使用字节数组。您可以轻松地将文件读入缓冲区...并从缓冲区读入字节数组,您可以在其中进行比较。
#4
As far as I recall, in terms of sheer performance, the Int32 type is among the faster data types of .NET. Can't say whether it is the most suitable in your application though.
据我所知,就纯粹的性能而言,Int32类型是.NET的更快数据类型之一。不能说它是否是最适合您的应用程序。
#5
Before pulling anything into .NET, you should check the length of the data in SQL Server using the LEN function. If the length is different, you know already that the two objects are different. This should save bringing down lots of unnecessary data from SQL Server to your client application.
在将任何东西引入.NET之前,您应该使用LEN函数检查SQL Server中的数据长度。如果长度不同,您已经知道两个对象是不同的。这样可以节省大量不必要的数据,从SQL Server到客户端应用程序。
I would also recommend storing a hash code (in a separate column from the binary data) using the CHECKSUM function (http://msdn.microsoft.com/en-us/library/aa258245(SQL.80).aspx). This will only work if you are using SQL Server 2005 and above and you are storing your data as varbinary(MAX). Once again, if the hash codes are different, the binary data is definitely different.
我还建议使用CHECKSUM函数(http://msdn.microsoft.com/en-us/library/aa258245(SQL.80).aspx)存储哈希码(在二进制数据的单独列中)。这仅在您使用SQL Server 2005及更高版本并且将数据存储为varbinary(MAX)时才有效。再一次,如果哈希码不同,则二进制数据肯定是不同的。
If you are using SQL Server 2000, you are stuck with the 'image' data type.
如果您使用的是SQL Server 2000,则会遇到“图像”数据类型。
Both image or varbinary(MAX) will map nicely to byte[] objects on the client, however if you are using SQL Server 2008, you have the option of storing your data as a FILESTREAM data type (http://blogs.msdn.com/manisblog/archive/2007/10/21/filestream-data-type-sql-server-2008.aspx).
image或varbinary(MAX)都可以很好地映射到客户端上的byte []对象,但是如果您使用的是SQL Server 2008,则可以选择将数据存储为FILESTREAM数据类型(http://blogs.msdn。 COM / manisblog /存档/ 2007/10/21 / FILESTREAM数据型-SQL服务器2008.aspx)。
#1
Personally, if you need to do frequent comparisons between large binary objects, I would hash the objects and compare the hashes.
就个人而言,如果你需要在大型二进制对象之间进行频繁的比较,我会对对象进行哈希并比较哈希值。
If the hashes don't match, then you can be sure the objects don't match (which should be the majority of the cases).
如果哈希值不匹配,那么您可以确定对象不匹配(这应该是大多数情况)。
If the hashes do match, you can then start a more lengthy routine to compare the actual objects.
如果哈希匹配,则可以启动更长的例程来比较实际对象。
This method alone should boost your performance quite a bit if you're comparing these objects frequently.
如果您经常比较这些对象,单独使用此方法可以提高您的性能。
#2
Speed of data types is a bit hard to measure. It makes a big difference if you're using a 32-bits operating system or a 64-bits. Why? Because it determines the speed at which this data can be processed. In general, on a 32-bits system, all data types that fit inside 32 bits (int16, int32, char, byte, pointers) will be processed as the same speed. If you need lots of data to be processed, it's best to divide it in blocks of four bytes each for your CPU to process them.
数据类型的速度有点难以衡量。如果您使用的是32位操作系统或64位操作系统,则会产生很大的不同。为什么?因为它决定了处理此数据的速度。通常,在32位系统上,所有适合32位(int16,int32,char,byte,pointers)的数据类型将以相同的速度处理。如果需要处理大量数据,最好将其分成四个字节的块,每个块用于CPU处理它们。
However, when you're writing data to disk, data speed tends to depend on a lot more factors. If your disk device is on some USB port, all data gets serialized, thus it would be byte after byte. In that case, size doesn't matter much, although the smallest datablocks would leave the smallest gaps. (In languages like Pascal you'd use a packed record for this kind of data to optimize streaming performance, while having your fields in your records aligned at multiples of 4 bytes for CPU performance.) Regular disks will store data in bigger blocks. To increase reading/writing speed, you'd prefer to make your data structures as compact as possible. But for processing performance, having them aligned on 4 bytes boundaries is more effective.
但是,当您将数据写入磁盘时,数据速度往往取决于更多因素。如果您的磁盘设备位于某个USB端口上,则所有数据都会被序列化,因此它将是逐字节的。在这种情况下,尺寸并不重要,尽管最小的数据块会留下最小的间隙。 (在像Pascal这样的语言中,您可以使用打包记录来优化流式处理性能,同时让记录中的字段以4字节的倍数对齐以获得CPU性能。)常规磁盘将数据存储在更大的块中。为了提高读/写速度,您希望尽可能简化数据结构。但是对于处理性能,使它们在4字节边界上对齐更有效。
Which reminds me that I once had a discussion with someone about using compression on an NTFS disk. I managed to prove that compressing an NTFS partition could actually improve the performance of a computer since it had to read a lot less data blocks, even though it meant it had to do more processing to decompress the same data blocks.
这让我想起曾经和某人讨论过在NTFS磁盘上使用压缩的问题。我设法证明压缩NTFS分区实际上可以提高计算机的性能,因为它必须读取更少的数据块,即使这意味着它必须做更多的处理来解压缩相同的数据块。
To improve performance, you just have to find the weakest (slowest) link and start there. Once it's optimized, there will be another weak link...
要提高性能,您只需要找到最薄弱(最慢)的链接并从那里开始。一旦优化,将会有另一个薄弱环节......
#3
Personally, I'd say you're best off using a byte array. You can easily read the file in to the buffer...and from the buffer into the byte array where you can do the comparison.
就个人而言,我说你最好使用字节数组。您可以轻松地将文件读入缓冲区...并从缓冲区读入字节数组,您可以在其中进行比较。
#4
As far as I recall, in terms of sheer performance, the Int32 type is among the faster data types of .NET. Can't say whether it is the most suitable in your application though.
据我所知,就纯粹的性能而言,Int32类型是.NET的更快数据类型之一。不能说它是否是最适合您的应用程序。
#5
Before pulling anything into .NET, you should check the length of the data in SQL Server using the LEN function. If the length is different, you know already that the two objects are different. This should save bringing down lots of unnecessary data from SQL Server to your client application.
在将任何东西引入.NET之前,您应该使用LEN函数检查SQL Server中的数据长度。如果长度不同,您已经知道两个对象是不同的。这样可以节省大量不必要的数据,从SQL Server到客户端应用程序。
I would also recommend storing a hash code (in a separate column from the binary data) using the CHECKSUM function (http://msdn.microsoft.com/en-us/library/aa258245(SQL.80).aspx). This will only work if you are using SQL Server 2005 and above and you are storing your data as varbinary(MAX). Once again, if the hash codes are different, the binary data is definitely different.
我还建议使用CHECKSUM函数(http://msdn.microsoft.com/en-us/library/aa258245(SQL.80).aspx)存储哈希码(在二进制数据的单独列中)。这仅在您使用SQL Server 2005及更高版本并且将数据存储为varbinary(MAX)时才有效。再一次,如果哈希码不同,则二进制数据肯定是不同的。
If you are using SQL Server 2000, you are stuck with the 'image' data type.
如果您使用的是SQL Server 2000,则会遇到“图像”数据类型。
Both image or varbinary(MAX) will map nicely to byte[] objects on the client, however if you are using SQL Server 2008, you have the option of storing your data as a FILESTREAM data type (http://blogs.msdn.com/manisblog/archive/2007/10/21/filestream-data-type-sql-server-2008.aspx).
image或varbinary(MAX)都可以很好地映射到客户端上的byte []对象,但是如果您使用的是SQL Server 2008,则可以选择将数据存储为FILESTREAM数据类型(http://blogs.msdn。 COM / manisblog /存档/ 2007/10/21 / FILESTREAM数据型-SQL服务器2008.aspx)。