使用和不使用Scatter / Gather操作进行零拷贝

时间:2022-12-31 10:58:53

I just read an article that explains the zero-copy mechanism.

我刚读了一篇解释零拷贝机制的文章。

It talks about the difference between zero-copy with and without Scatter/Gather supports.

它讨论了有和没有Scatter / Gather支持的零拷贝之间的区别。

NIC without SG support, the data copies are as follows

网卡没有SG支持,数据副本如下

使用和不使用Scatter / Gather操作进行零拷贝

NIC with SG support, the data copies are as follows

NIC支持SG,数据副本如下

使用和不使用Scatter / Gather操作进行零拷贝

In a word, zero-copy with SG support can eliminate one CPU copy.

总之,使用SG支持的零拷贝可以消除一个CPU拷贝。

My question is that why data in kernel buffer could be scattered?

我的问题是为什么内核缓冲区中的数据可能会分散?

3 个解决方案

#1


12  

Because the Linux kernel's mapping / memory allocation facilities by default will create virtually-contiguous but possibly physically-disjoint memory regions.
That means the read from the filesystem which sendfile() does internally goes to a buffer in kernel virtual memory, which the DMA code has to "transmogrify" (for lack of a better word) into something that the network card's DMA engine can grok.

因为默认情况下Linux内核的映射/内存分配工具会创建几乎连续的但可能是物理上不相交的内存区域。这意味着从文件系统中读取sendfile()在内部进入内核虚拟内存中的缓冲区,DMA代码必须将其“变换”(缺少更好的词)到网卡的DMA引擎可以查看的内容中。

Since DMA (often but not always) uses physical addresses, that means you either duplicate the data buffer (into a specially-allocated physically-contigous region of memory, your socket buffer above), or else transfer it one-physical-page-at-a-time.

由于DMA(通常但并非总是)使用物理地址,这意味着您要么复制数据缓冲区(进入特殊分配的物理上连续的内存区域,上面的套接字缓冲区),要么将其传输到一个物理页面-a时间。

If your DMA engine, on the other hand, is capable of aggregating multiple physically-disjoint memory regions into a single data transfer (that's called "scatter-gather") then instead of copying the buffer, you can simply pass a list of physical addresses (pointing to physically-contigous sub-segments of the kernel buffer, that's your aggregate descriptors above) and you no longer need to start a separate DMA transfer for each physical page. This is usually faster, but whether it can be done or not depends on the capabilities of the DMA engine.

另一方面,如果您的DMA引擎能够将多个物理上不相交的内存区域聚合成单个数据传输(称为“分散 - 聚集”),那么您可以简单地传递一个物理地址列表,而不是复制缓冲区。 (指向内核缓冲区的物理上连续的子段,这是上面的聚合描述符),您不再需要为每个物理页面启动单独的DMA传输。这通常更快,但是否可以完成取决于DMA引擎的功能。

#2


3  

Re: My question is that why data in kernel buffer could be scattered?

Re:我的问题是为什么内核缓冲区中的数据可能会分散?

Because it already is scattered. The data queue in front of a TCP socket is not divided into the datagrams that will go out onto the network interface. Scatter allows you to keep the data where it is and not have to copy it to make a flat buffer that is acceptable to the hardware.

因为它已经分散了。 TCP套接字前面的数据队列不会分成将出现在网络接口上的数据报。 Scatter允许您将数据保留在原处,而不必将其复制以生成硬件可接受的平坦缓冲区。

With the gather feature, you can give the network card a datagram which is broken into pieces at different addresses in memory, which can be references to the original socket buffers. The card will read it from those locations and send it as a single unit.

使用聚集功能,您可以为网卡提供一个数据报,该数据报在内存中的不同地址被分成几部分,可以引用原始套接字缓冲区。该卡将从这些位置读取并将其作为一个单元发送。

Without gather (hardware requires simple, linear buffers) a datagram has to be prepared as a contiguously allocated byte string, and all the data which belongs to it has to be memcpy-d into place from the buffers that are queued for transmission on the socket.

如果没有聚集(硬件需要简单的线性缓冲区),则必须将数据报准备为连续分配的字节字符串,并且属于它的所有数据必须从排队等待在套接字上传输的缓冲区中进行memcpy-d。 。

#3


2  

Because when you write to a socket, the headers of the packet are assembled in a different place from your user-data, so to be coalesced into a network packet, the device needs "gather" capability, at least to get the headers and data.

因为当您写入套接字时,数据包的标头汇集在与用户数据不同的位置,因此要合并到网络数据包中,设备需要“收集”功能,至少要获取标头和数据。

Also to avoid the CPU having to read the data (and thus, fill its cache up with useless stuff it's never going to need again), the network card also needs to generate its own IP and TCP checksums (I'm assuming TCP here, because 99% of your bulk data transfers are going to be TCP). This is OK, because nowadays they all can.

另外为了避免CPU必须读取数据(因此,用无用的东西填充其缓存,它永远不再需要),网卡也需要生成自己的IP和TCP校验和(我在这里假设TCP,因为99%的批量数据传输都是TCP)。这没关系,因为现在他们都可以。

What I'm not sure is, how this all interacts with TCP_CORK.

我不确定的是,这一切如何与TCP_CORK相互作用。

Most protocols tend to have their own headers, so a hypothetical protocol looks like:

大多数协议往往都有自己的标头,因此假设的协议如下:

Client: Send request Server: Send some metadata; send the file data

客户端:发送请求服务器:发送一些元数据;发送文件数据

So we tend to have a server application assembling some headers in memory, issuing a write(), followed by a sendfile()-like operation. I suppose the headers still get copied into a kernel buffer in this case.

所以我们倾向于有一个服务器应用程序在内存中组装一些头文件,发出一个write(),然后是sendfile() - 就像操作一样。我想在这种情况下标题仍然会被复制到内核缓冲区中。

#1


12  

Because the Linux kernel's mapping / memory allocation facilities by default will create virtually-contiguous but possibly physically-disjoint memory regions.
That means the read from the filesystem which sendfile() does internally goes to a buffer in kernel virtual memory, which the DMA code has to "transmogrify" (for lack of a better word) into something that the network card's DMA engine can grok.

因为默认情况下Linux内核的映射/内存分配工具会创建几乎连续的但可能是物理上不相交的内存区域。这意味着从文件系统中读取sendfile()在内部进入内核虚拟内存中的缓冲区,DMA代码必须将其“变换”(缺少更好的词)到网卡的DMA引擎可以查看的内容中。

Since DMA (often but not always) uses physical addresses, that means you either duplicate the data buffer (into a specially-allocated physically-contigous region of memory, your socket buffer above), or else transfer it one-physical-page-at-a-time.

由于DMA(通常但并非总是)使用物理地址,这意味着您要么复制数据缓冲区(进入特殊分配的物理上连续的内存区域,上面的套接字缓冲区),要么将其传输到一个物理页面-a时间。

If your DMA engine, on the other hand, is capable of aggregating multiple physically-disjoint memory regions into a single data transfer (that's called "scatter-gather") then instead of copying the buffer, you can simply pass a list of physical addresses (pointing to physically-contigous sub-segments of the kernel buffer, that's your aggregate descriptors above) and you no longer need to start a separate DMA transfer for each physical page. This is usually faster, but whether it can be done or not depends on the capabilities of the DMA engine.

另一方面,如果您的DMA引擎能够将多个物理上不相交的内存区域聚合成单个数据传输(称为“分散 - 聚集”),那么您可以简单地传递一个物理地址列表,而不是复制缓冲区。 (指向内核缓冲区的物理上连续的子段,这是上面的聚合描述符),您不再需要为每个物理页面启动单独的DMA传输。这通常更快,但是否可以完成取决于DMA引擎的功能。

#2


3  

Re: My question is that why data in kernel buffer could be scattered?

Re:我的问题是为什么内核缓冲区中的数据可能会分散?

Because it already is scattered. The data queue in front of a TCP socket is not divided into the datagrams that will go out onto the network interface. Scatter allows you to keep the data where it is and not have to copy it to make a flat buffer that is acceptable to the hardware.

因为它已经分散了。 TCP套接字前面的数据队列不会分成将出现在网络接口上的数据报。 Scatter允许您将数据保留在原处,而不必将其复制以生成硬件可接受的平坦缓冲区。

With the gather feature, you can give the network card a datagram which is broken into pieces at different addresses in memory, which can be references to the original socket buffers. The card will read it from those locations and send it as a single unit.

使用聚集功能,您可以为网卡提供一个数据报,该数据报在内存中的不同地址被分成几部分,可以引用原始套接字缓冲区。该卡将从这些位置读取并将其作为一个单元发送。

Without gather (hardware requires simple, linear buffers) a datagram has to be prepared as a contiguously allocated byte string, and all the data which belongs to it has to be memcpy-d into place from the buffers that are queued for transmission on the socket.

如果没有聚集(硬件需要简单的线性缓冲区),则必须将数据报准备为连续分配的字节字符串,并且属于它的所有数据必须从排队等待在套接字上传输的缓冲区中进行memcpy-d。 。

#3


2  

Because when you write to a socket, the headers of the packet are assembled in a different place from your user-data, so to be coalesced into a network packet, the device needs "gather" capability, at least to get the headers and data.

因为当您写入套接字时,数据包的标头汇集在与用户数据不同的位置,因此要合并到网络数据包中,设备需要“收集”功能,至少要获取标头和数据。

Also to avoid the CPU having to read the data (and thus, fill its cache up with useless stuff it's never going to need again), the network card also needs to generate its own IP and TCP checksums (I'm assuming TCP here, because 99% of your bulk data transfers are going to be TCP). This is OK, because nowadays they all can.

另外为了避免CPU必须读取数据(因此,用无用的东西填充其缓存,它永远不再需要),网卡也需要生成自己的IP和TCP校验和(我在这里假设TCP,因为99%的批量数据传输都是TCP)。这没关系,因为现在他们都可以。

What I'm not sure is, how this all interacts with TCP_CORK.

我不确定的是,这一切如何与TCP_CORK相互作用。

Most protocols tend to have their own headers, so a hypothetical protocol looks like:

大多数协议往往都有自己的标头,因此假设的协议如下:

Client: Send request Server: Send some metadata; send the file data

客户端:发送请求服务器:发送一些元数据;发送文件数据

So we tend to have a server application assembling some headers in memory, issuing a write(), followed by a sendfile()-like operation. I suppose the headers still get copied into a kernel buffer in this case.

所以我们倾向于有一个服务器应用程序在内存中组装一些头文件,发出一个write(),然后是sendfile() - 就像操作一样。我想在这种情况下标题仍然会被复制到内核缓冲区中。