Linux内核设备驱动程序从设备DMA到用户空间内存

I want to get data from a DMA enabled, PCIe hardware device into user-space as quickly as possible.

我希望尽快将支持DMA的PCIe硬件设备中的数据导入用户空间。

Q: How do I combine "direct I/O to user-space with/and/via a DMA transfer"

问：如何将“使用/和/通过DMA传输直接I / O连接到用户空间”

Reading through LDD3, it seems that I need to perform a few different types of IO operations!?

通过LDD3读取，似乎我需要执行一些不同类型的IO操作！？

dma_alloc_coherent gives me the physical address that I can pass to the hardware device. But would need to have setup get_user_pages and perform a copy_to_user type call when the transfer completes. This seems a waste, asking the Device to DMA into kernel memory (acting as buffer) then transferring it again to user-space. LDD3 p453: /* Only now is it safe to access the buffer, copy to user, etc. */

dma_alloc_coherent为我提供了可以传递给硬件设备的物理地址。但是需要设置get_user_pages并在传输完成时执行copy_to_user类型调用。这似乎是浪费，要求设备DMA进入内核内存（充当缓冲区），然后再将其传输到用户空间。 LDD3 p453：/ *现在只有访问缓冲区，复制到用户等是安全的。* /
What I ideally want is some memory that:

我理想的是一些记忆：
- I can use in user-space (Maybe request driver via a ioctl call to create DMA'able memory/buffer?)
- 我可以在用户空间中使用（也许通过ioctl调用请求驱动程序来创建DMA'able内存/缓冲区？）
- I can get a physical address from to pass to the device so that all user-space has to do is perform a read on the driver
- 我可以从物理地址获取传递给设备，以便所有用户空间必须执行的是对驱动程序执行读取操作
- the read method would activate the DMA transfer, block waiting for the DMA complete interrupt and release the user-space read afterwards (user-space is now safe to use/read memory).
- read方法将激活DMA传输，阻塞等待DMA完成中断并在之后释放用户空间读取（用户空间现在可以安全使用/读取存储器）。

Do I need single-page streaming mappings, setup mapping and user-space buffers mapped with get_user_pages dma_map_page?

我是否需要使用get_user_pages dma_map_page映射的单页流映射，设置映射和用户空间缓冲区？

My code so far sets up get_user_pages at the given address from user-space (I call this the Direct I/O part). Then, dma_map_page with a page from get_user_pages. I give the device the return value from dma_map_page as the DMA physical transfer address.

到目前为止，我的代码在用户空间的给定地址设置了get_user_pages（我称之为直接I / O部分）。然后，dma_map_page包含来自get_user_pages的页面。我将设备从dma_map_page返回值作为DMA物理传输地址。

I am using some kernel modules as reference: drivers_scsi_st.c and drivers-net-sh_eth.c. I would look at infiniband code, but cant find which one is the most basic!

我使用一些内核模块作为参考：drivers_scsi_st.c和drivers-net-sh_eth.c。我会看看infiniband代码，但无法找到哪一个是最基本的！

Many thanks in advance.

提前谢谢了。

6 个解决方案

#1

I'm actually working on exactly the same thing right now and I'm going the ioctl() route. The general idea is for user space to allocate the buffer which will be used for the DMA transfer and an ioctl() will be used to pass the size and address of this buffer to the device driver. The driver will then use scatter-gather lists along with the streaming DMA API to transfer data directly to and from the device and user-space buffer.

我现在正在做同样的事情而且我要去ioctl（）路线。一般的想法是用户空间分配将用于DMA传输的缓冲区，并且将使用ioctl（）将该缓冲区的大小和地址传递给设备驱动程序。然后，驱动程序将使用分散 - 收集列表以及流DMA API将数据直接传输到设备和用户空间缓冲区。

The implementation strategy I'm using is that the ioctl() in the driver enters a loop that DMA's the userspace buffer in chunks of 256k (which is the hardware imposed limit for how many scatter/gather entries it can handle). This is isolated inside a function that blocks until each transfer is complete (see below). When all bytes are transfered or the incremental transfer function returns an error the ioctl() exits and returns to userspace

我正在使用的实现策略是驱动程序中的ioctl（）进入一个循环，DMA是256k块的用户空间缓冲区（这是它可以处理多少分散/收集条目的硬件限制）。这被隔离在一个阻塞的函数内，直到每次传输完成为止（见下文）。当所有字节都被传输或增量传递函数返回错误时，ioctl（）退出并返回到用户空间

Pseudo code for the ioctl()

ioctl（）的伪代码

/*serialize all DMA transfers to/from the device*/
if (mutex_lock_interruptible( &device_ptr->mtx ) )
    return -EINTR;

chunk_data = (unsigned long) user_space_addr;
while( *transferred < total_bytes && !ret ) {
    chunk_bytes = total_bytes - *transferred;
    if (chunk_bytes > HW_DMA_MAX)
        chunk_bytes = HW_DMA_MAX; /* 256kb limit imposed by my device */
    ret = transfer_chunk(device_ptr, chunk_data, chunk_bytes, transferred);
    chunk_data += chunk_bytes;
    chunk_offset += chunk_bytes;
}

mutex_unlock(&device_ptr->mtx);

Pseudo code for incremental transfer function:

增量传递函数的伪代码：

/*Assuming the userspace pointer is passed as an unsigned long, */
/*calculate the first,last, and number of pages being transferred via*/

first_page = (udata & PAGE_MASK) >> PAGE_SHIFT;
last_page = ((udata+nbytes-1) & PAGE_MASK) >> PAGE_SHIFT;
first_page_offset = udata & PAGE_MASK;
npages = last_page - first_page + 1;

/* Ensure that all userspace pages are locked in memory for the */
/* duration of the DMA transfer */

down_read(&current->mm->mmap_sem);
ret = get_user_pages(current,
                     current->mm,
                     udata,
                     npages,
                     is_writing_to_userspace,
                     0,
                     &pages_array,
                     NULL);
up_read(&current->mm->mmap_sem);

/* Map a scatter-gather list to point at the userspace pages */

/*first*/
sg_set_page(&sglist[0], pages_array[0], PAGE_SIZE - fp_offset, fp_offset);

/*middle*/
for(i=1; i < npages-1; i++)
    sg_set_page(&sglist[i], pages_array[i], PAGE_SIZE, 0);

/*last*/
if (npages > 1) {
    sg_set_page(&sglist[npages-1], pages_array[npages-1],
        nbytes - (PAGE_SIZE - fp_offset) - ((npages-2)*PAGE_SIZE), 0);
}

/* Do the hardware specific thing to give it the scatter-gather list
   and tell it to start the DMA transfer */

/* Wait for the DMA transfer to complete */
ret = wait_event_interruptible_timeout( &device_ptr->dma_wait, 
         &device_ptr->flag_dma_done, HZ*2 );

if (ret == 0)
    /* DMA operation timed out */
else if (ret == -ERESTARTSYS )
    /* DMA operation interrupted by signal */
else {
    /* DMA success */
    *transferred += nbytes;
    return 0;
}

The interrupt handler is exceptionally brief:

中断处理程序非常简短：

/* Do hardware specific thing to make the device happy */

/* Wake the thread waiting for this DMA operation to complete */
device_ptr->flag_dma_done = 1;
wake_up_interruptible(device_ptr->dma_wait);

Please note that this is just a general approach, I've been working on this driver for the last few weeks and have yet to actually test it... So please, don't treat this pseudo code as gospel and be sure to double check all logic and parameters ;-).

请注意，这只是一种通用的方法，我在过去的几周里一直在研究这个驱动程序并且尚未对它进行实际测试...所以请不要将这个伪代码视为福音并确保加倍检查所有逻辑和参数;-)。

#2

You basically have the right idea: in 2.1, you can just have userspace allocate any old memory. You do want it page-aligned, so posix_memalign() is a handy API to use.

你基本上有正确的想法：在2.1中，你可以让用户空间分配任何旧的内存。你确实希望页面对齐，所以posix_memalign（）是一个方便使用的API。

Then have userspace pass in the userspace virtual address and size of this buffer somehow; ioctl() is a good quick and dirty way to do this. In the kernel, allocate an appropriately sized buffer array of struct page* -- user_buf_size/PAGE_SIZE entries -- and use get_user_pages() to get a list of struct page* for the userspace buffer.

然后让用户空间以某种方式传递用户空间虚拟地址和此缓冲区的大小; ioctl（）是一种很好的快速而肮脏的方法。在内核中，分配一个适当大小的struct page * - user_buf_size / PAGE_SIZE条目的缓冲区数组 - 并使用get_user_pages（）获取用户空间缓冲区的struct page *列表。

Once you have that, you can allocate an array of struct scatterlist that is the same size as your page array and loop through the list of pages doing sg_set_page(). After the sg list is set up, you do dma_map_sg() on the array of scatterlist and then you can get the sg_dma_address and sg_dma_len for each entry in the scatterlist (note you have to use the return value of dma_map_sg() because you may end up with fewer mapped entries because things might get merged by the DMA mapping code).

完成后，您可以分配一个struct scatterlist数组，该数组与页面数组的大小相同，并循环执行sg_set_page（）的页面列表。设置sg列表后，在scatterlist数组上执行dma_map_sg（），然后可以为散点列表中的每个条目获取sg_dma_address和sg_dma_len（注意，您必须使用dma_map_sg（）的返回值，因为您可能会结束使用较少的映射条目，因为事物可能会被DMA映射代码合并）。

That gives you all the bus addresses to pass to your device, and then you can trigger the DMA and wait for it however you want. The read()-based scheme you have is probably fine.

这样就可以将所有总线地址传递给您的设备，然后您可以触发DMA并根据需要等待它。你有基于read（）的方案可能没问题。

You can refer to drivers/infiniband/core/umem.c, specifically ib_umem_get(), for some code that builds up this mapping, although the generality that that code needs to deal with may make it a bit confusing.

对于构建此映射的一些代码，您可以参考drivers / infiniband / core / umem.c，特别是ib_umem_get（），尽管该代码需要处理的一般性可能会使它有点混乱。

Alternatively, if your device doesn't handle scatter/gather lists too well and you want contiguous memory, you could use get_free_pages() to allocate a physically contiguous buffer and use dma_map_page() on that. To give userspace access to that memory, your driver just needs to implement an mmap method instead of the ioctl as described above.

或者，如果您的设备没有很好地处理分散/收集列表并且您需要连续的内存，则可以使用get_free_pages（）来分配物理上连续的缓冲区并在其上使用dma_map_page（）。要为用户空间访问该内存，您的驱动程序只需要实现mmap方法而不是如上所述的ioctl。

#3

At some point I wanted to allow user-space application to allocate DMA buffers and get it mapped to user-space and get the physical address to be able to control my device and do DMA transactions (bus mastering) entirely from user-space, totally bypassing the Linux kernel. I have used a little bit different approach though. First I started with a minimal kernel module that was initializing/probing PCIe device and creating a character device. That driver then allowed a user-space application to do two things:

在某些时候，我想允许用户空间应用程序分配DMA缓冲区并将其映射到用户空间，并获得物理地址，以便能够完全从用户空间控制我的设备并完成DMA事务（总线控制），完全绕过Linux内核。我使用了一些不同的方法。首先，我开始使用初始化/探测PCIe设备并创建字符设备的最小内核模块。然后该驱动程序允许用户空间应用程序执行两项操作：

Map PCIe device's I/O bar into user-space using remap_pfn_range() function.
使用remap_pfn_range（）函数将PCIe设备的I / O栏映射到用户空间。
Allocate and free DMA buffers, map them to user space and pass a physical bus address to user-space application.
分配和释放DMA缓冲区，将它们映射到用户空间并将物理总线地址传递给用户空间应用程序。

Basically, it boils down to a custom implementation of mmap() call (though file_operations). One for I/O bar is easy:

基本上，它归结为mmap（）调用的自定义实现（尽管file_operations）。一个用于I / O栏很容易：

struct vm_operations_struct a2gx_bar_vma_ops = {
};

static int a2gx_cdev_mmap_bar2(struct file *filp, struct vm_area_struct *vma)
{
    struct a2gx_dev *dev;
    size_t size;

    size = vma->vm_end - vma->vm_start;
    if (size != 134217728)
        return -EIO;

    dev = filp->private_data;
    vma->vm_ops = &a2gx_bar_vma_ops;
    vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
    vma->vm_private_data = dev;

    if (remap_pfn_range(vma, vma->vm_start,
                        vmalloc_to_pfn(dev->bar2),
                        size, vma->vm_page_prot))
    {
        return -EAGAIN;
    }

    return 0;
}

And another one that allocates DMA buffers using pci_alloc_consistent() is a little bit more complicated:

另一个使用pci_alloc_consistent（）分配DMA缓冲区的方法有点复杂：

static void a2gx_dma_vma_close(struct vm_area_struct *vma)
{
    struct a2gx_dma_buf *buf;
    struct a2gx_dev *dev;

    buf = vma->vm_private_data;
    dev = buf->priv_data;

    pci_free_consistent(dev->pci_dev, buf->size, buf->cpu_addr, buf->dma_addr);
    buf->cpu_addr = NULL; /* Mark this buffer data structure as unused/free */
}

struct vm_operations_struct a2gx_dma_vma_ops = {
    .close = a2gx_dma_vma_close
};

static int a2gx_cdev_mmap_dma(struct file *filp, struct vm_area_struct *vma)
{
    struct a2gx_dev *dev;
    struct a2gx_dma_buf *buf;
    size_t size;
    unsigned int i;

    /* Obtain a pointer to our device structure and calculate the size
       of the requested DMA buffer */
    dev = filp->private_data;
    size = vma->vm_end - vma->vm_start;

    if (size < sizeof(unsigned long))
        return -EINVAL; /* Something fishy is happening */

    /* Find a structure where we can store extra information about this
       buffer to be able to release it later. */
    for (i = 0; i < A2GX_DMA_BUF_MAX; ++i) {
        buf = &dev->dma_buf[i];
        if (buf->cpu_addr == NULL)
            break;
    }

    if (buf->cpu_addr != NULL)
        return -ENOBUFS; /* Oops, hit the limit of allowed number of
                            allocated buffers. Change A2GX_DMA_BUF_MAX and
                            recompile? */

    /* Allocate consistent memory that can be used for DMA transactions */
    buf->cpu_addr = pci_alloc_consistent(dev->pci_dev, size, &buf->dma_addr);
    if (buf->cpu_addr == NULL)
        return -ENOMEM; /* Out of juice */

    /* There is no way to pass extra information to the user. And I am too lazy
       to implement this mmap() call using ioctl(). So we simply tell the user
       the bus address of this buffer by copying it to the allocated buffer
       itself. Hacks, hacks everywhere. */
    memcpy(buf->cpu_addr, &buf->dma_addr, sizeof(buf->dma_addr));

    buf->size = size;
    buf->priv_data = dev;
    vma->vm_ops = &a2gx_dma_vma_ops;
    vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
    vma->vm_private_data = buf;

    /*
     * Map this DMA buffer into user space.
     */
    if (remap_pfn_range(vma, vma->vm_start,
                        vmalloc_to_pfn(buf->cpu_addr),
                        size, vma->vm_page_prot))
    {
        /* Out of luck, rollback... */
        pci_free_consistent(dev->pci_dev, buf->size, buf->cpu_addr,
                            buf->dma_addr);
        buf->cpu_addr = NULL;
        return -EAGAIN;
    }

    return 0; /* All good! */
}

Once those are in place, user space application can pretty much do everything — control the device by reading/writing from/to I/O registers, allocate and free DMA buffers of arbitrary size, and have the device perform DMA transactions. The only missing part is interrupt-handling. I was doing polling in user space, burning my CPU, and had interrupts disabled.

一旦到位，用户空间应用程序几乎可以完成所有工作 - 通过读/写I / O寄存器来控制设备，分配和释放任意大小的DMA缓冲区，并让设备执行DMA事务。唯一缺少的部分是中断处理。我正在用户空间进行轮询，刻录我的CPU，并禁用中断。

Hope it helps. Good Luck!

希望能帮助到你。祝你好运！

#4

I'm getting confused with the direction to implement. I want to...

我对实施的方向感到困惑。我要...

Consider the application when designing a driver.
What is the nature of data movement, frequency, size and what else might be going on in the system?

在设计驱动程序时考虑应用程序。数据移动的性质，频率，大小以及系统中可能发生的其他事情是什么？

Is the traditional read/write API sufficient? Is direct mapping the device into user space OK? Is a reflective (semi-coherent) shared memory desirable?

传统的读/写API是否足够？是否将设备直接映射到用户空间？反射（半连贯）共享内存是否可取？

Manually manipulating data (read/write) is a pretty good option if the data lends itself to being well understood. Using general purpose VM and read/write may be sufficient with an inline copy. Direct mapping non cachable accesses to the peripheral is convenient, but can be clumsy. If the access is the relatively infrequent movement of large blocks, it may make sense to use regular memory, have the drive pin, translate addresses, DMA and release the pages. As an optimization, the pages (maybe huge) can be pre pinned and translated; the drive then can recognize the prepared memory and avoid the complexities of dynamic translation. If there are lots of little I/O operations, having the drive run asynchronously makes sense. If elegance is important, the VM dirty page flag can be used to automatically identify what needs to be moved and a (meta_sync()) call can be used to flush pages. Perhaps a mixture of the above works...

如果数据有助于理解，那么手动操作数据（读/写）是一个非常好的选择。使用通用VM和读取/写入对于内联副本可能就足够了。直接映射不可连接的外围设备访问很方便，但可能很笨拙。如果访问是大块的相对不频繁的移动，则使用常规存储器，具有驱动器引脚，转换地址，DMA和释放页面可能是有意义的。作为优化，页面（可能是巨大的）可以预先固定和翻译;然后驱动器可以识别准备好的存储器并避免动态转换的复杂性。如果有很多小的I / O操作，那么让驱动器异步运行是有意义的。如果优雅很重要，VM脏页标志可用于自动识别需要移动的内容，并且（meta_sync（））调用可用于刷新页面。也许上述作品的混合......

Too often people don't look at the larger problem, before digging into the details. Often the simplest solutions are sufficient. A little effort constructing a behavioral model can help guide what API is preferable.

在深入研究细节之前，人们常常不会考虑更大的问题。通常最简单的解决方案就足够了。构建行为模型的一点努力可以帮助指导哪些API更可取。

#5

first_page_offset = udata & PAGE_MASK;

It seems wrong. It should be either:

这似乎不对。它应该是：

first_page_offset = udata & ~PAGE_MASK;

要么

first_page_offset = udata & (PAGE_SIZE - 1)

#6

It is worth mention that driver with Scatter-Gather DMA support and user space memory allocation is most efficient and has highest performance. However in case we don't need high performance or we want to develop a driver in some simplified conditions we can use some tricks.

值得一提的是，具有Scatter-Gather DMA支持和用户空间内存分配的驱动程序效率最高，性能最高。但是，如果我们不需要高性能或者我们想在某些简化条件下开发驱动程序，我们可以使用一些技巧。

Give up zero copy design. It is worth to consider when data throughput is not too big. In such a design data can by copied to user by copy_to_user(user_buffer, kernel_dma_buffer, count); user_buffer might be for example buffer argument in character device read() system call implementation. We still need to take care of kernel_dma_buffer allocation. It might by memory obtained from dma_alloc_coherent() call for example.

放弃零拷贝设计。当数据吞吐量不是太大时，值得考虑。在这样的设计数据中，可以通过copy_to_user（user_buffer，kernel_dma_buffer，count）复制给用户; user_buffer可能是例如字符设备read（）系统调用实现中的缓冲区参数。我们仍然需要处理kernel_dma_buffer分配。例如，它可以通过从dma_alloc_coherent（）调用获得的内存来实现。

The another trick is to limit system memory at the boot time and then use it as huge contiguous DMA buffer. It is especially useful during driver and FPGA DMA controller development and rather not recommended in production environments. Lets say PC has 32GB of RAM. If we add mem=20GB to kernel boot parameters list we can use 12GB as huge contiguous dma buffer. To map this memory to user space simply implement mmap() as

另一个技巧是在引导时限制系统内存，然后将其用作巨大的连续DMA缓冲区。它在驱动程序和FPGA DMA控制器开发期间特别有用，而不是在生产环境中推荐。可以说PC有32GB的RAM。如果我们将mem = 20GB添加到内核启动参数列表中，我们可以使用12GB作为巨大的连续dma缓冲区。要将此内存映射到用户空间，只需将mmap（）实现为

remap_pfn_range(vma,
    vma->vm_start,
    (0x500000000 >> PAGE_SHIFT) + vma->vm_pgoff, 
    vma->vm_end - vma->vm_start,
    vma->vm_page_prot)

Of course this 12GB is completely omitted by OS and can be used only by process which has mapped it into its address space. We can try to avoid it by using Contiguous Memory Allocator (CMA).

当然，操作系统完全省略了12GB，并且只能将其映射到其地址空间的进程使用。我们可以尝试使用Contiguous Memory Allocator（CMA）来避免它。

Again above tricks will not replace full Scatter-Gather, zero copy DMA driver, but are useful during development time or in some less performance platforms.

上述技巧不会取代完整的Scatter-Gather零拷贝DMA驱动程序，但在开发时或某些性能较低的平台中非常有用。

#1