Linux DirectIO机制分析

时间:2022-05-23 16:55:12

DirectIO是write函数的一个选项,用来确定数据内容直接写到磁盘上,而非缓存中,保证即是系统异常了,也能保证紧要数据写到磁盘上,具体写文件的机制流程可以参考前面写的<Linux内核写文件流程>,DirectIO流程也是接续着写文件流程而来的。

内核走到__generic_file_aio_write函数时,系统根据file->f_flags & O_DIRECT判断进入DirectIO处理的分支:

	if (unlikely(file->f_flags & O_DIRECT)) {
		loff_t endbyte;
		ssize_t written_buffered;

		written = generic_file_direct_write(iocb, iov, &nr_segs, pos,
							ppos, count, ocount);
		if (written < 0 || written == count)
			goto out;
		/*
		 * direct-io write to a hole: fall through to buffered I/O
		 * for completing the rest of the request.
		 */
		pos += written;
		count -= written;
		written_buffered = generic_file_buffered_write(iocb, iov,
						nr_segs, pos, ppos, count,
						written);
		/*
		 * If generic_file_buffered_write() retuned a synchronous error
		 * then we want to return the number of bytes which were
		 * direct-written, or the error code if that was zero.  Note
		 * that this differs from normal direct-io semantics, which
		 * will return -EFOO even if some bytes were written.
		 */
		if (written_buffered < 0) {
			err = written_buffered;
			goto out;
		}

		/*
		 * We need to ensure that the page cache pages are written to
		 * disk and invalidated to preserve the expected O_DIRECT
		 * semantics.
		 */
		endbyte = pos + written_buffered - written - 1;
		err = filemap_write_and_wait_range(file->f_mapping, pos, endbyte);
		if (err == 0) {
			written = written_buffered;
			invalidate_mapping_pages(mapping,
						 pos >> PAGE_CACHE_SHIFT,
						 endbyte >> PAGE_CACHE_SHIFT);
		} else {
			/*
			 * We don't know how much we wrote, so just return
			 * the number of bytes which were direct-written
			 */
		}
	}

依次先看generic_file_direct_write函数,主要有filemap_write_and_wait_range,invalidate_inode_pages2_range和mapping->a_ops->direct_IO起作用。

filemap_write_and_wait_range主要用来刷mapping下的脏页,在__filemap_fdatawrite_range下调用do_writepages实现:

int do_writepages(struct address_space *mapping, struct writeback_control *wbc)
{
	int ret;

	if (wbc->nr_to_write <= 0)
		return 0;
	if (mapping->a_ops->writepages)
		ret = mapping->a_ops->writepages(mapping, wbc);
	else
		ret = generic_writepages(mapping, wbc);
	return ret;
}

filemap_write_and_wait_range如果有写入量则返回,后续的两个函数则不执行。我的理解是直写后相关数据都要一起刷到磁盘上,避免direct_IO的已经在磁盘上,而之前缓存的则不在,系统异常后文件系统就挂了。

如果没有写入量,则根据mapping->nrpages判断进入invalidate_inode_pages2_range,作用就是检查当前内存中是否由对应将要direct_IO的缓存页,如果有,则将其缓存标记为无效。目的是,因为direct_IO写入的数据并不缓存,如果direct_IO写入数据之前有对应缓存,而且是clean的,direct_IO完成之后,缓存和磁盘数据就不一致了,读取缓存的时候,如果没有保护,获取的数据就不是磁盘上的数据。如果的确有对应缓存标记为无效,则返回不执行后面的函数。

后面才到真正的主题,mapping->a_ops->direct_IO,在struct address_space_operations ext3_ordered_aops结构体里面有定义,是ext3_direct_IO,核心通过__blockdev_direct_IO实现,在direct_io_worker中组装了dio结构,然后通过dio_bio_submit,本质就是通过submit_bio(dio->rw, bio)提交到io层。所谓direct_io和其他读写比较就是跨过了buffer层,不要中间线程pdflush和kjournald定期刷盘到IO层。这个时候也不一定数据就在磁盘上了,direct_IO就是先假定IO的设备驱动没有较大延时的。

mapping->a_ops->direct_IO执行完成了,invalidate_inode_pages2_range又搞了一边,理由如下:

/* Finally, try again to invalidate clean pages which might have been  cached by non-direct readahead, or faulted in by get_user_pages(),  if the source of the write was an mmap'ed region of the file , we're writing. Either one is a pretty crazy thing to do,  so we don't support it 100%. If this invalidation  fails, tough, the write still worked...*/

系统复杂度很高的时候,就很难找到完全的数字式的过程保证,有时候土法炼钢也是简单有效的。

再次退回到__generic_file_aio_write函数,

		written = generic_file_direct_write(iocb, iov, &nr_segs, pos,
							ppos, count, ocount);
		if (written < 0 || written == count)
			goto out;
		/*
		 * direct-io write to a hole: fall through to buffered I/O
		 * for completing the rest of the request.
		 */
		pos += written;
		count -= written;
		written_buffered = generic_file_buffered_write(iocb, iov,
						nr_segs, pos, ppos, count,
						written);

如果generic_file_direct_write返回值不为count,则重新执行缓存写generic_file_buffered_write,前面已经分析过,如果写入数据有相关的脏页,或者有对应的缓存即是clean,写入量则不是期待的count,此处要重新进行缓存写入。

结果我们就看到,所谓的direct_IO并不完全保证跨越buffer,在某些条件下,也是buffer写入。所以在极端要求directIO情况下,就要对应的规避掉这两种情况,控制缓存映射。

小工具vmtouch对于缓存控制还是简单有效


Linux DirectIO机制分析 来自于 OenHan ,链接为:http://oenhan.com/ext3-fs-directio