DirectIO是write函数的一个选项,用来确定数据内容直接写到磁盘上,而非缓存中,保证即是系统异常了,也能保证紧要数据写到磁盘上,具体写文件的机制流程可以参考前面写的<Linux内核写文件流程>,DirectIO流程也是接续着写文件流程而来的。
内核走到__generic_file_aio_write函数时,系统根据file->f_flags & O_DIRECT判断进入DirectIO处理的分支:
if (unlikely(file->f_flags & O_DIRECT)) {
loff_t endbyte;
ssize_t written_buffered;
written = generic_file_direct_write(iocb, iov, &nr_segs, pos,
ppos, count, ocount);
if (written < 0 || written == count)
goto out;
/*
* direct-io write to a hole: fall through to buffered I/O
* for completing the rest of the request.
*/
pos += written;
count -= written;
written_buffered = generic_file_buffered_write(iocb, iov,
nr_segs, pos, ppos, count,
written);
/*
* If generic_file_buffered_write() retuned a synchronous error
* then we want to return the number of bytes which were
* direct-written, or the error code if that was zero. Note
* that this differs from normal direct-io semantics, which
* will return -EFOO even if some bytes were written.
*/
if (written_buffered < 0) {
err = written_buffered;
goto out;
}
/*
* We need to ensure that the page cache pages are written to
* disk and invalidated to preserve the expected O_DIRECT
* semantics.
*/
endbyte = pos + written_buffered - written - 1;
err = filemap_write_and_wait_range(file->f_mapping, pos, endbyte);
if (err == 0) {
written = written_buffered;
invalidate_mapping_pages(mapping,
pos >> PAGE_CACHE_SHIFT,
endbyte >> PAGE_CACHE_SHIFT);
} else {
/*
* We don't know how much we wrote, so just return
* the number of bytes which were direct-written
*/
}
}
依次先看generic_file_direct_write函数,主要有filemap_write_and_wait_range,invalidate_inode_pages2_range和mapping->a_ops->direct_IO起作用。
filemap_write_and_wait_range主要用来刷mapping下的脏页,在__filemap_fdatawrite_range下调用do_writepages实现:
int do_writepages(struct address_space *mapping, struct writeback_control *wbc)
{
int ret;
if (wbc->nr_to_write <= 0)
return 0;
if (mapping->a_ops->writepages)
ret = mapping->a_ops->writepages(mapping, wbc);
else
ret = generic_writepages(mapping, wbc);
return ret;
}
filemap_write_and_wait_range如果有写入量则返回,后续的两个函数则不执行。我的理解是直写后相关数据都要一起刷到磁盘上,避免direct_IO的已经在磁盘上,而之前缓存的则不在,系统异常后文件系统就挂了。
如果没有写入量,则根据mapping->nrpages判断进入invalidate_inode_pages2_range,作用就是检查当前内存中是否由对应将要direct_IO的缓存页,如果有,则将其缓存标记为无效。目的是,因为direct_IO写入的数据并不缓存,如果direct_IO写入数据之前有对应缓存,而且是clean的,direct_IO完成之后,缓存和磁盘数据就不一致了,读取缓存的时候,如果没有保护,获取的数据就不是磁盘上的数据。如果的确有对应缓存标记为无效,则返回不执行后面的函数。
后面才到真正的主题,mapping->a_ops->direct_IO,在struct address_space_operations ext3_ordered_aops结构体里面有定义,是ext3_direct_IO,核心通过__blockdev_direct_IO实现,在direct_io_worker中组装了dio结构,然后通过dio_bio_submit,本质就是通过submit_bio(dio->rw, bio)提交到io层。所谓direct_io和其他读写比较就是跨过了buffer层,不要中间线程pdflush和kjournald定期刷盘到IO层。这个时候也不一定数据就在磁盘上了,direct_IO就是先假定IO的设备驱动没有较大延时的。
mapping->a_ops->direct_IO执行完成了,invalidate_inode_pages2_range又搞了一边,理由如下:
/* Finally, try again to invalidate clean pages which might have been cached by non-direct readahead, or faulted in by get_user_pages(), if the source of the write was an mmap'ed region of the file , we're writing. Either one is a pretty crazy thing to do, so we don't support it 100%. If this invalidation fails, tough, the write still worked...*/
系统复杂度很高的时候,就很难找到完全的数字式的过程保证,有时候土法炼钢也是简单有效的。
再次退回到__generic_file_aio_write函数,
written = generic_file_direct_write(iocb, iov, &nr_segs, pos,
ppos, count, ocount);
if (written < 0 || written == count)
goto out;
/*
* direct-io write to a hole: fall through to buffered I/O
* for completing the rest of the request.
*/
pos += written;
count -= written;
written_buffered = generic_file_buffered_write(iocb, iov,
nr_segs, pos, ppos, count,
written);
如果generic_file_direct_write返回值不为count,则重新执行缓存写generic_file_buffered_write,前面已经分析过,如果写入数据有相关的脏页,或者有对应的缓存即是clean,写入量则不是期待的count,此处要重新进行缓存写入。
结果我们就看到,所谓的direct_IO并不完全保证跨越buffer,在某些条件下,也是buffer写入。所以在极端要求directIO情况下,就要对应的规避掉这两种情况,控制缓存映射。
小工具vmtouch对于缓存控制还是简单有效
Linux DirectIO机制分析 来自于 OenHan ,链接为:http://oenhan.com/ext3-fs-directio