Linux File System Change Monitoring Technology、Notifier Technology

时间:2022-11-22 13:25:11

catalog

. 为什么要监控文件系统
: hotplug
. udev
. fanotify(fscking all notification system)
. inotify
. code example

1. 为什么要监控文件系统

在日常工作中,人们往往需要知道在某些文件(夹)上都有那些变化,比如:

. 通知配置文件的改变
. 跟踪某些关键的系统文件的变化
. 监控某个分区磁盘的整体使用情况
. 系统崩溃时进行自动清理
. 自动触发备份进程
. 向服务器上传文件结束时发出通知
. 杀软(anti-virus)需要对磁盘上的文件变动进行实时监控,并进行文件内容查杀
. 通常使用文件轮询的通知机制,但是这种机制只适用于经常改变的文件(因为它可以确保每过x秒就可以得到i/o),其他情况下都非常低效,并且有时候会丢失某些类型的变化,例如文件的修改时间没有改变。像Tripwire这样的数据完整性系统,它们基于时间调度来跟踪文件变化,但是如果想实时监控文件的变化的话,那么时间调度就束手无策了

Relevant Link:

http://www.jiangmiao.org/blog/2179.html
http://www.infoq.com/cn/articles/inotify-linux-file-system-event-monitoring

2: hotplug

Hotplug是一种内核向用户态应用通报关于热插拔设备一些事件发生的机制,桌面系统能够利用它对设备进行有效的管理,无论何时一个设备从系统中 增删, 都产生一个"热插拔事件". 这意味着内核调用用户空间程序 /sbin/hotplug. 这个程序典型地是一个非常小的 bash 脚本, 只传递执行给一系列其他的位于 /etc/hot-plug.d/ 目录树的程序. 对于大部分的 Linux 发布, 这个脚本看来如下

DIR="/etc/hotplug.d"
for I in "${DIR}/$1/"*.hotplug "${DIR}/"default/*.hotplug ; do
if [ -f $I ]; then
test -x $I && $I $1 ;
fi
done
exit 1

这个脚本搜索所有的有 .hotplug 后缀的可能对这个事件感兴趣的程序并调用它们, 传递给它们许多不同的环境变量, 这些环境变量已经被内核设置

Relevant Link:

http://linux-hotplug.sourceforge.net/?selected=overview
http://oss.org.cn/kernel-book/ldd3/ch14s07.html

3. udev

udev是Linux kernel 2.6系列的设备管理器。它主要的功能是管理/dev目录底下的设备节点。它同时也是用来接替devfs及hotplug的功能,这意味着它要在添加/删除硬件时处理/dev目录以及所有用户空间的行为,包括加载firmware时
在传统的Linux系统中,/dev目录下的设备节点为一系列静态存在的文件,而udev则动态提供了在系统中实际存在的设备节点。虽然devfs提供了类似功能,但udev具有以下优点

. udev支持设备的固定命名,而并不依赖于设备插入系统的顺序。默认的udev设置提供了存储设备的固定命名。可以使用其
) vid(vendor)
) pid(device)
) 设备名称(model)等属性
) 或其父设备的对应属性来确认某一设备
. udev完全在用户空间执行,而不是像devfs在内核空间一样执行。结果就是udev将命名策略从内核中移走,并可以在节点创建前用任意程序在设备属性中为设备命名

0x1: 运行方式

udev是一个通用的内核设备管理器。它以守护进程的方式运行于Linux系统,并监听在新设备初始化或设备从系统中移除时,内核(通过netlink socket)所发出的uevent

统提供了一套规则用于匹配可发现的设备事件和属性的导出值。匹配规则可能命名并创建设备节点,并运行配置程序来对设备进行设置。udev规则可以匹配像内
核子系统、内核设备名称、设备的物理等属性,或设备序列号的属性。规则也可以请求外部程序提供信息来命名设备,或指定一个永远一样的自定义名称来命名设
备,而不管设备什么时候被系统发现

0x2: 系统架构

udev系统可以分为三个部分

. libudev函数库: 可以用来获取设备的信息
. udevd守护进程: 处于用户空间,用于管理虚拟/dev
. 管理命令udevadm: 用来诊断出错情况
. 系统获取内核通过netlink socket发出的信息

0x3: 命令格式

. BUS 总线 KERNEL 内核名如sd* ID 设备id 如总线id PLACE
. SYSFS{filename} 或 ATTR{filename}
. PROGRAM 调用外部程序 RESULT 匹配program返回的结果 NAME
. SYMLINK 连接规则

Relevant Link:

http://zh.wikipedia.org/wiki/Udev
https://www.ibm.com/developerworks/cn/linux/l-cn-udev/
https://wiki.archlinux.org/index.php/Udev_(%E7%AE%80%E4%BD%93%E4%B8%AD%E6%96%87)
https://www.suse.com/zh-cn/documentation/sles11/singlehtml/book_sle_admin/cha.udev.html

4. fanotify(fscking all notification system)

Fanotify 是一个 notifier,即一种对文件系统变化产生通知的机制,是替代 inotify 的下一代文件系统通知机制,Fanotify (fscking all notifiction and file access system) 是一个 notifier,即一种对文件系统变化产生通知的机制

0x1: fanotify的特性:文件系统事件通知

作为一个 notifier,最基本的功能是当文件系统出现变化时通知相应的监控程序,在 Linux 的历史上,最早由 dnotify 提供这种服务,后来 inotify 起而代之,Fanotify 也提供通知功能

. FAN_ACCESS: File was accessed
. FAN_MODIFY: File was modified
. FAN_CLOSE_WRITE: Writtable file closed
. FAN_CLOSE_NOWRITE: Unwrittable file closed
. FAN_OPEN: File was opened
. FAN_OPEN_PERM: File open in perm check
. FAN_ACCESS_PERM: File accessed in perm check

Linux File System Change Monitoring Technology、Notifier Technology

0x2: fanotify的特性:全文件系统监控

Inotify使用watchdescriptor这个数据结构来对应某个被监控的文件或者目录。每个需要被监控的文件系统对象(文件、目录)都需要一个wd对象来表示

Linux File System Change Monitoring Technology、Notifier Technology

Fanotify 有三个个基本的模式

. directed: directed 模式和 inotify 类似,直接工作在被监控的对象的 inode 上,一次只可以监控一个对象。因此需要监控大量目标时也很麻烦
. per-mount: Per-mount 模式工作在 mount 点上,比如磁盘 /dev/sda2 的 mount 点在 /home,则 /home 目录下的所有文件系统变化都可以被监控,这其实可以被看作另外一种 Global 模式
. global: Global 模式则监控整个文件系统,任何变化都会通知 Listener。杀毒软件便工作在这种模式下
/*
需要明白的是
fanotify 依然无法支持 sub-tree 监控。但比 inotify 进了一步的是,fanotify 可以监控某个目录下的直接子节点。比如可以监控 /home 和他的直接子节点,文件 /home/foo1,/home/foo2 等都可以被监控,但 /home/pics/foo1 就不可以了,因为 /home/pics/foo1 不是 /home 的直接子节点
*/

0x3: fanotify的特性:访问控制 Access decision

所 谓 access descision 即当文件被访问的时候,监控程序不仅可以获得这个事件通知,还能够决定是否允许该操作。这对于杀毒软件是必要的:当您试图打开一个含有病毒的文件 时,fanotify 将产生一个通知给作为 listener 的杀毒软件,这个时候杀毒软件不仅需要判断将被打开的文件是否含有病毒,还需要阻止您的这个不安全的操作

Linux File System Change Monitoring Technology、Notifier Technology

当 app 需要打开文件的时候,加入该文件已经被 AV 程序监控,那么 open 这个操作将引起 fanotify 的通知,在 VFS 允许 open 返回之前,fanotify 先询问 AV program,假如允许,则 app 的 open 调用成功,否则 app 的 open 调用将失败。这样就可以阻止应用程序打开带病毒的文件了

0x4: fanotify的特性:Listener groups

Fanotify 允许多个 Listener 同时监控同一个文件系统对象。比如杀毒软件 V 和桌面搜索软件 S 会同时监控目录 /myDocument。当文件 /mydocument/test 被打开的时候,fanotify 将通知 V 和 S,通知的顺序遵循Listener groups配置的策略进行

如有一类软件叫做 hierarchical storage manager(HSM),在文件系统中实际存放的可能只是一个 stub
文件,文件真正的内容在下一级存储设备中,因此当 stub 文件被打开时,fanotify 应该先通知 HSM,让它先工作,将真正的文件内容导入到
stub 文件中;然后再通知杀毒软件,对真正的文件内容进行扫描;否则就有这样的一种可能:杀毒文件只扫描了 stub,而 HSM 随后将病毒导入
Fanotify 将所有的 Listener 分成三个 Group,优先级从上到下递减

1. FAN_CLASS_PRE_CONTENT:
初始化为 FAN_CLASS_PRE_CONTENT 的 Listener 优先级最高,将最先收到通知,FAN_CLASS_PRE_CONTENT 用于 HSM 等需要在应用程序使用文件的 CONTENT 之前就得到文件操作权的应用程序 2. FAN_CLASS_CONTENT:
其后是 FAN_CLASS_CONTENT,FAN_CLASS_CONTENT 适用于杀毒软件等需要检查文件 CONTENT 的软件 3. FAN_CLASS_NOTIF:
最后才是 FAN_CLASS_NOTIF 进程得到通知,FAN_CLASS_NOTIF 则用于纯粹的 notification 软件,不需要访问文件内容的应用程序

0x5: fanotify的特性:Listener PID

调用 Inotify 进行监控的进程如果对被监控文件进行操作,也将引起通知。有时候这会造成问题(例如自身造成的无限递归事件触发)

inotify_add_watch (fd, “/home/lm/loop”, IN_MODIFY | IN_OPEN | IN_CREATE | IN_DELETE);
// 监控文件 /home/lm/loop for (;;)
{
readInotifyEvent();
if(event->mask & IN_OPEN)
check_what_changed(event); // 检查有些什么改动
} void check_what_changed(event)
{
fd = open(event->name, O_RDWR); // 又触发 inotify 通知
read (fd, buf,128)

}
//函数 check_what_changed() 为了检查文件内容是否有变化必须调用 open 打开文件,这里的 open 操作也会触发 inotify 通知,从而使得代码形成一个无限循环

Fanotify 在通知中包含了触发事件的进程的 Pid,因此上面的问题可以轻易解决:

. 在 check_what_changed 函数中判断引起通知的 pid,如果是监控程序自己,则忽略这个通知,不会再次打开该文件。从而打破无限循环
. 实际上,Fanotify 的通知中包含了被监控文件系统对象的 open fd,应用程序可以直接使用这个 fd 对文件对象进行操作,而不会引起新的通知,即在收到因为fanotify自身的文件操作引发的事件通知后,直接使用fd进行操作,而避免后续的递归事件,这也是 Fanotify 相对于 Inotify 改进的一个地方

0x6: fanotify的特性:Decision Cache

杀 毒软件要扫描每一个即将被访问的文件,这对用户体验的影响很大。假如一个文件被频繁使用,且没有修改,那么最好只在第一次访问的时候扫描它,之后便不再需 要扫描了。类似一个 cache,扫描过的文件进入这个 cache,下次再访问同一个文件时,假如在 cache 中存在,那就不需要再次扫描文件内容了。
Fanotify 支持这种 cache,也叫做 ignore marks。它的工作原理很简单,假如对一个文件系统对象设置了 ignore marks,那么下次该文件被访问时,相应的事件便不会触发访问控制的代码,从而始终允许该文件的访问。
杀毒软件可以这样使用此特性,当应用程序第一次打开文件 file A 时,Fanotify 将通知杀毒软件 AV 进行文件内容扫描,如果 AV 软件发现该文件没有病毒,在允许本次访问的同时,对该文件设置一个 ignore mark。如下图所示:

Linux File System Change Monitoring Technology、Notifier Technology

此后 File A 再次被访问的时候,Fanotify 将发现在 cache 中已经有相应的 Ignore Mark,因此不再通知 AV 软件进行访问控制而直接允许该文件的访问请求

Linux File System Change Monitoring Technology、Notifier Technology

当文件内容被修改时,Fanotify 将自动清除 Ignore mark。Ignore Mark 的数量缺省情况下有一定限制,但用户可以通过修改 init flag 设置无限的 mark 数目

0x7: Fanotify 的缺点

. Fanotify 目前支持的文件系统事件类型比 inotify 少很多
相比 inotify,fanotify 所支持的文件系统事件少很多,尤其是 fanotify 不支持 move,这使得 fanotify 无法应用于类似桌面搜索或者实时远程文件系统同步等应用。当文件从一个目录移动到另一个目录,或者被改名时,fanotify 不产生任何通知。这使得一些使用 inotify 的应用因此无法迁移到 fanotify 上面来 . 和 inotify 一样,目前 fanotify 无法做到 sub-tree 监控

Relevant Link:

https://www.ibm.com/developerworks/cn/linux/l-cn-fanotify/
http://www.lanedo.com/filesystem-monitoring-linux-kernel/
http://www.lanedo.com/users/amorgado/fanotify/fanotify-example-access-control.c
http://www.lanedo.com/users/amorgado/fanotify/fanotify-example-mount.c
http://www.lanedo.com/users/amorgado/fanotify/fanotify-example.c

5. inotify

inotify是Linux核心子系统之一,做为文件系统的附加功能,它可监控文件系统并将异动通知应用程序。本系统的出现取代了旧有Linux核心里,拥有类似功能之dnotify模块
inotify的主要应用于

1. 桌面搜索软件,像:Beagle,得以针对有变动的文件重新索引,而不必没有效率地每隔几分钟就要扫描整个文件系统。相较于主动轮询文件系统,通过操作系统主动告知文件异动的方式,让Beagle等软件甚至可以在文件更动后一秒内更新索引
2. 更新目录查看
3. 重新加载配置文件
4. 追踪变更
5. 备份
6. 同步甚至上传等许多自动化作业流程
7. 相较于被inotify取代较旧的 dnotify模块,inotify有诸多益处。在旧的dnotify模块中,程序必须为每一个被监控的目录创建file descriptor,这种作法很容易让进程拥有的file descriptor逼近系统允许的上限,进而形成瓶颈。dnotify产生的file decriptor也会导致系统资源忙碌,使可移除设备无法被移除,徒增使用上的困扰。
由于dnotify只能让程序员监控目录层级的变化,"精细度"亦是"dnotify"的劣势之一。为此,程序员必须付出额外的心力,自行撰写代码以期追踪更细微的文件系统事件。
inotify相较之下使用较少的file descriptor,亦允许select()与poll()接口,优于dnotify使用的信号系统。这也使得inotify与既有以select()或poll()为基础之库(如:Glib)集成更加便利

0x1: inotify监控事件类型

. IN_ACCESS: File was accessed (e.g., read(), execve()).
. IN_ATTRIB: Metadata changed—for example,
) permissions (e.g.,chmod())
) timestamps (e.g., utimensat())
) extended attributes (setxattr())
) link count (since Linux 2.6.; e.g., for the target of link() and for unlink())
) user/group ID (e.g., chown()).
. IN_CLOSE_WRITE: File opened for writing was closed.
. IN_CLOSE_NOWRITE: File or directory not opened for writing was closed.
. IN_CREATE: File/directory created in watched directory (e.g.
) open() O_CREAT
) mkdir()
) link()
) symlink()
) bind() on a UNIX domain socket
. IN_DELETE: File/directory deleted from watched directory.
. IN_DELETE_SELF:
Watched file/directory was itself deleted. (This event also occurs if an object is moved to another filesystem, since mv() in effect copies the file to the other filesystem and then deletes it from the original filesystem.) In addition, an IN_IGNORED event will subsequently be generated for the watch descriptor.
. IN_MODIFY: File was modified (e.g., write(), truncate()).
. IN_MOVE_SELF: Watched file/directory was itself moved.
. IN_MOVED_FROM: Generated for the directory containing the old filename when a file is renamed.
. IN_MOVED_TO: Generated for the directory containing the new filename when a file is renamed.
. IN_OPEN: File or directory was opened.
//IN_ALL_EVENTS: macro is defined as a bit mask of all of the above events.
. IN_MOVE: Equates to IN_MOVED_FROM | IN_MOVED_TO.
. IN_CLOSE: Equates to IN_CLOSE_WRITE | IN_CLOSE_NOWRITE.
. IN_DONT_FOLLOW: Don't dereference pathname if it is a symbolic link.
. IN_EXCL_UNLINK: events are not generated for children after they have been unlinked from the watched directory.
. IN_MASK_ADD: If a watch instance already exists for the filesystem object corresponding to pathname, add (OR) the events in mask to the watch mask (instead of replacing the mask)
. IN_ONESHOT: Monitor the filesystem object corresponding to pathname for one event, then remove from watch list.
. IN_ONLYDIR: Only watch pathname if it is a directory
. IN_IGNORED: Watch was removed explicitly (inotify_rm_watch()) or automatically (file was deleted, or filesystem was unmounted)
. IN_ISDIR: Subject of this event is a directory.
. IN_Q_OVERFLOW: Event queue overflowed
. IN_UNMOUNT: Filesystem containing watched object was unmounted. In addition, an IN_IGNORED event will subsequently be generated for the watch descriptor.

0x2: Examples

. Suppose an application is watching the directory dir and the file dir/myfile for all events.  The examples below show some events that will be generated for these two objects.
fd = open("dir/myfile", O_RDWR);
Generates IN_OPEN events for both dir and dir/myfile.
read(fd, buf, count);
Generates IN_ACCESS events for both dir and dir/myfile.
write(fd, buf, count);
Generates IN_MODIFY events for both dir and dir/myfile.
fchmod(fd, mode);
Generates IN_ATTRIB events for both dir and dir/myfile.
close(fd);
Generates IN_CLOSE_WRITE events for both dir and dir/myfile. . Suppose an application is watching the directories dir1 and dir2, and the file dir1/myfile. The following examples show some events that may be generated.
link("dir1/myfile", "dir2/new");
Generates an IN_ATTRIB event for myfile and an IN_CREATE event for dir2.
rename("dir1/myfile", "dir2/myfile");
Generates an IN_MOVED_FROM event for dir1, an IN_MOVED_TO event for dir2, and an IN_MOVE_SELF event for myfile. The IN_MOVED_FROM and IN_MOVED_TO events will have the same cookie value. . Suppose that dir1/xx and dir2/yy are (the only) links to the same file, and an application is watching dir1, dir2, dir1/xx, and dir2/yy. Executing the following calls in the order given below will generate the following events:
unlink("dir2/yy");
Generates an IN_ATTRIB event for xx (because its link count changes) and an IN_DELETE event for dir2.
unlink("dir1/xx");
Generates IN_ATTRIB, IN_DELETE_SELF, and IN_IGNORED events for xx, and an IN_DELETE event for dir1. . Suppose an application is watching the directory dir and (the empty) directory dir/subdir. The following examples show some events that may be generated.
mkdir("dir/new", mode);
Generates an IN_CREATE | IN_ISDIR event for dir.
rmdir("dir/subdir");
Generates IN_DELETE_SELF and IN_IGNORED events for subdir, and an IN_DELETE | IN_ISDIR event for dir.

0x3: 配置接口/proc interfaces

The following interfaces can be used to limit the amount of kernel memory consumed by inotify:

. /proc/sys/fs/inotify/max_queued_events
The value in this file is used when an application calls inotify_init() to set an upper limit on the number of events that can be queued to the corresponding inotify instance.
Events in excess of this limit are dropped, but an IN_Q_OVERFLOW event is always generated. . /proc/sys/fs/inotify/max_user_instances
This specifies an upper limit on the number of inotify instances that can be created per real user ID. . /proc/sys/fs/inotify/max_user_watches
This specifies an upper limit on the number of watches that can be created per real user ID.
//需要特别注意的是,inotify对磁盘变动事件的是存在限制的,对于inotify来说,每一个目录就是一个"watches",linux/windows对max watches都是有个数限制的,因为这会占用内存,从理论上来说,inotify无法做到100%的目录监控,除非采用内核态的文件系统变动监控

0x4: Limitations and caveats

. The inotify API provides no information about the user or process that triggered the inotify event.  In particular, there is no easy way for a process that is monitoring events via inotify to distinguish events that it triggers itself from those that are triggered by other processes.

. Inotify reports only events that a user-space program triggers through the filesystem API.  As a result, it does not catch remote events that occur on network filesystems.  (Applications must fall back to polling the filesystem to catch such events.)  Furthermore, various pseudo-filesystems such as /proc, /sys, and /dev/pts are not monitorable with inotify.

. The inotify API does not report file accesses and modifications that may occur because of mmap(), msync(), and munmap().

. The inotify API identifies affected files by filename.  However, by the time an application processes an inotify event, the filename may already have been deleted or renamed. 这也是任何主机文件变动监控都会遇到的一个技术难题,可以考虑的解决的方案有block阻断删除

. The inotify API identifies events via watch descriptors.  It is the application's responsibility to cache a mapping (if one is needed) between watch descriptors and pathnames.  Be aware that directory renamings may affect multiple cached pathnames.

. Inotify monitoring of directories is not recursive: to monitor subdirectories under a directory, additional watches must be created. This can take a significant amount time for large directory trees.

. If monitoring an entire directory subtree, and a new subdirectory is created in that tree or an existing directory is renamed into that tree, be aware that by the time you create a watch for the new subdirectory, new files (and subdirectories) may already exist inside the subdirectory.  Therefore, you may want to scan the contents of the subdirectory immediately after adding the watch (and, if desired, recursively add watches for any subdirectories that it contains).

. Note that the event queue can overflow.  In this case, events are lost.  Robust applications should handle the possibility of lost events gracefully.  For example, it may be necessary to rebuild part or all of the application cache.  (One simple, but possibly expensive, approach is to close the inotify file descriptor, empty the cache, create a new inotify file descriptor, and then re-create watches and cache entries for the objects to be monitored.)

0x5: 内核实现原理

在内核中,每一个 inotify 实例对应一个 inotify_device 结构
/source/fs/notify/inotify/inotify_user.c

struct inotify_device
{
/*
wait queue for i/o
wq 是等待队列,被 read 调用阻塞的进程将挂在该等待队列上
*/
wait_queue_head_t wq; struct mutex ev_mutex; /* protects event queue */
struct mutex up_mutex; /* synchronizes watch updates */ /*
list of queued events
events 为该 inotify 实例上发生的事件的列表,被该 inotify 实例监视的所有事件在发生后都将插入到这个列表
*/
struct list_head events; /*
user who opened this dev
user 用于描述创建该 inotify 实例的用户
*/
struct user_struct *user;
struct inotify_handle *ih; /* inotify handle */
struct fasync_struct *fa; /* async notification */ /*
reference count
count 是引用计数
*/
atomic_t count; /*
size of the queue (bytes)
queue_size 表示该 inotify 实例的事件队列的字节数
*/
unsigned int queue_size; /*
number of pending events
event_count 是 events 列表的事件数
*/
unsigned int event_count; /*
maximum number of events
max_events 为最大允许的事件数
*/
unsigned int max_events;
};

每一个 watch 对应一个 inotify_watch 结构
/source/linux/include/linux/inotify.h

struct inotify_watch
{
struct list_head h_list; /* entry in inotify_handle's list */
struct list_head i_list; /* entry in inode's list */
atomic_t count; /* reference count */
struct inotify_handle *ih; /* associated inotify handle */
struct inode *inode; /* associated inode */
__s32 wd; /* watch descriptor */
__u32 mask; /* event mask for this watch */
};

结构 inotify_device 在用户态调用 inotify_init() 时创建,当关闭 inotify_init()返回的文件描述符时将被释放
无论是目录还是文件,在内核中都对应一个 inode 结构,inotify 系统在 inode 结构中增加了两个字段

struct inode
{
...
#ifdef CONFIG_INOTIFY
/*
watches on this inode
inotify_watches 是在被监视目标上的 watch 列表,每当用户调用 inotify_add_watch()时,内核就为添加的 watch 创建一个 inotify_watch 结构,并把它插入到被监视目标对应的 inode 的 inotify_watches 列表
*/
struct list_head inotify_watches; /*
protects the watches list
inotify_mutex用于同步对 inotify_watches 列表的访问
*/
struct mutex inotify_mutex;
#endif
...
}

对于inotify的架构需要明白的是,文件变动监控需要内核和用户态应用程序的同时支持,Linux内核代码在文件系统这一层面原生支持了变动的通知,即所有的文件系统操作的代码流程中都串行地插入了inotify的通知代码
当文件系统发生"监控事件"之一时,相应的文件系统代码将显示调用fsnotify_* 来把相应的事件报告给 inotify 系统,其中*号就是相应的事件名,目前实现包括

. fsnotify_move: 文件从一个目录移动到另一个目录
. fsnotify_nameremove: 文件从目录中删除
. fsnotify_inoderemove: 自删除
. fsnotify_create: 创建新文件
. fsnotify_mkdir: 创建新目录
. fsnotify_access: 文件被读
. fsnotify_modify: 文件被写
. fsnotify_open: 文件被打开
. fsnotify_close: 文件被关闭
. fsnotify_xattr: 文件的扩展属性被修改
. fsnotify_change: 文件被修改或原数据被修改
. inotify_unmount_inodes: 它是一个例外,它会在文件系统被 umount 时调用来通知 umount 事件给 inotify 系统

以上提到函数最后都调用 inotify_inode_queue_event(inotify_unmount_inodes直接调用 inotify_dev_queue_event)
/source/fs/notify/inotify/inotify.c

/**
* inotify_inode_queue_event - queue an event to all watches on this inode
* @inode: inode event is originating from
* @mask: event mask describing this event
* @cookie: cookie for synchronization, or zero
* @name: filename, if any
* @n_inode: inode associated with name
*/
void inotify_inode_queue_event(struct inode *inode, u32 mask, u32 cookie, const char *name, struct inode *n_inode)
{
struct inotify_watch *watch, *next; //判断对应的inode是否被监视,这通过查看 inotify_watches 列表是否为空来实现
if (!inotify_inode_watched(inode))
return; mutex_lock(&inode->inotify_mutex);
//遍历 inotify_watches 列表,看是否当前的文件操作事件被某个 watch 监视(当前inode结点上的inotify_watches)
list_for_each_entry_safe(watch, next, &inode->inotify_watches, i_list)
{
u32 watch_mask = watch->mask;
if (watch_mask & mask)
{
struct inotify_handle *ih= watch->ih;
mutex_lock(&ih->mutex);
if (watch_mask & IN_ONESHOT)
remove_watch_no_event(watch, ih);
ih->in_ops->handle_event(watch, watch->wd, mask, cookie, name, n_inode);
mutex_unlock(&ih->mutex);
}
}
mutex_unlock(&inode->inotify_mutex);
}
EXPORT_SYMBOL_GPL(inotify_inode_queue_event);

inotify是以group调用链的形式进行事件通知的,所有的watch点都放置在这个group上
/source/include/linux/fsnotify_backend.h

/*
* A group is a "thing" that wants to receive notification about filesystem
* events. The mask holds the subset of event types this group cares about.
* refcnt on a group is up to the implementor and at any moment if it goes 0
* everything will be cleaned up.
*/
struct fsnotify_group
{
/*
* global list of all groups receiving events from fsnotify.
* anchored by fsnotify_groups and protected by either fsnotify_grp_mutex
* or fsnotify_grp_srcu depending on write vs read.
*/
struct list_head group_list; /*
* Defines all of the event types in which this group is interested.
* This mask is a bitwise OR of the FS_* events from above. Each time
* this mask changes for a group (if it changes) the correct functions
* must be called to update the global structures which indicate global
* interest in event types.
*/
__u32 mask; /*
* How the refcnt is used is up to each group. When the refcnt hits 0
* fsnotify will clean up all of the resources associated with this group.
* As an example, the dnotify group will always have a refcnt=1 and that
* will never change. Inotify, on the other hand, has a group per
* inotify_init() and the refcnt will hit 0 only when that fd has been
* closed.
*/
atomic_t refcnt; /* things with interest in this group */
unsigned int group_num; /* simply prevents accidental group collision */ /*
how this group handles things
这是我们重点要关注的成员
*/
const struct fsnotify_ops *ops; /* needed to send notification to userspace */
struct mutex notification_mutex; /* protect the notification_list */
struct list_head notification_list; /* list of event_holder this group needs to send to userspace */
wait_queue_head_t notification_waitq; /* read() on the notification file blocks on this waitq */
unsigned int q_len; /* events on the queue */
unsigned int max_events; /* maximum events allowed on the list */ /* stores all fastapth entries assoc with this group so they can be cleaned on unregister */
spinlock_t mark_lock; /* protect mark_entries list */
atomic_t num_marks; /* 1 for each mark entry and 1 for not being
* past the point of no return when freeing
* a group */
struct list_head mark_entries; /* all inode mark entries for this group */ /* prevents double list_del of group_list. protected by global fsnotify_grp_mutex */
bool on_group_list; /* groups can define private fields here or use the void *private */
union {
void *private;
#ifdef CONFIG_INOTIFY_USER
struct inotify_group_private_data {
spinlock_t idr_lock;
struct idr idr;
u32 last_wd;
struct fasync_struct *fa; /* async notification */
struct user_struct *user;
} inotify_data;
#endif
};
};

我们重点关注const struct fsnotify_ops *ops;

/*
* Each group much define these ops. The fsnotify infrastructure will call
* these operations for each relevant group.
*
* should_send_event - given a group, inode, and mask this function determines
* if the group is interested in this event.
* handle_event - main call for a group to handle an fs event
* free_group_priv - called when a group refcnt hits 0 to clean up the private union
* freeing-mark - this means that a mark has been flagged to die when everything
* finishes using it. The function is supplied with what must be a
* valid group and inode to use to clean up.
*/
struct fsnotify_ops
{
bool (*should_send_event)(struct fsnotify_group *group, struct inode *inode, __u32 mask);
int (*handle_event)(struct fsnotify_group *group, struct fsnotify_event *event);
void (*free_group_priv)(struct fsnotify_group *group);
void (*freeing_mark)(struct fsnotify_mark_entry *entry, struct fsnotify_group *group);
void (*free_event_priv)(struct fsnotify_event_private_data *priv);
};

0x6: IN_CLOSE_WRITE 事件监控内核态实现原理

/source/fs/open.c

/*
* Careful here! We test whether the file pointer is NULL before
* releasing the fd. This ensures that one clone task can't release
* an fd while another clone is opening it.
*/
SYSCALL_DEFINE1(close, unsigned int, fd)
{
struct file * filp;
struct files_struct *files = current->files;
struct fdtable *fdt;
int retval; spin_lock(&files->file_lock);
/*
获取指向struct fdtable结构体的指针
\linux-2.6.32.63\include\linux\fdtable.h
#define files_fdtable(files) (rcu_dereference((files)->fdt))
*/
fdt = files_fdtable(files);
if (fd >= fdt->max_fds)
{
goto out_unlock;
}
//获取需要关闭的文件描述符编号
filp = fdt->fd[fd];
if (!filp)
{
goto out_unlock;
}
/*
将fd_array[]中的的指定元素值置null
*/
rcu_assign_pointer(fdt->fd[fd], NULL);
FD_CLR(fd, fdt->close_on_exec);
/*
调用__put_unused_fd函数,将当前fd回收,则下一次打开新的文件又可以用这个fd了
static void __put_unused_fd(struct files_struct *files, unsigned int fd)
{
struct fdtable *fdt = files_fdtable(files);
__FD_CLR(fd, fdt->open_fds);
if (fd < files->next_fd)
{
files->next_fd = fd;
}
}
*/
__put_unused_fd(files, fd);
spin_unlock(&files->file_lock);
retval = filp_close(filp, files); /* can't restart close syscall because file table entry was cleared */
if (unlikely(retval == -ERESTARTSYS || retval == -ERESTARTNOINTR || retval == -ERESTARTNOHAND || retval == -ERESTART_RESTARTBLOCK))
{
retval = -EINTR;
} return retval; out_unlock:
spin_unlock(&files->file_lock);
return -EBADF;
}
EXPORT_SYMBOL(sys_close);

retval = filp_close(filp, files);

/*
* "id" is the POSIX thread ID. We use the
* files pointer for this..
*/
int filp_close(struct file *filp, fl_owner_t id)
{
int retval = ; if (!file_count(filp))
{
printk(KERN_ERR "VFS: Close: file count is 0\n");
return ;
} if (filp->f_op && filp->f_op->flush)
{
retval = filp->f_op->flush(filp, id);
} dnotify_flush(filp, id);
locks_remove_posix(filp, id);
fput(filp);
return retval;
}
EXPORT_SYMBOL(filp_close);

fput(filp);
/source/fs/file_table.c

void fput(struct file *file)
{
if (atomic_long_dec_and_test(&file->f_count))
__fput(file);
}
EXPORT_SYMBOL(fput); /* __fput is called from task context when aio completion releases the last
* last use of a struct file *. Do not use otherwise.
*/
void __fput(struct file *file)
{
struct dentry *dentry = file->f_path.dentry;
struct vfsmount *mnt = file->f_path.mnt;
struct inode *inode = dentry->d_inode; might_sleep(); //inotify内核通知点
fsnotify_close(file);
/*
* The function eventpoll_release() should be the first called
* in the file cleanup chain.
*/
eventpoll_release(file);
locks_remove_flock(file); if (unlikely(file->f_flags & FASYNC)) {
if (file->f_op && file->f_op->fasync)
file->f_op->fasync(-, file, );
}
if (file->f_op && file->f_op->release)
file->f_op->release(inode, file); //LSM Hook点
security_file_free(file); ima_file_free(file);
if (unlikely(S_ISCHR(inode->i_mode) && inode->i_cdev != NULL))
cdev_put(inode->i_cdev);
fops_put(file->f_op);
put_pid(file->f_owner.pid);
file_kill(file);
if (file->f_mode & FMODE_WRITE)
drop_file_write_access(file);
file->f_path.dentry = NULL;
file->f_path.mnt = NULL;
file_free(file);
dput(dentry);
mntput(mnt);
}

fsnotify_close(file);
\linux-2.6.32.63\include\linux\fsnotify.h

/*
* fsnotify_close - file was closed
*/
static inline void fsnotify_close(struct file *file)
{
struct dentry *dentry = file->f_path.dentry;
struct inode *inode = dentry->d_inode;
fmode_t mode = file->f_mode;
//判断关闭方式
__u32 mask = (mode & FMODE_WRITE) ? FS_CLOSE_WRITE : FS_CLOSE_NOWRITE; if (S_ISDIR(inode->i_mode))
mask |= FS_IN_ISDIR; inotify_inode_queue_event(inode, mask, , NULL, NULL); fsnotify_parent(dentry, mask);
fsnotify(inode, mask, file, FSNOTIFY_EVENT_FILE, NULL, );
}

Relevant Link:

http://www.ibm.com/developerworks/cn/linux/l-inotifynew/

6. code example

#include <errno.h>
#include <poll.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/inotify.h>
#include <unistd.h> /* Read all available inotify events from the file descriptor 'fd'.
wd is the table of watch descriptors for the directories in argv.
argc is the length of wd and argv.
argv is the list of watched directories.
Entry 0 of wd and argv is unused. */ static void
handle_events(int fd, int *wd, int argc, char* argv[])
{
/* Some systems cannot read integer variables if they are not
properly aligned. On other systems, incorrect alignment may
decrease performance. Hence, the buffer used for reading from
the inotify file descriptor should have the same alignment as
struct inotify_event. */ char buf[]
__attribute__ ((aligned(__alignof__(struct inotify_event))));
const struct inotify_event *event;
int i;
ssize_t len;
char *ptr; /* Loop while events can be read from inotify file descriptor. */ for (;;) { /* Read some events. */ len = read(fd, buf, sizeof buf);
if (len == - && errno != EAGAIN) {
perror("read");
exit(EXIT_FAILURE);
} /* If the nonblocking read() found no events to read, then
it returns -1 with errno set to EAGAIN. In that case,
we exit the loop. */ if (len <= )
break; /* Loop over all events in the buffer */ for (ptr = buf; ptr < buf + len;
ptr += sizeof(struct inotify_event) + event->len) { event = (const struct inotify_event *) ptr; /* Print event type */ if (event->mask & IN_OPEN)
printf("IN_OPEN: ");
if (event->mask & IN_CLOSE_NOWRITE)
printf("IN_CLOSE_NOWRITE: ");
if (event->mask & IN_CLOSE_WRITE)
printf("IN_CLOSE_WRITE: "); /* Print the name of the watched directory */ for (i = ; i < argc; ++i) {
if (wd[i] == event->wd) {
printf("%s/", argv[i]);
break;
}
} /* Print the name of the file */ if (event->len)
printf("%s", event->name); /* Print type of filesystem object */ if (event->mask & IN_ISDIR)
printf(" [directory]\n");
else
printf(" [file]\n");
}
}
} int
main(int argc, char* argv[])
{
char buf;
int fd, i, poll_num;
int *wd;
nfds_t nfds;
struct pollfd fds[]; if (argc < ) {
printf("Usage: %s PATH [PATH ...]\n", argv[]);
exit(EXIT_FAILURE);
} printf("Press ENTER key to terminate.\n"); /* Create the file descriptor for accessing the inotify API */ fd = inotify_init1(IN_NONBLOCK);
if (fd == -) {
perror("inotify_init1");
exit(EXIT_FAILURE);
} /* Allocate memory for watch descriptors */ wd = calloc(argc, sizeof(int));
if (wd == NULL) {
perror("calloc");
exit(EXIT_FAILURE);
} /* Mark directories for events
- file was opened
- file was closed */ for (i = ; i < argc; i++) {
wd[i] = inotify_add_watch(fd, argv[i],
IN_OPEN | IN_CLOSE);
if (wd[i] == -) {
fprintf(stderr, "Cannot watch '%s'\n", argv[i]);
perror("inotify_add_watch");
exit(EXIT_FAILURE);
}
} /* Prepare for polling */ nfds = ; /* Console input */ fds[].fd = STDIN_FILENO;
fds[].events = POLLIN; /* Inotify input */ fds[].fd = fd;
fds[].events = POLLIN; /* Wait for events and/or terminal input */ printf("Listening for events.\n");
while () {
poll_num = poll(fds, nfds, -);
if (poll_num == -) {
if (errno == EINTR)
continue;
perror("poll");
exit(EXIT_FAILURE);
} if (poll_num > ) { if (fds[].revents & POLLIN) { /* Console input is available. Empty stdin and quit */ while (read(STDIN_FILENO, &buf, ) > && buf != '\n')
continue;
break;
} if (fds[].revents & POLLIN) { /* Inotify events are available */ handle_events(fd, wd, argc, argv);
}
}
} printf("Listening for events stopped.\n"); /* Close inotify file descriptor */ close(fd); free(wd);
exit(EXIT_SUCCESS);
}

Relevant Link:

http://linux.die.net/man/7/inotify
http://man7.org/linux/man-pages/man7/inotify.7.html
http://www.ibm.com/developerworks/cn/linux/l-inotifynew/

Copyright (c) 2015 LittleHann All rights reserved