I have a web server which saves cache files and keeps them for 7 days. The file names are md5 hashes, i.e. exactly 32 hex characters long, and are being kept in a tree structure that looks like this:
我有一个Web服务器,可以保存缓存文件并保存7天。文件名是md5哈希值,即正好是32个十六进制字符长,并保存在如下所示的树结构中:
00/
00/
00000ae9355e59a3d8a314a5470753d8
.
.
00/
01/
You get the idea.
你明白了。
My problem is that deleting old files is taking a really long time. I have a daily cron job that runs
我的问题是删除旧文件需要很长时间。我每天都有一个cron工作
find cache/ -mtime +7 -type f -delete
which takes more than half a day to complete. I worry about scalability and the effect this has on the performance of the server. Additionally, the cache directory is now a black hole in my system, trapping the occasional innocent du
or find
.
这需要半天以上才能完成。我担心可伸缩性及其对服务器性能的影响。此外,缓存目录现在是我系统中的一个黑洞,偶尔捕获无辜的du或发现。
The standard solution to LRU cache is some sort of a heap. Is there a way to scale this to the filesystem level? Is there some other way to implement this in a way which makes it easier to manage?
LRU缓存的标准解决方案是某种堆。有没有办法将其扩展到文件系统级别?是否有其他方法以一种易于管理的方式实现这一点?
Here are ideas I considered:
以下是我考虑的想法:
- Create 7 top directories, one for each week day, and empty one directory every day. This increases the seek time for a cache file 7-fold, makes it really complicated when a file is overwritten, and I'm not sure what it will do to the deletion time.
- 创建7个*目录,每周一个目录,每天清空一个目录。这会将缓存文件的查找时间增加7倍,使文件被覆盖时非常复杂,而且我不确定它对删除时间的作用。
- Save the files as blobs in a MySQL table with indexes on name and date. This seemed promising, but in practice it's always been much slower than FS. Maybe I'm not doing it right.
- 将文件保存为MySQL表中的blob,其中包含名称和日期的索引。这似乎很有希望,但在实践中它总是比FS慢得多。也许我做得不对。
Any ideas?
有任何想法吗?
5 个解决方案
#1
15
When you store a file, make a symbolic link to a second directory structure that is organized by date, not by name.
存储文件时,请创建指向按日期而不是按名称组织的第二个目录结构的符号链接。
Retrieve your files using the "name" structure, delete them using the "date" structure.
使用“名称”结构检索文件,使用“日期”结构删除它们。
#2
4
Assuming this is ext2/3 have you tried adding in the indexed directories? When you have a large number of files in any particular directory the lookup will be painfully slow to delete something.
use tune2fs -o dir_index to enable the dir_index option.
When mounting a file system, make sure to use noatime option, which stops the OS from updating access time information for the directories (still needs to modify them).
Looking at the original post it seems as though you only have 2 levels of indirection to the files, which means that you can have a huge number of files in the leaf directories. When there are more than a million entries in these you will find that searches and changes are terribly slow. An alternative is to use a deeper hierarchy of directories, reducing the number of items in any particular directory, therefore reducing the cost of search and updates to the particular individual directory.
假设这是ext2 / 3你有没有尝试添加索引目录?当任何特定目录中有大量文件时,删除内容的查找速度会很慢。使用tune2fs -o dir_index启用dir_index选项。安装文件系统时,请确保使用noatime选项,这会阻止操作系统更新目录的访问时间信息(仍需要修改它们)。看一下原始帖子,好像你只有两个级别的文件间接,这意味着你可以在叶子目录中有大量的文件。如果这些条目中有超过一百万条,您会发现搜索和更改非常缓慢。另一种方法是使用更深层次的目录,减少任何特定目录中的项目数,从而降低搜索和更新特定单个目录的成本。
#3
1
Reiserfs is relatively efficient at handling small files. Did you try different Linux file systems? I'm not sure about delete performance - you can consider formatting (mkfs) as a substitute for individual file deletion. For example, you can create a different file system (cache1, cache2, ...) for each weekday.
Reiserfs在处理小文件方面相对有效。你尝试过不同的Linux文件系统吗?我不确定删除性能 - 您可以考虑将格式化(mkfs)替换为单个文件删除。例如,您可以为每个工作日创建不同的文件系统(cache1,cache2,...)。
#4
1
How about this:
这个怎么样:
- Have another folder called, say, "ToDelete"
- 有另一个名为“ToDelete”的文件夹
- When you add a new item, get today's date and look for a subfolder in "ToDelete" that has a name indicative of the current date
- 当您添加新项目时,请获取今天的日期并在“ToDelete”中查找具有指示当前日期的名称的子文件夹
- If it's not there, create it
- 如果不存在,请创建它
- Add a symbolic link to the item you've created in today's folder
- 添加一个符号链接到您在今天的文件夹中创建的项目
- Create a cron job that goes to the folder in "ToDelete" which is of the correct date and delete all the folders that are linked.
- 创建一个cron作业,该作业将转到“ToDelete”中具有正确日期的文件夹,并删除所有链接的文件夹。
- Delete the folder which contained all the links.
- 删除包含所有链接的文件夹。
#5
0
How about having a table in your database that uses the hash as the key. The other field would then be the name of the file. That way the file can be stored in a date-related fashion for fast deletion, and the database can be used for finding that file's location based on the hash in a fast fashion.
如何在数据库中使用哈希作为键的表。然后,另一个字段将是文件的名称。这样,文件可以以与日期相关的方式存储以便快速删除,并且数据库可以用于以快速方式基于散列来查找该文件的位置。
#1
15
When you store a file, make a symbolic link to a second directory structure that is organized by date, not by name.
存储文件时,请创建指向按日期而不是按名称组织的第二个目录结构的符号链接。
Retrieve your files using the "name" structure, delete them using the "date" structure.
使用“名称”结构检索文件,使用“日期”结构删除它们。
#2
4
Assuming this is ext2/3 have you tried adding in the indexed directories? When you have a large number of files in any particular directory the lookup will be painfully slow to delete something.
use tune2fs -o dir_index to enable the dir_index option.
When mounting a file system, make sure to use noatime option, which stops the OS from updating access time information for the directories (still needs to modify them).
Looking at the original post it seems as though you only have 2 levels of indirection to the files, which means that you can have a huge number of files in the leaf directories. When there are more than a million entries in these you will find that searches and changes are terribly slow. An alternative is to use a deeper hierarchy of directories, reducing the number of items in any particular directory, therefore reducing the cost of search and updates to the particular individual directory.
假设这是ext2 / 3你有没有尝试添加索引目录?当任何特定目录中有大量文件时,删除内容的查找速度会很慢。使用tune2fs -o dir_index启用dir_index选项。安装文件系统时,请确保使用noatime选项,这会阻止操作系统更新目录的访问时间信息(仍需要修改它们)。看一下原始帖子,好像你只有两个级别的文件间接,这意味着你可以在叶子目录中有大量的文件。如果这些条目中有超过一百万条,您会发现搜索和更改非常缓慢。另一种方法是使用更深层次的目录,减少任何特定目录中的项目数,从而降低搜索和更新特定单个目录的成本。
#3
1
Reiserfs is relatively efficient at handling small files. Did you try different Linux file systems? I'm not sure about delete performance - you can consider formatting (mkfs) as a substitute for individual file deletion. For example, you can create a different file system (cache1, cache2, ...) for each weekday.
Reiserfs在处理小文件方面相对有效。你尝试过不同的Linux文件系统吗?我不确定删除性能 - 您可以考虑将格式化(mkfs)替换为单个文件删除。例如,您可以为每个工作日创建不同的文件系统(cache1,cache2,...)。
#4
1
How about this:
这个怎么样:
- Have another folder called, say, "ToDelete"
- 有另一个名为“ToDelete”的文件夹
- When you add a new item, get today's date and look for a subfolder in "ToDelete" that has a name indicative of the current date
- 当您添加新项目时,请获取今天的日期并在“ToDelete”中查找具有指示当前日期的名称的子文件夹
- If it's not there, create it
- 如果不存在,请创建它
- Add a symbolic link to the item you've created in today's folder
- 添加一个符号链接到您在今天的文件夹中创建的项目
- Create a cron job that goes to the folder in "ToDelete" which is of the correct date and delete all the folders that are linked.
- 创建一个cron作业,该作业将转到“ToDelete”中具有正确日期的文件夹,并删除所有链接的文件夹。
- Delete the folder which contained all the links.
- 删除包含所有链接的文件夹。
#5
0
How about having a table in your database that uses the hash as the key. The other field would then be the name of the file. That way the file can be stored in a date-related fashion for fast deletion, and the database can be used for finding that file's location based on the hash in a fast fashion.
如何在数据库中使用哈希作为键的表。然后,另一个字段将是文件的名称。这样,文件可以以与日期相关的方式存储以便快速删除,并且数据库可以用于以快速方式基于散列来查找该文件的位置。