当LAMP服务器上有数百万用户时,存储和获取图像的最快和最有效的方法是什么?

时间:2021-03-10 12:52:35

Here is the best method I have come up with so far and I would like to know if there is an even better method (I'm sure there is!) for storing and fetching millions of user images:

这是迄今为止我提出的最好的方法,我想知道是否有更好的方法(我确定有!)用于存储和获取数百万用户图像:

In order to keep the directory sizes down and avoid having to make any additional calls to the DB, I am using nested directories that are calculated based on the User's unique ID as follows:

为了保持目录大小不变并避免对数据库进行任何其他调用,我使用的嵌套目录是根据用户的唯一ID计算的,如下所示:

$firstDir = './images';
$secondDir = floor($userID / 100000);
$thirdDir = floor(substr($id, -5, 5) / 100);
$fourthDir = $userID;
$imgLocation = "$firstDir/$secondDir/$thirdDir/$fourthDir/1.jpg";

User ID's ($userID) range from 1 to the millions.

用户ID($ userID)的范围从1到数百万。

So if I have User ID 7654321, for example, that user's first pic will be stored in:

因此,如果我有用户ID 7654321,那么用户的第一张照片将存储在:

./images/76/543/7654321/1.jpg

For User ID 654321:

对于用户ID 654321:

./images/6/543/654321/1.jpg

For User ID 54321 it would be:

对于用户ID 54321,它将是:

./images/0/543/54321/1.jpg

For User ID 4321 it would be:

对于用户ID 4321,它将是:

./images/0/43/4321/1.jpg

For User ID 321 it would be:

对于用户ID 321,它将是:

./images/0/3/321/1.jpg

For User ID 21 it would be:

对于用户ID 21,它将是:

./images/0/0/21/1.jpg

For User ID 1 it would be:

对于用户ID 1,它将是:

./images/0/0/1/1.jpg

This ensures that with up to 100,000,000 users, I will never have a directory with more than 1,000 sub-directories, so it seems to keep things clean and efficient.

这确保了最多100,000,000个用户,我将永远不会有一个包含超过1,000个子目录的目录,因此它似乎可以保持清洁和高效。

I benchmarked this method against using the following "hash" method that uses the fastest hash method available in PHP (crc32). This "hash" method calculates the Second Directory as the first 3 characters in the hash of the User ID and the Third Directory as the next 3 character in order to distribute the files randomly but evenly as follows:

我使用以下“哈希”方法对此方法进行基准测试,该方法使用PHP中可用的最快哈希方法(crc32)。此“哈希”方法将第二个目录计算为用户ID哈希值中的前3个字符,将第三个目录计算为下一个3个字符,以便随机分布文件,但如下所示:

$hash = crc32($userID);
$firstDir = './images';
$secondDir = substr($hash,0,3);
$thirdDir = substr($hash,3,3);
$fourthDir = $userID;
$imgLocation = "$firstDir/$secondDir/$thirdDir/$fourthDir/1.jpg";

However, this "hash" method is slower than the method I described earlier above, so it's no good.

但是,这种“哈希”方法比我前面描述的方法慢,所以它没有用。

I then went one step further and found an even faster method of calculating the Third Directory in my original example (floor(substr($userID, -5, 5) / 100);) as follows:

然后,我进一步发现了一个更快的方法来计算我的原始示例中的第三个目录(floor(substr($ userID,-5,5)/ 100);),如下所示:

$thirdDir = floor(substr($userID, -5, 3));

Now, this changes how/where the first 10,000 User ID's are stored, making some third directories have either 1 user sub-directory or 111 instead of 100, but it has the advantage of being faster since we do not have to divide by 100, so I think it is worth it in the long-run.

现在,这改变了存储前10,000个用户ID的方式/位置,使得一些第三个目录具有1个用户子目录或111而不是100,但它具有更快的优势,因为我们不必除以100,所以我认为从长远来看这是值得的。

Once the directory structure is defined, here is how I plan on storing the actual individual images: if a user uploads a 2nd pic, for example, it would go in the same directory as their first pic, but it would be named 2.jpg. The default pic of the user would always just be 1.jpg, so if they decide to make their 2nd pic the default pic, 2.jpg would be renamed to 1.jpg and 1.jpg would be renamed 2.jpg.

一旦定义了目录结构,这就是我计划存储实际单个图像的方式:例如,如果用户上传第二张图片,它将与第一张图片位于同一目录中,但它将被命名为2.jpg 。用户的默认pic总是只有1.jpg,所以如果他们决定将他们的第二张图片作为默认图片,那么2.jpg将被重命名为1.jpg,而1.jpg将被重命名为2.jpg。

Last but not least, if I needed to store multiple sizes of the same image, I would store them as follows for User ID 1 (for example):

最后但并非最不重要的是,如果我需要存储同一图像的多个大小,我会按如下方式存储它们用于用户ID 1(例如):

1024px:

./images/0/0/1/1024/1.jpg
./images/0/0/1/1024/2.jpg

640px:

./images/0/0/1/640/1.jpg
./images/0/0/1/640/2.jpg

That's about it.

就是这样。

So, are there any flaws with this method? If so, could you please point them out?

那么,这种方法有什么缺陷吗?如果是这样,你能指出来吗?

Is there a better method? If so, could you please describe it?

有更好的方法吗?如果是这样,你能描述一下吗?

Before I embark on implementing this, I want to make sure I have the best, fastest, and most efficient method for storing and retrieving images so that I don't have to change it again.

在我开始实现这个之前,我想确保我有最好,最快,最有效的方法来存储和检索图像,这样我就不必再次更改它了。

Thanks!

1 个解决方案

#1


3  

Do not care about the small speed differences of calculting the path, it doesn't matter. What matters is how well and uniformly the images are distributed in the directories, how short is generated the path, how hard is it to deduce the naming convention (lets replace 1.jpg to 2.jpg.. wow, it's working..).

不关心小路速度的差异,没关系。重要的是图像在目录中的分布情况和均匀​​程度,路径生成的时间有多短,推断命名约定有多难(让我们将1.jpg替换为2.jpg哇,它正在工作......) 。

For example in your hash solution the path is entirely based on userid, which will put all pictures belonging to one user to the same directory.

例如,在哈希解决方案中,路径完全基于userid,这将把属于一个用户的所有图片放在同一目录中。

Use the whole alphabet (lower and uppercase, if your FS supports it), not just numbers. Check what other softwares do, a good place to check hashed directy names is google chrome, mozilla, ... It's better to have short directory names. Faster to look up, occupies less space in your html documents.

使用整个字母(大写字母,如果您的FS支持它),而不仅仅是数字。检查其他软件的功能,检查哈希直接名称的好地方是google chrome,mozilla,...最好有短目录名称。查找速度更快,占用的html文档空间更少。

#1


3  

Do not care about the small speed differences of calculting the path, it doesn't matter. What matters is how well and uniformly the images are distributed in the directories, how short is generated the path, how hard is it to deduce the naming convention (lets replace 1.jpg to 2.jpg.. wow, it's working..).

不关心小路速度的差异,没关系。重要的是图像在目录中的分布情况和均匀​​程度,路径生成的时间有多短,推断命名约定有多难(让我们将1.jpg替换为2.jpg哇,它正在工作......) 。

For example in your hash solution the path is entirely based on userid, which will put all pictures belonging to one user to the same directory.

例如,在哈希解决方案中,路径完全基于userid,这将把属于一个用户的所有图片放在同一目录中。

Use the whole alphabet (lower and uppercase, if your FS supports it), not just numbers. Check what other softwares do, a good place to check hashed directy names is google chrome, mozilla, ... It's better to have short directory names. Faster to look up, occupies less space in your html documents.

使用整个字母(大写字母,如果您的FS支持它),而不仅仅是数字。检查其他软件的功能,检查哈希直接名称的好地方是google chrome,mozilla,...最好有短目录名称。查找速度更快,占用的html文档空间更少。