在Mongo中存储嵌套类别(或分层数据)的最有效方法?

时间:2022-09-23 17:48:00

We have nested categories for several products (e.g., Sports -> Basketball -> Men's, Sports -> Tennis -> Women's ) and are using Mongo instead of MySQL.

我们有几个产品的嵌套类别(例如,体育 - >篮球 - >男子,体育 - >网球 - >女子),并使用Mongo而不是MySQL。

We know how to store nested categories in a SQL database like MySQL, but would appreciate any advice on what to do for Mongo. The operation we need to optimize for is quickly finding all products in one category or subcategory, which could be nested several layers below a root category (e.g., all products in the Men's Basketball category or all products in the Women's Tennis category).

我们知道如何将嵌套类别存储在像MySQL这样的SQL数据库中,但是对于如何为Mongo做什么建议表示赞赏。我们需要优化的操作是快速查找一个类别或子类别中的所有产品,这些类别或子类别可以嵌套在根类别下面的几个层次(例如,男士篮球类别中的所有产品或女子网球类别中的所有产品)。

This Mongo doc suggests one approach, but it says it doesn't work well when operations are needed for subtrees, which we need (since categories can reach multiple levels).

这个Mongo文档提出了一种方法,但它表示当我们需要子树需要操作时它不能很好地工作(因为类别可以达到多个级别)。

Any suggestions on the best way to efficiently store and search nested categories of arbitrary depth?

有关有效存储和搜索任意深度的嵌套类别的最佳方法的任何建议?

2 个解决方案

#1


10  

The first thing you want to decide is exactly what kind of tree you will use.

您要决定的第一件事就是您将使用哪种树。

The big thing to consider is your data and access patterns. You have already stated that 90% of all your work will be querying and by the sounds of it (e-commerce) updates will only be run by administrators, most likely rarely.

需要考虑的重要事项是您的数据和访问模式。您已经说过,您所有工作的90%都将被查询,而且它的声音(电子商务)更新只会由管理员运行,很可能很少。

So you want a schema that gives you the power of querying quickly on child through a path, i.e.: Sports -> Basketball -> Men's, Sports -> Tennis -> Women's, and doesn't really need to truly scale to updates.

所以你想要一个模式,让你通过一条路径快速查询孩子,即:体育 - >篮球 - >男子,体育 - >网球 - >女子,并不真正需要真正扩展到更新。

As you so rightly pointed out MongoDB does have a good documentation page for this: http://docs.mongodb.org/manual/tutorial/model-tree-structures/ whereby 10gen actually state different models and schema methods for trees and describes the main ups and downs of them.

正如你正确指出的那样MongoDB确实有一个很好的文档页面:http://docs.mongodb.org/manual/tutorial/model-tree-structures/其中10gen实际上说明了树的不同模型和模式方法,并描述了他们主要起伏不定。

The one that should catch the eye if you are looking to query easily is materialised paths: http://docs.mongodb.org/manual/tutorial/model-tree-structures/#model-tree-structures-with-materialized-paths

如果您希望轻松查询,应该引起注意的是物化路径:http://docs.mongodb.org/manual/tutorial/model-tree-structures/#model-tree-structures-with-materialized-paths

This is a very interesting method to build up trees since to query on the example you gave above into "Womens" in "Tennis" you could simply do a pre-fixed regex (which can use the index: http://docs.mongodb.org/manual/reference/operator/regex/ ) like so:

这是一个非常有趣的构建树的方法,因为查询上面给出的“网球”中的“女性”你可以简单地做一个预先固定的正则表达式(可以使用索引:http://docs.mongodb .org / manual / reference / operator / regex /)如下:

db.products.find({category: /^Sports,Tennis,Womens[,]/})

to find all products listed under a certain path of your tree.

查找树的特定路径下列出的所有产品。

Unfortunately this model is really bad at updating, if you move a category or change its name you have to update all products and there could be thousands of products under one category.

不幸的是,这种模式在更新时非常糟糕,如果您移动类别或更改其名称,则必须更新所有产品,并且在一个类别下可能有数千种产品。

A better method would be to house a cat_id on the product and then separate the categories into a separate collection with the schema:

一个更好的方法是在产品上放置一个cat_id,然后将类别分离为一个带有模式的单独集合:

{
    _id: ObjectId(),
    name: 'Women\'s',
    path: 'Sports,Tennis,Womens',
    normed_name: 'all_special_chars_and_spaces_and_case_senstive_letters_taken_out_like_this'
}

So now your queries only involve the categories collection which should make them much smaller and more performant. The exception to this is when you delete a category, the products will still need touching.

所以现在你的查询只涉及类别集合,这应该使它们更小,更高效。例外情况是当您删除类别时,产品仍需要触摸。

So an example of changing "Tennis" to "Badmin":

所以将“网球”改为“Badmin”的例子如下:

db.categories.update({path:/^Sports,Tennis[,]/}).forEach(function(doc){
    doc.path = doc.path.replace(/,Tennis/, ",Badmin");
    db.categories.save(doc);
});

Unfortunately MongoDB provides no in-query document reflection at the moment so you do have to pull them out client side which is a little annoying, however hopefully it shouldn't result in too many categories being brought back.

不幸的是,MongoDB目前没有提供查询文档反射,所以你必须将它们拉出客户端,这有点烦人,但希望它不会导致太多的类别被带回来。

And this is basically how it works really. It is a bit of a pain to update but the power of being able to query instantly on any path using an index is more fitting for your scenario I believe.

这基本上就是它的工作原理。更新有点痛苦但是能够使用索引在任何路径上立即查询的能力更适合我的情况。

Of course the added benefit is that this schema is compatible with nested set models: http://en.wikipedia.org/wiki/Nested_set_model which I have found time and time again are just awesome for e-commerce sites, for example, Tennis might be under both "Sports" and "Leisure" and you want multiple paths depending on where the user came from.

当然,额外的好处是这个模式与嵌套集模型兼容:http://en.wikipedia.org/wiki/Nested_set_model我一次又一次地发现它对于电子商务网站来说真是太棒了,例如,网球可能在“体育”和“休闲”之下,你需要多条路径,具体取决于用户来自何处。

The schema for materialised paths easily supports this by just adding another path, that simple.

物化路径的模式通过添加另一条路径轻松支持这一点,这很简单。

Hope it makes sense, quite a long one there.

希望它有意义,在那里相当长。

#2


4  

If all categories are distinct then think of them as tags. The hierarchy isn't necessary to encode in the items because you don't need them when you query for items. The hierarchy is a presentational thing. Tag each item with all the categories in it's path, so "Sport > Baseball > Shoes" could be saved as {..., categories: ["sport", "baseball", "shoes"], ...}. If you want all items in the "Sport" category, search for {categories: "sport"}, if you want just the shoes, search for {tags: "shoes"}.

如果所有类别都不同,那么将它们视为标签。层次结构不需要在项目中进行编码,因为在查询项目时不需要它们。层次结构是表达性的东西。用它的路径中的所有类别标记每个项目,因此“Sport> Baseball> Shoes”可以保存为{...,类别:[“sport”,“baseball”,“shoes”],...}。如果您想要“运动”类别中的所有项目,请搜索{categories:“sport”},如果您只想要鞋子,请搜索{tags:“shoes”}。

This doesn't capture the hierarchy, but if you think about it that doesn't matter. If the categories are distinct, the hierarchy doesn't help you when you query for items. There will be no other "baseball", so when you search for that you will only get things below the "baseball" level in the hierarchy.

这不会捕获层次结构,但如果您认为它无关紧要。如果类别不同,则在查询项目时,层次结构不会对您有所帮助。没有其他“棒球”,所以当你搜索它时,你只会得到层次结构中“棒球”级别以下的东西。

My suggestion relies on categories being distinct, and I guess they aren't in your current model. However, there's no reason why you can't make them distinct. You've probably chosen to use the strings you display on the page as category names in the database. If you instead use symbolic names like "sport" or "womens_shoes" and use a lookup table to find the string to display on the page (this will also save you hours of work if the name of a category ever changes -- and it will make translating the site easier, if you would ever need to do that) you can easily make sure that they are distinct because they don't have anything to do with what is displayed on the page. So if you have two "Shoes" in the hierarchy (for example "Tennis > Women's > Shoes" and "Tennis > Men's > Shoes") you can just add a qualifier to make them distinct (for example "womens_shoes" and "mens_shoes", or "tennis_womens_shoes") The symbolic names are arbitrary and can be anything, you could even use numbers and just use the next number in the sequence every time you add a category.

我的建议依赖于不同的类别,我猜他们不在你当前的模型中。但是,没有理由不能让它们与众不同。您可能已选择使用在页面上显示的字符串作为数据库中的类别名称。如果您改为使用“sport”或“womens_shoes”这样的符号名称,并使用查找表来查找要在页面上显示的字符串(如果类别的名称发生变化,这也将节省您的工作时间 - 它将会如果您需要这样做,可以更轻松地翻译网站)您可以轻松确保它们与众不同,因为它们与页面上显示的内容无关。因此,如果您在层次结构中有两个“鞋子”(例如“网球>女性>鞋子”和“网球>男士>鞋子”),您可以添加限定符以使其区别(例如“womens_shoes”和“mens_shoes” ,或“tennis_womens_shoes”)符号名称是任意的,可以是任何东西,你甚至可以使用数字,每次添加一个类别时只使用序列中的下一个数字。

#1


10  

The first thing you want to decide is exactly what kind of tree you will use.

您要决定的第一件事就是您将使用哪种树。

The big thing to consider is your data and access patterns. You have already stated that 90% of all your work will be querying and by the sounds of it (e-commerce) updates will only be run by administrators, most likely rarely.

需要考虑的重要事项是您的数据和访问模式。您已经说过,您所有工作的90%都将被查询,而且它的声音(电子商务)更新只会由管理员运行,很可能很少。

So you want a schema that gives you the power of querying quickly on child through a path, i.e.: Sports -> Basketball -> Men's, Sports -> Tennis -> Women's, and doesn't really need to truly scale to updates.

所以你想要一个模式,让你通过一条路径快速查询孩子,即:体育 - >篮球 - >男子,体育 - >网球 - >女子,并不真正需要真正扩展到更新。

As you so rightly pointed out MongoDB does have a good documentation page for this: http://docs.mongodb.org/manual/tutorial/model-tree-structures/ whereby 10gen actually state different models and schema methods for trees and describes the main ups and downs of them.

正如你正确指出的那样MongoDB确实有一个很好的文档页面:http://docs.mongodb.org/manual/tutorial/model-tree-structures/其中10gen实际上说明了树的不同模型和模式方法,并描述了他们主要起伏不定。

The one that should catch the eye if you are looking to query easily is materialised paths: http://docs.mongodb.org/manual/tutorial/model-tree-structures/#model-tree-structures-with-materialized-paths

如果您希望轻松查询,应该引起注意的是物化路径:http://docs.mongodb.org/manual/tutorial/model-tree-structures/#model-tree-structures-with-materialized-paths

This is a very interesting method to build up trees since to query on the example you gave above into "Womens" in "Tennis" you could simply do a pre-fixed regex (which can use the index: http://docs.mongodb.org/manual/reference/operator/regex/ ) like so:

这是一个非常有趣的构建树的方法,因为查询上面给出的“网球”中的“女性”你可以简单地做一个预先固定的正则表达式(可以使用索引:http://docs.mongodb .org / manual / reference / operator / regex /)如下:

db.products.find({category: /^Sports,Tennis,Womens[,]/})

to find all products listed under a certain path of your tree.

查找树的特定路径下列出的所有产品。

Unfortunately this model is really bad at updating, if you move a category or change its name you have to update all products and there could be thousands of products under one category.

不幸的是,这种模式在更新时非常糟糕,如果您移动类别或更改其名称,则必须更新所有产品,并且在一个类别下可能有数千种产品。

A better method would be to house a cat_id on the product and then separate the categories into a separate collection with the schema:

一个更好的方法是在产品上放置一个cat_id,然后将类别分离为一个带有模式的单独集合:

{
    _id: ObjectId(),
    name: 'Women\'s',
    path: 'Sports,Tennis,Womens',
    normed_name: 'all_special_chars_and_spaces_and_case_senstive_letters_taken_out_like_this'
}

So now your queries only involve the categories collection which should make them much smaller and more performant. The exception to this is when you delete a category, the products will still need touching.

所以现在你的查询只涉及类别集合,这应该使它们更小,更高效。例外情况是当您删除类别时,产品仍需要触摸。

So an example of changing "Tennis" to "Badmin":

所以将“网球”改为“Badmin”的例子如下:

db.categories.update({path:/^Sports,Tennis[,]/}).forEach(function(doc){
    doc.path = doc.path.replace(/,Tennis/, ",Badmin");
    db.categories.save(doc);
});

Unfortunately MongoDB provides no in-query document reflection at the moment so you do have to pull them out client side which is a little annoying, however hopefully it shouldn't result in too many categories being brought back.

不幸的是,MongoDB目前没有提供查询文档反射,所以你必须将它们拉出客户端,这有点烦人,但希望它不会导致太多的类别被带回来。

And this is basically how it works really. It is a bit of a pain to update but the power of being able to query instantly on any path using an index is more fitting for your scenario I believe.

这基本上就是它的工作原理。更新有点痛苦但是能够使用索引在任何路径上立即查询的能力更适合我的情况。

Of course the added benefit is that this schema is compatible with nested set models: http://en.wikipedia.org/wiki/Nested_set_model which I have found time and time again are just awesome for e-commerce sites, for example, Tennis might be under both "Sports" and "Leisure" and you want multiple paths depending on where the user came from.

当然,额外的好处是这个模式与嵌套集模型兼容:http://en.wikipedia.org/wiki/Nested_set_model我一次又一次地发现它对于电子商务网站来说真是太棒了,例如,网球可能在“体育”和“休闲”之下,你需要多条路径,具体取决于用户来自何处。

The schema for materialised paths easily supports this by just adding another path, that simple.

物化路径的模式通过添加另一条路径轻松支持这一点,这很简单。

Hope it makes sense, quite a long one there.

希望它有意义,在那里相当长。

#2


4  

If all categories are distinct then think of them as tags. The hierarchy isn't necessary to encode in the items because you don't need them when you query for items. The hierarchy is a presentational thing. Tag each item with all the categories in it's path, so "Sport > Baseball > Shoes" could be saved as {..., categories: ["sport", "baseball", "shoes"], ...}. If you want all items in the "Sport" category, search for {categories: "sport"}, if you want just the shoes, search for {tags: "shoes"}.

如果所有类别都不同,那么将它们视为标签。层次结构不需要在项目中进行编码,因为在查询项目时不需要它们。层次结构是表达性的东西。用它的路径中的所有类别标记每个项目,因此“Sport> Baseball> Shoes”可以保存为{...,类别:[“sport”,“baseball”,“shoes”],...}。如果您想要“运动”类别中的所有项目,请搜索{categories:“sport”},如果您只想要鞋子,请搜索{tags:“shoes”}。

This doesn't capture the hierarchy, but if you think about it that doesn't matter. If the categories are distinct, the hierarchy doesn't help you when you query for items. There will be no other "baseball", so when you search for that you will only get things below the "baseball" level in the hierarchy.

这不会捕获层次结构,但如果您认为它无关紧要。如果类别不同,则在查询项目时,层次结构不会对您有所帮助。没有其他“棒球”,所以当你搜索它时,你只会得到层次结构中“棒球”级别以下的东西。

My suggestion relies on categories being distinct, and I guess they aren't in your current model. However, there's no reason why you can't make them distinct. You've probably chosen to use the strings you display on the page as category names in the database. If you instead use symbolic names like "sport" or "womens_shoes" and use a lookup table to find the string to display on the page (this will also save you hours of work if the name of a category ever changes -- and it will make translating the site easier, if you would ever need to do that) you can easily make sure that they are distinct because they don't have anything to do with what is displayed on the page. So if you have two "Shoes" in the hierarchy (for example "Tennis > Women's > Shoes" and "Tennis > Men's > Shoes") you can just add a qualifier to make them distinct (for example "womens_shoes" and "mens_shoes", or "tennis_womens_shoes") The symbolic names are arbitrary and can be anything, you could even use numbers and just use the next number in the sequence every time you add a category.

我的建议依赖于不同的类别,我猜他们不在你当前的模型中。但是,没有理由不能让它们与众不同。您可能已选择使用在页面上显示的字符串作为数据库中的类别名称。如果您改为使用“sport”或“womens_shoes”这样的符号名称,并使用查找表来查找要在页面上显示的字符串(如果类别的名称发生变化,这也将节省您的工作时间 - 它将会如果您需要这样做,可以更轻松地翻译网站)您可以轻松确保它们与众不同,因为它们与页面上显示的内容无关。因此,如果您在层次结构中有两个“鞋子”(例如“网球>女性>鞋子”和“网球>男士>鞋子”),您可以添加限定符以使其区别(例如“womens_shoes”和“mens_shoes” ,或“tennis_womens_shoes”)符号名称是任意的,可以是任何东西,你甚至可以使用数字,每次添加一个类别时只使用序列中的下一个数字。