I've heard of a few ways to implement tagging; using a mapping table between TagID and ItemID (makes sense to me, but does it scale?), adding a fixed number of possible TagID columns to ItemID (seems like a bad idea), Keeping tags in a text column that's comma separated (sounds crazy but could work). I've even heard someone recommend a sparse matrix, but then how do the tag names grow gracefully?
我听说过几种实现标记的方法;使用TagID和ItemID之间的映射表(对我来说有意义,但它是否可以扩展?),向ItemID添加固定数量的可能TagID列(看起来是个坏主意),将标记保留在逗号分隔的文本列中(声音疯了,但可以工作)。我甚至听过有人推荐稀疏矩阵,但那么标签名称如何优雅地增长?
Am I missing a best practice for tags?
我错过了标签的最佳做法吗?
6 个解决方案
#1
Three tables (one for storing all items, one for all tags, and one for the relation between the two), properly indexed, with foreign keys set running on a proper database, should work well and scale properly.
三个表(一个用于存储所有项目,一个用于所有标记,一个用于两者之间的关系),正确编制索引,外键设置在适当的数据库上运行,应该可以正常工作并正确缩放。
Table: ItemColumns: ItemID, Title, ContentTable: TagColumns: TagID, TitleTable: ItemTagColumns: ItemID, TagID
#2
Normally I would agree with Yaakov Ellis but in this special case there is another viable solution:
通常我会同意Yaakov Ellis,但在这个特例中还有另一个可行的解决方案:
Use two tables:
使用两个表:
Table: ItemColumns: ItemID, Title, ContentIndexes: ItemIDTable: TagColumns: ItemID, TitleIndexes: ItemId, Title
This has some major advantages:
这有一些主要优点:
First it makes development much simpler: in the three-table solution for insert and update of item
you have to lookup the Tag
table to see if there are already entries. Then you have to join them with new ones. This is no trivial task.
首先,它使开发变得更加简单:在用于插入和更新项目的三表解决方案中,您必须查找Tag表以查看是否已有条目。然后你必须加入新的。这不是一件轻而易举的事。
Then it makes queries simpler (and perhaps faster). There are three major database queries which you will do: Output all Tags
for one Item
, draw a Tag-Cloud and select all items for one Tag Title.
然后它使查询更简单(也许更快)。您将执行三个主要数据库查询:输出一个项目的所有标记,绘制标记云并选择一个标记标题的所有项目。
All Tags for one Item:
一个项目的所有标签:
3-Table:
SELECT Tag.Title FROM Tag JOIN ItemTag ON Tag.TagID = ItemTag.TagID WHERE ItemTag.ItemID = :id
2-Table:
SELECT Tag.TitleFROM TagWHERE Tag.ItemID = :id
Tag-Cloud:
3-Table:
SELECT Tag.Title, count(*) FROM Tag JOIN ItemTag ON Tag.TagID = ItemTag.TagID GROUP BY Tag.Title
2-Table:
SELECT Tag.Title, count(*) FROM Tag GROUP BY Tag.Title
Items for one Tag:
一个标签的项目:
3-Table:
SELECT Item.* FROM Item JOIN ItemTag ON Item.ItemID = ItemTag.ItemID JOIN Tag ON ItemTag.TagID = Tag.TagID WHERE Tag.Title = :title
2-Table:
SELECT Item.* FROM Item JOIN Tag ON Item.ItemID = Tag.ItemID WHERE Tag.Title = :title
But there are some drawbacks, too: It could take more space in the database (which could lead to more disk operations which is slower) and it's not normalized which could lead to inconsistencies.
但也有一些缺点:它可能需要在数据库中占用更多空间(这可能导致更多的磁盘操作更慢)并且没有规范化可能导致不一致。
The size argument is not that strong because the very nature of tags is that they are normally pretty small so the size increase is not a large one. One could argue that the query for the tag title is much faster in a small table which contains each tag only once and this certainly is true. But taking in regard the savings for not having to join and the fact that you can build a good index on them could easily compensate for this. This of course depends heavily on the size of the database you are using.
size参数不是那么强大,因为标签的本质是它们通常非常小,所以尺寸增加不是很大。有人可能会争辩说,标签标题的查询在一个只包含每个标签一次的小表中要快得多,这肯定是正确的。但是考虑到不必加入的节省以及你可以为它们建立一个好的索引的事实可以很容易地弥补这一点。这当然在很大程度上取决于您使用的数据库的大小。
The inconsistency argument is a little moot too. Tags are free text fields and there is no expected operation like 'rename all tags "foo" to "bar"'.
不一致的论点也有点没有实际意义。标签是*文本字段,没有预期的操作,如'重命名所有标签'foo“到”bar“'。
So tldr: I would go for the two-table solution. (In fact I'm going to. I found this article to see if there are valid arguments against it.)
所以tldr:我会选择双表解决方案。 (事实上我要去。我发现这篇文章是否有反对它的有效论据。)
#3
If you are using a database that supports map-reduce, like couchdb, storing tags in a plain text field or list field is indeed the best way. Example:
如果您使用的是支持map-reduce的数据库,例如couchdb,那么在纯文本字段或列表字段中存储标记确实是最好的方法。例:
tagcloud: { map: function(doc){ for(tag in doc.tags){ emit(doc.tags[tag],1) } } reduce: function(keys,values){ return values.length }}
Running this with group=true will group the results by tag name, and even return a count of the number of times that tag was encountered. It's very similar to counting the occurrences of a word in text.
使用group = true运行此命令将按标记名称对结果进行分组,甚至返回遇到标记的次数计数。它与计算文本中单词的出现次数非常相似。
#4
Use a single formatted text column[1] for storing the tags and use a capable full text search engine to index this. Else you will run into scaling problems when trying to implement boolean queries.
使用单个格式化文本列[1]存储标记,并使用功能强大的全文搜索引擎对其进行索引。否则,在尝试实现布尔查询时,您将遇到扩展问题。
If you need details about the tags you have, you can either keep track of it in a incrementally maintained table or run a batch job to extract the information.
如果需要有关标记的详细信息,可以在增量维护的表中跟踪它,也可以运行批处理作业来提取信息。
[1] Some RDBMS even provide a native array type which might be even better suited for storage by not needing a parsing step, but might cause problems with the full text search.
[1]有些RDBMS甚至提供了一种原生数组类型,它可能更适合存储而不需要解析步骤,但可能会导致全文搜索出现问题。
#5
I've always kept the tags in a separate table and then had a mapping table. Of course I've never done anything on a really large scale either.
我总是将标签保存在一个单独的表中,然后有一个映射表。当然,我从未做过大规模的任何事情。
Having a "tags" table and a map table makes it pretty trivial to generate tag clouds & such since you can easily put together SQL to get a list of tags with counts of how often each tag is used.
拥有“标签”表和地图表使得生成标签云非常简单,因为您可以轻松地将SQL组合在一起以获取标签列表,其中包含每个标签使用频率的计数。
#6
I would suggest following design : Item Table: Itemid, taglist1, taglist2
this will be fast and make easy saving and retrieving the data at item level.
我建议遵循以下设计:项目表:Itemid,taglist1,taglist2这将是快速的,并且可以轻松保存并在项目级别检索数据。
In parallel build another table: Tags tag do not make tag unique identifier and if you run out of space in 2nd column which contains lets say 100 items create another row.
并行构建另一个表:Tags标签不会生成标签唯一标识符,如果第二列中的空间用完,则包含100个项目创建另一行。
Now while searching for items for a tag it will be super fast.
现在,在搜索标签的项目时,它将非常快。
#1
Three tables (one for storing all items, one for all tags, and one for the relation between the two), properly indexed, with foreign keys set running on a proper database, should work well and scale properly.
三个表(一个用于存储所有项目,一个用于所有标记,一个用于两者之间的关系),正确编制索引,外键设置在适当的数据库上运行,应该可以正常工作并正确缩放。
Table: ItemColumns: ItemID, Title, ContentTable: TagColumns: TagID, TitleTable: ItemTagColumns: ItemID, TagID
#2
Normally I would agree with Yaakov Ellis but in this special case there is another viable solution:
通常我会同意Yaakov Ellis,但在这个特例中还有另一个可行的解决方案:
Use two tables:
使用两个表:
Table: ItemColumns: ItemID, Title, ContentIndexes: ItemIDTable: TagColumns: ItemID, TitleIndexes: ItemId, Title
This has some major advantages:
这有一些主要优点:
First it makes development much simpler: in the three-table solution for insert and update of item
you have to lookup the Tag
table to see if there are already entries. Then you have to join them with new ones. This is no trivial task.
首先,它使开发变得更加简单:在用于插入和更新项目的三表解决方案中,您必须查找Tag表以查看是否已有条目。然后你必须加入新的。这不是一件轻而易举的事。
Then it makes queries simpler (and perhaps faster). There are three major database queries which you will do: Output all Tags
for one Item
, draw a Tag-Cloud and select all items for one Tag Title.
然后它使查询更简单(也许更快)。您将执行三个主要数据库查询:输出一个项目的所有标记,绘制标记云并选择一个标记标题的所有项目。
All Tags for one Item:
一个项目的所有标签:
3-Table:
SELECT Tag.Title FROM Tag JOIN ItemTag ON Tag.TagID = ItemTag.TagID WHERE ItemTag.ItemID = :id
2-Table:
SELECT Tag.TitleFROM TagWHERE Tag.ItemID = :id
Tag-Cloud:
3-Table:
SELECT Tag.Title, count(*) FROM Tag JOIN ItemTag ON Tag.TagID = ItemTag.TagID GROUP BY Tag.Title
2-Table:
SELECT Tag.Title, count(*) FROM Tag GROUP BY Tag.Title
Items for one Tag:
一个标签的项目:
3-Table:
SELECT Item.* FROM Item JOIN ItemTag ON Item.ItemID = ItemTag.ItemID JOIN Tag ON ItemTag.TagID = Tag.TagID WHERE Tag.Title = :title
2-Table:
SELECT Item.* FROM Item JOIN Tag ON Item.ItemID = Tag.ItemID WHERE Tag.Title = :title
But there are some drawbacks, too: It could take more space in the database (which could lead to more disk operations which is slower) and it's not normalized which could lead to inconsistencies.
但也有一些缺点:它可能需要在数据库中占用更多空间(这可能导致更多的磁盘操作更慢)并且没有规范化可能导致不一致。
The size argument is not that strong because the very nature of tags is that they are normally pretty small so the size increase is not a large one. One could argue that the query for the tag title is much faster in a small table which contains each tag only once and this certainly is true. But taking in regard the savings for not having to join and the fact that you can build a good index on them could easily compensate for this. This of course depends heavily on the size of the database you are using.
size参数不是那么强大,因为标签的本质是它们通常非常小,所以尺寸增加不是很大。有人可能会争辩说,标签标题的查询在一个只包含每个标签一次的小表中要快得多,这肯定是正确的。但是考虑到不必加入的节省以及你可以为它们建立一个好的索引的事实可以很容易地弥补这一点。这当然在很大程度上取决于您使用的数据库的大小。
The inconsistency argument is a little moot too. Tags are free text fields and there is no expected operation like 'rename all tags "foo" to "bar"'.
不一致的论点也有点没有实际意义。标签是*文本字段,没有预期的操作,如'重命名所有标签'foo“到”bar“'。
So tldr: I would go for the two-table solution. (In fact I'm going to. I found this article to see if there are valid arguments against it.)
所以tldr:我会选择双表解决方案。 (事实上我要去。我发现这篇文章是否有反对它的有效论据。)
#3
If you are using a database that supports map-reduce, like couchdb, storing tags in a plain text field or list field is indeed the best way. Example:
如果您使用的是支持map-reduce的数据库,例如couchdb,那么在纯文本字段或列表字段中存储标记确实是最好的方法。例:
tagcloud: { map: function(doc){ for(tag in doc.tags){ emit(doc.tags[tag],1) } } reduce: function(keys,values){ return values.length }}
Running this with group=true will group the results by tag name, and even return a count of the number of times that tag was encountered. It's very similar to counting the occurrences of a word in text.
使用group = true运行此命令将按标记名称对结果进行分组,甚至返回遇到标记的次数计数。它与计算文本中单词的出现次数非常相似。
#4
Use a single formatted text column[1] for storing the tags and use a capable full text search engine to index this. Else you will run into scaling problems when trying to implement boolean queries.
使用单个格式化文本列[1]存储标记,并使用功能强大的全文搜索引擎对其进行索引。否则,在尝试实现布尔查询时,您将遇到扩展问题。
If you need details about the tags you have, you can either keep track of it in a incrementally maintained table or run a batch job to extract the information.
如果需要有关标记的详细信息,可以在增量维护的表中跟踪它,也可以运行批处理作业来提取信息。
[1] Some RDBMS even provide a native array type which might be even better suited for storage by not needing a parsing step, but might cause problems with the full text search.
[1]有些RDBMS甚至提供了一种原生数组类型,它可能更适合存储而不需要解析步骤,但可能会导致全文搜索出现问题。
#5
I've always kept the tags in a separate table and then had a mapping table. Of course I've never done anything on a really large scale either.
我总是将标签保存在一个单独的表中,然后有一个映射表。当然,我从未做过大规模的任何事情。
Having a "tags" table and a map table makes it pretty trivial to generate tag clouds & such since you can easily put together SQL to get a list of tags with counts of how often each tag is used.
拥有“标签”表和地图表使得生成标签云非常简单,因为您可以轻松地将SQL组合在一起以获取标签列表,其中包含每个标签使用频率的计数。
#6
I would suggest following design : Item Table: Itemid, taglist1, taglist2
this will be fast and make easy saving and retrieving the data at item level.
我建议遵循以下设计:项目表:Itemid,taglist1,taglist2这将是快速的,并且可以轻松保存并在项目级别检索数据。
In parallel build another table: Tags tag do not make tag unique identifier and if you run out of space in 2nd column which contains lets say 100 items create another row.
并行构建另一个表:Tags标签不会生成标签唯一标识符,如果第二列中的空间用完,则包含100个项目创建另一行。
Now while searching for items for a tag it will be super fast.
现在,在搜索标签的项目时,它将非常快。