推荐的标签或标签SQL数据库设计

时间:2021-07-01 12:51:56

I've heard of a few ways to implement tagging; using a mapping table between TagID and ItemID (makes sense to me, but does it scale?), adding a fixed number of possible TagID columns to ItemID (seems like a bad idea), Keeping tags in a text column that's comma separated (sounds crazy but could work). I've even heard someone recommend a sparse matrix, but then how do the tag names grow gracefully?

我听说过一些实现标签的方法;使用TagID和ItemID之间的映射表(对我来说是有意义的,但是它可以伸缩吗?),向ItemID添加固定数量的可能的TagID列(这似乎是一个坏主意),将标记保存在逗号分隔的文本列中(听起来很疯狂,但可以工作)。我甚至听过有人推荐使用稀疏矩阵,但是标记名是如何优雅地增长的呢?

Am I missing a best practice for tags?

我是否错过了标记的最佳实践?

6 个解决方案

#1


357  

Three tables (one for storing all items, one for all tags, and one for the relation between the two), properly indexed, with foreign keys set running on a proper database, should work well and scale properly.

有三个表(一个用于存储所有项,一个用于存储所有标记,一个用于两个表之间的关系),经过适当的索引,并在适当的数据库上设置了外键,应该可以很好地工作并适当地扩展。

Table: Item
Columns: ItemID, Title, Content

Table: Tag
Columns: TagID, Title

Table: ItemTag
Columns: ItemID, TagID

#2


59  

Normally I would agree with Yaakov Ellis but in this special case there is another viable solution:

通常我会同意Yaakov Ellis的观点,但在这种特殊情况下,还有另一种可行的解决方案:

Use two tables:

使用两个表:

Table: Item
Columns: ItemID, Title, Content
Indexes: ItemID

Table: Tag
Columns: ItemID, Title
Indexes: ItemId, Title

This has some major advantages:

这有一些主要的优点:

First it makes development much simpler: in the three-table solution for insert and update of item you have to lookup the Tag table to see if there are already entries. Then you have to join them with new ones. This is no trivial task.

首先,它使开发更加简单:在用于插入和更新项目的三表解决方案中,您必须查找标记表,以查看是否已经有条目。然后你必须加入他们的新成员。这不是一项微不足道的任务。

Then it makes queries simpler (and perhaps faster). There are three major database queries which you will do: Output all Tags for one Item, draw a Tag-Cloud and select all items for one Tag Title.

然后它使查询变得更简单(也许更快)。您将执行三个主要的数据库查询:为一个项目输出所有标记、绘制标记云并为一个标记标题选择所有项。

All Tags for one Item:

一件物品的所有标签:

3-Table:

3表:

SELECT Tag.Title 
  FROM Tag 
  JOIN ItemTag ON Tag.TagID = ItemTag.TagID
 WHERE ItemTag.ItemID = :id

2-Table:

2的表格:

SELECT Tag.Title
FROM Tag
WHERE Tag.ItemID = :id

Tag-Cloud:

类似于:

3-Table:

3表:

SELECT Tag.Title, count(*)
  FROM Tag
  JOIN ItemTag ON Tag.TagID = ItemTag.TagID
 GROUP BY Tag.Title

2-Table:

2的表格:

SELECT Tag.Title, count(*)
  FROM Tag
 GROUP BY Tag.Title

Items for one Tag:

一个标签的物品:

3-Table:

3表:

SELECT Item.*
  FROM Item
  JOIN ItemTag ON Item.ItemID = ItemTag.ItemID
  JOIN Tag ON ItemTag.TagID = Tag.TagID
 WHERE Tag.Title = :title

2-Table:

2的表格:

SELECT Item.*
  FROM Item
  JOIN Tag ON Item.ItemID = Tag.ItemID
 WHERE Tag.Title = :title

But there are some drawbacks, too: It could take more space in the database (which could lead to more disk operations which is slower) and it's not normalized which could lead to inconsistencies.

但是也有一些缺点:它可以在数据库中占用更多的空间(这可能导致更多的磁盘操作更慢),并且它不被规范化,这可能导致不一致。

The size argument is not that strong because the very nature of tags is that they are normally pretty small so the size increase is not a large one. One could argue that the query for the tag title is much faster in a small table which contains each tag only once and this certainly is true. But taking in regard the savings for not having to join and the fact that you can build a good index on them could easily compensate for this. This of course depends heavily on the size of the database you are using.

size参数没有那么强,因为标记的本质是它们通常非常小,所以大小增加不是很大。有人可能会说,在一个只包含每个标记一次的小表中,对标记标题的查询要快得多,这当然是正确的。但考虑到不必加入的节省,以及你可以在它们上建立一个良好的指数,可以很容易地弥补这一点。当然,这在很大程度上取决于您使用的数据库的大小。

The inconsistency argument is a little moot too. Tags are free text fields and there is no expected operation like 'rename all tags "foo" to "bar"'.

不一致的论点也有点不成立。标签是免费的文本字段,不存在“将所有标签重命名为foo”到“bar”这样的操作。

So tldr: I would go for the two-table solution. (In fact I'm going to. I found this article to see if there are valid arguments against it.)

所以tldr:我会选择双表解决方案。事实上我要。我找到了这篇文章,看看是否有针对它的有效论点。

#3


37  

If you are using a database that supports map-reduce, like couchdb, storing tags in a plain text field or list field is indeed the best way. Example:

如果您正在使用支持map-reduce的数据库,比如couchdb,那么将标记存储在纯文本字段或列表字段中确实是最好的方法。例子:

tagcloud: {
  map: function(doc){ 
    for(tag in doc.tags){ 
      emit(doc.tags[tag],1) 
    }
  }
  reduce: function(keys,values){
    return values.length
  }
}

Running this with group=true will group the results by tag name, and even return a count of the number of times that tag was encountered. It's very similar to counting the occurrences of a word in text.

使用group=true运行此操作将使用标记名称将结果分组,甚至返回所遇到标记的次数的计数。它非常类似于计算一个单词在文本中的出现次数。

#4


11  

Use a single formatted text column[1] for storing the tags and use a capable full text search engine to index this. Else you will run into scaling problems when trying to implement boolean queries.

使用一个格式化的文本列[1]来存储标记,并使用一个功能强大的全文搜索引擎对其进行索引。否则,在尝试实现布尔查询时,您将遇到扩展问题。

If you need details about the tags you have, you can either keep track of it in a incrementally maintained table or run a batch job to extract the information.

如果您需要关于您拥有的标记的详细信息,您可以在增量维护的表中跟踪它,或者运行批处理作业来提取信息。

[1] Some RDBMS even provide a native array type which might be even better suited for storage by not needing a parsing step, but might cause problems with the full text search.

一些RDBMS甚至提供了一种本机数组类型,这种类型可能更适合存储,因为不需要解析步骤,但可能会导致全文搜索出现问题。

#5


8  

I've always kept the tags in a separate table and then had a mapping table. Of course I've never done anything on a really large scale either.

我总是把标签放在一个单独的表中,然后有一个映射表。当然,我也没有做过任何大规模的工作。

Having a "tags" table and a map table makes it pretty trivial to generate tag clouds & such since you can easily put together SQL to get a list of tags with counts of how often each tag is used.

拥有一个“标记”表和一个映射表使生成标记云变得非常简单,因为您可以很容易地将SQL放在一起,以获得标记列表,并计算每个标记使用的频率。

#6


0  

I would suggest following design : Item Table: Itemid, taglist1, taglist2
this will be fast and make easy saving and retrieving the data at item level.

我建议遵循以下设计:Item Table: Itemid, taglist1, taglist2这将是快速的,并且便于在项目级别保存和检索数据。

In parallel build another table: Tags tag do not make tag unique identifier and if you run out of space in 2nd column which contains lets say 100 items create another row.

在并行构建中,另一个表:标记标记不会使标记唯一标识符,如果在第二列中空间耗尽,其中包含100个项目,则创建另一个行。

Now while searching for items for a tag it will be super fast.

现在,当搜索项目的标签,它将是超级快。

#1


357  

Three tables (one for storing all items, one for all tags, and one for the relation between the two), properly indexed, with foreign keys set running on a proper database, should work well and scale properly.

有三个表(一个用于存储所有项,一个用于存储所有标记,一个用于两个表之间的关系),经过适当的索引,并在适当的数据库上设置了外键,应该可以很好地工作并适当地扩展。

Table: Item
Columns: ItemID, Title, Content

Table: Tag
Columns: TagID, Title

Table: ItemTag
Columns: ItemID, TagID

#2


59  

Normally I would agree with Yaakov Ellis but in this special case there is another viable solution:

通常我会同意Yaakov Ellis的观点,但在这种特殊情况下,还有另一种可行的解决方案:

Use two tables:

使用两个表:

Table: Item
Columns: ItemID, Title, Content
Indexes: ItemID

Table: Tag
Columns: ItemID, Title
Indexes: ItemId, Title

This has some major advantages:

这有一些主要的优点:

First it makes development much simpler: in the three-table solution for insert and update of item you have to lookup the Tag table to see if there are already entries. Then you have to join them with new ones. This is no trivial task.

首先,它使开发更加简单:在用于插入和更新项目的三表解决方案中,您必须查找标记表,以查看是否已经有条目。然后你必须加入他们的新成员。这不是一项微不足道的任务。

Then it makes queries simpler (and perhaps faster). There are three major database queries which you will do: Output all Tags for one Item, draw a Tag-Cloud and select all items for one Tag Title.

然后它使查询变得更简单(也许更快)。您将执行三个主要的数据库查询:为一个项目输出所有标记、绘制标记云并为一个标记标题选择所有项。

All Tags for one Item:

一件物品的所有标签:

3-Table:

3表:

SELECT Tag.Title 
  FROM Tag 
  JOIN ItemTag ON Tag.TagID = ItemTag.TagID
 WHERE ItemTag.ItemID = :id

2-Table:

2的表格:

SELECT Tag.Title
FROM Tag
WHERE Tag.ItemID = :id

Tag-Cloud:

类似于:

3-Table:

3表:

SELECT Tag.Title, count(*)
  FROM Tag
  JOIN ItemTag ON Tag.TagID = ItemTag.TagID
 GROUP BY Tag.Title

2-Table:

2的表格:

SELECT Tag.Title, count(*)
  FROM Tag
 GROUP BY Tag.Title

Items for one Tag:

一个标签的物品:

3-Table:

3表:

SELECT Item.*
  FROM Item
  JOIN ItemTag ON Item.ItemID = ItemTag.ItemID
  JOIN Tag ON ItemTag.TagID = Tag.TagID
 WHERE Tag.Title = :title

2-Table:

2的表格:

SELECT Item.*
  FROM Item
  JOIN Tag ON Item.ItemID = Tag.ItemID
 WHERE Tag.Title = :title

But there are some drawbacks, too: It could take more space in the database (which could lead to more disk operations which is slower) and it's not normalized which could lead to inconsistencies.

但是也有一些缺点:它可以在数据库中占用更多的空间(这可能导致更多的磁盘操作更慢),并且它不被规范化,这可能导致不一致。

The size argument is not that strong because the very nature of tags is that they are normally pretty small so the size increase is not a large one. One could argue that the query for the tag title is much faster in a small table which contains each tag only once and this certainly is true. But taking in regard the savings for not having to join and the fact that you can build a good index on them could easily compensate for this. This of course depends heavily on the size of the database you are using.

size参数没有那么强,因为标记的本质是它们通常非常小,所以大小增加不是很大。有人可能会说,在一个只包含每个标记一次的小表中,对标记标题的查询要快得多,这当然是正确的。但考虑到不必加入的节省,以及你可以在它们上建立一个良好的指数,可以很容易地弥补这一点。当然,这在很大程度上取决于您使用的数据库的大小。

The inconsistency argument is a little moot too. Tags are free text fields and there is no expected operation like 'rename all tags "foo" to "bar"'.

不一致的论点也有点不成立。标签是免费的文本字段,不存在“将所有标签重命名为foo”到“bar”这样的操作。

So tldr: I would go for the two-table solution. (In fact I'm going to. I found this article to see if there are valid arguments against it.)

所以tldr:我会选择双表解决方案。事实上我要。我找到了这篇文章,看看是否有针对它的有效论点。

#3


37  

If you are using a database that supports map-reduce, like couchdb, storing tags in a plain text field or list field is indeed the best way. Example:

如果您正在使用支持map-reduce的数据库,比如couchdb,那么将标记存储在纯文本字段或列表字段中确实是最好的方法。例子:

tagcloud: {
  map: function(doc){ 
    for(tag in doc.tags){ 
      emit(doc.tags[tag],1) 
    }
  }
  reduce: function(keys,values){
    return values.length
  }
}

Running this with group=true will group the results by tag name, and even return a count of the number of times that tag was encountered. It's very similar to counting the occurrences of a word in text.

使用group=true运行此操作将使用标记名称将结果分组,甚至返回所遇到标记的次数的计数。它非常类似于计算一个单词在文本中的出现次数。

#4


11  

Use a single formatted text column[1] for storing the tags and use a capable full text search engine to index this. Else you will run into scaling problems when trying to implement boolean queries.

使用一个格式化的文本列[1]来存储标记,并使用一个功能强大的全文搜索引擎对其进行索引。否则,在尝试实现布尔查询时,您将遇到扩展问题。

If you need details about the tags you have, you can either keep track of it in a incrementally maintained table or run a batch job to extract the information.

如果您需要关于您拥有的标记的详细信息,您可以在增量维护的表中跟踪它,或者运行批处理作业来提取信息。

[1] Some RDBMS even provide a native array type which might be even better suited for storage by not needing a parsing step, but might cause problems with the full text search.

一些RDBMS甚至提供了一种本机数组类型,这种类型可能更适合存储,因为不需要解析步骤,但可能会导致全文搜索出现问题。

#5


8  

I've always kept the tags in a separate table and then had a mapping table. Of course I've never done anything on a really large scale either.

我总是把标签放在一个单独的表中,然后有一个映射表。当然,我也没有做过任何大规模的工作。

Having a "tags" table and a map table makes it pretty trivial to generate tag clouds & such since you can easily put together SQL to get a list of tags with counts of how often each tag is used.

拥有一个“标记”表和一个映射表使生成标记云变得非常简单,因为您可以很容易地将SQL放在一起,以获得标记列表,并计算每个标记使用的频率。

#6


0  

I would suggest following design : Item Table: Itemid, taglist1, taglist2
this will be fast and make easy saving and retrieving the data at item level.

我建议遵循以下设计:Item Table: Itemid, taglist1, taglist2这将是快速的,并且便于在项目级别保存和检索数据。

In parallel build another table: Tags tag do not make tag unique identifier and if you run out of space in 2nd column which contains lets say 100 items create another row.

在并行构建中,另一个表:标记标记不会使标记唯一标识符,如果在第二列中空间耗尽,其中包含100个项目,则创建另一个行。

Now while searching for items for a tag it will be super fast.

现在,当搜索项目的标签,它将是超级快。