我应该使用字符串表来提高数据库的效率吗？

Let's say you have a database with a single table like...

假设您有一个包含单个表的数据库...

---------------------------------------------
| Name    |  FavoriteFood                   |
---------------------------------------------
| Alice   | Pizza                           |
| Mark    | Sushi                           |
| Jack    | Pizza                           |
---------------------------------------------

Would it be more space-efficient to have an additional table called "Strings" that stores strings, and change the FavoriteFood column to an index in the string table. In the above example, "Pizza" looks like it is stored twice, but with the additional table, it would appear to be stored only once. Of course, please assume there are 1,000,000 rows and 1,000 unique strings instead of just 3 rows and 2 unique strings.

如果有一个名为“Strings”的表来存储字符串,并将FavoriteFood列更改为字符串表中的索引,是否更节省空间。在上面的示例中,“Pizza”看起来像是存储了两次,但是使用附加表,它似乎只存储了一次。当然,请假设有1,000,000行和1,000个唯一字符串,而不仅仅是3行和2个唯一字符串。

Edit: We don't know what the FavoriteFoods are beforehand: they are user-supplied. The programmatic interface to the string table would be something like...

编辑:我们事先不知道FavoriteFoods是什么:它们是用户提供的。字符串表的编程接口就像......

String GetString(int ID) { return String at with Row-ID == ID }

int GetID(String s) {
  if s exists, return row-id;
  else {
    Create new row;
    return new row id;
  }
}

So the string-table seems more efficient, but do modern databases already do that in the background, so I can just do the simple one table approach and be efficient?

所以字符串表看起来效率更高,但是现代数据库已经在后台做了,所以我可以做一个简单的表格方法并且效率高吗?

4 个解决方案

#1

You should be thinking in terms of what makes a good design in terms of your problem domain rather than efficiency (unless you expect to have tens of millions+ rows).

你应该考虑在问题领域而不是效率方面做出好的设计是什么(除非你期望有数千万+行)。

A well designed database should be in 3NF (third normal form). Only denormalise when you have identified a performance problem by measuring.

设计良好的数据库应为3NF(第三范式)。只有在通过测量确定性能问题时才能进行非规范化。

#2

What are you measuring efficiency by? Assuming there is no other data associated with each FavoriteFood (in which case obviously you want two tables), a one-table approach is probably more time efficient, as the unnecessary join would incur an extra processing cost. On the other hand, a two-table approach may be more space-efficient, since it takes less space to store an index than a string, but that depends on how the particular database that you're using optimizes storage of repeated strings.

你用什么衡量效率?假设没有其他数据与每个FavoriteFood相关联(在这种情况下显然你想要两个表),单表方法可能更有效,因为不必要的连接会产生额外的处理成本。另一方面,双表方法可能更节省空间,因为它存储索引所需的空间比字符串少,但这取决于您使用的特定数据库如何优化重复字符串的存储。

#3

In case you have another table to store the strings, it will be easier when you want to update the descriptions, for example, if u need to update all Pizzas to Italian Pizza, then u can do with one row update if u use a separate table. Another advantage would be translations, u can use the other table to store translations of the string in different languages and select the one based on the current language.

如果您有另一个表来存储字符串,当您想要更新描述时会更容易,例如,如果您需要将所有Pizzas更新为Italian Pizza,那么如果您使用单独的更新,则可以执行一行更新表。另一个优点是翻译,你可以使用另一个表来存储不同语言的字符串翻译,并根据当前语言选择一个。

But the problem with that approach would be for inserts. U need to insert in both tables and also need to maintain the foreign key constraints, so it adds a bit of complexity to a simple table.

但这种方法的问题在于插入物。你需要在两个表中插入并且还需要维护外键约束,因此它为一个简单的表增加了一点复杂性。

#4

Pros for having a separate "Strings" table:

有一个单独的“字符串”表的优点:

Likely, less space, if strings repeat really frequently

如果字符串经常重复,可能会减少空间

Likely, faster typical queries - because of less I\O

可能更快的典型查询 - 因为I \ O较少

Cons:

You'll write more complex queries to achieve the same result

您将编写更复杂的查询以获得相同的结果

If the repetition factor is rather small, you'll get higher query execution time. To resolve each ID to string (or back), database server will perform a single lookup (seek operation) per each ID. So you get additional log(Strings.Count()) factor ~ for each query doing this.

如果重复因子相当小,您将获得更高的查询执行时间。要将每个ID解析为字符串(或返回),数据库服务器将对每个ID执行单个查找(查找操作)。因此,对于执行此操作的每个查询,您将获得额外的日志(Strings.Count())因子〜。

But actually this is really effecient. E.g. most of full-text search engines use nearly this approach to store document-word maps.

但实际上这真的很有效。例如。大多数全文搜索引擎几乎都使用这种方法来存储文档字映射。

#1