数据库中有一个大表还是许多小表?

时间:2021-08-22 09:50:11

Say I want to create a typical todo-webApp using a db like postgresql. A user should be able to create todo-lists. On this lists he should be able to make the actual todo-entries.

假设我想使用像postgresql这样的db创建一个典型的todo-webApp。用户应该能够创建待办事项列表。在这个列表中,他应该能够制作实际的待办事项。

I regard the todo-list as an object which has different properties like owner, name, etc, and of course the actual todo-entries which have their own properties like content, priority, date ... .

我认为todo-list是一个具有不同属性的对象,如所有者,名称等,当然还有实际的todo-entries,它们有自己的属性,如content,priority,date .......

My idea was to create a table for all the todo-lists of all the users. In this table I would store all the attributes of each list. But the questions which arises is how to store the todo-entries themselves? Of course in an additional table, but should I rather:

我的想法是为所有用户的所有待办事项列表创建一个表。在此表中,我将存储每个列表的所有属性。但出现的问题是如何存储todo条目本身?当然在附加表中,但我应该宁愿:

1. Create one big table for all the entries and have a field storing the id of the todo-list they belong to, like so:

1.为所有条目创建一个大表,并有一个字段存储它们所属的待办事项列表的id,如下所示:

todo-list: id, owner, ...
todo-entries: list.id, content, ...

which would give 2 tables in total. The todo-entries table could get very large. Although we know that entries expire, hence the table only grows with more usage but not over time. Then we would write something like SELECT * FROM todo-entries WHERE todo-list-id=id where id is the of the list we are trying to retrieve.

这将总共给出2个表。 todo-entries表可能变得非常大。虽然我们知道条目到期,但因此表只会随着使用量的增加而增长但不会随着时间的推移而增长。然后我们会写类似SELECT * FROM todo-entries WHERE todo-list-id = id,其中id是我们试图检索的列表。

OR

要么

2. Create a todo-entries table on a per user basis.

2.基于每个用户创建todo-entries表。

todo-list: id, owner, ...
todo-entries-owner: list.id, content,. ..

Number of entries table depends on number of users in the system. Something like SELECT * FROM todo-entries-owner. Mid-sized tables depending on the number of entries users do in total.

条目数表取决于系统中的用户数。像SELECT * FROM todo-entries-owner之类的东西。中型表格取决于用户总共输入的条目数。

OR

要么

3. Create one todo-entries-table for each todo-list and then store a generated table name in a field for the table. For instance could we use the todos-list unique id in the table name like:

3.为每个待办事项列表创建一个todo-entries-table,然后将生成的表名存储在表的字段中。例如,我们可以在表名中使用todos-list唯一ID,如:

todo-list: id, owner, entries-list-name, ...    
todo-entries-id: content, ... //the id part is the id from the todo-list id field. 

In the third case we could potentially have quite a large number of tables. A user might create many 'short' todo-lists. To retrieve the list we would then simply go along the lines SELECT * FROM todo-entries-id where todo-entries-id should be either a field in the todo-list or it could be done implicitly by concatenating 'todo-entries' with the todos-list unique id. Btw.: How do I do that, should this be done in js or can it be done in PostgreSQL directly? And very related to this: in the SELECT * FROM <tablename> statement, is it possible to have the value of some field of some other table as <tablename>? Like SELECT * FROM todo-list(id).entries-list-name or so.

在第三种情况下,我们可能会有相当多的表。用户可能会创建许多“短”待办事项列表。为了检索列表,我们将简单地沿着SELECT * FROM todo-entries-id行,其中todo-entries-id应该是todo-list中的一个字段,或者它可以通过将'todo-entries'连接到一起来隐式地完成todos-list唯一ID。顺便说一句:如何在js中完成,或者可以直接在PostgreSQL中完成?与此非常相关:在SELECT * FROM 语句中,是否可以将某些其他表的某些字段的值作为 ?像SELECT * FROM todo-list(id).entries-list-name一样。

The three possibilities go from few large to many small tables. My personal feeling is that the second or third solutions are better. I think they might scale better. But I'm not sure quite sure of that and I would like to know what the 'typical' approach is.

这三种可能性从几个大表到几个小表。我个人的感觉是第二种或第三种解决方案更好。我认为他们可能会更好地扩展。但我不确定这一点,我想知道“典型”方法是什么。

I could go more in depth of what I think of each of the approaches, but to get to the point of my question:

我可以更深入地了解我对每种方法的看法,但要谈到我的问题:

  • Which of the three possibilities should I go for? (or anything else, has this to do with normalization?)
  • 我应该选择哪三种可能性? (或其他任何东西,这与标准化有关吗?)

Follow up:

跟进:

  • What would the (PostgreSQL) statements then look like?
  • 然后(PostgreSQL)语句会是什么样的?

1 个解决方案

#1


5  

The only viable option is the first. It is far easier to manage and will very likely be faster than the other options.

唯一可行的选择是第一个。管理起来要容易得多,并且很可能比其他选项更快。

Image you have 1 million users, with an average of 3 to-do lists each, with an average of 5 entries per list.

图像有100万用户,平均每个有3个待办事项列表,每个列表平均有5个条目。

Scenario 1

In the first scenario you have three tables:

在第一个场景中,您有三个表:

  • todo_users: 1 million records
  • todo_users:100万条记录
  • todo_lists: 3 million records
  • todo_lists:300万条记录
  • todo_entries: 15 million records
  • todo_entries:1500万条记录

Such table sizes are no problem for PostgreSQL and with the right indexes you will be able to retrieve any data in less than a second (meaning just simple queries; if your queries become more complex (like: get me the todo_entries for the longest todo_list of the top 15% of todo_users that have made less than 3 todo_lists in the 3-month period with the highest todo_entries entered) it will obviously be slower (as in the other scenarios). The queries are very straightforward:

这样的表大小对于PostgreSQL没有问题,并且使用正确的索引,您将能够在不到一秒的时间内检索任何数据(这意味着只是简单的查询;如果您的查询变得更复杂(例如:获取最长todo_list的todo_entries) todo_users中有15%的todo_users在3个月内输入的todo_lists少于3次todo_enists,这显然会更慢(如在其他场景中)。查询非常简单:

-- Find user data based on username entered in the web site
-- An index on 'username' is essential here
SELECT * FROM todo_users WHERE username = ?;

-- Find to-do lists from a user whose userid has been retrieved with previous query
SELECT * FROM todo_lists WHERE userid = ?;

-- Find entries for a to-do list based on its todoid
SELECT * FROM todo_entries WHERE listid = ?;

You can also combine the three queries into one:

您还可以将三个查询合并为一个:

SELECT u.*, l.*, e.* -- or select appropriate columns from the three tables
FROM todo_users u
LEFT JOIN todo_lists l ON l.userid = u.id
LEFT JOIN todo_entries e ON e.listid = l.id
WHERE u.username = ?;

Use of the LEFT JOINs means that you will also get data for users without lists or lists without entries (but column values will be NULL).

使用LEFT JOIN意味着您还将获得没有列表或没有条目的列表的用户数据(但列值将为NULL)。

Inserting, updating and deleting records can be done with very similar statements and similarly fast.

插入,更新和删除记录可以使用非常相似的语句完成,同样快速。

PostgreSQL stores data on "pages" (typically 4kB in size) and most pages will be filled, which is a good thing because reading a writing a page are very slow compared to other operations.

PostgreSQL将数据存储在“页面”上(通常大小为4kB),并且大多数页面都会被填充,这是一件好事,因为与其他操作相比,阅读页面写入速度非常慢。

Scenario 2

In this scenario you need only two tables per user (todo_lists and todo_entries) but you need some mechanism to identify which tables to query.

在这种情况下,每个用户只需要两个表(todo_lists和todo_entries),但您需要一些机制来识别要查询的表。

  • 1 million todo_lists tables with a few records each
  • 100万个todo_lists表,每个表都有几个记录
  • 1 million todo_entries tables with a few dozen records each
  • 100万个todo_entries表,每个表有几十个记录

The only practical solution to that is to construct the full table names from a "basename" related to the username or some other persistent authentication data from your web site. So something like this:

唯一可行的解​​决方案是从与用户名相关的“基本名称”或您网站上的一些其他持久性身份验证数据构建完整的表名。所以这样的事情:

username = 'Jerry';
todo_list = username + '_lists';
todo_entries = username + '_entries';

And then you query with those table names. More likely you will need a todo_users table anyway to store personal data, usernames and passwords of your 1 million users.

然后用这些表名查询。更有可能的是,您还需要一个todo_users表来存储您的100万用户的个人数据,用户名和密码。

In most cases the tables will be very small and PostgreSQL will not use any indexes (nor does it have to). It will have more trouble finding the appropriate tables, though, and you will most likely build your queries in code and then feed them to PostgreSQL, meaning that it cannot optimize a query plan. A bigger problem is creating the tables for new users (todo_list and todo_entries) or deleting obsolete lists or users. This typically requires behind-the scenes housekeeping that you avoid with the previous scenario. And the biggest performance penalty will be that most pages have only little content so you waste disk space and lots of time reading and writing those partially filled pages.

在大多数情况下,表格将非常小,PostgreSQL不会使用任何索引(也不必使用)。但是,在查找适当的表时会遇到更多麻烦,并且您很可能会在代码中构建查询,然后将它们提供给PostgreSQL,这意味着它无法优化查询计划。更大的问题是为新用户创建表(todo_list和todo_entries)或删除过时的列表或用户。这通常需要您在之前的场景中避免的幕后管家。最大的性能损失是大多数页面只有很少的内容,所以你浪费磁盘空间和大量的时间来阅读和编写那些部分填充的页面。

Scenario 3

This scenario is even worse that scenario 2. Don't do it, it's madness.

这种情况甚至更糟糕的情况2.不要这样做,这很疯狂。

  • 3 million tables todo_entries with a few records each
  • 300万个表todo_entries,每个都有几个记录

So...

Stick with option 1. It is your only real option.

坚持选项1.这是你唯一真正的选择。

#1


5  

The only viable option is the first. It is far easier to manage and will very likely be faster than the other options.

唯一可行的选择是第一个。管理起来要容易得多,并且很可能比其他选项更快。

Image you have 1 million users, with an average of 3 to-do lists each, with an average of 5 entries per list.

图像有100万用户,平均每个有3个待办事项列表,每个列表平均有5个条目。

Scenario 1

In the first scenario you have three tables:

在第一个场景中,您有三个表:

  • todo_users: 1 million records
  • todo_users:100万条记录
  • todo_lists: 3 million records
  • todo_lists:300万条记录
  • todo_entries: 15 million records
  • todo_entries:1500万条记录

Such table sizes are no problem for PostgreSQL and with the right indexes you will be able to retrieve any data in less than a second (meaning just simple queries; if your queries become more complex (like: get me the todo_entries for the longest todo_list of the top 15% of todo_users that have made less than 3 todo_lists in the 3-month period with the highest todo_entries entered) it will obviously be slower (as in the other scenarios). The queries are very straightforward:

这样的表大小对于PostgreSQL没有问题,并且使用正确的索引,您将能够在不到一秒的时间内检索任何数据(这意味着只是简单的查询;如果您的查询变得更复杂(例如:获取最长todo_list的todo_entries) todo_users中有15%的todo_users在3个月内输入的todo_lists少于3次todo_enists,这显然会更慢(如在其他场景中)。查询非常简单:

-- Find user data based on username entered in the web site
-- An index on 'username' is essential here
SELECT * FROM todo_users WHERE username = ?;

-- Find to-do lists from a user whose userid has been retrieved with previous query
SELECT * FROM todo_lists WHERE userid = ?;

-- Find entries for a to-do list based on its todoid
SELECT * FROM todo_entries WHERE listid = ?;

You can also combine the three queries into one:

您还可以将三个查询合并为一个:

SELECT u.*, l.*, e.* -- or select appropriate columns from the three tables
FROM todo_users u
LEFT JOIN todo_lists l ON l.userid = u.id
LEFT JOIN todo_entries e ON e.listid = l.id
WHERE u.username = ?;

Use of the LEFT JOINs means that you will also get data for users without lists or lists without entries (but column values will be NULL).

使用LEFT JOIN意味着您还将获得没有列表或没有条目的列表的用户数据(但列值将为NULL)。

Inserting, updating and deleting records can be done with very similar statements and similarly fast.

插入,更新和删除记录可以使用非常相似的语句完成,同样快速。

PostgreSQL stores data on "pages" (typically 4kB in size) and most pages will be filled, which is a good thing because reading a writing a page are very slow compared to other operations.

PostgreSQL将数据存储在“页面”上(通常大小为4kB),并且大多数页面都会被填充,这是一件好事,因为与其他操作相比,阅读页面写入速度非常慢。

Scenario 2

In this scenario you need only two tables per user (todo_lists and todo_entries) but you need some mechanism to identify which tables to query.

在这种情况下,每个用户只需要两个表(todo_lists和todo_entries),但您需要一些机制来识别要查询的表。

  • 1 million todo_lists tables with a few records each
  • 100万个todo_lists表,每个表都有几个记录
  • 1 million todo_entries tables with a few dozen records each
  • 100万个todo_entries表,每个表有几十个记录

The only practical solution to that is to construct the full table names from a "basename" related to the username or some other persistent authentication data from your web site. So something like this:

唯一可行的解​​决方案是从与用户名相关的“基本名称”或您网站上的一些其他持久性身份验证数据构建完整的表名。所以这样的事情:

username = 'Jerry';
todo_list = username + '_lists';
todo_entries = username + '_entries';

And then you query with those table names. More likely you will need a todo_users table anyway to store personal data, usernames and passwords of your 1 million users.

然后用这些表名查询。更有可能的是,您还需要一个todo_users表来存储您的100万用户的个人数据,用户名和密码。

In most cases the tables will be very small and PostgreSQL will not use any indexes (nor does it have to). It will have more trouble finding the appropriate tables, though, and you will most likely build your queries in code and then feed them to PostgreSQL, meaning that it cannot optimize a query plan. A bigger problem is creating the tables for new users (todo_list and todo_entries) or deleting obsolete lists or users. This typically requires behind-the scenes housekeeping that you avoid with the previous scenario. And the biggest performance penalty will be that most pages have only little content so you waste disk space and lots of time reading and writing those partially filled pages.

在大多数情况下,表格将非常小,PostgreSQL不会使用任何索引(也不必使用)。但是,在查找适当的表时会遇到更多麻烦,并且您很可能会在代码中构建查询,然后将它们提供给PostgreSQL,这意味着它无法优化查询计划。更大的问题是为新用户创建表(todo_list和todo_entries)或删除过时的列表或用户。这通常需要您在之前的场景中避免的幕后管家。最大的性能损失是大多数页面只有很少的内容,所以你浪费磁盘空间和大量的时间来阅读和编写那些部分填充的页面。

Scenario 3

This scenario is even worse that scenario 2. Don't do it, it's madness.

这种情况甚至更糟糕的情况2.不要这样做,这很疯狂。

  • 3 million tables todo_entries with a few records each
  • 300万个表todo_entries,每个都有几个记录

So...

Stick with option 1. It is your only real option.

坚持选项1.这是你唯一真正的选择。