Suppose I have a table that represents "task queue" (tens of millions of records).
假设我有一个代表“任务队列”的表(数千万条记录)。
Tasks can be "queued" or "done".
任务可以“排队”或“完成”。
Which performs better if we want to grab 10 task to process?
如果我们想抓住10个要处理的任务,哪个表现更好?
- Single table with "flag" column like
ENUM
/BIT
/TINYINT
flagging task as done or not (eventually index on column) - 带有“标志”列的单个表,如ENUM / BIT / TINYINT标记任务完成与否(最终列上的索引)
- Separate tables for queued tasks, and completed task and deleting each completed task from queued with insertion to completed
- 为排队任务分离表,完成任务并从排队完成后排除每个已完成任务
Note that at the begining, we have a few or none completed tasks, but as processing is going, there will be milions of already done tasks.
请注意,在开始时,我们有一些或没有完成的任务,但随着处理的进行,将有数百万已经完成的任务。
1 个解决方案
#1
6
It probably doesn't matter, but if it were me, I would use the one table. Here's my reasoning:
这可能没关系,但如果是我,我会用一张桌子。这是我的推理:
First and foremost, we must assume good indexes on this table, which is what's going to make the lookup fast. With appropriate indexes, if you want to query for queued tasks, it won't matter whether the number of "done" tasks is 10 or 10 billion, the DBMS will only look at the queued ones.
首先,我们必须在这个表上假设好的索引,这将使快速查找。使用适当的索引,如果要查询排队的任务,“完成”任务的数量是10或100亿也无关紧要,DBMS只会查看排队的任务。
Second, as a task moves from "queued" to "done", you're going to update its status. And this requires a bit of reorganization of the index by the DBMS, but that's OK, they've been doing that with high efficiency for something like 30 years now.
其次,当任务从“排队”移动到“完成”时,您将更新其状态。这需要DBMS对索引进行一些重组,但是没关系,他们一直在以30年的高效率做到这一点。
If you were to split them out into separate tables, essentially the maintenance of moving a record from one place to another would be put on your code instead of in the DBMS index reorganization code. Which of those code bases is better tested and more performant? :)
如果要将它们拆分为单独的表,基本上将记录从一个地方移动到另一个地方的维护将放在代码而不是DBMS索引重组代码中。哪些代码库经过了更好的测试,性能更高? :)
One final argument - if you put it all in one big table, further performance tweaking of the management of these tasks becomes a DBMS configuration issue, as opposed to a software development issue. That's a big win in my book. There's all sorts of crazy configuration stuff you can do to improve performance in any DBMS, including things like vertical and horizontal partitioning. Those things won't be options if the way you've distributed your data is via some scheme that's embedded in your software.
最后一个论点 - 如果你把它全部放在一个大表中,进一步调整这些任务的管理就会成为DBMS配置问题,而不是软件开发问题。这是我书中的一大胜利。您可以使用各种疯狂的配置来提高任何DBMS的性能,包括垂直和水平分区等。如果您分发数据的方式是通过软件中嵌入的某种方案,那么这些东西将不会是选项。
So bottom line - if you do the 2 table approach, I think it's going to perform very similarly to if you do the one table approach, once you take into account the extra work your code will have to do to move records around. If you delete an "open" task from one table and stick it into a "done" table, keep in mind the DBMS still will have to update the "open" index on the source table. Because there's likely not going to be a big performance difference, you should use the one table approach because it's less work for you, and gives you more flexibility later (speed improvements via configuration, vice software)
所以底线 - 如果你采用2表方法,我认为如果你考虑到你的代码为了移动记录而必须做的额外工作,如果你采用一个表方法,它将表现得非常相似。如果从一个表中删除“打开”任务并将其粘贴到“完成”表中,请记住DBMS仍然必须更新源表上的“打开”索引。因为可能不会有很大的性能差异,所以你应该使用one table方法,因为它对你的工作较少,并且为你提供更多的灵活性(通过配置,副软件提高速度)
#1
6
It probably doesn't matter, but if it were me, I would use the one table. Here's my reasoning:
这可能没关系,但如果是我,我会用一张桌子。这是我的推理:
First and foremost, we must assume good indexes on this table, which is what's going to make the lookup fast. With appropriate indexes, if you want to query for queued tasks, it won't matter whether the number of "done" tasks is 10 or 10 billion, the DBMS will only look at the queued ones.
首先,我们必须在这个表上假设好的索引,这将使快速查找。使用适当的索引,如果要查询排队的任务,“完成”任务的数量是10或100亿也无关紧要,DBMS只会查看排队的任务。
Second, as a task moves from "queued" to "done", you're going to update its status. And this requires a bit of reorganization of the index by the DBMS, but that's OK, they've been doing that with high efficiency for something like 30 years now.
其次,当任务从“排队”移动到“完成”时,您将更新其状态。这需要DBMS对索引进行一些重组,但是没关系,他们一直在以30年的高效率做到这一点。
If you were to split them out into separate tables, essentially the maintenance of moving a record from one place to another would be put on your code instead of in the DBMS index reorganization code. Which of those code bases is better tested and more performant? :)
如果要将它们拆分为单独的表,基本上将记录从一个地方移动到另一个地方的维护将放在代码而不是DBMS索引重组代码中。哪些代码库经过了更好的测试,性能更高? :)
One final argument - if you put it all in one big table, further performance tweaking of the management of these tasks becomes a DBMS configuration issue, as opposed to a software development issue. That's a big win in my book. There's all sorts of crazy configuration stuff you can do to improve performance in any DBMS, including things like vertical and horizontal partitioning. Those things won't be options if the way you've distributed your data is via some scheme that's embedded in your software.
最后一个论点 - 如果你把它全部放在一个大表中,进一步调整这些任务的管理就会成为DBMS配置问题,而不是软件开发问题。这是我书中的一大胜利。您可以使用各种疯狂的配置来提高任何DBMS的性能,包括垂直和水平分区等。如果您分发数据的方式是通过软件中嵌入的某种方案,那么这些东西将不会是选项。
So bottom line - if you do the 2 table approach, I think it's going to perform very similarly to if you do the one table approach, once you take into account the extra work your code will have to do to move records around. If you delete an "open" task from one table and stick it into a "done" table, keep in mind the DBMS still will have to update the "open" index on the source table. Because there's likely not going to be a big performance difference, you should use the one table approach because it's less work for you, and gives you more flexibility later (speed improvements via configuration, vice software)
所以底线 - 如果你采用2表方法,我认为如果你考虑到你的代码为了移动记录而必须做的额外工作,如果你采用一个表方法,它将表现得非常相似。如果从一个表中删除“打开”任务并将其粘贴到“完成”表中,请记住DBMS仍然必须更新源表上的“打开”索引。因为可能不会有很大的性能差异,所以你应该使用one table方法,因为它对你的工作较少,并且为你提供更多的灵活性(通过配置,副软件提高速度)