I'm designing a very simple (in terms of functionality) but difficult (in terms of scalability) system where users can message each other. Think of it as a very simple chatting service. A user can insert a message through a php page. The message is short and has a recipient name.
我正在设计一个非常简单(在功能方面)但很难(在可扩展性方面)系统,用户可以互相发送消息。把它想象成一个非常简单的聊天服务。用户可以通过php页面插入消息。邮件很短,并且有一个收件人姓名。
On another php page, the user can view all the messages that were sent to him all at once and then deletes them on the database. That's it. That's all the functionality needed for this system. How should I go about designing this (from a database/php point of view)?
在另一个php页面上,用户可以一次查看发送给他的所有消息,然后在数据库中删除它们。而已。这就是该系统所需的所有功能。我应该如何设计它(从数据库/ php的角度来看)?
So far I have the table like this:
到目前为止,我有这样的表:
- field1 -> message (varchar)
- field2 -> recipient (varchar)
field1 - > message(varchar)
field2 - > recipient(varchar)
Now for sql insert, I find that the time it takes is constant regardless of number of rows in the database. So my send.php will have a guaranteed return time which is good.
现在对于sql insert,我发现无论数据库中的行数是多少,它所花费的时间都是常量。所以我的send.php将保证返回时间很好。
But for pulling down messages, my pull.php will take longer as the number of rows increase! I find the sql select (and delete) will take longer as the rows grow and this is true even after I have added an index for the recipient field.
但是为了拉下消息,我的pull.php会随着行数的增加而花费更长的时间!我发现随着行的增长,sql select(和delete)将花费更长的时间,即使在我为收件人字段添加索引之后也是如此。
Now, if it was simply the case that users will have to wait a longer time before their messages are pulled on the php then it would have been OK. But what I am worried is that when each pull.php service time takes really long, the php server will start to refuse connections to some request. Or worse the server might just die.
现在,如果仅仅是这样的情况,用户将需要等待更长的时间才能在php上提取消息,那么就可以了。但我担心的是,当每个pull.php服务时间花费很长时间时,php服务器将开始拒绝某些请求的连接。或者更糟糕的是服务器可能会死
So the question is, how to design this such that it scales? Any tips/hints?
所以问题是,如何设计这样可以扩展?任何提示/提示?
PS. Some estiamte on numbers:
PS。一些人估计数字:
- number of users starts with 50,000 and goes up.
- each user on average have around 10 messages stored before the other end might pull it down.
- each user sends around 10-20 messages a day.
用户数从50,000开始上升。
每个用户平均有大约10条消息存储,然后另一方可能会将其拉下来。
每个用户每天发送大约10-20条消息。
UPDATE from reading the answers so far:
到目前为止阅读答案的更新:
I just want to clarify that by pulling down less messages from pull.php does not help. Even just pull one message will take a long time when the table is huge. This is because the table has all the messages so you have to do a select like this:
我只想澄清一下,通过从pull.php中删除较少的消息并没有帮助。当桌子很大时,即使只拉一条消息也需要很长时间。这是因为该表包含所有消息,因此您必须执行以下选择:
select message from DB where recipient = 'John'
even if you change it to this it doesn't help much
即使你改变它,它也没有多大帮助
select top 1 message from DB where recipient = 'John'
So far from the answers it seems like the longer the table the slower the select will be O(n) or slightly better, no way around it. If that is the case, how should I handle this from the php side? I don't want the php page to fail on the http because the user will be confused and end up refreshing like mad which makes it even worse.
从答案到目前为止,似乎表越长,选择越慢O(n)或稍微好一点,没有办法绕过它。如果是这种情况,我应该如何从php端处理这个?我不希望php页面在http上失败,因为用户会感到困惑并最终像疯了似的刷新,这使得它更糟糕。
8 个解决方案
#1
the database design for this is simple as you suggest. As far as it taking longer once the user has more messages, what you can do is just paginate the results. Show the first 10/50/100 or whatever makes sense and only pull those records. Generally speaking, your times shouldn't increase very much unless the volume of messages increases by an order of magnatude or more. You should be able to pull back 1000 short messages in way less than a second. Now it may take more time for the page to display at that point, but thats where the pagination should help.
如你所知,数据库设计很简单。一旦用户有更多消息需要更长的时间,您可以做的就是对结果进行分页。显示第一个10/50/100或任何有意义的东西,只显示那些记录。一般来说,除非消息量增加一个或更多,否则你的时间不应该增加很多。您应该可以在不到一秒的时间内收回1000条短消息。现在页面可能需要更长的时间才能显示,但分页应该有帮助。
I would suggest though going through and thinking of future features and building your database out a little more based on that. Adding more features to the software is easy, changing the database is comparatively harder.
我建议虽然经历并考虑未来的功能,并基于此更多地建立您的数据库。向软件添加更多功能很容易,更改数据库相对比较困难。
#2
- Follow the rules of normalization. Try to reach 3rd normal form. To go further for this type of application probably isn’t worth it. Keep your tables thin.
- Don’t actually delete rows just mark them as deleted with a bit flag. If you really need to remove them for some type of maintenance / cleanup to reduce size. Mark them as deleted and then create a cleanup process to archive or remove the records during low usage hours.
- Integer values are easier for SQL server to deal with then character values. So instead of where recipient = 'John' use WHERE Recipient_ID = 23 You will gain this type of behavior when you normalize your database.
遵循规范化规则。尝试达到第3范式。要进一步采用这种类型的应用程序可能不值得。保持你的桌子很薄。
实际上不删除行只是用位标记将它们标记为已删除。如果您确实需要将它们移除以进行某种类型的维护/清理以减小尺寸。将它们标记为已删除,然后创建清理过程以在低使用时间内归档或删除记录。
SQL服务器更容易处理整数值以处理字符值。因此,而不是在receive ='John'的地方使用WHERE Recipient_ID = 23在规范化数据库时,您将获得此类行为。
#3
Don't use VARCHAR for your recipient. It's best to make a Recipient table with a primary key that is an integer (or bigint if you are expecting extremely large quantities of people).
不要将VARCHAR用于收件人。最好使用一个整数的主键创建一个Recipient表(如果你期望的是非常大量的人,则使用bigint)。
Then when you do your select statements:
然后,当您执行select语句时:
SELECT message FROM DB WHERE recipient = 52;
The speed retrieving rows will be much faster.
检索行的速度会快得多。
Plus, I believe MySQL indexes are B-Trees, which is O(log n) for most cases.
另外,我认为MySQL索引是B-Trees,大多数情况下都是O(log n)。
#4
A database table without an index is called a heap, querying a heap results in each row of the table being evaluated even with a 'where' clause, the big-o notation for a heap is O(n) with n being the number of rows in the table. Adding an index (and this really depends on the underlying aspects of your database engine) results in a complexity of O(log(n)) to find the matching row in the table. This is because the index most certainly is implemented in a b-tree sort of way. Adding rows to the table, even with an index present is an O(1) operation.
没有索引的数据库表称为堆,查询堆会导致正在评估的表的每一行,即使使用'where'子句,堆的big-o表示法也是O(n),其中n是数字表中的行。添加索引(这实际上取决于数据库引擎的底层方面)会导致O(log(n))的复杂性,以便在表中找到匹配的行。这是因为索引最肯定是以b树的方式实现的。即使存在索引,向表中添加行也是O(1)操作。
> But for pulling down messages, my pull.php will take longer as the number of rows
increase! I find the sql select (and delete) will take longer as the rows grow and
this is true even after I have added an index for the recipient field.
UNLESS you are inserting into the middle of an index, at which point the database engine will need to shift rows down to accommodate. The same occurs when you delete from the index. Remember there is more than one kind of index. Be sure that the index you are using is not a clustered index as more data must be sifted through and moved with inserts and deletes.
除非您插入索引的中间,此时数据库引擎需要将行向下移动以适应。从索引中删除时也会出现同样的情况。请记住,有多种索引。确保您使用的索引不是聚簇索引,因为必须筛选更多数据并使用插入和删除进行移动。
FlySwat has given the best option available to you... do not use an RDBMS because your messages are not relational in a formal sense. You will get much better performance from a file system.
FlySwat为您提供了最佳选择...不要使用RDBMS,因为您的消息在正式意义上不是关系型的。您将从文件系统中获得更好的性能。
dbarker has also given correct answers. I do not know why he has been voted down 3 times, but I will vote him up at the risk that I may lose points. dbarker is referring to "Vertical Partitioning" and his suggestion is both acceptable and good. This isn't rocket surgery people.
dbarker也给出了正确的答案。我不知道他为什么被投了3次,但是我会冒险投票让他失去分数。 dbarker指的是“垂直分区”,他的建议既可接受又好。这不是火箭手术的人。
My suggestion is to not implement this kind of functionality in your RDBMS, if you do remember that select, update, insert, delete all place locks on pages in your table. If you do go forward with putting this functionality into a database then run your selects with a nolock locking hint if it is available on your platform to increase concurrency. Additionally if you have so many concurrent users, partition your tables vertically as dbarker suggested and place these database files on separate drives (not just volumes but separate hardware) to increase I/O concurrency.
我的建议是在你的RDBMS中不实现这种功能,如果你确实记得选择,更新,插入,删除表中页面上的所有地方锁。如果您确实将此功能放入数据库中,那么如果您的平台上有可用的nolock锁定提示,则运行您的选择以增加并发性。此外,如果您有这么多并发用户,请按照dbarker的建议垂直划分表,并将这些数据库文件放在不同的驱动器(不仅仅是卷而是单独的硬件)上,以增加I / O并发性。
#5
So the question is, how to design this such that it scales? Any tips/hints?
所以问题是,如何设计这样可以扩展?任何提示/提示?
Yes, you don't want to use a relational database for message queuing. What you are trying to do is not what a relational database is best designed for, and while you can do it, its kinda like driving in a nail with a screwdriver.
是的,您不希望使用关系数据库进行消息队列。你要做的不是关系数据库最适合的设计,虽然你可以做到,但它有点像用螺丝刀钉钉子。
Instead, look at one of the many open source message queues out there, the guys at SecondLife have a neat wiki where they reviewed a lot of them.
相反,看看那里的许多开源消息队列中的一个,SecondLife的人有一个整洁的维基,他们在那里审查了很多。
http://wiki.secondlife.com/wiki/Message_Queue_Evaluation_Notes
#6
This is an unavoidable problem - more messages, more time to find the requested ones. The only thing you can do is what you already did - add an index and turn O(n) look up time for a complete table scan into O(log(u) + m) for a clustered index look up where n is the number of total messages, u the number of users, and m the number of messages per user.
这是一个不可避免的问题 - 更多的消息,更多的时间来找到所请求的。你唯一能做的就是你已经做了 - 添加索引并将完整的表扫描的O(n)查找时间转换为O(log(u)+ m)以查找聚簇索引,其中n是数字总消息量,用户数量,以及每个用户的消息数量。
#7
Limit the number of rows that your pull.php will display at any one time.
限制pull.php在任何时候显示的行数。
The more data you transfer, longer it will take to display the page, regardless of how great your DB is.
无论数据库有多好,您传输的数据越多,显示页面所需的时间就越长。
You must limit your data in the SQL, return the most recent N rows.
您必须限制SQL中的数据,返回最近的N行。
EDIT Put an index on Recipient and it will speed it up. You'll need another column to distinguish rows if you want to take the top 50 or something, possibly SendDate or an auto incrementing field. A Clustered index will slow down inserts, so use a regular index there.
编辑在收件人上添加一个索引,它会加快速度。如果你想获得前50或其他东西,可能需要另一列来区分行,可能是SendDate或自动递增字段。聚簇索引会减慢插入速度,因此请在那里使用常规索引。
#8
You could always have only one row per user and just concatenate messages together into one long record. If you're keeping messages for a long period of time, that isn't the best way to go, but it reduces your problem to a single find and concatenate at storage time and a single find at retrieve time. It's hard to say without more detail - part of what makes DB design hard is meeting all the goals of the system in a well-compromised way. Without all the details, its hard to give advice on the best compromise.
每个用户总是只能有一行,只需将消息连接成一个长记录。如果您长时间保留消息,这不是最好的方法,但它可以将您的问题减少到单个查找并在存储时连接并在检索时进行单个查找。没有更详细的说法很难说 - 使DB设计变得困难的一部分是以一种妥协的方式满足系统的所有目标。没有所有细节,很难就最佳妥协提出建议。
EDIT: I thought I was fairly clear on this, but evidently not: You would not do this unless you were blanking a reader's queue when he reads it. This is why I prompted for clarification.
编辑:我认为我对此非常清楚,但显然不是:你不会这样做,除非你在阅读时消隐了读者的队列。这就是我提示澄清的原因。
#1
the database design for this is simple as you suggest. As far as it taking longer once the user has more messages, what you can do is just paginate the results. Show the first 10/50/100 or whatever makes sense and only pull those records. Generally speaking, your times shouldn't increase very much unless the volume of messages increases by an order of magnatude or more. You should be able to pull back 1000 short messages in way less than a second. Now it may take more time for the page to display at that point, but thats where the pagination should help.
如你所知,数据库设计很简单。一旦用户有更多消息需要更长的时间,您可以做的就是对结果进行分页。显示第一个10/50/100或任何有意义的东西,只显示那些记录。一般来说,除非消息量增加一个或更多,否则你的时间不应该增加很多。您应该可以在不到一秒的时间内收回1000条短消息。现在页面可能需要更长的时间才能显示,但分页应该有帮助。
I would suggest though going through and thinking of future features and building your database out a little more based on that. Adding more features to the software is easy, changing the database is comparatively harder.
我建议虽然经历并考虑未来的功能,并基于此更多地建立您的数据库。向软件添加更多功能很容易,更改数据库相对比较困难。
#2
- Follow the rules of normalization. Try to reach 3rd normal form. To go further for this type of application probably isn’t worth it. Keep your tables thin.
- Don’t actually delete rows just mark them as deleted with a bit flag. If you really need to remove them for some type of maintenance / cleanup to reduce size. Mark them as deleted and then create a cleanup process to archive or remove the records during low usage hours.
- Integer values are easier for SQL server to deal with then character values. So instead of where recipient = 'John' use WHERE Recipient_ID = 23 You will gain this type of behavior when you normalize your database.
遵循规范化规则。尝试达到第3范式。要进一步采用这种类型的应用程序可能不值得。保持你的桌子很薄。
实际上不删除行只是用位标记将它们标记为已删除。如果您确实需要将它们移除以进行某种类型的维护/清理以减小尺寸。将它们标记为已删除,然后创建清理过程以在低使用时间内归档或删除记录。
SQL服务器更容易处理整数值以处理字符值。因此,而不是在receive ='John'的地方使用WHERE Recipient_ID = 23在规范化数据库时,您将获得此类行为。
#3
Don't use VARCHAR for your recipient. It's best to make a Recipient table with a primary key that is an integer (or bigint if you are expecting extremely large quantities of people).
不要将VARCHAR用于收件人。最好使用一个整数的主键创建一个Recipient表(如果你期望的是非常大量的人,则使用bigint)。
Then when you do your select statements:
然后,当您执行select语句时:
SELECT message FROM DB WHERE recipient = 52;
The speed retrieving rows will be much faster.
检索行的速度会快得多。
Plus, I believe MySQL indexes are B-Trees, which is O(log n) for most cases.
另外,我认为MySQL索引是B-Trees,大多数情况下都是O(log n)。
#4
A database table without an index is called a heap, querying a heap results in each row of the table being evaluated even with a 'where' clause, the big-o notation for a heap is O(n) with n being the number of rows in the table. Adding an index (and this really depends on the underlying aspects of your database engine) results in a complexity of O(log(n)) to find the matching row in the table. This is because the index most certainly is implemented in a b-tree sort of way. Adding rows to the table, even with an index present is an O(1) operation.
没有索引的数据库表称为堆,查询堆会导致正在评估的表的每一行,即使使用'where'子句,堆的big-o表示法也是O(n),其中n是数字表中的行。添加索引(这实际上取决于数据库引擎的底层方面)会导致O(log(n))的复杂性,以便在表中找到匹配的行。这是因为索引最肯定是以b树的方式实现的。即使存在索引,向表中添加行也是O(1)操作。
> But for pulling down messages, my pull.php will take longer as the number of rows
increase! I find the sql select (and delete) will take longer as the rows grow and
this is true even after I have added an index for the recipient field.
UNLESS you are inserting into the middle of an index, at which point the database engine will need to shift rows down to accommodate. The same occurs when you delete from the index. Remember there is more than one kind of index. Be sure that the index you are using is not a clustered index as more data must be sifted through and moved with inserts and deletes.
除非您插入索引的中间,此时数据库引擎需要将行向下移动以适应。从索引中删除时也会出现同样的情况。请记住,有多种索引。确保您使用的索引不是聚簇索引,因为必须筛选更多数据并使用插入和删除进行移动。
FlySwat has given the best option available to you... do not use an RDBMS because your messages are not relational in a formal sense. You will get much better performance from a file system.
FlySwat为您提供了最佳选择...不要使用RDBMS,因为您的消息在正式意义上不是关系型的。您将从文件系统中获得更好的性能。
dbarker has also given correct answers. I do not know why he has been voted down 3 times, but I will vote him up at the risk that I may lose points. dbarker is referring to "Vertical Partitioning" and his suggestion is both acceptable and good. This isn't rocket surgery people.
dbarker也给出了正确的答案。我不知道他为什么被投了3次,但是我会冒险投票让他失去分数。 dbarker指的是“垂直分区”,他的建议既可接受又好。这不是火箭手术的人。
My suggestion is to not implement this kind of functionality in your RDBMS, if you do remember that select, update, insert, delete all place locks on pages in your table. If you do go forward with putting this functionality into a database then run your selects with a nolock locking hint if it is available on your platform to increase concurrency. Additionally if you have so many concurrent users, partition your tables vertically as dbarker suggested and place these database files on separate drives (not just volumes but separate hardware) to increase I/O concurrency.
我的建议是在你的RDBMS中不实现这种功能,如果你确实记得选择,更新,插入,删除表中页面上的所有地方锁。如果您确实将此功能放入数据库中,那么如果您的平台上有可用的nolock锁定提示,则运行您的选择以增加并发性。此外,如果您有这么多并发用户,请按照dbarker的建议垂直划分表,并将这些数据库文件放在不同的驱动器(不仅仅是卷而是单独的硬件)上,以增加I / O并发性。
#5
So the question is, how to design this such that it scales? Any tips/hints?
所以问题是,如何设计这样可以扩展?任何提示/提示?
Yes, you don't want to use a relational database for message queuing. What you are trying to do is not what a relational database is best designed for, and while you can do it, its kinda like driving in a nail with a screwdriver.
是的,您不希望使用关系数据库进行消息队列。你要做的不是关系数据库最适合的设计,虽然你可以做到,但它有点像用螺丝刀钉钉子。
Instead, look at one of the many open source message queues out there, the guys at SecondLife have a neat wiki where they reviewed a lot of them.
相反,看看那里的许多开源消息队列中的一个,SecondLife的人有一个整洁的维基,他们在那里审查了很多。
http://wiki.secondlife.com/wiki/Message_Queue_Evaluation_Notes
#6
This is an unavoidable problem - more messages, more time to find the requested ones. The only thing you can do is what you already did - add an index and turn O(n) look up time for a complete table scan into O(log(u) + m) for a clustered index look up where n is the number of total messages, u the number of users, and m the number of messages per user.
这是一个不可避免的问题 - 更多的消息,更多的时间来找到所请求的。你唯一能做的就是你已经做了 - 添加索引并将完整的表扫描的O(n)查找时间转换为O(log(u)+ m)以查找聚簇索引,其中n是数字总消息量,用户数量,以及每个用户的消息数量。
#7
Limit the number of rows that your pull.php will display at any one time.
限制pull.php在任何时候显示的行数。
The more data you transfer, longer it will take to display the page, regardless of how great your DB is.
无论数据库有多好,您传输的数据越多,显示页面所需的时间就越长。
You must limit your data in the SQL, return the most recent N rows.
您必须限制SQL中的数据,返回最近的N行。
EDIT Put an index on Recipient and it will speed it up. You'll need another column to distinguish rows if you want to take the top 50 or something, possibly SendDate or an auto incrementing field. A Clustered index will slow down inserts, so use a regular index there.
编辑在收件人上添加一个索引,它会加快速度。如果你想获得前50或其他东西,可能需要另一列来区分行,可能是SendDate或自动递增字段。聚簇索引会减慢插入速度,因此请在那里使用常规索引。
#8
You could always have only one row per user and just concatenate messages together into one long record. If you're keeping messages for a long period of time, that isn't the best way to go, but it reduces your problem to a single find and concatenate at storage time and a single find at retrieve time. It's hard to say without more detail - part of what makes DB design hard is meeting all the goals of the system in a well-compromised way. Without all the details, its hard to give advice on the best compromise.
每个用户总是只能有一行,只需将消息连接成一个长记录。如果您长时间保留消息,这不是最好的方法,但它可以将您的问题减少到单个查找并在存储时连接并在检索时进行单个查找。没有更详细的说法很难说 - 使DB设计变得困难的一部分是以一种妥协的方式满足系统的所有目标。没有所有细节,很难就最佳妥协提出建议。
EDIT: I thought I was fairly clear on this, but evidently not: You would not do this unless you were blanking a reader's queue when he reads it. This is why I prompted for clarification.
编辑:我认为我对此非常清楚,但显然不是:你不会这样做,除非你在阅读时消隐了读者的队列。这就是我提示澄清的原因。