I have a dataset of rows each with an 'odds' number between 1 and 100. I am looking to do it in the most efficient way possible. The odds do not necessarily add up to 100.
我有一个数据集,每一行都有一个“概率”在1到100之间。我希望以最有效的方式来做这件事。几率并不一定等于100。
I have had a few ideas.
我有一些想法。
a) Select the whole dataset and then add all the odds up and generate a random number between 1 and that number. Then loop through the dataset deducting the odds from the number until it is 0.
a)选择整个数据集,然后将所有的概率加起来,并在1和那个数之间生成一个随机数。然后遍历数据集,从数字中扣除概率,直到它为0。
I was hoping to minimize the impact on the database so I considered if I could only select the rows I needed.
我希望最小化对数据库的影响,因此我考虑是否只能选择需要的行。
b)
b)
SELECT * FROM table WHERE (100*RAND()) < odds
I considered LIMIT 0,1
我认为限制0,1
But then if items have the same probability only one of the will be returned
但是如果项目有相同的概率只有一个会被返回
Alternatively take the whole dataset and pick a random one from there... but then the odds are affected as it becomes a random with odds and then a random without odds thus the odds become tilted in favour of the higher odds (even more so).
或者取整个数据集从那里随机选一个…但是当它变成一个随机的有奇数的时候,概率就会受到影响,然后变成一个随机的没有奇数的时候,概率就会倾向于高的概率(甚至更多)。
I guess I could order by odds
ASC then take the whole dataset and then with PHP take a random out of the rows with the same odds as the first record (the lowest).
我想我可以按概率排序,然后取整个数据集,然后用PHP随机抽取出与第一个记录相同的概率(最低的)。
Seems like a clumsy solution.
看起来是一个笨拙的解决方案。
Does anyone have a superior solution? If not which one of the above is best?
有人有更好的解决方案吗?如果不是以上哪一个是最好的?
7 个解决方案
#1
3
Do some up-front work, add some columns to your table that help the selection. For example suppose you have these rows
做一些预先的工作,向表中添加一些有助于选择的列。例如,假设你有这些行
X 2
Y 3
Z 1
We add some cumulative values
我们添加一些累积值
Key Odds Start End
X 2 0 1 // range 0->1, 2 values == odds
Y 3 2 4 // range 2->4, 3 values == odds
Z 1 5 5 // range 5->5, 1 value == odds
Start and End are chosen as follows. The first row has a start of zero. Subsequent rows have a start one more than previous end. End is the (Start + Odds - 1).
开始和结束的选择如下。第一行从0开始。后面的行比前面的行多一个开始行。结束是(开始+赔率- 1)。
Now pick a random number R in the range 0 to Max(End)
现在选择一个随机数字R在0到最大值之间(结束)
Select * from T where R >= T.Start and R <= T.End
If the database is sufficiently clever we may we be able to use
如果这个数据库足够聪明,我们也许可以使用它
Select * from T where R >= T.Start and R <= (T.Start + T.Odds - 1)
I'm speculating that having an End column with an index may give the better performance. Also the Max(End) perhaps gets stashed somewhere and updated by a trigger when ncessary.
我猜测,有一个带有指数的结束列可能会带来更好的表现。同时,最大(端)也可能被隐藏在某处,并在未处理时被触发器更新。
Clearly there's some hassle in updating the Start/End. This may not be too bad if either
显然,更新开始/结束有一些麻烦。如果这两种情况都发生的话,这也不算太糟
- The table contents are stable
- 表内容是稳定的
- or insertions are in someway naturally ordered, so that each new row just continues from the old highest.
- 或者插入以某种自然的方式排序,因此每个新行都从旧的最高行继续。
#2
0
What if you took your code, and added an ORDER BY RAND()
and LIMIT 1
?
如果您取了您的代码,并添加了RAND()和LIMIT 1的订单,该怎么办?
SELECT * FROM table WHERE (100*RAND()) < odds ORDER BY RAND() LIMIT 1
This way, even if you have multiples of the same probability, it will always come back randomly ordered, then you just take the first entry.
这样,即使有相同概率的倍数,它总是随机返回,然后取第一个元素。
#3
0
select * from table
where id between 1 and 100 and ((id % 2) <> 0)
order by NewId()
#4
0
Hmm. Not entirely clear what result you want, so bear with me if this is a bit crazy. That being said, how about:
嗯。我不太清楚你想要什么结果,如果这有点疯狂,请接受我的建议。也就是说,
Make a new table. The table is a fixed data table, and looks like this:
创建一个新表。该表是一个固定的数据表,如下所示:
Odds
====
1
2
2
3
3
3
4
4
4
4
etc,
etc.
Then join from your dataset to that table on the odds column. You'll get as many rows back for each row in your table as the given odds of that row.
然后从数据集连接到odds列上的该表。对于表中的每一行,您将获得与该行给定的概率相同的行数。
Then just pick one of that set at random.
然后随便选一个。
#5
0
If you have an index on the odds column, and a primary key, this would be very efficient:
如果您在odds列上有一个索引和一个主键,这将非常有效:
SELECT id, odds FROM table WHERE odds > 0
The database wouldn't even have to read from the table, it would get everything it needed from the odds index.
数据库甚至不需要从表中读取数据,它将从概率索引中得到所需的所有信息。
Then, you'll select a random value between 1 and the number of rows returned.
然后,在返回的行数和1之间选择一个随机值。
Then select that row from the array of rows returned.
然后从返回的行数组中选择这一行。
Then, finally, select the whole target row:
然后,最后选择整个目标行:
SELECT * FROM table WHERE id = ?
This assures an even distribution between all rows with an odds value.
这保证了所有行之间具有奇数值的均匀分布。
Alternatively, put the odds in a different table, with an autoincrement primary key.
或者,使用自动递增主键将赔率放在不同的表中。
Odds
ID odds
1 4
2 9
3 56
4 12
Store the ID foreign key in the main table instead of the odds value, and index it.
将ID外键存储在主表中而不是odds值中,并对其进行索引。
First, get the max value. This never touches the database. It uses the index:
首先,获取最大值。这不会涉及到数据库。它使用指数:
SELECT MAX(ID) FROM Odds
Get a random value between 1 and the max.
取1到最大值之间的随机值。
Then select the record.
然后选择记录。
SELECT * FROM table
JOIN Odds ON Odds.ID = table.ID
WHERE Odds.ID >= ?
LIMIT 1
This will require some maintenance if you tend to delete Odds value or roll back inserts to keep the distribution even.
如果您倾向于删除优势值或回滚插入以保持分布均匀,这将需要一些维护。
There is a whole chapter on random selection in the book SQL Antipatterns.
在本书的SQL反模式中有一整章的随机选择。
#6
0
I didn't try it, but maybe something like this (with ? a random number from 0 to SUM(odds) - 1
)?
我没试过,但可能是这样的(有吗?)从0到SUM(赔率)- 1的随机数?
SET @prob := 0;
SELECT
T.*,
(@prob := @prob + T.odds) AS prob
FROM table T
WHERE prob > ?
LIMIT 1
This is basically the same as your idea a), but entirely within one (well, technically two if you count the variable set-up) SQL commands.
这基本上和你的想法是一样的),但是完全在一个(好吧,如果你计算变量设置)的SQL命令。
#7
0
A general solution, suitable for O(log(n)) updates, is something like this:
一个适用于O(log(n))更新的通用解决方案是这样的:
- Store objects as leaves of a (balanced) tree.
- 将对象存储为(平衡的)树的叶子。
- At each branch node, store the weights of all objects under it.
- 在每个分支节点上,存储它下面所有对象的权重。
- When adding, removing, or modifying nodes, update weights of their parents.
- 添加、删除或修改节点时,更新其父节点的权重。
Then pick a number between 0 and (total weight - 1) and navigate down the tree until you find the right object.
然后在0和(总重量- 1)之间选择一个数字,然后沿着树导航,直到找到正确的对象。
Since you don't care about the order of things in the tree, you can store them as an array of N pointers and N-1 numbers.
由于您不关心树中事物的顺序,您可以将它们存储为N个指针和N-1个数字的数组。
#1
3
Do some up-front work, add some columns to your table that help the selection. For example suppose you have these rows
做一些预先的工作,向表中添加一些有助于选择的列。例如,假设你有这些行
X 2
Y 3
Z 1
We add some cumulative values
我们添加一些累积值
Key Odds Start End
X 2 0 1 // range 0->1, 2 values == odds
Y 3 2 4 // range 2->4, 3 values == odds
Z 1 5 5 // range 5->5, 1 value == odds
Start and End are chosen as follows. The first row has a start of zero. Subsequent rows have a start one more than previous end. End is the (Start + Odds - 1).
开始和结束的选择如下。第一行从0开始。后面的行比前面的行多一个开始行。结束是(开始+赔率- 1)。
Now pick a random number R in the range 0 to Max(End)
现在选择一个随机数字R在0到最大值之间(结束)
Select * from T where R >= T.Start and R <= T.End
If the database is sufficiently clever we may we be able to use
如果这个数据库足够聪明,我们也许可以使用它
Select * from T where R >= T.Start and R <= (T.Start + T.Odds - 1)
I'm speculating that having an End column with an index may give the better performance. Also the Max(End) perhaps gets stashed somewhere and updated by a trigger when ncessary.
我猜测,有一个带有指数的结束列可能会带来更好的表现。同时,最大(端)也可能被隐藏在某处,并在未处理时被触发器更新。
Clearly there's some hassle in updating the Start/End. This may not be too bad if either
显然,更新开始/结束有一些麻烦。如果这两种情况都发生的话,这也不算太糟
- The table contents are stable
- 表内容是稳定的
- or insertions are in someway naturally ordered, so that each new row just continues from the old highest.
- 或者插入以某种自然的方式排序,因此每个新行都从旧的最高行继续。
#2
0
What if you took your code, and added an ORDER BY RAND()
and LIMIT 1
?
如果您取了您的代码,并添加了RAND()和LIMIT 1的订单,该怎么办?
SELECT * FROM table WHERE (100*RAND()) < odds ORDER BY RAND() LIMIT 1
This way, even if you have multiples of the same probability, it will always come back randomly ordered, then you just take the first entry.
这样,即使有相同概率的倍数,它总是随机返回,然后取第一个元素。
#3
0
select * from table
where id between 1 and 100 and ((id % 2) <> 0)
order by NewId()
#4
0
Hmm. Not entirely clear what result you want, so bear with me if this is a bit crazy. That being said, how about:
嗯。我不太清楚你想要什么结果,如果这有点疯狂,请接受我的建议。也就是说,
Make a new table. The table is a fixed data table, and looks like this:
创建一个新表。该表是一个固定的数据表,如下所示:
Odds
====
1
2
2
3
3
3
4
4
4
4
etc,
etc.
Then join from your dataset to that table on the odds column. You'll get as many rows back for each row in your table as the given odds of that row.
然后从数据集连接到odds列上的该表。对于表中的每一行,您将获得与该行给定的概率相同的行数。
Then just pick one of that set at random.
然后随便选一个。
#5
0
If you have an index on the odds column, and a primary key, this would be very efficient:
如果您在odds列上有一个索引和一个主键,这将非常有效:
SELECT id, odds FROM table WHERE odds > 0
The database wouldn't even have to read from the table, it would get everything it needed from the odds index.
数据库甚至不需要从表中读取数据,它将从概率索引中得到所需的所有信息。
Then, you'll select a random value between 1 and the number of rows returned.
然后,在返回的行数和1之间选择一个随机值。
Then select that row from the array of rows returned.
然后从返回的行数组中选择这一行。
Then, finally, select the whole target row:
然后,最后选择整个目标行:
SELECT * FROM table WHERE id = ?
This assures an even distribution between all rows with an odds value.
这保证了所有行之间具有奇数值的均匀分布。
Alternatively, put the odds in a different table, with an autoincrement primary key.
或者,使用自动递增主键将赔率放在不同的表中。
Odds
ID odds
1 4
2 9
3 56
4 12
Store the ID foreign key in the main table instead of the odds value, and index it.
将ID外键存储在主表中而不是odds值中,并对其进行索引。
First, get the max value. This never touches the database. It uses the index:
首先,获取最大值。这不会涉及到数据库。它使用指数:
SELECT MAX(ID) FROM Odds
Get a random value between 1 and the max.
取1到最大值之间的随机值。
Then select the record.
然后选择记录。
SELECT * FROM table
JOIN Odds ON Odds.ID = table.ID
WHERE Odds.ID >= ?
LIMIT 1
This will require some maintenance if you tend to delete Odds value or roll back inserts to keep the distribution even.
如果您倾向于删除优势值或回滚插入以保持分布均匀,这将需要一些维护。
There is a whole chapter on random selection in the book SQL Antipatterns.
在本书的SQL反模式中有一整章的随机选择。
#6
0
I didn't try it, but maybe something like this (with ? a random number from 0 to SUM(odds) - 1
)?
我没试过,但可能是这样的(有吗?)从0到SUM(赔率)- 1的随机数?
SET @prob := 0;
SELECT
T.*,
(@prob := @prob + T.odds) AS prob
FROM table T
WHERE prob > ?
LIMIT 1
This is basically the same as your idea a), but entirely within one (well, technically two if you count the variable set-up) SQL commands.
这基本上和你的想法是一样的),但是完全在一个(好吧,如果你计算变量设置)的SQL命令。
#7
0
A general solution, suitable for O(log(n)) updates, is something like this:
一个适用于O(log(n))更新的通用解决方案是这样的:
- Store objects as leaves of a (balanced) tree.
- 将对象存储为(平衡的)树的叶子。
- At each branch node, store the weights of all objects under it.
- 在每个分支节点上,存储它下面所有对象的权重。
- When adding, removing, or modifying nodes, update weights of their parents.
- 添加、删除或修改节点时,更新其父节点的权重。
Then pick a number between 0 and (total weight - 1) and navigate down the tree until you find the right object.
然后在0和(总重量- 1)之间选择一个数字,然后沿着树导航,直到找到正确的对象。
Since you don't care about the order of things in the tree, you can store them as an array of N pointers and N-1 numbers.
由于您不关心树中事物的顺序,您可以将它们存储为N个指针和N-1个数字的数组。