表列之间的一对多关系。分组和查找组合

时间:2021-03-08 01:57:10

In sample table t0 :

在示例表t0中:

OrderID | ProductID
 0001      1254
 0001      1252
 0002      0038
 0003      1254
 0003      1252
 0003      1432
 0004      0038
 0004      1254
 0004      1252  

I need to find the most popular combination of two ProductIDs under one OrderID. The purpose is to decide which products are more likely to be sold together in one order e.g phone - handsfree. I think the logic is to group by OrderID, calculate every possible combination of productID pairs, count them per OrderID and select the TOP 2, but i realy can't tell if it is doable..

我需要在一个OrderID下找到两个ProductID的最流行组合。目的是决定哪些产品更有可能在一个订单中一起出售,例如电话 - 免提。我认为逻辑是按OrderID分组,计算productID对的每个可能组合,按OrderID计算它们并选择TOP 2,但我真的不知道它是否可行..

4 个解决方案

#1


2  

A "self-join" may be used but ensuring that one of the product ids is greater then than the other so that we get get "pairs" of products per order. Then it is simple to count:

可以使用“自联接”,但确保其中一个产品ID比另一个更大,以便我们得到每个订单的“成对”产品。然后计算很简单:

Demo

CREATE TABLE OrderDetail
    ([OrderID] int, [ProductID] int)
;

INSERT INTO OrderDetail
    ([OrderID], [ProductID])
VALUES
    (0001, 1254), (0001, 1252), (0002, 0038), (0003, 1254), (0003, 1252), (0003, 1432), (0004, 0038), (0004, 1254), (0004, 1252)
;

Query 1:

select -- top(2)
      od1.ProductID, od2.ProductID, count(*) count_of
from OrderDetail od1
inner join OrderDetail od2 on od1.OrderID = od2.OrderID and od2.ProductID > od1.ProductID
group by
      od1.ProductID, od2.ProductID
order by
      count_of DESC

Results:

| ProductID | ProductID | count_of |
|-----------|-----------|----------|
|      1252 |      1254 |        3 |
|      1252 |      1432 |        1 |
|      1254 |      1432 |        1 |
|        38 |      1252 |        1 |
|        38 |      1254 |        1 |

----

With respect to displaying the "top 2" or whatever. You are likely to get "equal top" results so I would suggest you need to use dense_rank() and you may even want to "unpivot" the result so you have a single column of productids with their associated rank. How often you perform this and/or store this I leave to you.

关于显示“前2”或其他什么。您可能会获得“相同的*”结果,因此我建议您需要使用dense_rank(),您甚至可能希望“取消”结果,以便您拥有一列产品及其相关等级。你经常这样做和/或存储这个,我留给你。

with ProductPairs as (
      select 
             p1, p2, count_pair
          , dense_rank() over(order by count_pair DESC) as ranked
      from (
            select
                  od1.ProductID p1, od2.ProductID p2, count(*) count_pair
            from OrderDetail od1
            inner join OrderDetail od2 on od1.OrderID = od2.OrderID and od2.ProductID > od1.ProductID
            group by
                  od1.ProductID, od2.ProductID
            ) d
      )
, RankedProducts as (
       select p1 as ProductID, ranked, count_pair
       from ProductPairs
       union all
       select p2 as ProductID, ranked, count_pair
       from ProductPairs
       )
select *
from RankedProducts
where ranked <= 2
order by ranked, ProductID

#2


1  

Try using the following commnand:

尝试使用以下commnand:

SELECT T1.orderID,T1.productId,T2.productID,Count(*) as Occurence
FROM TBL T1 INNER JOIN TBL T2
ON T1.orderid = T2.orderid
WHERE t1.productid > T2.productId
GROUP BY T1.orderID,T1.productId,T2.productID
ORDER BY Occurence DESC

SQL fiddle

#3


1  

  WITH products as (
       SELECT DISTINCT ProductID
       FROM orders
  ),  permutation as (
      SELECT p1.ProductID as pidA, 
             p2.ProductID as pidB
      FROM products p1
      JOIN products p2
        ON p1.ProductID < p2.ProductID
  ), check_frequency as (
      SELECT pidA, pidB, COUNT (o2.orderID) total_orders
      FROM permutations p
      LEFT JOIN orders o1
        ON p.pidA = o1.ProductID
      LEFT JOIN orders o2
        ON p.pidB = o2.ProductID
       AND o1.orderID = o2.orderID
      GROUP BY pidA, pidB
  )
  SELECT TOP 2 *
  FROM check_frequency
  ORDER BY total_orders DESC

#4


1  

The following query calculates the number of two-way combinations among all orders in Orderline:

以下查询计算Orderline中所有订单之间的双向组合数:

SELECT SUM(numprods * (numprods - 1)/2) as numcombo2 
FROM ( SELECT orderid, COUNT(DISTINCT productid) as numprods
      FROM orderline ol 
      GROUP BY orderid ) o

Notice that this query counts distinct products rather than order lines, so orders with the same product on multiple lines do not affect the count. The number of two-way combinations is 185,791. This is useful because the number of combinations pretty much determines how quickly the query generating them runs. A single order with a large number of products can seriously degrade performance. For instance, if one order contains a thousand products, there would be about five hundred thousand two-way combinations in just that one order—versus 185,791 in all the orders data. As the number of products in the largest order increases, the number of combinations increases much faster.subject to the conditions:

请注意,此查询计算不同的产品而不是订单行,因此在多行上使用相同产品的订单不会影响计数。双向组合的数量是185,791。这很有用,因为组合的数量几乎决定了生成它们的查询的运行速度。具有大量产品的单个订单会严重降低性能。例如,如果一个订单包含一千个产品,那么在一个订单中将存在大约五十万个双向组合 - 而在所有订单数据中将为185,791个。随着最大订单中产品数量的增加,组合数量增加得更快。受条件限制:

  • The two products in the pair are different
  • 这对产品中的两种产品是不同的

  • No two combinations have the same two products.
  • 没有两种组合具有相同的两种产品。

The approach for calculating the combinations is to do a self-join on the Orderline table, with duplicate product pairs removed. The goal is to get all pairs of products The first condition is easily met by filtering out any pairs where the two products are equal. The second condition is also easily met, by requiring that the first product id be smaller than the second product id. The following query generates all the combinations in a subquery and counts the number of orders containing each one:

计算组合的方法是在Orderline表上执行自联接,删除重复的产品对。目标是获得所有产品对通过过滤掉两个产品相同的任何对,可以轻松满足第一个条件。通过要求第一产品id小于第二产品id,也容易满足第二条件。以下查询生成子查询中的所有组合,并计算包含每个组合的订单数:

SELECT p1, p2, COUNT(*) as numorders
FROM (SELECT op1.orderid, op1.productid as p1, op2.productid as p2
FROM (SELECT DISTINCT orderid, productid FROM orderline) op1 JOIN
(SELECT DISTINCT orderid, productid FROM orderline) op2
ON op1.orderid = op2.orderid AND
op1.productid < op2.productid
) combinations
GROUP BY p1, p2

source Data Analysis Using SQL and Excel

源数据分析使用SQL和Excel

#1


2  

A "self-join" may be used but ensuring that one of the product ids is greater then than the other so that we get get "pairs" of products per order. Then it is simple to count:

可以使用“自联接”,但确保其中一个产品ID比另一个更大,以便我们得到每个订单的“成对”产品。然后计算很简单:

Demo

CREATE TABLE OrderDetail
    ([OrderID] int, [ProductID] int)
;

INSERT INTO OrderDetail
    ([OrderID], [ProductID])
VALUES
    (0001, 1254), (0001, 1252), (0002, 0038), (0003, 1254), (0003, 1252), (0003, 1432), (0004, 0038), (0004, 1254), (0004, 1252)
;

Query 1:

select -- top(2)
      od1.ProductID, od2.ProductID, count(*) count_of
from OrderDetail od1
inner join OrderDetail od2 on od1.OrderID = od2.OrderID and od2.ProductID > od1.ProductID
group by
      od1.ProductID, od2.ProductID
order by
      count_of DESC

Results:

| ProductID | ProductID | count_of |
|-----------|-----------|----------|
|      1252 |      1254 |        3 |
|      1252 |      1432 |        1 |
|      1254 |      1432 |        1 |
|        38 |      1252 |        1 |
|        38 |      1254 |        1 |

----

With respect to displaying the "top 2" or whatever. You are likely to get "equal top" results so I would suggest you need to use dense_rank() and you may even want to "unpivot" the result so you have a single column of productids with their associated rank. How often you perform this and/or store this I leave to you.

关于显示“前2”或其他什么。您可能会获得“相同的*”结果,因此我建议您需要使用dense_rank(),您甚至可能希望“取消”结果,以便您拥有一列产品及其相关等级。你经常这样做和/或存储这个,我留给你。

with ProductPairs as (
      select 
             p1, p2, count_pair
          , dense_rank() over(order by count_pair DESC) as ranked
      from (
            select
                  od1.ProductID p1, od2.ProductID p2, count(*) count_pair
            from OrderDetail od1
            inner join OrderDetail od2 on od1.OrderID = od2.OrderID and od2.ProductID > od1.ProductID
            group by
                  od1.ProductID, od2.ProductID
            ) d
      )
, RankedProducts as (
       select p1 as ProductID, ranked, count_pair
       from ProductPairs
       union all
       select p2 as ProductID, ranked, count_pair
       from ProductPairs
       )
select *
from RankedProducts
where ranked <= 2
order by ranked, ProductID

#2


1  

Try using the following commnand:

尝试使用以下commnand:

SELECT T1.orderID,T1.productId,T2.productID,Count(*) as Occurence
FROM TBL T1 INNER JOIN TBL T2
ON T1.orderid = T2.orderid
WHERE t1.productid > T2.productId
GROUP BY T1.orderID,T1.productId,T2.productID
ORDER BY Occurence DESC

SQL fiddle

#3


1  

  WITH products as (
       SELECT DISTINCT ProductID
       FROM orders
  ),  permutation as (
      SELECT p1.ProductID as pidA, 
             p2.ProductID as pidB
      FROM products p1
      JOIN products p2
        ON p1.ProductID < p2.ProductID
  ), check_frequency as (
      SELECT pidA, pidB, COUNT (o2.orderID) total_orders
      FROM permutations p
      LEFT JOIN orders o1
        ON p.pidA = o1.ProductID
      LEFT JOIN orders o2
        ON p.pidB = o2.ProductID
       AND o1.orderID = o2.orderID
      GROUP BY pidA, pidB
  )
  SELECT TOP 2 *
  FROM check_frequency
  ORDER BY total_orders DESC

#4


1  

The following query calculates the number of two-way combinations among all orders in Orderline:

以下查询计算Orderline中所有订单之间的双向组合数:

SELECT SUM(numprods * (numprods - 1)/2) as numcombo2 
FROM ( SELECT orderid, COUNT(DISTINCT productid) as numprods
      FROM orderline ol 
      GROUP BY orderid ) o

Notice that this query counts distinct products rather than order lines, so orders with the same product on multiple lines do not affect the count. The number of two-way combinations is 185,791. This is useful because the number of combinations pretty much determines how quickly the query generating them runs. A single order with a large number of products can seriously degrade performance. For instance, if one order contains a thousand products, there would be about five hundred thousand two-way combinations in just that one order—versus 185,791 in all the orders data. As the number of products in the largest order increases, the number of combinations increases much faster.subject to the conditions:

请注意,此查询计算不同的产品而不是订单行,因此在多行上使用相同产品的订单不会影响计数。双向组合的数量是185,791。这很有用,因为组合的数量几乎决定了生成它们的查询的运行速度。具有大量产品的单个订单会严重降低性能。例如,如果一个订单包含一千个产品,那么在一个订单中将存在大约五十万个双向组合 - 而在所有订单数据中将为185,791个。随着最大订单中产品数量的增加,组合数量增加得更快。受条件限制:

  • The two products in the pair are different
  • 这对产品中的两种产品是不同的

  • No two combinations have the same two products.
  • 没有两种组合具有相同的两种产品。

The approach for calculating the combinations is to do a self-join on the Orderline table, with duplicate product pairs removed. The goal is to get all pairs of products The first condition is easily met by filtering out any pairs where the two products are equal. The second condition is also easily met, by requiring that the first product id be smaller than the second product id. The following query generates all the combinations in a subquery and counts the number of orders containing each one:

计算组合的方法是在Orderline表上执行自联接,删除重复的产品对。目标是获得所有产品对通过过滤掉两个产品相同的任何对,可以轻松满足第一个条件。通过要求第一产品id小于第二产品id,也容易满足第二条件。以下查询生成子查询中的所有组合,并计算包含每个组合的订单数:

SELECT p1, p2, COUNT(*) as numorders
FROM (SELECT op1.orderid, op1.productid as p1, op2.productid as p2
FROM (SELECT DISTINCT orderid, productid FROM orderline) op1 JOIN
(SELECT DISTINCT orderid, productid FROM orderline) op2
ON op1.orderid = op2.orderid AND
op1.productid < op2.productid
) combinations
GROUP BY p1, p2

source Data Analysis Using SQL and Excel

源数据分析使用SQL和Excel