关于sql group的问题

时间:2021-05-25 09:33:46

I have a table named visiting that looks like this:

我有一个名为visit的表,看起来像这样:

id | visitor_id | visit_time 
-------------------------------------
 1 |          1 | 2009-01-06 08:45:02 
 2 |          1 | 2009-01-06 08:58:11
 3 |          1 | 2009-01-06 09:08:23 
 4 |          1 | 2009-01-06 21:55:23
 5 |          1 | 2009-01-06 22:03:35

I want to work out a sql that can get how many times a user visits within one session(successive visit's interval less than 1 hour).

我想制定一个sql,它可以获得用户在一个会话中访问的次数(连续访问的间隔小于1小时)。

So, for the example data, I want to get following result:

因此,对于示例数据,我想获得以下结果:

visitor_id | count
-------------------
         1 |     3
         1 |     2

BTW, I use postgresql 8.3. Thanks!

顺便说一句,我使用postgresql 8.3。谢谢!

UPDATE: updated the timestamps in the example data table. sorry for the confusion.
UPDATE: I don't care much if the solution is a single sql query, using store procedure, subquery etc. I only care how to get it done :)

更新:更新示例数据表中的时间戳。对困惑感到抱歉。更新:我不在乎解决方案是单个SQL查询,使用存储过程,子查询等。我只关心如何完成它:)

7 个解决方案

#1


The question is slightly ambiguous because you're making the assumption or requiring that the hours are going to start at a set point, i.e. a natural query would also indicate that there's a result record of (1,2) for all the visits between the hour of 08:58 and 09:58. You would have to "tell" your query that the start times are for some determinable reason visits 1 and 4, or you'd get the natural result set:

这个问题有点含糊不清,因为你正在做出假设,或者要求小时数从一个设定点开始,即自然查询也会表明(1,2)的结果记录为(1,2)之间的所有访问。小时08:58和09:58。您必须“告诉”您的查询,开始时间是针对某些可确定的原因访问1和4,或者您将获得自然结果集:

visitor_id | count 
--------------------
         1 | 3
         1 | 2 <- extra result starting at visit 2
         1 | 1 <- extra result starting at visit 3
         1 | 2
         1 | 1 <- extra result starting at visit 5

That extra logic is going to be expensive and too complicated for my fragile mind this morning, somebody better than me at postgres can probably solve this.

对于我今天早上脆弱的头脑来说,这个额外的逻辑会变得昂贵而且太复杂,在postgres上比我更好的人可能会解决这个问题。

I would normally want to solve this by having a sessionkey column in the table I could cheaply group by for perforamnce reasons, but there's also a logical problem I think. Deriving session info from timings seems dangerous to me because I don't believe that the user will be definitely logged out after an hours activity. Most session systems work by expiring the session after a period of inactivity, i.e. it's very likely that a visit after 9:45 is going to be in the same session because your hourly period is going to be reset at 9:08.

我通常希望通过在表格中有一个sessionkey列来解决这个问题,因为出于性能原因我可以便宜地分组,但我认为这也是一个逻辑问题。从时间中获取会话信息对我来说似乎很危险,因为我不相信用户在一小时活动后肯定会被注销。大多数会话系统通过在一段时间不活动后使会话到期来工作,即很可能在9:45之后的访问将在同一会话中,因为您的每小时时段将在9:08重置。

#2


The problem seems a little fuzzy.

问题似乎有点模糊。

It gets more complicated as id 3 is within an hour of id 1 and 2, but if the user had visited at 9:50 then that would have been within an hour of 2 but not 1.

它变得更复杂,因为id 3在id 1和2的一小时内,但是如果用户在9:50访问那么那将是在2小时但不是1小时内。

You seem to be after a smoothed total - for a given visit, how many visits are within the following hour?

你似乎是在一个平滑的总数之后 - 对于一次特定的访问,在接下来的一小时内有多少次访问?

Perhaps you should be asking for how many visits have a succeeding visit less than an hour distant? If a visit is less than an hour from the preceeding one then should it 'count'?

也许你应该问到有多少次访问不到一个小时的访问?如果访问距离前一个小时不到一个小时,那么它应该“计数”吗?

So what you probably want is how many chains do you have where the links are less than an arbitrary amount (so the hypothetical 9:50 visit would be included in the chain that starts with id 1).

所以你可能想要的是你有多少链,其中链接少于任意数量(因此假设的9:50访问将包含在以id 1开头的链中)。

#3


no simple solution

There is no way to do this in a single SQL statment.
Below are 2 ideas: one uses a loop to count visits, the other changes the way the visiting table is populated.

在单个SQL语句中无法执行此操作。下面是两个想法:一个使用循环来计算访问次数,另一个使用循环来更改访问表的填充方式。

loop solution

However, it can be done without too much trouble with a loop.
(I have tried to get the postgresql syntax correct, but I'm no expert)

但是,它可以在没有太多问题的情况下完成循环。 (我试图让postgresql语法正确,但我不是专家)

/* find entries where there is no previous entry for */ 
/* the same visitor within the previous hour:        */ 

select v1.* , 0 visits 
into temp_table
from visiting v1
where not exists ( select 1 
                   from   visiting v2
                   where  v2.visitor_id = v1.visitor_id 
                   and    v2.visit_time < v1.visit_time 
                   and    v1.visit_time - interval '1 hour' <     v2.visit_time 
                 )  
select @rows = @@rowcount 

while @rows > 0 
begin
    update temp_table
    set    visits = visits + 1 , 
           last_time = v.visit_time 
    from   temp_table t , 
           visiting   v 
    where  t.visitor_id = v.visitor_id 
    and    v.visit_time - interval '1 hour' < t.last_time
    and    not exists ( select 1 
                        from   visiting v2 
                        where  v2.visitor_id = t.visitor_id 
                        and    v2.visit_time between t.last_time and v.visit_time 
                      ) 

    select @rows = @@rowcount 
end

/* get the result: */ 

select visitor_id, 
       visits 
from temp_table 

The idea here is to do this:

这里的想法是这样做:

  • get all visits where there is no prior visit inside of an hour.
    • this identifies the sessions
    • 这标识了会话

  • 在一小时内没有事先访问的所有访问。这标识了会话

  • loop, getting the next visit for each of these "first visits"
    • until there are no more "next visits"
    • 直到没有更多“下次访问”

  • 循环,为这些“首次访问”进行下一次访问,直到不再有“下次访问”

  • now you can just read off the number of visits in each session.
  • 现在您可以读取每个会话中的访问次数。

best solution?

I suggest:

  • add a column to the visiting table: session_id int not null
  • 向访问表添加一列:session_id int not null

  • change the process which makes the entries so that it checks to see if the previous visit by the current visitor was less than an hour ago. If so, it sets session_id to the same as the session id for that earlier visit. If not, it generates a new session_id .
  • 更改创建条目的过程,以便它检查当前访问者之前的访问是否不到一小时。如果是,则将session_id设置为与先前访问的会话ID相同。如果不是,则生成新的session_id。

  • you could put this logic in a trigger.
  • 你可以把这个逻辑放在一个触发器中。

Then your original query can be solved by:

然后您的原始查询可以通过以下方式解决

SELECT session_id, visitor_id, count(*)
FROM   visiting 
GROUP BY session_id, visitor_id

Hope this helps. If I've made mistakes (I'm sure I have), leave a comment and I'll correct it.

希望这可以帮助。如果我犯了错误(我确定我有错误),请留言,我会纠正。

#4


PostgreSQL 8.4 will have a windowing function, by then we can eliminate creating temporary table just to simulate rownumbers (sequence purposes)

PostgreSQL 8.4将有一个窗口函数,到那时我们可以消除创建临时表只是为了模拟rownumbers(序列目的)

create table visit
(
visitor_id int not null,
visit_time timestamp not null
);




insert into visit(visitor_id, visit_time) 
values
(1, '2009-01-06 08:45:02'),
(2, '2009-02-06 08:58:11'),
(1, '2009-01-06 08:58:11'),
(1, '2009-01-06 09:08:23'),
(1, '2009-01-06 21:55:23'),
(2, '2009-02-06 08:59:11'),
(2, '2009-02-07 00:01:00'),
(1, '2009-01-06 22:03:35');




create temp table temp_visit(visitor_id int not null, sequence serial not null, visit_time timestamp not null);
insert into temp_visit(visitor_id, visit_time) select visitor_id, visit_time from visit order by visitor_id, visit_time;


select 
    reference.visitor_id, count(nullif(reference.visit_time - prev.visit_time < interval '1 hour',false))
from temp_visit reference
left join temp_visit prev 
on prev.visitor_id = reference.visitor_id and prev.sequence = reference.sequence - 1
group by reference.visitor_id;

#5


One or both of these may work? However, both will end up giving you more columns in the result than you are asking for.

其中一个或两个可能有用吗?但是,两者最终会在结果中为您提供比您要求的更多列。

SELECT visitor_id,
       date_part('year', visit_time),
       date_part('month', visit_time),
       date_part('day', visit_time),
       date_part('hour', visit_time),
       COUNT(*)
  FROM visiting
 GROUP BY 1, 2, 3, 4, 5;


SELECT visitor_id,
       EXTRACT(EPOCH FROM visit_time)-(EXTRACT(EPOCH FROM visit_time) % 3600),
       COUNT(*)
  FROM visiting
 GROUP BY 1, 2;

#6


This can't be done in a single SQL. The better option is to handle it in stored procedure

这不能在单个SQL中完成。更好的选择是在存储过程中处理它

#7


If it were T-SQL, I would write something as:

如果是T-SQL,我会写一些东西:

SELECT  visitor_id, COUNT(id), 
        DATEPART(yy, visit_time), DATEPART(m, visit_time), 
        DATEPART(d, visit_time), DATEPART(hh, visit_time)
FROM visiting
GROUP BY
    visitor_id, 
    DATEPART(yy, visit_time), DATEPART(m, visit_time), 
    DATEPART(d, visit_time), DATEPART(hh, visit_time)

which gives me:

这给了我:

1   3   2009    1   6   8
1   2   2009    1   6   21

I do not know how or if you can write this in postgre though.

我不知道你怎么能在postgre中写这个。

#1


The question is slightly ambiguous because you're making the assumption or requiring that the hours are going to start at a set point, i.e. a natural query would also indicate that there's a result record of (1,2) for all the visits between the hour of 08:58 and 09:58. You would have to "tell" your query that the start times are for some determinable reason visits 1 and 4, or you'd get the natural result set:

这个问题有点含糊不清,因为你正在做出假设,或者要求小时数从一个设定点开始,即自然查询也会表明(1,2)的结果记录为(1,2)之间的所有访问。小时08:58和09:58。您必须“告诉”您的查询,开始时间是针对某些可确定的原因访问1和4,或者您将获得自然结果集:

visitor_id | count 
--------------------
         1 | 3
         1 | 2 <- extra result starting at visit 2
         1 | 1 <- extra result starting at visit 3
         1 | 2
         1 | 1 <- extra result starting at visit 5

That extra logic is going to be expensive and too complicated for my fragile mind this morning, somebody better than me at postgres can probably solve this.

对于我今天早上脆弱的头脑来说,这个额外的逻辑会变得昂贵而且太复杂,在postgres上比我更好的人可能会解决这个问题。

I would normally want to solve this by having a sessionkey column in the table I could cheaply group by for perforamnce reasons, but there's also a logical problem I think. Deriving session info from timings seems dangerous to me because I don't believe that the user will be definitely logged out after an hours activity. Most session systems work by expiring the session after a period of inactivity, i.e. it's very likely that a visit after 9:45 is going to be in the same session because your hourly period is going to be reset at 9:08.

我通常希望通过在表格中有一个sessionkey列来解决这个问题,因为出于性能原因我可以便宜地分组,但我认为这也是一个逻辑问题。从时间中获取会话信息对我来说似乎很危险,因为我不相信用户在一小时活动后肯定会被注销。大多数会话系统通过在一段时间不活动后使会话到期来工作,即很可能在9:45之后的访问将在同一会话中,因为您的每小时时段将在9:08重置。

#2


The problem seems a little fuzzy.

问题似乎有点模糊。

It gets more complicated as id 3 is within an hour of id 1 and 2, but if the user had visited at 9:50 then that would have been within an hour of 2 but not 1.

它变得更复杂,因为id 3在id 1和2的一小时内,但是如果用户在9:50访问那么那将是在2小时但不是1小时内。

You seem to be after a smoothed total - for a given visit, how many visits are within the following hour?

你似乎是在一个平滑的总数之后 - 对于一次特定的访问,在接下来的一小时内有多少次访问?

Perhaps you should be asking for how many visits have a succeeding visit less than an hour distant? If a visit is less than an hour from the preceeding one then should it 'count'?

也许你应该问到有多少次访问不到一个小时的访问?如果访问距离前一个小时不到一个小时,那么它应该“计数”吗?

So what you probably want is how many chains do you have where the links are less than an arbitrary amount (so the hypothetical 9:50 visit would be included in the chain that starts with id 1).

所以你可能想要的是你有多少链,其中链接少于任意数量(因此假设的9:50访问将包含在以id 1开头的链中)。

#3


no simple solution

There is no way to do this in a single SQL statment.
Below are 2 ideas: one uses a loop to count visits, the other changes the way the visiting table is populated.

在单个SQL语句中无法执行此操作。下面是两个想法:一个使用循环来计算访问次数,另一个使用循环来更改访问表的填充方式。

loop solution

However, it can be done without too much trouble with a loop.
(I have tried to get the postgresql syntax correct, but I'm no expert)

但是,它可以在没有太多问题的情况下完成循环。 (我试图让postgresql语法正确,但我不是专家)

/* find entries where there is no previous entry for */ 
/* the same visitor within the previous hour:        */ 

select v1.* , 0 visits 
into temp_table
from visiting v1
where not exists ( select 1 
                   from   visiting v2
                   where  v2.visitor_id = v1.visitor_id 
                   and    v2.visit_time < v1.visit_time 
                   and    v1.visit_time - interval '1 hour' <     v2.visit_time 
                 )  
select @rows = @@rowcount 

while @rows > 0 
begin
    update temp_table
    set    visits = visits + 1 , 
           last_time = v.visit_time 
    from   temp_table t , 
           visiting   v 
    where  t.visitor_id = v.visitor_id 
    and    v.visit_time - interval '1 hour' < t.last_time
    and    not exists ( select 1 
                        from   visiting v2 
                        where  v2.visitor_id = t.visitor_id 
                        and    v2.visit_time between t.last_time and v.visit_time 
                      ) 

    select @rows = @@rowcount 
end

/* get the result: */ 

select visitor_id, 
       visits 
from temp_table 

The idea here is to do this:

这里的想法是这样做:

  • get all visits where there is no prior visit inside of an hour.
    • this identifies the sessions
    • 这标识了会话

  • 在一小时内没有事先访问的所有访问。这标识了会话

  • loop, getting the next visit for each of these "first visits"
    • until there are no more "next visits"
    • 直到没有更多“下次访问”

  • 循环,为这些“首次访问”进行下一次访问,直到不再有“下次访问”

  • now you can just read off the number of visits in each session.
  • 现在您可以读取每个会话中的访问次数。

best solution?

I suggest:

  • add a column to the visiting table: session_id int not null
  • 向访问表添加一列:session_id int not null

  • change the process which makes the entries so that it checks to see if the previous visit by the current visitor was less than an hour ago. If so, it sets session_id to the same as the session id for that earlier visit. If not, it generates a new session_id .
  • 更改创建条目的过程,以便它检查当前访问者之前的访问是否不到一小时。如果是,则将session_id设置为与先前访问的会话ID相同。如果不是,则生成新的session_id。

  • you could put this logic in a trigger.
  • 你可以把这个逻辑放在一个触发器中。

Then your original query can be solved by:

然后您的原始查询可以通过以下方式解决

SELECT session_id, visitor_id, count(*)
FROM   visiting 
GROUP BY session_id, visitor_id

Hope this helps. If I've made mistakes (I'm sure I have), leave a comment and I'll correct it.

希望这可以帮助。如果我犯了错误(我确定我有错误),请留言,我会纠正。

#4


PostgreSQL 8.4 will have a windowing function, by then we can eliminate creating temporary table just to simulate rownumbers (sequence purposes)

PostgreSQL 8.4将有一个窗口函数,到那时我们可以消除创建临时表只是为了模拟rownumbers(序列目的)

create table visit
(
visitor_id int not null,
visit_time timestamp not null
);




insert into visit(visitor_id, visit_time) 
values
(1, '2009-01-06 08:45:02'),
(2, '2009-02-06 08:58:11'),
(1, '2009-01-06 08:58:11'),
(1, '2009-01-06 09:08:23'),
(1, '2009-01-06 21:55:23'),
(2, '2009-02-06 08:59:11'),
(2, '2009-02-07 00:01:00'),
(1, '2009-01-06 22:03:35');




create temp table temp_visit(visitor_id int not null, sequence serial not null, visit_time timestamp not null);
insert into temp_visit(visitor_id, visit_time) select visitor_id, visit_time from visit order by visitor_id, visit_time;


select 
    reference.visitor_id, count(nullif(reference.visit_time - prev.visit_time < interval '1 hour',false))
from temp_visit reference
left join temp_visit prev 
on prev.visitor_id = reference.visitor_id and prev.sequence = reference.sequence - 1
group by reference.visitor_id;

#5


One or both of these may work? However, both will end up giving you more columns in the result than you are asking for.

其中一个或两个可能有用吗?但是,两者最终会在结果中为您提供比您要求的更多列。

SELECT visitor_id,
       date_part('year', visit_time),
       date_part('month', visit_time),
       date_part('day', visit_time),
       date_part('hour', visit_time),
       COUNT(*)
  FROM visiting
 GROUP BY 1, 2, 3, 4, 5;


SELECT visitor_id,
       EXTRACT(EPOCH FROM visit_time)-(EXTRACT(EPOCH FROM visit_time) % 3600),
       COUNT(*)
  FROM visiting
 GROUP BY 1, 2;

#6


This can't be done in a single SQL. The better option is to handle it in stored procedure

这不能在单个SQL中完成。更好的选择是在存储过程中处理它

#7


If it were T-SQL, I would write something as:

如果是T-SQL,我会写一些东西:

SELECT  visitor_id, COUNT(id), 
        DATEPART(yy, visit_time), DATEPART(m, visit_time), 
        DATEPART(d, visit_time), DATEPART(hh, visit_time)
FROM visiting
GROUP BY
    visitor_id, 
    DATEPART(yy, visit_time), DATEPART(m, visit_time), 
    DATEPART(d, visit_time), DATEPART(hh, visit_time)

which gives me:

这给了我:

1   3   2009    1   6   8
1   2   2009    1   6   21

I do not know how or if you can write this in postgre though.

我不知道你怎么能在postgre中写这个。