I've a simple SQL table which looks like this-
我有一个简单的SQL表,它看起来是这样的
CREATE TABLE msg (
from_person character varying(10),
from_location character varying(10),
to_person character varying(10),
to_location character varying(10),
msglength integer,
ts timestamp without time zone
);
I want to find out for each row in the table if a different 'from_person' and 'from_location' has interacted with the 'to_person' in the current row in last 3 minutes.
我想找出表中的每一行,如果不同的“from_person”和“from_location”在最后3分钟内与当前行中的“to_person”发生了交互。
For example, in above table, for row # 4, other than mary from Mumbai (current row), nancy from NYC and bob from Barcelona has also sent a message to charlie in last 3 minutes so the count is 2.
例如,在上面的表格中,第4排,除了来自孟买的mary(现在的一排),来自纽约的nancy和来自巴塞罗那的bob也在最后3分钟给charlie发了信息,所以计数是2。
Similarly, for row#2, other than bob from Barcelona (current row), only nancy from NYC has sent a message to charlie in ca (current row) so the count is 1
类似地,对于第2行,除了来自巴塞罗那的bob(当前行),只有来自NYC的nancy在ca(当前行)发送了一个消息给charlie,所以count是1。
Example desired output-
示例所需的输出,
0
1
0
2
I tried using window function but it seems that in frame clause I can specify rows count before and after but I can't specify a time itself.
我尝试过使用window函数,但是在frame子句中,我可以指定前后的行数,但是我不能指定时间本身。
3 个解决方案
#1
4
As is well known, every table in Postgres has a primary key. Or should have at least. It would be great if you had a primary key defining expected order of rows.
众所周知,Postgres中的每个表都有一个主键。或者至少应该有。如果您有一个主键来定义预期的行顺序,那就太好了。
Example data:
示例数据:
create table msg (
id int primary key,
from_person text,
to_person text,
ts timestamp without time zone
);
insert into msg values
(1, 'nancy', 'charlie', '2016-02-01 01:00:00'),
(2, 'bob', 'charlie', '2016-02-01 01:00:00'),
(3, 'charlie', 'nancy', '2016-02-01 01:00:01'),
(4, 'mary', 'charlie', '2016-02-01 01:02:00');
The query:
查询:
select m1.id, count(m2)
from msg m1
left join msg m2
on m2.id < m1.id
and m2.to_person = m1.to_person
and m2.ts >= m1.ts- '3m'::interval
group by 1
order by 1;
id | count
----+-------
1 | 0
2 | 1
3 | 0
4 | 2
(4 rows)
In the lack of a primary key you can use the function row_number()
, for example:
如果没有主键,可以使用函数row_number(),例如:
with msg_with_rn as (
select *, row_number() over (order by ts, from_person desc) rn
from msg
)
select m1.id, count(m2)
from msg_with_rn m1
left join msg_with_rn m2
on m2.rn < m1.rn
and m2.to_person = m1.to_person
and m2.ts >= m1.ts- '3m'::interval
group by 1
order by 1;
Note that I have used row_number() over (order by ts, from_person desc)
to get the sequence of rows as you have presented in the question. Of course, you should decide yourself how to resolve ambiguities arising from the same values of the column ts
(as in the first two rows).
注意,我已经使用了row_number() over(由ts, from_person desc)来获得问题中的行序列。当然,您应该决定如何解决由ts列的相同值引起的不确定性(如前两行)。
#2
1
This should more or less do it. Depending on your requirements, you may need to modify the middle two conditions in the where clause:
这或多或少可以做到。根据您的要求,您可能需要修改where子句中的中间两个条件:
select *,
(select count(*) from msg m2
where m2.to_person = m1.to_person
and m2.from_person != m1.from_person
and m2.from_location != m1.from_location
and abs(EXTRACT(EPOCH FROM (m1.ts - m2.ts))) <= 3*60)
from msg m1
#3
1
Building on your actual question, this would be a correct answer:
根据你的实际问题,这将是一个正确的答案:
SELECT count(m2.to_person) AS ct_3min
FROM msg m1
LEFT JOIN msg m2
ON m2.to_person = m1.to_person
AND (m2.from_person, m2.from_location) <> (m1.from_person, m1.from_location)
AND m2.ts <= m1.ts -- including same timestamp (?)
AND m2.ts >= m1.ts - interval '3 min'
GROUP BY m1.ctid
ORDER BY m1.ctid;
Assuming to_person
, from_person
and from_location
are all defined NOT NULL
.
假设to_person、from_person和from_location都定义为NOT NULL。
Returns:
返回:
1 -- !!
1
0
2
Note that the result is basically meaningless without additional columns, any unique combination of columns, ideally a PK. I return the rows in the current physical order - which can change any time without warning. There is no natural order of rows in a relational table. Without an unambiguous ORDER BY
clause, the order of result rows is unreliable.
注意,如果没有额外的列、列的任何唯一组合(理想情况下是PK),结果基本上是没有意义的。关系表中没有自然的行顺序。如果没有一个明确的ORDER BY子句,结果行的顺序是不可靠的。
According to your definition the first two rows (according to your displayed order) need to have the same result: 1
- or 0
if you don't count same timestamp - 0
for one and 1
for the other would be incorrect according to your definition.
根据您的定义,前两行(根据显示的顺序)需要具有相同的结果:1 -或0,如果您不计算相同的时间戳——0表示一个,1表示另一个,根据您的定义将是不正确的。
In the absence of any unique key, I am using the ctid
as poor-man's surrogate key. More:
在没有任何唯一密钥的情况下,我使用ctid作为穷人的代理密钥。更多:
- In-order sequence generation
- 顺序序列生成
You should still have a primary key defined in your table, but it's by no means compulsory. That's not the only dubious detail in your table layout. You should probably operate with timestamp with time zone
, have some NOT NULL
constraints and only person_id
columns referencing a person
table in a properly normalized design. Something like:
您仍然应该在表中定义一个主键,但它绝不是强制的。这并不是表布局中惟一可疑的细节。您可能应该使用带有时区的时间戳进行操作,具有一些不为空的约束,并且在适当的规范化设计中,只有引用person表的person_id列。喜欢的东西:
CREATE TABLE msg (
msg_id serial PRIMARY KEY
, from_person_id integer NOT NULL REFERENCES person
, to_person_id integer NOT NULL REFERENCES person
, msglength integer
, ts timestamp with time zone
);
Either way, relying on a surrogate PK for the purpose of your query would be plain wrong. The "next" msg_id
does not even have to have a later timestamp. In a multi-user database a sequence does not guarantee anything of the sort.
不管怎样,为了查询的目的而依赖于代理PK是完全错误的。“next”msg_id甚至不需要有一个稍后的时间戳。在多用户数据库中,序列不能保证任何类型。
#1
4
As is well known, every table in Postgres has a primary key. Or should have at least. It would be great if you had a primary key defining expected order of rows.
众所周知,Postgres中的每个表都有一个主键。或者至少应该有。如果您有一个主键来定义预期的行顺序,那就太好了。
Example data:
示例数据:
create table msg (
id int primary key,
from_person text,
to_person text,
ts timestamp without time zone
);
insert into msg values
(1, 'nancy', 'charlie', '2016-02-01 01:00:00'),
(2, 'bob', 'charlie', '2016-02-01 01:00:00'),
(3, 'charlie', 'nancy', '2016-02-01 01:00:01'),
(4, 'mary', 'charlie', '2016-02-01 01:02:00');
The query:
查询:
select m1.id, count(m2)
from msg m1
left join msg m2
on m2.id < m1.id
and m2.to_person = m1.to_person
and m2.ts >= m1.ts- '3m'::interval
group by 1
order by 1;
id | count
----+-------
1 | 0
2 | 1
3 | 0
4 | 2
(4 rows)
In the lack of a primary key you can use the function row_number()
, for example:
如果没有主键,可以使用函数row_number(),例如:
with msg_with_rn as (
select *, row_number() over (order by ts, from_person desc) rn
from msg
)
select m1.id, count(m2)
from msg_with_rn m1
left join msg_with_rn m2
on m2.rn < m1.rn
and m2.to_person = m1.to_person
and m2.ts >= m1.ts- '3m'::interval
group by 1
order by 1;
Note that I have used row_number() over (order by ts, from_person desc)
to get the sequence of rows as you have presented in the question. Of course, you should decide yourself how to resolve ambiguities arising from the same values of the column ts
(as in the first two rows).
注意,我已经使用了row_number() over(由ts, from_person desc)来获得问题中的行序列。当然,您应该决定如何解决由ts列的相同值引起的不确定性(如前两行)。
#2
1
This should more or less do it. Depending on your requirements, you may need to modify the middle two conditions in the where clause:
这或多或少可以做到。根据您的要求,您可能需要修改where子句中的中间两个条件:
select *,
(select count(*) from msg m2
where m2.to_person = m1.to_person
and m2.from_person != m1.from_person
and m2.from_location != m1.from_location
and abs(EXTRACT(EPOCH FROM (m1.ts - m2.ts))) <= 3*60)
from msg m1
#3
1
Building on your actual question, this would be a correct answer:
根据你的实际问题,这将是一个正确的答案:
SELECT count(m2.to_person) AS ct_3min
FROM msg m1
LEFT JOIN msg m2
ON m2.to_person = m1.to_person
AND (m2.from_person, m2.from_location) <> (m1.from_person, m1.from_location)
AND m2.ts <= m1.ts -- including same timestamp (?)
AND m2.ts >= m1.ts - interval '3 min'
GROUP BY m1.ctid
ORDER BY m1.ctid;
Assuming to_person
, from_person
and from_location
are all defined NOT NULL
.
假设to_person、from_person和from_location都定义为NOT NULL。
Returns:
返回:
1 -- !!
1
0
2
Note that the result is basically meaningless without additional columns, any unique combination of columns, ideally a PK. I return the rows in the current physical order - which can change any time without warning. There is no natural order of rows in a relational table. Without an unambiguous ORDER BY
clause, the order of result rows is unreliable.
注意,如果没有额外的列、列的任何唯一组合(理想情况下是PK),结果基本上是没有意义的。关系表中没有自然的行顺序。如果没有一个明确的ORDER BY子句,结果行的顺序是不可靠的。
According to your definition the first two rows (according to your displayed order) need to have the same result: 1
- or 0
if you don't count same timestamp - 0
for one and 1
for the other would be incorrect according to your definition.
根据您的定义,前两行(根据显示的顺序)需要具有相同的结果:1 -或0,如果您不计算相同的时间戳——0表示一个,1表示另一个,根据您的定义将是不正确的。
In the absence of any unique key, I am using the ctid
as poor-man's surrogate key. More:
在没有任何唯一密钥的情况下,我使用ctid作为穷人的代理密钥。更多:
- In-order sequence generation
- 顺序序列生成
You should still have a primary key defined in your table, but it's by no means compulsory. That's not the only dubious detail in your table layout. You should probably operate with timestamp with time zone
, have some NOT NULL
constraints and only person_id
columns referencing a person
table in a properly normalized design. Something like:
您仍然应该在表中定义一个主键,但它绝不是强制的。这并不是表布局中惟一可疑的细节。您可能应该使用带有时区的时间戳进行操作,具有一些不为空的约束,并且在适当的规范化设计中,只有引用person表的person_id列。喜欢的东西:
CREATE TABLE msg (
msg_id serial PRIMARY KEY
, from_person_id integer NOT NULL REFERENCES person
, to_person_id integer NOT NULL REFERENCES person
, msglength integer
, ts timestamp with time zone
);
Either way, relying on a surrogate PK for the purpose of your query would be plain wrong. The "next" msg_id
does not even have to have a later timestamp. In a multi-user database a sequence does not guarantee anything of the sort.
不管怎样,为了查询的目的而依赖于代理PK是完全错误的。“next”msg_id甚至不需要有一个稍后的时间戳。在多用户数据库中,序列不能保证任何类型。