Here is the database layout. I have a table with sparse sales over time, aggregated per day. If for an item I have 10 sales on the 01-01-2015, I will have an entry, but If I have 0, then I have no entry. Something like this.
这是数据库布局。我有一张表,随着时间的推移销售稀少,每天汇总。如果对于我在01-01-2015有10个销售的项目,我将有一个条目,但如果我有0,那么我没有条目。像这样的东西。
|--------------------------------------|
| day_of_year | year | sales | item_id |
|--------------------------------------|
| 01 | 2015 | 20 | A1 |
| 01 | 2015 | 11 | A2 |
| 07 | 2015 | 09 | A1 |
| ... | ... | ... | ... |
|--------------------------------------|
This is how I get a time series for 1 item.
这就是我获得1个项目的时间序列的方法。
SELECT doy, max(sales) FROM (
SELECT day_of_year AS doy,
sales AS sales
FROM myschema.entry_daily
WHERE item_id = theNameOfmyItem
AND year = 2015
AND day_of_year < 150
UNION
SELECT doy AS doy,
0 AS sales
FROM generate_series(1, 149) AS doy) as t
GROUP BY doy
ORDER BY doy;
And I currently loop with R making 1 query for every item. I then aggregate the results in a dataframe. But this is very slow. I would actually like to have only one query that would aggregate all the data in the following form.
我目前循环使用R为每个项目进行1次查询。然后我在数据框中汇总结果。但这很慢。我实际上只想有一个查询将汇总以下形式的所有数据。
|----------------------------------------------|
| item_id | 01 | 02 | 03 | 04 | 05 | ... | 149 |
|----------------------------------------------|
| A1 | 10 | 00 | 00 | 05 | 12 | ... | 11 |
| A2 | 11 | 00 | 30 | 01 | 15 | ... | 09 |
| A3 | 20 | 00 | 00 | 05 | 17 | ... | 20 |
| ... |
|----------------------------------------------|
Would this be possible? By the way I am using a Postgres database.
这可能吗?顺便说一句,我正在使用Postgres数据库。
3 个解决方案
#1
3
Solution 1. Simple query with an aggregate.
The simplest and fastest way to get the expected result. It is easy to parse the sales
column within a client program.
获得预期结果的最简单,最快捷的方法。在客户端程序中解析销售列很容易。
select item, string_agg(coalesce(sales, 0)::text, ',') sales
from (
select distinct item_id item, doy
from generate_series (1, 10) doy -- change 10 to given n
cross join entry_daily
) sub
left join entry_daily on item_id = item and day_of_year = doy
group by 1
order by 1;
item | sales
------+----------------------
A1 | 20,0,0,0,0,0,9,0,0,0
A2 | 11,0,0,0,0,0,0,0,0,0
(2 rows)
Solution 2. Dynamically created view.
Based on the solution 1 with array_agg()
instead of string_agg()
. The function creates a view with a given number of columns.
基于解决方案1,使用array_agg()而不是string_agg()。该函数创建具有给定列数的视图。
create or replace function create_items_view(view_name text, days int)
returns void language plpgsql as $$
declare
list text;
begin
select string_agg(format('s[%s] "%s"', i::text, i::text), ',')
into list
from generate_series(1, days) i;
execute(format($f$
drop view if exists %s;
create view %s as select item, %s
from (
select item, array_agg(coalesce(sales, 0)) s
from (
select distinct item_id item, doy
from generate_series (1, %s) doy
cross join entry_daily
) sub
left join entry_daily on item_id = item and day_of_year = doy
group by 1
order by 1
) q
$f$, view_name, view_name, list, days)
);
end $$;
Usage:
用法:
select create_items_view('items_view_10', 10);
select * from items_view_10;
item | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10
------+----+---+---+---+---+---+---+---+---+----
A1 | 20 | 0 | 0 | 0 | 0 | 0 | 9 | 0 | 0 | 0
A2 | 11 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0
(2 rows)
Solution 3. Crosstab.
Easy to use, but very uncomfortable with the greater number of columns due to the need to define the row format.
由于需要定义行格式,因此易于使用,但对于更多列非常不舒服。
create extension if not exists tablefunc;
select * from crosstab (
'select item_id, day_of_year, sales
from entry_daily
order by 1',
'select i from generate_series (1, 10) i'
) as ct
(item_id text, "1" int, "2" int, "3" int, "4" int, "5" int, "6" int, "7" int, "8" int, "9" int, "10" int);
item_id | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10
---------+----+---+---+---+---+---+---+---+---+----
A1 | 20 | | | | | | 9 | | |
A2 | 11 | | | | | | | | |
(2 rows)
#2
1
First you need a table with all dates to fill the blank dates. 100 years of date mean 36,000 rows so no very big. Instead of calculate every time.
首先,您需要一个包含所有日期的表来填充空白日期。 100年的日期意味着36,000行所以不是很大。而不是每次都计算。
allDates:
allDates:
date_id
s_date
or created calculating the fields
或创建计算字段
date_id
s_date
doy = EXTRACT(DOY FROM s_date)
year = EXTRACT(YEAR FROM s_date)
Your base query will be SQL FIDDLE DEMO:
您的基本查询将是SQL FIDDLE DEMO:
SELECT
AD.year,
AD.doy,
allitems.item_id,
COALESCE(SUM(ED.sales), 0) as max_sales
FROM
(SELECT DISTINCT item_id
FROM entry_daily
) as allitems
CROSS JOIN alldates AD
LEFT JOIN entry_daily ED
ON ED.day_of_year = AD.doy
AND ED.year = AD.year
AND ED.item_id = allitems.item_id
WHERE AD.year = 2015
GROUP BY
AD.year, AD.doy, allitems.item_id
ORDER BY
AD.year, AD.doy, allitems.item_id
You will have this OUTPUT
你将有这个输出
| year | doy | item_id | max_sales |
|------|-----|---------|-----------|
| 2015 | 1 | A1 | 20 |
| 2015 | 1 | A2 | 11 |
| 2015 | 2 | A1 | 0 |
| 2015 | 2 | A2 | 0 |
| 2015 | 3 | A1 | 0 |
| 2015 | 3 | A2 | 0 |
| 2015 | 4 | A1 | 0 |
| 2015 | 4 | A2 | 0 |
| 2015 | 5 | A1 | 0 |
| 2015 | 5 | A2 | 0 |
| 2015 | 6 | A1 | 0 |
| 2015 | 6 | A2 | 0 |
| 2015 | 7 | A1 | 39 |
| 2015 | 7 | A2 | 0 |
| 2015 | 8 | A1 | 0 |
| 2015 | 8 | A2 | 0 |
| 2015 | 9 | A1 | 0 |
| 2015 | 9 | A2 | 0 |
| 2015 | 10 | A1 | 0 |
| 2015 | 10 | A2 | 0 |
Then you need install tablefunc
然后你需要安装tablefunc
and use crosstab to pivot this table SAMPLE
并使用交叉表来旋转此表SAMPLE
#3
0
Try this self-contained code where we have used 5 instead of 149 to keep the output short.
尝试这个自包含的代码,我们使用5而不是149来保持输出短。
In (1) we use a single SQL statement, as required, to generate all the series producing a long form result. Normally in relational databases long form rather than wide form is used and this form may be preferable but in case not we follow this with a conversion to wide form using the reshape2 package.
在(1)中,我们根据需要使用单个SQL语句来生成生成长格式结果的所有系列。通常在关系数据库中使用长形式而不是宽形式,并且这种形式可能是优选的,但是如果不是这样,我们使用reshape2包转换为宽形式。
In (2) we show how to replace the SQL statement with R code that uses the dplyr package.
在(2)中,我们展示了如何用使用dplyr包的R代码替换SQL语句。
1) PostgreSQL Regarding the SQL statement below, the innermost select generates a table 1, 2, ..., 5 whose column is day_of_year
which is cross joined with entry_daily
giving every combination of day_of_year with year and item and keeping only the distinct rows. This is then left joined with entry_daily
to pick up the sales numbers which we sum over.
1)PostgreSQL关于下面的SQL语句,最里面的select生成一个表1,2,...,5,其列是day_of_year,它与entry_daily交叉连接,给出day_of_year与year和item的每个组合,并仅保留不同的行。然后将其与entry_daily连接,以获取我们总结的销售数字。
Assuming you have set up postgreSQL to work with SQL as in FAQ#12 on the sqldf home page ( https://github.com/ggrothendieck/sqldf ) the following should illustrate it and is self contained code that you can just copy and paste into your session.
假设你已经设置了postgreSQL来使用SQL,就像在sqldf主页(https://github.com/ggrothendieck/sqldf)上的FAQ#12中一样,下面应该说明它并且是你可以复制和粘贴的自包含代码进入你的会话。
library(sqldf)
library(RPostgreSQL)
# input data
entry_daily <-
structure(list(day_of_year = c(1L, 1L, 7L), year = c(2015L, 2015L,
2015L), sales = c(20L, 11L, 9L), item_id = structure(c(1L, 2L,
1L), .Label = c("A1", "A2"), class = "factor")), .Names = c("day_of_year",
"year", "sales", "item_id"), class = "data.frame", row.names = c(NA,
-3L))
s <- sqldf("select A.item_id, A.year, A.day_of_year, sum(coalesce(B.sales, 0)) sales
from (select distinct x.day_of_year, y.year, y.item_id
from (select * from generate_series(1, 5) as day_of_year) as x
cross join entry_daily as y) as A
left join entry_daily as B
on A.year = B.year and A.day_of_year = B.day_of_year and
A.item_id = B.item_id
where A.year = 2015
group by A.item_id, A.year, A.day_of_year
order by A.item_id, A.year, A.day_of_year")
The output of the above query is this data.frame:
上述查询的输出是这个data.frame:
> s
item_id year day_of_year sales
1 A1 2015 1 20
2 A1 2015 2 0
3 A1 2015 3 0
4 A1 2015 4 0
5 A1 2015 5 0
6 A2 2015 1 11
7 A2 2015 2 0
8 A2 2015 3 0
9 A2 2015 4 0
10 A2 2015 5 0
If you really need it in wide form then we can do that in R using dcast
in the reshape2 package:
如果你真的需要它的广泛形式,那么我们可以使用reshape2包中的dcast在R中做到这一点:
library(reshape2)
dcast(s, item_id + year ~ day_of_year, value.var = "sales")
giving:
赠送:
item_id year 1 2 3 4 5
1 A1 2015 20 0 0 0 0
2 A2 2015 11 0 0 0 0
2) dplyr Note that as an alternative to the SQL statement this R code would compute s
:
2)dplyr请注意,作为SQL语句的替代,此R代码将计算s:
library(dplyr)
s2 <- expand.grid(item_id = unique(entry_daily$item_id),
year = 2015,
day_of_year = 1:5) %>%
left_join(entry_daily) %>%
group_by(item_id, year, day_of_year) %>%
summarize(sales = sum(sales, na.rm = TRUE)) %>%
ungroup() %>%
arrange(item_id, year, day_of_year)
giving:
赠送:
> s2
Joining by: c("item_id", "year", "day_of_year")
Source: local data frame [10 x 4]
Groups: item_id, year [?]
item_id year day_of_year sales
(fctr) (dbl) (int) (int)
1 A1 2015 1 20
2 A1 2015 2 0
3 A1 2015 3 0
4 A1 2015 4 0
5 A1 2015 5 0
6 A2 2015 1 11
7 A2 2015 2 0
8 A2 2015 3 0
9 A2 2015 4 0
10 A2 2015 5 0
Now optionally use the same dcast
as in (1).
现在可选择使用与(1)中相同的dcast。
#1
3
Solution 1. Simple query with an aggregate.
The simplest and fastest way to get the expected result. It is easy to parse the sales
column within a client program.
获得预期结果的最简单,最快捷的方法。在客户端程序中解析销售列很容易。
select item, string_agg(coalesce(sales, 0)::text, ',') sales
from (
select distinct item_id item, doy
from generate_series (1, 10) doy -- change 10 to given n
cross join entry_daily
) sub
left join entry_daily on item_id = item and day_of_year = doy
group by 1
order by 1;
item | sales
------+----------------------
A1 | 20,0,0,0,0,0,9,0,0,0
A2 | 11,0,0,0,0,0,0,0,0,0
(2 rows)
Solution 2. Dynamically created view.
Based on the solution 1 with array_agg()
instead of string_agg()
. The function creates a view with a given number of columns.
基于解决方案1,使用array_agg()而不是string_agg()。该函数创建具有给定列数的视图。
create or replace function create_items_view(view_name text, days int)
returns void language plpgsql as $$
declare
list text;
begin
select string_agg(format('s[%s] "%s"', i::text, i::text), ',')
into list
from generate_series(1, days) i;
execute(format($f$
drop view if exists %s;
create view %s as select item, %s
from (
select item, array_agg(coalesce(sales, 0)) s
from (
select distinct item_id item, doy
from generate_series (1, %s) doy
cross join entry_daily
) sub
left join entry_daily on item_id = item and day_of_year = doy
group by 1
order by 1
) q
$f$, view_name, view_name, list, days)
);
end $$;
Usage:
用法:
select create_items_view('items_view_10', 10);
select * from items_view_10;
item | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10
------+----+---+---+---+---+---+---+---+---+----
A1 | 20 | 0 | 0 | 0 | 0 | 0 | 9 | 0 | 0 | 0
A2 | 11 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0
(2 rows)
Solution 3. Crosstab.
Easy to use, but very uncomfortable with the greater number of columns due to the need to define the row format.
由于需要定义行格式,因此易于使用,但对于更多列非常不舒服。
create extension if not exists tablefunc;
select * from crosstab (
'select item_id, day_of_year, sales
from entry_daily
order by 1',
'select i from generate_series (1, 10) i'
) as ct
(item_id text, "1" int, "2" int, "3" int, "4" int, "5" int, "6" int, "7" int, "8" int, "9" int, "10" int);
item_id | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10
---------+----+---+---+---+---+---+---+---+---+----
A1 | 20 | | | | | | 9 | | |
A2 | 11 | | | | | | | | |
(2 rows)
#2
1
First you need a table with all dates to fill the blank dates. 100 years of date mean 36,000 rows so no very big. Instead of calculate every time.
首先,您需要一个包含所有日期的表来填充空白日期。 100年的日期意味着36,000行所以不是很大。而不是每次都计算。
allDates:
allDates:
date_id
s_date
or created calculating the fields
或创建计算字段
date_id
s_date
doy = EXTRACT(DOY FROM s_date)
year = EXTRACT(YEAR FROM s_date)
Your base query will be SQL FIDDLE DEMO:
您的基本查询将是SQL FIDDLE DEMO:
SELECT
AD.year,
AD.doy,
allitems.item_id,
COALESCE(SUM(ED.sales), 0) as max_sales
FROM
(SELECT DISTINCT item_id
FROM entry_daily
) as allitems
CROSS JOIN alldates AD
LEFT JOIN entry_daily ED
ON ED.day_of_year = AD.doy
AND ED.year = AD.year
AND ED.item_id = allitems.item_id
WHERE AD.year = 2015
GROUP BY
AD.year, AD.doy, allitems.item_id
ORDER BY
AD.year, AD.doy, allitems.item_id
You will have this OUTPUT
你将有这个输出
| year | doy | item_id | max_sales |
|------|-----|---------|-----------|
| 2015 | 1 | A1 | 20 |
| 2015 | 1 | A2 | 11 |
| 2015 | 2 | A1 | 0 |
| 2015 | 2 | A2 | 0 |
| 2015 | 3 | A1 | 0 |
| 2015 | 3 | A2 | 0 |
| 2015 | 4 | A1 | 0 |
| 2015 | 4 | A2 | 0 |
| 2015 | 5 | A1 | 0 |
| 2015 | 5 | A2 | 0 |
| 2015 | 6 | A1 | 0 |
| 2015 | 6 | A2 | 0 |
| 2015 | 7 | A1 | 39 |
| 2015 | 7 | A2 | 0 |
| 2015 | 8 | A1 | 0 |
| 2015 | 8 | A2 | 0 |
| 2015 | 9 | A1 | 0 |
| 2015 | 9 | A2 | 0 |
| 2015 | 10 | A1 | 0 |
| 2015 | 10 | A2 | 0 |
Then you need install tablefunc
然后你需要安装tablefunc
and use crosstab to pivot this table SAMPLE
并使用交叉表来旋转此表SAMPLE
#3
0
Try this self-contained code where we have used 5 instead of 149 to keep the output short.
尝试这个自包含的代码,我们使用5而不是149来保持输出短。
In (1) we use a single SQL statement, as required, to generate all the series producing a long form result. Normally in relational databases long form rather than wide form is used and this form may be preferable but in case not we follow this with a conversion to wide form using the reshape2 package.
在(1)中,我们根据需要使用单个SQL语句来生成生成长格式结果的所有系列。通常在关系数据库中使用长形式而不是宽形式,并且这种形式可能是优选的,但是如果不是这样,我们使用reshape2包转换为宽形式。
In (2) we show how to replace the SQL statement with R code that uses the dplyr package.
在(2)中,我们展示了如何用使用dplyr包的R代码替换SQL语句。
1) PostgreSQL Regarding the SQL statement below, the innermost select generates a table 1, 2, ..., 5 whose column is day_of_year
which is cross joined with entry_daily
giving every combination of day_of_year with year and item and keeping only the distinct rows. This is then left joined with entry_daily
to pick up the sales numbers which we sum over.
1)PostgreSQL关于下面的SQL语句,最里面的select生成一个表1,2,...,5,其列是day_of_year,它与entry_daily交叉连接,给出day_of_year与year和item的每个组合,并仅保留不同的行。然后将其与entry_daily连接,以获取我们总结的销售数字。
Assuming you have set up postgreSQL to work with SQL as in FAQ#12 on the sqldf home page ( https://github.com/ggrothendieck/sqldf ) the following should illustrate it and is self contained code that you can just copy and paste into your session.
假设你已经设置了postgreSQL来使用SQL,就像在sqldf主页(https://github.com/ggrothendieck/sqldf)上的FAQ#12中一样,下面应该说明它并且是你可以复制和粘贴的自包含代码进入你的会话。
library(sqldf)
library(RPostgreSQL)
# input data
entry_daily <-
structure(list(day_of_year = c(1L, 1L, 7L), year = c(2015L, 2015L,
2015L), sales = c(20L, 11L, 9L), item_id = structure(c(1L, 2L,
1L), .Label = c("A1", "A2"), class = "factor")), .Names = c("day_of_year",
"year", "sales", "item_id"), class = "data.frame", row.names = c(NA,
-3L))
s <- sqldf("select A.item_id, A.year, A.day_of_year, sum(coalesce(B.sales, 0)) sales
from (select distinct x.day_of_year, y.year, y.item_id
from (select * from generate_series(1, 5) as day_of_year) as x
cross join entry_daily as y) as A
left join entry_daily as B
on A.year = B.year and A.day_of_year = B.day_of_year and
A.item_id = B.item_id
where A.year = 2015
group by A.item_id, A.year, A.day_of_year
order by A.item_id, A.year, A.day_of_year")
The output of the above query is this data.frame:
上述查询的输出是这个data.frame:
> s
item_id year day_of_year sales
1 A1 2015 1 20
2 A1 2015 2 0
3 A1 2015 3 0
4 A1 2015 4 0
5 A1 2015 5 0
6 A2 2015 1 11
7 A2 2015 2 0
8 A2 2015 3 0
9 A2 2015 4 0
10 A2 2015 5 0
If you really need it in wide form then we can do that in R using dcast
in the reshape2 package:
如果你真的需要它的广泛形式,那么我们可以使用reshape2包中的dcast在R中做到这一点:
library(reshape2)
dcast(s, item_id + year ~ day_of_year, value.var = "sales")
giving:
赠送:
item_id year 1 2 3 4 5
1 A1 2015 20 0 0 0 0
2 A2 2015 11 0 0 0 0
2) dplyr Note that as an alternative to the SQL statement this R code would compute s
:
2)dplyr请注意,作为SQL语句的替代,此R代码将计算s:
library(dplyr)
s2 <- expand.grid(item_id = unique(entry_daily$item_id),
year = 2015,
day_of_year = 1:5) %>%
left_join(entry_daily) %>%
group_by(item_id, year, day_of_year) %>%
summarize(sales = sum(sales, na.rm = TRUE)) %>%
ungroup() %>%
arrange(item_id, year, day_of_year)
giving:
赠送:
> s2
Joining by: c("item_id", "year", "day_of_year")
Source: local data frame [10 x 4]
Groups: item_id, year [?]
item_id year day_of_year sales
(fctr) (dbl) (int) (int)
1 A1 2015 1 20
2 A1 2015 2 0
3 A1 2015 3 0
4 A1 2015 4 0
5 A1 2015 5 0
6 A2 2015 1 11
7 A2 2015 2 0
8 A2 2015 3 0
9 A2 2015 4 0
10 A2 2015 5 0
Now optionally use the same dcast
as in (1).
现在可选择使用与(1)中相同的dcast。