使用T-SQL SQL Server 2014中的分析函数移动平均值

时间:2021-07-24 09:14:56

I need to calculate weekly and monthly moving averages per sensor per day for a large set of sample data based on some quality criteria. I have a working solution based on correlated sub-queries (or self joins), but I was wondering if using analytic functions is possible and would result in better performance?

我需要根据一些质量标准计算大量样本数据每天每个传感器的每周和每月移动平均值。我有一个基于相关子查询(或自联接)的工作解决方案,但我想知道是否可以使用分析函数并且会带来更好的性能?

Here is what I have right now (simplified):

这就是我现在所拥有的(简化):

CREATE TABLE Samples(
    SensorId int,
    SampleTime datetime,
    Value float,
    Quality float
)

WITH DailyAvg (SensorId, SampleDate, ValueSum, ValueCount)
AS
(
    SELECT
        SensorId,
        CAST(SampleTime AS DATE) AS SampleDate,
        SUM(Value) AS ValueSum,
        COUNT_BIG(Value) AS ValueCount
    FROM Samples
    WHERE Quality > 0.95
    GROUP BY SensorId, CAST(SampleTime AS DATE)
)
SELECT
    SensorId,
    SampleDate,
    ( SELECT SUM(d2.ValueSum) / SUM(d2.ValueCount) FROM DailyAvg AS d2 WHERE d2.SensorId = d1.SensorId AND d2.SampleDate BETWEEN DATEADD(DAY,   -7, d1.SampleDate) AND d1.SampleDate) AS AverageLastWeek,
    ( SELECT SUM(d2.ValueSum) / SUM(d2.ValueCount) FROM DailyAvg AS d2 WHERE d2.SensorId = d1.SensorId AND d2.SampleDate BETWEEN DATEADD(DAY,  -14, d1.SampleDate) AND d1.SampleDate) AS AverageLast2Weeks,
    ( SELECT SUM(d2.ValueSum) / SUM(d2.ValueCount) FROM DailyAvg AS d2 WHERE d2.SensorId = d1.SensorId AND d2.SampleDate BETWEEN DATEADD(MONTH, -1, d1.SampleDate) AND d1.SampleDate) AS AverageLastMonth
FROM DailyAvg d1
ORDER BY SensorId, SampleDate

I've tried replacing the sub-query for weekly average with the snippet below, but it obviously cannot handle days without any samples correctly. I thought of using RANGE or PARTITION BY expressions, but I cannot figure out how to specify the window frame to select the samples from e.g. "last week".

我已尝试使用下面的代码段替换每周平均值的子查询,但显然无法正确处理没有任何样本的天数。我想过使用RANGE或PARTITION BY表达式,但是我无法弄清楚如何指定窗口框架以从例如中选择样本。 “上个星期”。

SUM(ValueSum) OVER(PARTITION BY SensorId ORDER BY SampleTime ROWS 7 PRECEDING) / SUM(ValueCount) OVER(PARTITION BY SensorId ORDER BY SampleTime ROWS 7 PRECEDING) AS AverageLastWeek

I even considered "Quirky Update", but besides being messy I don't think it makes sense with this many days being averaged over.

我甚至考虑过“古怪的更新”,但除了杂乱之外,我觉得这很多天被平均过来都没有意义。

3 个解决方案

#1


Check out the window functions "Lead" and "Lag". They were created exactly for this purpose (executing agg functions on moving windows over result sets).

查看窗口功能“Lead”和“Lag”。它们是为此目的而创建的(在结果集上移动窗口时执行agg函数)。

#2


My code compiled faster, had less scans and less logical reads with my little randomized dataset. It's kind of hard to tell with such a small dataset and I don't have your indexes and whatnot. So try it for yourself. If anything, it's simpler than your query. Now in my query, the Average month was tricky. If you want you could make it like the others that use a certain number of days. So you might do average for previous 30 days. What I put in was average for that current month only. I do have a feeling you could put your subquery in for that one, but I didn't try that cause it's late here.

使用我的小随机数据集,我的代码编译速度更快,扫描次数更少,逻辑读取次数更少。用这么小的数据集很难说,我没有你的索引和诸如此类的东西。所以试试吧。如果有的话,它比你的查询更简单。现在在我的查询中,平均月份很棘手。如果你想要,你可以像使用一定天数的其他人一样。所以你可能会在过去30天内做到平均值。我输入的是当前月份的平均值。我确实有一种感觉,你可以把你的子查询放在那个,但我没有尝试,因为它已经很晚了。

Note: The fancy rows preceding assumes that their is a row for each day with no gaps or else your data will be skewed.

注意:前面的花式行假定它们是每天的一行,没有间隙,否则您的数据将会倾斜。

SELECT  SensorID,
        SampleDate,
        AVG(avg_VALUE) OVER (PARTITION BY SensorID,SampleDate) avg_per_date, --but I only have one row per date so that's why its a whole number each time
        AVG(avg_VALUE) OVER (PARTITION BY SensorID ORDER BY SampleDate ROWS BETWEEN 7 PRECEDING AND CURRENT ROW)  AverageLastWeek,
        AVG(avg_VALUE) OVER (PARTITION BY SensorID ORDER BY SampleDate ROWS BETWEEN 14 PRECEDING AND CURRENT ROW) AverageLast2Weeks,
        AVG(avg_VALUE) OVER (PARTITION BY SensorID,MONTH(SampleDate) ORDER BY SampleDate ROWS UNBOUNDED PRECEDING) AverageCurrentMonth
        --AVG(avg_VALUE) OVER (PARTITION BY SensorID ORDER BY SampleDate ROWS BETWEEN 30 PRECEDING AND CURRENT ROW) AverageLast30Days --alternatively you could do this
FROM
(
    SELECT SensorId,CAST(SampleTime AS DATE) SampleDate,AVG(Value) Avg_value
    FROM Samples
    WHERE Quality > .95
    GROUP BY SensorId,CAST(SampleTime AS DATE)
) S
ORDER BY SensorID,SampleDate

If you have any questions or need anything else, let me know!

如果您有任何疑问或需要其他任何问题,请告诉我们!

#3


can you provide with some sample data. this can be done with recursive CTE. I can help after I have some test data to play with

你能提供一些样本数据吗?这可以通过递归CTE完成。在我有一些测试数据可以使用后,我可以提供帮助

#1


Check out the window functions "Lead" and "Lag". They were created exactly for this purpose (executing agg functions on moving windows over result sets).

查看窗口功能“Lead”和“Lag”。它们是为此目的而创建的(在结果集上移动窗口时执行agg函数)。

#2


My code compiled faster, had less scans and less logical reads with my little randomized dataset. It's kind of hard to tell with such a small dataset and I don't have your indexes and whatnot. So try it for yourself. If anything, it's simpler than your query. Now in my query, the Average month was tricky. If you want you could make it like the others that use a certain number of days. So you might do average for previous 30 days. What I put in was average for that current month only. I do have a feeling you could put your subquery in for that one, but I didn't try that cause it's late here.

使用我的小随机数据集,我的代码编译速度更快,扫描次数更少,逻辑读取次数更少。用这么小的数据集很难说,我没有你的索引和诸如此类的东西。所以试试吧。如果有的话,它比你的查询更简单。现在在我的查询中,平均月份很棘手。如果你想要,你可以像使用一定天数的其他人一样。所以你可能会在过去30天内做到平均值。我输入的是当前月份的平均值。我确实有一种感觉,你可以把你的子查询放在那个,但我没有尝试,因为它已经很晚了。

Note: The fancy rows preceding assumes that their is a row for each day with no gaps or else your data will be skewed.

注意:前面的花式行假定它们是每天的一行,没有间隙,否则您的数据将会倾斜。

SELECT  SensorID,
        SampleDate,
        AVG(avg_VALUE) OVER (PARTITION BY SensorID,SampleDate) avg_per_date, --but I only have one row per date so that's why its a whole number each time
        AVG(avg_VALUE) OVER (PARTITION BY SensorID ORDER BY SampleDate ROWS BETWEEN 7 PRECEDING AND CURRENT ROW)  AverageLastWeek,
        AVG(avg_VALUE) OVER (PARTITION BY SensorID ORDER BY SampleDate ROWS BETWEEN 14 PRECEDING AND CURRENT ROW) AverageLast2Weeks,
        AVG(avg_VALUE) OVER (PARTITION BY SensorID,MONTH(SampleDate) ORDER BY SampleDate ROWS UNBOUNDED PRECEDING) AverageCurrentMonth
        --AVG(avg_VALUE) OVER (PARTITION BY SensorID ORDER BY SampleDate ROWS BETWEEN 30 PRECEDING AND CURRENT ROW) AverageLast30Days --alternatively you could do this
FROM
(
    SELECT SensorId,CAST(SampleTime AS DATE) SampleDate,AVG(Value) Avg_value
    FROM Samples
    WHERE Quality > .95
    GROUP BY SensorId,CAST(SampleTime AS DATE)
) S
ORDER BY SensorID,SampleDate

If you have any questions or need anything else, let me know!

如果您有任何疑问或需要其他任何问题,请告诉我们!

#3


can you provide with some sample data. this can be done with recursive CTE. I can help after I have some test data to play with

你能提供一些样本数据吗?这可以通过递归CTE完成。在我有一些测试数据可以使用后,我可以提供帮助