数据中心冷却的 safe-RL，基于对 action 的事后修正技术

一个总述
摘要
1 intro
2 related work
3 preliminaries
4 performance of reward shaping
5 the safari approach
6 performance evaluation

一个总述

关于本篇笔记：

论文题目：Toward Physics-Guided Safe Deep Reinforcement Learning for Green Data Center Cooling Control
- 在线看：https://ieeexplore.ieee.org/abstract/document/9797658
阅读笔记是 ~~量子速读法~~ 的产物，只能起辅助阅读的功效，无法替代论文原文。

关于论文：

motivation：减少 RL 试错过程中的 unsafe behavior。
基本思想：先 imitation learning，再在 on-line learning 时强行改可能 unsafe 的 action。
关键技术：1. 怎么判断 action 可能 unsafe，2. 怎么改 action。
- 1 再训一个 coarse model。
- 2 启发式寻找最小的改动，直到 coarse model 判断安全。
- 1 2 是结合在一起做的。
局限性：把整个房间设为同一个温度（使用 EnergyPlus 仿真）

摘要

声称自己提出的：safety-aware DRL framework，for single-hall data center cooling control。
技术路线：off-line imitation learning + online post-hoc rectification。
rectification：使用基于 historical safe operation traces 来 fit 的 thermal state transition model，还能外推 unsafe state。

1 intro

Post-hoc rectification：去修正 DRL 提供的 action，来保证 won't drive the system to the unsafe region，采用最小的修正（rectification）。
Safari（offline 模仿学习 + online 修正）的优势：拟合状态转换模型时的低开销和对数据的低需求（即只需要安全数据）。

2 related work

simplex 单纯形：进入到 unsafe region，再回退；post-hoc rec：主动修改 action 使其安全。

3 preliminaries

用 EnergyPlus 仿真，假设整个房间的温度是一样的（uniform distribution）。
设定点（action）好像是 mass flow rate 和 T_in。

4 performance of reward shaping

本章内容：MDP setting 和 RL（DDPG）。

state：

T_z 室内温度, T_in 空调温度, P_c ACU 功率, P_IT 负载, T_o 户外温度。
假设 P_IT 和 T_o 都是 Markovian（？）

reward：

1 goal 项：exp(ΔT²) 惩罚的超温 - P_DC（总功率 = P_IT + P_c）。
2 shaping 项：用于控制 T_z 在 T_L T_U 之间，两个 max(0, ΔT_{过高/过低} ) 相加。

（迄今为止我都不知道 ACU P_c 是怎么算的）

5 the safari approach

5.1 CMDP Formulation & Approach Overview

如 subsection 标题。定义了 constrained MDP 的问题。

5.2 off-line IL

发现模仿学习可以让前三天不犯错，但后面 RL 继续试错还是会犯错的。

5.3 Online Post-hoc Rectification

用以前的数据再 train 一个 coarse model：灰盒或黑盒。

LSTM：MAE 在 0.5℃。需要 unsafe 数据（exploratory data）。
safari1：
- 关于 KKT 条件：https://zhuanlan.zhihu.com/p/556832103

数据中心冷却的 safe-RL，基于对 action 的事后修正技术

- 好像就是解了一个 KKT 条件。让数学形式如此简单的关键，是它的 DC 状态方程简单。

safari2： heuristic 方法。
- For T_in(t) and f(t), we adopt their setpoints as their approximations. 又干了这种直接拿 setpoint 当真实值的事情。
- 然后直接把 Eq.(1) 积分了！
- 然后，为了减少拿 setpoint 近似 flow rate 和温度的影响，用这段时间的 T_z 的积分作为最终的 T_z 值。
safari3：
- 搞了一个所有 IT power trace 的上包络线( maximum ramp-up function)。
- 然后直接用这个作为输入 Q，重新 forward DRL 模型。

比较：

LSTM using unsafe data 的性能最好，其次是 transient 的 safari2。

6 performance evaluation

没有什么要说的。

秒客网

数据中心冷却的 safe-RL，基于对 action 的事后修正技术

一个总述

摘要

1 intro

2 related work

3 preliminaries

4 performance of reward shaping

5 the safari approach

5.1 CMDP Formulation & Approach Overview

5.2 off-line IL

5.3 Online Post-hoc Rectification

6 performance evaluation

相关文章