19.2 CPUFreq驱动
CPUFreq子系统位于drivers/cpufreq目录下,负责进行运行过程中CPU频率和电压的动态调整,即DVFS(Dynamic Voltage Frequency Scaling,动态电压频率调整)。运行时进行CPU电压和频率调整的原因
是:CMOS电路中的功耗与电压的平方成正比、与频率成正比(P∝fV2),降低电压和频率可降低功耗。
CPUFreq的核心层位于drivers/cpufreq/cpufreq.c下,为各个SoC的CPUFreq驱动的实现提供一套统一的接口,并实现一套notifier机制,可以在CPUFreq的策略和频率改变的时候向其他模块发出通知。另外,在CPU运行频率发生变化的时候,内核的loops_per_jiffy常数也会发生相应变化。
init/main.c
unsigned long loops_per_jiffy = (1<<12);
EXPORT_SYMBOL(loops_per_jiffy);
linux/delay.h
extern unsigned long loops_per_jiffy;
19.2.1 SoC的CPUFreq驱动实现
每个SoC的具体CPUFreq驱动实例只需要实现电压、频率表,以及从硬件层面完成这些变化。
CPUFreq核心层提供如下API以供SoC注册自身的CPUFreq驱动:
linux/cpufreq.h
int cpufreq_register_driver(struct cpufreq_driver *driver_data);
其参数为一个cpufreq_driver结构体指针,cpufreq_driver封装一个具体的SoC的CPUFreq驱动的主体,该结构体形如代码清单19.1所示。
代码清单19.1 cpufreq_driver结构体
linux/cpufreq.h
struct cpufreq_driver {
char name[CPUFREQ_NAME_LEN]; /* CPUFreq驱动的名字 */
u8 flags; //
void *driver_data;
/* needed by all drivers */
int (*init)(struct cpufreq_policy *policy);
int (*verify)(struct cpufreq_policy *policy);
/* define one out of two */
int (*setpolicy)(struct cpufreq_policy *policy);
/*
* On failure, should always restore frequency to policy->restore_freq
* (i.e. old freq).
*/
int (*target)(struct cpufreq_policy *policy, unsigned int target_freq,
unsigned int relation); /* Deprecated */
int (*target_index)(struct cpufreq_policy *policy, unsigned int index);
/*
* Only for drivers with target_index() and CPUFREQ_ASYNC_NOTIFICATION
* unset.
*
* get_intermediate should return a stable intermediate frequency
* platform wants to switch to and target_intermediate() should set CPU
* to to that frequency, before jumping to the frequency corresponding
* to 'index'. Core will take care of sending notifications and driver
* doesn't have to handle them in target_intermediate() or
* target_index().
*
* Drivers can return '0' from get_intermediate() in case they don't
* wish to switch to intermediate frequency for some target frequency.
* In that case core will directly call ->target_index().
*/
unsigned int (*get_intermediate)(struct cpufreq_policy *policy, unsigned int index);
int (*target_intermediate)(struct cpufreq_policy *policy, unsigned int index);
/* should be defined, if possible */
unsigned int (*get)(unsigned int cpu);
/* optional */
int (*bios_limit)(int cpu, unsigned int *limit);
int (*exit)(struct cpufreq_policy *policy);
void (*stop_cpu)(struct cpufreq_policy *policy);
int (*suspend)(struct cpufreq_policy *policy);
int (*resume)(struct cpufreq_policy *policy);
/* Will be called after the driver is fully initialized */
void (*ready)(struct cpufreq_policy *policy);
struct freq_attr **attr;
/* platform specific boost support code */
bool boost_supported;
bool boost_enabled;
int (*set_boost)(int state);
};
备注:
flags:一些暗示性的标志,若设置CPUFREQ_CONST_LOOPS,则告诉内核loops_per_jiffy不会因为CPU频率的变化而变化。
init()成员是一个per-CPU初始化函数指针,每当一个新的CPU被注册进系统的时候,该函数就被调用,该函数接受一个cpufreq_policy的指针参数,在init()成员函数中,可进行如下设置:
policy->cpuinfo.min_freq
policy->cpuinfo.max_freq
上述代码描述的是该CPU支持的最小频率和最大频率(单位是kHz)。
policy->cpuinfo.transition_latency
上述代码描述的是CPU进行频率切换所需要的延迟(单位是ns)
policy->cur
上述代码描述的是CPU的当前频率。
policy->policy
policy->governor
policy->min
policy->max
上述代码定义该CPU的缺省策略,以及在缺省策略情况下,该策略支持的最小、最大CPU频率。
verify()成员函数用于对用户的CPUFreq策略设置进行有效性验证和数据修正。每当用户设定一个新策略时,该函数根据老的策略和新的策略,检验新策略设置的有效性并对无效设置进行必要的修正。在该成员函数的具体实现中,常用到如下辅助函数:
linux/cpufreq.h
static inline void cpufreq_verify_within_limits(struct cpufreq_policy *policy,
unsigned int min, unsigned int max)
{
if (policy->min < min)
policy->min = min;
if (policy->max < min)
policy->max = min;
if (policy->min > max)
policy->min = max;
if (policy->max > max)
policy->max = max;
if (policy->min > policy->max)
policy->min = policy->max;
return;
}
setpolicy()成员函数接受一个policy参数(包含policy->policy、policy->min和policy->max等成员),实现这个成员函数的CPU一般具备在一个范围(limit,从policy->min到policy->max)里自动调整频率的能力。目前只有少数驱动包含这样的成员函数,而绝大多数CPU都不会实现此函数,一般只实现target()成员函数,target()的参数直接就是一个指定的频率。
target()成员函数用于将频率调整到一个指定的值,接受3个参数:policy、target_freq和relation。target_freq是目标频率,实际驱动总是要设定真实的CPU频率到最接近于target_freq,并且设定的频率必须位于policy->min到policy->max之间。在设定频率接近target_freq的情况下,relation若为CPUFREQ_REL_L,则暗示设置的频率应该大于或等于target_freq;relation若为CPUFREQ_REL_H,则暗示设置的频率应该小于或等于target_freq。
表19.1描述setpolicy()和target()所针对的CPU以及调用方式上的区别。
表19.1 setpolicy()和target()所针对的CPU及其调用方式上的区别
根据芯片内部PLL和分频器的关系,ARM SoC一般不具备独立调整频率的能力,往往SoC的CPUFreq驱动会提供一个频率表,频率在该表的范围内进行变更,因此一般实现target()成员函数。
CPUFreq核心层提供一组与频率表相关的辅助API。
linux/cpufreq.h
int cpufreq_frequency_table_cpuinfo(struct cpufreq_policy *policy,
struct cpufreq_frequency_table *table);
drivers/cpufreq/freq_table.c
int cpufreq_frequency_table_cpuinfo(struct cpufreq_policy *policy,
struct cpufreq_frequency_table *table)
{
struct cpufreq_frequency_table *pos;
unsigned int min_freq = ~0;
unsigned int max_freq = 0;
unsigned int freq;
// 遍历
cpufreq_for_each_valid_entry(pos, table) {
freq = pos->frequency;
if (!cpufreq_boost_enabled()
&& (pos->flags & CPUFREQ_BOOST_FREQ))
continue;
pr_debug("table entry %u: %u kHz\n", (int)(pos - table), freq);
if (freq < min_freq)
min_freq = freq;
if (freq > max_freq)
max_freq = freq;
}
policy->min = policy->cpuinfo.min_freq = min_freq;
policy->max = policy->cpuinfo.max_freq = max_freq;
if (policy->min == ~0)
return -EINVAL;
else
return 0;
}
EXPORT_SYMBOL_GPL(cpufreq_frequency_table_cpuinfo);
cpufreq_frequency_table_cpuinfo是cpufreq_driver的init()成员函数的助手,用于将policy->min和
policy->max设置为与cpuinfo.min_freq和cpuinfo.max_freq相同的值。
linux/cpufreq.h
int cpufreq_frequency_table_target(struct cpufreq_policy *policy,
struct cpufreq_frequency_table *table,
unsigned int target_freq,
unsigned int relation,
unsigned int *index);
drivers/cpufreq/freq_table.c
int cpufreq_frequency_table_target(struct cpufreq_policy *policy,
struct cpufreq_frequency_table *table,
unsigned int target_freq,
unsigned int relation,
unsigned int *index)
{
struct cpufreq_frequency_table optimal = {
.driver_data = ~0,
.frequency = 0,
};
struct cpufreq_frequency_table suboptimal = {
.driver_data = ~0,
.frequency = 0,
};
struct cpufreq_frequency_table *pos;
unsigned int freq, diff, i = 0;
pr_debug("request for target %u kHz (relation: %u) for cpu %u\n", target_freq, relation, policy->cpu);
switch (relation) {
case CPUFREQ_RELATION_H:
suboptimal.frequency = ~0;
break;
case CPUFREQ_RELATION_L:
case CPUFREQ_RELATION_C:
optimal.frequency = ~0;
break;
}
cpufreq_for_each_valid_entry(pos, table) {
freq = pos->frequency;
i = pos - table;
if ((freq < policy->min) || (freq > policy->max))
continue;
if (freq == target_freq) {
optimal.driver_data = i;
break;
}
switch (relation) {
case CPUFREQ_RELATION_H:
if (freq < target_freq) {
if (freq >= optimal.frequency) {
optimal.frequency = freq;
optimal.driver_data = i;
}
} else {
if (freq <= suboptimal.frequency) {
suboptimal.frequency = freq;
suboptimal.driver_data = i;
}
}
break;
case CPUFREQ_RELATION_L:
if (freq > target_freq) {
if (freq <= optimal.frequency) {
optimal.frequency = freq;
optimal.driver_data = i;
}
} else {
if (freq >= suboptimal.frequency) {
suboptimal.frequency = freq;
suboptimal.driver_data = i;
}
}
break;
case CPUFREQ_RELATION_C:
diff = abs(freq - target_freq);
if (diff < optimal.frequency ||
(diff == optimal.frequency &&
freq > table[optimal.driver_data].frequency)) {
optimal.frequency = diff;
optimal.driver_data = i;
}
break;
}
}
if (optimal.driver_data > i) {
if (suboptimal.driver_data > i)
return -EINVAL;
*index = suboptimal.driver_data;
} else
*index = optimal.driver_data;
pr_debug("target index is %u, freq is:%u kHz\n", *index,
table[*index].frequency);
return 0;
}
EXPORT_SYMBOL_GPL(cpufreq_frequency_table_target);
cpufreq_frequency_table_target是cpufreq_driver的target()成员函数的助手,返回需要设定的频率在频率表中的索引。
1个SoC的CPUFreq驱动实例drivers/cpufreq/omap-cpufreq.c的核心结构如代码清单19.2所示。
代码清单19.2 omap的CPUFreq驱动
static int omap_target(struct cpufreq_policy *policy, unsigned int index)
{
int r, ret;
struct dev_pm_opp *opp;
unsigned long freq, volt = 0, volt_old = 0, tol = 0;
unsigned int old_freq, new_freq;
old_freq = policy->cur;
new_freq = freq_table[index].frequency; // 从CPU频率表中找到合适的频率
freq = new_freq * 1000;
ret = clk_round_rate(policy->clk, freq);
if (IS_ERR_VALUE(ret)) {
dev_warn(mpu_dev,
"CPUfreq: Cannot find matching frequency for %lu\n",
freq);
return ret;
}
freq = ret;
if (mpu_reg) {
rcu_read_lock();
opp = dev_pm_opp_find_freq_ceil(mpu_dev, &freq);
if (IS_ERR(opp)) {
rcu_read_unlock();
dev_err(mpu_dev, "%s: unable to find MPU OPP for %d\n",
__func__, new_freq);
return -EINVAL;
}
volt = dev_pm_opp_get_voltage(opp);
rcu_read_unlock();
tol = volt * OPP_TOLERANCE / 100;
volt_old = regulator_get_voltage(mpu_reg);
}
dev_dbg(mpu_dev, "cpufreq-omap: %u MHz, %ld mV --> %u MHz, %ld mV\n",
old_freq / 1000, volt_old ? volt_old / 1000 : -1,
new_freq / 1000, volt ? volt / 1000 : -1);
/* scaling up? scale voltage before frequency */
if (mpu_reg && (new_freq > old_freq)) {
r = regulator_set_voltage(mpu_reg, volt - tol, volt + tol); // 设置电压
if (r < 0) {
dev_warn(mpu_dev, "%s: unable to scale voltage up.\n",
__func__);
return r;
}
}
ret = clk_set_rate(policy->clk, new_freq * 1000);// 设置频率
/* scaling down? scale voltage after frequency */
if (mpu_reg && (new_freq < old_freq)) {
r = regulator_set_voltage(mpu_reg, volt - tol, volt + tol);
if (r < 0) {
dev_warn(mpu_dev, "%s: unable to scale voltage down.\n",
__func__);
clk_set_rate(policy->clk, old_freq * 1000);
return r;
}
}
return ret;
}
static int omap_cpu_init(struct cpufreq_policy *policy)
{
int result;
policy->clk = clk_get(NULL, "cpufreq_ck");
if (IS_ERR(policy->clk))
return PTR_ERR(policy->clk);
if (!freq_table) {
// create a cpufreq table for a device
if (result) {
dev_err(mpu_dev,
"%s: cpu%d: failed creating freq table[%d]\n",
__func__, policy->cpu, result);
goto fail;
}
}
atomic_inc_return(&freq_table_users);
/* FIXME: what's the actual transition time? */
result = cpufreq_generic_init(policy, freq_table, 300 * 1000);
if (!result)
return 0;
freq_table_free();
fail:
clk_put(policy->clk);
return result;
}
static struct cpufreq_driver omap_driver = {
.flags = CPUFREQ_STICKY | CPUFREQ_NEED_INITIAL_FREQ_CHECK,
.verify = cpufreq_generic_frequency_table_verify,
.target_index = omap_target,
.get = cpufreq_generic_get,
.init = omap_cpu_init,
.exit = omap_cpu_exit,
.name = "omap",
.attr = cpufreq_generic_attr,
};
static int omap_cpufreq_probe(struct platform_device *pdev)
{
mpu_dev = get_cpu_device(0);
if (!mpu_dev) {
pr_warning("%s: unable to get the mpu device\n", __func__);
return -EINVAL;
}
mpu_reg = regulator_get(mpu_dev, "vcc");
if (IS_ERR(mpu_reg)) {
pr_warning("%s: unable to get MPU regulator\n", __func__);
mpu_reg = NULL;
} else {
/*
* Ensure physical regulator is present.
* (e.g. could be dummy regulator.)
*/
if (regulator_get_voltage(mpu_reg) < 0) {
pr_warn("%s: physical regulator not present for MPU\n",
__func__);
regulator_put(mpu_reg);
mpu_reg = NULL;
}
}
return cpufreq_register_driver(&omap_driver);// 注册cpufreq_driver的实例
}
static int omap_cpufreq_remove(struct platform_device *pdev)
{
return cpufreq_unregister_driver(&omap_driver);// 注销cpufreq_driver的实例
}
static struct platform_driver omap_cpufreq_platdrv = {
.driver = {
.name = "omap-cpufreq",
},
.probe = omap_cpufreq_probe,
.remove = omap_cpufreq_remove,
};
module_platform_driver(omap_cpufreq_platdrv);
19.2.2 CPUFreq的策略
SoC CPUFreq驱动只是设定CPU的频率参数,以及提供设置频率的途径,但是SoC CPUFreq驱动并不会管CPU自身应该运行在哪种频率上。究竟频率依据的是哪种标准,进行何种变化,而这些完全由CPUFreq的策略(policy)决定,这些策略如表19.2所示。
表19.2 CPUFrep的策略及其实现方法
在Android系统中,增加1个交互策略,该策略适合于对延迟敏感的UI交互任务,当有UI交互任务时,该策略会更加激进并及时地调整CPU频率。
系统的状态和CPUFreq的策略共同决定CPU频率跳变的目标,CPUFreq核心层将目标频率传递给底层具体SoC的CPUFreq驱动,该驱动修改硬件,完成频率的变换,如图19.2所示。
图19.2 CPUFreq、系统负载、策略与调频
用户空间一般可通过/sys/devices/system/cpu/cpux/cpufreq/xxx节点来设置CPUFreq。譬如,要设置
CPUFreq到700MHz,采用userspace策略,则运行如下命令:
# echo userspace > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
# echo 700000 > /sys/devices/system/cpu/cpu0/cpufreq/scaling_setspeed
19.2.3 CPUFreq的性能测试和调优
Linux 3.1以后的内核将cpupower-utils工具集放入内核的tools/power/cpupower目录中,该工具集当中的cpufreq-bench工具可以帮助工程师分析采用CPUFreq后对系统性能的影响。
cpufreq-bench工具的工作原理:模拟系统运行时的“空闲→忙→空闲→忙”场景,从而触发系统的动态频率变化,然后在使用ondemand、conservative、interactive等策略的情况下,计算在做与performance高频模式下同样的运算完成任务的时间比例。
交叉编译该工具后,可放入目标电路板文件系统的/usr/sbin/目录下,运行该工具:
cpufreq-bench -l 50000 -s 100000 -x 50000 -y 100000 -g ondemand -r 5 -n 5 -v
会输出一系列的结果,提取其中的Round n这样的行,它表明-g ondemand选项中设定的ondemand策略相对于performance策略的性能比例,假设值为:
Round 1 - 39.74%
Round 2 - 36.35%
Round 3 - 47.91%
Round 4 - 54.22%
Round 5 - 58.64%
这显然不太理想,在同样的平台下采用Android的交互策略,得到新的测试结果:
Round 1 - 72.95%
Round 2 - 87.20%
Round 3 - 91.21%
Round 4 - 94.10%
Round 5 - 94.93%
备注:
一般的目标是在采用CPUFreq动态调整频率和电压后,性能应该为performance这个高性能策略下的90%左右,才比较理想。
19.2.4 CPUFreq通知
CPUFreq子系统会发出通知的情况有两种:CPUFreq的策略变化或者CPU运行频率变化。
在策略变化的过程中,会发送3次通知:
CPUFREQ_ADJUST:
所有注册的notifier可以根据硬件或者温度的情况去修改范围(即policy->min和policy->max)
CPUFREQ_INCOMPATIBLE:
只是为了避免硬件错误的情况下,可以在该通知中修改policy的限制信息
CPUFREQ_NOTIFY:
所有注册的notifier都会被告知新的策略已经被设置
在频率变化的过程中,会发送2次通知:
CPUFREQ_PRECHANGE:准备进行频率变更
CPUFREQ_POSTCHANGE:已经完成频率变更
发送CPUFREQ_PRECHANGE和CPUFREQ_POSTCHANGE的代码如下:
srcu_notifier_call_chain(&cpufreq_transition_notifier_list, CPUFREQ_PRECHANGE, freqs);
srcu_notifier_call_chain(&cpufreq_transition_notifier_list, CPUFREQ_POSTCHANGE, freqs);
如果某个模块关心CPUFREQ_PRECHANGE或CPUFREQ_POSTCHANGE事件,可使用Linux notifier机制监控。例如,drivers/video/sa1100fb.c在CPU频率变化过程中需要对自身硬件进行相关设置,在这个文件中注册了notifier,并在CPUFREQ_PRECHANGE和CPUFREQ_POSTCHANGE情况下分别进行不同的处理,如代码清单19.3所示。
代码清单19.3 CPUFreq notifier
/*
* CPU clock speed change handler. We need to adjust the LCD timing
* parameters when the CPU clock is adjusted by the power management
* subsystem.
*/
static int
sa1100fb_freq_transition(struct notifier_block *nb, unsigned long val,
void *data)
{
struct sa1100fb_info *fbi = TO_INF(nb, freq_transition);
struct cpufreq_freqs *f = data;
u_int pcd;
switch (val) {
case CPUFREQ_PRECHANGE:
set_ctrlr_state(fbi, C_DISABLE_CLKCHANGE);
break;
case CPUFREQ_POSTCHANGE:
pcd = get_pcd(fbi->fb.var.pixclock, f->new);
fbi->reg_lccr3 = (fbi->reg_lccr3 & ~0xff) | LCCR3_PixClkDiv(pcd);
set_ctrlr_state(fbi, C_ENABLE_CLKCHANGE);
break;
}
return 0;
}
此外,如果在系统挂起/恢复(休眠/唤醒)的过程中CPU频率会发生变化,则CPUFreq子系统也会发出
CPUFREQ_SUSPENDCHANGE和CPUFREQ_RESUMECHANGE这两个通知。
除了CPU以外,一些非CPU设备也支持多个操作频率和电压,存在多个OPP(某个Domain所支持的<频率,电压>对的集合)。Linux3.2之后的内核也支持针对这种非CPU设备的DVFS,该套子系统为Devfreq,在内核中存在一个drivers/devfreq目录。