Recently I was punched in the face by the TSC issue on our latest platform,
it has blocked many projects which rely heavily on TSC, such as time-sensitive
module audio, video, etc.
So the problem can be summarized as, the TSC calibration algorithm on this new
platform is not suitable , we need to use either MSR or CPUID calibration algorithm
to deal with this situation. OK, so finally we choose CPUID to solve our problem.
But one coin has two sides, although we solve the problem, but also introduced a new
performance regression on a server. Essentially this problem has introduced two method
to achieve the TSC calibration: for cores earlier than SKYLAKE(including ATOM), we use
MSR calibration, for platform later than SKYLAKE(including), we use CPUID. But we found
a performance regression on BDW-server that, the unixbench has dropped its score by 15%.
So by checking the test result, when using msr calibration, the TSC frequency has dropped
by 5MHz compared with using original PIT calibration algorithm. And since the platform uses
TSC as a clock source:
cat /sys/devices/system/clocksource/clocksource0/current_clocksource tscthe acuration of TSC might affect the result for unixbench.
Of cause we can change the clock source:
cat /sys/devices/system/clocksource/clocksource0/available_clocksource tsc hpet acpi_pmbut we have to figure out, which result should we trust, the msr or the pit?
I decided first to use clock_gettime with different parameters:
CLOCK_REALTIME
CLOCK_MONOTONICto get the rtc and clocksource(tsc) time respectively, and compare with them.
Why do I want to use rtc? because as the man page said, the rtc
is set in the BIOS and would not be affected by normal process.
So, it should be a very reliable clock, at least as the man page said..
So, what is CLOCK_REALTIME, and what is CLOCK_MONITONIC? RTFSC!
linux/kernel/time/posix-timers.c:
posix_timers_register_clock(CLOCK_REALTIME, &clock_realtime); posix_timers_register_clock(CLOCK_MONOTONIC, &clock_monotonic);
void posix_timers_register_clock(const clockid_t clock_id, struct k_clock *new_clock) { posix_clocks[clock_id] = *new_clock; }
struct k_clock clock_realtime = { .clock_getres = posix_get_hrtimer_res, .clock_get = posix_clock_realtime_get, .clock_set = posix_clock_realtime_set, .clock_adj = posix_clock_realtime_adj, .nsleep = common_nsleep, .nsleep_restart = hrtimer_nanosleep_restart, .timer_create = common_timer_create, .timer_set = common_timer_set, .timer_get = common_timer_get, .timer_del = common_timer_del, }; struct k_clock clock_monotonic = { .clock_getres = posix_get_hrtimer_res, .clock_get = posix_ktime_get_ts, .nsleep = common_nsleep, .nsleep_restart = hrtimer_nanosleep_restart, .timer_create = common_timer_create, .timer_set = common_timer_set, .timer_get = common_timer_get, .timer_del = common_timer_del, };
Let's first look at real time get:
static int posix_clock_realtime_get(clockid_t which_clock, struct timespec *tp) { ktime_get_real_ts(tp); return 0; }So it finally invoke:
int __getnstimeofday64(struct timespec64 *ts) { struct timekeeper *tk = &tk_core.timekeeper; unsigned long seq; s64 nsecs = 0; do { seq = read_seqcount_begin(&tk_core.seq); ts->tv_sec = tk->xtime_sec; nsecs = timekeeping_get_ns(&tk->tkr_mono); } while (read_seqcount_retry(&tk_core.seq, seq)); ts->tv_nsec = 0; timespec64_add_ns(ts, nsecs); /* * Do not bail out early, in case there were callers still using * the value, even in the face of the WARN_ON. */ if (unlikely(timekeeping_suspended)) return -EAGAIN; return 0; }
And it first get the current xtime_sec(which is updated in tick handler), and then get
the offset in nsec by substracting current clock_source->cycle from last_cycle, which is
also updated in tick handler, the convert the cycle offset to nsec, and added the nsec
to xtime_sec, thus get the finally current time. So as we see, this method also use
the clock source, and as we are using TSC, this method of using CLOCK_REALTIME is
not acceptable. So, we have to find a graceful solution to check the actual time elapse,
and actually we want to access the rtc directly, without any timekeeping subsystem,
so there is just a command you can use, that is hwclock:
http://linux.die.net/man/8/hwclock
http://linux.die.net/man/4/rtc
Since the rtc is more reliable, we compare the rtc with tsc, both
on previous implementation using pit, and current incorrect implementation
of msr calibration:
available clock source:
# cat/sys/devices/system/clocksource/clocksource0/available_clocksource
tsc hpet acpi_pm
# cat/sys/devices/system/clocksource/clocksource0/current_clocksource
Tsc
and then I tried to use clock_gettime(CLOCK_REALTIME), ifI understand correctly,
CLOCK_REALTIME will try to return the wall time by(xtime + offset),
and the offset is based on current clock_source.read_cycle,so I guess using
CLOCK_REALTIME is not safe too, so I found the command of 'hwclock'
might return the RTCwall timer directly, and likewise the command of 'date'
will return the walltime by gettimeofday, which is based on clock source.
So here is the result on a msr-calibration based kernel:
root@lkp-bdw-de1 ~# date Tue Jul 5 15:46:55CST 2016 root@lkp-bdw-de1 ~# hwclock -r Tue Jul 5 00:46:162016 -0.640944 seconds
distance to rtc: msr_based = 15:00:49
And here is the result from a pit-calibration based kernel:
root@lkp-bdw-de1 ~# date;hwclock -r Tue Jul 5 20:10:19CST 2016 Tue Jul 5 05:09:382016 -0.391106 seconds
distance to rtc: pit_based = 15:00:41
So it looks like using two different method would result in
small different distance. And the 15 hours come from
the difference of time-zone setting, since I set my own time zone
as China, which is :
UTC+08:00 (Chinese Standard Time)
So the default rtc time mush be set as the following time zone:
UTC−07:00 (MT) — Arizona, Colorado, Montana, New Mexico, Utah, Wyoming, parts of Idaho, Kansas, Nebraska, Oregon, North Dakota, South Dakota, and Texas
The result of offset for msr is 4 seconds bigger than that
of using pit, so can we say, the may result in the performance regression
reported by unixbench? The answer is, not exactly.
Because as the time elapse, the result of date might get
further and further compared with rtc time, so you can only
compare MSR and PIT delta either with the same time interval,
or at the same uptime.
Eg, we can use the following command for verification:
date;hwclock -r;sleep xxx;date;hwclock -r
thus wait for some time, and compare the actual hwclock delta and data delta.
Or just wait util the uptime reached the same value, to compare which one
has a bigger offset.
sometimes you might get incorrect rtc by running hwclock:
hwclock -r hwclock: ioctl(RTC_RD_TIME) to /dev/rtc to read the time failed: Invalid argumen
thus you can use following command to fix it:
# sudo hwclock --systohc -D --noadjfile --utc # sudo hwclock --set --date "06/05/13 23:00:00" # sudo hwclock --show Wed 05 Jun 2013 23:00:13 UTC -0.369063 seconds # sudo shutdown now