[知其然不知其所以然-31] CLOCK_MONOTONIC and CLOCK_REALTIME

时间:2021-01-31 04:12:28

Recently I was punched in the face by the TSC issue on our latest platform,

it has blocked many projects which rely heavily on TSC, such as time-sensitive

module audio, video, etc.

So the problem can be summarized as, the TSC calibration algorithm on this new

platform is not suitable , we need to use either MSR or CPUID calibration algorithm

to deal with this situation. OK, so finally we choose CPUID to solve our problem.


But one coin has two sides, although we solve the problem, but also introduced a new

performance regression on a server.  Essentially this problem has introduced two method

to achieve the TSC calibration: for cores earlier than SKYLAKE(including ATOM), we use

MSR calibration, for platform later than SKYLAKE(including), we use CPUID. But we found

a performance regression on BDW-server that, the unixbench has dropped its score by 15%.

So by checking the test result,  when using msr calibration, the TSC frequency has dropped

by 5MHz compared with using original PIT calibration algorithm. And since the platform uses

TSC as a clock source:

cat /sys/devices/system/clocksource/clocksource0/current_clocksource 
tsc
the acuration of TSC might affect the result for unixbench.

Of cause we can change the clock source:

cat /sys/devices/system/clocksource/clocksource0/available_clocksource 
tsc hpet acpi_pm
but we have to figure out, which result should we trust, the msr or the pit?

I decided first to use clock_gettime with different parameters:

CLOCK_REALTIME

CLOCK_MONOTONIC
to get the rtc and clocksource(tsc) time respectively, and compare with them.

Why do I want to use rtc? because  as the man page said, the rtc

is set in the BIOS and would not be affected by normal process.

So, it should be a very reliable clock, at least as the man page said..


So, what is CLOCK_REALTIME, and what is CLOCK_MONITONIC? RTFSC!

linux/kernel/time/posix-timers.c:

posix_timers_register_clock(CLOCK_REALTIME, &clock_realtime);
posix_timers_register_clock(CLOCK_MONOTONIC, &clock_monotonic);
void posix_timers_register_clock(const clockid_t clock_id,
                                 struct k_clock *new_clock)
{
  
        posix_clocks[clock_id] = *new_clock;
}


struct k_clock clock_realtime = {
                .clock_getres   = posix_get_hrtimer_res,
                .clock_get      = posix_clock_realtime_get,
                .clock_set      = posix_clock_realtime_set,
                .clock_adj      = posix_clock_realtime_adj,
                .nsleep         = common_nsleep,
                .nsleep_restart = hrtimer_nanosleep_restart,
                .timer_create   = common_timer_create,
                .timer_set      = common_timer_set,
                .timer_get      = common_timer_get,
                .timer_del      = common_timer_del,
        };
        struct k_clock clock_monotonic = {
                .clock_getres   = posix_get_hrtimer_res,
                .clock_get      = posix_ktime_get_ts,
                .nsleep         = common_nsleep,
                .nsleep_restart = hrtimer_nanosleep_restart,
                .timer_create   = common_timer_create,
                .timer_set      = common_timer_set,
                .timer_get      = common_timer_get,
                .timer_del      = common_timer_del,
        };

Let's first look at real time get:

static int posix_clock_realtime_get(clockid_t which_clock, struct timespec *tp)
{
        ktime_get_real_ts(tp);
        return 0;
}
So it finally invoke:

int __getnstimeofday64(struct timespec64 *ts)
{
        struct timekeeper *tk = &tk_core.timekeeper;
        unsigned long seq;
        s64 nsecs = 0;

        do {
                seq = read_seqcount_begin(&tk_core.seq);

                ts->tv_sec = tk->xtime_sec; 
                nsecs = timekeeping_get_ns(&tk->tkr_mono);

        } while (read_seqcount_retry(&tk_core.seq, seq));

        ts->tv_nsec = 0;
        timespec64_add_ns(ts, nsecs);

        /*
         * Do not bail out early, in case there were callers still using
         * the value, even in the face of the WARN_ON.
         */
        if (unlikely(timekeeping_suspended))
                return -EAGAIN;
        return 0;
}

And it first get the current xtime_sec(which is updated in tick handler), and then get

the offset in nsec by substracting current clock_source->cycle from last_cycle, which is

also updated in tick handler, the convert the cycle offset to nsec, and added the nsec

to xtime_sec, thus get the finally current time. So as we see, this method also use

the clock source, and as we are using TSC, this method of using CLOCK_REALTIME is

not acceptable. So, we have to find a graceful solution to check the actual time elapse,

and actually we want to access the rtc directly, without any timekeeping subsystem,

so there is just a command you can use, that is hwclock:

http://linux.die.net/man/8/hwclock

http://linux.die.net/man/4/rtc


Since the rtc is more reliable, we compare the rtc with tsc, both

on previous implementation using pit, and current incorrect implementation

of msr calibration:

available clock source:

# cat/sys/devices/system/clocksource/clocksource0/available_clocksource

tsc hpet acpi_pm

# cat/sys/devices/system/clocksource/clocksource0/current_clocksource

Tsc

 

and then I tried to use clock_gettime(CLOCK_REALTIME), ifI understand correctly,

 CLOCK_REALTIME will try to return the wall time by(xtime + offset), 

and the offset is based on current clock_source.read_cycle,so I guess using 

CLOCK_REALTIME is not safe too, so I found  the command of 'hwclock' 

might return the RTCwall timer directly, and likewise the command of 'date'

 will return the walltime by gettimeofday, which is based on clock source.

 

So here is the result on a msr-calibration based kernel:

root@lkp-bdw-de1 ~# date
Tue Jul  5 15:46:55CST 2016
root@lkp-bdw-de1 ~# hwclock -r
Tue Jul  5 00:46:162016  -0.640944 seconds

distance to rtc:  msr_based = 15:00:49

 

And here is the result from a pit-calibration based kernel:

root@lkp-bdw-de1 ~# date;hwclock -r
Tue Jul  5 20:10:19CST 2016
Tue Jul  5 05:09:382016  -0.391106 seconds

distance to rtc:  pit_based = 15:00:41


So it looks like using two different method would result in 

small different distance. And the 15 hours come from

the difference of time-zone setting, since I set my own time zone

as China, which is :

UTC+08:00 (Chinese Standard Time)

So the default rtc time mush be set as the following time zone:

UTC−07:00 (MT) — Arizona, Colorado, 
Montana, New Mexico, Utah, Wyoming, 
parts of Idaho, Kansas, Nebraska, 
Oregon, North Dakota, South Dakota,
 and Texas

The result of offset for msr is 4 seconds bigger than that

of using pit, so can we say, the may result in the performance regression

reported by unixbench? The answer is, not exactly.

Because as the time elapse, the result of date might get

further and further compared with rtc time, so you can only

compare MSR and PIT delta either with the same time interval, 

or  at the same uptime.


Eg, we can use the following command for verification:

date;hwclock -r;sleep xxx;date;hwclock -r

thus wait for some time, and compare the actual hwclock delta and data delta.


Or just wait util the uptime reached the same value, to compare which one 

has a bigger offset.


sometimes you might get incorrect rtc by running hwclock:

hwclock -r
hwclock: ioctl(RTC_RD_TIME) to /dev/rtc to read the time failed: Invalid argumen

thus you can use following command to fix it:

# sudo hwclock --systohc -D --noadjfile --utc
# sudo hwclock --set --date "06/05/13 23:00:00"
# sudo hwclock --show
Wed 05 Jun 2013 23:00:13 UTC  -0.369063 seconds
# sudo shutdown now