Reading notes about High-resolution timer managememt on linux

时间:2022-01-06 20:39:26
Reading notes about High-resolution timer  managememt on linux.
Author: Honggang Yang(Joseph) <eagle.rtlinux@gmail.com>
Kernel Version: Linux 3.1.1
Last modified: 11-26-2011
==================================================================

REF:
       http://kerneldox.com
       Professional Linux Kernel Architecture
         独辟蹊径品内核Linux内核源码导读   李云华
       Kernel APIs, Part 3: Timers and lists in the 2.6 kernel:
                 http://www.ibm.com/developerworks/linux/library/l-timers-list/index.html
--------------------------------------------------------------------------------------------------------
Contents:
0. Overview of HRT(High-resolution Timer)
1. Data  structures
2. How to use HRT (high-resolution timers) in your modules
    2.1 APIs

    2.2 A simple demon

3. HRT implementation

    3.1 HRT initialization
    3.2 HRT in low-resolution mode
    3.3  High-Resolution Timers in High-Resolution Mode
    3.4 Periodic Tick Emulation   
    3.5 Swiching to high-resolution timers
    3.6 High-Resolution Timers Operations
        3.6.1 hrtimer initialization
        3.6.2 add a hrtimer
        3.6.3 remove a hrtimer
4. HRT related system call

Appendix I:  Works have to be done, before we "can" switch to high-resolution mode
Appendix II:  What's the struct timerqueue_head's next member used for ?
Appendix III: How we build the relationship between the "Generic Time Subsystem" layer,
                            the "low resolution time subsystem" and "high-resolution timer system"    
Appendix IV:  Detail explanation of some important 'time' members

=====================================================================
Contents:

0. Overview of HRT(High-resolution Timer)

    
HRT is a second timing mechanism besides low-resolution timers.
While low-resolution timers are based on jiffies as fundamental units of time,
HRTs use human time units, namelym, nanoseconds. 1 nanosecond is a precisely
defined time interval, whereas the length of one jiffies tick depends on the
kernel configuration.

There is another fundamental difference distinguish HRT from low-resolution timers.
HRT are time-ordered on a red-black tree.

Low-resolution timers are implemented on top of the high-resolution mechanism,
partial support for high-resolution timers will also be built into the kernel even if
support for them is not explicitly enabled! Nevertheless, the system will only be
able to provide timers with low-resolution capapbilities.


  Reading notes about High-resolution timer managememt on linux
    

1. Data  structures

/**
 * struct hrtimer - the basic hrtimer structure
 * @node:   timerqueue node, which also manages node.expires,
 *      the absolute expiry time in the hrtimers internal
 *      representation. The time is related to the clock on
 *      which the timer is based. Is setup by adding
 *      slack to the _softexpires value. For non range timers
 *      identical to _softexpires.
 * @_softexpires: the absolute earliest expiry time of the hrtimer.
 *      The time which was given as expiry time when the timer
 *      was armed.
 * @function:   timer expiry callback function
 * @base:   pointer to the timer base (per cpu and per clock)
 * @state:  state information (See bit values above)
 * @start_site: timer statistics field to store the site where the timer
 *      was started
 * @start_comm: timer statistics field to store the name of the process which
 *      started the timer
 * @start_pid: timer statistics field to store the pid of the task which
 *      started the timer
 *
 * The hrtimer structure must be initialized by hrtimer_init()
 */
struct hrtimer {
    struct timerqueue_node      node;
    ktime_t             _softexpires;
    enum hrtimer_restart        (*function)(struct hrtimer *);
    struct hrtimer_clock_base   *base;
    unsigned long           state;
#ifdef CONFIG_TIMER_STATS
    int             start_pid;
    void                *start_site;
    char                start_comm[16];
#endif
};

struct timerqueue_node {
    struct rb_node node;
    ktime_t expires;
};

struct timerqueue_head {
    struct rb_root head;
    struct timerqueue_node *next;
};


//include/linux/time.h
287 /*
288  * The IDs of the various system clocks (for POSIX.1b interval timers):
289  */
290 #define CLOCK_REALTIME          0
291 #define CLOCK_MONOTONIC         1
292 #define CLOCK_PROCESS_CPUTIME_ID    2
293 #define CLOCK_THREAD_CPUTIME_ID     3
294 #define CLOCK_MONOTONIC_RAW     4
295 #define CLOCK_REALTIME_COARSE       5
296 #define CLOCK_MONOTONIC_COARSE      6
297 #define CLOCK_BOOTTIME          7
298 #define CLOCK_REALTIME_ALARM        8
299 #define CLOCK_BOOTTIME_ALARM        9
300
301 /*
302  * The IDs of various hardware clocks:
303  */
304 #define CLOCK_SGI_CYCLE         10
305 #define MAX_CLOCKS          16
306 #define CLOCKS_MASK         (CLOCK_REALTIME | CLOCK_MONOTONIC)
307 #define CLOCKS_MONO         CLOCK_MONOTONIC


enum  hrtimer_base_type {
    HRTIMER_BASE_MONOTONIC,//0
    HRTIMER_BASE_REALTIME,//1
    HRTIMER_BASE_BOOTTIME,//2
    HRTIMER_MAX_CLOCK_BASES,//3
};

/**
 * struct hrtimer_clock_base - the timer base for a specific clock
 * @cpu_base:       per cpu clock base
 * @index:      clock type index for per_cpu support when moving a
 *          timer to a base on another cpu.
 * @clockid:        clock id for per_cpu support
 * @active:     red black tree root node for the active timers
 * @resolution:     the resolution of the clock, in nanoseconds
 * @get_time:       function to retrieve the current time of the clock
 * @softirq_time:   the time when running the hrtimer queue in the softirq
 * @offset:     offset of this clock to the monotonic base
 */
struct hrtimer_clock_base {
    struct hrtimer_cpu_base *cpu_base;
     /*
    * @index can one of member in enum hrtimer_base_type above
    */
    int         index;
    
    /*
    * See the above "The IDs of the various system clocks"
    */
    clockid_t       clockid;
    
    struct timerqueue_head  active;
    ktime_t         resolution;
    ktime_t         (*get_time)(void);
    ktime_t         softirq_time;
    /*
    * When the real-time clock is adjusted, a discrepancy between the expiration
    * values of timers strored on the CLOCK_REALTIME clock base and the current
    * real time will arise. The offset field helps to fix the situation by denoting an offset by
    * which the timers needs to be corrected.
    */
    ktime_t         offset;
};
/*
 * struct hrtimer_cpu_base - the per cpu clock bases
 * @lock:       lock protecting the base and associated clock bases
 *          and timers
 * @active_bases:   Bitfield to mark bases with active timers(biti == 1 indicate
 *          active state of the  hrtimer_clock_base i)
 * @expires_next:   absolute time of the next event which was scheduled
 *          via clock_set_next_event()
 * @hres_active:    State of high resolution mode
 * @hang_detected:  The last hrtimer interrupt detected a hang
 * @nr_events:      Total number of hrtimer interrupt events
 * @nr_retries:     Total number of hrtimer interrupt retries
 * @nr_hangs:       Total number of hrtimer interrupt hangs
 * @max_hang_time:  Maximum time spent in hrtimer_interrupt
 * @clock_base:     array of clock bases for this cpu
 */
struct hrtimer_cpu_base {
    raw_spinlock_t          lock;
    unsigned long           active_bases;
#ifdef CONFIG_HIGH_RES_TIMERS
    ktime_t             expires_next;
    int             hres_active;
    int             hang_detected;
    unsigned long           nr_events;
    unsigned long           nr_retries;
    unsigned long           nr_hangs;
    ktime_t             max_hang_time;
#endif  
    struct hrtimer_clock_base   clock_base[HRTIMER_MAX_CLOCK_BASES];
};  


A common application for high-resolution timers is to put a task to sleep for
a specified short amount of time. The kernel provides another data structure for
this purpose.

/**
 * struct hrtimer_sleeper - simple sleeper structure
 * @timer:  embedded timer structure
 * @task:   task to wake up
 *
 * task is set to NULL, when the timer expires.
 */
struct hrtimer_sleeper {
    struct hrtimer timer;
    struct task_struct *task;
};

An hrtimer instance is bundled with a pointer to the task in question. The kernel
uses hrtimer_wakeup as the expiration function for sleepers. When the timer
expires, the hrtimer_sleeper can be derived from the hrtimer using the container_of
mechanism, and the associated task can be woken up.
 

 Figure Overview of data structures used to implement high-resolution timers

Reading notes about High-resolution timer managememt on linux


As you can see in the figure above, all timers are sorted by expiration time on a red-black tree.


 You can see the CPU's timers through:

# cat /proc/timer_list

 Timer List Version: v0.6
HRTIMER_MAX_CLOCK_BASES: 3
now at 827822434742 nsecs

cpu: 0
 clock 0:
  .base:       ffff88006fc0e7c0
  .index:      0
  .resolution: 1 nsecs
  .get_time:   ktime_get
  .offset:     0 nsecs
active timers:
 #0: <ffff88006fc0e8b0>, tick_sched_timer, S:01, hrtimer_start_range_ns, swapper/0
 # expires at 827824000000-827824000000 nsecs [in 1565258 to 1565258 nsecs]
 #1: <ffff8800364c3a68>, hrtimer_wakeup, S:01, hrtimer_start_range_ns, gnome-terminal/1996
 # expires at 827829624382-827829674382 nsecs [in 7189640 to 7239640 nsecs]
 #2: <ffff880056af5a68>, hrtimer_wakeup, S:01, hrtimer_start_range_ns, gnome-settings-/1920
 # expires at 827937710301-827938522299 nsecs [in 115275559 to 116087557 nsecs]
 #3: <ffff88006c579e98>, hrtimer_wakeup, S:01, hrtimer_start_range_ns, gvfs-afc-volume/1939
 # expires at 828180909773-828180959773 nsecs [in 358475031 to 358525031 nsecs]
 #4: <ffff8800672c1938>, hrtimer_wakeup, S:01, hrtimer_start_range_ns, ssh-agent/1895
 # expires at 828461568980-828471568978 nsecs [in 639134238 to 649134236 nsecs]
 #5: <ffff88005e1dba68>, hrtimer_wakeup, S:01, hrtimer_start_range_ns, gnome-panel/1934
 # expires at 828937518959-828941515957 nsecs [in 1115084217 to 1119081215 nsecs]
 #6: <ffff880056afda68>, hrtimer_wakeup, S:01, hrtimer_start_range_ns, nautilus/1940
 # expires at 829937985438-829941983436 nsecs [in 2115550696 to 2119548694 nsecs]
 #7: <ffff880056a49a68>, hrtimer_wakeup, S:01, hrtimer_start_range_ns, gnome-power-man/1923
 # expires at 829936999697-829946997695 nsecs [in 2114564955 to 2124562953 nsecs]
 #8: <ffff880056b5ba68>, hrtimer_wakeup, S:01, hrtimer_start_range_ns, gnome-screensav/1964
 # expires at 1362125501344-1362225501344 nsecs [in 534303066602 to 534403066602 nsecs]
 #9: <ffff880036e7af30>, it_real_fn, S:01, hrtimer_start, exim4/1541
 # expires at 1820888903968-1820888903968 nsecs [in 993066469226 to 993066469226 nsecs]
 #10: <ffff88003799ba68>, hrtimer_wakeup, S:01, hrtimer_start_range_ns, udisks-daemon/1932
 # expires at 1960718700537-1960818700537 nsecs [in 1132896265795 to 1132996265795 nsecs]
 #11: <ffff88006c44fa68>, hrtimer_wakeup, S:01, hrtimer_start_range_ns, evolution-alarm/1951
 # expires at 1964548580962-1964648580962 nsecs [in 1136726146220 to 1136826146220 nsecs]
 clock 1:
  .base:       ffff88006fc0e800
  .index:      1
  .resolution: 1 nsecs
  .get_time:   ktime_get_real
  .offset:     1322321011280707887 nsecs
active timers:
 #0: <ffff8800376cdd08>, hrtimer_wakeup, S:01, hrtimer_start_range_ns, firefox-bin/2023
 # expires at 1322321839115776000-1322321839115826000 nsecs [in 1322321011293341258 to 1322321011293391258 nsecs]
 #1: <ffff8800370e5d08>, hrtimer_wakeup, S:01, hrtimer_start_range_ns, rs:main Q:Reg/2133
 # expires at 1322321856132073269-1322321856132123269 nsecs [in 1322321028309638527 to 1322321028309688527 nsecs]
 #2: <ffff8800375f5d08>, hrtimer_wakeup, S:01, hrtimer_start_range_ns, firefox-bin/2135
 # expires at 1322321884339805000-1322321884339855000 nsecs [in 1322321056517370258 to 1322321056517420258 nsecs]
 #3: <ffff88003769fd08>, hrtimer_wakeup, S:01, hrtimer_start_range_ns, firefox-bin/2040
 # expires at 1322321913451563000-1322321913451613000 nsecs [in 1322321085629128258 to 1322321085629178258 nsecs]
 clock 2:
  .base:       ffff88006fc0e840
  .index:      2
  .resolution: 1 nsecs
  .get_time:   ktime_get_boottime
  .offset:     0 nsecs
active timers:
  .expires_next   : 827824000000 nsecs
  .hres_active    : 1
  .nr_events      : 192927
  .nr_retries     : 35
  .nr_hangs       : 0
  .max_hang_time  : 0 nsecs
  .nohz_mode      : 2
  .idle_tick      : 827820000000 nsecs
  .tick_stopped   : 0
  .idle_jiffies   : 4295099250
  .idle_calls     : 369172
  .idle_sleeps    : 349736
  .idle_entrytime : 827821325451 nsecs
  .idle_waketime  : 827818918157 nsecs
  .idle_exittime  : 827819503358 nsecs
  .idle_sleeptime : 708697206176 nsecs
  .iowait_sleeptime: 19169014146 nsecs
  .last_jiffies   : 4295099251
  .next_jiffies   : 4295099252
  .idle_expires   : 827824000000 nsecs
jiffies: 4295099251

cpu: 1
 clock 0:
  .base:       ffff88006fc8e7c0
  .index:      0
  .resolution: 1 nsecs
  .get_time:   ktime_get
  .offset:     0 nsecs
active timers:
 #0: <ffff88006fc8e8b0>, tick_sched_timer, S:01, hrtimer_start_range_ns, kworker/0:0/0
 # expires at 827824000000-827824000000 nsecs [in 1565258 to 1565258 nsecs]
 #1: <ffff8800364e3a68>, hrtimer_wakeup, S:01, hrtimer_start_range_ns, firefox-bin/2018
 # expires at 827828413671-827828463671 nsecs [in 5978929 to 6028929 nsecs]
 #2: <ffff880036b8b938>, hrtimer_wakeup, S:01, hrtimer_start_range_ns, Xorg/1112
 # expires at 828319971714-828320471712 nsecs [in 497536972 to 498036970 nsecs]
 #3: <ffff8800566e3a68>, hrtimer_wakeup, S:01, hrtimer_start_range_ns, udisks-daemon/1933
 # expires at 828719653896-828721649894 nsecs [in 897219154 to 899215152 nsecs]
 #4: <ffff880037affa68>, hrtimer_wakeup, S:01, hrtimer_start_range_ns, update-notifier/1948
 # expires at 828788961298-828790812296 nsecs [in 966526556 to 968377554 nsecs]
 #5: <ffff88006d78d938>, hrtimer_wakeup, S:01, hrtimer_start_range_ns, init/1
 # expires at 828972049498-828977049496 nsecs [in 1149614756 to 1154614754 nsecs]
 #6: <ffff880035b99a68>, hrtimer_wakeup, S:01, hrtimer_start_range_ns, avahi-daemon/1137
 # expires at 833410127235-833416065233 nsecs [in 5587692493 to 5593630491 nsecs]
 #7: <ffff880056407a68>, hrtimer_wakeup, S:01, hrtimer_start_range_ns, kerneloops/1554
 # expires at 833719990604-834719990604 nsecs [in 5897555862 to 6897555862 nsecs]
 #8: <ffff880036e2da68>, hrtimer_wakeup, S:01, hrtimer_start_range_ns, gconfd-2/1912
 # expires at 848937945707-848967938704 nsecs [in 21115510965 to 21145503962 nsecs]
 #9: <ffff880056a93e98>, hrtimer_wakeup, S:01, hrtimer_start_range_ns, cron/1234
 # expires at 870161662638-870161712638 nsecs [in 42339227896 to 42339277896 nsecs]
 #10: <ffff880037491938>, hrtimer_wakeup, S:01, hrtimer_start_range_ns, dhclient/1133
 # expires at 1018719296200-1018819296200 nsecs [in 190896861458 to 190996861458 nsecs]
 #11: <ffff880056a61a68>, hrtimer_wakeup, S:01, hrtimer_start_range_ns, nm-applet/1953
 # expires at 1064936764582-1065036764582 nsecs [in 237114329840 to 237214329840 nsecs]
 #12: <ffff880037947e98>, hrtimer_wakeup, S:01, hrtimer_start_range_ns, atd/1123
 # expires at 3616356794857-3616356844857 nsecs [in 2788534360115 to 2788534410115 nsecs]
 #13: <ffff880035b7b938>, hrtimer_wakeup, S:01, hrtimer_start_range_ns, rsyslogd/1037
 # expires at 86413770854885-86413870854885 nsecs [in 85585948420143 to 85586048420143 nsecs]
 clock 1:
  .base:       ffff88006fc8e800
  .index:      1
  .resolution: 1 nsecs
  .get_time:   ktime_get_real
  .offset:     1322321011280707887 nsecs
active timers:
 #0: <ffff8800376b3d08>, hrtimer_wakeup, S:01, hrtimer_start_range_ns, firefox-bin/2024
 # expires at 1322321839111141000-1322321839111191000 nsecs [in 1322321011288706258 to 1322321011288756258 nsecs]
 #1: <ffff880067311d08>, hrtimer_wakeup, S:01, hrtimer_start_range_ns, firefox-bin/2092
 # expires at 1322322005111143000-1322322005111193000 nsecs [in 1322321177288708258 to 1322321177288758258 nsecs]
 #2: <ffff880036533d08>, hrtimer_wakeup, S:01, hrtimer_start_range_ns, firefox-bin/2093
 # expires at 1322322005111670000-1322322005111720000 nsecs [in 1322321177289235258 to 1322321177289285258 nsecs]
 clock 2:
  .base:       ffff88006fc8e840
  .index:      2
  .resolution: 1 nsecs
  .get_time:   ktime_get_boottime
  .offset:     0 nsecs
active timers:
  .expires_next   : 827824000000 nsecs
  .hres_active    : 1
  .nr_events      : 204353
  .nr_retries     : 8
  .nr_hangs       : 0
  .max_hang_time  : 0 nsecs
  .nohz_mode      : 2
  .idle_tick      : 827820000000 nsecs
  .tick_stopped   : 0
  .idle_jiffies   : 4295099250
  .idle_calls     : 403455
  .idle_sleeps    : 361853
  .idle_entrytime : 827819784958 nsecs
  .idle_waketime  : 827819288247 nsecs
  .idle_exittime  : 827819784958 nsecs
  .idle_sleeptime : 697317489417 nsecs
  .iowait_sleeptime: 45458018569 nsecs
  .last_jiffies   : 4295099250
  .next_jiffies   : 4295099497
  .idle_expires   : 828804000000 nsecs
jiffies: 4295099251


Tick Device: mode:     1
Broadcast device
Clock Event Device: hpet
 max_delta_ns:   149983003520
 min_delta_ns:   13409
 mult:           61496115
 shift:          32
 mode:           3
 next_event:     9223372036854775807 nsecs
 set_next_event: hpet_legacy_next_event
 set_mode:       hpet_legacy_set_mode
 event_handler:  tick_handle_oneshot_broadcast
 retries:        0
tick_broadcast_mask: 00000000
tick_broadcast_oneshot_mask: 00000000


Tick Device: mode:     1
Per CPU device: 0
Clock Event Device: lapic
 max_delta_ns:   171802420480
 min_delta_ns:   1200
 mult:           53685926
 shift:          32
 mode:           3
 next_event:     827824000000 nsecs
 set_next_event: lapic_next_event
 set_mode:       lapic_timer_setup
 event_handler:  hrtimer_interrupt
 retries:        0

Tick Device: mode:     1
Per CPU device: 1
Clock Event Device: lapic
 max_delta_ns:   171802420480
 min_delta_ns:   1200
 mult:           53685926
 shift:          32
 mode:           3
 next_event:     827824000000 nsecs
 set_next_event: lapic_next_event
 set_mode:       lapic_timer_setup
 event_handler:  hrtimer_interrupt
 retries:        0




2. How to use HRT (high-resolution timers) in your modules

This part is from:
         http://www.ibm.com/developerworks/linux/library/l-timers-list/index.html

2.1 APIs

hrtimers APIs:
EXPORT_SYMBOL_GPL(ktime_add_ns);
EXPORT_SYMBOL_GPL(ktime_sub_ns);
EXPORT_SYMBOL_GPL(ktime_add_safe);
EXPORT_SYMBOL_GPL(hrtimer_init_on_stack);
EXPORT_SYMBOL_GPL(hrtimer_forward); +
EXPORT_SYMBOL_GPL(hrtimer_start_range_ns);
EXPORT_SYMBOL_GPL(hrtimer_start); +
EXPORT_SYMBOL_GPL(hrtimer_try_to_cancel); +
EXPORT_SYMBOL_GPL(hrtimer_cancel); +
EXPORT_SYMBOL_GPL(hrtimer_get_remaining);
EXPORT_SYMBOL_GPL(hrtimer_init); +
EXPORT_SYMBOL_GPL(hrtimer_get_res);
EXPORT_SYMBOL_GPL(hrtimer_init_sleeper);
EXPORT_SYMBOL_GPL(schedule_hrtimeout_range);
EXPORT_SYMBOL_GPL(schedule_hrtimeout);

We only simplely explain how to use some of the functions listed above.
In this part, we just know how to use, not go into the detail implemention.
We will do that in the following section(section 3: HRT implementation).


-- setting a new hrtimer

    The process begins with the initialization of a timer through hrtimer_init.
    This call includes the timer, clock definition, and timer mode (one-shot or
    restart). The clock to use is defined in ./include/linux/time.h and represents
    the various clocks that the system supports (such as the real-time clock or
    a monotonic clock that simply represents time from a starting point, such as
    system boot). Once a timer has been initialized, it can be started with
    hrtimer_start. This call includes the expiration time (in ktime_t) and the mode
     of the time value (absolute or relative value).

     
        /*
         * Mode arguments of xxx_hrtimer functions:
         */
        enum hrtimer_mode {
            HRTIMER_MODE_ABS = 0x0,     /* Time value is absolute */
            HRTIMER_MODE_REL = 0x1,     /* Time value is relative to now */
            HRTIMER_MODE_PINNED = 0x02, /* Timer is bound to CPU */
            HRTIMER_MODE_ABS_PINNED = 0x02,
            HRTIMER_MODE_REL_PINNED = 0x03,
        };

        1161 /**
        1162  * hrtimer_init - initialize a timer to the given clock
        1163  * @timer:  the timer to be initialized
        1164  * @clock_id:   the clock to be used// Clock id defined in file include/linux/time.h
        1165  * @mode:   timer mode abs/rel
        1166  */
        1167 void hrtimer_init(struct hrtimer *timer, clockid_t clock_id,
        1168           enum hrtimer_mode mode);
        
        1012 /**
        1013  * hrtimer_start - (re)start an hrtimer on the current CPU
        1014  * @timer:  the timer to be added
        1015  * @tim:    expiry time
        1016  * @mode:   expiry mode: absolute (HRTIMER_ABS) or relative (HRTIMER_REL)
        1017  *
        1018  * Returns:
        1019  *  0 on success
        1020  *  1 when the timer was active
        1021  */
        1022 int
        1023 hrtimer_start(struct hrtimer *timer, ktime_t tim, const enum hrtimer_mode mode);

-- cancelling a timer


Once an hrtimer has started, it can be cancelled through a call to hrtimer_cancel
or hrtimer_try_to_cancel. Each function includes the hrtimer reference as
the timer to be stopped. These functions differ in that the hrtimer_cancel
function attempts to cancel the timer, but if it has already fired, it will wait
for the callback function to finish. The hrtimer_try_to_cancel function differs
in that it also attempts to cancel the timer but will return failure if the timer
has fired.


1058 /**
1059  * hrtimer_cancel - cancel a timer and wait for the handler to finish.
1060  * @timer:  the timer to be cancelled
1061  *
1062  * Returns:
1063  *  0 when the timer was not active
1064  *  1 when the timer was active
1065  */
1066 int hrtimer_cancel(struct hrtimer *timer);

1030 /**
1031  * hrtimer_try_to_cancel - try to deactivate a timer
1032  * @timer:  hrtimer to stop
1033  *
1034  * Returns:
1035  *  0 when the timer was not active
1036  *  1 when the timer was active
1037  * -1 when the timer is currently excuting the callback function and
1038  *    cannot be stopped
1039  */
1040 int hrtimer_try_to_cancel(struct hrtimer *timer);

-- restart a hrtimer

 Usually, the timer's callback will return HRTIMER_NORESTART when it has finished executing.
In this case, the timer will simply disappear from the system. However, the time can also choose to
be restarted. This requires two steps from the callback:
1> The result of the callback must be HRTIMER_RESTART.
2> The expiration of the timer must be set to a future point in time. The
    callback function can perform this manipulation because it gets a pointer
    to the hrtimer instance for the currently running timer as function parameter.
    To simplify this matters, the kernel provides an auxiliary function to forward
    the expiration time of  a timer.
    <hrtimer.h>
    unsigned long
    hrtimer_forward(struct hrtimer *timer, ktime_t now, ktime_t interval);
    
    This resets the timer so that it expires after now [usually now is set to the value returned by
    hrtimer_clock_base->get_time()]. The exact expiration time is determined by taking the
    old expiration time of the timer and adding interval so often that the new expiration time
    lies past now. The function returns the number of times that interval had to be added to the
    expiration time to exceed now.
    Let us illustrate the behavior by an example. If the old expiration time is 5, now is 12,
    and interval is 2, then the new expiration time will be 13. The return value is 4 because
    13 = 5 + 4 × 2.

    

2.2 A simple demon


#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/hrtimer.h>
#include <linux/ktime.h>

MODULE_LICENSE("GPL");

#define MS_TO_NS(x)    (x * 1E6L)

static struct hrtimer hr_timer;

enum hrtimer_restart my_hrtimer_callback( struct hrtimer *timer )
{
  printk( "my_hrtimer_callback called (%ld).\n", jiffies );

  return HRTIMER_NORESTART;
}

int init_module( void )
{
  ktime_t ktime;
  unsigned long delay_in_ms = 200L;

  printk("HR Timer module installing\n");

  ktime = ktime_set( 0, MS_TO_NS(delay_in_ms) );

  hrtimer_init( &hr_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL );
 
  hr_timer.function = &my_hrtimer_callback;

  printk( "Starting timer to fire in %ldms (%ld)\n", delay_in_ms, jiffies );

  hrtimer_start( &hr_timer, ktime, HRTIMER_MODE_REL );

  return 0;
}

void cleanup_module( void )
{
  int ret;

  ret = hrtimer_cancel( &hr_timer );
  if (ret) printk("The timer was still in use...\n");

  printk("HR Timer module uninstalling\n");

  return;
}

There's much more to the hrtimer API than has been touched on here.
One interesting aspect is the ability to define the execution context of the
callback function (such as in softirq or hardiirq context). You can learn more
about the hrtimer API from the include file in ./include/linux/hrtimer.h.

3. HRT implementation

    3.1 HRT initialization

    

        When init the HRT, the clock queues are empty. The initialization work
        is simple. The work is done by hrtimers_init().
        

        Call Tree:
        start_kernel
                hrtimers_init
                            hrtimer_cpu_notify
                                        init_hrtimers_cpu
                            register_cpu_notifier(&hrtimers_nb)
                            open_softirq(HRTIMER_SOFTIRQ, run_hrtimer_softirq)

                            
          53 /*
          54  * The timer bases:
          55  *
          56  * There are more clockids then hrtimer bases. Thus, we index
          57  * into the timer bases by the hrtimer_base_type enum. When trying
          58  * to reach a base using a clockid, hrtimer_clockid_to_base()
          59  * is used to convert from clockid to the proper hrtimer_base_type.
          60  */
          61 DEFINE_PER_CPU(struct hrtimer_cpu_base, hrtimer_bases) =
          62 {
          63
          64     .clock_base =
          65     {
          66         {
          67             .index = HRTIMER_BASE_MONOTONIC,
          68             .clockid = CLOCK_MONOTONIC,
          69             .get_time = &ktime_get,
          70             .resolution = KTIME_LOW_RES,
          71         },
          72         {
          73             .index = HRTIMER_BASE_REALTIME,
          74             .clockid = CLOCK_REALTIME,
          75             .get_time = &ktime_get_real,
          76             .resolution = KTIME_LOW_RES,
          77         },
          78         {
          79             .index = HRTIMER_BASE_BOOTTIME,
          80             .clockid = CLOCK_BOOTTIME,
          81             .get_time = &ktime_get_boottime,
          82             .resolution = KTIME_LOW_RES,
          83         },
          84     }
          85 };
          86
          87 static const int hrtimer_clock_to_base_table[MAX_CLOCKS] = {
          88     [CLOCK_REALTIME]    = HRTIMER_BASE_REALTIME,
          89     [CLOCK_MONOTONIC]   = HRTIMER_BASE_MONOTONIC,
          90     [CLOCK_BOOTTIME]    = HRTIMER_BASE_BOOTTIME,
          91 };            
                               
            1731 static struct notifier_block __cpuinitdata hrtimers_nb = {
            1732     .notifier_call = hrtimer_cpu_notify,
            1733 };
            1734
            /*
            *  hrtimers_init - Init the infrastructure of HRT, register @hrtimers_nb
            *       which used to handle HRT related events and intialize the
            *       HRTIMER_SOFTIRQ's handler.
            */
            1735 void __init hrtimers_init(void)
            1736 {
                       /* At the beginning, @hrtimers_nb has not been registered yet,
                   *    call it manually.
                   * It will call init_hrtimers_cpu() to init the infrastructure of
                   * high-resolution times
                   */
            1737     hrtimer_cpu_notify(&hrtimers_nb, (unsigned long)CPU_UP_PREPARE,
            1738               (void *)(long)smp_processor_id());
                        /*
                        * register hrtimers_nb used to handle .                                                                  
                        */
            1739     register_cpu_notifier(&hrtimers_nb);
            1740 #ifdef CONFIG_HIGH_RES_TIMERS
                        /* Intialize the HRTIMER_SOFTIRQ soft irq handler */
            1741     open_softirq(HRTIMER_SOFTIRQ, run_hrtimer_softirq);
            1742 #endif
            1743 }

            1698 static int __cpuinit hrtimer_cpu_notify(struct notifier_block *self,
            1699                     unsigned long action, void *hcpu)
            1700 {
            1701     int scpu = (long)hcpu;
            1702
            1703     switch (action) {
            1704
            1705     case CPU_UP_PREPARE:
            1706     case CPU_UP_PREPARE_FROZEN:
            1707         init_hrtimers_cpu(scpu);
            1708         break;
            1709
            1710 #ifdef CONFIG_HOTPLUG_CPU
            1711     case CPU_DYING:
            1712     case CPU_DYING_FROZEN:
            1713         clockevents_notify(CLOCK_EVT_NOTIFY_CPU_DYING, &scpu);
            1714         break;
            1715     case CPU_DEAD:
            1716     case CPU_DEAD_FROZEN:
            1717     {
            1718         clockevents_notify(CLOCK_EVT_NOTIFY_CPU_DEAD, &scpu);
            1719         migrate_hrtimers(scpu);
            1720         break;
            1721     }
            1722 #endif
            1723
            1724     default:
            1725         break;
            1726     }
            1727
            1728     return NOTIFY_OK;
            1729 }

            1612 /*
            1613  * Functions related to boot-time initialization:
            1614  */
            1615 static void __cpuinit init_hrtimers_cpu(int cpu)
            1616 {
            1617     struct hrtimer_cpu_base *cpu_base = &per_cpu(hrtimer_bases, cpu);
            1618     int i;
            1619
            1620     raw_spin_lock_init(&cpu_base->lock);
            1621
            1622     for (i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++) {
            1623         cpu_base->clock_base[i].cpu_base = cpu_base;
            1624         timerqueue_init_head(&cpu_base->clock_base[i].active);
            1625     }
            1626
            1627     hrtimer_init_hres(cpu_base);
            1628 }

             627 /*
             628  * Initialize the high resolution related parts of cpu_base
             629  */
             630 static inline void hrtimer_init_hres(struct hrtimer_cpu_base *base)
             631 {
             632     base->expires_next.tv64 = KTIME_MAX;
             633     base->hres_active = 0;
             634 }
                   

    3.2 HRT in low-resolution mode

    
     Recall that, in the Generic Time Substem, tick_setup_device() is used to
     set up a tick device. If the clock event device supports periodic events,
     tick_setup_periodic() installs tick_handle_periodic() as handler function of
     the tick device.  tick_handle_periodic() is called on the next event of the tick
     device. tick_periodic() is called in tick_handle_periodic(). tick_periodic() is
     responsible for handling the perioc tick on a given CPU required as an
     argument.
     
     REF:  Reading notes about Generic Time Subsystem implementation on linux
                                     http://blog.csdn.net/ganggexiongqi/article/details/7006252

        Call Tree:
         tick_handle_periodic    
                tick_periodic
                   
        Call Tree:
         tick_periodic  |  tick_nohz_handler |  tick_sched_timer
                update_process_times
                         run_local_timers
                                 hrtimer_run_queues
                                 raise_softirq(TIMER_SOFTIRQ)

                                 
        1286 void update_process_times(int user_tick)
        1287 {
        1288     struct task_struct *p = current;
        1289     int cpu = smp_processor_id();
        1290
        1291     /* Note: this timer irq context must be accounted for as well. */
        1292     account_process_tick(p, user_tick);
        1293     run_local_timers();  // ############
        1294     rcu_check_callbacks(cpu, user_tick);
        1295     printk_tick();
        1296 #ifdef CONFIG_IRQ_WORK
        1297     if (in_irq())
        1298         irq_work_run();
        1299 #endif
        1300     scheduler_tick();
        1301     run_posix_cpu_timers(p);
        1302 }   
                               
        1317 /*  
        1318  * Called by the local, per-CPU timer interrupt on SMP.
        1319  */     
        1320 void run_local_timers(void)
        1321 {
        1322     hrtimer_run_queues(); // ############
        1323     raise_softirq(TIMER_SOFTIRQ);
        1324 }     

     Call Tree:
     hrtimer_run_queues
            hrtimer_hres_active
            
            hrtimer_get_softirq_time
            

        1429 /*
        1430  * Called from hardirq context every jiffy
              *
              * Expired high resolution timers are handled here, before the
              * hrtimer_bases is Active. This does not provide any high-resolution
              * capabilities naturally.
        1431  */
        1432 void hrtimer_run_queues(void)
        1433 {  
        1434     struct timerqueue_node *node;
        1435     struct hrtimer_cpu_base *cpu_base = &__get_cpu_var(hrtimer_bases);
        1436     struct hrtimer_clock_base *base;
        1437     int index, gettime = 1;
        1438
                    /*
                    *  hrtimer_hres_active() return 0, when @hrtimer_bases is still NOT ACTIVE
                    */
        1439     if (hrtimer_hres_active())
        1440         return;
        1441
                    /*
                    * As the @hrtimer_bases is NOT useable now, the processing of
                    * expired high resolution timers have to be done here.
                    */
        1442     for (index = 0; index < HRTIMER_MAX_CLOCK_BASES; index++) {
        1443         base = &cpu_base->clock_base[index];
        1444         if (!timerqueue_getnext(&base->active))
        1445             continue;
        1446            
        1447         if (gettime) {
                            /*
                            * Get the coarse grained time at the softirq based on xtime and
                            * wall_to_monotonic.
                            */
        1448             hrtimer_get_softirq_time(cpu_base);
        1449             gettime = 0;
        1450         }
        1451
        1452         raw_spin_lock(&cpu_base->lock);
        1453
        1454         while ((node = timerqueue_getnext(&base->active))) {
        1455             struct hrtimer *timer;
        1456
        1457             timer = container_of(node, struct hrtimer, node);
        1458             if (base->softirq_time.tv64 <=
        1459                     hrtimer_get_expires_tv64(timer))
        1460                 break;
        1461
                           /* run hrtimer's callback function, if needed, restart them. */
        1462             __run_hrtimer(timer, &base->softirq_time);
        1463         }
        1464         raw_spin_unlock(&cpu_base->lock);
        1465     }
        1466 }

          98
          99 /*
         100  * Get the coarse grained time at the softirq based on xtime and
         101  * wall_to_monotonic.
         102  */
         103 static void hrtimer_get_softirq_time(struct hrtimer_cpu_base *base)
         104 {
         105     ktime_t xtim, mono, boot;
         106     struct timespec xts, tom, slp;
         107
         108     get_xtime_and_monotonic_and_sleep_offset(&xts, &tom, &slp);
         109
         110     xtim = timespec_to_ktime(xts);
         111     mono = ktime_add(xtim, timespec_to_ktime(tom));
         112     boot = ktime_add(mono, timespec_to_ktime(slp));
         113     base->clock_base[HRTIMER_BASE_REALTIME].softirq_time = xtim;
         114     base->clock_base[HRTIMER_BASE_MONOTONIC].softirq_time = mono;
         115     base->clock_base[HRTIMER_BASE_BOOTTIME].softirq_time = boot;
         116 }

        /*
        * __run_hrtimer - run hrtimer's callback function, if needed, restart them.
        */
        1195 static void __run_hrtimer(struct hrtimer *timer, ktime_t *now)
        1196 {
        1197     struct hrtimer_clock_base *base = timer->base;
        1198     struct hrtimer_cpu_base *cpu_base = base->cpu_base;
        1199     enum hrtimer_restart (*fn)(struct hrtimer *);
        1200     int restart;
        1201
        1202     WARN_ON(!irqs_disabled());
        1203
        1204     debug_deactivate(timer);
        1205     __remove_hrtimer(timer, base, HRTIMER_STATE_CALLBACK, 0);
        1206     timer_stats_account_hrtimer(timer);
        1207     fn = timer->function;
        1208
        1209     /*
        1210      * Because we run timers from hardirq context, there is no chance
        1211      * they get migrated to another cpu, therefore its safe to unlock
        1212      * the timer base.
        1213      */
        1214     raw_spin_unlock(&cpu_base->lock);
        1215     trace_hrtimer_expire_entry(timer, now);
        1216     restart = fn(timer);
        1217     trace_hrtimer_expire_exit(timer);
        1218     raw_spin_lock(&cpu_base->lock);
        1220     /*
        1221      * Note: We clear the CALLBACK bit after enqueue_hrtimer and
        1222      * we do not reprogramm the event hardware. Happens either in
        1223      * hrtimer_start_range_ns() or in hrtimer_interrupt()
        1224      */
        1225     if (restart != HRTIMER_NORESTART) {
        1226         BUG_ON(timer->state != HRTIMER_STATE_CALLBACK);
        1227         enqueue_hrtimer(timer, base);
        1228     }
        1229
        1230     WARN_ON_ONCE(!(timer->state & HRTIMER_STATE_CALLBACK));
        1231
        1232     timer->state &= ~HRTIMER_STATE_CALLBACK;
        1233 }
        

3.3  High-Resolution Timers in High-Resolution Mode


    Let us firt assume that a high-resolution clock is up and running, and that the
    transition to high-resolution mode is completely finished.
    
    When the clock event device responsible for high-resolution timers raises an
    interrupt, hrtimer_interrupt() is called as event handler. The function is
    responsible for handling of all expired hrtimers.

    
    1237 /*
    1238  * High resolution timer interrupt
    1239  * Called with interrupts disabled
    1240  */
    1241 void hrtimer_interrupt(struct clock_event_device *dev)
    1242 {
    1243     struct hrtimer_cpu_base *cpu_base = &__get_cpu_var(hrtimer_bases);
    1244     ktime_t expires_next, now, entry_time, delta;
    1245     int i, retries = 0;
    1246    
    1247     BUG_ON(!cpu_base->hres_active);
    1248     cpu_base->nr_events++;
    1249     dev->next_event.tv64 = KTIME_MAX;
    1250        
                /* Get current time.  */
    1251     entry_time = now = ktime_get();
    1252 retry:
    1253     expires_next.tv64 = KTIME_MAX;
    1254        
    1255     raw_spin_lock(&cpu_base->lock);
    1256     /*
    1257      * We set expires_next to KTIME_MAX here with cpu_base->lock
    1258      * held to prevent that a timer is enqueued in our queue via
    1259      * the migration code. This does not affect enqueueing of
    1260      * timers which run their callback and need to be requeued on
    1261      * this CPU.
    1262      */
    1263     cpu_base->expires_next.tv64 = KTIME_MAX;
    1264
    1265     for (i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++) {
    1266         struct hrtimer_clock_base *base;
    1267         struct timerqueue_node *node;
    1268         ktime_t basenow;
    1269
    1270         if (!(cpu_base->active_bases & (1 << i)))
    1271             continue;
    1272
    1273         base = cpu_base->clock_base + i;
    1274         basenow = ktime_add(now, base->offset);
    1275
    1276         while ((node = timerqueue_getnext(&base->active))) {
    1277             struct hrtimer *timer;
    1278
    1279             timer = container_of(node, struct hrtimer, node);
    1280
    1281             /*
    1282              * The immediate goal for using the softexpires is
    1283              * minimizing wakeups, not running timers at the
    1284              * earliest interrupt after their soft expiration.
    1285              * This allows us to avoid using a Priority Search
    1286              * Tree, which can answer a stabbing querry for
    1287              * overlapping intervals and instead use the simple
    1288              * BST we already have.
    1289              * We don't add extra wakeups by delaying timers that
    1290              * are right-of a not yet expired timer, because that
    1291              * timer will have to trigger a wakeup anyway.
    1292              */
    1293
                        /* If the timer's soft expiration time lies in the future, process can be stopped */
    1294             if (basenow.tv64 < hrtimer_get_softexpires_tv64(timer)) {
    1295                 ktime_t expires;
    1296
                            /*
                            * base->offset is only non-zero when the real-time clock has
                            * been readjusted, so this is never affect the monotonic clock base.
                            *
                            */
    1297                 expires = ktime_sub(hrtimer_get_expires(timer),
    1298                             base->offset);
                            /* Store the next earlist expire time  */
    1299                 if (expires.tv64 < expires_next.tv64)
    1300                     expires_next = expires;
    1301                 break;
    1302             }
    1302             }
    1303
                    /* run hrtimer's callback function, if needed, restart them. */
    1304             __run_hrtimer(timer, &basenow);
    1305         }
    1306     }
    1307
    1308     /*
    1309      * Store the new expiry value so the migration code can verify
    1310      * against it.
    1311      */
    1312     cpu_base->expires_next = expires_next;
    1313     raw_spin_unlock(&cpu_base->lock);
    1314
    1315     /* Reprogramming necessary ? */
    1316     if (expires_next.tv64 == KTIME_MAX ||
    1317         !tick_program_event(expires_next, 0)) {
    1318         cpu_base->hang_detected = 0;
    1319         return;
    1320     }
    1321
    1322     /*
    1323      * The next timer was already expired due to:
    1324      * - tracing
    1325      * - long lasting callbacks
    1326      * - being scheduled away when running in a VM
    1327      *
    1328      * We need to prevent that we loop forever in the hrtimer
    1329      * interrupt routine. We give it 3 attempts to avoid
    1330      * overreacting on some spurious event.
    1331      */
    1332     now = ktime_get();
    1333     cpu_base->nr_retries++;
    1334     if (++retries < 3)
    1335         goto retry;
    1336     /*
    1337      * Give the system a chance to do something else than looping
    1338      * here. We stored the entry time, so we know exactly how long
    1339      * we spent here. We schedule the next event this amount of
    1340      * time away.
    1341      */
    1342     cpu_base->nr_hangs++;
    1343     cpu_base->hang_detected = 1;
    1344     delta = ktime_sub(now, entry_time);
    1345     if (delta.tv64 > cpu_base->max_hang_time.tv64)
    1346         cpu_base->max_hang_time = delta;
    1347     /*
    1348      * Limit it to a sensible value as we enforce a longer
    1349      * delay. Give the CPU at least 100ms to catch up.
    1350      */
    1351     if (delta.tv64 > 100 * NSEC_PER_MSEC)
    1352         expires_next = ktime_add_ns(now, 100 * NSEC_PER_MSEC);
    1353     else
    1354         expires_next = ktime_add(now, delta);
                /* Reprogram the related clockevent device */
    1355     tick_program_event(expires_next, 1);
    1356     printk_once(KERN_WARNING "hrtimer: interrupt took %llu ns\n",
    1357             ktime_to_ns(delta));
    1358 }
      

 3.4 Periodic Tick Emulation   

       The clock event handler in high-resolution mode is hrtimer_interrupt.
        This implies that tick_handle_periodic does not provide the periodic tick anymore.
        So an equivalent functionality thus needs be made available based on high-resolution
        timers. The implemention is (nearly) identical between the situations with and
        without dynamic ticks.
        
        Essentially, tick_sched is a special data structure to manage all relevant
        information about periodic ticks, and one instance per CPU is provided by
        the global variable @tick_cpu_sched. tick_setup_sched_timer() is called
        to active the tick emulation layer when the kernel switches to high-resolution
        mode. One high-resolution timer is installed per CPU. The required instance
        of struct hrtimer is kept in the per-CPU variable tick_sched.


        /**
         * struct tick_sched - sched tick emulation and no idle tick control/stats
         * @sched_timer:    hrtimer to schedule the periodic tick in high
         *          resolution mode
         * @idle_tick:      Store the last idle tick expiry time when the tick
         *          timer is modified for idle sleeps. This is necessary
         *          to resume the tick timer operation in the timeline
         *          when the CPU returns from idle
         * @tick_stopped:   Indicator that the idle tick has been stopped
         * @idle_jiffies:   jiffies at the entry to idle for idle time accounting
         * @idle_calls:     Total number of idle calls
         * @idle_sleeps:    Number of idle calls, where the sched tick was stopped
         * @idle_entrytime: Time when the idle call was entered
         * @idle_waketime:  Time when the idle was interrupted
         * @idle_exittime:  Time when the idle state was left
         * @idle_sleeptime: Sum of the time slept in idle with sched tick stopped
         * @iowait_sleeptime:   Sum of the time slept in idle with sched tick stopped, with IO outstanding
         * @sleep_length:   Duration of the current idle sleep
         * @do_timer_lst:   CPU was the last one doing do_timer before going idle
         */
        struct tick_sched {
            struct hrtimer          sched_timer;
            unsigned long           check_clocks;
            enum tick_nohz_mode     nohz_mode;
            ktime_t             idle_tick;
            int             inidle;
            int             tick_stopped;
            unsigned long           idle_jiffies;
            unsigned long           idle_calls;
            unsigned long           idle_sleeps;
            int             idle_active;
            ktime_t             idle_entrytime;
            ktime_t             idle_waketime;
            ktime_t             idle_exittime;
            ktime_t             idle_sleeptime;
            ktime_t             iowait_sleeptime;
            ktime_t             sleep_length;
            unsigned long           last_jiffies;
            unsigned long           next_jiffies;
            ktime_t             idle_expires;
            int             do_timer_last;
        };

         /*
         * tick_sched_timer - update jiffies_64,  increment the wall time and
         *          update the avenrun load, reset the software watchdog, anage
         *          process-specific time elements and resets the @timer
         * --------------------------------
         * We rearm the timer until we get disabled by the idle code.
         * Called with interrupts disabled and timer->base->cpu_base->lock held.
         */
        static enum hrtimer_restart tick_sched_timer(struct hrtimer *timer)
        {
            struct tick_sched *ts =
                container_of(timer, struct tick_sched, sched_timer);
            struct pt_regs *regs = get_irq_regs();
             /* get the current time */
            ktime_t now = ktime_get();
            int cpu = smp_processor_id();

        #ifdef CONFIG_NO_HZ
            /*
             * Check if the do_timer duty was dropped. We don't care about
             * concurrency: This happens only when the cpu in charge went
             * into a long sleep. If two cpus happen to assign themself to
             * this duty, then the jiffies update is still serialized by
             * xtime_lock.
             */
            if (unlikely(tick_do_timer_cpu == TICK_DO_TIMER_NONE))
                tick_do_timer_cpu = cpu;
        #endif
            /* Check, if the jiffies need an update */
            if (tick_do_timer_cpu == cpu)
             /*  update jiffies_64,  increment the wall time and
             *  update the avenrun load
             */
                tick_do_update_jiffies64(now);

            /*
             * Do not call, when we are not in irq context and have
             * no valid regs pointer
             */
            if (regs) {
                /*
                 * When we are idle and the tick is stopped, we have to touch
                 * the watchdog as we might not schedule for a really long
                 * time. This happens on complete idle SMP systems while
                 * waiting on the login prompt. We also increment the "start of
                 * idle" jiffy stamp so the idle accounting adjustment we do
                 * when we go busy again does not account too much ticks.
                 */
                if (ts->tick_stopped) {
                      /* reset the software watchdog */
                    touch_softlockup_watchdog();
                    ts->idle_jiffies++;
                }
                 /* Used to manage process-specific time elements */
                update_process_times(user_mode(regs));
                profile_tick(CPU_PROFILING);
            }
             /* resets the timer so that it expires after @now */
            hrtimer_forward(timer, now, tick_period);

            return HRTIMER_RESTART;
        }

        /*
         * tick_do_update_jiffies64 - update jiffies_64,  increment the wall time and
         *                                           update the avenrun load
         * -----------------
         * Must be called with interrupts disabled !
         */
        static void tick_do_update_jiffies64(ktime_t now)
        {
            unsigned long ticks = 0;
            ktime_t delta;

            /*
             * Do a quick check without holding xtime_lock:
             */
            delta = ktime_sub(now, last_jiffies_update);
            /* jiffies update is NOT needed */
            if (delta.tv64 < tick_period.tv64)
                return;

            /* Reevalute with xtime_lock held */
            write_seqlock(&xtime_lock);

            delta = ktime_sub(now, last_jiffies_update);
            /* jiffies update is needed */
            if (delta.tv64 >= tick_period.tv64) {

                delta = ktime_sub(delta, tick_period);
                 /* Remember the last updating time of jiffies64 */
                last_jiffies_update = ktime_add(last_jiffies_update,
                                tick_period);

                /* Slow path for long timeouts */
                /* This will happen when we missed some ticks */
                if (unlikely(delta.tv64 >= tick_period.tv64)) {
                    s64 incr = ktime_to_ns(tick_period);
                       /* (ticks + 1) is  number of ticks  we missed */
                    ticks = ktime_divns(delta, incr);
                      /* Remember the last updating time of jiffies64 */
                    last_jiffies_update = ktime_add_ns(last_jiffies_update,
                                       incr * ticks);
                }
                /* update jiffies_64,  increment the wall time and update the avenrun load */
                do_timer(++ticks);

                /* Keep the tick_next_period variable up to date */
                tick_next_period = ktime_add(last_jiffies_update, tick_period);
            }
            write_sequnlock(&xtime_lock);
        }


        /*
         * do_timer - update jiffies_64,  increment the wall time and update the avenrun load
         *--------------------------
         * The 64-bit jiffies value is not atomic - you MUST NOT read it
         * without sampling the sequence number in xtime_lock.
         * jiffies is defined in the linker script...
         */
        void do_timer(unsigned long ticks)
        {   
            jiffies_64 += ticks;
             /* Uses the current clocksource to increment the wall time */
            update_wall_time();
            /* update the avenrun load */
            calc_global_load(ticks);
        }

        /* Structure holding internal timekeeping values. */
        struct timekeeper {
            /* Current clocksource used for timekeeping. */
            struct clocksource *clock;
            /* The shift value of the current clocksource. */
            int shift;
            
            /* Number of clock cycles in one NTP interval. */
            cycle_t cycle_interval;
            /* Number of clock shifted nano seconds in one NTP interval. */
            u64 xtime_interval;
            /* shifted nano seconds left over when rounding cycle_interval */
            s64 xtime_remainder;
            /* Raw nano seconds accumulated per NTP interval. */
            u32 raw_interval;

            /* Clock shifted nano seconds remainder not stored in xtime.tv_nsec. */
            u64 xtime_nsec;
            /* Difference between accumulated time and NTP time in ntp
             * shifted nano seconds. */
            s64 ntp_error;
            /* Shift conversion between clock shifted nano seconds and
             * ntp shifted nano seconds. */
            int ntp_error_shift;
            /* NTP adjusted clock multiplier */
            u32 mult;
        };

        /**
         * update_wall_time - Uses the current clocksource to increment the wall time
         *
         * Called from the timer interrupt, must hold a write on xtime_lock.
         */
        static void update_wall_time(void)
        {
            struct clocksource *clock;
            cycle_t offset;
            int shift = 0, maxshift;

            /* Make sure we're fully resumed: */
            if (unlikely(timekeeping_suspended))
                return;

            clock = timekeeper.clock;
            /*  */
        #ifdef CONFIG_ARCH_USES_GETTIMEOFFSET
            offset = timekeeper.cycle_interval;
        #else
            offset = (clock->read(clock) - clock->cycle_last) & clock->mask;
        #endif
            timekeeper.xtime_nsec = (s64)xtime.tv_nsec << timekeeper.shift;

            /*
             * With NO_HZ we may have to accumulate many cycle_intervals
             * (think "ticks") worth of time at once. To do this efficiently,
             * we calculate the largest doubling multiple of cycle_intervals
             * that is smaller then the offset. We then accumulate that
             * chunk in one go, and then try to consume the next smaller
             * doubled multiple.
             */
            shift = ilog2(offset) - ilog2(timekeeper.cycle_interval);
            shift = max(0, shift);
            /* Bound shift to one less then what overflows tick_length */
            maxshift = (8*sizeof(tick_length) - (ilog2(tick_length)+1)) - 1;
            shift = min(shift, maxshift);
            while (offset >= timekeeper.cycle_interval) {
                offset = logarithmic_accumulation(offset, shift);
                if(offset < timekeeper.cycle_interval<<shift)
                    shift--;
            }

            /* correct the clock when NTP error is too big */
            timekeeping_adjust(offset);

            /*
             * Since in the loop above, we accumulate any amount of time
             * in xtime_nsec over a second into xtime.tv_sec, its possible for
             * xtime_nsec to be fairly small after the loop. Further, if we're
             * slightly speeding the clocksource up in timekeeping_adjust(),
             * its possible the required corrective factor to xtime_nsec could
             * cause it to underflow.
             *
             * Now, we cannot simply roll the accumulated second back, since
             * the NTP subsystem has been notified via second_overflow. So
             * instead we push xtime_nsec forward by the amount we underflowed,
             * and add that amount into the error.
             *
             * We'll correct this error next time through this function, when
             * xtime_nsec is not as small.
             */
            if (unlikely((s64)timekeeper.xtime_nsec < 0)) {
                s64 neg = -(s64)timekeeper.xtime_nsec;
                timekeeper.xtime_nsec = 0;
                timekeeper.ntp_error += neg << timekeeper.ntp_error_shift;
            }


            /*
             * Store full nanoseconds into xtime after rounding it up and
             * add the remainder to the error difference.
             */
            xtime.tv_nsec = ((s64) timekeeper.xtime_nsec >> timekeeper.shift) + 1;
            timekeeper.xtime_nsec -= (s64) xtime.tv_nsec << timekeeper.shift;
            timekeeper.ntp_error += timekeeper.xtime_nsec <<
                        timekeeper.ntp_error_shift;

              /*
             * Finally, make sure that after the rounding
             * xtime.tv_nsec isn't larger then NSEC_PER_SEC
             */
            if (unlikely(xtime.tv_nsec >= NSEC_PER_SEC)) {
                xtime.tv_nsec -= NSEC_PER_SEC;
                xtime.tv_sec++;
                second_overflow();
            }

            /* check to see if there is a new clocksource to use */
            update_vsyscall(&xtime, &wall_to_monotonic, timekeeper.clock,
                        timekeeper.mult);
        }

        /*
         * calc_load - update the avenrun load estimates 10 ticks after the
         * CPUs have updated calc_load_tasks.
         */
        void calc_global_load(unsigned long ticks)
        {   
            long active;

            calc_global_nohz(ticks);

            if (time_before(jiffies, calc_load_update + 10))
                return;

            active = atomic_long_read(&calc_load_tasks);
            active = active > 0 ? active * FIXED_1 : 0;

            avenrun[0] = calc_load(avenrun[0], EXP_1, active);
            avenrun[1] = calc_load(avenrun[1], EXP_5, active);
            avenrun[2] = calc_load(avenrun[2], EXP_15, active);

            calc_load_update += LOAD_FREQ;
        }
 

 

    3.5 Swiching to high-resolution timers

    
     Recall the Call Tree in section 3.1.
        Call Tree:
         tick_handle_periodic    
                tick_periodic
                   
        Call Tree:
         tick_periodic  |  tick_nohz_handler |  tick_sched_timer
                update_process_times
                         run_local_timers
                                 hrtimer_run_queues
                                 raise_softirq(TIMER_SOFTIRQ) // This will trigger the excuting of
                                                                                                    // run_timer_softirq()
        Call Tree:
        run_timer_softirq
            hrtimer_run_pending
                hrtimer_switch_to_hres


        1304 /*  
        1305  * This function runs timers and the timer-tq in bottom half context.
        1306  */
        1307 static void run_timer_softirq(struct softirq_action *h)
        1308 {
        1309     struct tvec_base *base = __this_cpu_read(tvec_bases);
        1310
                    /*
                    * check in the softirq context, whether we can switch to highres
                    * and / or nohz mode.
                    */
        1311     hrtimer_run_pending(); // #################
        1312     
        1313     if (time_after_eq(jiffies, base->timer_jiffies))
                        /*
                        * run all expired dynamic timers (if any) on this CPU.
                        */
        1314         __run_timers(base);
        1315 }

        1405 /*
        1406  * Called from timer softirq every jiffy, expire hrtimers:
        1407  *  
        1408  * For HRT its the fall back code to run the softirq in the timer
        1409  * softirq context in case the hrtimer initialization failed or has
        1410  * not been done yet.
        1411  */
        1412 void hrtimer_run_pending(void)
        1413 {
        1414     if (hrtimer_hres_active())
        1415         return;
        1416
        1417     /*
        1418      * This _is_ ugly: We have to check in the softirq context,
        1419      * whether we can switch to highres and / or nohz mode. The
        1420      * clocksource switch happens in the timer interrupt with
        1421      * xtime_lock held. Notification from there only sets the
        1422      * check bit in the tick_oneshot code, otherwise we might
        1423      * deadlock vs. xtime_lock.
        1424      */
                      /* NOTE: tick_check_oneshot_change() return 1, when hrtimer_hres_enabled is not 0
                  * which means hrtimer is enabled.
                  */
        1425     if (tick_check_oneshot_change(!hrtimer_is_hres_enabled()))
        1426         hrtimer_switch_to_hres();
        1427 }

         509 /*
         510  * hrtimer_high_res_enabled - query, if the highres mode is enabled
         511  */
         512 static inline int hrtimer_is_hres_enabled(void)
         513 {
         514     return hrtimer_hres_enabled;
         515 }
        839 /**
        840  * Check, if a change happened, which makes oneshot possible.
        841  *
        842  * Called cyclic from the hrtimer softirq (driven by the timer
        843  * softirq) allow_nohz signals, that we can switch into low-res nohz
        844  * mode, because high resolution timers are disabled (either compile
        845  * or runtime).
        846  */  
        847 int tick_check_oneshot_change(int allow_nohz)
        848 {
        849     struct tick_sched *ts = &__get_cpu_var(tick_cpu_sched);
        850
                   /* check_clocks.bit0 should be 1, before we can switch to high-resolution mode */
        851     if (!test_and_clear_bit(0, &ts->check_clocks))
        852         return 0;
        853
                    /* Not in NOHZ mode now */
        854     if (ts->nohz_mode != NOHZ_MODE_INACTIVE)
        855         return 0;
        856
                    /*
                   * Check if timekeeping is suitable for hres.
                   * check for a oneshot capable event device
                   */
        857     if (!timekeeping_valid_for_hres() || !tick_is_oneshot_available())
        858         return 0;
        859
        860     if (!allow_nohz)
        861         return 1;
        862
        863     tick_nohz_switch_to_nohz();
        864     return 0;
        865 }  

         43 /*
         44  * Clock event features
         45  */     
         46 #define CLOCK_EVT_FEAT_PERIODIC     0x000001
         47 #define CLOCK_EVT_FEAT_ONESHOT      0x000002
         48 /*  
         49  * x86(64) specific misfeatures:
         50  *
         51  * - Clockevent source stops in C3 State and needs broadcast support.
         52  * - Local APIC timer is used as a dummy device.
         53  */
         54 #define CLOCK_EVT_FEAT_C3STOP       0x000004
         55 #define CLOCK_EVT_FEAT_DUMMY        0x000008

         46 /**
         47  * tick_is_oneshot_available - check for a oneshot capable event device
         48  */
         49 int tick_is_oneshot_available(void)
         50 {
         51     struct clock_event_device *dev = __this_cpu_read(tick_cpu_device.evtdev);
         52
         53     if (!dev || !(dev->features & CLOCK_EVT_FEAT_ONESHOT))
         54         return 0;
         55     if (!(dev->features & CLOCK_EVT_FEAT_C3STOP))
         56         return 1;
                    /* Check whether the broadcast device supports oneshot. */
         57     return tick_broadcast_oneshot_available();
         58 }

hrtimer_switch_to_hres
        tick_init_highres
                tick_switch_to_oneshot(hrtimer_interrupt)
          ....
        tick_setup_sched_timer


         688 /*
         689  * Switch to high resolution mode
         690  */
         691 static int hrtimer_switch_to_hres(void)
         692 {       
         693     int i, cpu = smp_processor_id();
         694     struct hrtimer_cpu_base *base = &per_cpu(hrtimer_bases, cpu);
         695     unsigned long flags;
         696
         697     if (base->hres_active)
         698         return 1;
         699
         700     local_irq_save(flags);
         701     
         702     if (tick_init_highres()) {
         703         local_irq_restore(flags);
         704         printk(KERN_WARNING "Could not switch to high resolution "
         705                     "mode on CPU %d\n", cpu);
         706         return 0;
         707     }
         708     base->hres_active = 1;
         709     for (i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++)
         710         base->clock_base[i].resolution = KTIME_HIGH_RES;
         711
         712     tick_setup_sched_timer();
         713
         714     /* "Retrigger" the interrupt to get things going */
         715     retrigger_next_event(NULL);
         716     local_irq_restore(flags);
         717     return 1;
         718 }
         719

        176 /**  
        177  * tick_init_highres - switch to high resolution mode
        178  *   
        179  * Called with interrupts disabled.
        180  */
        181 int tick_init_highres(void)
        182 {
        183     return tick_switch_to_oneshot(hrtimer_interrupt);
        184 }

        126 /**
        127  * tick_switch_to_oneshot - switch to oneshot mode
        128  */
        129 int tick_switch_to_oneshot(void (*handler)(struct clock_event_device *))
        130 {
        131     struct tick_device *td = &__get_cpu_var(tick_cpu_device);
        132     struct clock_event_device *dev = td->evtdev;
        133
        134     if (!dev || !(dev->features & CLOCK_EVT_FEAT_ONESHOT) ||
        135             !tick_device_is_functional(dev)) {
        136
        137         printk(KERN_INFO "Clockevents: "
        138                "could not switch to one-shot mode:");
        139         if (!dev) {
        140             printk(" no tick device\n");
        141         } else {
        142             if (!tick_device_is_functional(dev))
        143                 printk(" %s is not functional.\n", dev->name);
        144             else
        145                 printk(" %s does not support one-shot mode.\n",
        146                        dev->name);
        147         }
        148         return -EINVAL;
        149     }
        150
        151     td->mode = TICKDEV_MODE_ONESHOT;
        152     dev->event_handler = handler;
        153     clockevents_set_mode(dev, CLOCK_EVT_MODE_ONESHOT);
        154     tick_broadcast_switch_to_oneshot();
        155     return 0;
        156 }

        768 /**
        769  * tick_setup_sched_timer - setup the tick emulation timer
        770  */
        771 void tick_setup_sched_timer(void)
        772 {        
        773     struct tick_sched *ts = &__get_cpu_var(tick_cpu_sched);
        774     ktime_t now = ktime_get();
        775
        776     /*
        777      * Emulate tick processing via per-CPU hrtimers:
        778      */
        779     hrtimer_init(&ts->sched_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS);
                   /* tick_sched_timer is used to simulate the "Periodic Tick" in high-resolution mode */
        780     ts->sched_timer.function = tick_sched_timer;
        781
        782     /* Get the next period (per cpu) */
        783     hrtimer_set_expires(&ts->sched_timer, tick_init_jiffy_update());
        784
        785     for (;;) {
        786         hrtimer_forward(&ts->sched_timer, now, tick_period);
        787         hrtimer_start_expires(&ts->sched_timer,
        788                       HRTIMER_MODE_ABS_PINNED);
        789         /* Check, if the timer was already in the past */
        790         if (hrtimer_active(&ts->sched_timer))
        791             break;
        792         now = ktime_get();
        793     }
        794
        795 #ifdef CONFIG_NO_HZ
        796     if (tick_nohz_enabled) {
        797         ts->nohz_mode = NOHZ_MODE_HIGHRES;
        798         printk(KERN_INFO "Switched to NOHz mode on CPU #%d\n", smp_processor_id());
        799     }
        800 #endif
        801 }

3.6 High-Resolution Timers Operations

    We have discussed how to use hrtimer's APIs in section 2. Here we'll get
    into the detail of the implementation of the APIs.

    
         30 /*
         31  * Mode arguments of xxx_hrtimer functions:
         32  */
         33 enum hrtimer_mode {
         34     HRTIMER_MODE_ABS = 0x0,     /* Time value is absolute */
         35     HRTIMER_MODE_REL = 0x1,     /* Time value is relative to now */
         36     HRTIMER_MODE_PINNED = 0x02, /* Timer is bound to CPU */
         37     HRTIMER_MODE_ABS_PINNED = 0x02,
         38     HRTIMER_MODE_REL_PINNED = 0x03,
         39 };
         40
         41 /*
         42  * Return values for the callback function
         43  */
         44 enum hrtimer_restart {
         45     HRTIMER_NORESTART,  /* Timer is not restarted */
         46     HRTIMER_RESTART,    /* Timer must be restarted */
         47 };

    

3.6.1 hrtimer initialization


1161 /**
1162  * hrtimer_init - initialize a timer to the given clock
1163  * @timer:  the timer to be initialized
1164  * @clock_id:   the clock to be used
1165  * @mode:   timer mode abs/rel
1166  */
1167 void hrtimer_init(struct hrtimer *timer, clockid_t clock_id,
1168           enum hrtimer_mode mode)
1169 {
1170     debug_init(timer, clock_id, mode);
1171     __hrtimer_init(timer, clock_id, mode);
1172 }
1173 EXPORT_SYMBOL_GPL(hrtimer_init);

1137 static void __hrtimer_init(struct hrtimer *timer, clockid_t clock_id,
1138                enum hrtimer_mode mode)
1139 {
1140     struct hrtimer_cpu_base *cpu_base;
1141     int base;
1142
1143     memset(timer, 0, sizeof(struct hrtimer));
1144
1145     cpu_base = &__raw_get_cpu_var(hrtimer_bases);
1146
1147     if (clock_id == CLOCK_REALTIME && mode != HRTIMER_MODE_ABS)
1148         clock_id = CLOCK_MONOTONIC;
1149
1150     base = hrtimer_clockid_to_base(clock_id);
1151     timer->base = &cpu_base->clock_base[base];
1152     timerqueue_init(&timer->node);
1153
1154 #ifdef CONFIG_TIMER_STATS
1155     timer->start_site = NULL;
1156     timer->start_pid = -1;
1157     memset(timer->start_comm, 0, TASK_COMM_LEN);
1158 #endif
1159 }
    

3.6.2 add a hrtimer

1011
1012 /**
1013  * hrtimer_start - (re)start an hrtimer on the current CPU
1014  * @timer:  the timer to be added
1015  * @tim:    expiry time
1016  * @mode:   expiry mode: absolute (HRTIMER_ABS) or relative (HRTIMER_REL)
1017  *
1018  * Returns:
1019  *  0 on success
1020  *  1 when the timer was active
1021  */
1022 int
1023 hrtimer_start(struct hrtimer *timer, ktime_t tim, const enum hrtimer_mode mode)
1024 {
1025     return __hrtimer_start_range_ns(timer, tim, 0, mode, 1);
1026 }
1027 EXPORT_SYMBOL_GPL(hrtimer_start);

        /*
        *  
        * @delta_ns:
        * @wakeup:
        */
 944 int __hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim,
 945         unsigned long delta_ns, const enum hrtimer_mode mode,
 946         int wakeup)
 947 {
 948     struct hrtimer_clock_base *base, *new_base;
 949     unsigned long flags;
 950     int ret, leftmost;
 951
 952     base = lock_hrtimer_base(timer, &flags);
 953
 954     /* Remove an active timer from the queue: */
 955     ret = remove_hrtimer(timer, base);
 956
 957     /* Switch the timer base, if necessary: */
            /*   If the @timer's base is not the 'base' of the current CPU,
          *   switch @timer's base to current CPU's base. But if the @timer's
          *   state is HRTIMER_STATE_CALLBACK which means the timer is
          *   'running', do not do the switch operation.
          */
 958     new_base = switch_hrtimer_base(timer, base, mode & HRTIMER_MODE_PINNED);
 959
 960     if (mode & HRTIMER_MODE_REL) {
 961         tim = ktime_add_safe(tim, new_base->get_time());
 962         /*
 963          * CONFIG_TIME_LOW_RES is a temporary way for architectures
 964          * to signal that they simply return xtime in
 965          * do_gettimeoffset(). In this case we want to round up by
 966          * resolution when starting a relative timer, to avoid short
 967          * timeouts. This will go away with the GTOD framework.
 968          */
 969 #ifdef CONFIG_TIME_LOW_RES
 970         tim = ktime_add_safe(tim, base->resolution);
 971 #endif
 972     }
 973
 974     hrtimer_set_expires_range_ns(timer, tim, delta_ns);
 975
 976     timer_stats_hrtimer_set_start_info(timer);
 977
            /* leftmost will be not 0, when @timer is the earlist timer will expire */
 978     leftmost = enqueue_hrtimer(timer, new_base);
 979
 980     /*
 981      * Only allow reprogramming if the new base is on this CPU.
 982      * (it might still be on another CPU if the timer was pending)
 983      *
 984      * XXX send_remote_softirq() ?
 985      */
 986     if (leftmost && new_base->cpu_base == &__get_cpu_var(hrtimer_bases))
 987         hrtimer_enqueue_reprogram(timer, new_base, wakeup);
 988
 989     unlock_hrtimer_base(timer, &flags);
 990
 991     return ret;
 992 }


3.6.3 remove a hrtimer


         911 /*
         912  * remove hrtimer, called with base lock held
         913  */
         914 static inline int
         915 remove_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *base)
         916 {
                    /* Helper function to check, whether the timer is on one of the queues */
         917     if (hrtimer_is_queued(timer)) {
         918         unsigned long state;
         919         int reprogram;
         920
         921         /*
         922          * Remove the timer and force reprogramming when high
         923          * resolution mode is active and the timer is on the current
         924          * CPU. If we remove a timer on another CPU, reprogramming is
         925          * skipped. The interrupt event on this CPU is fired and
         926          * reprogramming happens in the interrupt handler. This is a
         927          * rare case and less expensive than a smp call.
         928          */
         929         debug_deactivate(timer);
         930         timer_stats_hrtimer_clear_start_info(timer);
         931         reprogram = base->cpu_base == &__get_cpu_var(hrtimer_bases);
         932         /*
         933          * We must preserve the CALLBACK state flag here,
         934          * otherwise we could move the timer base in
         935          * switch_hrtimer_base.
         936          */
         937         state = timer->state & HRTIMER_STATE_CALLBACK;
         938         __remove_hrtimer(timer, base, state, reprogram);
         939         return 1;
         940     }
         941     return 0;
         942 }
          
         874 /*
         875  * __remove_hrtimer - internal function to remove a timer
         876  *
         877  * Caller must hold the base lock.
         878  *
         879  * High resolution timer mode reprograms the clock event device when the
         880  * timer is the one which expires next. The caller can disable this by setting
         881  * reprogram to zero. This is useful, when the context does a reprogramming
         882  * anyway (e.g. timer interrupt)
         883  */
         884 static void __remove_hrtimer(struct hrtimer *timer,
         885                  struct hrtimer_clock_base *base,
         886                  unsigned long newstate, int reprogram)
         887 {
         888     if (!(timer->state & HRTIMER_STATE_ENQUEUED))
         889         goto out;
         890
         891     if (&timer->node == timerqueue_getnext(&base->active)) {
         892 #ifdef CONFIG_HIGH_RES_TIMERS
         893         /* Reprogram the clock event device. if enabled */
         894         if (reprogram && hrtimer_hres_active()) {
         895             ktime_t expires;
         896
         897             expires = ktime_sub(hrtimer_get_expires(timer),
         898                         base->offset);
         899             if (base->cpu_base->expires_next.tv64 == expires.tv64)
         900                 hrtimer_force_reprogram(base->cpu_base, 1);
         901         }
         902 #endif
         903     }
         904     timerqueue_del(&base->active, &timer->node);
         905     if (!timerqueue_getnext(&base->active))
         906         base->cpu_base->active_bases &= ~(1 << base->index);
         907 out:
         908     timer->state = newstate;
         909 }
 
             525 /*
             526  * Reprogram the event source with checking both queues for the
             527  * next event
             528  * Called with interrupts disabled and base->lock held
             529  */
             530 static void
             531 hrtimer_force_reprogram(struct hrtimer_cpu_base *cpu_base, int skip_equal)
             532 {
             533     int i;
             534     struct hrtimer_clock_base *base = cpu_base->clock_base;
             535     ktime_t expires, expires_next;
             536
             537     expires_next.tv64 = KTIME_MAX;
             538
                        /* Find out the earlist timer */
             539     for (i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++, base++) {
             540         struct hrtimer *timer;
             541         struct timerqueue_node *next;
             542
             543         next = timerqueue_getnext(&base->active);
             544         if (!next)
             545             continue;
             546         timer = container_of(next, struct hrtimer, node);
             547
             548         expires = ktime_sub(hrtimer_get_expires(timer), base->offset);
             549         /*
             550          * clock_was_set() has changed base->offset so the
             551          * result might be negative. Fix it up to prevent a
             552          * false positive in clockevents_program_event()
             553          */
             554         if (expires.tv64 < 0)
             555             expires.tv64 = 0;
             556         if (expires.tv64 < expires_next.tv64)
             557             expires_next = expires;
             558     }
             560     if (skip_equal && expires_next.tv64 == cpu_base->expires_next.tv64)
             561         return;
             562
             563     cpu_base->expires_next.tv64 = expires_next.tv64;
             564
             565     if (cpu_base->expires_next.tv64 != KTIME_MAX)
             566         tick_program_event(cpu_base->expires_next, 1);
             567 }


4. HRT related system call

    
NO HZ mode swich: refer section 3.2 Swiching to high-resolution timers.
    

 
-------------------------------------------------------------------------------------------------------------------

Appendix I:  Works have to be done, before we "can" switch to high-resolution mode

 

          1> prepare a clocksource which can support "CLOCK_SOURCE_VALID_FOR_HRES.

          
             From the Call Tree below, when can know, every time a new clocksource
            is registered into the system, clocksource_select() will be called in order to
            select the best clocksource available to do the timekeeping task.
             For details, you can refer "Generic Time Subsystem implementation on linux"
                 http://blog.csdn.net/ganggexiongqi/article/details/7006252


         Call Tree:
         clocksource_register
                clocksource_max_deferment
                clocksource_enqueue
                clocksource_enqueue_watchdog
                clocksource_select
                            timekeeping_notify // Don't go to so far, they are out of our topic here
                                    stop_machine(change_clocksource...)// change_clocksource() will
                                                                            // be called to swaps clocksources if a new one is available
                                    tick_clock_notify  

                                              
          24 /* Structure holding internal timekeeping values. */
          25 struct timekeeper {
          26     /* Current clocksource used for timekeeping. */
          27     struct clocksource *clock;
                  ...
                  };
        164 struct clocksource {
                ...
                /*
                * @flags:      flags describing special properties
                *
                */
        185     unsigned long flags;
                ...
                };
                
        197 /*
        198  * Clock source flags bits::
        199  */
        200 #define CLOCK_SOURCE_IS_CONTINUOUS      0x01
        201 #define CLOCK_SOURCE_MUST_VERIFY        0x02
        202
        203 #define CLOCK_SOURCE_WATCHDOG           0x10
        204 #define CLOCK_SOURCE_VALID_FOR_HRES     0x20
        205 #define CLOCK_SOURCE_UNSTABLE           0x40

        From the code above and the function timekeeping_valid_for_hres(), we
        can know that the clocksource for the timekeeper must  set its
        CLOCK_SOURCE_VALID_FOR_HRES flag.
        

         500 /**
         501  * timekeeping_valid_for_hres - Check if timekeeping is suitable for hres
         502  */
         503 int timekeeping_valid_for_hres(void)
         504 {
         505     unsigned long seq;
         506     int ret;
         507
         508     do {
         509         seq = read_seqbegin(&xtime_lock);
         510
         511         ret = timekeeper.clock->flags & CLOCK_SOURCE_VALID_FOR_HRES;
         512        
         513     } while (read_seqretry(&xtime_lock, seq));
         514    
         515     return ret;
         516 }

          2>  hrtimer_hres_enabled = 1

         From the following code, we can know hrtimer_hres_enabled's default

         value is 1 when CONFIG_HIGH_RES_TIMERS is set which means hrtime

         is enabled by default. You can turn off  it through boot parameter "highres=off".

        __setup:

         485 /* High resolution timer related functions */
         486 #ifdef CONFIG_HIGH_RES_TIMERS
         487      

         488 /*

         489  * High resolution timer enabled ?
         490  */
         491 static int hrtimer_hres_enabled __read_mostly  = 1;
         492
         493 /*
         494  * Enable / Disable high resolution mode
         495  */
         496 static int __init setup_hrtimer_hres(char *str)
         497 {
         498     if (!strcmp(str, "off"))
         499         hrtimer_hres_enabled = 0;
         500     else if (!strcmp(str, "on"))
         501         hrtimer_hres_enabled = 1;
         502     else
         503         return 0;
         504     return 1;
         505 }
         506
         507 __setup("highres=", setup_hrtimer_hres);
         

        3>  a oneshot capable event device

        
         46 /**
         47  * tick_is_oneshot_available - check for a oneshot capable event device
         48  */
         49 int tick_is_oneshot_available(void)
         50 {
         51     struct clock_event_device *dev = __this_cpu_read(tick_cpu_device.evtdev);
         52
         53     if (!dev || !(dev->features & CLOCK_EVT_FEAT_ONESHOT))
         54         return 0;
         55     if (!(dev->features & CLOCK_EVT_FEAT_C3STOP))
         56         return 1;
         57     return tick_broadcast_oneshot_available();
         58 }
         59
         

Appendix II:  What's the struct timerqueue_head's next member used for ?

===========
My coclusion:
  It's used to point to the hrtimer whose expire time is the second earlist one
  in the Red-Black Tree which managed by @hrtimer_clock_base.
  It can be used to caculate the delta to the next expiry event.  For more info.
  refer hrtimer_get_next_event().

145 struct hrtimer_clock_base {
            ...
149     struct timerqueue_head  active;
           ...
154 };  

 13 struct timerqueue_head {
 14     struct rb_root head;
 15     struct timerqueue_node *next;
 16 };



1096 /**
1097  * hrtimer_get_next_event - get the time until next expiry event
1098  *
1099  * Returns the delta to the next expiry event or KTIME_MAX if no timer
1100  * is pending.
1101  */
1102 ktime_t hrtimer_get_next_event(void)
1103 {
1104     struct hrtimer_cpu_base *cpu_base = &__get_cpu_var(hrtimer_bases);
1105     struct hrtimer_clock_base *base = cpu_base->clock_base;
1106     ktime_t delta, mindelta = { .tv64 = KTIME_MAX };
1107     unsigned long flags;
1108     int i;
1109    
1110     raw_spin_lock_irqsave(&cpu_base->lock, flags);
1111    
1112     if (!hrtimer_hres_active()) {
1113         for (i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++, base++) {
1114             struct hrtimer *timer;
1115             struct timerqueue_node *next;
1116
1117             next = timerqueue_getnext(&base->active);
1118             if (!next)
1119                 continue;
1120    
1121             timer = container_of(next, struct hrtimer, node);
1122             delta.tv64 = hrtimer_get_expires_tv64(timer);
1123             delta = ktime_sub(delta, base->get_time());
1124             if (delta.tv64 < mindelta.tv64)
1125                 mindelta.tv64 = delta.tv64;
1126         }
1127     }
1128
1129     raw_spin_unlock_irqrestore(&cpu_base->lock, flags);

Appendix III: How we build the relationship between the "Generic Time Subsystem" layer,

                            the "low resolution time subsystem" and "high-resolution timer system"

 
 hrtimer_switch_to_hres() does this.
 
  Several global variables are introduced here:
  @tick_cpu_device is a per-CPU list contaning one instance of struct tick_device
        for each CPU in the system.
  @tick_cpu_sched is a per-CPU nohz control structure(struct tick_sched).
 
 [hrtimer_switch_to_hres() => tick_init_highres() => tick_switch_to_oneshot()]
  In tick_switch_to_oneshot(),  @tick_cpu_device's member @evtdev (struct clock_event_device)'s
  handler is changed to hrtimer_interrupt() and its mode is changed to CLOCK_EVT_MODE_ONESHOT.   
 
  And [hrtimer_switch_to_hres() => tick_setup_sched_timer() ]
  In tick_setup_sched_timer(), tick_sched_timer() is set as the @tick_cpu_sched.sched_timer's new
  callback function. tick_sched_timer() is used to simulate the periodic tick.


 49 struct tick_sched {
 50     struct hrtimer          sched_timer;
 51     unsigned long           check_clocks;
 52     enum tick_nohz_mode     nohz_mode;
          ...
         };
                      
 688 /*
 689  * Switch to high resolution mode
 690  */
 691 static int hrtimer_switch_to_hres(void)
 692 {              
 693     int i, cpu = smp_processor_id();
 694     struct hrtimer_cpu_base *base = &per_cpu(hrtimer_bases, cpu);
 695     unsigned long flags;
 696                        
 697     if (base->hres_active)
 698         return 1;
 699
 700     local_irq_save(flags);
 701
 702     if (tick_init_highres()) {
 703         local_irq_restore(flags);
 704         printk(KERN_WARNING "Could not switch to high resolution "
 705                     "mode on CPU %d\n", cpu);
 706         return 0;
 707     }
 708     base->hres_active = 1;
 709     for (i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++)
 710         base->clock_base[i].resolution = KTIME_HIGH_RES;
 711
 712     tick_setup_sched_timer();
 713
 714     /* "Retrigger" the interrupt to get things going */
 715     retrigger_next_event(NULL);
 716     local_irq_restore(flags);
 717     return 1;
 718 }

Appendix IV:  Detail explanation of some important 'time' members

=============
struct hrtimer {
    struct timerqueue_node      node;
    ktime_t             _softexpires;             --------------- [1]
    ...
};

struct timerqueue_node {
    struct rb_node node;
    ktime_t expires;   ---------------------------------------- [2]                         
};

struct hrtimer_cpu_base {
     ...
    ktime_t             expires_next;   ----------------------- [3]
     ...
    ktime_t             max_hang_time; -------------------- [4]
};

struct hrtimer_clock_base {
      ...
    ktime_t         softirq_time;  ----------------------------- [5]
    ktime_t         offset;    ------------------------------------ [6]
};

 

[1]: hrtimer's expire time. The value you transmit to the function hrtimer_start() as
        the hrtimer's expire time @tim.
[2]: equal to ([1] + delta_ns). @delta_ns:   "slack" range for the timer. Most of the time,
      they are the same value.

[3]: Record the absolute time of the next event that is due for expiration.
     It used in tick_program_event() to reprogram the @tick_cpu_device.evtdev.
     It is the smallest value of all the hrtimer's [2] managed by this CPU.

[4]: Records @now - @entry_time.  @now is the time now. @entry_time is the
       time when we enter the hrtimer_interrupt().
[5]:
[6]: This field helps to fix the situation by denoting an offset by which the timers
     needs to be corrected. This will happen when clock is adjusted
.
     
     

HRT in low-resolution mode <<<<<


hrtimer_run_queues():
                    ...
1458             if (base->softirq_time.tv64 <=
1459                     hrtimer_get_expires_tv64(timer))
1460                 break;
                    ...
                    

High resolution timer <<<<<<


hrtimer_interrupt():
1274         basenow = ktime_add(now, base->offset);
...
1294             if (basenow.tv64 < hrtimer_get_softexpires_tv64(timer)) {
1295                 ktime_t expires;
1296
1297                 expires = ktime_sub(hrtimer_get_expires(timer),
1298                             base->offset);
1299                 if (expires.tv64 < expires_next.tv64)
1300                     expires_next = expires;
1301                 break;