The Design and Implementation of the FreeBSD Operating System, Second Edition
Now available: The Design and Implementation of the FreeBSD Operating System (Second Edition)


[ source navigation ] [ diff markup ] [ identifier search ] [ freetext search ] [ file search ] [ list types ] [ track identifier ]

FreeBSD/Linux Kernel Cross Reference
sys/Documentation/hrtimers.txt

Version: -  FREEBSD  -  FREEBSD-13-STABLE  -  FREEBSD-13-0  -  FREEBSD-12-STABLE  -  FREEBSD-12-0  -  FREEBSD-11-STABLE  -  FREEBSD-11-0  -  FREEBSD-10-STABLE  -  FREEBSD-10-0  -  FREEBSD-9-STABLE  -  FREEBSD-9-0  -  FREEBSD-8-STABLE  -  FREEBSD-8-0  -  FREEBSD-7-STABLE  -  FREEBSD-7-0  -  FREEBSD-6-STABLE  -  FREEBSD-6-0  -  FREEBSD-5-STABLE  -  FREEBSD-5-0  -  FREEBSD-4-STABLE  -  FREEBSD-3-STABLE  -  FREEBSD22  -  l41  -  OPENBSD  -  linux-2.6  -  MK84  -  PLAN9  -  xnu-8792 
SearchContext: -  none  -  3  -  10 

    1 
    2 hrtimers - subsystem for high-resolution kernel timers
    3 ----------------------------------------------------
    4 
    5 This patch introduces a new subsystem for high-resolution kernel timers.
    6 
    7 One might ask the question: we already have a timer subsystem
    8 (kernel/timers.c), why do we need two timer subsystems? After a lot of
    9 back and forth trying to integrate high-resolution and high-precision
   10 features into the existing timer framework, and after testing various
   11 such high-resolution timer implementations in practice, we came to the
   12 conclusion that the timer wheel code is fundamentally not suitable for
   13 such an approach. We initially didn't believe this ('there must be a way
   14 to solve this'), and spent a considerable effort trying to integrate
   15 things into the timer wheel, but we failed. In hindsight, there are
   16 several reasons why such integration is hard/impossible:
   17 
   18 - the forced handling of low-resolution and high-resolution timers in
   19   the same way leads to a lot of compromises, macro magic and #ifdef
   20   mess. The timers.c code is very "tightly coded" around jiffies and
   21   32-bitness assumptions, and has been honed and micro-optimized for a
   22   relatively narrow use case (jiffies in a relatively narrow HZ range)
   23   for many years - and thus even small extensions to it easily break
   24   the wheel concept, leading to even worse compromises. The timer wheel
   25   code is very good and tight code, there's zero problems with it in its
   26   current usage - but it is simply not suitable to be extended for
   27   high-res timers.
   28 
   29 - the unpredictable [O(N)] overhead of cascading leads to delays which
   30   necessitate a more complex handling of high resolution timers, which
   31   in turn decreases robustness. Such a design still led to rather large
   32   timing inaccuracies. Cascading is a fundamental property of the timer
   33   wheel concept, it cannot be 'designed out' without unevitably
   34   degrading other portions of the timers.c code in an unacceptable way.
   35 
   36 - the implementation of the current posix-timer subsystem on top of
   37   the timer wheel has already introduced a quite complex handling of
   38   the required readjusting of absolute CLOCK_REALTIME timers at
   39   settimeofday or NTP time - further underlying our experience by
   40   example: that the timer wheel data structure is too rigid for high-res
   41   timers.
   42 
   43 - the timer wheel code is most optimal for use cases which can be
   44   identified as "timeouts". Such timeouts are usually set up to cover
   45   error conditions in various I/O paths, such as networking and block
   46   I/O. The vast majority of those timers never expire and are rarely
   47   recascaded because the expected correct event arrives in time so they
   48   can be removed from the timer wheel before any further processing of
   49   them becomes necessary. Thus the users of these timeouts can accept
   50   the granularity and precision tradeoffs of the timer wheel, and
   51   largely expect the timer subsystem to have near-zero overhead.
   52   Accurate timing for them is not a core purpose - in fact most of the
   53   timeout values used are ad-hoc. For them it is at most a necessary
   54   evil to guarantee the processing of actual timeout completions
   55   (because most of the timeouts are deleted before completion), which
   56   should thus be as cheap and unintrusive as possible.
   57 
   58 The primary users of precision timers are user-space applications that
   59 utilize nanosleep, posix-timers and itimer interfaces. Also, in-kernel
   60 users like drivers and subsystems which require precise timed events
   61 (e.g. multimedia) can benefit from the availability of a separate
   62 high-resolution timer subsystem as well.
   63 
   64 While this subsystem does not offer high-resolution clock sources just
   65 yet, the hrtimer subsystem can be easily extended with high-resolution
   66 clock capabilities, and patches for that exist and are maturing quickly.
   67 The increasing demand for realtime and multimedia applications along
   68 with other potential users for precise timers gives another reason to
   69 separate the "timeout" and "precise timer" subsystems.
   70 
   71 Another potential benefit is that such a separation allows even more
   72 special-purpose optimization of the existing timer wheel for the low
   73 resolution and low precision use cases - once the precision-sensitive
   74 APIs are separated from the timer wheel and are migrated over to
   75 hrtimers. E.g. we could decrease the frequency of the timeout subsystem
   76 from 250 Hz to 100 HZ (or even smaller).
   77 
   78 hrtimer subsystem implementation details
   79 ----------------------------------------
   80 
   81 the basic design considerations were:
   82 
   83 - simplicity
   84 
   85 - data structure not bound to jiffies or any other granularity. All the
   86   kernel logic works at 64-bit nanoseconds resolution - no compromises.
   87 
   88 - simplification of existing, timing related kernel code
   89 
   90 another basic requirement was the immediate enqueueing and ordering of
   91 timers at activation time. After looking at several possible solutions
   92 such as radix trees and hashes, we chose the red black tree as the basic
   93 data structure. Rbtrees are available as a library in the kernel and are
   94 used in various performance-critical areas of e.g. memory management and
   95 file systems. The rbtree is solely used for time sorted ordering, while
   96 a separate list is used to give the expiry code fast access to the
   97 queued timers, without having to walk the rbtree.
   98 
   99 (This separate list is also useful for later when we'll introduce
  100 high-resolution clocks, where we need separate pending and expired
  101 queues while keeping the time-order intact.)
  102 
  103 Time-ordered enqueueing is not purely for the purposes of
  104 high-resolution clocks though, it also simplifies the handling of
  105 absolute timers based on a low-resolution CLOCK_REALTIME. The existing
  106 implementation needed to keep an extra list of all armed absolute
  107 CLOCK_REALTIME timers along with complex locking. In case of
  108 settimeofday and NTP, all the timers (!) had to be dequeued, the
  109 time-changing code had to fix them up one by one, and all of them had to
  110 be enqueued again. The time-ordered enqueueing and the storage of the
  111 expiry time in absolute time units removes all this complex and poorly
  112 scaling code from the posix-timer implementation - the clock can simply
  113 be set without having to touch the rbtree. This also makes the handling
  114 of posix-timers simpler in general.
  115 
  116 The locking and per-CPU behavior of hrtimers was mostly taken from the
  117 existing timer wheel code, as it is mature and well suited. Sharing code
  118 was not really a win, due to the different data structures. Also, the
  119 hrtimer functions now have clearer behavior and clearer names - such as
  120 hrtimer_try_to_cancel() and hrtimer_cancel() [which are roughly
  121 equivalent to del_timer() and del_timer_sync()] - so there's no direct
  122 1:1 mapping between them on the algorithmical level, and thus no real
  123 potential for code sharing either.
  124 
  125 Basic data types: every time value, absolute or relative, is in a
  126 special nanosecond-resolution type: ktime_t. The kernel-internal
  127 representation of ktime_t values and operations is implemented via
  128 macros and inline functions, and can be switched between a "hybrid
  129 union" type and a plain "scalar" 64bit nanoseconds representation (at
  130 compile time). The hybrid union type optimizes time conversions on 32bit
  131 CPUs. This build-time-selectable ktime_t storage format was implemented
  132 to avoid the performance impact of 64-bit multiplications and divisions
  133 on 32bit CPUs. Such operations are frequently necessary to convert
  134 between the storage formats provided by kernel and userspace interfaces
  135 and the internal time format. (See include/linux/ktime.h for further
  136 details.)
  137 
  138 hrtimers - rounding of timer values
  139 -----------------------------------
  140 
  141 the hrtimer code will round timer events to lower-resolution clocks
  142 because it has to. Otherwise it will do no artificial rounding at all.
  143 
  144 one question is, what resolution value should be returned to the user by
  145 the clock_getres() interface. This will return whatever real resolution
  146 a given clock has - be it low-res, high-res, or artificially-low-res.
  147 
  148 hrtimers - testing and verification
  149 ----------------------------------
  150 
  151 We used the high-resolution clock subsystem ontop of hrtimers to verify
  152 the hrtimer implementation details in praxis, and we also ran the posix
  153 timer tests in order to ensure specification compliance. We also ran
  154 tests on low-resolution clocks.
  155 
  156 The hrtimer patch converts the following kernel functionality to use
  157 hrtimers:
  158 
  159  - nanosleep
  160  - itimers
  161  - posix-timers
  162 
  163 The conversion of nanosleep and posix-timers enabled the unification of
  164 nanosleep and clock_nanosleep.
  165 
  166 The code was successfully compiled for the following platforms:
  167 
  168  i386, x86_64, ARM, PPC, PPC64, IA64
  169 
  170 The code was run-tested on the following platforms:
  171 
  172  i386(UP/SMP), x86_64(UP/SMP), ARM, PPC
  173 
  174 hrtimers were also integrated into the -rt tree, along with a
  175 hrtimers-based high-resolution clock implementation, so the hrtimers
  176 code got a healthy amount of testing and use in practice.
  177 
  178         Thomas Gleixner, Ingo Molnar

Cache object: b5237bdbdd9705c1a0a4635bcb2580f0


[ source navigation ] [ diff markup ] [ identifier search ] [ freetext search ] [ file search ] [ list types ] [ track identifier ]


This page is part of the FreeBSD/Linux Linux Kernel Cross-Reference, and was automatically generated using a modified version of the LXR engine.