The Design and Implementation of the FreeBSD Operating System, Second Edition
Now available: The Design and Implementation of the FreeBSD Operating System (Second Edition)


[ source navigation ] [ diff markup ] [ identifier search ] [ freetext search ] [ file search ] [ list types ] [ track identifier ]

FreeBSD/Linux Kernel Cross Reference
sys/Documentation/futex-requeue-pi.txt

Version: -  FREEBSD  -  FREEBSD-13-STABLE  -  FREEBSD-13-0  -  FREEBSD-12-STABLE  -  FREEBSD-12-0  -  FREEBSD-11-STABLE  -  FREEBSD-11-0  -  FREEBSD-10-STABLE  -  FREEBSD-10-0  -  FREEBSD-9-STABLE  -  FREEBSD-9-0  -  FREEBSD-8-STABLE  -  FREEBSD-8-0  -  FREEBSD-7-STABLE  -  FREEBSD-7-0  -  FREEBSD-6-STABLE  -  FREEBSD-6-0  -  FREEBSD-5-STABLE  -  FREEBSD-5-0  -  FREEBSD-4-STABLE  -  FREEBSD-3-STABLE  -  FREEBSD22  -  l41  -  OPENBSD  -  linux-2.6  -  MK84  -  PLAN9  -  xnu-8792 
SearchContext: -  none  -  3  -  10 

    1 Futex Requeue PI
    2 ----------------
    3 
    4 Requeueing of tasks from a non-PI futex to a PI futex requires
    5 special handling in order to ensure the underlying rt_mutex is never
    6 left without an owner if it has waiters; doing so would break the PI
    7 boosting logic [see rt-mutex-desgin.txt] For the purposes of
    8 brevity, this action will be referred to as "requeue_pi" throughout
    9 this document.  Priority inheritance is abbreviated throughout as
   10 "PI".
   11 
   12 Motivation
   13 ----------
   14 
   15 Without requeue_pi, the glibc implementation of
   16 pthread_cond_broadcast() must resort to waking all the tasks waiting
   17 on a pthread_condvar and letting them try to sort out which task
   18 gets to run first in classic thundering-herd formation.  An ideal
   19 implementation would wake the highest-priority waiter, and leave the
   20 rest to the natural wakeup inherent in unlocking the mutex
   21 associated with the condvar.
   22 
   23 Consider the simplified glibc calls:
   24 
   25 /* caller must lock mutex */
   26 pthread_cond_wait(cond, mutex)
   27 {
   28         lock(cond->__data.__lock);
   29         unlock(mutex);
   30         do {
   31            unlock(cond->__data.__lock);
   32            futex_wait(cond->__data.__futex);
   33            lock(cond->__data.__lock);
   34         } while(...)
   35         unlock(cond->__data.__lock);
   36         lock(mutex);
   37 }
   38 
   39 pthread_cond_broadcast(cond)
   40 {
   41         lock(cond->__data.__lock);
   42         unlock(cond->__data.__lock);
   43         futex_requeue(cond->data.__futex, cond->mutex);
   44 }
   45 
   46 Once pthread_cond_broadcast() requeues the tasks, the cond->mutex
   47 has waiters. Note that pthread_cond_wait() attempts to lock the
   48 mutex only after it has returned to user space.  This will leave the
   49 underlying rt_mutex with waiters, and no owner, breaking the
   50 previously mentioned PI-boosting algorithms.
   51 
   52 In order to support PI-aware pthread_condvar's, the kernel needs to
   53 be able to requeue tasks to PI futexes.  This support implies that
   54 upon a successful futex_wait system call, the caller would return to
   55 user space already holding the PI futex.  The glibc implementation
   56 would be modified as follows:
   57 
   58 
   59 /* caller must lock mutex */
   60 pthread_cond_wait_pi(cond, mutex)
   61 {
   62         lock(cond->__data.__lock);
   63         unlock(mutex);
   64         do {
   65            unlock(cond->__data.__lock);
   66            futex_wait_requeue_pi(cond->__data.__futex);
   67            lock(cond->__data.__lock);
   68         } while(...)
   69         unlock(cond->__data.__lock);
   70         /* the kernel acquired the the mutex for us */
   71 }
   72 
   73 pthread_cond_broadcast_pi(cond)
   74 {
   75         lock(cond->__data.__lock);
   76         unlock(cond->__data.__lock);
   77         futex_requeue_pi(cond->data.__futex, cond->mutex);
   78 }
   79 
   80 The actual glibc implementation will likely test for PI and make the
   81 necessary changes inside the existing calls rather than creating new
   82 calls for the PI cases.  Similar changes are needed for
   83 pthread_cond_timedwait() and pthread_cond_signal().
   84 
   85 Implementation
   86 --------------
   87 
   88 In order to ensure the rt_mutex has an owner if it has waiters, it
   89 is necessary for both the requeue code, as well as the waiting code,
   90 to be able to acquire the rt_mutex before returning to user space.
   91 The requeue code cannot simply wake the waiter and leave it to
   92 acquire the rt_mutex as it would open a race window between the
   93 requeue call returning to user space and the waiter waking and
   94 starting to run.  This is especially true in the uncontended case.
   95 
   96 The solution involves two new rt_mutex helper routines,
   97 rt_mutex_start_proxy_lock() and rt_mutex_finish_proxy_lock(), which
   98 allow the requeue code to acquire an uncontended rt_mutex on behalf
   99 of the waiter and to enqueue the waiter on a contended rt_mutex.
  100 Two new system calls provide the kernel<->user interface to
  101 requeue_pi: FUTEX_WAIT_REQUEUE_PI and FUTEX_REQUEUE_CMP_PI.
  102 
  103 FUTEX_WAIT_REQUEUE_PI is called by the waiter (pthread_cond_wait()
  104 and pthread_cond_timedwait()) to block on the initial futex and wait
  105 to be requeued to a PI-aware futex.  The implementation is the
  106 result of a high-speed collision between futex_wait() and
  107 futex_lock_pi(), with some extra logic to check for the additional
  108 wake-up scenarios.
  109 
  110 FUTEX_REQUEUE_CMP_PI is called by the waker
  111 (pthread_cond_broadcast() and pthread_cond_signal()) to requeue and
  112 possibly wake the waiting tasks. Internally, this system call is
  113 still handled by futex_requeue (by passing requeue_pi=1).  Before
  114 requeueing, futex_requeue() attempts to acquire the requeue target
  115 PI futex on behalf of the top waiter.  If it can, this waiter is
  116 woken.  futex_requeue() then proceeds to requeue the remaining
  117 nr_wake+nr_requeue tasks to the PI futex, calling
  118 rt_mutex_start_proxy_lock() prior to each requeue to prepare the
  119 task as a waiter on the underlying rt_mutex.  It is possible that
  120 the lock can be acquired at this stage as well, if so, the next
  121 waiter is woken to finish the acquisition of the lock.
  122 
  123 FUTEX_REQUEUE_PI accepts nr_wake and nr_requeue as arguments, but
  124 their sum is all that really matters.  futex_requeue() will wake or
  125 requeue up to nr_wake + nr_requeue tasks.  It will wake only as many
  126 tasks as it can acquire the lock for, which in the majority of cases
  127 should be 0 as good programming practice dictates that the caller of
  128 either pthread_cond_broadcast() or pthread_cond_signal() acquire the
  129 mutex prior to making the call. FUTEX_REQUEUE_PI requires that
  130 nr_wake=1.  nr_requeue should be INT_MAX for broadcast and 0 for
  131 signal.

Cache object: 8c00b1b20da0383baa52c18facd9f26a


[ source navigation ] [ diff markup ] [ identifier search ] [ freetext search ] [ file search ] [ list types ] [ track identifier ]


This page is part of the FreeBSD/Linux Linux Kernel Cross-Reference, and was automatically generated using a modified version of the LXR engine.