The Design and Implementation of the FreeBSD Operating System, Second Edition
Now available: The Design and Implementation of the FreeBSD Operating System (Second Edition)


[ source navigation ] [ diff markup ] [ identifier search ] [ freetext search ] [ file search ] [ list types ] [ track identifier ]

FreeBSD/Linux Kernel Cross Reference
sys/Documentation/workqueue.txt

Version: -  FREEBSD  -  FREEBSD-13-STABLE  -  FREEBSD-13-0  -  FREEBSD-12-STABLE  -  FREEBSD-12-0  -  FREEBSD-11-STABLE  -  FREEBSD-11-0  -  FREEBSD-10-STABLE  -  FREEBSD-10-0  -  FREEBSD-9-STABLE  -  FREEBSD-9-0  -  FREEBSD-8-STABLE  -  FREEBSD-8-0  -  FREEBSD-7-STABLE  -  FREEBSD-7-0  -  FREEBSD-6-STABLE  -  FREEBSD-6-0  -  FREEBSD-5-STABLE  -  FREEBSD-5-0  -  FREEBSD-4-STABLE  -  FREEBSD-3-STABLE  -  FREEBSD22  -  l41  -  OPENBSD  -  linux-2.6  -  MK84  -  PLAN9  -  xnu-8792 
SearchContext: -  none  -  3  -  10 

    1 
    2 Concurrency Managed Workqueue (cmwq)
    3 
    4 September, 2010         Tejun Heo <tj@kernel.org>
    5                         Florian Mickler <florian@mickler.org>
    6 
    7 CONTENTS
    8 
    9 1. Introduction
   10 2. Why cmwq?
   11 3. The Design
   12 4. Application Programming Interface (API)
   13 5. Example Execution Scenarios
   14 6. Guidelines
   15 7. Debugging
   16 
   17 
   18 1. Introduction
   19 
   20 There are many cases where an asynchronous process execution context
   21 is needed and the workqueue (wq) API is the most commonly used
   22 mechanism for such cases.
   23 
   24 When such an asynchronous execution context is needed, a work item
   25 describing which function to execute is put on a queue.  An
   26 independent thread serves as the asynchronous execution context.  The
   27 queue is called workqueue and the thread is called worker.
   28 
   29 While there are work items on the workqueue the worker executes the
   30 functions associated with the work items one after the other.  When
   31 there is no work item left on the workqueue the worker becomes idle.
   32 When a new work item gets queued, the worker begins executing again.
   33 
   34 
   35 2. Why cmwq?
   36 
   37 In the original wq implementation, a multi threaded (MT) wq had one
   38 worker thread per CPU and a single threaded (ST) wq had one worker
   39 thread system-wide.  A single MT wq needed to keep around the same
   40 number of workers as the number of CPUs.  The kernel grew a lot of MT
   41 wq users over the years and with the number of CPU cores continuously
   42 rising, some systems saturated the default 32k PID space just booting
   43 up.
   44 
   45 Although MT wq wasted a lot of resource, the level of concurrency
   46 provided was unsatisfactory.  The limitation was common to both ST and
   47 MT wq albeit less severe on MT.  Each wq maintained its own separate
   48 worker pool.  A MT wq could provide only one execution context per CPU
   49 while a ST wq one for the whole system.  Work items had to compete for
   50 those very limited execution contexts leading to various problems
   51 including proneness to deadlocks around the single execution context.
   52 
   53 The tension between the provided level of concurrency and resource
   54 usage also forced its users to make unnecessary tradeoffs like libata
   55 choosing to use ST wq for polling PIOs and accepting an unnecessary
   56 limitation that no two polling PIOs can progress at the same time.  As
   57 MT wq don't provide much better concurrency, users which require
   58 higher level of concurrency, like async or fscache, had to implement
   59 their own thread pool.
   60 
   61 Concurrency Managed Workqueue (cmwq) is a reimplementation of wq with
   62 focus on the following goals.
   63 
   64 * Maintain compatibility with the original workqueue API.
   65 
   66 * Use per-CPU unified worker pools shared by all wq to provide
   67   flexible level of concurrency on demand without wasting a lot of
   68   resource.
   69 
   70 * Automatically regulate worker pool and level of concurrency so that
   71   the API users don't need to worry about such details.
   72 
   73 
   74 3. The Design
   75 
   76 In order to ease the asynchronous execution of functions a new
   77 abstraction, the work item, is introduced.
   78 
   79 A work item is a simple struct that holds a pointer to the function
   80 that is to be executed asynchronously.  Whenever a driver or subsystem
   81 wants a function to be executed asynchronously it has to set up a work
   82 item pointing to that function and queue that work item on a
   83 workqueue.
   84 
   85 Special purpose threads, called worker threads, execute the functions
   86 off of the queue, one after the other.  If no work is queued, the
   87 worker threads become idle.  These worker threads are managed in so
   88 called thread-pools.
   89 
   90 The cmwq design differentiates between the user-facing workqueues that
   91 subsystems and drivers queue work items on and the backend mechanism
   92 which manages thread-pools and processes the queued work items.
   93 
   94 The backend is called gcwq.  There is one gcwq for each possible CPU
   95 and one gcwq to serve work items queued on unbound workqueues.  Each
   96 gcwq has two thread-pools - one for normal work items and the other
   97 for high priority ones.
   98 
   99 Subsystems and drivers can create and queue work items through special
  100 workqueue API functions as they see fit. They can influence some
  101 aspects of the way the work items are executed by setting flags on the
  102 workqueue they are putting the work item on. These flags include
  103 things like CPU locality, reentrancy, concurrency limits, priority and
  104 more.  To get a detailed overview refer to the API description of
  105 alloc_workqueue() below.
  106 
  107 When a work item is queued to a workqueue, the target gcwq and
  108 thread-pool is determined according to the queue parameters and
  109 workqueue attributes and appended on the shared worklist of the
  110 thread-pool.  For example, unless specifically overridden, a work item
  111 of a bound workqueue will be queued on the worklist of either normal
  112 or highpri thread-pool of the gcwq that is associated to the CPU the
  113 issuer is running on.
  114 
  115 For any worker pool implementation, managing the concurrency level
  116 (how many execution contexts are active) is an important issue.  cmwq
  117 tries to keep the concurrency at a minimal but sufficient level.
  118 Minimal to save resources and sufficient in that the system is used at
  119 its full capacity.
  120 
  121 Each thread-pool bound to an actual CPU implements concurrency
  122 management by hooking into the scheduler.  The thread-pool is notified
  123 whenever an active worker wakes up or sleeps and keeps track of the
  124 number of the currently runnable workers.  Generally, work items are
  125 not expected to hog a CPU and consume many cycles.  That means
  126 maintaining just enough concurrency to prevent work processing from
  127 stalling should be optimal.  As long as there are one or more runnable
  128 workers on the CPU, the thread-pool doesn't start execution of a new
  129 work, but, when the last running worker goes to sleep, it immediately
  130 schedules a new worker so that the CPU doesn't sit idle while there
  131 are pending work items.  This allows using a minimal number of workers
  132 without losing execution bandwidth.
  133 
  134 Keeping idle workers around doesn't cost other than the memory space
  135 for kthreads, so cmwq holds onto idle ones for a while before killing
  136 them.
  137 
  138 For an unbound wq, the above concurrency management doesn't apply and
  139 the thread-pools for the pseudo unbound CPU try to start executing all
  140 work items as soon as possible.  The responsibility of regulating
  141 concurrency level is on the users.  There is also a flag to mark a
  142 bound wq to ignore the concurrency management.  Please refer to the
  143 API section for details.
  144 
  145 Forward progress guarantee relies on that workers can be created when
  146 more execution contexts are necessary, which in turn is guaranteed
  147 through the use of rescue workers.  All work items which might be used
  148 on code paths that handle memory reclaim are required to be queued on
  149 wq's that have a rescue-worker reserved for execution under memory
  150 pressure.  Else it is possible that the thread-pool deadlocks waiting
  151 for execution contexts to free up.
  152 
  153 
  154 4. Application Programming Interface (API)
  155 
  156 alloc_workqueue() allocates a wq.  The original create_*workqueue()
  157 functions are deprecated and scheduled for removal.  alloc_workqueue()
  158 takes three arguments - @name, @flags and @max_active.  @name is the
  159 name of the wq and also used as the name of the rescuer thread if
  160 there is one.
  161 
  162 A wq no longer manages execution resources but serves as a domain for
  163 forward progress guarantee, flush and work item attributes.  @flags
  164 and @max_active control how work items are assigned execution
  165 resources, scheduled and executed.
  166 
  167 @flags:
  168 
  169   WQ_NON_REENTRANT
  170 
  171         By default, a wq guarantees non-reentrance only on the same
  172         CPU.  A work item may not be executed concurrently on the same
  173         CPU by multiple workers but is allowed to be executed
  174         concurrently on multiple CPUs.  This flag makes sure
  175         non-reentrance is enforced across all CPUs.  Work items queued
  176         to a non-reentrant wq are guaranteed to be executed by at most
  177         one worker system-wide at any given time.
  178 
  179   WQ_UNBOUND
  180 
  181         Work items queued to an unbound wq are served by a special
  182         gcwq which hosts workers which are not bound to any specific
  183         CPU.  This makes the wq behave as a simple execution context
  184         provider without concurrency management.  The unbound gcwq
  185         tries to start execution of work items as soon as possible.
  186         Unbound wq sacrifices locality but is useful for the following
  187         cases.
  188 
  189         * Wide fluctuation in the concurrency level requirement is
  190           expected and using bound wq may end up creating large number
  191           of mostly unused workers across different CPUs as the issuer
  192           hops through different CPUs.
  193 
  194         * Long running CPU intensive workloads which can be better
  195           managed by the system scheduler.
  196 
  197   WQ_FREEZABLE
  198 
  199         A freezable wq participates in the freeze phase of the system
  200         suspend operations.  Work items on the wq are drained and no
  201         new work item starts execution until thawed.
  202 
  203   WQ_MEM_RECLAIM
  204 
  205         All wq which might be used in the memory reclaim paths _MUST_
  206         have this flag set.  The wq is guaranteed to have at least one
  207         execution context regardless of memory pressure.
  208 
  209   WQ_HIGHPRI
  210 
  211         Work items of a highpri wq are queued to the highpri
  212         thread-pool of the target gcwq.  Highpri thread-pools are
  213         served by worker threads with elevated nice level.
  214 
  215         Note that normal and highpri thread-pools don't interact with
  216         each other.  Each maintain its separate pool of workers and
  217         implements concurrency management among its workers.
  218 
  219   WQ_CPU_INTENSIVE
  220 
  221         Work items of a CPU intensive wq do not contribute to the
  222         concurrency level.  In other words, runnable CPU intensive
  223         work items will not prevent other work items in the same
  224         thread-pool from starting execution.  This is useful for bound
  225         work items which are expected to hog CPU cycles so that their
  226         execution is regulated by the system scheduler.
  227 
  228         Although CPU intensive work items don't contribute to the
  229         concurrency level, start of their executions is still
  230         regulated by the concurrency management and runnable
  231         non-CPU-intensive work items can delay execution of CPU
  232         intensive work items.
  233 
  234         This flag is meaningless for unbound wq.
  235 
  236 @max_active:
  237 
  238 @max_active determines the maximum number of execution contexts per
  239 CPU which can be assigned to the work items of a wq.  For example,
  240 with @max_active of 16, at most 16 work items of the wq can be
  241 executing at the same time per CPU.
  242 
  243 Currently, for a bound wq, the maximum limit for @max_active is 512
  244 and the default value used when 0 is specified is 256.  For an unbound
  245 wq, the limit is higher of 512 and 4 * num_possible_cpus().  These
  246 values are chosen sufficiently high such that they are not the
  247 limiting factor while providing protection in runaway cases.
  248 
  249 The number of active work items of a wq is usually regulated by the
  250 users of the wq, more specifically, by how many work items the users
  251 may queue at the same time.  Unless there is a specific need for
  252 throttling the number of active work items, specifying '0' is
  253 recommended.
  254 
  255 Some users depend on the strict execution ordering of ST wq.  The
  256 combination of @max_active of 1 and WQ_UNBOUND is used to achieve this
  257 behavior.  Work items on such wq are always queued to the unbound gcwq
  258 and only one work item can be active at any given time thus achieving
  259 the same ordering property as ST wq.
  260 
  261 
  262 5. Example Execution Scenarios
  263 
  264 The following example execution scenarios try to illustrate how cmwq
  265 behave under different configurations.
  266 
  267  Work items w0, w1, w2 are queued to a bound wq q0 on the same CPU.
  268  w0 burns CPU for 5ms then sleeps for 10ms then burns CPU for 5ms
  269  again before finishing.  w1 and w2 burn CPU for 5ms then sleep for
  270  10ms.
  271 
  272 Ignoring all other tasks, works and processing overhead, and assuming
  273 simple FIFO scheduling, the following is one highly simplified version
  274 of possible sequences of events with the original wq.
  275 
  276  TIME IN MSECS  EVENT
  277  0              w0 starts and burns CPU
  278  5              w0 sleeps
  279  15             w0 wakes up and burns CPU
  280  20             w0 finishes
  281  20             w1 starts and burns CPU
  282  25             w1 sleeps
  283  35             w1 wakes up and finishes
  284  35             w2 starts and burns CPU
  285  40             w2 sleeps
  286  50             w2 wakes up and finishes
  287 
  288 And with cmwq with @max_active >= 3,
  289 
  290  TIME IN MSECS  EVENT
  291  0              w0 starts and burns CPU
  292  5              w0 sleeps
  293  5              w1 starts and burns CPU
  294  10             w1 sleeps
  295  10             w2 starts and burns CPU
  296  15             w2 sleeps
  297  15             w0 wakes up and burns CPU
  298  20             w0 finishes
  299  20             w1 wakes up and finishes
  300  25             w2 wakes up and finishes
  301 
  302 If @max_active == 2,
  303 
  304  TIME IN MSECS  EVENT
  305  0              w0 starts and burns CPU
  306  5              w0 sleeps
  307  5              w1 starts and burns CPU
  308  10             w1 sleeps
  309  15             w0 wakes up and burns CPU
  310  20             w0 finishes
  311  20             w1 wakes up and finishes
  312  20             w2 starts and burns CPU
  313  25             w2 sleeps
  314  35             w2 wakes up and finishes
  315 
  316 Now, let's assume w1 and w2 are queued to a different wq q1 which has
  317 WQ_CPU_INTENSIVE set,
  318 
  319  TIME IN MSECS  EVENT
  320  0              w0 starts and burns CPU
  321  5              w0 sleeps
  322  5              w1 and w2 start and burn CPU
  323  10             w1 sleeps
  324  15             w2 sleeps
  325  15             w0 wakes up and burns CPU
  326  20             w0 finishes
  327  20             w1 wakes up and finishes
  328  25             w2 wakes up and finishes
  329 
  330 
  331 6. Guidelines
  332 
  333 * Do not forget to use WQ_MEM_RECLAIM if a wq may process work items
  334   which are used during memory reclaim.  Each wq with WQ_MEM_RECLAIM
  335   set has an execution context reserved for it.  If there is
  336   dependency among multiple work items used during memory reclaim,
  337   they should be queued to separate wq each with WQ_MEM_RECLAIM.
  338 
  339 * Unless strict ordering is required, there is no need to use ST wq.
  340 
  341 * Unless there is a specific need, using 0 for @max_active is
  342   recommended.  In most use cases, concurrency level usually stays
  343   well under the default limit.
  344 
  345 * A wq serves as a domain for forward progress guarantee
  346   (WQ_MEM_RECLAIM, flush and work item attributes.  Work items which
  347   are not involved in memory reclaim and don't need to be flushed as a
  348   part of a group of work items, and don't require any special
  349   attribute, can use one of the system wq.  There is no difference in
  350   execution characteristics between using a dedicated wq and a system
  351   wq.
  352 
  353 * Unless work items are expected to consume a huge amount of CPU
  354   cycles, using a bound wq is usually beneficial due to the increased
  355   level of locality in wq operations and work item execution.
  356 
  357 
  358 7. Debugging
  359 
  360 Because the work functions are executed by generic worker threads
  361 there are a few tricks needed to shed some light on misbehaving
  362 workqueue users.
  363 
  364 Worker threads show up in the process list as:
  365 
  366 root      5671  0.0  0.0      0     0 ?        S    12:07   0:00 [kworker/0:1]
  367 root      5672  0.0  0.0      0     0 ?        S    12:07   0:00 [kworker/1:2]
  368 root      5673  0.0  0.0      0     0 ?        S    12:12   0:00 [kworker/0:0]
  369 root      5674  0.0  0.0      0     0 ?        S    12:13   0:00 [kworker/1:0]
  370 
  371 If kworkers are going crazy (using too much cpu), there are two types
  372 of possible problems:
  373 
  374         1. Something beeing scheduled in rapid succession
  375         2. A single work item that consumes lots of cpu cycles
  376 
  377 The first one can be tracked using tracing:
  378 
  379         $ echo workqueue:workqueue_queue_work > /sys/kernel/debug/tracing/set_event
  380         $ cat /sys/kernel/debug/tracing/trace_pipe > out.txt
  381         (wait a few secs)
  382         ^C
  383 
  384 If something is busy looping on work queueing, it would be dominating
  385 the output and the offender can be determined with the work item
  386 function.
  387 
  388 For the second type of problems it should be possible to just check
  389 the stack trace of the offending worker thread.
  390 
  391         $ cat /proc/THE_OFFENDING_KWORKER/stack
  392 
  393 The work item's function should be trivially visible in the stack
  394 trace.

Cache object: 32d7395bf63af28eea285440afcd28d0


[ source navigation ] [ diff markup ] [ identifier search ] [ freetext search ] [ file search ] [ list types ] [ track identifier ]


This page is part of the FreeBSD/Linux Linux Kernel Cross-Reference, and was automatically generated using a modified version of the LXR engine.