The Design and Implementation of the FreeBSD Operating System, Second Edition
Now available: The Design and Implementation of the FreeBSD Operating System (Second Edition)


[ source navigation ] [ diff markup ] [ identifier search ] [ freetext search ] [ file search ] [ list types ] [ track identifier ]

FreeBSD/Linux Kernel Cross Reference
sys/Documentation/sched-nice-design.txt

Version: -  FREEBSD  -  FREEBSD-13-STABLE  -  FREEBSD-13-0  -  FREEBSD-12-STABLE  -  FREEBSD-12-0  -  FREEBSD-11-STABLE  -  FREEBSD-11-0  -  FREEBSD-10-STABLE  -  FREEBSD-10-0  -  FREEBSD-9-STABLE  -  FREEBSD-9-0  -  FREEBSD-8-STABLE  -  FREEBSD-8-0  -  FREEBSD-7-STABLE  -  FREEBSD-7-0  -  FREEBSD-6-STABLE  -  FREEBSD-6-0  -  FREEBSD-5-STABLE  -  FREEBSD-5-0  -  FREEBSD-4-STABLE  -  FREEBSD-3-STABLE  -  FREEBSD22  -  l41  -  OPENBSD  -  linux-2.6  -  MK84  -  PLAN9  -  xnu-8792 
SearchContext: -  none  -  3  -  10 

    1 This document explains the thinking about the revamped and streamlined
    2 nice-levels implementation in the new Linux scheduler.
    3 
    4 Nice levels were always pretty weak under Linux and people continuously
    5 pestered us to make nice +19 tasks use up much less CPU time.
    6 
    7 Unfortunately that was not that easy to implement under the old
    8 scheduler, (otherwise we'd have done it long ago) because nice level
    9 support was historically coupled to timeslice length, and timeslice
   10 units were driven by the HZ tick, so the smallest timeslice was 1/HZ.
   11 
   12 In the O(1) scheduler (in 2003) we changed negative nice levels to be
   13 much stronger than they were before in 2.4 (and people were happy about
   14 that change), and we also intentionally calibrated the linear timeslice
   15 rule so that nice +19 level would be _exactly_ 1 jiffy. To better
   16 understand it, the timeslice graph went like this (cheesy ASCII art
   17 alert!):
   18 
   19 
   20                    A
   21              \     | [timeslice length]
   22               \    |
   23                \   |
   24                 \  |
   25                  \ |
   26                   \|___100msecs
   27                    |^ . _
   28                    |      ^ . _
   29                    |            ^ . _
   30  -*----------------------------------*-----> [nice level]
   31  -20               |                +19
   32                    |
   33                    |
   34 
   35 So that if someone wanted to really renice tasks, +19 would give a much
   36 bigger hit than the normal linear rule would do. (The solution of
   37 changing the ABI to extend priorities was discarded early on.)
   38 
   39 This approach worked to some degree for some time, but later on with
   40 HZ=1000 it caused 1 jiffy to be 1 msec, which meant 0.1% CPU usage which
   41 we felt to be a bit excessive. Excessive _not_ because it's too small of
   42 a CPU utilization, but because it causes too frequent (once per
   43 millisec) rescheduling. (and would thus trash the cache, etc. Remember,
   44 this was long ago when hardware was weaker and caches were smaller, and
   45 people were running number crunching apps at nice +19.)
   46 
   47 So for HZ=1000 we changed nice +19 to 5msecs, because that felt like the
   48 right minimal granularity - and this translates to 5% CPU utilization.
   49 But the fundamental HZ-sensitive property for nice+19 still remained,
   50 and we never got a single complaint about nice +19 being too _weak_ in
   51 terms of CPU utilization, we only got complaints about it (still) being
   52 too _strong_ :-)
   53 
   54 To sum it up: we always wanted to make nice levels more consistent, but
   55 within the constraints of HZ and jiffies and their nasty design level
   56 coupling to timeslices and granularity it was not really viable.
   57 
   58 The second (less frequent but still periodically occuring) complaint
   59 about Linux's nice level support was its assymetry around the origo
   60 (which you can see demonstrated in the picture above), or more
   61 accurately: the fact that nice level behavior depended on the _absolute_
   62 nice level as well, while the nice API itself is fundamentally
   63 "relative":
   64 
   65    int nice(int inc);
   66 
   67    asmlinkage long sys_nice(int increment)
   68 
   69 (the first one is the glibc API, the second one is the syscall API.)
   70 Note that the 'inc' is relative to the current nice level. Tools like
   71 bash's "nice" command mirror this relative API.
   72 
   73 With the old scheduler, if you for example started a niced task with +1
   74 and another task with +2, the CPU split between the two tasks would
   75 depend on the nice level of the parent shell - if it was at nice -10 the
   76 CPU split was different than if it was at +5 or +10.
   77 
   78 A third complaint against Linux's nice level support was that negative
   79 nice levels were not 'punchy enough', so lots of people had to resort to
   80 run audio (and other multimedia) apps under RT priorities such as
   81 SCHED_FIFO. But this caused other problems: SCHED_FIFO is not starvation
   82 proof, and a buggy SCHED_FIFO app can also lock up the system for good.
   83 
   84 The new scheduler in v2.6.23 addresses all three types of complaints:
   85 
   86 To address the first complaint (of nice levels being not "punchy"
   87 enough), the scheduler was decoupled from 'time slice' and HZ concepts
   88 (and granularity was made a separate concept from nice levels) and thus
   89 it was possible to implement better and more consistent nice +19
   90 support: with the new scheduler nice +19 tasks get a HZ-independent
   91 1.5%, instead of the variable 3%-5%-9% range they got in the old
   92 scheduler.
   93 
   94 To address the second complaint (of nice levels not being consistent),
   95 the new scheduler makes nice(1) have the same CPU utilization effect on
   96 tasks, regardless of their absolute nice levels. So on the new
   97 scheduler, running a nice +10 and a nice 11 task has the same CPU
   98 utilization "split" between them as running a nice -5 and a nice -4
   99 task. (one will get 55% of the CPU, the other 45%.) That is why nice
  100 levels were changed to be "multiplicative" (or exponential) - that way
  101 it does not matter which nice level you start out from, the 'relative
  102 result' will always be the same.
  103 
  104 The third complaint (of negative nice levels not being "punchy" enough
  105 and forcing audio apps to run under the more dangerous SCHED_FIFO
  106 scheduling policy) is addressed by the new scheduler almost
  107 automatically: stronger negative nice levels are an automatic
  108 side-effect of the recalibrated dynamic range of nice levels.

Cache object: 307551e2609bdb8b279c1ed8107f6c24


[ source navigation ] [ diff markup ] [ identifier search ] [ freetext search ] [ file search ] [ list types ] [ track identifier ]


This page is part of the FreeBSD/Linux Linux Kernel Cross-Reference, and was automatically generated using a modified version of the LXR engine.