The Design and Implementation of the FreeBSD Operating System, Second Edition
Now available: The Design and Implementation of the FreeBSD Operating System (Second Edition)


[ source navigation ] [ diff markup ] [ identifier search ] [ freetext search ] [ file search ] [ list types ] [ track identifier ]

FreeBSD/Linux Kernel Cross Reference
sys/Documentation/sched-design.txt

Version: -  FREEBSD  -  FREEBSD-13-STABLE  -  FREEBSD-13-0  -  FREEBSD-12-STABLE  -  FREEBSD-12-0  -  FREEBSD-11-STABLE  -  FREEBSD-11-0  -  FREEBSD-10-STABLE  -  FREEBSD-10-0  -  FREEBSD-9-STABLE  -  FREEBSD-9-0  -  FREEBSD-8-STABLE  -  FREEBSD-8-0  -  FREEBSD-7-STABLE  -  FREEBSD-7-0  -  FREEBSD-6-STABLE  -  FREEBSD-6-0  -  FREEBSD-5-STABLE  -  FREEBSD-5-0  -  FREEBSD-4-STABLE  -  FREEBSD-3-STABLE  -  FREEBSD22  -  l41  -  OPENBSD  -  linux-2.6  -  MK84  -  PLAN9  -  xnu-8792 
SearchContext: -  none  -  3  -  10 

    1                    Goals, Design and Implementation of the
    2                       new ultra-scalable O(1) scheduler
    3 
    4 
    5   This is an edited version of an email Ingo Molnar sent to
    6   lkml on 4 Jan 2002.  It describes the goals, design, and
    7   implementation of Ingo's new ultra-scalable O(1) scheduler.
    8   Last Updated: 18 April 2002.
    9 
   10 
   11 Goal
   12 ====
   13 
   14 The main goal of the new scheduler is to keep all the good things we know
   15 and love about the current Linux scheduler:
   16 
   17  - good interactive performance even during high load: if the user
   18    types or clicks then the system must react instantly and must execute
   19    the user tasks smoothly, even during considerable background load.
   20 
   21  - good scheduling/wakeup performance with 1-2 runnable processes.
   22 
   23  - fairness: no process should stay without any timeslice for any
   24    unreasonable amount of time. No process should get an unjustly high
   25    amount of CPU time.
   26 
   27  - priorities: less important tasks can be started with lower priority,
   28    more important tasks with higher priority.
   29 
   30  - SMP efficiency: no CPU should stay idle if there is work to do.
   31 
   32  - SMP affinity: processes which run on one CPU should stay affine to
   33    that CPU. Processes should not bounce between CPUs too frequently.
   34 
   35  - plus additional scheduler features: RT scheduling, CPU binding.
   36 
   37 and the goal is also to add a few new things:
   38 
   39  - fully O(1) scheduling. Are you tired of the recalculation loop
   40    blowing the L1 cache away every now and then? Do you think the goodness
   41    loop is taking a bit too long to finish if there are lots of runnable
   42    processes? This new scheduler takes no prisoners: wakeup(), schedule(),
   43    the timer interrupt are all O(1) algorithms. There is no recalculation
   44    loop. There is no goodness loop either.
   45 
   46  - 'perfect' SMP scalability. With the new scheduler there is no 'big'
   47    runqueue_lock anymore - it's all per-CPU runqueues and locks - two
   48    tasks on two separate CPUs can wake up, schedule and context-switch
   49    completely in parallel, without any interlocking. All
   50    scheduling-relevant data is structured for maximum scalability.
   51 
   52  - better SMP affinity. The old scheduler has a particular weakness that
   53    causes the random bouncing of tasks between CPUs if/when higher
   54    priority/interactive tasks, this was observed and reported by many
   55    people. The reason is that the timeslice recalculation loop first needs
   56    every currently running task to consume its timeslice. But when this
   57    happens on eg. an 8-way system, then this property starves an
   58    increasing number of CPUs from executing any process. Once the last
   59    task that has a timeslice left has finished using up that timeslice,
   60    the recalculation loop is triggered and other CPUs can start executing
   61    tasks again - after having idled around for a number of timer ticks.
   62    The more CPUs, the worse this effect.
   63 
   64    Furthermore, this same effect causes the bouncing effect as well:
   65    whenever there is such a 'timeslice squeeze' of the global runqueue,
   66    idle processors start executing tasks which are not affine to that CPU.
   67    (because the affine tasks have finished off their timeslices already.)
   68 
   69    The new scheduler solves this problem by distributing timeslices on a
   70    per-CPU basis, without having any global synchronization or
   71    recalculation.
   72 
   73  - batch scheduling. A significant proportion of computing-intensive tasks
   74    benefit from batch-scheduling, where timeslices are long and processes
   75    are roundrobin scheduled. The new scheduler does such batch-scheduling
   76    of the lowest priority tasks - so nice +19 jobs will get
   77    'batch-scheduled' automatically. With this scheduler, nice +19 jobs are
   78    in essence SCHED_IDLE, from an interactiveness point of view.
   79 
   80  - handle extreme loads more smoothly, without breakdown and scheduling
   81    storms.
   82 
   83  - O(1) RT scheduling. For those RT folks who are paranoid about the
   84    O(nr_running) property of the goodness loop and the recalculation loop.
   85 
   86  - run fork()ed children before the parent. Andrea has pointed out the
   87    advantages of this a few months ago, but patches for this feature
   88    do not work with the old scheduler as well as they should,
   89    because idle processes often steal the new child before the fork()ing
   90    CPU gets to execute it.
   91 
   92 
   93 Design
   94 ======
   95 
   96 The core of the new scheduler contains the following mechanisms:
   97 
   98  - *two* priority-ordered 'priority arrays' per CPU. There is an 'active'
   99    array and an 'expired' array. The active array contains all tasks that
  100    are affine to this CPU and have timeslices left. The expired array
  101    contains all tasks which have used up their timeslices - but this array
  102    is kept sorted as well. The active and expired array is not accessed
  103    directly, it's accessed through two pointers in the per-CPU runqueue
  104    structure. If all active tasks are used up then we 'switch' the two
  105    pointers and from now on the ready-to-go (former-) expired array is the
  106    active array - and the empty active array serves as the new collector
  107    for expired tasks.
  108 
  109  - there is a 64-bit bitmap cache for array indices. Finding the highest
  110    priority task is thus a matter of two x86 BSFL bit-search instructions.
  111 
  112 the split-array solution enables us to have an arbitrary number of active
  113 and expired tasks, and the recalculation of timeslices can be done
  114 immediately when the timeslice expires. Because the arrays are always
  115 access through the pointers in the runqueue, switching the two arrays can
  116 be done very quickly.
  117 
  118 this is a hybride priority-list approach coupled with roundrobin
  119 scheduling and the array-switch method of distributing timeslices.
  120 
  121  - there is a per-task 'load estimator'.
  122 
  123 one of the toughest things to get right is good interactive feel during
  124 heavy system load. While playing with various scheduler variants i found
  125 that the best interactive feel is achieved not by 'boosting' interactive
  126 tasks, but by 'punishing' tasks that want to use more CPU time than there
  127 is available. This method is also much easier to do in an O(1) fashion.
  128 
  129 to establish the actual 'load' the task contributes to the system, a
  130 complex-looking but pretty accurate method is used: there is a 4-entry
  131 'history' ringbuffer of the task's activities during the last 4 seconds.
  132 This ringbuffer is operated without much overhead. The entries tell the
  133 scheduler a pretty accurate load-history of the task: has it used up more
  134 CPU time or less during the past N seconds. [the size '4' and the interval
  135 of 4x 1 seconds was found by lots of experimentation - this part is
  136 flexible and can be changed in both directions.]
  137 
  138 the penalty a task gets for generating more load than the CPU can handle
  139 is a priority decrease - there is a maximum amount to this penalty
  140 relative to their static priority, so even fully CPU-bound tasks will
  141 observe each other's priorities, and will share the CPU accordingly.
  142 
  143 the SMP load-balancer can be extended/switched with additional parallel
  144 computing and cache hierarchy concepts: NUMA scheduling, multi-core CPUs
  145 can be supported easily by changing the load-balancer. Right now it's
  146 tuned for my SMP systems.
  147 
  148 i skipped the prev->mm == next->mm advantage - no workload i know of shows
  149 any sensitivity to this. It can be added back by sacrificing O(1)
  150 schedule() [the current and one-lower priority list can be searched for a
  151 that->mm == current->mm condition], but costs a fair number of cycles
  152 during a number of important workloads, so i wanted to avoid this as much
  153 as possible.
  154 
  155 - the SMP idle-task startup code was still racy and the new scheduler
  156 triggered this. So i streamlined the idle-setup code a bit. We do not call
  157 into schedule() before all processors have started up fully and all idle
  158 threads are in place.
  159 
  160 - the patch also cleans up a number of aspects of sched.c - moves code
  161 into other areas of the kernel where it's appropriate, and simplifies
  162 certain code paths and data constructs. As a result, the new scheduler's
  163 code is smaller than the old one.
  164 
  165         Ingo

Cache object: 037de811ff937b1960c11f299f86a507


[ source navigation ] [ diff markup ] [ identifier search ] [ freetext search ] [ file search ] [ list types ] [ track identifier ]


This page is part of the FreeBSD/Linux Linux Kernel Cross-Reference, and was automatically generated using a modified version of the LXR engine.