[ source navigation ] [ diff markup ] [ identifier search ] [ freetext search ] [ file search ] [ list types ] [ track identifier ]

FreeBSD/Linux Kernel Cross Reference
sys/common/fs/zfs/arc.c

Version: -  FREEBSD  -  FREEBSD8  -  FREEBSD7  -  FREEBSD72  -  FREEBSD71  -  FREEBSD70  -  FREEBSD6  -  FREEBSD64  -  FREEBSD63  -  FREEBSD62  -  FREEBSD61  -  FREEBSD60  -  FREEBSD5  -  FREEBSD55  -  FREEBSD54  -  FREEBSD53  -  FREEBSD52  -  FREEBSD51  -  FREEBSD50  -  FREEBSD4  -  FREEBSD3  -  FREEBSD22  -  linux-2.6  -  linux-2.4.22  -  MK83  -  MK84  -  PLAN9  -  DFBSD  -  NETBSD  -  NETBSD5  -  NETBSD4  -  NETBSD3  -  NETBSD20  -  OPENBSD  -  xnu-517  -  xnu-792  -  xnu-792.6.70  -  xnu-1228  -  xnu-1456.1.26  -  OPENSOLARIS  -  minix-3-1-1  -  FREEBSD-LIBC  -  FREEBSD7-LIBC  -  FREEBSD6-LIBC  -  GLIBC27 
SearchContext: -  none  -  excerpts  -  bigexcerpts 

    1 /*
    2  * CDDL HEADER START
    3  *
    4  * The contents of this file are subject to the terms of the
    5  * Common Development and Distribution License (the "License").
    6  * You may not use this file except in compliance with the License.
    7  *
    8  * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
    9  * or http://www.opensolaris.org/os/licensing.
   10  * See the License for the specific language governing permissions
   11  * and limitations under the License.
   12  *
   13  * When distributing Covered Code, include this CDDL HEADER in each
   14  * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
   15  * If applicable, add the following below this CDDL HEADER, with the
   16  * fields enclosed by brackets "[]" replaced with your own identifying
   17  * information: Portions Copyright [yyyy] [name of copyright owner]
   18  *
   19  * CDDL HEADER END
   20  */
   21 /*
   22  * Copyright 2009 Sun Microsystems, Inc.  All rights reserved.
   23  * Use is subject to license terms.
   24  */
   25 
   26 /*
   27  * DVA-based Adjustable Replacement Cache
   28  *
   29  * While much of the theory of operation used here is
   30  * based on the self-tuning, low overhead replacement cache
   31  * presented by Megiddo and Modha at FAST 2003, there are some
   32  * significant differences:
   33  *
   34  * 1. The Megiddo and Modha model assumes any page is evictable.
   35  * Pages in its cache cannot be "locked" into memory.  This makes
   36  * the eviction algorithm simple: evict the last page in the list.
   37  * This also make the performance characteristics easy to reason
   38  * about.  Our cache is not so simple.  At any given moment, some
   39  * subset of the blocks in the cache are un-evictable because we
   40  * have handed out a reference to them.  Blocks are only evictable
   41  * when there are no external references active.  This makes
   42  * eviction far more problematic:  we choose to evict the evictable
   43  * blocks that are the "lowest" in the list.
   44  *
   45  * There are times when it is not possible to evict the requested
   46  * space.  In these circumstances we are unable to adjust the cache
   47  * size.  To prevent the cache growing unbounded at these times we
   48  * implement a "cache throttle" that slows the flow of new data
   49  * into the cache until we can make space available.
   50  *
   51  * 2. The Megiddo and Modha model assumes a fixed cache size.
   52  * Pages are evicted when the cache is full and there is a cache
   53  * miss.  Our model has a variable sized cache.  It grows with
   54  * high use, but also tries to react to memory pressure from the
   55  * operating system: decreasing its size when system memory is
   56  * tight.
   57  *
   58  * 3. The Megiddo and Modha model assumes a fixed page size. All
   59  * elements of the cache are therefor exactly the same size.  So
   60  * when adjusting the cache size following a cache miss, its simply
   61  * a matter of choosing a single page to evict.  In our model, we
   62  * have variable sized cache blocks (rangeing from 512 bytes to
   63  * 128K bytes).  We therefor choose a set of blocks to evict to make
   64  * space for a cache miss that approximates as closely as possible
   65  * the space used by the new block.
   66  *
   67  * See also:  "ARC: A Self-Tuning, Low Overhead Replacement Cache"
   68  * by N. Megiddo & D. Modha, FAST 2003
   69  */
   70 
   71 /*
   72  * The locking model:
   73  *
   74  * A new reference to a cache buffer can be obtained in two
   75  * ways: 1) via a hash table lookup using the DVA as a key,
   76  * or 2) via one of the ARC lists.  The arc_read() interface
   77  * uses method 1, while the internal arc algorithms for
   78  * adjusting the cache use method 2.  We therefor provide two
   79  * types of locks: 1) the hash table lock array, and 2) the
   80  * arc list locks.
   81  *
   82  * Buffers do not have their own mutexs, rather they rely on the
   83  * hash table mutexs for the bulk of their protection (i.e. most
   84  * fields in the arc_buf_hdr_t are protected by these mutexs).
   85  *
   86  * buf_hash_find() returns the appropriate mutex (held) when it
   87  * locates the requested buffer in the hash table.  It returns
   88  * NULL for the mutex if the buffer was not in the table.
   89  *
   90  * buf_hash_remove() expects the appropriate hash mutex to be
   91  * already held before it is invoked.
   92  *
   93  * Each arc state also has a mutex which is used to protect the
   94  * buffer list associated with the state.  When attempting to
   95  * obtain a hash table lock while holding an arc list lock you
   96  * must use: mutex_tryenter() to avoid deadlock.  Also note that
   97  * the active state mutex must be held before the ghost state mutex.
   98  *
   99  * Arc buffers may have an associated eviction callback function.
  100  * This function will be invoked prior to removing the buffer (e.g.
  101  * in arc_do_user_evicts()).  Note however that the data associated
  102  * with the buffer may be evicted prior to the callback.  The callback
  103  * must be made with *no locks held* (to prevent deadlock).  Additionally,
  104  * the users of callbacks must ensure that their private data is
  105  * protected from simultaneous callbacks from arc_buf_evict()
  106  * and arc_do_user_evicts().
  107  *
  108  * Note that the majority of the performance stats are manipulated
  109  * with atomic operations.
  110  *
  111  * The L2ARC uses the l2arc_buflist_mtx global mutex for the following:
  112  *
  113  *      - L2ARC buflist creation
  114  *      - L2ARC buflist eviction
  115  *      - L2ARC write completion, which walks L2ARC buflists
  116  *      - ARC header destruction, as it removes from L2ARC buflists
  117  *      - ARC header release, as it removes from L2ARC buflists
  118  */
  119 
  120 #include <sys/spa.h>
  121 #include <sys/zio.h>
  122 #include <sys/zio_checksum.h>
  123 #include <sys/zfs_context.h>
  124 #include <sys/arc.h>
  125 #include <sys/refcount.h>
  126 #include <sys/vdev.h>
  127 #include <sys/vdev_impl.h>
  128 #ifdef _KERNEL
  129 #include <sys/vmsystm.h>
  130 #include <vm/anon.h>
  131 #include <sys/fs/swapnode.h>
  132 #include <sys/dnlc.h>
  133 #endif
  134 #include <sys/callb.h>
  135 #include <sys/kstat.h>
  136 
  137 static kmutex_t         arc_reclaim_thr_lock;
  138 static kcondvar_t       arc_reclaim_thr_cv;     /* used to signal reclaim thr */
  139 static uint8_t          arc_thread_exit;
  140 
  141 extern int zfs_write_limit_shift;
  142 extern uint64_t zfs_write_limit_max;
  143 extern kmutex_t zfs_write_limit_lock;
  144 
  145 #define ARC_REDUCE_DNLC_PERCENT 3
  146 uint_t arc_reduce_dnlc_percent = ARC_REDUCE_DNLC_PERCENT;
  147 
  148 typedef enum arc_reclaim_strategy {
  149         ARC_RECLAIM_AGGR,               /* Aggressive reclaim strategy */
  150         ARC_RECLAIM_CONS                /* Conservative reclaim strategy */
  151 } arc_reclaim_strategy_t;
  152 
  153 /* number of seconds before growing cache again */
  154 static int              arc_grow_retry = 60;
  155 
  156 /* shift of arc_c for calculating both min and max arc_p */
  157 static int              arc_p_min_shift = 4;
  158 
  159 /* log2(fraction of arc to reclaim) */
  160 static int              arc_shrink_shift = 5;
  161 
  162 /*
  163  * minimum lifespan of a prefetch block in clock ticks
  164  * (initialized in arc_init())
  165  */
  166 static int              arc_min_prefetch_lifespan;
  167 
  168 static int arc_dead;
  169 
  170 /*
  171  * The arc has filled available memory and has now warmed up.
  172  */
  173 static boolean_t arc_warm;
  174 
  175 /*
  176  * These tunables are for performance analysis.
  177  */
  178 uint64_t zfs_arc_max;
  179 uint64_t zfs_arc_min;
  180 uint64_t zfs_arc_meta_limit = 0;
  181 int zfs_mdcomp_disable = 0;
  182 int zfs_arc_grow_retry = 0;
  183 int zfs_arc_shrink_shift = 0;
  184 int zfs_arc_p_min_shift = 0;
  185 
  186 /*
  187  * Note that buffers can be in one of 6 states:
  188  *      ARC_anon        - anonymous (discussed below)
  189  *      ARC_mru         - recently used, currently cached
  190  *      ARC_mru_ghost   - recentely used, no longer in cache
  191  *      ARC_mfu         - frequently used, currently cached
  192  *      ARC_mfu_ghost   - frequently used, no longer in cache
  193  *      ARC_l2c_only    - exists in L2ARC but not other states
  194  * When there are no active references to the buffer, they are
  195  * are linked onto a list in one of these arc states.  These are
  196  * the only buffers that can be evicted or deleted.  Within each
  197  * state there are multiple lists, one for meta-data and one for
  198  * non-meta-data.  Meta-data (indirect blocks, blocks of dnodes,
  199  * etc.) is tracked separately so that it can be managed more
  200  * explicitly: favored over data, limited explicitly.
  201  *
  202  * Anonymous buffers are buffers that are not associated with
  203  * a DVA.  These are buffers that hold dirty block copies
  204  * before they are written to stable storage.  By definition,
  205  * they are "ref'd" and are considered part of arc_mru
  206  * that cannot be freed.  Generally, they will aquire a DVA
  207  * as they are written and migrate onto the arc_mru list.
  208  *
  209  * The ARC_l2c_only state is for buffers that are in the second
  210  * level ARC but no longer in any of the ARC_m* lists.  The second
  211  * level ARC itself may also contain buffers that are in any of
  212  * the ARC_m* states - meaning that a buffer can exist in two
  213  * places.  The reason for the ARC_l2c_only state is to keep the
  214  * buffer header in the hash table, so that reads that hit the
  215  * second level ARC benefit from these fast lookups.
  216  */
  217 
  218 typedef struct arc_state {
  219         list_t  arcs_list[ARC_BUFC_NUMTYPES];   /* list of evictable buffers */
  220         uint64_t arcs_lsize[ARC_BUFC_NUMTYPES]; /* amount of evictable data */
  221         uint64_t arcs_size;     /* total amount of data in this state */
  222         kmutex_t arcs_mtx;
  223 } arc_state_t;
  224 
  225 /* The 6 states: */
  226 static arc_state_t ARC_anon;
  227 static arc_state_t ARC_mru;
  228 static arc_state_t ARC_mru_ghost;
  229 static arc_state_t ARC_mfu;
  230 static arc_state_t ARC_mfu_ghost;
  231 static arc_state_t ARC_l2c_only;
  232 
  233 typedef struct arc_stats {
  234         kstat_named_t arcstat_hits;
  235         kstat_named_t arcstat_misses;
  236         kstat_named_t arcstat_demand_data_hits;
  237         kstat_named_t arcstat_demand_data_misses;
  238         kstat_named_t arcstat_demand_metadata_hits;
  239         kstat_named_t arcstat_demand_metadata_misses;
  240         kstat_named_t arcstat_prefetch_data_hits;
  241         kstat_named_t arcstat_prefetch_data_misses;
  242         kstat_named_t arcstat_prefetch_metadata_hits;
  243         kstat_named_t arcstat_prefetch_metadata_misses;
  244         kstat_named_t arcstat_mru_hits;
  245         kstat_named_t arcstat_mru_ghost_hits;
  246         kstat_named_t arcstat_mfu_hits;
  247         kstat_named_t arcstat_mfu_ghost_hits;
  248         kstat_named_t arcstat_deleted;
  249         kstat_named_t arcstat_recycle_miss;
  250         kstat_named_t arcstat_mutex_miss;
  251         kstat_named_t arcstat_evict_skip;
  252         kstat_named_t arcstat_evict_l2_cached;
  253         kstat_named_t arcstat_evict_l2_eligible;
  254         kstat_named_t arcstat_evict_l2_ineligible;
  255         kstat_named_t arcstat_hash_elements;
  256         kstat_named_t arcstat_hash_elements_max;
  257         kstat_named_t arcstat_hash_collisions;
  258         kstat_named_t arcstat_hash_chains;
  259         kstat_named_t arcstat_hash_chain_max;
  260         kstat_named_t arcstat_p;
  261         kstat_named_t arcstat_c;
  262         kstat_named_t arcstat_c_min;
  263         kstat_named_t arcstat_c_max;
  264         kstat_named_t arcstat_size;
  265         kstat_named_t arcstat_hdr_size;
  266         kstat_named_t arcstat_data_size;
  267         kstat_named_t arcstat_other_size;
  268         kstat_named_t arcstat_l2_hits;
  269         kstat_named_t arcstat_l2_misses;
  270         kstat_named_t arcstat_l2_feeds;
  271         kstat_named_t arcstat_l2_rw_clash;
  272         kstat_named_t arcstat_l2_read_bytes;
  273         kstat_named_t arcstat_l2_write_bytes;
  274         kstat_named_t arcstat_l2_writes_sent;
  275         kstat_named_t arcstat_l2_writes_done;
  276         kstat_named_t arcstat_l2_writes_error;
  277         kstat_named_t arcstat_l2_writes_hdr_miss;
  278         kstat_named_t arcstat_l2_evict_lock_retry;
  279         kstat_named_t arcstat_l2_evict_reading;
  280         kstat_named_t arcstat_l2_free_on_write;
  281         kstat_named_t arcstat_l2_abort_lowmem;
  282         kstat_named_t arcstat_l2_cksum_bad;
  283         kstat_named_t arcstat_l2_io_error;
  284         kstat_named_t arcstat_l2_size;
  285         kstat_named_t arcstat_l2_hdr_size;
  286         kstat_named_t arcstat_memory_throttle_count;
  287 } arc_stats_t;
  288 
  289 static arc_stats_t arc_stats = {
  290         { "hits",                       KSTAT_DATA_UINT64 },
  291         { "misses",                     KSTAT_DATA_UINT64 },
  292         { "demand_data_hits",           KSTAT_DATA_UINT64 },
  293         { "demand_data_misses",         KSTAT_DATA_UINT64 },
  294         { "demand_metadata_hits",       KSTAT_DATA_UINT64 },
  295         { "demand_metadata_misses",     KSTAT_DATA_UINT64 },
  296         { "prefetch_data_hits",         KSTAT_DATA_UINT64 },
  297         { "prefetch_data_misses",       KSTAT_DATA_UINT64 },
  298         { "prefetch_metadata_hits",     KSTAT_DATA_UINT64 },
  299         { "prefetch_metadata_misses",   KSTAT_DATA_UINT64 },
  300         { "mru_hits",                   KSTAT_DATA_UINT64 },
  301         { "mru_ghost_hits",             KSTAT_DATA_UINT64 },
  302         { "mfu_hits",                   KSTAT_DATA_UINT64 },
  303         { "mfu_ghost_hits",             KSTAT_DATA_UINT64 },
  304         { "deleted",                    KSTAT_DATA_UINT64 },
  305         { "recycle_miss",               KSTAT_DATA_UINT64 },
  306         { "mutex_miss",                 KSTAT_DATA_UINT64 },
  307         { "evict_skip",                 KSTAT_DATA_UINT64 },
  308         { "evict_l2_cached",            KSTAT_DATA_UINT64 },
  309         { "evict_l2_eligible",          KSTAT_DATA_UINT64 },
  310         { "evict_l2_ineligible",        KSTAT_DATA_UINT64 },
  311         { "hash_elements",              KSTAT_DATA_UINT64 },
  312         { "hash_elements_max",          KSTAT_DATA_UINT64 },
  313         { "hash_collisions",            KSTAT_DATA_UINT64 },
  314         { "hash_chains",                KSTAT_DATA_UINT64 },
  315         { "hash_chain_max",             KSTAT_DATA_UINT64 },
  316         { "p",                          KSTAT_DATA_UINT64 },
  317         { "c",                          KSTAT_DATA_UINT64 },
  318         { "c_min",                      KSTAT_DATA_UINT64 },
  319         { "c_max",                      KSTAT_DATA_UINT64 },
  320         { "size",                       KSTAT_DATA_UINT64 },
  321         { "hdr_size",                   KSTAT_DATA_UINT64 },
  322         { "data_size",                  KSTAT_DATA_UINT64 },
  323         { "other_size",                 KSTAT_DATA_UINT64 },
  324         { "l2_hits",                    KSTAT_DATA_UINT64 },
  325         { "l2_misses",                  KSTAT_DATA_UINT64 },
  326         { "l2_feeds",                   KSTAT_DATA_UINT64 },
  327         { "l2_rw_clash",                KSTAT_DATA_UINT64 },
  328         { "l2_read_bytes",              KSTAT_DATA_UINT64 },
  329         { "l2_write_bytes",             KSTAT_DATA_UINT64 },
  330         { "l2_writes_sent",             KSTAT_DATA_UINT64 },
  331         { "l2_writes_done",             KSTAT_DATA_UINT64 },
  332         { "l2_writes_error",            KSTAT_DATA_UINT64 },
  333         { "l2_writes_hdr_miss",         KSTAT_DATA_UINT64 },
  334         { "l2_evict_lock_retry",        KSTAT_DATA_UINT64 },
  335         { "l2_evict_reading",           KSTAT_DATA_UINT64 },
  336         { "l2_free_on_write",           KSTAT_DATA_UINT64 },
  337         { "l2_abort_lowmem",            KSTAT_DATA_UINT64 },
  338         { "l2_cksum_bad",               KSTAT_DATA_UINT64 },
  339         { "l2_io_error",                KSTAT_DATA_UINT64 },
  340         { "l2_size",                    KSTAT_DATA_UINT64 },
  341         { "l2_hdr_size",                KSTAT_DATA_UINT64 },
  342         { "memory_throttle_count",      KSTAT_DATA_UINT64 }
  343 };
  344 
  345 #define ARCSTAT(stat)   (arc_stats.stat.value.ui64)
  346 
  347 #define ARCSTAT_INCR(stat, val) \
  348         atomic_add_64(&arc_stats.stat.value.ui64, (val));
  349 
  350 #define ARCSTAT_BUMP(stat)      ARCSTAT_INCR(stat, 1)
  351 #define ARCSTAT_BUMPDOWN(stat)  ARCSTAT_INCR(stat, -1)
  352 
  353 #define ARCSTAT_MAX(stat, val) {                                        \
  354         uint64_t m;                                                     \
  355         while ((val) > (m = arc_stats.stat.value.ui64) &&               \
  356             (m != atomic_cas_64(&arc_stats.stat.value.ui64, m, (val)))) \
  357                 continue;                                               \
  358 }
  359 
  360 #define ARCSTAT_MAXSTAT(stat) \
  361         ARCSTAT_MAX(stat##_max, arc_stats.stat.value.ui64)
  362 
  363 /*
  364  * We define a macro to allow ARC hits/misses to be easily broken down by
  365  * two separate conditions, giving a total of four different subtypes for
  366  * each of hits and misses (so eight statistics total).
  367  */
  368 #define ARCSTAT_CONDSTAT(cond1, stat1, notstat1, cond2, stat2, notstat2, stat) \
  369         if (cond1) {                                                    \
  370                 if (cond2) {                                            \
  371                         ARCSTAT_BUMP(arcstat_##stat1##_##stat2##_##stat); \
  372                 } else {                                                \
  373                         ARCSTAT_BUMP(arcstat_##stat1##_##notstat2##_##stat); \
  374                 }                                                       \
  375         } else {                                                        \
  376                 if (cond2) {                                            \
  377                         ARCSTAT_BUMP(arcstat_##notstat1##_##stat2##_##stat); \
  378                 } else {                                                \
  379                         ARCSTAT_BUMP(arcstat_##notstat1##_##notstat2##_##stat);\
  380                 }                                                       \
  381         }
  382 
  383 kstat_t                 *arc_ksp;
  384 static arc_state_t      *arc_anon;
  385 static arc_state_t      *arc_mru;
  386 static arc_state_t      *arc_mru_ghost;
  387 static arc_state_t      *arc_mfu;
  388 static arc_state_t      *arc_mfu_ghost;
  389 static arc_state_t      *arc_l2c_only;
  390 
  391 /*
  392  * There are several ARC variables that are critical to export as kstats --
  393  * but we don't want to have to grovel around in the kstat whenever we wish to
  394  * manipulate them.  For these variables, we therefore define them to be in
  395  * terms of the statistic variable.  This assures that we are not introducing
  396  * the possibility of inconsistency by having shadow copies of the variables,
  397  * while still allowing the code to be readable.
  398  */
  399 #define arc_size        ARCSTAT(arcstat_size)   /* actual total arc size */
  400 #define arc_p           ARCSTAT(arcstat_p)      /* target size of MRU */
  401 #define arc_c           ARCSTAT(arcstat_c)      /* target size of cache */
  402 #define arc_c_min       ARCSTAT(arcstat_c_min)  /* min target cache size */
  403 #define arc_c_max       ARCSTAT(arcstat_c_max)  /* max target cache size */
  404 
  405 static int              arc_no_grow;    /* Don't try to grow cache size */
  406 static uint64_t         arc_tempreserve;
  407 static uint64_t         arc_loaned_bytes;
  408 static uint64_t         arc_meta_used;
  409 static uint64_t         arc_meta_limit;
  410 static uint64_t         arc_meta_max = 0;
  411 
  412 typedef struct l2arc_buf_hdr l2arc_buf_hdr_t;
  413 
  414 typedef struct arc_callback arc_callback_t;
  415 
  416 struct arc_callback {
  417         void                    *acb_private;
  418         arc_done_func_t         *acb_done;
  419         arc_buf_t               *acb_buf;
  420         zio_t                   *acb_zio_dummy;
  421         arc_callback_t          *acb_next;
  422 };
  423 
  424 typedef struct arc_write_callback arc_write_callback_t;
  425 
  426 struct arc_write_callback {
  427         void            *awcb_private;
  428         arc_done_func_t *awcb_ready;
  429         arc_done_func_t *awcb_done;
  430         arc_buf_t       *awcb_buf;
  431 };
  432 
  433 struct arc_buf_hdr {
  434         /* protected by hash lock */
  435         dva_t                   b_dva;
  436         uint64_t                b_birth;
  437         uint64_t                b_cksum0;
  438 
  439         kmutex_t                b_freeze_lock;
  440         zio_cksum_t             *b_freeze_cksum;
  441 
  442         arc_buf_hdr_t           *b_hash_next;
  443         arc_buf_t               *b_buf;
  444         uint32_t                b_flags;
  445         uint32_t                b_datacnt;
  446 
  447         arc_callback_t          *b_acb;
  448         kcondvar_t              b_cv;
  449 
  450         /* immutable */
  451         arc_buf_contents_t      b_type;
  452         uint64_t                b_size;
  453         uint64_t                b_spa;
  454 
  455         /* protected by arc state mutex */
  456         arc_state_t             *b_state;
  457         list_node_t             b_arc_node;
  458 
  459         /* updated atomically */
  460         clock_t                 b_arc_access;
  461 
  462         /* self protecting */
  463         refcount_t              b_refcnt;
  464 
  465         l2arc_buf_hdr_t         *b_l2hdr;
  466         list_node_t             b_l2node;
  467 };
  468 
  469 static arc_buf_t *arc_eviction_list;
  470 static kmutex_t arc_eviction_mtx;
  471 static arc_buf_hdr_t arc_eviction_hdr;
  472 static void arc_get_data_buf(arc_buf_t *buf);
  473 static void arc_access(arc_buf_hdr_t *buf, kmutex_t *hash_lock);
  474 static int arc_evict_needed(arc_buf_contents_t type);
  475 static void arc_evict_ghost(arc_state_t *state, uint64_t spa, int64_t bytes);
  476 
  477 static boolean_t l2arc_write_eligible(uint64_t spa_guid, arc_buf_hdr_t *ab);
  478 
  479 #define GHOST_STATE(state)      \
  480         ((state) == arc_mru_ghost || (state) == arc_mfu_ghost ||        \
  481         (state) == arc_l2c_only)
  482 
  483 /*
  484  * Private ARC flags.  These flags are private ARC only flags that will show up
  485  * in b_flags in the arc_hdr_buf_t.  Some flags are publicly declared, and can
  486  * be passed in as arc_flags in things like arc_read.  However, these flags
  487  * should never be passed and should only be set by ARC code.  When adding new
  488  * public flags, make sure not to smash the private ones.
  489  */
  490 
  491 #define ARC_IN_HASH_TABLE       (1 << 9)        /* this buffer is hashed */
  492 #define ARC_IO_IN_PROGRESS      (1 << 10)       /* I/O in progress for buf */
  493 #define ARC_IO_ERROR            (1 << 11)       /* I/O failed for buf */
  494 #define ARC_FREED_IN_READ       (1 << 12)       /* buf freed while in read */
  495 #define ARC_BUF_AVAILABLE       (1 << 13)       /* block not in active use */
  496 #define ARC_INDIRECT            (1 << 14)       /* this is an indirect block */
  497 #define ARC_FREE_IN_PROGRESS    (1 << 15)       /* hdr about to be freed */
  498 #define ARC_L2_WRITING          (1 << 16)       /* L2ARC write in progress */
  499 #define ARC_L2_EVICTED          (1 << 17)       /* evicted during I/O */
  500 #define ARC_L2_WRITE_HEAD       (1 << 18)       /* head of write list */
  501 #define ARC_STORED              (1 << 19)       /* has been store()d to */
  502 
  503 #define HDR_IN_HASH_TABLE(hdr)  ((hdr)->b_flags & ARC_IN_HASH_TABLE)
  504 #define HDR_IO_IN_PROGRESS(hdr) ((hdr)->b_flags & ARC_IO_IN_PROGRESS)
  505 #define HDR_IO_ERROR(hdr)       ((hdr)->b_flags & ARC_IO_ERROR)
  506 #define HDR_PREFETCH(hdr)       ((hdr)->b_flags & ARC_PREFETCH)
  507 #define HDR_FREED_IN_READ(hdr)  ((hdr)->b_flags & ARC_FREED_IN_READ)
  508 #define HDR_BUF_AVAILABLE(hdr)  ((hdr)->b_flags & ARC_BUF_AVAILABLE)
  509 #define HDR_FREE_IN_PROGRESS(hdr)       ((hdr)->b_flags & ARC_FREE_IN_PROGRESS)
  510 #define HDR_L2CACHE(hdr)        ((hdr)->b_flags & ARC_L2CACHE)
  511 #define HDR_L2_READING(hdr)     ((hdr)->b_flags & ARC_IO_IN_PROGRESS && \
  512                                     (hdr)->b_l2hdr != NULL)
  513 #define HDR_L2_WRITING(hdr)     ((hdr)->b_flags & ARC_L2_WRITING)
  514 #define HDR_L2_EVICTED(hdr)     ((hdr)->b_flags & ARC_L2_EVICTED)
  515 #define HDR_L2_WRITE_HEAD(hdr)  ((hdr)->b_flags & ARC_L2_WRITE_HEAD)
  516 
  517 /*
  518  * Other sizes
  519  */
  520 
  521 #define HDR_SIZE ((int64_t)sizeof (arc_buf_hdr_t))
  522 #define L2HDR_SIZE ((int64_t)sizeof (l2arc_buf_hdr_t))
  523 
  524 /*
  525  * Hash table routines
  526  */
  527 
  528 #define HT_LOCK_PAD     64
  529 
  530 struct ht_lock {
  531         kmutex_t        ht_lock;
  532 #ifdef _KERNEL
  533         unsigned char   pad[(HT_LOCK_PAD - sizeof (kmutex_t))];
  534 #endif
  535 };
  536 
  537 #define BUF_LOCKS 256
  538 typedef struct buf_hash_table {
  539         uint64_t ht_mask;
  540         arc_buf_hdr_t **ht_table;
  541         struct ht_lock ht_locks[BUF_LOCKS];
  542 } buf_hash_table_t;
  543 
  544 static buf_hash_table_t buf_hash_table;
  545 
  546 #define BUF_HASH_INDEX(spa, dva, birth) \
  547         (buf_hash(spa, dva, birth) & buf_hash_table.ht_mask)
  548 #define BUF_HASH_LOCK_NTRY(idx) (buf_hash_table.ht_locks[idx & (BUF_LOCKS-1)])
  549 #define BUF_HASH_LOCK(idx)      (&(BUF_HASH_LOCK_NTRY(idx).ht_lock))
  550 #define HDR_LOCK(buf) \
  551         (BUF_HASH_LOCK(BUF_HASH_INDEX(buf->b_spa, &buf->b_dva, buf->b_birth)))
  552 
  553 uint64_t zfs_crc64_table[256];
  554 
  555 /*
  556  * Level 2 ARC
  557  */
  558 
  559 #define L2ARC_WRITE_SIZE        (8 * 1024 * 1024)       /* initial write max */
  560 #define L2ARC_HEADROOM          2               /* num of writes */
  561 #define L2ARC_FEED_SECS         1               /* caching interval secs */
  562 #define L2ARC_FEED_MIN_MS       200             /* min caching interval ms */
  563 
  564 #define l2arc_writes_sent       ARCSTAT(arcstat_l2_writes_sent)
  565 #define l2arc_writes_done       ARCSTAT(arcstat_l2_writes_done)
  566 
  567 /*
  568  * L2ARC Performance Tunables
  569  */
  570 uint64_t l2arc_write_max = L2ARC_WRITE_SIZE;    /* default max write size */
  571 uint64_t l2arc_write_boost = L2ARC_WRITE_SIZE;  /* extra write during warmup */
  572 uint64_t l2arc_headroom = L2ARC_HEADROOM;       /* number of dev writes */
  573 uint64_t l2arc_feed_secs = L2ARC_FEED_SECS;     /* interval seconds */
  574 uint64_t l2arc_feed_min_ms = L2ARC_FEED_MIN_MS; /* min interval milliseconds */
  575 boolean_t l2arc_noprefetch = B_TRUE;            /* don't cache prefetch bufs */
  576 boolean_t l2arc_feed_again = B_TRUE;            /* turbo warmup */
  577 boolean_t l2arc_norw = B_TRUE;                  /* no reads during writes */
  578 
  579 /*
  580  * L2ARC Internals
  581  */
  582 typedef struct l2arc_dev {
  583         vdev_t                  *l2ad_vdev;     /* vdev */
  584         spa_t                   *l2ad_spa;      /* spa */
  585         uint64_t                l2ad_hand;      /* next write location */
  586         uint64_t                l2ad_write;     /* desired write size, bytes */
  587         uint64_t                l2ad_boost;     /* warmup write boost, bytes */
  588         uint64_t                l2ad_start;     /* first addr on device */
  589         uint64_t                l2ad_end;       /* last addr on device */
  590         uint64_t                l2ad_evict;     /* last addr eviction reached */
  591         boolean_t               l2ad_first;     /* first sweep through */
  592         boolean_t               l2ad_writing;   /* currently writing */
  593         list_t                  *l2ad_buflist;  /* buffer list */
  594         list_node_t             l2ad_node;      /* device list node */
  595 } l2arc_dev_t;
  596 
  597 static list_t L2ARC_dev_list;                   /* device list */
  598 static list_t *l2arc_dev_list;                  /* device list pointer */
  599 static kmutex_t l2arc_dev_mtx;                  /* device list mutex */
  600 static l2arc_dev_t *l2arc_dev_last;             /* last device used */
  601 static kmutex_t l2arc_buflist_mtx;              /* mutex for all buflists */
  602 static list_t L2ARC_free_on_write;              /* free after write buf list */
  603 static list_t *l2arc_free_on_write;             /* free after write list ptr */
  604 static kmutex_t l2arc_free_on_write_mtx;        /* mutex for list */
  605 static uint64_t l2arc_ndev;                     /* number of devices */
  606 
  607 typedef struct l2arc_read_callback {
  608         arc_buf_t       *l2rcb_buf;             /* read buffer */
  609         spa_t           *l2rcb_spa;             /* spa */
  610         blkptr_t        l2rcb_bp;               /* original blkptr */
  611         zbookmark_t     l2rcb_zb;               /* original bookmark */
  612         int             l2rcb_flags;            /* original flags */
  613 } l2arc_read_callback_t;
  614 
  615 typedef struct l2arc_write_callback {
  616         l2arc_dev_t     *l2wcb_dev;             /* device info */
  617         arc_buf_hdr_t   *l2wcb_head;            /* head of write buflist */
  618 } l2arc_write_callback_t;
  619 
  620 struct l2arc_buf_hdr {
  621         /* protected by arc_buf_hdr  mutex */
  622         l2arc_dev_t     *b_dev;                 /* L2ARC device */
  623         uint64_t        b_daddr;                /* disk address, offset byte */
  624 };
  625 
  626 typedef struct l2arc_data_free {
  627         /* protected by l2arc_free_on_write_mtx */
  628         void            *l2df_data;
  629         size_t          l2df_size;
  630         void            (*l2df_func)(void *, size_t);
  631         list_node_t     l2df_list_node;
  632 } l2arc_data_free_t;
  633 
  634 static kmutex_t l2arc_feed_thr_lock;
  635 static kcondvar_t l2arc_feed_thr_cv;
  636 static uint8_t l2arc_thread_exit;
  637 
  638 static void l2arc_read_done(zio_t *zio);
  639 static void l2arc_hdr_stat_add(void);
  640 static void l2arc_hdr_stat_remove(void);
  641 
  642 static uint64_t
  643 buf_hash(uint64_t spa, const dva_t *dva, uint64_t birth)
  644 {
  645         uint8_t *vdva = (uint8_t *)dva;
  646         uint64_t crc = -1ULL;
  647         int i;
  648 
  649         ASSERT(zfs_crc64_table[128] == ZFS_CRC64_POLY);
  650 
  651         for (i = 0; i < sizeof (dva_t); i++)
  652                 crc = (crc >> 8) ^ zfs_crc64_table[(crc ^ vdva[i]) & 0xFF];
  653 
  654         crc ^= (spa>>8) ^ birth;
  655 
  656         return (crc);
  657 }
  658 
  659 #define BUF_EMPTY(buf)                                          \
  660         ((buf)->b_dva.dva_word[0] == 0 &&                       \
  661         (buf)->b_dva.dva_word[1] == 0 &&                        \
  662         (buf)->b_birth == 0)
  663 
  664 #define BUF_EQUAL(spa, dva, birth, buf)                         \
  665         ((buf)->b_dva.dva_word[0] == (dva)->dva_word[0]) &&     \
  666         ((buf)->b_dva.dva_word[1] == (dva)->dva_word[1]) &&     \
  667         ((buf)->b_birth == birth) && ((buf)->b_spa == spa)
  668 
  669 static arc_buf_hdr_t *
  670 buf_hash_find(uint64_t spa, const dva_t *dva, uint64_t birth, kmutex_t **lockp)
  671 {
  672         uint64_t idx = BUF_HASH_INDEX(spa, dva, birth);
  673         kmutex_t *hash_lock = BUF_HASH_LOCK(idx);
  674         arc_buf_hdr_t *buf;
  675 
  676         mutex_enter(hash_lock);
  677         for (buf = buf_hash_table.ht_table[idx]; buf != NULL;
  678             buf = buf->b_hash_next) {
  679                 if (BUF_EQUAL(spa, dva, birth, buf)) {
  680                         *lockp = hash_lock;
  681                         return (buf);
  682                 }
  683         }
  684         mutex_exit(hash_lock);
  685         *lockp = NULL;
  686         return (NULL);
  687 }
  688 
  689 /*
  690  * Insert an entry into the hash table.  If there is already an element
  691  * equal to elem in the hash table, then the already existing element
  692  * will be returned and the new element will not be inserted.
  693  * Otherwise returns NULL.
  694  */
  695 static arc_buf_hdr_t *
  696 buf_hash_insert(arc_buf_hdr_t *buf, kmutex_t **lockp)
  697 {
  698         uint64_t idx = BUF_HASH_INDEX(buf->b_spa, &buf->b_dva, buf->b_birth);
  699         kmutex_t *hash_lock = BUF_HASH_LOCK(idx);
  700         arc_buf_hdr_t *fbuf;
  701         uint32_t i;
  702 
  703         ASSERT(!HDR_IN_HASH_TABLE(buf));
  704         *lockp = hash_lock;
  705         mutex_enter(hash_lock);
  706         for (fbuf = buf_hash_table.ht_table[idx], i = 0; fbuf != NULL;
  707             fbuf = fbuf->b_hash_next, i++) {
  708                 if (BUF_EQUAL(buf->b_spa, &buf->b_dva, buf->b_birth, fbuf))
  709                         return (fbuf);
  710         }
  711 
  712         buf->b_hash_next = buf_hash_table.ht_table[idx];
  713         buf_hash_table.ht_table[idx] = buf;
  714         buf->b_flags |= ARC_IN_HASH_TABLE;
  715 
  716         /* collect some hash table performance data */
  717         if (i > 0) {
  718                 ARCSTAT_BUMP(arcstat_hash_collisions);
  719                 if (i == 1)
  720                         ARCSTAT_BUMP(arcstat_hash_chains);
  721 
  722                 ARCSTAT_MAX(arcstat_hash_chain_max, i);
  723         }
  724 
  725         ARCSTAT_BUMP(arcstat_hash_elements);
  726         ARCSTAT_MAXSTAT(arcstat_hash_elements);
  727 
  728         return (NULL);
  729 }
  730 
  731 static void
  732 buf_hash_remove(arc_buf_hdr_t *buf)
  733 {
  734         arc_buf_hdr_t *fbuf, **bufp;
  735         uint64_t idx = BUF_HASH_INDEX(buf->b_spa, &buf->b_dva, buf->b_birth);
  736 
  737         ASSERT(MUTEX_HELD(BUF_HASH_LOCK(idx)));
  738         ASSERT(HDR_IN_HASH_TABLE(buf));
  739 
  740         bufp = &buf_hash_table.ht_table[idx];
  741         while ((fbuf = *bufp) != buf) {
  742                 ASSERT(fbuf != NULL);
  743                 bufp = &fbuf->b_hash_next;
  744         }
  745         *bufp = buf->b_hash_next;
  746         buf->b_hash_next = NULL;
  747         buf->b_flags &= ~ARC_IN_HASH_TABLE;
  748 
  749         /* collect some hash table performance data */
  750         ARCSTAT_BUMPDOWN(arcstat_hash_elements);
  751 
  752         if (buf_hash_table.ht_table[idx] &&
  753             buf_hash_table.ht_table[idx]->b_hash_next == NULL)
  754                 ARCSTAT_BUMPDOWN(arcstat_hash_chains);
  755 }
  756 
  757 /*
  758  * Global data structures and functions for the buf kmem cache.
  759  */
  760 static kmem_cache_t *hdr_cache;
  761 static kmem_cache_t *buf_cache;
  762 
  763 static void
  764 buf_fini(void)
  765 {
  766         int i;
  767 
  768         kmem_free(buf_hash_table.ht_table,
  769             (buf_hash_table.ht_mask + 1) * sizeof (void *));
  770         for (i = 0; i < BUF_LOCKS; i++)
  771                 mutex_destroy(&buf_hash_table.ht_locks[i].ht_lock);
  772         kmem_cache_destroy(hdr_cache);
  773         kmem_cache_destroy(buf_cache);
  774 }
  775 
  776 /*
  777  * Constructor callback - called when the cache is empty
  778  * and a new buf is requested.
  779  */
  780 /* ARGSUSED */
  781 static int
  782 hdr_cons(void *vbuf, void *unused, int kmflag)
  783 {
  784         arc_buf_hdr_t *buf = vbuf;
  785 
  786         bzero(buf, sizeof (arc_buf_hdr_t));
  787         refcount_create(&buf->b_refcnt);
  788         cv_init(&buf->b_cv, NULL, CV_DEFAULT, NULL);
  789         mutex_init(&buf->b_freeze_lock, NULL, MUTEX_DEFAULT, NULL);
  790         arc_space_consume(sizeof (arc_buf_hdr_t), ARC_SPACE_HDRS);
  791 
  792         return (0);
  793 }
  794 
  795 /* ARGSUSED */
  796 static int
  797 buf_cons(void *vbuf, void *unused, int kmflag)
  798 {
  799         arc_buf_t *buf = vbuf;
  800 
  801         bzero(buf, sizeof (arc_buf_t));
  802         rw_init(&buf->b_lock, NULL, RW_DEFAULT, NULL);
  803         arc_space_consume(sizeof (arc_buf_t), ARC_SPACE_HDRS);
  804 
  805         return (0);
  806 }
  807 
  808 /*
  809  * Destructor callback - called when a cached buf is
  810  * no longer required.
  811  */
  812 /* ARGSUSED */
  813 static void
  814 hdr_dest(void *vbuf, void *unused)
  815 {
  816         arc_buf_hdr_t *buf = vbuf;
  817 
  818         refcount_destroy(&buf->b_refcnt);
  819         cv_destroy(&buf->b_cv);
  820         mutex_destroy(&buf->b_freeze_lock);
  821         arc_space_return(sizeof (arc_buf_hdr_t), ARC_SPACE_HDRS);
  822 }
  823 
  824 /* ARGSUSED */
  825 static void
  826 buf_dest(void *vbuf, void *unused)
  827 {
  828         arc_buf_t *buf = vbuf;
  829 
  830         rw_destroy(&buf->b_lock);
  831         arc_space_return(sizeof (arc_buf_t), ARC_SPACE_HDRS);
  832 }
  833 
  834 /*
  835  * Reclaim callback -- invoked when memory is low.
  836  */
  837 /* ARGSUSED */
  838 static void
  839 hdr_recl(void *unused)
  840 {
  841         dprintf("hdr_recl called\n");
  842         /*
  843          * umem calls the reclaim func when we destroy the buf cache,
  844          * which is after we do arc_fini().
  845          */
  846         if (!arc_dead)
  847                 cv_signal(&arc_reclaim_thr_cv);
  848 }
  849 
  850 static void
  851 buf_init(void)
  852 {
  853         uint64_t *ct;
  854         uint64_t hsize = 1ULL << 12;
  855         int i, j;
  856 
  857         /*
  858          * The hash table is big enough to fill all of physical memory
  859          * with an average 64K block size.  The table will take up
  860          * totalmem*sizeof(void*)/64K (eg. 128KB/GB with 8-byte pointers).
  861          */
  862         while (hsize * 65536 < physmem * PAGESIZE)
  863                 hsize <<= 1;
  864 retry:
  865         buf_hash_table.ht_mask = hsize - 1;
  866         buf_hash_table.ht_table =
  867             kmem_zalloc(hsize * sizeof (void*), KM_NOSLEEP);
  868         if (buf_hash_table.ht_table == NULL) {
  869                 ASSERT(hsize > (1ULL << 8));
  870                 hsize >>= 1;
  871                 goto retry;
  872         }
  873 
  874         hdr_cache = kmem_cache_create("arc_buf_hdr_t", sizeof (arc_buf_hdr_t),
  875             0, hdr_cons, hdr_dest, hdr_recl, NULL, NULL, 0);
  876         buf_cache = kmem_cache_create("arc_buf_t", sizeof (arc_buf_t),
  877             0, buf_cons, buf_dest, NULL, NULL, NULL, 0);
  878 
  879         for (i = 0; i < 256; i++)
  880                 for (ct = zfs_crc64_table + i, *ct = i, j = 8; j > 0; j--)
  881                         *ct = (*ct >> 1) ^ (-(*ct & 1) & ZFS_CRC64_POLY);
  882 
  883         for (i = 0; i < BUF_LOCKS; i++) {
  884                 mutex_init(&buf_hash_table.ht_locks[i].ht_lock,
  885                     NULL, MUTEX_DEFAULT, NULL);
  886         }
  887 }
  888 
  889 #define ARC_MINTIME     (hz>>4) /* 62 ms */
  890 
  891 static void
  892 arc_cksum_verify(arc_buf_t *buf)
  893 {
  894         zio_cksum_t zc;
  895 
  896         if (!(zfs_flags & ZFS_DEBUG_MODIFY))
  897                 return;
  898 
  899         mutex_enter(&buf->b_hdr->b_freeze_lock);
  900         if (buf->b_hdr->b_freeze_cksum == NULL ||
  901             (buf->b_hdr->b_flags & ARC_IO_ERROR)) {
  902                 mutex_exit(&buf->b_hdr->b_freeze_lock);
  903                 return;
  904         }
  905         fletcher_2_native(buf->b_data, buf->b_hdr->b_size, &zc);
  906         if (!ZIO_CHECKSUM_EQUAL(*buf->b_hdr->b_freeze_cksum, zc))
  907                 panic("buffer modified while frozen!");
  908         mutex_exit(&buf->b_hdr->b_freeze_lock);
  909 }
  910 
  911 static int
  912 arc_cksum_equal(arc_buf_t *buf)
  913 {
  914         zio_cksum_t zc;
  915         int equal;
  916 
  917         mutex_enter(&buf->b_hdr->b_freeze_lock);
  918         fletcher_2_native(buf->b_data, buf->b_hdr->b_size, &zc);
  919         equal = ZIO_CHECKSUM_EQUAL(*buf->b_hdr->b_freeze_cksum, zc);
  920         mutex_exit(&buf->b_hdr->b_freeze_lock);
  921 
  922         return (equal);
  923 }
  924 
  925 static void
  926 arc_cksum_compute(arc_buf_t *buf, boolean_t force)
  927 {
  928         if (!force && !(zfs_flags & ZFS_DEBUG_MODIFY))
  929                 return;
  930 
  931         mutex_enter(&buf->b_hdr->b_freeze_lock);
  932         if (buf->b_hdr->b_freeze_cksum != NULL) {
  933                 mutex_exit(&buf->b_hdr->b_freeze_lock);
  934                 return;
  935         }
  936         buf->b_hdr->b_freeze_cksum = kmem_alloc(sizeof (zio_cksum_t), KM_SLEEP);
  937         fletcher_2_native(buf->b_data, buf->b_hdr->b_size,
  938             buf->b_hdr->b_freeze_cksum);
  939         mutex_exit(&buf->b_hdr->b_freeze_lock);
  940 }
  941 
  942 void
  943 arc_buf_thaw(arc_buf_t *buf)
  944 {
  945         if (zfs_flags & ZFS_DEBUG_MODIFY) {
  946                 if (buf->b_hdr->b_state != arc_anon)
  947                         panic("modifying non-anon buffer!");
  948                 if (buf->b_hdr->b_flags & ARC_IO_IN_PROGRESS)
  949                         panic("modifying buffer while i/o in progress!");
  950                 arc_cksum_verify(buf);
  951         }
  952 
  953         mutex_enter(&buf->b_hdr->b_freeze_lock);
  954         if (buf->b_hdr->b_freeze_cksum != NULL) {
  955                 kmem_free(buf->b_hdr->b_freeze_cksum, sizeof (zio_cksum_t));
  956                 buf->b_hdr->b_freeze_cksum = NULL;
  957         }
  958         mutex_exit(&buf->b_hdr->b_freeze_lock);
  959 }
  960 
  961 void
  962 arc_buf_freeze(arc_buf_t *buf)
  963 {
  964         if (!(zfs_flags & ZFS_DEBUG_MODIFY))
  965                 return;
  966 
  967         ASSERT(buf->b_hdr->b_freeze_cksum != NULL ||
  968             buf->b_hdr->b_state == arc_anon);
  969         arc_cksum_compute(buf, B_FALSE);
  970 }
  971 
  972 static void
  973 add_reference(arc_buf_hdr_t *ab, kmutex_t *hash_lock, void *tag)
  974 {
  975         ASSERT(MUTEX_HELD(hash_lock));
  976 
  977         if ((refcount_add(&ab->b_refcnt, tag) == 1) &&
  978             (ab->b_state != arc_anon)) {
  979                 uint64_t delta = ab->b_size * ab->b_datacnt;
  980                 list_t *list = &ab->b_state->arcs_list[ab->b_type];
  981                 uint64_t *size = &ab->b_state->arcs_lsize[ab->b_type];
  982 
  983                 ASSERT(!MUTEX_HELD(&ab->b_state->arcs_mtx));
  984                 mutex_enter(&ab->b_state->arcs_mtx);
  985                 ASSERT(list_link_active(&ab->b_arc_node));
  986                 list_remove(list, ab);
  987                 if (GHOST_STATE(ab->b_state)) {
  988                         ASSERT3U(ab->b_datacnt, ==, 0);
  989                         ASSERT3P(ab->b_buf, ==, NULL);
  990                         delta = ab->b_size;
  991                 }
  992                 ASSERT(delta > 0);
  993                 ASSERT3U(*size, >=, delta);
  994                 atomic_add_64(size, -delta);
  995                 mutex_exit(&ab->b_state->arcs_mtx);
  996                 /* remove the prefetch flag if we get a reference */
  997                 if (ab->b_flags & ARC_PREFETCH)
  998                         ab->b_flags &= ~ARC_PREFETCH;
  999         }
 1000 }
 1001 
 1002 static int
 1003 remove_reference(arc_buf_hdr_t *ab, kmutex_t *hash_lock, void *tag)
 1004 {
 1005         int cnt;
 1006         arc_state_t *state = ab->b_state;
 1007 
 1008         ASSERT(state == arc_anon || MUTEX_HELD(hash_lock));
 1009         ASSERT(!GHOST_STATE(state));
 1010 
 1011         if (((cnt = refcount_remove(&ab->b_refcnt, tag)) == 0) &&
 1012             (state != arc_anon)) {
 1013                 uint64_t *size = &state->arcs_lsize[ab->b_type];
 1014 
 1015                 ASSERT(!MUTEX_HELD(&state->arcs_mtx));
 1016                 mutex_enter(&state->arcs_mtx);
 1017                 ASSERT(!list_link_active(&ab->b_arc_node));
 1018                 list_insert_head(&state->arcs_list[ab->b_type], ab);
 1019                 ASSERT(ab->b_datacnt > 0);
 1020                 atomic_add_64(size, ab->b_size * ab->b_datacnt);
 1021                 mutex_exit(&state->arcs_mtx);
 1022         }
 1023         return (cnt);
 1024 }
 1025 
 1026 /*
 1027  * Move the supplied buffer to the indicated state.  The mutex
 1028  * for the buffer must be held by the caller.
 1029  */
 1030 static void
 1031 arc_change_state(arc_state_t *new_state, arc_buf_hdr_t *ab, kmutex_t *hash_lock)
 1032 {
 1033         arc_state_t *old_state = ab->b_state;
 1034         int64_t refcnt = refcount_count(&ab->b_refcnt);
 1035         uint64_t from_delta, to_delta;
 1036 
 1037         ASSERT(MUTEX_HELD(hash_lock));
 1038         ASSERT(new_state != old_state);
 1039         ASSERT(refcnt == 0 || ab->b_datacnt > 0);
 1040         ASSERT(ab->b_datacnt == 0 || !GHOST_STATE(new_state));
 1041 
 1042         from_delta = to_delta = ab->b_datacnt * ab->b_size;
 1043 
 1044         /*
 1045          * If this buffer is evictable, transfer it from the
 1046          * old state list to the new state list.
 1047          */
 1048         if (refcnt == 0) {
 1049                 if (old_state != arc_anon) {
 1050                         int use_mutex = !MUTEX_HELD(&old_state->arcs_mtx);
 1051                         uint64_t *size = &old_state->arcs_lsize[ab->b_type];
 1052 
 1053                         if (use_mutex)
 1054                                 mutex_enter(&old_state->arcs_mtx);
 1055 
 1056                         ASSERT(list_link_active(&ab->b_arc_node));
 1057                         list_remove(&old_state->arcs_list[ab->b_type], ab);
 1058 
 1059                         /*
 1060                          * If prefetching out of the ghost cache,
 1061                          * we will have a non-null datacnt.
 1062                          */
 1063                         if (GHOST_STATE(old_state) && ab->b_datacnt == 0) {
 1064                                 /* ghost elements have a ghost size */
 1065                                 ASSERT(ab->b_buf == NULL);
 1066                                 from_delta = ab->b_size;
 1067                         }
 1068                         ASSERT3U(*size, >=, from_delta);
 1069                         atomic_add_64(size, -from_delta);
 1070 
 1071                         if (use_mutex)
 1072                                 mutex_exit(&old_state->arcs_mtx);
 1073                 }
 1074                 if (new_state != arc_anon) {
 1075                         int use_mutex = !MUTEX_HELD(&new_state->arcs_mtx);
 1076                         uint64_t *size = &new_state->arcs_lsize[ab->b_type];
 1077 
 1078                         if (use_mutex)
 1079                                 mutex_enter(&new_state->arcs_mtx);
 1080 
 1081                         list_insert_head(&new_state->arcs_list[ab->b_type], ab);
 1082 
 1083                         /* ghost elements have a ghost size */
 1084                         if (GHOST_STATE(new_state)) {
 1085                                 ASSERT(ab->b_datacnt == 0);
 1086                                 ASSERT(ab->b_buf == NULL);
 1087                                 to_delta = ab->b_size;
 1088                         }
 1089                         atomic_add_64(size, to_delta);
 1090 
 1091                         if (use_mutex)
 1092                                 mutex_exit(&new_state->arcs_mtx);
 1093                 }
 1094         }
 1095 
 1096         ASSERT(!BUF_EMPTY(ab));
 1097         if (new_state == arc_anon) {
 1098                 buf_hash_remove(ab);
 1099         }
 1100 
 1101         /* adjust state sizes */
 1102         if (to_delta)
 1103                 atomic_add_64(&new_state->arcs_size, to_delta);
 1104         if (from_delta) {
 1105                 ASSERT3U(old_state->arcs_size, >=, from_delta);
 1106                 atomic_add_64(&old_state->arcs_size, -from_delta);
 1107         }
 1108         ab->b_state = new_state;
 1109 
 1110         /* adjust l2arc hdr stats */
 1111         if (new_state == arc_l2c_only)
 1112                 l2arc_hdr_stat_add();
 1113         else if (old_state == arc_l2c_only)
 1114                 l2arc_hdr_stat_remove();
 1115 }
 1116 
 1117 void
 1118 arc_space_consume(uint64_t space, arc_space_type_t type)
 1119 {
 1120         ASSERT(type >= 0 && type < ARC_SPACE_NUMTYPES);
 1121 
 1122         switch (type) {
 1123         case ARC_SPACE_DATA:
 1124                 ARCSTAT_INCR(arcstat_data_size, space);
 1125                 break;
 1126         case ARC_SPACE_OTHER:
 1127                 ARCSTAT_INCR(arcstat_other_size, space);
 1128                 break;
 1129         case ARC_SPACE_HDRS:
 1130                 ARCSTAT_INCR(arcstat_hdr_size, space);
 1131                 break;
 1132         case ARC_SPACE_L2HDRS:
 1133                 ARCSTAT_INCR(arcstat_l2_hdr_size, space);
 1134                 break;
 1135         }
 1136 
 1137         atomic_add_64(&arc_meta_used, space);
 1138         atomic_add_64(&arc_size, space);
 1139 }
 1140 
 1141 void
 1142 arc_space_return(uint64_t space, arc_space_type_t type)
 1143 {
 1144         ASSERT(type >= 0 && type < ARC_SPACE_NUMTYPES);
 1145 
 1146         switch (type) {
 1147         case ARC_SPACE_DATA:
 1148                 ARCSTAT_INCR(arcstat_data_size, -space);
 1149                 break;
 1150         case ARC_SPACE_OTHER:
 1151                 ARCSTAT_INCR(arcstat_other_size, -space);
 1152                 break;
 1153         case ARC_SPACE_HDRS:
 1154                 ARCSTAT_INCR(arcstat_hdr_size, -space);
 1155                 break;
 1156         case ARC_SPACE_L2HDRS:
 1157                 ARCSTAT_INCR(arcstat_l2_hdr_size, -space);
 1158                 break;
 1159         }
 1160 
 1161         ASSERT(arc_meta_used >= space);
 1162         if (arc_meta_max < arc_meta_used)
 1163                 arc_meta_max = arc_meta_used;
 1164         atomic_add_64(&arc_meta_used, -space);
 1165         ASSERT(arc_size >= space);
 1166         atomic_add_64(&arc_size, -space);
 1167 }
 1168 
 1169 void *
 1170 arc_data_buf_alloc(uint64_t size)
 1171 {
 1172         if (arc_evict_needed(ARC_BUFC_DATA))
 1173                 cv_signal(&arc_reclaim_thr_cv);
 1174         atomic_add_64(&arc_size, size);
 1175         return (zio_data_buf_alloc(size));
 1176 }
 1177 
 1178 void
 1179 arc_data_buf_free(void *buf, uint64_t size)
 1180 {
 1181         zio_data_buf_free(buf, size);
 1182         ASSERT(arc_size >= size);
 1183         atomic_add_64(&arc_size, -size);
 1184 }
 1185 
 1186 arc_buf_t *
 1187 arc_buf_alloc(spa_t *spa, int size, void *tag, arc_buf_contents_t type)
 1188 {
 1189         arc_buf_hdr_t *hdr;
 1190         arc_buf_t *buf;
 1191 
 1192         ASSERT3U(size, >, 0);
 1193         hdr = kmem_cache_alloc(hdr_cache, KM_PUSHPAGE);
 1194         ASSERT(BUF_EMPTY(hdr));
 1195         hdr->b_size = size;
 1196         hdr->b_type = type;
 1197         hdr->b_spa = spa_guid(spa);
 1198         hdr->b_state = arc_anon;
 1199         hdr->b_arc_access = 0;
 1200         buf = kmem_cache_alloc(buf_cache, KM_PUSHPAGE);
 1201         buf->b_hdr = hdr;
 1202         buf->b_data = NULL;
 1203         buf->b_efunc = NULL;
 1204         buf->b_private = NULL;
 1205         buf->b_next = NULL;
 1206         hdr->b_buf = buf;
 1207         arc_get_data_buf(buf);
 1208         hdr->b_datacnt = 1;
 1209         hdr->b_flags = 0;
 1210         ASSERT(refcount_is_zero(&hdr->b_refcnt));
 1211         (void) refcount_add(&hdr->b_refcnt, tag);
 1212 
 1213         return (buf);
 1214 }
 1215 
 1216 static char *arc_onloan_tag = "onloan";
 1217 
 1218 /*
 1219  * Loan out an anonymous arc buffer. Loaned buffers are not counted as in
 1220  * flight data by arc_tempreserve_space() until they are "returned". Loaned
 1221  * buffers must be returned to the arc before they can be used by the DMU or
 1222  * freed.
 1223  */
 1224 arc_buf_t *
 1225 arc_loan_buf(spa_t *spa, int size)
 1226 {
 1227         arc_buf_t *buf;
 1228 
 1229         buf = arc_buf_alloc(spa, size, arc_onloan_tag, ARC_BUFC_DATA);
 1230 
 1231         atomic_add_64(&arc_loaned_bytes, size);
 1232         return (buf);
 1233 }
 1234 
 1235 /*
 1236  * Return a loaned arc buffer to the arc.
 1237  */
 1238 void
 1239 arc_return_buf(arc_buf_t *buf, void *tag)
 1240 {
 1241         arc_buf_hdr_t *hdr = buf->b_hdr;
 1242 
 1243         ASSERT(hdr->b_state == arc_anon);
 1244         ASSERT(buf->b_data != NULL);
 1245         VERIFY(refcount_remove(&hdr->b_refcnt, arc_onloan_tag) == 0);
 1246         VERIFY(refcount_add(&hdr->b_refcnt, tag) == 1);
 1247 
 1248         atomic_add_64(&arc_loaned_bytes, -hdr->b_size);
 1249 }
 1250 
 1251 static arc_buf_t *
 1252 arc_buf_clone(arc_buf_t *from)
 1253 {
 1254         arc_buf_t *buf;
 1255         arc_buf_hdr_t *hdr = from->b_hdr;
 1256         uint64_t size = hdr->b_size;
 1257 
 1258         buf = kmem_cache_alloc(buf_cache, KM_PUSHPAGE);
 1259         buf->b_hdr = hdr;
 1260         buf->b_data = NULL;
 1261         buf->b_efunc = NULL;
 1262         buf->b_private = NULL;
 1263         buf->b_next = hdr->b_buf;
 1264         hdr->b_buf = buf;
 1265         arc_get_data_buf(buf);
 1266         bcopy(from->b_data, buf->b_data, size);
 1267         hdr->b_datacnt += 1;
 1268         return (buf);
 1269 }
 1270 
 1271 void
 1272 arc_buf_add_ref(arc_buf_t *buf, void* tag)
 1273 {
 1274         arc_buf_hdr_t *hdr;
 1275         kmutex_t *hash_lock;
 1276 
 1277         /*
 1278          * Check to see if this buffer is evicted.  Callers
 1279          * must verify b_data != NULL to know if the add_ref
 1280          * was successful.
 1281          */
 1282         rw_enter(&buf->b_lock, RW_READER);
 1283         if (buf->b_data == NULL) {
 1284                 rw_exit(&buf->b_lock);
 1285                 return;
 1286         }
 1287         hdr = buf->b_hdr;
 1288         ASSERT(hdr != NULL);
 1289         hash_lock = HDR_LOCK(hdr);
 1290         mutex_enter(hash_lock);
 1291         rw_exit(&buf->b_lock);
 1292 
 1293         ASSERT(hdr->b_state == arc_mru || hdr->b_state == arc_mfu);
 1294         add_reference(hdr, hash_lock, tag);
 1295         DTRACE_PROBE1(arc__hit, arc_buf_hdr_t *, hdr);
 1296         arc_access(hdr, hash_lock);
 1297         mutex_exit(hash_lock);
 1298         ARCSTAT_BUMP(arcstat_hits);
 1299         ARCSTAT_CONDSTAT(!(hdr->b_flags & ARC_PREFETCH),
 1300             demand, prefetch, hdr->b_type != ARC_BUFC_METADATA,
 1301             data, metadata, hits);
 1302 }
 1303 
 1304 /*
 1305  * Free the arc data buffer.  If it is an l2arc write in progress,
 1306  * the buffer is placed on l2arc_free_on_write to be freed later.
 1307  */
 1308 static void
 1309 arc_buf_data_free(arc_buf_hdr_t *hdr, void (*free_func)(void *, size_t),
 1310     void *data, size_t size)
 1311 {
 1312         if (HDR_L2_WRITING(hdr)) {
 1313                 l2arc_data_free_t *df;
 1314                 df = kmem_alloc(sizeof (l2arc_data_free_t), KM_SLEEP);
 1315                 df->l2df_data = data;
 1316                 df->l2df_size = size;
 1317                 df->l2df_func = free_func;
 1318                 mutex_enter(&l2arc_free_on_write_mtx);
 1319                 list_insert_head(l2arc_free_on_write, df);
 1320                 mutex_exit(&l2arc_free_on_write_mtx);
 1321                 ARCSTAT_BUMP(arcstat_l2_free_on_write);
 1322         } else {
 1323                 free_func(data, size);
 1324         }
 1325 }
 1326 
 1327 static void
 1328 arc_buf_destroy(arc_buf_t *buf, boolean_t recycle, boolean_t all)
 1329 {
 1330         arc_buf_t **bufp;
 1331 
 1332         /* free up data associated with the buf */
 1333         if (buf->b_data) {
 1334                 arc_state_t *state = buf->b_hdr->b_state;
 1335                 uint64_t size = buf->b_hdr->b_size;
 1336                 arc_buf_contents_t type = buf->b_hdr->b_type;
 1337 
 1338                 arc_cksum_verify(buf);
 1339                 if (!recycle) {
 1340                         if (type == ARC_BUFC_METADATA) {
 1341                                 arc_buf_data_free(buf->b_hdr, zio_buf_free,
 1342                                     buf->b_data, size);
 1343                                 arc_space_return(size, ARC_SPACE_DATA);
 1344                         } else {
 1345                                 ASSERT(type == ARC_BUFC_DATA);
 1346                                 arc_buf_data_free(buf->b_hdr,
 1347                                     zio_data_buf_free, buf->b_data, size);
 1348                                 ARCSTAT_INCR(arcstat_data_size, -size);
 1349                                 atomic_add_64(&arc_size, -size);
 1350                         }
 1351                 }
 1352                 if (list_link_active(&buf->b_hdr->b_arc_node)) {
 1353                         uint64_t *cnt = &state->arcs_lsize[type];
 1354 
 1355                         ASSERT(refcount_is_zero(&buf->b_hdr->b_refcnt));
 1356                         ASSERT(state != arc_anon);
 1357 
 1358                         ASSERT3U(*cnt, >=, size);
 1359                         atomic_add_64(cnt, -size);
 1360                 }
 1361                 ASSERT3U(state->arcs_size, >=, size);
 1362                 atomic_add_64(&state->arcs_size, -size);
 1363                 buf->b_data = NULL;
 1364                 ASSERT(buf->b_hdr->b_datacnt > 0);
 1365                 buf->b_hdr->b_datacnt -= 1;
 1366         }
 1367 
 1368         /* only remove the buf if requested */
 1369         if (!all)
 1370                 return;
 1371 
 1372         /* remove the buf from the hdr list */
 1373         for (bufp = &buf->b_hdr->b_buf; *bufp != buf; bufp = &(*bufp)->b_next)
 1374                 continue;
 1375         *bufp = buf->b_next;
 1376 
 1377         ASSERT(buf->b_efunc == NULL);
 1378 
 1379         /* clean up the buf */
 1380         buf->b_hdr = NULL;
 1381         kmem_cache_free(buf_cache, buf);
 1382 }
 1383 
 1384 static void
 1385 arc_hdr_destroy(arc_buf_hdr_t *hdr)
 1386 {
 1387         ASSERT(refcount_is_zero(&hdr->b_refcnt));
 1388         ASSERT3P(hdr->b_state, ==, arc_anon);
 1389         ASSERT(!HDR_IO_IN_PROGRESS(hdr));
 1390         ASSERT(!(hdr->b_flags & ARC_STORED));
 1391 
 1392         if (hdr->b_l2hdr != NULL) {
 1393                 if (!MUTEX_HELD(&l2arc_buflist_mtx)) {
 1394                         /*
 1395                          * To prevent arc_free() and l2arc_evict() from
 1396                          * attempting to free the same buffer at the same time,
 1397                          * a FREE_IN_PROGRESS flag is given to arc_free() to
 1398                          * give it priority.  l2arc_evict() can't destroy this
 1399                          * header while we are waiting on l2arc_buflist_mtx.
 1400                          *
 1401                          * The hdr may be removed from l2ad_buflist before we
 1402                          * grab l2arc_buflist_mtx, so b_l2hdr is rechecked.
 1403                          */
 1404                         mutex_enter(&l2arc_buflist_mtx);
 1405                         if (hdr->b_l2hdr != NULL) {
 1406                                 list_remove(hdr->b_l2hdr->b_dev->l2ad_buflist,
 1407                                     hdr);
 1408                         }
 1409                         mutex_exit(&l2arc_buflist_mtx);
 1410                 } else {
 1411                         list_remove(hdr->b_l2hdr->b_dev->l2ad_buflist, hdr);
 1412                 }
 1413                 ARCSTAT_INCR(arcstat_l2_size, -hdr->b_size);
 1414                 kmem_free(hdr->b_l2hdr, sizeof (l2arc_buf_hdr_t));
 1415                 if (hdr->b_state == arc_l2c_only)
 1416                         l2arc_hdr_stat_remove();
 1417                 hdr->b_l2hdr = NULL;
 1418         }
 1419 
 1420         if (!BUF_EMPTY(hdr)) {
 1421                 ASSERT(!HDR_IN_HASH_TABLE(hdr));
 1422                 bzero(&hdr->b_dva, sizeof (dva_t));
 1423                 hdr->b_birth = 0;
 1424                 hdr->b_cksum0 = 0;
 1425         }
 1426         while (hdr->b_buf) {
 1427                 arc_buf_t *buf = hdr->b_buf;
 1428 
 1429                 if (buf->b_efunc) {
 1430                         mutex_enter(&arc_eviction_mtx);
 1431                         rw_enter(&buf->b_lock, RW_WRITER);
 1432                         ASSERT(buf->b_hdr != NULL);
 1433                         arc_buf_destroy(hdr->b_buf, FALSE, FALSE);
 1434                         hdr->b_buf = buf->b_next;
 1435                         buf->b_hdr = &arc_eviction_hdr;
 1436                         buf->b_next = arc_eviction_list;
 1437                         arc_eviction_list = buf;
 1438                         rw_exit(&buf->b_lock);
 1439                         mutex_exit(&arc_eviction_mtx);
 1440                 } else {
 1441                         arc_buf_destroy(hdr->b_buf, FALSE, TRUE);
 1442                 }
 1443         }
 1444         if (hdr->b_freeze_cksum != NULL) {
 1445                 kmem_free(hdr->b_freeze_cksum, sizeof (zio_cksum_t));
 1446                 hdr->b_freeze_cksum = NULL;
 1447         }
 1448 
 1449         ASSERT(!list_link_active(&hdr->b_arc_node));
 1450         ASSERT3P(hdr->b_hash_next, ==, NULL);
 1451         ASSERT3P(hdr->b_acb, ==, NULL);
 1452         kmem_cache_free(hdr_cache, hdr);
 1453 }
 1454 
 1455 void
 1456 arc_buf_free(arc_buf_t *buf, void *tag)
 1457 {
 1458         arc_buf_hdr_t *hdr = buf->b_hdr;
 1459         int hashed = hdr->b_state != arc_anon;
 1460 
 1461         ASSERT(buf->b_efunc == NULL);
 1462         ASSERT(buf->b_data != NULL);
 1463 
 1464         if (hashed) {
 1465                 kmutex_t *hash_lock = HDR_LOCK(hdr);
 1466 
 1467                 mutex_enter(hash_lock);
 1468                 (void) remove_reference(hdr, hash_lock, tag);
 1469                 if (hdr->b_datacnt > 1)
 1470                         arc_buf_destroy(buf, FALSE, TRUE);
 1471                 else
 1472                         hdr->b_flags |= ARC_BUF_AVAILABLE;
 1473                 mutex_exit(hash_lock);
 1474         } else if (HDR_IO_IN_PROGRESS(hdr)) {
 1475                 int destroy_hdr;
 1476                 /*
 1477                  * We are in the middle of an async write.  Don't destroy
 1478                  * this buffer unless the write completes before we finish
 1479                  * decrementing the reference count.
 1480                  */
 1481                 mutex_enter(&arc_eviction_mtx);
 1482                 (void) remove_reference(hdr, NULL, tag);
 1483                 ASSERT(refcount_is_zero(&hdr->b_refcnt));
 1484                 destroy_hdr = !HDR_IO_IN_PROGRESS(hdr);
 1485                 mutex_exit(&arc_eviction_mtx);
 1486                 if (destroy_hdr)
 1487                         arc_hdr_destroy(hdr);
 1488         } else {
 1489                 if (remove_reference(hdr, NULL, tag) > 0) {
 1490                         ASSERT(HDR_IO_ERROR(hdr));
 1491                         arc_buf_destroy(buf, FALSE, TRUE);
 1492                 } else {
 1493                         arc_hdr_destroy(hdr);
 1494                 }
 1495         }
 1496 }
 1497 
 1498 int
 1499 arc_buf_remove_ref(arc_buf_t *buf, void* tag)
 1500 {
 1501         arc_buf_hdr_t *hdr = buf->b_hdr;
 1502         kmutex_t *hash_lock = HDR_LOCK(hdr);
 1503         int no_callback = (buf->b_efunc == NULL);
 1504 
 1505         if (hdr->b_state == arc_anon) {
 1506                 arc_buf_free(buf, tag);
 1507                 return (no_callback);
 1508         }
 1509 
 1510         mutex_enter(hash_lock);
 1511         ASSERT(hdr->b_state != arc_anon);
 1512         ASSERT(buf->b_data != NULL);
 1513 
 1514         (void) remove_reference(hdr, hash_lock, tag);
 1515         if (hdr->b_datacnt > 1) {
 1516                 if (no_callback)
 1517                         arc_buf_destroy(buf, FALSE, TRUE);
 1518         } else if (no_callback) {
 1519                 ASSERT(hdr->b_buf == buf && buf->b_next == NULL);
 1520                 hdr->b_flags |= ARC_BUF_AVAILABLE;
 1521         }
 1522         ASSERT(no_callback || hdr->b_datacnt > 1 ||
 1523             refcount_is_zero(&hdr->b_refcnt));
 1524         mutex_exit(hash_lock);
 1525         return (no_callback);
 1526 }
 1527 
 1528 int
 1529 arc_buf_size(arc_buf_t *buf)
 1530 {
 1531         return (buf->b_hdr->b_size);
 1532 }
 1533 
 1534 /*
 1535  * Evict buffers from list until we've removed the specified number of
 1536  * bytes.  Move the removed buffers to the appropriate evict state.
 1537  * If the recycle flag is set, then attempt to "recycle" a buffer:
 1538  * - look for a buffer to evict that is `bytes' long.
 1539  * - return the data block from this buffer rather than freeing it.
 1540  * This flag is used by callers that are trying to make space for a
 1541  * new buffer in a full arc cache.
 1542  *
 1543  * This function makes a "best effort".  It skips over any buffers
 1544  * it can't get a hash_lock on, and so may not catch all candidates.
 1545  * It may also return without evicting as much space as requested.
 1546  */
 1547 static void *
 1548 arc_evict(arc_state_t *state, uint64_t spa, int64_t bytes, boolean_t recycle,
 1549     arc_buf_contents_t type)
 1550 {
 1551         arc_state_t *evicted_state;
 1552         uint64_t bytes_evicted = 0, skipped = 0, missed = 0;
 1553         arc_buf_hdr_t *ab, *ab_prev = NULL;
 1554         list_t *list = &state->arcs_list[type];
 1555         kmutex_t *hash_lock;
 1556         boolean_t have_lock;
 1557         void *stolen = NULL;
 1558 
 1559         ASSERT(state == arc_mru || state == arc_mfu);
 1560 
 1561         evicted_state = (state == arc_mru) ? arc_mru_ghost : arc_mfu_ghost;
 1562 
 1563         mutex_enter(&state->arcs_mtx);
 1564         mutex_enter(&evicted_state->arcs_mtx);
 1565 
 1566         for (ab = list_tail(list); ab; ab = ab_prev) {
 1567                 ab_prev = list_prev(list, ab);
 1568                 /* prefetch buffers have a minimum lifespan */
 1569                 if (HDR_IO_IN_PROGRESS(ab) ||
 1570                     (spa && ab->b_spa != spa) ||
 1571                     (ab->b_flags & (ARC_PREFETCH|ARC_INDIRECT) &&
 1572                     lbolt - ab->b_arc_access < arc_min_prefetch_lifespan)) {
 1573                         skipped++;
 1574                         continue;
 1575                 }
 1576                 /* "lookahead" for better eviction candidate */
 1577                 if (recycle && ab->b_size != bytes &&
 1578                     ab_prev && ab_prev->b_size == bytes)
 1579                         continue;
 1580                 hash_lock = HDR_LOCK(ab);
 1581                 have_lock = MUTEX_HELD(hash_lock);
 1582                 if (have_lock || mutex_tryenter(hash_lock)) {
 1583                         ASSERT3U(refcount_count(&ab->b_refcnt), ==, 0);
 1584                         ASSERT(ab->b_datacnt > 0);
 1585                         while (ab->b_buf) {
 1586                                 arc_buf_t *buf = ab->b_buf;
 1587                                 if (!rw_tryenter(&buf->b_lock, RW_WRITER)) {
 1588                                         missed += 1;
 1589                                         break;
 1590                                 }
 1591                                 if (buf->b_data) {
 1592                                         bytes_evicted += ab->b_size;
 1593                                         if (recycle && ab->b_type == type &&
 1594                                             ab->b_size == bytes &&
 1595                                             !HDR_L2_WRITING(ab)) {
 1596                                                 stolen = buf->b_data;
 1597                                                 recycle = FALSE;
 1598                                         }
 1599                                 }
 1600                                 if (buf->b_efunc) {
 1601                                         mutex_enter(&arc_eviction_mtx);
 1602                                         arc_buf_destroy(buf,
 1603                                             buf->b_data == stolen, FALSE);
 1604                                         ab->b_buf = buf->b_next;
 1605                                         buf->b_hdr = &arc_eviction_hdr;
 1606                                         buf->b_next = arc_eviction_list;
 1607                                         arc_eviction_list = buf;
 1608                                         mutex_exit(&arc_eviction_mtx);
 1609                                         rw_exit(&buf->b_lock);
 1610                                 } else {
 1611                                         rw_exit(&buf->b_lock);
 1612                                         arc_buf_destroy(buf,
 1613                                             buf->b_data == stolen, TRUE);
 1614                                 }
 1615                         }
 1616 
 1617                         if (ab->b_l2hdr) {
 1618                                 ARCSTAT_INCR(arcstat_evict_l2_cached,
 1619                                     ab->b_size);
 1620                         } else {
 1621                                 if (l2arc_write_eligible(ab->b_spa, ab)) {
 1622                                         ARCSTAT_INCR(arcstat_evict_l2_eligible,
 1623                                             ab->b_size);
 1624                                 } else {
 1625                                         ARCSTAT_INCR(
 1626                                             arcstat_evict_l2_ineligible,
 1627                                             ab->b_size);
 1628                                 }
 1629                         }
 1630 
 1631                         if (ab->b_datacnt == 0) {
 1632                                 arc_change_state(evicted_state, ab, hash_lock);
 1633                                 ASSERT(HDR_IN_HASH_TABLE(ab));
 1634                                 ab->b_flags |= ARC_IN_HASH_TABLE;
 1635                                 ab->b_flags &= ~ARC_BUF_AVAILABLE;
 1636                                 DTRACE_PROBE1(arc__evict, arc_buf_hdr_t *, ab);
 1637                         }
 1638                         if (!have_lock)
 1639                                 mutex_exit(hash_lock);
 1640                         if (bytes >= 0 && bytes_evicted >= bytes)
 1641                                 break;
 1642                 } else {
 1643                         missed += 1;
 1644                 }
 1645         }
 1646 
 1647         mutex_exit(&evicted_state->arcs_mtx);
 1648         mutex_exit(&state->arcs_mtx);
 1649 
 1650         if (bytes_evicted < bytes)
 1651                 dprintf("only evicted %lld bytes from %x",
 1652                     (longlong_t)bytes_evicted, state);
 1653 
 1654         if (skipped)
 1655                 ARCSTAT_INCR(arcstat_evict_skip, skipped);
 1656 
 1657         if (missed)
 1658                 ARCSTAT_INCR(arcstat_mutex_miss, missed);
 1659 
 1660         /*
 1661          * We have just evicted some date into the ghost state, make
 1662          * sure we also adjust the ghost state size if necessary.
 1663          */
 1664         if (arc_no_grow &&
 1665             arc_mru_ghost->arcs_size + arc_mfu_ghost->arcs_size > arc_c) {
 1666                 int64_t mru_over = arc_anon->arcs_size + arc_mru->arcs_size +
 1667                     arc_mru_ghost->arcs_size - arc_c;
 1668 
 1669                 if (mru_over > 0 && arc_mru_ghost->arcs_lsize[type] > 0) {
 1670                         int64_t todelete =
 1671                             MIN(arc_mru_ghost->arcs_lsize[type], mru_over);
 1672                         arc_evict_ghost(arc_mru_ghost, NULL, todelete);
 1673                 } else if (arc_mfu_ghost->arcs_lsize[type] > 0) {
 1674                         int64_t todelete = MIN(arc_mfu_ghost->arcs_lsize[type],
 1675                             arc_mru_ghost->arcs_size +
 1676                             arc_mfu_ghost->arcs_size - arc_c);
 1677                         arc_evict_ghost(arc_mfu_ghost, NULL, todelete);
 1678                 }
 1679         }
 1680 
 1681         return (stolen);
 1682 }
 1683 
 1684 /*
 1685  * Remove buffers from list until we've removed the specified number of
 1686  * bytes.  Destroy the buffers that are removed.
 1687  */
 1688 static void
 1689 arc_evict_ghost(arc_state_t *state, uint64_t spa, int64_t bytes)
 1690 {
 1691         arc_buf_hdr_t *ab, *ab_prev;
 1692         list_t *list = &state->arcs_list[ARC_BUFC_DATA];
 1693         kmutex_t *hash_lock;
 1694         uint64_t bytes_deleted = 0;
 1695         uint64_t bufs_skipped = 0;
 1696 
 1697         ASSERT(GHOST_STATE(state));
 1698 top:
 1699         mutex_enter(&state->arcs_mtx);
 1700         for (ab = list_tail(list); ab; ab = ab_prev) {
 1701                 ab_prev = list_prev(list, ab);
 1702                 if (spa && ab->b_spa != spa)
 1703                         continue;
 1704                 hash_lock = HDR_LOCK(ab);
 1705                 if (mutex_tryenter(hash_lock)) {
 1706                         ASSERT(!HDR_IO_IN_PROGRESS(ab));
 1707                         ASSERT(ab->b_buf == NULL);
 1708                         ARCSTAT_BUMP(arcstat_deleted);
 1709                         bytes_deleted += ab->b_size;
 1710 
 1711                         if (ab->b_l2hdr != NULL) {
 1712                                 /*
 1713                                  * This buffer is cached on the 2nd Level ARC;
 1714                                  * don't destroy the header.
 1715                                  */
 1716                                 arc_change_state(arc_l2c_only, ab, hash_lock);
 1717                                 mutex_exit(hash_lock);
 1718                         } else {
 1719                                 arc_change_state(arc_anon, ab, hash_lock);
 1720                                 mutex_exit(hash_lock);
 1721                                 arc_hdr_destroy(ab);
 1722                         }
 1723 
 1724                         DTRACE_PROBE1(arc__delete, arc_buf_hdr_t *, ab);
 1725                         if (bytes >= 0 && bytes_deleted >= bytes)
 1726                                 break;
 1727                 } else {
 1728                         if (bytes < 0) {
 1729                                 mutex_exit(&state->arcs_mtx);
 1730                                 mutex_enter(hash_lock);
 1731                                 mutex_exit(hash_lock);
 1732                                 goto top;
 1733                         }
 1734                         bufs_skipped += 1;
 1735                 }
 1736         }
 1737         mutex_exit(&state->arcs_mtx);
 1738 
 1739         if (list == &state->arcs_list[ARC_BUFC_DATA] &&
 1740             (bytes < 0 || bytes_deleted < bytes)) {
 1741                 list = &state->arcs_list[ARC_BUFC_METADATA];
 1742                 goto top;
 1743         }
 1744 
 1745         if (bufs_skipped) {
 1746                 ARCSTAT_INCR(arcstat_mutex_miss, bufs_skipped);
 1747                 ASSERT(bytes >= 0);
 1748         }
 1749 
 1750         if (bytes_deleted < bytes)
 1751                 dprintf("only deleted %lld bytes from %p",
 1752                     (longlong_t)bytes_deleted, state);
 1753 }
 1754 
 1755 static void
 1756 arc_adjust(void)
 1757 {
 1758         int64_t adjustment, delta;
 1759 
 1760         /*
 1761          * Adjust MRU size
 1762          */
 1763 
 1764         adjustment = MIN(arc_size - arc_c,
 1765             arc_anon->arcs_size + arc_mru->arcs_size + arc_meta_used - arc_p);
 1766 
 1767         if (adjustment > 0 && arc_mru->arcs_lsize[ARC_BUFC_DATA] > 0) {
 1768                 delta = MIN(arc_mru->arcs_lsize[ARC_BUFC_DATA], adjustment);
 1769                 (void) arc_evict(arc_mru, NULL, delta, FALSE, ARC_BUFC_DATA);
 1770                 adjustment -= delta;
 1771         }
 1772 
 1773         if (adjustment > 0 && arc_mru->arcs_lsize[ARC_BUFC_METADATA] > 0) {
 1774                 delta = MIN(arc_mru->arcs_lsize[ARC_BUFC_METADATA], adjustment);
 1775                 (void) arc_evict(arc_mru, NULL, delta, FALSE,
 1776                     ARC_BUFC_METADATA);
 1777         }
 1778 
 1779         /*
 1780          * Adjust MFU size
 1781          */
 1782 
 1783         adjustment = arc_size - arc_c;
 1784 
 1785         if (adjustment > 0 && arc_mfu->arcs_lsize[ARC_BUFC_DATA] > 0) {
 1786                 delta = MIN(adjustment, arc_mfu->arcs_lsize[ARC_BUFC_DATA]);
 1787                 (void) arc_evict(arc_mfu, NULL, delta, FALSE, ARC_BUFC_DATA);
 1788                 adjustment -= delta;
 1789         }
 1790 
 1791         if (adjustment > 0 && arc_mfu->arcs_lsize[ARC_BUFC_METADATA] > 0) {
 1792                 int64_t delta = MIN(adjustment,
 1793                     arc_mfu->arcs_lsize[ARC_BUFC_METADATA]);
 1794                 (void) arc_evict(arc_mfu, NULL, delta, FALSE,
 1795                     ARC_BUFC_METADATA);
 1796         }
 1797 
 1798         /*
 1799          * Adjust ghost lists
 1800          */
 1801 
 1802         adjustment = arc_mru->arcs_size + arc_mru_ghost->arcs_size - arc_c;
 1803 
 1804         if (adjustment > 0 && arc_mru_ghost->arcs_size > 0) {
 1805                 delta = MIN(arc_mru_ghost->arcs_size, adjustment);
 1806                 arc_evict_ghost(arc_mru_ghost, NULL, delta);
 1807         }
 1808 
 1809         adjustment =
 1810             arc_mru_ghost->arcs_size + arc_mfu_ghost->arcs_size - arc_c;
 1811 
 1812         if (adjustment > 0 && arc_mfu_ghost->arcs_size > 0) {
 1813                 delta = MIN(arc_mfu_ghost->arcs_size, adjustment);
 1814                 arc_evict_ghost(arc_mfu_ghost, NULL, delta);
 1815         }
 1816 }
 1817 
 1818 static void
 1819 arc_do_user_evicts(void)
 1820 {
 1821         mutex_enter(&arc_eviction_mtx);
 1822         while (arc_eviction_list != NULL) {
 1823                 arc_buf_t *buf = arc_eviction_list;
 1824                 arc_eviction_list = buf->b_next;
 1825                 rw_enter(&buf->b_lock, RW_WRITER);
 1826                 buf->b_hdr = NULL;
 1827                 rw_exit(&buf->b_lock);
 1828                 mutex_exit(&arc_eviction_mtx);
 1829 
 1830                 if (buf->b_efunc != NULL)
 1831                         VERIFY(buf->b_efunc(buf) == 0);
 1832 
 1833                 buf->b_efunc = NULL;
 1834                 buf->b_private = NULL;
 1835                 kmem_cache_free(buf_cache, buf);
 1836                 mutex_enter(&arc_eviction_mtx);
 1837         }
 1838         mutex_exit(&arc_eviction_mtx);
 1839 }
 1840 
 1841 /*
 1842  * Flush all *evictable* data from the cache for the given spa.
 1843  * NOTE: this will not touch "active" (i.e. referenced) data.
 1844  */
 1845 void
 1846 arc_flush(spa_t *spa)
 1847 {
 1848         uint64_t guid = 0;
 1849 
 1850         if (spa)
 1851                 guid = spa_guid(spa);
 1852 
 1853         while (list_head(&arc_mru->arcs_list[ARC_BUFC_DATA])) {
 1854                 (void) arc_evict(arc_mru, guid, -1, FALSE, ARC_BUFC_DATA);
 1855                 if (spa)
 1856                         break;
 1857         }
 1858         while (list_head(&arc_mru->arcs_list[ARC_BUFC_METADATA])) {
 1859                 (void) arc_evict(arc_mru, guid, -1, FALSE, ARC_BUFC_METADATA);
 1860                 if (spa)
 1861                         break;
 1862         }
 1863         while (list_head(&arc_mfu->arcs_list[ARC_BUFC_DATA])) {
 1864                 (void) arc_evict(arc_mfu, guid, -1, FALSE, ARC_BUFC_DATA);
 1865                 if (spa)
 1866                         break;
 1867         }
 1868         while (list_head(&arc_mfu->arcs_list[ARC_BUFC_METADATA])) {
 1869                 (void) arc_evict(arc_mfu, guid, -1, FALSE, ARC_BUFC_METADATA);
 1870                 if (spa)
 1871                         break;
 1872         }
 1873 
 1874         arc_evict_ghost(arc_mru_ghost, guid, -1);
 1875         arc_evict_ghost(arc_mfu_ghost, guid, -1);
 1876 
 1877         mutex_enter(&arc_reclaim_thr_lock);
 1878         arc_do_user_evicts();
 1879         mutex_exit(&arc_reclaim_thr_lock);
 1880         ASSERT(spa || arc_eviction_list == NULL);
 1881 }
 1882 
 1883 void
 1884 arc_shrink(void)
 1885 {
 1886         if (arc_c > arc_c_min) {
 1887                 uint64_t to_free;
 1888 
 1889 #ifdef _KERNEL
 1890                 to_free = MAX(arc_c >> arc_shrink_shift, ptob(needfree));
 1891 #else
 1892                 to_free = arc_c >> arc_shrink_shift;
 1893 #endif
 1894                 if (arc_c > arc_c_min + to_free)
 1895                         atomic_add_64(&arc_c, -to_free);
 1896                 else
 1897                         arc_c = arc_c_min;
 1898 
 1899                 atomic_add_64(&arc_p, -(arc_p >> arc_shrink_shift));
 1900                 if (arc_c > arc_size)
 1901                         arc_c = MAX(arc_size, arc_c_min);
 1902                 if (arc_p > arc_c)
 1903                         arc_p = (arc_c >> 1);
 1904                 ASSERT(arc_c >= arc_c_min);
 1905                 ASSERT((int64_t)arc_p >= 0);
 1906         }
 1907 
 1908         if (arc_size > arc_c)
 1909                 arc_adjust();
 1910 }
 1911 
 1912 static int
 1913 arc_reclaim_needed(void)
 1914 {
 1915         uint64_t extra;
 1916 
 1917 #ifdef _KERNEL
 1918 
 1919         if (needfree)
 1920                 return (1);
 1921 
 1922         /*
 1923          * take 'desfree' extra pages, so we reclaim sooner, rather than later
 1924          */
 1925         extra = desfree;
 1926 
 1927         /*
 1928          * check that we're out of range of the pageout scanner.  It starts to
 1929          * schedule paging if freemem is less than lotsfree and needfree.
 1930          * lotsfree is the high-water mark for pageout, and needfree is the
 1931          * number of needed free pages.  We add extra pages here to make sure
 1932          * the scanner doesn't start up while we're freeing memory.
 1933          */
 1934         if (freemem < lotsfree + needfree + extra)
 1935                 return (1);
 1936 
 1937         /*
 1938          * check to make sure that swapfs has enough space so that anon
 1939          * reservations can still succeed. anon_resvmem() checks that the
 1940          * availrmem is greater than swapfs_minfree, and the number of reserved
 1941          * swap pages.  We also add a bit of extra here just to prevent
 1942          * circumstances from getting really dire.
 1943          */
 1944         if (availrmem < swapfs_minfree + swapfs_reserve + extra)
 1945                 return (1);
 1946 
 1947 #if defined(__i386)
 1948         /*
 1949          * If we're on an i386 platform, it's possible that we'll exhaust the
 1950          * kernel heap space before we ever run out of available physical
 1951          * memory.  Most checks of the size of the heap_area compare against
 1952          * tune.t_minarmem, which is the minimum available real memory that we
 1953          * can have in the system.  However, this is generally fixed at 25 pages
 1954          * which is so low that it's useless.  In this comparison, we seek to
 1955          * calculate the total heap-size, and reclaim if more than 3/4ths of the
 1956          * heap is allocated.  (Or, in the calculation, if less than 1/4th is
 1957          * free)
 1958          */
 1959         if (btop(vmem_size(heap_arena, VMEM_FREE)) <
 1960             (btop(vmem_size(heap_arena, VMEM_FREE | VMEM_ALLOC)) >> 2))
 1961                 return (1);
 1962 #endif
 1963 
 1964 #else
 1965         if (spa_get_random(100) == 0)
 1966                 return (1);
 1967 #endif
 1968         return (0);
 1969 }
 1970 
 1971 static void
 1972 arc_kmem_reap_now(arc_reclaim_strategy_t strat)
 1973 {
 1974         size_t                  i;
 1975         kmem_cache_t            *prev_cache = NULL;
 1976         kmem_cache_t            *prev_data_cache = NULL;
 1977         extern kmem_cache_t     *zio_buf_cache[];
 1978         extern kmem_cache_t     *zio_data_buf_cache[];
 1979 
 1980 #ifdef _KERNEL
 1981         if (arc_meta_used >= arc_meta_limit) {
 1982                 /*
 1983                  * We are exceeding our meta-data cache limit.
 1984                  * Purge some DNLC entries to release holds on meta-data.
 1985                  */
 1986                 dnlc_reduce_cache((void *)(uintptr_t)arc_reduce_dnlc_percent);
 1987         }
 1988 #if defined(__i386)
 1989         /*
 1990          * Reclaim unused memory from all kmem caches.
 1991          */
 1992         kmem_reap();
 1993 #endif
 1994 #endif
 1995 
 1996         /*
 1997          * An aggressive reclamation will shrink the cache size as well as
 1998          * reap free buffers from the arc kmem caches.
 1999          */
 2000         if (strat == ARC_RECLAIM_AGGR)
 2001                 arc_shrink();
 2002 
 2003         for (i = 0; i < SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT; i++) {
 2004                 if (zio_buf_cache[i] != prev_cache) {
 2005                         prev_cache = zio_buf_cache[i];
 2006                         kmem_cache_reap_now(zio_buf_cache[i]);
 2007                 }
 2008                 if (zio_data_buf_cache[i] != prev_data_cache) {
 2009                         prev_data_cache = zio_data_buf_cache[i];
 2010                         kmem_cache_reap_now(zio_data_buf_cache[i]);
 2011                 }
 2012         }
 2013         kmem_cache_reap_now(buf_cache);
 2014         kmem_cache_reap_now(hdr_cache);
 2015 }
 2016 
 2017 static void
 2018 arc_reclaim_thread(void)
 2019 {
 2020         clock_t                 growtime = 0;
 2021         arc_reclaim_strategy_t  last_reclaim = ARC_RECLAIM_CONS;
 2022         callb_cpr_t             cpr;
 2023 
 2024         CALLB_CPR_INIT(&cpr, &arc_reclaim_thr_lock, callb_generic_cpr, FTAG);
 2025 
 2026         mutex_enter(&arc_reclaim_thr_lock);
 2027         while (arc_thread_exit == 0) {
 2028                 if (arc_reclaim_needed()) {
 2029 
 2030                         if (arc_no_grow) {
 2031                                 if (last_reclaim == ARC_RECLAIM_CONS) {
 2032                                         last_reclaim = ARC_RECLAIM_AGGR;
 2033                                 } else {
 2034                                         last_reclaim = ARC_RECLAIM_CONS;
 2035                                 }
 2036                         } else {
 2037                                 arc_no_grow = TRUE;
 2038                                 last_reclaim = ARC_RECLAIM_AGGR;
 2039                                 membar_producer();
 2040                         }
 2041 
 2042                         /* reset the growth delay for every reclaim */
 2043                         growtime = lbolt + (arc_grow_retry * hz);
 2044 
 2045                         arc_kmem_reap_now(last_reclaim);
 2046                         arc_warm = B_TRUE;
 2047 
 2048                 } else if (arc_no_grow && lbolt >= growtime) {
 2049                         arc_no_grow = FALSE;
 2050                 }
 2051 
 2052                 if (2 * arc_c < arc_size +
 2053                     arc_mru_ghost->arcs_size + arc_mfu_ghost->arcs_size)
 2054                         arc_adjust();
 2055 
 2056                 if (arc_eviction_list != NULL)
 2057                         arc_do_user_evicts();
 2058 
 2059                 /* block until needed, or one second, whichever is shorter */
 2060                 CALLB_CPR_SAFE_BEGIN(&cpr);
 2061                 (void) cv_timedwait(&arc_reclaim_thr_cv,
 2062                     &arc_reclaim_thr_lock, (lbolt + hz));
 2063                 CALLB_CPR_SAFE_END(&cpr, &arc_reclaim_thr_lock);
 2064         }
 2065 
 2066         arc_thread_exit = 0;
 2067         cv_broadcast(&arc_reclaim_thr_cv);
 2068         CALLB_CPR_EXIT(&cpr);           /* drops arc_reclaim_thr_lock */
 2069         thread_exit();
 2070 }
 2071 
 2072 /*
 2073  * Adapt arc info given the number of bytes we are trying to add and
 2074  * the state that we are comming from.  This function is only called
 2075  * when we are adding new content to the cache.
 2076  */
 2077 static void
 2078 arc_adapt(int bytes, arc_state_t *state)
 2079 {
 2080         int mult;
 2081         uint64_t arc_p_min = (arc_c >> arc_p_min_shift);
 2082 
 2083         if (state == arc_l2c_only)
 2084                 return;
 2085 
 2086         ASSERT(bytes > 0);
 2087         /*
 2088          * Adapt the target size of the MRU list:
 2089          *      - if we just hit in the MRU ghost list, then increase
 2090          *        the target size of the MRU list.
 2091          *      - if we just hit in the MFU ghost list, then increase
 2092          *        the target size of the MFU list by decreasing the
 2093          *        target size of the MRU list.
 2094          */
 2095         if (state == arc_mru_ghost) {
 2096                 mult = ((arc_mru_ghost->arcs_size >= arc_mfu_ghost->arcs_size) ?
 2097                     1 : (arc_mfu_ghost->arcs_size/arc_mru_ghost->arcs_size));
 2098 
 2099                 arc_p = MIN(arc_c - arc_p_min, arc_p + bytes * mult);
 2100         } else if (state == arc_mfu_ghost) {
 2101                 uint64_t delta;
 2102 
 2103                 mult = ((arc_mfu_ghost->arcs_size >= arc_mru_ghost->arcs_size) ?
 2104                     1 : (arc_mru_ghost->arcs_size/arc_mfu_ghost->arcs_size));
 2105 
 2106                 delta = MIN(bytes * mult, arc_p);
 2107                 arc_p = MAX(arc_p_min, arc_p - delta);
 2108         }
 2109         ASSERT((int64_t)arc_p >= 0);
 2110 
 2111         if (arc_reclaim_needed()) {
 2112                 cv_signal(&arc_reclaim_thr_cv);
 2113                 return;
 2114         }
 2115 
 2116         if (arc_no_grow)
 2117                 return;
 2118 
 2119         if (arc_c >= arc_c_max)
 2120                 return;
 2121 
 2122         /*
 2123          * If we're within (2 * maxblocksize) bytes of the target
 2124          * cache size, increment the target cache size
 2125          */
 2126         if (arc_size > arc_c - (2ULL << SPA_MAXBLOCKSHIFT)) {
 2127                 atomic_add_64(&arc_c, (int64_t)bytes);
 2128                 if (arc_c > arc_c_max)
 2129                         arc_c = arc_c_max;
 2130                 else if (state == arc_anon)
 2131                         atomic_add_64(&arc_p, (int64_t)bytes);
 2132                 if (arc_p > arc_c)
 2133                         arc_p = arc_c;
 2134         }
 2135         ASSERT((int64_t)arc_p >= 0);
 2136 }
 2137 
 2138 /*
 2139  * Check if the cache has reached its limits and eviction is required
 2140  * prior to insert.
 2141  */
 2142 static int
 2143 arc_evict_needed(arc_buf_contents_t type)
 2144 {
 2145         if (type == ARC_BUFC_METADATA && arc_meta_used >= arc_meta_limit)
 2146                 return (1);
 2147 
 2148 #ifdef _KERNEL
 2149         /*
 2150          * If zio data pages are being allocated out of a separate heap segment,
 2151          * then enforce that the size of available vmem for this area remains
 2152          * above about 1/32nd free.
 2153          */
 2154         if (type == ARC_BUFC_DATA && zio_arena != NULL &&
 2155             vmem_size(zio_arena, VMEM_FREE) <
 2156             (vmem_size(zio_arena, VMEM_ALLOC) >> 5))
 2157                 return (1);
 2158 #endif
 2159 
 2160         if (arc_reclaim_needed())
 2161                 return (1);
 2162 
 2163         return (arc_size > arc_c);
 2164 }
 2165 
 2166 /*
 2167  * The buffer, supplied as the first argument, needs a data block.
 2168  * So, if we are at cache max, determine which cache should be victimized.
 2169  * We have the following cases:
 2170  *
 2171  * 1. Insert for MRU, p > sizeof(arc_anon + arc_mru) ->
 2172  * In this situation if we're out of space, but the resident size of the MFU is
 2173  * under the limit, victimize the MFU cache to satisfy this insertion request.
 2174  *
 2175  * 2. Insert for MRU, p <= sizeof(arc_anon + arc_mru) ->
 2176  * Here, we've used up all of the available space for the MRU, so we need to
 2177  * evict from our own cache instead.  Evict from the set of resident MRU
 2178  * entries.
 2179  *
 2180  * 3. Insert for MFU (c - p) > sizeof(arc_mfu) ->
 2181  * c minus p represents the MFU space in the cache, since p is the size of the
 2182  * cache that is dedicated to the MRU.  In this situation there's still space on
 2183  * the MFU side, so the MRU side needs to be victimized.
 2184  *
 2185  * 4. Insert for MFU (c - p) < sizeof(arc_mfu) ->
 2186  * MFU's resident set is consuming more space than it has been allotted.  In
 2187  * this situation, we must victimize our own cache, the MFU, for this insertion.
 2188  */
 2189 static void
 2190 arc_get_data_buf(arc_buf_t *buf)
 2191 {
 2192         arc_state_t             *state = buf->b_hdr->b_state;
 2193         uint64_t                size = buf->b_hdr->b_size;
 2194         arc_buf_contents_t      type = buf->b_hdr->b_type;
 2195 
 2196         arc_adapt(size, state);
 2197 
 2198         /*
 2199          * We have not yet reached cache maximum size,
 2200          * just allocate a new buffer.
 2201          */
 2202         if (!arc_evict_needed(type)) {
 2203                 if (type == ARC_BUFC_METADATA) {
 2204                         buf->b_data = zio_buf_alloc(size);
 2205                         arc_space_consume(size, ARC_SPACE_DATA);
 2206                 } else {
 2207                         ASSERT(type == ARC_BUFC_DATA);
 2208                         buf->b_data = zio_data_buf_alloc(size);
 2209                         ARCSTAT_INCR(arcstat_data_size, size);
 2210                         atomic_add_64(&arc_size, size);
 2211                 }
 2212                 goto out;
 2213         }
 2214 
 2215         /*
 2216          * If we are prefetching from the mfu ghost list, this buffer
 2217          * will end up on the mru list; so steal space from there.
 2218          */
 2219         if (state == arc_mfu_ghost)
 2220                 state = buf->b_hdr->b_flags & ARC_PREFETCH ? arc_mru : arc_mfu;
 2221         else if (state == arc_mru_ghost)
 2222                 state = arc_mru;
 2223 
 2224         if (state == arc_mru || state == arc_anon) {
 2225                 uint64_t mru_used = arc_anon->arcs_size + arc_mru->arcs_size;
 2226                 state = (arc_mfu->arcs_lsize[type] >= size &&
 2227                     arc_p > mru_used) ? arc_mfu : arc_mru;
 2228         } else {
 2229                 /* MFU cases */
 2230                 uint64_t mfu_space = arc_c - arc_p;
 2231                 state =  (arc_mru->arcs_lsize[type] >= size &&
 2232                     mfu_space > arc_mfu->arcs_size) ? arc_mru : arc_mfu;
 2233         }
 2234         if ((buf->b_data = arc_evict(state, NULL, size, TRUE, type)) == NULL) {
 2235                 if (type == ARC_BUFC_METADATA) {
 2236                         buf->b_data = zio_buf_alloc(size);
 2237                         arc_space_consume(size, ARC_SPACE_DATA);
 2238                 } else {
 2239                         ASSERT(type == ARC_BUFC_DATA);
 2240                         buf->b_data = zio_data_buf_alloc(size);
 2241                         ARCSTAT_INCR(arcstat_data_size, size);
 2242                         atomic_add_64(&arc_size, size);
 2243                 }
 2244                 ARCSTAT_BUMP(arcstat_recycle_miss);
 2245         }
 2246         ASSERT(buf->b_data != NULL);
 2247 out:
 2248         /*
 2249          * Update the state size.  Note that ghost states have a
 2250          * "ghost size" and so don't need to be updated.
 2251          */
 2252         if (!GHOST_STATE(buf->b_hdr->b_state)) {
 2253                 arc_buf_hdr_t *hdr = buf->b_hdr;
 2254 
 2255                 atomic_add_64(&hdr->b_state->arcs_size, size);
 2256                 if (list_link_active(&hdr->b_arc_node)) {
 2257                         ASSERT(refcount_is_zero(&hdr->b_refcnt));
 2258                         atomic_add_64(&hdr->b_state->arcs_lsize[type], size);
 2259                 }
 2260                 /*
 2261                  * If we are growing the cache, and we are adding anonymous
 2262                  * data, and we have outgrown arc_p, update arc_p
 2263                  */
 2264                 if (arc_size < arc_c && hdr->b_state == arc_anon &&
 2265                     arc_anon->arcs_size + arc_mru->arcs_size > arc_p)
 2266                         arc_p = MIN(arc_c, arc_p + size);
 2267         }
 2268 }
 2269 
 2270 /*
 2271  * This routine is called whenever a buffer is accessed.
 2272  * NOTE: the hash lock is dropped in this function.
 2273  */
 2274 static void
 2275 arc_access(arc_buf_hdr_t *buf, kmutex_t *hash_lock)
 2276 {
 2277         ASSERT(MUTEX_HELD(hash_lock));
 2278 
 2279         if (buf->b_state == arc_anon) {
 2280                 /*
 2281                  * This buffer is not in the cache, and does not
 2282                  * appear in our "ghost" list.  Add the new buffer
 2283                  * to the MRU state.
 2284                  */
 2285 
 2286                 ASSERT(buf->b_arc_access == 0);
 2287                 buf->b_arc_access = lbolt;
 2288                 DTRACE_PROBE1(new_state__mru, arc_buf_hdr_t *, buf);
 2289                 arc_change_state(arc_mru, buf, hash_lock);
 2290 
 2291         } else if (buf->b_state == arc_mru) {
 2292                 /*
 2293                  * If this buffer is here because of a prefetch, then either:
 2294                  * - clear the flag if this is a "referencing" read
 2295                  *   (any subsequent access will bump this into the MFU state).
 2296                  * or
 2297                  * - move the buffer to the head of the list if this is
 2298                  *   another prefetch (to make it less likely to be evicted).
 2299                  */
 2300                 if ((buf->b_flags & ARC_PREFETCH) != 0) {
 2301                         if (refcount_count(&buf->b_refcnt) == 0) {
 2302                                 ASSERT(list_link_active(&buf->b_arc_node));
 2303                         } else {
 2304                                 buf->b_flags &= ~ARC_PREFETCH;
 2305                                 ARCSTAT_BUMP(arcstat_mru_hits);
 2306                         }
 2307                         buf->b_arc_access = lbolt;
 2308                         return;
 2309                 }
 2310 
 2311                 /*
 2312                  * This buffer has been "accessed" only once so far,
 2313                  * but it is still in the cache. Move it to the MFU
 2314                  * state.
 2315                  */
 2316                 if (lbolt > buf->b_arc_access + ARC_MINTIME) {
 2317                         /*
 2318                          * More than 125ms have passed since we
 2319                          * instantiated this buffer.  Move it to the
 2320                          * most frequently used state.
 2321                          */
 2322                         buf->b_arc_access = lbolt;
 2323                         DTRACE_PROBE1(new_state__mfu, arc_buf_hdr_t *, buf);
 2324                         arc_change_state(arc_mfu, buf, hash_lock);
 2325                 }
 2326                 ARCSTAT_BUMP(arcstat_mru_hits);
 2327         } else if (buf->b_state == arc_mru_ghost) {
 2328                 arc_state_t     *new_state;
 2329                 /*
 2330                  * This buffer has been "accessed" recently, but
 2331                  * was evicted from the cache.  Move it to the
 2332                  * MFU state.
 2333                  */
 2334 
 2335                 if (buf->b_flags & ARC_PREFETCH) {
 2336                         new_state = arc_mru;
 2337                         if (refcount_count(&buf->b_refcnt) > 0)
 2338                                 buf->b_flags &= ~ARC_PREFETCH;
 2339                         DTRACE_PROBE1(new_state__mru, arc_buf_hdr_t *, buf);
 2340                 } else {
 2341                         new_state = arc_mfu;
 2342                         DTRACE_PROBE1(new_state__mfu, arc_buf_hdr_t *, buf);
 2343                 }
 2344 
 2345                 buf->b_arc_access = lbolt;
 2346                 arc_change_state(new_state, buf, hash_lock);
 2347 
 2348                 ARCSTAT_BUMP(arcstat_mru_ghost_hits);
 2349         } else if (buf->b_state == arc_mfu) {
 2350                 /*
 2351                  * This buffer has been accessed more than once and is
 2352                  * still in the cache.  Keep it in the MFU state.
 2353                  *
 2354                  * NOTE: an add_reference() that occurred when we did
 2355                  * the arc_read() will have kicked this off the list.
 2356                  * If it was a prefetch, we will explicitly move it to
 2357                  * the head of the list now.
 2358                  */
 2359                 if ((buf->b_flags & ARC_PREFETCH) != 0) {
 2360                         ASSERT(refcount_count(&buf->b_refcnt) == 0);
 2361                         ASSERT(list_link_active(&buf->b_arc_node));
 2362                 }
 2363                 ARCSTAT_BUMP(arcstat_mfu_hits);
 2364                 buf->b_arc_access = lbolt;
 2365         } else if (buf->b_state == arc_mfu_ghost) {
 2366                 arc_state_t     *new_state = arc_mfu;
 2367                 /*
 2368                  * This buffer has been accessed more than once but has
 2369                  * been evicted from the cache.  Move it back to the
 2370                  * MFU state.
 2371                  */
 2372 
 2373                 if (buf->b_flags & ARC_PREFETCH) {
 2374                         /*
 2375                          * This is a prefetch access...
 2376                          * move this block back to the MRU state.
 2377                          */
 2378                         ASSERT3U(refcount_count(&buf->b_refcnt), ==, 0);
 2379                         new_state = arc_mru;
 2380                 }
 2381 
 2382                 buf->b_arc_access = lbolt;
 2383                 DTRACE_PROBE1(new_state__mfu, arc_buf_hdr_t *, buf);
 2384                 arc_change_state(new_state, buf, hash_lock);
 2385 
 2386                 ARCSTAT_BUMP(arcstat_mfu_ghost_hits);
 2387         } else if (buf->b_state == arc_l2c_only) {
 2388                 /*
 2389                  * This buffer is on the 2nd Level ARC.
 2390                  */
 2391 
 2392                 buf->b_arc_access = lbolt;
 2393                 DTRACE_PROBE1(new_state__mfu, arc_buf_hdr_t *, buf);
 2394                 arc_change_state(arc_mfu, buf, hash_lock);
 2395         } else {
 2396                 ASSERT(!"invalid arc state");
 2397         }
 2398 }
 2399 
 2400 /* a generic arc_done_func_t which you can use */
 2401 /* ARGSUSED */
 2402 void
 2403 arc_bcopy_func(zio_t *zio, arc_buf_t *buf, void *arg)
 2404 {
 2405         bcopy(buf->b_data, arg, buf->b_hdr->b_size);
 2406         VERIFY(arc_buf_remove_ref(buf, arg) == 1);
 2407 }
 2408 
 2409 /* a generic arc_done_func_t */
 2410 void
 2411 arc_getbuf_func(zio_t *zio, arc_buf_t *buf, void *arg)
 2412 {
 2413         arc_buf_t **bufp = arg;
 2414         if (zio && zio->io_error) {
 2415                 VERIFY(arc_buf_remove_ref(buf, arg) == 1);
 2416                 *bufp = NULL;
 2417         } else {
 2418                 *bufp = buf;
 2419         }
 2420 }
 2421 
 2422 static void
 2423 arc_read_done(zio_t *zio)
 2424 {
 2425         arc_buf_hdr_t   *hdr, *found;
 2426         arc_buf_t       *buf;
 2427         arc_buf_t       *abuf;  /* buffer we're assigning to callback */
 2428         kmutex_t        *hash_lock;
 2429         arc_callback_t  *callback_list, *acb;
 2430         int             freeable = FALSE;
 2431 
 2432         buf = zio->io_private;
 2433         hdr = buf->b_hdr;
 2434 
 2435         /*
 2436          * The hdr was inserted into hash-table and removed from lists
 2437          * prior to starting I/O.  We should find this header, since
 2438          * it's in the hash table, and it should be legit since it's
 2439          * not possible to evict it during the I/O.  The only possible
 2440          * reason for it not to be found is if we were freed during the
 2441          * read.
 2442          */
 2443         found = buf_hash_find(hdr->b_spa, &hdr->b_dva, hdr->b_birth,
 2444             &hash_lock);
 2445 
 2446         ASSERT((found == NULL && HDR_FREED_IN_READ(hdr) && hash_lock == NULL) ||
 2447             (found == hdr && DVA_EQUAL(&hdr->b_dva, BP_IDENTITY(zio->io_bp))) ||
 2448             (found == hdr && HDR_L2_READING(hdr)));
 2449 
 2450         hdr->b_flags &= ~ARC_L2_EVICTED;
 2451         if (l2arc_noprefetch && (hdr->b_flags & ARC_PREFETCH))
 2452                 hdr->b_flags &= ~ARC_L2CACHE;
 2453 
 2454         /* byteswap if necessary */
 2455         callback_list = hdr->b_acb;
 2456         ASSERT(callback_list != NULL);
 2457         if (BP_SHOULD_BYTESWAP(zio->io_bp)) {
 2458                 arc_byteswap_func_t *func = BP_GET_LEVEL(zio->io_bp) > 0 ?
 2459                     byteswap_uint64_array :
 2460                     dmu_ot[BP_GET_TYPE(zio->io_bp)].ot_byteswap;
 2461                 func(buf->b_data, hdr->b_size);
 2462         }
 2463 
 2464         arc_cksum_compute(buf, B_FALSE);
 2465 
 2466         /* create copies of the data buffer for the callers */
 2467         abuf = buf;
 2468         for (acb = callback_list; acb; acb = acb->acb_next) {
 2469                 if (acb->acb_done) {
 2470                         if (abuf == NULL)
 2471                                 abuf = arc_buf_clone(buf);
 2472                         acb->acb_buf = abuf;
 2473                         abuf = NULL;
 2474                 }
 2475         }
 2476         hdr->b_acb = NULL;
 2477         hdr->b_flags &= ~ARC_IO_IN_PROGRESS;
 2478         ASSERT(!HDR_BUF_AVAILABLE(hdr));
 2479         if (abuf == buf)
 2480                 hdr->b_flags |= ARC_BUF_AVAILABLE;
 2481 
 2482         ASSERT(refcount_is_zero(&hdr->b_refcnt) || callback_list != NULL);
 2483 
 2484         if (zio->io_error != 0) {
 2485                 hdr->b_flags |= ARC_IO_ERROR;
 2486                 if (hdr->b_state != arc_anon)
 2487                         arc_change_state(arc_anon, hdr, hash_lock);
 2488                 if (HDR_IN_HASH_TABLE(hdr))
 2489                         buf_hash_remove(hdr);
 2490                 freeable = refcount_is_zero(&hdr->b_refcnt);
 2491         }
 2492 
 2493         /*
 2494          * Broadcast before we drop the hash_lock to avoid the possibility
 2495          * that the hdr (and hence the cv) might be freed before we get to
 2496          * the cv_broadcast().
 2497          */
 2498         cv_broadcast(&hdr->b_cv);
 2499 
 2500         if (hash_lock) {
 2501                 /*
 2502                  * Only call arc_access on anonymous buffers.  This is because
 2503                  * if we've issued an I/O for an evicted buffer, we've already
 2504                  * called arc_access (to prevent any simultaneous readers from
 2505                  * getting confused).
 2506                  */
 2507                 if (zio->io_error == 0 && hdr->b_state == arc_anon)
 2508                         arc_access(hdr, hash_lock);
 2509                 mutex_exit(hash_lock);
 2510         } else {
 2511                 /*
 2512                  * This block was freed while we waited for the read to
 2513                  * complete.  It has been removed from the hash table and
 2514                  * moved to the anonymous state (so that it won't show up
 2515                  * in the cache).
 2516                  */
 2517                 ASSERT3P(hdr->b_state, ==, arc_anon);
 2518                 freeable = refcount_is_zero(&hdr->b_refcnt);
 2519         }
 2520 
 2521         /* execute each callback and free its structure */
 2522         while ((acb = callback_list) != NULL) {
 2523                 if (acb->acb_done)
 2524                         acb->acb_done(zio, acb->acb_buf, acb->acb_private);
 2525 
 2526                 if (acb->acb_zio_dummy != NULL) {
 2527                         acb->acb_zio_dummy->io_error = zio->io_error;
 2528                         zio_nowait(acb->acb_zio_dummy);
 2529                 }
 2530 
 2531                 callback_list = acb->acb_next;
 2532                 kmem_free(acb, sizeof (arc_callback_t));
 2533         }
 2534 
 2535         if (freeable)
 2536                 arc_hdr_destroy(hdr);
 2537 }
 2538 
 2539 /*
 2540  * "Read" the block block at the specified DVA (in bp) via the
 2541  * cache.  If the block is found in the cache, invoke the provided
 2542  * callback immediately and return.  Note that the `zio' parameter
 2543  * in the callback will be NULL in this case, since no IO was
 2544  * required.  If the block is not in the cache pass the read request
 2545  * on to the spa with a substitute callback function, so that the
 2546  * requested block will be added to the cache.
 2547  *
 2548  * If a read request arrives for a block that has a read in-progress,
 2549  * either wait for the in-progress read to complete (and return the
 2550  * results); or, if this is a read with a "done" func, add a record
 2551  * to the read to invoke the "done" func when the read completes,
 2552  * and return; or just return.
 2553  *
 2554  * arc_read_done() will invoke all the requested "done" functions
 2555  * for readers of this block.
 2556  *
 2557  * Normal callers should use arc_read and pass the arc buffer and offset
 2558  * for the bp.  But if you know you don't need locking, you can use
 2559  * arc_read_bp.
 2560  */
 2561 int
 2562 arc_read(zio_t *pio, spa_t *spa, blkptr_t *bp, arc_buf_t *pbuf,
 2563     arc_done_func_t *done, void *private, int priority, int zio_flags,
 2564     uint32_t *arc_flags, const zbookmark_t *zb)
 2565 {
 2566         int err;
 2567 
 2568         ASSERT(!refcount_is_zero(&pbuf->b_hdr->b_refcnt));
 2569         ASSERT3U((char *)bp - (char *)pbuf->b_data, <, pbuf->b_hdr->b_size);
 2570         rw_enter(&pbuf->b_lock, RW_READER);
 2571 
 2572         err = arc_read_nolock(pio, spa, bp, done, private, priority,
 2573             zio_flags, arc_flags, zb);
 2574         rw_exit(&pbuf->b_lock);
 2575 
 2576         return (err);
 2577 }
 2578 
 2579 int
 2580 arc_read_nolock(zio_t *pio, spa_t *spa, blkptr_t *bp,
 2581     arc_done_func_t *done, void *private, int priority, int zio_flags,
 2582     uint32_t *arc_flags, const zbookmark_t *zb)
 2583 {
 2584         arc_buf_hdr_t *hdr;
 2585         arc_buf_t *buf;
 2586         kmutex_t *hash_lock;
 2587         zio_t *rzio;
 2588         uint64_t guid = spa_guid(spa);
 2589 
 2590 top:
 2591         hdr = buf_hash_find(guid, BP_IDENTITY(bp), bp->blk_birth, &hash_lock);
 2592         if (hdr && hdr->b_datacnt > 0) {
 2593 
 2594                 *arc_flags |= ARC_CACHED;
 2595 
 2596                 if (HDR_IO_IN_PROGRESS(hdr)) {
 2597 
 2598                         if (*arc_flags & ARC_WAIT) {
 2599                                 cv_wait(&hdr->b_cv, hash_lock);
 2600                                 mutex_exit(hash_lock);
 2601                                 goto top;
 2602                         }
 2603                         ASSERT(*arc_flags & ARC_NOWAIT);
 2604 
 2605                         if (done) {
 2606                                 arc_callback_t  *acb = NULL;
 2607 
 2608                                 acb = kmem_zalloc(sizeof (arc_callback_t),
 2609                                     KM_SLEEP);
 2610                                 acb->acb_done = done;
 2611                                 acb->acb_private = private;
 2612                                 if (pio != NULL)
 2613                                         acb->acb_zio_dummy = zio_null(pio,
 2614                                             spa, NULL, NULL, NULL, zio_flags);
 2615 
 2616                                 ASSERT(acb->acb_done != NULL);
 2617                                 acb->acb_next = hdr->b_acb;
 2618                                 hdr->b_acb = acb;
 2619                                 add_reference(hdr, hash_lock, private);
 2620                                 mutex_exit(hash_lock);
 2621                                 return (0);
 2622                         }
 2623                         mutex_exit(hash_lock);
 2624                         return (0);
 2625                 }
 2626 
 2627                 ASSERT(hdr->b_state == arc_mru || hdr->b_state == arc_mfu);
 2628 
 2629                 if (done) {
 2630                         add_reference(hdr, hash_lock, private);
 2631                         /*
 2632                          * If this block is already in use, create a new
 2633                          * copy of the data so that we will be guaranteed
 2634                          * that arc_release() will always succeed.
 2635                          */
 2636                         buf = hdr->b_buf;
 2637                         ASSERT(buf);
 2638                         ASSERT(buf->b_data);
 2639                         if (HDR_BUF_AVAILABLE(hdr)) {
 2640                                 ASSERT(buf->b_efunc == NULL);
 2641                                 hdr->b_flags &= ~ARC_BUF_AVAILABLE;
 2642                         } else {
 2643                                 buf = arc_buf_clone(buf);
 2644                         }
 2645                 } else if (*arc_flags & ARC_PREFETCH &&
 2646                     refcount_count(&hdr->b_refcnt) == 0) {
 2647                         hdr->b_flags |= ARC_PREFETCH;
 2648                 }
 2649                 DTRACE_PROBE1(arc__hit, arc_buf_hdr_t *, hdr);
 2650                 arc_access(hdr, hash_lock);
 2651                 if (*arc_flags & ARC_L2CACHE)
 2652                         hdr->b_flags |= ARC_L2CACHE;
 2653                 mutex_exit(hash_lock);
 2654                 ARCSTAT_BUMP(arcstat_hits);
 2655                 ARCSTAT_CONDSTAT(!(hdr->b_flags & ARC_PREFETCH),
 2656                     demand, prefetch, hdr->b_type != ARC_BUFC_METADATA,
 2657                     data, metadata, hits);
 2658 
 2659                 if (done)
 2660                         done(NULL, buf, private);
 2661         } else {
 2662                 uint64_t size = BP_GET_LSIZE(bp);
 2663                 arc_callback_t  *acb;
 2664                 vdev_t *vd = NULL;
 2665                 uint64_t addr;
 2666                 boolean_t devw = B_FALSE;
 2667 
 2668                 if (hdr == NULL) {
 2669                         /* this block is not in the cache */
 2670                         arc_buf_hdr_t   *exists;
 2671                         arc_buf_contents_t type = BP_GET_BUFC_TYPE(bp);
 2672                         buf = arc_buf_alloc(spa, size, private, type);
 2673                         hdr = buf->b_hdr;
 2674                         hdr->b_dva = *BP_IDENTITY(bp);
 2675                         hdr->b_birth = bp->blk_birth;
 2676                         hdr->b_cksum0 = bp->blk_cksum.zc_word[0];
 2677                         exists = buf_hash_insert(hdr, &hash_lock);
 2678                         if (exists) {
 2679                                 /* somebody beat us to the hash insert */
 2680                                 mutex_exit(hash_lock);
 2681                                 bzero(&hdr->b_dva, sizeof (dva_t));
 2682                                 hdr->b_birth = 0;
 2683                                 hdr->b_cksum0 = 0;
 2684                                 (void) arc_buf_remove_ref(buf, private);
 2685                                 goto top; /* restart the IO request */
 2686                         }
 2687                         /* if this is a prefetch, we don't have a reference */
 2688                         if (*arc_flags & ARC_PREFETCH) {
 2689                                 (void) remove_reference(hdr, hash_lock,
 2690                                     private);
 2691                                 hdr->b_flags |= ARC_PREFETCH;
 2692                         }
 2693                         if (*arc_flags & ARC_L2CACHE)
 2694                                 hdr->b_flags |= ARC_L2CACHE;
 2695                         if (BP_GET_LEVEL(bp) > 0)
 2696                                 hdr->b_flags |= ARC_INDIRECT;
 2697                 } else {
 2698                         /* this block is in the ghost cache */
 2699                         ASSERT(GHOST_STATE(hdr->b_state));
 2700                         ASSERT(!HDR_IO_IN_PROGRESS(hdr));
 2701                         ASSERT3U(refcount_count(&hdr->b_refcnt), ==, 0);
 2702                         ASSERT(hdr->b_buf == NULL);
 2703 
 2704                         /* if this is a prefetch, we don't have a reference */
 2705                         if (*arc_flags & ARC_PREFETCH)
 2706                                 hdr->b_flags |= ARC_PREFETCH;
 2707                         else
 2708                                 add_reference(hdr, hash_lock, private);
 2709                         if (*arc_flags & ARC_L2CACHE)
 2710                                 hdr->b_flags |= ARC_L2CACHE;
 2711                         buf = kmem_cache_alloc(buf_cache, KM_PUSHPAGE);
 2712                         buf->b_hdr = hdr;
 2713                         buf->b_data = NULL;
 2714                         buf->b_efunc = NULL;
 2715                         buf->b_private = NULL;
 2716                         buf->b_next = NULL;
 2717                         hdr->b_buf = buf;
 2718                         arc_get_data_buf(buf);
 2719                         ASSERT(hdr->b_datacnt == 0);
 2720                         hdr->b_datacnt = 1;
 2721 
 2722                 }
 2723 
 2724                 acb = kmem_zalloc(sizeof (arc_callback_t), KM_SLEEP);
 2725                 acb->acb_done = done;
 2726                 acb->acb_private = private;
 2727 
 2728                 ASSERT(hdr->b_acb == NULL);
 2729                 hdr->b_acb = acb;
 2730                 hdr->b_flags |= ARC_IO_IN_PROGRESS;
 2731 
 2732                 /*
 2733                  * If the buffer has been evicted, migrate it to a present state
 2734                  * before issuing the I/O.  Once we drop the hash-table lock,
 2735                  * the header will be marked as I/O in progress and have an
 2736                  * attached buffer.  At this point, anybody who finds this
 2737                  * buffer ought to notice that it's legit but has a pending I/O.
 2738                  */
 2739 
 2740                 if (GHOST_STATE(hdr->b_state))
 2741                         arc_access(hdr, hash_lock);
 2742 
 2743                 if (HDR_L2CACHE(hdr) && hdr->b_l2hdr != NULL &&
 2744                     (vd = hdr->b_l2hdr->b_dev->l2ad_vdev) != NULL) {
 2745                         devw = hdr->b_l2hdr->b_dev->l2ad_writing;
 2746                         addr = hdr->b_l2hdr->b_daddr;
 2747                         /*
 2748                          * Lock out device removal.
 2749                          */
 2750                         if (vdev_is_dead(vd) ||
 2751                             !spa_config_tryenter(spa, SCL_L2ARC, vd, RW_READER))
 2752                                 vd = NULL;
 2753                 }
 2754 
 2755                 mutex_exit(hash_lock);
 2756 
 2757                 ASSERT3U(hdr->b_size, ==, size);
 2758                 DTRACE_PROBE4(arc__miss, arc_buf_hdr_t *, hdr, blkptr_t *, bp,
 2759                     uint64_t, size, zbookmark_t *, zb);
 2760                 ARCSTAT_BUMP(arcstat_misses);
 2761                 ARCSTAT_CONDSTAT(!(hdr->b_flags & ARC_PREFETCH),
 2762                     demand, prefetch, hdr->b_type != ARC_BUFC_METADATA,
 2763                     data, metadata, misses);
 2764 
 2765                 if (vd != NULL && l2arc_ndev != 0 && !(l2arc_norw && devw)) {
 2766                         /*
 2767                          * Read from the L2ARC if the following are true:
 2768                          * 1. The L2ARC vdev was previously cached.
 2769                          * 2. This buffer still has L2ARC metadata.
 2770                          * 3. This buffer isn't currently writing to the L2ARC.
 2771                          * 4. The L2ARC entry wasn't evicted, which may
 2772                          *    also have invalidated the vdev.
 2773                          * 5. This isn't prefetch and l2arc_noprefetch is set.
 2774                          */
 2775                         if (hdr->b_l2hdr != NULL &&
 2776                             !HDR_L2_WRITING(hdr) && !HDR_L2_EVICTED(hdr) &&
 2777                             !(l2arc_noprefetch && HDR_PREFETCH(hdr))) {
 2778                                 l2arc_read_callback_t *cb;
 2779 
 2780                                 DTRACE_PROBE1(l2arc__hit, arc_buf_hdr_t *, hdr);
 2781                                 ARCSTAT_BUMP(arcstat_l2_hits);
 2782 
 2783                                 cb = kmem_zalloc(sizeof (l2arc_read_callback_t),
 2784                                     KM_SLEEP);
 2785                                 cb->l2rcb_buf = buf;
 2786                                 cb->l2rcb_spa = spa;
 2787                                 cb->l2rcb_bp = *bp;
 2788                                 cb->l2rcb_zb = *zb;
 2789                                 cb->l2rcb_flags = zio_flags;
 2790 
 2791                                 /*
 2792                                  * l2arc read.  The SCL_L2ARC lock will be
 2793                                  * released by l2arc_read_done().
 2794                                  */
 2795                                 rzio = zio_read_phys(pio, vd, addr, size,
 2796                                     buf->b_data, ZIO_CHECKSUM_OFF,
 2797                                     l2arc_read_done, cb, priority, zio_flags |
 2798                                     ZIO_FLAG_DONT_CACHE | ZIO_FLAG_CANFAIL |
 2799                                     ZIO_FLAG_DONT_PROPAGATE |
 2800                                     ZIO_FLAG_DONT_RETRY, B_FALSE);
 2801                                 DTRACE_PROBE2(l2arc__read, vdev_t *, vd,
 2802                                     zio_t *, rzio);
 2803                                 ARCSTAT_INCR(arcstat_l2_read_bytes, size);
 2804 
 2805                                 if (*arc_flags & ARC_NOWAIT) {
 2806                                         zio_nowait(rzio);
 2807                                         return (0);
 2808                                 }
 2809 
 2810                                 ASSERT(*arc_flags & ARC_WAIT);
 2811                                 if (zio_wait(rzio) == 0)
 2812                                         return (0);
 2813 
 2814                                 /* l2arc read error; goto zio_read() */
 2815                         } else {
 2816                                 DTRACE_PROBE1(l2arc__miss,
 2817                                     arc_buf_hdr_t *, hdr);
 2818                                 ARCSTAT_BUMP(arcstat_l2_misses);
 2819                                 if (HDR_L2_WRITING(hdr))
 2820                                         ARCSTAT_BUMP(arcstat_l2_rw_clash);
 2821                                 spa_config_exit(spa, SCL_L2ARC, vd);
 2822                         }
 2823                 } else {
 2824                         if (vd != NULL)
 2825                                 spa_config_exit(spa, SCL_L2ARC, vd);
 2826                         if (l2arc_ndev != 0) {
 2827                                 DTRACE_PROBE1(l2arc__miss,
 2828                                     arc_buf_hdr_t *, hdr);
 2829                                 ARCSTAT_BUMP(arcstat_l2_misses);
 2830                         }
 2831                 }
 2832 
 2833                 rzio = zio_read(pio, spa, bp, buf->b_data, size,
 2834                     arc_read_done, buf, priority, zio_flags, zb);
 2835 
 2836                 if (*arc_flags & ARC_WAIT)
 2837                         return (zio_wait(rzio));
 2838 
 2839                 ASSERT(*arc_flags & ARC_NOWAIT);
 2840                 zio_nowait(rzio);
 2841         }
 2842         return (0);
 2843 }
 2844 
 2845 void
 2846 arc_set_callback(arc_buf_t *buf, arc_evict_func_t *func, void *private)
 2847 {
 2848         ASSERT(buf->b_hdr != NULL);
 2849         ASSERT(buf->b_hdr->b_state != arc_anon);
 2850         ASSERT(!refcount_is_zero(&buf->b_hdr->b_refcnt) || func == NULL);
 2851         buf->b_efunc = func;
 2852         buf->b_private = private;
 2853 }
 2854 
 2855 /*
 2856  * This is used by the DMU to let the ARC know that a buffer is
 2857  * being evicted, so the ARC should clean up.  If this arc buf
 2858  * is not yet in the evicted state, it will be put there.
 2859  */
 2860 int
 2861 arc_buf_evict(arc_buf_t *buf)
 2862 {
 2863         arc_buf_hdr_t *hdr;
 2864         kmutex_t *hash_lock;
 2865         arc_buf_t **bufp;
 2866 
 2867         rw_enter(&buf->b_lock, RW_WRITER);
 2868         hdr = buf->b_hdr;
 2869         if (hdr == NULL) {
 2870                 /*
 2871                  * We are in arc_do_user_evicts().
 2872                  */
 2873                 ASSERT(buf->b_data == NULL);
 2874                 rw_exit(&buf->b_lock);
 2875                 return (0);
 2876         } else if (buf->b_data == NULL) {
 2877                 arc_buf_t copy = *buf; /* structure assignment */
 2878                 /*
 2879                  * We are on the eviction list; process this buffer now
 2880                  * but let arc_do_user_evicts() do the reaping.
 2881                  */
 2882                 buf->b_efunc = NULL;
 2883                 rw_exit(&buf->b_lock);
 2884                 VERIFY(copy.b_efunc(&copy) == 0);
 2885                 return (1);
 2886         }
 2887         hash_lock = HDR_LOCK(hdr);
 2888         mutex_enter(hash_lock);
 2889 
 2890         ASSERT(buf->b_hdr == hdr);
 2891         ASSERT3U(refcount_count(&hdr->b_refcnt), <, hdr->b_datacnt);
 2892         ASSERT(hdr->b_state == arc_mru || hdr->b_state == arc_mfu);
 2893 
 2894         /*
 2895          * Pull this buffer off of the hdr
 2896          */
 2897         bufp = &hdr->b_buf;
 2898         while (*bufp != buf)
 2899                 bufp = &(*bufp)->b_next;
 2900         *bufp = buf->b_next;
 2901 
 2902         ASSERT(buf->b_data != NULL);
 2903         arc_buf_destroy(buf, FALSE, FALSE);
 2904 
 2905         if (hdr->b_datacnt == 0) {
 2906                 arc_state_t *old_state = hdr->b_state;
 2907                 arc_state_t *evicted_state;
 2908 
 2909                 ASSERT(refcount_is_zero(&hdr->b_refcnt));
 2910 
 2911                 evicted_state =
 2912                     (old_state == arc_mru) ? arc_mru_ghost : arc_mfu_ghost;
 2913 
 2914                 mutex_enter(&old_state->arcs_mtx);
 2915                 mutex_enter(&evicted_state->arcs_mtx);
 2916 
 2917                 arc_change_state(evicted_state, hdr, hash_lock);
 2918                 ASSERT(HDR_IN_HASH_TABLE(hdr));
 2919                 hdr->b_flags |= ARC_IN_HASH_TABLE;
 2920                 hdr->b_flags &= ~ARC_BUF_AVAILABLE;
 2921 
 2922                 mutex_exit(&evicted_state->arcs_mtx);
 2923                 mutex_exit(&old_state->arcs_mtx);
 2924         }
 2925         mutex_exit(hash_lock);
 2926         rw_exit(&buf->b_lock);
 2927 
 2928         VERIFY(buf->b_efunc(buf) == 0);
 2929         buf->b_efunc = NULL;
 2930         buf->b_private = NULL;
 2931         buf->b_hdr = NULL;
 2932         kmem_cache_free(buf_cache, buf);
 2933         return (1);
 2934 }
 2935 
 2936 /*
 2937  * Release this buffer from the cache.  This must be done
 2938  * after a read and prior to modifying the buffer contents.
 2939  * If the buffer has more than one reference, we must make
 2940  * a new hdr for the buffer.
 2941  */
 2942 void
 2943 arc_release(arc_buf_t *buf, void *tag)
 2944 {
 2945         arc_buf_hdr_t *hdr;
 2946         kmutex_t *hash_lock;
 2947         l2arc_buf_hdr_t *l2hdr;
 2948         uint64_t buf_size;
 2949         boolean_t released = B_FALSE;
 2950 
 2951         rw_enter(&buf->b_lock, RW_WRITER);
 2952         hdr = buf->b_hdr;
 2953 
 2954         /* this buffer is not on any list */
 2955         ASSERT(refcount_count(&hdr->b_refcnt) > 0);
 2956         ASSERT(!(hdr->b_flags & ARC_STORED));
 2957 
 2958         if (hdr->b_state == arc_anon) {
 2959                 /* this buffer is already released */
 2960                 ASSERT3U(refcount_count(&hdr->b_refcnt), ==, 1);
 2961                 ASSERT(BUF_EMPTY(hdr));
 2962                 ASSERT(buf->b_efunc == NULL);
 2963                 arc_buf_thaw(buf);
 2964                 rw_exit(&buf->b_lock);
 2965                 released = B_TRUE;
 2966         } else {
 2967                 hash_lock = HDR_LOCK(hdr);
 2968                 mutex_enter(hash_lock);
 2969         }
 2970 
 2971         l2hdr = hdr->b_l2hdr;
 2972         if (l2hdr) {
 2973                 mutex_enter(&l2arc_buflist_mtx);
 2974                 hdr->b_l2hdr = NULL;
 2975                 buf_size = hdr->b_size;
 2976         }
 2977 
 2978         if (released)
 2979                 goto out;
 2980 
 2981         /*
 2982          * Do we have more than one buf?
 2983          */
 2984         if (hdr->b_datacnt > 1) {
 2985                 arc_buf_hdr_t *nhdr;
 2986                 arc_buf_t **bufp;
 2987                 uint64_t blksz = hdr->b_size;
 2988                 uint64_t spa = hdr->b_spa;
 2989                 arc_buf_contents_t type = hdr->b_type;
 2990                 uint32_t flags = hdr->b_flags;
 2991 
 2992                 ASSERT(hdr->b_buf != buf || buf->b_next != NULL);
 2993                 /*
 2994                  * Pull the data off of this buf and attach it to
 2995                  * a new anonymous buf.
 2996                  */
 2997                 (void) remove_reference(hdr, hash_lock, tag);
 2998                 bufp = &hdr->b_buf;
 2999                 while (*bufp != buf)
 3000                         bufp = &(*bufp)->b_next;
 3001                 *bufp = (*bufp)->b_next;
 3002                 buf->b_next = NULL;
 3003 
 3004                 ASSERT3U(hdr->b_state->arcs_size, >=, hdr->b_size);
 3005                 atomic_add_64(&hdr->b_state->arcs_size, -hdr->b_size);
 3006                 if (refcount_is_zero(&hdr->b_refcnt)) {
 3007                         uint64_t *size = &hdr->b_state->arcs_lsize[hdr->b_type];
 3008                         ASSERT3U(*size, >=, hdr->b_size);
 3009                         atomic_add_64(size, -hdr->b_size);
 3010                 }
 3011                 hdr->b_datacnt -= 1;
 3012                 arc_cksum_verify(buf);
 3013 
 3014                 mutex_exit(hash_lock);
 3015 
 3016                 nhdr = kmem_cache_alloc(hdr_cache, KM_PUSHPAGE);
 3017                 nhdr->b_size = blksz;
 3018                 nhdr->b_spa = spa;
 3019                 nhdr->b_type = type;
 3020                 nhdr->b_buf = buf;
 3021                 nhdr->b_state = arc_anon;
 3022                 nhdr->b_arc_access = 0;
 3023                 nhdr->b_flags = flags & ARC_L2_WRITING;
 3024                 nhdr->b_l2hdr = NULL;
 3025                 nhdr->b_datacnt = 1;
 3026                 nhdr->b_freeze_cksum = NULL;
 3027                 (void) refcount_add(&nhdr->b_refcnt, tag);
 3028                 buf->b_hdr = nhdr;
 3029                 rw_exit(&buf->b_lock);
 3030                 atomic_add_64(&arc_anon->arcs_size, blksz);
 3031         } else {
 3032                 rw_exit(&buf->b_lock);
 3033                 ASSERT(refcount_count(&hdr->b_refcnt) == 1);
 3034                 ASSERT(!list_link_active(&hdr->b_arc_node));
 3035                 ASSERT(!HDR_IO_IN_PROGRESS(hdr));
 3036                 arc_change_state(arc_anon, hdr, hash_lock);
 3037                 hdr->b_arc_access = 0;
 3038                 mutex_exit(hash_lock);
 3039 
 3040                 bzero(&hdr->b_dva, sizeof (dva_t));
 3041                 hdr->b_birth = 0;
 3042                 hdr->b_cksum0 = 0;
 3043                 arc_buf_thaw(buf);
 3044         }
 3045         buf->b_efunc = NULL;
 3046         buf->b_private = NULL;
 3047 
 3048 out:
 3049         if (l2hdr) {
 3050                 list_remove(l2hdr->b_dev->l2ad_buflist, hdr);
 3051                 kmem_free(l2hdr, sizeof (l2arc_buf_hdr_t));
 3052                 ARCSTAT_INCR(arcstat_l2_size, -buf_size);
 3053                 mutex_exit(&l2arc_buflist_mtx);
 3054         }
 3055 }
 3056 
 3057 int
 3058 arc_released(arc_buf_t *buf)
 3059 {
 3060         int released;
 3061 
 3062         rw_enter(&buf->b_lock, RW_READER);
 3063         released = (buf->b_data != NULL && buf->b_hdr->b_state == arc_anon);
 3064         rw_exit(&buf->b_lock);
 3065         return (released);
 3066 }
 3067 
 3068 int
 3069 arc_has_callback(arc_buf_t *buf)
 3070 {
 3071         int callback;
 3072 
 3073         rw_enter(&buf->b_lock, RW_READER);
 3074         callback = (buf->b_efunc != NULL);
 3075         rw_exit(&buf->b_lock);
 3076         return (callback);
 3077 }
 3078 
 3079 #ifdef ZFS_DEBUG
 3080 int
 3081 arc_referenced(arc_buf_t *buf)
 3082 {
 3083         int referenced;
 3084 
 3085         rw_enter(&buf->b_lock, RW_READER);
 3086         referenced = (refcount_count(&buf->b_hdr->b_refcnt));
 3087         rw_exit(&buf->b_lock);
 3088         return (referenced);
 3089 }
 3090 #endif
 3091 
 3092 static void
 3093 arc_write_ready(zio_t *zio)
 3094 {
 3095         arc_write_callback_t *callback = zio->io_private;
 3096         arc_buf_t *buf = callback->awcb_buf;
 3097         arc_buf_hdr_t *hdr = buf->b_hdr;
 3098 
 3099         ASSERT(!refcount_is_zero(&buf->b_hdr->b_refcnt));
 3100         callback->awcb_ready(zio, buf, callback->awcb_private);
 3101 
 3102         /*
 3103          * If the IO is already in progress, then this is a re-write
 3104          * attempt, so we need to thaw and re-compute the cksum.
 3105          * It is the responsibility of the callback to handle the
 3106          * accounting for any re-write attempt.
 3107          */
 3108         if (HDR_IO_IN_PROGRESS(hdr)) {
 3109                 mutex_enter(&hdr->b_freeze_lock);
 3110                 if (hdr->b_freeze_cksum != NULL) {
 3111                         kmem_free(hdr->b_freeze_cksum, sizeof (zio_cksum_t));
 3112                         hdr->b_freeze_cksum = NULL;
 3113                 }
 3114                 mutex_exit(&hdr->b_freeze_lock);
 3115         }
 3116         arc_cksum_compute(buf, B_FALSE);
 3117         hdr->b_flags |= ARC_IO_IN_PROGRESS;
 3118 }
 3119 
 3120 static void
 3121 arc_write_done(zio_t *zio)
 3122 {
 3123         arc_write_callback_t *callback = zio->io_private;
 3124         arc_buf_t *buf = callback->awcb_buf;
 3125         arc_buf_hdr_t *hdr = buf->b_hdr;
 3126 
 3127         hdr->b_acb = NULL;
 3128 
 3129         hdr->b_dva = *BP_IDENTITY(zio->io_bp);
 3130         hdr->b_birth = zio->io_bp->blk_birth;
 3131         hdr->b_cksum0 = zio->io_bp->blk_cksum.zc_word[0];
 3132         /*
 3133          * If the block to be written was all-zero, we may have
 3134          * compressed it away.  In this case no write was performed
 3135          * so there will be no dva/birth-date/checksum.  The buffer
 3136          * must therefor remain anonymous (and uncached).
 3137          */
 3138         if (!BUF_EMPTY(hdr)) {
 3139                 arc_buf_hdr_t *exists;
 3140                 kmutex_t *hash_lock;
 3141 
 3142                 arc_cksum_verify(buf);
 3143 
 3144                 exists = buf_hash_insert(hdr, &hash_lock);
 3145                 if (exists) {
 3146                         /*
 3147                          * This can only happen if we overwrite for
 3148                          * sync-to-convergence, because we remove
 3149                          * buffers from the hash table when we arc_free().
 3150                          */
 3151                         if (!(zio->io_flags & ZIO_FLAG_IO_REWRITE) ||
 3152                             !DVA_EQUAL(BP_IDENTITY(&zio->io_bp_orig),
 3153                             BP_IDENTITY(zio->io_bp)) ||
 3154                             zio->io_bp_orig.blk_birth !=
 3155                             zio->io_bp->blk_birth) {
 3156                                 panic("bad overwrite, hdr=%p exists=%p",
 3157                                     (void *)hdr, (void *)exists);
 3158                         }
 3159 
 3160                         ASSERT(refcount_is_zero(&exists->b_refcnt));
 3161                         arc_change_state(arc_anon, exists, hash_lock);
 3162                         mutex_exit(hash_lock);
 3163                         arc_hdr_destroy(exists);
 3164                         exists = buf_hash_insert(hdr, &hash_lock);
 3165                         ASSERT3P(exists, ==, NULL);
 3166                 }
 3167                 hdr->b_flags &= ~ARC_IO_IN_PROGRESS;
 3168                 /* if it's not anon, we are doing a scrub */
 3169                 if (hdr->b_state == arc_anon)
 3170                         arc_access(hdr, hash_lock);
 3171                 mutex_exit(hash_lock);
 3172         } else if (callback->awcb_done == NULL) {
 3173                 int destroy_hdr;
 3174                 /*
 3175                  * This is an anonymous buffer with no user callback,
 3176                  * destroy it if there are no active references.
 3177                  */
 3178                 mutex_enter(&arc_eviction_mtx);
 3179                 destroy_hdr = refcount_is_zero(&hdr->b_refcnt);
 3180                 hdr->b_flags &= ~ARC_IO_IN_PROGRESS;
 3181                 mutex_exit(&arc_eviction_mtx);
 3182                 if (destroy_hdr)
 3183                         arc_hdr_destroy(hdr);
 3184         } else {
 3185                 hdr->b_flags &= ~ARC_IO_IN_PROGRESS;
 3186         }
 3187         hdr->b_flags &= ~ARC_STORED;
 3188 
 3189         if (callback->awcb_done) {
 3190                 ASSERT(!refcount_is_zero(&hdr->b_refcnt));
 3191                 callback->awcb_done(zio, buf, callback->awcb_private);
 3192         }
 3193 
 3194         kmem_free(callback, sizeof (arc_write_callback_t));
 3195 }
 3196 
 3197 void
 3198 write_policy(spa_t *spa, const writeprops_t *wp, zio_prop_t *zp)
 3199 {
 3200         boolean_t ismd = (wp->wp_level > 0 || dmu_ot[wp->wp_type].ot_metadata);
 3201 
 3202         /* Determine checksum setting */
 3203         if (ismd) {
 3204                 /*
 3205                  * Metadata always gets checksummed.  If the data
 3206                  * checksum is multi-bit correctable, and it's not a
 3207                  * ZBT-style checksum, then it's suitable for metadata
 3208                  * as well.  Otherwise, the metadata checksum defaults
 3209                  * to fletcher4.
 3210                  */
 3211                 if (zio_checksum_table[wp->wp_oschecksum].ci_correctable &&
 3212                     !zio_checksum_table[wp->wp_oschecksum].ci_zbt)
 3213                         zp->zp_checksum = wp->wp_oschecksum;
 3214                 else
 3215                         zp->zp_checksum = ZIO_CHECKSUM_FLETCHER_4;
 3216         } else {
 3217                 zp->zp_checksum = zio_checksum_select(wp->wp_dnchecksum,
 3218                     wp->wp_oschecksum);
 3219         }
 3220 
 3221         /* Determine compression setting */
 3222         if (ismd) {
 3223                 /*
 3224                  * XXX -- we should design a compression algorithm
 3225                  * that specializes in arrays of bps.
 3226                  */
 3227                 zp->zp_compress = zfs_mdcomp_disable ? ZIO_COMPRESS_EMPTY :
 3228                     ZIO_COMPRESS_LZJB;
 3229         } else {
 3230                 zp->zp_compress = zio_compress_select(wp->wp_dncompress,
 3231                     wp->wp_oscompress);
 3232         }
 3233 
 3234         zp->zp_type = wp->wp_type;
 3235         zp->zp_level = wp->wp_level;
 3236         zp->zp_ndvas = MIN(wp->wp_copies + ismd, spa_max_replication(spa));
 3237 }
 3238 
 3239 zio_t *
 3240 arc_write(zio_t *pio, spa_t *spa, const writeprops_t *wp,
 3241     boolean_t l2arc, uint64_t txg, blkptr_t *bp, arc_buf_t *buf,
 3242     arc_done_func_t *ready, arc_done_func_t *done, void *private, int priority,
 3243     int zio_flags, const zbookmark_t *zb)
 3244 {
 3245         arc_buf_hdr_t *hdr = buf->b_hdr;
 3246         arc_write_callback_t *callback;
 3247         zio_t *zio;
 3248         zio_prop_t zp;
 3249 
 3250         ASSERT(ready != NULL);
 3251         ASSERT(!HDR_IO_ERROR(hdr));
 3252         ASSERT((hdr->b_flags & ARC_IO_IN_PROGRESS) == 0);
 3253         ASSERT(hdr->b_acb == 0);
 3254         if (l2arc)
 3255                 hdr->b_flags |= ARC_L2CACHE;
 3256         callback = kmem_zalloc(sizeof (arc_write_callback_t), KM_SLEEP);
 3257         callback->awcb_ready = ready;
 3258         callback->awcb_done = done;
 3259         callback->awcb_private = private;
 3260         callback->awcb_buf = buf;
 3261 
 3262         write_policy(spa, wp, &zp);
 3263         zio = zio_write(pio, spa, txg, bp, buf->b_data, hdr->b_size, &zp,
 3264             arc_write_ready, arc_write_done, callback, priority, zio_flags, zb);
 3265 
 3266         return (zio);
 3267 }
 3268 
 3269 int
 3270 arc_free(zio_t *pio, spa_t *spa, uint64_t txg, blkptr_t *bp,
 3271     zio_done_func_t *done, void *private, uint32_t arc_flags)
 3272 {
 3273         arc_buf_hdr_t *ab;
 3274         kmutex_t *hash_lock;
 3275         zio_t   *zio;
 3276         uint64_t guid = spa_guid(spa);
 3277 
 3278         /*
 3279          * If this buffer is in the cache, release it, so it
 3280          * can be re-used.
 3281          */
 3282         ab = buf_hash_find(guid, BP_IDENTITY(bp), bp->blk_birth, &hash_lock);
 3283         if (ab != NULL) {
 3284                 /*
 3285                  * The checksum of blocks to free is not always
 3286                  * preserved (eg. on the deadlist).  However, if it is
 3287                  * nonzero, it should match what we have in the cache.
 3288                  */
 3289                 ASSERT(bp->blk_cksum.zc_word[0] == 0 ||
 3290                     bp->blk_cksum.zc_word[0] == ab->b_cksum0 ||
 3291                     bp->blk_fill == BLK_FILL_ALREADY_FREED);
 3292 
 3293                 if (ab->b_state != arc_anon)
 3294                         arc_change_state(arc_anon, ab, hash_lock);
 3295                 if (HDR_IO_IN_PROGRESS(ab)) {
 3296                         /*
 3297                          * This should only happen when we prefetch.
 3298                          */
 3299                         ASSERT(ab->b_flags & ARC_PREFETCH);
 3300                         ASSERT3U(ab->b_datacnt, ==, 1);
 3301                         ab->b_flags |= ARC_FREED_IN_READ;
 3302                         if (HDR_IN_HASH_TABLE(ab))
 3303                                 buf_hash_remove(ab);
 3304                         ab->b_arc_access = 0;
 3305                         bzero(&ab->b_dva, sizeof (dva_t));
 3306                         ab->b_birth = 0;
 3307                         ab->b_cksum0 = 0;
 3308                         ab->b_buf->b_efunc = NULL;
 3309                         ab->b_buf->b_private = NULL;
 3310                         mutex_exit(hash_lock);
 3311                 } else if (refcount_is_zero(&ab->b_refcnt)) {
 3312                         ab->b_flags |= ARC_FREE_IN_PROGRESS;
 3313                         mutex_exit(hash_lock);
 3314                         arc_hdr_destroy(ab);
 3315                         ARCSTAT_BUMP(arcstat_deleted);
 3316                 } else {
 3317                         /*
 3318                          * We still have an active reference on this
 3319                          * buffer.  This can happen, e.g., from
 3320                          * dbuf_unoverride().
 3321                          */
 3322                         ASSERT(!HDR_IN_HASH_TABLE(ab));
 3323                         ab->b_arc_access = 0;
 3324                         bzero(&ab->b_dva, sizeof (dva_t));
 3325                         ab->b_birth = 0;
 3326                         ab->b_cksum0 = 0;
 3327                         ab->b_buf->b_efunc = NULL;
 3328                         ab->b_buf->b_private = NULL;
 3329                         mutex_exit(hash_lock);
 3330                 }
 3331         }
 3332 
 3333         zio = zio_free(pio, spa, txg, bp, done, private, ZIO_FLAG_MUSTSUCCEED);
 3334 
 3335         if (arc_flags & ARC_WAIT)
 3336                 return (zio_wait(zio));
 3337 
 3338         ASSERT(arc_flags & ARC_NOWAIT);
 3339         zio_nowait(zio);
 3340 
 3341         return (0);
 3342 }
 3343 
 3344 static int
 3345 arc_memory_throttle(uint64_t reserve, uint64_t inflight_data, uint64_t txg)
 3346 {
 3347 #ifdef _KERNEL
 3348         uint64_t available_memory = ptob(freemem);
 3349         static uint64_t page_load = 0;
 3350         static uint64_t last_txg = 0;
 3351 
 3352 #if defined(__i386)
 3353         available_memory =
 3354             MIN(available_memory, vmem_size(heap_arena, VMEM_FREE));
 3355 #endif
 3356         if (available_memory >= zfs_write_limit_max)
 3357                 return (0);
 3358 
 3359         if (txg > last_txg) {
 3360                 last_txg = txg;
 3361                 page_load = 0;
 3362         }
 3363         /*
 3364          * If we are in pageout, we know that memory is already tight,
 3365          * the arc is already going to be evicting, so we just want to
 3366          * continue to let page writes occur as quickly as possible.
 3367          */
 3368         if (curproc == proc_pageout) {
 3369                 if (page_load > MAX(ptob(minfree), available_memory) / 4)
 3370                         return (ERESTART);
 3371                 /* Note: reserve is inflated, so we deflate */
 3372                 page_load += reserve / 8;
 3373                 return (0);
 3374         } else if (page_load > 0 && arc_reclaim_needed()) {
 3375                 /* memory is low, delay before restarting */
 3376                 ARCSTAT_INCR(arcstat_memory_throttle_count, 1);
 3377                 return (EAGAIN);
 3378         }
 3379         page_load = 0;
 3380 
 3381         if (arc_size > arc_c_min) {
 3382                 uint64_t evictable_memory =
 3383                     arc_mru->arcs_lsize[ARC_BUFC_DATA] +
 3384                     arc_mru->arcs_lsize[ARC_BUFC_METADATA] +
 3385                     arc_mfu->arcs_lsize[ARC_BUFC_DATA] +
 3386                     arc_mfu->arcs_lsize[ARC_BUFC_METADATA];
 3387                 available_memory += MIN(evictable_memory, arc_size - arc_c_min);
 3388         }
 3389 
 3390         if (inflight_data > available_memory / 4) {
 3391                 ARCSTAT_INCR(arcstat_memory_throttle_count, 1);
 3392                 return (ERESTART);
 3393         }
 3394 #endif
 3395         return (0);
 3396 }
 3397 
 3398 void
 3399 arc_tempreserve_clear(uint64_t reserve)
 3400 {
 3401         atomic_add_64(&arc_tempreserve, -reserve);
 3402         ASSERT((int64_t)arc_tempreserve >= 0);
 3403 }
 3404 
 3405 int
 3406 arc_tempreserve_space(uint64_t reserve, uint64_t txg)
 3407 {
 3408         int error;
 3409         uint64_t anon_size;
 3410 
 3411 #ifdef ZFS_DEBUG
 3412         /*
 3413          * Once in a while, fail for no reason.  Everything should cope.
 3414          */
 3415         if (spa_get_random(10000) == 0) {
 3416                 dprintf("forcing random failure\n");
 3417                 return (ERESTART);
 3418         }
 3419 #endif
 3420         if (reserve > arc_c/4 && !arc_no_grow)
 3421                 arc_c = MIN(arc_c_max, reserve * 4);
 3422         if (reserve > arc_c)
 3423                 return (ENOMEM);
 3424 
 3425         /*
 3426          * Don't count loaned bufs as in flight dirty data to prevent long
 3427          * network delays from blocking transactions that are ready to be
 3428          * assigned to a txg.
 3429          */
 3430         anon_size = MAX((int64_t)(arc_anon->arcs_size - arc_loaned_bytes), 0);
 3431 
 3432         /*
 3433          * Writes will, almost always, require additional memory allocations
 3434          * in order to compress/encrypt/etc the data.  We therefor need to
 3435          * make sure that there is sufficient available memory for this.
 3436          */
 3437         if (error = arc_memory_throttle(reserve, anon_size, txg))
 3438                 return (error);
 3439 
 3440         /*
 3441          * Throttle writes when the amount of dirty data in the cache
 3442          * gets too large.  We try to keep the cache less than half full
 3443          * of dirty blocks so that our sync times don't grow too large.
 3444          * Note: if two requests come in concurrently, we might let them
 3445          * both succeed, when one of them should fail.  Not a huge deal.
 3446          */
 3447 
 3448         if (reserve + arc_tempreserve + anon_size > arc_c / 2 &&
 3449             anon_size > arc_c / 4) {
 3450                 dprintf("failing, arc_tempreserve=%lluK anon_meta=%lluK "
 3451                     "anon_data=%lluK tempreserve=%lluK arc_c=%lluK\n",
 3452                     arc_tempreserve>>10,
 3453                     arc_anon->arcs_lsize[ARC_BUFC_METADATA]>>10,
 3454                     arc_anon->arcs_lsize[ARC_BUFC_DATA]>>10,
 3455                     reserve>>10, arc_c>>10);
 3456                 return (ERESTART);
 3457         }
 3458         atomic_add_64(&arc_tempreserve, reserve);
 3459         return (0);
 3460 }
 3461 
 3462 void
 3463 arc_init(void)
 3464 {
 3465         mutex_init(&arc_reclaim_thr_lock, NULL, MUTEX_DEFAULT, NULL);
 3466         cv_init(&arc_reclaim_thr_cv, NULL, CV_DEFAULT, NULL);
 3467 
 3468         /* Convert seconds to clock ticks */
 3469         arc_min_prefetch_lifespan = 1 * hz;
 3470 
 3471         /* Start out with 1/8 of all memory */
 3472         arc_c = physmem * PAGESIZE / 8;
 3473 
 3474 #ifdef _KERNEL
 3475         /*
 3476          * On architectures where the physical memory can be larger
 3477          * than the addressable space (intel in 32-bit mode), we may
 3478          * need to limit the cache to 1/8 of VM size.
 3479          */
 3480         arc_c = MIN(arc_c, vmem_size(heap_arena, VMEM_ALLOC | VMEM_FREE) / 8);
 3481 #endif
 3482 
 3483         /* set min cache to 1/32 of all memory, or 64MB, whichever is more */
 3484         arc_c_min = MAX(arc_c / 4, 64<<20);
 3485         /* set max to 3/4 of all memory, or all but 1GB, whichever is more */
 3486         if (arc_c * 8 >= 1<<30)
 3487                 arc_c_max = (arc_c * 8) - (1<<30);
 3488         else
 3489                 arc_c_max = arc_c_min;
 3490         arc_c_max = MAX(arc_c * 6, arc_c_max);
 3491 
 3492         /*
 3493          * Allow the tunables to override our calculations if they are
 3494          * reasonable (ie. over 64MB)
 3495          */
 3496         if (zfs_arc_max > 64<<20 && zfs_arc_max < physmem * PAGESIZE)
 3497                 arc_c_max = zfs_arc_max;
 3498         if (zfs_arc_min > 64<<20 && zfs_arc_min <= arc_c_max)
 3499                 arc_c_min = zfs_arc_min;
 3500 
 3501         arc_c = arc_c_max;
 3502         arc_p = (arc_c >> 1);
 3503 
 3504         /* limit meta-data to 1/4 of the arc capacity */
 3505         arc_meta_limit = arc_c_max / 4;
 3506 
 3507         /* Allow the tunable to override if it is reasonable */
 3508         if (zfs_arc_meta_limit > 0 && zfs_arc_meta_limit <= arc_c_max)
 3509                 arc_meta_limit = zfs_arc_meta_limit;
 3510 
 3511         if (arc_c_min < arc_meta_limit / 2 && zfs_arc_min == 0)
 3512                 arc_c_min = arc_meta_limit / 2;
 3513 
 3514         if (zfs_arc_grow_retry > 0)
 3515                 arc_grow_retry = zfs_arc_grow_retry;
 3516 
 3517         if (zfs_arc_shrink_shift > 0)
 3518                 arc_shrink_shift = zfs_arc_shrink_shift;
 3519 
 3520         if (zfs_arc_p_min_shift > 0)
 3521                 arc_p_min_shift = zfs_arc_p_min_shift;
 3522 
 3523         /* if kmem_flags are set, lets try to use less memory */
 3524         if (kmem_debugging())
 3525                 arc_c = arc_c / 2;
 3526         if (arc_c < arc_c_min)
 3527                 arc_c = arc_c_min;
 3528 
 3529         arc_anon = &ARC_anon;
 3530         arc_mru = &ARC_mru;
 3531         arc_mru_ghost = &ARC_mru_ghost;
 3532         arc_mfu = &ARC_mfu;
 3533         arc_mfu_ghost = &ARC_mfu_ghost;
 3534         arc_l2c_only = &ARC_l2c_only;
 3535         arc_size = 0;
 3536 
 3537         mutex_init(&arc_anon->arcs_mtx, NULL, MUTEX_DEFAULT, NULL);
 3538         mutex_init(&arc_mru->arcs_mtx, NULL, MUTEX_DEFAULT, NULL);
 3539         mutex_init(&arc_mru_ghost->arcs_mtx, NULL, MUTEX_DEFAULT, NULL);
 3540         mutex_init(&arc_mfu->arcs_mtx, NULL, MUTEX_DEFAULT, NULL);
 3541         mutex_init(&arc_mfu_ghost->arcs_mtx, NULL, MUTEX_DEFAULT, NULL);
 3542         mutex_init(&arc_l2c_only->arcs_mtx, NULL, MUTEX_DEFAULT, NULL);
 3543 
 3544         list_create(&arc_mru->arcs_list[ARC_BUFC_METADATA],
 3545             sizeof (arc_buf_hdr_t), offsetof(arc_buf_hdr_t, b_arc_node));
 3546         list_create(&arc_mru->arcs_list[ARC_BUFC_DATA],
 3547             sizeof (arc_buf_hdr_t), offsetof(arc_buf_hdr_t, b_arc_node));
 3548         list_create(&arc_mru_ghost->arcs_list[ARC_BUFC_METADATA],
 3549             sizeof (arc_buf_hdr_t), offsetof(arc_buf_hdr_t, b_arc_node));
 3550         list_create(&arc_mru_ghost->arcs_list[ARC_BUFC_DATA],
 3551             sizeof (arc_buf_hdr_t), offsetof(arc_buf_hdr_t, b_arc_node));
 3552         list_create(&arc_mfu->arcs_list[ARC_BUFC_METADATA],
 3553             sizeof (arc_buf_hdr_t), offsetof(arc_buf_hdr_t, b_arc_node));
 3554         list_create(&arc_mfu->arcs_list[ARC_BUFC_DATA],
 3555             sizeof (arc_buf_hdr_t), offsetof(arc_buf_hdr_t, b_arc_node));
 3556         list_create(&arc_mfu_ghost->arcs_list[ARC_BUFC_METADATA],
 3557             sizeof (arc_buf_hdr_t), offsetof(arc_buf_hdr_t, b_arc_node));
 3558         list_create(&arc_mfu_ghost->arcs_list[ARC_BUFC_DATA],
 3559             sizeof (arc_buf_hdr_t), offsetof(arc_buf_hdr_t, b_arc_node));
 3560         list_create(&arc_l2c_only->arcs_list[ARC_BUFC_METADATA],
 3561             sizeof (arc_buf_hdr_t), offsetof(arc_buf_hdr_t, b_arc_node));
 3562         list_create(&arc_l2c_only->arcs_list[ARC_BUFC_DATA],
 3563             sizeof (arc_buf_hdr_t), offsetof(arc_buf_hdr_t, b_arc_node));
 3564 
 3565         buf_init();
 3566 
 3567         arc_thread_exit = 0;
 3568         arc_eviction_list = NULL;
 3569         mutex_init(&arc_eviction_mtx, NULL, MUTEX_DEFAULT, NULL);
 3570         bzero(&arc_eviction_hdr, sizeof (arc_buf_hdr_t));
 3571 
 3572         arc_ksp = kstat_create("zfs", 0, "arcstats", "misc", KSTAT_TYPE_NAMED,
 3573             sizeof (arc_stats) / sizeof (kstat_named_t), KSTAT_FLAG_VIRTUAL);
 3574 
 3575         if (arc_ksp != NULL) {
 3576                 arc_ksp->ks_data = &arc_stats;
 3577                 kstat_install(arc_ksp);
 3578         }
 3579 
 3580         (void) thread_create(NULL, 0, arc_reclaim_thread, NULL, 0, &p0,
 3581             TS_RUN, minclsyspri);
 3582 
 3583         arc_dead = FALSE;
 3584         arc_warm = B_FALSE;
 3585 
 3586         if (zfs_write_limit_max == 0)
 3587                 zfs_write_limit_max = ptob(physmem) >> zfs_write_limit_shift;
 3588         else
 3589                 zfs_write_limit_shift = 0;
 3590         mutex_init(&zfs_write_limit_lock, NULL, MUTEX_DEFAULT, NULL);
 3591 }
 3592 
 3593 void
 3594 arc_fini(void)
 3595 {
 3596         mutex_enter(&arc_reclaim_thr_lock);
 3597         arc_thread_exit = 1;
 3598         while (arc_thread_exit != 0)
 3599                 cv_wait(&arc_reclaim_thr_cv, &arc_reclaim_thr_lock);
 3600         mutex_exit(&arc_reclaim_thr_lock);
 3601 
 3602         arc_flush(NULL);
 3603 
 3604         arc_dead = TRUE;
 3605 
 3606         if (arc_ksp != NULL) {
 3607                 kstat_delete(arc_ksp);
 3608                 arc_ksp = NULL;
 3609         }
 3610 
 3611         mutex_destroy(&arc_eviction_mtx);
 3612         mutex_destroy(&arc_reclaim_thr_lock);
 3613         cv_destroy(&arc_reclaim_thr_cv);
 3614 
 3615         list_destroy(&arc_mru->arcs_list[ARC_BUFC_METADATA]);
 3616         list_destroy(&arc_mru_ghost->arcs_list[ARC_BUFC_METADATA]);
 3617         list_destroy(&arc_mfu->arcs_list[ARC_BUFC_METADATA]);
 3618         list_destroy(&arc_mfu_ghost->arcs_list[ARC_BUFC_METADATA]);
 3619         list_destroy(&arc_mru->arcs_list[ARC_BUFC_DATA]);
 3620         list_destroy(&arc_mru_ghost->arcs_list[ARC_BUFC_DATA]);
 3621         list_destroy(&arc_mfu->arcs_list[ARC_BUFC_DATA]);
 3622         list_destroy(&arc_mfu_ghost->arcs_list[ARC_BUFC_DATA]);
 3623 
 3624         mutex_destroy(&arc_anon->arcs_mtx);
 3625         mutex_destroy(&arc_mru->arcs_mtx);
 3626         mutex_destroy(&arc_mru_ghost->arcs_mtx);
 3627         mutex_destroy(&arc_mfu->arcs_mtx);
 3628         mutex_destroy(&arc_mfu_ghost->arcs_mtx);
 3629         mutex_destroy(&arc_l2c_only->arcs_mtx);
 3630 
 3631         mutex_destroy(&zfs_write_limit_lock);
 3632 
 3633         buf_fini();
 3634 
 3635         ASSERT(arc_loaned_bytes == 0);
 3636 }
 3637 
 3638 /*
 3639  * Level 2 ARC
 3640  *
 3641  * The level 2 ARC (L2ARC) is a cache layer in-between main memory and disk.
 3642  * It uses dedicated storage devices to hold cached data, which are populated
 3643  * using large infrequent writes.  The main role of this cache is to boost
 3644  * the performance of random read workloads.  The intended L2ARC devices
 3645  * include short-stroked disks, solid state disks, and other media with
 3646  * substantially faster read latency than disk.
 3647  *
 3648  *                 +-----------------------+
 3649  *                 |         ARC           |
 3650  *                 +-----------------------+
 3651  *                    |         ^     ^
 3652  *                    |         |     |
 3653  *      l2arc_feed_thread()    arc_read()
 3654  *                    |         |     |
 3655  *                    |  l2arc read   |
 3656  *                    V         |     |
 3657  *               +---------------+    |
 3658  *               |     L2ARC     |    |
 3659  *               +---------------+    |
 3660  *                   |    ^           |
 3661  *          l2arc_write() |           |
 3662  *                   |    |           |
 3663  *                   V    |           |
 3664  *                 +-------+      +-------+
 3665  *                 | vdev  |      | vdev  |
 3666  *                 | cache |      | cache |
 3667  *                 +-------+      +-------+
 3668  *                 +=========+     .-----.
 3669  *                 :  L2ARC  :    |-_____-|
 3670  *                 : devices :    | Disks |
 3671  *                 +=========+    `-_____-'
 3672  *
 3673  * Read requests are satisfied from the following sources, in order:
 3674  *
 3675  *      1) ARC
 3676  *      2) vdev cache of L2ARC devices
 3677  *      3) L2ARC devices
 3678  *      4) vdev cache of disks
 3679  *      5) disks
 3680  *
 3681  * Some L2ARC device types exhibit extremely slow write performance.
 3682  * To accommodate for this there are some significant differences between
 3683  * the L2ARC and traditional cache design:
 3684  *
 3685  * 1. There is no eviction path from the ARC to the L2ARC.  Evictions from
 3686  * the ARC behave as usual, freeing buffers and placing headers on ghost
 3687  * lists.  The ARC does not send buffers to the L2ARC during eviction as
 3688  * this would add inflated write latencies for all ARC memory pressure.
 3689  *
 3690  * 2. The L2ARC attempts to cache data from the ARC before it is evicted.
 3691  * It does this by periodically scanning buffers from the eviction-end of
 3692  * the MFU and MRU ARC lists, copying them to the L2ARC devices if they are
 3693  * not already there.  It scans until a headroom of buffers is satisfied,
 3694  * which itself is a buffer for ARC eviction.  The thread that does this is
 3695  * l2arc_feed_thread(), illustrated below; example sizes are included to
 3696  * provide a better sense of ratio than this diagram:
 3697  *
 3698  *             head -->                        tail
 3699  *              +---------------------+----------+
 3700  *      ARC_mfu |:::::#:::::::::::::::|o#o###o###|-->.   # already on L2ARC
 3701  *              +---------------------+----------+   |   o L2ARC eligible
 3702  *      ARC_mru |:#:::::::::::::::::::|#o#ooo####|-->|   : ARC buffer
 3703  *              +---------------------+----------+   |
 3704  *                   15.9 Gbytes      ^ 32 Mbytes    |
 3705  *                                 headroom          |
 3706  *                                            l2arc_feed_thread()
 3707  *                                                   |
 3708  *                       l2arc write hand <--[oooo]--'
 3709  *                               |           8 Mbyte
 3710  *                               |          write max
 3711  *                               V
 3712  *                +==============================+
 3713  *      L2ARC dev |####|#|###|###|    |####| ... |
 3714  *                +==============================+
 3715  *                           32 Gbytes
 3716  *
 3717  * 3. If an ARC buffer is copied to the L2ARC but then hit instead of
 3718  * evicted, then the L2ARC has cached a buffer much sooner than it probably
 3719  * needed to, potentially wasting L2ARC device bandwidth and storage.  It is
 3720  * safe to say that this is an uncommon case, since buffers at the end of
 3721  * the ARC lists have moved there due to inactivity.
 3722  *
 3723  * 4. If the ARC evicts faster than the L2ARC can maintain a headroom,
 3724  * then the L2ARC simply misses copying some buffers.  This serves as a
 3725  * pressure valve to prevent heavy read workloads from both stalling the ARC
 3726  * with waits and clogging the L2ARC with writes.  This also helps prevent
 3727  * the potential for the L2ARC to churn if it attempts to cache content too
 3728  * quickly, such as during backups of the entire pool.
 3729  *
 3730  * 5. After system boot and before the ARC has filled main memory, there are
 3731  * no evictions from the ARC and so the tails of the ARC_mfu and ARC_mru
 3732  * lists can remain mostly static.  Instead of searching from tail of these
 3733  * lists as pictured, the l2arc_feed_thread() will search from the list heads
 3734  * for eligible buffers, greatly increasing its chance of finding them.
 3735  *
 3736  * The L2ARC device write speed is also boosted during this time so that
 3737  * the L2ARC warms up faster.  Since there have been no ARC evictions yet,
 3738  * there are no L2ARC reads, and no fear of degrading read performance
 3739  * through increased writes.
 3740  *
 3741  * 6. Writes to the L2ARC devices are grouped and sent in-sequence, so that
 3742  * the vdev queue can aggregate them into larger and fewer writes.  Each
 3743  * device is written to in a rotor fashion, sweeping writes through
 3744  * available space then repeating.
 3745  *
 3746  * 7. The L2ARC does not store dirty content.  It never needs to flush
 3747  * write buffers back to disk based storage.
 3748  *
 3749  * 8. If an ARC buffer is written (and dirtied) which also exists in the
 3750  * L2ARC, the now stale L2ARC buffer is immediately dropped.
 3751  *
 3752  * The performance of the L2ARC can be tweaked by a number of tunables, which
 3753  * may be necessary for different workloads:
 3754  *
 3755  *      l2arc_write_max         max write bytes per interval
 3756  *      l2arc_write_boost       extra write bytes during device warmup
 3757  *      l2arc_noprefetch        skip caching prefetched buffers
 3758  *      l2arc_headroom          number of max device writes to precache
 3759  *      l2arc_feed_secs         seconds between L2ARC writing
 3760  *
 3761  * Tunables may be removed or added as future performance improvements are
 3762  * integrated, and also may become zpool properties.
 3763  *
 3764  * There are three key functions that control how the L2ARC warms up:
 3765  *
 3766  *      l2arc_write_eligible()  check if a buffer is eligible to cache
 3767  *      l2arc_write_size()      calculate how much to write
 3768  *      l2arc_write_interval()  calculate sleep delay between writes
 3769  *
 3770  * These three functions determine what to write, how much, and how quickly
 3771  * to send writes.
 3772  */
 3773 
 3774 static boolean_t
 3775 l2arc_write_eligible(uint64_t spa_guid, arc_buf_hdr_t *ab)
 3776 {
 3777         /*
 3778          * A buffer is *not* eligible for the L2ARC if it:
 3779          * 1. belongs to a different spa.
 3780          * 2. is already cached on the L2ARC.
 3781          * 3. has an I/O in progress (it may be an incomplete read).
 3782          * 4. is flagged not eligible (zfs property).
 3783          */
 3784         if (ab->b_spa != spa_guid || ab->b_l2hdr != NULL ||
 3785             HDR_IO_IN_PROGRESS(ab) || !HDR_L2CACHE(ab))
 3786                 return (B_FALSE);
 3787 
 3788         return (B_TRUE);
 3789 }
 3790 
 3791 static uint64_t
 3792 l2arc_write_size(l2arc_dev_t *dev)
 3793 {
 3794         uint64_t size;
 3795 
 3796         size = dev->l2ad_write;
 3797 
 3798         if (arc_warm == B_FALSE)
 3799                 size += dev->l2ad_boost;
 3800 
 3801         return (size);
 3802 
 3803 }
 3804 
 3805 static clock_t
 3806 l2arc_write_interval(clock_t began, uint64_t wanted, uint64_t wrote)
 3807 {
 3808         clock_t interval, next;
 3809 
 3810         /*
 3811          * If the ARC lists are busy, increase our write rate; if the
 3812          * lists are stale, idle back.  This is achieved by checking
 3813          * how much we previously wrote - if it was more than half of
 3814          * what we wanted, schedule the next write much sooner.
 3815          */
 3816         if (l2arc_feed_again && wrote > (wanted / 2))
 3817                 interval = (hz * l2arc_feed_min_ms) / 1000;
 3818         else
 3819                 interval = hz * l2arc_feed_secs;
 3820 
 3821         next = MAX(lbolt, MIN(lbolt + interval, began + interval));
 3822 
 3823         return (next);
 3824 }
 3825 
 3826 static void
 3827 l2arc_hdr_stat_add(void)
 3828 {
 3829         ARCSTAT_INCR(arcstat_l2_hdr_size, HDR_SIZE + L2HDR_SIZE);
 3830         ARCSTAT_INCR(arcstat_hdr_size, -HDR_SIZE);
 3831 }
 3832 
 3833 static void
 3834 l2arc_hdr_stat_remove(void)
 3835 {
 3836         ARCSTAT_INCR(arcstat_l2_hdr_size, -(HDR_SIZE + L2HDR_SIZE));
 3837         ARCSTAT_INCR(arcstat_hdr_size, HDR_SIZE);
 3838 }
 3839 
 3840 /*
 3841  * Cycle through L2ARC devices.  This is how L2ARC load balances.
 3842  * If a device is returned, this also returns holding the spa config lock.
 3843  */
 3844 static l2arc_dev_t *
 3845 l2arc_dev_get_next(void)
 3846 {
 3847         l2arc_dev_t *first, *next = NULL;
 3848 
 3849         /*
 3850          * Lock out the removal of spas (spa_namespace_lock), then removal
 3851          * of cache devices (l2arc_dev_mtx).  Once a device has been selected,
 3852          * both locks will be dropped and a spa config lock held instead.
 3853          */
 3854         mutex_enter(&spa_namespace_lock);
 3855         mutex_enter(&l2arc_dev_mtx);
 3856 
 3857         /* if there are no vdevs, there is nothing to do */
 3858         if (l2arc_ndev == 0)
 3859                 goto out;
 3860 
 3861         first = NULL;
 3862         next = l2arc_dev_last;
 3863         do {
 3864                 /* loop around the list looking for a non-faulted vdev */
 3865                 if (next == NULL) {
 3866                         next = list_head(l2arc_dev_list);
 3867                 } else {
 3868                         next = list_next(l2arc_dev_list, next);
 3869                         if (next == NULL)
 3870                                 next = list_head(l2arc_dev_list);
 3871                 }
 3872 
 3873                 /* if we have come back to the start, bail out */
 3874                 if (first == NULL)
 3875                         first = next;
 3876                 else if (next == first)
 3877                         break;
 3878 
 3879         } while (vdev_is_dead(next->l2ad_vdev));
 3880 
 3881         /* if we were unable to find any usable vdevs, return NULL */
 3882         if (vdev_is_dead(next->l2ad_vdev))
 3883                 next = NULL;
 3884 
 3885         l2arc_dev_last = next;
 3886 
 3887 out:
 3888         mutex_exit(&l2arc_dev_mtx);
 3889 
 3890         /*
 3891          * Grab the config lock to prevent the 'next' device from being
 3892          * removed while we are writing to it.
 3893          */
 3894         if (next != NULL)
 3895                 spa_config_enter(next->l2ad_spa, SCL_L2ARC, next, RW_READER);
 3896         mutex_exit(&spa_namespace_lock);
 3897 
 3898         return (next);
 3899 }
 3900 
 3901 /*
 3902  * Free buffers that were tagged for destruction.
 3903  */
 3904 static void
 3905 l2arc_do_free_on_write()
 3906 {
 3907         list_t *buflist;
 3908         l2arc_data_free_t *df, *df_prev;
 3909 
 3910         mutex_enter(&l2arc_free_on_write_mtx);
 3911         buflist = l2arc_free_on_write;
 3912 
 3913         for (df = list_tail(buflist); df; df = df_prev) {
 3914                 df_prev = list_prev(buflist, df);
 3915                 ASSERT(df->l2df_data != NULL);
 3916                 ASSERT(df->l2df_func != NULL);
 3917                 df->l2df_func(df->l2df_data, df->l2df_size);
 3918                 list_remove(buflist, df);
 3919                 kmem_free(df, sizeof (l2arc_data_free_t));
 3920         }
 3921 
 3922         mutex_exit(&l2arc_free_on_write_mtx);
 3923 }
 3924 
 3925 /*
 3926  * A write to a cache device has completed.  Update all headers to allow
 3927  * reads from these buffers to begin.
 3928  */
 3929 static void
 3930 l2arc_write_done(zio_t *zio)
 3931 {
 3932         l2arc_write_callback_t *cb;
 3933         l2arc_dev_t *dev;
 3934         list_t *buflist;
 3935         arc_buf_hdr_t *head, *ab, *ab_prev;
 3936         l2arc_buf_hdr_t *abl2;
 3937         kmutex_t *hash_lock;
 3938 
 3939         cb = zio->io_private;
 3940         ASSERT(cb != NULL);
 3941         dev = cb->l2wcb_dev;
 3942         ASSERT(dev != NULL);
 3943         head = cb->l2wcb_head;
 3944         ASSERT(head != NULL);
 3945         buflist = dev->l2ad_buflist;
 3946         ASSERT(buflist != NULL);
 3947         DTRACE_PROBE2(l2arc__iodone, zio_t *, zio,
 3948             l2arc_write_callback_t *, cb);
 3949 
 3950         if (zio->io_error != 0)
 3951                 ARCSTAT_BUMP(arcstat_l2_writes_error);
 3952 
 3953         mutex_enter(&l2arc_buflist_mtx);
 3954 
 3955         /*
 3956          * All writes completed, or an error was hit.
 3957          */
 3958         for (ab = list_prev(buflist, head); ab; ab = ab_prev) {
 3959                 ab_prev = list_prev(buflist, ab);
 3960 
 3961                 hash_lock = HDR_LOCK(ab);
 3962                 if (!mutex_tryenter(hash_lock)) {
 3963                         /*
 3964                          * This buffer misses out.  It may be in a stage
 3965                          * of eviction.  Its ARC_L2_WRITING flag will be
 3966                          * left set, denying reads to this buffer.
 3967                          */
 3968                         ARCSTAT_BUMP(arcstat_l2_writes_hdr_miss);
 3969                         continue;
 3970                 }
 3971 
 3972                 if (zio->io_error != 0) {
 3973                         /*
 3974                          * Error - drop L2ARC entry.
 3975                          */
 3976                         list_remove(buflist, ab);
 3977                         abl2 = ab->b_l2hdr;
 3978                         ab->b_l2hdr = NULL;
 3979                         kmem_free(abl2, sizeof (l2arc_buf_hdr_t));
 3980                         ARCSTAT_INCR(arcstat_l2_size, -ab->b_size);
 3981                 }
 3982 
 3983                 /*
 3984                  * Allow ARC to begin reads to this L2ARC entry.
 3985                  */
 3986                 ab->b_flags &= ~ARC_L2_WRITING;
 3987 
 3988                 mutex_exit(hash_lock);
 3989         }
 3990 
 3991         atomic_inc_64(&l2arc_writes_done);
 3992         list_remove(buflist, head);
 3993         kmem_cache_free(hdr_cache, head);
 3994         mutex_exit(&l2arc_buflist_mtx);
 3995 
 3996         l2arc_do_free_on_write();
 3997 
 3998         kmem_free(cb, sizeof (l2arc_write_callback_t));
 3999 }
 4000 
 4001 /*
 4002  * A read to a cache device completed.  Validate buffer contents before
 4003  * handing over to the regular ARC routines.
 4004  */
 4005 static void
 4006 l2arc_read_done(zio_t *zio)
 4007 {
 4008         l2arc_read_callback_t *cb;
 4009         arc_buf_hdr_t *hdr;
 4010         arc_buf_t *buf;
 4011         kmutex_t *hash_lock;
 4012         int equal;
 4013 
 4014         ASSERT(zio->io_vd != NULL);
 4015         ASSERT(zio->io_flags & ZIO_FLAG_DONT_PROPAGATE);
 4016 
 4017         spa_config_exit(zio->io_spa, SCL_L2ARC, zio->io_vd);
 4018 
 4019         cb = zio->io_private;
 4020         ASSERT(cb != NULL);
 4021         buf = cb->l2rcb_buf;
 4022         ASSERT(buf != NULL);
 4023         hdr = buf->b_hdr;
 4024         ASSERT(hdr != NULL);
 4025 
 4026         hash_lock = HDR_LOCK(hdr);
 4027         mutex_enter(hash_lock);
 4028 
 4029         /*
 4030          * Check this survived the L2ARC journey.
 4031          */
 4032         equal = arc_cksum_equal(buf);
 4033         if (equal && zio->io_error == 0 && !HDR_L2_EVICTED(hdr)) {
 4034                 mutex_exit(hash_lock);
 4035                 zio->io_private = buf;
 4036                 zio->io_bp_copy = cb->l2rcb_bp; /* XXX fix in L2ARC 2.0 */
 4037                 zio->io_bp = &zio->io_bp_copy;  /* XXX fix in L2ARC 2.0 */
 4038                 arc_read_done(zio);
 4039         } else {
 4040                 mutex_exit(hash_lock);
 4041                 /*
 4042                  * Buffer didn't survive caching.  Increment stats and
 4043                  * reissue to the original storage device.
 4044                  */
 4045                 if (zio->io_error != 0) {
 4046                         ARCSTAT_BUMP(arcstat_l2_io_error);
 4047                 } else {
 4048                         zio->io_error = EIO;
 4049                 }
 4050                 if (!equal)
 4051                         ARCSTAT_BUMP(arcstat_l2_cksum_bad);
 4052 
 4053                 /*
 4054                  * If there's no waiter, issue an async i/o to the primary
 4055                  * storage now.  If there *is* a waiter, the caller must
 4056                  * issue the i/o in a context where it's OK to block.
 4057                  */
 4058                 if (zio->io_waiter == NULL) {
 4059                         zio_t *pio = zio_unique_parent(zio);
 4060 
 4061                         ASSERT(!pio || pio->io_child_type == ZIO_CHILD_LOGICAL);
 4062 
 4063                         zio_nowait(zio_read(pio, cb->l2rcb_spa, &cb->l2rcb_bp,
 4064                             buf->b_data, zio->io_size, arc_read_done, buf,
 4065                             zio->io_priority, cb->l2rcb_flags, &cb->l2rcb_zb));
 4066                 }
 4067         }
 4068 
 4069         kmem_free(cb, sizeof (l2arc_read_callback_t));
 4070 }
 4071 
 4072 /*
 4073  * This is the list priority from which the L2ARC will search for pages to
 4074  * cache.  This is used within loops (0..3) to cycle through lists in the
 4075  * desired order.  This order can have a significant effect on cache
 4076  * performance.
 4077  *
 4078  * Currently the metadata lists are hit first, MFU then MRU, followed by
 4079  * the data lists.  This function returns a locked list, and also returns
 4080  * the lock pointer.
 4081  */
 4082 static list_t *
 4083 l2arc_list_locked(int list_num, kmutex_t **lock)
 4084 {
 4085         list_t *list;
 4086 
 4087         ASSERT(list_num >= 0 && list_num <= 3);
 4088 
 4089         switch (list_num) {
 4090         case 0:
 4091                 list = &arc_mfu->arcs_list[ARC_BUFC_METADATA];
 4092                 *lock = &arc_mfu->arcs_mtx;
 4093                 break;
 4094         case 1:
 4095                 list = &arc_mru->arcs_list[ARC_BUFC_METADATA];
 4096                 *lock = &arc_mru->arcs_mtx;
 4097                 break;
 4098         case 2:
 4099                 list = &arc_mfu->arcs_list[ARC_BUFC_DATA];
 4100                 *lock = &arc_mfu->arcs_mtx;
 4101                 break;
 4102         case 3:
 4103                 list = &arc_mru->arcs_list[ARC_BUFC_DATA];
 4104                 *lock = &arc_mru->arcs_mtx;
 4105                 break;
 4106         }
 4107 
 4108         ASSERT(!(MUTEX_HELD(*lock)));
 4109         mutex_enter(*lock);
 4110         return (list);
 4111 }
 4112 
 4113 /*
 4114  * Evict buffers from the device write hand to the distance specified in
 4115  * bytes.  This distance may span populated buffers, it may span nothing.
 4116  * This is clearing a region on the L2ARC device ready for writing.
 4117  * If the 'all' boolean is set, every buffer is evicted.
 4118  */
 4119 static void
 4120 l2arc_evict(l2arc_dev_t *dev, uint64_t distance, boolean_t all)
 4121 {
 4122         list_t *buflist;
 4123         l2arc_buf_hdr_t *abl2;
 4124         arc_buf_hdr_t *ab, *ab_prev;
 4125         kmutex_t *hash_lock;
 4126         uint64_t taddr;
 4127 
 4128         buflist = dev->l2ad_buflist;
 4129 
 4130         if (buflist == NULL)
 4131                 return;
 4132 
 4133         if (!all && dev->l2ad_first) {
 4134                 /*
 4135                  * This is the first sweep through the device.  There is
 4136                  * nothing to evict.
 4137                  */
 4138                 return;
 4139         }
 4140 
 4141         if (dev->l2ad_hand >= (dev->l2ad_end - (2 * distance))) {
 4142                 /*
 4143                  * When nearing the end of the device, evict to the end
 4144                  * before the device write hand jumps to the start.
 4145                  */
 4146                 taddr = dev->l2ad_end;
 4147         } else {
 4148                 taddr = dev->l2ad_hand + distance;
 4149         }
 4150         DTRACE_PROBE4(l2arc__evict, l2arc_dev_t *, dev, list_t *, buflist,
 4151             uint64_t, taddr, boolean_t, all);
 4152 
 4153 top:
 4154         mutex_enter(&l2arc_buflist_mtx);
 4155         for (ab = list_tail(buflist); ab; ab = ab_prev) {
 4156                 ab_prev = list_prev(buflist, ab);
 4157 
 4158                 hash_lock = HDR_LOCK(ab);
 4159                 if (!mutex_tryenter(hash_lock)) {
 4160                         /*
 4161                          * Missed the hash lock.  Retry.
 4162                          */
 4163                         ARCSTAT_BUMP(arcstat_l2_evict_lock_retry);
 4164                         mutex_exit(&l2arc_buflist_mtx);
 4165                         mutex_enter(hash_lock);
 4166                         mutex_exit(hash_lock);
 4167                         goto top;
 4168                 }
 4169 
 4170                 if (HDR_L2_WRITE_HEAD(ab)) {
 4171                         /*
 4172                          * We hit a write head node.  Leave it for
 4173                          * l2arc_write_done().
 4174                          */
 4175                         list_remove(buflist, ab);
 4176                         mutex_exit(hash_lock);
 4177                         continue;
 4178                 }
 4179 
 4180                 if (!all && ab->b_l2hdr != NULL &&
 4181                     (ab->b_l2hdr->b_daddr > taddr ||
 4182                     ab->b_l2hdr->b_daddr < dev->l2ad_hand)) {
 4183                         /*
 4184                          * We've evicted to the target address,
 4185                          * or the end of the device.
 4186                          */
 4187                         mutex_exit(hash_lock);
 4188                         break;
 4189                 }
 4190 
 4191                 if (HDR_FREE_IN_PROGRESS(ab)) {
 4192                         /*
 4193                          * Already on the path to destruction.
 4194                          */
 4195                         mutex_exit(hash_lock);
 4196                         continue;
 4197                 }
 4198 
 4199                 if (ab->b_state == arc_l2c_only) {
 4200                         ASSERT(!HDR_L2_READING(ab));
 4201                         /*
 4202                          * This doesn't exist in the ARC.  Destroy.
 4203                          * arc_hdr_destroy() will call list_remove()
 4204                          * and decrement arcstat_l2_size.
 4205                          */
 4206                         arc_change_state(arc_anon, ab, hash_lock);
 4207                         arc_hdr_destroy(ab);
 4208                 } else {
 4209                         /*
 4210                          * Invalidate issued or about to be issued
 4211                          * reads, since we may be about to write
 4212                          * over this location.
 4213                          */
 4214                         if (HDR_L2_READING(ab)) {
 4215                                 ARCSTAT_BUMP(arcstat_l2_evict_reading);
 4216                                 ab->b_flags |= ARC_L2_EVICTED;
 4217                         }
 4218 
 4219                         /*
 4220                          * Tell ARC this no longer exists in L2ARC.
 4221                          */
 4222                         if (ab->b_l2hdr != NULL) {
 4223                                 abl2 = ab->b_l2hdr;
 4224                                 ab->b_l2hdr = NULL;
 4225                                 kmem_free(abl2, sizeof (l2arc_buf_hdr_t));
 4226                                 ARCSTAT_INCR(arcstat_l2_size, -ab->b_size);
 4227                         }
 4228                         list_remove(buflist, ab);
 4229 
 4230                         /*
 4231                          * This may have been leftover after a
 4232                          * failed write.
 4233                          */
 4234                         ab->b_flags &= ~ARC_L2_WRITING;
 4235                 }
 4236                 mutex_exit(hash_lock);
 4237         }
 4238         mutex_exit(&l2arc_buflist_mtx);
 4239 
 4240         spa_l2cache_space_update(dev->l2ad_vdev, 0, -(taddr - dev->l2ad_evict));
 4241         dev->l2ad_evict = taddr;
 4242 }
 4243 
 4244 /*
 4245  * Find and write ARC buffers to the L2ARC device.
 4246  *
 4247  * An ARC_L2_WRITING flag is set so that the L2ARC buffers are not valid
 4248  * for reading until they have completed writing.
 4249  */
 4250 static uint64_t
 4251 l2arc_write_buffers(spa_t *spa, l2arc_dev_t *dev, uint64_t target_sz)
 4252 {
 4253         arc_buf_hdr_t *ab, *ab_prev, *head;
 4254         l2arc_buf_hdr_t *hdrl2;
 4255         list_t *list;
 4256         uint64_t passed_sz, write_sz, buf_sz, headroom;
 4257         void *buf_data;
 4258         kmutex_t *hash_lock, *list_lock;
 4259         boolean_t have_lock, full;
 4260         l2arc_write_callback_t *cb;
 4261         zio_t *pio, *wzio;
 4262         uint64_t guid = spa_guid(spa);
 4263 
 4264         ASSERT(dev->l2ad_vdev != NULL);
 4265 
 4266         pio = NULL;
 4267         write_sz = 0;
 4268         full = B_FALSE;
 4269         head = kmem_cache_alloc(hdr_cache, KM_PUSHPAGE);
 4270         head->b_flags |= ARC_L2_WRITE_HEAD;
 4271 
 4272         /*
 4273          * Copy buffers for L2ARC writing.
 4274          */
 4275         mutex_enter(&l2arc_buflist_mtx);
 4276         for (int try = 0; try <= 3; try++) {
 4277                 list = l2arc_list_locked(try, &list_lock);
 4278                 passed_sz = 0;
 4279 
 4280                 /*
 4281                  * L2ARC fast warmup.
 4282                  *
 4283                  * Until the ARC is warm and starts to evict, read from the
 4284                  * head of the ARC lists rather than the tail.
 4285                  */
 4286                 headroom = target_sz * l2arc_headroom;
 4287                 if (arc_warm == B_FALSE)
 4288                         ab = list_head(list);
 4289                 else
 4290                         ab = list_tail(list);
 4291 
 4292                 for (; ab; ab = ab_prev) {
 4293                         if (arc_warm == B_FALSE)
 4294                                 ab_prev = list_next(list, ab);
 4295                         else
 4296                                 ab_prev = list_prev(list, ab);
 4297 
 4298                         hash_lock = HDR_LOCK(ab);
 4299                         have_lock = MUTEX_HELD(hash_lock);
 4300                         if (!have_lock && !mutex_tryenter(hash_lock)) {
 4301                                 /*
 4302                                  * Skip this buffer rather than waiting.
 4303                                  */
 4304                                 continue;
 4305                         }
 4306 
 4307                         passed_sz += ab->b_size;
 4308                         if (passed_sz > headroom) {
 4309                                 /*
 4310                                  * Searched too far.
 4311                                  */
 4312                                 mutex_exit(hash_lock);
 4313                                 break;
 4314                         }
 4315 
 4316                         if (!l2arc_write_eligible(guid, ab)) {
 4317                                 mutex_exit(hash_lock);
 4318                                 continue;
 4319                         }
 4320 
 4321                         if ((write_sz + ab->b_size) > target_sz) {
 4322                                 full = B_TRUE;
 4323                                 mutex_exit(hash_lock);
 4324                                 break;
 4325                         }
 4326 
 4327                         if (pio == NULL) {
 4328                                 /*
 4329                                  * Insert a dummy header on the buflist so
 4330                                  * l2arc_write_done() can find where the
 4331                                  * write buffers begin without searching.
 4332                                  */
 4333                                 list_insert_head(dev->l2ad_buflist, head);
 4334 
 4335                                 cb = kmem_alloc(
 4336                                     sizeof (l2arc_write_callback_t), KM_SLEEP);
 4337                                 cb->l2wcb_dev = dev;
 4338                                 cb->l2wcb_head = head;
 4339                                 pio = zio_root(spa, l2arc_write_done, cb,
 4340                                     ZIO_FLAG_CANFAIL);
 4341                         }
 4342 
 4343                         /*
 4344                          * Create and add a new L2ARC header.
 4345                          */
 4346                         hdrl2 = kmem_zalloc(sizeof (l2arc_buf_hdr_t), KM_SLEEP);
 4347                         hdrl2->b_dev = dev;
 4348                         hdrl2->b_daddr = dev->l2ad_hand;
 4349 
 4350                         ab->b_flags |= ARC_L2_WRITING;
 4351                         ab->b_l2hdr = hdrl2;
 4352                         list_insert_head(dev->l2ad_buflist, ab);
 4353                         buf_data = ab->b_buf->b_data;
 4354                         buf_sz = ab->b_size;
 4355 
 4356                         /*
 4357                          * Compute and store the buffer cksum before
 4358                          * writing.  On debug the cksum is verified first.
 4359                          */
 4360                         arc_cksum_verify(ab->b_buf);
 4361                         arc_cksum_compute(ab->b_buf, B_TRUE);
 4362 
 4363                         mutex_exit(hash_lock);
 4364 
 4365                         wzio = zio_write_phys(pio, dev->l2ad_vdev,
 4366                             dev->l2ad_hand, buf_sz, buf_data, ZIO_CHECKSUM_OFF,
 4367                             NULL, NULL, ZIO_PRIORITY_ASYNC_WRITE,
 4368                             ZIO_FLAG_CANFAIL, B_FALSE);
 4369 
 4370                         DTRACE_PROBE2(l2arc__write, vdev_t *, dev->l2ad_vdev,
 4371                             zio_t *, wzio);
 4372                         (void) zio_nowait(wzio);
 4373 
 4374                         /*
 4375                          * Keep the clock hand suitably device-aligned.
 4376                          */
 4377                         buf_sz = vdev_psize_to_asize(dev->l2ad_vdev, buf_sz);
 4378 
 4379                         write_sz += buf_sz;
 4380                         dev->l2ad_hand += buf_sz;
 4381                 }
 4382 
 4383                 mutex_exit(list_lock);
 4384 
 4385                 if (full == B_TRUE)
 4386                         break;
 4387         }
 4388         mutex_exit(&l2arc_buflist_mtx);
 4389 
 4390         if (pio == NULL) {
 4391                 ASSERT3U(write_sz, ==, 0);
 4392                 kmem_cache_free(hdr_cache, head);
 4393                 return (0);
 4394         }
 4395 
 4396         ASSERT3U(write_sz, <=, target_sz);
 4397         ARCSTAT_BUMP(arcstat_l2_writes_sent);
 4398         ARCSTAT_INCR(arcstat_l2_write_bytes, write_sz);
 4399         ARCSTAT_INCR(arcstat_l2_size, write_sz);
 4400         spa_l2cache_space_update(dev->l2ad_vdev, 0, write_sz);
 4401 
 4402         /*
 4403          * Bump device hand to the device start if it is approaching the end.
 4404          * l2arc_evict() will already have evicted ahead for this case.
 4405          */
 4406         if (dev->l2ad_hand >= (dev->l2ad_end - target_sz)) {
 4407                 spa_l2cache_space_update(dev->l2ad_vdev, 0,
 4408                     dev->l2ad_end - dev->l2ad_hand);
 4409                 dev->l2ad_hand = dev->l2ad_start;
 4410                 dev->l2ad_evict = dev->l2ad_start;
 4411                 dev->l2ad_first = B_FALSE;
 4412         }
 4413 
 4414         dev->l2ad_writing = B_TRUE;
 4415         (void) zio_wait(pio);
 4416         dev->l2ad_writing = B_FALSE;
 4417 
 4418         return (write_sz);
 4419 }
 4420 
 4421 /*
 4422  * This thread feeds the L2ARC at regular intervals.  This is the beating
 4423  * heart of the L2ARC.
 4424  */
 4425 static void
 4426 l2arc_feed_thread(void)
 4427 {
 4428         callb_cpr_t cpr;
 4429         l2arc_dev_t *dev;
 4430         spa_t *spa;
 4431         uint64_t size, wrote;
 4432         clock_t begin, next = lbolt;
 4433 
 4434         CALLB_CPR_INIT(&cpr, &l2arc_feed_thr_lock, callb_generic_cpr, FTAG);
 4435 
 4436         mutex_enter(&l2arc_feed_thr_lock);
 4437 
 4438         while (l2arc_thread_exit == 0) {
 4439                 CALLB_CPR_SAFE_BEGIN(&cpr);
 4440                 (void) cv_timedwait(&l2arc_feed_thr_cv, &l2arc_feed_thr_lock,
 4441                     next);
 4442                 CALLB_CPR_SAFE_END(&cpr, &l2arc_feed_thr_lock);
 4443                 next = lbolt + hz;
 4444 
 4445                 /*
 4446                  * Quick check for L2ARC devices.
 4447                  */
 4448                 mutex_enter(&l2arc_dev_mtx);
 4449                 if (l2arc_ndev == 0) {
 4450                         mutex_exit(&l2arc_dev_mtx);
 4451                         continue;
 4452                 }
 4453                 mutex_exit(&l2arc_dev_mtx);
 4454                 begin = lbolt;
 4455 
 4456                 /*
 4457                  * This selects the next l2arc device to write to, and in
 4458                  * doing so the next spa to feed from: dev->l2ad_spa.   This
 4459                  * will return NULL if there are now no l2arc devices or if
 4460                  * they are all faulted.
 4461                  *
 4462                  * If a device is returned, its spa's config lock is also
 4463                  * held to prevent device removal.  l2arc_dev_get_next()
 4464                  * will grab and release l2arc_dev_mtx.
 4465                  */
 4466                 if ((dev = l2arc_dev_get_next()) == NULL)
 4467                         continue;
 4468 
 4469                 spa = dev->l2ad_spa;
 4470                 ASSERT(spa != NULL);
 4471 
 4472                 /*
 4473                  * Avoid contributing to memory pressure.
 4474                  */
 4475                 if (arc_reclaim_needed()) {
 4476                         ARCSTAT_BUMP(arcstat_l2_abort_lowmem);
 4477                         spa_config_exit(spa, SCL_L2ARC, dev);
 4478                         continue;
 4479                 }
 4480 
 4481                 ARCSTAT_BUMP(arcstat_l2_feeds);
 4482 
 4483                 size = l2arc_write_size(dev);
 4484 
 4485                 /*
 4486                  * Evict L2ARC buffers that will be overwritten.
 4487                  */
 4488                 l2arc_evict(dev, size, B_FALSE);
 4489 
 4490                 /*
 4491                  * Write ARC buffers.
 4492                  */
 4493                 wrote = l2arc_write_buffers(spa, dev, size);
 4494 
 4495                 /*
 4496                  * Calculate interval between writes.
 4497                  */
 4498                 next = l2arc_write_interval(begin, size, wrote);
 4499                 spa_config_exit(spa, SCL_L2ARC, dev);
 4500         }
 4501 
 4502         l2arc_thread_exit = 0;
 4503         cv_broadcast(&l2arc_feed_thr_cv);
 4504         CALLB_CPR_EXIT(&cpr);           /* drops l2arc_feed_thr_lock */
 4505         thread_exit();
 4506 }
 4507 
 4508 boolean_t
 4509 l2arc_vdev_present(vdev_t *vd)
 4510 {
 4511         l2arc_dev_t *dev;
 4512 
 4513         mutex_enter(&l2arc_dev_mtx);
 4514         for (dev = list_head(l2arc_dev_list); dev != NULL;
 4515             dev = list_next(l2arc_dev_list, dev)) {
 4516                 if (dev->l2ad_vdev == vd)
 4517                         break;
 4518         }
 4519         mutex_exit(&l2arc_dev_mtx);
 4520 
 4521         return (dev != NULL);
 4522 }
 4523 
 4524 /*
 4525  * Add a vdev for use by the L2ARC.  By this point the spa has already
 4526  * validated the vdev and opened it.
 4527  */
 4528 void
 4529 l2arc_add_vdev(spa_t *spa, vdev_t *vd)
 4530 {
 4531         l2arc_dev_t *adddev;
 4532 
 4533         ASSERT(!l2arc_vdev_present(vd));
 4534 
 4535         /*
 4536          * Create a new l2arc device entry.
 4537          */
 4538         adddev = kmem_zalloc(sizeof (l2arc_dev_t), KM_SLEEP);
 4539         adddev->l2ad_spa = spa;
 4540         adddev->l2ad_vdev = vd;
 4541         adddev->l2ad_write = l2arc_write_max;
 4542         adddev->l2ad_boost = l2arc_write_boost;
 4543         adddev->l2ad_start = VDEV_LABEL_START_SIZE;
 4544         adddev->l2ad_end = VDEV_LABEL_START_SIZE + vdev_get_min_asize(vd);
 4545         adddev->l2ad_hand = adddev->l2ad_start;
 4546         adddev->l2ad_evict = adddev->l2ad_start;
 4547         adddev->l2ad_first = B_TRUE;
 4548         adddev->l2ad_writing = B_FALSE;
 4549         ASSERT3U(adddev->l2ad_write, >, 0);
 4550 
 4551         /*
 4552          * This is a list of all ARC buffers that are still valid on the
 4553          * device.
 4554          */
 4555         adddev->l2ad_buflist = kmem_zalloc(sizeof (list_t), KM_SLEEP);
 4556         list_create(adddev->l2ad_buflist, sizeof (arc_buf_hdr_t),
 4557             offsetof(arc_buf_hdr_t, b_l2node));
 4558 
 4559         spa_l2cache_space_update(vd, adddev->l2ad_end - adddev->l2ad_hand, 0);
 4560 
 4561         /*
 4562          * Add device to global list
 4563          */
 4564         mutex_enter(&l2arc_dev_mtx);
 4565         list_insert_head(l2arc_dev_list, adddev);
 4566         atomic_inc_64(&l2arc_ndev);
 4567         mutex_exit(&l2arc_dev_mtx);
 4568 }
 4569 
 4570 /*
 4571  * Remove a vdev from the L2ARC.
 4572  */
 4573 void
 4574 l2arc_remove_vdev(vdev_t *vd)
 4575 {
 4576         l2arc_dev_t *dev, *nextdev, *remdev = NULL;
 4577 
 4578         /*
 4579          * Find the device by vdev
 4580          */
 4581         mutex_enter(&l2arc_dev_mtx);
 4582         for (dev = list_head(l2arc_dev_list); dev; dev = nextdev) {
 4583                 nextdev = list_next(l2arc_dev_list, dev);
 4584                 if (vd == dev->l2ad_vdev) {
 4585                         remdev = dev;
 4586                         break;
 4587                 }
 4588         }
 4589         ASSERT(remdev != NULL);
 4590 
 4591         /*
 4592          * Remove device from global list
 4593          */
 4594         list_remove(l2arc_dev_list, remdev);
 4595         l2arc_dev_last = NULL;          /* may have been invalidated */
 4596         atomic_dec_64(&l2arc_ndev);
 4597         mutex_exit(&l2arc_dev_mtx);
 4598 
 4599         /*
 4600          * Clear all buflists and ARC references.  L2ARC device flush.
 4601          */
 4602         l2arc_evict(remdev, 0, B_TRUE);
 4603         list_destroy(remdev->l2ad_buflist);
 4604         kmem_free(remdev->l2ad_buflist, sizeof (list_t));
 4605         kmem_free(remdev, sizeof (l2arc_dev_t));
 4606 }
 4607 
 4608 void
 4609 l2arc_init(void)
 4610 {
 4611         l2arc_thread_exit = 0;
 4612         l2arc_ndev = 0;
 4613         l2arc_writes_sent = 0;
 4614         l2arc_writes_done = 0;
 4615 
 4616         mutex_init(&l2arc_feed_thr_lock, NULL, MUTEX_DEFAULT, NULL);
 4617         cv_init(&l2arc_feed_thr_cv, NULL, CV_DEFAULT, NULL);
 4618         mutex_init(&l2arc_dev_mtx, NULL, MUTEX_DEFAULT, NULL);
 4619         mutex_init(&l2arc_buflist_mtx, NULL, MUTEX_DEFAULT, NULL);
 4620         mutex_init(&l2arc_free_on_write_mtx, NULL, MUTEX_DEFAULT, NULL);
 4621 
 4622         l2arc_dev_list = &L2ARC_dev_list;
 4623         l2arc_free_on_write = &L2ARC_free_on_write;
 4624         list_create(l2arc_dev_list, sizeof (l2arc_dev_t),
 4625             offsetof(l2arc_dev_t, l2ad_node));
 4626         list_create(l2arc_free_on_write, sizeof (l2arc_data_free_t),
 4627             offsetof(l2arc_data_free_t, l2df_list_node));
 4628 }
 4629 
 4630 void
 4631 l2arc_fini(void)
 4632 {
 4633         /*
 4634          * This is called from dmu_fini(), which is called from spa_fini();
 4635          * Because of this, we can assume that all l2arc devices have
 4636          * already been removed when the pools themselves were removed.
 4637          */
 4638 
 4639         l2arc_do_free_on_write();
 4640 
 4641         mutex_destroy(&l2arc_feed_thr_lock);
 4642         cv_destroy(&l2arc_feed_thr_cv);
 4643         mutex_destroy(&l2arc_dev_mtx);
 4644         mutex_destroy(&l2arc_buflist_mtx);
 4645         mutex_destroy(&l2arc_free_on_write_mtx);
 4646 
 4647         list_destroy(l2arc_dev_list);
 4648         list_destroy(l2arc_free_on_write);
 4649 }
 4650 
 4651 void
 4652 l2arc_start(void)
 4653 {
 4654         if (!(spa_mode_global & FWRITE))
 4655                 return;
 4656 
 4657         (void) thread_create(NULL, 0, l2arc_feed_thread, NULL, 0, &p0,
 4658             TS_RUN, minclsyspri);
 4659 }
 4660 
 4661 void
 4662 l2arc_stop(void)
 4663 {
 4664         if (!(spa_mode_global & FWRITE))
 4665                 return;
 4666 
 4667         mutex_enter(&l2arc_feed_thr_lock);
 4668         cv_signal(&l2arc_feed_thr_cv);  /* kick thread out of startup */
 4669         l2arc_thread_exit = 1;
 4670         while (l2arc_thread_exit != 0)
 4671                 cv_wait(&l2arc_feed_thr_cv, &l2arc_feed_thr_lock);
 4672         mutex_exit(&l2arc_feed_thr_lock);
 4673 }

Cache object: ba5be096798130e0a1ee23ee4227a90e


[ source navigation ] [ diff markup ] [ identifier search ] [ freetext search ] [ file search ] [ list types ] [ track identifier ]


This page is part of the FreeBSD/Linux Linux Kernel Cross-Reference, and was automatically generated using a modified version of the LXR engine.