The Design and Implementation of the FreeBSD Operating System, Second Edition
Now available: The Design and Implementation of the FreeBSD Operating System (Second Edition)


[ source navigation ] [ diff markup ] [ identifier search ] [ freetext search ] [ file search ] [ list types ] [ track identifier ]

FreeBSD/Linux Kernel Cross Reference
sys/contrib/openzfs/module/zfs/arc.c

Version: -  FREEBSD  -  FREEBSD-13-STABLE  -  FREEBSD-13-0  -  FREEBSD-12-STABLE  -  FREEBSD-12-0  -  FREEBSD-11-STABLE  -  FREEBSD-11-0  -  FREEBSD-10-STABLE  -  FREEBSD-10-0  -  FREEBSD-9-STABLE  -  FREEBSD-9-0  -  FREEBSD-8-STABLE  -  FREEBSD-8-0  -  FREEBSD-7-STABLE  -  FREEBSD-7-0  -  FREEBSD-6-STABLE  -  FREEBSD-6-0  -  FREEBSD-5-STABLE  -  FREEBSD-5-0  -  FREEBSD-4-STABLE  -  FREEBSD-3-STABLE  -  FREEBSD22  -  l41  -  OPENBSD  -  linux-2.6  -  MK84  -  PLAN9  -  xnu-8792 
SearchContext: -  none  -  3  -  10 

    1 /*
    2  * CDDL HEADER START
    3  *
    4  * The contents of this file are subject to the terms of the
    5  * Common Development and Distribution License (the "License").
    6  * You may not use this file except in compliance with the License.
    7  *
    8  * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
    9  * or https://opensource.org/licenses/CDDL-1.0.
   10  * See the License for the specific language governing permissions
   11  * and limitations under the License.
   12  *
   13  * When distributing Covered Code, include this CDDL HEADER in each
   14  * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
   15  * If applicable, add the following below this CDDL HEADER, with the
   16  * fields enclosed by brackets "[]" replaced with your own identifying
   17  * information: Portions Copyright [yyyy] [name of copyright owner]
   18  *
   19  * CDDL HEADER END
   20  */
   21 /*
   22  * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
   23  * Copyright (c) 2018, Joyent, Inc.
   24  * Copyright (c) 2011, 2020, Delphix. All rights reserved.
   25  * Copyright (c) 2014, Saso Kiselkov. All rights reserved.
   26  * Copyright (c) 2017, Nexenta Systems, Inc.  All rights reserved.
   27  * Copyright (c) 2019, loli10K <ezomori.nozomu@gmail.com>. All rights reserved.
   28  * Copyright (c) 2020, George Amanakis. All rights reserved.
   29  * Copyright (c) 2019, Klara Inc.
   30  * Copyright (c) 2019, Allan Jude
   31  * Copyright (c) 2020, The FreeBSD Foundation [1]
   32  *
   33  * [1] Portions of this software were developed by Allan Jude
   34  *     under sponsorship from the FreeBSD Foundation.
   35  */
   36 
   37 /*
   38  * DVA-based Adjustable Replacement Cache
   39  *
   40  * While much of the theory of operation used here is
   41  * based on the self-tuning, low overhead replacement cache
   42  * presented by Megiddo and Modha at FAST 2003, there are some
   43  * significant differences:
   44  *
   45  * 1. The Megiddo and Modha model assumes any page is evictable.
   46  * Pages in its cache cannot be "locked" into memory.  This makes
   47  * the eviction algorithm simple: evict the last page in the list.
   48  * This also make the performance characteristics easy to reason
   49  * about.  Our cache is not so simple.  At any given moment, some
   50  * subset of the blocks in the cache are un-evictable because we
   51  * have handed out a reference to them.  Blocks are only evictable
   52  * when there are no external references active.  This makes
   53  * eviction far more problematic:  we choose to evict the evictable
   54  * blocks that are the "lowest" in the list.
   55  *
   56  * There are times when it is not possible to evict the requested
   57  * space.  In these circumstances we are unable to adjust the cache
   58  * size.  To prevent the cache growing unbounded at these times we
   59  * implement a "cache throttle" that slows the flow of new data
   60  * into the cache until we can make space available.
   61  *
   62  * 2. The Megiddo and Modha model assumes a fixed cache size.
   63  * Pages are evicted when the cache is full and there is a cache
   64  * miss.  Our model has a variable sized cache.  It grows with
   65  * high use, but also tries to react to memory pressure from the
   66  * operating system: decreasing its size when system memory is
   67  * tight.
   68  *
   69  * 3. The Megiddo and Modha model assumes a fixed page size. All
   70  * elements of the cache are therefore exactly the same size.  So
   71  * when adjusting the cache size following a cache miss, its simply
   72  * a matter of choosing a single page to evict.  In our model, we
   73  * have variable sized cache blocks (ranging from 512 bytes to
   74  * 128K bytes).  We therefore choose a set of blocks to evict to make
   75  * space for a cache miss that approximates as closely as possible
   76  * the space used by the new block.
   77  *
   78  * See also:  "ARC: A Self-Tuning, Low Overhead Replacement Cache"
   79  * by N. Megiddo & D. Modha, FAST 2003
   80  */
   81 
   82 /*
   83  * The locking model:
   84  *
   85  * A new reference to a cache buffer can be obtained in two
   86  * ways: 1) via a hash table lookup using the DVA as a key,
   87  * or 2) via one of the ARC lists.  The arc_read() interface
   88  * uses method 1, while the internal ARC algorithms for
   89  * adjusting the cache use method 2.  We therefore provide two
   90  * types of locks: 1) the hash table lock array, and 2) the
   91  * ARC list locks.
   92  *
   93  * Buffers do not have their own mutexes, rather they rely on the
   94  * hash table mutexes for the bulk of their protection (i.e. most
   95  * fields in the arc_buf_hdr_t are protected by these mutexes).
   96  *
   97  * buf_hash_find() returns the appropriate mutex (held) when it
   98  * locates the requested buffer in the hash table.  It returns
   99  * NULL for the mutex if the buffer was not in the table.
  100  *
  101  * buf_hash_remove() expects the appropriate hash mutex to be
  102  * already held before it is invoked.
  103  *
  104  * Each ARC state also has a mutex which is used to protect the
  105  * buffer list associated with the state.  When attempting to
  106  * obtain a hash table lock while holding an ARC list lock you
  107  * must use: mutex_tryenter() to avoid deadlock.  Also note that
  108  * the active state mutex must be held before the ghost state mutex.
  109  *
  110  * It as also possible to register a callback which is run when the
  111  * arc_meta_limit is reached and no buffers can be safely evicted.  In
  112  * this case the arc user should drop a reference on some arc buffers so
  113  * they can be reclaimed and the arc_meta_limit honored.  For example,
  114  * when using the ZPL each dentry holds a references on a znode.  These
  115  * dentries must be pruned before the arc buffer holding the znode can
  116  * be safely evicted.
  117  *
  118  * Note that the majority of the performance stats are manipulated
  119  * with atomic operations.
  120  *
  121  * The L2ARC uses the l2ad_mtx on each vdev for the following:
  122  *
  123  *      - L2ARC buflist creation
  124  *      - L2ARC buflist eviction
  125  *      - L2ARC write completion, which walks L2ARC buflists
  126  *      - ARC header destruction, as it removes from L2ARC buflists
  127  *      - ARC header release, as it removes from L2ARC buflists
  128  */
  129 
  130 /*
  131  * ARC operation:
  132  *
  133  * Every block that is in the ARC is tracked by an arc_buf_hdr_t structure.
  134  * This structure can point either to a block that is still in the cache or to
  135  * one that is only accessible in an L2 ARC device, or it can provide
  136  * information about a block that was recently evicted. If a block is
  137  * only accessible in the L2ARC, then the arc_buf_hdr_t only has enough
  138  * information to retrieve it from the L2ARC device. This information is
  139  * stored in the l2arc_buf_hdr_t sub-structure of the arc_buf_hdr_t. A block
  140  * that is in this state cannot access the data directly.
  141  *
  142  * Blocks that are actively being referenced or have not been evicted
  143  * are cached in the L1ARC. The L1ARC (l1arc_buf_hdr_t) is a structure within
  144  * the arc_buf_hdr_t that will point to the data block in memory. A block can
  145  * only be read by a consumer if it has an l1arc_buf_hdr_t. The L1ARC
  146  * caches data in two ways -- in a list of ARC buffers (arc_buf_t) and
  147  * also in the arc_buf_hdr_t's private physical data block pointer (b_pabd).
  148  *
  149  * The L1ARC's data pointer may or may not be uncompressed. The ARC has the
  150  * ability to store the physical data (b_pabd) associated with the DVA of the
  151  * arc_buf_hdr_t. Since the b_pabd is a copy of the on-disk physical block,
  152  * it will match its on-disk compression characteristics. This behavior can be
  153  * disabled by setting 'zfs_compressed_arc_enabled' to B_FALSE. When the
  154  * compressed ARC functionality is disabled, the b_pabd will point to an
  155  * uncompressed version of the on-disk data.
  156  *
  157  * Data in the L1ARC is not accessed by consumers of the ARC directly. Each
  158  * arc_buf_hdr_t can have multiple ARC buffers (arc_buf_t) which reference it.
  159  * Each ARC buffer (arc_buf_t) is being actively accessed by a specific ARC
  160  * consumer. The ARC will provide references to this data and will keep it
  161  * cached until it is no longer in use. The ARC caches only the L1ARC's physical
  162  * data block and will evict any arc_buf_t that is no longer referenced. The
  163  * amount of memory consumed by the arc_buf_ts' data buffers can be seen via the
  164  * "overhead_size" kstat.
  165  *
  166  * Depending on the consumer, an arc_buf_t can be requested in uncompressed or
  167  * compressed form. The typical case is that consumers will want uncompressed
  168  * data, and when that happens a new data buffer is allocated where the data is
  169  * decompressed for them to use. Currently the only consumer who wants
  170  * compressed arc_buf_t's is "zfs send", when it streams data exactly as it
  171  * exists on disk. When this happens, the arc_buf_t's data buffer is shared
  172  * with the arc_buf_hdr_t.
  173  *
  174  * Here is a diagram showing an arc_buf_hdr_t referenced by two arc_buf_t's. The
  175  * first one is owned by a compressed send consumer (and therefore references
  176  * the same compressed data buffer as the arc_buf_hdr_t) and the second could be
  177  * used by any other consumer (and has its own uncompressed copy of the data
  178  * buffer).
  179  *
  180  *   arc_buf_hdr_t
  181  *   +-----------+
  182  *   | fields    |
  183  *   | common to |
  184  *   | L1- and   |
  185  *   | L2ARC     |
  186  *   +-----------+
  187  *   | l2arc_buf_hdr_t
  188  *   |           |
  189  *   +-----------+
  190  *   | l1arc_buf_hdr_t
  191  *   |           |              arc_buf_t
  192  *   | b_buf     +------------>+-----------+      arc_buf_t
  193  *   | b_pabd    +-+           |b_next     +---->+-----------+
  194  *   +-----------+ |           |-----------|     |b_next     +-->NULL
  195  *                 |           |b_comp = T |     +-----------+
  196  *                 |           |b_data     +-+   |b_comp = F |
  197  *                 |           +-----------+ |   |b_data     +-+
  198  *                 +->+------+               |   +-----------+ |
  199  *        compressed  |      |               |                 |
  200  *           data     |      |<--------------+                 | uncompressed
  201  *                    +------+          compressed,            |     data
  202  *                                        shared               +-->+------+
  203  *                                         data                    |      |
  204  *                                                                 |      |
  205  *                                                                 +------+
  206  *
  207  * When a consumer reads a block, the ARC must first look to see if the
  208  * arc_buf_hdr_t is cached. If the hdr is cached then the ARC allocates a new
  209  * arc_buf_t and either copies uncompressed data into a new data buffer from an
  210  * existing uncompressed arc_buf_t, decompresses the hdr's b_pabd buffer into a
  211  * new data buffer, or shares the hdr's b_pabd buffer, depending on whether the
  212  * hdr is compressed and the desired compression characteristics of the
  213  * arc_buf_t consumer. If the arc_buf_t ends up sharing data with the
  214  * arc_buf_hdr_t and both of them are uncompressed then the arc_buf_t must be
  215  * the last buffer in the hdr's b_buf list, however a shared compressed buf can
  216  * be anywhere in the hdr's list.
  217  *
  218  * The diagram below shows an example of an uncompressed ARC hdr that is
  219  * sharing its data with an arc_buf_t (note that the shared uncompressed buf is
  220  * the last element in the buf list):
  221  *
  222  *                arc_buf_hdr_t
  223  *                +-----------+
  224  *                |           |
  225  *                |           |
  226  *                |           |
  227  *                +-----------+
  228  * l2arc_buf_hdr_t|           |
  229  *                |           |
  230  *                +-----------+
  231  * l1arc_buf_hdr_t|           |
  232  *                |           |                 arc_buf_t    (shared)
  233  *                |    b_buf  +------------>+---------+      arc_buf_t
  234  *                |           |             |b_next   +---->+---------+
  235  *                |  b_pabd   +-+           |---------|     |b_next   +-->NULL
  236  *                +-----------+ |           |         |     +---------+
  237  *                              |           |b_data   +-+   |         |
  238  *                              |           +---------+ |   |b_data   +-+
  239  *                              +->+------+             |   +---------+ |
  240  *                                 |      |             |               |
  241  *                   uncompressed  |      |             |               |
  242  *                        data     +------+             |               |
  243  *                                    ^                 +->+------+     |
  244  *                                    |       uncompressed |      |     |
  245  *                                    |           data     |      |     |
  246  *                                    |                    +------+     |
  247  *                                    +---------------------------------+
  248  *
  249  * Writing to the ARC requires that the ARC first discard the hdr's b_pabd
  250  * since the physical block is about to be rewritten. The new data contents
  251  * will be contained in the arc_buf_t. As the I/O pipeline performs the write,
  252  * it may compress the data before writing it to disk. The ARC will be called
  253  * with the transformed data and will memcpy the transformed on-disk block into
  254  * a newly allocated b_pabd. Writes are always done into buffers which have
  255  * either been loaned (and hence are new and don't have other readers) or
  256  * buffers which have been released (and hence have their own hdr, if there
  257  * were originally other readers of the buf's original hdr). This ensures that
  258  * the ARC only needs to update a single buf and its hdr after a write occurs.
  259  *
  260  * When the L2ARC is in use, it will also take advantage of the b_pabd. The
  261  * L2ARC will always write the contents of b_pabd to the L2ARC. This means
  262  * that when compressed ARC is enabled that the L2ARC blocks are identical
  263  * to the on-disk block in the main data pool. This provides a significant
  264  * advantage since the ARC can leverage the bp's checksum when reading from the
  265  * L2ARC to determine if the contents are valid. However, if the compressed
  266  * ARC is disabled, then the L2ARC's block must be transformed to look
  267  * like the physical block in the main data pool before comparing the
  268  * checksum and determining its validity.
  269  *
  270  * The L1ARC has a slightly different system for storing encrypted data.
  271  * Raw (encrypted + possibly compressed) data has a few subtle differences from
  272  * data that is just compressed. The biggest difference is that it is not
  273  * possible to decrypt encrypted data (or vice-versa) if the keys aren't loaded.
  274  * The other difference is that encryption cannot be treated as a suggestion.
  275  * If a caller would prefer compressed data, but they actually wind up with
  276  * uncompressed data the worst thing that could happen is there might be a
  277  * performance hit. If the caller requests encrypted data, however, we must be
  278  * sure they actually get it or else secret information could be leaked. Raw
  279  * data is stored in hdr->b_crypt_hdr.b_rabd. An encrypted header, therefore,
  280  * may have both an encrypted version and a decrypted version of its data at
  281  * once. When a caller needs a raw arc_buf_t, it is allocated and the data is
  282  * copied out of this header. To avoid complications with b_pabd, raw buffers
  283  * cannot be shared.
  284  */
  285 
  286 #include <sys/spa.h>
  287 #include <sys/zio.h>
  288 #include <sys/spa_impl.h>
  289 #include <sys/zio_compress.h>
  290 #include <sys/zio_checksum.h>
  291 #include <sys/zfs_context.h>
  292 #include <sys/arc.h>
  293 #include <sys/zfs_refcount.h>
  294 #include <sys/vdev.h>
  295 #include <sys/vdev_impl.h>
  296 #include <sys/dsl_pool.h>
  297 #include <sys/multilist.h>
  298 #include <sys/abd.h>
  299 #include <sys/zil.h>
  300 #include <sys/fm/fs/zfs.h>
  301 #include <sys/callb.h>
  302 #include <sys/kstat.h>
  303 #include <sys/zthr.h>
  304 #include <zfs_fletcher.h>
  305 #include <sys/arc_impl.h>
  306 #include <sys/trace_zfs.h>
  307 #include <sys/aggsum.h>
  308 #include <sys/wmsum.h>
  309 #include <cityhash.h>
  310 #include <sys/vdev_trim.h>
  311 #include <sys/zfs_racct.h>
  312 #include <sys/zstd/zstd.h>
  313 
  314 #ifndef _KERNEL
  315 /* set with ZFS_DEBUG=watch, to enable watchpoints on frozen buffers */
  316 boolean_t arc_watch = B_FALSE;
  317 #endif
  318 
  319 /*
  320  * This thread's job is to keep enough free memory in the system, by
  321  * calling arc_kmem_reap_soon() plus arc_reduce_target_size(), which improves
  322  * arc_available_memory().
  323  */
  324 static zthr_t *arc_reap_zthr;
  325 
  326 /*
  327  * This thread's job is to keep arc_size under arc_c, by calling
  328  * arc_evict(), which improves arc_is_overflowing().
  329  */
  330 static zthr_t *arc_evict_zthr;
  331 static arc_buf_hdr_t **arc_state_evict_markers;
  332 static int arc_state_evict_marker_count;
  333 
  334 static kmutex_t arc_evict_lock;
  335 static boolean_t arc_evict_needed = B_FALSE;
  336 static clock_t arc_last_uncached_flush;
  337 
  338 /*
  339  * Count of bytes evicted since boot.
  340  */
  341 static uint64_t arc_evict_count;
  342 
  343 /*
  344  * List of arc_evict_waiter_t's, representing threads waiting for the
  345  * arc_evict_count to reach specific values.
  346  */
  347 static list_t arc_evict_waiters;
  348 
  349 /*
  350  * When arc_is_overflowing(), arc_get_data_impl() waits for this percent of
  351  * the requested amount of data to be evicted.  For example, by default for
  352  * every 2KB that's evicted, 1KB of it may be "reused" by a new allocation.
  353  * Since this is above 100%, it ensures that progress is made towards getting
  354  * arc_size under arc_c.  Since this is finite, it ensures that allocations
  355  * can still happen, even during the potentially long time that arc_size is
  356  * more than arc_c.
  357  */
  358 static uint_t zfs_arc_eviction_pct = 200;
  359 
  360 /*
  361  * The number of headers to evict in arc_evict_state_impl() before
  362  * dropping the sublist lock and evicting from another sublist. A lower
  363  * value means we're more likely to evict the "correct" header (i.e. the
  364  * oldest header in the arc state), but comes with higher overhead
  365  * (i.e. more invocations of arc_evict_state_impl()).
  366  */
  367 static uint_t zfs_arc_evict_batch_limit = 10;
  368 
  369 /* number of seconds before growing cache again */
  370 uint_t arc_grow_retry = 5;
  371 
  372 /*
  373  * Minimum time between calls to arc_kmem_reap_soon().
  374  */
  375 static const int arc_kmem_cache_reap_retry_ms = 1000;
  376 
  377 /* shift of arc_c for calculating overflow limit in arc_get_data_impl */
  378 static int zfs_arc_overflow_shift = 8;
  379 
  380 /* shift of arc_c for calculating both min and max arc_p */
  381 static uint_t arc_p_min_shift = 4;
  382 
  383 /* log2(fraction of arc to reclaim) */
  384 uint_t arc_shrink_shift = 7;
  385 
  386 /* percent of pagecache to reclaim arc to */
  387 #ifdef _KERNEL
  388 uint_t zfs_arc_pc_percent = 0;
  389 #endif
  390 
  391 /*
  392  * log2(fraction of ARC which must be free to allow growing).
  393  * I.e. If there is less than arc_c >> arc_no_grow_shift free memory,
  394  * when reading a new block into the ARC, we will evict an equal-sized block
  395  * from the ARC.
  396  *
  397  * This must be less than arc_shrink_shift, so that when we shrink the ARC,
  398  * we will still not allow it to grow.
  399  */
  400 uint_t          arc_no_grow_shift = 5;
  401 
  402 
  403 /*
  404  * minimum lifespan of a prefetch block in clock ticks
  405  * (initialized in arc_init())
  406  */
  407 static uint_t           arc_min_prefetch_ms;
  408 static uint_t           arc_min_prescient_prefetch_ms;
  409 
  410 /*
  411  * If this percent of memory is free, don't throttle.
  412  */
  413 uint_t arc_lotsfree_percent = 10;
  414 
  415 /*
  416  * The arc has filled available memory and has now warmed up.
  417  */
  418 boolean_t arc_warm;
  419 
  420 /*
  421  * These tunables are for performance analysis.
  422  */
  423 uint64_t zfs_arc_max = 0;
  424 uint64_t zfs_arc_min = 0;
  425 uint64_t zfs_arc_meta_limit = 0;
  426 uint64_t zfs_arc_meta_min = 0;
  427 static uint64_t zfs_arc_dnode_limit = 0;
  428 static uint_t zfs_arc_dnode_reduce_percent = 10;
  429 static uint_t zfs_arc_grow_retry = 0;
  430 static uint_t zfs_arc_shrink_shift = 0;
  431 static uint_t zfs_arc_p_min_shift = 0;
  432 uint_t zfs_arc_average_blocksize = 8 * 1024; /* 8KB */
  433 
  434 /*
  435  * ARC dirty data constraints for arc_tempreserve_space() throttle:
  436  * * total dirty data limit
  437  * * anon block dirty limit
  438  * * each pool's anon allowance
  439  */
  440 static const unsigned long zfs_arc_dirty_limit_percent = 50;
  441 static const unsigned long zfs_arc_anon_limit_percent = 25;
  442 static const unsigned long zfs_arc_pool_dirty_percent = 20;
  443 
  444 /*
  445  * Enable or disable compressed arc buffers.
  446  */
  447 int zfs_compressed_arc_enabled = B_TRUE;
  448 
  449 /*
  450  * ARC will evict meta buffers that exceed arc_meta_limit. This
  451  * tunable make arc_meta_limit adjustable for different workloads.
  452  */
  453 static uint64_t zfs_arc_meta_limit_percent = 75;
  454 
  455 /*
  456  * Percentage that can be consumed by dnodes of ARC meta buffers.
  457  */
  458 static uint_t zfs_arc_dnode_limit_percent = 10;
  459 
  460 /*
  461  * These tunables are Linux-specific
  462  */
  463 static uint64_t zfs_arc_sys_free = 0;
  464 static uint_t zfs_arc_min_prefetch_ms = 0;
  465 static uint_t zfs_arc_min_prescient_prefetch_ms = 0;
  466 static int zfs_arc_p_dampener_disable = 1;
  467 static uint_t zfs_arc_meta_prune = 10000;
  468 static uint_t zfs_arc_meta_strategy = ARC_STRATEGY_META_BALANCED;
  469 static uint_t zfs_arc_meta_adjust_restarts = 4096;
  470 static uint_t zfs_arc_lotsfree_percent = 10;
  471 
  472 /*
  473  * Number of arc_prune threads
  474  */
  475 static int zfs_arc_prune_task_threads = 1;
  476 
  477 /* The 7 states: */
  478 arc_state_t ARC_anon;
  479 arc_state_t ARC_mru;
  480 arc_state_t ARC_mru_ghost;
  481 arc_state_t ARC_mfu;
  482 arc_state_t ARC_mfu_ghost;
  483 arc_state_t ARC_l2c_only;
  484 arc_state_t ARC_uncached;
  485 
  486 arc_stats_t arc_stats = {
  487         { "hits",                       KSTAT_DATA_UINT64 },
  488         { "iohits",                     KSTAT_DATA_UINT64 },
  489         { "misses",                     KSTAT_DATA_UINT64 },
  490         { "demand_data_hits",           KSTAT_DATA_UINT64 },
  491         { "demand_data_iohits",         KSTAT_DATA_UINT64 },
  492         { "demand_data_misses",         KSTAT_DATA_UINT64 },
  493         { "demand_metadata_hits",       KSTAT_DATA_UINT64 },
  494         { "demand_metadata_iohits",     KSTAT_DATA_UINT64 },
  495         { "demand_metadata_misses",     KSTAT_DATA_UINT64 },
  496         { "prefetch_data_hits",         KSTAT_DATA_UINT64 },
  497         { "prefetch_data_iohits",       KSTAT_DATA_UINT64 },
  498         { "prefetch_data_misses",       KSTAT_DATA_UINT64 },
  499         { "prefetch_metadata_hits",     KSTAT_DATA_UINT64 },
  500         { "prefetch_metadata_iohits",   KSTAT_DATA_UINT64 },
  501         { "prefetch_metadata_misses",   KSTAT_DATA_UINT64 },
  502         { "mru_hits",                   KSTAT_DATA_UINT64 },
  503         { "mru_ghost_hits",             KSTAT_DATA_UINT64 },
  504         { "mfu_hits",                   KSTAT_DATA_UINT64 },
  505         { "mfu_ghost_hits",             KSTAT_DATA_UINT64 },
  506         { "uncached_hits",              KSTAT_DATA_UINT64 },
  507         { "deleted",                    KSTAT_DATA_UINT64 },
  508         { "mutex_miss",                 KSTAT_DATA_UINT64 },
  509         { "access_skip",                KSTAT_DATA_UINT64 },
  510         { "evict_skip",                 KSTAT_DATA_UINT64 },
  511         { "evict_not_enough",           KSTAT_DATA_UINT64 },
  512         { "evict_l2_cached",            KSTAT_DATA_UINT64 },
  513         { "evict_l2_eligible",          KSTAT_DATA_UINT64 },
  514         { "evict_l2_eligible_mfu",      KSTAT_DATA_UINT64 },
  515         { "evict_l2_eligible_mru",      KSTAT_DATA_UINT64 },
  516         { "evict_l2_ineligible",        KSTAT_DATA_UINT64 },
  517         { "evict_l2_skip",              KSTAT_DATA_UINT64 },
  518         { "hash_elements",              KSTAT_DATA_UINT64 },
  519         { "hash_elements_max",          KSTAT_DATA_UINT64 },
  520         { "hash_collisions",            KSTAT_DATA_UINT64 },
  521         { "hash_chains",                KSTAT_DATA_UINT64 },
  522         { "hash_chain_max",             KSTAT_DATA_UINT64 },
  523         { "p",                          KSTAT_DATA_UINT64 },
  524         { "c",                          KSTAT_DATA_UINT64 },
  525         { "c_min",                      KSTAT_DATA_UINT64 },
  526         { "c_max",                      KSTAT_DATA_UINT64 },
  527         { "size",                       KSTAT_DATA_UINT64 },
  528         { "compressed_size",            KSTAT_DATA_UINT64 },
  529         { "uncompressed_size",          KSTAT_DATA_UINT64 },
  530         { "overhead_size",              KSTAT_DATA_UINT64 },
  531         { "hdr_size",                   KSTAT_DATA_UINT64 },
  532         { "data_size",                  KSTAT_DATA_UINT64 },
  533         { "metadata_size",              KSTAT_DATA_UINT64 },
  534         { "dbuf_size",                  KSTAT_DATA_UINT64 },
  535         { "dnode_size",                 KSTAT_DATA_UINT64 },
  536         { "bonus_size",                 KSTAT_DATA_UINT64 },
  537 #if defined(COMPAT_FREEBSD11)
  538         { "other_size",                 KSTAT_DATA_UINT64 },
  539 #endif
  540         { "anon_size",                  KSTAT_DATA_UINT64 },
  541         { "anon_evictable_data",        KSTAT_DATA_UINT64 },
  542         { "anon_evictable_metadata",    KSTAT_DATA_UINT64 },
  543         { "mru_size",                   KSTAT_DATA_UINT64 },
  544         { "mru_evictable_data",         KSTAT_DATA_UINT64 },
  545         { "mru_evictable_metadata",     KSTAT_DATA_UINT64 },
  546         { "mru_ghost_size",             KSTAT_DATA_UINT64 },
  547         { "mru_ghost_evictable_data",   KSTAT_DATA_UINT64 },
  548         { "mru_ghost_evictable_metadata", KSTAT_DATA_UINT64 },
  549         { "mfu_size",                   KSTAT_DATA_UINT64 },
  550         { "mfu_evictable_data",         KSTAT_DATA_UINT64 },
  551         { "mfu_evictable_metadata",     KSTAT_DATA_UINT64 },
  552         { "mfu_ghost_size",             KSTAT_DATA_UINT64 },
  553         { "mfu_ghost_evictable_data",   KSTAT_DATA_UINT64 },
  554         { "mfu_ghost_evictable_metadata", KSTAT_DATA_UINT64 },
  555         { "uncached_size",              KSTAT_DATA_UINT64 },
  556         { "uncached_evictable_data",    KSTAT_DATA_UINT64 },
  557         { "uncached_evictable_metadata", KSTAT_DATA_UINT64 },
  558         { "l2_hits",                    KSTAT_DATA_UINT64 },
  559         { "l2_misses",                  KSTAT_DATA_UINT64 },
  560         { "l2_prefetch_asize",          KSTAT_DATA_UINT64 },
  561         { "l2_mru_asize",               KSTAT_DATA_UINT64 },
  562         { "l2_mfu_asize",               KSTAT_DATA_UINT64 },
  563         { "l2_bufc_data_asize",         KSTAT_DATA_UINT64 },
  564         { "l2_bufc_metadata_asize",     KSTAT_DATA_UINT64 },
  565         { "l2_feeds",                   KSTAT_DATA_UINT64 },
  566         { "l2_rw_clash",                KSTAT_DATA_UINT64 },
  567         { "l2_read_bytes",              KSTAT_DATA_UINT64 },
  568         { "l2_write_bytes",             KSTAT_DATA_UINT64 },
  569         { "l2_writes_sent",             KSTAT_DATA_UINT64 },
  570         { "l2_writes_done",             KSTAT_DATA_UINT64 },
  571         { "l2_writes_error",            KSTAT_DATA_UINT64 },
  572         { "l2_writes_lock_retry",       KSTAT_DATA_UINT64 },
  573         { "l2_evict_lock_retry",        KSTAT_DATA_UINT64 },
  574         { "l2_evict_reading",           KSTAT_DATA_UINT64 },
  575         { "l2_evict_l1cached",          KSTAT_DATA_UINT64 },
  576         { "l2_free_on_write",           KSTAT_DATA_UINT64 },
  577         { "l2_abort_lowmem",            KSTAT_DATA_UINT64 },
  578         { "l2_cksum_bad",               KSTAT_DATA_UINT64 },
  579         { "l2_io_error",                KSTAT_DATA_UINT64 },
  580         { "l2_size",                    KSTAT_DATA_UINT64 },
  581         { "l2_asize",                   KSTAT_DATA_UINT64 },
  582         { "l2_hdr_size",                KSTAT_DATA_UINT64 },
  583         { "l2_log_blk_writes",          KSTAT_DATA_UINT64 },
  584         { "l2_log_blk_avg_asize",       KSTAT_DATA_UINT64 },
  585         { "l2_log_blk_asize",           KSTAT_DATA_UINT64 },
  586         { "l2_log_blk_count",           KSTAT_DATA_UINT64 },
  587         { "l2_data_to_meta_ratio",      KSTAT_DATA_UINT64 },
  588         { "l2_rebuild_success",         KSTAT_DATA_UINT64 },
  589         { "l2_rebuild_unsupported",     KSTAT_DATA_UINT64 },
  590         { "l2_rebuild_io_errors",       KSTAT_DATA_UINT64 },
  591         { "l2_rebuild_dh_errors",       KSTAT_DATA_UINT64 },
  592         { "l2_rebuild_cksum_lb_errors", KSTAT_DATA_UINT64 },
  593         { "l2_rebuild_lowmem",          KSTAT_DATA_UINT64 },
  594         { "l2_rebuild_size",            KSTAT_DATA_UINT64 },
  595         { "l2_rebuild_asize",           KSTAT_DATA_UINT64 },
  596         { "l2_rebuild_bufs",            KSTAT_DATA_UINT64 },
  597         { "l2_rebuild_bufs_precached",  KSTAT_DATA_UINT64 },
  598         { "l2_rebuild_log_blks",        KSTAT_DATA_UINT64 },
  599         { "memory_throttle_count",      KSTAT_DATA_UINT64 },
  600         { "memory_direct_count",        KSTAT_DATA_UINT64 },
  601         { "memory_indirect_count",      KSTAT_DATA_UINT64 },
  602         { "memory_all_bytes",           KSTAT_DATA_UINT64 },
  603         { "memory_free_bytes",          KSTAT_DATA_UINT64 },
  604         { "memory_available_bytes",     KSTAT_DATA_INT64 },
  605         { "arc_no_grow",                KSTAT_DATA_UINT64 },
  606         { "arc_tempreserve",            KSTAT_DATA_UINT64 },
  607         { "arc_loaned_bytes",           KSTAT_DATA_UINT64 },
  608         { "arc_prune",                  KSTAT_DATA_UINT64 },
  609         { "arc_meta_used",              KSTAT_DATA_UINT64 },
  610         { "arc_meta_limit",             KSTAT_DATA_UINT64 },
  611         { "arc_dnode_limit",            KSTAT_DATA_UINT64 },
  612         { "arc_meta_max",               KSTAT_DATA_UINT64 },
  613         { "arc_meta_min",               KSTAT_DATA_UINT64 },
  614         { "async_upgrade_sync",         KSTAT_DATA_UINT64 },
  615         { "predictive_prefetch", KSTAT_DATA_UINT64 },
  616         { "demand_hit_predictive_prefetch", KSTAT_DATA_UINT64 },
  617         { "demand_iohit_predictive_prefetch", KSTAT_DATA_UINT64 },
  618         { "prescient_prefetch", KSTAT_DATA_UINT64 },
  619         { "demand_hit_prescient_prefetch", KSTAT_DATA_UINT64 },
  620         { "demand_iohit_prescient_prefetch", KSTAT_DATA_UINT64 },
  621         { "arc_need_free",              KSTAT_DATA_UINT64 },
  622         { "arc_sys_free",               KSTAT_DATA_UINT64 },
  623         { "arc_raw_size",               KSTAT_DATA_UINT64 },
  624         { "cached_only_in_progress",    KSTAT_DATA_UINT64 },
  625         { "abd_chunk_waste_size",       KSTAT_DATA_UINT64 },
  626 };
  627 
  628 arc_sums_t arc_sums;
  629 
  630 #define ARCSTAT_MAX(stat, val) {                                        \
  631         uint64_t m;                                                     \
  632         while ((val) > (m = arc_stats.stat.value.ui64) &&               \
  633             (m != atomic_cas_64(&arc_stats.stat.value.ui64, m, (val)))) \
  634                 continue;                                               \
  635 }
  636 
  637 /*
  638  * We define a macro to allow ARC hits/misses to be easily broken down by
  639  * two separate conditions, giving a total of four different subtypes for
  640  * each of hits and misses (so eight statistics total).
  641  */
  642 #define ARCSTAT_CONDSTAT(cond1, stat1, notstat1, cond2, stat2, notstat2, stat) \
  643         if (cond1) {                                                    \
  644                 if (cond2) {                                            \
  645                         ARCSTAT_BUMP(arcstat_##stat1##_##stat2##_##stat); \
  646                 } else {                                                \
  647                         ARCSTAT_BUMP(arcstat_##stat1##_##notstat2##_##stat); \
  648                 }                                                       \
  649         } else {                                                        \
  650                 if (cond2) {                                            \
  651                         ARCSTAT_BUMP(arcstat_##notstat1##_##stat2##_##stat); \
  652                 } else {                                                \
  653                         ARCSTAT_BUMP(arcstat_##notstat1##_##notstat2##_##stat);\
  654                 }                                                       \
  655         }
  656 
  657 /*
  658  * This macro allows us to use kstats as floating averages. Each time we
  659  * update this kstat, we first factor it and the update value by
  660  * ARCSTAT_AVG_FACTOR to shrink the new value's contribution to the overall
  661  * average. This macro assumes that integer loads and stores are atomic, but
  662  * is not safe for multiple writers updating the kstat in parallel (only the
  663  * last writer's update will remain).
  664  */
  665 #define ARCSTAT_F_AVG_FACTOR    3
  666 #define ARCSTAT_F_AVG(stat, value) \
  667         do { \
  668                 uint64_t x = ARCSTAT(stat); \
  669                 x = x - x / ARCSTAT_F_AVG_FACTOR + \
  670                     (value) / ARCSTAT_F_AVG_FACTOR; \
  671                 ARCSTAT(stat) = x; \
  672         } while (0)
  673 
  674 static kstat_t                  *arc_ksp;
  675 
  676 /*
  677  * There are several ARC variables that are critical to export as kstats --
  678  * but we don't want to have to grovel around in the kstat whenever we wish to
  679  * manipulate them.  For these variables, we therefore define them to be in
  680  * terms of the statistic variable.  This assures that we are not introducing
  681  * the possibility of inconsistency by having shadow copies of the variables,
  682  * while still allowing the code to be readable.
  683  */
  684 #define arc_tempreserve ARCSTAT(arcstat_tempreserve)
  685 #define arc_loaned_bytes        ARCSTAT(arcstat_loaned_bytes)
  686 #define arc_meta_limit  ARCSTAT(arcstat_meta_limit) /* max size for metadata */
  687 /* max size for dnodes */
  688 #define arc_dnode_size_limit    ARCSTAT(arcstat_dnode_limit)
  689 #define arc_meta_min    ARCSTAT(arcstat_meta_min) /* min size for metadata */
  690 #define arc_need_free   ARCSTAT(arcstat_need_free) /* waiting to be evicted */
  691 
  692 hrtime_t arc_growtime;
  693 list_t arc_prune_list;
  694 kmutex_t arc_prune_mtx;
  695 taskq_t *arc_prune_taskq;
  696 
  697 #define GHOST_STATE(state)      \
  698         ((state) == arc_mru_ghost || (state) == arc_mfu_ghost ||        \
  699         (state) == arc_l2c_only)
  700 
  701 #define HDR_IN_HASH_TABLE(hdr)  ((hdr)->b_flags & ARC_FLAG_IN_HASH_TABLE)
  702 #define HDR_IO_IN_PROGRESS(hdr) ((hdr)->b_flags & ARC_FLAG_IO_IN_PROGRESS)
  703 #define HDR_IO_ERROR(hdr)       ((hdr)->b_flags & ARC_FLAG_IO_ERROR)
  704 #define HDR_PREFETCH(hdr)       ((hdr)->b_flags & ARC_FLAG_PREFETCH)
  705 #define HDR_PRESCIENT_PREFETCH(hdr)     \
  706         ((hdr)->b_flags & ARC_FLAG_PRESCIENT_PREFETCH)
  707 #define HDR_COMPRESSION_ENABLED(hdr)    \
  708         ((hdr)->b_flags & ARC_FLAG_COMPRESSED_ARC)
  709 
  710 #define HDR_L2CACHE(hdr)        ((hdr)->b_flags & ARC_FLAG_L2CACHE)
  711 #define HDR_UNCACHED(hdr)       ((hdr)->b_flags & ARC_FLAG_UNCACHED)
  712 #define HDR_L2_READING(hdr)     \
  713         (((hdr)->b_flags & ARC_FLAG_IO_IN_PROGRESS) &&  \
  714         ((hdr)->b_flags & ARC_FLAG_HAS_L2HDR))
  715 #define HDR_L2_WRITING(hdr)     ((hdr)->b_flags & ARC_FLAG_L2_WRITING)
  716 #define HDR_L2_EVICTED(hdr)     ((hdr)->b_flags & ARC_FLAG_L2_EVICTED)
  717 #define HDR_L2_WRITE_HEAD(hdr)  ((hdr)->b_flags & ARC_FLAG_L2_WRITE_HEAD)
  718 #define HDR_PROTECTED(hdr)      ((hdr)->b_flags & ARC_FLAG_PROTECTED)
  719 #define HDR_NOAUTH(hdr)         ((hdr)->b_flags & ARC_FLAG_NOAUTH)
  720 #define HDR_SHARED_DATA(hdr)    ((hdr)->b_flags & ARC_FLAG_SHARED_DATA)
  721 
  722 #define HDR_ISTYPE_METADATA(hdr)        \
  723         ((hdr)->b_flags & ARC_FLAG_BUFC_METADATA)
  724 #define HDR_ISTYPE_DATA(hdr)    (!HDR_ISTYPE_METADATA(hdr))
  725 
  726 #define HDR_HAS_L1HDR(hdr)      ((hdr)->b_flags & ARC_FLAG_HAS_L1HDR)
  727 #define HDR_HAS_L2HDR(hdr)      ((hdr)->b_flags & ARC_FLAG_HAS_L2HDR)
  728 #define HDR_HAS_RABD(hdr)       \
  729         (HDR_HAS_L1HDR(hdr) && HDR_PROTECTED(hdr) &&    \
  730         (hdr)->b_crypt_hdr.b_rabd != NULL)
  731 #define HDR_ENCRYPTED(hdr)      \
  732         (HDR_PROTECTED(hdr) && DMU_OT_IS_ENCRYPTED((hdr)->b_crypt_hdr.b_ot))
  733 #define HDR_AUTHENTICATED(hdr)  \
  734         (HDR_PROTECTED(hdr) && !DMU_OT_IS_ENCRYPTED((hdr)->b_crypt_hdr.b_ot))
  735 
  736 /* For storing compression mode in b_flags */
  737 #define HDR_COMPRESS_OFFSET     (highbit64(ARC_FLAG_COMPRESS_0) - 1)
  738 
  739 #define HDR_GET_COMPRESS(hdr)   ((enum zio_compress)BF32_GET((hdr)->b_flags, \
  740         HDR_COMPRESS_OFFSET, SPA_COMPRESSBITS))
  741 #define HDR_SET_COMPRESS(hdr, cmp) BF32_SET((hdr)->b_flags, \
  742         HDR_COMPRESS_OFFSET, SPA_COMPRESSBITS, (cmp));
  743 
  744 #define ARC_BUF_LAST(buf)       ((buf)->b_next == NULL)
  745 #define ARC_BUF_SHARED(buf)     ((buf)->b_flags & ARC_BUF_FLAG_SHARED)
  746 #define ARC_BUF_COMPRESSED(buf) ((buf)->b_flags & ARC_BUF_FLAG_COMPRESSED)
  747 #define ARC_BUF_ENCRYPTED(buf)  ((buf)->b_flags & ARC_BUF_FLAG_ENCRYPTED)
  748 
  749 /*
  750  * Other sizes
  751  */
  752 
  753 #define HDR_FULL_CRYPT_SIZE ((int64_t)sizeof (arc_buf_hdr_t))
  754 #define HDR_FULL_SIZE ((int64_t)offsetof(arc_buf_hdr_t, b_crypt_hdr))
  755 #define HDR_L2ONLY_SIZE ((int64_t)offsetof(arc_buf_hdr_t, b_l1hdr))
  756 
  757 /*
  758  * Hash table routines
  759  */
  760 
  761 #define BUF_LOCKS 2048
  762 typedef struct buf_hash_table {
  763         uint64_t ht_mask;
  764         arc_buf_hdr_t **ht_table;
  765         kmutex_t ht_locks[BUF_LOCKS] ____cacheline_aligned;
  766 } buf_hash_table_t;
  767 
  768 static buf_hash_table_t buf_hash_table;
  769 
  770 #define BUF_HASH_INDEX(spa, dva, birth) \
  771         (buf_hash(spa, dva, birth) & buf_hash_table.ht_mask)
  772 #define BUF_HASH_LOCK(idx)      (&buf_hash_table.ht_locks[idx & (BUF_LOCKS-1)])
  773 #define HDR_LOCK(hdr) \
  774         (BUF_HASH_LOCK(BUF_HASH_INDEX(hdr->b_spa, &hdr->b_dva, hdr->b_birth)))
  775 
  776 uint64_t zfs_crc64_table[256];
  777 
  778 /*
  779  * Level 2 ARC
  780  */
  781 
  782 #define L2ARC_WRITE_SIZE        (8 * 1024 * 1024)       /* initial write max */
  783 #define L2ARC_HEADROOM          2                       /* num of writes */
  784 
  785 /*
  786  * If we discover during ARC scan any buffers to be compressed, we boost
  787  * our headroom for the next scanning cycle by this percentage multiple.
  788  */
  789 #define L2ARC_HEADROOM_BOOST    200
  790 #define L2ARC_FEED_SECS         1               /* caching interval secs */
  791 #define L2ARC_FEED_MIN_MS       200             /* min caching interval ms */
  792 
  793 /*
  794  * We can feed L2ARC from two states of ARC buffers, mru and mfu,
  795  * and each of the state has two types: data and metadata.
  796  */
  797 #define L2ARC_FEED_TYPES        4
  798 
  799 /* L2ARC Performance Tunables */
  800 uint64_t l2arc_write_max = L2ARC_WRITE_SIZE;    /* def max write size */
  801 uint64_t l2arc_write_boost = L2ARC_WRITE_SIZE;  /* extra warmup write */
  802 uint64_t l2arc_headroom = L2ARC_HEADROOM;       /* # of dev writes */
  803 uint64_t l2arc_headroom_boost = L2ARC_HEADROOM_BOOST;
  804 uint64_t l2arc_feed_secs = L2ARC_FEED_SECS;     /* interval seconds */
  805 uint64_t l2arc_feed_min_ms = L2ARC_FEED_MIN_MS; /* min interval msecs */
  806 int l2arc_noprefetch = B_TRUE;                  /* don't cache prefetch bufs */
  807 int l2arc_feed_again = B_TRUE;                  /* turbo warmup */
  808 int l2arc_norw = B_FALSE;                       /* no reads during writes */
  809 static uint_t l2arc_meta_percent = 33;  /* limit on headers size */
  810 
  811 /*
  812  * L2ARC Internals
  813  */
  814 static list_t L2ARC_dev_list;                   /* device list */
  815 static list_t *l2arc_dev_list;                  /* device list pointer */
  816 static kmutex_t l2arc_dev_mtx;                  /* device list mutex */
  817 static l2arc_dev_t *l2arc_dev_last;             /* last device used */
  818 static list_t L2ARC_free_on_write;              /* free after write buf list */
  819 static list_t *l2arc_free_on_write;             /* free after write list ptr */
  820 static kmutex_t l2arc_free_on_write_mtx;        /* mutex for list */
  821 static uint64_t l2arc_ndev;                     /* number of devices */
  822 
  823 typedef struct l2arc_read_callback {
  824         arc_buf_hdr_t           *l2rcb_hdr;             /* read header */
  825         blkptr_t                l2rcb_bp;               /* original blkptr */
  826         zbookmark_phys_t        l2rcb_zb;               /* original bookmark */
  827         int                     l2rcb_flags;            /* original flags */
  828         abd_t                   *l2rcb_abd;             /* temporary buffer */
  829 } l2arc_read_callback_t;
  830 
  831 typedef struct l2arc_data_free {
  832         /* protected by l2arc_free_on_write_mtx */
  833         abd_t           *l2df_abd;
  834         size_t          l2df_size;
  835         arc_buf_contents_t l2df_type;
  836         list_node_t     l2df_list_node;
  837 } l2arc_data_free_t;
  838 
  839 typedef enum arc_fill_flags {
  840         ARC_FILL_LOCKED         = 1 << 0, /* hdr lock is held */
  841         ARC_FILL_COMPRESSED     = 1 << 1, /* fill with compressed data */
  842         ARC_FILL_ENCRYPTED      = 1 << 2, /* fill with encrypted data */
  843         ARC_FILL_NOAUTH         = 1 << 3, /* don't attempt to authenticate */
  844         ARC_FILL_IN_PLACE       = 1 << 4  /* fill in place (special case) */
  845 } arc_fill_flags_t;
  846 
  847 typedef enum arc_ovf_level {
  848         ARC_OVF_NONE,                   /* ARC within target size. */
  849         ARC_OVF_SOME,                   /* ARC is slightly overflowed. */
  850         ARC_OVF_SEVERE                  /* ARC is severely overflowed. */
  851 } arc_ovf_level_t;
  852 
  853 static kmutex_t l2arc_feed_thr_lock;
  854 static kcondvar_t l2arc_feed_thr_cv;
  855 static uint8_t l2arc_thread_exit;
  856 
  857 static kmutex_t l2arc_rebuild_thr_lock;
  858 static kcondvar_t l2arc_rebuild_thr_cv;
  859 
  860 enum arc_hdr_alloc_flags {
  861         ARC_HDR_ALLOC_RDATA = 0x1,
  862         ARC_HDR_DO_ADAPT = 0x2,
  863         ARC_HDR_USE_RESERVE = 0x4,
  864         ARC_HDR_ALLOC_LINEAR = 0x8,
  865 };
  866 
  867 
  868 static abd_t *arc_get_data_abd(arc_buf_hdr_t *, uint64_t, const void *, int);
  869 static void *arc_get_data_buf(arc_buf_hdr_t *, uint64_t, const void *);
  870 static void arc_get_data_impl(arc_buf_hdr_t *, uint64_t, const void *, int);
  871 static void arc_free_data_abd(arc_buf_hdr_t *, abd_t *, uint64_t, const void *);
  872 static void arc_free_data_buf(arc_buf_hdr_t *, void *, uint64_t, const void *);
  873 static void arc_free_data_impl(arc_buf_hdr_t *hdr, uint64_t size,
  874     const void *tag);
  875 static void arc_hdr_free_abd(arc_buf_hdr_t *, boolean_t);
  876 static void arc_hdr_alloc_abd(arc_buf_hdr_t *, int);
  877 static void arc_hdr_destroy(arc_buf_hdr_t *);
  878 static void arc_access(arc_buf_hdr_t *, arc_flags_t, boolean_t);
  879 static void arc_buf_watch(arc_buf_t *);
  880 static void arc_change_state(arc_state_t *, arc_buf_hdr_t *);
  881 
  882 static arc_buf_contents_t arc_buf_type(arc_buf_hdr_t *);
  883 static uint32_t arc_bufc_to_flags(arc_buf_contents_t);
  884 static inline void arc_hdr_set_flags(arc_buf_hdr_t *hdr, arc_flags_t flags);
  885 static inline void arc_hdr_clear_flags(arc_buf_hdr_t *hdr, arc_flags_t flags);
  886 
  887 static boolean_t l2arc_write_eligible(uint64_t, arc_buf_hdr_t *);
  888 static void l2arc_read_done(zio_t *);
  889 static void l2arc_do_free_on_write(void);
  890 static void l2arc_hdr_arcstats_update(arc_buf_hdr_t *hdr, boolean_t incr,
  891     boolean_t state_only);
  892 
  893 #define l2arc_hdr_arcstats_increment(hdr) \
  894         l2arc_hdr_arcstats_update((hdr), B_TRUE, B_FALSE)
  895 #define l2arc_hdr_arcstats_decrement(hdr) \
  896         l2arc_hdr_arcstats_update((hdr), B_FALSE, B_FALSE)
  897 #define l2arc_hdr_arcstats_increment_state(hdr) \
  898         l2arc_hdr_arcstats_update((hdr), B_TRUE, B_TRUE)
  899 #define l2arc_hdr_arcstats_decrement_state(hdr) \
  900         l2arc_hdr_arcstats_update((hdr), B_FALSE, B_TRUE)
  901 
  902 /*
  903  * l2arc_exclude_special : A zfs module parameter that controls whether buffers
  904  *              present on special vdevs are eligibile for caching in L2ARC. If
  905  *              set to 1, exclude dbufs on special vdevs from being cached to
  906  *              L2ARC.
  907  */
  908 int l2arc_exclude_special = 0;
  909 
  910 /*
  911  * l2arc_mfuonly : A ZFS module parameter that controls whether only MFU
  912  *              metadata and data are cached from ARC into L2ARC.
  913  */
  914 static int l2arc_mfuonly = 0;
  915 
  916 /*
  917  * L2ARC TRIM
  918  * l2arc_trim_ahead : A ZFS module parameter that controls how much ahead of
  919  *              the current write size (l2arc_write_max) we should TRIM if we
  920  *              have filled the device. It is defined as a percentage of the
  921  *              write size. If set to 100 we trim twice the space required to
  922  *              accommodate upcoming writes. A minimum of 64MB will be trimmed.
  923  *              It also enables TRIM of the whole L2ARC device upon creation or
  924  *              addition to an existing pool or if the header of the device is
  925  *              invalid upon importing a pool or onlining a cache device. The
  926  *              default is 0, which disables TRIM on L2ARC altogether as it can
  927  *              put significant stress on the underlying storage devices. This
  928  *              will vary depending of how well the specific device handles
  929  *              these commands.
  930  */
  931 static uint64_t l2arc_trim_ahead = 0;
  932 
  933 /*
  934  * Performance tuning of L2ARC persistence:
  935  *
  936  * l2arc_rebuild_enabled : A ZFS module parameter that controls whether adding
  937  *              an L2ARC device (either at pool import or later) will attempt
  938  *              to rebuild L2ARC buffer contents.
  939  * l2arc_rebuild_blocks_min_l2size : A ZFS module parameter that controls
  940  *              whether log blocks are written to the L2ARC device. If the L2ARC
  941  *              device is less than 1GB, the amount of data l2arc_evict()
  942  *              evicts is significant compared to the amount of restored L2ARC
  943  *              data. In this case do not write log blocks in L2ARC in order
  944  *              not to waste space.
  945  */
  946 static int l2arc_rebuild_enabled = B_TRUE;
  947 static uint64_t l2arc_rebuild_blocks_min_l2size = 1024 * 1024 * 1024;
  948 
  949 /* L2ARC persistence rebuild control routines. */
  950 void l2arc_rebuild_vdev(vdev_t *vd, boolean_t reopen);
  951 static __attribute__((noreturn)) void l2arc_dev_rebuild_thread(void *arg);
  952 static int l2arc_rebuild(l2arc_dev_t *dev);
  953 
  954 /* L2ARC persistence read I/O routines. */
  955 static int l2arc_dev_hdr_read(l2arc_dev_t *dev);
  956 static int l2arc_log_blk_read(l2arc_dev_t *dev,
  957     const l2arc_log_blkptr_t *this_lp, const l2arc_log_blkptr_t *next_lp,
  958     l2arc_log_blk_phys_t *this_lb, l2arc_log_blk_phys_t *next_lb,
  959     zio_t *this_io, zio_t **next_io);
  960 static zio_t *l2arc_log_blk_fetch(vdev_t *vd,
  961     const l2arc_log_blkptr_t *lp, l2arc_log_blk_phys_t *lb);
  962 static void l2arc_log_blk_fetch_abort(zio_t *zio);
  963 
  964 /* L2ARC persistence block restoration routines. */
  965 static void l2arc_log_blk_restore(l2arc_dev_t *dev,
  966     const l2arc_log_blk_phys_t *lb, uint64_t lb_asize);
  967 static void l2arc_hdr_restore(const l2arc_log_ent_phys_t *le,
  968     l2arc_dev_t *dev);
  969 
  970 /* L2ARC persistence write I/O routines. */
  971 static void l2arc_log_blk_commit(l2arc_dev_t *dev, zio_t *pio,
  972     l2arc_write_callback_t *cb);
  973 
  974 /* L2ARC persistence auxiliary routines. */
  975 boolean_t l2arc_log_blkptr_valid(l2arc_dev_t *dev,
  976     const l2arc_log_blkptr_t *lbp);
  977 static boolean_t l2arc_log_blk_insert(l2arc_dev_t *dev,
  978     const arc_buf_hdr_t *ab);
  979 boolean_t l2arc_range_check_overlap(uint64_t bottom,
  980     uint64_t top, uint64_t check);
  981 static void l2arc_blk_fetch_done(zio_t *zio);
  982 static inline uint64_t
  983     l2arc_log_blk_overhead(uint64_t write_sz, l2arc_dev_t *dev);
  984 
  985 /*
  986  * We use Cityhash for this. It's fast, and has good hash properties without
  987  * requiring any large static buffers.
  988  */
  989 static uint64_t
  990 buf_hash(uint64_t spa, const dva_t *dva, uint64_t birth)
  991 {
  992         return (cityhash4(spa, dva->dva_word[0], dva->dva_word[1], birth));
  993 }
  994 
  995 #define HDR_EMPTY(hdr)                                          \
  996         ((hdr)->b_dva.dva_word[0] == 0 &&                       \
  997         (hdr)->b_dva.dva_word[1] == 0)
  998 
  999 #define HDR_EMPTY_OR_LOCKED(hdr)                                \
 1000         (HDR_EMPTY(hdr) || MUTEX_HELD(HDR_LOCK(hdr)))
 1001 
 1002 #define HDR_EQUAL(spa, dva, birth, hdr)                         \
 1003         ((hdr)->b_dva.dva_word[0] == (dva)->dva_word[0]) &&     \
 1004         ((hdr)->b_dva.dva_word[1] == (dva)->dva_word[1]) &&     \
 1005         ((hdr)->b_birth == birth) && ((hdr)->b_spa == spa)
 1006 
 1007 static void
 1008 buf_discard_identity(arc_buf_hdr_t *hdr)
 1009 {
 1010         hdr->b_dva.dva_word[0] = 0;
 1011         hdr->b_dva.dva_word[1] = 0;
 1012         hdr->b_birth = 0;
 1013 }
 1014 
 1015 static arc_buf_hdr_t *
 1016 buf_hash_find(uint64_t spa, const blkptr_t *bp, kmutex_t **lockp)
 1017 {
 1018         const dva_t *dva = BP_IDENTITY(bp);
 1019         uint64_t birth = BP_PHYSICAL_BIRTH(bp);
 1020         uint64_t idx = BUF_HASH_INDEX(spa, dva, birth);
 1021         kmutex_t *hash_lock = BUF_HASH_LOCK(idx);
 1022         arc_buf_hdr_t *hdr;
 1023 
 1024         mutex_enter(hash_lock);
 1025         for (hdr = buf_hash_table.ht_table[idx]; hdr != NULL;
 1026             hdr = hdr->b_hash_next) {
 1027                 if (HDR_EQUAL(spa, dva, birth, hdr)) {
 1028                         *lockp = hash_lock;
 1029                         return (hdr);
 1030                 }
 1031         }
 1032         mutex_exit(hash_lock);
 1033         *lockp = NULL;
 1034         return (NULL);
 1035 }
 1036 
 1037 /*
 1038  * Insert an entry into the hash table.  If there is already an element
 1039  * equal to elem in the hash table, then the already existing element
 1040  * will be returned and the new element will not be inserted.
 1041  * Otherwise returns NULL.
 1042  * If lockp == NULL, the caller is assumed to already hold the hash lock.
 1043  */
 1044 static arc_buf_hdr_t *
 1045 buf_hash_insert(arc_buf_hdr_t *hdr, kmutex_t **lockp)
 1046 {
 1047         uint64_t idx = BUF_HASH_INDEX(hdr->b_spa, &hdr->b_dva, hdr->b_birth);
 1048         kmutex_t *hash_lock = BUF_HASH_LOCK(idx);
 1049         arc_buf_hdr_t *fhdr;
 1050         uint32_t i;
 1051 
 1052         ASSERT(!DVA_IS_EMPTY(&hdr->b_dva));
 1053         ASSERT(hdr->b_birth != 0);
 1054         ASSERT(!HDR_IN_HASH_TABLE(hdr));
 1055 
 1056         if (lockp != NULL) {
 1057                 *lockp = hash_lock;
 1058                 mutex_enter(hash_lock);
 1059         } else {
 1060                 ASSERT(MUTEX_HELD(hash_lock));
 1061         }
 1062 
 1063         for (fhdr = buf_hash_table.ht_table[idx], i = 0; fhdr != NULL;
 1064             fhdr = fhdr->b_hash_next, i++) {
 1065                 if (HDR_EQUAL(hdr->b_spa, &hdr->b_dva, hdr->b_birth, fhdr))
 1066                         return (fhdr);
 1067         }
 1068 
 1069         hdr->b_hash_next = buf_hash_table.ht_table[idx];
 1070         buf_hash_table.ht_table[idx] = hdr;
 1071         arc_hdr_set_flags(hdr, ARC_FLAG_IN_HASH_TABLE);
 1072 
 1073         /* collect some hash table performance data */
 1074         if (i > 0) {
 1075                 ARCSTAT_BUMP(arcstat_hash_collisions);
 1076                 if (i == 1)
 1077                         ARCSTAT_BUMP(arcstat_hash_chains);
 1078 
 1079                 ARCSTAT_MAX(arcstat_hash_chain_max, i);
 1080         }
 1081         uint64_t he = atomic_inc_64_nv(
 1082             &arc_stats.arcstat_hash_elements.value.ui64);
 1083         ARCSTAT_MAX(arcstat_hash_elements_max, he);
 1084 
 1085         return (NULL);
 1086 }
 1087 
 1088 static void
 1089 buf_hash_remove(arc_buf_hdr_t *hdr)
 1090 {
 1091         arc_buf_hdr_t *fhdr, **hdrp;
 1092         uint64_t idx = BUF_HASH_INDEX(hdr->b_spa, &hdr->b_dva, hdr->b_birth);
 1093 
 1094         ASSERT(MUTEX_HELD(BUF_HASH_LOCK(idx)));
 1095         ASSERT(HDR_IN_HASH_TABLE(hdr));
 1096 
 1097         hdrp = &buf_hash_table.ht_table[idx];
 1098         while ((fhdr = *hdrp) != hdr) {
 1099                 ASSERT3P(fhdr, !=, NULL);
 1100                 hdrp = &fhdr->b_hash_next;
 1101         }
 1102         *hdrp = hdr->b_hash_next;
 1103         hdr->b_hash_next = NULL;
 1104         arc_hdr_clear_flags(hdr, ARC_FLAG_IN_HASH_TABLE);
 1105 
 1106         /* collect some hash table performance data */
 1107         atomic_dec_64(&arc_stats.arcstat_hash_elements.value.ui64);
 1108 
 1109         if (buf_hash_table.ht_table[idx] &&
 1110             buf_hash_table.ht_table[idx]->b_hash_next == NULL)
 1111                 ARCSTAT_BUMPDOWN(arcstat_hash_chains);
 1112 }
 1113 
 1114 /*
 1115  * Global data structures and functions for the buf kmem cache.
 1116  */
 1117 
 1118 static kmem_cache_t *hdr_full_cache;
 1119 static kmem_cache_t *hdr_full_crypt_cache;
 1120 static kmem_cache_t *hdr_l2only_cache;
 1121 static kmem_cache_t *buf_cache;
 1122 
 1123 static void
 1124 buf_fini(void)
 1125 {
 1126 #if defined(_KERNEL)
 1127         /*
 1128          * Large allocations which do not require contiguous pages
 1129          * should be using vmem_free() in the linux kernel\
 1130          */
 1131         vmem_free(buf_hash_table.ht_table,
 1132             (buf_hash_table.ht_mask + 1) * sizeof (void *));
 1133 #else
 1134         kmem_free(buf_hash_table.ht_table,
 1135             (buf_hash_table.ht_mask + 1) * sizeof (void *));
 1136 #endif
 1137         for (int i = 0; i < BUF_LOCKS; i++)
 1138                 mutex_destroy(BUF_HASH_LOCK(i));
 1139         kmem_cache_destroy(hdr_full_cache);
 1140         kmem_cache_destroy(hdr_full_crypt_cache);
 1141         kmem_cache_destroy(hdr_l2only_cache);
 1142         kmem_cache_destroy(buf_cache);
 1143 }
 1144 
 1145 /*
 1146  * Constructor callback - called when the cache is empty
 1147  * and a new buf is requested.
 1148  */
 1149 static int
 1150 hdr_full_cons(void *vbuf, void *unused, int kmflag)
 1151 {
 1152         (void) unused, (void) kmflag;
 1153         arc_buf_hdr_t *hdr = vbuf;
 1154 
 1155         memset(hdr, 0, HDR_FULL_SIZE);
 1156         hdr->b_l1hdr.b_byteswap = DMU_BSWAP_NUMFUNCS;
 1157         cv_init(&hdr->b_l1hdr.b_cv, NULL, CV_DEFAULT, NULL);
 1158         zfs_refcount_create(&hdr->b_l1hdr.b_refcnt);
 1159 #ifdef ZFS_DEBUG
 1160         mutex_init(&hdr->b_l1hdr.b_freeze_lock, NULL, MUTEX_DEFAULT, NULL);
 1161 #endif
 1162         multilist_link_init(&hdr->b_l1hdr.b_arc_node);
 1163         list_link_init(&hdr->b_l2hdr.b_l2node);
 1164         arc_space_consume(HDR_FULL_SIZE, ARC_SPACE_HDRS);
 1165 
 1166         return (0);
 1167 }
 1168 
 1169 static int
 1170 hdr_full_crypt_cons(void *vbuf, void *unused, int kmflag)
 1171 {
 1172         (void) unused;
 1173         arc_buf_hdr_t *hdr = vbuf;
 1174 
 1175         hdr_full_cons(vbuf, unused, kmflag);
 1176         memset(&hdr->b_crypt_hdr, 0, sizeof (hdr->b_crypt_hdr));
 1177         arc_space_consume(sizeof (hdr->b_crypt_hdr), ARC_SPACE_HDRS);
 1178 
 1179         return (0);
 1180 }
 1181 
 1182 static int
 1183 hdr_l2only_cons(void *vbuf, void *unused, int kmflag)
 1184 {
 1185         (void) unused, (void) kmflag;
 1186         arc_buf_hdr_t *hdr = vbuf;
 1187 
 1188         memset(hdr, 0, HDR_L2ONLY_SIZE);
 1189         arc_space_consume(HDR_L2ONLY_SIZE, ARC_SPACE_L2HDRS);
 1190 
 1191         return (0);
 1192 }
 1193 
 1194 static int
 1195 buf_cons(void *vbuf, void *unused, int kmflag)
 1196 {
 1197         (void) unused, (void) kmflag;
 1198         arc_buf_t *buf = vbuf;
 1199 
 1200         memset(buf, 0, sizeof (arc_buf_t));
 1201         arc_space_consume(sizeof (arc_buf_t), ARC_SPACE_HDRS);
 1202 
 1203         return (0);
 1204 }
 1205 
 1206 /*
 1207  * Destructor callback - called when a cached buf is
 1208  * no longer required.
 1209  */
 1210 static void
 1211 hdr_full_dest(void *vbuf, void *unused)
 1212 {
 1213         (void) unused;
 1214         arc_buf_hdr_t *hdr = vbuf;
 1215 
 1216         ASSERT(HDR_EMPTY(hdr));
 1217         cv_destroy(&hdr->b_l1hdr.b_cv);
 1218         zfs_refcount_destroy(&hdr->b_l1hdr.b_refcnt);
 1219 #ifdef ZFS_DEBUG
 1220         mutex_destroy(&hdr->b_l1hdr.b_freeze_lock);
 1221 #endif
 1222         ASSERT(!multilist_link_active(&hdr->b_l1hdr.b_arc_node));
 1223         arc_space_return(HDR_FULL_SIZE, ARC_SPACE_HDRS);
 1224 }
 1225 
 1226 static void
 1227 hdr_full_crypt_dest(void *vbuf, void *unused)
 1228 {
 1229         (void) vbuf, (void) unused;
 1230 
 1231         hdr_full_dest(vbuf, unused);
 1232         arc_space_return(sizeof (((arc_buf_hdr_t *)NULL)->b_crypt_hdr),
 1233             ARC_SPACE_HDRS);
 1234 }
 1235 
 1236 static void
 1237 hdr_l2only_dest(void *vbuf, void *unused)
 1238 {
 1239         (void) unused;
 1240         arc_buf_hdr_t *hdr = vbuf;
 1241 
 1242         ASSERT(HDR_EMPTY(hdr));
 1243         arc_space_return(HDR_L2ONLY_SIZE, ARC_SPACE_L2HDRS);
 1244 }
 1245 
 1246 static void
 1247 buf_dest(void *vbuf, void *unused)
 1248 {
 1249         (void) unused;
 1250         (void) vbuf;
 1251 
 1252         arc_space_return(sizeof (arc_buf_t), ARC_SPACE_HDRS);
 1253 }
 1254 
 1255 static void
 1256 buf_init(void)
 1257 {
 1258         uint64_t *ct = NULL;
 1259         uint64_t hsize = 1ULL << 12;
 1260         int i, j;
 1261 
 1262         /*
 1263          * The hash table is big enough to fill all of physical memory
 1264          * with an average block size of zfs_arc_average_blocksize (default 8K).
 1265          * By default, the table will take up
 1266          * totalmem * sizeof(void*) / 8K (1MB per GB with 8-byte pointers).
 1267          */
 1268         while (hsize * zfs_arc_average_blocksize < arc_all_memory())
 1269                 hsize <<= 1;
 1270 retry:
 1271         buf_hash_table.ht_mask = hsize - 1;
 1272 #if defined(_KERNEL)
 1273         /*
 1274          * Large allocations which do not require contiguous pages
 1275          * should be using vmem_alloc() in the linux kernel
 1276          */
 1277         buf_hash_table.ht_table =
 1278             vmem_zalloc(hsize * sizeof (void*), KM_SLEEP);
 1279 #else
 1280         buf_hash_table.ht_table =
 1281             kmem_zalloc(hsize * sizeof (void*), KM_NOSLEEP);
 1282 #endif
 1283         if (buf_hash_table.ht_table == NULL) {
 1284                 ASSERT(hsize > (1ULL << 8));
 1285                 hsize >>= 1;
 1286                 goto retry;
 1287         }
 1288 
 1289         hdr_full_cache = kmem_cache_create("arc_buf_hdr_t_full", HDR_FULL_SIZE,
 1290             0, hdr_full_cons, hdr_full_dest, NULL, NULL, NULL, 0);
 1291         hdr_full_crypt_cache = kmem_cache_create("arc_buf_hdr_t_full_crypt",
 1292             HDR_FULL_CRYPT_SIZE, 0, hdr_full_crypt_cons, hdr_full_crypt_dest,
 1293             NULL, NULL, NULL, 0);
 1294         hdr_l2only_cache = kmem_cache_create("arc_buf_hdr_t_l2only",
 1295             HDR_L2ONLY_SIZE, 0, hdr_l2only_cons, hdr_l2only_dest, NULL,
 1296             NULL, NULL, 0);
 1297         buf_cache = kmem_cache_create("arc_buf_t", sizeof (arc_buf_t),
 1298             0, buf_cons, buf_dest, NULL, NULL, NULL, 0);
 1299 
 1300         for (i = 0; i < 256; i++)
 1301                 for (ct = zfs_crc64_table + i, *ct = i, j = 8; j > 0; j--)
 1302                         *ct = (*ct >> 1) ^ (-(*ct & 1) & ZFS_CRC64_POLY);
 1303 
 1304         for (i = 0; i < BUF_LOCKS; i++)
 1305                 mutex_init(BUF_HASH_LOCK(i), NULL, MUTEX_DEFAULT, NULL);
 1306 }
 1307 
 1308 #define ARC_MINTIME     (hz>>4) /* 62 ms */
 1309 
 1310 /*
 1311  * This is the size that the buf occupies in memory. If the buf is compressed,
 1312  * it will correspond to the compressed size. You should use this method of
 1313  * getting the buf size unless you explicitly need the logical size.
 1314  */
 1315 uint64_t
 1316 arc_buf_size(arc_buf_t *buf)
 1317 {
 1318         return (ARC_BUF_COMPRESSED(buf) ?
 1319             HDR_GET_PSIZE(buf->b_hdr) : HDR_GET_LSIZE(buf->b_hdr));
 1320 }
 1321 
 1322 uint64_t
 1323 arc_buf_lsize(arc_buf_t *buf)
 1324 {
 1325         return (HDR_GET_LSIZE(buf->b_hdr));
 1326 }
 1327 
 1328 /*
 1329  * This function will return B_TRUE if the buffer is encrypted in memory.
 1330  * This buffer can be decrypted by calling arc_untransform().
 1331  */
 1332 boolean_t
 1333 arc_is_encrypted(arc_buf_t *buf)
 1334 {
 1335         return (ARC_BUF_ENCRYPTED(buf) != 0);
 1336 }
 1337 
 1338 /*
 1339  * Returns B_TRUE if the buffer represents data that has not had its MAC
 1340  * verified yet.
 1341  */
 1342 boolean_t
 1343 arc_is_unauthenticated(arc_buf_t *buf)
 1344 {
 1345         return (HDR_NOAUTH(buf->b_hdr) != 0);
 1346 }
 1347 
 1348 void
 1349 arc_get_raw_params(arc_buf_t *buf, boolean_t *byteorder, uint8_t *salt,
 1350     uint8_t *iv, uint8_t *mac)
 1351 {
 1352         arc_buf_hdr_t *hdr = buf->b_hdr;
 1353 
 1354         ASSERT(HDR_PROTECTED(hdr));
 1355 
 1356         memcpy(salt, hdr->b_crypt_hdr.b_salt, ZIO_DATA_SALT_LEN);
 1357         memcpy(iv, hdr->b_crypt_hdr.b_iv, ZIO_DATA_IV_LEN);
 1358         memcpy(mac, hdr->b_crypt_hdr.b_mac, ZIO_DATA_MAC_LEN);
 1359         *byteorder = (hdr->b_l1hdr.b_byteswap == DMU_BSWAP_NUMFUNCS) ?
 1360             ZFS_HOST_BYTEORDER : !ZFS_HOST_BYTEORDER;
 1361 }
 1362 
 1363 /*
 1364  * Indicates how this buffer is compressed in memory. If it is not compressed
 1365  * the value will be ZIO_COMPRESS_OFF. It can be made normally readable with
 1366  * arc_untransform() as long as it is also unencrypted.
 1367  */
 1368 enum zio_compress
 1369 arc_get_compression(arc_buf_t *buf)
 1370 {
 1371         return (ARC_BUF_COMPRESSED(buf) ?
 1372             HDR_GET_COMPRESS(buf->b_hdr) : ZIO_COMPRESS_OFF);
 1373 }
 1374 
 1375 /*
 1376  * Return the compression algorithm used to store this data in the ARC. If ARC
 1377  * compression is enabled or this is an encrypted block, this will be the same
 1378  * as what's used to store it on-disk. Otherwise, this will be ZIO_COMPRESS_OFF.
 1379  */
 1380 static inline enum zio_compress
 1381 arc_hdr_get_compress(arc_buf_hdr_t *hdr)
 1382 {
 1383         return (HDR_COMPRESSION_ENABLED(hdr) ?
 1384             HDR_GET_COMPRESS(hdr) : ZIO_COMPRESS_OFF);
 1385 }
 1386 
 1387 uint8_t
 1388 arc_get_complevel(arc_buf_t *buf)
 1389 {
 1390         return (buf->b_hdr->b_complevel);
 1391 }
 1392 
 1393 static inline boolean_t
 1394 arc_buf_is_shared(arc_buf_t *buf)
 1395 {
 1396         boolean_t shared = (buf->b_data != NULL &&
 1397             buf->b_hdr->b_l1hdr.b_pabd != NULL &&
 1398             abd_is_linear(buf->b_hdr->b_l1hdr.b_pabd) &&
 1399             buf->b_data == abd_to_buf(buf->b_hdr->b_l1hdr.b_pabd));
 1400         IMPLY(shared, HDR_SHARED_DATA(buf->b_hdr));
 1401         IMPLY(shared, ARC_BUF_SHARED(buf));
 1402         IMPLY(shared, ARC_BUF_COMPRESSED(buf) || ARC_BUF_LAST(buf));
 1403 
 1404         /*
 1405          * It would be nice to assert arc_can_share() too, but the "hdr isn't
 1406          * already being shared" requirement prevents us from doing that.
 1407          */
 1408 
 1409         return (shared);
 1410 }
 1411 
 1412 /*
 1413  * Free the checksum associated with this header. If there is no checksum, this
 1414  * is a no-op.
 1415  */
 1416 static inline void
 1417 arc_cksum_free(arc_buf_hdr_t *hdr)
 1418 {
 1419 #ifdef ZFS_DEBUG
 1420         ASSERT(HDR_HAS_L1HDR(hdr));
 1421 
 1422         mutex_enter(&hdr->b_l1hdr.b_freeze_lock);
 1423         if (hdr->b_l1hdr.b_freeze_cksum != NULL) {
 1424                 kmem_free(hdr->b_l1hdr.b_freeze_cksum, sizeof (zio_cksum_t));
 1425                 hdr->b_l1hdr.b_freeze_cksum = NULL;
 1426         }
 1427         mutex_exit(&hdr->b_l1hdr.b_freeze_lock);
 1428 #endif
 1429 }
 1430 
 1431 /*
 1432  * Return true iff at least one of the bufs on hdr is not compressed.
 1433  * Encrypted buffers count as compressed.
 1434  */
 1435 static boolean_t
 1436 arc_hdr_has_uncompressed_buf(arc_buf_hdr_t *hdr)
 1437 {
 1438         ASSERT(hdr->b_l1hdr.b_state == arc_anon || HDR_EMPTY_OR_LOCKED(hdr));
 1439 
 1440         for (arc_buf_t *b = hdr->b_l1hdr.b_buf; b != NULL; b = b->b_next) {
 1441                 if (!ARC_BUF_COMPRESSED(b)) {
 1442                         return (B_TRUE);
 1443                 }
 1444         }
 1445         return (B_FALSE);
 1446 }
 1447 
 1448 
 1449 /*
 1450  * If we've turned on the ZFS_DEBUG_MODIFY flag, verify that the buf's data
 1451  * matches the checksum that is stored in the hdr. If there is no checksum,
 1452  * or if the buf is compressed, this is a no-op.
 1453  */
 1454 static void
 1455 arc_cksum_verify(arc_buf_t *buf)
 1456 {
 1457 #ifdef ZFS_DEBUG
 1458         arc_buf_hdr_t *hdr = buf->b_hdr;
 1459         zio_cksum_t zc;
 1460 
 1461         if (!(zfs_flags & ZFS_DEBUG_MODIFY))
 1462                 return;
 1463 
 1464         if (ARC_BUF_COMPRESSED(buf))
 1465                 return;
 1466 
 1467         ASSERT(HDR_HAS_L1HDR(hdr));
 1468 
 1469         mutex_enter(&hdr->b_l1hdr.b_freeze_lock);
 1470 
 1471         if (hdr->b_l1hdr.b_freeze_cksum == NULL || HDR_IO_ERROR(hdr)) {
 1472                 mutex_exit(&hdr->b_l1hdr.b_freeze_lock);
 1473                 return;
 1474         }
 1475 
 1476         fletcher_2_native(buf->b_data, arc_buf_size(buf), NULL, &zc);
 1477         if (!ZIO_CHECKSUM_EQUAL(*hdr->b_l1hdr.b_freeze_cksum, zc))
 1478                 panic("buffer modified while frozen!");
 1479         mutex_exit(&hdr->b_l1hdr.b_freeze_lock);
 1480 #endif
 1481 }
 1482 
 1483 /*
 1484  * This function makes the assumption that data stored in the L2ARC
 1485  * will be transformed exactly as it is in the main pool. Because of
 1486  * this we can verify the checksum against the reading process's bp.
 1487  */
 1488 static boolean_t
 1489 arc_cksum_is_equal(arc_buf_hdr_t *hdr, zio_t *zio)
 1490 {
 1491         ASSERT(!BP_IS_EMBEDDED(zio->io_bp));
 1492         VERIFY3U(BP_GET_PSIZE(zio->io_bp), ==, HDR_GET_PSIZE(hdr));
 1493 
 1494         /*
 1495          * Block pointers always store the checksum for the logical data.
 1496          * If the block pointer has the gang bit set, then the checksum
 1497          * it represents is for the reconstituted data and not for an
 1498          * individual gang member. The zio pipeline, however, must be able to
 1499          * determine the checksum of each of the gang constituents so it
 1500          * treats the checksum comparison differently than what we need
 1501          * for l2arc blocks. This prevents us from using the
 1502          * zio_checksum_error() interface directly. Instead we must call the
 1503          * zio_checksum_error_impl() so that we can ensure the checksum is
 1504          * generated using the correct checksum algorithm and accounts for the
 1505          * logical I/O size and not just a gang fragment.
 1506          */
 1507         return (zio_checksum_error_impl(zio->io_spa, zio->io_bp,
 1508             BP_GET_CHECKSUM(zio->io_bp), zio->io_abd, zio->io_size,
 1509             zio->io_offset, NULL) == 0);
 1510 }
 1511 
 1512 /*
 1513  * Given a buf full of data, if ZFS_DEBUG_MODIFY is enabled this computes a
 1514  * checksum and attaches it to the buf's hdr so that we can ensure that the buf
 1515  * isn't modified later on. If buf is compressed or there is already a checksum
 1516  * on the hdr, this is a no-op (we only checksum uncompressed bufs).
 1517  */
 1518 static void
 1519 arc_cksum_compute(arc_buf_t *buf)
 1520 {
 1521         if (!(zfs_flags & ZFS_DEBUG_MODIFY))
 1522                 return;
 1523 
 1524 #ifdef ZFS_DEBUG
 1525         arc_buf_hdr_t *hdr = buf->b_hdr;
 1526         ASSERT(HDR_HAS_L1HDR(hdr));
 1527         mutex_enter(&hdr->b_l1hdr.b_freeze_lock);
 1528         if (hdr->b_l1hdr.b_freeze_cksum != NULL || ARC_BUF_COMPRESSED(buf)) {
 1529                 mutex_exit(&hdr->b_l1hdr.b_freeze_lock);
 1530                 return;
 1531         }
 1532 
 1533         ASSERT(!ARC_BUF_ENCRYPTED(buf));
 1534         ASSERT(!ARC_BUF_COMPRESSED(buf));
 1535         hdr->b_l1hdr.b_freeze_cksum = kmem_alloc(sizeof (zio_cksum_t),
 1536             KM_SLEEP);
 1537         fletcher_2_native(buf->b_data, arc_buf_size(buf), NULL,
 1538             hdr->b_l1hdr.b_freeze_cksum);
 1539         mutex_exit(&hdr->b_l1hdr.b_freeze_lock);
 1540 #endif
 1541         arc_buf_watch(buf);
 1542 }
 1543 
 1544 #ifndef _KERNEL
 1545 void
 1546 arc_buf_sigsegv(int sig, siginfo_t *si, void *unused)
 1547 {
 1548         (void) sig, (void) unused;
 1549         panic("Got SIGSEGV at address: 0x%lx\n", (long)si->si_addr);
 1550 }
 1551 #endif
 1552 
 1553 static void
 1554 arc_buf_unwatch(arc_buf_t *buf)
 1555 {
 1556 #ifndef _KERNEL
 1557         if (arc_watch) {
 1558                 ASSERT0(mprotect(buf->b_data, arc_buf_size(buf),
 1559                     PROT_READ | PROT_WRITE));
 1560         }
 1561 #else
 1562         (void) buf;
 1563 #endif
 1564 }
 1565 
 1566 static void
 1567 arc_buf_watch(arc_buf_t *buf)
 1568 {
 1569 #ifndef _KERNEL
 1570         if (arc_watch)
 1571                 ASSERT0(mprotect(buf->b_data, arc_buf_size(buf),
 1572                     PROT_READ));
 1573 #else
 1574         (void) buf;
 1575 #endif
 1576 }
 1577 
 1578 static arc_buf_contents_t
 1579 arc_buf_type(arc_buf_hdr_t *hdr)
 1580 {
 1581         arc_buf_contents_t type;
 1582         if (HDR_ISTYPE_METADATA(hdr)) {
 1583                 type = ARC_BUFC_METADATA;
 1584         } else {
 1585                 type = ARC_BUFC_DATA;
 1586         }
 1587         VERIFY3U(hdr->b_type, ==, type);
 1588         return (type);
 1589 }
 1590 
 1591 boolean_t
 1592 arc_is_metadata(arc_buf_t *buf)
 1593 {
 1594         return (HDR_ISTYPE_METADATA(buf->b_hdr) != 0);
 1595 }
 1596 
 1597 static uint32_t
 1598 arc_bufc_to_flags(arc_buf_contents_t type)
 1599 {
 1600         switch (type) {
 1601         case ARC_BUFC_DATA:
 1602                 /* metadata field is 0 if buffer contains normal data */
 1603                 return (0);
 1604         case ARC_BUFC_METADATA:
 1605                 return (ARC_FLAG_BUFC_METADATA);
 1606         default:
 1607                 break;
 1608         }
 1609         panic("undefined ARC buffer type!");
 1610         return ((uint32_t)-1);
 1611 }
 1612 
 1613 void
 1614 arc_buf_thaw(arc_buf_t *buf)
 1615 {
 1616         arc_buf_hdr_t *hdr = buf->b_hdr;
 1617 
 1618         ASSERT3P(hdr->b_l1hdr.b_state, ==, arc_anon);
 1619         ASSERT(!HDR_IO_IN_PROGRESS(hdr));
 1620 
 1621         arc_cksum_verify(buf);
 1622 
 1623         /*
 1624          * Compressed buffers do not manipulate the b_freeze_cksum.
 1625          */
 1626         if (ARC_BUF_COMPRESSED(buf))
 1627                 return;
 1628 
 1629         ASSERT(HDR_HAS_L1HDR(hdr));
 1630         arc_cksum_free(hdr);
 1631         arc_buf_unwatch(buf);
 1632 }
 1633 
 1634 void
 1635 arc_buf_freeze(arc_buf_t *buf)
 1636 {
 1637         if (!(zfs_flags & ZFS_DEBUG_MODIFY))
 1638                 return;
 1639 
 1640         if (ARC_BUF_COMPRESSED(buf))
 1641                 return;
 1642 
 1643         ASSERT(HDR_HAS_L1HDR(buf->b_hdr));
 1644         arc_cksum_compute(buf);
 1645 }
 1646 
 1647 /*
 1648  * The arc_buf_hdr_t's b_flags should never be modified directly. Instead,
 1649  * the following functions should be used to ensure that the flags are
 1650  * updated in a thread-safe way. When manipulating the flags either
 1651  * the hash_lock must be held or the hdr must be undiscoverable. This
 1652  * ensures that we're not racing with any other threads when updating
 1653  * the flags.
 1654  */
 1655 static inline void
 1656 arc_hdr_set_flags(arc_buf_hdr_t *hdr, arc_flags_t flags)
 1657 {
 1658         ASSERT(HDR_EMPTY_OR_LOCKED(hdr));
 1659         hdr->b_flags |= flags;
 1660 }
 1661 
 1662 static inline void
 1663 arc_hdr_clear_flags(arc_buf_hdr_t *hdr, arc_flags_t flags)
 1664 {
 1665         ASSERT(HDR_EMPTY_OR_LOCKED(hdr));
 1666         hdr->b_flags &= ~flags;
 1667 }
 1668 
 1669 /*
 1670  * Setting the compression bits in the arc_buf_hdr_t's b_flags is
 1671  * done in a special way since we have to clear and set bits
 1672  * at the same time. Consumers that wish to set the compression bits
 1673  * must use this function to ensure that the flags are updated in
 1674  * thread-safe manner.
 1675  */
 1676 static void
 1677 arc_hdr_set_compress(arc_buf_hdr_t *hdr, enum zio_compress cmp)
 1678 {
 1679         ASSERT(HDR_EMPTY_OR_LOCKED(hdr));
 1680 
 1681         /*
 1682          * Holes and embedded blocks will always have a psize = 0 so
 1683          * we ignore the compression of the blkptr and set the
 1684          * want to uncompress them. Mark them as uncompressed.
 1685          */
 1686         if (!zfs_compressed_arc_enabled || HDR_GET_PSIZE(hdr) == 0) {
 1687                 arc_hdr_clear_flags(hdr, ARC_FLAG_COMPRESSED_ARC);
 1688                 ASSERT(!HDR_COMPRESSION_ENABLED(hdr));
 1689         } else {
 1690                 arc_hdr_set_flags(hdr, ARC_FLAG_COMPRESSED_ARC);
 1691                 ASSERT(HDR_COMPRESSION_ENABLED(hdr));
 1692         }
 1693 
 1694         HDR_SET_COMPRESS(hdr, cmp);
 1695         ASSERT3U(HDR_GET_COMPRESS(hdr), ==, cmp);
 1696 }
 1697 
 1698 /*
 1699  * Looks for another buf on the same hdr which has the data decompressed, copies
 1700  * from it, and returns true. If no such buf exists, returns false.
 1701  */
 1702 static boolean_t
 1703 arc_buf_try_copy_decompressed_data(arc_buf_t *buf)
 1704 {
 1705         arc_buf_hdr_t *hdr = buf->b_hdr;
 1706         boolean_t copied = B_FALSE;
 1707 
 1708         ASSERT(HDR_HAS_L1HDR(hdr));
 1709         ASSERT3P(buf->b_data, !=, NULL);
 1710         ASSERT(!ARC_BUF_COMPRESSED(buf));
 1711 
 1712         for (arc_buf_t *from = hdr->b_l1hdr.b_buf; from != NULL;
 1713             from = from->b_next) {
 1714                 /* can't use our own data buffer */
 1715                 if (from == buf) {
 1716                         continue;
 1717                 }
 1718 
 1719                 if (!ARC_BUF_COMPRESSED(from)) {
 1720                         memcpy(buf->b_data, from->b_data, arc_buf_size(buf));
 1721                         copied = B_TRUE;
 1722                         break;
 1723                 }
 1724         }
 1725 
 1726 #ifdef ZFS_DEBUG
 1727         /*
 1728          * There were no decompressed bufs, so there should not be a
 1729          * checksum on the hdr either.
 1730          */
 1731         if (zfs_flags & ZFS_DEBUG_MODIFY)
 1732                 EQUIV(!copied, hdr->b_l1hdr.b_freeze_cksum == NULL);
 1733 #endif
 1734 
 1735         return (copied);
 1736 }
 1737 
 1738 /*
 1739  * Allocates an ARC buf header that's in an evicted & L2-cached state.
 1740  * This is used during l2arc reconstruction to make empty ARC buffers
 1741  * which circumvent the regular disk->arc->l2arc path and instead come
 1742  * into being in the reverse order, i.e. l2arc->arc.
 1743  */
 1744 static arc_buf_hdr_t *
 1745 arc_buf_alloc_l2only(size_t size, arc_buf_contents_t type, l2arc_dev_t *dev,
 1746     dva_t dva, uint64_t daddr, int32_t psize, uint64_t birth,
 1747     enum zio_compress compress, uint8_t complevel, boolean_t protected,
 1748     boolean_t prefetch, arc_state_type_t arcs_state)
 1749 {
 1750         arc_buf_hdr_t   *hdr;
 1751 
 1752         ASSERT(size != 0);
 1753         hdr = kmem_cache_alloc(hdr_l2only_cache, KM_SLEEP);
 1754         hdr->b_birth = birth;
 1755         hdr->b_type = type;
 1756         hdr->b_flags = 0;
 1757         arc_hdr_set_flags(hdr, arc_bufc_to_flags(type) | ARC_FLAG_HAS_L2HDR);
 1758         HDR_SET_LSIZE(hdr, size);
 1759         HDR_SET_PSIZE(hdr, psize);
 1760         arc_hdr_set_compress(hdr, compress);
 1761         hdr->b_complevel = complevel;
 1762         if (protected)
 1763                 arc_hdr_set_flags(hdr, ARC_FLAG_PROTECTED);
 1764         if (prefetch)
 1765                 arc_hdr_set_flags(hdr, ARC_FLAG_PREFETCH);
 1766         hdr->b_spa = spa_load_guid(dev->l2ad_vdev->vdev_spa);
 1767 
 1768         hdr->b_dva = dva;
 1769 
 1770         hdr->b_l2hdr.b_dev = dev;
 1771         hdr->b_l2hdr.b_daddr = daddr;
 1772         hdr->b_l2hdr.b_arcs_state = arcs_state;
 1773 
 1774         return (hdr);
 1775 }
 1776 
 1777 /*
 1778  * Return the size of the block, b_pabd, that is stored in the arc_buf_hdr_t.
 1779  */
 1780 static uint64_t
 1781 arc_hdr_size(arc_buf_hdr_t *hdr)
 1782 {
 1783         uint64_t size;
 1784 
 1785         if (arc_hdr_get_compress(hdr) != ZIO_COMPRESS_OFF &&
 1786             HDR_GET_PSIZE(hdr) > 0) {
 1787                 size = HDR_GET_PSIZE(hdr);
 1788         } else {
 1789                 ASSERT3U(HDR_GET_LSIZE(hdr), !=, 0);
 1790                 size = HDR_GET_LSIZE(hdr);
 1791         }
 1792         return (size);
 1793 }
 1794 
 1795 static int
 1796 arc_hdr_authenticate(arc_buf_hdr_t *hdr, spa_t *spa, uint64_t dsobj)
 1797 {
 1798         int ret;
 1799         uint64_t csize;
 1800         uint64_t lsize = HDR_GET_LSIZE(hdr);
 1801         uint64_t psize = HDR_GET_PSIZE(hdr);
 1802         void *tmpbuf = NULL;
 1803         abd_t *abd = hdr->b_l1hdr.b_pabd;
 1804 
 1805         ASSERT(HDR_EMPTY_OR_LOCKED(hdr));
 1806         ASSERT(HDR_AUTHENTICATED(hdr));
 1807         ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL);
 1808 
 1809         /*
 1810          * The MAC is calculated on the compressed data that is stored on disk.
 1811          * However, if compressed arc is disabled we will only have the
 1812          * decompressed data available to us now. Compress it into a temporary
 1813          * abd so we can verify the MAC. The performance overhead of this will
 1814          * be relatively low, since most objects in an encrypted objset will
 1815          * be encrypted (instead of authenticated) anyway.
 1816          */
 1817         if (HDR_GET_COMPRESS(hdr) != ZIO_COMPRESS_OFF &&
 1818             !HDR_COMPRESSION_ENABLED(hdr)) {
 1819                 tmpbuf = zio_buf_alloc(lsize);
 1820                 abd = abd_get_from_buf(tmpbuf, lsize);
 1821                 abd_take_ownership_of_buf(abd, B_TRUE);
 1822                 csize = zio_compress_data(HDR_GET_COMPRESS(hdr),
 1823                     hdr->b_l1hdr.b_pabd, tmpbuf, lsize, hdr->b_complevel);
 1824                 ASSERT3U(csize, <=, psize);
 1825                 abd_zero_off(abd, csize, psize - csize);
 1826         }
 1827 
 1828         /*
 1829          * Authentication is best effort. We authenticate whenever the key is
 1830          * available. If we succeed we clear ARC_FLAG_NOAUTH.
 1831          */
 1832         if (hdr->b_crypt_hdr.b_ot == DMU_OT_OBJSET) {
 1833                 ASSERT3U(HDR_GET_COMPRESS(hdr), ==, ZIO_COMPRESS_OFF);
 1834                 ASSERT3U(lsize, ==, psize);
 1835                 ret = spa_do_crypt_objset_mac_abd(B_FALSE, spa, dsobj, abd,
 1836                     psize, hdr->b_l1hdr.b_byteswap != DMU_BSWAP_NUMFUNCS);
 1837         } else {
 1838                 ret = spa_do_crypt_mac_abd(B_FALSE, spa, dsobj, abd, psize,
 1839                     hdr->b_crypt_hdr.b_mac);
 1840         }
 1841 
 1842         if (ret == 0)
 1843                 arc_hdr_clear_flags(hdr, ARC_FLAG_NOAUTH);
 1844         else if (ret != ENOENT)
 1845                 goto error;
 1846 
 1847         if (tmpbuf != NULL)
 1848                 abd_free(abd);
 1849 
 1850         return (0);
 1851 
 1852 error:
 1853         if (tmpbuf != NULL)
 1854                 abd_free(abd);
 1855 
 1856         return (ret);
 1857 }
 1858 
 1859 /*
 1860  * This function will take a header that only has raw encrypted data in
 1861  * b_crypt_hdr.b_rabd and decrypt it into a new buffer which is stored in
 1862  * b_l1hdr.b_pabd. If designated in the header flags, this function will
 1863  * also decompress the data.
 1864  */
 1865 static int
 1866 arc_hdr_decrypt(arc_buf_hdr_t *hdr, spa_t *spa, const zbookmark_phys_t *zb)
 1867 {
 1868         int ret;
 1869         abd_t *cabd = NULL;
 1870         void *tmp = NULL;
 1871         boolean_t no_crypt = B_FALSE;
 1872         boolean_t bswap = (hdr->b_l1hdr.b_byteswap != DMU_BSWAP_NUMFUNCS);
 1873 
 1874         ASSERT(HDR_EMPTY_OR_LOCKED(hdr));
 1875         ASSERT(HDR_ENCRYPTED(hdr));
 1876 
 1877         arc_hdr_alloc_abd(hdr, ARC_HDR_DO_ADAPT);
 1878 
 1879         ret = spa_do_crypt_abd(B_FALSE, spa, zb, hdr->b_crypt_hdr.b_ot,
 1880             B_FALSE, bswap, hdr->b_crypt_hdr.b_salt, hdr->b_crypt_hdr.b_iv,
 1881             hdr->b_crypt_hdr.b_mac, HDR_GET_PSIZE(hdr), hdr->b_l1hdr.b_pabd,
 1882             hdr->b_crypt_hdr.b_rabd, &no_crypt);
 1883         if (ret != 0)
 1884                 goto error;
 1885 
 1886         if (no_crypt) {
 1887                 abd_copy(hdr->b_l1hdr.b_pabd, hdr->b_crypt_hdr.b_rabd,
 1888                     HDR_GET_PSIZE(hdr));
 1889         }
 1890 
 1891         /*
 1892          * If this header has disabled arc compression but the b_pabd is
 1893          * compressed after decrypting it, we need to decompress the newly
 1894          * decrypted data.
 1895          */
 1896         if (HDR_GET_COMPRESS(hdr) != ZIO_COMPRESS_OFF &&
 1897             !HDR_COMPRESSION_ENABLED(hdr)) {
 1898                 /*
 1899                  * We want to make sure that we are correctly honoring the
 1900                  * zfs_abd_scatter_enabled setting, so we allocate an abd here
 1901                  * and then loan a buffer from it, rather than allocating a
 1902                  * linear buffer and wrapping it in an abd later.
 1903                  */
 1904                 cabd = arc_get_data_abd(hdr, arc_hdr_size(hdr), hdr,
 1905                     ARC_HDR_DO_ADAPT);
 1906                 tmp = abd_borrow_buf(cabd, arc_hdr_size(hdr));
 1907 
 1908                 ret = zio_decompress_data(HDR_GET_COMPRESS(hdr),
 1909                     hdr->b_l1hdr.b_pabd, tmp, HDR_GET_PSIZE(hdr),
 1910                     HDR_GET_LSIZE(hdr), &hdr->b_complevel);
 1911                 if (ret != 0) {
 1912                         abd_return_buf(cabd, tmp, arc_hdr_size(hdr));
 1913                         goto error;
 1914                 }
 1915 
 1916                 abd_return_buf_copy(cabd, tmp, arc_hdr_size(hdr));
 1917                 arc_free_data_abd(hdr, hdr->b_l1hdr.b_pabd,
 1918                     arc_hdr_size(hdr), hdr);
 1919                 hdr->b_l1hdr.b_pabd = cabd;
 1920         }
 1921 
 1922         return (0);
 1923 
 1924 error:
 1925         arc_hdr_free_abd(hdr, B_FALSE);
 1926         if (cabd != NULL)
 1927                 arc_free_data_buf(hdr, cabd, arc_hdr_size(hdr), hdr);
 1928 
 1929         return (ret);
 1930 }
 1931 
 1932 /*
 1933  * This function is called during arc_buf_fill() to prepare the header's
 1934  * abd plaintext pointer for use. This involves authenticated protected
 1935  * data and decrypting encrypted data into the plaintext abd.
 1936  */
 1937 static int
 1938 arc_fill_hdr_crypt(arc_buf_hdr_t *hdr, kmutex_t *hash_lock, spa_t *spa,
 1939     const zbookmark_phys_t *zb, boolean_t noauth)
 1940 {
 1941         int ret;
 1942 
 1943         ASSERT(HDR_PROTECTED(hdr));
 1944 
 1945         if (hash_lock != NULL)
 1946                 mutex_enter(hash_lock);
 1947 
 1948         if (HDR_NOAUTH(hdr) && !noauth) {
 1949                 /*
 1950                  * The caller requested authenticated data but our data has
 1951                  * not been authenticated yet. Verify the MAC now if we can.
 1952                  */
 1953                 ret = arc_hdr_authenticate(hdr, spa, zb->zb_objset);
 1954                 if (ret != 0)
 1955                         goto error;
 1956         } else if (HDR_HAS_RABD(hdr) && hdr->b_l1hdr.b_pabd == NULL) {
 1957                 /*
 1958                  * If we only have the encrypted version of the data, but the
 1959                  * unencrypted version was requested we take this opportunity
 1960                  * to store the decrypted version in the header for future use.
 1961                  */
 1962                 ret = arc_hdr_decrypt(hdr, spa, zb);
 1963                 if (ret != 0)
 1964                         goto error;
 1965         }
 1966 
 1967         ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL);
 1968 
 1969         if (hash_lock != NULL)
 1970                 mutex_exit(hash_lock);
 1971 
 1972         return (0);
 1973 
 1974 error:
 1975         if (hash_lock != NULL)
 1976                 mutex_exit(hash_lock);
 1977 
 1978         return (ret);
 1979 }
 1980 
 1981 /*
 1982  * This function is used by the dbuf code to decrypt bonus buffers in place.
 1983  * The dbuf code itself doesn't have any locking for decrypting a shared dnode
 1984  * block, so we use the hash lock here to protect against concurrent calls to
 1985  * arc_buf_fill().
 1986  */
 1987 static void
 1988 arc_buf_untransform_in_place(arc_buf_t *buf)
 1989 {
 1990         arc_buf_hdr_t *hdr = buf->b_hdr;
 1991 
 1992         ASSERT(HDR_ENCRYPTED(hdr));
 1993         ASSERT3U(hdr->b_crypt_hdr.b_ot, ==, DMU_OT_DNODE);
 1994         ASSERT(HDR_EMPTY_OR_LOCKED(hdr));
 1995         ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL);
 1996 
 1997         zio_crypt_copy_dnode_bonus(hdr->b_l1hdr.b_pabd, buf->b_data,
 1998             arc_buf_size(buf));
 1999         buf->b_flags &= ~ARC_BUF_FLAG_ENCRYPTED;
 2000         buf->b_flags &= ~ARC_BUF_FLAG_COMPRESSED;
 2001         hdr->b_crypt_hdr.b_ebufcnt -= 1;
 2002 }
 2003 
 2004 /*
 2005  * Given a buf that has a data buffer attached to it, this function will
 2006  * efficiently fill the buf with data of the specified compression setting from
 2007  * the hdr and update the hdr's b_freeze_cksum if necessary. If the buf and hdr
 2008  * are already sharing a data buf, no copy is performed.
 2009  *
 2010  * If the buf is marked as compressed but uncompressed data was requested, this
 2011  * will allocate a new data buffer for the buf, remove that flag, and fill the
 2012  * buf with uncompressed data. You can't request a compressed buf on a hdr with
 2013  * uncompressed data, and (since we haven't added support for it yet) if you
 2014  * want compressed data your buf must already be marked as compressed and have
 2015  * the correct-sized data buffer.
 2016  */
 2017 static int
 2018 arc_buf_fill(arc_buf_t *buf, spa_t *spa, const zbookmark_phys_t *zb,
 2019     arc_fill_flags_t flags)
 2020 {
 2021         int error = 0;
 2022         arc_buf_hdr_t *hdr = buf->b_hdr;
 2023         boolean_t hdr_compressed =
 2024             (arc_hdr_get_compress(hdr) != ZIO_COMPRESS_OFF);
 2025         boolean_t compressed = (flags & ARC_FILL_COMPRESSED) != 0;
 2026         boolean_t encrypted = (flags & ARC_FILL_ENCRYPTED) != 0;
 2027         dmu_object_byteswap_t bswap = hdr->b_l1hdr.b_byteswap;
 2028         kmutex_t *hash_lock = (flags & ARC_FILL_LOCKED) ? NULL : HDR_LOCK(hdr);
 2029 
 2030         ASSERT3P(buf->b_data, !=, NULL);
 2031         IMPLY(compressed, hdr_compressed || ARC_BUF_ENCRYPTED(buf));
 2032         IMPLY(compressed, ARC_BUF_COMPRESSED(buf));
 2033         IMPLY(encrypted, HDR_ENCRYPTED(hdr));
 2034         IMPLY(encrypted, ARC_BUF_ENCRYPTED(buf));
 2035         IMPLY(encrypted, ARC_BUF_COMPRESSED(buf));
 2036         IMPLY(encrypted, !ARC_BUF_SHARED(buf));
 2037 
 2038         /*
 2039          * If the caller wanted encrypted data we just need to copy it from
 2040          * b_rabd and potentially byteswap it. We won't be able to do any
 2041          * further transforms on it.
 2042          */
 2043         if (encrypted) {
 2044                 ASSERT(HDR_HAS_RABD(hdr));
 2045                 abd_copy_to_buf(buf->b_data, hdr->b_crypt_hdr.b_rabd,
 2046                     HDR_GET_PSIZE(hdr));
 2047                 goto byteswap;
 2048         }
 2049 
 2050         /*
 2051          * Adjust encrypted and authenticated headers to accommodate
 2052          * the request if needed. Dnode blocks (ARC_FILL_IN_PLACE) are
 2053          * allowed to fail decryption due to keys not being loaded
 2054          * without being marked as an IO error.
 2055          */
 2056         if (HDR_PROTECTED(hdr)) {
 2057                 error = arc_fill_hdr_crypt(hdr, hash_lock, spa,
 2058                     zb, !!(flags & ARC_FILL_NOAUTH));
 2059                 if (error == EACCES && (flags & ARC_FILL_IN_PLACE) != 0) {
 2060                         return (error);
 2061                 } else if (error != 0) {
 2062                         if (hash_lock != NULL)
 2063                                 mutex_enter(hash_lock);
 2064                         arc_hdr_set_flags(hdr, ARC_FLAG_IO_ERROR);
 2065                         if (hash_lock != NULL)
 2066                                 mutex_exit(hash_lock);
 2067                         return (error);
 2068                 }
 2069         }
 2070 
 2071         /*
 2072          * There is a special case here for dnode blocks which are
 2073          * decrypting their bonus buffers. These blocks may request to
 2074          * be decrypted in-place. This is necessary because there may
 2075          * be many dnodes pointing into this buffer and there is
 2076          * currently no method to synchronize replacing the backing
 2077          * b_data buffer and updating all of the pointers. Here we use
 2078          * the hash lock to ensure there are no races. If the need
 2079          * arises for other types to be decrypted in-place, they must
 2080          * add handling here as well.
 2081          */
 2082         if ((flags & ARC_FILL_IN_PLACE) != 0) {
 2083                 ASSERT(!hdr_compressed);
 2084                 ASSERT(!compressed);
 2085                 ASSERT(!encrypted);
 2086 
 2087                 if (HDR_ENCRYPTED(hdr) && ARC_BUF_ENCRYPTED(buf)) {
 2088                         ASSERT3U(hdr->b_crypt_hdr.b_ot, ==, DMU_OT_DNODE);
 2089 
 2090                         if (hash_lock != NULL)
 2091                                 mutex_enter(hash_lock);
 2092                         arc_buf_untransform_in_place(buf);
 2093                         if (hash_lock != NULL)
 2094                                 mutex_exit(hash_lock);
 2095 
 2096                         /* Compute the hdr's checksum if necessary */
 2097                         arc_cksum_compute(buf);
 2098                 }
 2099 
 2100                 return (0);
 2101         }
 2102 
 2103         if (hdr_compressed == compressed) {
 2104                 if (!arc_buf_is_shared(buf)) {
 2105                         abd_copy_to_buf(buf->b_data, hdr->b_l1hdr.b_pabd,
 2106                             arc_buf_size(buf));
 2107                 }
 2108         } else {
 2109                 ASSERT(hdr_compressed);
 2110                 ASSERT(!compressed);
 2111 
 2112                 /*
 2113                  * If the buf is sharing its data with the hdr, unlink it and
 2114                  * allocate a new data buffer for the buf.
 2115                  */
 2116                 if (arc_buf_is_shared(buf)) {
 2117                         ASSERT(ARC_BUF_COMPRESSED(buf));
 2118 
 2119                         /* We need to give the buf its own b_data */
 2120                         buf->b_flags &= ~ARC_BUF_FLAG_SHARED;
 2121                         buf->b_data =
 2122                             arc_get_data_buf(hdr, HDR_GET_LSIZE(hdr), buf);
 2123                         arc_hdr_clear_flags(hdr, ARC_FLAG_SHARED_DATA);
 2124 
 2125                         /* Previously overhead was 0; just add new overhead */
 2126                         ARCSTAT_INCR(arcstat_overhead_size, HDR_GET_LSIZE(hdr));
 2127                 } else if (ARC_BUF_COMPRESSED(buf)) {
 2128                         /* We need to reallocate the buf's b_data */
 2129                         arc_free_data_buf(hdr, buf->b_data, HDR_GET_PSIZE(hdr),
 2130                             buf);
 2131                         buf->b_data =
 2132                             arc_get_data_buf(hdr, HDR_GET_LSIZE(hdr), buf);
 2133 
 2134                         /* We increased the size of b_data; update overhead */
 2135                         ARCSTAT_INCR(arcstat_overhead_size,
 2136                             HDR_GET_LSIZE(hdr) - HDR_GET_PSIZE(hdr));
 2137                 }
 2138 
 2139                 /*
 2140                  * Regardless of the buf's previous compression settings, it
 2141                  * should not be compressed at the end of this function.
 2142                  */
 2143                 buf->b_flags &= ~ARC_BUF_FLAG_COMPRESSED;
 2144 
 2145                 /*
 2146                  * Try copying the data from another buf which already has a
 2147                  * decompressed version. If that's not possible, it's time to
 2148                  * bite the bullet and decompress the data from the hdr.
 2149                  */
 2150                 if (arc_buf_try_copy_decompressed_data(buf)) {
 2151                         /* Skip byteswapping and checksumming (already done) */
 2152                         return (0);
 2153                 } else {
 2154                         error = zio_decompress_data(HDR_GET_COMPRESS(hdr),
 2155                             hdr->b_l1hdr.b_pabd, buf->b_data,
 2156                             HDR_GET_PSIZE(hdr), HDR_GET_LSIZE(hdr),
 2157                             &hdr->b_complevel);
 2158 
 2159                         /*
 2160                          * Absent hardware errors or software bugs, this should
 2161                          * be impossible, but log it anyway so we can debug it.
 2162                          */
 2163                         if (error != 0) {
 2164                                 zfs_dbgmsg(
 2165                                     "hdr %px, compress %d, psize %d, lsize %d",
 2166                                     hdr, arc_hdr_get_compress(hdr),
 2167                                     HDR_GET_PSIZE(hdr), HDR_GET_LSIZE(hdr));
 2168                                 if (hash_lock != NULL)
 2169                                         mutex_enter(hash_lock);
 2170                                 arc_hdr_set_flags(hdr, ARC_FLAG_IO_ERROR);
 2171                                 if (hash_lock != NULL)
 2172                                         mutex_exit(hash_lock);
 2173                                 return (SET_ERROR(EIO));
 2174                         }
 2175                 }
 2176         }
 2177 
 2178 byteswap:
 2179         /* Byteswap the buf's data if necessary */
 2180         if (bswap != DMU_BSWAP_NUMFUNCS) {
 2181                 ASSERT(!HDR_SHARED_DATA(hdr));
 2182                 ASSERT3U(bswap, <, DMU_BSWAP_NUMFUNCS);
 2183                 dmu_ot_byteswap[bswap].ob_func(buf->b_data, HDR_GET_LSIZE(hdr));
 2184         }
 2185 
 2186         /* Compute the hdr's checksum if necessary */
 2187         arc_cksum_compute(buf);
 2188 
 2189         return (0);
 2190 }
 2191 
 2192 /*
 2193  * If this function is being called to decrypt an encrypted buffer or verify an
 2194  * authenticated one, the key must be loaded and a mapping must be made
 2195  * available in the keystore via spa_keystore_create_mapping() or one of its
 2196  * callers.
 2197  */
 2198 int
 2199 arc_untransform(arc_buf_t *buf, spa_t *spa, const zbookmark_phys_t *zb,
 2200     boolean_t in_place)
 2201 {
 2202         int ret;
 2203         arc_fill_flags_t flags = 0;
 2204 
 2205         if (in_place)
 2206                 flags |= ARC_FILL_IN_PLACE;
 2207 
 2208         ret = arc_buf_fill(buf, spa, zb, flags);
 2209         if (ret == ECKSUM) {
 2210                 /*
 2211                  * Convert authentication and decryption errors to EIO
 2212                  * (and generate an ereport) before leaving the ARC.
 2213                  */
 2214                 ret = SET_ERROR(EIO);
 2215                 spa_log_error(spa, zb);
 2216                 (void) zfs_ereport_post(FM_EREPORT_ZFS_AUTHENTICATION,
 2217                     spa, NULL, zb, NULL, 0);
 2218         }
 2219 
 2220         return (ret);
 2221 }
 2222 
 2223 /*
 2224  * Increment the amount of evictable space in the arc_state_t's refcount.
 2225  * We account for the space used by the hdr and the arc buf individually
 2226  * so that we can add and remove them from the refcount individually.
 2227  */
 2228 static void
 2229 arc_evictable_space_increment(arc_buf_hdr_t *hdr, arc_state_t *state)
 2230 {
 2231         arc_buf_contents_t type = arc_buf_type(hdr);
 2232 
 2233         ASSERT(HDR_HAS_L1HDR(hdr));
 2234 
 2235         if (GHOST_STATE(state)) {
 2236                 ASSERT0(hdr->b_l1hdr.b_bufcnt);
 2237                 ASSERT3P(hdr->b_l1hdr.b_buf, ==, NULL);
 2238                 ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL);
 2239                 ASSERT(!HDR_HAS_RABD(hdr));
 2240                 (void) zfs_refcount_add_many(&state->arcs_esize[type],
 2241                     HDR_GET_LSIZE(hdr), hdr);
 2242                 return;
 2243         }
 2244 
 2245         if (hdr->b_l1hdr.b_pabd != NULL) {
 2246                 (void) zfs_refcount_add_many(&state->arcs_esize[type],
 2247                     arc_hdr_size(hdr), hdr);
 2248         }
 2249         if (HDR_HAS_RABD(hdr)) {
 2250                 (void) zfs_refcount_add_many(&state->arcs_esize[type],
 2251                     HDR_GET_PSIZE(hdr), hdr);
 2252         }
 2253 
 2254         for (arc_buf_t *buf = hdr->b_l1hdr.b_buf; buf != NULL;
 2255             buf = buf->b_next) {
 2256                 if (arc_buf_is_shared(buf))
 2257                         continue;
 2258                 (void) zfs_refcount_add_many(&state->arcs_esize[type],
 2259                     arc_buf_size(buf), buf);
 2260         }
 2261 }
 2262 
 2263 /*
 2264  * Decrement the amount of evictable space in the arc_state_t's refcount.
 2265  * We account for the space used by the hdr and the arc buf individually
 2266  * so that we can add and remove them from the refcount individually.
 2267  */
 2268 static void
 2269 arc_evictable_space_decrement(arc_buf_hdr_t *hdr, arc_state_t *state)
 2270 {
 2271         arc_buf_contents_t type = arc_buf_type(hdr);
 2272 
 2273         ASSERT(HDR_HAS_L1HDR(hdr));
 2274 
 2275         if (GHOST_STATE(state)) {
 2276                 ASSERT0(hdr->b_l1hdr.b_bufcnt);
 2277                 ASSERT3P(hdr->b_l1hdr.b_buf, ==, NULL);
 2278                 ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL);
 2279                 ASSERT(!HDR_HAS_RABD(hdr));
 2280                 (void) zfs_refcount_remove_many(&state->arcs_esize[type],
 2281                     HDR_GET_LSIZE(hdr), hdr);
 2282                 return;
 2283         }
 2284 
 2285         if (hdr->b_l1hdr.b_pabd != NULL) {
 2286                 (void) zfs_refcount_remove_many(&state->arcs_esize[type],
 2287                     arc_hdr_size(hdr), hdr);
 2288         }
 2289         if (HDR_HAS_RABD(hdr)) {
 2290                 (void) zfs_refcount_remove_many(&state->arcs_esize[type],
 2291                     HDR_GET_PSIZE(hdr), hdr);
 2292         }
 2293 
 2294         for (arc_buf_t *buf = hdr->b_l1hdr.b_buf; buf != NULL;
 2295             buf = buf->b_next) {
 2296                 if (arc_buf_is_shared(buf))
 2297                         continue;
 2298                 (void) zfs_refcount_remove_many(&state->arcs_esize[type],
 2299                     arc_buf_size(buf), buf);
 2300         }
 2301 }
 2302 
 2303 /*
 2304  * Add a reference to this hdr indicating that someone is actively
 2305  * referencing that memory. When the refcount transitions from 0 to 1,
 2306  * we remove it from the respective arc_state_t list to indicate that
 2307  * it is not evictable.
 2308  */
 2309 static void
 2310 add_reference(arc_buf_hdr_t *hdr, const void *tag)
 2311 {
 2312         arc_state_t *state = hdr->b_l1hdr.b_state;
 2313 
 2314         ASSERT(HDR_HAS_L1HDR(hdr));
 2315         if (!HDR_EMPTY(hdr) && !MUTEX_HELD(HDR_LOCK(hdr))) {
 2316                 ASSERT(state == arc_anon);
 2317                 ASSERT(zfs_refcount_is_zero(&hdr->b_l1hdr.b_refcnt));
 2318                 ASSERT3P(hdr->b_l1hdr.b_buf, ==, NULL);
 2319         }
 2320 
 2321         if ((zfs_refcount_add(&hdr->b_l1hdr.b_refcnt, tag) == 1) &&
 2322             state != arc_anon && state != arc_l2c_only) {
 2323                 /* We don't use the L2-only state list. */
 2324                 multilist_remove(&state->arcs_list[arc_buf_type(hdr)], hdr);
 2325                 arc_evictable_space_decrement(hdr, state);
 2326         }
 2327 }
 2328 
 2329 /*
 2330  * Remove a reference from this hdr. When the reference transitions from
 2331  * 1 to 0 and we're not anonymous, then we add this hdr to the arc_state_t's
 2332  * list making it eligible for eviction.
 2333  */
 2334 static int
 2335 remove_reference(arc_buf_hdr_t *hdr, const void *tag)
 2336 {
 2337         int cnt;
 2338         arc_state_t *state = hdr->b_l1hdr.b_state;
 2339 
 2340         ASSERT(HDR_HAS_L1HDR(hdr));
 2341         ASSERT(state == arc_anon || MUTEX_HELD(HDR_LOCK(hdr)));
 2342         ASSERT(!GHOST_STATE(state));    /* arc_l2c_only counts as a ghost. */
 2343 
 2344         if ((cnt = zfs_refcount_remove(&hdr->b_l1hdr.b_refcnt, tag)) != 0)
 2345                 return (cnt);
 2346 
 2347         if (state == arc_anon) {
 2348                 arc_hdr_destroy(hdr);
 2349                 return (0);
 2350         }
 2351         if (state == arc_uncached && !HDR_PREFETCH(hdr)) {
 2352                 arc_change_state(arc_anon, hdr);
 2353                 arc_hdr_destroy(hdr);
 2354                 return (0);
 2355         }
 2356         multilist_insert(&state->arcs_list[arc_buf_type(hdr)], hdr);
 2357         arc_evictable_space_increment(hdr, state);
 2358         return (0);
 2359 }
 2360 
 2361 /*
 2362  * Returns detailed information about a specific arc buffer.  When the
 2363  * state_index argument is set the function will calculate the arc header
 2364  * list position for its arc state.  Since this requires a linear traversal
 2365  * callers are strongly encourage not to do this.  However, it can be helpful
 2366  * for targeted analysis so the functionality is provided.
 2367  */
 2368 void
 2369 arc_buf_info(arc_buf_t *ab, arc_buf_info_t *abi, int state_index)
 2370 {
 2371         (void) state_index;
 2372         arc_buf_hdr_t *hdr = ab->b_hdr;
 2373         l1arc_buf_hdr_t *l1hdr = NULL;
 2374         l2arc_buf_hdr_t *l2hdr = NULL;
 2375         arc_state_t *state = NULL;
 2376 
 2377         memset(abi, 0, sizeof (arc_buf_info_t));
 2378 
 2379         if (hdr == NULL)
 2380                 return;
 2381 
 2382         abi->abi_flags = hdr->b_flags;
 2383 
 2384         if (HDR_HAS_L1HDR(hdr)) {
 2385                 l1hdr = &hdr->b_l1hdr;
 2386                 state = l1hdr->b_state;
 2387         }
 2388         if (HDR_HAS_L2HDR(hdr))
 2389                 l2hdr = &hdr->b_l2hdr;
 2390 
 2391         if (l1hdr) {
 2392                 abi->abi_bufcnt = l1hdr->b_bufcnt;
 2393                 abi->abi_access = l1hdr->b_arc_access;
 2394                 abi->abi_mru_hits = l1hdr->b_mru_hits;
 2395                 abi->abi_mru_ghost_hits = l1hdr->b_mru_ghost_hits;
 2396                 abi->abi_mfu_hits = l1hdr->b_mfu_hits;
 2397                 abi->abi_mfu_ghost_hits = l1hdr->b_mfu_ghost_hits;
 2398                 abi->abi_holds = zfs_refcount_count(&l1hdr->b_refcnt);
 2399         }
 2400 
 2401         if (l2hdr) {
 2402                 abi->abi_l2arc_dattr = l2hdr->b_daddr;
 2403                 abi->abi_l2arc_hits = l2hdr->b_hits;
 2404         }
 2405 
 2406         abi->abi_state_type = state ? state->arcs_state : ARC_STATE_ANON;
 2407         abi->abi_state_contents = arc_buf_type(hdr);
 2408         abi->abi_size = arc_hdr_size(hdr);
 2409 }
 2410 
 2411 /*
 2412  * Move the supplied buffer to the indicated state. The hash lock
 2413  * for the buffer must be held by the caller.
 2414  */
 2415 static void
 2416 arc_change_state(arc_state_t *new_state, arc_buf_hdr_t *hdr)
 2417 {
 2418         arc_state_t *old_state;
 2419         int64_t refcnt;
 2420         uint32_t bufcnt;
 2421         boolean_t update_old, update_new;
 2422         arc_buf_contents_t buftype = arc_buf_type(hdr);
 2423 
 2424         /*
 2425          * We almost always have an L1 hdr here, since we call arc_hdr_realloc()
 2426          * in arc_read() when bringing a buffer out of the L2ARC.  However, the
 2427          * L1 hdr doesn't always exist when we change state to arc_anon before
 2428          * destroying a header, in which case reallocating to add the L1 hdr is
 2429          * pointless.
 2430          */
 2431         if (HDR_HAS_L1HDR(hdr)) {
 2432                 old_state = hdr->b_l1hdr.b_state;
 2433                 refcnt = zfs_refcount_count(&hdr->b_l1hdr.b_refcnt);
 2434                 bufcnt = hdr->b_l1hdr.b_bufcnt;
 2435                 update_old = (bufcnt > 0 || hdr->b_l1hdr.b_pabd != NULL ||
 2436                     HDR_HAS_RABD(hdr));
 2437 
 2438                 IMPLY(GHOST_STATE(old_state), bufcnt == 0);
 2439                 IMPLY(GHOST_STATE(new_state), bufcnt == 0);
 2440                 IMPLY(GHOST_STATE(old_state), hdr->b_l1hdr.b_buf == NULL);
 2441                 IMPLY(GHOST_STATE(new_state), hdr->b_l1hdr.b_buf == NULL);
 2442                 IMPLY(old_state == arc_anon, bufcnt <= 1);
 2443         } else {
 2444                 old_state = arc_l2c_only;
 2445                 refcnt = 0;
 2446                 bufcnt = 0;
 2447                 update_old = B_FALSE;
 2448         }
 2449         update_new = update_old;
 2450         if (GHOST_STATE(old_state))
 2451                 update_old = B_TRUE;
 2452         if (GHOST_STATE(new_state))
 2453                 update_new = B_TRUE;
 2454 
 2455         ASSERT(MUTEX_HELD(HDR_LOCK(hdr)));
 2456         ASSERT3P(new_state, !=, old_state);
 2457 
 2458         /*
 2459          * If this buffer is evictable, transfer it from the
 2460          * old state list to the new state list.
 2461          */
 2462         if (refcnt == 0) {
 2463                 if (old_state != arc_anon && old_state != arc_l2c_only) {
 2464                         ASSERT(HDR_HAS_L1HDR(hdr));
 2465                         /* remove_reference() saves on insert. */
 2466                         if (multilist_link_active(&hdr->b_l1hdr.b_arc_node)) {
 2467                                 multilist_remove(&old_state->arcs_list[buftype],
 2468                                     hdr);
 2469                                 arc_evictable_space_decrement(hdr, old_state);
 2470                         }
 2471                 }
 2472                 if (new_state != arc_anon && new_state != arc_l2c_only) {
 2473                         /*
 2474                          * An L1 header always exists here, since if we're
 2475                          * moving to some L1-cached state (i.e. not l2c_only or
 2476                          * anonymous), we realloc the header to add an L1hdr
 2477                          * beforehand.
 2478                          */
 2479                         ASSERT(HDR_HAS_L1HDR(hdr));
 2480                         multilist_insert(&new_state->arcs_list[buftype], hdr);
 2481                         arc_evictable_space_increment(hdr, new_state);
 2482                 }
 2483         }
 2484 
 2485         ASSERT(!HDR_EMPTY(hdr));
 2486         if (new_state == arc_anon && HDR_IN_HASH_TABLE(hdr))
 2487                 buf_hash_remove(hdr);
 2488 
 2489         /* adjust state sizes (ignore arc_l2c_only) */
 2490 
 2491         if (update_new && new_state != arc_l2c_only) {
 2492                 ASSERT(HDR_HAS_L1HDR(hdr));
 2493                 if (GHOST_STATE(new_state)) {
 2494                         ASSERT0(bufcnt);
 2495 
 2496                         /*
 2497                          * When moving a header to a ghost state, we first
 2498                          * remove all arc buffers. Thus, we'll have a
 2499                          * bufcnt of zero, and no arc buffer to use for
 2500                          * the reference. As a result, we use the arc
 2501                          * header pointer for the reference.
 2502                          */
 2503                         (void) zfs_refcount_add_many(&new_state->arcs_size,
 2504                             HDR_GET_LSIZE(hdr), hdr);
 2505                         ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL);
 2506                         ASSERT(!HDR_HAS_RABD(hdr));
 2507                 } else {
 2508                         uint32_t buffers = 0;
 2509 
 2510                         /*
 2511                          * Each individual buffer holds a unique reference,
 2512                          * thus we must remove each of these references one
 2513                          * at a time.
 2514                          */
 2515                         for (arc_buf_t *buf = hdr->b_l1hdr.b_buf; buf != NULL;
 2516                             buf = buf->b_next) {
 2517                                 ASSERT3U(bufcnt, !=, 0);
 2518                                 buffers++;
 2519 
 2520                                 /*
 2521                                  * When the arc_buf_t is sharing the data
 2522                                  * block with the hdr, the owner of the
 2523                                  * reference belongs to the hdr. Only
 2524                                  * add to the refcount if the arc_buf_t is
 2525                                  * not shared.
 2526                                  */
 2527                                 if (arc_buf_is_shared(buf))
 2528                                         continue;
 2529 
 2530                                 (void) zfs_refcount_add_many(
 2531                                     &new_state->arcs_size,
 2532                                     arc_buf_size(buf), buf);
 2533                         }
 2534                         ASSERT3U(bufcnt, ==, buffers);
 2535 
 2536                         if (hdr->b_l1hdr.b_pabd != NULL) {
 2537                                 (void) zfs_refcount_add_many(
 2538                                     &new_state->arcs_size,
 2539                                     arc_hdr_size(hdr), hdr);
 2540                         }
 2541 
 2542                         if (HDR_HAS_RABD(hdr)) {
 2543                                 (void) zfs_refcount_add_many(
 2544                                     &new_state->arcs_size,
 2545                                     HDR_GET_PSIZE(hdr), hdr);
 2546                         }
 2547                 }
 2548         }
 2549 
 2550         if (update_old && old_state != arc_l2c_only) {
 2551                 ASSERT(HDR_HAS_L1HDR(hdr));
 2552                 if (GHOST_STATE(old_state)) {
 2553                         ASSERT0(bufcnt);
 2554                         ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL);
 2555                         ASSERT(!HDR_HAS_RABD(hdr));
 2556 
 2557                         /*
 2558                          * When moving a header off of a ghost state,
 2559                          * the header will not contain any arc buffers.
 2560                          * We use the arc header pointer for the reference
 2561                          * which is exactly what we did when we put the
 2562                          * header on the ghost state.
 2563                          */
 2564 
 2565                         (void) zfs_refcount_remove_many(&old_state->arcs_size,
 2566                             HDR_GET_LSIZE(hdr), hdr);
 2567                 } else {
 2568                         uint32_t buffers = 0;
 2569 
 2570                         /*
 2571                          * Each individual buffer holds a unique reference,
 2572                          * thus we must remove each of these references one
 2573                          * at a time.
 2574                          */
 2575                         for (arc_buf_t *buf = hdr->b_l1hdr.b_buf; buf != NULL;
 2576                             buf = buf->b_next) {
 2577                                 ASSERT3U(bufcnt, !=, 0);
 2578                                 buffers++;
 2579 
 2580                                 /*
 2581                                  * When the arc_buf_t is sharing the data
 2582                                  * block with the hdr, the owner of the
 2583                                  * reference belongs to the hdr. Only
 2584                                  * add to the refcount if the arc_buf_t is
 2585                                  * not shared.
 2586                                  */
 2587                                 if (arc_buf_is_shared(buf))
 2588                                         continue;
 2589 
 2590                                 (void) zfs_refcount_remove_many(
 2591                                     &old_state->arcs_size, arc_buf_size(buf),
 2592                                     buf);
 2593                         }
 2594                         ASSERT3U(bufcnt, ==, buffers);
 2595                         ASSERT(hdr->b_l1hdr.b_pabd != NULL ||
 2596                             HDR_HAS_RABD(hdr));
 2597 
 2598                         if (hdr->b_l1hdr.b_pabd != NULL) {
 2599                                 (void) zfs_refcount_remove_many(
 2600                                     &old_state->arcs_size, arc_hdr_size(hdr),
 2601                                     hdr);
 2602                         }
 2603 
 2604                         if (HDR_HAS_RABD(hdr)) {
 2605                                 (void) zfs_refcount_remove_many(
 2606                                     &old_state->arcs_size, HDR_GET_PSIZE(hdr),
 2607                                     hdr);
 2608                         }
 2609                 }
 2610         }
 2611 
 2612         if (HDR_HAS_L1HDR(hdr)) {
 2613                 hdr->b_l1hdr.b_state = new_state;
 2614 
 2615                 if (HDR_HAS_L2HDR(hdr) && new_state != arc_l2c_only) {
 2616                         l2arc_hdr_arcstats_decrement_state(hdr);
 2617                         hdr->b_l2hdr.b_arcs_state = new_state->arcs_state;
 2618                         l2arc_hdr_arcstats_increment_state(hdr);
 2619                 }
 2620         }
 2621 }
 2622 
 2623 void
 2624 arc_space_consume(uint64_t space, arc_space_type_t type)
 2625 {
 2626         ASSERT(type >= 0 && type < ARC_SPACE_NUMTYPES);
 2627 
 2628         switch (type) {
 2629         default:
 2630                 break;
 2631         case ARC_SPACE_DATA:
 2632                 ARCSTAT_INCR(arcstat_data_size, space);
 2633                 break;
 2634         case ARC_SPACE_META:
 2635                 ARCSTAT_INCR(arcstat_metadata_size, space);
 2636                 break;
 2637         case ARC_SPACE_BONUS:
 2638                 ARCSTAT_INCR(arcstat_bonus_size, space);
 2639                 break;
 2640         case ARC_SPACE_DNODE:
 2641                 aggsum_add(&arc_sums.arcstat_dnode_size, space);
 2642                 break;
 2643         case ARC_SPACE_DBUF:
 2644                 ARCSTAT_INCR(arcstat_dbuf_size, space);
 2645                 break;
 2646         case ARC_SPACE_HDRS:
 2647                 ARCSTAT_INCR(arcstat_hdr_size, space);
 2648                 break;
 2649         case ARC_SPACE_L2HDRS:
 2650                 aggsum_add(&arc_sums.arcstat_l2_hdr_size, space);
 2651                 break;
 2652         case ARC_SPACE_ABD_CHUNK_WASTE:
 2653                 /*
 2654                  * Note: this includes space wasted by all scatter ABD's, not
 2655                  * just those allocated by the ARC.  But the vast majority of
 2656                  * scatter ABD's come from the ARC, because other users are
 2657                  * very short-lived.
 2658                  */
 2659                 ARCSTAT_INCR(arcstat_abd_chunk_waste_size, space);
 2660                 break;
 2661         }
 2662 
 2663         if (type != ARC_SPACE_DATA && type != ARC_SPACE_ABD_CHUNK_WASTE)
 2664                 aggsum_add(&arc_sums.arcstat_meta_used, space);
 2665 
 2666         aggsum_add(&arc_sums.arcstat_size, space);
 2667 }
 2668 
 2669 void
 2670 arc_space_return(uint64_t space, arc_space_type_t type)
 2671 {
 2672         ASSERT(type >= 0 && type < ARC_SPACE_NUMTYPES);
 2673 
 2674         switch (type) {
 2675         default:
 2676                 break;
 2677         case ARC_SPACE_DATA:
 2678                 ARCSTAT_INCR(arcstat_data_size, -space);
 2679                 break;
 2680         case ARC_SPACE_META:
 2681                 ARCSTAT_INCR(arcstat_metadata_size, -space);
 2682                 break;
 2683         case ARC_SPACE_BONUS:
 2684                 ARCSTAT_INCR(arcstat_bonus_size, -space);
 2685                 break;
 2686         case ARC_SPACE_DNODE:
 2687                 aggsum_add(&arc_sums.arcstat_dnode_size, -space);
 2688                 break;
 2689         case ARC_SPACE_DBUF:
 2690                 ARCSTAT_INCR(arcstat_dbuf_size, -space);
 2691                 break;
 2692         case ARC_SPACE_HDRS:
 2693                 ARCSTAT_INCR(arcstat_hdr_size, -space);
 2694                 break;
 2695         case ARC_SPACE_L2HDRS:
 2696                 aggsum_add(&arc_sums.arcstat_l2_hdr_size, -space);
 2697                 break;
 2698         case ARC_SPACE_ABD_CHUNK_WASTE:
 2699                 ARCSTAT_INCR(arcstat_abd_chunk_waste_size, -space);
 2700                 break;
 2701         }
 2702 
 2703         if (type != ARC_SPACE_DATA && type != ARC_SPACE_ABD_CHUNK_WASTE) {
 2704                 ASSERT(aggsum_compare(&arc_sums.arcstat_meta_used,
 2705                     space) >= 0);
 2706                 ARCSTAT_MAX(arcstat_meta_max,
 2707                     aggsum_upper_bound(&arc_sums.arcstat_meta_used));
 2708                 aggsum_add(&arc_sums.arcstat_meta_used, -space);
 2709         }
 2710 
 2711         ASSERT(aggsum_compare(&arc_sums.arcstat_size, space) >= 0);
 2712         aggsum_add(&arc_sums.arcstat_size, -space);
 2713 }
 2714 
 2715 /*
 2716  * Given a hdr and a buf, returns whether that buf can share its b_data buffer
 2717  * with the hdr's b_pabd.
 2718  */
 2719 static boolean_t
 2720 arc_can_share(arc_buf_hdr_t *hdr, arc_buf_t *buf)
 2721 {
 2722         /*
 2723          * The criteria for sharing a hdr's data are:
 2724          * 1. the buffer is not encrypted
 2725          * 2. the hdr's compression matches the buf's compression
 2726          * 3. the hdr doesn't need to be byteswapped
 2727          * 4. the hdr isn't already being shared
 2728          * 5. the buf is either compressed or it is the last buf in the hdr list
 2729          *
 2730          * Criterion #5 maintains the invariant that shared uncompressed
 2731          * bufs must be the final buf in the hdr's b_buf list. Reading this, you
 2732          * might ask, "if a compressed buf is allocated first, won't that be the
 2733          * last thing in the list?", but in that case it's impossible to create
 2734          * a shared uncompressed buf anyway (because the hdr must be compressed
 2735          * to have the compressed buf). You might also think that #3 is
 2736          * sufficient to make this guarantee, however it's possible
 2737          * (specifically in the rare L2ARC write race mentioned in
 2738          * arc_buf_alloc_impl()) there will be an existing uncompressed buf that
 2739          * is shareable, but wasn't at the time of its allocation. Rather than
 2740          * allow a new shared uncompressed buf to be created and then shuffle
 2741          * the list around to make it the last element, this simply disallows
 2742          * sharing if the new buf isn't the first to be added.
 2743          */
 2744         ASSERT3P(buf->b_hdr, ==, hdr);
 2745         boolean_t hdr_compressed =
 2746             arc_hdr_get_compress(hdr) != ZIO_COMPRESS_OFF;
 2747         boolean_t buf_compressed = ARC_BUF_COMPRESSED(buf) != 0;
 2748         return (!ARC_BUF_ENCRYPTED(buf) &&
 2749             buf_compressed == hdr_compressed &&
 2750             hdr->b_l1hdr.b_byteswap == DMU_BSWAP_NUMFUNCS &&
 2751             !HDR_SHARED_DATA(hdr) &&
 2752             (ARC_BUF_LAST(buf) || ARC_BUF_COMPRESSED(buf)));
 2753 }
 2754 
 2755 /*
 2756  * Allocate a buf for this hdr. If you care about the data that's in the hdr,
 2757  * or if you want a compressed buffer, pass those flags in. Returns 0 if the
 2758  * copy was made successfully, or an error code otherwise.
 2759  */
 2760 static int
 2761 arc_buf_alloc_impl(arc_buf_hdr_t *hdr, spa_t *spa, const zbookmark_phys_t *zb,
 2762     const void *tag, boolean_t encrypted, boolean_t compressed,
 2763     boolean_t noauth, boolean_t fill, arc_buf_t **ret)
 2764 {
 2765         arc_buf_t *buf;
 2766         arc_fill_flags_t flags = ARC_FILL_LOCKED;
 2767 
 2768         ASSERT(HDR_HAS_L1HDR(hdr));
 2769         ASSERT3U(HDR_GET_LSIZE(hdr), >, 0);
 2770         VERIFY(hdr->b_type == ARC_BUFC_DATA ||
 2771             hdr->b_type == ARC_BUFC_METADATA);
 2772         ASSERT3P(ret, !=, NULL);
 2773         ASSERT3P(*ret, ==, NULL);
 2774         IMPLY(encrypted, compressed);
 2775 
 2776         buf = *ret = kmem_cache_alloc(buf_cache, KM_PUSHPAGE);
 2777         buf->b_hdr = hdr;
 2778         buf->b_data = NULL;
 2779         buf->b_next = hdr->b_l1hdr.b_buf;
 2780         buf->b_flags = 0;
 2781 
 2782         add_reference(hdr, tag);
 2783 
 2784         /*
 2785          * We're about to change the hdr's b_flags. We must either
 2786          * hold the hash_lock or be undiscoverable.
 2787          */
 2788         ASSERT(HDR_EMPTY_OR_LOCKED(hdr));
 2789 
 2790         /*
 2791          * Only honor requests for compressed bufs if the hdr is actually
 2792          * compressed. This must be overridden if the buffer is encrypted since
 2793          * encrypted buffers cannot be decompressed.
 2794          */
 2795         if (encrypted) {
 2796                 buf->b_flags |= ARC_BUF_FLAG_COMPRESSED;
 2797                 buf->b_flags |= ARC_BUF_FLAG_ENCRYPTED;
 2798                 flags |= ARC_FILL_COMPRESSED | ARC_FILL_ENCRYPTED;
 2799         } else if (compressed &&
 2800             arc_hdr_get_compress(hdr) != ZIO_COMPRESS_OFF) {
 2801                 buf->b_flags |= ARC_BUF_FLAG_COMPRESSED;
 2802                 flags |= ARC_FILL_COMPRESSED;
 2803         }
 2804 
 2805         if (noauth) {
 2806                 ASSERT0(encrypted);
 2807                 flags |= ARC_FILL_NOAUTH;
 2808         }
 2809 
 2810         /*
 2811          * If the hdr's data can be shared then we share the data buffer and
 2812          * set the appropriate bit in the hdr's b_flags to indicate the hdr is
 2813          * sharing it's b_pabd with the arc_buf_t. Otherwise, we allocate a new
 2814          * buffer to store the buf's data.
 2815          *
 2816          * There are two additional restrictions here because we're sharing
 2817          * hdr -> buf instead of the usual buf -> hdr. First, the hdr can't be
 2818          * actively involved in an L2ARC write, because if this buf is used by
 2819          * an arc_write() then the hdr's data buffer will be released when the
 2820          * write completes, even though the L2ARC write might still be using it.
 2821          * Second, the hdr's ABD must be linear so that the buf's user doesn't
 2822          * need to be ABD-aware.  It must be allocated via
 2823          * zio_[data_]buf_alloc(), not as a page, because we need to be able
 2824          * to abd_release_ownership_of_buf(), which isn't allowed on "linear
 2825          * page" buffers because the ABD code needs to handle freeing them
 2826          * specially.
 2827          */
 2828         boolean_t can_share = arc_can_share(hdr, buf) &&
 2829             !HDR_L2_WRITING(hdr) &&
 2830             hdr->b_l1hdr.b_pabd != NULL &&
 2831             abd_is_linear(hdr->b_l1hdr.b_pabd) &&
 2832             !abd_is_linear_page(hdr->b_l1hdr.b_pabd);
 2833 
 2834         /* Set up b_data and sharing */
 2835         if (can_share) {
 2836                 buf->b_data = abd_to_buf(hdr->b_l1hdr.b_pabd);
 2837                 buf->b_flags |= ARC_BUF_FLAG_SHARED;
 2838                 arc_hdr_set_flags(hdr, ARC_FLAG_SHARED_DATA);
 2839         } else {
 2840                 buf->b_data =
 2841                     arc_get_data_buf(hdr, arc_buf_size(buf), buf);
 2842                 ARCSTAT_INCR(arcstat_overhead_size, arc_buf_size(buf));
 2843         }
 2844         VERIFY3P(buf->b_data, !=, NULL);
 2845 
 2846         hdr->b_l1hdr.b_buf = buf;
 2847         hdr->b_l1hdr.b_bufcnt += 1;
 2848         if (encrypted)
 2849                 hdr->b_crypt_hdr.b_ebufcnt += 1;
 2850 
 2851         /*
 2852          * If the user wants the data from the hdr, we need to either copy or
 2853          * decompress the data.
 2854          */
 2855         if (fill) {
 2856                 ASSERT3P(zb, !=, NULL);
 2857                 return (arc_buf_fill(buf, spa, zb, flags));
 2858         }
 2859 
 2860         return (0);
 2861 }
 2862 
 2863 static const char *arc_onloan_tag = "onloan";
 2864 
 2865 static inline void
 2866 arc_loaned_bytes_update(int64_t delta)
 2867 {
 2868         atomic_add_64(&arc_loaned_bytes, delta);
 2869 
 2870         /* assert that it did not wrap around */
 2871         ASSERT3S(atomic_add_64_nv(&arc_loaned_bytes, 0), >=, 0);
 2872 }
 2873 
 2874 /*
 2875  * Loan out an anonymous arc buffer. Loaned buffers are not counted as in
 2876  * flight data by arc_tempreserve_space() until they are "returned". Loaned
 2877  * buffers must be returned to the arc before they can be used by the DMU or
 2878  * freed.
 2879  */
 2880 arc_buf_t *
 2881 arc_loan_buf(spa_t *spa, boolean_t is_metadata, int size)
 2882 {
 2883         arc_buf_t *buf = arc_alloc_buf(spa, arc_onloan_tag,
 2884             is_metadata ? ARC_BUFC_METADATA : ARC_BUFC_DATA, size);
 2885 
 2886         arc_loaned_bytes_update(arc_buf_size(buf));
 2887 
 2888         return (buf);
 2889 }
 2890 
 2891 arc_buf_t *
 2892 arc_loan_compressed_buf(spa_t *spa, uint64_t psize, uint64_t lsize,
 2893     enum zio_compress compression_type, uint8_t complevel)
 2894 {
 2895         arc_buf_t *buf = arc_alloc_compressed_buf(spa, arc_onloan_tag,
 2896             psize, lsize, compression_type, complevel);
 2897 
 2898         arc_loaned_bytes_update(arc_buf_size(buf));
 2899 
 2900         return (buf);
 2901 }
 2902 
 2903 arc_buf_t *
 2904 arc_loan_raw_buf(spa_t *spa, uint64_t dsobj, boolean_t byteorder,
 2905     const uint8_t *salt, const uint8_t *iv, const uint8_t *mac,
 2906     dmu_object_type_t ot, uint64_t psize, uint64_t lsize,
 2907     enum zio_compress compression_type, uint8_t complevel)
 2908 {
 2909         arc_buf_t *buf = arc_alloc_raw_buf(spa, arc_onloan_tag, dsobj,
 2910             byteorder, salt, iv, mac, ot, psize, lsize, compression_type,
 2911             complevel);
 2912 
 2913         atomic_add_64(&arc_loaned_bytes, psize);
 2914         return (buf);
 2915 }
 2916 
 2917 
 2918 /*
 2919  * Return a loaned arc buffer to the arc.
 2920  */
 2921 void
 2922 arc_return_buf(arc_buf_t *buf, const void *tag)
 2923 {
 2924         arc_buf_hdr_t *hdr = buf->b_hdr;
 2925 
 2926         ASSERT3P(buf->b_data, !=, NULL);
 2927         ASSERT(HDR_HAS_L1HDR(hdr));
 2928         (void) zfs_refcount_add(&hdr->b_l1hdr.b_refcnt, tag);
 2929         (void) zfs_refcount_remove(&hdr->b_l1hdr.b_refcnt, arc_onloan_tag);
 2930 
 2931         arc_loaned_bytes_update(-arc_buf_size(buf));
 2932 }
 2933 
 2934 /* Detach an arc_buf from a dbuf (tag) */
 2935 void
 2936 arc_loan_inuse_buf(arc_buf_t *buf, const void *tag)
 2937 {
 2938         arc_buf_hdr_t *hdr = buf->b_hdr;
 2939 
 2940         ASSERT3P(buf->b_data, !=, NULL);
 2941         ASSERT(HDR_HAS_L1HDR(hdr));
 2942         (void) zfs_refcount_add(&hdr->b_l1hdr.b_refcnt, arc_onloan_tag);
 2943         (void) zfs_refcount_remove(&hdr->b_l1hdr.b_refcnt, tag);
 2944 
 2945         arc_loaned_bytes_update(arc_buf_size(buf));
 2946 }
 2947 
 2948 static void
 2949 l2arc_free_abd_on_write(abd_t *abd, size_t size, arc_buf_contents_t type)
 2950 {
 2951         l2arc_data_free_t *df = kmem_alloc(sizeof (*df), KM_SLEEP);
 2952 
 2953         df->l2df_abd = abd;
 2954         df->l2df_size = size;
 2955         df->l2df_type = type;
 2956         mutex_enter(&l2arc_free_on_write_mtx);
 2957         list_insert_head(l2arc_free_on_write, df);
 2958         mutex_exit(&l2arc_free_on_write_mtx);
 2959 }
 2960 
 2961 static void
 2962 arc_hdr_free_on_write(arc_buf_hdr_t *hdr, boolean_t free_rdata)
 2963 {
 2964         arc_state_t *state = hdr->b_l1hdr.b_state;
 2965         arc_buf_contents_t type = arc_buf_type(hdr);
 2966         uint64_t size = (free_rdata) ? HDR_GET_PSIZE(hdr) : arc_hdr_size(hdr);
 2967 
 2968         /* protected by hash lock, if in the hash table */
 2969         if (multilist_link_active(&hdr->b_l1hdr.b_arc_node)) {
 2970                 ASSERT(zfs_refcount_is_zero(&hdr->b_l1hdr.b_refcnt));
 2971                 ASSERT(state != arc_anon && state != arc_l2c_only);
 2972 
 2973                 (void) zfs_refcount_remove_many(&state->arcs_esize[type],
 2974                     size, hdr);
 2975         }
 2976         (void) zfs_refcount_remove_many(&state->arcs_size, size, hdr);
 2977         if (type == ARC_BUFC_METADATA) {
 2978                 arc_space_return(size, ARC_SPACE_META);
 2979         } else {
 2980                 ASSERT(type == ARC_BUFC_DATA);
 2981                 arc_space_return(size, ARC_SPACE_DATA);
 2982         }
 2983 
 2984         if (free_rdata) {
 2985                 l2arc_free_abd_on_write(hdr->b_crypt_hdr.b_rabd, size, type);
 2986         } else {
 2987                 l2arc_free_abd_on_write(hdr->b_l1hdr.b_pabd, size, type);
 2988         }
 2989 }
 2990 
 2991 /*
 2992  * Share the arc_buf_t's data with the hdr. Whenever we are sharing the
 2993  * data buffer, we transfer the refcount ownership to the hdr and update
 2994  * the appropriate kstats.
 2995  */
 2996 static void
 2997 arc_share_buf(arc_buf_hdr_t *hdr, arc_buf_t *buf)
 2998 {
 2999         ASSERT(arc_can_share(hdr, buf));
 3000         ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL);
 3001         ASSERT(!ARC_BUF_ENCRYPTED(buf));
 3002         ASSERT(HDR_EMPTY_OR_LOCKED(hdr));
 3003 
 3004         /*
 3005          * Start sharing the data buffer. We transfer the
 3006          * refcount ownership to the hdr since it always owns
 3007          * the refcount whenever an arc_buf_t is shared.
 3008          */
 3009         zfs_refcount_transfer_ownership_many(&hdr->b_l1hdr.b_state->arcs_size,
 3010             arc_hdr_size(hdr), buf, hdr);
 3011         hdr->b_l1hdr.b_pabd = abd_get_from_buf(buf->b_data, arc_buf_size(buf));
 3012         abd_take_ownership_of_buf(hdr->b_l1hdr.b_pabd,
 3013             HDR_ISTYPE_METADATA(hdr));
 3014         arc_hdr_set_flags(hdr, ARC_FLAG_SHARED_DATA);
 3015         buf->b_flags |= ARC_BUF_FLAG_SHARED;
 3016 
 3017         /*
 3018          * Since we've transferred ownership to the hdr we need
 3019          * to increment its compressed and uncompressed kstats and
 3020          * decrement the overhead size.
 3021          */
 3022         ARCSTAT_INCR(arcstat_compressed_size, arc_hdr_size(hdr));
 3023         ARCSTAT_INCR(arcstat_uncompressed_size, HDR_GET_LSIZE(hdr));
 3024         ARCSTAT_INCR(arcstat_overhead_size, -arc_buf_size(buf));
 3025 }
 3026 
 3027 static void
 3028 arc_unshare_buf(arc_buf_hdr_t *hdr, arc_buf_t *buf)
 3029 {
 3030         ASSERT(arc_buf_is_shared(buf));
 3031         ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL);
 3032         ASSERT(HDR_EMPTY_OR_LOCKED(hdr));
 3033 
 3034         /*
 3035          * We are no longer sharing this buffer so we need
 3036          * to transfer its ownership to the rightful owner.
 3037          */
 3038         zfs_refcount_transfer_ownership_many(&hdr->b_l1hdr.b_state->arcs_size,
 3039             arc_hdr_size(hdr), hdr, buf);
 3040         arc_hdr_clear_flags(hdr, ARC_FLAG_SHARED_DATA);
 3041         abd_release_ownership_of_buf(hdr->b_l1hdr.b_pabd);
 3042         abd_free(hdr->b_l1hdr.b_pabd);
 3043         hdr->b_l1hdr.b_pabd = NULL;
 3044         buf->b_flags &= ~ARC_BUF_FLAG_SHARED;
 3045 
 3046         /*
 3047          * Since the buffer is no longer shared between
 3048          * the arc buf and the hdr, count it as overhead.
 3049          */
 3050         ARCSTAT_INCR(arcstat_compressed_size, -arc_hdr_size(hdr));
 3051         ARCSTAT_INCR(arcstat_uncompressed_size, -HDR_GET_LSIZE(hdr));
 3052         ARCSTAT_INCR(arcstat_overhead_size, arc_buf_size(buf));
 3053 }
 3054 
 3055 /*
 3056  * Remove an arc_buf_t from the hdr's buf list and return the last
 3057  * arc_buf_t on the list. If no buffers remain on the list then return
 3058  * NULL.
 3059  */
 3060 static arc_buf_t *
 3061 arc_buf_remove(arc_buf_hdr_t *hdr, arc_buf_t *buf)
 3062 {
 3063         ASSERT(HDR_HAS_L1HDR(hdr));
 3064         ASSERT(HDR_EMPTY_OR_LOCKED(hdr));
 3065 
 3066         arc_buf_t **bufp = &hdr->b_l1hdr.b_buf;
 3067         arc_buf_t *lastbuf = NULL;
 3068 
 3069         /*
 3070          * Remove the buf from the hdr list and locate the last
 3071          * remaining buffer on the list.
 3072          */
 3073         while (*bufp != NULL) {
 3074                 if (*bufp == buf)
 3075                         *bufp = buf->b_next;
 3076 
 3077                 /*
 3078                  * If we've removed a buffer in the middle of
 3079                  * the list then update the lastbuf and update
 3080                  * bufp.
 3081                  */
 3082                 if (*bufp != NULL) {
 3083                         lastbuf = *bufp;
 3084                         bufp = &(*bufp)->b_next;
 3085                 }
 3086         }
 3087         buf->b_next = NULL;
 3088         ASSERT3P(lastbuf, !=, buf);
 3089         IMPLY(hdr->b_l1hdr.b_bufcnt > 0, lastbuf != NULL);
 3090         IMPLY(hdr->b_l1hdr.b_bufcnt > 0, hdr->b_l1hdr.b_buf != NULL);
 3091         IMPLY(lastbuf != NULL, ARC_BUF_LAST(lastbuf));
 3092 
 3093         return (lastbuf);
 3094 }
 3095 
 3096 /*
 3097  * Free up buf->b_data and pull the arc_buf_t off of the arc_buf_hdr_t's
 3098  * list and free it.
 3099  */
 3100 static void
 3101 arc_buf_destroy_impl(arc_buf_t *buf)
 3102 {
 3103         arc_buf_hdr_t *hdr = buf->b_hdr;
 3104 
 3105         /*
 3106          * Free up the data associated with the buf but only if we're not
 3107          * sharing this with the hdr. If we are sharing it with the hdr, the
 3108          * hdr is responsible for doing the free.
 3109          */
 3110         if (buf->b_data != NULL) {
 3111                 /*
 3112                  * We're about to change the hdr's b_flags. We must either
 3113                  * hold the hash_lock or be undiscoverable.
 3114                  */
 3115                 ASSERT(HDR_EMPTY_OR_LOCKED(hdr));
 3116 
 3117                 arc_cksum_verify(buf);
 3118                 arc_buf_unwatch(buf);
 3119 
 3120                 if (arc_buf_is_shared(buf)) {
 3121                         arc_hdr_clear_flags(hdr, ARC_FLAG_SHARED_DATA);
 3122                 } else {
 3123                         uint64_t size = arc_buf_size(buf);
 3124                         arc_free_data_buf(hdr, buf->b_data, size, buf);
 3125                         ARCSTAT_INCR(arcstat_overhead_size, -size);
 3126                 }
 3127                 buf->b_data = NULL;
 3128 
 3129                 ASSERT(hdr->b_l1hdr.b_bufcnt > 0);
 3130                 hdr->b_l1hdr.b_bufcnt -= 1;
 3131 
 3132                 if (ARC_BUF_ENCRYPTED(buf)) {
 3133                         hdr->b_crypt_hdr.b_ebufcnt -= 1;
 3134 
 3135                         /*
 3136                          * If we have no more encrypted buffers and we've
 3137                          * already gotten a copy of the decrypted data we can
 3138                          * free b_rabd to save some space.
 3139                          */
 3140                         if (hdr->b_crypt_hdr.b_ebufcnt == 0 &&
 3141                             HDR_HAS_RABD(hdr) && hdr->b_l1hdr.b_pabd != NULL &&
 3142                             !HDR_IO_IN_PROGRESS(hdr)) {
 3143                                 arc_hdr_free_abd(hdr, B_TRUE);
 3144                         }
 3145                 }
 3146         }
 3147 
 3148         arc_buf_t *lastbuf = arc_buf_remove(hdr, buf);
 3149 
 3150         if (ARC_BUF_SHARED(buf) && !ARC_BUF_COMPRESSED(buf)) {
 3151                 /*
 3152                  * If the current arc_buf_t is sharing its data buffer with the
 3153                  * hdr, then reassign the hdr's b_pabd to share it with the new
 3154                  * buffer at the end of the list. The shared buffer is always
 3155                  * the last one on the hdr's buffer list.
 3156                  *
 3157                  * There is an equivalent case for compressed bufs, but since
 3158                  * they aren't guaranteed to be the last buf in the list and
 3159                  * that is an exceedingly rare case, we just allow that space be
 3160                  * wasted temporarily. We must also be careful not to share
 3161                  * encrypted buffers, since they cannot be shared.
 3162                  */
 3163                 if (lastbuf != NULL && !ARC_BUF_ENCRYPTED(lastbuf)) {
 3164                         /* Only one buf can be shared at once */
 3165                         VERIFY(!arc_buf_is_shared(lastbuf));
 3166                         /* hdr is uncompressed so can't have compressed buf */
 3167                         VERIFY(!ARC_BUF_COMPRESSED(lastbuf));
 3168 
 3169                         ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL);
 3170                         arc_hdr_free_abd(hdr, B_FALSE);
 3171 
 3172                         /*
 3173                          * We must setup a new shared block between the
 3174                          * last buffer and the hdr. The data would have
 3175                          * been allocated by the arc buf so we need to transfer
 3176                          * ownership to the hdr since it's now being shared.
 3177                          */
 3178                         arc_share_buf(hdr, lastbuf);
 3179                 }
 3180         } else if (HDR_SHARED_DATA(hdr)) {
 3181                 /*
 3182                  * Uncompressed shared buffers are always at the end
 3183                  * of the list. Compressed buffers don't have the
 3184                  * same requirements. This makes it hard to
 3185                  * simply assert that the lastbuf is shared so
 3186                  * we rely on the hdr's compression flags to determine
 3187                  * if we have a compressed, shared buffer.
 3188                  */
 3189                 ASSERT3P(lastbuf, !=, NULL);
 3190                 ASSERT(arc_buf_is_shared(lastbuf) ||
 3191                     arc_hdr_get_compress(hdr) != ZIO_COMPRESS_OFF);
 3192         }
 3193 
 3194         /*
 3195          * Free the checksum if we're removing the last uncompressed buf from
 3196          * this hdr.
 3197          */
 3198         if (!arc_hdr_has_uncompressed_buf(hdr)) {
 3199                 arc_cksum_free(hdr);
 3200         }
 3201 
 3202         /* clean up the buf */
 3203         buf->b_hdr = NULL;
 3204         kmem_cache_free(buf_cache, buf);
 3205 }
 3206 
 3207 static void
 3208 arc_hdr_alloc_abd(arc_buf_hdr_t *hdr, int alloc_flags)
 3209 {
 3210         uint64_t size;
 3211         boolean_t alloc_rdata = ((alloc_flags & ARC_HDR_ALLOC_RDATA) != 0);
 3212 
 3213         ASSERT3U(HDR_GET_LSIZE(hdr), >, 0);
 3214         ASSERT(HDR_HAS_L1HDR(hdr));
 3215         ASSERT(!HDR_SHARED_DATA(hdr) || alloc_rdata);
 3216         IMPLY(alloc_rdata, HDR_PROTECTED(hdr));
 3217 
 3218         if (alloc_rdata) {
 3219                 size = HDR_GET_PSIZE(hdr);
 3220                 ASSERT3P(hdr->b_crypt_hdr.b_rabd, ==, NULL);
 3221                 hdr->b_crypt_hdr.b_rabd = arc_get_data_abd(hdr, size, hdr,
 3222                     alloc_flags);
 3223                 ASSERT3P(hdr->b_crypt_hdr.b_rabd, !=, NULL);
 3224                 ARCSTAT_INCR(arcstat_raw_size, size);
 3225         } else {
 3226                 size = arc_hdr_size(hdr);
 3227                 ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL);
 3228                 hdr->b_l1hdr.b_pabd = arc_get_data_abd(hdr, size, hdr,
 3229                     alloc_flags);
 3230                 ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL);
 3231         }
 3232 
 3233         ARCSTAT_INCR(arcstat_compressed_size, size);
 3234         ARCSTAT_INCR(arcstat_uncompressed_size, HDR_GET_LSIZE(hdr));
 3235 }
 3236 
 3237 static void
 3238 arc_hdr_free_abd(arc_buf_hdr_t *hdr, boolean_t free_rdata)
 3239 {
 3240         uint64_t size = (free_rdata) ? HDR_GET_PSIZE(hdr) : arc_hdr_size(hdr);
 3241 
 3242         ASSERT(HDR_HAS_L1HDR(hdr));
 3243         ASSERT(hdr->b_l1hdr.b_pabd != NULL || HDR_HAS_RABD(hdr));
 3244         IMPLY(free_rdata, HDR_HAS_RABD(hdr));
 3245 
 3246         /*
 3247          * If the hdr is currently being written to the l2arc then
 3248          * we defer freeing the data by adding it to the l2arc_free_on_write
 3249          * list. The l2arc will free the data once it's finished
 3250          * writing it to the l2arc device.
 3251          */
 3252         if (HDR_L2_WRITING(hdr)) {
 3253                 arc_hdr_free_on_write(hdr, free_rdata);
 3254                 ARCSTAT_BUMP(arcstat_l2_free_on_write);
 3255         } else if (free_rdata) {
 3256                 arc_free_data_abd(hdr, hdr->b_crypt_hdr.b_rabd, size, hdr);
 3257         } else {
 3258                 arc_free_data_abd(hdr, hdr->b_l1hdr.b_pabd, size, hdr);
 3259         }
 3260 
 3261         if (free_rdata) {
 3262                 hdr->b_crypt_hdr.b_rabd = NULL;
 3263                 ARCSTAT_INCR(arcstat_raw_size, -size);
 3264         } else {
 3265                 hdr->b_l1hdr.b_pabd = NULL;
 3266         }
 3267 
 3268         if (hdr->b_l1hdr.b_pabd == NULL && !HDR_HAS_RABD(hdr))
 3269                 hdr->b_l1hdr.b_byteswap = DMU_BSWAP_NUMFUNCS;
 3270 
 3271         ARCSTAT_INCR(arcstat_compressed_size, -size);
 3272         ARCSTAT_INCR(arcstat_uncompressed_size, -HDR_GET_LSIZE(hdr));
 3273 }
 3274 
 3275 /*
 3276  * Allocate empty anonymous ARC header.  The header will get its identity
 3277  * assigned and buffers attached later as part of read or write operations.
 3278  *
 3279  * In case of read arc_read() assigns header its identify (b_dva + b_birth),
 3280  * inserts it into ARC hash to become globally visible and allocates physical
 3281  * (b_pabd) or raw (b_rabd) ABD buffer to read into from disk.  On disk read
 3282  * completion arc_read_done() allocates ARC buffer(s) as needed, potentially
 3283  * sharing one of them with the physical ABD buffer.
 3284  *
 3285  * In case of write arc_alloc_buf() allocates ARC buffer to be filled with
 3286  * data.  Then after compression and/or encryption arc_write_ready() allocates
 3287  * and fills (or potentially shares) physical (b_pabd) or raw (b_rabd) ABD
 3288  * buffer.  On disk write completion arc_write_done() assigns the header its
 3289  * new identity (b_dva + b_birth) and inserts into ARC hash.
 3290  *
 3291  * In case of partial overwrite the old data is read first as described. Then
 3292  * arc_release() either allocates new anonymous ARC header and moves the ARC
 3293  * buffer to it, or reuses the old ARC header by discarding its identity and
 3294  * removing it from ARC hash.  After buffer modification normal write process
 3295  * follows as described.
 3296  */
 3297 static arc_buf_hdr_t *
 3298 arc_hdr_alloc(uint64_t spa, int32_t psize, int32_t lsize,
 3299     boolean_t protected, enum zio_compress compression_type, uint8_t complevel,
 3300     arc_buf_contents_t type)
 3301 {
 3302         arc_buf_hdr_t *hdr;
 3303 
 3304         VERIFY(type == ARC_BUFC_DATA || type == ARC_BUFC_METADATA);
 3305         if (protected) {
 3306                 hdr = kmem_cache_alloc(hdr_full_crypt_cache, KM_PUSHPAGE);
 3307         } else {
 3308                 hdr = kmem_cache_alloc(hdr_full_cache, KM_PUSHPAGE);
 3309         }
 3310 
 3311         ASSERT(HDR_EMPTY(hdr));
 3312 #ifdef ZFS_DEBUG
 3313         ASSERT3P(hdr->b_l1hdr.b_freeze_cksum, ==, NULL);
 3314 #endif
 3315         HDR_SET_PSIZE(hdr, psize);
 3316         HDR_SET_LSIZE(hdr, lsize);
 3317         hdr->b_spa = spa;
 3318         hdr->b_type = type;
 3319         hdr->b_flags = 0;
 3320         arc_hdr_set_flags(hdr, arc_bufc_to_flags(type) | ARC_FLAG_HAS_L1HDR);
 3321         arc_hdr_set_compress(hdr, compression_type);
 3322         hdr->b_complevel = complevel;
 3323         if (protected)
 3324                 arc_hdr_set_flags(hdr, ARC_FLAG_PROTECTED);
 3325 
 3326         hdr->b_l1hdr.b_state = arc_anon;
 3327         hdr->b_l1hdr.b_arc_access = 0;
 3328         hdr->b_l1hdr.b_mru_hits = 0;
 3329         hdr->b_l1hdr.b_mru_ghost_hits = 0;
 3330         hdr->b_l1hdr.b_mfu_hits = 0;
 3331         hdr->b_l1hdr.b_mfu_ghost_hits = 0;
 3332         hdr->b_l1hdr.b_bufcnt = 0;
 3333         hdr->b_l1hdr.b_buf = NULL;
 3334 
 3335         ASSERT(zfs_refcount_is_zero(&hdr->b_l1hdr.b_refcnt));
 3336 
 3337         return (hdr);
 3338 }
 3339 
 3340 /*
 3341  * Transition between the two allocation states for the arc_buf_hdr struct.
 3342  * The arc_buf_hdr struct can be allocated with (hdr_full_cache) or without
 3343  * (hdr_l2only_cache) the fields necessary for the L1 cache - the smaller
 3344  * version is used when a cache buffer is only in the L2ARC in order to reduce
 3345  * memory usage.
 3346  */
 3347 static arc_buf_hdr_t *
 3348 arc_hdr_realloc(arc_buf_hdr_t *hdr, kmem_cache_t *old, kmem_cache_t *new)
 3349 {
 3350         ASSERT(HDR_HAS_L2HDR(hdr));
 3351 
 3352         arc_buf_hdr_t *nhdr;
 3353         l2arc_dev_t *dev = hdr->b_l2hdr.b_dev;
 3354 
 3355         ASSERT((old == hdr_full_cache && new == hdr_l2only_cache) ||
 3356             (old == hdr_l2only_cache && new == hdr_full_cache));
 3357 
 3358         /*
 3359          * if the caller wanted a new full header and the header is to be
 3360          * encrypted we will actually allocate the header from the full crypt
 3361          * cache instead. The same applies to freeing from the old cache.
 3362          */
 3363         if (HDR_PROTECTED(hdr) && new == hdr_full_cache)
 3364                 new = hdr_full_crypt_cache;
 3365         if (HDR_PROTECTED(hdr) && old == hdr_full_cache)
 3366                 old = hdr_full_crypt_cache;
 3367 
 3368         nhdr = kmem_cache_alloc(new, KM_PUSHPAGE);
 3369 
 3370         ASSERT(MUTEX_HELD(HDR_LOCK(hdr)));
 3371         buf_hash_remove(hdr);
 3372 
 3373         memcpy(nhdr, hdr, HDR_L2ONLY_SIZE);
 3374 
 3375         if (new == hdr_full_cache || new == hdr_full_crypt_cache) {
 3376                 arc_hdr_set_flags(nhdr, ARC_FLAG_HAS_L1HDR);
 3377                 /*
 3378                  * arc_access and arc_change_state need to be aware that a
 3379                  * header has just come out of L2ARC, so we set its state to
 3380                  * l2c_only even though it's about to change.
 3381                  */
 3382                 nhdr->b_l1hdr.b_state = arc_l2c_only;
 3383 
 3384                 /* Verify previous threads set to NULL before freeing */
 3385                 ASSERT3P(nhdr->b_l1hdr.b_pabd, ==, NULL);
 3386                 ASSERT(!HDR_HAS_RABD(hdr));
 3387         } else {
 3388                 ASSERT3P(hdr->b_l1hdr.b_buf, ==, NULL);
 3389                 ASSERT0(hdr->b_l1hdr.b_bufcnt);
 3390 #ifdef ZFS_DEBUG
 3391                 ASSERT3P(hdr->b_l1hdr.b_freeze_cksum, ==, NULL);
 3392 #endif
 3393 
 3394                 /*
 3395                  * If we've reached here, We must have been called from
 3396                  * arc_evict_hdr(), as such we should have already been
 3397                  * removed from any ghost list we were previously on
 3398                  * (which protects us from racing with arc_evict_state),
 3399                  * thus no locking is needed during this check.
 3400                  */
 3401                 ASSERT(!multilist_link_active(&hdr->b_l1hdr.b_arc_node));
 3402 
 3403                 /*
 3404                  * A buffer must not be moved into the arc_l2c_only
 3405                  * state if it's not finished being written out to the
 3406                  * l2arc device. Otherwise, the b_l1hdr.b_pabd field
 3407                  * might try to be accessed, even though it was removed.
 3408                  */
 3409                 VERIFY(!HDR_L2_WRITING(hdr));
 3410                 VERIFY3P(hdr->b_l1hdr.b_pabd, ==, NULL);
 3411                 ASSERT(!HDR_HAS_RABD(hdr));
 3412 
 3413                 arc_hdr_clear_flags(nhdr, ARC_FLAG_HAS_L1HDR);
 3414         }
 3415         /*
 3416          * The header has been reallocated so we need to re-insert it into any
 3417          * lists it was on.
 3418          */
 3419         (void) buf_hash_insert(nhdr, NULL);
 3420 
 3421         ASSERT(list_link_active(&hdr->b_l2hdr.b_l2node));
 3422 
 3423         mutex_enter(&dev->l2ad_mtx);
 3424 
 3425         /*
 3426          * We must place the realloc'ed header back into the list at
 3427          * the same spot. Otherwise, if it's placed earlier in the list,
 3428          * l2arc_write_buffers() could find it during the function's
 3429          * write phase, and try to write it out to the l2arc.
 3430          */
 3431         list_insert_after(&dev->l2ad_buflist, hdr, nhdr);
 3432         list_remove(&dev->l2ad_buflist, hdr);
 3433 
 3434         mutex_exit(&dev->l2ad_mtx);
 3435 
 3436         /*
 3437          * Since we're using the pointer address as the tag when
 3438          * incrementing and decrementing the l2ad_alloc refcount, we
 3439          * must remove the old pointer (that we're about to destroy) and
 3440          * add the new pointer to the refcount. Otherwise we'd remove
 3441          * the wrong pointer address when calling arc_hdr_destroy() later.
 3442          */
 3443 
 3444         (void) zfs_refcount_remove_many(&dev->l2ad_alloc,
 3445             arc_hdr_size(hdr), hdr);
 3446         (void) zfs_refcount_add_many(&dev->l2ad_alloc,
 3447             arc_hdr_size(nhdr), nhdr);
 3448 
 3449         buf_discard_identity(hdr);
 3450         kmem_cache_free(old, hdr);
 3451 
 3452         return (nhdr);
 3453 }
 3454 
 3455 /*
 3456  * This function allows an L1 header to be reallocated as a crypt
 3457  * header and vice versa. If we are going to a crypt header, the
 3458  * new fields will be zeroed out.
 3459  */
 3460 static arc_buf_hdr_t *
 3461 arc_hdr_realloc_crypt(arc_buf_hdr_t *hdr, boolean_t need_crypt)
 3462 {
 3463         arc_buf_hdr_t *nhdr;
 3464         arc_buf_t *buf;
 3465         kmem_cache_t *ncache, *ocache;
 3466 
 3467         /*
 3468          * This function requires that hdr is in the arc_anon state.
 3469          * Therefore it won't have any L2ARC data for us to worry
 3470          * about copying.
 3471          */
 3472         ASSERT(HDR_HAS_L1HDR(hdr));
 3473         ASSERT(!HDR_HAS_L2HDR(hdr));
 3474         ASSERT3U(!!HDR_PROTECTED(hdr), !=, need_crypt);
 3475         ASSERT3P(hdr->b_l1hdr.b_state, ==, arc_anon);
 3476         ASSERT(!multilist_link_active(&hdr->b_l1hdr.b_arc_node));
 3477         ASSERT(!list_link_active(&hdr->b_l2hdr.b_l2node));
 3478         ASSERT3P(hdr->b_hash_next, ==, NULL);
 3479 
 3480         if (need_crypt) {
 3481                 ncache = hdr_full_crypt_cache;
 3482                 ocache = hdr_full_cache;
 3483         } else {
 3484                 ncache = hdr_full_cache;
 3485                 ocache = hdr_full_crypt_cache;
 3486         }
 3487 
 3488         nhdr = kmem_cache_alloc(ncache, KM_PUSHPAGE);
 3489 
 3490         /*
 3491          * Copy all members that aren't locks or condvars to the new header.
 3492          * No lists are pointing to us (as we asserted above), so we don't
 3493          * need to worry about the list nodes.
 3494          */
 3495         nhdr->b_dva = hdr->b_dva;
 3496         nhdr->b_birth = hdr->b_birth;
 3497         nhdr->b_type = hdr->b_type;
 3498         nhdr->b_flags = hdr->b_flags;
 3499         nhdr->b_psize = hdr->b_psize;
 3500         nhdr->b_lsize = hdr->b_lsize;
 3501         nhdr->b_spa = hdr->b_spa;
 3502 #ifdef ZFS_DEBUG
 3503         nhdr->b_l1hdr.b_freeze_cksum = hdr->b_l1hdr.b_freeze_cksum;
 3504 #endif
 3505         nhdr->b_l1hdr.b_bufcnt = hdr->b_l1hdr.b_bufcnt;
 3506         nhdr->b_l1hdr.b_byteswap = hdr->b_l1hdr.b_byteswap;
 3507         nhdr->b_l1hdr.b_state = hdr->b_l1hdr.b_state;
 3508         nhdr->b_l1hdr.b_arc_access = hdr->b_l1hdr.b_arc_access;
 3509         nhdr->b_l1hdr.b_mru_hits = hdr->b_l1hdr.b_mru_hits;
 3510         nhdr->b_l1hdr.b_mru_ghost_hits = hdr->b_l1hdr.b_mru_ghost_hits;
 3511         nhdr->b_l1hdr.b_mfu_hits = hdr->b_l1hdr.b_mfu_hits;
 3512         nhdr->b_l1hdr.b_mfu_ghost_hits = hdr->b_l1hdr.b_mfu_ghost_hits;
 3513         nhdr->b_l1hdr.b_acb = hdr->b_l1hdr.b_acb;
 3514         nhdr->b_l1hdr.b_pabd = hdr->b_l1hdr.b_pabd;
 3515 
 3516         /*
 3517          * This zfs_refcount_add() exists only to ensure that the individual
 3518          * arc buffers always point to a header that is referenced, avoiding
 3519          * a small race condition that could trigger ASSERTs.
 3520          */
 3521         (void) zfs_refcount_add(&nhdr->b_l1hdr.b_refcnt, FTAG);
 3522         nhdr->b_l1hdr.b_buf = hdr->b_l1hdr.b_buf;
 3523         for (buf = nhdr->b_l1hdr.b_buf; buf != NULL; buf = buf->b_next)
 3524                 buf->b_hdr = nhdr;
 3525 
 3526         zfs_refcount_transfer(&nhdr->b_l1hdr.b_refcnt, &hdr->b_l1hdr.b_refcnt);
 3527         (void) zfs_refcount_remove(&nhdr->b_l1hdr.b_refcnt, FTAG);
 3528         ASSERT0(zfs_refcount_count(&hdr->b_l1hdr.b_refcnt));
 3529 
 3530         if (need_crypt) {
 3531                 arc_hdr_set_flags(nhdr, ARC_FLAG_PROTECTED);
 3532         } else {
 3533                 arc_hdr_clear_flags(nhdr, ARC_FLAG_PROTECTED);
 3534         }
 3535 
 3536         /* unset all members of the original hdr */
 3537         memset(&hdr->b_dva, 0, sizeof (dva_t));
 3538         hdr->b_birth = 0;
 3539         hdr->b_type = ARC_BUFC_INVALID;
 3540         hdr->b_flags = 0;
 3541         hdr->b_psize = 0;
 3542         hdr->b_lsize = 0;
 3543         hdr->b_spa = 0;
 3544 #ifdef ZFS_DEBUG
 3545         hdr->b_l1hdr.b_freeze_cksum = NULL;
 3546 #endif
 3547         hdr->b_l1hdr.b_buf = NULL;
 3548         hdr->b_l1hdr.b_bufcnt = 0;
 3549         hdr->b_l1hdr.b_byteswap = 0;
 3550         hdr->b_l1hdr.b_state = NULL;
 3551         hdr->b_l1hdr.b_arc_access = 0;
 3552         hdr->b_l1hdr.b_mru_hits = 0;
 3553         hdr->b_l1hdr.b_mru_ghost_hits = 0;
 3554         hdr->b_l1hdr.b_mfu_hits = 0;
 3555         hdr->b_l1hdr.b_mfu_ghost_hits = 0;
 3556         hdr->b_l1hdr.b_acb = NULL;
 3557         hdr->b_l1hdr.b_pabd = NULL;
 3558 
 3559         if (ocache == hdr_full_crypt_cache) {
 3560                 ASSERT(!HDR_HAS_RABD(hdr));
 3561                 hdr->b_crypt_hdr.b_ot = DMU_OT_NONE;
 3562                 hdr->b_crypt_hdr.b_ebufcnt = 0;
 3563                 hdr->b_crypt_hdr.b_dsobj = 0;
 3564                 memset(hdr->b_crypt_hdr.b_salt, 0, ZIO_DATA_SALT_LEN);
 3565                 memset(hdr->b_crypt_hdr.b_iv, 0, ZIO_DATA_IV_LEN);
 3566                 memset(hdr->b_crypt_hdr.b_mac, 0, ZIO_DATA_MAC_LEN);
 3567         }
 3568 
 3569         buf_discard_identity(hdr);
 3570         kmem_cache_free(ocache, hdr);
 3571 
 3572         return (nhdr);
 3573 }
 3574 
 3575 /*
 3576  * This function is used by the send / receive code to convert a newly
 3577  * allocated arc_buf_t to one that is suitable for a raw encrypted write. It
 3578  * is also used to allow the root objset block to be updated without altering
 3579  * its embedded MACs. Both block types will always be uncompressed so we do not
 3580  * have to worry about compression type or psize.
 3581  */
 3582 void
 3583 arc_convert_to_raw(arc_buf_t *buf, uint64_t dsobj, boolean_t byteorder,
 3584     dmu_object_type_t ot, const uint8_t *salt, const uint8_t *iv,
 3585     const uint8_t *mac)
 3586 {
 3587         arc_buf_hdr_t *hdr = buf->b_hdr;
 3588 
 3589         ASSERT(ot == DMU_OT_DNODE || ot == DMU_OT_OBJSET);
 3590         ASSERT(HDR_HAS_L1HDR(hdr));
 3591         ASSERT3P(hdr->b_l1hdr.b_state, ==, arc_anon);
 3592 
 3593         buf->b_flags |= (ARC_BUF_FLAG_COMPRESSED | ARC_BUF_FLAG_ENCRYPTED);
 3594         if (!HDR_PROTECTED(hdr))
 3595                 hdr = arc_hdr_realloc_crypt(hdr, B_TRUE);
 3596         hdr->b_crypt_hdr.b_dsobj = dsobj;
 3597         hdr->b_crypt_hdr.b_ot = ot;
 3598         hdr->b_l1hdr.b_byteswap = (byteorder == ZFS_HOST_BYTEORDER) ?
 3599             DMU_BSWAP_NUMFUNCS : DMU_OT_BYTESWAP(ot);
 3600         if (!arc_hdr_has_uncompressed_buf(hdr))
 3601                 arc_cksum_free(hdr);
 3602 
 3603         if (salt != NULL)
 3604                 memcpy(hdr->b_crypt_hdr.b_salt, salt, ZIO_DATA_SALT_LEN);
 3605         if (iv != NULL)
 3606                 memcpy(hdr->b_crypt_hdr.b_iv, iv, ZIO_DATA_IV_LEN);
 3607         if (mac != NULL)
 3608                 memcpy(hdr->b_crypt_hdr.b_mac, mac, ZIO_DATA_MAC_LEN);
 3609 }
 3610 
 3611 /*
 3612  * Allocate a new arc_buf_hdr_t and arc_buf_t and return the buf to the caller.
 3613  * The buf is returned thawed since we expect the consumer to modify it.
 3614  */
 3615 arc_buf_t *
 3616 arc_alloc_buf(spa_t *spa, const void *tag, arc_buf_contents_t type,
 3617     int32_t size)
 3618 {
 3619         arc_buf_hdr_t *hdr = arc_hdr_alloc(spa_load_guid(spa), size, size,
 3620             B_FALSE, ZIO_COMPRESS_OFF, 0, type);
 3621 
 3622         arc_buf_t *buf = NULL;
 3623         VERIFY0(arc_buf_alloc_impl(hdr, spa, NULL, tag, B_FALSE, B_FALSE,
 3624             B_FALSE, B_FALSE, &buf));
 3625         arc_buf_thaw(buf);
 3626 
 3627         return (buf);
 3628 }
 3629 
 3630 /*
 3631  * Allocate a compressed buf in the same manner as arc_alloc_buf. Don't use this
 3632  * for bufs containing metadata.
 3633  */
 3634 arc_buf_t *
 3635 arc_alloc_compressed_buf(spa_t *spa, const void *tag, uint64_t psize,
 3636     uint64_t lsize, enum zio_compress compression_type, uint8_t complevel)
 3637 {
 3638         ASSERT3U(lsize, >, 0);
 3639         ASSERT3U(lsize, >=, psize);
 3640         ASSERT3U(compression_type, >, ZIO_COMPRESS_OFF);
 3641         ASSERT3U(compression_type, <, ZIO_COMPRESS_FUNCTIONS);
 3642 
 3643         arc_buf_hdr_t *hdr = arc_hdr_alloc(spa_load_guid(spa), psize, lsize,
 3644             B_FALSE, compression_type, complevel, ARC_BUFC_DATA);
 3645 
 3646         arc_buf_t *buf = NULL;
 3647         VERIFY0(arc_buf_alloc_impl(hdr, spa, NULL, tag, B_FALSE,
 3648             B_TRUE, B_FALSE, B_FALSE, &buf));
 3649         arc_buf_thaw(buf);
 3650 
 3651         /*
 3652          * To ensure that the hdr has the correct data in it if we call
 3653          * arc_untransform() on this buf before it's been written to disk,
 3654          * it's easiest if we just set up sharing between the buf and the hdr.
 3655          */
 3656         arc_share_buf(hdr, buf);
 3657 
 3658         return (buf);
 3659 }
 3660 
 3661 arc_buf_t *
 3662 arc_alloc_raw_buf(spa_t *spa, const void *tag, uint64_t dsobj,
 3663     boolean_t byteorder, const uint8_t *salt, const uint8_t *iv,
 3664     const uint8_t *mac, dmu_object_type_t ot, uint64_t psize, uint64_t lsize,
 3665     enum zio_compress compression_type, uint8_t complevel)
 3666 {
 3667         arc_buf_hdr_t *hdr;
 3668         arc_buf_t *buf;
 3669         arc_buf_contents_t type = DMU_OT_IS_METADATA(ot) ?
 3670             ARC_BUFC_METADATA : ARC_BUFC_DATA;
 3671 
 3672         ASSERT3U(lsize, >, 0);
 3673         ASSERT3U(lsize, >=, psize);
 3674         ASSERT3U(compression_type, >=, ZIO_COMPRESS_OFF);
 3675         ASSERT3U(compression_type, <, ZIO_COMPRESS_FUNCTIONS);
 3676 
 3677         hdr = arc_hdr_alloc(spa_load_guid(spa), psize, lsize, B_TRUE,
 3678             compression_type, complevel, type);
 3679 
 3680         hdr->b_crypt_hdr.b_dsobj = dsobj;
 3681         hdr->b_crypt_hdr.b_ot = ot;
 3682         hdr->b_l1hdr.b_byteswap = (byteorder == ZFS_HOST_BYTEORDER) ?
 3683             DMU_BSWAP_NUMFUNCS : DMU_OT_BYTESWAP(ot);
 3684         memcpy(hdr->b_crypt_hdr.b_salt, salt, ZIO_DATA_SALT_LEN);
 3685         memcpy(hdr->b_crypt_hdr.b_iv, iv, ZIO_DATA_IV_LEN);
 3686         memcpy(hdr->b_crypt_hdr.b_mac, mac, ZIO_DATA_MAC_LEN);
 3687 
 3688         /*
 3689          * This buffer will be considered encrypted even if the ot is not an
 3690          * encrypted type. It will become authenticated instead in
 3691          * arc_write_ready().
 3692          */
 3693         buf = NULL;
 3694         VERIFY0(arc_buf_alloc_impl(hdr, spa, NULL, tag, B_TRUE, B_TRUE,
 3695             B_FALSE, B_FALSE, &buf));
 3696         arc_buf_thaw(buf);
 3697 
 3698         return (buf);
 3699 }
 3700 
 3701 static void
 3702 l2arc_hdr_arcstats_update(arc_buf_hdr_t *hdr, boolean_t incr,
 3703     boolean_t state_only)
 3704 {
 3705         l2arc_buf_hdr_t *l2hdr = &hdr->b_l2hdr;
 3706         l2arc_dev_t *dev = l2hdr->b_dev;
 3707         uint64_t lsize = HDR_GET_LSIZE(hdr);
 3708         uint64_t psize = HDR_GET_PSIZE(hdr);
 3709         uint64_t asize = vdev_psize_to_asize(dev->l2ad_vdev, psize);
 3710         arc_buf_contents_t type = hdr->b_type;
 3711         int64_t lsize_s;
 3712         int64_t psize_s;
 3713         int64_t asize_s;
 3714 
 3715         if (incr) {
 3716                 lsize_s = lsize;
 3717                 psize_s = psize;
 3718                 asize_s = asize;
 3719         } else {
 3720                 lsize_s = -lsize;
 3721                 psize_s = -psize;
 3722                 asize_s = -asize;
 3723         }
 3724 
 3725         /* If the buffer is a prefetch, count it as such. */
 3726         if (HDR_PREFETCH(hdr)) {
 3727                 ARCSTAT_INCR(arcstat_l2_prefetch_asize, asize_s);
 3728         } else {
 3729                 /*
 3730                  * We use the value stored in the L2 header upon initial
 3731                  * caching in L2ARC. This value will be updated in case
 3732                  * an MRU/MRU_ghost buffer transitions to MFU but the L2ARC
 3733                  * metadata (log entry) cannot currently be updated. Having
 3734                  * the ARC state in the L2 header solves the problem of a
 3735                  * possibly absent L1 header (apparent in buffers restored
 3736                  * from persistent L2ARC).
 3737                  */
 3738                 switch (hdr->b_l2hdr.b_arcs_state) {
 3739                         case ARC_STATE_MRU_GHOST:
 3740                         case ARC_STATE_MRU:
 3741                                 ARCSTAT_INCR(arcstat_l2_mru_asize, asize_s);
 3742                                 break;
 3743                         case ARC_STATE_MFU_GHOST:
 3744                         case ARC_STATE_MFU:
 3745                                 ARCSTAT_INCR(arcstat_l2_mfu_asize, asize_s);
 3746                                 break;
 3747                         default:
 3748                                 break;
 3749                 }
 3750         }
 3751 
 3752         if (state_only)
 3753                 return;
 3754 
 3755         ARCSTAT_INCR(arcstat_l2_psize, psize_s);
 3756         ARCSTAT_INCR(arcstat_l2_lsize, lsize_s);
 3757 
 3758         switch (type) {
 3759                 case ARC_BUFC_DATA:
 3760                         ARCSTAT_INCR(arcstat_l2_bufc_data_asize, asize_s);
 3761                         break;
 3762                 case ARC_BUFC_METADATA:
 3763                         ARCSTAT_INCR(arcstat_l2_bufc_metadata_asize, asize_s);
 3764                         break;
 3765                 default:
 3766                         break;
 3767         }
 3768 }
 3769 
 3770 
 3771 static void
 3772 arc_hdr_l2hdr_destroy(arc_buf_hdr_t *hdr)
 3773 {
 3774         l2arc_buf_hdr_t *l2hdr = &hdr->b_l2hdr;
 3775         l2arc_dev_t *dev = l2hdr->b_dev;
 3776         uint64_t psize = HDR_GET_PSIZE(hdr);
 3777         uint64_t asize = vdev_psize_to_asize(dev->l2ad_vdev, psize);
 3778 
 3779         ASSERT(MUTEX_HELD(&dev->l2ad_mtx));
 3780         ASSERT(HDR_HAS_L2HDR(hdr));
 3781 
 3782         list_remove(&dev->l2ad_buflist, hdr);
 3783 
 3784         l2arc_hdr_arcstats_decrement(hdr);
 3785         vdev_space_update(dev->l2ad_vdev, -asize, 0, 0);
 3786 
 3787         (void) zfs_refcount_remove_many(&dev->l2ad_alloc, arc_hdr_size(hdr),
 3788             hdr);
 3789         arc_hdr_clear_flags(hdr, ARC_FLAG_HAS_L2HDR);
 3790 }
 3791 
 3792 static void
 3793 arc_hdr_destroy(arc_buf_hdr_t *hdr)
 3794 {
 3795         if (HDR_HAS_L1HDR(hdr)) {
 3796                 ASSERT(hdr->b_l1hdr.b_buf == NULL ||
 3797                     hdr->b_l1hdr.b_bufcnt > 0);
 3798                 ASSERT(zfs_refcount_is_zero(&hdr->b_l1hdr.b_refcnt));
 3799                 ASSERT3P(hdr->b_l1hdr.b_state, ==, arc_anon);
 3800         }
 3801         ASSERT(!HDR_IO_IN_PROGRESS(hdr));
 3802         ASSERT(!HDR_IN_HASH_TABLE(hdr));
 3803 
 3804         if (HDR_HAS_L2HDR(hdr)) {
 3805                 l2arc_dev_t *dev = hdr->b_l2hdr.b_dev;
 3806                 boolean_t buflist_held = MUTEX_HELD(&dev->l2ad_mtx);
 3807 
 3808                 if (!buflist_held)
 3809                         mutex_enter(&dev->l2ad_mtx);
 3810 
 3811                 /*
 3812                  * Even though we checked this conditional above, we
 3813                  * need to check this again now that we have the
 3814                  * l2ad_mtx. This is because we could be racing with
 3815                  * another thread calling l2arc_evict() which might have
 3816                  * destroyed this header's L2 portion as we were waiting
 3817                  * to acquire the l2ad_mtx. If that happens, we don't
 3818                  * want to re-destroy the header's L2 portion.
 3819                  */
 3820                 if (HDR_HAS_L2HDR(hdr)) {
 3821 
 3822                         if (!HDR_EMPTY(hdr))
 3823                                 buf_discard_identity(hdr);
 3824 
 3825                         arc_hdr_l2hdr_destroy(hdr);
 3826                 }
 3827 
 3828                 if (!buflist_held)
 3829                         mutex_exit(&dev->l2ad_mtx);
 3830         }
 3831 
 3832         /*
 3833          * The header's identify can only be safely discarded once it is no
 3834          * longer discoverable.  This requires removing it from the hash table
 3835          * and the l2arc header list.  After this point the hash lock can not
 3836          * be used to protect the header.
 3837          */
 3838         if (!HDR_EMPTY(hdr))
 3839                 buf_discard_identity(hdr);
 3840 
 3841         if (HDR_HAS_L1HDR(hdr)) {
 3842                 arc_cksum_free(hdr);
 3843 
 3844                 while (hdr->b_l1hdr.b_buf != NULL)
 3845                         arc_buf_destroy_impl(hdr->b_l1hdr.b_buf);
 3846 
 3847                 if (hdr->b_l1hdr.b_pabd != NULL)
 3848                         arc_hdr_free_abd(hdr, B_FALSE);
 3849 
 3850                 if (HDR_HAS_RABD(hdr))
 3851                         arc_hdr_free_abd(hdr, B_TRUE);
 3852         }
 3853 
 3854         ASSERT3P(hdr->b_hash_next, ==, NULL);
 3855         if (HDR_HAS_L1HDR(hdr)) {
 3856                 ASSERT(!multilist_link_active(&hdr->b_l1hdr.b_arc_node));
 3857                 ASSERT3P(hdr->b_l1hdr.b_acb, ==, NULL);
 3858 #ifdef ZFS_DEBUG
 3859                 ASSERT3P(hdr->b_l1hdr.b_freeze_cksum, ==, NULL);
 3860 #endif
 3861 
 3862                 if (!HDR_PROTECTED(hdr)) {
 3863                         kmem_cache_free(hdr_full_cache, hdr);
 3864                 } else {
 3865                         kmem_cache_free(hdr_full_crypt_cache, hdr);
 3866                 }
 3867         } else {
 3868                 kmem_cache_free(hdr_l2only_cache, hdr);
 3869         }
 3870 }
 3871 
 3872 void
 3873 arc_buf_destroy(arc_buf_t *buf, const void *tag)
 3874 {
 3875         arc_buf_hdr_t *hdr = buf->b_hdr;
 3876 
 3877         if (hdr->b_l1hdr.b_state == arc_anon) {
 3878                 ASSERT3U(hdr->b_l1hdr.b_bufcnt, ==, 1);
 3879                 ASSERT(!HDR_IO_IN_PROGRESS(hdr));
 3880                 VERIFY0(remove_reference(hdr, tag));
 3881                 return;
 3882         }
 3883 
 3884         kmutex_t *hash_lock = HDR_LOCK(hdr);
 3885         mutex_enter(hash_lock);
 3886 
 3887         ASSERT3P(hdr, ==, buf->b_hdr);
 3888         ASSERT(hdr->b_l1hdr.b_bufcnt > 0);
 3889         ASSERT3P(hash_lock, ==, HDR_LOCK(hdr));
 3890         ASSERT3P(hdr->b_l1hdr.b_state, !=, arc_anon);
 3891         ASSERT3P(buf->b_data, !=, NULL);
 3892 
 3893         arc_buf_destroy_impl(buf);
 3894         (void) remove_reference(hdr, tag);
 3895         mutex_exit(hash_lock);
 3896 }
 3897 
 3898 /*
 3899  * Evict the arc_buf_hdr that is provided as a parameter. The resultant
 3900  * state of the header is dependent on its state prior to entering this
 3901  * function. The following transitions are possible:
 3902  *
 3903  *    - arc_mru -> arc_mru_ghost
 3904  *    - arc_mfu -> arc_mfu_ghost
 3905  *    - arc_mru_ghost -> arc_l2c_only
 3906  *    - arc_mru_ghost -> deleted
 3907  *    - arc_mfu_ghost -> arc_l2c_only
 3908  *    - arc_mfu_ghost -> deleted
 3909  *    - arc_uncached -> deleted
 3910  *
 3911  * Return total size of evicted data buffers for eviction progress tracking.
 3912  * When evicting from ghost states return logical buffer size to make eviction
 3913  * progress at the same (or at least comparable) rate as from non-ghost states.
 3914  *
 3915  * Return *real_evicted for actual ARC size reduction to wake up threads
 3916  * waiting for it.  For non-ghost states it includes size of evicted data
 3917  * buffers (the headers are not freed there).  For ghost states it includes
 3918  * only the evicted headers size.
 3919  */
 3920 static int64_t
 3921 arc_evict_hdr(arc_buf_hdr_t *hdr, uint64_t *real_evicted)
 3922 {
 3923         arc_state_t *evicted_state, *state;
 3924         int64_t bytes_evicted = 0;
 3925         uint_t min_lifetime = HDR_PRESCIENT_PREFETCH(hdr) ?
 3926             arc_min_prescient_prefetch_ms : arc_min_prefetch_ms;
 3927 
 3928         ASSERT(MUTEX_HELD(HDR_LOCK(hdr)));
 3929         ASSERT(HDR_HAS_L1HDR(hdr));
 3930         ASSERT(!HDR_IO_IN_PROGRESS(hdr));
 3931         ASSERT0(hdr->b_l1hdr.b_bufcnt);
 3932         ASSERT3P(hdr->b_l1hdr.b_buf, ==, NULL);
 3933         ASSERT0(zfs_refcount_count(&hdr->b_l1hdr.b_refcnt));
 3934 
 3935         *real_evicted = 0;
 3936         state = hdr->b_l1hdr.b_state;
 3937         if (GHOST_STATE(state)) {
 3938 
 3939                 /*
 3940                  * l2arc_write_buffers() relies on a header's L1 portion
 3941                  * (i.e. its b_pabd field) during it's write phase.
 3942                  * Thus, we cannot push a header onto the arc_l2c_only
 3943                  * state (removing its L1 piece) until the header is
 3944                  * done being written to the l2arc.
 3945                  */
 3946                 if (HDR_HAS_L2HDR(hdr) && HDR_L2_WRITING(hdr)) {
 3947                         ARCSTAT_BUMP(arcstat_evict_l2_skip);
 3948                         return (bytes_evicted);
 3949                 }
 3950 
 3951                 ARCSTAT_BUMP(arcstat_deleted);
 3952                 bytes_evicted += HDR_GET_LSIZE(hdr);
 3953 
 3954                 DTRACE_PROBE1(arc__delete, arc_buf_hdr_t *, hdr);
 3955 
 3956                 if (HDR_HAS_L2HDR(hdr)) {
 3957                         ASSERT(hdr->b_l1hdr.b_pabd == NULL);
 3958                         ASSERT(!HDR_HAS_RABD(hdr));
 3959                         /*
 3960                          * This buffer is cached on the 2nd Level ARC;
 3961                          * don't destroy the header.
 3962                          */
 3963                         arc_change_state(arc_l2c_only, hdr);
 3964                         /*
 3965                          * dropping from L1+L2 cached to L2-only,
 3966                          * realloc to remove the L1 header.
 3967                          */
 3968                         (void) arc_hdr_realloc(hdr, hdr_full_cache,
 3969                             hdr_l2only_cache);
 3970                         *real_evicted += HDR_FULL_SIZE - HDR_L2ONLY_SIZE;
 3971                 } else {
 3972                         arc_change_state(arc_anon, hdr);
 3973                         arc_hdr_destroy(hdr);
 3974                         *real_evicted += HDR_FULL_SIZE;
 3975                 }
 3976                 return (bytes_evicted);
 3977         }
 3978 
 3979         ASSERT(state == arc_mru || state == arc_mfu || state == arc_uncached);
 3980         evicted_state = (state == arc_uncached) ? arc_anon :
 3981             ((state == arc_mru) ? arc_mru_ghost : arc_mfu_ghost);
 3982 
 3983         /* prefetch buffers have a minimum lifespan */
 3984         if ((hdr->b_flags & (ARC_FLAG_PREFETCH | ARC_FLAG_INDIRECT)) &&
 3985             ddi_get_lbolt() - hdr->b_l1hdr.b_arc_access <
 3986             MSEC_TO_TICK(min_lifetime)) {
 3987                 ARCSTAT_BUMP(arcstat_evict_skip);
 3988                 return (bytes_evicted);
 3989         }
 3990 
 3991         if (HDR_HAS_L2HDR(hdr)) {
 3992                 ARCSTAT_INCR(arcstat_evict_l2_cached, HDR_GET_LSIZE(hdr));
 3993         } else {
 3994                 if (l2arc_write_eligible(hdr->b_spa, hdr)) {
 3995                         ARCSTAT_INCR(arcstat_evict_l2_eligible,
 3996                             HDR_GET_LSIZE(hdr));
 3997 
 3998                         switch (state->arcs_state) {
 3999                                 case ARC_STATE_MRU:
 4000                                         ARCSTAT_INCR(
 4001                                             arcstat_evict_l2_eligible_mru,
 4002                                             HDR_GET_LSIZE(hdr));
 4003                                         break;
 4004                                 case ARC_STATE_MFU:
 4005                                         ARCSTAT_INCR(
 4006                                             arcstat_evict_l2_eligible_mfu,
 4007                                             HDR_GET_LSIZE(hdr));
 4008                                         break;
 4009                                 default:
 4010                                         break;
 4011                         }
 4012                 } else {
 4013                         ARCSTAT_INCR(arcstat_evict_l2_ineligible,
 4014                             HDR_GET_LSIZE(hdr));
 4015                 }
 4016         }
 4017 
 4018         bytes_evicted += arc_hdr_size(hdr);
 4019         *real_evicted += arc_hdr_size(hdr);
 4020 
 4021         /*
 4022          * If this hdr is being evicted and has a compressed buffer then we
 4023          * discard it here before we change states.  This ensures that the
 4024          * accounting is updated correctly in arc_free_data_impl().
 4025          */
 4026         if (hdr->b_l1hdr.b_pabd != NULL)
 4027                 arc_hdr_free_abd(hdr, B_FALSE);
 4028 
 4029         if (HDR_HAS_RABD(hdr))
 4030                 arc_hdr_free_abd(hdr, B_TRUE);
 4031 
 4032         arc_change_state(evicted_state, hdr);
 4033         DTRACE_PROBE1(arc__evict, arc_buf_hdr_t *, hdr);
 4034         if (evicted_state == arc_anon) {
 4035                 arc_hdr_destroy(hdr);
 4036                 *real_evicted += HDR_FULL_SIZE;
 4037         } else {
 4038                 ASSERT(HDR_IN_HASH_TABLE(hdr));
 4039         }
 4040 
 4041         return (bytes_evicted);
 4042 }
 4043 
 4044 static void
 4045 arc_set_need_free(void)
 4046 {
 4047         ASSERT(MUTEX_HELD(&arc_evict_lock));
 4048         int64_t remaining = arc_free_memory() - arc_sys_free / 2;
 4049         arc_evict_waiter_t *aw = list_tail(&arc_evict_waiters);
 4050         if (aw == NULL) {
 4051                 arc_need_free = MAX(-remaining, 0);
 4052         } else {
 4053                 arc_need_free =
 4054                     MAX(-remaining, (int64_t)(aw->aew_count - arc_evict_count));
 4055         }
 4056 }
 4057 
 4058 static uint64_t
 4059 arc_evict_state_impl(multilist_t *ml, int idx, arc_buf_hdr_t *marker,
 4060     uint64_t spa, uint64_t bytes)
 4061 {
 4062         multilist_sublist_t *mls;
 4063         uint64_t bytes_evicted = 0, real_evicted = 0;
 4064         arc_buf_hdr_t *hdr;
 4065         kmutex_t *hash_lock;
 4066         uint_t evict_count = zfs_arc_evict_batch_limit;
 4067 
 4068         ASSERT3P(marker, !=, NULL);
 4069 
 4070         mls = multilist_sublist_lock(ml, idx);
 4071 
 4072         for (hdr = multilist_sublist_prev(mls, marker); likely(hdr != NULL);
 4073             hdr = multilist_sublist_prev(mls, marker)) {
 4074                 if ((evict_count == 0) || (bytes_evicted >= bytes))
 4075                         break;
 4076 
 4077                 /*
 4078                  * To keep our iteration location, move the marker
 4079                  * forward. Since we're not holding hdr's hash lock, we
 4080                  * must be very careful and not remove 'hdr' from the
 4081                  * sublist. Otherwise, other consumers might mistake the
 4082                  * 'hdr' as not being on a sublist when they call the
 4083                  * multilist_link_active() function (they all rely on
 4084                  * the hash lock protecting concurrent insertions and
 4085                  * removals). multilist_sublist_move_forward() was
 4086                  * specifically implemented to ensure this is the case
 4087                  * (only 'marker' will be removed and re-inserted).
 4088                  */
 4089                 multilist_sublist_move_forward(mls, marker);
 4090 
 4091                 /*
 4092                  * The only case where the b_spa field should ever be
 4093                  * zero, is the marker headers inserted by
 4094                  * arc_evict_state(). It's possible for multiple threads
 4095                  * to be calling arc_evict_state() concurrently (e.g.
 4096                  * dsl_pool_close() and zio_inject_fault()), so we must
 4097                  * skip any markers we see from these other threads.
 4098                  */
 4099                 if (hdr->b_spa == 0)
 4100                         continue;
 4101 
 4102                 /* we're only interested in evicting buffers of a certain spa */
 4103                 if (spa != 0 && hdr->b_spa != spa) {
 4104                         ARCSTAT_BUMP(arcstat_evict_skip);
 4105                         continue;
 4106                 }
 4107 
 4108                 hash_lock = HDR_LOCK(hdr);
 4109 
 4110                 /*
 4111                  * We aren't calling this function from any code path
 4112                  * that would already be holding a hash lock, so we're
 4113                  * asserting on this assumption to be defensive in case
 4114                  * this ever changes. Without this check, it would be
 4115                  * possible to incorrectly increment arcstat_mutex_miss
 4116                  * below (e.g. if the code changed such that we called
 4117                  * this function with a hash lock held).
 4118                  */
 4119                 ASSERT(!MUTEX_HELD(hash_lock));
 4120 
 4121                 if (mutex_tryenter(hash_lock)) {
 4122                         uint64_t revicted;
 4123                         uint64_t evicted = arc_evict_hdr(hdr, &revicted);
 4124                         mutex_exit(hash_lock);
 4125 
 4126                         bytes_evicted += evicted;
 4127                         real_evicted += revicted;
 4128 
 4129                         /*
 4130                          * If evicted is zero, arc_evict_hdr() must have
 4131                          * decided to skip this header, don't increment
 4132                          * evict_count in this case.
 4133                          */
 4134                         if (evicted != 0)
 4135                                 evict_count--;
 4136 
 4137                 } else {
 4138                         ARCSTAT_BUMP(arcstat_mutex_miss);
 4139                 }
 4140         }
 4141 
 4142         multilist_sublist_unlock(mls);
 4143 
 4144         /*
 4145          * Increment the count of evicted bytes, and wake up any threads that
 4146          * are waiting for the count to reach this value.  Since the list is
 4147          * ordered by ascending aew_count, we pop off the beginning of the
 4148          * list until we reach the end, or a waiter that's past the current
 4149          * "count".  Doing this outside the loop reduces the number of times
 4150          * we need to acquire the global arc_evict_lock.
 4151          *
 4152          * Only wake when there's sufficient free memory in the system
 4153          * (specifically, arc_sys_free/2, which by default is a bit more than
 4154          * 1/64th of RAM).  See the comments in arc_wait_for_eviction().
 4155          */
 4156         mutex_enter(&arc_evict_lock);
 4157         arc_evict_count += real_evicted;
 4158 
 4159         if (arc_free_memory() > arc_sys_free / 2) {
 4160                 arc_evict_waiter_t *aw;
 4161                 while ((aw = list_head(&arc_evict_waiters)) != NULL &&
 4162                     aw->aew_count <= arc_evict_count) {
 4163                         list_remove(&arc_evict_waiters, aw);
 4164                         cv_broadcast(&aw->aew_cv);
 4165                 }
 4166         }
 4167         arc_set_need_free();
 4168         mutex_exit(&arc_evict_lock);
 4169 
 4170         /*
 4171          * If the ARC size is reduced from arc_c_max to arc_c_min (especially
 4172          * if the average cached block is small), eviction can be on-CPU for
 4173          * many seconds.  To ensure that other threads that may be bound to
 4174          * this CPU are able to make progress, make a voluntary preemption
 4175          * call here.
 4176          */
 4177         kpreempt(KPREEMPT_SYNC);
 4178 
 4179         return (bytes_evicted);
 4180 }
 4181 
 4182 /*
 4183  * Allocate an array of buffer headers used as placeholders during arc state
 4184  * eviction.
 4185  */
 4186 static arc_buf_hdr_t **
 4187 arc_state_alloc_markers(int count)
 4188 {
 4189         arc_buf_hdr_t **markers;
 4190 
 4191         markers = kmem_zalloc(sizeof (*markers) * count, KM_SLEEP);
 4192         for (int i = 0; i < count; i++) {
 4193                 markers[i] = kmem_cache_alloc(hdr_full_cache, KM_SLEEP);
 4194 
 4195                 /*
 4196                  * A b_spa of 0 is used to indicate that this header is
 4197                  * a marker. This fact is used in arc_evict_type() and
 4198                  * arc_evict_state_impl().
 4199                  */
 4200                 markers[i]->b_spa = 0;
 4201 
 4202         }
 4203         return (markers);
 4204 }
 4205 
 4206 static void
 4207 arc_state_free_markers(arc_buf_hdr_t **markers, int count)
 4208 {
 4209         for (int i = 0; i < count; i++)
 4210                 kmem_cache_free(hdr_full_cache, markers[i]);
 4211         kmem_free(markers, sizeof (*markers) * count);
 4212 }
 4213 
 4214 /*
 4215  * Evict buffers from the given arc state, until we've removed the
 4216  * specified number of bytes. Move the removed buffers to the
 4217  * appropriate evict state.
 4218  *
 4219  * This function makes a "best effort". It skips over any buffers
 4220  * it can't get a hash_lock on, and so, may not catch all candidates.
 4221  * It may also return without evicting as much space as requested.
 4222  *
 4223  * If bytes is specified using the special value ARC_EVICT_ALL, this
 4224  * will evict all available (i.e. unlocked and evictable) buffers from
 4225  * the given arc state; which is used by arc_flush().
 4226  */
 4227 static uint64_t
 4228 arc_evict_state(arc_state_t *state, uint64_t spa, uint64_t bytes,
 4229     arc_buf_contents_t type)
 4230 {
 4231         uint64_t total_evicted = 0;
 4232         multilist_t *ml = &state->arcs_list[type];
 4233         int num_sublists;
 4234         arc_buf_hdr_t **markers;
 4235 
 4236         num_sublists = multilist_get_num_sublists(ml);
 4237 
 4238         /*
 4239          * If we've tried to evict from each sublist, made some
 4240          * progress, but still have not hit the target number of bytes
 4241          * to evict, we want to keep trying. The markers allow us to
 4242          * pick up where we left off for each individual sublist, rather
 4243          * than starting from the tail each time.
 4244          */
 4245         if (zthr_iscurthread(arc_evict_zthr)) {
 4246                 markers = arc_state_evict_markers;
 4247                 ASSERT3S(num_sublists, <=, arc_state_evict_marker_count);
 4248         } else {
 4249                 markers = arc_state_alloc_markers(num_sublists);
 4250         }
 4251         for (int i = 0; i < num_sublists; i++) {
 4252                 multilist_sublist_t *mls;
 4253 
 4254                 mls = multilist_sublist_lock(ml, i);
 4255                 multilist_sublist_insert_tail(mls, markers[i]);
 4256                 multilist_sublist_unlock(mls);
 4257         }
 4258 
 4259         /*
 4260          * While we haven't hit our target number of bytes to evict, or
 4261          * we're evicting all available buffers.
 4262          */
 4263         while (total_evicted < bytes) {
 4264                 int sublist_idx = multilist_get_random_index(ml);
 4265                 uint64_t scan_evicted = 0;
 4266 
 4267                 /*
 4268                  * Try to reduce pinned dnodes with a floor of arc_dnode_limit.
 4269                  * Request that 10% of the LRUs be scanned by the superblock
 4270                  * shrinker.
 4271                  */
 4272                 if (type == ARC_BUFC_DATA && aggsum_compare(
 4273                     &arc_sums.arcstat_dnode_size, arc_dnode_size_limit) > 0) {
 4274                         arc_prune_async((aggsum_upper_bound(
 4275                             &arc_sums.arcstat_dnode_size) -
 4276                             arc_dnode_size_limit) / sizeof (dnode_t) /
 4277                             zfs_arc_dnode_reduce_percent);
 4278                 }
 4279 
 4280                 /*
 4281                  * Start eviction using a randomly selected sublist,
 4282                  * this is to try and evenly balance eviction across all
 4283                  * sublists. Always starting at the same sublist
 4284                  * (e.g. index 0) would cause evictions to favor certain
 4285                  * sublists over others.
 4286                  */
 4287                 for (int i = 0; i < num_sublists; i++) {
 4288                         uint64_t bytes_remaining;
 4289                         uint64_t bytes_evicted;
 4290 
 4291                         if (total_evicted < bytes)
 4292                                 bytes_remaining = bytes - total_evicted;
 4293                         else
 4294                                 break;
 4295 
 4296                         bytes_evicted = arc_evict_state_impl(ml, sublist_idx,
 4297                             markers[sublist_idx], spa, bytes_remaining);
 4298 
 4299                         scan_evicted += bytes_evicted;
 4300                         total_evicted += bytes_evicted;
 4301 
 4302                         /* we've reached the end, wrap to the beginning */
 4303                         if (++sublist_idx >= num_sublists)
 4304                                 sublist_idx = 0;
 4305                 }
 4306 
 4307                 /*
 4308                  * If we didn't evict anything during this scan, we have
 4309                  * no reason to believe we'll evict more during another
 4310                  * scan, so break the loop.
 4311                  */
 4312                 if (scan_evicted == 0) {
 4313                         /* This isn't possible, let's make that obvious */
 4314                         ASSERT3S(bytes, !=, 0);
 4315 
 4316                         /*
 4317                          * When bytes is ARC_EVICT_ALL, the only way to
 4318                          * break the loop is when scan_evicted is zero.
 4319                          * In that case, we actually have evicted enough,
 4320                          * so we don't want to increment the kstat.
 4321                          */
 4322                         if (bytes != ARC_EVICT_ALL) {
 4323                                 ASSERT3S(total_evicted, <, bytes);
 4324                                 ARCSTAT_BUMP(arcstat_evict_not_enough);
 4325                         }
 4326 
 4327                         break;
 4328                 }
 4329         }
 4330 
 4331         for (int i = 0; i < num_sublists; i++) {
 4332                 multilist_sublist_t *mls = multilist_sublist_lock(ml, i);
 4333                 multilist_sublist_remove(mls, markers[i]);
 4334                 multilist_sublist_unlock(mls);
 4335         }
 4336         if (markers != arc_state_evict_markers)
 4337                 arc_state_free_markers(markers, num_sublists);
 4338 
 4339         return (total_evicted);
 4340 }
 4341 
 4342 /*
 4343  * Flush all "evictable" data of the given type from the arc state
 4344  * specified. This will not evict any "active" buffers (i.e. referenced).
 4345  *
 4346  * When 'retry' is set to B_FALSE, the function will make a single pass
 4347  * over the state and evict any buffers that it can. Since it doesn't
 4348  * continually retry the eviction, it might end up leaving some buffers
 4349  * in the ARC due to lock misses.
 4350  *
 4351  * When 'retry' is set to B_TRUE, the function will continually retry the
 4352  * eviction until *all* evictable buffers have been removed from the
 4353  * state. As a result, if concurrent insertions into the state are
 4354  * allowed (e.g. if the ARC isn't shutting down), this function might
 4355  * wind up in an infinite loop, continually trying to evict buffers.
 4356  */
 4357 static uint64_t
 4358 arc_flush_state(arc_state_t *state, uint64_t spa, arc_buf_contents_t type,
 4359     boolean_t retry)
 4360 {
 4361         uint64_t evicted = 0;
 4362 
 4363         while (zfs_refcount_count(&state->arcs_esize[type]) != 0) {
 4364                 evicted += arc_evict_state(state, spa, ARC_EVICT_ALL, type);
 4365 
 4366                 if (!retry)
 4367                         break;
 4368         }
 4369 
 4370         return (evicted);
 4371 }
 4372 
 4373 /*
 4374  * Evict the specified number of bytes from the state specified,
 4375  * restricting eviction to the spa and type given. This function
 4376  * prevents us from trying to evict more from a state's list than
 4377  * is "evictable", and to skip evicting altogether when passed a
 4378  * negative value for "bytes". In contrast, arc_evict_state() will
 4379  * evict everything it can, when passed a negative value for "bytes".
 4380  */
 4381 static uint64_t
 4382 arc_evict_impl(arc_state_t *state, uint64_t spa, int64_t bytes,
 4383     arc_buf_contents_t type)
 4384 {
 4385         uint64_t delta;
 4386 
 4387         if (bytes > 0 && zfs_refcount_count(&state->arcs_esize[type]) > 0) {
 4388                 delta = MIN(zfs_refcount_count(&state->arcs_esize[type]),
 4389                     bytes);
 4390                 return (arc_evict_state(state, spa, delta, type));
 4391         }
 4392 
 4393         return (0);
 4394 }
 4395 
 4396 /*
 4397  * The goal of this function is to evict enough meta data buffers from the
 4398  * ARC in order to enforce the arc_meta_limit.  Achieving this is slightly
 4399  * more complicated than it appears because it is common for data buffers
 4400  * to have holds on meta data buffers.  In addition, dnode meta data buffers
 4401  * will be held by the dnodes in the block preventing them from being freed.
 4402  * This means we can't simply traverse the ARC and expect to always find
 4403  * enough unheld meta data buffer to release.
 4404  *
 4405  * Therefore, this function has been updated to make alternating passes
 4406  * over the ARC releasing data buffers and then newly unheld meta data
 4407  * buffers.  This ensures forward progress is maintained and meta_used
 4408  * will decrease.  Normally this is sufficient, but if required the ARC
 4409  * will call the registered prune callbacks causing dentry and inodes to
 4410  * be dropped from the VFS cache.  This will make dnode meta data buffers
 4411  * available for reclaim.
 4412  */
 4413 static uint64_t
 4414 arc_evict_meta_balanced(uint64_t meta_used)
 4415 {
 4416         int64_t delta, adjustmnt;
 4417         uint64_t total_evicted = 0, prune = 0;
 4418         arc_buf_contents_t type = ARC_BUFC_DATA;
 4419         uint_t restarts = zfs_arc_meta_adjust_restarts;
 4420 
 4421 restart:
 4422         /*
 4423          * This slightly differs than the way we evict from the mru in
 4424          * arc_evict because we don't have a "target" value (i.e. no
 4425          * "meta" arc_p). As a result, I think we can completely
 4426          * cannibalize the metadata in the MRU before we evict the
 4427          * metadata from the MFU. I think we probably need to implement a
 4428          * "metadata arc_p" value to do this properly.
 4429          */
 4430         adjustmnt = meta_used - arc_meta_limit;
 4431 
 4432         if (adjustmnt > 0 &&
 4433             zfs_refcount_count(&arc_mru->arcs_esize[type]) > 0) {
 4434                 delta = MIN(zfs_refcount_count(&arc_mru->arcs_esize[type]),
 4435                     adjustmnt);
 4436                 total_evicted += arc_evict_impl(arc_mru, 0, delta, type);
 4437                 adjustmnt -= delta;
 4438         }
 4439 
 4440         /*
 4441          * We can't afford to recalculate adjustmnt here. If we do,
 4442          * new metadata buffers can sneak into the MRU or ANON lists,
 4443          * thus penalize the MFU metadata. Although the fudge factor is
 4444          * small, it has been empirically shown to be significant for
 4445          * certain workloads (e.g. creating many empty directories). As
 4446          * such, we use the original calculation for adjustmnt, and
 4447          * simply decrement the amount of data evicted from the MRU.
 4448          */
 4449 
 4450         if (adjustmnt > 0 &&
 4451             zfs_refcount_count(&arc_mfu->arcs_esize[type]) > 0) {
 4452                 delta = MIN(zfs_refcount_count(&arc_mfu->arcs_esize[type]),
 4453                     adjustmnt);
 4454                 total_evicted += arc_evict_impl(arc_mfu, 0, delta, type);
 4455         }
 4456 
 4457         adjustmnt = meta_used - arc_meta_limit;
 4458 
 4459         if (adjustmnt > 0 &&
 4460             zfs_refcount_count(&arc_mru_ghost->arcs_esize[type]) > 0) {
 4461                 delta = MIN(adjustmnt,
 4462                     zfs_refcount_count(&arc_mru_ghost->arcs_esize[type]));
 4463                 total_evicted += arc_evict_impl(arc_mru_ghost, 0, delta, type);
 4464                 adjustmnt -= delta;
 4465         }
 4466 
 4467         if (adjustmnt > 0 &&
 4468             zfs_refcount_count(&arc_mfu_ghost->arcs_esize[type]) > 0) {
 4469                 delta = MIN(adjustmnt,
 4470                     zfs_refcount_count(&arc_mfu_ghost->arcs_esize[type]));
 4471                 total_evicted += arc_evict_impl(arc_mfu_ghost, 0, delta, type);
 4472         }
 4473 
 4474         /*
 4475          * If after attempting to make the requested adjustment to the ARC
 4476          * the meta limit is still being exceeded then request that the
 4477          * higher layers drop some cached objects which have holds on ARC
 4478          * meta buffers.  Requests to the upper layers will be made with
 4479          * increasingly large scan sizes until the ARC is below the limit.
 4480          */
 4481         if (meta_used > arc_meta_limit || arc_available_memory() < 0) {
 4482                 if (type == ARC_BUFC_DATA) {
 4483                         type = ARC_BUFC_METADATA;
 4484                 } else {
 4485                         type = ARC_BUFC_DATA;
 4486 
 4487                         if (zfs_arc_meta_prune) {
 4488                                 prune += zfs_arc_meta_prune;
 4489                                 arc_prune_async(prune);
 4490                         }
 4491                 }
 4492 
 4493                 if (restarts > 0) {
 4494                         restarts--;
 4495                         goto restart;
 4496                 }
 4497         }
 4498         return (total_evicted);
 4499 }
 4500 
 4501 /*
 4502  * Evict metadata buffers from the cache, such that arcstat_meta_used is
 4503  * capped by the arc_meta_limit tunable.
 4504  */
 4505 static uint64_t
 4506 arc_evict_meta_only(uint64_t meta_used)
 4507 {
 4508         uint64_t total_evicted = 0;
 4509         int64_t target;
 4510 
 4511         /*
 4512          * If we're over the meta limit, we want to evict enough
 4513          * metadata to get back under the meta limit. We don't want to
 4514          * evict so much that we drop the MRU below arc_p, though. If
 4515          * we're over the meta limit more than we're over arc_p, we
 4516          * evict some from the MRU here, and some from the MFU below.
 4517          */
 4518         target = MIN((int64_t)(meta_used - arc_meta_limit),
 4519             (int64_t)(zfs_refcount_count(&arc_anon->arcs_size) +
 4520             zfs_refcount_count(&arc_mru->arcs_size) - arc_p));
 4521 
 4522         total_evicted += arc_evict_impl(arc_mru, 0, target, ARC_BUFC_METADATA);
 4523 
 4524         /*
 4525          * Similar to the above, we want to evict enough bytes to get us
 4526          * below the meta limit, but not so much as to drop us below the
 4527          * space allotted to the MFU (which is defined as arc_c - arc_p).
 4528          */
 4529         target = MIN((int64_t)(meta_used - arc_meta_limit),
 4530             (int64_t)(zfs_refcount_count(&arc_mfu->arcs_size) -
 4531             (arc_c - arc_p)));
 4532 
 4533         total_evicted += arc_evict_impl(arc_mfu, 0, target, ARC_BUFC_METADATA);
 4534 
 4535         return (total_evicted);
 4536 }
 4537 
 4538 static uint64_t
 4539 arc_evict_meta(uint64_t meta_used)
 4540 {
 4541         if (zfs_arc_meta_strategy == ARC_STRATEGY_META_ONLY)
 4542                 return (arc_evict_meta_only(meta_used));
 4543         else
 4544                 return (arc_evict_meta_balanced(meta_used));
 4545 }
 4546 
 4547 /*
 4548  * Return the type of the oldest buffer in the given arc state
 4549  *
 4550  * This function will select a random sublist of type ARC_BUFC_DATA and
 4551  * a random sublist of type ARC_BUFC_METADATA. The tail of each sublist
 4552  * is compared, and the type which contains the "older" buffer will be
 4553  * returned.
 4554  */
 4555 static arc_buf_contents_t
 4556 arc_evict_type(arc_state_t *state)
 4557 {
 4558         multilist_t *data_ml = &state->arcs_list[ARC_BUFC_DATA];
 4559         multilist_t *meta_ml = &state->arcs_list[ARC_BUFC_METADATA];
 4560         int data_idx = multilist_get_random_index(data_ml);
 4561         int meta_idx = multilist_get_random_index(meta_ml);
 4562         multilist_sublist_t *data_mls;
 4563         multilist_sublist_t *meta_mls;
 4564         arc_buf_contents_t type;
 4565         arc_buf_hdr_t *data_hdr;
 4566         arc_buf_hdr_t *meta_hdr;
 4567 
 4568         /*
 4569          * We keep the sublist lock until we're finished, to prevent
 4570          * the headers from being destroyed via arc_evict_state().
 4571          */
 4572         data_mls = multilist_sublist_lock(data_ml, data_idx);
 4573         meta_mls = multilist_sublist_lock(meta_ml, meta_idx);
 4574 
 4575         /*
 4576          * These two loops are to ensure we skip any markers that
 4577          * might be at the tail of the lists due to arc_evict_state().
 4578          */
 4579 
 4580         for (data_hdr = multilist_sublist_tail(data_mls); data_hdr != NULL;
 4581             data_hdr = multilist_sublist_prev(data_mls, data_hdr)) {
 4582                 if (data_hdr->b_spa != 0)
 4583                         break;
 4584         }
 4585 
 4586         for (meta_hdr = multilist_sublist_tail(meta_mls); meta_hdr != NULL;
 4587             meta_hdr = multilist_sublist_prev(meta_mls, meta_hdr)) {
 4588                 if (meta_hdr->b_spa != 0)
 4589                         break;
 4590         }
 4591 
 4592         if (data_hdr == NULL && meta_hdr == NULL) {
 4593                 type = ARC_BUFC_DATA;
 4594         } else if (data_hdr == NULL) {
 4595                 ASSERT3P(meta_hdr, !=, NULL);
 4596                 type = ARC_BUFC_METADATA;
 4597         } else if (meta_hdr == NULL) {
 4598                 ASSERT3P(data_hdr, !=, NULL);
 4599                 type = ARC_BUFC_DATA;
 4600         } else {
 4601                 ASSERT3P(data_hdr, !=, NULL);
 4602                 ASSERT3P(meta_hdr, !=, NULL);
 4603 
 4604                 /* The headers can't be on the sublist without an L1 header */
 4605                 ASSERT(HDR_HAS_L1HDR(data_hdr));
 4606                 ASSERT(HDR_HAS_L1HDR(meta_hdr));
 4607 
 4608                 if (data_hdr->b_l1hdr.b_arc_access <
 4609                     meta_hdr->b_l1hdr.b_arc_access) {
 4610                         type = ARC_BUFC_DATA;
 4611                 } else {
 4612                         type = ARC_BUFC_METADATA;
 4613                 }
 4614         }
 4615 
 4616         multilist_sublist_unlock(meta_mls);
 4617         multilist_sublist_unlock(data_mls);
 4618 
 4619         return (type);
 4620 }
 4621 
 4622 /*
 4623  * Evict buffers from the cache, such that arcstat_size is capped by arc_c.
 4624  */
 4625 static uint64_t
 4626 arc_evict(void)
 4627 {
 4628         uint64_t total_evicted = 0;
 4629         uint64_t bytes;
 4630         int64_t target;
 4631         uint64_t asize = aggsum_value(&arc_sums.arcstat_size);
 4632         uint64_t ameta = aggsum_value(&arc_sums.arcstat_meta_used);
 4633 
 4634         /*
 4635          * If we're over arc_meta_limit, we want to correct that before
 4636          * potentially evicting data buffers below.
 4637          */
 4638         total_evicted += arc_evict_meta(ameta);
 4639 
 4640         /*
 4641          * Adjust MRU size
 4642          *
 4643          * If we're over the target cache size, we want to evict enough
 4644          * from the list to get back to our target size. We don't want
 4645          * to evict too much from the MRU, such that it drops below
 4646          * arc_p. So, if we're over our target cache size more than
 4647          * the MRU is over arc_p, we'll evict enough to get back to
 4648          * arc_p here, and then evict more from the MFU below.
 4649          */
 4650         target = MIN((int64_t)(asize - arc_c),
 4651             (int64_t)(zfs_refcount_count(&arc_anon->arcs_size) +
 4652             zfs_refcount_count(&arc_mru->arcs_size) + ameta - arc_p));
 4653 
 4654         /*
 4655          * If we're below arc_meta_min, always prefer to evict data.
 4656          * Otherwise, try to satisfy the requested number of bytes to
 4657          * evict from the type which contains older buffers; in an
 4658          * effort to keep newer buffers in the cache regardless of their
 4659          * type. If we cannot satisfy the number of bytes from this
 4660          * type, spill over into the next type.
 4661          */
 4662         if (arc_evict_type(arc_mru) == ARC_BUFC_METADATA &&
 4663             ameta > arc_meta_min) {
 4664                 bytes = arc_evict_impl(arc_mru, 0, target, ARC_BUFC_METADATA);
 4665                 total_evicted += bytes;
 4666 
 4667                 /*
 4668                  * If we couldn't evict our target number of bytes from
 4669                  * metadata, we try to get the rest from data.
 4670                  */
 4671                 target -= bytes;
 4672 
 4673                 total_evicted +=
 4674                     arc_evict_impl(arc_mru, 0, target, ARC_BUFC_DATA);
 4675         } else {
 4676                 bytes = arc_evict_impl(arc_mru, 0, target, ARC_BUFC_DATA);
 4677                 total_evicted += bytes;
 4678 
 4679                 /*
 4680                  * If we couldn't evict our target number of bytes from
 4681                  * data, we try to get the rest from metadata.
 4682                  */
 4683                 target -= bytes;
 4684 
 4685                 total_evicted +=
 4686                     arc_evict_impl(arc_mru, 0, target, ARC_BUFC_METADATA);
 4687         }
 4688 
 4689         /*
 4690          * Re-sum ARC stats after the first round of evictions.
 4691          */
 4692         asize = aggsum_value(&arc_sums.arcstat_size);
 4693         ameta = aggsum_value(&arc_sums.arcstat_meta_used);
 4694 
 4695 
 4696         /*
 4697          * Adjust MFU size
 4698          *
 4699          * Now that we've tried to evict enough from the MRU to get its
 4700          * size back to arc_p, if we're still above the target cache
 4701          * size, we evict the rest from the MFU.
 4702          */
 4703         target = asize - arc_c;
 4704 
 4705         if (arc_evict_type(arc_mfu) == ARC_BUFC_METADATA &&
 4706             ameta > arc_meta_min) {
 4707                 bytes = arc_evict_impl(arc_mfu, 0, target, ARC_BUFC_METADATA);
 4708                 total_evicted += bytes;
 4709 
 4710                 /*
 4711                  * If we couldn't evict our target number of bytes from
 4712                  * metadata, we try to get the rest from data.
 4713                  */
 4714                 target -= bytes;
 4715 
 4716                 total_evicted +=
 4717                     arc_evict_impl(arc_mfu, 0, target, ARC_BUFC_DATA);
 4718         } else {
 4719                 bytes = arc_evict_impl(arc_mfu, 0, target, ARC_BUFC_DATA);
 4720                 total_evicted += bytes;
 4721 
 4722                 /*
 4723                  * If we couldn't evict our target number of bytes from
 4724                  * data, we try to get the rest from data.
 4725                  */
 4726                 target -= bytes;
 4727 
 4728                 total_evicted +=
 4729                     arc_evict_impl(arc_mfu, 0, target, ARC_BUFC_METADATA);
 4730         }
 4731 
 4732         /*
 4733          * Adjust ghost lists
 4734          *
 4735          * In addition to the above, the ARC also defines target values
 4736          * for the ghost lists. The sum of the mru list and mru ghost
 4737          * list should never exceed the target size of the cache, and
 4738          * the sum of the mru list, mfu list, mru ghost list, and mfu
 4739          * ghost list should never exceed twice the target size of the
 4740          * cache. The following logic enforces these limits on the ghost
 4741          * caches, and evicts from them as needed.
 4742          */
 4743         target = zfs_refcount_count(&arc_mru->arcs_size) +
 4744             zfs_refcount_count(&arc_mru_ghost->arcs_size) - arc_c;
 4745 
 4746         bytes = arc_evict_impl(arc_mru_ghost, 0, target, ARC_BUFC_DATA);
 4747         total_evicted += bytes;
 4748 
 4749         target -= bytes;
 4750 
 4751         total_evicted +=
 4752             arc_evict_impl(arc_mru_ghost, 0, target, ARC_BUFC_METADATA);
 4753 
 4754         /*
 4755          * We assume the sum of the mru list and mfu list is less than
 4756          * or equal to arc_c (we enforced this above), which means we
 4757          * can use the simpler of the two equations below:
 4758          *
 4759          *      mru + mfu + mru ghost + mfu ghost <= 2 * arc_c
 4760          *                  mru ghost + mfu ghost <= arc_c
 4761          */
 4762         target = zfs_refcount_count(&arc_mru_ghost->arcs_size) +
 4763             zfs_refcount_count(&arc_mfu_ghost->arcs_size) - arc_c;
 4764 
 4765         bytes = arc_evict_impl(arc_mfu_ghost, 0, target, ARC_BUFC_DATA);
 4766         total_evicted += bytes;
 4767 
 4768         target -= bytes;
 4769 
 4770         total_evicted +=
 4771             arc_evict_impl(arc_mfu_ghost, 0, target, ARC_BUFC_METADATA);
 4772 
 4773         return (total_evicted);
 4774 }
 4775 
 4776 void
 4777 arc_flush(spa_t *spa, boolean_t retry)
 4778 {
 4779         uint64_t guid = 0;
 4780 
 4781         /*
 4782          * If retry is B_TRUE, a spa must not be specified since we have
 4783          * no good way to determine if all of a spa's buffers have been
 4784          * evicted from an arc state.
 4785          */
 4786         ASSERT(!retry || spa == NULL);
 4787 
 4788         if (spa != NULL)
 4789                 guid = spa_load_guid(spa);
 4790 
 4791         (void) arc_flush_state(arc_mru, guid, ARC_BUFC_DATA, retry);
 4792         (void) arc_flush_state(arc_mru, guid, ARC_BUFC_METADATA, retry);
 4793 
 4794         (void) arc_flush_state(arc_mfu, guid, ARC_BUFC_DATA, retry);
 4795         (void) arc_flush_state(arc_mfu, guid, ARC_BUFC_METADATA, retry);
 4796 
 4797         (void) arc_flush_state(arc_mru_ghost, guid, ARC_BUFC_DATA, retry);
 4798         (void) arc_flush_state(arc_mru_ghost, guid, ARC_BUFC_METADATA, retry);
 4799 
 4800         (void) arc_flush_state(arc_mfu_ghost, guid, ARC_BUFC_DATA, retry);
 4801         (void) arc_flush_state(arc_mfu_ghost, guid, ARC_BUFC_METADATA, retry);
 4802 
 4803         (void) arc_flush_state(arc_uncached, guid, ARC_BUFC_DATA, retry);
 4804         (void) arc_flush_state(arc_uncached, guid, ARC_BUFC_METADATA, retry);
 4805 }
 4806 
 4807 void
 4808 arc_reduce_target_size(int64_t to_free)
 4809 {
 4810         uint64_t asize = aggsum_value(&arc_sums.arcstat_size);
 4811 
 4812         /*
 4813          * All callers want the ARC to actually evict (at least) this much
 4814          * memory.  Therefore we reduce from the lower of the current size and
 4815          * the target size.  This way, even if arc_c is much higher than
 4816          * arc_size (as can be the case after many calls to arc_freed(), we will
 4817          * immediately have arc_c < arc_size and therefore the arc_evict_zthr
 4818          * will evict.
 4819          */
 4820         uint64_t c = MIN(arc_c, asize);
 4821 
 4822         if (c > to_free && c - to_free > arc_c_min) {
 4823                 arc_c = c - to_free;
 4824                 atomic_add_64(&arc_p, -(arc_p >> arc_shrink_shift));
 4825                 if (arc_p > arc_c)
 4826                         arc_p = (arc_c >> 1);
 4827                 ASSERT(arc_c >= arc_c_min);
 4828                 ASSERT((int64_t)arc_p >= 0);
 4829         } else {
 4830                 arc_c = arc_c_min;
 4831         }
 4832 
 4833         if (asize > arc_c) {
 4834                 /* See comment in arc_evict_cb_check() on why lock+flag */
 4835                 mutex_enter(&arc_evict_lock);
 4836                 arc_evict_needed = B_TRUE;
 4837                 mutex_exit(&arc_evict_lock);
 4838                 zthr_wakeup(arc_evict_zthr);
 4839         }
 4840 }
 4841 
 4842 /*
 4843  * Determine if the system is under memory pressure and is asking
 4844  * to reclaim memory. A return value of B_TRUE indicates that the system
 4845  * is under memory pressure and that the arc should adjust accordingly.
 4846  */
 4847 boolean_t
 4848 arc_reclaim_needed(void)
 4849 {
 4850         return (arc_available_memory() < 0);
 4851 }
 4852 
 4853 void
 4854 arc_kmem_reap_soon(void)
 4855 {
 4856         size_t                  i;
 4857         kmem_cache_t            *prev_cache = NULL;
 4858         kmem_cache_t            *prev_data_cache = NULL;
 4859 
 4860 #ifdef _KERNEL
 4861         if ((aggsum_compare(&arc_sums.arcstat_meta_used,
 4862             arc_meta_limit) >= 0) && zfs_arc_meta_prune) {
 4863                 /*
 4864                  * We are exceeding our meta-data cache limit.
 4865                  * Prune some entries to release holds on meta-data.
 4866                  */
 4867                 arc_prune_async(zfs_arc_meta_prune);
 4868         }
 4869 #if defined(_ILP32)
 4870         /*
 4871          * Reclaim unused memory from all kmem caches.
 4872          */
 4873         kmem_reap();
 4874 #endif
 4875 #endif
 4876 
 4877         for (i = 0; i < SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT; i++) {
 4878 #if defined(_ILP32)
 4879                 /* reach upper limit of cache size on 32-bit */
 4880                 if (zio_buf_cache[i] == NULL)
 4881                         break;
 4882 #endif
 4883                 if (zio_buf_cache[i] != prev_cache) {
 4884                         prev_cache = zio_buf_cache[i];
 4885                         kmem_cache_reap_now(zio_buf_cache[i]);
 4886                 }
 4887                 if (zio_data_buf_cache[i] != prev_data_cache) {
 4888                         prev_data_cache = zio_data_buf_cache[i];
 4889                         kmem_cache_reap_now(zio_data_buf_cache[i]);
 4890                 }
 4891         }
 4892         kmem_cache_reap_now(buf_cache);
 4893         kmem_cache_reap_now(hdr_full_cache);
 4894         kmem_cache_reap_now(hdr_l2only_cache);
 4895         kmem_cache_reap_now(zfs_btree_leaf_cache);
 4896         abd_cache_reap_now();
 4897 }
 4898 
 4899 static boolean_t
 4900 arc_evict_cb_check(void *arg, zthr_t *zthr)
 4901 {
 4902         (void) arg, (void) zthr;
 4903 
 4904 #ifdef ZFS_DEBUG
 4905         /*
 4906          * This is necessary in order to keep the kstat information
 4907          * up to date for tools that display kstat data such as the
 4908          * mdb ::arc dcmd and the Linux crash utility.  These tools
 4909          * typically do not call kstat's update function, but simply
 4910          * dump out stats from the most recent update.  Without
 4911          * this call, these commands may show stale stats for the
 4912          * anon, mru, mru_ghost, mfu, and mfu_ghost lists.  Even
 4913          * with this call, the data might be out of date if the
 4914          * evict thread hasn't been woken recently; but that should
 4915          * suffice.  The arc_state_t structures can be queried
 4916          * directly if more accurate information is needed.
 4917          */
 4918         if (arc_ksp != NULL)
 4919                 arc_ksp->ks_update(arc_ksp, KSTAT_READ);
 4920 #endif
 4921 
 4922         /*
 4923          * We have to rely on arc_wait_for_eviction() to tell us when to
 4924          * evict, rather than checking if we are overflowing here, so that we
 4925          * are sure to not leave arc_wait_for_eviction() waiting on aew_cv.
 4926          * If we have become "not overflowing" since arc_wait_for_eviction()
 4927          * checked, we need to wake it up.  We could broadcast the CV here,
 4928          * but arc_wait_for_eviction() may have not yet gone to sleep.  We
 4929          * would need to use a mutex to ensure that this function doesn't
 4930          * broadcast until arc_wait_for_eviction() has gone to sleep (e.g.
 4931          * the arc_evict_lock).  However, the lock ordering of such a lock
 4932          * would necessarily be incorrect with respect to the zthr_lock,
 4933          * which is held before this function is called, and is held by
 4934          * arc_wait_for_eviction() when it calls zthr_wakeup().
 4935          */
 4936         if (arc_evict_needed)
 4937                 return (B_TRUE);
 4938 
 4939         /*
 4940          * If we have buffers in uncached state, evict them periodically.
 4941          */
 4942         return ((zfs_refcount_count(&arc_uncached->arcs_esize[ARC_BUFC_DATA]) +
 4943             zfs_refcount_count(&arc_uncached->arcs_esize[ARC_BUFC_METADATA]) &&
 4944             ddi_get_lbolt() - arc_last_uncached_flush >
 4945             MSEC_TO_TICK(arc_min_prefetch_ms / 2)));
 4946 }
 4947 
 4948 /*
 4949  * Keep arc_size under arc_c by running arc_evict which evicts data
 4950  * from the ARC.
 4951  */
 4952 static void
 4953 arc_evict_cb(void *arg, zthr_t *zthr)
 4954 {
 4955         (void) arg, (void) zthr;
 4956 
 4957         uint64_t evicted = 0;
 4958         fstrans_cookie_t cookie = spl_fstrans_mark();
 4959 
 4960         /* Always try to evict from uncached state. */
 4961         arc_last_uncached_flush = ddi_get_lbolt();
 4962         evicted += arc_flush_state(arc_uncached, 0, ARC_BUFC_DATA, B_FALSE);
 4963         evicted += arc_flush_state(arc_uncached, 0, ARC_BUFC_METADATA, B_FALSE);
 4964 
 4965         /* Evict from other states only if told to. */
 4966         if (arc_evict_needed)
 4967                 evicted += arc_evict();
 4968 
 4969         /*
 4970          * If evicted is zero, we couldn't evict anything
 4971          * via arc_evict(). This could be due to hash lock
 4972          * collisions, but more likely due to the majority of
 4973          * arc buffers being unevictable. Therefore, even if
 4974          * arc_size is above arc_c, another pass is unlikely to
 4975          * be helpful and could potentially cause us to enter an
 4976          * infinite loop.  Additionally, zthr_iscancelled() is
 4977          * checked here so that if the arc is shutting down, the
 4978          * broadcast will wake any remaining arc evict waiters.
 4979          */
 4980         mutex_enter(&arc_evict_lock);
 4981         arc_evict_needed = !zthr_iscancelled(arc_evict_zthr) &&
 4982             evicted > 0 && aggsum_compare(&arc_sums.arcstat_size, arc_c) > 0;
 4983         if (!arc_evict_needed) {
 4984                 /*
 4985                  * We're either no longer overflowing, or we
 4986                  * can't evict anything more, so we should wake
 4987                  * arc_get_data_impl() sooner.
 4988                  */
 4989                 arc_evict_waiter_t *aw;
 4990                 while ((aw = list_remove_head(&arc_evict_waiters)) != NULL) {
 4991                         cv_broadcast(&aw->aew_cv);
 4992                 }
 4993                 arc_set_need_free();
 4994         }
 4995         mutex_exit(&arc_evict_lock);
 4996         spl_fstrans_unmark(cookie);
 4997 }
 4998 
 4999 static boolean_t
 5000 arc_reap_cb_check(void *arg, zthr_t *zthr)
 5001 {
 5002         (void) arg, (void) zthr;
 5003 
 5004         int64_t free_memory = arc_available_memory();
 5005         static int reap_cb_check_counter = 0;
 5006 
 5007         /*
 5008          * If a kmem reap is already active, don't schedule more.  We must
 5009          * check for this because kmem_cache_reap_soon() won't actually
 5010          * block on the cache being reaped (this is to prevent callers from
 5011          * becoming implicitly blocked by a system-wide kmem reap -- which,
 5012          * on a system with many, many full magazines, can take minutes).
 5013          */
 5014         if (!kmem_cache_reap_active() && free_memory < 0) {
 5015 
 5016                 arc_no_grow = B_TRUE;
 5017                 arc_warm = B_TRUE;
 5018                 /*
 5019                  * Wait at least zfs_grow_retry (default 5) seconds
 5020                  * before considering growing.
 5021                  */
 5022                 arc_growtime = gethrtime() + SEC2NSEC(arc_grow_retry);
 5023                 return (B_TRUE);
 5024         } else if (free_memory < arc_c >> arc_no_grow_shift) {
 5025                 arc_no_grow = B_TRUE;
 5026         } else if (gethrtime() >= arc_growtime) {
 5027                 arc_no_grow = B_FALSE;
 5028         }
 5029 
 5030         /*
 5031          * Called unconditionally every 60 seconds to reclaim unused
 5032          * zstd compression and decompression context. This is done
 5033          * here to avoid the need for an independent thread.
 5034          */
 5035         if (!((reap_cb_check_counter++) % 60))
 5036                 zfs_zstd_cache_reap_now();
 5037 
 5038         return (B_FALSE);
 5039 }
 5040 
 5041 /*
 5042  * Keep enough free memory in the system by reaping the ARC's kmem
 5043  * caches.  To cause more slabs to be reapable, we may reduce the
 5044  * target size of the cache (arc_c), causing the arc_evict_cb()
 5045  * to free more buffers.
 5046  */
 5047 static void
 5048 arc_reap_cb(void *arg, zthr_t *zthr)
 5049 {
 5050         (void) arg, (void) zthr;
 5051 
 5052         int64_t free_memory;
 5053         fstrans_cookie_t cookie = spl_fstrans_mark();
 5054 
 5055         /*
 5056          * Kick off asynchronous kmem_reap()'s of all our caches.
 5057          */
 5058         arc_kmem_reap_soon();
 5059 
 5060         /*
 5061          * Wait at least arc_kmem_cache_reap_retry_ms between
 5062          * arc_kmem_reap_soon() calls. Without this check it is possible to
 5063          * end up in a situation where we spend lots of time reaping
 5064          * caches, while we're near arc_c_min.  Waiting here also gives the
 5065          * subsequent free memory check a chance of finding that the
 5066          * asynchronous reap has already freed enough memory, and we don't
 5067          * need to call arc_reduce_target_size().
 5068          */
 5069         delay((hz * arc_kmem_cache_reap_retry_ms + 999) / 1000);
 5070 
 5071         /*
 5072          * Reduce the target size as needed to maintain the amount of free
 5073          * memory in the system at a fraction of the arc_size (1/128th by
 5074          * default).  If oversubscribed (free_memory < 0) then reduce the
 5075          * target arc_size by the deficit amount plus the fractional
 5076          * amount.  If free memory is positive but less than the fractional
 5077          * amount, reduce by what is needed to hit the fractional amount.
 5078          */
 5079         free_memory = arc_available_memory();
 5080 
 5081         int64_t can_free = arc_c - arc_c_min;
 5082         if (can_free > 0) {
 5083                 int64_t to_free = (can_free >> arc_shrink_shift) - free_memory;
 5084                 if (to_free > 0)
 5085                         arc_reduce_target_size(to_free);
 5086         }
 5087         spl_fstrans_unmark(cookie);
 5088 }
 5089 
 5090 #ifdef _KERNEL
 5091 /*
 5092  * Determine the amount of memory eligible for eviction contained in the
 5093  * ARC. All clean data reported by the ghost lists can always be safely
 5094  * evicted. Due to arc_c_min, the same does not hold for all clean data
 5095  * contained by the regular mru and mfu lists.
 5096  *
 5097  * In the case of the regular mru and mfu lists, we need to report as
 5098  * much clean data as possible, such that evicting that same reported
 5099  * data will not bring arc_size below arc_c_min. Thus, in certain
 5100  * circumstances, the total amount of clean data in the mru and mfu
 5101  * lists might not actually be evictable.
 5102  *
 5103  * The following two distinct cases are accounted for:
 5104  *
 5105  * 1. The sum of the amount of dirty data contained by both the mru and
 5106  *    mfu lists, plus the ARC's other accounting (e.g. the anon list),
 5107  *    is greater than or equal to arc_c_min.
 5108  *    (i.e. amount of dirty data >= arc_c_min)
 5109  *
 5110  *    This is the easy case; all clean data contained by the mru and mfu
 5111  *    lists is evictable. Evicting all clean data can only drop arc_size
 5112  *    to the amount of dirty data, which is greater than arc_c_min.
 5113  *
 5114  * 2. The sum of the amount of dirty data contained by both the mru and
 5115  *    mfu lists, plus the ARC's other accounting (e.g. the anon list),
 5116  *    is less than arc_c_min.
 5117  *    (i.e. arc_c_min > amount of dirty data)
 5118  *
 5119  *    2.1. arc_size is greater than or equal arc_c_min.
 5120  *         (i.e. arc_size >= arc_c_min > amount of dirty data)
 5121  *
 5122  *         In this case, not all clean data from the regular mru and mfu
 5123  *         lists is actually evictable; we must leave enough clean data
 5124  *         to keep arc_size above arc_c_min. Thus, the maximum amount of
 5125  *         evictable data from the two lists combined, is exactly the
 5126  *         difference between arc_size and arc_c_min.
 5127  *
 5128  *    2.2. arc_size is less than arc_c_min
 5129  *         (i.e. arc_c_min > arc_size > amount of dirty data)
 5130  *
 5131  *         In this case, none of the data contained in the mru and mfu
 5132  *         lists is evictable, even if it's clean. Since arc_size is
 5133  *         already below arc_c_min, evicting any more would only
 5134  *         increase this negative difference.
 5135  */
 5136 
 5137 #endif /* _KERNEL */
 5138 
 5139 /*
 5140  * Adapt arc info given the number of bytes we are trying to add and
 5141  * the state that we are coming from.  This function is only called
 5142  * when we are adding new content to the cache.
 5143  */
 5144 static void
 5145 arc_adapt(int bytes, arc_state_t *state)
 5146 {
 5147         int mult;
 5148         uint64_t arc_p_min = (arc_c >> arc_p_min_shift);
 5149         int64_t mrug_size = zfs_refcount_count(&arc_mru_ghost->arcs_size);
 5150         int64_t mfug_size = zfs_refcount_count(&arc_mfu_ghost->arcs_size);
 5151 
 5152         ASSERT(bytes > 0);
 5153         /*
 5154          * Adapt the target size of the MRU list:
 5155          *      - if we just hit in the MRU ghost list, then increase
 5156          *        the target size of the MRU list.
 5157          *      - if we just hit in the MFU ghost list, then increase
 5158          *        the target size of the MFU list by decreasing the
 5159          *        target size of the MRU list.
 5160          */
 5161         if (state == arc_mru_ghost) {
 5162                 mult = (mrug_size >= mfug_size) ? 1 : (mfug_size / mrug_size);
 5163                 if (!zfs_arc_p_dampener_disable)
 5164                         mult = MIN(mult, 10); /* avoid wild arc_p adjustment */
 5165 
 5166                 arc_p = MIN(arc_c - arc_p_min, arc_p + (uint64_t)bytes * mult);
 5167         } else if (state == arc_mfu_ghost) {
 5168                 uint64_t delta;
 5169 
 5170                 mult = (mfug_size >= mrug_size) ? 1 : (mrug_size / mfug_size);
 5171                 if (!zfs_arc_p_dampener_disable)
 5172                         mult = MIN(mult, 10);
 5173 
 5174                 delta = MIN(bytes * mult, arc_p);
 5175                 arc_p = MAX(arc_p_min, arc_p - delta);
 5176         }
 5177         ASSERT((int64_t)arc_p >= 0);
 5178 
 5179         /*
 5180          * Wake reap thread if we do not have any available memory
 5181          */
 5182         if (arc_reclaim_needed()) {
 5183                 zthr_wakeup(arc_reap_zthr);
 5184                 return;
 5185         }
 5186 
 5187         if (arc_no_grow)
 5188                 return;
 5189 
 5190         if (arc_c >= arc_c_max)
 5191                 return;
 5192 
 5193         /*
 5194          * If we're within (2 * maxblocksize) bytes of the target
 5195          * cache size, increment the target cache size
 5196          */
 5197         ASSERT3U(arc_c, >=, 2ULL << SPA_MAXBLOCKSHIFT);
 5198         if (aggsum_upper_bound(&arc_sums.arcstat_size) >=
 5199             arc_c - (2ULL << SPA_MAXBLOCKSHIFT)) {
 5200                 atomic_add_64(&arc_c, (int64_t)bytes);
 5201                 if (arc_c > arc_c_max)
 5202                         arc_c = arc_c_max;
 5203                 else if (state == arc_anon && arc_p < arc_c >> 1)
 5204                         atomic_add_64(&arc_p, (int64_t)bytes);
 5205                 if (arc_p > arc_c)
 5206                         arc_p = arc_c;
 5207         }
 5208         ASSERT((int64_t)arc_p >= 0);
 5209 }
 5210 
 5211 /*
 5212  * Check if arc_size has grown past our upper threshold, determined by
 5213  * zfs_arc_overflow_shift.
 5214  */
 5215 static arc_ovf_level_t
 5216 arc_is_overflowing(boolean_t use_reserve)
 5217 {
 5218         /* Always allow at least one block of overflow */
 5219         int64_t overflow = MAX(SPA_MAXBLOCKSIZE,
 5220             arc_c >> zfs_arc_overflow_shift);
 5221 
 5222         /*
 5223          * We just compare the lower bound here for performance reasons. Our
 5224          * primary goals are to make sure that the arc never grows without
 5225          * bound, and that it can reach its maximum size. This check
 5226          * accomplishes both goals. The maximum amount we could run over by is
 5227          * 2 * aggsum_borrow_multiplier * NUM_CPUS * the average size of a block
 5228          * in the ARC. In practice, that's in the tens of MB, which is low
 5229          * enough to be safe.
 5230          */
 5231         int64_t over = aggsum_lower_bound(&arc_sums.arcstat_size) -
 5232             arc_c - overflow / 2;
 5233         if (!use_reserve)
 5234                 overflow /= 2;
 5235         return (over < 0 ? ARC_OVF_NONE :
 5236             over < overflow ? ARC_OVF_SOME : ARC_OVF_SEVERE);
 5237 }
 5238 
 5239 static abd_t *
 5240 arc_get_data_abd(arc_buf_hdr_t *hdr, uint64_t size, const void *tag,
 5241     int alloc_flags)
 5242 {
 5243         arc_buf_contents_t type = arc_buf_type(hdr);
 5244 
 5245         arc_get_data_impl(hdr, size, tag, alloc_flags);
 5246         if (alloc_flags & ARC_HDR_ALLOC_LINEAR)
 5247                 return (abd_alloc_linear(size, type == ARC_BUFC_METADATA));
 5248         else
 5249                 return (abd_alloc(size, type == ARC_BUFC_METADATA));
 5250 }
 5251 
 5252 static void *
 5253 arc_get_data_buf(arc_buf_hdr_t *hdr, uint64_t size, const void *tag)
 5254 {
 5255         arc_buf_contents_t type = arc_buf_type(hdr);
 5256 
 5257         arc_get_data_impl(hdr, size, tag, ARC_HDR_DO_ADAPT);
 5258         if (type == ARC_BUFC_METADATA) {
 5259                 return (zio_buf_alloc(size));
 5260         } else {
 5261                 ASSERT(type == ARC_BUFC_DATA);
 5262                 return (zio_data_buf_alloc(size));
 5263         }
 5264 }
 5265 
 5266 /*
 5267  * Wait for the specified amount of data (in bytes) to be evicted from the
 5268  * ARC, and for there to be sufficient free memory in the system.  Waiting for
 5269  * eviction ensures that the memory used by the ARC decreases.  Waiting for
 5270  * free memory ensures that the system won't run out of free pages, regardless
 5271  * of ARC behavior and settings.  See arc_lowmem_init().
 5272  */
 5273 void
 5274 arc_wait_for_eviction(uint64_t amount, boolean_t use_reserve)
 5275 {
 5276         switch (arc_is_overflowing(use_reserve)) {
 5277         case ARC_OVF_NONE:
 5278                 return;
 5279         case ARC_OVF_SOME:
 5280                 /*
 5281                  * This is a bit racy without taking arc_evict_lock, but the
 5282                  * worst that can happen is we either call zthr_wakeup() extra
 5283                  * time due to race with other thread here, or the set flag
 5284                  * get cleared by arc_evict_cb(), which is unlikely due to
 5285                  * big hysteresis, but also not important since at this level
 5286                  * of overflow the eviction is purely advisory.  Same time
 5287                  * taking the global lock here every time without waiting for
 5288                  * the actual eviction creates a significant lock contention.
 5289                  */
 5290                 if (!arc_evict_needed) {
 5291                         arc_evict_needed = B_TRUE;
 5292                         zthr_wakeup(arc_evict_zthr);
 5293                 }
 5294                 return;
 5295         case ARC_OVF_SEVERE:
 5296         default:
 5297         {
 5298                 arc_evict_waiter_t aw;
 5299                 list_link_init(&aw.aew_node);
 5300                 cv_init(&aw.aew_cv, NULL, CV_DEFAULT, NULL);
 5301 
 5302                 uint64_t last_count = 0;
 5303                 mutex_enter(&arc_evict_lock);
 5304                 if (!list_is_empty(&arc_evict_waiters)) {
 5305                         arc_evict_waiter_t *last =
 5306                             list_tail(&arc_evict_waiters);
 5307                         last_count = last->aew_count;
 5308                 } else if (!arc_evict_needed) {
 5309                         arc_evict_needed = B_TRUE;
 5310                         zthr_wakeup(arc_evict_zthr);
 5311                 }
 5312                 /*
 5313                  * Note, the last waiter's count may be less than
 5314                  * arc_evict_count if we are low on memory in which
 5315                  * case arc_evict_state_impl() may have deferred
 5316                  * wakeups (but still incremented arc_evict_count).
 5317                  */
 5318                 aw.aew_count = MAX(last_count, arc_evict_count) + amount;
 5319 
 5320                 list_insert_tail(&arc_evict_waiters, &aw);
 5321 
 5322                 arc_set_need_free();
 5323 
 5324                 DTRACE_PROBE3(arc__wait__for__eviction,
 5325                     uint64_t, amount,
 5326                     uint64_t, arc_evict_count,
 5327                     uint64_t, aw.aew_count);
 5328 
 5329                 /*
 5330                  * We will be woken up either when arc_evict_count reaches
 5331                  * aew_count, or when the ARC is no longer overflowing and
 5332                  * eviction completes.
 5333                  * In case of "false" wakeup, we will still be on the list.
 5334                  */
 5335                 do {
 5336                         cv_wait(&aw.aew_cv, &arc_evict_lock);
 5337                 } while (list_link_active(&aw.aew_node));
 5338                 mutex_exit(&arc_evict_lock);
 5339 
 5340                 cv_destroy(&aw.aew_cv);
 5341         }
 5342         }
 5343 }
 5344 
 5345 /*
 5346  * Allocate a block and return it to the caller. If we are hitting the
 5347  * hard limit for the cache size, we must sleep, waiting for the eviction
 5348  * thread to catch up. If we're past the target size but below the hard
 5349  * limit, we'll only signal the reclaim thread and continue on.
 5350  */
 5351 static void
 5352 arc_get_data_impl(arc_buf_hdr_t *hdr, uint64_t size, const void *tag,
 5353     int alloc_flags)
 5354 {
 5355         arc_state_t *state = hdr->b_l1hdr.b_state;
 5356         arc_buf_contents_t type = arc_buf_type(hdr);
 5357 
 5358         if (alloc_flags & ARC_HDR_DO_ADAPT)
 5359                 arc_adapt(size, state);
 5360 
 5361         /*
 5362          * If arc_size is currently overflowing, we must be adding data
 5363          * faster than we are evicting.  To ensure we don't compound the
 5364          * problem by adding more data and forcing arc_size to grow even
 5365          * further past it's target size, we wait for the eviction thread to
 5366          * make some progress.  We also wait for there to be sufficient free
 5367          * memory in the system, as measured by arc_free_memory().
 5368          *
 5369          * Specifically, we wait for zfs_arc_eviction_pct percent of the
 5370          * requested size to be evicted.  This should be more than 100%, to
 5371          * ensure that that progress is also made towards getting arc_size
 5372          * under arc_c.  See the comment above zfs_arc_eviction_pct.
 5373          */
 5374         arc_wait_for_eviction(size * zfs_arc_eviction_pct / 100,
 5375             alloc_flags & ARC_HDR_USE_RESERVE);
 5376 
 5377         VERIFY3U(hdr->b_type, ==, type);
 5378         if (type == ARC_BUFC_METADATA) {
 5379                 arc_space_consume(size, ARC_SPACE_META);
 5380         } else {
 5381                 arc_space_consume(size, ARC_SPACE_DATA);
 5382         }
 5383 
 5384         /*
 5385          * Update the state size.  Note that ghost states have a
 5386          * "ghost size" and so don't need to be updated.
 5387          */
 5388         if (!GHOST_STATE(state)) {
 5389 
 5390                 (void) zfs_refcount_add_many(&state->arcs_size, size, tag);
 5391 
 5392                 /*
 5393                  * If this is reached via arc_read, the link is
 5394                  * protected by the hash lock. If reached via
 5395                  * arc_buf_alloc, the header should not be accessed by
 5396                  * any other thread. And, if reached via arc_read_done,
 5397                  * the hash lock will protect it if it's found in the
 5398                  * hash table; otherwise no other thread should be
 5399                  * trying to [add|remove]_reference it.
 5400                  */
 5401                 if (multilist_link_active(&hdr->b_l1hdr.b_arc_node)) {
 5402                         ASSERT(zfs_refcount_is_zero(&hdr->b_l1hdr.b_refcnt));
 5403                         (void) zfs_refcount_add_many(&state->arcs_esize[type],
 5404                             size, tag);
 5405                 }
 5406 
 5407                 /*
 5408                  * If we are growing the cache, and we are adding anonymous
 5409                  * data, and we have outgrown arc_p, update arc_p
 5410                  */
 5411                 if (aggsum_upper_bound(&arc_sums.arcstat_size) < arc_c &&
 5412                     hdr->b_l1hdr.b_state == arc_anon &&
 5413                     (zfs_refcount_count(&arc_anon->arcs_size) +
 5414                     zfs_refcount_count(&arc_mru->arcs_size) > arc_p &&
 5415                     arc_p < arc_c >> 1))
 5416                         arc_p = MIN(arc_c, arc_p + size);
 5417         }
 5418 }
 5419 
 5420 static void
 5421 arc_free_data_abd(arc_buf_hdr_t *hdr, abd_t *abd, uint64_t size,
 5422     const void *tag)
 5423 {
 5424         arc_free_data_impl(hdr, size, tag);
 5425         abd_free(abd);
 5426 }
 5427 
 5428 static void
 5429 arc_free_data_buf(arc_buf_hdr_t *hdr, void *buf, uint64_t size, const void *tag)
 5430 {
 5431         arc_buf_contents_t type = arc_buf_type(hdr);
 5432 
 5433         arc_free_data_impl(hdr, size, tag);
 5434         if (type == ARC_BUFC_METADATA) {
 5435                 zio_buf_free(buf, size);
 5436         } else {
 5437                 ASSERT(type == ARC_BUFC_DATA);
 5438                 zio_data_buf_free(buf, size);
 5439         }
 5440 }
 5441 
 5442 /*
 5443  * Free the arc data buffer.
 5444  */
 5445 static void
 5446 arc_free_data_impl(arc_buf_hdr_t *hdr, uint64_t size, const void *tag)
 5447 {
 5448         arc_state_t *state = hdr->b_l1hdr.b_state;
 5449         arc_buf_contents_t type = arc_buf_type(hdr);
 5450 
 5451         /* protected by hash lock, if in the hash table */
 5452         if (multilist_link_active(&hdr->b_l1hdr.b_arc_node)) {
 5453                 ASSERT(zfs_refcount_is_zero(&hdr->b_l1hdr.b_refcnt));
 5454                 ASSERT(state != arc_anon && state != arc_l2c_only);
 5455 
 5456                 (void) zfs_refcount_remove_many(&state->arcs_esize[type],
 5457                     size, tag);
 5458         }
 5459         (void) zfs_refcount_remove_many(&state->arcs_size, size, tag);
 5460 
 5461         VERIFY3U(hdr->b_type, ==, type);
 5462         if (type == ARC_BUFC_METADATA) {
 5463                 arc_space_return(size, ARC_SPACE_META);
 5464         } else {
 5465                 ASSERT(type == ARC_BUFC_DATA);
 5466                 arc_space_return(size, ARC_SPACE_DATA);
 5467         }
 5468 }
 5469 
 5470 /*
 5471  * This routine is called whenever a buffer is accessed.
 5472  */
 5473 static void
 5474 arc_access(arc_buf_hdr_t *hdr, arc_flags_t arc_flags, boolean_t hit)
 5475 {
 5476         ASSERT(MUTEX_HELD(HDR_LOCK(hdr)));
 5477         ASSERT(HDR_HAS_L1HDR(hdr));
 5478 
 5479         /*
 5480          * Update buffer prefetch status.
 5481          */
 5482         boolean_t was_prefetch = HDR_PREFETCH(hdr);
 5483         boolean_t now_prefetch = arc_flags & ARC_FLAG_PREFETCH;
 5484         if (was_prefetch != now_prefetch) {
 5485                 if (was_prefetch) {
 5486                         ARCSTAT_CONDSTAT(hit, demand_hit, demand_iohit,
 5487                             HDR_PRESCIENT_PREFETCH(hdr), prescient, predictive,
 5488                             prefetch);
 5489                 }
 5490                 if (HDR_HAS_L2HDR(hdr))
 5491                         l2arc_hdr_arcstats_decrement_state(hdr);
 5492                 if (was_prefetch) {
 5493                         arc_hdr_clear_flags(hdr,
 5494                             ARC_FLAG_PREFETCH | ARC_FLAG_PRESCIENT_PREFETCH);
 5495                 } else {
 5496                         arc_hdr_set_flags(hdr, ARC_FLAG_PREFETCH);
 5497                 }
 5498                 if (HDR_HAS_L2HDR(hdr))
 5499                         l2arc_hdr_arcstats_increment_state(hdr);
 5500         }
 5501         if (now_prefetch) {
 5502                 if (arc_flags & ARC_FLAG_PRESCIENT_PREFETCH) {
 5503                         arc_hdr_set_flags(hdr, ARC_FLAG_PRESCIENT_PREFETCH);
 5504                         ARCSTAT_BUMP(arcstat_prescient_prefetch);
 5505                 } else {
 5506                         ARCSTAT_BUMP(arcstat_predictive_prefetch);
 5507                 }
 5508         }
 5509         if (arc_flags & ARC_FLAG_L2CACHE)
 5510                 arc_hdr_set_flags(hdr, ARC_FLAG_L2CACHE);
 5511 
 5512         clock_t now = ddi_get_lbolt();
 5513         if (hdr->b_l1hdr.b_state == arc_anon) {
 5514                 arc_state_t     *new_state;
 5515                 /*
 5516                  * This buffer is not in the cache, and does not appear in
 5517                  * our "ghost" lists.  Add it to the MRU or uncached state.
 5518                  */
 5519                 ASSERT0(hdr->b_l1hdr.b_arc_access);
 5520                 hdr->b_l1hdr.b_arc_access = now;
 5521                 if (HDR_UNCACHED(hdr)) {
 5522                         new_state = arc_uncached;
 5523                         DTRACE_PROBE1(new_state__uncached, arc_buf_hdr_t *,
 5524                             hdr);
 5525                 } else {
 5526                         new_state = arc_mru;
 5527                         DTRACE_PROBE1(new_state__mru, arc_buf_hdr_t *, hdr);
 5528                 }
 5529                 arc_change_state(new_state, hdr);
 5530         } else if (hdr->b_l1hdr.b_state == arc_mru) {
 5531                 /*
 5532                  * This buffer has been accessed once recently and either
 5533                  * its read is still in progress or it is in the cache.
 5534                  */
 5535                 if (HDR_IO_IN_PROGRESS(hdr)) {
 5536                         hdr->b_l1hdr.b_arc_access = now;
 5537                         return;
 5538                 }
 5539                 hdr->b_l1hdr.b_mru_hits++;
 5540                 ARCSTAT_BUMP(arcstat_mru_hits);
 5541 
 5542                 /*
 5543                  * If the previous access was a prefetch, then it already
 5544                  * handled possible promotion, so nothing more to do for now.
 5545                  */
 5546                 if (was_prefetch) {
 5547                         hdr->b_l1hdr.b_arc_access = now;
 5548                         return;
 5549                 }
 5550 
 5551                 /*
 5552                  * If more than ARC_MINTIME have passed from the previous
 5553                  * hit, promote the buffer to the MFU state.
 5554                  */
 5555                 if (ddi_time_after(now, hdr->b_l1hdr.b_arc_access +
 5556                     ARC_MINTIME)) {
 5557                         hdr->b_l1hdr.b_arc_access = now;
 5558                         DTRACE_PROBE1(new_state__mfu, arc_buf_hdr_t *, hdr);
 5559                         arc_change_state(arc_mfu, hdr);
 5560                 }
 5561         } else if (hdr->b_l1hdr.b_state == arc_mru_ghost) {
 5562                 arc_state_t     *new_state;
 5563                 /*
 5564                  * This buffer has been accessed once recently, but was
 5565                  * evicted from the cache.  Would we have bigger MRU, it
 5566                  * would be an MRU hit, so handle it the same way, except
 5567                  * we don't need to check the previous access time.
 5568                  */
 5569                 hdr->b_l1hdr.b_mru_ghost_hits++;
 5570                 ARCSTAT_BUMP(arcstat_mru_ghost_hits);
 5571                 hdr->b_l1hdr.b_arc_access = now;
 5572                 if (was_prefetch) {
 5573                         new_state = arc_mru;
 5574                         DTRACE_PROBE1(new_state__mru, arc_buf_hdr_t *, hdr);
 5575                 } else {
 5576                         new_state = arc_mfu;
 5577                         DTRACE_PROBE1(new_state__mfu, arc_buf_hdr_t *, hdr);
 5578                 }
 5579                 arc_change_state(new_state, hdr);
 5580         } else if (hdr->b_l1hdr.b_state == arc_mfu) {
 5581                 /*
 5582                  * This buffer has been accessed more than once and either
 5583                  * still in the cache or being restored from one of ghosts.
 5584                  */
 5585                 if (!HDR_IO_IN_PROGRESS(hdr)) {
 5586                         hdr->b_l1hdr.b_mfu_hits++;
 5587                         ARCSTAT_BUMP(arcstat_mfu_hits);
 5588                 }
 5589                 hdr->b_l1hdr.b_arc_access = now;
 5590         } else if (hdr->b_l1hdr.b_state == arc_mfu_ghost) {
 5591                 /*
 5592                  * This buffer has been accessed more than once recently, but
 5593                  * has been evicted from the cache.  Would we have bigger MFU
 5594                  * it would stay in cache, so move it back to MFU state.
 5595                  */
 5596                 hdr->b_l1hdr.b_mfu_ghost_hits++;
 5597                 ARCSTAT_BUMP(arcstat_mfu_ghost_hits);
 5598                 hdr->b_l1hdr.b_arc_access = now;
 5599                 DTRACE_PROBE1(new_state__mfu, arc_buf_hdr_t *, hdr);
 5600                 arc_change_state(arc_mfu, hdr);
 5601         } else if (hdr->b_l1hdr.b_state == arc_uncached) {
 5602                 /*
 5603                  * This buffer is uncacheable, but we got a hit.  Probably
 5604                  * a demand read after prefetch.  Nothing more to do here.
 5605                  */
 5606                 if (!HDR_IO_IN_PROGRESS(hdr))
 5607                         ARCSTAT_BUMP(arcstat_uncached_hits);
 5608                 hdr->b_l1hdr.b_arc_access = now;
 5609         } else if (hdr->b_l1hdr.b_state == arc_l2c_only) {
 5610                 /*
 5611                  * This buffer is on the 2nd Level ARC and was not accessed
 5612                  * for a long time, so treat it as new and put into MRU.
 5613                  */
 5614                 hdr->b_l1hdr.b_arc_access = now;
 5615                 DTRACE_PROBE1(new_state__mru, arc_buf_hdr_t *, hdr);
 5616                 arc_change_state(arc_mru, hdr);
 5617         } else {
 5618                 cmn_err(CE_PANIC, "invalid arc state 0x%p",
 5619                     hdr->b_l1hdr.b_state);
 5620         }
 5621 }
 5622 
 5623 /*
 5624  * This routine is called by dbuf_hold() to update the arc_access() state
 5625  * which otherwise would be skipped for entries in the dbuf cache.
 5626  */
 5627 void
 5628 arc_buf_access(arc_buf_t *buf)
 5629 {
 5630         arc_buf_hdr_t *hdr = buf->b_hdr;
 5631 
 5632         /*
 5633          * Avoid taking the hash_lock when possible as an optimization.
 5634          * The header must be checked again under the hash_lock in order
 5635          * to handle the case where it is concurrently being released.
 5636          */
 5637         if (hdr->b_l1hdr.b_state == arc_anon || HDR_EMPTY(hdr))
 5638                 return;
 5639 
 5640         kmutex_t *hash_lock = HDR_LOCK(hdr);
 5641         mutex_enter(hash_lock);
 5642 
 5643         if (hdr->b_l1hdr.b_state == arc_anon || HDR_EMPTY(hdr)) {
 5644                 mutex_exit(hash_lock);
 5645                 ARCSTAT_BUMP(arcstat_access_skip);
 5646                 return;
 5647         }
 5648 
 5649         ASSERT(hdr->b_l1hdr.b_state == arc_mru ||
 5650             hdr->b_l1hdr.b_state == arc_mfu ||
 5651             hdr->b_l1hdr.b_state == arc_uncached);
 5652 
 5653         DTRACE_PROBE1(arc__hit, arc_buf_hdr_t *, hdr);
 5654         arc_access(hdr, 0, B_TRUE);
 5655         mutex_exit(hash_lock);
 5656 
 5657         ARCSTAT_BUMP(arcstat_hits);
 5658         ARCSTAT_CONDSTAT(B_TRUE /* demand */, demand, prefetch,
 5659             !HDR_ISTYPE_METADATA(hdr), data, metadata, hits);
 5660 }
 5661 
 5662 /* a generic arc_read_done_func_t which you can use */
 5663 void
 5664 arc_bcopy_func(zio_t *zio, const zbookmark_phys_t *zb, const blkptr_t *bp,
 5665     arc_buf_t *buf, void *arg)
 5666 {
 5667         (void) zio, (void) zb, (void) bp;
 5668 
 5669         if (buf == NULL)
 5670                 return;
 5671 
 5672         memcpy(arg, buf->b_data, arc_buf_size(buf));
 5673         arc_buf_destroy(buf, arg);
 5674 }
 5675 
 5676 /* a generic arc_read_done_func_t */
 5677 void
 5678 arc_getbuf_func(zio_t *zio, const zbookmark_phys_t *zb, const blkptr_t *bp,
 5679     arc_buf_t *buf, void *arg)
 5680 {
 5681         (void) zb, (void) bp;
 5682         arc_buf_t **bufp = arg;
 5683 
 5684         if (buf == NULL) {
 5685                 ASSERT(zio == NULL || zio->io_error != 0);
 5686                 *bufp = NULL;
 5687         } else {
 5688                 ASSERT(zio == NULL || zio->io_error == 0);
 5689                 *bufp = buf;
 5690                 ASSERT(buf->b_data != NULL);
 5691         }
 5692 }
 5693 
 5694 static void
 5695 arc_hdr_verify(arc_buf_hdr_t *hdr, blkptr_t *bp)
 5696 {
 5697         if (BP_IS_HOLE(bp) || BP_IS_EMBEDDED(bp)) {
 5698                 ASSERT3U(HDR_GET_PSIZE(hdr), ==, 0);
 5699                 ASSERT3U(arc_hdr_get_compress(hdr), ==, ZIO_COMPRESS_OFF);
 5700         } else {
 5701                 if (HDR_COMPRESSION_ENABLED(hdr)) {
 5702                         ASSERT3U(arc_hdr_get_compress(hdr), ==,
 5703                             BP_GET_COMPRESS(bp));
 5704                 }
 5705                 ASSERT3U(HDR_GET_LSIZE(hdr), ==, BP_GET_LSIZE(bp));
 5706                 ASSERT3U(HDR_GET_PSIZE(hdr), ==, BP_GET_PSIZE(bp));
 5707                 ASSERT3U(!!HDR_PROTECTED(hdr), ==, BP_IS_PROTECTED(bp));
 5708         }
 5709 }
 5710 
 5711 static void
 5712 arc_read_done(zio_t *zio)
 5713 {
 5714         blkptr_t        *bp = zio->io_bp;
 5715         arc_buf_hdr_t   *hdr = zio->io_private;
 5716         kmutex_t        *hash_lock = NULL;
 5717         arc_callback_t  *callback_list;
 5718         arc_callback_t  *acb;
 5719 
 5720         /*
 5721          * The hdr was inserted into hash-table and removed from lists
 5722          * prior to starting I/O.  We should find this header, since
 5723          * it's in the hash table, and it should be legit since it's
 5724          * not possible to evict it during the I/O.  The only possible
 5725          * reason for it not to be found is if we were freed during the
 5726          * read.
 5727          */
 5728         if (HDR_IN_HASH_TABLE(hdr)) {
 5729                 arc_buf_hdr_t *found;
 5730 
 5731                 ASSERT3U(hdr->b_birth, ==, BP_PHYSICAL_BIRTH(zio->io_bp));
 5732                 ASSERT3U(hdr->b_dva.dva_word[0], ==,
 5733                     BP_IDENTITY(zio->io_bp)->dva_word[0]);
 5734                 ASSERT3U(hdr->b_dva.dva_word[1], ==,
 5735                     BP_IDENTITY(zio->io_bp)->dva_word[1]);
 5736 
 5737                 found = buf_hash_find(hdr->b_spa, zio->io_bp, &hash_lock);
 5738 
 5739                 ASSERT((found == hdr &&
 5740                     DVA_EQUAL(&hdr->b_dva, BP_IDENTITY(zio->io_bp))) ||
 5741                     (found == hdr && HDR_L2_READING(hdr)));
 5742                 ASSERT3P(hash_lock, !=, NULL);
 5743         }
 5744 
 5745         if (BP_IS_PROTECTED(bp)) {
 5746                 hdr->b_crypt_hdr.b_ot = BP_GET_TYPE(bp);
 5747                 hdr->b_crypt_hdr.b_dsobj = zio->io_bookmark.zb_objset;
 5748                 zio_crypt_decode_params_bp(bp, hdr->b_crypt_hdr.b_salt,
 5749                     hdr->b_crypt_hdr.b_iv);
 5750 
 5751                 if (zio->io_error == 0) {
 5752                         if (BP_GET_TYPE(bp) == DMU_OT_INTENT_LOG) {
 5753                                 void *tmpbuf;
 5754 
 5755                                 tmpbuf = abd_borrow_buf_copy(zio->io_abd,
 5756                                     sizeof (zil_chain_t));
 5757                                 zio_crypt_decode_mac_zil(tmpbuf,
 5758                                     hdr->b_crypt_hdr.b_mac);
 5759                                 abd_return_buf(zio->io_abd, tmpbuf,
 5760                                     sizeof (zil_chain_t));
 5761                         } else {
 5762                                 zio_crypt_decode_mac_bp(bp,
 5763                                     hdr->b_crypt_hdr.b_mac);
 5764                         }
 5765                 }
 5766         }
 5767 
 5768         if (zio->io_error == 0) {
 5769                 /* byteswap if necessary */
 5770                 if (BP_SHOULD_BYTESWAP(zio->io_bp)) {
 5771                         if (BP_GET_LEVEL(zio->io_bp) > 0) {
 5772                                 hdr->b_l1hdr.b_byteswap = DMU_BSWAP_UINT64;
 5773                         } else {
 5774                                 hdr->b_l1hdr.b_byteswap =
 5775                                     DMU_OT_BYTESWAP(BP_GET_TYPE(zio->io_bp));
 5776                         }
 5777                 } else {
 5778                         hdr->b_l1hdr.b_byteswap = DMU_BSWAP_NUMFUNCS;
 5779                 }
 5780                 if (!HDR_L2_READING(hdr)) {
 5781                         hdr->b_complevel = zio->io_prop.zp_complevel;
 5782                 }
 5783         }
 5784 
 5785         arc_hdr_clear_flags(hdr, ARC_FLAG_L2_EVICTED);
 5786         if (l2arc_noprefetch && HDR_PREFETCH(hdr))
 5787                 arc_hdr_clear_flags(hdr, ARC_FLAG_L2CACHE);
 5788 
 5789         callback_list = hdr->b_l1hdr.b_acb;
 5790         ASSERT3P(callback_list, !=, NULL);
 5791         hdr->b_l1hdr.b_acb = NULL;
 5792 
 5793         /*
 5794          * If a read request has a callback (i.e. acb_done is not NULL), then we
 5795          * make a buf containing the data according to the parameters which were
 5796          * passed in. The implementation of arc_buf_alloc_impl() ensures that we
 5797          * aren't needlessly decompressing the data multiple times.
 5798          */
 5799         int callback_cnt = 0;
 5800         for (acb = callback_list; acb != NULL; acb = acb->acb_next) {
 5801 
 5802                 /* We need the last one to call below in original order. */
 5803                 callback_list = acb;
 5804 
 5805                 if (!acb->acb_done || acb->acb_nobuf)
 5806                         continue;
 5807 
 5808                 callback_cnt++;
 5809 
 5810                 if (zio->io_error != 0)
 5811                         continue;
 5812 
 5813                 int error = arc_buf_alloc_impl(hdr, zio->io_spa,
 5814                     &acb->acb_zb, acb->acb_private, acb->acb_encrypted,
 5815                     acb->acb_compressed, acb->acb_noauth, B_TRUE,
 5816                     &acb->acb_buf);
 5817 
 5818                 /*
 5819                  * Assert non-speculative zios didn't fail because an
 5820                  * encryption key wasn't loaded
 5821                  */
 5822                 ASSERT((zio->io_flags & ZIO_FLAG_SPECULATIVE) ||
 5823                     error != EACCES);
 5824 
 5825                 /*
 5826                  * If we failed to decrypt, report an error now (as the zio
 5827                  * layer would have done if it had done the transforms).
 5828                  */
 5829                 if (error == ECKSUM) {
 5830                         ASSERT(BP_IS_PROTECTED(bp));
 5831                         error = SET_ERROR(EIO);
 5832                         if ((zio->io_flags & ZIO_FLAG_SPECULATIVE) == 0) {
 5833                                 spa_log_error(zio->io_spa, &acb->acb_zb);
 5834                                 (void) zfs_ereport_post(
 5835                                     FM_EREPORT_ZFS_AUTHENTICATION,
 5836                                     zio->io_spa, NULL, &acb->acb_zb, zio, 0);
 5837                         }
 5838                 }
 5839 
 5840                 if (error != 0) {
 5841                         /*
 5842                          * Decompression or decryption failed.  Set
 5843                          * io_error so that when we call acb_done
 5844                          * (below), we will indicate that the read
 5845                          * failed. Note that in the unusual case
 5846                          * where one callback is compressed and another
 5847                          * uncompressed, we will mark all of them
 5848                          * as failed, even though the uncompressed
 5849                          * one can't actually fail.  In this case,
 5850                          * the hdr will not be anonymous, because
 5851                          * if there are multiple callbacks, it's
 5852                          * because multiple threads found the same
 5853                          * arc buf in the hash table.
 5854                          */
 5855                         zio->io_error = error;
 5856                 }
 5857         }
 5858 
 5859         /*
 5860          * If there are multiple callbacks, we must have the hash lock,
 5861          * because the only way for multiple threads to find this hdr is
 5862          * in the hash table.  This ensures that if there are multiple
 5863          * callbacks, the hdr is not anonymous.  If it were anonymous,
 5864          * we couldn't use arc_buf_destroy() in the error case below.
 5865          */
 5866         ASSERT(callback_cnt < 2 || hash_lock != NULL);
 5867 
 5868         if (zio->io_error == 0) {
 5869                 arc_hdr_verify(hdr, zio->io_bp);
 5870         } else {
 5871                 arc_hdr_set_flags(hdr, ARC_FLAG_IO_ERROR);
 5872                 if (hdr->b_l1hdr.b_state != arc_anon)
 5873                         arc_change_state(arc_anon, hdr);
 5874                 if (HDR_IN_HASH_TABLE(hdr))
 5875                         buf_hash_remove(hdr);
 5876         }
 5877 
 5878         /*
 5879          * Broadcast before we drop the hash_lock to avoid the possibility
 5880          * that the hdr (and hence the cv) might be freed before we get to
 5881          * the cv_broadcast().
 5882          */
 5883         cv_broadcast(&hdr->b_l1hdr.b_cv);
 5884 
 5885         arc_hdr_clear_flags(hdr, ARC_FLAG_IO_IN_PROGRESS);
 5886         (void) remove_reference(hdr, hdr);
 5887 
 5888         if (hash_lock != NULL)
 5889                 mutex_exit(hash_lock);
 5890 
 5891         /* execute each callback and free its structure */
 5892         while ((acb = callback_list) != NULL) {
 5893                 if (acb->acb_done != NULL) {
 5894                         if (zio->io_error != 0 && acb->acb_buf != NULL) {
 5895                                 /*
 5896                                  * If arc_buf_alloc_impl() fails during
 5897                                  * decompression, the buf will still be
 5898                                  * allocated, and needs to be freed here.
 5899                                  */
 5900                                 arc_buf_destroy(acb->acb_buf,
 5901                                     acb->acb_private);
 5902                                 acb->acb_buf = NULL;
 5903                         }
 5904                         acb->acb_done(zio, &zio->io_bookmark, zio->io_bp,
 5905                             acb->acb_buf, acb->acb_private);
 5906                 }
 5907 
 5908                 if (acb->acb_zio_dummy != NULL) {
 5909                         acb->acb_zio_dummy->io_error = zio->io_error;
 5910                         zio_nowait(acb->acb_zio_dummy);
 5911                 }
 5912 
 5913                 callback_list = acb->acb_prev;
 5914                 if (acb->acb_wait) {
 5915                         mutex_enter(&acb->acb_wait_lock);
 5916                         acb->acb_wait_error = zio->io_error;
 5917                         acb->acb_wait = B_FALSE;
 5918                         cv_signal(&acb->acb_wait_cv);
 5919                         mutex_exit(&acb->acb_wait_lock);
 5920                         /* acb will be freed by the waiting thread. */
 5921                 } else {
 5922                         kmem_free(acb, sizeof (arc_callback_t));
 5923                 }
 5924         }
 5925 }
 5926 
 5927 /*
 5928  * "Read" the block at the specified DVA (in bp) via the
 5929  * cache.  If the block is found in the cache, invoke the provided
 5930  * callback immediately and return.  Note that the `zio' parameter
 5931  * in the callback will be NULL in this case, since no IO was
 5932  * required.  If the block is not in the cache pass the read request
 5933  * on to the spa with a substitute callback function, so that the
 5934  * requested block will be added to the cache.
 5935  *
 5936  * If a read request arrives for a block that has a read in-progress,
 5937  * either wait for the in-progress read to complete (and return the
 5938  * results); or, if this is a read with a "done" func, add a record
 5939  * to the read to invoke the "done" func when the read completes,
 5940  * and return; or just return.
 5941  *
 5942  * arc_read_done() will invoke all the requested "done" functions
 5943  * for readers of this block.
 5944  */
 5945 int
 5946 arc_read(zio_t *pio, spa_t *spa, const blkptr_t *bp,
 5947     arc_read_done_func_t *done, void *private, zio_priority_t priority,
 5948     int zio_flags, arc_flags_t *arc_flags, const zbookmark_phys_t *zb)
 5949 {
 5950         arc_buf_hdr_t *hdr = NULL;
 5951         kmutex_t *hash_lock = NULL;
 5952         zio_t *rzio;
 5953         uint64_t guid = spa_load_guid(spa);
 5954         boolean_t compressed_read = (zio_flags & ZIO_FLAG_RAW_COMPRESS) != 0;
 5955         boolean_t encrypted_read = BP_IS_ENCRYPTED(bp) &&
 5956             (zio_flags & ZIO_FLAG_RAW_ENCRYPT) != 0;
 5957         boolean_t noauth_read = BP_IS_AUTHENTICATED(bp) &&
 5958             (zio_flags & ZIO_FLAG_RAW_ENCRYPT) != 0;
 5959         boolean_t embedded_bp = !!BP_IS_EMBEDDED(bp);
 5960         boolean_t no_buf = *arc_flags & ARC_FLAG_NO_BUF;
 5961         int rc = 0;
 5962 
 5963         ASSERT(!embedded_bp ||
 5964             BPE_GET_ETYPE(bp) == BP_EMBEDDED_TYPE_DATA);
 5965         ASSERT(!BP_IS_HOLE(bp));
 5966         ASSERT(!BP_IS_REDACTED(bp));
 5967 
 5968         /*
 5969          * Normally SPL_FSTRANS will already be set since kernel threads which
 5970          * expect to call the DMU interfaces will set it when created.  System
 5971          * calls are similarly handled by setting/cleaning the bit in the
 5972          * registered callback (module/os/.../zfs/zpl_*).
 5973          *
 5974          * External consumers such as Lustre which call the exported DMU
 5975          * interfaces may not have set SPL_FSTRANS.  To avoid a deadlock
 5976          * on the hash_lock always set and clear the bit.
 5977          */
 5978         fstrans_cookie_t cookie = spl_fstrans_mark();
 5979 top:
 5980         /*
 5981          * Verify the block pointer contents are reasonable.  This should
 5982          * always be the case since the blkptr is protected by a checksum.
 5983          * However, if there is damage it's desirable to detect this early
 5984          * and treat it as a checksum error.  This allows an alternate blkptr
 5985          * to be tried when one is available (e.g. ditto blocks).
 5986          */
 5987         if (!zfs_blkptr_verify(spa, bp, zio_flags & ZIO_FLAG_CONFIG_WRITER,
 5988             BLK_VERIFY_LOG)) {
 5989                 rc = SET_ERROR(ECKSUM);
 5990                 goto out;
 5991         }
 5992 
 5993         if (!embedded_bp) {
 5994                 /*
 5995                  * Embedded BP's have no DVA and require no I/O to "read".
 5996                  * Create an anonymous arc buf to back it.
 5997                  */
 5998                 hdr = buf_hash_find(guid, bp, &hash_lock);
 5999         }
 6000 
 6001         /*
 6002          * Determine if we have an L1 cache hit or a cache miss. For simplicity
 6003          * we maintain encrypted data separately from compressed / uncompressed
 6004          * data. If the user is requesting raw encrypted data and we don't have
 6005          * that in the header we will read from disk to guarantee that we can
 6006          * get it even if the encryption keys aren't loaded.
 6007          */
 6008         if (hdr != NULL && HDR_HAS_L1HDR(hdr) && (HDR_HAS_RABD(hdr) ||
 6009             (hdr->b_l1hdr.b_pabd != NULL && !encrypted_read))) {
 6010                 boolean_t is_data = !HDR_ISTYPE_METADATA(hdr);
 6011                 arc_buf_t *buf = NULL;
 6012 
 6013                 if (HDR_IO_IN_PROGRESS(hdr)) {
 6014                         if (*arc_flags & ARC_FLAG_CACHED_ONLY) {
 6015                                 mutex_exit(hash_lock);
 6016                                 ARCSTAT_BUMP(arcstat_cached_only_in_progress);
 6017                                 rc = SET_ERROR(ENOENT);
 6018                                 goto out;
 6019                         }
 6020 
 6021                         zio_t *head_zio = hdr->b_l1hdr.b_acb->acb_zio_head;
 6022                         ASSERT3P(head_zio, !=, NULL);
 6023                         if ((hdr->b_flags & ARC_FLAG_PRIO_ASYNC_READ) &&
 6024                             priority == ZIO_PRIORITY_SYNC_READ) {
 6025                                 /*
 6026                                  * This is a sync read that needs to wait for
 6027                                  * an in-flight async read. Request that the
 6028                                  * zio have its priority upgraded.
 6029                                  */
 6030                                 zio_change_priority(head_zio, priority);
 6031                                 DTRACE_PROBE1(arc__async__upgrade__sync,
 6032                                     arc_buf_hdr_t *, hdr);
 6033                                 ARCSTAT_BUMP(arcstat_async_upgrade_sync);
 6034                         }
 6035 
 6036                         DTRACE_PROBE1(arc__iohit, arc_buf_hdr_t *, hdr);
 6037                         arc_access(hdr, *arc_flags, B_FALSE);
 6038 
 6039                         /*
 6040                          * If there are multiple threads reading the same block
 6041                          * and that block is not yet in the ARC, then only one
 6042                          * thread will do the physical I/O and all other
 6043                          * threads will wait until that I/O completes.
 6044                          * Synchronous reads use the acb_wait_cv whereas nowait
 6045                          * reads register a callback. Both are signalled/called
 6046                          * in arc_read_done.
 6047                          *
 6048                          * Errors of the physical I/O may need to be propagated.
 6049                          * Synchronous read errors are returned here from
 6050                          * arc_read_done via acb_wait_error.  Nowait reads
 6051                          * attach the acb_zio_dummy zio to pio and
 6052                          * arc_read_done propagates the physical I/O's io_error
 6053                          * to acb_zio_dummy, and thereby to pio.
 6054                          */
 6055                         arc_callback_t *acb = NULL;
 6056                         if (done || pio || *arc_flags & ARC_FLAG_WAIT) {
 6057                                 acb = kmem_zalloc(sizeof (arc_callback_t),
 6058                                     KM_SLEEP);
 6059                                 acb->acb_done = done;
 6060                                 acb->acb_private = private;
 6061                                 acb->acb_compressed = compressed_read;
 6062                                 acb->acb_encrypted = encrypted_read;
 6063                                 acb->acb_noauth = noauth_read;
 6064                                 acb->acb_nobuf = no_buf;
 6065                                 if (*arc_flags & ARC_FLAG_WAIT) {
 6066                                         acb->acb_wait = B_TRUE;
 6067                                         mutex_init(&acb->acb_wait_lock, NULL,
 6068                                             MUTEX_DEFAULT, NULL);
 6069                                         cv_init(&acb->acb_wait_cv, NULL,
 6070                                             CV_DEFAULT, NULL);
 6071                                 }
 6072                                 acb->acb_zb = *zb;
 6073                                 if (pio != NULL) {
 6074                                         acb->acb_zio_dummy = zio_null(pio,
 6075                                             spa, NULL, NULL, NULL, zio_flags);
 6076                                 }
 6077                                 acb->acb_zio_head = head_zio;
 6078                                 acb->acb_next = hdr->b_l1hdr.b_acb;
 6079                                 if (hdr->b_l1hdr.b_acb)
 6080                                         hdr->b_l1hdr.b_acb->acb_prev = acb;
 6081                                 hdr->b_l1hdr.b_acb = acb;
 6082                         }
 6083                         mutex_exit(hash_lock);
 6084 
 6085                         ARCSTAT_BUMP(arcstat_iohits);
 6086                         ARCSTAT_CONDSTAT(!(*arc_flags & ARC_FLAG_PREFETCH),
 6087                             demand, prefetch, is_data, data, metadata, iohits);
 6088 
 6089                         if (*arc_flags & ARC_FLAG_WAIT) {
 6090                                 mutex_enter(&acb->acb_wait_lock);
 6091                                 while (acb->acb_wait) {
 6092                                         cv_wait(&acb->acb_wait_cv,
 6093                                             &acb->acb_wait_lock);
 6094                                 }
 6095                                 rc = acb->acb_wait_error;
 6096                                 mutex_exit(&acb->acb_wait_lock);
 6097                                 mutex_destroy(&acb->acb_wait_lock);
 6098                                 cv_destroy(&acb->acb_wait_cv);
 6099                                 kmem_free(acb, sizeof (arc_callback_t));
 6100                         }
 6101                         goto out;
 6102                 }
 6103 
 6104                 ASSERT(hdr->b_l1hdr.b_state == arc_mru ||
 6105                     hdr->b_l1hdr.b_state == arc_mfu ||
 6106                     hdr->b_l1hdr.b_state == arc_uncached);
 6107 
 6108                 DTRACE_PROBE1(arc__hit, arc_buf_hdr_t *, hdr);
 6109                 arc_access(hdr, *arc_flags, B_TRUE);
 6110 
 6111                 if (done && !no_buf) {
 6112                         ASSERT(!embedded_bp || !BP_IS_HOLE(bp));
 6113 
 6114                         /* Get a buf with the desired data in it. */
 6115                         rc = arc_buf_alloc_impl(hdr, spa, zb, private,
 6116                             encrypted_read, compressed_read, noauth_read,
 6117                             B_TRUE, &buf);
 6118                         if (rc == ECKSUM) {
 6119                                 /*
 6120                                  * Convert authentication and decryption errors
 6121                                  * to EIO (and generate an ereport if needed)
 6122                                  * before leaving the ARC.
 6123                                  */
 6124                                 rc = SET_ERROR(EIO);
 6125                                 if ((zio_flags & ZIO_FLAG_SPECULATIVE) == 0) {
 6126                                         spa_log_error(spa, zb);
 6127                                         (void) zfs_ereport_post(
 6128                                             FM_EREPORT_ZFS_AUTHENTICATION,
 6129                                             spa, NULL, zb, NULL, 0);
 6130                                 }
 6131                         }
 6132                         if (rc != 0) {
 6133                                 arc_buf_destroy_impl(buf);
 6134                                 buf = NULL;
 6135                                 (void) remove_reference(hdr, private);
 6136                         }
 6137 
 6138                         /* assert any errors weren't due to unloaded keys */
 6139                         ASSERT((zio_flags & ZIO_FLAG_SPECULATIVE) ||
 6140                             rc != EACCES);
 6141                 }
 6142                 mutex_exit(hash_lock);
 6143                 ARCSTAT_BUMP(arcstat_hits);
 6144                 ARCSTAT_CONDSTAT(!(*arc_flags & ARC_FLAG_PREFETCH),
 6145                     demand, prefetch, is_data, data, metadata, hits);
 6146                 *arc_flags |= ARC_FLAG_CACHED;
 6147 
 6148                 if (done)
 6149                         done(NULL, zb, bp, buf, private);
 6150         } else {
 6151                 uint64_t lsize = BP_GET_LSIZE(bp);
 6152                 uint64_t psize = BP_GET_PSIZE(bp);
 6153                 arc_callback_t *acb;
 6154                 vdev_t *vd = NULL;
 6155                 uint64_t addr = 0;
 6156                 boolean_t devw = B_FALSE;
 6157                 uint64_t size;
 6158                 abd_t *hdr_abd;
 6159                 int alloc_flags = encrypted_read ? ARC_HDR_ALLOC_RDATA : 0;
 6160 
 6161                 if (*arc_flags & ARC_FLAG_CACHED_ONLY) {
 6162                         rc = SET_ERROR(ENOENT);
 6163                         if (hash_lock != NULL)
 6164                                 mutex_exit(hash_lock);
 6165                         goto out;
 6166                 }
 6167 
 6168                 if (hdr == NULL) {
 6169                         /*
 6170                          * This block is not in the cache or it has
 6171                          * embedded data.
 6172                          */
 6173                         arc_buf_hdr_t *exists = NULL;
 6174                         arc_buf_contents_t type = BP_GET_BUFC_TYPE(bp);
 6175                         hdr = arc_hdr_alloc(spa_load_guid(spa), psize, lsize,
 6176                             BP_IS_PROTECTED(bp), BP_GET_COMPRESS(bp), 0, type);
 6177 
 6178                         if (!embedded_bp) {
 6179                                 hdr->b_dva = *BP_IDENTITY(bp);
 6180                                 hdr->b_birth = BP_PHYSICAL_BIRTH(bp);
 6181                                 exists = buf_hash_insert(hdr, &hash_lock);
 6182                         }
 6183                         if (exists != NULL) {
 6184                                 /* somebody beat us to the hash insert */
 6185                                 mutex_exit(hash_lock);
 6186                                 buf_discard_identity(hdr);
 6187                                 arc_hdr_destroy(hdr);
 6188                                 goto top; /* restart the IO request */
 6189                         }
 6190                 } else {
 6191                         /*
 6192                          * This block is in the ghost cache or encrypted data
 6193                          * was requested and we didn't have it. If it was
 6194                          * L2-only (and thus didn't have an L1 hdr),
 6195                          * we realloc the header to add an L1 hdr.
 6196                          */
 6197                         if (!HDR_HAS_L1HDR(hdr)) {
 6198                                 hdr = arc_hdr_realloc(hdr, hdr_l2only_cache,
 6199                                     hdr_full_cache);
 6200                         }
 6201 
 6202                         if (GHOST_STATE(hdr->b_l1hdr.b_state)) {
 6203                                 ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL);
 6204                                 ASSERT(!HDR_HAS_RABD(hdr));
 6205                                 ASSERT(!HDR_IO_IN_PROGRESS(hdr));
 6206                                 ASSERT0(zfs_refcount_count(
 6207                                     &hdr->b_l1hdr.b_refcnt));
 6208                                 ASSERT3P(hdr->b_l1hdr.b_buf, ==, NULL);
 6209 #ifdef ZFS_DEBUG
 6210                                 ASSERT3P(hdr->b_l1hdr.b_freeze_cksum, ==, NULL);
 6211 #endif
 6212                         } else if (HDR_IO_IN_PROGRESS(hdr)) {
 6213                                 /*
 6214                                  * If this header already had an IO in progress
 6215                                  * and we are performing another IO to fetch
 6216                                  * encrypted data we must wait until the first
 6217                                  * IO completes so as not to confuse
 6218                                  * arc_read_done(). This should be very rare
 6219                                  * and so the performance impact shouldn't
 6220                                  * matter.
 6221                                  */
 6222                                 cv_wait(&hdr->b_l1hdr.b_cv, hash_lock);
 6223                                 mutex_exit(hash_lock);
 6224                                 goto top;
 6225                         }
 6226                 }
 6227                 if (*arc_flags & ARC_FLAG_UNCACHED) {
 6228                         arc_hdr_set_flags(hdr, ARC_FLAG_UNCACHED);
 6229                         if (!encrypted_read)
 6230                                 alloc_flags |= ARC_HDR_ALLOC_LINEAR;
 6231                 }
 6232 
 6233                 /*
 6234                  * Call arc_adapt() explicitly before arc_access() to allow
 6235                  * its logic to balance MRU/MFU based on the original state.
 6236                  */
 6237                 arc_adapt(arc_hdr_size(hdr), hdr->b_l1hdr.b_state);
 6238                 /*
 6239                  * Take additional reference for IO_IN_PROGRESS.  It stops
 6240                  * arc_access() from putting this header without any buffers
 6241                  * and so other references but obviously nonevictable onto
 6242                  * the evictable list of MRU or MFU state.
 6243                  */
 6244                 add_reference(hdr, hdr);
 6245                 if (!embedded_bp)
 6246                         arc_access(hdr, *arc_flags, B_FALSE);
 6247                 arc_hdr_set_flags(hdr, ARC_FLAG_IO_IN_PROGRESS);
 6248                 arc_hdr_alloc_abd(hdr, alloc_flags);
 6249                 if (encrypted_read) {
 6250                         ASSERT(HDR_HAS_RABD(hdr));
 6251                         size = HDR_GET_PSIZE(hdr);
 6252                         hdr_abd = hdr->b_crypt_hdr.b_rabd;
 6253                         zio_flags |= ZIO_FLAG_RAW;
 6254                 } else {
 6255                         ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL);
 6256                         size = arc_hdr_size(hdr);
 6257                         hdr_abd = hdr->b_l1hdr.b_pabd;
 6258 
 6259                         if (arc_hdr_get_compress(hdr) != ZIO_COMPRESS_OFF) {
 6260                                 zio_flags |= ZIO_FLAG_RAW_COMPRESS;
 6261                         }
 6262 
 6263                         /*
 6264                          * For authenticated bp's, we do not ask the ZIO layer
 6265                          * to authenticate them since this will cause the entire
 6266                          * IO to fail if the key isn't loaded. Instead, we
 6267                          * defer authentication until arc_buf_fill(), which will
 6268                          * verify the data when the key is available.
 6269                          */
 6270                         if (BP_IS_AUTHENTICATED(bp))
 6271                                 zio_flags |= ZIO_FLAG_RAW_ENCRYPT;
 6272                 }
 6273 
 6274                 if (BP_IS_AUTHENTICATED(bp))
 6275                         arc_hdr_set_flags(hdr, ARC_FLAG_NOAUTH);
 6276                 if (BP_GET_LEVEL(bp) > 0)
 6277                         arc_hdr_set_flags(hdr, ARC_FLAG_INDIRECT);
 6278                 ASSERT(!GHOST_STATE(hdr->b_l1hdr.b_state));
 6279 
 6280                 acb = kmem_zalloc(sizeof (arc_callback_t), KM_SLEEP);
 6281                 acb->acb_done = done;
 6282                 acb->acb_private = private;
 6283                 acb->acb_compressed = compressed_read;
 6284                 acb->acb_encrypted = encrypted_read;
 6285                 acb->acb_noauth = noauth_read;
 6286                 acb->acb_zb = *zb;
 6287 
 6288                 ASSERT3P(hdr->b_l1hdr.b_acb, ==, NULL);
 6289                 hdr->b_l1hdr.b_acb = acb;
 6290 
 6291                 if (HDR_HAS_L2HDR(hdr) &&
 6292                     (vd = hdr->b_l2hdr.b_dev->l2ad_vdev) != NULL) {
 6293                         devw = hdr->b_l2hdr.b_dev->l2ad_writing;
 6294                         addr = hdr->b_l2hdr.b_daddr;
 6295                         /*
 6296                          * Lock out L2ARC device removal.
 6297                          */
 6298                         if (vdev_is_dead(vd) ||
 6299                             !spa_config_tryenter(spa, SCL_L2ARC, vd, RW_READER))
 6300                                 vd = NULL;
 6301                 }
 6302 
 6303                 /*
 6304                  * We count both async reads and scrub IOs as asynchronous so
 6305                  * that both can be upgraded in the event of a cache hit while
 6306                  * the read IO is still in-flight.
 6307                  */
 6308                 if (priority == ZIO_PRIORITY_ASYNC_READ ||
 6309                     priority == ZIO_PRIORITY_SCRUB)
 6310                         arc_hdr_set_flags(hdr, ARC_FLAG_PRIO_ASYNC_READ);
 6311                 else
 6312                         arc_hdr_clear_flags(hdr, ARC_FLAG_PRIO_ASYNC_READ);
 6313 
 6314                 /*
 6315                  * At this point, we have a level 1 cache miss or a blkptr
 6316                  * with embedded data.  Try again in L2ARC if possible.
 6317                  */
 6318                 ASSERT3U(HDR_GET_LSIZE(hdr), ==, lsize);
 6319 
 6320                 /*
 6321                  * Skip ARC stat bump for block pointers with embedded
 6322                  * data. The data are read from the blkptr itself via
 6323                  * decode_embedded_bp_compressed().
 6324                  */
 6325                 if (!embedded_bp) {
 6326                         DTRACE_PROBE4(arc__miss, arc_buf_hdr_t *, hdr,
 6327                             blkptr_t *, bp, uint64_t, lsize,
 6328                             zbookmark_phys_t *, zb);
 6329                         ARCSTAT_BUMP(arcstat_misses);
 6330                         ARCSTAT_CONDSTAT(!(*arc_flags & ARC_FLAG_PREFETCH),
 6331                             demand, prefetch, !HDR_ISTYPE_METADATA(hdr), data,
 6332                             metadata, misses);
 6333                         zfs_racct_read(size, 1);
 6334                 }
 6335 
 6336                 /* Check if the spa even has l2 configured */
 6337                 const boolean_t spa_has_l2 = l2arc_ndev != 0 &&
 6338                     spa->spa_l2cache.sav_count > 0;
 6339 
 6340                 if (vd != NULL && spa_has_l2 && !(l2arc_norw && devw)) {
 6341                         /*
 6342                          * Read from the L2ARC if the following are true:
 6343                          * 1. The L2ARC vdev was previously cached.
 6344                          * 2. This buffer still has L2ARC metadata.
 6345                          * 3. This buffer isn't currently writing to the L2ARC.
 6346                          * 4. The L2ARC entry wasn't evicted, which may
 6347                          *    also have invalidated the vdev.
 6348                          * 5. This isn't prefetch or l2arc_noprefetch is 0.
 6349                          */
 6350                         if (HDR_HAS_L2HDR(hdr) &&
 6351                             !HDR_L2_WRITING(hdr) && !HDR_L2_EVICTED(hdr) &&
 6352                             !(l2arc_noprefetch &&
 6353                             (*arc_flags & ARC_FLAG_PREFETCH))) {
 6354                                 l2arc_read_callback_t *cb;
 6355                                 abd_t *abd;
 6356                                 uint64_t asize;
 6357 
 6358                                 DTRACE_PROBE1(l2arc__hit, arc_buf_hdr_t *, hdr);
 6359                                 ARCSTAT_BUMP(arcstat_l2_hits);
 6360                                 hdr->b_l2hdr.b_hits++;
 6361 
 6362                                 cb = kmem_zalloc(sizeof (l2arc_read_callback_t),
 6363                                     KM_SLEEP);
 6364                                 cb->l2rcb_hdr = hdr;
 6365                                 cb->l2rcb_bp = *bp;
 6366                                 cb->l2rcb_zb = *zb;
 6367                                 cb->l2rcb_flags = zio_flags;
 6368 
 6369                                 /*
 6370                                  * When Compressed ARC is disabled, but the
 6371                                  * L2ARC block is compressed, arc_hdr_size()
 6372                                  * will have returned LSIZE rather than PSIZE.
 6373                                  */
 6374                                 if (HDR_GET_COMPRESS(hdr) != ZIO_COMPRESS_OFF &&
 6375                                     !HDR_COMPRESSION_ENABLED(hdr) &&
 6376                                     HDR_GET_PSIZE(hdr) != 0) {
 6377                                         size = HDR_GET_PSIZE(hdr);
 6378                                 }
 6379 
 6380                                 asize = vdev_psize_to_asize(vd, size);
 6381                                 if (asize != size) {
 6382                                         abd = abd_alloc_for_io(asize,
 6383                                             HDR_ISTYPE_METADATA(hdr));
 6384                                         cb->l2rcb_abd = abd;
 6385                                 } else {
 6386                                         abd = hdr_abd;
 6387                                 }
 6388 
 6389                                 ASSERT(addr >= VDEV_LABEL_START_SIZE &&
 6390                                     addr + asize <= vd->vdev_psize -
 6391                                     VDEV_LABEL_END_SIZE);
 6392 
 6393                                 /*
 6394                                  * l2arc read.  The SCL_L2ARC lock will be
 6395                                  * released by l2arc_read_done().
 6396                                  * Issue a null zio if the underlying buffer
 6397                                  * was squashed to zero size by compression.
 6398                                  */
 6399                                 ASSERT3U(arc_hdr_get_compress(hdr), !=,
 6400                                     ZIO_COMPRESS_EMPTY);
 6401                                 rzio = zio_read_phys(pio, vd, addr,
 6402                                     asize, abd,
 6403                                     ZIO_CHECKSUM_OFF,
 6404                                     l2arc_read_done, cb, priority,
 6405                                     zio_flags | ZIO_FLAG_DONT_CACHE |
 6406                                     ZIO_FLAG_CANFAIL |
 6407                                     ZIO_FLAG_DONT_PROPAGATE |
 6408                                     ZIO_FLAG_DONT_RETRY, B_FALSE);
 6409                                 acb->acb_zio_head = rzio;
 6410 
 6411                                 if (hash_lock != NULL)
 6412                                         mutex_exit(hash_lock);
 6413 
 6414                                 DTRACE_PROBE2(l2arc__read, vdev_t *, vd,
 6415                                     zio_t *, rzio);
 6416                                 ARCSTAT_INCR(arcstat_l2_read_bytes,
 6417                                     HDR_GET_PSIZE(hdr));
 6418 
 6419                                 if (*arc_flags & ARC_FLAG_NOWAIT) {
 6420                                         zio_nowait(rzio);
 6421                                         goto out;
 6422                                 }
 6423 
 6424                                 ASSERT(*arc_flags & ARC_FLAG_WAIT);
 6425                                 if (zio_wait(rzio) == 0)
 6426                                         goto out;
 6427 
 6428                                 /* l2arc read error; goto zio_read() */
 6429                                 if (hash_lock != NULL)
 6430                                         mutex_enter(hash_lock);
 6431                         } else {
 6432                                 DTRACE_PROBE1(l2arc__miss,
 6433                                     arc_buf_hdr_t *, hdr);
 6434                                 ARCSTAT_BUMP(arcstat_l2_misses);
 6435                                 if (HDR_L2_WRITING(hdr))
 6436                                         ARCSTAT_BUMP(arcstat_l2_rw_clash);
 6437                                 spa_config_exit(spa, SCL_L2ARC, vd);
 6438                         }
 6439                 } else {
 6440                         if (vd != NULL)
 6441                                 spa_config_exit(spa, SCL_L2ARC, vd);
 6442 
 6443                         /*
 6444                          * Only a spa with l2 should contribute to l2
 6445                          * miss stats.  (Including the case of having a
 6446                          * faulted cache device - that's also a miss.)
 6447                          */
 6448                         if (spa_has_l2) {
 6449                                 /*
 6450                                  * Skip ARC stat bump for block pointers with
 6451                                  * embedded data. The data are read from the
 6452                                  * blkptr itself via
 6453                                  * decode_embedded_bp_compressed().
 6454                                  */
 6455                                 if (!embedded_bp) {
 6456                                         DTRACE_PROBE1(l2arc__miss,
 6457                                             arc_buf_hdr_t *, hdr);
 6458                                         ARCSTAT_BUMP(arcstat_l2_misses);
 6459                                 }
 6460                         }
 6461                 }
 6462 
 6463                 rzio = zio_read(pio, spa, bp, hdr_abd, size,
 6464                     arc_read_done, hdr, priority, zio_flags, zb);
 6465                 acb->acb_zio_head = rzio;
 6466 
 6467                 if (hash_lock != NULL)
 6468                         mutex_exit(hash_lock);
 6469 
 6470                 if (*arc_flags & ARC_FLAG_WAIT) {
 6471                         rc = zio_wait(rzio);
 6472                         goto out;
 6473                 }
 6474 
 6475                 ASSERT(*arc_flags & ARC_FLAG_NOWAIT);
 6476                 zio_nowait(rzio);
 6477         }
 6478 
 6479 out:
 6480         /* embedded bps don't actually go to disk */
 6481         if (!embedded_bp)
 6482                 spa_read_history_add(spa, zb, *arc_flags);
 6483         spl_fstrans_unmark(cookie);
 6484         return (rc);
 6485 }
 6486 
 6487 arc_prune_t *
 6488 arc_add_prune_callback(arc_prune_func_t *func, void *private)
 6489 {
 6490         arc_prune_t *p;
 6491 
 6492         p = kmem_alloc(sizeof (*p), KM_SLEEP);
 6493         p->p_pfunc = func;
 6494         p->p_private = private;
 6495         list_link_init(&p->p_node);
 6496         zfs_refcount_create(&p->p_refcnt);
 6497 
 6498         mutex_enter(&arc_prune_mtx);
 6499         zfs_refcount_add(&p->p_refcnt, &arc_prune_list);
 6500         list_insert_head(&arc_prune_list, p);
 6501         mutex_exit(&arc_prune_mtx);
 6502 
 6503         return (p);
 6504 }
 6505 
 6506 void
 6507 arc_remove_prune_callback(arc_prune_t *p)
 6508 {
 6509         boolean_t wait = B_FALSE;
 6510         mutex_enter(&arc_prune_mtx);
 6511         list_remove(&arc_prune_list, p);
 6512         if (zfs_refcount_remove(&p->p_refcnt, &arc_prune_list) > 0)
 6513                 wait = B_TRUE;
 6514         mutex_exit(&arc_prune_mtx);
 6515 
 6516         /* wait for arc_prune_task to finish */
 6517         if (wait)
 6518                 taskq_wait_outstanding(arc_prune_taskq, 0);
 6519         ASSERT0(zfs_refcount_count(&p->p_refcnt));
 6520         zfs_refcount_destroy(&p->p_refcnt);
 6521         kmem_free(p, sizeof (*p));
 6522 }
 6523 
 6524 /*
 6525  * Notify the arc that a block was freed, and thus will never be used again.
 6526  */
 6527 void
 6528 arc_freed(spa_t *spa, const blkptr_t *bp)
 6529 {
 6530         arc_buf_hdr_t *hdr;
 6531         kmutex_t *hash_lock;
 6532         uint64_t guid = spa_load_guid(spa);
 6533 
 6534         ASSERT(!BP_IS_EMBEDDED(bp));
 6535 
 6536         hdr = buf_hash_find(guid, bp, &hash_lock);
 6537         if (hdr == NULL)
 6538                 return;
 6539 
 6540         /*
 6541          * We might be trying to free a block that is still doing I/O
 6542          * (i.e. prefetch) or has some other reference (i.e. a dedup-ed,
 6543          * dmu_sync-ed block). A block may also have a reference if it is
 6544          * part of a dedup-ed, dmu_synced write. The dmu_sync() function would
 6545          * have written the new block to its final resting place on disk but
 6546          * without the dedup flag set. This would have left the hdr in the MRU
 6547          * state and discoverable. When the txg finally syncs it detects that
 6548          * the block was overridden in open context and issues an override I/O.
 6549          * Since this is a dedup block, the override I/O will determine if the
 6550          * block is already in the DDT. If so, then it will replace the io_bp
 6551          * with the bp from the DDT and allow the I/O to finish. When the I/O
 6552          * reaches the done callback, dbuf_write_override_done, it will
 6553          * check to see if the io_bp and io_bp_override are identical.
 6554          * If they are not, then it indicates that the bp was replaced with
 6555          * the bp in the DDT and the override bp is freed. This allows
 6556          * us to arrive here with a reference on a block that is being
 6557          * freed. So if we have an I/O in progress, or a reference to
 6558          * this hdr, then we don't destroy the hdr.
 6559          */
 6560         if (!HDR_HAS_L1HDR(hdr) ||
 6561             zfs_refcount_is_zero(&hdr->b_l1hdr.b_refcnt)) {
 6562                 arc_change_state(arc_anon, hdr);
 6563                 arc_hdr_destroy(hdr);
 6564                 mutex_exit(hash_lock);
 6565         } else {
 6566                 mutex_exit(hash_lock);
 6567         }
 6568 
 6569 }
 6570 
 6571 /*
 6572  * Release this buffer from the cache, making it an anonymous buffer.  This
 6573  * must be done after a read and prior to modifying the buffer contents.
 6574  * If the buffer has more than one reference, we must make
 6575  * a new hdr for the buffer.
 6576  */
 6577 void
 6578 arc_release(arc_buf_t *buf, const void *tag)
 6579 {
 6580         arc_buf_hdr_t *hdr = buf->b_hdr;
 6581 
 6582         /*
 6583          * It would be nice to assert that if its DMU metadata (level >
 6584          * 0 || it's the dnode file), then it must be syncing context.
 6585          * But we don't know that information at this level.
 6586          */
 6587 
 6588         ASSERT(HDR_HAS_L1HDR(hdr));
 6589 
 6590         /*
 6591          * We don't grab the hash lock prior to this check, because if
 6592          * the buffer's header is in the arc_anon state, it won't be
 6593          * linked into the hash table.
 6594          */
 6595         if (hdr->b_l1hdr.b_state == arc_anon) {
 6596                 ASSERT(!HDR_IO_IN_PROGRESS(hdr));
 6597                 ASSERT(!HDR_IN_HASH_TABLE(hdr));
 6598                 ASSERT(!HDR_HAS_L2HDR(hdr));
 6599 
 6600                 ASSERT3U(hdr->b_l1hdr.b_bufcnt, ==, 1);
 6601                 ASSERT3S(zfs_refcount_count(&hdr->b_l1hdr.b_refcnt), ==, 1);
 6602                 ASSERT(!multilist_link_active(&hdr->b_l1hdr.b_arc_node));
 6603 
 6604                 hdr->b_l1hdr.b_arc_access = 0;
 6605 
 6606                 /*
 6607                  * If the buf is being overridden then it may already
 6608                  * have a hdr that is not empty.
 6609                  */
 6610                 buf_discard_identity(hdr);
 6611                 arc_buf_thaw(buf);
 6612 
 6613                 return;
 6614         }
 6615 
 6616         kmutex_t *hash_lock = HDR_LOCK(hdr);
 6617         mutex_enter(hash_lock);
 6618 
 6619         /*
 6620          * This assignment is only valid as long as the hash_lock is
 6621          * held, we must be careful not to reference state or the
 6622          * b_state field after dropping the lock.
 6623          */
 6624         arc_state_t *state = hdr->b_l1hdr.b_state;
 6625         ASSERT3P(hash_lock, ==, HDR_LOCK(hdr));
 6626         ASSERT3P(state, !=, arc_anon);
 6627 
 6628         /* this buffer is not on any list */
 6629         ASSERT3S(zfs_refcount_count(&hdr->b_l1hdr.b_refcnt), >, 0);
 6630 
 6631         if (HDR_HAS_L2HDR(hdr)) {
 6632                 mutex_enter(&hdr->b_l2hdr.b_dev->l2ad_mtx);
 6633 
 6634                 /*
 6635                  * We have to recheck this conditional again now that
 6636                  * we're holding the l2ad_mtx to prevent a race with
 6637                  * another thread which might be concurrently calling
 6638                  * l2arc_evict(). In that case, l2arc_evict() might have
 6639                  * destroyed the header's L2 portion as we were waiting
 6640                  * to acquire the l2ad_mtx.
 6641                  */
 6642                 if (HDR_HAS_L2HDR(hdr))
 6643                         arc_hdr_l2hdr_destroy(hdr);
 6644 
 6645                 mutex_exit(&hdr->b_l2hdr.b_dev->l2ad_mtx);
 6646         }
 6647 
 6648         /*
 6649          * Do we have more than one buf?
 6650          */
 6651         if (hdr->b_l1hdr.b_bufcnt > 1) {
 6652                 arc_buf_hdr_t *nhdr;
 6653                 uint64_t spa = hdr->b_spa;
 6654                 uint64_t psize = HDR_GET_PSIZE(hdr);
 6655                 uint64_t lsize = HDR_GET_LSIZE(hdr);
 6656                 boolean_t protected = HDR_PROTECTED(hdr);
 6657                 enum zio_compress compress = arc_hdr_get_compress(hdr);
 6658                 arc_buf_contents_t type = arc_buf_type(hdr);
 6659                 VERIFY3U(hdr->b_type, ==, type);
 6660 
 6661                 ASSERT(hdr->b_l1hdr.b_buf != buf || buf->b_next != NULL);
 6662                 VERIFY3S(remove_reference(hdr, tag), >, 0);
 6663 
 6664                 if (arc_buf_is_shared(buf) && !ARC_BUF_COMPRESSED(buf)) {
 6665                         ASSERT3P(hdr->b_l1hdr.b_buf, !=, buf);
 6666                         ASSERT(ARC_BUF_LAST(buf));
 6667                 }
 6668 
 6669                 /*
 6670                  * Pull the data off of this hdr and attach it to
 6671                  * a new anonymous hdr. Also find the last buffer
 6672                  * in the hdr's buffer list.
 6673                  */
 6674                 arc_buf_t *lastbuf = arc_buf_remove(hdr, buf);
 6675                 ASSERT3P(lastbuf, !=, NULL);
 6676 
 6677                 /*
 6678                  * If the current arc_buf_t and the hdr are sharing their data
 6679                  * buffer, then we must stop sharing that block.
 6680                  */
 6681                 if (arc_buf_is_shared(buf)) {
 6682                         ASSERT3P(hdr->b_l1hdr.b_buf, !=, buf);
 6683                         VERIFY(!arc_buf_is_shared(lastbuf));
 6684 
 6685                         /*
 6686                          * First, sever the block sharing relationship between
 6687                          * buf and the arc_buf_hdr_t.
 6688                          */
 6689                         arc_unshare_buf(hdr, buf);
 6690 
 6691                         /*
 6692                          * Now we need to recreate the hdr's b_pabd. Since we
 6693                          * have lastbuf handy, we try to share with it, but if
 6694                          * we can't then we allocate a new b_pabd and copy the
 6695                          * data from buf into it.
 6696                          */
 6697                         if (arc_can_share(hdr, lastbuf)) {
 6698                                 arc_share_buf(hdr, lastbuf);
 6699                         } else {
 6700                                 arc_hdr_alloc_abd(hdr, ARC_HDR_DO_ADAPT);
 6701                                 abd_copy_from_buf(hdr->b_l1hdr.b_pabd,
 6702                                     buf->b_data, psize);
 6703                         }
 6704                         VERIFY3P(lastbuf->b_data, !=, NULL);
 6705                 } else if (HDR_SHARED_DATA(hdr)) {
 6706                         /*
 6707                          * Uncompressed shared buffers are always at the end
 6708                          * of the list. Compressed buffers don't have the
 6709                          * same requirements. This makes it hard to
 6710                          * simply assert that the lastbuf is shared so
 6711                          * we rely on the hdr's compression flags to determine
 6712                          * if we have a compressed, shared buffer.
 6713                          */
 6714                         ASSERT(arc_buf_is_shared(lastbuf) ||
 6715                             arc_hdr_get_compress(hdr) != ZIO_COMPRESS_OFF);
 6716                         ASSERT(!ARC_BUF_SHARED(buf));
 6717                 }
 6718 
 6719                 ASSERT(hdr->b_l1hdr.b_pabd != NULL || HDR_HAS_RABD(hdr));
 6720                 ASSERT3P(state, !=, arc_l2c_only);
 6721 
 6722                 (void) zfs_refcount_remove_many(&state->arcs_size,
 6723                     arc_buf_size(buf), buf);
 6724 
 6725                 if (zfs_refcount_is_zero(&hdr->b_l1hdr.b_refcnt)) {
 6726                         ASSERT3P(state, !=, arc_l2c_only);
 6727                         (void) zfs_refcount_remove_many(
 6728                             &state->arcs_esize[type],
 6729                             arc_buf_size(buf), buf);
 6730                 }
 6731 
 6732                 hdr->b_l1hdr.b_bufcnt -= 1;
 6733                 if (ARC_BUF_ENCRYPTED(buf))
 6734                         hdr->b_crypt_hdr.b_ebufcnt -= 1;
 6735 
 6736                 arc_cksum_verify(buf);
 6737                 arc_buf_unwatch(buf);
 6738 
 6739                 /* if this is the last uncompressed buf free the checksum */
 6740                 if (!arc_hdr_has_uncompressed_buf(hdr))
 6741                         arc_cksum_free(hdr);
 6742 
 6743                 mutex_exit(hash_lock);
 6744 
 6745                 nhdr = arc_hdr_alloc(spa, psize, lsize, protected,
 6746                     compress, hdr->b_complevel, type);
 6747                 ASSERT3P(nhdr->b_l1hdr.b_buf, ==, NULL);
 6748                 ASSERT0(nhdr->b_l1hdr.b_bufcnt);
 6749                 ASSERT0(zfs_refcount_count(&nhdr->b_l1hdr.b_refcnt));
 6750                 VERIFY3U(nhdr->b_type, ==, type);
 6751                 ASSERT(!HDR_SHARED_DATA(nhdr));
 6752 
 6753                 nhdr->b_l1hdr.b_buf = buf;
 6754                 nhdr->b_l1hdr.b_bufcnt = 1;
 6755                 if (ARC_BUF_ENCRYPTED(buf))
 6756                         nhdr->b_crypt_hdr.b_ebufcnt = 1;
 6757                 (void) zfs_refcount_add(&nhdr->b_l1hdr.b_refcnt, tag);
 6758                 buf->b_hdr = nhdr;
 6759 
 6760                 (void) zfs_refcount_add_many(&arc_anon->arcs_size,
 6761                     arc_buf_size(buf), buf);
 6762         } else {
 6763                 ASSERT(zfs_refcount_count(&hdr->b_l1hdr.b_refcnt) == 1);
 6764                 /* protected by hash lock, or hdr is on arc_anon */
 6765                 ASSERT(!multilist_link_active(&hdr->b_l1hdr.b_arc_node));
 6766                 ASSERT(!HDR_IO_IN_PROGRESS(hdr));
 6767                 hdr->b_l1hdr.b_mru_hits = 0;
 6768                 hdr->b_l1hdr.b_mru_ghost_hits = 0;
 6769                 hdr->b_l1hdr.b_mfu_hits = 0;
 6770                 hdr->b_l1hdr.b_mfu_ghost_hits = 0;
 6771                 arc_change_state(arc_anon, hdr);
 6772                 hdr->b_l1hdr.b_arc_access = 0;
 6773 
 6774                 mutex_exit(hash_lock);
 6775                 buf_discard_identity(hdr);
 6776                 arc_buf_thaw(buf);
 6777         }
 6778 }
 6779 
 6780 int
 6781 arc_released(arc_buf_t *buf)
 6782 {
 6783         return (buf->b_data != NULL &&
 6784             buf->b_hdr->b_l1hdr.b_state == arc_anon);
 6785 }
 6786 
 6787 #ifdef ZFS_DEBUG
 6788 int
 6789 arc_referenced(arc_buf_t *buf)
 6790 {
 6791         return (zfs_refcount_count(&buf->b_hdr->b_l1hdr.b_refcnt));
 6792 }
 6793 #endif
 6794 
 6795 static void
 6796 arc_write_ready(zio_t *zio)
 6797 {
 6798         arc_write_callback_t *callback = zio->io_private;
 6799         arc_buf_t *buf = callback->awcb_buf;
 6800         arc_buf_hdr_t *hdr = buf->b_hdr;
 6801         blkptr_t *bp = zio->io_bp;
 6802         uint64_t psize = BP_IS_HOLE(bp) ? 0 : BP_GET_PSIZE(bp);
 6803         fstrans_cookie_t cookie = spl_fstrans_mark();
 6804 
 6805         ASSERT(HDR_HAS_L1HDR(hdr));
 6806         ASSERT(!zfs_refcount_is_zero(&buf->b_hdr->b_l1hdr.b_refcnt));
 6807         ASSERT(hdr->b_l1hdr.b_bufcnt > 0);
 6808 
 6809         /*
 6810          * If we're reexecuting this zio because the pool suspended, then
 6811          * cleanup any state that was previously set the first time the
 6812          * callback was invoked.
 6813          */
 6814         if (zio->io_flags & ZIO_FLAG_REEXECUTED) {
 6815                 arc_cksum_free(hdr);
 6816                 arc_buf_unwatch(buf);
 6817                 if (hdr->b_l1hdr.b_pabd != NULL) {
 6818                         if (arc_buf_is_shared(buf)) {
 6819                                 arc_unshare_buf(hdr, buf);
 6820                         } else {
 6821                                 arc_hdr_free_abd(hdr, B_FALSE);
 6822                         }
 6823                 }
 6824 
 6825                 if (HDR_HAS_RABD(hdr))
 6826                         arc_hdr_free_abd(hdr, B_TRUE);
 6827         }
 6828         ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL);
 6829         ASSERT(!HDR_HAS_RABD(hdr));
 6830         ASSERT(!HDR_SHARED_DATA(hdr));
 6831         ASSERT(!arc_buf_is_shared(buf));
 6832 
 6833         callback->awcb_ready(zio, buf, callback->awcb_private);
 6834 
 6835         if (HDR_IO_IN_PROGRESS(hdr)) {
 6836                 ASSERT(zio->io_flags & ZIO_FLAG_REEXECUTED);
 6837         } else {
 6838                 arc_hdr_set_flags(hdr, ARC_FLAG_IO_IN_PROGRESS);
 6839                 add_reference(hdr, hdr); /* For IO_IN_PROGRESS. */
 6840         }
 6841 
 6842         if (BP_IS_PROTECTED(bp) != !!HDR_PROTECTED(hdr))
 6843                 hdr = arc_hdr_realloc_crypt(hdr, BP_IS_PROTECTED(bp));
 6844 
 6845         if (BP_IS_PROTECTED(bp)) {
 6846                 /* ZIL blocks are written through zio_rewrite */
 6847                 ASSERT3U(BP_GET_TYPE(bp), !=, DMU_OT_INTENT_LOG);
 6848                 ASSERT(HDR_PROTECTED(hdr));
 6849 
 6850                 if (BP_SHOULD_BYTESWAP(bp)) {
 6851                         if (BP_GET_LEVEL(bp) > 0) {
 6852                                 hdr->b_l1hdr.b_byteswap = DMU_BSWAP_UINT64;
 6853                         } else {
 6854                                 hdr->b_l1hdr.b_byteswap =
 6855                                     DMU_OT_BYTESWAP(BP_GET_TYPE(bp));
 6856                         }
 6857                 } else {
 6858                         hdr->b_l1hdr.b_byteswap = DMU_BSWAP_NUMFUNCS;
 6859                 }
 6860 
 6861                 hdr->b_crypt_hdr.b_ot = BP_GET_TYPE(bp);
 6862                 hdr->b_crypt_hdr.b_dsobj = zio->io_bookmark.zb_objset;
 6863                 zio_crypt_decode_params_bp(bp, hdr->b_crypt_hdr.b_salt,
 6864                     hdr->b_crypt_hdr.b_iv);
 6865                 zio_crypt_decode_mac_bp(bp, hdr->b_crypt_hdr.b_mac);
 6866         }
 6867 
 6868         /*
 6869          * If this block was written for raw encryption but the zio layer
 6870          * ended up only authenticating it, adjust the buffer flags now.
 6871          */
 6872         if (BP_IS_AUTHENTICATED(bp) && ARC_BUF_ENCRYPTED(buf)) {
 6873                 arc_hdr_set_flags(hdr, ARC_FLAG_NOAUTH);
 6874                 buf->b_flags &= ~ARC_BUF_FLAG_ENCRYPTED;
 6875                 if (BP_GET_COMPRESS(bp) == ZIO_COMPRESS_OFF)
 6876                         buf->b_flags &= ~ARC_BUF_FLAG_COMPRESSED;
 6877         } else if (BP_IS_HOLE(bp) && ARC_BUF_ENCRYPTED(buf)) {
 6878                 buf->b_flags &= ~ARC_BUF_FLAG_ENCRYPTED;
 6879                 buf->b_flags &= ~ARC_BUF_FLAG_COMPRESSED;
 6880         }
 6881 
 6882         /* this must be done after the buffer flags are adjusted */
 6883         arc_cksum_compute(buf);
 6884 
 6885         enum zio_compress compress;
 6886         if (BP_IS_HOLE(bp) || BP_IS_EMBEDDED(bp)) {
 6887                 compress = ZIO_COMPRESS_OFF;
 6888         } else {
 6889                 ASSERT3U(HDR_GET_LSIZE(hdr), ==, BP_GET_LSIZE(bp));
 6890                 compress = BP_GET_COMPRESS(bp);
 6891         }
 6892         HDR_SET_PSIZE(hdr, psize);
 6893         arc_hdr_set_compress(hdr, compress);
 6894         hdr->b_complevel = zio->io_prop.zp_complevel;
 6895 
 6896         if (zio->io_error != 0 || psize == 0)
 6897                 goto out;
 6898 
 6899         /*
 6900          * Fill the hdr with data. If the buffer is encrypted we have no choice
 6901          * but to copy the data into b_radb. If the hdr is compressed, the data
 6902          * we want is available from the zio, otherwise we can take it from
 6903          * the buf.
 6904          *
 6905          * We might be able to share the buf's data with the hdr here. However,
 6906          * doing so would cause the ARC to be full of linear ABDs if we write a
 6907          * lot of shareable data. As a compromise, we check whether scattered
 6908          * ABDs are allowed, and assume that if they are then the user wants
 6909          * the ARC to be primarily filled with them regardless of the data being
 6910          * written. Therefore, if they're allowed then we allocate one and copy
 6911          * the data into it; otherwise, we share the data directly if we can.
 6912          */
 6913         if (ARC_BUF_ENCRYPTED(buf)) {
 6914                 ASSERT3U(psize, >, 0);
 6915                 ASSERT(ARC_BUF_COMPRESSED(buf));
 6916                 arc_hdr_alloc_abd(hdr, ARC_HDR_DO_ADAPT | ARC_HDR_ALLOC_RDATA |
 6917                     ARC_HDR_USE_RESERVE);
 6918                 abd_copy(hdr->b_crypt_hdr.b_rabd, zio->io_abd, psize);
 6919         } else if (!(HDR_UNCACHED(hdr) ||
 6920             abd_size_alloc_linear(arc_buf_size(buf))) ||
 6921             !arc_can_share(hdr, buf)) {
 6922                 /*
 6923                  * Ideally, we would always copy the io_abd into b_pabd, but the
 6924                  * user may have disabled compressed ARC, thus we must check the
 6925                  * hdr's compression setting rather than the io_bp's.
 6926                  */
 6927                 if (BP_IS_ENCRYPTED(bp)) {
 6928                         ASSERT3U(psize, >, 0);
 6929                         arc_hdr_alloc_abd(hdr, ARC_HDR_DO_ADAPT |
 6930                             ARC_HDR_ALLOC_RDATA | ARC_HDR_USE_RESERVE);
 6931                         abd_copy(hdr->b_crypt_hdr.b_rabd, zio->io_abd, psize);
 6932                 } else if (arc_hdr_get_compress(hdr) != ZIO_COMPRESS_OFF &&
 6933                     !ARC_BUF_COMPRESSED(buf)) {
 6934                         ASSERT3U(psize, >, 0);
 6935                         arc_hdr_alloc_abd(hdr, ARC_HDR_DO_ADAPT |
 6936                             ARC_HDR_USE_RESERVE);
 6937                         abd_copy(hdr->b_l1hdr.b_pabd, zio->io_abd, psize);
 6938                 } else {
 6939                         ASSERT3U(zio->io_orig_size, ==, arc_hdr_size(hdr));
 6940                         arc_hdr_alloc_abd(hdr, ARC_HDR_DO_ADAPT |
 6941                             ARC_HDR_USE_RESERVE);
 6942                         abd_copy_from_buf(hdr->b_l1hdr.b_pabd, buf->b_data,
 6943                             arc_buf_size(buf));
 6944                 }
 6945         } else {
 6946                 ASSERT3P(buf->b_data, ==, abd_to_buf(zio->io_orig_abd));
 6947                 ASSERT3U(zio->io_orig_size, ==, arc_buf_size(buf));
 6948                 ASSERT3U(hdr->b_l1hdr.b_bufcnt, ==, 1);
 6949 
 6950                 arc_share_buf(hdr, buf);
 6951         }
 6952 
 6953 out:
 6954         arc_hdr_verify(hdr, bp);
 6955         spl_fstrans_unmark(cookie);
 6956 }
 6957 
 6958 static void
 6959 arc_write_children_ready(zio_t *zio)
 6960 {
 6961         arc_write_callback_t *callback = zio->io_private;
 6962         arc_buf_t *buf = callback->awcb_buf;
 6963 
 6964         callback->awcb_children_ready(zio, buf, callback->awcb_private);
 6965 }
 6966 
 6967 /*
 6968  * The SPA calls this callback for each physical write that happens on behalf
 6969  * of a logical write.  See the comment in dbuf_write_physdone() for details.
 6970  */
 6971 static void
 6972 arc_write_physdone(zio_t *zio)
 6973 {
 6974         arc_write_callback_t *cb = zio->io_private;
 6975         if (cb->awcb_physdone != NULL)
 6976                 cb->awcb_physdone(zio, cb->awcb_buf, cb->awcb_private);
 6977 }
 6978 
 6979 static void
 6980 arc_write_done(zio_t *zio)
 6981 {
 6982         arc_write_callback_t *callback = zio->io_private;
 6983         arc_buf_t *buf = callback->awcb_buf;
 6984         arc_buf_hdr_t *hdr = buf->b_hdr;
 6985 
 6986         ASSERT3P(hdr->b_l1hdr.b_acb, ==, NULL);
 6987 
 6988         if (zio->io_error == 0) {
 6989                 arc_hdr_verify(hdr, zio->io_bp);
 6990 
 6991                 if (BP_IS_HOLE(zio->io_bp) || BP_IS_EMBEDDED(zio->io_bp)) {
 6992                         buf_discard_identity(hdr);
 6993                 } else {
 6994                         hdr->b_dva = *BP_IDENTITY(zio->io_bp);
 6995                         hdr->b_birth = BP_PHYSICAL_BIRTH(zio->io_bp);
 6996                 }
 6997         } else {
 6998                 ASSERT(HDR_EMPTY(hdr));
 6999         }
 7000 
 7001         /*
 7002          * If the block to be written was all-zero or compressed enough to be
 7003          * embedded in the BP, no write was performed so there will be no
 7004          * dva/birth/checksum.  The buffer must therefore remain anonymous
 7005          * (and uncached).
 7006          */
 7007         if (!HDR_EMPTY(hdr)) {
 7008                 arc_buf_hdr_t *exists;
 7009                 kmutex_t *hash_lock;
 7010 
 7011                 ASSERT3U(zio->io_error, ==, 0);
 7012 
 7013                 arc_cksum_verify(buf);
 7014 
 7015                 exists = buf_hash_insert(hdr, &hash_lock);
 7016                 if (exists != NULL) {
 7017                         /*
 7018                          * This can only happen if we overwrite for
 7019                          * sync-to-convergence, because we remove
 7020                          * buffers from the hash table when we arc_free().
 7021                          */
 7022                         if (zio->io_flags & ZIO_FLAG_IO_REWRITE) {
 7023                                 if (!BP_EQUAL(&zio->io_bp_orig, zio->io_bp))
 7024                                         panic("bad overwrite, hdr=%p exists=%p",
 7025                                             (void *)hdr, (void *)exists);
 7026                                 ASSERT(zfs_refcount_is_zero(
 7027                                     &exists->b_l1hdr.b_refcnt));
 7028                                 arc_change_state(arc_anon, exists);
 7029                                 arc_hdr_destroy(exists);
 7030                                 mutex_exit(hash_lock);
 7031                                 exists = buf_hash_insert(hdr, &hash_lock);
 7032                                 ASSERT3P(exists, ==, NULL);
 7033                         } else if (zio->io_flags & ZIO_FLAG_NOPWRITE) {
 7034                                 /* nopwrite */
 7035                                 ASSERT(zio->io_prop.zp_nopwrite);
 7036                                 if (!BP_EQUAL(&zio->io_bp_orig, zio->io_bp))
 7037                                         panic("bad nopwrite, hdr=%p exists=%p",
 7038                                             (void *)hdr, (void *)exists);
 7039                         } else {
 7040                                 /* Dedup */
 7041                                 ASSERT(hdr->b_l1hdr.b_bufcnt == 1);
 7042                                 ASSERT(hdr->b_l1hdr.b_state == arc_anon);
 7043                                 ASSERT(BP_GET_DEDUP(zio->io_bp));
 7044                                 ASSERT(BP_GET_LEVEL(zio->io_bp) == 0);
 7045                         }
 7046                 }
 7047                 arc_hdr_clear_flags(hdr, ARC_FLAG_IO_IN_PROGRESS);
 7048                 VERIFY3S(remove_reference(hdr, hdr), >, 0);
 7049                 /* if it's not anon, we are doing a scrub */
 7050                 if (exists == NULL && hdr->b_l1hdr.b_state == arc_anon)
 7051                         arc_access(hdr, 0, B_FALSE);
 7052                 mutex_exit(hash_lock);
 7053         } else {
 7054                 arc_hdr_clear_flags(hdr, ARC_FLAG_IO_IN_PROGRESS);
 7055                 VERIFY3S(remove_reference(hdr, hdr), >, 0);
 7056         }
 7057 
 7058         callback->awcb_done(zio, buf, callback->awcb_private);
 7059 
 7060         abd_free(zio->io_abd);
 7061         kmem_free(callback, sizeof (arc_write_callback_t));
 7062 }
 7063 
 7064 zio_t *
 7065 arc_write(zio_t *pio, spa_t *spa, uint64_t txg,
 7066     blkptr_t *bp, arc_buf_t *buf, boolean_t uncached, boolean_t l2arc,
 7067     const zio_prop_t *zp, arc_write_done_func_t *ready,
 7068     arc_write_done_func_t *children_ready, arc_write_done_func_t *physdone,
 7069     arc_write_done_func_t *done, void *private, zio_priority_t priority,
 7070     int zio_flags, const zbookmark_phys_t *zb)
 7071 {
 7072         arc_buf_hdr_t *hdr = buf->b_hdr;
 7073         arc_write_callback_t *callback;
 7074         zio_t *zio;
 7075         zio_prop_t localprop = *zp;
 7076 
 7077         ASSERT3P(ready, !=, NULL);
 7078         ASSERT3P(done, !=, NULL);
 7079         ASSERT(!HDR_IO_ERROR(hdr));
 7080         ASSERT(!HDR_IO_IN_PROGRESS(hdr));
 7081         ASSERT3P(hdr->b_l1hdr.b_acb, ==, NULL);
 7082         ASSERT3U(hdr->b_l1hdr.b_bufcnt, >, 0);
 7083         if (uncached)
 7084                 arc_hdr_set_flags(hdr, ARC_FLAG_UNCACHED);
 7085         else if (l2arc)
 7086                 arc_hdr_set_flags(hdr, ARC_FLAG_L2CACHE);
 7087 
 7088         if (ARC_BUF_ENCRYPTED(buf)) {
 7089                 ASSERT(ARC_BUF_COMPRESSED(buf));
 7090                 localprop.zp_encrypt = B_TRUE;
 7091                 localprop.zp_compress = HDR_GET_COMPRESS(hdr);
 7092                 localprop.zp_complevel = hdr->b_complevel;
 7093                 localprop.zp_byteorder =
 7094                     (hdr->b_l1hdr.b_byteswap == DMU_BSWAP_NUMFUNCS) ?
 7095                     ZFS_HOST_BYTEORDER : !ZFS_HOST_BYTEORDER;
 7096                 memcpy(localprop.zp_salt, hdr->b_crypt_hdr.b_salt,
 7097                     ZIO_DATA_SALT_LEN);
 7098                 memcpy(localprop.zp_iv, hdr->b_crypt_hdr.b_iv,
 7099                     ZIO_DATA_IV_LEN);
 7100                 memcpy(localprop.zp_mac, hdr->b_crypt_hdr.b_mac,
 7101                     ZIO_DATA_MAC_LEN);
 7102                 if (DMU_OT_IS_ENCRYPTED(localprop.zp_type)) {
 7103                         localprop.zp_nopwrite = B_FALSE;
 7104                         localprop.zp_copies =
 7105                             MIN(localprop.zp_copies, SPA_DVAS_PER_BP - 1);
 7106                 }
 7107                 zio_flags |= ZIO_FLAG_RAW;
 7108         } else if (ARC_BUF_COMPRESSED(buf)) {
 7109                 ASSERT3U(HDR_GET_LSIZE(hdr), !=, arc_buf_size(buf));
 7110                 localprop.zp_compress = HDR_GET_COMPRESS(hdr);
 7111                 localprop.zp_complevel = hdr->b_complevel;
 7112                 zio_flags |= ZIO_FLAG_RAW_COMPRESS;
 7113         }
 7114         callback = kmem_zalloc(sizeof (arc_write_callback_t), KM_SLEEP);
 7115         callback->awcb_ready = ready;
 7116         callback->awcb_children_ready = children_ready;
 7117         callback->awcb_physdone = physdone;
 7118         callback->awcb_done = done;
 7119         callback->awcb_private = private;
 7120         callback->awcb_buf = buf;
 7121 
 7122         /*
 7123          * The hdr's b_pabd is now stale, free it now. A new data block
 7124          * will be allocated when the zio pipeline calls arc_write_ready().
 7125          */
 7126         if (hdr->b_l1hdr.b_pabd != NULL) {
 7127                 /*
 7128                  * If the buf is currently sharing the data block with
 7129                  * the hdr then we need to break that relationship here.
 7130                  * The hdr will remain with a NULL data pointer and the
 7131                  * buf will take sole ownership of the block.
 7132                  */
 7133                 if (arc_buf_is_shared(buf)) {
 7134                         arc_unshare_buf(hdr, buf);
 7135                 } else {
 7136                         arc_hdr_free_abd(hdr, B_FALSE);
 7137                 }
 7138                 VERIFY3P(buf->b_data, !=, NULL);
 7139         }
 7140 
 7141         if (HDR_HAS_RABD(hdr))
 7142                 arc_hdr_free_abd(hdr, B_TRUE);
 7143 
 7144         if (!(zio_flags & ZIO_FLAG_RAW))
 7145                 arc_hdr_set_compress(hdr, ZIO_COMPRESS_OFF);
 7146 
 7147         ASSERT(!arc_buf_is_shared(buf));
 7148         ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL);
 7149 
 7150         zio = zio_write(pio, spa, txg, bp,
 7151             abd_get_from_buf(buf->b_data, HDR_GET_LSIZE(hdr)),
 7152             HDR_GET_LSIZE(hdr), arc_buf_size(buf), &localprop, arc_write_ready,
 7153             (children_ready != NULL) ? arc_write_children_ready : NULL,
 7154             arc_write_physdone, arc_write_done, callback,
 7155             priority, zio_flags, zb);
 7156 
 7157         return (zio);
 7158 }
 7159 
 7160 void
 7161 arc_tempreserve_clear(uint64_t reserve)
 7162 {
 7163         atomic_add_64(&arc_tempreserve, -reserve);
 7164         ASSERT((int64_t)arc_tempreserve >= 0);
 7165 }
 7166 
 7167 int
 7168 arc_tempreserve_space(spa_t *spa, uint64_t reserve, uint64_t txg)
 7169 {
 7170         int error;
 7171         uint64_t anon_size;
 7172 
 7173         if (!arc_no_grow &&
 7174             reserve > arc_c/4 &&
 7175             reserve * 4 > (2ULL << SPA_MAXBLOCKSHIFT))
 7176                 arc_c = MIN(arc_c_max, reserve * 4);
 7177 
 7178         /*
 7179          * Throttle when the calculated memory footprint for the TXG
 7180          * exceeds the target ARC size.
 7181          */
 7182         if (reserve > arc_c) {
 7183                 DMU_TX_STAT_BUMP(dmu_tx_memory_reserve);
 7184                 return (SET_ERROR(ERESTART));
 7185         }
 7186 
 7187         /*
 7188          * Don't count loaned bufs as in flight dirty data to prevent long
 7189          * network delays from blocking transactions that are ready to be
 7190          * assigned to a txg.
 7191          */
 7192 
 7193         /* assert that it has not wrapped around */
 7194         ASSERT3S(atomic_add_64_nv(&arc_loaned_bytes, 0), >=, 0);
 7195 
 7196         anon_size = MAX((int64_t)(zfs_refcount_count(&arc_anon->arcs_size) -
 7197             arc_loaned_bytes), 0);
 7198 
 7199         /*
 7200          * Writes will, almost always, require additional memory allocations
 7201          * in order to compress/encrypt/etc the data.  We therefore need to
 7202          * make sure that there is sufficient available memory for this.
 7203          */
 7204         error = arc_memory_throttle(spa, reserve, txg);
 7205         if (error != 0)
 7206                 return (error);
 7207 
 7208         /*
 7209          * Throttle writes when the amount of dirty data in the cache
 7210          * gets too large.  We try to keep the cache less than half full
 7211          * of dirty blocks so that our sync times don't grow too large.
 7212          *
 7213          * In the case of one pool being built on another pool, we want
 7214          * to make sure we don't end up throttling the lower (backing)
 7215          * pool when the upper pool is the majority contributor to dirty
 7216          * data. To insure we make forward progress during throttling, we
 7217          * also check the current pool's net dirty data and only throttle
 7218          * if it exceeds zfs_arc_pool_dirty_percent of the anonymous dirty
 7219          * data in the cache.
 7220          *
 7221          * Note: if two requests come in concurrently, we might let them
 7222          * both succeed, when one of them should fail.  Not a huge deal.
 7223          */
 7224         uint64_t total_dirty = reserve + arc_tempreserve + anon_size;
 7225         uint64_t spa_dirty_anon = spa_dirty_data(spa);
 7226         uint64_t rarc_c = arc_warm ? arc_c : arc_c_max;
 7227         if (total_dirty > rarc_c * zfs_arc_dirty_limit_percent / 100 &&
 7228             anon_size > rarc_c * zfs_arc_anon_limit_percent / 100 &&
 7229             spa_dirty_anon > anon_size * zfs_arc_pool_dirty_percent / 100) {
 7230 #ifdef ZFS_DEBUG
 7231                 uint64_t meta_esize = zfs_refcount_count(
 7232                     &arc_anon->arcs_esize[ARC_BUFC_METADATA]);
 7233                 uint64_t data_esize =
 7234                     zfs_refcount_count(&arc_anon->arcs_esize[ARC_BUFC_DATA]);
 7235                 dprintf("failing, arc_tempreserve=%lluK anon_meta=%lluK "
 7236                     "anon_data=%lluK tempreserve=%lluK rarc_c=%lluK\n",
 7237                     (u_longlong_t)arc_tempreserve >> 10,
 7238                     (u_longlong_t)meta_esize >> 10,
 7239                     (u_longlong_t)data_esize >> 10,
 7240                     (u_longlong_t)reserve >> 10,
 7241                     (u_longlong_t)rarc_c >> 10);
 7242 #endif
 7243                 DMU_TX_STAT_BUMP(dmu_tx_dirty_throttle);
 7244                 return (SET_ERROR(ERESTART));
 7245         }
 7246         atomic_add_64(&arc_tempreserve, reserve);
 7247         return (0);
 7248 }
 7249 
 7250 static void
 7251 arc_kstat_update_state(arc_state_t *state, kstat_named_t *size,
 7252     kstat_named_t *evict_data, kstat_named_t *evict_metadata)
 7253 {
 7254         size->value.ui64 = zfs_refcount_count(&state->arcs_size);
 7255         evict_data->value.ui64 =
 7256             zfs_refcount_count(&state->arcs_esize[ARC_BUFC_DATA]);
 7257         evict_metadata->value.ui64 =
 7258             zfs_refcount_count(&state->arcs_esize[ARC_BUFC_METADATA]);
 7259 }
 7260 
 7261 static int
 7262 arc_kstat_update(kstat_t *ksp, int rw)
 7263 {
 7264         arc_stats_t *as = ksp->ks_data;
 7265 
 7266         if (rw == KSTAT_WRITE)
 7267                 return (SET_ERROR(EACCES));
 7268 
 7269         as->arcstat_hits.value.ui64 =
 7270             wmsum_value(&arc_sums.arcstat_hits);
 7271         as->arcstat_iohits.value.ui64 =
 7272             wmsum_value(&arc_sums.arcstat_iohits);
 7273         as->arcstat_misses.value.ui64 =
 7274             wmsum_value(&arc_sums.arcstat_misses);
 7275         as->arcstat_demand_data_hits.value.ui64 =
 7276             wmsum_value(&arc_sums.arcstat_demand_data_hits);
 7277         as->arcstat_demand_data_iohits.value.ui64 =
 7278             wmsum_value(&arc_sums.arcstat_demand_data_iohits);
 7279         as->arcstat_demand_data_misses.value.ui64 =
 7280             wmsum_value(&arc_sums.arcstat_demand_data_misses);
 7281         as->arcstat_demand_metadata_hits.value.ui64 =
 7282             wmsum_value(&arc_sums.arcstat_demand_metadata_hits);
 7283         as->arcstat_demand_metadata_iohits.value.ui64 =
 7284             wmsum_value(&arc_sums.arcstat_demand_metadata_iohits);
 7285         as->arcstat_demand_metadata_misses.value.ui64 =
 7286             wmsum_value(&arc_sums.arcstat_demand_metadata_misses);
 7287         as->arcstat_prefetch_data_hits.value.ui64 =
 7288             wmsum_value(&arc_sums.arcstat_prefetch_data_hits);
 7289         as->arcstat_prefetch_data_iohits.value.ui64 =
 7290             wmsum_value(&arc_sums.arcstat_prefetch_data_iohits);
 7291         as->arcstat_prefetch_data_misses.value.ui64 =
 7292             wmsum_value(&arc_sums.arcstat_prefetch_data_misses);
 7293         as->arcstat_prefetch_metadata_hits.value.ui64 =
 7294             wmsum_value(&arc_sums.arcstat_prefetch_metadata_hits);
 7295         as->arcstat_prefetch_metadata_iohits.value.ui64 =
 7296             wmsum_value(&arc_sums.arcstat_prefetch_metadata_iohits);
 7297         as->arcstat_prefetch_metadata_misses.value.ui64 =
 7298             wmsum_value(&arc_sums.arcstat_prefetch_metadata_misses);
 7299         as->arcstat_mru_hits.value.ui64 =
 7300             wmsum_value(&arc_sums.arcstat_mru_hits);
 7301         as->arcstat_mru_ghost_hits.value.ui64 =
 7302             wmsum_value(&arc_sums.arcstat_mru_ghost_hits);
 7303         as->arcstat_mfu_hits.value.ui64 =
 7304             wmsum_value(&arc_sums.arcstat_mfu_hits);
 7305         as->arcstat_mfu_ghost_hits.value.ui64 =
 7306             wmsum_value(&arc_sums.arcstat_mfu_ghost_hits);
 7307         as->arcstat_uncached_hits.value.ui64 =
 7308             wmsum_value(&arc_sums.arcstat_uncached_hits);
 7309         as->arcstat_deleted.value.ui64 =
 7310             wmsum_value(&arc_sums.arcstat_deleted);
 7311         as->arcstat_mutex_miss.value.ui64 =
 7312             wmsum_value(&arc_sums.arcstat_mutex_miss);
 7313         as->arcstat_access_skip.value.ui64 =
 7314             wmsum_value(&arc_sums.arcstat_access_skip);
 7315         as->arcstat_evict_skip.value.ui64 =
 7316             wmsum_value(&arc_sums.arcstat_evict_skip);
 7317         as->arcstat_evict_not_enough.value.ui64 =
 7318             wmsum_value(&arc_sums.arcstat_evict_not_enough);
 7319         as->arcstat_evict_l2_cached.value.ui64 =
 7320             wmsum_value(&arc_sums.arcstat_evict_l2_cached);
 7321         as->arcstat_evict_l2_eligible.value.ui64 =
 7322             wmsum_value(&arc_sums.arcstat_evict_l2_eligible);
 7323         as->arcstat_evict_l2_eligible_mfu.value.ui64 =
 7324             wmsum_value(&arc_sums.arcstat_evict_l2_eligible_mfu);
 7325         as->arcstat_evict_l2_eligible_mru.value.ui64 =
 7326             wmsum_value(&arc_sums.arcstat_evict_l2_eligible_mru);
 7327         as->arcstat_evict_l2_ineligible.value.ui64 =
 7328             wmsum_value(&arc_sums.arcstat_evict_l2_ineligible);
 7329         as->arcstat_evict_l2_skip.value.ui64 =
 7330             wmsum_value(&arc_sums.arcstat_evict_l2_skip);
 7331         as->arcstat_hash_collisions.value.ui64 =
 7332             wmsum_value(&arc_sums.arcstat_hash_collisions);
 7333         as->arcstat_hash_chains.value.ui64 =
 7334             wmsum_value(&arc_sums.arcstat_hash_chains);
 7335         as->arcstat_size.value.ui64 =
 7336             aggsum_value(&arc_sums.arcstat_size);
 7337         as->arcstat_compressed_size.value.ui64 =
 7338             wmsum_value(&arc_sums.arcstat_compressed_size);
 7339         as->arcstat_uncompressed_size.value.ui64 =
 7340             wmsum_value(&arc_sums.arcstat_uncompressed_size);
 7341         as->arcstat_overhead_size.value.ui64 =
 7342             wmsum_value(&arc_sums.arcstat_overhead_size);
 7343         as->arcstat_hdr_size.value.ui64 =
 7344             wmsum_value(&arc_sums.arcstat_hdr_size);
 7345         as->arcstat_data_size.value.ui64 =
 7346             wmsum_value(&arc_sums.arcstat_data_size);
 7347         as->arcstat_metadata_size.value.ui64 =
 7348             wmsum_value(&arc_sums.arcstat_metadata_size);
 7349         as->arcstat_dbuf_size.value.ui64 =
 7350             wmsum_value(&arc_sums.arcstat_dbuf_size);
 7351 #if defined(COMPAT_FREEBSD11)
 7352         as->arcstat_other_size.value.ui64 =
 7353             wmsum_value(&arc_sums.arcstat_bonus_size) +
 7354             aggsum_value(&arc_sums.arcstat_dnode_size) +
 7355             wmsum_value(&arc_sums.arcstat_dbuf_size);
 7356 #endif
 7357 
 7358         arc_kstat_update_state(arc_anon,
 7359             &as->arcstat_anon_size,
 7360             &as->arcstat_anon_evictable_data,
 7361             &as->arcstat_anon_evictable_metadata);
 7362         arc_kstat_update_state(arc_mru,
 7363             &as->arcstat_mru_size,
 7364             &as->arcstat_mru_evictable_data,
 7365             &as->arcstat_mru_evictable_metadata);
 7366         arc_kstat_update_state(arc_mru_ghost,
 7367             &as->arcstat_mru_ghost_size,
 7368             &as->arcstat_mru_ghost_evictable_data,
 7369             &as->arcstat_mru_ghost_evictable_metadata);
 7370         arc_kstat_update_state(arc_mfu,
 7371             &as->arcstat_mfu_size,
 7372             &as->arcstat_mfu_evictable_data,
 7373             &as->arcstat_mfu_evictable_metadata);
 7374         arc_kstat_update_state(arc_mfu_ghost,
 7375             &as->arcstat_mfu_ghost_size,
 7376             &as->arcstat_mfu_ghost_evictable_data,
 7377             &as->arcstat_mfu_ghost_evictable_metadata);
 7378         arc_kstat_update_state(arc_uncached,
 7379             &as->arcstat_uncached_size,
 7380             &as->arcstat_uncached_evictable_data,
 7381             &as->arcstat_uncached_evictable_metadata);
 7382 
 7383         as->arcstat_dnode_size.value.ui64 =
 7384             aggsum_value(&arc_sums.arcstat_dnode_size);
 7385         as->arcstat_bonus_size.value.ui64 =
 7386             wmsum_value(&arc_sums.arcstat_bonus_size);
 7387         as->arcstat_l2_hits.value.ui64 =
 7388             wmsum_value(&arc_sums.arcstat_l2_hits);
 7389         as->arcstat_l2_misses.value.ui64 =
 7390             wmsum_value(&arc_sums.arcstat_l2_misses);
 7391         as->arcstat_l2_prefetch_asize.value.ui64 =
 7392             wmsum_value(&arc_sums.arcstat_l2_prefetch_asize);
 7393         as->arcstat_l2_mru_asize.value.ui64 =
 7394             wmsum_value(&arc_sums.arcstat_l2_mru_asize);
 7395         as->arcstat_l2_mfu_asize.value.ui64 =
 7396             wmsum_value(&arc_sums.arcstat_l2_mfu_asize);
 7397         as->arcstat_l2_bufc_data_asize.value.ui64 =
 7398             wmsum_value(&arc_sums.arcstat_l2_bufc_data_asize);
 7399         as->arcstat_l2_bufc_metadata_asize.value.ui64 =
 7400             wmsum_value(&arc_sums.arcstat_l2_bufc_metadata_asize);
 7401         as->arcstat_l2_feeds.value.ui64 =
 7402             wmsum_value(&arc_sums.arcstat_l2_feeds);
 7403         as->arcstat_l2_rw_clash.value.ui64 =
 7404             wmsum_value(&arc_sums.arcstat_l2_rw_clash);
 7405         as->arcstat_l2_read_bytes.value.ui64 =
 7406             wmsum_value(&arc_sums.arcstat_l2_read_bytes);
 7407         as->arcstat_l2_write_bytes.value.ui64 =
 7408             wmsum_value(&arc_sums.arcstat_l2_write_bytes);
 7409         as->arcstat_l2_writes_sent.value.ui64 =
 7410             wmsum_value(&arc_sums.arcstat_l2_writes_sent);
 7411         as->arcstat_l2_writes_done.value.ui64 =
 7412             wmsum_value(&arc_sums.arcstat_l2_writes_done);
 7413         as->arcstat_l2_writes_error.value.ui64 =
 7414             wmsum_value(&arc_sums.arcstat_l2_writes_error);
 7415         as->arcstat_l2_writes_lock_retry.value.ui64 =
 7416             wmsum_value(&arc_sums.arcstat_l2_writes_lock_retry);
 7417         as->arcstat_l2_evict_lock_retry.value.ui64 =
 7418             wmsum_value(&arc_sums.arcstat_l2_evict_lock_retry);
 7419         as->arcstat_l2_evict_reading.value.ui64 =
 7420             wmsum_value(&arc_sums.arcstat_l2_evict_reading);
 7421         as->arcstat_l2_evict_l1cached.value.ui64 =
 7422             wmsum_value(&arc_sums.arcstat_l2_evict_l1cached);
 7423         as->arcstat_l2_free_on_write.value.ui64 =
 7424             wmsum_value(&arc_sums.arcstat_l2_free_on_write);
 7425         as->arcstat_l2_abort_lowmem.value.ui64 =
 7426             wmsum_value(&arc_sums.arcstat_l2_abort_lowmem);
 7427         as->arcstat_l2_cksum_bad.value.ui64 =
 7428             wmsum_value(&arc_sums.arcstat_l2_cksum_bad);
 7429         as->arcstat_l2_io_error.value.ui64 =
 7430             wmsum_value(&arc_sums.arcstat_l2_io_error);
 7431         as->arcstat_l2_lsize.value.ui64 =
 7432             wmsum_value(&arc_sums.arcstat_l2_lsize);
 7433         as->arcstat_l2_psize.value.ui64 =
 7434             wmsum_value(&arc_sums.arcstat_l2_psize);
 7435         as->arcstat_l2_hdr_size.value.ui64 =
 7436             aggsum_value(&arc_sums.arcstat_l2_hdr_size);
 7437         as->arcstat_l2_log_blk_writes.value.ui64 =
 7438             wmsum_value(&arc_sums.arcstat_l2_log_blk_writes);
 7439         as->arcstat_l2_log_blk_asize.value.ui64 =
 7440             wmsum_value(&arc_sums.arcstat_l2_log_blk_asize);
 7441         as->arcstat_l2_log_blk_count.value.ui64 =
 7442             wmsum_value(&arc_sums.arcstat_l2_log_blk_count);
 7443         as->arcstat_l2_rebuild_success.value.ui64 =
 7444             wmsum_value(&arc_sums.arcstat_l2_rebuild_success);
 7445         as->arcstat_l2_rebuild_abort_unsupported.value.ui64 =
 7446             wmsum_value(&arc_sums.arcstat_l2_rebuild_abort_unsupported);
 7447         as->arcstat_l2_rebuild_abort_io_errors.value.ui64 =
 7448             wmsum_value(&arc_sums.arcstat_l2_rebuild_abort_io_errors);
 7449         as->arcstat_l2_rebuild_abort_dh_errors.value.ui64 =
 7450             wmsum_value(&arc_sums.arcstat_l2_rebuild_abort_dh_errors);
 7451         as->arcstat_l2_rebuild_abort_cksum_lb_errors.value.ui64 =
 7452             wmsum_value(&arc_sums.arcstat_l2_rebuild_abort_cksum_lb_errors);
 7453         as->arcstat_l2_rebuild_abort_lowmem.value.ui64 =
 7454             wmsum_value(&arc_sums.arcstat_l2_rebuild_abort_lowmem);
 7455         as->arcstat_l2_rebuild_size.value.ui64 =
 7456             wmsum_value(&arc_sums.arcstat_l2_rebuild_size);
 7457         as->arcstat_l2_rebuild_asize.value.ui64 =
 7458             wmsum_value(&arc_sums.arcstat_l2_rebuild_asize);
 7459         as->arcstat_l2_rebuild_bufs.value.ui64 =
 7460             wmsum_value(&arc_sums.arcstat_l2_rebuild_bufs);
 7461         as->arcstat_l2_rebuild_bufs_precached.value.ui64 =
 7462             wmsum_value(&arc_sums.arcstat_l2_rebuild_bufs_precached);
 7463         as->arcstat_l2_rebuild_log_blks.value.ui64 =
 7464             wmsum_value(&arc_sums.arcstat_l2_rebuild_log_blks);
 7465         as->arcstat_memory_throttle_count.value.ui64 =
 7466             wmsum_value(&arc_sums.arcstat_memory_throttle_count);
 7467         as->arcstat_memory_direct_count.value.ui64 =
 7468             wmsum_value(&arc_sums.arcstat_memory_direct_count);
 7469         as->arcstat_memory_indirect_count.value.ui64 =
 7470             wmsum_value(&arc_sums.arcstat_memory_indirect_count);
 7471 
 7472         as->arcstat_memory_all_bytes.value.ui64 =
 7473             arc_all_memory();
 7474         as->arcstat_memory_free_bytes.value.ui64 =
 7475             arc_free_memory();
 7476         as->arcstat_memory_available_bytes.value.i64 =
 7477             arc_available_memory();
 7478 
 7479         as->arcstat_prune.value.ui64 =
 7480             wmsum_value(&arc_sums.arcstat_prune);
 7481         as->arcstat_meta_used.value.ui64 =
 7482             aggsum_value(&arc_sums.arcstat_meta_used);
 7483         as->arcstat_async_upgrade_sync.value.ui64 =
 7484             wmsum_value(&arc_sums.arcstat_async_upgrade_sync);
 7485         as->arcstat_predictive_prefetch.value.ui64 =
 7486             wmsum_value(&arc_sums.arcstat_predictive_prefetch);
 7487         as->arcstat_demand_hit_predictive_prefetch.value.ui64 =
 7488             wmsum_value(&arc_sums.arcstat_demand_hit_predictive_prefetch);
 7489         as->arcstat_demand_iohit_predictive_prefetch.value.ui64 =
 7490             wmsum_value(&arc_sums.arcstat_demand_iohit_predictive_prefetch);
 7491         as->arcstat_prescient_prefetch.value.ui64 =
 7492             wmsum_value(&arc_sums.arcstat_prescient_prefetch);
 7493         as->arcstat_demand_hit_prescient_prefetch.value.ui64 =
 7494             wmsum_value(&arc_sums.arcstat_demand_hit_prescient_prefetch);
 7495         as->arcstat_demand_iohit_prescient_prefetch.value.ui64 =
 7496             wmsum_value(&arc_sums.arcstat_demand_iohit_prescient_prefetch);
 7497         as->arcstat_raw_size.value.ui64 =
 7498             wmsum_value(&arc_sums.arcstat_raw_size);
 7499         as->arcstat_cached_only_in_progress.value.ui64 =
 7500             wmsum_value(&arc_sums.arcstat_cached_only_in_progress);
 7501         as->arcstat_abd_chunk_waste_size.value.ui64 =
 7502             wmsum_value(&arc_sums.arcstat_abd_chunk_waste_size);
 7503 
 7504         return (0);
 7505 }
 7506 
 7507 /*
 7508  * This function *must* return indices evenly distributed between all
 7509  * sublists of the multilist. This is needed due to how the ARC eviction
 7510  * code is laid out; arc_evict_state() assumes ARC buffers are evenly
 7511  * distributed between all sublists and uses this assumption when
 7512  * deciding which sublist to evict from and how much to evict from it.
 7513  */
 7514 static unsigned int
 7515 arc_state_multilist_index_func(multilist_t *ml, void *obj)
 7516 {
 7517         arc_buf_hdr_t *hdr = obj;
 7518 
 7519         /*
 7520          * We rely on b_dva to generate evenly distributed index
 7521          * numbers using buf_hash below. So, as an added precaution,
 7522          * let's make sure we never add empty buffers to the arc lists.
 7523          */
 7524         ASSERT(!HDR_EMPTY(hdr));
 7525 
 7526         /*
 7527          * The assumption here, is the hash value for a given
 7528          * arc_buf_hdr_t will remain constant throughout its lifetime
 7529          * (i.e. its b_spa, b_dva, and b_birth fields don't change).
 7530          * Thus, we don't need to store the header's sublist index
 7531          * on insertion, as this index can be recalculated on removal.
 7532          *
 7533          * Also, the low order bits of the hash value are thought to be
 7534          * distributed evenly. Otherwise, in the case that the multilist
 7535          * has a power of two number of sublists, each sublists' usage
 7536          * would not be evenly distributed. In this context full 64bit
 7537          * division would be a waste of time, so limit it to 32 bits.
 7538          */
 7539         return ((unsigned int)buf_hash(hdr->b_spa, &hdr->b_dva, hdr->b_birth) %
 7540             multilist_get_num_sublists(ml));
 7541 }
 7542 
 7543 static unsigned int
 7544 arc_state_l2c_multilist_index_func(multilist_t *ml, void *obj)
 7545 {
 7546         panic("Header %p insert into arc_l2c_only %p", obj, ml);
 7547 }
 7548 
 7549 #define WARN_IF_TUNING_IGNORED(tuning, value, do_warn) do {     \
 7550         if ((do_warn) && (tuning) && ((tuning) != (value))) {   \
 7551                 cmn_err(CE_WARN,                                \
 7552                     "ignoring tunable %s (using %llu instead)", \
 7553                     (#tuning), (u_longlong_t)(value));  \
 7554         }                                                       \
 7555 } while (0)
 7556 
 7557 /*
 7558  * Called during module initialization and periodically thereafter to
 7559  * apply reasonable changes to the exposed performance tunings.  Can also be
 7560  * called explicitly by param_set_arc_*() functions when ARC tunables are
 7561  * updated manually.  Non-zero zfs_* values which differ from the currently set
 7562  * values will be applied.
 7563  */
 7564 void
 7565 arc_tuning_update(boolean_t verbose)
 7566 {
 7567         uint64_t allmem = arc_all_memory();
 7568         unsigned long limit;
 7569 
 7570         /* Valid range: 32M - <arc_c_max> */
 7571         if ((zfs_arc_min) && (zfs_arc_min != arc_c_min) &&
 7572             (zfs_arc_min >= 2ULL << SPA_MAXBLOCKSHIFT) &&
 7573             (zfs_arc_min <= arc_c_max)) {
 7574                 arc_c_min = zfs_arc_min;
 7575                 arc_c = MAX(arc_c, arc_c_min);
 7576         }
 7577         WARN_IF_TUNING_IGNORED(zfs_arc_min, arc_c_min, verbose);
 7578 
 7579         /* Valid range: 64M - <all physical memory> */
 7580         if ((zfs_arc_max) && (zfs_arc_max != arc_c_max) &&
 7581             (zfs_arc_max >= MIN_ARC_MAX) && (zfs_arc_max < allmem) &&
 7582             (zfs_arc_max > arc_c_min)) {
 7583                 arc_c_max = zfs_arc_max;
 7584                 arc_c = MIN(arc_c, arc_c_max);
 7585                 arc_p = (arc_c >> 1);
 7586                 if (arc_meta_limit > arc_c_max)
 7587                         arc_meta_limit = arc_c_max;
 7588                 if (arc_dnode_size_limit > arc_meta_limit)
 7589                         arc_dnode_size_limit = arc_meta_limit;
 7590         }
 7591         WARN_IF_TUNING_IGNORED(zfs_arc_max, arc_c_max, verbose);
 7592 
 7593         /* Valid range: 16M - <arc_c_max> */
 7594         if ((zfs_arc_meta_min) && (zfs_arc_meta_min != arc_meta_min) &&
 7595             (zfs_arc_meta_min >= 1ULL << SPA_MAXBLOCKSHIFT) &&
 7596             (zfs_arc_meta_min <= arc_c_max)) {
 7597                 arc_meta_min = zfs_arc_meta_min;
 7598                 if (arc_meta_limit < arc_meta_min)
 7599                         arc_meta_limit = arc_meta_min;
 7600                 if (arc_dnode_size_limit < arc_meta_min)
 7601                         arc_dnode_size_limit = arc_meta_min;
 7602         }
 7603         WARN_IF_TUNING_IGNORED(zfs_arc_meta_min, arc_meta_min, verbose);
 7604 
 7605         /* Valid range: <arc_meta_min> - <arc_c_max> */
 7606         limit = zfs_arc_meta_limit ? zfs_arc_meta_limit :
 7607             MIN(zfs_arc_meta_limit_percent, 100) * arc_c_max / 100;
 7608         if ((limit != arc_meta_limit) &&
 7609             (limit >= arc_meta_min) &&
 7610             (limit <= arc_c_max))
 7611                 arc_meta_limit = limit;
 7612         WARN_IF_TUNING_IGNORED(zfs_arc_meta_limit, arc_meta_limit, verbose);
 7613 
 7614         /* Valid range: <arc_meta_min> - <arc_meta_limit> */
 7615         limit = zfs_arc_dnode_limit ? zfs_arc_dnode_limit :
 7616             MIN(zfs_arc_dnode_limit_percent, 100) * arc_meta_limit / 100;
 7617         if ((limit != arc_dnode_size_limit) &&
 7618             (limit >= arc_meta_min) &&
 7619             (limit <= arc_meta_limit))
 7620                 arc_dnode_size_limit = limit;
 7621         WARN_IF_TUNING_IGNORED(zfs_arc_dnode_limit, arc_dnode_size_limit,
 7622             verbose);
 7623 
 7624         /* Valid range: 1 - N */
 7625         if (zfs_arc_grow_retry)
 7626                 arc_grow_retry = zfs_arc_grow_retry;
 7627 
 7628         /* Valid range: 1 - N */
 7629         if (zfs_arc_shrink_shift) {
 7630                 arc_shrink_shift = zfs_arc_shrink_shift;
 7631                 arc_no_grow_shift = MIN(arc_no_grow_shift, arc_shrink_shift -1);
 7632         }
 7633 
 7634         /* Valid range: 1 - N */
 7635         if (zfs_arc_p_min_shift)
 7636                 arc_p_min_shift = zfs_arc_p_min_shift;
 7637 
 7638         /* Valid range: 1 - N ms */
 7639         if (zfs_arc_min_prefetch_ms)
 7640                 arc_min_prefetch_ms = zfs_arc_min_prefetch_ms;
 7641 
 7642         /* Valid range: 1 - N ms */
 7643         if (zfs_arc_min_prescient_prefetch_ms) {
 7644                 arc_min_prescient_prefetch_ms =
 7645                     zfs_arc_min_prescient_prefetch_ms;
 7646         }
 7647 
 7648         /* Valid range: 0 - 100 */
 7649         if (zfs_arc_lotsfree_percent <= 100)
 7650                 arc_lotsfree_percent = zfs_arc_lotsfree_percent;
 7651         WARN_IF_TUNING_IGNORED(zfs_arc_lotsfree_percent, arc_lotsfree_percent,
 7652             verbose);
 7653 
 7654         /* Valid range: 0 - <all physical memory> */
 7655         if ((zfs_arc_sys_free) && (zfs_arc_sys_free != arc_sys_free))
 7656                 arc_sys_free = MIN(zfs_arc_sys_free, allmem);
 7657         WARN_IF_TUNING_IGNORED(zfs_arc_sys_free, arc_sys_free, verbose);
 7658 }
 7659 
 7660 static void
 7661 arc_state_multilist_init(multilist_t *ml,
 7662     multilist_sublist_index_func_t *index_func, int *maxcountp)
 7663 {
 7664         multilist_create(ml, sizeof (arc_buf_hdr_t),
 7665             offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node), index_func);
 7666         *maxcountp = MAX(*maxcountp, multilist_get_num_sublists(ml));
 7667 }
 7668 
 7669 static void
 7670 arc_state_init(void)
 7671 {
 7672         int num_sublists = 0;
 7673 
 7674         arc_state_multilist_init(&arc_mru->arcs_list[ARC_BUFC_METADATA],
 7675             arc_state_multilist_index_func, &num_sublists);
 7676         arc_state_multilist_init(&arc_mru->arcs_list[ARC_BUFC_DATA],
 7677             arc_state_multilist_index_func, &num_sublists);
 7678         arc_state_multilist_init(&arc_mru_ghost->arcs_list[ARC_BUFC_METADATA],
 7679             arc_state_multilist_index_func, &num_sublists);
 7680         arc_state_multilist_init(&arc_mru_ghost->arcs_list[ARC_BUFC_DATA],
 7681             arc_state_multilist_index_func, &num_sublists);
 7682         arc_state_multilist_init(&arc_mfu->arcs_list[ARC_BUFC_METADATA],
 7683             arc_state_multilist_index_func, &num_sublists);
 7684         arc_state_multilist_init(&arc_mfu->arcs_list[ARC_BUFC_DATA],
 7685             arc_state_multilist_index_func, &num_sublists);
 7686         arc_state_multilist_init(&arc_mfu_ghost->arcs_list[ARC_BUFC_METADATA],
 7687             arc_state_multilist_index_func, &num_sublists);
 7688         arc_state_multilist_init(&arc_mfu_ghost->arcs_list[ARC_BUFC_DATA],
 7689             arc_state_multilist_index_func, &num_sublists);
 7690         arc_state_multilist_init(&arc_uncached->arcs_list[ARC_BUFC_METADATA],
 7691             arc_state_multilist_index_func, &num_sublists);
 7692         arc_state_multilist_init(&arc_uncached->arcs_list[ARC_BUFC_DATA],
 7693             arc_state_multilist_index_func, &num_sublists);
 7694 
 7695         /*
 7696          * L2 headers should never be on the L2 state list since they don't
 7697          * have L1 headers allocated.  Special index function asserts that.
 7698          */
 7699         arc_state_multilist_init(&arc_l2c_only->arcs_list[ARC_BUFC_METADATA],
 7700             arc_state_l2c_multilist_index_func, &num_sublists);
 7701         arc_state_multilist_init(&arc_l2c_only->arcs_list[ARC_BUFC_DATA],
 7702             arc_state_l2c_multilist_index_func, &num_sublists);
 7703 
 7704         /*
 7705          * Keep track of the number of markers needed to reclaim buffers from
 7706          * any ARC state.  The markers will be pre-allocated so as to minimize
 7707          * the number of memory allocations performed by the eviction thread.
 7708          */
 7709         arc_state_evict_marker_count = num_sublists;
 7710 
 7711         zfs_refcount_create(&arc_anon->arcs_esize[ARC_BUFC_METADATA]);
 7712         zfs_refcount_create(&arc_anon->arcs_esize[ARC_BUFC_DATA]);
 7713         zfs_refcount_create(&arc_mru->arcs_esize[ARC_BUFC_METADATA]);
 7714         zfs_refcount_create(&arc_mru->arcs_esize[ARC_BUFC_DATA]);
 7715         zfs_refcount_create(&arc_mru_ghost->arcs_esize[ARC_BUFC_METADATA]);
 7716         zfs_refcount_create(&arc_mru_ghost->arcs_esize[ARC_BUFC_DATA]);
 7717         zfs_refcount_create(&arc_mfu->arcs_esize[ARC_BUFC_METADATA]);
 7718         zfs_refcount_create(&arc_mfu->arcs_esize[ARC_BUFC_DATA]);
 7719         zfs_refcount_create(&arc_mfu_ghost->arcs_esize[ARC_BUFC_METADATA]);
 7720         zfs_refcount_create(&arc_mfu_ghost->arcs_esize[ARC_BUFC_DATA]);
 7721         zfs_refcount_create(&arc_l2c_only->arcs_esize[ARC_BUFC_METADATA]);
 7722         zfs_refcount_create(&arc_l2c_only->arcs_esize[ARC_BUFC_DATA]);
 7723         zfs_refcount_create(&arc_uncached->arcs_esize[ARC_BUFC_METADATA]);
 7724         zfs_refcount_create(&arc_uncached->arcs_esize[ARC_BUFC_DATA]);
 7725 
 7726         zfs_refcount_create(&arc_anon->arcs_size);
 7727         zfs_refcount_create(&arc_mru->arcs_size);
 7728         zfs_refcount_create(&arc_mru_ghost->arcs_size);
 7729         zfs_refcount_create(&arc_mfu->arcs_size);
 7730         zfs_refcount_create(&arc_mfu_ghost->arcs_size);
 7731         zfs_refcount_create(&arc_l2c_only->arcs_size);
 7732         zfs_refcount_create(&arc_uncached->arcs_size);
 7733 
 7734         wmsum_init(&arc_sums.arcstat_hits, 0);
 7735         wmsum_init(&arc_sums.arcstat_iohits, 0);
 7736         wmsum_init(&arc_sums.arcstat_misses, 0);
 7737         wmsum_init(&arc_sums.arcstat_demand_data_hits, 0);
 7738         wmsum_init(&arc_sums.arcstat_demand_data_iohits, 0);
 7739         wmsum_init(&arc_sums.arcstat_demand_data_misses, 0);
 7740         wmsum_init(&arc_sums.arcstat_demand_metadata_hits, 0);
 7741         wmsum_init(&arc_sums.arcstat_demand_metadata_iohits, 0);
 7742         wmsum_init(&arc_sums.arcstat_demand_metadata_misses, 0);
 7743         wmsum_init(&arc_sums.arcstat_prefetch_data_hits, 0);
 7744         wmsum_init(&arc_sums.arcstat_prefetch_data_iohits, 0);
 7745         wmsum_init(&arc_sums.arcstat_prefetch_data_misses, 0);
 7746         wmsum_init(&arc_sums.arcstat_prefetch_metadata_hits, 0);
 7747         wmsum_init(&arc_sums.arcstat_prefetch_metadata_iohits, 0);
 7748         wmsum_init(&arc_sums.arcstat_prefetch_metadata_misses, 0);
 7749         wmsum_init(&arc_sums.arcstat_mru_hits, 0);
 7750         wmsum_init(&arc_sums.arcstat_mru_ghost_hits, 0);
 7751         wmsum_init(&arc_sums.arcstat_mfu_hits, 0);
 7752         wmsum_init(&arc_sums.arcstat_mfu_ghost_hits, 0);
 7753         wmsum_init(&arc_sums.arcstat_uncached_hits, 0);
 7754         wmsum_init(&arc_sums.arcstat_deleted, 0);
 7755         wmsum_init(&arc_sums.arcstat_mutex_miss, 0);
 7756         wmsum_init(&arc_sums.arcstat_access_skip, 0);
 7757         wmsum_init(&arc_sums.arcstat_evict_skip, 0);
 7758         wmsum_init(&arc_sums.arcstat_evict_not_enough, 0);
 7759         wmsum_init(&arc_sums.arcstat_evict_l2_cached, 0);
 7760         wmsum_init(&arc_sums.arcstat_evict_l2_eligible, 0);
 7761         wmsum_init(&arc_sums.arcstat_evict_l2_eligible_mfu, 0);
 7762         wmsum_init(&arc_sums.arcstat_evict_l2_eligible_mru, 0);
 7763         wmsum_init(&arc_sums.arcstat_evict_l2_ineligible, 0);
 7764         wmsum_init(&arc_sums.arcstat_evict_l2_skip, 0);
 7765         wmsum_init(&arc_sums.arcstat_hash_collisions, 0);
 7766         wmsum_init(&arc_sums.arcstat_hash_chains, 0);
 7767         aggsum_init(&arc_sums.arcstat_size, 0);
 7768         wmsum_init(&arc_sums.arcstat_compressed_size, 0);
 7769         wmsum_init(&arc_sums.arcstat_uncompressed_size, 0);
 7770         wmsum_init(&arc_sums.arcstat_overhead_size, 0);
 7771         wmsum_init(&arc_sums.arcstat_hdr_size, 0);
 7772         wmsum_init(&arc_sums.arcstat_data_size, 0);
 7773         wmsum_init(&arc_sums.arcstat_metadata_size, 0);
 7774         wmsum_init(&arc_sums.arcstat_dbuf_size, 0);
 7775         aggsum_init(&arc_sums.arcstat_dnode_size, 0);
 7776         wmsum_init(&arc_sums.arcstat_bonus_size, 0);
 7777         wmsum_init(&arc_sums.arcstat_l2_hits, 0);
 7778         wmsum_init(&arc_sums.arcstat_l2_misses, 0);
 7779         wmsum_init(&arc_sums.arcstat_l2_prefetch_asize, 0);
 7780         wmsum_init(&arc_sums.arcstat_l2_mru_asize, 0);
 7781         wmsum_init(&arc_sums.arcstat_l2_mfu_asize, 0);
 7782         wmsum_init(&arc_sums.arcstat_l2_bufc_data_asize, 0);
 7783         wmsum_init(&arc_sums.arcstat_l2_bufc_metadata_asize, 0);
 7784         wmsum_init(&arc_sums.arcstat_l2_feeds, 0);
 7785         wmsum_init(&arc_sums.arcstat_l2_rw_clash, 0);
 7786         wmsum_init(&arc_sums.arcstat_l2_read_bytes, 0);
 7787         wmsum_init(&arc_sums.arcstat_l2_write_bytes, 0);
 7788         wmsum_init(&arc_sums.arcstat_l2_writes_sent, 0);
 7789         wmsum_init(&arc_sums.arcstat_l2_writes_done, 0);
 7790         wmsum_init(&arc_sums.arcstat_l2_writes_error, 0);
 7791         wmsum_init(&arc_sums.arcstat_l2_writes_lock_retry, 0);
 7792         wmsum_init(&arc_sums.arcstat_l2_evict_lock_retry, 0);
 7793         wmsum_init(&arc_sums.arcstat_l2_evict_reading, 0);
 7794         wmsum_init(&arc_sums.arcstat_l2_evict_l1cached, 0);
 7795         wmsum_init(&arc_sums.arcstat_l2_free_on_write, 0);
 7796         wmsum_init(&arc_sums.arcstat_l2_abort_lowmem, 0);
 7797         wmsum_init(&arc_sums.arcstat_l2_cksum_bad, 0);
 7798         wmsum_init(&arc_sums.arcstat_l2_io_error, 0);
 7799         wmsum_init(&arc_sums.arcstat_l2_lsize, 0);
 7800         wmsum_init(&arc_sums.arcstat_l2_psize, 0);
 7801         aggsum_init(&arc_sums.arcstat_l2_hdr_size, 0);
 7802         wmsum_init(&arc_sums.arcstat_l2_log_blk_writes, 0);
 7803         wmsum_init(&arc_sums.arcstat_l2_log_blk_asize, 0);
 7804         wmsum_init(&arc_sums.arcstat_l2_log_blk_count, 0);
 7805         wmsum_init(&arc_sums.arcstat_l2_rebuild_success, 0);
 7806         wmsum_init(&arc_sums.arcstat_l2_rebuild_abort_unsupported, 0);
 7807         wmsum_init(&arc_sums.arcstat_l2_rebuild_abort_io_errors, 0);
 7808         wmsum_init(&arc_sums.arcstat_l2_rebuild_abort_dh_errors, 0);
 7809         wmsum_init(&arc_sums.arcstat_l2_rebuild_abort_cksum_lb_errors, 0);
 7810         wmsum_init(&arc_sums.arcstat_l2_rebuild_abort_lowmem, 0);
 7811         wmsum_init(&arc_sums.arcstat_l2_rebuild_size, 0);
 7812         wmsum_init(&arc_sums.arcstat_l2_rebuild_asize, 0);
 7813         wmsum_init(&arc_sums.arcstat_l2_rebuild_bufs, 0);
 7814         wmsum_init(&arc_sums.arcstat_l2_rebuild_bufs_precached, 0);
 7815         wmsum_init(&arc_sums.arcstat_l2_rebuild_log_blks, 0);
 7816         wmsum_init(&arc_sums.arcstat_memory_throttle_count, 0);
 7817         wmsum_init(&arc_sums.arcstat_memory_direct_count, 0);
 7818         wmsum_init(&arc_sums.arcstat_memory_indirect_count, 0);
 7819         wmsum_init(&arc_sums.arcstat_prune, 0);
 7820         aggsum_init(&arc_sums.arcstat_meta_used, 0);
 7821         wmsum_init(&arc_sums.arcstat_async_upgrade_sync, 0);
 7822         wmsum_init(&arc_sums.arcstat_predictive_prefetch, 0);
 7823         wmsum_init(&arc_sums.arcstat_demand_hit_predictive_prefetch, 0);
 7824         wmsum_init(&arc_sums.arcstat_demand_iohit_predictive_prefetch, 0);
 7825         wmsum_init(&arc_sums.arcstat_prescient_prefetch, 0);
 7826         wmsum_init(&arc_sums.arcstat_demand_hit_prescient_prefetch, 0);
 7827         wmsum_init(&arc_sums.arcstat_demand_iohit_prescient_prefetch, 0);
 7828         wmsum_init(&arc_sums.arcstat_raw_size, 0);
 7829         wmsum_init(&arc_sums.arcstat_cached_only_in_progress, 0);
 7830         wmsum_init(&arc_sums.arcstat_abd_chunk_waste_size, 0);
 7831 
 7832         arc_anon->arcs_state = ARC_STATE_ANON;
 7833         arc_mru->arcs_state = ARC_STATE_MRU;
 7834         arc_mru_ghost->arcs_state = ARC_STATE_MRU_GHOST;
 7835         arc_mfu->arcs_state = ARC_STATE_MFU;
 7836         arc_mfu_ghost->arcs_state = ARC_STATE_MFU_GHOST;
 7837         arc_l2c_only->arcs_state = ARC_STATE_L2C_ONLY;
 7838         arc_uncached->arcs_state = ARC_STATE_UNCACHED;
 7839 }
 7840 
 7841 static void
 7842 arc_state_fini(void)
 7843 {
 7844         zfs_refcount_destroy(&arc_anon->arcs_esize[ARC_BUFC_METADATA]);
 7845         zfs_refcount_destroy(&arc_anon->arcs_esize[ARC_BUFC_DATA]);
 7846         zfs_refcount_destroy(&arc_mru->arcs_esize[ARC_BUFC_METADATA]);
 7847         zfs_refcount_destroy(&arc_mru->arcs_esize[ARC_BUFC_DATA]);
 7848         zfs_refcount_destroy(&arc_mru_ghost->arcs_esize[ARC_BUFC_METADATA]);
 7849         zfs_refcount_destroy(&arc_mru_ghost->arcs_esize[ARC_BUFC_DATA]);
 7850         zfs_refcount_destroy(&arc_mfu->arcs_esize[ARC_BUFC_METADATA]);
 7851         zfs_refcount_destroy(&arc_mfu->arcs_esize[ARC_BUFC_DATA]);
 7852         zfs_refcount_destroy(&arc_mfu_ghost->arcs_esize[ARC_BUFC_METADATA]);
 7853         zfs_refcount_destroy(&arc_mfu_ghost->arcs_esize[ARC_BUFC_DATA]);
 7854         zfs_refcount_destroy(&arc_l2c_only->arcs_esize[ARC_BUFC_METADATA]);
 7855         zfs_refcount_destroy(&arc_l2c_only->arcs_esize[ARC_BUFC_DATA]);
 7856         zfs_refcount_destroy(&arc_uncached->arcs_esize[ARC_BUFC_METADATA]);
 7857         zfs_refcount_destroy(&arc_uncached->arcs_esize[ARC_BUFC_DATA]);
 7858 
 7859         zfs_refcount_destroy(&arc_anon->arcs_size);
 7860         zfs_refcount_destroy(&arc_mru->arcs_size);
 7861         zfs_refcount_destroy(&arc_mru_ghost->arcs_size);
 7862         zfs_refcount_destroy(&arc_mfu->arcs_size);
 7863         zfs_refcount_destroy(&arc_mfu_ghost->arcs_size);
 7864         zfs_refcount_destroy(&arc_l2c_only->arcs_size);
 7865         zfs_refcount_destroy(&arc_uncached->arcs_size);
 7866 
 7867         multilist_destroy(&arc_mru->arcs_list[ARC_BUFC_METADATA]);
 7868         multilist_destroy(&arc_mru_ghost->arcs_list[ARC_BUFC_METADATA]);
 7869         multilist_destroy(&arc_mfu->arcs_list[ARC_BUFC_METADATA]);
 7870         multilist_destroy(&arc_mfu_ghost->arcs_list[ARC_BUFC_METADATA]);
 7871         multilist_destroy(&arc_mru->arcs_list[ARC_BUFC_DATA]);
 7872         multilist_destroy(&arc_mru_ghost->arcs_list[ARC_BUFC_DATA]);
 7873         multilist_destroy(&arc_mfu->arcs_list[ARC_BUFC_DATA]);
 7874         multilist_destroy(&arc_mfu_ghost->arcs_list[ARC_BUFC_DATA]);
 7875         multilist_destroy(&arc_l2c_only->arcs_list[ARC_BUFC_METADATA]);
 7876         multilist_destroy(&arc_l2c_only->arcs_list[ARC_BUFC_DATA]);
 7877         multilist_destroy(&arc_uncached->arcs_list[ARC_BUFC_METADATA]);
 7878         multilist_destroy(&arc_uncached->arcs_list[ARC_BUFC_DATA]);
 7879 
 7880         wmsum_fini(&arc_sums.arcstat_hits);
 7881         wmsum_fini(&arc_sums.arcstat_iohits);
 7882         wmsum_fini(&arc_sums.arcstat_misses);
 7883         wmsum_fini(&arc_sums.arcstat_demand_data_hits);
 7884         wmsum_fini(&arc_sums.arcstat_demand_data_iohits);
 7885         wmsum_fini(&arc_sums.arcstat_demand_data_misses);
 7886         wmsum_fini(&arc_sums.arcstat_demand_metadata_hits);
 7887         wmsum_fini(&arc_sums.arcstat_demand_metadata_iohits);
 7888         wmsum_fini(&arc_sums.arcstat_demand_metadata_misses);
 7889         wmsum_fini(&arc_sums.arcstat_prefetch_data_hits);
 7890         wmsum_fini(&arc_sums.arcstat_prefetch_data_iohits);
 7891         wmsum_fini(&arc_sums.arcstat_prefetch_data_misses);
 7892         wmsum_fini(&arc_sums.arcstat_prefetch_metadata_hits);
 7893         wmsum_fini(&arc_sums.arcstat_prefetch_metadata_iohits);
 7894         wmsum_fini(&arc_sums.arcstat_prefetch_metadata_misses);
 7895         wmsum_fini(&arc_sums.arcstat_mru_hits);
 7896         wmsum_fini(&arc_sums.arcstat_mru_ghost_hits);
 7897         wmsum_fini(&arc_sums.arcstat_mfu_hits);
 7898         wmsum_fini(&arc_sums.arcstat_mfu_ghost_hits);
 7899         wmsum_fini(&arc_sums.arcstat_uncached_hits);
 7900         wmsum_fini(&arc_sums.arcstat_deleted);
 7901         wmsum_fini(&arc_sums.arcstat_mutex_miss);
 7902         wmsum_fini(&arc_sums.arcstat_access_skip);
 7903         wmsum_fini(&arc_sums.arcstat_evict_skip);
 7904         wmsum_fini(&arc_sums.arcstat_evict_not_enough);
 7905         wmsum_fini(&arc_sums.arcstat_evict_l2_cached);
 7906         wmsum_fini(&arc_sums.arcstat_evict_l2_eligible);
 7907         wmsum_fini(&arc_sums.arcstat_evict_l2_eligible_mfu);
 7908         wmsum_fini(&arc_sums.arcstat_evict_l2_eligible_mru);
 7909         wmsum_fini(&arc_sums.arcstat_evict_l2_ineligible);
 7910         wmsum_fini(&arc_sums.arcstat_evict_l2_skip);
 7911         wmsum_fini(&arc_sums.arcstat_hash_collisions);
 7912         wmsum_fini(&arc_sums.arcstat_hash_chains);
 7913         aggsum_fini(&arc_sums.arcstat_size);
 7914         wmsum_fini(&arc_sums.arcstat_compressed_size);
 7915         wmsum_fini(&arc_sums.arcstat_uncompressed_size);
 7916         wmsum_fini(&arc_sums.arcstat_overhead_size);
 7917         wmsum_fini(&arc_sums.arcstat_hdr_size);
 7918         wmsum_fini(&arc_sums.arcstat_data_size);
 7919         wmsum_fini(&arc_sums.arcstat_metadata_size);
 7920         wmsum_fini(&arc_sums.arcstat_dbuf_size);
 7921         aggsum_fini(&arc_sums.arcstat_dnode_size);
 7922         wmsum_fini(&arc_sums.arcstat_bonus_size);
 7923         wmsum_fini(&arc_sums.arcstat_l2_hits);
 7924         wmsum_fini(&arc_sums.arcstat_l2_misses);
 7925         wmsum_fini(&arc_sums.arcstat_l2_prefetch_asize);
 7926         wmsum_fini(&arc_sums.arcstat_l2_mru_asize);
 7927         wmsum_fini(&arc_sums.arcstat_l2_mfu_asize);
 7928         wmsum_fini(&arc_sums.arcstat_l2_bufc_data_asize);
 7929         wmsum_fini(&arc_sums.arcstat_l2_bufc_metadata_asize);
 7930         wmsum_fini(&arc_sums.arcstat_l2_feeds);
 7931         wmsum_fini(&arc_sums.arcstat_l2_rw_clash);
 7932         wmsum_fini(&arc_sums.arcstat_l2_read_bytes);
 7933         wmsum_fini(&arc_sums.arcstat_l2_write_bytes);
 7934         wmsum_fini(&arc_sums.arcstat_l2_writes_sent);
 7935         wmsum_fini(&arc_sums.arcstat_l2_writes_done);
 7936         wmsum_fini(&arc_sums.arcstat_l2_writes_error);
 7937         wmsum_fini(&arc_sums.arcstat_l2_writes_lock_retry);
 7938         wmsum_fini(&arc_sums.arcstat_l2_evict_lock_retry);
 7939         wmsum_fini(&arc_sums.arcstat_l2_evict_reading);
 7940         wmsum_fini(&arc_sums.arcstat_l2_evict_l1cached);
 7941         wmsum_fini(&arc_sums.arcstat_l2_free_on_write);
 7942         wmsum_fini(&arc_sums.arcstat_l2_abort_lowmem);
 7943         wmsum_fini(&arc_sums.arcstat_l2_cksum_bad);
 7944         wmsum_fini(&arc_sums.arcstat_l2_io_error);
 7945         wmsum_fini(&arc_sums.arcstat_l2_lsize);
 7946         wmsum_fini(&arc_sums.arcstat_l2_psize);
 7947         aggsum_fini(&arc_sums.arcstat_l2_hdr_size);
 7948         wmsum_fini(&arc_sums.arcstat_l2_log_blk_writes);
 7949         wmsum_fini(&arc_sums.arcstat_l2_log_blk_asize);
 7950         wmsum_fini(&arc_sums.arcstat_l2_log_blk_count);
 7951         wmsum_fini(&arc_sums.arcstat_l2_rebuild_success);
 7952         wmsum_fini(&arc_sums.arcstat_l2_rebuild_abort_unsupported);
 7953         wmsum_fini(&arc_sums.arcstat_l2_rebuild_abort_io_errors);
 7954         wmsum_fini(&arc_sums.arcstat_l2_rebuild_abort_dh_errors);
 7955         wmsum_fini(&arc_sums.arcstat_l2_rebuild_abort_cksum_lb_errors);
 7956         wmsum_fini(&arc_sums.arcstat_l2_rebuild_abort_lowmem);
 7957         wmsum_fini(&arc_sums.arcstat_l2_rebuild_size);
 7958         wmsum_fini(&arc_sums.arcstat_l2_rebuild_asize);
 7959         wmsum_fini(&arc_sums.arcstat_l2_rebuild_bufs);
 7960         wmsum_fini(&arc_sums.arcstat_l2_rebuild_bufs_precached);
 7961         wmsum_fini(&arc_sums.arcstat_l2_rebuild_log_blks);
 7962         wmsum_fini(&arc_sums.arcstat_memory_throttle_count);
 7963         wmsum_fini(&arc_sums.arcstat_memory_direct_count);
 7964         wmsum_fini(&arc_sums.arcstat_memory_indirect_count);
 7965         wmsum_fini(&arc_sums.arcstat_prune);
 7966         aggsum_fini(&arc_sums.arcstat_meta_used);
 7967         wmsum_fini(&arc_sums.arcstat_async_upgrade_sync);
 7968         wmsum_fini(&arc_sums.arcstat_predictive_prefetch);
 7969         wmsum_fini(&arc_sums.arcstat_demand_hit_predictive_prefetch);
 7970         wmsum_fini(&arc_sums.arcstat_demand_iohit_predictive_prefetch);
 7971         wmsum_fini(&arc_sums.arcstat_prescient_prefetch);
 7972         wmsum_fini(&arc_sums.arcstat_demand_hit_prescient_prefetch);
 7973         wmsum_fini(&arc_sums.arcstat_demand_iohit_prescient_prefetch);
 7974         wmsum_fini(&arc_sums.arcstat_raw_size);
 7975         wmsum_fini(&arc_sums.arcstat_cached_only_in_progress);
 7976         wmsum_fini(&arc_sums.arcstat_abd_chunk_waste_size);
 7977 }
 7978 
 7979 uint64_t
 7980 arc_target_bytes(void)
 7981 {
 7982         return (arc_c);
 7983 }
 7984 
 7985 void
 7986 arc_set_limits(uint64_t allmem)
 7987 {
 7988         /* Set min cache to 1/32 of all memory, or 32MB, whichever is more. */
 7989         arc_c_min = MAX(allmem / 32, 2ULL << SPA_MAXBLOCKSHIFT);
 7990 
 7991         /* How to set default max varies by platform. */
 7992         arc_c_max = arc_default_max(arc_c_min, allmem);
 7993 }
 7994 void
 7995 arc_init(void)
 7996 {
 7997         uint64_t percent, allmem = arc_all_memory();
 7998         mutex_init(&arc_evict_lock, NULL, MUTEX_DEFAULT, NULL);
 7999         list_create(&arc_evict_waiters, sizeof (arc_evict_waiter_t),
 8000             offsetof(arc_evict_waiter_t, aew_node));
 8001 
 8002         arc_min_prefetch_ms = 1000;
 8003         arc_min_prescient_prefetch_ms = 6000;
 8004 
 8005 #if defined(_KERNEL)
 8006         arc_lowmem_init();
 8007 #endif
 8008 
 8009         arc_set_limits(allmem);
 8010 
 8011 #ifdef _KERNEL
 8012         /*
 8013          * If zfs_arc_max is non-zero at init, meaning it was set in the kernel
 8014          * environment before the module was loaded, don't block setting the
 8015          * maximum because it is less than arc_c_min, instead, reset arc_c_min
 8016          * to a lower value.
 8017          * zfs_arc_min will be handled by arc_tuning_update().
 8018          */
 8019         if (zfs_arc_max != 0 && zfs_arc_max >= MIN_ARC_MAX &&
 8020             zfs_arc_max < allmem) {
 8021                 arc_c_max = zfs_arc_max;
 8022                 if (arc_c_min >= arc_c_max) {
 8023                         arc_c_min = MAX(zfs_arc_max / 2,
 8024                             2ULL << SPA_MAXBLOCKSHIFT);
 8025                 }
 8026         }
 8027 #else
 8028         /*
 8029          * In userland, there's only the memory pressure that we artificially
 8030          * create (see arc_available_memory()).  Don't let arc_c get too
 8031          * small, because it can cause transactions to be larger than
 8032          * arc_c, causing arc_tempreserve_space() to fail.
 8033          */
 8034         arc_c_min = MAX(arc_c_max / 2, 2ULL << SPA_MAXBLOCKSHIFT);
 8035 #endif
 8036 
 8037         arc_c = arc_c_min;
 8038         arc_p = (arc_c >> 1);
 8039 
 8040         /* Set min to 1/2 of arc_c_min */
 8041         arc_meta_min = 1ULL << SPA_MAXBLOCKSHIFT;
 8042         /*
 8043          * Set arc_meta_limit to a percent of arc_c_max with a floor of
 8044          * arc_meta_min, and a ceiling of arc_c_max.
 8045          */
 8046         percent = MIN(zfs_arc_meta_limit_percent, 100);
 8047         arc_meta_limit = MAX(arc_meta_min, (percent * arc_c_max) / 100);
 8048         percent = MIN(zfs_arc_dnode_limit_percent, 100);
 8049         arc_dnode_size_limit = (percent * arc_meta_limit) / 100;
 8050 
 8051         /* Apply user specified tunings */
 8052         arc_tuning_update(B_TRUE);
 8053 
 8054         /* if kmem_flags are set, lets try to use less memory */
 8055         if (kmem_debugging())
 8056                 arc_c = arc_c / 2;
 8057         if (arc_c < arc_c_min)
 8058                 arc_c = arc_c_min;
 8059 
 8060         arc_register_hotplug();
 8061 
 8062         arc_state_init();
 8063 
 8064         buf_init();
 8065 
 8066         list_create(&arc_prune_list, sizeof (arc_prune_t),
 8067             offsetof(arc_prune_t, p_node));
 8068         mutex_init(&arc_prune_mtx, NULL, MUTEX_DEFAULT, NULL);
 8069 
 8070         arc_prune_taskq = taskq_create("arc_prune", zfs_arc_prune_task_threads,
 8071             defclsyspri, 100, INT_MAX, TASKQ_PREPOPULATE | TASKQ_DYNAMIC);
 8072 
 8073         arc_ksp = kstat_create("zfs", 0, "arcstats", "misc", KSTAT_TYPE_NAMED,
 8074             sizeof (arc_stats) / sizeof (kstat_named_t), KSTAT_FLAG_VIRTUAL);
 8075 
 8076         if (arc_ksp != NULL) {
 8077                 arc_ksp->ks_data = &arc_stats;
 8078                 arc_ksp->ks_update = arc_kstat_update;
 8079                 kstat_install(arc_ksp);
 8080         }
 8081 
 8082         arc_state_evict_markers =
 8083             arc_state_alloc_markers(arc_state_evict_marker_count);
 8084         arc_evict_zthr = zthr_create_timer("arc_evict",
 8085             arc_evict_cb_check, arc_evict_cb, NULL, SEC2NSEC(1), defclsyspri);
 8086         arc_reap_zthr = zthr_create_timer("arc_reap",
 8087             arc_reap_cb_check, arc_reap_cb, NULL, SEC2NSEC(1), minclsyspri);
 8088 
 8089         arc_warm = B_FALSE;
 8090 
 8091         /*
 8092          * Calculate maximum amount of dirty data per pool.
 8093          *
 8094          * If it has been set by a module parameter, take that.
 8095          * Otherwise, use a percentage of physical memory defined by
 8096          * zfs_dirty_data_max_percent (default 10%) with a cap at
 8097          * zfs_dirty_data_max_max (default 4G or 25% of physical memory).
 8098          */
 8099 #ifdef __LP64__
 8100         if (zfs_dirty_data_max_max == 0)
 8101                 zfs_dirty_data_max_max = MIN(4ULL * 1024 * 1024 * 1024,
 8102                     allmem * zfs_dirty_data_max_max_percent / 100);
 8103 #else
 8104         if (zfs_dirty_data_max_max == 0)
 8105                 zfs_dirty_data_max_max = MIN(1ULL * 1024 * 1024 * 1024,
 8106                     allmem * zfs_dirty_data_max_max_percent / 100);
 8107 #endif
 8108 
 8109         if (zfs_dirty_data_max == 0) {
 8110                 zfs_dirty_data_max = allmem *
 8111                     zfs_dirty_data_max_percent / 100;
 8112                 zfs_dirty_data_max = MIN(zfs_dirty_data_max,
 8113                     zfs_dirty_data_max_max);
 8114         }
 8115 
 8116         if (zfs_wrlog_data_max == 0) {
 8117 
 8118                 /*
 8119                  * dp_wrlog_total is reduced for each txg at the end of
 8120                  * spa_sync(). However, dp_dirty_total is reduced every time
 8121                  * a block is written out. Thus under normal operation,
 8122                  * dp_wrlog_total could grow 2 times as big as
 8123                  * zfs_dirty_data_max.
 8124                  */
 8125                 zfs_wrlog_data_max = zfs_dirty_data_max * 2;
 8126         }
 8127 }
 8128 
 8129 void
 8130 arc_fini(void)
 8131 {
 8132         arc_prune_t *p;
 8133 
 8134 #ifdef _KERNEL
 8135         arc_lowmem_fini();
 8136 #endif /* _KERNEL */
 8137 
 8138         /* Use B_TRUE to ensure *all* buffers are evicted */
 8139         arc_flush(NULL, B_TRUE);
 8140 
 8141         if (arc_ksp != NULL) {
 8142                 kstat_delete(arc_ksp);
 8143                 arc_ksp = NULL;
 8144         }
 8145 
 8146         taskq_wait(arc_prune_taskq);
 8147         taskq_destroy(arc_prune_taskq);
 8148 
 8149         mutex_enter(&arc_prune_mtx);
 8150         while ((p = list_head(&arc_prune_list)) != NULL) {
 8151                 list_remove(&arc_prune_list, p);
 8152                 zfs_refcount_remove(&p->p_refcnt, &arc_prune_list);
 8153                 zfs_refcount_destroy(&p->p_refcnt);
 8154                 kmem_free(p, sizeof (*p));
 8155         }
 8156         mutex_exit(&arc_prune_mtx);
 8157 
 8158         list_destroy(&arc_prune_list);
 8159         mutex_destroy(&arc_prune_mtx);
 8160 
 8161         (void) zthr_cancel(arc_evict_zthr);
 8162         (void) zthr_cancel(arc_reap_zthr);
 8163         arc_state_free_markers(arc_state_evict_markers,
 8164             arc_state_evict_marker_count);
 8165 
 8166         mutex_destroy(&arc_evict_lock);
 8167         list_destroy(&arc_evict_waiters);
 8168 
 8169         /*
 8170          * Free any buffers that were tagged for destruction.  This needs
 8171          * to occur before arc_state_fini() runs and destroys the aggsum
 8172          * values which are updated when freeing scatter ABDs.
 8173          */
 8174         l2arc_do_free_on_write();
 8175 
 8176         /*
 8177          * buf_fini() must proceed arc_state_fini() because buf_fin() may
 8178          * trigger the release of kmem magazines, which can callback to
 8179          * arc_space_return() which accesses aggsums freed in act_state_fini().
 8180          */
 8181         buf_fini();
 8182         arc_state_fini();
 8183 
 8184         arc_unregister_hotplug();
 8185 
 8186         /*
 8187          * We destroy the zthrs after all the ARC state has been
 8188          * torn down to avoid the case of them receiving any
 8189          * wakeup() signals after they are destroyed.
 8190          */
 8191         zthr_destroy(arc_evict_zthr);
 8192         zthr_destroy(arc_reap_zthr);
 8193 
 8194         ASSERT0(arc_loaned_bytes);
 8195 }
 8196 
 8197 /*
 8198  * Level 2 ARC
 8199  *
 8200  * The level 2 ARC (L2ARC) is a cache layer in-between main memory and disk.
 8201  * It uses dedicated storage devices to hold cached data, which are populated
 8202  * using large infrequent writes.  The main role of this cache is to boost
 8203  * the performance of random read workloads.  The intended L2ARC devices
 8204  * include short-stroked disks, solid state disks, and other media with
 8205  * substantially faster read latency than disk.
 8206  *
 8207  *                 +-----------------------+
 8208  *                 |         ARC           |
 8209  *                 +-----------------------+
 8210  *                    |         ^     ^
 8211  *                    |         |     |
 8212  *      l2arc_feed_thread()    arc_read()
 8213  *                    |         |     |
 8214  *                    |  l2arc read   |
 8215  *                    V         |     |
 8216  *               +---------------+    |
 8217  *               |     L2ARC     |    |
 8218  *               +---------------+    |
 8219  *                   |    ^           |
 8220  *          l2arc_write() |           |
 8221  *                   |    |           |
 8222  *                   V    |           |
 8223  *                 +-------+      +-------+
 8224  *                 | vdev  |      | vdev  |
 8225  *                 | cache |      | cache |
 8226  *                 +-------+      +-------+
 8227  *                 +=========+     .-----.
 8228  *                 :  L2ARC  :    |-_____-|
 8229  *                 : devices :    | Disks |
 8230  *                 +=========+    `-_____-'
 8231  *
 8232  * Read requests are satisfied from the following sources, in order:
 8233  *
 8234  *      1) ARC
 8235  *      2) vdev cache of L2ARC devices
 8236  *      3) L2ARC devices
 8237  *      4) vdev cache of disks
 8238  *      5) disks
 8239  *
 8240  * Some L2ARC device types exhibit extremely slow write performance.
 8241  * To accommodate for this there are some significant differences between
 8242  * the L2ARC and traditional cache design:
 8243  *
 8244  * 1. There is no eviction path from the ARC to the L2ARC.  Evictions from
 8245  * the ARC behave as usual, freeing buffers and placing headers on ghost
 8246  * lists.  The ARC does not send buffers to the L2ARC during eviction as
 8247  * this would add inflated write latencies for all ARC memory pressure.
 8248  *
 8249  * 2. The L2ARC attempts to cache data from the ARC before it is evicted.
 8250  * It does this by periodically scanning buffers from the eviction-end of
 8251  * the MFU and MRU ARC lists, copying them to the L2ARC devices if they are
 8252  * not already there. It scans until a headroom of buffers is satisfied,
 8253  * which itself is a buffer for ARC eviction. If a compressible buffer is
 8254  * found during scanning and selected for writing to an L2ARC device, we
 8255  * temporarily boost scanning headroom during the next scan cycle to make
 8256  * sure we adapt to compression effects (which might significantly reduce
 8257  * the data volume we write to L2ARC). The thread that does this is
 8258  * l2arc_feed_thread(), illustrated below; example sizes are included to
 8259  * provide a better sense of ratio than this diagram:
 8260  *
 8261  *             head -->                        tail
 8262  *              +---------------------+----------+
 8263  *      ARC_mfu |:::::#:::::::::::::::|o#o###o###|-->.   # already on L2ARC
 8264  *              +---------------------+----------+   |   o L2ARC eligible
 8265  *      ARC_mru |:#:::::::::::::::::::|#o#ooo####|-->|   : ARC buffer
 8266  *              +---------------------+----------+   |
 8267  *                   15.9 Gbytes      ^ 32 Mbytes    |
 8268  *                                 headroom          |
 8269  *                                            l2arc_feed_thread()
 8270  *                                                   |
 8271  *                       l2arc write hand <--[oooo]--'
 8272  *                               |           8 Mbyte
 8273  *                               |          write max
 8274  *                               V
 8275  *                +==============================+
 8276  *      L2ARC dev |####|#|###|###|    |####| ... |
 8277  *                +==============================+
 8278  *                           32 Gbytes
 8279  *
 8280  * 3. If an ARC buffer is copied to the L2ARC but then hit instead of
 8281  * evicted, then the L2ARC has cached a buffer much sooner than it probably
 8282  * needed to, potentially wasting L2ARC device bandwidth and storage.  It is
 8283  * safe to say that this is an uncommon case, since buffers at the end of
 8284  * the ARC lists have moved there due to inactivity.
 8285  *
 8286  * 4. If the ARC evicts faster than the L2ARC can maintain a headroom,
 8287  * then the L2ARC simply misses copying some buffers.  This serves as a
 8288  * pressure valve to prevent heavy read workloads from both stalling the ARC
 8289  * with waits and clogging the L2ARC with writes.  This also helps prevent
 8290  * the potential for the L2ARC to churn if it attempts to cache content too
 8291  * quickly, such as during backups of the entire pool.
 8292  *
 8293  * 5. After system boot and before the ARC has filled main memory, there are
 8294  * no evictions from the ARC and so the tails of the ARC_mfu and ARC_mru
 8295  * lists can remain mostly static.  Instead of searching from tail of these
 8296  * lists as pictured, the l2arc_feed_thread() will search from the list heads
 8297  * for eligible buffers, greatly increasing its chance of finding them.
 8298  *
 8299  * The L2ARC device write speed is also boosted during this time so that
 8300  * the L2ARC warms up faster.  Since there have been no ARC evictions yet,
 8301  * there are no L2ARC reads, and no fear of degrading read performance
 8302  * through increased writes.
 8303  *
 8304  * 6. Writes to the L2ARC devices are grouped and sent in-sequence, so that
 8305  * the vdev queue can aggregate them into larger and fewer writes.  Each
 8306  * device is written to in a rotor fashion, sweeping writes through
 8307  * available space then repeating.
 8308  *
 8309  * 7. The L2ARC does not store dirty content.  It never needs to flush
 8310  * write buffers back to disk based storage.
 8311  *
 8312  * 8. If an ARC buffer is written (and dirtied) which also exists in the
 8313  * L2ARC, the now stale L2ARC buffer is immediately dropped.
 8314  *
 8315  * The performance of the L2ARC can be tweaked by a number of tunables, which
 8316  * may be necessary for different workloads:
 8317  *
 8318  *      l2arc_write_max         max write bytes per interval
 8319  *      l2arc_write_boost       extra write bytes during device warmup
 8320  *      l2arc_noprefetch        skip caching prefetched buffers
 8321  *      l2arc_headroom          number of max device writes to precache
 8322  *      l2arc_headroom_boost    when we find compressed buffers during ARC
 8323  *                              scanning, we multiply headroom by this
 8324  *                              percentage factor for the next scan cycle,
 8325  *                              since more compressed buffers are likely to
 8326  *                              be present
 8327  *      l2arc_feed_secs         seconds between L2ARC writing
 8328  *
 8329  * Tunables may be removed or added as future performance improvements are
 8330  * integrated, and also may become zpool properties.
 8331  *
 8332  * There are three key functions that control how the L2ARC warms up:
 8333  *
 8334  *      l2arc_write_eligible()  check if a buffer is eligible to cache
 8335  *      l2arc_write_size()      calculate how much to write
 8336  *      l2arc_write_interval()  calculate sleep delay between writes
 8337  *
 8338  * These three functions determine what to write, how much, and how quickly
 8339  * to send writes.
 8340  *
 8341  * L2ARC persistence:
 8342  *
 8343  * When writing buffers to L2ARC, we periodically add some metadata to
 8344  * make sure we can pick them up after reboot, thus dramatically reducing
 8345  * the impact that any downtime has on the performance of storage systems
 8346  * with large caches.
 8347  *
 8348  * The implementation works fairly simply by integrating the following two
 8349  * modifications:
 8350  *
 8351  * *) When writing to the L2ARC, we occasionally write a "l2arc log block",
 8352  *    which is an additional piece of metadata which describes what's been
 8353  *    written. This allows us to rebuild the arc_buf_hdr_t structures of the
 8354  *    main ARC buffers. There are 2 linked-lists of log blocks headed by
 8355  *    dh_start_lbps[2]. We alternate which chain we append to, so they are
 8356  *    time-wise and offset-wise interleaved, but that is an optimization rather
 8357  *    than for correctness. The log block also includes a pointer to the
 8358  *    previous block in its chain.
 8359  *
 8360  * *) We reserve SPA_MINBLOCKSIZE of space at the start of each L2ARC device
 8361  *    for our header bookkeeping purposes. This contains a device header,
 8362  *    which contains our top-level reference structures. We update it each
 8363  *    time we write a new log block, so that we're able to locate it in the
 8364  *    L2ARC device. If this write results in an inconsistent device header
 8365  *    (e.g. due to power failure), we detect this by verifying the header's
 8366  *    checksum and simply fail to reconstruct the L2ARC after reboot.
 8367  *
 8368  * Implementation diagram:
 8369  *
 8370  * +=== L2ARC device (not to scale) ======================================+
 8371  * |       ___two newest log block pointers__.__________                  |
 8372  * |      /                                   \dh_start_lbps[1]           |
 8373  * |     /                                     \         \dh_start_lbps[0]|
 8374  * |.___/__.                                    V         V               |
 8375  * ||L2 dev|....|lb |bufs |lb |bufs |lb |bufs |lb |bufs |lb |---(empty)---|
 8376  * ||   hdr|      ^         /^       /^        /         /                |
 8377  * |+------+  ...--\-------/  \-----/--\------/         /                 |
 8378  * |                \--------------/    \--------------/                  |
 8379  * +======================================================================+
 8380  *
 8381  * As can be seen on the diagram, rather than using a simple linked list,
 8382  * we use a pair of linked lists with alternating elements. This is a
 8383  * performance enhancement due to the fact that we only find out the
 8384  * address of the next log block access once the current block has been
 8385  * completely read in. Obviously, this hurts performance, because we'd be
 8386  * keeping the device's I/O queue at only a 1 operation deep, thus
 8387  * incurring a large amount of I/O round-trip latency. Having two lists
 8388  * allows us to fetch two log blocks ahead of where we are currently
 8389  * rebuilding L2ARC buffers.
 8390  *
 8391  * On-device data structures:
 8392  *
 8393  * L2ARC device header: l2arc_dev_hdr_phys_t
 8394  * L2ARC log block:     l2arc_log_blk_phys_t
 8395  *
 8396  * L2ARC reconstruction:
 8397  *
 8398  * When writing data, we simply write in the standard rotary fashion,
 8399  * evicting buffers as we go and simply writing new data over them (writing
 8400  * a new log block every now and then). This obviously means that once we
 8401  * loop around the end of the device, we will start cutting into an already
 8402  * committed log block (and its referenced data buffers), like so:
 8403  *
 8404  *    current write head__       __old tail
 8405  *                        \     /
 8406  *                        V    V
 8407  * <--|bufs |lb |bufs |lb |    |bufs |lb |bufs |lb |-->
 8408  *                         ^    ^^^^^^^^^___________________________________
 8409  *                         |                                                \
 8410  *                   <<nextwrite>> may overwrite this blk and/or its bufs --'
 8411  *
 8412  * When importing the pool, we detect this situation and use it to stop
 8413  * our scanning process (see l2arc_rebuild).
 8414  *
 8415  * There is one significant caveat to consider when rebuilding ARC contents
 8416  * from an L2ARC device: what about invalidated buffers? Given the above
 8417  * construction, we cannot update blocks which we've already written to amend
 8418  * them to remove buffers which were invalidated. Thus, during reconstruction,
 8419  * we might be populating the cache with buffers for data that's not on the
 8420  * main pool anymore, or may have been overwritten!
 8421  *
 8422  * As it turns out, this isn't a problem. Every arc_read request includes
 8423  * both the DVA and, crucially, the birth TXG of the BP the caller is
 8424  * looking for. So even if the cache were populated by completely rotten
 8425  * blocks for data that had been long deleted and/or overwritten, we'll
 8426  * never actually return bad data from the cache, since the DVA with the
 8427  * birth TXG uniquely identify a block in space and time - once created,
 8428  * a block is immutable on disk. The worst thing we have done is wasted
 8429  * some time and memory at l2arc rebuild to reconstruct outdated ARC
 8430  * entries that will get dropped from the l2arc as it is being updated
 8431  * with new blocks.
 8432  *
 8433  * L2ARC buffers that have been evicted by l2arc_evict() ahead of the write
 8434  * hand are not restored. This is done by saving the offset (in bytes)
 8435  * l2arc_evict() has evicted to in the L2ARC device header and taking it
 8436  * into account when restoring buffers.
 8437  */
 8438 
 8439 static boolean_t
 8440 l2arc_write_eligible(uint64_t spa_guid, arc_buf_hdr_t *hdr)
 8441 {
 8442         /*
 8443          * A buffer is *not* eligible for the L2ARC if it:
 8444          * 1. belongs to a different spa.
 8445          * 2. is already cached on the L2ARC.
 8446          * 3. has an I/O in progress (it may be an incomplete read).
 8447          * 4. is flagged not eligible (zfs property).
 8448          */
 8449         if (hdr->b_spa != spa_guid || HDR_HAS_L2HDR(hdr) ||
 8450             HDR_IO_IN_PROGRESS(hdr) || !HDR_L2CACHE(hdr))
 8451                 return (B_FALSE);
 8452 
 8453         return (B_TRUE);
 8454 }
 8455 
 8456 static uint64_t
 8457 l2arc_write_size(l2arc_dev_t *dev)
 8458 {
 8459         uint64_t size, dev_size, tsize;
 8460 
 8461         /*
 8462          * Make sure our globals have meaningful values in case the user
 8463          * altered them.
 8464          */
 8465         size = l2arc_write_max;
 8466         if (size == 0) {
 8467                 cmn_err(CE_NOTE, "Bad value for l2arc_write_max, value must "
 8468                     "be greater than zero, resetting it to the default (%d)",
 8469                     L2ARC_WRITE_SIZE);
 8470                 size = l2arc_write_max = L2ARC_WRITE_SIZE;
 8471         }
 8472 
 8473         if (arc_warm == B_FALSE)
 8474                 size += l2arc_write_boost;
 8475 
 8476         /*
 8477          * Make sure the write size does not exceed the size of the cache
 8478          * device. This is important in l2arc_evict(), otherwise infinite
 8479          * iteration can occur.
 8480          */
 8481         dev_size = dev->l2ad_end - dev->l2ad_start;
 8482         tsize = size + l2arc_log_blk_overhead(size, dev);
 8483         if (dev->l2ad_vdev->vdev_has_trim && l2arc_trim_ahead > 0)
 8484                 tsize += MAX(64 * 1024 * 1024,
 8485                     (tsize * l2arc_trim_ahead) / 100);
 8486 
 8487         if (tsize >= dev_size) {
 8488                 cmn_err(CE_NOTE, "l2arc_write_max or l2arc_write_boost "
 8489                     "plus the overhead of log blocks (persistent L2ARC, "
 8490                     "%llu bytes) exceeds the size of the cache device "
 8491                     "(guid %llu), resetting them to the default (%d)",
 8492                     (u_longlong_t)l2arc_log_blk_overhead(size, dev),
 8493                     (u_longlong_t)dev->l2ad_vdev->vdev_guid, L2ARC_WRITE_SIZE);
 8494                 size = l2arc_write_max = l2arc_write_boost = L2ARC_WRITE_SIZE;
 8495 
 8496                 if (arc_warm == B_FALSE)
 8497                         size += l2arc_write_boost;
 8498         }
 8499 
 8500         return (size);
 8501 
 8502 }
 8503 
 8504 static clock_t
 8505 l2arc_write_interval(clock_t began, uint64_t wanted, uint64_t wrote)
 8506 {
 8507         clock_t interval, next, now;
 8508 
 8509         /*
 8510          * If the ARC lists are busy, increase our write rate; if the
 8511          * lists are stale, idle back.  This is achieved by checking
 8512          * how much we previously wrote - if it was more than half of
 8513          * what we wanted, schedule the next write much sooner.
 8514          */
 8515         if (l2arc_feed_again && wrote > (wanted / 2))
 8516                 interval = (hz * l2arc_feed_min_ms) / 1000;
 8517         else
 8518                 interval = hz * l2arc_feed_secs;
 8519 
 8520         now = ddi_get_lbolt();
 8521         next = MAX(now, MIN(now + interval, began + interval));
 8522 
 8523         return (next);
 8524 }
 8525 
 8526 /*
 8527  * Cycle through L2ARC devices.  This is how L2ARC load balances.
 8528  * If a device is returned, this also returns holding the spa config lock.
 8529  */
 8530 static l2arc_dev_t *
 8531 l2arc_dev_get_next(void)
 8532 {
 8533         l2arc_dev_t *first, *next = NULL;
 8534 
 8535         /*
 8536          * Lock out the removal of spas (spa_namespace_lock), then removal
 8537          * of cache devices (l2arc_dev_mtx).  Once a device has been selected,
 8538          * both locks will be dropped and a spa config lock held instead.
 8539          */
 8540         mutex_enter(&spa_namespace_lock);
 8541         mutex_enter(&l2arc_dev_mtx);
 8542 
 8543         /* if there are no vdevs, there is nothing to do */
 8544         if (l2arc_ndev == 0)
 8545                 goto out;
 8546 
 8547         first = NULL;
 8548         next = l2arc_dev_last;
 8549         do {
 8550                 /* loop around the list looking for a non-faulted vdev */
 8551                 if (next == NULL) {
 8552                         next = list_head(l2arc_dev_list);
 8553                 } else {
 8554                         next = list_next(l2arc_dev_list, next);
 8555                         if (next == NULL)
 8556                                 next = list_head(l2arc_dev_list);
 8557                 }
 8558 
 8559                 /* if we have come back to the start, bail out */
 8560                 if (first == NULL)
 8561                         first = next;
 8562                 else if (next == first)
 8563                         break;
 8564 
 8565                 ASSERT3P(next, !=, NULL);
 8566         } while (vdev_is_dead(next->l2ad_vdev) || next->l2ad_rebuild ||
 8567             next->l2ad_trim_all);
 8568 
 8569         /* if we were unable to find any usable vdevs, return NULL */
 8570         if (vdev_is_dead(next->l2ad_vdev) || next->l2ad_rebuild ||
 8571             next->l2ad_trim_all)
 8572                 next = NULL;
 8573 
 8574         l2arc_dev_last = next;
 8575 
 8576 out:
 8577         mutex_exit(&l2arc_dev_mtx);
 8578 
 8579         /*
 8580          * Grab the config lock to prevent the 'next' device from being
 8581          * removed while we are writing to it.
 8582          */
 8583         if (next != NULL)
 8584                 spa_config_enter(next->l2ad_spa, SCL_L2ARC, next, RW_READER);
 8585         mutex_exit(&spa_namespace_lock);
 8586 
 8587         return (next);
 8588 }
 8589 
 8590 /*
 8591  * Free buffers that were tagged for destruction.
 8592  */
 8593 static void
 8594 l2arc_do_free_on_write(void)
 8595 {
 8596         list_t *buflist;
 8597         l2arc_data_free_t *df, *df_prev;
 8598 
 8599         mutex_enter(&l2arc_free_on_write_mtx);
 8600         buflist = l2arc_free_on_write;
 8601 
 8602         for (df = list_tail(buflist); df; df = df_prev) {
 8603                 df_prev = list_prev(buflist, df);
 8604                 ASSERT3P(df->l2df_abd, !=, NULL);
 8605                 abd_free(df->l2df_abd);
 8606                 list_remove(buflist, df);
 8607                 kmem_free(df, sizeof (l2arc_data_free_t));
 8608         }
 8609 
 8610         mutex_exit(&l2arc_free_on_write_mtx);
 8611 }
 8612 
 8613 /*
 8614  * A write to a cache device has completed.  Update all headers to allow
 8615  * reads from these buffers to begin.
 8616  */
 8617 static void
 8618 l2arc_write_done(zio_t *zio)
 8619 {
 8620         l2arc_write_callback_t  *cb;
 8621         l2arc_lb_abd_buf_t      *abd_buf;
 8622         l2arc_lb_ptr_buf_t      *lb_ptr_buf;
 8623         l2arc_dev_t             *dev;
 8624         l2arc_dev_hdr_phys_t    *l2dhdr;
 8625         list_t                  *buflist;
 8626         arc_buf_hdr_t           *head, *hdr, *hdr_prev;
 8627         kmutex_t                *hash_lock;
 8628         int64_t                 bytes_dropped = 0;
 8629 
 8630         cb = zio->io_private;
 8631         ASSERT3P(cb, !=, NULL);
 8632         dev = cb->l2wcb_dev;
 8633         l2dhdr = dev->l2ad_dev_hdr;
 8634         ASSERT3P(dev, !=, NULL);
 8635         head = cb->l2wcb_head;
 8636         ASSERT3P(head, !=, NULL);
 8637         buflist = &dev->l2ad_buflist;
 8638         ASSERT3P(buflist, !=, NULL);
 8639         DTRACE_PROBE2(l2arc__iodone, zio_t *, zio,
 8640             l2arc_write_callback_t *, cb);
 8641 
 8642         /*
 8643          * All writes completed, or an error was hit.
 8644          */
 8645 top:
 8646         mutex_enter(&dev->l2ad_mtx);
 8647         for (hdr = list_prev(buflist, head); hdr; hdr = hdr_prev) {
 8648                 hdr_prev = list_prev(buflist, hdr);
 8649 
 8650                 hash_lock = HDR_LOCK(hdr);
 8651 
 8652                 /*
 8653                  * We cannot use mutex_enter or else we can deadlock
 8654                  * with l2arc_write_buffers (due to swapping the order
 8655                  * the hash lock and l2ad_mtx are taken).
 8656                  */
 8657                 if (!mutex_tryenter(hash_lock)) {
 8658                         /*
 8659                          * Missed the hash lock. We must retry so we
 8660                          * don't leave the ARC_FLAG_L2_WRITING bit set.
 8661                          */
 8662                         ARCSTAT_BUMP(arcstat_l2_writes_lock_retry);
 8663 
 8664                         /*
 8665                          * We don't want to rescan the headers we've
 8666                          * already marked as having been written out, so
 8667                          * we reinsert the head node so we can pick up
 8668                          * where we left off.
 8669                          */
 8670                         list_remove(buflist, head);
 8671                         list_insert_after(buflist, hdr, head);
 8672 
 8673                         mutex_exit(&dev->l2ad_mtx);
 8674 
 8675                         /*
 8676                          * We wait for the hash lock to become available
 8677                          * to try and prevent busy waiting, and increase
 8678                          * the chance we'll be able to acquire the lock
 8679                          * the next time around.
 8680                          */
 8681                         mutex_enter(hash_lock);
 8682                         mutex_exit(hash_lock);
 8683                         goto top;
 8684                 }
 8685 
 8686                 /*
 8687                  * We could not have been moved into the arc_l2c_only
 8688                  * state while in-flight due to our ARC_FLAG_L2_WRITING
 8689                  * bit being set. Let's just ensure that's being enforced.
 8690                  */
 8691                 ASSERT(HDR_HAS_L1HDR(hdr));
 8692 
 8693                 /*
 8694                  * Skipped - drop L2ARC entry and mark the header as no
 8695                  * longer L2 eligibile.
 8696                  */
 8697                 if (zio->io_error != 0) {
 8698                         /*
 8699                          * Error - drop L2ARC entry.
 8700                          */
 8701                         list_remove(buflist, hdr);
 8702                         arc_hdr_clear_flags(hdr, ARC_FLAG_HAS_L2HDR);
 8703 
 8704                         uint64_t psize = HDR_GET_PSIZE(hdr);
 8705                         l2arc_hdr_arcstats_decrement(hdr);
 8706 
 8707                         bytes_dropped +=
 8708                             vdev_psize_to_asize(dev->l2ad_vdev, psize);
 8709                         (void) zfs_refcount_remove_many(&dev->l2ad_alloc,
 8710                             arc_hdr_size(hdr), hdr);
 8711                 }
 8712 
 8713                 /*
 8714                  * Allow ARC to begin reads and ghost list evictions to
 8715                  * this L2ARC entry.
 8716                  */
 8717                 arc_hdr_clear_flags(hdr, ARC_FLAG_L2_WRITING);
 8718 
 8719                 mutex_exit(hash_lock);
 8720         }
 8721 
 8722         /*
 8723          * Free the allocated abd buffers for writing the log blocks.
 8724          * If the zio failed reclaim the allocated space and remove the
 8725          * pointers to these log blocks from the log block pointer list
 8726          * of the L2ARC device.
 8727          */
 8728         while ((abd_buf = list_remove_tail(&cb->l2wcb_abd_list)) != NULL) {
 8729                 abd_free(abd_buf->abd);
 8730                 zio_buf_free(abd_buf, sizeof (*abd_buf));
 8731                 if (zio->io_error != 0) {
 8732                         lb_ptr_buf = list_remove_head(&dev->l2ad_lbptr_list);
 8733                         /*
 8734                          * L2BLK_GET_PSIZE returns aligned size for log
 8735                          * blocks.
 8736                          */
 8737                         uint64_t asize =
 8738                             L2BLK_GET_PSIZE((lb_ptr_buf->lb_ptr)->lbp_prop);
 8739                         bytes_dropped += asize;
 8740                         ARCSTAT_INCR(arcstat_l2_log_blk_asize, -asize);
 8741                         ARCSTAT_BUMPDOWN(arcstat_l2_log_blk_count);
 8742                         zfs_refcount_remove_many(&dev->l2ad_lb_asize, asize,
 8743                             lb_ptr_buf);
 8744                         zfs_refcount_remove(&dev->l2ad_lb_count, lb_ptr_buf);
 8745                         kmem_free(lb_ptr_buf->lb_ptr,
 8746                             sizeof (l2arc_log_blkptr_t));
 8747                         kmem_free(lb_ptr_buf, sizeof (l2arc_lb_ptr_buf_t));
 8748                 }
 8749         }
 8750         list_destroy(&cb->l2wcb_abd_list);
 8751 
 8752         if (zio->io_error != 0) {
 8753                 ARCSTAT_BUMP(arcstat_l2_writes_error);
 8754 
 8755                 /*
 8756                  * Restore the lbps array in the header to its previous state.
 8757                  * If the list of log block pointers is empty, zero out the
 8758                  * log block pointers in the device header.
 8759                  */
 8760                 lb_ptr_buf = list_head(&dev->l2ad_lbptr_list);
 8761                 for (int i = 0; i < 2; i++) {
 8762                         if (lb_ptr_buf == NULL) {
 8763                                 /*
 8764                                  * If the list is empty zero out the device
 8765                                  * header. Otherwise zero out the second log
 8766                                  * block pointer in the header.
 8767                                  */
 8768                                 if (i == 0) {
 8769                                         memset(l2dhdr, 0,
 8770                                             dev->l2ad_dev_hdr_asize);
 8771                                 } else {
 8772                                         memset(&l2dhdr->dh_start_lbps[i], 0,
 8773                                             sizeof (l2arc_log_blkptr_t));
 8774                                 }
 8775                                 break;
 8776                         }
 8777                         memcpy(&l2dhdr->dh_start_lbps[i], lb_ptr_buf->lb_ptr,
 8778                             sizeof (l2arc_log_blkptr_t));
 8779                         lb_ptr_buf = list_next(&dev->l2ad_lbptr_list,
 8780                             lb_ptr_buf);
 8781                 }
 8782         }
 8783 
 8784         ARCSTAT_BUMP(arcstat_l2_writes_done);
 8785         list_remove(buflist, head);
 8786         ASSERT(!HDR_HAS_L1HDR(head));
 8787         kmem_cache_free(hdr_l2only_cache, head);
 8788         mutex_exit(&dev->l2ad_mtx);
 8789 
 8790         ASSERT(dev->l2ad_vdev != NULL);
 8791         vdev_space_update(dev->l2ad_vdev, -bytes_dropped, 0, 0);
 8792 
 8793         l2arc_do_free_on_write();
 8794 
 8795         kmem_free(cb, sizeof (l2arc_write_callback_t));
 8796 }
 8797 
 8798 static int
 8799 l2arc_untransform(zio_t *zio, l2arc_read_callback_t *cb)
 8800 {
 8801         int ret;
 8802         spa_t *spa = zio->io_spa;
 8803         arc_buf_hdr_t *hdr = cb->l2rcb_hdr;
 8804         blkptr_t *bp = zio->io_bp;
 8805         uint8_t salt[ZIO_DATA_SALT_LEN];
 8806         uint8_t iv[ZIO_DATA_IV_LEN];
 8807         uint8_t mac[ZIO_DATA_MAC_LEN];
 8808         boolean_t no_crypt = B_FALSE;
 8809 
 8810         /*
 8811          * ZIL data is never be written to the L2ARC, so we don't need
 8812          * special handling for its unique MAC storage.
 8813          */
 8814         ASSERT3U(BP_GET_TYPE(bp), !=, DMU_OT_INTENT_LOG);
 8815         ASSERT(MUTEX_HELD(HDR_LOCK(hdr)));
 8816         ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL);
 8817 
 8818         /*
 8819          * If the data was encrypted, decrypt it now. Note that
 8820          * we must check the bp here and not the hdr, since the
 8821          * hdr does not have its encryption parameters updated
 8822          * until arc_read_done().
 8823          */
 8824         if (BP_IS_ENCRYPTED(bp)) {
 8825                 abd_t *eabd = arc_get_data_abd(hdr, arc_hdr_size(hdr), hdr,
 8826                     ARC_HDR_DO_ADAPT | ARC_HDR_USE_RESERVE);
 8827 
 8828                 zio_crypt_decode_params_bp(bp, salt, iv);
 8829                 zio_crypt_decode_mac_bp(bp, mac);
 8830 
 8831                 ret = spa_do_crypt_abd(B_FALSE, spa, &cb->l2rcb_zb,
 8832                     BP_GET_TYPE(bp), BP_GET_DEDUP(bp), BP_SHOULD_BYTESWAP(bp),
 8833                     salt, iv, mac, HDR_GET_PSIZE(hdr), eabd,
 8834                     hdr->b_l1hdr.b_pabd, &no_crypt);
 8835                 if (ret != 0) {
 8836                         arc_free_data_abd(hdr, eabd, arc_hdr_size(hdr), hdr);
 8837                         goto error;
 8838                 }
 8839 
 8840                 /*
 8841                  * If we actually performed decryption, replace b_pabd
 8842                  * with the decrypted data. Otherwise we can just throw
 8843                  * our decryption buffer away.
 8844                  */
 8845                 if (!no_crypt) {
 8846                         arc_free_data_abd(hdr, hdr->b_l1hdr.b_pabd,
 8847                             arc_hdr_size(hdr), hdr);
 8848                         hdr->b_l1hdr.b_pabd = eabd;
 8849                         zio->io_abd = eabd;
 8850                 } else {
 8851                         arc_free_data_abd(hdr, eabd, arc_hdr_size(hdr), hdr);
 8852                 }
 8853         }
 8854 
 8855         /*
 8856          * If the L2ARC block was compressed, but ARC compression
 8857          * is disabled we decompress the data into a new buffer and
 8858          * replace the existing data.
 8859          */
 8860         if (HDR_GET_COMPRESS(hdr) != ZIO_COMPRESS_OFF &&
 8861             !HDR_COMPRESSION_ENABLED(hdr)) {
 8862                 abd_t *cabd = arc_get_data_abd(hdr, arc_hdr_size(hdr), hdr,
 8863                     ARC_HDR_DO_ADAPT | ARC_HDR_USE_RESERVE);
 8864                 void *tmp = abd_borrow_buf(cabd, arc_hdr_size(hdr));
 8865 
 8866                 ret = zio_decompress_data(HDR_GET_COMPRESS(hdr),
 8867                     hdr->b_l1hdr.b_pabd, tmp, HDR_GET_PSIZE(hdr),
 8868                     HDR_GET_LSIZE(hdr), &hdr->b_complevel);
 8869                 if (ret != 0) {
 8870                         abd_return_buf_copy(cabd, tmp, arc_hdr_size(hdr));
 8871                         arc_free_data_abd(hdr, cabd, arc_hdr_size(hdr), hdr);
 8872                         goto error;
 8873                 }
 8874 
 8875                 abd_return_buf_copy(cabd, tmp, arc_hdr_size(hdr));
 8876                 arc_free_data_abd(hdr, hdr->b_l1hdr.b_pabd,
 8877                     arc_hdr_size(hdr), hdr);
 8878                 hdr->b_l1hdr.b_pabd = cabd;
 8879                 zio->io_abd = cabd;
 8880                 zio->io_size = HDR_GET_LSIZE(hdr);
 8881         }
 8882 
 8883         return (0);
 8884 
 8885 error:
 8886         return (ret);
 8887 }
 8888 
 8889 
 8890 /*
 8891  * A read to a cache device completed.  Validate buffer contents before
 8892  * handing over to the regular ARC routines.
 8893  */
 8894 static void
 8895 l2arc_read_done(zio_t *zio)
 8896 {
 8897         int tfm_error = 0;
 8898         l2arc_read_callback_t *cb = zio->io_private;
 8899         arc_buf_hdr_t *hdr;
 8900         kmutex_t *hash_lock;
 8901         boolean_t valid_cksum;
 8902         boolean_t using_rdata = (BP_IS_ENCRYPTED(&cb->l2rcb_bp) &&
 8903             (cb->l2rcb_flags & ZIO_FLAG_RAW_ENCRYPT));
 8904 
 8905         ASSERT3P(zio->io_vd, !=, NULL);
 8906         ASSERT(zio->io_flags & ZIO_FLAG_DONT_PROPAGATE);
 8907 
 8908         spa_config_exit(zio->io_spa, SCL_L2ARC, zio->io_vd);
 8909 
 8910         ASSERT3P(cb, !=, NULL);
 8911         hdr = cb->l2rcb_hdr;
 8912         ASSERT3P(hdr, !=, NULL);
 8913 
 8914         hash_lock = HDR_LOCK(hdr);
 8915         mutex_enter(hash_lock);
 8916         ASSERT3P(hash_lock, ==, HDR_LOCK(hdr));
 8917 
 8918         /*
 8919          * If the data was read into a temporary buffer,
 8920          * move it and free the buffer.
 8921          */
 8922         if (cb->l2rcb_abd != NULL) {
 8923                 ASSERT3U(arc_hdr_size(hdr), <, zio->io_size);
 8924                 if (zio->io_error == 0) {
 8925                         if (using_rdata) {
 8926                                 abd_copy(hdr->b_crypt_hdr.b_rabd,
 8927                                     cb->l2rcb_abd, arc_hdr_size(hdr));
 8928                         } else {
 8929                                 abd_copy(hdr->b_l1hdr.b_pabd,
 8930                                     cb->l2rcb_abd, arc_hdr_size(hdr));
 8931                         }
 8932                 }
 8933 
 8934                 /*
 8935                  * The following must be done regardless of whether
 8936                  * there was an error:
 8937                  * - free the temporary buffer
 8938                  * - point zio to the real ARC buffer
 8939                  * - set zio size accordingly
 8940                  * These are required because zio is either re-used for
 8941                  * an I/O of the block in the case of the error
 8942                  * or the zio is passed to arc_read_done() and it
 8943                  * needs real data.
 8944                  */
 8945                 abd_free(cb->l2rcb_abd);
 8946                 zio->io_size = zio->io_orig_size = arc_hdr_size(hdr);
 8947 
 8948                 if (using_rdata) {
 8949                         ASSERT(HDR_HAS_RABD(hdr));
 8950                         zio->io_abd = zio->io_orig_abd =
 8951                             hdr->b_crypt_hdr.b_rabd;
 8952                 } else {
 8953                         ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL);
 8954                         zio->io_abd = zio->io_orig_abd = hdr->b_l1hdr.b_pabd;
 8955                 }
 8956         }
 8957 
 8958         ASSERT3P(zio->io_abd, !=, NULL);
 8959 
 8960         /*
 8961          * Check this survived the L2ARC journey.
 8962          */
 8963         ASSERT(zio->io_abd == hdr->b_l1hdr.b_pabd ||
 8964             (HDR_HAS_RABD(hdr) && zio->io_abd == hdr->b_crypt_hdr.b_rabd));
 8965         zio->io_bp_copy = cb->l2rcb_bp; /* XXX fix in L2ARC 2.0 */
 8966         zio->io_bp = &zio->io_bp_copy;  /* XXX fix in L2ARC 2.0 */
 8967         zio->io_prop.zp_complevel = hdr->b_complevel;
 8968 
 8969         valid_cksum = arc_cksum_is_equal(hdr, zio);
 8970 
 8971         /*
 8972          * b_rabd will always match the data as it exists on disk if it is
 8973          * being used. Therefore if we are reading into b_rabd we do not
 8974          * attempt to untransform the data.
 8975          */
 8976         if (valid_cksum && !using_rdata)
 8977                 tfm_error = l2arc_untransform(zio, cb);
 8978 
 8979         if (valid_cksum && tfm_error == 0 && zio->io_error == 0 &&
 8980             !HDR_L2_EVICTED(hdr)) {
 8981                 mutex_exit(hash_lock);
 8982                 zio->io_private = hdr;
 8983                 arc_read_done(zio);
 8984         } else {
 8985                 /*
 8986                  * Buffer didn't survive caching.  Increment stats and
 8987                  * reissue to the original storage device.
 8988                  */
 8989                 if (zio->io_error != 0) {
 8990                         ARCSTAT_BUMP(arcstat_l2_io_error);
 8991                 } else {
 8992                         zio->io_error = SET_ERROR(EIO);
 8993                 }
 8994                 if (!valid_cksum || tfm_error != 0)
 8995                         ARCSTAT_BUMP(arcstat_l2_cksum_bad);
 8996 
 8997                 /*
 8998                  * If there's no waiter, issue an async i/o to the primary
 8999                  * storage now.  If there *is* a waiter, the caller must
 9000                  * issue the i/o in a context where it's OK to block.
 9001                  */
 9002                 if (zio->io_waiter == NULL) {
 9003                         zio_t *pio = zio_unique_parent(zio);
 9004                         void *abd = (using_rdata) ?
 9005                             hdr->b_crypt_hdr.b_rabd : hdr->b_l1hdr.b_pabd;
 9006 
 9007                         ASSERT(!pio || pio->io_child_type == ZIO_CHILD_LOGICAL);
 9008 
 9009                         zio = zio_read(pio, zio->io_spa, zio->io_bp,
 9010                             abd, zio->io_size, arc_read_done,
 9011                             hdr, zio->io_priority, cb->l2rcb_flags,
 9012                             &cb->l2rcb_zb);
 9013 
 9014                         /*
 9015                          * Original ZIO will be freed, so we need to update
 9016                          * ARC header with the new ZIO pointer to be used
 9017                          * by zio_change_priority() in arc_read().
 9018                          */
 9019                         for (struct arc_callback *acb = hdr->b_l1hdr.b_acb;
 9020                             acb != NULL; acb = acb->acb_next)
 9021                                 acb->acb_zio_head = zio;
 9022 
 9023                         mutex_exit(hash_lock);
 9024                         zio_nowait(zio);
 9025                 } else {
 9026                         mutex_exit(hash_lock);
 9027                 }
 9028         }
 9029 
 9030         kmem_free(cb, sizeof (l2arc_read_callback_t));
 9031 }
 9032 
 9033 /*
 9034  * This is the list priority from which the L2ARC will search for pages to
 9035  * cache.  This is used within loops (0..3) to cycle through lists in the
 9036  * desired order.  This order can have a significant effect on cache
 9037  * performance.
 9038  *
 9039  * Currently the metadata lists are hit first, MFU then MRU, followed by
 9040  * the data lists.  This function returns a locked list, and also returns
 9041  * the lock pointer.
 9042  */
 9043 static multilist_sublist_t *
 9044 l2arc_sublist_lock(int list_num)
 9045 {
 9046         multilist_t *ml = NULL;
 9047         unsigned int idx;
 9048 
 9049         ASSERT(list_num >= 0 && list_num < L2ARC_FEED_TYPES);
 9050 
 9051         switch (list_num) {
 9052         case 0:
 9053                 ml = &arc_mfu->arcs_list[ARC_BUFC_METADATA];
 9054                 break;
 9055         case 1:
 9056                 ml = &arc_mru->arcs_list[ARC_BUFC_METADATA];
 9057                 break;
 9058         case 2:
 9059                 ml = &arc_mfu->arcs_list[ARC_BUFC_DATA];
 9060                 break;
 9061         case 3:
 9062                 ml = &arc_mru->arcs_list[ARC_BUFC_DATA];
 9063                 break;
 9064         default:
 9065                 return (NULL);
 9066         }
 9067 
 9068         /*
 9069          * Return a randomly-selected sublist. This is acceptable
 9070          * because the caller feeds only a little bit of data for each
 9071          * call (8MB). Subsequent calls will result in different
 9072          * sublists being selected.
 9073          */
 9074         idx = multilist_get_random_index(ml);
 9075         return (multilist_sublist_lock(ml, idx));
 9076 }
 9077 
 9078 /*
 9079  * Calculates the maximum overhead of L2ARC metadata log blocks for a given
 9080  * L2ARC write size. l2arc_evict and l2arc_write_size need to include this
 9081  * overhead in processing to make sure there is enough headroom available
 9082  * when writing buffers.
 9083  */
 9084 static inline uint64_t
 9085 l2arc_log_blk_overhead(uint64_t write_sz, l2arc_dev_t *dev)
 9086 {
 9087         if (dev->l2ad_log_entries == 0) {
 9088                 return (0);
 9089         } else {
 9090                 uint64_t log_entries = write_sz >> SPA_MINBLOCKSHIFT;
 9091 
 9092                 uint64_t log_blocks = (log_entries +
 9093                     dev->l2ad_log_entries - 1) /
 9094                     dev->l2ad_log_entries;
 9095 
 9096                 return (vdev_psize_to_asize(dev->l2ad_vdev,
 9097                     sizeof (l2arc_log_blk_phys_t)) * log_blocks);
 9098         }
 9099 }
 9100 
 9101 /*
 9102  * Evict buffers from the device write hand to the distance specified in
 9103  * bytes. This distance may span populated buffers, it may span nothing.
 9104  * This is clearing a region on the L2ARC device ready for writing.
 9105  * If the 'all' boolean is set, every buffer is evicted.
 9106  */
 9107 static void
 9108 l2arc_evict(l2arc_dev_t *dev, uint64_t distance, boolean_t all)
 9109 {
 9110         list_t *buflist;
 9111         arc_buf_hdr_t *hdr, *hdr_prev;
 9112         kmutex_t *hash_lock;
 9113         uint64_t taddr;
 9114         l2arc_lb_ptr_buf_t *lb_ptr_buf, *lb_ptr_buf_prev;
 9115         vdev_t *vd = dev->l2ad_vdev;
 9116         boolean_t rerun;
 9117 
 9118         buflist = &dev->l2ad_buflist;
 9119 
 9120         /*
 9121          * We need to add in the worst case scenario of log block overhead.
 9122          */
 9123         distance += l2arc_log_blk_overhead(distance, dev);
 9124         if (vd->vdev_has_trim && l2arc_trim_ahead > 0) {
 9125                 /*
 9126                  * Trim ahead of the write size 64MB or (l2arc_trim_ahead/100)
 9127                  * times the write size, whichever is greater.
 9128                  */
 9129                 distance += MAX(64 * 1024 * 1024,
 9130                     (distance * l2arc_trim_ahead) / 100);
 9131         }
 9132 
 9133 top:
 9134         rerun = B_FALSE;
 9135         if (dev->l2ad_hand >= (dev->l2ad_end - distance)) {
 9136                 /*
 9137                  * When there is no space to accommodate upcoming writes,
 9138                  * evict to the end. Then bump the write and evict hands
 9139                  * to the start and iterate. This iteration does not
 9140                  * happen indefinitely as we make sure in
 9141                  * l2arc_write_size() that when the write hand is reset,
 9142                  * the write size does not exceed the end of the device.
 9143                  */
 9144                 rerun = B_TRUE;
 9145                 taddr = dev->l2ad_end;
 9146         } else {
 9147                 taddr = dev->l2ad_hand + distance;
 9148         }
 9149         DTRACE_PROBE4(l2arc__evict, l2arc_dev_t *, dev, list_t *, buflist,
 9150             uint64_t, taddr, boolean_t, all);
 9151 
 9152         if (!all) {
 9153                 /*
 9154                  * This check has to be placed after deciding whether to
 9155                  * iterate (rerun).
 9156                  */
 9157                 if (dev->l2ad_first) {
 9158                         /*
 9159                          * This is the first sweep through the device. There is
 9160                          * nothing to evict. We have already trimmmed the
 9161                          * whole device.
 9162                          */
 9163                         goto out;
 9164                 } else {
 9165                         /*
 9166                          * Trim the space to be evicted.
 9167                          */
 9168                         if (vd->vdev_has_trim && dev->l2ad_evict < taddr &&
 9169                             l2arc_trim_ahead > 0) {
 9170                                 /*
 9171                                  * We have to drop the spa_config lock because
 9172                                  * vdev_trim_range() will acquire it.
 9173                                  * l2ad_evict already accounts for the label
 9174                                  * size. To prevent vdev_trim_ranges() from
 9175                                  * adding it again, we subtract it from
 9176                                  * l2ad_evict.
 9177                                  */
 9178                                 spa_config_exit(dev->l2ad_spa, SCL_L2ARC, dev);
 9179                                 vdev_trim_simple(vd,
 9180                                     dev->l2ad_evict - VDEV_LABEL_START_SIZE,
 9181                                     taddr - dev->l2ad_evict);
 9182                                 spa_config_enter(dev->l2ad_spa, SCL_L2ARC, dev,
 9183                                     RW_READER);
 9184                         }
 9185 
 9186                         /*
 9187                          * When rebuilding L2ARC we retrieve the evict hand
 9188                          * from the header of the device. Of note, l2arc_evict()
 9189                          * does not actually delete buffers from the cache
 9190                          * device, but trimming may do so depending on the
 9191                          * hardware implementation. Thus keeping track of the
 9192                          * evict hand is useful.
 9193                          */
 9194                         dev->l2ad_evict = MAX(dev->l2ad_evict, taddr);
 9195                 }
 9196         }
 9197 
 9198 retry:
 9199         mutex_enter(&dev->l2ad_mtx);
 9200         /*
 9201          * We have to account for evicted log blocks. Run vdev_space_update()
 9202          * on log blocks whose offset (in bytes) is before the evicted offset
 9203          * (in bytes) by searching in the list of pointers to log blocks
 9204          * present in the L2ARC device.
 9205          */
 9206         for (lb_ptr_buf = list_tail(&dev->l2ad_lbptr_list); lb_ptr_buf;
 9207             lb_ptr_buf = lb_ptr_buf_prev) {
 9208 
 9209                 lb_ptr_buf_prev = list_prev(&dev->l2ad_lbptr_list, lb_ptr_buf);
 9210 
 9211                 /* L2BLK_GET_PSIZE returns aligned size for log blocks */
 9212                 uint64_t asize = L2BLK_GET_PSIZE(
 9213                     (lb_ptr_buf->lb_ptr)->lbp_prop);
 9214 
 9215                 /*
 9216                  * We don't worry about log blocks left behind (ie
 9217                  * lbp_payload_start < l2ad_hand) because l2arc_write_buffers()
 9218                  * will never write more than l2arc_evict() evicts.
 9219                  */
 9220                 if (!all && l2arc_log_blkptr_valid(dev, lb_ptr_buf->lb_ptr)) {
 9221                         break;
 9222                 } else {
 9223                         vdev_space_update(vd, -asize, 0, 0);
 9224                         ARCSTAT_INCR(arcstat_l2_log_blk_asize, -asize);
 9225                         ARCSTAT_BUMPDOWN(arcstat_l2_log_blk_count);
 9226                         zfs_refcount_remove_many(&dev->l2ad_lb_asize, asize,
 9227                             lb_ptr_buf);
 9228                         zfs_refcount_remove(&dev->l2ad_lb_count, lb_ptr_buf);
 9229                         list_remove(&dev->l2ad_lbptr_list, lb_ptr_buf);
 9230                         kmem_free(lb_ptr_buf->lb_ptr,
 9231                             sizeof (l2arc_log_blkptr_t));
 9232                         kmem_free(lb_ptr_buf, sizeof (l2arc_lb_ptr_buf_t));
 9233                 }
 9234         }
 9235 
 9236         for (hdr = list_tail(buflist); hdr; hdr = hdr_prev) {
 9237                 hdr_prev = list_prev(buflist, hdr);
 9238 
 9239                 ASSERT(!HDR_EMPTY(hdr));
 9240                 hash_lock = HDR_LOCK(hdr);
 9241 
 9242                 /*
 9243                  * We cannot use mutex_enter or else we can deadlock
 9244                  * with l2arc_write_buffers (due to swapping the order
 9245                  * the hash lock and l2ad_mtx are taken).
 9246                  */
 9247                 if (!mutex_tryenter(hash_lock)) {
 9248                         /*
 9249                          * Missed the hash lock.  Retry.
 9250                          */
 9251                         ARCSTAT_BUMP(arcstat_l2_evict_lock_retry);
 9252                         mutex_exit(&dev->l2ad_mtx);
 9253                         mutex_enter(hash_lock);
 9254                         mutex_exit(hash_lock);
 9255                         goto retry;
 9256                 }
 9257 
 9258                 /*
 9259                  * A header can't be on this list if it doesn't have L2 header.
 9260                  */
 9261                 ASSERT(HDR_HAS_L2HDR(hdr));
 9262 
 9263                 /* Ensure this header has finished being written. */
 9264                 ASSERT(!HDR_L2_WRITING(hdr));
 9265                 ASSERT(!HDR_L2_WRITE_HEAD(hdr));
 9266 
 9267                 if (!all && (hdr->b_l2hdr.b_daddr >= dev->l2ad_evict ||
 9268                     hdr->b_l2hdr.b_daddr < dev->l2ad_hand)) {
 9269                         /*
 9270                          * We've evicted to the target address,
 9271                          * or the end of the device.
 9272                          */
 9273                         mutex_exit(hash_lock);
 9274                         break;
 9275                 }
 9276 
 9277                 if (!HDR_HAS_L1HDR(hdr)) {
 9278                         ASSERT(!HDR_L2_READING(hdr));
 9279                         /*
 9280                          * This doesn't exist in the ARC.  Destroy.
 9281                          * arc_hdr_destroy() will call list_remove()
 9282                          * and decrement arcstat_l2_lsize.
 9283                          */
 9284                         arc_change_state(arc_anon, hdr);
 9285                         arc_hdr_destroy(hdr);
 9286                 } else {
 9287                         ASSERT(hdr->b_l1hdr.b_state != arc_l2c_only);
 9288                         ARCSTAT_BUMP(arcstat_l2_evict_l1cached);
 9289                         /*
 9290                          * Invalidate issued or about to be issued
 9291                          * reads, since we may be about to write
 9292                          * over this location.
 9293                          */
 9294                         if (HDR_L2_READING(hdr)) {
 9295                                 ARCSTAT_BUMP(arcstat_l2_evict_reading);
 9296                                 arc_hdr_set_flags(hdr, ARC_FLAG_L2_EVICTED);
 9297                         }
 9298 
 9299                         arc_hdr_l2hdr_destroy(hdr);
 9300                 }
 9301                 mutex_exit(hash_lock);
 9302         }
 9303         mutex_exit(&dev->l2ad_mtx);
 9304 
 9305 out:
 9306         /*
 9307          * We need to check if we evict all buffers, otherwise we may iterate
 9308          * unnecessarily.
 9309          */
 9310         if (!all && rerun) {
 9311                 /*
 9312                  * Bump device hand to the device start if it is approaching the
 9313                  * end. l2arc_evict() has already evicted ahead for this case.
 9314                  */
 9315                 dev->l2ad_hand = dev->l2ad_start;
 9316                 dev->l2ad_evict = dev->l2ad_start;
 9317                 dev->l2ad_first = B_FALSE;
 9318                 goto top;
 9319         }
 9320 
 9321         if (!all) {
 9322                 /*
 9323                  * In case of cache device removal (all) the following
 9324                  * assertions may be violated without functional consequences
 9325                  * as the device is about to be removed.
 9326                  */
 9327                 ASSERT3U(dev->l2ad_hand + distance, <, dev->l2ad_end);
 9328                 if (!dev->l2ad_first)
 9329                         ASSERT3U(dev->l2ad_hand, <, dev->l2ad_evict);
 9330         }
 9331 }
 9332 
 9333 /*
 9334  * Handle any abd transforms that might be required for writing to the L2ARC.
 9335  * If successful, this function will always return an abd with the data
 9336  * transformed as it is on disk in a new abd of asize bytes.
 9337  */
 9338 static int
 9339 l2arc_apply_transforms(spa_t *spa, arc_buf_hdr_t *hdr, uint64_t asize,
 9340     abd_t **abd_out)
 9341 {
 9342         int ret;
 9343         void *tmp = NULL;
 9344         abd_t *cabd = NULL, *eabd = NULL, *to_write = hdr->b_l1hdr.b_pabd;
 9345         enum zio_compress compress = HDR_GET_COMPRESS(hdr);
 9346         uint64_t psize = HDR_GET_PSIZE(hdr);
 9347         uint64_t size = arc_hdr_size(hdr);
 9348         boolean_t ismd = HDR_ISTYPE_METADATA(hdr);
 9349         boolean_t bswap = (hdr->b_l1hdr.b_byteswap != DMU_BSWAP_NUMFUNCS);
 9350         dsl_crypto_key_t *dck = NULL;
 9351         uint8_t mac[ZIO_DATA_MAC_LEN] = { 0 };
 9352         boolean_t no_crypt = B_FALSE;
 9353 
 9354         ASSERT((HDR_GET_COMPRESS(hdr) != ZIO_COMPRESS_OFF &&
 9355             !HDR_COMPRESSION_ENABLED(hdr)) ||
 9356             HDR_ENCRYPTED(hdr) || HDR_SHARED_DATA(hdr) || psize != asize);
 9357         ASSERT3U(psize, <=, asize);
 9358 
 9359         /*
 9360          * If this data simply needs its own buffer, we simply allocate it
 9361          * and copy the data. This may be done to eliminate a dependency on a
 9362          * shared buffer or to reallocate the buffer to match asize.
 9363          */
 9364         if (HDR_HAS_RABD(hdr) && asize != psize) {
 9365                 ASSERT3U(asize, >=, psize);
 9366                 to_write = abd_alloc_for_io(asize, ismd);
 9367                 abd_copy(to_write, hdr->b_crypt_hdr.b_rabd, psize);
 9368                 if (psize != asize)
 9369                         abd_zero_off(to_write, psize, asize - psize);
 9370                 goto out;
 9371         }
 9372 
 9373         if ((compress == ZIO_COMPRESS_OFF || HDR_COMPRESSION_ENABLED(hdr)) &&
 9374             !HDR_ENCRYPTED(hdr)) {
 9375                 ASSERT3U(size, ==, psize);
 9376                 to_write = abd_alloc_for_io(asize, ismd);
 9377                 abd_copy(to_write, hdr->b_l1hdr.b_pabd, size);
 9378                 if (size != asize)
 9379                         abd_zero_off(to_write, size, asize - size);
 9380                 goto out;
 9381         }
 9382 
 9383         if (compress != ZIO_COMPRESS_OFF && !HDR_COMPRESSION_ENABLED(hdr)) {
 9384                 /*
 9385                  * In some cases, we can wind up with size > asize, so
 9386                  * we need to opt for the larger allocation option here.
 9387                  *
 9388                  * (We also need abd_return_buf_copy in all cases because
 9389                  * it's an ASSERT() to modify the buffer before returning it
 9390                  * with arc_return_buf(), and all the compressors
 9391                  * write things before deciding to fail compression in nearly
 9392                  * every case.)
 9393                  */
 9394                 cabd = abd_alloc_for_io(size, ismd);
 9395                 tmp = abd_borrow_buf(cabd, size);
 9396 
 9397                 psize = zio_compress_data(compress, to_write, tmp, size,
 9398                     hdr->b_complevel);
 9399 
 9400                 if (psize >= asize) {
 9401                         psize = HDR_GET_PSIZE(hdr);
 9402                         abd_return_buf_copy(cabd, tmp, size);
 9403                         HDR_SET_COMPRESS(hdr, ZIO_COMPRESS_OFF);
 9404                         to_write = cabd;
 9405                         abd_copy(to_write, hdr->b_l1hdr.b_pabd, psize);
 9406                         if (psize != asize)
 9407                                 abd_zero_off(to_write, psize, asize - psize);
 9408                         goto encrypt;
 9409                 }
 9410                 ASSERT3U(psize, <=, HDR_GET_PSIZE(hdr));
 9411                 if (psize < asize)
 9412                         memset((char *)tmp + psize, 0, asize - psize);
 9413                 psize = HDR_GET_PSIZE(hdr);
 9414                 abd_return_buf_copy(cabd, tmp, size);
 9415                 to_write = cabd;
 9416         }
 9417 
 9418 encrypt:
 9419         if (HDR_ENCRYPTED(hdr)) {
 9420                 eabd = abd_alloc_for_io(asize, ismd);
 9421 
 9422                 /*
 9423                  * If the dataset was disowned before the buffer
 9424                  * made it to this point, the key to re-encrypt
 9425                  * it won't be available. In this case we simply
 9426                  * won't write the buffer to the L2ARC.
 9427                  */
 9428                 ret = spa_keystore_lookup_key(spa, hdr->b_crypt_hdr.b_dsobj,
 9429                     FTAG, &dck);
 9430                 if (ret != 0)
 9431                         goto error;
 9432 
 9433                 ret = zio_do_crypt_abd(B_TRUE, &dck->dck_key,
 9434                     hdr->b_crypt_hdr.b_ot, bswap, hdr->b_crypt_hdr.b_salt,
 9435                     hdr->b_crypt_hdr.b_iv, mac, psize, to_write, eabd,
 9436                     &no_crypt);
 9437                 if (ret != 0)
 9438                         goto error;
 9439 
 9440                 if (no_crypt)
 9441                         abd_copy(eabd, to_write, psize);
 9442 
 9443                 if (psize != asize)
 9444                         abd_zero_off(eabd, psize, asize - psize);
 9445 
 9446                 /* assert that the MAC we got here matches the one we saved */
 9447                 ASSERT0(memcmp(mac, hdr->b_crypt_hdr.b_mac, ZIO_DATA_MAC_LEN));
 9448                 spa_keystore_dsl_key_rele(spa, dck, FTAG);
 9449 
 9450                 if (to_write == cabd)
 9451                         abd_free(cabd);
 9452 
 9453                 to_write = eabd;
 9454         }
 9455 
 9456 out:
 9457         ASSERT3P(to_write, !=, hdr->b_l1hdr.b_pabd);
 9458         *abd_out = to_write;
 9459         return (0);
 9460 
 9461 error:
 9462         if (dck != NULL)
 9463                 spa_keystore_dsl_key_rele(spa, dck, FTAG);
 9464         if (cabd != NULL)
 9465                 abd_free(cabd);
 9466         if (eabd != NULL)
 9467                 abd_free(eabd);
 9468 
 9469         *abd_out = NULL;
 9470         return (ret);
 9471 }
 9472 
 9473 static void
 9474 l2arc_blk_fetch_done(zio_t *zio)
 9475 {
 9476         l2arc_read_callback_t *cb;
 9477 
 9478         cb = zio->io_private;
 9479         if (cb->l2rcb_abd != NULL)
 9480                 abd_free(cb->l2rcb_abd);
 9481         kmem_free(cb, sizeof (l2arc_read_callback_t));
 9482 }
 9483 
 9484 /*
 9485  * Find and write ARC buffers to the L2ARC device.
 9486  *
 9487  * An ARC_FLAG_L2_WRITING flag is set so that the L2ARC buffers are not valid
 9488  * for reading until they have completed writing.
 9489  * The headroom_boost is an in-out parameter used to maintain headroom boost
 9490  * state between calls to this function.
 9491  *
 9492  * Returns the number of bytes actually written (which may be smaller than
 9493  * the delta by which the device hand has changed due to alignment and the
 9494  * writing of log blocks).
 9495  */
 9496 static uint64_t
 9497 l2arc_write_buffers(spa_t *spa, l2arc_dev_t *dev, uint64_t target_sz)
 9498 {
 9499         arc_buf_hdr_t           *hdr, *hdr_prev, *head;
 9500         uint64_t                write_asize, write_psize, write_lsize, headroom;
 9501         boolean_t               full;
 9502         l2arc_write_callback_t  *cb = NULL;
 9503         zio_t                   *pio, *wzio;
 9504         uint64_t                guid = spa_load_guid(spa);
 9505         l2arc_dev_hdr_phys_t    *l2dhdr = dev->l2ad_dev_hdr;
 9506 
 9507         ASSERT3P(dev->l2ad_vdev, !=, NULL);
 9508 
 9509         pio = NULL;
 9510         write_lsize = write_asize = write_psize = 0;
 9511         full = B_FALSE;
 9512         head = kmem_cache_alloc(hdr_l2only_cache, KM_PUSHPAGE);
 9513         arc_hdr_set_flags(head, ARC_FLAG_L2_WRITE_HEAD | ARC_FLAG_HAS_L2HDR);
 9514 
 9515         /*
 9516          * Copy buffers for L2ARC writing.
 9517          */
 9518         for (int pass = 0; pass < L2ARC_FEED_TYPES; pass++) {
 9519                 /*
 9520                  * If pass == 1 or 3, we cache MRU metadata and data
 9521                  * respectively.
 9522                  */
 9523                 if (l2arc_mfuonly) {
 9524                         if (pass == 1 || pass == 3)
 9525                                 continue;
 9526                 }
 9527 
 9528                 multilist_sublist_t *mls = l2arc_sublist_lock(pass);
 9529                 uint64_t passed_sz = 0;
 9530 
 9531                 VERIFY3P(mls, !=, NULL);
 9532 
 9533                 /*
 9534                  * L2ARC fast warmup.
 9535                  *
 9536                  * Until the ARC is warm and starts to evict, read from the
 9537                  * head of the ARC lists rather than the tail.
 9538                  */
 9539                 if (arc_warm == B_FALSE)
 9540                         hdr = multilist_sublist_head(mls);
 9541                 else
 9542                         hdr = multilist_sublist_tail(mls);
 9543 
 9544                 headroom = target_sz * l2arc_headroom;
 9545                 if (zfs_compressed_arc_enabled)
 9546                         headroom = (headroom * l2arc_headroom_boost) / 100;
 9547 
 9548                 for (; hdr; hdr = hdr_prev) {
 9549                         kmutex_t *hash_lock;
 9550                         abd_t *to_write = NULL;
 9551 
 9552                         if (arc_warm == B_FALSE)
 9553                                 hdr_prev = multilist_sublist_next(mls, hdr);
 9554                         else
 9555                                 hdr_prev = multilist_sublist_prev(mls, hdr);
 9556 
 9557                         hash_lock = HDR_LOCK(hdr);
 9558                         if (!mutex_tryenter(hash_lock)) {
 9559                                 /*
 9560                                  * Skip this buffer rather than waiting.
 9561                                  */
 9562                                 continue;
 9563                         }
 9564 
 9565                         passed_sz += HDR_GET_LSIZE(hdr);
 9566                         if (l2arc_headroom != 0 && passed_sz > headroom) {
 9567                                 /*
 9568                                  * Searched too far.
 9569                                  */
 9570                                 mutex_exit(hash_lock);
 9571                                 break;
 9572                         }
 9573 
 9574                         if (!l2arc_write_eligible(guid, hdr)) {
 9575                                 mutex_exit(hash_lock);
 9576                                 continue;
 9577                         }
 9578 
 9579                         ASSERT(HDR_HAS_L1HDR(hdr));
 9580 
 9581                         ASSERT3U(HDR_GET_PSIZE(hdr), >, 0);
 9582                         ASSERT3U(arc_hdr_size(hdr), >, 0);
 9583                         ASSERT(hdr->b_l1hdr.b_pabd != NULL ||
 9584                             HDR_HAS_RABD(hdr));
 9585                         uint64_t psize = HDR_GET_PSIZE(hdr);
 9586                         uint64_t asize = vdev_psize_to_asize(dev->l2ad_vdev,
 9587                             psize);
 9588 
 9589                         if ((write_asize + asize) > target_sz) {
 9590                                 full = B_TRUE;
 9591                                 mutex_exit(hash_lock);
 9592                                 break;
 9593                         }
 9594 
 9595                         /*
 9596                          * We rely on the L1 portion of the header below, so
 9597                          * it's invalid for this header to have been evicted out
 9598                          * of the ghost cache, prior to being written out. The
 9599                          * ARC_FLAG_L2_WRITING bit ensures this won't happen.
 9600                          */
 9601                         arc_hdr_set_flags(hdr, ARC_FLAG_L2_WRITING);
 9602 
 9603                         /*
 9604                          * If this header has b_rabd, we can use this since it
 9605                          * must always match the data exactly as it exists on
 9606                          * disk. Otherwise, the L2ARC can normally use the
 9607                          * hdr's data, but if we're sharing data between the
 9608                          * hdr and one of its bufs, L2ARC needs its own copy of
 9609                          * the data so that the ZIO below can't race with the
 9610                          * buf consumer. To ensure that this copy will be
 9611                          * available for the lifetime of the ZIO and be cleaned
 9612                          * up afterwards, we add it to the l2arc_free_on_write
 9613                          * queue. If we need to apply any transforms to the
 9614                          * data (compression, encryption) we will also need the
 9615                          * extra buffer.
 9616                          */
 9617                         if (HDR_HAS_RABD(hdr) && psize == asize) {
 9618                                 to_write = hdr->b_crypt_hdr.b_rabd;
 9619                         } else if ((HDR_COMPRESSION_ENABLED(hdr) ||
 9620                             HDR_GET_COMPRESS(hdr) == ZIO_COMPRESS_OFF) &&
 9621                             !HDR_ENCRYPTED(hdr) && !HDR_SHARED_DATA(hdr) &&
 9622                             psize == asize) {
 9623                                 to_write = hdr->b_l1hdr.b_pabd;
 9624                         } else {
 9625                                 int ret;
 9626                                 arc_buf_contents_t type = arc_buf_type(hdr);
 9627 
 9628                                 ret = l2arc_apply_transforms(spa, hdr, asize,
 9629                                     &to_write);
 9630                                 if (ret != 0) {
 9631                                         arc_hdr_clear_flags(hdr,
 9632                                             ARC_FLAG_L2_WRITING);
 9633                                         mutex_exit(hash_lock);
 9634                                         continue;
 9635                                 }
 9636 
 9637                                 l2arc_free_abd_on_write(to_write, asize, type);
 9638                         }
 9639 
 9640                         if (pio == NULL) {
 9641                                 /*
 9642                                  * Insert a dummy header on the buflist so
 9643                                  * l2arc_write_done() can find where the
 9644                                  * write buffers begin without searching.
 9645                                  */
 9646                                 mutex_enter(&dev->l2ad_mtx);
 9647                                 list_insert_head(&dev->l2ad_buflist, head);
 9648                                 mutex_exit(&dev->l2ad_mtx);
 9649 
 9650                                 cb = kmem_alloc(
 9651                                     sizeof (l2arc_write_callback_t), KM_SLEEP);
 9652                                 cb->l2wcb_dev = dev;
 9653                                 cb->l2wcb_head = head;
 9654                                 /*
 9655                                  * Create a list to save allocated abd buffers
 9656                                  * for l2arc_log_blk_commit().
 9657                                  */
 9658                                 list_create(&cb->l2wcb_abd_list,
 9659                                     sizeof (l2arc_lb_abd_buf_t),
 9660                                     offsetof(l2arc_lb_abd_buf_t, node));
 9661                                 pio = zio_root(spa, l2arc_write_done, cb,
 9662                                     ZIO_FLAG_CANFAIL);
 9663                         }
 9664 
 9665                         hdr->b_l2hdr.b_dev = dev;
 9666                         hdr->b_l2hdr.b_hits = 0;
 9667 
 9668                         hdr->b_l2hdr.b_daddr = dev->l2ad_hand;
 9669                         hdr->b_l2hdr.b_arcs_state =
 9670                             hdr->b_l1hdr.b_state->arcs_state;
 9671                         arc_hdr_set_flags(hdr, ARC_FLAG_HAS_L2HDR);
 9672 
 9673                         mutex_enter(&dev->l2ad_mtx);
 9674                         list_insert_head(&dev->l2ad_buflist, hdr);
 9675                         mutex_exit(&dev->l2ad_mtx);
 9676 
 9677                         (void) zfs_refcount_add_many(&dev->l2ad_alloc,
 9678                             arc_hdr_size(hdr), hdr);
 9679 
 9680                         wzio = zio_write_phys(pio, dev->l2ad_vdev,
 9681                             hdr->b_l2hdr.b_daddr, asize, to_write,
 9682                             ZIO_CHECKSUM_OFF, NULL, hdr,
 9683                             ZIO_PRIORITY_ASYNC_WRITE,
 9684                             ZIO_FLAG_CANFAIL, B_FALSE);
 9685 
 9686                         write_lsize += HDR_GET_LSIZE(hdr);
 9687                         DTRACE_PROBE2(l2arc__write, vdev_t *, dev->l2ad_vdev,
 9688                             zio_t *, wzio);
 9689 
 9690                         write_psize += psize;
 9691                         write_asize += asize;
 9692                         dev->l2ad_hand += asize;
 9693                         l2arc_hdr_arcstats_increment(hdr);
 9694                         vdev_space_update(dev->l2ad_vdev, asize, 0, 0);
 9695 
 9696                         mutex_exit(hash_lock);
 9697 
 9698                         /*
 9699                          * Append buf info to current log and commit if full.
 9700                          * arcstat_l2_{size,asize} kstats are updated
 9701                          * internally.
 9702                          */
 9703                         if (l2arc_log_blk_insert(dev, hdr))
 9704                                 l2arc_log_blk_commit(dev, pio, cb);
 9705 
 9706                         zio_nowait(wzio);
 9707                 }
 9708 
 9709                 multilist_sublist_unlock(mls);
 9710 
 9711                 if (full == B_TRUE)
 9712                         break;
 9713         }
 9714 
 9715         /* No buffers selected for writing? */
 9716         if (pio == NULL) {
 9717                 ASSERT0(write_lsize);
 9718                 ASSERT(!HDR_HAS_L1HDR(head));
 9719                 kmem_cache_free(hdr_l2only_cache, head);
 9720 
 9721                 /*
 9722                  * Although we did not write any buffers l2ad_evict may
 9723                  * have advanced.
 9724                  */
 9725                 if (dev->l2ad_evict != l2dhdr->dh_evict)
 9726                         l2arc_dev_hdr_update(dev);
 9727 
 9728                 return (0);
 9729         }
 9730 
 9731         if (!dev->l2ad_first)
 9732                 ASSERT3U(dev->l2ad_hand, <=, dev->l2ad_evict);
 9733 
 9734         ASSERT3U(write_asize, <=, target_sz);
 9735         ARCSTAT_BUMP(arcstat_l2_writes_sent);
 9736         ARCSTAT_INCR(arcstat_l2_write_bytes, write_psize);
 9737 
 9738         dev->l2ad_writing = B_TRUE;
 9739         (void) zio_wait(pio);
 9740         dev->l2ad_writing = B_FALSE;
 9741 
 9742         /*
 9743          * Update the device header after the zio completes as
 9744          * l2arc_write_done() may have updated the memory holding the log block
 9745          * pointers in the device header.
 9746          */
 9747         l2arc_dev_hdr_update(dev);
 9748 
 9749         return (write_asize);
 9750 }
 9751 
 9752 static boolean_t
 9753 l2arc_hdr_limit_reached(void)
 9754 {
 9755         int64_t s = aggsum_upper_bound(&arc_sums.arcstat_l2_hdr_size);
 9756 
 9757         return (arc_reclaim_needed() || (s > arc_meta_limit * 3 / 4) ||
 9758             (s > (arc_warm ? arc_c : arc_c_max) * l2arc_meta_percent / 100));
 9759 }
 9760 
 9761 /*
 9762  * This thread feeds the L2ARC at regular intervals.  This is the beating
 9763  * heart of the L2ARC.
 9764  */
 9765 static  __attribute__((noreturn)) void
 9766 l2arc_feed_thread(void *unused)
 9767 {
 9768         (void) unused;
 9769         callb_cpr_t cpr;
 9770         l2arc_dev_t *dev;
 9771         spa_t *spa;
 9772         uint64_t size, wrote;
 9773         clock_t begin, next = ddi_get_lbolt();
 9774         fstrans_cookie_t cookie;
 9775 
 9776         CALLB_CPR_INIT(&cpr, &l2arc_feed_thr_lock, callb_generic_cpr, FTAG);
 9777 
 9778         mutex_enter(&l2arc_feed_thr_lock);
 9779 
 9780         cookie = spl_fstrans_mark();
 9781         while (l2arc_thread_exit == 0) {
 9782                 CALLB_CPR_SAFE_BEGIN(&cpr);
 9783                 (void) cv_timedwait_idle(&l2arc_feed_thr_cv,
 9784                     &l2arc_feed_thr_lock, next);
 9785                 CALLB_CPR_SAFE_END(&cpr, &l2arc_feed_thr_lock);
 9786                 next = ddi_get_lbolt() + hz;
 9787 
 9788                 /*
 9789                  * Quick check for L2ARC devices.
 9790                  */
 9791                 mutex_enter(&l2arc_dev_mtx);
 9792                 if (l2arc_ndev == 0) {
 9793                         mutex_exit(&l2arc_dev_mtx);
 9794                         continue;
 9795                 }
 9796                 mutex_exit(&l2arc_dev_mtx);
 9797                 begin = ddi_get_lbolt();
 9798 
 9799                 /*
 9800                  * This selects the next l2arc device to write to, and in
 9801                  * doing so the next spa to feed from: dev->l2ad_spa.   This
 9802                  * will return NULL if there are now no l2arc devices or if
 9803                  * they are all faulted.
 9804                  *
 9805                  * If a device is returned, its spa's config lock is also
 9806                  * held to prevent device removal.  l2arc_dev_get_next()
 9807                  * will grab and release l2arc_dev_mtx.
 9808                  */
 9809                 if ((dev = l2arc_dev_get_next()) == NULL)
 9810                         continue;
 9811 
 9812                 spa = dev->l2ad_spa;
 9813                 ASSERT3P(spa, !=, NULL);
 9814 
 9815                 /*
 9816                  * If the pool is read-only then force the feed thread to
 9817                  * sleep a little longer.
 9818                  */
 9819                 if (!spa_writeable(spa)) {
 9820                         next = ddi_get_lbolt() + 5 * l2arc_feed_secs * hz;
 9821                         spa_config_exit(spa, SCL_L2ARC, dev);
 9822                         continue;
 9823                 }
 9824 
 9825                 /*
 9826                  * Avoid contributing to memory pressure.
 9827                  */
 9828                 if (l2arc_hdr_limit_reached()) {
 9829                         ARCSTAT_BUMP(arcstat_l2_abort_lowmem);
 9830                         spa_config_exit(spa, SCL_L2ARC, dev);
 9831                         continue;
 9832                 }
 9833 
 9834                 ARCSTAT_BUMP(arcstat_l2_feeds);
 9835 
 9836                 size = l2arc_write_size(dev);
 9837 
 9838                 /*
 9839                  * Evict L2ARC buffers that will be overwritten.
 9840                  */
 9841                 l2arc_evict(dev, size, B_FALSE);
 9842 
 9843                 /*
 9844                  * Write ARC buffers.
 9845                  */
 9846                 wrote = l2arc_write_buffers(spa, dev, size);
 9847 
 9848                 /*
 9849                  * Calculate interval between writes.
 9850                  */
 9851                 next = l2arc_write_interval(begin, size, wrote);
 9852                 spa_config_exit(spa, SCL_L2ARC, dev);
 9853         }
 9854         spl_fstrans_unmark(cookie);
 9855 
 9856         l2arc_thread_exit = 0;
 9857         cv_broadcast(&l2arc_feed_thr_cv);
 9858         CALLB_CPR_EXIT(&cpr);           /* drops l2arc_feed_thr_lock */
 9859         thread_exit();
 9860 }
 9861 
 9862 boolean_t
 9863 l2arc_vdev_present(vdev_t *vd)
 9864 {
 9865         return (l2arc_vdev_get(vd) != NULL);
 9866 }
 9867 
 9868 /*
 9869  * Returns the l2arc_dev_t associated with a particular vdev_t or NULL if
 9870  * the vdev_t isn't an L2ARC device.
 9871  */
 9872 l2arc_dev_t *
 9873 l2arc_vdev_get(vdev_t *vd)
 9874 {
 9875         l2arc_dev_t     *dev;
 9876 
 9877         mutex_enter(&l2arc_dev_mtx);
 9878         for (dev = list_head(l2arc_dev_list); dev != NULL;
 9879             dev = list_next(l2arc_dev_list, dev)) {
 9880                 if (dev->l2ad_vdev == vd)
 9881                         break;
 9882         }
 9883         mutex_exit(&l2arc_dev_mtx);
 9884 
 9885         return (dev);
 9886 }
 9887 
 9888 static void
 9889 l2arc_rebuild_dev(l2arc_dev_t *dev, boolean_t reopen)
 9890 {
 9891         l2arc_dev_hdr_phys_t *l2dhdr = dev->l2ad_dev_hdr;
 9892         uint64_t l2dhdr_asize = dev->l2ad_dev_hdr_asize;
 9893         spa_t *spa = dev->l2ad_spa;
 9894 
 9895         /*
 9896          * The L2ARC has to hold at least the payload of one log block for
 9897          * them to be restored (persistent L2ARC). The payload of a log block
 9898          * depends on the amount of its log entries. We always write log blocks
 9899          * with 1022 entries. How many of them are committed or restored depends
 9900          * on the size of the L2ARC device. Thus the maximum payload of
 9901          * one log block is 1022 * SPA_MAXBLOCKSIZE = 16GB. If the L2ARC device
 9902          * is less than that, we reduce the amount of committed and restored
 9903          * log entries per block so as to enable persistence.
 9904          */
 9905         if (dev->l2ad_end < l2arc_rebuild_blocks_min_l2size) {
 9906                 dev->l2ad_log_entries = 0;
 9907         } else {
 9908                 dev->l2ad_log_entries = MIN((dev->l2ad_end -
 9909                     dev->l2ad_start) >> SPA_MAXBLOCKSHIFT,
 9910                     L2ARC_LOG_BLK_MAX_ENTRIES);
 9911         }
 9912 
 9913         /*
 9914          * Read the device header, if an error is returned do not rebuild L2ARC.
 9915          */
 9916         if (l2arc_dev_hdr_read(dev) == 0 && dev->l2ad_log_entries > 0) {
 9917                 /*
 9918                  * If we are onlining a cache device (vdev_reopen) that was
 9919                  * still present (l2arc_vdev_present()) and rebuild is enabled,
 9920                  * we should evict all ARC buffers and pointers to log blocks
 9921                  * and reclaim their space before restoring its contents to
 9922                  * L2ARC.
 9923                  */
 9924                 if (reopen) {
 9925                         if (!l2arc_rebuild_enabled) {
 9926                                 return;
 9927                         } else {
 9928                                 l2arc_evict(dev, 0, B_TRUE);
 9929                                 /* start a new log block */
 9930                                 dev->l2ad_log_ent_idx = 0;
 9931                                 dev->l2ad_log_blk_payload_asize = 0;
 9932                                 dev->l2ad_log_blk_payload_start = 0;
 9933                         }
 9934                 }
 9935                 /*
 9936                  * Just mark the device as pending for a rebuild. We won't
 9937                  * be starting a rebuild in line here as it would block pool
 9938                  * import. Instead spa_load_impl will hand that off to an
 9939                  * async task which will call l2arc_spa_rebuild_start.
 9940                  */
 9941                 dev->l2ad_rebuild = B_TRUE;
 9942         } else if (spa_writeable(spa)) {
 9943                 /*
 9944                  * In this case TRIM the whole device if l2arc_trim_ahead > 0,
 9945                  * otherwise create a new header. We zero out the memory holding
 9946                  * the header to reset dh_start_lbps. If we TRIM the whole
 9947                  * device the new header will be written by
 9948                  * vdev_trim_l2arc_thread() at the end of the TRIM to update the
 9949                  * trim_state in the header too. When reading the header, if
 9950                  * trim_state is not VDEV_TRIM_COMPLETE and l2arc_trim_ahead > 0
 9951                  * we opt to TRIM the whole device again.
 9952                  */
 9953                 if (l2arc_trim_ahead > 0) {
 9954                         dev->l2ad_trim_all = B_TRUE;
 9955                 } else {
 9956                         memset(l2dhdr, 0, l2dhdr_asize);
 9957                         l2arc_dev_hdr_update(dev);
 9958                 }
 9959         }
 9960 }
 9961 
 9962 /*
 9963  * Add a vdev for use by the L2ARC.  By this point the spa has already
 9964  * validated the vdev and opened it.
 9965  */
 9966 void
 9967 l2arc_add_vdev(spa_t *spa, vdev_t *vd)
 9968 {
 9969         l2arc_dev_t             *adddev;
 9970         uint64_t                l2dhdr_asize;
 9971 
 9972         ASSERT(!l2arc_vdev_present(vd));
 9973 
 9974         /*
 9975          * Create a new l2arc device entry.
 9976          */
 9977         adddev = vmem_zalloc(sizeof (l2arc_dev_t), KM_SLEEP);
 9978         adddev->l2ad_spa = spa;
 9979         adddev->l2ad_vdev = vd;
 9980         /* leave extra size for an l2arc device header */
 9981         l2dhdr_asize = adddev->l2ad_dev_hdr_asize =
 9982             MAX(sizeof (*adddev->l2ad_dev_hdr), 1 << vd->vdev_ashift);
 9983         adddev->l2ad_start = VDEV_LABEL_START_SIZE + l2dhdr_asize;
 9984         adddev->l2ad_end = VDEV_LABEL_START_SIZE + vdev_get_min_asize(vd);
 9985         ASSERT3U(adddev->l2ad_start, <, adddev->l2ad_end);
 9986         adddev->l2ad_hand = adddev->l2ad_start;
 9987         adddev->l2ad_evict = adddev->l2ad_start;
 9988         adddev->l2ad_first = B_TRUE;
 9989         adddev->l2ad_writing = B_FALSE;
 9990         adddev->l2ad_trim_all = B_FALSE;
 9991         list_link_init(&adddev->l2ad_node);
 9992         adddev->l2ad_dev_hdr = kmem_zalloc(l2dhdr_asize, KM_SLEEP);
 9993 
 9994         mutex_init(&adddev->l2ad_mtx, NULL, MUTEX_DEFAULT, NULL);
 9995         /*
 9996          * This is a list of all ARC buffers that are still valid on the
 9997          * device.
 9998          */
 9999         list_create(&adddev->l2ad_buflist, sizeof (arc_buf_hdr_t),
10000             offsetof(arc_buf_hdr_t, b_l2hdr.b_l2node));
10001 
10002         /*
10003          * This is a list of pointers to log blocks that are still present
10004          * on the device.
10005          */
10006         list_create(&adddev->l2ad_lbptr_list, sizeof (l2arc_lb_ptr_buf_t),
10007             offsetof(l2arc_lb_ptr_buf_t, node));
10008 
10009         vdev_space_update(vd, 0, 0, adddev->l2ad_end - adddev->l2ad_hand);
10010         zfs_refcount_create(&adddev->l2ad_alloc);
10011         zfs_refcount_create(&adddev->l2ad_lb_asize);
10012         zfs_refcount_create(&adddev->l2ad_lb_count);
10013 
10014         /*
10015          * Decide if dev is eligible for L2ARC rebuild or whole device
10016          * trimming. This has to happen before the device is added in the
10017          * cache device list and l2arc_dev_mtx is released. Otherwise
10018          * l2arc_feed_thread() might already start writing on the
10019          * device.
10020          */
10021         l2arc_rebuild_dev(adddev, B_FALSE);
10022 
10023         /*
10024          * Add device to global list
10025          */
10026         mutex_enter(&l2arc_dev_mtx);
10027         list_insert_head(l2arc_dev_list, adddev);
10028         atomic_inc_64(&l2arc_ndev);
10029         mutex_exit(&l2arc_dev_mtx);
10030 }
10031 
10032 /*
10033  * Decide if a vdev is eligible for L2ARC rebuild, called from vdev_reopen()
10034  * in case of onlining a cache device.
10035  */
10036 void
10037 l2arc_rebuild_vdev(vdev_t *vd, boolean_t reopen)
10038 {
10039         l2arc_dev_t             *dev = NULL;
10040 
10041         dev = l2arc_vdev_get(vd);
10042         ASSERT3P(dev, !=, NULL);
10043 
10044         /*
10045          * In contrast to l2arc_add_vdev() we do not have to worry about
10046          * l2arc_feed_thread() invalidating previous content when onlining a
10047          * cache device. The device parameters (l2ad*) are not cleared when
10048          * offlining the device and writing new buffers will not invalidate
10049          * all previous content. In worst case only buffers that have not had
10050          * their log block written to the device will be lost.
10051          * When onlining the cache device (ie offline->online without exporting
10052          * the pool in between) this happens:
10053          * vdev_reopen() -> vdev_open() -> l2arc_rebuild_vdev()
10054          *                      |                       |
10055          *              vdev_is_dead() = B_FALSE        l2ad_rebuild = B_TRUE
10056          * During the time where vdev_is_dead = B_FALSE and until l2ad_rebuild
10057          * is set to B_TRUE we might write additional buffers to the device.
10058          */
10059         l2arc_rebuild_dev(dev, reopen);
10060 }
10061 
10062 /*
10063  * Remove a vdev from the L2ARC.
10064  */
10065 void
10066 l2arc_remove_vdev(vdev_t *vd)
10067 {
10068         l2arc_dev_t *remdev = NULL;
10069 
10070         /*
10071          * Find the device by vdev
10072          */
10073         remdev = l2arc_vdev_get(vd);
10074         ASSERT3P(remdev, !=, NULL);
10075 
10076         /*
10077          * Cancel any ongoing or scheduled rebuild.
10078          */
10079         mutex_enter(&l2arc_rebuild_thr_lock);
10080         if (remdev->l2ad_rebuild_began == B_TRUE) {
10081                 remdev->l2ad_rebuild_cancel = B_TRUE;
10082                 while (remdev->l2ad_rebuild == B_TRUE)
10083                         cv_wait(&l2arc_rebuild_thr_cv, &l2arc_rebuild_thr_lock);
10084         }
10085         mutex_exit(&l2arc_rebuild_thr_lock);
10086 
10087         /*
10088          * Remove device from global list
10089          */
10090         mutex_enter(&l2arc_dev_mtx);
10091         list_remove(l2arc_dev_list, remdev);
10092         l2arc_dev_last = NULL;          /* may have been invalidated */
10093         atomic_dec_64(&l2arc_ndev);
10094         mutex_exit(&l2arc_dev_mtx);
10095 
10096         /*
10097          * Clear all buflists and ARC references.  L2ARC device flush.
10098          */
10099         l2arc_evict(remdev, 0, B_TRUE);
10100         list_destroy(&remdev->l2ad_buflist);
10101         ASSERT(list_is_empty(&remdev->l2ad_lbptr_list));
10102         list_destroy(&remdev->l2ad_lbptr_list);
10103         mutex_destroy(&remdev->l2ad_mtx);
10104         zfs_refcount_destroy(&remdev->l2ad_alloc);
10105         zfs_refcount_destroy(&remdev->l2ad_lb_asize);
10106         zfs_refcount_destroy(&remdev->l2ad_lb_count);
10107         kmem_free(remdev->l2ad_dev_hdr, remdev->l2ad_dev_hdr_asize);
10108         vmem_free(remdev, sizeof (l2arc_dev_t));
10109 }
10110 
10111 void
10112 l2arc_init(void)
10113 {
10114         l2arc_thread_exit = 0;
10115         l2arc_ndev = 0;
10116 
10117         mutex_init(&l2arc_feed_thr_lock, NULL, MUTEX_DEFAULT, NULL);
10118         cv_init(&l2arc_feed_thr_cv, NULL, CV_DEFAULT, NULL);
10119         mutex_init(&l2arc_rebuild_thr_lock, NULL, MUTEX_DEFAULT, NULL);
10120         cv_init(&l2arc_rebuild_thr_cv, NULL, CV_DEFAULT, NULL);
10121         mutex_init(&l2arc_dev_mtx, NULL, MUTEX_DEFAULT, NULL);
10122         mutex_init(&l2arc_free_on_write_mtx, NULL, MUTEX_DEFAULT, NULL);
10123 
10124         l2arc_dev_list = &L2ARC_dev_list;
10125         l2arc_free_on_write = &L2ARC_free_on_write;
10126         list_create(l2arc_dev_list, sizeof (l2arc_dev_t),
10127             offsetof(l2arc_dev_t, l2ad_node));
10128         list_create(l2arc_free_on_write, sizeof (l2arc_data_free_t),
10129             offsetof(l2arc_data_free_t, l2df_list_node));
10130 }
10131 
10132 void
10133 l2arc_fini(void)
10134 {
10135         mutex_destroy(&l2arc_feed_thr_lock);
10136         cv_destroy(&l2arc_feed_thr_cv);
10137         mutex_destroy(&l2arc_rebuild_thr_lock);
10138         cv_destroy(&l2arc_rebuild_thr_cv);
10139         mutex_destroy(&l2arc_dev_mtx);
10140         mutex_destroy(&l2arc_free_on_write_mtx);
10141 
10142         list_destroy(l2arc_dev_list);
10143         list_destroy(l2arc_free_on_write);
10144 }
10145 
10146 void
10147 l2arc_start(void)
10148 {
10149         if (!(spa_mode_global & SPA_MODE_WRITE))
10150                 return;
10151 
10152         (void) thread_create(NULL, 0, l2arc_feed_thread, NULL, 0, &p0,
10153             TS_RUN, defclsyspri);
10154 }
10155 
10156 void
10157 l2arc_stop(void)
10158 {
10159         if (!(spa_mode_global & SPA_MODE_WRITE))
10160                 return;
10161 
10162         mutex_enter(&l2arc_feed_thr_lock);
10163         cv_signal(&l2arc_feed_thr_cv);  /* kick thread out of startup */
10164         l2arc_thread_exit = 1;
10165         while (l2arc_thread_exit != 0)
10166                 cv_wait(&l2arc_feed_thr_cv, &l2arc_feed_thr_lock);
10167         mutex_exit(&l2arc_feed_thr_lock);
10168 }
10169 
10170 /*
10171  * Punches out rebuild threads for the L2ARC devices in a spa. This should
10172  * be called after pool import from the spa async thread, since starting
10173  * these threads directly from spa_import() will make them part of the
10174  * "zpool import" context and delay process exit (and thus pool import).
10175  */
10176 void
10177 l2arc_spa_rebuild_start(spa_t *spa)
10178 {
10179         ASSERT(MUTEX_HELD(&spa_namespace_lock));
10180 
10181         /*
10182          * Locate the spa's l2arc devices and kick off rebuild threads.
10183          */
10184         for (int i = 0; i < spa->spa_l2cache.sav_count; i++) {
10185                 l2arc_dev_t *dev =
10186                     l2arc_vdev_get(spa->spa_l2cache.sav_vdevs[i]);
10187                 if (dev == NULL) {
10188                         /* Don't attempt a rebuild if the vdev is UNAVAIL */
10189                         continue;
10190                 }
10191                 mutex_enter(&l2arc_rebuild_thr_lock);
10192                 if (dev->l2ad_rebuild && !dev->l2ad_rebuild_cancel) {
10193                         dev->l2ad_rebuild_began = B_TRUE;
10194                         (void) thread_create(NULL, 0, l2arc_dev_rebuild_thread,
10195                             dev, 0, &p0, TS_RUN, minclsyspri);
10196                 }
10197                 mutex_exit(&l2arc_rebuild_thr_lock);
10198         }
10199 }
10200 
10201 /*
10202  * Main entry point for L2ARC rebuilding.
10203  */
10204 static __attribute__((noreturn)) void
10205 l2arc_dev_rebuild_thread(void *arg)
10206 {
10207         l2arc_dev_t *dev = arg;
10208 
10209         VERIFY(!dev->l2ad_rebuild_cancel);
10210         VERIFY(dev->l2ad_rebuild);
10211         (void) l2arc_rebuild(dev);
10212         mutex_enter(&l2arc_rebuild_thr_lock);
10213         dev->l2ad_rebuild_began = B_FALSE;
10214         dev->l2ad_rebuild = B_FALSE;
10215         mutex_exit(&l2arc_rebuild_thr_lock);
10216 
10217         thread_exit();
10218 }
10219 
10220 /*
10221  * This function implements the actual L2ARC metadata rebuild. It:
10222  * starts reading the log block chain and restores each block's contents
10223  * to memory (reconstructing arc_buf_hdr_t's).
10224  *
10225  * Operation stops under any of the following conditions:
10226  *
10227  * 1) We reach the end of the log block chain.
10228  * 2) We encounter *any* error condition (cksum errors, io errors)
10229  */
10230 static int
10231 l2arc_rebuild(l2arc_dev_t *dev)
10232 {
10233         vdev_t                  *vd = dev->l2ad_vdev;
10234         spa_t                   *spa = vd->vdev_spa;
10235         int                     err = 0;
10236         l2arc_dev_hdr_phys_t    *l2dhdr = dev->l2ad_dev_hdr;
10237         l2arc_log_blk_phys_t    *this_lb, *next_lb;
10238         zio_t                   *this_io = NULL, *next_io = NULL;
10239         l2arc_log_blkptr_t      lbps[2];
10240         l2arc_lb_ptr_buf_t      *lb_ptr_buf;
10241         boolean_t               lock_held;
10242 
10243         this_lb = vmem_zalloc(sizeof (*this_lb), KM_SLEEP);
10244         next_lb = vmem_zalloc(sizeof (*next_lb), KM_SLEEP);
10245 
10246         /*
10247          * We prevent device removal while issuing reads to the device,
10248          * then during the rebuilding phases we drop this lock again so
10249          * that a spa_unload or device remove can be initiated - this is
10250          * safe, because the spa will signal us to stop before removing
10251          * our device and wait for us to stop.
10252          */
10253         spa_config_enter(spa, SCL_L2ARC, vd, RW_READER);
10254         lock_held = B_TRUE;
10255 
10256         /*
10257          * Retrieve the persistent L2ARC device state.
10258          * L2BLK_GET_PSIZE returns aligned size for log blocks.
10259          */
10260         dev->l2ad_evict = MAX(l2dhdr->dh_evict, dev->l2ad_start);
10261         dev->l2ad_hand = MAX(l2dhdr->dh_start_lbps[0].lbp_daddr +
10262             L2BLK_GET_PSIZE((&l2dhdr->dh_start_lbps[0])->lbp_prop),
10263             dev->l2ad_start);
10264         dev->l2ad_first = !!(l2dhdr->dh_flags & L2ARC_DEV_HDR_EVICT_FIRST);
10265 
10266         vd->vdev_trim_action_time = l2dhdr->dh_trim_action_time;
10267         vd->vdev_trim_state = l2dhdr->dh_trim_state;
10268 
10269         /*
10270          * In case the zfs module parameter l2arc_rebuild_enabled is false
10271          * we do not start the rebuild process.
10272          */
10273         if (!l2arc_rebuild_enabled)
10274                 goto out;
10275 
10276         /* Prepare the rebuild process */
10277         memcpy(lbps, l2dhdr->dh_start_lbps, sizeof (lbps));
10278 
10279         /* Start the rebuild process */
10280         for (;;) {
10281                 if (!l2arc_log_blkptr_valid(dev, &lbps[0]))
10282                         break;
10283 
10284                 if ((err = l2arc_log_blk_read(dev, &lbps[0], &lbps[1],
10285                     this_lb, next_lb, this_io, &next_io)) != 0)
10286                         goto out;
10287 
10288                 /*
10289                  * Our memory pressure valve. If the system is running low
10290                  * on memory, rather than swamping memory with new ARC buf
10291                  * hdrs, we opt not to rebuild the L2ARC. At this point,
10292                  * however, we have already set up our L2ARC dev to chain in
10293                  * new metadata log blocks, so the user may choose to offline/
10294                  * online the L2ARC dev at a later time (or re-import the pool)
10295                  * to reconstruct it (when there's less memory pressure).
10296                  */
10297                 if (l2arc_hdr_limit_reached()) {
10298                         ARCSTAT_BUMP(arcstat_l2_rebuild_abort_lowmem);
10299                         cmn_err(CE_NOTE, "System running low on memory, "
10300                             "aborting L2ARC rebuild.");
10301                         err = SET_ERROR(ENOMEM);
10302                         goto out;
10303                 }
10304 
10305                 spa_config_exit(spa, SCL_L2ARC, vd);
10306                 lock_held = B_FALSE;
10307 
10308                 /*
10309                  * Now that we know that the next_lb checks out alright, we
10310                  * can start reconstruction from this log block.
10311                  * L2BLK_GET_PSIZE returns aligned size for log blocks.
10312                  */
10313                 uint64_t asize = L2BLK_GET_PSIZE((&lbps[0])->lbp_prop);
10314                 l2arc_log_blk_restore(dev, this_lb, asize);
10315 
10316                 /*
10317                  * log block restored, include its pointer in the list of
10318                  * pointers to log blocks present in the L2ARC device.
10319                  */
10320                 lb_ptr_buf = kmem_zalloc(sizeof (l2arc_lb_ptr_buf_t), KM_SLEEP);
10321                 lb_ptr_buf->lb_ptr = kmem_zalloc(sizeof (l2arc_log_blkptr_t),
10322                     KM_SLEEP);
10323                 memcpy(lb_ptr_buf->lb_ptr, &lbps[0],
10324                     sizeof (l2arc_log_blkptr_t));
10325                 mutex_enter(&dev->l2ad_mtx);
10326                 list_insert_tail(&dev->l2ad_lbptr_list, lb_ptr_buf);
10327                 ARCSTAT_INCR(arcstat_l2_log_blk_asize, asize);
10328                 ARCSTAT_BUMP(arcstat_l2_log_blk_count);
10329                 zfs_refcount_add_many(&dev->l2ad_lb_asize, asize, lb_ptr_buf);
10330                 zfs_refcount_add(&dev->l2ad_lb_count, lb_ptr_buf);
10331                 mutex_exit(&dev->l2ad_mtx);
10332                 vdev_space_update(vd, asize, 0, 0);
10333 
10334                 /*
10335                  * Protection against loops of log blocks:
10336                  *
10337                  *                                     l2ad_hand  l2ad_evict
10338                  *                                         V          V
10339                  * l2ad_start |=======================================| l2ad_end
10340                  *             -----|||----|||---|||----|||
10341                  *                  (3)    (2)   (1)    (0)
10342                  *             ---|||---|||----|||---|||
10343                  *                (7)   (6)    (5)   (4)
10344                  *
10345                  * In this situation the pointer of log block (4) passes
10346                  * l2arc_log_blkptr_valid() but the log block should not be
10347                  * restored as it is overwritten by the payload of log block
10348                  * (0). Only log blocks (0)-(3) should be restored. We check
10349                  * whether l2ad_evict lies in between the payload starting
10350                  * offset of the next log block (lbps[1].lbp_payload_start)
10351                  * and the payload starting offset of the present log block
10352                  * (lbps[0].lbp_payload_start). If true and this isn't the
10353                  * first pass, we are looping from the beginning and we should
10354                  * stop.
10355                  */
10356                 if (l2arc_range_check_overlap(lbps[1].lbp_payload_start,
10357                     lbps[0].lbp_payload_start, dev->l2ad_evict) &&
10358                     !dev->l2ad_first)
10359                         goto out;
10360 
10361                 kpreempt(KPREEMPT_SYNC);
10362                 for (;;) {
10363                         mutex_enter(&l2arc_rebuild_thr_lock);
10364                         if (dev->l2ad_rebuild_cancel) {
10365                                 dev->l2ad_rebuild = B_FALSE;
10366                                 cv_signal(&l2arc_rebuild_thr_cv);
10367                                 mutex_exit(&l2arc_rebuild_thr_lock);
10368                                 err = SET_ERROR(ECANCELED);
10369                                 goto out;
10370                         }
10371                         mutex_exit(&l2arc_rebuild_thr_lock);
10372                         if (spa_config_tryenter(spa, SCL_L2ARC, vd,
10373                             RW_READER)) {
10374                                 lock_held = B_TRUE;
10375                                 break;
10376                         }
10377                         /*
10378                          * L2ARC config lock held by somebody in writer,
10379                          * possibly due to them trying to remove us. They'll
10380                          * likely to want us to shut down, so after a little
10381                          * delay, we check l2ad_rebuild_cancel and retry
10382                          * the lock again.
10383                          */
10384                         delay(1);
10385                 }
10386 
10387                 /*
10388                  * Continue with the next log block.
10389                  */
10390                 lbps[0] = lbps[1];
10391                 lbps[1] = this_lb->lb_prev_lbp;
10392                 PTR_SWAP(this_lb, next_lb);
10393                 this_io = next_io;
10394                 next_io = NULL;
10395         }
10396 
10397         if (this_io != NULL)
10398                 l2arc_log_blk_fetch_abort(this_io);
10399 out:
10400         if (next_io != NULL)
10401                 l2arc_log_blk_fetch_abort(next_io);
10402         vmem_free(this_lb, sizeof (*this_lb));
10403         vmem_free(next_lb, sizeof (*next_lb));
10404 
10405         if (!l2arc_rebuild_enabled) {
10406                 spa_history_log_internal(spa, "L2ARC rebuild", NULL,
10407                     "disabled");
10408         } else if (err == 0 && zfs_refcount_count(&dev->l2ad_lb_count) > 0) {
10409                 ARCSTAT_BUMP(arcstat_l2_rebuild_success);
10410                 spa_history_log_internal(spa, "L2ARC rebuild", NULL,
10411                     "successful, restored %llu blocks",
10412                     (u_longlong_t)zfs_refcount_count(&dev->l2ad_lb_count));
10413         } else if (err == 0 && zfs_refcount_count(&dev->l2ad_lb_count) == 0) {
10414                 /*
10415                  * No error but also nothing restored, meaning the lbps array
10416                  * in the device header points to invalid/non-present log
10417                  * blocks. Reset the header.
10418                  */
10419                 spa_history_log_internal(spa, "L2ARC rebuild", NULL,
10420                     "no valid log blocks");
10421                 memset(l2dhdr, 0, dev->l2ad_dev_hdr_asize);
10422                 l2arc_dev_hdr_update(dev);
10423         } else if (err == ECANCELED) {
10424                 /*
10425                  * In case the rebuild was canceled do not log to spa history
10426                  * log as the pool may be in the process of being removed.
10427                  */
10428                 zfs_dbgmsg("L2ARC rebuild aborted, restored %llu blocks",
10429                     (u_longlong_t)zfs_refcount_count(&dev->l2ad_lb_count));
10430         } else if (err != 0) {
10431                 spa_history_log_internal(spa, "L2ARC rebuild", NULL,
10432                     "aborted, restored %llu blocks",
10433                     (u_longlong_t)zfs_refcount_count(&dev->l2ad_lb_count));
10434         }
10435 
10436         if (lock_held)
10437                 spa_config_exit(spa, SCL_L2ARC, vd);
10438 
10439         return (err);
10440 }
10441 
10442 /*
10443  * Attempts to read the device header on the provided L2ARC device and writes
10444  * it to `hdr'. On success, this function returns 0, otherwise the appropriate
10445  * error code is returned.
10446  */
10447 static int
10448 l2arc_dev_hdr_read(l2arc_dev_t *dev)
10449 {
10450         int                     err;
10451         uint64_t                guid;
10452         l2arc_dev_hdr_phys_t    *l2dhdr = dev->l2ad_dev_hdr;
10453         const uint64_t          l2dhdr_asize = dev->l2ad_dev_hdr_asize;
10454         abd_t                   *abd;
10455 
10456         guid = spa_guid(dev->l2ad_vdev->vdev_spa);
10457 
10458         abd = abd_get_from_buf(l2dhdr, l2dhdr_asize);
10459 
10460         err = zio_wait(zio_read_phys(NULL, dev->l2ad_vdev,
10461             VDEV_LABEL_START_SIZE, l2dhdr_asize, abd,
10462             ZIO_CHECKSUM_LABEL, NULL, NULL, ZIO_PRIORITY_SYNC_READ,
10463             ZIO_FLAG_DONT_CACHE | ZIO_FLAG_CANFAIL |
10464             ZIO_FLAG_DONT_PROPAGATE | ZIO_FLAG_DONT_RETRY |
10465             ZIO_FLAG_SPECULATIVE, B_FALSE));
10466 
10467         abd_free(abd);
10468 
10469         if (err != 0) {
10470                 ARCSTAT_BUMP(arcstat_l2_rebuild_abort_dh_errors);
10471                 zfs_dbgmsg("L2ARC IO error (%d) while reading device header, "
10472                     "vdev guid: %llu", err,
10473                     (u_longlong_t)dev->l2ad_vdev->vdev_guid);
10474                 return (err);
10475         }
10476 
10477         if (l2dhdr->dh_magic == BSWAP_64(L2ARC_DEV_HDR_MAGIC))
10478                 byteswap_uint64_array(l2dhdr, sizeof (*l2dhdr));
10479 
10480         if (l2dhdr->dh_magic != L2ARC_DEV_HDR_MAGIC ||
10481             l2dhdr->dh_spa_guid != guid ||
10482             l2dhdr->dh_vdev_guid != dev->l2ad_vdev->vdev_guid ||
10483             l2dhdr->dh_version != L2ARC_PERSISTENT_VERSION ||
10484             l2dhdr->dh_log_entries != dev->l2ad_log_entries ||
10485             l2dhdr->dh_end != dev->l2ad_end ||
10486             !l2arc_range_check_overlap(dev->l2ad_start, dev->l2ad_end,
10487             l2dhdr->dh_evict) ||
10488             (l2dhdr->dh_trim_state != VDEV_TRIM_COMPLETE &&
10489             l2arc_trim_ahead > 0)) {
10490                 /*
10491                  * Attempt to rebuild a device containing no actual dev hdr
10492                  * or containing a header from some other pool or from another
10493                  * version of persistent L2ARC.
10494                  */
10495                 ARCSTAT_BUMP(arcstat_l2_rebuild_abort_unsupported);
10496                 return (SET_ERROR(ENOTSUP));
10497         }
10498 
10499         return (0);
10500 }
10501 
10502 /*
10503  * Reads L2ARC log blocks from storage and validates their contents.
10504  *
10505  * This function implements a simple fetcher to make sure that while
10506  * we're processing one buffer the L2ARC is already fetching the next
10507  * one in the chain.
10508  *
10509  * The arguments this_lp and next_lp point to the current and next log block
10510  * address in the block chain. Similarly, this_lb and next_lb hold the
10511  * l2arc_log_blk_phys_t's of the current and next L2ARC blk.
10512  *
10513  * The `this_io' and `next_io' arguments are used for block fetching.
10514  * When issuing the first blk IO during rebuild, you should pass NULL for
10515  * `this_io'. This function will then issue a sync IO to read the block and
10516  * also issue an async IO to fetch the next block in the block chain. The
10517  * fetched IO is returned in `next_io'. On subsequent calls to this
10518  * function, pass the value returned in `next_io' from the previous call
10519  * as `this_io' and a fresh `next_io' pointer to hold the next fetch IO.
10520  * Prior to the call, you should initialize your `next_io' pointer to be
10521  * NULL. If no fetch IO was issued, the pointer is left set at NULL.
10522  *
10523  * On success, this function returns 0, otherwise it returns an appropriate
10524  * error code. On error the fetching IO is aborted and cleared before
10525  * returning from this function. Therefore, if we return `success', the
10526  * caller can assume that we have taken care of cleanup of fetch IOs.
10527  */
10528 static int
10529 l2arc_log_blk_read(l2arc_dev_t *dev,
10530     const l2arc_log_blkptr_t *this_lbp, const l2arc_log_blkptr_t *next_lbp,
10531     l2arc_log_blk_phys_t *this_lb, l2arc_log_blk_phys_t *next_lb,
10532     zio_t *this_io, zio_t **next_io)
10533 {
10534         int             err = 0;
10535         zio_cksum_t     cksum;
10536         abd_t           *abd = NULL;
10537         uint64_t        asize;
10538 
10539         ASSERT(this_lbp != NULL && next_lbp != NULL);
10540         ASSERT(this_lb != NULL && next_lb != NULL);
10541         ASSERT(next_io != NULL && *next_io == NULL);
10542         ASSERT(l2arc_log_blkptr_valid(dev, this_lbp));
10543 
10544         /*
10545          * Check to see if we have issued the IO for this log block in a
10546          * previous run. If not, this is the first call, so issue it now.
10547          */
10548         if (this_io == NULL) {
10549                 this_io = l2arc_log_blk_fetch(dev->l2ad_vdev, this_lbp,
10550                     this_lb);
10551         }
10552 
10553         /*
10554          * Peek to see if we can start issuing the next IO immediately.
10555          */
10556         if (l2arc_log_blkptr_valid(dev, next_lbp)) {
10557                 /*
10558                  * Start issuing IO for the next log block early - this
10559                  * should help keep the L2ARC device busy while we
10560                  * decompress and restore this log block.
10561                  */
10562                 *next_io = l2arc_log_blk_fetch(dev->l2ad_vdev, next_lbp,
10563                     next_lb);
10564         }
10565 
10566         /* Wait for the IO to read this log block to complete */
10567         if ((err = zio_wait(this_io)) != 0) {
10568                 ARCSTAT_BUMP(arcstat_l2_rebuild_abort_io_errors);
10569                 zfs_dbgmsg("L2ARC IO error (%d) while reading log block, "
10570                     "offset: %llu, vdev guid: %llu", err,
10571                     (u_longlong_t)this_lbp->lbp_daddr,
10572                     (u_longlong_t)dev->l2ad_vdev->vdev_guid);
10573                 goto cleanup;
10574         }
10575 
10576         /*
10577          * Make sure the buffer checks out.
10578          * L2BLK_GET_PSIZE returns aligned size for log blocks.
10579          */
10580         asize = L2BLK_GET_PSIZE((this_lbp)->lbp_prop);
10581         fletcher_4_native(this_lb, asize, NULL, &cksum);
10582         if (!ZIO_CHECKSUM_EQUAL(cksum, this_lbp->lbp_cksum)) {
10583                 ARCSTAT_BUMP(arcstat_l2_rebuild_abort_cksum_lb_errors);
10584                 zfs_dbgmsg("L2ARC log block cksum failed, offset: %llu, "
10585                     "vdev guid: %llu, l2ad_hand: %llu, l2ad_evict: %llu",
10586                     (u_longlong_t)this_lbp->lbp_daddr,
10587                     (u_longlong_t)dev->l2ad_vdev->vdev_guid,
10588                     (u_longlong_t)dev->l2ad_hand,
10589                     (u_longlong_t)dev->l2ad_evict);
10590                 err = SET_ERROR(ECKSUM);
10591                 goto cleanup;
10592         }
10593 
10594         /* Now we can take our time decoding this buffer */
10595         switch (L2BLK_GET_COMPRESS((this_lbp)->lbp_prop)) {
10596         case ZIO_COMPRESS_OFF:
10597                 break;
10598         case ZIO_COMPRESS_LZ4:
10599                 abd = abd_alloc_for_io(asize, B_TRUE);
10600                 abd_copy_from_buf_off(abd, this_lb, 0, asize);
10601                 if ((err = zio_decompress_data(
10602                     L2BLK_GET_COMPRESS((this_lbp)->lbp_prop),
10603                     abd, this_lb, asize, sizeof (*this_lb), NULL)) != 0) {
10604                         err = SET_ERROR(EINVAL);
10605                         goto cleanup;
10606                 }
10607                 break;
10608         default:
10609                 err = SET_ERROR(EINVAL);
10610                 goto cleanup;
10611         }
10612         if (this_lb->lb_magic == BSWAP_64(L2ARC_LOG_BLK_MAGIC))
10613                 byteswap_uint64_array(this_lb, sizeof (*this_lb));
10614         if (this_lb->lb_magic != L2ARC_LOG_BLK_MAGIC) {
10615                 err = SET_ERROR(EINVAL);
10616                 goto cleanup;
10617         }
10618 cleanup:
10619         /* Abort an in-flight fetch I/O in case of error */
10620         if (err != 0 && *next_io != NULL) {
10621                 l2arc_log_blk_fetch_abort(*next_io);
10622                 *next_io = NULL;
10623         }
10624         if (abd != NULL)
10625                 abd_free(abd);
10626         return (err);
10627 }
10628 
10629 /*
10630  * Restores the payload of a log block to ARC. This creates empty ARC hdr
10631  * entries which only contain an l2arc hdr, essentially restoring the
10632  * buffers to their L2ARC evicted state. This function also updates space
10633  * usage on the L2ARC vdev to make sure it tracks restored buffers.
10634  */
10635 static void
10636 l2arc_log_blk_restore(l2arc_dev_t *dev, const l2arc_log_blk_phys_t *lb,
10637     uint64_t lb_asize)
10638 {
10639         uint64_t        size = 0, asize = 0;
10640         uint64_t        log_entries = dev->l2ad_log_entries;
10641 
10642         /*
10643          * Usually arc_adapt() is called only for data, not headers, but
10644          * since we may allocate significant amount of memory here, let ARC
10645          * grow its arc_c.
10646          */
10647         arc_adapt(log_entries * HDR_L2ONLY_SIZE, arc_l2c_only);
10648 
10649         for (int i = log_entries - 1; i >= 0; i--) {
10650                 /*
10651                  * Restore goes in the reverse temporal direction to preserve
10652                  * correct temporal ordering of buffers in the l2ad_buflist.
10653                  * l2arc_hdr_restore also does a list_insert_tail instead of
10654                  * list_insert_head on the l2ad_buflist:
10655                  *
10656                  *              LIST    l2ad_buflist            LIST
10657                  *              HEAD  <------ (time) ------     TAIL
10658                  * direction    +-----+-----+-----+-----+-----+    direction
10659                  * of l2arc <== | buf | buf | buf | buf | buf | ===> of rebuild
10660                  * fill         +-----+-----+-----+-----+-----+
10661                  *              ^                               ^
10662                  *              |                               |
10663                  *              |                               |
10664                  *      l2arc_feed_thread               l2arc_rebuild
10665                  *      will place new bufs here        restores bufs here
10666                  *
10667                  * During l2arc_rebuild() the device is not used by
10668                  * l2arc_feed_thread() as dev->l2ad_rebuild is set to true.
10669                  */
10670                 size += L2BLK_GET_LSIZE((&lb->lb_entries[i])->le_prop);
10671                 asize += vdev_psize_to_asize(dev->l2ad_vdev,
10672                     L2BLK_GET_PSIZE((&lb->lb_entries[i])->le_prop));
10673                 l2arc_hdr_restore(&lb->lb_entries[i], dev);
10674         }
10675 
10676         /*
10677          * Record rebuild stats:
10678          *      size            Logical size of restored buffers in the L2ARC
10679          *      asize           Aligned size of restored buffers in the L2ARC
10680          */
10681         ARCSTAT_INCR(arcstat_l2_rebuild_size, size);
10682         ARCSTAT_INCR(arcstat_l2_rebuild_asize, asize);
10683         ARCSTAT_INCR(arcstat_l2_rebuild_bufs, log_entries);
10684         ARCSTAT_F_AVG(arcstat_l2_log_blk_avg_asize, lb_asize);
10685         ARCSTAT_F_AVG(arcstat_l2_data_to_meta_ratio, asize / lb_asize);
10686         ARCSTAT_BUMP(arcstat_l2_rebuild_log_blks);
10687 }
10688 
10689 /*
10690  * Restores a single ARC buf hdr from a log entry. The ARC buffer is put
10691  * into a state indicating that it has been evicted to L2ARC.
10692  */
10693 static void
10694 l2arc_hdr_restore(const l2arc_log_ent_phys_t *le, l2arc_dev_t *dev)
10695 {
10696         arc_buf_hdr_t           *hdr, *exists;
10697         kmutex_t                *hash_lock;
10698         arc_buf_contents_t      type = L2BLK_GET_TYPE((le)->le_prop);
10699         uint64_t                asize;
10700 
10701         /*
10702          * Do all the allocation before grabbing any locks, this lets us
10703          * sleep if memory is full and we don't have to deal with failed
10704          * allocations.
10705          */
10706         hdr = arc_buf_alloc_l2only(L2BLK_GET_LSIZE((le)->le_prop), type,
10707             dev, le->le_dva, le->le_daddr,
10708             L2BLK_GET_PSIZE((le)->le_prop), le->le_birth,
10709             L2BLK_GET_COMPRESS((le)->le_prop), le->le_complevel,
10710             L2BLK_GET_PROTECTED((le)->le_prop),
10711             L2BLK_GET_PREFETCH((le)->le_prop),
10712             L2BLK_GET_STATE((le)->le_prop));
10713         asize = vdev_psize_to_asize(dev->l2ad_vdev,
10714             L2BLK_GET_PSIZE((le)->le_prop));
10715 
10716         /*
10717          * vdev_space_update() has to be called before arc_hdr_destroy() to
10718          * avoid underflow since the latter also calls vdev_space_update().
10719          */
10720         l2arc_hdr_arcstats_increment(hdr);
10721         vdev_space_update(dev->l2ad_vdev, asize, 0, 0);
10722 
10723         mutex_enter(&dev->l2ad_mtx);
10724         list_insert_tail(&dev->l2ad_buflist, hdr);
10725         (void) zfs_refcount_add_many(&dev->l2ad_alloc, arc_hdr_size(hdr), hdr);
10726         mutex_exit(&dev->l2ad_mtx);
10727 
10728         exists = buf_hash_insert(hdr, &hash_lock);
10729         if (exists) {
10730                 /* Buffer was already cached, no need to restore it. */
10731                 arc_hdr_destroy(hdr);
10732                 /*
10733                  * If the buffer is already cached, check whether it has
10734                  * L2ARC metadata. If not, enter them and update the flag.
10735                  * This is important is case of onlining a cache device, since
10736                  * we previously evicted all L2ARC metadata from ARC.
10737                  */
10738                 if (!HDR_HAS_L2HDR(exists)) {
10739                         arc_hdr_set_flags(exists, ARC_FLAG_HAS_L2HDR);
10740                         exists->b_l2hdr.b_dev = dev;
10741                         exists->b_l2hdr.b_daddr = le->le_daddr;
10742                         exists->b_l2hdr.b_arcs_state =
10743                             L2BLK_GET_STATE((le)->le_prop);
10744                         mutex_enter(&dev->l2ad_mtx);
10745                         list_insert_tail(&dev->l2ad_buflist, exists);
10746                         (void) zfs_refcount_add_many(&dev->l2ad_alloc,
10747                             arc_hdr_size(exists), exists);
10748                         mutex_exit(&dev->l2ad_mtx);
10749                         l2arc_hdr_arcstats_increment(exists);
10750                         vdev_space_update(dev->l2ad_vdev, asize, 0, 0);
10751                 }
10752                 ARCSTAT_BUMP(arcstat_l2_rebuild_bufs_precached);
10753         }
10754 
10755         mutex_exit(hash_lock);
10756 }
10757 
10758 /*
10759  * Starts an asynchronous read IO to read a log block. This is used in log
10760  * block reconstruction to start reading the next block before we are done
10761  * decoding and reconstructing the current block, to keep the l2arc device
10762  * nice and hot with read IO to process.
10763  * The returned zio will contain a newly allocated memory buffers for the IO
10764  * data which should then be freed by the caller once the zio is no longer
10765  * needed (i.e. due to it having completed). If you wish to abort this
10766  * zio, you should do so using l2arc_log_blk_fetch_abort, which takes
10767  * care of disposing of the allocated buffers correctly.
10768  */
10769 static zio_t *
10770 l2arc_log_blk_fetch(vdev_t *vd, const l2arc_log_blkptr_t *lbp,
10771     l2arc_log_blk_phys_t *lb)
10772 {
10773         uint32_t                asize;
10774         zio_t                   *pio;
10775         l2arc_read_callback_t   *cb;
10776 
10777         /* L2BLK_GET_PSIZE returns aligned size for log blocks */
10778         asize = L2BLK_GET_PSIZE((lbp)->lbp_prop);
10779         ASSERT(asize <= sizeof (l2arc_log_blk_phys_t));
10780 
10781         cb = kmem_zalloc(sizeof (l2arc_read_callback_t), KM_SLEEP);
10782         cb->l2rcb_abd = abd_get_from_buf(lb, asize);
10783         pio = zio_root(vd->vdev_spa, l2arc_blk_fetch_done, cb,
10784             ZIO_FLAG_DONT_CACHE | ZIO_FLAG_CANFAIL | ZIO_FLAG_DONT_PROPAGATE |
10785             ZIO_FLAG_DONT_RETRY);
10786         (void) zio_nowait(zio_read_phys(pio, vd, lbp->lbp_daddr, asize,
10787             cb->l2rcb_abd, ZIO_CHECKSUM_OFF, NULL, NULL,
10788             ZIO_PRIORITY_ASYNC_READ, ZIO_FLAG_DONT_CACHE | ZIO_FLAG_CANFAIL |
10789             ZIO_FLAG_DONT_PROPAGATE | ZIO_FLAG_DONT_RETRY, B_FALSE));
10790 
10791         return (pio);
10792 }
10793 
10794 /*
10795  * Aborts a zio returned from l2arc_log_blk_fetch and frees the data
10796  * buffers allocated for it.
10797  */
10798 static void
10799 l2arc_log_blk_fetch_abort(zio_t *zio)
10800 {
10801         (void) zio_wait(zio);
10802 }
10803 
10804 /*
10805  * Creates a zio to update the device header on an l2arc device.
10806  */
10807 void
10808 l2arc_dev_hdr_update(l2arc_dev_t *dev)
10809 {
10810         l2arc_dev_hdr_phys_t    *l2dhdr = dev->l2ad_dev_hdr;
10811         const uint64_t          l2dhdr_asize = dev->l2ad_dev_hdr_asize;
10812         abd_t                   *abd;
10813         int                     err;
10814 
10815         VERIFY(spa_config_held(dev->l2ad_spa, SCL_STATE_ALL, RW_READER));
10816 
10817         l2dhdr->dh_magic = L2ARC_DEV_HDR_MAGIC;
10818         l2dhdr->dh_version = L2ARC_PERSISTENT_VERSION;
10819         l2dhdr->dh_spa_guid = spa_guid(dev->l2ad_vdev->vdev_spa);
10820         l2dhdr->dh_vdev_guid = dev->l2ad_vdev->vdev_guid;
10821         l2dhdr->dh_log_entries = dev->l2ad_log_entries;
10822         l2dhdr->dh_evict = dev->l2ad_evict;
10823         l2dhdr->dh_start = dev->l2ad_start;
10824         l2dhdr->dh_end = dev->l2ad_end;
10825         l2dhdr->dh_lb_asize = zfs_refcount_count(&dev->l2ad_lb_asize);
10826         l2dhdr->dh_lb_count = zfs_refcount_count(&dev->l2ad_lb_count);
10827         l2dhdr->dh_flags = 0;
10828         l2dhdr->dh_trim_action_time = dev->l2ad_vdev->vdev_trim_action_time;
10829         l2dhdr->dh_trim_state = dev->l2ad_vdev->vdev_trim_state;
10830         if (dev->l2ad_first)
10831                 l2dhdr->dh_flags |= L2ARC_DEV_HDR_EVICT_FIRST;
10832 
10833         abd = abd_get_from_buf(l2dhdr, l2dhdr_asize);
10834 
10835         err = zio_wait(zio_write_phys(NULL, dev->l2ad_vdev,
10836             VDEV_LABEL_START_SIZE, l2dhdr_asize, abd, ZIO_CHECKSUM_LABEL, NULL,
10837             NULL, ZIO_PRIORITY_ASYNC_WRITE, ZIO_FLAG_CANFAIL, B_FALSE));
10838 
10839         abd_free(abd);
10840 
10841         if (err != 0) {
10842                 zfs_dbgmsg("L2ARC IO error (%d) while writing device header, "
10843                     "vdev guid: %llu", err,
10844                     (u_longlong_t)dev->l2ad_vdev->vdev_guid);
10845         }
10846 }
10847 
10848 /*
10849  * Commits a log block to the L2ARC device. This routine is invoked from
10850  * l2arc_write_buffers when the log block fills up.
10851  * This function allocates some memory to temporarily hold the serialized
10852  * buffer to be written. This is then released in l2arc_write_done.
10853  */
10854 static void
10855 l2arc_log_blk_commit(l2arc_dev_t *dev, zio_t *pio, l2arc_write_callback_t *cb)
10856 {
10857         l2arc_log_blk_phys_t    *lb = &dev->l2ad_log_blk;
10858         l2arc_dev_hdr_phys_t    *l2dhdr = dev->l2ad_dev_hdr;
10859         uint64_t                psize, asize;
10860         zio_t                   *wzio;
10861         l2arc_lb_abd_buf_t      *abd_buf;
10862         uint8_t                 *tmpbuf;
10863         l2arc_lb_ptr_buf_t      *lb_ptr_buf;
10864 
10865         VERIFY3S(dev->l2ad_log_ent_idx, ==, dev->l2ad_log_entries);
10866 
10867         tmpbuf = zio_buf_alloc(sizeof (*lb));
10868         abd_buf = zio_buf_alloc(sizeof (*abd_buf));
10869         abd_buf->abd = abd_get_from_buf(lb, sizeof (*lb));
10870         lb_ptr_buf = kmem_zalloc(sizeof (l2arc_lb_ptr_buf_t), KM_SLEEP);
10871         lb_ptr_buf->lb_ptr = kmem_zalloc(sizeof (l2arc_log_blkptr_t), KM_SLEEP);
10872 
10873         /* link the buffer into the block chain */
10874         lb->lb_prev_lbp = l2dhdr->dh_start_lbps[1];
10875         lb->lb_magic = L2ARC_LOG_BLK_MAGIC;
10876 
10877         /*
10878          * l2arc_log_blk_commit() may be called multiple times during a single
10879          * l2arc_write_buffers() call. Save the allocated abd buffers in a list
10880          * so we can free them in l2arc_write_done() later on.
10881          */
10882         list_insert_tail(&cb->l2wcb_abd_list, abd_buf);
10883 
10884         /* try to compress the buffer */
10885         psize = zio_compress_data(ZIO_COMPRESS_LZ4,
10886             abd_buf->abd, tmpbuf, sizeof (*lb), 0);
10887 
10888         /* a log block is never entirely zero */
10889         ASSERT(psize != 0);
10890         asize = vdev_psize_to_asize(dev->l2ad_vdev, psize);
10891         ASSERT(asize <= sizeof (*lb));
10892 
10893         /*
10894          * Update the start log block pointer in the device header to point
10895          * to the log block we're about to write.
10896          */
10897         l2dhdr->dh_start_lbps[1] = l2dhdr->dh_start_lbps[0];
10898         l2dhdr->dh_start_lbps[0].lbp_daddr = dev->l2ad_hand;
10899         l2dhdr->dh_start_lbps[0].lbp_payload_asize =
10900             dev->l2ad_log_blk_payload_asize;
10901         l2dhdr->dh_start_lbps[0].lbp_payload_start =
10902             dev->l2ad_log_blk_payload_start;
10903         L2BLK_SET_LSIZE(
10904             (&l2dhdr->dh_start_lbps[0])->lbp_prop, sizeof (*lb));
10905         L2BLK_SET_PSIZE(
10906             (&l2dhdr->dh_start_lbps[0])->lbp_prop, asize);
10907         L2BLK_SET_CHECKSUM(
10908             (&l2dhdr->dh_start_lbps[0])->lbp_prop,
10909             ZIO_CHECKSUM_FLETCHER_4);
10910         if (asize < sizeof (*lb)) {
10911                 /* compression succeeded */
10912                 memset(tmpbuf + psize, 0, asize - psize);
10913                 L2BLK_SET_COMPRESS(
10914                     (&l2dhdr->dh_start_lbps[0])->lbp_prop,
10915                     ZIO_COMPRESS_LZ4);
10916         } else {
10917                 /* compression failed */
10918                 memcpy(tmpbuf, lb, sizeof (*lb));
10919                 L2BLK_SET_COMPRESS(
10920                     (&l2dhdr->dh_start_lbps[0])->lbp_prop,
10921                     ZIO_COMPRESS_OFF);
10922         }
10923 
10924         /* checksum what we're about to write */
10925         fletcher_4_native(tmpbuf, asize, NULL,
10926             &l2dhdr->dh_start_lbps[0].lbp_cksum);
10927 
10928         abd_free(abd_buf->abd);
10929 
10930         /* perform the write itself */
10931         abd_buf->abd = abd_get_from_buf(tmpbuf, sizeof (*lb));
10932         abd_take_ownership_of_buf(abd_buf->abd, B_TRUE);
10933         wzio = zio_write_phys(pio, dev->l2ad_vdev, dev->l2ad_hand,
10934             asize, abd_buf->abd, ZIO_CHECKSUM_OFF, NULL, NULL,
10935             ZIO_PRIORITY_ASYNC_WRITE, ZIO_FLAG_CANFAIL, B_FALSE);
10936         DTRACE_PROBE2(l2arc__write, vdev_t *, dev->l2ad_vdev, zio_t *, wzio);
10937         (void) zio_nowait(wzio);
10938 
10939         dev->l2ad_hand += asize;
10940         /*
10941          * Include the committed log block's pointer  in the list of pointers
10942          * to log blocks present in the L2ARC device.
10943          */
10944         memcpy(lb_ptr_buf->lb_ptr, &l2dhdr->dh_start_lbps[0],
10945             sizeof (l2arc_log_blkptr_t));
10946         mutex_enter(&dev->l2ad_mtx);
10947         list_insert_head(&dev->l2ad_lbptr_list, lb_ptr_buf);
10948         ARCSTAT_INCR(arcstat_l2_log_blk_asize, asize);
10949         ARCSTAT_BUMP(arcstat_l2_log_blk_count);
10950         zfs_refcount_add_many(&dev->l2ad_lb_asize, asize, lb_ptr_buf);
10951         zfs_refcount_add(&dev->l2ad_lb_count, lb_ptr_buf);
10952         mutex_exit(&dev->l2ad_mtx);
10953         vdev_space_update(dev->l2ad_vdev, asize, 0, 0);
10954 
10955         /* bump the kstats */
10956         ARCSTAT_INCR(arcstat_l2_write_bytes, asize);
10957         ARCSTAT_BUMP(arcstat_l2_log_blk_writes);
10958         ARCSTAT_F_AVG(arcstat_l2_log_blk_avg_asize, asize);
10959         ARCSTAT_F_AVG(arcstat_l2_data_to_meta_ratio,
10960             dev->l2ad_log_blk_payload_asize / asize);
10961 
10962         /* start a new log block */
10963         dev->l2ad_log_ent_idx = 0;
10964         dev->l2ad_log_blk_payload_asize = 0;
10965         dev->l2ad_log_blk_payload_start = 0;
10966 }
10967 
10968 /*
10969  * Validates an L2ARC log block address to make sure that it can be read
10970  * from the provided L2ARC device.
10971  */
10972 boolean_t
10973 l2arc_log_blkptr_valid(l2arc_dev_t *dev, const l2arc_log_blkptr_t *lbp)
10974 {
10975         /* L2BLK_GET_PSIZE returns aligned size for log blocks */
10976         uint64_t asize = L2BLK_GET_PSIZE((lbp)->lbp_prop);
10977         uint64_t end = lbp->lbp_daddr + asize - 1;
10978         uint64_t start = lbp->lbp_payload_start;
10979         boolean_t evicted = B_FALSE;
10980 
10981         /*
10982          * A log block is valid if all of the following conditions are true:
10983          * - it fits entirely (including its payload) between l2ad_start and
10984          *   l2ad_end
10985          * - it has a valid size
10986          * - neither the log block itself nor part of its payload was evicted
10987          *   by l2arc_evict():
10988          *
10989          *              l2ad_hand          l2ad_evict
10990          *              |                        |      lbp_daddr
10991          *              |     start              |      |  end
10992          *              |     |                  |      |  |
10993          *              V     V                  V      V  V
10994          *   l2ad_start ============================================ l2ad_end
10995          *                    --------------------------||||
10996          *                              ^                ^
10997          *                              |               log block
10998          *                              payload
10999          */
11000 
11001         evicted =
11002             l2arc_range_check_overlap(start, end, dev->l2ad_hand) ||
11003             l2arc_range_check_overlap(start, end, dev->l2ad_evict) ||
11004             l2arc_range_check_overlap(dev->l2ad_hand, dev->l2ad_evict, start) ||
11005             l2arc_range_check_overlap(dev->l2ad_hand, dev->l2ad_evict, end);
11006 
11007         return (start >= dev->l2ad_start && end <= dev->l2ad_end &&
11008             asize > 0 && asize <= sizeof (l2arc_log_blk_phys_t) &&
11009             (!evicted || dev->l2ad_first));
11010 }
11011 
11012 /*
11013  * Inserts ARC buffer header `hdr' into the current L2ARC log block on
11014  * the device. The buffer being inserted must be present in L2ARC.
11015  * Returns B_TRUE if the L2ARC log block is full and needs to be committed
11016  * to L2ARC, or B_FALSE if it still has room for more ARC buffers.
11017  */
11018 static boolean_t
11019 l2arc_log_blk_insert(l2arc_dev_t *dev, const arc_buf_hdr_t *hdr)
11020 {
11021         l2arc_log_blk_phys_t    *lb = &dev->l2ad_log_blk;
11022         l2arc_log_ent_phys_t    *le;
11023 
11024         if (dev->l2ad_log_entries == 0)
11025                 return (B_FALSE);
11026 
11027         int index = dev->l2ad_log_ent_idx++;
11028 
11029         ASSERT3S(index, <, dev->l2ad_log_entries);
11030         ASSERT(HDR_HAS_L2HDR(hdr));
11031 
11032         le = &lb->lb_entries[index];
11033         memset(le, 0, sizeof (*le));
11034         le->le_dva = hdr->b_dva;
11035         le->le_birth = hdr->b_birth;
11036         le->le_daddr = hdr->b_l2hdr.b_daddr;
11037         if (index == 0)
11038                 dev->l2ad_log_blk_payload_start = le->le_daddr;
11039         L2BLK_SET_LSIZE((le)->le_prop, HDR_GET_LSIZE(hdr));
11040         L2BLK_SET_PSIZE((le)->le_prop, HDR_GET_PSIZE(hdr));
11041         L2BLK_SET_COMPRESS((le)->le_prop, HDR_GET_COMPRESS(hdr));
11042         le->le_complevel = hdr->b_complevel;
11043         L2BLK_SET_TYPE((le)->le_prop, hdr->b_type);
11044         L2BLK_SET_PROTECTED((le)->le_prop, !!(HDR_PROTECTED(hdr)));
11045         L2BLK_SET_PREFETCH((le)->le_prop, !!(HDR_PREFETCH(hdr)));
11046         L2BLK_SET_STATE((le)->le_prop, hdr->b_l1hdr.b_state->arcs_state);
11047 
11048         dev->l2ad_log_blk_payload_asize += vdev_psize_to_asize(dev->l2ad_vdev,
11049             HDR_GET_PSIZE(hdr));
11050 
11051         return (dev->l2ad_log_ent_idx == dev->l2ad_log_entries);
11052 }
11053 
11054 /*
11055  * Checks whether a given L2ARC device address sits in a time-sequential
11056  * range. The trick here is that the L2ARC is a rotary buffer, so we can't
11057  * just do a range comparison, we need to handle the situation in which the
11058  * range wraps around the end of the L2ARC device. Arguments:
11059  *      bottom -- Lower end of the range to check (written to earlier).
11060  *      top    -- Upper end of the range to check (written to later).
11061  *      check  -- The address for which we want to determine if it sits in
11062  *                between the top and bottom.
11063  *
11064  * The 3-way conditional below represents the following cases:
11065  *
11066  *      bottom < top : Sequentially ordered case:
11067  *        <check>--------+-------------------+
11068  *                       |  (overlap here?)  |
11069  *       L2ARC dev       V                   V
11070  *       |---------------<bottom>============<top>--------------|
11071  *
11072  *      bottom > top: Looped-around case:
11073  *                            <check>--------+------------------+
11074  *                                           |  (overlap here?) |
11075  *       L2ARC dev                           V                  V
11076  *       |===============<top>---------------<bottom>===========|
11077  *       ^               ^
11078  *       |  (or here?)   |
11079  *       +---------------+---------<check>
11080  *
11081  *      top == bottom : Just a single address comparison.
11082  */
11083 boolean_t
11084 l2arc_range_check_overlap(uint64_t bottom, uint64_t top, uint64_t check)
11085 {
11086         if (bottom < top)
11087                 return (bottom <= check && check <= top);
11088         else if (bottom > top)
11089                 return (check <= top || bottom <= check);
11090         else
11091                 return (check == top);
11092 }
11093 
11094 EXPORT_SYMBOL(arc_buf_size);
11095 EXPORT_SYMBOL(arc_write);
11096 EXPORT_SYMBOL(arc_read);
11097 EXPORT_SYMBOL(arc_buf_info);
11098 EXPORT_SYMBOL(arc_getbuf_func);
11099 EXPORT_SYMBOL(arc_add_prune_callback);
11100 EXPORT_SYMBOL(arc_remove_prune_callback);
11101 
11102 ZFS_MODULE_PARAM_CALL(zfs_arc, zfs_arc_, min, param_set_arc_min,
11103         spl_param_get_u64, ZMOD_RW, "Minimum ARC size in bytes");
11104 
11105 ZFS_MODULE_PARAM_CALL(zfs_arc, zfs_arc_, max, param_set_arc_max,
11106         spl_param_get_u64, ZMOD_RW, "Maximum ARC size in bytes");
11107 
11108 ZFS_MODULE_PARAM_CALL(zfs_arc, zfs_arc_, meta_limit, param_set_arc_u64,
11109         spl_param_get_u64, ZMOD_RW, "Metadata limit for ARC size in bytes");
11110 
11111 ZFS_MODULE_PARAM_CALL(zfs_arc, zfs_arc_, meta_limit_percent,
11112     param_set_arc_int, param_get_uint, ZMOD_RW,
11113         "Percent of ARC size for ARC meta limit");
11114 
11115 ZFS_MODULE_PARAM_CALL(zfs_arc, zfs_arc_, meta_min, param_set_arc_u64,
11116         spl_param_get_u64, ZMOD_RW, "Minimum ARC metadata size in bytes");
11117 
11118 ZFS_MODULE_PARAM(zfs_arc, zfs_arc_, meta_prune, INT, ZMOD_RW,
11119         "Meta objects to scan for prune");
11120 
11121 ZFS_MODULE_PARAM(zfs_arc, zfs_arc_, meta_adjust_restarts, UINT, ZMOD_RW,
11122         "Limit number of restarts in arc_evict_meta");
11123 
11124 ZFS_MODULE_PARAM(zfs_arc, zfs_arc_, meta_strategy, UINT, ZMOD_RW,
11125         "Meta reclaim strategy");
11126 
11127 ZFS_MODULE_PARAM_CALL(zfs_arc, zfs_arc_, grow_retry, param_set_arc_int,
11128         param_get_uint, ZMOD_RW, "Seconds before growing ARC size");
11129 
11130 ZFS_MODULE_PARAM(zfs_arc, zfs_arc_, p_dampener_disable, INT, ZMOD_RW,
11131         "Disable arc_p adapt dampener");
11132 
11133 ZFS_MODULE_PARAM_CALL(zfs_arc, zfs_arc_, shrink_shift, param_set_arc_int,
11134         param_get_uint, ZMOD_RW, "log2(fraction of ARC to reclaim)");
11135 
11136 ZFS_MODULE_PARAM(zfs_arc, zfs_arc_, pc_percent, UINT, ZMOD_RW,
11137         "Percent of pagecache to reclaim ARC to");
11138 
11139 ZFS_MODULE_PARAM_CALL(zfs_arc, zfs_arc_, p_min_shift, param_set_arc_int,
11140         param_get_uint, ZMOD_RW, "arc_c shift to calc min/max arc_p");
11141 
11142 ZFS_MODULE_PARAM(zfs_arc, zfs_arc_, average_blocksize, UINT, ZMOD_RD,
11143         "Target average block size");
11144 
11145 ZFS_MODULE_PARAM(zfs, zfs_, compressed_arc_enabled, INT, ZMOD_RW,
11146         "Disable compressed ARC buffers");
11147 
11148 ZFS_MODULE_PARAM_CALL(zfs_arc, zfs_arc_, min_prefetch_ms, param_set_arc_int,
11149         param_get_uint, ZMOD_RW, "Min life of prefetch block in ms");
11150 
11151 ZFS_MODULE_PARAM_CALL(zfs_arc, zfs_arc_, min_prescient_prefetch_ms,
11152     param_set_arc_int, param_get_uint, ZMOD_RW,
11153         "Min life of prescient prefetched block in ms");
11154 
11155 ZFS_MODULE_PARAM(zfs_l2arc, l2arc_, write_max, U64, ZMOD_RW,
11156         "Max write bytes per interval");
11157 
11158 ZFS_MODULE_PARAM(zfs_l2arc, l2arc_, write_boost, U64, ZMOD_RW,
11159         "Extra write bytes during device warmup");
11160 
11161 ZFS_MODULE_PARAM(zfs_l2arc, l2arc_, headroom, U64, ZMOD_RW,
11162         "Number of max device writes to precache");
11163 
11164 ZFS_MODULE_PARAM(zfs_l2arc, l2arc_, headroom_boost, U64, ZMOD_RW,
11165         "Compressed l2arc_headroom multiplier");
11166 
11167 ZFS_MODULE_PARAM(zfs_l2arc, l2arc_, trim_ahead, U64, ZMOD_RW,
11168         "TRIM ahead L2ARC write size multiplier");
11169 
11170 ZFS_MODULE_PARAM(zfs_l2arc, l2arc_, feed_secs, U64, ZMOD_RW,
11171         "Seconds between L2ARC writing");
11172 
11173 ZFS_MODULE_PARAM(zfs_l2arc, l2arc_, feed_min_ms, U64, ZMOD_RW,
11174         "Min feed interval in milliseconds");
11175 
11176 ZFS_MODULE_PARAM(zfs_l2arc, l2arc_, noprefetch, INT, ZMOD_RW,
11177         "Skip caching prefetched buffers");
11178 
11179 ZFS_MODULE_PARAM(zfs_l2arc, l2arc_, feed_again, INT, ZMOD_RW,
11180         "Turbo L2ARC warmup");
11181 
11182 ZFS_MODULE_PARAM(zfs_l2arc, l2arc_, norw, INT, ZMOD_RW,
11183         "No reads during writes");
11184 
11185 ZFS_MODULE_PARAM(zfs_l2arc, l2arc_, meta_percent, UINT, ZMOD_RW,
11186         "Percent of ARC size allowed for L2ARC-only headers");
11187 
11188 ZFS_MODULE_PARAM(zfs_l2arc, l2arc_, rebuild_enabled, INT, ZMOD_RW,
11189         "Rebuild the L2ARC when importing a pool");
11190 
11191 ZFS_MODULE_PARAM(zfs_l2arc, l2arc_, rebuild_blocks_min_l2size, U64, ZMOD_RW,
11192         "Min size in bytes to write rebuild log blocks in L2ARC");
11193 
11194 ZFS_MODULE_PARAM(zfs_l2arc, l2arc_, mfuonly, INT, ZMOD_RW,
11195         "Cache only MFU data from ARC into L2ARC");
11196 
11197 ZFS_MODULE_PARAM(zfs_l2arc, l2arc_, exclude_special, INT, ZMOD_RW,
11198         "Exclude dbufs on special vdevs from being cached to L2ARC if set.");
11199 
11200 ZFS_MODULE_PARAM_CALL(zfs_arc, zfs_arc_, lotsfree_percent, param_set_arc_int,
11201         param_get_uint, ZMOD_RW, "System free memory I/O throttle in bytes");
11202 
11203 ZFS_MODULE_PARAM_CALL(zfs_arc, zfs_arc_, sys_free, param_set_arc_u64,
11204         spl_param_get_u64, ZMOD_RW, "System free memory target size in bytes");
11205 
11206 ZFS_MODULE_PARAM_CALL(zfs_arc, zfs_arc_, dnode_limit, param_set_arc_u64,
11207         spl_param_get_u64, ZMOD_RW, "Minimum bytes of dnodes in ARC");
11208 
11209 ZFS_MODULE_PARAM_CALL(zfs_arc, zfs_arc_, dnode_limit_percent,
11210     param_set_arc_int, param_get_uint, ZMOD_RW,
11211         "Percent of ARC meta buffers for dnodes");
11212 
11213 ZFS_MODULE_PARAM(zfs_arc, zfs_arc_, dnode_reduce_percent, UINT, ZMOD_RW,
11214         "Percentage of excess dnodes to try to unpin");
11215 
11216 ZFS_MODULE_PARAM(zfs_arc, zfs_arc_, eviction_pct, UINT, ZMOD_RW,
11217         "When full, ARC allocation waits for eviction of this % of alloc size");
11218 
11219 ZFS_MODULE_PARAM(zfs_arc, zfs_arc_, evict_batch_limit, UINT, ZMOD_RW,
11220         "The number of headers to evict per sublist before moving to the next");
11221 
11222 ZFS_MODULE_PARAM(zfs_arc, zfs_arc_, prune_task_threads, INT, ZMOD_RW,
11223         "Number of arc_prune threads");

Cache object: 06b965713f9770a4fecdde9d789034f6


[ source navigation ] [ diff markup ] [ identifier search ] [ freetext search ] [ file search ] [ list types ] [ track identifier ]


This page is part of the FreeBSD/Linux Linux Kernel Cross-Reference, and was automatically generated using a modified version of the LXR engine.