The Design and Implementation of the FreeBSD Operating System, Second Edition
Now available: The Design and Implementation of the FreeBSD Operating System (Second Edition)


[ source navigation ] [ diff markup ] [ identifier search ] [ freetext search ] [ file search ] [ list types ] [ track identifier ]

FreeBSD/Linux Kernel Cross Reference
sys/contrib/openzfs/module/zfs/vdev_rebuild.c

Version: -  FREEBSD  -  FREEBSD-13-STABLE  -  FREEBSD-13-0  -  FREEBSD-12-STABLE  -  FREEBSD-12-0  -  FREEBSD-11-STABLE  -  FREEBSD-11-0  -  FREEBSD-10-STABLE  -  FREEBSD-10-0  -  FREEBSD-9-STABLE  -  FREEBSD-9-0  -  FREEBSD-8-STABLE  -  FREEBSD-8-0  -  FREEBSD-7-STABLE  -  FREEBSD-7-0  -  FREEBSD-6-STABLE  -  FREEBSD-6-0  -  FREEBSD-5-STABLE  -  FREEBSD-5-0  -  FREEBSD-4-STABLE  -  FREEBSD-3-STABLE  -  FREEBSD22  -  l41  -  OPENBSD  -  linux-2.6  -  MK84  -  PLAN9  -  xnu-8792 
SearchContext: -  none  -  3  -  10 

    1 /*
    2  * CDDL HEADER START
    3  *
    4  * The contents of this file are subject to the terms of the
    5  * Common Development and Distribution License (the "License").
    6  * You may not use this file except in compliance with the License.
    7  *
    8  * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
    9  * or https://opensource.org/licenses/CDDL-1.0.
   10  * See the License for the specific language governing permissions
   11  * and limitations under the License.
   12  *
   13  * When distributing Covered Code, include this CDDL HEADER in each
   14  * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
   15  * If applicable, add the following below this CDDL HEADER, with the
   16  * fields enclosed by brackets "[]" replaced with your own identifying
   17  * information: Portions Copyright [yyyy] [name of copyright owner]
   18  *
   19  * CDDL HEADER END
   20  */
   21 /*
   22  *
   23  * Copyright (c) 2018, Intel Corporation.
   24  * Copyright (c) 2020 by Lawrence Livermore National Security, LLC.
   25  * Copyright (c) 2022 Hewlett Packard Enterprise Development LP.
   26  */
   27 
   28 #include <sys/vdev_impl.h>
   29 #include <sys/vdev_draid.h>
   30 #include <sys/dsl_scan.h>
   31 #include <sys/spa_impl.h>
   32 #include <sys/metaslab_impl.h>
   33 #include <sys/vdev_rebuild.h>
   34 #include <sys/zio.h>
   35 #include <sys/dmu_tx.h>
   36 #include <sys/arc.h>
   37 #include <sys/zap.h>
   38 
   39 /*
   40  * This file contains the sequential reconstruction implementation for
   41  * resilvering.  This form of resilvering is internally referred to as device
   42  * rebuild to avoid conflating it with the traditional healing reconstruction
   43  * performed by the dsl scan code.
   44  *
   45  * When replacing a device, or scrubbing the pool, ZFS has historically used
   46  * a process called resilvering which is a form of healing reconstruction.
   47  * This approach has the advantage that as blocks are read from disk their
   48  * checksums can be immediately verified and the data repaired.  Unfortunately,
   49  * it also results in a random IO pattern to the disk even when extra care
   50  * is taken to sequentialize the IO as much as possible.  This substantially
   51  * increases the time required to resilver the pool and restore redundancy.
   52  *
   53  * For mirrored devices it's possible to implement an alternate sequential
   54  * reconstruction strategy when resilvering.  Sequential reconstruction
   55  * behaves like a traditional RAID rebuild and reconstructs a device in LBA
   56  * order without verifying the checksum.  After this phase completes a second
   57  * scrub phase is started to verify all of the checksums.  This two phase
   58  * process will take longer than the healing reconstruction described above.
   59  * However, it has that advantage that after the reconstruction first phase
   60  * completes redundancy has been restored.  At this point the pool can incur
   61  * another device failure without risking data loss.
   62  *
   63  * There are a few noteworthy limitations and other advantages of resilvering
   64  * using sequential reconstruction vs healing reconstruction.
   65  *
   66  * Limitations:
   67  *
   68  *   - Sequential reconstruction is not possible on RAIDZ due to its
   69  *     variable stripe width.  Note dRAID uses a fixed stripe width which
   70  *     avoids this issue, but comes at the expense of some usable capacity.
   71  *
   72  *   - Block checksums are not verified during sequential reconstruction.
   73  *     Similar to traditional RAID the parity/mirror data is reconstructed
   74  *     but cannot be immediately double checked.  For this reason when the
   75  *     last active resilver completes the pool is automatically scrubbed
   76  *     by default.
   77  *
   78  *   - Deferred resilvers using sequential reconstruction are not currently
   79  *     supported.  When adding another vdev to an active top-level resilver
   80  *     it must be restarted.
   81  *
   82  * Advantages:
   83  *
   84  *   - Sequential reconstruction is performed in LBA order which may be faster
   85  *     than healing reconstruction particularly when using HDDs (or
   86  *     especially with SMR devices).  Only allocated capacity is resilvered.
   87  *
   88  *   - Sequential reconstruction is not constrained by ZFS block boundaries.
   89  *     This allows it to issue larger IOs to disk which span multiple blocks
   90  *     allowing all of these logical blocks to be repaired with a single IO.
   91  *
   92  *   - Unlike a healing resilver or scrub which are pool wide operations,
   93  *     sequential reconstruction is handled by the top-level vdevs.  This
   94  *     allows for it to be started or canceled on a top-level vdev without
   95  *     impacting any other top-level vdevs in the pool.
   96  *
   97  *   - Data only referenced by a pool checkpoint will be repaired because
   98  *     that space is reflected in the space maps.  This differs for a
   99  *     healing resilver or scrub which will not repair that data.
  100  */
  101 
  102 
  103 /*
  104  * Size of rebuild reads; defaults to 1MiB per data disk and is capped at
  105  * SPA_MAXBLOCKSIZE.
  106  */
  107 static uint64_t zfs_rebuild_max_segment = 1024 * 1024;
  108 
  109 /*
  110  * Maximum number of parallelly executed bytes per leaf vdev caused by a
  111  * sequential resilver.  We attempt to strike a balance here between keeping
  112  * the vdev queues full of I/Os at all times and not overflowing the queues
  113  * to cause long latency, which would cause long txg sync times.
  114  *
  115  * A large default value can be safely used here because the default target
  116  * segment size is also large (zfs_rebuild_max_segment=1M).  This helps keep
  117  * the queue depth short.
  118  *
  119  * 32MB was selected as the default value to achieve good performance with
  120  * a large 90-drive dRAID HDD configuration (draid2:8d:90c:2s). A sequential
  121  * rebuild was unable to saturate all of the drives using smaller values.
  122  * With a value of 32MB the sequential resilver write rate was measured at
  123  * 800MB/s sustained while rebuilding to a distributed spare.
  124  */
  125 static uint64_t zfs_rebuild_vdev_limit = 32 << 20;
  126 
  127 /*
  128  * Automatically start a pool scrub when the last active sequential resilver
  129  * completes in order to verify the checksums of all blocks which have been
  130  * resilvered. This option is enabled by default and is strongly recommended.
  131  */
  132 static int zfs_rebuild_scrub_enabled = 1;
  133 
  134 /*
  135  * For vdev_rebuild_initiate_sync() and vdev_rebuild_reset_sync().
  136  */
  137 static __attribute__((noreturn)) void vdev_rebuild_thread(void *arg);
  138 static void vdev_rebuild_reset_sync(void *arg, dmu_tx_t *tx);
  139 
  140 /*
  141  * Clear the per-vdev rebuild bytes value for a vdev tree.
  142  */
  143 static void
  144 clear_rebuild_bytes(vdev_t *vd)
  145 {
  146         vdev_stat_t *vs = &vd->vdev_stat;
  147 
  148         for (uint64_t i = 0; i < vd->vdev_children; i++)
  149                 clear_rebuild_bytes(vd->vdev_child[i]);
  150 
  151         mutex_enter(&vd->vdev_stat_lock);
  152         vs->vs_rebuild_processed = 0;
  153         mutex_exit(&vd->vdev_stat_lock);
  154 }
  155 
  156 /*
  157  * Determines whether a vdev_rebuild_thread() should be stopped.
  158  */
  159 static boolean_t
  160 vdev_rebuild_should_stop(vdev_t *vd)
  161 {
  162         return (!vdev_writeable(vd) || vd->vdev_removing ||
  163             vd->vdev_rebuild_exit_wanted ||
  164             vd->vdev_rebuild_cancel_wanted ||
  165             vd->vdev_rebuild_reset_wanted);
  166 }
  167 
  168 /*
  169  * Determine if the rebuild should be canceled.  This may happen when all
  170  * vdevs with MISSING DTLs are detached.
  171  */
  172 static boolean_t
  173 vdev_rebuild_should_cancel(vdev_t *vd)
  174 {
  175         vdev_rebuild_t *vr = &vd->vdev_rebuild_config;
  176         vdev_rebuild_phys_t *vrp = &vr->vr_rebuild_phys;
  177 
  178         if (!vdev_resilver_needed(vd, &vrp->vrp_min_txg, &vrp->vrp_max_txg))
  179                 return (B_TRUE);
  180 
  181         return (B_FALSE);
  182 }
  183 
  184 /*
  185  * The sync task for updating the on-disk state of a rebuild.  This is
  186  * scheduled by vdev_rebuild_range().
  187  */
  188 static void
  189 vdev_rebuild_update_sync(void *arg, dmu_tx_t *tx)
  190 {
  191         int vdev_id = (uintptr_t)arg;
  192         spa_t *spa = dmu_tx_pool(tx)->dp_spa;
  193         vdev_t *vd = vdev_lookup_top(spa, vdev_id);
  194         vdev_rebuild_t *vr = &vd->vdev_rebuild_config;
  195         vdev_rebuild_phys_t *vrp = &vr->vr_rebuild_phys;
  196         uint64_t txg = dmu_tx_get_txg(tx);
  197 
  198         mutex_enter(&vd->vdev_rebuild_lock);
  199 
  200         if (vr->vr_scan_offset[txg & TXG_MASK] > 0) {
  201                 vrp->vrp_last_offset = vr->vr_scan_offset[txg & TXG_MASK];
  202                 vr->vr_scan_offset[txg & TXG_MASK] = 0;
  203         }
  204 
  205         vrp->vrp_scan_time_ms = vr->vr_prev_scan_time_ms +
  206             NSEC2MSEC(gethrtime() - vr->vr_pass_start_time);
  207 
  208         VERIFY0(zap_update(vd->vdev_spa->spa_meta_objset, vd->vdev_top_zap,
  209             VDEV_TOP_ZAP_VDEV_REBUILD_PHYS, sizeof (uint64_t),
  210             REBUILD_PHYS_ENTRIES, vrp, tx));
  211 
  212         mutex_exit(&vd->vdev_rebuild_lock);
  213 }
  214 
  215 /*
  216  * Initialize the on-disk state for a new rebuild, start the rebuild thread.
  217  */
  218 static void
  219 vdev_rebuild_initiate_sync(void *arg, dmu_tx_t *tx)
  220 {
  221         int vdev_id = (uintptr_t)arg;
  222         spa_t *spa = dmu_tx_pool(tx)->dp_spa;
  223         vdev_t *vd = vdev_lookup_top(spa, vdev_id);
  224         vdev_rebuild_t *vr = &vd->vdev_rebuild_config;
  225         vdev_rebuild_phys_t *vrp = &vr->vr_rebuild_phys;
  226 
  227         ASSERT(vd->vdev_rebuilding);
  228 
  229         spa_feature_incr(vd->vdev_spa, SPA_FEATURE_DEVICE_REBUILD, tx);
  230 
  231         mutex_enter(&vd->vdev_rebuild_lock);
  232         memset(vrp, 0, sizeof (uint64_t) * REBUILD_PHYS_ENTRIES);
  233         vrp->vrp_rebuild_state = VDEV_REBUILD_ACTIVE;
  234         vrp->vrp_min_txg = 0;
  235         vrp->vrp_max_txg = dmu_tx_get_txg(tx);
  236         vrp->vrp_start_time = gethrestime_sec();
  237         vrp->vrp_scan_time_ms = 0;
  238         vr->vr_prev_scan_time_ms = 0;
  239 
  240         /*
  241          * Rebuilds are currently only used when replacing a device, in which
  242          * case there must be DTL_MISSING entries.  In the future, we could
  243          * allow rebuilds to be used in a way similar to a scrub.  This would
  244          * be useful because it would allow us to rebuild the space used by
  245          * pool checkpoints.
  246          */
  247         VERIFY(vdev_resilver_needed(vd, &vrp->vrp_min_txg, &vrp->vrp_max_txg));
  248 
  249         VERIFY0(zap_update(vd->vdev_spa->spa_meta_objset, vd->vdev_top_zap,
  250             VDEV_TOP_ZAP_VDEV_REBUILD_PHYS, sizeof (uint64_t),
  251             REBUILD_PHYS_ENTRIES, vrp, tx));
  252 
  253         spa_history_log_internal(spa, "rebuild", tx,
  254             "vdev_id=%llu vdev_guid=%llu started",
  255             (u_longlong_t)vd->vdev_id, (u_longlong_t)vd->vdev_guid);
  256 
  257         ASSERT3P(vd->vdev_rebuild_thread, ==, NULL);
  258         vd->vdev_rebuild_thread = thread_create(NULL, 0,
  259             vdev_rebuild_thread, vd, 0, &p0, TS_RUN, maxclsyspri);
  260 
  261         mutex_exit(&vd->vdev_rebuild_lock);
  262 }
  263 
  264 static void
  265 vdev_rebuild_log_notify(spa_t *spa, vdev_t *vd, const char *name)
  266 {
  267         nvlist_t *aux = fnvlist_alloc();
  268 
  269         fnvlist_add_string(aux, ZFS_EV_RESILVER_TYPE, "sequential");
  270         spa_event_notify(spa, vd, aux, name);
  271         nvlist_free(aux);
  272 }
  273 
  274 /*
  275  * Called to request that a new rebuild be started.  The feature will remain
  276  * active for the duration of the rebuild, then revert to the enabled state.
  277  */
  278 static void
  279 vdev_rebuild_initiate(vdev_t *vd)
  280 {
  281         spa_t *spa = vd->vdev_spa;
  282 
  283         ASSERT(vd->vdev_top == vd);
  284         ASSERT(MUTEX_HELD(&vd->vdev_rebuild_lock));
  285         ASSERT(!vd->vdev_rebuilding);
  286 
  287         dmu_tx_t *tx = dmu_tx_create_dd(spa_get_dsl(spa)->dp_mos_dir);
  288         VERIFY0(dmu_tx_assign(tx, TXG_WAIT));
  289 
  290         vd->vdev_rebuilding = B_TRUE;
  291 
  292         dsl_sync_task_nowait(spa_get_dsl(spa), vdev_rebuild_initiate_sync,
  293             (void *)(uintptr_t)vd->vdev_id, tx);
  294         dmu_tx_commit(tx);
  295 
  296         vdev_rebuild_log_notify(spa, vd, ESC_ZFS_RESILVER_START);
  297 }
  298 
  299 /*
  300  * Update the on-disk state to completed when a rebuild finishes.
  301  */
  302 static void
  303 vdev_rebuild_complete_sync(void *arg, dmu_tx_t *tx)
  304 {
  305         int vdev_id = (uintptr_t)arg;
  306         spa_t *spa = dmu_tx_pool(tx)->dp_spa;
  307         vdev_t *vd = vdev_lookup_top(spa, vdev_id);
  308         vdev_rebuild_t *vr = &vd->vdev_rebuild_config;
  309         vdev_rebuild_phys_t *vrp = &vr->vr_rebuild_phys;
  310 
  311         mutex_enter(&vd->vdev_rebuild_lock);
  312 
  313         /*
  314          * Handle a second device failure if it occurs after all rebuild I/O
  315          * has completed but before this sync task has been executed.
  316          */
  317         if (vd->vdev_rebuild_reset_wanted) {
  318                 mutex_exit(&vd->vdev_rebuild_lock);
  319                 vdev_rebuild_reset_sync(arg, tx);
  320                 return;
  321         }
  322 
  323         vrp->vrp_rebuild_state = VDEV_REBUILD_COMPLETE;
  324         vrp->vrp_end_time = gethrestime_sec();
  325 
  326         VERIFY0(zap_update(vd->vdev_spa->spa_meta_objset, vd->vdev_top_zap,
  327             VDEV_TOP_ZAP_VDEV_REBUILD_PHYS, sizeof (uint64_t),
  328             REBUILD_PHYS_ENTRIES, vrp, tx));
  329 
  330         vdev_dtl_reassess(vd, tx->tx_txg, vrp->vrp_max_txg, B_TRUE, B_TRUE);
  331         spa_feature_decr(vd->vdev_spa, SPA_FEATURE_DEVICE_REBUILD, tx);
  332 
  333         spa_history_log_internal(spa, "rebuild",  tx,
  334             "vdev_id=%llu vdev_guid=%llu complete",
  335             (u_longlong_t)vd->vdev_id, (u_longlong_t)vd->vdev_guid);
  336         vdev_rebuild_log_notify(spa, vd, ESC_ZFS_RESILVER_FINISH);
  337 
  338         /* Handles detaching of spares */
  339         spa_async_request(spa, SPA_ASYNC_REBUILD_DONE);
  340         vd->vdev_rebuilding = B_FALSE;
  341         mutex_exit(&vd->vdev_rebuild_lock);
  342 
  343         /*
  344          * While we're in syncing context take the opportunity to
  345          * setup the scrub when there are no more active rebuilds.
  346          */
  347         pool_scan_func_t func = POOL_SCAN_SCRUB;
  348         if (dsl_scan_setup_check(&func, tx) == 0 &&
  349             zfs_rebuild_scrub_enabled) {
  350                 dsl_scan_setup_sync(&func, tx);
  351         }
  352 
  353         cv_broadcast(&vd->vdev_rebuild_cv);
  354 
  355         /* Clear recent error events (i.e. duplicate events tracking) */
  356         zfs_ereport_clear(spa, NULL);
  357 }
  358 
  359 /*
  360  * Update the on-disk state to canceled when a rebuild finishes.
  361  */
  362 static void
  363 vdev_rebuild_cancel_sync(void *arg, dmu_tx_t *tx)
  364 {
  365         int vdev_id = (uintptr_t)arg;
  366         spa_t *spa = dmu_tx_pool(tx)->dp_spa;
  367         vdev_t *vd = vdev_lookup_top(spa, vdev_id);
  368         vdev_rebuild_t *vr = &vd->vdev_rebuild_config;
  369         vdev_rebuild_phys_t *vrp = &vr->vr_rebuild_phys;
  370 
  371         mutex_enter(&vd->vdev_rebuild_lock);
  372         vrp->vrp_rebuild_state = VDEV_REBUILD_CANCELED;
  373         vrp->vrp_end_time = gethrestime_sec();
  374 
  375         VERIFY0(zap_update(vd->vdev_spa->spa_meta_objset, vd->vdev_top_zap,
  376             VDEV_TOP_ZAP_VDEV_REBUILD_PHYS, sizeof (uint64_t),
  377             REBUILD_PHYS_ENTRIES, vrp, tx));
  378 
  379         spa_feature_decr(vd->vdev_spa, SPA_FEATURE_DEVICE_REBUILD, tx);
  380 
  381         spa_history_log_internal(spa, "rebuild",  tx,
  382             "vdev_id=%llu vdev_guid=%llu canceled",
  383             (u_longlong_t)vd->vdev_id, (u_longlong_t)vd->vdev_guid);
  384         vdev_rebuild_log_notify(spa, vd, ESC_ZFS_RESILVER_FINISH);
  385 
  386         vd->vdev_rebuild_cancel_wanted = B_FALSE;
  387         vd->vdev_rebuilding = B_FALSE;
  388         mutex_exit(&vd->vdev_rebuild_lock);
  389 
  390         spa_notify_waiters(spa);
  391         cv_broadcast(&vd->vdev_rebuild_cv);
  392 }
  393 
  394 /*
  395  * Resets the progress of a running rebuild.  This will occur when a new
  396  * vdev is added to rebuild.
  397  */
  398 static void
  399 vdev_rebuild_reset_sync(void *arg, dmu_tx_t *tx)
  400 {
  401         int vdev_id = (uintptr_t)arg;
  402         spa_t *spa = dmu_tx_pool(tx)->dp_spa;
  403         vdev_t *vd = vdev_lookup_top(spa, vdev_id);
  404         vdev_rebuild_t *vr = &vd->vdev_rebuild_config;
  405         vdev_rebuild_phys_t *vrp = &vr->vr_rebuild_phys;
  406 
  407         mutex_enter(&vd->vdev_rebuild_lock);
  408 
  409         ASSERT(vrp->vrp_rebuild_state == VDEV_REBUILD_ACTIVE);
  410         ASSERT3P(vd->vdev_rebuild_thread, ==, NULL);
  411 
  412         vrp->vrp_last_offset = 0;
  413         vrp->vrp_min_txg = 0;
  414         vrp->vrp_max_txg = dmu_tx_get_txg(tx);
  415         vrp->vrp_bytes_scanned = 0;
  416         vrp->vrp_bytes_issued = 0;
  417         vrp->vrp_bytes_rebuilt = 0;
  418         vrp->vrp_bytes_est = 0;
  419         vrp->vrp_scan_time_ms = 0;
  420         vr->vr_prev_scan_time_ms = 0;
  421 
  422         /* See vdev_rebuild_initiate_sync comment */
  423         VERIFY(vdev_resilver_needed(vd, &vrp->vrp_min_txg, &vrp->vrp_max_txg));
  424 
  425         VERIFY0(zap_update(vd->vdev_spa->spa_meta_objset, vd->vdev_top_zap,
  426             VDEV_TOP_ZAP_VDEV_REBUILD_PHYS, sizeof (uint64_t),
  427             REBUILD_PHYS_ENTRIES, vrp, tx));
  428 
  429         spa_history_log_internal(spa, "rebuild",  tx,
  430             "vdev_id=%llu vdev_guid=%llu reset",
  431             (u_longlong_t)vd->vdev_id, (u_longlong_t)vd->vdev_guid);
  432 
  433         vd->vdev_rebuild_reset_wanted = B_FALSE;
  434         ASSERT(vd->vdev_rebuilding);
  435 
  436         vd->vdev_rebuild_thread = thread_create(NULL, 0,
  437             vdev_rebuild_thread, vd, 0, &p0, TS_RUN, maxclsyspri);
  438 
  439         mutex_exit(&vd->vdev_rebuild_lock);
  440 }
  441 
  442 /*
  443  * Clear the last rebuild status.
  444  */
  445 void
  446 vdev_rebuild_clear_sync(void *arg, dmu_tx_t *tx)
  447 {
  448         int vdev_id = (uintptr_t)arg;
  449         spa_t *spa = dmu_tx_pool(tx)->dp_spa;
  450         vdev_t *vd = vdev_lookup_top(spa, vdev_id);
  451         vdev_rebuild_t *vr = &vd->vdev_rebuild_config;
  452         vdev_rebuild_phys_t *vrp = &vr->vr_rebuild_phys;
  453         objset_t *mos = spa_meta_objset(spa);
  454 
  455         mutex_enter(&vd->vdev_rebuild_lock);
  456 
  457         if (!spa_feature_is_enabled(spa, SPA_FEATURE_DEVICE_REBUILD) ||
  458             vrp->vrp_rebuild_state == VDEV_REBUILD_ACTIVE) {
  459                 mutex_exit(&vd->vdev_rebuild_lock);
  460                 return;
  461         }
  462 
  463         clear_rebuild_bytes(vd);
  464         memset(vrp, 0, sizeof (uint64_t) * REBUILD_PHYS_ENTRIES);
  465 
  466         if (vd->vdev_top_zap != 0 && zap_contains(mos, vd->vdev_top_zap,
  467             VDEV_TOP_ZAP_VDEV_REBUILD_PHYS) == 0) {
  468                 VERIFY0(zap_update(mos, vd->vdev_top_zap,
  469                     VDEV_TOP_ZAP_VDEV_REBUILD_PHYS, sizeof (uint64_t),
  470                     REBUILD_PHYS_ENTRIES, vrp, tx));
  471         }
  472 
  473         mutex_exit(&vd->vdev_rebuild_lock);
  474 }
  475 
  476 /*
  477  * The zio_done_func_t callback for each rebuild I/O issued.  It's responsible
  478  * for updating the rebuild stats and limiting the number of in flight I/Os.
  479  */
  480 static void
  481 vdev_rebuild_cb(zio_t *zio)
  482 {
  483         vdev_rebuild_t *vr = zio->io_private;
  484         vdev_rebuild_phys_t *vrp = &vr->vr_rebuild_phys;
  485         vdev_t *vd = vr->vr_top_vdev;
  486 
  487         mutex_enter(&vr->vr_io_lock);
  488         if (zio->io_error == ENXIO && !vdev_writeable(vd)) {
  489                 /*
  490                  * The I/O failed because the top-level vdev was unavailable.
  491                  * Attempt to roll back to the last completed offset, in order
  492                  * resume from the correct location if the pool is resumed.
  493                  * (This works because spa_sync waits on spa_txg_zio before
  494                  * it runs sync tasks.)
  495                  */
  496                 uint64_t *off = &vr->vr_scan_offset[zio->io_txg & TXG_MASK];
  497                 *off = MIN(*off, zio->io_offset);
  498         } else if (zio->io_error) {
  499                 vrp->vrp_errors++;
  500         }
  501 
  502         abd_free(zio->io_abd);
  503 
  504         ASSERT3U(vr->vr_bytes_inflight, >, 0);
  505         vr->vr_bytes_inflight -= zio->io_size;
  506         cv_broadcast(&vr->vr_io_cv);
  507         mutex_exit(&vr->vr_io_lock);
  508 
  509         spa_config_exit(vd->vdev_spa, SCL_STATE_ALL, vd);
  510 }
  511 
  512 /*
  513  * Initialize a block pointer that can be used to read the given segment
  514  * for sequential rebuild.
  515  */
  516 static void
  517 vdev_rebuild_blkptr_init(blkptr_t *bp, vdev_t *vd, uint64_t start,
  518     uint64_t asize)
  519 {
  520         ASSERT(vd->vdev_ops == &vdev_draid_ops ||
  521             vd->vdev_ops == &vdev_mirror_ops ||
  522             vd->vdev_ops == &vdev_replacing_ops ||
  523             vd->vdev_ops == &vdev_spare_ops);
  524 
  525         uint64_t psize = vd->vdev_ops == &vdev_draid_ops ?
  526             vdev_draid_asize_to_psize(vd, asize) : asize;
  527 
  528         BP_ZERO(bp);
  529 
  530         DVA_SET_VDEV(&bp->blk_dva[0], vd->vdev_id);
  531         DVA_SET_OFFSET(&bp->blk_dva[0], start);
  532         DVA_SET_GANG(&bp->blk_dva[0], 0);
  533         DVA_SET_ASIZE(&bp->blk_dva[0], asize);
  534 
  535         BP_SET_BIRTH(bp, TXG_INITIAL, TXG_INITIAL);
  536         BP_SET_LSIZE(bp, psize);
  537         BP_SET_PSIZE(bp, psize);
  538         BP_SET_COMPRESS(bp, ZIO_COMPRESS_OFF);
  539         BP_SET_CHECKSUM(bp, ZIO_CHECKSUM_OFF);
  540         BP_SET_TYPE(bp, DMU_OT_NONE);
  541         BP_SET_LEVEL(bp, 0);
  542         BP_SET_DEDUP(bp, 0);
  543         BP_SET_BYTEORDER(bp, ZFS_HOST_BYTEORDER);
  544 }
  545 
  546 /*
  547  * Issues a rebuild I/O and takes care of rate limiting the number of queued
  548  * rebuild I/Os.  The provided start and size must be properly aligned for the
  549  * top-level vdev type being rebuilt.
  550  */
  551 static int
  552 vdev_rebuild_range(vdev_rebuild_t *vr, uint64_t start, uint64_t size)
  553 {
  554         uint64_t ms_id __maybe_unused = vr->vr_scan_msp->ms_id;
  555         vdev_t *vd = vr->vr_top_vdev;
  556         spa_t *spa = vd->vdev_spa;
  557         blkptr_t blk;
  558 
  559         ASSERT3U(ms_id, ==, start >> vd->vdev_ms_shift);
  560         ASSERT3U(ms_id, ==, (start + size - 1) >> vd->vdev_ms_shift);
  561 
  562         vr->vr_pass_bytes_scanned += size;
  563         vr->vr_rebuild_phys.vrp_bytes_scanned += size;
  564 
  565         /*
  566          * Rebuild the data in this range by constructing a special block
  567          * pointer.  It has no relation to any existing blocks in the pool.
  568          * However, by disabling checksum verification and issuing a scrub IO
  569          * we can reconstruct and repair any children with missing data.
  570          */
  571         vdev_rebuild_blkptr_init(&blk, vd, start, size);
  572         uint64_t psize = BP_GET_PSIZE(&blk);
  573 
  574         if (!vdev_dtl_need_resilver(vd, &blk.blk_dva[0], psize, TXG_UNKNOWN))
  575                 return (0);
  576 
  577         mutex_enter(&vr->vr_io_lock);
  578 
  579         /* Limit in flight rebuild I/Os */
  580         while (vr->vr_bytes_inflight >= vr->vr_bytes_inflight_max)
  581                 cv_wait(&vr->vr_io_cv, &vr->vr_io_lock);
  582 
  583         vr->vr_bytes_inflight += psize;
  584         mutex_exit(&vr->vr_io_lock);
  585 
  586         dmu_tx_t *tx = dmu_tx_create_dd(spa_get_dsl(spa)->dp_mos_dir);
  587         VERIFY0(dmu_tx_assign(tx, TXG_WAIT));
  588         uint64_t txg = dmu_tx_get_txg(tx);
  589 
  590         spa_config_enter(spa, SCL_STATE_ALL, vd, RW_READER);
  591         mutex_enter(&vd->vdev_rebuild_lock);
  592 
  593         /* This is the first I/O for this txg. */
  594         if (vr->vr_scan_offset[txg & TXG_MASK] == 0) {
  595                 vr->vr_scan_offset[txg & TXG_MASK] = start;
  596                 dsl_sync_task_nowait(spa_get_dsl(spa),
  597                     vdev_rebuild_update_sync,
  598                     (void *)(uintptr_t)vd->vdev_id, tx);
  599         }
  600 
  601         /* When exiting write out our progress. */
  602         if (vdev_rebuild_should_stop(vd)) {
  603                 mutex_enter(&vr->vr_io_lock);
  604                 vr->vr_bytes_inflight -= psize;
  605                 mutex_exit(&vr->vr_io_lock);
  606                 spa_config_exit(vd->vdev_spa, SCL_STATE_ALL, vd);
  607                 mutex_exit(&vd->vdev_rebuild_lock);
  608                 dmu_tx_commit(tx);
  609                 return (SET_ERROR(EINTR));
  610         }
  611         mutex_exit(&vd->vdev_rebuild_lock);
  612         dmu_tx_commit(tx);
  613 
  614         vr->vr_scan_offset[txg & TXG_MASK] = start + size;
  615         vr->vr_pass_bytes_issued += size;
  616         vr->vr_rebuild_phys.vrp_bytes_issued += size;
  617 
  618         zio_nowait(zio_read(spa->spa_txg_zio[txg & TXG_MASK], spa, &blk,
  619             abd_alloc(psize, B_FALSE), psize, vdev_rebuild_cb, vr,
  620             ZIO_PRIORITY_REBUILD, ZIO_FLAG_RAW | ZIO_FLAG_CANFAIL |
  621             ZIO_FLAG_RESILVER, NULL));
  622 
  623         return (0);
  624 }
  625 
  626 /*
  627  * Issues rebuild I/Os for all ranges in the provided vr->vr_tree range tree.
  628  */
  629 static int
  630 vdev_rebuild_ranges(vdev_rebuild_t *vr)
  631 {
  632         vdev_t *vd = vr->vr_top_vdev;
  633         zfs_btree_t *t = &vr->vr_scan_tree->rt_root;
  634         zfs_btree_index_t idx;
  635         int error;
  636 
  637         for (range_seg_t *rs = zfs_btree_first(t, &idx); rs != NULL;
  638             rs = zfs_btree_next(t, &idx, &idx)) {
  639                 uint64_t start = rs_get_start(rs, vr->vr_scan_tree);
  640                 uint64_t size = rs_get_end(rs, vr->vr_scan_tree) - start;
  641 
  642                 /*
  643                  * zfs_scan_suspend_progress can be set to disable rebuild
  644                  * progress for testing.  See comment in dsl_scan_sync().
  645                  */
  646                 while (zfs_scan_suspend_progress &&
  647                     !vdev_rebuild_should_stop(vd)) {
  648                         delay(hz);
  649                 }
  650 
  651                 while (size > 0) {
  652                         uint64_t chunk_size;
  653 
  654                         /*
  655                          * Split range into legally-sized logical chunks
  656                          * given the constraints of the top-level vdev
  657                          * being rebuilt (dRAID or mirror).
  658                          */
  659                         ASSERT3P(vd->vdev_ops, !=, NULL);
  660                         chunk_size = vd->vdev_ops->vdev_op_rebuild_asize(vd,
  661                             start, size, zfs_rebuild_max_segment);
  662 
  663                         error = vdev_rebuild_range(vr, start, chunk_size);
  664                         if (error != 0)
  665                                 return (error);
  666 
  667                         size -= chunk_size;
  668                         start += chunk_size;
  669                 }
  670         }
  671 
  672         return (0);
  673 }
  674 
  675 /*
  676  * Calculates the estimated capacity which remains to be scanned.  Since
  677  * we traverse the pool in metaslab order only allocated capacity beyond
  678  * the vrp_last_offset need be considered.  All lower offsets must have
  679  * already been rebuilt and are thus already included in vrp_bytes_scanned.
  680  */
  681 static void
  682 vdev_rebuild_update_bytes_est(vdev_t *vd, uint64_t ms_id)
  683 {
  684         vdev_rebuild_t *vr = &vd->vdev_rebuild_config;
  685         vdev_rebuild_phys_t *vrp = &vr->vr_rebuild_phys;
  686         uint64_t bytes_est = vrp->vrp_bytes_scanned;
  687 
  688         if (vrp->vrp_last_offset < vd->vdev_ms[ms_id]->ms_start)
  689                 return;
  690 
  691         for (uint64_t i = ms_id; i < vd->vdev_ms_count; i++) {
  692                 metaslab_t *msp = vd->vdev_ms[i];
  693 
  694                 mutex_enter(&msp->ms_lock);
  695                 bytes_est += metaslab_allocated_space(msp);
  696                 mutex_exit(&msp->ms_lock);
  697         }
  698 
  699         vrp->vrp_bytes_est = bytes_est;
  700 }
  701 
  702 /*
  703  * Load from disk the top-level vdev's rebuild information.
  704  */
  705 int
  706 vdev_rebuild_load(vdev_t *vd)
  707 {
  708         vdev_rebuild_t *vr = &vd->vdev_rebuild_config;
  709         vdev_rebuild_phys_t *vrp = &vr->vr_rebuild_phys;
  710         spa_t *spa = vd->vdev_spa;
  711         int err = 0;
  712 
  713         mutex_enter(&vd->vdev_rebuild_lock);
  714         vd->vdev_rebuilding = B_FALSE;
  715 
  716         if (!spa_feature_is_enabled(spa, SPA_FEATURE_DEVICE_REBUILD)) {
  717                 memset(vrp, 0, sizeof (uint64_t) * REBUILD_PHYS_ENTRIES);
  718                 mutex_exit(&vd->vdev_rebuild_lock);
  719                 return (SET_ERROR(ENOTSUP));
  720         }
  721 
  722         ASSERT(vd->vdev_top == vd);
  723 
  724         err = zap_lookup(spa->spa_meta_objset, vd->vdev_top_zap,
  725             VDEV_TOP_ZAP_VDEV_REBUILD_PHYS, sizeof (uint64_t),
  726             REBUILD_PHYS_ENTRIES, vrp);
  727 
  728         /*
  729          * A missing or damaged VDEV_TOP_ZAP_VDEV_REBUILD_PHYS should
  730          * not prevent a pool from being imported.  Clear the rebuild
  731          * status allowing a new resilver/rebuild to be started.
  732          */
  733         if (err == ENOENT || err == EOVERFLOW || err == ECKSUM) {
  734                 memset(vrp, 0, sizeof (uint64_t) * REBUILD_PHYS_ENTRIES);
  735         } else if (err) {
  736                 mutex_exit(&vd->vdev_rebuild_lock);
  737                 return (err);
  738         }
  739 
  740         vr->vr_prev_scan_time_ms = vrp->vrp_scan_time_ms;
  741         vr->vr_top_vdev = vd;
  742 
  743         mutex_exit(&vd->vdev_rebuild_lock);
  744 
  745         return (0);
  746 }
  747 
  748 /*
  749  * Each scan thread is responsible for rebuilding a top-level vdev.  The
  750  * rebuild progress in tracked on-disk in VDEV_TOP_ZAP_VDEV_REBUILD_PHYS.
  751  */
  752 static __attribute__((noreturn)) void
  753 vdev_rebuild_thread(void *arg)
  754 {
  755         vdev_t *vd = arg;
  756         spa_t *spa = vd->vdev_spa;
  757         int error = 0;
  758 
  759         /*
  760          * If there's a scrub in process request that it be stopped.  This
  761          * is not required for a correct rebuild, but we do want rebuilds to
  762          * emulate the resilver behavior as much as possible.
  763          */
  764         dsl_pool_t *dsl = spa_get_dsl(spa);
  765         if (dsl_scan_scrubbing(dsl))
  766                 dsl_scan_cancel(dsl);
  767 
  768         spa_config_enter(spa, SCL_CONFIG, FTAG, RW_READER);
  769         mutex_enter(&vd->vdev_rebuild_lock);
  770 
  771         ASSERT3P(vd->vdev_top, ==, vd);
  772         ASSERT3P(vd->vdev_rebuild_thread, !=, NULL);
  773         ASSERT(vd->vdev_rebuilding);
  774         ASSERT(spa_feature_is_active(spa, SPA_FEATURE_DEVICE_REBUILD));
  775         ASSERT3B(vd->vdev_rebuild_cancel_wanted, ==, B_FALSE);
  776 
  777         vdev_rebuild_t *vr = &vd->vdev_rebuild_config;
  778         vdev_rebuild_phys_t *vrp = &vr->vr_rebuild_phys;
  779         vr->vr_top_vdev = vd;
  780         vr->vr_scan_msp = NULL;
  781         vr->vr_scan_tree = range_tree_create(NULL, RANGE_SEG64, NULL, 0, 0);
  782         mutex_init(&vr->vr_io_lock, NULL, MUTEX_DEFAULT, NULL);
  783         cv_init(&vr->vr_io_cv, NULL, CV_DEFAULT, NULL);
  784 
  785         vr->vr_pass_start_time = gethrtime();
  786         vr->vr_pass_bytes_scanned = 0;
  787         vr->vr_pass_bytes_issued = 0;
  788 
  789         vr->vr_bytes_inflight_max = MAX(1ULL << 20,
  790             zfs_rebuild_vdev_limit * vd->vdev_children);
  791 
  792         uint64_t update_est_time = gethrtime();
  793         vdev_rebuild_update_bytes_est(vd, 0);
  794 
  795         clear_rebuild_bytes(vr->vr_top_vdev);
  796 
  797         mutex_exit(&vd->vdev_rebuild_lock);
  798 
  799         /*
  800          * Systematically walk the metaslabs and issue rebuild I/Os for
  801          * all ranges in the allocated space map.
  802          */
  803         for (uint64_t i = 0; i < vd->vdev_ms_count; i++) {
  804                 metaslab_t *msp = vd->vdev_ms[i];
  805                 vr->vr_scan_msp = msp;
  806 
  807                 /*
  808                  * Removal of vdevs from the vdev tree may eliminate the need
  809                  * for the rebuild, in which case it should be canceled.  The
  810                  * vdev_rebuild_cancel_wanted flag is set until the sync task
  811                  * completes.  This may be after the rebuild thread exits.
  812                  */
  813                 if (vdev_rebuild_should_cancel(vd)) {
  814                         vd->vdev_rebuild_cancel_wanted = B_TRUE;
  815                         error = EINTR;
  816                         break;
  817                 }
  818 
  819                 ASSERT0(range_tree_space(vr->vr_scan_tree));
  820 
  821                 /* Disable any new allocations to this metaslab */
  822                 spa_config_exit(spa, SCL_CONFIG, FTAG);
  823                 metaslab_disable(msp);
  824 
  825                 mutex_enter(&msp->ms_sync_lock);
  826                 mutex_enter(&msp->ms_lock);
  827 
  828                 /*
  829                  * If there are outstanding allocations wait for them to be
  830                  * synced.  This is needed to ensure all allocated ranges are
  831                  * on disk and therefore will be rebuilt.
  832                  */
  833                 for (int j = 0; j < TXG_SIZE; j++) {
  834                         if (range_tree_space(msp->ms_allocating[j])) {
  835                                 mutex_exit(&msp->ms_lock);
  836                                 mutex_exit(&msp->ms_sync_lock);
  837                                 txg_wait_synced(dsl, 0);
  838                                 mutex_enter(&msp->ms_sync_lock);
  839                                 mutex_enter(&msp->ms_lock);
  840                                 break;
  841                         }
  842                 }
  843 
  844                 /*
  845                  * When a metaslab has been allocated from read its allocated
  846                  * ranges from the space map object into the vr_scan_tree.
  847                  * Then add inflight / unflushed ranges and remove inflight /
  848                  * unflushed frees.  This is the minimum range to be rebuilt.
  849                  */
  850                 if (msp->ms_sm != NULL) {
  851                         VERIFY0(space_map_load(msp->ms_sm,
  852                             vr->vr_scan_tree, SM_ALLOC));
  853 
  854                         for (int i = 0; i < TXG_SIZE; i++) {
  855                                 ASSERT0(range_tree_space(
  856                                     msp->ms_allocating[i]));
  857                         }
  858 
  859                         range_tree_walk(msp->ms_unflushed_allocs,
  860                             range_tree_add, vr->vr_scan_tree);
  861                         range_tree_walk(msp->ms_unflushed_frees,
  862                             range_tree_remove, vr->vr_scan_tree);
  863 
  864                         /*
  865                          * Remove ranges which have already been rebuilt based
  866                          * on the last offset.  This can happen when restarting
  867                          * a scan after exporting and re-importing the pool.
  868                          */
  869                         range_tree_clear(vr->vr_scan_tree, 0,
  870                             vrp->vrp_last_offset);
  871                 }
  872 
  873                 mutex_exit(&msp->ms_lock);
  874                 mutex_exit(&msp->ms_sync_lock);
  875 
  876                 /*
  877                  * To provide an accurate estimate re-calculate the estimated
  878                  * size every 5 minutes to account for recent allocations and
  879                  * frees made to space maps which have not yet been rebuilt.
  880                  */
  881                 if (gethrtime() > update_est_time + SEC2NSEC(300)) {
  882                         update_est_time = gethrtime();
  883                         vdev_rebuild_update_bytes_est(vd, i);
  884                 }
  885 
  886                 /*
  887                  * Walk the allocated space map and issue the rebuild I/O.
  888                  */
  889                 error = vdev_rebuild_ranges(vr);
  890                 range_tree_vacate(vr->vr_scan_tree, NULL, NULL);
  891 
  892                 spa_config_enter(spa, SCL_CONFIG, FTAG, RW_READER);
  893                 metaslab_enable(msp, B_FALSE, B_FALSE);
  894 
  895                 if (error != 0)
  896                         break;
  897         }
  898 
  899         range_tree_destroy(vr->vr_scan_tree);
  900         spa_config_exit(spa, SCL_CONFIG, FTAG);
  901 
  902         /* Wait for any remaining rebuild I/O to complete */
  903         mutex_enter(&vr->vr_io_lock);
  904         while (vr->vr_bytes_inflight > 0)
  905                 cv_wait(&vr->vr_io_cv, &vr->vr_io_lock);
  906 
  907         mutex_exit(&vr->vr_io_lock);
  908 
  909         mutex_destroy(&vr->vr_io_lock);
  910         cv_destroy(&vr->vr_io_cv);
  911 
  912         spa_config_enter(spa, SCL_CONFIG, FTAG, RW_READER);
  913 
  914         dsl_pool_t *dp = spa_get_dsl(spa);
  915         dmu_tx_t *tx = dmu_tx_create_dd(dp->dp_mos_dir);
  916         VERIFY0(dmu_tx_assign(tx, TXG_WAIT));
  917 
  918         mutex_enter(&vd->vdev_rebuild_lock);
  919         if (error == 0) {
  920                 /*
  921                  * After a successful rebuild clear the DTLs of all ranges
  922                  * which were missing when the rebuild was started.  These
  923                  * ranges must have been rebuilt as a consequence of rebuilding
  924                  * all allocated space.  Note that unlike a scrub or resilver
  925                  * the rebuild operation will reconstruct data only referenced
  926                  * by a pool checkpoint.  See the dsl_scan_done() comments.
  927                  */
  928                 dsl_sync_task_nowait(dp, vdev_rebuild_complete_sync,
  929                     (void *)(uintptr_t)vd->vdev_id, tx);
  930         } else if (vd->vdev_rebuild_cancel_wanted) {
  931                 /*
  932                  * The rebuild operation was canceled.  This will occur when
  933                  * a device participating in the rebuild is detached.
  934                  */
  935                 dsl_sync_task_nowait(dp, vdev_rebuild_cancel_sync,
  936                     (void *)(uintptr_t)vd->vdev_id, tx);
  937         } else if (vd->vdev_rebuild_reset_wanted) {
  938                 /*
  939                  * Reset the running rebuild without canceling and restarting
  940                  * it.  This will occur when a new device is attached and must
  941                  * participate in the rebuild.
  942                  */
  943                 dsl_sync_task_nowait(dp, vdev_rebuild_reset_sync,
  944                     (void *)(uintptr_t)vd->vdev_id, tx);
  945         } else {
  946                 /*
  947                  * The rebuild operation should be suspended.  This may occur
  948                  * when detaching a child vdev or when exporting the pool.  The
  949                  * rebuild is left in the active state so it will be resumed.
  950                  */
  951                 ASSERT(vrp->vrp_rebuild_state == VDEV_REBUILD_ACTIVE);
  952                 vd->vdev_rebuilding = B_FALSE;
  953         }
  954 
  955         dmu_tx_commit(tx);
  956 
  957         vd->vdev_rebuild_thread = NULL;
  958         mutex_exit(&vd->vdev_rebuild_lock);
  959         spa_config_exit(spa, SCL_CONFIG, FTAG);
  960 
  961         cv_broadcast(&vd->vdev_rebuild_cv);
  962 
  963         thread_exit();
  964 }
  965 
  966 /*
  967  * Returns B_TRUE if any top-level vdev are rebuilding.
  968  */
  969 boolean_t
  970 vdev_rebuild_active(vdev_t *vd)
  971 {
  972         spa_t *spa = vd->vdev_spa;
  973         boolean_t ret = B_FALSE;
  974 
  975         if (vd == spa->spa_root_vdev) {
  976                 for (uint64_t i = 0; i < vd->vdev_children; i++) {
  977                         ret = vdev_rebuild_active(vd->vdev_child[i]);
  978                         if (ret)
  979                                 return (ret);
  980                 }
  981         } else if (vd->vdev_top_zap != 0) {
  982                 vdev_rebuild_t *vr = &vd->vdev_rebuild_config;
  983                 vdev_rebuild_phys_t *vrp = &vr->vr_rebuild_phys;
  984 
  985                 mutex_enter(&vd->vdev_rebuild_lock);
  986                 ret = (vrp->vrp_rebuild_state == VDEV_REBUILD_ACTIVE);
  987                 mutex_exit(&vd->vdev_rebuild_lock);
  988         }
  989 
  990         return (ret);
  991 }
  992 
  993 /*
  994  * Start a rebuild operation.  The rebuild may be restarted when the
  995  * top-level vdev is currently actively rebuilding.
  996  */
  997 void
  998 vdev_rebuild(vdev_t *vd)
  999 {
 1000         vdev_rebuild_t *vr = &vd->vdev_rebuild_config;
 1001         vdev_rebuild_phys_t *vrp __maybe_unused = &vr->vr_rebuild_phys;
 1002 
 1003         ASSERT(vd->vdev_top == vd);
 1004         ASSERT(vdev_is_concrete(vd));
 1005         ASSERT(!vd->vdev_removing);
 1006         ASSERT(spa_feature_is_enabled(vd->vdev_spa,
 1007             SPA_FEATURE_DEVICE_REBUILD));
 1008 
 1009         mutex_enter(&vd->vdev_rebuild_lock);
 1010         if (vd->vdev_rebuilding) {
 1011                 ASSERT3U(vrp->vrp_rebuild_state, ==, VDEV_REBUILD_ACTIVE);
 1012 
 1013                 /*
 1014                  * Signal a running rebuild operation that it should restart
 1015                  * from the beginning because a new device was attached.  The
 1016                  * vdev_rebuild_reset_wanted flag is set until the sync task
 1017                  * completes.  This may be after the rebuild thread exits.
 1018                  */
 1019                 if (!vd->vdev_rebuild_reset_wanted)
 1020                         vd->vdev_rebuild_reset_wanted = B_TRUE;
 1021         } else {
 1022                 vdev_rebuild_initiate(vd);
 1023         }
 1024         mutex_exit(&vd->vdev_rebuild_lock);
 1025 }
 1026 
 1027 static void
 1028 vdev_rebuild_restart_impl(vdev_t *vd)
 1029 {
 1030         spa_t *spa = vd->vdev_spa;
 1031 
 1032         if (vd == spa->spa_root_vdev) {
 1033                 for (uint64_t i = 0; i < vd->vdev_children; i++)
 1034                         vdev_rebuild_restart_impl(vd->vdev_child[i]);
 1035 
 1036         } else if (vd->vdev_top_zap != 0) {
 1037                 vdev_rebuild_t *vr = &vd->vdev_rebuild_config;
 1038                 vdev_rebuild_phys_t *vrp = &vr->vr_rebuild_phys;
 1039 
 1040                 mutex_enter(&vd->vdev_rebuild_lock);
 1041                 if (vrp->vrp_rebuild_state == VDEV_REBUILD_ACTIVE &&
 1042                     vdev_writeable(vd) && !vd->vdev_rebuilding) {
 1043                         ASSERT(spa_feature_is_active(spa,
 1044                             SPA_FEATURE_DEVICE_REBUILD));
 1045                         vd->vdev_rebuilding = B_TRUE;
 1046                         vd->vdev_rebuild_thread = thread_create(NULL, 0,
 1047                             vdev_rebuild_thread, vd, 0, &p0, TS_RUN,
 1048                             maxclsyspri);
 1049                 }
 1050                 mutex_exit(&vd->vdev_rebuild_lock);
 1051         }
 1052 }
 1053 
 1054 /*
 1055  * Conditionally restart all of the vdev_rebuild_thread's for a pool.  The
 1056  * feature flag must be active and the rebuild in the active state.   This
 1057  * cannot be used to start a new rebuild.
 1058  */
 1059 void
 1060 vdev_rebuild_restart(spa_t *spa)
 1061 {
 1062         ASSERT(MUTEX_HELD(&spa_namespace_lock));
 1063 
 1064         vdev_rebuild_restart_impl(spa->spa_root_vdev);
 1065 }
 1066 
 1067 /*
 1068  * Stop and wait for all of the vdev_rebuild_thread's associated with the
 1069  * vdev tree provide to be terminated (canceled or stopped).
 1070  */
 1071 void
 1072 vdev_rebuild_stop_wait(vdev_t *vd)
 1073 {
 1074         spa_t *spa = vd->vdev_spa;
 1075 
 1076         ASSERT(MUTEX_HELD(&spa_namespace_lock));
 1077 
 1078         if (vd == spa->spa_root_vdev) {
 1079                 for (uint64_t i = 0; i < vd->vdev_children; i++)
 1080                         vdev_rebuild_stop_wait(vd->vdev_child[i]);
 1081 
 1082         } else if (vd->vdev_top_zap != 0) {
 1083                 ASSERT(vd == vd->vdev_top);
 1084 
 1085                 mutex_enter(&vd->vdev_rebuild_lock);
 1086                 if (vd->vdev_rebuild_thread != NULL) {
 1087                         vd->vdev_rebuild_exit_wanted = B_TRUE;
 1088                         while (vd->vdev_rebuilding) {
 1089                                 cv_wait(&vd->vdev_rebuild_cv,
 1090                                     &vd->vdev_rebuild_lock);
 1091                         }
 1092                         vd->vdev_rebuild_exit_wanted = B_FALSE;
 1093                 }
 1094                 mutex_exit(&vd->vdev_rebuild_lock);
 1095         }
 1096 }
 1097 
 1098 /*
 1099  * Stop all rebuild operations but leave them in the active state so they
 1100  * will be resumed when importing the pool.
 1101  */
 1102 void
 1103 vdev_rebuild_stop_all(spa_t *spa)
 1104 {
 1105         vdev_rebuild_stop_wait(spa->spa_root_vdev);
 1106 }
 1107 
 1108 /*
 1109  * Rebuild statistics reported per top-level vdev.
 1110  */
 1111 int
 1112 vdev_rebuild_get_stats(vdev_t *tvd, vdev_rebuild_stat_t *vrs)
 1113 {
 1114         spa_t *spa = tvd->vdev_spa;
 1115 
 1116         if (!spa_feature_is_enabled(spa, SPA_FEATURE_DEVICE_REBUILD))
 1117                 return (SET_ERROR(ENOTSUP));
 1118 
 1119         if (tvd != tvd->vdev_top || tvd->vdev_top_zap == 0)
 1120                 return (SET_ERROR(EINVAL));
 1121 
 1122         int error = zap_contains(spa_meta_objset(spa),
 1123             tvd->vdev_top_zap, VDEV_TOP_ZAP_VDEV_REBUILD_PHYS);
 1124 
 1125         if (error == ENOENT) {
 1126                 memset(vrs, 0, sizeof (vdev_rebuild_stat_t));
 1127                 vrs->vrs_state = VDEV_REBUILD_NONE;
 1128                 error = 0;
 1129         } else if (error == 0) {
 1130                 vdev_rebuild_t *vr = &tvd->vdev_rebuild_config;
 1131                 vdev_rebuild_phys_t *vrp = &vr->vr_rebuild_phys;
 1132 
 1133                 mutex_enter(&tvd->vdev_rebuild_lock);
 1134                 vrs->vrs_state = vrp->vrp_rebuild_state;
 1135                 vrs->vrs_start_time = vrp->vrp_start_time;
 1136                 vrs->vrs_end_time = vrp->vrp_end_time;
 1137                 vrs->vrs_scan_time_ms = vrp->vrp_scan_time_ms;
 1138                 vrs->vrs_bytes_scanned = vrp->vrp_bytes_scanned;
 1139                 vrs->vrs_bytes_issued = vrp->vrp_bytes_issued;
 1140                 vrs->vrs_bytes_rebuilt = vrp->vrp_bytes_rebuilt;
 1141                 vrs->vrs_bytes_est = vrp->vrp_bytes_est;
 1142                 vrs->vrs_errors = vrp->vrp_errors;
 1143                 vrs->vrs_pass_time_ms = NSEC2MSEC(gethrtime() -
 1144                     vr->vr_pass_start_time);
 1145                 vrs->vrs_pass_bytes_scanned = vr->vr_pass_bytes_scanned;
 1146                 vrs->vrs_pass_bytes_issued = vr->vr_pass_bytes_issued;
 1147                 mutex_exit(&tvd->vdev_rebuild_lock);
 1148         }
 1149 
 1150         return (error);
 1151 }
 1152 
 1153 ZFS_MODULE_PARAM(zfs, zfs_, rebuild_max_segment, U64, ZMOD_RW,
 1154         "Max segment size in bytes of rebuild reads");
 1155 
 1156 ZFS_MODULE_PARAM(zfs, zfs_, rebuild_vdev_limit, U64, ZMOD_RW,
 1157         "Max bytes in flight per leaf vdev for sequential resilvers");
 1158 
 1159 ZFS_MODULE_PARAM(zfs, zfs_, rebuild_scrub_enabled, INT, ZMOD_RW,
 1160         "Automatically scrub after sequential resilver completes");

Cache object: bec0f7b7e96b280f458206818144e46c


[ source navigation ] [ diff markup ] [ identifier search ] [ freetext search ] [ file search ] [ list types ] [ track identifier ]


This page is part of the FreeBSD/Linux Linux Kernel Cross-Reference, and was automatically generated using a modified version of the LXR engine.