[v5,23/28] btrfs: support dev-replace in HMZONED mode

Message ID	20191204081735.852438-24-naohiro.aota@wdc.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=ure3=Z2=vger.kernel.org=linux-btrfs-owner@kernel.org> IronPort-SDR: Tp/3pCeYOdu8zWDuvUEDh0EBHUXWq0PNpOD2p+WbdsmmT+/Bm7uUyemMGagUx23tKzLEG+vAhO fnJrxRYA2n1nMbKHcaWtQLFj6x5DQ9l4354QDAYCCtTiJOJrk/Cj/uLii9xZmQxY1OlmAeiIA7 2x1Rrq2F30+7bLmHrOy7VGyAir/qDooKDIL76hoBgcB8ZT8UdildYxig6UKGFTTDScZI9Ovt4X WDqQoSl6naJhztt2QnyvU5PdmiWz1aorEZOdIfDtdSVFZOawqjmpyRan74s7HoGeYkxAK9c96y SSI= IronPort-SDR: 40Kd1SGzRSminTpSL0mH+hEtFdsTc3Plt9t/UTm9n7R0Yd30CbR96rTEjPhhtp6jYlhBADjXL2 BUaowjJ459JIbUaHpade94g9yi2D4lUebCkUYL0IU9ah0mzSJguxxKoJ7/ijFqYNl+K6+jHm4Q mkERj+aJwnZrs9FuBzj3gyLG725YclYQjjW+WH3oEEacbdmrPjGDMIjzBi6gyGJQVB/6sb6qZi +JJwskqkjz83gqB2whuaQkxOAcc0RddAJFtaxqnMGmazIFC7UfucXy+eIZyjiHSstF3wcjVKtw AkKe47qA9htC1h1ba3EBfQBu IronPort-SDR: nP1RrEcAB42F3/wVp5F+eYWVlvZNZc4WFgBXwwh2id9iH5EekfB7Jqn8gOdNj1wwfnDP+5vbfh B2L4iPHLnpdzUNoTKaFLwOMXjvKJoOdBIcH1f9vqOCpCHJgSFzB0kJbRy+7RuqcqQWGHI+AuDp Pvh2zrwjfWJrztj+8TmRuL8znse5hv9NhKcJN+mKfNJDq+b/HPS75o64KbNtatozxBh7UjnWf0 IMBsGLP8EIm8EViowlMPieCWiKoZ57OcjX7saf6+3ODIbzf5bzownCktuh+GpOWThLyhyaOHQK hxM= WDCIronportException: Internal From: Naohiro Aota <naohiro.aota@wdc.com> To: linux-btrfs@vger.kernel.org, David Sterba <dsterba@suse.com> Cc: Chris Mason <clm@fb.com>, Josef Bacik <josef@toxicpanda.com>, Nikolay Borisov <nborisov@suse.com>, Damien Le Moal <damien.lemoal@wdc.com>, Johannes Thumshirn <jthumshirn@suse.de>, Hannes Reinecke <hare@suse.com>, Anand Jain <anand.jain@oracle.com>, linux-fsdevel@vger.kernel.org, Naohiro Aota <naohiro.aota@wdc.com> Subject: [PATCH v5 23/28] btrfs: support dev-replace in HMZONED mode Date: Wed, 4 Dec 2019 17:17:30 +0900 Message-Id: <20191204081735.852438-24-naohiro.aota@wdc.com> In-Reply-To: <20191204081735.852438-1-naohiro.aota@wdc.com> References: <20191204081735.852438-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: linux-btrfs-owner@vger.kernel.org Precedence: bulk
Series	btrfs: zoned block device support \| expand [v5,00/28] btrfs: zoned block device support [v5,01/28] btrfs: introduce HMZONED feature flag [v5,02/28] btrfs: Get zone information of zoned block devices [v5,03/28] btrfs: Check and enable HMZONED mode [v5,04/28] btrfs: disallow RAID5/6 in HMZONED mode [v5,05/28] btrfs: disallow space_cache in HMZONED mode [v5,06/28] btrfs: disallow NODATACOW in HMZONED mode [v5,07/28] btrfs: disable fallocate in HMZONED mode [v5,08/28] btrfs: implement log-structured superblock for HMZONED mode [v5,09/28] btrfs: align device extent allocation to zone boundary [v5,10/28] btrfs: do sequential extent allocation in HMZONED mode [v5,11/28] btrfs: make unmirroed BGs readonly only if we have at least one writable BG [v5,12/28] btrfs: ensure metadata space available on/after degraded mount in HMZONED [v5,13/28] btrfs: reset zones of unused block groups [v5,14/28] btrfs: redirty released extent buffers in HMZONED mode [v5,15/28] btrfs: serialize data allocation and submit IOs [v5,16/28] btrfs: implement atomic compressed IO submission [v5,17/28] btrfs: support direct write IO in HMZONED [v5,18/28] btrfs: serialize meta IOs on HMZONED mode [v5,19/28] btrfs: wait existing extents before truncating [v5,20/28] btrfs: avoid async checksum on HMZONED mode [v5,21/28] btrfs: disallow mixed-bg in HMZONED mode [v5,22/28] btrfs: disallow inode_cache in HMZONED mode [v5,23/28] btrfs: support dev-replace in HMZONED mode [v5,24/28] btrfs: enable relocation in HMZONED mode [v5,25/28] btrfs: relocate block group to repair IO failure in HMZONED [v5,26/28] btrfs: split alloc_log_tree() [v5,27/28] btrfs: enable tree-log on HMZONED mode [v5,28/28] btrfs: enable to mount HMZONED incompat flag

diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h index 8827869f1744..323ba01ad8a9 100644 --- a/fs/btrfs/block-group.h +++ b/fs/btrfs/block-group.h @@ -83,6 +83,7 @@ struct btrfs_block_group { unsigned int has_caching_ctl:1; unsigned int removed:1; unsigned int wp_broken:1; + unsigned int to_copy:1; int disk_cache_state; diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c index 9286c6e0b636..6ac6aa0eb0b6 100644 --- a/fs/btrfs/dev-replace.c +++ b/fs/btrfs/dev-replace.c @@ -265,6 +265,10 @@ static int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info, set_blocksize(device->bdev, BTRFS_BDEV_BLOCKSIZE); device->fs_devices = fs_info->fs_devices; + ret = btrfs_get_dev_zone_info(device); + if (ret) + goto error; + mutex_lock(&fs_info->fs_devices->device_list_mutex); list_add(&device->dev_list, &fs_info->fs_devices->devices); fs_info->fs_devices->num_devices++; @@ -399,6 +403,176 @@ static char* btrfs_dev_name(struct btrfs_device *device) return rcu_str_deref(device->name); } +static int mark_block_group_to_copy(struct btrfs_fs_info *fs_info, + struct btrfs_device *src_dev) +{ + struct btrfs_path *path; + struct btrfs_key key; + struct btrfs_key found_key; + struct btrfs_root *root = fs_info->dev_root; + struct btrfs_dev_extent *dev_extent = NULL; + struct btrfs_block_group *cache; + struct extent_buffer *l; + struct btrfs_trans_handle *trans; + int slot; + int ret = 0; + u64 chunk_offset, length; + + /* Do not use "to_copy" on non-HMZONED for now */ + if (!btrfs_fs_incompat(fs_info, HMZONED)) + return 0; + + mutex_lock(&fs_info->chunk_mutex); + + /* ensulre we don't have pending new block group */ + while (fs_info->running_transaction && + !list_empty(&fs_info->running_transaction->dev_update_list)) { + mutex_unlock(&fs_info->chunk_mutex); + trans = btrfs_attach_transaction(root); + if (IS_ERR(trans)) { + ret = PTR_ERR(trans); + mutex_lock(&fs_info->chunk_mutex); + if (ret == -ENOENT) + continue; + else + goto out; + } + + ret = btrfs_commit_transaction(trans); + mutex_lock(&fs_info->chunk_mutex); + if (ret) + goto out; + } + + path = btrfs_alloc_path(); + if (!path) { + ret = -ENOMEM; + goto out; + } + + path->reada = READA_FORWARD; + path->search_commit_root = 1; + path->skip_locking = 1; + + key.objectid = src_dev->devid; + key.offset = 0ull; + key.type = BTRFS_DEV_EXTENT_KEY; + + while (1) { + ret = btrfs_search_slot(NULL, root, &key, path, 0, 0); + if (ret < 0) + break; + if (ret > 0) { + if (path->slots[0] >= + btrfs_header_nritems(path->nodes[0])) { + ret = btrfs_next_leaf(root, path); + if (ret < 0) + break; + if (ret > 0) { + ret = 0; + break; + } + } else { + ret = 0; + } + } + + l = path->nodes[0]; + slot = path->slots[0]; + + btrfs_item_key_to_cpu(l, &found_key, slot); + + if (found_key.objectid != src_dev->devid) + break; + + if (found_key.type != BTRFS_DEV_EXTENT_KEY) + break; + + if (found_key.offset < key.offset) + break; + + dev_extent = btrfs_item_ptr(l, slot, struct btrfs_dev_extent); + length = btrfs_dev_extent_length(l, dev_extent); + + chunk_offset = btrfs_dev_extent_chunk_offset(l, dev_extent); + + cache = btrfs_lookup_block_group(fs_info, chunk_offset); + if (!cache) + goto skip; + + spin_lock(&cache->lock); + cache->to_copy = 1; + spin_unlock(&cache->lock); + + btrfs_put_block_group(cache); + +skip: + key.offset = found_key.offset + length; + btrfs_release_path(path); + } + + btrfs_free_path(path); +out: + mutex_unlock(&fs_info->chunk_mutex); + + return ret; +} + +bool btrfs_finish_block_group_to_copy(struct btrfs_device *srcdev, + struct btrfs_block_group *cache, + u64 physical) +{ + struct btrfs_fs_info *fs_info = cache->fs_info; + struct extent_map *em; + struct map_lookup *map; + u64 chunk_offset = cache->start; + int num_extents, cur_extent; + int i; + + /* Do not use "to_copy" on non-HMZONED for now */ + if (!btrfs_fs_incompat(fs_info, HMZONED)) + return true; + + spin_lock(&cache->lock); + if (cache->removed) { + spin_unlock(&cache->lock); + return true; + } + spin_unlock(&cache->lock); + + em = btrfs_get_chunk_map(fs_info, chunk_offset, 1); + BUG_ON(IS_ERR(em)); + map = em->map_lookup; + + num_extents = cur_extent = 0; + for (i = 0; i < map->num_stripes; i++) { + /* we have more device extent to copy */ + if (srcdev != map->stripes[i].dev) + continue; + + num_extents++; + if (physical == map->stripes[i].physical) + cur_extent = i; + } + + free_extent_map(em); + + if (num_extents > 1 && cur_extent < num_extents - 1) { + /* + * Has more stripes on this device. Keep this BG + * readonly until we finish all the stripes. + */ + return false; + } + + /* last stripe on this device */ + spin_lock(&cache->lock); + cache->to_copy = 0; + spin_unlock(&cache->lock); + + return true; +} + static int btrfs_dev_replace_start(struct btrfs_fs_info *fs_info, const char *tgtdev_name, u64 srcdevid, const char *srcdev_name, int read_src) @@ -440,6 +614,10 @@ static int btrfs_dev_replace_start(struct btrfs_fs_info *fs_info, if (ret) return ret; + ret = mark_block_group_to_copy(fs_info, src_device); + if (ret) + return ret; + down_write(&dev_replace->rwsem); switch (dev_replace->replace_state) { case BTRFS_IOCTL_DEV_REPLACE_STATE_NEVER_STARTED: diff --git a/fs/btrfs/dev-replace.h b/fs/btrfs/dev-replace.h index 60b70dacc299..3911049a5f23 100644 --- a/fs/btrfs/dev-replace.h +++ b/fs/btrfs/dev-replace.h @@ -18,5 +18,8 @@ int btrfs_dev_replace_cancel(struct btrfs_fs_info *fs_info); void btrfs_dev_replace_suspend_for_unmount(struct btrfs_fs_info *fs_info); int btrfs_resume_dev_replace_async(struct btrfs_fs_info *fs_info); int __pure btrfs_dev_replace_is_ongoing(struct btrfs_dev_replace *dev_replace); +bool btrfs_finish_block_group_to_copy(struct btrfs_device *srcdev, + struct btrfs_block_group *cache, + u64 physical); #endif diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index d1f326b6c4d4..69c4ce8ec83e 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -34,6 +34,7 @@ #include "block-group.h" #include "rcu-string.h" #include "hmzoned.h" +#include "dev-replace.h" #undef SCRAMBLE_DELAYED_REFS @@ -1343,6 +1344,8 @@ int btrfs_discard_extent(struct btrfs_fs_info *fs_info, u64 bytenr, u64 length = stripe->length; u64 bytes; struct request_queue *req_q; + struct btrfs_dev_replace *dev_replace = + &fs_info->dev_replace; if (!stripe->dev->bdev) { ASSERT(btrfs_test_opt(fs_info, DEGRADED)); @@ -1351,15 +1354,28 @@ int btrfs_discard_extent(struct btrfs_fs_info *fs_info, u64 bytenr, req_q = bdev_get_queue(stripe->dev->bdev); /* zone reset in HMZONED mode */ - if (btrfs_can_zone_reset(dev, physical, length)) + if (btrfs_can_zone_reset(dev, physical, length)) { ret = btrfs_reset_device_zone(dev, physical, length, &bytes); - else if (blk_queue_discard(req_q)) + if (ret) + goto next; + if (!btrfs_dev_replace_is_ongoing( + dev_replace) || + dev != dev_replace->srcdev) + goto next; + + discarded_bytes += bytes; + /* send to replace target as well */ + ret = btrfs_reset_device_zone( + dev_replace->tgtdev, + physical, length, &bytes); + } else if (blk_queue_discard(req_q)) ret = btrfs_issue_discard(dev->bdev, physical, length, &bytes); else continue; +next: if (!ret) { discarded_bytes += bytes; } else if (ret != -EOPNOTSUPP) { diff --git a/fs/btrfs/hmzoned.c b/fs/btrfs/hmzoned.c index 55530782c1b1..08ba72a2c973 100644 --- a/fs/btrfs/hmzoned.c +++ b/fs/btrfs/hmzoned.c @@ -18,6 +18,7 @@ #include "locking.h" #include "space-info.h" #include "transaction.h" +#include "dev-replace.h" /* Maximum number of zones to report per blkdev_report_zones() call */ #define BTRFS_REPORT_NR_ZONES 4096 @@ -845,6 +846,8 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache) for (i = 0; i < map->num_stripes; i++) { bool is_sequential; struct blk_zone zone; + struct btrfs_dev_replace *dev_replace = &fs_info->dev_replace; + int dev_replace_is_ongoing = 0; device = map->stripes[i].dev; physical = map->stripes[i].physical; @@ -871,6 +874,14 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache) */ btrfs_dev_clear_zone_empty(device, physical); + down_read(&dev_replace->rwsem); + dev_replace_is_ongoing = + btrfs_dev_replace_is_ongoing(dev_replace); + if (dev_replace_is_ongoing && dev_replace->tgtdev != NULL) + btrfs_dev_clear_zone_empty(dev_replace->tgtdev, + physical); + up_read(&dev_replace->rwsem); + /* * The group is mapped to a sequential zone. Get the zone write * pointer to determine the allocation offset within the zone. @@ -1239,3 +1250,83 @@ void btrfs_revert_meta_write_pointer(struct btrfs_block_group *cache, ASSERT(cache->meta_write_pointer == eb->start + eb->len); cache->meta_write_pointer = eb->start; } + +int btrfs_hmzoned_issue_zeroout(struct btrfs_device *device, u64 physical, + u64 length) +{ + if (!btrfs_dev_is_sequential(device, physical)) + return -EOPNOTSUPP; + + return blkdev_issue_zeroout(device->bdev, + physical >> SECTOR_SHIFT, + length >> SECTOR_SHIFT, + GFP_NOFS, 0); +} + +static int read_zone_info(struct btrfs_fs_info *fs_info, u64 logical, + struct blk_zone *zone) +{ + struct btrfs_bio *bbio = NULL; + u64 mapped_length = PAGE_SIZE; + unsigned int nofs_flag; + int nmirrors; + int i, ret; + + ret = btrfs_map_sblock(fs_info, BTRFS_MAP_GET_READ_MIRRORS, logical, + &mapped_length, &bbio); + if (ret || !bbio || mapped_length < PAGE_SIZE) { + btrfs_put_bbio(bbio); + return -EIO; + } + + if (bbio->map_type & BTRFS_BLOCK_GROUP_RAID56_MASK) + return -EINVAL; + + nofs_flag = memalloc_nofs_save(); + nmirrors = (int)bbio->num_stripes; + for (i = 0; i < nmirrors; i++) { + u64 physical = bbio->stripes[i].physical; + struct btrfs_device *dev = bbio->stripes[i].dev; + + /* missing device */ + if (!dev->bdev) + continue; + + ret = btrfs_get_dev_zone(dev, physical, zone); + /* failing device */ + if (ret == -EIO || ret == -EOPNOTSUPP) + continue; + break; + } + memalloc_nofs_restore(nofs_flag); + + return ret; +} + +int btrfs_sync_zone_write_pointer(struct btrfs_device *tgt_dev, u64 logical, + u64 physical_start, u64 physical_pos) +{ + struct btrfs_fs_info *fs_info = tgt_dev->fs_info; + struct blk_zone zone; + u64 length; + u64 wp; + int ret; + + if (!btrfs_dev_is_sequential(tgt_dev, physical_pos)) + return 0; + + ret = read_zone_info(fs_info, logical, &zone); + if (ret) + return ret; + + wp = physical_start + ((zone.wp - zone.start) << SECTOR_SHIFT); + + if (physical_pos == wp) + return 0; + + if (physical_pos > wp) + return -EUCLEAN; + + length = wp - physical_pos; + return btrfs_hmzoned_issue_zeroout(tgt_dev, physical_pos, length); +} diff --git a/fs/btrfs/hmzoned.h b/fs/btrfs/hmzoned.h index 54f1affa6919..8558dd692b08 100644 --- a/fs/btrfs/hmzoned.h +++ b/fs/btrfs/hmzoned.h @@ -55,6 +55,10 @@ bool btrfs_check_meta_write_pointer(struct btrfs_fs_info *fs_info, struct btrfs_block_group **cache_ret); void btrfs_revert_meta_write_pointer(struct btrfs_block_group *cache, struct extent_buffer *eb); +int btrfs_hmzoned_issue_zeroout(struct btrfs_device *device, u64 physical, + u64 length); +int btrfs_sync_zone_write_pointer(struct btrfs_device *tgt_dev, u64 logical, + u64 physical_start, u64 physical_pos); #else /* CONFIG_BLK_DEV_ZONED */ static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos, struct blk_zone *zone) @@ -133,6 +137,18 @@ static inline bool btrfs_check_meta_write_pointer( } static inline void btrfs_revert_meta_write_pointer( struct btrfs_block_group *cache, struct extent_buffer *eb) { } +static inline int btrfs_hmzoned_issue_zeroout(struct btrfs_device *device, + u64 physical, u64 length) +{ + return -EOPNOTSUPP; +} +static inline int btrfs_sync_zone_write_pointer(struct btrfs_device *tgt_dev, + u64 logical, + u64 physical_start, + u64 physical_pos) +{ + return -EOPNOTSUPP; +} #endif static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos) diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index af7cec962619..e88f32256ccc 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -168,6 +168,7 @@ struct scrub_ctx { int pages_per_rd_bio; int is_dev_replace; + u64 write_pointer; struct scrub_bio *wr_curr_bio; struct mutex wr_lock; @@ -1627,6 +1628,25 @@ static int scrub_write_page_to_dev_replace(struct scrub_block *sblock, return scrub_add_page_to_wr_bio(sblock->sctx, spage); } +static int fill_writer_pointer_gap(struct scrub_ctx *sctx, u64 physical) +{ + int ret = 0; + u64 length; + + if (!btrfs_fs_incompat(sctx->fs_info, HMZONED)) + return 0; + + if (sctx->write_pointer < physical) { + length = physical - sctx->write_pointer; + + ret = btrfs_hmzoned_issue_zeroout(sctx->wr_tgtdev, + sctx->write_pointer, length); + if (!ret) + sctx->write_pointer = physical; + } + return ret; +} + static int scrub_add_page_to_wr_bio(struct scrub_ctx *sctx, struct scrub_page *spage) { @@ -1649,6 +1669,13 @@ static int scrub_add_page_to_wr_bio(struct scrub_ctx *sctx, if (sbio->page_count == 0) { struct bio *bio; + ret = fill_writer_pointer_gap(sctx, + spage->physical_for_dev_replace); + if (ret) { + mutex_unlock(&sctx->wr_lock); + return ret; + } + sbio->physical = spage->physical_for_dev_replace; sbio->logical = spage->logical; sbio->dev = sctx->wr_tgtdev; @@ -1710,6 +1737,10 @@ static void scrub_wr_submit(struct scrub_ctx *sctx) * doubled the write performance on spinning disks when measured * with Linux 3.5 */ btrfsic_submit_bio(sbio->bio); + + if (btrfs_fs_incompat(sctx->fs_info, HMZONED)) + sctx->write_pointer = sbio->physical + + sbio->page_count * PAGE_SIZE; } static void scrub_wr_bio_end_io(struct bio *bio) @@ -3040,6 +3071,46 @@ static noinline_for_stack int scrub_raid56_parity(struct scrub_ctx *sctx, return ret < 0 ? ret : 0; } +static void sync_replace_for_hmzoned(struct scrub_ctx *sctx) +{ + if (!btrfs_fs_incompat(sctx->fs_info, HMZONED)) + return; + + sctx->flush_all_writes = true; + scrub_submit(sctx); + mutex_lock(&sctx->wr_lock); + scrub_wr_submit(sctx); + mutex_unlock(&sctx->wr_lock); + + wait_event(sctx->list_wait, + atomic_read(&sctx->bios_in_flight) == 0); +} + +static int sync_write_pointer_for_hmzoned(struct scrub_ctx *sctx, u64 logical, + u64 physical, u64 physical_end) +{ + struct btrfs_fs_info *fs_info = sctx->fs_info; + int ret = 0; + + if (!btrfs_fs_incompat(fs_info, HMZONED)) + return 0; + + wait_event(sctx->list_wait, atomic_read(&sctx->bios_in_flight) == 0); + + mutex_lock(&sctx->wr_lock); + if (sctx->write_pointer < physical_end) { + ret = btrfs_sync_zone_write_pointer(sctx->wr_tgtdev, logical, + physical, + sctx->write_pointer); + if (ret) + btrfs_err(fs_info, "failed to recover write pointer"); + } + mutex_unlock(&sctx->wr_lock); + btrfs_dev_clear_zone_empty(sctx->wr_tgtdev, physical); + + return ret; +} + static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx, struct map_lookup *map, struct btrfs_device *scrub_dev, @@ -3052,7 +3123,7 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx, struct btrfs_extent_item *extent; struct blk_plug plug; u64 flags; - int ret; + int ret, ret2; int slot; u64 nstripes; struct extent_buffer *l; @@ -3171,6 +3242,14 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx, */ blk_start_plug(&plug); + if (sctx->is_dev_replace && + btrfs_dev_is_sequential(sctx->wr_tgtdev, physical)) { + mutex_lock(&sctx->wr_lock); + sctx->write_pointer = physical; + mutex_unlock(&sctx->wr_lock); + sctx->flush_all_writes = true; + } + /* * now find all extents for each stripe and scrub them */ @@ -3343,6 +3422,9 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx, if (ret) goto out; + if (sctx->is_dev_replace) + sync_replace_for_hmzoned(sctx); + if (extent_logical + extent_len < key.objectid + bytes) { if (map->type & BTRFS_BLOCK_GROUP_RAID56_MASK) { @@ -3410,6 +3492,15 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx, blk_finish_plug(&plug); btrfs_free_path(path); btrfs_free_path(ppath); + + if (sctx->is_dev_replace && ret >= 0) { + ret2 = sync_write_pointer_for_hmzoned( + sctx, base + offset, + map->stripes[num].physical, physical_end); + if (ret2) + ret = ret2; + } + return ret < 0 ? ret : 0; } @@ -3465,6 +3556,25 @@ static noinline_for_stack int scrub_chunk(struct scrub_ctx *sctx, return ret; } +static int finish_extent_writes_for_hmzoned(struct btrfs_root *root, + struct btrfs_block_group *cache) +{ + struct btrfs_fs_info *fs_info = cache->fs_info; + struct btrfs_trans_handle *trans; + + if (!btrfs_fs_incompat(fs_info, HMZONED)) + return 0; + + btrfs_wait_block_group_reservations(cache); + btrfs_wait_nocow_writers(cache); + btrfs_wait_ordered_roots(fs_info, U64_MAX, cache->start, cache->length); + + trans = btrfs_join_transaction(root); + if (IS_ERR(trans)) + return PTR_ERR(trans); + return btrfs_commit_transaction(trans); +} + static noinline_for_stack int scrub_enumerate_chunks(struct scrub_ctx *sctx, struct btrfs_device *scrub_dev, u64 start, u64 end) @@ -3483,6 +3593,7 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx, struct btrfs_key found_key; struct btrfs_block_group *cache; struct btrfs_dev_replace *dev_replace = &fs_info->dev_replace; + bool do_chunk_alloc = btrfs_fs_incompat(fs_info, HMZONED); path = btrfs_alloc_path(); if (!path) @@ -3551,6 +3662,18 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx, if (!cache) goto skip; + + if (sctx->is_dev_replace && + btrfs_fs_incompat(fs_info, HMZONED)) { + spin_lock(&cache->lock); + if (!cache->to_copy) { + spin_unlock(&cache->lock); + ro_set = 0; + goto done; + } + spin_unlock(&cache->lock); + } + /* * we need call btrfs_inc_block_group_ro() with scrubs_paused, * to avoid deadlock caused by: @@ -3579,7 +3702,16 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx, * thread can't be triggered fast enough, and use up all space * of btrfs_super_block::sys_chunk_array */ - ret = btrfs_inc_block_group_ro(cache, false); + ret = btrfs_inc_block_group_ro(cache, do_chunk_alloc); + if (!ret && sctx->is_dev_replace) { + ret = finish_extent_writes_for_hmzoned(root, cache); + if (ret) { + btrfs_dec_block_group_ro(cache); + scrub_pause_off(fs_info); + btrfs_put_block_group(cache); + break; + } + } scrub_pause_off(fs_info); if (ret == 0) { @@ -3641,6 +3773,12 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx, scrub_pause_off(fs_info); + if (sctx->is_dev_replace && + !btrfs_finish_block_group_to_copy(dev_replace->srcdev, + cache, found_key.offset)) + ro_set = 0; + +done: down_write(&dev_replace->rwsem); dev_replace->cursor_left = dev_replace->cursor_right; dev_replace->item_needs_writeback = 1; diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index d058ea613627..c4061b65a38b 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -1413,6 +1413,9 @@ static int find_free_dev_extent_start(struct btrfs_device *device, search_start = max_t(u64, search_start, zone_size); search_start = btrfs_zone_align(device, search_start); + WARN_ON(device->zone_info && + !IS_ALIGNED(num_bytes, device->zone_info->zone_size)); + path = btrfs_alloc_path(); if (!path) return -ENOMEM; @@ -5720,9 +5723,29 @@ static int get_extra_mirror_from_replace(struct btrfs_fs_info *fs_info, return ret; } +static bool is_block_group_to_copy(struct btrfs_fs_info *fs_info, u64 logical) +{ + struct btrfs_block_group *cache; + bool ret; + + /* non-HMZONED mode does not use "to_copy" flag */ + if (!btrfs_fs_incompat(fs_info, HMZONED)) + return false; + + cache = btrfs_lookup_block_group(fs_info, logical); + + spin_lock(&cache->lock); + ret = cache->to_copy; + spin_unlock(&cache->lock); + + btrfs_put_block_group(cache); + return ret; +} + static void handle_ops_on_dev_replace(enum btrfs_map_op op, struct btrfs_bio **bbio_ret, struct btrfs_dev_replace *dev_replace, + u64 logical, int *num_stripes_ret, int *max_errors_ret) { struct btrfs_bio *bbio = *bbio_ret; @@ -5735,6 +5758,15 @@ static void handle_ops_on_dev_replace(enum btrfs_map_op op, if (op == BTRFS_MAP_WRITE) { int index_where_to_add; + /* + * a block group which have "to_copy" set will + * eventually copied by dev-replace process. We can + * avoid cloning IO here. + */ + if (is_block_group_to_copy(dev_replace->srcdev->fs_info, + logical)) + return; + /* * duplicate the write operations while the dev replace * procedure is running. Since the copying of the old disk to @@ -6145,8 +6177,8 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info, if (dev_replace_is_ongoing && dev_replace->tgtdev != NULL && need_full_stripe(op)) { - handle_ops_on_dev_replace(op, &bbio, dev_replace, &num_stripes, - &max_errors); + handle_ops_on_dev_replace(op, &bbio, dev_replace, logical, + &num_stripes, &max_errors); } *bbio_ret = bbio;

[v5,23/28] btrfs: support dev-replace in HMZONED mode

Commit Message

Patch