btrfs: zoned: don't read beyond write pointer for scrub

Message ID	51dc3cd7d8e7d77fb95277f35030165d8e12c8bd.1688384570.git.johannes.thumshirn@wdc.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-btrfs-owner@vger.kernel.org> IronPort-SDR: dstnLybZ+rrR3wcG5WqaV567DM7eGHx12gFnD769Sto1vBfzUci5jmXfSdxKf+3yF+G8SroxoP 1TNWgGgrqhz2UWHCsaLDBRNqj1AncVhAhIAXajAcpz0Mb1HBdZkXbXhhdzqMw0RQu2K1In0vmW qEVSkNf2VFlAV74gn32jgXpaAfPRUUlQDAPIElIbI3fpbS1ChCL8z03/7v/FERMGTerENLY6A0 FiMYiqt1PoPmhpv2NP1LSPLbNWUlPofc5Hx3qekjjHUaNhEQDQN/PNnN79Yze1FrcxiRXO4MEz tAg= IronPort-SDR: 7Xah3XJcdLjXFi3LDdw9CWlpV5o0N0qDFjTIvNbp6thoeHKM66UD4h0RNYHd+uZP+CFhoVZ15G Ltae/I7CRF60lqi8hR9vj15Xw+BdIXn3MP6PtRNbeI0jmjOaWvJiHAxdsLTJn6rPSn92aUAGx5 Yo1eHmIZEdoM7PIBnHJRdNsUBNfU6CHzyr8pVw4u9Ppd/GXuYvz8VNfc92rP4Qtd8kIgjNL2J8 LOXVyktey7qiSkoLIp5INMYetVLJBsCiWLOqL1qQZdz7rGznGDDDKP6wfy+q2eBqRE7jaWcVBq fiY= WDCIronportException: Internal From: Johannes Thumshirn <johannes.thumshirn@wdc.com> To: linux-btrfs@vger.kernel.org Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>, Qu Wenruo <wqu@suse.com> Subject: [PATCH] btrfs: zoned: don't read beyond write pointer for scrub Date: Mon, 3 Jul 2023 04:47:51 -0700 Message-ID: <51dc3cd7d8e7d77fb95277f35030165d8e12c8bd.1688384570.git.johannes.thumshirn@wdc.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	btrfs: zoned: don't read beyond write pointer for scrub \| expand btrfs: zoned: don't read beyond write pointer for scrub

Johannes Thumshirn July 3, 2023, 11:47 a.m. UTC

As an optimization scrub_simple_mirror() performs reads in 64k chunks, if
at least one block of this chunk is has an extent allocated to it.

For zoned devices, this can lead to a read beyond the zone's write
pointer. But as there can't be any data beyond the write pointer, there's
no point in reading from it.

Cc: Qu Wenruo <wqu@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

---
While this is only a marginal optimization for the current code, it will
become necessary for RAID on zoned drives using the RAID strip tree.
---
 fs/btrfs/scrub.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

Qu Wenruo July 3, 2023, 12:03 p.m. UTC | #1

On 2023/7/3 19:47, Johannes Thumshirn wrote:
> As an optimization scrub_simple_mirror() performs reads in 64k chunks, if
> at least one block of this chunk is has an extent allocated to it.
>
> For zoned devices, this can lead to a read beyond the zone's write
> pointer. But as there can't be any data beyond the write pointer, there's
> no point in reading from it.
>
> Cc: Qu Wenruo <wqu@suse.com>
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>
> ---
> While this is only a marginal optimization for the current code, it will
> become necessary for RAID on zoned drives using the RAID strip tree.
> ---

Personally speaking, I'd prefer RST code to change
scrub_submit_initial_read() to only submit the read with set bits of
extent_sector_bitmap.
(Change it to something like scrub_stripe_submit_repair_read()).

The current change is a little too ad-hoc, and not that worthy by itself.
Especially considering that reading garbage (to reduce IOPS) is a
feature (if not a selling point of lower IOPS) of the new scrub code.

I totally understand RST would hate to read any garbage, as that would
make btrfs_map_block() complain heavily.

Thus in that case, I'd prefer the initial read for RST (regular zoned
are still free to read the garbage) to only submit the sectors covered
by extents.

Thanks,
Qu

>   fs/btrfs/scrub.c | 8 +++++++-
>   1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> index 38c103f13fd5..250317da1730 100644
> --- a/fs/btrfs/scrub.c
> +++ b/fs/btrfs/scrub.c
> @@ -2060,13 +2060,19 @@ static int scrub_simple_stripe(struct scrub_ctx *sctx,
>   	int ret = 0;
>
>   	while (cur_logical < bg->start + bg->length) {
> +		u64 length = BTRFS_STRIPE_LEN;
> +
> +		if (btrfs_is_zoned(bg->fs_info) &&
> +		    cur_logical + BTRFS_STRIPE_LEN > bg->alloc_offset) {
> +			length = bg->alloc_offset - cur_logical;
> +		}
>   		/*
>   		 * Inside each stripe, RAID0 is just SINGLE, and RAID10 is
>   		 * just RAID1, so we can reuse scrub_simple_mirror() to scrub
>   		 * this stripe.
>   		 */
>   		ret = scrub_simple_mirror(sctx, bg, map, cur_logical,
> -					  BTRFS_STRIPE_LEN, device, cur_physical,
> +					  length, device, cur_physical,
>   					  mirror_num);
>   		if (ret)
>   			return ret;

Johannes Thumshirn July 3, 2023, 3:48 p.m. UTC | #2

On 03.07.23 14:03, Qu Wenruo wrote:
> 
> 
> On 2023/7/3 19:47, Johannes Thumshirn wrote:
>> As an optimization scrub_simple_mirror() performs reads in 64k chunks, if
>> at least one block of this chunk is has an extent allocated to it.
>>
>> For zoned devices, this can lead to a read beyond the zone's write
>> pointer. But as there can't be any data beyond the write pointer, there's
>> no point in reading from it.
>>
>> Cc: Qu Wenruo <wqu@suse.com>
>> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>>
>> ---
>> While this is only a marginal optimization for the current code, it will
>> become necessary for RAID on zoned drives using the RAID strip tree.
>> ---
> 
> Personally speaking, I'd prefer RST code to change
> scrub_submit_initial_read() to only submit the read with set bits of
> extent_sector_bitmap.
> (Change it to something like scrub_stripe_submit_repair_read()).

I'll look into it.
 
> The current change is a little too ad-hoc, and not that worthy by itself.
> Especially considering that reading garbage (to reduce IOPS) is a
> feature (if not a selling point of lower IOPS) of the new scrub code.
> 
> I totally understand RST would hate to read any garbage, as that would
> make btrfs_map_block() complain heavily.

Yeah I've started to skip ranges that return ENOENT from btrfs_map_block()
in btrfs_submit_bio(). I /think/ that's also needed for holes.

> 
> Thus in that case, I'd prefer the initial read for RST (regular zoned
> are still free to read the garbage) to only submit the sectors covered
> by extents.

Qu Wenruo July 3, 2023, 10:41 p.m. UTC | #3

On 2023/7/3 23:48, Johannes Thumshirn wrote:
> On 03.07.23 14:03, Qu Wenruo wrote:
>>
>>
>> On 2023/7/3 19:47, Johannes Thumshirn wrote:
>>> As an optimization scrub_simple_mirror() performs reads in 64k chunks, if
>>> at least one block of this chunk is has an extent allocated to it.
>>>
>>> For zoned devices, this can lead to a read beyond the zone's write
>>> pointer. But as there can't be any data beyond the write pointer, there's
>>> no point in reading from it.
>>>
>>> Cc: Qu Wenruo <wqu@suse.com>
>>> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>>>
>>> ---
>>> While this is only a marginal optimization for the current code, it will
>>> become necessary for RAID on zoned drives using the RAID strip tree.
>>> ---
>>
>> Personally speaking, I'd prefer RST code to change
>> scrub_submit_initial_read() to only submit the read with set bits of
>> extent_sector_bitmap.
>> (Change it to something like scrub_stripe_submit_repair_read()).
>
> I'll look into it.

I can craft an RFC for the use case soon.

Thanks,
Qu
>
>> The current change is a little too ad-hoc, and not that worthy by itself.
>> Especially considering that reading garbage (to reduce IOPS) is a
>> feature (if not a selling point of lower IOPS) of the new scrub code.
>>
>> I totally understand RST would hate to read any garbage, as that would
>> make btrfs_map_block() complain heavily.
>
> Yeah I've started to skip ranges that return ENOENT from btrfs_map_block()
> in btrfs_submit_bio(). I /think/ that's also needed for holes.
>
>>
>> Thus in that case, I'd prefer the initial read for RST (regular zoned
>> are still free to read the garbage) to only submit the sectors covered
>> by extents.
>

Johannes Thumshirn July 5, 2023, 8:42 a.m. UTC | #4

On 04.07.23 00:41, Qu Wenruo wrote:
> 
> 
> On 2023/7/3 23:48, Johannes Thumshirn wrote:
>> On 03.07.23 14:03, Qu Wenruo wrote:
>>>
>>>
>>> On 2023/7/3 19:47, Johannes Thumshirn wrote:
>>>> As an optimization scrub_simple_mirror() performs reads in 64k chunks, if
>>>> at least one block of this chunk is has an extent allocated to it.
>>>>
>>>> For zoned devices, this can lead to a read beyond the zone's write
>>>> pointer. But as there can't be any data beyond the write pointer, there's
>>>> no point in reading from it.
>>>>
>>>> Cc: Qu Wenruo <wqu@suse.com>
>>>> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>>>>
>>>> ---
>>>> While this is only a marginal optimization for the current code, it will
>>>> become necessary for RAID on zoned drives using the RAID strip tree.
>>>> ---
>>>
>>> Personally speaking, I'd prefer RST code to change
>>> scrub_submit_initial_read() to only submit the read with set bits of
>>> extent_sector_bitmap.
>>> (Change it to something like scrub_stripe_submit_repair_read()).
>>
>> I'll look into it.
> 
> I can craft an RFC for the use case soon.

Did you envision sth. like this (untested and needs cleanup of ASSERT and 
btrfs_chunk_map() call but I think you get the point):

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 38c103f13fd5..f0301da98fb6 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -24,6 +24,7 @@
 #include "accessors.h"
 #include "file-item.h"
 #include "scrub.h"
+#include "raid-stripe-tree.h"
 
 /*
  * This is only the first step towards a full-features scrub. It reads all
@@ -1593,6 +1594,47 @@ static void scrub_reset_stripe(struct scrub_stripe *stripe)
        }
 }
 
+static void scrub_submit_extent_sector_read(struct scrub_ctx *sctx,
+                                           struct scrub_stripe *stripe)
+{
+       struct btrfs_fs_info *fs_info = stripe->bg->fs_info;
+       struct btrfs_bio *bbio = NULL;
+       int mirror = stripe->mirror_num;
+       int i;
+
+       for_each_set_bit(i, &stripe->extent_sector_bitmap, stripe->nr_sectors) {
+               struct page *page;
+               int pgoff;
+
+               page = scrub_stripe_get_page(stripe, i);
+               pgoff = scrub_stripe_get_page_offset(stripe, i);
+
+               /* The current sector cannot be merged, submit the bio. */
+               if (bbio &&
+                   ((i > 0 && !test_bit(i - 1, &stripe->extent_sector_bitmap)) ||
+                    bbio->bio.bi_iter.bi_size >= BTRFS_STRIPE_LEN)) {
+                       ASSERT(bbio->bio.bi_iter.bi_size);
+                       atomic_inc(&stripe->pending_io);
+                       btrfs_submit_bio(bbio, mirror);
+                       bbio = NULL;
+               }
+
+               if (!bbio) {
+                       bbio = btrfs_bio_alloc(stripe->nr_sectors, REQ_OP_READ,
+                               fs_info, scrub_read_endio, stripe);
+                       bbio->bio.bi_iter.bi_sector = (stripe->logical +
+                               (i << fs_info->sectorsize_bits)) >> SECTOR_SHIFT;
+               }
+
+               __bio_add_page(&bbio->bio, page, fs_info->sectorsize, pgoff);
+       }
+       if (bbio) {
+               ASSERT(bbio->bio.bi_iter.bi_size);
+               atomic_inc(&stripe->pending_io);
+               btrfs_submit_bio(bbio, mirror);
+       }
+}
+
 static void scrub_submit_initial_read(struct scrub_ctx *sctx,
                                      struct scrub_stripe *stripe)
 {
@@ -1604,6 +1646,21 @@ static void scrub_submit_initial_read(struct scrub_ctx *sctx,
        ASSERT(stripe->mirror_num > 0);
        ASSERT(test_bit(SCRUB_STRIPE_FLAG_INITIALIZED, &stripe->state));
 
+       /* FIXME: cache map->type in stripe */
+       if (fs_info->stripe_root) {
+               struct extent_map *em;
+               u64 map_type;
+
+               em = btrfs_get_chunk_map(fs_info, stripe->logical,
+                                        BTRFS_STRIPE_LEN);
+               ASSERT(em);
+               map_type = em->map_lookup->type;
+               free_extent_map(em);
+
+               if (btrfs_need_stripe_tree_update(fs_info, map_type))
+                       return scrub_submit_extent_sector_read(sctx, stripe);
+       }
+
        bbio = btrfs_bio_alloc(SCRUB_STRIPE_PAGES, REQ_OP_READ, fs_info,
                               scrub_read_endio, stripe);

Qu Wenruo July 5, 2023, 8:54 a.m. UTC | #5

On 2023/7/5 16:42, Johannes Thumshirn wrote:
> On 04.07.23 00:41, Qu Wenruo wrote:
[...]
>>
>> I can craft an RFC for the use case soon.
> 
> Did you envision sth. like this (untested and needs cleanup of ASSERT and
> btrfs_chunk_map() call but I think you get the point):

Yes, that's exactly the idea.

Just some hidden traps in my head:

[...]
> +static void scrub_submit_extent_sector_read(struct scrub_ctx *sctx,
> +                                           struct scrub_stripe *stripe)
> +{
> +       struct btrfs_fs_info *fs_info = stripe->bg->fs_info;
> +       struct btrfs_bio *bbio = NULL;
> +       int mirror = stripe->mirror_num;
> +       int i;
> +
Unlike the initial read which should only have one bio flying, this time 
we can have a race where the first bio we submitted finished before we 
even submit the next bio.

This can lead to the delayed work to be queued immaturely (and even 
queued multiple times).

So here we should increase the atomic before we enter the loop, to 
ensure any bio ended before we exit the loop won't queue the work.

> +       for_each_set_bit(i, &stripe->extent_sector_bitmap, stripe->nr_sectors) {
> +               struct page *page;
> +               int pgoff;
> +
> +               page = scrub_stripe_get_page(stripe, i);
> +               pgoff = scrub_stripe_get_page_offset(stripe, i);
> +
> +               /* The current sector cannot be merged, submit the bio. */
> +               if (bbio &&
> +                   ((i > 0 && !test_bit(i - 1, &stripe->extent_sector_bitmap)) ||
> +                    bbio->bio.bi_iter.bi_size >= BTRFS_STRIPE_LEN)) {
> +                       ASSERT(bbio->bio.bi_iter.bi_size);
> +                       atomic_inc(&stripe->pending_io); > +                       btrfs_submit_bio(bbio, mirror);
> +                       bbio = NULL;
> +               }
> +
> +               if (!bbio) {
> +                       bbio = btrfs_bio_alloc(stripe->nr_sectors, REQ_OP_READ,
> +                               fs_info, scrub_read_endio, stripe);
> +                       bbio->bio.bi_iter.bi_sector = (stripe->logical +
> +                               (i << fs_info->sectorsize_bits)) >> SECTOR_SHIFT;
> +               }
> +
> +               __bio_add_page(&bbio->bio, page, fs_info->sectorsize, pgoff);
> +       }
> +       if (bbio) {
> +               ASSERT(bbio->bio.bi_iter.bi_size);
> +               atomic_inc(&stripe->pending_io);
> +               btrfs_submit_bio(bbio, mirror);
> +       }

And after we have submitted the last bbio, we should decrease the 
pending_io and check if we need to queue the work:

	if (atomic_dec_and_test(&stripe->pending_io)) {
		INIT_WORK();
		queue_work();
	}

> +}
> +
>   static void scrub_submit_initial_read(struct scrub_ctx *sctx,
>                                        struct scrub_stripe *stripe)
>   {
> @@ -1604,6 +1646,21 @@ static void scrub_submit_initial_read(struct scrub_ctx *sctx,
>          ASSERT(stripe->mirror_num > 0);
>          ASSERT(test_bit(SCRUB_STRIPE_FLAG_INITIALIZED, &stripe->state));
>   
> +       /* FIXME: cache map->type in stripe */
> +       if (fs_info->stripe_root) {
> +               struct extent_map *em;
> +               u64 map_type;
> +
> +               em = btrfs_get_chunk_map(fs_info, stripe->logical,
> +                                        BTRFS_STRIPE_LEN);
> +               ASSERT(em);
> +               map_type = em->map_lookup->type;

IIRC we have stripe->bg to grab the type.

Thanks,
Qu
> +               free_extent_map(em);
> +
> +               if (btrfs_need_stripe_tree_update(fs_info, map_type))
> +                       return scrub_submit_extent_sector_read(sctx, stripe);
> +       }
> +
>          bbio = btrfs_bio_alloc(SCRUB_STRIPE_PAGES, REQ_OP_READ, fs_info,
>                                 scrub_read_endio, stripe);
>

Johannes Thumshirn July 5, 2023, 9:06 a.m. UTC | #6

On 05.07.23 10:54, Qu Wenruo wrote:
> 
> 
> On 2023/7/5 16:42, Johannes Thumshirn wrote:
>> On 04.07.23 00:41, Qu Wenruo wrote:
> [...]
>>>
>>> I can craft an RFC for the use case soon.
>>
>> Did you envision sth. like this (untested and needs cleanup of ASSERT and
>> btrfs_chunk_map() call but I think you get the point):
> 
> Yes, that's exactly the idea.
> 
> Just some hidden traps in my head:
> 
> [...]
>> +static void scrub_submit_extent_sector_read(struct scrub_ctx *sctx,
>> +                                           struct scrub_stripe *stripe)
>> +{
>> +       struct btrfs_fs_info *fs_info = stripe->bg->fs_info;
>> +       struct btrfs_bio *bbio = NULL;
>> +       int mirror = stripe->mirror_num;
>> +       int i;
>> +
> Unlike the initial read which should only have one bio flying, this time 
> we can have a race where the first bio we submitted finished before we 
> even submit the next bio.
> 
> This can lead to the delayed work to be queued immaturely (and even 
> queued multiple times).
> 
> So here we should increase the atomic before we enter the loop, to 
> ensure any bio ended before we exit the loop won't queue the work.
> 
>> +       for_each_set_bit(i, &stripe->extent_sector_bitmap, stripe->nr_sectors) {
>> +               struct page *page;
>> +               int pgoff;
>> +
>> +               page = scrub_stripe_get_page(stripe, i);
>> +               pgoff = scrub_stripe_get_page_offset(stripe, i);
>> +
>> +               /* The current sector cannot be merged, submit the bio. */
>> +               if (bbio &&
>> +                   ((i > 0 && !test_bit(i - 1, &stripe->extent_sector_bitmap)) ||
>> +                    bbio->bio.bi_iter.bi_size >= BTRFS_STRIPE_LEN)) {
>> +                       ASSERT(bbio->bio.bi_iter.bi_size);
>> +                       atomic_inc(&stripe->pending_io); > +                       btrfs_submit_bio(bbio, mirror);
>> +                       bbio = NULL;
>> +               }
>> +
>> +               if (!bbio) {
>> +                       bbio = btrfs_bio_alloc(stripe->nr_sectors, REQ_OP_READ,
>> +                               fs_info, scrub_read_endio, stripe);
>> +                       bbio->bio.bi_iter.bi_sector = (stripe->logical +
>> +                               (i << fs_info->sectorsize_bits)) >> SECTOR_SHIFT;
>> +               }
>> +
>> +               __bio_add_page(&bbio->bio, page, fs_info->sectorsize, pgoff);
>> +       }
>> +       if (bbio) {
>> +               ASSERT(bbio->bio.bi_iter.bi_size);
>> +               atomic_inc(&stripe->pending_io);
>> +               btrfs_submit_bio(bbio, mirror);
>> +       }
> 
> And after we have submitted the last bbio, we should decrease the 
> pending_io and check if we need to queue the work:
> 
> 	if (atomic_dec_and_test(&stripe->pending_io)) {
> 		INIT_WORK();
> 		queue_work();
> 	}
> 

OK let me incorporate that and run btrfs/060.

>> +}
>> +
>>   static void scrub_submit_initial_read(struct scrub_ctx *sctx,
>>                                        struct scrub_stripe *stripe)
>>   {
>> @@ -1604,6 +1646,21 @@ static void scrub_submit_initial_read(struct scrub_ctx *sctx,
>>          ASSERT(stripe->mirror_num > 0);
>>          ASSERT(test_bit(SCRUB_STRIPE_FLAG_INITIALIZED, &stripe->state));
>>   
>> +       /* FIXME: cache map->type in stripe */
>> +       if (fs_info->stripe_root) {
>> +               struct extent_map *em;
>> +               u64 map_type;
>> +
>> +               em = btrfs_get_chunk_map(fs_info, stripe->logical,
>> +                                        BTRFS_STRIPE_LEN);
>> +               ASSERT(em);
>> +               map_type = em->map_lookup->type;
> 
> IIRC we have stripe->bg to grab the type.

Indeed we do, that'll ease up the stuff a lot.

btrfs: zoned: don't read beyond write pointer for scrub

Commit Message

Comments

Patch