diff mbox series

[1/3] dm: don't grab target io reference in dm_zone_map_bio

Message ID 20220408171254.935171-2-ming.lei@redhat.com (mailing list archive)
State Superseded, archived
Delegated to: Mike Snitzer
Headers show
Series dm: improvement on io referencing and io poll | expand

Commit Message

Ming Lei April 8, 2022, 5:12 p.m. UTC
dm_zone_map_bio() is only called from __map_bio in which the io's
reference is grabbed already, and the reference won't be released
until the bio is submitted, so no necessary to do it dm_zone_map_bio
any more.

Cc: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/md/dm-core.h |  7 -------
 drivers/md/dm-zone.c | 10 ----------
 drivers/md/dm.c      |  7 ++++++-
 3 files changed, 6 insertions(+), 18 deletions(-)

Comments

Damien Le Moal April 11, 2022, 12:18 a.m. UTC | #1
On 4/9/22 02:12, Ming Lei wrote:
> dm_zone_map_bio() is only called from __map_bio in which the io's
> reference is grabbed already, and the reference won't be released
> until the bio is submitted, so no necessary to do it dm_zone_map_bio
> any more.

I do not think that this patch is correct. Removing the extra reference on
the io can lead to problems if the io is completed in the target
->map(ti, clone) call or before dm_zone_map_bio_end() is called for the
DM_MAPIO_SUBMITTED or DM_MAPIO_REMAPPED cases. dm_zone_map_bio_end() needs
to reference the original bio, so we must ensure that it is still around,
which means keeping an extra IO reference to avoid dm_io_dec_pending()
calling bio_endio() on the orgi bio before dm_zone_map_bio_end() has a
chance to run.

> 
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  drivers/md/dm-core.h |  7 -------
>  drivers/md/dm-zone.c | 10 ----------
>  drivers/md/dm.c      |  7 ++++++-
>  3 files changed, 6 insertions(+), 18 deletions(-)
> 
> diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h
> index 4277853c7535..13136bd47cb3 100644
> --- a/drivers/md/dm-core.h
> +++ b/drivers/md/dm-core.h
> @@ -277,13 +277,6 @@ static inline void dm_io_set_flag(struct dm_io *io, unsigned int bit)
>  	io->flags |= (1U << bit);
>  }
>  
> -static inline void dm_io_inc_pending(struct dm_io *io)
> -{
> -	atomic_inc(&io->io_count);
> -}
> -
> -void dm_io_dec_pending(struct dm_io *io, blk_status_t error);
> -
>  static inline struct completion *dm_get_completion_from_kobject(struct kobject *kobj)
>  {
>  	return &container_of(kobj, struct dm_kobject_holder, kobj)->completion;
> diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
> index c1ca9be4b79e..85d3c158719f 100644
> --- a/drivers/md/dm-zone.c
> +++ b/drivers/md/dm-zone.c
> @@ -545,13 +545,6 @@ int dm_zone_map_bio(struct dm_target_io *tio)
>  		return DM_MAPIO_KILL;
>  	}
>  
> -	/*
> -	 * The target map function may issue and complete the IO quickly.
> -	 * Take an extra reference on the IO to make sure it does disappear
> -	 * until we run dm_zone_map_bio_end().
> -	 */
> -	dm_io_inc_pending(io);
> -
>  	/* Let the target do its work */
>  	r = ti->type->map(ti, clone);
>  	switch (r) {
> @@ -580,9 +573,6 @@ int dm_zone_map_bio(struct dm_target_io *tio)
>  		break;
>  	}
>  
> -	/* Drop the extra reference on the IO */
> -	dm_io_dec_pending(io, sts);
> -
>  	if (sts != BLK_STS_OK)
>  		return DM_MAPIO_KILL;
>  
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index 3c5fad7c4ee6..b8424a4b4725 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -929,11 +929,16 @@ static inline bool dm_tio_is_normal(struct dm_target_io *tio)
>  		!dm_tio_flagged(tio, DM_TIO_IS_DUPLICATE_BIO));
>  }
>  
> +static void dm_io_inc_pending(struct dm_io *io)
> +{
> +	atomic_inc(&io->io_count);
> +}
> +
>  /*
>   * Decrements the number of outstanding ios that a bio has been
>   * cloned into, completing the original io if necc.
>   */
> -void dm_io_dec_pending(struct dm_io *io, blk_status_t error)
> +static void dm_io_dec_pending(struct dm_io *io, blk_status_t error)
>  {
>  	/* Push-back supersedes any I/O errors */
>  	if (unlikely(error)) {
Ming Lei April 11, 2022, 12:36 a.m. UTC | #2
On Mon, Apr 11, 2022 at 09:18:56AM +0900, Damien Le Moal wrote:
> On 4/9/22 02:12, Ming Lei wrote:
> > dm_zone_map_bio() is only called from __map_bio in which the io's
> > reference is grabbed already, and the reference won't be released
> > until the bio is submitted, so no necessary to do it dm_zone_map_bio
> > any more.
> 
> I do not think that this patch is correct. Removing the extra reference on
> the io can lead to problems if the io is completed in the target
> ->map(ti, clone) call or before dm_zone_map_bio_end() is called for the
> DM_MAPIO_SUBMITTED or DM_MAPIO_REMAPPED cases. dm_zone_map_bio_end() needs

__map_bio():
	...
	dm_io_inc_pending(io);
	...
	dm_zone_map_bio(tio);
	...

dm_submit_bio():
	dm_split_and_process_bio
		init_clone_info(&ci, md, map, bio)
			atomic_set(&io->io_count, 1)
		...
		__map_bio()
		...
		dm_io_dec_pending()

The target io can only be completed after the above line returns,
so the extra reference in dm_zone_map_bio() doesn't make any difference,
does it?


Thanks, 
Ming
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel
Damien Le Moal April 11, 2022, 12:50 a.m. UTC | #3
On 4/11/22 09:36, Ming Lei wrote:
> On Mon, Apr 11, 2022 at 09:18:56AM +0900, Damien Le Moal wrote:
>> On 4/9/22 02:12, Ming Lei wrote:
>>> dm_zone_map_bio() is only called from __map_bio in which the io's
>>> reference is grabbed already, and the reference won't be released
>>> until the bio is submitted, so no necessary to do it dm_zone_map_bio
>>> any more.
>>
>> I do not think that this patch is correct. Removing the extra reference on
>> the io can lead to problems if the io is completed in the target
>> ->map(ti, clone) call or before dm_zone_map_bio_end() is called for the
>> DM_MAPIO_SUBMITTED or DM_MAPIO_REMAPPED cases. dm_zone_map_bio_end() needs
> 
> __map_bio():
> 	...
> 	dm_io_inc_pending(io);
> 	...
> 	dm_zone_map_bio(tio);
> 	...

dm-crypt (for instance) may terminate the clone bio immediately in its
->map() function context, resulting in the bio_endio()clone) ->
clone_endio() -> dm_io_dec_pending() call chain.

With that, the io is gone and dm_zone_map_bio_end() will not have a valid
reference on the orig bio.

What am I missing here ?

> 
> dm_submit_bio():
> 	dm_split_and_process_bio
> 		init_clone_info(&ci, md, map, bio)
> 			atomic_set(&io->io_count, 1)
> 		...
> 		__map_bio()
> 		...
> 		dm_io_dec_pending()
> 
> The target io can only be completed after the above line returns,
> so the extra reference in dm_zone_map_bio() doesn't make any difference,
> does it?
> 
> 
> Thanks, 
> Ming
>
Ming Lei April 11, 2022, 1:04 a.m. UTC | #4
On Mon, Apr 11, 2022 at 09:50:57AM +0900, Damien Le Moal wrote:
> On 4/11/22 09:36, Ming Lei wrote:
> > On Mon, Apr 11, 2022 at 09:18:56AM +0900, Damien Le Moal wrote:
> >> On 4/9/22 02:12, Ming Lei wrote:
> >>> dm_zone_map_bio() is only called from __map_bio in which the io's
> >>> reference is grabbed already, and the reference won't be released
> >>> until the bio is submitted, so no necessary to do it dm_zone_map_bio
> >>> any more.
> >>
> >> I do not think that this patch is correct. Removing the extra reference on
> >> the io can lead to problems if the io is completed in the target
> >> ->map(ti, clone) call or before dm_zone_map_bio_end() is called for the
> >> DM_MAPIO_SUBMITTED or DM_MAPIO_REMAPPED cases. dm_zone_map_bio_end() needs
> > 
> > __map_bio():
> > 	...
> > 	dm_io_inc_pending(io);
> > 	...
> > 	dm_zone_map_bio(tio);
> > 	...
> 
> dm-crypt (for instance) may terminate the clone bio immediately in its
> ->map() function context, resulting in the bio_endio()clone) ->
> clone_endio() -> dm_io_dec_pending() call chain.
> 
> With that, the io is gone and dm_zone_map_bio_end() will not have a valid
> reference on the orig bio.

Any target can complete io during ->map. Here looks nothing is special with
dm-crypt or dm-zone, why does only dm zone need extra reference?

The reference counter is initialized as 1 in init_clone_info(), dm_io_inc_pending()
in __map_bio() increases it to 2, so after the above call chain you mentioned is done,
the counter becomes 1. The original bio can't be completed until dm_io_dec_pending()
in dm_split_and_process_bio() is called.

Or maybe I miss any extra requirement from dm-zone?


thanks,
Ming
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel
Damien Le Moal April 11, 2022, 1:08 a.m. UTC | #5
On 4/11/22 10:04, Ming Lei wrote:
> On Mon, Apr 11, 2022 at 09:50:57AM +0900, Damien Le Moal wrote:
>> On 4/11/22 09:36, Ming Lei wrote:
>>> On Mon, Apr 11, 2022 at 09:18:56AM +0900, Damien Le Moal wrote:
>>>> On 4/9/22 02:12, Ming Lei wrote:
>>>>> dm_zone_map_bio() is only called from __map_bio in which the io's
>>>>> reference is grabbed already, and the reference won't be released
>>>>> until the bio is submitted, so no necessary to do it dm_zone_map_bio
>>>>> any more.
>>>>
>>>> I do not think that this patch is correct. Removing the extra reference on
>>>> the io can lead to problems if the io is completed in the target
>>>> ->map(ti, clone) call or before dm_zone_map_bio_end() is called for the
>>>> DM_MAPIO_SUBMITTED or DM_MAPIO_REMAPPED cases. dm_zone_map_bio_end() needs
>>>
>>> __map_bio():
>>> 	...
>>> 	dm_io_inc_pending(io);
>>> 	...
>>> 	dm_zone_map_bio(tio);
>>> 	...
>>
>> dm-crypt (for instance) may terminate the clone bio immediately in its
>> ->map() function context, resulting in the bio_endio()clone) ->
>> clone_endio() -> dm_io_dec_pending() call chain.
>>
>> With that, the io is gone and dm_zone_map_bio_end() will not have a valid
>> reference on the orig bio.
> 
> Any target can complete io during ->map. Here looks nothing is special with
> dm-crypt or dm-zone, why does only dm zone need extra reference?
> 
> The reference counter is initialized as 1 in init_clone_info(), dm_io_inc_pending()
> in __map_bio() increases it to 2, so after the above call chain you mentioned is done,
> the counter becomes 1. The original bio can't be completed until dm_io_dec_pending()
> in dm_split_and_process_bio() is called.
> 
> Or maybe I miss any extra requirement from dm-zone?

I added that extra reference as I discovered it was needed when testing
zone append emulation with dm-crypt, back when it was implemented (the
emulation, not dm-crypt :)). Let me revisit this.

> 
> 
> thanks,
> Ming
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://listman.redhat.com/mailman/listinfo/dm-devel
>
Damien Le Moal April 11, 2022, 2:19 a.m. UTC | #6
On 4/11/22 10:04, Ming Lei wrote:
> On Mon, Apr 11, 2022 at 09:50:57AM +0900, Damien Le Moal wrote:
>> On 4/11/22 09:36, Ming Lei wrote:
>>> On Mon, Apr 11, 2022 at 09:18:56AM +0900, Damien Le Moal wrote:
>>>> On 4/9/22 02:12, Ming Lei wrote:
>>>>> dm_zone_map_bio() is only called from __map_bio in which the io's
>>>>> reference is grabbed already, and the reference won't be released
>>>>> until the bio is submitted, so no necessary to do it dm_zone_map_bio
>>>>> any more.
>>>>
>>>> I do not think that this patch is correct. Removing the extra reference on
>>>> the io can lead to problems if the io is completed in the target
>>>> ->map(ti, clone) call or before dm_zone_map_bio_end() is called for the
>>>> DM_MAPIO_SUBMITTED or DM_MAPIO_REMAPPED cases. dm_zone_map_bio_end() needs
>>>
>>> __map_bio():
>>> 	...
>>> 	dm_io_inc_pending(io);
>>> 	...
>>> 	dm_zone_map_bio(tio);
>>> 	...
>>
>> dm-crypt (for instance) may terminate the clone bio immediately in its
>> ->map() function context, resulting in the bio_endio()clone) ->
>> clone_endio() -> dm_io_dec_pending() call chain.
>>
>> With that, the io is gone and dm_zone_map_bio_end() will not have a valid
>> reference on the orig bio.
> 
> Any target can complete io during ->map. Here looks nothing is special with
> dm-crypt or dm-zone, why does only dm zone need extra reference?
> 
> The reference counter is initialized as 1 in init_clone_info(), dm_io_inc_pending()
> in __map_bio() increases it to 2, so after the above call chain you mentioned is done,
> the counter becomes 1. The original bio can't be completed until dm_io_dec_pending()
> in dm_split_and_process_bio() is called.
> 
> Or maybe I miss any extra requirement from dm-zone?

Something is wrong... With and without your patch, when I setup a dm-crypt
target on top of a zoned nullblk device, I get:

[  292.596454] device-mapper: uevent: version 1.0.3
[  292.602746] device-mapper: ioctl: 4.46.0-ioctl (2022-02-22)
initialised: dm-devel@redhat.com
[  292.732217] general protection fault, probably for non-canonical
address 0xdffffc0000000002: 0000 [#1] PREEMPT SMP KASAN PTI
[  292.743724] KASAN: null-ptr-deref in range
[0x0000000000000010-0x0000000000000017]
[  292.751409] CPU: 0 PID: 4259 Comm: systemd-udevd Not tainted
5.18.0-rc2+ #1458
[  292.758746] Hardware name: Supermicro Super Server/X11DPL-i, BIOS 3.3
02/21/2020
[  292.766250] RIP: 0010:dm_zone_map_bio+0x146/0x1740 [dm_mod]
[  292.771938] Code: 00 00 4d 8b 65 10 48 8d 43 28 48 89 44 24 10 49 8d 44
24 10 48 89 c2 48 89 44 24 18 48 b8 00 00 00 00 00 fc ff df 48 c1 ea 03
<0f> b6 04 02 84 c0 74 08 3c 03 0f 8e 78 0e 00 00 45 8b 7c 24 10 41
[  292.790946] RSP: 0018:ffff8883cd847218 EFLAGS: 00010202
[  292.796260] RAX: dffffc0000000000 RBX: ffff8885c5bcdce8 RCX:
1ffff11034470027
[  292.803496] RDX: 0000000000000002 RSI: 0000000000000008 RDI:
ffff8885c5bcdc60
[  292.810732] RBP: 1ffff11079b08e4f R08: ffff8881a23801d8 R09:
ffff8881a238013f
[  292.817970] R10: ffff88821c594040 R11: 0000000000000001 R12:
0000000000000000
[  292.825206] R13: ffff8885c5bcdc50 R14: ffff8881a2380000 R15:
ffff8885c5bcdd08
[  292.832442] FS:  00007fe169b06b40(0000) GS:ffff88880fc00000(0000)
knlGS:0000000000000000
[  292.840646] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  292.846481] CR2: 00007ffd80a57a38 CR3: 00000004b91b0006 CR4:
00000000007706f0
[  292.853722] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[  292.860957] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[  292.868194] PKRU: 55555554
[  292.870949] Call Trace:
[  292.873446]  <TASK>
[  292.875593]  ? lock_is_held_type+0xd7/0x130
[  292.879860]  ? dm_set_zones_restrictions+0x8f0/0x8f0 [dm_mod]
[  292.885718]  ? __module_address.part.0+0x25/0x300
[  292.890509]  ? is_module_address+0x43/0x70
[  292.894674]  ? static_obj+0x8a/0xc0
[  292.898233]  __map_bio+0x352/0x740 [dm_mod]
[  292.902512]  dm_submit_bio+0x72f/0x17a0 [dm_mod]
[  292.907222]  ? find_held_lock+0x2c/0x110
[  292.911217]  ? __send_empty_flush+0x2b0/0x2b0 [dm_mod]
[  292.916459]  ? lock_release+0x3b2/0x6f0
[  292.920368]  ? lock_downgrade+0x6d0/0x6d0
[  292.924458]  ? lock_is_held_type+0xd7/0x130
[  292.928714]  __submit_bio+0x12a/0x1f0
[  292.932450]  submit_bio_noacct_nocheck+0x324/0x840
[  292.937324]  ? should_fail_request+0x70/0x70
[  292.941670]  ? rcu_read_lock_sched_held+0x3f/0x70
[  292.946458]  ? submit_bio_noacct+0xfa4/0x1530
[  292.950888]  ? lock_is_held_type+0xd7/0x130
[  292.957813]  mpage_readahead+0x32e/0x4b0
[  292.964470]  ? do_mpage_readpage+0x17c0/0x17c0
[  292.971661]  ? blkdev_write_begin+0x20/0x20
[  292.978567]  ? lock_release+0x3b2/0x6f0
[  292.985073]  ? folio_add_lru+0x217/0x3f0
[  292.991620]  ? lock_downgrade+0x6d0/0x6d0
[  292.998237]  read_pages+0x18c/0x990
[  293.004308]  ? memcg_list_lru_alloc+0x810/0x810
[  293.011404]  ? folio_add_lru+0x238/0x3f0
[  293.017805]  ? file_ra_state_init+0xd0/0xd0
[  293.024395]  ? policy_node+0xbb/0x140
[  293.030416]  page_cache_ra_unbounded+0x258/0x410
[  293.037376]  force_page_cache_ra+0x281/0x400
[  293.043944]  filemap_get_pages+0x25e/0x1290
[  293.050342]  ? __lock_acquire+0x1603/0x6180
[  293.056654]  ? filemap_add_folio+0x140/0x140
[  293.063002]  ? lock_is_held_type+0xd7/0x130
[  293.069236]  filemap_read+0x29e/0x910
[  293.074927]  ? filemap_get_pages+0x1290/0x1290
[  293.081378]  ? __lock_acquire+0x1603/0x6180
[  293.087558]  blkdev_read_iter+0x20c/0x640
[  293.093529]  ? cp_new_stat+0x47a/0x590
[  293.099190]  ? cp_old_stat+0x470/0x470
[  293.104795]  new_sync_read+0x2e4/0x520
[  293.110362]  ? __x64_sys_lseek+0x1d0/0x1d0
[  293.116269]  ? lock_acquire+0x1b2/0x4d0
[  293.121928]  ? find_held_lock+0x2c/0x110
[  293.127648]  vfs_read+0x312/0x430
[  293.132755]  ksys_read+0xf3/0x1d0
[  293.137863]  ? __x64_sys_pwrite64+0x1f0/0x1f0
[  293.144032]  do_syscall_64+0x35/0x80
[  293.149391]  entry_SYSCALL_64_after_hwframe+0x44/0xae

The crash is at: drivers/md/dm-zone.c:499, which is
dm_need_zone_wp_tracking() called from dm_zone_map_bio(). The orig_bio
pointer is invalid. Weird. Investigating.

Also checking why our weekly test runs did not catch this.
Damien Le Moal April 11, 2022, 2:55 a.m. UTC | #7
On 4/11/22 11:19, Damien Le Moal wrote:
> On 4/11/22 10:04, Ming Lei wrote:
>> On Mon, Apr 11, 2022 at 09:50:57AM +0900, Damien Le Moal wrote:
>>> On 4/11/22 09:36, Ming Lei wrote:
>>>> On Mon, Apr 11, 2022 at 09:18:56AM +0900, Damien Le Moal wrote:
>>>>> On 4/9/22 02:12, Ming Lei wrote:
>>>>>> dm_zone_map_bio() is only called from __map_bio in which the io's
>>>>>> reference is grabbed already, and the reference won't be released
>>>>>> until the bio is submitted, so no necessary to do it dm_zone_map_bio
>>>>>> any more.
>>>>>
>>>>> I do not think that this patch is correct. Removing the extra reference on
>>>>> the io can lead to problems if the io is completed in the target
>>>>> ->map(ti, clone) call or before dm_zone_map_bio_end() is called for the
>>>>> DM_MAPIO_SUBMITTED or DM_MAPIO_REMAPPED cases. dm_zone_map_bio_end() needs
>>>>
>>>> __map_bio():
>>>> 	...
>>>> 	dm_io_inc_pending(io);
>>>> 	...
>>>> 	dm_zone_map_bio(tio);
>>>> 	...
>>>
>>> dm-crypt (for instance) may terminate the clone bio immediately in its
>>> ->map() function context, resulting in the bio_endio()clone) ->
>>> clone_endio() -> dm_io_dec_pending() call chain.
>>>
>>> With that, the io is gone and dm_zone_map_bio_end() will not have a valid
>>> reference on the orig bio.
>>
>> Any target can complete io during ->map. Here looks nothing is special with
>> dm-crypt or dm-zone, why does only dm zone need extra reference?
>>
>> The reference counter is initialized as 1 in init_clone_info(), dm_io_inc_pending()
>> in __map_bio() increases it to 2, so after the above call chain you mentioned is done,
>> the counter becomes 1. The original bio can't be completed until dm_io_dec_pending()
>> in dm_split_and_process_bio() is called.
>>
>> Or maybe I miss any extra requirement from dm-zone?
> 
> Something is wrong... With and without your patch, when I setup a dm-crypt
> target on top of a zoned nullblk device, I get:
> 
> [  292.596454] device-mapper: uevent: version 1.0.3
> [  292.602746] device-mapper: ioctl: 4.46.0-ioctl (2022-02-22)
> initialised: dm-devel@redhat.com
> [  292.732217] general protection fault, probably for non-canonical
> address 0xdffffc0000000002: 0000 [#1] PREEMPT SMP KASAN PTI
> [  292.743724] KASAN: null-ptr-deref in range
> [0x0000000000000010-0x0000000000000017]
> [  292.751409] CPU: 0 PID: 4259 Comm: systemd-udevd Not tainted
> 5.18.0-rc2+ #1458
> [  292.758746] Hardware name: Supermicro Super Server/X11DPL-i, BIOS 3.3
> 02/21/2020
> [  292.766250] RIP: 0010:dm_zone_map_bio+0x146/0x1740 [dm_mod]
> [  292.771938] Code: 00 00 4d 8b 65 10 48 8d 43 28 48 89 44 24 10 49 8d 44
> 24 10 48 89 c2 48 89 44 24 18 48 b8 00 00 00 00 00 fc ff df 48 c1 ea 03
> <0f> b6 04 02 84 c0 74 08 3c 03 0f 8e 78 0e 00 00 45 8b 7c 24 10 41
> [  292.790946] RSP: 0018:ffff8883cd847218 EFLAGS: 00010202
> [  292.796260] RAX: dffffc0000000000 RBX: ffff8885c5bcdce8 RCX:
> 1ffff11034470027
> [  292.803496] RDX: 0000000000000002 RSI: 0000000000000008 RDI:
> ffff8885c5bcdc60
> [  292.810732] RBP: 1ffff11079b08e4f R08: ffff8881a23801d8 R09:
> ffff8881a238013f
> [  292.817970] R10: ffff88821c594040 R11: 0000000000000001 R12:
> 0000000000000000
> [  292.825206] R13: ffff8885c5bcdc50 R14: ffff8881a2380000 R15:
> ffff8885c5bcdd08
> [  292.832442] FS:  00007fe169b06b40(0000) GS:ffff88880fc00000(0000)
> knlGS:0000000000000000
> [  292.840646] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  292.846481] CR2: 00007ffd80a57a38 CR3: 00000004b91b0006 CR4:
> 00000000007706f0
> [  292.853722] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [  292.860957] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> 0000000000000400
> [  292.868194] PKRU: 55555554
> [  292.870949] Call Trace:
> [  292.873446]  <TASK>
> [  292.875593]  ? lock_is_held_type+0xd7/0x130
> [  292.879860]  ? dm_set_zones_restrictions+0x8f0/0x8f0 [dm_mod]
> [  292.885718]  ? __module_address.part.0+0x25/0x300
> [  292.890509]  ? is_module_address+0x43/0x70
> [  292.894674]  ? static_obj+0x8a/0xc0
> [  292.898233]  __map_bio+0x352/0x740 [dm_mod]
> [  292.902512]  dm_submit_bio+0x72f/0x17a0 [dm_mod]
> [  292.907222]  ? find_held_lock+0x2c/0x110
> [  292.911217]  ? __send_empty_flush+0x2b0/0x2b0 [dm_mod]
> [  292.916459]  ? lock_release+0x3b2/0x6f0
> [  292.920368]  ? lock_downgrade+0x6d0/0x6d0
> [  292.924458]  ? lock_is_held_type+0xd7/0x130
> [  292.928714]  __submit_bio+0x12a/0x1f0
> [  292.932450]  submit_bio_noacct_nocheck+0x324/0x840
> [  292.937324]  ? should_fail_request+0x70/0x70
> [  292.941670]  ? rcu_read_lock_sched_held+0x3f/0x70
> [  292.946458]  ? submit_bio_noacct+0xfa4/0x1530
> [  292.950888]  ? lock_is_held_type+0xd7/0x130
> [  292.957813]  mpage_readahead+0x32e/0x4b0
> [  292.964470]  ? do_mpage_readpage+0x17c0/0x17c0
> [  292.971661]  ? blkdev_write_begin+0x20/0x20
> [  292.978567]  ? lock_release+0x3b2/0x6f0
> [  292.985073]  ? folio_add_lru+0x217/0x3f0
> [  292.991620]  ? lock_downgrade+0x6d0/0x6d0
> [  292.998237]  read_pages+0x18c/0x990
> [  293.004308]  ? memcg_list_lru_alloc+0x810/0x810
> [  293.011404]  ? folio_add_lru+0x238/0x3f0
> [  293.017805]  ? file_ra_state_init+0xd0/0xd0
> [  293.024395]  ? policy_node+0xbb/0x140
> [  293.030416]  page_cache_ra_unbounded+0x258/0x410
> [  293.037376]  force_page_cache_ra+0x281/0x400
> [  293.043944]  filemap_get_pages+0x25e/0x1290
> [  293.050342]  ? __lock_acquire+0x1603/0x6180
> [  293.056654]  ? filemap_add_folio+0x140/0x140
> [  293.063002]  ? lock_is_held_type+0xd7/0x130
> [  293.069236]  filemap_read+0x29e/0x910
> [  293.074927]  ? filemap_get_pages+0x1290/0x1290
> [  293.081378]  ? __lock_acquire+0x1603/0x6180
> [  293.087558]  blkdev_read_iter+0x20c/0x640
> [  293.093529]  ? cp_new_stat+0x47a/0x590
> [  293.099190]  ? cp_old_stat+0x470/0x470
> [  293.104795]  new_sync_read+0x2e4/0x520
> [  293.110362]  ? __x64_sys_lseek+0x1d0/0x1d0
> [  293.116269]  ? lock_acquire+0x1b2/0x4d0
> [  293.121928]  ? find_held_lock+0x2c/0x110
> [  293.127648]  vfs_read+0x312/0x430
> [  293.132755]  ksys_read+0xf3/0x1d0
> [  293.137863]  ? __x64_sys_pwrite64+0x1f0/0x1f0
> [  293.144032]  do_syscall_64+0x35/0x80
> [  293.149391]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> 
> The crash is at: drivers/md/dm-zone.c:499, which is
> dm_need_zone_wp_tracking() called from dm_zone_map_bio(). The orig_bio
> pointer is invalid. Weird. Investigating.
> 
> Also checking why our weekly test runs did not catch this.

This fixes the issue:

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 3c5fad7c4ee6..3dd6735450c5 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -581,7 +581,7 @@ static struct dm_io *alloc_io(struct mapped_device
*md, struct bio *bio)
        io->status = 0;
        atomic_set(&io->io_count, 1);
        this_cpu_inc(*md->pending_io);
-       io->orig_bio = NULL;
+       io->orig_bio = bio;
        io->md = md;
        io->map_task = current;
        spin_lock_init(&io->lock);

Otherwise, the dm-zone.c code sees a NULL orig_bio.
However, this change may be messing up the bio accounting. Need to check that.
Damien Le Moal April 11, 2022, 3:10 a.m. UTC | #8
On 4/11/22 11:55, Damien Le Moal wrote:
> On 4/11/22 11:19, Damien Le Moal wrote:
>> On 4/11/22 10:04, Ming Lei wrote:
>>> On Mon, Apr 11, 2022 at 09:50:57AM +0900, Damien Le Moal wrote:
>>>> On 4/11/22 09:36, Ming Lei wrote:
>>>>> On Mon, Apr 11, 2022 at 09:18:56AM +0900, Damien Le Moal wrote:
>>>>>> On 4/9/22 02:12, Ming Lei wrote:
>>>>>>> dm_zone_map_bio() is only called from __map_bio in which the io's
>>>>>>> reference is grabbed already, and the reference won't be released
>>>>>>> until the bio is submitted, so no necessary to do it dm_zone_map_bio
>>>>>>> any more.
>>>>>>
>>>>>> I do not think that this patch is correct. Removing the extra reference on
>>>>>> the io can lead to problems if the io is completed in the target
>>>>>> ->map(ti, clone) call or before dm_zone_map_bio_end() is called for the
>>>>>> DM_MAPIO_SUBMITTED or DM_MAPIO_REMAPPED cases. dm_zone_map_bio_end() needs
>>>>>
>>>>> __map_bio():
>>>>> 	...
>>>>> 	dm_io_inc_pending(io);
>>>>> 	...
>>>>> 	dm_zone_map_bio(tio);
>>>>> 	...
>>>>
>>>> dm-crypt (for instance) may terminate the clone bio immediately in its
>>>> ->map() function context, resulting in the bio_endio()clone) ->
>>>> clone_endio() -> dm_io_dec_pending() call chain.
>>>>
>>>> With that, the io is gone and dm_zone_map_bio_end() will not have a valid
>>>> reference on the orig bio.
>>>
>>> Any target can complete io during ->map. Here looks nothing is special with
>>> dm-crypt or dm-zone, why does only dm zone need extra reference?
>>>
>>> The reference counter is initialized as 1 in init_clone_info(), dm_io_inc_pending()
>>> in __map_bio() increases it to 2, so after the above call chain you mentioned is done,
>>> the counter becomes 1. The original bio can't be completed until dm_io_dec_pending()
>>> in dm_split_and_process_bio() is called.
>>>
>>> Or maybe I miss any extra requirement from dm-zone?
>>
>> Something is wrong... With and without your patch, when I setup a dm-crypt
>> target on top of a zoned nullblk device, I get:
>>
>> [  292.596454] device-mapper: uevent: version 1.0.3
>> [  292.602746] device-mapper: ioctl: 4.46.0-ioctl (2022-02-22)
>> initialised: dm-devel@redhat.com
>> [  292.732217] general protection fault, probably for non-canonical
>> address 0xdffffc0000000002: 0000 [#1] PREEMPT SMP KASAN PTI
>> [  292.743724] KASAN: null-ptr-deref in range
>> [0x0000000000000010-0x0000000000000017]
>> [  292.751409] CPU: 0 PID: 4259 Comm: systemd-udevd Not tainted
>> 5.18.0-rc2+ #1458
>> [  292.758746] Hardware name: Supermicro Super Server/X11DPL-i, BIOS 3.3
>> 02/21/2020
>> [  292.766250] RIP: 0010:dm_zone_map_bio+0x146/0x1740 [dm_mod]
>> [  292.771938] Code: 00 00 4d 8b 65 10 48 8d 43 28 48 89 44 24 10 49 8d 44
>> 24 10 48 89 c2 48 89 44 24 18 48 b8 00 00 00 00 00 fc ff df 48 c1 ea 03
>> <0f> b6 04 02 84 c0 74 08 3c 03 0f 8e 78 0e 00 00 45 8b 7c 24 10 41
>> [  292.790946] RSP: 0018:ffff8883cd847218 EFLAGS: 00010202
>> [  292.796260] RAX: dffffc0000000000 RBX: ffff8885c5bcdce8 RCX:
>> 1ffff11034470027
>> [  292.803496] RDX: 0000000000000002 RSI: 0000000000000008 RDI:
>> ffff8885c5bcdc60
>> [  292.810732] RBP: 1ffff11079b08e4f R08: ffff8881a23801d8 R09:
>> ffff8881a238013f
>> [  292.817970] R10: ffff88821c594040 R11: 0000000000000001 R12:
>> 0000000000000000
>> [  292.825206] R13: ffff8885c5bcdc50 R14: ffff8881a2380000 R15:
>> ffff8885c5bcdd08
>> [  292.832442] FS:  00007fe169b06b40(0000) GS:ffff88880fc00000(0000)
>> knlGS:0000000000000000
>> [  292.840646] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [  292.846481] CR2: 00007ffd80a57a38 CR3: 00000004b91b0006 CR4:
>> 00000000007706f0
>> [  292.853722] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>> 0000000000000000
>> [  292.860957] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
>> 0000000000000400
>> [  292.868194] PKRU: 55555554
>> [  292.870949] Call Trace:
>> [  292.873446]  <TASK>
>> [  292.875593]  ? lock_is_held_type+0xd7/0x130
>> [  292.879860]  ? dm_set_zones_restrictions+0x8f0/0x8f0 [dm_mod]
>> [  292.885718]  ? __module_address.part.0+0x25/0x300
>> [  292.890509]  ? is_module_address+0x43/0x70
>> [  292.894674]  ? static_obj+0x8a/0xc0
>> [  292.898233]  __map_bio+0x352/0x740 [dm_mod]
>> [  292.902512]  dm_submit_bio+0x72f/0x17a0 [dm_mod]
>> [  292.907222]  ? find_held_lock+0x2c/0x110
>> [  292.911217]  ? __send_empty_flush+0x2b0/0x2b0 [dm_mod]
>> [  292.916459]  ? lock_release+0x3b2/0x6f0
>> [  292.920368]  ? lock_downgrade+0x6d0/0x6d0
>> [  292.924458]  ? lock_is_held_type+0xd7/0x130
>> [  292.928714]  __submit_bio+0x12a/0x1f0
>> [  292.932450]  submit_bio_noacct_nocheck+0x324/0x840
>> [  292.937324]  ? should_fail_request+0x70/0x70
>> [  292.941670]  ? rcu_read_lock_sched_held+0x3f/0x70
>> [  292.946458]  ? submit_bio_noacct+0xfa4/0x1530
>> [  292.950888]  ? lock_is_held_type+0xd7/0x130
>> [  292.957813]  mpage_readahead+0x32e/0x4b0
>> [  292.964470]  ? do_mpage_readpage+0x17c0/0x17c0
>> [  292.971661]  ? blkdev_write_begin+0x20/0x20
>> [  292.978567]  ? lock_release+0x3b2/0x6f0
>> [  292.985073]  ? folio_add_lru+0x217/0x3f0
>> [  292.991620]  ? lock_downgrade+0x6d0/0x6d0
>> [  292.998237]  read_pages+0x18c/0x990
>> [  293.004308]  ? memcg_list_lru_alloc+0x810/0x810
>> [  293.011404]  ? folio_add_lru+0x238/0x3f0
>> [  293.017805]  ? file_ra_state_init+0xd0/0xd0
>> [  293.024395]  ? policy_node+0xbb/0x140
>> [  293.030416]  page_cache_ra_unbounded+0x258/0x410
>> [  293.037376]  force_page_cache_ra+0x281/0x400
>> [  293.043944]  filemap_get_pages+0x25e/0x1290
>> [  293.050342]  ? __lock_acquire+0x1603/0x6180
>> [  293.056654]  ? filemap_add_folio+0x140/0x140
>> [  293.063002]  ? lock_is_held_type+0xd7/0x130
>> [  293.069236]  filemap_read+0x29e/0x910
>> [  293.074927]  ? filemap_get_pages+0x1290/0x1290
>> [  293.081378]  ? __lock_acquire+0x1603/0x6180
>> [  293.087558]  blkdev_read_iter+0x20c/0x640
>> [  293.093529]  ? cp_new_stat+0x47a/0x590
>> [  293.099190]  ? cp_old_stat+0x470/0x470
>> [  293.104795]  new_sync_read+0x2e4/0x520
>> [  293.110362]  ? __x64_sys_lseek+0x1d0/0x1d0
>> [  293.116269]  ? lock_acquire+0x1b2/0x4d0
>> [  293.121928]  ? find_held_lock+0x2c/0x110
>> [  293.127648]  vfs_read+0x312/0x430
>> [  293.132755]  ksys_read+0xf3/0x1d0
>> [  293.137863]  ? __x64_sys_pwrite64+0x1f0/0x1f0
>> [  293.144032]  do_syscall_64+0x35/0x80
>> [  293.149391]  entry_SYSCALL_64_after_hwframe+0x44/0xae
>>
>> The crash is at: drivers/md/dm-zone.c:499, which is
>> dm_need_zone_wp_tracking() called from dm_zone_map_bio(). The orig_bio
>> pointer is invalid. Weird. Investigating.
>>
>> Also checking why our weekly test runs did not catch this.
> 
> This fixes the issue:
> 
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index 3c5fad7c4ee6..3dd6735450c5 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -581,7 +581,7 @@ static struct dm_io *alloc_io(struct mapped_device
> *md, struct bio *bio)
>         io->status = 0;
>         atomic_set(&io->io_count, 1);
>         this_cpu_inc(*md->pending_io);
> -       io->orig_bio = NULL;
> +       io->orig_bio = bio;
>         io->md = md;
>         io->map_task = current;
>         spin_lock_init(&io->lock);
> 
> Otherwise, the dm-zone.c code sees a NULL orig_bio.
> However, this change may be messing up the bio accounting. Need to check that.

Forgot to mention that with the above fix + your patch applied, everything
seems to be fine. No issues. The extra reference is indeed not needed.
Only the orig_bio initialization missing is a problem.
Ming Lei April 11, 2022, 7:34 a.m. UTC | #9
On Mon, Apr 11, 2022 at 11:55:14AM +0900, Damien Le Moal wrote:
> On 4/11/22 11:19, Damien Le Moal wrote:
> > On 4/11/22 10:04, Ming Lei wrote:
> >> On Mon, Apr 11, 2022 at 09:50:57AM +0900, Damien Le Moal wrote:
> >>> On 4/11/22 09:36, Ming Lei wrote:
> >>>> On Mon, Apr 11, 2022 at 09:18:56AM +0900, Damien Le Moal wrote:
> >>>>> On 4/9/22 02:12, Ming Lei wrote:
> >>>>>> dm_zone_map_bio() is only called from __map_bio in which the io's
> >>>>>> reference is grabbed already, and the reference won't be released
> >>>>>> until the bio is submitted, so no necessary to do it dm_zone_map_bio
> >>>>>> any more.
> >>>>>
> >>>>> I do not think that this patch is correct. Removing the extra reference on
> >>>>> the io can lead to problems if the io is completed in the target
> >>>>> ->map(ti, clone) call or before dm_zone_map_bio_end() is called for the
> >>>>> DM_MAPIO_SUBMITTED or DM_MAPIO_REMAPPED cases. dm_zone_map_bio_end() needs
> >>>>
> >>>> __map_bio():
> >>>> 	...
> >>>> 	dm_io_inc_pending(io);
> >>>> 	...
> >>>> 	dm_zone_map_bio(tio);
> >>>> 	...
> >>>
> >>> dm-crypt (for instance) may terminate the clone bio immediately in its
> >>> ->map() function context, resulting in the bio_endio()clone) ->
> >>> clone_endio() -> dm_io_dec_pending() call chain.
> >>>
> >>> With that, the io is gone and dm_zone_map_bio_end() will not have a valid
> >>> reference on the orig bio.
> >>
> >> Any target can complete io during ->map. Here looks nothing is special with
> >> dm-crypt or dm-zone, why does only dm zone need extra reference?
> >>
> >> The reference counter is initialized as 1 in init_clone_info(), dm_io_inc_pending()
> >> in __map_bio() increases it to 2, so after the above call chain you mentioned is done,
> >> the counter becomes 1. The original bio can't be completed until dm_io_dec_pending()
> >> in dm_split_and_process_bio() is called.
> >>
> >> Or maybe I miss any extra requirement from dm-zone?
> > 
> > Something is wrong... With and without your patch, when I setup a dm-crypt
> > target on top of a zoned nullblk device, I get:
> > 
> > [  292.596454] device-mapper: uevent: version 1.0.3
> > [  292.602746] device-mapper: ioctl: 4.46.0-ioctl (2022-02-22)
> > initialised: dm-devel@redhat.com
> > [  292.732217] general protection fault, probably for non-canonical
> > address 0xdffffc0000000002: 0000 [#1] PREEMPT SMP KASAN PTI
> > [  292.743724] KASAN: null-ptr-deref in range
> > [0x0000000000000010-0x0000000000000017]
> > [  292.751409] CPU: 0 PID: 4259 Comm: systemd-udevd Not tainted
> > 5.18.0-rc2+ #1458
> > [  292.758746] Hardware name: Supermicro Super Server/X11DPL-i, BIOS 3.3
> > 02/21/2020
> > [  292.766250] RIP: 0010:dm_zone_map_bio+0x146/0x1740 [dm_mod]
> > [  292.771938] Code: 00 00 4d 8b 65 10 48 8d 43 28 48 89 44 24 10 49 8d 44
> > 24 10 48 89 c2 48 89 44 24 18 48 b8 00 00 00 00 00 fc ff df 48 c1 ea 03
> > <0f> b6 04 02 84 c0 74 08 3c 03 0f 8e 78 0e 00 00 45 8b 7c 24 10 41
> > [  292.790946] RSP: 0018:ffff8883cd847218 EFLAGS: 00010202
> > [  292.796260] RAX: dffffc0000000000 RBX: ffff8885c5bcdce8 RCX:
> > 1ffff11034470027
> > [  292.803496] RDX: 0000000000000002 RSI: 0000000000000008 RDI:
> > ffff8885c5bcdc60
> > [  292.810732] RBP: 1ffff11079b08e4f R08: ffff8881a23801d8 R09:
> > ffff8881a238013f
> > [  292.817970] R10: ffff88821c594040 R11: 0000000000000001 R12:
> > 0000000000000000
> > [  292.825206] R13: ffff8885c5bcdc50 R14: ffff8881a2380000 R15:
> > ffff8885c5bcdd08
> > [  292.832442] FS:  00007fe169b06b40(0000) GS:ffff88880fc00000(0000)
> > knlGS:0000000000000000
> > [  292.840646] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [  292.846481] CR2: 00007ffd80a57a38 CR3: 00000004b91b0006 CR4:
> > 00000000007706f0
> > [  292.853722] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> > 0000000000000000
> > [  292.860957] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> > 0000000000000400
> > [  292.868194] PKRU: 55555554
> > [  292.870949] Call Trace:
> > [  292.873446]  <TASK>
> > [  292.875593]  ? lock_is_held_type+0xd7/0x130
> > [  292.879860]  ? dm_set_zones_restrictions+0x8f0/0x8f0 [dm_mod]
> > [  292.885718]  ? __module_address.part.0+0x25/0x300
> > [  292.890509]  ? is_module_address+0x43/0x70
> > [  292.894674]  ? static_obj+0x8a/0xc0
> > [  292.898233]  __map_bio+0x352/0x740 [dm_mod]
> > [  292.902512]  dm_submit_bio+0x72f/0x17a0 [dm_mod]
> > [  292.907222]  ? find_held_lock+0x2c/0x110
> > [  292.911217]  ? __send_empty_flush+0x2b0/0x2b0 [dm_mod]
> > [  292.916459]  ? lock_release+0x3b2/0x6f0
> > [  292.920368]  ? lock_downgrade+0x6d0/0x6d0
> > [  292.924458]  ? lock_is_held_type+0xd7/0x130
> > [  292.928714]  __submit_bio+0x12a/0x1f0
> > [  292.932450]  submit_bio_noacct_nocheck+0x324/0x840
> > [  292.937324]  ? should_fail_request+0x70/0x70
> > [  292.941670]  ? rcu_read_lock_sched_held+0x3f/0x70
> > [  292.946458]  ? submit_bio_noacct+0xfa4/0x1530
> > [  292.950888]  ? lock_is_held_type+0xd7/0x130
> > [  292.957813]  mpage_readahead+0x32e/0x4b0
> > [  292.964470]  ? do_mpage_readpage+0x17c0/0x17c0
> > [  292.971661]  ? blkdev_write_begin+0x20/0x20
> > [  292.978567]  ? lock_release+0x3b2/0x6f0
> > [  292.985073]  ? folio_add_lru+0x217/0x3f0
> > [  292.991620]  ? lock_downgrade+0x6d0/0x6d0
> > [  292.998237]  read_pages+0x18c/0x990
> > [  293.004308]  ? memcg_list_lru_alloc+0x810/0x810
> > [  293.011404]  ? folio_add_lru+0x238/0x3f0
> > [  293.017805]  ? file_ra_state_init+0xd0/0xd0
> > [  293.024395]  ? policy_node+0xbb/0x140
> > [  293.030416]  page_cache_ra_unbounded+0x258/0x410
> > [  293.037376]  force_page_cache_ra+0x281/0x400
> > [  293.043944]  filemap_get_pages+0x25e/0x1290
> > [  293.050342]  ? __lock_acquire+0x1603/0x6180
> > [  293.056654]  ? filemap_add_folio+0x140/0x140
> > [  293.063002]  ? lock_is_held_type+0xd7/0x130
> > [  293.069236]  filemap_read+0x29e/0x910
> > [  293.074927]  ? filemap_get_pages+0x1290/0x1290
> > [  293.081378]  ? __lock_acquire+0x1603/0x6180
> > [  293.087558]  blkdev_read_iter+0x20c/0x640
> > [  293.093529]  ? cp_new_stat+0x47a/0x590
> > [  293.099190]  ? cp_old_stat+0x470/0x470
> > [  293.104795]  new_sync_read+0x2e4/0x520
> > [  293.110362]  ? __x64_sys_lseek+0x1d0/0x1d0
> > [  293.116269]  ? lock_acquire+0x1b2/0x4d0
> > [  293.121928]  ? find_held_lock+0x2c/0x110
> > [  293.127648]  vfs_read+0x312/0x430
> > [  293.132755]  ksys_read+0xf3/0x1d0
> > [  293.137863]  ? __x64_sys_pwrite64+0x1f0/0x1f0
> > [  293.144032]  do_syscall_64+0x35/0x80
> > [  293.149391]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> > 
> > The crash is at: drivers/md/dm-zone.c:499, which is
> > dm_need_zone_wp_tracking() called from dm_zone_map_bio(). The orig_bio
> > pointer is invalid. Weird. Investigating.
> > 
> > Also checking why our weekly test runs did not catch this.
> 
> This fixes the issue:
> 
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index 3c5fad7c4ee6..3dd6735450c5 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -581,7 +581,7 @@ static struct dm_io *alloc_io(struct mapped_device
> *md, struct bio *bio)
>         io->status = 0;
>         atomic_set(&io->io_count, 1);
>         this_cpu_inc(*md->pending_io);
> -       io->orig_bio = NULL;
> +       io->orig_bio = bio;
>         io->md = md;
>         io->map_task = current;
>         spin_lock_init(&io->lock);
> 
> Otherwise, the dm-zone.c code sees a NULL orig_bio.
> However, this change may be messing up the bio accounting. Need to check that.

Looks it is one recent regression since:

commit 0fbb4d93b38b ("dm: add dm_submit_bio_remap interface")


Thanks,
Ming
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel
Damien Le Moal April 11, 2022, 7:42 a.m. UTC | #10
On 4/11/22 16:34, Ming Lei wrote:
> On Mon, Apr 11, 2022 at 11:55:14AM +0900, Damien Le Moal wrote:
>> On 4/11/22 11:19, Damien Le Moal wrote:
>>> On 4/11/22 10:04, Ming Lei wrote:
>>>> On Mon, Apr 11, 2022 at 09:50:57AM +0900, Damien Le Moal wrote:
>>>>> On 4/11/22 09:36, Ming Lei wrote:
>>>>>> On Mon, Apr 11, 2022 at 09:18:56AM +0900, Damien Le Moal wrote:
>>>>>>> On 4/9/22 02:12, Ming Lei wrote:
>>>>>>>> dm_zone_map_bio() is only called from __map_bio in which the io's
>>>>>>>> reference is grabbed already, and the reference won't be released
>>>>>>>> until the bio is submitted, so no necessary to do it dm_zone_map_bio
>>>>>>>> any more.
>>>>>>>
>>>>>>> I do not think that this patch is correct. Removing the extra reference on
>>>>>>> the io can lead to problems if the io is completed in the target
>>>>>>> ->map(ti, clone) call or before dm_zone_map_bio_end() is called for the
>>>>>>> DM_MAPIO_SUBMITTED or DM_MAPIO_REMAPPED cases. dm_zone_map_bio_end() needs
>>>>>>
>>>>>> __map_bio():
>>>>>> 	...
>>>>>> 	dm_io_inc_pending(io);
>>>>>> 	...
>>>>>> 	dm_zone_map_bio(tio);
>>>>>> 	...
>>>>>
>>>>> dm-crypt (for instance) may terminate the clone bio immediately in its
>>>>> ->map() function context, resulting in the bio_endio()clone) ->
>>>>> clone_endio() -> dm_io_dec_pending() call chain.
>>>>>
>>>>> With that, the io is gone and dm_zone_map_bio_end() will not have a valid
>>>>> reference on the orig bio.
>>>>
>>>> Any target can complete io during ->map. Here looks nothing is special with
>>>> dm-crypt or dm-zone, why does only dm zone need extra reference?
>>>>
>>>> The reference counter is initialized as 1 in init_clone_info(), dm_io_inc_pending()
>>>> in __map_bio() increases it to 2, so after the above call chain you mentioned is done,
>>>> the counter becomes 1. The original bio can't be completed until dm_io_dec_pending()
>>>> in dm_split_and_process_bio() is called.
>>>>
>>>> Or maybe I miss any extra requirement from dm-zone?
>>>
>>> Something is wrong... With and without your patch, when I setup a dm-crypt
>>> target on top of a zoned nullblk device, I get:
>>>
>>> [  292.596454] device-mapper: uevent: version 1.0.3
>>> [  292.602746] device-mapper: ioctl: 4.46.0-ioctl (2022-02-22)
>>> initialised: dm-devel@redhat.com
>>> [  292.732217] general protection fault, probably for non-canonical
>>> address 0xdffffc0000000002: 0000 [#1] PREEMPT SMP KASAN PTI
>>> [  292.743724] KASAN: null-ptr-deref in range
>>> [0x0000000000000010-0x0000000000000017]
>>> [  292.751409] CPU: 0 PID: 4259 Comm: systemd-udevd Not tainted
>>> 5.18.0-rc2+ #1458
>>> [  292.758746] Hardware name: Supermicro Super Server/X11DPL-i, BIOS 3.3
>>> 02/21/2020
>>> [  292.766250] RIP: 0010:dm_zone_map_bio+0x146/0x1740 [dm_mod]
>>> [  292.771938] Code: 00 00 4d 8b 65 10 48 8d 43 28 48 89 44 24 10 49 8d 44
>>> 24 10 48 89 c2 48 89 44 24 18 48 b8 00 00 00 00 00 fc ff df 48 c1 ea 03
>>> <0f> b6 04 02 84 c0 74 08 3c 03 0f 8e 78 0e 00 00 45 8b 7c 24 10 41
>>> [  292.790946] RSP: 0018:ffff8883cd847218 EFLAGS: 00010202
>>> [  292.796260] RAX: dffffc0000000000 RBX: ffff8885c5bcdce8 RCX:
>>> 1ffff11034470027
>>> [  292.803496] RDX: 0000000000000002 RSI: 0000000000000008 RDI:
>>> ffff8885c5bcdc60
>>> [  292.810732] RBP: 1ffff11079b08e4f R08: ffff8881a23801d8 R09:
>>> ffff8881a238013f
>>> [  292.817970] R10: ffff88821c594040 R11: 0000000000000001 R12:
>>> 0000000000000000
>>> [  292.825206] R13: ffff8885c5bcdc50 R14: ffff8881a2380000 R15:
>>> ffff8885c5bcdd08
>>> [  292.832442] FS:  00007fe169b06b40(0000) GS:ffff88880fc00000(0000)
>>> knlGS:0000000000000000
>>> [  292.840646] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [  292.846481] CR2: 00007ffd80a57a38 CR3: 00000004b91b0006 CR4:
>>> 00000000007706f0
>>> [  292.853722] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>>> 0000000000000000
>>> [  292.860957] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
>>> 0000000000000400
>>> [  292.868194] PKRU: 55555554
>>> [  292.870949] Call Trace:
>>> [  292.873446]  <TASK>
>>> [  292.875593]  ? lock_is_held_type+0xd7/0x130
>>> [  292.879860]  ? dm_set_zones_restrictions+0x8f0/0x8f0 [dm_mod]
>>> [  292.885718]  ? __module_address.part.0+0x25/0x300
>>> [  292.890509]  ? is_module_address+0x43/0x70
>>> [  292.894674]  ? static_obj+0x8a/0xc0
>>> [  292.898233]  __map_bio+0x352/0x740 [dm_mod]
>>> [  292.902512]  dm_submit_bio+0x72f/0x17a0 [dm_mod]
>>> [  292.907222]  ? find_held_lock+0x2c/0x110
>>> [  292.911217]  ? __send_empty_flush+0x2b0/0x2b0 [dm_mod]
>>> [  292.916459]  ? lock_release+0x3b2/0x6f0
>>> [  292.920368]  ? lock_downgrade+0x6d0/0x6d0
>>> [  292.924458]  ? lock_is_held_type+0xd7/0x130
>>> [  292.928714]  __submit_bio+0x12a/0x1f0
>>> [  292.932450]  submit_bio_noacct_nocheck+0x324/0x840
>>> [  292.937324]  ? should_fail_request+0x70/0x70
>>> [  292.941670]  ? rcu_read_lock_sched_held+0x3f/0x70
>>> [  292.946458]  ? submit_bio_noacct+0xfa4/0x1530
>>> [  292.950888]  ? lock_is_held_type+0xd7/0x130
>>> [  292.957813]  mpage_readahead+0x32e/0x4b0
>>> [  292.964470]  ? do_mpage_readpage+0x17c0/0x17c0
>>> [  292.971661]  ? blkdev_write_begin+0x20/0x20
>>> [  292.978567]  ? lock_release+0x3b2/0x6f0
>>> [  292.985073]  ? folio_add_lru+0x217/0x3f0
>>> [  292.991620]  ? lock_downgrade+0x6d0/0x6d0
>>> [  292.998237]  read_pages+0x18c/0x990
>>> [  293.004308]  ? memcg_list_lru_alloc+0x810/0x810
>>> [  293.011404]  ? folio_add_lru+0x238/0x3f0
>>> [  293.017805]  ? file_ra_state_init+0xd0/0xd0
>>> [  293.024395]  ? policy_node+0xbb/0x140
>>> [  293.030416]  page_cache_ra_unbounded+0x258/0x410
>>> [  293.037376]  force_page_cache_ra+0x281/0x400
>>> [  293.043944]  filemap_get_pages+0x25e/0x1290
>>> [  293.050342]  ? __lock_acquire+0x1603/0x6180
>>> [  293.056654]  ? filemap_add_folio+0x140/0x140
>>> [  293.063002]  ? lock_is_held_type+0xd7/0x130
>>> [  293.069236]  filemap_read+0x29e/0x910
>>> [  293.074927]  ? filemap_get_pages+0x1290/0x1290
>>> [  293.081378]  ? __lock_acquire+0x1603/0x6180
>>> [  293.087558]  blkdev_read_iter+0x20c/0x640
>>> [  293.093529]  ? cp_new_stat+0x47a/0x590
>>> [  293.099190]  ? cp_old_stat+0x470/0x470
>>> [  293.104795]  new_sync_read+0x2e4/0x520
>>> [  293.110362]  ? __x64_sys_lseek+0x1d0/0x1d0
>>> [  293.116269]  ? lock_acquire+0x1b2/0x4d0
>>> [  293.121928]  ? find_held_lock+0x2c/0x110
>>> [  293.127648]  vfs_read+0x312/0x430
>>> [  293.132755]  ksys_read+0xf3/0x1d0
>>> [  293.137863]  ? __x64_sys_pwrite64+0x1f0/0x1f0
>>> [  293.144032]  do_syscall_64+0x35/0x80
>>> [  293.149391]  entry_SYSCALL_64_after_hwframe+0x44/0xae
>>>
>>> The crash is at: drivers/md/dm-zone.c:499, which is
>>> dm_need_zone_wp_tracking() called from dm_zone_map_bio(). The orig_bio
>>> pointer is invalid. Weird. Investigating.
>>>
>>> Also checking why our weekly test runs did not catch this.
>>
>> This fixes the issue:
>>
>> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
>> index 3c5fad7c4ee6..3dd6735450c5 100644
>> --- a/drivers/md/dm.c
>> +++ b/drivers/md/dm.c
>> @@ -581,7 +581,7 @@ static struct dm_io *alloc_io(struct mapped_device
>> *md, struct bio *bio)
>>         io->status = 0;
>>         atomic_set(&io->io_count, 1);
>>         this_cpu_inc(*md->pending_io);
>> -       io->orig_bio = NULL;
>> +       io->orig_bio = bio;
>>         io->md = md;
>>         io->map_task = current;
>>         spin_lock_init(&io->lock);
>>
>> Otherwise, the dm-zone.c code sees a NULL orig_bio.
>> However, this change may be messing up the bio accounting. Need to check that.
> 
> Looks it is one recent regression since:
> 
> commit 0fbb4d93b38b ("dm: add dm_submit_bio_remap interface")

Yep, saw that. Problem is, I really do not understand that change setting
io->orig_bio *after* __map_bio() is called. It seems that the accounting
is done on each fragment of the orig_bio instead of once for the entire
BIO... So my "fix" above seems wrong. Apart from passing along orig_bio as
an argument to  __map_bio() from __split_and_process_bio(), I do not think
my change is correct. Thoughts ?
Damien Le Moal April 11, 2022, 9:40 a.m. UTC | #11
On 4/9/22 02:12, Ming Lei wrote:
> dm_zone_map_bio() is only called from __map_bio in which the io's
> reference is grabbed already, and the reference won't be released
> until the bio is submitted, so no necessary to do it dm_zone_map_bio
> any more.
> 
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>

Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Ming Lei April 11, 2022, 2:18 p.m. UTC | #12
On Mon, Apr 11, 2022 at 04:42:59PM +0900, Damien Le Moal wrote:
> On 4/11/22 16:34, Ming Lei wrote:
> > On Mon, Apr 11, 2022 at 11:55:14AM +0900, Damien Le Moal wrote:
> >> On 4/11/22 11:19, Damien Le Moal wrote:
> >>> On 4/11/22 10:04, Ming Lei wrote:
> >>>> On Mon, Apr 11, 2022 at 09:50:57AM +0900, Damien Le Moal wrote:
> >>>>> On 4/11/22 09:36, Ming Lei wrote:
> >>>>>> On Mon, Apr 11, 2022 at 09:18:56AM +0900, Damien Le Moal wrote:
> >>>>>>> On 4/9/22 02:12, Ming Lei wrote:
> >>>>>>>> dm_zone_map_bio() is only called from __map_bio in which the io's
> >>>>>>>> reference is grabbed already, and the reference won't be released
> >>>>>>>> until the bio is submitted, so no necessary to do it dm_zone_map_bio
> >>>>>>>> any more.
> >>>>>>>
> >>>>>>> I do not think that this patch is correct. Removing the extra reference on
> >>>>>>> the io can lead to problems if the io is completed in the target
> >>>>>>> ->map(ti, clone) call or before dm_zone_map_bio_end() is called for the
> >>>>>>> DM_MAPIO_SUBMITTED or DM_MAPIO_REMAPPED cases. dm_zone_map_bio_end() needs
> >>>>>>
> >>>>>> __map_bio():
> >>>>>> 	...
> >>>>>> 	dm_io_inc_pending(io);
> >>>>>> 	...
> >>>>>> 	dm_zone_map_bio(tio);
> >>>>>> 	...
> >>>>>
> >>>>> dm-crypt (for instance) may terminate the clone bio immediately in its
> >>>>> ->map() function context, resulting in the bio_endio()clone) ->
> >>>>> clone_endio() -> dm_io_dec_pending() call chain.
> >>>>>
> >>>>> With that, the io is gone and dm_zone_map_bio_end() will not have a valid
> >>>>> reference on the orig bio.
> >>>>
> >>>> Any target can complete io during ->map. Here looks nothing is special with
> >>>> dm-crypt or dm-zone, why does only dm zone need extra reference?
> >>>>
> >>>> The reference counter is initialized as 1 in init_clone_info(), dm_io_inc_pending()
> >>>> in __map_bio() increases it to 2, so after the above call chain you mentioned is done,
> >>>> the counter becomes 1. The original bio can't be completed until dm_io_dec_pending()
> >>>> in dm_split_and_process_bio() is called.
> >>>>
> >>>> Or maybe I miss any extra requirement from dm-zone?
> >>>
> >>> Something is wrong... With and without your patch, when I setup a dm-crypt
> >>> target on top of a zoned nullblk device, I get:
> >>>
> >>> [  292.596454] device-mapper: uevent: version 1.0.3
> >>> [  292.602746] device-mapper: ioctl: 4.46.0-ioctl (2022-02-22)
> >>> initialised: dm-devel@redhat.com
> >>> [  292.732217] general protection fault, probably for non-canonical
> >>> address 0xdffffc0000000002: 0000 [#1] PREEMPT SMP KASAN PTI
> >>> [  292.743724] KASAN: null-ptr-deref in range
> >>> [0x0000000000000010-0x0000000000000017]
> >>> [  292.751409] CPU: 0 PID: 4259 Comm: systemd-udevd Not tainted
> >>> 5.18.0-rc2+ #1458
> >>> [  292.758746] Hardware name: Supermicro Super Server/X11DPL-i, BIOS 3.3
> >>> 02/21/2020
> >>> [  292.766250] RIP: 0010:dm_zone_map_bio+0x146/0x1740 [dm_mod]
> >>> [  292.771938] Code: 00 00 4d 8b 65 10 48 8d 43 28 48 89 44 24 10 49 8d 44
> >>> 24 10 48 89 c2 48 89 44 24 18 48 b8 00 00 00 00 00 fc ff df 48 c1 ea 03
> >>> <0f> b6 04 02 84 c0 74 08 3c 03 0f 8e 78 0e 00 00 45 8b 7c 24 10 41
> >>> [  292.790946] RSP: 0018:ffff8883cd847218 EFLAGS: 00010202
> >>> [  292.796260] RAX: dffffc0000000000 RBX: ffff8885c5bcdce8 RCX:
> >>> 1ffff11034470027
> >>> [  292.803496] RDX: 0000000000000002 RSI: 0000000000000008 RDI:
> >>> ffff8885c5bcdc60
> >>> [  292.810732] RBP: 1ffff11079b08e4f R08: ffff8881a23801d8 R09:
> >>> ffff8881a238013f
> >>> [  292.817970] R10: ffff88821c594040 R11: 0000000000000001 R12:
> >>> 0000000000000000
> >>> [  292.825206] R13: ffff8885c5bcdc50 R14: ffff8881a2380000 R15:
> >>> ffff8885c5bcdd08
> >>> [  292.832442] FS:  00007fe169b06b40(0000) GS:ffff88880fc00000(0000)
> >>> knlGS:0000000000000000
> >>> [  292.840646] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>> [  292.846481] CR2: 00007ffd80a57a38 CR3: 00000004b91b0006 CR4:
> >>> 00000000007706f0
> >>> [  292.853722] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> >>> 0000000000000000
> >>> [  292.860957] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> >>> 0000000000000400
> >>> [  292.868194] PKRU: 55555554
> >>> [  292.870949] Call Trace:
> >>> [  292.873446]  <TASK>
> >>> [  292.875593]  ? lock_is_held_type+0xd7/0x130
> >>> [  292.879860]  ? dm_set_zones_restrictions+0x8f0/0x8f0 [dm_mod]
> >>> [  292.885718]  ? __module_address.part.0+0x25/0x300
> >>> [  292.890509]  ? is_module_address+0x43/0x70
> >>> [  292.894674]  ? static_obj+0x8a/0xc0
> >>> [  292.898233]  __map_bio+0x352/0x740 [dm_mod]
> >>> [  292.902512]  dm_submit_bio+0x72f/0x17a0 [dm_mod]
> >>> [  292.907222]  ? find_held_lock+0x2c/0x110
> >>> [  292.911217]  ? __send_empty_flush+0x2b0/0x2b0 [dm_mod]
> >>> [  292.916459]  ? lock_release+0x3b2/0x6f0
> >>> [  292.920368]  ? lock_downgrade+0x6d0/0x6d0
> >>> [  292.924458]  ? lock_is_held_type+0xd7/0x130
> >>> [  292.928714]  __submit_bio+0x12a/0x1f0
> >>> [  292.932450]  submit_bio_noacct_nocheck+0x324/0x840
> >>> [  292.937324]  ? should_fail_request+0x70/0x70
> >>> [  292.941670]  ? rcu_read_lock_sched_held+0x3f/0x70
> >>> [  292.946458]  ? submit_bio_noacct+0xfa4/0x1530
> >>> [  292.950888]  ? lock_is_held_type+0xd7/0x130
> >>> [  292.957813]  mpage_readahead+0x32e/0x4b0
> >>> [  292.964470]  ? do_mpage_readpage+0x17c0/0x17c0
> >>> [  292.971661]  ? blkdev_write_begin+0x20/0x20
> >>> [  292.978567]  ? lock_release+0x3b2/0x6f0
> >>> [  292.985073]  ? folio_add_lru+0x217/0x3f0
> >>> [  292.991620]  ? lock_downgrade+0x6d0/0x6d0
> >>> [  292.998237]  read_pages+0x18c/0x990
> >>> [  293.004308]  ? memcg_list_lru_alloc+0x810/0x810
> >>> [  293.011404]  ? folio_add_lru+0x238/0x3f0
> >>> [  293.017805]  ? file_ra_state_init+0xd0/0xd0
> >>> [  293.024395]  ? policy_node+0xbb/0x140
> >>> [  293.030416]  page_cache_ra_unbounded+0x258/0x410
> >>> [  293.037376]  force_page_cache_ra+0x281/0x400
> >>> [  293.043944]  filemap_get_pages+0x25e/0x1290
> >>> [  293.050342]  ? __lock_acquire+0x1603/0x6180
> >>> [  293.056654]  ? filemap_add_folio+0x140/0x140
> >>> [  293.063002]  ? lock_is_held_type+0xd7/0x130
> >>> [  293.069236]  filemap_read+0x29e/0x910
> >>> [  293.074927]  ? filemap_get_pages+0x1290/0x1290
> >>> [  293.081378]  ? __lock_acquire+0x1603/0x6180
> >>> [  293.087558]  blkdev_read_iter+0x20c/0x640
> >>> [  293.093529]  ? cp_new_stat+0x47a/0x590
> >>> [  293.099190]  ? cp_old_stat+0x470/0x470
> >>> [  293.104795]  new_sync_read+0x2e4/0x520
> >>> [  293.110362]  ? __x64_sys_lseek+0x1d0/0x1d0
> >>> [  293.116269]  ? lock_acquire+0x1b2/0x4d0
> >>> [  293.121928]  ? find_held_lock+0x2c/0x110
> >>> [  293.127648]  vfs_read+0x312/0x430
> >>> [  293.132755]  ksys_read+0xf3/0x1d0
> >>> [  293.137863]  ? __x64_sys_pwrite64+0x1f0/0x1f0
> >>> [  293.144032]  do_syscall_64+0x35/0x80
> >>> [  293.149391]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> >>>
> >>> The crash is at: drivers/md/dm-zone.c:499, which is
> >>> dm_need_zone_wp_tracking() called from dm_zone_map_bio(). The orig_bio
> >>> pointer is invalid. Weird. Investigating.
> >>>
> >>> Also checking why our weekly test runs did not catch this.
> >>
> >> This fixes the issue:
> >>
> >> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> >> index 3c5fad7c4ee6..3dd6735450c5 100644
> >> --- a/drivers/md/dm.c
> >> +++ b/drivers/md/dm.c
> >> @@ -581,7 +581,7 @@ static struct dm_io *alloc_io(struct mapped_device
> >> *md, struct bio *bio)
> >>         io->status = 0;
> >>         atomic_set(&io->io_count, 1);
> >>         this_cpu_inc(*md->pending_io);
> >> -       io->orig_bio = NULL;
> >> +       io->orig_bio = bio;
> >>         io->md = md;
> >>         io->map_task = current;
> >>         spin_lock_init(&io->lock);
> >>
> >> Otherwise, the dm-zone.c code sees a NULL orig_bio.
> >> However, this change may be messing up the bio accounting. Need to check that.
> > 
> > Looks it is one recent regression since:
> > 
> > commit 0fbb4d93b38b ("dm: add dm_submit_bio_remap interface")
> 
> Yep, saw that. Problem is, I really do not understand that change setting
> io->orig_bio *after* __map_bio() is called. It seems that the accounting
> is done on each fragment of the orig_bio instead of once for the entire
> BIO... So my "fix" above seems wrong. Apart from passing along orig_bio as
> an argument to  __map_bio() from __split_and_process_bio(), I do not think
> my change is correct. Thoughts ?

Frankly speaking, both changing ->orig_bio after split and setting ->orig_bio
after ->map() looks ugly & tricky, and the following change should avoid the
issue, meantime simplify dm accounting a bit:

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 3c5fad7c4ee6..f1fe83113608 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -581,7 +581,7 @@ static struct dm_io *alloc_io(struct mapped_device *md, struct bio *bio)
 	io->status = 0;
 	atomic_set(&io->io_count, 1);
 	this_cpu_inc(*md->pending_io);
-	io->orig_bio = NULL;
+	io->orig_bio = bio;
 	io->md = md;
 	io->map_task = current;
 	spin_lock_init(&io->lock);
@@ -1223,19 +1223,11 @@ void dm_submit_bio_remap(struct bio *clone, struct bio *tgt_clone)
 	 * Account io->origin_bio to DM dev on behalf of target
 	 * that took ownership of IO with DM_MAPIO_SUBMITTED.
 	 */
-	if (io->map_task == current) {
+	if (io->map_task == current)
 		/* Still in target's map function */
 		dm_io_set_flag(io, DM_IO_START_ACCT);
-	} else {
-		/*
-		 * Called by another thread, managed by DM target,
-		 * wait for dm_split_and_process_bio() to store
-		 * io->orig_bio
-		 */
-		while (unlikely(!smp_load_acquire(&io->orig_bio)))
-			msleep(1);
+	else
 		dm_start_io_acct(io, clone);
-	}
 
 	__dm_submit_bio_remap(tgt_clone, disk_devt(io->md->disk),
 			      tio->old_sector);
@@ -1562,7 +1554,7 @@ static void dm_split_and_process_bio(struct mapped_device *md,
 				     struct dm_table *map, struct bio *bio)
 {
 	struct clone_info ci;
-	struct bio *orig_bio = NULL;
+	struct bio *new_bio = NULL;
 	int error = 0;
 
 	init_clone_info(&ci, md, map, bio);
@@ -1578,22 +1570,14 @@ static void dm_split_and_process_bio(struct mapped_device *md,
 	if (error || !ci.sector_count)
 		goto out;
 
-	/*
-	 * Remainder must be passed to submit_bio_noacct() so it gets handled
-	 * *after* bios already submitted have been completely processed.
-	 * We take a clone of the original to store in ci.io->orig_bio to be
-	 * used by dm_end_io_acct() and for dm_io_complete() to use for
-	 * completion handling.
-	 */
-	orig_bio = bio_split(bio, bio_sectors(bio) - ci.sector_count,
-			     GFP_NOIO, &md->queue->bio_split);
-	bio_chain(orig_bio, bio);
-	trace_block_split(orig_bio, bio->bi_iter.bi_sector);
-	submit_bio_noacct(bio);
+	new_bio = bio_alloc_clone(bio->bi_bdev, bio, GFP_NOIO,
+			&md->queue->bio_split);
+	bio_trim(new_bio, bio_sectors(bio) - ci.sector_count, ci.sector_count);
+	bio_trim(bio, 0, bio_sectors(bio) - ci.sector_count);
+	bio_chain(new_bio, bio);
+	trace_block_split(new_bio, new_bio->bi_iter.bi_sector);
+	submit_bio_noacct(new_bio);
 out:
-	if (!orig_bio)
-		orig_bio = bio;
-	smp_store_release(&ci.io->orig_bio, orig_bio);
 	if (dm_io_flagged(ci.io, DM_IO_START_ACCT))
 		dm_start_io_acct(ci.io, NULL);
Damien Le Moal April 11, 2022, 11:33 p.m. UTC | #13
On 4/11/22 23:18, Ming Lei wrote:
>>>> This fixes the issue:
>>>>
>>>> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
>>>> index 3c5fad7c4ee6..3dd6735450c5 100644
>>>> --- a/drivers/md/dm.c
>>>> +++ b/drivers/md/dm.c
>>>> @@ -581,7 +581,7 @@ static struct dm_io *alloc_io(struct mapped_device
>>>> *md, struct bio *bio)
>>>>         io->status = 0;
>>>>         atomic_set(&io->io_count, 1);
>>>>         this_cpu_inc(*md->pending_io);
>>>> -       io->orig_bio = NULL;
>>>> +       io->orig_bio = bio;
>>>>         io->md = md;
>>>>         io->map_task = current;
>>>>         spin_lock_init(&io->lock);
>>>>
>>>> Otherwise, the dm-zone.c code sees a NULL orig_bio.
>>>> However, this change may be messing up the bio accounting. Need to check that.
>>>
>>> Looks it is one recent regression since:
>>>
>>> commit 0fbb4d93b38b ("dm: add dm_submit_bio_remap interface")
>>
>> Yep, saw that. Problem is, I really do not understand that change setting
>> io->orig_bio *after* __map_bio() is called. It seems that the accounting
>> is done on each fragment of the orig_bio instead of once for the entire
>> BIO... So my "fix" above seems wrong. Apart from passing along orig_bio as
>> an argument to  __map_bio() from __split_and_process_bio(), I do not think
>> my change is correct. Thoughts ?
> 
> Frankly speaking, both changing ->orig_bio after split and setting ->orig_bio
> after ->map() looks ugly & tricky, and the following change should avoid the
> issue, meantime simplify dm accounting a bit:
> 
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index 3c5fad7c4ee6..f1fe83113608 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -581,7 +581,7 @@ static struct dm_io *alloc_io(struct mapped_device *md, struct bio *bio)
>  	io->status = 0;
>  	atomic_set(&io->io_count, 1);
>  	this_cpu_inc(*md->pending_io);
> -	io->orig_bio = NULL;
> +	io->orig_bio = bio;
>  	io->md = md;
>  	io->map_task = current;
>  	spin_lock_init(&io->lock);
> @@ -1223,19 +1223,11 @@ void dm_submit_bio_remap(struct bio *clone, struct bio *tgt_clone)
>  	 * Account io->origin_bio to DM dev on behalf of target
>  	 * that took ownership of IO with DM_MAPIO_SUBMITTED.
>  	 */
> -	if (io->map_task == current) {
> +	if (io->map_task == current)
>  		/* Still in target's map function */
>  		dm_io_set_flag(io, DM_IO_START_ACCT);
> -	} else {
> -		/*
> -		 * Called by another thread, managed by DM target,
> -		 * wait for dm_split_and_process_bio() to store
> -		 * io->orig_bio
> -		 */
> -		while (unlikely(!smp_load_acquire(&io->orig_bio)))
> -			msleep(1);
> +	else

Curly brackets around the else here.

>  		dm_start_io_acct(io, clone);
> -	}
>  
>  	__dm_submit_bio_remap(tgt_clone, disk_devt(io->md->disk),
>  			      tio->old_sector);
> @@ -1562,7 +1554,7 @@ static void dm_split_and_process_bio(struct mapped_device *md,
>  				     struct dm_table *map, struct bio *bio)
>  {
>  	struct clone_info ci;
> -	struct bio *orig_bio = NULL;
> +	struct bio *new_bio = NULL;
>  	int error = 0;
>  
>  	init_clone_info(&ci, md, map, bio);
> @@ -1578,22 +1570,14 @@ static void dm_split_and_process_bio(struct mapped_device *md,
>  	if (error || !ci.sector_count)
>  		goto out;
>  
> -	/*
> -	 * Remainder must be passed to submit_bio_noacct() so it gets handled
> -	 * *after* bios already submitted have been completely processed.
> -	 * We take a clone of the original to store in ci.io->orig_bio to be
> -	 * used by dm_end_io_acct() and for dm_io_complete() to use for
> -	 * completion handling.
> -	 */

This comment should remain with some adjustment.

> -	orig_bio = bio_split(bio, bio_sectors(bio) - ci.sector_count,
> -			     GFP_NOIO, &md->queue->bio_split);
> -	bio_chain(orig_bio, bio);
> -	trace_block_split(orig_bio, bio->bi_iter.bi_sector);
> -	submit_bio_noacct(bio);
> +	new_bio = bio_alloc_clone(bio->bi_bdev, bio, GFP_NOIO,
> +			&md->queue->bio_split);

Why not keep using bio_split() ?

> +	bio_trim(new_bio, bio_sectors(bio) - ci.sector_count, ci.sector_count);
> +	bio_trim(bio, 0, bio_sectors(bio) - ci.sector_count);
> +	bio_chain(new_bio, bio);
> +	trace_block_split(new_bio, new_bio->bi_iter.bi_sector);
> +	submit_bio_noacct(new_bio);
>  out:
> -	if (!orig_bio)
> -		orig_bio = bio;
> -	smp_store_release(&ci.io->orig_bio, orig_bio);
>  	if (dm_io_flagged(ci.io, DM_IO_START_ACCT))
>  		dm_start_io_acct(ci.io, NULL);

I tested this and it works. Need to check the accounting though.
And I agree this is a lot cleaner :)
Ming Lei April 12, 2022, 12:09 a.m. UTC | #14
On Tue, Apr 12, 2022 at 08:33:04AM +0900, Damien Le Moal wrote:
> On 4/11/22 23:18, Ming Lei wrote:
> >>>> This fixes the issue:
> >>>>
> >>>> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> >>>> index 3c5fad7c4ee6..3dd6735450c5 100644
> >>>> --- a/drivers/md/dm.c
> >>>> +++ b/drivers/md/dm.c
> >>>> @@ -581,7 +581,7 @@ static struct dm_io *alloc_io(struct mapped_device
> >>>> *md, struct bio *bio)
> >>>>         io->status = 0;
> >>>>         atomic_set(&io->io_count, 1);
> >>>>         this_cpu_inc(*md->pending_io);
> >>>> -       io->orig_bio = NULL;
> >>>> +       io->orig_bio = bio;
> >>>>         io->md = md;
> >>>>         io->map_task = current;
> >>>>         spin_lock_init(&io->lock);
> >>>>
> >>>> Otherwise, the dm-zone.c code sees a NULL orig_bio.
> >>>> However, this change may be messing up the bio accounting. Need to check that.
> >>>
> >>> Looks it is one recent regression since:
> >>>
> >>> commit 0fbb4d93b38b ("dm: add dm_submit_bio_remap interface")
> >>
> >> Yep, saw that. Problem is, I really do not understand that change setting
> >> io->orig_bio *after* __map_bio() is called. It seems that the accounting
> >> is done on each fragment of the orig_bio instead of once for the entire
> >> BIO... So my "fix" above seems wrong. Apart from passing along orig_bio as
> >> an argument to  __map_bio() from __split_and_process_bio(), I do not think
> >> my change is correct. Thoughts ?
> > 
> > Frankly speaking, both changing ->orig_bio after split and setting ->orig_bio
> > after ->map() looks ugly & tricky, and the following change should avoid the
> > issue, meantime simplify dm accounting a bit:
> > 
> > diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> > index 3c5fad7c4ee6..f1fe83113608 100644
> > --- a/drivers/md/dm.c
> > +++ b/drivers/md/dm.c
> > @@ -581,7 +581,7 @@ static struct dm_io *alloc_io(struct mapped_device *md, struct bio *bio)
> >  	io->status = 0;
> >  	atomic_set(&io->io_count, 1);
> >  	this_cpu_inc(*md->pending_io);
> > -	io->orig_bio = NULL;
> > +	io->orig_bio = bio;
> >  	io->md = md;
> >  	io->map_task = current;
> >  	spin_lock_init(&io->lock);
> > @@ -1223,19 +1223,11 @@ void dm_submit_bio_remap(struct bio *clone, struct bio *tgt_clone)
> >  	 * Account io->origin_bio to DM dev on behalf of target
> >  	 * that took ownership of IO with DM_MAPIO_SUBMITTED.
> >  	 */
> > -	if (io->map_task == current) {
> > +	if (io->map_task == current)
> >  		/* Still in target's map function */
> >  		dm_io_set_flag(io, DM_IO_START_ACCT);
> > -	} else {
> > -		/*
> > -		 * Called by another thread, managed by DM target,
> > -		 * wait for dm_split_and_process_bio() to store
> > -		 * io->orig_bio
> > -		 */
> > -		while (unlikely(!smp_load_acquire(&io->orig_bio)))
> > -			msleep(1);
> > +	else
> 
> Curly brackets around the else here.
> 
> >  		dm_start_io_acct(io, clone);
> > -	}
> >  
> >  	__dm_submit_bio_remap(tgt_clone, disk_devt(io->md->disk),
> >  			      tio->old_sector);
> > @@ -1562,7 +1554,7 @@ static void dm_split_and_process_bio(struct mapped_device *md,
> >  				     struct dm_table *map, struct bio *bio)
> >  {
> >  	struct clone_info ci;
> > -	struct bio *orig_bio = NULL;
> > +	struct bio *new_bio = NULL;
> >  	int error = 0;
> >  
> >  	init_clone_info(&ci, md, map, bio);
> > @@ -1578,22 +1570,14 @@ static void dm_split_and_process_bio(struct mapped_device *md,
> >  	if (error || !ci.sector_count)
> >  		goto out;
> >  
> > -	/*
> > -	 * Remainder must be passed to submit_bio_noacct() so it gets handled
> > -	 * *after* bios already submitted have been completely processed.
> > -	 * We take a clone of the original to store in ci.io->orig_bio to be
> > -	 * used by dm_end_io_acct() and for dm_io_complete() to use for
> > -	 * completion handling.
> > -	 */
> 
> This comment should remain with some adjustment.

Fine, just felt the approach is very straightforward.

> 
> > -	orig_bio = bio_split(bio, bio_sectors(bio) - ci.sector_count,
> > -			     GFP_NOIO, &md->queue->bio_split);
> > -	bio_chain(orig_bio, bio);
> > -	trace_block_split(orig_bio, bio->bi_iter.bi_sector);
> > -	submit_bio_noacct(bio);
> > +	new_bio = bio_alloc_clone(bio->bi_bdev, bio, GFP_NOIO,
> > +			&md->queue->bio_split);
> 
> Why not keep using bio_split() ?

With bio_split(), 'bio' actually tracks the remainder, and the returned
'orig_bio' tracks the part for current target io, so ->orig_bio has to
be updated in this way.

With bio_alloc_clone() and bio_trim(), ->orig_bio needn't to be
changed, and assigning it in alloc_io() works perfectly.

> 
> > +	bio_trim(new_bio, bio_sectors(bio) - ci.sector_count, ci.sector_count);
> > +	bio_trim(bio, 0, bio_sectors(bio) - ci.sector_count);
> > +	bio_chain(new_bio, bio);
> > +	trace_block_split(new_bio, new_bio->bi_iter.bi_sector);
> > +	submit_bio_noacct(new_bio);
> >  out:
> > -	if (!orig_bio)
> > -		orig_bio = bio;
> > -	smp_store_release(&ci.io->orig_bio, orig_bio);
> >  	if (dm_io_flagged(ci.io, DM_IO_START_ACCT))
> >  		dm_start_io_acct(ci.io, NULL);
> 
> I tested this and it works. Need to check the accounting though.
> And I agree this is a lot cleaner :)

BTW, the cloned bio for split is just for accounting purpose, if
->bi_iter.bi_sector & ->bi_iter.bi_size can be stored in 'struct dm_io',
the cloned bio can be avoided, but code may become not readable as
before.


Thanks,
Ming
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel
Damien Le Moal April 12, 2022, 12:28 a.m. UTC | #15
On 4/12/22 09:09, Ming Lei wrote:
> On Tue, Apr 12, 2022 at 08:33:04AM +0900, Damien Le Moal wrote:
>> On 4/11/22 23:18, Ming Lei wrote:
>>>>>> This fixes the issue:
>>>>>>
>>>>>> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
>>>>>> index 3c5fad7c4ee6..3dd6735450c5 100644
>>>>>> --- a/drivers/md/dm.c
>>>>>> +++ b/drivers/md/dm.c
>>>>>> @@ -581,7 +581,7 @@ static struct dm_io *alloc_io(struct mapped_device
>>>>>> *md, struct bio *bio)
>>>>>>         io->status = 0;
>>>>>>         atomic_set(&io->io_count, 1);
>>>>>>         this_cpu_inc(*md->pending_io);
>>>>>> -       io->orig_bio = NULL;
>>>>>> +       io->orig_bio = bio;
>>>>>>         io->md = md;
>>>>>>         io->map_task = current;
>>>>>>         spin_lock_init(&io->lock);
>>>>>>
>>>>>> Otherwise, the dm-zone.c code sees a NULL orig_bio.
>>>>>> However, this change may be messing up the bio accounting. Need to check that.
>>>>>
>>>>> Looks it is one recent regression since:
>>>>>
>>>>> commit 0fbb4d93b38b ("dm: add dm_submit_bio_remap interface")
>>>>
>>>> Yep, saw that. Problem is, I really do not understand that change setting
>>>> io->orig_bio *after* __map_bio() is called. It seems that the accounting
>>>> is done on each fragment of the orig_bio instead of once for the entire
>>>> BIO... So my "fix" above seems wrong. Apart from passing along orig_bio as
>>>> an argument to  __map_bio() from __split_and_process_bio(), I do not think
>>>> my change is correct. Thoughts ?
>>>
>>> Frankly speaking, both changing ->orig_bio after split and setting ->orig_bio
>>> after ->map() looks ugly & tricky, and the following change should avoid the
>>> issue, meantime simplify dm accounting a bit:
>>>
>>> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
>>> index 3c5fad7c4ee6..f1fe83113608 100644
>>> --- a/drivers/md/dm.c
>>> +++ b/drivers/md/dm.c
>>> @@ -581,7 +581,7 @@ static struct dm_io *alloc_io(struct mapped_device *md, struct bio *bio)
>>>  	io->status = 0;
>>>  	atomic_set(&io->io_count, 1);
>>>  	this_cpu_inc(*md->pending_io);
>>> -	io->orig_bio = NULL;
>>> +	io->orig_bio = bio;
>>>  	io->md = md;
>>>  	io->map_task = current;
>>>  	spin_lock_init(&io->lock);
>>> @@ -1223,19 +1223,11 @@ void dm_submit_bio_remap(struct bio *clone, struct bio *tgt_clone)
>>>  	 * Account io->origin_bio to DM dev on behalf of target
>>>  	 * that took ownership of IO with DM_MAPIO_SUBMITTED.
>>>  	 */
>>> -	if (io->map_task == current) {
>>> +	if (io->map_task == current)
>>>  		/* Still in target's map function */
>>>  		dm_io_set_flag(io, DM_IO_START_ACCT);
>>> -	} else {
>>> -		/*
>>> -		 * Called by another thread, managed by DM target,
>>> -		 * wait for dm_split_and_process_bio() to store
>>> -		 * io->orig_bio
>>> -		 */
>>> -		while (unlikely(!smp_load_acquire(&io->orig_bio)))
>>> -			msleep(1);
>>> +	else
>>
>> Curly brackets around the else here.
>>
>>>  		dm_start_io_acct(io, clone);
>>> -	}
>>>  
>>>  	__dm_submit_bio_remap(tgt_clone, disk_devt(io->md->disk),
>>>  			      tio->old_sector);
>>> @@ -1562,7 +1554,7 @@ static void dm_split_and_process_bio(struct mapped_device *md,
>>>  				     struct dm_table *map, struct bio *bio)
>>>  {
>>>  	struct clone_info ci;
>>> -	struct bio *orig_bio = NULL;
>>> +	struct bio *new_bio = NULL;
>>>  	int error = 0;
>>>  
>>>  	init_clone_info(&ci, md, map, bio);
>>> @@ -1578,22 +1570,14 @@ static void dm_split_and_process_bio(struct mapped_device *md,
>>>  	if (error || !ci.sector_count)
>>>  		goto out;
>>>  
>>> -	/*
>>> -	 * Remainder must be passed to submit_bio_noacct() so it gets handled
>>> -	 * *after* bios already submitted have been completely processed.
>>> -	 * We take a clone of the original to store in ci.io->orig_bio to be
>>> -	 * used by dm_end_io_acct() and for dm_io_complete() to use for
>>> -	 * completion handling.
>>> -	 */
>>
>> This comment should remain with some adjustment.
> 
> Fine, just felt the approach is very straightforward.
> 
>>
>>> -	orig_bio = bio_split(bio, bio_sectors(bio) - ci.sector_count,
>>> -			     GFP_NOIO, &md->queue->bio_split);
>>> -	bio_chain(orig_bio, bio);
>>> -	trace_block_split(orig_bio, bio->bi_iter.bi_sector);
>>> -	submit_bio_noacct(bio);
>>> +	new_bio = bio_alloc_clone(bio->bi_bdev, bio, GFP_NOIO,
>>> +			&md->queue->bio_split);
>>
>> Why not keep using bio_split() ?
> 
> With bio_split(), 'bio' actually tracks the remainder, and the returned
> 'orig_bio' tracks the part for current target io, so ->orig_bio has to
> be updated in this way.
> 
> With bio_alloc_clone() and bio_trim(), ->orig_bio needn't to be
> changed, and assigning it in alloc_io() works perfectly.

OK. Got it.

>>> +	bio_trim(new_bio, bio_sectors(bio) - ci.sector_count, ci.sector_count);
>>> +	bio_trim(bio, 0, bio_sectors(bio) - ci.sector_count);
>>> +	bio_chain(new_bio, bio);
>>> +	trace_block_split(new_bio, new_bio->bi_iter.bi_sector);
>>> +	submit_bio_noacct(new_bio);
>>>  out:
>>> -	if (!orig_bio)
>>> -		orig_bio = bio;
>>> -	smp_store_release(&ci.io->orig_bio, orig_bio);
>>>  	if (dm_io_flagged(ci.io, DM_IO_START_ACCT))
>>>  		dm_start_io_acct(ci.io, NULL);
>>
>> I tested this and it works. Need to check the accounting though.
>> And I agree this is a lot cleaner :)
> 
> BTW, the cloned bio for split is just for accounting purpose, if
> ->bi_iter.bi_sector & ->bi_iter.bi_size can be stored in 'struct dm_io',
> the cloned bio can be avoided, but code may become not readable as
> before.

The BIO op would be needed too for remapping zone append to regular writes
when zone append emulation is enabled. That is actually why
dm_zone_map_bio() needs the original BIO, to have the unmodified BIO op
since the clone op may not be the same.

So I think this fix+cleanup as is is good for now. Will you send a proper
patch ?

> 
> 
> Thanks,
> Ming
>
Ming Lei April 12, 2022, 1:02 a.m. UTC | #16
On Tue, Apr 12, 2022 at 09:28:46AM +0900, Damien Le Moal wrote:
> On 4/12/22 09:09, Ming Lei wrote:
> > On Tue, Apr 12, 2022 at 08:33:04AM +0900, Damien Le Moal wrote:
> >> On 4/11/22 23:18, Ming Lei wrote:
> >>>>>> This fixes the issue:
> >>>>>>
> >>>>>> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> >>>>>> index 3c5fad7c4ee6..3dd6735450c5 100644
> >>>>>> --- a/drivers/md/dm.c
> >>>>>> +++ b/drivers/md/dm.c
> >>>>>> @@ -581,7 +581,7 @@ static struct dm_io *alloc_io(struct mapped_device
> >>>>>> *md, struct bio *bio)
> >>>>>>         io->status = 0;
> >>>>>>         atomic_set(&io->io_count, 1);
> >>>>>>         this_cpu_inc(*md->pending_io);
> >>>>>> -       io->orig_bio = NULL;
> >>>>>> +       io->orig_bio = bio;
> >>>>>>         io->md = md;
> >>>>>>         io->map_task = current;
> >>>>>>         spin_lock_init(&io->lock);
> >>>>>>
> >>>>>> Otherwise, the dm-zone.c code sees a NULL orig_bio.
> >>>>>> However, this change may be messing up the bio accounting. Need to check that.
> >>>>>
> >>>>> Looks it is one recent regression since:
> >>>>>
> >>>>> commit 0fbb4d93b38b ("dm: add dm_submit_bio_remap interface")
> >>>>
> >>>> Yep, saw that. Problem is, I really do not understand that change setting
> >>>> io->orig_bio *after* __map_bio() is called. It seems that the accounting
> >>>> is done on each fragment of the orig_bio instead of once for the entire
> >>>> BIO... So my "fix" above seems wrong. Apart from passing along orig_bio as
> >>>> an argument to  __map_bio() from __split_and_process_bio(), I do not think
> >>>> my change is correct. Thoughts ?
> >>>
> >>> Frankly speaking, both changing ->orig_bio after split and setting ->orig_bio
> >>> after ->map() looks ugly & tricky, and the following change should avoid the
> >>> issue, meantime simplify dm accounting a bit:
> >>>
> >>> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> >>> index 3c5fad7c4ee6..f1fe83113608 100644
> >>> --- a/drivers/md/dm.c
> >>> +++ b/drivers/md/dm.c
> >>> @@ -581,7 +581,7 @@ static struct dm_io *alloc_io(struct mapped_device *md, struct bio *bio)
> >>>  	io->status = 0;
> >>>  	atomic_set(&io->io_count, 1);
> >>>  	this_cpu_inc(*md->pending_io);
> >>> -	io->orig_bio = NULL;
> >>> +	io->orig_bio = bio;
> >>>  	io->md = md;
> >>>  	io->map_task = current;
> >>>  	spin_lock_init(&io->lock);
> >>> @@ -1223,19 +1223,11 @@ void dm_submit_bio_remap(struct bio *clone, struct bio *tgt_clone)
> >>>  	 * Account io->origin_bio to DM dev on behalf of target
> >>>  	 * that took ownership of IO with DM_MAPIO_SUBMITTED.
> >>>  	 */
> >>> -	if (io->map_task == current) {
> >>> +	if (io->map_task == current)
> >>>  		/* Still in target's map function */
> >>>  		dm_io_set_flag(io, DM_IO_START_ACCT);
> >>> -	} else {
> >>> -		/*
> >>> -		 * Called by another thread, managed by DM target,
> >>> -		 * wait for dm_split_and_process_bio() to store
> >>> -		 * io->orig_bio
> >>> -		 */
> >>> -		while (unlikely(!smp_load_acquire(&io->orig_bio)))
> >>> -			msleep(1);
> >>> +	else
> >>
> >> Curly brackets around the else here.
> >>
> >>>  		dm_start_io_acct(io, clone);
> >>> -	}
> >>>  
> >>>  	__dm_submit_bio_remap(tgt_clone, disk_devt(io->md->disk),
> >>>  			      tio->old_sector);
> >>> @@ -1562,7 +1554,7 @@ static void dm_split_and_process_bio(struct mapped_device *md,
> >>>  				     struct dm_table *map, struct bio *bio)
> >>>  {
> >>>  	struct clone_info ci;
> >>> -	struct bio *orig_bio = NULL;
> >>> +	struct bio *new_bio = NULL;
> >>>  	int error = 0;
> >>>  
> >>>  	init_clone_info(&ci, md, map, bio);
> >>> @@ -1578,22 +1570,14 @@ static void dm_split_and_process_bio(struct mapped_device *md,
> >>>  	if (error || !ci.sector_count)
> >>>  		goto out;
> >>>  
> >>> -	/*
> >>> -	 * Remainder must be passed to submit_bio_noacct() so it gets handled
> >>> -	 * *after* bios already submitted have been completely processed.
> >>> -	 * We take a clone of the original to store in ci.io->orig_bio to be
> >>> -	 * used by dm_end_io_acct() and for dm_io_complete() to use for
> >>> -	 * completion handling.
> >>> -	 */
> >>
> >> This comment should remain with some adjustment.
> > 
> > Fine, just felt the approach is very straightforward.
> > 
> >>
> >>> -	orig_bio = bio_split(bio, bio_sectors(bio) - ci.sector_count,
> >>> -			     GFP_NOIO, &md->queue->bio_split);
> >>> -	bio_chain(orig_bio, bio);
> >>> -	trace_block_split(orig_bio, bio->bi_iter.bi_sector);
> >>> -	submit_bio_noacct(bio);
> >>> +	new_bio = bio_alloc_clone(bio->bi_bdev, bio, GFP_NOIO,
> >>> +			&md->queue->bio_split);
> >>
> >> Why not keep using bio_split() ?
> > 
> > With bio_split(), 'bio' actually tracks the remainder, and the returned
> > 'orig_bio' tracks the part for current target io, so ->orig_bio has to
> > be updated in this way.
> > 
> > With bio_alloc_clone() and bio_trim(), ->orig_bio needn't to be
> > changed, and assigning it in alloc_io() works perfectly.
> 
> OK. Got it.
> 
> >>> +	bio_trim(new_bio, bio_sectors(bio) - ci.sector_count, ci.sector_count);
> >>> +	bio_trim(bio, 0, bio_sectors(bio) - ci.sector_count);
> >>> +	bio_chain(new_bio, bio);
> >>> +	trace_block_split(new_bio, new_bio->bi_iter.bi_sector);
> >>> +	submit_bio_noacct(new_bio);
> >>>  out:
> >>> -	if (!orig_bio)
> >>> -		orig_bio = bio;
> >>> -	smp_store_release(&ci.io->orig_bio, orig_bio);
> >>>  	if (dm_io_flagged(ci.io, DM_IO_START_ACCT))
> >>>  		dm_start_io_acct(ci.io, NULL);
> >>
> >> I tested this and it works. Need to check the accounting though.
> >> And I agree this is a lot cleaner :)
> > 
> > BTW, the cloned bio for split is just for accounting purpose, if
> > ->bi_iter.bi_sector & ->bi_iter.bi_size can be stored in 'struct dm_io',
> > the cloned bio can be avoided, but code may become not readable as
> > before.
> 
> The BIO op would be needed too for remapping zone append to regular writes
> when zone append emulation is enabled. That is actually why
> dm_zone_map_bio() needs the original BIO, to have the unmodified BIO op
> since the clone op may not be the same.

clone inherits the original bio's op. I meant to store the
real bio from FS as ->orig_bio always in alloc_io(), and simply trim it in case
of split & re-submission, meantime store orig_bio's ->bi_sector & ->size into
'dm_io' for account purpose. But it looks a bit complicated and messy.

Wrt. dm zone, I'd suggest to double check anywhere orig bio is used,
since only part of orig bio may be mapped in case of splitting, which is
actually post-split.

If bio split won't happen for dm-zone, your patch is fine. But I guess
it isn't true for dm-zone.

> 
> So I think this fix+cleanup as is is good for now. Will you send a proper
> patch ?

Not yet, the fix+cleanup patch actually breaks recent dm io polling, and I
don't figure out one solution yet.


Thanks,
Ming
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel
Damien Le Moal April 12, 2022, 1:17 a.m. UTC | #17
On 4/12/22 10:02, Ming Lei wrote:
> On Tue, Apr 12, 2022 at 09:28:46AM +0900, Damien Le Moal wrote:
>> On 4/12/22 09:09, Ming Lei wrote:
>>> On Tue, Apr 12, 2022 at 08:33:04AM +0900, Damien Le Moal wrote:
>>>> On 4/11/22 23:18, Ming Lei wrote:
>>>>>>>> This fixes the issue:
>>>>>>>>
>>>>>>>> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
>>>>>>>> index 3c5fad7c4ee6..3dd6735450c5 100644
>>>>>>>> --- a/drivers/md/dm.c
>>>>>>>> +++ b/drivers/md/dm.c
>>>>>>>> @@ -581,7 +581,7 @@ static struct dm_io *alloc_io(struct mapped_device
>>>>>>>> *md, struct bio *bio)
>>>>>>>>         io->status = 0;
>>>>>>>>         atomic_set(&io->io_count, 1);
>>>>>>>>         this_cpu_inc(*md->pending_io);
>>>>>>>> -       io->orig_bio = NULL;
>>>>>>>> +       io->orig_bio = bio;
>>>>>>>>         io->md = md;
>>>>>>>>         io->map_task = current;
>>>>>>>>         spin_lock_init(&io->lock);
>>>>>>>>
>>>>>>>> Otherwise, the dm-zone.c code sees a NULL orig_bio.
>>>>>>>> However, this change may be messing up the bio accounting. Need to check that.
>>>>>>>
>>>>>>> Looks it is one recent regression since:
>>>>>>>
>>>>>>> commit 0fbb4d93b38b ("dm: add dm_submit_bio_remap interface")
>>>>>>
>>>>>> Yep, saw that. Problem is, I really do not understand that change setting
>>>>>> io->orig_bio *after* __map_bio() is called. It seems that the accounting
>>>>>> is done on each fragment of the orig_bio instead of once for the entire
>>>>>> BIO... So my "fix" above seems wrong. Apart from passing along orig_bio as
>>>>>> an argument to  __map_bio() from __split_and_process_bio(), I do not think
>>>>>> my change is correct. Thoughts ?
>>>>>
>>>>> Frankly speaking, both changing ->orig_bio after split and setting ->orig_bio
>>>>> after ->map() looks ugly & tricky, and the following change should avoid the
>>>>> issue, meantime simplify dm accounting a bit:
>>>>>
>>>>> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
>>>>> index 3c5fad7c4ee6..f1fe83113608 100644
>>>>> --- a/drivers/md/dm.c
>>>>> +++ b/drivers/md/dm.c
>>>>> @@ -581,7 +581,7 @@ static struct dm_io *alloc_io(struct mapped_device *md, struct bio *bio)
>>>>>  	io->status = 0;
>>>>>  	atomic_set(&io->io_count, 1);
>>>>>  	this_cpu_inc(*md->pending_io);
>>>>> -	io->orig_bio = NULL;
>>>>> +	io->orig_bio = bio;
>>>>>  	io->md = md;
>>>>>  	io->map_task = current;
>>>>>  	spin_lock_init(&io->lock);
>>>>> @@ -1223,19 +1223,11 @@ void dm_submit_bio_remap(struct bio *clone, struct bio *tgt_clone)
>>>>>  	 * Account io->origin_bio to DM dev on behalf of target
>>>>>  	 * that took ownership of IO with DM_MAPIO_SUBMITTED.
>>>>>  	 */
>>>>> -	if (io->map_task == current) {
>>>>> +	if (io->map_task == current)
>>>>>  		/* Still in target's map function */
>>>>>  		dm_io_set_flag(io, DM_IO_START_ACCT);
>>>>> -	} else {
>>>>> -		/*
>>>>> -		 * Called by another thread, managed by DM target,
>>>>> -		 * wait for dm_split_and_process_bio() to store
>>>>> -		 * io->orig_bio
>>>>> -		 */
>>>>> -		while (unlikely(!smp_load_acquire(&io->orig_bio)))
>>>>> -			msleep(1);
>>>>> +	else
>>>>
>>>> Curly brackets around the else here.
>>>>
>>>>>  		dm_start_io_acct(io, clone);
>>>>> -	}
>>>>>  
>>>>>  	__dm_submit_bio_remap(tgt_clone, disk_devt(io->md->disk),
>>>>>  			      tio->old_sector);
>>>>> @@ -1562,7 +1554,7 @@ static void dm_split_and_process_bio(struct mapped_device *md,
>>>>>  				     struct dm_table *map, struct bio *bio)
>>>>>  {
>>>>>  	struct clone_info ci;
>>>>> -	struct bio *orig_bio = NULL;
>>>>> +	struct bio *new_bio = NULL;
>>>>>  	int error = 0;
>>>>>  
>>>>>  	init_clone_info(&ci, md, map, bio);
>>>>> @@ -1578,22 +1570,14 @@ static void dm_split_and_process_bio(struct mapped_device *md,
>>>>>  	if (error || !ci.sector_count)
>>>>>  		goto out;
>>>>>  
>>>>> -	/*
>>>>> -	 * Remainder must be passed to submit_bio_noacct() so it gets handled
>>>>> -	 * *after* bios already submitted have been completely processed.
>>>>> -	 * We take a clone of the original to store in ci.io->orig_bio to be
>>>>> -	 * used by dm_end_io_acct() and for dm_io_complete() to use for
>>>>> -	 * completion handling.
>>>>> -	 */
>>>>
>>>> This comment should remain with some adjustment.
>>>
>>> Fine, just felt the approach is very straightforward.
>>>
>>>>
>>>>> -	orig_bio = bio_split(bio, bio_sectors(bio) - ci.sector_count,
>>>>> -			     GFP_NOIO, &md->queue->bio_split);
>>>>> -	bio_chain(orig_bio, bio);
>>>>> -	trace_block_split(orig_bio, bio->bi_iter.bi_sector);
>>>>> -	submit_bio_noacct(bio);
>>>>> +	new_bio = bio_alloc_clone(bio->bi_bdev, bio, GFP_NOIO,
>>>>> +			&md->queue->bio_split);
>>>>
>>>> Why not keep using bio_split() ?
>>>
>>> With bio_split(), 'bio' actually tracks the remainder, and the returned
>>> 'orig_bio' tracks the part for current target io, so ->orig_bio has to
>>> be updated in this way.
>>>
>>> With bio_alloc_clone() and bio_trim(), ->orig_bio needn't to be
>>> changed, and assigning it in alloc_io() works perfectly.
>>
>> OK. Got it.
>>
>>>>> +	bio_trim(new_bio, bio_sectors(bio) - ci.sector_count, ci.sector_count);
>>>>> +	bio_trim(bio, 0, bio_sectors(bio) - ci.sector_count);
>>>>> +	bio_chain(new_bio, bio);
>>>>> +	trace_block_split(new_bio, new_bio->bi_iter.bi_sector);
>>>>> +	submit_bio_noacct(new_bio);
>>>>>  out:
>>>>> -	if (!orig_bio)
>>>>> -		orig_bio = bio;
>>>>> -	smp_store_release(&ci.io->orig_bio, orig_bio);
>>>>>  	if (dm_io_flagged(ci.io, DM_IO_START_ACCT))
>>>>>  		dm_start_io_acct(ci.io, NULL);
>>>>
>>>> I tested this and it works. Need to check the accounting though.
>>>> And I agree this is a lot cleaner :)
>>>
>>> BTW, the cloned bio for split is just for accounting purpose, if
>>> ->bi_iter.bi_sector & ->bi_iter.bi_size can be stored in 'struct dm_io',
>>> the cloned bio can be avoided, but code may become not readable as
>>> before.
>>
>> The BIO op would be needed too for remapping zone append to regular writes
>> when zone append emulation is enabled. That is actually why
>> dm_zone_map_bio() needs the original BIO, to have the unmodified BIO op
>> since the clone op may not be the same.
> 
> clone inherits the original bio's op. I meant to store the
> real bio from FS as ->orig_bio always in alloc_io(), and simply trim it in case
> of split & re-submission, meantime store orig_bio's ->bi_sector & ->size into
> 'dm_io' for account purpose. But it looks a bit complicated and messy.

Indeed.

> Wrt. dm zone, I'd suggest to double check anywhere orig bio is used,
> since only part of orig bio may be mapped in case of splitting, which is
> actually post-split.
> 
> If bio split won't happen for dm-zone, your patch is fine. But I guess
> it isn't true for dm-zone.

Zone append operations can never be split. That is checked before a user
bio reaches DM __map_bio()/dm_zone_map_bio(). So I think that there is no
issue with the clone+trim change since that will never be done for a zone
append operation.

In the case of emulated zone append, regular writes will still go through
the dm_zone_map_bio() function, and these writes may be split. But then
the orig_bio does not really matter in that case and the zone write
pointer tracking relies on the clone sector & size, not on the original BIO.

I will run more tests with this patch. But so far, this seems all OK.

>> So I think this fix+cleanup as is is good for now. Will you send a proper
>> patch ?
> 
> Not yet, the fix+cleanup patch actually breaks recent dm io polling, and I
> don't figure out one solution yet.

OK. Let me know if you need help testing.
diff mbox series

Patch

diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h
index 4277853c7535..13136bd47cb3 100644
--- a/drivers/md/dm-core.h
+++ b/drivers/md/dm-core.h
@@ -277,13 +277,6 @@  static inline void dm_io_set_flag(struct dm_io *io, unsigned int bit)
 	io->flags |= (1U << bit);
 }
 
-static inline void dm_io_inc_pending(struct dm_io *io)
-{
-	atomic_inc(&io->io_count);
-}
-
-void dm_io_dec_pending(struct dm_io *io, blk_status_t error);
-
 static inline struct completion *dm_get_completion_from_kobject(struct kobject *kobj)
 {
 	return &container_of(kobj, struct dm_kobject_holder, kobj)->completion;
diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
index c1ca9be4b79e..85d3c158719f 100644
--- a/drivers/md/dm-zone.c
+++ b/drivers/md/dm-zone.c
@@ -545,13 +545,6 @@  int dm_zone_map_bio(struct dm_target_io *tio)
 		return DM_MAPIO_KILL;
 	}
 
-	/*
-	 * The target map function may issue and complete the IO quickly.
-	 * Take an extra reference on the IO to make sure it does disappear
-	 * until we run dm_zone_map_bio_end().
-	 */
-	dm_io_inc_pending(io);
-
 	/* Let the target do its work */
 	r = ti->type->map(ti, clone);
 	switch (r) {
@@ -580,9 +573,6 @@  int dm_zone_map_bio(struct dm_target_io *tio)
 		break;
 	}
 
-	/* Drop the extra reference on the IO */
-	dm_io_dec_pending(io, sts);
-
 	if (sts != BLK_STS_OK)
 		return DM_MAPIO_KILL;
 
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 3c5fad7c4ee6..b8424a4b4725 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -929,11 +929,16 @@  static inline bool dm_tio_is_normal(struct dm_target_io *tio)
 		!dm_tio_flagged(tio, DM_TIO_IS_DUPLICATE_BIO));
 }
 
+static void dm_io_inc_pending(struct dm_io *io)
+{
+	atomic_inc(&io->io_count);
+}
+
 /*
  * Decrements the number of outstanding ios that a bio has been
  * cloned into, completing the original io if necc.
  */
-void dm_io_dec_pending(struct dm_io *io, blk_status_t error)
+static void dm_io_dec_pending(struct dm_io *io, blk_status_t error)
 {
 	/* Push-back supersedes any I/O errors */
 	if (unlikely(error)) {