[RFC,0/7] dm: fix issues with swapping dm tables

Message ID	20250309222904.449803-1-bmarzins@redhat.com (mailing list archive)
Headers	show Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 755CB2153FE for <linux-block@vger.kernel.org>; Sun, 9 Mar 2025 22:29:13 +0000 (UTC) From: Benjamin Marzinski <bmarzins@redhat.com> To: Mikulas Patocka <mpatocka@redhat.com>, Mike Snitzer <snitzer@redhat.com>, Jens Axboe <axboe@kernel.dk> Cc: dm-devel@lists.linux.dev, linux-block@vger.kernel.org, Damien Le Moal <dlemoal@kernel.org>, Christoph Hellwig <hch@lst.de> Subject: [RFC PATCH 0/7] dm: fix issues with swapping dm tables Date: Sun, 9 Mar 2025 18:28:56 -0400 Message-ID: <20250309222904.449803-1-bmarzins@redhat.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	dm: fix issues with swapping dm tables \| expand [RFC,0/7] dm: fix issues with swapping dm tables [1/7] dm: don't change md if dm_table_set_restrictions() fails [2/7] dm: free table mempools if not used in __bind [3/7] dm: handle failures in dm_table_set_restrictions [4/7] dm: fix dm_blk_report_zones [5/7] blk-zoned: clean up zone settings for devices without zwplugs [RFC,6/7] blk-zoned: modify blk_revalidate_disk_zones for bio-based drivers [RFC,7/7] dm: allow devices to revalidate existing zones

Message ID

20250309222904.449803-1-bmarzins@redhat.com (mailing list archive)

Headers

From: Benjamin Marzinski <bmarzins@redhat.com>
To: Mikulas Patocka <mpatocka@redhat.com>, Mike Snitzer <snitzer@redhat.com>,
        Jens Axboe <axboe@kernel.dk>
Cc: dm-devel@lists.linux.dev, linux-block@vger.kernel.org,
        Damien Le Moal <dlemoal@kernel.org>, Christoph Hellwig <hch@lst.de>
Subject: [RFC PATCH 0/7] dm: fix issues with swapping dm tables
Date: Sun,  9 Mar 2025 18:28:56 -0400
Message-ID: <20250309222904.449803-1-bmarzins@redhat.com>
Precedence: bulk
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit

Series

dm: fix issues with swapping dm tables | expand

Message

Benjamin Marzinski March 9, 2025, 10:28 p.m. UTC

There were multiple places in dm's __bind() function where it could fail
and not completely roll back, leaving the device using the the old
table, but with device limits and resources from the new table.
Additionally, unused mempools for request-based devices were not always
freed immediately.

Finally, there were a number of issues with switching zoned tables that
emulate zone append (in other words, dm-crypt on top of zoned devices).
dm_blk_report_zones() could be called while the device was suspended and
modifying zoned resources or could possibly fail to end a srcu read
section.  More importantly, blk_revalidate_disk_zones() would never get
called when updating a zoned table. This could cause the dm device to
see the wrong zone write offsets, not have a large enough zwplugs
reserved in its mempool, or read invalid memory when checking the
conventional zones bitmap.

This patchset fixes these issues. It does not make it so that
device-mapper is able to load any zoned table from any other zoned
table. Zoned dm-crypt devices can be safely grown and shrunk, but
reloading a zoned dm-crypt device to, for instance, point at a
completely different underlying device won't work correctly. IO might
fail since the zone write offsets of the dm-crypt device will not be
updated for all the existing zones with plugs. If the new device's zone
offsets don't match the old device's offsets, IO to the zone will fail.
If the ability to switch tables from a zoned dm-crypt device to an
abritry other zoned dm-crypt device is important to people, it could be
done as long as there are no plugged zones when dm suspends.

This patchset also doesn't touch the code for switching from a zoned to
a non-zoned device. Switching from a zoned dm-crypt device to a
non-zoned device is problematic if there are plugged zones, since the
underlying device will not be prepared to handle these plugged writes.
Switching to a target that does not pass down IOs, like the dm-error
target, is fine. So is switching when there are no plugged zones, except
that we do not free the zoned resources in this case, even though we
safely could.

If people are interested in removing some of these limitations, I can
send patches for them, but I'm not sure how much extra code we want,
just to support niche zoned dm-crypt reloads.

Benjamin Marzinski (7):
  dm: don't change md if dm_table_set_restrictions() fails
  dm: free table mempools if not used in __bind
  dm: handle failures in dm_table_set_restrictions
  dm: fix dm_blk_report_zones
  blk-zoned: clean up zone settings for devices without zwplugs
  blk-zoned: modify blk_revalidate_disk_zones for bio-based drivers
  dm: allow devices to revalidate existing zones

 block/blk-zoned.c      | 78 ++++++++++++++++++++++--------------------
 block/blk.h            |  4 ---
 drivers/md/dm-core.h   |  1 +
 drivers/md/dm-table.c  | 24 ++++++++-----
 drivers/md/dm-zone.c   | 75 ++++++++++++++++++++++++++--------------
 drivers/md/dm.c        | 30 ++++++++--------
 drivers/md/dm.h        |  1 +
 include/linux/blkdev.h |  4 +++
 8 files changed, 128 insertions(+), 89 deletions(-)

Comments

Damien Le Moal March 9, 2025, 11:16 p.m. UTC | #1

On 3/10/25 07:28, Benjamin Marzinski wrote:
> There were multiple places in dm's __bind() function where it could fail
> and not completely roll back, leaving the device using the the old
> table, but with device limits and resources from the new table.
> Additionally, unused mempools for request-based devices were not always
> freed immediately.
> 
> Finally, there were a number of issues with switching zoned tables that
> emulate zone append (in other words, dm-crypt on top of zoned devices).
> dm_blk_report_zones() could be called while the device was suspended and
> modifying zoned resources or could possibly fail to end a srcu read
> section.  More importantly, blk_revalidate_disk_zones() would never get
> called when updating a zoned table. This could cause the dm device to
> see the wrong zone write offsets, not have a large enough zwplugs
> reserved in its mempool, or read invalid memory when checking the
> conventional zones bitmap.
> 
> This patchset fixes these issues. It does not make it so that
> device-mapper is able to load any zoned table from any other zoned
> table. Zoned dm-crypt devices can be safely grown and shrunk, but
> reloading a zoned dm-crypt device to, for instance, point at a
> completely different underlying device won't work correctly. IO might
> fail since the zone write offsets of the dm-crypt device will not be
> updated for all the existing zones with plugs. If the new device's zone
> offsets don't match the old device's offsets, IO to the zone will fail.
> If the ability to switch tables from a zoned dm-crypt device to an
> abritry other zoned dm-crypt device is important to people, it could be
> done as long as there are no plugged zones when dm suspends.

Thanks for fixing this.

Given that in the general case switching tables will always likely result in
unaligned write errors, I think we should just report a ENOTSUPP error if the
user attempts to swap tables.

> This patchset also doesn't touch the code for switching from a zoned to
> a non-zoned device. Switching from a zoned dm-crypt device to a
> non-zoned device is problematic if there are plugged zones, since the
> underlying device will not be prepared to handle these plugged writes.
> Switching to a target that does not pass down IOs, like the dm-error
> target, is fine. So is switching when there are no plugged zones, except
> that we do not free the zoned resources in this case, even though we
> safely could.

This is another case that does not make much sense in practice. So instead of
still allowing it knowing that it most likely will not work, we should return
ENOTSUPP here too I think.

> If people are interested in removing some of these limitations, I can
> send patches for them, but I'm not sure how much extra code we want,
> just to support niche zoned dm-crypt reloads.

I have never heard any complaints/bug reports from people attempting this. Which
likely means that no-one is needing/trying to do this. So as mentionned above,
we should make sure that this feature is not reported as not supported with a
ENOTSUPP error, and maybe a warning.

Benjamin Marzinski March 10, 2025, 4:38 p.m. UTC | #2

On Mon, Mar 10, 2025 at 08:16:43AM +0900, Damien Le Moal wrote:
> On 3/10/25 07:28, Benjamin Marzinski wrote:
> > There were multiple places in dm's __bind() function where it could fail
> > and not completely roll back, leaving the device using the the old
> > table, but with device limits and resources from the new table.
> > Additionally, unused mempools for request-based devices were not always
> > freed immediately.
> > 
> > Finally, there were a number of issues with switching zoned tables that
> > emulate zone append (in other words, dm-crypt on top of zoned devices).
> > dm_blk_report_zones() could be called while the device was suspended and
> > modifying zoned resources or could possibly fail to end a srcu read
> > section.  More importantly, blk_revalidate_disk_zones() would never get
> > called when updating a zoned table. This could cause the dm device to
> > see the wrong zone write offsets, not have a large enough zwplugs
> > reserved in its mempool, or read invalid memory when checking the
> > conventional zones bitmap.
> > 
> > This patchset fixes these issues. It does not make it so that
> > device-mapper is able to load any zoned table from any other zoned
> > table. Zoned dm-crypt devices can be safely grown and shrunk, but
> > reloading a zoned dm-crypt device to, for instance, point at a
> > completely different underlying device won't work correctly. IO might
> > fail since the zone write offsets of the dm-crypt device will not be
> > updated for all the existing zones with plugs. If the new device's zone
> > offsets don't match the old device's offsets, IO to the zone will fail.
> > If the ability to switch tables from a zoned dm-crypt device to an
> > abritry other zoned dm-crypt device is important to people, it could be
> > done as long as there are no plugged zones when dm suspends.
> 
> Thanks for fixing this.
> 
> Given that in the general case switching tables will always likely result in
> unaligned write errors, I think we should just report a ENOTSUPP error if the
> user attempts to swap tables.

If we don't think there's any interest in growing or shrinking zoned
dm-crypt devices, that's fine.  I do think we should make an exception
for switching to the dm-error target. We specifically call that out with
DM_TARGET_WILDCARD so that we can always switch to it from any table if
we just want to fail out all the IO.

-Ben
 
> > This patchset also doesn't touch the code for switching from a zoned to
> > a non-zoned device. Switching from a zoned dm-crypt device to a
> > non-zoned device is problematic if there are plugged zones, since the
> > underlying device will not be prepared to handle these plugged writes.
> > Switching to a target that does not pass down IOs, like the dm-error
> > target, is fine. So is switching when there are no plugged zones, except
> > that we do not free the zoned resources in this case, even though we
> > safely could.
> 
> This is another case that does not make much sense in practice. So instead of
> still allowing it knowing that it most likely will not work, we should return
> ENOTSUPP here too I think.
> 
> > If people are interested in removing some of these limitations, I can
> > send patches for them, but I'm not sure how much extra code we want,
> > just to support niche zoned dm-crypt reloads.
> 
> I have never heard any complaints/bug reports from people attempting this. Which
> likely means that no-one is needing/trying to do this. So as mentionned above,
> we should make sure that this feature is not reported as not supported with a
> ENOTSUPP error, and maybe a warning.
> 
> 
> 
> -- 
> Damien Le Moal
> Western Digital Research

Damien Le Moal March 10, 2025, 11:13 p.m. UTC | #3

On 3/11/25 01:38, Benjamin Marzinski wrote:
> On Mon, Mar 10, 2025 at 08:16:43AM +0900, Damien Le Moal wrote:
>> On 3/10/25 07:28, Benjamin Marzinski wrote:
>>> There were multiple places in dm's __bind() function where it could fail
>>> and not completely roll back, leaving the device using the the old
>>> table, but with device limits and resources from the new table.
>>> Additionally, unused mempools for request-based devices were not always
>>> freed immediately.
>>>
>>> Finally, there were a number of issues with switching zoned tables that
>>> emulate zone append (in other words, dm-crypt on top of zoned devices).
>>> dm_blk_report_zones() could be called while the device was suspended and
>>> modifying zoned resources or could possibly fail to end a srcu read
>>> section.  More importantly, blk_revalidate_disk_zones() would never get
>>> called when updating a zoned table. This could cause the dm device to
>>> see the wrong zone write offsets, not have a large enough zwplugs
>>> reserved in its mempool, or read invalid memory when checking the
>>> conventional zones bitmap.
>>>
>>> This patchset fixes these issues. It does not make it so that
>>> device-mapper is able to load any zoned table from any other zoned
>>> table. Zoned dm-crypt devices can be safely grown and shrunk, but
>>> reloading a zoned dm-crypt device to, for instance, point at a
>>> completely different underlying device won't work correctly. IO might
>>> fail since the zone write offsets of the dm-crypt device will not be
>>> updated for all the existing zones with plugs. If the new device's zone
>>> offsets don't match the old device's offsets, IO to the zone will fail.
>>> If the ability to switch tables from a zoned dm-crypt device to an
>>> abritry other zoned dm-crypt device is important to people, it could be
>>> done as long as there are no plugged zones when dm suspends.
>>
>> Thanks for fixing this.
>>
>> Given that in the general case switching tables will always likely result in
>> unaligned write errors, I think we should just report a ENOTSUPP error if the
>> user attempts to swap tables.
> 
> If we don't think there's any interest in growing or shrinking zoned
> dm-crypt devices, that's fine.  I do think we should make an exception
> for switching to the dm-error target. We specifically call that out with
> DM_TARGET_WILDCARD so that we can always switch to it from any table if
> we just want to fail out all the IO.

Arg ! dm-error is used in xfstests so we need it (for btrfs at least since btrfs
supports zoned devices, and soon xfs as well). So I guess we should disallow
switching tables when the new table changes something to the zone configuration
(grow, shrink, zone size, zoned/non-zoned). dm-error does not change anything,
so we should still be able to allow it.