diff mbox series

[v9,3/3] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind

Message ID 1675522718-88-4-git-send-email-ruansy.fnst@fujitsu.com (mailing list archive)
State New, archived
Headers show
Series mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind | expand

Commit Message

Shiyang Ruan Feb. 4, 2023, 2:58 p.m. UTC
This patch is inspired by Dan's "mm, dax, pmem: Introduce
dev_pagemap_failure()"[1].  With the help of dax_holder and
->notify_failure() mechanism, the pmem driver is able to ask filesystem
(or mapped device) on it to unmap all files in use and notify processes
who are using those files.

Call trace:
trigger unbind
 -> unbind_store()
  -> ... (skip)
   -> devres_release_all()   # was pmem driver ->remove() in v1
    -> kill_dax()
     -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
      -> xfs_dax_notify_failure()

Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
event.  So do not shutdown filesystem directly if something not
supported, or if failure range includes metadata area.  Make sure all
files and processes are handled correctly.

[1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/

Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
---
 drivers/dax/super.c         |  3 ++-
 fs/xfs/xfs_notify_failure.c | 28 +++++++++++++++++++++++++++-
 include/linux/mm.h          |  1 +
 3 files changed, 30 insertions(+), 2 deletions(-)

Comments

Matthew Wilcox (Oracle) Feb. 5, 2023, 9:10 p.m. UTC | #1
On Sat, Feb 04, 2023 at 02:58:38PM +0000, Shiyang Ruan wrote:
> +	if (mf_flags & MF_MEM_PRE_REMOVE) {
> +		xfs_info(mp, "device is about to be removed!");
> +		down_write(&mp->m_super->s_umount);
> +		error = sync_filesystem(mp->m_super);
> +		/* invalidate_inode_pages2() invalidates dax mapping */
> +		super_drop_pagecache(mp->m_super, invalidate_inode_pages2);

OK, there's a better way to handle all of this.

First, an essentially untyped interface with a void * argument is
bad.  Second, we can do all this with invalidate_inode_pages2_range()
and invalidate_mapping_pages() without introducing a new function.

Here's the proposal:

void super_drop_pagecache(struct super_block *sb,
		int (*invalidate)(struct address_space *, pgoff_t start, pgoff_t end))

In fs/drop-caches.c:

static void drop_pagecache_sb(struct super_block *sb, void *unused)
{
	super_drop_pagecache(sb, invalidate_mapping_pages);
}

In XFS:

		super_drop_pagecache(mp->m_super,
				invalidate_inode_pages2_range);

Much smaller change ...
Dave Chinner Feb. 5, 2023, 9:50 p.m. UTC | #2
On Sat, Feb 04, 2023 at 02:58:38PM +0000, Shiyang Ruan wrote:
> This patch is inspired by Dan's "mm, dax, pmem: Introduce
> dev_pagemap_failure()"[1].  With the help of dax_holder and
> ->notify_failure() mechanism, the pmem driver is able to ask filesystem
> (or mapped device) on it to unmap all files in use and notify processes
> who are using those files.

.....

> @@ -182,12 +188,24 @@ xfs_dax_notify_failure(
>  	struct xfs_mount	*mp = dax_holder(dax_dev);
>  	u64			ddev_start;
>  	u64			ddev_end;
> +	int			error;
>  
>  	if (!(mp->m_super->s_flags & SB_BORN)) {
>  		xfs_warn(mp, "filesystem is not ready for notify_failure()!");
>  		return -EIO;
>  	}
>  
> +	if (mf_flags & MF_MEM_PRE_REMOVE) {
> +		xfs_info(mp, "device is about to be removed!");
> +		down_write(&mp->m_super->s_umount);
> +		error = sync_filesystem(mp->m_super);
> +		/* invalidate_inode_pages2() invalidates dax mapping */
> +		super_drop_pagecache(mp->m_super, invalidate_inode_pages2);
> +		up_write(&mp->m_super->s_umount);

I really don't like this.

super_drop_pagecache() doesn't guarantee that everything is removed
from cache. It is racy - it doesn't touch inodes being freed or
being instantiated, nor does it prevent concurrent accesses to the
inodes from re-instantiating page cache pages and dirtying them
after the inode has been scanned by super_drop_pagecache().

If we are about to remove the block device and we want to guarantee
that the filesystem is cleaned and stable before the device gets
yanked out from under running applications, then we have to
guarantee that we stall the running applications trying to modify
the filesystem between the MF_MEM_PRE_REMOVE and the actual removal
event that then shuts down the filesystem. Invalidating the page
cache is not enough to guarantee this.

Keep in mind that we're going to walk the rmap after writing the
data to kill any processes that have mmap()d files in the filesystem
after we've dropped the page cache - the page cache invalidation
doesn't change this at all - and this will kill any active userspace
DAX mappings before the device is unplugged. So I don't actually see
how walking the page cache to invalidate it here actually helps
"invalidate dax mapping" reliably as new write page faults on dax
VMAs can still occur between super_drop_pagecache() and the rmap
walk triggering kills on processes with DAX mapped VMAs.

We also don't care if read-only operations race with device unplug -
they are going to get EIO the moment the device is actually
unplugged or the filesystem is shutdown anyway, so it doesn't matter
if reads race with the device remove event.  Hence all we really
care about here is not dirtying the filesystem after we've started
processing the MF_MEM_PRE_REMOVE event.

Realistically, I think we need to freeze the filesystem here to
prevent racing modifications occurring during the rmap + VMA walk +
proc kill. That could be write() IO dirtying new data or other
transactions running dirtying the journal/metadata. Both
sync_filesystem() and super_drop_pagecache() operate on current
state - they don't prevent future dax mapping instantiation or
dirtying from happening on the device, so they don't prevent this...

Cheers,

Dave.
diff mbox series

Patch

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index da4438f3188c..40274d19f4f9 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -323,7 +323,8 @@  void kill_dax(struct dax_device *dax_dev)
 		return;
 
 	if (dax_dev->holder_data != NULL)
-		dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
+		dax_holder_notify_failure(dax_dev, 0, U64_MAX,
+				MF_MEM_PRE_REMOVE);
 
 	clear_bit(DAXDEV_ALIVE, &dax_dev->flags);
 	synchronize_srcu(&dax_srcu);
diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
index 3830f908e215..5c1e678a1285 100644
--- a/fs/xfs/xfs_notify_failure.c
+++ b/fs/xfs/xfs_notify_failure.c
@@ -22,6 +22,7 @@ 
 
 #include <linux/mm.h>
 #include <linux/dax.h>
+#include <linux/fs.h>
 
 struct xfs_failure_info {
 	xfs_agblock_t		startblock;
@@ -77,6 +78,9 @@  xfs_dax_failure_fn(
 
 	if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
 	    (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
+		/* The device is about to be removed.  Not a really failure. */
+		if (notify->mf_flags & MF_MEM_PRE_REMOVE)
+			return 0;
 		notify->want_shutdown = true;
 		return 0;
 	}
@@ -168,7 +172,9 @@  xfs_dax_notify_ddev_failure(
 		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
 		if (!error)
 			error = -EFSCORRUPTED;
-	}
+	} else if (mf_flags & MF_MEM_PRE_REMOVE)
+		xfs_force_shutdown(mp, SHUTDOWN_FORCE_UMOUNT);
+
 	return error;
 }
 
@@ -182,12 +188,24 @@  xfs_dax_notify_failure(
 	struct xfs_mount	*mp = dax_holder(dax_dev);
 	u64			ddev_start;
 	u64			ddev_end;
+	int			error;
 
 	if (!(mp->m_super->s_flags & SB_BORN)) {
 		xfs_warn(mp, "filesystem is not ready for notify_failure()!");
 		return -EIO;
 	}
 
+	if (mf_flags & MF_MEM_PRE_REMOVE) {
+		xfs_info(mp, "device is about to be removed!");
+		down_write(&mp->m_super->s_umount);
+		error = sync_filesystem(mp->m_super);
+		/* invalidate_inode_pages2() invalidates dax mapping */
+		super_drop_pagecache(mp->m_super, invalidate_inode_pages2);
+		up_write(&mp->m_super->s_umount);
+		if (error)
+			return error;
+	}
+
 	if (mp->m_rtdev_targp && mp->m_rtdev_targp->bt_daxdev == dax_dev) {
 		xfs_debug(mp,
 			 "notify_failure() not supported on realtime device!");
@@ -196,6 +214,8 @@  xfs_dax_notify_failure(
 
 	if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev &&
 	    mp->m_logdev_targp != mp->m_ddev_targp) {
+		if (mf_flags & MF_MEM_PRE_REMOVE)
+			return 0;
 		xfs_err(mp, "ondisk log corrupt, shutting down fs!");
 		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
 		return -EFSCORRUPTED;
@@ -209,6 +229,12 @@  xfs_dax_notify_failure(
 	ddev_start = mp->m_ddev_targp->bt_dax_part_off;
 	ddev_end = ddev_start + bdev_nr_bytes(mp->m_ddev_targp->bt_bdev) - 1;
 
+	/* Notify failure on the whole device */
+	if (offset == 0 && len == U64_MAX) {
+		offset = ddev_start;
+		len = bdev_nr_bytes(mp->m_ddev_targp->bt_bdev);
+	}
+
 	/* Ignore the range out of filesystem area */
 	if (offset + len - 1 < ddev_start)
 		return -ENXIO;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8f857163ac89..9711dbc9451f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3424,6 +3424,7 @@  enum mf_flags {
 	MF_UNPOISON = 1 << 4,
 	MF_SW_SIMULATED = 1 << 5,
 	MF_NO_RETRY = 1 << 6,
+	MF_MEM_PRE_REMOVE = 1 << 7,
 };
 int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
 		      unsigned long count, int mf_flags);