diff mbox series

[v15] mm, pmem, xfs: Introduce MF_MEM_PRE_REMOVE for unbind

Message ID 20230928103227.250550-1-ruansy.fnst@fujitsu.com (mailing list archive)
State Superseded, archived
Headers show
Series [v15] mm, pmem, xfs: Introduce MF_MEM_PRE_REMOVE for unbind | expand

Commit Message

Shiyang Ruan Sept. 28, 2023, 10:32 a.m. UTC
====
Changes since v14:
 1. added/fixed code comments per Dan's comments
====

Now, if we suddenly remove a PMEM device(by calling unbind) which
contains FSDAX while programs are still accessing data in this device,
e.g.:
```
 $FSSTRESS_PROG -d $SCRATCH_MNT -n 99999 -p 4 &
 # $FSX_PROG -N 1000000 -o 8192 -l 500000 $SCRATCH_MNT/t001 &
 echo "pfn1.1" > /sys/bus/nd/drivers/nd_pmem/unbind
```
it could come into an unacceptable state:
  1. device has gone but mount point still exists, and umount will fail
       with "target is busy"
  2. programs will hang and cannot be killed
  3. may crash with NULL pointer dereference

To fix this, we introduce a MF_MEM_PRE_REMOVE flag to let it know that we
are going to remove the whole device, and make sure all related processes
could be notified so that they could end up gracefully.

This patch is inspired by Dan's "mm, dax, pmem: Introduce
dev_pagemap_failure()"[1].  With the help of dax_holder and
->notify_failure() mechanism, the pmem driver is able to ask filesystem
on it to unmap all files in use, and notify processes who are using
those files.

Call trace:
trigger unbind
 -> unbind_store()
  -> ... (skip)
   -> devres_release_all()
    -> kill_dax()
     -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
      -> xfs_dax_notify_failure()
      `-> freeze_super()             // freeze (kernel call)
      `-> do xfs rmap
      ` -> mf_dax_kill_procs()
      `  -> collect_procs_fsdax()    // all associated processes
      `  -> unmap_and_kill()
      ` -> invalidate_inode_pages2_range() // drop file's cache
      `-> thaw_super()               // thaw (both kernel & user call)

Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
event.  Use the exclusive freeze/thaw[2] to lock the filesystem to prevent
new dax mapping from being created.  Do not shutdown filesystem directly
if configuration is not supported, or if failure range includes metadata
area.  Make sure all files and processes(not only the current progress)
are handled correctly.  Also drop the cache of associated files before
pmem is removed.

[1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
[2]: https://lore.kernel.org/linux-xfs/169116275623.3187159.16862410128731457358.stg-ugh@frogsfrogsfrogs/

Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Acked-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/super.c         |   3 +-
 fs/xfs/xfs_notify_failure.c | 108 ++++++++++++++++++++++++++++++++++--
 include/linux/mm.h          |   1 +
 mm/memory-failure.c         |  21 +++++--
 4 files changed, 122 insertions(+), 11 deletions(-)

Comments

Dan Williams Sept. 29, 2023, 6:31 p.m. UTC | #1
Shiyang Ruan wrote:
> ====
> Changes since v14:
>  1. added/fixed code comments per Dan's comments
> ====
> 
> Now, if we suddenly remove a PMEM device(by calling unbind) which
> contains FSDAX while programs are still accessing data in this device,
> e.g.:
> ```
>  $FSSTRESS_PROG -d $SCRATCH_MNT -n 99999 -p 4 &
>  # $FSX_PROG -N 1000000 -o 8192 -l 500000 $SCRATCH_MNT/t001 &
>  echo "pfn1.1" > /sys/bus/nd/drivers/nd_pmem/unbind
> ```
> it could come into an unacceptable state:
>   1. device has gone but mount point still exists, and umount will fail
>        with "target is busy"
>   2. programs will hang and cannot be killed
>   3. may crash with NULL pointer dereference
> 
> To fix this, we introduce a MF_MEM_PRE_REMOVE flag to let it know that we
> are going to remove the whole device, and make sure all related processes
> could be notified so that they could end up gracefully.
> 
> This patch is inspired by Dan's "mm, dax, pmem: Introduce
> dev_pagemap_failure()"[1].  With the help of dax_holder and
> ->notify_failure() mechanism, the pmem driver is able to ask filesystem
> on it to unmap all files in use, and notify processes who are using
> those files.
> 
> Call trace:
> trigger unbind
>  -> unbind_store()
>   -> ... (skip)
>    -> devres_release_all()
>     -> kill_dax()
>      -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
>       -> xfs_dax_notify_failure()
>       `-> freeze_super()             // freeze (kernel call)
>       `-> do xfs rmap
>       ` -> mf_dax_kill_procs()
>       `  -> collect_procs_fsdax()    // all associated processes
>       `  -> unmap_and_kill()
>       ` -> invalidate_inode_pages2_range() // drop file's cache
>       `-> thaw_super()               // thaw (both kernel & user call)
> 
> Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
> event.  Use the exclusive freeze/thaw[2] to lock the filesystem to prevent
> new dax mapping from being created.  Do not shutdown filesystem directly
> if configuration is not supported, or if failure range includes metadata
> area.  Make sure all files and processes(not only the current progress)
> are handled correctly.  Also drop the cache of associated files before
> pmem is removed.
> 
> [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
> [2]: https://lore.kernel.org/linux-xfs/169116275623.3187159.16862410128731457358.stg-ugh@frogsfrogsfrogs/
> 
> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> Acked-by: Dan Williams <dan.j.williams@intel.com>

This version address my feedback you can upgrade that Acked-by: to

Reviewed-by: Dan Williams <dan.j.williams@intel.com>
kernel test robot Oct. 1, 2023, 1:43 a.m. UTC | #2
Hi Shiyang,

kernel test robot noticed the following build errors:



url:    https://github.com/intel-lab-lkp/linux/commits/UPDATE-20230928-183310/Shiyang-Ruan/xfs-fix-the-calculation-for-end-and-length/20230629-161913
base:   the 2th patch of https://lore.kernel.org/r/20230629081651.253626-3-ruansy.fnst%40fujitsu.com
patch link:    https://lore.kernel.org/r/20230928103227.250550-1-ruansy.fnst%40fujitsu.com
patch subject: [PATCH v15] mm, pmem, xfs: Introduce MF_MEM_PRE_REMOVE for unbind
config: x86_64-rhel-8.3-rust (https://download.01.org/0day-ci/archive/20231001/202310010955.feI4HCwZ-lkp@intel.com/config)
compiler: clang version 15.0.7 (https://github.com/llvm/llvm-project.git 8dfdcc7b7bf66834a761bd8de445840ef68e4d1a)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231001/202310010955.feI4HCwZ-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202310010955.feI4HCwZ-lkp@intel.com/

All errors (new ones prefixed by >>):

>> fs/xfs/xfs_notify_failure.c:127:27: error: use of undeclared identifier 'FREEZE_HOLDER_KERNEL'
           error = freeze_super(sb, FREEZE_HOLDER_KERNEL);
                                    ^
   fs/xfs/xfs_notify_failure.c:143:26: error: use of undeclared identifier 'FREEZE_HOLDER_KERNEL'
                   error = thaw_super(sb, FREEZE_HOLDER_KERNEL);
                                          ^
>> fs/xfs/xfs_notify_failure.c:153:17: error: use of undeclared identifier 'FREEZE_HOLDER_USERSPACE'
           thaw_super(sb, FREEZE_HOLDER_USERSPACE);
                          ^
   3 errors generated.


vim +/FREEZE_HOLDER_KERNEL +127 fs/xfs/xfs_notify_failure.c

   119	
   120	static int
   121	xfs_dax_notify_failure_freeze(
   122		struct xfs_mount	*mp)
   123	{
   124		struct super_block	*sb = mp->m_super;
   125		int			error;
   126	
 > 127		error = freeze_super(sb, FREEZE_HOLDER_KERNEL);
   128		if (error)
   129			xfs_emerg(mp, "already frozen by kernel, err=%d", error);
   130	
   131		return error;
   132	}
   133	
   134	static void
   135	xfs_dax_notify_failure_thaw(
   136		struct xfs_mount	*mp,
   137		bool			kernel_frozen)
   138	{
   139		struct super_block	*sb = mp->m_super;
   140		int			error;
   141	
   142		if (kernel_frozen) {
   143			error = thaw_super(sb, FREEZE_HOLDER_KERNEL);
   144			if (error)
   145				xfs_emerg(mp, "still frozen after notify failure, err=%d",
   146					error);
   147		}
   148	
   149		/*
   150		 * Also thaw userspace call anyway because the device is about to be
   151		 * removed immediately.
   152		 */
 > 153		thaw_super(sb, FREEZE_HOLDER_USERSPACE);
   154	}
   155
Shiyang Ruan Oct. 2, 2023, 11:57 a.m. UTC | #3
在 2023/10/1 9:43, kernel test robot 写道:
> Hi Shiyang,
> 
> kernel test robot noticed the following build errors:
> 
> 
> 
> url:    https://github.com/intel-lab-lkp/linux/commits/UPDATE-20230928-183310/Shiyang-Ruan/xfs-fix-the-calculation-for-end-and-length/20230629-161913
> base:   the 2th patch of https://lore.kernel.org/r/20230629081651.253626-3-ruansy.fnst%40fujitsu.com
> patch link:    https://lore.kernel.org/r/20230928103227.250550-1-ruansy.fnst%40fujitsu.com
> patch subject: [PATCH v15] mm, pmem, xfs: Introduce MF_MEM_PRE_REMOVE for unbind
> config: x86_64-rhel-8.3-rust (https://download.01.org/0day-ci/archive/20231001/202310010955.feI4HCwZ-lkp@intel.com/config)
> compiler: clang version 15.0.7 (https://github.com/llvm/llvm-project.git 8dfdcc7b7bf66834a761bd8de445840ef68e4d1a)
> reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231001/202310010955.feI4HCwZ-lkp@intel.com/reproduce)
> 
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <lkp@intel.com>
> | Closes: https://lore.kernel.org/oe-kbuild-all/202310010955.feI4HCwZ-lkp@intel.com/
> 
> All errors (new ones prefixed by >>):
> 
>>> fs/xfs/xfs_notify_failure.c:127:27: error: use of undeclared identifier 'FREEZE_HOLDER_KERNEL'
>             error = freeze_super(sb, FREEZE_HOLDER_KERNEL);
>                                      ^
>     fs/xfs/xfs_notify_failure.c:143:26: error: use of undeclared identifier 'FREEZE_HOLDER_KERNEL'
>                     error = thaw_super(sb, FREEZE_HOLDER_KERNEL);
>                                            ^
>>> fs/xfs/xfs_notify_failure.c:153:17: error: use of undeclared identifier 'FREEZE_HOLDER_USERSPACE'
>             thaw_super(sb, FREEZE_HOLDER_USERSPACE);
>                            ^
>     3 errors generated.
> 

The two enums has been introduced since 880b9577855e ("fs: distinguish 
between user initiated freeze and kernel initiated freeze"), v6.6-rc1. 
I also compiled my patches based on v6.6-rc1 with your config file, it 
passed with no error.

So, which kernel version were you testing?


--
Thanks,
Ruan.

> 
> vim +/FREEZE_HOLDER_KERNEL +127 fs/xfs/xfs_notify_failure.c
> 
>     119	
>     120	static int
>     121	xfs_dax_notify_failure_freeze(
>     122		struct xfs_mount	*mp)
>     123	{
>     124		struct super_block	*sb = mp->m_super;
>     125		int			error;
>     126	
>   > 127		error = freeze_super(sb, FREEZE_HOLDER_KERNEL);
>     128		if (error)
>     129			xfs_emerg(mp, "already frozen by kernel, err=%d", error);
>     130	
>     131		return error;
>     132	}
>     133	
>     134	static void
>     135	xfs_dax_notify_failure_thaw(
>     136		struct xfs_mount	*mp,
>     137		bool			kernel_frozen)
>     138	{
>     139		struct super_block	*sb = mp->m_super;
>     140		int			error;
>     141	
>     142		if (kernel_frozen) {
>     143			error = thaw_super(sb, FREEZE_HOLDER_KERNEL);
>     144			if (error)
>     145				xfs_emerg(mp, "still frozen after notify failure, err=%d",
>     146					error);
>     147		}
>     148	
>     149		/*
>     150		 * Also thaw userspace call anyway because the device is about to be
>     151		 * removed immediately.
>     152		 */
>   > 153		thaw_super(sb, FREEZE_HOLDER_USERSPACE);
>     154	}
>     155	
>
Chandan Babu R Oct. 20, 2023, 9:56 a.m. UTC | #4
On Thu, Sep 28, 2023 at 06:32:27 PM +0800, Shiyang Ruan wrote:
> ====
> Changes since v14:
>  1. added/fixed code comments per Dan's comments
> ====
>
> Now, if we suddenly remove a PMEM device(by calling unbind) which
> contains FSDAX while programs are still accessing data in this device,
> e.g.:
> ```
>  $FSSTRESS_PROG -d $SCRATCH_MNT -n 99999 -p 4 &
>  # $FSX_PROG -N 1000000 -o 8192 -l 500000 $SCRATCH_MNT/t001 &
>  echo "pfn1.1" > /sys/bus/nd/drivers/nd_pmem/unbind
> ```
> it could come into an unacceptable state:
>   1. device has gone but mount point still exists, and umount will fail
>        with "target is busy"
>   2. programs will hang and cannot be killed
>   3. may crash with NULL pointer dereference
>
> To fix this, we introduce a MF_MEM_PRE_REMOVE flag to let it know that we
> are going to remove the whole device, and make sure all related processes
> could be notified so that they could end up gracefully.
>
> This patch is inspired by Dan's "mm, dax, pmem: Introduce
> dev_pagemap_failure()"[1].  With the help of dax_holder and
> ->notify_failure() mechanism, the pmem driver is able to ask filesystem
> on it to unmap all files in use, and notify processes who are using
> those files.
>
> Call trace:
> trigger unbind
>  -> unbind_store()
>   -> ... (skip)
>    -> devres_release_all()
>     -> kill_dax()
>      -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
>       -> xfs_dax_notify_failure()
>       `-> freeze_super()             // freeze (kernel call)
>       `-> do xfs rmap
>       ` -> mf_dax_kill_procs()
>       `  -> collect_procs_fsdax()    // all associated processes
>       `  -> unmap_and_kill()
>       ` -> invalidate_inode_pages2_range() // drop file's cache
>       `-> thaw_super()               // thaw (both kernel & user call)
>
> Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
> event.  Use the exclusive freeze/thaw[2] to lock the filesystem to prevent
> new dax mapping from being created.  Do not shutdown filesystem directly
> if configuration is not supported, or if failure range includes metadata
> area.  Make sure all files and processes(not only the current progress)
> are handled correctly.  Also drop the cache of associated files before
> pmem is removed.
>
> [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
> [2]: https://lore.kernel.org/linux-xfs/169116275623.3187159.16862410128731457358.stg-ugh@frogsfrogsfrogs/
>
> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> Acked-by: Dan Williams <dan.j.williams@intel.com>

Hi Andrew,

Shiyang had indicated that this patch has been added to
akpm/mm-hotfixes-unstable branch. However, I don't see the patch listed in
that branch.

I am about to start collecting XFS patches for v6.7 cycle. Please let me know
if you have any objections with me taking this patch via the XFS tree.
Darrick J. Wong Oct. 20, 2023, 3:40 p.m. UTC | #5
On Fri, Oct 20, 2023 at 03:26:32PM +0530, Chandan Babu R wrote:
> On Thu, Sep 28, 2023 at 06:32:27 PM +0800, Shiyang Ruan wrote:
> > ====
> > Changes since v14:
> >  1. added/fixed code comments per Dan's comments
> > ====
> >
> > Now, if we suddenly remove a PMEM device(by calling unbind) which
> > contains FSDAX while programs are still accessing data in this device,
> > e.g.:
> > ```
> >  $FSSTRESS_PROG -d $SCRATCH_MNT -n 99999 -p 4 &
> >  # $FSX_PROG -N 1000000 -o 8192 -l 500000 $SCRATCH_MNT/t001 &
> >  echo "pfn1.1" > /sys/bus/nd/drivers/nd_pmem/unbind
> > ```
> > it could come into an unacceptable state:
> >   1. device has gone but mount point still exists, and umount will fail
> >        with "target is busy"
> >   2. programs will hang and cannot be killed
> >   3. may crash with NULL pointer dereference
> >
> > To fix this, we introduce a MF_MEM_PRE_REMOVE flag to let it know that we
> > are going to remove the whole device, and make sure all related processes
> > could be notified so that they could end up gracefully.
> >
> > This patch is inspired by Dan's "mm, dax, pmem: Introduce
> > dev_pagemap_failure()"[1].  With the help of dax_holder and
> > ->notify_failure() mechanism, the pmem driver is able to ask filesystem
> > on it to unmap all files in use, and notify processes who are using
> > those files.
> >
> > Call trace:
> > trigger unbind
> >  -> unbind_store()
> >   -> ... (skip)
> >    -> devres_release_all()
> >     -> kill_dax()
> >      -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
> >       -> xfs_dax_notify_failure()
> >       `-> freeze_super()             // freeze (kernel call)
> >       `-> do xfs rmap
> >       ` -> mf_dax_kill_procs()
> >       `  -> collect_procs_fsdax()    // all associated processes
> >       `  -> unmap_and_kill()
> >       ` -> invalidate_inode_pages2_range() // drop file's cache
> >       `-> thaw_super()               // thaw (both kernel & user call)
> >
> > Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
> > event.  Use the exclusive freeze/thaw[2] to lock the filesystem to prevent
> > new dax mapping from being created.  Do not shutdown filesystem directly
> > if configuration is not supported, or if failure range includes metadata
> > area.  Make sure all files and processes(not only the current progress)
> > are handled correctly.  Also drop the cache of associated files before
> > pmem is removed.
> >
> > [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
> > [2]: https://lore.kernel.org/linux-xfs/169116275623.3187159.16862410128731457358.stg-ugh@frogsfrogsfrogs/
> >
> > Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
> > Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> > Acked-by: Dan Williams <dan.j.williams@intel.com>
> 
> Hi Andrew,
> 
> Shiyang had indicated that this patch has been added to
> akpm/mm-hotfixes-unstable branch. However, I don't see the patch listed in
> that branch.
> 
> I am about to start collecting XFS patches for v6.7 cycle. Please let me know
> if you have any objections with me taking this patch via the XFS tree.

V15 was dropped from his tree on 28 Sept., you might as well pull it
into your own tree for 6.7.  It's been testing fine on my trees for the
past 3 weeks.

https://lore.kernel.org/mm-commits/20230928172815.EE6AFC433C8@smtp.kernel.org/

--D

> 
> -- 
> Chandan
Chandan Babu R Oct. 23, 2023, 6:40 a.m. UTC | #6
On Fri, Oct 20, 2023 at 08:40:09 AM -0700, Darrick J. Wong wrote:
> On Fri, Oct 20, 2023 at 03:26:32PM +0530, Chandan Babu R wrote:
>> On Thu, Sep 28, 2023 at 06:32:27 PM +0800, Shiyang Ruan wrote:
>> > ====
>> > Changes since v14:
>> >  1. added/fixed code comments per Dan's comments
>> > ====
>> >
>> > Now, if we suddenly remove a PMEM device(by calling unbind) which
>> > contains FSDAX while programs are still accessing data in this device,
>> > e.g.:
>> > ```
>> >  $FSSTRESS_PROG -d $SCRATCH_MNT -n 99999 -p 4 &
>> >  # $FSX_PROG -N 1000000 -o 8192 -l 500000 $SCRATCH_MNT/t001 &
>> >  echo "pfn1.1" > /sys/bus/nd/drivers/nd_pmem/unbind
>> > ```
>> > it could come into an unacceptable state:
>> >   1. device has gone but mount point still exists, and umount will fail
>> >        with "target is busy"
>> >   2. programs will hang and cannot be killed
>> >   3. may crash with NULL pointer dereference
>> >
>> > To fix this, we introduce a MF_MEM_PRE_REMOVE flag to let it know that we
>> > are going to remove the whole device, and make sure all related processes
>> > could be notified so that they could end up gracefully.
>> >
>> > This patch is inspired by Dan's "mm, dax, pmem: Introduce
>> > dev_pagemap_failure()"[1].  With the help of dax_holder and
>> > ->notify_failure() mechanism, the pmem driver is able to ask filesystem
>> > on it to unmap all files in use, and notify processes who are using
>> > those files.
>> >
>> > Call trace:
>> > trigger unbind
>> >  -> unbind_store()
>> >   -> ... (skip)
>> >    -> devres_release_all()
>> >     -> kill_dax()
>> >      -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
>> >       -> xfs_dax_notify_failure()
>> >       `-> freeze_super()             // freeze (kernel call)
>> >       `-> do xfs rmap
>> >       ` -> mf_dax_kill_procs()
>> >       `  -> collect_procs_fsdax()    // all associated processes
>> >       `  -> unmap_and_kill()
>> >       ` -> invalidate_inode_pages2_range() // drop file's cache
>> >       `-> thaw_super()               // thaw (both kernel & user call)
>> >
>> > Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
>> > event.  Use the exclusive freeze/thaw[2] to lock the filesystem to prevent
>> > new dax mapping from being created.  Do not shutdown filesystem directly
>> > if configuration is not supported, or if failure range includes metadata
>> > area.  Make sure all files and processes(not only the current progress)
>> > are handled correctly.  Also drop the cache of associated files before
>> > pmem is removed.
>> >
>> > [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
>> > [2]: https://lore.kernel.org/linux-xfs/169116275623.3187159.16862410128731457358.stg-ugh@frogsfrogsfrogs/
>> >
>> > Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
>> > Reviewed-by: Darrick J. Wong <djwong@kernel.org>
>> > Acked-by: Dan Williams <dan.j.williams@intel.com>
>> 
>> Hi Andrew,
>> 
>> Shiyang had indicated that this patch has been added to
>> akpm/mm-hotfixes-unstable branch. However, I don't see the patch listed in
>> that branch.
>> 
>> I am about to start collecting XFS patches for v6.7 cycle. Please let me know
>> if you have any objections with me taking this patch via the XFS tree.
>
> V15 was dropped from his tree on 28 Sept., you might as well pull it
> into your own tree for 6.7.  It's been testing fine on my trees for the
> past 3 weeks.
>
> https://lore.kernel.org/mm-commits/20230928172815.EE6AFC433C8@smtp.kernel.org/

Shiyang, this patch does not apply cleanly on v6.6-rc7. Can you please rebase
the patch on v6.6-rc7 and send it to the mailing list?
Shiyang Ruan Oct. 23, 2023, 7:26 a.m. UTC | #7
在 2023/10/23 14:40, Chandan Babu R 写道:
> 
> On Fri, Oct 20, 2023 at 08:40:09 AM -0700, Darrick J. Wong wrote:
>> On Fri, Oct 20, 2023 at 03:26:32PM +0530, Chandan Babu R wrote:
>>> On Thu, Sep 28, 2023 at 06:32:27 PM +0800, Shiyang Ruan wrote:
>>>> ====
>>>> Changes since v14:
>>>>   1. added/fixed code comments per Dan's comments
>>>> ====
>>>>
>>>> Now, if we suddenly remove a PMEM device(by calling unbind) which
>>>> contains FSDAX while programs are still accessing data in this device,
>>>> e.g.:
>>>> ```
>>>>   $FSSTRESS_PROG -d $SCRATCH_MNT -n 99999 -p 4 &
>>>>   # $FSX_PROG -N 1000000 -o 8192 -l 500000 $SCRATCH_MNT/t001 &
>>>>   echo "pfn1.1" > /sys/bus/nd/drivers/nd_pmem/unbind
>>>> ```
>>>> it could come into an unacceptable state:
>>>>    1. device has gone but mount point still exists, and umount will fail
>>>>         with "target is busy"
>>>>    2. programs will hang and cannot be killed
>>>>    3. may crash with NULL pointer dereference
>>>>
>>>> To fix this, we introduce a MF_MEM_PRE_REMOVE flag to let it know that we
>>>> are going to remove the whole device, and make sure all related processes
>>>> could be notified so that they could end up gracefully.
>>>>
>>>> This patch is inspired by Dan's "mm, dax, pmem: Introduce
>>>> dev_pagemap_failure()"[1].  With the help of dax_holder and
>>>> ->notify_failure() mechanism, the pmem driver is able to ask filesystem
>>>> on it to unmap all files in use, and notify processes who are using
>>>> those files.
>>>>
>>>> Call trace:
>>>> trigger unbind
>>>>   -> unbind_store()
>>>>    -> ... (skip)
>>>>     -> devres_release_all()
>>>>      -> kill_dax()
>>>>       -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
>>>>        -> xfs_dax_notify_failure()
>>>>        `-> freeze_super()             // freeze (kernel call)
>>>>        `-> do xfs rmap
>>>>        ` -> mf_dax_kill_procs()
>>>>        `  -> collect_procs_fsdax()    // all associated processes
>>>>        `  -> unmap_and_kill()
>>>>        ` -> invalidate_inode_pages2_range() // drop file's cache
>>>>        `-> thaw_super()               // thaw (both kernel & user call)
>>>>
>>>> Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
>>>> event.  Use the exclusive freeze/thaw[2] to lock the filesystem to prevent
>>>> new dax mapping from being created.  Do not shutdown filesystem directly
>>>> if configuration is not supported, or if failure range includes metadata
>>>> area.  Make sure all files and processes(not only the current progress)
>>>> are handled correctly.  Also drop the cache of associated files before
>>>> pmem is removed.
>>>>
>>>> [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
>>>> [2]: https://lore.kernel.org/linux-xfs/169116275623.3187159.16862410128731457358.stg-ugh@frogsfrogsfrogs/
>>>>
>>>> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
>>>> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
>>>> Acked-by: Dan Williams <dan.j.williams@intel.com>
>>>
>>> Hi Andrew,
>>>
>>> Shiyang had indicated that this patch has been added to
>>> akpm/mm-hotfixes-unstable branch. However, I don't see the patch listed in
>>> that branch.
>>>
>>> I am about to start collecting XFS patches for v6.7 cycle. Please let me know
>>> if you have any objections with me taking this patch via the XFS tree.
>>
>> V15 was dropped from his tree on 28 Sept., you might as well pull it
>> into your own tree for 6.7.  It's been testing fine on my trees for the
>> past 3 weeks.
>>
>> https://lore.kernel.org/mm-commits/20230928172815.EE6AFC433C8@smtp.kernel.org/
> 
> Shiyang, this patch does not apply cleanly on v6.6-rc7. Can you please rebase
> the patch on v6.6-rc7 and send it to the mailing list?

Sure.  I have rebased it and sent a v15.1.  Please check it:

https://lore.kernel.org/linux-xfs/20231023072046.1626474-1-ruansy.fnst@fujitsu.com/


--
Thanks,
Ruan.

>
Chandan Babu R Oct. 23, 2023, 12:21 p.m. UTC | #8
On Mon, Oct 23, 2023 at 03:26:52 PM +0800, Shiyang Ruan wrote:
> 在 2023/10/23 14:40, Chandan Babu R 写道:
>> On Fri, Oct 20, 2023 at 08:40:09 AM -0700, Darrick J. Wong wrote:
>>> On Fri, Oct 20, 2023 at 03:26:32PM +0530, Chandan Babu R wrote:
>>>> On Thu, Sep 28, 2023 at 06:32:27 PM +0800, Shiyang Ruan wrote:
>>>>> ====
>>>>> Changes since v14:
>>>>>   1. added/fixed code comments per Dan's comments
>>>>> ====
>>>>>
>>>>> Now, if we suddenly remove a PMEM device(by calling unbind) which
>>>>> contains FSDAX while programs are still accessing data in this device,
>>>>> e.g.:
>>>>> ```
>>>>>   $FSSTRESS_PROG -d $SCRATCH_MNT -n 99999 -p 4 &
>>>>>   # $FSX_PROG -N 1000000 -o 8192 -l 500000 $SCRATCH_MNT/t001 &
>>>>>   echo "pfn1.1" > /sys/bus/nd/drivers/nd_pmem/unbind
>>>>> ```
>>>>> it could come into an unacceptable state:
>>>>>    1. device has gone but mount point still exists, and umount will fail
>>>>>         with "target is busy"
>>>>>    2. programs will hang and cannot be killed
>>>>>    3. may crash with NULL pointer dereference
>>>>>
>>>>> To fix this, we introduce a MF_MEM_PRE_REMOVE flag to let it know that we
>>>>> are going to remove the whole device, and make sure all related processes
>>>>> could be notified so that they could end up gracefully.
>>>>>
>>>>> This patch is inspired by Dan's "mm, dax, pmem: Introduce
>>>>> dev_pagemap_failure()"[1].  With the help of dax_holder and
>>>>> ->notify_failure() mechanism, the pmem driver is able to ask filesystem
>>>>> on it to unmap all files in use, and notify processes who are using
>>>>> those files.
>>>>>
>>>>> Call trace:
>>>>> trigger unbind
>>>>>   -> unbind_store()
>>>>>    -> ... (skip)
>>>>>     -> devres_release_all()
>>>>>      -> kill_dax()
>>>>>       -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
>>>>>        -> xfs_dax_notify_failure()
>>>>>        `-> freeze_super()             // freeze (kernel call)
>>>>>        `-> do xfs rmap
>>>>>        ` -> mf_dax_kill_procs()
>>>>>        `  -> collect_procs_fsdax()    // all associated processes
>>>>>        `  -> unmap_and_kill()
>>>>>        ` -> invalidate_inode_pages2_range() // drop file's cache
>>>>>        `-> thaw_super()               // thaw (both kernel & user call)
>>>>>
>>>>> Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
>>>>> event.  Use the exclusive freeze/thaw[2] to lock the filesystem to prevent
>>>>> new dax mapping from being created.  Do not shutdown filesystem directly
>>>>> if configuration is not supported, or if failure range includes metadata
>>>>> area.  Make sure all files and processes(not only the current progress)
>>>>> are handled correctly.  Also drop the cache of associated files before
>>>>> pmem is removed.
>>>>>
>>>>> [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
>>>>> [2]: https://lore.kernel.org/linux-xfs/169116275623.3187159.16862410128731457358.stg-ugh@frogsfrogsfrogs/
>>>>>
>>>>> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
>>>>> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
>>>>> Acked-by: Dan Williams <dan.j.williams@intel.com>
>>>>
>>>> Hi Andrew,
>>>>
>>>> Shiyang had indicated that this patch has been added to
>>>> akpm/mm-hotfixes-unstable branch. However, I don't see the patch listed in
>>>> that branch.
>>>>
>>>> I am about to start collecting XFS patches for v6.7 cycle. Please let me know
>>>> if you have any objections with me taking this patch via the XFS tree.
>>>
>>> V15 was dropped from his tree on 28 Sept., you might as well pull it
>>> into your own tree for 6.7.  It's been testing fine on my trees for the
>>> past 3 weeks.
>>>
>>> https://lore.kernel.org/mm-commits/20230928172815.EE6AFC433C8@smtp.kernel.org/
>> Shiyang, this patch does not apply cleanly on v6.6-rc7. Can you
>> please rebase
>> the patch on v6.6-rc7 and send it to the mailing list?
>
> Sure.  I have rebased it and sent a v15.1.  Please check it:
>
> https://lore.kernel.org/linux-xfs/20231023072046.1626474-1-ruansy.fnst@fujitsu.com/

Thank you. I have applied the patch to my local Git tree.
diff mbox series

Patch

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 0da9232ea175..f4b635526345 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -326,7 +326,8 @@  void kill_dax(struct dax_device *dax_dev)
 		return;
 
 	if (dax_dev->holder_data != NULL)
-		dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
+		dax_holder_notify_failure(dax_dev, 0, U64_MAX,
+				MF_MEM_PRE_REMOVE);
 
 	clear_bit(DAXDEV_ALIVE, &dax_dev->flags);
 	synchronize_srcu(&dax_srcu);
diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
index 4a9bbd3fe120..30e9f4e09f76 100644
--- a/fs/xfs/xfs_notify_failure.c
+++ b/fs/xfs/xfs_notify_failure.c
@@ -22,6 +22,7 @@ 
 
 #include <linux/mm.h>
 #include <linux/dax.h>
+#include <linux/fs.h>
 
 struct xfs_failure_info {
 	xfs_agblock_t		startblock;
@@ -73,10 +74,16 @@  xfs_dax_failure_fn(
 	struct xfs_mount		*mp = cur->bc_mp;
 	struct xfs_inode		*ip;
 	struct xfs_failure_info		*notify = data;
+	struct address_space		*mapping;
+	pgoff_t				pgoff;
+	unsigned long			pgcnt;
 	int				error = 0;
 
 	if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
 	    (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
+		/* Continue the query because this isn't a failure. */
+		if (notify->mf_flags & MF_MEM_PRE_REMOVE)
+			return 0;
 		notify->want_shutdown = true;
 		return 0;
 	}
@@ -92,14 +99,60 @@  xfs_dax_failure_fn(
 		return 0;
 	}
 
-	error = mf_dax_kill_procs(VFS_I(ip)->i_mapping,
-				  xfs_failure_pgoff(mp, rec, notify),
-				  xfs_failure_pgcnt(mp, rec, notify),
-				  notify->mf_flags);
+	mapping = VFS_I(ip)->i_mapping;
+	pgoff = xfs_failure_pgoff(mp, rec, notify);
+	pgcnt = xfs_failure_pgcnt(mp, rec, notify);
+
+	/* Continue the rmap query if the inode isn't a dax file. */
+	if (dax_mapping(mapping))
+		error = mf_dax_kill_procs(mapping, pgoff, pgcnt,
+					  notify->mf_flags);
+
+	/* Invalidate the cache in dax pages. */
+	if (notify->mf_flags & MF_MEM_PRE_REMOVE)
+		invalidate_inode_pages2_range(mapping, pgoff,
+					      pgoff + pgcnt - 1);
+
 	xfs_irele(ip);
 	return error;
 }
 
+static int
+xfs_dax_notify_failure_freeze(
+	struct xfs_mount	*mp)
+{
+	struct super_block	*sb = mp->m_super;
+	int			error;
+
+	error = freeze_super(sb, FREEZE_HOLDER_KERNEL);
+	if (error)
+		xfs_emerg(mp, "already frozen by kernel, err=%d", error);
+
+	return error;
+}
+
+static void
+xfs_dax_notify_failure_thaw(
+	struct xfs_mount	*mp,
+	bool			kernel_frozen)
+{
+	struct super_block	*sb = mp->m_super;
+	int			error;
+
+	if (kernel_frozen) {
+		error = thaw_super(sb, FREEZE_HOLDER_KERNEL);
+		if (error)
+			xfs_emerg(mp, "still frozen after notify failure, err=%d",
+				error);
+	}
+
+	/*
+	 * Also thaw userspace call anyway because the device is about to be
+	 * removed immediately.
+	 */
+	thaw_super(sb, FREEZE_HOLDER_USERSPACE);
+}
+
 static int
 xfs_dax_notify_ddev_failure(
 	struct xfs_mount	*mp,
@@ -112,15 +165,29 @@  xfs_dax_notify_ddev_failure(
 	struct xfs_btree_cur	*cur = NULL;
 	struct xfs_buf		*agf_bp = NULL;
 	int			error = 0;
+	bool			kernel_frozen = false;
 	xfs_fsblock_t		fsbno = XFS_DADDR_TO_FSB(mp, daddr);
 	xfs_agnumber_t		agno = XFS_FSB_TO_AGNO(mp, fsbno);
 	xfs_fsblock_t		end_fsbno = XFS_DADDR_TO_FSB(mp,
 							     daddr + bblen - 1);
 	xfs_agnumber_t		end_agno = XFS_FSB_TO_AGNO(mp, end_fsbno);
 
+	if (mf_flags & MF_MEM_PRE_REMOVE) {
+		xfs_info(mp, "Device is about to be removed!");
+		/*
+		 * Freeze fs to prevent new mappings from being created.
+		 * - Keep going on if others already hold the kernel forzen.
+		 * - Keep going on if other errors too because this device is
+		 *   starting to fail.
+		 * - If kernel frozen state is hold successfully here, thaw it
+		 *   here as well at the end.
+		 */
+		kernel_frozen = xfs_dax_notify_failure_freeze(mp) == 0;
+	}
+
 	error = xfs_trans_alloc_empty(mp, &tp);
 	if (error)
-		return error;
+		goto out;
 
 	for (; agno <= end_agno; agno++) {
 		struct xfs_rmap_irec	ri_low = { };
@@ -165,11 +232,26 @@  xfs_dax_notify_ddev_failure(
 	}
 
 	xfs_trans_cancel(tp);
-	if (error || notify.want_shutdown) {
+
+	/*
+	 * Shutdown fs from a force umount in pre-remove case which won't fail,
+	 * so errors can be ignored.  Otherwise, shutdown the filesystem with
+	 * CORRUPT flag if error occured or notify.want_shutdown was set during
+	 * RMAP querying.
+	 */
+	if (mf_flags & MF_MEM_PRE_REMOVE)
+		xfs_force_shutdown(mp, SHUTDOWN_FORCE_UMOUNT);
+	else if (error || notify.want_shutdown) {
 		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
 		if (!error)
 			error = -EFSCORRUPTED;
 	}
+
+out:
+	/* Thaw the fs if it has been frozen before. */
+	if (mf_flags & MF_MEM_PRE_REMOVE)
+		xfs_dax_notify_failure_thaw(mp, kernel_frozen);
+
 	return error;
 }
 
@@ -197,6 +279,14 @@  xfs_dax_notify_failure(
 
 	if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev &&
 	    mp->m_logdev_targp != mp->m_ddev_targp) {
+		/*
+		 * In the pre-remove case the failure notification is attempting
+		 * to trigger a force unmount.  The expectation is that the
+		 * device is still present, but its removal is in progress and
+		 * can not be cancelled, proceed with accessing the log device.
+		 */
+		if (mf_flags & MF_MEM_PRE_REMOVE)
+			return 0;
 		xfs_err(mp, "ondisk log corrupt, shutting down fs!");
 		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
 		return -EFSCORRUPTED;
@@ -210,6 +300,12 @@  xfs_dax_notify_failure(
 	ddev_start = mp->m_ddev_targp->bt_dax_part_off;
 	ddev_end = ddev_start + bdev_nr_bytes(mp->m_ddev_targp->bt_bdev) - 1;
 
+	/* Notify failure on the whole device. */
+	if (offset == 0 && len == U64_MAX) {
+		offset = ddev_start;
+		len = bdev_nr_bytes(mp->m_ddev_targp->bt_bdev);
+	}
+
 	/* Ignore the range out of filesystem area */
 	if (offset + len - 1 < ddev_start)
 		return -ENXIO;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2dd73e4f3d8e..a10c75bebd6d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3665,6 +3665,7 @@  enum mf_flags {
 	MF_UNPOISON = 1 << 4,
 	MF_SW_SIMULATED = 1 << 5,
 	MF_NO_RETRY = 1 << 6,
+	MF_MEM_PRE_REMOVE = 1 << 7,
 };
 int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
 		      unsigned long count, int mf_flags);
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index e245191e6b04..955edea9837f 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -683,7 +683,7 @@  static void add_to_kill_fsdax(struct task_struct *tsk, struct page *p,
  */
 static void collect_procs_fsdax(struct page *page,
 		struct address_space *mapping, pgoff_t pgoff,
-		struct list_head *to_kill)
+		struct list_head *to_kill, bool pre_remove)
 {
 	struct vm_area_struct *vma;
 	struct task_struct *tsk;
@@ -691,8 +691,15 @@  static void collect_procs_fsdax(struct page *page,
 	i_mmap_lock_read(mapping);
 	read_lock(&tasklist_lock);
 	for_each_process(tsk) {
-		struct task_struct *t = task_early_kill(tsk, true);
+		struct task_struct *t = tsk;
 
+		/*
+		 * Search for all tasks while MF_MEM_PRE_REMOVE is set, because
+		 * the current may not be the one accessing the fsdax page.
+		 * Otherwise, search for the current task.
+		 */
+		if (!pre_remove)
+			t = task_early_kill(tsk, true);
 		if (!t)
 			continue;
 		vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
@@ -1788,6 +1795,7 @@  int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
 	dax_entry_t cookie;
 	struct page *page;
 	size_t end = index + count;
+	bool pre_remove = mf_flags & MF_MEM_PRE_REMOVE;
 
 	mf_flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
 
@@ -1799,9 +1807,14 @@  int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
 		if (!page)
 			goto unlock;
 
-		SetPageHWPoison(page);
+		if (!pre_remove)
+			SetPageHWPoison(page);
 
-		collect_procs_fsdax(page, mapping, index, &to_kill);
+		/*
+		 * The pre_remove case is revoking access, the memory is still
+		 * good and could theoretically be put back into service.
+		 */
+		collect_procs_fsdax(page, mapping, index, &to_kill, pre_remove);
 		unmap_and_kill(&to_kill, page_to_pfn(page), mapping,
 				index, mf_flags);
 unlock: