diff mbox series

[v7,2/2] xfs: avoid transaction reservation recursion

Message ID 20200827013444.24270-3-laoar.shao@gmail.com
State Superseded
Headers show
Series avoid xfs transaction reservation recursion | expand

Commit Message

Yafang Shao Aug. 27, 2020, 1:34 a.m. UTC
PF_FSTRANS which is used to avoid transaction reservation recursion, is
dropped since commit 9070733b4efa ("xfs: abstract PF_FSTRANS to
PF_MEMALLOC_NOFS") and commit 7dea19f9ee63 ("mm: introduce
memalloc_nofs_{save,restore} API") and replaced by PF_MEMALLOC_NOFS which
means to avoid filesystem reclaim recursion. That change is subtle.
Let's take the exmple of the check of WARN_ON_ONCE(current->flags &
PF_MEMALLOC_NOFS)) to explain why this abstraction from PF_FSTRANS to
PF_MEMALLOC_NOFS is not proper.
Below comment is quoted from Dave,
> It wasn't for memory allocation recursion protection in XFS - it was for
> transaction reservation recursion protection by something trying to flush
> data pages while holding a transaction reservation. Doing
> this could deadlock the journal because the existing reservation
> could prevent the nested reservation for being able to reserve space
> in the journal and that is a self-deadlock vector.
> IOWs, this check is not protecting against memory reclaim recursion
> bugs at all (that's the previous check [1]). This check is
> protecting against the filesystem calling writepages directly from a
> context where it can self-deadlock.
> So what we are seeing here is that the PF_FSTRANS ->
> PF_MEMALLOC_NOFS abstraction lost all the actual useful information
> about what type of error this check was protecting against.

As a result, we should reintroduce PF_FSTRANS. As current->journal_info
isn't used in XFS, we can reuse it to indicate whehter the task is in
fstrans or not, Per Willy. To achieve that, four new helpers are introduce
in this patch, per Dave:
- xfs_trans_context_set()
  Used in xfs_trans_alloc()
- xfs_trans_context_clear()
  Used in xfs_trans_commit() and xfs_trans_cancel()
- xfs_trans_context_update()
  Used in xfs_trans_roll()
- xfs_trans_context_active()
  To check whehter current is in fs transcation or not

This recursion check works for XFS only currently, so a new member is
introduced in struct writeback_control.

[1]. Below check is to avoid memory reclaim recursion.
if (WARN_ON_ONCE((current->flags & (PF_MEMALLOC|PF_KSWAPD)) ==
        PF_MEMALLOC))
        goto redirty;

Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 fs/iomap/buffered-io.c    |  4 ++--
 fs/xfs/xfs_aops.c         | 11 +++++++++--
 fs/xfs/xfs_linux.h        |  4 ----
 fs/xfs/xfs_trans.c        | 19 +++++++++----------
 fs/xfs/xfs_trans.h        | 30 ++++++++++++++++++++++++++++++
 include/linux/writeback.h |  3 +++
 6 files changed, 53 insertions(+), 18 deletions(-)

Comments

Matthew Wilcox Aug. 27, 2020, 1:58 a.m. UTC | #1
On Thu, Aug 27, 2020 at 09:34:44AM +0800, Yafang Shao wrote:
> @@ -1500,9 +1500,9 @@ iomap_do_writepage(struct page *page, struct writeback_control *wbc, void *data)
>  
>  	/*
>  	 * Given that we do not allow direct reclaim to call us, we should
> -	 * never be called in a recursive filesystem reclaim context.
> +	 * never be called while in a filesystem transaction.
>  	 */
> -	if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
> +	if (WARN_ON_ONCE(wbc->fstrans_recursion))
>  		goto redirty;

Erm, Dave said:

> I think we should just remove
> the check completely from iomap_writepage() and move it up into
> xfs_vm_writepage() and xfs_vm_writepages().

ie everywhere you set this new bit, just check current->journal_info.
Yafang Shao Aug. 27, 2020, 2:13 a.m. UTC | #2
On Thu, Aug 27, 2020 at 9:58 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Thu, Aug 27, 2020 at 09:34:44AM +0800, Yafang Shao wrote:
> > @@ -1500,9 +1500,9 @@ iomap_do_writepage(struct page *page, struct writeback_control *wbc, void *data)
> >
> >       /*
> >        * Given that we do not allow direct reclaim to call us, we should
> > -      * never be called in a recursive filesystem reclaim context.
> > +      * never be called while in a filesystem transaction.
> >        */
> > -     if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
> > +     if (WARN_ON_ONCE(wbc->fstrans_recursion))
> >               goto redirty;
>
> Erm, Dave said:
>
> > I think we should just remove
> > the check completely from iomap_writepage() and move it up into
> > xfs_vm_writepage() and xfs_vm_writepages().
>
> ie everywhere you set this new bit, just check current->journal_info.


I can't get you. Would you pls. be more specific ?

I move the check of current->journal into xfs_vm_writepage() and
xfs_vm_writepages(), and I think that is the easiest way to implement
it.

       /* we abort the update if there was an IO error */
@@ -564,6 +565,9 @@ xfs_vm_writepage(
 {
        struct xfs_writepage_ctx wpc = { };

+       if (xfs_trans_context_active())
+               wbc->fstrans_recursion = 1;    <<< set for XFS only.
+
        return iomap_writepage(page, wbc, &wpc.ctx, &xfs_writeback_ops);
 }


--
Thanks
Yafang
Dave Chinner Aug. 27, 2020, 2:42 a.m. UTC | #3
On Thu, Aug 27, 2020 at 10:13:15AM +0800, Yafang Shao wrote:
> On Thu, Aug 27, 2020 at 9:58 AM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Thu, Aug 27, 2020 at 09:34:44AM +0800, Yafang Shao wrote:
> > > @@ -1500,9 +1500,9 @@ iomap_do_writepage(struct page *page, struct writeback_control *wbc, void *data)
> > >
> > >       /*
> > >        * Given that we do not allow direct reclaim to call us, we should
> > > -      * never be called in a recursive filesystem reclaim context.
> > > +      * never be called while in a filesystem transaction.
> > >        */
> > > -     if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
> > > +     if (WARN_ON_ONCE(wbc->fstrans_recursion))
> > >               goto redirty;
> >
> > Erm, Dave said:
> >
> > > I think we should just remove
> > > the check completely from iomap_writepage() and move it up into
> > > xfs_vm_writepage() and xfs_vm_writepages().
> >
> > ie everywhere you set this new bit, just check current->journal_info.
> 
> 
> I can't get you. Would you pls. be more specific ?
> 
> I move the check of current->journal into xfs_vm_writepage() and
> xfs_vm_writepages(), and I think that is the easiest way to implement
> it.
> 
>        /* we abort the update if there was an IO error */
> @@ -564,6 +565,9 @@ xfs_vm_writepage(
>  {
>         struct xfs_writepage_ctx wpc = { };
> 
> +       if (xfs_trans_context_active())
> +               wbc->fstrans_recursion = 1;    <<< set for XFS only.
> +
>         return iomap_writepage(page, wbc, &wpc.ctx, &xfs_writeback_ops);

Get rid of wbc->fstrans_recursion. Just do

	if (WARN_ON_ONCE(current->journal_info))
		.....

right here in the XFS code.

Cheers,

Dave.
Yafang Shao Aug. 27, 2020, 2:59 a.m. UTC | #4
On Thu, Aug 27, 2020 at 10:42 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Thu, Aug 27, 2020 at 10:13:15AM +0800, Yafang Shao wrote:
> > On Thu, Aug 27, 2020 at 9:58 AM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Thu, Aug 27, 2020 at 09:34:44AM +0800, Yafang Shao wrote:
> > > > @@ -1500,9 +1500,9 @@ iomap_do_writepage(struct page *page, struct writeback_control *wbc, void *data)
> > > >
> > > >       /*
> > > >        * Given that we do not allow direct reclaim to call us, we should
> > > > -      * never be called in a recursive filesystem reclaim context.
> > > > +      * never be called while in a filesystem transaction.
> > > >        */
> > > > -     if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
> > > > +     if (WARN_ON_ONCE(wbc->fstrans_recursion))
> > > >               goto redirty;
> > >
> > > Erm, Dave said:
> > >
> > > > I think we should just remove
> > > > the check completely from iomap_writepage() and move it up into
> > > > xfs_vm_writepage() and xfs_vm_writepages().
> > >
> > > ie everywhere you set this new bit, just check current->journal_info.
> >
> >
> > I can't get you. Would you pls. be more specific ?
> >
> > I move the check of current->journal into xfs_vm_writepage() and
> > xfs_vm_writepages(), and I think that is the easiest way to implement
> > it.
> >
> >        /* we abort the update if there was an IO error */
> > @@ -564,6 +565,9 @@ xfs_vm_writepage(
> >  {
> >         struct xfs_writepage_ctx wpc = { };
> >
> > +       if (xfs_trans_context_active())
> > +               wbc->fstrans_recursion = 1;    <<< set for XFS only.
> > +
> >         return iomap_writepage(page, wbc, &wpc.ctx, &xfs_writeback_ops);
>
> Get rid of wbc->fstrans_recursion. Just do
>
>         if (WARN_ON_ONCE(current->journal_info))
>                 .....
>
> right here in the XFS code.
>

Understood.
But we have to implement the 'redirty' as well in the XFS code, that
may make the implementation more complicated.
diff mbox series

Patch

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index bcfc288dba3f..12e0fc4cb825 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1500,9 +1500,9 @@  iomap_do_writepage(struct page *page, struct writeback_control *wbc, void *data)
 
 	/*
 	 * Given that we do not allow direct reclaim to call us, we should
-	 * never be called in a recursive filesystem reclaim context.
+	 * never be called while in a filesystem transaction.
 	 */
-	if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
+	if (WARN_ON_ONCE(wbc->fstrans_recursion))
 		goto redirty;
 
 	/*
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index b35611882ff9..9e2eedc9d208 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -62,7 +62,8 @@  xfs_setfilesize_trans_alloc(
 	 * We hand off the transaction to the completion thread now, so
 	 * clear the flag here.
 	 */
-	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
+	xfs_trans_context_clear(tp);
+
 	return 0;
 }
 
@@ -125,7 +126,7 @@  xfs_setfilesize_ioend(
 	 * thus we need to mark ourselves as being in a transaction manually.
 	 * Similarly for freeze protection.
 	 */
-	current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
+	xfs_trans_context_set(tp);
 	__sb_writers_acquired(VFS_I(ip)->i_sb, SB_FREEZE_FS);
 
 	/* we abort the update if there was an IO error */
@@ -564,6 +565,9 @@  xfs_vm_writepage(
 {
 	struct xfs_writepage_ctx wpc = { };
 
+	if (xfs_trans_context_active())
+		wbc->fstrans_recursion = 1;
+
 	return iomap_writepage(page, wbc, &wpc.ctx, &xfs_writeback_ops);
 }
 
@@ -574,6 +578,9 @@  xfs_vm_writepages(
 {
 	struct xfs_writepage_ctx wpc = { };
 
+	if (xfs_trans_context_active())
+		wbc->fstrans_recursion = 1;
+
 	xfs_iflags_clear(XFS_I(mapping->host), XFS_ITRUNCATED);
 	return iomap_writepages(mapping, wbc, &wpc.ctx, &xfs_writeback_ops);
 }
diff --git a/fs/xfs/xfs_linux.h b/fs/xfs/xfs_linux.h
index ab737fed7b12..8a4f6db77e33 100644
--- a/fs/xfs/xfs_linux.h
+++ b/fs/xfs/xfs_linux.h
@@ -102,10 +102,6 @@  typedef __u32			xfs_nlink_t;
 #define xfs_cowb_secs		xfs_params.cowb_timer.val
 
 #define current_cpu()		(raw_smp_processor_id())
-#define current_set_flags_nested(sp, f)		\
-		(*(sp) = current->flags, current->flags |= (f))
-#define current_restore_flags_nested(sp, f)	\
-		(current->flags = ((current->flags & ~(f)) | (*(sp) & (f))))
 
 #define NBBY		8		/* number of bits per byte */
 
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index ed72867b1a19..5f3a4ff51b3c 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -153,8 +153,6 @@  xfs_trans_reserve(
 	int			error = 0;
 	bool			rsvd = (tp->t_flags & XFS_TRANS_RESERVE) != 0;
 
-	/* Mark this thread as being in a transaction */
-	current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 
 	/*
 	 * Attempt to reserve the needed disk blocks by decrementing
@@ -163,10 +161,8 @@  xfs_trans_reserve(
 	 */
 	if (blocks > 0) {
 		error = xfs_mod_fdblocks(mp, -((int64_t)blocks), rsvd);
-		if (error != 0) {
-			current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
+		if (error != 0)
 			return -ENOSPC;
-		}
 		tp->t_blk_res += blocks;
 	}
 
@@ -241,8 +237,6 @@  xfs_trans_reserve(
 		tp->t_blk_res = 0;
 	}
 
-	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
-
 	return error;
 }
 
@@ -284,6 +278,8 @@  xfs_trans_alloc(
 	INIT_LIST_HEAD(&tp->t_dfops);
 	tp->t_firstblock = NULLFSBLOCK;
 
+	/* Mark this thread as being in a transaction */
+	xfs_trans_context_set(tp);
 	error = xfs_trans_reserve(tp, resp, blocks, rtextents);
 	if (error) {
 		xfs_trans_cancel(tp);
@@ -878,7 +874,8 @@  __xfs_trans_commit(
 
 	xfs_log_commit_cil(mp, tp, &commit_lsn, regrant);
 
-	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
+	if (!regrant)
+		xfs_trans_context_clear(tp);
 	xfs_trans_free(tp);
 
 	/*
@@ -910,7 +907,8 @@  __xfs_trans_commit(
 			xfs_log_ticket_ungrant(mp->m_log, tp->t_ticket);
 		tp->t_ticket = NULL;
 	}
-	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
+
+	xfs_trans_context_clear(tp);
 	xfs_trans_free_items(tp, !!error);
 	xfs_trans_free(tp);
 
@@ -971,7 +969,7 @@  xfs_trans_cancel(
 	}
 
 	/* mark this thread as no longer being in a transaction */
-	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
+	xfs_trans_context_clear(tp);
 
 	xfs_trans_free_items(tp, dirty);
 	xfs_trans_free(tp);
@@ -1013,6 +1011,7 @@  xfs_trans_roll(
 	if (error)
 		return error;
 
+	xfs_trans_context_update(trans, *tpp);
 	/*
 	 * Reserve space in the log for the next transaction.
 	 * This also pushes items in the "AIL", the list of logged items,
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index b752501818d2..f84b563438f6 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -243,4 +243,34 @@  void		xfs_trans_buf_copy_type(struct xfs_buf *dst_bp,
 
 extern kmem_zone_t	*xfs_trans_zone;
 
+static inline void
+xfs_trans_context_set(struct xfs_trans *tp)
+{
+	ASSERT(!current->journal_info);
+	current->journal_info = tp;
+	tp->t_pflags = memalloc_nofs_save();
+}
+
+static inline void
+xfs_trans_context_update(struct xfs_trans *old, struct xfs_trans *new)
+{
+	ASSERT(current->journal_info == old);
+	current->journal_info = new;
+}
+
+static inline void
+xfs_trans_context_clear(struct xfs_trans *tp)
+{
+	ASSERT(current->journal_info == tp);
+	current->journal_info = NULL;
+	memalloc_nofs_restore(tp->t_pflags);
+}
+
+static inline bool
+xfs_trans_context_active(void)
+{
+	/* Use journal_info to indicate current is in a transaction */
+	return current->journal_info != NULL;
+}
+
 #endif	/* __XFS_TRANS_H__ */
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 8e5c5bb16e2d..33f6fb901a78 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -80,6 +80,9 @@  struct writeback_control {
 
 	unsigned punt_to_cgroup:1;	/* cgrp punting, see __REQ_CGROUP_PUNT */
 
+	/* To avoid filesystem transaction reservation recursion */
+	unsigned fstrans_recursion:1;
+
 #ifdef CONFIG_CGROUP_WRITEBACK
 	struct bdi_writeback *wb;	/* wb this writeback is issued under */
 	struct inode *inode;		/* inode being written out */