[v4,2/2] xfs: avoid transaction reservation recursion
diff mbox series

Message ID 20200801154632.866356-3-laoar.shao@gmail.com
State New
Headers show
Series
  • void xfs transaction reservation recursion
Related show

Commit Message

Yafang Shao Aug. 1, 2020, 3:46 p.m. UTC
From: Yafang Shao <shaoyafang@didiglobal.com>

PF_FSTRANS which is used to avoid transaction reservation recursion, is
dropped since commit 9070733b4efa ("xfs: abstract PF_FSTRANS to
PF_MEMALLOC_NOFS") and commit 7dea19f9ee63 ("mm: introduce
memalloc_nofs_{save,restore} API") and replaced by PF_MEMALLOC_NOFS which
means to avoid filesystem reclaim recursion. That change is subtle.
Let's take the exmple of the check of WARN_ON_ONCE(current->flags &
PF_MEMALLOC_NOFS)) to explain why this abstraction from PF_FSTRANS to
PF_MEMALLOC_NOFS is not proper.
Below comment is quoted from Dave,
> It wasn't for memory allocation recursion protection in XFS - it was for
> transaction reservation recursion protection by something trying to flush
> data pages while holding a transaction reservation. Doing
> this could deadlock the journal because the existing reservation
> could prevent the nested reservation for being able to reserve space
> in the journal and that is a self-deadlock vector.
> IOWs, this check is not protecting against memory reclaim recursion
> bugs at all (that's the previous check [1]). This check is
> protecting against the filesystem calling writepages directly from a
> context where it can self-deadlock.
> So what we are seeing here is that the PF_FSTRANS ->
> PF_MEMALLOC_NOFS abstraction lost all the actual useful information
> about what type of error this check was protecting against.

As a result, we should reintroduce PF_FSTRANS. As current->journal_info
isn't used in XFS, we can reuse it to indicate whehter the task is in
fstrans or not.

[1]. Below check is to avoid memory reclaim recursion.
if (WARN_ON_ONCE((current->flags & (PF_MEMALLOC|PF_KSWAPD)) ==
        PF_MEMALLOC))
        goto redirty;

Signed-off-by: Yafang Shao <shaoyafang@didiglobal.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
---
 fs/iomap/buffered-io.c    |  4 ++--
 fs/xfs/libxfs/xfs_btree.c |  2 ++
 fs/xfs/xfs_aops.c         |  3 +++
 fs/xfs/xfs_linux.h        | 19 +++++++++++++++++++
 fs/xfs/xfs_trans.c        |  8 +++++++-
 5 files changed, 33 insertions(+), 3 deletions(-)

Comments

Dave Chinner Aug. 4, 2020, 11:35 p.m. UTC | #1
On Sat, Aug 01, 2020 at 11:46:32AM -0400, Yafang Shao wrote:
> From: Yafang Shao <shaoyafang@didiglobal.com>
> 
> PF_FSTRANS which is used to avoid transaction reservation recursion, is
> dropped since commit 9070733b4efa ("xfs: abstract PF_FSTRANS to
> PF_MEMALLOC_NOFS") and commit 7dea19f9ee63 ("mm: introduce
> memalloc_nofs_{save,restore} API") and replaced by PF_MEMALLOC_NOFS which
> means to avoid filesystem reclaim recursion. That change is subtle.
> Let's take the exmple of the check of WARN_ON_ONCE(current->flags &
> PF_MEMALLOC_NOFS)) to explain why this abstraction from PF_FSTRANS to
> PF_MEMALLOC_NOFS is not proper.
> Below comment is quoted from Dave,
> > It wasn't for memory allocation recursion protection in XFS - it was for
> > transaction reservation recursion protection by something trying to flush
> > data pages while holding a transaction reservation. Doing
> > this could deadlock the journal because the existing reservation
> > could prevent the nested reservation for being able to reserve space
> > in the journal and that is a self-deadlock vector.
> > IOWs, this check is not protecting against memory reclaim recursion
> > bugs at all (that's the previous check [1]). This check is
> > protecting against the filesystem calling writepages directly from a
> > context where it can self-deadlock.
> > So what we are seeing here is that the PF_FSTRANS ->
> > PF_MEMALLOC_NOFS abstraction lost all the actual useful information
> > about what type of error this check was protecting against.
> 
> As a result, we should reintroduce PF_FSTRANS. As current->journal_info
> isn't used in XFS, we can reuse it to indicate whehter the task is in
> fstrans or not.

IF we are just going to use ->journal_info, do it the simple way.

> index 2d25bab68764..0795511f9e6a 100644
> --- a/fs/xfs/libxfs/xfs_btree.c
> +++ b/fs/xfs/libxfs/xfs_btree.c
> @@ -2825,6 +2825,7 @@ xfs_btree_split_worker(
>  	if (args->kswapd)
>  		new_pflags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
>  
> +	xfs_trans_context_start();

Not really. We are transferring a transaction context here, not
starting one. Hence this reads somewhat strangely.

This implies that the helper function should be something like:

	xfs_trans_context_set(tp);


>  	current_set_flags_nested(&pflags, new_pflags);
>  
>  	args->result = __xfs_btree_split(args->cur, args->level, args->ptrp,
> @@ -2832,6 +2833,7 @@ xfs_btree_split_worker(
>  	complete(args->done);
>  
>  	current_restore_flags_nested(&pflags, new_pflags);
> +	xfs_trans_context_end();

And this is more likely xfs_trans_context_clear(tp)

Reasons for this will become clear soon...

>  }
>  
>  /*
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index b35611882ff9..39ef95acdd8e 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -63,6 +63,8 @@ xfs_setfilesize_trans_alloc(
>  	 * clear the flag here.
>  	 */
>  	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
> +	xfs_trans_context_end();

Note the repeated pairing of functions in this patch?

> +
>  	return 0;
>  }
>  
> @@ -125,6 +127,7 @@ xfs_setfilesize_ioend(
>  	 * thus we need to mark ourselves as being in a transaction manually.
>  	 * Similarly for freeze protection.
>  	 */
> +	xfs_trans_context_start();
>  	current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
>  	__sb_writers_acquired(VFS_I(ip)->i_sb, SB_FREEZE_FS);
>  
> diff --git a/fs/xfs/xfs_linux.h b/fs/xfs/xfs_linux.h
> index 9f70d2f68e05..1192b660a968 100644
> --- a/fs/xfs/xfs_linux.h
> +++ b/fs/xfs/xfs_linux.h
> @@ -111,6 +111,25 @@ typedef __u32			xfs_nlink_t;
>  #define current_restore_flags_nested(sp, f)	\
>  		(current->flags = ((current->flags & ~(f)) | (*(sp) & (f))))
>  
> +static inline void xfs_trans_context_start(void)
> +{
> +	long flags = (long)current->journal_info;
> +
> +	/*
> +	 * Reuse journal_info to indicate whehter the current is in fstrans
> +	 * or not.
> +	 */
> +	current->journal_info = (void *)(flags + 1);
> +}
> +
> +static inline void xfs_trans_context_end(void)
> +{
> +	long flags = (long)current->journal_info;
> +
> +	WARN_ON_ONCE(flags <= 0);
> +	current->journal_info = ((void *)(flags - 1));
> +}

This is overly complex, and cannot be used for validation that we
are clearing the transaction context we expect to be clearing. These
are really "set" and "clear" operations, and for rolling
transactions we are going to need an "update" operation, too.

As per my comments about the previous patch, _set() would be done
in xfs_trans_alloc(), _clear() would be done on the final
xfs_trans_commit() or _cancel() and _update() would be done when the
transaction rolls.

Then we can roll in the NOFS updates, and we get these three helper
functions that keep all the per-transaction thread state coherent:

static inline void
xfs_trans_context_set(struct xfs_trans *tp)
{
	ASSERT(!current->journal_info);
	current->journal_info = tp;
	tp->t_flags = memalloc_nofs_save();
}

static inline void
xfs_trans_context_update(struct xfs_trans *old, struct xfs_trans *new)
{
	ASSERT(current->journal_info == old);
	current->journal_info = new;
	new->t_flags = old->t_flags;
}

static inline void
xfs_trans_context_clear(struct xfs_trans *tp)
{
	ASSERT(current->journal_info == tp);
	current->journal_info = NULL;
	memalloc_nofs_restore(tp->t_flags);
}

static bool
xfs_trans_context_active(void)
{
	return current->journal_info != NULL;
}


Cheers,

Dave.
Yafang Shao Aug. 7, 2020, 4:11 a.m. UTC | #2
On Wed, Aug 5, 2020 at 7:35 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Sat, Aug 01, 2020 at 11:46:32AM -0400, Yafang Shao wrote:
> > From: Yafang Shao <shaoyafang@didiglobal.com>
> >
> > PF_FSTRANS which is used to avoid transaction reservation recursion, is
> > dropped since commit 9070733b4efa ("xfs: abstract PF_FSTRANS to
> > PF_MEMALLOC_NOFS") and commit 7dea19f9ee63 ("mm: introduce
> > memalloc_nofs_{save,restore} API") and replaced by PF_MEMALLOC_NOFS which
> > means to avoid filesystem reclaim recursion. That change is subtle.
> > Let's take the exmple of the check of WARN_ON_ONCE(current->flags &
> > PF_MEMALLOC_NOFS)) to explain why this abstraction from PF_FSTRANS to
> > PF_MEMALLOC_NOFS is not proper.
> > Below comment is quoted from Dave,
> > > It wasn't for memory allocation recursion protection in XFS - it was for
> > > transaction reservation recursion protection by something trying to flush
> > > data pages while holding a transaction reservation. Doing
> > > this could deadlock the journal because the existing reservation
> > > could prevent the nested reservation for being able to reserve space
> > > in the journal and that is a self-deadlock vector.
> > > IOWs, this check is not protecting against memory reclaim recursion
> > > bugs at all (that's the previous check [1]). This check is
> > > protecting against the filesystem calling writepages directly from a
> > > context where it can self-deadlock.
> > > So what we are seeing here is that the PF_FSTRANS ->
> > > PF_MEMALLOC_NOFS abstraction lost all the actual useful information
> > > about what type of error this check was protecting against.
> >
> > As a result, we should reintroduce PF_FSTRANS. As current->journal_info
> > isn't used in XFS, we can reuse it to indicate whehter the task is in
> > fstrans or not.
>
> IF we are just going to use ->journal_info, do it the simple way.
>
> > index 2d25bab68764..0795511f9e6a 100644
> > --- a/fs/xfs/libxfs/xfs_btree.c
> > +++ b/fs/xfs/libxfs/xfs_btree.c
> > @@ -2825,6 +2825,7 @@ xfs_btree_split_worker(
> >       if (args->kswapd)
> >               new_pflags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
> >
> > +     xfs_trans_context_start();
>
> Not really. We are transferring a transaction context here, not
> starting one. Hence this reads somewhat strangely.
>
> This implies that the helper function should be something like:
>
>         xfs_trans_context_set(tp);
>
>
> >       current_set_flags_nested(&pflags, new_pflags);
> >
> >       args->result = __xfs_btree_split(args->cur, args->level, args->ptrp,
> > @@ -2832,6 +2833,7 @@ xfs_btree_split_worker(
> >       complete(args->done);
> >
> >       current_restore_flags_nested(&pflags, new_pflags);
> > +     xfs_trans_context_end();
>
> And this is more likely xfs_trans_context_clear(tp)
>
> Reasons for this will become clear soon...
>
> >  }
> >
> >  /*
> > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> > index b35611882ff9..39ef95acdd8e 100644
> > --- a/fs/xfs/xfs_aops.c
> > +++ b/fs/xfs/xfs_aops.c
> > @@ -63,6 +63,8 @@ xfs_setfilesize_trans_alloc(
> >        * clear the flag here.
> >        */
> >       current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
> > +     xfs_trans_context_end();
>
> Note the repeated pairing of functions in this patch?
>
> > +
> >       return 0;
> >  }
> >
> > @@ -125,6 +127,7 @@ xfs_setfilesize_ioend(
> >        * thus we need to mark ourselves as being in a transaction manually.
> >        * Similarly for freeze protection.
> >        */
> > +     xfs_trans_context_start();
> >       current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
> >       __sb_writers_acquired(VFS_I(ip)->i_sb, SB_FREEZE_FS);
> >
> > diff --git a/fs/xfs/xfs_linux.h b/fs/xfs/xfs_linux.h
> > index 9f70d2f68e05..1192b660a968 100644
> > --- a/fs/xfs/xfs_linux.h
> > +++ b/fs/xfs/xfs_linux.h
> > @@ -111,6 +111,25 @@ typedef __u32                    xfs_nlink_t;
> >  #define current_restore_flags_nested(sp, f)  \
> >               (current->flags = ((current->flags & ~(f)) | (*(sp) & (f))))
> >
> > +static inline void xfs_trans_context_start(void)
> > +{
> > +     long flags = (long)current->journal_info;
> > +
> > +     /*
> > +      * Reuse journal_info to indicate whehter the current is in fstrans
> > +      * or not.
> > +      */
> > +     current->journal_info = (void *)(flags + 1);
> > +}
> > +
> > +static inline void xfs_trans_context_end(void)
> > +{
> > +     long flags = (long)current->journal_info;
> > +
> > +     WARN_ON_ONCE(flags <= 0);
> > +     current->journal_info = ((void *)(flags - 1));
> > +}
>
> This is overly complex, and cannot be used for validation that we
> are clearing the transaction context we expect to be clearing. These
> are really "set" and "clear" operations, and for rolling
> transactions we are going to need an "update" operation, too.
>
> As per my comments about the previous patch, _set() would be done
> in xfs_trans_alloc(), _clear() would be done on the final
> xfs_trans_commit() or _cancel() and _update() would be done when the
> transaction rolls.
>
> Then we can roll in the NOFS updates, and we get these three helper
> functions that keep all the per-transaction thread state coherent:
>
> static inline void
> xfs_trans_context_set(struct xfs_trans *tp)
> {
>         ASSERT(!current->journal_info);
>         current->journal_info = tp;
>         tp->t_flags = memalloc_nofs_save();
> }
>
> static inline void
> xfs_trans_context_update(struct xfs_trans *old, struct xfs_trans *new)
> {
>         ASSERT(current->journal_info == old);
>         current->journal_info = new;
>         new->t_flags = old->t_flags;
> }
>
> static inline void
> xfs_trans_context_clear(struct xfs_trans *tp)
> {
>         ASSERT(current->journal_info == tp);
>         current->journal_info = NULL;
>         memalloc_nofs_restore(tp->t_flags);
> }
>

Below helper will be used in fs/iomap/buffered-io.c, so I think we'd
better name it with fstrans_context_active() and put it in
include/linux/iomap.h, while regarding the other three helpers I think
we'd better put them in fs/xfs/xfs_trans.h.
> static bool
> xfs_trans_context_active(void)
> {
>         return current->journal_info != NULL;
> }
>

Many thanks for the detailed explanation, I will update with your
suggestion in the next version.

Patch
diff mbox series

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index bcfc288dba3f..b3f66b6b5116 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1500,9 +1500,9 @@  iomap_do_writepage(struct page *page, struct writeback_control *wbc, void *data)
 
 	/*
 	 * Given that we do not allow direct reclaim to call us, we should
-	 * never be called in a recursive filesystem reclaim context.
+	 * never be called while in a filesystem transaction.
 	 */
-	if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
+	if (WARN_ON_ONCE(current->journal_info))
 		goto redirty;
 
 	/*
diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index 2d25bab68764..0795511f9e6a 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -2825,6 +2825,7 @@  xfs_btree_split_worker(
 	if (args->kswapd)
 		new_pflags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
 
+	xfs_trans_context_start();
 	current_set_flags_nested(&pflags, new_pflags);
 
 	args->result = __xfs_btree_split(args->cur, args->level, args->ptrp,
@@ -2832,6 +2833,7 @@  xfs_btree_split_worker(
 	complete(args->done);
 
 	current_restore_flags_nested(&pflags, new_pflags);
+	xfs_trans_context_end();
 }
 
 /*
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index b35611882ff9..39ef95acdd8e 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -63,6 +63,8 @@  xfs_setfilesize_trans_alloc(
 	 * clear the flag here.
 	 */
 	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
+	xfs_trans_context_end();
+
 	return 0;
 }
 
@@ -125,6 +127,7 @@  xfs_setfilesize_ioend(
 	 * thus we need to mark ourselves as being in a transaction manually.
 	 * Similarly for freeze protection.
 	 */
+	xfs_trans_context_start();
 	current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 	__sb_writers_acquired(VFS_I(ip)->i_sb, SB_FREEZE_FS);
 
diff --git a/fs/xfs/xfs_linux.h b/fs/xfs/xfs_linux.h
index 9f70d2f68e05..1192b660a968 100644
--- a/fs/xfs/xfs_linux.h
+++ b/fs/xfs/xfs_linux.h
@@ -111,6 +111,25 @@  typedef __u32			xfs_nlink_t;
 #define current_restore_flags_nested(sp, f)	\
 		(current->flags = ((current->flags & ~(f)) | (*(sp) & (f))))
 
+static inline void xfs_trans_context_start(void)
+{
+	long flags = (long)current->journal_info;
+
+	/*
+	 * Reuse journal_info to indicate whehter the current is in fstrans
+	 * or not.
+	 */
+	current->journal_info = (void *)(flags + 1);
+}
+
+static inline void xfs_trans_context_end(void)
+{
+	long flags = (long)current->journal_info;
+
+	WARN_ON_ONCE(flags <= 0);
+	current->journal_info = ((void *)(flags - 1));
+}
+
 #define NBBY		8		/* number of bits per byte */
 
 /*
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 9ff41970d0c7..38d94679ad41 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -153,6 +153,7 @@  xfs_trans_reserve(
 	bool			rsvd = (tp->t_flags & XFS_TRANS_RESERVE) != 0;
 
 	/* Mark this thread as being in a transaction */
+	xfs_trans_context_start();
 	current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 
 	/*
@@ -859,6 +860,7 @@  __xfs_trans_commit(
 	xfs_log_commit_cil(mp, tp, &commit_lsn, regrant);
 
 	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
+	xfs_trans_context_end();
 	xfs_trans_free(tp);
 
 	/*
@@ -891,6 +893,7 @@  __xfs_trans_commit(
 		tp->t_ticket = NULL;
 	}
 	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
+	xfs_trans_context_end();
 	xfs_trans_free_items(tp, !!error);
 	xfs_trans_free(tp);
 
@@ -952,6 +955,7 @@  xfs_trans_cancel(
 
 	/* mark this thread as no longer being in a transaction */
 	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
+	xfs_trans_context_end();
 
 	xfs_trans_free_items(tp, dirty);
 	xfs_trans_free(tp);
@@ -1005,8 +1009,10 @@  xfs_trans_roll(
 	tres.tr_logflags = XFS_TRANS_PERM_LOG_RES;
 	tp = *tpp;
 	error = xfs_trans_reserve(tp, &tres, 0, 0);
-	if (error)
+	if (error) {
 		current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
+		xfs_trans_context_end();
+	}
 
 	return error;
 }