[2/5] xfs: allocate quotaoff transactions up front to avoid log deadlock
diff mbox

Message ID 1487173247-5965-3-git-send-email-bfoster@redhat.com
State New
Headers show

Commit Message

Brian Foster Feb. 15, 2017, 3:40 p.m. UTC
The quotaoff operation commits two explicitly synchronous
transactions to correctly handle log recovery of dquots being
modified at the time the quotaoff occurs. The first quotaoff
transaction pins the tail of the log with a qoff logitem in the AIL
to ensure further logged dquots stay ahead of the quotaoff operation
in the log. The second quotaoff_end transaction is committed after
the quotaoff operation completes, releases the original log item and
thus unpins the log.

Since the first transaction pins the tail of the log, this means a
finite amount of space in the log is available between the time a
quotaoff starts and completes before transaction reservations have
to block on the log. While the quotaoff operation itself consumes no
further log space, it is possible for other operations in the
filesystem to consume the remaining log space before the quotaoff
completes. If this occurs, the quotaoff_end transaction also blocks
on the log which prevents the release of the log item and the
filesystem deadlocks. This has been reproduced via repeated xfs/305
iterations on a vm with fairly limited resources.

To avoid a deadlock due to a particularly slow quotaoff operation,
allocate the quotaoff_end transaction immediately after the initial
quotaoff transaction is committed. Carry a reference to the
transaction through xfs_qm_scall_quotaoff() rather than the qoff log
item and commit it once the quotaoff completes.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_qm_syscalls.c | 42 ++++++++++++++++++++++++------------------
 1 file changed, 24 insertions(+), 18 deletions(-)

Comments

Darrick J. Wong April 26, 2017, 9:23 p.m. UTC | #1
On Wed, Feb 15, 2017 at 10:40:44AM -0500, Brian Foster wrote:
> The quotaoff operation commits two explicitly synchronous
> transactions to correctly handle log recovery of dquots being
> modified at the time the quotaoff occurs. The first quotaoff
> transaction pins the tail of the log with a qoff logitem in the AIL
> to ensure further logged dquots stay ahead of the quotaoff operation
> in the log. The second quotaoff_end transaction is committed after
> the quotaoff operation completes, releases the original log item and
> thus unpins the log.
> 
> Since the first transaction pins the tail of the log, this means a
> finite amount of space in the log is available between the time a
> quotaoff starts and completes before transaction reservations have
> to block on the log. While the quotaoff operation itself consumes no
> further log space, it is possible for other operations in the
> filesystem to consume the remaining log space before the quotaoff
> completes. If this occurs, the quotaoff_end transaction also blocks
> on the log which prevents the release of the log item and the
> filesystem deadlocks. This has been reproduced via repeated xfs/305
> iterations on a vm with fairly limited resources.
> 
> To avoid a deadlock due to a particularly slow quotaoff operation,
> allocate the quotaoff_end transaction immediately after the initial
> quotaoff transaction is committed. Carry a reference to the
> transaction through xfs_qm_scall_quotaoff() rather than the qoff log
> item and commit it once the quotaoff completes.

Hmm... so instead of holding a pinned log item for however long it takes
to finish quotaoff, we instead hold a transaction.  That does fix the
problem where we pin the log tail and the head crashes into it, but also
comes with its own problems, doesn't it?

In xfs_qm_scall_quotaoff, we call xfs_qm_dqrele_all_inodes, which
calls xfs_dqrele_inode to call xfs_qm_dqrele on all the inodes in the
system.  However, if any of those inodes are unlinked and inactive, the
iput -> xfs_fs_destroy_inode -> xfs_inactive path truncates the inode,
which itself requires a transaction, so now we're running with nested
transactions.

I'm not sure that's necessarily a problem because the outer transaction
contains only a single qoffend item.  AFAICT I haven't encountered any
other places in XFS where a single thread runs nested transactions, but
it still seems suspect to me.

Heh, lockdep just spat this at me:

[19767.048839] =============================================
[19767.049699] [ INFO: possible recursive locking detected ]
[19767.050511] 4.11.0-rc4-djw #3 Not tainted
[19767.051122] ---------------------------------------------
[19767.051907] xfs_quota/4152 is trying to acquire lock:
[19767.052617]  (sb_internal){.+.+.+}, at: [<ffffffffa015ee13>] xfs_trans_alloc+0xe3/0x130 [xfs]
[19767.054151] 
[19767.054151] but task is already holding lock:
[19767.055246]  (sb_internal){.+.+.+}, at: [<ffffffffa015ee13>] xfs_trans_alloc+0xe3/0x130 [xfs]
[19767.056897] 
[19767.056897] other info that might help us debug this:
[19767.057980]  Possible unsafe locking scenario:
[19767.057980] 
[19767.058967]        CPU0
[19767.059385]        ----
[19767.060339]   lock(sb_internal);
[19767.061187]   lock(sb_internal);
[19767.061771] 
[19767.061771]  *** DEADLOCK ***
[19767.061771] 
[19767.062818]  May be due to missing lock nesting notation
[19767.062818] 
[19767.063967] 3 locks held by xfs_quota/4152:
[19767.064521]  #0:  (&type->s_umount_key#34){++++++}, at: [<ffffffff812287f0>] __get_super.part.18+0xc0/0xe0
[19767.065805]  #1:  (&qinf->qi_quotaofflock){+.+.+.}, at: [<ffffffffa0182a53>] xfs_qm_scall_quotaoff+0x53/0x6c0 [xfs]
[19767.067226]  #2:  (sb_internal){.+.+.+}, at: [<ffffffffa015ee13>] xfs_trans_alloc+0xe3/0x130 [xfs]
[19767.068543] 
[19767.068543] stack backtrace:
[19767.069191] CPU: 2 PID: 4152 Comm: xfs_quota Not tainted 4.11.0-rc4-djw #3
[19767.070124] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[19767.071405] Call Trace:
[19767.071798]  dump_stack+0x85/0xc2
[19767.072298]  __lock_acquire+0x164e/0x16a0
[19767.072866]  ? kvm_sched_clock_read+0x9/0x20
[19767.073460]  ? sched_clock+0x9/0x10
[19767.073979]  lock_acquire+0xbf/0x210
[19767.074532]  ? xfs_trans_alloc+0xe3/0x130 [xfs]
[19767.075166]  __sb_start_write+0xc0/0x220
[19767.075782]  ? xfs_trans_alloc+0xe3/0x130 [xfs]
[19767.076475]  xfs_trans_alloc+0xe3/0x130 [xfs]
[19767.077138]  xfs_inactive_truncate+0x31/0x130 [xfs]
[19767.077862]  ? xfs_qm_dqattach+0x16/0x50 [xfs]
[19767.078524]  xfs_inactive+0x17d/0x270 [xfs]
[19767.079141]  xfs_fs_destroy_inode+0xb3/0x350 [xfs]
[19767.079964]  destroy_inode+0x3b/0x60
[19767.080679]  evict+0x132/0x190
[19767.081301]  iput+0x243/0x320
[19767.081961]  xfs_inode_ag_walk+0x2a8/0x690 [xfs]
[19767.082757]  ? xfs_inode_ag_walk+0x92/0x690 [xfs]
[19767.083545]  ? xfs_trans_free_dqinfo+0x30/0x30 [xfs]
[19767.084279]  ? xfs_perag_get+0xa8/0x2b0 [xfs]
[19767.084753]  ? xfs_perag_get+0xa8/0x2b0 [xfs]
[19767.085234]  xfs_inode_ag_iterator_flags+0x69/0xa0 [xfs]
[19767.086251]  ? xfs_trans_free_dqinfo+0x30/0x30 [xfs]
[19767.087219]  xfs_qm_dqrele_all_inodes+0x36/0x60 [xfs]
[19767.088431]  xfs_qm_scall_quotaoff+0x100/0x6c0 [xfs]
[19767.089285]  xfs_quota_disable+0x3d/0x50 [xfs]
[19767.090138]  SyS_quotactl+0x3ff/0x820
[19767.090853]  ? SYSC_newstat+0x3b/0x50
[19767.091501]  ? trace_hardirqs_on_caller+0x129/0x1b0
[19767.092398]  ? trace_hardirqs_on_thunk+0x1a/0x1c
[19767.093238]  entry_SYSCALL_64_fastpath+0x1f/0xc2
[19767.094150] RIP: 0033:0x7f43530a70ca
[19767.094851] RSP: 002b:00007ffe7570fbe8 EFLAGS: 00000206 ORIG_RAX: 00000000000000b3
[19767.096078] RAX: ffffffffffffffda RBX: 00007ffe7570fde8 RCX: 00007f43530a70ca
[19767.097242] RDX: 0000000000000000 RSI: 00000000012d26f0 RDI: 0000000000580202
[19767.098402] RBP: 0000000000000005 R08: 00007ffe7570fbfc R09: 0000000000005802
[19767.099645] R10: 00007ffe7570fbfc R11: 0000000000000206 R12: 0000000000401e50
[19767.100841] R13: 00007ffe7570fde0 R14: 0000000000000000 R15: 0000000000000000

--D

> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/xfs/xfs_qm_syscalls.c | 42 ++++++++++++++++++++++++------------------
>  1 file changed, 24 insertions(+), 18 deletions(-)
> 
> diff --git a/fs/xfs/xfs_qm_syscalls.c b/fs/xfs/xfs_qm_syscalls.c
> index 475a388..dbb6802 100644
> --- a/fs/xfs/xfs_qm_syscalls.c
> +++ b/fs/xfs/xfs_qm_syscalls.c
> @@ -35,9 +35,11 @@
>  #include "xfs_trace.h"
>  #include "xfs_icache.h"
>  
> -STATIC int	xfs_qm_log_quotaoff(xfs_mount_t *, xfs_qoff_logitem_t **, uint);
> -STATIC int	xfs_qm_log_quotaoff_end(xfs_mount_t *, xfs_qoff_logitem_t *,
> -					uint);
> +STATIC int	xfs_qm_log_quotaoff(struct xfs_mount *, struct xfs_trans **,
> +				    uint);
> +STATIC int	xfs_qm_log_quotaoff_end(struct xfs_mount *,
> +					struct xfs_qoff_logitem *,
> +					struct xfs_trans **, uint);
>  
>  /*
>   * Turn off quota accounting and/or enforcement for all udquots and/or
> @@ -56,7 +58,7 @@ xfs_qm_scall_quotaoff(
>  	uint			dqtype;
>  	int			error;
>  	uint			inactivate_flags;
> -	xfs_qoff_logitem_t	*qoffstart;
> +	struct xfs_trans	*qend_tp;
>  
>  	/*
>  	 * No file system can have quotas enabled on disk but not in core.
> @@ -128,7 +130,7 @@ xfs_qm_scall_quotaoff(
>  	 * and synchronously. If we fail to write, we should abort the
>  	 * operation as it cannot be recovered safely if we crash.
>  	 */
> -	error = xfs_qm_log_quotaoff(mp, &qoffstart, flags);
> +	error = xfs_qm_log_quotaoff(mp, &qend_tp, flags);
>  	if (error)
>  		goto out_unlock;
>  
> @@ -181,7 +183,7 @@ xfs_qm_scall_quotaoff(
>  	 * So, we have QUOTAOFF start and end logitems; the start
>  	 * logitem won't get overwritten until the end logitem appears...
>  	 */
> -	error = xfs_qm_log_quotaoff_end(mp, qoffstart, flags);
> +	error = xfs_trans_commit(qend_tp);
>  	if (error) {
>  		/* We're screwed now. Shutdown is the only option. */
>  		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> @@ -556,13 +558,14 @@ xfs_qm_scall_setqlim(
>  
>  STATIC int
>  xfs_qm_log_quotaoff_end(
> -	xfs_mount_t		*mp,
> -	xfs_qoff_logitem_t	*startqoff,
> +	struct xfs_mount	*mp,
> +	struct xfs_qoff_logitem	*startqoff,
> +	struct xfs_trans	**tpp,
>  	uint			flags)
>  {
> -	xfs_trans_t		*tp;
> +	struct xfs_trans	*tp;
>  	int			error;
> -	xfs_qoff_logitem_t	*qoffi;
> +	struct xfs_qoff_logitem	*qoffi;
>  
>  	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_qm_equotaoff, 0, 0, 0, &tp);
>  	if (error)
> @@ -578,21 +581,22 @@ xfs_qm_log_quotaoff_end(
>  	 * We don't care about quotoff's performance.
>  	 */
>  	xfs_trans_set_sync(tp);
> -	return xfs_trans_commit(tp);
> +	*tpp = tp;
> +	return 0;
>  }
>  
>  
>  STATIC int
>  xfs_qm_log_quotaoff(
> -	xfs_mount_t	       *mp,
> -	xfs_qoff_logitem_t     **qoffstartp,
> -	uint		       flags)
> +	struct xfs_mount	*mp,
> +	struct xfs_trans	**end_tp,
> +	uint			flags)
>  {
> -	xfs_trans_t	       *tp;
> +	struct xfs_trans	*tp;
>  	int			error;
> -	xfs_qoff_logitem_t     *qoffi;
> +	struct xfs_qoff_logitem	*qoffi;
>  
> -	*qoffstartp = NULL;
> +	*end_tp = NULL;
>  
>  	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_qm_quotaoff, 0, 0, 0, &tp);
>  	if (error)
> @@ -617,7 +621,9 @@ xfs_qm_log_quotaoff(
>  	if (error)
>  		goto out;
>  
> -	*qoffstartp = qoffi;
> +	error = xfs_qm_log_quotaoff_end(mp, qoffi, end_tp, flags);
> +	if (error)
> +		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
>  out:
>  	return error;
>  }
> -- 
> 2.7.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Brian Foster April 27, 2017, 12:03 p.m. UTC | #2
On Wed, Apr 26, 2017 at 02:23:33PM -0700, Darrick J. Wong wrote:
> On Wed, Feb 15, 2017 at 10:40:44AM -0500, Brian Foster wrote:
> > The quotaoff operation commits two explicitly synchronous
> > transactions to correctly handle log recovery of dquots being
> > modified at the time the quotaoff occurs. The first quotaoff
> > transaction pins the tail of the log with a qoff logitem in the AIL
> > to ensure further logged dquots stay ahead of the quotaoff operation
> > in the log. The second quotaoff_end transaction is committed after
> > the quotaoff operation completes, releases the original log item and
> > thus unpins the log.
> > 
> > Since the first transaction pins the tail of the log, this means a
> > finite amount of space in the log is available between the time a
> > quotaoff starts and completes before transaction reservations have
> > to block on the log. While the quotaoff operation itself consumes no
> > further log space, it is possible for other operations in the
> > filesystem to consume the remaining log space before the quotaoff
> > completes. If this occurs, the quotaoff_end transaction also blocks
> > on the log which prevents the release of the log item and the
> > filesystem deadlocks. This has been reproduced via repeated xfs/305
> > iterations on a vm with fairly limited resources.
> > 
> > To avoid a deadlock due to a particularly slow quotaoff operation,
> > allocate the quotaoff_end transaction immediately after the initial
> > quotaoff transaction is committed. Carry a reference to the
> > transaction through xfs_qm_scall_quotaoff() rather than the qoff log
> > item and commit it once the quotaoff completes.
> 
> Hmm... so instead of holding a pinned log item for however long it takes
> to finish quotaoff, we instead hold a transaction.  That does fix the
> problem where we pin the log tail and the head crashes into it, but also
> comes with its own problems, doesn't it?
> 
> In xfs_qm_scall_quotaoff, we call xfs_qm_dqrele_all_inodes, which
> calls xfs_dqrele_inode to call xfs_qm_dqrele on all the inodes in the
> system.  However, if any of those inodes are unlinked and inactive, the
> iput -> xfs_fs_destroy_inode -> xfs_inactive path truncates the inode,
> which itself requires a transaction, so now we're running with nested
> transactions.
> 
> I'm not sure that's necessarily a problem because the outer transaction
> contains only a single qoffend item.  AFAICT I haven't encountered any
> other places in XFS where a single thread runs nested transactions, but
> it still seems suspect to me.
> 

Yeah, good spot. I wasn't expecting to be logging anything else in this
window of time, but it looks like that is possible. We definitely don't
want to be doing nested transactions. I'll have to think about this one
some more.

Note that this was a one off fix for something I reproduced when trying
to reproduce the livelock addressed by the subsequent patches in the
series. It's not a dependency for the subsequent 3 patches.

> Heh, lockdep just spat this at me:
> 

I'm guessing I just didn't have lockdep enabled when I tested this. To
be sure... what is the reproducer for this? Thanks.

Brian

> [19767.048839] =============================================
> [19767.049699] [ INFO: possible recursive locking detected ]
> [19767.050511] 4.11.0-rc4-djw #3 Not tainted
> [19767.051122] ---------------------------------------------
> [19767.051907] xfs_quota/4152 is trying to acquire lock:
> [19767.052617]  (sb_internal){.+.+.+}, at: [<ffffffffa015ee13>] xfs_trans_alloc+0xe3/0x130 [xfs]
> [19767.054151] 
> [19767.054151] but task is already holding lock:
> [19767.055246]  (sb_internal){.+.+.+}, at: [<ffffffffa015ee13>] xfs_trans_alloc+0xe3/0x130 [xfs]
> [19767.056897] 
> [19767.056897] other info that might help us debug this:
> [19767.057980]  Possible unsafe locking scenario:
> [19767.057980] 
> [19767.058967]        CPU0
> [19767.059385]        ----
> [19767.060339]   lock(sb_internal);
> [19767.061187]   lock(sb_internal);
> [19767.061771] 
> [19767.061771]  *** DEADLOCK ***
> [19767.061771] 
> [19767.062818]  May be due to missing lock nesting notation
> [19767.062818] 
> [19767.063967] 3 locks held by xfs_quota/4152:
> [19767.064521]  #0:  (&type->s_umount_key#34){++++++}, at: [<ffffffff812287f0>] __get_super.part.18+0xc0/0xe0
> [19767.065805]  #1:  (&qinf->qi_quotaofflock){+.+.+.}, at: [<ffffffffa0182a53>] xfs_qm_scall_quotaoff+0x53/0x6c0 [xfs]
> [19767.067226]  #2:  (sb_internal){.+.+.+}, at: [<ffffffffa015ee13>] xfs_trans_alloc+0xe3/0x130 [xfs]
> [19767.068543] 
> [19767.068543] stack backtrace:
> [19767.069191] CPU: 2 PID: 4152 Comm: xfs_quota Not tainted 4.11.0-rc4-djw #3
> [19767.070124] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
> [19767.071405] Call Trace:
> [19767.071798]  dump_stack+0x85/0xc2
> [19767.072298]  __lock_acquire+0x164e/0x16a0
> [19767.072866]  ? kvm_sched_clock_read+0x9/0x20
> [19767.073460]  ? sched_clock+0x9/0x10
> [19767.073979]  lock_acquire+0xbf/0x210
> [19767.074532]  ? xfs_trans_alloc+0xe3/0x130 [xfs]
> [19767.075166]  __sb_start_write+0xc0/0x220
> [19767.075782]  ? xfs_trans_alloc+0xe3/0x130 [xfs]
> [19767.076475]  xfs_trans_alloc+0xe3/0x130 [xfs]
> [19767.077138]  xfs_inactive_truncate+0x31/0x130 [xfs]
> [19767.077862]  ? xfs_qm_dqattach+0x16/0x50 [xfs]
> [19767.078524]  xfs_inactive+0x17d/0x270 [xfs]
> [19767.079141]  xfs_fs_destroy_inode+0xb3/0x350 [xfs]
> [19767.079964]  destroy_inode+0x3b/0x60
> [19767.080679]  evict+0x132/0x190
> [19767.081301]  iput+0x243/0x320
> [19767.081961]  xfs_inode_ag_walk+0x2a8/0x690 [xfs]
> [19767.082757]  ? xfs_inode_ag_walk+0x92/0x690 [xfs]
> [19767.083545]  ? xfs_trans_free_dqinfo+0x30/0x30 [xfs]
> [19767.084279]  ? xfs_perag_get+0xa8/0x2b0 [xfs]
> [19767.084753]  ? xfs_perag_get+0xa8/0x2b0 [xfs]
> [19767.085234]  xfs_inode_ag_iterator_flags+0x69/0xa0 [xfs]
> [19767.086251]  ? xfs_trans_free_dqinfo+0x30/0x30 [xfs]
> [19767.087219]  xfs_qm_dqrele_all_inodes+0x36/0x60 [xfs]
> [19767.088431]  xfs_qm_scall_quotaoff+0x100/0x6c0 [xfs]
> [19767.089285]  xfs_quota_disable+0x3d/0x50 [xfs]
> [19767.090138]  SyS_quotactl+0x3ff/0x820
> [19767.090853]  ? SYSC_newstat+0x3b/0x50
> [19767.091501]  ? trace_hardirqs_on_caller+0x129/0x1b0
> [19767.092398]  ? trace_hardirqs_on_thunk+0x1a/0x1c
> [19767.093238]  entry_SYSCALL_64_fastpath+0x1f/0xc2
> [19767.094150] RIP: 0033:0x7f43530a70ca
> [19767.094851] RSP: 002b:00007ffe7570fbe8 EFLAGS: 00000206 ORIG_RAX: 00000000000000b3
> [19767.096078] RAX: ffffffffffffffda RBX: 00007ffe7570fde8 RCX: 00007f43530a70ca
> [19767.097242] RDX: 0000000000000000 RSI: 00000000012d26f0 RDI: 0000000000580202
> [19767.098402] RBP: 0000000000000005 R08: 00007ffe7570fbfc R09: 0000000000005802
> [19767.099645] R10: 00007ffe7570fbfc R11: 0000000000000206 R12: 0000000000401e50
> [19767.100841] R13: 00007ffe7570fde0 R14: 0000000000000000 R15: 0000000000000000
> 
> --D
> 
> > 
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> >  fs/xfs/xfs_qm_syscalls.c | 42 ++++++++++++++++++++++++------------------
> >  1 file changed, 24 insertions(+), 18 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_qm_syscalls.c b/fs/xfs/xfs_qm_syscalls.c
> > index 475a388..dbb6802 100644
> > --- a/fs/xfs/xfs_qm_syscalls.c
> > +++ b/fs/xfs/xfs_qm_syscalls.c
> > @@ -35,9 +35,11 @@
> >  #include "xfs_trace.h"
> >  #include "xfs_icache.h"
> >  
> > -STATIC int	xfs_qm_log_quotaoff(xfs_mount_t *, xfs_qoff_logitem_t **, uint);
> > -STATIC int	xfs_qm_log_quotaoff_end(xfs_mount_t *, xfs_qoff_logitem_t *,
> > -					uint);
> > +STATIC int	xfs_qm_log_quotaoff(struct xfs_mount *, struct xfs_trans **,
> > +				    uint);
> > +STATIC int	xfs_qm_log_quotaoff_end(struct xfs_mount *,
> > +					struct xfs_qoff_logitem *,
> > +					struct xfs_trans **, uint);
> >  
> >  /*
> >   * Turn off quota accounting and/or enforcement for all udquots and/or
> > @@ -56,7 +58,7 @@ xfs_qm_scall_quotaoff(
> >  	uint			dqtype;
> >  	int			error;
> >  	uint			inactivate_flags;
> > -	xfs_qoff_logitem_t	*qoffstart;
> > +	struct xfs_trans	*qend_tp;
> >  
> >  	/*
> >  	 * No file system can have quotas enabled on disk but not in core.
> > @@ -128,7 +130,7 @@ xfs_qm_scall_quotaoff(
> >  	 * and synchronously. If we fail to write, we should abort the
> >  	 * operation as it cannot be recovered safely if we crash.
> >  	 */
> > -	error = xfs_qm_log_quotaoff(mp, &qoffstart, flags);
> > +	error = xfs_qm_log_quotaoff(mp, &qend_tp, flags);
> >  	if (error)
> >  		goto out_unlock;
> >  
> > @@ -181,7 +183,7 @@ xfs_qm_scall_quotaoff(
> >  	 * So, we have QUOTAOFF start and end logitems; the start
> >  	 * logitem won't get overwritten until the end logitem appears...
> >  	 */
> > -	error = xfs_qm_log_quotaoff_end(mp, qoffstart, flags);
> > +	error = xfs_trans_commit(qend_tp);
> >  	if (error) {
> >  		/* We're screwed now. Shutdown is the only option. */
> >  		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> > @@ -556,13 +558,14 @@ xfs_qm_scall_setqlim(
> >  
> >  STATIC int
> >  xfs_qm_log_quotaoff_end(
> > -	xfs_mount_t		*mp,
> > -	xfs_qoff_logitem_t	*startqoff,
> > +	struct xfs_mount	*mp,
> > +	struct xfs_qoff_logitem	*startqoff,
> > +	struct xfs_trans	**tpp,
> >  	uint			flags)
> >  {
> > -	xfs_trans_t		*tp;
> > +	struct xfs_trans	*tp;
> >  	int			error;
> > -	xfs_qoff_logitem_t	*qoffi;
> > +	struct xfs_qoff_logitem	*qoffi;
> >  
> >  	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_qm_equotaoff, 0, 0, 0, &tp);
> >  	if (error)
> > @@ -578,21 +581,22 @@ xfs_qm_log_quotaoff_end(
> >  	 * We don't care about quotoff's performance.
> >  	 */
> >  	xfs_trans_set_sync(tp);
> > -	return xfs_trans_commit(tp);
> > +	*tpp = tp;
> > +	return 0;
> >  }
> >  
> >  
> >  STATIC int
> >  xfs_qm_log_quotaoff(
> > -	xfs_mount_t	       *mp,
> > -	xfs_qoff_logitem_t     **qoffstartp,
> > -	uint		       flags)
> > +	struct xfs_mount	*mp,
> > +	struct xfs_trans	**end_tp,
> > +	uint			flags)
> >  {
> > -	xfs_trans_t	       *tp;
> > +	struct xfs_trans	*tp;
> >  	int			error;
> > -	xfs_qoff_logitem_t     *qoffi;
> > +	struct xfs_qoff_logitem	*qoffi;
> >  
> > -	*qoffstartp = NULL;
> > +	*end_tp = NULL;
> >  
> >  	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_qm_quotaoff, 0, 0, 0, &tp);
> >  	if (error)
> > @@ -617,7 +621,9 @@ xfs_qm_log_quotaoff(
> >  	if (error)
> >  		goto out;
> >  
> > -	*qoffstartp = qoffi;
> > +	error = xfs_qm_log_quotaoff_end(mp, qoffi, end_tp, flags);
> > +	if (error)
> > +		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> >  out:
> >  	return error;
> >  }
> > -- 
> > 2.7.4
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Darrick J. Wong April 27, 2017, 3:47 p.m. UTC | #3
On Thu, Apr 27, 2017 at 08:03:27AM -0400, Brian Foster wrote:
> On Wed, Apr 26, 2017 at 02:23:33PM -0700, Darrick J. Wong wrote:
> > On Wed, Feb 15, 2017 at 10:40:44AM -0500, Brian Foster wrote:
> > > The quotaoff operation commits two explicitly synchronous
> > > transactions to correctly handle log recovery of dquots being
> > > modified at the time the quotaoff occurs. The first quotaoff
> > > transaction pins the tail of the log with a qoff logitem in the AIL
> > > to ensure further logged dquots stay ahead of the quotaoff operation
> > > in the log. The second quotaoff_end transaction is committed after
> > > the quotaoff operation completes, releases the original log item and
> > > thus unpins the log.
> > > 
> > > Since the first transaction pins the tail of the log, this means a
> > > finite amount of space in the log is available between the time a
> > > quotaoff starts and completes before transaction reservations have
> > > to block on the log. While the quotaoff operation itself consumes no
> > > further log space, it is possible for other operations in the
> > > filesystem to consume the remaining log space before the quotaoff
> > > completes. If this occurs, the quotaoff_end transaction also blocks
> > > on the log which prevents the release of the log item and the
> > > filesystem deadlocks. This has been reproduced via repeated xfs/305
> > > iterations on a vm with fairly limited resources.
> > > 
> > > To avoid a deadlock due to a particularly slow quotaoff operation,
> > > allocate the quotaoff_end transaction immediately after the initial
> > > quotaoff transaction is committed. Carry a reference to the
> > > transaction through xfs_qm_scall_quotaoff() rather than the qoff log
> > > item and commit it once the quotaoff completes.
> > 
> > Hmm... so instead of holding a pinned log item for however long it takes
> > to finish quotaoff, we instead hold a transaction.  That does fix the
> > problem where we pin the log tail and the head crashes into it, but also
> > comes with its own problems, doesn't it?
> > 
> > In xfs_qm_scall_quotaoff, we call xfs_qm_dqrele_all_inodes, which
> > calls xfs_dqrele_inode to call xfs_qm_dqrele on all the inodes in the
> > system.  However, if any of those inodes are unlinked and inactive, the
> > iput -> xfs_fs_destroy_inode -> xfs_inactive path truncates the inode,
> > which itself requires a transaction, so now we're running with nested
> > transactions.
> > 
> > I'm not sure that's necessarily a problem because the outer transaction
> > contains only a single qoffend item.  AFAICT I haven't encountered any
> > other places in XFS where a single thread runs nested transactions, but
> > it still seems suspect to me.
> > 
> 
> Yeah, good spot. I wasn't expecting to be logging anything else in this
> window of time, but it looks like that is possible. We definitely don't
> want to be doing nested transactions. I'll have to think about this one
> some more.
> 
> Note that this was a one off fix for something I reproduced when trying
> to reproduce the livelock addressed by the subsequent patches in the
> series. It's not a dependency for the subsequent 3 patches.

Oh, ok.  Thank you for pointing that out, as I didn't see a strong
connection between this patch and the next three, and was wondering if
there was one.

> > Heh, lockdep just spat this at me:
> > 
> 
> I'm guessing I just didn't have lockdep enabled when I tested this. To
> be sure... what is the reproducer for this? Thanks.

Running xfs/305 in a loop.

--D

> 
> Brian
> 
> > [19767.048839] =============================================
> > [19767.049699] [ INFO: possible recursive locking detected ]
> > [19767.050511] 4.11.0-rc4-djw #3 Not tainted
> > [19767.051122] ---------------------------------------------
> > [19767.051907] xfs_quota/4152 is trying to acquire lock:
> > [19767.052617]  (sb_internal){.+.+.+}, at: [<ffffffffa015ee13>] xfs_trans_alloc+0xe3/0x130 [xfs]
> > [19767.054151] 
> > [19767.054151] but task is already holding lock:
> > [19767.055246]  (sb_internal){.+.+.+}, at: [<ffffffffa015ee13>] xfs_trans_alloc+0xe3/0x130 [xfs]
> > [19767.056897] 
> > [19767.056897] other info that might help us debug this:
> > [19767.057980]  Possible unsafe locking scenario:
> > [19767.057980] 
> > [19767.058967]        CPU0
> > [19767.059385]        ----
> > [19767.060339]   lock(sb_internal);
> > [19767.061187]   lock(sb_internal);
> > [19767.061771] 
> > [19767.061771]  *** DEADLOCK ***
> > [19767.061771] 
> > [19767.062818]  May be due to missing lock nesting notation
> > [19767.062818] 
> > [19767.063967] 3 locks held by xfs_quota/4152:
> > [19767.064521]  #0:  (&type->s_umount_key#34){++++++}, at: [<ffffffff812287f0>] __get_super.part.18+0xc0/0xe0
> > [19767.065805]  #1:  (&qinf->qi_quotaofflock){+.+.+.}, at: [<ffffffffa0182a53>] xfs_qm_scall_quotaoff+0x53/0x6c0 [xfs]
> > [19767.067226]  #2:  (sb_internal){.+.+.+}, at: [<ffffffffa015ee13>] xfs_trans_alloc+0xe3/0x130 [xfs]
> > [19767.068543] 
> > [19767.068543] stack backtrace:
> > [19767.069191] CPU: 2 PID: 4152 Comm: xfs_quota Not tainted 4.11.0-rc4-djw #3
> > [19767.070124] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
> > [19767.071405] Call Trace:
> > [19767.071798]  dump_stack+0x85/0xc2
> > [19767.072298]  __lock_acquire+0x164e/0x16a0
> > [19767.072866]  ? kvm_sched_clock_read+0x9/0x20
> > [19767.073460]  ? sched_clock+0x9/0x10
> > [19767.073979]  lock_acquire+0xbf/0x210
> > [19767.074532]  ? xfs_trans_alloc+0xe3/0x130 [xfs]
> > [19767.075166]  __sb_start_write+0xc0/0x220
> > [19767.075782]  ? xfs_trans_alloc+0xe3/0x130 [xfs]
> > [19767.076475]  xfs_trans_alloc+0xe3/0x130 [xfs]
> > [19767.077138]  xfs_inactive_truncate+0x31/0x130 [xfs]
> > [19767.077862]  ? xfs_qm_dqattach+0x16/0x50 [xfs]
> > [19767.078524]  xfs_inactive+0x17d/0x270 [xfs]
> > [19767.079141]  xfs_fs_destroy_inode+0xb3/0x350 [xfs]
> > [19767.079964]  destroy_inode+0x3b/0x60
> > [19767.080679]  evict+0x132/0x190
> > [19767.081301]  iput+0x243/0x320
> > [19767.081961]  xfs_inode_ag_walk+0x2a8/0x690 [xfs]
> > [19767.082757]  ? xfs_inode_ag_walk+0x92/0x690 [xfs]
> > [19767.083545]  ? xfs_trans_free_dqinfo+0x30/0x30 [xfs]
> > [19767.084279]  ? xfs_perag_get+0xa8/0x2b0 [xfs]
> > [19767.084753]  ? xfs_perag_get+0xa8/0x2b0 [xfs]
> > [19767.085234]  xfs_inode_ag_iterator_flags+0x69/0xa0 [xfs]
> > [19767.086251]  ? xfs_trans_free_dqinfo+0x30/0x30 [xfs]
> > [19767.087219]  xfs_qm_dqrele_all_inodes+0x36/0x60 [xfs]
> > [19767.088431]  xfs_qm_scall_quotaoff+0x100/0x6c0 [xfs]
> > [19767.089285]  xfs_quota_disable+0x3d/0x50 [xfs]
> > [19767.090138]  SyS_quotactl+0x3ff/0x820
> > [19767.090853]  ? SYSC_newstat+0x3b/0x50
> > [19767.091501]  ? trace_hardirqs_on_caller+0x129/0x1b0
> > [19767.092398]  ? trace_hardirqs_on_thunk+0x1a/0x1c
> > [19767.093238]  entry_SYSCALL_64_fastpath+0x1f/0xc2
> > [19767.094150] RIP: 0033:0x7f43530a70ca
> > [19767.094851] RSP: 002b:00007ffe7570fbe8 EFLAGS: 00000206 ORIG_RAX: 00000000000000b3
> > [19767.096078] RAX: ffffffffffffffda RBX: 00007ffe7570fde8 RCX: 00007f43530a70ca
> > [19767.097242] RDX: 0000000000000000 RSI: 00000000012d26f0 RDI: 0000000000580202
> > [19767.098402] RBP: 0000000000000005 R08: 00007ffe7570fbfc R09: 0000000000005802
> > [19767.099645] R10: 00007ffe7570fbfc R11: 0000000000000206 R12: 0000000000401e50
> > [19767.100841] R13: 00007ffe7570fde0 R14: 0000000000000000 R15: 0000000000000000
> > 
> > --D
> > 
> > > 
> > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > ---
> > >  fs/xfs/xfs_qm_syscalls.c | 42 ++++++++++++++++++++++++------------------
> > >  1 file changed, 24 insertions(+), 18 deletions(-)
> > > 
> > > diff --git a/fs/xfs/xfs_qm_syscalls.c b/fs/xfs/xfs_qm_syscalls.c
> > > index 475a388..dbb6802 100644
> > > --- a/fs/xfs/xfs_qm_syscalls.c
> > > +++ b/fs/xfs/xfs_qm_syscalls.c
> > > @@ -35,9 +35,11 @@
> > >  #include "xfs_trace.h"
> > >  #include "xfs_icache.h"
> > >  
> > > -STATIC int	xfs_qm_log_quotaoff(xfs_mount_t *, xfs_qoff_logitem_t **, uint);
> > > -STATIC int	xfs_qm_log_quotaoff_end(xfs_mount_t *, xfs_qoff_logitem_t *,
> > > -					uint);
> > > +STATIC int	xfs_qm_log_quotaoff(struct xfs_mount *, struct xfs_trans **,
> > > +				    uint);
> > > +STATIC int	xfs_qm_log_quotaoff_end(struct xfs_mount *,
> > > +					struct xfs_qoff_logitem *,
> > > +					struct xfs_trans **, uint);
> > >  
> > >  /*
> > >   * Turn off quota accounting and/or enforcement for all udquots and/or
> > > @@ -56,7 +58,7 @@ xfs_qm_scall_quotaoff(
> > >  	uint			dqtype;
> > >  	int			error;
> > >  	uint			inactivate_flags;
> > > -	xfs_qoff_logitem_t	*qoffstart;
> > > +	struct xfs_trans	*qend_tp;
> > >  
> > >  	/*
> > >  	 * No file system can have quotas enabled on disk but not in core.
> > > @@ -128,7 +130,7 @@ xfs_qm_scall_quotaoff(
> > >  	 * and synchronously. If we fail to write, we should abort the
> > >  	 * operation as it cannot be recovered safely if we crash.
> > >  	 */
> > > -	error = xfs_qm_log_quotaoff(mp, &qoffstart, flags);
> > > +	error = xfs_qm_log_quotaoff(mp, &qend_tp, flags);
> > >  	if (error)
> > >  		goto out_unlock;
> > >  
> > > @@ -181,7 +183,7 @@ xfs_qm_scall_quotaoff(
> > >  	 * So, we have QUOTAOFF start and end logitems; the start
> > >  	 * logitem won't get overwritten until the end logitem appears...
> > >  	 */
> > > -	error = xfs_qm_log_quotaoff_end(mp, qoffstart, flags);
> > > +	error = xfs_trans_commit(qend_tp);
> > >  	if (error) {
> > >  		/* We're screwed now. Shutdown is the only option. */
> > >  		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> > > @@ -556,13 +558,14 @@ xfs_qm_scall_setqlim(
> > >  
> > >  STATIC int
> > >  xfs_qm_log_quotaoff_end(
> > > -	xfs_mount_t		*mp,
> > > -	xfs_qoff_logitem_t	*startqoff,
> > > +	struct xfs_mount	*mp,
> > > +	struct xfs_qoff_logitem	*startqoff,
> > > +	struct xfs_trans	**tpp,
> > >  	uint			flags)
> > >  {
> > > -	xfs_trans_t		*tp;
> > > +	struct xfs_trans	*tp;
> > >  	int			error;
> > > -	xfs_qoff_logitem_t	*qoffi;
> > > +	struct xfs_qoff_logitem	*qoffi;
> > >  
> > >  	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_qm_equotaoff, 0, 0, 0, &tp);
> > >  	if (error)
> > > @@ -578,21 +581,22 @@ xfs_qm_log_quotaoff_end(
> > >  	 * We don't care about quotoff's performance.
> > >  	 */
> > >  	xfs_trans_set_sync(tp);
> > > -	return xfs_trans_commit(tp);
> > > +	*tpp = tp;
> > > +	return 0;
> > >  }
> > >  
> > >  
> > >  STATIC int
> > >  xfs_qm_log_quotaoff(
> > > -	xfs_mount_t	       *mp,
> > > -	xfs_qoff_logitem_t     **qoffstartp,
> > > -	uint		       flags)
> > > +	struct xfs_mount	*mp,
> > > +	struct xfs_trans	**end_tp,
> > > +	uint			flags)
> > >  {
> > > -	xfs_trans_t	       *tp;
> > > +	struct xfs_trans	*tp;
> > >  	int			error;
> > > -	xfs_qoff_logitem_t     *qoffi;
> > > +	struct xfs_qoff_logitem	*qoffi;
> > >  
> > > -	*qoffstartp = NULL;
> > > +	*end_tp = NULL;
> > >  
> > >  	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_qm_quotaoff, 0, 0, 0, &tp);
> > >  	if (error)
> > > @@ -617,7 +621,9 @@ xfs_qm_log_quotaoff(
> > >  	if (error)
> > >  		goto out;
> > >  
> > > -	*qoffstartp = qoffi;
> > > +	error = xfs_qm_log_quotaoff_end(mp, qoffi, end_tp, flags);
> > > +	if (error)
> > > +		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> > >  out:
> > >  	return error;
> > >  }
> > > -- 
> > > 2.7.4
> > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch
diff mbox

diff --git a/fs/xfs/xfs_qm_syscalls.c b/fs/xfs/xfs_qm_syscalls.c
index 475a388..dbb6802 100644
--- a/fs/xfs/xfs_qm_syscalls.c
+++ b/fs/xfs/xfs_qm_syscalls.c
@@ -35,9 +35,11 @@ 
 #include "xfs_trace.h"
 #include "xfs_icache.h"
 
-STATIC int	xfs_qm_log_quotaoff(xfs_mount_t *, xfs_qoff_logitem_t **, uint);
-STATIC int	xfs_qm_log_quotaoff_end(xfs_mount_t *, xfs_qoff_logitem_t *,
-					uint);
+STATIC int	xfs_qm_log_quotaoff(struct xfs_mount *, struct xfs_trans **,
+				    uint);
+STATIC int	xfs_qm_log_quotaoff_end(struct xfs_mount *,
+					struct xfs_qoff_logitem *,
+					struct xfs_trans **, uint);
 
 /*
  * Turn off quota accounting and/or enforcement for all udquots and/or
@@ -56,7 +58,7 @@  xfs_qm_scall_quotaoff(
 	uint			dqtype;
 	int			error;
 	uint			inactivate_flags;
-	xfs_qoff_logitem_t	*qoffstart;
+	struct xfs_trans	*qend_tp;
 
 	/*
 	 * No file system can have quotas enabled on disk but not in core.
@@ -128,7 +130,7 @@  xfs_qm_scall_quotaoff(
 	 * and synchronously. If we fail to write, we should abort the
 	 * operation as it cannot be recovered safely if we crash.
 	 */
-	error = xfs_qm_log_quotaoff(mp, &qoffstart, flags);
+	error = xfs_qm_log_quotaoff(mp, &qend_tp, flags);
 	if (error)
 		goto out_unlock;
 
@@ -181,7 +183,7 @@  xfs_qm_scall_quotaoff(
 	 * So, we have QUOTAOFF start and end logitems; the start
 	 * logitem won't get overwritten until the end logitem appears...
 	 */
-	error = xfs_qm_log_quotaoff_end(mp, qoffstart, flags);
+	error = xfs_trans_commit(qend_tp);
 	if (error) {
 		/* We're screwed now. Shutdown is the only option. */
 		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
@@ -556,13 +558,14 @@  xfs_qm_scall_setqlim(
 
 STATIC int
 xfs_qm_log_quotaoff_end(
-	xfs_mount_t		*mp,
-	xfs_qoff_logitem_t	*startqoff,
+	struct xfs_mount	*mp,
+	struct xfs_qoff_logitem	*startqoff,
+	struct xfs_trans	**tpp,
 	uint			flags)
 {
-	xfs_trans_t		*tp;
+	struct xfs_trans	*tp;
 	int			error;
-	xfs_qoff_logitem_t	*qoffi;
+	struct xfs_qoff_logitem	*qoffi;
 
 	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_qm_equotaoff, 0, 0, 0, &tp);
 	if (error)
@@ -578,21 +581,22 @@  xfs_qm_log_quotaoff_end(
 	 * We don't care about quotoff's performance.
 	 */
 	xfs_trans_set_sync(tp);
-	return xfs_trans_commit(tp);
+	*tpp = tp;
+	return 0;
 }
 
 
 STATIC int
 xfs_qm_log_quotaoff(
-	xfs_mount_t	       *mp,
-	xfs_qoff_logitem_t     **qoffstartp,
-	uint		       flags)
+	struct xfs_mount	*mp,
+	struct xfs_trans	**end_tp,
+	uint			flags)
 {
-	xfs_trans_t	       *tp;
+	struct xfs_trans	*tp;
 	int			error;
-	xfs_qoff_logitem_t     *qoffi;
+	struct xfs_qoff_logitem	*qoffi;
 
-	*qoffstartp = NULL;
+	*end_tp = NULL;
 
 	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_qm_quotaoff, 0, 0, 0, &tp);
 	if (error)
@@ -617,7 +621,9 @@  xfs_qm_log_quotaoff(
 	if (error)
 		goto out;
 
-	*qoffstartp = qoffi;
+	error = xfs_qm_log_quotaoff_end(mp, qoffi, end_tp, flags);
+	if (error)
+		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
 out:
 	return error;
 }