[v2] xfs: use ordered buffers to initialize dquot buffers during quotacheck
diff mbox series

Message ID 20200514165658.GC6714@magnolia
State Accepted
Headers show
Series
  • [v2] xfs: use ordered buffers to initialize dquot buffers during quotacheck
Related show

Commit Message

Darrick J. Wong May 14, 2020, 4:56 p.m. UTC
From: Darrick J. Wong <darrick.wong@oracle.com>

While QAing the new xfs_repair quotacheck code, I uncovered a quota
corruption bug resulting from a bad interaction between dquot buffer
initialization and quotacheck.  The bug can be reproduced with the
following sequence:

# mkfs.xfs -f /dev/sdf
# mount /dev/sdf /opt -o usrquota
# su nobody -s /bin/bash -c 'touch /opt/barf'
# sync
# xfs_quota -x -c 'report -ahi' /opt
User quota on /opt (/dev/sdf)
                        Inodes
User ID      Used   Soft   Hard Warn/Grace
---------- ---------------------------------
root            3      0      0  00 [------]
nobody          1      0      0  00 [------]

# xfs_io -x -c 'shutdown' /opt
# umount /opt
# mount /dev/sdf /opt -o usrquota
# touch /opt/man2
# xfs_quota -x -c 'report -ahi' /opt
User quota on /opt (/dev/sdf)
                        Inodes
User ID      Used   Soft   Hard Warn/Grace
---------- ---------------------------------
root            1      0      0  00 [------]
nobody          1      0      0  00 [------]

# umount /opt

Notice how the initial quotacheck set the root dquot icount to 3
(rootino, rbmino, rsumino), but after shutdown -> remount -> recovery,
xfs_quota reports that the root dquot has only 1 icount.  We haven't
deleted anything from the filesystem, which means that quota is now
under-counting.  This behavior is not limited to icount or the root
dquot, but this is the shortest reproducer.

I traced the cause of this discrepancy to the way that we handle ondisk
dquot updates during quotacheck vs. regular fs activity.  Normally, when
we allocate a disk block for a dquot, we log the buffer as a regular
(dquot) buffer.  Subsequent updates to the dquots backed by that block
are done via separate dquot log item updates, which means that they
depend on the logged buffer update being written to disk before the
dquot items.  Because individual dquots have their own LSN fields, that
initial dquot buffer must always be recovered.

However, the story changes for quotacheck, which can cause dquot block
allocations but persists the final dquot counter values via a delwri
list.  Because recovery doesn't gate dquot buffer replay on an LSN, this
means that the initial dquot buffer can be replayed over the (newer)
contents that were delwritten at the end of quotacheck.  In effect, this
re-initializes the dquot counters after they've been updated.  If the
log does not contain any other dquot items to recover, the obsolete
dquot contents will not be corrected by log recovery.

Because quotacheck uses a transaction to log the setting of the CHKD
flags in the superblock, we skip quotacheck during the second mount
call, which allows the incorrect icount to remain.

Fix this by changing the ondisk dquot initialization function to use
ordered buffers to write out fresh dquot blocks if it detects that we're
running quotacheck.  If the system goes down before quotacheck can
complete, the CHKD flags will not be set in the superblock and the next
mount will run quotacheck again, which can fix uninitialized dquot
buffers.  This requires amending the defer code to maintaine ordered
buffer state across defer rolls for the sake of the dquot allocation
code.

For regular operations we preserve the current behavior since the dquot
items require properly initialized ondisk dquot records.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
v2: rework the code comment explaining all this
---
 fs/xfs/libxfs/xfs_defer.c |   10 +++++++
 fs/xfs/xfs_dquot.c        |   62 ++++++++++++++++++++++++++++++++++++---------
 2 files changed, 58 insertions(+), 14 deletions(-)

Comments

Brian Foster May 18, 2020, 1:16 p.m. UTC | #1
On Thu, May 14, 2020 at 09:56:58AM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
...
> 
> Fix this by changing the ondisk dquot initialization function to use
> ordered buffers to write out fresh dquot blocks if it detects that we're
> running quotacheck.  If the system goes down before quotacheck can
> complete, the CHKD flags will not be set in the superblock and the next
> mount will run quotacheck again, which can fix uninitialized dquot
> buffers.  This requires amending the defer code to maintaine ordered
> buffer state across defer rolls for the sake of the dquot allocation
> code.
> 
> For regular operations we preserve the current behavior since the dquot
> items require properly initialized ondisk dquot records.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
> v2: rework the code comment explaining all this
> ---
>  fs/xfs/libxfs/xfs_defer.c |   10 +++++++
>  fs/xfs/xfs_dquot.c        |   62 ++++++++++++++++++++++++++++++++++++---------
>  2 files changed, 58 insertions(+), 14 deletions(-)
> 
...
> diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c
> index 52e0f7245afc..f60a8967f9d5 100644
> --- a/fs/xfs/xfs_dquot.c
> +++ b/fs/xfs/xfs_dquot.c
...
> @@ -238,11 +240,45 @@ xfs_qm_init_dquot_blk(
...
> +
> +	/*
> +	 * When quotacheck runs, we use delayed writes to update all the dquots
> +	 * on disk in an efficient manner instead of logging the individual
> +	 * dquot changes as they are made.
> +	 *
> +	 * Hence if we log the buffer that we allocate here, then crash
> +	 * post-quotacheck while the logged initialisation is still in the
> +	 * active region of the log, we can lose the information quotacheck
> +	 * wrote directly to the buffer. That is, log recovery will replay the
> +	 * dquot buffer initialisation over the top of whatever information
> +	 * quotacheck had written to the buffer.
> +	 *
> +	 * To avoid this problem, dquot allocation during quotacheck needs to
> +	 * avoid logging the initialised buffer, but we still need to have
> +	 * writeback of the buffer pin the tail of the log so that it is
> +	 * initialised on disk before we remove the allocation transaction from
> +	 * the active region of the log. Marking the buffer as ordered instead
> +	 * of logging it provides this behaviour.
> +	 *
> +	 * If we crash before quotacheck completes, a subsequent quotacheck run
> +	 * will re-allocate and re-initialize the dquot records as needed.
> +	 */

I took a stab at condensing the comment a bit, FWIW (diff below). LGTM
either way. Thanks for the update.

Reviewed-by: Brian Foster <bfoster@redhat.com>

> +	if (!(mp->m_qflags & qflag))
> +		xfs_trans_ordered_buf(tp, bp);
> +	else
> +		xfs_trans_log_buf(tp, bp, 0, BBTOB(q->qi_dqchunklen) - 1);
>  }
>  
>  /*
> 

diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c
index f60a8967f9d5..55b95d45303b 100644
--- a/fs/xfs/xfs_dquot.c
+++ b/fs/xfs/xfs_dquot.c
@@ -254,26 +254,20 @@ xfs_qm_init_dquot_blk(
 	xfs_trans_dquot_buf(tp, bp, blftype);
 
 	/*
-	 * When quotacheck runs, we use delayed writes to update all the dquots
-	 * on disk in an efficient manner instead of logging the individual
-	 * dquot changes as they are made.
+	 * quotacheck uses delayed writes to update all the dquots on disk in an
+	 * efficient manner instead of logging the individual dquot changes as
+	 * they are made. However if we log the buffer allocated here and crash
+	 * after quotacheck while the logged initialisation is still in the
+	 * active region of the log, log recovery can replay the dquot buffer
+	 * initialisation over the top of the checked dquots and corrupt quota
+	 * accounting.
 	 *
-	 * Hence if we log the buffer that we allocate here, then crash
-	 * post-quotacheck while the logged initialisation is still in the
-	 * active region of the log, we can lose the information quotacheck
-	 * wrote directly to the buffer. That is, log recovery will replay the
-	 * dquot buffer initialisation over the top of whatever information
-	 * quotacheck had written to the buffer.
-	 *
-	 * To avoid this problem, dquot allocation during quotacheck needs to
-	 * avoid logging the initialised buffer, but we still need to have
-	 * writeback of the buffer pin the tail of the log so that it is
-	 * initialised on disk before we remove the allocation transaction from
-	 * the active region of the log. Marking the buffer as ordered instead
-	 * of logging it provides this behaviour.
-	 *
-	 * If we crash before quotacheck completes, a subsequent quotacheck run
-	 * will re-allocate and re-initialize the dquot records as needed.
+	 * To avoid this problem, quotacheck cannot log the initialised buffer.
+	 * We must still dirty the buffer and write it back before the
+	 * allocation transaction clears the log. Therefore, mark the buffer as
+	 * ordered instead of logging it directly. This is safe for quotacheck
+	 * because it detects and repairs allocated but initialized dquot blocks
+	 * in the quota inodes.
 	 */
 	if (!(mp->m_qflags & qflag))
 		xfs_trans_ordered_buf(tp, bp);
Darrick J. Wong May 18, 2020, 4:52 p.m. UTC | #2
On Mon, May 18, 2020 at 09:16:25AM -0400, Brian Foster wrote:
> On Thu, May 14, 2020 at 09:56:58AM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> ...
> > 
> > Fix this by changing the ondisk dquot initialization function to use
> > ordered buffers to write out fresh dquot blocks if it detects that we're
> > running quotacheck.  If the system goes down before quotacheck can
> > complete, the CHKD flags will not be set in the superblock and the next
> > mount will run quotacheck again, which can fix uninitialized dquot
> > buffers.  This requires amending the defer code to maintaine ordered
> > buffer state across defer rolls for the sake of the dquot allocation
> > code.
> > 
> > For regular operations we preserve the current behavior since the dquot
> > items require properly initialized ondisk dquot records.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> > v2: rework the code comment explaining all this
> > ---
> >  fs/xfs/libxfs/xfs_defer.c |   10 +++++++
> >  fs/xfs/xfs_dquot.c        |   62 ++++++++++++++++++++++++++++++++++++---------
> >  2 files changed, 58 insertions(+), 14 deletions(-)
> > 
> ...
> > diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c
> > index 52e0f7245afc..f60a8967f9d5 100644
> > --- a/fs/xfs/xfs_dquot.c
> > +++ b/fs/xfs/xfs_dquot.c
> ...
> > @@ -238,11 +240,45 @@ xfs_qm_init_dquot_blk(
> ...
> > +
> > +	/*
> > +	 * When quotacheck runs, we use delayed writes to update all the dquots
> > +	 * on disk in an efficient manner instead of logging the individual
> > +	 * dquot changes as they are made.
> > +	 *
> > +	 * Hence if we log the buffer that we allocate here, then crash
> > +	 * post-quotacheck while the logged initialisation is still in the
> > +	 * active region of the log, we can lose the information quotacheck
> > +	 * wrote directly to the buffer. That is, log recovery will replay the
> > +	 * dquot buffer initialisation over the top of whatever information
> > +	 * quotacheck had written to the buffer.
> > +	 *
> > +	 * To avoid this problem, dquot allocation during quotacheck needs to
> > +	 * avoid logging the initialised buffer, but we still need to have
> > +	 * writeback of the buffer pin the tail of the log so that it is
> > +	 * initialised on disk before we remove the allocation transaction from
> > +	 * the active region of the log. Marking the buffer as ordered instead
> > +	 * of logging it provides this behaviour.
> > +	 *
> > +	 * If we crash before quotacheck completes, a subsequent quotacheck run
> > +	 * will re-allocate and re-initialize the dquot records as needed.
> > +	 */
> 
> I took a stab at condensing the comment a bit, FWIW (diff below). LGTM
> either way. Thanks for the update.
> 
> Reviewed-by: Brian Foster <bfoster@redhat.com>
> 
> > +	if (!(mp->m_qflags & qflag))
> > +		xfs_trans_ordered_buf(tp, bp);
> > +	else
> > +		xfs_trans_log_buf(tp, bp, 0, BBTOB(q->qi_dqchunklen) - 1);
> >  }
> >  
> >  /*
> > 
> 
> diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c
> index f60a8967f9d5..55b95d45303b 100644
> --- a/fs/xfs/xfs_dquot.c
> +++ b/fs/xfs/xfs_dquot.c
> @@ -254,26 +254,20 @@ xfs_qm_init_dquot_blk(
>  	xfs_trans_dquot_buf(tp, bp, blftype);
>  
>  	/*
> -	 * When quotacheck runs, we use delayed writes to update all the dquots
> -	 * on disk in an efficient manner instead of logging the individual
> -	 * dquot changes as they are made.
> +	 * quotacheck uses delayed writes to update all the dquots on disk in an
> +	 * efficient manner instead of logging the individual dquot changes as
> +	 * they are made. However if we log the buffer allocated here and crash
> +	 * after quotacheck while the logged initialisation is still in the
> +	 * active region of the log, log recovery can replay the dquot buffer
> +	 * initialisation over the top of the checked dquots and corrupt quota
> +	 * accounting.
>  	 *
> -	 * Hence if we log the buffer that we allocate here, then crash
> -	 * post-quotacheck while the logged initialisation is still in the
> -	 * active region of the log, we can lose the information quotacheck
> -	 * wrote directly to the buffer. That is, log recovery will replay the
> -	 * dquot buffer initialisation over the top of whatever information
> -	 * quotacheck had written to the buffer.
> -	 *
> -	 * To avoid this problem, dquot allocation during quotacheck needs to
> -	 * avoid logging the initialised buffer, but we still need to have
> -	 * writeback of the buffer pin the tail of the log so that it is
> -	 * initialised on disk before we remove the allocation transaction from
> -	 * the active region of the log. Marking the buffer as ordered instead
> -	 * of logging it provides this behaviour.
> -	 *
> -	 * If we crash before quotacheck completes, a subsequent quotacheck run
> -	 * will re-allocate and re-initialize the dquot records as needed.
> +	 * To avoid this problem, quotacheck cannot log the initialised buffer.
> +	 * We must still dirty the buffer and write it back before the
> +	 * allocation transaction clears the log. Therefore, mark the buffer as
> +	 * ordered instead of logging it directly. This is safe for quotacheck
> +	 * because it detects and repairs allocated but initialized dquot blocks
> +	 * in the quota inodes.

I think I like your revised comment better. :)

--D

>  	 */
>  	if (!(mp->m_qflags & qflag))
>  		xfs_trans_ordered_buf(tp, bp);
>
Christoph Hellwig May 18, 2020, 4:58 p.m. UTC | #3
I would have split the addition of support for order buffers to the
defer mechanism into a separate patch.

Otherwise this looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

Patch
diff mbox series

diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c
index 1172fbf072d8..d8f586256add 100644
--- a/fs/xfs/libxfs/xfs_defer.c
+++ b/fs/xfs/libxfs/xfs_defer.c
@@ -240,10 +240,13 @@  xfs_defer_trans_roll(
 	struct xfs_log_item		*lip;
 	struct xfs_buf			*bplist[XFS_DEFER_OPS_NR_BUFS];
 	struct xfs_inode		*iplist[XFS_DEFER_OPS_NR_INODES];
+	unsigned int			ordered = 0; /* bitmap */
 	int				bpcount = 0, ipcount = 0;
 	int				i;
 	int				error;
 
+	BUILD_BUG_ON(NBBY * sizeof(ordered) < XFS_DEFER_OPS_NR_BUFS);
+
 	list_for_each_entry(lip, &tp->t_items, li_trans) {
 		switch (lip->li_type) {
 		case XFS_LI_BUF:
@@ -254,7 +257,10 @@  xfs_defer_trans_roll(
 					ASSERT(0);
 					return -EFSCORRUPTED;
 				}
-				xfs_trans_dirty_buf(tp, bli->bli_buf);
+				if (bli->bli_flags & XFS_BLI_ORDERED)
+					ordered |= (1U << bpcount);
+				else
+					xfs_trans_dirty_buf(tp, bli->bli_buf);
 				bplist[bpcount++] = bli->bli_buf;
 			}
 			break;
@@ -295,6 +301,8 @@  xfs_defer_trans_roll(
 	/* Rejoin the buffers and dirty them so the log moves forward. */
 	for (i = 0; i < bpcount; i++) {
 		xfs_trans_bjoin(tp, bplist[i]);
+		if (ordered & (1U << i))
+			xfs_trans_ordered_buf(tp, bplist[i]);
 		xfs_trans_bhold(tp, bplist[i]);
 	}
 
diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c
index 52e0f7245afc..f60a8967f9d5 100644
--- a/fs/xfs/xfs_dquot.c
+++ b/fs/xfs/xfs_dquot.c
@@ -205,16 +205,18 @@  xfs_qm_adjust_dqtimers(
  */
 STATIC void
 xfs_qm_init_dquot_blk(
-	xfs_trans_t	*tp,
-	xfs_mount_t	*mp,
-	xfs_dqid_t	id,
-	uint		type,
-	xfs_buf_t	*bp)
+	struct xfs_trans	*tp,
+	struct xfs_mount	*mp,
+	xfs_dqid_t		id,
+	uint			type,
+	struct xfs_buf		*bp)
 {
 	struct xfs_quotainfo	*q = mp->m_quotainfo;
-	xfs_dqblk_t	*d;
-	xfs_dqid_t	curid;
-	int		i;
+	struct xfs_dqblk	*d;
+	xfs_dqid_t		curid;
+	unsigned int		qflag;
+	unsigned int		blftype;
+	int			i;
 
 	ASSERT(tp);
 	ASSERT(xfs_buf_islocked(bp));
@@ -238,11 +240,45 @@  xfs_qm_init_dquot_blk(
 		}
 	}
 
-	xfs_trans_dquot_buf(tp, bp,
-			    (type & XFS_DQ_USER ? XFS_BLF_UDQUOT_BUF :
-			    ((type & XFS_DQ_PROJ) ? XFS_BLF_PDQUOT_BUF :
-			     XFS_BLF_GDQUOT_BUF)));
-	xfs_trans_log_buf(tp, bp, 0, BBTOB(q->qi_dqchunklen) - 1);
+	if (type & XFS_DQ_USER) {
+		qflag = XFS_UQUOTA_CHKD;
+		blftype = XFS_BLF_UDQUOT_BUF;
+	} else if (type & XFS_DQ_PROJ) {
+		qflag = XFS_PQUOTA_CHKD;
+		blftype = XFS_BLF_PDQUOT_BUF;
+	} else {
+		qflag = XFS_GQUOTA_CHKD;
+		blftype = XFS_BLF_GDQUOT_BUF;
+	}
+
+	xfs_trans_dquot_buf(tp, bp, blftype);
+
+	/*
+	 * When quotacheck runs, we use delayed writes to update all the dquots
+	 * on disk in an efficient manner instead of logging the individual
+	 * dquot changes as they are made.
+	 *
+	 * Hence if we log the buffer that we allocate here, then crash
+	 * post-quotacheck while the logged initialisation is still in the
+	 * active region of the log, we can lose the information quotacheck
+	 * wrote directly to the buffer. That is, log recovery will replay the
+	 * dquot buffer initialisation over the top of whatever information
+	 * quotacheck had written to the buffer.
+	 *
+	 * To avoid this problem, dquot allocation during quotacheck needs to
+	 * avoid logging the initialised buffer, but we still need to have
+	 * writeback of the buffer pin the tail of the log so that it is
+	 * initialised on disk before we remove the allocation transaction from
+	 * the active region of the log. Marking the buffer as ordered instead
+	 * of logging it provides this behaviour.
+	 *
+	 * If we crash before quotacheck completes, a subsequent quotacheck run
+	 * will re-allocate and re-initialize the dquot records as needed.
+	 */
+	if (!(mp->m_qflags & qflag))
+		xfs_trans_ordered_buf(tp, bp);
+	else
+		xfs_trans_log_buf(tp, bp, 0, BBTOB(q->qi_dqchunklen) - 1);
 }
 
 /*