xfs: use ordered buffers to initialize dquot buffers during quotacheck
diff mbox series

Message ID 20200512210033.GL6714@magnolia
State Superseded
Headers show
Series
  • xfs: use ordered buffers to initialize dquot buffers during quotacheck
Related show

Commit Message

Darrick J. Wong May 12, 2020, 9 p.m. UTC
From: Darrick J. Wong <darrick.wong@oracle.com>

While QAing the new xfs_repair quotacheck code, I uncovered a quota
corruption bug resulting from a bad interaction between dquot buffer
initialization and quotacheck.  The bug can be reproduced with the
following sequence:

# mkfs.xfs -f /dev/sdf
# mount /dev/sdf /opt -o usrquota
# su nobody -s /bin/bash -c 'touch /opt/barf'
# sync
# xfs_quota -x -c 'report -ahi' /opt
User quota on /opt (/dev/sdf)
                        Inodes
User ID      Used   Soft   Hard Warn/Grace
---------- ---------------------------------
root            3      0      0  00 [------]
nobody          1      0      0  00 [------]

# xfs_io -x -c 'shutdown' /opt
# umount /opt
# mount /dev/sdf /opt -o usrquota
# touch /opt/man2
# xfs_quota -x -c 'report -ahi' /opt
User quota on /opt (/dev/sdf)
                        Inodes
User ID      Used   Soft   Hard Warn/Grace
---------- ---------------------------------
root            1      0      0  00 [------]
nobody          1      0      0  00 [------]

# umount /opt

Notice how the initial quotacheck set the root dquot icount to 3
(rootino, rbmino, rsumino), but after shutdown -> remount -> recovery,
xfs_quota reports that the root dquot has only 1 icount.  We haven't
deleted anything from the filesystem, which means that quota is now
under-counting.  This behavior is not limited to icount or the root
dquot, but this is the shortest reproducer.

I traced the cause of this discrepancy to the way that we handle ondisk
dquot updates during quotacheck vs. regular fs activity.  Normally, when
we allocate a disk block for a dquot, we log the buffer as a regular
(dquot) buffer.  Subsequent updates to the dquots backed by that block
are done via separate dquot log item updates, which means that they
depend on the logged buffer update being written to disk before the
dquot items.  Because individual dquots have their own LSN fields, that
initial dquot buffer must always be recovered.

However, the story changes for quotacheck, which can cause dquot block
allocations but persists the final dquot counter values via a delwri
list.  Because recovery doesn't gate dquot buffer replay on an LSN, this
means that the initial dquot buffer can be replayed over the (newer)
contents that were delwritten at the end of quotacheck.  In effect, this
re-initializes the dquot counters after they've been updated.  If the
log does not contain any other dquot items to recover, the obsolete
dquot contents will not be corrected by log recovery.

Because quotacheck uses a transaction to log the setting of the CHKD
flags in the superblock, we skip quotacheck during the second mount
call, which allows the incorrect icount to remain.

Fix this by changing the ondisk dquot initialization function to use
ordered buffers to write out fresh dquot blocks if it detects that we're
running quotacheck.  If the system goes down before quotacheck can
complete, the CHKD flags will not be set in the superblock and the next
mount will run quotacheck again, which can fix uninitialized dquot
buffers.  This requires amending the defer code to maintaine ordered
buffer state across defer rolls for the sake of the dquot allocation
code.

For regular operations we preserve the current behavior since the dquot
items require properly initialized ondisk dquot records.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_defer.c |   10 ++++++++-
 fs/xfs/xfs_dquot.c        |   51 ++++++++++++++++++++++++++++++++++-----------
 2 files changed, 47 insertions(+), 14 deletions(-)

Comments

Dave Chinner May 13, 2020, 12:06 a.m. UTC | #1
On Tue, May 12, 2020 at 02:00:33PM -0700, Darrick J. Wong wrote:
> @@ -277,11 +279,34 @@ xfs_qm_init_dquot_blk(
>  		}
>  	}
>  
> -	xfs_trans_dquot_buf(tp, bp,
> -			    (type & XFS_DQ_USER ? XFS_BLF_UDQUOT_BUF :
> -			    ((type & XFS_DQ_PROJ) ? XFS_BLF_PDQUOT_BUF :
> -			     XFS_BLF_GDQUOT_BUF)));
> -	xfs_trans_log_buf(tp, bp, 0, BBTOB(q->qi_dqchunklen) - 1);
> +	if (type & XFS_DQ_USER) {
> +		qflag = XFS_UQUOTA_CHKD;
> +		blftype = XFS_BLF_UDQUOT_BUF;
> +	} else if (type & XFS_DQ_PROJ) {
> +		qflag = XFS_PQUOTA_CHKD;
> +		blftype = XFS_BLF_PDQUOT_BUF;
> +	} else {
> +		qflag = XFS_GQUOTA_CHKD;
> +		blftype = XFS_BLF_GDQUOT_BUF;
> +	}
> +
> +	xfs_trans_dquot_buf(tp, bp, blftype);
> +
> +	/*
> +	 * If the CHKD flag isn't set, we're running quotacheck and need to use
> +	 * ordered buffers so that the logged initialization buffer does not
> +	 * get replayed over the delwritten quotacheck buffer.  If we crash
> +	 * before the end of quotacheck, the CHKD flags will not be set in the
> +	 * superblock and we'll re-run quotacheck at next mount.
> +	 *
> +	 * Outside of quotacheck, dquot updates are logged via dquot items and
> +	 * we must use the regular buffer logging mechanisms to ensure that the
> +	 * initial buffer state is recovered before dquot items.
> +	 */
> +	if (mp->m_qflags & qflag)
> +		xfs_trans_log_buf(tp, bp, 0, BBTOB(q->qi_dqchunklen) - 1);
> +	else
> +		xfs_trans_ordered_buf(tp, bp);
>  }

That comment is ... difficult to understand. It conflates what we
are currently doing with what might happen in future if we did
something differently at the current time. IIUC, what you actually
mean is this:

	/*
	 * When quotacheck runs, we use delayed writes to update all the dquots
	 * on disk in an efficient manner instead of logging the individual
	 * dquot changes as they are made.
	 *
	 * Hence if we log the buffer that we allocate here, then crash
	 * post-quotacheck while the logged initialisation is still in the
	 * active region of the log, we can lose the information quotacheck
	 * wrote directly to the buffer. That is, log recovery will replay the
	 * dquot buffer initialisation over the top of whatever information
	 * quotacheck had written to the buffer.
	 *
	 * To avoid this problem, dquot allocation during quotacheck needs to
	 * avoid logging the initialised buffer, but we still need to have
	 * writeback of the buffer pin the tail of the log so that it is
	 * initialised on disk before we remove the allocation transaction from
	 * the active region of the log. Marking the buffer as ordered instead
	 * of logging it provides this behaviour.
	 */

Also, does this mean quotacheck completion should force the log and push the AIL
to ensure that all the allocations are completed and removed from the log before
marking the quota as CHKD?

Cheers,

Dave.
Darrick J. Wong May 13, 2020, 12:25 a.m. UTC | #2
On Wed, May 13, 2020 at 10:06:28AM +1000, Dave Chinner wrote:
> On Tue, May 12, 2020 at 02:00:33PM -0700, Darrick J. Wong wrote:
> > @@ -277,11 +279,34 @@ xfs_qm_init_dquot_blk(
> >  		}
> >  	}
> >  
> > -	xfs_trans_dquot_buf(tp, bp,
> > -			    (type & XFS_DQ_USER ? XFS_BLF_UDQUOT_BUF :
> > -			    ((type & XFS_DQ_PROJ) ? XFS_BLF_PDQUOT_BUF :
> > -			     XFS_BLF_GDQUOT_BUF)));
> > -	xfs_trans_log_buf(tp, bp, 0, BBTOB(q->qi_dqchunklen) - 1);
> > +	if (type & XFS_DQ_USER) {
> > +		qflag = XFS_UQUOTA_CHKD;
> > +		blftype = XFS_BLF_UDQUOT_BUF;
> > +	} else if (type & XFS_DQ_PROJ) {
> > +		qflag = XFS_PQUOTA_CHKD;
> > +		blftype = XFS_BLF_PDQUOT_BUF;
> > +	} else {
> > +		qflag = XFS_GQUOTA_CHKD;
> > +		blftype = XFS_BLF_GDQUOT_BUF;
> > +	}
> > +
> > +	xfs_trans_dquot_buf(tp, bp, blftype);
> > +
> > +	/*
> > +	 * If the CHKD flag isn't set, we're running quotacheck and need to use
> > +	 * ordered buffers so that the logged initialization buffer does not
> > +	 * get replayed over the delwritten quotacheck buffer.  If we crash
> > +	 * before the end of quotacheck, the CHKD flags will not be set in the
> > +	 * superblock and we'll re-run quotacheck at next mount.
> > +	 *
> > +	 * Outside of quotacheck, dquot updates are logged via dquot items and
> > +	 * we must use the regular buffer logging mechanisms to ensure that the
> > +	 * initial buffer state is recovered before dquot items.
> > +	 */
> > +	if (mp->m_qflags & qflag)
> > +		xfs_trans_log_buf(tp, bp, 0, BBTOB(q->qi_dqchunklen) - 1);
> > +	else
> > +		xfs_trans_ordered_buf(tp, bp);
> >  }
> 
> That comment is ... difficult to understand. It conflates what we
> are currently doing with what might happen in future if we did
> something differently at the current time. IIUC, what you actually
> mean is this:
> 
> 	/*
> 	 * When quotacheck runs, we use delayed writes to update all the dquots
> 	 * on disk in an efficient manner instead of logging the individual
> 	 * dquot changes as they are made.
> 	 *
> 	 * Hence if we log the buffer that we allocate here, then crash
> 	 * post-quotacheck while the logged initialisation is still in the
> 	 * active region of the log, we can lose the information quotacheck
> 	 * wrote directly to the buffer. That is, log recovery will replay the
> 	 * dquot buffer initialisation over the top of whatever information
> 	 * quotacheck had written to the buffer.
> 	 *
> 	 * To avoid this problem, dquot allocation during quotacheck needs to
> 	 * avoid logging the initialised buffer, but we still need to have
> 	 * writeback of the buffer pin the tail of the log so that it is
> 	 * initialised on disk before we remove the allocation transaction from
> 	 * the active region of the log. Marking the buffer as ordered instead
> 	 * of logging it provides this behaviour.
> 	 */

That's certainly a /lot/ clearer. :)

> Also, does this mean quotacheck completion should force the log and push the AIL
> to ensure that all the allocations are completed and removed from the log before
> marking the quota as CHKD?

I need to think about this more, but I think the answer is that we don't
need to force/push the log because the delwri means we've persisted the
new dquot contents before we set CHKD, and if we crash before we set
CHKD, on the next mount attempt we'll re-run quotacheck, which can
reallocate or reinitialize the ondisk dquots.

But I dunno, maybe there's some other subtlety I haven't thought of...

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
Dave Chinner May 13, 2020, 12:38 a.m. UTC | #3
On Tue, May 12, 2020 at 05:25:48PM -0700, Darrick J. Wong wrote:
> On Wed, May 13, 2020 at 10:06:28AM +1000, Dave Chinner wrote:
> > On Tue, May 12, 2020 at 02:00:33PM -0700, Darrick J. Wong wrote:
> > > @@ -277,11 +279,34 @@ xfs_qm_init_dquot_blk(
> > >  		}
> > >  	}
> > >  
> > > -	xfs_trans_dquot_buf(tp, bp,
> > > -			    (type & XFS_DQ_USER ? XFS_BLF_UDQUOT_BUF :
> > > -			    ((type & XFS_DQ_PROJ) ? XFS_BLF_PDQUOT_BUF :
> > > -			     XFS_BLF_GDQUOT_BUF)));
> > > -	xfs_trans_log_buf(tp, bp, 0, BBTOB(q->qi_dqchunklen) - 1);
> > > +	if (type & XFS_DQ_USER) {
> > > +		qflag = XFS_UQUOTA_CHKD;
> > > +		blftype = XFS_BLF_UDQUOT_BUF;
> > > +	} else if (type & XFS_DQ_PROJ) {
> > > +		qflag = XFS_PQUOTA_CHKD;
> > > +		blftype = XFS_BLF_PDQUOT_BUF;
> > > +	} else {
> > > +		qflag = XFS_GQUOTA_CHKD;
> > > +		blftype = XFS_BLF_GDQUOT_BUF;
> > > +	}
> > > +
> > > +	xfs_trans_dquot_buf(tp, bp, blftype);
> > > +
> > > +	/*
> > > +	 * If the CHKD flag isn't set, we're running quotacheck and need to use
> > > +	 * ordered buffers so that the logged initialization buffer does not
> > > +	 * get replayed over the delwritten quotacheck buffer.  If we crash
> > > +	 * before the end of quotacheck, the CHKD flags will not be set in the
> > > +	 * superblock and we'll re-run quotacheck at next mount.
> > > +	 *
> > > +	 * Outside of quotacheck, dquot updates are logged via dquot items and
> > > +	 * we must use the regular buffer logging mechanisms to ensure that the
> > > +	 * initial buffer state is recovered before dquot items.
> > > +	 */
> > > +	if (mp->m_qflags & qflag)
> > > +		xfs_trans_log_buf(tp, bp, 0, BBTOB(q->qi_dqchunklen) - 1);
> > > +	else
> > > +		xfs_trans_ordered_buf(tp, bp);
> > >  }
> > 
> > That comment is ... difficult to understand. It conflates what we
> > are currently doing with what might happen in future if we did
> > something differently at the current time. IIUC, what you actually
> > mean is this:
> > 
> > 	/*
> > 	 * When quotacheck runs, we use delayed writes to update all the dquots
> > 	 * on disk in an efficient manner instead of logging the individual
> > 	 * dquot changes as they are made.
> > 	 *
> > 	 * Hence if we log the buffer that we allocate here, then crash
> > 	 * post-quotacheck while the logged initialisation is still in the
> > 	 * active region of the log, we can lose the information quotacheck
> > 	 * wrote directly to the buffer. That is, log recovery will replay the
> > 	 * dquot buffer initialisation over the top of whatever information
> > 	 * quotacheck had written to the buffer.
> > 	 *
> > 	 * To avoid this problem, dquot allocation during quotacheck needs to
> > 	 * avoid logging the initialised buffer, but we still need to have
> > 	 * writeback of the buffer pin the tail of the log so that it is
> > 	 * initialised on disk before we remove the allocation transaction from
> > 	 * the active region of the log. Marking the buffer as ordered instead
> > 	 * of logging it provides this behaviour.
> > 	 */
> 
> That's certainly a /lot/ clearer. :)
> 
> > Also, does this mean quotacheck completion should force the log and push the AIL
> > to ensure that all the allocations are completed and removed from the log before
> > marking the quota as CHKD?
> 
> I need to think about this more, but I think the answer is that we don't
> need to force/push the log because the delwri means we've persisted the
> new dquot contents before we set CHKD, and if we crash before we set
> CHKD, on the next mount attempt we'll re-run quotacheck, which can
> reallocate or reinitialize the ondisk dquots.

*nod*

That was pretty much my thinking, but I haven't looked at the code
to determine if there was...

> But I dunno, maybe there's some other subtlety I haven't thought of...

.... some other subtlety I hadn't thought of... :)

Cheers,

Dave.
Brian Foster May 13, 2020, 1:14 p.m. UTC | #4
On Tue, May 12, 2020 at 02:00:33PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> While QAing the new xfs_repair quotacheck code, I uncovered a quota
> corruption bug resulting from a bad interaction between dquot buffer
> initialization and quotacheck.  The bug can be reproduced with the
> following sequence:
> 
> # mkfs.xfs -f /dev/sdf
> # mount /dev/sdf /opt -o usrquota
> # su nobody -s /bin/bash -c 'touch /opt/barf'
> # sync
> # xfs_quota -x -c 'report -ahi' /opt
> User quota on /opt (/dev/sdf)
>                         Inodes
> User ID      Used   Soft   Hard Warn/Grace
> ---------- ---------------------------------
> root            3      0      0  00 [------]
> nobody          1      0      0  00 [------]
> 
> # xfs_io -x -c 'shutdown' /opt
> # umount /opt
> # mount /dev/sdf /opt -o usrquota
> # touch /opt/man2
> # xfs_quota -x -c 'report -ahi' /opt
> User quota on /opt (/dev/sdf)
>                         Inodes
> User ID      Used   Soft   Hard Warn/Grace
> ---------- ---------------------------------
> root            1      0      0  00 [------]
> nobody          1      0      0  00 [------]
> 
> # umount /opt
> 
> Notice how the initial quotacheck set the root dquot icount to 3
> (rootino, rbmino, rsumino), but after shutdown -> remount -> recovery,
> xfs_quota reports that the root dquot has only 1 icount.  We haven't
> deleted anything from the filesystem, which means that quota is now
> under-counting.  This behavior is not limited to icount or the root
> dquot, but this is the shortest reproducer.
> 
> I traced the cause of this discrepancy to the way that we handle ondisk
> dquot updates during quotacheck vs. regular fs activity.  Normally, when
> we allocate a disk block for a dquot, we log the buffer as a regular
> (dquot) buffer.  Subsequent updates to the dquots backed by that block
> are done via separate dquot log item updates, which means that they
> depend on the logged buffer update being written to disk before the
> dquot items.  Because individual dquots have their own LSN fields, that
> initial dquot buffer must always be recovered.
> 
> However, the story changes for quotacheck, which can cause dquot block
> allocations but persists the final dquot counter values via a delwri
> list.  Because recovery doesn't gate dquot buffer replay on an LSN, this
> means that the initial dquot buffer can be replayed over the (newer)
> contents that were delwritten at the end of quotacheck.  In effect, this
> re-initializes the dquot counters after they've been updated.  If the
> log does not contain any other dquot items to recover, the obsolete
> dquot contents will not be corrected by log recovery.
> 
> Because quotacheck uses a transaction to log the setting of the CHKD
> flags in the superblock, we skip quotacheck during the second mount
> call, which allows the incorrect icount to remain.
> 
> Fix this by changing the ondisk dquot initialization function to use
> ordered buffers to write out fresh dquot blocks if it detects that we're
> running quotacheck.  If the system goes down before quotacheck can
> complete, the CHKD flags will not be set in the superblock and the next
> mount will run quotacheck again, which can fix uninitialized dquot
> buffers.  This requires amending the defer code to maintaine ordered
> buffer state across defer rolls for the sake of the dquot allocation
> code.
> 
> For regular operations we preserve the current behavior since the dquot
> items require properly initialized ondisk dquot records.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/xfs/libxfs/xfs_defer.c |   10 ++++++++-
>  fs/xfs/xfs_dquot.c        |   51 ++++++++++++++++++++++++++++++++++-----------
>  2 files changed, 47 insertions(+), 14 deletions(-)
> 
...  
> diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c
> index 7f39dd24d475..cf8b2f4de587 100644
> --- a/fs/xfs/xfs_dquot.c
> +++ b/fs/xfs/xfs_dquot.c
...
> @@ -277,11 +279,34 @@ xfs_qm_init_dquot_blk(
>  		}
>  	}
>  
> -	xfs_trans_dquot_buf(tp, bp,
> -			    (type & XFS_DQ_USER ? XFS_BLF_UDQUOT_BUF :
> -			    ((type & XFS_DQ_PROJ) ? XFS_BLF_PDQUOT_BUF :
> -			     XFS_BLF_GDQUOT_BUF)));
> -	xfs_trans_log_buf(tp, bp, 0, BBTOB(q->qi_dqchunklen) - 1);
> +	if (type & XFS_DQ_USER) {
> +		qflag = XFS_UQUOTA_CHKD;
> +		blftype = XFS_BLF_UDQUOT_BUF;
> +	} else if (type & XFS_DQ_PROJ) {
> +		qflag = XFS_PQUOTA_CHKD;
> +		blftype = XFS_BLF_PDQUOT_BUF;
> +	} else {
> +		qflag = XFS_GQUOTA_CHKD;
> +		blftype = XFS_BLF_GDQUOT_BUF;
> +	}
> +
> +	xfs_trans_dquot_buf(tp, bp, blftype);
> +
> +	/*
> +	 * If the CHKD flag isn't set, we're running quotacheck and need to use
> +	 * ordered buffers so that the logged initialization buffer does not
> +	 * get replayed over the delwritten quotacheck buffer.  If we crash
> +	 * before the end of quotacheck, the CHKD flags will not be set in the
> +	 * superblock and we'll re-run quotacheck at next mount.
> +	 *
> +	 * Outside of quotacheck, dquot updates are logged via dquot items and
> +	 * we must use the regular buffer logging mechanisms to ensure that the
> +	 * initial buffer state is recovered before dquot items.
> +	 */
> +	if (mp->m_qflags & qflag)
> +		xfs_trans_log_buf(tp, bp, 0, BBTOB(q->qi_dqchunklen) - 1);
> +	else
> +		xfs_trans_ordered_buf(tp, bp);

Ok, I think I follow what's going on here. quotacheck runs and allocates
a dquot block and inits it in a transaction. Some time later quotacheck
updates dquots in said block and queues the buffer in its own delwri
list. Quotacheck completes, the buffer is written back and then the
filesystem immediately crashes. We replay the content of the alloc/init
transaction over the updated content in the block on disk. This isn't a
problem outside of quotacheck because a subsequent dquot update would
have also been logged instead of directly written back.

Therefore, the purpose of the change above is primarily to avoid logging
the init of the dquot buffer during quotacheck. That makes sense, but
what about the integrity of this particular transaction? For example,
what happens if this transaction commits to the on-disk log and we crash
before quotacheck completes and the buffer is written back? I'm assuming
recovery would replay the dquot allocation but not the initialization,
then quotacheck would run again and find the buffer in an unknown state
(perhaps failing a read verifier?). Hm?

Brian

>  }
>  
>  /*
>
Darrick J. Wong May 13, 2020, 3:26 p.m. UTC | #5
On Wed, May 13, 2020 at 09:14:15AM -0400, Brian Foster wrote:
> On Tue, May 12, 2020 at 02:00:33PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > While QAing the new xfs_repair quotacheck code, I uncovered a quota
> > corruption bug resulting from a bad interaction between dquot buffer
> > initialization and quotacheck.  The bug can be reproduced with the
> > following sequence:
> > 
> > # mkfs.xfs -f /dev/sdf
> > # mount /dev/sdf /opt -o usrquota
> > # su nobody -s /bin/bash -c 'touch /opt/barf'
> > # sync
> > # xfs_quota -x -c 'report -ahi' /opt
> > User quota on /opt (/dev/sdf)
> >                         Inodes
> > User ID      Used   Soft   Hard Warn/Grace
> > ---------- ---------------------------------
> > root            3      0      0  00 [------]
> > nobody          1      0      0  00 [------]
> > 
> > # xfs_io -x -c 'shutdown' /opt
> > # umount /opt
> > # mount /dev/sdf /opt -o usrquota
> > # touch /opt/man2
> > # xfs_quota -x -c 'report -ahi' /opt
> > User quota on /opt (/dev/sdf)
> >                         Inodes
> > User ID      Used   Soft   Hard Warn/Grace
> > ---------- ---------------------------------
> > root            1      0      0  00 [------]
> > nobody          1      0      0  00 [------]
> > 
> > # umount /opt
> > 
> > Notice how the initial quotacheck set the root dquot icount to 3
> > (rootino, rbmino, rsumino), but after shutdown -> remount -> recovery,
> > xfs_quota reports that the root dquot has only 1 icount.  We haven't
> > deleted anything from the filesystem, which means that quota is now
> > under-counting.  This behavior is not limited to icount or the root
> > dquot, but this is the shortest reproducer.
> > 
> > I traced the cause of this discrepancy to the way that we handle ondisk
> > dquot updates during quotacheck vs. regular fs activity.  Normally, when
> > we allocate a disk block for a dquot, we log the buffer as a regular
> > (dquot) buffer.  Subsequent updates to the dquots backed by that block
> > are done via separate dquot log item updates, which means that they
> > depend on the logged buffer update being written to disk before the
> > dquot items.  Because individual dquots have their own LSN fields, that
> > initial dquot buffer must always be recovered.
> > 
> > However, the story changes for quotacheck, which can cause dquot block
> > allocations but persists the final dquot counter values via a delwri
> > list.  Because recovery doesn't gate dquot buffer replay on an LSN, this
> > means that the initial dquot buffer can be replayed over the (newer)
> > contents that were delwritten at the end of quotacheck.  In effect, this
> > re-initializes the dquot counters after they've been updated.  If the
> > log does not contain any other dquot items to recover, the obsolete
> > dquot contents will not be corrected by log recovery.
> > 
> > Because quotacheck uses a transaction to log the setting of the CHKD
> > flags in the superblock, we skip quotacheck during the second mount
> > call, which allows the incorrect icount to remain.
> > 
> > Fix this by changing the ondisk dquot initialization function to use
> > ordered buffers to write out fresh dquot blocks if it detects that we're
> > running quotacheck.  If the system goes down before quotacheck can
> > complete, the CHKD flags will not be set in the superblock and the next
> > mount will run quotacheck again, which can fix uninitialized dquot
> > buffers.  This requires amending the defer code to maintaine ordered
> > buffer state across defer rolls for the sake of the dquot allocation
> > code.
> > 
> > For regular operations we preserve the current behavior since the dquot
> > items require properly initialized ondisk dquot records.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> >  fs/xfs/libxfs/xfs_defer.c |   10 ++++++++-
> >  fs/xfs/xfs_dquot.c        |   51 ++++++++++++++++++++++++++++++++++-----------
> >  2 files changed, 47 insertions(+), 14 deletions(-)
> > 
> ...  
> > diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c
> > index 7f39dd24d475..cf8b2f4de587 100644
> > --- a/fs/xfs/xfs_dquot.c
> > +++ b/fs/xfs/xfs_dquot.c
> ...
> > @@ -277,11 +279,34 @@ xfs_qm_init_dquot_blk(
> >  		}
> >  	}
> >  
> > -	xfs_trans_dquot_buf(tp, bp,
> > -			    (type & XFS_DQ_USER ? XFS_BLF_UDQUOT_BUF :
> > -			    ((type & XFS_DQ_PROJ) ? XFS_BLF_PDQUOT_BUF :
> > -			     XFS_BLF_GDQUOT_BUF)));
> > -	xfs_trans_log_buf(tp, bp, 0, BBTOB(q->qi_dqchunklen) - 1);
> > +	if (type & XFS_DQ_USER) {
> > +		qflag = XFS_UQUOTA_CHKD;
> > +		blftype = XFS_BLF_UDQUOT_BUF;
> > +	} else if (type & XFS_DQ_PROJ) {
> > +		qflag = XFS_PQUOTA_CHKD;
> > +		blftype = XFS_BLF_PDQUOT_BUF;
> > +	} else {
> > +		qflag = XFS_GQUOTA_CHKD;
> > +		blftype = XFS_BLF_GDQUOT_BUF;
> > +	}
> > +
> > +	xfs_trans_dquot_buf(tp, bp, blftype);
> > +
> > +	/*
> > +	 * If the CHKD flag isn't set, we're running quotacheck and need to use
> > +	 * ordered buffers so that the logged initialization buffer does not
> > +	 * get replayed over the delwritten quotacheck buffer.  If we crash
> > +	 * before the end of quotacheck, the CHKD flags will not be set in the
> > +	 * superblock and we'll re-run quotacheck at next mount.
> > +	 *
> > +	 * Outside of quotacheck, dquot updates are logged via dquot items and
> > +	 * we must use the regular buffer logging mechanisms to ensure that the
> > +	 * initial buffer state is recovered before dquot items.
> > +	 */
> > +	if (mp->m_qflags & qflag)
> > +		xfs_trans_log_buf(tp, bp, 0, BBTOB(q->qi_dqchunklen) - 1);
> > +	else
> > +		xfs_trans_ordered_buf(tp, bp);
> 
> Ok, I think I follow what's going on here. quotacheck runs and allocates
> a dquot block and inits it in a transaction. Some time later quotacheck
> updates dquots in said block and queues the buffer in its own delwri
> list. Quotacheck completes, the buffer is written back and then the
> filesystem immediately crashes. We replay the content of the alloc/init
> transaction over the updated content in the block on disk. This isn't a
> problem outside of quotacheck because a subsequent dquot update would
> have also been logged instead of directly written back.

Correct.

> Therefore, the purpose of the change above is primarily to avoid logging
> the init of the dquot buffer during quotacheck. That makes sense, but
> what about the integrity of this particular transaction? For example,
> what happens if this transaction commits to the on-disk log and we crash
> before quotacheck completes and the buffer is written back? I'm assuming
> recovery would replay the dquot allocation but not the initialization,
> then quotacheck would run again and find the buffer in an unknown state
> (perhaps failing a read verifier?). Hm?

Yes, the read verifier can fail in quotacheck, but quotacheck reacts to
that by re-trying the buffer read with a NULL buffer ops (in
xfs_qm_reset_dqcounts_all).  Next, it calls xfs_qm_reset_dqcounts, which
calls xfs_dqblk_repair to reinitialize the dquot records.  After that,
xfs_qm_reset_dqcounts resets even more of the dquot state (counters,
timers, flags, warns).

In short, quotacheck will fix anything it doesn't like about the dquot
buffers that are attached to the quota inodes.

--D

> Brian
> 
> >  }
> >  
> >  /*
> > 
>
Brian Foster May 13, 2020, 4:12 p.m. UTC | #6
On Wed, May 13, 2020 at 08:26:28AM -0700, Darrick J. Wong wrote:
> On Wed, May 13, 2020 at 09:14:15AM -0400, Brian Foster wrote:
> > On Tue, May 12, 2020 at 02:00:33PM -0700, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > 
> > > While QAing the new xfs_repair quotacheck code, I uncovered a quota
> > > corruption bug resulting from a bad interaction between dquot buffer
> > > initialization and quotacheck.  The bug can be reproduced with the
> > > following sequence:
> > > 
> > > # mkfs.xfs -f /dev/sdf
> > > # mount /dev/sdf /opt -o usrquota
> > > # su nobody -s /bin/bash -c 'touch /opt/barf'
> > > # sync
> > > # xfs_quota -x -c 'report -ahi' /opt
> > > User quota on /opt (/dev/sdf)
> > >                         Inodes
> > > User ID      Used   Soft   Hard Warn/Grace
> > > ---------- ---------------------------------
> > > root            3      0      0  00 [------]
> > > nobody          1      0      0  00 [------]
> > > 
> > > # xfs_io -x -c 'shutdown' /opt
> > > # umount /opt
> > > # mount /dev/sdf /opt -o usrquota
> > > # touch /opt/man2
> > > # xfs_quota -x -c 'report -ahi' /opt
> > > User quota on /opt (/dev/sdf)
> > >                         Inodes
> > > User ID      Used   Soft   Hard Warn/Grace
> > > ---------- ---------------------------------
> > > root            1      0      0  00 [------]
> > > nobody          1      0      0  00 [------]
> > > 
> > > # umount /opt
> > > 
> > > Notice how the initial quotacheck set the root dquot icount to 3
> > > (rootino, rbmino, rsumino), but after shutdown -> remount -> recovery,
> > > xfs_quota reports that the root dquot has only 1 icount.  We haven't
> > > deleted anything from the filesystem, which means that quota is now
> > > under-counting.  This behavior is not limited to icount or the root
> > > dquot, but this is the shortest reproducer.
> > > 
> > > I traced the cause of this discrepancy to the way that we handle ondisk
> > > dquot updates during quotacheck vs. regular fs activity.  Normally, when
> > > we allocate a disk block for a dquot, we log the buffer as a regular
> > > (dquot) buffer.  Subsequent updates to the dquots backed by that block
> > > are done via separate dquot log item updates, which means that they
> > > depend on the logged buffer update being written to disk before the
> > > dquot items.  Because individual dquots have their own LSN fields, that
> > > initial dquot buffer must always be recovered.
> > > 
> > > However, the story changes for quotacheck, which can cause dquot block
> > > allocations but persists the final dquot counter values via a delwri
> > > list.  Because recovery doesn't gate dquot buffer replay on an LSN, this
> > > means that the initial dquot buffer can be replayed over the (newer)
> > > contents that were delwritten at the end of quotacheck.  In effect, this
> > > re-initializes the dquot counters after they've been updated.  If the
> > > log does not contain any other dquot items to recover, the obsolete
> > > dquot contents will not be corrected by log recovery.
> > > 
> > > Because quotacheck uses a transaction to log the setting of the CHKD
> > > flags in the superblock, we skip quotacheck during the second mount
> > > call, which allows the incorrect icount to remain.
> > > 
> > > Fix this by changing the ondisk dquot initialization function to use
> > > ordered buffers to write out fresh dquot blocks if it detects that we're
> > > running quotacheck.  If the system goes down before quotacheck can
> > > complete, the CHKD flags will not be set in the superblock and the next
> > > mount will run quotacheck again, which can fix uninitialized dquot
> > > buffers.  This requires amending the defer code to maintaine ordered
> > > buffer state across defer rolls for the sake of the dquot allocation
> > > code.
> > > 
> > > For regular operations we preserve the current behavior since the dquot
> > > items require properly initialized ondisk dquot records.
> > > 
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > ---
> > >  fs/xfs/libxfs/xfs_defer.c |   10 ++++++++-
> > >  fs/xfs/xfs_dquot.c        |   51 ++++++++++++++++++++++++++++++++++-----------
> > >  2 files changed, 47 insertions(+), 14 deletions(-)
> > > 
> > ...  
> > > diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c
> > > index 7f39dd24d475..cf8b2f4de587 100644
> > > --- a/fs/xfs/xfs_dquot.c
> > > +++ b/fs/xfs/xfs_dquot.c
> > ...
> > > @@ -277,11 +279,34 @@ xfs_qm_init_dquot_blk(
> > >  		}
> > >  	}
> > >  
> > > -	xfs_trans_dquot_buf(tp, bp,
> > > -			    (type & XFS_DQ_USER ? XFS_BLF_UDQUOT_BUF :
> > > -			    ((type & XFS_DQ_PROJ) ? XFS_BLF_PDQUOT_BUF :
> > > -			     XFS_BLF_GDQUOT_BUF)));
> > > -	xfs_trans_log_buf(tp, bp, 0, BBTOB(q->qi_dqchunklen) - 1);
> > > +	if (type & XFS_DQ_USER) {
> > > +		qflag = XFS_UQUOTA_CHKD;
> > > +		blftype = XFS_BLF_UDQUOT_BUF;
> > > +	} else if (type & XFS_DQ_PROJ) {
> > > +		qflag = XFS_PQUOTA_CHKD;
> > > +		blftype = XFS_BLF_PDQUOT_BUF;
> > > +	} else {
> > > +		qflag = XFS_GQUOTA_CHKD;
> > > +		blftype = XFS_BLF_GDQUOT_BUF;
> > > +	}
> > > +
> > > +	xfs_trans_dquot_buf(tp, bp, blftype);
> > > +
> > > +	/*
> > > +	 * If the CHKD flag isn't set, we're running quotacheck and need to use
> > > +	 * ordered buffers so that the logged initialization buffer does not
> > > +	 * get replayed over the delwritten quotacheck buffer.  If we crash
> > > +	 * before the end of quotacheck, the CHKD flags will not be set in the
> > > +	 * superblock and we'll re-run quotacheck at next mount.
> > > +	 *
> > > +	 * Outside of quotacheck, dquot updates are logged via dquot items and
> > > +	 * we must use the regular buffer logging mechanisms to ensure that the
> > > +	 * initial buffer state is recovered before dquot items.
> > > +	 */
> > > +	if (mp->m_qflags & qflag)
> > > +		xfs_trans_log_buf(tp, bp, 0, BBTOB(q->qi_dqchunklen) - 1);
> > > +	else
> > > +		xfs_trans_ordered_buf(tp, bp);
> > 
> > Ok, I think I follow what's going on here. quotacheck runs and allocates
> > a dquot block and inits it in a transaction. Some time later quotacheck
> > updates dquots in said block and queues the buffer in its own delwri
> > list. Quotacheck completes, the buffer is written back and then the
> > filesystem immediately crashes. We replay the content of the alloc/init
> > transaction over the updated content in the block on disk. This isn't a
> > problem outside of quotacheck because a subsequent dquot update would
> > have also been logged instead of directly written back.
> 
> Correct.
> 
> > Therefore, the purpose of the change above is primarily to avoid logging
> > the init of the dquot buffer during quotacheck. That makes sense, but
> > what about the integrity of this particular transaction? For example,
> > what happens if this transaction commits to the on-disk log and we crash
> > before quotacheck completes and the buffer is written back? I'm assuming
> > recovery would replay the dquot allocation but not the initialization,
> > then quotacheck would run again and find the buffer in an unknown state
> > (perhaps failing a read verifier?). Hm?
> 
> Yes, the read verifier can fail in quotacheck, but quotacheck reacts to
> that by re-trying the buffer read with a NULL buffer ops (in
> xfs_qm_reset_dqcounts_all).  Next, it calls xfs_qm_reset_dqcounts, which
> calls xfs_dqblk_repair to reinitialize the dquot records.  After that,
> xfs_qm_reset_dqcounts resets even more of the dquot state (counters,
> timers, flags, warns).
> 

Ok, I was thinking we might get back to this "read or allocate" path but
it sounds like the earlier reset phase reads every chunk, does this
verifier dance and reinits if necessary. It's somewhat unfortunate that
we can create those conditions intentionally, but I suppose that's
better than the current state in that we at least recover from it. We
should probably document that somewhere to explain why ordering the
buffer like this is safe in quotacheck context.

Brian

> In short, quotacheck will fix anything it doesn't like about the dquot
> buffers that are attached to the quota inodes.
> 
> --D
> 
> > Brian
> > 
> > >  }
> > >  
> > >  /*
> > > 
> > 
>
Darrick J. Wong May 13, 2020, 5:06 p.m. UTC | #7
On Wed, May 13, 2020 at 12:12:23PM -0400, Brian Foster wrote:
> On Wed, May 13, 2020 at 08:26:28AM -0700, Darrick J. Wong wrote:
> > On Wed, May 13, 2020 at 09:14:15AM -0400, Brian Foster wrote:
> > > On Tue, May 12, 2020 at 02:00:33PM -0700, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > > 
> > > > While QAing the new xfs_repair quotacheck code, I uncovered a quota
> > > > corruption bug resulting from a bad interaction between dquot buffer
> > > > initialization and quotacheck.  The bug can be reproduced with the
> > > > following sequence:
> > > > 
> > > > # mkfs.xfs -f /dev/sdf
> > > > # mount /dev/sdf /opt -o usrquota
> > > > # su nobody -s /bin/bash -c 'touch /opt/barf'
> > > > # sync
> > > > # xfs_quota -x -c 'report -ahi' /opt
> > > > User quota on /opt (/dev/sdf)
> > > >                         Inodes
> > > > User ID      Used   Soft   Hard Warn/Grace
> > > > ---------- ---------------------------------
> > > > root            3      0      0  00 [------]
> > > > nobody          1      0      0  00 [------]
> > > > 
> > > > # xfs_io -x -c 'shutdown' /opt
> > > > # umount /opt
> > > > # mount /dev/sdf /opt -o usrquota
> > > > # touch /opt/man2
> > > > # xfs_quota -x -c 'report -ahi' /opt
> > > > User quota on /opt (/dev/sdf)
> > > >                         Inodes
> > > > User ID      Used   Soft   Hard Warn/Grace
> > > > ---------- ---------------------------------
> > > > root            1      0      0  00 [------]
> > > > nobody          1      0      0  00 [------]
> > > > 
> > > > # umount /opt
> > > > 
> > > > Notice how the initial quotacheck set the root dquot icount to 3
> > > > (rootino, rbmino, rsumino), but after shutdown -> remount -> recovery,
> > > > xfs_quota reports that the root dquot has only 1 icount.  We haven't
> > > > deleted anything from the filesystem, which means that quota is now
> > > > under-counting.  This behavior is not limited to icount or the root
> > > > dquot, but this is the shortest reproducer.
> > > > 
> > > > I traced the cause of this discrepancy to the way that we handle ondisk
> > > > dquot updates during quotacheck vs. regular fs activity.  Normally, when
> > > > we allocate a disk block for a dquot, we log the buffer as a regular
> > > > (dquot) buffer.  Subsequent updates to the dquots backed by that block
> > > > are done via separate dquot log item updates, which means that they
> > > > depend on the logged buffer update being written to disk before the
> > > > dquot items.  Because individual dquots have their own LSN fields, that
> > > > initial dquot buffer must always be recovered.
> > > > 
> > > > However, the story changes for quotacheck, which can cause dquot block
> > > > allocations but persists the final dquot counter values via a delwri
> > > > list.  Because recovery doesn't gate dquot buffer replay on an LSN, this
> > > > means that the initial dquot buffer can be replayed over the (newer)
> > > > contents that were delwritten at the end of quotacheck.  In effect, this
> > > > re-initializes the dquot counters after they've been updated.  If the
> > > > log does not contain any other dquot items to recover, the obsolete
> > > > dquot contents will not be corrected by log recovery.
> > > > 
> > > > Because quotacheck uses a transaction to log the setting of the CHKD
> > > > flags in the superblock, we skip quotacheck during the second mount
> > > > call, which allows the incorrect icount to remain.
> > > > 
> > > > Fix this by changing the ondisk dquot initialization function to use
> > > > ordered buffers to write out fresh dquot blocks if it detects that we're
> > > > running quotacheck.  If the system goes down before quotacheck can
> > > > complete, the CHKD flags will not be set in the superblock and the next
> > > > mount will run quotacheck again, which can fix uninitialized dquot
> > > > buffers.  This requires amending the defer code to maintaine ordered
> > > > buffer state across defer rolls for the sake of the dquot allocation
> > > > code.
> > > > 
> > > > For regular operations we preserve the current behavior since the dquot
> > > > items require properly initialized ondisk dquot records.
> > > > 
> > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > ---
> > > >  fs/xfs/libxfs/xfs_defer.c |   10 ++++++++-
> > > >  fs/xfs/xfs_dquot.c        |   51 ++++++++++++++++++++++++++++++++++-----------
> > > >  2 files changed, 47 insertions(+), 14 deletions(-)
> > > > 
> > > ...  
> > > > diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c
> > > > index 7f39dd24d475..cf8b2f4de587 100644
> > > > --- a/fs/xfs/xfs_dquot.c
> > > > +++ b/fs/xfs/xfs_dquot.c
> > > ...
> > > > @@ -277,11 +279,34 @@ xfs_qm_init_dquot_blk(
> > > >  		}
> > > >  	}
> > > >  
> > > > -	xfs_trans_dquot_buf(tp, bp,
> > > > -			    (type & XFS_DQ_USER ? XFS_BLF_UDQUOT_BUF :
> > > > -			    ((type & XFS_DQ_PROJ) ? XFS_BLF_PDQUOT_BUF :
> > > > -			     XFS_BLF_GDQUOT_BUF)));
> > > > -	xfs_trans_log_buf(tp, bp, 0, BBTOB(q->qi_dqchunklen) - 1);
> > > > +	if (type & XFS_DQ_USER) {
> > > > +		qflag = XFS_UQUOTA_CHKD;
> > > > +		blftype = XFS_BLF_UDQUOT_BUF;
> > > > +	} else if (type & XFS_DQ_PROJ) {
> > > > +		qflag = XFS_PQUOTA_CHKD;
> > > > +		blftype = XFS_BLF_PDQUOT_BUF;
> > > > +	} else {
> > > > +		qflag = XFS_GQUOTA_CHKD;
> > > > +		blftype = XFS_BLF_GDQUOT_BUF;
> > > > +	}
> > > > +
> > > > +	xfs_trans_dquot_buf(tp, bp, blftype);
> > > > +
> > > > +	/*
> > > > +	 * If the CHKD flag isn't set, we're running quotacheck and need to use
> > > > +	 * ordered buffers so that the logged initialization buffer does not
> > > > +	 * get replayed over the delwritten quotacheck buffer.  If we crash
> > > > +	 * before the end of quotacheck, the CHKD flags will not be set in the
> > > > +	 * superblock and we'll re-run quotacheck at next mount.
> > > > +	 *
> > > > +	 * Outside of quotacheck, dquot updates are logged via dquot items and
> > > > +	 * we must use the regular buffer logging mechanisms to ensure that the
> > > > +	 * initial buffer state is recovered before dquot items.
> > > > +	 */
> > > > +	if (mp->m_qflags & qflag)
> > > > +		xfs_trans_log_buf(tp, bp, 0, BBTOB(q->qi_dqchunklen) - 1);
> > > > +	else
> > > > +		xfs_trans_ordered_buf(tp, bp);
> > > 
> > > Ok, I think I follow what's going on here. quotacheck runs and allocates
> > > a dquot block and inits it in a transaction. Some time later quotacheck
> > > updates dquots in said block and queues the buffer in its own delwri
> > > list. Quotacheck completes, the buffer is written back and then the
> > > filesystem immediately crashes. We replay the content of the alloc/init
> > > transaction over the updated content in the block on disk. This isn't a
> > > problem outside of quotacheck because a subsequent dquot update would
> > > have also been logged instead of directly written back.
> > 
> > Correct.
> > 
> > > Therefore, the purpose of the change above is primarily to avoid logging
> > > the init of the dquot buffer during quotacheck. That makes sense, but
> > > what about the integrity of this particular transaction? For example,
> > > what happens if this transaction commits to the on-disk log and we crash
> > > before quotacheck completes and the buffer is written back? I'm assuming
> > > recovery would replay the dquot allocation but not the initialization,
> > > then quotacheck would run again and find the buffer in an unknown state
> > > (perhaps failing a read verifier?). Hm?
> > 
> > Yes, the read verifier can fail in quotacheck, but quotacheck reacts to
> > that by re-trying the buffer read with a NULL buffer ops (in
> > xfs_qm_reset_dqcounts_all).  Next, it calls xfs_qm_reset_dqcounts, which
> > calls xfs_dqblk_repair to reinitialize the dquot records.  After that,
> > xfs_qm_reset_dqcounts resets even more of the dquot state (counters,
> > timers, flags, warns).
> > 
> 
> Ok, I was thinking we might get back to this "read or allocate" path but
> it sounds like the earlier reset phase reads every chunk, does this
> verifier dance and reinits if necessary. It's somewhat unfortunate that
> we can create those conditions intentionally, but I suppose that's
> better than the current state in that we at least recover from it. We
> should probably document that somewhere to explain why ordering the
> buffer like this is safe in quotacheck context.

Yeah, I was trying to get everyone there in the comment above, but maybe
this ought to be more explicitly stated in the comment.  The current
version of that looks like:

	/*
	 * When quotacheck runs, we use delayed writes to update all the
	 * dquots on disk in an efficient manner instead of logging the
	 * individual dquot changes as they are made.
	 *
	 * Hence if we log the buffer that we allocate here, then crash
	 * post-quotacheck while the logged initialisation is still in
	 * the active region of the log, we can lose the information
	 * quotacheck wrote directly to the buffer. That is, log
	 * recovery will replay the dquot buffer initialisation over the
	 * top of whatever information quotacheck had written to the
	 * buffer.
	 *
	 * To avoid this problem, dquot allocation during quotacheck
	 * needs to avoid logging the initialised buffer, but we still
	 * need to have writeback of the buffer pin the tail of the log
	 * so that it is initialised on disk before we remove the
	 * allocation transaction from the active region of the log.
	 * Marking the buffer as ordered instead of logging it provides
	 * this behaviour.
	 *
	 * If we crash before quotacheck completes, a subsequent
	 * quotacheck run will re-allocate and re-initialize the dquot
	 * records as needed.
	 */
	if (!(mp->m_qflags & qflag))
		xfs_trans_ordered_buf(tp, bp);
	else
		xfs_trans_log_buf(tp, bp, 0, BBTOB(q->qi_dqchunklen) - 1);


--D

> Brian
> 
> > In short, quotacheck will fix anything it doesn't like about the dquot
> > buffers that are attached to the quota inodes.
> > 
> > --D
> > 
> > > Brian
> > > 
> > > >  }
> > > >  
> > > >  /*
> > > > 
> > > 
> > 
>

Patch
diff mbox series

diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c
index 00bd0e478829..c56d4715f521 100644
--- a/fs/xfs/libxfs/xfs_defer.c
+++ b/fs/xfs/libxfs/xfs_defer.c
@@ -243,10 +243,13 @@  xfs_defer_trans_roll(
 	struct xfs_log_item		*lip;
 	struct xfs_buf			*bplist[XFS_DEFER_OPS_NR_BUFS];
 	struct xfs_inode		*iplist[XFS_DEFER_OPS_NR_INODES];
+	unsigned int			ordered = 0; /* bitmap */
 	int				bpcount = 0, ipcount = 0;
 	int				i;
 	int				error;
 
+	BUILD_BUG_ON(NBBY * sizeof(ordered) < XFS_DEFER_OPS_NR_BUFS);
+
 	list_for_each_entry(lip, &tp->t_items, li_trans) {
 		switch (lip->li_type) {
 		case XFS_LI_BUF:
@@ -257,7 +260,10 @@  xfs_defer_trans_roll(
 					ASSERT(0);
 					return -EFSCORRUPTED;
 				}
-				xfs_trans_dirty_buf(tp, bli->bli_buf);
+				if (bli->bli_flags & XFS_BLI_ORDERED)
+					ordered |= (1U << bpcount);
+				else
+					xfs_trans_dirty_buf(tp, bli->bli_buf);
 				bplist[bpcount++] = bli->bli_buf;
 			}
 			break;
@@ -298,6 +304,8 @@  xfs_defer_trans_roll(
 	/* Rejoin the buffers and dirty them so the log moves forward. */
 	for (i = 0; i < bpcount; i++) {
 		xfs_trans_bjoin(tp, bplist[i]);
+		if (ordered & (1U << i))
+			xfs_trans_ordered_buf(tp, bplist[i]);
 		xfs_trans_bhold(tp, bplist[i]);
 	}
 
diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c
index 7f39dd24d475..cf8b2f4de587 100644
--- a/fs/xfs/xfs_dquot.c
+++ b/fs/xfs/xfs_dquot.c
@@ -243,16 +243,18 @@  xfs_qm_adjust_dqtimers(
  */
 STATIC void
 xfs_qm_init_dquot_blk(
-	xfs_trans_t	*tp,
-	xfs_mount_t	*mp,
-	xfs_dqid_t	id,
-	uint		type,
-	xfs_buf_t	*bp)
+	struct xfs_trans	*tp,
+	struct xfs_mount	*mp,
+	xfs_dqid_t		id,
+	uint			type,
+	struct xfs_buf		*bp)
 {
 	struct xfs_quotainfo	*q = mp->m_quotainfo;
-	xfs_dqblk_t	*d;
-	xfs_dqid_t	curid;
-	int		i;
+	struct xfs_dqblk	*d;
+	xfs_dqid_t		curid;
+	unsigned int		qflag;
+	unsigned int		blftype;
+	int			i;
 
 	ASSERT(tp);
 	ASSERT(xfs_buf_islocked(bp));
@@ -277,11 +279,34 @@  xfs_qm_init_dquot_blk(
 		}
 	}
 
-	xfs_trans_dquot_buf(tp, bp,
-			    (type & XFS_DQ_USER ? XFS_BLF_UDQUOT_BUF :
-			    ((type & XFS_DQ_PROJ) ? XFS_BLF_PDQUOT_BUF :
-			     XFS_BLF_GDQUOT_BUF)));
-	xfs_trans_log_buf(tp, bp, 0, BBTOB(q->qi_dqchunklen) - 1);
+	if (type & XFS_DQ_USER) {
+		qflag = XFS_UQUOTA_CHKD;
+		blftype = XFS_BLF_UDQUOT_BUF;
+	} else if (type & XFS_DQ_PROJ) {
+		qflag = XFS_PQUOTA_CHKD;
+		blftype = XFS_BLF_PDQUOT_BUF;
+	} else {
+		qflag = XFS_GQUOTA_CHKD;
+		blftype = XFS_BLF_GDQUOT_BUF;
+	}
+
+	xfs_trans_dquot_buf(tp, bp, blftype);
+
+	/*
+	 * If the CHKD flag isn't set, we're running quotacheck and need to use
+	 * ordered buffers so that the logged initialization buffer does not
+	 * get replayed over the delwritten quotacheck buffer.  If we crash
+	 * before the end of quotacheck, the CHKD flags will not be set in the
+	 * superblock and we'll re-run quotacheck at next mount.
+	 *
+	 * Outside of quotacheck, dquot updates are logged via dquot items and
+	 * we must use the regular buffer logging mechanisms to ensure that the
+	 * initial buffer state is recovered before dquot items.
+	 */
+	if (mp->m_qflags & qflag)
+		xfs_trans_log_buf(tp, bp, 0, BBTOB(q->qi_dqchunklen) - 1);
+	else
+		xfs_trans_ordered_buf(tp, bp);
 }
 
 /*