diff mbox series

[07/13] xfs: document the invalidate_bdev call in invalidate_bdev

Message ID 20230809220545.1308228-8-hch@lst.de (mailing list archive)
State New, archived
Headers show
Series [01/13] xfs: reformat the xfs_fs_free prototype | expand

Commit Message

Christoph Hellwig Aug. 9, 2023, 10:05 p.m. UTC
Copy and paste the commit message from Darrick into a comment to explain
the seemly odd invalidate_bdev in xfs_shutdown_devices.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_super.c | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

Comments

Darrick J. Wong Aug. 9, 2023, 10:39 p.m. UTC | #1
On Wed, Aug 09, 2023 at 03:05:39PM -0700, Christoph Hellwig wrote:
> Copy and paste the commit message from Darrick into a comment to explain
> the seemly odd invalidate_bdev in xfs_shutdown_devices.

      ^ seemingly?
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/xfs/xfs_super.c | 26 ++++++++++++++++++++++++++
>  1 file changed, 26 insertions(+)
> 
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 4ae3b01ed038c7..c169beb0d8cab3 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -399,6 +399,32 @@ STATIC void
>  xfs_shutdown_devices(
>  	struct xfs_mount	*mp)
>  {
> +	/*
> +	 * Udev is triggered whenever anyone closes a block device or unmounts
> +	 * a file systemm on a block device.
> +	 * The default udev rules invoke blkid to read the fs super and create
> +	 * symlinks to the bdev under /dev/disk.  For this, it uses buffered
> +	 * reads through the page cache.
> +	 *
> +	 * xfs_db also uses buffered reads to examine metadata.  There is no
> +	 * coordination between xfs_db and udev, which means that they can run
> +	 * concurrently.  Note there is no coordination between the kernel and
> +	 * blkid either.
> +	 *
> +	 * On a system with 64k pages, the page cache can cache the superblock
> +	 * and the root inode (and hence the root directory) with the same 64k
> +	 * page.  If udev spawns blkid after the mkfs and the system is busy
> +	 * enough that it is still running when xfs_db starts up, they'll both
> +	 * read from the same page in the pagecache.
> +	 *
> +	 * The unmount writes updated inode metadata to disk directly.  The XFS
> +	 * buffer cache does not use the bdev pagecache, nor does it invalidate
> +	 * the pagecache on umount.  If the above scenario occurs, the pagecache

This sentence reads a little strangely, since "nor does it invalidate"
would seem to conflict with the invalidate_bdev call below.  I suggest
changing the verb a bit:

"The XFS buffer cache does not use the bdev pagecache, so it needs to
invalidate that pagecache on unmount."

With those two things changed,
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D


> +	 * no longer reflects what's on disk, xfs_db reads the stale metadata,
> +	 * and fails to find /a.  Most of the time this succeeds because closing
> +	 * a bdev invalidates the page cache, but when processes race, everyone
> +	 * loses.
> +	 */
>  	if (mp->m_logdev_targp && mp->m_logdev_targp != mp->m_ddev_targp) {
>  		blkdev_issue_flush(mp->m_logdev_targp->bt_bdev);
>  		invalidate_bdev(mp->m_logdev_targp->bt_bdev);
> -- 
> 2.39.2
>
Christian Brauner Aug. 10, 2023, 8:20 a.m. UTC | #2
On Wed, Aug 09, 2023 at 03:39:23PM -0700, Darrick J. Wong wrote:
> On Wed, Aug 09, 2023 at 03:05:39PM -0700, Christoph Hellwig wrote:
> > Copy and paste the commit message from Darrick into a comment to explain
> > the seemly odd invalidate_bdev in xfs_shutdown_devices.
> 
>       ^ seemingly?
> > 
> > Signed-off-by: Christoph Hellwig <hch@lst.de>
> > ---
> >  fs/xfs/xfs_super.c | 26 ++++++++++++++++++++++++++
> >  1 file changed, 26 insertions(+)
> > 
> > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > index 4ae3b01ed038c7..c169beb0d8cab3 100644
> > --- a/fs/xfs/xfs_super.c
> > +++ b/fs/xfs/xfs_super.c
> > @@ -399,6 +399,32 @@ STATIC void
> >  xfs_shutdown_devices(
> >  	struct xfs_mount	*mp)
> >  {
> > +	/*
> > +	 * Udev is triggered whenever anyone closes a block device or unmounts
> > +	 * a file systemm on a block device.
> > +	 * The default udev rules invoke blkid to read the fs super and create
> > +	 * symlinks to the bdev under /dev/disk.  For this, it uses buffered
> > +	 * reads through the page cache.
> > +	 *
> > +	 * xfs_db also uses buffered reads to examine metadata.  There is no
> > +	 * coordination between xfs_db and udev, which means that they can run
> > +	 * concurrently.  Note there is no coordination between the kernel and
> > +	 * blkid either.
> > +	 *
> > +	 * On a system with 64k pages, the page cache can cache the superblock
> > +	 * and the root inode (and hence the root directory) with the same 64k
> > +	 * page.  If udev spawns blkid after the mkfs and the system is busy
> > +	 * enough that it is still running when xfs_db starts up, they'll both
> > +	 * read from the same page in the pagecache.
> > +	 *
> > +	 * The unmount writes updated inode metadata to disk directly.  The XFS
> > +	 * buffer cache does not use the bdev pagecache, nor does it invalidate
> > +	 * the pagecache on umount.  If the above scenario occurs, the pagecache
> 
> This sentence reads a little strangely, since "nor does it invalidate"
> would seem to conflict with the invalidate_bdev call below.  I suggest
> changing the verb a bit:
> 
> "The XFS buffer cache does not use the bdev pagecache, so it needs to
> invalidate that pagecache on unmount."
> 
> With those two things changed,
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>

Fixed in-tree.
Matthew Wilcox Aug. 10, 2023, 3:22 p.m. UTC | #3
On Wed, Aug 09, 2023 at 03:05:39PM -0700, Christoph Hellwig wrote:
> +	/*
> +	 * Udev is triggered whenever anyone closes a block device or unmounts
> +	 * a file systemm on a block device.
> +	 * The default udev rules invoke blkid to read the fs super and create
> +	 * symlinks to the bdev under /dev/disk.  For this, it uses buffered
> +	 * reads through the page cache.
> +	 *
> +	 * xfs_db also uses buffered reads to examine metadata.  There is no
> +	 * coordination between xfs_db and udev, which means that they can run
> +	 * concurrently.  Note there is no coordination between the kernel and
> +	 * blkid either.
> +	 *
> +	 * On a system with 64k pages, the page cache can cache the superblock
> +	 * and the root inode (and hence the root directory) with the same 64k
> +	 * page.  If udev spawns blkid after the mkfs and the system is busy
> +	 * enough that it is still running when xfs_db starts up, they'll both
> +	 * read from the same page in the pagecache.
> +	 *
> +	 * The unmount writes updated inode metadata to disk directly.  The XFS
> +	 * buffer cache does not use the bdev pagecache, nor does it invalidate
> +	 * the pagecache on umount.  If the above scenario occurs, the pagecache
> +	 * no longer reflects what's on disk, xfs_db reads the stale metadata,
> +	 * and fails to find /a.  Most of the time this succeeds because closing
> +	 * a bdev invalidates the page cache, but when processes race, everyone
> +	 * loses.
> +	 */
>  	if (mp->m_logdev_targp && mp->m_logdev_targp != mp->m_ddev_targp) {
>  		blkdev_issue_flush(mp->m_logdev_targp->bt_bdev);
>  		invalidate_bdev(mp->m_logdev_targp->bt_bdev);

While I have no complaints with this as a commit message, it's just too
verbose for an inline comment, IMO.  Something pithier and more generic
would seem appropriate.  How about:

	/*
	 * Prevent userspace (eg blkid or xfs_db) from seeing stale data.
	 * XFS is not coherent with the bdev's page cache.
	 */
Christoph Hellwig Aug. 10, 2023, 3:52 p.m. UTC | #4
On Thu, Aug 10, 2023 at 04:22:15PM +0100, Matthew Wilcox wrote:
> >  		blkdev_issue_flush(mp->m_logdev_targp->bt_bdev);
> >  		invalidate_bdev(mp->m_logdev_targp->bt_bdev);
> 
> While I have no complaints with this as a commit message, it's just too
> verbose for an inline comment, IMO.  Something pithier and more generic
> would seem appropriate.  How about:
> 
> 	/*
> 	 * Prevent userspace (eg blkid or xfs_db) from seeing stale data.
> 	 * XFS is not coherent with the bdev's page cache.
> 	 */

Well, this completely misses the point.  The point is that XFS should
never have to invalidate the page cache because it's not using it,
but it has to due to weird races.  I tried to condese the message but
I could not come up with a good one that's not losing information.
Christoph Hellwig Aug. 10, 2023, 3:53 p.m. UTC | #5
On Wed, Aug 09, 2023 at 03:39:23PM -0700, Darrick J. Wong wrote:
> > +	 * read from the same page in the pagecache.
> > +	 *
> > +	 * The unmount writes updated inode metadata to disk directly.  The XFS
> > +	 * buffer cache does not use the bdev pagecache, nor does it invalidate
> > +	 * the pagecache on umount.  If the above scenario occurs, the pagecache
> 
> This sentence reads a little strangely, since "nor does it invalidate"
> would seem to conflict with the invalidate_bdev call below.  I suggest
> changing the verb a bit:
> 
> "The XFS buffer cache does not use the bdev pagecache, so it needs to
> invalidate that pagecache on unmount."

Agreed. I'll forward it to the original author of the sentence time
permitting :)
Darrick J. Wong Aug. 10, 2023, 3:57 p.m. UTC | #6
On Thu, Aug 10, 2023 at 04:22:15PM +0100, Matthew Wilcox wrote:
> On Wed, Aug 09, 2023 at 03:05:39PM -0700, Christoph Hellwig wrote:
> > +	/*
> > +	 * Udev is triggered whenever anyone closes a block device or unmounts
> > +	 * a file systemm on a block device.
> > +	 * The default udev rules invoke blkid to read the fs super and create
> > +	 * symlinks to the bdev under /dev/disk.  For this, it uses buffered
> > +	 * reads through the page cache.
> > +	 *
> > +	 * xfs_db also uses buffered reads to examine metadata.  There is no
> > +	 * coordination between xfs_db and udev, which means that they can run
> > +	 * concurrently.  Note there is no coordination between the kernel and
> > +	 * blkid either.
> > +	 *
> > +	 * On a system with 64k pages, the page cache can cache the superblock
> > +	 * and the root inode (and hence the root directory) with the same 64k
> > +	 * page.  If udev spawns blkid after the mkfs and the system is busy
> > +	 * enough that it is still running when xfs_db starts up, they'll both
> > +	 * read from the same page in the pagecache.
> > +	 *
> > +	 * The unmount writes updated inode metadata to disk directly.  The XFS
> > +	 * buffer cache does not use the bdev pagecache, nor does it invalidate
> > +	 * the pagecache on umount.  If the above scenario occurs, the pagecache
> > +	 * no longer reflects what's on disk, xfs_db reads the stale metadata,
> > +	 * and fails to find /a.  Most of the time this succeeds because closing
> > +	 * a bdev invalidates the page cache, but when processes race, everyone
> > +	 * loses.
> > +	 */
> >  	if (mp->m_logdev_targp && mp->m_logdev_targp != mp->m_ddev_targp) {
> >  		blkdev_issue_flush(mp->m_logdev_targp->bt_bdev);
> >  		invalidate_bdev(mp->m_logdev_targp->bt_bdev);
> 
> While I have no complaints with this as a commit message, it's just too
> verbose for an inline comment, IMO.  Something pithier and more generic
> would seem appropriate.  How about:
> 
> 	/*
> 	 * Prevent userspace (eg blkid or xfs_db) from seeing stale data.
> 	 * XFS is not coherent with the bdev's page cache.

"XFS' buffer cache is not coherent with the bdev's page cache."
?

--D

> 	 */
Darrick J. Wong Aug. 10, 2023, 4 p.m. UTC | #7
On Thu, Aug 10, 2023 at 05:52:25PM +0200, Christoph Hellwig wrote:
> On Thu, Aug 10, 2023 at 04:22:15PM +0100, Matthew Wilcox wrote:
> > >  		blkdev_issue_flush(mp->m_logdev_targp->bt_bdev);
> > >  		invalidate_bdev(mp->m_logdev_targp->bt_bdev);
> > 
> > While I have no complaints with this as a commit message, it's just too
> > verbose for an inline comment, IMO.  Something pithier and more generic
> > would seem appropriate.  How about:
> > 
> > 	/*
> > 	 * Prevent userspace (eg blkid or xfs_db) from seeing stale data.
> > 	 * XFS is not coherent with the bdev's page cache.
> > 	 */
> 
> Well, this completely misses the point.  The point is that XFS should
> never have to invalidate the page cache because it's not using it,
> but it has to due to weird races.  I tried to condese the message but
> I could not come up with a good one that's not losing information.

Agreed -- it took me a while to set up an arm64 box with just the right
debugging info to figure out why certain fstests were flaky.  I do think
it's useful (despite my other reply to willy) to retain the defect
details for hard-to-reproduce errors, and the only way to do that
without encountering the dead url problem is to dump it in a huge
commit message or a comment.

(Too bad there's no way to have a commit whose code comments reference
the commit id of that commit to say "Hey, you need to read this commit
before you touch this line"...)

--D
diff mbox series

Patch

diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 4ae3b01ed038c7..c169beb0d8cab3 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -399,6 +399,32 @@  STATIC void
 xfs_shutdown_devices(
 	struct xfs_mount	*mp)
 {
+	/*
+	 * Udev is triggered whenever anyone closes a block device or unmounts
+	 * a file systemm on a block device.
+	 * The default udev rules invoke blkid to read the fs super and create
+	 * symlinks to the bdev under /dev/disk.  For this, it uses buffered
+	 * reads through the page cache.
+	 *
+	 * xfs_db also uses buffered reads to examine metadata.  There is no
+	 * coordination between xfs_db and udev, which means that they can run
+	 * concurrently.  Note there is no coordination between the kernel and
+	 * blkid either.
+	 *
+	 * On a system with 64k pages, the page cache can cache the superblock
+	 * and the root inode (and hence the root directory) with the same 64k
+	 * page.  If udev spawns blkid after the mkfs and the system is busy
+	 * enough that it is still running when xfs_db starts up, they'll both
+	 * read from the same page in the pagecache.
+	 *
+	 * The unmount writes updated inode metadata to disk directly.  The XFS
+	 * buffer cache does not use the bdev pagecache, nor does it invalidate
+	 * the pagecache on umount.  If the above scenario occurs, the pagecache
+	 * no longer reflects what's on disk, xfs_db reads the stale metadata,
+	 * and fails to find /a.  Most of the time this succeeds because closing
+	 * a bdev invalidates the page cache, but when processes race, everyone
+	 * loses.
+	 */
 	if (mp->m_logdev_targp && mp->m_logdev_targp != mp->m_ddev_targp) {
 		blkdev_issue_flush(mp->m_logdev_targp->bt_bdev);
 		invalidate_bdev(mp->m_logdev_targp->bt_bdev);