diff mbox series

xfs: recheck appropriateness of map_shared lock

Message ID Y8ib6ls32e/pJezE@magnolia (mailing list archive)
State New, archived
Headers show
Series xfs: recheck appropriateness of map_shared lock | expand

Commit Message

Darrick J. Wong Jan. 19, 2023, 1:24 a.m. UTC
From: Darrick J. Wong <djwong@kernel.org>

While fuzzing the data fork extent count on a btree-format directory
with xfs/375, I observed the following (excerpted) splat:

XFS: Assertion failed: xfs_isilocked(ip, XFS_ILOCK_EXCL), file: fs/xfs/libxfs/xfs_bmap.c, line: 1208
------------[ cut here ]------------
WARNING: CPU: 0 PID: 43192 at fs/xfs/xfs_message.c:104 assfail+0x46/0x4a [xfs]
Call Trace:
 <TASK>
 xfs_iread_extents+0x1af/0x210 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
 xchk_dir_walk+0xb8/0x190 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
 xchk_parent_count_parent_dentries+0x41/0x80 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
 xchk_parent_validate+0x199/0x2e0 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
 xchk_parent+0xdf/0x130 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
 xfs_scrub_metadata+0x2b8/0x730 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
 xfs_scrubv_metadata+0x38b/0x4d0 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
 xfs_ioc_scrubv_metadata+0x111/0x160 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
 xfs_file_ioctl+0x367/0xf50 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
 __x64_sys_ioctl+0x82/0xa0
 do_syscall_64+0x2b/0x80
 entry_SYSCALL_64_after_hwframe+0x46/0xb0

The cause of this is a race condition in xfs_ilock_data_map_shared,
which performs an unlocked access to the data fork to guess which lock
mode it needs:

Thread 0                          Thread 1

xfs_need_iread_extents
<observe no iext tree>
xfs_ilock(..., ILOCK_EXCL)
xfs_iread_extents
<observe no iext tree>
<check ILOCK_EXCL>
<load bmbt extents into iext>
<notice iext size doesn't
 match nextents>
                                  xfs_need_iread_extents
                                  <observe iext tree>
                                  xfs_ilock(..., ILOCK_SHARED)
<tear down iext tree>
xfs_iunlock(..., ILOCK_EXCL)
                                  xfs_iread_extents
                                  <observe no iext tree>
                                  <check ILOCK_EXCL>
                                  *BOOM*

mitigate this race by having thread 1 to recheck xfs_need_iread_extents
after taking the shared ILOCK.  If the iext tree isn't present, then we
need to upgrade to the exclusive ILOCK to try to load the bmbt.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_inode.c |   29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

Comments

Dave Chinner Jan. 19, 2023, 5:14 a.m. UTC | #1
On Wed, Jan 18, 2023 at 05:24:58PM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> While fuzzing the data fork extent count on a btree-format directory
> with xfs/375, I observed the following (excerpted) splat:
> 
> XFS: Assertion failed: xfs_isilocked(ip, XFS_ILOCK_EXCL), file: fs/xfs/libxfs/xfs_bmap.c, line: 1208
> ------------[ cut here ]------------
> WARNING: CPU: 0 PID: 43192 at fs/xfs/xfs_message.c:104 assfail+0x46/0x4a [xfs]
> Call Trace:
>  <TASK>
>  xfs_iread_extents+0x1af/0x210 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
>  xchk_dir_walk+0xb8/0x190 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
>  xchk_parent_count_parent_dentries+0x41/0x80 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
>  xchk_parent_validate+0x199/0x2e0 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
>  xchk_parent+0xdf/0x130 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
>  xfs_scrub_metadata+0x2b8/0x730 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
>  xfs_scrubv_metadata+0x38b/0x4d0 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
>  xfs_ioc_scrubv_metadata+0x111/0x160 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
>  xfs_file_ioctl+0x367/0xf50 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
>  __x64_sys_ioctl+0x82/0xa0
>  do_syscall_64+0x2b/0x80
>  entry_SYSCALL_64_after_hwframe+0x46/0xb0
> 
> The cause of this is a race condition in xfs_ilock_data_map_shared,
> which performs an unlocked access to the data fork to guess which lock
> mode it needs:
> 
> Thread 0                          Thread 1
> 
> xfs_need_iread_extents
> <observe no iext tree>
> xfs_ilock(..., ILOCK_EXCL)
> xfs_iread_extents
> <observe no iext tree>
> <check ILOCK_EXCL>
> <load bmbt extents into iext>
> <notice iext size doesn't
>  match nextents>
>                                   xfs_need_iread_extents
>                                   <observe iext tree>
>                                   xfs_ilock(..., ILOCK_SHARED)
> <tear down iext tree>
> xfs_iunlock(..., ILOCK_EXCL)
>                                   xfs_iread_extents
>                                   <observe no iext tree>
>                                   <check ILOCK_EXCL>
>                                   *BOOM*
> 
> mitigate this race by having thread 1 to recheck xfs_need_iread_extents
> after taking the shared ILOCK.  If the iext tree isn't present, then we
> need to upgrade to the exclusive ILOCK to try to load the bmbt.

Yup, I see the problem - this check is failing:

        if (XFS_IS_CORRUPT(mp, ir.loaded != ifp->if_nextents)) {
                error = -EFSCORRUPTED;
                goto out;
        }

and that results in calling xfs_iext_destroy() to tear down the
extent tree.

But we know the BMBT is corrupted and the extent list cannot be read
until the corruption is fixed. IOWs, we can't access any data in the
inode no matter how we lock it until the corruption is repaired.

> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/xfs_inode.c |   29 +++++++++++++++++++++++++++++
>  1 file changed, 29 insertions(+)
> 
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index d354ea2b74f9..6ce1e0e9f256 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -117,6 +117,20 @@ xfs_ilock_data_map_shared(
>  	if (xfs_need_iread_extents(&ip->i_df))
>  		lock_mode = XFS_ILOCK_EXCL;
>  	xfs_ilock(ip, lock_mode);
> +
> +	/*
> +	 * It's possible that the unlocked access of the data fork to determine
> +	 * the lock mode could have raced with another thread that was failing
> +	 * to load the bmbt but hadn't yet torn down the iext tree.  Recheck
> +	 * the lock mode and upgrade to an exclusive lock if we need to.
> +	 */
> +	if (lock_mode == XFS_ILOCK_SHARED &&
> +	    xfs_need_iread_extents(&ip->i_df)) {
> +		xfs_iunlock(ip, lock_mode);
> +		lock_mode = XFS_ILOCK_EXCL;
> +		xfs_ilock(ip, lock_mode);
> +	}

.... and this makes me cringe. :/

If we hit this race condition, re-reading the extent list from disk
isn't going to fix the corruption, so I don't see much point in
papering over the problem just by changing the locking and failing
to read in the extent list again and returning -EFSCORRUPTED to the
operation.

So.... shouldn't we mark the inode as sick when we detect the extent
list corruption issue? i.e. before destroying the iext tree, calling
xfs_inode_mark_sick(XFS_SICK_INO_BMBTD) (or BMBTA, depending on the
fork being read) so that there is a record of the BMBT being
corrupt?

That would mean that this path simply becomes:

	if (ip->i_sick & XFS_SICK_INO_BMBTD) {
		xfs_iunlock(ip, lock_mode);
		return -EFSCORRUPTED;
	}

Which is now pretty clear that we there's no point continuing
because we can't read in the extent list, and in doing so we've
removed the race condition caused by temporarily filling the in-core
extent list.

> +
>  	return lock_mode;
>  }
>  
> @@ -129,6 +143,21 @@ xfs_ilock_attr_map_shared(
>  	if (xfs_inode_has_attr_fork(ip) && xfs_need_iread_extents(&ip->i_af))
>  		lock_mode = XFS_ILOCK_EXCL;
>  	xfs_ilock(ip, lock_mode);
> +
> +	/*
> +	 * It's possible that the unlocked access of the attr fork to determine
> +	 * the lock mode could have raced with another thread that was failing
> +	 * to load the bmbt but hadn't yet torn down the iext tree.  Recheck
> +	 * the lock mode and upgrade to an exclusive lock if we need to.
> +	 */
> +	if (lock_mode == XFS_ILOCK_SHARED &&
> +	    xfs_inode_has_attr_fork(ip) &&
> +	    xfs_need_iread_extents(&ip->i_af)) {
> +		xfs_iunlock(ip, lock_mode);
> +		lock_mode = XFS_ILOCK_EXCL;
> +		xfs_ilock(ip, lock_mode);
> +	}

And this can just check for XFS_SICK_INO_BMBTA instead...

Cheers,

Dave.
Christoph Hellwig Jan. 19, 2023, 6:31 p.m. UTC | #2
On Wed, Jan 18, 2023 at 05:24:58PM -0800, Darrick J. Wong wrote:
>  	xfs_ilock(ip, lock_mode);
> +
> +	/*
> +	 * It's possible that the unlocked access of the data fork to determine
> +	 * the lock mode could have raced with another thread that was failing
> +	 * to load the bmbt but hadn't yet torn down the iext tree.  Recheck
> +	 * the lock mode and upgrade to an exclusive lock if we need to.
> +	 */
> +	if (lock_mode == XFS_ILOCK_SHARED &&
> +	    xfs_need_iread_extents(&ip->i_df)) {

Eww.  I think the proper fix here is to make sure
xfs_need_iread_extents does not return false until we're actually
read the extents.  So I think we'll need a new inode flag
XFS_INEED_READ - gets set when reading inode in btree format,
and gets cleared at the very end of xfs_iread_extents once we know
the read succeeded.
Christoph Hellwig Jan. 19, 2023, 6:39 p.m. UTC | #3
On Thu, Jan 19, 2023 at 04:14:11PM +1100, Dave Chinner wrote:
> If we hit this race condition, re-reading the extent list from disk
> isn't going to fix the corruption, so I don't see much point in
> papering over the problem just by changing the locking and failing
> to read in the extent list again and returning -EFSCORRUPTED to the
> operation.

Yep.

> So.... shouldn't we mark the inode as sick when we detect the extent
> list corruption issue? i.e. before destroying the iext tree, calling
> xfs_inode_mark_sick(XFS_SICK_INO_BMBTD) (or BMBTA, depending on the
> fork being read) so that there is a record of the BMBT being
> corrupt?

Yes.

> That would mean that this path simply becomes:
> 
> 	if (ip->i_sick & XFS_SICK_INO_BMBTD) {
> 		xfs_iunlock(ip, lock_mode);
> 		return -EFSCORRUPTED;
> 	}

This path being xfs_ilock_{data,attr}_map_shared?  These don't
return an error.  But if we make sure xfs_need_iread_extents
returns true for XFS_SICK_INO_BMBTD, xfs_iread_extents can
return -EFSCORRUPTED.
Dave Chinner Jan. 19, 2023, 8:34 p.m. UTC | #4
On Thu, Jan 19, 2023 at 10:39:34AM -0800, Christoph Hellwig wrote:
> On Thu, Jan 19, 2023 at 04:14:11PM +1100, Dave Chinner wrote:
> > If we hit this race condition, re-reading the extent list from disk
> > isn't going to fix the corruption, so I don't see much point in
> > papering over the problem just by changing the locking and failing
> > to read in the extent list again and returning -EFSCORRUPTED to the
> > operation.
> 
> Yep.
> 
> > So.... shouldn't we mark the inode as sick when we detect the extent
> > list corruption issue? i.e. before destroying the iext tree, calling
> > xfs_inode_mark_sick(XFS_SICK_INO_BMBTD) (or BMBTA, depending on the
> > fork being read) so that there is a record of the BMBT being
> > corrupt?
> 
> Yes.
> 
> > That would mean that this path simply becomes:
> > 
> > 	if (ip->i_sick & XFS_SICK_INO_BMBTD) {
> > 		xfs_iunlock(ip, lock_mode);
> > 		return -EFSCORRUPTED;
> > 	}
> 
> This path being xfs_ilock_{data,attr}_map_shared?  These don't
> return an error.

I was thinking we just change the function parameters to take a "int
*lockmode" parameter and return an error state similar to
what we do in the IO path with the xfs_ilock_iocb() wrapper.

> But if we make sure xfs_need_iread_extents
> returns true for XFS_SICK_INO_BMBTD, xfs_iread_extents can
> return -EFSCORRUPTED.

I don't think that solves the race condition because
xfs_need_iread_extents() is run unlocked. Just like it can race with
filling the extent list and then removing it again while we wait on
the ILOCK, it can return true before XFS_SICK_INO_BMBTD is set and
then when we get the lock we find XFS_SICK_INO_BMBTD is set and
extent list is empty...

Hence I think the check for extent list corruption has to be done
after we gain the inode lock so we wait correctly for the result of
the racing extent loading before proceeding.

Cheers,

Dave.
Darrick J. Wong Feb. 28, 2023, 8:08 p.m. UTC | #5
On Thu, Jan 19, 2023 at 04:14:11PM +1100, Dave Chinner wrote:
> On Wed, Jan 18, 2023 at 05:24:58PM -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > While fuzzing the data fork extent count on a btree-format directory
> > with xfs/375, I observed the following (excerpted) splat:
> > 
> > XFS: Assertion failed: xfs_isilocked(ip, XFS_ILOCK_EXCL), file: fs/xfs/libxfs/xfs_bmap.c, line: 1208
> > ------------[ cut here ]------------
> > WARNING: CPU: 0 PID: 43192 at fs/xfs/xfs_message.c:104 assfail+0x46/0x4a [xfs]
> > Call Trace:
> >  <TASK>
> >  xfs_iread_extents+0x1af/0x210 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
> >  xchk_dir_walk+0xb8/0x190 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
> >  xchk_parent_count_parent_dentries+0x41/0x80 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
> >  xchk_parent_validate+0x199/0x2e0 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
> >  xchk_parent+0xdf/0x130 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
> >  xfs_scrub_metadata+0x2b8/0x730 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
> >  xfs_scrubv_metadata+0x38b/0x4d0 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
> >  xfs_ioc_scrubv_metadata+0x111/0x160 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
> >  xfs_file_ioctl+0x367/0xf50 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
> >  __x64_sys_ioctl+0x82/0xa0
> >  do_syscall_64+0x2b/0x80
> >  entry_SYSCALL_64_after_hwframe+0x46/0xb0
> > 
> > The cause of this is a race condition in xfs_ilock_data_map_shared,
> > which performs an unlocked access to the data fork to guess which lock
> > mode it needs:
> > 
> > Thread 0                          Thread 1
> > 
> > xfs_need_iread_extents
> > <observe no iext tree>
> > xfs_ilock(..., ILOCK_EXCL)
> > xfs_iread_extents
> > <observe no iext tree>
> > <check ILOCK_EXCL>
> > <load bmbt extents into iext>
> > <notice iext size doesn't
> >  match nextents>
> >                                   xfs_need_iread_extents
> >                                   <observe iext tree>
> >                                   xfs_ilock(..., ILOCK_SHARED)
> > <tear down iext tree>
> > xfs_iunlock(..., ILOCK_EXCL)
> >                                   xfs_iread_extents
> >                                   <observe no iext tree>
> >                                   <check ILOCK_EXCL>
> >                                   *BOOM*
> > 
> > mitigate this race by having thread 1 to recheck xfs_need_iread_extents
> > after taking the shared ILOCK.  If the iext tree isn't present, then we
> > need to upgrade to the exclusive ILOCK to try to load the bmbt.
> 
> Yup, I see the problem - this check is failing:
> 
>         if (XFS_IS_CORRUPT(mp, ir.loaded != ifp->if_nextents)) {
>                 error = -EFSCORRUPTED;
>                 goto out;
>         }
> 
> and that results in calling xfs_iext_destroy() to tear down the
> extent tree.
> 
> But we know the BMBT is corrupted and the extent list cannot be read
> until the corruption is fixed. IOWs, we can't access any data in the
> inode no matter how we lock it until the corruption is repaired.
> 
> > 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  fs/xfs/xfs_inode.c |   29 +++++++++++++++++++++++++++++
> >  1 file changed, 29 insertions(+)
> > 
> > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > index d354ea2b74f9..6ce1e0e9f256 100644
> > --- a/fs/xfs/xfs_inode.c
> > +++ b/fs/xfs/xfs_inode.c
> > @@ -117,6 +117,20 @@ xfs_ilock_data_map_shared(
> >  	if (xfs_need_iread_extents(&ip->i_df))
> >  		lock_mode = XFS_ILOCK_EXCL;
> >  	xfs_ilock(ip, lock_mode);
> > +
> > +	/*
> > +	 * It's possible that the unlocked access of the data fork to determine
> > +	 * the lock mode could have raced with another thread that was failing
> > +	 * to load the bmbt but hadn't yet torn down the iext tree.  Recheck
> > +	 * the lock mode and upgrade to an exclusive lock if we need to.
> > +	 */
> > +	if (lock_mode == XFS_ILOCK_SHARED &&
> > +	    xfs_need_iread_extents(&ip->i_df)) {
> > +		xfs_iunlock(ip, lock_mode);
> > +		lock_mode = XFS_ILOCK_EXCL;
> > +		xfs_ilock(ip, lock_mode);
> > +	}
> 
> .... and this makes me cringe. :/
> 
> If we hit this race condition, re-reading the extent list from disk
> isn't going to fix the corruption, so I don't see much point in
> papering over the problem just by changing the locking and failing
> to read in the extent list again and returning -EFSCORRUPTED to the
> operation.

Doing it this (suboptimal way) means that we can backport the race fix
to older kernels without having to push the API change as well.  Threads
will continue to (pointlessly) try to load the iext tree from the
corrupt btree, but at least they won't be doing it while holding
ILOCK_SHARED.

> So.... shouldn't we mark the inode as sick when we detect the extent
> list corruption issue? i.e. before destroying the iext tree, calling
> xfs_inode_mark_sick(XFS_SICK_INO_BMBTD) (or BMBTA, depending on the
> fork being read) so that there is a record of the BMBT being
> corrupt?

Yes, we should, but the codebase is not yet ready to use
XFS_SICK_INO_BMBTD to detect bmbt corruption.  Notice this other
function call in xfs_iread_extents:

	error = xfs_btree_visit_blocks(cur, xfs_iread_bmbt_block,
			XFS_BTREE_VISIT_RECORDS, &ir);

Corruption errors in the btree code also trigger the xfs_iext_destroy
call, but the generic btree code hasn't yet been outfitted with the
appropriate _mark_sick calls to set the XFS_SICK state.

Patches to add that have been out for review since November 2019.  In
the past 39 months, only one reviewer (Brian) came forth:

https://lore.kernel.org/linux-xfs/157375555426.3692735.1357467392517392169.stgit@magnolia/

That review ended on the suggestion that callers of xfs_buf_read should
push the necessary context information through struct xfs_buf so that
verifiers themselves can trigger the health state updates.

In other words, Brian wanted me to explore capturing local variables
from the caller's state and passing the captured information to a
caller-supplied callback function.  Many other languages provide this in
the form of closures and lambda functions, but C is not one of them.  I
concluded that this approach was not feasible and moved on.

Since then, the patchset has been reposted for review in December 2019,
December 2020, December 2021, and December 2022.  Nobody has reviewed
it:

https://lore.kernel.org/linux-xfs/?q=report+corruption+to+the+health+trackers

*After* we merge online repair, it should be possible to base our
behavior off of XFS_SICK_INO_BMBTD.  However, this race affects current
kernels, which is why I sent it separately as a bug fix, keyed off of a
second call to _need_iread_extents.

--D

> That would mean that this path simply becomes:
> 
> 	if (ip->i_sick & XFS_SICK_INO_BMBTD) {
> 		xfs_iunlock(ip, lock_mode);
> 		return -EFSCORRUPTED;
> 	}
> 
> Which is now pretty clear that we there's no point continuing
> because we can't read in the extent list, and in doing so we've
> removed the race condition caused by temporarily filling the in-core
> extent list.
> 
> > +
> >  	return lock_mode;
> >  }
> >  
> > @@ -129,6 +143,21 @@ xfs_ilock_attr_map_shared(
> >  	if (xfs_inode_has_attr_fork(ip) && xfs_need_iread_extents(&ip->i_af))
> >  		lock_mode = XFS_ILOCK_EXCL;
> >  	xfs_ilock(ip, lock_mode);
> > +
> > +	/*
> > +	 * It's possible that the unlocked access of the attr fork to determine
> > +	 * the lock mode could have raced with another thread that was failing
> > +	 * to load the bmbt but hadn't yet torn down the iext tree.  Recheck
> > +	 * the lock mode and upgrade to an exclusive lock if we need to.
> > +	 */
> > +	if (lock_mode == XFS_ILOCK_SHARED &&
> > +	    xfs_inode_has_attr_fork(ip) &&
> > +	    xfs_need_iread_extents(&ip->i_af)) {
> > +		xfs_iunlock(ip, lock_mode);
> > +		lock_mode = XFS_ILOCK_EXCL;
> > +		xfs_ilock(ip, lock_mode);
> > +	}
> 
> And this can just check for XFS_SICK_INO_BMBTA instead...
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
Darrick J. Wong April 11, 2023, 1:05 a.m. UTC | #6
On Thu, Jan 19, 2023 at 10:31:06AM -0800, Christoph Hellwig wrote:
> On Wed, Jan 18, 2023 at 05:24:58PM -0800, Darrick J. Wong wrote:
> >  	xfs_ilock(ip, lock_mode);
> > +
> > +	/*
> > +	 * It's possible that the unlocked access of the data fork to determine
> > +	 * the lock mode could have raced with another thread that was failing
> > +	 * to load the bmbt but hadn't yet torn down the iext tree.  Recheck
> > +	 * the lock mode and upgrade to an exclusive lock if we need to.
> > +	 */
> > +	if (lock_mode == XFS_ILOCK_SHARED &&
> > +	    xfs_need_iread_extents(&ip->i_df)) {
> 
> Eww.  I think the proper fix here is to make sure
> xfs_need_iread_extents does not return false until we're actually
> read the extents.  So I think we'll need a new inode flag
> XFS_INEED_READ - gets set when reading inode in btree format,
> and gets cleared at the very end of xfs_iread_extents once we know
> the read succeeded.

So I finally cleared enough off my plate to get back to this, and
reworking the patch this way *looks* promising.  It definitely fixes the
xfs/375 problems, and over the weekend I didn't see any obvious splats.

--D
diff mbox series

Patch

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index d354ea2b74f9..6ce1e0e9f256 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -117,6 +117,20 @@  xfs_ilock_data_map_shared(
 	if (xfs_need_iread_extents(&ip->i_df))
 		lock_mode = XFS_ILOCK_EXCL;
 	xfs_ilock(ip, lock_mode);
+
+	/*
+	 * It's possible that the unlocked access of the data fork to determine
+	 * the lock mode could have raced with another thread that was failing
+	 * to load the bmbt but hadn't yet torn down the iext tree.  Recheck
+	 * the lock mode and upgrade to an exclusive lock if we need to.
+	 */
+	if (lock_mode == XFS_ILOCK_SHARED &&
+	    xfs_need_iread_extents(&ip->i_df)) {
+		xfs_iunlock(ip, lock_mode);
+		lock_mode = XFS_ILOCK_EXCL;
+		xfs_ilock(ip, lock_mode);
+	}
+
 	return lock_mode;
 }
 
@@ -129,6 +143,21 @@  xfs_ilock_attr_map_shared(
 	if (xfs_inode_has_attr_fork(ip) && xfs_need_iread_extents(&ip->i_af))
 		lock_mode = XFS_ILOCK_EXCL;
 	xfs_ilock(ip, lock_mode);
+
+	/*
+	 * It's possible that the unlocked access of the attr fork to determine
+	 * the lock mode could have raced with another thread that was failing
+	 * to load the bmbt but hadn't yet torn down the iext tree.  Recheck
+	 * the lock mode and upgrade to an exclusive lock if we need to.
+	 */
+	if (lock_mode == XFS_ILOCK_SHARED &&
+	    xfs_inode_has_attr_fork(ip) &&
+	    xfs_need_iread_extents(&ip->i_af)) {
+		xfs_iunlock(ip, lock_mode);
+		lock_mode = XFS_ILOCK_EXCL;
+		xfs_ilock(ip, lock_mode);
+	}
+
 	return lock_mode;
 }