diff mbox

xfs: recheck reflink / dirty page status before freeing CoW reservations

Message ID 20180110220336.GU5602@magnolia (mailing list archive)
State Accepted
Headers show

Commit Message

Darrick J. Wong Jan. 10, 2018, 10:03 p.m. UTC
From: Darrick J. Wong <darrick.wong@oracle.com>

Eryu Guan reported seeing occasional hangs when running generic/269 with
a new fsstress that supports clonerange/deduperange.  The cause of this
hang is an infinite loop when we convert the CoW fork extents from
unwritten to real just prior to writing the pages out; the infinite
loop happens because there's nothing in the CoW fork to convert, and so
it spins forever.

The underlying issue here is that when we go to perform these CoW fork
conversions, we're supposed to have an extent waiting for us, but the
low space CoW reaper has snuck in and blown them away!  There are four
conditions that can dissuade the reaper from touching our file -- no
reflink iflag; dirty page cache; writeback in progress; or directio in
progress.  We check the four conditions prior to taking the locks, but
we neglect to recheck them once we have the locks, which is how we end
up whacking the writeback that's in progress.

Therefore, refactor the four checks into a helper function and call it
once again once we have the locks to make sure we really want to reap
the inode.  While we're at it, add an ASSERT for this weird condition so
that we'll fail noisily if we ever screw this up again.

Reported-by: Eryu Guan <eguan@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_bmap.c |    7 +++++
 fs/xfs/xfs_icache.c      |   61 +++++++++++++++++++++++++++++-----------------
 2 files changed, 46 insertions(+), 22 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Eryu Guan Jan. 11, 2018, 7:54 a.m. UTC | #1
On Wed, Jan 10, 2018 at 02:03:36PM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Eryu Guan reported seeing occasional hangs when running generic/269 with
> a new fsstress that supports clonerange/deduperange.  The cause of this
> hang is an infinite loop when we convert the CoW fork extents from
> unwritten to real just prior to writing the pages out; the infinite
> loop happens because there's nothing in the CoW fork to convert, and so
> it spins forever.
> 
> The underlying issue here is that when we go to perform these CoW fork
> conversions, we're supposed to have an extent waiting for us, but the
> low space CoW reaper has snuck in and blown them away!  There are four
> conditions that can dissuade the reaper from touching our file -- no
> reflink iflag; dirty page cache; writeback in progress; or directio in
> progress.  We check the four conditions prior to taking the locks, but
> we neglect to recheck them once we have the locks, which is how we end
> up whacking the writeback that's in progress.
> 
> Therefore, refactor the four checks into a helper function and call it
> once again once we have the locks to make sure we really want to reap
> the inode.  While we're at it, add an ASSERT for this weird condition so
> that we'll fail noisily if we ever screw this up again.
> 
> Reported-by: Eryu Guan <eguan@redhat.com>
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>

I applied this patch on top of v4.15-rc5 kernel, and ran generic/083
generic/269 and generic/270 (where I hit the soft lockup and hang before)
multiple times and tests all passed. I also ran all tests in 'enospc'
group on 1k/2k/4k XFS with reflink enabled, tests passed too. So

Tested-by: Eryu Guan <eguan@redhat.com>

Thanks!

Eryu
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Brian Foster Jan. 11, 2018, 12:04 p.m. UTC | #2
On Wed, Jan 10, 2018 at 02:03:36PM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Eryu Guan reported seeing occasional hangs when running generic/269 with
> a new fsstress that supports clonerange/deduperange.  The cause of this
> hang is an infinite loop when we convert the CoW fork extents from
> unwritten to real just prior to writing the pages out; the infinite
> loop happens because there's nothing in the CoW fork to convert, and so
> it spins forever.
> 
> The underlying issue here is that when we go to perform these CoW fork
> conversions, we're supposed to have an extent waiting for us, but the
> low space CoW reaper has snuck in and blown them away!  There are four
> conditions that can dissuade the reaper from touching our file -- no
> reflink iflag; dirty page cache; writeback in progress; or directio in
> progress.  We check the four conditions prior to taking the locks, but
> we neglect to recheck them once we have the locks, which is how we end
> up whacking the writeback that's in progress.
> 
> Therefore, refactor the four checks into a helper function and call it
> once again once we have the locks to make sure we really want to reap
> the inode.  While we're at it, add an ASSERT for this weird condition so
> that we'll fail noisily if we ever screw this up again.
> 
> Reported-by: Eryu Guan <eguan@redhat.com>
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/xfs/libxfs/xfs_bmap.c |    7 +++++
>  fs/xfs/xfs_icache.c      |   61 +++++++++++++++++++++++++++++-----------------
>  2 files changed, 46 insertions(+), 22 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> index a01cef4..7bd933f 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
> @@ -4311,6 +4311,13 @@ xfs_bmapi_write(
>  	while (bno < end && n < *nmap) {
>  		bool			need_alloc = false, wasdelay = false;
>  
> +		/*
> +		 * CoW fork conversions should /never/ hit EOF.  There should
> +		 * always be something for us to work on.
> +		 */
> +		ASSERT(!eof || !(flags & XFS_BMAPI_CONVERT) ||
> +			       !(flags & XFS_BMAPI_COWFORK));
> +

The hunk just below asserts for BMAPI_COWFORK in a case that explicitly
considers eof. That makes the logic confusing to follow IMO, but I'm
more wondering whether pushing something like ASSERT(!((flags & CONVERT)
&& (flags & COWFORK))) down into that hunk is effectively the same
thing..?  I.e., is it also true that we should not find a hole in the
(CONVERT & COW) case?

>  		/* in hole or beyoned EOF? */
>  		if (eof || bma.got.br_startoff > bno) {
>  			if (flags & XFS_BMAPI_DELALLOC) {
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index 1f84562..3fbcc03 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -1654,6 +1654,35 @@ xfs_inode_clear_eofblocks_tag(
>  			trace_xfs_perag_clear_eofblocks, XFS_ICI_EOFBLOCKS_TAG);
>  }
>  
> +/* Is this a good time to reap the CoW reservations for this file? */
> +static bool
> +xfs_can_free_cowblocks(
> +	struct xfs_inode	*ip,
> +	struct xfs_ifork	*ifp)
> +{
> +	/*
> +	 * Just clear the tag if we have an empty cow fork or none at all. It's
> +	 * possible the inode was fully unshared since it was originally tagged.
> +	 */
> +	if (!xfs_is_reflink_inode(ip) || !ifp->if_bytes) {
> +		trace_xfs_inode_free_cowblocks_invalid(ip);
> +		xfs_inode_clear_cowblocks_tag(ip);
> +		return false;

I think the flag update and tracepoint should probably remain in the
caller. They're somewhat misplaced for a "xfs_can_do_something()"
helper, particularly if it's ever exported and used in other contexts in
the future. Otherwise seems fine.

Brian

> +	}
> +
> +	/*
> +	 * If the mapping is dirty or under writeback we cannot touch the
> +	 * CoW fork.  Leave it alone if we're in the midst of a directio.
> +	 */
> +	if ((VFS_I(ip)->i_state & I_DIRTY_PAGES) ||
> +	    mapping_tagged(VFS_I(ip)->i_mapping, PAGECACHE_TAG_DIRTY) ||
> +	    mapping_tagged(VFS_I(ip)->i_mapping, PAGECACHE_TAG_WRITEBACK) ||
> +	    atomic_read(&VFS_I(ip)->i_dio_count))
> +		return false;
> +
> +	return true;
> +}
> +
>  /*
>   * Automatic CoW Reservation Freeing
>   *
> @@ -1672,29 +1701,12 @@ xfs_inode_free_cowblocks(
>  	int			flags,
>  	void			*args)
>  {
> -	int ret;
> -	struct xfs_eofblocks *eofb = args;
> -	int match;
> +	struct xfs_eofblocks	*eofb = args;
>  	struct xfs_ifork	*ifp = XFS_IFORK_PTR(ip, XFS_COW_FORK);
> +	int			match;
> +	int			ret = 0;
>  
> -	/*
> -	 * Just clear the tag if we have an empty cow fork or none at all. It's
> -	 * possible the inode was fully unshared since it was originally tagged.
> -	 */
> -	if (!xfs_is_reflink_inode(ip) || !ifp->if_bytes) {
> -		trace_xfs_inode_free_cowblocks_invalid(ip);
> -		xfs_inode_clear_cowblocks_tag(ip);
> -		return 0;
> -	}
> -
> -	/*
> -	 * If the mapping is dirty or under writeback we cannot touch the
> -	 * CoW fork.  Leave it alone if we're in the midst of a directio.
> -	 */
> -	if ((VFS_I(ip)->i_state & I_DIRTY_PAGES) ||
> -	    mapping_tagged(VFS_I(ip)->i_mapping, PAGECACHE_TAG_DIRTY) ||
> -	    mapping_tagged(VFS_I(ip)->i_mapping, PAGECACHE_TAG_WRITEBACK) ||
> -	    atomic_read(&VFS_I(ip)->i_dio_count))
> +	if (!xfs_can_free_cowblocks(ip, ifp))
>  		return 0;
>  
>  	if (eofb) {
> @@ -1715,7 +1727,12 @@ xfs_inode_free_cowblocks(
>  	xfs_ilock(ip, XFS_IOLOCK_EXCL);
>  	xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
>  
> -	ret = xfs_reflink_cancel_cow_range(ip, 0, NULLFILEOFF, false);
> +	/*
> +	 * Check again, nobody else should be able to dirty blocks or change
> +	 * the reflink iflag now that we have the first two locks held.
> +	 */
> +	if (xfs_can_free_cowblocks(ip, ifp))
> +		ret = xfs_reflink_cancel_cow_range(ip, 0, NULLFILEOFF, false);
>  
>  	xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);
>  	xfs_iunlock(ip, XFS_IOLOCK_EXCL);
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Darrick J. Wong Jan. 11, 2018, 5:40 p.m. UTC | #3
On Thu, Jan 11, 2018 at 07:04:10AM -0500, Brian Foster wrote:
> On Wed, Jan 10, 2018 at 02:03:36PM -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > Eryu Guan reported seeing occasional hangs when running generic/269 with
> > a new fsstress that supports clonerange/deduperange.  The cause of this
> > hang is an infinite loop when we convert the CoW fork extents from
> > unwritten to real just prior to writing the pages out; the infinite
> > loop happens because there's nothing in the CoW fork to convert, and so
> > it spins forever.
> > 
> > The underlying issue here is that when we go to perform these CoW fork
> > conversions, we're supposed to have an extent waiting for us, but the
> > low space CoW reaper has snuck in and blown them away!  There are four
> > conditions that can dissuade the reaper from touching our file -- no
> > reflink iflag; dirty page cache; writeback in progress; or directio in
> > progress.  We check the four conditions prior to taking the locks, but
> > we neglect to recheck them once we have the locks, which is how we end
> > up whacking the writeback that's in progress.
> > 
> > Therefore, refactor the four checks into a helper function and call it
> > once again once we have the locks to make sure we really want to reap
> > the inode.  While we're at it, add an ASSERT for this weird condition so
> > that we'll fail noisily if we ever screw this up again.
> > 
> > Reported-by: Eryu Guan <eguan@redhat.com>
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> >  fs/xfs/libxfs/xfs_bmap.c |    7 +++++
> >  fs/xfs/xfs_icache.c      |   61 +++++++++++++++++++++++++++++-----------------
> >  2 files changed, 46 insertions(+), 22 deletions(-)
> > 
> > diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> > index a01cef4..7bd933f 100644
> > --- a/fs/xfs/libxfs/xfs_bmap.c
> > +++ b/fs/xfs/libxfs/xfs_bmap.c
> > @@ -4311,6 +4311,13 @@ xfs_bmapi_write(
> >  	while (bno < end && n < *nmap) {
> >  		bool			need_alloc = false, wasdelay = false;
> >  
> > +		/*
> > +		 * CoW fork conversions should /never/ hit EOF.  There should
> > +		 * always be something for us to work on.
> > +		 */
> > +		ASSERT(!eof || !(flags & XFS_BMAPI_CONVERT) ||
> > +			       !(flags & XFS_BMAPI_COWFORK));
> > +
> 
> The hunk just below asserts for BMAPI_COWFORK in a case that explicitly
> considers eof. That makes the logic confusing to follow IMO, but I'm
> more wondering whether pushing something like ASSERT(!((flags & CONVERT)
> && (flags & COWFORK))) down into that hunk is effectively the same
> thing..?  I.e., is it also true that we should not find a hole in the
> (CONVERT & COW) case?

Yes.

> >  		/* in hole or beyoned EOF? */
> >  		if (eof || bma.got.br_startoff > bno) {
> >  			if (flags & XFS_BMAPI_DELALLOC) {
> > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > index 1f84562..3fbcc03 100644
> > --- a/fs/xfs/xfs_icache.c
> > +++ b/fs/xfs/xfs_icache.c
> > @@ -1654,6 +1654,35 @@ xfs_inode_clear_eofblocks_tag(
> >  			trace_xfs_perag_clear_eofblocks, XFS_ICI_EOFBLOCKS_TAG);
> >  }
> >  
> > +/* Is this a good time to reap the CoW reservations for this file? */
> > +static bool
> > +xfs_can_free_cowblocks(
> > +	struct xfs_inode	*ip,
> > +	struct xfs_ifork	*ifp)
> > +{
> > +	/*
> > +	 * Just clear the tag if we have an empty cow fork or none at all. It's
> > +	 * possible the inode was fully unshared since it was originally tagged.
> > +	 */
> > +	if (!xfs_is_reflink_inode(ip) || !ifp->if_bytes) {
> > +		trace_xfs_inode_free_cowblocks_invalid(ip);
> > +		xfs_inode_clear_cowblocks_tag(ip);
> > +		return false;
> 
> I think the flag update and tracepoint should probably remain in the
> caller. They're somewhat misplaced for a "xfs_can_do_something()"
> helper, particularly if it's ever exported and used in other contexts in
> the future. Otherwise seems fine.

Hmm.  There's a subtlety to step around here, which is that this
predicate can return false to mean "nothing here to see" or to mean
"cannot clear anything at this time".  We want the trace+clear for the
first case, but not the second.

I suppose this function could return the regular error code int and the
caller can figure out what that means to it, but then all the post-check
stuff ends up duplicated in the callers... so maybe I should just rename
it xfs_prep_free_cowblocks().

And change the comment to:

/*
 * Set ourselves up to free CoW blocks from this file.  If it's already
 * clean then we can bail out quickly, but otherwise we must back off if
 * the file is undergoing some kind of write.
 */

--D

> Brian
> 
> > +	}
> > +
> > +	/*
> > +	 * If the mapping is dirty or under writeback we cannot touch the
> > +	 * CoW fork.  Leave it alone if we're in the midst of a directio.
> > +	 */
> > +	if ((VFS_I(ip)->i_state & I_DIRTY_PAGES) ||
> > +	    mapping_tagged(VFS_I(ip)->i_mapping, PAGECACHE_TAG_DIRTY) ||
> > +	    mapping_tagged(VFS_I(ip)->i_mapping, PAGECACHE_TAG_WRITEBACK) ||
> > +	    atomic_read(&VFS_I(ip)->i_dio_count))
> > +		return false;
> > +
> > +	return true;
> > +}
> > +
> >  /*
> >   * Automatic CoW Reservation Freeing
> >   *
> > @@ -1672,29 +1701,12 @@ xfs_inode_free_cowblocks(
> >  	int			flags,
> >  	void			*args)
> >  {
> > -	int ret;
> > -	struct xfs_eofblocks *eofb = args;
> > -	int match;
> > +	struct xfs_eofblocks	*eofb = args;
> >  	struct xfs_ifork	*ifp = XFS_IFORK_PTR(ip, XFS_COW_FORK);
> > +	int			match;
> > +	int			ret = 0;
> >  
> > -	/*
> > -	 * Just clear the tag if we have an empty cow fork or none at all. It's
> > -	 * possible the inode was fully unshared since it was originally tagged.
> > -	 */
> > -	if (!xfs_is_reflink_inode(ip) || !ifp->if_bytes) {
> > -		trace_xfs_inode_free_cowblocks_invalid(ip);
> > -		xfs_inode_clear_cowblocks_tag(ip);
> > -		return 0;
> > -	}
> > -
> > -	/*
> > -	 * If the mapping is dirty or under writeback we cannot touch the
> > -	 * CoW fork.  Leave it alone if we're in the midst of a directio.
> > -	 */
> > -	if ((VFS_I(ip)->i_state & I_DIRTY_PAGES) ||
> > -	    mapping_tagged(VFS_I(ip)->i_mapping, PAGECACHE_TAG_DIRTY) ||
> > -	    mapping_tagged(VFS_I(ip)->i_mapping, PAGECACHE_TAG_WRITEBACK) ||
> > -	    atomic_read(&VFS_I(ip)->i_dio_count))
> > +	if (!xfs_can_free_cowblocks(ip, ifp))
> >  		return 0;
> >  
> >  	if (eofb) {
> > @@ -1715,7 +1727,12 @@ xfs_inode_free_cowblocks(
> >  	xfs_ilock(ip, XFS_IOLOCK_EXCL);
> >  	xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
> >  
> > -	ret = xfs_reflink_cancel_cow_range(ip, 0, NULLFILEOFF, false);
> > +	/*
> > +	 * Check again, nobody else should be able to dirty blocks or change
> > +	 * the reflink iflag now that we have the first two locks held.
> > +	 */
> > +	if (xfs_can_free_cowblocks(ip, ifp))
> > +		ret = xfs_reflink_cancel_cow_range(ip, 0, NULLFILEOFF, false);
> >  
> >  	xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);
> >  	xfs_iunlock(ip, XFS_IOLOCK_EXCL);
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Brian Foster Jan. 11, 2018, 7:38 p.m. UTC | #4
On Thu, Jan 11, 2018 at 09:40:27AM -0800, Darrick J. Wong wrote:
> On Thu, Jan 11, 2018 at 07:04:10AM -0500, Brian Foster wrote:
> > On Wed, Jan 10, 2018 at 02:03:36PM -0800, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > 
> > > Eryu Guan reported seeing occasional hangs when running generic/269 with
> > > a new fsstress that supports clonerange/deduperange.  The cause of this
> > > hang is an infinite loop when we convert the CoW fork extents from
> > > unwritten to real just prior to writing the pages out; the infinite
> > > loop happens because there's nothing in the CoW fork to convert, and so
> > > it spins forever.
> > > 
> > > The underlying issue here is that when we go to perform these CoW fork
> > > conversions, we're supposed to have an extent waiting for us, but the
> > > low space CoW reaper has snuck in and blown them away!  There are four
> > > conditions that can dissuade the reaper from touching our file -- no
> > > reflink iflag; dirty page cache; writeback in progress; or directio in
> > > progress.  We check the four conditions prior to taking the locks, but
> > > we neglect to recheck them once we have the locks, which is how we end
> > > up whacking the writeback that's in progress.
> > > 
> > > Therefore, refactor the four checks into a helper function and call it
> > > once again once we have the locks to make sure we really want to reap
> > > the inode.  While we're at it, add an ASSERT for this weird condition so
> > > that we'll fail noisily if we ever screw this up again.
> > > 
> > > Reported-by: Eryu Guan <eguan@redhat.com>
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > ---
> > >  fs/xfs/libxfs/xfs_bmap.c |    7 +++++
> > >  fs/xfs/xfs_icache.c      |   61 +++++++++++++++++++++++++++++-----------------
> > >  2 files changed, 46 insertions(+), 22 deletions(-)
> > > 
> > > diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> > > index a01cef4..7bd933f 100644
> > > --- a/fs/xfs/libxfs/xfs_bmap.c
> > > +++ b/fs/xfs/libxfs/xfs_bmap.c
> > > @@ -4311,6 +4311,13 @@ xfs_bmapi_write(
> > >  	while (bno < end && n < *nmap) {
> > >  		bool			need_alloc = false, wasdelay = false;
> > >  
> > > +		/*
> > > +		 * CoW fork conversions should /never/ hit EOF.  There should
> > > +		 * always be something for us to work on.
> > > +		 */
> > > +		ASSERT(!eof || !(flags & XFS_BMAPI_CONVERT) ||
> > > +			       !(flags & XFS_BMAPI_COWFORK));
> > > +
> > 
> > The hunk just below asserts for BMAPI_COWFORK in a case that explicitly
> > considers eof. That makes the logic confusing to follow IMO, but I'm
> > more wondering whether pushing something like ASSERT(!((flags & CONVERT)
> > && (flags & COWFORK))) down into that hunk is effectively the same
> > thing..?  I.e., is it also true that we should not find a hole in the
> > (CONVERT & COW) case?
> 
> Yes.
> 
> > >  		/* in hole or beyoned EOF? */
> > >  		if (eof || bma.got.br_startoff > bno) {
> > >  			if (flags & XFS_BMAPI_DELALLOC) {
> > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > > index 1f84562..3fbcc03 100644
> > > --- a/fs/xfs/xfs_icache.c
> > > +++ b/fs/xfs/xfs_icache.c
> > > @@ -1654,6 +1654,35 @@ xfs_inode_clear_eofblocks_tag(
> > >  			trace_xfs_perag_clear_eofblocks, XFS_ICI_EOFBLOCKS_TAG);
> > >  }
> > >  
> > > +/* Is this a good time to reap the CoW reservations for this file? */
> > > +static bool
> > > +xfs_can_free_cowblocks(
> > > +	struct xfs_inode	*ip,
> > > +	struct xfs_ifork	*ifp)
> > > +{
> > > +	/*
> > > +	 * Just clear the tag if we have an empty cow fork or none at all. It's
> > > +	 * possible the inode was fully unshared since it was originally tagged.
> > > +	 */
> > > +	if (!xfs_is_reflink_inode(ip) || !ifp->if_bytes) {
> > > +		trace_xfs_inode_free_cowblocks_invalid(ip);
> > > +		xfs_inode_clear_cowblocks_tag(ip);
> > > +		return false;
> > 
> > I think the flag update and tracepoint should probably remain in the
> > caller. They're somewhat misplaced for a "xfs_can_do_something()"
> > helper, particularly if it's ever exported and used in other contexts in
> > the future. Otherwise seems fine.
> 
> Hmm.  There's a subtlety to step around here, which is that this
> predicate can return false to mean "nothing here to see" or to mean
> "cannot clear anything at this time".  We want the trace+clear for the
> first case, but not the second.
> 
> I suppose this function could return the regular error code int and the
> caller can figure out what that means to it, but then all the post-check
> stuff ends up duplicated in the callers... so maybe I should just rename
> it xfs_prep_free_cowblocks().
> 

Are both checks necessarily required to be repeated under lock to fix
the bug? IOW, Could the !fork || !flag check remain in the caller to
cover the first case?

> And change the comment to:
> 
> /*
>  * Set ourselves up to free CoW blocks from this file.  If it's already
>  * clean then we can bail out quickly, but otherwise we must back off if
>  * the file is undergoing some kind of write.
>  */
> 

That sounds reasonable too.

Brian

> --D
> 
> > Brian
> > 
> > > +	}
> > > +
> > > +	/*
> > > +	 * If the mapping is dirty or under writeback we cannot touch the
> > > +	 * CoW fork.  Leave it alone if we're in the midst of a directio.
> > > +	 */
> > > +	if ((VFS_I(ip)->i_state & I_DIRTY_PAGES) ||
> > > +	    mapping_tagged(VFS_I(ip)->i_mapping, PAGECACHE_TAG_DIRTY) ||
> > > +	    mapping_tagged(VFS_I(ip)->i_mapping, PAGECACHE_TAG_WRITEBACK) ||
> > > +	    atomic_read(&VFS_I(ip)->i_dio_count))
> > > +		return false;
> > > +
> > > +	return true;
> > > +}
> > > +
> > >  /*
> > >   * Automatic CoW Reservation Freeing
> > >   *
> > > @@ -1672,29 +1701,12 @@ xfs_inode_free_cowblocks(
> > >  	int			flags,
> > >  	void			*args)
> > >  {
> > > -	int ret;
> > > -	struct xfs_eofblocks *eofb = args;
> > > -	int match;
> > > +	struct xfs_eofblocks	*eofb = args;
> > >  	struct xfs_ifork	*ifp = XFS_IFORK_PTR(ip, XFS_COW_FORK);
> > > +	int			match;
> > > +	int			ret = 0;
> > >  
> > > -	/*
> > > -	 * Just clear the tag if we have an empty cow fork or none at all. It's
> > > -	 * possible the inode was fully unshared since it was originally tagged.
> > > -	 */
> > > -	if (!xfs_is_reflink_inode(ip) || !ifp->if_bytes) {
> > > -		trace_xfs_inode_free_cowblocks_invalid(ip);
> > > -		xfs_inode_clear_cowblocks_tag(ip);
> > > -		return 0;
> > > -	}
> > > -
> > > -	/*
> > > -	 * If the mapping is dirty or under writeback we cannot touch the
> > > -	 * CoW fork.  Leave it alone if we're in the midst of a directio.
> > > -	 */
> > > -	if ((VFS_I(ip)->i_state & I_DIRTY_PAGES) ||
> > > -	    mapping_tagged(VFS_I(ip)->i_mapping, PAGECACHE_TAG_DIRTY) ||
> > > -	    mapping_tagged(VFS_I(ip)->i_mapping, PAGECACHE_TAG_WRITEBACK) ||
> > > -	    atomic_read(&VFS_I(ip)->i_dio_count))
> > > +	if (!xfs_can_free_cowblocks(ip, ifp))
> > >  		return 0;
> > >  
> > >  	if (eofb) {
> > > @@ -1715,7 +1727,12 @@ xfs_inode_free_cowblocks(
> > >  	xfs_ilock(ip, XFS_IOLOCK_EXCL);
> > >  	xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
> > >  
> > > -	ret = xfs_reflink_cancel_cow_range(ip, 0, NULLFILEOFF, false);
> > > +	/*
> > > +	 * Check again, nobody else should be able to dirty blocks or change
> > > +	 * the reflink iflag now that we have the first two locks held.
> > > +	 */
> > > +	if (xfs_can_free_cowblocks(ip, ifp))
> > > +		ret = xfs_reflink_cancel_cow_range(ip, 0, NULLFILEOFF, false);
> > >  
> > >  	xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);
> > >  	xfs_iunlock(ip, XFS_IOLOCK_EXCL);
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Darrick J. Wong Jan. 11, 2018, 8:32 p.m. UTC | #5
On Thu, Jan 11, 2018 at 02:38:28PM -0500, Brian Foster wrote:
> On Thu, Jan 11, 2018 at 09:40:27AM -0800, Darrick J. Wong wrote:
> > On Thu, Jan 11, 2018 at 07:04:10AM -0500, Brian Foster wrote:
> > > On Wed, Jan 10, 2018 at 02:03:36PM -0800, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > > 
> > > > Eryu Guan reported seeing occasional hangs when running generic/269 with
> > > > a new fsstress that supports clonerange/deduperange.  The cause of this
> > > > hang is an infinite loop when we convert the CoW fork extents from
> > > > unwritten to real just prior to writing the pages out; the infinite
> > > > loop happens because there's nothing in the CoW fork to convert, and so
> > > > it spins forever.
> > > > 
> > > > The underlying issue here is that when we go to perform these CoW fork
> > > > conversions, we're supposed to have an extent waiting for us, but the
> > > > low space CoW reaper has snuck in and blown them away!  There are four
> > > > conditions that can dissuade the reaper from touching our file -- no
> > > > reflink iflag; dirty page cache; writeback in progress; or directio in
> > > > progress.  We check the four conditions prior to taking the locks, but
> > > > we neglect to recheck them once we have the locks, which is how we end
> > > > up whacking the writeback that's in progress.
> > > > 
> > > > Therefore, refactor the four checks into a helper function and call it
> > > > once again once we have the locks to make sure we really want to reap
> > > > the inode.  While we're at it, add an ASSERT for this weird condition so
> > > > that we'll fail noisily if we ever screw this up again.
> > > > 
> > > > Reported-by: Eryu Guan <eguan@redhat.com>
> > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > ---
> > > >  fs/xfs/libxfs/xfs_bmap.c |    7 +++++
> > > >  fs/xfs/xfs_icache.c      |   61 +++++++++++++++++++++++++++++-----------------
> > > >  2 files changed, 46 insertions(+), 22 deletions(-)
> > > > 
> > > > diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> > > > index a01cef4..7bd933f 100644
> > > > --- a/fs/xfs/libxfs/xfs_bmap.c
> > > > +++ b/fs/xfs/libxfs/xfs_bmap.c
> > > > @@ -4311,6 +4311,13 @@ xfs_bmapi_write(
> > > >  	while (bno < end && n < *nmap) {
> > > >  		bool			need_alloc = false, wasdelay = false;
> > > >  
> > > > +		/*
> > > > +		 * CoW fork conversions should /never/ hit EOF.  There should
> > > > +		 * always be something for us to work on.
> > > > +		 */
> > > > +		ASSERT(!eof || !(flags & XFS_BMAPI_CONVERT) ||
> > > > +			       !(flags & XFS_BMAPI_COWFORK));
> > > > +
> > > 
> > > The hunk just below asserts for BMAPI_COWFORK in a case that explicitly
> > > considers eof. That makes the logic confusing to follow IMO, but I'm
> > > more wondering whether pushing something like ASSERT(!((flags & CONVERT)
> > > && (flags & COWFORK))) down into that hunk is effectively the same
> > > thing..?  I.e., is it also true that we should not find a hole in the
> > > (CONVERT & COW) case?
> > 
> > Yes.
> > 
> > > >  		/* in hole or beyoned EOF? */
> > > >  		if (eof || bma.got.br_startoff > bno) {
> > > >  			if (flags & XFS_BMAPI_DELALLOC) {
> > > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > > > index 1f84562..3fbcc03 100644
> > > > --- a/fs/xfs/xfs_icache.c
> > > > +++ b/fs/xfs/xfs_icache.c
> > > > @@ -1654,6 +1654,35 @@ xfs_inode_clear_eofblocks_tag(
> > > >  			trace_xfs_perag_clear_eofblocks, XFS_ICI_EOFBLOCKS_TAG);
> > > >  }
> > > >  
> > > > +/* Is this a good time to reap the CoW reservations for this file? */
> > > > +static bool
> > > > +xfs_can_free_cowblocks(
> > > > +	struct xfs_inode	*ip,
> > > > +	struct xfs_ifork	*ifp)
> > > > +{
> > > > +	/*
> > > > +	 * Just clear the tag if we have an empty cow fork or none at all. It's
> > > > +	 * possible the inode was fully unshared since it was originally tagged.
> > > > +	 */
> > > > +	if (!xfs_is_reflink_inode(ip) || !ifp->if_bytes) {
> > > > +		trace_xfs_inode_free_cowblocks_invalid(ip);
> > > > +		xfs_inode_clear_cowblocks_tag(ip);
> > > > +		return false;
> > > 
> > > I think the flag update and tracepoint should probably remain in the
> > > caller. They're somewhat misplaced for a "xfs_can_do_something()"
> > > helper, particularly if it's ever exported and used in other contexts in
> > > the future. Otherwise seems fine.
> > 
> > Hmm.  There's a subtlety to step around here, which is that this
> > predicate can return false to mean "nothing here to see" or to mean
> > "cannot clear anything at this time".  We want the trace+clear for the
> > first case, but not the second.
> > 
> > I suppose this function could return the regular error code int and the
> > caller can figure out what that means to it, but then all the post-check
> > stuff ends up duplicated in the callers... so maybe I should just rename
> > it xfs_prep_free_cowblocks().
> > 
> 
> Are both checks necessarily required to be repeated under lock to fix
> the bug? IOW, Could the !fork || !flag check remain in the caller to
> cover the first case?

They're necessary in both cases; generic/269 (when it wasn't hanging)
would also blow the xfs_is_reflink_inode assert in
xfs_reflink_cancel_cow_range if the flag got cleared before we can grab
the iolock/mmaplock.

> > And change the comment to:
> > 
> > /*
> >  * Set ourselves up to free CoW blocks from this file.  If it's already
> >  * clean then we can bail out quickly, but otherwise we must back off if
> >  * the file is undergoing some kind of write.
> >  */
> > 
> 
> That sounds reasonable too.

<nod>

--D

> Brian
> 
> > --D
> > 
> > > Brian
> > > 
> > > > +	}
> > > > +
> > > > +	/*
> > > > +	 * If the mapping is dirty or under writeback we cannot touch the
> > > > +	 * CoW fork.  Leave it alone if we're in the midst of a directio.
> > > > +	 */
> > > > +	if ((VFS_I(ip)->i_state & I_DIRTY_PAGES) ||
> > > > +	    mapping_tagged(VFS_I(ip)->i_mapping, PAGECACHE_TAG_DIRTY) ||
> > > > +	    mapping_tagged(VFS_I(ip)->i_mapping, PAGECACHE_TAG_WRITEBACK) ||
> > > > +	    atomic_read(&VFS_I(ip)->i_dio_count))
> > > > +		return false;
> > > > +
> > > > +	return true;
> > > > +}
> > > > +
> > > >  /*
> > > >   * Automatic CoW Reservation Freeing
> > > >   *
> > > > @@ -1672,29 +1701,12 @@ xfs_inode_free_cowblocks(
> > > >  	int			flags,
> > > >  	void			*args)
> > > >  {
> > > > -	int ret;
> > > > -	struct xfs_eofblocks *eofb = args;
> > > > -	int match;
> > > > +	struct xfs_eofblocks	*eofb = args;
> > > >  	struct xfs_ifork	*ifp = XFS_IFORK_PTR(ip, XFS_COW_FORK);
> > > > +	int			match;
> > > > +	int			ret = 0;
> > > >  
> > > > -	/*
> > > > -	 * Just clear the tag if we have an empty cow fork or none at all. It's
> > > > -	 * possible the inode was fully unshared since it was originally tagged.
> > > > -	 */
> > > > -	if (!xfs_is_reflink_inode(ip) || !ifp->if_bytes) {
> > > > -		trace_xfs_inode_free_cowblocks_invalid(ip);
> > > > -		xfs_inode_clear_cowblocks_tag(ip);
> > > > -		return 0;
> > > > -	}
> > > > -
> > > > -	/*
> > > > -	 * If the mapping is dirty or under writeback we cannot touch the
> > > > -	 * CoW fork.  Leave it alone if we're in the midst of a directio.
> > > > -	 */
> > > > -	if ((VFS_I(ip)->i_state & I_DIRTY_PAGES) ||
> > > > -	    mapping_tagged(VFS_I(ip)->i_mapping, PAGECACHE_TAG_DIRTY) ||
> > > > -	    mapping_tagged(VFS_I(ip)->i_mapping, PAGECACHE_TAG_WRITEBACK) ||
> > > > -	    atomic_read(&VFS_I(ip)->i_dio_count))
> > > > +	if (!xfs_can_free_cowblocks(ip, ifp))
> > > >  		return 0;
> > > >  
> > > >  	if (eofb) {
> > > > @@ -1715,7 +1727,12 @@ xfs_inode_free_cowblocks(
> > > >  	xfs_ilock(ip, XFS_IOLOCK_EXCL);
> > > >  	xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
> > > >  
> > > > -	ret = xfs_reflink_cancel_cow_range(ip, 0, NULLFILEOFF, false);
> > > > +	/*
> > > > +	 * Check again, nobody else should be able to dirty blocks or change
> > > > +	 * the reflink iflag now that we have the first two locks held.
> > > > +	 */
> > > > +	if (xfs_can_free_cowblocks(ip, ifp))
> > > > +		ret = xfs_reflink_cancel_cow_range(ip, 0, NULLFILEOFF, false);
> > > >  
> > > >  	xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);
> > > >  	xfs_iunlock(ip, XFS_IOLOCK_EXCL);
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > the body of a message to majordomo@vger.kernel.org
> > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eryu Guan Jan. 12, 2018, 3:32 a.m. UTC | #6
On Thu, Jan 11, 2018 at 03:54:41PM +0800, Eryu Guan wrote:
> On Wed, Jan 10, 2018 at 02:03:36PM -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > Eryu Guan reported seeing occasional hangs when running generic/269 with
> > a new fsstress that supports clonerange/deduperange.  The cause of this
> > hang is an infinite loop when we convert the CoW fork extents from
> > unwritten to real just prior to writing the pages out; the infinite
> > loop happens because there's nothing in the CoW fork to convert, and so
> > it spins forever.
> > 
> > The underlying issue here is that when we go to perform these CoW fork
> > conversions, we're supposed to have an extent waiting for us, but the
> > low space CoW reaper has snuck in and blown them away!  There are four
> > conditions that can dissuade the reaper from touching our file -- no
> > reflink iflag; dirty page cache; writeback in progress; or directio in
> > progress.  We check the four conditions prior to taking the locks, but
> > we neglect to recheck them once we have the locks, which is how we end
> > up whacking the writeback that's in progress.
> > 
> > Therefore, refactor the four checks into a helper function and call it
> > once again once we have the locks to make sure we really want to reap
> > the inode.  While we're at it, add an ASSERT for this weird condition so
> > that we'll fail noisily if we ever screw this up again.
> > 
> > Reported-by: Eryu Guan <eguan@redhat.com>
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> 
> I applied this patch on top of v4.15-rc5 kernel, and ran generic/083
> generic/269 and generic/270 (where I hit the soft lockup and hang before)
> multiple times and tests all passed. I also ran all tests in 'enospc'
> group on 1k/2k/4k XFS with reflink enabled, tests passed too. So
> 
> Tested-by: Eryu Guan <eguan@redhat.com>

Sorry, I have to withdraw this tag for now.. I'm seeing soft lockup
again in generic/269 run with the patched kernel. I'll do more testings
to confirm, paste the soft lockup info here for now:

[596580.126008] watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [fsstress:30037]
[596580.126008] Modules linked in: xfs dm_delay btrfs xor zstd_compress raid6_pq zstd_decompress xxhash dm_thin_pool dm_persistent_data dm_bio_prison dm_flakey loop ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack libcrc32c iptable_mangle iptable_raw iptable_security ebtable_filter ebtables ip6table_filter ip6_tables sunrpc joydev i2c_piix4 8139too virtio_balloon pcspkr 8139cp mii virtio_pci virtio_ring virtio serio_raw floppy ata_generic pata_acpi [last unloaded: scsi_debug]
[596580.129050] irq event stamp: 174005460
[596580.129050] hardirqs last  enabled at (174005459): [<000000004aebc6cd>] restore_regs_and_return_to_kernel+0x0/0x2e
[596580.129050] hardirqs last disabled at (174005460): [<0000000084598378>] apic_timer_interrupt+0xa7/0xc0
[596580.132071] softirqs last  enabled at (79644030): [<000000009174d1b7>] __do_softirq+0x392/0x502
[596580.133052] softirqs last disabled at (79644019): [<000000002b9518d7>] irq_exit+0x102/0x110
[596580.133052] CPU: 3 PID: 30037 Comm: fsstress Tainted: G        W  OEL   4.15.0-rc5 #10
[596580.133052] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
[596580.133052] RIP: 0010:xfs_bmapi_convert_unwritten+0xb/0x1b0 [xfs]
[596580.133052] RSP: 0000:ffffbb8643d538d0 EFLAGS: 00000287 ORIG_RAX: ffffffffffffff11
[596580.133052] RAX: 000ffffffffe0000 RBX: 0000000000000a40 RCX: 0000000000000a40
[596580.136071] RDX: 0000000000000080 RSI: ffffbb8643d53aa8 RDI: ffffbb8643d53980
[596580.136071] RBP: ffffbb8643d53a68 R08: 0000000000000080 R09: 0000000000000000
[596580.137053] R10: 0000000000000000 R11: fed20f5f8482504e R12: ffffbb8643d53980
[596580.137053] R13: ffffbb8643d539c0 R14: 0000000000000080 R15: 0000000000000001
[596580.137053] FS:  00007fca843b1b80(0000) GS:ffff9c29d7400000(0000) knlGS:0000000000000000
[596580.137053] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[596580.137053] CR2: 00007fca843a8000 CR3: 00000001070f5000 CR4: 00000000000006e0
[596580.137053] Call Trace:
[596580.137053]  xfs_bmapi_write+0x301/0xcc0 [xfs]
[596580.140071]  ? sched_clock+0x5/0x10
[596580.140071]  xfs_reflink_convert_cow+0x8c/0xc0 [xfs]
[596580.140071]  ? __test_set_page_writeback+0x18b/0x3c0
[596580.141051]  xfs_submit_ioend+0x18f/0x1f0 [xfs]
[596580.141051]  xfs_do_writepage+0x39d/0x7e0 [xfs]
[596580.141051]  write_cache_pages+0x1d0/0x550
[596580.141051]  ? xfs_vm_readpage+0x130/0x130 [xfs]
[596580.141051]  xfs_vm_writepages+0xb1/0xd0 [xfs]
[596580.141051]  do_writepages+0x48/0xf0
[596580.141051]  ? __filemap_fdatawrite_range+0xb4/0x100
[596580.141051]  ? __filemap_fdatawrite_range+0xc1/0x100
[596580.141051]  __filemap_fdatawrite_range+0xc1/0x100
[596580.144072]  xfs_release+0x11c/0x160 [xfs]
[596580.144072]  __fput+0xe6/0x1f0
[596580.144072]  task_work_run+0x82/0xb0
[596580.145050]  exit_to_usermode_loop+0xa8/0xb0
[596580.145050]  syscall_return_slowpath+0x153/0x160
[596580.145050]  entry_SYSCALL_64_fastpath+0x94/0x96
[596580.145050] RIP: 0033:0x7fca83b87cb1
[596580.145050] RSP: 002b:00007ffdd89b8368 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
[596580.145050] RAX: 0000000000000000 RBX: 000000000000031a RCX: 00007fca83b87cb1
[596580.145050] RDX: 0000000001a675f0 RSI: 0000000001a55010 RDI: 0000000000000003
[596580.145050] RBP: 000000000001146a R08: 0000000000000006 R09: 00007fca83b71d00
[596580.148070] R10: 0000000001a55010 R11: 0000000000000246 R12: 0000000000000003
[596580.148070] R13: 0000000000174143 R14: 0000000001aad800 R15: 0000000000000000
[596580.149050] Code: e9 58 f4 ff ff e8 c6 a8 a9 c9 4c 8b 4d 08 41 b8 99 09 00 00 e9 44 f4 ff ff 0f 1f 80 00 00 00 00 0f 1f 44 00 00 41 57 41 56 41 55 <41> 54 49 89 d5 55 53 89 cd 48 89 fb 49 89 f4 48 83 ec 10 48 8b

Thanks,
Eryu
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eryu Guan Jan. 15, 2018, 6:36 a.m. UTC | #7
On Fri, Jan 12, 2018 at 11:32:31AM +0800, Eryu Guan wrote:
> On Thu, Jan 11, 2018 at 03:54:41PM +0800, Eryu Guan wrote:
> > On Wed, Jan 10, 2018 at 02:03:36PM -0800, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > 
> > > Eryu Guan reported seeing occasional hangs when running generic/269 with
> > > a new fsstress that supports clonerange/deduperange.  The cause of this
> > > hang is an infinite loop when we convert the CoW fork extents from
> > > unwritten to real just prior to writing the pages out; the infinite
> > > loop happens because there's nothing in the CoW fork to convert, and so
> > > it spins forever.
> > > 
> > > The underlying issue here is that when we go to perform these CoW fork
> > > conversions, we're supposed to have an extent waiting for us, but the
> > > low space CoW reaper has snuck in and blown them away!  There are four
> > > conditions that can dissuade the reaper from touching our file -- no
> > > reflink iflag; dirty page cache; writeback in progress; or directio in
> > > progress.  We check the four conditions prior to taking the locks, but
> > > we neglect to recheck them once we have the locks, which is how we end
> > > up whacking the writeback that's in progress.
> > > 
> > > Therefore, refactor the four checks into a helper function and call it
> > > once again once we have the locks to make sure we really want to reap
> > > the inode.  While we're at it, add an ASSERT for this weird condition so
> > > that we'll fail noisily if we ever screw this up again.
> > > 
> > > Reported-by: Eryu Guan <eguan@redhat.com>
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > I applied this patch on top of v4.15-rc5 kernel, and ran generic/083
> > generic/269 and generic/270 (where I hit the soft lockup and hang before)
> > multiple times and tests all passed. I also ran all tests in 'enospc'
> > group on 1k/2k/4k XFS with reflink enabled, tests passed too. So
> > 
> > Tested-by: Eryu Guan <eguan@redhat.com>
> 
> Sorry, I have to withdraw this tag for now.. I'm seeing soft lockup
> again in generic/269 run with the patched kernel. I'll do more testings
> to confirm, paste the soft lockup info here for now:

I ran generic/269 for over 4000 iterations and didn't hit soft lockup, I
suspect that previously I tested on wrong/unpatched xfs module..

But occationally I saw fs inconsistency in generic/269, it's hard to
reproduce (need 100-200 iterations) but I did see it several times. But
it seems like another problem.

_check_xfs_filesystem: filesystem on /dev/sda6 is inconsistent (r)                                                                                                                                                   
*** xfs_repair -n output ***                                                                                                                                                                                         
Phase 1 - find and verify superblock...                                                                                                                                                                              
Phase 2 - using internal log                                                                                                                                                                                         
        - zero log...                                                                                                                                                                                                
        - scan filesystem freespace and inode maps...                                                                                                                                                                
sb_fdblocks 8178, counted 8188                                                                                                                                                                                       
        - found root inode chunk                                                                                                                                                                                     
Phase 3 - for each AG...                                                                                                                                                                                             
        - scan (but don't clear) agi unlinked lists...                                                                                                                                                               
        - process known inodes and perform inode discovery...                                                                                                                                                        
        - agno = 0                                                                                                                                                                                                   
        - agno = 1                                                                                                                                                                                                   
        - agno = 2                                                                                                                                                                                                   
        - agno = 3                                                                                                                                                                                                   
        - process newly discovered inodes...                                                                                                                                                                         
Phase 4 - check for duplicate blocks...                                                                                                                                                                              
        - setting up duplicate extent list...                                                                                                                                                                        
        - check for inodes claiming duplicate blocks...                                                                                                                                                              
        - agno = 1                                                                                                                                                                                                   
        - agno = 3                                                                                                                                                                                                   
        - agno = 2                                                                                                                                                                                                   
        - agno = 0                                                                                                                                                                                                   
No modify flag set, skipping phase 5                                                                                                                                                                                 
Phase 6 - check inode connectivity...                                                                                                                                                                                
        - traversing filesystem ...                                                                                                                                                                                  
        - traversal finished ...                                                                                                                                                                                     
        - moving disconnected inodes to lost+found ...                                                                                                                                                               
Phase 7 - verify link counts...                                                                                                                                                                                      
No modify flag set, skipping filesystem flush and exiting.                                                                                                                                                           
*** end xfs_repair output

Thanks,
Eryu
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Darrick J. Wong Jan. 15, 2018, 8:08 p.m. UTC | #8
On Mon, Jan 15, 2018 at 02:36:05PM +0800, Eryu Guan wrote:
> On Fri, Jan 12, 2018 at 11:32:31AM +0800, Eryu Guan wrote:
> > On Thu, Jan 11, 2018 at 03:54:41PM +0800, Eryu Guan wrote:
> > > On Wed, Jan 10, 2018 at 02:03:36PM -0800, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > > 
> > > > Eryu Guan reported seeing occasional hangs when running generic/269 with
> > > > a new fsstress that supports clonerange/deduperange.  The cause of this
> > > > hang is an infinite loop when we convert the CoW fork extents from
> > > > unwritten to real just prior to writing the pages out; the infinite
> > > > loop happens because there's nothing in the CoW fork to convert, and so
> > > > it spins forever.
> > > > 
> > > > The underlying issue here is that when we go to perform these CoW fork
> > > > conversions, we're supposed to have an extent waiting for us, but the
> > > > low space CoW reaper has snuck in and blown them away!  There are four
> > > > conditions that can dissuade the reaper from touching our file -- no
> > > > reflink iflag; dirty page cache; writeback in progress; or directio in
> > > > progress.  We check the four conditions prior to taking the locks, but
> > > > we neglect to recheck them once we have the locks, which is how we end
> > > > up whacking the writeback that's in progress.
> > > > 
> > > > Therefore, refactor the four checks into a helper function and call it
> > > > once again once we have the locks to make sure we really want to reap
> > > > the inode.  While we're at it, add an ASSERT for this weird condition so
> > > > that we'll fail noisily if we ever screw this up again.
> > > > 
> > > > Reported-by: Eryu Guan <eguan@redhat.com>
> > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > 
> > > I applied this patch on top of v4.15-rc5 kernel, and ran generic/083
> > > generic/269 and generic/270 (where I hit the soft lockup and hang before)
> > > multiple times and tests all passed. I also ran all tests in 'enospc'
> > > group on 1k/2k/4k XFS with reflink enabled, tests passed too. So
> > > 
> > > Tested-by: Eryu Guan <eguan@redhat.com>
> > 
> > Sorry, I have to withdraw this tag for now.. I'm seeing soft lockup
> > again in generic/269 run with the patched kernel. I'll do more testings
> > to confirm, paste the soft lockup info here for now:
> 
> I ran generic/269 for over 4000 iterations and didn't hit soft lockup, I
> suspect that previously I tested on wrong/unpatched xfs module..
> 
> But occationally I saw fs inconsistency in generic/269, it's hard to
> reproduce (need 100-200 iterations) but I did see it several times. But
> it seems like another problem.

I've seen that one go by occasionally too; will see if I can figure out
what's going on, though I don't think it's related to this hang fix.

--D

> _check_xfs_filesystem: filesystem on /dev/sda6 is inconsistent (r)                                                                                                                                                   
> *** xfs_repair -n output ***                                                                                                                                                                                         
> Phase 1 - find and verify superblock...                                                                                                                                                                              
> Phase 2 - using internal log                                                                                                                                                                                         
>         - zero log...                                                                                                                                                                                                
>         - scan filesystem freespace and inode maps...                                                                                                                                                                
> sb_fdblocks 8178, counted 8188                                                                                                                                                                                       
>         - found root inode chunk                                                                                                                                                                                     
> Phase 3 - for each AG...                                                                                                                                                                                             
>         - scan (but don't clear) agi unlinked lists...                                                                                                                                                               
>         - process known inodes and perform inode discovery...                                                                                                                                                        
>         - agno = 0                                                                                                                                                                                                   
>         - agno = 1                                                                                                                                                                                                   
>         - agno = 2                                                                                                                                                                                                   
>         - agno = 3                                                                                                                                                                                                   
>         - process newly discovered inodes...                                                                                                                                                                         
> Phase 4 - check for duplicate blocks...                                                                                                                                                                              
>         - setting up duplicate extent list...                                                                                                                                                                        
>         - check for inodes claiming duplicate blocks...                                                                                                                                                              
>         - agno = 1                                                                                                                                                                                                   
>         - agno = 3                                                                                                                                                                                                   
>         - agno = 2                                                                                                                                                                                                   
>         - agno = 0                                                                                                                                                                                                   
> No modify flag set, skipping phase 5                                                                                                                                                                                 
> Phase 6 - check inode connectivity...                                                                                                                                                                                
>         - traversing filesystem ...                                                                                                                                                                                  
>         - traversal finished ...                                                                                                                                                                                     
>         - moving disconnected inodes to lost+found ...                                                                                                                                                               
> Phase 7 - verify link counts...                                                                                                                                                                                      
> No modify flag set, skipping filesystem flush and exiting.                                                                                                                                                           
> *** end xfs_repair output
> 
> Thanks,
> Eryu
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index a01cef4..7bd933f 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -4311,6 +4311,13 @@  xfs_bmapi_write(
 	while (bno < end && n < *nmap) {
 		bool			need_alloc = false, wasdelay = false;
 
+		/*
+		 * CoW fork conversions should /never/ hit EOF.  There should
+		 * always be something for us to work on.
+		 */
+		ASSERT(!eof || !(flags & XFS_BMAPI_CONVERT) ||
+			       !(flags & XFS_BMAPI_COWFORK));
+
 		/* in hole or beyoned EOF? */
 		if (eof || bma.got.br_startoff > bno) {
 			if (flags & XFS_BMAPI_DELALLOC) {
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 1f84562..3fbcc03 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1654,6 +1654,35 @@  xfs_inode_clear_eofblocks_tag(
 			trace_xfs_perag_clear_eofblocks, XFS_ICI_EOFBLOCKS_TAG);
 }
 
+/* Is this a good time to reap the CoW reservations for this file? */
+static bool
+xfs_can_free_cowblocks(
+	struct xfs_inode	*ip,
+	struct xfs_ifork	*ifp)
+{
+	/*
+	 * Just clear the tag if we have an empty cow fork or none at all. It's
+	 * possible the inode was fully unshared since it was originally tagged.
+	 */
+	if (!xfs_is_reflink_inode(ip) || !ifp->if_bytes) {
+		trace_xfs_inode_free_cowblocks_invalid(ip);
+		xfs_inode_clear_cowblocks_tag(ip);
+		return false;
+	}
+
+	/*
+	 * If the mapping is dirty or under writeback we cannot touch the
+	 * CoW fork.  Leave it alone if we're in the midst of a directio.
+	 */
+	if ((VFS_I(ip)->i_state & I_DIRTY_PAGES) ||
+	    mapping_tagged(VFS_I(ip)->i_mapping, PAGECACHE_TAG_DIRTY) ||
+	    mapping_tagged(VFS_I(ip)->i_mapping, PAGECACHE_TAG_WRITEBACK) ||
+	    atomic_read(&VFS_I(ip)->i_dio_count))
+		return false;
+
+	return true;
+}
+
 /*
  * Automatic CoW Reservation Freeing
  *
@@ -1672,29 +1701,12 @@  xfs_inode_free_cowblocks(
 	int			flags,
 	void			*args)
 {
-	int ret;
-	struct xfs_eofblocks *eofb = args;
-	int match;
+	struct xfs_eofblocks	*eofb = args;
 	struct xfs_ifork	*ifp = XFS_IFORK_PTR(ip, XFS_COW_FORK);
+	int			match;
+	int			ret = 0;
 
-	/*
-	 * Just clear the tag if we have an empty cow fork or none at all. It's
-	 * possible the inode was fully unshared since it was originally tagged.
-	 */
-	if (!xfs_is_reflink_inode(ip) || !ifp->if_bytes) {
-		trace_xfs_inode_free_cowblocks_invalid(ip);
-		xfs_inode_clear_cowblocks_tag(ip);
-		return 0;
-	}
-
-	/*
-	 * If the mapping is dirty or under writeback we cannot touch the
-	 * CoW fork.  Leave it alone if we're in the midst of a directio.
-	 */
-	if ((VFS_I(ip)->i_state & I_DIRTY_PAGES) ||
-	    mapping_tagged(VFS_I(ip)->i_mapping, PAGECACHE_TAG_DIRTY) ||
-	    mapping_tagged(VFS_I(ip)->i_mapping, PAGECACHE_TAG_WRITEBACK) ||
-	    atomic_read(&VFS_I(ip)->i_dio_count))
+	if (!xfs_can_free_cowblocks(ip, ifp))
 		return 0;
 
 	if (eofb) {
@@ -1715,7 +1727,12 @@  xfs_inode_free_cowblocks(
 	xfs_ilock(ip, XFS_IOLOCK_EXCL);
 	xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
 
-	ret = xfs_reflink_cancel_cow_range(ip, 0, NULLFILEOFF, false);
+	/*
+	 * Check again, nobody else should be able to dirty blocks or change
+	 * the reflink iflag now that we have the first two locks held.
+	 */
+	if (xfs_can_free_cowblocks(ip, ifp))
+		ret = xfs_reflink_cancel_cow_range(ip, 0, NULLFILEOFF, false);
 
 	xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);
 	xfs_iunlock(ip, XFS_IOLOCK_EXCL);