Deadlock between block allocation and block truncation

Message ID	20170412161017.GA16590@infradead.org (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-xfs-owner@kernel.org> Date: Wed, 12 Apr 2017 09:10:17 -0700 From: Christoph Hellwig <hch@infradead.org> To: Nikolay Borisov <n.borisov.lkml@gmail.com> Cc: linux-xfs@vger.kernel.org Subject: Re: Deadlock between block allocation and block truncation Message-ID: <20170412161017.GA16590@infradead.org> References: <800468eb-3ded-9166-20a4-047de8018582@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <800468eb-3ded-9166-20a4-047de8018582@gmail.com> User-Agent: Mutt/1.8.0 (2017-02-23) Sender: linux-xfs-owner@vger.kernel.org Precedence: bulk

Christoph Hellwig April 12, 2017, 4:10 p.m. UTC

Hi Nikolay,

I guess the culprit is that truncate can free up to two extents in
the same transaction and thus try to lock two different AGs without
requiring them to be in increasing order.

Does the one liner below fix the problem for you?

--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Nikolay Borisov April 12, 2017, 4:12 p.m. UTC | #1

On 12.04.2017 19:10, Christoph Hellwig wrote:
> Hi Nikolay,
> 
> I guess the culprit is that truncate can free up to two extents in
> the same transaction and thus try to lock two different AGs without
> requiring them to be in increasing order.
> 
> Does the one liner below fix the problem for you?
> 
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 7605d8396596..29f2cd5afb04 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -58,7 +58,7 @@ kmem_zone_t *xfs_inode_zone;
>   * Used in xfs_itruncate_extents().  This is the maximum number of extents
>   * freed from a file in a single transaction.
>   */
> -#define	XFS_ITRUNC_MAX_EXTENTS	2
> +#define	XFS_ITRUNC_MAX_EXTENTS	1
>  
>  STATIC int xfs_iflush_int(struct xfs_inode *, struct xfs_buf *);
>  STATIC int xfs_iunlink(struct xfs_trans *, struct xfs_inode *);
> 


I will apply this to 3.12 and 4.4 and run tests since I can reproduce
fairly reliably on those. You don't expect any fallout on older kernel,
yes ?
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christoph Hellwig April 12, 2017, 4:20 p.m. UTC | #2

On Wed, Apr 12, 2017 at 07:12:50PM +0300, Nikolay Borisov wrote:
> I will apply this to 3.12 and 4.4 and run tests since I can reproduce
> fairly reliably on those. You don't expect any fallout on older kernel,
> yes ?

No.
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Nikolay Borisov April 12, 2017, 5:44 p.m. UTC | #3

On 12.04.2017 19:10, Christoph Hellwig wrote:
> Hi Nikolay,
> 
> I guess the culprit is that truncate can free up to two extents in
> the same transaction and thus try to lock two different AGs without
> requiring them to be in increasing order.

On the other hand Darrick suggested that the problem might be in the
allocation path due to it having a dirty buffer for AGF1 and proceeding
to lock AGF0, resulting in locking order violation. So the bli holding
AGF1 in the allocating task is:

crash> struct xfs_buf_log_item.bli_flags 0xffff8800a60b1570
  bli_flags = 2

That's XFS_BLI_DIRTY. According to Darick's opinion here is what
*should* happen:

"
djwong: either agf1 is clean and it needs to release that before going
for agf0, or agf1 is dirty and thus it cannot go for agf0
"

In this case agf1 is dirty and allocation path continues to agf0 which
is clear lock order violation?

On the truncation side the bli's flags for agf0 :

crash> struct -x xfs_buf_log_item.bli_flags 0xffff8801394ed2b8
  bli_flags = 0xa => BLI_DIRTY | BLI_LOGGED

And then it is proceeding to lock AGF1 (ascending order) correctly.

In spite of this your patch is likely to help this situation though I'm
not sure if it is modifying the right side of the violation.

Regards,
Nikolay
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christoph Hellwig April 12, 2017, 5:57 p.m. UTC | #4

On Wed, Apr 12, 2017 at 08:44:32PM +0300, Nikolay Borisov wrote:
> "
> djwong: either agf1 is clean and it needs to release that before going
> for agf0, or agf1 is dirty and thus it cannot go for agf0
> "

Yes.  Older kernels had some bugs in this area due to busy extent
tracking, where xfs_alloc_ag_vextent would fail despite
xfs_alloc_fix_freelist picking an AG and possibly dirtying the AGF.
My busy extent tracking changes for the asynchronous discard code in
4.11-rc should have fixed that.

But even with that in place I think that locking two AGFs in any
order in the truncate path is wrong.
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Nikolay Borisov April 13, 2017, 1:52 p.m. UTC | #5

On 12.04.2017 19:10, Christoph Hellwig wrote:
> Hi Nikolay,
> 
> I guess the culprit is that truncate can free up to two extents in
> the same transaction and thus try to lock two different AGs without
> requiring them to be in increasing order.
> 
> Does the one liner below fix the problem for you?

So after 200 runs of generic/299 I didn't hit the deadlock whereas
before it would hit in the first 30 or so. FWIW :

Tested-by: Nikolay Borisov <nborisov@suse.com>

On a different note - do you think that reducing the unmapped extents
from 2 to 1 would introduce any performance degradation during
truncation? Looking around the code this define is only used when doing
truncation, so perhaps a better thing to do would be to turn this
xfs_bunmapi arg to a boolean which signal whether we are doing
truncation or not. And if it is set to true have xfs_bunmapi unmap all
possible extents from only a single AG? I'm going to sift through the
git history to figure out where this requirement of maximum 2 extent
came to truncate, came.

Regards,
Nikolay
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christoph Hellwig April 14, 2017, 7:42 a.m. UTC | #6

On Thu, Apr 13, 2017 at 04:52:03PM +0300, Nikolay Borisov wrote:
> On a different note - do you think that reducing the unmapped extents
> from 2 to 1 would introduce any performance degradation during
> truncation?

There will be some.  But now that we have the CIL it will just
additional in-kernel overhead instead of overhead in the on-disk
log.

> Looking around the code this define is only used when doing
> truncation, so perhaps a better thing to do would be to turn this
> xfs_bunmapi arg to a boolean which signal whether we are doing
> truncation or not. And if it is set to true have xfs_bunmapi unmap all
> possible extents from only a single AG? I'm going to sift through the
> git history to figure out where this requirement of maximum 2 extent
> came to truncate, came.

We have the problem with all transactions that could lock multiple AGF
headers, so that's not going to cut it.

I think we could do multiple transactions IFF in the same AG.  I'll
need to check if that's worth it.  And on top of that I have started
entirely reworking what is currently xfs_bunmapi, but that will have
to wait until after a fix for your issue.
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christoph Hellwig April 25, 2017, 7:36 a.m. UTC | #7

FYI, I've been testing with the patch quite a bit and it seems to
be doing fine.

But I fear it's not the complete fix:  if we do truncates, hole
punches or other complicated operations the bmap btree blocks
might be from different AGs than the data blocks and we could
still run into these issues in theory.  So I fear we might have
to come up with a solution where we roll into a new chained transaction
everytime we encounter a new AG instead.
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Deadlock between block allocation and block truncation

Commit Message

Comments

Patch