diff mbox series

xfs: don't change to infinate lock to avoid dead lock

Message ID 20200423172325.8595-1-wen.gang.wang@oracle.com (mailing list archive)
State New, archived
Headers show
Series xfs: don't change to infinate lock to avoid dead lock | expand

Commit Message

Wengang Wang April 23, 2020, 5:23 p.m. UTC
xfs_reclaim_inodes_ag() do infinate locking on pag_ici_reclaim_lock at the
2nd round of walking of all AGs when SYNC_TRYLOCK is set (conditionally).
That causes dead lock in a special situation:

1) In a heavy memory load environment, process A is doing direct memory
reclaiming waiting for xfs_inode.i_pincount to be cleared while holding
mutex lock pag_ici_reclaim_lock.

2) i_pincount is increased by adding the xfs_inode to journal transection,
and it's expected to be decreased when the transection related IO is done.
Step 1) happens after i_pincount is increased and before truansection IO is
issued.

3) Now the transection IO is issued by process B. In the IO path (IO could
be more complex than you think), memory allocation and memory direct
reclaiming happened too. It is blocked, during the 2nd walking of AGs, at
locking pag_ici_reclaim_lock which is now held by process A.

Thus Process A waiting for IO done holding pag_ici_reclaim_lock, process B
tries to issue the IO but blocked at pag_ici_reclaim_lock. -- That forms
dead lock.

The fix is: don't change to infinate wait when SYNC_TRYLOCK is set. To
avoid long time spining, just walk each AG only once.

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
---
 fs/xfs/xfs_icache.c | 15 ---------------
 1 file changed, 15 deletions(-)

Comments

Dave Chinner April 23, 2020, 11:05 p.m. UTC | #1
On Thu, Apr 23, 2020 at 10:23:25AM -0700, Wengang Wang wrote:
> xfs_reclaim_inodes_ag() do infinate locking on pag_ici_reclaim_lock at the
> 2nd round of walking of all AGs when SYNC_TRYLOCK is set (conditionally).
> That causes dead lock in a special situation:
> 
> 1) In a heavy memory load environment, process A is doing direct memory
> reclaiming waiting for xfs_inode.i_pincount to be cleared while holding
> mutex lock pag_ici_reclaim_lock.
> 
> 2) i_pincount is increased by adding the xfs_inode to journal transection,
> and it's expected to be decreased when the transection related IO is done.
> Step 1) happens after i_pincount is increased and before truansection IO is
> issued.
> 
> 3) Now the transection IO is issued by process B. In the IO path (IO could
> be more complex than you think), memory allocation and memory direct
> reclaiming happened too.

Sure, but IO path allocations are done under GFP_NOIO context, which
means IO path allocations can't recurse back into filesystem reclaim
via direct reclaim. Hence there should be no way for an IO path
allocation to block on XFS inode reclaim and hence there's no
possible deadlock here...

IOWs, I don't think this is the deadlock you are looking for. Do you
have a lockdep report or some other set of stack traces that lead
you to this point?

Cheers,

Dave.
Wengang Wang April 23, 2020, 11:14 p.m. UTC | #2
On 4/23/20 4:05 PM, Dave Chinner wrote:
> On Thu, Apr 23, 2020 at 10:23:25AM -0700, Wengang Wang wrote:
>> xfs_reclaim_inodes_ag() do infinate locking on pag_ici_reclaim_lock at the
>> 2nd round of walking of all AGs when SYNC_TRYLOCK is set (conditionally).
>> That causes dead lock in a special situation:
>>
>> 1) In a heavy memory load environment, process A is doing direct memory
>> reclaiming waiting for xfs_inode.i_pincount to be cleared while holding
>> mutex lock pag_ici_reclaim_lock.
>>
>> 2) i_pincount is increased by adding the xfs_inode to journal transection,
>> and it's expected to be decreased when the transection related IO is done.
>> Step 1) happens after i_pincount is increased and before truansection IO is
>> issued.
>>
>> 3) Now the transection IO is issued by process B. In the IO path (IO could
>> be more complex than you think), memory allocation and memory direct
>> reclaiming happened too.
> Sure, but IO path allocations are done under GFP_NOIO context, which
> means IO path allocations can't recurse back into filesystem reclaim
> via direct reclaim. Hence there should be no way for an IO path
> allocation to block on XFS inode reclaim and hence there's no
> possible deadlock here...
>
> IOWs, I don't think this is the deadlock you are looking for. Do you
> have a lockdep report or some other set of stack traces that lead
> you to this point?

As I mentioned, the IO path can be more complex than you think.

The real case I hit is that the process A is waiting for inode unpin on 
XFS A which is a loop device backed mount.

And the backing file is from a different (X)FS B mount. So the IO is 
going through loop device, (direct) writes to (X)FS B.

The (direct) writes to (X)FS B do memory allocations and then memory 
direct reclaims...

thanks,
wengang
Wengang Wang April 23, 2020, 11:19 p.m. UTC | #3
On 4/23/20 4:14 PM, Wengang Wang wrote:
>
> On 4/23/20 4:05 PM, Dave Chinner wrote:
>> On Thu, Apr 23, 2020 at 10:23:25AM -0700, Wengang Wang wrote:
>>> xfs_reclaim_inodes_ag() do infinate locking on pag_ici_reclaim_lock 
>>> at the
>>> 2nd round of walking of all AGs when SYNC_TRYLOCK is set 
>>> (conditionally).
>>> That causes dead lock in a special situation:
>>>
>>> 1) In a heavy memory load environment, process A is doing direct memory
>>> reclaiming waiting for xfs_inode.i_pincount to be cleared while holding
>>> mutex lock pag_ici_reclaim_lock.
>>>
>>> 2) i_pincount is increased by adding the xfs_inode to journal 
>>> transection,
>>> and it's expected to be decreased when the transection related IO is 
>>> done.
>>> Step 1) happens after i_pincount is increased and before 
>>> truansection IO is
>>> issued.
>>>
>>> 3) Now the transection IO is issued by process B. In the IO path (IO 
>>> could
>>> be more complex than you think), memory allocation and memory direct
>>> reclaiming happened too.
>> Sure, but IO path allocations are done under GFP_NOIO context, which
>> means IO path allocations can't recurse back into filesystem reclaim
>> via direct reclaim. Hence there should be no way for an IO path
>> allocation to block on XFS inode reclaim and hence there's no
>> possible deadlock here...
>>
>> IOWs, I don't think this is the deadlock you are looking for. Do you
>> have a lockdep report or some other set of stack traces that lead
>> you to this point?
>
> As I mentioned, the IO path can be more complex than you think.
>
> The real case I hit is that the process A is waiting for inode unpin 
> on XFS A which is a loop device backed mount.

And actually, there is a dm-thin on top of the loop device..

thanks,
wengang

>
> And the backing file is from a different (X)FS B mount. So the IO is 
> going through loop device, (direct) writes to (X)FS B.
>
> The (direct) writes to (X)FS B do memory allocations and then memory 
> direct reclaims...
>
> thanks,
> wengang
>
Dave Chinner April 24, 2020, 1:39 a.m. UTC | #4
On Thu, Apr 23, 2020 at 04:19:52PM -0700, Wengang Wang wrote:
> 
> On 4/23/20 4:14 PM, Wengang Wang wrote:
> > 
> > On 4/23/20 4:05 PM, Dave Chinner wrote:
> > > On Thu, Apr 23, 2020 at 10:23:25AM -0700, Wengang Wang wrote:
> > > > xfs_reclaim_inodes_ag() do infinate locking on
> > > > pag_ici_reclaim_lock at the
> > > > 2nd round of walking of all AGs when SYNC_TRYLOCK is set
> > > > (conditionally).
> > > > That causes dead lock in a special situation:
> > > > 
> > > > 1) In a heavy memory load environment, process A is doing direct memory
> > > > reclaiming waiting for xfs_inode.i_pincount to be cleared while holding
> > > > mutex lock pag_ici_reclaim_lock.
> > > > 
> > > > 2) i_pincount is increased by adding the xfs_inode to journal
> > > > transection,
> > > > and it's expected to be decreased when the transection related
> > > > IO is done.
> > > > Step 1) happens after i_pincount is increased and before
> > > > truansection IO is
> > > > issued.
> > > > 
> > > > 3) Now the transection IO is issued by process B. In the IO path
> > > > (IO could
> > > > be more complex than you think), memory allocation and memory direct
> > > > reclaiming happened too.
> > > Sure, but IO path allocations are done under GFP_NOIO context, which
> > > means IO path allocations can't recurse back into filesystem reclaim
> > > via direct reclaim. Hence there should be no way for an IO path
> > > allocation to block on XFS inode reclaim and hence there's no
> > > possible deadlock here...
> > > 
> > > IOWs, I don't think this is the deadlock you are looking for. Do you
> > > have a lockdep report or some other set of stack traces that lead
> > > you to this point?
> > 
> > As I mentioned, the IO path can be more complex than you think.

I don't think the IO path is complex. I *know* the IO path is
complex. I also know how we manage that complexity to prevent things
like stacked devices and filesystems from deadlocking, but I also
know that there are bugs and other architectural deficiencies that
may be playing a part here.

The problem is, your description of the problem tells me nothing
about where the problem might lie. You are telling me what you think
the problem is, rather than explaining the way the problem comes
about, what storage stack configuration and IO behaviour triggers
it, etc. Hence I cannot determine what the problem you are seeing
actually is, and hence I cannot evaluate whether your patch is
correct.

> > The real case I hit is that the process A is waiting for inode unpin on
> > XFS A which is a loop device backed mount.
> 
> And actually, there is a dm-thin on top of the loop device..

Makes no difference, really, because it's still the loop device
that is doing the IO to the underlying filesystem...

> > And the backing file is from a different (X)FS B mount. So the IO is
> > going through loop device, (direct) writes to (X)FS B.
> > 
> > The (direct) writes to (X)FS B do memory allocations and then memory
> > direct reclaims...

THe loop device issues IO to the lower filesystem in
memalloc_noio_save() context, which means all memory allocations in
it's IO path are done with GFP_NOIO context. Hence those allocations
will not recurse into reclaim on -any filesystem- and hence will not
deadlock on filesystem reclaim. So what I said originally is correct
even when we take filesystems stacked via loop devices into account.

Hence I'll ask again: do you have stack traces of the deadlock or a
lockdep report? If not, can you please describe the storage setup
from top to bottom and lay out exactly where in what layers trigger
this deadlock?

Cheers,

Dave.
Wengang Wang April 24, 2020, 4:58 p.m. UTC | #5
On 4/23/20 6:39 PM, Dave Chinner wrote:
> On Thu, Apr 23, 2020 at 04:19:52PM -0700, Wengang Wang wrote:
>> On 4/23/20 4:14 PM, Wengang Wang wrote:
>>> On 4/23/20 4:05 PM, Dave Chinner wrote:
>>>> On Thu, Apr 23, 2020 at 10:23:25AM -0700, Wengang Wang wrote:
>>>>> xfs_reclaim_inodes_ag() do infinate locking on
>>>>> pag_ici_reclaim_lock at the
>>>>> 2nd round of walking of all AGs when SYNC_TRYLOCK is set
>>>>> (conditionally).
>>>>> That causes dead lock in a special situation:
>>>>>
>>>>> 1) In a heavy memory load environment, process A is doing direct memory
>>>>> reclaiming waiting for xfs_inode.i_pincount to be cleared while holding
>>>>> mutex lock pag_ici_reclaim_lock.
>>>>>
>>>>> 2) i_pincount is increased by adding the xfs_inode to journal
>>>>> transection,
>>>>> and it's expected to be decreased when the transection related
>>>>> IO is done.
>>>>> Step 1) happens after i_pincount is increased and before
>>>>> truansection IO is
>>>>> issued.
>>>>>
>>>>> 3) Now the transection IO is issued by process B. In the IO path
>>>>> (IO could
>>>>> be more complex than you think), memory allocation and memory direct
>>>>> reclaiming happened too.
>>>> Sure, but IO path allocations are done under GFP_NOIO context, which
>>>> means IO path allocations can't recurse back into filesystem reclaim
>>>> via direct reclaim. Hence there should be no way for an IO path
>>>> allocation to block on XFS inode reclaim and hence there's no
>>>> possible deadlock here...
>>>>
>>>> IOWs, I don't think this is the deadlock you are looking for. Do you
>>>> have a lockdep report or some other set of stack traces that lead
>>>> you to this point?
>>> As I mentioned, the IO path can be more complex than you think.
> I don't think the IO path is complex. I *know* the IO path is
> complex. I also know how we manage that complexity to prevent things
> like stacked devices and filesystems from deadlocking, but I also
> know that there are bugs and other architectural deficiencies that
> may be playing a part here.
Right.
>
> The problem is, your description of the problem tells me nothing
> about where the problem might lie. You are telling me what you think
> the problem is, rather than explaining the way the problem comes
> about, what storage stack configuration and IO behaviour triggers
> it, etc. Hence I cannot determine what the problem you are seeing
> actually is, and hence I cannot evaluate whether your patch is
> correct.

I think my description is the generic description, stating  the problem 
in XFS.
The problem is there as long as memory direct reclaiming happens in the 
IO path.

More details about the IO may help to understanding the problem. But 
that's not the main idea of this patch.

I will provide the detail below.

>
>>> The real case I hit is that the process A is waiting for inode unpin on
>>> XFS A which is a loop device backed mount.
>> And actually, there is a dm-thin on top of the loop device..
> Makes no difference, really, because it's still the loop device
> that is doing the IO to the underlying filesystem...
I mentioned IO path here, not the IO its self.  In this case, the IO 
patch includes dm-thin.

We have to consider it as long as we are not sure if there is GPF_KERNEL 
(or any flags without NOFS, NOIO) allocation happens in dm-thin.

If dm-thin has GPF_KERNEL allocation and goes into memory direct 
reclaiming, the deadlock forms.

>>> And the backing file is from a different (X)FS B mount. So the IO is
>>> going through loop device, (direct) writes to (X)FS B.
>>>
>>> The (direct) writes to (X)FS B do memory allocations and then memory
>>> direct reclaims...
> THe loop device issues IO to the lower filesystem in
> memalloc_noio_save() context, which means all memory allocations in
> it's IO path are done with GFP_NOIO context. Hence those allocations
> will not recurse into reclaim on -any filesystem- and hence will not
> deadlock on filesystem reclaim. So what I said originally is correct
> even when we take filesystems stacked via loop devices into account.
You are right here. Seems loop device is doing NOFS|NOIO allocations.

The deadlock happened with a bit lower kernel version which is without 
loop device patch that does NOFS|NOIO allocation.

Well, here you are only talking about loop device, it's not enough to 
say it's also safe in case the memory reclaiming happens at higher layer 
above loop device in the IO path.

> Hence I'll ask again: do you have stack traces of the deadlock or a
> lockdep report? If not, can you please describe the storage setup
> from top to bottom and lay out exactly where in what layers trigger
> this deadlock?

Sharing the callback traces:

Process 61234 is holding mutex pag_ici_reclaim_lock

PID: 61234  TASK: ffff89e7bc6c1ec0  CPU: 39  COMMAND: "java"
  #0 [ffff9be677cdb470] __schedule at ffffffff8f866a9c
  #1 [ffff9be677cdb508] schedule at ffffffff8f8670b6
  #2 [ffff9be677cdb520] io_schedule at ffffffff8f0c73c6
  #3 [ffff9be677cdb538] __dta___xfs_iunpin_wait_3443 at ffffffffc0565087 
[xfs]
  #4 [ffff9be677cdb5b0] xfs_iunpin_wait at ffffffffc0567ca9 [xfs]
  #5 [ffff9be677cdb5c0] __dta_xfs_reclaim_inode_3357 at ffffffffc055b4cc 
[xfs]
  #6 [ffff9be677cdb610] xfs_reclaim_inodes_ag at ffffffffc055b916 [xfs]
  #7 [ffff9be677cdb7a0] xfs_reclaim_inodes_nr at ffffffffc055cb73 [xfs]
  #8 [ffff9be677cdb7c0] xfs_fs_free_cached_objects at ffffffffc056fe79 [xfs]
  #9 [ffff9be677cdb7d0] super_cache_scan at ffffffff8f287927
#10 [ffff9be677cdb828] shrink_slab at ffffffff8f1efa53
#11 [ffff9be677cdb910] shrink_node at ffffffff8f1f5628
#12 [ffff9be677cdb998] do_try_to_free_pages at ffffffff8f1f5b62
#13 [ffff9be677cdba00] try_to_free_pages at ffffffff8f1f5f09
#14 [ffff9be677cdba88] __alloc_pages_slowpath at ffffffff8f1e479d
#15 [ffff9be677cdbba8] __alloc_pages_nodemask at ffffffff8f1e2231
#16 [ffff9be677cdbc18] alloc_pages_current at ffffffff8f243bea
#17 [ffff9be677cdbc48] skb_page_frag_refill at ffffffff8f6f943c
#18 [ffff9be677cdbc68] sk_page_frag_refill at ffffffff8f6f947d
#19 [ffff9be677cdbc80] tcp_sendmsg_locked at ffffffff8f779103
#20 [ffff9be677cdbd30] tcp_sendmsg at ffffffff8f779cec
#21 [ffff9be677cdbd58] inet_sendmsg at ffffffff8f7a8f07
#22 [ffff9be677cdbd80] sock_sendmsg at ffffffff8f6f30ce
#23 [ffff9be677cdbda0] sock_write_iter at ffffffff8f6f3165
#24 [ffff9be677cdbe18] __vfs_write at ffffffff8f28447c
#25 [ffff9be677cdbea0] vfs_write at ffffffff8f284692
#26 [ffff9be677cdbee0] sys_write at ffffffff8f2848f5
#27 [ffff9be677cdbf28] do_syscall_64 at ffffffff8f003949


And  waiter (of that pag_ici_reclaim_lock) process has the following 
call back trace:

PID: 30224  TASK: ffff89f944cd0000  CPU: 52  COMMAND: "loop0"
  #0 [ffff9be663fcb0a8] __schedule at ffffffff8f866a9c
  #1 [ffff9be663fcb140] schedule at ffffffff8f8670b6
  #2 [ffff9be663fcb158] schedule_preempt_disabled at ffffffff8f86740e
  #3 [ffff9be663fcb168] __mutex_lock at ffffffff8f868d7c
  #4 [ffff9be663fcb228] __mutex_lock_slowpath at ffffffff8f8691c3
  #5 [ffff9be663fcb238] mutex_lock at ffffffff8f8691ff
  #6 [ffff9be663fcb250] xfs_reclaim_inodes_ag at ffffffffc055b9b2 [xfs]
  #7 [ffff9be663fcb3e0] xfs_reclaim_inodes_nr at ffffffffc055cb73 [xfs]
  #8 [ffff9be663fcb400] xfs_fs_free_cached_objects at ffffffffc056fe79 [xfs]
  #9 [ffff9be663fcb410] super_cache_scan at ffffffff8f287927
#10 [ffff9be663fcb468] shrink_slab at ffffffff8f1efa53
#11 [ffff9be663fcb550] shrink_node at ffffffff8f1f5628
#12 [ffff9be663fcb5d8] do_try_to_free_pages at ffffffff8f1f5b62
#13 [ffff9be663fcb640] try_to_free_pages at ffffffff8f1f5f09
#14 [ffff9be663fcb6c8] __alloc_pages_slowpath at ffffffff8f1e479d
#15 [ffff9be663fcb7e8] __alloc_pages_nodemask at ffffffff8f1e2231
#16 [ffff9be663fcb858] alloc_pages_current at ffffffff8f243bea
#17 [ffff9be663fcb888] new_slab at ffffffff8f250f09
#18 [ffff9be663fcb8f0] ___slab_alloc at ffffffff8f2517b5
#19 [ffff9be663fcb9b8] __slab_alloc at ffffffff8f255ddd
#20 [ffff9be663fcb9f8] kmem_cache_alloc at ffffffff8f251ab9
#21 [ffff9be663fcba38] mempool_alloc_slab at ffffffff8f1da935
#22 [ffff9be663fcba48] mempool_alloc at ffffffff8f1daa63
#23 [ffff9be663fcbac0] bio_alloc_bioset at ffffffff8f39a7f2
#24 [ffff9be663fcbb10] iomap_dio_actor at ffffffff8f2f1d2d
#25 [ffff9be663fcbb90] iomap_apply at ffffffff8f2f2a07
#26 [ffff9be663fcbc10] iomap_dio_rw at ffffffff8f2f3261
#27 [ffff9be663fcbcc0] __dta_xfs_file_dio_aio_write_3305 at 
ffffffffc0557078 [xfs]
#28 [ffff9be663fcbd10] xfs_file_write_iter at ffffffffc0557551 [xfs]
#29 [ffff9be663fcbd40] lo_rw_aio at ffffffffc075d2ea [loop]
#30 [ffff9be663fcbdc0] loop_queue_work at ffffffffc075ddc2 [loop]
#31 [ffff9be663fcbea8] kthread_worker_fn at ffffffff8f0b889a
#32 [ffff9be663fcbef8] loop_kthread_worker_fn at ffffffffc075b90e [loop]
#33 [ffff9be663fcbf08] kthread at ffffffff8f0b7fd5
#34 [ffff9be663fcbf50] ret_from_fork at ffffffff8fa00344

The deadlock happened in(under) loop device layer, the deadlock under 
loop device may won't happen in newest upstream code. But it still can 
happen at higher layers in the IO path.

thanks,
wengang
Dave Chinner April 24, 2020, 9:37 p.m. UTC | #6
On Fri, Apr 24, 2020 at 09:58:09AM -0700, Wengang Wang wrote:
> On 4/23/20 6:39 PM, Dave Chinner wrote:
> > On Thu, Apr 23, 2020 at 04:19:52PM -0700, Wengang Wang wrote:
> > > On 4/23/20 4:14 PM, Wengang Wang wrote:
> > > > The real case I hit is that the process A is waiting for inode unpin on
> > > > XFS A which is a loop device backed mount.
> > > And actually, there is a dm-thin on top of the loop device..
> > Makes no difference, really, because it's still the loop device
> > that is doing the IO to the underlying filesystem...
> I mentioned IO path here, not the IO its self.  In this case, the IO patch
> includes dm-thin.
> 
> We have to consider it as long as we are not sure if there is GPF_KERNEL (or
> any flags without NOFS, NOIO) allocation happens in dm-thin.
> 
> If dm-thin has GPF_KERNEL allocation and goes into memory direct reclaiming,
> the deadlock forms.

If that happens, then that is a bug in dm-thin, not a bug in XFS.
There are rules to how memory allocation must be done to avoid
deadlocks, and one of those is that block device level IO path
allocations *must* use GFP_NOIO. This prevents reclaim from
recursing into subsystems that might require IO to reclaim memory
and hence self deadlock because the IO layer requires allocation to
succeed to make forwards progress.

That's why we have mempools and GFP_NOIO at the block and device
layers....


> > > > And the backing file is from a different (X)FS B mount. So the IO is
> > > > going through loop device, (direct) writes to (X)FS B.
> > > > 
> > > > The (direct) writes to (X)FS B do memory allocations and then memory
> > > > direct reclaims...
> > THe loop device issues IO to the lower filesystem in
> > memalloc_noio_save() context, which means all memory allocations in
> > it's IO path are done with GFP_NOIO context. Hence those allocations
> > will not recurse into reclaim on -any filesystem- and hence will not
> > deadlock on filesystem reclaim. So what I said originally is correct
> > even when we take filesystems stacked via loop devices into account.
> You are right here. Seems loop device is doing NOFS|NOIO allocations.
> 
> The deadlock happened with a bit lower kernel version which is without loop
> device patch that does NOFS|NOIO allocation.

Right, the loop device used to have an allocation context bug, but
that has been fixed. Either way, this is not an XFS or even a
filesystem layer issue.

> Well, here you are only talking about loop device, it's not enough to say
> it's also safe in case the memory reclaiming happens at higher layer above
> loop device in the IO path.

Yes it is.

Block devices and device drivers are *required* to use GFP_NOIO
context for memory allocations in the IO path. IOWs, any block
device that is doing GFP_KERNEL context allocation violates the
memory allocation rules we have for the IO path.  This architectural
constraint exists exclusively to avoid this entire class of IO-based
memory reclaim recursion deadlocks.

> > Hence I'll ask again: do you have stack traces of the deadlock or a
> > lockdep report? If not, can you please describe the storage setup
> > from top to bottom and lay out exactly where in what layers trigger
> > this deadlock?
> 
> Sharing the callback traces:

<snip>

Yeah, so the loop device is doing GFP_KERNEL allocation in a
GFP_NOIO context. You need to fix the loop device in whatever kernel
you are testing, which you have conveniently never mentioned. I'm
betting this is a vendor kernel that is missing fixes from the
upstream kernel. In which case you need to talk to your OS vendor,
not upstream...

Cheers,

Dave.
Wengang Wang April 24, 2020, 9:45 p.m. UTC | #7
OK, make sense! Thanks Dave!

Wengang

On 4/24/20 2:37 PM, Dave Chinner wrote:
> On Fri, Apr 24, 2020 at 09:58:09AM -0700, Wengang Wang wrote:
>> On 4/23/20 6:39 PM, Dave Chinner wrote:
>>> On Thu, Apr 23, 2020 at 04:19:52PM -0700, Wengang Wang wrote:
>>>> On 4/23/20 4:14 PM, Wengang Wang wrote:
>>>>> The real case I hit is that the process A is waiting for inode unpin on
>>>>> XFS A which is a loop device backed mount.
>>>> And actually, there is a dm-thin on top of the loop device..
>>> Makes no difference, really, because it's still the loop device
>>> that is doing the IO to the underlying filesystem...
>> I mentioned IO path here, not the IO its self.  In this case, the IO patch
>> includes dm-thin.
>>
>> We have to consider it as long as we are not sure if there is GPF_KERNEL (or
>> any flags without NOFS, NOIO) allocation happens in dm-thin.
>>
>> If dm-thin has GPF_KERNEL allocation and goes into memory direct reclaiming,
>> the deadlock forms.
> If that happens, then that is a bug in dm-thin, not a bug in XFS.
> There are rules to how memory allocation must be done to avoid
> deadlocks, and one of those is that block device level IO path
> allocations *must* use GFP_NOIO. This prevents reclaim from
> recursing into subsystems that might require IO to reclaim memory
> and hence self deadlock because the IO layer requires allocation to
> succeed to make forwards progress.
>
> That's why we have mempools and GFP_NOIO at the block and device
> layers....
>
>
>>>>> And the backing file is from a different (X)FS B mount. So the IO is
>>>>> going through loop device, (direct) writes to (X)FS B.
>>>>>
>>>>> The (direct) writes to (X)FS B do memory allocations and then memory
>>>>> direct reclaims...
>>> THe loop device issues IO to the lower filesystem in
>>> memalloc_noio_save() context, which means all memory allocations in
>>> it's IO path are done with GFP_NOIO context. Hence those allocations
>>> will not recurse into reclaim on -any filesystem- and hence will not
>>> deadlock on filesystem reclaim. So what I said originally is correct
>>> even when we take filesystems stacked via loop devices into account.
>> You are right here. Seems loop device is doing NOFS|NOIO allocations.
>>
>> The deadlock happened with a bit lower kernel version which is without loop
>> device patch that does NOFS|NOIO allocation.
> Right, the loop device used to have an allocation context bug, but
> that has been fixed. Either way, this is not an XFS or even a
> filesystem layer issue.
>
>> Well, here you are only talking about loop device, it's not enough to say
>> it's also safe in case the memory reclaiming happens at higher layer above
>> loop device in the IO path.
> Yes it is.
>
> Block devices and device drivers are *required* to use GFP_NOIO
> context for memory allocations in the IO path. IOWs, any block
> device that is doing GFP_KERNEL context allocation violates the
> memory allocation rules we have for the IO path.  This architectural
> constraint exists exclusively to avoid this entire class of IO-based
> memory reclaim recursion deadlocks.
>
>>> Hence I'll ask again: do you have stack traces of the deadlock or a
>>> lockdep report? If not, can you please describe the storage setup
>>> from top to bottom and lay out exactly where in what layers trigger
>>> this deadlock?
>> Sharing the callback traces:
> <snip>
>
> Yeah, so the loop device is doing GFP_KERNEL allocation in a
> GFP_NOIO context. You need to fix the loop device in whatever kernel
> you are testing, which you have conveniently never mentioned. I'm
> betting this is a vendor kernel that is missing fixes from the
> upstream kernel. In which case you need to talk to your OS vendor,
> not upstream...
>
> Cheers,
>
> Dave.
diff mbox series

Patch

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 8dc2e5414276..e2a6ab04db3d 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1245,11 +1245,8 @@  xfs_reclaim_inodes_ag(
 	int			last_error = 0;
 	xfs_agnumber_t		ag;
 	int			trylock = flags & SYNC_TRYLOCK;
-	int			skipped;
 
-restart:
 	ag = 0;
-	skipped = 0;
 	while ((pag = xfs_perag_get_tag(mp, ag, XFS_ICI_RECLAIM_TAG))) {
 		unsigned long	first_index = 0;
 		int		done = 0;
@@ -1259,7 +1256,6 @@  xfs_reclaim_inodes_ag(
 
 		if (trylock) {
 			if (!mutex_trylock(&pag->pag_ici_reclaim_lock)) {
-				skipped++;
 				xfs_perag_put(pag);
 				continue;
 			}
@@ -1340,17 +1336,6 @@  xfs_reclaim_inodes_ag(
 		xfs_perag_put(pag);
 	}
 
-	/*
-	 * if we skipped any AG, and we still have scan count remaining, do
-	 * another pass this time using blocking reclaim semantics (i.e
-	 * waiting on the reclaim locks and ignoring the reclaim cursors). This
-	 * ensure that when we get more reclaimers than AGs we block rather
-	 * than spin trying to execute reclaim.
-	 */
-	if (skipped && (flags & SYNC_WAIT) && *nr_to_scan > 0) {
-		trylock = 0;
-		goto restart;
-	}
 	return last_error;
 }