diff mbox series

mm, memcg: fix reclaim deadlock with writeback

Message ID 20181211132645.31053-1-mhocko@kernel.org (mailing list archive)
State New, archived
Headers show
Series mm, memcg: fix reclaim deadlock with writeback | expand

Commit Message

Michal Hocko Dec. 11, 2018, 1:26 p.m. UTC
From: Michal Hocko <mhocko@suse.com>

Liu Bo has experienced a deadlock between memcg (legacy) reclaim and the
ext4 writeback
task1:
[<ffffffff811aaa52>] wait_on_page_bit+0x82/0xa0
[<ffffffff811c5777>] shrink_page_list+0x907/0x960
[<ffffffff811c6027>] shrink_inactive_list+0x2c7/0x680
[<ffffffff811c6ba4>] shrink_node_memcg+0x404/0x830
[<ffffffff811c70a8>] shrink_node+0xd8/0x300
[<ffffffff811c73dd>] do_try_to_free_pages+0x10d/0x330
[<ffffffff811c7865>] try_to_free_mem_cgroup_pages+0xd5/0x1b0
[<ffffffff8122df2d>] try_charge+0x14d/0x720
[<ffffffff812320cc>] memcg_kmem_charge_memcg+0x3c/0xa0
[<ffffffff812321ae>] memcg_kmem_charge+0x7e/0xd0
[<ffffffff811b68a8>] __alloc_pages_nodemask+0x178/0x260
[<ffffffff8120bff5>] alloc_pages_current+0x95/0x140
[<ffffffff81074247>] pte_alloc_one+0x17/0x40
[<ffffffff811e34de>] __pte_alloc+0x1e/0x110
[<ffffffffa06739de>] alloc_set_pte+0x5fe/0xc20
[<ffffffff811e5d93>] do_fault+0x103/0x970
[<ffffffff811e6e5e>] handle_mm_fault+0x61e/0xd10
[<ffffffff8106ea02>] __do_page_fault+0x252/0x4d0
[<ffffffff8106ecb0>] do_page_fault+0x30/0x80
[<ffffffff8171bce8>] page_fault+0x28/0x30
[<ffffffffffffffff>] 0xffffffffffffffff

task2:
[<ffffffff811aadc6>] __lock_page+0x86/0xa0
[<ffffffffa02f1e47>] mpage_prepare_extent_to_map+0x2e7/0x310 [ext4]
[<ffffffffa08a2689>] ext4_writepages+0x479/0xd60
[<ffffffff811bbede>] do_writepages+0x1e/0x30
[<ffffffff812725e5>] __writeback_single_inode+0x45/0x320
[<ffffffff81272de2>] writeback_sb_inodes+0x272/0x600
[<ffffffff81273202>] __writeback_inodes_wb+0x92/0xc0
[<ffffffff81273568>] wb_writeback+0x268/0x300
[<ffffffff81273d24>] wb_workfn+0xb4/0x390
[<ffffffff810a2f19>] process_one_work+0x189/0x420
[<ffffffff810a31fe>] worker_thread+0x4e/0x4b0
[<ffffffff810a9786>] kthread+0xe6/0x100
[<ffffffff8171a9a1>] ret_from_fork+0x41/0x50
[<ffffffffffffffff>] 0xffffffffffffffff

He adds
: task1 is waiting for the PageWriteback bit of the page that task2 has
: collected in mpd->io_submit->io_bio, and tasks2 is waiting for the LOCKED
: bit the page which tasks1 has locked.

More precisely task1 is handling a page fault and it has a page locked
while it charges a new page table to a memcg. That in turn hits a memory
limit reclaim and the memcg reclaim for legacy controller is waiting on
the writeback but that is never going to finish because the writeback
itself is waiting for the page locked in the #PF path. So this is
essentially ABBA deadlock.

Waiting for the writeback in legacy memcg controller is a workaround
for pre-mature OOM killer invocations because there is no dirty IO
throttling available for the controller. There is no easy way around
that unfortunately. Therefore fix this specific issue by pre-allocating
the page table outside of the page lock. We have that handy
infrastructure for that already so simply reuse the fault-around pattern
which already does this.

Reported-and-Debugged-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
Hi,
this has been originally reported here [1]. While it could get worked
around in the fs, catching the allocation early sounds like a preferable
approach. Liu Bo has noted that he is not able to reproduce this anymore
because kmem accounting has been disabled in their workload but this
should be quite straightforward to review.

There are probably other hidden __GFP_ACCOUNT | GFP_KERNEL allocations
from under a fs page locked but they should be really rare. I am not
aware of a better solution unfortunately.

I would appreciate if Kirril could have a look and double check I am not
doing something stupid here.

Debugging lock_page deadlocks is an absolute PITA considering a lack of
lockdep support so I would mark it for stable.

[1] http://lkml.kernel.org/r/1540858969-75803-1-git-send-email-bo.liu@linux.alibaba.com
 mm/memory.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

Comments

Kirill A . Shutemov Dec. 11, 2018, 3:15 p.m. UTC | #1
On Tue, Dec 11, 2018 at 02:26:45PM +0100, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> Liu Bo has experienced a deadlock between memcg (legacy) reclaim and the
> ext4 writeback
> task1:
> [<ffffffff811aaa52>] wait_on_page_bit+0x82/0xa0
> [<ffffffff811c5777>] shrink_page_list+0x907/0x960
> [<ffffffff811c6027>] shrink_inactive_list+0x2c7/0x680
> [<ffffffff811c6ba4>] shrink_node_memcg+0x404/0x830
> [<ffffffff811c70a8>] shrink_node+0xd8/0x300
> [<ffffffff811c73dd>] do_try_to_free_pages+0x10d/0x330
> [<ffffffff811c7865>] try_to_free_mem_cgroup_pages+0xd5/0x1b0
> [<ffffffff8122df2d>] try_charge+0x14d/0x720
> [<ffffffff812320cc>] memcg_kmem_charge_memcg+0x3c/0xa0
> [<ffffffff812321ae>] memcg_kmem_charge+0x7e/0xd0
> [<ffffffff811b68a8>] __alloc_pages_nodemask+0x178/0x260
> [<ffffffff8120bff5>] alloc_pages_current+0x95/0x140
> [<ffffffff81074247>] pte_alloc_one+0x17/0x40
> [<ffffffff811e34de>] __pte_alloc+0x1e/0x110
> [<ffffffffa06739de>] alloc_set_pte+0x5fe/0xc20
> [<ffffffff811e5d93>] do_fault+0x103/0x970
> [<ffffffff811e6e5e>] handle_mm_fault+0x61e/0xd10
> [<ffffffff8106ea02>] __do_page_fault+0x252/0x4d0
> [<ffffffff8106ecb0>] do_page_fault+0x30/0x80
> [<ffffffff8171bce8>] page_fault+0x28/0x30
> [<ffffffffffffffff>] 0xffffffffffffffff
> 
> task2:
> [<ffffffff811aadc6>] __lock_page+0x86/0xa0
> [<ffffffffa02f1e47>] mpage_prepare_extent_to_map+0x2e7/0x310 [ext4]
> [<ffffffffa08a2689>] ext4_writepages+0x479/0xd60
> [<ffffffff811bbede>] do_writepages+0x1e/0x30
> [<ffffffff812725e5>] __writeback_single_inode+0x45/0x320
> [<ffffffff81272de2>] writeback_sb_inodes+0x272/0x600
> [<ffffffff81273202>] __writeback_inodes_wb+0x92/0xc0
> [<ffffffff81273568>] wb_writeback+0x268/0x300
> [<ffffffff81273d24>] wb_workfn+0xb4/0x390
> [<ffffffff810a2f19>] process_one_work+0x189/0x420
> [<ffffffff810a31fe>] worker_thread+0x4e/0x4b0
> [<ffffffff810a9786>] kthread+0xe6/0x100
> [<ffffffff8171a9a1>] ret_from_fork+0x41/0x50
> [<ffffffffffffffff>] 0xffffffffffffffff
> 
> He adds
> : task1 is waiting for the PageWriteback bit of the page that task2 has
> : collected in mpd->io_submit->io_bio, and tasks2 is waiting for the LOCKED
> : bit the page which tasks1 has locked.
> 
> More precisely task1 is handling a page fault and it has a page locked
> while it charges a new page table to a memcg. That in turn hits a memory
> limit reclaim and the memcg reclaim for legacy controller is waiting on
> the writeback but that is never going to finish because the writeback
> itself is waiting for the page locked in the #PF path. So this is
> essentially ABBA deadlock.
> 
> Waiting for the writeback in legacy memcg controller is a workaround
> for pre-mature OOM killer invocations because there is no dirty IO
> throttling available for the controller. There is no easy way around
> that unfortunately. Therefore fix this specific issue by pre-allocating
> the page table outside of the page lock. We have that handy
> infrastructure for that already so simply reuse the fault-around pattern
> which already does this.
> 
> Reported-and-Debugged-by: Liu Bo <bo.liu@linux.alibaba.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
> Hi,
> this has been originally reported here [1]. While it could get worked
> around in the fs, catching the allocation early sounds like a preferable
> approach. Liu Bo has noted that he is not able to reproduce this anymore
> because kmem accounting has been disabled in their workload but this
> should be quite straightforward to review.
> 
> There are probably other hidden __GFP_ACCOUNT | GFP_KERNEL allocations
> from under a fs page locked but they should be really rare. I am not
> aware of a better solution unfortunately.
> 
> I would appreciate if Kirril could have a look and double check I am not
> doing something stupid here.
> 
> Debugging lock_page deadlocks is an absolute PITA considering a lack of
> lockdep support so I would mark it for stable.
> 
> [1] http://lkml.kernel.org/r/1540858969-75803-1-git-send-email-bo.liu@linux.alibaba.com
>  mm/memory.c | 11 +++++++++++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 4ad2d293ddc2..1a73d2d4659e 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2993,6 +2993,17 @@ static vm_fault_t __do_fault(struct vm_fault *vmf)
>  	struct vm_area_struct *vma = vmf->vma;
>  	vm_fault_t ret;
>  
> +	/*
> +	 * Preallocate pte before we take page_lock because this might lead to
> +	 * deadlocks for memcg reclaim which waits for pages under writeback.
> +	 */
> +	if (pmd_none(*vmf->pmd) && !vmf->prealloc_pte) {
> +		vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm>mm, vmf->address);
> +		if (!vmf->prealloc_pte)
> +			return VM_FAULT_OOM;
> +		smp_wmb(); /* See comment in __pte_alloc() */
> +	}
> +
>  	ret = vma->vm_ops->fault(vmf);
>  	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY |
>  			    VM_FAULT_DONE_COW)))

Sorry, but I don't think it fixes anything. Just hides it a level deeper.

The trick with ->prealloc_pte works for faultaround because we can rely on
->map_pages() to not sleep and we know how it will setup page table entry.
Basically, core controls most of the path.

It's not the case with ->fault(). It is free to sleep and allocate
whatever it wants.

For instance, DAX page fault will setup page table entry on its own and
return VM_FAULT_NOPAGE. It uses vmf_insert_mixed() to setup the page table
and ignores your pre-allocated page table.

But it's just an example. The problem is that ->fault() is not bounded on
what it can do, unlike ->map_pages().
Michal Hocko Dec. 11, 2018, 4:21 p.m. UTC | #2
On Tue 11-12-18 18:15:42, Kirill A. Shutemov wrote:
> On Tue, Dec 11, 2018 at 02:26:45PM +0100, Michal Hocko wrote:
[...]
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -2993,6 +2993,17 @@ static vm_fault_t __do_fault(struct vm_fault *vmf)
> >  	struct vm_area_struct *vma = vmf->vma;
> >  	vm_fault_t ret;
> >  
> > +	/*
> > +	 * Preallocate pte before we take page_lock because this might lead to
> > +	 * deadlocks for memcg reclaim which waits for pages under writeback.
> > +	 */
> > +	if (pmd_none(*vmf->pmd) && !vmf->prealloc_pte) {
> > +		vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm>mm, vmf->address);
> > +		if (!vmf->prealloc_pte)
> > +			return VM_FAULT_OOM;
> > +		smp_wmb(); /* See comment in __pte_alloc() */
> > +	}
> > +
> >  	ret = vma->vm_ops->fault(vmf);
> >  	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY |
> >  			    VM_FAULT_DONE_COW)))
> 
> Sorry, but I don't think it fixes anything. Just hides it a level deeper.
> 
> The trick with ->prealloc_pte works for faultaround because we can rely on
> ->map_pages() to not sleep and we know how it will setup page table entry.
> Basically, core controls most of the path.
> 
> It's not the case with ->fault(). It is free to sleep and allocate
> whatever it wants.

Yeah, but if the fault callback wants to allocate then it has to
consider the usual allocation restrictions. e.g. NOFS if the allocation
itself can trip over fs locks.

> For instance, DAX page fault will setup page table entry on its own and
> return VM_FAULT_NOPAGE. It uses vmf_insert_mixed() to setup the page table
> and ignores your pre-allocated page table.

Does this happen with a page locked and with __GFP_ACCOUNT allocation. I
am not familiar with that code but I do not see it from a quick look.
 
> But it's just an example. The problem is that ->fault() is not bounded on
> what it can do, unlike ->map_pages().

That is a fair point but the primary issue here is that the generic #PF
code breaks the underlying assumption and performs
__GFP_ACCOUNT|GFP_KERNEL allocation from within a fs owned locked page.
Jan Kara Dec. 11, 2018, 4:39 p.m. UTC | #3
On Tue 11-12-18 17:21:49, Michal Hocko wrote:
> On Tue 11-12-18 18:15:42, Kirill A. Shutemov wrote:
> > For instance, DAX page fault will setup page table entry on its own and
> > return VM_FAULT_NOPAGE. It uses vmf_insert_mixed() to setup the page table
> > and ignores your pre-allocated page table.
> 
> Does this happen with a page locked and with __GFP_ACCOUNT allocation. I
> am not familiar with that code but I do not see it from a quick look.

DAX has no page to lock and also no writeback to do so the deadlock isn't
really possible when DAX is in use...

								Honza
Kirill A . Shutemov Dec. 12, 2018, 9:42 a.m. UTC | #4
On Tue, Dec 11, 2018 at 02:26:45PM +0100, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> Liu Bo has experienced a deadlock between memcg (legacy) reclaim and the
> ext4 writeback
> task1:
> [<ffffffff811aaa52>] wait_on_page_bit+0x82/0xa0
> [<ffffffff811c5777>] shrink_page_list+0x907/0x960
> [<ffffffff811c6027>] shrink_inactive_list+0x2c7/0x680
> [<ffffffff811c6ba4>] shrink_node_memcg+0x404/0x830
> [<ffffffff811c70a8>] shrink_node+0xd8/0x300
> [<ffffffff811c73dd>] do_try_to_free_pages+0x10d/0x330
> [<ffffffff811c7865>] try_to_free_mem_cgroup_pages+0xd5/0x1b0
> [<ffffffff8122df2d>] try_charge+0x14d/0x720
> [<ffffffff812320cc>] memcg_kmem_charge_memcg+0x3c/0xa0
> [<ffffffff812321ae>] memcg_kmem_charge+0x7e/0xd0
> [<ffffffff811b68a8>] __alloc_pages_nodemask+0x178/0x260
> [<ffffffff8120bff5>] alloc_pages_current+0x95/0x140
> [<ffffffff81074247>] pte_alloc_one+0x17/0x40
> [<ffffffff811e34de>] __pte_alloc+0x1e/0x110
> [<ffffffffa06739de>] alloc_set_pte+0x5fe/0xc20
> [<ffffffff811e5d93>] do_fault+0x103/0x970
> [<ffffffff811e6e5e>] handle_mm_fault+0x61e/0xd10
> [<ffffffff8106ea02>] __do_page_fault+0x252/0x4d0
> [<ffffffff8106ecb0>] do_page_fault+0x30/0x80
> [<ffffffff8171bce8>] page_fault+0x28/0x30
> [<ffffffffffffffff>] 0xffffffffffffffff
> 
> task2:
> [<ffffffff811aadc6>] __lock_page+0x86/0xa0
> [<ffffffffa02f1e47>] mpage_prepare_extent_to_map+0x2e7/0x310 [ext4]
> [<ffffffffa08a2689>] ext4_writepages+0x479/0xd60
> [<ffffffff811bbede>] do_writepages+0x1e/0x30
> [<ffffffff812725e5>] __writeback_single_inode+0x45/0x320
> [<ffffffff81272de2>] writeback_sb_inodes+0x272/0x600
> [<ffffffff81273202>] __writeback_inodes_wb+0x92/0xc0
> [<ffffffff81273568>] wb_writeback+0x268/0x300
> [<ffffffff81273d24>] wb_workfn+0xb4/0x390
> [<ffffffff810a2f19>] process_one_work+0x189/0x420
> [<ffffffff810a31fe>] worker_thread+0x4e/0x4b0
> [<ffffffff810a9786>] kthread+0xe6/0x100
> [<ffffffff8171a9a1>] ret_from_fork+0x41/0x50
> [<ffffffffffffffff>] 0xffffffffffffffff
> 
> He adds
> : task1 is waiting for the PageWriteback bit of the page that task2 has
> : collected in mpd->io_submit->io_bio, and tasks2 is waiting for the LOCKED
> : bit the page which tasks1 has locked.
> 
> More precisely task1 is handling a page fault and it has a page locked
> while it charges a new page table to a memcg. That in turn hits a memory
> limit reclaim and the memcg reclaim for legacy controller is waiting on
> the writeback but that is never going to finish because the writeback
> itself is waiting for the page locked in the #PF path. So this is
> essentially ABBA deadlock.

Side node:

Do we have PG_writeback vs. PG_locked ordering documentated somewhere?

IIUC, the trace from task2 suggests that we must not wait for writeback
on the locked page.

But that not what I see for many wait_on_page_writeback() users: it usally
called with the page locked. I see it for truncate, shmem, swapfile,
splice...

Maybe the problem is within task2 codepath after all?
Michal Hocko Dec. 12, 2018, 9:48 a.m. UTC | #5
On Wed 12-12-18 12:42:49, Kirill A. Shutemov wrote:
> On Tue, Dec 11, 2018 at 02:26:45PM +0100, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > Liu Bo has experienced a deadlock between memcg (legacy) reclaim and the
> > ext4 writeback
> > task1:
> > [<ffffffff811aaa52>] wait_on_page_bit+0x82/0xa0
> > [<ffffffff811c5777>] shrink_page_list+0x907/0x960
> > [<ffffffff811c6027>] shrink_inactive_list+0x2c7/0x680
> > [<ffffffff811c6ba4>] shrink_node_memcg+0x404/0x830
> > [<ffffffff811c70a8>] shrink_node+0xd8/0x300
> > [<ffffffff811c73dd>] do_try_to_free_pages+0x10d/0x330
> > [<ffffffff811c7865>] try_to_free_mem_cgroup_pages+0xd5/0x1b0
> > [<ffffffff8122df2d>] try_charge+0x14d/0x720
> > [<ffffffff812320cc>] memcg_kmem_charge_memcg+0x3c/0xa0
> > [<ffffffff812321ae>] memcg_kmem_charge+0x7e/0xd0
> > [<ffffffff811b68a8>] __alloc_pages_nodemask+0x178/0x260
> > [<ffffffff8120bff5>] alloc_pages_current+0x95/0x140
> > [<ffffffff81074247>] pte_alloc_one+0x17/0x40
> > [<ffffffff811e34de>] __pte_alloc+0x1e/0x110
> > [<ffffffffa06739de>] alloc_set_pte+0x5fe/0xc20
> > [<ffffffff811e5d93>] do_fault+0x103/0x970
> > [<ffffffff811e6e5e>] handle_mm_fault+0x61e/0xd10
> > [<ffffffff8106ea02>] __do_page_fault+0x252/0x4d0
> > [<ffffffff8106ecb0>] do_page_fault+0x30/0x80
> > [<ffffffff8171bce8>] page_fault+0x28/0x30
> > [<ffffffffffffffff>] 0xffffffffffffffff
> > 
> > task2:
> > [<ffffffff811aadc6>] __lock_page+0x86/0xa0
> > [<ffffffffa02f1e47>] mpage_prepare_extent_to_map+0x2e7/0x310 [ext4]
> > [<ffffffffa08a2689>] ext4_writepages+0x479/0xd60
> > [<ffffffff811bbede>] do_writepages+0x1e/0x30
> > [<ffffffff812725e5>] __writeback_single_inode+0x45/0x320
> > [<ffffffff81272de2>] writeback_sb_inodes+0x272/0x600
> > [<ffffffff81273202>] __writeback_inodes_wb+0x92/0xc0
> > [<ffffffff81273568>] wb_writeback+0x268/0x300
> > [<ffffffff81273d24>] wb_workfn+0xb4/0x390
> > [<ffffffff810a2f19>] process_one_work+0x189/0x420
> > [<ffffffff810a31fe>] worker_thread+0x4e/0x4b0
> > [<ffffffff810a9786>] kthread+0xe6/0x100
> > [<ffffffff8171a9a1>] ret_from_fork+0x41/0x50
> > [<ffffffffffffffff>] 0xffffffffffffffff
> > 
> > He adds
> > : task1 is waiting for the PageWriteback bit of the page that task2 has
> > : collected in mpd->io_submit->io_bio, and tasks2 is waiting for the LOCKED
> > : bit the page which tasks1 has locked.
> > 
> > More precisely task1 is handling a page fault and it has a page locked
> > while it charges a new page table to a memcg. That in turn hits a memory
> > limit reclaim and the memcg reclaim for legacy controller is waiting on
> > the writeback but that is never going to finish because the writeback
> > itself is waiting for the page locked in the #PF path. So this is
> > essentially ABBA deadlock.
> 
> Side node:
> 
> Do we have PG_writeback vs. PG_locked ordering documentated somewhere?

I am not aware of any

> IIUC, the trace from task2 suggests that we must not wait for writeback
> on the locked page.
> 
> But that not what I see for many wait_on_page_writeback() users: it usally
> called with the page locked. I see it for truncate, shmem, swapfile,
> splice...
> 
> Maybe the problem is within task2 codepath after all?

Jack and David have explained that this is due to an optimization
multiple filesystems do. They lock and set wribeback on multiple pages
and then send a largeer IO at once. So in this case we have the
following pattern

                                        lock_page(B)
                                        SetPageWriteback(B)
                                        unlock_page(B)
lock_page(A)
                                        lock_page(A)
pte_alloc_pne
  shrink_page_list
    wait_on_page_writeback(B)
                                        SetPageWriteback(A)
                                        unlock_page(A)

                                        # flush A, B to clear the writeback
Jan Kara Dec. 12, 2018, 10:05 a.m. UTC | #6
On Wed 12-12-18 12:42:49, Kirill A. Shutemov wrote:
> On Tue, Dec 11, 2018 at 02:26:45PM +0100, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > Liu Bo has experienced a deadlock between memcg (legacy) reclaim and the
> > ext4 writeback
> > task1:
> > [<ffffffff811aaa52>] wait_on_page_bit+0x82/0xa0
> > [<ffffffff811c5777>] shrink_page_list+0x907/0x960
> > [<ffffffff811c6027>] shrink_inactive_list+0x2c7/0x680
> > [<ffffffff811c6ba4>] shrink_node_memcg+0x404/0x830
> > [<ffffffff811c70a8>] shrink_node+0xd8/0x300
> > [<ffffffff811c73dd>] do_try_to_free_pages+0x10d/0x330
> > [<ffffffff811c7865>] try_to_free_mem_cgroup_pages+0xd5/0x1b0
> > [<ffffffff8122df2d>] try_charge+0x14d/0x720
> > [<ffffffff812320cc>] memcg_kmem_charge_memcg+0x3c/0xa0
> > [<ffffffff812321ae>] memcg_kmem_charge+0x7e/0xd0
> > [<ffffffff811b68a8>] __alloc_pages_nodemask+0x178/0x260
> > [<ffffffff8120bff5>] alloc_pages_current+0x95/0x140
> > [<ffffffff81074247>] pte_alloc_one+0x17/0x40
> > [<ffffffff811e34de>] __pte_alloc+0x1e/0x110
> > [<ffffffffa06739de>] alloc_set_pte+0x5fe/0xc20
> > [<ffffffff811e5d93>] do_fault+0x103/0x970
> > [<ffffffff811e6e5e>] handle_mm_fault+0x61e/0xd10
> > [<ffffffff8106ea02>] __do_page_fault+0x252/0x4d0
> > [<ffffffff8106ecb0>] do_page_fault+0x30/0x80
> > [<ffffffff8171bce8>] page_fault+0x28/0x30
> > [<ffffffffffffffff>] 0xffffffffffffffff
> > 
> > task2:
> > [<ffffffff811aadc6>] __lock_page+0x86/0xa0
> > [<ffffffffa02f1e47>] mpage_prepare_extent_to_map+0x2e7/0x310 [ext4]
> > [<ffffffffa08a2689>] ext4_writepages+0x479/0xd60
> > [<ffffffff811bbede>] do_writepages+0x1e/0x30
> > [<ffffffff812725e5>] __writeback_single_inode+0x45/0x320
> > [<ffffffff81272de2>] writeback_sb_inodes+0x272/0x600
> > [<ffffffff81273202>] __writeback_inodes_wb+0x92/0xc0
> > [<ffffffff81273568>] wb_writeback+0x268/0x300
> > [<ffffffff81273d24>] wb_workfn+0xb4/0x390
> > [<ffffffff810a2f19>] process_one_work+0x189/0x420
> > [<ffffffff810a31fe>] worker_thread+0x4e/0x4b0
> > [<ffffffff810a9786>] kthread+0xe6/0x100
> > [<ffffffff8171a9a1>] ret_from_fork+0x41/0x50
> > [<ffffffffffffffff>] 0xffffffffffffffff
> > 
> > He adds
> > : task1 is waiting for the PageWriteback bit of the page that task2 has
> > : collected in mpd->io_submit->io_bio, and tasks2 is waiting for the LOCKED
> > : bit the page which tasks1 has locked.
> > 
> > More precisely task1 is handling a page fault and it has a page locked
> > while it charges a new page table to a memcg. That in turn hits a memory
> > limit reclaim and the memcg reclaim for legacy controller is waiting on
> > the writeback but that is never going to finish because the writeback
> > itself is waiting for the page locked in the #PF path. So this is
> > essentially ABBA deadlock.
> 
> Side node:
> 
> Do we have PG_writeback vs. PG_locked ordering documentated somewhere?
> 
> IIUC, the trace from task2 suggests that we must not wait for writeback
> on the locked page.

Well, waiting on writeback of page A when A is locked has always been fine.
After all that's the only easy way to make sure you really have a page for
which no IO is running as page lock protects you from new writeback attempt
starting.

Waiting on writeback of page B while having page A locked *is* problematic
and prone to deadlocks due to code paths like in task2.

> But that not what I see for many wait_on_page_writeback() users: it usally
> called with the page locked. I see it for truncate, shmem, swapfile,
> splice...
> 
> Maybe the problem is within task2 codepath after all?

So ->writepages() methods in filesystems have the property that to complete
writeback on page with index X, they may need page lock from page X+1. I
agree that this is a bit hairy but from fs point of view it makes a lot of
sense and AFAIK nothing besides that memcg IO throttling can create
deadlocks with such a locking scheme.

								Honza
Kirill A . Shutemov Dec. 12, 2018, 11:58 a.m. UTC | #7
On Tue, Dec 11, 2018 at 02:26:45PM +0100, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> Liu Bo has experienced a deadlock between memcg (legacy) reclaim and the
> ext4 writeback
> task1:
> [<ffffffff811aaa52>] wait_on_page_bit+0x82/0xa0
> [<ffffffff811c5777>] shrink_page_list+0x907/0x960
> [<ffffffff811c6027>] shrink_inactive_list+0x2c7/0x680
> [<ffffffff811c6ba4>] shrink_node_memcg+0x404/0x830
> [<ffffffff811c70a8>] shrink_node+0xd8/0x300
> [<ffffffff811c73dd>] do_try_to_free_pages+0x10d/0x330
> [<ffffffff811c7865>] try_to_free_mem_cgroup_pages+0xd5/0x1b0
> [<ffffffff8122df2d>] try_charge+0x14d/0x720
> [<ffffffff812320cc>] memcg_kmem_charge_memcg+0x3c/0xa0
> [<ffffffff812321ae>] memcg_kmem_charge+0x7e/0xd0
> [<ffffffff811b68a8>] __alloc_pages_nodemask+0x178/0x260
> [<ffffffff8120bff5>] alloc_pages_current+0x95/0x140
> [<ffffffff81074247>] pte_alloc_one+0x17/0x40
> [<ffffffff811e34de>] __pte_alloc+0x1e/0x110
> [<ffffffffa06739de>] alloc_set_pte+0x5fe/0xc20
> [<ffffffff811e5d93>] do_fault+0x103/0x970
> [<ffffffff811e6e5e>] handle_mm_fault+0x61e/0xd10
> [<ffffffff8106ea02>] __do_page_fault+0x252/0x4d0
> [<ffffffff8106ecb0>] do_page_fault+0x30/0x80
> [<ffffffff8171bce8>] page_fault+0x28/0x30
> [<ffffffffffffffff>] 0xffffffffffffffff
> 
> task2:
> [<ffffffff811aadc6>] __lock_page+0x86/0xa0
> [<ffffffffa02f1e47>] mpage_prepare_extent_to_map+0x2e7/0x310 [ext4]
> [<ffffffffa08a2689>] ext4_writepages+0x479/0xd60
> [<ffffffff811bbede>] do_writepages+0x1e/0x30
> [<ffffffff812725e5>] __writeback_single_inode+0x45/0x320
> [<ffffffff81272de2>] writeback_sb_inodes+0x272/0x600
> [<ffffffff81273202>] __writeback_inodes_wb+0x92/0xc0
> [<ffffffff81273568>] wb_writeback+0x268/0x300
> [<ffffffff81273d24>] wb_workfn+0xb4/0x390
> [<ffffffff810a2f19>] process_one_work+0x189/0x420
> [<ffffffff810a31fe>] worker_thread+0x4e/0x4b0
> [<ffffffff810a9786>] kthread+0xe6/0x100
> [<ffffffff8171a9a1>] ret_from_fork+0x41/0x50
> [<ffffffffffffffff>] 0xffffffffffffffff
> 
> He adds
> : task1 is waiting for the PageWriteback bit of the page that task2 has
> : collected in mpd->io_submit->io_bio, and tasks2 is waiting for the LOCKED
> : bit the page which tasks1 has locked.
> 
> More precisely task1 is handling a page fault and it has a page locked
> while it charges a new page table to a memcg. That in turn hits a memory
> limit reclaim and the memcg reclaim for legacy controller is waiting on
> the writeback but that is never going to finish because the writeback
> itself is waiting for the page locked in the #PF path. So this is
> essentially ABBA deadlock.
> 
> Waiting for the writeback in legacy memcg controller is a workaround
> for pre-mature OOM killer invocations because there is no dirty IO
> throttling available for the controller. There is no easy way around
> that unfortunately. Therefore fix this specific issue by pre-allocating
> the page table outside of the page lock. We have that handy
> infrastructure for that already so simply reuse the fault-around pattern
> which already does this.
> 
> Reported-and-Debugged-by: Liu Bo <bo.liu@linux.alibaba.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
> Hi,
> this has been originally reported here [1]. While it could get worked
> around in the fs, catching the allocation early sounds like a preferable
> approach. Liu Bo has noted that he is not able to reproduce this anymore
> because kmem accounting has been disabled in their workload but this
> should be quite straightforward to review.
> 
> There are probably other hidden __GFP_ACCOUNT | GFP_KERNEL allocations
> from under a fs page locked but they should be really rare. I am not
> aware of a better solution unfortunately.

Okay, I have spent some time on the issue and was not able to find a
better solution too. But I cannot say I like it.

I think we need to spend more time on making ->prealloc_pte useful: looks
like it would help to convert vmf_insert_* helpers to take struct vm_fault
* as input and propagate it down to pmd population point. Otherwise DAX
and drivers would alloacate the page table for nothing.

Have you considered if we need anything similar for anon path? Is it
possible to have similar deadlock with swaping rather than writeback?
Michal Hocko Dec. 12, 2018, 12:13 p.m. UTC | #8
On Wed 12-12-18 14:58:37, Kirill A. Shutemov wrote:
> On Tue, Dec 11, 2018 at 02:26:45PM +0100, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > Liu Bo has experienced a deadlock between memcg (legacy) reclaim and the
> > ext4 writeback
> > task1:
> > [<ffffffff811aaa52>] wait_on_page_bit+0x82/0xa0
> > [<ffffffff811c5777>] shrink_page_list+0x907/0x960
> > [<ffffffff811c6027>] shrink_inactive_list+0x2c7/0x680
> > [<ffffffff811c6ba4>] shrink_node_memcg+0x404/0x830
> > [<ffffffff811c70a8>] shrink_node+0xd8/0x300
> > [<ffffffff811c73dd>] do_try_to_free_pages+0x10d/0x330
> > [<ffffffff811c7865>] try_to_free_mem_cgroup_pages+0xd5/0x1b0
> > [<ffffffff8122df2d>] try_charge+0x14d/0x720
> > [<ffffffff812320cc>] memcg_kmem_charge_memcg+0x3c/0xa0
> > [<ffffffff812321ae>] memcg_kmem_charge+0x7e/0xd0
> > [<ffffffff811b68a8>] __alloc_pages_nodemask+0x178/0x260
> > [<ffffffff8120bff5>] alloc_pages_current+0x95/0x140
> > [<ffffffff81074247>] pte_alloc_one+0x17/0x40
> > [<ffffffff811e34de>] __pte_alloc+0x1e/0x110
> > [<ffffffffa06739de>] alloc_set_pte+0x5fe/0xc20
> > [<ffffffff811e5d93>] do_fault+0x103/0x970
> > [<ffffffff811e6e5e>] handle_mm_fault+0x61e/0xd10
> > [<ffffffff8106ea02>] __do_page_fault+0x252/0x4d0
> > [<ffffffff8106ecb0>] do_page_fault+0x30/0x80
> > [<ffffffff8171bce8>] page_fault+0x28/0x30
> > [<ffffffffffffffff>] 0xffffffffffffffff
> > 
> > task2:
> > [<ffffffff811aadc6>] __lock_page+0x86/0xa0
> > [<ffffffffa02f1e47>] mpage_prepare_extent_to_map+0x2e7/0x310 [ext4]
> > [<ffffffffa08a2689>] ext4_writepages+0x479/0xd60
> > [<ffffffff811bbede>] do_writepages+0x1e/0x30
> > [<ffffffff812725e5>] __writeback_single_inode+0x45/0x320
> > [<ffffffff81272de2>] writeback_sb_inodes+0x272/0x600
> > [<ffffffff81273202>] __writeback_inodes_wb+0x92/0xc0
> > [<ffffffff81273568>] wb_writeback+0x268/0x300
> > [<ffffffff81273d24>] wb_workfn+0xb4/0x390
> > [<ffffffff810a2f19>] process_one_work+0x189/0x420
> > [<ffffffff810a31fe>] worker_thread+0x4e/0x4b0
> > [<ffffffff810a9786>] kthread+0xe6/0x100
> > [<ffffffff8171a9a1>] ret_from_fork+0x41/0x50
> > [<ffffffffffffffff>] 0xffffffffffffffff
> > 
> > He adds
> > : task1 is waiting for the PageWriteback bit of the page that task2 has
> > : collected in mpd->io_submit->io_bio, and tasks2 is waiting for the LOCKED
> > : bit the page which tasks1 has locked.
> > 
> > More precisely task1 is handling a page fault and it has a page locked
> > while it charges a new page table to a memcg. That in turn hits a memory
> > limit reclaim and the memcg reclaim for legacy controller is waiting on
> > the writeback but that is never going to finish because the writeback
> > itself is waiting for the page locked in the #PF path. So this is
> > essentially ABBA deadlock.
> > 
> > Waiting for the writeback in legacy memcg controller is a workaround
> > for pre-mature OOM killer invocations because there is no dirty IO
> > throttling available for the controller. There is no easy way around
> > that unfortunately. Therefore fix this specific issue by pre-allocating
> > the page table outside of the page lock. We have that handy
> > infrastructure for that already so simply reuse the fault-around pattern
> > which already does this.
> > 
> > Reported-and-Debugged-by: Liu Bo <bo.liu@linux.alibaba.com>
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > ---
> > Hi,
> > this has been originally reported here [1]. While it could get worked
> > around in the fs, catching the allocation early sounds like a preferable
> > approach. Liu Bo has noted that he is not able to reproduce this anymore
> > because kmem accounting has been disabled in their workload but this
> > should be quite straightforward to review.
> > 
> > There are probably other hidden __GFP_ACCOUNT | GFP_KERNEL allocations
> > from under a fs page locked but they should be really rare. I am not
> > aware of a better solution unfortunately.
> 
> Okay, I have spent some time on the issue and was not able to find a
> better solution too. But I cannot say I like it.

Ohh, I do not like it either. I can make it more targeted by abstracting
sane_reclaim() and using it for the check but I am not sure this is
really more helpful.
 
> I think we need to spend more time on making ->prealloc_pte useful: looks
> like it would help to convert vmf_insert_* helpers to take struct vm_fault
> * as input and propagate it down to pmd population point. Otherwise DAX
> and drivers would alloacate the page table for nothing.

Yes this would be an obvious enahancement.

> Have you considered if we need anything similar for anon path? Is it
> possible to have similar deadlock with swaping rather than writeback?

No, I do not see anon path to allocated under page lock.
kernel test robot Dec. 12, 2018, 3:33 p.m. UTC | #9
Hi Michal,

I love your patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v4.20-rc6 next-20181212]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Michal-Hocko/mm-memcg-fix-reclaim-deadlock-with-writeback/20181212-224633
config: x86_64-randconfig-x000-201849 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-1) 7.3.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

All errors (new ones prefixed by >>):

   mm/memory.c: In function '__do_fault':
>> mm/memory.c:3001:45: error: 'struct vm_area_struct' has no member named 'vm'
      vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm>mm, vmf->address);
                                                ^~
>> mm/memory.c:3001:50: error: 'mm' undeclared (first use in this function); did you mean 'hmm'?
      vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm>mm, vmf->address);
                                                     ^~
                                                     hmm
   mm/memory.c:3001:50: note: each undeclared identifier is reported only once for each function it appears in

vim +3001 mm/memory.c

  2985	
  2986	/*
  2987	 * The mmap_sem must have been held on entry, and may have been
  2988	 * released depending on flags and vma->vm_ops->fault() return value.
  2989	 * See filemap_fault() and __lock_page_retry().
  2990	 */
  2991	static vm_fault_t __do_fault(struct vm_fault *vmf)
  2992	{
  2993		struct vm_area_struct *vma = vmf->vma;
  2994		vm_fault_t ret;
  2995	
  2996		/*
  2997		 * Preallocate pte before we take page_lock because this might lead to
  2998		 * deadlocks for memcg reclaim which waits for pages under writeback.
  2999		 */
  3000		if (pmd_none(*vmf->pmd) && !vmf->prealloc_pte) {
> 3001			vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm>mm, vmf->address);
  3002			if (!vmf->prealloc_pte)
  3003				return VM_FAULT_OOM;
  3004			smp_wmb(); /* See comment in __pte_alloc() */
  3005		}
  3006	
  3007		ret = vma->vm_ops->fault(vmf);
  3008		if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY |
  3009				    VM_FAULT_DONE_COW)))
  3010			return ret;
  3011	
  3012		if (unlikely(PageHWPoison(vmf->page))) {
  3013			if (ret & VM_FAULT_LOCKED)
  3014				unlock_page(vmf->page);
  3015			put_page(vmf->page);
  3016			vmf->page = NULL;
  3017			return VM_FAULT_HWPOISON;
  3018		}
  3019	
  3020		if (unlikely(!(ret & VM_FAULT_LOCKED)))
  3021			lock_page(vmf->page);
  3022		else
  3023			VM_BUG_ON_PAGE(!PageLocked(vmf->page), vmf->page);
  3024	
  3025		return ret;
  3026	}
  3027	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation
Michal Hocko Dec. 12, 2018, 3:57 p.m. UTC | #10
On Wed 12-12-18 23:33:10, kbuild test robot wrote:
> Hi Michal,
> 
> I love your patch! Yet something to improve:

Well, I hate it ;) (like all obviously broken patches) Sorry this is a
typo. v2 sent out

Thanks!
kernel test robot Dec. 14, 2018, 5:13 p.m. UTC | #11
Hi Michal,

I love your patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v4.20-rc6 next-20181214]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Michal-Hocko/mm-memcg-fix-reclaim-deadlock-with-writeback/20181212-224633
config: nds32-allmodconfig (attached as .config)
compiler: nds32le-linux-gcc (GCC) 6.4.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        GCC_VERSION=6.4.0 make.cross ARCH=nds32 

All errors (new ones prefixed by >>):

   mm/memory.c: In function '__do_fault':
   mm/memory.c:3001:45: error: 'struct vm_area_struct' has no member named 'vm'
      vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm>mm, vmf->address);
                                                ^~
>> mm/memory.c:3001:50: error: 'mm' undeclared (first use in this function)
      vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm>mm, vmf->address);
                                                     ^~
   mm/memory.c:3001:50: note: each undeclared identifier is reported only once for each function it appears in

vim +/mm +3001 mm/memory.c

  2985	
  2986	/*
  2987	 * The mmap_sem must have been held on entry, and may have been
  2988	 * released depending on flags and vma->vm_ops->fault() return value.
  2989	 * See filemap_fault() and __lock_page_retry().
  2990	 */
  2991	static vm_fault_t __do_fault(struct vm_fault *vmf)
  2992	{
  2993		struct vm_area_struct *vma = vmf->vma;
  2994		vm_fault_t ret;
  2995	
  2996		/*
  2997		 * Preallocate pte before we take page_lock because this might lead to
  2998		 * deadlocks for memcg reclaim which waits for pages under writeback.
  2999		 */
  3000		if (pmd_none(*vmf->pmd) && !vmf->prealloc_pte) {
> 3001			vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm>mm, vmf->address);
  3002			if (!vmf->prealloc_pte)
  3003				return VM_FAULT_OOM;
  3004			smp_wmb(); /* See comment in __pte_alloc() */
  3005		}
  3006	
  3007		ret = vma->vm_ops->fault(vmf);
  3008		if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY |
  3009				    VM_FAULT_DONE_COW)))
  3010			return ret;
  3011	
  3012		if (unlikely(PageHWPoison(vmf->page))) {
  3013			if (ret & VM_FAULT_LOCKED)
  3014				unlock_page(vmf->page);
  3015			put_page(vmf->page);
  3016			vmf->page = NULL;
  3017			return VM_FAULT_HWPOISON;
  3018		}
  3019	
  3020		if (unlikely(!(ret & VM_FAULT_LOCKED)))
  3021			lock_page(vmf->page);
  3022		else
  3023			VM_BUG_ON_PAGE(!PageLocked(vmf->page), vmf->page);
  3024	
  3025		return ret;
  3026	}
  3027	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation
Michal Hocko Dec. 14, 2018, 5:31 p.m. UTC | #12
On Sat 15-12-18 01:13:53, kbuild test robot wrote:
> Hi Michal,
> 
> I love your patch! Yet something to improve:

Could you point the bot to v3 please? http://lkml.kernel.org/r/20181213092221.27270-1-mhocko@kernel.org
diff mbox series

Patch

diff --git a/mm/memory.c b/mm/memory.c
index 4ad2d293ddc2..1a73d2d4659e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2993,6 +2993,17 @@  static vm_fault_t __do_fault(struct vm_fault *vmf)
 	struct vm_area_struct *vma = vmf->vma;
 	vm_fault_t ret;
 
+	/*
+	 * Preallocate pte before we take page_lock because this might lead to
+	 * deadlocks for memcg reclaim which waits for pages under writeback.
+	 */
+	if (pmd_none(*vmf->pmd) && !vmf->prealloc_pte) {
+		vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm>mm, vmf->address);
+		if (!vmf->prealloc_pte)
+			return VM_FAULT_OOM;
+		smp_wmb(); /* See comment in __pte_alloc() */
+	}
+
 	ret = vma->vm_ops->fault(vmf);
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY |
 			    VM_FAULT_DONE_COW)))