Message ID | 20181211132645.31053-1-mhocko@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mm, memcg: fix reclaim deadlock with writeback | expand |
On Tue, Dec 11, 2018 at 02:26:45PM +0100, Michal Hocko wrote: > From: Michal Hocko <mhocko@suse.com> > > Liu Bo has experienced a deadlock between memcg (legacy) reclaim and the > ext4 writeback > task1: > [<ffffffff811aaa52>] wait_on_page_bit+0x82/0xa0 > [<ffffffff811c5777>] shrink_page_list+0x907/0x960 > [<ffffffff811c6027>] shrink_inactive_list+0x2c7/0x680 > [<ffffffff811c6ba4>] shrink_node_memcg+0x404/0x830 > [<ffffffff811c70a8>] shrink_node+0xd8/0x300 > [<ffffffff811c73dd>] do_try_to_free_pages+0x10d/0x330 > [<ffffffff811c7865>] try_to_free_mem_cgroup_pages+0xd5/0x1b0 > [<ffffffff8122df2d>] try_charge+0x14d/0x720 > [<ffffffff812320cc>] memcg_kmem_charge_memcg+0x3c/0xa0 > [<ffffffff812321ae>] memcg_kmem_charge+0x7e/0xd0 > [<ffffffff811b68a8>] __alloc_pages_nodemask+0x178/0x260 > [<ffffffff8120bff5>] alloc_pages_current+0x95/0x140 > [<ffffffff81074247>] pte_alloc_one+0x17/0x40 > [<ffffffff811e34de>] __pte_alloc+0x1e/0x110 > [<ffffffffa06739de>] alloc_set_pte+0x5fe/0xc20 > [<ffffffff811e5d93>] do_fault+0x103/0x970 > [<ffffffff811e6e5e>] handle_mm_fault+0x61e/0xd10 > [<ffffffff8106ea02>] __do_page_fault+0x252/0x4d0 > [<ffffffff8106ecb0>] do_page_fault+0x30/0x80 > [<ffffffff8171bce8>] page_fault+0x28/0x30 > [<ffffffffffffffff>] 0xffffffffffffffff > > task2: > [<ffffffff811aadc6>] __lock_page+0x86/0xa0 > [<ffffffffa02f1e47>] mpage_prepare_extent_to_map+0x2e7/0x310 [ext4] > [<ffffffffa08a2689>] ext4_writepages+0x479/0xd60 > [<ffffffff811bbede>] do_writepages+0x1e/0x30 > [<ffffffff812725e5>] __writeback_single_inode+0x45/0x320 > [<ffffffff81272de2>] writeback_sb_inodes+0x272/0x600 > [<ffffffff81273202>] __writeback_inodes_wb+0x92/0xc0 > [<ffffffff81273568>] wb_writeback+0x268/0x300 > [<ffffffff81273d24>] wb_workfn+0xb4/0x390 > [<ffffffff810a2f19>] process_one_work+0x189/0x420 > [<ffffffff810a31fe>] worker_thread+0x4e/0x4b0 > [<ffffffff810a9786>] kthread+0xe6/0x100 > [<ffffffff8171a9a1>] ret_from_fork+0x41/0x50 > [<ffffffffffffffff>] 0xffffffffffffffff > > He adds > : task1 is waiting for the PageWriteback bit of the page that task2 has > : collected in mpd->io_submit->io_bio, and tasks2 is waiting for the LOCKED > : bit the page which tasks1 has locked. > > More precisely task1 is handling a page fault and it has a page locked > while it charges a new page table to a memcg. That in turn hits a memory > limit reclaim and the memcg reclaim for legacy controller is waiting on > the writeback but that is never going to finish because the writeback > itself is waiting for the page locked in the #PF path. So this is > essentially ABBA deadlock. > > Waiting for the writeback in legacy memcg controller is a workaround > for pre-mature OOM killer invocations because there is no dirty IO > throttling available for the controller. There is no easy way around > that unfortunately. Therefore fix this specific issue by pre-allocating > the page table outside of the page lock. We have that handy > infrastructure for that already so simply reuse the fault-around pattern > which already does this. > > Reported-and-Debugged-by: Liu Bo <bo.liu@linux.alibaba.com> > Signed-off-by: Michal Hocko <mhocko@suse.com> > --- > Hi, > this has been originally reported here [1]. While it could get worked > around in the fs, catching the allocation early sounds like a preferable > approach. Liu Bo has noted that he is not able to reproduce this anymore > because kmem accounting has been disabled in their workload but this > should be quite straightforward to review. > > There are probably other hidden __GFP_ACCOUNT | GFP_KERNEL allocations > from under a fs page locked but they should be really rare. I am not > aware of a better solution unfortunately. > > I would appreciate if Kirril could have a look and double check I am not > doing something stupid here. > > Debugging lock_page deadlocks is an absolute PITA considering a lack of > lockdep support so I would mark it for stable. > > [1] http://lkml.kernel.org/r/1540858969-75803-1-git-send-email-bo.liu@linux.alibaba.com > mm/memory.c | 11 +++++++++++ > 1 file changed, 11 insertions(+) > > diff --git a/mm/memory.c b/mm/memory.c > index 4ad2d293ddc2..1a73d2d4659e 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -2993,6 +2993,17 @@ static vm_fault_t __do_fault(struct vm_fault *vmf) > struct vm_area_struct *vma = vmf->vma; > vm_fault_t ret; > > + /* > + * Preallocate pte before we take page_lock because this might lead to > + * deadlocks for memcg reclaim which waits for pages under writeback. > + */ > + if (pmd_none(*vmf->pmd) && !vmf->prealloc_pte) { > + vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm>mm, vmf->address); > + if (!vmf->prealloc_pte) > + return VM_FAULT_OOM; > + smp_wmb(); /* See comment in __pte_alloc() */ > + } > + > ret = vma->vm_ops->fault(vmf); > if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY | > VM_FAULT_DONE_COW))) Sorry, but I don't think it fixes anything. Just hides it a level deeper. The trick with ->prealloc_pte works for faultaround because we can rely on ->map_pages() to not sleep and we know how it will setup page table entry. Basically, core controls most of the path. It's not the case with ->fault(). It is free to sleep and allocate whatever it wants. For instance, DAX page fault will setup page table entry on its own and return VM_FAULT_NOPAGE. It uses vmf_insert_mixed() to setup the page table and ignores your pre-allocated page table. But it's just an example. The problem is that ->fault() is not bounded on what it can do, unlike ->map_pages().
On Tue 11-12-18 18:15:42, Kirill A. Shutemov wrote: > On Tue, Dec 11, 2018 at 02:26:45PM +0100, Michal Hocko wrote: [...] > > --- a/mm/memory.c > > +++ b/mm/memory.c > > @@ -2993,6 +2993,17 @@ static vm_fault_t __do_fault(struct vm_fault *vmf) > > struct vm_area_struct *vma = vmf->vma; > > vm_fault_t ret; > > > > + /* > > + * Preallocate pte before we take page_lock because this might lead to > > + * deadlocks for memcg reclaim which waits for pages under writeback. > > + */ > > + if (pmd_none(*vmf->pmd) && !vmf->prealloc_pte) { > > + vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm>mm, vmf->address); > > + if (!vmf->prealloc_pte) > > + return VM_FAULT_OOM; > > + smp_wmb(); /* See comment in __pte_alloc() */ > > + } > > + > > ret = vma->vm_ops->fault(vmf); > > if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY | > > VM_FAULT_DONE_COW))) > > Sorry, but I don't think it fixes anything. Just hides it a level deeper. > > The trick with ->prealloc_pte works for faultaround because we can rely on > ->map_pages() to not sleep and we know how it will setup page table entry. > Basically, core controls most of the path. > > It's not the case with ->fault(). It is free to sleep and allocate > whatever it wants. Yeah, but if the fault callback wants to allocate then it has to consider the usual allocation restrictions. e.g. NOFS if the allocation itself can trip over fs locks. > For instance, DAX page fault will setup page table entry on its own and > return VM_FAULT_NOPAGE. It uses vmf_insert_mixed() to setup the page table > and ignores your pre-allocated page table. Does this happen with a page locked and with __GFP_ACCOUNT allocation. I am not familiar with that code but I do not see it from a quick look. > But it's just an example. The problem is that ->fault() is not bounded on > what it can do, unlike ->map_pages(). That is a fair point but the primary issue here is that the generic #PF code breaks the underlying assumption and performs __GFP_ACCOUNT|GFP_KERNEL allocation from within a fs owned locked page.
On Tue 11-12-18 17:21:49, Michal Hocko wrote: > On Tue 11-12-18 18:15:42, Kirill A. Shutemov wrote: > > For instance, DAX page fault will setup page table entry on its own and > > return VM_FAULT_NOPAGE. It uses vmf_insert_mixed() to setup the page table > > and ignores your pre-allocated page table. > > Does this happen with a page locked and with __GFP_ACCOUNT allocation. I > am not familiar with that code but I do not see it from a quick look. DAX has no page to lock and also no writeback to do so the deadlock isn't really possible when DAX is in use... Honza
On Tue, Dec 11, 2018 at 02:26:45PM +0100, Michal Hocko wrote: > From: Michal Hocko <mhocko@suse.com> > > Liu Bo has experienced a deadlock between memcg (legacy) reclaim and the > ext4 writeback > task1: > [<ffffffff811aaa52>] wait_on_page_bit+0x82/0xa0 > [<ffffffff811c5777>] shrink_page_list+0x907/0x960 > [<ffffffff811c6027>] shrink_inactive_list+0x2c7/0x680 > [<ffffffff811c6ba4>] shrink_node_memcg+0x404/0x830 > [<ffffffff811c70a8>] shrink_node+0xd8/0x300 > [<ffffffff811c73dd>] do_try_to_free_pages+0x10d/0x330 > [<ffffffff811c7865>] try_to_free_mem_cgroup_pages+0xd5/0x1b0 > [<ffffffff8122df2d>] try_charge+0x14d/0x720 > [<ffffffff812320cc>] memcg_kmem_charge_memcg+0x3c/0xa0 > [<ffffffff812321ae>] memcg_kmem_charge+0x7e/0xd0 > [<ffffffff811b68a8>] __alloc_pages_nodemask+0x178/0x260 > [<ffffffff8120bff5>] alloc_pages_current+0x95/0x140 > [<ffffffff81074247>] pte_alloc_one+0x17/0x40 > [<ffffffff811e34de>] __pte_alloc+0x1e/0x110 > [<ffffffffa06739de>] alloc_set_pte+0x5fe/0xc20 > [<ffffffff811e5d93>] do_fault+0x103/0x970 > [<ffffffff811e6e5e>] handle_mm_fault+0x61e/0xd10 > [<ffffffff8106ea02>] __do_page_fault+0x252/0x4d0 > [<ffffffff8106ecb0>] do_page_fault+0x30/0x80 > [<ffffffff8171bce8>] page_fault+0x28/0x30 > [<ffffffffffffffff>] 0xffffffffffffffff > > task2: > [<ffffffff811aadc6>] __lock_page+0x86/0xa0 > [<ffffffffa02f1e47>] mpage_prepare_extent_to_map+0x2e7/0x310 [ext4] > [<ffffffffa08a2689>] ext4_writepages+0x479/0xd60 > [<ffffffff811bbede>] do_writepages+0x1e/0x30 > [<ffffffff812725e5>] __writeback_single_inode+0x45/0x320 > [<ffffffff81272de2>] writeback_sb_inodes+0x272/0x600 > [<ffffffff81273202>] __writeback_inodes_wb+0x92/0xc0 > [<ffffffff81273568>] wb_writeback+0x268/0x300 > [<ffffffff81273d24>] wb_workfn+0xb4/0x390 > [<ffffffff810a2f19>] process_one_work+0x189/0x420 > [<ffffffff810a31fe>] worker_thread+0x4e/0x4b0 > [<ffffffff810a9786>] kthread+0xe6/0x100 > [<ffffffff8171a9a1>] ret_from_fork+0x41/0x50 > [<ffffffffffffffff>] 0xffffffffffffffff > > He adds > : task1 is waiting for the PageWriteback bit of the page that task2 has > : collected in mpd->io_submit->io_bio, and tasks2 is waiting for the LOCKED > : bit the page which tasks1 has locked. > > More precisely task1 is handling a page fault and it has a page locked > while it charges a new page table to a memcg. That in turn hits a memory > limit reclaim and the memcg reclaim for legacy controller is waiting on > the writeback but that is never going to finish because the writeback > itself is waiting for the page locked in the #PF path. So this is > essentially ABBA deadlock. Side node: Do we have PG_writeback vs. PG_locked ordering documentated somewhere? IIUC, the trace from task2 suggests that we must not wait for writeback on the locked page. But that not what I see for many wait_on_page_writeback() users: it usally called with the page locked. I see it for truncate, shmem, swapfile, splice... Maybe the problem is within task2 codepath after all?
On Wed 12-12-18 12:42:49, Kirill A. Shutemov wrote: > On Tue, Dec 11, 2018 at 02:26:45PM +0100, Michal Hocko wrote: > > From: Michal Hocko <mhocko@suse.com> > > > > Liu Bo has experienced a deadlock between memcg (legacy) reclaim and the > > ext4 writeback > > task1: > > [<ffffffff811aaa52>] wait_on_page_bit+0x82/0xa0 > > [<ffffffff811c5777>] shrink_page_list+0x907/0x960 > > [<ffffffff811c6027>] shrink_inactive_list+0x2c7/0x680 > > [<ffffffff811c6ba4>] shrink_node_memcg+0x404/0x830 > > [<ffffffff811c70a8>] shrink_node+0xd8/0x300 > > [<ffffffff811c73dd>] do_try_to_free_pages+0x10d/0x330 > > [<ffffffff811c7865>] try_to_free_mem_cgroup_pages+0xd5/0x1b0 > > [<ffffffff8122df2d>] try_charge+0x14d/0x720 > > [<ffffffff812320cc>] memcg_kmem_charge_memcg+0x3c/0xa0 > > [<ffffffff812321ae>] memcg_kmem_charge+0x7e/0xd0 > > [<ffffffff811b68a8>] __alloc_pages_nodemask+0x178/0x260 > > [<ffffffff8120bff5>] alloc_pages_current+0x95/0x140 > > [<ffffffff81074247>] pte_alloc_one+0x17/0x40 > > [<ffffffff811e34de>] __pte_alloc+0x1e/0x110 > > [<ffffffffa06739de>] alloc_set_pte+0x5fe/0xc20 > > [<ffffffff811e5d93>] do_fault+0x103/0x970 > > [<ffffffff811e6e5e>] handle_mm_fault+0x61e/0xd10 > > [<ffffffff8106ea02>] __do_page_fault+0x252/0x4d0 > > [<ffffffff8106ecb0>] do_page_fault+0x30/0x80 > > [<ffffffff8171bce8>] page_fault+0x28/0x30 > > [<ffffffffffffffff>] 0xffffffffffffffff > > > > task2: > > [<ffffffff811aadc6>] __lock_page+0x86/0xa0 > > [<ffffffffa02f1e47>] mpage_prepare_extent_to_map+0x2e7/0x310 [ext4] > > [<ffffffffa08a2689>] ext4_writepages+0x479/0xd60 > > [<ffffffff811bbede>] do_writepages+0x1e/0x30 > > [<ffffffff812725e5>] __writeback_single_inode+0x45/0x320 > > [<ffffffff81272de2>] writeback_sb_inodes+0x272/0x600 > > [<ffffffff81273202>] __writeback_inodes_wb+0x92/0xc0 > > [<ffffffff81273568>] wb_writeback+0x268/0x300 > > [<ffffffff81273d24>] wb_workfn+0xb4/0x390 > > [<ffffffff810a2f19>] process_one_work+0x189/0x420 > > [<ffffffff810a31fe>] worker_thread+0x4e/0x4b0 > > [<ffffffff810a9786>] kthread+0xe6/0x100 > > [<ffffffff8171a9a1>] ret_from_fork+0x41/0x50 > > [<ffffffffffffffff>] 0xffffffffffffffff > > > > He adds > > : task1 is waiting for the PageWriteback bit of the page that task2 has > > : collected in mpd->io_submit->io_bio, and tasks2 is waiting for the LOCKED > > : bit the page which tasks1 has locked. > > > > More precisely task1 is handling a page fault and it has a page locked > > while it charges a new page table to a memcg. That in turn hits a memory > > limit reclaim and the memcg reclaim for legacy controller is waiting on > > the writeback but that is never going to finish because the writeback > > itself is waiting for the page locked in the #PF path. So this is > > essentially ABBA deadlock. > > Side node: > > Do we have PG_writeback vs. PG_locked ordering documentated somewhere? I am not aware of any > IIUC, the trace from task2 suggests that we must not wait for writeback > on the locked page. > > But that not what I see for many wait_on_page_writeback() users: it usally > called with the page locked. I see it for truncate, shmem, swapfile, > splice... > > Maybe the problem is within task2 codepath after all? Jack and David have explained that this is due to an optimization multiple filesystems do. They lock and set wribeback on multiple pages and then send a largeer IO at once. So in this case we have the following pattern lock_page(B) SetPageWriteback(B) unlock_page(B) lock_page(A) lock_page(A) pte_alloc_pne shrink_page_list wait_on_page_writeback(B) SetPageWriteback(A) unlock_page(A) # flush A, B to clear the writeback
On Wed 12-12-18 12:42:49, Kirill A. Shutemov wrote: > On Tue, Dec 11, 2018 at 02:26:45PM +0100, Michal Hocko wrote: > > From: Michal Hocko <mhocko@suse.com> > > > > Liu Bo has experienced a deadlock between memcg (legacy) reclaim and the > > ext4 writeback > > task1: > > [<ffffffff811aaa52>] wait_on_page_bit+0x82/0xa0 > > [<ffffffff811c5777>] shrink_page_list+0x907/0x960 > > [<ffffffff811c6027>] shrink_inactive_list+0x2c7/0x680 > > [<ffffffff811c6ba4>] shrink_node_memcg+0x404/0x830 > > [<ffffffff811c70a8>] shrink_node+0xd8/0x300 > > [<ffffffff811c73dd>] do_try_to_free_pages+0x10d/0x330 > > [<ffffffff811c7865>] try_to_free_mem_cgroup_pages+0xd5/0x1b0 > > [<ffffffff8122df2d>] try_charge+0x14d/0x720 > > [<ffffffff812320cc>] memcg_kmem_charge_memcg+0x3c/0xa0 > > [<ffffffff812321ae>] memcg_kmem_charge+0x7e/0xd0 > > [<ffffffff811b68a8>] __alloc_pages_nodemask+0x178/0x260 > > [<ffffffff8120bff5>] alloc_pages_current+0x95/0x140 > > [<ffffffff81074247>] pte_alloc_one+0x17/0x40 > > [<ffffffff811e34de>] __pte_alloc+0x1e/0x110 > > [<ffffffffa06739de>] alloc_set_pte+0x5fe/0xc20 > > [<ffffffff811e5d93>] do_fault+0x103/0x970 > > [<ffffffff811e6e5e>] handle_mm_fault+0x61e/0xd10 > > [<ffffffff8106ea02>] __do_page_fault+0x252/0x4d0 > > [<ffffffff8106ecb0>] do_page_fault+0x30/0x80 > > [<ffffffff8171bce8>] page_fault+0x28/0x30 > > [<ffffffffffffffff>] 0xffffffffffffffff > > > > task2: > > [<ffffffff811aadc6>] __lock_page+0x86/0xa0 > > [<ffffffffa02f1e47>] mpage_prepare_extent_to_map+0x2e7/0x310 [ext4] > > [<ffffffffa08a2689>] ext4_writepages+0x479/0xd60 > > [<ffffffff811bbede>] do_writepages+0x1e/0x30 > > [<ffffffff812725e5>] __writeback_single_inode+0x45/0x320 > > [<ffffffff81272de2>] writeback_sb_inodes+0x272/0x600 > > [<ffffffff81273202>] __writeback_inodes_wb+0x92/0xc0 > > [<ffffffff81273568>] wb_writeback+0x268/0x300 > > [<ffffffff81273d24>] wb_workfn+0xb4/0x390 > > [<ffffffff810a2f19>] process_one_work+0x189/0x420 > > [<ffffffff810a31fe>] worker_thread+0x4e/0x4b0 > > [<ffffffff810a9786>] kthread+0xe6/0x100 > > [<ffffffff8171a9a1>] ret_from_fork+0x41/0x50 > > [<ffffffffffffffff>] 0xffffffffffffffff > > > > He adds > > : task1 is waiting for the PageWriteback bit of the page that task2 has > > : collected in mpd->io_submit->io_bio, and tasks2 is waiting for the LOCKED > > : bit the page which tasks1 has locked. > > > > More precisely task1 is handling a page fault and it has a page locked > > while it charges a new page table to a memcg. That in turn hits a memory > > limit reclaim and the memcg reclaim for legacy controller is waiting on > > the writeback but that is never going to finish because the writeback > > itself is waiting for the page locked in the #PF path. So this is > > essentially ABBA deadlock. > > Side node: > > Do we have PG_writeback vs. PG_locked ordering documentated somewhere? > > IIUC, the trace from task2 suggests that we must not wait for writeback > on the locked page. Well, waiting on writeback of page A when A is locked has always been fine. After all that's the only easy way to make sure you really have a page for which no IO is running as page lock protects you from new writeback attempt starting. Waiting on writeback of page B while having page A locked *is* problematic and prone to deadlocks due to code paths like in task2. > But that not what I see for many wait_on_page_writeback() users: it usally > called with the page locked. I see it for truncate, shmem, swapfile, > splice... > > Maybe the problem is within task2 codepath after all? So ->writepages() methods in filesystems have the property that to complete writeback on page with index X, they may need page lock from page X+1. I agree that this is a bit hairy but from fs point of view it makes a lot of sense and AFAIK nothing besides that memcg IO throttling can create deadlocks with such a locking scheme. Honza
On Tue, Dec 11, 2018 at 02:26:45PM +0100, Michal Hocko wrote: > From: Michal Hocko <mhocko@suse.com> > > Liu Bo has experienced a deadlock between memcg (legacy) reclaim and the > ext4 writeback > task1: > [<ffffffff811aaa52>] wait_on_page_bit+0x82/0xa0 > [<ffffffff811c5777>] shrink_page_list+0x907/0x960 > [<ffffffff811c6027>] shrink_inactive_list+0x2c7/0x680 > [<ffffffff811c6ba4>] shrink_node_memcg+0x404/0x830 > [<ffffffff811c70a8>] shrink_node+0xd8/0x300 > [<ffffffff811c73dd>] do_try_to_free_pages+0x10d/0x330 > [<ffffffff811c7865>] try_to_free_mem_cgroup_pages+0xd5/0x1b0 > [<ffffffff8122df2d>] try_charge+0x14d/0x720 > [<ffffffff812320cc>] memcg_kmem_charge_memcg+0x3c/0xa0 > [<ffffffff812321ae>] memcg_kmem_charge+0x7e/0xd0 > [<ffffffff811b68a8>] __alloc_pages_nodemask+0x178/0x260 > [<ffffffff8120bff5>] alloc_pages_current+0x95/0x140 > [<ffffffff81074247>] pte_alloc_one+0x17/0x40 > [<ffffffff811e34de>] __pte_alloc+0x1e/0x110 > [<ffffffffa06739de>] alloc_set_pte+0x5fe/0xc20 > [<ffffffff811e5d93>] do_fault+0x103/0x970 > [<ffffffff811e6e5e>] handle_mm_fault+0x61e/0xd10 > [<ffffffff8106ea02>] __do_page_fault+0x252/0x4d0 > [<ffffffff8106ecb0>] do_page_fault+0x30/0x80 > [<ffffffff8171bce8>] page_fault+0x28/0x30 > [<ffffffffffffffff>] 0xffffffffffffffff > > task2: > [<ffffffff811aadc6>] __lock_page+0x86/0xa0 > [<ffffffffa02f1e47>] mpage_prepare_extent_to_map+0x2e7/0x310 [ext4] > [<ffffffffa08a2689>] ext4_writepages+0x479/0xd60 > [<ffffffff811bbede>] do_writepages+0x1e/0x30 > [<ffffffff812725e5>] __writeback_single_inode+0x45/0x320 > [<ffffffff81272de2>] writeback_sb_inodes+0x272/0x600 > [<ffffffff81273202>] __writeback_inodes_wb+0x92/0xc0 > [<ffffffff81273568>] wb_writeback+0x268/0x300 > [<ffffffff81273d24>] wb_workfn+0xb4/0x390 > [<ffffffff810a2f19>] process_one_work+0x189/0x420 > [<ffffffff810a31fe>] worker_thread+0x4e/0x4b0 > [<ffffffff810a9786>] kthread+0xe6/0x100 > [<ffffffff8171a9a1>] ret_from_fork+0x41/0x50 > [<ffffffffffffffff>] 0xffffffffffffffff > > He adds > : task1 is waiting for the PageWriteback bit of the page that task2 has > : collected in mpd->io_submit->io_bio, and tasks2 is waiting for the LOCKED > : bit the page which tasks1 has locked. > > More precisely task1 is handling a page fault and it has a page locked > while it charges a new page table to a memcg. That in turn hits a memory > limit reclaim and the memcg reclaim for legacy controller is waiting on > the writeback but that is never going to finish because the writeback > itself is waiting for the page locked in the #PF path. So this is > essentially ABBA deadlock. > > Waiting for the writeback in legacy memcg controller is a workaround > for pre-mature OOM killer invocations because there is no dirty IO > throttling available for the controller. There is no easy way around > that unfortunately. Therefore fix this specific issue by pre-allocating > the page table outside of the page lock. We have that handy > infrastructure for that already so simply reuse the fault-around pattern > which already does this. > > Reported-and-Debugged-by: Liu Bo <bo.liu@linux.alibaba.com> > Signed-off-by: Michal Hocko <mhocko@suse.com> > --- > Hi, > this has been originally reported here [1]. While it could get worked > around in the fs, catching the allocation early sounds like a preferable > approach. Liu Bo has noted that he is not able to reproduce this anymore > because kmem accounting has been disabled in their workload but this > should be quite straightforward to review. > > There are probably other hidden __GFP_ACCOUNT | GFP_KERNEL allocations > from under a fs page locked but they should be really rare. I am not > aware of a better solution unfortunately. Okay, I have spent some time on the issue and was not able to find a better solution too. But I cannot say I like it. I think we need to spend more time on making ->prealloc_pte useful: looks like it would help to convert vmf_insert_* helpers to take struct vm_fault * as input and propagate it down to pmd population point. Otherwise DAX and drivers would alloacate the page table for nothing. Have you considered if we need anything similar for anon path? Is it possible to have similar deadlock with swaping rather than writeback?
On Wed 12-12-18 14:58:37, Kirill A. Shutemov wrote: > On Tue, Dec 11, 2018 at 02:26:45PM +0100, Michal Hocko wrote: > > From: Michal Hocko <mhocko@suse.com> > > > > Liu Bo has experienced a deadlock between memcg (legacy) reclaim and the > > ext4 writeback > > task1: > > [<ffffffff811aaa52>] wait_on_page_bit+0x82/0xa0 > > [<ffffffff811c5777>] shrink_page_list+0x907/0x960 > > [<ffffffff811c6027>] shrink_inactive_list+0x2c7/0x680 > > [<ffffffff811c6ba4>] shrink_node_memcg+0x404/0x830 > > [<ffffffff811c70a8>] shrink_node+0xd8/0x300 > > [<ffffffff811c73dd>] do_try_to_free_pages+0x10d/0x330 > > [<ffffffff811c7865>] try_to_free_mem_cgroup_pages+0xd5/0x1b0 > > [<ffffffff8122df2d>] try_charge+0x14d/0x720 > > [<ffffffff812320cc>] memcg_kmem_charge_memcg+0x3c/0xa0 > > [<ffffffff812321ae>] memcg_kmem_charge+0x7e/0xd0 > > [<ffffffff811b68a8>] __alloc_pages_nodemask+0x178/0x260 > > [<ffffffff8120bff5>] alloc_pages_current+0x95/0x140 > > [<ffffffff81074247>] pte_alloc_one+0x17/0x40 > > [<ffffffff811e34de>] __pte_alloc+0x1e/0x110 > > [<ffffffffa06739de>] alloc_set_pte+0x5fe/0xc20 > > [<ffffffff811e5d93>] do_fault+0x103/0x970 > > [<ffffffff811e6e5e>] handle_mm_fault+0x61e/0xd10 > > [<ffffffff8106ea02>] __do_page_fault+0x252/0x4d0 > > [<ffffffff8106ecb0>] do_page_fault+0x30/0x80 > > [<ffffffff8171bce8>] page_fault+0x28/0x30 > > [<ffffffffffffffff>] 0xffffffffffffffff > > > > task2: > > [<ffffffff811aadc6>] __lock_page+0x86/0xa0 > > [<ffffffffa02f1e47>] mpage_prepare_extent_to_map+0x2e7/0x310 [ext4] > > [<ffffffffa08a2689>] ext4_writepages+0x479/0xd60 > > [<ffffffff811bbede>] do_writepages+0x1e/0x30 > > [<ffffffff812725e5>] __writeback_single_inode+0x45/0x320 > > [<ffffffff81272de2>] writeback_sb_inodes+0x272/0x600 > > [<ffffffff81273202>] __writeback_inodes_wb+0x92/0xc0 > > [<ffffffff81273568>] wb_writeback+0x268/0x300 > > [<ffffffff81273d24>] wb_workfn+0xb4/0x390 > > [<ffffffff810a2f19>] process_one_work+0x189/0x420 > > [<ffffffff810a31fe>] worker_thread+0x4e/0x4b0 > > [<ffffffff810a9786>] kthread+0xe6/0x100 > > [<ffffffff8171a9a1>] ret_from_fork+0x41/0x50 > > [<ffffffffffffffff>] 0xffffffffffffffff > > > > He adds > > : task1 is waiting for the PageWriteback bit of the page that task2 has > > : collected in mpd->io_submit->io_bio, and tasks2 is waiting for the LOCKED > > : bit the page which tasks1 has locked. > > > > More precisely task1 is handling a page fault and it has a page locked > > while it charges a new page table to a memcg. That in turn hits a memory > > limit reclaim and the memcg reclaim for legacy controller is waiting on > > the writeback but that is never going to finish because the writeback > > itself is waiting for the page locked in the #PF path. So this is > > essentially ABBA deadlock. > > > > Waiting for the writeback in legacy memcg controller is a workaround > > for pre-mature OOM killer invocations because there is no dirty IO > > throttling available for the controller. There is no easy way around > > that unfortunately. Therefore fix this specific issue by pre-allocating > > the page table outside of the page lock. We have that handy > > infrastructure for that already so simply reuse the fault-around pattern > > which already does this. > > > > Reported-and-Debugged-by: Liu Bo <bo.liu@linux.alibaba.com> > > Signed-off-by: Michal Hocko <mhocko@suse.com> > > --- > > Hi, > > this has been originally reported here [1]. While it could get worked > > around in the fs, catching the allocation early sounds like a preferable > > approach. Liu Bo has noted that he is not able to reproduce this anymore > > because kmem accounting has been disabled in their workload but this > > should be quite straightforward to review. > > > > There are probably other hidden __GFP_ACCOUNT | GFP_KERNEL allocations > > from under a fs page locked but they should be really rare. I am not > > aware of a better solution unfortunately. > > Okay, I have spent some time on the issue and was not able to find a > better solution too. But I cannot say I like it. Ohh, I do not like it either. I can make it more targeted by abstracting sane_reclaim() and using it for the check but I am not sure this is really more helpful. > I think we need to spend more time on making ->prealloc_pte useful: looks > like it would help to convert vmf_insert_* helpers to take struct vm_fault > * as input and propagate it down to pmd population point. Otherwise DAX > and drivers would alloacate the page table for nothing. Yes this would be an obvious enahancement. > Have you considered if we need anything similar for anon path? Is it > possible to have similar deadlock with swaping rather than writeback? No, I do not see anon path to allocated under page lock.
Hi Michal, I love your patch! Yet something to improve: [auto build test ERROR on linus/master] [also build test ERROR on v4.20-rc6 next-20181212] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Michal-Hocko/mm-memcg-fix-reclaim-deadlock-with-writeback/20181212-224633 config: x86_64-randconfig-x000-201849 (attached as .config) compiler: gcc-7 (Debian 7.3.0-1) 7.3.0 reproduce: # save the attached .config to linux build tree make ARCH=x86_64 All errors (new ones prefixed by >>): mm/memory.c: In function '__do_fault': >> mm/memory.c:3001:45: error: 'struct vm_area_struct' has no member named 'vm' vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm>mm, vmf->address); ^~ >> mm/memory.c:3001:50: error: 'mm' undeclared (first use in this function); did you mean 'hmm'? vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm>mm, vmf->address); ^~ hmm mm/memory.c:3001:50: note: each undeclared identifier is reported only once for each function it appears in vim +3001 mm/memory.c 2985 2986 /* 2987 * The mmap_sem must have been held on entry, and may have been 2988 * released depending on flags and vma->vm_ops->fault() return value. 2989 * See filemap_fault() and __lock_page_retry(). 2990 */ 2991 static vm_fault_t __do_fault(struct vm_fault *vmf) 2992 { 2993 struct vm_area_struct *vma = vmf->vma; 2994 vm_fault_t ret; 2995 2996 /* 2997 * Preallocate pte before we take page_lock because this might lead to 2998 * deadlocks for memcg reclaim which waits for pages under writeback. 2999 */ 3000 if (pmd_none(*vmf->pmd) && !vmf->prealloc_pte) { > 3001 vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm>mm, vmf->address); 3002 if (!vmf->prealloc_pte) 3003 return VM_FAULT_OOM; 3004 smp_wmb(); /* See comment in __pte_alloc() */ 3005 } 3006 3007 ret = vma->vm_ops->fault(vmf); 3008 if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY | 3009 VM_FAULT_DONE_COW))) 3010 return ret; 3011 3012 if (unlikely(PageHWPoison(vmf->page))) { 3013 if (ret & VM_FAULT_LOCKED) 3014 unlock_page(vmf->page); 3015 put_page(vmf->page); 3016 vmf->page = NULL; 3017 return VM_FAULT_HWPOISON; 3018 } 3019 3020 if (unlikely(!(ret & VM_FAULT_LOCKED))) 3021 lock_page(vmf->page); 3022 else 3023 VM_BUG_ON_PAGE(!PageLocked(vmf->page), vmf->page); 3024 3025 return ret; 3026 } 3027 --- 0-DAY kernel test infrastructure Open Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation
On Wed 12-12-18 23:33:10, kbuild test robot wrote: > Hi Michal, > > I love your patch! Yet something to improve: Well, I hate it ;) (like all obviously broken patches) Sorry this is a typo. v2 sent out Thanks!
Hi Michal, I love your patch! Yet something to improve: [auto build test ERROR on linus/master] [also build test ERROR on v4.20-rc6 next-20181214] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Michal-Hocko/mm-memcg-fix-reclaim-deadlock-with-writeback/20181212-224633 config: nds32-allmodconfig (attached as .config) compiler: nds32le-linux-gcc (GCC) 6.4.0 reproduce: wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # save the attached .config to linux build tree GCC_VERSION=6.4.0 make.cross ARCH=nds32 All errors (new ones prefixed by >>): mm/memory.c: In function '__do_fault': mm/memory.c:3001:45: error: 'struct vm_area_struct' has no member named 'vm' vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm>mm, vmf->address); ^~ >> mm/memory.c:3001:50: error: 'mm' undeclared (first use in this function) vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm>mm, vmf->address); ^~ mm/memory.c:3001:50: note: each undeclared identifier is reported only once for each function it appears in vim +/mm +3001 mm/memory.c 2985 2986 /* 2987 * The mmap_sem must have been held on entry, and may have been 2988 * released depending on flags and vma->vm_ops->fault() return value. 2989 * See filemap_fault() and __lock_page_retry(). 2990 */ 2991 static vm_fault_t __do_fault(struct vm_fault *vmf) 2992 { 2993 struct vm_area_struct *vma = vmf->vma; 2994 vm_fault_t ret; 2995 2996 /* 2997 * Preallocate pte before we take page_lock because this might lead to 2998 * deadlocks for memcg reclaim which waits for pages under writeback. 2999 */ 3000 if (pmd_none(*vmf->pmd) && !vmf->prealloc_pte) { > 3001 vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm>mm, vmf->address); 3002 if (!vmf->prealloc_pte) 3003 return VM_FAULT_OOM; 3004 smp_wmb(); /* See comment in __pte_alloc() */ 3005 } 3006 3007 ret = vma->vm_ops->fault(vmf); 3008 if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY | 3009 VM_FAULT_DONE_COW))) 3010 return ret; 3011 3012 if (unlikely(PageHWPoison(vmf->page))) { 3013 if (ret & VM_FAULT_LOCKED) 3014 unlock_page(vmf->page); 3015 put_page(vmf->page); 3016 vmf->page = NULL; 3017 return VM_FAULT_HWPOISON; 3018 } 3019 3020 if (unlikely(!(ret & VM_FAULT_LOCKED))) 3021 lock_page(vmf->page); 3022 else 3023 VM_BUG_ON_PAGE(!PageLocked(vmf->page), vmf->page); 3024 3025 return ret; 3026 } 3027 --- 0-DAY kernel test infrastructure Open Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation
On Sat 15-12-18 01:13:53, kbuild test robot wrote: > Hi Michal, > > I love your patch! Yet something to improve: Could you point the bot to v3 please? http://lkml.kernel.org/r/20181213092221.27270-1-mhocko@kernel.org
diff --git a/mm/memory.c b/mm/memory.c index 4ad2d293ddc2..1a73d2d4659e 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2993,6 +2993,17 @@ static vm_fault_t __do_fault(struct vm_fault *vmf) struct vm_area_struct *vma = vmf->vma; vm_fault_t ret; + /* + * Preallocate pte before we take page_lock because this might lead to + * deadlocks for memcg reclaim which waits for pages under writeback. + */ + if (pmd_none(*vmf->pmd) && !vmf->prealloc_pte) { + vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm>mm, vmf->address); + if (!vmf->prealloc_pte) + return VM_FAULT_OOM; + smp_wmb(); /* See comment in __pte_alloc() */ + } + ret = vma->vm_ops->fault(vmf); if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY | VM_FAULT_DONE_COW)))