From patchwork Tue Dec 11 13:26:45 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Michal Hocko X-Patchwork-Id: 10723761 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 8945B159A for ; Tue, 11 Dec 2018 13:27:02 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 74C6E29FF2 for ; Tue, 11 Dec 2018 13:27:02 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 651852A03E; Tue, 11 Dec 2018 13:27:02 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id C0C7D29FF2 for ; Tue, 11 Dec 2018 13:27:01 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 94EB28E008A; Tue, 11 Dec 2018 08:27:00 -0500 (EST) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 8FED68E004D; Tue, 11 Dec 2018 08:27:00 -0500 (EST) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7EF148E008A; Tue, 11 Dec 2018 08:27:00 -0500 (EST) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-pf1-f200.google.com (mail-pf1-f200.google.com [209.85.210.200]) by kanga.kvack.org (Postfix) with ESMTP id 300CA8E004D for ; Tue, 11 Dec 2018 08:27:00 -0500 (EST) Received: by mail-pf1-f200.google.com with SMTP id b8so12661773pfe.10 for ; Tue, 11 Dec 2018 05:27:00 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-original-authentication-results:x-gm-message-state:from:to:cc :subject:date:message-id:mime-version:content-transfer-encoding; bh=jxbynfRZF3X6tSvesQ9hBhsSdCVxDV0cRayMv8c1614=; b=VFXwQrA3Z/QD9V0Bk9OmqRQSMuNNdoQIRKjUPdmvL4qlhxiKNDKENxE9na02jP/rf9 5d9ReErjWmz3xAKqP7zhfv1YzvTn5+lwnIXritGcpEKmOMapcBmnAv1vIv9waPcDfEhb oQXWnA5YweRX5yF83p1/CBaPEc/lkshzZ0rQ3b5LG1xuCL1V4u67jKhC27eVoOmL2RyG UH0kJFSgbpRzC3vl5jFlsYQ9ehcny/LQFYwnPzJeOldhWnnvWnuqu2uAl2DNAMDlSK5P x808O/cgYZoQLcPwcDIaX9W3cGq4X7LRn5AN6D0JbXbQSrbRxm2kbWe8h9U39+CnqSlk 8WYw== X-Original-Authentication-Results: mx.google.com; spf=pass (google.com: domain of mstsxfx@gmail.com designates 209.85.220.65 as permitted sender) smtp.mailfrom=mstsxfx@gmail.com; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org X-Gm-Message-State: AA+aEWbiaqiaHw7ILpWpGIYESBXEapJMKc84rSMuryNfajS1uXLuahys QRTXwo2w5pENONeGN6vXrkFQ9zV5fYhpF8Hvasx/DFQzeFg4ikPBqV8BWd0KYE84xxGy/+CzLrJ grXxcEXjlvMGJlvrMkLX261jeSsreVNt1THUbBl8A0hcGD9gnNxLha8lP3mihvGv8sQjuwes7xK kS2T8N3V67qyrwyIZ5U+fF6NhDIvL/4mMB0/g67fEU5kSc3cHvUhruBFjSYzB7JyRGizbnODXHv smfJ1yROzv7VJ1ZuBAHzCx/Wi1Z3kTh/32ebutdFTuD5lCxc1hGhjMWlBbxkKIkJG3qPPMuE/cg JXMKuih/7S+PmUCJG5sRirpRIQSDkh2E0IVmp8YVmpbAS3BQLpvEH6siKvTK415qWudErZsW7Q= = X-Received: by 2002:a62:528e:: with SMTP id g136mr17107325pfb.111.1544534819651; Tue, 11 Dec 2018 05:26:59 -0800 (PST) X-Received: by 2002:a62:528e:: with SMTP id g136mr17107276pfb.111.1544534818436; Tue, 11 Dec 2018 05:26:58 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1544534818; cv=none; d=google.com; s=arc-20160816; b=TvVn1BL1IvwsiMsokEIPJ0ptHQgCNMuY5fpy8dL+gHgyrWkQIIh+9lEjerX8tXnE68 kLO1ywXxOl+/gbbisMB200SNwyYbfZelCJDuMiXPJvGVXMcIfWyeoEeDy0Wkr9k/XBY1 RjaxVfCFkvxER+olHaAhUFZS4Z0u2zl+IkRBYr3w5Bpshwn5F3jx2SZAeafZX5uB7d3L +HTau1kzbOhmAaBor9FtZuqQoME8AnK8RbH2v35Sdi+Zx6iMvaApddYSuMuSBlOnBcGa QaR+zEv388CVJ+ViI9WJ2Yerlc+PHl2wQ8TeleyzvZCuIMkzvttY+ZIQL/pAjBVF27Mt dQRw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from; bh=jxbynfRZF3X6tSvesQ9hBhsSdCVxDV0cRayMv8c1614=; b=mBHlP5qMT3W0GwErgV8yIJjm9hmGGFF3FO5dbkzEc8DEhnBW+NrG2DxACOjx8HfDju 9XLtRqHvy5KQVzC636FawIOEy78ZV+06Xh7b8tD+WNlFiqdwH16ACBA8Og5wOLB92OQv KmvMvf3It/DuqbHwsuUddUIsygrv2e91gB7U+Gjs8SZkx9MjNNaNVth8YyphZOpytnFu jOM7Qw3bFaYk0kq6FXeWe8tlHKwsjDU21Bx5a4MfFCanprAHvzxjAcCHwTVCU5YPXEuB AshZXCBeCLh4jNBLNxqVQYRqyndMnyCI0Ws49BFgrVZvWGyrIftIuZ2DsYwdDEvuAote 1VYg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of mstsxfx@gmail.com designates 209.85.220.65 as permitted sender) smtp.mailfrom=mstsxfx@gmail.com; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: from mail-sor-f65.google.com (mail-sor-f65.google.com. [209.85.220.65]) by mx.google.com with SMTPS id 27sor22603791pfs.60.2018.12.11.05.26.58 for (Google Transport Security); Tue, 11 Dec 2018 05:26:58 -0800 (PST) Received-SPF: pass (google.com: domain of mstsxfx@gmail.com designates 209.85.220.65 as permitted sender) client-ip=209.85.220.65; Authentication-Results: mx.google.com; spf=pass (google.com: domain of mstsxfx@gmail.com designates 209.85.220.65 as permitted sender) smtp.mailfrom=mstsxfx@gmail.com; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org X-Google-Smtp-Source: AFSGD/W5JvJTtNsYhqZ89uUnjHbTW+iv+BfZ7C5Sp303eExqcx99HohPFxw2cfCr05CMj5yUC/vUww== X-Received: by 2002:a62:5444:: with SMTP id i65mr16903077pfb.193.1544534817855; Tue, 11 Dec 2018 05:26:57 -0800 (PST) Received: from tiehlicka.suse.cz (prg-ext-pat.suse.com. [213.151.95.130]) by smtp.gmail.com with ESMTPSA id v62sm28677411pfd.163.2018.12.11.05.26.54 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 11 Dec 2018 05:26:56 -0800 (PST) From: Michal Hocko To: Andrew Morton , "Kirill A. Shutemov" Cc: Liu Bo , Jan Kara , Dave Chinner , "Theodore Ts'o" , Johannes Weiner , Vladimir Davydov , , , LKML , Michal Hocko Subject: [PATCH] mm, memcg: fix reclaim deadlock with writeback Date: Tue, 11 Dec 2018 14:26:45 +0100 Message-Id: <20181211132645.31053-1-mhocko@kernel.org> X-Mailer: git-send-email 2.19.2 MIME-Version: 1.0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP From: Michal Hocko Liu Bo has experienced a deadlock between memcg (legacy) reclaim and the ext4 writeback task1: [] wait_on_page_bit+0x82/0xa0 [] shrink_page_list+0x907/0x960 [] shrink_inactive_list+0x2c7/0x680 [] shrink_node_memcg+0x404/0x830 [] shrink_node+0xd8/0x300 [] do_try_to_free_pages+0x10d/0x330 [] try_to_free_mem_cgroup_pages+0xd5/0x1b0 [] try_charge+0x14d/0x720 [] memcg_kmem_charge_memcg+0x3c/0xa0 [] memcg_kmem_charge+0x7e/0xd0 [] __alloc_pages_nodemask+0x178/0x260 [] alloc_pages_current+0x95/0x140 [] pte_alloc_one+0x17/0x40 [] __pte_alloc+0x1e/0x110 [] alloc_set_pte+0x5fe/0xc20 [] do_fault+0x103/0x970 [] handle_mm_fault+0x61e/0xd10 [] __do_page_fault+0x252/0x4d0 [] do_page_fault+0x30/0x80 [] page_fault+0x28/0x30 [] 0xffffffffffffffff task2: [] __lock_page+0x86/0xa0 [] mpage_prepare_extent_to_map+0x2e7/0x310 [ext4] [] ext4_writepages+0x479/0xd60 [] do_writepages+0x1e/0x30 [] __writeback_single_inode+0x45/0x320 [] writeback_sb_inodes+0x272/0x600 [] __writeback_inodes_wb+0x92/0xc0 [] wb_writeback+0x268/0x300 [] wb_workfn+0xb4/0x390 [] process_one_work+0x189/0x420 [] worker_thread+0x4e/0x4b0 [] kthread+0xe6/0x100 [] ret_from_fork+0x41/0x50 [] 0xffffffffffffffff He adds : task1 is waiting for the PageWriteback bit of the page that task2 has : collected in mpd->io_submit->io_bio, and tasks2 is waiting for the LOCKED : bit the page which tasks1 has locked. More precisely task1 is handling a page fault and it has a page locked while it charges a new page table to a memcg. That in turn hits a memory limit reclaim and the memcg reclaim for legacy controller is waiting on the writeback but that is never going to finish because the writeback itself is waiting for the page locked in the #PF path. So this is essentially ABBA deadlock. Waiting for the writeback in legacy memcg controller is a workaround for pre-mature OOM killer invocations because there is no dirty IO throttling available for the controller. There is no easy way around that unfortunately. Therefore fix this specific issue by pre-allocating the page table outside of the page lock. We have that handy infrastructure for that already so simply reuse the fault-around pattern which already does this. Reported-and-Debugged-by: Liu Bo Signed-off-by: Michal Hocko --- Hi, this has been originally reported here [1]. While it could get worked around in the fs, catching the allocation early sounds like a preferable approach. Liu Bo has noted that he is not able to reproduce this anymore because kmem accounting has been disabled in their workload but this should be quite straightforward to review. There are probably other hidden __GFP_ACCOUNT | GFP_KERNEL allocations from under a fs page locked but they should be really rare. I am not aware of a better solution unfortunately. I would appreciate if Kirril could have a look and double check I am not doing something stupid here. Debugging lock_page deadlocks is an absolute PITA considering a lack of lockdep support so I would mark it for stable. [1] http://lkml.kernel.org/r/1540858969-75803-1-git-send-email-bo.liu@linux.alibaba.com mm/memory.c | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/mm/memory.c b/mm/memory.c index 4ad2d293ddc2..1a73d2d4659e 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2993,6 +2993,17 @@ static vm_fault_t __do_fault(struct vm_fault *vmf) struct vm_area_struct *vma = vmf->vma; vm_fault_t ret; + /* + * Preallocate pte before we take page_lock because this might lead to + * deadlocks for memcg reclaim which waits for pages under writeback. + */ + if (pmd_none(*vmf->pmd) && !vmf->prealloc_pte) { + vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm>mm, vmf->address); + if (!vmf->prealloc_pte) + return VM_FAULT_OOM; + smp_wmb(); /* See comment in __pte_alloc() */ + } + ret = vma->vm_ops->fault(vmf); if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY | VM_FAULT_DONE_COW)))