From patchwork Sat Jan 4 21:00:15 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andrew Morton X-Patchwork-Id: 11317941 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id B795F139A for ; Sat, 4 Jan 2020 21:00:19 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 7835724670 for ; Sat, 4 Jan 2020 21:00:19 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=kernel.org header.i=@kernel.org header.b="nyHnihz0" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 7835724670 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=linux-foundation.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 44A0B8E0012; Sat, 4 Jan 2020 16:00:18 -0500 (EST) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 3D35A8E0003; Sat, 4 Jan 2020 16:00:18 -0500 (EST) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 29BEE8E0012; Sat, 4 Jan 2020 16:00:18 -0500 (EST) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0069.hostedemail.com [216.40.44.69]) by kanga.kvack.org (Postfix) with ESMTP id 100B48E0003 for ; Sat, 4 Jan 2020 16:00:18 -0500 (EST) Received: from smtpin02.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with SMTP id C952A4DA3 for ; Sat, 4 Jan 2020 21:00:17 +0000 (UTC) X-FDA: 76341169674.02.lamp60_59d3bc7e46b2c X-Spam-Summary: 13,1.2,0,550b2eb14fe2bcaf,d41d8cd98f00b204,akpm@linux-foundation.org,:ak@linux.intel.com:akpm@linux-foundation.org:aneesh.kumar@linux.ibm.com:dbueso@suse.de:ktkhai@virtuozzo.com::longman@redhat.com:mhocko@suse.com:mike.kravetz@oracle.com:mm-commits@vger.kernel.org:torvalds@linux-foundation.org:willy@infradead.org,RULES_HIT:2:41:69:355:379:800:960:965:966:967:973:988:989:1260:1263:1345:1381:1431:1437:1535:1605:1712:1730:1747:1777:1792:1981:2194:2196:2198:2199:2200:2201:2393:2525:2559:2563:2682:2685:2859:2902:2904:2933:2937:2939:2942:2945:2947:2951:2954:3022:3138:3139:3140:3141:3142:3865:3866:3867:3868:3870:3871:3872:3874:3934:3936:3938:3941:3944:3947:3950:3953:3956:3959:4050:4119:4250:4321:4362:4384:4385:4390:4395:4605:5007:6261:6653:6737:7576:7903:7904:8599:8660:8784:8957:9010:9025:9545:10008:10913:11026:11473:11658:11914:12043:12048:12291:12296:12297:12438:12517:12519:12555:12679:12683:12783:12895:12986:13148:13153:13161:13172:13228:13229:13230:13846:13870:13 972:1409 X-HE-Tag: lamp60_59d3bc7e46b2c X-Filterd-Recvd-Size: 8638 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by imf05.hostedemail.com (Postfix) with ESMTP for ; Sat, 4 Jan 2020 21:00:17 +0000 (UTC) Received: from localhost.localdomain (c-73-231-172-41.hsd1.ca.comcast.net [73.231.172.41]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 1BBC024653; Sat, 4 Jan 2020 21:00:16 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1578171616; bh=BuBa/UHOuxRUGgjS0y/BaRexbKAMNhPpqXSW9ICOTXI=; h=Date:From:To:Subject:From; b=nyHnihz0gol6Wou8CBwevY55i+5TviKL8tp74+B/x5U85wlHl9I2AfIADE5q5AlVD tT/+uz2Igxo+0issqUbkLy6PKs2rZ0lsCdHViv93QwJXfRAH20tSEQFe/vYbsDXX8Y 5noBRVNUj9B9c+dOr7TBZXZUxYOfhoxtD9ti44ik= Date: Sat, 04 Jan 2020 13:00:15 -0800 From: akpm@linux-foundation.org To: ak@linux.intel.com, akpm@linux-foundation.org, aneesh.kumar@linux.ibm.com, dbueso@suse.de, ktkhai@virtuozzo.com, linux-mm@kvack.org, longman@redhat.com, mhocko@suse.com, mike.kravetz@oracle.com, mm-commits@vger.kernel.org, torvalds@linux-foundation.org, willy@infradead.org Subject: [patch 14/17] mm/hugetlb: defer freeing of huge pages if in non-task context Message-ID: <20200104210015.ImiOBe38U%akpm@linux-foundation.org> User-Agent: s-nail v14.8.16 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Waiman Long Subject: mm/hugetlb: defer freeing of huge pages if in non-task context The following lockdep splat was observed when a certain hugetlbfs test was run: [ 612.388273] ================================ [ 612.411273] WARNING: inconsistent lock state [ 612.432273] 4.18.0-159.el8.x86_64+debug #1 Tainted: G W --------- - - [ 612.469273] -------------------------------- [ 612.489273] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage. [ 612.517273] swapper/30/0 [HC0[0]:SC1[1]:HE1:SE0] takes: [ 612.541273] ffffffff9acdc038 (hugetlb_lock){+.?.}, at: free_huge_page+0x36f/0xaa0 [ 612.576273] {SOFTIRQ-ON-W} state was registered at: [ 612.598273] lock_acquire+0x14f/0x3b0 [ 612.616273] _raw_spin_lock+0x30/0x70 [ 612.634273] __nr_hugepages_store_common+0x11b/0xb30 [ 612.657273] hugetlb_sysctl_handler_common+0x209/0x2d0 [ 612.681273] proc_sys_call_handler+0x37f/0x450 [ 612.703273] vfs_write+0x157/0x460 [ 612.719273] ksys_write+0xb8/0x170 [ 612.736273] do_syscall_64+0xa5/0x4d0 [ 612.753273] entry_SYSCALL_64_after_hwframe+0x6a/0xdf [ 612.777273] irq event stamp: 691296 [ 612.794273] hardirqs last enabled at (691296): [] _raw_spin_unlock_irqrestore+0x4b/0x60 [ 612.839273] hardirqs last disabled at (691295): [] _raw_spin_lock_irqsave+0x22/0x81 [ 612.882273] softirqs last enabled at (691284): [] irq_enter+0xc3/0xe0 [ 612.922273] softirqs last disabled at (691285): [] irq_exit+0x23e/0x2b0 [ 612.962273] [ 612.962273] other info that might help us debug this: [ 612.993273] Possible unsafe locking scenario: [ 612.993273] [ 613.020273] CPU0 [ 613.031273] ---- [ 613.042273] lock(hugetlb_lock); [ 613.057273] [ 613.069273] lock(hugetlb_lock); [ 613.085273] [ 613.085273] *** DEADLOCK *** : [ 613.245273] Call Trace: [ 613.256273] [ 613.265273] dump_stack+0x9a/0xf0 [ 613.281273] mark_lock+0xd0c/0x12f0 [ 613.297273] ? print_shortest_lock_dependencies+0x80/0x80 [ 613.322273] ? sched_clock_cpu+0x18/0x1e0 [ 613.341273] __lock_acquire+0x146b/0x48c0 [ 613.360273] ? trace_hardirqs_on+0x10/0x10 [ 613.379273] ? trace_hardirqs_on_caller+0x27b/0x580 [ 613.401273] lock_acquire+0x14f/0x3b0 [ 613.419273] ? free_huge_page+0x36f/0xaa0 [ 613.440273] _raw_spin_lock+0x30/0x70 [ 613.458273] ? free_huge_page+0x36f/0xaa0 [ 613.477273] free_huge_page+0x36f/0xaa0 [ 613.495273] bio_check_pages_dirty+0x2fc/0x5c0 [ 613.516273] clone_endio+0x17f/0x670 [dm_mod] [ 613.536273] ? disable_discard+0x90/0x90 [dm_mod] [ 613.558273] ? bio_endio+0x4ba/0x930 [ 613.575273] ? blk_account_io_completion+0x400/0x530 [ 613.598273] blk_update_request+0x276/0xe50 [ 613.617273] scsi_end_request+0x7b/0x6a0 [ 613.636273] ? lock_downgrade+0x6f0/0x6f0 [ 613.654273] scsi_io_completion+0x1c6/0x1570 [ 613.674273] ? sd_completed_bytes+0x3a0/0x3a0 [sd_mod] [ 613.698273] ? scsi_mq_requeue_cmd+0xc0/0xc0 [ 613.718273] blk_done_softirq+0x22e/0x350 [ 613.737273] ? blk_softirq_cpu_dead+0x230/0x230 [ 613.758273] __do_softirq+0x23d/0xad8 [ 613.776273] irq_exit+0x23e/0x2b0 [ 613.792273] do_IRQ+0x11a/0x200 [ 613.806273] common_interrupt+0xf/0xf [ 613.823273] Both the hugetbl_lock and the subpool lock can be acquired in free_huge_page(). One way to solve the problem is to make both locks irq-safe. However, Mike Kravetz had learned that the hugetlb_lock is held for a linear scan of ALL hugetlb pages during a cgroup reparentling operation. So it is just too long to have irq disabled unless we can break hugetbl_lock down into finer-grained locks with shorter lock hold times. Another alternative is to defer the freeing to a workqueue job. This patch implements the deferred freeing by adding a free_hpage_workfn() work function to do the actual freeing. The free_huge_page() call in a non-task context saves the page to be freed in the hpage_freelist linked list in a lockless manner using the llist APIs. The generic workqueue is used to process the work, but a dedicated workqueue can be used instead if it is desirable to have the huge page freed ASAP. Thanks to Kirill Tkhai for suggesting the use of llist APIs which simplfy the code. Link: http://lkml.kernel.org/r/20191217170331.30893-1-longman@redhat.com Signed-off-by: Waiman Long Reviewed-by: Mike Kravetz Acked-by: Davidlohr Bueso Acked-by: Michal Hocko Reviewed-by: Kirill Tkhai Cc: Aneesh Kumar K.V Cc: Matthew Wilcox Cc: Andi Kleen Signed-off-by: Andrew Morton --- mm/hugetlb.c | 51 ++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 50 insertions(+), 1 deletion(-) --- a/mm/hugetlb.c~mm-hugetlb-defer-freeing-of-huge-pages-if-in-non-task-context +++ a/mm/hugetlb.c @@ -27,6 +27,7 @@ #include #include #include +#include #include #include @@ -1136,7 +1137,7 @@ static inline void ClearPageHugeTemporar page[2].mapping = NULL; } -void free_huge_page(struct page *page) +static void __free_huge_page(struct page *page) { /* * Can't pass hstate in here because it is called from the @@ -1199,6 +1200,54 @@ void free_huge_page(struct page *page) spin_unlock(&hugetlb_lock); } +/* + * As free_huge_page() can be called from a non-task context, we have + * to defer the actual freeing in a workqueue to prevent potential + * hugetlb_lock deadlock. + * + * free_hpage_workfn() locklessly retrieves the linked list of pages to + * be freed and frees them one-by-one. As the page->mapping pointer is + * going to be cleared in __free_huge_page() anyway, it is reused as the + * llist_node structure of a lockless linked list of huge pages to be freed. + */ +static LLIST_HEAD(hpage_freelist); + +static void free_hpage_workfn(struct work_struct *work) +{ + struct llist_node *node; + struct page *page; + + node = llist_del_all(&hpage_freelist); + + while (node) { + page = container_of((struct address_space **)node, + struct page, mapping); + node = node->next; + __free_huge_page(page); + } +} +static DECLARE_WORK(free_hpage_work, free_hpage_workfn); + +void free_huge_page(struct page *page) +{ + /* + * Defer freeing if in non-task context to avoid hugetlb_lock deadlock. + */ + if (!in_task()) { + /* + * Only call schedule_work() if hpage_freelist is previously + * empty. Otherwise, schedule_work() had been called but the + * workfn hasn't retrieved the list yet. + */ + if (llist_add((struct llist_node *)&page->mapping, + &hpage_freelist)) + schedule_work(&free_hpage_work); + return; + } + + __free_huge_page(page); +} + static void prep_new_huge_page(struct hstate *h, struct page *page, int nid) { INIT_LIST_HEAD(&page->lru);