From patchwork Mon Aug 12 18:12:20 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Peter Xu X-Patchwork-Id: 13760909 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8C41BC52D7C for ; Mon, 12 Aug 2024 18:12:40 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2BEC36B0095; Mon, 12 Aug 2024 14:12:39 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 26E7E6B0098; Mon, 12 Aug 2024 14:12:39 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0E8776B009A; Mon, 12 Aug 2024 14:12:39 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id E371C6B0095 for ; Mon, 12 Aug 2024 14:12:38 -0400 (EDT) Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 9B527A51A6 for ; Mon, 12 Aug 2024 18:12:38 +0000 (UTC) X-FDA: 82444388796.02.6E096F8 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf23.hostedemail.com (Postfix) with ESMTP id 7A5B1140018 for ; Mon, 12 Aug 2024 18:12:36 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=bb7ChQBZ; spf=pass (imf23.hostedemail.com: domain of peterx@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1723486263; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=6aC2RF+//p0GJfjgH/iZrlW9ATLJl4mVfe8Rdc4jHgk=; b=eN4cChNeXMZExph9JfQ+Mm/8kIZkfQLDG5LFAn77MnAsEOBzmqGuEaedOqYbTNG2rqS+/D 0qDtLx0+YdGCwAfG7kq4pGzUwg8M1B4bKcgQvz8lg/3dWhQ8KjMBY0T76TxxfAV6+f7ukA ER73EMVGeSYjg20RQVhzuke13SycffM= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=bb7ChQBZ; spf=pass (imf23.hostedemail.com: domain of peterx@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1723486263; a=rsa-sha256; cv=none; b=8K8qUOapwMDsLELiIcA6aoRE0On4Qvr4lXDINTMHIxRpeBY+4ckSIQOOExFjoN1SIMO4S6 v5Y5N7WKRkj3JT2jGcXV+WG6hejA8fTxT/0UPLQ1MfoCHwv9RUZY2bNTFTprDmZB5nVTGv TiVv/YaROZyYKy7/sBjikCHkWiZKZL8= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1723486355; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=6aC2RF+//p0GJfjgH/iZrlW9ATLJl4mVfe8Rdc4jHgk=; b=bb7ChQBZ34guu+6NgVUqrcZ4p2b7xHoIyLPnUD4fbKiMTybEDIVP/J7xKfUK5KtJZizm36 XLK3ba00vK5iOg7AKxui/hOfGmpWzIQbuyjrsU7b7i/Tn0M4i/gKY5CZxbkdmt2w0cDx1p 1vYQ6t8oFWyRuKnxje95pyAXm0eNYq4= Received: from mail-ot1-f70.google.com (mail-ot1-f70.google.com [209.85.210.70]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-620-DD7E5LD0NWyc597mxy-oXg-1; Mon, 12 Aug 2024 14:12:34 -0400 X-MC-Unique: DD7E5LD0NWyc597mxy-oXg-1 Received: by mail-ot1-f70.google.com with SMTP id 46e09a7af769-7093f4569b3so148706a34.1 for ; Mon, 12 Aug 2024 11:12:34 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1723486354; x=1724091154; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=6aC2RF+//p0GJfjgH/iZrlW9ATLJl4mVfe8Rdc4jHgk=; b=WmQuEkqBVHG3KLW3Wu6XrpC/9kwUBlIMc5dd+YuApwJ0I9R2SFOW7dTaInfrgTUlu8 N/8NXhfm7DKtIEpBH1r2sGIS1pv50aDlfSjbA1Ortucn02eK4PeqHC/IfuWXJMYRgFL1 wTbks//QnJ+5ri1vjbK9q1rTnocX51xmXCXMsj3TRmkJvWomtPhxcNnQRFIf9Frn5K9h ErWe5AasBbhSSus4hQWMdAFQ+K3eBBTOqFOkOrfc/ukQhlRa7cy4JZXbAkgxRyygl0xC MJzlXbTl370V96xbS1wEBpCk/PTNZFlbsfMnovcrDICk0o9mlwk1qlquNg2jTjCS9jNK WHPQ== X-Forwarded-Encrypted: i=1; AJvYcCXPD41xqXPJerhkAinWH+ycZ086qZU4b8hk/Q7s/kD/Laf3THC0jh/FC4DNLekcS3FxKOkf2ht6KWWQt4IpNTgFiSM= X-Gm-Message-State: AOJu0Yw9IZBhEONVV9Je6uKoCBdoIM3JQs4O1xv/xumPOCNkHYzzYNpA sL74+7ZeIFSfmc238qEJbrWEfgoUxpt2TqKz4B2HsbvZRbTM/dTwN6BIg5/rR1c2vt+Afn1/Ed0 qN/0z5JL4Luo96iuzBMlix5HIiTRO046RhYklUcGmVQWcNZwn X-Received: by 2002:a05:6358:d25:b0:1ac:a26c:a07a with SMTP id e5c5f4694b2df-1b1a02f2896mr3555d.4.1723486353704; Mon, 12 Aug 2024 11:12:33 -0700 (PDT) X-Google-Smtp-Source: AGHT+IFKB0h9tqBDznq7jPudw+rKB1F4j/lDBjBP5vz8Ij/mXfeVHheDU0KW5XKnOwJHdaPAJFxqTQ== X-Received: by 2002:a05:6358:d25:b0:1ac:a26c:a07a with SMTP id e5c5f4694b2df-1b1a02f2896mr2555d.4.1723486353186; Mon, 12 Aug 2024 11:12:33 -0700 (PDT) Received: from x1n.redhat.com (pool-99-254-121-117.cpe.net.cable.rogers.com. [99.254.121.117]) by smtp.gmail.com with ESMTPSA id af79cd13be357-7a4c7dee013sm268663985a.84.2024.08.12.11.12.30 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 12 Aug 2024 11:12:32 -0700 (PDT) From: Peter Xu To: linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: "Kirill A . Shutemov" , Nicholas Piggin , David Hildenbrand , Matthew Wilcox , Andrew Morton , James Houghton , Huang Ying , "Aneesh Kumar K . V" , peterx@redhat.com, Vlastimil Babka , Rick P Edgecombe , Hugh Dickins , Borislav Petkov , Christophe Leroy , Michael Ellerman , Rik van Riel , Dan Williams , Mel Gorman , x86@kernel.org, Ingo Molnar , linuxppc-dev@lists.ozlabs.org, Dave Hansen , Dave Jiang , Oscar Salvador , Thomas Gleixner , kvm@vger.kernel.org, Sean Christopherson , Paolo Bonzini , David Rientjes Subject: [PATCH v5 2/7] mm/mprotect: Push mmu notifier to PUDs Date: Mon, 12 Aug 2024 14:12:20 -0400 Message-ID: <20240812181225.1360970-3-peterx@redhat.com> X-Mailer: git-send-email 2.45.0 In-Reply-To: <20240812181225.1360970-1-peterx@redhat.com> References: <20240812181225.1360970-1-peterx@redhat.com> MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 7A5B1140018 X-Stat-Signature: iny98uj6seqf45k9hocd6n331uhfog18 X-Rspam-User: X-HE-Tag: 1723486356-604144 X-HE-Meta: U2FsdGVkX19rYTua3V7J7kLGEGrO9NcodFM7GvZFO+yBh1Tmg61GhQzkJCIUyeviWjkCTXh3GDPyKPg6ObRVHByONMW+q5JU7SQIjC/ua8GWYRQxEQ9HvqPQ3YNp+41V4wsT2ArEBmHecXhBsbEhNjtEb411KUQOCL6MKTDUMuJoHeTvxHkwTFWwgHh6f98FseWipIM2nti+V8ZqGMP3TYYFo3lXsAJWZb95koARRAzQWDR+Kl9QR5GyqI5CXNlYb1UAut+bt3t87wXUEvjfcP8Pi7ogO4+FJdksPtGRNNTpwLfB975qZgXE5ooepJLHwO7S+bF3R/nr2O7sTzr2aDNv/Np6bWt/UUO9kqsyXNNLBNzNpgm8GVKfOA2RyhY2SxWl72NbIOdINW8X/GaqbxtV9vA1bPZKiaSiSQICfLEqBmTo5maOLBGK966r9HADOilkSLc1vpv84ro0W7kZ4dbK6KmOEvkE8X6yllxWuSw8t7prf/yoOeq2SM7t9VcXmhfghtfXl11IM38NSNQIPaNIHmq7spbrI7IjCealBpbHdLpEoDLtxnKJgndiCpouiAa6nrFT6lLDDbF+WgKwTrAOs2qfrujfua9envn5kUjttoqVtdZVfz5/hqlZKHjrXyP378mGa8fKrPwVUhRIL8oxw/vKHCZ0FjO9RAQCR739eNXtZyun/z/cw0K+xB+QJYSIiVsbyr7hQeLrQ84a0s0b1sl1PumM5h8hKf2+3s+A1hFoQin8IuIPBUK0ltmF850BwqUDJtc+sQouDK8CJgDKBVRTIkk7qfcyQ8HPBo/cbxKuPu41rBtOffF8OU+jPM1LrRkvmXymW9OI5x9DZONjfcVNBkmjTub9EbZ3fK6uiG3rTsmGkeNjPT/Y6BusE9K8ddefuSC1PG/S6Z4k4wwZbbOLKjW8ZGTSIMzkMJJSONCrHxtQiqiz8LymtUu0MbWLjzjer+8yFEK7iPw XwQug8Tg FEoO3RyXvRUPLr2zF5v+Nq8ruHy+lvK7a8asTolypfSE9Q5DhF3v2YyAI95ghOXLq4jkHD3wQrleUfuml9gjcNAYgv0g2EVEnc7JvFI8H/2eW6+PRe8nbtUcHYau9bI8tkzpgRQ72IWMAaZF/uFjfeUSk48xhkwZspW00h5hKQp2amem0nrkIxMH2NwM6AV8jXm+MPk9/N/dajbSK3QW50mWwYtvIGl2Oin38ivT2XFExx1u11ELxL+wlvJ2cXWrLffzLlJHDs+knHYOVzx4dDOV2gkXQbdwaj8mpL/bJutZukfI6HZZL3zC32wbvgC4WsQWCTFXzOeHtocfsiAn4FEEwy5TvfrAbXKJgcv4Y8kNmnm01pTECdKvw/IW7kebKZXTVJsTcaYsN+7jykmCOmJruYMC3HD2Uh2k2/Z9Doi9XHVD/nVp3FeHwWbPmC66PHBM/4WhLUf9fml524EzdqGPZGjqCrXJLCPXa3i9dphZmjgpNg6mgGxTXJmH8GuFTZ2SxNz3VFErfOkMvnysvlFtWsHEvbvbDThZk1y8zCw7460XopCCC9tBd4crOHyEi7SqQ X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: mprotect() does mmu notifiers in PMD levels. It's there since 2014 of commit a5338093bfb4 ("mm: move mmu notifier call from change_protection to change_pmd_range"). At that time, the issue was that NUMA balancing can be applied on a huge range of VM memory, even if nothing was populated. The notification can be avoided in this case if no valid pmd detected, which includes either THP or a PTE pgtable page. Now to pave way for PUD handling, this isn't enough. We need to generate mmu notifications even on PUD entries properly. mprotect() is currently broken on PUD (e.g., one can easily trigger kernel error with dax 1G mappings already), this is the start to fix it. To fix that, this patch proposes to push such notifications to the PUD layers. There is risk on regressing the problem Rik wanted to resolve before, but I think it shouldn't really happen, and I still chose this solution because of a few reasons: 1) Consider a large VM that should definitely contain more than GBs of memory, it's highly likely that PUDs are also none. In this case there will have no regression. 2) KVM has evolved a lot over the years to get rid of rmap walks, which might be the major cause of the previous soft-lockup. At least TDP MMU already got rid of rmap as long as not nested (which should be the major use case, IIUC), then the TDP MMU pgtable walker will simply see empty VM pgtable (e.g. EPT on x86), the invalidation of a full empty region in most cases could be pretty fast now, comparing to 2014. 3) KVM has explicit code paths now to even give way for mmu notifiers just like this one, e.g. in commit d02c357e5bfa ("KVM: x86/mmu: Retry fault before acquiring mmu_lock if mapping is changing"). It'll also avoid contentions that may also contribute to a soft-lockup. 4) Stick with PMD layer simply don't work when PUD is there... We need one way or another to fix PUD mappings on mprotect(). Pushing it to PUD should be the safest approach as of now, e.g. there's yet no sign of huge P4D coming on any known archs. Cc: kvm@vger.kernel.org Cc: Sean Christopherson Cc: Paolo Bonzini Cc: David Rientjes Cc: Rik van Riel Signed-off-by: Peter Xu --- mm/mprotect.c | 32 ++++++++++++++++---------------- 1 file changed, 16 insertions(+), 16 deletions(-) diff --git a/mm/mprotect.c b/mm/mprotect.c index 37cf8d249405..d423080e6509 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -363,9 +363,6 @@ static inline long change_pmd_range(struct mmu_gather *tlb, unsigned long next; long pages = 0; unsigned long nr_huge_updates = 0; - struct mmu_notifier_range range; - - range.start = 0; pmd = pmd_offset(pud, addr); do { @@ -383,14 +380,6 @@ static inline long change_pmd_range(struct mmu_gather *tlb, if (pmd_none(*pmd)) goto next; - /* invoke the mmu notifier if the pmd is populated */ - if (!range.start) { - mmu_notifier_range_init(&range, - MMU_NOTIFY_PROTECTION_VMA, 0, - vma->vm_mm, addr, end); - mmu_notifier_invalidate_range_start(&range); - } - _pmd = pmdp_get_lockless(pmd); if (is_swap_pmd(_pmd) || pmd_trans_huge(_pmd) || pmd_devmap(_pmd)) { if ((next - addr != HPAGE_PMD_SIZE) || @@ -431,9 +420,6 @@ static inline long change_pmd_range(struct mmu_gather *tlb, cond_resched(); } while (pmd++, addr = next, addr != end); - if (range.start) - mmu_notifier_invalidate_range_end(&range); - if (nr_huge_updates) count_vm_numa_events(NUMA_HUGE_PTE_UPDATES, nr_huge_updates); return pages; @@ -443,22 +429,36 @@ static inline long change_pud_range(struct mmu_gather *tlb, struct vm_area_struct *vma, p4d_t *p4d, unsigned long addr, unsigned long end, pgprot_t newprot, unsigned long cp_flags) { + struct mmu_notifier_range range; pud_t *pud; unsigned long next; long pages = 0, ret; + range.start = 0; + pud = pud_offset(p4d, addr); do { next = pud_addr_end(addr, end); ret = change_prepare(vma, pud, pmd, addr, cp_flags); - if (ret) - return ret; + if (ret) { + pages = ret; + break; + } if (pud_none_or_clear_bad(pud)) continue; + if (!range.start) { + mmu_notifier_range_init(&range, + MMU_NOTIFY_PROTECTION_VMA, 0, + vma->vm_mm, addr, end); + mmu_notifier_invalidate_range_start(&range); + } pages += change_pmd_range(tlb, vma, pud, addr, next, newprot, cp_flags); } while (pud++, addr = next, addr != end); + if (range.start) + mmu_notifier_invalidate_range_end(&range); + return pages; }