From patchwork Fri Aug 9 16:08:50 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Peter Xu X-Patchwork-Id: 13758916 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id F302CC52D7C for ; Fri, 9 Aug 2024 16:09:19 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 807056B0095; Fri, 9 Aug 2024 12:09:19 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 7B6816B0098; Fri, 9 Aug 2024 12:09:19 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 656A16B009A; Fri, 9 Aug 2024 12:09:19 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 44B736B0095 for ; Fri, 9 Aug 2024 12:09:19 -0400 (EDT) Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id E0888121148 for ; Fri, 9 Aug 2024 16:09:18 +0000 (UTC) X-FDA: 82433191596.29.2B80C0A Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf05.hostedemail.com (Postfix) with ESMTP id DC00F10001A for ; Fri, 9 Aug 2024 16:09:16 +0000 (UTC) Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=dDnC1UBt; spf=pass (imf05.hostedemail.com: domain of peterx@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1723219691; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding:in-reply-to: references:dkim-signature; bh=H3FphSULA278eQYK8IhU8kCmMsD1ITR1tZt6BMM/yRo=; b=RCyITnAlopZJtjoHImya8fH5UPLhX7UFSB/YNcgCp8Lv9McklBEAyKikt9+wlzWk8A7T6W eEE/hECwRKAj7Img3gjbkmjEgGW+cxrQ85RI8r5YD1tJTEyGRY5ZIqW2aJ0q9Qk65Eck0i ea33Llj4yCsjdiqb0XQkw04HGufwqAM= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1723219691; a=rsa-sha256; cv=none; b=xpTvetK9PC4r/6EM3fOTpR0YzE4DjQb91IDXLN6rkz+kU6fkUO4ee78Hqdh1lbdDTRO0Ua sGWHRX5XNyI315jyJodoXzX011UdRaRJYtfGm87MeimVzSoYQhMa0jdsYhNd6Dc2qqeNtH F9yoHUYq4SiHxkHFKUiP+InUCrHab9k= ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=dDnC1UBt; spf=pass (imf05.hostedemail.com: domain of peterx@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=none) header.from=redhat.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1723219756; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding; bh=H3FphSULA278eQYK8IhU8kCmMsD1ITR1tZt6BMM/yRo=; b=dDnC1UBtl19YgpiBThfuQqajxWrtUF5z8dgv2FAoTtZkELTeUU1V214bXBf8MUoVvhPkFB Ln5JJCQnALvFdZm8uLFB2kESALapzGzo2oQdte/z8b5X/x6Db3bBgpJznKA1CzqCghjV8d aaHmNm1RU76jf6odAFUGosIOup81uBA= Received: from mail-qt1-f198.google.com (mail-qt1-f198.google.com [209.85.160.198]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-458-AYN6BFw_OFGWGyFHRAmcbw-1; Fri, 09 Aug 2024 12:09:14 -0400 X-MC-Unique: AYN6BFw_OFGWGyFHRAmcbw-1 Received: by mail-qt1-f198.google.com with SMTP id d75a77b69052e-44ff25bbfe1so3069381cf.3 for ; Fri, 09 Aug 2024 09:09:14 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1723219754; x=1723824554; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=H3FphSULA278eQYK8IhU8kCmMsD1ITR1tZt6BMM/yRo=; b=u34I8SFOJ+LOqXBkIfFbN7pn7LXBc7trPVRnlXS9+GuTrUZZLeZ6QUDpQ8ZD0x2v7U YCqlGssSYMns9WggcE/27XWgfjkTA434+iUKnU4QhJTB3WozJ5O1U1nSAp7+kcS3crKF H++cix54MQ42l+Vqx/RNRcVrFPK6U+sveVfMnApuYg/9+IHhxFGcaJWN/A4vzFXRY+O7 qQqj2rhlXTALAdv0YMirAD10a2DAQR2vgB03mAwLgd6CBNHISkVFXHEHdww/POZGeUwy Vv/0OPsWpxzg3202Zrf8yfcYQF5F0CSPCuuDiN/GOYWN9+2vSJBpeK8ryszpSMv8St5q s7yQ== X-Gm-Message-State: AOJu0YxWEHKSUJ+jqifX9sCIP8JF+60Ins5+c9w88I4k3WdpLVY49415 aEydt90w1pRcd2lOPXBx/LhMjzb4aCf/Tv+xGr0qohaQFVupKqMfbfKGfBRqZzA7Ma3NsBRipZO OkKGkp0TPNSWG8lFYjFM1XnqfswH3o9Ui3WBoXZFxVHjzgVSh22JSFmrdd6Y3aed8RKteAllmMq Mj7BoZXavdDn+XLXchfLaAQBYdoFo7f/mE X-Received: by 2002:a05:622a:1b8e:b0:447:e636:9ea9 with SMTP id d75a77b69052e-4531251c724mr14418301cf.3.1723219753614; Fri, 09 Aug 2024 09:09:13 -0700 (PDT) X-Google-Smtp-Source: AGHT+IEWNPDs+vlFTPHpFuMrdeo6E/S0PAcXPPqx11I4cRXb3TL+utF6Zl6s0U3W9M0F2mk2TxiY/g== X-Received: by 2002:a05:622a:1b8e:b0:447:e636:9ea9 with SMTP id d75a77b69052e-4531251c724mr14417681cf.3.1723219752892; Fri, 09 Aug 2024 09:09:12 -0700 (PDT) Received: from x1n.redhat.com (pool-99-254-121-117.cpe.net.cable.rogers.com. [99.254.121.117]) by smtp.gmail.com with ESMTPSA id d75a77b69052e-451c870016csm22526741cf.19.2024.08.09.09.09.10 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 09 Aug 2024 09:09:11 -0700 (PDT) From: Peter Xu To: linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Sean Christopherson , Oscar Salvador , Jason Gunthorpe , Axel Rasmussen , linux-arm-kernel@lists.infradead.org, x86@kernel.org, peterx@redhat.com, Will Deacon , Gavin Shan , Paolo Bonzini , Zi Yan , Andrew Morton , Catalin Marinas , Ingo Molnar , Alistair Popple , Borislav Petkov , David Hildenbrand , Thomas Gleixner , kvm@vger.kernel.org, Dave Hansen , Alex Williamson , Yan Zhao Subject: [PATCH 00/19] mm: Support huge pfnmaps Date: Fri, 9 Aug 2024 12:08:50 -0400 Message-ID: <20240809160909.1023470-1-peterx@redhat.com> X-Mailer: git-send-email 2.45.0 MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com X-Rspamd-Queue-Id: DC00F10001A X-Stat-Signature: 9cc6anwt85yi6urezpuecui7pxao7zs6 X-Rspamd-Server: rspam09 X-Rspam-User: X-HE-Tag: 1723219756-235925 X-HE-Meta: U2FsdGVkX1/11p6mSzsc+Z7pRHtlSTJ1FoMeQrEcz6f5Upp0thgXa4K+eans6VZYyxpDlYxn6z3zQd/Pt7wz7PE6yGG/l2ogUVaMFKRXOeoQqbzeza5sTS3evpbV8FlHcPQ9eZukYbgGankEUbN2a0XbGwjlusGasl5/Qgae0GmCPnwp6zEpDdtT71KHKXtgXSQGXEVR3a1tMZSBvw3Dq5hQDeDhMPvqpx7y0vg8T2T9YV2PPuaCUVQIqeO8ZHMe7GglVSJ/iqaPuEriJYRajMMEC1VlDokI2xMXWUOHe942K7h+zOEsXRjqw8Xypn2DPMHqbfQAr5tsEFi10bllkpe3P1FRDN90YLuQvHVoQKyzJZifLhihLyfAvia8z1DcK2OaMlynAqt/Ps6Fx9IO2P06vO8S7WO+uIqK81GZqYmApKhSXzOg56ZAdRSY8XL6/PHXhI6s6b9TMa0QAN4zFoNaIbbolMYvZ5qYDwf1/+aLtfGYBFVyn/yYL8obbZHBrh1FXldNMq0kQaGji7QzcaxY+q+MRy4SstOWnjJIHyzscJBondHj8y/xcO/T7i1p2nOhJBQ+EgPuwHt5kaedpl0tPUF5ODB29slIHpxpq2VtIfRb5rk7I2CJ2IZpwtxaClDo1QTYfIeMGaIanzkXPS0VrxtUdfSFrIhRxQX3agrbuXUPyw3GFdTPiBhHy2j0i6oXo+44FK6991l9jp4vxRgDp98B8dtunO0rwWvup3xOqIpeTuKrSCsvnzv9n8OZCTkh4KbMtIhd02qEZTytSHnfw43dW5E6V+ZPocQhiq/SzOVjGzSiZTCIyTWcTKUqzuBOY6cRfSah1XUhmHFuUfCndZTqVAysE6gBu5n6EyyMEGD591J3tmMSP1N+Ek1iXX9VvQXMgaIEfGptTiB1IDAJzDyCytmiWGB9n/y1tFmeZ5hmaEjgcO88ho9sobBj+74MSEYaK5xYJ7ZABNn 1GAABZ9/ 5XQ8iH2sHcDl6m+A9PmB37QWQT9QGKQ+Fy8q+UXF22cgJHX3f8KcdHamA/OhJPXtR68MQkiRPDcWiPoaVvsdi95FLrilotauSg47tlTuxNvip0scerLj9tcP25Sny8KQgDZKxReEdBvBVINuNkbQF44dRA1bEgj254cqCmfk41ipNqGolcPsxdoTf6MBZKIMfhziig5DdMn3k/UYrNK7gdVIzMW4ZBtXw54jfwYs82Kud9KtgTIFVSGWQGm2fTpVgVN89Ys4RvIuMlm1VCKrKYi+l3h0szJXpSxs1mKraMkczKkLcIqqZqI/xfxD8uslI8qjKywHmdXgWYLrJREQTRZaidVw8qN/sIJbxt9MOR+UmUqJYqqwyrDfPxsggnIEwCcfKPWPap2GOvz+dyyvt8RP1AHzIDY9EFjhfgiQ+btt7gNmbfJey2HCuyzuxVcO6EwdrY1x5KZ1qJF0ABDEHpsOV15jW4y0FkweFnraG6aN/v5A= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Overview ======== This series is based on mm-unstable, commit 98808d08fc0f of Aug 7th latest, plus dax 1g fix [1]. Note that this series should also apply if without the dax 1g fix series, but when without it, mprotect() will trigger similar errors otherwise on PUD mappings. This series implements huge pfnmaps support for mm in general. Huge pfnmap allows e.g. VM_PFNMAP vmas to map in either PMD or PUD levels, similar to what we do with dax / thp / hugetlb so far to benefit from TLB hits. Now we extend that idea to PFN mappings, e.g. PCI MMIO bars where it can grow as large as 8GB or even bigger. Currently, only x86_64 (1G+2M) and arm64 (2M) are supported. The last patch (from Alex Williamson) will be the first user of huge pfnmap, so as to enable vfio-pci driver to fault in huge pfn mappings. Implementation ============== In reality, it's relatively simple to add such support comparing to many other types of mappings, because of PFNMAP's specialties when there's no vmemmap backing it, so that most of the kernel routines on huge mappings should simply already fail for them, like GUPs or old-school follow_page() (which is recently rewritten to be folio_walk* APIs by David). One trick here is that we're still unmature on PUDs in generic paths here and there, as DAX is so far the only user. This patchset will add the 2nd user of it. Hugetlb can be a 3rd user if the hugetlb unification work can go on smoothly, but to be discussed later. The other trick is how to allow gup-fast working for such huge mappings even if there's no direct sign of knowing whether it's a normal page or MMIO mapping. This series chose to keep the pte_special solution, so that it reuses similar idea on setting a special bit to pfnmap PMDs/PUDs so that gup-fast will be able to identify them and fail properly. Along the way, we'll also notice that the major pgtable pfn walker, aka, follow_pte(), will need to retire soon due to the fact that it only works with ptes. A new set of simple API is introduced (follow_pfnmap* API) to be able to do whatever follow_pte() can already do, plus that it can also process huge pfnmaps now. Half of this series is about that and converting all existing pfnmap walkers to use the new API properly. Hopefully the new API also looks better to avoid exposing e.g. pgtable lock details into the callers, so that it can be used in an even more straightforward way. Here, three more options will be introduced and involved in huge pfnmap: - ARCH_SUPPORTS_HUGE_PFNMAP Arch developers will need to select this option when huge pfnmap is supported in arch's Kconfig. After this patchset applied, both x86_64 and arm64 will start to enable it by default. - ARCH_SUPPORTS_PMD_PFNMAP / ARCH_SUPPORTS_PUD_PFNMAP These options are for driver developers to identify whether current arch / config supports huge pfnmaps, making decision on whether it can use the huge pfnmap APIs to inject them. One can refer to the last vfio-pci patch from Alex on the use of them properly in a device driver. So after the whole set applied, and if one would enable some dynamic debug lines in vfio-pci core files, we should observe things like: vfio-pci 0000:00:06.0: vfio_pci_mmap_huge_fault(,order = 9) BAR 0 page offset 0x0: 0x100 vfio-pci 0000:00:06.0: vfio_pci_mmap_huge_fault(,order = 9) BAR 0 page offset 0x200: 0x100 vfio-pci 0000:00:06.0: vfio_pci_mmap_huge_fault(,order = 9) BAR 0 page offset 0x400: 0x100 In this specific case, it says that vfio-pci faults in PMDs properly for a few BAR0 offsets. Patch Layout ============ Patch 1: Introduce the new options mentioned above for huge PFNMAPs Patch 2: A tiny cleanup Patch 3-8: Preparation patches for huge pfnmap (include introduce special bit for pmd/pud) Patch 9-16: Introduce follow_pfnmap*() API, use it everywhere, and then drop follow_pte() API Patch 17: Add huge pfnmap support for x86_64 Patch 18: Add huge pfnmap support for arm64 Patch 19: Add vfio-pci support for all kinds of huge pfnmaps (Alex) TODO ==== Nothing I plan to do myself, as in our VM use case most of these doesn't yet apply, but still list something I think might be interesting. More architectures / More page sizes ------------------------------------ Currently only x86_64 (2M+1G) and arm64 (2M) are supported. For example, if arm64 can start to support THP_PUD one day, the huge pfnmap on 1G will be automatically enabled. Generically speaking, arch will need to first support THP / THP_1G, then provide a special bit in pmds/puds to support huge pfnmaps. remap_pfn_range() support ------------------------- Currently, remap_pfn_range() still only maps PTEs. With the new option, remap_pfn_range() can logically start to inject either PMDs or PUDs when the alignment requirements match on the VAs. When the support is there, it should be able to silently benefit all drivers that is using remap_pfn_range() in its mmap() handler on better TLB hit rate and overall faster MMIO accesses similar to processor on hugepages. More driver support ------------------- VFIO is so far the only consumer for the huge pfnmaps after this series applied. Besides above remap_pfn_range() generic optimization, device driver can also try to optimize its mmap() on a better VA alignment for either PMD/PUD sizes. This may, iiuc, normally require userspace changes, as the driver doesn't normally decide the VA to map a bar. But I don't think I know all the drivers to know the full picture. Tests ===== - Cross-build tests that I normally do. I only saw one bluetooth driver build failure on i386 PAE on top of latest mm-unstable, but shouldn't be relevant. - run_vmtests.sh whole set, no more failures (e.g. mlock2 tests fail on mm-unstable) - Hacked e1000e QEMU with 128MB BAR 0, with some prefault test, mprotect() and fork() tests on the bar mapped - x86_64 + AMD GPU - Needs Alex's modified QEMU to guarantee proper VA alignment to make sure all pages to be mapped with PUDs - Main BAR (8GB) start to use PUD mappings - Sub BAR (??MBs?) start to use PMD mappings - Performance wise, slight improvement comparing to the old PTE mappings - aarch64 + NIC - Detached NIC test to make sure driver loads fine with PMD mappings Credits all go to Alex on help testing the GPU/NIC use cases above. Comments welcomed, thanks. [1] https://lore.kernel.org/r/20240807194812.819412-1-peterx@redhat.com Alex Williamson (1): vfio/pci: Implement huge_fault support Peter Xu (18): mm: Introduce ARCH_SUPPORTS_HUGE_PFNMAP and special bits to pmd/pud mm: Drop is_huge_zero_pud() mm: Mark special bits for huge pfn mappings when inject mm: Allow THP orders for PFNMAPs mm/gup: Detect huge pfnmap entries in gup-fast mm/pagewalk: Check pfnmap early for folio_walk_start() mm/fork: Accept huge pfnmap entries mm: Always define pxx_pgprot() mm: New follow_pfnmap API KVM: Use follow_pfnmap API s390/pci_mmio: Use follow_pfnmap API mm/x86/pat: Use the new follow_pfnmap API vfio: Use the new follow_pfnmap API acrn: Use the new follow_pfnmap API mm/access_process_vm: Use the new follow_pfnmap API mm: Remove follow_pte() mm/x86: Support large pfn mappings mm/arm64: Support large pfn mappings arch/arm64/Kconfig | 1 + arch/arm64/include/asm/pgtable.h | 30 +++++ arch/powerpc/include/asm/pgtable.h | 1 + arch/s390/include/asm/pgtable.h | 1 + arch/s390/pci/pci_mmio.c | 22 ++-- arch/sparc/include/asm/pgtable_64.h | 1 + arch/x86/Kconfig | 1 + arch/x86/include/asm/pgtable.h | 80 ++++++++----- arch/x86/mm/pat/memtype.c | 17 ++- drivers/vfio/pci/vfio_pci_core.c | 60 +++++++--- drivers/vfio/vfio_iommu_type1.c | 16 +-- drivers/virt/acrn/mm.c | 16 +-- include/linux/huge_mm.h | 16 +-- include/linux/mm.h | 57 ++++++++- include/linux/pgtable.h | 12 ++ mm/Kconfig | 13 ++ mm/gup.c | 6 + mm/huge_memory.c | 48 +++++--- mm/memory.c | 178 ++++++++++++++++++++-------- mm/pagewalk.c | 5 + virt/kvm/kvm_main.c | 19 ++- 21 files changed, 422 insertions(+), 178 deletions(-)