From patchwork Tue Apr 30 20:40:43 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Hildenbrand X-Patchwork-Id: 13649975 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7474AC4345F for ; Tue, 30 Apr 2024 20:41:01 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id EDA0D6B0088; Tue, 30 Apr 2024 16:41:00 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id E89586B0089; Tue, 30 Apr 2024 16:41:00 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D29836B008A; Tue, 30 Apr 2024 16:41:00 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id B1D966B0088 for ; Tue, 30 Apr 2024 16:41:00 -0400 (EDT) Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 609828045C for ; Tue, 30 Apr 2024 20:41:00 +0000 (UTC) X-FDA: 82067367480.15.38E6400 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf13.hostedemail.com (Postfix) with ESMTP id 95CA82000A for ; Tue, 30 Apr 2024 20:40:58 +0000 (UTC) Authentication-Results: imf13.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=Dj5JtkT1; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf13.hostedemail.com: domain of david@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=david@redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1714509658; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=vOAuKWIwQi6S2JtNyAUQzZVVkPkPSxcwLLbIo1CKxCo=; b=Ukzgvg4kKqQSh1n8zxQ3uBn+NIgSn5CO4/+GQ/wHspsSu3NLeV5dga2JyQSK7lxl8NdwYe F9Cy1TSKQIIEGy78CWQ2GHY4lg+as3rV8Nzo0MoKnQfUx4FzgVp6rojIEy/bf4qshKZWEc 9bqHN7kLbpwZy/CvGYMxHE7x78TvjV4= ARC-Authentication-Results: i=1; imf13.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=Dj5JtkT1; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf13.hostedemail.com: domain of david@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=david@redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1714509658; a=rsa-sha256; cv=none; b=w3fNQJVsTDlaLCAAPc5tLAYefyVy0KxAcFg16a0C3V/THE0nQhGfcvwKKSFB1mACFmep1B YXocsfAZJhnVS4Q94tgRjbYPGJ6egGIQpNNrmny88M0i3hFDksnj7C2VN9VIsXQI67JEax kt5m3NIqDc8c64hJQEhi6JT97NVd2kc= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1714509657; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=vOAuKWIwQi6S2JtNyAUQzZVVkPkPSxcwLLbIo1CKxCo=; b=Dj5JtkT1aje0VjcyhANP8Ybg1FZ2lh2yQlAEBUif5ZevwuSX08iIw68IWkucOXoSAW/fz7 gBot3/xgPemvBXzvT+ssnsl3IcLp7bY7gBvr1royWJv4SyPGxpeGSqx+a/HK7qbp4R1WT5 pzDgCcKU7A87g+Zp1noo+uCvkzbIhhA= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-81-1amwTavyN7Ot0XE5cRcD2A-1; Tue, 30 Apr 2024 16:40:53 -0400 X-MC-Unique: 1amwTavyN7Ot0XE5cRcD2A-1 Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.rdu2.redhat.com [10.11.54.8]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id BE26480E95D; Tue, 30 Apr 2024 20:40:52 +0000 (UTC) Received: from t14s.fritz.box (unknown [10.39.192.75]) by smtp.corp.redhat.com (Postfix) with ESMTP id CDAC6C15771; Tue, 30 Apr 2024 20:40:47 +0000 (UTC) From: David Hildenbrand To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, David Hildenbrand , Andrew Morton , Vincent Donnefort , Dan Williams Subject: [PATCH v1 1/2] mm/memory: cleanly support zeropage in vm_insert_page*(), vm_map_pages*() and vmf_insert_mixed() Date: Tue, 30 Apr 2024 22:40:43 +0200 Message-ID: <20240430204044.52755-2-david@redhat.com> In-Reply-To: <20240430204044.52755-1-david@redhat.com> References: <20240430204044.52755-1-david@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.8 X-Stat-Signature: 5e4e9hnqwofbb48xr3qtd3jya5bc7eo3 X-Rspamd-Queue-Id: 95CA82000A X-Rspam-User: X-Rspamd-Server: rspam12 X-HE-Tag: 1714509658-71361 X-HE-Meta: U2FsdGVkX1/tLrgLPq4vIVG+l0NaHEsfJVa7ucyXJzkeEzbbhD0prfrL5NnR+zfpwwqJXA7M12xdlgn+i/sjjoeVDyXAshldOtNY4JreR8cgQj76qKA0kjFcJzzuxsGWSCFpwmaZarWJM5LJVUWks8Ooytiui61c6w8iSUjmiBf+QaGxFtdXyyR1dmWzwd3DwAG3iCeJzsGcQw5yjJtdWlyir2Ic973HuW9pxp+S/C/i8XEh5c9zQS+ysg88QbQnzAMCrXV7jpBH1XCURh9FA3JxMCUPaqwmVygLqYGuoCfHRl/gwtnZvxG8aaSwJ6LNeCdlkUCIUMY/fczCmOmsLRaI5lCsEAxWji/lrQpmWiqxBwtQimQaAH/C9Yb9EXmVhepT/dyy8wXxvl9dvyUb84sWBXbE0NxzGmpmSUOFVD+89a4sbuSi8e0nI16t6ObSZHHNLW1ihGPTN38lG8k4zba/aEkt5M/zB+jTLFiQWbc/6EyXvVuKH4gx3/NgxeCNuqaAB536N765239LM6fIJTBYZ+/WRV3cLL9aV1yu8A49tFi5QhirBJq5isBoR47tQGK/4VFabQ1MsJE3RWuIlcXkSrJTDXiF+EVVpgZPMP7kbsTOOap+kI9tZoHsCJnldBFpWRg3Lhlpo4mX2iq78aLJP4na/OTaak+wr7CYnvm6AE6+Nomz+ODXnMhxVixk8GUvtukj4xsNDflx1Zju4NDTjwH66OmLh7TerwSGiX9OiWZg4X/mCyxr5cjpmJ1+EBhxBoKTmocS+PIygIZIYXZglbgS5rm+M11DeyDiLYG9IRTWEyn2MMzGIQj3nS7JKXZUvLSuUTACVGOtZnw/vUuTEFdiFROoyAK3p7F6cuNys5WGqeDvKLL8yjgJG2oYdypS72OzoxQDg++N2YwRLkq5cV+1vPIV6a7JfoaFptG9zyl01athhHjehOCqXx1jQI12RajO5lxSXZ11f5w BPGX7aVi zvdhKjKVxoJJr9Gj2vCu0ug02cbRar5AVpUAjapJVWntW81oi0WociJvU1hbexFyHJ7X9BGHu8spjsAJJYKLZVGW/pkPTUvAVvFqJyI5fBfeRc6GVxYMcu6HzYtEglYW2ULCTVNPk2HrYgaNlt1Pa2mv/q25Mb0qT5vzu0EnWNF6Eq4rhBFBPkf1G7dF/g+ABchvjqFSJdUI/HgrwWLGsythulA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: For now we only get the (small) zeropage mapped to user space in four cases (excluding VM_PFNMAP mappings, such as /proc/vmstat): (1) Read page faults in anonymous VMAs (MAP_PRIVATE|MAP_ANON): do_anonymous_page() will not refcount it and map it pte_mkspecial() (2) UFFDIO_ZEROPAGE on anonymous VMA or COW mapping of shmem (MAP_PRIVATE). mfill_atomic_pte_zeropage() will not refcount it and map it pte_mkspecial(). (3) KSM in mergeable VMA (anonymous VMA or COW mapping). cmp_and_merge_page() will not refcount it and map it pte_mkspecial(). (4) FSDAX as an optimization for holes. vmf_insert_mixed()->__vm_insert_mixed() might end up calling insert_page() without CONFIG_ARCH_HAS_PTE_SPECIAL, refcounting the zeropage and not mapping it pte_mkspecial(). With CONFIG_ARCH_HAS_PTE_SPECIAL, we'll call insert_pfn() where we will not refcount it and map it pte_mkspecial(). In case (4), we might not have VM_MIXEDMAP set: while fs/fuse/dax.c sets VM_MIXEDMAP, we removed it for ext4 fsdax in commit e1fb4a086495 ("dax: remove VM_MIXEDMAP for fsdax and device dax") and for XFS in commit e1fb4a086495 ("dax: remove VM_MIXEDMAP for fsdax and device dax"). Without CONFIG_ARCH_HAS_PTE_SPECIAL and with VM_MIXEDMAP, vm_normal_page() would currently return the zeropage. We'll refcount the zeropage when mapping and when unmapping. Without CONFIG_ARCH_HAS_PTE_SPECIAL and without VM_MIXEDMAP, vm_normal_page() would currently refuse to return the zeropage. So we'd refcount it when mapping but not when unmapping it ... do we have fsdax without CONFIG_ARCH_HAS_PTE_SPECIAL in practice? Hard to tell. Independent of that, we should never refcount the zeropage when we might be holding that reference for a long time, because even without an accounting imbalance we might overflow the refcount. As there is interest in using the zeropage also in other VM_MIXEDMAP mappings, let's add clean support for that in the cases where it makes sense: (A) Never refcount the zeropage when mapping it: In insert_page(), special-case the zeropage, do not refcount it, and use pte_mkspecial(). Don't involve insert_pfn(), adjusting insert_page() looks cleaner than branching off to insert_pfn(). (B) Never refcount the zeropage when unmapping it: In vm_normal_page(), also don't return the zeropage in a VM_MIXEDMAP mapping without CONFIG_ARCH_HAS_PTE_SPECIAL. Add a VM_WARN_ON_ONCE() sanity check if we'd ever return the zeropage, which could happen if someone forgets to set pte_mkspecial() when mapping the zeropage. Document that. (C) Allow the zeropage only where reasonable s390x never wants the zeropage in some processes running legacy KVM guests that make use of storage keys. So disallow that. Further, using the zeropage in COW mappings is unproblematic (just what we do for other COW mappings), because FAULT_FLAG_UNSHARE can just unshare it and GUP with FOLL_LONGTERM would work as expected. Similarly, mappings that can never have writable PTEs (implying no write faults) are also not problematic, because nothing could end up mapping the PTE writable by mistake later. But in case we could have writable PTEs, we'll only allow the zeropage in FSDAX VMAs, that are incompatible with GUP and are blocked there completely. We'll always require the zeropage to be mapped with pte_special(). GUP-fast will reject the zeropage that way, but GUP-slow will allow it. (Note that GUP does not refcount the zeropage with FOLL_PIN, because there were issues with overflowing the refcount in the past). Add sanity checks to can_change_pte_writable() and wp_page_reuse(), to catch early during testing if we'd ever find a zeropage unexpectedly in code that wants to upgrade write permissions. Convert the BUG_ON in vm_mixed_ok() to an ordinary check and simply fail with VM_FAULT_SIGBUS, like we do for other sanity checks. Drop the stale comment regarding reserved pages from insert_page(). Note that: * we won't mess with VM_PFNMAP mappings for now. remap_pfn_range() and vmf_insert_pfn() would allow the zeropage in some cases and not refcount it. * vmf_insert_pfn*() will reject the zeropage in VM_MIXEDMAP mappings and we'll leave that alone for now. People can simply use one of the other interfaces. * we won't bother with the huge zeropage for now. It's never PTE-mapped and also GUP does not special-case it yet. Signed-off-by: David Hildenbrand --- mm/memory.c | 92 +++++++++++++++++++++++++++++++++++++++------------ mm/mprotect.c | 2 ++ 2 files changed, 73 insertions(+), 21 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index eea6e4984eaef..5fffc9bd3febd 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -575,10 +575,13 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr, * VM_MIXEDMAP mappings can likewise contain memory with or without "struct * page" backing, however the difference is that _all_ pages with a struct * page (that is, those where pfn_valid is true) are refcounted and considered - * normal pages by the VM. The disadvantage is that pages are refcounted - * (which can be slower and simply not an option for some PFNMAP users). The - * advantage is that we don't have to follow the strict linearity rule of - * PFNMAP mappings in order to support COWable mappings. + * normal pages by the VM. The only exception are zeropages, which are + * *never* refcounted. + * + * The disadvantage is that pages are refcounted (which can be slower and + * simply not an option for some PFNMAP users). The advantage is that we + * don't have to follow the strict linearity rule of PFNMAP mappings in + * order to support COWable mappings. * */ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr, @@ -616,6 +619,8 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr, if (vma->vm_flags & VM_MIXEDMAP) { if (!pfn_valid(pfn)) return NULL; + if (is_zero_pfn(pfn)) + return NULL; goto out; } else { unsigned long off; @@ -641,6 +646,7 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr, * eg. VDSO mappings can cause them to exist. */ out: + VM_WARN_ON_ONCE(is_zero_pfn(pfn)); return pfn_to_page(pfn); } @@ -1983,10 +1989,47 @@ pte_t *__get_locked_pte(struct mm_struct *mm, unsigned long addr, return pte_alloc_map_lock(mm, pmd, addr, ptl); } -static int validate_page_before_insert(struct page *page) +static bool vm_mixed_zeropage_allowed(struct vm_area_struct *vma) +{ + VM_WARN_ON_ONCE(vma->vm_flags & VM_PFNMAP); + /* + * Whoever wants to forbid the zeropage after some zeropages + * might already have been mapped has to scan the page tables and + * bail out on any zeropages. Zeropages in COW mappings can + * be unshared using FAULT_FLAG_UNSHARE faults. + */ + if (mm_forbids_zeropage(vma->vm_mm)) + return false; + /* zeropages in COW mappings are common and unproblematic. */ + if (is_cow_mapping(vma->vm_flags)) + return true; + /* Mappings that do not allow for writable PTEs are unproblematic. */ + if (!(vma->vm_flags & (VM_WRITE | VM_MAYWRITE))) + return false; + /* + * Why not allow any VMA that has vm_ops->pfn_mkwrite? GUP could + * find the shared zeropage and longterm-pin it, which would + * be problematic as soon as the zeropage gets replaced by a different + * page due to vma->vm_ops->pfn_mkwrite, because what's mapped would + * now differ to what GUP looked up. FSDAX is incompatible to + * FOLL_LONGTERM and VM_IO is incompatible to GUP completely (see + * check_vma_flags). + */ + return vma->vm_ops && vma->vm_ops->pfn_mkwrite && + (vma_is_fsdax(vma) || vma->vm_flags & VM_IO); +} + +static int validate_page_before_insert(struct vm_area_struct *vma, + struct page *page) { struct folio *folio = page_folio(page); + if (unlikely(is_zero_folio(folio))) { + if (!vm_mixed_zeropage_allowed(vma)) + return -EINVAL; + return 0; + } + if (folio_test_anon(folio) || folio_test_slab(folio) || page_has_type(page)) return -EINVAL; @@ -1998,24 +2041,23 @@ static int insert_page_into_pte_locked(struct vm_area_struct *vma, pte_t *pte, unsigned long addr, struct page *page, pgprot_t prot) { struct folio *folio = page_folio(page); + pte_t pteval; if (!pte_none(ptep_get(pte))) return -EBUSY; /* Ok, finally just insert the thing.. */ - folio_get(folio); - inc_mm_counter(vma->vm_mm, mm_counter_file(folio)); - folio_add_file_rmap_pte(folio, page, vma); - set_pte_at(vma->vm_mm, addr, pte, mk_pte(page, prot)); + pteval = mk_pte(page, prot); + if (unlikely(is_zero_folio(folio))) { + pteval = pte_mkspecial(pteval); + } else { + folio_get(folio); + inc_mm_counter(vma->vm_mm, mm_counter_file(folio)); + folio_add_file_rmap_pte(folio, page, vma); + } + set_pte_at(vma->vm_mm, addr, pte, pteval); return 0; } -/* - * This is the old fallback for page remapping. - * - * For historical reasons, it only allows reserved pages. Only - * old drivers should use this, and they needed to mark their - * pages reserved for the old functions anyway. - */ static int insert_page(struct vm_area_struct *vma, unsigned long addr, struct page *page, pgprot_t prot) { @@ -2023,7 +2065,7 @@ static int insert_page(struct vm_area_struct *vma, unsigned long addr, pte_t *pte; spinlock_t *ptl; - retval = validate_page_before_insert(page); + retval = validate_page_before_insert(vma, page); if (retval) goto out; retval = -ENOMEM; @@ -2043,7 +2085,7 @@ static int insert_page_in_batch_locked(struct vm_area_struct *vma, pte_t *pte, if (!page_count(page)) return -EINVAL; - err = validate_page_before_insert(page); + err = validate_page_before_insert(vma, page); if (err) return err; return insert_page_into_pte_locked(vma, pte, addr, page, prot); @@ -2149,7 +2191,8 @@ EXPORT_SYMBOL(vm_insert_pages); * @page: source kernel page * * This allows drivers to insert individual pages they've allocated - * into a user vma. + * into a user vma. The zeropage is supported in some VMAs, + * see vm_mixed_zeropage_allowed(). * * The page has to be a nice clean _individual_ kernel allocation. * If you allocate a compound page, you need to have marked it as @@ -2195,6 +2238,8 @@ EXPORT_SYMBOL(vm_insert_page); * @offset: user's requested vm_pgoff * * This allows drivers to map range of kernel pages into a user vma. + * The zeropage is supported in some VMAs, see + * vm_mixed_zeropage_allowed(). * * Return: 0 on success and error code otherwise. */ @@ -2410,8 +2455,11 @@ vm_fault_t vmf_insert_pfn(struct vm_area_struct *vma, unsigned long addr, } EXPORT_SYMBOL(vmf_insert_pfn); -static bool vm_mixed_ok(struct vm_area_struct *vma, pfn_t pfn) +static bool vm_mixed_ok(struct vm_area_struct *vma, pfn_t pfn, bool mkwrite) { + if (unlikely(is_zero_pfn(pfn_t_to_pfn(pfn))) && + (mkwrite || !vm_mixed_zeropage_allowed(vma))) + return false; /* these checks mirror the abort conditions in vm_normal_page */ if (vma->vm_flags & VM_MIXEDMAP) return true; @@ -2430,7 +2478,8 @@ static vm_fault_t __vm_insert_mixed(struct vm_area_struct *vma, pgprot_t pgprot = vma->vm_page_prot; int err; - BUG_ON(!vm_mixed_ok(vma, pfn)); + if (!vm_mixed_ok(vma, pfn, mkwrite)) + return VM_FAULT_SIGBUS; if (addr < vma->vm_start || addr >= vma->vm_end) return VM_FAULT_SIGBUS; @@ -3178,6 +3227,7 @@ static inline void wp_page_reuse(struct vm_fault *vmf, struct folio *folio) pte_t entry; VM_BUG_ON(!(vmf->flags & FAULT_FLAG_WRITE)); + VM_WARN_ON(is_zero_pfn(pte_pfn(vmf->orig_pte))); if (folio) { VM_BUG_ON(folio_test_anon(folio) && diff --git a/mm/mprotect.c b/mm/mprotect.c index 8c6cd88252738..888ef66468dbd 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -71,6 +71,8 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr, return page && PageAnon(page) && PageAnonExclusive(page); } + VM_WARN_ON_ONCE(is_zero_pfn(pte_pfn(pte)) && pte_dirty(pte)); + /* * Writable MAP_SHARED mapping: "clean" might indicate that the FS still * needs a real write-fault for writenotify