[v6,16/26] huge_memory: Add vmf_insert_folio_pmd()

Message ID	02216c30a733ecc84951f9aeb1130cef7497125d.1736488799.git-series.apopple@nvidia.com (mailing list archive)
State	Superseded
Headers	show Received: from NAM11-CO1-obe.outbound.protection.outlook.com (mail-co1nam11on2082.outbound.protection.outlook.com [40.107.220.82]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5702D20C035; Fri, 10 Jan 2025 06:02:46 +0000 (UTC) From: Alistair Popple <apopple@nvidia.com> To: akpm@linux-foundation.org, dan.j.williams@intel.com, linux-mm@kvack.org Cc: alison.schofield@intel.com, Alistair Popple <apopple@nvidia.com>, lina@asahilina.net, zhang.lyra@gmail.com, gerald.schaefer@linux.ibm.com, vishal.l.verma@intel.com, dave.jiang@intel.com, logang@deltatee.com, bhelgaas@google.com, jack@suse.cz, jgg@ziepe.ca, catalin.marinas@arm.com, will@kernel.org, mpe@ellerman.id.au, npiggin@gmail.com, dave.hansen@linux.intel.com, ira.weiny@intel.com, willy@infradead.org, djwong@kernel.org, tytso@mit.edu, linmiaohe@huawei.com, david@redhat.com, peterx@redhat.com, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org, linuxppc-dev@lists.ozlabs.org, nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-ext4@vger.kernel.org, linux-xfs@vger.kernel.org, jhubbard@nvidia.com, hch@lst.de, david@fromorbit.com, chenhuacai@kernel.org, kernel@xen0n.name, loongarch@lists.linux.dev Subject: [PATCH v6 16/26] huge_memory: Add vmf_insert_folio_pmd() Date: Fri, 10 Jan 2025 17:00:44 +1100 Message-ID: <02216c30a733ecc84951f9aeb1130cef7497125d.1736488799.git-series.apopple@nvidia.com> In-Reply-To: <cover.11189864684e31260d1408779fac9db80122047b.1736488799.git-series.apopple@nvidia.com> References: <cover.11189864684e31260d1408779fac9db80122047b.1736488799.git-series.apopple@nvidia.com> Content-Transfer-Encoding: 8bit Content-Type: text/plain Precedence: bulk MIME-Version: 1.0
Series	fs/dax: Fix ZONE_DEVICE page reference counts \| expand [v6,00/26] fs/dax: Fix ZONE_DEVICE page reference counts [v6,01/26] fuse: Fix dax truncate/punch_hole fault path [v6,02/26] fs/dax: Return unmapped busy pages from dax_layout_busy_page_range() [v6,03/26] fs/dax: Don't skip locked entries when scanning entries [v6,04/26] fs/dax: Refactor wait for dax idle page [v6,05/26] fs/dax: Create a common implementation to break DAX layouts [v6,06/26] fs/dax: Always remove DAX page-cache entries when breaking layouts [v6,07/26] fs/dax: Ensure all pages are idle prior to filesystem unmount [v6,08/26] fs/dax: Remove PAGE_MAPPING_DAX_SHARED mapping flag [v6,09/26] mm/gup: Remove redundant check for PCI P2PDMA page [v6,10/26] mm/mm_init: Move p2pdma page refcount initialisation to p2pdma [v6,11/26] mm: Allow compound zone device pages [v6,12/26] mm/memory: Enhance insert_page_into_pte_locked() to create writable mappings [v6,13/26] mm/memory: Add vmf_insert_page_mkwrite() [v6,14/26] rmap: Add support for PUD sized mappings to rmap [v6,15/26] huge_memory: Add vmf_insert_folio_pud() [v6,16/26] huge_memory: Add vmf_insert_folio_pmd() [v6,17/26] memremap: Add is_devdax_page() and is_fsdax_page() helpers [v6,18/26] mm/gup: Don't allow FOLL_LONGTERM pinning of FS DAX pages [v6,19/26] proc/task_mmu: Mark devdax and fsdax pages as always unpinned [v6,20/26] mm/mlock: Skip ZONE_DEVICE PMDs during mlock [v6,21/26] fs/dax: Properly refcount fs dax pages [v6,22/26] device/dax: Properly refcount device dax pages when mapping [v6,23/26] mm: Remove pXX_devmap callers [v6,24/26] mm: Remove devmap related functions and page table bits [v6,25/26] Revert "riscv: mm: Add support for ZONE_DEVICE" [v6,26/26] Revert "LoongArch: Add ARCH_HAS_PTE_DEVMAP support"

Message ID

02216c30a733ecc84951f9aeb1130cef7497125d.1736488799.git-series.apopple@nvidia.com (mailing list archive)

State

Superseded

Headers

From: Alistair Popple <apopple@nvidia.com>
To: akpm@linux-foundation.org,
	dan.j.williams@intel.com,
	linux-mm@kvack.org
Cc: alison.schofield@intel.com,
	Alistair Popple <apopple@nvidia.com>,
	lina@asahilina.net,
	zhang.lyra@gmail.com,
	gerald.schaefer@linux.ibm.com,
	vishal.l.verma@intel.com,
	dave.jiang@intel.com,
	logang@deltatee.com,
	bhelgaas@google.com,
	jack@suse.cz,
	jgg@ziepe.ca,
	catalin.marinas@arm.com,
	will@kernel.org,
	mpe@ellerman.id.au,
	npiggin@gmail.com,
	dave.hansen@linux.intel.com,
	ira.weiny@intel.com,
	willy@infradead.org,
	djwong@kernel.org,
	tytso@mit.edu,
	linmiaohe@huawei.com,
	david@redhat.com,
	peterx@redhat.com,
	linux-doc@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linuxppc-dev@lists.ozlabs.org,
	nvdimm@lists.linux.dev,
	linux-cxl@vger.kernel.org,
	linux-fsdevel@vger.kernel.org,
	linux-ext4@vger.kernel.org,
	linux-xfs@vger.kernel.org,
	jhubbard@nvidia.com,
	hch@lst.de,
	david@fromorbit.com,
	chenhuacai@kernel.org,
	kernel@xen0n.name,
	loongarch@lists.linux.dev
Subject: [PATCH v6 16/26] huge_memory: Add vmf_insert_folio_pmd()
Date: Fri, 10 Jan 2025 17:00:44 +1100
Message-ID: 
 <02216c30a733ecc84951f9aeb1130cef7497125d.1736488799.git-series.apopple@nvidia.com>
In-Reply-To: 
 <cover.11189864684e31260d1408779fac9db80122047b.1736488799.git-series.apopple@nvidia.com>
References: 
 <cover.11189864684e31260d1408779fac9db80122047b.1736488799.git-series.apopple@nvidia.com>
Content-Transfer-Encoding: 8bit
Content-Type: text/plain
Precedence: bulk
MIME-Version: 1.0
X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1
X-MS-Exchange-AntiSpam-MessageData-0: 
 0Xzwlx/Vqfnirbp9RJ60P5udUjYzo/YKdwJ+G4jYBy1+ovqY7d5FdqwFpOOCN5r7khDMzbrnHVDaB76Afko2c4ar6zh9q1tbuA392QsV3PUYBwIC1JjTltPj2z1k6Z4KlnYphP+/4C8R/xPQGPrYZ7tV5NdqQMiNUQakt/QDc4lcPQjkvc1JsWLtVmxjox4OIxG+VGyMi1QifSQbhn21MdNX4DRfaB99uGSADrLJH9lX1AkA9u9Pt5C9hQ5rC83cba2D3CVWt11P7DXKLfhpqTrK4gV7JbSE+ekA/zn9VFAufn3Gds1UhpvwJh8X+6rwXbiS4hIJTFB6ADDiN5TwNJpoRv4eq2e+6khzLeC/VQev7aIR4/SjCNrhHhR5Vjhqr0wMHEqhuCU22fpLMh86o10cCJQGfs7IfnJDTSoYSZ0y/GGCJzzJi7pEnqMqazNLQFlDiUQpos5oNAeyaCHi0KV9wnwvJzkQHFpj84aqFe1eqRayiRsR/F+UAJ7KrRRvCbRGhjB4xeByKkooAIWB3uRKDq3ZJljur+aJv+a2II039LoWfGOVgzJVqDAiUzlmV+jEbVRDgLvjGZQkj7Br65TEwNClcPDWTffrRndoJxPgs7k4L17TQdjpfOKGGHom2eQi45XcOsnJdtmfvosgKZhCfQuhjxzHNKODZ5xZuBYq6w1y1kGx5gswUgjqPLOgi6wZS7CIUPZ0kOmc1qEGgmxgQq04XqX9lv2ONIifpQsc6ViLA6zZhPZ1HOK/yXfMfalpYvibxIJeGp9hKquUAZ2619Ed7BB52UHteOqpzYRUZ4qjQyhsNs64m2TkmetSnWeMwosvtXvbLvif8EqNz3Xpa87TGt6qwc9kXQZOZVK156y6dh2NZ9EkGDa8QksrHzQr2Qju3wgCHVFlCDzEu/2Hcrx2EJ5RzsIt5P/sMuDs54Q0AZOiitm58N7GlJnqeFV4yhADaXEr3o0zCppejKS7tnVoXTlPB+RT0qdYil0nQJwU5+e9nSU9Z8NfRJ3zNtRQp6viQmwDz9sbMKHPPKqsGeIb1uZtk0f1oFcw3ttkyWPJZ8FPDAzpIKzXn3H7Ir15X0cFPym3jN2r606CTYgPEtteaDkAt6arGisrGbnQWdEFyFrL+YpGsj+JDhAk+MHRkeVbxkQNnyvDFy/WgAXFf9e/EaLL/ba1TNBfijrrwDU4uRUuuf/ctsM8bBYzTBEARKN1O+Yeqf39qMrWu8TYIs5WSt8pQJPjFjSqJA3Z/MoGioWl14vBQS2gw/rwAsvnRq3P+HD7WjwZ/8dRctidIOaoTY9qScrX6zmurqMt9Fj0FR3Z4j9etOgw6Iefg1QWTMskv076+RWs9c05KA4kcT6OUFosIgtlpCBgqK6mhk/HVbU9tJSQc6GopJqMKylMZw8l0rirLkuDFQcgEXkhueDjEvhdCYPiKL+IdlUAUih/rlhC2335Pfg198w2FFX9YZUev6MMTFf6IHk7uyhnZqU/GDl+Mt/kv1ynoF1HBp2WwOytp5vMggjwXoDZSPhvUNWzqRyqyoRM1QRWVnSLsAgUMIBvoIhQzF/0IvTVQNHD9rBZfRtR+fNRzEA3
X-OriginatorOrg: Nvidia.com
X-MS-Exchange-CrossTenant-Network-Message-Id: 
 43d7f782-c91f-4c6a-0915-08dd313c6694
X-MS-Exchange-CrossTenant-AuthSource: DS0PR12MB7726.namprd12.prod.outlook.com
X-MS-Exchange-CrossTenant-AuthAs: Internal
X-MS-Exchange-CrossTenant-OriginalArrivalTime: 10 Jan 2025 06:02:44.8101
 (UTC)
X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted
X-MS-Exchange-CrossTenant-Id: 43083d15-7273-40c1-b7db-39efd9ccc17a
X-MS-Exchange-CrossTenant-MailboxType: HOSTED
X-MS-Exchange-CrossTenant-UserPrincipalName: 
 LErK2sPAfpAjR3Kre2FeKFIYQtUQFXGP0p/bxy1r4D9WjNRYJ8WKyUmkEEnVz7L1M1iIuOpMViHM3BfZZoRZsg==
X-MS-Exchange-Transport-CrossTenantHeadersStamped: BY5PR12MB4132

Series

fs/dax: Fix ZONE_DEVICE page reference counts | expand

Commit Message

Alistair Popple Jan. 10, 2025, 6 a.m. UTC

Currently DAX folio/page reference counts are managed differently to
normal pages. To allow these to be managed the same as normal pages
introduce vmf_insert_folio_pmd. This will map the entire PMD-sized folio
and take references as it would for a normally mapped page.

This is distinct from the current mechanism, vmf_insert_pfn_pmd, which
simply inserts a special devmap PMD entry into the page table without
holding a reference to the page for the mapping.

Signed-off-by: Alistair Popple <apopple@nvidia.com>

---

Changes for v5:
 - Minor code cleanup suggested by David
---
 include/linux/huge_mm.h |  1 +-
 mm/huge_memory.c        | 54 ++++++++++++++++++++++++++++++++++--------
 2 files changed, 45 insertions(+), 10 deletions(-)

Comments

Dan Williams Jan. 14, 2025, 2:04 a.m. UTC | #1

Alistair Popple wrote:
> Currently DAX folio/page reference counts are managed differently to
> normal pages. To allow these to be managed the same as normal pages
> introduce vmf_insert_folio_pmd. This will map the entire PMD-sized folio
> and take references as it would for a normally mapped page.
> 
> This is distinct from the current mechanism, vmf_insert_pfn_pmd, which
> simply inserts a special devmap PMD entry into the page table without
> holding a reference to the page for the mapping.
> 
> Signed-off-by: Alistair Popple <apopple@nvidia.com>
> 
> ---
> 
> Changes for v5:
>  - Minor code cleanup suggested by David
> ---
>  include/linux/huge_mm.h |  1 +-
>  mm/huge_memory.c        | 54 ++++++++++++++++++++++++++++++++++--------
>  2 files changed, 45 insertions(+), 10 deletions(-)
> 
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 5bd1ff7..3633bd3 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -39,6 +39,7 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  
>  vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write);
>  vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write);
> +vm_fault_t vmf_insert_folio_pmd(struct vm_fault *vmf, struct folio *folio, bool write);
>  vm_fault_t vmf_insert_folio_pud(struct vm_fault *vmf, struct folio *folio, bool write);
>  
>  enum transparent_hugepage_flag {
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 256adc3..d1ea76e 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1381,14 +1381,12 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
>  {
>  	struct mm_struct *mm = vma->vm_mm;
>  	pmd_t entry;
> -	spinlock_t *ptl;
>  
> -	ptl = pmd_lock(mm, pmd);

Apply this comment to the previous patch too, but I think this would be
more self-documenting as:

   lockdep_assert_held(pmd_lock(mm, pmd));

...to make it clear in this diff and into the future what the locking
constraints of this function are.

After that you can add:

Reviewed-by: Dan Williams <dan.j.williams@intel.com>

David Hildenbrand Jan. 14, 2025, 4:40 p.m. UTC | #2

> +vm_fault_t vmf_insert_folio_pmd(struct vm_fault *vmf, struct folio *folio, bool write)
> +{
> +	struct vm_area_struct *vma = vmf->vma;
> +	unsigned long addr = vmf->address & PMD_MASK;
> +	struct mm_struct *mm = vma->vm_mm;
> +	spinlock_t *ptl;
> +	pgtable_t pgtable = NULL;
> +
> +	if (addr < vma->vm_start || addr >= vma->vm_end)
> +		return VM_FAULT_SIGBUS;
> +
> +	if (WARN_ON_ONCE(folio_order(folio) != PMD_ORDER))
> +		return VM_FAULT_SIGBUS;
> +
> +	if (arch_needs_pgtable_deposit()) {
> +		pgtable = pte_alloc_one(vma->vm_mm);
> +		if (!pgtable)
> +			return VM_FAULT_OOM;
> +	}

This is interesting and nasty at the same time (only to make ppc64 boo3s 
with has tables happy). But it seems to be the right thing to do.

> +
> +	ptl = pmd_lock(mm, vmf->pmd);
> +	if (pmd_none(*vmf->pmd)) {
> +		folio_get(folio);
> +		folio_add_file_rmap_pmd(folio, &folio->page, vma);
> +		add_mm_counter(mm, mm_counter_file(folio), HPAGE_PMD_NR);
> +	}
> +	insert_pfn_pmd(vma, addr, vmf->pmd, pfn_to_pfn_t(folio_pfn(folio)),
> +		       vma->vm_page_prot, write, pgtable);
> +	spin_unlock(ptl);
> +	if (pgtable)
> +		pte_free(mm, pgtable);

Ehm, are you unconditionally freeing the pgtable, even if consumed by 
insert_pfn_pmd() ?

Note that setting pgtable to NULL in insert_pfn_pmd() when consumed will 
not be visible here.

You'd have to pass a pointer to the ... pointer (&pgtable).

... unless I am missing something, staring at the diff.

Dan Williams Jan. 14, 2025, 5:22 p.m. UTC | #3

David Hildenbrand wrote:
> > +vm_fault_t vmf_insert_folio_pmd(struct vm_fault *vmf, struct folio *folio, bool write)
> > +{
> > +	struct vm_area_struct *vma = vmf->vma;
> > +	unsigned long addr = vmf->address & PMD_MASK;
> > +	struct mm_struct *mm = vma->vm_mm;
> > +	spinlock_t *ptl;
> > +	pgtable_t pgtable = NULL;
> > +
> > +	if (addr < vma->vm_start || addr >= vma->vm_end)
> > +		return VM_FAULT_SIGBUS;
> > +
> > +	if (WARN_ON_ONCE(folio_order(folio) != PMD_ORDER))
> > +		return VM_FAULT_SIGBUS;
> > +
> > +	if (arch_needs_pgtable_deposit()) {
> > +		pgtable = pte_alloc_one(vma->vm_mm);
> > +		if (!pgtable)
> > +			return VM_FAULT_OOM;
> > +	}
> 
> This is interesting and nasty at the same time (only to make ppc64 boo3s 
> with has tables happy). But it seems to be the right thing to do.
> 
> > +
> > +	ptl = pmd_lock(mm, vmf->pmd);
> > +	if (pmd_none(*vmf->pmd)) {
> > +		folio_get(folio);
> > +		folio_add_file_rmap_pmd(folio, &folio->page, vma);
> > +		add_mm_counter(mm, mm_counter_file(folio), HPAGE_PMD_NR);
> > +	}
> > +	insert_pfn_pmd(vma, addr, vmf->pmd, pfn_to_pfn_t(folio_pfn(folio)),
> > +		       vma->vm_page_prot, write, pgtable);
> > +	spin_unlock(ptl);
> > +	if (pgtable)
> > +		pte_free(mm, pgtable);
> 
> Ehm, are you unconditionally freeing the pgtable, even if consumed by 
> insert_pfn_pmd() ?
> 
> Note that setting pgtable to NULL in insert_pfn_pmd() when consumed will 
> not be visible here.
> 
> You'd have to pass a pointer to the ... pointer (&pgtable).
> 
> ... unless I am missing something, staring at the diff.

In fact I glazed over the fact that this has been commented on before
and assumed it was fixed:

http://lore.kernel.org/66f61ce4da80_964f2294fb@dwillia2-xfh.jf.intel.com.notmuch

So, yes, insert_pfn_pmd needs to take &pgtable to report back if the
allocation got consumed.

Good catch.

Alistair Popple Jan. 15, 2025, 7:05 a.m. UTC | #4

On Tue, Jan 14, 2025 at 09:22:00AM -0800, Dan Williams wrote:
> David Hildenbrand wrote:
> > > +vm_fault_t vmf_insert_folio_pmd(struct vm_fault *vmf, struct folio *folio, bool write)
> > > +{
> > > +	struct vm_area_struct *vma = vmf->vma;
> > > +	unsigned long addr = vmf->address & PMD_MASK;
> > > +	struct mm_struct *mm = vma->vm_mm;
> > > +	spinlock_t *ptl;
> > > +	pgtable_t pgtable = NULL;
> > > +
> > > +	if (addr < vma->vm_start || addr >= vma->vm_end)
> > > +		return VM_FAULT_SIGBUS;
> > > +
> > > +	if (WARN_ON_ONCE(folio_order(folio) != PMD_ORDER))
> > > +		return VM_FAULT_SIGBUS;
> > > +
> > > +	if (arch_needs_pgtable_deposit()) {
> > > +		pgtable = pte_alloc_one(vma->vm_mm);
> > > +		if (!pgtable)
> > > +			return VM_FAULT_OOM;
> > > +	}
> > 
> > This is interesting and nasty at the same time (only to make ppc64 boo3s 
> > with has tables happy). But it seems to be the right thing to do.
> > 
> > > +
> > > +	ptl = pmd_lock(mm, vmf->pmd);
> > > +	if (pmd_none(*vmf->pmd)) {
> > > +		folio_get(folio);
> > > +		folio_add_file_rmap_pmd(folio, &folio->page, vma);
> > > +		add_mm_counter(mm, mm_counter_file(folio), HPAGE_PMD_NR);
> > > +	}
> > > +	insert_pfn_pmd(vma, addr, vmf->pmd, pfn_to_pfn_t(folio_pfn(folio)),
> > > +		       vma->vm_page_prot, write, pgtable);
> > > +	spin_unlock(ptl);
> > > +	if (pgtable)
> > > +		pte_free(mm, pgtable);
> > 
> > Ehm, are you unconditionally freeing the pgtable, even if consumed by 
> > insert_pfn_pmd() ?
> > 
> > Note that setting pgtable to NULL in insert_pfn_pmd() when consumed will 
> > not be visible here.
> > 
> > You'd have to pass a pointer to the ... pointer (&pgtable).
> > 
> > ... unless I am missing something, staring at the diff.
> 
> In fact I glazed over the fact that this has been commented on before
> and assumed it was fixed:
> 
> http://lore.kernel.org/66f61ce4da80_964f2294fb@dwillia2-xfh.jf.intel.com.notmuch
> 
> So, yes, insert_pfn_pmd needs to take &pgtable to report back if the
> allocation got consumed.
> 
> Good catch.

Yes, thanks Dave and Dan and apologies for missing that originally. Looking
at the thread I suspect I went down the rabbit hole of trying to implement
vmf_insert_folio() and when that wasn't possible forgot to come back and fix
this up. I have added a return code to insert_pfn_pmd() to indicate whether
or not the pgtable was consumed. I have also added a comment in the commit log
explaining why a vmf_insert_folio() isn't useful.

 - Alistair

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 5bd1ff7..3633bd3 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -39,6 +39,7 @@  int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 
 vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write);
 vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write);
+vm_fault_t vmf_insert_folio_pmd(struct vm_fault *vmf, struct folio *folio, bool write);
 vm_fault_t vmf_insert_folio_pud(struct vm_fault *vmf, struct folio *folio, bool write);
 
 enum transparent_hugepage_flag {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 256adc3..d1ea76e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1381,14 +1381,12 @@  static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
 {
 	struct mm_struct *mm = vma->vm_mm;
 	pmd_t entry;
-	spinlock_t *ptl;
 
-	ptl = pmd_lock(mm, pmd);
 	if (!pmd_none(*pmd)) {
 		if (write) {
 			if (pmd_pfn(*pmd) != pfn_t_to_pfn(pfn)) {
 				WARN_ON_ONCE(!is_huge_zero_pmd(*pmd));
-				goto out_unlock;
+				return;
 			}
 			entry = pmd_mkyoung(*pmd);
 			entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
@@ -1396,7 +1394,7 @@  static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
 				update_mmu_cache_pmd(vma, addr, pmd);
 		}
 
-		goto out_unlock;
+		return;
 	}
 
 	entry = pmd_mkhuge(pfn_t_pmd(pfn, prot));
@@ -1417,11 +1415,6 @@  static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
 
 	set_pmd_at(mm, addr, pmd, entry);
 	update_mmu_cache_pmd(vma, addr, pmd);
-
-out_unlock:
-	spin_unlock(ptl);
-	if (pgtable)
-		pte_free(mm, pgtable);
 }
 
 /**
@@ -1440,6 +1433,7 @@  vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write)
 	struct vm_area_struct *vma = vmf->vma;
 	pgprot_t pgprot = vma->vm_page_prot;
 	pgtable_t pgtable = NULL;
+	spinlock_t *ptl;
 
 	/*
 	 * If we had pmd_special, we could avoid all these restrictions,
@@ -1462,12 +1456,52 @@  vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write)
 	}
 
 	track_pfn_insert(vma, &pgprot, pfn);
-
+	ptl = pmd_lock(vma->vm_mm, vmf->pmd);
 	insert_pfn_pmd(vma, addr, vmf->pmd, pfn, pgprot, write, pgtable);
+	spin_unlock(ptl);
+	if (pgtable)
+		pte_free(vma->vm_mm, pgtable);
+
 	return VM_FAULT_NOPAGE;
 }
 EXPORT_SYMBOL_GPL(vmf_insert_pfn_pmd);
 
+vm_fault_t vmf_insert_folio_pmd(struct vm_fault *vmf, struct folio *folio, bool write)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	unsigned long addr = vmf->address & PMD_MASK;
+	struct mm_struct *mm = vma->vm_mm;
+	spinlock_t *ptl;
+	pgtable_t pgtable = NULL;
+
+	if (addr < vma->vm_start || addr >= vma->vm_end)
+		return VM_FAULT_SIGBUS;
+
+	if (WARN_ON_ONCE(folio_order(folio) != PMD_ORDER))
+		return VM_FAULT_SIGBUS;
+
+	if (arch_needs_pgtable_deposit()) {
+		pgtable = pte_alloc_one(vma->vm_mm);
+		if (!pgtable)
+			return VM_FAULT_OOM;
+	}
+
+	ptl = pmd_lock(mm, vmf->pmd);
+	if (pmd_none(*vmf->pmd)) {
+		folio_get(folio);
+		folio_add_file_rmap_pmd(folio, &folio->page, vma);
+		add_mm_counter(mm, mm_counter_file(folio), HPAGE_PMD_NR);
+	}
+	insert_pfn_pmd(vma, addr, vmf->pmd, pfn_to_pfn_t(folio_pfn(folio)),
+		       vma->vm_page_prot, write, pgtable);
+	spin_unlock(ptl);
+	if (pgtable)
+		pte_free(mm, pgtable);
+
+	return VM_FAULT_NOPAGE;
+}
+EXPORT_SYMBOL_GPL(vmf_insert_folio_pmd);
+
 #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
 static pud_t maybe_pud_mkwrite(pud_t pud, struct vm_area_struct *vma)
 {

[v6,16/26] huge_memory: Add vmf_insert_folio_pmd()

Commit Message

Comments

Patch