Message ID | 60fcfaa3df47885b1df9b064ecb3d4e366fc07e7.1736488799.git-series.apopple@nvidia.com |
---|---|
State | New |
Headers | show |
Series | fs/dax: Fix ZONE_DEVICE page reference counts | expand |
Alistair Popple wrote: > Currently DAX folio/page reference counts are managed differently to > normal pages. To allow these to be managed the same as normal pages > introduce vmf_insert_folio_pud. This will map the entire PUD-sized folio > and take references as it would for a normally mapped page. > > This is distinct from the current mechanism, vmf_insert_pfn_pud, which > simply inserts a special devmap PUD entry into the page table without > holding a reference to the page for the mapping. > > Signed-off-by: Alistair Popple <apopple@nvidia.com> Looks correct for what it is: Reviewed-by: Dan Williams <dan.j.williams@intel.com>
On 10.01.25 07:00, Alistair Popple wrote: > Currently DAX folio/page reference counts are managed differently to > normal pages. To allow these to be managed the same as normal pages > introduce vmf_insert_folio_pud. This will map the entire PUD-sized folio > and take references as it would for a normally mapped page. > > This is distinct from the current mechanism, vmf_insert_pfn_pud, which > simply inserts a special devmap PUD entry into the page table without > holding a reference to the page for the mapping. > > Signed-off-by: Alistair Popple <apopple@nvidia.com> [...] > +/** > + * vmf_insert_folio_pud - insert a pud size folio mapped by a pud entry > + * @vmf: Structure describing the fault > + * @folio: folio to insert > + * @write: whether it's a write fault > + * > + * Return: vm_fault_t value. > + */ > +vm_fault_t vmf_insert_folio_pud(struct vm_fault *vmf, struct folio *folio, bool write) > +{ > + struct vm_area_struct *vma = vmf->vma; > + unsigned long addr = vmf->address & PUD_MASK; > + pud_t *pud = vmf->pud; > + struct mm_struct *mm = vma->vm_mm; > + spinlock_t *ptl; > + > + if (addr < vma->vm_start || addr >= vma->vm_end) > + return VM_FAULT_SIGBUS; > + > + if (WARN_ON_ONCE(folio_order(folio) != PUD_ORDER)) > + return VM_FAULT_SIGBUS; > + > + ptl = pud_lock(mm, pud); > + if (pud_none(*vmf->pud)) { > + folio_get(folio); > + folio_add_file_rmap_pud(folio, &folio->page, vma); > + add_mm_counter(mm, mm_counter_file(folio), HPAGE_PUD_NR); > + } > + insert_pfn_pud(vma, addr, vmf->pud, pfn_to_pfn_t(folio_pfn(folio)), write); This looks scary at first (inserting something when not taking a reference), but insert_pfn_pud() seems to handle that. A comment here would have been nice. It's weird, though, that if there is already something else, that we only WARN but don't actually return an error. So ... > + spin_unlock(ptl); > + > + return VM_FAULT_NOPAGE; I assume always returning VM_FAULT_NOPAGE, even when something went wrong, is the right thing to do? Apart from that LGTM.
On Tue, Jan 14, 2025 at 05:22:15PM +0100, David Hildenbrand wrote: > On 10.01.25 07:00, Alistair Popple wrote: > > Currently DAX folio/page reference counts are managed differently to > > normal pages. To allow these to be managed the same as normal pages > > introduce vmf_insert_folio_pud. This will map the entire PUD-sized folio > > and take references as it would for a normally mapped page. > > > > This is distinct from the current mechanism, vmf_insert_pfn_pud, which > > simply inserts a special devmap PUD entry into the page table without > > holding a reference to the page for the mapping. > > > > Signed-off-by: Alistair Popple <apopple@nvidia.com> > > [...] > > > +/** > > + * vmf_insert_folio_pud - insert a pud size folio mapped by a pud entry > > + * @vmf: Structure describing the fault > > + * @folio: folio to insert > > + * @write: whether it's a write fault > > + * > > + * Return: vm_fault_t value. > > + */ > > +vm_fault_t vmf_insert_folio_pud(struct vm_fault *vmf, struct folio *folio, bool write) > > +{ > > + struct vm_area_struct *vma = vmf->vma; > > + unsigned long addr = vmf->address & PUD_MASK; > > + pud_t *pud = vmf->pud; > > + struct mm_struct *mm = vma->vm_mm; > > + spinlock_t *ptl; > > + > > + if (addr < vma->vm_start || addr >= vma->vm_end) > > + return VM_FAULT_SIGBUS; > > + > > + if (WARN_ON_ONCE(folio_order(folio) != PUD_ORDER)) > > + return VM_FAULT_SIGBUS; > > + > > + ptl = pud_lock(mm, pud); > > + if (pud_none(*vmf->pud)) { > > + folio_get(folio); > > + folio_add_file_rmap_pud(folio, &folio->page, vma); > > + add_mm_counter(mm, mm_counter_file(folio), HPAGE_PUD_NR); > > + } > > + insert_pfn_pud(vma, addr, vmf->pud, pfn_to_pfn_t(folio_pfn(folio)), write); > > This looks scary at first (inserting something when not taking a reference), > but insert_pfn_pud() seems to handle that. A comment here would have been > nice. Indeed, I will add one. > It's weird, though, that if there is already something else, that we only > WARN but don't actually return an error. So ... Note we only WARN when there is already a mapping there and we're trying to upgrade it to writeable. This just mimics the logic which currently exists in insert_pfn() and insert_pfn_pmd(). The comment in insert_pfn() sheds more light: /* * For read faults on private mappings the PFN passed * in may not match the PFN we have mapped if the * mapped PFN is a writeable COW page. In the mkwrite * case we are creating a writable PTE for a shared * mapping and we expect the PFNs to match. If they * don't match, we are likely racing with block * allocation and mapping invalidation so just skip the * update. */ > > + spin_unlock(ptl); > > + > > + return VM_FAULT_NOPAGE; > > I assume always returning VM_FAULT_NOPAGE, even when something went wrong, > is the right thing to do? Yes, I think so. I guess in the WARN case we could return something like VM_FAULT_SIGBUS to kill the application, but the existing vmf_insert_*() functions don't currently do that so I think that would be a separate clean-up. > Apart from that LGTM. > > > -- > Cheers, > > David / dhildenb >
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 93e509b..5bd1ff7 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -39,6 +39,7 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write); vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write); +vm_fault_t vmf_insert_folio_pud(struct vm_fault *vmf, struct folio *folio, bool write); enum transparent_hugepage_flag { TRANSPARENT_HUGEPAGE_UNSUPPORTED, diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 120cd2c..256adc3 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1482,19 +1482,17 @@ static void insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr, struct mm_struct *mm = vma->vm_mm; pgprot_t prot = vma->vm_page_prot; pud_t entry; - spinlock_t *ptl; - ptl = pud_lock(mm, pud); if (!pud_none(*pud)) { if (write) { if (WARN_ON_ONCE(pud_pfn(*pud) != pfn_t_to_pfn(pfn))) - goto out_unlock; + return; entry = pud_mkyoung(*pud); entry = maybe_pud_mkwrite(pud_mkdirty(entry), vma); if (pudp_set_access_flags(vma, addr, pud, entry, 1)) update_mmu_cache_pud(vma, addr, pud); } - goto out_unlock; + return; } entry = pud_mkhuge(pfn_t_pud(pfn, prot)); @@ -1508,9 +1506,6 @@ static void insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr, } set_pud_at(mm, addr, pud, entry); update_mmu_cache_pud(vma, addr, pud); - -out_unlock: - spin_unlock(ptl); } /** @@ -1528,6 +1523,7 @@ vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write) unsigned long addr = vmf->address & PUD_MASK; struct vm_area_struct *vma = vmf->vma; pgprot_t pgprot = vma->vm_page_prot; + spinlock_t *ptl; /* * If we had pud_special, we could avoid all these restrictions, @@ -1545,10 +1541,48 @@ vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write) track_pfn_insert(vma, &pgprot, pfn); + ptl = pud_lock(vma->vm_mm, vmf->pud); insert_pfn_pud(vma, addr, vmf->pud, pfn, write); + spin_unlock(ptl); + return VM_FAULT_NOPAGE; } EXPORT_SYMBOL_GPL(vmf_insert_pfn_pud); + +/** + * vmf_insert_folio_pud - insert a pud size folio mapped by a pud entry + * @vmf: Structure describing the fault + * @folio: folio to insert + * @write: whether it's a write fault + * + * Return: vm_fault_t value. + */ +vm_fault_t vmf_insert_folio_pud(struct vm_fault *vmf, struct folio *folio, bool write) +{ + struct vm_area_struct *vma = vmf->vma; + unsigned long addr = vmf->address & PUD_MASK; + pud_t *pud = vmf->pud; + struct mm_struct *mm = vma->vm_mm; + spinlock_t *ptl; + + if (addr < vma->vm_start || addr >= vma->vm_end) + return VM_FAULT_SIGBUS; + + if (WARN_ON_ONCE(folio_order(folio) != PUD_ORDER)) + return VM_FAULT_SIGBUS; + + ptl = pud_lock(mm, pud); + if (pud_none(*vmf->pud)) { + folio_get(folio); + folio_add_file_rmap_pud(folio, &folio->page, vma); + add_mm_counter(mm, mm_counter_file(folio), HPAGE_PUD_NR); + } + insert_pfn_pud(vma, addr, vmf->pud, pfn_to_pfn_t(folio_pfn(folio)), write); + spin_unlock(ptl); + + return VM_FAULT_NOPAGE; +} +EXPORT_SYMBOL_GPL(vmf_insert_folio_pud); #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */ void touch_pmd(struct vm_area_struct *vma, unsigned long addr, @@ -2146,7 +2180,8 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, zap_deposited_table(tlb->mm, pmd); spin_unlock(ptl); } else if (is_huge_zero_pmd(orig_pmd)) { - zap_deposited_table(tlb->mm, pmd); + if (!vma_is_dax(vma) || arch_needs_pgtable_deposit()) + zap_deposited_table(tlb->mm, pmd); spin_unlock(ptl); } else { struct folio *folio = NULL; @@ -2634,12 +2669,23 @@ int zap_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma, orig_pud = pudp_huge_get_and_clear_full(vma, addr, pud, tlb->fullmm); arch_check_zapped_pud(vma, orig_pud); tlb_remove_pud_tlb_entry(tlb, pud, addr); - if (vma_is_special_huge(vma)) { + if (!vma_is_dax(vma) && vma_is_special_huge(vma)) { spin_unlock(ptl); /* No zero page support yet */ } else { - /* No support for anonymous PUD pages yet */ - BUG(); + struct page *page = NULL; + struct folio *folio; + + /* No support for anonymous PUD pages or migration yet */ + VM_WARN_ON_ONCE(vma_is_anonymous(vma) || !pud_present(orig_pud)); + + page = pud_page(orig_pud); + folio = page_folio(page); + folio_remove_rmap_pud(folio, page, vma); + add_mm_counter(tlb->mm, mm_counter_file(folio), -HPAGE_PUD_NR); + + spin_unlock(ptl); + tlb_remove_page_size(tlb, page, HPAGE_PUD_SIZE); } return 1; } @@ -2647,6 +2693,10 @@ int zap_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma, static void __split_huge_pud_locked(struct vm_area_struct *vma, pud_t *pud, unsigned long haddr) { + struct folio *folio; + struct page *page; + pud_t old_pud; + VM_BUG_ON(haddr & ~HPAGE_PUD_MASK); VM_BUG_ON_VMA(vma->vm_start > haddr, vma); VM_BUG_ON_VMA(vma->vm_end < haddr + HPAGE_PUD_SIZE, vma); @@ -2654,7 +2704,22 @@ static void __split_huge_pud_locked(struct vm_area_struct *vma, pud_t *pud, count_vm_event(THP_SPLIT_PUD); - pudp_huge_clear_flush(vma, haddr, pud); + old_pud = pudp_huge_clear_flush(vma, haddr, pud); + + if (!vma_is_dax(vma)) + return; + + page = pud_page(old_pud); + folio = page_folio(page); + + if (!folio_test_dirty(folio) && pud_dirty(old_pud)) + folio_mark_dirty(folio); + if (!folio_test_referenced(folio) && pud_young(old_pud)) + folio_set_referenced(folio); + folio_remove_rmap_pud(folio, page, vma); + folio_put(folio); + add_mm_counter(vma->vm_mm, mm_counter_file(folio), + -HPAGE_PUD_NR); } void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
Currently DAX folio/page reference counts are managed differently to normal pages. To allow these to be managed the same as normal pages introduce vmf_insert_folio_pud. This will map the entire PUD-sized folio and take references as it would for a normally mapped page. This is distinct from the current mechanism, vmf_insert_pfn_pud, which simply inserts a special devmap PUD entry into the page table without holding a reference to the page for the mapping. Signed-off-by: Alistair Popple <apopple@nvidia.com> --- Changes for v5: - Removed is_huge_zero_pud() as it's unlikely to ever be implemented. - Minor code clean-up suggested by David. --- include/linux/huge_mm.h | 1 +- mm/huge_memory.c | 89 ++++++++++++++++++++++++++++++++++++------ 2 files changed, 78 insertions(+), 12 deletions(-)