Message ID | 25a23433cb70f0fe6af92042eb71e962fcbf092b.1734407924.git-series.apopple@nvidia.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | fs/dax: Fix ZONE_DEVICE page reference counts | expand |
On 17.12.24 06:12, Alistair Popple wrote: > In preparation for using insert_page() for DAX, enhance > insert_page_into_pte_locked() to handle establishing writable > mappings. Recall that DAX returns VM_FAULT_NOPAGE after installing a > PTE which bypasses the typical set_pte_range() in finish_fault. > > Signed-off-by: Alistair Popple <apopple@nvidia.com> > Suggested-by: Dan Williams <dan.j.williams@intel.com> > > --- > > Changes since v2: > > - New patch split out from "mm/memory: Add dax_insert_pfn" > --- > mm/memory.c | 45 +++++++++++++++++++++++++++++++++++++-------- > 1 file changed, 37 insertions(+), 8 deletions(-) > > diff --git a/mm/memory.c b/mm/memory.c > index 06bb29e..cd82952 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -2126,19 +2126,47 @@ static int validate_page_before_insert(struct vm_area_struct *vma, > } > > static int insert_page_into_pte_locked(struct vm_area_struct *vma, pte_t *pte, > - unsigned long addr, struct page *page, pgprot_t prot) > + unsigned long addr, struct page *page, > + pgprot_t prot, bool mkwrite) > { > struct folio *folio = page_folio(page); > + pte_t entry = ptep_get(pte); > pte_t pteval; > > - if (!pte_none(ptep_get(pte))) > - return -EBUSY; > + if (!pte_none(entry)) { > + if (!mkwrite) > + return -EBUSY; > + > + /* > + * For read faults on private mappings the PFN passed in may not > + * match the PFN we have mapped if the mapped PFN is a writeable > + * COW page. In the mkwrite case we are creating a writable PTE > + * for a shared mapping and we expect the PFNs to match. If they > + * don't match, we are likely racing with block allocation and > + * mapping invalidation so just skip the update. > + */ Would it make sense to instead have here /* See insert_pfn(). */ But ... > + if (pte_pfn(entry) != page_to_pfn(page)) { > + WARN_ON_ONCE(!is_zero_pfn(pte_pfn(entry))); > + return -EFAULT; > + } > + entry = maybe_mkwrite(entry, vma); > + entry = pte_mkyoung(entry); > + if (ptep_set_access_flags(vma, addr, pte, entry, 1)) > + update_mmu_cache(vma, addr, pte); ... I am not sure if we want the above at all. Someone inserted a page, which is refcounted + mapcounted already. Now you ignore that and do like the second insertion "worked" ? No, that feels wrong, I suspect you will run into refcount+mapcount issues. If there is already something, inserting must fail IMHO. If you want to change something to upgrade write permissions, then a different interface should be used. > + return 0; > + } > + > /* Ok, finally just insert the thing.. */ > pteval = mk_pte(page, prot); > if (unlikely(is_zero_folio(folio))) { > pteval = pte_mkspecial(pteval); > } else { > folio_get(folio); > + entry = mk_pte(page, prot); > + if (mkwrite) { > + entry = pte_mkyoung(entry); > + entry = maybe_mkwrite(pte_mkdirty(entry), vma);> + } > inc_mm_counter(vma->vm_mm, mm_counter_file(folio)); > folio_add_file_rmap_pte(folio, page, vma); > } > @@ -2147,7 +2175,7 @@ static int insert_page_into_pte_locked(struct vm_area_struct *vma, pte_t *pte, > } > > static int insert_page(struct vm_area_struct *vma, unsigned long addr, > - struct page *page, pgprot_t prot) > + struct page *page, pgprot_t prot, bool mkwrite) > { > int retval; > pte_t *pte; > @@ -2160,7 +2188,8 @@ static int insert_page(struct vm_area_struct *vma, unsigned long addr, > pte = get_locked_pte(vma->vm_mm, addr, &ptl); > if (!pte) > goto out; > - retval = insert_page_into_pte_locked(vma, pte, addr, page, prot); > + retval = insert_page_into_pte_locked(vma, pte, addr, page, prot, > + mkwrite); Alignment looks odd. Likely you can also just put it into a single line.
On 20.12.24 20:01, David Hildenbrand wrote: > On 17.12.24 06:12, Alistair Popple wrote: >> In preparation for using insert_page() for DAX, enhance >> insert_page_into_pte_locked() to handle establishing writable >> mappings. Recall that DAX returns VM_FAULT_NOPAGE after installing a >> PTE which bypasses the typical set_pte_range() in finish_fault. >> >> Signed-off-by: Alistair Popple <apopple@nvidia.com> >> Suggested-by: Dan Williams <dan.j.williams@intel.com> >> >> --- >> >> Changes since v2: >> >> - New patch split out from "mm/memory: Add dax_insert_pfn" >> --- >> mm/memory.c | 45 +++++++++++++++++++++++++++++++++++++-------- >> 1 file changed, 37 insertions(+), 8 deletions(-) >> >> diff --git a/mm/memory.c b/mm/memory.c >> index 06bb29e..cd82952 100644 >> --- a/mm/memory.c >> +++ b/mm/memory.c >> @@ -2126,19 +2126,47 @@ static int validate_page_before_insert(struct vm_area_struct *vma, >> } >> >> static int insert_page_into_pte_locked(struct vm_area_struct *vma, pte_t *pte, >> - unsigned long addr, struct page *page, pgprot_t prot) >> + unsigned long addr, struct page *page, >> + pgprot_t prot, bool mkwrite) >> { >> struct folio *folio = page_folio(page); >> + pte_t entry = ptep_get(pte); >> pte_t pteval; >> >> - if (!pte_none(ptep_get(pte))) >> - return -EBUSY; >> + if (!pte_none(entry)) { >> + if (!mkwrite) >> + return -EBUSY; >> + >> + /* >> + * For read faults on private mappings the PFN passed in may not >> + * match the PFN we have mapped if the mapped PFN is a writeable >> + * COW page. In the mkwrite case we are creating a writable PTE >> + * for a shared mapping and we expect the PFNs to match. If they >> + * don't match, we are likely racing with block allocation and >> + * mapping invalidation so just skip the update. >> + */ > > Would it make sense to instead have here > > /* See insert_pfn(). */ > > But ... > >> + if (pte_pfn(entry) != page_to_pfn(page)) { >> + WARN_ON_ONCE(!is_zero_pfn(pte_pfn(entry))); >> + return -EFAULT; >> + } >> + entry = maybe_mkwrite(entry, vma); >> + entry = pte_mkyoung(entry); >> + if (ptep_set_access_flags(vma, addr, pte, entry, 1)) >> + update_mmu_cache(vma, addr, pte); > > ... I am not sure if we want the above at all. Someone inserted a page, > which is refcounted + mapcounted already. > > Now you ignore that and do like the second insertion "worked" ? > > No, that feels wrong, I suspect you will run into refcount+mapcount issues. > > If there is already something, inserting must fail IMHO. If you want to > change something to upgrade write permissions, then a different > interface should be used. Ah, now I realize that the early exit saves you because we won't adjust the refcount +mapcount. I still wonder if that really belongs in here, I would prefer to not play such tricks to upgrade write permissions if possible.
diff --git a/mm/memory.c b/mm/memory.c index 06bb29e..cd82952 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2126,19 +2126,47 @@ static int validate_page_before_insert(struct vm_area_struct *vma, } static int insert_page_into_pte_locked(struct vm_area_struct *vma, pte_t *pte, - unsigned long addr, struct page *page, pgprot_t prot) + unsigned long addr, struct page *page, + pgprot_t prot, bool mkwrite) { struct folio *folio = page_folio(page); + pte_t entry = ptep_get(pte); pte_t pteval; - if (!pte_none(ptep_get(pte))) - return -EBUSY; + if (!pte_none(entry)) { + if (!mkwrite) + return -EBUSY; + + /* + * For read faults on private mappings the PFN passed in may not + * match the PFN we have mapped if the mapped PFN is a writeable + * COW page. In the mkwrite case we are creating a writable PTE + * for a shared mapping and we expect the PFNs to match. If they + * don't match, we are likely racing with block allocation and + * mapping invalidation so just skip the update. + */ + if (pte_pfn(entry) != page_to_pfn(page)) { + WARN_ON_ONCE(!is_zero_pfn(pte_pfn(entry))); + return -EFAULT; + } + entry = maybe_mkwrite(entry, vma); + entry = pte_mkyoung(entry); + if (ptep_set_access_flags(vma, addr, pte, entry, 1)) + update_mmu_cache(vma, addr, pte); + return 0; + } + /* Ok, finally just insert the thing.. */ pteval = mk_pte(page, prot); if (unlikely(is_zero_folio(folio))) { pteval = pte_mkspecial(pteval); } else { folio_get(folio); + entry = mk_pte(page, prot); + if (mkwrite) { + entry = pte_mkyoung(entry); + entry = maybe_mkwrite(pte_mkdirty(entry), vma); + } inc_mm_counter(vma->vm_mm, mm_counter_file(folio)); folio_add_file_rmap_pte(folio, page, vma); } @@ -2147,7 +2175,7 @@ static int insert_page_into_pte_locked(struct vm_area_struct *vma, pte_t *pte, } static int insert_page(struct vm_area_struct *vma, unsigned long addr, - struct page *page, pgprot_t prot) + struct page *page, pgprot_t prot, bool mkwrite) { int retval; pte_t *pte; @@ -2160,7 +2188,8 @@ static int insert_page(struct vm_area_struct *vma, unsigned long addr, pte = get_locked_pte(vma->vm_mm, addr, &ptl); if (!pte) goto out; - retval = insert_page_into_pte_locked(vma, pte, addr, page, prot); + retval = insert_page_into_pte_locked(vma, pte, addr, page, prot, + mkwrite); pte_unmap_unlock(pte, ptl); out: return retval; @@ -2174,7 +2203,7 @@ static int insert_page_in_batch_locked(struct vm_area_struct *vma, pte_t *pte, err = validate_page_before_insert(vma, page); if (err) return err; - return insert_page_into_pte_locked(vma, pte, addr, page, prot); + return insert_page_into_pte_locked(vma, pte, addr, page, prot, false); } /* insert_pages() amortizes the cost of spinlock operations @@ -2310,7 +2339,7 @@ int vm_insert_page(struct vm_area_struct *vma, unsigned long addr, BUG_ON(vma->vm_flags & VM_PFNMAP); vm_flags_set(vma, VM_MIXEDMAP); } - return insert_page(vma, addr, page, vma->vm_page_prot); + return insert_page(vma, addr, page, vma->vm_page_prot, false); } EXPORT_SYMBOL(vm_insert_page); @@ -2590,7 +2619,7 @@ static vm_fault_t __vm_insert_mixed(struct vm_area_struct *vma, * result in pfn_t_has_page() == false. */ page = pfn_to_page(pfn_t_to_pfn(pfn)); - err = insert_page(vma, addr, page, pgprot); + err = insert_page(vma, addr, page, pgprot, mkwrite); } else { return insert_pfn(vma, addr, pfn, pgprot, mkwrite); }
In preparation for using insert_page() for DAX, enhance insert_page_into_pte_locked() to handle establishing writable mappings. Recall that DAX returns VM_FAULT_NOPAGE after installing a PTE which bypasses the typical set_pte_range() in finish_fault. Signed-off-by: Alistair Popple <apopple@nvidia.com> Suggested-by: Dan Williams <dan.j.williams@intel.com> --- Changes since v2: - New patch split out from "mm/memory: Add dax_insert_pfn" --- mm/memory.c | 45 +++++++++++++++++++++++++++++++++++++-------- 1 file changed, 37 insertions(+), 8 deletions(-)