Message ID | 25a23433cb70f0fe6af92042eb71e962fcbf092b.1734407924.git-series.apopple@nvidia.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | fs/dax: Fix ZONE_DEVICE page reference counts | expand |
On 17.12.24 06:12, Alistair Popple wrote: > In preparation for using insert_page() for DAX, enhance > insert_page_into_pte_locked() to handle establishing writable > mappings. Recall that DAX returns VM_FAULT_NOPAGE after installing a > PTE which bypasses the typical set_pte_range() in finish_fault. > > Signed-off-by: Alistair Popple <apopple@nvidia.com> > Suggested-by: Dan Williams <dan.j.williams@intel.com> > > --- > > Changes since v2: > > - New patch split out from "mm/memory: Add dax_insert_pfn" > --- > mm/memory.c | 45 +++++++++++++++++++++++++++++++++++++-------- > 1 file changed, 37 insertions(+), 8 deletions(-) > > diff --git a/mm/memory.c b/mm/memory.c > index 06bb29e..cd82952 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -2126,19 +2126,47 @@ static int validate_page_before_insert(struct vm_area_struct *vma, > } > > static int insert_page_into_pte_locked(struct vm_area_struct *vma, pte_t *pte, > - unsigned long addr, struct page *page, pgprot_t prot) > + unsigned long addr, struct page *page, > + pgprot_t prot, bool mkwrite) > { > struct folio *folio = page_folio(page); > + pte_t entry = ptep_get(pte); > pte_t pteval; > > - if (!pte_none(ptep_get(pte))) > - return -EBUSY; > + if (!pte_none(entry)) { > + if (!mkwrite) > + return -EBUSY; > + > + /* > + * For read faults on private mappings the PFN passed in may not > + * match the PFN we have mapped if the mapped PFN is a writeable > + * COW page. In the mkwrite case we are creating a writable PTE > + * for a shared mapping and we expect the PFNs to match. If they > + * don't match, we are likely racing with block allocation and > + * mapping invalidation so just skip the update. > + */ Would it make sense to instead have here /* See insert_pfn(). */ But ... > + if (pte_pfn(entry) != page_to_pfn(page)) { > + WARN_ON_ONCE(!is_zero_pfn(pte_pfn(entry))); > + return -EFAULT; > + } > + entry = maybe_mkwrite(entry, vma); > + entry = pte_mkyoung(entry); > + if (ptep_set_access_flags(vma, addr, pte, entry, 1)) > + update_mmu_cache(vma, addr, pte); ... I am not sure if we want the above at all. Someone inserted a page, which is refcounted + mapcounted already. Now you ignore that and do like the second insertion "worked" ? No, that feels wrong, I suspect you will run into refcount+mapcount issues. If there is already something, inserting must fail IMHO. If you want to change something to upgrade write permissions, then a different interface should be used. > + return 0; > + } > + > /* Ok, finally just insert the thing.. */ > pteval = mk_pte(page, prot); > if (unlikely(is_zero_folio(folio))) { > pteval = pte_mkspecial(pteval); > } else { > folio_get(folio); > + entry = mk_pte(page, prot); > + if (mkwrite) { > + entry = pte_mkyoung(entry); > + entry = maybe_mkwrite(pte_mkdirty(entry), vma);> + } > inc_mm_counter(vma->vm_mm, mm_counter_file(folio)); > folio_add_file_rmap_pte(folio, page, vma); > } > @@ -2147,7 +2175,7 @@ static int insert_page_into_pte_locked(struct vm_area_struct *vma, pte_t *pte, > } > > static int insert_page(struct vm_area_struct *vma, unsigned long addr, > - struct page *page, pgprot_t prot) > + struct page *page, pgprot_t prot, bool mkwrite) > { > int retval; > pte_t *pte; > @@ -2160,7 +2188,8 @@ static int insert_page(struct vm_area_struct *vma, unsigned long addr, > pte = get_locked_pte(vma->vm_mm, addr, &ptl); > if (!pte) > goto out; > - retval = insert_page_into_pte_locked(vma, pte, addr, page, prot); > + retval = insert_page_into_pte_locked(vma, pte, addr, page, prot, > + mkwrite); Alignment looks odd. Likely you can also just put it into a single line.
On 20.12.24 20:01, David Hildenbrand wrote: > On 17.12.24 06:12, Alistair Popple wrote: >> In preparation for using insert_page() for DAX, enhance >> insert_page_into_pte_locked() to handle establishing writable >> mappings. Recall that DAX returns VM_FAULT_NOPAGE after installing a >> PTE which bypasses the typical set_pte_range() in finish_fault. >> >> Signed-off-by: Alistair Popple <apopple@nvidia.com> >> Suggested-by: Dan Williams <dan.j.williams@intel.com> >> >> --- >> >> Changes since v2: >> >> - New patch split out from "mm/memory: Add dax_insert_pfn" >> --- >> mm/memory.c | 45 +++++++++++++++++++++++++++++++++++++-------- >> 1 file changed, 37 insertions(+), 8 deletions(-) >> >> diff --git a/mm/memory.c b/mm/memory.c >> index 06bb29e..cd82952 100644 >> --- a/mm/memory.c >> +++ b/mm/memory.c >> @@ -2126,19 +2126,47 @@ static int validate_page_before_insert(struct vm_area_struct *vma, >> } >> >> static int insert_page_into_pte_locked(struct vm_area_struct *vma, pte_t *pte, >> - unsigned long addr, struct page *page, pgprot_t prot) >> + unsigned long addr, struct page *page, >> + pgprot_t prot, bool mkwrite) >> { >> struct folio *folio = page_folio(page); >> + pte_t entry = ptep_get(pte); >> pte_t pteval; >> >> - if (!pte_none(ptep_get(pte))) >> - return -EBUSY; >> + if (!pte_none(entry)) { >> + if (!mkwrite) >> + return -EBUSY; >> + >> + /* >> + * For read faults on private mappings the PFN passed in may not >> + * match the PFN we have mapped if the mapped PFN is a writeable >> + * COW page. In the mkwrite case we are creating a writable PTE >> + * for a shared mapping and we expect the PFNs to match. If they >> + * don't match, we are likely racing with block allocation and >> + * mapping invalidation so just skip the update. >> + */ > > Would it make sense to instead have here > > /* See insert_pfn(). */ > > But ... > >> + if (pte_pfn(entry) != page_to_pfn(page)) { >> + WARN_ON_ONCE(!is_zero_pfn(pte_pfn(entry))); >> + return -EFAULT; >> + } >> + entry = maybe_mkwrite(entry, vma); >> + entry = pte_mkyoung(entry); >> + if (ptep_set_access_flags(vma, addr, pte, entry, 1)) >> + update_mmu_cache(vma, addr, pte); > > ... I am not sure if we want the above at all. Someone inserted a page, > which is refcounted + mapcounted already. > > Now you ignore that and do like the second insertion "worked" ? > > No, that feels wrong, I suspect you will run into refcount+mapcount issues. > > If there is already something, inserting must fail IMHO. If you want to > change something to upgrade write permissions, then a different > interface should be used. Ah, now I realize that the early exit saves you because we won't adjust the refcount +mapcount. I still wonder if that really belongs in here, I would prefer to not play such tricks to upgrade write permissions if possible.
On Fri, Dec 20, 2024 at 08:06:48PM +0100, David Hildenbrand wrote: > On 20.12.24 20:01, David Hildenbrand wrote: > > On 17.12.24 06:12, Alistair Popple wrote: > > > In preparation for using insert_page() for DAX, enhance > > > insert_page_into_pte_locked() to handle establishing writable > > > mappings. Recall that DAX returns VM_FAULT_NOPAGE after installing a > > > PTE which bypasses the typical set_pte_range() in finish_fault. > > > > > > Signed-off-by: Alistair Popple <apopple@nvidia.com> > > > Suggested-by: Dan Williams <dan.j.williams@intel.com> > > > > > > --- > > > > > > Changes since v2: > > > > > > - New patch split out from "mm/memory: Add dax_insert_pfn" > > > --- > > > mm/memory.c | 45 +++++++++++++++++++++++++++++++++++++-------- > > > 1 file changed, 37 insertions(+), 8 deletions(-) > > > > > > diff --git a/mm/memory.c b/mm/memory.c > > > index 06bb29e..cd82952 100644 > > > --- a/mm/memory.c > > > +++ b/mm/memory.c > > > @@ -2126,19 +2126,47 @@ static int validate_page_before_insert(struct vm_area_struct *vma, > > > } > > > static int insert_page_into_pte_locked(struct vm_area_struct *vma, pte_t *pte, > > > - unsigned long addr, struct page *page, pgprot_t prot) > > > + unsigned long addr, struct page *page, > > > + pgprot_t prot, bool mkwrite) > > > { > > > struct folio *folio = page_folio(page); > > > + pte_t entry = ptep_get(pte); > > > pte_t pteval; > > > - if (!pte_none(ptep_get(pte))) > > > - return -EBUSY; > > > + if (!pte_none(entry)) { > > > + if (!mkwrite) > > > + return -EBUSY; > > > + > > > + /* > > > + * For read faults on private mappings the PFN passed in may not > > > + * match the PFN we have mapped if the mapped PFN is a writeable > > > + * COW page. In the mkwrite case we are creating a writable PTE > > > + * for a shared mapping and we expect the PFNs to match. If they > > > + * don't match, we are likely racing with block allocation and > > > + * mapping invalidation so just skip the update. > > > + */ > > > > Would it make sense to instead have here > > > > /* See insert_pfn(). */ > > > > But ... > > > > > + if (pte_pfn(entry) != page_to_pfn(page)) { > > > + WARN_ON_ONCE(!is_zero_pfn(pte_pfn(entry))); > > > + return -EFAULT; > > > + } > > > + entry = maybe_mkwrite(entry, vma); > > > + entry = pte_mkyoung(entry); > > > + if (ptep_set_access_flags(vma, addr, pte, entry, 1)) > > > + update_mmu_cache(vma, addr, pte); > > > > ... I am not sure if we want the above at all. Someone inserted a page, > > which is refcounted + mapcounted already. > > > > Now you ignore that and do like the second insertion "worked" ? > > > > No, that feels wrong, I suspect you will run into refcount+mapcount issues. > > > > If there is already something, inserting must fail IMHO. If you want to > > change something to upgrade write permissions, then a different > > interface should be used. > > Ah, now I realize that the early exit saves you because we won't adjust the > refcount +mapcount. Right. > I still wonder if that really belongs in here, I would prefer to not play > such tricks to upgrade write permissions if possible. As you have pointed out this was all inspired (ie. mostly copied) from the existing insert_pfn() implementation which is used from vmf_insert_mixed{_mkwrite}(). I agree a different interface to upgrade permissions would be nice. However it's tricky because in general callers of these functions (eg. FS DAX) aren't aware if the page is already mapped by a PTE/PMD. They only know a fault has occured and the faulting permissions. This wouldn't be impossible to fix - the mm does provide vm_ops->page_mkwrite() for permission upgrades. The difficulty is that most filesystems that support FS DAX (ie. ext4, XFS) don't treat a vm_ops->page_mkwrite() call any differently from a vm_ops->fault() call due to write fault. Therefore the FS DAX code is unaware of whether or not this is a permission upgrade or initial writeable mapping of the page in the VMA. A further issue in there is currently no vm_ops->huge_mkwrite() callback. Obviously this could all be plumbed through the MM/FS layers, but that would require a separate patch series. Given the current implementation has no issues beyond the cosmetic I'd rather not delay this series any longer, especially as the cosmetic defect is largely pre-existing (vmf_insert_mixed{_mkwrite}() could have equally had a separate upgrade interface). > -- > > Cheers, > > David / dhildenb >
On 06.01.25 03:07, Alistair Popple wrote: > On Fri, Dec 20, 2024 at 08:06:48PM +0100, David Hildenbrand wrote: >> On 20.12.24 20:01, David Hildenbrand wrote: >>> On 17.12.24 06:12, Alistair Popple wrote: >>>> In preparation for using insert_page() for DAX, enhance >>>> insert_page_into_pte_locked() to handle establishing writable >>>> mappings. Recall that DAX returns VM_FAULT_NOPAGE after installing a >>>> PTE which bypasses the typical set_pte_range() in finish_fault. >>>> >>>> Signed-off-by: Alistair Popple <apopple@nvidia.com> >>>> Suggested-by: Dan Williams <dan.j.williams@intel.com> >>>> >>>> --- >>>> >>>> Changes since v2: >>>> >>>> - New patch split out from "mm/memory: Add dax_insert_pfn" >>>> --- >>>> mm/memory.c | 45 +++++++++++++++++++++++++++++++++++++-------- >>>> 1 file changed, 37 insertions(+), 8 deletions(-) >>>> >>>> diff --git a/mm/memory.c b/mm/memory.c >>>> index 06bb29e..cd82952 100644 >>>> --- a/mm/memory.c >>>> +++ b/mm/memory.c >>>> @@ -2126,19 +2126,47 @@ static int validate_page_before_insert(struct vm_area_struct *vma, >>>> } >>>> static int insert_page_into_pte_locked(struct vm_area_struct *vma, pte_t *pte, >>>> - unsigned long addr, struct page *page, pgprot_t prot) >>>> + unsigned long addr, struct page *page, >>>> + pgprot_t prot, bool mkwrite) >>>> { >>>> struct folio *folio = page_folio(page); >>>> + pte_t entry = ptep_get(pte); >>>> pte_t pteval; >>>> - if (!pte_none(ptep_get(pte))) >>>> - return -EBUSY; >>>> + if (!pte_none(entry)) { >>>> + if (!mkwrite) >>>> + return -EBUSY; >>>> + >>>> + /* >>>> + * For read faults on private mappings the PFN passed in may not >>>> + * match the PFN we have mapped if the mapped PFN is a writeable >>>> + * COW page. In the mkwrite case we are creating a writable PTE >>>> + * for a shared mapping and we expect the PFNs to match. If they >>>> + * don't match, we are likely racing with block allocation and >>>> + * mapping invalidation so just skip the update. >>>> + */ >>> >>> Would it make sense to instead have here >>> >>> /* See insert_pfn(). */ >>> >>> But ... >>> >>>> + if (pte_pfn(entry) != page_to_pfn(page)) { >>>> + WARN_ON_ONCE(!is_zero_pfn(pte_pfn(entry))); >>>> + return -EFAULT; >>>> + } >>>> + entry = maybe_mkwrite(entry, vma); >>>> + entry = pte_mkyoung(entry); >>>> + if (ptep_set_access_flags(vma, addr, pte, entry, 1)) >>>> + update_mmu_cache(vma, addr, pte); >>> >>> ... I am not sure if we want the above at all. Someone inserted a page, >>> which is refcounted + mapcounted already. >>> >>> Now you ignore that and do like the second insertion "worked" ? >>> >>> No, that feels wrong, I suspect you will run into refcount+mapcount issues. >>> >>> If there is already something, inserting must fail IMHO. If you want to >>> change something to upgrade write permissions, then a different >>> interface should be used. >> >> Ah, now I realize that the early exit saves you because we won't adjust the >> refcount +mapcount. > > Right. > >> I still wonder if that really belongs in here, I would prefer to not play >> such tricks to upgrade write permissions if possible. > > As you have pointed out this was all inspired (ie. mostly copied) > from the existing insert_pfn() implementation which is used from > vmf_insert_mixed{_mkwrite}(). > > I agree a different interface to upgrade permissions would be nice. However > it's tricky because in general callers of these functions (eg. FS DAX) aren't > aware if the page is already mapped by a PTE/PMD. They only know a fault has > occured and the faulting permissions. > > This wouldn't be impossible to fix - the mm does provide vm_ops->page_mkwrite() > for permission upgrades. The difficulty is that most filesystems that support > FS DAX (ie. ext4, XFS) don't treat a vm_ops->page_mkwrite() call any differently > from a vm_ops->fault() call due to write fault. Therefore the FS DAX code is > unaware of whether or not this is a permission upgrade or initial writeable > mapping of the page in the VMA. > > A further issue in there is currently no vm_ops->huge_mkwrite() callback. > > Obviously this could all be plumbed through the MM/FS layers, but that would > require a separate patch series. Given the current implementation has no issues > beyond the cosmetic I'd rather not delay this series any longer, especially as > the cosmetic defect is largely pre-existing (vmf_insert_mixed{_mkwrite}() could > have equally had a separate upgrade interface). Fine with me, just stumbled over it an thought "that looks odd".
diff --git a/mm/memory.c b/mm/memory.c index 06bb29e..cd82952 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2126,19 +2126,47 @@ static int validate_page_before_insert(struct vm_area_struct *vma, } static int insert_page_into_pte_locked(struct vm_area_struct *vma, pte_t *pte, - unsigned long addr, struct page *page, pgprot_t prot) + unsigned long addr, struct page *page, + pgprot_t prot, bool mkwrite) { struct folio *folio = page_folio(page); + pte_t entry = ptep_get(pte); pte_t pteval; - if (!pte_none(ptep_get(pte))) - return -EBUSY; + if (!pte_none(entry)) { + if (!mkwrite) + return -EBUSY; + + /* + * For read faults on private mappings the PFN passed in may not + * match the PFN we have mapped if the mapped PFN is a writeable + * COW page. In the mkwrite case we are creating a writable PTE + * for a shared mapping and we expect the PFNs to match. If they + * don't match, we are likely racing with block allocation and + * mapping invalidation so just skip the update. + */ + if (pte_pfn(entry) != page_to_pfn(page)) { + WARN_ON_ONCE(!is_zero_pfn(pte_pfn(entry))); + return -EFAULT; + } + entry = maybe_mkwrite(entry, vma); + entry = pte_mkyoung(entry); + if (ptep_set_access_flags(vma, addr, pte, entry, 1)) + update_mmu_cache(vma, addr, pte); + return 0; + } + /* Ok, finally just insert the thing.. */ pteval = mk_pte(page, prot); if (unlikely(is_zero_folio(folio))) { pteval = pte_mkspecial(pteval); } else { folio_get(folio); + entry = mk_pte(page, prot); + if (mkwrite) { + entry = pte_mkyoung(entry); + entry = maybe_mkwrite(pte_mkdirty(entry), vma); + } inc_mm_counter(vma->vm_mm, mm_counter_file(folio)); folio_add_file_rmap_pte(folio, page, vma); } @@ -2147,7 +2175,7 @@ static int insert_page_into_pte_locked(struct vm_area_struct *vma, pte_t *pte, } static int insert_page(struct vm_area_struct *vma, unsigned long addr, - struct page *page, pgprot_t prot) + struct page *page, pgprot_t prot, bool mkwrite) { int retval; pte_t *pte; @@ -2160,7 +2188,8 @@ static int insert_page(struct vm_area_struct *vma, unsigned long addr, pte = get_locked_pte(vma->vm_mm, addr, &ptl); if (!pte) goto out; - retval = insert_page_into_pte_locked(vma, pte, addr, page, prot); + retval = insert_page_into_pte_locked(vma, pte, addr, page, prot, + mkwrite); pte_unmap_unlock(pte, ptl); out: return retval; @@ -2174,7 +2203,7 @@ static int insert_page_in_batch_locked(struct vm_area_struct *vma, pte_t *pte, err = validate_page_before_insert(vma, page); if (err) return err; - return insert_page_into_pte_locked(vma, pte, addr, page, prot); + return insert_page_into_pte_locked(vma, pte, addr, page, prot, false); } /* insert_pages() amortizes the cost of spinlock operations @@ -2310,7 +2339,7 @@ int vm_insert_page(struct vm_area_struct *vma, unsigned long addr, BUG_ON(vma->vm_flags & VM_PFNMAP); vm_flags_set(vma, VM_MIXEDMAP); } - return insert_page(vma, addr, page, vma->vm_page_prot); + return insert_page(vma, addr, page, vma->vm_page_prot, false); } EXPORT_SYMBOL(vm_insert_page); @@ -2590,7 +2619,7 @@ static vm_fault_t __vm_insert_mixed(struct vm_area_struct *vma, * result in pfn_t_has_page() == false. */ page = pfn_to_page(pfn_t_to_pfn(pfn)); - err = insert_page(vma, addr, page, pgprot); + err = insert_page(vma, addr, page, pgprot, mkwrite); } else { return insert_pfn(vma, addr, pfn, pgprot, mkwrite); }
In preparation for using insert_page() for DAX, enhance insert_page_into_pte_locked() to handle establishing writable mappings. Recall that DAX returns VM_FAULT_NOPAGE after installing a PTE which bypasses the typical set_pte_range() in finish_fault. Signed-off-by: Alistair Popple <apopple@nvidia.com> Suggested-by: Dan Williams <dan.j.williams@intel.com> --- Changes since v2: - New patch split out from "mm/memory: Add dax_insert_pfn" --- mm/memory.c | 45 +++++++++++++++++++++++++++++++++++++-------- 1 file changed, 37 insertions(+), 8 deletions(-)