Message ID | e75232267bb9b5411b67df267e16aa27597eba33.1736488799.git-series.apopple@nvidia.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | fs/dax: Fix ZONE_DEVICE page reference counts | expand |
Alistair Popple wrote: > Currently to map a DAX page the DAX driver calls vmf_insert_pfn. This > creates a special devmap PTE entry for the pfn but does not take a > reference on the underlying struct page for the mapping. This is > because DAX page refcounts are treated specially, as indicated by the > presence of a devmap entry. > > To allow DAX page refcounts to be managed the same as normal page > refcounts introduce vmf_insert_page_mkwrite(). This will take a > reference on the underlying page much the same as vmf_insert_page, > except it also permits upgrading an existing mapping to be writable if > requested/possible. > > Signed-off-by: Alistair Popple <apopple@nvidia.com> > > --- > > Updates from v2: > > - Rename function to make not DAX specific > > - Split the insert_page_into_pte_locked() change into a separate > patch. > > Updates from v1: > > - Re-arrange code in insert_page_into_pte_locked() based on comments > from Jan Kara. > > - Call mkdrity/mkyoung for the mkwrite case, also suggested by Jan. > --- > include/linux/mm.h | 2 ++ > mm/memory.c | 36 ++++++++++++++++++++++++++++++++++++ > 2 files changed, 38 insertions(+) Looks good to me, you can add: Reviewed-by: Dan Williams <dan.j.williams@intel.com>
On 10.01.25 07:00, Alistair Popple wrote: > Currently to map a DAX page the DAX driver calls vmf_insert_pfn. This > creates a special devmap PTE entry for the pfn but does not take a > reference on the underlying struct page for the mapping. This is > because DAX page refcounts are treated specially, as indicated by the > presence of a devmap entry. > > To allow DAX page refcounts to be managed the same as normal page > refcounts introduce vmf_insert_page_mkwrite(). This will take a > reference on the underlying page much the same as vmf_insert_page, > except it also permits upgrading an existing mapping to be writable if > requested/possible. > > Signed-off-by: Alistair Popple <apopple@nvidia.com> > > --- > > Updates from v2: > > - Rename function to make not DAX specific > > - Split the insert_page_into_pte_locked() change into a separate > patch. > > Updates from v1: > > - Re-arrange code in insert_page_into_pte_locked() based on comments > from Jan Kara. > > - Call mkdrity/mkyoung for the mkwrite case, also suggested by Jan. > --- > include/linux/mm.h | 2 ++ > mm/memory.c | 36 ++++++++++++++++++++++++++++++++++++ > 2 files changed, 38 insertions(+) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index e790298..f267b06 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -3620,6 +3620,8 @@ int vm_map_pages(struct vm_area_struct *vma, struct page **pages, > unsigned long num); > int vm_map_pages_zero(struct vm_area_struct *vma, struct page **pages, > unsigned long num); > +vm_fault_t vmf_insert_page_mkwrite(struct vm_fault *vmf, struct page *page, > + bool write); > vm_fault_t vmf_insert_pfn(struct vm_area_struct *vma, unsigned long addr, > unsigned long pfn); > vm_fault_t vmf_insert_pfn_prot(struct vm_area_struct *vma, unsigned long addr, > diff --git a/mm/memory.c b/mm/memory.c > index 8531acb..c60b819 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -2624,6 +2624,42 @@ static vm_fault_t __vm_insert_mixed(struct vm_area_struct *vma, > return VM_FAULT_NOPAGE; > } > > +vm_fault_t vmf_insert_page_mkwrite(struct vm_fault *vmf, struct page *page, > + bool write) > +{ > + struct vm_area_struct *vma = vmf->vma; > + pgprot_t pgprot = vma->vm_page_prot; > + unsigned long pfn = page_to_pfn(page); > + unsigned long addr = vmf->address; > + int err; > + > + if (addr < vma->vm_start || addr >= vma->vm_end) > + return VM_FAULT_SIGBUS; > + > + track_pfn_insert(vma, &pgprot, pfn_to_pfn_t(pfn)); I think I raised this before: why is this track_pfn_insert() in here? It only ever does something to VM_PFNMAP mappings, and that cannot possibly be the case here (nothing in VM_PFNMAP is refcounted, ever)? > + > + if (!pfn_modify_allowed(pfn, pgprot)) > + return VM_FAULT_SIGBUS; Why is that required? Why are we messing so much with PFNs? :) Note that x86 does in there /* If it's real memory always allow */ if (pfn_valid(pfn)) return true; See below, when would we ever have a "struct page *" but !pfn_valid() ? > + > + /* > + * We refcount the page normally so make sure pfn_valid is true. > + */ > + if (!pfn_valid(pfn)) > + return VM_FAULT_SIGBUS; Somebody gave us a "struct page", how could the pfn ever by invalid (not have a struct page)? I think all of the above regarding PFNs should be dropped -- unless I am missing something important. > + > + if (WARN_ON(is_zero_pfn(pfn) && write)) > + return VM_FAULT_SIGBUS; is_zero_page() if you already have the "page". But note that in validate_page_before_insert() we do have a check that allows for conditional insertion of the shared zeropage. So maybe this hunk is also not required. > + > + err = insert_page(vma, addr, page, pgprot, write); > + if (err == -ENOMEM) > + return VM_FAULT_OOM; > + if (err < 0 && err != -EBUSY) > + return VM_FAULT_SIGBUS; > + > + return VM_FAULT_NOPAGE; > +} > +EXPORT_SYMBOL_GPL(vmf_insert_page_mkwrite);
On Tue, Jan 14, 2025 at 05:15:54PM +0100, David Hildenbrand wrote: > On 10.01.25 07:00, Alistair Popple wrote: > > Currently to map a DAX page the DAX driver calls vmf_insert_pfn. This > > creates a special devmap PTE entry for the pfn but does not take a > > reference on the underlying struct page for the mapping. This is > > because DAX page refcounts are treated specially, as indicated by the > > presence of a devmap entry. > > > > To allow DAX page refcounts to be managed the same as normal page > > refcounts introduce vmf_insert_page_mkwrite(). This will take a > > reference on the underlying page much the same as vmf_insert_page, > > except it also permits upgrading an existing mapping to be writable if > > requested/possible. > > > > Signed-off-by: Alistair Popple <apopple@nvidia.com> > > > > --- > > > > Updates from v2: > > > > - Rename function to make not DAX specific > > > > - Split the insert_page_into_pte_locked() change into a separate > > patch. > > > > Updates from v1: > > > > - Re-arrange code in insert_page_into_pte_locked() based on comments > > from Jan Kara. > > > > - Call mkdrity/mkyoung for the mkwrite case, also suggested by Jan. > > --- > > include/linux/mm.h | 2 ++ > > mm/memory.c | 36 ++++++++++++++++++++++++++++++++++++ > > 2 files changed, 38 insertions(+) > > > > diff --git a/include/linux/mm.h b/include/linux/mm.h > > index e790298..f267b06 100644 > > --- a/include/linux/mm.h > > +++ b/include/linux/mm.h > > @@ -3620,6 +3620,8 @@ int vm_map_pages(struct vm_area_struct *vma, struct page **pages, > > unsigned long num); > > int vm_map_pages_zero(struct vm_area_struct *vma, struct page **pages, > > unsigned long num); > > +vm_fault_t vmf_insert_page_mkwrite(struct vm_fault *vmf, struct page *page, > > + bool write); > > vm_fault_t vmf_insert_pfn(struct vm_area_struct *vma, unsigned long addr, > > unsigned long pfn); > > vm_fault_t vmf_insert_pfn_prot(struct vm_area_struct *vma, unsigned long addr, > > diff --git a/mm/memory.c b/mm/memory.c > > index 8531acb..c60b819 100644 > > --- a/mm/memory.c > > +++ b/mm/memory.c > > @@ -2624,6 +2624,42 @@ static vm_fault_t __vm_insert_mixed(struct vm_area_struct *vma, > > return VM_FAULT_NOPAGE; > > } > > +vm_fault_t vmf_insert_page_mkwrite(struct vm_fault *vmf, struct page *page, > > + bool write) > > +{ > > + struct vm_area_struct *vma = vmf->vma; > > + pgprot_t pgprot = vma->vm_page_prot; > > + unsigned long pfn = page_to_pfn(page); > > + unsigned long addr = vmf->address; > > + int err; > > + > > + if (addr < vma->vm_start || addr >= vma->vm_end) > > + return VM_FAULT_SIGBUS; > > + > > + track_pfn_insert(vma, &pgprot, pfn_to_pfn_t(pfn)); > > I think I raised this before: why is this track_pfn_insert() in here? It > only ever does something to VM_PFNMAP mappings, and that cannot possibly be > the case here (nothing in VM_PFNMAP is refcounted, ever)? Yes, I also had deja vu reading this comment and a vague recollection of fixing them too. Your comments[1] were for vmf_insert_folio_pud() though which exlains why I neglected to do the same clean-up here even though I should have so thanks for pointing them out. [1] - https://lore.kernel.org/linux-mm/ee19854f-fa1f-4207-9176-3c7b79bccd07@redhat.com/ > > > + > > + if (!pfn_modify_allowed(pfn, pgprot)) > > + return VM_FAULT_SIGBUS; > > Why is that required? Why are we messing so much with PFNs? :) > > Note that x86 does in there > > /* If it's real memory always allow */ > if (pfn_valid(pfn)) > return true; > > See below, when would we ever have a "struct page *" but !pfn_valid() ? > > > > + > > + /* > > + * We refcount the page normally so make sure pfn_valid is true. > > + */ > > + if (!pfn_valid(pfn)) > > + return VM_FAULT_SIGBUS; > > Somebody gave us a "struct page", how could the pfn ever by invalid (not > have a struct page)? > > I think all of the above regarding PFNs should be dropped -- unless I am > missing something important. > > > + > > + if (WARN_ON(is_zero_pfn(pfn) && write)) > > + return VM_FAULT_SIGBUS; > > is_zero_page() if you already have the "page". But note that in > validate_page_before_insert() we do have a check that allows for conditional > insertion of the shared zeropage. > > So maybe this hunk is also not required. Yes, also not required. I have removed the above hunks as well because we don't need any of this pfn stuff. Again it's just a hangover from an earlier version of the series when I was passing pfn's rather than pages here. > > + > > + err = insert_page(vma, addr, page, pgprot, write); > > + if (err == -ENOMEM) > > + return VM_FAULT_OOM; > > + if (err < 0 && err != -EBUSY) > > + return VM_FAULT_SIGBUS; > > + > > + return VM_FAULT_NOPAGE; > > +} > > +EXPORT_SYMBOL_GPL(vmf_insert_page_mkwrite); > > > > > > -- > Cheers, > > David / dhildenb >
diff --git a/include/linux/mm.h b/include/linux/mm.h index e790298..f267b06 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -3620,6 +3620,8 @@ int vm_map_pages(struct vm_area_struct *vma, struct page **pages, unsigned long num); int vm_map_pages_zero(struct vm_area_struct *vma, struct page **pages, unsigned long num); +vm_fault_t vmf_insert_page_mkwrite(struct vm_fault *vmf, struct page *page, + bool write); vm_fault_t vmf_insert_pfn(struct vm_area_struct *vma, unsigned long addr, unsigned long pfn); vm_fault_t vmf_insert_pfn_prot(struct vm_area_struct *vma, unsigned long addr, diff --git a/mm/memory.c b/mm/memory.c index 8531acb..c60b819 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2624,6 +2624,42 @@ static vm_fault_t __vm_insert_mixed(struct vm_area_struct *vma, return VM_FAULT_NOPAGE; } +vm_fault_t vmf_insert_page_mkwrite(struct vm_fault *vmf, struct page *page, + bool write) +{ + struct vm_area_struct *vma = vmf->vma; + pgprot_t pgprot = vma->vm_page_prot; + unsigned long pfn = page_to_pfn(page); + unsigned long addr = vmf->address; + int err; + + if (addr < vma->vm_start || addr >= vma->vm_end) + return VM_FAULT_SIGBUS; + + track_pfn_insert(vma, &pgprot, pfn_to_pfn_t(pfn)); + + if (!pfn_modify_allowed(pfn, pgprot)) + return VM_FAULT_SIGBUS; + + /* + * We refcount the page normally so make sure pfn_valid is true. + */ + if (!pfn_valid(pfn)) + return VM_FAULT_SIGBUS; + + if (WARN_ON(is_zero_pfn(pfn) && write)) + return VM_FAULT_SIGBUS; + + err = insert_page(vma, addr, page, pgprot, write); + if (err == -ENOMEM) + return VM_FAULT_OOM; + if (err < 0 && err != -EBUSY) + return VM_FAULT_SIGBUS; + + return VM_FAULT_NOPAGE; +} +EXPORT_SYMBOL_GPL(vmf_insert_page_mkwrite); + vm_fault_t vmf_insert_mixed(struct vm_area_struct *vma, unsigned long addr, pfn_t pfn) {
Currently to map a DAX page the DAX driver calls vmf_insert_pfn. This creates a special devmap PTE entry for the pfn but does not take a reference on the underlying struct page for the mapping. This is because DAX page refcounts are treated specially, as indicated by the presence of a devmap entry. To allow DAX page refcounts to be managed the same as normal page refcounts introduce vmf_insert_page_mkwrite(). This will take a reference on the underlying page much the same as vmf_insert_page, except it also permits upgrading an existing mapping to be writable if requested/possible. Signed-off-by: Alistair Popple <apopple@nvidia.com> --- Updates from v2: - Rename function to make not DAX specific - Split the insert_page_into_pte_locked() change into a separate patch. Updates from v1: - Re-arrange code in insert_page_into_pte_locked() based on comments from Jan Kara. - Call mkdrity/mkyoung for the mkwrite case, also suggested by Jan. --- include/linux/mm.h | 2 ++ mm/memory.c | 36 ++++++++++++++++++++++++++++++++++++ 2 files changed, 38 insertions(+)