Message ID | 9210f90866fef17b54884130fb3e55ab410dd015.1736488799.git-series.apopple@nvidia.com |
---|---|
State | New |
Headers | show |
Series | fs/dax: Fix ZONE_DEVICE page reference counts | expand |
On 10.01.25 07:00, Alistair Popple wrote: > Zone device pages are used to represent various type of device memory > managed by device drivers. Currently compound zone device pages are > not supported. This is because MEMORY_DEVICE_FS_DAX pages are the only > user of higher order zone device pages and have their own page > reference counting. > > A future change will unify FS DAX reference counting with normal page > reference counting rules and remove the special FS DAX reference > counting. Supporting that requires compound zone device pages. > > Supporting compound zone device pages requires compound_head() to > distinguish between head and tail pages whilst still preserving the > special struct page fields that are specific to zone device pages. > > A tail page is distinguished by having bit zero being set in > page->compound_head, with the remaining bits pointing to the head > page. For zone device pages page->compound_head is shared with > page->pgmap. > > The page->pgmap field is common to all pages within a memory section. > Therefore pgmap is the same for both head and tail pages and can be > moved into the folio and we can use the standard scheme to find > compound_head from a tail page. The more relevant thing is that the pgmap field must be common to all pages in a folio, even if a folio exceeds memory sections (e.g., 128 MiB on x86_64 where we have 1 GiB folios). > > Signed-off-by: Alistair Popple <apopple@nvidia.com> > Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> > Reviewed-by: Dan Williams <dan.j.williams@intel.com> > > --- > > Changes for v4: > - Fix build breakages reported by kernel test robot > > Changes since v2: > > - Indentation fix > - Rename page_dev_pagemap() to page_pgmap() > - Rename folio _unused field to _unused_pgmap_compound_head > - s/WARN_ON/VM_WARN_ON_ONCE_PAGE/ > > Changes since v1: > > - Move pgmap to the folio as suggested by Matthew Wilcox > --- [...] > static inline bool folio_is_device_coherent(const struct folio *folio) > diff --git a/include/linux/migrate.h b/include/linux/migrate.h > index 29919fa..61899ec 100644 > --- a/include/linux/migrate.h > +++ b/include/linux/migrate.h > @@ -205,8 +205,8 @@ struct migrate_vma { > unsigned long end; > > /* > - * Set to the owner value also stored in page->pgmap->owner for > - * migrating out of device private memory. The flags also need to > + * Set to the owner value also stored in page_pgmap(page)->owner > + * for migrating out of device private memory. The flags also need to > * be set to MIGRATE_VMA_SELECT_DEVICE_PRIVATE. > * The caller should always set this field when using mmu notifier > * callbacks to avoid device MMU invalidations for device private > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > index df8f515..54b59b8 100644 > --- a/include/linux/mm_types.h > +++ b/include/linux/mm_types.h > @@ -129,8 +129,11 @@ struct page { > unsigned long compound_head; /* Bit zero is set */ > }; > struct { /* ZONE_DEVICE pages */ > - /** @pgmap: Points to the hosting device page map. */ > - struct dev_pagemap *pgmap; > + /* > + * The first word is used for compound_head or folio > + * pgmap > + */ > + void *_unused_pgmap_compound_head; > void *zone_device_data; > /* > * ZONE_DEVICE private pages are counted as being > @@ -299,6 +302,7 @@ typedef struct { > * @_refcount: Do not access this member directly. Use folio_ref_count() > * to find how many references there are to this folio. > * @memcg_data: Memory Control Group data. > + * @pgmap: Metadata for ZONE_DEVICE mappings > * @virtual: Virtual address in the kernel direct map. > * @_last_cpupid: IDs of last CPU and last process that accessed the folio. > * @_entire_mapcount: Do not use directly, call folio_entire_mapcount(). > @@ -337,6 +341,7 @@ struct folio { > /* private: */ > }; > /* public: */ > + struct dev_pagemap *pgmap; Agreed, that should work. Acked-by: David Hildenbrand <david@redhat.com>
On Tue, Jan 14, 2025 at 03:59:31PM +0100, David Hildenbrand wrote: > On 10.01.25 07:00, Alistair Popple wrote: > > Zone device pages are used to represent various type of device memory > > managed by device drivers. Currently compound zone device pages are > > not supported. This is because MEMORY_DEVICE_FS_DAX pages are the only > > user of higher order zone device pages and have their own page > > reference counting. > > > > A future change will unify FS DAX reference counting with normal page > > reference counting rules and remove the special FS DAX reference > > counting. Supporting that requires compound zone device pages. > > > > Supporting compound zone device pages requires compound_head() to > > distinguish between head and tail pages whilst still preserving the > > special struct page fields that are specific to zone device pages. > > > > A tail page is distinguished by having bit zero being set in > > page->compound_head, with the remaining bits pointing to the head > > page. For zone device pages page->compound_head is shared with > > page->pgmap. > > > > The page->pgmap field is common to all pages within a memory section. > > Therefore pgmap is the same for both head and tail pages and can be > > moved into the folio and we can use the standard scheme to find > > compound_head from a tail page. > > The more relevant thing is that the pgmap field must be common to all pages > in a folio, even if a folio exceeds memory sections (e.g., 128 MiB on x86_64 > where we have 1 GiB folios). Thanks for pointing that out. I had assumed folios couldn't cross a memory section. Obviously that is wrong so I've updated the commit message accordingly. - Alistair > > > Signed-off-by: Alistair Popple <apopple@nvidia.com> > > Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> > > Reviewed-by: Dan Williams <dan.j.williams@intel.com> > > > > --- > > > > Changes for v4: > > - Fix build breakages reported by kernel test robot > > > > Changes since v2: > > > > - Indentation fix > > - Rename page_dev_pagemap() to page_pgmap() > > - Rename folio _unused field to _unused_pgmap_compound_head > > - s/WARN_ON/VM_WARN_ON_ONCE_PAGE/ > > > > Changes since v1: > > > > - Move pgmap to the folio as suggested by Matthew Wilcox > > --- > > [...] > > > static inline bool folio_is_device_coherent(const struct folio *folio) > > diff --git a/include/linux/migrate.h b/include/linux/migrate.h > > index 29919fa..61899ec 100644 > > --- a/include/linux/migrate.h > > +++ b/include/linux/migrate.h > > @@ -205,8 +205,8 @@ struct migrate_vma { > > unsigned long end; > > /* > > - * Set to the owner value also stored in page->pgmap->owner for > > - * migrating out of device private memory. The flags also need to > > + * Set to the owner value also stored in page_pgmap(page)->owner > > + * for migrating out of device private memory. The flags also need to > > * be set to MIGRATE_VMA_SELECT_DEVICE_PRIVATE. > > * The caller should always set this field when using mmu notifier > > * callbacks to avoid device MMU invalidations for device private > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > > index df8f515..54b59b8 100644 > > --- a/include/linux/mm_types.h > > +++ b/include/linux/mm_types.h > > @@ -129,8 +129,11 @@ struct page { > > unsigned long compound_head; /* Bit zero is set */ > > }; > > struct { /* ZONE_DEVICE pages */ > > - /** @pgmap: Points to the hosting device page map. */ > > - struct dev_pagemap *pgmap; > > + /* > > + * The first word is used for compound_head or folio > > + * pgmap > > + */ > > + void *_unused_pgmap_compound_head; > > void *zone_device_data; > > /* > > * ZONE_DEVICE private pages are counted as being > > @@ -299,6 +302,7 @@ typedef struct { > > * @_refcount: Do not access this member directly. Use folio_ref_count() > > * to find how many references there are to this folio. > > * @memcg_data: Memory Control Group data. > > + * @pgmap: Metadata for ZONE_DEVICE mappings > > * @virtual: Virtual address in the kernel direct map. > > * @_last_cpupid: IDs of last CPU and last process that accessed the folio. > > * @_entire_mapcount: Do not use directly, call folio_entire_mapcount(). > > @@ -337,6 +341,7 @@ struct folio { > > /* private: */ > > }; > > /* public: */ > > + struct dev_pagemap *pgmap; > > Agreed, that should work. > > Acked-by: David Hildenbrand <david@redhat.com> > > -- > Cheers, > > David / dhildenb >
diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c index 1a07256..61d0f41 100644 --- a/drivers/gpu/drm/nouveau/nouveau_dmem.c +++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c @@ -88,7 +88,8 @@ struct nouveau_dmem { static struct nouveau_dmem_chunk *nouveau_page_to_chunk(struct page *page) { - return container_of(page->pgmap, struct nouveau_dmem_chunk, pagemap); + return container_of(page_pgmap(page), struct nouveau_dmem_chunk, + pagemap); } static struct nouveau_drm *page_to_drm(struct page *page) diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c index 04773a8..19214ec 100644 --- a/drivers/pci/p2pdma.c +++ b/drivers/pci/p2pdma.c @@ -202,7 +202,7 @@ static const struct attribute_group p2pmem_group = { static void p2pdma_page_free(struct page *page) { - struct pci_p2pdma_pagemap *pgmap = to_p2p_pgmap(page->pgmap); + struct pci_p2pdma_pagemap *pgmap = to_p2p_pgmap(page_pgmap(page)); /* safe to dereference while a reference is held to the percpu ref */ struct pci_p2pdma *p2pdma = rcu_dereference_protected(pgmap->provider->p2pdma, 1); @@ -1025,8 +1025,8 @@ enum pci_p2pdma_map_type pci_p2pdma_map_segment(struct pci_p2pdma_map_state *state, struct device *dev, struct scatterlist *sg) { - if (state->pgmap != sg_page(sg)->pgmap) { - state->pgmap = sg_page(sg)->pgmap; + if (state->pgmap != page_pgmap(sg_page(sg))) { + state->pgmap = page_pgmap(sg_page(sg)); state->map = pci_p2pdma_map_type(state->pgmap, dev); state->bus_off = to_p2p_pgmap(state->pgmap)->bus_offset; } diff --git a/include/linux/memremap.h b/include/linux/memremap.h index 3f7143a..0256a42 100644 --- a/include/linux/memremap.h +++ b/include/linux/memremap.h @@ -161,7 +161,7 @@ static inline bool is_device_private_page(const struct page *page) { return IS_ENABLED(CONFIG_DEVICE_PRIVATE) && is_zone_device_page(page) && - page->pgmap->type == MEMORY_DEVICE_PRIVATE; + page_pgmap(page)->type == MEMORY_DEVICE_PRIVATE; } static inline bool folio_is_device_private(const struct folio *folio) @@ -173,13 +173,13 @@ static inline bool is_pci_p2pdma_page(const struct page *page) { return IS_ENABLED(CONFIG_PCI_P2PDMA) && is_zone_device_page(page) && - page->pgmap->type == MEMORY_DEVICE_PCI_P2PDMA; + page_pgmap(page)->type == MEMORY_DEVICE_PCI_P2PDMA; } static inline bool is_device_coherent_page(const struct page *page) { return is_zone_device_page(page) && - page->pgmap->type == MEMORY_DEVICE_COHERENT; + page_pgmap(page)->type == MEMORY_DEVICE_COHERENT; } static inline bool folio_is_device_coherent(const struct folio *folio) diff --git a/include/linux/migrate.h b/include/linux/migrate.h index 29919fa..61899ec 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -205,8 +205,8 @@ struct migrate_vma { unsigned long end; /* - * Set to the owner value also stored in page->pgmap->owner for - * migrating out of device private memory. The flags also need to + * Set to the owner value also stored in page_pgmap(page)->owner + * for migrating out of device private memory. The flags also need to * be set to MIGRATE_VMA_SELECT_DEVICE_PRIVATE. * The caller should always set this field when using mmu notifier * callbacks to avoid device MMU invalidations for device private diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index df8f515..54b59b8 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -129,8 +129,11 @@ struct page { unsigned long compound_head; /* Bit zero is set */ }; struct { /* ZONE_DEVICE pages */ - /** @pgmap: Points to the hosting device page map. */ - struct dev_pagemap *pgmap; + /* + * The first word is used for compound_head or folio + * pgmap + */ + void *_unused_pgmap_compound_head; void *zone_device_data; /* * ZONE_DEVICE private pages are counted as being @@ -299,6 +302,7 @@ typedef struct { * @_refcount: Do not access this member directly. Use folio_ref_count() * to find how many references there are to this folio. * @memcg_data: Memory Control Group data. + * @pgmap: Metadata for ZONE_DEVICE mappings * @virtual: Virtual address in the kernel direct map. * @_last_cpupid: IDs of last CPU and last process that accessed the folio. * @_entire_mapcount: Do not use directly, call folio_entire_mapcount(). @@ -337,6 +341,7 @@ struct folio { /* private: */ }; /* public: */ + struct dev_pagemap *pgmap; }; struct address_space *mapping; pgoff_t index; diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index c7ad4d6..fd492c3 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -1159,6 +1159,12 @@ static inline bool is_zone_device_page(const struct page *page) return page_zonenum(page) == ZONE_DEVICE; } +static inline struct dev_pagemap *page_pgmap(const struct page *page) +{ + VM_WARN_ON_ONCE_PAGE(!is_zone_device_page(page), page); + return page_folio(page)->pgmap; +} + /* * Consecutive zone device pages should not be merged into the same sgl * or bvec segment with other types of pages or if they belong to different @@ -1174,7 +1180,7 @@ static inline bool zone_device_pages_have_same_pgmap(const struct page *a, return false; if (!is_zone_device_page(a)) return true; - return a->pgmap == b->pgmap; + return page_pgmap(a) == page_pgmap(b); } extern void memmap_init_zone_device(struct zone *, unsigned long, @@ -1189,6 +1195,10 @@ static inline bool zone_device_pages_have_same_pgmap(const struct page *a, { return true; } +static inline struct dev_pagemap *page_pgmap(const struct page *page) +{ + return NULL; +} #endif static inline bool folio_is_zone_device(const struct folio *folio) diff --git a/lib/test_hmm.c b/lib/test_hmm.c index 056f2e4..ffd0c6f 100644 --- a/lib/test_hmm.c +++ b/lib/test_hmm.c @@ -195,7 +195,8 @@ static int dmirror_fops_release(struct inode *inode, struct file *filp) static struct dmirror_chunk *dmirror_page_to_chunk(struct page *page) { - return container_of(page->pgmap, struct dmirror_chunk, pagemap); + return container_of(page_pgmap(page), struct dmirror_chunk, + pagemap); } static struct dmirror_device *dmirror_page_to_device(struct page *page) diff --git a/mm/hmm.c b/mm/hmm.c index 7e0229a..082f7b7 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -248,7 +248,7 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr, * just report the PFN. */ if (is_device_private_entry(entry) && - pfn_swap_entry_to_page(entry)->pgmap->owner == + page_pgmap(pfn_swap_entry_to_page(entry))->owner == range->dev_private_owner) { cpu_flags = HMM_PFN_VALID; if (is_writable_device_private_entry(entry)) diff --git a/mm/memory.c b/mm/memory.c index f09f20c..06bb29e 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4316,6 +4316,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) vmf->page = pfn_swap_entry_to_page(entry); ret = remove_device_exclusive_entry(vmf); } else if (is_device_private_entry(entry)) { + struct dev_pagemap *pgmap; if (vmf->flags & FAULT_FLAG_VMA_LOCK) { /* * migrate_to_ram is not yet ready to operate @@ -4340,7 +4341,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) */ get_page(vmf->page); pte_unmap_unlock(vmf->pte, vmf->ptl); - ret = vmf->page->pgmap->ops->migrate_to_ram(vmf); + pgmap = page_pgmap(vmf->page); + ret = pgmap->ops->migrate_to_ram(vmf); put_page(vmf->page); } else if (is_hwpoison_entry(entry)) { ret = VM_FAULT_HWPOISON; diff --git a/mm/memremap.c b/mm/memremap.c index 07bbe0e..68099af 100644 --- a/mm/memremap.c +++ b/mm/memremap.c @@ -458,8 +458,8 @@ EXPORT_SYMBOL_GPL(get_dev_pagemap); void free_zone_device_folio(struct folio *folio) { - if (WARN_ON_ONCE(!folio->page.pgmap->ops || - !folio->page.pgmap->ops->page_free)) + if (WARN_ON_ONCE(!folio->pgmap->ops || + !folio->pgmap->ops->page_free)) return; mem_cgroup_uncharge(folio); @@ -486,12 +486,12 @@ void free_zone_device_folio(struct folio *folio) * to clear folio->mapping. */ folio->mapping = NULL; - folio->page.pgmap->ops->page_free(folio_page(folio, 0)); + folio->pgmap->ops->page_free(folio_page(folio, 0)); - switch (folio->page.pgmap->type) { + switch (folio->pgmap->type) { case MEMORY_DEVICE_PRIVATE: case MEMORY_DEVICE_COHERENT: - put_dev_pagemap(folio->page.pgmap); + put_dev_pagemap(folio->pgmap); break; case MEMORY_DEVICE_FS_DAX: @@ -514,7 +514,7 @@ void zone_device_page_init(struct page *page) * Drivers shouldn't be allocating pages after calling * memunmap_pages(). */ - WARN_ON_ONCE(!percpu_ref_tryget_live(&page->pgmap->ref)); + WARN_ON_ONCE(!percpu_ref_tryget_live(&page_pgmap(page)->ref)); set_page_count(page, 1); lock_page(page); } @@ -523,7 +523,7 @@ EXPORT_SYMBOL_GPL(zone_device_page_init); #ifdef CONFIG_FS_DAX bool __put_devmap_managed_folio_refs(struct folio *folio, int refs) { - if (folio->page.pgmap->type != MEMORY_DEVICE_FS_DAX) + if (folio->pgmap->type != MEMORY_DEVICE_FS_DAX) return false; /* diff --git a/mm/migrate_device.c b/mm/migrate_device.c index 9cf2659..2209070 100644 --- a/mm/migrate_device.c +++ b/mm/migrate_device.c @@ -106,6 +106,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, arch_enter_lazy_mmu_mode(); for (; addr < end; addr += PAGE_SIZE, ptep++) { + struct dev_pagemap *pgmap; unsigned long mpfn = 0, pfn; struct folio *folio; struct page *page; @@ -133,9 +134,10 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, goto next; page = pfn_swap_entry_to_page(entry); + pgmap = page_pgmap(page); if (!(migrate->flags & MIGRATE_VMA_SELECT_DEVICE_PRIVATE) || - page->pgmap->owner != migrate->pgmap_owner) + pgmap->owner != migrate->pgmap_owner) goto next; mpfn = migrate_pfn(page_to_pfn(page)) | @@ -151,12 +153,13 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, goto next; } page = vm_normal_page(migrate->vma, addr, pte); + pgmap = page_pgmap(page); if (page && !is_zone_device_page(page) && !(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM)) goto next; else if (page && is_device_coherent_page(page) && (!(migrate->flags & MIGRATE_VMA_SELECT_DEVICE_COHERENT) || - page->pgmap->owner != migrate->pgmap_owner)) + pgmap->owner != migrate->pgmap_owner)) goto next; mpfn = migrate_pfn(pfn) | MIGRATE_PFN_MIGRATE; mpfn |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0; diff --git a/mm/mm_init.c b/mm/mm_init.c index f021e63..cb73402 100644 --- a/mm/mm_init.c +++ b/mm/mm_init.c @@ -998,7 +998,7 @@ static void __ref __init_zone_device_page(struct page *page, unsigned long pfn, * and zone_device_data. It is a bug if a ZONE_DEVICE page is * ever freed or placed on a driver-private list. */ - page->pgmap = pgmap; + page_folio(page)->pgmap = pgmap; page->zone_device_data = NULL; /*