Message ID | f3ebda542373feb70ed3e5d83b276a2e8347609f.1734407924.git-series.apopple@nvidia.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | fs/dax: Fix ZONE_DEVICE page reference counts | expand |
On 17.12.24 06:13, Alistair Popple wrote: > The procfs mmu files such as smaps currently ignore device dax and fs > dax pages because these pages are considered special. To maintain > existing behaviour once these pages are treated as normal pages and > returned from vm_normal_page() add tests to explicitly skip them. > > Signed-off-by: Alistair Popple <apopple@nvidia.com> > --- > fs/proc/task_mmu.c | 18 ++++++++++++++---- > 1 file changed, 14 insertions(+), 4 deletions(-) > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > index 38a5a3e..c9b227a 100644 > --- a/fs/proc/task_mmu.c > +++ b/fs/proc/task_mmu.c > @@ -801,6 +801,8 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr, > > if (pte_present(ptent)) { > page = vm_normal_page(vma, addr, ptent); > + if (page && (is_device_dax_page(page) || is_fsdax_page(page))) This "is_device_dax_page(page) || is_fsdax_page(page)" is a common theme here, likely we should have a special helper? But, don't we actually want to include them in the smaps output now? I think we want. The rmap code will indicate these pages in /proc/meminfo, per-node info, in the memcg ... as "Mapped:" etc. So likely we just want to also indicate them here, or is there any downsides we know of?
On Tue, Dec 17, 2024 at 11:31:25PM +0100, David Hildenbrand wrote: > On 17.12.24 06:13, Alistair Popple wrote: > > The procfs mmu files such as smaps currently ignore device dax and fs > > dax pages because these pages are considered special. To maintain > > existing behaviour once these pages are treated as normal pages and > > returned from vm_normal_page() add tests to explicitly skip them. > > > > Signed-off-by: Alistair Popple <apopple@nvidia.com> > > --- > > fs/proc/task_mmu.c | 18 ++++++++++++++---- > > 1 file changed, 14 insertions(+), 4 deletions(-) > > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > > index 38a5a3e..c9b227a 100644 > > --- a/fs/proc/task_mmu.c > > +++ b/fs/proc/task_mmu.c > > @@ -801,6 +801,8 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr, > > if (pte_present(ptent)) { > > page = vm_normal_page(vma, addr, ptent); > > + if (page && (is_device_dax_page(page) || is_fsdax_page(page))) > > This "is_device_dax_page(page) || is_fsdax_page(page)" is a common theme > here, likely we should have a special helper? Sounds good, will add is_dax_page() if there are enough callers left after any review comments. > But, don't we actually want to include them in the smaps output now? I think > we want. I'm not an expert in what callers of vm_normal_page() think of as a "normal" page. So my philosphy here was to ensure anything calling vm_normal_page() didn't accidentally start seeing DAX pages, either by checking existing filters (lots of callers already call vma_is_special_huge() or some equivalent) or explicitly filtering them out in the hope someone smarter than me could tell me it was unneccssary. That stategy seems to have worked, and so I agree we likely do want them in smaps. I just didn't want to silently do it without this kind of discussion first. > The rmap code will indicate these pages in /proc/meminfo, per-node info, in > the memcg ... as "Mapped:" etc. > > So likely we just want to also indicate them here, or is there any downsides > we know of? I don't know of any, and I think it makes sense to also indicate them so will drop this check in the respin. > -- > Cheers, > > David / dhildenb >
On 19.12.24 00:11, Alistair Popple wrote: > On Tue, Dec 17, 2024 at 11:31:25PM +0100, David Hildenbrand wrote: >> On 17.12.24 06:13, Alistair Popple wrote: >>> The procfs mmu files such as smaps currently ignore device dax and fs >>> dax pages because these pages are considered special. To maintain >>> existing behaviour once these pages are treated as normal pages and >>> returned from vm_normal_page() add tests to explicitly skip them. >>> >>> Signed-off-by: Alistair Popple <apopple@nvidia.com> >>> --- >>> fs/proc/task_mmu.c | 18 ++++++++++++++---- >>> 1 file changed, 14 insertions(+), 4 deletions(-) >>> >>> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c >>> index 38a5a3e..c9b227a 100644 >>> --- a/fs/proc/task_mmu.c >>> +++ b/fs/proc/task_mmu.c >>> @@ -801,6 +801,8 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr, >>> if (pte_present(ptent)) { >>> page = vm_normal_page(vma, addr, ptent); >>> + if (page && (is_device_dax_page(page) || is_fsdax_page(page))) >> >> This "is_device_dax_page(page) || is_fsdax_page(page)" is a common theme >> here, likely we should have a special helper? > > Sounds good, will add is_dax_page() if there are enough callers left after any > review comments. :) > >> But, don't we actually want to include them in the smaps output now? I think >> we want. > > I'm not an expert in what callers of vm_normal_page() think of as a "normal" > page. Yeah, it's tricky. It means "this is abnormal, don't look at the struct page". We're moving away from that, such that these folios/pages will be ... mostly normal :) > So my philosphy here was to ensure anything calling vm_normal_page() > didn't accidentally start seeing DAX pages, either by checking existing filters > (lots of callers already call vma_is_special_huge() or some equivalent) or > explicitly filtering them out in the hope someone smarter than me could tell me > it was unneccssary. > > That stategy seems to have worked, and so I agree we likely do want them in > smaps. I just didn't want to silently do it without this kind of discussion > first. Yes, absolutely. > >> The rmap code will indicate these pages in /proc/meminfo, per-node info, in >> the memcg ... as "Mapped:" etc. >> >> So likely we just want to also indicate them here, or is there any downsides >> we know of? > > I don't know of any, and I think it makes sense to also indicate them so will > drop this check in the respin. It will be easy to hide them later, at least we talked about it. Thanks for doing all this!
On Fri, Dec 20, 2024 at 07:32:52PM +0100, David Hildenbrand wrote: > On 19.12.24 00:11, Alistair Popple wrote: > > On Tue, Dec 17, 2024 at 11:31:25PM +0100, David Hildenbrand wrote: > > > On 17.12.24 06:13, Alistair Popple wrote: > > > > The procfs mmu files such as smaps currently ignore device dax and fs > > > > dax pages because these pages are considered special. To maintain > > > > existing behaviour once these pages are treated as normal pages and > > > > returned from vm_normal_page() add tests to explicitly skip them. > > > > > > > > Signed-off-by: Alistair Popple <apopple@nvidia.com> > > > > --- > > > > fs/proc/task_mmu.c | 18 ++++++++++++++---- > > > > 1 file changed, 14 insertions(+), 4 deletions(-) > > > > > > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > > > > index 38a5a3e..c9b227a 100644 > > > > --- a/fs/proc/task_mmu.c > > > > +++ b/fs/proc/task_mmu.c > > > > @@ -801,6 +801,8 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr, > > > > if (pte_present(ptent)) { > > > > page = vm_normal_page(vma, addr, ptent); > > > > + if (page && (is_device_dax_page(page) || is_fsdax_page(page))) > > > > > > This "is_device_dax_page(page) || is_fsdax_page(page)" is a common theme > > > here, likely we should have a special helper? > > > > Sounds good, will add is_dax_page() if there are enough callers left after any > > review comments. > > :) In the end there was only a single caller so I will leave this open-coded. > > > But, don't we actually want to include them in the smaps output now? I think > > > we want. > > > > I'm not an expert in what callers of vm_normal_page() think of as a "normal" > > page. > > Yeah, it's tricky. It means "this is abnormal, don't look at the struct > page". We're moving away from that, such that these folios/pages will be ... > mostly normal :) > > > So my philosphy here was to ensure anything calling vm_normal_page() > > didn't accidentally start seeing DAX pages, either by checking existing filters > > (lots of callers already call vma_is_special_huge() or some equivalent) or > > explicitly filtering them out in the hope someone smarter than me could tell me > > it was unneccssary. > > > > That stategy seems to have worked, and so I agree we likely do want them in > > smaps. I just didn't want to silently do it without this kind of discussion > > first. > > Yes, absolutely. > > > > > > The rmap code will indicate these pages in /proc/meminfo, per-node info, in > > > the memcg ... as "Mapped:" etc. > > > > > > So likely we just want to also indicate them here, or is there any downsides > > > we know of? > > > > I don't know of any, and I think it makes sense to also indicate them so will > > drop this check in the respin. > > It will be easy to hide them later, at least we talked about it. Thanks for > doing all this! Not a problem. The other main thing in this patch is also hiding them from /proc/<PID>/pagemap. Based on this discussion I can't think of any good reason why we would want to hide them there so will also remove the checks in the pagemap walker. > -- > Cheers, > > David / dhildenb >
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 38a5a3e..c9b227a 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -801,6 +801,8 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr, if (pte_present(ptent)) { page = vm_normal_page(vma, addr, ptent); + if (page && (is_device_dax_page(page) || is_fsdax_page(page))) + page = NULL; young = pte_young(ptent); dirty = pte_dirty(ptent); present = true; @@ -849,6 +851,8 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr, if (pmd_present(*pmd)) { page = vm_normal_page_pmd(vma, addr, *pmd); + if (page && (is_device_dax_page(page) || is_fsdax_page(page))) + page = NULL; present = true; } else if (unlikely(thp_migration_supported() && is_swap_pmd(*pmd))) { swp_entry_t entry = pmd_to_swp_entry(*pmd); @@ -1378,7 +1382,7 @@ static inline bool pte_is_pinned(struct vm_area_struct *vma, unsigned long addr, if (likely(!test_bit(MMF_HAS_PINNED, &vma->vm_mm->flags))) return false; folio = vm_normal_folio(vma, addr, pte); - if (!folio) + if (!folio || folio_is_device_dax(folio) || folio_is_fsdax(folio)) return false; return folio_maybe_dma_pinned(folio); } @@ -1703,6 +1707,8 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm, frame = pte_pfn(pte); flags |= PM_PRESENT; page = vm_normal_page(vma, addr, pte); + if (page && (is_device_dax_page(page) || is_fsdax_page(page))) + page = NULL; if (pte_soft_dirty(pte)) flags |= PM_SOFT_DIRTY; if (pte_uffd_wp(pte)) @@ -2089,7 +2095,9 @@ static unsigned long pagemap_page_category(struct pagemap_scan_private *p, if (p->masks_of_interest & PAGE_IS_FILE) { page = vm_normal_page(vma, addr, pte); - if (page && !PageAnon(page)) + if (page && !PageAnon(page) && + !is_device_dax_page(page) && + !is_fsdax_page(page)) categories |= PAGE_IS_FILE; } @@ -2151,7 +2159,9 @@ static unsigned long pagemap_thp_category(struct pagemap_scan_private *p, if (p->masks_of_interest & PAGE_IS_FILE) { page = vm_normal_page_pmd(vma, addr, pmd); - if (page && !PageAnon(page)) + if (page && !PageAnon(page) && + !is_device_dax_page(page) && + !is_fsdax_page(page)) categories |= PAGE_IS_FILE; } @@ -2914,7 +2924,7 @@ static struct page *can_gather_numa_stats_pmd(pmd_t pmd, return NULL; page = vm_normal_page_pmd(vma, addr, pmd); - if (!page) + if (!page || is_device_dax_page(page) || is_fsdax_page(page)) return NULL; if (PageReserved(page))
The procfs mmu files such as smaps currently ignore device dax and fs dax pages because these pages are considered special. To maintain existing behaviour once these pages are treated as normal pages and returned from vm_normal_page() add tests to explicitly skip them. Signed-off-by: Alistair Popple <apopple@nvidia.com> --- fs/proc/task_mmu.c | 18 ++++++++++++++---- 1 file changed, 14 insertions(+), 4 deletions(-)