diff mbox series

[v4,19/25] proc/task_mmu: Ignore ZONE_DEVICE pages

Message ID f3ebda542373feb70ed3e5d83b276a2e8347609f.1734407924.git-series.apopple@nvidia.com (mailing list archive)
State New
Headers show
Series fs/dax: Fix ZONE_DEVICE page reference counts | expand

Commit Message

Alistair Popple Dec. 17, 2024, 5:13 a.m. UTC
The procfs mmu files such as smaps currently ignore device dax and fs
dax pages because these pages are considered special. To maintain
existing behaviour once these pages are treated as normal pages and
returned from vm_normal_page() add tests to explicitly skip them.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
---
 fs/proc/task_mmu.c | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

Comments

David Hildenbrand Dec. 17, 2024, 10:31 p.m. UTC | #1
On 17.12.24 06:13, Alistair Popple wrote:
> The procfs mmu files such as smaps currently ignore device dax and fs
> dax pages because these pages are considered special. To maintain
> existing behaviour once these pages are treated as normal pages and
> returned from vm_normal_page() add tests to explicitly skip them.
> 
> Signed-off-by: Alistair Popple <apopple@nvidia.com>
> ---
>   fs/proc/task_mmu.c | 18 ++++++++++++++----
>   1 file changed, 14 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 38a5a3e..c9b227a 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -801,6 +801,8 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
>   
>   	if (pte_present(ptent)) {
>   		page = vm_normal_page(vma, addr, ptent);
> +		if (page && (is_device_dax_page(page) || is_fsdax_page(page)))

This "is_device_dax_page(page) || is_fsdax_page(page)" is a common theme 
here, likely we should have a special helper?


But, don't we actually want to include them in the smaps output now? I 
think we want.

The rmap code will indicate these pages in /proc/meminfo, per-node info, 
in the memcg ... as "Mapped:" etc.

So likely we just want to also indicate them here, or is there any 
downsides we know of?
Alistair Popple Dec. 18, 2024, 11:11 p.m. UTC | #2
On Tue, Dec 17, 2024 at 11:31:25PM +0100, David Hildenbrand wrote:
> On 17.12.24 06:13, Alistair Popple wrote:
> > The procfs mmu files such as smaps currently ignore device dax and fs
> > dax pages because these pages are considered special. To maintain
> > existing behaviour once these pages are treated as normal pages and
> > returned from vm_normal_page() add tests to explicitly skip them.
> > 
> > Signed-off-by: Alistair Popple <apopple@nvidia.com>
> > ---
> >   fs/proc/task_mmu.c | 18 ++++++++++++++----
> >   1 file changed, 14 insertions(+), 4 deletions(-)
> > 
> > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > index 38a5a3e..c9b227a 100644
> > --- a/fs/proc/task_mmu.c
> > +++ b/fs/proc/task_mmu.c
> > @@ -801,6 +801,8 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
> >   	if (pte_present(ptent)) {
> >   		page = vm_normal_page(vma, addr, ptent);
> > +		if (page && (is_device_dax_page(page) || is_fsdax_page(page)))
> 
> This "is_device_dax_page(page) || is_fsdax_page(page)" is a common theme
> here, likely we should have a special helper?

Sounds good, will add is_dax_page() if there are enough callers left after any
review comments.
 
> But, don't we actually want to include them in the smaps output now? I think
> we want.

I'm not an expert in what callers of vm_normal_page() think of as a "normal"
page. So my philosphy here was to ensure anything calling vm_normal_page()
didn't accidentally start seeing DAX pages, either by checking existing filters
(lots of callers already call vma_is_special_huge() or some equivalent) or
explicitly filtering them out in the hope someone smarter than me could tell me
it was unneccssary.

That stategy seems to have worked, and so I agree we likely do want them in
smaps. I just didn't want to silently do it without this kind of discussion
first.

> The rmap code will indicate these pages in /proc/meminfo, per-node info, in
> the memcg ... as "Mapped:" etc.
> 
> So likely we just want to also indicate them here, or is there any downsides
> we know of?

I don't know of any, and I think it makes sense to also indicate them so will
drop this check in the respin.

> -- 
> Cheers,
> 
> David / dhildenb
>
David Hildenbrand Dec. 20, 2024, 6:32 p.m. UTC | #3
On 19.12.24 00:11, Alistair Popple wrote:
> On Tue, Dec 17, 2024 at 11:31:25PM +0100, David Hildenbrand wrote:
>> On 17.12.24 06:13, Alistair Popple wrote:
>>> The procfs mmu files such as smaps currently ignore device dax and fs
>>> dax pages because these pages are considered special. To maintain
>>> existing behaviour once these pages are treated as normal pages and
>>> returned from vm_normal_page() add tests to explicitly skip them.
>>>
>>> Signed-off-by: Alistair Popple <apopple@nvidia.com>
>>> ---
>>>    fs/proc/task_mmu.c | 18 ++++++++++++++----
>>>    1 file changed, 14 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
>>> index 38a5a3e..c9b227a 100644
>>> --- a/fs/proc/task_mmu.c
>>> +++ b/fs/proc/task_mmu.c
>>> @@ -801,6 +801,8 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
>>>    	if (pte_present(ptent)) {
>>>    		page = vm_normal_page(vma, addr, ptent);
>>> +		if (page && (is_device_dax_page(page) || is_fsdax_page(page)))
>>
>> This "is_device_dax_page(page) || is_fsdax_page(page)" is a common theme
>> here, likely we should have a special helper?
> 
> Sounds good, will add is_dax_page() if there are enough callers left after any
> review comments.

:)

>   
>> But, don't we actually want to include them in the smaps output now? I think
>> we want.
> 
> I'm not an expert in what callers of vm_normal_page() think of as a "normal"
> page. 

Yeah, it's tricky. It means "this is abnormal, don't look at the struct 
page". We're moving away from that, such that these folios/pages will be 
... mostly normal :)

> So my philosphy here was to ensure anything calling vm_normal_page()
> didn't accidentally start seeing DAX pages, either by checking existing filters
> (lots of callers already call vma_is_special_huge() or some equivalent) or
> explicitly filtering them out in the hope someone smarter than me could tell me
> it was unneccssary.
> 
> That stategy seems to have worked, and so I agree we likely do want them in
> smaps. I just didn't want to silently do it without this kind of discussion
> first.

Yes, absolutely.

> 
>> The rmap code will indicate these pages in /proc/meminfo, per-node info, in
>> the memcg ... as "Mapped:" etc.
>>
>> So likely we just want to also indicate them here, or is there any downsides
>> we know of?
> 
> I don't know of any, and I think it makes sense to also indicate them so will
> drop this check in the respin.

It will be easy to hide them later, at least we talked about it. Thanks 
for doing all this!
Alistair Popple Jan. 6, 2025, 6:43 a.m. UTC | #4
On Fri, Dec 20, 2024 at 07:32:52PM +0100, David Hildenbrand wrote:
> On 19.12.24 00:11, Alistair Popple wrote:
> > On Tue, Dec 17, 2024 at 11:31:25PM +0100, David Hildenbrand wrote:
> > > On 17.12.24 06:13, Alistair Popple wrote:
> > > > The procfs mmu files such as smaps currently ignore device dax and fs
> > > > dax pages because these pages are considered special. To maintain
> > > > existing behaviour once these pages are treated as normal pages and
> > > > returned from vm_normal_page() add tests to explicitly skip them.
> > > > 
> > > > Signed-off-by: Alistair Popple <apopple@nvidia.com>
> > > > ---
> > > >    fs/proc/task_mmu.c | 18 ++++++++++++++----
> > > >    1 file changed, 14 insertions(+), 4 deletions(-)
> > > > 
> > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > > > index 38a5a3e..c9b227a 100644
> > > > --- a/fs/proc/task_mmu.c
> > > > +++ b/fs/proc/task_mmu.c
> > > > @@ -801,6 +801,8 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
> > > >    	if (pte_present(ptent)) {
> > > >    		page = vm_normal_page(vma, addr, ptent);
> > > > +		if (page && (is_device_dax_page(page) || is_fsdax_page(page)))
> > > 
> > > This "is_device_dax_page(page) || is_fsdax_page(page)" is a common theme
> > > here, likely we should have a special helper?
> > 
> > Sounds good, will add is_dax_page() if there are enough callers left after any
> > review comments.
> 
> :)

In the end there was only a single caller so I will leave this open-coded.

> > > But, don't we actually want to include them in the smaps output now? I think
> > > we want.
> > 
> > I'm not an expert in what callers of vm_normal_page() think of as a "normal"
> > page.
> 
> Yeah, it's tricky. It means "this is abnormal, don't look at the struct
> page". We're moving away from that, such that these folios/pages will be ...
> mostly normal :)
> 
> > So my philosphy here was to ensure anything calling vm_normal_page()
> > didn't accidentally start seeing DAX pages, either by checking existing filters
> > (lots of callers already call vma_is_special_huge() or some equivalent) or
> > explicitly filtering them out in the hope someone smarter than me could tell me
> > it was unneccssary.
> > 
> > That stategy seems to have worked, and so I agree we likely do want them in
> > smaps. I just didn't want to silently do it without this kind of discussion
> > first.
> 
> Yes, absolutely.
> 
> > 
> > > The rmap code will indicate these pages in /proc/meminfo, per-node info, in
> > > the memcg ... as "Mapped:" etc.
> > > 
> > > So likely we just want to also indicate them here, or is there any downsides
> > > we know of?
> > 
> > I don't know of any, and I think it makes sense to also indicate them so will
> > drop this check in the respin.
> 
> It will be easy to hide them later, at least we talked about it. Thanks for
> doing all this!

Not a problem. The other main thing in this patch is also hiding them from
/proc/<PID>/pagemap. Based on this discussion I can't think of any good reason
why we would want to hide them there so will also remove the checks in the
pagemap walker.

> -- 
> Cheers,
> 
> David / dhildenb
>
diff mbox series

Patch

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 38a5a3e..c9b227a 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -801,6 +801,8 @@  static void smaps_pte_entry(pte_t *pte, unsigned long addr,
 
 	if (pte_present(ptent)) {
 		page = vm_normal_page(vma, addr, ptent);
+		if (page && (is_device_dax_page(page) || is_fsdax_page(page)))
+			page = NULL;
 		young = pte_young(ptent);
 		dirty = pte_dirty(ptent);
 		present = true;
@@ -849,6 +851,8 @@  static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
 
 	if (pmd_present(*pmd)) {
 		page = vm_normal_page_pmd(vma, addr, *pmd);
+		if (page && (is_device_dax_page(page) || is_fsdax_page(page)))
+			page = NULL;
 		present = true;
 	} else if (unlikely(thp_migration_supported() && is_swap_pmd(*pmd))) {
 		swp_entry_t entry = pmd_to_swp_entry(*pmd);
@@ -1378,7 +1382,7 @@  static inline bool pte_is_pinned(struct vm_area_struct *vma, unsigned long addr,
 	if (likely(!test_bit(MMF_HAS_PINNED, &vma->vm_mm->flags)))
 		return false;
 	folio = vm_normal_folio(vma, addr, pte);
-	if (!folio)
+	if (!folio || folio_is_device_dax(folio) || folio_is_fsdax(folio))
 		return false;
 	return folio_maybe_dma_pinned(folio);
 }
@@ -1703,6 +1707,8 @@  static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
 			frame = pte_pfn(pte);
 		flags |= PM_PRESENT;
 		page = vm_normal_page(vma, addr, pte);
+		if (page && (is_device_dax_page(page) || is_fsdax_page(page)))
+			page = NULL;
 		if (pte_soft_dirty(pte))
 			flags |= PM_SOFT_DIRTY;
 		if (pte_uffd_wp(pte))
@@ -2089,7 +2095,9 @@  static unsigned long pagemap_page_category(struct pagemap_scan_private *p,
 
 		if (p->masks_of_interest & PAGE_IS_FILE) {
 			page = vm_normal_page(vma, addr, pte);
-			if (page && !PageAnon(page))
+			if (page && !PageAnon(page) &&
+			    !is_device_dax_page(page) &&
+			    !is_fsdax_page(page))
 				categories |= PAGE_IS_FILE;
 		}
 
@@ -2151,7 +2159,9 @@  static unsigned long pagemap_thp_category(struct pagemap_scan_private *p,
 
 		if (p->masks_of_interest & PAGE_IS_FILE) {
 			page = vm_normal_page_pmd(vma, addr, pmd);
-			if (page && !PageAnon(page))
+			if (page && !PageAnon(page) &&
+			    !is_device_dax_page(page) &&
+			    !is_fsdax_page(page))
 				categories |= PAGE_IS_FILE;
 		}
 
@@ -2914,7 +2924,7 @@  static struct page *can_gather_numa_stats_pmd(pmd_t pmd,
 		return NULL;
 
 	page = vm_normal_page_pmd(vma, addr, pmd);
-	if (!page)
+	if (!page || is_device_dax_page(page) || is_fsdax_page(page))
 		return NULL;
 
 	if (PageReserved(page))