diff mbox series

[v1] mm: Add /proc/$PID/pageflags

Message ID 20211028205854.830200-1-almasrymina@google.com (mailing list archive)
State New
Headers show
Series [v1] mm: Add /proc/$PID/pageflags | expand

Commit Message

Mina Almasry Oct. 28, 2021, 8:58 p.m. UTC
From: Yu Zhao <yuzhao@google.com>

This file lets a userspace process know the page flags of each of its virtual
pages.  It contains a 64-bit set of flags for each virtual page, containing
data identical to that emitted by /proc/kpageflags.  This allows the user-space
task can learn the kpageflags for the pages backing its address-space by
consulting one file, without needing to be root.

Example use case is a performance sensitive user-space process querying the
hugepage backing of its own memory without the root access required to access
/proc/kpageflags, and without accessing /proc/self/smaps_rollup which can be
slow and needs to hold mmap_lock.

Similar to /proc/kpageflags, the flags printed out by the kernel for
each page are provided by stable_page_flags(), which exports flag bits
that are user visible and stable over time.

Signed-off-by: Mina Almasry <almasrymina@google.com>

Cc: David Rientjes rientjes@google.com
Cc: Paul E. McKenney <paulmckrcu@fb.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ivan Teterevkov <ivan.teterevkov@nutanix.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Florian Schmidt <florian.schmidt@nutanix.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-mm@kvack.org

---
 Documentation/admin-guide/mm/pagemap.rst |   9 +-
 Documentation/filesystems/proc.rst       |   5 +-
 fs/proc/base.c                           |   2 +
 fs/proc/internal.h                       |   1 +
 fs/proc/task_mmu.c                       | 158 ++++++++++++++++++-----
 5 files changed, 144 insertions(+), 31 deletions(-)

--
2.33.0.1079.g6e70778dc9-goog

Comments

David Hildenbrand Oct. 29, 2021, 7:11 a.m. UTC | #1
On 28.10.21 22:58, Mina Almasry wrote:
> From: Yu Zhao <yuzhao@google.com>
> 
> This file lets a userspace process know the page flags of each of its virtual
> pages.  It contains a 64-bit set of flags for each virtual page, containing
> data identical to that emitted by /proc/kpageflags.  This allows the user-space
> task can learn the kpageflags for the pages backing its address-space by
> consulting one file, without needing to be root.
> 
> Example use case is a performance sensitive user-space process querying the
> hugepage backing of its own memory without the root access required to access
> /proc/kpageflags, and without accessing /proc/self/smaps_rollup which can be
> slow and needs to hold mmap_lock.

Can you elaborate on

a) The target use case. Are you primarily interested to see if a page
given base page is head or tail?

b) Your mmap_lock comment. pagemap_read() needs to hold the mmap lock in
read mode while walking process page tables via walk_page_range().

Also, do you have a rough performance comparison?

> 
> Similar to /proc/kpageflags, the flags printed out by the kernel for
> each page are provided by stable_page_flags(), which exports flag bits
> that are user visible and stable over time.

It exports flags (documented for pageflags_read()) that are not
applicable to processes, like OFFLINE. BUDDY, SLAB, PGTABLE ... and can
never happen. Some of these kpageflags are not even page->flags, they
include abstracted types we use for physical memory pages based on other
struct page members (OFFLINE, BUDDY, MMAP, PGTABLE, ...). This feels wrong.

Also, to me it feels like we are exposing too much internal information
to the user, essentially making it ABI that user space processes will
rely on.

Did you investigate

a) Reducing the flags we expose to a bare minimum necessary for your use
case (and actually applicable to mmaped pages).

b) Extending pagemap output instead.

You seem to be interested in the "hugepage backing", which smells like
"what is mapped" as in "pagemap".
Mina Almasry Oct. 29, 2021, 8:04 p.m. UTC | #2
On Fri, Oct 29, 2021 at 12:11 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 28.10.21 22:58, Mina Almasry wrote:
> > From: Yu Zhao <yuzhao@google.com>
> >
> > This file lets a userspace process know the page flags of each of its virtual
> > pages.  It contains a 64-bit set of flags for each virtual page, containing
> > data identical to that emitted by /proc/kpageflags.  This allows the user-space
> > task can learn the kpageflags for the pages backing its address-space by
> > consulting one file, without needing to be root.
> >
> > Example use case is a performance sensitive user-space process querying the
> > hugepage backing of its own memory without the root access required to access
> > /proc/kpageflags, and without accessing /proc/self/smaps_rollup which can be
> > slow and needs to hold mmap_lock.
>
> Can you elaborate on
>
> a) The target use case. Are you primarily interested to see if a page
> given base page is head or tail?
>

Not quite. Generally some userspace process (most notably our network
service) has a region of performance critical memory and would like to
know if this memory is backed by hugepages or not. It uses
/proc/self/pageflags to inspect the pageflags of the pages backing
this region, and counts how many ranges are backed by hugepages and
how many are not. Generally we export this data to metrics, and if the
hugepage backing drops or is insufficient we look into the issue
postmortem.

> b) Your mmap_lock comment. pagemap_read() needs to hold the mmap lock in
> read mode while walking process page tables via walk_page_range().
>

Gah, I'm _very_ sorry for the misinformation. I was (very incorrectly)
under the impression that /proc/self/smaps_rollup required holding the
mmap lock but /proc/self/pageflags didn't. I'll remove the comment
about the mmap lock from the commit message in V2.

> Also, do you have a rough performance comparison?
>

So from my tests with simple processes querying smaps/pageflags I
don't see any performance difference, but I suspect it's due to my
test cases not mapping much memory or regions.

I've CC'd Nathan who works on our network service and has run into
performance issues with smaps. Nathan, do you have a rough performance
comparison? If so please do share.

> >
> > Similar to /proc/kpageflags, the flags printed out by the kernel for
> > each page are provided by stable_page_flags(), which exports flag bits
> > that are user visible and stable over time.
>
> It exports flags (documented for pageflags_read()) that are not
> applicable to processes, like OFFLINE. BUDDY, SLAB, PGTABLE ... and can
> never happen. Some of these kpageflags are not even page->flags, they
> include abstracted types we use for physical memory pages based on other
> struct page members (OFFLINE, BUDDY, MMAP, PGTABLE, ...). This feels wrong.
>
> Also, to me it feels like we are exposing too much internal information
> to the user, essentially making it ABI that user space processes will
> rely on.
>

I'm honestly a bit surprised by this comment because AFAIU (sorry if
wrong) we are already exporting this information via /proc/kpageflags
and therefore it's already somewhat part of an ABI, and the
stable_page_flags() output already needs to be stable and backwards
compatible due to potential root users being affected by any
non-backwards compatible changes. I am yes extending access to this
information to non-root users.

> Did you investigate
>
> a) Reducing the flags we expose to a bare minimum necessary for your use
> case (and actually applicable to mmaped pages).
>

To be honest I haven't, but this is something that's certainly doable.
I'm not sure it's easier for processes to understand or the kernel to
maintain. My thinking:
1. Processes parsing /proc/kpageflags can also easily parse
/proc/self/pageflags and re-use code/implementations between them.
2. Userspace code can extract the flags they need and ignore the ones
they don't need or are not applicable.
3. For kernel it's maybe easier to maintain 1 set of
stable_page_flags() and keep that list backwards compatible. To
address your comment I'd need to create a subset,
stable_ps_page_flags(), and both lists now need to be backwards
compatible.

But I hear you, and if you feel strongly about this I'm more than
happy to oblige. Please confirm if this is something you would like to
see in V2.

> b) Extending pagemap output instead.
>

No I have not until you mentioned it, but even now AFAIU (and again
sorry if wrong, please correct) all the bits exposed by pagemap as
documented in pagemap.rst are in use, and it's a non-starter for me to
modify how pagemap works because it'd break backwards compatibility.
But if you see a way I'm happy to oblige :-)

Thanks for your review!

> You seem to be interested in the "hugepage backing", which smells like
> "what is mapped" as in "pagemap".
>
>
> --
> Thanks,
>
> David / dhildenb
>
David Hildenbrand Oct. 29, 2021, 9:37 p.m. UTC | #3
On 29.10.21 22:04, Mina Almasry wrote:
> On Fri, Oct 29, 2021 at 12:11 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 28.10.21 22:58, Mina Almasry wrote:
>>> From: Yu Zhao <yuzhao@google.com>
>>>
>>> This file lets a userspace process know the page flags of each of its virtual
>>> pages.  It contains a 64-bit set of flags for each virtual page, containing
>>> data identical to that emitted by /proc/kpageflags.  This allows the user-space
>>> task can learn the kpageflags for the pages backing its address-space by
>>> consulting one file, without needing to be root.
>>>
>>> Example use case is a performance sensitive user-space process querying the
>>> hugepage backing of its own memory without the root access required to access
>>> /proc/kpageflags, and without accessing /proc/self/smaps_rollup which can be
>>> slow and needs to hold mmap_lock.
>>
>> Can you elaborate on
>>
>> a) The target use case. Are you primarily interested to see if a page
>> given base page is head or tail?
>>
> 
> Not quite. Generally some userspace process (most notably our network
> service) has a region of performance critical memory and would like to
> know if this memory is backed by hugepages or not. It uses
> /proc/self/pageflags to inspect the pageflags of the pages backing
> this region, and counts how many ranges are backed by hugepages and
> how many are not. Generally we export this data to metrics, and if the
> hugepage backing drops or is insufficient we look into the issue
> postmortem.

Okay, so it's all about detecting if/where THPs are mapped. I assume
knowing just the number of THPs getting used by that process is not
sufficient for your use case? If you just need numbers, it might be
better to let the kernel do the counting :)

[...]

>> Also, do you have a rough performance comparison?
>>
> 
> So from my tests with simple processes querying smaps/pageflags I
> don't see any performance difference, but I suspect it's due to my
> test cases not mapping much memory or regions.
> 
> I've CC'd Nathan who works on our network service and has run into
> performance issues with smaps. Nathan, do you have a rough performance
> comparison? If so please do share.
> 

That would be great, because we tend to not add interfaces if the
information can already be obtained differently and there is no clear
benefit. Performance comparisons can help.

>>>
>>> Similar to /proc/kpageflags, the flags printed out by the kernel for
>>> each page are provided by stable_page_flags(), which exports flag bits
>>> that are user visible and stable over time.
>>
>> It exports flags (documented for pageflags_read()) that are not
>> applicable to processes, like OFFLINE. BUDDY, SLAB, PGTABLE ... and can
>> never happen. Some of these kpageflags are not even page->flags, they
>> include abstracted types we use for physical memory pages based on other
>> struct page members (OFFLINE, BUDDY, MMAP, PGTABLE, ...). This feels wrong.
>>
>> Also, to me it feels like we are exposing too much internal information
>> to the user, essentially making it ABI that user space processes will
>> rely on.
>>
> 
> I'm honestly a bit surprised by this comment because AFAIU (sorry if
> wrong) we are already exporting this information via /proc/kpageflags
> and therefore it's already somewhat part of an ABI, and the
> stable_page_flags() output already needs to be stable and backwards
> compatible due to potential root users being affected by any
> non-backwards compatible changes. I am yes extending access to this
> information to non-root users.

Sure, a (root) application could access these flags via /proc/kpageflags
-- in my thinking usually for debugging purposes, like how I've been
using it a couple of times.

Because for something in process context it's barely usable: once you
have the PFN via the pagemap for a virtual address and you would want to
read the flags of that PFN via kpageflags, the PFN might already have
changed for the virtual address and you'd be reading wrong data. It's racy.

I might be wrong, maybe there are some system services making use of
that information for some kind of optimizations. A quick google search
didn't reveal anything, but maybe I just gave up too early :)

Exposing this information to non-root users would most certainly let
some random libraries use this information for real and depend on it,
for whatever purpose. If that makes sense.

> 
>> Did you investigate
>>
>> a) Reducing the flags we expose to a bare minimum necessary for your use
>> case (and actually applicable to mmaped pages).
>>
> 
> To be honest I haven't, but this is something that's certainly doable.
> I'm not sure it's easier for processes to understand or the kernel to
> maintain. My thinking:
> 1. Processes parsing /proc/kpageflags can also easily parse
> /proc/self/pageflags and re-use code/implementations between them.
> 2. Userspace code can extract the flags they need and ignore the ones
> they don't need or are not applicable.
> 3. For kernel it's maybe easier to maintain 1 set of
> stable_page_flags() and keep that list backwards compatible. To
> address your comment I'd need to create a subset,
> stable_ps_page_flags(), and both lists now need to be backwards
> compatible.

I'd love to hear other opinions, because maybe I'm just being paranoid. :)

> 
> But I hear you, and if you feel strongly about this I'm more than
> happy to oblige. Please confirm if this is something you would like to
> see in V2.
> 
>> b) Extending pagemap output instead.
>>
> 
> No I have not until you mentioned it, but even now AFAIU (and again
> sorry if wrong, please correct) all the bits exposed by pagemap as
> documented in pagemap.rst are in use, and it's a non-starter for me to
> modify how pagemap works because it'd break backwards compatibility.
> But if you see a way I'm happy to oblige :-)
> 

Bit 58-60 are still free, no? Bit 57 was recently added for uffd-wp
purposes I think.

#define PM_SOFT_DIRTY		BIT_ULL(55)
#define PM_MMAP_EXCLUSIVE	BIT_ULL(56)
#define PM_UFFD_WP		BIT_ULL(57)
#define PM_FILE			BIT_ULL(61)
#define PM_SWAP			BIT_ULL(62)
#define PM_PRESENT		BIT_ULL(63)

PM_MMAP_EXCLUSIVE and PM_FILE already go into the direction of "what is
mapped" IMHO. So just a thought if something in there (PM_HUGE? PM_THP?)
... could make sense.
Mina Almasry Oct. 30, 2021, 10:06 p.m. UTC | #4
On Fri, Oct 29, 2021 at 2:37 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 29.10.21 22:04, Mina Almasry wrote:
> > On Fri, Oct 29, 2021 at 12:11 AM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 28.10.21 22:58, Mina Almasry wrote:
> >>> From: Yu Zhao <yuzhao@google.com>
> >>>
> >>> This file lets a userspace process know the page flags of each of its virtual
> >>> pages.  It contains a 64-bit set of flags for each virtual page, containing
> >>> data identical to that emitted by /proc/kpageflags.  This allows the user-space
> >>> task can learn the kpageflags for the pages backing its address-space by
> >>> consulting one file, without needing to be root.
> >>>
> >>> Example use case is a performance sensitive user-space process querying the
> >>> hugepage backing of its own memory without the root access required to access
> >>> /proc/kpageflags, and without accessing /proc/self/smaps_rollup which can be
> >>> slow and needs to hold mmap_lock.
> >>
> >> Can you elaborate on
> >>
> >> a) The target use case. Are you primarily interested to see if a page
> >> given base page is head or tail?
> >>
> >
> > Not quite. Generally some userspace process (most notably our network
> > service) has a region of performance critical memory and would like to
> > know if this memory is backed by hugepages or not. It uses
> > /proc/self/pageflags to inspect the pageflags of the pages backing
> > this region, and counts how many ranges are backed by hugepages and
> > how many are not. Generally we export this data to metrics, and if the
> > hugepage backing drops or is insufficient we look into the issue
> > postmortem.
>
> Okay, so it's all about detecting if/where THPs are mapped. I assume
> knowing just the number of THPs getting used by that process is not
> sufficient for your use case? If you just need numbers, it might be
> better to let the kernel do the counting :)
>
> [...]
>

Not quite sufficient, no. The process may have lots of non performance
critical memory. The network service cares about specific memory
ranges and wants to know if those are backed by hugepages.


> >> Also, do you have a rough performance comparison?
> >>
> >
> > So from my tests with simple processes querying smaps/pageflags I
> > don't see any performance difference, but I suspect it's due to my
> > test cases not mapping much memory or regions.
> >
> > I've CC'd Nathan who works on our network service and has run into
> > performance issues with smaps. Nathan, do you have a rough performance
> > comparison? If so please do share.
> >
>
> That would be great, because we tend to not add interfaces if the
> information can already be obtained differently and there is no clear
> benefit. Performance comparisons can help.
>
> >>>
> >>> Similar to /proc/kpageflags, the flags printed out by the kernel for
> >>> each page are provided by stable_page_flags(), which exports flag bits
> >>> that are user visible and stable over time.
> >>
> >> It exports flags (documented for pageflags_read()) that are not
> >> applicable to processes, like OFFLINE. BUDDY, SLAB, PGTABLE ... and can
> >> never happen. Some of these kpageflags are not even page->flags, they
> >> include abstracted types we use for physical memory pages based on other
> >> struct page members (OFFLINE, BUDDY, MMAP, PGTABLE, ...). This feels wrong.
> >>
> >> Also, to me it feels like we are exposing too much internal information
> >> to the user, essentially making it ABI that user space processes will
> >> rely on.
> >>
> >
> > I'm honestly a bit surprised by this comment because AFAIU (sorry if
> > wrong) we are already exporting this information via /proc/kpageflags
> > and therefore it's already somewhat part of an ABI, and the
> > stable_page_flags() output already needs to be stable and backwards
> > compatible due to potential root users being affected by any
> > non-backwards compatible changes. I am yes extending access to this
> > information to non-root users.
>
> Sure, a (root) application could access these flags via /proc/kpageflags
> -- in my thinking usually for debugging purposes, like how I've been
> using it a couple of times.
>
> Because for something in process context it's barely usable: once you
> have the PFN via the pagemap for a virtual address and you would want to
> read the flags of that PFN via kpageflags, the PFN might already have
> changed for the virtual address and you'd be reading wrong data. It's racy.
>
> I might be wrong, maybe there are some system services making use of
> that information for some kind of optimizations. A quick google search
> didn't reveal anything, but maybe I just gave up too early :)
>
> Exposing this information to non-root users would most certainly let
> some random libraries use this information for real and depend on it,
> for whatever purpose. If that makes sense.
>

Ah, now I understand. Your concerns here make perfect sense. To be
honest I still wonder if stable_page_flags() are exposed to the
userspace 'enough' that they have to remain backwards compatible
anyway, but I can see that not being really true. Adding
/proc/pid/pageflags definitely sets them in stone.

> >
> >> Did you investigate
> >>
> >> a) Reducing the flags we expose to a bare minimum necessary for your use
> >> case (and actually applicable to mmaped pages).
> >>
> >
> > To be honest I haven't, but this is something that's certainly doable.
> > I'm not sure it's easier for processes to understand or the kernel to
> > maintain. My thinking:
> > 1. Processes parsing /proc/kpageflags can also easily parse
> > /proc/self/pageflags and re-use code/implementations between them.
> > 2. Userspace code can extract the flags they need and ignore the ones
> > they don't need or are not applicable.
> > 3. For kernel it's maybe easier to maintain 1 set of
> > stable_page_flags() and keep that list backwards compatible. To
> > address your comment I'd need to create a subset,
> > stable_ps_page_flags(), and both lists now need to be backwards
> > compatible.
>
> I'd love to hear other opinions, because maybe I'm just being paranoid. :)
>
> >
> > But I hear you, and if you feel strongly about this I'm more than
> > happy to oblige. Please confirm if this is something you would like to
> > see in V2.
> >
> >> b) Extending pagemap output instead.
> >>
> >
> > No I have not until you mentioned it, but even now AFAIU (and again
> > sorry if wrong, please correct) all the bits exposed by pagemap as
> > documented in pagemap.rst are in use, and it's a non-starter for me to
> > modify how pagemap works because it'd break backwards compatibility.
> > But if you see a way I'm happy to oblige :-)
> >
>
> Bit 58-60 are still free, no? Bit 57 was recently added for uffd-wp
> purposes I think.
>
> #define PM_SOFT_DIRTY           BIT_ULL(55)
> #define PM_MMAP_EXCLUSIVE       BIT_ULL(56)
> #define PM_UFFD_WP              BIT_ULL(57)
> #define PM_FILE                 BIT_ULL(61)
> #define PM_SWAP                 BIT_ULL(62)
> #define PM_PRESENT              BIT_ULL(63)
>
> PM_MMAP_EXCLUSIVE and PM_FILE already go into the direction of "what is
> mapped" IMHO. So just a thought if something in there (PM_HUGE? PM_THP?)
> ... could make sense.
>

Thanks! I _think_ that would work for us, I'll look into confirming.
To be honest I still wonder if eventually different folks will find
uses for other page flags and eventually we'll run out of pagemaps
bits, but I'll yield to whatever you think is best here.

> --
> Thanks,
>
> David / dhildenb
>
Matthew Wilcox Oct. 31, 2021, 2:28 a.m. UTC | #5
On Sat, Oct 30, 2021 at 03:06:31PM -0700, Mina Almasry wrote:
> Not quite sufficient, no. The process may have lots of non performance
> critical memory. The network service cares about specific memory
> ranges and wants to know if those are backed by hugepages.

Would it make sense to extend mincore() instead?  We have 7 remaining
bits per byte.

But my question is, what information do you really want?  Do you want
to know if the memory range is backed by huge pages, or do you want to
know if PMDs are being used to map the backing memory?

What information would you want to see if, say, 64kB entries are being
used on a 4kB ARM system where there's hardware support for those.
Other architectures also have support for TLB entries that are
intermediate between PTE and PMD sizes.
David Hildenbrand Nov. 2, 2021, 11:42 a.m. UTC | #6
>> Bit 58-60 are still free, no? Bit 57 was recently added for uffd-wp
>> purposes I think.
>>
>> #define PM_SOFT_DIRTY           BIT_ULL(55)
>> #define PM_MMAP_EXCLUSIVE       BIT_ULL(56)
>> #define PM_UFFD_WP              BIT_ULL(57)
>> #define PM_FILE                 BIT_ULL(61)
>> #define PM_SWAP                 BIT_ULL(62)
>> #define PM_PRESENT              BIT_ULL(63)
>>
>> PM_MMAP_EXCLUSIVE and PM_FILE already go into the direction of "what is
>> mapped" IMHO. So just a thought if something in there (PM_HUGE? PM_THP?)
>> ... could make sense.
>>
> 
> Thanks! I _think_ that would work for us, I'll look into confirming.
> To be honest I still wonder if eventually different folks will find
> uses for other page flags and eventually we'll run out of pagemaps
> bits, but I'll yield to whatever you think is best here.

Using one of the remaining 3 bits should be fine. In the worst case,
we'll need pagemap_ext at some point that provides more bits per PFN, if
we ever run out of bits.

But as mentioned by Matthew, extending mincore() could also work: not
only indicating if the page is resident, but also in which "form" it is
resident.

We could separate the cases "cont PTE huge page" vs. "PMD huge page".

I recall that the information (THP / !THP) might be valuable for users:
there was a discussion to let user space decide where to place THP.
(IIRC madvise() extension to have something like MADV_COLLAPSE_THP /
MADV_DISSOLVE_THP)
Mina Almasry Nov. 2, 2021, 6:38 p.m. UTC | #7
On Tue, Nov 2, 2021 at 4:42 AM David Hildenbrand <david@redhat.com> wrote:
>
> >> Bit 58-60 are still free, no? Bit 57 was recently added for uffd-wp
> >> purposes I think.
> >>
> >> #define PM_SOFT_DIRTY           BIT_ULL(55)
> >> #define PM_MMAP_EXCLUSIVE       BIT_ULL(56)
> >> #define PM_UFFD_WP              BIT_ULL(57)
> >> #define PM_FILE                 BIT_ULL(61)
> >> #define PM_SWAP                 BIT_ULL(62)
> >> #define PM_PRESENT              BIT_ULL(63)
> >>
> >> PM_MMAP_EXCLUSIVE and PM_FILE already go into the direction of "what is
> >> mapped" IMHO. So just a thought if something in there (PM_HUGE? PM_THP?)
> >> ... could make sense.
> >>
> >
> > Thanks! I _think_ that would work for us, I'll look into confirming.
> > To be honest I still wonder if eventually different folks will find
> > uses for other page flags and eventually we'll run out of pagemaps
> > bits, but I'll yield to whatever you think is best here.
>
> Using one of the remaining 3 bits should be fine. In the worst case,
> we'll need pagemap_ext at some point that provides more bits per PFN, if
> we ever run out of bits.
>

That sounds great to me. Thank you Both Matthew and David for
patiently explaining the concerns with /proc/self/pageflags to me and
suggesting alternatives that could work :-)

> But as mentioned by Matthew, extending mincore() could also work: not
> only indicating if the page is resident, but also in which "form" it is
> resident.
>

I need to learn more about mincore() to be honest, from casually
reading some docs I didn't get a full understanding on if/why that
would work better. I'll do some investigating and upload V2 either
with /proc/self/pagemaps or mincore() and why I chose such and we can
go from there.

> We could separate the cases "cont PTE huge page" vs. "PMD huge page".
>

So to be completely honest (and I need to confirm), we are using this
on x86 and we essentially care that the virt address is mapped by 2MB,
so mapped by PMD. I think (but need to confirm) that's what the
pageflags HUGE bit refers to as well as does PageHuge() and
TransPageHuge(). After confirming I'll upload V2 with the precise info
we need (I think it's going to be "PMD huge page" as David says).

> I recall that the information (THP / !THP) might be valuable for users:
> there was a discussion to let user space decide where to place THP.
> (IIRC madvise() extension to have something like MADV_COLLAPSE_THP /
> MADV_DISSOLVE_THP)
>
> --
> Thanks,
>
> David / dhildenb
>
diff mbox series

Patch

diff --git a/Documentation/admin-guide/mm/pagemap.rst b/Documentation/admin-guide/mm/pagemap.rst
index fdc19fbc10839..79a127f671436 100644
--- a/Documentation/admin-guide/mm/pagemap.rst
+++ b/Documentation/admin-guide/mm/pagemap.rst
@@ -8,7 +8,7 @@  pagemap is a new (as of 2.6.25) set of interfaces in the kernel that allow
 userspace programs to examine the page tables and related information by
 reading files in ``/proc``.

-There are four components to pagemap:
+There are five components to pagemap:

  * ``/proc/pid/pagemap``.  This file lets a userspace process find out which
    physical frame each virtual page is mapped to.  It contains one 64-bit
@@ -82,6 +82,13 @@  number of times a page is mapped.
     25. IDLE
     26. PGTABLE

+ * ``/proc/pid/pageflags``.  This file lets a userspace process know the page
+   flags of each of its virtual pages.  It contains a 64-bit set of flags for
+   each virtual page, containing data identical to the one emitted by
+   /proc/kpageflags listed above.  The user-space task can learn the kpageflags
+   for the pages backing its address-space by consulting one file, without
+   needing to be root.
+
  * ``/proc/kpagecgroup``.  This file contains a 64-bit inode number of the
    memory cgroup each page is charged to, indexed by PFN. Only available when
    CONFIG_MEMCG is set.
diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index 042c418f40906..fab84e5966b3e 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -155,6 +155,7 @@  usually fail with ESRCH.
  wchan		Present with CONFIG_KALLSYMS=y: it shows the kernel function
 		symbol the task is blocked in - or "0" if not blocked.
  pagemap	Page table
+ pageflags	Process's memory page flag information
  stack		Report full stack trace, enable via CONFIG_STACKTRACE
  smaps		An extension based on maps, showing the memory consumption of
 		each mapping and flags associated with it
@@ -619,7 +620,9 @@  Any other value written to /proc/PID/clear_refs will have no effect.

 The /proc/pid/pagemap gives the PFN, which can be used to find the pageflags
 using /proc/kpageflags and number of times a page is mapped using
-/proc/kpagecount. For detailed explanation, see
+/proc/kpagecount. /proc/pid/pageflags provides the page flags of a process's
+virtual pages, so a task can learn the kpageflags for its address space with no
+need to be root. For detailed explanation, see
 Documentation/admin-guide/mm/pagemap.rst.

 The /proc/pid/numa_maps is an extension based on maps, showing the memory
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 264509e584e3e..40febcaef6aa6 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -3219,6 +3219,7 @@  static const struct pid_entry tgid_base_stuff[] = {
 	REG("smaps",      S_IRUGO, proc_pid_smaps_operations),
 	REG("smaps_rollup", S_IRUGO, proc_pid_smaps_rollup_operations),
 	REG("pagemap",    S_IRUSR, proc_pagemap_operations),
+	REG("pageflags",  S_IRUGO, proc_pageflags_operations),
 #endif
 #ifdef CONFIG_SECURITY
 	DIR("attr",       S_IRUGO|S_IXUGO, proc_attr_dir_inode_operations, proc_attr_dir_operations),
@@ -3562,6 +3563,7 @@  static const struct pid_entry tid_base_stuff[] = {
 	REG("smaps",     S_IRUGO, proc_pid_smaps_operations),
 	REG("smaps_rollup", S_IRUGO, proc_pid_smaps_rollup_operations),
 	REG("pagemap",    S_IRUSR, proc_pagemap_operations),
+	REG("pageflags",  S_IRUGO, proc_pageflags_operations),
 #endif
 #ifdef CONFIG_SECURITY
 	DIR("attr",      S_IRUGO|S_IXUGO, proc_attr_dir_inode_operations, proc_attr_dir_operations),
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 03415f3fb3a81..177be691a86a7 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -305,6 +305,7 @@  extern const struct file_operations proc_pid_smaps_operations;
 extern const struct file_operations proc_pid_smaps_rollup_operations;
 extern const struct file_operations proc_clear_refs_operations;
 extern const struct file_operations proc_pagemap_operations;
+extern const struct file_operations proc_pageflags_operations;

 extern unsigned long task_vsize(struct mm_struct *);
 extern unsigned long task_statm(struct mm_struct *,
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index ad667dbc96f5c..4e24ff521b5f0 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1291,6 +1291,10 @@  struct pagemapread {
 	int pos, len;		/* units: PM_ENTRY_BYTES, not bytes */
 	pagemap_entry_t *buffer;
 	bool show_pfn;
+	/* If to_flags is set, show the page flags for the virtual pages
+	 * instead of the mapping information.
+	 */
+	bool to_flags;
 };

 #define PAGEMAP_WALK_SIZE	(PMD_SIZE)
@@ -1331,7 +1335,8 @@  static int pagemap_pte_hole(unsigned long start, unsigned long end,

 	while (addr < end) {
 		struct vm_area_struct *vma = find_vma(walk->mm, addr);
-		pagemap_entry_t pme = make_pme(0, 0);
+		pagemap_entry_t pme =
+			make_pme(0, pm->to_flags ? stable_page_flags(NULL) : 0);
 		/* End of address space hole, which we mark as non-present. */
 		unsigned long hole_end;

@@ -1350,7 +1355,7 @@  static int pagemap_pte_hole(unsigned long start, unsigned long end,
 			break;

 		/* Addresses in the VMA. */
-		if (vma->vm_flags & VM_SOFTDIRTY)
+		if ((vma->vm_flags & VM_SOFTDIRTY) && !pm->to_flags)
 			pme = make_pme(0, PM_SOFT_DIRTY);
 		for (; addr < min(end, vma->vm_end); addr += PAGE_SIZE) {
 			err = add_to_pagemap(addr, &pme, pm);
@@ -1368,6 +1373,12 @@  static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
 	u64 frame = 0, flags = 0;
 	struct page *page = NULL;

+	if (pm->to_flags) {
+		if (pte_present(pte))
+			page = vm_normal_page(vma, addr, pte);
+		return make_pme(0, stable_page_flags(page));
+	}
+
 	if (pte_present(pte)) {
 		if (pm->show_pfn)
 			frame = pte_pfn(pte);
@@ -1421,6 +1432,22 @@  static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
 		if (vma->vm_flags & VM_SOFTDIRTY)
 			flags |= PM_SOFT_DIRTY;

+		if (pm->to_flags) {
+			pagemap_entry_t pme;
+			struct page *page =
+				pmd_present(pmd) ? pmd_page(pmd) : NULL;
+
+			for (; addr != end; addr += PAGE_SIZE) {
+				if (page)
+					page += (addr & ~PMD_MASK) >>
+						PAGE_SHIFT;
+				pme = make_pme(0, stable_page_flags(page));
+				add_to_pagemap(addr, &pme, pm);
+			}
+			spin_unlock(ptl);
+			return 0;
+		}
+
 		if (pmd_present(pmd)) {
 			page = pmd_page(pmd);

@@ -1514,6 +1541,20 @@  static int pagemap_hugetlb_range(pte_t *ptep, unsigned long hmask,
 		flags |= PM_SOFT_DIRTY;

 	pte = huge_ptep_get(ptep);
+
+	if (pm->to_flags) {
+		pagemap_entry_t pme;
+		struct page *page = pte_present(pte) ? pte_page(pte) : NULL;
+
+		for (; addr != end; addr += PAGE_SIZE) {
+			if (page)
+				page += (addr & ~hmask) >> PAGE_SHIFT;
+			pme = make_pme(0, stable_page_flags(page));
+			add_to_pagemap(addr, &pme, pm);
+		}
+		goto done;
+	}
+
 	if (pte_present(pte)) {
 		struct page *page = pte_page(pte);

@@ -1539,6 +1580,7 @@  static int pagemap_hugetlb_range(pte_t *ptep, unsigned long hmask,
 			frame++;
 	}

+done:
 	cond_resched();

 	return err;
@@ -1553,34 +1595,11 @@  static const struct mm_walk_ops pagemap_ops = {
 	.hugetlb_entry	= pagemap_hugetlb_range,
 };

-/*
- * /proc/pid/pagemap - an array mapping virtual pages to pfns
- *
- * For each page in the address space, this file contains one 64-bit entry
- * consisting of the following:
- *
- * Bits 0-54  page frame number (PFN) if present
- * Bits 0-4   swap type if swapped
- * Bits 5-54  swap offset if swapped
- * Bit  55    pte is soft-dirty (see Documentation/admin-guide/mm/soft-dirty.rst)
- * Bit  56    page exclusively mapped
- * Bits 57-60 zero
- * Bit  61    page is file-page or shared-anon
- * Bit  62    page swapped
- * Bit  63    page present
- *
- * If the page is not present but in swap, then the PFN contains an
- * encoding of the swap file number and the page's offset into the
- * swap. Unmapped pages return a null PFN. This allows determining
- * precisely which pages are mapped (or in swap) and comparing mapped
- * pages between processes.
- *
- * Efficient users of this interface will use /proc/pid/maps to
- * determine which areas of memory are actually mapped and llseek to
- * skip over unmapped regions.
+/* If to_flags is set, show the page flags for the virtual pages
+ * instead of the mapping information.
  */
-static ssize_t pagemap_read(struct file *file, char __user *buf,
-			    size_t count, loff_t *ppos)
+static ssize_t pagemap_pageflags_read(struct file *file, char __user *buf,
+				      size_t count, loff_t *ppos, bool to_flags)
 {
 	struct mm_struct *mm = file->private_data;
 	struct pagemapread pm;
@@ -1602,6 +1621,8 @@  static ssize_t pagemap_read(struct file *file, char __user *buf,
 	if (!count)
 		goto out_mm;

+	pm.to_flags = to_flags;
+
 	/* do not disclose physical addresses: attack vector */
 	pm.show_pfn = file_ns_capable(file, &init_user_ns, CAP_SYS_ADMIN);

@@ -1668,6 +1689,78 @@  static ssize_t pagemap_read(struct file *file, char __user *buf,
 	return ret;
 }

+/*
+ * /proc/pid/pagemap - an array mapping virtual pages to pfns
+ *
+ * For each page in the address space, this file contains one 64-bit entry
+ * consisting of the following:
+ *
+ * Bits 0-54  page frame number (PFN) if present
+ * Bits 0-4   swap type if swapped
+ * Bits 5-54  swap offset if swapped
+ * Bit  55    pte is soft-dirty (see Documentation/admin-guide/mm/soft-dirty.rst)
+ * Bit  56    page exclusively mapped
+ * Bits 57-60 zero
+ * Bit  61    page is file-page or shared-anon
+ * Bit  62    page swapped
+ * Bit  63    page present
+ *
+ * If the page is not present but in swap, then the PFN contains an
+ * encoding of the swap file number and the page's offset into the
+ * swap. Unmapped pages return a null PFN. This allows determining
+ * precisely which pages are mapped (or in swap) and comparing mapped
+ * pages between processes.
+ *
+ * Efficient users of this interface will use /proc/pid/maps to
+ * determine which areas of memory are actually mapped and llseek to
+ * skip over unmapped regions.
+ */
+static ssize_t pagemap_read(struct file *file, char __user *buf, size_t count,
+			    loff_t *ppos)
+{
+	return pagemap_pageflags_read(file, buf, count, ppos, false);
+}
+
+/*
+ * /proc/pid/pageflags - an array mapping virtual pages to pageflags
+ *
+ * For each page in the address space, this file contains one 64-bit entry
+ * consisting of the following:
+ *
+ * 0. LOCKED
+ * 1. ERROR
+ * 2. REFERENCED
+ * 3. UPTODATE
+ * 4. DIRTY
+ * 5. LRU
+ * 6. ACTIVE
+ * 7. SLAB
+ * 8. WRITEBACK
+ * 9. RECLAIM
+ * 10. BUDDY
+ * 11. MMAP
+ * 12. ANON
+ * 13. SWAPCACHE
+ * 14. SWAPBACKED
+ * 15. COMPOUND_HEAD
+ * 16. COMPOUND_TAIL
+ * 17. HUGE
+ * 18. UNEVICTABLE
+ * 19. HWPOISON
+ * 20. NOPAGE
+ * 21. KSM
+ * 22. THP
+ * 23. OFFLINE
+ * 24. ZERO_PAGE
+ * 25. IDLE
+ * 26. PGTABLE
+ */
+static ssize_t pageflags_read(struct file *file, char __user *buf, size_t count,
+			      loff_t *ppos)
+{
+	return pagemap_pageflags_read(file, buf, count, ppos, true);
+}
+
 static int pagemap_open(struct inode *inode, struct file *file)
 {
 	struct mm_struct *mm;
@@ -1694,6 +1787,13 @@  const struct file_operations proc_pagemap_operations = {
 	.open		= pagemap_open,
 	.release	= pagemap_release,
 };
+
+const struct file_operations proc_pageflags_operations = {
+	.llseek		= mem_lseek,
+	.read		= pageflags_read,
+	.open		= pagemap_open,
+	.release	= pagemap_release,
+};
 #endif /* CONFIG_PROC_PAGE_MONITOR */

 #ifdef CONFIG_NUMA