diff mbox series

[v6] mm: Add PM_THP_MAPPED to /proc/pid/pagemap

Message ID 20211117194855.398455-1-almasrymina@google.com (mailing list archive)
State New
Headers show
Series [v6] mm: Add PM_THP_MAPPED to /proc/pid/pagemap | expand

Commit Message

Mina Almasry Nov. 17, 2021, 7:48 p.m. UTC
Add PM_THP_MAPPED MAPPING to allow userspace to detect whether a given virt
address is currently mapped by a transparent huge page or not.  Example
use case is a process requesting THPs from the kernel (via a huge tmpfs
mount for example), for a performance critical region of memory.  The
userspace may want to query whether the kernel is actually backing this
memory by hugepages or not.

PM_THP_MAPPED bit is set if the virt address is mapped at the PMD
level and the underlying page is a transparent huge page.

A few options were considered:
1. Add /proc/pid/pageflags that exports the same info as
   /proc/kpageflags.  This is not appropriate because many kpageflags are
   inappropriate to expose to userspace processes.
2. Simply get this info from the existing /proc/pid/smaps interface.
   There are a couple of issues with that:
   1. /proc/pid/smaps output is human readable and unfriendly to
      programatically parse.
   2. /proc/pid/smaps is slow.  The cost of reading /proc/pid/smaps into
      userspace buffers is about ~800us per call, and this doesn't
      include parsing the output to get the information you need. The
      cost of querying 1 virt address in /proc/pid/pagemaps however is
      around 5-7us.

Tested manually by adding logging into transhuge-stress, and by
allocating THP and querying the PM_THP_MAPPED flag at those
virtual addresses.

Signed-off-by: Mina Almasry <almasrymina@google.com>

Cc: David Hildenbrand <david@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: David Rientjes rientjes@google.com
Cc: Paul E. McKenney <paulmckrcu@fb.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ivan Teterevkov <ivan.teterevkov@nutanix.com>
Cc: Florian Schmidt <florian.schmidt@nutanix.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-mm@kvack.org


---

Changes in v6:
- Renamed to PM_THP_MAPPED
- Removed changes to transhuge-stress

Changes in v5:
- Added justification for this interface in the commit message!

Changes in v4:
- Removed unnecessary moving of flags variable declaration

Changes in v3:
- Renamed PM_THP to PM_HUGE_THP_MAPPING
- Fixed checks to set PM_HUGE_THP_MAPPING
- Added PM_HUGE_THP_MAPPING docs
---
 Documentation/admin-guide/mm/pagemap.rst | 3 ++-
 fs/proc/task_mmu.c                       | 3 +++
 2 files changed, 5 insertions(+), 1 deletion(-)

--
2.34.0.rc2.393.gf8c9666880-goog

Comments

Peter Xu Nov. 18, 2021, 12:34 a.m. UTC | #1
Hi, Mina,

On Wed, Nov 17, 2021 at 11:48:54AM -0800, Mina Almasry wrote:
> Add PM_THP_MAPPED MAPPING to allow userspace to detect whether a given virt
> address is currently mapped by a transparent huge page or not.  Example
> use case is a process requesting THPs from the kernel (via a huge tmpfs
> mount for example), for a performance critical region of memory.  The
> userspace may want to query whether the kernel is actually backing this
> memory by hugepages or not.
> 
> PM_THP_MAPPED bit is set if the virt address is mapped at the PMD
> level and the underlying page is a transparent huge page.
> 
> A few options were considered:
> 1. Add /proc/pid/pageflags that exports the same info as
>    /proc/kpageflags.  This is not appropriate because many kpageflags are
>    inappropriate to expose to userspace processes.
> 2. Simply get this info from the existing /proc/pid/smaps interface.
>    There are a couple of issues with that:
>    1. /proc/pid/smaps output is human readable and unfriendly to
>       programatically parse.
>    2. /proc/pid/smaps is slow.  The cost of reading /proc/pid/smaps into
>       userspace buffers is about ~800us per call, and this doesn't
>       include parsing the output to get the information you need. The
>       cost of querying 1 virt address in /proc/pid/pagemaps however is
>       around 5-7us.

This does not seem to be fair...  Should the "800us" value relevant to the
process memory size being mapped?  As smaps requires walking the whole memory
range and provides all stat info including THP accountings.

While querying 1 virt address can only account 1 single THP at most.

It means if we want to do all THP accounting for the whole range from pagemap
we need multiple read()s, right?  The fair comparison should compare the sum of
all the read()s on the virt addr we care to a single smap call.

And it's hard to be fair too, IMHO, because all these depend on size of mm.

Smaps is, logically, faster because of two things:

  - Smaps needs only one syscall for whatever memory range (so one
    user->kernel->user switch).

    Comparing to pagemap use case of yours, you'll need to read 1 virt address
    for each PMD, so when the PMD mapped size is huge, it could turn out that
    smaps is faster even counting in the parsing time of smaps output.

  - Smaps knows how to handle things in PMD level without looping into PTEs:
    see smaps_pmd_entry().

    Smaps will not duplicate the PMD entry into 512 PTE entries, because smaps
    is doing statistic (and that's exaxtly what your use case wants!), while
    pagemap is defined in 4K page size even for huge mappings because the
    structure is defined upon the offset of the pagemap fd; that's why it needs
    to duplicate the 2M entry into 512 same ones; even if they're not really so
    meaningful.

That's also why I tried to propose the interface of smaps to allow querying
partial of the memory range, because IMHO it solves the exact problem you're
describing and it'll also be in the most efficient way, meanwhile I think it
expose all the rest smaps data too besides THP accountings so it could help
more than thp accountings.

So again, no objection on this simple and working patch, but perhaps rephrasing
the 2nd bullet as: "smaps is slow because it must read the whole memory range
rather than a small range we care"?

Thanks,
Mina Almasry Nov. 23, 2021, midnight UTC | #2
On Wed, Nov 17, 2021 at 4:34 PM Peter Xu <peterx@redhat.com> wrote:
>
> Hi, Mina,
>
> On Wed, Nov 17, 2021 at 11:48:54AM -0800, Mina Almasry wrote:
> > Add PM_THP_MAPPED MAPPING to allow userspace to detect whether a given virt
> > address is currently mapped by a transparent huge page or not.  Example
> > use case is a process requesting THPs from the kernel (via a huge tmpfs
> > mount for example), for a performance critical region of memory.  The
> > userspace may want to query whether the kernel is actually backing this
> > memory by hugepages or not.
> >
> > PM_THP_MAPPED bit is set if the virt address is mapped at the PMD
> > level and the underlying page is a transparent huge page.
> >
> > A few options were considered:
> > 1. Add /proc/pid/pageflags that exports the same info as
> >    /proc/kpageflags.  This is not appropriate because many kpageflags are
> >    inappropriate to expose to userspace processes.
> > 2. Simply get this info from the existing /proc/pid/smaps interface.
> >    There are a couple of issues with that:
> >    1. /proc/pid/smaps output is human readable and unfriendly to
> >       programatically parse.
> >    2. /proc/pid/smaps is slow.  The cost of reading /proc/pid/smaps into
> >       userspace buffers is about ~800us per call, and this doesn't
> >       include parsing the output to get the information you need. The
> >       cost of querying 1 virt address in /proc/pid/pagemaps however is
> >       around 5-7us.
>
> This does not seem to be fair...  Should the "800us" value relevant to the
> process memory size being mapped?  As smaps requires walking the whole memory
> range and provides all stat info including THP accountings.
>
> While querying 1 virt address can only account 1 single THP at most.
>
> It means if we want to do all THP accounting for the whole range from pagemap
> we need multiple read()s, right?  The fair comparison should compare the sum of
> all the read()s on the virt addr we care to a single smap call.
>
> And it's hard to be fair too, IMHO, because all these depend on size of mm.
>
> Smaps is, logically, faster because of two things:
>
>   - Smaps needs only one syscall for whatever memory range (so one
>     user->kernel->user switch).
>
>     Comparing to pagemap use case of yours, you'll need to read 1 virt address
>     for each PMD, so when the PMD mapped size is huge, it could turn out that
>     smaps is faster even counting in the parsing time of smaps output.
>
>   - Smaps knows how to handle things in PMD level without looping into PTEs:
>     see smaps_pmd_entry().
>
>     Smaps will not duplicate the PMD entry into 512 PTE entries, because smaps
>     is doing statistic (and that's exaxtly what your use case wants!), while
>     pagemap is defined in 4K page size even for huge mappings because the
>     structure is defined upon the offset of the pagemap fd; that's why it needs
>     to duplicate the 2M entry into 512 same ones; even if they're not really so
>     meaningful.
>
> That's also why I tried to propose the interface of smaps to allow querying
> partial of the memory range, because IMHO it solves the exact problem you're
> describing and it'll also be in the most efficient way, meanwhile I think it
> expose all the rest smaps data too besides THP accountings so it could help
> more than thp accountings.
>
> So again, no objection on this simple and working patch, but perhaps rephrasing
> the 2nd bullet as: "smaps is slow because it must read the whole memory range
> rather than a small range we care"?
>

Sure thing, added in v7.

If I'm coming across as resisting a range-based smaps, I don't mean
to. I think it addresses my use case. I'm just warning/pointing out
that:
1. It'll be a large(r than 2 line) patch and probably an added kernel
interface, and I'm not sure my use case is an acceptable justification
to do that given the problem can be equally addressed very simply like
I'm adding here, but if it is an acceptable justification then I'm
fine with a range-based smaps.
2. I'm not 100% sure what the performance would be like. But I can
protype it and let you know if I have any issues with the performance.
I just need to know what interface you're envisioning for this.

I'll upload a v7 with the commit message change, and let me know what you think!

> Thanks,

>
> --
> Peter Xu
>
Peter Xu Nov. 23, 2021, 12:29 a.m. UTC | #3
On Mon, Nov 22, 2021 at 04:00:19PM -0800, Mina Almasry wrote:
> On Wed, Nov 17, 2021 at 4:34 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > Hi, Mina,
> >
> > On Wed, Nov 17, 2021 at 11:48:54AM -0800, Mina Almasry wrote:
> > > Add PM_THP_MAPPED MAPPING to allow userspace to detect whether a given virt
> > > address is currently mapped by a transparent huge page or not.  Example
> > > use case is a process requesting THPs from the kernel (via a huge tmpfs
> > > mount for example), for a performance critical region of memory.  The
> > > userspace may want to query whether the kernel is actually backing this
> > > memory by hugepages or not.
> > >
> > > PM_THP_MAPPED bit is set if the virt address is mapped at the PMD
> > > level and the underlying page is a transparent huge page.
> > >
> > > A few options were considered:
> > > 1. Add /proc/pid/pageflags that exports the same info as
> > >    /proc/kpageflags.  This is not appropriate because many kpageflags are
> > >    inappropriate to expose to userspace processes.
> > > 2. Simply get this info from the existing /proc/pid/smaps interface.
> > >    There are a couple of issues with that:
> > >    1. /proc/pid/smaps output is human readable and unfriendly to
> > >       programatically parse.
> > >    2. /proc/pid/smaps is slow.  The cost of reading /proc/pid/smaps into
> > >       userspace buffers is about ~800us per call, and this doesn't
> > >       include parsing the output to get the information you need. The
> > >       cost of querying 1 virt address in /proc/pid/pagemaps however is
> > >       around 5-7us.
> >
> > This does not seem to be fair...  Should the "800us" value relevant to the
> > process memory size being mapped?  As smaps requires walking the whole memory
> > range and provides all stat info including THP accountings.
> >
> > While querying 1 virt address can only account 1 single THP at most.
> >
> > It means if we want to do all THP accounting for the whole range from pagemap
> > we need multiple read()s, right?  The fair comparison should compare the sum of
> > all the read()s on the virt addr we care to a single smap call.
> >
> > And it's hard to be fair too, IMHO, because all these depend on size of mm.
> >
> > Smaps is, logically, faster because of two things:
> >
> >   - Smaps needs only one syscall for whatever memory range (so one
> >     user->kernel->user switch).
> >
> >     Comparing to pagemap use case of yours, you'll need to read 1 virt address
> >     for each PMD, so when the PMD mapped size is huge, it could turn out that
> >     smaps is faster even counting in the parsing time of smaps output.
> >
> >   - Smaps knows how to handle things in PMD level without looping into PTEs:
> >     see smaps_pmd_entry().
> >
> >     Smaps will not duplicate the PMD entry into 512 PTE entries, because smaps
> >     is doing statistic (and that's exaxtly what your use case wants!), while
> >     pagemap is defined in 4K page size even for huge mappings because the
> >     structure is defined upon the offset of the pagemap fd; that's why it needs
> >     to duplicate the 2M entry into 512 same ones; even if they're not really so
> >     meaningful.
> >
> > That's also why I tried to propose the interface of smaps to allow querying
> > partial of the memory range, because IMHO it solves the exact problem you're
> > describing and it'll also be in the most efficient way, meanwhile I think it
> > expose all the rest smaps data too besides THP accountings so it could help
> > more than thp accountings.
> >
> > So again, no objection on this simple and working patch, but perhaps rephrasing
> > the 2nd bullet as: "smaps is slow because it must read the whole memory range
> > rather than a small range we care"?
> >
> 
> Sure thing, added in v7.
> 
> If I'm coming across as resisting a range-based smaps, I don't mean
> to. I think it addresses my use case. I'm just warning/pointing out
> that:
> 1. It'll be a large(r than 2 line) patch and probably an added kernel
> interface, and I'm not sure my use case is an acceptable justification
> to do that given the problem can be equally addressed very simply like
> I'm adding here, but if it is an acceptable justification then I'm
> fine with a range-based smaps.
> 2. I'm not 100% sure what the performance would be like. But I can
> protype it and let you know if I have any issues with the performance.
> I just need to know what interface you're envisioning for this.

Not sure whether an ioctl upon the smaps procfs file can work.

> 
> I'll upload a v7 with the commit message change, and let me know what you think!

Yeah that's the only thing I'm asking for now, and sorry to be picky.  Thanks.
diff mbox series

Patch

diff --git a/Documentation/admin-guide/mm/pagemap.rst b/Documentation/admin-guide/mm/pagemap.rst
index fdc19fbc10839..8a0f0064ff336 100644
--- a/Documentation/admin-guide/mm/pagemap.rst
+++ b/Documentation/admin-guide/mm/pagemap.rst
@@ -23,7 +23,8 @@  There are four components to pagemap:
     * Bit  56    page exclusively mapped (since 4.2)
     * Bit  57    pte is uffd-wp write-protected (since 5.13) (see
       :ref:`Documentation/admin-guide/mm/userfaultfd.rst <userfaultfd>`)
-    * Bits 57-60 zero
+    * Bit  58    page is a huge (PMD size) THP mapping
+    * Bits 59-60 zero
     * Bit  61    page is file-page or shared-anon (since 3.5)
     * Bit  62    page swapped
     * Bit  63    page present
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index ad667dbc96f5c..d784a97aa209a 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1302,6 +1302,7 @@  struct pagemapread {
 #define PM_SOFT_DIRTY		BIT_ULL(55)
 #define PM_MMAP_EXCLUSIVE	BIT_ULL(56)
 #define PM_UFFD_WP		BIT_ULL(57)
+#define PM_THP_MAPPED		BIT_ULL(58)
 #define PM_FILE			BIT_ULL(61)
 #define PM_SWAP			BIT_ULL(62)
 #define PM_PRESENT		BIT_ULL(63)
@@ -1456,6 +1457,8 @@  static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,

 		if (page && page_mapcount(page) == 1)
 			flags |= PM_MMAP_EXCLUSIVE;
+		if (page && is_transparent_hugepage(page))
+			flags |= PM_THP_MAPPED;

 		for (; addr != end; addr += PAGE_SIZE) {
 			pagemap_entry_t pme = make_pme(frame, flags);