mbox series

[RFC,0/4] mm: Enable PM_SWAP for shmem with PTE_MARKER

Message ID 20210807032521.7591-1-peterx@redhat.com (mailing list archive)
Headers show
Series mm: Enable PM_SWAP for shmem with PTE_MARKER | expand

Message

Peter Xu Aug. 7, 2021, 3:25 a.m. UTC
Summary
=======

[Based on v5.14-rc4]

This patchset enables PM_SWAP of pagemap on shmem.  IOW userspace will be able
to detect whether a shmem page is swapped out, just like anonymous pages.

This feature can be enabled with CONFIG_PTE_MARKER_PAGEOUT. When enabled, it
brings 0.8% overhead on swap-in performance on a shmem page, so I didn't make
it the default yet.  However IMHO 0.8% is still in an acceptable range that we
can even make it the default at last.  Comments are welcomed here.

There's one previous series that wanted to address the same issue but in
another way by Tiberiu A Georgescu <tiberiu.georgescu@nutanix.com>, here:

https://lore.kernel.org/lkml/20210730160826.63785-1-tiberiu.georgescu@nutanix.com/

In that series it's done by looking up page cache for all none ptes.  However I
raised concern on 4x performance degradation for all shmem pagemap users.

Unlike the other approach, this series has zero overhead on pagemap read
because the PM_SWAP info is consolidated into the zapped PTEs directly.

Goals
=====

One major goal of this series is to add the PM_SWAP support, the reason is as
stated by Tiberiu and Ivan in the other patchset:

https://lore.kernel.org/lkml/CY4PR0201MB3460E372956C0E1B8D33F904E9E39@CY4PR0201MB3460.namprd02.prod.outlook.com/

As a summary: for some reason the userspace needs to scan the pages in the
background, however that scanning could misguide page reclaim on which page is
hot and which is cold.  With correct PM_SWAP information, the userspace can
correct the behavior of page reclaim by firstly fetching that info from
pagemap, and explicit madvise(MADV_PAGEOUT).  In this case, the pages are for
the guest, but it can be any shmem page.

Another major goal of this series is to do a proof-of-concept of the PTE marker
idea, and that's also the major reason why it's RFC.  So far PTE marker can
potentially be the solution for below three problems that I'm aware of:

  (a) PM_SWAP on shmem

  (b) Userfaultfd-wp on shmem/hugetlbfs

  (c) PM_SOFT_DIRTY lost for shmem over swapping

This series tries to resolve problem (a) which should be the simplest, ideally
it should solve immediate problem for the live migration issue raised by
Tiberiu and Ivan on proactive paging out unused guest pages.

Both (a) and (c) will be for performance-wise or statistic-wise.

Scenario (b) will require pte markers as part of the function to trap writes to
uffd-wp protected regions when the pages were e.g. swapped out or zapped for
any reason.

Currently, uffd-wp shmem work (still during review on the list, latest v5, [1])
used another solution called "special swap pte".  It works similarly like PTE
markers as both of the approachs are to persist information into zapped pte,
but people showed concern about that idea and it's suggested to use a safer
(swp-entry level operation, not pte level), and arch-independent approach.

Hopefully PTE markers satifsfy these demands.

Before I rework the uffd-wp series, I wanted to know whether this approach can
be accepted upstream.  So besides the swap part, comments on PTE markers will
be extremely welcomed.

What is PTE Markers?
====================

PTE markers are defined as some special PTEs that works like a "marker" just
like in normal life.  Firstly it uses a swap type, IOW it's not a valid/present
pte, so processor will trigger a page fault when it's accessed.  Meanwhile, the
format of the PTE is well-defined, so as to contain some information that we
would like to know before/during the page access happening.

In this specific case, when the shmem page is paged out, we set a marker
showing that this page was paged out, then when pagemap is read about this pte,
we know this is a swapped-out/very-cold page.

This use case is not an obvious one but the most simplest.  The uffd-wp use
case is more obvious (wr-protect is per-pte, so we can't save into page cache;
meanwhile we need that info to persist across zappings e.g. thp split or page
out of shmem pages).

So in the future, it can contain more information, e.g., whether this pte is
wr-protected by userfaultfd; whether this pte was written in this mm context
for soft-dirtying.  On 64 bit systems, we have a total of 58 bits (swp_offset).

I'm also curious whether it can be further expanded to other mm areas.  E.g.,
logically it can work too for non-RAM based memories outside shmem/hugetlbfs,
e.g. a common file system like ext4 or btrfs?  As long as there will be a need
to store some per-pte information across zapping of the ptes, then maybe it can
be considered.

Known Issues/Concerns
=====================

About THP
---------

Currently we don't need to worry about THP because paged out shmem pages will
be split when shrinking, IOW we only need to consider PTE, and the markers will
only be applied to a shmem pte not pmd or bigger.

About PM_SWAP Accuracy
----------------------

This is not an "accurate" solution to provide PM_SWAP bit.  Two exmaples:

  - When process A & B both map shmem page P somewhere, it can happen that only
    one of these ptes got marked with the pte marker.  Imagine below sequence:

    0. Process A & B both map shmem page P somewhere
    1. Process A zap pte of page P for some reason (e.g. thp split)
    2. System decides to recycle page P
    3. System replace process B's pte (pointed to P) by PTE marker
    4. System _didn't_ replace process A's pte because it was none pte, and
       it'll continue to be none pte
    5. Only process B's relevant pte has the PTE marker after P swapped out

  - When fork, we don't copy shmem vma ptes, including the pte markers.  So
    even if page P was swapped out, only the parent process has the pte marker
    installed, in child it'll be none pte if fork() happened after pageout.

Conclusion: just like it used to be, the PM_SWAP is best-effort.  But it should
work in 99.99% cases and it should already start to solve problems.

About Performance Impact
------------------------

Due to the special PTE marker, page fault logic needs to understand this pte
and there will be some extra logic to handle that.  The overhead is merely
non-observable with 0.82% perf drop.

For more information, please see the test section below where I wrote a test
for it.  When we really care about that small difference, the user can also
disable the shmem PM_SWAP support with !CONFIG_PTE_MARKER_PAGEOUT.

Tests
=====

Test case I used is here:

https://github.com/xzpeter/clibs/blob/master/bsd/pagemap.c

Functional test
---------------

Run with !CONFIG_PTE_MARKER_PAGEOUT, we'll miss the PM_SWAP when paged out (see
swap bit always being zeros):

       FAULT1 (expect swap==0): present bit 1, swap bit 0
      PAGEOUT (expect swap==1): present bit 0, swap bit 0
       FAULT2 (expect swap==0): present bit 1, swap bit 0
       REMOVE (expect swap==0): present bit 0, swap bit 0
      PAGEOUT (expect swap==1): present bit 0, swap bit 0
       REMOVE (expect swap==0): present bit 0, swap bit 0

Run with CONFIG_PTE_MARKER_PAGEOUT, we'll be able to observe correct PM_SWAP:

       FAULT1 (expect swap==0): present bit 1, swap bit 0
      PAGEOUT (expect swap==1): present bit 0, swap bit 1
       FAULT2 (expect swap==0): present bit 1, swap bit 0
       REMOVE (expect swap==0): present bit 0, swap bit 0
      PAGEOUT (expect swap==1): present bit 0, swap bit 1
       REMOVE (expect swap==0): present bit 0, swap bit 0

Performance test
----------------

The performance test is not about pagemap reading, because it should be the
same as before.  Instead there's indeed extra overhead in the fault path, when
the page is swapped in from the disk.  I did some sequential swap-in tests of
1GB range (each for 5 times in a loop) to measure the difference.

Hardware I used:

        Processor: Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
        Memory:    32GB memory, 16GB swap (on a PERC H330 Mini 2TBi disk)
        Test Size: 1GB shmem

I only measured the time to fault-in the pages on the disk, so the measurement
does not include pageout time, one can refer to the .c file.  Results:

   |-----------------------------------+------------------+------------|
   | Config                            | Time used (us)   | Change (%) |
   |-----------------------------------+------------------+------------|
   | !PTE_MARKER                       | 519652 (+-0.73%) |        N/A |
   | PTE_MARKER && !PTE_MARKER_PAGEOUT | 519874 (+-0.40%) |     -0.04% |
   | PTE_MARKER && PTE_MARKER_PAGEOUT  | 523914 (+-0.71%) |     -0.82% |
   |-----------------------------------+------------------+------------|

Any comment would be greatly welcomed.

[1] https://lore.kernel.org/lkml/20210715201422.211004-1-peterx@redhat.com/

Peter Xu (4):
  mm: Introduce PTE_MARKER swap entry
  mm: Check against orig_pte for finish_fault()
  mm: Handle PTE_MARKER page faults
  mm: Install marker pte when page out for shmem pages

 fs/proc/task_mmu.c      |  1 +
 include/linux/rmap.h    |  1 +
 include/linux/swap.h    | 14 ++++++++++++-
 include/linux/swapops.h | 45 +++++++++++++++++++++++++++++++++++++++++
 mm/Kconfig              | 17 ++++++++++++++++
 mm/memory.c             | 43 ++++++++++++++++++++++++++++++++++++++-
 mm/rmap.c               | 19 +++++++++++++++++
 mm/vmscan.c             |  2 +-
 8 files changed, 139 insertions(+), 3 deletions(-)

Comments

David Hildenbrand Aug. 17, 2021, 9:04 a.m. UTC | #1
Hi Peter, a couple of comments, sorry for the late reply.

> Summary
> =======
> 
> [Based on v5.14-rc4]
> 
> This patchset enables PM_SWAP of pagemap on shmem.  IOW userspace will be able
> to detect whether a shmem page is swapped out, just like anonymous pages.
> 
> This feature can be enabled with CONFIG_PTE_MARKER_PAGEOUT. When enabled, it
> brings 0.8% overhead on swap-in performance on a shmem page, so I didn't make
> it the default yet.  However IMHO 0.8% is still in an acceptable range that we
> can even make it the default at last.  Comments are welcomed here.

Special config option and added complexity for handling a corner case 
feature partially more correct. Hm.

> 
> There's one previous series that wanted to address the same issue but in
> another way by Tiberiu A Georgescu <tiberiu.georgescu@nutanix.com>, here:
> 
> https://lore.kernel.org/lkml/20210730160826.63785-1-tiberiu.georgescu@nutanix.com/
> 
> In that series it's done by looking up page cache for all none ptes.  However I
> raised concern on 4x performance degradation for all shmem pagemap users.

Who cares? I am asking because for me, it's hard to imagine a workload 
that actually cares about a 4x performance degradation when querying the 
pagemap in very special cases, especially if it involves gigantic shmem 
ranges. VM live migration -- sync will be a bit slower? CRIU sync will 
be a bit slower? I mean, actual page dumping is a lot more expensive. 
Really a problem?

I read that CRIU cares about shmem via pagemap [1], at least for 
anonymous shared memory; not sure how memfd is treated, I assume in a 
similar way. But I do wonder how it even works reliably, because it 
relies on present/swapped out and sofrtdirty tracking, which are both 
essentially broken e.g., when swapping out AFAIKT. Looks like this 
really needs a proper fix.

[1] https://criu.org/Shared_memory

> 
> Unlike the other approach, this series has zero overhead on pagemap read
> because the PM_SWAP info is consolidated into the zapped PTEs directly.
> 
> Goals
> =====
> 
> One major goal of this series is to add the PM_SWAP support, the reason is as
> stated by Tiberiu and Ivan in the other patchset:
> 
> https://lore.kernel.org/lkml/CY4PR0201MB3460E372956C0E1B8D33F904E9E39@CY4PR0201MB3460.namprd02.prod.outlook.com/
> 
> As a summary: for some reason the userspace needs to scan the pages in the
> background, however that scanning could misguide page reclaim on which page is
> hot and which is cold.  With correct PM_SWAP information, the userspace can
> correct the behavior of page reclaim by firstly fetching that info from
> pagemap, and explicit madvise(MADV_PAGEOUT).  In this case, the pages are for
> the guest, but it can be any shmem page.
> 
> Another major goal of this series is to do a proof-of-concept of the PTE marker
> idea, and that's also the major reason why it's RFC.  So far PTE marker can
> potentially be the solution for below three problems that I'm aware of:
> 
>    (a) PM_SWAP on shmem
> 
>    (b) Userfaultfd-wp on shmem/hugetlbfs
> 
>    (c) PM_SOFT_DIRTY lost for shmem over swapping
> 
> This series tries to resolve problem (a) which should be the simplest, ideally
> it should solve immediate problem for the live migration issue raised by
> Tiberiu and Ivan on proactive paging out unused guest pages.
> 
> Both (a) and (c) will be for performance-wise or statistic-wise.
> 
> Scenario (b) will require pte markers as part of the function to trap writes to
> uffd-wp protected regions when the pages were e.g. swapped out or zapped for
> any reason.
> 
> Currently, uffd-wp shmem work (still during review on the list, latest v5, [1])
> used another solution called "special swap pte".  It works similarly like PTE
> markers as both of the approachs are to persist information into zapped pte,
> but people showed concern about that idea and it's suggested to use a safer
> (swp-entry level operation, not pte level), and arch-independent approach.
> 
> Hopefully PTE markers satifsfy these demands.
> 
> Before I rework the uffd-wp series, I wanted to know whether this approach can
> be accepted upstream.  So besides the swap part, comments on PTE markers will
> be extremely welcomed.

For uffd-wp in its current form, it would certainly be the way to go I 
think. AFAIKT, the idea of special swap entries isn't new, just that 
it's limited to anonymous memory for now, which makes things like fork 
and new mappings a lot cheaper.

> 
> What is PTE Markers?
> ====================
> 
> PTE markers are defined as some special PTEs that works like a "marker" just
> like in normal life.  Firstly it uses a swap type, IOW it's not a valid/present
> pte, so processor will trigger a page fault when it's accessed.  Meanwhile, the
> format of the PTE is well-defined, so as to contain some information that we
> would like to know before/during the page access happening.
> 
> In this specific case, when the shmem page is paged out, we set a marker
> showing that this page was paged out, then when pagemap is read about this pte,
> we know this is a swapped-out/very-cold page.
> 
> This use case is not an obvious one but the most simplest.  The uffd-wp use
> case is more obvious (wr-protect is per-pte, so we can't save into page cache;
> meanwhile we need that info to persist across zappings e.g. thp split or page
> out of shmem pages).
> 
> So in the future, it can contain more information, e.g., whether this pte is
> wr-protected by userfaultfd; whether this pte was written in this mm context
> for soft-dirtying.  On 64 bit systems, we have a total of 58 bits (swp_offset).
> 
> I'm also curious whether it can be further expanded to other mm areas.  E.g.,
> logically it can work too for non-RAM based memories outside shmem/hugetlbfs,
> e.g. a common file system like ext4 or btrfs?  As long as there will be a need
> to store some per-pte information across zapping of the ptes, then maybe it can
> be considered.

As already expressed, we should try storing as little information in 
page tables as possible if we're dealing with shared memory. The 
features we design around this all seem to over-complicate the actual 
users, over-complicate fork, over-complicate handling on new mappings.

For uffd-wp in its current form, there seems to be no way around it, and 
PTE markers seem to be what you want -- but as I already raised, the 
feature itself on shmem is somewhat suboptimal, just like SOFTDIRTY 
tracking on shmem.

But I guess I'm biased at this point because the main users of these 
features actually want to query/set such properties for all sharers, not 
individual processes; so the opinion of others would be appreciated.

> 
> Known Issues/Concerns
> =====================
> 
> About THP
> ---------
> 
> Currently we don't need to worry about THP because paged out shmem pages will
> be split when shrinking, IOW we only need to consider PTE, and the markers will
> only be applied to a shmem pte not pmd or bigger.
> 
> About PM_SWAP Accuracy
> ----------------------
> 
> This is not an "accurate" solution to provide PM_SWAP bit.  Two exmaples:
> 
>    - When process A & B both map shmem page P somewhere, it can happen that only
>      one of these ptes got marked with the pte marker.  Imagine below sequence:
> 
>      0. Process A & B both map shmem page P somewhere
>      1. Process A zap pte of page P for some reason (e.g. thp split)
>      2. System decides to recycle page P
>      3. System replace process B's pte (pointed to P) by PTE marker
>      4. System _didn't_ replace process A's pte because it was none pte, and
>         it'll continue to be none pte
>      5. Only process B's relevant pte has the PTE marker after P swapped out
> 
>    - When fork, we don't copy shmem vma ptes, including the pte markers.  So
>      even if page P was swapped out, only the parent process has the pte marker
>      installed, in child it'll be none pte if fork() happened after pageout.
> 
> Conclusion: just like it used to be, the PM_SWAP is best-effort.  But it should
> work in 99.99% cases and it should already start to solve problems.

At least I don't like these semantics at all. PM_SWAP is a cached value 
which might be under-represented and consequently wrong. Take CRIU as an 
example, it has to be correct even if a process would remap a memory 
region, fork() and unmap in the parent as far as I understand, ...

If we really care about performance for users with the old semantics, 
introduce some runtime toggle that enables the new behavior (even for a 
single process?) and consequently is a bit slower in corner cases. But I 
really do wonder if we care at all about the performance degradation in 
corner cases.

If we really care about performance for users with new semantics, then 
let's do it properly and see how we can actually speed it up without 
per-process page table hacks.

Anyhow, just my 2 cents.
Peter Xu Aug. 17, 2021, 5:09 p.m. UTC | #2
On Tue, Aug 17, 2021 at 11:04:18AM +0200, David Hildenbrand wrote:
> Hi Peter, a couple of comments, sorry for the late reply.
> 
> > Summary
> > =======
> > 
> > [Based on v5.14-rc4]
> > 
> > This patchset enables PM_SWAP of pagemap on shmem.  IOW userspace will be able
> > to detect whether a shmem page is swapped out, just like anonymous pages.
> > 
> > This feature can be enabled with CONFIG_PTE_MARKER_PAGEOUT. When enabled, it
> > brings 0.8% overhead on swap-in performance on a shmem page, so I didn't make
> > it the default yet.  However IMHO 0.8% is still in an acceptable range that we
> > can even make it the default at last.  Comments are welcomed here.
> 
> Special config option and added complexity for handling a corner case
> feature partially more correct. Hm.
> 
> > 
> > There's one previous series that wanted to address the same issue but in
> > another way by Tiberiu A Georgescu <tiberiu.georgescu@nutanix.com>, here:
> > 
> > https://lore.kernel.org/lkml/20210730160826.63785-1-tiberiu.georgescu@nutanix.com/
> > 
> > In that series it's done by looking up page cache for all none ptes.  However I
> > raised concern on 4x performance degradation for all shmem pagemap users.
> 
> Who cares? I am asking because for me, it's hard to imagine a workload that
> actually cares about a 4x performance degradation when querying the pagemap
> in very special cases,

I won't say it's a "special case".  it affects any pagemap reading if there's
shmem, even if they're not looking for PM_SWAP it'll suffer from trying to look
up page cache if the pte is none.

Yes it needs to be a relatively large memory to show an effect, but IMHO it's
not a reason good enough that we can make it 4x slower.

> especially if it involves gigantic shmem ranges. VM
> live migration -- sync will be a bit slower? CRIU sync will be a bit slower?
> I mean, actual page dumping is a lot more expensive. Really a problem?
> 
> I read that CRIU cares about shmem via pagemap [1], at least for anonymous
> shared memory; not sure how memfd is treated, I assume in a similar way. But
> I do wonder how it even works reliably, because it relies on present/swapped
> out and sofrtdirty tracking, which are both essentially broken e.g., when
> swapping out AFAIKT. Looks like this really needs a proper fix.
> 
> [1] https://criu.org/Shared_memory

I have not enough knowledge to judge CRIU here, but I can kind of understand
Tiberiu's problem and hopefully I digested it right.  What I'm targetting with
this series is just to see whether it can fix the problem there, assuming that
could be a good summary of what it can also be used elsewhere.

We have the other solution of the page cache lookup, but I think that's slower
than this solution, so when there's the 0.8% overhead possibility I think this
one is much better.  The bad side is it's going to be added into swap-in path,
it's not ideal for fast swap devices, but I have no knowledge on that part too,
as to my understanding most swap devices should be still slow like hard disks
which can take worst case milli-seconds to read in a sector.  I think that
overhead should be mostly not measureable and lost in the white noise.

> 
> > 
> > Unlike the other approach, this series has zero overhead on pagemap read
> > because the PM_SWAP info is consolidated into the zapped PTEs directly.
> > 
> > Goals
> > =====
> > 
> > One major goal of this series is to add the PM_SWAP support, the reason is as
> > stated by Tiberiu and Ivan in the other patchset:
> > 
> > https://lore.kernel.org/lkml/CY4PR0201MB3460E372956C0E1B8D33F904E9E39@CY4PR0201MB3460.namprd02.prod.outlook.com/
> > 
> > As a summary: for some reason the userspace needs to scan the pages in the
> > background, however that scanning could misguide page reclaim on which page is
> > hot and which is cold.  With correct PM_SWAP information, the userspace can
> > correct the behavior of page reclaim by firstly fetching that info from
> > pagemap, and explicit madvise(MADV_PAGEOUT).  In this case, the pages are for
> > the guest, but it can be any shmem page.
> > 
> > Another major goal of this series is to do a proof-of-concept of the PTE marker
> > idea, and that's also the major reason why it's RFC.  So far PTE marker can
> > potentially be the solution for below three problems that I'm aware of:
> > 
> >    (a) PM_SWAP on shmem
> > 
> >    (b) Userfaultfd-wp on shmem/hugetlbfs
> > 
> >    (c) PM_SOFT_DIRTY lost for shmem over swapping
> > 
> > This series tries to resolve problem (a) which should be the simplest, ideally
> > it should solve immediate problem for the live migration issue raised by
> > Tiberiu and Ivan on proactive paging out unused guest pages.
> > 
> > Both (a) and (c) will be for performance-wise or statistic-wise.
> > 
> > Scenario (b) will require pte markers as part of the function to trap writes to
> > uffd-wp protected regions when the pages were e.g. swapped out or zapped for
> > any reason.
> > 
> > Currently, uffd-wp shmem work (still during review on the list, latest v5, [1])
> > used another solution called "special swap pte".  It works similarly like PTE
> > markers as both of the approachs are to persist information into zapped pte,
> > but people showed concern about that idea and it's suggested to use a safer
> > (swp-entry level operation, not pte level), and arch-independent approach.
> > 
> > Hopefully PTE markers satifsfy these demands.
> > 
> > Before I rework the uffd-wp series, I wanted to know whether this approach can
> > be accepted upstream.  So besides the swap part, comments on PTE markers will
> > be extremely welcomed.
> 
> For uffd-wp in its current form, it would certainly be the way to go I
> think. AFAIKT, the idea of special swap entries isn't new, just that it's
> limited to anonymous memory for now, which makes things like fork and new
> mappings a lot cheaper.

Thanks for reviewing this series separately; yes I definitely wanted to get
comments on both sides: one on the pte marker idea, the other is whether it's
applicable to this swap+shmem use case.

I don't plan to handle both fork and new mappings for swapping.

For fork(), as we've discussed, and as I've also mentioned below in Known
Issues, that we don't copy these markers in fork().  So children may lose these
hints from pte, but I think that's fine if that's hardly used.

Here I really want to make the pte marker be flexible - it can be strict when
necessary (it will be 100% strict with uffd-wp), then it can also be a hint
just like what we have with available ptes on soft-dirty, idle, accessed bits.
Here the swap bit I wanted to make it that kind, so we add zero overhead to
fork() and we still solve problems.

Same thing to "newly mapped shmem".  Do we have a use case for that?  If that's
a hint bit, can we ignore it?

> 
> > 
> > What is PTE Markers?
> > ====================
> > 
> > PTE markers are defined as some special PTEs that works like a "marker" just
> > like in normal life.  Firstly it uses a swap type, IOW it's not a valid/present
> > pte, so processor will trigger a page fault when it's accessed.  Meanwhile, the
> > format of the PTE is well-defined, so as to contain some information that we
> > would like to know before/during the page access happening.
> > 
> > In this specific case, when the shmem page is paged out, we set a marker
> > showing that this page was paged out, then when pagemap is read about this pte,
> > we know this is a swapped-out/very-cold page.
> > 
> > This use case is not an obvious one but the most simplest.  The uffd-wp use
> > case is more obvious (wr-protect is per-pte, so we can't save into page cache;
> > meanwhile we need that info to persist across zappings e.g. thp split or page
> > out of shmem pages).
> > 
> > So in the future, it can contain more information, e.g., whether this pte is
> > wr-protected by userfaultfd; whether this pte was written in this mm context
> > for soft-dirtying.  On 64 bit systems, we have a total of 58 bits (swp_offset).
> > 
> > I'm also curious whether it can be further expanded to other mm areas.  E.g.,
> > logically it can work too for non-RAM based memories outside shmem/hugetlbfs,
> > e.g. a common file system like ext4 or btrfs?  As long as there will be a need
> > to store some per-pte information across zapping of the ptes, then maybe it can
> > be considered.
> 
> As already expressed, we should try storing as little information in page
> tables as possible if we're dealing with shared memory. The features we
> design around this all seem to over-complicate the actual users,
> over-complicate fork, over-complicate handling on new mappings.

I'll skip the last two "over-complicated" items, because as we've discussed I
don't think we need to take care of them so far.  We can revisit when they
become some kind of requirement.

To me having PM_SWAP 99% right on shmem is still a progress comparing to
completely missing it, even if it's not 100% right.  It's used for performance
reasons on PAGEOUT and doing finer-grained memory control from userspace, it's
not a strict requirement.

So I still cannot strictly follow why storing information in pte is so bad for
file-backed, which I can see you really don't like it.  Could you share some
detailed example?

> 
> For uffd-wp in its current form, there seems to be no way around it, and PTE
> markers seem to be what you want -- but as I already raised, the feature
> itself on shmem is somewhat suboptimal, just like SOFTDIRTY tracking on
> shmem.

Emm I don't know whether we should call it sub-optimal. To me it's still a
clean and self-contained way to do things.  I don't know how to do that from
page cache layer, but I bet there'll be issues coming when we go that way, then
we may figure out "oh this way is 90% beautiful, but that way is 85%".  To me
it's sometimes not easy to figure out the ideal solution for all the problems.

> 
> But I guess I'm biased at this point because the main users of these
> features actually want to query/set such properties for all sharers, not
> individual processes; so the opinion of others would be appreciated.
> 
> > 
> > Known Issues/Concerns
> > =====================
> > 
> > About THP
> > ---------
> > 
> > Currently we don't need to worry about THP because paged out shmem pages will
> > be split when shrinking, IOW we only need to consider PTE, and the markers will
> > only be applied to a shmem pte not pmd or bigger.
> > 
> > About PM_SWAP Accuracy
> > ----------------------
> > 
> > This is not an "accurate" solution to provide PM_SWAP bit.  Two exmaples:
> > 
> >    - When process A & B both map shmem page P somewhere, it can happen that only
> >      one of these ptes got marked with the pte marker.  Imagine below sequence:
> > 
> >      0. Process A & B both map shmem page P somewhere
> >      1. Process A zap pte of page P for some reason (e.g. thp split)
> >      2. System decides to recycle page P
> >      3. System replace process B's pte (pointed to P) by PTE marker
> >      4. System _didn't_ replace process A's pte because it was none pte, and
> >         it'll continue to be none pte
> >      5. Only process B's relevant pte has the PTE marker after P swapped out
> > 
> >    - When fork, we don't copy shmem vma ptes, including the pte markers.  So
> >      even if page P was swapped out, only the parent process has the pte marker
> >      installed, in child it'll be none pte if fork() happened after pageout.
> > 
> > Conclusion: just like it used to be, the PM_SWAP is best-effort.  But it should
> > work in 99.99% cases and it should already start to solve problems.
> 
> At least I don't like these semantics at all. PM_SWAP is a cached value
> which might be under-represented and consequently wrong.

Please have a look at current pagemap impl in pte_to_pagemap_entry().  It's not
accurate from the 1st day, imho.  E.g., when a page is being migrated from numa
node 1 to node 2, we'll mark it PM_SWAP but I think it's not the case.  We can
make it more accurate, but I think it's fine, because it's a hint.

> Take CRIU as an example, it has to be correct even if a process would remap a
> memory region, fork() and unmap in the parent as far as I understand, ...

Are you talking about dirty bit or swap bit?  I'm a bit confused on why swap
bit needs to be accurate.  Maybe you mean the dirty bit?

> 
> If we really care about performance for users with the old semantics,
> introduce some runtime toggle that enables the new behavior (even for a
> single process?) and consequently is a bit slower in corner cases. But I
> really do wonder if we care at all about the performance degradation in
> corner cases.

Yes that's okay.  If we want to go with Tiberiu's approach, we can use an
option to avoid regress users.  However I still prefer the current approach,
and I even want to make it a default option if it can prove itself with some
more real swap workloads somewhere (maybe kernel bot has some?  I didn't check
yet).

> 
> If we really care about performance for users with new semantics, then let's
> do it properly and see how we can actually speed it up without per-process
> page table hacks.
> 
> Anyhow, just my 2 cents.

Thanks for the discussion!
David Hildenbrand Aug. 17, 2021, 6:46 p.m. UTC | #3
>>
>> For uffd-wp in its current form, it would certainly be the way to go I
>> think. AFAIKT, the idea of special swap entries isn't new, just that it's
>> limited to anonymous memory for now, which makes things like fork and new
>> mappings a lot cheaper.
> 
> Thanks for reviewing this series separately; yes I definitely wanted to get
> comments on both sides: one on the pte marker idea, the other is whether it's
> applicable to this swap+shmem use case.

Exactly.

> 
> Here I really want to make the pte marker be flexible - it can be strict when
> necessary (it will be 100% strict with uffd-wp), then it can also be a hint
> just like what we have with available ptes on soft-dirty, idle, accessed bits.
> Here the swap bit I wanted to make it that kind, so we add zero overhead to
> fork() and we still solve problems.
> 
> Same thing to "newly mapped shmem".  Do we have a use case for that?  If that's
> a hint bit, can we ignore it?

I am really not a fan of taking an already broken feature (broken 
presence information for shmem in pagemap) and instead of fixing it 
properly, turning it less broken, crossing fingers that nobody will 
notice the remaining (undocumented) cases.

[...]

>> As already expressed, we should try storing as little information in page
>> tables as possible if we're dealing with shared memory. The features we
>> design around this all seem to over-complicate the actual users,
>> over-complicate fork, over-complicate handling on new mappings.
> 
> I'll skip the last two "over-complicated" items, because as we've discussed I
> don't think we need to take care of them so far.  We can revisit when they
> become some kind of requirement.
> 
> To me having PM_SWAP 99% right on shmem is still a progress comparing to
> completely missing it, even if it's not 100% right.  It's used for performance
> reasons on PAGEOUT and doing finer-grained memory control from userspace, it's
> not a strict requirement.
> 
> So I still cannot strictly follow why storing information in pte is so bad for
> file-backed, which I can see you really don't like it.  Could you share some
> detailed example?

I am not a fan of your approach of "hints", because while it might work 
for some use cases, it might not work for others (see below for CRIU); I 
would rather like to avoid such "inconsistent caches" where it's not 
really required. But again, this is just my opinion and there might be 
plenty other people that will most probably disagree.

Storing hints in page tables isn't actually that bad, because we can 
drop hints whenever we like (well, there are side effects, and once we 
drop hints too often people might complain again) -- for example, when 
reclaiming "empty" but actually "filled with hints" page tables. When we 
rely on consistent values, fork() and mmap() are a problem. Well, and 
page tables will have to stick around. At least for uffd-wp, mmap() is 
not an issue, and we don't expect too many uffd-wp users such that page 
table consumption would matter -- my guess.

So I repeat, for uffd-wp in its current form, it sounds like the right 
thing to do.

>> But I guess I'm biased at this point because the main users of these
>> features actually want to query/set such properties for all sharers, not
>> individual processes; so the opinion of others would be appreciated.
>>
>>>
>>> Known Issues/Concerns
>>> =====================
>>>
>>> About THP
>>> ---------
>>>
>>> Currently we don't need to worry about THP because paged out shmem pages will
>>> be split when shrinking, IOW we only need to consider PTE, and the markers will
>>> only be applied to a shmem pte not pmd or bigger.
>>>
>>> About PM_SWAP Accuracy
>>> ----------------------
>>>
>>> This is not an "accurate" solution to provide PM_SWAP bit.  Two exmaples:
>>>
>>>     - When process A & B both map shmem page P somewhere, it can happen that only
>>>       one of these ptes got marked with the pte marker.  Imagine below sequence:
>>>
>>>       0. Process A & B both map shmem page P somewhere
>>>       1. Process A zap pte of page P for some reason (e.g. thp split)
>>>       2. System decides to recycle page P
>>>       3. System replace process B's pte (pointed to P) by PTE marker
>>>       4. System _didn't_ replace process A's pte because it was none pte, and
>>>          it'll continue to be none pte
>>>       5. Only process B's relevant pte has the PTE marker after P swapped out
>>>
>>>     - When fork, we don't copy shmem vma ptes, including the pte markers.  So
>>>       even if page P was swapped out, only the parent process has the pte marker
>>>       installed, in child it'll be none pte if fork() happened after pageout.
>>>
>>> Conclusion: just like it used to be, the PM_SWAP is best-effort.  But it should
>>> work in 99.99% cases and it should already start to solve problems.
>>
>> At least I don't like these semantics at all. PM_SWAP is a cached value
>> which might be under-represented and consequently wrong.
> 
> Please have a look at current pagemap impl in pte_to_pagemap_entry().  It's not
> accurate from the 1st day, imho.  E.g., when a page is being migrated from numa
> node 1 to node 2, we'll mark it PM_SWAP but I think it's not the case.  We can
> make it more accurate, but I think it's fine, because it's a hint.

That inconsistency doesn't really matter as you can determine if 
something is present and worth dumping if it's either swapped or 
present. As long as it's one of both but not simply nothing.

I will shamelessly reference 
tools/testing/selftests/vm/madv_populate.c:pagemap_is_populated() that 
checks exactly for that (the test case uses only private anonymous memory).

> 
>> Take CRIU as an example, it has to be correct even if a process would remap a
>> memory region, fork() and unmap in the parent as far as I understand, ...
> 
> Are you talking about dirty bit or swap bit?  I'm a bit confused on why swap
> bit needs to be accurate.  Maybe you mean the dirty bit?

https://criu.org/Shared_memory

"Dumping present pages"

"... CRIU does not dump all of the data. Instead, it determines which 
pages contain it, and only dumps those pages. This is done similarly to 
how regular memory dumping and restoring works, i.e. by looking for 
PRESENT or SWAPPED bits in owners' pagemap entries."

-> Neither PRESENT nor SWAPPED results in memory not getting dumped, 
which makes perfect sense.

1) Process A sets up shared memory and writes data to it.
2) System swaps out memory, hints are setup.
3) Process A forks Process B, hints are not copied.
4) Process A unmaps shared memory, hints are dropped.
5) CRIU migrates process A and B and migrates only PRESENT or SWAPPED in 
pagemap.
6) Process B uses memory in shared memory region. Pages were not migrated.

Just one example; feel free to correct me.


There is notion of the mincore() systemcall:

"There is one particular feature of shared memory dumps worth 
mentioning. Sometimes, a shared memory page can exist in the kernel, but 
it is not mapped to any process. CRIU detects such pages by calling 
mincore() on the shmem segment, which reports back the page in-memory 
status. The mincore bitmap is when ANDed with the per-process ones. "

Not sure if they actually mean ORed, because otherwise they'd be losing 
pages that have been swapped out. "mincore() returns a vector that 
indicates whether pages of the calling process's virtual memory are 
resident in core (RAM)"
Peter Xu Aug. 17, 2021, 8:24 p.m. UTC | #4
On Tue, Aug 17, 2021 at 08:46:45PM +0200, David Hildenbrand wrote:
> > Please have a look at current pagemap impl in pte_to_pagemap_entry().  It's not
> > accurate from the 1st day, imho.  E.g., when a page is being migrated from numa
> > node 1 to node 2, we'll mark it PM_SWAP but I think it's not the case.  We can
> > make it more accurate, but I think it's fine, because it's a hint.
> 
> That inconsistency doesn't really matter as you can determine if something
> is present and worth dumping if it's either swapped or present. As long as
> it's one of both but not simply nothing.
> 
> I will shamelessly reference
> tools/testing/selftests/vm/madv_populate.c:pagemap_is_populated() that
> checks exactly for that (the test case uses only private anonymous memory).

Then I think the MADV_POPULATE_READ|WRITE test cases shouldn't depend on
PM_SWAP for that when it goes beyond anonymous private memories - when shmem
swapped out the pte can be none, then the test case can fail even if it
shouldn't, imho.

The mincore() syscall seems to be ideally the thing you may want to make it
accurate, but again it's not a problem for current anonymous private memories.

> 
> > 
> > > Take CRIU as an example, it has to be correct even if a process would remap a
> > > memory region, fork() and unmap in the parent as far as I understand, ...
> > 
> > Are you talking about dirty bit or swap bit?  I'm a bit confused on why swap
> > bit needs to be accurate.  Maybe you mean the dirty bit?
> 
> https://criu.org/Shared_memory
> 
> "Dumping present pages"
> 
> "... CRIU does not dump all of the data. Instead, it determines which pages
> contain it, and only dumps those pages. This is done similarly to how
> regular memory dumping and restoring works, i.e. by looking for PRESENT or
> SWAPPED bits in owners' pagemap entries."
> 
> -> Neither PRESENT nor SWAPPED results in memory not getting dumped, which
> makes perfect sense.
> 
> 1) Process A sets up shared memory and writes data to it.
> 2) System swaps out memory, hints are setup.
> 3) Process A forks Process B, hints are not copied.
> 4) Process A unmaps shared memory, hints are dropped.
> 5) CRIU migrates process A and B and migrates only PRESENT or SWAPPED in
> pagemap.
> 6) Process B uses memory in shared memory region. Pages were not migrated.
> 
> Just one example; feel free to correct me.

I think pte marker won't crash criu, what will happen is that it'll see more
ptes that used to be none that become the pte markers.  This reminded me that
maybe I should teach up mincore() syscall to also be aware of the pte marker at
least, and all non_swap_entry() callers.

> 
> 
> There is notion of the mincore() systemcall:
> 
> "There is one particular feature of shared memory dumps worth mentioning.
> Sometimes, a shared memory page can exist in the kernel, but it is not
> mapped to any process. CRIU detects such pages by calling mincore() on the
> shmem segment, which reports back the page in-memory status. The mincore
> bitmap is when ANDed with the per-process ones. "
> 
> Not sure if they actually mean ORed, because otherwise they'd be losing
> pages that have been swapped out. "mincore() returns a vector that indicates
> whether pages of the calling process's virtual memory are resident in core
> (RAM)"

I am wildly guessing they ORed the two just because PM_SWAP is not working
properly for shmem, so the OR happens only for shmem.  Criu may not only rely
on mincore() because they also want the dirty bits.

Btw, I noticed in 2016 criu switched from mincore() to lseek():

https://github.com/checkpoint-restore/criu/commit/1821acedd04b602b37b587eac5a481094b6274ae

Criu should want to know "whether this page has valid data" not "whether this
page has swapped out", so lseek() seems to be more suitable, which I'm not
aware of before.

I'm now wondering whether for Tiberiu's case mincore() can also be used.  It
should just still be a bit slow because it'll look up the cache too, but it
should work similarly like the original proposal.

Thanks,
David Hildenbrand Aug. 18, 2021, 8:24 a.m. UTC | #5
On 17.08.21 22:24, Peter Xu wrote:
> On Tue, Aug 17, 2021 at 08:46:45PM +0200, David Hildenbrand wrote:
>>> Please have a look at current pagemap impl in pte_to_pagemap_entry().  It's not
>>> accurate from the 1st day, imho.  E.g., when a page is being migrated from numa
>>> node 1 to node 2, we'll mark it PM_SWAP but I think it's not the case.  We can
>>> make it more accurate, but I think it's fine, because it's a hint.
>>
>> That inconsistency doesn't really matter as you can determine if something
>> is present and worth dumping if it's either swapped or present. As long as
>> it's one of both but not simply nothing.
>>
>> I will shamelessly reference
>> tools/testing/selftests/vm/madv_populate.c:pagemap_is_populated() that
>> checks exactly for that (the test case uses only private anonymous memory).
> 
> Then I think the MADV_POPULATE_READ|WRITE test cases shouldn't depend on
> PM_SWAP for that when it goes beyond anonymous private memories - when shmem
> swapped out the pte can be none, then the test case can fail even if it
> shouldn't, imho.

Exactly, because the pagemap is fairly completely broken for shmem.

> 
> The mincore() syscall seems to be ideally the thing you may want to make it
> accurate, but again it's not a problem for current anonymous private memories.

I haven't checked the details, but I believe the mincore() syscall won't 
report swapped out pages. At least according to its documentation:

"mincore()  returns a vector that indicates whether pages of the calling 
process's virtual memory are resident in core (RAM), and so will not 
cause a disk access  (page  fault)  if  referenced."

(to protect it from swapping and relying on mincore() we would have to 
mlock that memory; we'd want MCL_ONFAULT to be able to test 
MADV_POPULATE_READ|WRITE; or we'd just want to rely on lseek)

> 
>>
>>>
>>>> Take CRIU as an example, it has to be correct even if a process would remap a
>>>> memory region, fork() and unmap in the parent as far as I understand, ...
>>>
>>> Are you talking about dirty bit or swap bit?  I'm a bit confused on why swap
>>> bit needs to be accurate.  Maybe you mean the dirty bit?
>>
>> https://criu.org/Shared_memory
>>
>> "Dumping present pages"
>>
>> "... CRIU does not dump all of the data. Instead, it determines which pages
>> contain it, and only dumps those pages. This is done similarly to how
>> regular memory dumping and restoring works, i.e. by looking for PRESENT or
>> SWAPPED bits in owners' pagemap entries."
>>
>> -> Neither PRESENT nor SWAPPED results in memory not getting dumped, which
>> makes perfect sense.
>>
>> 1) Process A sets up shared memory and writes data to it.
>> 2) System swaps out memory, hints are setup.
>> 3) Process A forks Process B, hints are not copied.
>> 4) Process A unmaps shared memory, hints are dropped.
>> 5) CRIU migrates process A and B and migrates only PRESENT or SWAPPED in
>> pagemap.
>> 6) Process B uses memory in shared memory region. Pages were not migrated.
>>
>> Just one example; feel free to correct me.
> 
> I think pte marker won't crash criu, what will happen is that it'll see more
> ptes that used to be none that become the pte markers.  This reminded me that
> maybe I should teach up mincore() syscall to also be aware of the pte marker at
> least, and all non_swap_entry() callers.
> 

I haven't checked what mincore() is doing, but from what I understand 
when reading the CRIU doc and the mincore() doc, it does the right thing 
without requiring any fiddling with pte marker hints. I assume you 
merely have a performance improvement in mind.

>>
>>
>> There is notion of the mincore() systemcall:
>>
>> "There is one particular feature of shared memory dumps worth mentioning.
>> Sometimes, a shared memory page can exist in the kernel, but it is not
>> mapped to any process. CRIU detects such pages by calling mincore() on the
>> shmem segment, which reports back the page in-memory status. The mincore
>> bitmap is when ANDed with the per-process ones. "
>>
>> Not sure if they actually mean ORed, because otherwise they'd be losing
>> pages that have been swapped out. "mincore() returns a vector that indicates
>> whether pages of the calling process's virtual memory are resident in core
>> (RAM)"
> 
> I am wildly guessing they ORed the two just because PM_SWAP is not working
> properly for shmem, so the OR happens only for shmem.  Criu may not only rely
> on mincore() because they also want the dirty bits.
> 
> Btw, I noticed in 2016 criu switched from mincore() to lseek():
> 
> https://github.com/checkpoint-restore/criu/commit/1821acedd04b602b37b587eac5a481094b6274ae

Interesting. That's certainly what we want when it comes to skipping 
holes in files. (before reading that, I wasn't even aware that mincore() 
existed)

> 
> Criu should want to know "whether this page has valid data" not "whether this
> page has swapped out", so lseek() seems to be more suitable, which I'm not
> aware of before.

Again, just as you, I learned a lot :)

> 
> I'm now wondering whether for Tiberiu's case mincore() can also be used.  It
> should just still be a bit slow because it'll look up the cache too, but it
> should work similarly like the original proposal.
> 

Very right, maybe we can just avoid tampering with pagemap on shmem 
completely (which sounds like an excellent idea to me) and document it 
as "On shared memory, we will never indicate SWAPPED if the pages have 
been swapped out. Further, PRESENT might be under-indicated: if a shared 
page is currently not mapped into the page table of a process.". I saw 
there was a related, proposed doc update, maybe we can finetune that.
Tiberiu A Georgescu Aug. 18, 2021, 5:52 p.m. UTC | #6
> On 18 Aug 2021, at 09:24, David Hildenbrand <david@redhat.com> wrote:
> 
> On 17.08.21 22:24, Peter Xu wrote:
>> On Tue, Aug 17, 2021 at 08:46:45PM +0200, David Hildenbrand wrote:
>>>> Please have a look at current pagemap impl in pte_to_pagemap_entry().  It's not
>>>> accurate from the 1st day, imho.  E.g., when a page is being migrated from numa
>>>> node 1 to node 2, we'll mark it PM_SWAP but I think it's not the case.  We can
>>>> make it more accurate, but I think it's fine, because it's a hint.
>>> 
>>> That inconsistency doesn't really matter as you can determine if something
>>> is present and worth dumping if it's either swapped or present. As long as
>>> it's one of both but not simply nothing.
>>> 
>>> I will shamelessly reference
>>> tools/testing/selftests/vm/madv_populate.c:pagemap_is_populated() that
>>> checks exactly for that (the test case uses only private anonymous memory).
>> Then I think the MADV_POPULATE_READ|WRITE test cases shouldn't depend on
>> PM_SWAP for that when it goes beyond anonymous private memories - when shmem
>> swapped out the pte can be none, then the test case can fail even if it
>> shouldn't, imho.
> 
> Exactly, because the pagemap is fairly completely broken for shmem.
> 
>> The mincore() syscall seems to be ideally the thing you may want to make it
>> accurate, but again it's not a problem for current anonymous private memories.
> 
> I haven't checked the details, but I believe the mincore() syscall won't report swapped out pages. At least according to its documentation:
> 
> "mincore()  returns a vector that indicates whether pages of the calling process's virtual memory are resident in core (RAM), and so will not cause a disk access  (page  fault)  if  referenced."
> 
> (to protect it from swapping and relying on mincore() we would have to mlock that memory; we'd want MCL_ONFAULT to be able to test MADV_POPULATE_READ|WRITE; or we'd just want to rely on lseek)

After some digging and testing, I found out that the docs for mincore() are a little outdated, and that
the RFC has an unexpected side effect on the sys call.

The output vector is supposed to set the flag to 1 if the page it indicates was present in either the
page cache or the swap cache. I would like to highlight that if a page got swapped out and reached
swap (i.e. page content no longer stored in the swap cache), the flag gets set to 0.

So indeed, mincore does not report swapped out pages, but AFAIK, it does reports pages which are
still in the swap cache.

There was an attempt to rework mincore altogether and make it retrieve mappings instead [1], but
was quickly reverted [2] because the removed functionality was necessary for some long functioning systems.

On Peter's RFC, it now looks like mincore()'s flags are set to 1 for any shared page that has been
dirtied, whether it is still present, in swap cache or it arrived in swap. AFAIK, only none pages have
their flags set to zero. For private pages, mincore still seems to behave normally.

[1] https://github.com/torvalds/linux/commit/574823bfab82d9d8fa47f422778043fbb4b4f50e
[2] https://github.com/torvalds/linux/commit/30bac164aca750892b93eef350439a0562a68647

> 
>>> 
>>>> 
>>>>> Take CRIU as an example, it has to be correct even if a process would remap a
>>>>> memory region, fork() and unmap in the parent as far as I understand, ...
>>>> 
>>>> Are you talking about dirty bit or swap bit?  I'm a bit confused on why swap
>>>> bit needs to be accurate.  Maybe you mean the dirty bit?
>>> 
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__criu.org_Shared-5Fmemory&d=DwIDaQ&c=s883GpUCOChKOHiocYtGcg&r=rRM5dtWOv0DNo5dDxZ2U16jl4WAw6ql5szOKa9cu_RA&m=A5H_4nfdW_jAPHckF-cuCBfRHsm2aij-cr-mELX0uww&s=DZgiYWJgLokyZkBYd5VKOnr5Fbj63aAc01Fu2BPE8Lc&e= 
>>> "Dumping present pages"
>>> 
>>> "... CRIU does not dump all of the data. Instead, it determines which pages
>>> contain it, and only dumps those pages. This is done similarly to how
>>> regular memory dumping and restoring works, i.e. by looking for PRESENT or
>>> SWAPPED bits in owners' pagemap entries."
>>> 
>>> -> Neither PRESENT nor SWAPPED results in memory not getting dumped, which
>>> makes perfect sense.
>>> 
>>> 1) Process A sets up shared memory and writes data to it.
>>> 2) System swaps out memory, hints are setup.
>>> 3) Process A forks Process B, hints are not copied.
>>> 4) Process A unmaps shared memory, hints are dropped.
>>> 5) CRIU migrates process A and B and migrates only PRESENT or SWAPPED in
>>> pagemap.
>>> 6) Process B uses memory in shared memory region. Pages were not migrated.
>>> 
>>> Just one example; feel free to correct me.
>> I think pte marker won't crash criu, what will happen is that it'll see more
>> ptes that used to be none that become the pte markers.  This reminded me that
>> maybe I should teach up mincore() syscall to also be aware of the pte marker at
>> least, and all non_swap_entry() callers.

I think in mincore_pte_range, the call to non_swap_entry(entry) could return true, setting the flag on
the vector to 1 prematurely. Please read my comment above.
> 
> I haven't checked what mincore() is doing, but from what I understand when reading the CRIU doc and the mincore() doc, it does the right thing without requiring any fiddling with pte marker hints. I assume you merely have a performance improvement in mind.
> 
>>> 
>>> 
>>> There is notion of the mincore() systemcall:
>>> 
>>> "There is one particular feature of shared memory dumps worth mentioning.
>>> Sometimes, a shared memory page can exist in the kernel, but it is not
>>> mapped to any process. CRIU detects such pages by calling mincore() on the
>>> shmem segment, which reports back the page in-memory status. The mincore
>>> bitmap is when ANDed with the per-process ones. "
>>> 
>>> Not sure if they actually mean ORed, because otherwise they'd be losing
>>> pages that have been swapped out. "mincore() returns a vector that indicates
>>> whether pages of the calling process's virtual memory are resident in core
>>> (RAM)"
>> I am wildly guessing they ORed the two just because PM_SWAP is not working
>> properly for shmem, so the OR happens only for shmem.  Criu may not only rely
>> on mincore() because they also want the dirty bits.
>> Btw, I noticed in 2016 criu switched from mincore() to lseek():
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_checkpoint-2Drestore_criu_commit_1821acedd04b602b37b587eac5a481094b6274ae&d=DwIDaQ&c=s883GpUCOChKOHiocYtGcg&r=rRM5dtWOv0DNo5dDxZ2U16jl4WAw6ql5szOKa9cu_RA&m=A5H_4nfdW_jAPHckF-cuCBfRHsm2aij-cr-mELX0uww&s=kel85AR7AGZnvBymbM7QEwc770HGO2koka-kTiF-r5U&e= 
> 
> Interesting. That's certainly what we want when it comes to skipping holes in files. (before reading that, I wasn't even aware that mincore() existed)
> 
>> Criu should want to know "whether this page has valid data" not "whether this
>> page has swapped out", so lseek() seems to be more suitable, which I'm not
>> aware of before.
> 
> Again, just as you, I learned a lot :)

Same :)
> 
>> I'm now wondering whether for Tiberiu's case mincore() can also be used.  It
>> should just still be a bit slow because it'll look up the cache too, but it
>> should work similarly like the original proposal.

I am afraid that the information returned by mincore is a little too vague to be of better help, compared to what the pagemap should provide in theory. I will have a look to see whether lseek on
proc/map_files works as a "PM_SWAP" equivalent. However, the swap offset would still be missing.
> 
> Very right, maybe we can just avoid tampering with pagemap on shmem completely (which sounds like an excellent idea to me) and document it as "On shared memory, we will never indicate SWAPPED if the pages have been swapped out. Further, PRESENT might be under-indicated: if a shared page is currently not mapped into the page table of a process.". I saw there was a related, proposed doc update, maybe we can finetune that.
> 
We could take into consideration an alternative approach to retrieving the shared page info in user
space, like storing it in sys/fs instead of per process. However, just leaving the pagemap functionality
incomplete, and not providing an alternative to retrieve the missing information, does not seem right. Updating the docs with a "can't do" should be temporary, until an alternative or fix.

Also, I think you are talking about my own doc update patch[3]. If not, please share a link with your
next reply.

[3] https://marc.info/?m=162878395426774
> 
> -- 
> Thanks,
> 
> David / dhildenb
--
Kind regards,

Tibi Georgescu
David Hildenbrand Aug. 18, 2021, 6:13 p.m. UTC | #7
>>
>>> I'm now wondering whether for Tiberiu's case mincore() can also be used.  It
>>> should just still be a bit slow because it'll look up the cache too, but it
>>> should work similarly like the original proposal.
> 
> I am afraid that the information returned by mincore is a little too vague to be of better help, compared to what the pagemap should provide in theory. I will have a look to see whether lseek on
> proc/map_files works as a "PM_SWAP" equivalent. However, the swap offset would still be missing.

Well, with mincore() you could at least decide "page is present" vs. 
"page is swapped or not existent". At least for making pageout decisions 
it shouldn't really matter, no? madvise(MADV_PAGEOUT) on a hole is a nop.

But I'm not 100% sure what exactly your use case is here and what you 
would really need, so you know best :)

>>
>> Very right, maybe we can just avoid tampering with pagemap on shmem completely (which sounds like an excellent idea to me) and document it as "On shared memory, we will never indicate SWAPPED if the pages have been swapped out. Further, PRESENT might be under-indicated: if a shared page is currently not mapped into the page table of a process.". I saw there was a related, proposed doc update, maybe we can finetune that.
>>
> We could take into consideration an alternative approach to retrieving the shared page info in user
> space, like storing it in sys/fs instead of per process. However, just leaving the pagemap functionality
> incomplete, and not providing an alternative to retrieve the missing information, does not seem right. Updating the docs with a "can't do" should be temporary, until an alternative or fix.
> 

As I stated before, making pagemap less broken is not a good idea IMHO. 
Either make it really correct or just leave it all broken -- and 
document that e.g., other interfaces (lseek) shall be used. It sounds 
like they exist and are good enough for CRUI.

And TBH, if other interfaces already exist and get the job done, I'm 
more than happy that we can avoid mixing more shmem stuff into pagemap 
and trying to compensate performance problems by introducing inconsistency.

If it has an fd and we can punch that into syscalls, we should much 
rather use that fd to lookup stuff then going via process page tables -- 
if possible of course (to be evaluated, because I haven't looked into 
the CRIU details and how they use lseek with anonymous shared memory).

> Also, I think you are talking about my own doc update patch[3]. If not, please share a link with your
> next reply.
> 
> [3] https://marc.info/?m=162878395426774

No, that's it.
Tiberiu A Georgescu Aug. 19, 2021, 2:54 p.m. UTC | #8
> On 18 Aug 2021, at 19:13, David Hildenbrand <david@redhat.com> wrote:
> 
>>> 
>>>> I'm now wondering whether for Tiberiu's case mincore() can also be used.  It
>>>> should just still be a bit slow because it'll look up the cache too, but it
>>>> should work similarly like the original proposal.
>> I am afraid that the information returned by mincore is a little too vague to be of better help, compared to what the pagemap should provide in theory. I will have a look to see whether lseek on
>> proc/map_files works as a "PM_SWAP" equivalent. However, the swap offset would still be missing.
> 
> Well, with mincore() you could at least decide "page is present" vs. "page is swapped or not existent". At least for making pageout decisions it shouldn't really matter, no? madvise(MADV_PAGEOUT) on a hole is a nop.

I think you are right. In the optimisation we first presented, we should be able to
send the madvise(MADV_PAGEOUT) call even if the page is none quite safely
and get the wanted behaviour. Also, the "is_present" or "is_swap_or_none"
question can be answered by the current pagemap too. Nice catch.

However, not all use cases are the same. AFAIK, there is still no way to figure
out whether a shared page is swapped out or none unless it is directly
read/accessed after a pagemap check. Bringing a page into memory to check
if it previously was in swap does not seem ideal.

Also, we still have no mechanism to retrieve the swap offsets of shmem pages
AFAIK. There is one more QEMU optimisation we are working on that requires
these mappings available outside of kernel space.
> 
> But I'm not 100% sure what exactly your use case is here and what you would really need, so you know best :)

Maybe, but I am always open to (m)advise :)
> 
>>> 
>>> Very right, maybe we can just avoid tampering with pagemap on shmem completely (which sounds like an excellent idea to me) and document it as "On shared memory, we will never indicate SWAPPED if the pages have been swapped out. Further, PRESENT might be under-indicated: if a shared page is currently not mapped into the page table of a process.". I saw there was a related, proposed doc update, maybe we can finetune that.
>>> 
>> We could take into consideration an alternative approach to retrieving the shared page info in user
>> space, like storing it in sys/fs instead of per process. However, just leaving the pagemap functionality
>> incomplete, and not providing an alternative to retrieve the missing information, does not seem right. Updating the docs with a "can't do" should be temporary, until an alternative or fix.
> 
> As I stated before, making pagemap less broken is not a good idea IMHO. Either make it really correct or just leave it all broken -- and document that e.g., other interfaces (lseek) shall be used. It sounds like they exist and are good enough for CRUI.
> 
> And TBH, if other interfaces already exist and get the job done, I'm more than happy that we can avoid mixing more shmem stuff into pagemap and trying to compensate performance problems by introducing inconsistency.
> 
> If it has an fd and we can punch that into syscalls, we should much rather use that fd to lookup stuff then going via process page tables -- if possible of course (to be evaluated, because I haven't looked into the CRIU details and how they use lseek with anonymous shared memory).

I found out that it is possible to retrieve the fds of shmem/tmpfs file allocations
using proc/pid/map_files, which is neat. Still, CRIU does not seem to care
whether a page is swapped out or just empty, only if it is present on page cache.
The holes that lseek finds would not be able to infer this difference, AFAIK. Will
test the behaviour to make sure.
> 
>> Also, I think you are talking about my own doc update patch[3]. If not, please share a link with your
>> next reply.
>> [3] https://urldefense.proofpoint.com/v2/url?u=https-3A__marc.info_-3Fm-3D162878395426774&d=DwIDaQ&c=s883GpUCOChKOHiocYtGcg&r=rRM5dtWOv0DNo5dDxZ2U16jl4WAw6ql5szOKa9cu_RA&m=T9yjhM3vhL_Ip2wxg2x-BZclbbY3WAO5Oc-y7IqNs7Y&s=HujDmGVIi1iXQ22oWF_GE-sPxvQ2ORTcCWEfnpqq35o&e= 
> 
> No, that's it.
> 
Great, I'll head right there.

--
Kind regards,

Tibi
David Hildenbrand Aug. 19, 2021, 5:26 p.m. UTC | #9
On 19.08.21 16:54, Tiberiu Georgescu wrote:
> 
>> On 18 Aug 2021, at 19:13, David Hildenbrand <david@redhat.com> wrote:
>>
>>>>
>>>>> I'm now wondering whether for Tiberiu's case mincore() can also be used.  It
>>>>> should just still be a bit slow because it'll look up the cache too, but it
>>>>> should work similarly like the original proposal.
>>> I am afraid that the information returned by mincore is a little too vague to be of better help, compared to what the pagemap should provide in theory. I will have a look to see whether lseek on
>>> proc/map_files works as a "PM_SWAP" equivalent. However, the swap offset would still be missing.
>>
>> Well, with mincore() you could at least decide "page is present" vs. "page is swapped or not existent". At least for making pageout decisions it shouldn't really matter, no? madvise(MADV_PAGEOUT) on a hole is a nop.
> 
> I think you are right. In the optimisation we first presented, we should be able to
> send the madvise(MADV_PAGEOUT) call even if the page is none quite safely
> and get the wanted behaviour. Also, the "is_present" or "is_swap_or_none"
> question can be answered by the current pagemap too. Nice catch.
> 
> However, not all use cases are the same. AFAIK, there is still no way to figure
> out whether a shared page is swapped out or none unless it is directly
> read/accessed after a pagemap check. Bringing a page into memory to check
> if it previously was in swap does not seem ideal.

Well, you can lseek() to remove all the holes and use mincore() to 
remove all in-memory pages. You're left with the swapped ones. Not the 
most efficient interface maybe, but there is a way :)

> 
> Also, we still have no mechanism to retrieve the swap offsets of shmem pages
> AFAIK. There is one more QEMU optimisation we are working on that requires
> these mappings available outside of kernel space.

How exactly would the swap offset really help? IMHO that's a kernel 
internal that shouldn't be of any value to user space -- it's merely for 
debugging purposes. But I'd love to learn details.

[...]

>> If it has an fd and we can punch that into syscalls, we should much rather use that fd to lookup stuff then going via process page tables -- if possible of course (to be evaluated, because I haven't looked into the CRIU details and how they use lseek with anonymous shared memory).
> 
> I found out that it is possible to retrieve the fds of shmem/tmpfs file allocations
> using proc/pid/map_files, which is neat. Still, CRIU does not seem to care
> whether a page is swapped out or just empty, only if it is present on page cache.
> The holes that lseek finds would not be able to infer this difference, AFAIK. Will
> test the behaviour to make sure.

CRIU wants to migrate everything. lseek() gives you the definitive 
answer what needs migration -- if it's swapped out or resident. Just 
skip the holes.
Tiberiu A Georgescu Aug. 20, 2021, 4:49 p.m. UTC | #10
> On 19 Aug 2021, at 18:26, David Hildenbrand <david@redhat.com> wrote:
> 
> On 19.08.21 16:54, Tiberiu Georgescu wrote:
>>> On 18 Aug 2021, at 19:13, David Hildenbrand <david@redhat.com> wrote:
>>> 
>>>>> 
>>>>>> I'm now wondering whether for Tiberiu's case mincore() can also be used.  It
>>>>>> should just still be a bit slow because it'll look up the cache too, but it
>>>>>> should work similarly like the original proposal.
>>>> I am afraid that the information returned by mincore is a little too vague to be of better help, compared to what the pagemap should provide in theory. I will have a look to see whether lseek on
>>>> proc/map_files works as a "PM_SWAP" equivalent. However, the swap offset would still be missing.
>>> 
>>> Well, with mincore() you could at least decide "page is present" vs. "page is swapped or not existent". At least for making pageout decisions it shouldn't really matter, no? madvise(MADV_PAGEOUT) on a hole is a nop.
>> I think you are right. In the optimisation we first presented, we should be able to
>> send the madvise(MADV_PAGEOUT) call even if the page is none quite safely
>> and get the wanted behaviour. Also, the "is_present" or "is_swap_or_none"
>> question can be answered by the current pagemap too. Nice catch.
>> However, not all use cases are the same. AFAIK, there is still no way to figure
>> out whether a shared page is swapped out or none unless it is directly
>> read/accessed after a pagemap check. Bringing a page into memory to check
>> if it previously was in swap does not seem ideal.
> 
> Well, you can lseek() to remove all the holes and use mincore() to remove all in-memory pages. You're left with the swapped ones. Not the most efficient interface maybe, but there is a way :)

Ok, that could work. Still, I have a couple of concerns.

Firstly, I am worried lseek with the SEEK_HOLE flag would page in pages from
swap, so using it would be a direct factor on its own output. If people are working
on Live Migration, this would not be ideal. I am not 100% sure this is how lseek
works, so please feel free to contradict me, but I think it would swap in some
of the pages that it seeks through, if not all, to figure out when to stop. Unless it
leverages the page cache somehow, or an internal bitmap.

Secondly, mincore() could return some "false positives" for this particular use
case. That is because it returns flag=1 for pages which are still in the swap
cache, so the output becomes ambiguous.

I am not saying this is not something that would ever be needed. Some people
could actually be looking for exactly this scenario, and lseeking during the check
could be an advantage. Just that it does not look very flexible. That is why the
pagemap would have been ideal for us.

Alternatively, to get all logically swapped out pages, the lseek with pagemap
should do the trick. As you said, we remove holes with lseek, but we remove
in-memory pages with is_present(pte) instead. This solution would still suffer from 
my first concern, but it should do the job.

> 
>> Also, we still have no mechanism to retrieve the swap offsets of shmem pages
>> AFAIK. There is one more QEMU optimisation we are working on that requires
>> these mappings available outside of kernel space.
> 
> How exactly would the swap offset really help? IMHO that's a kernel internal that shouldn't be of any value to user space -- it's merely for debugging purposes. But I'd love to learn details.

It is possible for the swap device to be network attached and shared, so multiple
hosts would need to understand its content. Then it is no longer internal to one
kernel only.

By being swap-aware, we can skip swapped-out pages during migration (to prevent IO and potential thrashing), and transfer those pages in another way that
is zero-copy.
> 
> [...]
> 
>>> If it has an fd and we can punch that into syscalls, we should much rather use that fd to lookup stuff then going via process page tables -- if possible of course (to be evaluated, because I haven't looked into the CRIU details and how they use lseek with anonymous shared memory).
>> I found out that it is possible to retrieve the fds of shmem/tmpfs file allocations
>> using proc/pid/map_files, which is neat. Still, CRIU does not seem to care
>> whether a page is swapped out or just empty, only if it is present on page cache.
>> The holes that lseek finds would not be able to infer this difference, AFAIK. Will
>> test the behaviour to make sure.
> 
> CRIU wants to migrate everything. lseek() gives you the definitive answer what needs migration -- if it's swapped out or resident. Just skip the holes.

Thank you for the summary. I see why the use case is sufficient for CRIU then.
In our case, the optimisations aim to make the migration on QEMU swap aware.

--
Kind regards,
Tibi Georgescu
Peter Xu Aug. 20, 2021, 7:12 p.m. UTC | #11
Hello, Tiberiu,

On Fri, Aug 20, 2021 at 04:49:58PM +0000, Tiberiu Georgescu wrote:
> Firstly, I am worried lseek with the SEEK_HOLE flag would page in pages from
> swap, so using it would be a direct factor on its own output. If people are working
> on Live Migration, this would not be ideal. I am not 100% sure this is how lseek
> works, so please feel free to contradict me, but I think it would swap in some
> of the pages that it seeks through, if not all, to figure out when to stop. Unless it
> leverages the page cache somehow, or an internal bitmap.

It shouldn't.  Man page is clear on that:

       SEEK_DATA
              Adjust the file offset to the next location in the file greater
              than or equal to offset containing data.  If offset points to
              data, then the file offset is set to offset.

Again, I think your requirement is different from CRIU, so I think mincore() is
the right thing for you.

> 
> Secondly, mincore() could return some "false positives" for this particular use
> case. That is because it returns flag=1 for pages which are still in the swap
> cache, so the output becomes ambiguous.

I don't think so; mincore() should return flag=0 if it's either in swap cache
or even got dropped from it.  I think its name/doc also shows that in the fact
that "as long as it's not in RAM, the flag is cleared".  That's why I think
that should indeed be what you're looking for, if swp entry can be ignored.
More below on that.

Note that my series is as you mentioned missing the changes to support
mincore() (otherwise I'll know the existance of it!).  It'll be trivial to add
that, but let's see whether mincore() will satisfy your need.

[...]

> It is possible for the swap device to be network attached and shared, so multiple
> hosts would need to understand its content. Then it is no longer internal to one
> kernel only.
> 
> By being swap-aware, we can skip swapped-out pages during migration (to prevent IO and potential thrashing), and transfer those pages in another way that
> is zero-copy.

That sounds reasonable, but I'm not aware of any user-API that exposes swap
entries to userspace, or is there one?

I.e., how do you know which swap device is which?  How do you guarantee the
kernel swp entry information won't change along with time?

Thanks,
Tiberiu A Georgescu Aug. 25, 2021, 1:40 p.m. UTC | #12
Hello Peter, sorry for the late reply,

> On Fri, Aug 20, 2021 at 04:49:58PM +0000, Tiberiu Georgescu wrote:
>> Firstly, I am worried lseek with the SEEK_HOLE flag would page in pages from
>> swap, so using it would be a direct factor on its own output. If people are working
>> on Live Migration, this would not be ideal. I am not 100% sure this is how lseek
>> works, so please feel free to contradict me, but I think it would swap in some
>> of the pages that it seeks through, if not all, to figure out when to stop. Unless it
>> leverages the page cache somehow, or an internal bitmap.
> 
> It shouldn't.  Man page is clear on that:
> 
>       SEEK_DATA
>              Adjust the file offset to the next location in the file greater
>              than or equal to offset containing data.  If offset points to
>              data, then the file offset is set to offset.

Ok, I got to test it out and you are right. lseek does not swap in pages. That is
great news.
> 
> Again, I think your requirement is different from CRIU, so I think mincore() is
> the right thing for you.
> 
>> 
>> Secondly, mincore() could return some "false positives" for this particular use
>> case. That is because it returns flag=1 for pages which are still in the swap
>> cache, so the output becomes ambiguous.
> 
> I don't think so; mincore() should return flag=0 if it's either in swap cache
> or even got dropped from it.  I think its name/doc also shows that in the fact
> that "as long as it's not in RAM, the flag is cleared".  That's why I think
> that should indeed be what you're looking for, if swp entry can be ignored.
> More below on that.

By saying there are "false positives", I do not mean that the mincore() would
not work as expected, only that its definition is a little more subtle than that. And
that it does not suit our needs entirely by itself.

I tested mincore() compared to the pagemap, and I discovered that there are
more flags set to 1 (not necessarily contiguous) compared to the pages pagemap 
was reporting as present. By also looking through the code, I could only conclude
that pages in the swap cache were considered "still in RAM", so were set to 1 as
well. When looking into what the swap cache does, it makes sense.

We could use mincore() and pagemap to find the pages in the swap cache.

In short, mincore() is not enough because it does not differentiate between
present pages and swap-cache entries, as both are in RAM, but the latter
is also in swap. It can be used with other tools to get more specific information
though, so it is useful.
> 
> Note that my series is as you mentioned missing the changes to support
> mincore() (otherwise I'll know the existance of it!).  It'll be trivial to add
> that, but let's see whether mincore() will satisfy your need.

We are currently trying to make use of all tools that we have learned of so far
during our discussions (lseek, map_files, even mincore) to get the information
that we need about swap pages. In theory, for many of our use cases, a
combination of 2 or 3 should be enough.

It is a little more convoluted than a simple pagemap call, but it can be more
versatile (using lseek to skip multiple unallocated pages). As to whether the swap
bit (and more) should be eventually added on the pagemap, maybe this topic
makes more sense to continue on the Documentation thread.
> 
> [...]
> 
>> It is possible for the swap device to be network attached and shared, so multiple
>> hosts would need to understand its content. Then it is no longer internal to one
>> kernel only.
>> 
>> By being swap-aware, we can skip swapped-out pages during migration (to prevent IO and potential thrashing), and transfer those pages in another way that
>> is zero-copy.
> 
> That sounds reasonable, but I'm not aware of any user-API that exposes swap
> entries to userspace, or is there one?

Good question. AFAIK, the swap device can be retrieved by using the swap type,
which is part of the swap entry. During our discussions, I was always assuming
that, if the pagemap entry kept track of the swap offset, it would keep track of the
swap type and, conversely, the swap device as well. Sorry if I haven't made this
assumption clear until now.

So we were relying on the pagemap to expose swap entry information. Seeing it
works for private pages, we thought it made sense to have worked on shared pages as well.
> 
> I.e., how do you know which swap device is which?  How do you guarantee the
> kernel swp entry information won't change along with time?

I don't think we can guarantee it unless we halt the guest. But QEMU does most
migration work in pre-copy using a best-effort approach anyway.

So, having a way to retrieve temporary, but accurate information about swap
entries (i.e. post-patch pagemap) should be enough to guarantee a smoother
migration process. It is intended to be repeated, unless there is no change
between iterations.

--
Kind regards,

Tibi
Peter Xu Aug. 25, 2021, 2:59 p.m. UTC | #13
On Wed, Aug 25, 2021 at 01:40:13PM +0000, Tiberiu Georgescu wrote:
> Hello Peter, sorry for the late reply,

Hi, Tiberiu,

No worries on that.

> 
> > On Fri, Aug 20, 2021 at 04:49:58PM +0000, Tiberiu Georgescu wrote:
> >> Firstly, I am worried lseek with the SEEK_HOLE flag would page in pages from
> >> swap, so using it would be a direct factor on its own output. If people are working
> >> on Live Migration, this would not be ideal. I am not 100% sure this is how lseek
> >> works, so please feel free to contradict me, but I think it would swap in some
> >> of the pages that it seeks through, if not all, to figure out when to stop. Unless it
> >> leverages the page cache somehow, or an internal bitmap.
> > 
> > It shouldn't.  Man page is clear on that:
> > 
> >       SEEK_DATA
> >              Adjust the file offset to the next location in the file greater
> >              than or equal to offset containing data.  If offset points to
> >              data, then the file offset is set to offset.
> 
> Ok, I got to test it out and you are right. lseek does not swap in pages. That is
> great news.
> > 
> > Again, I think your requirement is different from CRIU, so I think mincore() is
> > the right thing for you.
> > 
> >> 
> >> Secondly, mincore() could return some "false positives" for this particular use
> >> case. That is because it returns flag=1 for pages which are still in the swap
> >> cache, so the output becomes ambiguous.
> > 
> > I don't think so; mincore() should return flag=0 if it's either in swap cache
> > or even got dropped from it.  I think its name/doc also shows that in the fact
> > that "as long as it's not in RAM, the flag is cleared".  That's why I think
> > that should indeed be what you're looking for, if swp entry can be ignored.
> > More below on that.
> 
> By saying there are "false positives", I do not mean that the mincore() would
> not work as expected, only that its definition is a little more subtle than that. And
> that it does not suit our needs entirely by itself.
> 
> I tested mincore() compared to the pagemap, and I discovered that there are
> more flags set to 1 (not necessarily contiguous) compared to the pages pagemap 
> was reporting as present. By also looking through the code, I could only conclude
> that pages in the swap cache were considered "still in RAM", so were set to 1 as
> well. When looking into what the swap cache does, it makes sense.

Please see mincore_page():

	/*
	 * When tmpfs swaps out a page from a file, any process mapping that
	 * file will not get a swp_entry_t in its pte, but rather it is like
	 * any other file mapping (ie. marked !present and faulted in with
	 * tmpfs's .fault). So swapped out tmpfs mappings are tested here.
	 */
	page = find_get_incore_page(mapping, index);
	if (page) {
		present = PageUptodate(page);
		put_page(page);
	}

I think the "testing" means when swapped out, the page will be NULL. If it's
just the pte got zapped, the page will be returned.  The call stack should look
like:

  find_get_incore_page -> find_get_page -> pagecache_get_page(fgp_flags==0).

I think the fgp_flags guaranteed it, with no FGP_ENTRY.

Did you test mincore() without my patch (as my current patch will indeed cause
more 1's returned than it should)?  My guess is there's something else that
made your test read more 1's with mincore() than pagemap, but I have no solid
idea on that.

> 
> We could use mincore() and pagemap to find the pages in the swap cache.
> 
> In short, mincore() is not enough because it does not differentiate between
> present pages and swap-cache entries, as both are in RAM, but the latter
> is also in swap. It can be used with other tools to get more specific information
> though, so it is useful.
> > 
> > Note that my series is as you mentioned missing the changes to support
> > mincore() (otherwise I'll know the existance of it!).  It'll be trivial to add
> > that, but let's see whether mincore() will satisfy your need.
> 
> We are currently trying to make use of all tools that we have learned of so far
> during our discussions (lseek, map_files, even mincore) to get the information
> that we need about swap pages. In theory, for many of our use cases, a
> combination of 2 or 3 should be enough.
> 
> It is a little more convoluted than a simple pagemap call, but it can be more
> versatile (using lseek to skip multiple unallocated pages). As to whether the swap
> bit (and more) should be eventually added on the pagemap, maybe this topic
> makes more sense to continue on the Documentation thread.
> > 
> > [...]
> > 
> >> It is possible for the swap device to be network attached and shared, so multiple
> >> hosts would need to understand its content. Then it is no longer internal to one
> >> kernel only.
> >> 
> >> By being swap-aware, we can skip swapped-out pages during migration (to prevent IO and potential thrashing), and transfer those pages in another way that
> >> is zero-copy.
> > 
> > That sounds reasonable, but I'm not aware of any user-API that exposes swap
> > entries to userspace, or is there one?
> 
> Good question. AFAIK, the swap device can be retrieved by using the swap type,
> which is part of the swap entry. During our discussions, I was always assuming
> that, if the pagemap entry kept track of the swap offset, it would keep track of the
> swap type and, conversely, the swap device as well. Sorry if I haven't made this
> assumption clear until now.
> 
> So we were relying on the pagemap to expose swap entry information. Seeing it
> works for private pages, we thought it made sense to have worked on shared pages as well.
> > 
> > I.e., how do you know which swap device is which?  How do you guarantee the
> > kernel swp entry information won't change along with time?
> 
> I don't think we can guarantee it unless we halt the guest.

Yes, halting the guest helps, though then performance start to matter because
all time consumed in either pagemap or mincore() will be counted in as downtime
of the VM live migration, and it's not "live" at all during this period.

I'm not sure how it was done with private mappings before, because I thought
that's a pre-requisite knowledge to decide whether we should migrate a page or
not, but I might have missed something.  We can stop vm, sample, start vm, but
it could become hiccups in the guest too, or otherwise contribute to downtime
when src/dst vm switches.

> But QEMU does most
> migration work in pre-copy using a best-effort approach anyway.
> 
> So, having a way to retrieve temporary, but accurate information about swap
> entries (i.e. post-patch pagemap) should be enough to guarantee a smoother
> migration process. It is intended to be repeated, unless there is no change
> between iterations.

The kernel will allocate swap device index which will be assigned as swp_type,
right?  If there're multiple swap devices, how do you know which swp_entry is
located on which device?  I wanted to look for that info in "swapon -s" but I
didn't.  Or maybe that solution only works if there's only one swap device?

Besides, I also still have a question on the accuracy. If there's no formal way
for userspace to interact with the kernel, I'm wondering how to guarantee the
page will be kept swapped out, and data located on the swap device will always
be the latest?  Because IMHO the kernel can swap in pages as wish, even if it's
not accessed from the userspace.  After all, all these operations should be
transparent to userspace.

One example in my mind is we do have page fault-around enabled for shmem, so
that even if the VM is stopped, its pages can be faulted in if (unluckily, in
this case, though) some page near the swapped out page got faulted in - it
could be some qemu malloc()ed region that (again, unluckily) was allocated to
have a virtual address very close to the mmap()ed guest memories.

I am not sure whether it's a real problem, e.g., even if some page swapped in
during guest halted for some reason, if no one is writting to that page, at
least the data on the page will still be identical to the one located on the
swap device.  However I think that still sounds too tricky, and maybe fragile.

Thanks,
Tiberiu A Georgescu Aug. 26, 2021, 3:01 p.m. UTC | #14
On 25 Aug 2021, at 15:59, Peter Xu <peterx@redhat.com<mailto:peterx@redhat.com>> wrote:

On Wed, Aug 25, 2021 at 01:40:13PM +0000, Tiberiu Georgescu wrote:
Hello Peter, sorry for the late reply,

Hi, Tiberiu,

No worries on that.


On Fri, Aug 20, 2021 at 04:49:58PM +0000, Tiberiu Georgescu wrote:
Firstly, I am worried lseek with the SEEK_HOLE flag would page in pages from
swap, so using it would be a direct factor on its own output. If people are working
on Live Migration, this would not be ideal. I am not 100% sure this is how lseek
works, so please feel free to contradict me, but I think it would swap in some
of the pages that it seeks through, if not all, to figure out when to stop. Unless it
leverages the page cache somehow, or an internal bitmap.

It shouldn't.  Man page is clear on that:

     SEEK_DATA
            Adjust the file offset to the next location in the file greater
            than or equal to offset containing data.  If offset points to
            data, then the file offset is set to offset.

Ok, I got to test it out and you are right. lseek does not swap in pages. That is
great news.

Again, I think your requirement is different from CRIU, so I think mincore() is
the right thing for you.


Secondly, mincore() could return some "false positives" for this particular use
case. That is because it returns flag=1 for pages which are still in the swap
cache, so the output becomes ambiguous.

I don't think so; mincore() should return flag=0 if it's either in swap cache
or even got dropped from it.  I think its name/doc also shows that in the fact
that "as long as it's not in RAM, the flag is cleared".  That's why I think
that should indeed be what you're looking for, if swp entry can be ignored.
More below on that.

By saying there are "false positives", I do not mean that the mincore() would
not work as expected, only that its definition is a little more subtle than that. And
that it does not suit our needs entirely by itself.

I tested mincore() compared to the pagemap, and I discovered that there are
more flags set to 1 (not necessarily contiguous) compared to the pages pagemap
was reporting as present. By also looking through the code, I could only conclude
that pages in the swap cache were considered "still in RAM", so were set to 1 as
well. When looking into what the swap cache does, it makes sense.

Please see mincore_page():

/*
 * When tmpfs swaps out a page from a file, any process mapping that
 * file will not get a swp_entry_t in its pte, but rather it is like
 * any other file mapping (ie. marked !present and faulted in with
 * tmpfs's .fault). So swapped out tmpfs mappings are tested here.
 */
page = find_get_incore_page(mapping, index);
if (page) {
present = PageUptodate(page);
put_page(page);
}

I think the "testing" means when swapped out, the page will be NULL. If it's
just the pte got zapped, the page will be returned.  The call stack should look
like:

 find_get_incore_page -> find_get_page -> pagecache_get_page(fgp_flags==0).

I think the fgp_flags guaranteed it, with no FGP_ENTRY.

Did you test mincore() without my patch (as my current patch will indeed cause
more 1's returned than it should)?  My guess is there's something else that
made your test read more 1's with mincore() than pagemap, but I have no solid
idea on that.

I made sure to avoid any of our patches while testing mincore and counted the
flags to make sure. There are more set than they are present. I will look into the
code again and test on a non-rc kernel version, just to be safe.

I still think it makes sense for mincore to consider pages in the swap cache
to be "in RAM". I start seeing how useful it can be to differentiate between
present pages and in-swap-cache pages.


We could use mincore() and pagemap to find the pages in the swap cache.

In short, mincore() is not enough because it does not differentiate between
present pages and swap-cache entries, as both are in RAM, but the latter
is also in swap. It can be used with other tools to get more specific information
though, so it is useful.

Note that my series is as you mentioned missing the changes to support
mincore() (otherwise I'll know the existance of it!).  It'll be trivial to add
that, but let's see whether mincore() will satisfy your need.

We are currently trying to make use of all tools that we have learned of so far
during our discussions (lseek, map_files, even mincore) to get the information
that we need about swap pages. In theory, for many of our use cases, a
combination of 2 or 3 should be enough.

It is a little more convoluted than a simple pagemap call, but it can be more
versatile (using lseek to skip multiple unallocated pages). As to whether the swap
bit (and more) should be eventually added on the pagemap, maybe this topic
makes more sense to continue on the Documentation thread.

[...]

It is possible for the swap device to be network attached and shared, so multiple
hosts would need to understand its content. Then it is no longer internal to one
kernel only.

By being swap-aware, we can skip swapped-out pages during migration (to prevent IO and potential thrashing), and transfer those pages in another way that
is zero-copy.

That sounds reasonable, but I'm not aware of any user-API that exposes swap
entries to userspace, or is there one?

Good question. AFAIK, the swap device can be retrieved by using the swap type,
which is part of the swap entry. During our discussions, I was always assuming
that, if the pagemap entry kept track of the swap offset, it would keep track of the
swap type and, conversely, the swap device as well. Sorry if I haven't made this
assumption clear until now.

So we were relying on the pagemap to expose swap entry information. Seeing it
works for private pages, we thought it made sense to have worked on shared pages as well.

I.e., how do you know which swap device is which?  How do you guarantee the
kernel swp entry information won't change along with time?

I don't think we can guarantee it unless we halt the guest.

Yes, halting the guest helps, though then performance start to matter because
all time consumed in either pagemap or mincore() will be counted in as downtime
of the VM live migration, and it's not "live" at all during this period.

That's true. That is why this "halting" is supposed to only happen once, and
its duration needs to be minimised by pre-copy. Live migration AFAIK is intended
to have a single imperceptible pause at some point in order to converge more
quickly and cleanly. Without this halt, the whole migration procedure could run
indefinitely, or for days instead of hours/minutes.

I'm not sure how it was done with private mappings before, because I thought
that's a pre-requisite knowledge to decide whether we should migrate a page or
not, but I might have missed something.  We can stop vm, sample, start vm, but
it could become hiccups in the guest too, or otherwise contribute to downtime
when src/dst vm switches.

I see your point. Safe to say the vm should never be stopped during pre-copy.
See my comment above.

But QEMU does most
migration work in pre-copy using a best-effort approach anyway.

So, having a way to retrieve temporary, but accurate information about swap
entries (i.e. post-patch pagemap) should be enough to guarantee a smoother
migration process. It is intended to be repeated, unless there is no change
between iterations.

The kernel will allocate swap device index which will be assigned as swp_type,
right?  If there're multiple swap devices, how do you know which swp_entry is
located on which device?  I wanted to look for that info in "swapon -s" but I
didn't.  Or maybe that solution only works if there's only one swap device?

Besides, I also still have a question on the accuracy. If there's no formal way
for userspace to interact with the kernel, I'm wondering how to guarantee the
page will be kept swapped out, and data located on the swap device will always
be the latest?  Because IMHO the kernel can swap in pages as wish, even if it's
not accessed from the userspace.  After all, all these operations should be
transparent to userspace.

One example in my mind is we do have page fault-around enabled for shmem, so
that even if the VM is stopped, its pages can be faulted in if (unluckily, in
this case, though) some page near the swapped out page got faulted in - it
could be some qemu malloc()ed region that (again, unluckily) was allocated to
have a virtual address very close to the mmap()ed guest memories.

I am not sure whether it's a real problem, e.g., even if some page swapped in
during guest halted for some reason, if no one is writting to that page, at
least the data on the page will still be identical to the one located on the
swap device.  However I think that still sounds too tricky, and maybe fragile.

Great example, and good point. Thank you for raising that up. This looks like a
very real problem that we need to take into consideration in our prototypes.

--
Kind regards,

Tibi