diff mbox series

Documentation: update pagemap with SOFT_DIRTY & UFFD_WP shmem issue

Message ID 20210812155843.236919-1-tiberiu.georgescu@nutanix.com (mailing list archive)
State New
Headers show
Series Documentation: update pagemap with SOFT_DIRTY & UFFD_WP shmem issue | expand

Commit Message

Tiberiu A Georgescu Aug. 12, 2021, 3:58 p.m. UTC
Mentioning the current missing functionality of the pagemap, in case
someone stumbles upon unexpected behaviour.

Signed-off-by: Tiberiu A Georgescu <tiberiu.georgescu@nutanix.com>
Reviewed-by: Ivan Teterevkov <ivan.teterevkov@nutanix.com>
Reviewed-by: Florian Schmidt <florian.schmidt@nutanix.com>
Reviewed-by: Carl Waldspurger <carl.waldspurger@nutanix.com>
Reviewed-by: Jonathan Davies <jonathan.davies@nutanix.com>
---
 Documentation/admin-guide/mm/pagemap.rst | 6 ++++++
 1 file changed, 6 insertions(+)

Comments

David Hildenbrand Aug. 18, 2021, 7:14 p.m. UTC | #1
On 12.08.21 17:58, Tiberiu A Georgescu wrote:
> Mentioning the current missing functionality of the pagemap, in case
> someone stumbles upon unexpected behaviour.
> 
> Signed-off-by: Tiberiu A Georgescu <tiberiu.georgescu@nutanix.com>
> Reviewed-by: Ivan Teterevkov <ivan.teterevkov@nutanix.com>
> Reviewed-by: Florian Schmidt <florian.schmidt@nutanix.com>
> Reviewed-by: Carl Waldspurger <carl.waldspurger@nutanix.com>
> Reviewed-by: Jonathan Davies <jonathan.davies@nutanix.com>
> ---
>   Documentation/admin-guide/mm/pagemap.rst | 6 ++++++
>   1 file changed, 6 insertions(+)
> 
> diff --git a/Documentation/admin-guide/mm/pagemap.rst b/Documentation/admin-guide/mm/pagemap.rst
> index fb578fbbb76c..627f3832b3a2 100644
> --- a/Documentation/admin-guide/mm/pagemap.rst
> +++ b/Documentation/admin-guide/mm/pagemap.rst
> @@ -207,3 +207,9 @@ Before Linux 3.11 pagemap bits 55-60 were used for "page-shift" (which is
>   always 12 at most architectures). Since Linux 3.11 their meaning changes
>   after first clear of soft-dirty bits. Since Linux 4.2 they are used for
>   flags unconditionally.
> +
> +Note that the page table entries for swappable and non-syncable pages are
> +cleared when those pages are zapped or swapped out. This makes information
> +about the page disappear from the pagemap.  The location of the swapped
> +page can still be retrieved from the page cache, but flags like SOFT_DIRTY
> +and UFFD_WP are lost irretrievably.
> 

UFFD_WP is currently only supported for private anonymous memory, where 
it should just work (a swap entry with a uffd-wp marker). So can we even 
end up with UFFD_WP bits on shmem and such? (Peter is up-streaming that 
right now, but there, I think he's intending to handle it properly 
without these bits getting lost using pte_markers and such).

So regarding upstream Linux, your note regarding UFFD_WP should not be 
applicable, right?


On a related note: if we start thinking about the pagemap expressing 
which pages are currently mapped into the page tables ("state of the 
process page tables") mostly all starts making sense. We document this 
as "to examine the page tables" already.

We only get swapped information if there is a swap PTE -- which only 
makes sense for anonymous pages, because there, the page table holds the 
state ("single source of truth"). For shmem, we don't have it, because 
the page cache is the single source of truth.

We only get presence information if there is a page mapped into the page 
tables -- which, for anonymous pages, specifies if there is anything 
present at all. For shmem we only have it if it's currently mapped into 
the page table.

Losing softdirt is a bad side effect of, what you describe, just setting 
a PTE to none and not syncing back that state back to some central place 
where it could be observed even without the PTE at hand.


Maybe we should document more clearly, especially what to expect for 
anonymous pages and what to expect for shared memory etc from the 
pagemap. Once we figured out which other interfaces we have to deal with 
shared memory (minore(), lseek() as we learned), we might want to 
document that as well, to safe people some time when exploring this area.
Tiberiu A Georgescu Aug. 20, 2021, 5:10 p.m. UTC | #2
Hello David,

> On 18 Aug 2021, at 20:14, David Hildenbrand <david@redhat.com> wrote:
> 
> On 12.08.21 17:58, Tiberiu A Georgescu wrote:
>> Mentioning the current missing functionality of the pagemap, in case
>> someone stumbles upon unexpected behaviour.
>> Signed-off-by: Tiberiu A Georgescu <tiberiu.georgescu@nutanix.com>
>> Reviewed-by: Ivan Teterevkov <ivan.teterevkov@nutanix.com>
>> Reviewed-by: Florian Schmidt <florian.schmidt@nutanix.com>
>> Reviewed-by: Carl Waldspurger <carl.waldspurger@nutanix.com>
>> Reviewed-by: Jonathan Davies <jonathan.davies@nutanix.com>
>> ---
>>  Documentation/admin-guide/mm/pagemap.rst | 6 ++++++
>>  1 file changed, 6 insertions(+)
>> diff --git a/Documentation/admin-guide/mm/pagemap.rst b/Documentation/admin-guide/mm/pagemap.rst
>> index fb578fbbb76c..627f3832b3a2 100644
>> --- a/Documentation/admin-guide/mm/pagemap.rst
>> +++ b/Documentation/admin-guide/mm/pagemap.rst
>> @@ -207,3 +207,9 @@ Before Linux 3.11 pagemap bits 55-60 were used for "page-shift" (which is
>>  always 12 at most architectures). Since Linux 3.11 their meaning changes
>>  after first clear of soft-dirty bits. Since Linux 4.2 they are used for
>>  flags unconditionally.
>> +
>> +Note that the page table entries for swappable and non-syncable pages are
>> +cleared when those pages are zapped or swapped out. This makes information
>> +about the page disappear from the pagemap.  The location of the swapped
>> +page can still be retrieved from the page cache, but flags like SOFT_DIRTY
>> +and UFFD_WP are lost irretrievably.
> 
> UFFD_WP is currently only supported for private anonymous memory, where it should just work (a swap entry with a uffd-wp marker). So can we even end up with UFFD_WP bits on shmem and such? (Peter is up-streaming that right now, but there, I think he's intending to handle it properly without these bits getting lost using pte_markers and such).

If that is the case, I guess we should not end up with UFFD_WP bits on shmem
ptes yet. Sorry for the confusion.

Great to hear Peter is upstreaming his patch soon. Is it this series[1] you
mention?

[1]: https://lore.kernel.org/lkml/20210715201422.211004-1-peterx@redhat.com/
> 
> So regarding upstream Linux, your note regarding UFFD_WP should not be applicable, right?
> 
Right.
> 
> On a related note: if we start thinking about the pagemap expressing which pages are currently mapped into the page tables ("state of the process page tables") mostly all starts making sense. We document this as "to examine the page tables" already.
> 
> We only get swapped information if there is a swap PTE -- which only makes sense for anonymous pages, because there, the page table holds the state ("single source of truth"). For shmem, we don't have it, because the page cache is the single source of truth.
> 
> We only get presence information if there is a page mapped into the page tables -- which, for anonymous pages, specifies if there is anything present at all. For shmem we only have it if it's currently mapped into the page table.
> 
> Losing softdirt is a bad side effect of, what you describe, just setting a PTE to none and not syncing back that state back to some central place where it could be observed even without the PTE at hand.
> 
Yeah, that seems to be the case because shared memory behaves internally
as file-backed memory, but logically needs to be swapped to a swap device, not
to the disk. This turns shmem into an odd hybrid, which does not truly adhere to
the rules the other categories comply.
> 
> Maybe we should document more clearly, especially what to expect for anonymous pages and what to expect for shared memory etc from the pagemap. Once we figured out which other interfaces we have to deal with shared memory (minore(), lseek() as we learned), we might want to document that as well, to safe people some time when exploring this area.

I agree, as I found out first hand how eluding this information can be.
Thank you for your comments and discoveries mentioned on Peter's RFC thread[4], particularly the usage of mincore(), lseek() and proc/pid/map_files in
CRIU. I learned a lot from them. We should definitely add them as alternatives for
parts of the missing information.

Currently, the missing information for shmem is this:
1. Difference between is_swap(pte) and is_none(pte).
    * is_swap(pte) is always false;
    * is_none(pte) is true when is_swap() should have been;
    * is_present(pte) is fine.
2. swp_entry(pte)
    Particularly, swp_type() and swp_offset().
3. SOFT_DIRTY_BIT
    This is not always missing for shmem. 
    Once 4 is written to clear_refs, if the page is dirtied, the bit is fine as long as it
    is still in memory. If the page is swapped out, the bit is lost. Then, if the page is
    brought back into memory, the bit is still lost.

For 1, you mentioned how lseek() and madvise() can be used to get this
information [2], and I proposed a different method with a little help from
the current pagemap[3]. They have slightly different output and applications, so
the difference should be taken into consideration.
For 2, if anyone knows of any way of retrieve the missing information cleanly,
please let us know. 
As for 3, AFAIK, we will need to leverage Peter's special PTE marker mechanism
and implement it in another patch.

[2]: https://lore.kernel.org/lkml/5766d353-6ff8-fdfa-f8f9-764e8de9b5aa@redhat.com/
[3]: https://lore.kernel.org/lkml/B130B700-B3DB-4D07-A632-73030BCBC715@nutanix.com/

============================
For completeness, I would like to mention Peter's RFC[4] and my own patch[5],
which deal with adding missing functionality to the pagemap when pages are
shmem/tmpfs.

Peter's patch[4] adds the missing information at 1 to the pagemap, with very little performance overhead. AFAIK, it is still WIP.

My patch[5] fixes both 1 and 2, at the expense of a significant loss in performance
when dealing with swapped out shared pages. This performance loss can be
reduced with batching, for use cases when high performance matters. Also, this
patch on top of Peter's RFC yields better performance[6]. Still 2x as slow on
average compared to pre-patch.

Peter's patch has a config flag, and I intend to add one to mine in the next
version. So I wanted to propose, if alternatives are not implemented yet (mincore,
lseek, map_files or otherwise are insufficient), we upstream our patches (once
they are ready), so that users can toggle them on or off, depending on whether
they need the extra functionality or not. And, of course, document their usage.

If neither sounds like a particularly useful/convenient option, we might need to
look into designs of retrieving the missing information via another mechanism
(sys/fs, ioctl, netlink etc).

That is, unless we find that we can/should place this info in the pagemap still, for
the sake of correctness and completeness. For that though, we should convene
on what do we expect the pagemap to do in the end. Is shmem/tmpfs out of
bounds for it or not?

[4]: https://lore.kernel.org/lkml/20210807032521.7591-1-peterx@redhat.com/
[5]: https://lore.kernel.org/lkml/20210730160826.63785-1-tiberiu.georgescu@nutanix.com/
[6]: https://lore.kernel.org/lkml/C0DB3FED-F779-4838-9697-D05BE96C3514@nutanix.com/

--
Kind regards,

Tibi Georgescu
Peter Xu Aug. 20, 2021, 8:25 p.m. UTC | #3
Hi, Tiberiu,

On Fri, Aug 20, 2021 at 05:10:20PM +0000, Tiberiu Georgescu wrote:
> Currently, the missing information for shmem is this:
> 1. Difference between is_swap(pte) and is_none(pte).
>     * is_swap(pte) is always false;
>     * is_none(pte) is true when is_swap() should have been;
>     * is_present(pte) is fine.
> 2. swp_entry(pte)
>     Particularly, swp_type() and swp_offset().
> 3. SOFT_DIRTY_BIT
>     This is not always missing for shmem. 
>     Once 4 is written to clear_refs, if the page is dirtied, the bit is fine as long as it
>     is still in memory. If the page is swapped out, the bit is lost. Then, if the page is
>     brought back into memory, the bit is still lost.
> 
> For 1, you mentioned how lseek() and madvise() can be used to get this
> information [2], and I proposed a different method with a little help from
> the current pagemap[3]. They have slightly different output and applications, so
> the difference should be taken into consideration.
> For 2, if anyone knows of any way of retrieve the missing information cleanly,
> please let us know. 
> As for 3, AFAIK, we will need to leverage Peter's special PTE marker mechanism
> and implement it in another patch.
> 
> [2]: https://lore.kernel.org/lkml/5766d353-6ff8-fdfa-f8f9-764e8de9b5aa@redhat.com/
> [3]: https://lore.kernel.org/lkml/B130B700-B3DB-4D07-A632-73030BCBC715@nutanix.com/
> 
> ============================
> For completeness, I would like to mention Peter's RFC[4] and my own patch[5],
> which deal with adding missing functionality to the pagemap when pages are
> shmem/tmpfs.
> 
> Peter's patch[4] adds the missing information at 1 to the pagemap, with very little performance overhead. AFAIK, it is still WIP.
> 
> My patch[5] fixes both 1 and 2, at the expense of a significant loss in performance
> when dealing with swapped out shared pages. This performance loss can be
> reduced with batching, for use cases when high performance matters. Also, this
> patch on top of Peter's RFC yields better performance[6]. Still 2x as slow on
> average compared to pre-patch.
> 
> Peter's patch has a config flag, and I intend to add one to mine in the next
> version. So I wanted to propose, if alternatives are not implemented yet (mincore,
> lseek, map_files or otherwise are insufficient), we upstream our patches (once
> they are ready), so that users can toggle them on or off, depending on whether
> they need the extra functionality or not. And, of course, document their usage.
> 
> If neither sounds like a particularly useful/convenient option, we might need to
> look into designs of retrieving the missing information via another mechanism
> (sys/fs, ioctl, netlink etc).
> 
> That is, unless we find that we can/should place this info in the pagemap still, for
> the sake of correctness and completeness. For that though, we should convene
> on what do we expect the pagemap to do in the end. Is shmem/tmpfs out of
> bounds for it or not?
> 
> [4]: https://lore.kernel.org/lkml/20210807032521.7591-1-peterx@redhat.com/
> [5]: https://lore.kernel.org/lkml/20210730160826.63785-1-tiberiu.georgescu@nutanix.com/
> [6]: https://lore.kernel.org/lkml/C0DB3FED-F779-4838-9697-D05BE96C3514@nutanix.com/

Thanks for summarizing the issues.

Before going further, I really would like to understand a few questions that I
already raised in the other thread here:

https://lore.kernel.org/lkml/YR%2F+gfL8RCP8XoB1@t490s/

They're:

  (1) Whether does mincore() suit your need already?

  (2) What would you like to do with swap entries in pagemap?

I'm more interested in question (2) because I never figured it out before, and
I really don't see how it would work even if the kernel can share swap format
to userspace.  E.g., right after you decided to "zero copy" that page, the page
can be faulted in right before live migration finishes, and it can be dirtied
again.  Then the page on the shared network storage will be stall, the same to
the swap entry you just scanned.

Thanks,
David Hildenbrand Aug. 23, 2021, 8:40 a.m. UTC | #4
On 20.08.21 22:25, Peter Xu wrote:
> Hi, Tiberiu,
> 
> On Fri, Aug 20, 2021 at 05:10:20PM +0000, Tiberiu Georgescu wrote:
>> Currently, the missing information for shmem is this:
>> 1. Difference between is_swap(pte) and is_none(pte).
>>      * is_swap(pte) is always false;
>>      * is_none(pte) is true when is_swap() should have been;
>>      * is_present(pte) is fine.
>> 2. swp_entry(pte)
>>      Particularly, swp_type() and swp_offset().
>> 3. SOFT_DIRTY_BIT
>>      This is not always missing for shmem.
>>      Once 4 is written to clear_refs, if the page is dirtied, the bit is fine as long as it
>>      is still in memory. If the page is swapped out, the bit is lost. Then, if the page is
>>      brought back into memory, the bit is still lost.
>>
>> For 1, you mentioned how lseek() and madvise() can be used to get this
>> information [2], and I proposed a different method with a little help from
>> the current pagemap[3]. They have slightly different output and applications, so
>> the difference should be taken into consideration.
>> For 2, if anyone knows of any way of retrieve the missing information cleanly,
>> please let us know.
>> As for 3, AFAIK, we will need to leverage Peter's special PTE marker mechanism
>> and implement it in another patch.
>>
>> [2]: https://lore.kernel.org/lkml/5766d353-6ff8-fdfa-f8f9-764e8de9b5aa@redhat.com/
>> [3]: https://lore.kernel.org/lkml/B130B700-B3DB-4D07-A632-73030BCBC715@nutanix.com/
>>
>> ============================
>> For completeness, I would like to mention Peter's RFC[4] and my own patch[5],
>> which deal with adding missing functionality to the pagemap when pages are
>> shmem/tmpfs.
>>
>> Peter's patch[4] adds the missing information at 1 to the pagemap, with very little performance overhead. AFAIK, it is still WIP.
>>
>> My patch[5] fixes both 1 and 2, at the expense of a significant loss in performance
>> when dealing with swapped out shared pages. This performance loss can be
>> reduced with batching, for use cases when high performance matters. Also, this
>> patch on top of Peter's RFC yields better performance[6]. Still 2x as slow on
>> average compared to pre-patch.
>>
>> Peter's patch has a config flag, and I intend to add one to mine in the next
>> version. So I wanted to propose, if alternatives are not implemented yet (mincore,
>> lseek, map_files or otherwise are insufficient), we upstream our patches (once
>> they are ready), so that users can toggle them on or off, depending on whether
>> they need the extra functionality or not. And, of course, document their usage.
>>
>> If neither sounds like a particularly useful/convenient option, we might need to
>> look into designs of retrieving the missing information via another mechanism
>> (sys/fs, ioctl, netlink etc).
>>
>> That is, unless we find that we can/should place this info in the pagemap still, for
>> the sake of correctness and completeness. For that though, we should convene
>> on what do we expect the pagemap to do in the end. Is shmem/tmpfs out of
>> bounds for it or not?
>>
>> [4]: https://lore.kernel.org/lkml/20210807032521.7591-1-peterx@redhat.com/
>> [5]: https://lore.kernel.org/lkml/20210730160826.63785-1-tiberiu.georgescu@nutanix.com/
>> [6]: https://lore.kernel.org/lkml/C0DB3FED-F779-4838-9697-D05BE96C3514@nutanix.com/
> 
> Thanks for summarizing the issues.
> 
> Before going further, I really would like to understand a few questions that I
> already raised in the other thread here:
> 
> https://lore.kernel.org/lkml/YR%2F+gfL8RCP8XoB1@t490s/
> 
> They're:
> 
>    (1) Whether does mincore() suit your need already?
> 
>    (2) What would you like to do with swap entries in pagemap?
> 
> I'm more interested in question (2) because I never figured it out before, and
> I really don't see how it would work even if the kernel can share swap format
> to userspace.  E.g., right after you decided to "zero copy" that page, the page
> can be faulted in right before live migration finishes, and it can be dirtied
> again.  Then the page on the shared network storage will be stall, the same to
> the swap entry you just scanned.

I wonder if one should much rather try using shared file-backed memory 
located on a network storage instead of hacking into swap here.
David Hildenbrand Aug. 23, 2021, 8:52 a.m. UTC | #5
On 20.08.21 19:10, Tiberiu Georgescu wrote:
> Hello David,
> 
>> On 18 Aug 2021, at 20:14, David Hildenbrand <david@redhat.com> wrote:
>>
>> On 12.08.21 17:58, Tiberiu A Georgescu wrote:
>>> Mentioning the current missing functionality of the pagemap, in case
>>> someone stumbles upon unexpected behaviour.
>>> Signed-off-by: Tiberiu A Georgescu <tiberiu.georgescu@nutanix.com>
>>> Reviewed-by: Ivan Teterevkov <ivan.teterevkov@nutanix.com>
>>> Reviewed-by: Florian Schmidt <florian.schmidt@nutanix.com>
>>> Reviewed-by: Carl Waldspurger <carl.waldspurger@nutanix.com>
>>> Reviewed-by: Jonathan Davies <jonathan.davies@nutanix.com>
>>> ---
>>>   Documentation/admin-guide/mm/pagemap.rst | 6 ++++++
>>>   1 file changed, 6 insertions(+)
>>> diff --git a/Documentation/admin-guide/mm/pagemap.rst b/Documentation/admin-guide/mm/pagemap.rst
>>> index fb578fbbb76c..627f3832b3a2 100644
>>> --- a/Documentation/admin-guide/mm/pagemap.rst
>>> +++ b/Documentation/admin-guide/mm/pagemap.rst
>>> @@ -207,3 +207,9 @@ Before Linux 3.11 pagemap bits 55-60 were used for "page-shift" (which is
>>>   always 12 at most architectures). Since Linux 3.11 their meaning changes
>>>   after first clear of soft-dirty bits. Since Linux 4.2 they are used for
>>>   flags unconditionally.
>>> +
>>> +Note that the page table entries for swappable and non-syncable pages are
>>> +cleared when those pages are zapped or swapped out. This makes information
>>> +about the page disappear from the pagemap.  The location of the swapped
>>> +page can still be retrieved from the page cache, but flags like SOFT_DIRTY
>>> +and UFFD_WP are lost irretrievably.
>>
>> UFFD_WP is currently only supported for private anonymous memory, where it should just work (a swap entry with a uffd-wp marker). So can we even end up with UFFD_WP bits on shmem and such? (Peter is up-streaming that right now, but there, I think he's intending to handle it properly without these bits getting lost using pte_markers and such).
> 
> If that is the case, I guess we should not end up with UFFD_WP bits on shmem
> ptes yet. Sorry for the confusion.
> 
> Great to hear Peter is upstreaming his patch soon. Is it this series[1] you
> mention?
> 
> [1]: https://lore.kernel.org/lkml/20210715201422.211004-1-peterx@redhat.com/

Yes, and that would take care of making the uffd-wp bit persistent.

>>
>> So regarding upstream Linux, your note regarding UFFD_WP should not be applicable, right?
>>
> Right.
>>
>> On a related note: if we start thinking about the pagemap expressing which pages are currently mapped into the page tables ("state of the process page tables") mostly all starts making sense. We document this as "to examine the page tables" already.
>>
>> We only get swapped information if there is a swap PTE -- which only makes sense for anonymous pages, because there, the page table holds the state ("single source of truth"). For shmem, we don't have it, because the page cache is the single source of truth.
>>
>> We only get presence information if there is a page mapped into the page tables -- which, for anonymous pages, specifies if there is anything present at all. For shmem we only have it if it's currently mapped into the page table.
>>
>> Losing softdirt is a bad side effect of, what you describe, just setting a PTE to none and not syncing back that state back to some central place where it could be observed even without the PTE at hand.
>>
> Yeah, that seems to be the case because shared memory behaves internally
> as file-backed memory, but logically needs to be swapped to a swap device, not
> to the disk. This turns shmem into an odd hybrid, which does not truly adhere to
> the rules the other categories comply.
>>
>> Maybe we should document more clearly, especially what to expect for anonymous pages and what to expect for shared memory etc from the pagemap. Once we figured out which other interfaces we have to deal with shared memory (minore(), lseek() as we learned), we might want to document that as well, to safe people some time when exploring this area.
> 
> I agree, as I found out first hand how eluding this information can be.
> Thank you for your comments and discoveries mentioned on Peter's RFC thread[4], particularly the usage of mincore(), lseek() and proc/pid/map_files in
> CRIU. I learned a lot from them. We should definitely add them as alternatives for
> parts of the missing information.
> 
> Currently, the missing information for shmem is this:
> 1. Difference between is_swap(pte) and is_none(pte).
>      * is_swap(pte) is always false;
>      * is_none(pte) is true when is_swap() should have been;

You can also have is_none(pte) if it should be is_present(pte).

>      * is_present(pte) is fine.

is_present(pte) is always correct when set, but might be wrong when not set.

> 2. swp_entry(pte)
>      Particularly, swp_type() and swp_offset().
> 3. SOFT_DIRTY_BIT
>      This is not always missing for shmem.
>      Once 4 is written to clear_refs, if the page is dirtied, the bit is fine as long as it
>      is still in memory. If the page is swapped out, the bit is lost. Then, if the page is
>      brought back into memory, the bit is still lost.

There are other cases that don't require swapping I think (THP 
splitting). I might be wrong.

> 
> For 1, you mentioned how lseek() and madvise() can be used to get this
> information [2], and I proposed a different method with a little help from
> the current pagemap[3]. They have slightly different output and applications, so
> the difference should be taken into consideration.

At this point I am pretty sure that the pagemap is the wrong mechanism 
for that. Pagemap never made that promise; it promised to tell you how 
the page tables currently look like, not the correct state of the 
underlying file.

> For 2, if anyone knows of any way of retrieve the missing information cleanly,
> please let us know.

As raised by Peter as well, there is much likely not a sane use case 
that should really rely on this. There might be corner cases (use case 
you mentioned), but that doesn't mean that we want to support them from 
a Linux ABI POV.

> As for 3, AFAIK, we will need to leverage Peter's special PTE marker mechanism
> and implement it in another patch.

Or come to the conclusion that softdirty+shmem in the current form is 
the wrong approach and you actually want to maintain such information in 
central place, from where you can retrieve reliably if shared memory has 
been modified by any user.

pagemap never worked reliably with softdirty/swap/present on shmem, so 
it's not a regression. It was always best effort.
Tiberiu A Georgescu Aug. 25, 2021, 3:48 p.m. UTC | #6
> On 23 Aug 2021, at 09:52, David Hildenbrand <david@redhat.com> wrote:
> 
> On 20.08.21 19:10, Tiberiu Georgescu wrote:
>> Hello David,
>>> On 18 Aug 2021, at 20:14, David Hildenbrand <david@redhat.com> wrote:
>>> 
>>> On 12.08.21 17:58, Tiberiu A Georgescu wrote:
>>>> Mentioning the current missing functionality of the pagemap, in case
>>>> someone stumbles upon unexpected behaviour.
>>>> Signed-off-by: Tiberiu A Georgescu <tiberiu.georgescu@nutanix.com>
>>>> Reviewed-by: Ivan Teterevkov <ivan.teterevkov@nutanix.com>
>>>> Reviewed-by: Florian Schmidt <florian.schmidt@nutanix.com>
>>>> Reviewed-by: Carl Waldspurger <carl.waldspurger@nutanix.com>
>>>> Reviewed-by: Jonathan Davies <jonathan.davies@nutanix.com>
>>>> ---
>>>>  Documentation/admin-guide/mm/pagemap.rst | 6 ++++++
>>>>  1 file changed, 6 insertions(+)
>>>> diff --git a/Documentation/admin-guide/mm/pagemap.rst b/Documentation/admin-guide/mm/pagemap.rst
>>>> index fb578fbbb76c..627f3832b3a2 100644
>>>> --- a/Documentation/admin-guide/mm/pagemap.rst
>>>> +++ b/Documentation/admin-guide/mm/pagemap.rst
>>>> @@ -207,3 +207,9 @@ Before Linux 3.11 pagemap bits 55-60 were used for "page-shift" (which is
>>>>  always 12 at most architectures). Since Linux 3.11 their meaning changes
>>>>  after first clear of soft-dirty bits. Since Linux 4.2 they are used for
>>>>  flags unconditionally.
>>>> +
>>>> +Note that the page table entries for swappable and non-syncable pages are
>>>> +cleared when those pages are zapped or swapped out. This makes information
>>>> +about the page disappear from the pagemap.  The location of the swapped
>>>> +page can still be retrieved from the page cache, but flags like SOFT_DIRTY
>>>> +and UFFD_WP are lost irretrievably.
>>> 
>>> UFFD_WP is currently only supported for private anonymous memory, where it should just work (a swap entry with a uffd-wp marker). So can we even end up with UFFD_WP bits on shmem and such? (Peter is up-streaming that right now, but there, I think he's intending to handle it properly without these bits getting lost using pte_markers and such).
>> If that is the case, I guess we should not end up with UFFD_WP bits on shmem
>> ptes yet. Sorry for the confusion.
>> Great to hear Peter is upstreaming his patch soon. Is it this series[1] you
>> mention?
>> [1]: https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_lkml_20210715201422.211004-2D1-2Dpeterx-40redhat.com_&d=DwIDaQ&c=s883GpUCOChKOHiocYtGcg&r=rRM5dtWOv0DNo5dDxZ2U16jl4WAw6ql5szOKa9cu_RA&m=Hx-j4mONRY1L7rSzBw2fX9uS-mxLtz9qaknDaLVoeNI&s=Kl5f106w3YaBLQe0_T5RV1nvPm2pQumiVIKoE76d0yE&e= 
> 
> Yes, and that would take care of making the uffd-wp bit persistent.

Great news!
> 
>>> 
>>> So regarding upstream Linux, your note regarding UFFD_WP should not be applicable, right?
>>> 
>> Right.
>>> 
>>> On a related note: if we start thinking about the pagemap expressing which pages are currently mapped into the page tables ("state of the process page tables") mostly all starts making sense. We document this as "to examine the page tables" already.
>>> 
>>> We only get swapped information if there is a swap PTE -- which only makes sense for anonymous pages, because there, the page table holds the state ("single source of truth"). For shmem, we don't have it, because the page cache is the single source of truth.
>>> 
>>> We only get presence information if there is a page mapped into the page tables -- which, for anonymous pages, specifies if there is anything present at all. For shmem we only have it if it's currently mapped into the page table.
>>> 
>>> Losing softdirt is a bad side effect of, what you describe, just setting a PTE to none and not syncing back that state back to some central place where it could be observed even without the PTE at hand.
>>> 
>> Yeah, that seems to be the case because shared memory behaves internally
>> as file-backed memory, but logically needs to be swapped to a swap device, not
>> to the disk. This turns shmem into an odd hybrid, which does not truly adhere to
>> the rules the other categories comply.
>>> 
>>> Maybe we should document more clearly, especially what to expect for anonymous pages and what to expect for shared memory etc from the pagemap. Once we figured out which other interfaces we have to deal with shared memory (minore(), lseek() as we learned), we might want to document that as well, to safe people some time when exploring this area.
>> I agree, as I found out first hand how eluding this information can be.
>> Thank you for your comments and discoveries mentioned on Peter's RFC thread[4], particularly the usage of mincore(), lseek() and proc/pid/map_files in
>> CRIU. I learned a lot from them. We should definitely add them as alternatives for
>> parts of the missing information.
>> Currently, the missing information for shmem is this:
>> 1. Difference between is_swap(pte) and is_none(pte).
>>     * is_swap(pte) is always false;
>>     * is_none(pte) is true when is_swap() should have been;
> 
> You can also have is_none(pte) if it should be is_present(pte).

Does this happen if the pte is accessed before the reverse mapping procedure
finishes updating swapped in pages? I thought this case was being protected
somehow.
> 
>>     * is_present(pte) is fine.
> 
> is_present(pte) is always correct when set, but might be wrong when not set.

We were not able to find any case when is_present(pte) is true when it should
have been false, nor vice-versa. Is your previous comment also an example?
> 
>> 2. swp_entry(pte)
>>     Particularly, swp_type() and swp_offset().
>> 3. SOFT_DIRTY_BIT
>>     This is not always missing for shmem.
>>     Once 4 is written to clear_refs, if the page is dirtied, the bit is fine as long as it
>>     is still in memory. If the page is swapped out, the bit is lost. Then, if the page is
>>     brought back into memory, the bit is still lost.
> 
> There are other cases that don't require swapping I think (THP splitting). I might be wrong.

That's a good point. Still, the bit can disappear at any time.
> 
>> For 1, you mentioned how lseek() and madvise() can be used to get this
>> information [2], and I proposed a different method with a little help from
>> the current pagemap[3]. They have slightly different output and applications, so
>> the difference should be taken into consideration.
> 
> At this point I am pretty sure that the pagemap is the wrong mechanism for that. Pagemap never made that promise; it promised to tell you how the page tables currently look like, not the correct state of the underlying file.

I start thinking the same thing. Also considering the implementation, when
tackling shmem pages, the information gets retrieved in a very different way.

It is clear that private and shared pages, as of their current implementation,
are two very disjoint sets, which need to be reasoned about differently.

Now, my only concern is whether being aware of this difference can/should be 
the user's job, or whether the kernel is still better suited for handling this layer 
of abstraction. Because the former would usually mean a significant difference 
between operations just to retrieve info that is common to both types of
memory (imagine an interface/class in OOP), whereas the latter could impose 
performance deficits on operations that would only target one type of memory
anyway.
> 
>> For 2, if anyone knows of any way of retrieve the missing information cleanly,
>> please let us know.
> 
> As raised by Peter as well, there is much likely not a sane use case that should really rely on this. There might be corner cases (use case you mentioned), but that doesn't mean that we want to support them from a Linux ABI POV.

Considering the swap information (offset & type) is already exposed for private
pages by the pagemap, I can only comment that not supporting it at all for
shmem seems incomplete IMHO.

I agree pagemap might not be the ideal place for shmem swap info, but an
alternative for information that AFAIK cannot be retrieved in any other way
should be taken into consideration.
> 
>> As for 3, AFAIK, we will need to leverage Peter's special PTE marker mechanism
>> and implement it in another patch.
> 
> Or come to the conclusion that softdirty+shmem in the current form is the wrong approach and you actually want to maintain such information in central place, from where you can retrieve reliably if shared memory has been modified by any user.

That would definitely be more efficient, compared to having to update a
structure in each process every time a page gets soft-dirty. And more accurate.
> 
> pagemap never worked reliably with softdirty/swap/present on shmem, so it's not a regression. It was always best effort.

--
Kind regards,

Tibi
diff mbox series

Patch

diff --git a/Documentation/admin-guide/mm/pagemap.rst b/Documentation/admin-guide/mm/pagemap.rst
index fb578fbbb76c..627f3832b3a2 100644
--- a/Documentation/admin-guide/mm/pagemap.rst
+++ b/Documentation/admin-guide/mm/pagemap.rst
@@ -207,3 +207,9 @@  Before Linux 3.11 pagemap bits 55-60 were used for "page-shift" (which is
 always 12 at most architectures). Since Linux 3.11 their meaning changes
 after first clear of soft-dirty bits. Since Linux 4.2 they are used for
 flags unconditionally.
+
+Note that the page table entries for swappable and non-syncable pages are
+cleared when those pages are zapped or swapped out. This makes information
+about the page disappear from the pagemap.  The location of the swapped
+page can still be retrieved from the page cache, but flags like SOFT_DIRTY
+and UFFD_WP are lost irretrievably.