mbox series

[RFC,0/8] mm/madvise: support process_madvise(MADV_DONTNEED)

Message ID 20210926161259.238054-1-namit@vmware.com (mailing list archive)
Headers show
Series mm/madvise: support process_madvise(MADV_DONTNEED) | expand

Message

Nadav Amit Sept. 26, 2021, 4:12 p.m. UTC
From: Nadav Amit <namit@vmware.com>

The goal of these patches is to add support for
process_madvise(MADV_DONTNEED). Yet, in the process some (arguably)
useful cleanups, a bug fix and performance enhancements are performed.

The patches try to consolidate the logic across different behaviors, and
to a certain extent overlap/conflict with an outstanding patch that does
something similar [1]. This consolidation however is mostly orthogonal
to the aforementioned one and done in order to clarify what is done in
respect to locks and TLB for each behavior and to batch these operations
more efficiently on process_madvise().

process_madvise(MADV_DONTNEED) is useful for two reasons: (a) it allows
userfaultfd monitors to unmap memory from monitored processes; and (b)
it is more efficient than madvise() since it is vectored and batches TLB
flushes more aggressively.

The first three patches should not interfere with the outstanding patch
[1]. The rest might need to be rebased after [1] is applied.

[1] https://lore.kernel.org/linux-mm/CAJuCfpFDBJ_W1y2tqAT4BGtPbWrjjDud_JuKO8ZbnjYfeVNvRg@mail.gmail.com/

Cc: Peter Xu <peterx@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Colin Cross <ccross@google.com>
Cc: Suren Baghdasarya <surenb@google.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>

Nadav Amit (8):
  mm/madvise: propagate vma->vm_end changes
  mm/madvise: remove unnecessary check on madvise_dontneed_free()
  mm/madvise: remove unnecessary checks on madvise_free_single_vma()
  mm/madvise: define madvise behavior in a struct
  mm/madvise: perform certain operations once on process_madvise()
  mm/madvise: more aggressive TLB batching
  mm/madvise: deduplicate code in madvise_dontneed_free()
  mm/madvise: process_madvise(MADV_DONTNEED)

 mm/internal.h |   5 +
 mm/madvise.c  | 396 ++++++++++++++++++++++++++++++--------------------
 mm/memory.c   |   2 +-
 3 files changed, 248 insertions(+), 155 deletions(-)

Comments

David Hildenbrand Sept. 27, 2021, 9:24 a.m. UTC | #1
On 26.09.21 18:12, Nadav Amit wrote:
> From: Nadav Amit <namit@vmware.com>
> 
> The goal of these patches is to add support for
> process_madvise(MADV_DONTNEED). Yet, in the process some (arguably)
> useful cleanups, a bug fix and performance enhancements are performed.
> 
> The patches try to consolidate the logic across different behaviors, and
> to a certain extent overlap/conflict with an outstanding patch that does
> something similar [1]. This consolidation however is mostly orthogonal
> to the aforementioned one and done in order to clarify what is done in
> respect to locks and TLB for each behavior and to batch these operations
> more efficiently on process_madvise().
> 
> process_madvise(MADV_DONTNEED) is useful for two reasons: (a) it allows
> userfaultfd monitors to unmap memory from monitored processes; and (b)
> it is more efficient than madvise() since it is vectored and batches TLB
> flushes more aggressively.

MADV_DONTNEED on MAP_PRIVATE memory is a target-visible operation; this 
is very different to all the other process_madvise() calls we allow, 
which are merely hints, but the target cannot be broken . I don't think 
this is acceptable.
Nadav Amit Sept. 27, 2021, 10:41 a.m. UTC | #2
> On Sep 27, 2021, at 2:24 AM, David Hildenbrand <david@redhat.com> wrote:
> 
> On 26.09.21 18:12, Nadav Amit wrote:
>> From: Nadav Amit <namit@vmware.com>
>> The goal of these patches is to add support for
>> process_madvise(MADV_DONTNEED). Yet, in the process some (arguably)
>> useful cleanups, a bug fix and performance enhancements are performed.
>> The patches try to consolidate the logic across different behaviors, and
>> to a certain extent overlap/conflict with an outstanding patch that does
>> something similar [1]. This consolidation however is mostly orthogonal
>> to the aforementioned one and done in order to clarify what is done in
>> respect to locks and TLB for each behavior and to batch these operations
>> more efficiently on process_madvise().
>> process_madvise(MADV_DONTNEED) is useful for two reasons: (a) it allows
>> userfaultfd monitors to unmap memory from monitored processes; and (b)
>> it is more efficient than madvise() since it is vectored and batches TLB
>> flushes more aggressively.
> 
> MADV_DONTNEED on MAP_PRIVATE memory is a target-visible operation; this is very different to all the other process_madvise() calls we allow, which are merely hints, but the target cannot be broken . I don't think this is acceptable.

This is a fair point, which I expected, but did not address properly.

I guess an additional capability, such as CAP_SYS_PTRACE needs to be
required in this case. Would that ease your mind?
David Hildenbrand Sept. 27, 2021, 10:58 a.m. UTC | #3
On 27.09.21 12:41, Nadav Amit wrote:
> 
> 
>> On Sep 27, 2021, at 2:24 AM, David Hildenbrand <david@redhat.com> wrote:
>>
>> On 26.09.21 18:12, Nadav Amit wrote:
>>> From: Nadav Amit <namit@vmware.com>
>>> The goal of these patches is to add support for
>>> process_madvise(MADV_DONTNEED). Yet, in the process some (arguably)
>>> useful cleanups, a bug fix and performance enhancements are performed.
>>> The patches try to consolidate the logic across different behaviors, and
>>> to a certain extent overlap/conflict with an outstanding patch that does
>>> something similar [1]. This consolidation however is mostly orthogonal
>>> to the aforementioned one and done in order to clarify what is done in
>>> respect to locks and TLB for each behavior and to batch these operations
>>> more efficiently on process_madvise().
>>> process_madvise(MADV_DONTNEED) is useful for two reasons: (a) it allows
>>> userfaultfd monitors to unmap memory from monitored processes; and (b)
>>> it is more efficient than madvise() since it is vectored and batches TLB
>>> flushes more aggressively.
>>
>> MADV_DONTNEED on MAP_PRIVATE memory is a target-visible operation; this is very different to all the other process_madvise() calls we allow, which are merely hints, but the target cannot be broken . I don't think this is acceptable.
> 
> This is a fair point, which I expected, but did not address properly.
> 
> I guess an additional capability, such as CAP_SYS_PTRACE needs to be
> required in this case. Would that ease your mind?

I think it would be slightly better, but I'm still missing a clear use 
case that justifies messing with the page tables of other processes in 
that way, especially with MAP_PRIVATE mappings. Can you maybe elaborate 
a bit on a) and b)?

Especially, why would a) make sense or be required? When would it be a 
good idea to zap random pages of a target process, especially with 
MAP_PRIVATE? How would the target use case make sure that the target 
process doesn't suddenly lose data? I would have assume that you can 
really only do something sane with uffd() if 1) the process decided to 
give up on some pages (madvise(DONTNEED)) b) the process hasn't touched 
these pages yet.

Can you also comment a bit more on b)? Who cares about that? And would 
we suddenly expect users of madvise() to switch to process_madvise() 
because it's more effective? It sounds a bit weird to me TBH, but most 
probably I am missing details :)
Nadav Amit Sept. 27, 2021, noon UTC | #4
> On Sep 27, 2021, at 3:58 AM, David Hildenbrand <david@redhat.com> wrote:
> 
> On 27.09.21 12:41, Nadav Amit wrote:
>>> On Sep 27, 2021, at 2:24 AM, David Hildenbrand <david@redhat.com> wrote:
>>> 
>>> On 26.09.21 18:12, Nadav Amit wrote:
>>>> From: Nadav Amit <namit@vmware.com>
>>>> The goal of these patches is to add support for
>>>> process_madvise(MADV_DONTNEED). Yet, in the process some (arguably)
>>>> useful cleanups, a bug fix and performance enhancements are performed.
>>>> The patches try to consolidate the logic across different behaviors, and
>>>> to a certain extent overlap/conflict with an outstanding patch that does
>>>> something similar [1]. This consolidation however is mostly orthogonal
>>>> to the aforementioned one and done in order to clarify what is done in
>>>> respect to locks and TLB for each behavior and to batch these operations
>>>> more efficiently on process_madvise().
>>>> process_madvise(MADV_DONTNEED) is useful for two reasons: (a) it allows
>>>> userfaultfd monitors to unmap memory from monitored processes; and (b)
>>>> it is more efficient than madvise() since it is vectored and batches TLB
>>>> flushes more aggressively.
>>> 
>>> MADV_DONTNEED on MAP_PRIVATE memory is a target-visible operation; this is very different to all the other process_madvise() calls we allow, which are merely hints, but the target cannot be broken . I don't think this is acceptable.
>> This is a fair point, which I expected, but did not address properly.
>> I guess an additional capability, such as CAP_SYS_PTRACE needs to be
>> required in this case. Would that ease your mind?
> 
> I think it would be slightly better, but I'm still missing a clear use case that justifies messing with the page tables of other processes in that way, especially with MAP_PRIVATE mappings. Can you maybe elaborate a bit on a) and b)?
> 
> Especially, why would a) make sense or be required? When would it be a good idea to zap random pages of a target process, especially with MAP_PRIVATE? How would the target use case make sure that the target process doesn't suddenly lose data? I would have assume that you can really only do something sane with uffd() if 1) the process decided to give up on some pages (madvise(DONTNEED)) b) the process hasn't touched these pages yet.
> 
> Can you also comment a bit more on b)? Who cares about that? And would we suddenly expect users of madvise() to switch to process_madvise() because it's more effective? It sounds a bit weird to me TBH, but most probably I am missing details :)

Ok, ok, your criticism is fair. I tried to hold back some details in order to
prevent the discussion from digressing. I am going to focus on (a) which is
what I really have in mind.

The use-case that I explore is a userspace memory manager with some level of
cooperation of the monitored processes.

The manager is notified on memory regions that it should monitor
(through PTRACE/LD_PRELOAD/explicit-API). It then monitors these regions
using the remote-userfaultfd that you saw on the second thread. When it wants
to reclaim (anonymous) memory, it:

1. Uses UFFD-WP to protect that memory (and for this matter I got a vectored
   UFFD-WP to do so efficiently, a patch which I did not send yet).
2. Calls process_vm_readv() to read that memory of that process.
3. Write it back to “swap”.
4. Calls process_madvise(MADV_DONTNEED) to zap it.

Once the memory is accessed again, the manager uses UFFD-COPY to bring it
back. This is really work-in-progress, but eventually performance is not as
bad as you would imagine (some patches for efficient use of uffd with
iouring are needed for that matter).

I am aware that there are some caveats, as zapping the memory does not
guarantee that the memory would be freed since it might be pinned for a
variety of reasons. That's the reason I mentioned the processes have "some
level of cooperation" with the manager. It is not intended to deal with
adversaries or uncommon corner cases (e.g., processes that use UFFD for
their own reasons).

Putting aside my use-case (which I am sure people would be glad to criticize),
I can imagine debuggers or emulators may also find use for similar schemes
(although I do not have concrete use-cases for them).
Michal Hocko Sept. 27, 2021, 12:16 p.m. UTC | #5
On Mon 27-09-21 05:00:11, Nadav Amit wrote:
[...]
> The manager is notified on memory regions that it should monitor
> (through PTRACE/LD_PRELOAD/explicit-API). It then monitors these regions
> using the remote-userfaultfd that you saw on the second thread. When it wants
> to reclaim (anonymous) memory, it:
> 
> 1. Uses UFFD-WP to protect that memory (and for this matter I got a vectored
>    UFFD-WP to do so efficiently, a patch which I did not send yet).
> 2. Calls process_vm_readv() to read that memory of that process.
> 3. Write it back to “swap”.
> 4. Calls process_madvise(MADV_DONTNEED) to zap it.

Why cannot you use MADV_PAGEOUT/MADV_COLD for this usecase?

MADV_DONTNEED on a remote process has been proposed in the past several
times and it has always been rejected because it is a free ticket to all
sorts of hard to debug problems as it is just a free ticket for a remote
memory corruption. An additional capability requirement might reduce the
risk to some degree but I still do not think this is a good idea.
David Hildenbrand Sept. 27, 2021, 5:05 p.m. UTC | #6
On 27.09.21 14:00, Nadav Amit wrote:
> 
> 
>> On Sep 27, 2021, at 3:58 AM, David Hildenbrand <david@redhat.com> wrote:
>>
>> On 27.09.21 12:41, Nadav Amit wrote:
>>>> On Sep 27, 2021, at 2:24 AM, David Hildenbrand <david@redhat.com> wrote:
>>>>
>>>> On 26.09.21 18:12, Nadav Amit wrote:
>>>>> From: Nadav Amit <namit@vmware.com>
>>>>> The goal of these patches is to add support for
>>>>> process_madvise(MADV_DONTNEED). Yet, in the process some (arguably)
>>>>> useful cleanups, a bug fix and performance enhancements are performed.
>>>>> The patches try to consolidate the logic across different behaviors, and
>>>>> to a certain extent overlap/conflict with an outstanding patch that does
>>>>> something similar [1]. This consolidation however is mostly orthogonal
>>>>> to the aforementioned one and done in order to clarify what is done in
>>>>> respect to locks and TLB for each behavior and to batch these operations
>>>>> more efficiently on process_madvise().
>>>>> process_madvise(MADV_DONTNEED) is useful for two reasons: (a) it allows
>>>>> userfaultfd monitors to unmap memory from monitored processes; and (b)
>>>>> it is more efficient than madvise() since it is vectored and batches TLB
>>>>> flushes more aggressively.
>>>>
>>>> MADV_DONTNEED on MAP_PRIVATE memory is a target-visible operation; this is very different to all the other process_madvise() calls we allow, which are merely hints, but the target cannot be broken . I don't think this is acceptable.
>>> This is a fair point, which I expected, but did not address properly.
>>> I guess an additional capability, such as CAP_SYS_PTRACE needs to be
>>> required in this case. Would that ease your mind?
>>
>> I think it would be slightly better, but I'm still missing a clear use case that justifies messing with the page tables of other processes in that way, especially with MAP_PRIVATE mappings. Can you maybe elaborate a bit on a) and b)?
>>
>> Especially, why would a) make sense or be required? When would it be a good idea to zap random pages of a target process, especially with MAP_PRIVATE? How would the target use case make sure that the target process doesn't suddenly lose data? I would have assume that you can really only do something sane with uffd() if 1) the process decided to give up on some pages (madvise(DONTNEED)) b) the process hasn't touched these pages yet.
>>
>> Can you also comment a bit more on b)? Who cares about that? And would we suddenly expect users of madvise() to switch to process_madvise() because it's more effective? It sounds a bit weird to me TBH, but most probably I am missing details :)
> 
> Ok, ok, your criticism is fair. I tried to hold back some details in order to
> prevent the discussion from digressing. I am going to focus on (a) which is
> what I really have in mind.

Thanks for the details!

> 
> The use-case that I explore is a userspace memory manager with some level of
> cooperation of the monitored processes.
> 
> The manager is notified on memory regions that it should monitor
> (through PTRACE/LD_PRELOAD/explicit-API). It then monitors these regions
> using the remote-userfaultfd that you saw on the second thread. When it wants
> to reclaim (anonymous) memory, it:
> 
> 1. Uses UFFD-WP to protect that memory (and for this matter I got a vectored
>     UFFD-WP to do so efficiently, a patch which I did not send yet).
> 2. Calls process_vm_readv() to read that memory of that process.
> 3. Write it back to “swap”.
> 4. Calls process_madvise(MADV_DONTNEED) to zap it.
> 
> Once the memory is accessed again, the manager uses UFFD-COPY to bring it
> back. This is really work-in-progress, but eventually performance is not as
> bad as you would imagine (some patches for efficient use of uffd with
> iouring are needed for that matter).

Again, thanks for the details. I guess this should basically work, 
although it involves a lot of complexity (read: all flavors of uffd on 
other processes). And I am no so sure about performance aspects. 
"Performance is not as bad as you think" doesn't sound like the words 
you would want to hear from a car dealer ;) So there has to be another 
big benefit to do such user space swapping.

> 
> I am aware that there are some caveats, as zapping the memory does not
> guarantee that the memory would be freed since it might be pinned for a
> variety of reasons. That's the reason I mentioned the processes have "some
> level of cooperation" with the manager. It is not intended to deal with
> adversaries or uncommon corner cases (e.g., processes that use UFFD for
> their own reasons).

It's not only long-term pinnings. Pages could have been de-duplicated 
(COW after fork, KSM, shared zeropage). Further, you'll most probably 
lose any kind of "aging" ("accessed") information on pages, or how would 
you track that?

Although I can see that this might work, I do wonder if it's a use case 
worth supporting. As Michal correctly raised, we already have other 
infrastructure in place to trigger swapin/swapout. I recall that also 
damon wants to let you write advanced policies for that by monitoring 
actual access characteristics.

> 
> Putting aside my use-case (which I am sure people would be glad to criticize),
> I can imagine debuggers or emulators may also find use for similar schemes
> (although I do not have concrete use-cases for them).

I'd be curious about use cases for debuggers/emulators. Especially for 
emulators I'd guess it makes more sense to just do it within the 
process. And for debuggers, I'm having a hard time why it would make 
sense to throw away a page instead of just overwriting it with $PATTERN 
(e.g., 0). But I'm sure people can be creative :)
Nadav Amit Sept. 27, 2021, 7:12 p.m. UTC | #7
> On Sep 27, 2021, at 5:16 AM, Michal Hocko <mhocko@suse.com> wrote:
> 
> On Mon 27-09-21 05:00:11, Nadav Amit wrote:
> [...]
>> The manager is notified on memory regions that it should monitor
>> (through PTRACE/LD_PRELOAD/explicit-API). It then monitors these regions
>> using the remote-userfaultfd that you saw on the second thread. When it wants
>> to reclaim (anonymous) memory, it:
>> 
>> 1. Uses UFFD-WP to protect that memory (and for this matter I got a vectored
>>   UFFD-WP to do so efficiently, a patch which I did not send yet).
>> 2. Calls process_vm_readv() to read that memory of that process.
>> 3. Write it back to “swap”.
>> 4. Calls process_madvise(MADV_DONTNEED) to zap it.
> 
> Why cannot you use MADV_PAGEOUT/MADV_COLD for this usecase?

Providing hints to the kernel takes you so far to a certain extent.
The kernel does not want to (for a good reason) to be completely
configurable when it comes to reclaim and prefetch policies. Doing
so from userspace allows you to be fully configurable.

> MADV_DONTNEED on a remote process has been proposed in the past several
> times and it has always been rejected because it is a free ticket to all
> sorts of hard to debug problems as it is just a free ticket for a remote
> memory corruption. An additional capability requirement might reduce the
> risk to some degree but I still do not think this is a good idea.

I would argue that there is nothing bad that remote MADV_DONTNEED can do
that process_vm_writev() cannot do as well (putting aside ptrace).

process_vm_writev() is checking:

	mm = mm_access(task, PTRACE_MODE_ATTACH_REALCREDS)

Wouldn't adding such a condition suffice?
Nadav Amit Sept. 27, 2021, 7:59 p.m. UTC | #8
> On Sep 27, 2021, at 10:05 AM, David Hildenbrand <david@redhat.com> wrote:
> 
> On 27.09.21 14:00, Nadav Amit wrote:
>>> On Sep 27, 2021, at 3:58 AM, David Hildenbrand <david@redhat.com> wrote:
>>> 
>>> On 27.09.21 12:41, Nadav Amit wrote:
>>>>> On Sep 27, 2021, at 2:24 AM, David Hildenbrand <david@redhat.com> wrote:
>>>>> 
>>>>> On 26.09.21 18:12, Nadav Amit wrote:
>>>>>> From: Nadav Amit <namit@vmware.com>
>>>>>> The goal of these patches is to add support for
>>>>>> process_madvise(MADV_DONTNEED). Yet, in the process some (arguably)
>>>>>> useful cleanups, a bug fix and performance enhancements are performed.
>>>>>> The patches try to consolidate the logic across different behaviors, and
>>>>>> to a certain extent overlap/conflict with an outstanding patch that does
>>>>>> something similar [1]. This consolidation however is mostly orthogonal
>>>>>> to the aforementioned one and done in order to clarify what is done in
>>>>>> respect to locks and TLB for each behavior and to batch these operations
>>>>>> more efficiently on process_madvise().
>>>>>> process_madvise(MADV_DONTNEED) is useful for two reasons: (a) it allows
>>>>>> userfaultfd monitors to unmap memory from monitored processes; and (b)
>>>>>> it is more efficient than madvise() since it is vectored and batches TLB
>>>>>> flushes more aggressively.
>>>>> 
>>>>> MADV_DONTNEED on MAP_PRIVATE memory is a target-visible operation; this is very different to all the other process_madvise() calls we allow, which are merely hints, but the target cannot be broken . I don't think this is acceptable.
>>>> This is a fair point, which I expected, but did not address properly.
>>>> I guess an additional capability, such as CAP_SYS_PTRACE needs to be
>>>> required in this case. Would that ease your mind?
>>> 
>>> I think it would be slightly better, but I'm still missing a clear use case that justifies messing with the page tables of other processes in that way, especially with MAP_PRIVATE mappings. Can you maybe elaborate a bit on a) and b)?
>>> 
>>> Especially, why would a) make sense or be required? When would it be a good idea to zap random pages of a target process, especially with MAP_PRIVATE? How would the target use case make sure that the target process doesn't suddenly lose data? I would have assume that you can really only do something sane with uffd() if 1) the process decided to give up on some pages (madvise(DONTNEED)) b) the process hasn't touched these pages yet.
>>> 
>>> Can you also comment a bit more on b)? Who cares about that? And would we suddenly expect users of madvise() to switch to process_madvise() because it's more effective? It sounds a bit weird to me TBH, but most probably I am missing details :)
>> Ok, ok, your criticism is fair. I tried to hold back some details in order to
>> prevent the discussion from digressing. I am going to focus on (a) which is
>> what I really have in mind.
> 
> Thanks for the details!
> 
>> The use-case that I explore is a userspace memory manager with some level of
>> cooperation of the monitored processes.
>> The manager is notified on memory regions that it should monitor
>> (through PTRACE/LD_PRELOAD/explicit-API). It then monitors these regions
>> using the remote-userfaultfd that you saw on the second thread. When it wants
>> to reclaim (anonymous) memory, it:
>> 1. Uses UFFD-WP to protect that memory (and for this matter I got a vectored
>>    UFFD-WP to do so efficiently, a patch which I did not send yet).
>> 2. Calls process_vm_readv() to read that memory of that process.
>> 3. Write it back to “swap”.
>> 4. Calls process_madvise(MADV_DONTNEED) to zap it.
>> Once the memory is accessed again, the manager uses UFFD-COPY to bring it
>> back. This is really work-in-progress, but eventually performance is not as
>> bad as you would imagine (some patches for efficient use of uffd with
>> iouring are needed for that matter).
> 
> Again, thanks for the details. I guess this should basically work, although it involves a lot of complexity (read: all flavors of uffd on other processes). And I am no so sure about performance aspects. "Performance is not as bad as you think" doesn't sound like the words you would want to hear from a car dealer ;) So there has to be another big benefit to do such user space swapping.

There is some complexity, indeed. Worse, there are some quirks of UFFD
that make life hard for no reason and some uffd and iouring bugs.

As for my sales pitch - I agree that I am not the best car dealer… :(
When I say performance is not bad, I mean that the core operations of
page-fault handling, prefetch and reclaim do not induce high overhead
*after* the improvements I sent or mentioned.

The benefit of doing so from userspace is that you have full control
over the reclaim/prefetch policies, so you may be able to make better
decisions.

Some workloads have predictable access patterns (see for instance "MAGE:
Nearly Zero-Cost Virtual Memory for Secure Computation”, OSDI’21). You may
be handle such access patterns without requiring intrusive changes to the
workload.


> 
>> I am aware that there are some caveats, as zapping the memory does not
>> guarantee that the memory would be freed since it might be pinned for a
>> variety of reasons. That's the reason I mentioned the processes have "some
>> level of cooperation" with the manager. It is not intended to deal with
>> adversaries or uncommon corner cases (e.g., processes that use UFFD for
>> their own reasons).
> 
> It's not only long-term pinnings. Pages could have been de-duplicated (COW after fork, KSM, shared zeropage). Further, you'll most probably lose any kind of "aging" ("accessed") information on pages, or how would you track that?

I know it’s not just long-term pinnings. That’s what “variety of reasons”
stood for. ;-)

Aging is a tool for certain types of reclamation policies. Some do not
require it (e.g., random). You can also have compiler/application-guided
reclamation policies. If you are really into “aging”, you may be able
to use PEBS or other CPU facilities to track it.

Anyhow, the access-bit by itself not such a great solution to track
aging. Setting it can induce overheads of >500 cycles from my (and
others) experience.

> 
> Although I can see that this might work, I do wonder if it's a use case worth supporting. As Michal correctly raised, we already have other infrastructure in place to trigger swapin/swapout. I recall that also damon wants to let you write advanced policies for that by monitoring actual access characteristics.

Hints, as those that Michal mentioned, prevent the efficient use of
userfaultfd. Using MADV_PAGEOUT will not trigger another uffd event
when the page is brought back from swap. So using
MADV_PAGEOUT/MADV_WILLNEED does not allow you to have a custom
prefetch policy, for instance. It would also require you to live
with the kernel reclamation/IO stack for better and worse.

As for DAMON, I am not very familiar with it, but from what I remember
it seemed to look on a similar direction. IMHO it is more intrusive
and less configurable (although it can have the advantage of better
integration with various kernel mechanism). I was wondering for a
second why you give me such a hard time for a pretty straight-forward
extension for process_madvise(), but then I remembered that DAMON got
into the kernel after >30 versions, so I’ll shut up about that. ;-)

> 
>> Putting aside my use-case (which I am sure people would be glad to criticize),
>> I can imagine debuggers or emulators may also find use for similar schemes
>> (although I do not have concrete use-cases for them).
> 
> I'd be curious about use cases for debuggers/emulators. Especially for emulators I'd guess it makes more sense to just do it within the process. And for debuggers, I'm having a hard time why it would make sense to throw away a page instead of just overwriting it with $PATTERN (e.g., 0). But I'm sure people can be creative :)

I have some more vague ideas, but I am afraid that you will keep
saying that it makes more sense to handle such events from within
a process. I am not sure that this is true. Even for the emulators
that we discuss, the emulated program might run in a different
address space (for sandboxing). You may be able to avoid the need
for remote-UFFD and get away with the current non-cooperative
UFFD, but zapping the memory (for atomic updates) would still
require process_madvise(MADV_DONTNEED) [putting aside various
ptrace solutions].

Anyhow, David, I really appreciate your feedback. And you make
strong points about issues I encounter. Yet, eventually, I think
that the main question in this discussion is whether enabling
process_madvise(MADV_DONTNEED) is any different - from security
point of view - than process_vm_writev(), not to mention ptrace.
If not, then the same security guards should suffice, I would
argue.
David Hildenbrand Sept. 28, 2021, 8:53 a.m. UTC | #9
>>
>> Again, thanks for the details. I guess this should basically work, although it involves a lot of complexity (read: all flavors of uffd on other processes). And I am no so sure about performance aspects. "Performance is not as bad as you think" doesn't sound like the words you would want to hear from a car dealer ;) So there has to be another big benefit to do such user space swapping.
> 
> There is some complexity, indeed. Worse, there are some quirks of UFFD
> that make life hard for no reason and some uffd and iouring bugs.
> 
> As for my sales pitch - I agree that I am not the best car dealer… :(

:)

> When I say performance is not bad, I mean that the core operations of
> page-fault handling, prefetch and reclaim do not induce high overhead
> *after* the improvements I sent or mentioned.
> 
> The benefit of doing so from userspace is that you have full control
> over the reclaim/prefetch policies, so you may be able to make better
> decisions.
> 
> Some workloads have predictable access patterns (see for instance "MAGE:
> Nearly Zero-Cost Virtual Memory for Secure Computation”, OSDI’21). You may
> be handle such access patterns without requiring intrusive changes to the
> workload.

Thanks for the pointer.

And my question would be if something like DAMON would actually be what 
you want.

> 
> 
>>
>>> I am aware that there are some caveats, as zapping the memory does not
>>> guarantee that the memory would be freed since it might be pinned for a
>>> variety of reasons. That's the reason I mentioned the processes have "some
>>> level of cooperation" with the manager. It is not intended to deal with
>>> adversaries or uncommon corner cases (e.g., processes that use UFFD for
>>> their own reasons).
>>
>> It's not only long-term pinnings. Pages could have been de-duplicated (COW after fork, KSM, shared zeropage). Further, you'll most probably lose any kind of "aging" ("accessed") information on pages, or how would you track that?
> 
> I know it’s not just long-term pinnings. That’s what “variety of reasons”
> stood for. ;-)
> 
> Aging is a tool for certain types of reclamation policies. Some do not
> require it (e.g., random). You can also have compiler/application-guided
> reclamation policies. If you are really into “aging”, you may be able
> to use PEBS or other CPU facilities to track it.
> 
> Anyhow, the access-bit by itself not such a great solution to track
> aging. Setting it can induce overheads of >500 cycles from my (and
> others) experience.

Well, I'm certainly no expert on that; I would assume it's relevant in 
corner cases only: if you're application accesses all it's memory 
permanently a swap setup is already "broken". If you have plenty of old 
memory (VMs, databases, ...) it should work reasonably well. But yeah, 
detecting the working set size is a problematic problem, and "access"
bits can be sub-optimal.

After all, that's what the Linux kernel has been relying on for a long 
time ... and IIRC it might be extended by multiple "aging" queues soon.

> 
>>
>> Although I can see that this might work, I do wonder if it's a use case worth supporting. As Michal correctly raised, we already have other infrastructure in place to trigger swapin/swapout. I recall that also damon wants to let you write advanced policies for that by monitoring actual access characteristics.
> 
> Hints, as those that Michal mentioned, prevent the efficient use of
> userfaultfd. Using MADV_PAGEOUT will not trigger another uffd event
> when the page is brought back from swap. So using
> MADV_PAGEOUT/MADV_WILLNEED does not allow you to have a custom
> prefetch policy, for instance. It would also require you to live
> with the kernel reclamation/IO stack for better and worse.

Would more uffd (or similar) events help?

> 
> As for DAMON, I am not very familiar with it, but from what I remember
> it seemed to look on a similar direction. IMHO it is more intrusive
> and less configurable (although it can have the advantage of better
> integration with various kernel mechanism). I was wondering for a
> second why you give me such a hard time for a pretty straight-forward
> extension for process_madvise(), but then I remembered that DAMON got
> into the kernel after >30 versions, so I’ll shut up about that. ;-)

It took ... quite a long time, indeed :)

> 
>>
>>> Putting aside my use-case (which I am sure people would be glad to criticize),
>>> I can imagine debuggers or emulators may also find use for similar schemes
>>> (although I do not have concrete use-cases for them).
>>
>> I'd be curious about use cases for debuggers/emulators. Especially for emulators I'd guess it makes more sense to just do it within the process. And for debuggers, I'm having a hard time why it would make sense to throw away a page instead of just overwriting it with $PATTERN (e.g., 0). But I'm sure people can be creative :)
> 
> I have some more vague ideas, but I am afraid that you will keep
> saying that it makes more sense to handle such events from within
> a process. I am not sure that this is true. Even for the emulators
> that we discuss, the emulated program might run in a different
> address space (for sandboxing). You may be able to avoid the need
> for remote-UFFD and get away with the current non-cooperative
> UFFD, but zapping the memory (for atomic updates) would still
> require process_madvise(MADV_DONTNEED) [putting aside various
> ptrace solutions].
> 
> Anyhow, David, I really appreciate your feedback. And you make
> strong points about issues I encounter. Yet, eventually, I think
> that the main question in this discussion is whether enabling
> process_madvise(MADV_DONTNEED) is any different - from security
> point of view - than process_vm_writev(), not to mention ptrace.
> If not, then the same security guards should suffice, I would
> argue.
> 

You raise a very excellent point (and it should have been part of your 
initial sales pitch): how does it differ to process_vm_writev().

I can say that it differs in a way that you can break applications in 
more extreme ways. Let me give you two examples:

1. longterm pinnings: you raised this yourself; this can break an 
application silently and there is barely a safe way your tooling could 
handle it.

2. pagemap: applications can depend on the populated(present |swap) 
information in the pagemap for correctness. For example, there was 
recently a discussion to use pagemap information to speed up live 
migration of VMs, by skipping migration of !populated pages. There is 
currently no way your tooling can fake that. In comparison, ordinary 
swapping in the kernel can handle it.

Is it easy to break an application with process_vm_writev()? Yes. When 
talking about dynamic debugging, it's expected that you break the target 
already -- or the target is already broken. Is it easier to break an 
application with process_madvise(MADV_DONTNEED)? I'd say yes, especially 
when implementing something way beyond debugging as you describe.


I'm giving you "a hard time" for the reason Michal raised: we discussed 
this in the past already at least two times IIRC and "it is a free 
ticket to all sorts of hard to debug problem" in our opinion; especially 
when we mess around in other process address spaces besides for debugging.

I'm not the person to ack/nack this, I'm just asking the questions :)
Nadav Amit Sept. 28, 2021, 10:56 p.m. UTC | #10
> On Sep 28, 2021, at 1:53 AM, David Hildenbrand <david@redhat.com> wrote:
> 
>>> 
>>> Again, thanks for the details. I guess this should basically work, although it involves a lot of complexity (read: all flavors of uffd on other processes). And I am no so sure about performance aspects. "Performance is not as bad as you think" doesn't sound like the words you would want to hear from a car dealer ;) So there has to be another big benefit to do such user space swapping.
>> There is some complexity, indeed. Worse, there are some quirks of UFFD
>> that make life hard for no reason and some uffd and iouring bugs.
>> As for my sales pitch - I agree that I am not the best car dealer… :(
> 
> :)
> 
>> When I say performance is not bad, I mean that the core operations of
>> page-fault handling, prefetch and reclaim do not induce high overhead
>> *after* the improvements I sent or mentioned.
>> The benefit of doing so from userspace is that you have full control
>> over the reclaim/prefetch policies, so you may be able to make better
>> decisions.
>> Some workloads have predictable access patterns (see for instance "MAGE:
>> Nearly Zero-Cost Virtual Memory for Secure Computation”, OSDI’21). You may
>> be handle such access patterns without requiring intrusive changes to the
>> workload.
> 
> Thanks for the pointer.
> 
> And my question would be if something like DAMON would actually be what you want.

I looked into DAMON and even with the proposed future extensions it sounds
as a different approach with certain benefits but with many limitations.

The major limitation of DAMON is that you need to predefine the logic you
want for reclamation into the kernel. You can add programability through
some API or even eBPF, but it would never be as easy or as versatile as
what user manager can achieve. We already have pretty much all the
facilities to do so from userspace, and the missing parts (at least for
basic userspace manager) are almost already there. In contrast, see how
many iterations are needed for the basic DAMON implementation.

The second, also big, difference is that DAMON looks only on reclamation.
If you want a custom prefetch scheme or different I/O stack for backing
storage, you cannot have such one.

> 
>>> 
>>>> I am aware that there are some caveats, as zapping the memory does not
>>>> guarantee that the memory would be freed since it might be pinned for a
>>>> variety of reasons. That's the reason I mentioned the processes have "some
>>>> level of cooperation" with the manager. It is not intended to deal with
>>>> adversaries or uncommon corner cases (e.g., processes that use UFFD for
>>>> their own reasons).
>>> 
>>> It's not only long-term pinnings. Pages could have been de-duplicated (COW after fork, KSM, shared zeropage). Further, you'll most probably lose any kind of "aging" ("accessed") information on pages, or how would you track that?
>> I know it’s not just long-term pinnings. That’s what “variety of reasons”
>> stood for. ;-)
>> Aging is a tool for certain types of reclamation policies. Some do not
>> require it (e.g., random). You can also have compiler/application-guided
>> reclamation policies. If you are really into “aging”, you may be able
>> to use PEBS or other CPU facilities to track it.
>> Anyhow, the access-bit by itself not such a great solution to track
>> aging. Setting it can induce overheads of >500 cycles from my (and
>> others) experience.
> 
> Well, I'm certainly no expert on that; I would assume it's relevant in corner cases only: if you're application accesses all it's memory permanently a swap setup is already "broken". If you have plenty of old memory (VMs, databases, ...) it should work reasonably well. But yeah, detecting the working set size is a problematic problem, and "access"
> bits can be sub-optimal.
> 
> After all, that's what the Linux kernel has been relying on for a long time ... and IIRC it might be extended by multiple "aging" queues soon.

I see your point. I do not want to waste time in our discussion by
focusing on recency-based reclamation schemes and access-bits. I was
just trying to point that such schemes and access-bit have their own
drawbacks. 

> 
>>> 
>>> Although I can see that this might work, I do wonder if it's a use case worth supporting. As Michal correctly raised, we already have other infrastructure in place to trigger swapin/swapout. I recall that also damon wants to let you write advanced policies for that by monitoring actual access characteristics.
>> Hints, as those that Michal mentioned, prevent the efficient use of
>> userfaultfd. Using MADV_PAGEOUT will not trigger another uffd event
>> when the page is brought back from swap. So using
>> MADV_PAGEOUT/MADV_WILLNEED does not allow you to have a custom
>> prefetch policy, for instance. It would also require you to live
>> with the kernel reclamation/IO stack for better and worse.
> 
> Would more uffd (or similar) events help?
> 
>> As for DAMON, I am not very familiar with it, but from what I remember
>> it seemed to look on a similar direction. IMHO it is more intrusive
>> and less configurable (although it can have the advantage of better
>> integration with various kernel mechanism). I was wondering for a
>> second why you give me such a hard time for a pretty straight-forward
>> extension for process_madvise(), but then I remembered that DAMON got
>> into the kernel after >30 versions, so I’ll shut up about that. ;-)
> 
> It took ... quite a long time, indeed :)
> 
>>> 
>>>> Putting aside my use-case (which I am sure people would be glad to criticize),
>>>> I can imagine debuggers or emulators may also find use for similar schemes
>>>> (although I do not have concrete use-cases for them).
>>> 
>>> I'd be curious about use cases for debuggers/emulators. Especially for emulators I'd guess it makes more sense to just do it within the process. And for debuggers, I'm having a hard time why it would make sense to throw away a page instead of just overwriting it with $PATTERN (e.g., 0). But I'm sure people can be creative :)
>> I have some more vague ideas, but I am afraid that you will keep
>> saying that it makes more sense to handle such events from within
>> a process. I am not sure that this is true. Even for the emulators
>> that we discuss, the emulated program might run in a different
>> address space (for sandboxing). You may be able to avoid the need
>> for remote-UFFD and get away with the current non-cooperative
>> UFFD, but zapping the memory (for atomic updates) would still
>> require process_madvise(MADV_DONTNEED) [putting aside various
>> ptrace solutions].
>> Anyhow, David, I really appreciate your feedback. And you make
>> strong points about issues I encounter. Yet, eventually, I think
>> that the main question in this discussion is whether enabling
>> process_madvise(MADV_DONTNEED) is any different - from security
>> point of view - than process_vm_writev(), not to mention ptrace.
>> If not, then the same security guards should suffice, I would
>> argue.
> 
> You raise a very excellent point (and it should have been part of your initial sales pitch): how does it differ to process_vm_writev().
> 
> I can say that it differs in a way that you can break applications in more extreme ways. Let me give you two examples:
> 
> 1. longterm pinnings: you raised this yourself; this can break an application silently and there is barely a safe way your tooling could handle it.
> 
> 2. pagemap: applications can depend on the populated(present |swap) information in the pagemap for correctness. For example, there was recently a discussion to use pagemap information to speed up live migration of VMs, by skipping migration of !populated pages. There is currently no way your tooling can fake that. In comparison, ordinary swapping in the kernel can handle it.

I understand (1). As for (2): the scenario that you mention sound
very specific, and one can argue that ignoring UFFD-registered
regions in such a case is either (1) wrong or (2) should trigger
some UFFD event.

> 
> Is it easy to break an application with process_vm_writev()? Yes. When talking about dynamic debugging, it's expected that you break the target already -- or the target is already broken. Is it easier to break an application with process_madvise(MADV_DONTNEED)? I'd say yes, especially when implementing something way beyond debugging as you describe.

If you do not know what you are doing, you can easily break anything.
Note that there are other APIs that can break your application even
worse, specifically ptrace().

> I'm giving you "a hard time" for the reason Michal raised: we discussed this in the past already at least two times IIRC and "it is a free ticket to all sorts of hard to debug problem" in our opinion; especially when we mess around in other process address spaces besides for debugging.
> 
> I'm not the person to ack/nack this, I'm just asking the questions :)

I see your points and I try to look for a path of least resistance.
I thought that process_madvise() is a nice interface to hook into.

But if you are concerned it will be misused, how about adding instead
an IOCTL that will zap pages but only in UFFD-registered regions?
A separate IOCTL for this matter have an advantage of being more
tailored for UFFD, not to notify UFFD upon “remove” and to be less
likely to be misused.
Michal Hocko Sept. 29, 2021, 7:52 a.m. UTC | #11
On Mon 27-09-21 12:12:46, Nadav Amit wrote:
> 
> > On Sep 27, 2021, at 5:16 AM, Michal Hocko <mhocko@suse.com> wrote:
> > 
> > On Mon 27-09-21 05:00:11, Nadav Amit wrote:
> > [...]
> >> The manager is notified on memory regions that it should monitor
> >> (through PTRACE/LD_PRELOAD/explicit-API). It then monitors these regions
> >> using the remote-userfaultfd that you saw on the second thread. When it wants
> >> to reclaim (anonymous) memory, it:
> >> 
> >> 1. Uses UFFD-WP to protect that memory (and for this matter I got a vectored
> >>   UFFD-WP to do so efficiently, a patch which I did not send yet).
> >> 2. Calls process_vm_readv() to read that memory of that process.
> >> 3. Write it back to “swap”.
> >> 4. Calls process_madvise(MADV_DONTNEED) to zap it.
> > 
> > Why cannot you use MADV_PAGEOUT/MADV_COLD for this usecase?
> 
> Providing hints to the kernel takes you so far to a certain extent.
> The kernel does not want to (for a good reason) to be completely
> configurable when it comes to reclaim and prefetch policies. Doing
> so from userspace allows you to be fully configurable.

I am sorry but I do not follow. Your scenario is describing a user
space driven reclaim. Something that MADV_{COLD,PAGEOUT} have been
designed for. What are you missing in the existing functionality?

> > MADV_DONTNEED on a remote process has been proposed in the past several
> > times and it has always been rejected because it is a free ticket to all
> > sorts of hard to debug problems as it is just a free ticket for a remote
> > memory corruption. An additional capability requirement might reduce the
> > risk to some degree but I still do not think this is a good idea.
> 
> I would argue that there is nothing bad that remote MADV_DONTNEED can do
> that process_vm_writev() cannot do as well (putting aside ptrace).

I am not arguing this would be the first syscall to allow tricky and
hard to debug corruptions if used without care.

> process_vm_writev() is checking:
> 
> 	mm = mm_access(task, PTRACE_MODE_ATTACH_REALCREDS)
> 
> Wouldn't adding such a condition suffice?

This would be a minimum requirement. Another one is a sensible usecase
that is not covered by an existing functionality.
Nadav Amit Sept. 29, 2021, 6:31 p.m. UTC | #12
> On Sep 29, 2021, at 12:52 AM, Michal Hocko <mhocko@suse.com> wrote:
> 
> On Mon 27-09-21 12:12:46, Nadav Amit wrote:
>> 
>>> On Sep 27, 2021, at 5:16 AM, Michal Hocko <mhocko@suse.com> wrote:
>>> 
>>> On Mon 27-09-21 05:00:11, Nadav Amit wrote:
>>> [...]
>>>> The manager is notified on memory regions that it should monitor
>>>> (through PTRACE/LD_PRELOAD/explicit-API). It then monitors these regions
>>>> using the remote-userfaultfd that you saw on the second thread. When it wants
>>>> to reclaim (anonymous) memory, it:
>>>> 
>>>> 1. Uses UFFD-WP to protect that memory (and for this matter I got a vectored
>>>>  UFFD-WP to do so efficiently, a patch which I did not send yet).
>>>> 2. Calls process_vm_readv() to read that memory of that process.
>>>> 3. Write it back to “swap”.
>>>> 4. Calls process_madvise(MADV_DONTNEED) to zap it.
>>> 
>>> Why cannot you use MADV_PAGEOUT/MADV_COLD for this usecase?
>> 
>> Providing hints to the kernel takes you so far to a certain extent.
>> The kernel does not want to (for a good reason) to be completely
>> configurable when it comes to reclaim and prefetch policies. Doing
>> so from userspace allows you to be fully configurable.
> 
> I am sorry but I do not follow. Your scenario is describing a user
> space driven reclaim. Something that MADV_{COLD,PAGEOUT} have been
> designed for. What are you missing in the existing functionality?

Using MADV_COLD/MADV_PAGEOUT does not allow userspace to control
many aspects of paging out memory:

1. Writeback: writeback ahead of time, dynamic clustering, etc.
2. Batching (regardless, MADV_PAGEOUT does pretty bad batching job
   on non-contiguous memory).
3. No guarantee the page is actually reclaimed (e.g., writeback)
   and the time it takes place.
4. I/O stack for swapping - you must use kernel I/O stack (FUSE
   as non-performant as it is cannot be used for swap AFAIK).
5. Other operations (e.g., locking, working set tracking) that
   might not be necessary or interfere.

In addition, the use of MADV_COLD/MADV_PAGEOUT prevents the use
of userfaultfd to trap page-faults and react accordingly, so you
are also prevented from:

6. Having your own custom prefetching policy in response to #PF.

There are additional use-cases I can try to formalize in which
MADV_COLD/MADV_PAGEOUT is insufficient. But the main difference
is pretty clear, I think: one is a hint that only applied to
page reclamation. The other enables the direct control of
userspace over (almost) all aspects of paging.

As I suggested before, if it is preferred, this can be a UFFD
IOCTL instead of process_madvise() behavior, thereby lowering
the risk of a misuse.

I would emphasize that this feature (i.e., 
process_madvise(MADV_DONTNEED) or a similar new UFFD feature)
has little to no effect on the kernel robustness, complexity,
security or API changes. So the impact on the kernel is
negligible.
David Hildenbrand Oct. 4, 2021, 5:58 p.m. UTC | #13
>>
>> Thanks for the pointer.
>>
>> And my question would be if something like DAMON would actually be what you want.
> 
> I looked into DAMON and even with the proposed future extensions it sounds
> as a different approach with certain benefits but with many limitations.
> 
> The major limitation of DAMON is that you need to predefine the logic you
> want for reclamation into the kernel. You can add programability through
> some API or even eBPF, but it would never be as easy or as versatile as
> what user manager can achieve. We already have pretty much all the
> facilities to do so from userspace, and the missing parts (at least for
> basic userspace manager) are almost already there. In contrast, see how
> many iterations are needed for the basic DAMON implementation.

I can see what you're saying when looking at optimizing a hand full of 
special applications. I yet fail to see how something like that could 
work as a full replacement for in kernel swapping. I'm happy to learn.

> 
> The second, also big, difference is that DAMON looks only on reclamation.
> If you want a custom prefetch scheme or different I/O stack for backing
> storage, you cannot have such one.

I do wonder if it could be extended for prefetching. But I am absolutely 
not a DAMON expert.

[...]

>>
>> You raise a very excellent point (and it should have been part of your initial sales pitch): how does it differ to process_vm_writev().
>>
>> I can say that it differs in a way that you can break applications in more extreme ways. Let me give you two examples:
>>
>> 1. longterm pinnings: you raised this yourself; this can break an application silently and there is barely a safe way your tooling could handle it.
>>
>> 2. pagemap: applications can depend on the populated(present |swap) information in the pagemap for correctness. For example, there was recently a discussion to use pagemap information to speed up live migration of VMs, by skipping migration of !populated pages. There is currently no way your tooling can fake that. In comparison, ordinary swapping in the kernel can handle it.
> 
> I understand (1). As for (2): the scenario that you mention sound
> very specific, and one can argue that ignoring UFFD-registered
> regions in such a case is either (1) wrong or (2) should trigger
> some UFFD event.
> 
>>
>> Is it easy to break an application with process_vm_writev()? Yes. When talking about dynamic debugging, it's expected that you break the target already -- or the target is already broken. Is it easier to break an application with process_madvise(MADV_DONTNEED)? I'd say yes, especially when implementing something way beyond debugging as you describe.
> 
> If you do not know what you are doing, you can easily break anything.
> Note that there are other APIs that can break your application even
> worse, specifically ptrace().
> 
>> I'm giving you "a hard time" for the reason Michal raised: we discussed this in the past already at least two times IIRC and "it is a free ticket to all sorts of hard to debug problem" in our opinion; especially when we mess around in other process address spaces besides for debugging.
>>
>> I'm not the person to ack/nack this, I'm just asking the questions :)
> 
> I see your points and I try to look for a path of least resistance.
> I thought that process_madvise() is a nice interface to hook into.

It would be the right interface -- iff the operation wouldn't have a bad 
smell to it. We don't really want applications to mess around in the 
page table layout of some other process: however, that is exactly what 
you require. By unlocking that interface for that use case we agree that 
what you are proposing is a "sane use case", but  ...

> 
> But if you are concerned it will be misused, how about adding instead
> an IOCTL that will zap pages but only in UFFD-registered regions?
> A separate IOCTL for this matter have an advantage of being more
> tailored for UFFD, not to notify UFFD upon “remove” and to be less
> likely to be misused.

... that won't change the fact that with your user-space swapping 
approach that requires this interface we can break some applications 
silently, and that's really the major concern I have.

I mean, there are more cases where you can just harm the target 
application I think, for example if the target application uses 
SOFTDIRTY tracking.


To judge if this is a sane use case we want to support, it would help a 
lot if there would be actual code+evaluation when actually implementing 
some of these advanced policies. Because you raise a lot of interesting 
points in your reply to Michal to back your use case, and naive me 
thinks "this sounds interesting but ... aren't we losing a lot of 
flexibility+features when doing this in user space? Does anyone actually 
want to do it like that?".

Again, I'm not the person to ack/nack this, I'm just questioning if the 
use case that requires this interface is actually something that will 
get used later in real life because it has real advantages, or if it's a 
pure research project that will get abandoned at some point and we ended 
up exposing an interface we really didn't want to expose so far 
(especially, because all other requests so far were bogus).
Nadav Amit Oct. 7, 2021, 4:19 p.m. UTC | #14
> On Oct 4, 2021, at 10:58 AM, David Hildenbrand <david@redhat.com> wrote:
> 
>>> 
>>> Thanks for the pointer.
>>> 
>>> And my question would be if something like DAMON would actually be what you want.
>> I looked into DAMON and even with the proposed future extensions it sounds
>> as a different approach with certain benefits but with many limitations.
>> The major limitation of DAMON is that you need to predefine the logic you
>> want for reclamation into the kernel. You can add programability through
>> some API or even eBPF, but it would never be as easy or as versatile as
>> what user manager can achieve. We already have pretty much all the
>> facilities to do so from userspace, and the missing parts (at least for
>> basic userspace manager) are almost already there. In contrast, see how
>> many iterations are needed for the basic DAMON implementation.
> 
> I can see what you're saying when looking at optimizing a hand full of special applications. I yet fail to see how something like that could work as a full replacement for in kernel swapping. I'm happy to learn.

I am not arguing it is a full replacement, at least at this stage.

> 
>> The second, also big, difference is that DAMON looks only on reclamation.
>> If you want a custom prefetch scheme or different I/O stack for backing
>> storage, you cannot have such one.
> 
> I do wonder if it could be extended for prefetching. But I am absolutely not a DAMON expert.
> 
> [...]

These are 2 different approaches. One, is to provide some logic
for the kernel (DAMON). The other is to provide userspace full
control over paging operations (with caveats). Obviously, due to
the caveats, the kernel paging mechanism behaves as a backup.

> 
>>> 
>>> You raise a very excellent point (and it should have been part of your initial sales pitch): how does it differ to process_vm_writev().
>>> 
>>> I can say that it differs in a way that you can break applications in more extreme ways. Let me give you two examples:
>>> 
>>> 1. longterm pinnings: you raised this yourself; this can break an application silently and there is barely a safe way your tooling could handle it.
>>> 
>>> 2. pagemap: applications can depend on the populated(present |swap) information in the pagemap for correctness. For example, there was recently a discussion to use pagemap information to speed up live migration of VMs, by skipping migration of !populated pages. There is currently no way your tooling can fake that. In comparison, ordinary swapping in the kernel can handle it.
>> I understand (1). As for (2): the scenario that you mention sound
>> very specific, and one can argue that ignoring UFFD-registered
>> regions in such a case is either (1) wrong or (2) should trigger
>> some UFFD event.
>>> 
>>> Is it easy to break an application with process_vm_writev()? Yes. When talking about dynamic debugging, it's expected that you break the target already -- or the target is already broken. Is it easier to break an application with process_madvise(MADV_DONTNEED)? I'd say yes, especially when implementing something way beyond debugging as you describe.
>> If you do not know what you are doing, you can easily break anything.
>> Note that there are other APIs that can break your application even
>> worse, specifically ptrace().
>>> I'm giving you "a hard time" for the reason Michal raised: we discussed this in the past already at least two times IIRC and "it is a free ticket to all sorts of hard to debug problem" in our opinion; especially when we mess around in other process address spaces besides for debugging.
>>> 
>>> I'm not the person to ack/nack this, I'm just asking the questions :)
>> I see your points and I try to look for a path of least resistance.
>> I thought that process_madvise() is a nice interface to hook into.
> 
> It would be the right interface -- iff the operation wouldn't have a bad smell to it. We don't really want applications to mess around in the page table layout of some other process: however, that is exactly what you require. By unlocking that interface for that use case we agree that what you are proposing is a "sane use case", but  ...
> 
>> But if you are concerned it will be misused, how about adding instead
>> an IOCTL that will zap pages but only in UFFD-registered regions?
>> A separate IOCTL for this matter have an advantage of being more
>> tailored for UFFD, not to notify UFFD upon “remove” and to be less
>> likely to be misused.
> 
> ... that won't change the fact that with your user-space swapping approach that requires this interface we can break some applications silently, and that's really the major concern I have.
> 
> I mean, there are more cases where you can just harm the target application I think, for example if the target application uses SOFTDIRTY tracking.
> 
> 
> To judge if this is a sane use case we want to support, it would help a lot if there would be actual code+evaluation when actually implementing some of these advanced policies. Because you raise a lot of interesting points in your reply to Michal to back your use case, and naive me thinks "this sounds interesting but ... aren't we losing a lot of flexibility+features when doing this in user space? Does anyone actually want to do it like that?".
> 
> Again, I'm not the person to ack/nack this, I'm just questioning if the use case that requires this interface is actually something that will get used later in real life because it has real advantages, or if it's a pure research project that will get abandoned at some point and we ended up exposing an interface we really didn't want to expose so far (especially, because all other requests so far were bogus).

I do want to release the code, but it is really
incomplete/immature at this point. I would not that there additional
use cases, such as workloads that have discardable cache (or memoization
data), which want a central/another entity to discard the data when
there is memory pressure. (You can think about it as a userspace
shrinker).

Anyhow, as a path of least resistance, I think I would do the
following:

1. Wait for the other madvise related patches to be applied.
2. Simplify the patches, specifically removing the data structure
   changes based on Kirill feedback.
3. Defer the enablement of the MADV_DONTNEED until I can show
   code/performance numbers.

Regards,
Nadav
David Hildenbrand Oct. 7, 2021, 4:46 p.m. UTC | #15
On 07.10.21 18:19, Nadav Amit wrote:
> 
> 
>> On Oct 4, 2021, at 10:58 AM, David Hildenbrand <david@redhat.com> wrote:
>>
>>>>
>>>> Thanks for the pointer.
>>>>
>>>> And my question would be if something like DAMON would actually be what you want.
>>> I looked into DAMON and even with the proposed future extensions it sounds
>>> as a different approach with certain benefits but with many limitations.
>>> The major limitation of DAMON is that you need to predefine the logic you
>>> want for reclamation into the kernel. You can add programability through
>>> some API or even eBPF, but it would never be as easy or as versatile as
>>> what user manager can achieve. We already have pretty much all the
>>> facilities to do so from userspace, and the missing parts (at least for
>>> basic userspace manager) are almost already there. In contrast, see how
>>> many iterations are needed for the basic DAMON implementation.
>>
>> I can see what you're saying when looking at optimizing a hand full of special applications. I yet fail to see how something like that could work as a full replacement for in kernel swapping. I'm happy to learn.
> 
> I am not arguing it is a full replacement, at least at this stage.
> 
>>
>>> The second, also big, difference is that DAMON looks only on reclamation.
>>> If you want a custom prefetch scheme or different I/O stack for backing
>>> storage, you cannot have such one.
>>
>> I do wonder if it could be extended for prefetching. But I am absolutely not a DAMON expert.
>>
>> [...]
> 
> These are 2 different approaches. One, is to provide some logic
> for the kernel (DAMON). The other is to provide userspace full
> control over paging operations (with caveats). Obviously, due to
> the caveats, the kernel paging mechanism behaves as a backup.
> 
>>
>>>>
>>>> You raise a very excellent point (and it should have been part of your initial sales pitch): how does it differ to process_vm_writev().
>>>>
>>>> I can say that it differs in a way that you can break applications in more extreme ways. Let me give you two examples:
>>>>
>>>> 1. longterm pinnings: you raised this yourself; this can break an application silently and there is barely a safe way your tooling could handle it.
>>>>
>>>> 2. pagemap: applications can depend on the populated(present |swap) information in the pagemap for correctness. For example, there was recently a discussion to use pagemap information to speed up live migration of VMs, by skipping migration of !populated pages. There is currently no way your tooling can fake that. In comparison, ordinary swapping in the kernel can handle it.
>>> I understand (1). As for (2): the scenario that you mention sound
>>> very specific, and one can argue that ignoring UFFD-registered
>>> regions in such a case is either (1) wrong or (2) should trigger
>>> some UFFD event.
>>>>
>>>> Is it easy to break an application with process_vm_writev()? Yes. When talking about dynamic debugging, it's expected that you break the target already -- or the target is already broken. Is it easier to break an application with process_madvise(MADV_DONTNEED)? I'd say yes, especially when implementing something way beyond debugging as you describe.
>>> If you do not know what you are doing, you can easily break anything.
>>> Note that there are other APIs that can break your application even
>>> worse, specifically ptrace().
>>>> I'm giving you "a hard time" for the reason Michal raised: we discussed this in the past already at least two times IIRC and "it is a free ticket to all sorts of hard to debug problem" in our opinion; especially when we mess around in other process address spaces besides for debugging.
>>>>
>>>> I'm not the person to ack/nack this, I'm just asking the questions :)
>>> I see your points and I try to look for a path of least resistance.
>>> I thought that process_madvise() is a nice interface to hook into.
>>
>> It would be the right interface -- iff the operation wouldn't have a bad smell to it. We don't really want applications to mess around in the page table layout of some other process: however, that is exactly what you require. By unlocking that interface for that use case we agree that what you are proposing is a "sane use case", but  ...
>>
>>> But if you are concerned it will be misused, how about adding instead
>>> an IOCTL that will zap pages but only in UFFD-registered regions?
>>> A separate IOCTL for this matter have an advantage of being more
>>> tailored for UFFD, not to notify UFFD upon “remove” and to be less
>>> likely to be misused.
>>
>> ... that won't change the fact that with your user-space swapping approach that requires this interface we can break some applications silently, and that's really the major concern I have.
>>
>> I mean, there are more cases where you can just harm the target application I think, for example if the target application uses SOFTDIRTY tracking.
>>
>>
>> To judge if this is a sane use case we want to support, it would help a lot if there would be actual code+evaluation when actually implementing some of these advanced policies. Because you raise a lot of interesting points in your reply to Michal to back your use case, and naive me thinks "this sounds interesting but ... aren't we losing a lot of flexibility+features when doing this in user space? Does anyone actually want to do it like that?".
>>
>> Again, I'm not the person to ack/nack this, I'm just questioning if the use case that requires this interface is actually something that will get used later in real life because it has real advantages, or if it's a pure research project that will get abandoned at some point and we ended up exposing an interface we really didn't want to expose so far (especially, because all other requests so far were bogus).
> 
> I do want to release the code, but it is really
> incomplete/immature at this point. I would not that there additional
> use cases, such as workloads that have discardable cache (or memoization
> data), which want a central/another entity to discard the data when
> there is memory pressure. (You can think about it as a userspace
> shrinker).
> 
> Anyhow, as a path of least resistance, I think I would do the
> following:
> 
> 1. Wait for the other madvise related patches to be applied.
> 2. Simplify the patches, specifically removing the data structure
>     changes based on Kirill feedback.
> 3. Defer the enablement of the MADV_DONTNEED until I can show
>     code/performance numbers.

Sounds excellent, for your project to make progress at this stage I 
assume this stuff doesn't have to be upstream, but it's good to discuss 
upstream-ability.

Happy to learn more once you have more details to share.
Peter Xu Oct. 12, 2021, 11:14 p.m. UTC | #16
On Wed, Sep 29, 2021 at 11:31:25AM -0700, Nadav Amit wrote:
> 
> 
> > On Sep 29, 2021, at 12:52 AM, Michal Hocko <mhocko@suse.com> wrote:
> > 
> > On Mon 27-09-21 12:12:46, Nadav Amit wrote:
> >> 
> >>> On Sep 27, 2021, at 5:16 AM, Michal Hocko <mhocko@suse.com> wrote:
> >>> 
> >>> On Mon 27-09-21 05:00:11, Nadav Amit wrote:
> >>> [...]
> >>>> The manager is notified on memory regions that it should monitor
> >>>> (through PTRACE/LD_PRELOAD/explicit-API). It then monitors these regions
> >>>> using the remote-userfaultfd that you saw on the second thread. When it wants
> >>>> to reclaim (anonymous) memory, it:
> >>>> 
> >>>> 1. Uses UFFD-WP to protect that memory (and for this matter I got a vectored
> >>>>  UFFD-WP to do so efficiently, a patch which I did not send yet).
> >>>> 2. Calls process_vm_readv() to read that memory of that process.
> >>>> 3. Write it back to “swap”.
> >>>> 4. Calls process_madvise(MADV_DONTNEED) to zap it.
> >>> 
> >>> Why cannot you use MADV_PAGEOUT/MADV_COLD for this usecase?
> >> 
> >> Providing hints to the kernel takes you so far to a certain extent.
> >> The kernel does not want to (for a good reason) to be completely
> >> configurable when it comes to reclaim and prefetch policies. Doing
> >> so from userspace allows you to be fully configurable.
> > 
> > I am sorry but I do not follow. Your scenario is describing a user
> > space driven reclaim. Something that MADV_{COLD,PAGEOUT} have been
> > designed for. What are you missing in the existing functionality?
> 
> Using MADV_COLD/MADV_PAGEOUT does not allow userspace to control
> many aspects of paging out memory:
> 
> 1. Writeback: writeback ahead of time, dynamic clustering, etc.
> 2. Batching (regardless, MADV_PAGEOUT does pretty bad batching job
>    on non-contiguous memory).
> 3. No guarantee the page is actually reclaimed (e.g., writeback)
>    and the time it takes place.
> 4. I/O stack for swapping - you must use kernel I/O stack (FUSE
>    as non-performant as it is cannot be used for swap AFAIK).
> 5. Other operations (e.g., locking, working set tracking) that
>    might not be necessary or interfere.
> 
> In addition, the use of MADV_COLD/MADV_PAGEOUT prevents the use
> of userfaultfd to trap page-faults and react accordingly, so you
> are also prevented from:
> 
> 6. Having your own custom prefetching policy in response to #PF.
> 
> There are additional use-cases I can try to formalize in which
> MADV_COLD/MADV_PAGEOUT is insufficient. But the main difference
> is pretty clear, I think: one is a hint that only applied to
> page reclamation. The other enables the direct control of
> userspace over (almost) all aspects of paging.
> 
> As I suggested before, if it is preferred, this can be a UFFD
> IOCTL instead of process_madvise() behavior, thereby lowering
> the risk of a misuse.

(Sorry to join so late..)

Yeah I'm wondering whether that could add one extra layer of security.  But as
you mentioned, we've already have process_vm_writev(), then it's indeed not
strong reason to reject process_madvise(DONTNEED) too, it seems.

Not sure whether you're aware of the umap project from LLNL:

  https://github.com/LLNL/umap

From what I can tell, that's really doing very similar thing as what you
proposed here, but it's just a local version of things.  IOW in umap the
DONTNEED can be done locally with madvise() already in the umap maintained
threads.  That close the need to introduce the new process_madvise() interface
and it's definitely safer as it's per-mm and per-task.

I think you mentioned above that the tracee program will need to cooperate in
this case, I'm wondering whether some solution like umap would be fine too as
that also requires cooperation of the tracee program, it's just that the
cooperation may be slightly more than your solution but frankly I think that's
still trivial and before I understand the details of your solution I can't
really tell..

E.g. for a program to use umap, I think it needs to replace mmap() to umap()
where we want the buffers to be managed by umap library rather than the kernel,
then link against the umap library should work.  If the remote solution you're
proposing requires similar (or even more complicated) cooperation, then it'll
be controversial whether that can be done per-mm just like how umap designed
and used.  So IMHO it'll be great to share more details on those parts if umap
cannot satisfy the current need - IMHO it satisfies all the features you
described on fully customized pageout and page faulting in, it's just done in a
single mm.

Thanks,
Nadav Amit Oct. 13, 2021, 3:47 p.m. UTC | #17
> On Oct 12, 2021, at 4:14 PM, Peter Xu <peterx@redhat.com> wrote:
> 
> On Wed, Sep 29, 2021 at 11:31:25AM -0700, Nadav Amit wrote:
>> 
>> 
>>> On Sep 29, 2021, at 12:52 AM, Michal Hocko <mhocko@suse.com> wrote:
>>> 
>>> On Mon 27-09-21 12:12:46, Nadav Amit wrote:
>>>> 
>>>>> On Sep 27, 2021, at 5:16 AM, Michal Hocko <mhocko@suse.com> wrote:
>>>>> 
>>>>> On Mon 27-09-21 05:00:11, Nadav Amit wrote:
>>>>> [...]
>>>>>> The manager is notified on memory regions that it should monitor
>>>>>> (through PTRACE/LD_PRELOAD/explicit-API). It then monitors these regions
>>>>>> using the remote-userfaultfd that you saw on the second thread. When it wants
>>>>>> to reclaim (anonymous) memory, it:
>>>>>> 
>>>>>> 1. Uses UFFD-WP to protect that memory (and for this matter I got a vectored
>>>>>> UFFD-WP to do so efficiently, a patch which I did not send yet).
>>>>>> 2. Calls process_vm_readv() to read that memory of that process.
>>>>>> 3. Write it back to “swap”.
>>>>>> 4. Calls process_madvise(MADV_DONTNEED) to zap it.
>>>>> 
>>>>> Why cannot you use MADV_PAGEOUT/MADV_COLD for this usecase?
>>>> 
>>>> Providing hints to the kernel takes you so far to a certain extent.
>>>> The kernel does not want to (for a good reason) to be completely
>>>> configurable when it comes to reclaim and prefetch policies. Doing
>>>> so from userspace allows you to be fully configurable.
>>> 
>>> I am sorry but I do not follow. Your scenario is describing a user
>>> space driven reclaim. Something that MADV_{COLD,PAGEOUT} have been
>>> designed for. What are you missing in the existing functionality?
>> 
>> Using MADV_COLD/MADV_PAGEOUT does not allow userspace to control
>> many aspects of paging out memory:
>> 
>> 1. Writeback: writeback ahead of time, dynamic clustering, etc.
>> 2. Batching (regardless, MADV_PAGEOUT does pretty bad batching job
>>  on non-contiguous memory).
>> 3. No guarantee the page is actually reclaimed (e.g., writeback)
>>  and the time it takes place.
>> 4. I/O stack for swapping - you must use kernel I/O stack (FUSE
>>  as non-performant as it is cannot be used for swap AFAIK).
>> 5. Other operations (e.g., locking, working set tracking) that
>>  might not be necessary or interfere.
>> 
>> In addition, the use of MADV_COLD/MADV_PAGEOUT prevents the use
>> of userfaultfd to trap page-faults and react accordingly, so you
>> are also prevented from:
>> 
>> 6. Having your own custom prefetching policy in response to #PF.
>> 
>> There are additional use-cases I can try to formalize in which
>> MADV_COLD/MADV_PAGEOUT is insufficient. But the main difference
>> is pretty clear, I think: one is a hint that only applied to
>> page reclamation. The other enables the direct control of
>> userspace over (almost) all aspects of paging.
>> 
>> As I suggested before, if it is preferred, this can be a UFFD
>> IOCTL instead of process_madvise() behavior, thereby lowering
>> the risk of a misuse.
> 
> (Sorry to join so late..)
> 
> Yeah I'm wondering whether that could add one extra layer of security.  But as
> you mentioned, we've already have process_vm_writev(), then it's indeed not
> strong reason to reject process_madvise(DONTNEED) too, it seems.
> 
> Not sure whether you're aware of the umap project from LLNL:
> 
> https://github.com/LLNL/umap
> 
> From what I can tell, that's really doing very similar thing as what you
> proposed here, but it's just a local version of things.  IOW in umap the
> DONTNEED can be done locally with madvise() already in the umap maintained
> threads.  That close the need to introduce the new process_madvise() interface
> and it's definitely safer as it's per-mm and per-task.
> 
> I think you mentioned above that the tracee program will need to cooperate in
> this case, I'm wondering whether some solution like umap would be fine too as
> that also requires cooperation of the tracee program, it's just that the
> cooperation may be slightly more than your solution but frankly I think that's
> still trivial and before I understand the details of your solution I can't
> really tell..
> 
> E.g. for a program to use umap, I think it needs to replace mmap() to umap()
> where we want the buffers to be managed by umap library rather than the kernel,
> then link against the umap library should work.  If the remote solution you're
> proposing requires similar (or even more complicated) cooperation, then it'll
> be controversial whether that can be done per-mm just like how umap designed
> and used.  So IMHO it'll be great to share more details on those parts if umap
> cannot satisfy the current need - IMHO it satisfies all the features you
> described on fully customized pageout and page faulting in, it's just done in a
> single mm.

Thanks for you feedback, Peter.

I am familiar with umap, perhaps not enough, but I am aware.

From my experience, the existing interfaces are not sufficient if you look
for high performance (low overhead) solution for multiple processes. The
level of cooperation that I mentioned is something that I mentioned
preemptively to avoid unnecessary discussion, but I believe they can be
resolved (I have just deferred handling them).

Specifically for performance, several new kernel features are needed, for
instance, support for iouring with async operations, a vectored
UFFDIO_WRITEPROTECT(V) which batches TLB flushes across VMAs and a
vectored madvise(). Even if we talk on the context of a single mm, I
cannot see umap being performant for low latency devices without those
facilities.

Anyhow, I take your feedback and I will resend the patch for enabling
MADV_DONTNEED with other patches once I am done. As for the TLB batching
itself, I think it has an independent value - but I am not going to
argue about it now if there is a pushback against it.
Peter Xu Oct. 13, 2021, 11:09 p.m. UTC | #18
On Wed, Oct 13, 2021 at 08:47:11AM -0700, Nadav Amit wrote:
> 
> 
> > On Oct 12, 2021, at 4:14 PM, Peter Xu <peterx@redhat.com> wrote:
> > 
> > On Wed, Sep 29, 2021 at 11:31:25AM -0700, Nadav Amit wrote:
> >> 
> >> 
> >>> On Sep 29, 2021, at 12:52 AM, Michal Hocko <mhocko@suse.com> wrote:
> >>> 
> >>> On Mon 27-09-21 12:12:46, Nadav Amit wrote:
> >>>> 
> >>>>> On Sep 27, 2021, at 5:16 AM, Michal Hocko <mhocko@suse.com> wrote:
> >>>>> 
> >>>>> On Mon 27-09-21 05:00:11, Nadav Amit wrote:
> >>>>> [...]
> >>>>>> The manager is notified on memory regions that it should monitor
> >>>>>> (through PTRACE/LD_PRELOAD/explicit-API). It then monitors these regions
> >>>>>> using the remote-userfaultfd that you saw on the second thread. When it wants
> >>>>>> to reclaim (anonymous) memory, it:
> >>>>>> 
> >>>>>> 1. Uses UFFD-WP to protect that memory (and for this matter I got a vectored
> >>>>>> UFFD-WP to do so efficiently, a patch which I did not send yet).
> >>>>>> 2. Calls process_vm_readv() to read that memory of that process.
> >>>>>> 3. Write it back to “swap”.
> >>>>>> 4. Calls process_madvise(MADV_DONTNEED) to zap it.
> >>>>> 
> >>>>> Why cannot you use MADV_PAGEOUT/MADV_COLD for this usecase?
> >>>> 
> >>>> Providing hints to the kernel takes you so far to a certain extent.
> >>>> The kernel does not want to (for a good reason) to be completely
> >>>> configurable when it comes to reclaim and prefetch policies. Doing
> >>>> so from userspace allows you to be fully configurable.
> >>> 
> >>> I am sorry but I do not follow. Your scenario is describing a user
> >>> space driven reclaim. Something that MADV_{COLD,PAGEOUT} have been
> >>> designed for. What are you missing in the existing functionality?
> >> 
> >> Using MADV_COLD/MADV_PAGEOUT does not allow userspace to control
> >> many aspects of paging out memory:
> >> 
> >> 1. Writeback: writeback ahead of time, dynamic clustering, etc.
> >> 2. Batching (regardless, MADV_PAGEOUT does pretty bad batching job
> >>  on non-contiguous memory).
> >> 3. No guarantee the page is actually reclaimed (e.g., writeback)
> >>  and the time it takes place.
> >> 4. I/O stack for swapping - you must use kernel I/O stack (FUSE
> >>  as non-performant as it is cannot be used for swap AFAIK).
> >> 5. Other operations (e.g., locking, working set tracking) that
> >>  might not be necessary or interfere.
> >> 
> >> In addition, the use of MADV_COLD/MADV_PAGEOUT prevents the use
> >> of userfaultfd to trap page-faults and react accordingly, so you
> >> are also prevented from:
> >> 
> >> 6. Having your own custom prefetching policy in response to #PF.
> >> 
> >> There are additional use-cases I can try to formalize in which
> >> MADV_COLD/MADV_PAGEOUT is insufficient. But the main difference
> >> is pretty clear, I think: one is a hint that only applied to
> >> page reclamation. The other enables the direct control of
> >> userspace over (almost) all aspects of paging.
> >> 
> >> As I suggested before, if it is preferred, this can be a UFFD
> >> IOCTL instead of process_madvise() behavior, thereby lowering
> >> the risk of a misuse.
> > 
> > (Sorry to join so late..)
> > 
> > Yeah I'm wondering whether that could add one extra layer of security.  But as
> > you mentioned, we've already have process_vm_writev(), then it's indeed not
> > strong reason to reject process_madvise(DONTNEED) too, it seems.
> > 
> > Not sure whether you're aware of the umap project from LLNL:
> > 
> > https://github.com/LLNL/umap
> > 
> > From what I can tell, that's really doing very similar thing as what you
> > proposed here, but it's just a local version of things.  IOW in umap the
> > DONTNEED can be done locally with madvise() already in the umap maintained
> > threads.  That close the need to introduce the new process_madvise() interface
> > and it's definitely safer as it's per-mm and per-task.
> > 
> > I think you mentioned above that the tracee program will need to cooperate in
> > this case, I'm wondering whether some solution like umap would be fine too as
> > that also requires cooperation of the tracee program, it's just that the
> > cooperation may be slightly more than your solution but frankly I think that's
> > still trivial and before I understand the details of your solution I can't
> > really tell..
> > 
> > E.g. for a program to use umap, I think it needs to replace mmap() to umap()
> > where we want the buffers to be managed by umap library rather than the kernel,
> > then link against the umap library should work.  If the remote solution you're
> > proposing requires similar (or even more complicated) cooperation, then it'll
> > be controversial whether that can be done per-mm just like how umap designed
> > and used.  So IMHO it'll be great to share more details on those parts if umap
> > cannot satisfy the current need - IMHO it satisfies all the features you
> > described on fully customized pageout and page faulting in, it's just done in a
> > single mm.
> 
> Thanks for you feedback, Peter.
> 
> I am familiar with umap, perhaps not enough, but I am aware.
> 
> From my experience, the existing interfaces are not sufficient if you look
> for high performance (low overhead) solution for multiple processes. The
> level of cooperation that I mentioned is something that I mentioned
> preemptively to avoid unnecessary discussion, but I believe they can be
> resolved (I have just deferred handling them).
> 
> Specifically for performance, several new kernel features are needed, for
> instance, support for iouring with async operations, a vectored
> UFFDIO_WRITEPROTECT(V) which batches TLB flushes across VMAs and a
> vectored madvise(). Even if we talk on the context of a single mm, I
> cannot see umap being performant for low latency devices without those
> facilities.
> 
> Anyhow, I take your feedback and I will resend the patch for enabling
> MADV_DONTNEED with other patches once I am done. As for the TLB batching
> itself, I think it has an independent value - but I am not going to
> argue about it now if there is a pushback against it.

Fair enough.

Yes my comment was mostly about whether a remote interface is needed or can we
still do it locally (frankly I always wanted to have some remote interface to
manipulate uffd, but still anything like that should require some
justifications for sure).

I totally agree your rest works on either optimizing tlb (especially on the two
points Andrea mentioned for either reducing tlb for huge pmd change protection,
or pte promotions) and vectored interfaces sound reasonable, and they're
definitely separate issues comparing to this one.

Thanks,