Message ID | 20210926161259.238054-1-namit@vmware.com (mailing list archive) |
---|---|
Headers | show |
Series | mm/madvise: support process_madvise(MADV_DONTNEED) | expand |
On 26.09.21 18:12, Nadav Amit wrote: > From: Nadav Amit <namit@vmware.com> > > The goal of these patches is to add support for > process_madvise(MADV_DONTNEED). Yet, in the process some (arguably) > useful cleanups, a bug fix and performance enhancements are performed. > > The patches try to consolidate the logic across different behaviors, and > to a certain extent overlap/conflict with an outstanding patch that does > something similar [1]. This consolidation however is mostly orthogonal > to the aforementioned one and done in order to clarify what is done in > respect to locks and TLB for each behavior and to batch these operations > more efficiently on process_madvise(). > > process_madvise(MADV_DONTNEED) is useful for two reasons: (a) it allows > userfaultfd monitors to unmap memory from monitored processes; and (b) > it is more efficient than madvise() since it is vectored and batches TLB > flushes more aggressively. MADV_DONTNEED on MAP_PRIVATE memory is a target-visible operation; this is very different to all the other process_madvise() calls we allow, which are merely hints, but the target cannot be broken . I don't think this is acceptable.
> On Sep 27, 2021, at 2:24 AM, David Hildenbrand <david@redhat.com> wrote: > > On 26.09.21 18:12, Nadav Amit wrote: >> From: Nadav Amit <namit@vmware.com> >> The goal of these patches is to add support for >> process_madvise(MADV_DONTNEED). Yet, in the process some (arguably) >> useful cleanups, a bug fix and performance enhancements are performed. >> The patches try to consolidate the logic across different behaviors, and >> to a certain extent overlap/conflict with an outstanding patch that does >> something similar [1]. This consolidation however is mostly orthogonal >> to the aforementioned one and done in order to clarify what is done in >> respect to locks and TLB for each behavior and to batch these operations >> more efficiently on process_madvise(). >> process_madvise(MADV_DONTNEED) is useful for two reasons: (a) it allows >> userfaultfd monitors to unmap memory from monitored processes; and (b) >> it is more efficient than madvise() since it is vectored and batches TLB >> flushes more aggressively. > > MADV_DONTNEED on MAP_PRIVATE memory is a target-visible operation; this is very different to all the other process_madvise() calls we allow, which are merely hints, but the target cannot be broken . I don't think this is acceptable. This is a fair point, which I expected, but did not address properly. I guess an additional capability, such as CAP_SYS_PTRACE needs to be required in this case. Would that ease your mind?
On 27.09.21 12:41, Nadav Amit wrote: > > >> On Sep 27, 2021, at 2:24 AM, David Hildenbrand <david@redhat.com> wrote: >> >> On 26.09.21 18:12, Nadav Amit wrote: >>> From: Nadav Amit <namit@vmware.com> >>> The goal of these patches is to add support for >>> process_madvise(MADV_DONTNEED). Yet, in the process some (arguably) >>> useful cleanups, a bug fix and performance enhancements are performed. >>> The patches try to consolidate the logic across different behaviors, and >>> to a certain extent overlap/conflict with an outstanding patch that does >>> something similar [1]. This consolidation however is mostly orthogonal >>> to the aforementioned one and done in order to clarify what is done in >>> respect to locks and TLB for each behavior and to batch these operations >>> more efficiently on process_madvise(). >>> process_madvise(MADV_DONTNEED) is useful for two reasons: (a) it allows >>> userfaultfd monitors to unmap memory from monitored processes; and (b) >>> it is more efficient than madvise() since it is vectored and batches TLB >>> flushes more aggressively. >> >> MADV_DONTNEED on MAP_PRIVATE memory is a target-visible operation; this is very different to all the other process_madvise() calls we allow, which are merely hints, but the target cannot be broken . I don't think this is acceptable. > > This is a fair point, which I expected, but did not address properly. > > I guess an additional capability, such as CAP_SYS_PTRACE needs to be > required in this case. Would that ease your mind? I think it would be slightly better, but I'm still missing a clear use case that justifies messing with the page tables of other processes in that way, especially with MAP_PRIVATE mappings. Can you maybe elaborate a bit on a) and b)? Especially, why would a) make sense or be required? When would it be a good idea to zap random pages of a target process, especially with MAP_PRIVATE? How would the target use case make sure that the target process doesn't suddenly lose data? I would have assume that you can really only do something sane with uffd() if 1) the process decided to give up on some pages (madvise(DONTNEED)) b) the process hasn't touched these pages yet. Can you also comment a bit more on b)? Who cares about that? And would we suddenly expect users of madvise() to switch to process_madvise() because it's more effective? It sounds a bit weird to me TBH, but most probably I am missing details :)
> On Sep 27, 2021, at 3:58 AM, David Hildenbrand <david@redhat.com> wrote: > > On 27.09.21 12:41, Nadav Amit wrote: >>> On Sep 27, 2021, at 2:24 AM, David Hildenbrand <david@redhat.com> wrote: >>> >>> On 26.09.21 18:12, Nadav Amit wrote: >>>> From: Nadav Amit <namit@vmware.com> >>>> The goal of these patches is to add support for >>>> process_madvise(MADV_DONTNEED). Yet, in the process some (arguably) >>>> useful cleanups, a bug fix and performance enhancements are performed. >>>> The patches try to consolidate the logic across different behaviors, and >>>> to a certain extent overlap/conflict with an outstanding patch that does >>>> something similar [1]. This consolidation however is mostly orthogonal >>>> to the aforementioned one and done in order to clarify what is done in >>>> respect to locks and TLB for each behavior and to batch these operations >>>> more efficiently on process_madvise(). >>>> process_madvise(MADV_DONTNEED) is useful for two reasons: (a) it allows >>>> userfaultfd monitors to unmap memory from monitored processes; and (b) >>>> it is more efficient than madvise() since it is vectored and batches TLB >>>> flushes more aggressively. >>> >>> MADV_DONTNEED on MAP_PRIVATE memory is a target-visible operation; this is very different to all the other process_madvise() calls we allow, which are merely hints, but the target cannot be broken . I don't think this is acceptable. >> This is a fair point, which I expected, but did not address properly. >> I guess an additional capability, such as CAP_SYS_PTRACE needs to be >> required in this case. Would that ease your mind? > > I think it would be slightly better, but I'm still missing a clear use case that justifies messing with the page tables of other processes in that way, especially with MAP_PRIVATE mappings. Can you maybe elaborate a bit on a) and b)? > > Especially, why would a) make sense or be required? When would it be a good idea to zap random pages of a target process, especially with MAP_PRIVATE? How would the target use case make sure that the target process doesn't suddenly lose data? I would have assume that you can really only do something sane with uffd() if 1) the process decided to give up on some pages (madvise(DONTNEED)) b) the process hasn't touched these pages yet. > > Can you also comment a bit more on b)? Who cares about that? And would we suddenly expect users of madvise() to switch to process_madvise() because it's more effective? It sounds a bit weird to me TBH, but most probably I am missing details :) Ok, ok, your criticism is fair. I tried to hold back some details in order to prevent the discussion from digressing. I am going to focus on (a) which is what I really have in mind. The use-case that I explore is a userspace memory manager with some level of cooperation of the monitored processes. The manager is notified on memory regions that it should monitor (through PTRACE/LD_PRELOAD/explicit-API). It then monitors these regions using the remote-userfaultfd that you saw on the second thread. When it wants to reclaim (anonymous) memory, it: 1. Uses UFFD-WP to protect that memory (and for this matter I got a vectored UFFD-WP to do so efficiently, a patch which I did not send yet). 2. Calls process_vm_readv() to read that memory of that process. 3. Write it back to “swap”. 4. Calls process_madvise(MADV_DONTNEED) to zap it. Once the memory is accessed again, the manager uses UFFD-COPY to bring it back. This is really work-in-progress, but eventually performance is not as bad as you would imagine (some patches for efficient use of uffd with iouring are needed for that matter). I am aware that there are some caveats, as zapping the memory does not guarantee that the memory would be freed since it might be pinned for a variety of reasons. That's the reason I mentioned the processes have "some level of cooperation" with the manager. It is not intended to deal with adversaries or uncommon corner cases (e.g., processes that use UFFD for their own reasons). Putting aside my use-case (which I am sure people would be glad to criticize), I can imagine debuggers or emulators may also find use for similar schemes (although I do not have concrete use-cases for them).
On Mon 27-09-21 05:00:11, Nadav Amit wrote: [...] > The manager is notified on memory regions that it should monitor > (through PTRACE/LD_PRELOAD/explicit-API). It then monitors these regions > using the remote-userfaultfd that you saw on the second thread. When it wants > to reclaim (anonymous) memory, it: > > 1. Uses UFFD-WP to protect that memory (and for this matter I got a vectored > UFFD-WP to do so efficiently, a patch which I did not send yet). > 2. Calls process_vm_readv() to read that memory of that process. > 3. Write it back to “swap”. > 4. Calls process_madvise(MADV_DONTNEED) to zap it. Why cannot you use MADV_PAGEOUT/MADV_COLD for this usecase? MADV_DONTNEED on a remote process has been proposed in the past several times and it has always been rejected because it is a free ticket to all sorts of hard to debug problems as it is just a free ticket for a remote memory corruption. An additional capability requirement might reduce the risk to some degree but I still do not think this is a good idea.
On 27.09.21 14:00, Nadav Amit wrote: > > >> On Sep 27, 2021, at 3:58 AM, David Hildenbrand <david@redhat.com> wrote: >> >> On 27.09.21 12:41, Nadav Amit wrote: >>>> On Sep 27, 2021, at 2:24 AM, David Hildenbrand <david@redhat.com> wrote: >>>> >>>> On 26.09.21 18:12, Nadav Amit wrote: >>>>> From: Nadav Amit <namit@vmware.com> >>>>> The goal of these patches is to add support for >>>>> process_madvise(MADV_DONTNEED). Yet, in the process some (arguably) >>>>> useful cleanups, a bug fix and performance enhancements are performed. >>>>> The patches try to consolidate the logic across different behaviors, and >>>>> to a certain extent overlap/conflict with an outstanding patch that does >>>>> something similar [1]. This consolidation however is mostly orthogonal >>>>> to the aforementioned one and done in order to clarify what is done in >>>>> respect to locks and TLB for each behavior and to batch these operations >>>>> more efficiently on process_madvise(). >>>>> process_madvise(MADV_DONTNEED) is useful for two reasons: (a) it allows >>>>> userfaultfd monitors to unmap memory from monitored processes; and (b) >>>>> it is more efficient than madvise() since it is vectored and batches TLB >>>>> flushes more aggressively. >>>> >>>> MADV_DONTNEED on MAP_PRIVATE memory is a target-visible operation; this is very different to all the other process_madvise() calls we allow, which are merely hints, but the target cannot be broken . I don't think this is acceptable. >>> This is a fair point, which I expected, but did not address properly. >>> I guess an additional capability, such as CAP_SYS_PTRACE needs to be >>> required in this case. Would that ease your mind? >> >> I think it would be slightly better, but I'm still missing a clear use case that justifies messing with the page tables of other processes in that way, especially with MAP_PRIVATE mappings. Can you maybe elaborate a bit on a) and b)? >> >> Especially, why would a) make sense or be required? When would it be a good idea to zap random pages of a target process, especially with MAP_PRIVATE? How would the target use case make sure that the target process doesn't suddenly lose data? I would have assume that you can really only do something sane with uffd() if 1) the process decided to give up on some pages (madvise(DONTNEED)) b) the process hasn't touched these pages yet. >> >> Can you also comment a bit more on b)? Who cares about that? And would we suddenly expect users of madvise() to switch to process_madvise() because it's more effective? It sounds a bit weird to me TBH, but most probably I am missing details :) > > Ok, ok, your criticism is fair. I tried to hold back some details in order to > prevent the discussion from digressing. I am going to focus on (a) which is > what I really have in mind. Thanks for the details! > > The use-case that I explore is a userspace memory manager with some level of > cooperation of the monitored processes. > > The manager is notified on memory regions that it should monitor > (through PTRACE/LD_PRELOAD/explicit-API). It then monitors these regions > using the remote-userfaultfd that you saw on the second thread. When it wants > to reclaim (anonymous) memory, it: > > 1. Uses UFFD-WP to protect that memory (and for this matter I got a vectored > UFFD-WP to do so efficiently, a patch which I did not send yet). > 2. Calls process_vm_readv() to read that memory of that process. > 3. Write it back to “swap”. > 4. Calls process_madvise(MADV_DONTNEED) to zap it. > > Once the memory is accessed again, the manager uses UFFD-COPY to bring it > back. This is really work-in-progress, but eventually performance is not as > bad as you would imagine (some patches for efficient use of uffd with > iouring are needed for that matter). Again, thanks for the details. I guess this should basically work, although it involves a lot of complexity (read: all flavors of uffd on other processes). And I am no so sure about performance aspects. "Performance is not as bad as you think" doesn't sound like the words you would want to hear from a car dealer ;) So there has to be another big benefit to do such user space swapping. > > I am aware that there are some caveats, as zapping the memory does not > guarantee that the memory would be freed since it might be pinned for a > variety of reasons. That's the reason I mentioned the processes have "some > level of cooperation" with the manager. It is not intended to deal with > adversaries or uncommon corner cases (e.g., processes that use UFFD for > their own reasons). It's not only long-term pinnings. Pages could have been de-duplicated (COW after fork, KSM, shared zeropage). Further, you'll most probably lose any kind of "aging" ("accessed") information on pages, or how would you track that? Although I can see that this might work, I do wonder if it's a use case worth supporting. As Michal correctly raised, we already have other infrastructure in place to trigger swapin/swapout. I recall that also damon wants to let you write advanced policies for that by monitoring actual access characteristics. > > Putting aside my use-case (which I am sure people would be glad to criticize), > I can imagine debuggers or emulators may also find use for similar schemes > (although I do not have concrete use-cases for them). I'd be curious about use cases for debuggers/emulators. Especially for emulators I'd guess it makes more sense to just do it within the process. And for debuggers, I'm having a hard time why it would make sense to throw away a page instead of just overwriting it with $PATTERN (e.g., 0). But I'm sure people can be creative :)
> On Sep 27, 2021, at 5:16 AM, Michal Hocko <mhocko@suse.com> wrote: > > On Mon 27-09-21 05:00:11, Nadav Amit wrote: > [...] >> The manager is notified on memory regions that it should monitor >> (through PTRACE/LD_PRELOAD/explicit-API). It then monitors these regions >> using the remote-userfaultfd that you saw on the second thread. When it wants >> to reclaim (anonymous) memory, it: >> >> 1. Uses UFFD-WP to protect that memory (and for this matter I got a vectored >> UFFD-WP to do so efficiently, a patch which I did not send yet). >> 2. Calls process_vm_readv() to read that memory of that process. >> 3. Write it back to “swap”. >> 4. Calls process_madvise(MADV_DONTNEED) to zap it. > > Why cannot you use MADV_PAGEOUT/MADV_COLD for this usecase? Providing hints to the kernel takes you so far to a certain extent. The kernel does not want to (for a good reason) to be completely configurable when it comes to reclaim and prefetch policies. Doing so from userspace allows you to be fully configurable. > MADV_DONTNEED on a remote process has been proposed in the past several > times and it has always been rejected because it is a free ticket to all > sorts of hard to debug problems as it is just a free ticket for a remote > memory corruption. An additional capability requirement might reduce the > risk to some degree but I still do not think this is a good idea. I would argue that there is nothing bad that remote MADV_DONTNEED can do that process_vm_writev() cannot do as well (putting aside ptrace). process_vm_writev() is checking: mm = mm_access(task, PTRACE_MODE_ATTACH_REALCREDS) Wouldn't adding such a condition suffice?
> On Sep 27, 2021, at 10:05 AM, David Hildenbrand <david@redhat.com> wrote: > > On 27.09.21 14:00, Nadav Amit wrote: >>> On Sep 27, 2021, at 3:58 AM, David Hildenbrand <david@redhat.com> wrote: >>> >>> On 27.09.21 12:41, Nadav Amit wrote: >>>>> On Sep 27, 2021, at 2:24 AM, David Hildenbrand <david@redhat.com> wrote: >>>>> >>>>> On 26.09.21 18:12, Nadav Amit wrote: >>>>>> From: Nadav Amit <namit@vmware.com> >>>>>> The goal of these patches is to add support for >>>>>> process_madvise(MADV_DONTNEED). Yet, in the process some (arguably) >>>>>> useful cleanups, a bug fix and performance enhancements are performed. >>>>>> The patches try to consolidate the logic across different behaviors, and >>>>>> to a certain extent overlap/conflict with an outstanding patch that does >>>>>> something similar [1]. This consolidation however is mostly orthogonal >>>>>> to the aforementioned one and done in order to clarify what is done in >>>>>> respect to locks and TLB for each behavior and to batch these operations >>>>>> more efficiently on process_madvise(). >>>>>> process_madvise(MADV_DONTNEED) is useful for two reasons: (a) it allows >>>>>> userfaultfd monitors to unmap memory from monitored processes; and (b) >>>>>> it is more efficient than madvise() since it is vectored and batches TLB >>>>>> flushes more aggressively. >>>>> >>>>> MADV_DONTNEED on MAP_PRIVATE memory is a target-visible operation; this is very different to all the other process_madvise() calls we allow, which are merely hints, but the target cannot be broken . I don't think this is acceptable. >>>> This is a fair point, which I expected, but did not address properly. >>>> I guess an additional capability, such as CAP_SYS_PTRACE needs to be >>>> required in this case. Would that ease your mind? >>> >>> I think it would be slightly better, but I'm still missing a clear use case that justifies messing with the page tables of other processes in that way, especially with MAP_PRIVATE mappings. Can you maybe elaborate a bit on a) and b)? >>> >>> Especially, why would a) make sense or be required? When would it be a good idea to zap random pages of a target process, especially with MAP_PRIVATE? How would the target use case make sure that the target process doesn't suddenly lose data? I would have assume that you can really only do something sane with uffd() if 1) the process decided to give up on some pages (madvise(DONTNEED)) b) the process hasn't touched these pages yet. >>> >>> Can you also comment a bit more on b)? Who cares about that? And would we suddenly expect users of madvise() to switch to process_madvise() because it's more effective? It sounds a bit weird to me TBH, but most probably I am missing details :) >> Ok, ok, your criticism is fair. I tried to hold back some details in order to >> prevent the discussion from digressing. I am going to focus on (a) which is >> what I really have in mind. > > Thanks for the details! > >> The use-case that I explore is a userspace memory manager with some level of >> cooperation of the monitored processes. >> The manager is notified on memory regions that it should monitor >> (through PTRACE/LD_PRELOAD/explicit-API). It then monitors these regions >> using the remote-userfaultfd that you saw on the second thread. When it wants >> to reclaim (anonymous) memory, it: >> 1. Uses UFFD-WP to protect that memory (and for this matter I got a vectored >> UFFD-WP to do so efficiently, a patch which I did not send yet). >> 2. Calls process_vm_readv() to read that memory of that process. >> 3. Write it back to “swap”. >> 4. Calls process_madvise(MADV_DONTNEED) to zap it. >> Once the memory is accessed again, the manager uses UFFD-COPY to bring it >> back. This is really work-in-progress, but eventually performance is not as >> bad as you would imagine (some patches for efficient use of uffd with >> iouring are needed for that matter). > > Again, thanks for the details. I guess this should basically work, although it involves a lot of complexity (read: all flavors of uffd on other processes). And I am no so sure about performance aspects. "Performance is not as bad as you think" doesn't sound like the words you would want to hear from a car dealer ;) So there has to be another big benefit to do such user space swapping. There is some complexity, indeed. Worse, there are some quirks of UFFD that make life hard for no reason and some uffd and iouring bugs. As for my sales pitch - I agree that I am not the best car dealer… :( When I say performance is not bad, I mean that the core operations of page-fault handling, prefetch and reclaim do not induce high overhead *after* the improvements I sent or mentioned. The benefit of doing so from userspace is that you have full control over the reclaim/prefetch policies, so you may be able to make better decisions. Some workloads have predictable access patterns (see for instance "MAGE: Nearly Zero-Cost Virtual Memory for Secure Computation”, OSDI’21). You may be handle such access patterns without requiring intrusive changes to the workload. > >> I am aware that there are some caveats, as zapping the memory does not >> guarantee that the memory would be freed since it might be pinned for a >> variety of reasons. That's the reason I mentioned the processes have "some >> level of cooperation" with the manager. It is not intended to deal with >> adversaries or uncommon corner cases (e.g., processes that use UFFD for >> their own reasons). > > It's not only long-term pinnings. Pages could have been de-duplicated (COW after fork, KSM, shared zeropage). Further, you'll most probably lose any kind of "aging" ("accessed") information on pages, or how would you track that? I know it’s not just long-term pinnings. That’s what “variety of reasons” stood for. ;-) Aging is a tool for certain types of reclamation policies. Some do not require it (e.g., random). You can also have compiler/application-guided reclamation policies. If you are really into “aging”, you may be able to use PEBS or other CPU facilities to track it. Anyhow, the access-bit by itself not such a great solution to track aging. Setting it can induce overheads of >500 cycles from my (and others) experience. > > Although I can see that this might work, I do wonder if it's a use case worth supporting. As Michal correctly raised, we already have other infrastructure in place to trigger swapin/swapout. I recall that also damon wants to let you write advanced policies for that by monitoring actual access characteristics. Hints, as those that Michal mentioned, prevent the efficient use of userfaultfd. Using MADV_PAGEOUT will not trigger another uffd event when the page is brought back from swap. So using MADV_PAGEOUT/MADV_WILLNEED does not allow you to have a custom prefetch policy, for instance. It would also require you to live with the kernel reclamation/IO stack for better and worse. As for DAMON, I am not very familiar with it, but from what I remember it seemed to look on a similar direction. IMHO it is more intrusive and less configurable (although it can have the advantage of better integration with various kernel mechanism). I was wondering for a second why you give me such a hard time for a pretty straight-forward extension for process_madvise(), but then I remembered that DAMON got into the kernel after >30 versions, so I’ll shut up about that. ;-) > >> Putting aside my use-case (which I am sure people would be glad to criticize), >> I can imagine debuggers or emulators may also find use for similar schemes >> (although I do not have concrete use-cases for them). > > I'd be curious about use cases for debuggers/emulators. Especially for emulators I'd guess it makes more sense to just do it within the process. And for debuggers, I'm having a hard time why it would make sense to throw away a page instead of just overwriting it with $PATTERN (e.g., 0). But I'm sure people can be creative :) I have some more vague ideas, but I am afraid that you will keep saying that it makes more sense to handle such events from within a process. I am not sure that this is true. Even for the emulators that we discuss, the emulated program might run in a different address space (for sandboxing). You may be able to avoid the need for remote-UFFD and get away with the current non-cooperative UFFD, but zapping the memory (for atomic updates) would still require process_madvise(MADV_DONTNEED) [putting aside various ptrace solutions]. Anyhow, David, I really appreciate your feedback. And you make strong points about issues I encounter. Yet, eventually, I think that the main question in this discussion is whether enabling process_madvise(MADV_DONTNEED) is any different - from security point of view - than process_vm_writev(), not to mention ptrace. If not, then the same security guards should suffice, I would argue.
>> >> Again, thanks for the details. I guess this should basically work, although it involves a lot of complexity (read: all flavors of uffd on other processes). And I am no so sure about performance aspects. "Performance is not as bad as you think" doesn't sound like the words you would want to hear from a car dealer ;) So there has to be another big benefit to do such user space swapping. > > There is some complexity, indeed. Worse, there are some quirks of UFFD > that make life hard for no reason and some uffd and iouring bugs. > > As for my sales pitch - I agree that I am not the best car dealer… :( :) > When I say performance is not bad, I mean that the core operations of > page-fault handling, prefetch and reclaim do not induce high overhead > *after* the improvements I sent or mentioned. > > The benefit of doing so from userspace is that you have full control > over the reclaim/prefetch policies, so you may be able to make better > decisions. > > Some workloads have predictable access patterns (see for instance "MAGE: > Nearly Zero-Cost Virtual Memory for Secure Computation”, OSDI’21). You may > be handle such access patterns without requiring intrusive changes to the > workload. Thanks for the pointer. And my question would be if something like DAMON would actually be what you want. > > >> >>> I am aware that there are some caveats, as zapping the memory does not >>> guarantee that the memory would be freed since it might be pinned for a >>> variety of reasons. That's the reason I mentioned the processes have "some >>> level of cooperation" with the manager. It is not intended to deal with >>> adversaries or uncommon corner cases (e.g., processes that use UFFD for >>> their own reasons). >> >> It's not only long-term pinnings. Pages could have been de-duplicated (COW after fork, KSM, shared zeropage). Further, you'll most probably lose any kind of "aging" ("accessed") information on pages, or how would you track that? > > I know it’s not just long-term pinnings. That’s what “variety of reasons” > stood for. ;-) > > Aging is a tool for certain types of reclamation policies. Some do not > require it (e.g., random). You can also have compiler/application-guided > reclamation policies. If you are really into “aging”, you may be able > to use PEBS or other CPU facilities to track it. > > Anyhow, the access-bit by itself not such a great solution to track > aging. Setting it can induce overheads of >500 cycles from my (and > others) experience. Well, I'm certainly no expert on that; I would assume it's relevant in corner cases only: if you're application accesses all it's memory permanently a swap setup is already "broken". If you have plenty of old memory (VMs, databases, ...) it should work reasonably well. But yeah, detecting the working set size is a problematic problem, and "access" bits can be sub-optimal. After all, that's what the Linux kernel has been relying on for a long time ... and IIRC it might be extended by multiple "aging" queues soon. > >> >> Although I can see that this might work, I do wonder if it's a use case worth supporting. As Michal correctly raised, we already have other infrastructure in place to trigger swapin/swapout. I recall that also damon wants to let you write advanced policies for that by monitoring actual access characteristics. > > Hints, as those that Michal mentioned, prevent the efficient use of > userfaultfd. Using MADV_PAGEOUT will not trigger another uffd event > when the page is brought back from swap. So using > MADV_PAGEOUT/MADV_WILLNEED does not allow you to have a custom > prefetch policy, for instance. It would also require you to live > with the kernel reclamation/IO stack for better and worse. Would more uffd (or similar) events help? > > As for DAMON, I am not very familiar with it, but from what I remember > it seemed to look on a similar direction. IMHO it is more intrusive > and less configurable (although it can have the advantage of better > integration with various kernel mechanism). I was wondering for a > second why you give me such a hard time for a pretty straight-forward > extension for process_madvise(), but then I remembered that DAMON got > into the kernel after >30 versions, so I’ll shut up about that. ;-) It took ... quite a long time, indeed :) > >> >>> Putting aside my use-case (which I am sure people would be glad to criticize), >>> I can imagine debuggers or emulators may also find use for similar schemes >>> (although I do not have concrete use-cases for them). >> >> I'd be curious about use cases for debuggers/emulators. Especially for emulators I'd guess it makes more sense to just do it within the process. And for debuggers, I'm having a hard time why it would make sense to throw away a page instead of just overwriting it with $PATTERN (e.g., 0). But I'm sure people can be creative :) > > I have some more vague ideas, but I am afraid that you will keep > saying that it makes more sense to handle such events from within > a process. I am not sure that this is true. Even for the emulators > that we discuss, the emulated program might run in a different > address space (for sandboxing). You may be able to avoid the need > for remote-UFFD and get away with the current non-cooperative > UFFD, but zapping the memory (for atomic updates) would still > require process_madvise(MADV_DONTNEED) [putting aside various > ptrace solutions]. > > Anyhow, David, I really appreciate your feedback. And you make > strong points about issues I encounter. Yet, eventually, I think > that the main question in this discussion is whether enabling > process_madvise(MADV_DONTNEED) is any different - from security > point of view - than process_vm_writev(), not to mention ptrace. > If not, then the same security guards should suffice, I would > argue. > You raise a very excellent point (and it should have been part of your initial sales pitch): how does it differ to process_vm_writev(). I can say that it differs in a way that you can break applications in more extreme ways. Let me give you two examples: 1. longterm pinnings: you raised this yourself; this can break an application silently and there is barely a safe way your tooling could handle it. 2. pagemap: applications can depend on the populated(present |swap) information in the pagemap for correctness. For example, there was recently a discussion to use pagemap information to speed up live migration of VMs, by skipping migration of !populated pages. There is currently no way your tooling can fake that. In comparison, ordinary swapping in the kernel can handle it. Is it easy to break an application with process_vm_writev()? Yes. When talking about dynamic debugging, it's expected that you break the target already -- or the target is already broken. Is it easier to break an application with process_madvise(MADV_DONTNEED)? I'd say yes, especially when implementing something way beyond debugging as you describe. I'm giving you "a hard time" for the reason Michal raised: we discussed this in the past already at least two times IIRC and "it is a free ticket to all sorts of hard to debug problem" in our opinion; especially when we mess around in other process address spaces besides for debugging. I'm not the person to ack/nack this, I'm just asking the questions :)
> On Sep 28, 2021, at 1:53 AM, David Hildenbrand <david@redhat.com> wrote: > >>> >>> Again, thanks for the details. I guess this should basically work, although it involves a lot of complexity (read: all flavors of uffd on other processes). And I am no so sure about performance aspects. "Performance is not as bad as you think" doesn't sound like the words you would want to hear from a car dealer ;) So there has to be another big benefit to do such user space swapping. >> There is some complexity, indeed. Worse, there are some quirks of UFFD >> that make life hard for no reason and some uffd and iouring bugs. >> As for my sales pitch - I agree that I am not the best car dealer… :( > > :) > >> When I say performance is not bad, I mean that the core operations of >> page-fault handling, prefetch and reclaim do not induce high overhead >> *after* the improvements I sent or mentioned. >> The benefit of doing so from userspace is that you have full control >> over the reclaim/prefetch policies, so you may be able to make better >> decisions. >> Some workloads have predictable access patterns (see for instance "MAGE: >> Nearly Zero-Cost Virtual Memory for Secure Computation”, OSDI’21). You may >> be handle such access patterns without requiring intrusive changes to the >> workload. > > Thanks for the pointer. > > And my question would be if something like DAMON would actually be what you want. I looked into DAMON and even with the proposed future extensions it sounds as a different approach with certain benefits but with many limitations. The major limitation of DAMON is that you need to predefine the logic you want for reclamation into the kernel. You can add programability through some API or even eBPF, but it would never be as easy or as versatile as what user manager can achieve. We already have pretty much all the facilities to do so from userspace, and the missing parts (at least for basic userspace manager) are almost already there. In contrast, see how many iterations are needed for the basic DAMON implementation. The second, also big, difference is that DAMON looks only on reclamation. If you want a custom prefetch scheme or different I/O stack for backing storage, you cannot have such one. > >>> >>>> I am aware that there are some caveats, as zapping the memory does not >>>> guarantee that the memory would be freed since it might be pinned for a >>>> variety of reasons. That's the reason I mentioned the processes have "some >>>> level of cooperation" with the manager. It is not intended to deal with >>>> adversaries or uncommon corner cases (e.g., processes that use UFFD for >>>> their own reasons). >>> >>> It's not only long-term pinnings. Pages could have been de-duplicated (COW after fork, KSM, shared zeropage). Further, you'll most probably lose any kind of "aging" ("accessed") information on pages, or how would you track that? >> I know it’s not just long-term pinnings. That’s what “variety of reasons” >> stood for. ;-) >> Aging is a tool for certain types of reclamation policies. Some do not >> require it (e.g., random). You can also have compiler/application-guided >> reclamation policies. If you are really into “aging”, you may be able >> to use PEBS or other CPU facilities to track it. >> Anyhow, the access-bit by itself not such a great solution to track >> aging. Setting it can induce overheads of >500 cycles from my (and >> others) experience. > > Well, I'm certainly no expert on that; I would assume it's relevant in corner cases only: if you're application accesses all it's memory permanently a swap setup is already "broken". If you have plenty of old memory (VMs, databases, ...) it should work reasonably well. But yeah, detecting the working set size is a problematic problem, and "access" > bits can be sub-optimal. > > After all, that's what the Linux kernel has been relying on for a long time ... and IIRC it might be extended by multiple "aging" queues soon. I see your point. I do not want to waste time in our discussion by focusing on recency-based reclamation schemes and access-bits. I was just trying to point that such schemes and access-bit have their own drawbacks. > >>> >>> Although I can see that this might work, I do wonder if it's a use case worth supporting. As Michal correctly raised, we already have other infrastructure in place to trigger swapin/swapout. I recall that also damon wants to let you write advanced policies for that by monitoring actual access characteristics. >> Hints, as those that Michal mentioned, prevent the efficient use of >> userfaultfd. Using MADV_PAGEOUT will not trigger another uffd event >> when the page is brought back from swap. So using >> MADV_PAGEOUT/MADV_WILLNEED does not allow you to have a custom >> prefetch policy, for instance. It would also require you to live >> with the kernel reclamation/IO stack for better and worse. > > Would more uffd (or similar) events help? > >> As for DAMON, I am not very familiar with it, but from what I remember >> it seemed to look on a similar direction. IMHO it is more intrusive >> and less configurable (although it can have the advantage of better >> integration with various kernel mechanism). I was wondering for a >> second why you give me such a hard time for a pretty straight-forward >> extension for process_madvise(), but then I remembered that DAMON got >> into the kernel after >30 versions, so I’ll shut up about that. ;-) > > It took ... quite a long time, indeed :) > >>> >>>> Putting aside my use-case (which I am sure people would be glad to criticize), >>>> I can imagine debuggers or emulators may also find use for similar schemes >>>> (although I do not have concrete use-cases for them). >>> >>> I'd be curious about use cases for debuggers/emulators. Especially for emulators I'd guess it makes more sense to just do it within the process. And for debuggers, I'm having a hard time why it would make sense to throw away a page instead of just overwriting it with $PATTERN (e.g., 0). But I'm sure people can be creative :) >> I have some more vague ideas, but I am afraid that you will keep >> saying that it makes more sense to handle such events from within >> a process. I am not sure that this is true. Even for the emulators >> that we discuss, the emulated program might run in a different >> address space (for sandboxing). You may be able to avoid the need >> for remote-UFFD and get away with the current non-cooperative >> UFFD, but zapping the memory (for atomic updates) would still >> require process_madvise(MADV_DONTNEED) [putting aside various >> ptrace solutions]. >> Anyhow, David, I really appreciate your feedback. And you make >> strong points about issues I encounter. Yet, eventually, I think >> that the main question in this discussion is whether enabling >> process_madvise(MADV_DONTNEED) is any different - from security >> point of view - than process_vm_writev(), not to mention ptrace. >> If not, then the same security guards should suffice, I would >> argue. > > You raise a very excellent point (and it should have been part of your initial sales pitch): how does it differ to process_vm_writev(). > > I can say that it differs in a way that you can break applications in more extreme ways. Let me give you two examples: > > 1. longterm pinnings: you raised this yourself; this can break an application silently and there is barely a safe way your tooling could handle it. > > 2. pagemap: applications can depend on the populated(present |swap) information in the pagemap for correctness. For example, there was recently a discussion to use pagemap information to speed up live migration of VMs, by skipping migration of !populated pages. There is currently no way your tooling can fake that. In comparison, ordinary swapping in the kernel can handle it. I understand (1). As for (2): the scenario that you mention sound very specific, and one can argue that ignoring UFFD-registered regions in such a case is either (1) wrong or (2) should trigger some UFFD event. > > Is it easy to break an application with process_vm_writev()? Yes. When talking about dynamic debugging, it's expected that you break the target already -- or the target is already broken. Is it easier to break an application with process_madvise(MADV_DONTNEED)? I'd say yes, especially when implementing something way beyond debugging as you describe. If you do not know what you are doing, you can easily break anything. Note that there are other APIs that can break your application even worse, specifically ptrace(). > I'm giving you "a hard time" for the reason Michal raised: we discussed this in the past already at least two times IIRC and "it is a free ticket to all sorts of hard to debug problem" in our opinion; especially when we mess around in other process address spaces besides for debugging. > > I'm not the person to ack/nack this, I'm just asking the questions :) I see your points and I try to look for a path of least resistance. I thought that process_madvise() is a nice interface to hook into. But if you are concerned it will be misused, how about adding instead an IOCTL that will zap pages but only in UFFD-registered regions? A separate IOCTL for this matter have an advantage of being more tailored for UFFD, not to notify UFFD upon “remove” and to be less likely to be misused.
On Mon 27-09-21 12:12:46, Nadav Amit wrote: > > > On Sep 27, 2021, at 5:16 AM, Michal Hocko <mhocko@suse.com> wrote: > > > > On Mon 27-09-21 05:00:11, Nadav Amit wrote: > > [...] > >> The manager is notified on memory regions that it should monitor > >> (through PTRACE/LD_PRELOAD/explicit-API). It then monitors these regions > >> using the remote-userfaultfd that you saw on the second thread. When it wants > >> to reclaim (anonymous) memory, it: > >> > >> 1. Uses UFFD-WP to protect that memory (and for this matter I got a vectored > >> UFFD-WP to do so efficiently, a patch which I did not send yet). > >> 2. Calls process_vm_readv() to read that memory of that process. > >> 3. Write it back to “swap”. > >> 4. Calls process_madvise(MADV_DONTNEED) to zap it. > > > > Why cannot you use MADV_PAGEOUT/MADV_COLD for this usecase? > > Providing hints to the kernel takes you so far to a certain extent. > The kernel does not want to (for a good reason) to be completely > configurable when it comes to reclaim and prefetch policies. Doing > so from userspace allows you to be fully configurable. I am sorry but I do not follow. Your scenario is describing a user space driven reclaim. Something that MADV_{COLD,PAGEOUT} have been designed for. What are you missing in the existing functionality? > > MADV_DONTNEED on a remote process has been proposed in the past several > > times and it has always been rejected because it is a free ticket to all > > sorts of hard to debug problems as it is just a free ticket for a remote > > memory corruption. An additional capability requirement might reduce the > > risk to some degree but I still do not think this is a good idea. > > I would argue that there is nothing bad that remote MADV_DONTNEED can do > that process_vm_writev() cannot do as well (putting aside ptrace). I am not arguing this would be the first syscall to allow tricky and hard to debug corruptions if used without care. > process_vm_writev() is checking: > > mm = mm_access(task, PTRACE_MODE_ATTACH_REALCREDS) > > Wouldn't adding such a condition suffice? This would be a minimum requirement. Another one is a sensible usecase that is not covered by an existing functionality.
> On Sep 29, 2021, at 12:52 AM, Michal Hocko <mhocko@suse.com> wrote: > > On Mon 27-09-21 12:12:46, Nadav Amit wrote: >> >>> On Sep 27, 2021, at 5:16 AM, Michal Hocko <mhocko@suse.com> wrote: >>> >>> On Mon 27-09-21 05:00:11, Nadav Amit wrote: >>> [...] >>>> The manager is notified on memory regions that it should monitor >>>> (through PTRACE/LD_PRELOAD/explicit-API). It then monitors these regions >>>> using the remote-userfaultfd that you saw on the second thread. When it wants >>>> to reclaim (anonymous) memory, it: >>>> >>>> 1. Uses UFFD-WP to protect that memory (and for this matter I got a vectored >>>> UFFD-WP to do so efficiently, a patch which I did not send yet). >>>> 2. Calls process_vm_readv() to read that memory of that process. >>>> 3. Write it back to “swap”. >>>> 4. Calls process_madvise(MADV_DONTNEED) to zap it. >>> >>> Why cannot you use MADV_PAGEOUT/MADV_COLD for this usecase? >> >> Providing hints to the kernel takes you so far to a certain extent. >> The kernel does not want to (for a good reason) to be completely >> configurable when it comes to reclaim and prefetch policies. Doing >> so from userspace allows you to be fully configurable. > > I am sorry but I do not follow. Your scenario is describing a user > space driven reclaim. Something that MADV_{COLD,PAGEOUT} have been > designed for. What are you missing in the existing functionality? Using MADV_COLD/MADV_PAGEOUT does not allow userspace to control many aspects of paging out memory: 1. Writeback: writeback ahead of time, dynamic clustering, etc. 2. Batching (regardless, MADV_PAGEOUT does pretty bad batching job on non-contiguous memory). 3. No guarantee the page is actually reclaimed (e.g., writeback) and the time it takes place. 4. I/O stack for swapping - you must use kernel I/O stack (FUSE as non-performant as it is cannot be used for swap AFAIK). 5. Other operations (e.g., locking, working set tracking) that might not be necessary or interfere. In addition, the use of MADV_COLD/MADV_PAGEOUT prevents the use of userfaultfd to trap page-faults and react accordingly, so you are also prevented from: 6. Having your own custom prefetching policy in response to #PF. There are additional use-cases I can try to formalize in which MADV_COLD/MADV_PAGEOUT is insufficient. But the main difference is pretty clear, I think: one is a hint that only applied to page reclamation. The other enables the direct control of userspace over (almost) all aspects of paging. As I suggested before, if it is preferred, this can be a UFFD IOCTL instead of process_madvise() behavior, thereby lowering the risk of a misuse. I would emphasize that this feature (i.e., process_madvise(MADV_DONTNEED) or a similar new UFFD feature) has little to no effect on the kernel robustness, complexity, security or API changes. So the impact on the kernel is negligible.
>> >> Thanks for the pointer. >> >> And my question would be if something like DAMON would actually be what you want. > > I looked into DAMON and even with the proposed future extensions it sounds > as a different approach with certain benefits but with many limitations. > > The major limitation of DAMON is that you need to predefine the logic you > want for reclamation into the kernel. You can add programability through > some API or even eBPF, but it would never be as easy or as versatile as > what user manager can achieve. We already have pretty much all the > facilities to do so from userspace, and the missing parts (at least for > basic userspace manager) are almost already there. In contrast, see how > many iterations are needed for the basic DAMON implementation. I can see what you're saying when looking at optimizing a hand full of special applications. I yet fail to see how something like that could work as a full replacement for in kernel swapping. I'm happy to learn. > > The second, also big, difference is that DAMON looks only on reclamation. > If you want a custom prefetch scheme or different I/O stack for backing > storage, you cannot have such one. I do wonder if it could be extended for prefetching. But I am absolutely not a DAMON expert. [...] >> >> You raise a very excellent point (and it should have been part of your initial sales pitch): how does it differ to process_vm_writev(). >> >> I can say that it differs in a way that you can break applications in more extreme ways. Let me give you two examples: >> >> 1. longterm pinnings: you raised this yourself; this can break an application silently and there is barely a safe way your tooling could handle it. >> >> 2. pagemap: applications can depend on the populated(present |swap) information in the pagemap for correctness. For example, there was recently a discussion to use pagemap information to speed up live migration of VMs, by skipping migration of !populated pages. There is currently no way your tooling can fake that. In comparison, ordinary swapping in the kernel can handle it. > > I understand (1). As for (2): the scenario that you mention sound > very specific, and one can argue that ignoring UFFD-registered > regions in such a case is either (1) wrong or (2) should trigger > some UFFD event. > >> >> Is it easy to break an application with process_vm_writev()? Yes. When talking about dynamic debugging, it's expected that you break the target already -- or the target is already broken. Is it easier to break an application with process_madvise(MADV_DONTNEED)? I'd say yes, especially when implementing something way beyond debugging as you describe. > > If you do not know what you are doing, you can easily break anything. > Note that there are other APIs that can break your application even > worse, specifically ptrace(). > >> I'm giving you "a hard time" for the reason Michal raised: we discussed this in the past already at least two times IIRC and "it is a free ticket to all sorts of hard to debug problem" in our opinion; especially when we mess around in other process address spaces besides for debugging. >> >> I'm not the person to ack/nack this, I'm just asking the questions :) > > I see your points and I try to look for a path of least resistance. > I thought that process_madvise() is a nice interface to hook into. It would be the right interface -- iff the operation wouldn't have a bad smell to it. We don't really want applications to mess around in the page table layout of some other process: however, that is exactly what you require. By unlocking that interface for that use case we agree that what you are proposing is a "sane use case", but ... > > But if you are concerned it will be misused, how about adding instead > an IOCTL that will zap pages but only in UFFD-registered regions? > A separate IOCTL for this matter have an advantage of being more > tailored for UFFD, not to notify UFFD upon “remove” and to be less > likely to be misused. ... that won't change the fact that with your user-space swapping approach that requires this interface we can break some applications silently, and that's really the major concern I have. I mean, there are more cases where you can just harm the target application I think, for example if the target application uses SOFTDIRTY tracking. To judge if this is a sane use case we want to support, it would help a lot if there would be actual code+evaluation when actually implementing some of these advanced policies. Because you raise a lot of interesting points in your reply to Michal to back your use case, and naive me thinks "this sounds interesting but ... aren't we losing a lot of flexibility+features when doing this in user space? Does anyone actually want to do it like that?". Again, I'm not the person to ack/nack this, I'm just questioning if the use case that requires this interface is actually something that will get used later in real life because it has real advantages, or if it's a pure research project that will get abandoned at some point and we ended up exposing an interface we really didn't want to expose so far (especially, because all other requests so far were bogus).
> On Oct 4, 2021, at 10:58 AM, David Hildenbrand <david@redhat.com> wrote: > >>> >>> Thanks for the pointer. >>> >>> And my question would be if something like DAMON would actually be what you want. >> I looked into DAMON and even with the proposed future extensions it sounds >> as a different approach with certain benefits but with many limitations. >> The major limitation of DAMON is that you need to predefine the logic you >> want for reclamation into the kernel. You can add programability through >> some API or even eBPF, but it would never be as easy or as versatile as >> what user manager can achieve. We already have pretty much all the >> facilities to do so from userspace, and the missing parts (at least for >> basic userspace manager) are almost already there. In contrast, see how >> many iterations are needed for the basic DAMON implementation. > > I can see what you're saying when looking at optimizing a hand full of special applications. I yet fail to see how something like that could work as a full replacement for in kernel swapping. I'm happy to learn. I am not arguing it is a full replacement, at least at this stage. > >> The second, also big, difference is that DAMON looks only on reclamation. >> If you want a custom prefetch scheme or different I/O stack for backing >> storage, you cannot have such one. > > I do wonder if it could be extended for prefetching. But I am absolutely not a DAMON expert. > > [...] These are 2 different approaches. One, is to provide some logic for the kernel (DAMON). The other is to provide userspace full control over paging operations (with caveats). Obviously, due to the caveats, the kernel paging mechanism behaves as a backup. > >>> >>> You raise a very excellent point (and it should have been part of your initial sales pitch): how does it differ to process_vm_writev(). >>> >>> I can say that it differs in a way that you can break applications in more extreme ways. Let me give you two examples: >>> >>> 1. longterm pinnings: you raised this yourself; this can break an application silently and there is barely a safe way your tooling could handle it. >>> >>> 2. pagemap: applications can depend on the populated(present |swap) information in the pagemap for correctness. For example, there was recently a discussion to use pagemap information to speed up live migration of VMs, by skipping migration of !populated pages. There is currently no way your tooling can fake that. In comparison, ordinary swapping in the kernel can handle it. >> I understand (1). As for (2): the scenario that you mention sound >> very specific, and one can argue that ignoring UFFD-registered >> regions in such a case is either (1) wrong or (2) should trigger >> some UFFD event. >>> >>> Is it easy to break an application with process_vm_writev()? Yes. When talking about dynamic debugging, it's expected that you break the target already -- or the target is already broken. Is it easier to break an application with process_madvise(MADV_DONTNEED)? I'd say yes, especially when implementing something way beyond debugging as you describe. >> If you do not know what you are doing, you can easily break anything. >> Note that there are other APIs that can break your application even >> worse, specifically ptrace(). >>> I'm giving you "a hard time" for the reason Michal raised: we discussed this in the past already at least two times IIRC and "it is a free ticket to all sorts of hard to debug problem" in our opinion; especially when we mess around in other process address spaces besides for debugging. >>> >>> I'm not the person to ack/nack this, I'm just asking the questions :) >> I see your points and I try to look for a path of least resistance. >> I thought that process_madvise() is a nice interface to hook into. > > It would be the right interface -- iff the operation wouldn't have a bad smell to it. We don't really want applications to mess around in the page table layout of some other process: however, that is exactly what you require. By unlocking that interface for that use case we agree that what you are proposing is a "sane use case", but ... > >> But if you are concerned it will be misused, how about adding instead >> an IOCTL that will zap pages but only in UFFD-registered regions? >> A separate IOCTL for this matter have an advantage of being more >> tailored for UFFD, not to notify UFFD upon “remove” and to be less >> likely to be misused. > > ... that won't change the fact that with your user-space swapping approach that requires this interface we can break some applications silently, and that's really the major concern I have. > > I mean, there are more cases where you can just harm the target application I think, for example if the target application uses SOFTDIRTY tracking. > > > To judge if this is a sane use case we want to support, it would help a lot if there would be actual code+evaluation when actually implementing some of these advanced policies. Because you raise a lot of interesting points in your reply to Michal to back your use case, and naive me thinks "this sounds interesting but ... aren't we losing a lot of flexibility+features when doing this in user space? Does anyone actually want to do it like that?". > > Again, I'm not the person to ack/nack this, I'm just questioning if the use case that requires this interface is actually something that will get used later in real life because it has real advantages, or if it's a pure research project that will get abandoned at some point and we ended up exposing an interface we really didn't want to expose so far (especially, because all other requests so far were bogus). I do want to release the code, but it is really incomplete/immature at this point. I would not that there additional use cases, such as workloads that have discardable cache (or memoization data), which want a central/another entity to discard the data when there is memory pressure. (You can think about it as a userspace shrinker). Anyhow, as a path of least resistance, I think I would do the following: 1. Wait for the other madvise related patches to be applied. 2. Simplify the patches, specifically removing the data structure changes based on Kirill feedback. 3. Defer the enablement of the MADV_DONTNEED until I can show code/performance numbers. Regards, Nadav
On 07.10.21 18:19, Nadav Amit wrote: > > >> On Oct 4, 2021, at 10:58 AM, David Hildenbrand <david@redhat.com> wrote: >> >>>> >>>> Thanks for the pointer. >>>> >>>> And my question would be if something like DAMON would actually be what you want. >>> I looked into DAMON and even with the proposed future extensions it sounds >>> as a different approach with certain benefits but with many limitations. >>> The major limitation of DAMON is that you need to predefine the logic you >>> want for reclamation into the kernel. You can add programability through >>> some API or even eBPF, but it would never be as easy or as versatile as >>> what user manager can achieve. We already have pretty much all the >>> facilities to do so from userspace, and the missing parts (at least for >>> basic userspace manager) are almost already there. In contrast, see how >>> many iterations are needed for the basic DAMON implementation. >> >> I can see what you're saying when looking at optimizing a hand full of special applications. I yet fail to see how something like that could work as a full replacement for in kernel swapping. I'm happy to learn. > > I am not arguing it is a full replacement, at least at this stage. > >> >>> The second, also big, difference is that DAMON looks only on reclamation. >>> If you want a custom prefetch scheme or different I/O stack for backing >>> storage, you cannot have such one. >> >> I do wonder if it could be extended for prefetching. But I am absolutely not a DAMON expert. >> >> [...] > > These are 2 different approaches. One, is to provide some logic > for the kernel (DAMON). The other is to provide userspace full > control over paging operations (with caveats). Obviously, due to > the caveats, the kernel paging mechanism behaves as a backup. > >> >>>> >>>> You raise a very excellent point (and it should have been part of your initial sales pitch): how does it differ to process_vm_writev(). >>>> >>>> I can say that it differs in a way that you can break applications in more extreme ways. Let me give you two examples: >>>> >>>> 1. longterm pinnings: you raised this yourself; this can break an application silently and there is barely a safe way your tooling could handle it. >>>> >>>> 2. pagemap: applications can depend on the populated(present |swap) information in the pagemap for correctness. For example, there was recently a discussion to use pagemap information to speed up live migration of VMs, by skipping migration of !populated pages. There is currently no way your tooling can fake that. In comparison, ordinary swapping in the kernel can handle it. >>> I understand (1). As for (2): the scenario that you mention sound >>> very specific, and one can argue that ignoring UFFD-registered >>> regions in such a case is either (1) wrong or (2) should trigger >>> some UFFD event. >>>> >>>> Is it easy to break an application with process_vm_writev()? Yes. When talking about dynamic debugging, it's expected that you break the target already -- or the target is already broken. Is it easier to break an application with process_madvise(MADV_DONTNEED)? I'd say yes, especially when implementing something way beyond debugging as you describe. >>> If you do not know what you are doing, you can easily break anything. >>> Note that there are other APIs that can break your application even >>> worse, specifically ptrace(). >>>> I'm giving you "a hard time" for the reason Michal raised: we discussed this in the past already at least two times IIRC and "it is a free ticket to all sorts of hard to debug problem" in our opinion; especially when we mess around in other process address spaces besides for debugging. >>>> >>>> I'm not the person to ack/nack this, I'm just asking the questions :) >>> I see your points and I try to look for a path of least resistance. >>> I thought that process_madvise() is a nice interface to hook into. >> >> It would be the right interface -- iff the operation wouldn't have a bad smell to it. We don't really want applications to mess around in the page table layout of some other process: however, that is exactly what you require. By unlocking that interface for that use case we agree that what you are proposing is a "sane use case", but ... >> >>> But if you are concerned it will be misused, how about adding instead >>> an IOCTL that will zap pages but only in UFFD-registered regions? >>> A separate IOCTL for this matter have an advantage of being more >>> tailored for UFFD, not to notify UFFD upon “remove” and to be less >>> likely to be misused. >> >> ... that won't change the fact that with your user-space swapping approach that requires this interface we can break some applications silently, and that's really the major concern I have. >> >> I mean, there are more cases where you can just harm the target application I think, for example if the target application uses SOFTDIRTY tracking. >> >> >> To judge if this is a sane use case we want to support, it would help a lot if there would be actual code+evaluation when actually implementing some of these advanced policies. Because you raise a lot of interesting points in your reply to Michal to back your use case, and naive me thinks "this sounds interesting but ... aren't we losing a lot of flexibility+features when doing this in user space? Does anyone actually want to do it like that?". >> >> Again, I'm not the person to ack/nack this, I'm just questioning if the use case that requires this interface is actually something that will get used later in real life because it has real advantages, or if it's a pure research project that will get abandoned at some point and we ended up exposing an interface we really didn't want to expose so far (especially, because all other requests so far were bogus). > > I do want to release the code, but it is really > incomplete/immature at this point. I would not that there additional > use cases, such as workloads that have discardable cache (or memoization > data), which want a central/another entity to discard the data when > there is memory pressure. (You can think about it as a userspace > shrinker). > > Anyhow, as a path of least resistance, I think I would do the > following: > > 1. Wait for the other madvise related patches to be applied. > 2. Simplify the patches, specifically removing the data structure > changes based on Kirill feedback. > 3. Defer the enablement of the MADV_DONTNEED until I can show > code/performance numbers. Sounds excellent, for your project to make progress at this stage I assume this stuff doesn't have to be upstream, but it's good to discuss upstream-ability. Happy to learn more once you have more details to share.
On Wed, Sep 29, 2021 at 11:31:25AM -0700, Nadav Amit wrote: > > > > On Sep 29, 2021, at 12:52 AM, Michal Hocko <mhocko@suse.com> wrote: > > > > On Mon 27-09-21 12:12:46, Nadav Amit wrote: > >> > >>> On Sep 27, 2021, at 5:16 AM, Michal Hocko <mhocko@suse.com> wrote: > >>> > >>> On Mon 27-09-21 05:00:11, Nadav Amit wrote: > >>> [...] > >>>> The manager is notified on memory regions that it should monitor > >>>> (through PTRACE/LD_PRELOAD/explicit-API). It then monitors these regions > >>>> using the remote-userfaultfd that you saw on the second thread. When it wants > >>>> to reclaim (anonymous) memory, it: > >>>> > >>>> 1. Uses UFFD-WP to protect that memory (and for this matter I got a vectored > >>>> UFFD-WP to do so efficiently, a patch which I did not send yet). > >>>> 2. Calls process_vm_readv() to read that memory of that process. > >>>> 3. Write it back to “swap”. > >>>> 4. Calls process_madvise(MADV_DONTNEED) to zap it. > >>> > >>> Why cannot you use MADV_PAGEOUT/MADV_COLD for this usecase? > >> > >> Providing hints to the kernel takes you so far to a certain extent. > >> The kernel does not want to (for a good reason) to be completely > >> configurable when it comes to reclaim and prefetch policies. Doing > >> so from userspace allows you to be fully configurable. > > > > I am sorry but I do not follow. Your scenario is describing a user > > space driven reclaim. Something that MADV_{COLD,PAGEOUT} have been > > designed for. What are you missing in the existing functionality? > > Using MADV_COLD/MADV_PAGEOUT does not allow userspace to control > many aspects of paging out memory: > > 1. Writeback: writeback ahead of time, dynamic clustering, etc. > 2. Batching (regardless, MADV_PAGEOUT does pretty bad batching job > on non-contiguous memory). > 3. No guarantee the page is actually reclaimed (e.g., writeback) > and the time it takes place. > 4. I/O stack for swapping - you must use kernel I/O stack (FUSE > as non-performant as it is cannot be used for swap AFAIK). > 5. Other operations (e.g., locking, working set tracking) that > might not be necessary or interfere. > > In addition, the use of MADV_COLD/MADV_PAGEOUT prevents the use > of userfaultfd to trap page-faults and react accordingly, so you > are also prevented from: > > 6. Having your own custom prefetching policy in response to #PF. > > There are additional use-cases I can try to formalize in which > MADV_COLD/MADV_PAGEOUT is insufficient. But the main difference > is pretty clear, I think: one is a hint that only applied to > page reclamation. The other enables the direct control of > userspace over (almost) all aspects of paging. > > As I suggested before, if it is preferred, this can be a UFFD > IOCTL instead of process_madvise() behavior, thereby lowering > the risk of a misuse. (Sorry to join so late..) Yeah I'm wondering whether that could add one extra layer of security. But as you mentioned, we've already have process_vm_writev(), then it's indeed not strong reason to reject process_madvise(DONTNEED) too, it seems. Not sure whether you're aware of the umap project from LLNL: https://github.com/LLNL/umap From what I can tell, that's really doing very similar thing as what you proposed here, but it's just a local version of things. IOW in umap the DONTNEED can be done locally with madvise() already in the umap maintained threads. That close the need to introduce the new process_madvise() interface and it's definitely safer as it's per-mm and per-task. I think you mentioned above that the tracee program will need to cooperate in this case, I'm wondering whether some solution like umap would be fine too as that also requires cooperation of the tracee program, it's just that the cooperation may be slightly more than your solution but frankly I think that's still trivial and before I understand the details of your solution I can't really tell.. E.g. for a program to use umap, I think it needs to replace mmap() to umap() where we want the buffers to be managed by umap library rather than the kernel, then link against the umap library should work. If the remote solution you're proposing requires similar (or even more complicated) cooperation, then it'll be controversial whether that can be done per-mm just like how umap designed and used. So IMHO it'll be great to share more details on those parts if umap cannot satisfy the current need - IMHO it satisfies all the features you described on fully customized pageout and page faulting in, it's just done in a single mm. Thanks,
> On Oct 12, 2021, at 4:14 PM, Peter Xu <peterx@redhat.com> wrote: > > On Wed, Sep 29, 2021 at 11:31:25AM -0700, Nadav Amit wrote: >> >> >>> On Sep 29, 2021, at 12:52 AM, Michal Hocko <mhocko@suse.com> wrote: >>> >>> On Mon 27-09-21 12:12:46, Nadav Amit wrote: >>>> >>>>> On Sep 27, 2021, at 5:16 AM, Michal Hocko <mhocko@suse.com> wrote: >>>>> >>>>> On Mon 27-09-21 05:00:11, Nadav Amit wrote: >>>>> [...] >>>>>> The manager is notified on memory regions that it should monitor >>>>>> (through PTRACE/LD_PRELOAD/explicit-API). It then monitors these regions >>>>>> using the remote-userfaultfd that you saw on the second thread. When it wants >>>>>> to reclaim (anonymous) memory, it: >>>>>> >>>>>> 1. Uses UFFD-WP to protect that memory (and for this matter I got a vectored >>>>>> UFFD-WP to do so efficiently, a patch which I did not send yet). >>>>>> 2. Calls process_vm_readv() to read that memory of that process. >>>>>> 3. Write it back to “swap”. >>>>>> 4. Calls process_madvise(MADV_DONTNEED) to zap it. >>>>> >>>>> Why cannot you use MADV_PAGEOUT/MADV_COLD for this usecase? >>>> >>>> Providing hints to the kernel takes you so far to a certain extent. >>>> The kernel does not want to (for a good reason) to be completely >>>> configurable when it comes to reclaim and prefetch policies. Doing >>>> so from userspace allows you to be fully configurable. >>> >>> I am sorry but I do not follow. Your scenario is describing a user >>> space driven reclaim. Something that MADV_{COLD,PAGEOUT} have been >>> designed for. What are you missing in the existing functionality? >> >> Using MADV_COLD/MADV_PAGEOUT does not allow userspace to control >> many aspects of paging out memory: >> >> 1. Writeback: writeback ahead of time, dynamic clustering, etc. >> 2. Batching (regardless, MADV_PAGEOUT does pretty bad batching job >> on non-contiguous memory). >> 3. No guarantee the page is actually reclaimed (e.g., writeback) >> and the time it takes place. >> 4. I/O stack for swapping - you must use kernel I/O stack (FUSE >> as non-performant as it is cannot be used for swap AFAIK). >> 5. Other operations (e.g., locking, working set tracking) that >> might not be necessary or interfere. >> >> In addition, the use of MADV_COLD/MADV_PAGEOUT prevents the use >> of userfaultfd to trap page-faults and react accordingly, so you >> are also prevented from: >> >> 6. Having your own custom prefetching policy in response to #PF. >> >> There are additional use-cases I can try to formalize in which >> MADV_COLD/MADV_PAGEOUT is insufficient. But the main difference >> is pretty clear, I think: one is a hint that only applied to >> page reclamation. The other enables the direct control of >> userspace over (almost) all aspects of paging. >> >> As I suggested before, if it is preferred, this can be a UFFD >> IOCTL instead of process_madvise() behavior, thereby lowering >> the risk of a misuse. > > (Sorry to join so late..) > > Yeah I'm wondering whether that could add one extra layer of security. But as > you mentioned, we've already have process_vm_writev(), then it's indeed not > strong reason to reject process_madvise(DONTNEED) too, it seems. > > Not sure whether you're aware of the umap project from LLNL: > > https://github.com/LLNL/umap > > From what I can tell, that's really doing very similar thing as what you > proposed here, but it's just a local version of things. IOW in umap the > DONTNEED can be done locally with madvise() already in the umap maintained > threads. That close the need to introduce the new process_madvise() interface > and it's definitely safer as it's per-mm and per-task. > > I think you mentioned above that the tracee program will need to cooperate in > this case, I'm wondering whether some solution like umap would be fine too as > that also requires cooperation of the tracee program, it's just that the > cooperation may be slightly more than your solution but frankly I think that's > still trivial and before I understand the details of your solution I can't > really tell.. > > E.g. for a program to use umap, I think it needs to replace mmap() to umap() > where we want the buffers to be managed by umap library rather than the kernel, > then link against the umap library should work. If the remote solution you're > proposing requires similar (or even more complicated) cooperation, then it'll > be controversial whether that can be done per-mm just like how umap designed > and used. So IMHO it'll be great to share more details on those parts if umap > cannot satisfy the current need - IMHO it satisfies all the features you > described on fully customized pageout and page faulting in, it's just done in a > single mm. Thanks for you feedback, Peter. I am familiar with umap, perhaps not enough, but I am aware. From my experience, the existing interfaces are not sufficient if you look for high performance (low overhead) solution for multiple processes. The level of cooperation that I mentioned is something that I mentioned preemptively to avoid unnecessary discussion, but I believe they can be resolved (I have just deferred handling them). Specifically for performance, several new kernel features are needed, for instance, support for iouring with async operations, a vectored UFFDIO_WRITEPROTECT(V) which batches TLB flushes across VMAs and a vectored madvise(). Even if we talk on the context of a single mm, I cannot see umap being performant for low latency devices without those facilities. Anyhow, I take your feedback and I will resend the patch for enabling MADV_DONTNEED with other patches once I am done. As for the TLB batching itself, I think it has an independent value - but I am not going to argue about it now if there is a pushback against it.
On Wed, Oct 13, 2021 at 08:47:11AM -0700, Nadav Amit wrote: > > > > On Oct 12, 2021, at 4:14 PM, Peter Xu <peterx@redhat.com> wrote: > > > > On Wed, Sep 29, 2021 at 11:31:25AM -0700, Nadav Amit wrote: > >> > >> > >>> On Sep 29, 2021, at 12:52 AM, Michal Hocko <mhocko@suse.com> wrote: > >>> > >>> On Mon 27-09-21 12:12:46, Nadav Amit wrote: > >>>> > >>>>> On Sep 27, 2021, at 5:16 AM, Michal Hocko <mhocko@suse.com> wrote: > >>>>> > >>>>> On Mon 27-09-21 05:00:11, Nadav Amit wrote: > >>>>> [...] > >>>>>> The manager is notified on memory regions that it should monitor > >>>>>> (through PTRACE/LD_PRELOAD/explicit-API). It then monitors these regions > >>>>>> using the remote-userfaultfd that you saw on the second thread. When it wants > >>>>>> to reclaim (anonymous) memory, it: > >>>>>> > >>>>>> 1. Uses UFFD-WP to protect that memory (and for this matter I got a vectored > >>>>>> UFFD-WP to do so efficiently, a patch which I did not send yet). > >>>>>> 2. Calls process_vm_readv() to read that memory of that process. > >>>>>> 3. Write it back to “swap”. > >>>>>> 4. Calls process_madvise(MADV_DONTNEED) to zap it. > >>>>> > >>>>> Why cannot you use MADV_PAGEOUT/MADV_COLD for this usecase? > >>>> > >>>> Providing hints to the kernel takes you so far to a certain extent. > >>>> The kernel does not want to (for a good reason) to be completely > >>>> configurable when it comes to reclaim and prefetch policies. Doing > >>>> so from userspace allows you to be fully configurable. > >>> > >>> I am sorry but I do not follow. Your scenario is describing a user > >>> space driven reclaim. Something that MADV_{COLD,PAGEOUT} have been > >>> designed for. What are you missing in the existing functionality? > >> > >> Using MADV_COLD/MADV_PAGEOUT does not allow userspace to control > >> many aspects of paging out memory: > >> > >> 1. Writeback: writeback ahead of time, dynamic clustering, etc. > >> 2. Batching (regardless, MADV_PAGEOUT does pretty bad batching job > >> on non-contiguous memory). > >> 3. No guarantee the page is actually reclaimed (e.g., writeback) > >> and the time it takes place. > >> 4. I/O stack for swapping - you must use kernel I/O stack (FUSE > >> as non-performant as it is cannot be used for swap AFAIK). > >> 5. Other operations (e.g., locking, working set tracking) that > >> might not be necessary or interfere. > >> > >> In addition, the use of MADV_COLD/MADV_PAGEOUT prevents the use > >> of userfaultfd to trap page-faults and react accordingly, so you > >> are also prevented from: > >> > >> 6. Having your own custom prefetching policy in response to #PF. > >> > >> There are additional use-cases I can try to formalize in which > >> MADV_COLD/MADV_PAGEOUT is insufficient. But the main difference > >> is pretty clear, I think: one is a hint that only applied to > >> page reclamation. The other enables the direct control of > >> userspace over (almost) all aspects of paging. > >> > >> As I suggested before, if it is preferred, this can be a UFFD > >> IOCTL instead of process_madvise() behavior, thereby lowering > >> the risk of a misuse. > > > > (Sorry to join so late..) > > > > Yeah I'm wondering whether that could add one extra layer of security. But as > > you mentioned, we've already have process_vm_writev(), then it's indeed not > > strong reason to reject process_madvise(DONTNEED) too, it seems. > > > > Not sure whether you're aware of the umap project from LLNL: > > > > https://github.com/LLNL/umap > > > > From what I can tell, that's really doing very similar thing as what you > > proposed here, but it's just a local version of things. IOW in umap the > > DONTNEED can be done locally with madvise() already in the umap maintained > > threads. That close the need to introduce the new process_madvise() interface > > and it's definitely safer as it's per-mm and per-task. > > > > I think you mentioned above that the tracee program will need to cooperate in > > this case, I'm wondering whether some solution like umap would be fine too as > > that also requires cooperation of the tracee program, it's just that the > > cooperation may be slightly more than your solution but frankly I think that's > > still trivial and before I understand the details of your solution I can't > > really tell.. > > > > E.g. for a program to use umap, I think it needs to replace mmap() to umap() > > where we want the buffers to be managed by umap library rather than the kernel, > > then link against the umap library should work. If the remote solution you're > > proposing requires similar (or even more complicated) cooperation, then it'll > > be controversial whether that can be done per-mm just like how umap designed > > and used. So IMHO it'll be great to share more details on those parts if umap > > cannot satisfy the current need - IMHO it satisfies all the features you > > described on fully customized pageout and page faulting in, it's just done in a > > single mm. > > Thanks for you feedback, Peter. > > I am familiar with umap, perhaps not enough, but I am aware. > > From my experience, the existing interfaces are not sufficient if you look > for high performance (low overhead) solution for multiple processes. The > level of cooperation that I mentioned is something that I mentioned > preemptively to avoid unnecessary discussion, but I believe they can be > resolved (I have just deferred handling them). > > Specifically for performance, several new kernel features are needed, for > instance, support for iouring with async operations, a vectored > UFFDIO_WRITEPROTECT(V) which batches TLB flushes across VMAs and a > vectored madvise(). Even if we talk on the context of a single mm, I > cannot see umap being performant for low latency devices without those > facilities. > > Anyhow, I take your feedback and I will resend the patch for enabling > MADV_DONTNEED with other patches once I am done. As for the TLB batching > itself, I think it has an independent value - but I am not going to > argue about it now if there is a pushback against it. Fair enough. Yes my comment was mostly about whether a remote interface is needed or can we still do it locally (frankly I always wanted to have some remote interface to manipulate uffd, but still anything like that should require some justifications for sure). I totally agree your rest works on either optimizing tlb (especially on the two points Andrea mentioned for either reducing tlb for huge pmd change protection, or pte promotions) and vectored interfaces sound reasonable, and they're definitely separate issues comparing to this one. Thanks,
From: Nadav Amit <namit@vmware.com> The goal of these patches is to add support for process_madvise(MADV_DONTNEED). Yet, in the process some (arguably) useful cleanups, a bug fix and performance enhancements are performed. The patches try to consolidate the logic across different behaviors, and to a certain extent overlap/conflict with an outstanding patch that does something similar [1]. This consolidation however is mostly orthogonal to the aforementioned one and done in order to clarify what is done in respect to locks and TLB for each behavior and to batch these operations more efficiently on process_madvise(). process_madvise(MADV_DONTNEED) is useful for two reasons: (a) it allows userfaultfd monitors to unmap memory from monitored processes; and (b) it is more efficient than madvise() since it is vectored and batches TLB flushes more aggressively. The first three patches should not interfere with the outstanding patch [1]. The rest might need to be rebased after [1] is applied. [1] https://lore.kernel.org/linux-mm/CAJuCfpFDBJ_W1y2tqAT4BGtPbWrjjDud_JuKO8ZbnjYfeVNvRg@mail.gmail.com/ Cc: Peter Xu <peterx@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Colin Cross <ccross@google.com> Cc: Suren Baghdasarya <surenb@google.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Nadav Amit (8): mm/madvise: propagate vma->vm_end changes mm/madvise: remove unnecessary check on madvise_dontneed_free() mm/madvise: remove unnecessary checks on madvise_free_single_vma() mm/madvise: define madvise behavior in a struct mm/madvise: perform certain operations once on process_madvise() mm/madvise: more aggressive TLB batching mm/madvise: deduplicate code in madvise_dontneed_free() mm/madvise: process_madvise(MADV_DONTNEED) mm/internal.h | 5 + mm/madvise.c | 396 ++++++++++++++++++++++++++++++-------------------- mm/memory.c | 2 +- 3 files changed, 248 insertions(+), 155 deletions(-)