Message ID | 20210715201651.212134-1-peterx@redhat.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | userfaultfd-wp: Support shmem and hugetlbfs | expand |
Hello Peter, > On 15 Jul 2021, at 21:16, Peter Xu <peterx@redhat.com> wrote: > > This requires the pagemap code to be able to recognize the newly introduced > swap special pte for uffd-wp, meanwhile the general case for hugetlb that we > recently start to support. It should make pagemap uffd-wp support complete. > > Signed-off-by: Peter Xu <peterx@redhat.com> > --- > fs/proc/task_mmu.c | 7 +++++++ > 1 file changed, 7 insertions(+) > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > index 9c5af77b5290..988e29fa1f00 100644 > --- a/fs/proc/task_mmu.c > +++ b/fs/proc/task_mmu.c > @@ -1389,6 +1389,8 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm, > flags |= PM_SWAP; > if (is_pfn_swap_entry(entry)) > page = pfn_swap_entry_to_page(entry); > + } else if (pte_swp_uffd_wp_special(pte)) { > + flags |= PM_UFFD_WP; > } ^ Would it not be important to also add PM_SWAP to flags? Kind regards, Tibi
On Mon, Jul 19, 2021 at 09:53:36AM +0000, Tiberiu Georgescu wrote: > > Hello Peter, Hi, Tiberiu, > > > On 15 Jul 2021, at 21:16, Peter Xu <peterx@redhat.com> wrote: > > > > This requires the pagemap code to be able to recognize the newly introduced > > swap special pte for uffd-wp, meanwhile the general case for hugetlb that we > > recently start to support. It should make pagemap uffd-wp support complete. > > > > Signed-off-by: Peter Xu <peterx@redhat.com> > > --- > > fs/proc/task_mmu.c | 7 +++++++ > > 1 file changed, 7 insertions(+) > > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > > index 9c5af77b5290..988e29fa1f00 100644 > > --- a/fs/proc/task_mmu.c > > +++ b/fs/proc/task_mmu.c > > @@ -1389,6 +1389,8 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm, > > flags |= PM_SWAP; > > if (is_pfn_swap_entry(entry)) > > page = pfn_swap_entry_to_page(entry); > > + } else if (pte_swp_uffd_wp_special(pte)) { > > + flags |= PM_UFFD_WP; > > } > > ^ Would it not be important to also add PM_SWAP to flags? Hmm, I'm not sure; it's the same as a none pte in this case, so imho we still can't tell if it's swapped out or simply the pte got zapped but page cache will still hit (even if being swapped out may be the most possible case). What we're clear is we know it's uffd wr-protected, so maybe setting PM_UFFD_WP is still the simplest? Thanks,
> On 19 Jul 2021, at 17:03, Peter Xu <peterx@redhat.com> wrote: > > On Mon, Jul 19, 2021 at 09:53:36AM +0000, Tiberiu Georgescu wrote: >> >> Hello Peter, > > Hi, Tiberiu, > >> >>> On 15 Jul 2021, at 21:16, Peter Xu <peterx@redhat.com> wrote: >>> >>> This requires the pagemap code to be able to recognize the newly introduced >>> swap special pte for uffd-wp, meanwhile the general case for hugetlb that we >>> recently start to support. It should make pagemap uffd-wp support complete. >>> >>> Signed-off-by: Peter Xu <peterx@redhat.com> >>> --- >>> fs/proc/task_mmu.c | 7 +++++++ >>> 1 file changed, 7 insertions(+) >>> >>> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c >>> index 9c5af77b5290..988e29fa1f00 100644 >>> --- a/fs/proc/task_mmu.c >>> +++ b/fs/proc/task_mmu.c >>> @@ -1389,6 +1389,8 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm, >>> flags |= PM_SWAP; >>> if (is_pfn_swap_entry(entry)) >>> page = pfn_swap_entry_to_page(entry); >>> + } else if (pte_swp_uffd_wp_special(pte)) { >>> + flags |= PM_UFFD_WP; >>> } >> >> ^ Would it not be important to also add PM_SWAP to flags? > > Hmm, I'm not sure; it's the same as a none pte in this case, so imho we still > can't tell if it's swapped out or simply the pte got zapped but page cache will > still hit (even if being swapped out may be the most possible case). Yeah, that's true. Come to think of it, we also can't tell none pte from swapped out shmem pages (all bits are cleared out). > > What we're clear is we know it's uffd wr-protected, so maybe setting PM_UFFD_WP > is still the simplest? That's right, but if we were to require any of the differentiations above, how does keeping another bit on the special pte sound to you? One to signal the location on swap or otherwise (none or zapped). Is there any other clearer way to do it? We wouldn't want to overload the special pte unnecessarily. Thanks, -- Tibi
On Mon, Jul 19, 2021 at 05:23:14PM +0000, Tiberiu Georgescu wrote: > > What we're clear is we know it's uffd wr-protected, so maybe setting PM_UFFD_WP > > is still the simplest? > > That's right, but if we were to require any of the differentiations above, how > does keeping another bit on the special pte sound to you? One to signal the location on swap or otherwise (none or zapped). I don't know how to do it even with an extra bit in the pte. The thing is we need some mechanism to trigger the tweak of that bit in the pte when switching from "present" to "swapped out", while I don't see how that could be done. Consider when page reclaim happens, we'll unmap and zap the ptes first before swapping the pages out, then when we do the pageout() we've already released the rmap so no way to figure out which pte to tweak, afaiu. It also looks complicated just for maintaining this information. > > Is there any other clearer way to do it? We wouldn't want to overload the > special pte unnecessarily. I feel like the solution you proposed in the other patch for soft dirty might work. It's just that it seems heavier, especially because we'll try to look up the page cache for every single pte_none() (and after this patch including the swap special pte) even if the page is never accessed. I expect it will regress the case of a normal soft-dirty user when the memory is sparsely used, because there'll be plenty of page cache look up operations that are destined to be useless. I'm also curious what would be the real use to have an accurate PM_SWAP accounting. To me current implementation may not provide accurate value but should be good enough for most cases. However not sure whether it's also true for your use case. Thanks,
On Mon, Jul 19, 2021 5:56 PM +0000, Peter Xu wrote: > I'm also curious what would be the real use to have an accurate PM_SWAP > accounting. To me current implementation may not provide accurate value but > should be good enough for most cases. However not sure whether it's also true > for your use case. We want the PM_SWAP bit implemented (for shared memory in the pagemap interface) to enhance the live migration for some fraction of the guest VMs that have their pages swapped out to the host swap. Once those pages are paged in and transferred over network, we then want to release them with madvise(MADV_PAGEOUT) and preserve the working set of the guest VMs to reduce the thrashing of the host swap. At this point, we don't really need the PM_UFFD_WP or PM_SOFT_DIRTY bits in the pagemap report and were considering them only if they were easy to retrieve. The latter one seems to require some plumbing through the variety of use cases in the kernel, so our intention at the moment is to capture it in the pagemap docs as the known issue, presumably to handle by CRIU users. (Cc Pavel Emelyanov CRIU chief maintainer) Thanks, Ivan
On 21.07.21 16:38, Ivan Teterevkov wrote: > On Mon, Jul 19, 2021 5:56 PM +0000, Peter Xu wrote: >> I'm also curious what would be the real use to have an accurate PM_SWAP >> accounting. To me current implementation may not provide accurate value but >> should be good enough for most cases. However not sure whether it's also true >> for your use case. > > We want the PM_SWAP bit implemented (for shared memory in the pagemap > interface) to enhance the live migration for some fraction of the guest > VMs that have their pages swapped out to the host swap. Once those pages > are paged in and transferred over network, we then want to release them > with madvise(MADV_PAGEOUT) and preserve the working set of the guest VMs > to reduce the thrashing of the host swap. There are 3 possibilities I think (swap is just another variant of the page cache): 1) The page is not in the page cache, e.g., it resides on disk or in a swap file. pte_none(). 2) The page is in the page cache and is not mapped into the page table. pte_none(). 3) The page is in the page cache and mapped into the page table. !pte_none(). Do I understand correctly that you want to identify 1) and indicate it via PM_SWAP?
On Wed, Jul 21, 2021 4:20 PM +0000, David Hildenbrand wrote: > On 21.07.21 16:38, Ivan Teterevkov wrote: > > On Mon, Jul 19, 2021 5:56 PM +0000, Peter Xu wrote: > >> I'm also curious what would be the real use to have an accurate > >> PM_SWAP accounting. To me current implementation may not provide > >> accurate value but should be good enough for most cases. However not > >> sure whether it's also true for your use case. > > > > We want the PM_SWAP bit implemented (for shared memory in the pagemap > > interface) to enhance the live migration for some fraction of the > > guest VMs that have their pages swapped out to the host swap. Once > > those pages are paged in and transferred over network, we then want to > > release them with madvise(MADV_PAGEOUT) and preserve the working set > > of the guest VMs to reduce the thrashing of the host swap. > > There are 3 possibilities I think (swap is just another variant of the page cache): > > 1) The page is not in the page cache, e.g., it resides on disk or in a swap file. > pte_none(). > 2) The page is in the page cache and is not mapped into the page table. > pte_none(). > 3) The page is in the page cache and mapped into the page table. > !pte_none(). > > Do I understand correctly that you want to identify 1) and indicate it via > PM_SWAP? Yes, and I also want to outline the context so we're on the same page. This series introduces the support for userfaultfd-wp for shared memory because once a shared page is swapped, its PTE is cleared. Upon retrieval from a swap file, there's no way to "recover" the _PAGE_SWP_UFFD_WP flag because unlike private memory it's not kept in PTE or elsewhere. We came across the same issue with PM_SWAP in the pagemap interface, but fortunately, there's the place that we could query: the i_pages field of the struct address_space (XArray). In https://lkml.org/lkml/2021/7/14/595 we do it similarly to what shmem_fault() does when it handles #PF. Now, in the context of this series, we were exploring whether it makes any practical sense to introduce more brand new flags to the special PTE to populate the pagemap flags "on the spot" from the given PTE. However, I can't see how (and why) to achieve that specifically for PM_SWAP even with an extra bit: the XArray is precisely what we need for the live migration use case. Another flag PM_SOFT_DIRTY suffers the same problem as UFFD_WP_SWP_PTE_SPECIAL before this patch series, but we don't need it at the moment. Hope that clarification makes sense? The only outstanding note I have is about the compatibility of our patches around pte_to_pagemap_entry(). I think the resulting code should look like this: static pagemap_entry_t pte_to_pagemap_entry(...) { if (pte_present(pte)) { ... } else if (is_swap_pte(pte) || shmem_file(vma->vm_file)) { ... if (pte_swp_uffd_wp_special(pte)) { flags |= PM_UFFD_WP; } } } The is_swap_pte() branch will be taken for the swapped out shared pages, thanks to shmem_file(), so the pte_swp_uffd_wp_special() can be checked inside. Alternatively, we could just remove "else" statement: static pagemap_entry_t pte_to_pagemap_entry(...) { if (pte_present(pte)) { ... } else if (is_swap_pte(pte) || shmem_file(vma->vm_file)) { ... } if (pte_swp_uffd_wp_special(pte)) { flags |= PM_UFFD_WP; } } What do you reckon? Thanks, Ivan
Hi, Ivan, On Wed, Jul 21, 2021 at 07:54:44PM +0000, Ivan Teterevkov wrote: > On Wed, Jul 21, 2021 4:20 PM +0000, David Hildenbrand wrote: > > On 21.07.21 16:38, Ivan Teterevkov wrote: > > > On Mon, Jul 19, 2021 5:56 PM +0000, Peter Xu wrote: > > >> I'm also curious what would be the real use to have an accurate > > >> PM_SWAP accounting. To me current implementation may not provide > > >> accurate value but should be good enough for most cases. However not > > >> sure whether it's also true for your use case. > > > > > > We want the PM_SWAP bit implemented (for shared memory in the pagemap > > > interface) to enhance the live migration for some fraction of the > > > guest VMs that have their pages swapped out to the host swap. Once > > > those pages are paged in and transferred over network, we then want to > > > release them with madvise(MADV_PAGEOUT) and preserve the working set > > > of the guest VMs to reduce the thrashing of the host swap. > > > > There are 3 possibilities I think (swap is just another variant of the page cache): > > > > 1) The page is not in the page cache, e.g., it resides on disk or in a swap file. > > pte_none(). > > 2) The page is in the page cache and is not mapped into the page table. > > pte_none(). > > 3) The page is in the page cache and mapped into the page table. > > !pte_none(). > > > > Do I understand correctly that you want to identify 1) and indicate it via > > PM_SWAP? > > Yes, and I also want to outline the context so we're on the same page. > > This series introduces the support for userfaultfd-wp for shared memory > because once a shared page is swapped, its PTE is cleared. Upon retrieval > from a swap file, there's no way to "recover" the _PAGE_SWP_UFFD_WP flag > because unlike private memory it's not kept in PTE or elsewhere. > > We came across the same issue with PM_SWAP in the pagemap interface, but > fortunately, there's the place that we could query: the i_pages field of > the struct address_space (XArray). In https://lkml.org/lkml/2021/7/14/595 > we do it similarly to what shmem_fault() does when it handles #PF. > > Now, in the context of this series, we were exploring whether it makes > any practical sense to introduce more brand new flags to the special > PTE to populate the pagemap flags "on the spot" from the given PTE. > > However, I can't see how (and why) to achieve that specifically for > PM_SWAP even with an extra bit: the XArray is precisely what we need for > the live migration use case. Another flag PM_SOFT_DIRTY suffers the same > problem as UFFD_WP_SWP_PTE_SPECIAL before this patch series, but we don't > need it at the moment. > > Hope that clarification makes sense? Yes it helps, thanks. So I can understand now on how that patch comes initially, even if it may not work for PM_SOFT_DIRTY but it seems working indeed for PM_SWAP. However I have a concern that I raised also in the other thread: I think there'll be an extra and meaningless xa_load() for all the real pte_none()s that aren't swapped out but just having no page at the back from the very beginning. That happens much more frequent when the memory being observed by pagemap is mapped in a huge chunk and sparsely mapped. With old code we'll simply skip those ptes, but now I have no idea how much overhead would a xa_load() brings. Btw, I think there's a way to implement such an idea similar to the swap special uffd-wp pte - when page reclaim of shmem pages, instead of putting a none pte there maybe we can also have one bit set in the none pte showing that this pte is swapped out. When the page faulted back we just drop that bit. That bit could be also scanned by pagemap code to know that this page was swapped out. That should be much lighter than xa_load(), and that identifies immediately from a real none pte just by reading the value. Do you think this would work? > > The only outstanding note I have is about the compatibility of our > patches around pte_to_pagemap_entry(). I think the resulting code > should look like this: > > static pagemap_entry_t pte_to_pagemap_entry(...) > { > if (pte_present(pte)) { > ... > } else if (is_swap_pte(pte) || shmem_file(vma->vm_file)) { > ... > if (pte_swp_uffd_wp_special(pte)) { > flags |= PM_UFFD_WP; > } > } > } > > The is_swap_pte() branch will be taken for the swapped out shared pages, > thanks to shmem_file(), so the pte_swp_uffd_wp_special() can be checked > inside. > > Alternatively, we could just remove "else" statement: > > static pagemap_entry_t pte_to_pagemap_entry(...) > { > if (pte_present(pte)) { > ... > } else if (is_swap_pte(pte) || shmem_file(vma->vm_file)) { > ... > } > > if (pte_swp_uffd_wp_special(pte)) { > flags |= PM_UFFD_WP; > } > } > > What do you reckon? I don't worry too much on how we implement those in details yet. Both look fine to me. Thanks,
On Wed, Jul 21, 2021 at 06:28:03PM -0400, Peter Xu wrote: > Hi, Ivan, > > On Wed, Jul 21, 2021 at 07:54:44PM +0000, Ivan Teterevkov wrote: > > On Wed, Jul 21, 2021 4:20 PM +0000, David Hildenbrand wrote: > > > On 21.07.21 16:38, Ivan Teterevkov wrote: > > > > On Mon, Jul 19, 2021 5:56 PM +0000, Peter Xu wrote: > > > >> I'm also curious what would be the real use to have an accurate > > > >> PM_SWAP accounting. To me current implementation may not provide > > > >> accurate value but should be good enough for most cases. However not > > > >> sure whether it's also true for your use case. > > > > > > > > We want the PM_SWAP bit implemented (for shared memory in the pagemap > > > > interface) to enhance the live migration for some fraction of the > > > > guest VMs that have their pages swapped out to the host swap. Once > > > > those pages are paged in and transferred over network, we then want to > > > > release them with madvise(MADV_PAGEOUT) and preserve the working set > > > > of the guest VMs to reduce the thrashing of the host swap. > > > > > > There are 3 possibilities I think (swap is just another variant of the page cache): > > > > > > 1) The page is not in the page cache, e.g., it resides on disk or in a swap file. > > > pte_none(). > > > 2) The page is in the page cache and is not mapped into the page table. > > > pte_none(). > > > 3) The page is in the page cache and mapped into the page table. > > > !pte_none(). > > > > > > Do I understand correctly that you want to identify 1) and indicate it via > > > PM_SWAP? > > > > Yes, and I also want to outline the context so we're on the same page. > > > > This series introduces the support for userfaultfd-wp for shared memory > > because once a shared page is swapped, its PTE is cleared. Upon retrieval > > from a swap file, there's no way to "recover" the _PAGE_SWP_UFFD_WP flag > > because unlike private memory it's not kept in PTE or elsewhere. > > > > We came across the same issue with PM_SWAP in the pagemap interface, but > > fortunately, there's the place that we could query: the i_pages field of > > the struct address_space (XArray). In https://lkml.org/lkml/2021/7/14/595 > > we do it similarly to what shmem_fault() does when it handles #PF. > > > > Now, in the context of this series, we were exploring whether it makes > > any practical sense to introduce more brand new flags to the special > > PTE to populate the pagemap flags "on the spot" from the given PTE. > > > > However, I can't see how (and why) to achieve that specifically for > > PM_SWAP even with an extra bit: the XArray is precisely what we need for > > the live migration use case. Another flag PM_SOFT_DIRTY suffers the same > > problem as UFFD_WP_SWP_PTE_SPECIAL before this patch series, but we don't > > need it at the moment. > > > > Hope that clarification makes sense? > > Yes it helps, thanks. > > So I can understand now on how that patch comes initially, even if it may not > work for PM_SOFT_DIRTY but it seems working indeed for PM_SWAP. > > However I have a concern that I raised also in the other thread: I think > there'll be an extra and meaningless xa_load() for all the real pte_none()s > that aren't swapped out but just having no page at the back from the very > beginning. That happens much more frequent when the memory being observed by > pagemap is mapped in a huge chunk and sparsely mapped. > > With old code we'll simply skip those ptes, but now I have no idea how much > overhead would a xa_load() brings. > > Btw, I think there's a way to implement such an idea similar to the swap > special uffd-wp pte - when page reclaim of shmem pages, instead of putting a > none pte there maybe we can also have one bit set in the none pte showing that > this pte is swapped out. When the page faulted back we just drop that bit. > > That bit could be also scanned by pagemap code to know that this page was > swapped out. That should be much lighter than xa_load(), and that identifies > immediately from a real none pte just by reading the value. > > Do you think this would work? Btw, I think that's what Tiberiu used to mention, but I think I just changed my mind.. Sorry to have brought such a confusion. So what I think now is: we can set it (instead of zeroing the pte) right at unmapping the pte of page reclaim. Code-wise, that can be a special flag (maybe, TTU_PAGEOUT?) passed over to try_to_unmap() of shrink_page_list() to differenciate from other try_to_unmap()s. I think that bit can also be dropped correctly e.g. when punching a hole in the file, then rmap_walk() can find and drop the marker (I used to suspect uffd-wp bit could get left-overs, but after a second thought here similarly, it seems it won't; as long as hole punching and vma unmapping will always be able to scan those marker ptes, then it seems all right to drop them correctly). But that's my wild thoughts; I could have missed something too. Thanks,
On 22.07.21 00:57, Peter Xu wrote: > On Wed, Jul 21, 2021 at 06:28:03PM -0400, Peter Xu wrote: >> Hi, Ivan, >> >> On Wed, Jul 21, 2021 at 07:54:44PM +0000, Ivan Teterevkov wrote: >>> On Wed, Jul 21, 2021 4:20 PM +0000, David Hildenbrand wrote: >>>> On 21.07.21 16:38, Ivan Teterevkov wrote: >>>>> On Mon, Jul 19, 2021 5:56 PM +0000, Peter Xu wrote: >>>>>> I'm also curious what would be the real use to have an accurate >>>>>> PM_SWAP accounting. To me current implementation may not provide >>>>>> accurate value but should be good enough for most cases. However not >>>>>> sure whether it's also true for your use case. >>>>> >>>>> We want the PM_SWAP bit implemented (for shared memory in the pagemap >>>>> interface) to enhance the live migration for some fraction of the >>>>> guest VMs that have their pages swapped out to the host swap. Once >>>>> those pages are paged in and transferred over network, we then want to >>>>> release them with madvise(MADV_PAGEOUT) and preserve the working set >>>>> of the guest VMs to reduce the thrashing of the host swap. >>>> >>>> There are 3 possibilities I think (swap is just another variant of the page cache): >>>> >>>> 1) The page is not in the page cache, e.g., it resides on disk or in a swap file. >>>> pte_none(). >>>> 2) The page is in the page cache and is not mapped into the page table. >>>> pte_none(). >>>> 3) The page is in the page cache and mapped into the page table. >>>> !pte_none(). >>>> >>>> Do I understand correctly that you want to identify 1) and indicate it via >>>> PM_SWAP? >>> >>> Yes, and I also want to outline the context so we're on the same page. >>> >>> This series introduces the support for userfaultfd-wp for shared memory >>> because once a shared page is swapped, its PTE is cleared. Upon retrieval >>> from a swap file, there's no way to "recover" the _PAGE_SWP_UFFD_WP flag >>> because unlike private memory it's not kept in PTE or elsewhere. >>> >>> We came across the same issue with PM_SWAP in the pagemap interface, but >>> fortunately, there's the place that we could query: the i_pages field of >>> the struct address_space (XArray). In https://lkml.org/lkml/2021/7/14/595 >>> we do it similarly to what shmem_fault() does when it handles #PF. >>> >>> Now, in the context of this series, we were exploring whether it makes >>> any practical sense to introduce more brand new flags to the special >>> PTE to populate the pagemap flags "on the spot" from the given PTE. >>> >>> However, I can't see how (and why) to achieve that specifically for >>> PM_SWAP even with an extra bit: the XArray is precisely what we need for >>> the live migration use case. Another flag PM_SOFT_DIRTY suffers the same >>> problem as UFFD_WP_SWP_PTE_SPECIAL before this patch series, but we don't >>> need it at the moment. >>> >>> Hope that clarification makes sense? >> >> Yes it helps, thanks. >> >> So I can understand now on how that patch comes initially, even if it may not >> work for PM_SOFT_DIRTY but it seems working indeed for PM_SWAP. >> >> However I have a concern that I raised also in the other thread: I think >> there'll be an extra and meaningless xa_load() for all the real pte_none()s >> that aren't swapped out but just having no page at the back from the very >> beginning. That happens much more frequent when the memory being observed by >> pagemap is mapped in a huge chunk and sparsely mapped. >> >> With old code we'll simply skip those ptes, but now I have no idea how much >> overhead would a xa_load() brings. Let's benchmark it then. I feel like we really shouldn't be storing unnecessarily data in page tables if they are readily available somehwere else, because ... >> >> Btw, I think there's a way to implement such an idea similar to the swap >> special uffd-wp pte - when page reclaim of shmem pages, instead of putting a >> none pte there maybe we can also have one bit set in the none pte showing that >> this pte is swapped out. When the page faulted back we just drop that bit. >> >> That bit could be also scanned by pagemap code to know that this page was >> swapped out. That should be much lighter than xa_load(), and that identifies >> immediately from a real none pte just by reading the value. ... we are optimizing a corner case feature (pagemap) by affecting other system parts. Just imagine 1. Forking: will always have to copy the whole page tables for shemem instead of optimizing. 2. New shmem mappings: will always have to sync back that bit from the pagecache And these are just the things that immediately come to mind. There is certainly more (e.g., page table reclaim [1]). >> >> Do you think this would work? > > Btw, I think that's what Tiberiu used to mention, but I think I just changed my > mind.. Sorry to have brought such a confusion. > > So what I think now is: we can set it (instead of zeroing the pte) right at > unmapping the pte of page reclaim. Code-wise, that can be a special flag > (maybe, TTU_PAGEOUT?) passed over to try_to_unmap() of shrink_page_list() to > differenciate from other try_to_unmap()s. > > I think that bit can also be dropped correctly e.g. when punching a hole in the > file, then rmap_walk() can find and drop the marker (I used to suspect uffd-wp > bit could get left-overs, but after a second thought here similarly, it seems > it won't; as long as hole punching and vma unmapping will always be able to > scan those marker ptes, then it seems all right to drop them correctly). > > But that's my wild thoughts; I could have missed something too. > Adding to that, Peter can you enlighten me how uffd-wp on shmem combined with the uffd-wp bit in page tables is supposed to work in general when talking about multiple processes? Shmem means any process can modify any memory. To be able to properly catch writes to such memory, the only way I can see it working is 1. All processes register uffd-wp on the shmem VMA 2. All processes arm uffd-wp by setting the same uffd-wp bits in their page tables for the affected shmem 3. All processes synchronize, sending each other uffd-wp events when they receive one This is quite ... suboptimal I have to say. This is really the only way I can imagine uffd-wp to work reliably. Is there any obvious way to make this work I am missing? But then, all page tables are already supposed to contain the uffd-wp bit. Which makes me think that we can actually get rid of the uffd-wp bit in the page table for pte_none() entries and instead store this information somewhere else (in the page cache?) for all entries combined. So that simplification would result in 1. All processes register uffd-wp on the shmem VMA 2. One processes wp-protects uffd-wp via the page cache (we can update all PTEs in other processes) 3. All processes synchronize, sending each other uffd-wp events when they receive one The semantics of uffd-wp on shmem would be different to what we have so far ... which would be just fine as we never had uffd-wp on shared memory. In an ideal world, 1. and 3. wouldn't be required and all registered uffd listeners would be notified when any process writes to it. Sure, for single-user shmem it would work just like !shmem, but then, maybe that user really shouldn't be using shmem. But maybe I am missing something important :) > Thanks, > [1] https://lkml.kernel.org/r/20210718043034.76431-1-zhengqi.arch@bytedance.com
On Thu, Jul 22, 2021 at 08:27:07AM +0200, David Hildenbrand wrote: > On 22.07.21 00:57, Peter Xu wrote: > > On Wed, Jul 21, 2021 at 06:28:03PM -0400, Peter Xu wrote: > > > Hi, Ivan, > > > > > > On Wed, Jul 21, 2021 at 07:54:44PM +0000, Ivan Teterevkov wrote: > > > > On Wed, Jul 21, 2021 4:20 PM +0000, David Hildenbrand wrote: > > > > > On 21.07.21 16:38, Ivan Teterevkov wrote: > > > > > > On Mon, Jul 19, 2021 5:56 PM +0000, Peter Xu wrote: > > > > > > > I'm also curious what would be the real use to have an accurate > > > > > > > PM_SWAP accounting. To me current implementation may not provide > > > > > > > accurate value but should be good enough for most cases. However not > > > > > > > sure whether it's also true for your use case. > > > > > > > > > > > > We want the PM_SWAP bit implemented (for shared memory in the pagemap > > > > > > interface) to enhance the live migration for some fraction of the > > > > > > guest VMs that have their pages swapped out to the host swap. Once > > > > > > those pages are paged in and transferred over network, we then want to > > > > > > release them with madvise(MADV_PAGEOUT) and preserve the working set > > > > > > of the guest VMs to reduce the thrashing of the host swap. > > > > > > > > > > There are 3 possibilities I think (swap is just another variant of the page cache): > > > > > > > > > > 1) The page is not in the page cache, e.g., it resides on disk or in a swap file. > > > > > pte_none(). > > > > > 2) The page is in the page cache and is not mapped into the page table. > > > > > pte_none(). > > > > > 3) The page is in the page cache and mapped into the page table. > > > > > !pte_none(). > > > > > > > > > > Do I understand correctly that you want to identify 1) and indicate it via > > > > > PM_SWAP? > > > > > > > > Yes, and I also want to outline the context so we're on the same page. > > > > > > > > This series introduces the support for userfaultfd-wp for shared memory > > > > because once a shared page is swapped, its PTE is cleared. Upon retrieval > > > > from a swap file, there's no way to "recover" the _PAGE_SWP_UFFD_WP flag > > > > because unlike private memory it's not kept in PTE or elsewhere. > > > > > > > > We came across the same issue with PM_SWAP in the pagemap interface, but > > > > fortunately, there's the place that we could query: the i_pages field of > > > > the struct address_space (XArray). In https://lkml.org/lkml/2021/7/14/595 > > > > we do it similarly to what shmem_fault() does when it handles #PF. > > > > > > > > Now, in the context of this series, we were exploring whether it makes > > > > any practical sense to introduce more brand new flags to the special > > > > PTE to populate the pagemap flags "on the spot" from the given PTE. > > > > > > > > However, I can't see how (and why) to achieve that specifically for > > > > PM_SWAP even with an extra bit: the XArray is precisely what we need for > > > > the live migration use case. Another flag PM_SOFT_DIRTY suffers the same > > > > problem as UFFD_WP_SWP_PTE_SPECIAL before this patch series, but we don't > > > > need it at the moment. > > > > > > > > Hope that clarification makes sense? > > > > > > Yes it helps, thanks. > > > > > > So I can understand now on how that patch comes initially, even if it may not > > > work for PM_SOFT_DIRTY but it seems working indeed for PM_SWAP. > > > > > > However I have a concern that I raised also in the other thread: I think > > > there'll be an extra and meaningless xa_load() for all the real pte_none()s > > > that aren't swapped out but just having no page at the back from the very > > > beginning. That happens much more frequent when the memory being observed by > > > pagemap is mapped in a huge chunk and sparsely mapped. > > > > > > With old code we'll simply skip those ptes, but now I have no idea how much > > > overhead would a xa_load() brings. > > Let's benchmark it then. Yes that's a good idea. The goal should be that we won't regress any existing pagemap users due to too slow sampling. > I feel like we really shouldn't be storing > unnecessarily data in page tables if they are readily available somehwere > else, because ... > > > > > > > Btw, I think there's a way to implement such an idea similar to the swap > > > special uffd-wp pte - when page reclaim of shmem pages, instead of putting a > > > none pte there maybe we can also have one bit set in the none pte showing that > > > this pte is swapped out. When the page faulted back we just drop that bit. > > > > > > That bit could be also scanned by pagemap code to know that this page was > > > swapped out. That should be much lighter than xa_load(), and that identifies > > > immediately from a real none pte just by reading the value. > > ... we are optimizing a corner case feature (pagemap) by affecting other > system parts. Just imagine > > 1. Forking: will always have to copy the whole page tables for shemem > instead of optimizing. > 2. New shmem mappings: will always have to sync back that bit from the > pagecache Both points seem valid. Not sure whether we can still keep the behavior single-process only, that should satisfy e.g. the guest tracking use case and I believe most of the cases. Then for fork() we can ignore this bit, and for new mappings we ignore too. But I do confess that's a new limitation even if so. > > And these are just the things that immediately come to mind. There is > certainly more (e.g., page table reclaim [1]). > > > > > > > Do you think this would work? > > > > Btw, I think that's what Tiberiu used to mention, but I think I just changed my > > mind.. Sorry to have brought such a confusion. > > > > So what I think now is: we can set it (instead of zeroing the pte) right at > > unmapping the pte of page reclaim. Code-wise, that can be a special flag > > (maybe, TTU_PAGEOUT?) passed over to try_to_unmap() of shrink_page_list() to > > differenciate from other try_to_unmap()s. > > > > I think that bit can also be dropped correctly e.g. when punching a hole in the > > file, then rmap_walk() can find and drop the marker (I used to suspect uffd-wp > > bit could get left-overs, but after a second thought here similarly, it seems > > it won't; as long as hole punching and vma unmapping will always be able to > > scan those marker ptes, then it seems all right to drop them correctly). > > > > But that's my wild thoughts; I could have missed something too. > > > > Adding to that, Peter can you enlighten me how uffd-wp on shmem combined > with the uffd-wp bit in page tables is supposed to work in general when > talking about multiple processes? > > Shmem means any process can modify any memory. To be able to properly catch > writes to such memory, the only way I can see it working is > > 1. All processes register uffd-wp on the shmem VMA > 2. All processes arm uffd-wp by setting the same uffd-wp bits in their page > tables for the affected shmem > 3. All processes synchronize, sending each other uffd-wp events when they > receive one > > This is quite ... suboptimal I have to say. This is really the only way I > can imagine uffd-wp to work reliably. Is there any obvious way to make this > work I am missing? > > But then, all page tables are already supposed to contain the uffd-wp bit. > Which makes me think that we can actually get rid of the uffd-wp bit in the > page table for pte_none() entries and instead store this information > somewhere else (in the page cache?) for all entries combined. > > So that simplification would result in > > 1. All processes register uffd-wp on the shmem VMA > 2. One processes wp-protects uffd-wp via the page cache (we can update all > PTEs in other processes) > 3. All processes synchronize, sending each other uffd-wp events when they > receive one > > The semantics of uffd-wp on shmem would be different to what we have so far > ... which would be just fine as we never had uffd-wp on shared memory. > > In an ideal world, 1. and 3. wouldn't be required and all registered uffd > listeners would be notified when any process writes to it. This is also a good point at least to be fully discussed, I wished there could be the ideal world, but I just don't yet know whether it exists.. As you see even if we want to do some trick in page cache it'll still need at least step 1 to have that process be okay and be aware of such trick otherwise there'll start to evolve more complicated privilege relationships. It's also easier to do like what we have right now, e.g. one vma is bind to only one uffd, and that uffd is definitely bind to the current mm. If we allow cross-process operations that'll be hard to tell how that works - say processes A,B,C shared the memory and wanted to wr-protect it, who (which uffd, as there can be three) should be servicing the fault from someone? Why that process has that privilege? I can't quickly tell. So the per-mm idea and keep things within pgtables do sound okay, and simplify things like this so all things are at least self-contained within one process, for either privilege or the rest. I also don't know whether dropping the uffd-wp bit will be easy for shmem. The page can be wr-protected for other reasons (e.g. soft dirty), even if we check the vma VM_UFFD_WP flag we won't be able to identify whether this is wr-protected by uffd. I don't see how we can avoid false positives otherwise. > > > Sure, for single-user shmem it would work just like !shmem, but then, maybe > that user really shouldn't be using shmem. But maybe I am missing something > important :) I see this in a bit different perspective; normally shmem and uffd-wp come from two different requirements: (1) shmem is definitely for data sharing, while (2) uffd-wp is for tracking page changes. When the requriements meet each other, we'll need something like this series. I don't know how it will go at last, but I hope we can fill up the gap so uffd-wp can be close to a complete stage. Thanks!
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 9c5af77b5290..988e29fa1f00 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -1389,6 +1389,8 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm, flags |= PM_SWAP; if (is_pfn_swap_entry(entry)) page = pfn_swap_entry_to_page(entry); + } else if (pte_swp_uffd_wp_special(pte)) { + flags |= PM_UFFD_WP; } if (page && !PageAnon(page)) @@ -1522,10 +1524,15 @@ static int pagemap_hugetlb_range(pte_t *ptep, unsigned long hmask, if (page_mapcount(page) == 1) flags |= PM_MMAP_EXCLUSIVE; + if (huge_pte_uffd_wp(pte)) + flags |= PM_UFFD_WP; + flags |= PM_PRESENT; if (pm->show_pfn) frame = pte_pfn(pte) + ((addr & ~hmask) >> PAGE_SHIFT); + } else if (pte_swp_uffd_wp_special(pte)) { + flags |= PM_UFFD_WP; } for (; addr != end; addr += PAGE_SIZE) {
This requires the pagemap code to be able to recognize the newly introduced swap special pte for uffd-wp, meanwhile the general case for hugetlb that we recently start to support. It should make pagemap uffd-wp support complete. Signed-off-by: Peter Xu <peterx@redhat.com> --- fs/proc/task_mmu.c | 7 +++++++ 1 file changed, 7 insertions(+)