Message ID | 20220622185038.71740-3-namit@vmware.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | userfaultfd: support access/write hints | expand |
On Wed, Jun 22, 2022 at 11:50:35AM -0700, Nadav Amit wrote: > From: Nadav Amit <namit@vmware.com> > > Using a PTE on x86 with cleared access-bit (aka young-bit) > takes ~600 cycles more than when the access bit is set. At the same > time, setting the access-bit for memory that is not used (e.g., > prefetched) can introduce greater overheads, as the prefetched memory is > reclaimed later than it should be. > > Userfaultfd currently does not set the access-bit (excluding the > huge-pages case). Arguably, it is best to let the user control whether > the access bit should be set or not. The expected use is to request > userfaultfd to set the access-bit when the copy/wp operation is done to > resolve a page-fault, and not to set the access-bit when the memory is > prefetched. > > Introduce UFFDIO_[op]_ACCESS_LIKELY to enable userspace to request the > young bit to be set. > > Cc: Mike Kravetz <mike.kravetz@oracle.com> > Cc: Hugh Dickins <hughd@google.com> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Axel Rasmussen <axelrasmussen@google.com> > Cc: Peter Xu <peterx@redhat.com> > Cc: David Hildenbrand <david@redhat.com> > Cc: Mike Rapoport <rppt@linux.ibm.com> > Signed-off-by: Nadav Amit <namit@vmware.com> Hmm.. is the hugetlb code overlooked (for both of the hints), or maybe I missed it? Do we need to cover them too? Thanks,
On Jun 23, 2022, at 4:24 PM, Peter Xu <peterx@redhat.com> wrote: > ⚠ External Email > > On Wed, Jun 22, 2022 at 11:50:35AM -0700, Nadav Amit wrote: >> From: Nadav Amit <namit@vmware.com> >> >> Using a PTE on x86 with cleared access-bit (aka young-bit) >> takes ~600 cycles more than when the access bit is set. At the same >> time, setting the access-bit for memory that is not used (e.g., >> prefetched) can introduce greater overheads, as the prefetched memory is >> reclaimed later than it should be. >> >> Userfaultfd currently does not set the access-bit (excluding the >> huge-pages case). Arguably, it is best to let the user control whether >> the access bit should be set or not. The expected use is to request >> userfaultfd to set the access-bit when the copy/wp operation is done to >> resolve a page-fault, and not to set the access-bit when the memory is >> prefetched. >> >> Introduce UFFDIO_[op]_ACCESS_LIKELY to enable userspace to request the >> young bit to be set. >> >> Cc: Mike Kravetz <mike.kravetz@oracle.com> >> Cc: Hugh Dickins <hughd@google.com> >> Cc: Andrew Morton <akpm@linux-foundation.org> >> Cc: Axel Rasmussen <axelrasmussen@google.com> >> Cc: Peter Xu <peterx@redhat.com> >> Cc: David Hildenbrand <david@redhat.com> >> Cc: Mike Rapoport <rppt@linux.ibm.com> >> Signed-off-by: Nadav Amit <namit@vmware.com> > > Hmm.. is the hugetlb code overlooked (for both of the hints), or maybe I > missed it? Do we need to cover them too? Thanks, I do not know what the value of not setting the PTE’s access/dirty when it comes to performance. The pages won’t be swapped out, just as you wrote in your comment in hugetlb_mcopy_atomic_pte(): /* * Always mark UFFDIO_COPY page dirty; note that this may not be * extremely important for hugetlbfs for now since swapping is not * supported, but we should still be clear in that this page cannot be * thrown away at will, even if write bit not set. */ _dst_pte = huge_pte_mkdirty(_dst_pte); _dst_pte = pte_mkyoung(_dst_pte); If you want for consistency/robustness not to set dirty on read-only entries, that’s something that I can do.
On Thu, Jun 23, 2022 at 11:35:00PM +0000, Nadav Amit wrote: > On Jun 23, 2022, at 4:24 PM, Peter Xu <peterx@redhat.com> wrote: > > > ⚠ External Email > > > > On Wed, Jun 22, 2022 at 11:50:35AM -0700, Nadav Amit wrote: > >> From: Nadav Amit <namit@vmware.com> > >> > >> Using a PTE on x86 with cleared access-bit (aka young-bit) > >> takes ~600 cycles more than when the access bit is set. At the same > >> time, setting the access-bit for memory that is not used (e.g., > >> prefetched) can introduce greater overheads, as the prefetched memory is > >> reclaimed later than it should be. > >> > >> Userfaultfd currently does not set the access-bit (excluding the > >> huge-pages case). Arguably, it is best to let the user control whether > >> the access bit should be set or not. The expected use is to request > >> userfaultfd to set the access-bit when the copy/wp operation is done to > >> resolve a page-fault, and not to set the access-bit when the memory is > >> prefetched. > >> > >> Introduce UFFDIO_[op]_ACCESS_LIKELY to enable userspace to request the > >> young bit to be set. > >> > >> Cc: Mike Kravetz <mike.kravetz@oracle.com> > >> Cc: Hugh Dickins <hughd@google.com> > >> Cc: Andrew Morton <akpm@linux-foundation.org> > >> Cc: Axel Rasmussen <axelrasmussen@google.com> > >> Cc: Peter Xu <peterx@redhat.com> > >> Cc: David Hildenbrand <david@redhat.com> > >> Cc: Mike Rapoport <rppt@linux.ibm.com> > >> Signed-off-by: Nadav Amit <namit@vmware.com> > > > > Hmm.. is the hugetlb code overlooked (for both of the hints), or maybe I > > missed it? Do we need to cover them too? Thanks, > > I do not know what the value of not setting the PTE’s access/dirty when it > comes to performance. The pages won’t be swapped out, just as you wrote in > your comment in hugetlb_mcopy_atomic_pte(): > > /* > * Always mark UFFDIO_COPY page dirty; note that this may not be > * extremely important for hugetlbfs for now since swapping is not > * supported, but we should still be clear in that this page cannot be > * thrown away at will, even if write bit not set. > */ > _dst_pte = huge_pte_mkdirty(_dst_pte); > _dst_pte = pte_mkyoung(_dst_pte); > > If you want for consistency/robustness not to set dirty on read-only > entries, that’s something that I can do. We have more than one options here I guess. We could have the hints not available for hugetlbfs, but then we'll both need to add special document for hugetlbfs (when you write the man-page, let's say) and also it'll be probably better to fail the ioctls with hints when applying to hugetlb vmas. Leaving them silent to hugetlbfs is another option, it just sounds weird to me without any kind of documentation so I like this one least. Or we could make them work for hugetlbfs too. Not saying that swap will work some day (but it still could?!) it just seems all consistent as you said. So I'd prefer the last one, but only if you agree.
On Jun 23, 2022, at 4:49 PM, Peter Xu <peterx@redhat.com> wrote: > On Thu, Jun 23, 2022 at 11:35:00PM +0000, Nadav Amit wrote: >> On Jun 23, 2022, at 4:24 PM, Peter Xu <peterx@redhat.com> wrote: >> >>> ⚠ External Email >>> >>> On Wed, Jun 22, 2022 at 11:50:35AM -0700, Nadav Amit wrote: >>>> From: Nadav Amit <namit@vmware.com> >>>> >>>> Using a PTE on x86 with cleared access-bit (aka young-bit) >>>> takes ~600 cycles more than when the access bit is set. At the same >>>> time, setting the access-bit for memory that is not used (e.g., >>>> prefetched) can introduce greater overheads, as the prefetched memory is >>>> reclaimed later than it should be. >>>> >>>> Userfaultfd currently does not set the access-bit (excluding the >>>> huge-pages case). Arguably, it is best to let the user control whether >>>> the access bit should be set or not. The expected use is to request >>>> userfaultfd to set the access-bit when the copy/wp operation is done to >>>> resolve a page-fault, and not to set the access-bit when the memory is >>>> prefetched. >>>> >>>> Introduce UFFDIO_[op]_ACCESS_LIKELY to enable userspace to request the >>>> young bit to be set. >>>> >>>> Cc: Mike Kravetz <mike.kravetz@oracle.com> >>>> Cc: Hugh Dickins <hughd@google.com> >>>> Cc: Andrew Morton <akpm@linux-foundation.org> >>>> Cc: Axel Rasmussen <axelrasmussen@google.com> >>>> Cc: Peter Xu <peterx@redhat.com> >>>> Cc: David Hildenbrand <david@redhat.com> >>>> Cc: Mike Rapoport <rppt@linux.ibm.com> >>>> Signed-off-by: Nadav Amit <namit@vmware.com> >>> >>> Hmm.. is the hugetlb code overlooked (for both of the hints), or maybe I >>> missed it? Do we need to cover them too? Thanks, >> >> I do not know what the value of not setting the PTE’s access/dirty when it >> comes to performance. The pages won’t be swapped out, just as you wrote in >> your comment in hugetlb_mcopy_atomic_pte(): >> >> /* >> * Always mark UFFDIO_COPY page dirty; note that this may not be >> * extremely important for hugetlbfs for now since swapping is not >> * supported, but we should still be clear in that this page cannot be >> * thrown away at will, even if write bit not set. >> */ >> _dst_pte = huge_pte_mkdirty(_dst_pte); >> _dst_pte = pte_mkyoung(_dst_pte); >> >> If you want for consistency/robustness not to set dirty on read-only >> entries, that’s something that I can do. > > We have more than one options here I guess. > > We could have the hints not available for hugetlbfs, but then we'll both > need to add special document for hugetlbfs (when you write the man-page, > let's say) and also it'll be probably better to fail the ioctls with hints > when applying to hugetlb vmas. > > Leaving them silent to hugetlbfs is another option, it just sounds weird to > me without any kind of documentation so I like this one least. > > Or we could make them work for hugetlbfs too. Not saying that swap will > work some day (but it still could?!) it just seems all consistent as you > said. > > So I'd prefer the last one, but only if you agree. My take is that hints are hints. Following David’s (or was it yours?) feedback, I fixed the description to indicate that this is merely a hint and removed all references to dirty/access bits. The kernel therefore can ignore the hint when it wants to or use it in any other way. I fully agree that this gives the kernel the ability to change the behavior as needed. Note that for write-protected 4KB zero-page (where we share the zero-page) we always set the access-bit, regardless of the hint, because it makes sense: the zero-page is not swappable and therefore the access-bit is set. I think that the lesser user-facing documentation there is on how the feature is *exactly* used by the kernel - is better from an API point of view. So I see no reason to fail or be forced not to set a page as young, just because a hint was *not* provided. This would even be a regression in the behavior. The hint is actually always respected right now, it is just that even if you do not provide the hint, the access/dirty is set. The only consistency I think worth thinking about is with the dirty-bit, and I can add it if you want. Note that the access-bit (in x86) might be set speculatively in contrast to the dirty-bit is only set atomically with a real access. That’s the reason I think it may make sense not to set the dirty without a hint. Is that acceptable? Access-bit always set, dirty-bit according to hint?
On Fri, Jun 24, 2022 at 12:03:38AM +0000, Nadav Amit wrote: > My take is that hints are hints. Following David’s (or was it yours?) > feedback, I fixed the description to indicate that this is merely a hint and > removed all references to dirty/access bits. The kernel therefore can ignore > the hint when it wants to or use it in any other way. I fully agree that > this gives the kernel the ability to change the behavior as needed. > > Note that for write-protected 4KB zero-page (where we share the zero-page) > we always set the access-bit, regardless of the hint, because it makes > sense: the zero-page is not swappable and therefore the access-bit is set. The zero-page example makes sense, and yeah that makes the hugetlb behavior making more sense too. > > I think that the lesser user-facing documentation there is on how the > feature is *exactly* used by the kernel - is better from an API point of > view. > > So I see no reason to fail or be forced not to set a page as young, just > because a hint was *not* provided. This would even be a regression in the > behavior. The hint is actually always respected right now, it is just that > even if you do not provide the hint, the access/dirty is set. > > The only consistency I think worth thinking about is with the dirty-bit, and > I can add it if you want. Note that the access-bit (in x86) might be set > speculatively in contrast to the dirty-bit is only set atomically with a > real access. That’s the reason I think it may make sense not to set the > dirty without a hint. Sorry to ask if this is (another) naive question: any link/help to explain the speculative behavior on access bit? Is it part of speculative execution (which, iiuc, would it be reverted if the speculation failed)? > > Is that acceptable? Access-bit always set, dirty-bit according to hint? I'm still trying to digest what you said above, sorry. Aren't both access and dirty bits need an atomic op to be set anyway? Then from perf pov should we simply keep setting them both too like what you did with this version? because it seems that'll always avoid an extra pgtable update access?
> On Jun 23, 2022, at 7:05 PM, Peter Xu <peterx@redhat.com> wrote: > > On Fri, Jun 24, 2022 at 12:03:38AM +0000, Nadav Amit wrote: >> My take is that hints are hints. Following David’s (or was it yours?) >> feedback, I fixed the description to indicate that this is merely a hint and >> removed all references to dirty/access bits. The kernel therefore can ignore >> the hint when it wants to or use it in any other way. I fully agree that >> this gives the kernel the ability to change the behavior as needed. >> >> Note that for write-protected 4KB zero-page (where we share the zero-page) >> we always set the access-bit, regardless of the hint, because it makes >> sense: the zero-page is not swappable and therefore the access-bit is set. > > The zero-page example makes sense, and yeah that makes the hugetlb behavior > making more sense too. > >> >> I think that the lesser user-facing documentation there is on how the >> feature is *exactly* used by the kernel - is better from an API point of >> view. >> >> So I see no reason to fail or be forced not to set a page as young, just >> because a hint was *not* provided. This would even be a regression in the >> behavior. The hint is actually always respected right now, it is just that >> even if you do not provide the hint, the access/dirty is set. >> >> The only consistency I think worth thinking about is with the dirty-bit, and >> I can add it if you want. Note that the access-bit (in x86) might be set >> speculatively in contrast to the dirty-bit is only set atomically with a >> real access. That’s the reason I think it may make sense not to set the >> dirty without a hint. > > Sorry to ask if this is (another) naive question: any link/help to explain > the speculative behavior on access bit? Is it part of speculative > execution (which, iiuc, would it be reverted if the speculation failed)? Oh man, it is hard to find a reference. I made this claim it based on my recollection (and logic). The access-bit on Intel is set when the PTE is loaded into the TLB, so if you allow speculative loading of the TLB, that’s what you get. Googling shows Yu Zhao saying: "IIRC, there are also false positives, i.e., the accessed bit is set on entries used by speculative execution only.” [1] Intel SDM says: "Whenever the processor uses a paging-structure entry as part of linear-address translation, it sets the accessed flag in that entry... Whenever there is a write to a linear address, the processor sets the dirty flag (if it is not already set) in the paging- structure entry..." You can argue that this indicates that the access-bit is updated speculatively (translations can be speculative) and dirty-bit is on actual write. But it is somewhat of a creative reading. Googling further did not help much, but I found a relevant discussion on RISC-V, in which they actually consider a similar behavior. [2] If you want (and care), we can cc Dave Hansen to get a clear answer. [1] https://lore.kernel.org/lkml/YE7Rk%2FYA1Uj7yFn2@google.com/ [2] https://lists.riscv.org/g/tech-virt-mem/topic/accessed_bit/77699883?p=,,,20,0,0,0::recentpostdate%2Fsticky,,,20,1,80,77699883 > >> >> Is that acceptable? Access-bit always set, dirty-bit according to hint? > > I'm still trying to digest what you said above, sorry. > > Aren't both access and dirty bits need an atomic op to be set anyway? Then > from perf pov should we simply keep setting them both too like what you did > with this version? because it seems that'll always avoid an extra pgtable > update access? I guess by atomic-op you mean atomic-update by the hardware AD-assist. I agree that if a page is written, the bits would need to be updated and these would introduce an overhead. However, if the page cannot be written, well, the dirty bit would never be set. hugetlb_mcopy_atomic_pte() currently does the following: _dst_pte = huge_pte_mkdirty(_dst_pte); _dst_pte = pte_mkyoung(_dst_pte); if (wp_copy) _dst_pte = huge_pte_mkuffd_wp(_dst_pte); Since you asked to update hugetlb_mcopy_atomic_pte(), I can offer three options: 1. Do not set dirty if (wp_copy). 2. Do not set dirty if (wp_copy || !write_hint) 3. Keep it as is. I am fine with whatever would make you happy. :)
[Sorry for replying late] On Fri, Jun 24, 2022 at 02:42:21AM +0000, Nadav Amit wrote: > > > > On Jun 23, 2022, at 7:05 PM, Peter Xu <peterx@redhat.com> wrote: > > > > On Fri, Jun 24, 2022 at 12:03:38AM +0000, Nadav Amit wrote: > >> My take is that hints are hints. Following David’s (or was it yours?) > >> feedback, I fixed the description to indicate that this is merely a hint and > >> removed all references to dirty/access bits. The kernel therefore can ignore > >> the hint when it wants to or use it in any other way. I fully agree that > >> this gives the kernel the ability to change the behavior as needed. > >> > >> Note that for write-protected 4KB zero-page (where we share the zero-page) > >> we always set the access-bit, regardless of the hint, because it makes > >> sense: the zero-page is not swappable and therefore the access-bit is set. > > > > The zero-page example makes sense, and yeah that makes the hugetlb behavior > > making more sense too. > > > >> > >> I think that the lesser user-facing documentation there is on how the > >> feature is *exactly* used by the kernel - is better from an API point of > >> view. > >> > >> So I see no reason to fail or be forced not to set a page as young, just > >> because a hint was *not* provided. This would even be a regression in the > >> behavior. The hint is actually always respected right now, it is just that > >> even if you do not provide the hint, the access/dirty is set. > >> > >> The only consistency I think worth thinking about is with the dirty-bit, and > >> I can add it if you want. Note that the access-bit (in x86) might be set > >> speculatively in contrast to the dirty-bit is only set atomically with a > >> real access. That’s the reason I think it may make sense not to set the > >> dirty without a hint. > > > > Sorry to ask if this is (another) naive question: any link/help to explain > > the speculative behavior on access bit? Is it part of speculative > > execution (which, iiuc, would it be reverted if the speculation failed)? > > Oh man, it is hard to find a reference. I made this claim it based on my > recollection (and logic). > > The access-bit on Intel is set when the PTE is loaded into the TLB, so if you > allow speculative loading of the TLB, that’s what you get. > > Googling shows Yu Zhao saying: "IIRC, there are also false positives, i.e., > the accessed bit is set on entries used by speculative execution only.” [1] > > Intel SDM says: "Whenever the processor uses a paging-structure entry as part > of linear-address translation, it sets the accessed flag in that entry... > Whenever there is a write to a linear address, the processor sets the dirty > flag (if it is not already set) in the paging- structure entry..." > > You can argue that this indicates that the access-bit is updated > speculatively (translations can be speculative) and dirty-bit is on actual > write. But it is somewhat of a creative reading. > > Googling further did not help much, but I found a relevant discussion on > RISC-V, in which they actually consider a similar behavior. [2] > > If you want (and care), we can cc Dave Hansen to get a clear answer. > > [1] https://lore.kernel.org/lkml/YE7Rk%2FYA1Uj7yFn2@google.com/ > [2] https://lists.riscv.org/g/tech-virt-mem/topic/accessed_bit/77699883?p=,,,20,0,0,0::recentpostdate%2Fsticky,,,20,1,80,77699883 I thought even writes can be speculatively executed too? Though I think when the speculation was proved wrong the write needs to be reverted along with making sure D bit cleared if it was cleared before the speculative operation. So I think I get you if you meant the access bit may not be reverted even if we hit a speculative failure (though without solid proofs, afaict..). IOW we could have false positive access bits set even if not accessed, but not to D bits which should be accurate. > > > > >> > >> Is that acceptable? Access-bit always set, dirty-bit according to hint? > > > > I'm still trying to digest what you said above, sorry. > > > > Aren't both access and dirty bits need an atomic op to be set anyway? Then > > from perf pov should we simply keep setting them both too like what you did > > with this version? because it seems that'll always avoid an extra pgtable > > update access? > > I guess by atomic-op you mean atomic-update by the hardware AD-assist. Yes. Btw, since I looked at the SDM as you quoted I think that may not strictly be like an atomic op from processor pov, I guess, since there's a NOTE: The accesses used by the processor to set these flags may or may not be exposed to the processor’s self-modifying code detection logic. If the processor is executing code from the same memory area that is being used for the paging structures, the setting of these flags may or may not result in an immediate change to the executing code stream. So I read it as: even if it'll be an atomic, the op can be postponed. > > I agree that if a page is written, the bits would need to be updated and > these would introduce an overhead. However, if the page cannot be written, > well, the dirty bit would never be set. Ok I see what you mean now. But honestly, I don't think it's anything related to the speculative access bit behavior described above.. or is it? > > hugetlb_mcopy_atomic_pte() currently does the following: > > _dst_pte = huge_pte_mkdirty(_dst_pte); > _dst_pte = pte_mkyoung(_dst_pte); > > if (wp_copy) > _dst_pte = huge_pte_mkuffd_wp(_dst_pte); > > Since you asked to update hugetlb_mcopy_atomic_pte(), I can offer three > options: > > 1. Do not set dirty if (wp_copy). > 2. Do not set dirty if (wp_copy || !write_hint) > 3. Keep it as is. AFAICT you already go somewhere at least not (3) with non-hugetlb pages in current series.. because dirty bit is not always set already for them, so I'd say we'd make them match? Hugetlbfs shouldn't be special in this aspect, IMHO. Said that, I think it doesn't really necessary need to be that complex, since make_huge_pte() already sets dirty bit when "writable=1", so IIUC what you need to do is simply make sure dirty bit set when write_hint=1. Does it sounds correct to you?
On Fri, Jun 24, 2022 at 05:58:17PM -0400, Peter Xu wrote: > [Sorry for replying late] > > On Fri, Jun 24, 2022 at 02:42:21AM +0000, Nadav Amit wrote: > > > > > > > On Jun 23, 2022, at 7:05 PM, Peter Xu <peterx@redhat.com> wrote: > > > > > > On Fri, Jun 24, 2022 at 12:03:38AM +0000, Nadav Amit wrote: > > >> My take is that hints are hints. Following David’s (or was it yours?) > > >> feedback, I fixed the description to indicate that this is merely a hint and > > >> removed all references to dirty/access bits. The kernel therefore can ignore > > >> the hint when it wants to or use it in any other way. I fully agree that > > >> this gives the kernel the ability to change the behavior as needed. > > >> > > >> Note that for write-protected 4KB zero-page (where we share the zero-page) > > >> we always set the access-bit, regardless of the hint, because it makes > > >> sense: the zero-page is not swappable and therefore the access-bit is set. > > > > > > The zero-page example makes sense, and yeah that makes the hugetlb behavior > > > making more sense too. > > > > > >> > > >> I think that the lesser user-facing documentation there is on how the > > >> feature is *exactly* used by the kernel - is better from an API point of > > >> view. > > >> > > >> So I see no reason to fail or be forced not to set a page as young, just > > >> because a hint was *not* provided. This would even be a regression in the > > >> behavior. The hint is actually always respected right now, it is just that > > >> even if you do not provide the hint, the access/dirty is set. > > >> > > >> The only consistency I think worth thinking about is with the dirty-bit, and > > >> I can add it if you want. Note that the access-bit (in x86) might be set > > >> speculatively in contrast to the dirty-bit is only set atomically with a > > >> real access. That’s the reason I think it may make sense not to set the > > >> dirty without a hint. > > > > > > Sorry to ask if this is (another) naive question: any link/help to explain > > > the speculative behavior on access bit? Is it part of speculative > > > execution (which, iiuc, would it be reverted if the speculation failed)? > > > > Oh man, it is hard to find a reference. I made this claim it based on my > > recollection (and logic). > > > > The access-bit on Intel is set when the PTE is loaded into the TLB, so if you > > allow speculative loading of the TLB, that’s what you get. > > > > Googling shows Yu Zhao saying: "IIRC, there are also false positives, i.e., > > the accessed bit is set on entries used by speculative execution only.” [1] > > > > Intel SDM says: "Whenever the processor uses a paging-structure entry as part > > of linear-address translation, it sets the accessed flag in that entry... > > Whenever there is a write to a linear address, the processor sets the dirty > > flag (if it is not already set) in the paging- structure entry..." > > > > You can argue that this indicates that the access-bit is updated > > speculatively (translations can be speculative) and dirty-bit is on actual > > write. But it is somewhat of a creative reading. > > > > Googling further did not help much, but I found a relevant discussion on > > RISC-V, in which they actually consider a similar behavior. [2] > > > > If you want (and care), we can cc Dave Hansen to get a clear answer. > > > > [1] https://lore.kernel.org/lkml/YE7Rk%2FYA1Uj7yFn2@google.com/ > > [2] https://lists.riscv.org/g/tech-virt-mem/topic/accessed_bit/77699883?p=,,,20,0,0,0::recentpostdate%2Fsticky,,,20,1,80,77699883 > > I thought even writes can be speculatively executed too? Though I think > when the speculation was proved wrong the write needs to be reverted along > with making sure D bit cleared if it was cleared before the speculative > operation. > > So I think I get you if you meant the access bit may not be reverted even > if we hit a speculative failure (though without solid proofs, afaict..). > IOW we could have false positive access bits set even if not accessed, but > not to D bits which should be accurate. > > > > > > > > >> > > >> Is that acceptable? Access-bit always set, dirty-bit according to hint? > > > > > > I'm still trying to digest what you said above, sorry. > > > > > > Aren't both access and dirty bits need an atomic op to be set anyway? Then > > > from perf pov should we simply keep setting them both too like what you did > > > with this version? because it seems that'll always avoid an extra pgtable > > > update access? > > > > I guess by atomic-op you mean atomic-update by the hardware AD-assist. > > Yes. > > Btw, since I looked at the SDM as you quoted I think that may not strictly > be like an atomic op from processor pov, I guess, since there's a NOTE: > > The accesses used by the processor to set these flags may or may not be > exposed to the processor’s self-modifying code detection logic. If the > processor is executing code from the same memory area that is being used > for the paging structures, the setting of these flags may or may not > result in an immediate change to the executing code stream. > > So I read it as: even if it'll be an atomic, the op can be postponed. > > > > > I agree that if a page is written, the bits would need to be updated and > > these would introduce an overhead. However, if the page cannot be written, > > well, the dirty bit would never be set. > > Ok I see what you mean now. But honestly, I don't think it's anything > related to the speculative access bit behavior described above.. or is it? > > > > > hugetlb_mcopy_atomic_pte() currently does the following: > > > > _dst_pte = huge_pte_mkdirty(_dst_pte); > > _dst_pte = pte_mkyoung(_dst_pte); > > > > if (wp_copy) > > _dst_pte = huge_pte_mkuffd_wp(_dst_pte); > > > > Since you asked to update hugetlb_mcopy_atomic_pte(), I can offer three > > options: > > > > 1. Do not set dirty if (wp_copy). > > 2. Do not set dirty if (wp_copy || !write_hint) > > 3. Keep it as is. > > AFAICT you already go somewhere at least not (3) with non-hugetlb pages in > current series.. because dirty bit is not always set already for them, so > I'd say we'd make them match? Hugetlbfs shouldn't be special in this > aspect, IMHO. > > Said that, I think it doesn't really necessary need to be that complex, > since make_huge_pte() already sets dirty bit when "writable=1", so IIUC > what you need to do is simply make sure dirty bit set when write_hint=1. > > Does it sounds correct to you? Hmm, hold on... I failed to figure out how that write-likely hint could help us for either huge or non-huge pages, since: (1) Old code always set dirty, so no perf degrade anyway with/without the hint (2) If we want to rework dirty bit (which I'm totally fine with..), then we don't apply it when we shouldn't, and afaict we should set D bit whenever we should... if the user assumes this page is likely to be written but made it read-only, say, with UFFDIO_COPY(wp_mode=1), setting D bit will not help, instead, the user should simply use an UFFDIO_COPY(wp_mode=0) then the dirty will be set with write=1.. It'll be helpful but only helpful for UFFDIO_ZEROCOPY because it avoids one COW. But that seems to be it. In short: I'm wondering whether we only really need the ACCESS_LIKELY hint as you proposed earlier. We may want UFFDIO_ZEROPAGE_MODE_ALLOCATE separately, but keep that only for zeropage op (and it shouldn't really be called WRITE_LIKELY)? Or did I miss something?
> On Jun 24, 2022, at 3:17 PM, Peter Xu <peterx@redhat.com> wrote: > > On Fri, Jun 24, 2022 at 05:58:17PM -0400, Peter Xu wrote: >> [Sorry for replying late] >> >> Said that, I think it doesn't really necessary need to be that complex, >> since make_huge_pte() already sets dirty bit when "writable=1", so IIUC >> what you need to do is simply make sure dirty bit set when write_hint=1. >> >> Does it sounds correct to you? > > Hmm, hold on... I failed to figure out how that write-likely hint could > help us for either huge or non-huge pages, since: > > (1) Old code always set dirty, so no perf degrade anyway with/without the > hint > > (2) If we want to rework dirty bit (which I'm totally fine with..), then > we don't apply it when we shouldn't, and afaict we should set D bit > whenever we should... if the user assumes this page is likely to be > written but made it read-only, say, with UFFDIO_COPY(wp_mode=1), > setting D bit will not help, instead, the user should simply use an > UFFDIO_COPY(wp_mode=0) then the dirty will be set with write=1.. > > It'll be helpful but only helpful for UFFDIO_ZEROCOPY because it avoids one > COW. But that seems to be it. > > In short: I'm wondering whether we only really need the ACCESS_LIKELY hint > as you proposed earlier. We may want UFFDIO_ZEROPAGE_MODE_ALLOCATE > separately, but keep that only for zeropage op (and it shouldn't really be > called WRITE_LIKELY)? Or did I miss something? Let’s see if I get you correctly. I am not sure whether we had this discussion before. We are talking about a scenario in which WP=0. You argue that if the page is already set as dirty, what is the benefit of not setting the dirty-bit, right? So first, IIUC, there are cases in which the page would not be set as dirty, e.g., UFFDIO_CONTINUE. [ I am admittedly not too familiar with this use-case, so I say it based on the comments. ] Second, even if the page is dirty (e.g., following UFFDIO_COPY), but it is not written by the user after UFFDI_COPY, marking the PTE as dirty when it is mapped would induce overhead, as we discussed before, since if/when the PTE is unmapped, TLB flush batching might not be possible. So I don’t think there is a problem in having WRITE_LIKELY hint. Moreover, I would reiterate my position (which you guys convinced me in!) that having hints that indicate what the user does (WRITE_LIKELY) is a better API than something that indicates directly what the kernel should do (e.g., UFFDIO_ZEROPAGE_MODE_ALLOCATE). But this discussion made me think that there are two somewhat related matters that we may want to address: 1. mwriteprotect_range() should use MM_CP_TRY_CHANGE_WRITABLE when !wp to proactively make entries writable and save . 2. The WRITE_LIKELY hint should be propagated from mwriteprotect_range() to change_pte_range() through cp_flags, and the entry should be set dirty accordingly. With that, and with leaving hugetlbfs as it is (I meant before - leaving it as it is in the patchset that I sent), would it be ok with you?
On Sat, Jun 25, 2022 at 07:49:54AM +0000, Nadav Amit wrote: > > > > On Jun 24, 2022, at 3:17 PM, Peter Xu <peterx@redhat.com> wrote: > > > > On Fri, Jun 24, 2022 at 05:58:17PM -0400, Peter Xu wrote: > >> [Sorry for replying late] > >> > >> Said that, I think it doesn't really necessary need to be that complex, > >> since make_huge_pte() already sets dirty bit when "writable=1", so IIUC > >> what you need to do is simply make sure dirty bit set when write_hint=1. > >> > >> Does it sounds correct to you? > > > > Hmm, hold on... I failed to figure out how that write-likely hint could > > help us for either huge or non-huge pages, since: > > > > (1) Old code always set dirty, so no perf degrade anyway with/without the > > hint > > > > (2) If we want to rework dirty bit (which I'm totally fine with..), then > > we don't apply it when we shouldn't, and afaict we should set D bit > > whenever we should... if the user assumes this page is likely to be > > written but made it read-only, say, with UFFDIO_COPY(wp_mode=1), > > setting D bit will not help, instead, the user should simply use an > > UFFDIO_COPY(wp_mode=0) then the dirty will be set with write=1.. > > > > It'll be helpful but only helpful for UFFDIO_ZEROCOPY because it avoids one > > COW. But that seems to be it. > > > > In short: I'm wondering whether we only really need the ACCESS_LIKELY hint > > as you proposed earlier. We may want UFFDIO_ZEROPAGE_MODE_ALLOCATE > > separately, but keep that only for zeropage op (and it shouldn't really be > > called WRITE_LIKELY)? Or did I miss something? > > Let’s see if I get you correctly. I am not sure whether we had this > discussion before. > > We are talking about a scenario in which WP=0. You argue that if the page > is already set as dirty, what is the benefit of not setting the dirty-bit, > right? > > So first, IIUC, there are cases in which the page would not be set as > dirty, e.g., UFFDIO_CONTINUE. [ I am admittedly not too familiar with this > use-case, so I say it based on the comments. ] > > Second, even if the page is dirty (e.g., following UFFDIO_COPY), but it > is not written by the user after UFFDI_COPY, marking the PTE as dirty > when it is mapped would induce overhead, as we discussed before, since > if/when the PTE is unmapped, TLB flush batching might not be possible. I'd hope we don't make an interface design just to service that purpose of when write=0 and dirty=1 use case that is internal to the kernel so far, and I still think it's the tlb flush code to change.. or do we have other use case for this WRITE_LIKELY hint? For UFFDIO_CONTINUE, if we want to make things clear on dirty bit, then IMHO for UFFDIO_CONTINUE the right place for the dirty process is where the user writes to the page in the other mapping, where PageDirty() will start to be true already even if the pte that to be CONTINUEd will have dirty=0 in the pte entry. From that pov I still don't see why we need to grant the user on the dirty bit control, no matter with a hint only, or explicit. > > So I don’t think there is a problem in having WRITE_LIKELY hint. Moreover, > I would reiterate my position (which you guys convinced me in!) David convinced you I think :) > that having hints that indicate what the user does (WRITE_LIKELY) is a > better API than something that indicates directly what the kernel should > do (e.g., UFFDIO_ZEROPAGE_MODE_ALLOCATE). The hint idea sounds good to me, it's just that we actually have two steps here: (1) We think providing user the control of dirty bit makes sense, then, (2) We think the flag should be a hint not explicit "set dirty bit" I agree with (2) in this case if (1) is applicable. And now I think I'm questioning myself on (1). Fundamentally, access bit has more meaningful context (0 means cold, 1 means hot), for dirty it's really more a perf thing to me (when clear, it'll take extra cycles to set it when memory write happens to it; being clear _may_ help only for the tlb flush example you mentioned but I'm not fully convinced that's correct). Maybe with the to be proposed RFC patch for tlb flush we can know whether that should be something we can rely on. It'll add more dependency on this work which I'm sorry to say. It's just that IMHO we should think carefully for the write-hint because this is a solid new uABI we're talking about. The other option is we can introduce the access hint first and think more on the dirty one (we can always add it when proper). What do you think? Also, David please chim in anytime if I missed the whole point when you proposed the idea. > > But this discussion made me think that there are two somewhat related > matters that we may want to address: > > 1. mwriteprotect_range() should use MM_CP_TRY_CHANGE_WRITABLE when !wp > to proactively make entries writable and save . I'm not sure I'm right here, but I think David's patch should have covered that case? The new helper only checks pte_uffd_wp() based on my memory, and when resolving page faults uffd-wp bit should have been gone, so it should be treated the same as normal ptes. > > 2. The WRITE_LIKELY hint should be propagated from mwriteprotect_range() > to change_pte_range() through cp_flags, and the entry should be set > dirty accordingly. Sounds correct. Though again I hope we can think more thoroughly on whether we need the write-hint first. Thanks,
> Fundamentally, access bit has more meaningful context (0 means cold, 1 > means hot), for dirty it's really more a perf thing to me (when clear, > it'll take extra cycles to set it when memory write happens to it; being > clear _may_ help only for the tlb flush example you mentioned but I'm not > fully convinced that's correct). > > Maybe with the to be proposed RFC patch for tlb flush we can know whether > that should be something we can rely on. It'll add more dependency on this > work which I'm sorry to say. It's just that IMHO we should think carefully > for the write-hint because this is a solid new uABI we're talking about. > > The other option is we can introduce the access hint first and think more > on the dirty one (we can always add it when proper). What do you think? > Also, David please chim in anytime if I missed the whole point when you > proposed the idea. Well, if we have an ABI that places pages without further information *why* we're doing that makes us having to guess what to do or what not to do, and I think the zeropage placement is a prime example for that. Personally, I think communicating the intention in forms of hints is something that doesn't leak implementation details into an ABI. So "no planned access" vs. "read_likely" vs. "write_likely" conceptually makes sense to me. As I raised previously, *if* we want to let the user affect the dirty bit setting (1) is then a pure implementation detail. Or whatever else we might want to do. But I also want to raise awareness that architectures that don't have a hw-set dirty bit have to use page faults to mimic dirty tracking. IIRC, s390x is a prime example for that: pte_mkclean() sets the WP bit and marks the page dirty from the write fault. So it's even more expensive than on other architectures.
On Mon, Jun 27, 2022 at 03:27:49PM +0200, David Hildenbrand wrote: > > Fundamentally, access bit has more meaningful context (0 means cold, 1 > > means hot), for dirty it's really more a perf thing to me (when clear, > > it'll take extra cycles to set it when memory write happens to it; being > > clear _may_ help only for the tlb flush example you mentioned but I'm not > > fully convinced that's correct). > > > > Maybe with the to be proposed RFC patch for tlb flush we can know whether > > that should be something we can rely on. It'll add more dependency on this > > work which I'm sorry to say. It's just that IMHO we should think carefully > > for the write-hint because this is a solid new uABI we're talking about. > > > > The other option is we can introduce the access hint first and think more > > on the dirty one (we can always add it when proper). What do you think? > > Also, David please chim in anytime if I missed the whole point when you > > proposed the idea. > > Well, if we have an ABI that places pages without further information > *why* we're doing that makes us having to guess what to do or what not > to do, and I think the zeropage placement is a prime example for that. > Personally, I think communicating the intention in forms of hints is > something that doesn't leak implementation details into an ABI. > > So "no planned access" vs. "read_likely" vs. "write_likely" conceptually > makes sense to me. > > As I raised previously, *if* we want to let the user affect the dirty > bit setting (1) is then a pure implementation detail. Or whatever else > we might want to do. > > But I also want to raise awareness that architectures that don't have a > hw-set dirty bit have to use page faults to mimic dirty tracking. IIRC, > s390x is a prime example for that: pte_mkclean() sets the WP bit and > marks the page dirty from the write fault. So it's even more expensive > than on other architectures. The last input seems to be supporting that we'd better even have redundant dirty bit in ptes rather than accidentally not having it, even when both are safe. So to me WRITE_LIKELY was still mostly around dirty bit besides the ZEROPAGE case. I don't have a strong opinion on how we should name that flag, if we want to insist on WRITE_LIKELY but only on ZEROPAGE I think it's fine, it's just that if I'm the user app I prefer making sure the page is allocated after UFFDIO_ZEROPAGE returned, rather than only providing a hint and then the kernels says "we'll do something but nothing is guaranteed". I also fully agree we don't want to expose impl details but my question was more on whether we want that hint at all as a generic one, and in what case that hint helps outside ZEROPAGE. For "it can be accessed" hint, I have an answer and it seems to apply to most of the uffd ioctls; but not so generic for a "it can be written" hint. Thanks,
[ +Dave Hansen to say how wrong I am ] > On Jun 27, 2022, at 6:12 AM, Peter Xu <peterx@redhat.com> wrote: > > ⚠ External Email > > On Sat, Jun 25, 2022 at 07:49:54AM +0000, Nadav Amit wrote: >> >> >>> On Jun 24, 2022, at 3:17 PM, Peter Xu <peterx@redhat.com> wrote: >>> >>> On Fri, Jun 24, 2022 at 05:58:17PM -0400, Peter Xu wrote: >>>> [Sorry for replying late] >>>> >>>> Said that, I think it doesn't really necessary need to be that complex, >>>> since make_huge_pte() already sets dirty bit when "writable=1", so IIUC >>>> what you need to do is simply make sure dirty bit set when write_hint=1. >>>> >>>> Does it sounds correct to you? >>> >>> Hmm, hold on... I failed to figure out how that write-likely hint could >>> help us for either huge or non-huge pages, since: >>> >>> (1) Old code always set dirty, so no perf degrade anyway with/without the >>> hint >>> >>> (2) If we want to rework dirty bit (which I'm totally fine with..), then >>> we don't apply it when we shouldn't, and afaict we should set D bit >>> whenever we should... if the user assumes this page is likely to be >>> written but made it read-only, say, with UFFDIO_COPY(wp_mode=1), >>> setting D bit will not help, instead, the user should simply use an >>> UFFDIO_COPY(wp_mode=0) then the dirty will be set with write=1.. >>> >>> It'll be helpful but only helpful for UFFDIO_ZEROCOPY because it avoids one >>> COW. But that seems to be it. >>> >>> In short: I'm wondering whether we only really need the ACCESS_LIKELY hint >>> as you proposed earlier. We may want UFFDIO_ZEROPAGE_MODE_ALLOCATE >>> separately, but keep that only for zeropage op (and it shouldn't really be >>> called WRITE_LIKELY)? Or did I miss something? >> >> Let’s see if I get you correctly. I am not sure whether we had this >> discussion before. >> >> We are talking about a scenario in which WP=0. You argue that if the page >> is already set as dirty, what is the benefit of not setting the dirty-bit, >> right? >> >> So first, IIUC, there are cases in which the page would not be set as >> dirty, e.g., UFFDIO_CONTINUE. [ I am admittedly not too familiar with this >> use-case, so I say it based on the comments. ] >> >> Second, even if the page is dirty (e.g., following UFFDIO_COPY), but it >> is not written by the user after UFFDI_COPY, marking the PTE as dirty >> when it is mapped would induce overhead, as we discussed before, since >> if/when the PTE is unmapped, TLB flush batching might not be possible. > > I'd hope we don't make an interface design just to service that purpose of > when write=0 and dirty=1 use case that is internal to the kernel so far, > and I still think it's the tlb flush code to change.. or do we have other > use case for this WRITE_LIKELY hint? > > For UFFDIO_CONTINUE, if we want to make things clear on dirty bit, then > IMHO for UFFDIO_CONTINUE the right place for the dirty process is where the > user writes to the page in the other mapping, where PageDirty() will start > to be true already even if the pte that to be CONTINUEd will have dirty=0 > in the pte entry. From that pov I still don't see why we need to grant the > user on the dirty bit control, no matter with a hint only, or explicit. > >> >> So I don’t think there is a problem in having WRITE_LIKELY hint. Moreover, >> I would reiterate my position (which you guys convinced me in!) > > David convinced you I think :) > >> that having hints that indicate what the user does (WRITE_LIKELY) is a >> better API than something that indicates directly what the kernel should >> do (e.g., UFFDIO_ZEROPAGE_MODE_ALLOCATE). > > The hint idea sounds good to me, it's just that we actually have two steps > here: > > (1) We think providing user the control of dirty bit makes sense, then, > (2) We think the flag should be a hint not explicit "set dirty bit" > > I agree with (2) in this case if (1) is applicable. And now I think I'm > questioning myself on (1). > > Fundamentally, access bit has more meaningful context (0 means cold, 1 > means hot), for dirty it's really more a perf thing to me (when clear, > it'll take extra cycles to set it when memory write happens to it; being > clear _may_ help only for the tlb flush example you mentioned but I'm not > fully convinced that's correct). I am not sure we understand each other. I think the benefit of not setting a dirty-bit when a page is not actually written is fundamental, and has inherit performance benefit. When I did x86’s pte_flags_need_flush(), I was defensive, but there is a basic optimization that is possible to avoid a TLB flush on non-dirty writable PTEs. In x86, consider a situation in which you use ptep_modify_prot_start() to remove a PTE and load its old value using xchg. (A similar case happens on reclaim). Assume you want to write-protect the entry. If the PTE is non-dirty then you should be able to avoid a flush, even if the PTE is writable. In x86, a write and the change of the dirty-bit are performed both atomically. Therefore, if the dirty-bit on the old PTE was clear, you can avoid a TLB flush. Besides the benefit of avoiding a TLB flush, there is also the benefit of having more precise dirty tracking. You assume UFFDIO_CONTINUE will be preceded by memory write to the shared memory, but that does not have to be the case. Similarly, if in the future userfaultfd would also support memory-backed private mappings, that does not have to be the case either. Putting all of the above aside, there is a bug in my code, but this bug also points why dirty should not be set unconditionally. If someone uses SOFT_DIRTY with userfaultfd, then marking the PTE as dirty (and soft-dirty) might be misleading, causing unnecessary userspace writeback of memory. So I do need to fix my code so it would not write-unprotect memory if soft-dirty is enabled and UFFD_FLAGS_WRITE_LIKELY is not provided. But I think it emphasizes the benefit of having UFFD_FLAGS_WRITE_LIKELY. > > Maybe with the to be proposed RFC patch for tlb flush we can know whether > that should be something we can rely on. It'll add more dependency on this > work which I'm sorry to say. It's just that IMHO we should think carefully > for the write-hint because this is a solid new uABI we're talking about. > > The other option is we can introduce the access hint first and think more > on the dirty one (we can always add it when proper). What do you think? > Also, David please chim in anytime if I missed the whole point when you > proposed the idea. > >> >> But this discussion made me think that there are two somewhat related >> matters that we may want to address: >> >> 1. mwriteprotect_range() should use MM_CP_TRY_CHANGE_WRITABLE when !wp >> to proactively make entries writable and save . > > I'm not sure I'm right here, but I think David's patch should have covered > that case? The new helper only checks pte_uffd_wp() based on my memory, > and when resolving page faults uffd-wp bit should have been gone, so it > should be treated the same as normal ptes. Let’s see we get to the same page: mwriteprotect_range() does: change_protection(&tlb, dst_vma, start, start + len, newprot, enable_wp ? MM_CP_UFFD_WP : MM_CP_UFFD_WP_RESOLVE) As you see no use of MM_CP_TRY_CHANGE_WRITABLE. And then change_pte_range() does: if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) && !pte_write(ptent) && can_change_pte_writable(vma, addr, ptent)) ptent = pte_mkwrite(ptent); If we do not provide MM_CP_TRY_CHANGE_WRITABLE, the PTE would not be writable. Now for the record, we may want an additional optimization, so if a fault handler is woken because a PTE become writable, we do not retry the page fault (since the PTE is already writable). It is a small change, but let’s see first we agree on the patches so far…
On 28.06.22 01:37, Nadav Amit wrote: > [ +Dave Hansen to say how wrong I am ] > >> On Jun 27, 2022, at 6:12 AM, Peter Xu <peterx@redhat.com> wrote: >> >> ⚠ External Email >> >> On Sat, Jun 25, 2022 at 07:49:54AM +0000, Nadav Amit wrote: >>> >>> >>>> On Jun 24, 2022, at 3:17 PM, Peter Xu <peterx@redhat.com> wrote: >>>> >>>> On Fri, Jun 24, 2022 at 05:58:17PM -0400, Peter Xu wrote: >>>>> [Sorry for replying late] >>>>> >>>>> Said that, I think it doesn't really necessary need to be that complex, >>>>> since make_huge_pte() already sets dirty bit when "writable=1", so IIUC >>>>> what you need to do is simply make sure dirty bit set when write_hint=1. >>>>> >>>>> Does it sounds correct to you? >>>> >>>> Hmm, hold on... I failed to figure out how that write-likely hint could >>>> help us for either huge or non-huge pages, since: >>>> >>>> (1) Old code always set dirty, so no perf degrade anyway with/without the >>>> hint >>>> >>>> (2) If we want to rework dirty bit (which I'm totally fine with..), then >>>> we don't apply it when we shouldn't, and afaict we should set D bit >>>> whenever we should... if the user assumes this page is likely to be >>>> written but made it read-only, say, with UFFDIO_COPY(wp_mode=1), >>>> setting D bit will not help, instead, the user should simply use an >>>> UFFDIO_COPY(wp_mode=0) then the dirty will be set with write=1.. >>>> >>>> It'll be helpful but only helpful for UFFDIO_ZEROCOPY because it avoids one >>>> COW. But that seems to be it. >>>> >>>> In short: I'm wondering whether we only really need the ACCESS_LIKELY hint >>>> as you proposed earlier. We may want UFFDIO_ZEROPAGE_MODE_ALLOCATE >>>> separately, but keep that only for zeropage op (and it shouldn't really be >>>> called WRITE_LIKELY)? Or did I miss something? >>> >>> Let’s see if I get you correctly. I am not sure whether we had this >>> discussion before. >>> >>> We are talking about a scenario in which WP=0. You argue that if the page >>> is already set as dirty, what is the benefit of not setting the dirty-bit, >>> right? >>> >>> So first, IIUC, there are cases in which the page would not be set as >>> dirty, e.g., UFFDIO_CONTINUE. [ I am admittedly not too familiar with this >>> use-case, so I say it based on the comments. ] >>> >>> Second, even if the page is dirty (e.g., following UFFDIO_COPY), but it >>> is not written by the user after UFFDI_COPY, marking the PTE as dirty >>> when it is mapped would induce overhead, as we discussed before, since >>> if/when the PTE is unmapped, TLB flush batching might not be possible. >> >> I'd hope we don't make an interface design just to service that purpose of >> when write=0 and dirty=1 use case that is internal to the kernel so far, >> and I still think it's the tlb flush code to change.. or do we have other >> use case for this WRITE_LIKELY hint? >> >> For UFFDIO_CONTINUE, if we want to make things clear on dirty bit, then >> IMHO for UFFDIO_CONTINUE the right place for the dirty process is where the >> user writes to the page in the other mapping, where PageDirty() will start >> to be true already even if the pte that to be CONTINUEd will have dirty=0 >> in the pte entry. From that pov I still don't see why we need to grant the >> user on the dirty bit control, no matter with a hint only, or explicit. >> >>> >>> So I don’t think there is a problem in having WRITE_LIKELY hint. Moreover, >>> I would reiterate my position (which you guys convinced me in!) >> >> David convinced you I think :) >> >>> that having hints that indicate what the user does (WRITE_LIKELY) is a >>> better API than something that indicates directly what the kernel should >>> do (e.g., UFFDIO_ZEROPAGE_MODE_ALLOCATE). >> >> The hint idea sounds good to me, it's just that we actually have two steps >> here: >> >> (1) We think providing user the control of dirty bit makes sense, then, >> (2) We think the flag should be a hint not explicit "set dirty bit" >> >> I agree with (2) in this case if (1) is applicable. And now I think I'm >> questioning myself on (1). >> >> Fundamentally, access bit has more meaningful context (0 means cold, 1 >> means hot), for dirty it's really more a perf thing to me (when clear, >> it'll take extra cycles to set it when memory write happens to it; being >> clear _may_ help only for the tlb flush example you mentioned but I'm not >> fully convinced that's correct). > > I am not sure we understand each other. I think the benefit of not setting > a dirty-bit when a page is not actually written is fundamental, and has > inherit performance benefit. > > When I did x86’s pte_flags_need_flush(), I was defensive, but there is a > basic optimization that is possible to avoid a TLB flush on non-dirty > writable PTEs. > > In x86, consider a situation in which you use ptep_modify_prot_start() > to remove a PTE and load its old value using xchg. (A similar case happens > on reclaim). Assume you want to write-protect the entry. > > If the PTE is non-dirty then you should be able to avoid a flush, even if > the PTE is writable. In x86, a write and the change of the dirty-bit are > performed both atomically. Therefore, if the dirty-bit on the old PTE was > clear, you can avoid a TLB flush. > > Besides the benefit of avoiding a TLB flush, there is also the benefit > of having more precise dirty tracking. You assume UFFDIO_CONTINUE will be > preceded by memory write to the shared memory, but that does not have to > be the case. Similarly, if in the future userfaultfd would also support > memory-backed private mappings, that does not have to be the case either. > > Putting all of the above aside, there is a bug in my code, but this > bug also points why dirty should not be set unconditionally. If someone > uses SOFT_DIRTY with userfaultfd, then marking the PTE as dirty (and > soft-dirty) might be misleading, causing unnecessary userspace writeback > of memory. > > So I do need to fix my code so it would not write-unprotect memory if > soft-dirty is enabled and UFFD_FLAGS_WRITE_LIKELY is not provided. But > I think it emphasizes the benefit of having UFFD_FLAGS_WRITE_LIKELY. > >> >> Maybe with the to be proposed RFC patch for tlb flush we can know whether >> that should be something we can rely on. It'll add more dependency on this >> work which I'm sorry to say. It's just that IMHO we should think carefully >> for the write-hint because this is a solid new uABI we're talking about. >> >> The other option is we can introduce the access hint first and think more >> on the dirty one (we can always add it when proper). What do you think? >> Also, David please chim in anytime if I missed the whole point when you >> proposed the idea. >> >>> >>> But this discussion made me think that there are two somewhat related >>> matters that we may want to address: >>> >>> 1. mwriteprotect_range() should use MM_CP_TRY_CHANGE_WRITABLE when !wp >>> to proactively make entries writable and save . >> >> I'm not sure I'm right here, but I think David's patch should have covered >> that case? The new helper only checks pte_uffd_wp() based on my memory, >> and when resolving page faults uffd-wp bit should have been gone, so it >> should be treated the same as normal ptes. > > Let’s see we get to the same page: > > mwriteprotect_range() does: > > change_protection(&tlb, dst_vma, start, start + len, newprot, > enable_wp ? MM_CP_UFFD_WP : MM_CP_UFFD_WP_RESOLVE) > > As you see no use of MM_CP_TRY_CHANGE_WRITABLE. > > And then change_pte_range() does: > > if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) && > !pte_write(ptent) && > can_change_pte_writable(vma, addr, ptent)) > ptent = pte_mkwrite(ptent); Right, I think in a previous version of my patch (before you guys convinced me to introduce MM_CP_TRY_CHANGE_WRITABLE :P ) it would have done it automatically (for private mappings). We might have to add it to some callers now manually to not only consider mprotect.
On Mon, Jun 27, 2022 at 11:37:32PM +0000, Nadav Amit wrote: > [ +Dave Hansen to say how wrong I am ] > > > On Jun 27, 2022, at 6:12 AM, Peter Xu <peterx@redhat.com> wrote: > > > > ⚠ External Email > > > > On Sat, Jun 25, 2022 at 07:49:54AM +0000, Nadav Amit wrote: > >> > >> > >>> On Jun 24, 2022, at 3:17 PM, Peter Xu <peterx@redhat.com> wrote: > >>> > >>> On Fri, Jun 24, 2022 at 05:58:17PM -0400, Peter Xu wrote: > >>>> [Sorry for replying late] > >>>> > >>>> Said that, I think it doesn't really necessary need to be that complex, > >>>> since make_huge_pte() already sets dirty bit when "writable=1", so IIUC > >>>> what you need to do is simply make sure dirty bit set when write_hint=1. > >>>> > >>>> Does it sounds correct to you? > >>> > >>> Hmm, hold on... I failed to figure out how that write-likely hint could > >>> help us for either huge or non-huge pages, since: > >>> > >>> (1) Old code always set dirty, so no perf degrade anyway with/without the > >>> hint > >>> > >>> (2) If we want to rework dirty bit (which I'm totally fine with..), then > >>> we don't apply it when we shouldn't, and afaict we should set D bit > >>> whenever we should... if the user assumes this page is likely to be > >>> written but made it read-only, say, with UFFDIO_COPY(wp_mode=1), > >>> setting D bit will not help, instead, the user should simply use an > >>> UFFDIO_COPY(wp_mode=0) then the dirty will be set with write=1.. > >>> > >>> It'll be helpful but only helpful for UFFDIO_ZEROCOPY because it avoids one > >>> COW. But that seems to be it. > >>> > >>> In short: I'm wondering whether we only really need the ACCESS_LIKELY hint > >>> as you proposed earlier. We may want UFFDIO_ZEROPAGE_MODE_ALLOCATE > >>> separately, but keep that only for zeropage op (and it shouldn't really be > >>> called WRITE_LIKELY)? Or did I miss something? > >> > >> Let’s see if I get you correctly. I am not sure whether we had this > >> discussion before. > >> > >> We are talking about a scenario in which WP=0. You argue that if the page > >> is already set as dirty, what is the benefit of not setting the dirty-bit, > >> right? > >> > >> So first, IIUC, there are cases in which the page would not be set as > >> dirty, e.g., UFFDIO_CONTINUE. [ I am admittedly not too familiar with this > >> use-case, so I say it based on the comments. ] > >> > >> Second, even if the page is dirty (e.g., following UFFDIO_COPY), but it > >> is not written by the user after UFFDI_COPY, marking the PTE as dirty > >> when it is mapped would induce overhead, as we discussed before, since > >> if/when the PTE is unmapped, TLB flush batching might not be possible. > > > > I'd hope we don't make an interface design just to service that purpose of > > when write=0 and dirty=1 use case that is internal to the kernel so far, > > and I still think it's the tlb flush code to change.. or do we have other > > use case for this WRITE_LIKELY hint? > > > > For UFFDIO_CONTINUE, if we want to make things clear on dirty bit, then > > IMHO for UFFDIO_CONTINUE the right place for the dirty process is where the > > user writes to the page in the other mapping, where PageDirty() will start > > to be true already even if the pte that to be CONTINUEd will have dirty=0 > > in the pte entry. From that pov I still don't see why we need to grant the > > user on the dirty bit control, no matter with a hint only, or explicit. > > > >> > >> So I don’t think there is a problem in having WRITE_LIKELY hint. Moreover, > >> I would reiterate my position (which you guys convinced me in!) > > > > David convinced you I think :) > > > >> that having hints that indicate what the user does (WRITE_LIKELY) is a > >> better API than something that indicates directly what the kernel should > >> do (e.g., UFFDIO_ZEROPAGE_MODE_ALLOCATE). > > > > The hint idea sounds good to me, it's just that we actually have two steps > > here: > > > > (1) We think providing user the control of dirty bit makes sense, then, > > (2) We think the flag should be a hint not explicit "set dirty bit" > > > > I agree with (2) in this case if (1) is applicable. And now I think I'm > > questioning myself on (1). > > > > Fundamentally, access bit has more meaningful context (0 means cold, 1 > > means hot), for dirty it's really more a perf thing to me (when clear, > > it'll take extra cycles to set it when memory write happens to it; being > > clear _may_ help only for the tlb flush example you mentioned but I'm not > > fully convinced that's correct). > > I am not sure we understand each other. I think the benefit of not setting > a dirty-bit when a page is not actually written is fundamental, and has > inherit performance benefit. > > When I did x86’s pte_flags_need_flush(), I was defensive, but there is a > basic optimization that is possible to avoid a TLB flush on non-dirty > writable PTEs. > > In x86, consider a situation in which you use ptep_modify_prot_start() > to remove a PTE and load its old value using xchg. (A similar case happens > on reclaim). Assume you want to write-protect the entry. > > If the PTE is non-dirty then you should be able to avoid a flush, even if > the PTE is writable. In x86, a write and the change of the dirty-bit are > performed both atomically. Therefore, if the dirty-bit on the old PTE was > clear, you can avoid a TLB flush. > > Besides the benefit of avoiding a TLB flush, there is also the benefit > of having more precise dirty tracking. You assume UFFDIO_CONTINUE will be > preceded by memory write to the shared memory, but that does not have to > be the case. Similarly, if in the future userfaultfd would also support > memory-backed private mappings, that does not have to be the case either. Could I ask what's the not supported private mapping you're talking about (and I think you meant "file-backed")? I thought all kinds of private mappings are supported on all modes already? Thanks for explaining, so yes either false positive or false negative on pte dirty bit can bring a side effect on things, and the user knows the best. Makes sense. I think what I overlooked is the downside of having both write==dirty==1 when it doesn't need to. > > Putting all of the above aside, there is a bug in my code, but this > bug also points why dirty should not be set unconditionally. If someone > uses SOFT_DIRTY with userfaultfd, then marking the PTE as dirty (and > soft-dirty) might be misleading, causing unnecessary userspace writeback > of memory. Well I never thought anyone would be using soft-dirty with uffd because logically uffd-wp was somehow trying to replace it with a better interface, hopefully. If to talk about it, IMHO the most important thing is really not the dirty bit but the write bit: even if you leave both the dirty bits cleared (so the user specified !WRITE_HINT then we grant write bit only), it means when the write happens for sure dirty bit will be set on demand but not soft-dirty since we won't generate a page fault at all. AFAICT it'll be a false negative. The old code is safe in that respect on that we always set dirty so we can only have false positive (which is acceptable in this case) but not false negative (which is not). > > So I do need to fix my code so it would not write-unprotect memory if > soft-dirty is enabled and UFFD_FLAGS_WRITE_LIKELY is not provided. But > I think it emphasizes the benefit of having UFFD_FLAGS_WRITE_LIKELY. Sorry to say that, but to me it seems a proof that dirty bit is even more tricky than access bit so we need to have better thoughts, even if the WRITE_LIKELY hint sounds straightforward. I now can buy in all the tlb flush points you explained, but still IMHO that's too fine granule (some random ptes in a batched tlb range flush may not even matter!) and I'm just naively wondering whether we need more thoughts for an uABI to be proposed like that. I won't ask for some use cases and especially numbers to prove assuming I'm asking too much, but really that'll be very persuasive if we can get. I'd not worry for ACCESS_LIKELY on it because that's much more straightforward to me. > > > > > Maybe with the to be proposed RFC patch for tlb flush we can know whether > > that should be something we can rely on. It'll add more dependency on this > > work which I'm sorry to say. It's just that IMHO we should think carefully > > for the write-hint because this is a solid new uABI we're talking about. > > > > The other option is we can introduce the access hint first and think more > > on the dirty one (we can always add it when proper). What do you think? > > Also, David please chim in anytime if I missed the whole point when you > > proposed the idea. > > > >> > >> But this discussion made me think that there are two somewhat related > >> matters that we may want to address: > >> > >> 1. mwriteprotect_range() should use MM_CP_TRY_CHANGE_WRITABLE when !wp > >> to proactively make entries writable and save . > > > > I'm not sure I'm right here, but I think David's patch should have covered > > that case? The new helper only checks pte_uffd_wp() based on my memory, > > and when resolving page faults uffd-wp bit should have been gone, so it > > should be treated the same as normal ptes. > > Let’s see we get to the same page: > > mwriteprotect_range() does: > > change_protection(&tlb, dst_vma, start, start + len, newprot, > enable_wp ? MM_CP_UFFD_WP : MM_CP_UFFD_WP_RESOLVE) > > As you see no use of MM_CP_TRY_CHANGE_WRITABLE. > > And then change_pte_range() does: > > if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) && > !pte_write(ptent) && > can_change_pte_writable(vma, addr, ptent)) > ptent = pte_mkwrite(ptent); > > If we do not provide MM_CP_TRY_CHANGE_WRITABLE, the PTE would not be > writable. > > Now for the record, we may want an additional optimization, so if a > fault handler is woken because a PTE become writable, we do not retry > the page fault (since the PTE is already writable). It is a small change, > but let’s see first we agree on the patches so far… Ah I see what I missed, thanks. Another option is we call can_change_pte_writable() somehow in the unprotect branch.
> On Jun 28, 2022, at 12:15 PM, Peter Xu <peterx@redhat.com> wrote: > > ⚠ External Email > > On Mon, Jun 27, 2022 at 11:37:32PM +0000, Nadav Amit wrote: >> >> Besides the benefit of avoiding a TLB flush, there is also the benefit >> of having more precise dirty tracking. You assume UFFDIO_CONTINUE will be >> preceded by memory write to the shared memory, but that does not have to >> be the case. Similarly, if in the future userfaultfd would also support >> memory-backed private mappings, that does not have to be the case either. > > Could I ask what's the not supported private mapping you're talking about > (and I think you meant "file-backed")? I thought all kinds of private > mappings are supported on all modes already? Yes, I meant file-backed. See vma_can_userfault() for the supported types of memory: static inline bool vma_can_userfault(struct vm_area_struct *vma, unsigned long vm_flags) { if (vm_flags & VM_UFFD_MINOR) return is_vm_hugetlb_page(vma) || vma_is_shmem(vma); #ifndef CONFIG_PTE_MARKER_UFFD_WP /* * If user requested uffd-wp but not enabled pte markers for * uffd-wp, then shmem & hugetlbfs are not supported but only * anonymous. */ if ((vm_flags & VM_UFFD_WP) && !vma_is_anonymous(vma)) return false; #endif return vma_is_anonymous(vma) || is_vm_hugetlb_page(vma) || vma_is_shmem(vma); } > > Thanks for explaining, so yes either false positive or false negative on > pte dirty bit can bring a side effect on things, and the user knows the > best. Makes sense. I think what I overlooked is the downside of having > both write==dirty==1 when it doesn't need to. > >> >> Putting all of the above aside, there is a bug in my code, but this >> bug also points why dirty should not be set unconditionally. If someone >> uses SOFT_DIRTY with userfaultfd, then marking the PTE as dirty (and >> soft-dirty) might be misleading, causing unnecessary userspace writeback >> of memory. > > Well I never thought anyone would be using soft-dirty with uffd because > logically uffd-wp was somehow trying to replace it with a better interface, > hopefully. I have heard about some who does (not me). So I do not make it up, unfortunately. > > If to talk about it, IMHO the most important thing is really not the dirty > bit but the write bit: even if you leave both the dirty bits cleared (so > the user specified !WRITE_HINT then we grant write bit only), it means when > the write happens for sure dirty bit will be set on demand but not > soft-dirty since we won't generate a page fault at all. AFAICT it'll be a > false negative. > > The old code is safe in that respect on that we always set dirty so we can > only have false positive (which is acceptable in this case) but not false > negative (which is not). I agree that the PTE should be left RO in this case. I was just pointing that the user might not be aware of soft-dirty being used, and then it makes sense to treat the (lack of) write-hint appropriately. In the current code, when PTEs are unconditionally marked as dirty, even when they are write-protected, soft-dirty would produce false-positives. Nothing would break, but it is not good for performance either. > >> >> So I do need to fix my code so it would not write-unprotect memory if >> soft-dirty is enabled and UFFD_FLAGS_WRITE_LIKELY is not provided. But >> I think it emphasizes the benefit of having UFFD_FLAGS_WRITE_LIKELY. > > Sorry to say that, but to me it seems a proof that dirty bit is even more > tricky than access bit so we need to have better thoughts, even if the > WRITE_LIKELY hint sounds straightforward. I now can buy in all the tlb > flush points you explained, but still IMHO that's too fine granule (some > random ptes in a batched tlb range flush may not even matter!) and I'm just > naively wondering whether we need more thoughts for an uABI to be proposed > like that. I won't ask for some use cases and especially numbers to prove > assuming I'm asking too much, but really that'll be very persuasive if we > can get. I'd not worry for ACCESS_LIKELY on it because that's much more > straightforward to me. Let me get back to you with numbers. > >> >>> >>> Maybe with the to be proposed RFC patch for tlb flush we can know whether >>> that should be something we can rely on. It'll add more dependency on this >>> work which I'm sorry to say. It's just that IMHO we should think carefully >>> for the write-hint because this is a solid new uABI we're talking about. >>> >>> The other option is we can introduce the access hint first and think more >>> on the dirty one (we can always add it when proper). What do you think? >>> Also, David please chim in anytime if I missed the whole point when you >>> proposed the idea. >>> >>>> >>>> But this discussion made me think that there are two somewhat related >>>> matters that we may want to address: >>>> >>>> 1. mwriteprotect_range() should use MM_CP_TRY_CHANGE_WRITABLE when !wp >>>> to proactively make entries writable and save . >>> >>> I'm not sure I'm right here, but I think David's patch should have covered >>> that case? The new helper only checks pte_uffd_wp() based on my memory, >>> and when resolving page faults uffd-wp bit should have been gone, so it >>> should be treated the same as normal ptes. >> >> Let’s see we get to the same page: >> >> mwriteprotect_range() does: >> >> change_protection(&tlb, dst_vma, start, start + len, newprot, >> enable_wp ? MM_CP_UFFD_WP : MM_CP_UFFD_WP_RESOLVE) >> >> As you see no use of MM_CP_TRY_CHANGE_WRITABLE. >> >> And then change_pte_range() does: >> >> if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) && >> !pte_write(ptent) && >> can_change_pte_writable(vma, addr, ptent)) >> ptent = pte_mkwrite(ptent); >> >> If we do not provide MM_CP_TRY_CHANGE_WRITABLE, the PTE would not be >> writable. >> >> Now for the record, we may want an additional optimization, so if a >> fault handler is woken because a PTE become writable, we do not retry >> the page fault (since the PTE is already writable). It is a small change, >> but let’s see first we agree on the patches so far… > > Ah I see what I missed, thanks. > > Another option is we call can_change_pte_writable() somehow in the > unprotect branch. I don’t think that I got this one. tl;dr: I’ll run some numbers (access when PTE-dirty/clear), address the aforementioned issues and send v2. Based on the results we can decide whether it makes sense.
On Tue, Jun 28, 2022 at 08:30:46PM +0000, Nadav Amit wrote: > > > > On Jun 28, 2022, at 12:15 PM, Peter Xu <peterx@redhat.com> wrote: > > > > ⚠ External Email > > > > On Mon, Jun 27, 2022 at 11:37:32PM +0000, Nadav Amit wrote: > >> > >> Besides the benefit of avoiding a TLB flush, there is also the benefit > >> of having more precise dirty tracking. You assume UFFDIO_CONTINUE will be > >> preceded by memory write to the shared memory, but that does not have to > >> be the case. Similarly, if in the future userfaultfd would also support > >> memory-backed private mappings, that does not have to be the case either. > > > > Could I ask what's the not supported private mapping you're talking about > > (and I think you meant "file-backed")? I thought all kinds of private > > mappings are supported on all modes already? > > Yes, I meant file-backed. See vma_can_userfault() for the supported > types of memory: > > static inline bool vma_can_userfault(struct vm_area_struct *vma, > unsigned long vm_flags) > { > if (vm_flags & VM_UFFD_MINOR) > return is_vm_hugetlb_page(vma) || vma_is_shmem(vma); > > #ifndef CONFIG_PTE_MARKER_UFFD_WP > /* > * If user requested uffd-wp but not enabled pte markers for > * uffd-wp, then shmem & hugetlbfs are not supported but only > * anonymous. > */ > if ((vm_flags & VM_UFFD_WP) && !vma_is_anonymous(vma)) > return false; > #endif > return vma_is_anonymous(vma) || is_vm_hugetlb_page(vma) || > vma_is_shmem(vma); > } I'm confused. Let me ask in another way: do you mean !VM_SHARED when you say "private"? Both shmem & hugetlb support private mappings for all three modes, afaict. > > Well I never thought anyone would be using soft-dirty with uffd because > > logically uffd-wp was somehow trying to replace it with a better interface, > > hopefully. > > I have heard about some who does (not me). So I do not make it up, > unfortunately. If in any form you can get the reason of using soft-dirty in that use case please kindly share more. I think it could be that uffd-wp is not working as good as soft-dirty in some way but we can think about it when there's a valid use case, so ultimately it's more possible for a full replacement. Sync messages can be a problem and they're indeed slow, soft-dirty is by nature async. But still I'd like to check in case you know. > > > > > If to talk about it, IMHO the most important thing is really not the dirty > > bit but the write bit: even if you leave both the dirty bits cleared (so > > the user specified !WRITE_HINT then we grant write bit only), it means when > > the write happens for sure dirty bit will be set on demand but not > > soft-dirty since we won't generate a page fault at all. AFAICT it'll be a > > false negative. > > > > The old code is safe in that respect on that we always set dirty so we can > > only have false positive (which is acceptable in this case) but not false > > negative (which is not). > > I agree that the PTE should be left RO in this case. I was just pointing > that the user might not be aware of soft-dirty being used, and then it > makes sense to treat the (lack of) write-hint appropriately. Yes please. Thanks for raising this topic btw, I never thought about that side effect. [...] > > Another option is we call can_change_pte_writable() somehow in the > > unprotect branch. > > I don’t think that I got this one. No worry, feel free to ignore it. It's probably not ideal anyway because we need an extra pte_mkwrite() there too when can_change_pte_writable() returns true. > tl;dr: I’ll run some numbers (access when PTE-dirty/clear), address the > aforementioned issues and send v2. Based on the results we can decide > whether it makes sense. That'll be very appreciated. Thanks a lot!
On Jun 28, 2022, at 1:56 PM, Peter Xu <peterx@redhat.com> wrote: > On Tue, Jun 28, 2022 at 08:30:46PM +0000, Nadav Amit wrote: >>> On Jun 28, 2022, at 12:15 PM, Peter Xu <peterx@redhat.com> wrote: >>> >>> ⚠ External Email >>> >>> On Mon, Jun 27, 2022 at 11:37:32PM +0000, Nadav Amit wrote: >>>> Besides the benefit of avoiding a TLB flush, there is also the benefit >>>> of having more precise dirty tracking. You assume UFFDIO_CONTINUE will be >>>> preceded by memory write to the shared memory, but that does not have to >>>> be the case. Similarly, if in the future userfaultfd would also support >>>> memory-backed private mappings, that does not have to be the case either. >>> >>> Could I ask what's the not supported private mapping you're talking about >>> (and I think you meant "file-backed")? I thought all kinds of private >>> mappings are supported on all modes already? >> >> Yes, I meant file-backed. See vma_can_userfault() for the supported >> types of memory: >> >> static inline bool vma_can_userfault(struct vm_area_struct *vma, >> unsigned long vm_flags) >> { >> if (vm_flags & VM_UFFD_MINOR) >> return is_vm_hugetlb_page(vma) || vma_is_shmem(vma); >> >> #ifndef CONFIG_PTE_MARKER_UFFD_WP >> /* >> * If user requested uffd-wp but not enabled pte markers for >> * uffd-wp, then shmem & hugetlbfs are not supported but only >> * anonymous. >> */ >> if ((vm_flags & VM_UFFD_WP) && !vma_is_anonymous(vma)) >> return false; >> #endif >> return vma_is_anonymous(vma) || is_vm_hugetlb_page(vma) || >> vma_is_shmem(vma); >> } > > I'm confused. Let me ask in another way: do you mean !VM_SHARED when you > say "private"? > > Both shmem & hugetlb support private mappings for all three modes, afaict. Sorry, let me more clear. I meant private (!VM_SHARED) file-backed memory. > >>> Well I never thought anyone would be using soft-dirty with uffd because >>> logically uffd-wp was somehow trying to replace it with a better interface, >>> hopefully. >> >> I have heard about some who does (not me). So I do not make it up, >> unfortunately. > > If in any form you can get the reason of using soft-dirty in that use case > please kindly share more. I think it could be that uffd-wp is not working > as good as soft-dirty in some way but we can think about it when there's a > valid use case, so ultimately it's more possible for a full replacement. > > Sync messages can be a problem and they're indeed slow, soft-dirty is by > nature async. But still I'd like to check in case you know. It’s not my thing (not VMware’s either), so I do not feel free of sharing, but anyhow I know very little. I think that project started before uffd-wp was implemented and that’s the reason they use this strange combination. It should not be our main motivation - just to say that users are “creative”. Again, I will get back to you with the rest.
On Tue, Jun 28, 2022 at 09:03:02PM +0000, Nadav Amit wrote: > > Both shmem & hugetlb support private mappings for all three modes, afaict. > > Sorry, let me more clear. I meant private (!VM_SHARED) file-backed memory. I normally use "file-backed" to stand for "shmem + hugetlb" in any uffd context. If you meant anything outside shmem/hugetlb for uffd (e.g. on xfs), then we never support them anyway, do we? Hmm.. What am I ultimately missing?
On Jun 28, 2022, at 2:12 PM, Peter Xu <peterx@redhat.com> wrote: > ⚠ External Email > > On Tue, Jun 28, 2022 at 09:03:02PM +0000, Nadav Amit wrote: >>> Both shmem & hugetlb support private mappings for all three modes, afaict. >> >> Sorry, let me more clear. I meant private (!VM_SHARED) file-backed memory. > > I normally use "file-backed" to stand for "shmem + hugetlb" in any uffd > context. > > If you meant anything outside shmem/hugetlb for uffd (e.g. on xfs), then we > never support them anyway, do we? > > Hmm.. What am I ultimately missing? You are not missing anything. We just have communication problems. I meant - we do not support uffd+xfs today, but maybe one day we will. I do not know/think of any implications WRITE_HINT should have in such case, but it would be useful to have the option to decide how to handle something like that in the future (and not bind ourselves to certain behavior now).
On Jun 22, 2022, at 11:50 AM, Nadav Amit <nadav.amit@gmail.com> wrote: > From: Nadav Amit <namit@vmware.com> > > Using a PTE on x86 with cleared access-bit (aka young-bit) > takes ~600 cycles more than when the access bit is set. At the same > time, setting the access-bit for memory that is not used (e.g., > prefetched) can introduce greater overheads, as the prefetched memory is > reclaimed later than it should be. > > Userfaultfd currently does not set the access-bit (excluding the > huge-pages case). Arguably, it is best to let the user control whether > the access bit should be set or not. The expected use is to request > userfaultfd to set the access-bit when the copy/wp operation is done to > resolve a page-fault, and not to set the access-bit when the memory is > prefetched. > > Introduce UFFDIO_[op]_ACCESS_LIKELY to enable userspace to request the > young bit to be set. I reply to my own email, but this mostly addresses the concerns that Peter has raised. So I ran the test below on my Haswell (x86), which showed two things: 1. Accessing an address using a clean PTE or old PTE takes ~500 cycles more than with dirty+young (depending on the access, of course: dirty does not matter for read, dirty+young both matter for write). 2. I made a mistake in my implementation. PTEs are - at least on x86 - created as young with mk_pte(). So the logic should be similar to do_set_pte(): if (prefault && arch_wants_old_prefaulted_pte()) entry = pte_mkold(entry); else entry = pte_sw_mkyoung(entry); Based on these results, I will send another version for both young and dirty. Let me know if these results are not convincing. I will add, as we discussed (well, I think I raised these things, so hopefully you agree): 1. On x86, avoid flush if changing WP->RO and PTE is clean. 2. When write-unprotecting entry, if PTE is exclusive, set it as writable. [ I considered not setting it as writable if write-hint is not provided, but with the change in (1), it does not provide any real value. ] --- #define _GNU_SOURCE #include <sys/types.h> #include <stdio.h> #include <linux/userfaultfd.h> #include <pthread.h> #include <errno.h> #include <unistd.h> #include <stdlib.h> #include <fcntl.h> #include <signal.h> #include <poll.h> #include <string.h> #include <sys/mman.h> #include <sys/syscall.h> #include <sys/ioctl.h> #include <stdint.h> #include <stdbool.h> #define errExit(msg) do { perror(msg); exit(EXIT_FAILURE); \ } while (0) static inline uint64_t rdtscp(void) { uint64_t rax, rdx; uint32_t aux; asm volatile ("rdtscp" : "=a" (rax), "=d" (rdx), "=c" (aux):: "memory"); } int main(int argc, char *argv[]) { long uffd; /* userfaultfd file descriptor */ char *addr; /* Start of region handled by userfaultfd */ unsigned long len; /* Length of region handled by userfaultfd */ pthread_t thr; /* ID of thread that handles page faults */ bool young, dirty, write; struct uffdio_api uffdio_api; struct uffdio_register uffdio_register; int l; static char *page = NULL; struct uffdio_copy uffdio_copy; ssize_t nread; int page_size; if (argc != 5) { fprintf(stderr, "Usage: %s [num-pages] [write] [young] [dirty]\n", argv[0]); exit(EXIT_FAILURE); } page_size = sysconf(_SC_PAGE_SIZE); len = strtoul(argv[1], NULL, 0) * page_size; write = !!strtoul(argv[2], NULL, 0); young = !!strtoul(argv[3], NULL, 0); dirty = !!strtoul(argv[4], NULL, 0); page = mmap(NULL, page_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); if (page == MAP_FAILED) errExit("mmap"); uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK); if (uffd == -1) errExit("userfaultfd"); uffdio_api.api = UFFD_API; uffdio_api.features = (1<<11); //UFFD_FEATURE_EXACT_ADDRESS; if (ioctl(uffd, UFFDIO_API, &uffdio_api) == -1) errExit("ioctl-UFFDIO_API"); addr = mmap(NULL, len, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); if (addr == MAP_FAILED) errExit("mmap"); uffdio_register.range.start = (unsigned long) addr; uffdio_register.range.len = len; uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING; if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register) == -1) errExit("ioctl-UFFDIO_REGISTER"); uffdio_copy.src = (unsigned long) page; uffdio_copy.mode = 0; if (young) uffdio_copy.mode |= (1ul << 2); if (dirty) uffdio_copy.mode |= (1ul << 3); uffdio_copy.len = page_size; uffdio_copy.copy = 0; for (l = 0; l < len; l += page_size) { uffdio_copy.dst = (unsigned long)(&addr[l]); if (ioctl(uffd, UFFDIO_COPY, &uffdio_copy) == -1) errExit("ioctl-UFFDIO_COPY"); } for (l = 0; l < len; l += page_size) { char c; uint64_t start; start = rdtscp(); if (write) addr[l] = 5; else c = *(volatile char *)(&addr[l]); printf("%ld\n", rdtscp() - start); } exit(EXIT_SUCCESS); }
Hi, Nadav, On Tue, Jul 12, 2022 at 06:19:08AM +0000, Nadav Amit wrote: > On Jun 22, 2022, at 11:50 AM, Nadav Amit <nadav.amit@gmail.com> wrote: > > > From: Nadav Amit <namit@vmware.com> > > > > Using a PTE on x86 with cleared access-bit (aka young-bit) > > takes ~600 cycles more than when the access bit is set. At the same > > time, setting the access-bit for memory that is not used (e.g., > > prefetched) can introduce greater overheads, as the prefetched memory is > > reclaimed later than it should be. > > > > Userfaultfd currently does not set the access-bit (excluding the > > huge-pages case). Arguably, it is best to let the user control whether > > the access bit should be set or not. The expected use is to request > > userfaultfd to set the access-bit when the copy/wp operation is done to > > resolve a page-fault, and not to set the access-bit when the memory is > > prefetched. > > > > Introduce UFFDIO_[op]_ACCESS_LIKELY to enable userspace to request the > > young bit to be set. > > I reply to my own email, but this mostly addresses the concerns that Peter > has raised. > > So I ran the test below on my Haswell (x86), which showed two things: > > 1. Accessing an address using a clean PTE or old PTE takes ~500 cycles > more than with dirty+young (depending on the access, of course: dirty > does not matter for read, dirty+young both matter for write). > > 2. I made a mistake in my implementation. PTEs are - at least on x86 - > created as young with mk_pte(). So the logic should be similar to > do_set_pte(): > > if (prefault && arch_wants_old_prefaulted_pte()) > entry = pte_mkold(entry); > else > entry = pte_sw_mkyoung(entry); > > Based on these results, I will send another version for both young and > dirty. Let me know if these results are not convincing. Thanks for trying to verify this idea, but I'm not fully sure this is what my concern was on WRITE_LIKELY. AFAICT the test below was trying to measure the overhead of hardware setting either access or dirty or both bits when they're not set for read/write. What I wanted as a justification is whether WRITE_LIKELY would be helpful in any real world scenario at all. AFAIK the only way to prove it so far is to measure any tlb flush difference (probably only on x86, since that tlb code is only compiled on x86) that may trigger with W=0,D=1 but may not trigger with W=0,D=0 (where W stands for "write bit", and D stands for "dirty bit"). It's not about the slowness when D is cleared. The core thing is (sorry to rephrase, but just hope we're on the same page) we'll set D bit always for all uffd pages so far. Even if we want to change that behavior so we skip setting D bit for RO pages (we'll need to convert the dirty bit into PageDirty though), we'll still always set D bit for writable pages. So we always set D bit as long as possible and we'll never suffer from hardware overhead on setting D bit for uffd pages. The other worry of having WRITE_HINT is, after we have it we probably need to _not_ apply dirty bit when WRITE_HINT is not set (which is actually a very light ABI change since we used to always set it), then I'll start to worry the hardware setting D bit overhead you just measured because we'll have that overhead when user didn't specify WRITE_HINT with the old code. So again, I'm totally fine if you want to start with ACCESS_HINT only, but I still don't see why we should need WRITE_HINT too.. Thanks, > > I will add, as we discussed (well, I think I raised these things, so > hopefully you agree): > > 1. On x86, avoid flush if changing WP->RO and PTE is clean. > > 2. When write-unprotecting entry, if PTE is exclusive, set it as writable. > [ I considered not setting it as writable if write-hint is not provided, but > with the change in (1), it does not provide any real value. ] > > --- > > #define _GNU_SOURCE > #include <sys/types.h> > #include <stdio.h> > #include <linux/userfaultfd.h> > #include <pthread.h> > #include <errno.h> > #include <unistd.h> > #include <stdlib.h> > #include <fcntl.h> > #include <signal.h> > #include <poll.h> > #include <string.h> > #include <sys/mman.h> > #include <sys/syscall.h> > #include <sys/ioctl.h> > #include <stdint.h> > #include <stdbool.h> > > #define errExit(msg) do { perror(msg); exit(EXIT_FAILURE); \ > } while (0) > > static inline uint64_t rdtscp(void) > { > uint64_t rax, rdx; > uint32_t aux; > asm volatile ("rdtscp" : "=a" (rax), "=d" (rdx), "=c" (aux):: "memory"); > } > > int main(int argc, char *argv[]) > { > long uffd; /* userfaultfd file descriptor */ > char *addr; /* Start of region handled by userfaultfd */ > unsigned long len; /* Length of region handled by userfaultfd */ > pthread_t thr; /* ID of thread that handles page faults */ > bool young, dirty, write; > struct uffdio_api uffdio_api; > struct uffdio_register uffdio_register; > int l; > static char *page = NULL; > struct uffdio_copy uffdio_copy; > ssize_t nread; > int page_size; > > if (argc != 5) { > fprintf(stderr, "Usage: %s [num-pages] [write] [young] [dirty]\n", argv[0]); > exit(EXIT_FAILURE); > } > > page_size = sysconf(_SC_PAGE_SIZE); > len = strtoul(argv[1], NULL, 0) * page_size; > write = !!strtoul(argv[2], NULL, 0); > young = !!strtoul(argv[3], NULL, 0); > dirty = !!strtoul(argv[4], NULL, 0); > > page = mmap(NULL, page_size, PROT_READ | PROT_WRITE, > MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); > > if (page == MAP_FAILED) > errExit("mmap"); > > uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK); > if (uffd == -1) > errExit("userfaultfd"); > > uffdio_api.api = UFFD_API; > uffdio_api.features = (1<<11); //UFFD_FEATURE_EXACT_ADDRESS; > if (ioctl(uffd, UFFDIO_API, &uffdio_api) == -1) > errExit("ioctl-UFFDIO_API"); > > addr = mmap(NULL, len, PROT_READ | PROT_WRITE, > MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); > if (addr == MAP_FAILED) > errExit("mmap"); > > uffdio_register.range.start = (unsigned long) addr; > uffdio_register.range.len = len; > uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING; > if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register) == -1) > errExit("ioctl-UFFDIO_REGISTER"); > > uffdio_copy.src = (unsigned long) page; > uffdio_copy.mode = 0; > if (young) > uffdio_copy.mode |= (1ul << 2); > if (dirty) > uffdio_copy.mode |= (1ul << 3); > > uffdio_copy.len = page_size; > uffdio_copy.copy = 0; > > for (l = 0; l < len; l += page_size) { > uffdio_copy.dst = (unsigned long)(&addr[l]); > if (ioctl(uffd, UFFDIO_COPY, &uffdio_copy) == -1) > errExit("ioctl-UFFDIO_COPY"); > } > > for (l = 0; l < len; l += page_size) { > char c; > uint64_t start; > > start = rdtscp(); > if (write) > addr[l] = 5; > else > c = *(volatile char *)(&addr[l]); > printf("%ld\n", rdtscp() - start); > } > > exit(EXIT_SUCCESS); > } >
On Jul 12, 2022, at 7:56 AM, Peter Xu <peterx@redhat.com> wrote: > Hi, Nadav, > > On Tue, Jul 12, 2022 at 06:19:08AM +0000, Nadav Amit wrote: >> On Jun 22, 2022, at 11:50 AM, Nadav Amit <nadav.amit@gmail.com> wrote: >> >>> From: Nadav Amit <namit@vmware.com> >>> >>> Using a PTE on x86 with cleared access-bit (aka young-bit) >>> takes ~600 cycles more than when the access bit is set. At the same >>> time, setting the access-bit for memory that is not used (e.g., >>> prefetched) can introduce greater overheads, as the prefetched memory is >>> reclaimed later than it should be. >>> >>> Userfaultfd currently does not set the access-bit (excluding the >>> huge-pages case). Arguably, it is best to let the user control whether >>> the access bit should be set or not. The expected use is to request >>> userfaultfd to set the access-bit when the copy/wp operation is done to >>> resolve a page-fault, and not to set the access-bit when the memory is >>> prefetched. >>> >>> Introduce UFFDIO_[op]_ACCESS_LIKELY to enable userspace to request the >>> young bit to be set. >> >> I reply to my own email, but this mostly addresses the concerns that Peter >> has raised. >> >> So I ran the test below on my Haswell (x86), which showed two things: >> >> 1. Accessing an address using a clean PTE or old PTE takes ~500 cycles >> more than with dirty+young (depending on the access, of course: dirty >> does not matter for read, dirty+young both matter for write). >> >> 2. I made a mistake in my implementation. PTEs are - at least on x86 - >> created as young with mk_pte(). So the logic should be similar to >> do_set_pte(): >> >> if (prefault && arch_wants_old_prefaulted_pte()) >> entry = pte_mkold(entry); >> else >> entry = pte_sw_mkyoung(entry); >> >> Based on these results, I will send another version for both young and >> dirty. Let me know if these results are not convincing. > > Thanks for trying to verify this idea, but I'm not fully sure this is what > my concern was on WRITE_LIKELY. > > AFAICT the test below was trying to measure the overhead of hardware > setting either access or dirty or both bits when they're not set for > read/write. Indeed. > > What I wanted as a justification is whether WRITE_LIKELY would be helpful > in any real world scenario at all. AFAIK the only way to prove it so far > is to measure any tlb flush difference (probably only on x86, since that > tlb code is only compiled on x86) that may trigger with W=0,D=1 but may not > trigger with W=0,D=0 (where W stands for "write bit", and D stands for > "dirty bit"). > > It's not about the slowness when D is cleared. > > The core thing is (sorry to rephrase, but just hope we're on the same page) > we'll set D bit always for all uffd pages so far. Even if we want to > change that behavior so we skip setting D bit for RO pages (we'll need to > convert the dirty bit into PageDirty though), we'll still always set D bit > for writable pages. So we always set D bit as long as possible and we'll > never suffer from hardware overhead on setting D bit for uffd pages. Thanks as usual for your clarifications. As you see, I also do my best to be on the same page with, even if from time to time I fail. I had some recent communication challenges on lkml, so I hope that you understand that everything I say is said with full respect, and if I use double-quotes while arguing with you, it is in good spirit, and I really appreciate your feedback. Ok. So there is a lot to digest in what you just said, and I politely disagree with some of the assertions that you made. You focus on discussing the issue of whether we set the dirty bit for RO pages, which in my opinion is so intuitively wrong. But, I think that discussing this issue really digress us from the benefits of not setting the D-bit when it is unnecessary for RW pages, which is the main question. But before we get to it, I want to argue with some of the “facts” that you present: 1. "D bit always for all uffd pages” - This is true for almost all UFFD operations, but not true to write-unprotect. The moment we use David’s MM_CP_TRY_CHANGE_WRITABLE in UFFD, the PTE would be writable but not dirty. [Yes, the patches that I sent do not deal with that: as I noted before, I want to send it as part of v2.] Arguably, one of the places that setting the D-bit matters the most if when you change PTE from RO->RW. And anyhow, as you can see the API is inconsistent. 2. "we'll still always set D bit for writable pages” - To continue my previous bullet - why? Why would we? Besides uffd-wrunprotect path, why would we always set it for MCOPY_CONTINUE? (more to follow on this one) 3. "measure any tlb flush difference … that may trigger with W=0,D=1 but may not trigger with W=0,D=0” - This is really a boring case, which even if was underoptimized could have been resolved by its own. This is certainly not the motivation for the write hints. So please allow me to go back to the reasons for why you want write-hints, and RO entries would be mostly left out and implicitly included in the other arguments - which are mainly about RW entries: 1. You can avoid TLB flushes when write-protecting clean writable PTEs (on x86). Such PTEs can be created when userspace monitor prefaults or speculatively write-unprotects memory that might be needed. When a monitor removes the mapping, using MADV_DONTNEED, it would not need to flush anything. 2. Hopefully you agree that write-hints are needed if during uffd-write-unprotect we actually write-unprotect the PTE (not just clearing the logical flag). So you do need write-hint for zero-page (to get a clear-page) and for write-unprotect, so why not to be consistent and provide it for all operations? 3. For UFFDIO_CONTINUE. Admittedly, I am not very familiar with UFFDIO_CONTINUE, but from the discussion (and skimming the code) I understood that you can be used for prefetching. In such case, why would you assume that the page is dirty? 4. It allows you to treat softdirty properly. If softdirty is on, presumably, if you did not get a WRITE_HINT, you would keep it writeprotected. 5. Consistency with UFFDIO_ZERO that needs a write-hint to clear a page. Now I will “kill myself” over support of write-hints. Not everything that I mentioned here is in v1 that I sent. But, if you decide you do not want write-hints, I would still need to leave the UFFDIO_ZERO write-hints (that clear the page) and to find some solution for UFFDIO_WRITEPROTECT (that should set the dirty-bit for better performance of WP page-fault handling). If you decide you do want it, I would run some tests to check that indeed the access-time of UFFDIO_WRITEPROTECT reflects the fact no TLB flush was needed. > The other worry of having WRITE_HINT is, after we have it we probably need > to _not_ apply dirty bit when WRITE_HINT is not set (which is actually a > very light ABI change since we used to always set it), then I'll start to > worry the hardware setting D bit overhead you just measured because we'll > have that overhead when user didn't specify WRITE_HINT with the old code. Good point, which I have already got to. So, because I was mistaken on the ACCESS_HINT understanding (young-bit is set by default), this would mean that the hints should only be regarded when the “hints” feature is enabled for backward compatibility. I got this code ready for v2. > > So again, I'm totally fine if you want to start with ACCESS_HINT only, but > I still don't see why we should need WRITE_HINT too.. I really hate how the “features” evolve that I think it is better to decide now. I think that in the lack of agreement, the best thing to do is to put all of the write-hints now in the API, and only regard them for UFFDIO_ZERO (which would clear the page) and UFFDIO_WRITE(un)PROTECT (that would set the old bit). Again, thanks for the feedback, and hopefully (at least) I understood you this time.
On Tue, Jul 12, 2022 at 06:09:35PM -0700, Nadav Amit wrote: > On Jul 12, 2022, at 7:56 AM, Peter Xu <peterx@redhat.com> wrote: > > > Hi, Nadav, > > > > On Tue, Jul 12, 2022 at 06:19:08AM +0000, Nadav Amit wrote: > >> On Jun 22, 2022, at 11:50 AM, Nadav Amit <nadav.amit@gmail.com> wrote: > >> > >>> From: Nadav Amit <namit@vmware.com> > >>> > >>> Using a PTE on x86 with cleared access-bit (aka young-bit) > >>> takes ~600 cycles more than when the access bit is set. At the same > >>> time, setting the access-bit for memory that is not used (e.g., > >>> prefetched) can introduce greater overheads, as the prefetched memory is > >>> reclaimed later than it should be. > >>> > >>> Userfaultfd currently does not set the access-bit (excluding the > >>> huge-pages case). Arguably, it is best to let the user control whether > >>> the access bit should be set or not. The expected use is to request > >>> userfaultfd to set the access-bit when the copy/wp operation is done to > >>> resolve a page-fault, and not to set the access-bit when the memory is > >>> prefetched. > >>> > >>> Introduce UFFDIO_[op]_ACCESS_LIKELY to enable userspace to request the > >>> young bit to be set. > >> > >> I reply to my own email, but this mostly addresses the concerns that Peter > >> has raised. > >> > >> So I ran the test below on my Haswell (x86), which showed two things: > >> > >> 1. Accessing an address using a clean PTE or old PTE takes ~500 cycles > >> more than with dirty+young (depending on the access, of course: dirty > >> does not matter for read, dirty+young both matter for write). > >> > >> 2. I made a mistake in my implementation. PTEs are - at least on x86 - > >> created as young with mk_pte(). So the logic should be similar to > >> do_set_pte(): > >> > >> if (prefault && arch_wants_old_prefaulted_pte()) > >> entry = pte_mkold(entry); > >> else > >> entry = pte_sw_mkyoung(entry); > >> > >> Based on these results, I will send another version for both young and > >> dirty. Let me know if these results are not convincing. > > > > Thanks for trying to verify this idea, but I'm not fully sure this is what > > my concern was on WRITE_LIKELY. > > > > AFAICT the test below was trying to measure the overhead of hardware > > setting either access or dirty or both bits when they're not set for > > read/write. > > Indeed. > > > > > What I wanted as a justification is whether WRITE_LIKELY would be helpful > > in any real world scenario at all. AFAIK the only way to prove it so far > > is to measure any tlb flush difference (probably only on x86, since that > > tlb code is only compiled on x86) that may trigger with W=0,D=1 but may not > > trigger with W=0,D=0 (where W stands for "write bit", and D stands for > > "dirty bit"). > > > > It's not about the slowness when D is cleared. > > > > The core thing is (sorry to rephrase, but just hope we're on the same page) > > we'll set D bit always for all uffd pages so far. Even if we want to > > change that behavior so we skip setting D bit for RO pages (we'll need to > > convert the dirty bit into PageDirty though), we'll still always set D bit > > for writable pages. So we always set D bit as long as possible and we'll > > never suffer from hardware overhead on setting D bit for uffd pages. > > Thanks as usual for your clarifications. As you see, I also do my best to be > on the same page with, even if from time to time I fail. I had some recent > communication challenges on lkml, so I hope that you understand that > everything I say is said with full respect, and if I use double-quotes while > arguing with you, it is in good spirit, and I really appreciate your > feedback. > > Ok. So there is a lot to digest in what you just said, and I politely > disagree with some of the assertions that you made. You focus on discussing > the issue of whether we set the dirty bit for RO pages, which in my opinion > is so intuitively wrong. But, I think that discussing this issue really > digress us from the benefits of not setting the D-bit when it is unnecessary > for RW pages, which is the main question. > > But before we get to it, I want to argue with some of the “facts” that you > present: > > 1. "D bit always for all uffd pages” - This is true for almost all UFFD > operations, but not true to write-unprotect. The moment we use David’s > MM_CP_TRY_CHANGE_WRITABLE in UFFD, the PTE would be writable but not dirty. > [Yes, the patches that I sent do not deal with that: as I noted before, I > want to send it as part of v2.] Arguably, one of the places that setting the > D-bit matters the most if when you change PTE from RO->RW. And anyhow, as > you can see the API is inconsistent. Yes. This point is WRITE_HINT irrelevant, afaict, but I fully agree we should fix it. It'll be great if you'll cover that upon David's patch. [1] > > 2. "we'll still always set D bit for writable pages” - To continue my > previous bullet - why? Why would we? Besides uffd-wrunprotect path, why > would we always set it for MCOPY_CONTINUE? (more to follow on this one) I wanted to mean that we used to always set both D bits when W=1. IIUC that applies to not only uffd but any general pgtable accesses. E.g. in do_anonymous_page() we'll set both D if W=1. Same to anywhere I can find across the kernel. > > 3. "measure any tlb flush difference … that may trigger with W=0,D=1 but may > not trigger with W=0,D=0” - This is really a boring case, which even if was > underoptimized could have been resolved by its own. This is certainly not > the motivation for the write hints. Hmm? I don't see what benefit we can get from the new WRITE_HINT besides avoiding tlb flushes, or do we? See right below.. > > > So please allow me to go back to the reasons for why you want write-hints, > and RO entries would be mostly left out and implicitly included in the > other arguments - which are mainly about RW entries: > > 1. You can avoid TLB flushes when write-protecting clean writable PTEs (on > x86). Such PTEs can be created when userspace monitor prefaults or > speculatively write-unprotects memory that might be needed. When a monitor > removes the mapping, using MADV_DONTNEED, it would not need to flush > anything. Yep, but afaict this is the "avoid tlb flush benefit" we talked above.. > > 2. Hopefully you agree that write-hints are needed if during > uffd-write-unprotect I think I agree on the unprotect above [1] on fixing D bit set if unprotected. If with that fix then I think whether WRITE_HINT would help or not here for unprotect is the same as whether WRITE_HINT would help in other cases like UFFDIO_COPY. So these are two problems to me: (1) Whether we should fix change_pte_range() to set D properly when unprotect (W=1), again I fully agree. (2) Whether WRITE_HINT helps for unprotect is to be discussed here, and we're trying to reach a consensus.. > we actually write-unprotect the PTE (not just clearing > the logical flag). So you do need write-hint for zero-page (to get a > clear-page) and for write-unprotect, so why not to be consistent and provide > it for all operations? I always agree on the zeropage case. But IMHO that's the only special case here. > > 3. For UFFDIO_CONTINUE. Admittedly, I am not very familiar with > UFFDIO_CONTINUE, but from the discussion (and skimming the code) I > understood that you can be used for prefetching. In such case, why would you > assume that the page is dirty? Again, I think it's the same as when we faulting in a normal anonymous page: when we grant write bit we always assume it's dirty. Yes it may not really be written, but don't you think that's a more common question to ask rather than uffd only? IOW, why we always set D bit in do_anonymous_page() even if there's possibility that the page may not be written at all? Actually I'd say we'll have problem if we don't set dirty, because then we won't be able to track soft-dirty anymore just like below - we need an explicit page fault to trap soft-dirty. With W=1, how do we do that? > > 4. It allows you to treat softdirty properly. If softdirty is on, > presumably, if you did not get a WRITE_HINT, you would keep it > writeprotected. Sorry I don't get why WRITE_HINT helps soft-dirty. As I mentioned before, I think it will start to complicate soft-dirty rather than making it better. Basically when the hints are enabled in feature bits and when the user has !WRITE_HINT, we'll need to wr-protect the page for soft-dirty if it's enabled. We don't need that extra complexity if we don't apply WRITE_HINT outside ZEROPAGE. That's really even more than "hardware setting D bit overhead" because it'll be a full path page fault just to trap soft-dirty. I am not sure we're really on the same page here, but I think I'll just wait and read your code if it's coming and comment there if I see anything wrong. Maybe that'll be more straightforward. > > 5. Consistency with UFFDIO_ZERO that needs a write-hint to clear a page. This one is actually point 2 above. Taken. > > Now I will “kill myself” over support of write-hints. Not everything that > I mentioned here is in v1 that I sent. > > But, if you decide you do not want write-hints, I would still need to leave > the UFFDIO_ZERO write-hints (that clear the page) and to find some solution > for UFFDIO_WRITEPROTECT (that should set the dirty-bit for better performance > of WP page-fault handling). > > If you decide you do want it, I would run some tests to check that indeed > the access-time of UFFDIO_WRITEPROTECT reflects the fact no TLB flush was > needed. > > > The other worry of having WRITE_HINT is, after we have it we probably need > > to _not_ apply dirty bit when WRITE_HINT is not set (which is actually a > > very light ABI change since we used to always set it), then I'll start to > > worry the hardware setting D bit overhead you just measured because we'll > > have that overhead when user didn't specify WRITE_HINT with the old code. > > Good point, which I have already got to. So, because I was mistaken on the > ACCESS_HINT understanding (young-bit is set by default), this would mean > that the hints should only be regarded when the “hints” feature is enabled > for backward compatibility. I got this code ready for v2. Sounds good. > > > > > So again, I'm totally fine if you want to start with ACCESS_HINT only, but > > I still don't see why we should need WRITE_HINT too.. > > I really hate how the “features” evolve that I think it is better to decide > now. > > I think that in the lack of agreement, the best thing to do is to put all of > the write-hints now in the API, and only regard them for UFFDIO_ZERO (which > would clear the page) and UFFDIO_WRITE(un)PROTECT (that would set the old > bit). Yes let's fix unprotect on D bit first, because that's inconsistent. Please put it as the 1st patch if you plan to post them together, I think we should have it even earlier if not together with the uffd-hint series. Then, there's one thing we can do as you said - we can introduce WRITE_HINT but only apply it to ZEROPAGE only (note: I still don't see why unprotect is special here, it just has one perf bug to fix). It can be bound to ACCESS_HINT with the same feature bit then we think about whether we want to extend it to other uffd ioctls besides ZEROPAGE. That's probably a good split on the differences. > > Again, thanks for the feedback, and hopefully (at least) I understood you > this time. What you suggested at the end sounds good to me, thanks for being persistent. The only thing uncertain seems to be whether we need WRITE_HINT for UFFDIO_WRITEPROTECT, all the rest sounds a good plan. Thanks,
On Jul 13, 2022, at 9:02 AM, Peter Xu <peterx@redhat.com> wrote: > On Tue, Jul 12, 2022 at 06:09:35PM -0700, Nadav Amit wrote: >> On Jul 12, 2022, at 7:56 AM, Peter Xu <peterx@redhat.com> wrote: > > > What you suggested at the end sounds good to me, thanks for being > persistent. The only thing uncertain seems to be whether we need > WRITE_HINT for UFFDIO_WRITEPROTECT, all the rest sounds a good plan. Peter, thanks for the detailed feedback. I understand and agree with most of what you wrote. I think we still have the mprotect()/UFFDIO_WRITE(un)PROTECT issue, which I either do not understand your position or disagree with it. :) Anyhow, instead of too many words, I’ll send v2 based on your feedback, which would help to refine the behavior and clear any misunderstanding we might have.
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index a44e46f8249f..abf176bd0349 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -1726,12 +1726,15 @@ static int userfaultfd_copy(struct userfaultfd_ctx *ctx, ret = -EINVAL; if (uffdio_copy.src + uffdio_copy.len <= uffdio_copy.src) goto out; - if (uffdio_copy.mode & ~(UFFDIO_COPY_MODE_DONTWAKE|UFFDIO_COPY_MODE_WP)) + if (uffdio_copy.mode & ~(UFFDIO_COPY_MODE_DONTWAKE|UFFDIO_COPY_MODE_WP| + UFFDIO_COPY_MODE_ACCESS_LIKELY)) goto out; mode_wp = uffdio_copy.mode & UFFDIO_COPY_MODE_WP; uffd_flags = mode_wp ? UFFD_FLAGS_WP : UFFD_FLAGS_NONE; + if (uffdio_copy.mode & UFFDIO_COPY_MODE_ACCESS_LIKELY) + uffd_flags |= UFFD_FLAGS_ACCESS_LIKELY; if (mmget_not_zero(ctx->mm)) { ret = mcopy_atomic(ctx->mm, uffdio_copy.dst, uffdio_copy.src, @@ -1783,9 +1786,13 @@ static int userfaultfd_zeropage(struct userfaultfd_ctx *ctx, if (ret) goto out; ret = -EINVAL; - if (uffdio_zeropage.mode & ~UFFDIO_ZEROPAGE_MODE_DONTWAKE) + if (uffdio_zeropage.mode & ~(UFFDIO_ZEROPAGE_MODE_DONTWAKE| + UFFDIO_ZEROPAGE_MODE_ACCESS_LIKELY)) goto out; + if (uffdio_zeropage.mode & UFFDIO_ZEROPAGE_MODE_ACCESS_LIKELY) + uffd_flags |= UFFD_FLAGS_ACCESS_LIKELY; + if (mmget_not_zero(ctx->mm)) { ret = mfill_zeropage(ctx->mm, uffdio_zeropage.range.start, uffdio_zeropage.range.len, @@ -1835,7 +1842,8 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx, return ret; if (uffdio_wp.mode & ~(UFFDIO_WRITEPROTECT_MODE_DONTWAKE | - UFFDIO_WRITEPROTECT_MODE_WP)) + UFFDIO_WRITEPROTECT_MODE_WP | + UFFDIO_WRITEPROTECT_MODE_ACCESS_LIKELY)) return -EINVAL; mode_wp = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_WP; @@ -1845,6 +1853,8 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx, return -EINVAL; uffd_flags = mode_wp ? UFFD_FLAGS_WP : UFFD_FLAGS_NONE; + if (uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_ACCESS_LIKELY) + uffd_flags |= UFFD_FLAGS_ACCESS_LIKELY; if (mmget_not_zero(ctx->mm)) { ret = mwriteprotect_range(ctx->mm, uffdio_wp.range.start, @@ -1872,6 +1882,7 @@ static int userfaultfd_continue(struct userfaultfd_ctx *ctx, unsigned long arg) struct uffdio_continue uffdio_continue; struct uffdio_continue __user *user_uffdio_continue; struct userfaultfd_wake_range range; + uffd_flags_t uffd_flags = UFFD_FLAGS_NONE; user_uffdio_continue = (struct uffdio_continue __user *)arg; @@ -1896,13 +1907,17 @@ static int userfaultfd_continue(struct userfaultfd_ctx *ctx, unsigned long arg) uffdio_continue.range.start) { goto out; } - if (uffdio_continue.mode & ~UFFDIO_CONTINUE_MODE_DONTWAKE) + if (uffdio_continue.mode & ~(UFFDIO_CONTINUE_MODE_DONTWAKE| + UFFDIO_CONTINUE_MODE_ACCESS_LIKELY)) goto out; + if (uffdio_continue.mode & UFFDIO_CONTINUE_MODE_ACCESS_LIKELY) + uffd_flags |= UFFD_FLAGS_ACCESS_LIKELY; + if (mmget_not_zero(ctx->mm)) { ret = mcopy_continue(ctx->mm, uffdio_continue.range.start, uffdio_continue.range.len, - &ctx->mmap_changing, 0); + &ctx->mmap_changing, uffd_flags); mmput(ctx->mm); } else { return -ESRCH; diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h index d5b3dff48a87..af268b2c2b27 100644 --- a/include/linux/userfaultfd_k.h +++ b/include/linux/userfaultfd_k.h @@ -59,6 +59,7 @@ typedef unsigned int __bitwise uffd_flags_t; #define UFFD_FLAGS_NONE ((__force uffd_flags_t)0) #define UFFD_FLAGS_WP ((__force uffd_flags_t)BIT(0)) +#define UFFD_FLAGS_ACCESS_LIKELY ((__force uffd_flags_t)BIT(1)) extern int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd, struct vm_area_struct *dst_vma, diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h index 005e5e306266..ff7150c878bb 100644 --- a/include/uapi/linux/userfaultfd.h +++ b/include/uapi/linux/userfaultfd.h @@ -38,7 +38,8 @@ UFFD_FEATURE_MINOR_HUGETLBFS | \ UFFD_FEATURE_MINOR_SHMEM | \ UFFD_FEATURE_EXACT_ADDRESS | \ - UFFD_FEATURE_WP_HUGETLBFS_SHMEM) + UFFD_FEATURE_WP_HUGETLBFS_SHMEM | \ + UFFD_FEATURE_ACCESS_HINTS) #define UFFD_API_IOCTLS \ ((__u64)1 << _UFFDIO_REGISTER | \ (__u64)1 << _UFFDIO_UNREGISTER | \ @@ -203,6 +204,9 @@ struct uffdio_api { * * UFFD_FEATURE_WP_HUGETLBFS_SHMEM indicates that userfaultfd * write-protection mode is supported on both shmem and hugetlbfs. + * + * UFFD_FEATURE_ACCESS_HINTS indicates that the ioctl operations + * support the UFFDIO_*_MODE_ACCESS_LIKELY hints. */ #define UFFD_FEATURE_PAGEFAULT_FLAG_WP (1<<0) #define UFFD_FEATURE_EVENT_FORK (1<<1) @@ -217,6 +221,7 @@ struct uffdio_api { #define UFFD_FEATURE_MINOR_SHMEM (1<<10) #define UFFD_FEATURE_EXACT_ADDRESS (1<<11) #define UFFD_FEATURE_WP_HUGETLBFS_SHMEM (1<<12) +#define UFFD_FEATURE_ACCESS_HINTS (1<<13) __u64 features; __u64 ioctls; @@ -251,8 +256,14 @@ struct uffdio_copy { * the fly. UFFDIO_COPY_MODE_WP is available only if the * write protected ioctl is implemented for the range * according to the uffdio_register.ioctls. + * + * UFFDIO_COPY_MODE_ACCESS_LIKELY provides a hint to the kernel that the + * page is likely to be access in the near future. Providing the hint + * properly can improve performance. + * */ #define UFFDIO_COPY_MODE_WP ((__u64)1<<1) +#define UFFDIO_COPY_MODE_ACCESS_LIKELY ((__u64)1<<2) __u64 mode; /* @@ -265,6 +276,7 @@ struct uffdio_copy { struct uffdio_zeropage { struct uffdio_range range; #define UFFDIO_ZEROPAGE_MODE_DONTWAKE ((__u64)1<<0) +#define UFFDIO_ZEROPAGE_MODE_ACCESS_LIKELY ((__u64)1<<1) __u64 mode; /* @@ -284,6 +296,10 @@ struct uffdio_writeprotect { * UFFDIO_WRITEPROTECT_MODE_DONTWAKE: set the flag to avoid waking up * any wait thread after the operation succeeds. * + * UFFDIO_WRITEPROTECT_MODE_ACCESS_LIKELY provides a hint to the kernel + * that the page is likely to be access in the near future. Providing + * the hint properly can improve performance. + * * NOTE: Write protecting a region (WP=1) is unrelated to page faults, * therefore DONTWAKE flag is meaningless with WP=1. Removing write * protection (WP=0) in response to a page fault wakes the faulting @@ -291,12 +307,14 @@ struct uffdio_writeprotect { */ #define UFFDIO_WRITEPROTECT_MODE_WP ((__u64)1<<0) #define UFFDIO_WRITEPROTECT_MODE_DONTWAKE ((__u64)1<<1) +#define UFFDIO_WRITEPROTECT_MODE_ACCESS_LIKELY ((__u64)1<<2) __u64 mode; }; struct uffdio_continue { struct uffdio_range range; #define UFFDIO_CONTINUE_MODE_DONTWAKE ((__u64)1<<0) +#define UFFDIO_CONTINUE_MODE_ACCESS_LIKELY ((__u64)1<<1) __u64 mode; /* diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 734de6aa0b8e..5051b9028722 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -92,6 +92,9 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd, */ _dst_pte = pte_wrprotect(_dst_pte); + if (uffd_flags & UFFD_FLAGS_ACCESS_LIKELY) + _dst_pte = pte_mkyoung(_dst_pte); + dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl); if (vma_is_shmem(dst_vma)) { @@ -202,7 +205,8 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm, static int mfill_zeropage_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd, struct vm_area_struct *dst_vma, - unsigned long dst_addr) + unsigned long dst_addr, + uffd_flags_t uffd_flags) { pte_t _dst_pte, *dst_pte; spinlock_t *ptl; @@ -225,6 +229,10 @@ static int mfill_zeropage_pte(struct mm_struct *dst_mm, ret = -EEXIST; if (!pte_none(*dst_pte)) goto out_unlock; + + if (uffd_flags & UFFD_FLAGS_ACCESS_LIKELY) + _dst_pte = pte_mkyoung(_dst_pte); + set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte); /* No need to invalidate - it was non-present before */ update_mmu_cache(dst_vma, dst_addr, dst_pte); @@ -498,7 +506,7 @@ static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm, uffd_flags); else err = mfill_zeropage_pte(dst_mm, dst_pmd, - dst_vma, dst_addr); + dst_vma, dst_addr, uffd_flags); } else { err = shmem_mfill_atomic_pte(dst_mm, dst_pmd, dst_vma, dst_addr, src_addr, @@ -692,7 +700,7 @@ ssize_t mfill_zeropage(struct mm_struct *dst_mm, unsigned long start, uffd_flags_t uffd_flags) { return __mcopy_atomic(dst_mm, start, 0, len, MCOPY_ATOMIC_ZEROPAGE, - mmap_changing, 0); + mmap_changing, uffd_flags); } ssize_t mcopy_continue(struct mm_struct *dst_mm, unsigned long start, @@ -700,7 +708,7 @@ ssize_t mcopy_continue(struct mm_struct *dst_mm, unsigned long start, uffd_flags_t uffd_flags) { return __mcopy_atomic(dst_mm, start, 0, len, MCOPY_ATOMIC_CONTINUE, - mmap_changing, 0); + mmap_changing, uffd_flags); } int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,