Message ID | 20221109102303.851281-1-usama.anjum@collabora.com (mailing list archive) |
---|---|
Headers | show |
Series | Implement IOCTL to get and/or the clear info about PTEs | expand |
On 09.11.22 11:23, Muhammad Usama Anjum wrote: > Changes in v6: > - Updated the interface and made cosmetic changes > > Original Cover Letter in v5: > Hello, > > This patch series implements IOCTL on the pagemap procfs file to get the > information about the page table entries (PTEs). The following operations > are supported in this ioctl: > - Get the information if the pages are soft-dirty, file mapped, present > or swapped. > - Clear the soft-dirty PTE bit of the pages. > - Get and clear the soft-dirty PTE bit of the pages atomically. > > Soft-dirty PTE bit of the memory pages can be read by using the pagemap > procfs file. The soft-dirty PTE bit for the whole memory range of the > process can be cleared by writing to the clear_refs file. There are other > methods to mimic this information entirely in userspace with poor > performance: > - The mprotect syscall and SIGSEGV handler for bookkeeping > - The userfaultfd syscall with the handler for bookkeeping > Some benchmarks can be seen here[1]. This series adds features that weren't > present earlier: > - There is no atomic get soft-dirty PTE bit status and clear operation > possible. > - The soft-dirty PTE bit of only a part of memory cannot be cleared. > > Historically, soft-dirty PTE bit tracking has been used in the CRIU > project. The procfs interface is enough for finding the soft-dirty bit > status and clearing the soft-dirty bit of all the pages of a process. > We have the use case where we need to track the soft-dirty PTE bit for > only specific pages on demand. We need this tracking and clear mechanism > of a region of memory while the process is running to emulate the > getWriteWatch() syscall of Windows. This syscall is used by games to > keep track of dirty pages to process only the dirty pages. > > The information related to pages if the page is file mapped, present and > swapped is required for the CRIU project[2][3]. The addition of the > required mask, any mask, excluded mask and return masks are also required > for the CRIU project[2]. > > The IOCTL returns the addresses of the pages which match the specific masks. > The page addresses are returned in struct page_region in a compact form. > The max_pages is needed to support a use case where user only wants to get > a specific number of pages. So there is no need to find all the pages of > interest in the range when max_pages is specified. The IOCTL returns when > the maximum number of the pages are found. The max_pages is optional. If > max_pages is specified, it must be equal or greater than the vec_size. > This restriction is needed to handle worse case when one page_region only > contains info of one page and it cannot be compacted. This is needed to > emulate the Windows getWriteWatch() syscall. > > Some non-dirty pages get marked as dirty because of the kernel's > internal activity (such as VMA merging as soft-dirty bit difference isn't > considered while deciding to merge VMAs). The dirty bit of the pages is > stored in the VMA flags and in the per page flags. If any of these two bits > are set, the page is considered to be soft dirty. Suppose you have cleared > the soft dirty bit of half of VMA which will be done by splitting the VMA > and clearing soft dirty bit flag in the half VMA and the pages in it. Now > kernel may decide to merge the VMAs again. So the half VMA becomes dirty > again. This splitting/merging costs performance. The application receives > a lot of pages which aren't dirty in reality but marked as dirty. > Performance is lost again here. Also sometimes user doesn't want the newly > allocated memory to be marked as dirty. PAGEMAP_NO_REUSED_REGIONS flag > solves both the problems. It is used to not depend on the soft dirty flag > in the VMA flags. So VMA splitting and merging doesn't happen. It only > depends on the soft dirty bit of the individual pages. Thus by using this > flag, there may be a scenerio such that the new memory regions which are > just created, doesn't look dirty when seen with the IOCTL, but look dirty > when seen from procfs. This seems okay as the user of this flag know the > implication of using it. Please separate that part out from the other changes; I am still not convinced that we want this and what the semantical implications are. Let's take a look at an example: can_change_pte_writable() /* Do we need write faults for softdirty tracking? */ if (vma_soft_dirty_enabled(vma) && !pte_soft_dirty(pte)) return false; We care about PTE softdirty tracking, if it is enabled for the VMA. Tracking is enabled if: vma_soft_dirty_enabled() /* * Soft-dirty is kind of special: its tracking is enabled when * the vma flags not set. */ return !(vma->vm_flags & VM_SOFTDIRTY); Consequently, if VM_SOFTDIRTY is set, we are not considering the soft_dirty PTE bits accordingly. I'd suggest moving forward without this controversial PAGEMAP_NO_REUSED_REGIONS functionality for now, and preparing it as a clear add-on we can discuss separately.
On 11/9/22 3:34 PM, David Hildenbrand wrote: > On 09.11.22 11:23, Muhammad Usama Anjum wrote: [...] >> Some non-dirty pages get marked as dirty because of the kernel's >> internal activity (such as VMA merging as soft-dirty bit difference isn't >> considered while deciding to merge VMAs). The dirty bit of the pages is >> stored in the VMA flags and in the per page flags. If any of these two bits >> are set, the page is considered to be soft dirty. Suppose you have cleared >> the soft dirty bit of half of VMA which will be done by splitting the VMA >> and clearing soft dirty bit flag in the half VMA and the pages in it. Now >> kernel may decide to merge the VMAs again. So the half VMA becomes dirty >> again. This splitting/merging costs performance. The application receives >> a lot of pages which aren't dirty in reality but marked as dirty. >> Performance is lost again here. Also sometimes user doesn't want the newly >> allocated memory to be marked as dirty. PAGEMAP_NO_REUSED_REGIONS flag >> solves both the problems. It is used to not depend on the soft dirty flag >> in the VMA flags. So VMA splitting and merging doesn't happen. It only >> depends on the soft dirty bit of the individual pages. Thus by using this >> flag, there may be a scenerio such that the new memory regions which are >> just created, doesn't look dirty when seen with the IOCTL, but look dirty >> when seen from procfs. This seems okay as the user of this flag know the >> implication of using it. The soft-dirtiness is stored in the PTE. VMA is marked dirty to store the dirtiness for reused regions. Clearing the soft-dirty status of whole process is straight forward. When we want to clear/monitor the soft-dirtiness of a part of the virtual memory, there is a lot of internal noise. We don't want the non-dirty pages to become dirty because of how the soft-dirty feature has been working. Soft-dirty feature wasn't being used the way we want to use now. While monitoring a part of memory, it is not acceptable to get non-dirty pages as dirty. Non-dirty pages become dirty when the two VMAs are merged without considering if they both are dirty or not (34228d473efe). To monitor changes over the memory, sometimes VMAs are split to clear the soft-dirty bit in the VMA flags. But sometimes kernel decide to merge them backup. It is so waste of resources. To keep things consistent, the default behavior of the IOCTL is to output even the extra non-dirty pages as dirty from the kernel noise. A optional PAGEMAP_NO_REUSED_REGIONS flag is added for those use cases which aren't tolerant of extra non-dirty pages. This flag can be considered as something which is by-passing the already present buggy implementation in the kernel. It is not buggy per say as the issue can be solved if we don't allow the two VMA which have different soft-dirty bits to get merged. But we are allowing that so that the total number of VMAs doesn't increase. This was acceptable at the time, but now with the use case of monitoring a part of memory for soft-dirty doesn't want this merging. So either we need to revert 34228d473efe and PAGEMAP_NO_REUSED_REGIONS flag will not be needed or we should allow PAGEMAP_NO_REUSED_REGIONS or similar mechanism to ignore the extra dirty pages which aren't dirty in reality. When PAGEMAP_NO_REUSED_REGIONS flag is used, only the PTEs are checked to find if the pages are dirty. So re-used regions cannot be detected. This has the only side-effect of not checking the VMAs. So this is limitation of using this flag which should be acceptable in the current state of code. This limitation is okay for the users as they can clear the soft-dirty bit of the VMA before starting to monitor a range of memory for soft-dirtiness. > Please separate that part out from the other changes; I am still not > convinced that we want this and what the semantical implications are. > > Let's take a look at an example: can_change_pte_writable() > > /* Do we need write faults for softdirty tracking? */ > if (vma_soft_dirty_enabled(vma) && !pte_soft_dirty(pte)) > return false; > > We care about PTE softdirty tracking, if it is enabled for the VMA. > Tracking is enabled if: vma_soft_dirty_enabled() > > /* > * Soft-dirty is kind of special: its tracking is enabled when > * the vma flags not set. > */ > return !(vma->vm_flags & VM_SOFTDIRTY); > > Consequently, if VM_SOFTDIRTY is set, we are not considering the soft_dirty > PTE bits accordingly. Sorry, I'm unable to completely grasp the meaning of the example. We have followed clear_refs_write() to write the soft-dirty bit clearing code in the current patch. Dirtiness of the VMA and the PTE may be set independently. Newer allocated memory has dirty bit set in the VMA. When something is written the memory, the soft dirty bit is set in the PTEs as well regardless if the soft dirty bit is set in the VMA or not. > > > I'd suggest moving forward without this controversial > PAGEMAP_NO_REUSED_REGIONS functionality for now, and preparing it as a > clear add-on we can discuss separately.Like I've described above, I've only added this flag to not get the non-dirty pages as dirty. Can there be some alternative to adding this flag? Please suggest. -- BR, Muhammad Usama Anjum
> The soft-dirtiness is stored in the PTE. VMA is marked dirty to store the > dirtiness for reused regions. Clearing the soft-dirty status of whole > process is straight forward. When we want to clear/monitor the > soft-dirtiness of a part of the virtual memory, there is a lot of internal > noise. We don't want the non-dirty pages to become dirty because of how the > soft-dirty feature has been working. Soft-dirty feature wasn't being used > the way we want to use now. While monitoring a part of memory, it is not > acceptable to get non-dirty pages as dirty. Non-dirty pages become dirty > when the two VMAs are merged without considering if they both are dirty or > not (34228d473efe). To monitor changes over the memory, sometimes VMAs are > split to clear the soft-dirty bit in the VMA flags. But sometimes kernel > decide to merge them backup. It is so waste of resources. Maybe you'd want a per-process option to not merge if the VM_SOFTDIRTY property differs. But that might be just one alternative for handling this case. > > To keep things consistent, the default behavior of the IOCTL is to output > even the extra non-dirty pages as dirty from the kernel noise. A optional > PAGEMAP_NO_REUSED_REGIONS flag is added for those use cases which aren't > tolerant of extra non-dirty pages. This flag can be considered as something > which is by-passing the already present buggy implementation in the kernel. > It is not buggy per say as the issue can be solved if we don't allow the > two VMA which have different soft-dirty bits to get merged. But we are > allowing that so that the total number of VMAs doesn't increase. This was > acceptable at the time, but now with the use case of monitoring a part of > memory for soft-dirty doesn't want this merging. So either we need to > revert 34228d473efe and PAGEMAP_NO_REUSED_REGIONS flag will not be needed > or we should allow PAGEMAP_NO_REUSED_REGIONS or similar mechanism to ignore > the extra dirty pages which aren't dirty in reality. > > When PAGEMAP_NO_REUSED_REGIONS flag is used, only the PTEs are checked to > find if the pages are dirty. So re-used regions cannot be detected. This > has the only side-effect of not checking the VMAs. So this is limitation of > using this flag which should be acceptable in the current state of code. > This limitation is okay for the users as they can clear the soft-dirty bit > of the VMA before starting to monitor a range of memory for soft-dirtiness. > > >> Please separate that part out from the other changes; I am still not >> convinced that we want this and what the semantical implications are. >> >> Let's take a look at an example: can_change_pte_writable() >> >> /* Do we need write faults for softdirty tracking? */ >> if (vma_soft_dirty_enabled(vma) && !pte_soft_dirty(pte)) >> return false; >> >> We care about PTE softdirty tracking, if it is enabled for the VMA. >> Tracking is enabled if: vma_soft_dirty_enabled() >> >> /* >> * Soft-dirty is kind of special: its tracking is enabled when >> * the vma flags not set. >> */ >> return !(vma->vm_flags & VM_SOFTDIRTY); >> >> Consequently, if VM_SOFTDIRTY is set, we are not considering the soft_dirty >> PTE bits accordingly. > Sorry, I'm unable to completely grasp the meaning of the example. We have > followed clear_refs_write() to write the soft-dirty bit clearing code in > the current patch. Dirtiness of the VMA and the PTE may be set > independently. Newer allocated memory has dirty bit set in the VMA. When > something is written the memory, the soft dirty bit is set in the PTEs as > well regardless if the soft dirty bit is set in the VMA or not. > Let me try to find a simple explanation: After clearing a SOFTDIRTY PTE flag inside an area with VM_SOFTDIRTY set, there are ways that PTE could get written to and it could become dirty, without the PTE becoming softdirty. Essentially, inside a VMA with VM_SOFTDIRTY set, the PTE softdirty values might be stale: there might be entries that are softdirty even though the PTE is *not* marked softdirty. These are, AFAIU, the current semantics, and I am not sure if we want user space to explicitly work around that. >> >> >> I'd suggest moving forward without this controversial >> PAGEMAP_NO_REUSED_REGIONS functionality for now, and preparing it as a >> clear add-on we can discuss separately.Like I've described above, I've only added this flag to not get the > non-dirty pages as dirty. Can there be some alternative to adding this > flag? Please suggest. Please split it out into a separate patch for now. We can discuss further what the semantics are and if there are better alternatives for that. In the meantime, you could move forward without PAGEMAP_NO_REUSED_REGIONS while we are discussing them further.
Hello, Thank you for replying. On 11/14/22 8:46 PM, David Hildenbrand wrote: >> The soft-dirtiness is stored in the PTE. VMA is marked dirty to store the >> dirtiness for reused regions. Clearing the soft-dirty status of whole >> process is straight forward. When we want to clear/monitor the >> soft-dirtiness of a part of the virtual memory, there is a lot of internal >> noise. We don't want the non-dirty pages to become dirty because of how the >> soft-dirty feature has been working. Soft-dirty feature wasn't being used >> the way we want to use now. While monitoring a part of memory, it is not >> acceptable to get non-dirty pages as dirty. Non-dirty pages become dirty >> when the two VMAs are merged without considering if they both are dirty or >> not (34228d473efe). To monitor changes over the memory, sometimes VMAs are >> split to clear the soft-dirty bit in the VMA flags. But sometimes kernel >> decide to merge them backup. It is so waste of resources. > > Maybe you'd want a per-process option to not merge if the VM_SOFTDIRTY > property differs. But that might be just one alternative for handling this > case. > >> >> To keep things consistent, the default behavior of the IOCTL is to output >> even the extra non-dirty pages as dirty from the kernel noise. A optional >> PAGEMAP_NO_REUSED_REGIONS flag is added for those use cases which aren't >> tolerant of extra non-dirty pages. This flag can be considered as something >> which is by-passing the already present buggy implementation in the kernel. >> It is not buggy per say as the issue can be solved if we don't allow the >> two VMA which have different soft-dirty bits to get merged. But we are >> allowing that so that the total number of VMAs doesn't increase. This was >> acceptable at the time, but now with the use case of monitoring a part of >> memory for soft-dirty doesn't want this merging. So either we need to >> revert 34228d473efe and PAGEMAP_NO_REUSED_REGIONS flag will not be needed >> or we should allow PAGEMAP_NO_REUSED_REGIONS or similar mechanism to ignore >> the extra dirty pages which aren't dirty in reality. >> >> When PAGEMAP_NO_REUSED_REGIONS flag is used, only the PTEs are checked to >> find if the pages are dirty. So re-used regions cannot be detected. This >> has the only side-effect of not checking the VMAs. So this is limitation of >> using this flag which should be acceptable in the current state of code. >> This limitation is okay for the users as they can clear the soft-dirty bit >> of the VMA before starting to monitor a range of memory for soft-dirtiness. >> >> >>> Please separate that part out from the other changes; I am still not >>> convinced that we want this and what the semantical implications are. >>> >>> Let's take a look at an example: can_change_pte_writable() >>> >>> /* Do we need write faults for softdirty tracking? */ >>> if (vma_soft_dirty_enabled(vma) && !pte_soft_dirty(pte)) >>> return false; >>> >>> We care about PTE softdirty tracking, if it is enabled for the VMA. >>> Tracking is enabled if: vma_soft_dirty_enabled() >>> >>> /* >>> * Soft-dirty is kind of special: its tracking is enabled when >>> * the vma flags not set. >>> */ >>> return !(vma->vm_flags & VM_SOFTDIRTY); >>> >>> Consequently, if VM_SOFTDIRTY is set, we are not considering the soft_dirty >>> PTE bits accordingly. >> Sorry, I'm unable to completely grasp the meaning of the example. We have >> followed clear_refs_write() to write the soft-dirty bit clearing code in >> the current patch. Dirtiness of the VMA and the PTE may be set >> independently. Newer allocated memory has dirty bit set in the VMA. When >> something is written the memory, the soft dirty bit is set in the PTEs as >> well regardless if the soft dirty bit is set in the VMA or not. >> > > Let me try to find a simple explanation: > > After clearing a SOFTDIRTY PTE flag inside an area with VM_SOFTDIRTY set, > there are ways that PTE could get written to and it could become dirty, > without the PTE becoming softdirty. > > Essentially, inside a VMA with VM_SOFTDIRTY set, the PTE softdirty values > might be stale: there might be entries that are softdirty even though the > PTE is *not* marked softdirty. Can someone please share the example to reproduce this? In all of my testing, even if I ignore VM_SOFTDIRTY and only base my decision of soft-dirtiness on individual pages, it always passes. > > These are, AFAIU, the current semantics, and I am not sure if we want user > space to explicitly work around that. > >>> >>> >>> I'd suggest moving forward without this controversial >>> PAGEMAP_NO_REUSED_REGIONS functionality for now, and preparing it as a >>> clear add-on we can discuss separately.Like I've described above, I've >>> only added this flag to not get the >> non-dirty pages as dirty. Can there be some alternative to adding this >> flag? Please suggest. > > Please split it out into a separate patch for now. We can discuss further > what the semantics are and if there are better alternatives for that. In > the meantime, you could move forward without PAGEMAP_NO_REUSED_REGIONS > while we are discussing them further. >
On 21.11.22 16:00, Muhammad Usama Anjum wrote: > Hello, > > Thank you for replying. > > On 11/14/22 8:46 PM, David Hildenbrand wrote: >>> The soft-dirtiness is stored in the PTE. VMA is marked dirty to store the >>> dirtiness for reused regions. Clearing the soft-dirty status of whole >>> process is straight forward. When we want to clear/monitor the >>> soft-dirtiness of a part of the virtual memory, there is a lot of internal >>> noise. We don't want the non-dirty pages to become dirty because of how the >>> soft-dirty feature has been working. Soft-dirty feature wasn't being used >>> the way we want to use now. While monitoring a part of memory, it is not >>> acceptable to get non-dirty pages as dirty. Non-dirty pages become dirty >>> when the two VMAs are merged without considering if they both are dirty or >>> not (34228d473efe). To monitor changes over the memory, sometimes VMAs are >>> split to clear the soft-dirty bit in the VMA flags. But sometimes kernel >>> decide to merge them backup. It is so waste of resources. >> >> Maybe you'd want a per-process option to not merge if the VM_SOFTDIRTY >> property differs. But that might be just one alternative for handling this >> case. >> >>> >>> To keep things consistent, the default behavior of the IOCTL is to output >>> even the extra non-dirty pages as dirty from the kernel noise. A optional >>> PAGEMAP_NO_REUSED_REGIONS flag is added for those use cases which aren't >>> tolerant of extra non-dirty pages. This flag can be considered as something >>> which is by-passing the already present buggy implementation in the kernel. >>> It is not buggy per say as the issue can be solved if we don't allow the >>> two VMA which have different soft-dirty bits to get merged. But we are >>> allowing that so that the total number of VMAs doesn't increase. This was >>> acceptable at the time, but now with the use case of monitoring a part of >>> memory for soft-dirty doesn't want this merging. So either we need to >>> revert 34228d473efe and PAGEMAP_NO_REUSED_REGIONS flag will not be needed >>> or we should allow PAGEMAP_NO_REUSED_REGIONS or similar mechanism to ignore >>> the extra dirty pages which aren't dirty in reality. >>> >>> When PAGEMAP_NO_REUSED_REGIONS flag is used, only the PTEs are checked to >>> find if the pages are dirty. So re-used regions cannot be detected. This >>> has the only side-effect of not checking the VMAs. So this is limitation of >>> using this flag which should be acceptable in the current state of code. >>> This limitation is okay for the users as they can clear the soft-dirty bit >>> of the VMA before starting to monitor a range of memory for soft-dirtiness. >>> >>> >>>> Please separate that part out from the other changes; I am still not >>>> convinced that we want this and what the semantical implications are. >>>> >>>> Let's take a look at an example: can_change_pte_writable() >>>> >>>> /* Do we need write faults for softdirty tracking? */ >>>> if (vma_soft_dirty_enabled(vma) && !pte_soft_dirty(pte)) >>>> return false; >>>> >>>> We care about PTE softdirty tracking, if it is enabled for the VMA. >>>> Tracking is enabled if: vma_soft_dirty_enabled() >>>> >>>> /* >>>> * Soft-dirty is kind of special: its tracking is enabled when >>>> * the vma flags not set. >>>> */ >>>> return !(vma->vm_flags & VM_SOFTDIRTY); >>>> >>>> Consequently, if VM_SOFTDIRTY is set, we are not considering the soft_dirty >>>> PTE bits accordingly. >>> Sorry, I'm unable to completely grasp the meaning of the example. We have >>> followed clear_refs_write() to write the soft-dirty bit clearing code in >>> the current patch. Dirtiness of the VMA and the PTE may be set >>> independently. Newer allocated memory has dirty bit set in the VMA. When >>> something is written the memory, the soft dirty bit is set in the PTEs as >>> well regardless if the soft dirty bit is set in the VMA or not. >>> >> >> Let me try to find a simple explanation: >> >> After clearing a SOFTDIRTY PTE flag inside an area with VM_SOFTDIRTY set, >> there are ways that PTE could get written to and it could become dirty, >> without the PTE becoming softdirty. >> >> Essentially, inside a VMA with VM_SOFTDIRTY set, the PTE softdirty values >> might be stale: there might be entries that are softdirty even though the >> PTE is *not* marked softdirty. > Can someone please share the example to reproduce this? In all of my > testing, even if I ignore VM_SOFTDIRTY and only base my decision of > soft-dirtiness on individual pages, it always passes. Quick reproducer (the first and easiest one that triggered :) ) attached. With no kernel changes, it works as expected. # ./softdirty_mprotect With the following kernel change to simulate what you propose it fails: diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index d22687d2e81e..f2c682bf7f64 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -1457,8 +1457,8 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm, flags |= PM_FILE; if (page && !migration && page_mapcount(page) == 1) flags |= PM_MMAP_EXCLUSIVE; - if (vma->vm_flags & VM_SOFTDIRTY) - flags |= PM_SOFT_DIRTY; + //if (vma->vm_flags & VM_SOFTDIRTY) + // flags |= PM_SOFT_DIRTY; return make_pme(frame, flags); } # ./softdirty_mprotect Page #1 should be softdirty
On Wed, Nov 09, 2022 at 03:23:00PM +0500, Muhammad Usama Anjum wrote: > Soft-dirty PTE bit of the memory pages can be read by using the pagemap > procfs file. The soft-dirty PTE bit for the whole memory range of the > process can be cleared by writing to the clear_refs file. There are other > methods to mimic this information entirely in userspace with poor > performance: > - The mprotect syscall and SIGSEGV handler for bookkeeping > - The userfaultfd syscall with the handler for bookkeeping Userfaultfd is definitely slow in this case because it needs the messaging roundtrip that happens in two different threads synchronously, so at least more schedule effort even than mprotect. I saw the other patch on vma merging with SOFTDIRTY, didn't look deeper there but IIUC it won't really help much if the other commit (34228d47) can't be reverted then it seems to help nothing. And, it does looks risky to revert that because in the same commit it mentioned the case where one can clear ref right before a vma merge, so definitely worth more thoughts and testings, which I agree with you. I'm thinking whether the vma issue can be totally avoided. For example by providing an async version of uffd-wp. Currently uffd-wp must be synchronous and it'll be slow but it services specific purposes. And this is definitely not the 1st time any of us thinking about uffd-wp being async, it's just that we need to solve the problem of storage on the dirty information. Actually we can also use other storage form but so far I didn't think of anything that's easy and clean. Current soft-dirty bit also has its defects (e.g. the need to take mmap lock and walk the pgtables), but that part will be the same as soft-dirty for now. Now I'm wildly thinking whether we can just reuse the soft-dirty bit in the ptes already defined. The GET interface could be similar as proposed here, or at least a separate issue. So _maybe_ we can have a feature (bound to the uffd context) for uffd that enables async uffd-wp, in which case the wr-protect fault is not sending any message anymore (nor enqueuing) but instead setting the soft-dirty then quickly resolving the write bit immediately and continue the fault. Clearing of the soft-dirty bit needs to be done in UFFDIO_WRITEPROTECT alongside of clearing uffd-wp bit. So on that part the current GET+CLEAR interface for pagemap may need to be replaced. And frankly, it feels weird to me to allow change mm layout in pagemap ioctls.. With this we can keep the pagemap interface to only fetch information, like before. A major benefit of using uffd is that uffd is by nature pte-based, so no fiddling with vma needed at all. Firstly, no need to worry about merging vmas with tons of false positives. Meanwhile, one can wr-protect in page-size granule easily. All the wr-protect is not governed by vma flag anymore but based on uffd-wp flag, so no extra overhead too on any page that the monitor is not interested. There's already infrastructure code for persisting uffd-wp bit, so it'll naturally work similarly for an async mode if to come to the world. It's just that we'll also need to consider exclusive use of the bit, so we'll need to fail clear_refs on vmas where we have VM_UFFD_WP and also the async feature enabled. I would hope that's very rare, but worth thinking about its side effect. The same will need to apply to UFFDIO_REGISTER on async wp mode when soft-dirty enabled, we'll need to bailout too. Said that, this is not a suggestion of a new design, but just something I thought about when reading this, and quickly writting this down. Thanks,
On 11/21/22 8:55 PM, David Hildenbrand wrote: > On 21.11.22 16:00, Muhammad Usama Anjum wrote: >> Hello, >> >> Thank you for replying. >> >> On 11/14/22 8:46 PM, David Hildenbrand wrote: >>>> The soft-dirtiness is stored in the PTE. VMA is marked dirty to store the >>>> dirtiness for reused regions. Clearing the soft-dirty status of whole >>>> process is straight forward. When we want to clear/monitor the >>>> soft-dirtiness of a part of the virtual memory, there is a lot of internal >>>> noise. We don't want the non-dirty pages to become dirty because of how >>>> the >>>> soft-dirty feature has been working. Soft-dirty feature wasn't being used >>>> the way we want to use now. While monitoring a part of memory, it is not >>>> acceptable to get non-dirty pages as dirty. Non-dirty pages become dirty >>>> when the two VMAs are merged without considering if they both are dirty or >>>> not (34228d473efe). To monitor changes over the memory, sometimes VMAs are >>>> split to clear the soft-dirty bit in the VMA flags. But sometimes kernel >>>> decide to merge them backup. It is so waste of resources. >>> >>> Maybe you'd want a per-process option to not merge if the VM_SOFTDIRTY >>> property differs. But that might be just one alternative for handling this >>> case. >>> >>>> >>>> To keep things consistent, the default behavior of the IOCTL is to output >>>> even the extra non-dirty pages as dirty from the kernel noise. A optional >>>> PAGEMAP_NO_REUSED_REGIONS flag is added for those use cases which aren't >>>> tolerant of extra non-dirty pages. This flag can be considered as >>>> something >>>> which is by-passing the already present buggy implementation in the >>>> kernel. >>>> It is not buggy per say as the issue can be solved if we don't allow the >>>> two VMA which have different soft-dirty bits to get merged. But we are >>>> allowing that so that the total number of VMAs doesn't increase. This was >>>> acceptable at the time, but now with the use case of monitoring a part of >>>> memory for soft-dirty doesn't want this merging. So either we need to >>>> revert 34228d473efe and PAGEMAP_NO_REUSED_REGIONS flag will not be needed >>>> or we should allow PAGEMAP_NO_REUSED_REGIONS or similar mechanism to >>>> ignore >>>> the extra dirty pages which aren't dirty in reality. >>>> >>>> When PAGEMAP_NO_REUSED_REGIONS flag is used, only the PTEs are checked to >>>> find if the pages are dirty. So re-used regions cannot be detected. This >>>> has the only side-effect of not checking the VMAs. So this is >>>> limitation of >>>> using this flag which should be acceptable in the current state of code. >>>> This limitation is okay for the users as they can clear the soft-dirty bit >>>> of the VMA before starting to monitor a range of memory for >>>> soft-dirtiness. >>>> >>>> >>>>> Please separate that part out from the other changes; I am still not >>>>> convinced that we want this and what the semantical implications are. >>>>> >>>>> Let's take a look at an example: can_change_pte_writable() >>>>> >>>>> /* Do we need write faults for softdirty tracking? */ >>>>> if (vma_soft_dirty_enabled(vma) && !pte_soft_dirty(pte)) >>>>> return false; >>>>> >>>>> We care about PTE softdirty tracking, if it is enabled for the VMA. >>>>> Tracking is enabled if: vma_soft_dirty_enabled() >>>>> >>>>> /* >>>>> * Soft-dirty is kind of special: its tracking is enabled when >>>>> * the vma flags not set. >>>>> */ >>>>> return !(vma->vm_flags & VM_SOFTDIRTY); >>>>> >>>>> Consequently, if VM_SOFTDIRTY is set, we are not considering the >>>>> soft_dirty >>>>> PTE bits accordingly. >>>> Sorry, I'm unable to completely grasp the meaning of the example. We have >>>> followed clear_refs_write() to write the soft-dirty bit clearing code in >>>> the current patch. Dirtiness of the VMA and the PTE may be set >>>> independently. Newer allocated memory has dirty bit set in the VMA. When >>>> something is written the memory, the soft dirty bit is set in the PTEs as >>>> well regardless if the soft dirty bit is set in the VMA or not. >>>> >>> >>> Let me try to find a simple explanation: >>> >>> After clearing a SOFTDIRTY PTE flag inside an area with VM_SOFTDIRTY set, >>> there are ways that PTE could get written to and it could become dirty, >>> without the PTE becoming softdirty. >>> >>> Essentially, inside a VMA with VM_SOFTDIRTY set, the PTE softdirty values >>> might be stale: there might be entries that are softdirty even though the >>> PTE is *not* marked softdirty. >> Can someone please share the example to reproduce this? In all of my >> testing, even if I ignore VM_SOFTDIRTY and only base my decision of >> soft-dirtiness on individual pages, it always passes. > > Quick reproducer (the first and easiest one that triggered :) ) > attached. > > With no kernel changes, it works as expected. > > # ./softdirty_mprotect > > > With the following kernel change to simulate what you propose it fails: > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > index d22687d2e81e..f2c682bf7f64 100644 > --- a/fs/proc/task_mmu.c > +++ b/fs/proc/task_mmu.c > @@ -1457,8 +1457,8 @@ static pagemap_entry_t pte_to_pagemap_entry(struct > pagemapread *pm, > flags |= PM_FILE; > if (page && !migration && page_mapcount(page) == 1) > flags |= PM_MMAP_EXCLUSIVE; > - if (vma->vm_flags & VM_SOFTDIRTY) > - flags |= PM_SOFT_DIRTY; > + //if (vma->vm_flags & VM_SOFTDIRTY) > + // flags |= PM_SOFT_DIRTY; > > return make_pme(frame, flags); > } > > > # ./softdirty_mprotect > Page #1 should be softdirty > Thank you so much for sharing the issue and reproducer. After remapping the second part of the memory and m-protecting + m-unprotecting the whole memory, the PTE of the first half of the memory doesn't get marked as soft dirty even after writing multiple times to it. Even if soft-dirtiness is cleared on the whole process, the PTE of the first half memory doesn't get dirty. This seems like more of a bug in mprotect. The mprotect should not mess up with the soft-dirty flag in the PTEs. I'm debugging this. I hope to find the issue soon. Soft-dirty tracking in PTEs should be working correctly irrespective of the VM_SOFTDIRTY is set or not on the VMA. Cyrill has said in [1]: > ioctl might be an option indeed It brings some hope to this patch-set. [1] https://lore.kernel.org/all/Y4W0axw0ZgORtfkt@grain/
On 30.11.22 12:42, Muhammad Usama Anjum wrote: > On 11/21/22 8:55 PM, David Hildenbrand wrote: >> On 21.11.22 16:00, Muhammad Usama Anjum wrote: >>> Hello, >>> >>> Thank you for replying. >>> >>> On 11/14/22 8:46 PM, David Hildenbrand wrote: >>>>> The soft-dirtiness is stored in the PTE. VMA is marked dirty to store the >>>>> dirtiness for reused regions. Clearing the soft-dirty status of whole >>>>> process is straight forward. When we want to clear/monitor the >>>>> soft-dirtiness of a part of the virtual memory, there is a lot of internal >>>>> noise. We don't want the non-dirty pages to become dirty because of how >>>>> the >>>>> soft-dirty feature has been working. Soft-dirty feature wasn't being used >>>>> the way we want to use now. While monitoring a part of memory, it is not >>>>> acceptable to get non-dirty pages as dirty. Non-dirty pages become dirty >>>>> when the two VMAs are merged without considering if they both are dirty or >>>>> not (34228d473efe). To monitor changes over the memory, sometimes VMAs are >>>>> split to clear the soft-dirty bit in the VMA flags. But sometimes kernel >>>>> decide to merge them backup. It is so waste of resources. >>>> >>>> Maybe you'd want a per-process option to not merge if the VM_SOFTDIRTY >>>> property differs. But that might be just one alternative for handling this >>>> case. >>>> >>>>> >>>>> To keep things consistent, the default behavior of the IOCTL is to output >>>>> even the extra non-dirty pages as dirty from the kernel noise. A optional >>>>> PAGEMAP_NO_REUSED_REGIONS flag is added for those use cases which aren't >>>>> tolerant of extra non-dirty pages. This flag can be considered as >>>>> something >>>>> which is by-passing the already present buggy implementation in the >>>>> kernel. >>>>> It is not buggy per say as the issue can be solved if we don't allow the >>>>> two VMA which have different soft-dirty bits to get merged. But we are >>>>> allowing that so that the total number of VMAs doesn't increase. This was >>>>> acceptable at the time, but now with the use case of monitoring a part of >>>>> memory for soft-dirty doesn't want this merging. So either we need to >>>>> revert 34228d473efe and PAGEMAP_NO_REUSED_REGIONS flag will not be needed >>>>> or we should allow PAGEMAP_NO_REUSED_REGIONS or similar mechanism to >>>>> ignore >>>>> the extra dirty pages which aren't dirty in reality. >>>>> >>>>> When PAGEMAP_NO_REUSED_REGIONS flag is used, only the PTEs are checked to >>>>> find if the pages are dirty. So re-used regions cannot be detected. This >>>>> has the only side-effect of not checking the VMAs. So this is >>>>> limitation of >>>>> using this flag which should be acceptable in the current state of code. >>>>> This limitation is okay for the users as they can clear the soft-dirty bit >>>>> of the VMA before starting to monitor a range of memory for >>>>> soft-dirtiness. >>>>> >>>>> >>>>>> Please separate that part out from the other changes; I am still not >>>>>> convinced that we want this and what the semantical implications are. >>>>>> >>>>>> Let's take a look at an example: can_change_pte_writable() >>>>>> >>>>>> /* Do we need write faults for softdirty tracking? */ >>>>>> if (vma_soft_dirty_enabled(vma) && !pte_soft_dirty(pte)) >>>>>> return false; >>>>>> >>>>>> We care about PTE softdirty tracking, if it is enabled for the VMA. >>>>>> Tracking is enabled if: vma_soft_dirty_enabled() >>>>>> >>>>>> /* >>>>>> * Soft-dirty is kind of special: its tracking is enabled when >>>>>> * the vma flags not set. >>>>>> */ >>>>>> return !(vma->vm_flags & VM_SOFTDIRTY); >>>>>> >>>>>> Consequently, if VM_SOFTDIRTY is set, we are not considering the >>>>>> soft_dirty >>>>>> PTE bits accordingly. >>>>> Sorry, I'm unable to completely grasp the meaning of the example. We have >>>>> followed clear_refs_write() to write the soft-dirty bit clearing code in >>>>> the current patch. Dirtiness of the VMA and the PTE may be set >>>>> independently. Newer allocated memory has dirty bit set in the VMA. When >>>>> something is written the memory, the soft dirty bit is set in the PTEs as >>>>> well regardless if the soft dirty bit is set in the VMA or not. >>>>> >>>> >>>> Let me try to find a simple explanation: >>>> >>>> After clearing a SOFTDIRTY PTE flag inside an area with VM_SOFTDIRTY set, >>>> there are ways that PTE could get written to and it could become dirty, >>>> without the PTE becoming softdirty. >>>> >>>> Essentially, inside a VMA with VM_SOFTDIRTY set, the PTE softdirty values >>>> might be stale: there might be entries that are softdirty even though the >>>> PTE is *not* marked softdirty. >>> Can someone please share the example to reproduce this? In all of my >>> testing, even if I ignore VM_SOFTDIRTY and only base my decision of >>> soft-dirtiness on individual pages, it always passes. >> >> Quick reproducer (the first and easiest one that triggered :) ) >> attached. >> >> With no kernel changes, it works as expected. >> >> # ./softdirty_mprotect >> >> >> With the following kernel change to simulate what you propose it fails: >> >> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c >> index d22687d2e81e..f2c682bf7f64 100644 >> --- a/fs/proc/task_mmu.c >> +++ b/fs/proc/task_mmu.c >> @@ -1457,8 +1457,8 @@ static pagemap_entry_t pte_to_pagemap_entry(struct >> pagemapread *pm, >> flags |= PM_FILE; >> if (page && !migration && page_mapcount(page) == 1) >> flags |= PM_MMAP_EXCLUSIVE; >> - if (vma->vm_flags & VM_SOFTDIRTY) >> - flags |= PM_SOFT_DIRTY; >> + //if (vma->vm_flags & VM_SOFTDIRTY) >> + // flags |= PM_SOFT_DIRTY; >> >> return make_pme(frame, flags); >> } >> >> >> # ./softdirty_mprotect >> Page #1 should be softdirty >> > Thank you so much for sharing the issue and reproducer. > > After remapping the second part of the memory and m-protecting + > m-unprotecting the whole memory, the PTE of the first half of the memory > doesn't get marked as soft dirty even after writing multiple times to it. > Even if soft-dirtiness is cleared on the whole process, the PTE of the > first half memory doesn't get dirty. This seems like more of a bug in > mprotect. The mprotect should not mess up with the soft-dirty flag in the PTEs. > > I'm debugging this. I hope to find the issue soon. Soft-dirty tracking in > PTEs should be working correctly irrespective of the VM_SOFTDIRTY is set or > not on the VMA. No, it's not a bug and these are not the VM_SOFTDIRTY semantics -- just because you think they should be like this. As people explained, VM_SOFTDIRTY implies *until now* that any PTE is consideres softdirty. And there are other scenarios that can similarly trigger something like that, besides mprotect(). Sorry if I sound annoyed, but please 1) factor out that from your patch set for now 2) find a way to handle this cleanly, for example, not merging VMAs that differ in VM_SOFTDIRTY
On 11/30/22 5:10 PM, David Hildenbrand wrote: > On 30.11.22 12:42, Muhammad Usama Anjum wrote: >> On 11/21/22 8:55 PM, David Hildenbrand wrote: >>> On 21.11.22 16:00, Muhammad Usama Anjum wrote: >>>> Hello, >>>> >>>> Thank you for replying. >>>> >>>> On 11/14/22 8:46 PM, David Hildenbrand wrote: >>>>>> The soft-dirtiness is stored in the PTE. VMA is marked dirty to store >>>>>> the >>>>>> dirtiness for reused regions. Clearing the soft-dirty status of whole >>>>>> process is straight forward. When we want to clear/monitor the >>>>>> soft-dirtiness of a part of the virtual memory, there is a lot of >>>>>> internal >>>>>> noise. We don't want the non-dirty pages to become dirty because of how >>>>>> the >>>>>> soft-dirty feature has been working. Soft-dirty feature wasn't being >>>>>> used >>>>>> the way we want to use now. While monitoring a part of memory, it is not >>>>>> acceptable to get non-dirty pages as dirty. Non-dirty pages become dirty >>>>>> when the two VMAs are merged without considering if they both are >>>>>> dirty or >>>>>> not (34228d473efe). To monitor changes over the memory, sometimes >>>>>> VMAs are >>>>>> split to clear the soft-dirty bit in the VMA flags. But sometimes kernel >>>>>> decide to merge them backup. It is so waste of resources. >>>>> >>>>> Maybe you'd want a per-process option to not merge if the VM_SOFTDIRTY >>>>> property differs. But that might be just one alternative for handling >>>>> this >>>>> case. >>>>> >>>>>> >>>>>> To keep things consistent, the default behavior of the IOCTL is to >>>>>> output >>>>>> even the extra non-dirty pages as dirty from the kernel noise. A >>>>>> optional >>>>>> PAGEMAP_NO_REUSED_REGIONS flag is added for those use cases which aren't >>>>>> tolerant of extra non-dirty pages. This flag can be considered as >>>>>> something >>>>>> which is by-passing the already present buggy implementation in the >>>>>> kernel. >>>>>> It is not buggy per say as the issue can be solved if we don't allow the >>>>>> two VMA which have different soft-dirty bits to get merged. But we are >>>>>> allowing that so that the total number of VMAs doesn't increase. This >>>>>> was >>>>>> acceptable at the time, but now with the use case of monitoring a >>>>>> part of >>>>>> memory for soft-dirty doesn't want this merging. So either we need to >>>>>> revert 34228d473efe and PAGEMAP_NO_REUSED_REGIONS flag will not be >>>>>> needed >>>>>> or we should allow PAGEMAP_NO_REUSED_REGIONS or similar mechanism to >>>>>> ignore >>>>>> the extra dirty pages which aren't dirty in reality. >>>>>> >>>>>> When PAGEMAP_NO_REUSED_REGIONS flag is used, only the PTEs are >>>>>> checked to >>>>>> find if the pages are dirty. So re-used regions cannot be detected. This >>>>>> has the only side-effect of not checking the VMAs. So this is >>>>>> limitation of >>>>>> using this flag which should be acceptable in the current state of code. >>>>>> This limitation is okay for the users as they can clear the >>>>>> soft-dirty bit >>>>>> of the VMA before starting to monitor a range of memory for >>>>>> soft-dirtiness. >>>>>> >>>>>> >>>>>>> Please separate that part out from the other changes; I am still not >>>>>>> convinced that we want this and what the semantical implications are. >>>>>>> >>>>>>> Let's take a look at an example: can_change_pte_writable() >>>>>>> >>>>>>> /* Do we need write faults for softdirty tracking? */ >>>>>>> if (vma_soft_dirty_enabled(vma) && !pte_soft_dirty(pte)) >>>>>>> return false; >>>>>>> >>>>>>> We care about PTE softdirty tracking, if it is enabled for the VMA. >>>>>>> Tracking is enabled if: vma_soft_dirty_enabled() >>>>>>> >>>>>>> /* >>>>>>> * Soft-dirty is kind of special: its tracking is enabled when >>>>>>> * the vma flags not set. >>>>>>> */ >>>>>>> return !(vma->vm_flags & VM_SOFTDIRTY); >>>>>>> >>>>>>> Consequently, if VM_SOFTDIRTY is set, we are not considering the >>>>>>> soft_dirty >>>>>>> PTE bits accordingly. >>>>>> Sorry, I'm unable to completely grasp the meaning of the example. We >>>>>> have >>>>>> followed clear_refs_write() to write the soft-dirty bit clearing code in >>>>>> the current patch. Dirtiness of the VMA and the PTE may be set >>>>>> independently. Newer allocated memory has dirty bit set in the VMA. When >>>>>> something is written the memory, the soft dirty bit is set in the >>>>>> PTEs as >>>>>> well regardless if the soft dirty bit is set in the VMA or not. >>>>>> >>>>> >>>>> Let me try to find a simple explanation: >>>>> >>>>> After clearing a SOFTDIRTY PTE flag inside an area with VM_SOFTDIRTY set, >>>>> there are ways that PTE could get written to and it could become dirty, >>>>> without the PTE becoming softdirty. >>>>> >>>>> Essentially, inside a VMA with VM_SOFTDIRTY set, the PTE softdirty values >>>>> might be stale: there might be entries that are softdirty even though the >>>>> PTE is *not* marked softdirty. >>>> Can someone please share the example to reproduce this? In all of my >>>> testing, even if I ignore VM_SOFTDIRTY and only base my decision of >>>> soft-dirtiness on individual pages, it always passes. >>> >>> Quick reproducer (the first and easiest one that triggered :) ) >>> attached. >>> >>> With no kernel changes, it works as expected. >>> >>> # ./softdirty_mprotect >>> >>> >>> With the following kernel change to simulate what you propose it fails: >>> >>> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c >>> index d22687d2e81e..f2c682bf7f64 100644 >>> --- a/fs/proc/task_mmu.c >>> +++ b/fs/proc/task_mmu.c >>> @@ -1457,8 +1457,8 @@ static pagemap_entry_t pte_to_pagemap_entry(struct >>> pagemapread *pm, >>> flags |= PM_FILE; >>> if (page && !migration && page_mapcount(page) == 1) >>> flags |= PM_MMAP_EXCLUSIVE; >>> - if (vma->vm_flags & VM_SOFTDIRTY) >>> - flags |= PM_SOFT_DIRTY; >>> + //if (vma->vm_flags & VM_SOFTDIRTY) >>> + // flags |= PM_SOFT_DIRTY; >>> return make_pme(frame, flags); >>> } >>> >>> >>> # ./softdirty_mprotect >>> Page #1 should be softdirty >>> >> Thank you so much for sharing the issue and reproducer. >> >> After remapping the second part of the memory and m-protecting + >> m-unprotecting the whole memory, the PTE of the first half of the memory >> doesn't get marked as soft dirty even after writing multiple times to it. >> Even if soft-dirtiness is cleared on the whole process, the PTE of the >> first half memory doesn't get dirty. This seems like more of a bug in >> mprotect. The mprotect should not mess up with the soft-dirty flag in the >> PTEs. >> >> I'm debugging this. I hope to find the issue soon. Soft-dirty tracking in >> PTEs should be working correctly irrespective of the VM_SOFTDIRTY is set or >> not on the VMA. > > No, it's not a bug and these are not the VM_SOFTDIRTY semantics -- just > because you think they should be like this. As people explained, > VM_SOFTDIRTY implies *until now* that any PTE is consideres softdirty. And > there are other scenarios that can similarly trigger something like that, > besides mprotect(). > > Sorry if I sound annoyed, but please > > 1) factor out that from your patch set for now > 2) find a way to handle this cleanly, for example, not merging VMAs that > differ in VM_SOFTDIRTY > I'm extremely sorry for the annoyance. I absolutely understand your point. The problem is that the half of this IOCTL wouldn't be useful without solving the extra soft-dirty pages issue. We don't want to upstream something which we wouldn't be using until 2 is solved. This is why we are trying to solve the point 2 before upstreaming the 1. I'm working on ideas on how this can be resolved or redesigned entirely. Maybe Cyril will share the ideas soon once he has some time. He was involved in the soft-dirty feature development.
On 05.12.22 16:29, Muhammad Usama Anjum wrote: > On 11/30/22 5:10 PM, David Hildenbrand wrote: >> On 30.11.22 12:42, Muhammad Usama Anjum wrote: >>> On 11/21/22 8:55 PM, David Hildenbrand wrote: >>>> On 21.11.22 16:00, Muhammad Usama Anjum wrote: >>>>> Hello, >>>>> >>>>> Thank you for replying. >>>>> >>>>> On 11/14/22 8:46 PM, David Hildenbrand wrote: >>>>>>> The soft-dirtiness is stored in the PTE. VMA is marked dirty to store >>>>>>> the >>>>>>> dirtiness for reused regions. Clearing the soft-dirty status of whole >>>>>>> process is straight forward. When we want to clear/monitor the >>>>>>> soft-dirtiness of a part of the virtual memory, there is a lot of >>>>>>> internal >>>>>>> noise. We don't want the non-dirty pages to become dirty because of how >>>>>>> the >>>>>>> soft-dirty feature has been working. Soft-dirty feature wasn't being >>>>>>> used >>>>>>> the way we want to use now. While monitoring a part of memory, it is not >>>>>>> acceptable to get non-dirty pages as dirty. Non-dirty pages become dirty >>>>>>> when the two VMAs are merged without considering if they both are >>>>>>> dirty or >>>>>>> not (34228d473efe). To monitor changes over the memory, sometimes >>>>>>> VMAs are >>>>>>> split to clear the soft-dirty bit in the VMA flags. But sometimes kernel >>>>>>> decide to merge them backup. It is so waste of resources. >>>>>> >>>>>> Maybe you'd want a per-process option to not merge if the VM_SOFTDIRTY >>>>>> property differs. But that might be just one alternative for handling >>>>>> this >>>>>> case. >>>>>> >>>>>>> >>>>>>> To keep things consistent, the default behavior of the IOCTL is to >>>>>>> output >>>>>>> even the extra non-dirty pages as dirty from the kernel noise. A >>>>>>> optional >>>>>>> PAGEMAP_NO_REUSED_REGIONS flag is added for those use cases which aren't >>>>>>> tolerant of extra non-dirty pages. This flag can be considered as >>>>>>> something >>>>>>> which is by-passing the already present buggy implementation in the >>>>>>> kernel. >>>>>>> It is not buggy per say as the issue can be solved if we don't allow the >>>>>>> two VMA which have different soft-dirty bits to get merged. But we are >>>>>>> allowing that so that the total number of VMAs doesn't increase. This >>>>>>> was >>>>>>> acceptable at the time, but now with the use case of monitoring a >>>>>>> part of >>>>>>> memory for soft-dirty doesn't want this merging. So either we need to >>>>>>> revert 34228d473efe and PAGEMAP_NO_REUSED_REGIONS flag will not be >>>>>>> needed >>>>>>> or we should allow PAGEMAP_NO_REUSED_REGIONS or similar mechanism to >>>>>>> ignore >>>>>>> the extra dirty pages which aren't dirty in reality. >>>>>>> >>>>>>> When PAGEMAP_NO_REUSED_REGIONS flag is used, only the PTEs are >>>>>>> checked to >>>>>>> find if the pages are dirty. So re-used regions cannot be detected. This >>>>>>> has the only side-effect of not checking the VMAs. So this is >>>>>>> limitation of >>>>>>> using this flag which should be acceptable in the current state of code. >>>>>>> This limitation is okay for the users as they can clear the >>>>>>> soft-dirty bit >>>>>>> of the VMA before starting to monitor a range of memory for >>>>>>> soft-dirtiness. >>>>>>> >>>>>>> >>>>>>>> Please separate that part out from the other changes; I am still not >>>>>>>> convinced that we want this and what the semantical implications are. >>>>>>>> >>>>>>>> Let's take a look at an example: can_change_pte_writable() >>>>>>>> >>>>>>>> /* Do we need write faults for softdirty tracking? */ >>>>>>>> if (vma_soft_dirty_enabled(vma) && !pte_soft_dirty(pte)) >>>>>>>> return false; >>>>>>>> >>>>>>>> We care about PTE softdirty tracking, if it is enabled for the VMA. >>>>>>>> Tracking is enabled if: vma_soft_dirty_enabled() >>>>>>>> >>>>>>>> /* >>>>>>>> * Soft-dirty is kind of special: its tracking is enabled when >>>>>>>> * the vma flags not set. >>>>>>>> */ >>>>>>>> return !(vma->vm_flags & VM_SOFTDIRTY); >>>>>>>> >>>>>>>> Consequently, if VM_SOFTDIRTY is set, we are not considering the >>>>>>>> soft_dirty >>>>>>>> PTE bits accordingly. >>>>>>> Sorry, I'm unable to completely grasp the meaning of the example. We >>>>>>> have >>>>>>> followed clear_refs_write() to write the soft-dirty bit clearing code in >>>>>>> the current patch. Dirtiness of the VMA and the PTE may be set >>>>>>> independently. Newer allocated memory has dirty bit set in the VMA. When >>>>>>> something is written the memory, the soft dirty bit is set in the >>>>>>> PTEs as >>>>>>> well regardless if the soft dirty bit is set in the VMA or not. >>>>>>> >>>>>> >>>>>> Let me try to find a simple explanation: >>>>>> >>>>>> After clearing a SOFTDIRTY PTE flag inside an area with VM_SOFTDIRTY set, >>>>>> there are ways that PTE could get written to and it could become dirty, >>>>>> without the PTE becoming softdirty. >>>>>> >>>>>> Essentially, inside a VMA with VM_SOFTDIRTY set, the PTE softdirty values >>>>>> might be stale: there might be entries that are softdirty even though the >>>>>> PTE is *not* marked softdirty. >>>>> Can someone please share the example to reproduce this? In all of my >>>>> testing, even if I ignore VM_SOFTDIRTY and only base my decision of >>>>> soft-dirtiness on individual pages, it always passes. >>>> >>>> Quick reproducer (the first and easiest one that triggered :) ) >>>> attached. >>>> >>>> With no kernel changes, it works as expected. >>>> >>>> # ./softdirty_mprotect >>>> >>>> >>>> With the following kernel change to simulate what you propose it fails: >>>> >>>> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c >>>> index d22687d2e81e..f2c682bf7f64 100644 >>>> --- a/fs/proc/task_mmu.c >>>> +++ b/fs/proc/task_mmu.c >>>> @@ -1457,8 +1457,8 @@ static pagemap_entry_t pte_to_pagemap_entry(struct >>>> pagemapread *pm, >>>> flags |= PM_FILE; >>>> if (page && !migration && page_mapcount(page) == 1) >>>> flags |= PM_MMAP_EXCLUSIVE; >>>> - if (vma->vm_flags & VM_SOFTDIRTY) >>>> - flags |= PM_SOFT_DIRTY; >>>> + //if (vma->vm_flags & VM_SOFTDIRTY) >>>> + // flags |= PM_SOFT_DIRTY; >>>> return make_pme(frame, flags); >>>> } >>>> >>>> >>>> # ./softdirty_mprotect >>>> Page #1 should be softdirty >>>> >>> Thank you so much for sharing the issue and reproducer. >>> >>> After remapping the second part of the memory and m-protecting + >>> m-unprotecting the whole memory, the PTE of the first half of the memory >>> doesn't get marked as soft dirty even after writing multiple times to it. >>> Even if soft-dirtiness is cleared on the whole process, the PTE of the >>> first half memory doesn't get dirty. This seems like more of a bug in >>> mprotect. The mprotect should not mess up with the soft-dirty flag in the >>> PTEs. >>> >>> I'm debugging this. I hope to find the issue soon. Soft-dirty tracking in >>> PTEs should be working correctly irrespective of the VM_SOFTDIRTY is set or >>> not on the VMA. >> >> No, it's not a bug and these are not the VM_SOFTDIRTY semantics -- just >> because you think they should be like this. As people explained, >> VM_SOFTDIRTY implies *until now* that any PTE is consideres softdirty. And >> there are other scenarios that can similarly trigger something like that, >> besides mprotect(). >> >> Sorry if I sound annoyed, but please >> >> 1) factor out that from your patch set for now >> 2) find a way to handle this cleanly, for example, not merging VMAs that >> differ in VM_SOFTDIRTY >> > > I'm extremely sorry for the annoyance. I absolutely understand your point. No need to be sorry :) > The problem is that the half of this IOCTL wouldn't be useful without > solving the extra soft-dirty pages issue. We don't want to upstream > something which we wouldn't be using until 2 is solved. This is why we are > trying to solve the point 2 before upstreaming the 1. I'm working on ideas > on how this can be resolved or redesigned entirely. Maybe Cyril will share > the ideas soon once he has some time. He was involved in the soft-dirty > feature development. Got it, thanks for the info on usability without this feature. Let me make my point clearer: exposing VM_SOFTDIRTY details to user space and providing different kinds of "soft dirty" really is sub-optimal from an ABI POV. It would be really preferred if we could just find a way to improve the soft-dirty implementation in a way such that no such ABI hacks are required and that the existing interface will provide the semantics you want. For example, if we could rework the VMA merging case that would be really preferable. I understand that we might want more fine-grained soft-dirty clearing IOCTLs. My primary concern is regarding the VM_SOFTDIRTY special-casing just when observing whether a PTE is softdirty.