Message ID | 20210629023959.4ZAFiI8oZ%akpm@linux-foundation.org (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [001/192] mm/gup: fix try_grab_compound_head() race with split_huge_page() | expand |
On Mon, Jun 28, 2021 at 7:40 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > > - /* Avoid taking write faults for known dirty pages */ > - if (dirty_accountable && pte_dirty(ptent) && > - (pte_soft_dirty(ptent) || > - !(vma->vm_flags & VM_SOFTDIRTY))) { > + if (may_avoid_write_fault(ptent, vma, cp_flags)) > ptent = pte_mkwrite(ptent); > - } Hmm. I don't think this is correct. As fat as I can tell, may_avoid_write_fault() doesn't even check if the vma is writable! Am I misreading it? Because I think you just made even a shared mmap with "mprotect(PROT_READ)" turn the pte's writable. Which is a "slight" security issue. Maybe the new code is fine, and I'm missing something. The old code looks strange too, which makes me think that the MM_CP_DIRTY_ACCT test ends up saving us and depend on VM_WRITE. But it's very much not obvious. And even if I _am_ missing something, I really would like a very obvious and direct test for "this vma is writable", ie maybe a if (!(vma->vm_flags & VM_WRITE)) return false; at the very top of the function. And no, "pte_dirty()" is not a reason to make something writable, it might have started out as a writable mapping, and we dirtied the page, and we made it read-only. The page stays dirty, but it shouldn't become writable just because of that. So please make me get the warm and fuzzies about this code. Because as-is, it just looks scary. Linus
On Tue, Jun 29, 2021 at 10:50:12AM -0700, Linus Torvalds wrote: > On Mon, Jun 28, 2021 at 7:40 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > > > > > - /* Avoid taking write faults for known dirty pages */ > > - if (dirty_accountable && pte_dirty(ptent) && > > - (pte_soft_dirty(ptent) || > > - !(vma->vm_flags & VM_SOFTDIRTY))) { > > + if (may_avoid_write_fault(ptent, vma, cp_flags)) > > ptent = pte_mkwrite(ptent); > > - } > > Hmm. I don't think this is correct. > > As fat as I can tell, may_avoid_write_fault() doesn't even check if > the vma is writable! > > Am I misreading it? Because I think you just made even a shared mmap > with "mprotect(PROT_READ)" turn the pte's writable. > > Which is a "slight" security issue. > > Maybe the new code is fine, and I'm missing something. The old code > looks strange too, which makes me think that the MM_CP_DIRTY_ACCT test > ends up saving us and depend on VM_WRITE. But it's very much not > obvious. vma_wants_writenotify() checks first VM_WRITE|VM_SHARED, otherwise MM_CP_DIRTY_ACCT will not be set. While for anonymous vmas the newly introduced may_avoid_write_fault() checks VM_WRITE explicitly. Agreed even if it's checked it's not straightforward. Maybe it'll be a bonus to have a comment above may_avoid_write_fault() about it in a follow up. > > And even if I _am_ missing something, I really would like a very > obvious and direct test for "this vma is writable", ie maybe a > > if (!(vma->vm_flags & VM_WRITE)) > return false; > > at the very top of the function. Yes looks okay too; I think using MM_CP_DIRTY_ACCT flag has a slight advantage in that it checks VM_WRITE only once before calling change_protection(), rather than doing the check for every pte even if we know it'll have the same result. However it indeed hides the facts deeper.. > > And no, "pte_dirty()" is not a reason to make something writable, it > might have started out as a writable mapping, and we dirtied the page, > and we made it read-only. The page stays dirty, but it shouldn't > become writable just because of that. I think the dirty bit checks are only to make sure we don't need those extra write faults. It should definitely be based on the fact that VM_WRITE being set already. Thanks,
On Tue, Jun 29, 2021 at 08:12:12PM -0400, Peter Xu wrote: > On Tue, Jun 29, 2021 at 10:50:12AM -0700, Linus Torvalds wrote: > > On Mon, Jun 28, 2021 at 7:40 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > > > > > > > > - /* Avoid taking write faults for known dirty pages */ > > > - if (dirty_accountable && pte_dirty(ptent) && > > > - (pte_soft_dirty(ptent) || > > > - !(vma->vm_flags & VM_SOFTDIRTY))) { > > > + if (may_avoid_write_fault(ptent, vma, cp_flags)) > > > ptent = pte_mkwrite(ptent); > > > - } > > > > Hmm. I don't think this is correct. > > > > As fat as I can tell, may_avoid_write_fault() doesn't even check if > > the vma is writable! > > > > Am I misreading it? Because I think you just made even a shared mmap > > with "mprotect(PROT_READ)" turn the pte's writable. > > > > Which is a "slight" security issue. > > > > Maybe the new code is fine, and I'm missing something. The old code > > looks strange too, which makes me think that the MM_CP_DIRTY_ACCT test > > ends up saving us and depend on VM_WRITE. But it's very much not > > obvious. > > vma_wants_writenotify() checks first VM_WRITE|VM_SHARED, otherwise > MM_CP_DIRTY_ACCT will not be set. While for anonymous vmas the newly > introduced may_avoid_write_fault() checks VM_WRITE explicitly. Sorry, this statement is unclear. It's not about anonymous or not, it's just that a hidden check against VM_WRITE is there already.. Say, below chunk of the patch: if (!(cp_flags & MM_CP_DIRTY_ACCT)) { /* Otherwise, we must have exclusive access to the page. */ if (!(vma_is_anonymous(vma) && (vma->vm_flags & VM_WRITE))) return false; if (page_count(pte_page(pte)) != 1) return false; } Should be the same as: if (!(cp_flags & MM_CP_DIRTY_ACCT)) { if (!vma_is_anonymous(vma)) return false; if (!(vma->vm_flags & VM_WRITE)) return false; if (page_count(pte_page(pte)) != 1) return false; } And since MM_CP_DIRTY_ACCT implies "VM_WRITE|VM_SHARED" all set, above should be a slightly faster version of below: /* This just never trigger if MM_CP_DIRTY_ACCT set */ if (!(vma->vm_flags & VM_WRITE)) return false; if (!(cp_flags & MM_CP_DIRTY_ACCT)) { if (!vma_is_anonymous(vma)) return false; if (page_count(pte_page(pte)) != 1) return false; } It's just that we avoid checking "vma->vm_flags & VM_WRITE" when MM_CP_DIRTY_ACCT set. Again, I think in all cases some more comment should be good indeed.. > > Agreed even if it's checked it's not straightforward. Maybe it'll be a bonus > to have a comment above may_avoid_write_fault() about it in a follow up. > > > > > And even if I _am_ missing something, I really would like a very > > obvious and direct test for "this vma is writable", ie maybe a > > > > if (!(vma->vm_flags & VM_WRITE)) > > return false; > > > > at the very top of the function. > > Yes looks okay too; I think using MM_CP_DIRTY_ACCT flag has a slight advantage > in that it checks VM_WRITE only once before calling change_protection(), rather > than doing the check for every pte even if we know it'll have the same result. > However it indeed hides the facts deeper.. > > > > > And no, "pte_dirty()" is not a reason to make something writable, it > > might have started out as a writable mapping, and we dirtied the page, > > and we made it read-only. The page stays dirty, but it shouldn't > > become writable just because of that. > > I think the dirty bit checks are only to make sure we don't need those extra > write faults. It should definitely be based on the fact that VM_WRITE being > set already. > > Thanks, > > -- > Peter Xu
On Tue, Jun 29, 2021 at 6:39 PM Peter Xu <peterx@redhat.com> wrote: > > And since MM_CP_DIRTY_ACCT implies "VM_WRITE|VM_SHARED" all set, above should > be a slightly faster version of below: That's way too subtle, particularly since the MM_CP_DIRTY_ACCT logic comes from another file entirely. I don't think it's even faster, considering that presumably the anonymous mapping case is the common one, and that's the one that needs all the extra tests, it's likely better to _not_ test that very subtle flag at all, and just doing the straightforward and obvious tests that are understandable _locally_. So I claim that it's (a) not an optimization at all (b) completely locally unintuitive and unreadable > Again, I think in all cases some more comment should be good indeed.. I really want more than a comment. I want that MM_CP_DIRTY_ACCT bit testing gone. The only point where it makes sense to check MM_CP_DIRTY_ACCT is within the context of "is the page already dirty". So I think the logic should be something along the lines of - first: if (!(vma->vm_flags & VM_WRITE)) return false; because that logic is set in stone, and true regardless of anything else. If the vma isn't writable, we're not going to set the write bit. End of story. - then, check the vma_is_anonumous() case: if (vma_is_anonymous(vma)) return page_count(pte_page(pte)) == 1; because if it's a writable mapping, and anonymous, then we can mark it writable if we're the exclusive owners of that page. - and THEN we can handle the "ok, shared mapping, now let's start thinking about dirty accounting" cases. Make it obvious and correct. This is not a sequence where you should try to (incorrectly) optimize away individual instructions. Linus
On Tue, Jun 29, 2021 at 07:25:42PM -0700, Linus Torvalds wrote: > On Tue, Jun 29, 2021 at 6:39 PM Peter Xu <peterx@redhat.com> wrote: > > > > And since MM_CP_DIRTY_ACCT implies "VM_WRITE|VM_SHARED" all set, above should > > be a slightly faster version of below: > > That's way too subtle, particularly since the MM_CP_DIRTY_ACCT logic > comes from another file entirely. > > I don't think it's even faster, considering that presumably the > anonymous mapping case is the common one, and that's the one that > needs all the extra tests, it's likely better to _not_ test that very > subtle flag at all, and just doing the straightforward and obvious > tests that are understandable _locally_. > > So I claim that it's > > (a) not an optimization at all > > (b) completely locally unintuitive and unreadable > > > Again, I think in all cases some more comment should be good indeed.. > > I really want more than a comment. I want that MM_CP_DIRTY_ACCT bit > testing gone. My understanding is that MM_CP_DIRTY_ACCT contains all check results from vma_wants_writenotify(), so if we drop it we'd need to have something like that to be checked within change_pte_range(), which is again slower (I have totally no idea how slow to check vma->vm_flags & VM_WRITE, but moving the whole vma_wants_writenotify here is definitely even slower). > > The only point where it makes sense to check MM_CP_DIRTY_ACCT is > within the context of "is the page already dirty". > > So I think the logic should be something along the lines of > > - first: > > if (!(vma->vm_flags & VM_WRITE)) > return false; > > because that logic is set in stone, and true regardless of anything > else. If the vma isn't writable, we're not going to set the write bit. > End of story. > > - then, check the vma_is_anonumous() case: > > if (vma_is_anonymous(vma)) > return page_count(pte_page(pte)) == 1; > > because if it's a writable mapping, and anonymous, then we can > mark it writable if we're the exclusive owners of that page. Shouldn't we still at least checks [soft-]dirty bits and uffd-wp bits to make sure it's either not dirty tracked or uffd wr-protected? Say, IMHO it's possible that soft-dirty tracking enabled on this anonymous vma range, then we still depend on the write bit removed to set the soft-dirty later in the fault handler. > > - and THEN we can handle the "ok, shared mapping, now let's start > thinking about dirty accounting" cases. > > Make it obvious and correct. This is not a sequence where you should > try to (incorrectly) optimize away individual instructions. Yes I still fully agree it's very un-obvious. So far the best thing I can come up with is something like below (patch attached too but not yet tested). I moved VM_WRITE out so hopefully it'll be very clear; then I also rearranged the checks so the final outcome looks like below: static bool may_avoid_write_fault(pte_t pte, struct vm_area_struct *vma, unsigned long cp_flags) { /* * It is unclear whether this optimization can be done safely for NUMA * pages. */ if (cp_flags & MM_CP_PROT_NUMA) return false; /* * Never apply write bit if VM_WRITE not set. Note that this is * actually checked for VM_SHARED when MM_CP_DIRTY_ACCT is set, so * logically we only need to check it for !MM_CP_DIRTY_ACCT, but just * make it even more obvious. */ if (!(vma->vm_flags & VM_WRITE)) return false; /* * Don't do this optimization for clean pages as we need to be notified * of the transition from clean to dirty. */ if (!pte_dirty(pte)) return false; /* Same for softdirty. */ if (!pte_soft_dirty(pte) && (vma->vm_flags & VM_SOFTDIRTY)) return false; /* * For userfaultfd the user program needs to monitor write faults so we * can't do this optimization. */ if (pte_uffd_wp(pte)) return false; /* * MM_CP_DIRTY_ACCT indicates that we can always make the page writable * regardless of the number of references. Time to set the write bit. */ if (cp_flags & MM_CP_DIRTY_ACCT) return true; /* * Othewise it means !MM_CP_DIRTY_ACCT. We can only apply write bit * early if it's anonymous page and we exclusively own it. */ if (vma_is_anonymous(vma) && (page_count(pte_page(pte)) == 1)) return true; /* Don't play any trick */ return false; } The logic should be the same as before, it's just that we'll do an extra check on VM_WRITE for MM_CP_DIRTY_ACCT but assuming it's ok. Another side note is that I still think the VM_SOFTDIRTY check is wrong in may_avoid_write_fault() and even in the old code (I mentioned it previously when reviewing the patch), as !VM_SOFTDIRTY should mean soft dirty tracking enabled while VM_SOFTDIRTY means disabled. So I wonder whether it should be: - if (!pte_soft_dirty(pte) && (vma->vm_flags & VM_SOFTDIRTY)) + if (!pte_soft_dirty(pte) && !(vma->vm_flags & VM_SOFTDIRTY)) However I didn't touch it up there as it may need more justifications (I feel it's okay in the old code, as vma_wants_writenotify actually checks it too and in the right way; however after the anonymous fast path it seems to prone to error if it's anonymous; I'll check later). Thanks,
On Wed, Jun 30, 2021 at 9:42 AM Peter Xu <peterx@redhat.com> wrote: > > Yes I still fully agree it's very un-obvious. So far the best thing I can come > up with is something like below (patch attached too but not yet tested). I > moved VM_WRITE out so hopefully it'll be very clear; then I also rearranged the > checks so the final outcome looks like below: > > static bool may_avoid_write_fault(pte_t pte, struct vm_area_struct *vma, > unsigned long cp_flags) > { > /* > * It is unclear whether this optimization can be done safely for NUMA > * pages. > */ > if (cp_flags & MM_CP_PROT_NUMA) > return false; Please just put that VM_WRITE test first. It's the one that really *matters*. There's no "it's unclear if" about that part. Just handle the obvious and important check first. Yeah, yeah, they both return false, so order doesn't matter from a semantic standpoint, but from a clarity standpoint just do the clear and unambiguous and security-relevant test first. The rest of the tests are implementation details, the VM_WRITE test is fundamental behavior. It's the one that made me worry about this patch in the first place. > /* > * Don't do this optimization for clean pages as we need to be notified > * of the transition from clean to dirty. > */ > if (!pte_dirty(pte)) > return false; > > /* Same for softdirty. */ > if (!pte_soft_dirty(pte) && (vma->vm_flags & VM_SOFTDIRTY)) > return false; > > /* > * For userfaultfd the user program needs to monitor write faults so we > * can't do this optimization. > */ > if (pte_uffd_wp(pte)) > return false; So all of these are a bit special. Why? Because if I look at the actual page fault path, these are not the tests there. I'd really like to have some obvious situation where we keep this "make it writable" in sync with what would actually happen on a write fault when it's not writable. And it's not at all obvious to me for these cases. The do_wp_page() code doesn't even use pte_uffd_wp(). It uses userfaultfd_pte_wp(vma, pte), and I don't even know why. Yes, I can see the code (it additionally tests the VM_UFFD_WP flag in the vma), but a number of other paths then only do that pte_uffd_wp() test. I get the feeling that we really should try to match what the do_wp_page() path does, though. Which brings up another issue: the do_wp_page() path treats PageKsm() pages differently. And it locks the page before double-checking the page count. Why does mprotect() not need to do the same thing? I think this has come up before, and "change_protection()" can get called with the mmap_sem held just for reading - see userfaultfd - so it has all the same issues as a page fault does, afaik. > /* > * MM_CP_DIRTY_ACCT indicates that we can always make the page writable > * regardless of the number of references. Time to set the write bit. > */ > if (cp_flags & MM_CP_DIRTY_ACCT) > return true; > > /* > * Othewise it means !MM_CP_DIRTY_ACCT. We can only apply write bit > * early if it's anonymous page and we exclusively own it. > */ > if (vma_is_anonymous(vma) && (page_count(pte_page(pte)) == 1)) > return true; > > /* Don't play any trick */ > return false; > } > > The logic should be the same as before, it's just that we'll do an extra check > on VM_WRITE for MM_CP_DIRTY_ACCT but assuming it's ok. See above. I don't think the logic before was all that clear either. The one case that is clear is that if it's a shared mapping, and MM_CP_DIRTY_ACCT is set, and it was already dirty (and softdirty), then it's ok., That's the old code. I don't like how the old code was written (because I think that MM_CP_DIRTY_ACCT bit wasx too subtle), but I think the old code was at least correct. The new code, it just worries me. It adds all those new cases for when we can make the page writable early - that's the whole point of the patch, after all - but my point here is that it's not at all obvious that those new cases are actually correct. MAYBE it's all correct. I'm not saying it's wrong. I'm just saying it's not _obvious_ that it's correct. What about that page_count() test, for example: it has a comment, it looks obvious, but it's very different from what do_wp_page() does. So what happens if we have a page-out at the same time that turns that page into a swap cache page, and increments the page count? What about that race? Do we end up with a writable page that is shared with a swap cache entry? Is that ok? Why isn't it ok in the page fault case? See why this patch worries me so much? Linus
On Wed, Jun 30, 2021 at 11:03:25AM -0700, Linus Torvalds wrote: > On Wed, Jun 30, 2021 at 9:42 AM Peter Xu <peterx@redhat.com> wrote: > > > > Yes I still fully agree it's very un-obvious. So far the best thing I can come > > up with is something like below (patch attached too but not yet tested). I > > moved VM_WRITE out so hopefully it'll be very clear; then I also rearranged the > > checks so the final outcome looks like below: > > > > static bool may_avoid_write_fault(pte_t pte, struct vm_area_struct *vma, > > unsigned long cp_flags) > > { > > /* > > * It is unclear whether this optimization can be done safely for NUMA > > * pages. > > */ > > if (cp_flags & MM_CP_PROT_NUMA) > > return false; > > Please just put that VM_WRITE test first. It's the one that really > *matters*. There's no "it's unclear if" about that part. Just handle > the obvious and important check first. > > Yeah, yeah, they both return false, so order doesn't matter from a > semantic standpoint, but from a clarity standpoint just do the clear > and unambiguous and security-relevant test first. > > The rest of the tests are implementation details, the VM_WRITE test is > fundamental behavior. It's the one that made me worry about this patch > in the first place. Sure. > > > /* > > * Don't do this optimization for clean pages as we need to be notified > > * of the transition from clean to dirty. > > */ > > if (!pte_dirty(pte)) > > return false; > > > > /* Same for softdirty. */ > > if (!pte_soft_dirty(pte) && (vma->vm_flags & VM_SOFTDIRTY)) > > return false; > > > > /* > > * For userfaultfd the user program needs to monitor write faults so we > > * can't do this optimization. > > */ > > if (pte_uffd_wp(pte)) > > return false; > > So all of these are a bit special. > > Why? Because if I look at the actual page fault path, these are not > the tests there. > > I'd really like to have some obvious situation where we keep this > "make it writable" in sync with what would actually happen on a write > fault when it's not writable. > > And it's not at all obvious to me for these cases. > > The do_wp_page() code doesn't even use pte_uffd_wp(). It uses > userfaultfd_pte_wp(vma, pte), and I don't even know why. Yes, I can > see the code (it additionally tests the VM_UFFD_WP flag in the vma), > but a number of other paths then only do that pte_uffd_wp() test. The vma check is a safety net to make sure it's not the case e.g. when the vma has already unregistered from uffd-wp while there's uffd-wp bit left over. E.g., currently UFFDIO_UNREGISTER is lazy on removing uffd-wp bits that applied to ptes, so even vma is unregistered there could still have pte_uffd_wp() being true for some ptes. That vma check makes sure when it happens the uffd-wp bit will be auto-removed. > > I get the feeling that we really should try to match what the > do_wp_page() path does, though. Makes sense. > > Which brings up another issue: the do_wp_page() path treats PageKsm() > pages differently. And it locks the page before double-checking the > page count. > > Why does mprotect() not need to do the same thing? I think this has > come up before, and "change_protection()" can get called with the > mmap_sem held just for reading - see userfaultfd - so it has all the > same issues as a page fault does, afaik. Good point.. I overlooked ksm when reviewing the patch, while I should really have looked at do_wp_page() as you suggested (hmm.. the truth is I wasn't even aware of this patch and never planned to try to review it, until it breaks the uffd-wp anonymous tests in its initial versions when I was testing mmots...). Maybe something like this (to be squashed into the previously attached patch): ---8<--- diff --git a/mm/mprotect.c b/mm/mprotect.c index 3977bfd55f62..7aab30ac9c9f 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -39,12 +39,8 @@ static bool may_avoid_write_fault(pte_t pte, struct vm_area_struct *vma, unsigned long cp_flags) { - /* - * It is unclear whether this optimization can be done safely for NUMA - * pages. - */ - if (cp_flags & MM_CP_PROT_NUMA) - return false; + struct page *page; + bool ret = false; /* * Never apply write bit if VM_WRITE not set. Note that this is @@ -55,6 +51,13 @@ static bool may_avoid_write_fault(pte_t pte, struct vm_area_struct *vma, if (!(vma->vm_flags & VM_WRITE)) return false; + /* + * It is unclear whether this optimization can be done safely for NUMA + * pages. + */ + if (cp_flags & MM_CP_PROT_NUMA) + return false; + /* * Don't do this optimization for clean pages as we need to be notified * of the transition from clean to dirty. @@ -80,15 +83,22 @@ static bool may_avoid_write_fault(pte_t pte, struct vm_area_struct *vma, if (cp_flags & MM_CP_DIRTY_ACCT) return true; + page = pte_page(pte); + /* Best effort to take page lock, don't play trick if failed */ + if (!trylock_page(page)) + return false; + /* KSM pages needs COW; leave them be */ + if (PageKsm(page)) + goto unlock_fail; /* - * Othewise it means !MM_CP_DIRTY_ACCT. We can only apply write bit - * early if it's anonymous page and we exclusively own it. + * Othewise it means !MM_CP_DIRTY_ACCT and !KSM. We can only apply + * write bit early if it's anonymous page and we exclusively own it. */ - if (vma_is_anonymous(vma) && (page_count(pte_page(pte)) == 1)) - return true; - - /* Don't play any trick */ - return false; + if (vma_is_anonymous(vma) && (page_count(page) == 1)) + ret = true; +unlock_fail: + unlock_page(page); + return ret; } static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, ---8<--- I hope I didn't overlook something else.. Today when I was looking at ksm code, I found that I got lost on why we say "PageKsm() doesn't necessarily raise the page refcount", as in do_wp_page(). I was looking at replace_page() where, afaict, we still do proper refcounting for the stable nodes with "get_page(kpage)". I know I must missed something but I can't quickly tell. In all cases with above PageKsm check I think it'll be safe, and it's definitely clear that page lock will stablize PageKsm()'s return value, like do_wp_page(). > > > /* > > * MM_CP_DIRTY_ACCT indicates that we can always make the page writable > > * regardless of the number of references. Time to set the write bit. > > */ > > if (cp_flags & MM_CP_DIRTY_ACCT) > > return true; > > > > /* > > * Othewise it means !MM_CP_DIRTY_ACCT. We can only apply write bit > > * early if it's anonymous page and we exclusively own it. > > */ > > if (vma_is_anonymous(vma) && (page_count(pte_page(pte)) == 1)) > > return true; > > > > /* Don't play any trick */ > > return false; > > } > > > > The logic should be the same as before, it's just that we'll do an extra check > > on VM_WRITE for MM_CP_DIRTY_ACCT but assuming it's ok. > > See above. I don't think the logic before was all that clear either. > > The one case that is clear is that if it's a shared mapping, and > MM_CP_DIRTY_ACCT is set, and it was already dirty (and softdirty), > then it's ok., > > That's the old code. I don't like how the old code was written > (because I think that MM_CP_DIRTY_ACCT bit wasx too subtle), but I > think the old code was at least correct. > > The new code, it just worries me. It adds all those new cases for when > we can make the page writable early - that's the whole point of the > patch, after all - but my point here is that it's not at all obvious > that those new cases are actually correct. Yes agreed the MM_CP_DIRTY_ACCT bit is very subtle and not easy to get. It's just that I don't have a good idea to make it better, yet.. > > MAYBE it's all correct. I'm not saying it's wrong. I'm just saying > it's not _obvious_ that it's correct. > > What about that page_count() test, for example: it has a comment, it > looks obvious, but it's very different from what do_wp_page() does. So > what happens if we have a page-out at the same time that turns that > page into a swap cache page, and increments the page count? What about > that race? Do we end up with a writable page that is shared with a > swap cache entry? Is that ok? Why isn't it ok in the page fault case? That looks fine to me: when the race happens we must have checked page_count==1 first and granted the write bit, then add_to_swap_cache() happens after the page_count==1 check (as it takes another refcount, so >2 otherwise). Then it also means unmap mappings should happen even after that point. If my above understanding is correct, our newly installed pte will be zapped safely soon, but of course after we release the pgtable lock in change_pte_range(). Thanks,
On Wed, Jun 30, 2021 at 6:27 PM Peter Xu <peterx@redhat.com> wrote: > > On Wed, Jun 30, 2021 at 11:03:25AM -0700, Linus Torvalds wrote: > > > > What about that page_count() test, for example: it has a comment, it > > looks obvious, but it's very different from what do_wp_page() does. So > > what happens if we have a page-out at the same time that turns that > > page into a swap cache page, and increments the page count? What about > > that race? Do we end up with a writable page that is shared with a > > swap cache entry? Is that ok? Why isn't it ok in the page fault case? > > That looks fine to me: when the race happens we must have checked page_count==1 > first and granted the write bit, then add_to_swap_cache() happens after the > page_count==1 check (as it takes another refcount, so >2 otherwise). Then it > also means unmap mappings should happen even after that point. If my above > understanding is correct, our newly installed pte will be zapped safely soon, > but of course after we release the pgtable lock in change_pte_range(). So if this is fine, then maybe we should just remove the page lock in the do_wp_page() path (and remove the PageKSM check while at it)? If it's not required by mprotect() to say "I can make the page writable directly", then it really shouldn't be required by the page fault path either. Which I'd love to do, and was really itching to do (it's a nasty lock), but I worried about it.. I'd hate to have mprotect do one thing, and page faulting do another thing, and not have some logic to why they have to be different. Linus
(sorry for a very late reply) On Thu, Jul 01, 2021 at 11:29:50AM -0700, Linus Torvalds wrote: > On Wed, Jun 30, 2021 at 6:27 PM Peter Xu <peterx@redhat.com> wrote: > > > > On Wed, Jun 30, 2021 at 11:03:25AM -0700, Linus Torvalds wrote: > > > > > > What about that page_count() test, for example: it has a comment, it > > > looks obvious, but it's very different from what do_wp_page() does. So > > > what happens if we have a page-out at the same time that turns that > > > page into a swap cache page, and increments the page count? What about > > > that race? Do we end up with a writable page that is shared with a > > > swap cache entry? Is that ok? Why isn't it ok in the page fault case? > > > > That looks fine to me: when the race happens we must have checked page_count==1 > > first and granted the write bit, then add_to_swap_cache() happens after the > > page_count==1 check (as it takes another refcount, so >2 otherwise). Then it > > also means unmap mappings should happen even after that point. If my above > > understanding is correct, our newly installed pte will be zapped safely soon, > > but of course after we release the pgtable lock in change_pte_range(). > > So if this is fine, then maybe we should just remove the page lock in > the do_wp_page() path (and remove the PageKSM check while at it)? I could be wrong, but I thought the page lock in do_wp_page() is more for the PageKsm() race (e.g., to make sure we don't grant write to a page that is becoming a ksm page in parallel). > > If it's not required by mprotect() to say "I can make the page > writable directly", then it really shouldn't be required by the page > fault path either. > > Which I'd love to do, and was really itching to do (it's a nasty > lock), but I worried about it.. > > I'd hate to have mprotect do one thing, and page faulting do another > thing, and not have some logic to why they have to be different. Agreed; perhaps no need to be identical - I think the mprotect path can be even stricter than the fault page, as it's a fast-path only. It should never apply the write bit when the page fault path won't. So I think the original patch does need a justification on why it didn't handle ksm page while do_wp_page handled it. Thanks,
--- a/mm/mprotect.c~mm-improve-mprotectrw-efficiency-on-pages-referenced-once +++ a/mm/mprotect.c @@ -35,6 +35,51 @@ #include "internal.h" +/* Determine whether we can avoid taking write faults for known dirty pages. */ +static bool may_avoid_write_fault(pte_t pte, struct vm_area_struct *vma, + unsigned long cp_flags) +{ + /* + * The dirty accountable bit indicates that we can always make the page + * writable regardless of the number of references. + */ + if (!(cp_flags & MM_CP_DIRTY_ACCT)) { + /* Otherwise, we must have exclusive access to the page. */ + if (!(vma_is_anonymous(vma) && (vma->vm_flags & VM_WRITE))) + return false; + + if (page_count(pte_page(pte)) != 1) + return false; + } + + /* + * Don't do this optimization for clean pages as we need to be notified + * of the transition from clean to dirty. + */ + if (!pte_dirty(pte)) + return false; + + /* Same for softdirty. */ + if (!pte_soft_dirty(pte) && (vma->vm_flags & VM_SOFTDIRTY)) + return false; + + /* + * For userfaultfd the user program needs to monitor write faults so we + * can't do this optimization. + */ + if (pte_uffd_wp(pte)) + return false; + + /* + * It is unclear whether this optimization can be done safely for NUMA + * pages. + */ + if (cp_flags & MM_CP_PROT_NUMA) + return false; + + return true; +} + static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr, unsigned long end, pgprot_t newprot, unsigned long cp_flags) @@ -43,7 +88,6 @@ static unsigned long change_pte_range(st spinlock_t *ptl; unsigned long pages = 0; int target_node = NUMA_NO_NODE; - bool dirty_accountable = cp_flags & MM_CP_DIRTY_ACCT; bool prot_numa = cp_flags & MM_CP_PROT_NUMA; bool uffd_wp = cp_flags & MM_CP_UFFD_WP; bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE; @@ -131,12 +175,8 @@ static unsigned long change_pte_range(st ptent = pte_clear_uffd_wp(ptent); } - /* Avoid taking write faults for known dirty pages */ - if (dirty_accountable && pte_dirty(ptent) && - (pte_soft_dirty(ptent) || - !(vma->vm_flags & VM_SOFTDIRTY))) { + if (may_avoid_write_fault(ptent, vma, cp_flags)) ptent = pte_mkwrite(ptent); - } ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent); pages++; } else if (is_swap_pte(oldpte)) {