Message ID | 20210915181456.10739-2-peterx@redhat.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm: A few cleanup patches around zap, shmem and uffd | expand |
On Wed, 15 Sep 2021, Peter Xu wrote: > It was conditionally done previously, as there's one shmem special case that we > use SetPageDirty() instead. However that's not necessary and it should be > easier and cleaner to do it unconditionally in mfill_atomic_install_pte(). > > The most recent discussion about this is here, where Hugh explained the history > of SetPageDirty() and why it's possible that it's not required at all: > > https://lore.kernel.org/lkml/alpine.LSU.2.11.2104121657050.1097@eggly.anvils/ > > Currently mfill_atomic_install_pte() has three callers: > > 1. shmem_mfill_atomic_pte > 2. mcopy_atomic_pte > 3. mcontinue_atomic_pte > > After the change: case (1) should have its SetPageDirty replaced by the dirty > bit on pte (so we unify them together, finally), case (2) should have no > functional change at all as it has page_in_cache==false, case (3) may add a > dirty bit to the pte. However since case (3) is UFFDIO_CONTINUE for shmem, > it's merely 100% sure the page is dirty after all because UFFDIO_CONTINUE > normally requires another process to modify the page cache and kick the faulted > thread, so should not make a real difference either. > > This should make it much easier to follow on which case will set dirty for > uffd, as we'll simply set it all now for all uffd related ioctls. Meanwhile, > no special handling of SetPageDirty() if there's no need. > > Cc: Hugh Dickins <hughd@google.com> > Cc: Axel Rasmussen <axelrasmussen@google.com> > Cc: Andrea Arcangeli <aarcange@redhat.com> > Reviewed-by: Axel Rasmussen <axelrasmussen@google.com> > Signed-off-by: Peter Xu <peterx@redhat.com> I'm not going to NAK this, but you and I have different ideas of "very nice cleanups". Generally, you appear (understandably) to be trying to offload pieces of work from your larger series, but often I don't see the sense of them, here in isolation anyway. Is this a safe transformation of the existing code? Yes, I believe so (at least until someone adds some PTESAN checker which looks to see if any ptes are dirty in vmas to which user never had write access). But it took quite a lot of lawyering to arrive at that conclusion. Is this a cleanup? No, it's a dirtyup. shmem_mfill_atomic_pte() does SetPageDirty (before unlocking page) because that's where the page contents are made dirty. You could criticise it for doing SetPageDirty even in the zeropage case: yes, we've been lazy there; but that's a different argument. If someone is faulting this page into a read-only vma, it's surprising to make the pte dirty there. What would be most correct would be to keep the SetPageDirty in shmem_mfill_atomic_pte() (with or without zeropage optimization), and probably SetPageDirty in some other places in mm/userfaultfd.c (I didn't look where) when the page is filled with supplied data, and mfill_atomic_install_pte() only do that pte_mkdirty() when it's serving a FAULT_FLAG_WRITE. I haven't looked again (I have a pile of mails to respond to!), but when I looked before I think I found that the vmf flags are not available to the userfaultfd ioctler. If so, then it would be more appropriate to just leave the mkdirty to the hardware on return from fault (except - and again I cannot spend time researching this - perhaps I'm too x86-centric, and there are other architectures on which the software *must* do the mkdirty fixup to avoid refaulting forever - though probably userfaultfd state would itself prevent that). But you seem to think that doing the dirtying in an unnatural place helps somehow; and for all I know, that may be so in your larger series, though this change certainly raises suspicions of that. I'm sorry to be so discouraging, but you have asked for my opinion, and here at last you have it. Not a NAK, but no enthusiasm at all. Hugh > --- > mm/shmem.c | 1 - > mm/userfaultfd.c | 3 +-- > 2 files changed, 1 insertion(+), 3 deletions(-) > > diff --git a/mm/shmem.c b/mm/shmem.c > index 88742953532c..96ccf6e941aa 100644 > --- a/mm/shmem.c > +++ b/mm/shmem.c > @@ -2424,7 +2424,6 @@ int shmem_mfill_atomic_pte(struct mm_struct *dst_mm, > shmem_recalc_inode(inode); > spin_unlock_irq(&info->lock); > > - SetPageDirty(page); > unlock_page(page); > return 0; > out_delete_from_cache: > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c > index 7a9008415534..caf6dfff2a60 100644 > --- a/mm/userfaultfd.c > +++ b/mm/userfaultfd.c > @@ -69,10 +69,9 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd, > pgoff_t offset, max_off; > > _dst_pte = mk_pte(page, dst_vma->vm_page_prot); > + _dst_pte = pte_mkdirty(_dst_pte); > if (page_in_cache && !vm_shared) > writable = false; > - if (writable || !page_in_cache) > - _dst_pte = pte_mkdirty(_dst_pte); > if (writable) { > if (wp_copy) > _dst_pte = pte_mkuffd_wp(_dst_pte); > -- > 2.31.1
Hi, Hugh, On Thu, Sep 23, 2021 at 08:56:33PM -0700, Hugh Dickins wrote: > I'm not going to NAK this, but you and I have different ideas of > "very nice cleanups". Generally, you appear (understandably) to be > trying to offload pieces of work from your larger series, but often > I don't see the sense of them, here in isolation anyway. > > Is this a safe transformation of the existing code? Yes, I believe so > (at least until someone adds some PTESAN checker which looks to see > if any ptes are dirty in vmas to which user never had write access). > But it took quite a lot of lawyering to arrive at that conclusion. I can get your point there, but I keep a skeptical view if there'll be a tool called PTESAN that asserts VM_WRITE for pte_dirty. After we've noticed the arm64 implementation of pte_mkdirty() last time, I've already started to not bind the ideas on VM_WRITE or pte_write() for pte dirty. As I said before, that's quite natural when I think "the uffd-way", because uffd can easily arm a page with read-only but while the page is dirty. I think you'll answer that with "we should mark the page dirty instead" in this case, as you stated below. I also agree. However if we see pte_dirty a major way to track data dirty information, and at last when it'll be converged into the PageDirty, I think it doesn't really matter a huge lot to us if we set pte or page dirty, or is it? > > Is this a cleanup? No, it's a dirtyup. > > shmem_mfill_atomic_pte() does SetPageDirty (before unlocking page) > because that's where the page contents are made dirty. You could > criticise it for doing SetPageDirty even in the zeropage case: > yes, we've been lazy there; but that's a different argument. > > If someone is faulting this page into a read-only vma, it's > surprising to make the pte dirty there. What would be most correct > would be to keep the SetPageDirty in shmem_mfill_atomic_pte() > (with or without zeropage optimization), and probably SetPageDirty > in some other places in mm/userfaultfd.c (I didn't look where) when > the page is filled with supplied data, and mfill_atomic_install_pte() > only do that pte_mkdirty() when it's serving a FAULT_FLAG_WRITE. That's a good point, and yeah if we can unconditionally mark PageDirty it'll be great too; I think what bothered me most in the past was that the condition to check dirty is too complicated, for which myself has been debugging for two cases where we should apply the dirty bit but we forgot; each of the debugging process took me a few days or more to figure out, thanks to my awkward debugging skills. Then I noticed, why not we do the way around if for 99% of the cases they're dirty in real systems? Say, let's set dirty unconditionally and see when there (could have, which I still doubt) is a negative effect on having some page dirty, we track that from a "degraded" performance results. Then we convert some hard-to-debug data corrupt issues into "oh previously this programs runs at speed 100x, now it runs 99x, why I got 1% performance lost?" I even highly doubt whether it'll come true: for the uffd case (which is the only case I modified in this patch), I can hardly tell how many people would like to use the mappings read-only, and how much they'll suffer from that extra dirty bit or PageDirty. That's why I really like this patch to happen, I want to save time for myself, and for anyone who will be fighting for another dirty lost issues. > > I haven't looked again (I have a pile of mails to respond to!), > but when I looked before I think I found that the vmf flags are > not available to the userfaultfd ioctler. If so, then it would > be more appropriate to just leave the mkdirty to the hardware on > return from fault (except - and again I cannot spend time researching > this - perhaps I'm too x86-centric, and there are other architectures > on which the software *must* do the mkdirty fixup to avoid refaulting > forever - though probably userfaultfd state would itself prevent that). If it's based on the fact that we'll set PageDirty for file-backed, then it looks okay, but not usre. One thing to mention is pte_mkdirty() also counts in soft dirty by nature. I'm imagining a program that was soft-dirty tracked and somehow using UFFDIO_COPY as the major data filler (so the task itself may not write to the page directly hence HW won't set dirty bit there). If with pte_mkdirty the other userspace tracker with soft-dirty can still detect this, while with PageDirty I believe it can't. From that POV I'm not sure whether I can say that as proactively doing pte_mkdirty is a safer approach just in case such an use case exist, as myself can't say they're illegal, so pte_dirty is a superset of PageDirty not vice versa. > > But you seem to think that doing the dirtying in an unnatural place > helps somehow; and for all I know, that may be so in your larger > series, though this change certainly raises suspicions of that. > > I'm sorry to be so discouraging, but you have asked for my opinion, > and here at last you have it. Not a NAK, but no enthusiasm at all. Thanks a lot for still looking at these patches; even if most of them are negative and they come a bit late for sure.. I still appreciate your time. As you mentioned you're busy with all the things and I'm aware of it. And that's really what this patch wants to achieve too - to save time for all, where my point stands at "maintaining 100% accurate dirty bit does not worth it here". Thanks,
On Mon, 27 Sep 2021, Peter Xu wrote: > On Thu, Sep 23, 2021 at 08:56:33PM -0700, Hugh Dickins wrote: > > I'm not going to NAK this, but you and I have different ideas of > > "very nice cleanups". Generally, you appear (understandably) to be > > trying to offload pieces of work from your larger series, but often > > I don't see the sense of them, here in isolation anyway. > > > > Is this a safe transformation of the existing code? Yes, I believe so > > (at least until someone adds some PTESAN checker which looks to see > > if any ptes are dirty in vmas to which user never had write access). > > But it took quite a lot of lawyering to arrive at that conclusion. > > I can get your point there, but I keep a skeptical view if there'll be a tool > called PTESAN that asserts VM_WRITE for pte_dirty. > > After we've noticed the arm64 implementation of pte_mkdirty() last time, I've > already started to not bind the ideas on VM_WRITE or pte_write() for pte dirty. Yes, I know that there are good cases of pte_dirty() without pte_write(): that's why I said "never had write access" - never. > > As I said before, that's quite natural when I think "the uffd-way", because > uffd can easily arm a page with read-only but while the page is dirty. I think > you'll answer that with "we should mark the page dirty instead" in this case, > as you stated below. I also agree. However if we see pte_dirty a major way to > track data dirty information, and at last when it'll be converged into the > PageDirty, I think it doesn't really matter a huge lot to us if we set pte or > page dirty, or is it? Please, imagine the faulting process has a PROT_READ,MAP_SHARED mmap of /etc/passwd, or any of a million files you would not want it to write to. The process serving that fault by doing the ioctl has (in my perhaps mistaken mental model of userfaultfd) greater privilege, and is able to fill in the contents of what /etc/passwd should contain: it fills the right data into the page, which is preserved by being marked PageDirty. The faulting process can never write to that page, and that pte ought never to be marked dirty. Marking the pte dirty is not a security problem: doing so does not grant write access (though it's easy to imagine incorrect code elsewhere that "deduces" pte_write() from pte_dirty() for some reason - I'm pretty sure we have had such instances in the past, if not now). But it is an anomaly that would be better avoided. Which is precisely why there is the "writable" check before doing it at present ("writable" being a stand-in for the FAULT_FLAG_WRITE not visible at this end). (There is also "|| !page_in_cache" - that's to allow for a read-only mmap of a page supplied by mcopy_atomic_pte(): which I argue would be better with a SetPageDirty() in the caller than a pte_mkdirty() here; but anonymous pages are less of a worry than shared file pages.) So, as I said before, I believe what you're doing in this patch happens to be a safe transformation of existing code; but not a nice cleanup. > > > > > Is this a cleanup? No, it's a dirtyup. > > > > shmem_mfill_atomic_pte() does SetPageDirty (before unlocking page) > > because that's where the page contents are made dirty. You could > > criticise it for doing SetPageDirty even in the zeropage case: > > yes, we've been lazy there; but that's a different argument. > > > > If someone is faulting this page into a read-only vma, it's > > surprising to make the pte dirty there. What would be most correct > > would be to keep the SetPageDirty in shmem_mfill_atomic_pte() > > (with or without zeropage optimization), and probably SetPageDirty > > in some other places in mm/userfaultfd.c (I didn't look where) when > > the page is filled with supplied data, and mfill_atomic_install_pte() > > only do that pte_mkdirty() when it's serving a FAULT_FLAG_WRITE. > > That's a good point, and yeah if we can unconditionally mark PageDirty it'll be > great too; I think what bothered me most in the past was that the condition to > check dirty is too complicated, for which myself has been debugging for two > cases where we should apply the dirty bit but we forgot; each of the debugging > process took me a few days or more to figure out, thanks to my awkward > debugging skills. > > Then I noticed, why not we do the way around if for 99% of the cases they're > dirty in real systems? Say, let's set dirty unconditionally and see when there > (could have, which I still doubt) is a negative effect on having some page > dirty, we track that from a "degraded" performance results. Then we convert > some hard-to-debug data corrupt issues into "oh previously this programs runs > at speed 100x, now it runs 99x, why I got 1% performance lost?" I even highly > doubt whether it'll come true: for the uffd case (which is the only case I > modified in this patch), I can hardly tell how many people would like to use > the mappings read-only, and how much they'll suffer from that extra dirty bit > or PageDirty. > > That's why I really like this patch to happen, I want to save time for myself, > and for anyone who will be fighting for another dirty lost issues. I've lost time on missed dirties too, so I ought to be more sympathetic to your argument than I am: I'm afraid I read it as saying that you don't really understand "dirty", so want to do it more often to be "safe". Not a persuasive argument. > > > > > I haven't looked again (I have a pile of mails to respond to!), > > but when I looked before I think I found that the vmf flags are > > not available to the userfaultfd ioctler. If so, then it would > > be more appropriate to just leave the mkdirty to the hardware on > > return from fault (except - and again I cannot spend time researching > > this - perhaps I'm too x86-centric, and there are other architectures > > on which the software *must* do the mkdirty fixup to avoid refaulting > > forever - though probably userfaultfd state would itself prevent that). > > If it's based on the fact that we'll set PageDirty for file-backed, then it > looks okay, but not usre. > > One thing to mention is pte_mkdirty() also counts in soft dirty by nature. I'm > imagining a program that was soft-dirty tracked and somehow using UFFDIO_COPY > as the major data filler (so the task itself may not write to the page directly > hence HW won't set dirty bit there). If with pte_mkdirty the other userspace > tracker with soft-dirty can still detect this, while with PageDirty I believe > it can't. From that POV I'm not sure whether I can say that as proactively > doing pte_mkdirty is a safer approach just in case such an use case exist, as > myself can't say they're illegal, so pte_dirty is a superset of PageDirty not > vice versa. And this is not persuasive either: without much deeper analysis (which I'll decline to do!), it's impossible to tell whether an excess of pte_mkdirty()s is good or bad for the hypothetical uffd+softdirty tracker: you're guessing good, I'm guessing bad. How about a compromise (if you really want to continue with this patch): you leave the SetPageDirty(page) in shmem_mfill_atomic_pte(), where I feel a responsibility for it; but you do whatever works for you with pte_mkdirty() at the mm/userfaultfd.c end? (In the course of writing this, it has occurred to me that a much nicer solution might be to delete mfill_atomic_install_pte() altogether, and change the userfaultfd protocol so that handle_userfault() returns the page supplied by ioctler, for process to map into its own userspace in its usual way. But that's a big change, that neither of would be keen to make; and it would not be surprising if it turned out actually to be a very bad change - perhaps tried and abandoned before the "atomic" functions were decided on. I wouldn't even dare mention it, unless that direction might happen to fit in with something else you're plannng.) Hugh
On Tue, Sep 28, 2021 at 12:28:40PM -0700, Hugh Dickins wrote: > > That's why I really like this patch to happen, I want to save time for myself, > > and for anyone who will be fighting for another dirty lost issues. > > I've lost time on missed dirties too, so I ought to be more sympathetic > to your argument than I am: I'm afraid I read it as saying that you don't > really understand "dirty", so want to do it more often to be "safe". > Not a persuasive argument. I'd hope I didn't mis-understood dirty bit, please let me know otherwise, then I must have been making a very serious mistake, and also then I'll be more than glad to learn to make things right, as long as you could help me to achieve it. FWICT, I was trying to argue it's not worth it to worry the corner case, but obviously I failed. :) But that's fine and understandable, because it happens. > > > > > > > > > I haven't looked again (I have a pile of mails to respond to!), > > > but when I looked before I think I found that the vmf flags are > > > not available to the userfaultfd ioctler. If so, then it would > > > be more appropriate to just leave the mkdirty to the hardware on > > > return from fault (except - and again I cannot spend time researching > > > this - perhaps I'm too x86-centric, and there are other architectures > > > on which the software *must* do the mkdirty fixup to avoid refaulting > > > forever - though probably userfaultfd state would itself prevent that). > > > > If it's based on the fact that we'll set PageDirty for file-backed, then it > > looks okay, but not usre. > > > > One thing to mention is pte_mkdirty() also counts in soft dirty by nature. I'm > > imagining a program that was soft-dirty tracked and somehow using UFFDIO_COPY > > as the major data filler (so the task itself may not write to the page directly > > hence HW won't set dirty bit there). If with pte_mkdirty the other userspace > > tracker with soft-dirty can still detect this, while with PageDirty I believe > > it can't. From that POV I'm not sure whether I can say that as proactively > > doing pte_mkdirty is a safer approach just in case such an use case exist, as > > myself can't say they're illegal, so pte_dirty is a superset of PageDirty not > > vice versa. > > And this is not persuasive either: without much deeper analysis > (which I'll decline to do!), it's impossible to tell whether an excess of > pte_mkdirty()s is good or bad for the hypothetical uffd+softdirty tracker: > you're guessing good, I'm guessing bad. > > How about a compromise (if you really want to continue with this patch): > you leave the SetPageDirty(page) in shmem_mfill_atomic_pte(), where I > feel a responsibility for it; but you do whatever works for you with > pte_mkdirty() at the mm/userfaultfd.c end? Sure. Duplicating dirty bit is definitely fine to me as it achieves the same goal as I hoped - we're still 100% clear we won't free a uffd page without being noticed, then that's enough to me for the goal of this patch. I won't initiate that NACK myself since I still think duplicating is unnecessary no matter it resides in shmem or uffd code, but please go ahead doing that and I'll be fine with it, just in case Andrew didn't follow the details. Actually, I can feel how strong opinion you're at this point on keeping the old way on this patch, and yes I definitely touched the shmem code (your preference has its reasoning; I have mine too and I had a preference on the other side, but I failed to persuade you..). So also feel free to NACK the whole patch with both the SetPageDirty and condition checks on pte_mkdirty(), I'm fine with it too. I can keep the conditions in my future series too. Not like the other zap_details patch where I believe I must need at some point, this patch is something I hoped a lot to have, but it's not a must. I know this "hole" already so I won't fall into it so hard the 3rd time (actually on the 2nd time I debugged faster and quickly noticed I just fell into this same hole I've used to fall). Let's wish the same to the others. > > (In the course of writing this, it has occurred to me that a much nicer > solution might be to delete mfill_atomic_install_pte() altogether, and > change the userfaultfd protocol so that handle_userfault() returns the > page supplied by ioctler, for process to map into its own userspace > in its usual way. But that's a big change, that neither of would be > keen to make; and it would not be surprising if it turned out actually > to be a very bad change - perhaps tried and abandoned before the "atomic" > functions were decided on. I wouldn't even dare mention it, unless that > direction might happen to fit in with something else you're plannng.) Dangling pages will be hard to make right, imho. E.g., please considering one thread faulted in with uffd-missing, we'll request UFFDIO_COPY and deliver the page to the faulted thread, then a 2nd thread faulted on the same address before the 1st faulted thread resolved the page fault in the pgtables. It'll start to be challenging from that point, imho. From that POV I think current uffdio-copy is doing great on using pgtable and the pgtable lock to keep all these things very well serialized. We may have false positives as the 2nd faulted thread will generate a dup message, but that can be easily resolved by the user app. Thanks,
On Tue, 28 Sep 2021 17:37:31 -0400 Peter Xu <peterx@redhat.com> wrote: > > How about a compromise (if you really want to continue with this patch): > > you leave the SetPageDirty(page) in shmem_mfill_atomic_pte(), where I > > feel a responsibility for it; but you do whatever works for you with > > pte_mkdirty() at the mm/userfaultfd.c end? > > Sure. Duplicating dirty bit is definitely fine to me as it achieves the same > goal as I hoped - we're still 100% clear we won't free a uffd page without > being noticed, then that's enough to me for the goal of this patch. I won't > initiate that NACK myself since I still think duplicating is unnecessary no > matter it resides in shmem or uffd code, but please go ahead doing that and > I'll be fine with it, just in case Andrew didn't follow the details. I think Hugh was asking you to implement this... I guess I'll send this patch upstream. But it does sound like Hugh would prefer a followon patch for this kernel release which makes the above change, please.
On Thu, Nov 04, 2021 at 02:34:40PM -0700, Andrew Morton wrote: > On Tue, 28 Sep 2021 17:37:31 -0400 Peter Xu <peterx@redhat.com> wrote: > > > > How about a compromise (if you really want to continue with this patch): > > > you leave the SetPageDirty(page) in shmem_mfill_atomic_pte(), where I > > > feel a responsibility for it; but you do whatever works for you with > > > pte_mkdirty() at the mm/userfaultfd.c end? > > > > Sure. Duplicating dirty bit is definitely fine to me as it achieves the same > > goal as I hoped - we're still 100% clear we won't free a uffd page without > > being noticed, then that's enough to me for the goal of this patch. I won't > > initiate that NACK myself since I still think duplicating is unnecessary no > > matter it resides in shmem or uffd code, but please go ahead doing that and > > I'll be fine with it, just in case Andrew didn't follow the details. > > I think Hugh was asking you to implement this... > > I guess I'll send this patch upstream. But it does sound like Hugh > would prefer a followon patch for this kernel release which makes the > above change, please. Thanks Andrew for helping. But as I mentioned I still think that's odd to set dirty in both places. That's why I don't want to draft the patch because I am not very willing to sign-off.. If Hugh agrees, I can post the patch with Hugh's sign-off, adding the PageDirty back too. I am during a holiday so I cannot follow up the whole thing today, but if it's easier for you to drop that patch or even drop the whole series, please feel free to do. I can rework everything too, then I'll try to get Hugh's ack again on every single patch, as long as Hugh will have time to look at it in the future. Thanks,
diff --git a/mm/shmem.c b/mm/shmem.c index 88742953532c..96ccf6e941aa 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -2424,7 +2424,6 @@ int shmem_mfill_atomic_pte(struct mm_struct *dst_mm, shmem_recalc_inode(inode); spin_unlock_irq(&info->lock); - SetPageDirty(page); unlock_page(page); return 0; out_delete_from_cache: diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 7a9008415534..caf6dfff2a60 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -69,10 +69,9 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd, pgoff_t offset, max_off; _dst_pte = mk_pte(page, dst_vma->vm_page_prot); + _dst_pte = pte_mkdirty(_dst_pte); if (page_in_cache && !vm_shared) writable = false; - if (writable || !page_in_cache) - _dst_pte = pte_mkdirty(_dst_pte); if (writable) { if (wp_copy) _dst_pte = pte_mkuffd_wp(_dst_pte);