Message ID | 20211217113049.23850-7-david@redhat.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm: COW fixes part 1: fix the COW security issue for THP and hugetlb | expand |
On Fri, Dec 17, 2021 at 3:34 AM David Hildenbrand <david@redhat.com> wrote: > > + * If the child takes a read-only pin on such a page (i.e., FOLL_WRITE is not > + * set) and then unmaps the target page, we have: > + * > + * * page has mapcount == 1 and refcount > 1 All these games with mapcount makes me think this is still broken. mapcount has been a horribly broken thing in the past, and I'm not convinced it's not a broken thing now. > + vmf->page = vm_normal_page(vmf->vma, vmf->address, vmf->orig_pte); > + if (vmf->page && PageAnon(vmf->page) && !PageKsm(vmf->page) && > + page_mapcount(vmf->page) > 1) { What keeps the mapcount stable in here? And I still believe that the whole notion that "COW should use mapcount" is pure and utter garbage. If we are doing a COW, we need an *exclusive* access to the page. That is not mapcount, that is the page ref. mapcount is insane, and I think this is making this worse again. Linus
On Fri, Dec 17, 2021 at 11:04 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > If we are doing a COW, we need an *exclusive* access to the page. That > is not mapcount, that is the page ref. > > mapcount is insane, and I think this is making this worse again. Maybe I'm misreading this, but afaik - get a "readonly" copy of a local private page using FAULT_FLAG_UNSHARE. This just increments the page count, because mapcount == 1. - fork() - unmap in the original - child now has "mapcount == 1" on a page again, but refcount is elevated, and child HAS TO COW before writing. Notice? "mapcount" is complete BS. The number of times a page is mapped is irrelevant for COW. All that matters is that we get an exclusive access to the page before we can write to it. Anybody who takes mapcount into account at COW time is broken, and it worries me how this is all mixing up with the COW logic. Now, maybe this "unshare" case is sufficiently different from COW that it's ok to look at mapcount for FAULT_FLAG_UNSHARE, as long as it doesn't happen for a real COW. But honestly, for "unshare", I still don't see that the mapcount matters. What does "mapcount == 1" mean? Why is it meaningful? Because if COW does things right, and always breaks a COW based on refcount, then what's the problem with taking a read-only ref to the page whether it is mapped multiple times or mapped just once? Anybody who already had write access to the page can write to it regardless, and any new writers go through COW and get a new page. I must be missing something realyl fundamental here, but to me it really reads like "mapcount can fundamentally never be relevant for COW, and if it's not relevant for COW, how can it be relevant for a read-only copy?" Linus
On 17.12.21 20:22, Linus Torvalds wrote: > On Fri, Dec 17, 2021 at 11:04 AM Linus Torvalds > <torvalds@linux-foundation.org> wrote: >> >> If we are doing a COW, we need an *exclusive* access to the page. That >> is not mapcount, that is the page ref. >> >> mapcount is insane, and I think this is making this worse again. > > Maybe I'm misreading this, but afaik > > - get a "readonly" copy of a local private page using FAULT_FLAG_UNSHARE. > > This just increments the page count, because mapcount == 1. > > - fork() > > - unmap in the original > > - child now has "mapcount == 1" on a page again, but refcount is > elevated, and child HAS TO COW before writing. Hi Linus, This is just GUP before fork(), which is in general problematic/incompatible with sharing. What we're concerned about in the context of this series (see the security issue) is GUP after fork(). And we're not changing GUP before fork() or even the COW logic in the context of this series. I agree that GUP before fork() has to be handled differently: during fork(): don't share something that cannot possibly be shared in a safe way. Don't allow COW semantics for something that is just broken with COW. > > Notice? "mapcount" is complete BS. The number of times a page is > mapped is irrelevant for COW. All that matters is that we get an > exclusive access to the page before we can write to it. We have to be very careful about the two sides of the story: GUP before fork and GUP after fork. > > Anybody who takes mapcount into account at COW time is broken, and it > worries me how this is all mixing up with the COW logic. > > Now, maybe this "unshare" case is sufficiently different from COW that > it's ok to look at mapcount for FAULT_FLAG_UNSHARE, as long as it > doesn't happen for a real COW. > > But honestly, for "unshare", I still don't see that the mapcount > matters. What does "mapcount == 1" mean? Why is it meaningful? I'll reply to your first mail in a sec. GUP is the problem with COW, not ordinary processes mapping a page (mapcount), where you will only get new sharers during fork() -- in a very controlled way. So GUP has to take care to unshare *before* taking a reference, such that we can never reach the point of missed COW. GUP really is the problematic bit with it all. Without GUP, we'd be living in a wonderful world in regards to COW. > > Because if COW does things right, and always breaks a COW based on > refcount, then what's the problem with taking a read-only ref to the > page whether it is mapped multiple times or mapped just once? Anybody > who already had write access to the page can write to it regardless, > and any new writers go through COW and get a new page. Let's just take a look at what refcount does *wrong*. Let's use an adjusted version of your example above, because it's a perfect fit: 1. mem = mmap(pagesize, MAP_PRIVATE) -> refcount == 1 2. memset(mem, 0, pagesize); /* Page is mapped R/W */ 3. fork() /* Page gets mapped R/O */ -> refcount > 1 4. child quits -> refcount == 1 5. Take a R/O pin (RDMA, VFIO, ...) -> refcount > 1 6. memset(mem, 0xff, pagesize); -> Write fault -> COW And GUP sees something different than our MM -- and this is perfectly valid, the R/O pin is just reading page content we might be modifying afterwrds. Take out 3. and 4. and it works as expected. This wouldn't happen when relying on the mapcount. And 3+4 can really be anything that results in a R/O mapping of an anonymous page, even if it's just swapout followed by read fault that maps the page R/O. > > I must be missing something realyl fundamental here, but to me it > really reads like "mapcount can fundamentally never be relevant for > COW, and if it's not relevant for COW, how can it be relevant for a > read-only copy?" It really is the right value to use. Only GUP is the problematic bit that has to trigger unsharing to not mess up COW logic later. Take GUP out of the equation and COW just works as expected with the mapcount -- as long as we can read an atomic value and synchronize against fork. (again, still composing the other mail :) )
On Fri, Dec 17, 2021 at 12:18 PM David Hildenbrand <david@redhat.com> wrote: > > On 17.12.21 20:22, Linus Torvalds wrote: > > On Fri, Dec 17, 2021 at 11:04 AM Linus Torvalds > > <torvalds@linux-foundation.org> wrote: > >> > > - get a "readonly" copy of a local private page using FAULT_FLAG_UNSHARE. > > > > This just increments the page count, because mapcount == 1. > > > > - fork() > > > > - unmap in the original > > > > - child now has "mapcount == 1" on a page again, but refcount is > > elevated, and child HAS TO COW before writing. > > Hi Linus, > > This is just GUP before fork(), which is in general > problematic/incompatible with sharing. Note that my example was not meant to be an example of a problem per se, but purely as an example of how meaningless 'mapcount' is, and how 'mapcount==1' isn't really a very meaningful test. So it wasn't mean to show "look, GUP before fork is problematic". We have that problem already solved at least for regular pages. It was purely meant to show how "mapcount==1" isn't a meaningful thing to test, and my worry about how you're adding that nonsensical test to the new code. > Let's just take a look at what refcount does *wrong*. Let's use an > adjusted version of your example above, because it's a perfect fit: > > 1. mem = mmap(pagesize, MAP_PRIVATE) > -> refcount == 1 > > 2. memset(mem, 0, pagesize); /* Page is mapped R/W */ > > 3. fork() /* Page gets mapped R/O */ > -> refcount > 1 > > 4. child quits > -> refcount == 1 > > 5. Take a R/O pin (RDMA, VFIO, ...) > -> refcount > 1 > > 6. memset(mem, 0xff, pagesize); > -> Write fault -> COW I do not believe this is actually a bug. You asked for a R/O pin, and you got one. Then somebody else modified that page, and you got exactly what you asked for - a COW event. The original R/O pin has the original page that it asked for, and can read it just fine. So what is your argument? Linus
On Fri, Dec 17, 2021 at 12:36 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > > > 5. Take a R/O pin (RDMA, VFIO, ...) > > -> refcount > 1 > > > > 6. memset(mem, 0xff, pagesize); > > -> Write fault -> COW > > I do not believe this is actually a bug. > > You asked for a R/O pin, and you got one. If you want a shared pin that actually follows the changes of your process around, then that is what you should have asked for. At the time of such a shared pin, you can do what we already do: re-use the page if it has a refcount of 1. Or do an early COW event (feel free to avoid the "mark it writable and dirty"). But note: *refcount* of 1. Not "mapcount". Because mapcount would be broken garbage. Linus
On 17.12.21 21:36, Linus Torvalds wrote: > On Fri, Dec 17, 2021 at 12:18 PM David Hildenbrand <david@redhat.com> wrote: >> >> On 17.12.21 20:22, Linus Torvalds wrote: >>> On Fri, Dec 17, 2021 at 11:04 AM Linus Torvalds >>> <torvalds@linux-foundation.org> wrote: >>>> >>> - get a "readonly" copy of a local private page using FAULT_FLAG_UNSHARE. >>> >>> This just increments the page count, because mapcount == 1. >>> >>> - fork() >>> >>> - unmap in the original >>> >>> - child now has "mapcount == 1" on a page again, but refcount is >>> elevated, and child HAS TO COW before writing. >> >> Hi Linus, >> >> This is just GUP before fork(), which is in general >> problematic/incompatible with sharing. > > Note that my example was not meant to be an example of a problem per > se, but purely as an example of how meaningless 'mapcount' is, and how > 'mapcount==1' isn't really a very meaningful test. > > So it wasn't mean to show "look, GUP before fork is problematic". We > have that problem already solved at least for regular pages. > > It was purely meant to show how "mapcount==1" isn't a meaningful thing > to test, and my worry about how you're adding that nonsensical test to > the new code. > >> Let's just take a look at what refcount does *wrong*. Let's use an >> adjusted version of your example above, because it's a perfect fit: >> >> 1. mem = mmap(pagesize, MAP_PRIVATE) >> -> refcount == 1 >> >> 2. memset(mem, 0, pagesize); /* Page is mapped R/W */ >> >> 3. fork() /* Page gets mapped R/O */ >> -> refcount > 1 >> >> 4. child quits >> -> refcount == 1 >> >> 5. Take a R/O pin (RDMA, VFIO, ...) >> -> refcount > 1 >> >> 6. memset(mem, 0xff, pagesize); >> -> Write fault -> COW > > I do not believe this is actually a bug. It's debatable if it's a BUG or not (I think it is one). It's for sure inconsistent. > > You asked for a R/O pin, and you got one. > > Then somebody else modified that page, and you got exactly what you > asked for - a COW event. The original R/O pin has the original page > that it asked for, and can read it just fine. Where in the code did I ask for a COW event? I asked for a R/O pin, not any kind of memory protection.
On Fri, Dec 17, 2021 at 12:39 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > At the time of such a shared pin, you can do what we already do: > re-use the page if it has a refcount of 1. Or do an early COW event > (feel free to avoid the "mark it writable and dirty"). Note that this also depends on fork() doing the right thing, marking things for "a fork() can not share this page any more". Which it does for regular pages, and is exactly what that page_needs_cow_for_dma() logic is all about (and the special write_protect_seq around gup/fork). I do believe that huge-pages don't do it right. But I think that as you try to fix hugepages, you are now breaking the normal case. If all your logic was only about hugepages, I wouldn't care so much. But you are playing questionable games with code that I think is correct. Please explain why. Linus
On 17.12.21 20:04, Linus Torvalds wrote: > On Fri, Dec 17, 2021 at 3:34 AM David Hildenbrand <david@redhat.com> wrote: >> >> + * If the child takes a read-only pin on such a page (i.e., FOLL_WRITE is not >> + * set) and then unmaps the target page, we have: >> + * >> + * * page has mapcount == 1 and refcount > 1 > Hi Linus, > All these games with mapcount makes me think this is still broken. > > mapcount has been a horribly broken thing in the past, and I'm not > convinced it's not a broken thing now. It all started when Jann detected the security issue in GUP – and this patch set is fixing the problem exactly there, in GUP itself. Are you aware of COW issues regarding the mapcount if we would remove GUP from the equation? My point is that COW without GUP works just perfectly fine, but I’ll be happy to learn about other cases I was ignoring so far. Unfortunately, page_count() is even more unreliable, and the issues we're just detecting (see the link in the cover letter: memory corruptions inside user space -- e.g., lost DMA writes) are even worse than what we had before -- IMHO of course. > >> + vmf->page = vm_normal_page(vmf->vma, vmf->address, vmf->orig_pte); >> + if (vmf->page && PageAnon(vmf->page) && !PageKsm(vmf->page) && >> + page_mapcount(vmf->page) > 1) { > > What keeps the mapcount stable in here? So, we're reading an atomic value here. It’s read via atomic_read for regular pages, and the THP mapcount case has also been made atomic (as lockless as page_count) in patch #5. If a page is mapped exactly once, page_mapcount(page) == 1 and there is nothing to do. If the page is mapped more than once, page_mapcount(page) > 1 and we would have to trigger unsharing. And it’s true that the value is unstable in this case, but we really only care about page_mapcount(page) > 1 vs. page_mapcount(page) == 1. In this respect, there is no difference from the instability of the page_count and the mapcount – we still only care if it’s >1 or == 1. So the only case we could care about is concurrent additional mappings that can increment the mapcount -- which can only happen due to concurrent fork. So if we're reading page_mapcount(page) == 1 the only way we can get page_mapcount(page) > 1 is due to fork(). But we're holding the mmap_lock in read mode during faults and fork requires the mmap_lock in write mode. > > And I still believe that the whole notion that "COW should use > mapcount" is pure and utter garbage. > > If we are doing a COW, we need an *exclusive* access to the page. That > is not mapcount, that is the page ref. I thought about this a lot, because initially I had the same opinion. But really, we don't care about any speculative references (pagecache, migration, daemon, pagevec, ...) or any short-term "I just want to grab this reference real quick such that the page can't get freed" references. All we care about are GUP references, and we attack that problem at the root by triggering unsharing exactly at the point where GUP comes into play. So IMHO GUP is the problem and needs unsharing either: * On write access to a shared anonymous page, which is just COW as we know it. * On read access to a shared anonymous page, which is what we’re proposing in this patch set. So as soon as GUP comes into play, even if only pinning R/O, we have to trigger unsharing . Doing so enforces the invariant that it is impossible to take a GUP pin on an anonymous page with a mapcount > 1. In turn, the COW does not need to worry about the GUP after fork() security issue anymore and it can focus on doing optimally the COW faults as if GUP just wouldn’t exist.
On Fri, Dec 17, 2021 at 12:42 PM David Hildenbrand <david@redhat.com> wrote: > > > Then somebody else modified that page, and you got exactly what you > > asked for - a COW event. The original R/O pin has the original page > > that it asked for, and can read it just fine. > > Where in the code did I ask for a COW event? I asked for a R/O pin, not > any kind of memory protection. Why didn't you ask for a shared pin, if that is what you want? We already support that. If you don't like the read-only pins, don't use them. It's that simple. Linus
On Fri, Dec 17, 2021 at 12:36:43PM -0800, Linus Torvalds wrote: > > 5. Take a R/O pin (RDMA, VFIO, ...) > > -> refcount > 1 > > > > 6. memset(mem, 0xff, pagesize); > > -> Write fault -> COW > > I do not believe this is actually a bug. > > You asked for a R/O pin, and you got one. > > Then somebody else modified that page, and you got exactly what you > asked for - a COW event. The original R/O pin has the original page > that it asked for, and can read it just fine. To remind all, the GUP users, like RDMA, VFIO use FOLL_FORCE|FOLL_WRITE to get a 'r/o pin' specifically because of the COW breaking the coherence. In these case 'r/o pin' does not mean "snapshot the data", but its only a promise not to write to the pages and still desires coherence with the memory map. Eg in RDMA we know of apps asking for a R/O pin of something in .bss then filling that something with data finally doing the actual DMA. Breaking COW after pin breaks those apps. The above #5 can occur for O_DIRECT read and in that case the 'snapshot the data' is perfectly fine as racing the COW with the O_DIRECT read just resolves the race toward the read() direction. IIRC there is some other scenario that motivated this patch? Jason
On Fri, Dec 17, 2021 at 12:45 PM David Hildenbrand <david@redhat.com> wrote: > > If a page is mapped exactly once, page_mapcount(page) == 1 and there is > nothing to do. Why? You state that, but you stating that doesn't magically make it so. What makes "mapcount==1" stable and special? Your "it's an atomic_read()" argument is nonsense - it implies that the count can be changing, but you will get _one_ answer. What makes that one answer of a changing count special? What if there are other references to that same page, gotten with vmsplice(), and just about to be mapped into another address space? This is the meat of my argument. You claim that "mapcount==1" is special. I claim that you haven't explained why it would be. And I do not believe it is. Linus
On 17.12.21 21:51, Linus Torvalds wrote: > On Fri, Dec 17, 2021 at 12:45 PM David Hildenbrand <david@redhat.com> wrote: >> >> If a page is mapped exactly once, page_mapcount(page) == 1 and there is >> nothing to do. > > Why? > > You state that, but you stating that doesn't magically make it so. > > What makes "mapcount==1" stable and special? Your "it's an > atomic_read()" argument is nonsense - it implies that the count can be > changing, but you will get _one_ answer. And I explained how it can not increment. And the only way is via fork(), which cannot run concurrently. > > What makes that one answer of a changing count special? > > What if there are other references to that same page, gotten with > vmsplice(), and just about to be mapped into another address space? If we have a shared anonymous page we cannot have GUP references, not even R/O ones. Because GUP would have unshared and copied the page, resulting in a R/O mapped anonymous page. What am I missing?
On Fri, Dec 17, 2021 at 12:47 PM Jason Gunthorpe <jgg@nvidia.com> wrote: > > To remind all, the GUP users, like RDMA, VFIO use > FOLL_FORCE|FOLL_WRITE to get a 'r/o pin' specifically because of the > COW breaking the coherence. In these case 'r/o pin' does not mean > "snapshot the data", but its only a promise not to write to the pages > and still desires coherence with the memory map. > > Eg in RDMA we know of apps asking for a R/O pin of something in .bss > then filling that something with data finally doing the actual > DMA. Breaking COW after pin breaks those apps. I agree. And my argument is that those kinds of things that ask for a R/O pin are broken, and should just make sure to use the shared pins. If the page was already writable, you can just re-use the page directly (marking it pinned, so that any subsequent fork() does the right pre-cow thing) And if the page was *NOT* already writable, you do a COW - which might be sharing the page directly too, if it has no other references. What's the downside of just doing this properly? Again: if a DMA user wants coherent memory, then it should use the coherent pinning. Not some kind of "read-only sharing that looks at crazy mapcounts that have absolutely zero relevance to whether the page is coherent or not". Linus
On 17.12.21 21:47, Jason Gunthorpe wrote: > On Fri, Dec 17, 2021 at 12:36:43PM -0800, Linus Torvalds wrote: > >>> 5. Take a R/O pin (RDMA, VFIO, ...) >>> -> refcount > 1 >>> >>> 6. memset(mem, 0xff, pagesize); >>> -> Write fault -> COW >> >> I do not believe this is actually a bug. >> >> You asked for a R/O pin, and you got one. >> >> Then somebody else modified that page, and you got exactly what you >> asked for - a COW event. The original R/O pin has the original page >> that it asked for, and can read it just fine. > Hi Jason > To remind all, the GUP users, like RDMA, VFIO use > FOLL_FORCE|FOLL_WRITE to get a 'r/o pin' specifically because of the I heard that statement often. Can you point me at the code? VFIO: drivers/vfio/vfio_iommu_type1.c vaddr_get_pfns() will end up doing a pin_user_pages_remote(FOLL_LONGTERM) without FOLL_FORCE|FOLL_WRITE. Is that added automatically internally? Note the comment in the next patch + * + * TODO: although the security issue described does no longer apply in any case, + * the full consistency between the pinned pages and the pages mapped into the + * page tables of the MM only apply to short-term pinnings only. For + * FOLL_LONGTERM, FOLL_WRITE|FOLL_FORCE is required for now, which can be + * inefficient and still result in some consistency issues. Extend this + * mechanism to also provide full synchronicity to FOLL_LONGTERM, avoiding + * FOLL_WRITE|FOLL_FORCE. > > Eg in RDMA we know of apps asking for a R/O pin of something in .bss > then filling that something with data finally doing the actual > DMA. Breaking COW after pin breaks those apps. > > The above #5 can occur for O_DIRECT read and in that case the > 'snapshot the data' is perfectly fine as racing the COW with the > O_DIRECT read just resolves the race toward the read() direction. > > IIRC there is some other scenario that motivated this patch? 1. I want to fix the COW security issue as documented. Reproducers in patch #11 2. I want to fix all of the other issues as documented and linked in the cover letter that result from the imprecise page_count check in COW code. Especially the ones where we have memory corruptions, because this is just not acceptable. There are reproducers as well for everybody that doesn't believe me. But this series really just wants to fix the security issue as "part 1". Without any more breakages. I'm sorry, but it's all described in the cover letter. Maybe TL;DR
> On Dec 17, 2021, at 12:47 PM, Jason Gunthorpe <jgg@nvidia.com> wrote: > > On Fri, Dec 17, 2021 at 12:36:43PM -0800, Linus Torvalds wrote: > >>> 5. Take a R/O pin (RDMA, VFIO, ...) >>> -> refcount > 1 >>> >>> 6. memset(mem, 0xff, pagesize); >>> -> Write fault -> COW >> >> I do not believe this is actually a bug. >> >> You asked for a R/O pin, and you got one. >> >> Then somebody else modified that page, and you got exactly what you >> asked for - a COW event. The original R/O pin has the original page >> that it asked for, and can read it just fine. > > To remind all, the GUP users, like RDMA, VFIO use > FOLL_FORCE|FOLL_WRITE to get a 'r/o pin' specifically because of the > COW breaking the coherence. In these case 'r/o pin' does not mean > "snapshot the data", but its only a promise not to write to the pages > and still desires coherence with the memory map. > > Eg in RDMA we know of apps asking for a R/O pin of something in .bss > then filling that something with data finally doing the actual > DMA. Breaking COW after pin breaks those apps. > > The above #5 can occur for O_DIRECT read and in that case the > 'snapshot the data' is perfectly fine as racing the COW with the > O_DIRECT read just resolves the race toward the read() direction. > > IIRC there is some other scenario that motivated this patch? I think that there is an assumption that once a page is COW-broken, it would never have another write-fault that might lead to COW breaking later. AFAIK at least after userfaultfd-WP followed by userfaultfd-write-unprotect a page might be write-protected and go through do_wp_page() a second time to be COW-broken again. In such case, I think the FOLL_FORCE|FOLL_WRITE would not help. I suspect (not sure) that this might even happen with mprotect() since I do not see all code-paths preserving whether the page was writable.
On 17.12.21 21:56, Linus Torvalds wrote: > On Fri, Dec 17, 2021 at 12:47 PM Jason Gunthorpe <jgg@nvidia.com> wrote: >> >> To remind all, the GUP users, like RDMA, VFIO use >> FOLL_FORCE|FOLL_WRITE to get a 'r/o pin' specifically because of the >> COW breaking the coherence. In these case 'r/o pin' does not mean >> "snapshot the data", but its only a promise not to write to the pages >> and still desires coherence with the memory map. >> >> Eg in RDMA we know of apps asking for a R/O pin of something in .bss >> then filling that something with data finally doing the actual >> DMA. Breaking COW after pin breaks those apps. > > I agree. > I agree that breaking COW after a pin should never be done. Therefore, break the COW before the pin -> unsharing as implemented here. > And my argument is that those kinds of things that ask for a R/O pin > are broken, and should just make sure to use the shared pins. And trigger a write fault although they are just reading. To me this is just a band aid instead of eventually ... ... > What's the downside of just doing this properly? Doing it properly by fixing GUP and not the COW logic. GUP and COW are just incompatible and we should unshare early. Honestly, the memory corruptions we can trigger in user space due to the current COW logic *especially* with FOLL_WRITE users such O_DIRECT, io_uring or vfio are not a joke anymore. (again, link in the cover letter)
On 17.12.21 22:15, Nadav Amit wrote: > > >> On Dec 17, 2021, at 12:47 PM, Jason Gunthorpe <jgg@nvidia.com> wrote: >> >> On Fri, Dec 17, 2021 at 12:36:43PM -0800, Linus Torvalds wrote: >> >>>> 5. Take a R/O pin (RDMA, VFIO, ...) >>>> -> refcount > 1 >>>> >>>> 6. memset(mem, 0xff, pagesize); >>>> -> Write fault -> COW >>> >>> I do not believe this is actually a bug. >>> >>> You asked for a R/O pin, and you got one. >>> >>> Then somebody else modified that page, and you got exactly what you >>> asked for - a COW event. The original R/O pin has the original page >>> that it asked for, and can read it just fine. >> >> To remind all, the GUP users, like RDMA, VFIO use >> FOLL_FORCE|FOLL_WRITE to get a 'r/o pin' specifically because of the >> COW breaking the coherence. In these case 'r/o pin' does not mean >> "snapshot the data", but its only a promise not to write to the pages >> and still desires coherence with the memory map. >> >> Eg in RDMA we know of apps asking for a R/O pin of something in .bss >> then filling that something with data finally doing the actual >> DMA. Breaking COW after pin breaks those apps. >> >> The above #5 can occur for O_DIRECT read and in that case the >> 'snapshot the data' is perfectly fine as racing the COW with the >> O_DIRECT read just resolves the race toward the read() direction. >> >> IIRC there is some other scenario that motivated this patch? > > I think that there is an assumption that once a page is COW-broken, > it would never have another write-fault that might lead to COW > breaking later. > > AFAIK at least after userfaultfd-WP followed by > userfaultfd-write-unprotect a page might be write-protected and > go through do_wp_page() a second time to be COW-broken again. In > such case, I think the FOLL_FORCE|FOLL_WRITE would not help. > > I suspect (not sure) that this might even happen with mprotect() > since I do not see all code-paths preserving whether the page > was writable. > uffd-wp and mprotect() are broken as well, yes. But the easiest example is just swap + read fault. Section 2 and 3 in [1], along with reproducers. Note that I didn't mention uffd-wp and mprotect(), because these require "manual intervention". With swap, it's not your application doing something "special". [1] https://lore.kernel.org/r/3ae33b08-d9ef-f846-56fb-645e3b9b4c66@redhat.com
On Fri, Dec 17, 2021 at 12:55 PM David Hildenbrand <david@redhat.com> wrote: > > If we have a shared anonymous page we cannot have GUP references, not > even R/O ones. Because GUP would have unshared and copied the page, > resulting in a R/O mapped anonymous page. Doing a GUP on an actual shared page is wrong to begin with. You even know that, you try to use "page_mapcount() > 1" to disallow it. My point is that it's wrong regardless, and that "mapcount" is dubious, and that COW cannot - and must not - use mapcount, and that I think your shared case should strive to avoid it for the exact same reason. So, what I think should happen is: (a) GUP makes sure that it only ever looks up pages that can be shared with this VM. This may in involve breaking COW early with any past fork(). (b) it marks such pages so that any future work will not cause them to COW either Note that (a) is not necessarily "always COW and have to allocate and copy new page". In particular, if the page is already writable, you know you already have exclusive access to it and don't need to COW. And if it isn't writable, then the other common case is "the cow has only one user, and it's us" - that's the "refcount == 1" case. And (b) is what we do with that page_maybe_dma_pinned() logic for fork(), but also for things like swap cache creation (eg see commit feb889fb40fa: "mm: don't put pinned pages into the swap cache"). Note that this code all already exists, and already works - even without getting the (very expensive) mmap_sem. So it works with fast-GUP and it can race with concurrent forking by another thread, which is why we also have that seqcount thing. As far as I can tell, your "mapcount" logic fundamentally requires mmap_sem for the fork() race avoidance, for example. So this is why I don't like the mapcount games - I think they are very fragile, and not at all as logical as the two simple rules a/b above. I believe you can make mapcount games _work_ - we used to have something like that. It was incredibly fragile, and it had its own set of bugs, but with enough care it's doable. But my argument really is that I think it's the wrong approach, and that we should simply strive to follow the two simple conceptual rules above. Linus
On 17.12.21 22:36, Linus Torvalds wrote: > On Fri, Dec 17, 2021 at 12:55 PM David Hildenbrand <david@redhat.com> wrote: >> >> If we have a shared anonymous page we cannot have GUP references, not >> even R/O ones. Because GUP would have unshared and copied the page, >> resulting in a R/O mapped anonymous page. > > Doing a GUP on an actual shared page is wrong to begin with. > > You even know that, you try to use "page_mapcount() > 1" to disallow it. GUP is incomaptible with shared anonymous pages, therefore it has to trigger unsharing, correct. > > My point is that it's wrong regardless, and that "mapcount" is > dubious, and that COW cannot - and must not - use mapcount, and that I > think your shared case should strive to avoid it for the exact same > reason. For now I have not heard a compelling argument why the mapcount is dubious, I repeat: * mapcount can only increase due to fork() * mapcount can decrease due to unmap / zap We can protect from the transtition == 1 -> >1 using the mmap_lock. For COW the mapcount is the only thing that matters *if we take GUP* out of the equation. And that's exactly what we OTOH, take a look which issues resulted from the page_count changes. That's what I call dubious, sorry to say. > > So, what I think should happen is: > > (a) GUP makes sure that it only ever looks up pages that can be > shared with this VM. This may in involve breaking COW early with any > past fork(). Is that unsharing as we propose it? > > (b) it marks such pages so that any future work will not cause them > to COW either Right, exactly. GUP before fork does not result in a page getting shared again. > > Note that (a) is not necessarily "always COW and have to allocate and > copy new page". In particular, if the page is already writable, you > know you already have exclusive access to it and don't need to COW. > > And if it isn't writable, then the other common case is "the cow has > only one user, and it's us" - that's the "refcount == 1" case. > > And (b) is what we do with that page_maybe_dma_pinned() logic for > fork(), but also for things like swap cache creation (eg see commit > feb889fb40fa: "mm: don't put pinned pages into the swap cache"). I fully agree with b). GUP before fork is a totally different set of problems than GUP after fork. > > Note that this code all already exists, and already works - even > without getting the (very expensive) mmap_sem. So it works with > fast-GUP and it can race with concurrent forking by another thread, > which is why we also have that seqcount thing. I know, I studied it intensively :) > > As far as I can tell, your "mapcount" logic fundamentally requires > mmap_sem for the fork() race avoidance, for example. Yes. Or any other more lightweight synchronization in the future. For now this is just perfect. > > So this is why I don't like the mapcount games - I think they are very > fragile, and not at all as logical as the two simple rules a/b above. I don't really see anything fragile, really. I'm happy to learn as always. > > I believe you can make mapcount games _work_ - we used to have > something like that. It was incredibly fragile, and it had its own set > of bugs, but with enough care it's doable. We made it work, and it was comparatively simple.
On Fri, Dec 17, 2021 at 1:47 PM David Hildenbrand <david@redhat.com> wrote: > > For now I have not heard a compelling argument why the mapcount is > dubious, I repeat: > > * mapcount can only increase due to fork() > * mapcount can decrease due to unmap / zap > > We can protect from the transtition == 1 -> >1 using the mmap_lock. > > For COW the mapcount is the only thing that matters *if we take GUP* out > of the equation. And that's exactly what we What do you have against just doing what we already do in other parts, that a/b thing? Which avoids the whole mmap_sem issue. That was a big issue for the rdma people, afaik. Linus
On Fri, Dec 17, 2021 at 1:47 PM David Hildenbrand <david@redhat.com> wrote: > > For now I have not heard a compelling argument why the mapcount is > dubious, I repeat: > > * mapcount can only increase due to fork() > * mapcount can decrease due to unmap / zap And to answer the "why is this dubious", let' sjust look at your actual code that I reacted to: + vmf->page = vm_normal_page(vmf->vma, vmf->address, vmf->orig_pte); + if (vmf->page && PageAnon(vmf->page) && !PageKsm(vmf->page) && + page_mapcount(vmf->page) > 1) { Note how you don't just check page_mapcount(). Why not? Because mapcount is completely immaterial if it's not a PageAnon page, so you test for that. So even when you do the mapcount read as one atomic thing, it's one atomic thing that depends on _other_ things, and all these checks are not atomic. But a PageAnon() page can actually become a swap-backed page, and as far as I can tell, your code doesn't have any locking to protect against that. So now you need not only the mmap_sem (to protect against fork), you also need the page lock (to protect against rmap changing the type of page). I don't see you taking the page lock anywhere. Maybe the page table lock ends up serializing sufficiently with the rmap code that it ends up working In the do_wp_page() path, we currently do those kinds of racy checks too, but then we do a trylock_page, and re-do them. And at any time there is any question about things, we fall back to copying - because a copy is always safe. Well, it's always safe if we have the rule that "once we've pinned things, we don't cause them to be COW again". But that "it's safe if" was exactly my (b) case. That's why I much prefer the model I'm trying to push - it's conceptually quite simple. I can literally explain mine at a conceptual level with that "break pre-existing COW, make sure no future COW" model. In contrast, I look at your page_mapcount() code, and I go "there is no conceptual rules here, and the actual implementation details look dodgy". I personally like having clear conceptual rules - as opposed to random implementation details. Linus
On 17.12.21 22:50, Linus Torvalds wrote: > On Fri, Dec 17, 2021 at 1:47 PM David Hildenbrand <david@redhat.com> wrote: >> >> For now I have not heard a compelling argument why the mapcount is >> dubious, I repeat: >> >> * mapcount can only increase due to fork() >> * mapcount can decrease due to unmap / zap >> >> We can protect from the transtition == 1 -> >1 using the mmap_lock. >> >> For COW the mapcount is the only thing that matters *if we take GUP* out >> of the equation. And that's exactly what we > > What do you have against just doing what we already do in other parts, > that a/b thing? Let me put it that way: I just want to get all of this fixed for good. I don't particularly care how. *But* I will fight for something that is superior, logically makes sense (at least to me :) ) and not super complicated. And I call also avoiding unnecessary COW "superior". I do know that what this series proposes fixes the CVE: GUP after fork. I do know that the part 2 we'll be sending out next year will fix everything else we discovered so far, and it will rely on this as a basis, to not reintroduce any other COW issues we've seen so far. If someone can propose something comparable that makes all discovered problems go away I'll be *extremely* happy. We have reproducers for all issues, so it's easy to verify, and I'm planning on extending the selftests to cover even more corner cases. So far, I am not convinced that using the mapcount is dubious or problematic, I just don't see how. COW is an about sharing pages between processes, each expressed in the mapcount. It's a pure optimization for exactly that purpose. GUP is the problem, not COW, not the mapcount. To me the mapcount is the only thing that makes sense in COW+unsharing logic, and GUP has to be taught to identify it and resolve it -> unshare when it detects a shared anaonymous page. > > Which avoids the whole mmap_sem issue. That was a big issue for the > rdma people, afaik. While I do care about future use cases, I cannot possibly see fork() not requiring the mmap_lock in the foreseeable future. Just so much depends on it as of now. And after all, fixing everything what we discovered so far is more important to me than something like that for the future. We have other problems to solve in that regard. ---------------------------------------------------------------------- I didn't want to talk about hugetlb here but I will just because it's a good example why using the refocunt is just wrong -- because unnecessary COW are just absolutely problematic. Assume you have a R/O mapped huge page that is only mapped into your process and you get a write fault. What should your COW logic do? a) Rely on the mapcount? Yes, iff GUP has been taught to unshare properly, because then it expresses exactly what we want to know. mapcount == 1 -> reuse. mapcount > 1 -> COW. b) Rely on the refocunt? If we have a speculative refrence on the page we would COW. As huge pages are a scarce resource we can easily just not have a free huge page anymore and crash the application. The app didn't do anything wrong. So teaching the hugetlb COW code to rely on the refount would just be highly fragile.
On 17.12.21 23:18, Linus Torvalds wrote: > On Fri, Dec 17, 2021 at 1:47 PM David Hildenbrand <david@redhat.com> wrote: >> >> For now I have not heard a compelling argument why the mapcount is >> dubious, I repeat: >> >> * mapcount can only increase due to fork() >> * mapcount can decrease due to unmap / zap > > And to answer the "why is this dubious", let' sjust look at your > actual code that I reacted to: > > + vmf->page = vm_normal_page(vmf->vma, vmf->address, vmf->orig_pte); > + if (vmf->page && PageAnon(vmf->page) && !PageKsm(vmf->page) && > + page_mapcount(vmf->page) > 1) { > > Note how you don't just check page_mapcount(). Why not? Because > mapcount is completely immaterial if it's not a PageAnon page, so you > test for that. > > So even when you do the mapcount read as one atomic thing, it's one > atomic thing that depends on _other_ things, and all these checks are > not atomic. > > But a PageAnon() page can actually become a swap-backed page, and as > far as I can tell, your code doesn't have any locking to protect > against that. The pages stay PageAnon(). swap-backed pages simply set a bit IIRC. mapcount still applies. > > So now you need not only the mmap_sem (to protect against fork), you > also need the page lock (to protect against rmap changing the type of > page). No, I don't think so. But I'm happy to be proven wrong because I might just be missing something important. > > I don't see you taking the page lock anywhere. Maybe the page table > lock ends up serializing sufficiently with the rmap code that it ends > up working > > In the do_wp_page() path, we currently do those kinds of racy checks > too, but then we do a trylock_page, and re-do them. And at any time > there is any question about things, we fall back to copying - because > a copy is always safe. Yes, I studied that code in detail as well. > > Well, it's always safe if we have the rule that "once we've pinned > things, we don't cause them to be COW again". We should also be handling FOLL_GET, but that's a completely different discussion. > > But that "it's safe if" was exactly my (b) case. > > That's why I much prefer the model I'm trying to push - it's > conceptually quite simple. I can literally explain mine at a > conceptual level with that "break pre-existing COW, make sure no > future COW" model. :) We really might be talking about the same thing just that my point is that the mapcount is the right thing to use for making the discussion whether to break COW -> triger unsharing. > > In contrast, I look at your page_mapcount() code, and I go "there is > no conceptual rules here, and the actual implementation details look > dodgy". > > I personally like having clear conceptual rules - as opposed to random > implementation details. Oh, don't get me wrong, me to. But for me it just all makes perfect. What we document is: "The fault is an unsharing request to unshare a shared anonymous page (-> mapped R/O). Does not apply to KSM." And the code checks for exactly that. And in that context the mapcount just expresses exactly what we want. Again, unless I am missing something important that you raise above. Anyhow, it's late in Germany. thanks for the discussion Linus!
On Fri, Dec 17, 2021 at 2:29 PM David Hildenbrand <david@redhat.com> wrote: > > While I do care about future use cases, I cannot possibly see fork() not > requiring the mmap_lock in the foreseeable future. Just so much depends > on it as of now. It's not that *fork()* depends on it. Of course fork() takes the mmap_sem. It's that fast-gup really really doesn't want it, and can't take it. So any fast-gup user fundamentally cannot look at mapcount(), because that would be fundamentally wrong and racy, and could race with fork. And yet, as far as I can tell, that's *exactly* what your gup patches do, with gup_pte_range() adding + if (!pte_write(pte) && gup_must_unshare(flags, page, false)) { + put_compound_head(head, 1, flags); + goto pte_unmap; + } which looks at the page mapcount without holding the mmap sem at all. And see my other email - I think there are other examples of your patches looking at data that isn't stable because you don't hold the right locks. And you can't even do the optimistic case without taking the lock, because in your world, a COW that optimistically copies in the case of a race condition is fundamentally *wrong* and buggy. Because in your world-view, GUP and COW are very different and have different rules, but you need things to be *exact*, and they aren't. And none of this is anything at least I can think about, because I don't see what the "design" is. I really have a hard time following what the rules actually are. You seem to think that "page_mapcount()" is a really simple rule, and I fundamentally disagree. It's a _very_ complicated thing indeed, with locking issues, AND YOU ACTIVELY VIOLATE THE LOCKING RULES! See why I'm so unhappy? We *did* do the page_mapcount() thing. It was bad. It forced COW to always take the page lock. There's a very real reason why I'm pushing my "let's have a _design_ here", instead of your "let's look at page_mapcount without even doing the locking". And yes, I *know* that fork() takes the mmap_sem, and likely always will. That really isn't the problem here. The problem is that your page_mapcount() paths DO NOT take that lock. Btw, maybe I'm misreading things. I looked at the individual patches, I didn't apply them, maybe I missed something. But I don't think I am. Linus
On Fri, Dec 17, 2021 at 2:43 PM David Hildenbrand <david@redhat.com> wrote: > > The pages stay PageAnon(). swap-backed pages simply set a bit IIRC. > mapcount still applies. Our code-base is too large for me to remember all the details, but if we still end up having PageAnon for swapbacked pages, then mapcount can increase from another process faulting in an pte with that swap entry. And mmap_sem doesn't protect against that. Again, page_lock() does. And taking the page lock was a big performance issue. One of the reasons that new COW handling is so nice is that you can do things like if (!trylock_page(page)) goto copy; exactly because in the a/b world order, the copy case is always safe. In your model, as far as I can tell, you leave the page read-only and a subsequent COW fault _can_ happen, which means that now the subsequent COW needs to b every very careful, because if it ever copies a page that was GUP'ed, you just broke the rules. So COWing too much is a bug (because it breaks the page from the GUP), but COWing too little is an even worse problem (because it measn that now the GUP user can see data it shouldn't have seen). Our old code literally COWed too little. It's why all those changes happened in the first place. This is why I'm pushing that whole story line of (1) COW is based purely on refcounting, because that's the only thing that obviously can never COW too little. (2) GUP pre-COWs (the thing I called the "(a)" rule earlier) and then makes sure to not mark pinned pages COW again (that "(b)" rule). and here "don't use page_mapcount()" really is about that (1). You do seem to have kept (1) in that your COW rules don't seem to change (but maybe I missed it), but because your GUP-vs-COW semantics are very different indeed, I'm not at all convinced about (2). Linus
On 17.12.21 23:58, Linus Torvalds wrote: > On Fri, Dec 17, 2021 at 2:29 PM David Hildenbrand <david@redhat.com> wrote: >> >> While I do care about future use cases, I cannot possibly see fork() not >> requiring the mmap_lock in the foreseeable future. Just so much depends >> on it as of now. > > It's not that *fork()* depends on it. > > Of course fork() takes the mmap_sem. > > It's that fast-gup really really doesn't want it, and can't take it. Oh, sorry, I was misreading your mail. > > So any fast-gup user fundamentally cannot look at mapcount(), because > that would be fundamentally wrong and racy, and could race with fork. > So we're concerned about fork() racing with gup-fast-only. gup-fast-only runs essentially lockless. As we read an atomic mapcount, the relevant thing to happen would be that we read an mapcount of 1 and decide to share the page, but there is concurrent fork() such that the mapcount is increased. So the parent process has to be the only process owning that page for this to trigger (mapcount == 1). In that situation, we would pin the page in gup-fast-only. BUT this is just like GUP before fork() and caught using the mm->write_protect_seq, so we'd immediately unpin it and not actually return it from get-user-pages-fast. No harm done AFAIKS. > And yet, as far as I can tell, that's *exactly* what your gup patches > do, with gup_pte_range() adding > > + if (!pte_write(pte) && gup_must_unshare(flags, page, false)) { > + put_compound_head(head, 1, flags); > + goto pte_unmap; > + } > > which looks at the page mapcount without holding the mmap sem at all. > > And see my other email - I think there are other examples of your > patches looking at data that isn't stable because you don't hold the > right locks. We rely on PageAnon(), PageKsm() and the mapcount. To my understanding, they are stable for our use in pagefault handling code under mmap_lock and in gup-fast because of above reasoning. > > And you can't even do the optimistic case without taking the lock, > because in your world, a COW that optimistically copies in the case of > a race condition is fundamentally *wrong* and buggy. Because in your > world-view, GUP and COW are very different and have different rules, > but you need things to be *exact*, and they aren't. > > And none of this is anything at least I can think about, because I > don't see what the "design" is. > > I really have a hard time following what the rules actually are. You > seem to think that "page_mapcount()" is a really simple rule, and I > fundamentally disagree. It's a _very_ complicated thing indeed, with > locking issues, AND YOU ACTIVELY VIOLATE THE LOCKING RULES! > > See why I'm so unhappy? I see why your unhappy, and I appreciate the productive discussion :) But I think we just have to complete the big picture of what we're proposing and how the mapcount is safe to be used for this purpose. I mean, I'm happy if you actually find a flaw in the current design proposal. > > We *did* do the page_mapcount() thing. It was bad. It forced COW to > always take the page lock. There's a very real reason why I'm pushing > my "let's have a _design_ here", instead of your "let's look at > page_mapcount without even doing the locking". The locking semantics just have to be clarified and written in stone -- if we don't find any flaws. But this will be my last mail for today, have a nice weekend Linus!
> On Dec 17, 2021, at 3:29 PM, David Hildenbrand <david@redhat.com> wrote: > > On 17.12.21 23:58, Linus Torvalds wrote: >> >> And you can't even do the optimistic case without taking the lock, >> because in your world, a COW that optimistically copies in the case of >> a race condition is fundamentally *wrong* and buggy. Because in your >> world-view, GUP and COW are very different and have different rules, >> but you need things to be *exact*, and they aren’t. I understand the discussion mainly revolves correctness, which is obviously the most important property, but I would like to mention that having transient get_page() calls causing unnecessary COWs can cause hard-to-analyze and hard-to-avoid performance degradation. COW means a page copy, a TLB flush and potentially a TLB shootdown, which is the most painful, specifically on VMs. So I think that any solution should be able to limit the cases/number of unnecessary COW operations to be minimal.
On Fri, Dec 17, 2021 at 10:04:11PM +0100, David Hildenbrand wrote: > > To remind all, the GUP users, like RDMA, VFIO use > > FOLL_FORCE|FOLL_WRITE to get a 'r/o pin' specifically because of the > > I heard that statement often. Can you point me at the code? > > VFIO: drivers/vfio/vfio_iommu_type1.c > > vaddr_get_pfns() will end up doing a > pin_user_pages_remote(FOLL_LONGTERM) without > FOLL_FORCE|FOLL_WRITE. > Is that added automatically internally? No, it is just that VFIO is broken in this regard. AFAIK VFIO users rarely use the read-only mode and haven't noticed this bug yet. You can search for FOLL_FORCE and see the drivers that do it this way: drivers/misc/habanalabs/common/memory.c: FOLL_FORCE | FOLL_WRITE | FOLL_LONGTERM, drivers/infiniband/core/umem.c: gup_flags |= FOLL_FORCE; drivers/media/v4l2-core/videobuf-dma-sg.c: unsigned int flags = FOLL_FORCE; drivers/media/common/videobuf2/frame_vector.c: FOLL_FORCE | FOLL_WRITE | FOLL_LONGTERM, [etc] No doubt there are others that do it right and wrong, this is badly documented and misunderstood. > But this series really just wants to fix the security issue as "part 1". > Without any more breakages. Sure, I'm just trying to understand why this example was brought up. Jason
On Fri, Dec 17, 2021 at 09:15:45PM +0000, Nadav Amit wrote: > I think that there is an assumption that once a page is COW-broken, > it would never have another write-fault that might lead to COW > breaking later. Yes, that is what Linus has been explaining, AFAICT > AFAIK at least after userfaultfd-WP followed by > userfaultfd-write-unprotect a page might be write-protected and > go through do_wp_page() a second time to be COW-broken again. In > such case, I think the FOLL_FORCE|FOLL_WRITE would not help. Right, and this is a good reason why refcount is running into trouble, it COW's too much in cases like that because userfaultfd-WP doesn't align to the first assumption. Jason
[ Going back in the thread to this one ] On Fri, Dec 17, 2021 at 1:15 PM Nadav Amit <namit@vmware.com> wrote: > > I think that there is an assumption that once a page is COW-broken, > it would never have another write-fault that might lead to COW > breaking later. Right. I do think there are problems in the current code, I just think that the patches are a step back. The problems with the current code are of two kinds: - I think the largepage code (either THP or explicit hugetlb) doesn't do as good a job of this whole COW handling as the regular pages do - some of the "you can make pages read-only again explicitly" kinds of loads. But honestly, at least for the second case, if somebody does a GUP, and then starts playing mprotect games on the same virtual memory area that they did a GUP on, and are surprised when they get another COW fault that breaks their own connection with a page they did a GUP on earlier, that's their own fault. So I think there's some of "If you broke it, you get to keep both pieces". Literally, in this case. You have your GUP page that you looked up, and you have your virtual address page that you caused COW on with mprotect() by making it read-only and then read-write again, then you have two different pages, and at some point it really is just "Well, don't do that then". But yes, there's also some of "some code probably didn't get fully converted to the new world order". So if VFIO only uses FOLL_LONGTERM, and didn't ask for the COW breaking, then yes, VFIO will see page incoherencies. But that should be an issue of "VFIO should do the right thing". So part of it is a combination of "if you do crazy things, you'll get crazy results". And some of it is some kernel pinning code that doesn't do the right thing to actually make sure it gets a shared page to be pinned. And then there's THP and HUGETLB, that I do think needs fixing and aren't about those two kinds of cases. I think we never got around to just doing the same thing we did for regular pages. I think the hugepage code simply doesn't follow that "COW on GUP, mark to not COW later" pattern. Linus
On Fri, Dec 17, 2021 at 5:53 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > And then there's THP and HUGETLB, that I do think needs fixing and > aren't about those two kinds of cases. > > I think we never got around to just doing the same thing we did for > regular pages. I think the hugepage code simply doesn't follow that > "COW on GUP, mark to not COW later" pattern. In particular, do_huge_pmd_wp_page() has this pattern: /* * We can only reuse the page if nobody else maps the huge page or it's * part. */ if (reuse_swap_page(page, NULL)) { ... mark it writable ... and that never got converted to "only mark it writable if we actually have exclusive access to this huge page". So the problem is literally that reuse_swap_page() uses that "page_mapcount()" logic, and doesn't take into account that the page is actually used by a GUP reference. Which is exactly why David then sees that "oh, we got a GUP reference to it, and now we're seeing the writes come through". Because that code looks at mapcount, and it shouldn't. I think the hugepage code should use the exact same logic that the regular wp fault code does. Linus
On Fri, Dec 17, 2021 at 6:17 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > I think the hugepage code should use the exact same logic that the > regular wp fault code does. IOW, I think that this stupid (AND UNTESTED) patch should likely just fix David's test-case with the hugepage and splice thing.. Or at least be somewhat close. But it should be paired with the GUP side doing the right thing too, of course. Maybe it already does, maybe it doesn't, I didn't check... And maybe there's something subtle about the page_count() of a THP entry. Again, I can't really claim to have tested this all, but I'm hoping this makes somebody go "Ahh, now I see what Linus means" Linus
On Fri, Dec 17, 2021 at 05:53:45PM -0800, Linus Torvalds wrote: > But honestly, at least for the second case, if somebody does a GUP, > and then starts playing mprotect games on the same virtual memory area > that they did a GUP on, and are surprised when they get another COW > fault that breaks their own connection with a page they did a GUP on > earlier, that's their own fault. I've been told there are real workloads that do this. Something like qemu will use GUP with VFIO to insert PCI devices into the guest and GUP with RDMA to do fast network copy of VM memory during VM migration. qemu also uses the WP games to implement dirty tracking of VM memory during migration (and more? I'm not sure). It expects that during all of this nothing will COW the pages, as the two kinds of DMA must always go to the pages mapped to KVM. The big trouble here is this all worked before, so it is a userspace visible regression. Can this be made to work at all? I wonder if qemu uses MAP_SHARED, eg via a memfd or something, does the COW then go away naturally? Jason
> On Dec 17, 2021, at 7:05 PM, Jason Gunthorpe <jgg@nvidia.com> wrote: > > On Fri, Dec 17, 2021 at 05:53:45PM -0800, Linus Torvalds wrote: > >> But honestly, at least for the second case, if somebody does a GUP, >> and then starts playing mprotect games on the same virtual memory area >> that they did a GUP on, and are surprised when they get another COW >> fault that breaks their own connection with a page they did a GUP on >> earlier, that's their own fault. > > I've been told there are real workloads that do this. > > Something like qemu will use GUP with VFIO to insert PCI devices into > the guest and GUP with RDMA to do fast network copy of VM memory > during VM migration. > > qemu also uses the WP games to implement dirty tracking of VM memory > during migration (and more? I'm not sure). It expects that during all > of this nothing will COW the pages, as the two kinds of DMA must > always go to the pages mapped to KVM. > > The big trouble here is this all worked before, so it is a userspace > visible regression. In such a case, I do think it makes sense to fail uffd-wp (when page_count() > 1), and in a prototype I am working on I do something like that. Otherwise, if the page is written and you use uffd for dirty tracking, what do you actually achieve? You can return EAGAIN (which is documented and actually returned while “mmap_changing”) in such case. This would not break userspace, but indeed still likely to cause a performance regression.
On Fri, Dec 17, 2021 at 6:42 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > IOW, I think that this stupid (AND UNTESTED) patch should likely just > fix David's test-case with the hugepage and splice thing.. Looking at that patch, the page lock is entirely pointless. It existed because of that broken reuse_swap_page() that tried to count page mappings etc, but once you get rid of that - like we got rid of it for the regular pages - it's not even needed. So as we hold the page table lock, and see a page_count() of 1, we could be done without any page lock at all. So that whole trylock/unlock is actually unnecessary. That said, it's possibly woth re-using any swap cache pages at this point, and that would want the page lock. So some complexity in this area is likely worth it. Similar to how we did it in commit f4c4a3f48480 ("mm: free idle swap cache page after COW") for regular pages. So that patch is not great, but I think it works as a guiding one. And notice how *simple* it is. It doesn't require careful counting of swap entries that depend on page locking. Linus
On Fri, Dec 17, 2021 at 7:30 PM Nadav Amit <namit@vmware.com> wrote: > > In such a case, I do think it makes sense to fail uffd-wp (when > page_count() > 1), and in a prototype I am working on I do something > like that. Ack. If uddf-wp finds a page that is pinned, just skip it as not write-protectable. Because some of the pinners might be writing to it, of course - just not through the page tables. So that sounds like the right thing to do. I _think_ we discussed this the last time this came up. I have some dim memory of that. Jason, ring a bell? Linus
On Fri, Dec 17, 2021 at 3:53 PM Nadav Amit <namit@vmware.com> wrote: > > I understand the discussion mainly revolves correctness, which is > obviously the most important property, but I would like to mention > that having transient get_page() calls causing unnecessary COWs can > cause hard-to-analyze and hard-to-avoid performance degradation. Note that the COW itself is pretty cheap. Yes, there's the page allocation and copy, but it's mostly a local thing. So that falls under the "good to avoid" heading, but in the end it's not an immense deal. In contrast, the page lock has been an actual big user-visible latency issue, to the point of correctness. A couple of years ago, we literally had NMI watchdog timeouts due to the page wait-queues growing basically boundlessly. This was some customer internal benchmark code that I never saw, so it wasn't *quite* clear exactly what was going on, but we ended up having to split up the page wait list traversal using bookmark entries, because it was such a huge latency issue. That was mostly NUMA balancing faults, I think, but the point I'm making is that avoiding the page lock can be a *much* bigger deal than avoiding some local allocation and copying of a page of data. There are real loads where the page-lock gets insanely bad, and I think it's because we use it much too much. See commit 2554db916586 ("sched/wait: Break up long wake list walk") for some of that saga. So I really think that having to serialize with the page lock in order to do some "exact page use counting" is a false economy. Yes, maybe you'd be able to avoid a COW or two, but at what locking cost? Linus
> On Dec 17, 2021, at 8:02 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Fri, Dec 17, 2021 at 3:53 PM Nadav Amit <namit@vmware.com> wrote: >> >> I understand the discussion mainly revolves correctness, which is >> obviously the most important property, but I would like to mention >> that having transient get_page() calls causing unnecessary COWs can >> cause hard-to-analyze and hard-to-avoid performance degradation. > > Note that the COW itself is pretty cheap. Yes, there's the page > allocation and copy, but it's mostly a local thing. I don’t know about the page-lock overhead, but I understand your argument. Having said that, I do know a bit about TLB flushes, which you did not mention as overheads of COW. Such flushes can be quite expensive on multithreaded workloads (specifically on VMs, but lets put those aside). Take for instance memcached and assume you overcommit memory with a very fast swap (e.g., pmem, zram, perhaps even slower). Now, it turns out memcached often accesses a page first for read and shortly after for write. I encountered, in a similar scenario, that the page reference that lru_cache_add() takes during the first faultin event (for read), causes a COW on a write page-fault that happens shortly after [1]. So on memcached I assume this would also trigger frequent unnecessary COWs. Besides page allocation and copy, COW would then require a TLB flush, which, when performed locally, might not be too bad (~200 cycles). But if memcached has many threads, as it usually does, then you need a TLB shootdown and this one can be expensive (microseconds). If you start getting a TLB shootdown storm, you may avoid some IPIs since you see that other CPUs already queued IPIs for the target CPU. But then the kernel would flush the entire TLB on the the target CPU, as it realizes that multiple TLB flushes were queued, and as it assumes that a full TLB flush would be cheaper. [ I can try to run a benchmark during the weekend to measure the impact, as I did not really measure the impact on memcached before/after 5.8. ] So I am in no position to prioritize one overhead over the other, but I do not think that COW can be characterized as mostly-local and cheap in the case of multithreaded workloads. [1] https://lore.kernel.org/linux-mm/0480D692-D9B2-429A-9A88-9BBA1331AC3A@gmail.com/
On Sat, Dec 18, 2021 at 04:52:13AM +0000, Nadav Amit wrote: > Take for instance memcached and assume you overcommit memory with a very fast > swap (e.g., pmem, zram, perhaps even slower). Now, it turns out memcached > often accesses a page first for read and shortly after for write. I > encountered, in a similar scenario, that the page reference that > lru_cache_add() takes during the first faultin event (for read), causes a COW > on a write page-fault that happens shortly after [1]. So on memcached I > assume this would also trigger frequent unnecessary COWs. Why are we comparing page_count() against 1 and not 1 + PageLRU(page)? Having a reference from the LRU should be expected. Is it because of some race that we'd need to take the page lock to protect against?
> On Dec 17, 2021, at 9:03 PM, Matthew Wilcox <willy@infradead.org> wrote: > > On Sat, Dec 18, 2021 at 04:52:13AM +0000, Nadav Amit wrote: >> Take for instance memcached and assume you overcommit memory with a very fast >> swap (e.g., pmem, zram, perhaps even slower). Now, it turns out memcached >> often accesses a page first for read and shortly after for write. I >> encountered, in a similar scenario, that the page reference that >> lru_cache_add() takes during the first faultin event (for read), causes a COW >> on a write page-fault that happens shortly after [1]. So on memcached I >> assume this would also trigger frequent unnecessary COWs. > > Why are we comparing page_count() against 1 and not 1 + PageLRU(page)? > Having a reference from the LRU should be expected. Is it because of > some race that we'd need to take the page lock to protect against? > IIUC, the reference that is taken on the page is taken before SetPageLRU() is called and the reference is later dropped: lru_add_drain() lru_add_drain_cpu() __pagevec_lru_add() __pagevec_lru_add_fn() __pagevec_lru_add_fn() SetPageLRU() <- sets the LRU release_pages() <- drops the reference It is one scenario I encountered. There might be others that take transient references on pages that cause unnecessary COWs. I think David and Andrea had few in mind. To trigger a COW bug I once used mlock()/munlock() that take such transient reference. But who knows how many other cases exist (KSM? vmscan?)
On 18.12.21 00:20, Linus Torvalds wrote: > On Fri, Dec 17, 2021 at 2:43 PM David Hildenbrand <david@redhat.com> wrote: >> >> The pages stay PageAnon(). swap-backed pages simply set a bit IIRC. >> mapcount still applies. > > Our code-base is too large for me to remember all the details, but if > we still end up having PageAnon for swapbacked pages, then mapcount > can increase from another process faulting in an pte with that swap > entry. "Our code-base is too large for me to remember all the details". I second that. You might a valid point with the mapcount regarding concurrent swapin in the current code, I'll have to think further about that if it could be a problem and if it cannot be handled without heavy synchronization (I think the concern is that gup unsharing could miss doing an unshare because it doesn't detect that there are other page sharers not expressed in the mapcount code but via the swap code when seeing mapcount == 1). Do you have any other concerns regarding the semantics/stability regarding the following points (as discussed, fork() is not the issue because it can be handled via write_protect_seq or something comparable. handling per-process thingies is not the problem): a) Using PageAnon(): It cannot possibly change in the pagefault path or in the gup-fast-only path (otherwise there would be use-after-free already). b) Using PageKsm(): It cannot possibly change in the pagefault path or in the gup-fast path (otherwise there would be use-after-free already). c) Using mapcount: It cannot possibly change in the way we care about or cannot detect (mapcount going from == 1 to > 1 concurrently) in the pagefault path or in the gup-fast path due to fork(). You're point for c) is that we might currently not handle swap correctly. Any other concerns, especially regarding the mapcount or is that it? IIUC, any GUP approach to detect necessary unsharing would at least require a check for a) and b). What we're arguing about is c). > > And mmap_sem doesn't protect against that. Again, page_lock() does. > > And taking the page lock was a big performance issue. > > One of the reasons that new COW handling is so nice is that you can do > things like > > if (!trylock_page(page)) > goto copy; > > exactly because in the a/b world order, the copy case is always safe. > > In your model, as far as I can tell, you leave the page read-only and > a subsequent COW fault _can_ happen, which means that now the > subsequent COW needs to b every very careful, because if it ever > copies a page that was GUP'ed, you just broke the rules. > > So COWing too much is a bug (because it breaks the page from the GUP), > but COWing too little is an even worse problem (because it measn that > now the GUP user can see data it shouldn't have seen). Good summary, I'll extend below. > > Our old code literally COWed too little. It's why all those changes > happened in the first place. Let's see if we can agree on some things to get a common understanding. What can happen with COW is: 1) Missed COW We miss a COW, therefore someone has access to a wrong page. This is the security issue as in patch #11. The security issue documented in [1]. 2) Unnecessary COW We do a COW, but there are no other valid users, so it's just overhead + noise. The performance issue documented in section 5 in [1]. 3) Wrong COW We do a COW but there are other valid users (-> GUP). The memory corruption issue documented in section 2 and 3 in [1]. Most notably, the io_uring reproducer which races with the page_maybe_dma_pinned() check in current code can trigger this easily, and exactly this issues is what gives me nightmares. [2] Does that make sense? If we agree on the above, then here is how the currently discussed approaches differ: page_count != 1: * 1) cannot happen * 2) can happen easily (speculative references due to pagecache, migration, daemon, pagevec, ...) * 3) can happen in the current code mapcount > 1: * 1) your concern is that this can happen due to concurrent swapin * 2) cannot happen. * 3) your concern is that this can happen due to concurrent swapin If we can agree on that, I can see why you dislike mapcount, can you see why I dislike page_count? Ideally we'd really have a fast and reliable check for "is this page shared and could get used by multiple processes -- either multiple processes are already mapping it R/O or could map it via the swap R/O later". > This is why I'm pushing that whole story line of > > (1) COW is based purely on refcounting, because that's the only thing > that obviously can never COW too little. I am completely missing how 2) or 3) could *ever* be handled properly for page_count != 1. 3) is obviously more important and gives me nightmares. And that's what I'm trying to communicate the whole time: page_count is absolutely fragile, because anything that results in a page getting mapped R/O into a page table can trigger 3). And as [2] proves that can even happen with *swap*. (see how we're running into the same swap issues with both approaches? Stupid swap :) ) > > (2) GUP pre-COWs (the thing I called the "(a)" rule earlier) and then > makes sure to not mark pinned pages COW again (that "(b)" rule). > > and here "don't use page_mapcount()" really is about that (1). > > You do seem to have kept (1) in that your COW rules don't seem to > change (but maybe I missed it), but because your GUP-vs-COW semantics > are very different indeed, I'm not at all convinced about (2). Oh yes, sorry, not in the context of this series. The point is that the current page_count != 1 covers mapcount > 1, so we can adjust that separately later. You mentioned "design", so let's assume we have a nice function: /* * Check if an anon page is shared or exclusively used by a single * process: if shared, the page is shared by multiple processes either * mapping the page R/O ("active sharing") or having swap entries that * could result in the page getting mapped R/O ("inactive sharing"). * * This function is safe to be called under mmap_lock in read/write mode * because it prevents concurrent fork() sharing the page. * This function is safe to be called from gup-fast-only in IRQ context, * as it detects concurrent fork() sharing the page */ bool page_anon_shared(); Can we agree that that would that be a suitable function for (1) and (2) instead of using either the page_count or the mapcount directly? (yes, how to actually make it reliable due to swapin is to be discussed, but it might be a problem worth solving if that's the way to go) For hugetlb, this would really have to use the mapcount as explained (after all, fortunately there is no swap ...). [1] https://lore.kernel.org/all/3ae33b08-d9ef-f846-56fb-645e3b9b4c66@redhat.com/ [2] https://gitlab.com/aarcange/kernel-testcases-for-v5.11/-/blob/main/io_uring_swap.c
On Fri, Dec 17, 2021 at 9:03 PM Matthew Wilcox <willy@infradead.org> wrote: > > Why are we comparing page_count() against 1 and not 1 + PageLRU(page)? > Having a reference from the LRU should be expected. Is it because of > some race that we'd need to take the page lock to protect against? The LRU doesn't actually count towards a reference - the LRU list is maintained independently of the lifetime of the page (and is torn down on last release - which wouldn't work if the LRU list itself held a ref to the page). But atr least some of the code that gathers up pages to then put them on the LRU list takes a ref to the pages before passing them off, just to guarantee to keep them around during the operation. So yes, various things can increment page counts in a transitory manner. I still *much* prefer a reliable COW over one that doesn't happen enough. The page count can have these (on the whole fairly rare) blips. That's ok. The page count is still *reliable*, in ways that teh mapcount can never be. The mapcount fundamentally doesn't show "other non-mapped users". So Nadav is correct that unnecessary cow events will cause extra work (and the TLB flush is a good point - just marking a page writable as-is is much cheaper). But I'm looking at teh actual code, and the actual logic, and I am dismissign the whole mapcount games completely. David has a 10-patch series (plus one test) of complex, grotty, hard-to-understand code with new flags. I posted a patch that removed 10 lines, and fixes the problem case his test-case was designed for. I think that really speaks to the issues. My approach is *simpler* and a hell of a lot more robust. And as mentioned, I can explain it. And christ the thing I'm advocating for is WHAT WE ALREADY DO FOR 99.99% of all cases. Why? Because it's literally how the regular COW paths work TODAY. And we had benchmarks show performance improvements (or no movement at all) from when we made those changes. Not the downsides that people claim. It's only the THP paths that are broken (and possibly some individual mis-uses of GUP - people have mentioned virtio). So nbow people are trying to do a fragile, complex thing that was shown to be problematic for the common case, and they are doing it for the insanely rare case? When a ten-line removal patch fixes that one too? Linus PS. Yes, yes, that 10-line removal patch is obviously still not tested, it's still likely incomplete because the THP case needs to do the page-pinning logic on the other side too, so I'm very obviously over-simplifying. But the fact that the *normal* pages already do this correctly - and don't use mapcount - should really make people go "Hmm".
On Fri, Dec 17, 2021 at 07:38:39PM -0800, Linus Torvalds wrote: > On Fri, Dec 17, 2021 at 7:30 PM Nadav Amit <namit@vmware.com> wrote: > > > > In such a case, I do think it makes sense to fail uffd-wp (when > > page_count() > 1), and in a prototype I am working on I do something > > like that. > > Ack. If uddf-wp finds a page that is pinned, just skip it as not > write-protectable. > > Because some of the pinners might be writing to it, of course - just > not through the page tables. That doesn't address the qemu use case though. The RDMA pin is the 'coherent r/o pin' we discussed before, which requires that the pages remain un-write-protected and the HW DMA is read only. The VFIO pin will enable dirty page tracking in the system IOMMU so it gets the same effect from qemu's perspective as the CPU WP is doing. In these operations every single page of the guest will be pinned, so skip it just means userfault fd wp doesn't work at all. Qemu needs some solution to be able to dirty track the CPU memory for migration.. > So that sounds like the right thing to do. I _think_ we discussed this > the last time this came up. I have some dim memory of that. Jason, > ring a bell? We talked about clear_refs alot, but it was never really clear the use case, I think. Plus that discussion never finialized to anything. David's latest summary seems accurate, if I paraphrase at a high level, Linus's approach always does enough COWs but might do extra and David's approach tries to do exactly the right number of COWs. It looks like to have the same functionality with Linus's approach we need to have a way for userspace to opt out of COW and work in an entirely deterministic non-COW world. WP&GUP can never work together otherwise which leaves qemu stranded. Or, we follow David's approach and make COW be precise and accept the complexity.. Jason
[ Cutting down ruthlessly to the core of the issue ] On Sat, Dec 18, 2021 at 1:58 AM David Hildenbrand <david@redhat.com> wrote: > > 1) Missed COW > > 2) Unnecessary COW > > 3) Wrong COW > Does that make sense? If we agree on the above, then here is how the > currently discussed approaches differ: > > page_count != 1: > * 1) cannot happen > * 2) can happen easily (speculative references due to pagecache, > migration, daemon, pagevec, ...) > * 3) can happen in the current code I claim that (1) "cannot happen" is a huge mother of a deal. It's *LITERALLY* the bug you are chasing, and it's the security issue, so on a bug scale, it's about the worst there is. I further then claim that (2) "happen easily" is you just making things up. Yes, it can happen. But no, it's not actually that common, and since (2) is harmless from a correctness standpoint, it is purely about performance. And as mentioned, not using the mapcount actually makes *common* operations much simpler and faster. You don't need the page lock to serialize the mapcount. So (2) is a performance argument, and you haven't actually shown it to be a problem. Which really only leaves (3). Which I've already explained what the fix is: don't ever mark pages that shouldn't be COW'ed as being COW pages. (3) is really that simple, although it ended up depending on Jason and John Hubbard and others doing that FOLL_PIN logic to distinguish "I just want to see a random page, and I don't care about COW" from "I want to get a page, and that page needs to be coherent with this VM and not be COW'ed away" So I'm not claiming (3) is "trivial", but at the same time it's certainly not some fundamentally complicated thing, and it's easy to explain what is going on. > mapcount > 1: > * 1) your concern is that this can happen due to concurrent swapin > * 2) cannot happen. > * 3) your concern is that this can happen due to concurrent swapin No, my concern about (1) is that IT IS WRONG. "mapcount" means nothing for COW. I even gave you an example of exactly where it means nothing. It's crazy. It's illogical. And it's complicated as hell. The fact that only one user maps a page is simply not meaningful. That page can have other users that you don't know anything about, and that don't show up in the mapcount. That page can be swapcached, in which case mapcount can change radically in ways that you earlier indicated cannot happen. You were wrong. But even if you fix that - by taking the page lock in every single place - there are still *other* users that for all you know may want the old contents. You don't know. The only thing that says "no other users" is the page count. Not the mapcount. In other words, I claim that (a) mapcount is fundamentally the wrong thing to test. You can be the only mapper, without being the "owner" of the page. (b) it's *LITERALLY* the direct and present source of that bug in the testcase you added, where a page with a mapcount of 1 has other concurrent users and needs to be COW'ed but isn't. (c) it's complicated and expensive to calculate (where one big part of the expense is the page lock synchronization requirements, but there are others) And this all happens for that "case (1)", which is the worst adn scariest of them all. In contrast to that, your argument that "(2) cannot happen" is a total non-argument. (2) isn't the problem. And I claim that (3) can happen because you're testing the wrong counter, so who knows if the COW is wrong or not? > I am completely missing how 2) or 3) could *ever* be handled properly > for page_count != 1. 3) is obviously more important and gives me nightmares. Ok, so if I tell you how (2) and (3) are handled properly, you will just admit you were wrong? Here's how they are handled properly with page counts. I have told you this before, but I'll summarize: (2) is handled semantically properly by definition - it may be "unnecessary", but it has no semantic meaning This is an IMPORTANT thing to realize. The fact is, (2) is not in the same class as (1) or (3). And honestly - we've been doing this for all the common cases already since at least 5.9, and your performance argument simply has not really reared its head. Which makes the whole argument moot. I claim that it simplifies lots of common operations and avoids having to serialize on a lock that has been a real and major problem. You claim it's extra overhead and can cause extra COW events. Neither of has any numbers worth anything, but at least I can point to the fact that all the *normal* VM paths have been doing the thing I advocate for many releases now, and the sky most definitely is NOT falling. So that only leaves (3). Handling (3) really is so conceptually simple that I feel silly for repeating it: if you don't want a COW to happen, then you mark the page as being not-COW. That sounds so simple as to be stupid. But it really is the solution. It's what that pinning logic does, and keeps that "page may be pinned" state around, and then operations like fork() that would otherwise create a COW mapping of it will just not do it. So that incredibly simple approach does require actual code: it requires that explicit "fork() needs to copy instead of COW" code, it requires that "if it's pinned, we don't make a new swapcache entry out of it". So it's real code, and it's a real issue, but it's conceptually absolutely trivial, and the code is usualyl really simple to understand too. So you have a *trivial* concept, and you have simple code that could be described to a slightly developmentally challenged waterfowl. If you're one of the programmers doing the "explain your code to a rubber ducky", you can look at code like this: /* * Anonymous process memory has backing store? * Try to allocate it some swap space here. * Lazyfree page could be freed directly */ if (PageAnon(page) && PageSwapBacked(page)) { if (!PageSwapCache(page)) { if (!(sc->gfp_mask & __GFP_IO)) goto keep_locked; if (page_maybe_dma_pinned(page)) goto keep_locked; and you can explain that page_maybe_dma_pinned() test to your rubber ducky, and that rubber ducky will literally nod its head. It gets it. To recap: (1) is important, and page_count() is the only thing that guarantees "you get full access to a page only when it's *obviously* exclusively yours". (2) is NOT important, but could be a performance issue, but we have real data from the past year that it isn't. (3) is important, and has a really spectacularly simple conceptual fix with quite simple code too. In contrast, with the "mapcount" games you can't even explain why they should work, and the patches I see are actively buggy because everything is so subtle. Linus
On Sat, Dec 18, 2021 at 11:21 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > To recap: > (1) is important, and page_count() is the only thing that guarantees > "you get full access to a page only when it's *obviously* exclusively > yours". > (2) is NOT important, but could be a performance issue, but we have > real data from the past year that it isn't. > (3) is important, and has a really spectacularly simple conceptual > fix with quite simple code too. > > In contrast, with the "mapcount" games you can't even explain why they > should work, and the patches I see are actively buggy because > everything is so subtle. So to challenge you, please explain exactly how mapcount works to solve (1) and (3), and how it incidentally guarantees that (2) doesn't happen. And that really involves explaining the actual code too. I can explain the high-level concepts in literally a couple of sentences. For (1), "the page_count()==1 guarantees you are the only owner, so a COW event can re-use the page" really explains it. And the code is pretty simple too. There's nothing subtle about "goto copy" when pagecount is not 1. And even the locking is simple: "we hold the page table lock, we found a page, it has only one ref to it, we own it" Our VM is *incredibly* complicated. There really are serious advantages to having simple rules in place. And for (2), the simple rule is "yeah, we can cause spurious cow events". That's not only simple to explain, it's simple to code for. Suddenly you don't need to worry. "Copying the page is always safe". That's a really really powerful statement. Now, admittedly (3) is the one that ends up being more complicated, but the *concept* sure is simple. "If you don't want to COW this page, then don't mark it for COW". The *code* for (3) is admittedly a bit more complicated. The "don't mark it for COW" is simple to say, but we do have that fairly odd locking thing with fork() doing a seqcount_write_begin/end, and then GIP does the read-seqcount thing with retry. So it's a bit unusual, and I don't think we have that particular pattern anywhere else, but it's one well-defined lock and while unusual it's not *complicated* as far as kernel locking rules go. It's unusual and perhaps not trivial, but in the end those seqcount code sequences are maybe 10 lines total, and they don't interact with anything else. And yes, the "don't mark it for COW" means that write-protecting something is special, mainly because we sadly do not have extra bits in the page tables. It would be *really* easy if we could just hide this "don't COW this page" in the page table. Truly trivial. We don't, because of portability across different architectures ;( So I'll freely give you that my (3) is somewhat painful, but it's painful with a really simple concept. And the places that get (3) wrong are generally places that nobody has been able to care about. I didn't realize the problem with creating a swap page after the fact for a while, so that commit feb889fb40fa ("mm: don't put pinned pages into the swap cache") came later, but it's literally a very simple two-liner. The commit message for commit feb889fb40fa may be worth reading. It very much explains the spirit of the thing, and is much longer than the trivial patch itself. Simple and clear concepts matter. Code gets complicated even then, but complex code with complex concepts is a bad combination. Linus
> On Dec 18, 2021, at 10:42 AM, Jason Gunthorpe <jgg@nvidia.com> wrote: > > On Fri, Dec 17, 2021 at 07:38:39PM -0800, Linus Torvalds wrote: >> On Fri, Dec 17, 2021 at 7:30 PM Nadav Amit <namit@vmware.com> wrote: >>> >>> In such a case, I do think it makes sense to fail uffd-wp (when >>> page_count() > 1), and in a prototype I am working on I do something >>> like that. >> >> Ack. If uddf-wp finds a page that is pinned, just skip it as not >> write-protectable. >> >> Because some of the pinners might be writing to it, of course - just >> not through the page tables. > > That doesn't address the qemu use case though. The RDMA pin is the > 'coherent r/o pin' we discussed before, which requires that the pages > remain un-write-protected and the HW DMA is read only. > > The VFIO pin will enable dirty page tracking in the system IOMMU so it > gets the same effect from qemu's perspective as the CPU WP is doing. > > In these operations every single page of the guest will be pinned, so > skip it just means userfault fd wp doesn't work at all. > > Qemu needs some solution to be able to dirty track the CPU memory for > migration.. My bad. I misunderstood the scenario. Yes, I guess that you pin the pages early for RDMA registration, which is also something you may do for IO-uring buffers. This would render userfaultfd unusable. I do not see how it can be solved without custom, potentially complicated logic, which the page_count() approach wants to avoid. The only thing I can think of is requiring the pinned regions to be first madvise’d with MADV_DONTFORK and not COW’ing in such case. But this would break existing code though.
On Fri, Dec 17, 2021 at 12:45:45PM -0800, Linus Torvalds wrote: > On Fri, Dec 17, 2021 at 12:42 PM David Hildenbrand <david@redhat.com> wrote: > > > > > Then somebody else modified that page, and you got exactly what you > > > asked for - a COW event. The original R/O pin has the original page > > > that it asked for, and can read it just fine. > > > > Where in the code did I ask for a COW event? I asked for a R/O pin, not > > any kind of memory protection. > > Why didn't you ask for a shared pin, if that is what you want? > > We already support that. > > If you don't like the read-only pins, don't use them. It's that simple. So you are saying that if a GUP user wants to see changes made by userspace to the page after the GUP it must ask for FOLL_WRITE, even if it doesn't have intend to write to the page? That's news to me. Or did I misunderstand you?
On Sat, Dec 18, 2021 at 1:49 PM Nadav Amit <namit@vmware.com> wrote: > > Yes, I guess that you pin the pages early for RDMA registration, which > is also something you may do for IO-uring buffers. This would render > userfaultfd unusable. I think this is all on usefaultfd. That code literally stole two of the bits from the page table layout - bits that we could have used for better things. And guess what? Because it required those two bits in the page tables, and because that's non-portable, it turns out that UFFD_WP can only be enabled and only works on x86-64 in the first place. So UFFS_WP is fundamentally non-portable. Don't use it. Anyway, the good news is that I think that exactly because uffd_wp stole two bits from the page table layout, it already has all the knowledge it needs to handle this entirely on its own. It's just too lazy to do so now. In particular, it has that special UFFD_WP bit that basically says "this page is actually writable, but I've made it read-only just to get the fault for soft-dirty". And the hint here is that if the page truly *was* writable, then COW just shouldn't happen, and all that the page fault code should do is set soft-dirty and return with the page set writable. And if the page was *not* writable, then UFFD_WP wasn't actually needed in the first place, but the uffd code just sets it blindly. Notice? It _should_ be just an operation based purely on the page table contents, never even looking at the page AT ALL. Not even the page count, much less some mapcount thing. Done right, that soft-dirty thing could work even with no page backing at all, I think. But as far as I know, we've actually never seen a workload that does all this, so.. Does anybody even have a test-case? Because I do think that UFFD_WP really should never really look at the page, and this issue is actually independent of the "page_count() vs page_mapcount()" discussion. (Somewhat related aside: Looking up the page is actually one of the more expensive operations of a page fault and a lot of other page table manipulation functions - it's where most of the cache misses happen. That's true on the page fault side, but it's also true for things like copy_page_range() etc) Linus
On Sat, Dec 18, 2021 at 2:52 PM Kirill A. Shutemov <kirill@shutemov.name> wrote: > > So you are saying that if a GUP user wants to see changes made by > userspace to the page after the GUP it must ask for FOLL_WRITE, even if it > doesn't have intend to write to the page? Yup. Put the onus very clearly on GUP. It's a very special operation, and it's one of the operations that cause a lot of problems for the VM code. It's by no means the _only_ one - we've got a lot of other things that cause issues - but I think it's very clear that GUP is very very special, and nobody should say "I want GUP to do whatever". There are two cases for GUP: - look up the page as it is *NOW* - look up the page in order to see any future state on it (and possibly modify it) that "any future state" is a fundamentally much heavier operation. It's the one that really *has* to get rid of any sharing, and it has to make sure no sharing happens in the future either. So ii it is an anonymous page, it basically needs to act like a write. Even if that page is then used only for reading. Note that here that "if it's anonymous" is kind of a big deal. If it's a shared file-mapped page, at that point it's "just another reference". It's potentially problematic even in that case (think of "truncate()" that tries to force-unmap all pages from VM's), but for the shared case the answer is "if you truncate it and disassociate the page from the mapping, it's _your_ problem. And once it acts as a write, and once it does that COW and we have exclusive access to it, it might as well be just writable and dirty. You've done the expensive part already. You've forced it to be private to that VM. And this was all triggered by the user doing something very special, so no amount of "but POSIX" or whatever matters. GUP is not great. If you use GUP, you get to deal with the downsides. Linus
> On Dec 18, 2021, at 2:53 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Sat, Dec 18, 2021 at 1:49 PM Nadav Amit <namit@vmware.com> wrote: >> >> Yes, I guess that you pin the pages early for RDMA registration, which >> is also something you may do for IO-uring buffers. This would render >> userfaultfd unusable. > > I think this is all on usefaultfd. > > That code literally stole two of the bits from the page table layout - > bits that we could have used for better things. > > And guess what? Because it required those two bits in the page tables, > and because that's non-portable, it turns out that UFFD_WP can only be > enabled and only works on x86-64 in the first place. > > So UFFS_WP is fundamentally non-portable. Don't use it. I have always felt that the PTE software-bits limit is very artificial. We can just allocate two adjacent pages when needed, one for PTEs and one for extra software bits. A software bit in the PTE can indicate “extra software bits” are relevant (to save cache-misses), and a bit in the PTEs' page-struct indicate whether there is adjacent “extra software bits” page. I’ve done it something similar once in a research project. It is rather similar to what is done for PTI in the PGD level. > > Anyway, the good news is that I think that exactly because uffd_wp > stole two bits from the page table layout, it already has all the > knowledge it needs to handle this entirely on its own. It's just too > lazy to do so now. > > In particular, it has that special UFFD_WP bit that basically says > "this page is actually writable, but I've made it read-only just to > get the fault for soft-dirty". > > And the hint here is that if the page truly *was* writable, then COW > just shouldn't happen, and all that the page fault code should do is > set soft-dirty and return with the page set writable. > > And if the page was *not* writable, then UFFD_WP wasn't actually > needed in the first place, but the uffd code just sets it blindly. I don’t think that I am following. The write-protection of UFFD means that the userspace wants to intervene before anything else (including COW). UFFD_WP indications are recorded per PTE (i.e., not VMA). So if userspace wants to intervene on write, it must use UFFD_WP even if the page is write-protected. The kernel then has to keep the UFFD_WP indication to call userspace upon a write. > > Notice? It _should_ be just an operation based purely on the page > table contents, never even looking at the page AT ALL. Not even the > page count, much less some mapcount thing. > > Done right, that soft-dirty thing could work even with no page backing > at all, I think. > > But as far as I know, we've actually never seen a workload that does > all this, so.. Does anybody even have a test-case? > > Because I do think that UFFD_WP really should never really look at the > page, and this issue is actually independent of the "page_count() vs > page_mapcount()" discussion. I can think of two examples for reasonable flows of UFFD: [ M = Monitor thread; F = Faulting thread ] (A) Userspace page-fault tracking (e.g., for memory migration): 1. M: WP memory. 2. F: WP page-fault: provide UFFD notification. 3. M: Unprotect the page. 4. M: Wake the faulting thread (usually as part of the unprotect) 5. F: Retry the page-fault (and handle COW). (B) Userspace memory snapshots: 1. M: Write-protect memory. 2. M: Copy the memory to disk. 3. M: Write-unprotect memory (e.g., speculatively as you expect a page to be written to and do not want to pay the #PF price). [ notice that the un-protection is logical, not really in the PTEs] 4. F: Get a page-fault (later) and handle it (because it might or might not need COW) There may be “crazier” flows (e.g., wake the faulting thread and emulate the instruction that triggered the write with ptrace), but let’s put those aside. IIUC the solution you propose, it tries to address flows such as (A). I am not sure whether the proposal is to change the write-protection API to only provide notifications (i.e., not block to after page-fault as done today), but I do not see how it addresses (B). I am not saying it is impossible, but I think that the solution would complicate the code by making UFFD a special case.
On Sat, Dec 18, 2021 at 4:19 PM Nadav Amit <namit@vmware.com> wrote: > > I have always felt that the PTE software-bits limit is very artificial. > We can just allocate two adjacent pages when needed, one for PTEs and > one for extra software bits. A software bit in the PTE can indicate > “extra software bits” are relevant (to save cache-misses), and a bit > in the PTEs' page-struct indicate whether there is adjacent “extra > software bits” page. Hmm. That doesn't sound very bad, no. And it would be nice to have more software bits (and have them portably). > I don’t think that I am following. The write-protection of UFFD means > that the userspace wants to intervene before anything else (including > COW). The point I was making (badly) is that UFFD_WP is only needed to for the case where the pte isn't already non-writable for other reasons. > UFFD_WP indications are recorded per PTE (i.e., not VMA). The changing of those bits are basically a bastardized 'mprotect()', and does already require the vma to be marked VM_UFFD_WP. And the way you set (or clear) the bits is with a range operation. It really could have been done with mprotect(), and with actual explicit vma bits. The fact that it now uses the page table bit is rather random. I think it would actually be cleaner to make that userfaultfd_writeprotect truly *be* a vma range. Right now it's kind of "half this, half that". Of course, it's possible that because of this situation, some users do a lot of fine-grained VM_UFFD_WP setting, and they kind of expect to not have issues with lots of vma fragments. So practical concerns may have made the implementation set in stone. (I have only ever seen the kernel side of uffd, not the actual user side, so I'm not sure about the use patterns). That said, your suggestion of a shadow sw page table bit thing would also work. And it would solve some problems we have in core areas (notably "page_special()" which right now has that ARCH_HAS_PTE_SPECIAL thing). It would make it really easy to have that "this page table entry is pinned" flag too. Linus
> On Dec 18, 2021, at 4:35 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > (I have only ever seen the kernel side of uffd, not the actual user > side, so I'm not sure about the use patterns). I use it in a very fine granularity, and I suspect QEMU and CRIU do so too. > > That said, your suggestion of a shadow sw page table bit thing would > also work. And it would solve some problems we have in core areas > (notably "page_special()" which right now has that > ARCH_HAS_PTE_SPECIAL thing). > > It would make it really easy to have that "this page table entry is > pinned" flag too. I found my old messy code for the software-PTE thing. I see that eventually I decided to hold a pointer to the “extra PTEs” of each page in the PMD-page-struct. [ I also implemented the 2-adjacent pages approach but this code is long gone. ] My rationale was that: 1. It does not bound you to have the same size for PTE and “extra-PTE” 2. The PMD-page struct is anyhow hot (since you acquired the PTL) 3. Allocating “extra-PTE” dynamically does not require to rewire the page-tables, which requires a TLB flush. I think there is a place to hold a pointer in the PMD-page-struct (_pt_pad_1, we just need to keep the lowest bit clear so the kernel won’t mistaken it to be a compound page). I still don’t know what exactly you have in mind for making use out of it for the COW issue. Keeping a pin-count (which requires internal API changes for unpin_user_page() and friends?) or having “was ever pinned” sticky bit? And then changing page_needs_cow_for_dma() to look at the PTE so copy_present_pte() would break the COW eagerly? Anyhow, I can clean it up and send (although it is rather simple and I ignored many thing, such as THP, remap, etc), but I am not sure I have the time now to fully address the COW problem. I will wait for Monday for David’s response.
On 12/18/21 22:02, Nadav Amit wrote: > >> On Dec 18, 2021, at 4:35 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: >> >> (I have only ever seen the kernel side of uffd, not the actual user >> side, so I'm not sure about the use patterns). > > I use it in a very fine granularity, and I suspect QEMU and CRIU do so > too. > >> >> That said, your suggestion of a shadow sw page table bit thing would >> also work. And it would solve some problems we have in core areas >> (notably "page_special()" which right now has that >> ARCH_HAS_PTE_SPECIAL thing). >> >> It would make it really easy to have that "this page table entry is >> pinned" flag too. > > I found my old messy code for the software-PTE thing. > > I see that eventually I decided to hold a pointer to the “extra PTEs” > of each page in the PMD-page-struct. [ I also implemented the 2-adjacent > pages approach but this code is long gone. ] > > My rationale was that: > > 1. It does not bound you to have the same size for PTE and “extra-PTE” > 2. The PMD-page struct is anyhow hot (since you acquired the PTL) > 3. Allocating “extra-PTE” dynamically does not require to rewire the > page-tables, which requires a TLB flush. > > I think there is a place to hold a pointer in the PMD-page-struct > (_pt_pad_1, we just need to keep the lowest bit clear so the kernel > won’t mistaken it to be a compound page). > > I still don’t know what exactly you have in mind for making use > out of it for the COW issue. Keeping a pin-count (which requires > internal API changes for unpin_user_page() and friends?) or having > “was ever pinned” sticky bit? And then changing > page_needs_cow_for_dma() to look at the PTE so copy_present_pte() > would break the COW eagerly? > > Anyhow, I can clean it up and send (although it is rather simple > and I ignored many thing, such as THP, remap, etc), but I am not > sure I have the time now to fully address the COW problem. I will > wait for Monday for David’s response. > Hi Nadav, A couple of thoughts about this part of the design: a) The PMD-page-struct approach won't help as much, because (assuming that we're using it in an attempt to get a true, perfect pin count), you are combining the pin counts of a PMD's worth of pages. OTOH...maybe that actually *is* OK, assuming you don't overflow--except that you can only answer the "is it dma-pinned?" question at a PMD level. That's a contradiction of your stated desire above to have very granular control. Also, because of not having bit 0 available in page._pt_pad_1, I think the count would have to be implemented as adding and subtracting 2, instead of 1 (in order to keep the value even), further reducing the counter range. b) If, instead, you used your older 2-adjacent pages approach, then Linus' comment makes more sense here: we could use the additional struct page to hold an exact pin count, per page. That way, you can actually build a wrapper function such as: page_really_is_dma_pinned() ...and/or simply get a stronger "maybe" for page_maybe_dma_pinned(). Furthermore, this direction is extensible and supports solving other "I am out of space in struct page" problems, at the cost of more pages, of course. As an aside, I find it instructive that we're talking about this approach, instead of extending struct page. The lesson I'm taking away is: allocating more space for some cases (2x pages) is better than having *all* struct pages be larger than they are now. Anyway, the pin count implementation would look somewhat like the existing hpage_pincount, which similarly has ample space for a separate, exact pin count. In other words, this sort of thing (mostly-pseudo code): diff --git a/include/linux/mm.h b/include/linux/mm.h index a7e4a9e7d807..646761388025 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -938,6 +938,16 @@ static inline bool hpage_pincount_available(struct page *page) return PageCompound(page) && compound_order(page) > 1; } +static inline bool shadow_page_pincount_available(struct page *page) +{ + /* + * TODO: Nadav: connect this up with the shadow page table + * implementation, and return an appropriate answer. + */ + + return false; // hardcoded for now, for compile testing +} + static inline int head_compound_pincount(struct page *head) { return atomic_read(compound_pincount_ptr(head)); @@ -950,6 +960,13 @@ static inline int compound_pincount(struct page *page) return head_compound_pincount(page); } +static inline int shadow_page_pincount(struct page *page) +{ + VM_BUG_ON_PAGE(!shadow_page_pincount_available(page), page); + + return atomic_read(shadow_page_pincount_ptr(page)); +} + static inline void set_compound_order(struct page *page, unsigned int order) { page[1].compound_order = order; @@ -1326,6 +1343,9 @@ static inline bool page_maybe_dma_pinned(struct page *page) if (hpage_pincount_available(page)) return compound_pincount(page) > 0; + if (shadow_page_pincount_available(page)) + return shadow_page_pincount(page) > 0; + /* * page_ref_count() is signed. If that refcount overflows, then * page_ref_count() returns a negative value, and callers will avoid c) The "was it ever pinned" sticky bit is not a workable concept, at the struct page level. A counter is required, in order to allow pages to live out their normal lives to their fullest potential. The only time we even temporarily got away with this kind of stickiness was at a higher level, and only per-process, not per-physical-page. Processes come and go, but the struct pages are more or less forever, so once you mark one sticky like this, it's out of play. thanks,
On 18.12.21 20:52, Linus Torvalds wrote: > On Sat, Dec 18, 2021 at 11:21 AM Linus Torvalds > <torvalds@linux-foundation.org> wrote: >> >> To recap: >> (1) is important, and page_count() is the only thing that guarantees >> "you get full access to a page only when it's *obviously* exclusively >> yours". >> (2) is NOT important, but could be a performance issue, but we have >> real data from the past year that it isn't. >> (3) is important, and has a really spectacularly simple conceptual >> fix with quite simple code too. >> >> In contrast, with the "mapcount" games you can't even explain why they >> should work, and the patches I see are actively buggy because >> everything is so subtle. > > So to challenge you, please explain exactly how mapcount works to > solve (1) and (3), and how it incidentally guarantees that (2) doesn't > happen. Oh, there is no need for additional challenges, I've been challenged with this problem for way too long already ;) And again, I appreciate this discussion and your feedback. I want to get all of this fixed ASAP, even if it's not going to be the way I propose as I raised. Any input is appreciated (as long as people don't scream at me). To get to your point: I thought about your remarks with the swapcount and it makes sense. The mapcount itself is not expressive enough to catch macpount == 1 vs mapcount > 1. What *would* work to have precise COW I think is having the active and inactive count instead of just the active (mapcount) part, whereby: active: page tables mapping this page -> mapcount inactive: page tables referencing this page via swap/migration entry An alternative would be to just know that there are inactive users. We'd have to read both values atomically in one shot. There would be ways to store that information in the _mapcount but it certainly adds a lot of complexity, and ... > > And that really involves explaining the actual code too. I can explain > the high-level concepts in literally a couple of sentences. > > For (1), "the page_count()==1 guarantees you are the only owner, so a > COW event can re-use the page" really explains it. And the code is > pretty simple too. There's nothing subtle about "goto copy" when > pagecount is not 1. And even the locking is simple: "we hold the page > table lock, we found a page, it has only one ref to it, we own it" > > Our VM is *incredibly* complicated. There really are serious > advantages to having simple rules in place. ... you have a point here. Having that said, I hope we can agree that the "page_count" is not the perfect solution. I hope we can at least tweak it for now to get rid of 3) Wrong COW. > > And for (2), the simple rule is "yeah, we can cause spurious cow > events". That's not only simple to explain, it's simple to code for. > Suddenly you don't need to worry. "Copying the page is always safe". > That's a really really powerful statement. > > Now, admittedly (3) is the one that ends up being more complicated, > but the *concept* sure is simple. "If you don't want to COW this page, > then don't mark it for COW". > > The *code* for (3) is admittedly a bit more complicated. The "don't > mark it for COW" is simple to say, but we do have that fairly odd > locking thing with fork() doing a seqcount_write_begin/end, and then > GIP does the read-seqcount thing with retry. So it's a bit unusual, > and I don't think we have that particular pattern anywhere else, but > it's one well-defined lock and while unusual it's not *complicated* as > far as kernel locking rules go. It's unusual and perhaps not trivial, > but in the end those seqcount code sequences are maybe 10 lines total, > and they don't interact with anything else. > > And yes, the "don't mark it for COW" means that write-protecting > something is special, mainly because we sadly do not have extra bits > in the page tables. It would be *really* easy if we could just hide > this "don't COW this page" in the page table. Truly trivial. We don't, > because of portability across different architectures ;( > > So I'll freely give you that my (3) is somewhat painful, but it's > painful with a really simple concept. Thanks for admitting that! I might have had an idea yesterday on how to fix most of the issues without relying on the mapcount, doing it similar (but slightly different) as you propose here. Let's call it a mixture of the unsharing approach and your approach. I cannot promise anything, so ... ... I'll go playing with it and share some details ASAP. At least it sounds comparatively simple in my head. > > And the places that get (3) wrong are generally places that nobody has > been able to care about. I didn't realize the problem with creating a > swap page after the fact for a while, so that commit feb889fb40fa > ("mm: don't put pinned pages into the swap cache") came later, but > it's literally a very simple two-liner. > Just to give you my perspective: Personally I don't care too much about 2). The only reason why I somehow care about "Unnecessary COW" are * Challenging for hugetlb use as I explained. We might still want to use the mapcount there. * It's mostly a symptom of our eventually too simple COW logic that effectively leads to 3). While I do care about 1) (Missed CoW) for our customers, I *especially* care about 3) (Wrong Cow) simply because silent memory corruptions in user space are not acceptable. As you say, fixing 1) the "page_count" way might be easy, at least for THP. Simple example: Have swapping enabled and register a fixed io_uring buffer at the wrong time. Fixed io_uring buffers are no a commodity feature for unprivileged users space ... So that's why I so deeply care about all of this.
On Sun, Dec 19, 2021 at 12:01:59AM -0800, John Hubbard wrote: > On 12/18/21 22:02, Nadav Amit wrote: > > I found my old messy code for the software-PTE thing. > > > > I see that eventually I decided to hold a pointer to the “extra PTEs” > > of each page in the PMD-page-struct. [ I also implemented the 2-adjacent > > pages approach but this code is long gone. ] > > a) The PMD-page-struct approach won't help as much, because (assuming > that we're using it in an attempt to get a true, perfect pin count), you > are combining the pin counts of a PMD's worth of pages. OTOH...maybe > that actually *is* OK, assuming you don't overflow--except that you can > only answer the "is it dma-pinned?" question at a PMD level. That's a > contradiction of your stated desire above to have very granular control. > > Also, because of not having bit 0 available in page._pt_pad_1, I think > the count would have to be implemented as adding and subtracting 2, > instead of 1 (in order to keep the value even), further reducing the > counter range. I think you misunderstood Nadav's approach. He's talking about making an extra side-allocation per PMD if you're using uffd, and storing extra information in it. I think it's a worthwile approach.
On Sat, Dec 18, 2021 at 10:02 PM Nadav Amit <namit@vmware.com> wrote: > > I found my old messy code for the software-PTE thing. > > I see that eventually I decided to hold a pointer to the “extra PTEs” > of each page in the PMD-page-struct. [ I also implemented the 2-adjacent > pages approach but this code is long gone. ] Ok, I understand why that ends up being the choice, but it makes it too ugly and messy to look up to be worth it, I think. > I still don’t know what exactly you have in mind for making use > out of it for the COW issue. So the truly fundamental question for COW (and for a long-term GUP) is fairly simple: - Is the page I have truly owned exclusively by this VM? If that _isn't_ the case, you absolutely have to COW. If that _is_ the case, you can re-use the page. That is really it, boiled down to the pure basics. And if you aren't sure whether you are the ultimate and only authority over the page, then COW is the "safer" option, in that breaking sharing is fundamentally better than over-sharing. Now, the reason I like "page_count()==1" is that it is a 100% certain way to know that you own the page absolutely and clearly. There is no question what-so-ever about it. And the reason I hate "page_mapcount()==1" with a passion is that it is NOTHING OF THE KIND. It is an entirely meaningless number. It doesn't mean anything at all. Even if the page mapcount is exactly right, it could easily and trivially be a result of "fork, then unmap in either parent or child". Now that page_mapcount() is unquestionably 1, but despite that, at some point the page was shared by another VM, and you can not know whether you really have exclusive access. And that "even if page mapcount is exactly right" is a big issue in itself, as I hope I've explained. It requires page locking, it requires that you take swapcache users into account, it is just a truly messy and messed up thing. There really is absolutely no reason for page_mapcount to exist. It's a mistake. We have it for completely broken historical reasons. It's WRONG. Now, if "page_count()==1" is so great, what is the issue? Problem solved. No, while page_count()==1 is one really fundamental marker (unlike the mapcount), it does have problems too. Because yes, "page_count()==1" does mean that you have truly exclusive ownership of the page, but the reverse is not true. The way the current regular VM code handles that "the reverse is not true" is by making "the page is writable" be the second way you can say "you clearly have full ownership of the page". So that's why you then have the "maybe_pinned()" thing in fork() and in swap cache creation that keeps such a page writable, and doesn't do the virtual copy and make it read-only again. But that's also why it has problems with write-protect (whether mprotect or uddf_wp). Anyway, that was a long explanation to make the thinking clear, and finally come to the actual answer to your question: Adding another bit in the page tables - *purely* to say "this VM owns the page outright" - would be fairly powerful. And fairly simple. Then any COW event will set that bit - because when you actually COW, the page you install is *yours*. No questions asked. And fork() would simply clear that bit (unless the page was one of the pinned pages that we simply copy). See how simple that kind of concept is. And please, see how INCREDIBLY BROKEN page_mapcount() is. It really fundamentally is pure and utter garbage. It in no way says "I have exclusive ownership of this page", because even if the mapcount is 1 *now*, it could have been something else earlier, and some other VM could have gotten a reference to it before the current VM did so. This is why I will categoricall NAK any stupid attempt to re-introduce page_mapcount() for COW or GUP handling. It's unacceptably fundamentally broken. Btw, the extra bit doesn't really have to be in the page tables. It could be a bit in the page itself. We could add another page bit that we just clear when we do the "add ref to page as you make a virtual copy during fork() etc". And no, we can't use "pincount" either, because it's not exact. The fact that the page count is so elevated that we think it's pinned is a _heuristic_, and that's ok when you have the opposite problem, and ask "*might* this page be pinned". You want to never get a false negative, but it can get a false positive. Linus
> > Btw, the extra bit doesn't really have to be in the page tables. It > could be a bit in the page itself. We could add another page bit that > we just clear when we do the "add ref to page as you make a virtual > copy during fork() etc". ^ I'm playing with the idea if using a page bit to express: "This page is exclusive". On a CoW fault, if that bit is set, I can simply reuse the page. The semantics under which semantics to set the bit are slightly different than what you describe, and I'm playing with additional unsharing (on GUP R/O) that avoids mapping the copied page similarly R/O and simply sets the bit. But the general idea could fly I think, devil's in the detail ...
David, you said that you were working on some alternative model. Is it perhaps along these same lines below? I was thinking that a bit in the page tables to say "this page is exclusive to this VM" would be a really simple thing to deal with for fork() and swapout and friends. But we don't have such a bit in general, since many architectures have very limited sets of SW bits, and even when they exist we've spent them on things like UDDF_WP., But the more I think about the "bit doesn't even have to be in the page tables", the more I think maybe that's the solution. A bit in the 'struct page' itself. For hugepages, you'd have to distribute said bit when you split the hugepage. But other than that it looks quite simple: anybody who does a virtual copy will inevitably be messing with the page refcount, so clearing the "exclusive ownership" bit wouldn't be costly: the 'struct page' cacheline is already getting dirtied. Or what was your model you were implying you were thinking about in your other email? You said "I might have had an idea yesterday on how to fix most of the issues without relying on the mapcount, doing it similar [..]" but I didn't then reply to that email because I had just written this other long email to Nadav. Linus On Sun, Dec 19, 2021 at 9:27 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Adding another bit in the page tables - *purely* to say "this VM owns > the page outright" - would be fairly powerful. And fairly simple. > > Then any COW event will set that bit - because when you actually COW, > the page you install is *yours*. No questions asked. > [ snip snip ] > > Btw, the extra bit doesn't really have to be in the page tables. It > could be a bit in the page itself. We could add another page bit that > we just clear when we do the "add ref to page as you make a virtual > copy during fork() etc". > > And no, we can't use "pincount" either, because it's not exact. The > fact that the page count is so elevated that we think it's pinned is a > _heuristic_, and that's ok when you have the opposite problem, and ask > "*might* this page be pinned". You want to never get a false negative, > but it can get a false positive. > > Linus
On 19.12.21 18:44, Linus Torvalds wrote: > David, you said that you were working on some alternative model. Is it > perhaps along these same lines below? > > I was thinking that a bit in the page tables to say "this page is > exclusive to this VM" would be a really simple thing to deal with for > fork() and swapout and friends. > > But we don't have such a bit in general, since many architectures have > very limited sets of SW bits, and even when they exist we've spent > them on things like UDDF_WP., > > But the more I think about the "bit doesn't even have to be in the > page tables", the more I think maybe that's the solution. > > A bit in the 'struct page' itself. > Exactly what I am prototyping right now. > For hugepages, you'd have to distribute said bit when you split the hugepage. Yes, that's one tricky part ... > > But other than that it looks quite simple: anybody who does a virtual > copy will inevitably be messing with the page refcount, so clearing > the "exclusive ownership" bit wouldn't be costly: the 'struct page' > cacheline is already getting dirtied. > > Or what was your model you were implying you were thinking about in > your other email? You said I'm playing with the idea of not setting the bit always during COW but only on GUP request to set the bit (either manually if possible or via FOLL_UNSHARE). That's a bit more tricky but allows for decoupling that approach completely from the page_pin() counter. fork() is allowed to clear the bit if page_count() == 1 and share the page. So no GUP->no fork() performance changes (!) . Otherwise the bit can only vanish if we swapout/migrate the page: in which case there are no additional GUP/references on the page that rely on it! The bit can be set directly if we have to copy the page in the fault handler (COW or unshare). Outside of COW/Unshare code, the bit can only be set if page_count() == 1 and we sync against fork(). (and that's the problem for gup-fast-only that I'm investigating right now, because it would then always have to fallback to the slow variant if the bit isn't already set) So the bit can "vanish" whenever there is no additional reference on the page. GUP syncs against fork() and can thereby set the bit/request to set the bit. I'm trying to decouple it completely from the page_pin() counter to also be able to handle FOLL_GET (O_DIRECT reproducers unfortunately) correctly. Not set it stone, just an idea what I'm playing with right now ... and I have to tripple-check if * page is PTE mapped in the page table I'm walking * page_count() == 1 Really means that "this is the only reference.". I do strongly believe so .. :)
On Sun, Dec 19, 2021 at 06:59:51PM +0100, David Hildenbrand wrote: > On 19.12.21 18:44, Linus Torvalds wrote: > > David, you said that you were working on some alternative model. Is it > > perhaps along these same lines below? > > > > I was thinking that a bit in the page tables to say "this page is > > exclusive to this VM" would be a really simple thing to deal with for > > fork() and swapout and friends. > > > > But we don't have such a bit in general, since many architectures have > > very limited sets of SW bits, and even when they exist we've spent > > them on things like UDDF_WP., > > > > But the more I think about the "bit doesn't even have to be in the > > page tables", the more I think maybe that's the solution. > > > > A bit in the 'struct page' itself. > > > > Exactly what I am prototyping right now. > > > For hugepages, you'd have to distribute said bit when you split the hugepage. > > Yes, that's one tricky part ... That part shouldn't be that tricky ... Can we get rid of ->mapcount altogether? Three states: - Not mapped - Mapped exactly once - Possibly mapped more than once I appreciate "Not mapped" is not a state that anon pages can meaningfully have (maybe when they go into the swap cache?) And this information would only be present on the head page (ie stored per folio). If one VMA has multiple PTEs that map the same folio, then hopefully that only counts as mapped once. I must admit about half this conversation is going over my head. I need more time to understand all the constraints than exists between emails :-)
On Sun, Dec 19, 2021 at 1:12 PM Matthew Wilcox <willy@infradead.org> wrote: > > Can we get rid of ->mapcount altogether? Three states: > - Not mapped > - Mapped exactly once > - Possibly mapped more than once I don't think even that is useful. We should get rid of mapcount entirely. It doesn't actually help to know "mapped exactly once", exactly because even when that's true, there may be non-mapped references to the page. Mapped references really aren't that special in general. One case where it *can* be special is on virtually indexed cache architectures, where "is this mapped anywhere else" can be an issue for cache flushing. There the page_mapcount() can actually really matter, but it's such an odd case that I'm not convinced it should be something the kernel VM code should bend over backwards for. And the count could be useful for 'rmap' operations, where you can stop walking the rmap once you've found all mapped cases (paghe migration being one case of this). But again, I'm not convinced the serialization is worth it, or that it's a noticeable win. However, I'm not 100% convinced it's worth it even there, and I'm not sure we necessarily use it there. So in general, I think page_mapcount() can be useful as a count for those things that are _literally_ about "where is this page mapped". Page migration, virtual cache coherency, things like that can literally be about "how many different virtual mappings can we find". It's just that pages can have a number of non-mapped users too, so mapcount isn't all that meaningful in general. And you can look it up with rmap too, and so I do think it would be worth discussing whether we really should strive to maintain 'mapcount' at all. > I appreciate "Not mapped" is not a state that anon pages can > meaningfully have (maybe when they go into the swap cache?) Absolutely. And we can keep references around to an anonymous page even without it having any mapping or swap cache at all (ie "gup + unmap"). So "Not mapped at all" is a possible case, without the page being free'd. Linus
On Sun, Dec 19, 2021 at 01:27:07PM -0800, Linus Torvalds wrote: > On Sun, Dec 19, 2021 at 1:12 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > Can we get rid of ->mapcount altogether? Three states: > > - Not mapped > > - Mapped exactly once > > - Possibly mapped more than once > > I don't think even that is useful. We should get rid of mapcount entirely. > > It doesn't actually help to know "mapped exactly once", exactly > because even when that's true, there may be non-mapped references to > the page. > > Mapped references really aren't that special in general. > > One case where it *can* be special is on virtually indexed cache > architectures, where "is this mapped anywhere else" can be an issue > for cache flushing. > > There the page_mapcount() can actually really matter, but it's such an > odd case that I'm not convinced it should be something the kernel VM > code should bend over backwards for. > > And the count could be useful for 'rmap' operations, where you can > stop walking the rmap once you've found all mapped cases (paghe > migration being one case of this). But again, I'm not convinced the > serialization is worth it, or that it's a noticeable win. > > However, I'm not 100% convinced it's worth it even there, and I'm not > sure we necessarily use it there. > > So in general, I think page_mapcount() can be useful as a count for > those things that are _literally_ about "where is this page mapped". > Page migration, virtual cache coherency, things like that can > literally be about "how many different virtual mappings can we find". > > It's just that pages can have a number of non-mapped users too, so > mapcount isn't all that meaningful in general. > > And you can look it up with rmap too, and so I do think it would be > worth discussing whether we really should strive to maintain > 'mapcount' at all. Yes, agreed, I was thinking that we could use "not mapped at all" as an optimisation to avoid doing rmap walks. eg __unmap_and_move(). Perhaps more interestingly in truncate_cleanup_page(): if (page_mapped(page)) unmap_mapping_page(page); where we can skip the i_mmap rbtree walk if we know the page isn't mapped. I'd be willing to give up that optimisation if we had "this page was never mapped" (ie if page_mapped() was allowed to return false positives).
On Sun, Dec 19, 2021 at 1:48 PM Matthew Wilcox <willy@infradead.org> wrote: > > Yes, agreed, I was thinking that we could use "not mapped at all" > as an optimisation to avoid doing rmap walks. eg __unmap_and_move(). So the thing is, it's a very dodgy optimization for a rather simple reason: what if somebody pages the page in? So even "not mapped at all" is questionable. You have to check that it's also not a swapcache page, and hold the page lock for that check, at the very least. And by then, you're really in a very unusual situation - and my gut feel says not one worth optimizing for (because anon pages are _usually_ mapped at least once). But I dunno - it might depend on your load. Maybe you have some very special load that happens to trigger this case a lot? Linus
On Sun, Dec 19, 2021 at 01:53:36PM -0800, Linus Torvalds wrote: > On Sun, Dec 19, 2021 at 1:48 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > Yes, agreed, I was thinking that we could use "not mapped at all" > > as an optimisation to avoid doing rmap walks. eg __unmap_and_move(). > > So the thing is, it's a very dodgy optimization for a rather simple > reason: what if somebody pages the page in? > > So even "not mapped at all" is questionable. > > You have to check that it's also not a swapcache page, and hold the > page lock for that check, at the very least. > > And by then, you're really in a very unusual situation - and my gut > feel says not one worth optimizing for (because anon pages are > _usually_ mapped at least once). I'd like to get rid of ->mapcount for file pages too. And those are definitely never mapped in the majority of cases.
On Sun, Dec 19, 2021 at 2:02 PM Matthew Wilcox <willy@infradead.org> wrote: > > I'd like to get rid of ->mapcount for file pages too. And those are > definitely never mapped in the majority of cases. Fair enough. You'd probably be better off checking "is this mapping mapped" though. Because otherwise you have to get the page lock to serialize each page. Linus
On Sun, Dec 19, 2021 at 02:12:04PM -0800, Linus Torvalds wrote: > On Sun, Dec 19, 2021 at 2:02 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > I'd like to get rid of ->mapcount for file pages too. And those are > > definitely never mapped in the majority of cases. > > Fair enough. > > You'd probably be better off checking "is this mapping mapped" though. > Because otherwise you have to get the page lock to serialize each > page. Truncate already has the page locked, eg truncate_inode_pages_range() find_lock_entries() truncate_cleanup_page() if (page_mapped(page)) unmap_mapping_page(page) I think anyone calling unmap_mapping_page() really ought to have the page lock. Oh, we actually have an assert already to that effect ;-) VM_BUG_ON(!PageLocked(page));
On Sun, Dec 19, 2021 at 09:12:01PM +0000, Matthew Wilcox wrote: > Can we get rid of ->mapcount altogether? Three states: This might be a step in the right direction? From f723fb7cf2519428eee75e9e779907f80258f302 Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" <willy@infradead.org> Date: Mon, 20 Dec 2021 13:24:04 -0500 Subject: [PATCH] mm: reuse_swap_page() no longer needs to return map_swapcount All callers of reuse_swap_page() currently pass NULL, indicating that they don't use the painstakingly calculated map_swapcount. That lets us further remove it from page_trans_huge_map_swapcount() and page_trans_huge_mapcount(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> --- I don't know if this is a helpful patch to add at this point or whether it will get in the way of David's much more extensive work. include/linux/mm.h | 10 +++------- include/linux/swap.h | 6 +++--- mm/huge_memory.c | 32 +++++++++++--------------------- mm/khugepaged.c | 2 +- mm/memory.c | 2 +- mm/swapfile.c | 33 +++++++++++---------------------- 6 files changed, 30 insertions(+), 55 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index a7e4a9e7d807..286eb4155c80 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -840,19 +840,15 @@ static inline int page_mapcount(struct page *page) #ifdef CONFIG_TRANSPARENT_HUGEPAGE int total_mapcount(struct page *page); -int page_trans_huge_mapcount(struct page *page, int *total_mapcount); +int page_trans_huge_mapcount(struct page *page); #else static inline int total_mapcount(struct page *page) { return page_mapcount(page); } -static inline int page_trans_huge_mapcount(struct page *page, - int *total_mapcount) +static inline int page_trans_huge_mapcount(struct page *page) { - int mapcount = page_mapcount(page); - if (total_mapcount) - *total_mapcount = mapcount; - return mapcount; + return page_mapcount(page); } #endif diff --git a/include/linux/swap.h b/include/linux/swap.h index d1ea44b31f19..1d38d9475c4d 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -514,7 +514,7 @@ extern int __swp_swapcount(swp_entry_t entry); extern int swp_swapcount(swp_entry_t entry); extern struct swap_info_struct *page_swap_info(struct page *); extern struct swap_info_struct *swp_swap_info(swp_entry_t entry); -extern bool reuse_swap_page(struct page *, int *); +extern bool reuse_swap_page(struct page *); extern int try_to_free_swap(struct page *); struct backing_dev_info; extern int init_swap_address_space(unsigned int type, unsigned long nr_pages); @@ -680,8 +680,8 @@ static inline int swp_swapcount(swp_entry_t entry) return 0; } -#define reuse_swap_page(page, total_map_swapcount) \ - (page_trans_huge_mapcount(page, total_map_swapcount) == 1) +#define reuse_swap_page(page) \ + (page_trans_huge_mapcount(page) == 1) static inline int try_to_free_swap(struct page *page) { diff --git a/mm/huge_memory.c b/mm/huge_memory.c index e5483347291c..6ed86a8f6a5b 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1322,7 +1322,7 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf) * We can only reuse the page if nobody else maps the huge page or it's * part. */ - if (reuse_swap_page(page, NULL)) { + if (reuse_swap_page(page)) { pmd_t entry; entry = pmd_mkyoung(orig_pmd); entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma); @@ -2542,38 +2542,28 @@ int total_mapcount(struct page *page) * need full accuracy to avoid breaking page pinning, because * page_trans_huge_mapcount() is slower than page_mapcount(). */ -int page_trans_huge_mapcount(struct page *page, int *total_mapcount) +int page_trans_huge_mapcount(struct page *page) { - int i, ret, _total_mapcount, mapcount; + int i, ret; /* hugetlbfs shouldn't call it */ VM_BUG_ON_PAGE(PageHuge(page), page); - if (likely(!PageTransCompound(page))) { - mapcount = atomic_read(&page->_mapcount) + 1; - if (total_mapcount) - *total_mapcount = mapcount; - return mapcount; - } + if (likely(!PageTransCompound(page))) + return atomic_read(&page->_mapcount) + 1; page = compound_head(page); - _total_mapcount = ret = 0; + ret = 0; for (i = 0; i < thp_nr_pages(page); i++) { - mapcount = atomic_read(&page[i]._mapcount) + 1; + int mapcount = atomic_read(&page[i]._mapcount) + 1; ret = max(ret, mapcount); - _total_mapcount += mapcount; } - if (PageDoubleMap(page)) { + + if (PageDoubleMap(page)) ret -= 1; - _total_mapcount -= thp_nr_pages(page); - } - mapcount = compound_mapcount(page); - ret += mapcount; - _total_mapcount += mapcount; - if (total_mapcount) - *total_mapcount = _total_mapcount; - return ret; + + return ret + compound_mapcount(page); } /* Racy check whether the huge page can be split */ diff --git a/mm/khugepaged.c b/mm/khugepaged.c index e99101162f1a..11794bdf513a 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -681,7 +681,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma, goto out; } if (!pte_write(pteval) && PageSwapCache(page) && - !reuse_swap_page(page, NULL)) { + !reuse_swap_page(page)) { /* * Page is in the swap cache and cannot be re-used. * It cannot be collapsed into a THP. diff --git a/mm/memory.c b/mm/memory.c index 8f1de811a1dc..dd85fd07cb24 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3626,7 +3626,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES); dec_mm_counter_fast(vma->vm_mm, MM_SWAPENTS); pte = mk_pte(page, vma->vm_page_prot); - if ((vmf->flags & FAULT_FLAG_WRITE) && reuse_swap_page(page, NULL)) { + if ((vmf->flags & FAULT_FLAG_WRITE) && reuse_swap_page(page)) { pte = maybe_mkwrite(pte_mkdirty(pte), vma); vmf->flags &= ~FAULT_FLAG_WRITE; ret |= VM_FAULT_WRITE; diff --git a/mm/swapfile.c b/mm/swapfile.c index e59e08ef46e1..bc0810c3b2a5 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1601,31 +1601,30 @@ static bool page_swapped(struct page *page) return false; } -static int page_trans_huge_map_swapcount(struct page *page, int *total_mapcount, +static int page_trans_huge_map_swapcount(struct page *page, int *total_swapcount) { - int i, map_swapcount, _total_mapcount, _total_swapcount; + int i, map_swapcount, _total_swapcount; unsigned long offset = 0; struct swap_info_struct *si; struct swap_cluster_info *ci = NULL; unsigned char *map = NULL; - int mapcount, swapcount = 0; + int swapcount = 0; /* hugetlbfs shouldn't call it */ VM_BUG_ON_PAGE(PageHuge(page), page); if (!IS_ENABLED(CONFIG_THP_SWAP) || likely(!PageTransCompound(page))) { - mapcount = page_trans_huge_mapcount(page, total_mapcount); if (PageSwapCache(page)) swapcount = page_swapcount(page); if (total_swapcount) *total_swapcount = swapcount; - return mapcount + swapcount; + return swapcount + page_trans_huge_mapcount(page); } page = compound_head(page); - _total_mapcount = _total_swapcount = map_swapcount = 0; + _total_swapcount = map_swapcount = 0; if (PageSwapCache(page)) { swp_entry_t entry; @@ -1639,8 +1638,7 @@ static int page_trans_huge_map_swapcount(struct page *page, int *total_mapcount, if (map) ci = lock_cluster(si, offset); for (i = 0; i < HPAGE_PMD_NR; i++) { - mapcount = atomic_read(&page[i]._mapcount) + 1; - _total_mapcount += mapcount; + int mapcount = atomic_read(&page[i]._mapcount) + 1; if (map) { swapcount = swap_count(map[offset + i]); _total_swapcount += swapcount; @@ -1648,15 +1646,9 @@ static int page_trans_huge_map_swapcount(struct page *page, int *total_mapcount, map_swapcount = max(map_swapcount, mapcount + swapcount); } unlock_cluster(ci); - if (PageDoubleMap(page)) { + if (PageDoubleMap(page)) map_swapcount -= 1; - _total_mapcount -= HPAGE_PMD_NR; - } - mapcount = compound_mapcount(page); - map_swapcount += mapcount; - _total_mapcount += mapcount; - if (total_mapcount) - *total_mapcount = _total_mapcount; + map_swapcount += compound_mapcount(page); if (total_swapcount) *total_swapcount = _total_swapcount; @@ -1673,17 +1665,14 @@ static int page_trans_huge_map_swapcount(struct page *page, int *total_mapcount, * reuse_swap_page() returns false, but it may be always overwritten * (see the other implementation for CONFIG_SWAP=n). */ -bool reuse_swap_page(struct page *page, int *total_map_swapcount) +bool reuse_swap_page(struct page *page) { - int count, total_mapcount, total_swapcount; + int count, total_swapcount; VM_BUG_ON_PAGE(!PageLocked(page), page); if (unlikely(PageKsm(page))) return false; - count = page_trans_huge_map_swapcount(page, &total_mapcount, - &total_swapcount); - if (total_map_swapcount) - *total_map_swapcount = total_mapcount + total_swapcount; + count = page_trans_huge_map_swapcount(page, &total_swapcount); if (count == 1 && PageSwapCache(page) && (likely(!PageTransCompound(page)) || /* The remaining swap count will be freed soon */
On Mon, Dec 20, 2021 at 06:37:30PM +0000, Matthew Wilcox wrote: > +++ b/mm/memory.c > @@ -3626,7 +3626,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES); > dec_mm_counter_fast(vma->vm_mm, MM_SWAPENTS); > pte = mk_pte(page, vma->vm_page_prot); > - if ((vmf->flags & FAULT_FLAG_WRITE) && reuse_swap_page(page, NULL)) { > + if ((vmf->flags & FAULT_FLAG_WRITE) && reuse_swap_page(page)) { > pte = maybe_mkwrite(pte_mkdirty(pte), vma); > vmf->flags &= ~FAULT_FLAG_WRITE; > ret |= VM_FAULT_WRITE; [...] > @@ -1673,17 +1665,14 @@ static int page_trans_huge_map_swapcount(struct page *page, int *total_mapcount, > * reuse_swap_page() returns false, but it may be always overwritten > * (see the other implementation for CONFIG_SWAP=n). > */ > -bool reuse_swap_page(struct page *page, int *total_map_swapcount) > +bool reuse_swap_page(struct page *page) > { > - int count, total_mapcount, total_swapcount; > + int count, total_swapcount; > > VM_BUG_ON_PAGE(!PageLocked(page), page); > if (unlikely(PageKsm(page))) > return false; > - count = page_trans_huge_map_swapcount(page, &total_mapcount, > - &total_swapcount); > - if (total_map_swapcount) > - *total_map_swapcount = total_mapcount + total_swapcount; > + count = page_trans_huge_map_swapcount(page, &total_swapcount); > if (count == 1 && PageSwapCache(page) && > (likely(!PageTransCompound(page)) || > /* The remaining swap count will be freed soon */ It makes me wonder if reuse_swap_page() can also be based on refcount instead of mapcount?
On Mon, Dec 20, 2021 at 10:37 AM Matthew Wilcox <willy@infradead.org> wrote: > > This might be a step in the right direction? > > Subject: [PATCH] mm: reuse_swap_page() no longer needs to return map_swapcount Well, that patch seems to be a no-op removal of dead code, so absolutely yes. That said, I think it would be good to split it up. I looked at that patch and went "is that really a no-op" to the point of recreating it. I think it would be good to make it multiple patches that are each individally trivial. IOW, start with (1) remove second argument to reuse_swap_page() that is always NULL, without making any other changes (2) that now made 'total_mapcount' unused in reuse_swap_page(), remove it as an argument from page_trans_huge_map_swapcount() (3) that now made 'total_mapcount' unused in page_trans_huge_mapcount(), remove it as an argument there too. because as it stands, that patch of yours looks like it is changing a lot of things, and I think it would be clearer to remove one thign at a time as it becomes obviously not used. Hmm? Linus
On Mon, Dec 20, 2021 at 10:53 AM Matthew Wilcox <willy@infradead.org> wrote: > > It makes me wonder if reuse_swap_page() can also be based on refcount > instead of mapcount? I suspect it doesn't even need refcount. For regular pages, after we've copied the page, all we do right now is if (page_copied) free_swap_cache(old_page); which is basically just an optimistic trylock_page() followed by try_to_free_swap(). And that then pretty much simply checks "are there any swap users left" and deletes it from the swap cache if not. The "free_swap_cache()" thing is actually just an optimization to avoid having memory pressure do it later. So it doesn't have to be exact. In fact, I thought that swap is so unusual that it's not even needed at all, but I was wrong. See how this was re-introduced in commit f4c4a3f48480 ("mm: free idle swap cache page after COW") because yes, some loads still have swap space allocated. In theory, it would probably be a good idea at COW time to see if the page ref is 2, and if it's a swap cache page, and try to do that swap cache removal even earlier, so that the page actually gets re-used (instead of copied and then the swap entry removed). But swap is such a non-issue these days that I doubt it matters, and it's probably better to keep the swap handling in the unusual path. So mapcount and refcount aren't what matters for the swap cache. The swap count obviously *does* matter - because it means that some mapping has a reference to this swap entry (not as a page, but as an actual swap pointer). But the mapcount is irrelevant - any users that have the swap page actually mapped, don't actually need to be a swapcache page. Even the refcount doesn't really matter, afaik. The only "refcount" we care about is that swapcount - that's what actually reference counts the swap cases. try_to_free_swap() does check for one particular kind of reference: it does a check for PageWriteback(). We don't want to remove the thing from the swap cache if it's under active IO. (This codepath does need the page lock, though, thus all those "page_trylock()" things). Linus
On Mon, Dec 20, 2021 at 11:15:14AM -0800, Linus Torvalds wrote: > Well, that patch seems to be a no-op removal of dead code, so absolutely yes. > > That said, I think it would be good to split it up. I looked at that > patch and went "is that really a no-op" to the point of recreating it. > > I think it would be good to make it multiple patches that are each > individally trivial. IOW, start with > > (1) remove second argument to reuse_swap_page() that is always NULL, > without making any other changes > > (2) that now made 'total_mapcount' unused in reuse_swap_page(), > remove it as an argument from page_trans_huge_map_swapcount() > > (3) that now made 'total_mapcount' unused in > page_trans_huge_mapcount(), remove it as an argument there too. Hah, that was actually how I did it originally (without actually committing at each step, and with a few "Oh, hang on, now we can avoid calculating this too" stops and restarts along the way), but I thought it all hung together logically as a single change. It's hard to see things from the other person's perspective at times.
On Mon, Dec 20, 2021 at 1:02 PM Matthew Wilcox <willy@infradead.org> wrote: > > Hah, that was actually how I did it originally (without actually > committing at each step, and with a few "Oh, hang on, now we can avoid > calculating this too" stops and restarts along the way), but I thought > it all hung together logically as a single change. It's hard to see > things from the other person's perspective at times. In just about any other area, I wouldn't mind one bigger patch that just removes code that isn't used. But when it's in the vm code, and it's pretty grotty, I do prefer seeing three patches that individually are much easier to see that "yeah, this doesn't actually change anything at all". The combined patch may be exactly the same thing, it's just much harder to see that "oh, now it's not used any more". That was perhaps especially true since a number of the changes also ended up doing statement simplification when the old layout made no sense any more with part of the results not used. So your 3-patch series was much easier to look at and go "Yeah, I believe each of these patches is a no-op". So ACK on all those patches. Linus
On Sun, Dec 19, 2021 at 06:59:51PM +0100, David Hildenbrand wrote: > handler (COW or unshare). Outside of COW/Unshare code, the bit can only > be set if page_count() == 1 and we sync against fork(). (and that's the > problem for gup-fast-only that I'm investigating right now, because it > would then always have to fallback to the slow variant if the bit isn't > already set) I'm having a hard time imagining how gup_fast can maintain any sort of bit - it lacks all forms of locks so how can we do an atomic test and set between two pieces of data? I think the point of Linus's plan really is that the new bit is derived from page_count, we get the set the new bit when we observe page_count==1 in various situations and we clear the new bit whenever we write protect with the intent to copy. GUP doesn't interact with this bit. A writable page would still be the second way you can say "you clearly have full ownership of the page', so GUP just checks writability and bumps the refcount. The challenge of GUP reamins to sanely sequence it with things that are doing WP. The elevated GUP refcount prevents the page from getting the bit set again, regardless of what happens to it. Then on the WP sides.. Obviously we clear the bit when applying a WP for copy. So all the bad GUP cases are halted now, as with a cleared bit and a != 1 refcount COW must happen. Then, it is also the case that most often a read-only page will have this bit cleared, UFFWD WP being the exception. UFFWD WP works fine as it will have the bit set in the cases we care about and COW will not happen. If the bit is not set then everything works as is today and you get extra COWs. We still have to fix the things that are missing the extra COWs to check the page ref to fix this. It seems this new bit is acting as a 'COW disable', so, the accuracy of COW vs GUP&speculative pagerefs now relies on setting the bit as aggressively as possible when it is safe and cheap to do so. If I got it right this is why it is not just mapcount reduced to 1 bit. It is quite different, even though "this VM owns the page outright" sure sounds like "mapcount == 1".. It seems like an interesting direction - the security properties seem good as we only have to take care of sites applying WP to decide what to do with the extra bit, and all places that set the bit to 1 do so after testing refcount under various locks preventing PTE WP. That just leave the THP splitting.. I suppose we get the PTL, then compute the current value of the new bit based on refcount and diffuse it to all tail pages, then update the PMD and release the PTL. Safe against concurrent WP - don't need DoubleMap horrors because it isn't a counter. > Not set it stone, just an idea what I'm playing with right now ... and I > have to tripple-check if > * page is PTE mapped in the page table I'm walking > * page_count() == 1 > Really means that "this is the only reference.". I do strongly believe > so .. :) AFAIK the only places that can break this are places putting struct page memory into special PTEs. Which is horrific and is just bugs, but I think I've seen it from time to time :( ZONE_DEVICE is also messed up, of course, but that is just more reasons ZONE_DEVICE refcounting needs fixing and you should ignore it. Jason
On Mon, Dec 20, 2021 at 09:03:12PM -0400, Jason Gunthorpe wrote: > That just leave the THP splitting.. I suppose we get the PTL, then > compute the current value of the new bit based on refcount and diffuse > it to all tail pages, then update the PMD and release the PTL. Safe > against concurrent WP - don't need DoubleMap horrors because it isn't > a counter. One of the things I've been trying to figure out is how we do can_split_huge_page(). Maybe an rmap walk to figure out how many refcounts we would subtract if we did unmap it from everywhere it's currently mapped? (just to be clear, we call unmap_page() as the next thing, so I don't mind warming up the rbtree cachelines if it's mapped anywhere)
On 21.12.21 02:03, Jason Gunthorpe wrote: > On Sun, Dec 19, 2021 at 06:59:51PM +0100, David Hildenbrand wrote: > >> handler (COW or unshare). Outside of COW/Unshare code, the bit can only >> be set if page_count() == 1 and we sync against fork(). (and that's the >> problem for gup-fast-only that I'm investigating right now, because it >> would then always have to fallback to the slow variant if the bit isn't >> already set) > [in the meantime I figured out which pageflag we can reuse for anon pages, which is at least one step into the right direction] > I'm having a hard time imagining how gup_fast can maintain any sort of > bit - it lacks all forms of locks so how can we do an atomic test and > set between two pieces of data? And exactly that is to be figured out. Note that I am trying to make also any kind of R/O pins on an anonymous page work as expected as well, to fix any kind of GUP after fork() and GUP before fork(). So taking a R/O pin on an !PageAnonExclusive() page similarly has to make sure that the page is exclusive -- even if it's mapped R/O (!). In the pagefault handler we can then always reuse a PageAnonExclusive() page, because we know it's exclusive and it will stay exclusive because concurrent fork() isn't possible. > > I think the point of Linus's plan really is that the new bit is > derived from page_count, we get the set the new bit when we observe > page_count==1 in various situations and we clear the new bit whenever > we write protect with the intent to copy. Here are is one problems I'm fighting with: Assume we set the bit whenever we create a new anon page (either due to COW, ordinary fault, unsharing request, ..., even if it's mapped R/O first). We know the page is exclusive at that point because we created it and fork() could not happen yet. fork() is the only code that can share the page between processes and turn it non-exclusive. We can only clear the bit during fork() -- to turn the share page exclusive and map it R/O into both processes -- when we are sure that *nobody* concurrently takes a reference on the page the would be problematic (-> GUP). So to clear the bit during fork, we have to (1) Check against page_count == 1 (2) Synchronize against GUP (2) is easy using the mmap_lock and the mm->write_protect_seq BUT, it would mean that whenever we fork() and there is one additional reference on a page (even if it's from the swapcache), we would slow down fork() even if there was never any GUP. This would apply to any process out there that does a fork() ... So the idea is to mark a page only exclusive as soon as someone needs the page to be exclusive and stay exclusive (-> e.g., GUP with FOLL_PIN or selected FOLL_GET like O_DIRECT). This can happen in my current approach using two ways: (1) Set the bit when we know we are the only users We can set PageAnonExclusive() in case *we sync against fork* and the page cannot get unmapped (pt lock) when: * The page is mapped writable * The page is mapped readable and page_count == 1 This should work during ordinary GUP in many cases. If we cannot set the page exclusive, we have to trigger a page fault. (2) During pagefaults when FOLL_FAULT_UNSHARE is set. GUP will set FOLL_FAULT_UNSHARE for a pagefault when required (again e.g., FOLL_PIN or selected FOLL_GET users), and manual setting of the bit failed. The page fault will then try once again to set the bit if there is a page mapped, and if that fails, do the COW/unshare and set the bit. The above should work fairly reliable with GUP. But indeed, gup-fast-only is the problem. I'm still investigating what kind of lightweight synchronization we could do against fork() such that we wouldn't try setting a page PageAnonExclusive() while fork() concurrently shares the page. We could eventually use the page lock and do a try_lock(), both in fork() and in gup-fast-only. fork() would only clear the bit if the try_lock() succeeded. gup-fast-only would only be able to set the bit and not fallback to the slow path if try_lock() succeeded. But I'm still investigating if there are better alternatives ... > > GUP doesn't interact with this bit. A writable page would still be the > second way you can say "you clearly have full ownership of the page', > so GUP just checks writability and bumps the refcount. The challenge > of GUP reamins to sanely sequence it with things that are doing WP. > > The elevated GUP refcount prevents the page from getting the bit set > again, regardless of what happens to it. > > Then on the WP sides.. Obviously we clear the bit when applying a WP > for copy. So all the bad GUP cases are halted now, as with a cleared > bit and a != 1 refcount COW must happen. > > Then, it is also the case that most often a read-only page will have > this bit cleared, UFFWD WP being the exception. > > UFFWD WP works fine as it will have the bit set in the cases we care > about and COW will not happen. > > If the bit is not set then everything works as is today and you get > extra COWs. We still have to fix the things that are missing the extra > COWs to check the page ref to fix this. > > It seems this new bit is acting as a 'COW disable', so, the accuracy > of COW vs GUP&speculative pagerefs now relies on setting the bit as > aggressively as possible when it is safe and cheap to do so. But we really want to avoid degrading fork() for everybody that doesn't do heavy GUP ... > > If I got it right this is why it is not just mapcount reduced to 1 > bit. It is quite different, even though "this VM owns the page > outright" sure sounds like "mapcount == 1".. > > It seems like an interesting direction - the security properties seem > good as we only have to take care of sites applying WP to decide what > to do with the extra bit, and all places that set the bit to 1 do so > after testing refcount under various locks preventing PTE WP. > > That just leave the THP splitting.. I suppose we get the PTL, then > compute the current value of the new bit based on refcount and diffuse > it to all tail pages, then update the PMD and release the PTL. Safe > against concurrent WP - don't need DoubleMap horrors because it isn't > a counter. > >> Not set it stone, just an idea what I'm playing with right now ... and I >> have to tripple-check if >> * page is PTE mapped in the page table I'm walking >> * page_count() == 1 >> Really means that "this is the only reference.". I do strongly believe >> so .. :) > > AFAIK the only places that can break this are places putting struct > page memory into special PTEs. Which is horrific and is just bugs, but > I think I've seen it from time to time :( As we only care about anon pages, I think that doesn't apply. At least that's what I hope.
On Tue, Dec 21, 2021 at 09:58:32AM +0100, David Hildenbrand wrote: > > I'm having a hard time imagining how gup_fast can maintain any sort of > > bit - it lacks all forms of locks so how can we do an atomic test and > > set between two pieces of data? > > And exactly that is to be figured out. > > Note that I am trying to make also any kind of R/O pins on an anonymous > page work as expected as well, to fix any kind of GUP after fork() and > GUP before fork(). So taking a R/O pin on an !PageAnonExclusive() page > similarly has to make sure that the page is exclusive -- even if it's > mapped R/O (!). Why? AFAIK we don't have bugs here. If the page is RO and has an elevated refcount it cannot be 'PageAnonExclusive' and so any place that wants to drop the WP just cannot. What is the issue? > BUT, it would mean that whenever we fork() and there is one additional > reference on a page (even if it's from the swapcache), we would slow > down fork() even if there was never any GUP. This would apply to any > process out there that does a fork() ... You mean because we'd copy? Is this common? Linus' prior email was talking as though swap is so rare we should't optimize for it? > So the idea is to mark a page only exclusive as soon as someone needs > the page to be exclusive and stay exclusive (-> e.g., GUP with FOLL_PIN > or selected FOLL_GET like O_DIRECT). This can happen in my current > approach using two ways: > > (1) Set the bit when we know we are the only users > > We can set PageAnonExclusive() in case *we sync against fork* and the > page cannot get unmapped (pt lock) when: > * The page is mapped writable > * The page is mapped readable and page_count == 1 I'm still not sure I see that all this complexity is netting a gain? > If we cannot set the page exclusive, we have to trigger a page fault. > > (2) During pagefaults when FOLL_FAULT_UNSHARE is set. Why do we need FOLL_FAULT_UNSHARE ? AFAICT that was part of this series because of mapcount, once the hugetlb COW is fixed to use refcount properly, as Linus showed, the bugs this was trying to fix go away. And as discussed before it is OK if READ gup becomes incoherent, that is its defined semantic. > The above should work fairly reliable with GUP. But indeed, > gup-fast-only is the problem. I'm still investigating what kind of > lightweight synchronization we could do against fork() such that we > wouldn't try setting a page PageAnonExclusive() while fork() > concurrently shares the page. > > We could eventually use the page lock and do a try_lock(), both in > fork() and in gup-fast-only. fork() would only clear the bit if the > try_lock() succeeded. gup-fast-only would only be able to set the bit > and not fallback to the slow path if try_lock() succeeded. I suspect that is worse than just having fork clear the bit and leave GUP as-is. try lock is an atomic, clearing PageAnonExclusive does not need to be atomic, it is protected by the PTL. > > Then on the WP sides.. Obviously we clear the bit when applying a WP > > for copy. So all the bad GUP cases are halted now, as with a cleared > > bit and a != 1 refcount COW must happen. > But we really want to avoid degrading fork() for everybody that doesn't > do heavy GUP ... fork() already has to dirty the struct page cache line for refcount, setting a flag seems minor at that point? At least we shouldn't discard this nice understandable approach without a measurement.... Remember fork is already incring mapcount so if we kill mapcount it is a win for fork to replace the mapcount atomic with a non-atomic flag. > > AFAIK the only places that can break this are places putting struct > > page memory into special PTEs. Which is horrific and is just bugs, but > > I think I've seen it from time to time :( > > As we only care about anon pages, I think that doesn't apply. At least > that's what I hope. You are optimistic :) Jason
On Tue, Dec 21, 2021 at 12:58 AM David Hildenbrand <david@redhat.com> wrote: > > On 21.12.21 02:03, Jason Gunthorpe wrote: > > > I'm having a hard time imagining how gup_fast can maintain any sort of > > bit - it lacks all forms of locks so how can we do an atomic test and > > set between two pieces of data? > > And exactly that is to be figured out. So my preference would be to just always maintain the "exclusive to this VM" bit in the 'struct page', because that makes things easier to think about. [ Of course - the bit could be reversed, and be a 'not exclusive to this VM' bit, semantically the set-or-cleared issue doesn't matter. Also, when I talk about some "exclusive to this VM" bit, I'm purely talking about pages that are marked PageAnon(), so the bit may or may not even exist for other pager types ] And then all GUP-fast would need to do is to refuse to look up a page that isn't exclusive to that VM. We already have the situation that GUP-fast can fail for non-writable pages etc, so it's just another test. > Note that I am trying to make also any kind of R/O pins on an anonymous > page work as expected as well, to fix any kind of GUP after fork() and > GUP before fork(). So taking a R/O pin on an !PageAnonExclusive() page > similarly has to make sure that the page is exclusive -- even if it's > mapped R/O (!). I do think the existing "maybe_pinned()" logic is fine for that. The "exclusive to this VM" bit can be used to *help* that decision - because only an exclusive page can be pinned - bit I don't think it should _replace_ that logic. There's a quite fundamental difference between (a) COW and GUP: these two operations _have_ to know that they get an exclusive page in order to re-use or look up the page respectively (b) the pre-cow logic in fork() or the "add this to the swap cache" logic in vmscan that decides whether a page can be turned into a COW page by adding a reference coutn to it (whether due to fork or swap cache doesn't matter - the end result is the same). The difference is that in (a) the thing we *have* to get right is whether a page is exclusively owned by that VM or not. We can COW too much, but we can never share a page unless it's exclusive. That's true whether it's pinned or not. In (b), the "have to get right" is different. In (b), it's perfectly ok to COW an exclusive page and turn it non-exclusive. But we must never COW a pinned page. So (a) and (b) are very different situations, and have different logic. If we always maintain an exclusive bit for AnonPage pages, then both (a) and (b) can use that bit, but they'll use it very differently. In (a) we'll refuse to look it up and will force a 'handle_mm_fault()' to get an exclusive copy. And in (b), we just use it as a "we know only exclusive pages can be pinned", so it's just another check for page_needs_cow_for_dma(), the same way we currently check "MMF_HAS_PINNED" to narrow down the whole "page count indicates this may be a pinned page" question. And the "page is exclusive" would actually be the *common* case for almost all pages. Any time you've written to a page and you haven't forked after the write (and it hasn't been turned into a swap page), that page would be exclusive to that VM. Doesn't this seem like really straightforward semantics to maintain (and think about)? I'd like the exclusive page bit to *not* be directly about "has this page been pinned" exactly because we already have too many special cases for GUP. It would be nicer to have a page bit that has very clear semantics even in the absence of GUP. Linus
On 21.12.21 18:05, Linus Torvalds wrote: > On Tue, Dec 21, 2021 at 12:58 AM David Hildenbrand <david@redhat.com> wrote: >> >> On 21.12.21 02:03, Jason Gunthorpe wrote: >> >>> I'm having a hard time imagining how gup_fast can maintain any sort of >>> bit - it lacks all forms of locks so how can we do an atomic test and >>> set between two pieces of data? >> >> And exactly that is to be figured out. > > So my preference would be to just always maintain the "exclusive to > this VM" bit in the 'struct page', because that makes things easier to > think about. > > [ Of course - the bit could be reversed, and be a 'not exclusive to > this VM' bit, semantically the set-or-cleared issue doesn't matter. > Also, when I talk about some "exclusive to this VM" bit, I'm purely > talking about pages that are marked PageAnon(), so the bit may or may > not even exist for other pager types ] Yes, whatever I say applies to PageAnon() only -- including the (overloaded bit), called PG_anon_exclusive now. > > And then all GUP-fast would need to do is to refuse to look up a page > that isn't exclusive to that VM. We already have the situation that > GUP-fast can fail for non-writable pages etc, so it's just another > test. Right, the simplest way is simply failing GUP fast if the bit isn't set, forcing it into the slow path. If that would primarily happens for R/O pins after fork(), fine with me. > >> Note that I am trying to make also any kind of R/O pins on an anonymous >> page work as expected as well, to fix any kind of GUP after fork() and >> GUP before fork(). So taking a R/O pin on an !PageAnonExclusive() page >> similarly has to make sure that the page is exclusive -- even if it's >> mapped R/O (!). > > I do think the existing "maybe_pinned()" logic is fine for that. The > "exclusive to this VM" bit can be used to *help* that decision - > because only an exclusive page can be pinned - bit I don't think it > should _replace_ that logic. The issue is that O_DIRECT uses FOLL_GET and cannot easily be changed to FOLL_PIN unfortunately. So I'm *trying* to make it more generic such that such corner cases can be handled as well correctly. But yeah, I'll see where this goes ... O_DIRECT has to be fixed one way or the other. John H. mentioned that he wants to look into converting that to FOLL_PIN. So maybe that will work eventually. > > There's a quite fundamental difference between > > (a) COW and GUP: these two operations _have_ to know that they get an > exclusive page in order to re-use or look up the page respectively > > (b) the pre-cow logic in fork() or the "add this to the swap cache" > logic in vmscan that decides whether a page can be turned into a COW > page by adding a reference coutn to it (whether due to fork or swap > cache doesn't matter - the end result is the same). > > The difference is that in (a) the thing we *have* to get right is > whether a page is exclusively owned by that VM or not. We can COW too > much, but we can never share a page unless it's exclusive. That's true > whether it's pinned or not. Exactly. Once a page is "exclusive" it must not get shared *unless* we can turn it into a "shared" page during fork(). There are some ugly corner cases that will require some thought. > > In (b), the "have to get right" is different. In (b), it's perfectly > ok to COW an exclusive page and turn it non-exclusive. But we must > never COW a pinned page. > > So (a) and (b) are very different situations, and have different logic. > > If we always maintain an exclusive bit for AnonPage pages, then both > (a) and (b) can use that bit, but they'll use it very differently. In > (a) we'll refuse to look it up and will force a 'handle_mm_fault()' to > get an exclusive copy. And in (b), we just use it as a "we know only > exclusive pages can be pinned", so it's just another check for > page_needs_cow_for_dma(), the same way we currently check > "MMF_HAS_PINNED" to narrow down the whole "page count indicates this > may be a pinned page" question. If we use page_needs_cow_for_dma() for that purpose we can still have other references from our process referencing the page, including right now O_DIRECT ones. So the safest thing to do would be relying on the same logic as we do in the COW path regarding the pagecount ... but that might result in unnecessary copies as I mentioned. It would be perfect if just anything that modifies page content would be using FOLL_PIN, unfortunately that's not reality ... > > And the "page is exclusive" would actually be the *common* case for > almost all pages. Any time you've written to a page and you haven't > forked after the write (and it hasn't been turned into a swap page), > that page would be exclusive to that VM. Yes. Essentially every time we create a new anonymous page it would end up as exclusive. Or if we're in a fault and can convert the "exclusive" page into a "shared" page (essentially the COW reuse logic). > > Doesn't this seem like really straightforward semantics to maintain > (and think about)? > > I'd like the exclusive page bit to *not* be directly about "has this > page been pinned" exactly because we already have too many special > cases for GUP. It would be nicer to have a page bit that has very > clear semantics even in the absence of GUP. What adds complexity to correctly maintain the "exclusive" state are at least: * KSM (might be harder, have to think about it) * migration (might be easy to just copy the bit) * fork() with migration/swap entries that reference a page that is "exclusive". I'll have to think about that more. So I have plenty of stuff to look into. Just so we're on the same page what I'd like to achieve with anonymous pages: 1) If we take a R/W pin on an anonymous page, we will always pin an "exclusive page". 2) If we take a R/O pin on an anonymous page, we will always pin an "exclusive page", even if the page is mapped R/O. 3) "Exclusive" pages cannot be turned "shared" during fork (and ksm? :/ ) if pinned. 4) "Exclusive" pages can be turned "shared" during fork if not pinned. 5) "Exclusive" pages will never be COWed but remain there for all eternity, until unmapped ... well or until converted into "shared" again if possible Ideally we'd handle O_DIRECT ... :( 2) is certainly the cherry on top. But it just means that R/O pins don't have to be the weird kid. And yes, achieving 2) would require FAULT_FLAG_EXCLUSIVE / FAULT_FLAG_UNSHARED, but it would really 99% do what existing COW logic does, just bypass the "map writable" and "trigger write fault" semantics. I hope we agree that R/O pins don't have to have weird kid if we can "get it right" with the same approach.
On Tue, Dec 21, 2021 at 9:40 AM David Hildenbrand <david@redhat.com> wrote: > > > I do think the existing "maybe_pinned()" logic is fine for that. The > > "exclusive to this VM" bit can be used to *help* that decision - > > because only an exclusive page can be pinned - bit I don't think it > > should _replace_ that logic. > > The issue is that O_DIRECT uses FOLL_GET and cannot easily be changed to > FOLL_PIN unfortunately. So I'm *trying* to make it more generic such > that such corner cases can be handled as well correctly. But yeah, I'll > see where this goes ... O_DIRECT has to be fixed one way or the other. > > John H. mentioned that he wants to look into converting that to > FOLL_PIN. So maybe that will work eventually. I'd really prefer that as the plan. What exactly is the issue with O_DIRECT? Is it purely that it uses "put_page()" instead of "unpin", or what? I really think that if people look up pages and expect those pages to stay coherent with the VM they looked it up for, they _have_ to actively tell the VM layer - which means using FOLL_PIN. Note that this is in absolutely no way a "new" issue. It has *always* been true. If some O_DIORECT path depends on pinning behavior, it has never worked correctly, and it is entirely on O_DIRECT, and not at all a VM issue. We've had people doing GUP games forever, and being burnt by those games not working reliably. GUP (before we even had the notion of pinning) would always just take a reference to the page, but it would not guarantee that that exact page then kept an association with the VM. Now, in *practice* this all works if: (a) the GUP user had always written to the page since the fork (either explicitly, or with FOLL_WRITE obviously acting as such) (b) the GUP user never forks afterwards until the IO is done (c) the GUP user plays no other VM games on that address and it's also very possible that it has worked by pure luck (ie we've had a lot of random code that actively mis-used things and it would work in practice just because COW would happen to cut the right direction etc). Is there some particular GUP user you happen to care about more than others? I think it's a valid option to try to fix things up one by one, even if you don't perhaps fix _all_ cases. Linus
On Tue 21-12-21 18:40:30, David Hildenbrand wrote: > On 21.12.21 18:05, Linus Torvalds wrote: > > On Tue, Dec 21, 2021 at 12:58 AM David Hildenbrand <david@redhat.com> wrote: > >> Note that I am trying to make also any kind of R/O pins on an anonymous > >> page work as expected as well, to fix any kind of GUP after fork() and > >> GUP before fork(). So taking a R/O pin on an !PageAnonExclusive() page > >> similarly has to make sure that the page is exclusive -- even if it's > >> mapped R/O (!). > > > > I do think the existing "maybe_pinned()" logic is fine for that. The > > "exclusive to this VM" bit can be used to *help* that decision - > > because only an exclusive page can be pinned - bit I don't think it > > should _replace_ that logic. > > The issue is that O_DIRECT uses FOLL_GET and cannot easily be changed to > FOLL_PIN unfortunately. So I'm *trying* to make it more generic such > that such corner cases can be handled as well correctly. But yeah, I'll > see where this goes ... O_DIRECT has to be fixed one way or the other. > > John H. mentioned that he wants to look into converting that to > FOLL_PIN. So maybe that will work eventually. For record we always intended (and still intend) to make O_DIRECT use FOLL_PIN. Just it is tricky because some users mix pages pinned with GUP and pages acquired through get_page() in a single bio (such as zero page) and thus it is non-trivial to do the right thing on IO completion (unpin or just put_page). Honza
On Tue, Dec 21, 2021 at 10:07 AM Jan Kara <jack@suse.cz> wrote: > > For record we always intended (and still intend) to make O_DIRECT use > FOLL_PIN. Just it is tricky because some users mix pages pinned with GUP > and pages acquired through get_page() in a single bio (such as zero page) > and thus it is non-trivial to do the right thing on IO completion (unpin or > just put_page). Side note: the new "exclusive VM" bit wouldn't _solve_ this issue, but it might make it much easier to debug and catch. If we only set the exclusive VM bit on pages that get mapped into user space, and we guarantee that GUP only looks up such pages, then we can also add a debug test to the "unpin" case that the bit is still set. And that would catch anybody who ends up using other pages for unpin(), and you could have a WARN_ON() for it (obviously also trigger on the page count being too small to unpin). That way, at least from a kernel debugging and development standpoint it would make it easy to see "ok, this unpinning got a page that wasn't pinned", and it would help find these cases where some situation had used just a get_page() rather than a pin to get a page pointer. No? Linus
On Tue, Dec 21, 2021 at 10:51 AM David Hildenbrand <david@redhat.com> wrote: > > For that purpose the pincount would already kind-off work. Not precise, > but at least something ("this page cannot possibly have been pinned"). That part actually exists already, ie put_page_refs() has this: #ifdef CONFIG_DEBUG_VM if (VM_WARN_ON_ONCE_PAGE(page_ref_count(page) < refs, page)) return; #endif And yeah, it shouldn't have that '#ifdef CONFIG_DEBUG_VM' there, but I think it's because the non-CONFIG_DEBUG_VM #define for VM_WARN_ON_ONCE_PAGE() is broken, and doesn't return 0. Linus
On Tue, Dec 21, 2021 at 06:40:30PM +0100, David Hildenbrand wrote: > 2) is certainly the cherry on top. But it just means that R/O pins don't > have to be the weird kid. And yes, achieving 2) would require > FAULT_FLAG_EXCLUSIVE / FAULT_FLAG_UNSHARED, but it would really 99% do > what existing COW logic does, just bypass the "map writable" and > "trigger write fault" semantics. I still don't agree with this - when you come to patches can you have this work at the end and under a good cover letter? Maybe it will make more sense then. Thanks, Jason
On 12/21/21 10:28, David Hildenbrand wrote: ... > I'd appreciate if someone could work on the O_DIRECT FOLL_PIN conversion > while I struggle with PageAnonExclusive() and R/W pins :) > Yes, I'll sign up for that (unless someone else who is faster is already working on it). I've tried a couple times in the past, but without the proper level of determination to see it through. So this time will be different. :) > [noting that I'll not get too much done within the next 2 weeks] > Likewise. Starting in early January. thanks,
On 12/21/21 10:30, Linus Torvalds wrote: > On Tue, Dec 21, 2021 at 10:07 AM Jan Kara <jack@suse.cz> wrote: >> >> For record we always intended (and still intend) to make O_DIRECT use >> FOLL_PIN. Just it is tricky because some users mix pages pinned with GUP >> and pages acquired through get_page() in a single bio (such as zero page) >> and thus it is non-trivial to do the right thing on IO completion (unpin or >> just put_page). > > Side note: the new "exclusive VM" bit wouldn't _solve_ this issue, but > it might make it much easier to debug and catch. > > If we only set the exclusive VM bit on pages that get mapped into user > space, and we guarantee that GUP only looks up such pages, then we can > also add a debug test to the "unpin" case that the bit is still set. > > And that would catch anybody who ends up using other pages for > unpin(), and you could have a WARN_ON() for it (obviously also trigger > on the page count being too small to unpin). > > That way, at least from a kernel debugging and development standpoint > it would make it easy to see "ok, this unpinning got a page that > wasn't pinned", and it would help find these cases where some > situation had used just a get_page() rather than a pin to get a page > pointer. > > No? > > Linus Yes, this is especially welcome, because it means that after enough time sitting in the -mm tree, we can reasonably expect to catch the most important cases, if any were missed. That makes it a whole other level of useful, as compared to local testing hacks. thanks,
On Tue, Dec 21, 2021 at 04:19:33PM +0100, David Hildenbrand wrote: > >> Note that I am trying to make also any kind of R/O pins on an anonymous > >> page work as expected as well, to fix any kind of GUP after fork() and > >> GUP before fork(). So taking a R/O pin on an !PageAnonExclusive() page > >> similarly has to make sure that the page is exclusive -- even if it's > >> mapped R/O (!). > > > > Why? AFAIK we don't have bugs here. If the page is RO and has an > > elevated refcount it cannot be 'PageAnonExclusive' and so any place > > that wants to drop the WP just cannot. What is the issue? > But what I think you actually mean is if we want to get R/O pins > right. What I ment was a page that is GUP'd RO, is not PageAnonExclusive and has an elevated refcount. Those cannot be transformed to PageAnonExclusive, or re-used during COW, but also they don't have problems today. Either places are like O_DIRECT read and are tolerant of a false COW, or they are broken like VFIO and should be using FOLL_FORCE|FOLL_WRITE, which turns them into a WRITE and then we know they get PageAnonExclusive. So, the swap issue is fixed directly with PageAnonExclusive and no change to READ GUP is required, at least in your #1 scenario, AFAICT.. > There are 2 models, leaving FOLL_FAULT_UNSHARE out of the picture for now: > > 1) Whenever mapping an anonymous page R/W (after COW, during ordinary > fault, on swapin), we mark the page exclusive. We must never lose the > PageAnonExclusive bit, not during migration, not during swapout. I prefer this one as well. It allows us to keep Linus's simple logic that refcount == 1 means always safe to re-use, no matter what. And refcount != 1 goes on to consider the additional bit to decide what to do. The simple bit really means 'we know this page has one PTE so ignore the refcount for COW reuse decisions'. > fork() will process the bit for each and every process, even if there > was no GUP, and will copy if there are additional references. Yes, just like it does today already for mapcount. > 2) Whenever GUP wants to pin/ref a page, we try marking it exclusive. We > can lose the PageAnonExclusive bit during migration and swapout, because > that can only happen when there are no additional references. I haven't thought of a way this is achievable. At least not without destroying GUP fast.. Idea #2 is really a "this page is GUP'd" flag with some sneaky logic to clear it. That comes along with all the races too because as an idea it is fundamentally about GUP which runs without locks. Jason
On Tue, Dec 21, 2021 at 06:40:30PM +0100, David Hildenbrand wrote: > What adds complexity to correctly maintain the "exclusive" state are at > least: > * KSM (might be harder, have to think about it) I know little about it, but isn't KSM like fork where it is trying to WP pages with the intention of copying them? Shouldn't KSM completely reject WP'ing a page that is under any kind of writable GUP? Jason
On Wed 22-12-21 10:58:36, David Hildenbrand wrote: > On 22.12.21 09:51, David Hildenbrand wrote: > > On 21.12.21 20:07, Jason Gunthorpe wrote: > >> On Tue, Dec 21, 2021 at 06:40:30PM +0100, David Hildenbrand wrote: > >> > >>> 2) is certainly the cherry on top. But it just means that R/O pins don't > >>> have to be the weird kid. And yes, achieving 2) would require > >>> FAULT_FLAG_EXCLUSIVE / FAULT_FLAG_UNSHARED, but it would really 99% do > >>> what existing COW logic does, just bypass the "map writable" and > >>> "trigger write fault" semantics. > >> > >> I still don't agree with this - when you come to patches can you have > >> this work at the end and under a good cover letter? Maybe it will make > >> more sense then. > > > > Yes. But really, I think it's the logical consequence of what Linus said > > [1]: > > > > "And then all GUP-fast would need to do is to refuse to look up a page > > that isn't exclusive to that VM. We already have the situation that > > GUP-fast can fail for non-writable pages etc, so it's just another > > test." > > > > We must not FOLL_PIN a page that is not exclusive (not only on gup-fast, > > but really, on any gup). If we special case R/O FOLL_PIN, we cannot > > enable the sanity check on unpin as suggested by Linus [2]: > > > > "If we only set the exclusive VM bit on pages that get mapped into > > user space, and we guarantee that GUP only looks up such pages, then > > we can also add a debug test to the "unpin" case that the bit is > > still set." > > > > There are really only two feasible options I see when we want to take a > > R/O FOLL_PIN on a !PageAnonExclusive() anon page > > > > (1) Fail the pinning completely. This implies that we'll have to fail > > O_DIRECT once converted to FOLL_PIN. > > (2) Request to mark the page PageAnonExclusive() via a > > FAULT_FLAG_UNSHARE and let it succeed. > > > > > > Anything else would require additional accounting that we already > > discussed in the past is hard -- for example, to differentiate R/O from > > R/W pins requiring two pin counters. > > > > The only impact would be that FOLL_PIN after fork() has to go via a > > FAULT_FLAG_UNSHARE once, to turn the page PageAnonExclusive. IMHO this > > is the right thing to do for FOLL_LONGTERM. For !FOLL_LONGTERM it would > > be nice to optimize this, to *not* do that, but again ... this would > > require even more counters I think, for example, to differentiate > > between "R/W short/long-term or R/O long-term pin" and "R/O short-term pin". > > > > So unless we discover a way to do additional accounting for ordinary 4k > > pages, I think we really can only do (1) or (2) to make sure we never > > ever pin a !PageAnonExclusive() page. > > BTW, I just wondered if the optimization should actually be that R/O > short-term FOLL_PIN users should actually be using FOLL_GET instead. So > O_DIRECT with R/O would already be doing the right thing. > > And it somewhat aligns with what we found: only R/W short-term FOLL_GET > is problematic, where we can lose writes to the page from the device via > O_DIRECT. > > IIUC, our COW logic makes sure that a shared anonymous page that might > still be used by a R/O FOLL_GET cannot be modified, because any attempt > to modify it would result in a copy. Well, we defined FOLL_PIN to mean the intent that the caller wants to access not only page state (for which is enough FOLL_GET and there are some users - mostly inside mm - who need this) but also page data. Eventually, we even wanted to make FOLL_GET unavailable to broad areas of kernel (and keep it internal to only MM for its dirty deeds ;)) to reduce the misuse of GUP. For file pages we need this data vs no-data access distinction so that filesystems can detect when someone can be accessing page data although the page is unmapped. Practically, filesystems care most about when someone can be *modifying* page data (we need to make sure data is stable e.g. when writing back data to disk or doing data checksumming or other operations) so using FOLL_GET when wanting to only read page data should be OK for filesystems but honestly I would be reluctant to break the rule of "use FOLL_PIN when wanting to access page data" to keep things simple and reasonably easy to understand for parties such as filesystem developers or driver developers who all need to interact with pinned pages... Honza
On Tue 21-12-21 19:59:16, Jason Gunthorpe wrote: > On Tue, Dec 21, 2021 at 06:40:30PM +0100, David Hildenbrand wrote: > > > What adds complexity to correctly maintain the "exclusive" state are at > > least: > > * KSM (might be harder, have to think about it) > > I know little about it, but isn't KSM like fork where it is trying to > WP pages with the intention of copying them? Shouldn't KSM completely > reject WP'ing a page that is under any kind of writable GUP? I know little about KSM as well but I think fundamentally it has similar requirements for anon pages as filesystems have for page cache pages e.g. when doing block deduplication or data checksumming... I.e., it needs to make sure data in the page is stable and nobody can modify it. Honza
On Wed 22-12-21 14:09:41, David Hildenbrand wrote: > >> IIUC, our COW logic makes sure that a shared anonymous page that might > >> still be used by a R/O FOLL_GET cannot be modified, because any attempt > >> to modify it would result in a copy. > > > > Well, we defined FOLL_PIN to mean the intent that the caller wants to access > > not only page state (for which is enough FOLL_GET and there are some users > > - mostly inside mm - who need this) but also page data. Eventually, we even > > wanted to make FOLL_GET unavailable to broad areas of kernel (and keep it > > internal to only MM for its dirty deeds ;)) to reduce the misuse of GUP. > > > > For file pages we need this data vs no-data access distinction so that > > filesystems can detect when someone can be accessing page data although the > > page is unmapped. Practically, filesystems care most about when someone > > can be *modifying* page data (we need to make sure data is stable e.g. when > > writing back data to disk or doing data checksumming or other operations) > > so using FOLL_GET when wanting to only read page data should be OK for > > filesystems but honestly I would be reluctant to break the rule of "use > > FOLL_PIN when wanting to access page data" to keep things simple and > > reasonably easy to understand for parties such as filesystem developers or > > driver developers who all need to interact with pinned pages... > > Right, from an API perspective we really want people to use FOLL_PIN. > > To optimize this case in particular it would help if we would have the > FOLL flags on the unpin path. Then we could just decide internally > "well, short-term R/O FOLL_PIN can be really lightweight, we can treat > this like a FOLL_GET instead". And we would need that as well if we were > to keep different counters for R/O vs. R/W pinned. Well, I guess the question here is: Which GUP user needs only R/O access to page data and is so performance critical that it would be worth it to sacrifice API clarity for speed? I'm not aware of any but I was not looking really hard... Honza
On Wed 22-12-21 15:48:34, David Hildenbrand wrote: > On 22.12.21 15:42, Jan Kara wrote: > > On Wed 22-12-21 14:09:41, David Hildenbrand wrote: > >>>> IIUC, our COW logic makes sure that a shared anonymous page that might > >>>> still be used by a R/O FOLL_GET cannot be modified, because any attempt > >>>> to modify it would result in a copy. > >>> > >>> Well, we defined FOLL_PIN to mean the intent that the caller wants to access > >>> not only page state (for which is enough FOLL_GET and there are some users > >>> - mostly inside mm - who need this) but also page data. Eventually, we even > >>> wanted to make FOLL_GET unavailable to broad areas of kernel (and keep it > >>> internal to only MM for its dirty deeds ;)) to reduce the misuse of GUP. > >>> > >>> For file pages we need this data vs no-data access distinction so that > >>> filesystems can detect when someone can be accessing page data although the > >>> page is unmapped. Practically, filesystems care most about when someone > >>> can be *modifying* page data (we need to make sure data is stable e.g. when > >>> writing back data to disk or doing data checksumming or other operations) > >>> so using FOLL_GET when wanting to only read page data should be OK for > >>> filesystems but honestly I would be reluctant to break the rule of "use > >>> FOLL_PIN when wanting to access page data" to keep things simple and > >>> reasonably easy to understand for parties such as filesystem developers or > >>> driver developers who all need to interact with pinned pages... > >> > >> Right, from an API perspective we really want people to use FOLL_PIN. > >> > >> To optimize this case in particular it would help if we would have the > >> FOLL flags on the unpin path. Then we could just decide internally > >> "well, short-term R/O FOLL_PIN can be really lightweight, we can treat > >> this like a FOLL_GET instead". And we would need that as well if we were > >> to keep different counters for R/O vs. R/W pinned. > > > > Well, I guess the question here is: Which GUP user needs only R/O access to > > page data and is so performance critical that it would be worth it to > > sacrifice API clarity for speed? I'm not aware of any but I was not looking > > really hard... > > I'd be interested in examples as well. Maybe databases that use O_DIRECT > after fork()? Well, but O_DIRECT reads must use FOLL_PIN in any case because they modify page data (and so we need to detect them both for COW and filesystem needs). O_DIRECT writes could use FOLL_GET but at this point I'm not convinced it is worth it. Honza
On Wed, Dec 22, 2021 at 05:08:46PM +0100, Jan Kara wrote: > On Wed 22-12-21 15:48:34, David Hildenbrand wrote: > > On 22.12.21 15:42, Jan Kara wrote: > > > On Wed 22-12-21 14:09:41, David Hildenbrand wrote: > > >>>> IIUC, our COW logic makes sure that a shared anonymous page that might > > >>>> still be used by a R/O FOLL_GET cannot be modified, because any attempt > > >>>> to modify it would result in a copy. > > >>> > > >>> Well, we defined FOLL_PIN to mean the intent that the caller wants to access > > >>> not only page state (for which is enough FOLL_GET and there are some users > > >>> - mostly inside mm - who need this) but also page data. Eventually, we even > > >>> wanted to make FOLL_GET unavailable to broad areas of kernel (and keep it > > >>> internal to only MM for its dirty deeds ;)) to reduce the misuse of GUP. > > >>> > > >>> For file pages we need this data vs no-data access distinction so that > > >>> filesystems can detect when someone can be accessing page data although the > > >>> page is unmapped. Practically, filesystems care most about when someone > > >>> can be *modifying* page data (we need to make sure data is stable e.g. when > > >>> writing back data to disk or doing data checksumming or other operations) > > >>> so using FOLL_GET when wanting to only read page data should be OK for > > >>> filesystems but honestly I would be reluctant to break the rule of "use > > >>> FOLL_PIN when wanting to access page data" to keep things simple and > > >>> reasonably easy to understand for parties such as filesystem developers or > > >>> driver developers who all need to interact with pinned pages... > > >> > > >> Right, from an API perspective we really want people to use FOLL_PIN. > > >> > > >> To optimize this case in particular it would help if we would have the > > >> FOLL flags on the unpin path. Then we could just decide internally > > >> "well, short-term R/O FOLL_PIN can be really lightweight, we can treat > > >> this like a FOLL_GET instead". And we would need that as well if we were > > >> to keep different counters for R/O vs. R/W pinned. > > > > > > Well, I guess the question here is: Which GUP user needs only R/O access to > > > page data and is so performance critical that it would be worth it to > > > sacrifice API clarity for speed? I'm not aware of any but I was not looking > > > really hard... > > > > I'd be interested in examples as well. Maybe databases that use O_DIRECT > > after fork()? > > Well, but O_DIRECT reads must use FOLL_PIN in any case because they modify > page data (and so we need to detect them both for COW and filesystem needs). > O_DIRECT writes could use FOLL_GET but at this point I'm not convinced it > is worth it. Wow, I didn't realise the plan was to make FOLL_PIN the "default". I hoped it was weird crap that was going away soon. Looks like we'd better fix all the bugs in it then ...
On Wed, Dec 22, 2021 at 8:08 AM Jan Kara <jack@suse.cz> wrote: > > Well, but O_DIRECT reads must use FOLL_PIN in any case because they modify > page data (and so we need to detect them both for COW and filesystem needs). Well, O_DIRECT reads do, but not necessarily writes. And hey, even reads have been dodgy in the past when we didn't really have the pinning logic - there's been a lot of users that just wanted it to work for their particular use-case rather than in general and in all situations.. Linus
On Wed, Dec 22, 2021 at 02:09:41PM +0100, David Hildenbrand wrote: > Right, from an API perspective we really want people to use FOLL_PIN. > > To optimize this case in particular it would help if we would have the > FOLL flags on the unpin path. Then we could just decide internally > "well, short-term R/O FOLL_PIN can be really lightweight, we can treat > this like a FOLL_GET instead". And we would need that as well if we were > to keep different counters for R/O vs. R/W pinned. FYI, in my current tree, there's a gup_put_folio() which replaces put_compound_head: static void gup_put_folio(struct folio *folio, int refs, unsigned int flags) { if (flags & FOLL_PIN) { node_stat_mod_folio(folio, NR_FOLL_PIN_RELEASED, refs); if (hpage_pincount_available(&folio->page)) hpage_pincount_sub(&folio->page, refs); else refs *= GUP_PIN_COUNTING_BIAS; } folio_put_refs(folio, refs); } That can become non-static if it's needed. I'm still working on that series, because I'd like to get it to a point where we return one folio pointer instead of N page pointers. Not quite there yet.
On Wed 22-12-21 10:40:18, Linus Torvalds wrote: > On Wed, Dec 22, 2021 at 8:08 AM Jan Kara <jack@suse.cz> wrote: > > > > Well, but O_DIRECT reads must use FOLL_PIN in any case because they modify > > page data (and so we need to detect them both for COW and filesystem needs). > > Well, O_DIRECT reads do, but not necessarily writes. I agree. > And hey, even reads have been dodgy in the past when we didn't really > have the pinning logic - there's been a lot of users that just wanted > it to work for their particular use-case rather than in general and in > all situations.. Yes, but currently a malicious user can take the system down (BUG_ON) or cause DIF/DIX failures if he is nasty and tries hard enough with O_DIRECT reads (practically, the window is small so I haven't really seen a report that I could trace to O_DIRECT reads but in principle the problem is the same as with pinning & dirtying done e.g. by video capture drivers and there we've seen these problem happen). So forcing pinning for O_DIRECT reads is IMO mandatory. Honza
On Thu, Dec 23, 2021 at 4:54 AM Jan Kara <jack@suse.cz> wrote: > > So forcing pinning for O_DIRECT reads is IMO mandatory. I don't disagree. And I do think the eventual aim should be to do it for writes too even if they don't necessarily require it (since they write _from_ the VM data, not _to_ the VM data - the "read-vs-write direction has always been confusing when it comes to GUP"). Partly just for consistency in the IO paths - I think people want to share as much as possible in there - but also just to make sure that we're all done with the "wrong-way-cow" kind of issues for good. If we get to the point where the legacy GUP is used only for very special things (looking up physical pages for debug and trace purposes etc), I think that would be lovely. That may be a pretty long-range goal, though. Linus
On Thu, Dec 23, 2021 at 12:21:06AM +0000, Matthew Wilcox wrote: > On Wed, Dec 22, 2021 at 02:09:41PM +0100, David Hildenbrand wrote: > > Right, from an API perspective we really want people to use FOLL_PIN. > > > > To optimize this case in particular it would help if we would have the > > FOLL flags on the unpin path. Then we could just decide internally > > "well, short-term R/O FOLL_PIN can be really lightweight, we can treat > > this like a FOLL_GET instead". And we would need that as well if we were > > to keep different counters for R/O vs. R/W pinned. > > FYI, in my current tree, there's a gup_put_folio() which replaces > put_compound_head: > > static void gup_put_folio(struct folio *folio, int refs, unsigned int flags) > { > if (flags & FOLL_PIN) { > node_stat_mod_folio(folio, NR_FOLL_PIN_RELEASED, refs); > if (hpage_pincount_available(&folio->page)) > hpage_pincount_sub(&folio->page, refs); > else > refs *= GUP_PIN_COUNTING_BIAS; > } > > folio_put_refs(folio, refs); > } > > That can become non-static if it's needed. I'm still working on that > series, because I'd like to get it to a point where we return one > folio pointer instead of N page pointers. Not quite there yet. I'm keen to see what that looks like, every driver I'm working on that calls PUP goes through gyrations to recover contiguous pages, so this is most welcomed! Jason
On Thu, Dec 23, 2021 at 10:53:09PM -0400, Jason Gunthorpe wrote: > On Thu, Dec 23, 2021 at 12:21:06AM +0000, Matthew Wilcox wrote: > > On Wed, Dec 22, 2021 at 02:09:41PM +0100, David Hildenbrand wrote: > > > Right, from an API perspective we really want people to use FOLL_PIN. > > > > > > To optimize this case in particular it would help if we would have the > > > FOLL flags on the unpin path. Then we could just decide internally > > > "well, short-term R/O FOLL_PIN can be really lightweight, we can treat > > > this like a FOLL_GET instead". And we would need that as well if we were > > > to keep different counters for R/O vs. R/W pinned. > > > > FYI, in my current tree, there's a gup_put_folio() which replaces > > put_compound_head: > > > > static void gup_put_folio(struct folio *folio, int refs, unsigned int flags) > > { > > if (flags & FOLL_PIN) { > > node_stat_mod_folio(folio, NR_FOLL_PIN_RELEASED, refs); > > if (hpage_pincount_available(&folio->page)) > > hpage_pincount_sub(&folio->page, refs); > > else > > refs *= GUP_PIN_COUNTING_BIAS; > > } > > > > folio_put_refs(folio, refs); > > } > > > > That can become non-static if it's needed. I'm still working on that > > series, because I'd like to get it to a point where we return one > > folio pointer instead of N page pointers. Not quite there yet. > > I'm keen to see what that looks like, every driver I'm working on that > calls PUP goes through gyrations to recover contiguous pages, so this > is most welcomed! I'm about to take some time off, so alas, you won't see it any time soon. It'd be good to talk with some of the interested users because it's actually a pretty tricky problem. We can't just return an array of the struct folios because the actual memory you want to access might be anywhere in that folio, and you don't want to have to redo the lookup just to find out which subpages of the folio are meant. So I'm currently thinking about returning a bio_vec: struct bio_vec { struct page *bv_page; unsigned int bv_len; unsigned int bv_offset; }; In the iomap patchset which should go upstream in the next merge window, you can iterate over a bio like this: struct folio_iter fi; bio_for_each_folio_all(fi, bio) iomap_finish_folio_read(fi.folio, fi.offset, fi.length, error); There aren't any equivalent helpers for a bvec yet, but obviously we can add them so that you can iterate over each folio in a contiguous range. But now that each component in it is variable length, the caller can't know how large an array of bio_vecs to allocate. 1. The callee can allocate the array and let the caller free it when it's finished 2. The caller passes in a (small, fixed-size, on-stack) array of bio_vecs over (potentially) multiple calls. 3. The caller can overallocate and ignore that most of the array isn't used. Any preferences? I don't like #3.
On Fri, Dec 24, 2021 at 04:53:38AM +0000, Matthew Wilcox wrote: > On Thu, Dec 23, 2021 at 10:53:09PM -0400, Jason Gunthorpe wrote: > > On Thu, Dec 23, 2021 at 12:21:06AM +0000, Matthew Wilcox wrote: > > > On Wed, Dec 22, 2021 at 02:09:41PM +0100, David Hildenbrand wrote: > > > > Right, from an API perspective we really want people to use FOLL_PIN. > > > > > > > > To optimize this case in particular it would help if we would have the > > > > FOLL flags on the unpin path. Then we could just decide internally > > > > "well, short-term R/O FOLL_PIN can be really lightweight, we can treat > > > > this like a FOLL_GET instead". And we would need that as well if we were > > > > to keep different counters for R/O vs. R/W pinned. > > > > > > FYI, in my current tree, there's a gup_put_folio() which replaces > > > put_compound_head: > > > > > > static void gup_put_folio(struct folio *folio, int refs, unsigned int flags) > > > { > > > if (flags & FOLL_PIN) { > > > node_stat_mod_folio(folio, NR_FOLL_PIN_RELEASED, refs); > > > if (hpage_pincount_available(&folio->page)) > > > hpage_pincount_sub(&folio->page, refs); > > > else > > > refs *= GUP_PIN_COUNTING_BIAS; > > > } > > > > > > folio_put_refs(folio, refs); > > > } > > > > > > That can become non-static if it's needed. I'm still working on that > > > series, because I'd like to get it to a point where we return one > > > folio pointer instead of N page pointers. Not quite there yet. > > > > I'm keen to see what that looks like, every driver I'm working on that > > calls PUP goes through gyrations to recover contiguous pages, so this > > is most welcomed! > > I'm about to take some time off, so alas, you won't see it any time > soon. It'd be good to talk with some of the interested users because > it's actually a pretty tricky problem. Sure, it is a good idea > We can't just return an array of the struct folios because the > actual memory you want to access might be anywhere in that folio, > and you don't want to have to redo the lookup just to find out which > subpages of the folio are meant. Yep > So I'm currently thinking about returning a bio_vec: > > struct bio_vec { > struct page *bv_page; > unsigned int bv_len; > unsigned int bv_offset; > }; The cases I'm looking at basically want an efficient list of physical addresses + lengths. They don't care about pages or folios, eg often the next step is to build a SGL and DMA map it which largely ignores all of that. As the memory used to hold the output of pin_user_pages() is all temporary there is a sensitivity to allocate the memory quicky, but also to have enough of it so that we don't have to do redundant work in pin_user_pages() - eg traversing to the same PMD table again and again. > But now that each component in it is variable length, the caller can't > know how large an array of bio_vecs to allocate. And the array entry is now 2x the size and there is no way to scatter the array to 4k segments? > 1. The callee can allocate the array and let the caller free it when it's > finished It is not bad, but a bit tricky, alot of the GUP code executes in an IRQ disabled state, so it has to use a pre-allocating scheme. We also can't scan everything twice and hope it didn't change, so exact preallocation doesn't seem likely either. > 2. The caller passes in a (small, fixed-size, on-stack) array of bio_vecs > over (potentially) multiple calls. It is slow, because we do redundant work traversing the same locks and page tables again and again.. > 3. The caller can overallocate and ignore that most of the array isn't > used. > > Any preferences? I don't like #3. #3 is OK for my applications because we immediately turn around and copy the output to something else and free the memory anyhow... However, being an array means we can't reliably allocate more than 4k and with 16 bytes per entry that isn't even able to store a full PTE table. What would be nice for these cases is if the caller can supply an array of 4k pages and GUP will fill them in. In many cases we'd probably pass in up to 2M worth of pages or something. There is some batching balance here where we want to minimize the temporary memory consumed by GUP's output (and the time to allocate it!) but also minimize the redundant work inside GUP repeatedly walking the same tables and locks. eg ideally GUP would stop at some natural alignment boundary if it could tell it can't fully fill the buffer. Then the next iteration would not redo the same locks. I was once thinking about something like storing an array of PFNs and using the high bits to encode that the PFN is not 4k. It would allow efficient packing of the common fragmented cases. To make it work you'd need to have each 4k page grow the pfn list up from the start and the pfn sizes down from the end. A size entry is only consumed if the pfn bits can't encode the size directly so the packing can be a perfect 8 bytes per PFN for the common 4k and 2M aligned cases. Jason
diff --git a/include/linux/mm.h b/include/linux/mm.h index a7e4a9e7d807..37d1fb2f865e 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -436,6 +436,9 @@ extern pgprot_t protection_map[16]; * @FAULT_FLAG_REMOTE: The fault is not for current task/mm. * @FAULT_FLAG_INSTRUCTION: The fault was during an instruction fetch. * @FAULT_FLAG_INTERRUPTIBLE: The fault can be interrupted by non-fatal signals. + * @FAULT_FLAG_UNSHARE: The fault is an unsharing request to unshare a + * shared anonymous page (-> mapped R/O). Does not apply + * to KSM. * * About @FAULT_FLAG_ALLOW_RETRY and @FAULT_FLAG_TRIED: we can specify * whether we would allow page faults to retry by specifying these two @@ -467,6 +470,7 @@ enum fault_flag { FAULT_FLAG_REMOTE = 1 << 7, FAULT_FLAG_INSTRUCTION = 1 << 8, FAULT_FLAG_INTERRUPTIBLE = 1 << 9, + FAULT_FLAG_UNSHARE = 1 << 10, }; /* diff --git a/mm/memory.c b/mm/memory.c index 8f1de811a1dc..7253a2ad4320 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2707,8 +2707,9 @@ EXPORT_SYMBOL_GPL(apply_to_existing_page_range); * read non-atomically. Before making any commitment, on those architectures * or configurations (e.g. i386 with PAE) which might give a mix of unmatched * parts, do_swap_page must check under lock before unmapping the pte and - * proceeding (but do_wp_page is only called after already making such a check; - * and do_anonymous_page can safely check later on). + * proceeding (but do_wp_page_cow/do_wp_page_unshare is only called after + * already making such a check; and do_anonymous_page can safely check later + * on). */ static inline int pte_unmap_same(struct vm_fault *vmf) { @@ -2726,8 +2727,8 @@ static inline int pte_unmap_same(struct vm_fault *vmf) return same; } -static inline bool cow_user_page(struct page *dst, struct page *src, - struct vm_fault *vmf) +static inline bool __wp_page_copy_user(struct page *dst, struct page *src, + struct vm_fault *vmf) { bool ret; void *kaddr; @@ -2952,7 +2953,8 @@ static inline void wp_page_reuse(struct vm_fault *vmf) } /* - * Handle the case of a page which we actually need to copy to a new page. + * Handle the case of a page which we actually need to copy to a new page, + * either due to COW or unsharing. * * Called with mmap_lock locked and the old page referenced, but * without the ptl held. @@ -2967,7 +2969,7 @@ static inline void wp_page_reuse(struct vm_fault *vmf) * held to the old page, as well as updating the rmap. * - In any case, unlock the PTL and drop the reference we took to the old page. */ -static vm_fault_t wp_page_copy(struct vm_fault *vmf) +static vm_fault_t wp_page_copy(struct vm_fault *vmf, bool unshare) { struct vm_area_struct *vma = vmf->vma; struct mm_struct *mm = vma->vm_mm; @@ -2991,7 +2993,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf) if (!new_page) goto oom; - if (!cow_user_page(new_page, old_page, vmf)) { + if (!__wp_page_copy_user(new_page, old_page, vmf)) { /* * COW failed, if the fault was solved by other, * it's fine. If not, userspace would re-fault on @@ -3033,7 +3035,14 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf) flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte)); entry = mk_pte(new_page, vma->vm_page_prot); entry = pte_sw_mkyoung(entry); - entry = maybe_mkwrite(pte_mkdirty(entry), vma); + if (unlikely(unshare)) { + if (pte_soft_dirty(vmf->orig_pte)) + entry = pte_mksoft_dirty(entry); + if (pte_uffd_wp(vmf->orig_pte)) + entry = pte_mkuffd_wp(entry); + } else { + entry = maybe_mkwrite(pte_mkdirty(entry), vma); + } /* * Clear the pte entry and flush it first, before updating the @@ -3050,6 +3059,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf) * mmu page tables (such as kvm shadow page tables), we want the * new page to be mapped directly into the secondary page table. */ + BUG_ON(unshare && pte_write(entry)); set_pte_at_notify(mm, vmf->address, vmf->pte, entry); update_mmu_cache(vma, vmf->address, vmf->pte); if (old_page) { @@ -3109,6 +3119,8 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf) free_swap_cache(old_page); put_page(old_page); } + if (unlikely(unshare)) + return 0; return page_copied ? VM_FAULT_WRITE : 0; oom_free_new: put_page(new_page); @@ -3118,6 +3130,70 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf) return VM_FAULT_OOM; } +static __always_inline vm_fault_t wp_page_cow(struct vm_fault *vmf) +{ + return wp_page_copy(vmf, false); +} + +static __always_inline vm_fault_t wp_page_unshare(struct vm_fault *vmf) +{ + return wp_page_copy(vmf, true); +} + +/* + * This routine handles present pages, when GUP tries to take a read-only + * pin on a shared anonymous page. It's similar to do_wp_page_cow(), except that + * it keeps the pages mapped read-only and doesn't apply to KSM pages. + * + * If a parent process forks a child process, we share anonymous pages between + * both processes with COW semantics. Both processes will map these now shared + * anonymous pages read-only, and any write access triggers unsharing via COW. + * + * If the child takes a read-only pin on such a page (i.e., FOLL_WRITE is not + * set) and then unmaps the target page, we have: + * + * * page has mapcount == 1 and refcount > 1 + * * page is mapped read-only into the parent + * * page is pinned by the child and can still be read + * + * For now, we rely on refcount > 1 to perform the COW and trigger unsharing. + * However, that leads to other hard-to fix issues. + * + * GUP-triggered unsharing provides a parallel approach to trigger unsharing + * early, still allowing for relying on mapcount > 1 in COW code instead of on + * imprecise refcount > 1. Note that when we don't actually take a reference + * on the target page but instead use memory notifiers to synchronize to changes + * in the process page tables, unsharing is not required. + * + * Note that in the above scenario, it's impossible to distinguish during the + * write fault between: + * + * a) The parent process performed the pin and the child no longer has access + * to the page. + * + * b) The child process performed the pin and the child still has access to the + * page. + * + * In case of a), if we're dealing with a long-term read-only pin, the COW + * in the parent will result the pinned page differing from the page actually + * mapped into the process page tables in the parent: loss of synchronicity. + * Therefore, we really want to perform the copy when the read-only pin happens. + */ +static vm_fault_t do_wp_page_unshare(struct vm_fault *vmf) + __releases(vmf->ptl) +{ + vmf->page = vm_normal_page(vmf->vma, vmf->address, vmf->orig_pte); + if (vmf->page && PageAnon(vmf->page) && !PageKsm(vmf->page) && + page_mapcount(vmf->page) > 1) { + get_page(vmf->page); + pte_unmap_unlock(vmf->pte, vmf->ptl); + return wp_page_unshare(vmf); + } + vmf->page = NULL; + pte_unmap_unlock(vmf->pte, vmf->ptl); + return 0; +} + /** * finish_mkwrite_fault - finish page fault for a shared mapping, making PTE * writeable once the page is prepared @@ -3226,7 +3302,7 @@ static vm_fault_t wp_page_shared(struct vm_fault *vmf) * but allow concurrent faults), with pte both mapped and locked. * We return with mmap_lock still held, but pte unmapped and unlocked. */ -static vm_fault_t do_wp_page(struct vm_fault *vmf) +static vm_fault_t do_wp_page_cow(struct vm_fault *vmf) __releases(vmf->ptl) { struct vm_area_struct *vma = vmf->vma; @@ -3258,7 +3334,7 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf) return wp_pfn_shared(vmf); pte_unmap_unlock(vmf->pte, vmf->ptl); - return wp_page_copy(vmf); + return wp_page_cow(vmf); } /* @@ -3296,7 +3372,7 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf) get_page(vmf->page); pte_unmap_unlock(vmf->pte, vmf->ptl); - return wp_page_copy(vmf); + return wp_page_cow(vmf); } static void unmap_mapping_range_vma(struct vm_area_struct *vma, @@ -3670,7 +3746,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) } if (vmf->flags & FAULT_FLAG_WRITE) { - ret |= do_wp_page(vmf); + ret |= do_wp_page_cow(vmf); if (ret & VM_FAULT_ERROR) ret &= VM_FAULT_ERROR; goto out; @@ -4428,6 +4504,16 @@ static inline vm_fault_t create_huge_pmd(struct vm_fault *vmf) /* `inline' is required to avoid gcc 4.1.2 build error */ static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf) { + if (vmf->flags & FAULT_FLAG_UNSHARE) { + /* + * We'll simply split the THP and handle unsharing on the + * PTE level. Unsharing only applies to anon THPs and we + * shouldn't ever find them inside shared mappings. + */ + if (WARN_ON_ONCE(vmf->vma->vm_flags & VM_SHARED)) + return 0; + goto split_fallback; + } if (vma_is_anonymous(vmf->vma)) { if (userfaultfd_huge_pmd_wp(vmf->vma, vmf->orig_pmd)) return handle_userfault(vmf, VM_UFFD_WP); @@ -4440,7 +4526,8 @@ static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf) return ret; } - /* COW or write-notify handled on pte level: split pmd. */ +split_fallback: + /* COW, unsharing or write-notify handled on pte level: split pmd. */ __split_huge_pmd(vmf->vma, vmf->pmd, vmf->address, false, NULL); return VM_FAULT_FALLBACK; @@ -4551,8 +4638,11 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf) return do_fault(vmf); } - if (!pte_present(vmf->orig_pte)) - return do_swap_page(vmf); + if (!pte_present(vmf->orig_pte)) { + if (likely(!(vmf->flags & FAULT_FLAG_UNSHARE))) + return do_swap_page(vmf); + return 0; + } if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma)) return do_numa_page(vmf); @@ -4564,9 +4654,13 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf) update_mmu_tlb(vmf->vma, vmf->address, vmf->pte); goto unlock; } - if (vmf->flags & FAULT_FLAG_WRITE) { - if (!pte_write(entry)) - return do_wp_page(vmf); + if (vmf->flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) { + if (!pte_write(entry)) { + if (vmf->flags & FAULT_FLAG_WRITE) + return do_wp_page_cow(vmf); + else + return do_wp_page_unshare(vmf); + } entry = pte_mkdirty(entry); } entry = pte_mkyoung(entry); @@ -4607,7 +4701,6 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma, .pgoff = linear_page_index(vma, address), .gfp_mask = __get_fault_gfp_mask(vma), }; - unsigned int dirty = flags & FAULT_FLAG_WRITE; struct mm_struct *mm = vma->vm_mm; pgd_t *pgd; p4d_t *p4d; @@ -4634,7 +4727,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma, /* NUMA case for anonymous PUDs would go here */ - if (dirty && !pud_write(orig_pud)) { + if ((flags & FAULT_FLAG_WRITE) && !pud_write(orig_pud)) { ret = wp_huge_pud(&vmf, orig_pud); if (!(ret & VM_FAULT_FALLBACK)) return ret; @@ -4672,7 +4765,8 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma, if (pmd_protnone(vmf.orig_pmd) && vma_is_accessible(vma)) return do_huge_pmd_numa_page(&vmf); - if (dirty && !pmd_write(vmf.orig_pmd)) { + if ((flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) && + !pmd_write(vmf.orig_pmd)) { ret = wp_huge_pmd(&vmf); if (!(ret & VM_FAULT_FALLBACK)) return ret;