diff mbox series

[v1] mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW

Message ID 20220808073232.8808-1-david@redhat.com (mailing list archive)
State New
Headers show
Series [v1] mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW | expand

Commit Message

David Hildenbrand Aug. 8, 2022, 7:32 a.m. UTC
Ever since the Dirty COW (CVE-2016-5195) security issue happened, we know
that FOLL_FORCE can be possibly dangerous, especially if there are races
that can be exploited by user space.

Right now, it would be sufficient to have some code that sets a PTE of
a R/O-mapped shared page dirty, in order for it to erroneously become
writable by FOLL_FORCE. The implications of setting a write-protected PTE
dirty might not be immediately obvious to everyone.

And in fact ever since commit 9ae0f87d009c ("mm/shmem: unconditionally set
pte dirty in mfill_atomic_install_pte"), we can use UFFDIO_CONTINUE to map
a shmem page R/O while marking the pte dirty. This can be used by
unprivileged user space to modify tmpfs/shmem file content even if the user
does not have write permissions to the file -- Dirty COW restricted to
tmpfs/shmem (CVE-2022-2590).

To fix such security issues for good, the insight is that we really only
need that fancy retry logic (FOLL_COW) for COW mappings that are not
writable (!VM_WRITE). And in a COW mapping, we really only broke COW if
we have an exclusive anonymous page mapped. If we have something else
mapped, or the mapped anonymous page might be shared (!PageAnonExclusive),
we have to trigger a write fault to break COW. If we don't find an
exclusive anonymous page when we retry, we have to trigger COW breaking
once again because something intervened.

Let's move away from this mandatory-retry + dirty handling and rely on
our PageAnonExclusive() flag for making a similar decision, to use the
same COW logic as in other kernel parts here as well. In case we stumble
over a PTE in a COW mapping that does not map an exclusive anonymous page,
COW was not properly broken and we have to trigger a fake write-fault to
break COW.

Just like we do in can_change_pte_writable() added via
commit 64fe24a3e05e ("mm/mprotect: try avoiding write faults for exclusive
anonymous pages when changing protection") and commit 76aefad628aa
("mm/mprotect: fix soft-dirty check in can_change_pte_writable()"), take
care of softdirty and uffd-wp manually.

For example, a write() via /proc/self/mem to a uffd-wp-protected range has
to fail instead of silently granting write access and bypassing the
userspace fault handler. Note that FOLL_FORCE is not only used for debug
access, but also triggered by applications without debug intentions, for
example, when pinning pages via RDMA.

This fixes CVE-2022-2590. Note that only x86_64 and aarch64 are
affected, because only those support CONFIG_HAVE_ARCH_USERFAULTFD_MINOR.

Fortunately, FOLL_COW is no longer required to handle FOLL_FORCE. So
let's just get rid of it.

Note 1: We don't check for the PTE being dirty because it doesn't matter
	for making a "was COWed" decision anymore, and whoever modifies the
	page has to set the page dirty either way.

Note 2: Kernels before extended uffd-wp support and before
	PageAnonExclusive (< 5.19) can simply revert the problematic
	commit instead and be safe regarding UFFDIO_CONTINUE. A backport to
	v5.19 requires minor adjustments due to lack of
	vma_soft_dirty_enabled().

Fixes: 9ae0f87d009c ("mm/shmem: unconditionally set pte dirty in mfill_atomic_install_pte")
Cc: <stable@vger.kernel.org> # 5.16+
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---

Against upstream from yesterday instead of v5.19 because I wanted to
reference the mprotect commit IDs and can_change_pte_writable(), and I
wanted to directly use vma_soft_dirty_enabled().

I have a working reproducer that I'll post to oss-security in one week. Of
course, that reproducer no longer triggers with that commit and my ptrace
testing indicated that FOLL_FORCE seems to continue working as expected.

---
 include/linux/mm.h |  1 -
 mm/gup.c           | 62 +++++++++++++++++++++++++++++-----------------
 mm/huge_memory.c   | 45 +++++++++++++++++----------------
 3 files changed, 63 insertions(+), 45 deletions(-)


base-commit: 1612c382ffbdf1f673caec76502b1c00e6d35363

Comments

David Hildenbrand Aug. 8, 2022, 4:02 p.m. UTC | #1
On 08.08.22 09:32, David Hildenbrand wrote:
> Ever since the Dirty COW (CVE-2016-5195) security issue happened, we know
> that FOLL_FORCE can be possibly dangerous, especially if there are races
> that can be exploited by user space.
> 
> Right now, it would be sufficient to have some code that sets a PTE of
> a R/O-mapped shared page dirty, in order for it to erroneously become
> writable by FOLL_FORCE. The implications of setting a write-protected PTE
> dirty might not be immediately obvious to everyone.
> 
> And in fact ever since commit 9ae0f87d009c ("mm/shmem: unconditionally set
> pte dirty in mfill_atomic_install_pte"), we can use UFFDIO_CONTINUE to map
> a shmem page R/O while marking the pte dirty. This can be used by
> unprivileged user space to modify tmpfs/shmem file content even if the user
> does not have write permissions to the file -- Dirty COW restricted to
> tmpfs/shmem (CVE-2022-2590).
> 
> To fix such security issues for good, the insight is that we really only
> need that fancy retry logic (FOLL_COW) for COW mappings that are not
> writable (!VM_WRITE). And in a COW mapping, we really only broke COW if
> we have an exclusive anonymous page mapped. If we have something else
> mapped, or the mapped anonymous page might be shared (!PageAnonExclusive),
> we have to trigger a write fault to break COW. If we don't find an
> exclusive anonymous page when we retry, we have to trigger COW breaking
> once again because something intervened.
> 
> Let's move away from this mandatory-retry + dirty handling and rely on
> our PageAnonExclusive() flag for making a similar decision, to use the
> same COW logic as in other kernel parts here as well. In case we stumble
> over a PTE in a COW mapping that does not map an exclusive anonymous page,
> COW was not properly broken and we have to trigger a fake write-fault to
> break COW.
> 
> Just like we do in can_change_pte_writable() added via
> commit 64fe24a3e05e ("mm/mprotect: try avoiding write faults for exclusive
> anonymous pages when changing protection") and commit 76aefad628aa
> ("mm/mprotect: fix soft-dirty check in can_change_pte_writable()"), take
> care of softdirty and uffd-wp manually.
> 
> For example, a write() via /proc/self/mem to a uffd-wp-protected range has
> to fail instead of silently granting write access and bypassing the
> userspace fault handler. Note that FOLL_FORCE is not only used for debug
> access, but also triggered by applications without debug intentions, for
> example, when pinning pages via RDMA.
> 
> This fixes CVE-2022-2590. Note that only x86_64 and aarch64 are
> affected, because only those support CONFIG_HAVE_ARCH_USERFAULTFD_MINOR.
> 
> Fortunately, FOLL_COW is no longer required to handle FOLL_FORCE. So
> let's just get rid of it.

I have to add here:

"Thanks to Nadav Amit for pointing out that the pte_dirty() check in
FOLL_FORCE code is problematic and might be exploitable."
Linus Torvalds Aug. 9, 2022, 6:27 p.m. UTC | #2
I'm still reading through this, but

 STOP DOING THIS

On Mon, Aug 8, 2022 at 12:32 AM David Hildenbrand <david@redhat.com> wrote:
>
> +       VM_BUG_ON(!is_cow_mapping(vma->vm_flags));

Using BUG_ON() for debugging is simply not ok.

And saying "but it's just a VM_BUG_ON()" does not change *anything*.
At least Fedora enables that unconditionally for normal people, it is
not some kind of "only VM people do this".

Really. BUG_ON() IS NOT FOR DEBUGGING.

Stop it. Now.

If you have a condition that must not happen, you either write that
condition into the code, or - if you are convinced it cannot happen -
you make it a WARN_ON_ONCE() so that people can report it to you.

The BUG_ON() will just make the machine die.

And for the facebooks and googles of the world, the WARN_ON() will be
sufficient.

               Linus
Linus Torvalds Aug. 9, 2022, 6:40 p.m. UTC | #3
On Mon, Aug 8, 2022 at 12:32 AM David Hildenbrand <david@redhat.com> wrote:
>
> For example, a write() via /proc/self/mem to a uffd-wp-protected range has
> to fail instead of silently granting write access and bypassing the
> userspace fault handler. Note that FOLL_FORCE is not only used for debug
> access, but also triggered by applications without debug intentions, for
> example, when pinning pages via RDMA.

So this made me go "Whaa?"

I didn't even realize that the media drivers and rdma used FOLL_FORCE.

That's just completely bogus.

Why do they do that?

It seems to be completely bogus, and seems to have no actual valid
reason for it. Looking through the history, it goes back to the
original code submission in 2006, and doesn't have a mention of why.

I think the original reason was that the code didn't have pinning, so
it used "do a write" as a pin mechanism - even for reads.

IOW, I think the non-ptrace use of FOLL_FORCE should just be removed.

                 Linus
David Hildenbrand Aug. 9, 2022, 6:45 p.m. UTC | #4
On 09.08.22 20:27, Linus Torvalds wrote:
> I'm still reading through this, but
> 
>  STOP DOING THIS
> 
> On Mon, Aug 8, 2022 at 12:32 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> +       VM_BUG_ON(!is_cow_mapping(vma->vm_flags));
> 
> Using BUG_ON() for debugging is simply not ok.
> 
> And saying "but it's just a VM_BUG_ON()" does not change *anything*.
> At least Fedora enables that unconditionally for normal people, it is
> not some kind of "only VM people do this".
> 
> Really. BUG_ON() IS NOT FOR DEBUGGING.

I totally agree with BUG_ON ... but if I get talked to in all-caps on a
Thursday evening and feel like I just touched the forbidden fruit, I
have to ask for details about VM_BUG_ON nowadays.

VM_BUG_ON is only active with CONFIG_DEBUG_VM. ... which indicated some
kind of debugging at least to me. I *know* that Fedora enables it and I
*know* that this will make Fedora crash.

I know why Fedora enables this debug option, but it somewhat destorys
the whole purpose of VM_BUG_ON kind of nowadays?

For this case, this condition will never trigger and I consider it much
more a hint to the reader that we can rest assured that this condition
holds. And on production systems, it will get optimized out.

Should we forbid any new usage of VM_BUG_ON just like we mostly do with
BUG_ON?

> 
> Stop it. Now.
> 
> If you have a condition that must not happen, you either write that
> condition into the code, or - if you are convinced it cannot happen -
> you make it a WARN_ON_ONCE() so that people can report it to you.

I can just turn that into a WARN_ON_ONCE() or even a VM_WARN_ON_ONCE().
Linus Torvalds Aug. 9, 2022, 6:48 p.m. UTC | #5
On Mon, Aug 8, 2022 at 12:32 AM David Hildenbrand <david@redhat.com> wrote:
>
> For example, a write() via /proc/self/mem to a uffd-wp-protected range has
> to fail instead of silently granting write access and bypassing the
> userspace fault handler.

This, btw, just makes me go "uffd-wp is broken garbage" once more.

It also makes me go "if uffd-wp can disallow ptrace writes, then why
doesn't regular write protect do it"?

IOW, I don't think the patch is wrong (apart from the VM_BUG_ON's that
absolutely must go away), but I get the strong feelign that we instead
should try to get rid of FOLL_FORCE entirely.

If some other user action can stop FOLL_FORCE anyway, then why do we
support it at all?

             Linus
Jason Gunthorpe Aug. 9, 2022, 6:48 p.m. UTC | #6
On Tue, Aug 09, 2022 at 11:40:50AM -0700, Linus Torvalds wrote:
> On Mon, Aug 8, 2022 at 12:32 AM David Hildenbrand <david@redhat.com> wrote:
> >
> > For example, a write() via /proc/self/mem to a uffd-wp-protected range has
> > to fail instead of silently granting write access and bypassing the
> > userspace fault handler. Note that FOLL_FORCE is not only used for debug
> > access, but also triggered by applications without debug intentions, for
> > example, when pinning pages via RDMA.
> 
> So this made me go "Whaa?"
> 
> I didn't even realize that the media drivers and rdma used FOLL_FORCE.
> 
> That's just completely bogus.
> 
> Why do they do that?
> 
> It seems to be completely bogus, and seems to have no actual valid
> reason for it. Looking through the history, it goes back to the
> original code submission in 2006, and doesn't have a mention of why.

It is because of all this madness with COW.

Lets say an app does:

 buf = mmap(MAP_PRIVATE)
 rdma_pin_for_read(buf)
 buf[0] = 1 

Then the store to buf[0] will COW the page and the pin will become
decoherent.

The purpose of the FORCE is to force COW to happen early so this is
avoided.

Sadly there are real apps that end up working this way, eg because
they are using buffer in .data or something.

I've hoped David's new work on this provided some kind of path to a
proper solution, as the need is very similar to all the other places
where we need to ensure there is no possiblity of future COW.

So, these usage can be interpreted as a FOLL flag we don't have - some
kind of (FOLL_EXCLUSIVE | FOLL_READ) to match the PG_anon_exclusive
sort of idea.

Jason
David Hildenbrand Aug. 9, 2022, 6:53 p.m. UTC | #7
On 09.08.22 20:48, Jason Gunthorpe wrote:
> On Tue, Aug 09, 2022 at 11:40:50AM -0700, Linus Torvalds wrote:
>> On Mon, Aug 8, 2022 at 12:32 AM David Hildenbrand <david@redhat.com> wrote:
>>>
>>> For example, a write() via /proc/self/mem to a uffd-wp-protected range has
>>> to fail instead of silently granting write access and bypassing the
>>> userspace fault handler. Note that FOLL_FORCE is not only used for debug
>>> access, but also triggered by applications without debug intentions, for
>>> example, when pinning pages via RDMA.
>>
>> So this made me go "Whaa?"
>>
>> I didn't even realize that the media drivers and rdma used FOLL_FORCE.
>>
>> That's just completely bogus.
>>
>> Why do they do that?
>>
>> It seems to be completely bogus, and seems to have no actual valid
>> reason for it. Looking through the history, it goes back to the
>> original code submission in 2006, and doesn't have a mention of why.
> 
> It is because of all this madness with COW.
> 
> Lets say an app does:
> 
>  buf = mmap(MAP_PRIVATE)
>  rdma_pin_for_read(buf)
>  buf[0] = 1 
> 
> Then the store to buf[0] will COW the page and the pin will become
> decoherent.
> 
> The purpose of the FORCE is to force COW to happen early so this is
> avoided.
> 
> Sadly there are real apps that end up working this way, eg because
> they are using buffer in .data or something.
> 
> I've hoped David's new work on this provided some kind of path to a
> proper solution, as the need is very similar to all the other places
> where we need to ensure there is no possiblity of future COW.
> 
> So, these usage can be interpreted as a FOLL flag we don't have - some
> kind of (FOLL_EXCLUSIVE | FOLL_READ) to match the PG_anon_exclusive
> sort of idea.

Thanks Jason for the explanation.

I have patches in the works to no longer use FOLL_FORCE|FOLL_WRITE for
taking a reliable longerterm R/O pin in a MAP_PRIVATE mapping. The
patches are mostly done (and comparably simple), I merely deferred
sending them out because I stumbled over this issue first.

Some ugly corner cases of MAP_SHARED remain, but for most prominent use
cases, my upcoming patches should allow us to just have longterm R/O
pins working as expected.
Linus Torvalds Aug. 9, 2022, 6:59 p.m. UTC | #8
On Tue, Aug 9, 2022 at 11:45 AM David Hildenbrand <david@redhat.com> wrote:
>
> I totally agree with BUG_ON ... but if I get talked to in all-caps on a
> Thursday evening and feel like I just touched the forbidden fruit, I
> have to ask for details about VM_BUG_ON nowadays.
>
> VM_BUG_ON is only active with CONFIG_DEBUG_VM. ... which indicated some
> kind of debugging at least to me. I *know* that Fedora enables it and I
> *know* that this will make Fedora crash.

No.

VM_BUG_ON() has the exact same semantics as BUG_ON. It is literally no
different, the only difference is "we can make the code smaller
because these are less important".

The only possible case where BUG_ON can validly be used is "I have
some fundamental data corruption and cannot possibly return an error".

This kind of "I don't think this can happen" is _never_ an excuse for it.

Honestly, 99% of our existing BUG_ON() ones are completely bogus, and
left-over debug code that wasn't removed because they never triggered.
I've several times considered just using a coccinelle script to remove
every single BUG_ON() (and VM_BUG_ON()) as simply bogus. Because they
are pure noise.

I just tried to find a valid BUG_ON() that would make me go "yeah,
that's actually worth it", and couldn't really find one. Yeah, there
are several ones in the scheduler that make me go "ok, if that
triggers, the machine is dead anyway", so in that sense there are
certainly BUG_ON()s that don't _hurt_.

But as a very good approximation, the rule is "absolutely no new
BUG_ON() calls _ever_". Because I really cannot see a single case
where "proper error handling and WARN_ON_ONCE()" isn't the right
thing.

Now, that said, there is one very valid sub-form of BUG_ON():
BUILD_BUG_ON() is absolutely 100% fine.

                Linus
Linus Torvalds Aug. 9, 2022, 7:07 p.m. UTC | #9
On Tue, Aug 9, 2022 at 11:48 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> It is because of all this madness with COW.

Yes, yes, but we have the proper long-term pinning now with
PG_anon_exclusive, and it actually gets the pinning right not just
over COW, but even over a fork - which that early write never did.

David, I thought all of that got properly merged? Is there something
still missing?

                     Linus
Jason Gunthorpe Aug. 9, 2022, 7:07 p.m. UTC | #10
On Tue, Aug 09, 2022 at 11:59:45AM -0700, Linus Torvalds wrote:

> But as a very good approximation, the rule is "absolutely no new
> BUG_ON() calls _ever_". Because I really cannot see a single case
> where "proper error handling and WARN_ON_ONCE()" isn't the right
> thing.

Parallel to this discussion I've had ones where people more or less
say

 Since BUG_ON crashes the machine and Linus says that crashing the
 machine is bad, WARN_ON will also crash the machine if you set the
 panic_on_warn parameter, so it is also bad, thus we shouldn't use
 anything.

I've generally maintained that people who set the panic_on_warn *want*
these crashes, because that is the entire point of it. So we should
use WARN_ON with an error recovery for "can't happen" assertions like
these. I think it is what you are saying here.

Jason
David Hildenbrand Aug. 9, 2022, 7:09 p.m. UTC | #11
On 09.08.22 20:48, Linus Torvalds wrote:
> On Mon, Aug 8, 2022 at 12:32 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> For example, a write() via /proc/self/mem to a uffd-wp-protected range has
>> to fail instead of silently granting write access and bypassing the
>> userspace fault handler.
> 
> This, btw, just makes me go "uffd-wp is broken garbage" once more.
> 
> It also makes me go "if uffd-wp can disallow ptrace writes, then why
> doesn't regular write protect do it"?

I remember that it's not just uffd-wp, it's also ordinary userfaultfd if
we have no page mapped, because we'd have to drop the mmap lock in order
to notify user space about the fault and wait for a resolution.

IIUC, we cannot tell what exactly user-space will do as a response to a
user write fault here (for example, QEMU VM snapshots have to copy page
content away such that the VM snapshot remains consistent and we won't
corrupt the snapshot), so we have to back off and fail the GUP. I'd say,
for ptrace that's even the right thing to do because one might deadlock
waiting on the user space thread that handles faults ... but that's a
little off-topic to this fix here. I'm just trying to keep the semantics
unchanged, as weird as they might be.


> 
> IOW, I don't think the patch is wrong (apart from the VM_BUG_ON's that
> absolutely must go away), but I get the strong feelign that we instead
> should try to get rid of FOLL_FORCE entirely.

I can resend v2 soonish, taking care of the VM_BUG_ON as you requested
if there are no other comments.

> 
> If some other user action can stop FOLL_FORCE anyway, then why do we
> support it at all?

My humble opinion is that debugging userfaultfd-managed memory is a
corner case and that we can hopefully stop using FOLL_FORCE outside of
debugging context soon.

Having that said, I do enjoy having the uffd and uffd-wp feature
available in user space (especially in QEMU). I don't always enjoy
having to handle such corner cases in the kernel.
David Hildenbrand Aug. 9, 2022, 7:20 p.m. UTC | #12
On 09.08.22 21:07, Linus Torvalds wrote:
> On Tue, Aug 9, 2022 at 11:48 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>>
>> It is because of all this madness with COW.
> 
> Yes, yes, but we have the proper long-term pinning now with
> PG_anon_exclusive, and it actually gets the pinning right not just
> over COW, but even over a fork - which that early write never did.
> 
> David, I thought all of that got properly merged? Is there something
> still missing?

The only thing to get R/O longterm pins in MAP_PRIVATE correct that's
missing is that we have to break COW when taking a R/O longterm pin when
*not* finding an anon page inside a private mapping. Regarding anon
pages I am not aware of issues (due to PG_anon_exclusive).

If anybody here wants to stare at a webpage, the following commit
explains the rough idea for MAP_PRIVATE:

https://github.com/davidhildenbrand/linux/commit/cd7989fb76d2513c86f01e6f7a74415eee5d3150

Once we have that in place, we can mostly get rid of
FOLL_FORCE|FOLL_WRITE for R/O longterm pins. There are some corner cases
though that need some additional thought which i am still working on.
FS-handled COW in MAP_SHARED mappings is just nasty (hello DAX).

(the wrong use of FOLL_GET instead of FOLL_PIN for O_DIRECT and friends
still persists, but that's a different thing to handle and it's only
problematic with concurrent fork() IIRC)
Linus Torvalds Aug. 9, 2022, 7:21 p.m. UTC | #13
On Tue, Aug 9, 2022 at 12:09 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
>  Since BUG_ON crashes the machine and Linus says that crashing the
>  machine is bad, WARN_ON will also crash the machine if you set the
>  panic_on_warn parameter, so it is also bad, thus we shouldn't use
>  anything.

If you set 'panic_on_warn' you get to keep both pieces when something breaks.

The thing is, there are people who *do* want to stop immediately when
something goes wrong in the kernel.

Anybody doing large-scale virtualization presumably has all the
infrastructure to get debug info out of the virtual environment.

And people who run controlled loads in big server machine setups and
have a MIS department to manage said machines typically also prefer
for a machine to just crash over continuing.

So in those situations, a dead machine is still a dead machine, but
you get the information out, and panic_on_warn is fine, because panic
and reboot is fine.

And yes, that's actually a fairly common case. Things like syzkaller
etc *wants* to abort on the first warning, because that's kind of the
point.

But while that kind of virtualized automation machinery is very very
common, and is a big deal, it's by no means the only deal, and the
most important thing to the point where nothing else matters.

And if you are *not* in a farm, and if you are *not* using
virtualization, a dead machine is literally a useless brick. Nobody
has serial lines on individual machines any more. In most cases, the
hardware literally doesn't even exist any more.

So in that situation, you really cannot afford to take the approach of
"just kill the machine". If you are on a laptop and are doing power
management code, you generally cannot do that in a virtual
environment, and you already have enough problems with suspend and
resume being hard to debug, without people also going "oh, let's just
BUG_ON() and kill the machine".

Because the other side of that "we have a lot of machine farms doing
automated testing" is that those machine farms do not generally find a
lot of the exciting cases.

Almost every single merge window, I end up having to bisect and report
an oops or a WARN_ON(), because I actually run on real hardware. And
said problem was never seen in linux-next.

So we have two very different cases: the "virtual machine with good
logging where a dead machine is fine" - use 'panic_on_warn'. And the
actual real hardware with real drivers, running real loads by users.

Both are valid. But the second case means that BUG_ON() is basically
_never_ valid.

              Linus
Linus Torvalds Aug. 9, 2022, 8 p.m. UTC | #14
On Mon, Aug 8, 2022 at 12:32 AM David Hildenbrand <david@redhat.com> wrote:
>

So I've read through the patch several times, and it seems fine, but
this function (and the pmd version of it) just read oddly to me.

> +static inline bool can_follow_write_pte(pte_t pte, struct page *page,
> +                                       struct vm_area_struct *vma,
> +                                       unsigned int flags)
> +{
> +       if (pte_write(pte))
> +               return true;
> +       if (!(flags & FOLL_FORCE))
> +               return false;
> +
> +       /*
> +        * See check_vma_flags(): only COW mappings need that special
> +        * "force" handling when they lack VM_WRITE.
> +        */
> +       if (vma->vm_flags & VM_WRITE)
> +               return false;
> +       VM_BUG_ON(!is_cow_mapping(vma->vm_flags));

So apart from the VM_BUG_ON(), this code just looks really strange -
even despite the comment. Just conceptually, the whole "if it's
writable, return that you cannot follow it for a write" just looks so
very very strange.

That doesn't make the code _wrong_, but considering how many times
this has had subtle bugs, let's not write code that looks strange.

So I would suggest that to protect against future bugs, we try to make
it be fairly clear and straightforward, and maybe even a bit overly
protective.

For example, let's kill the "shared mapping that you don't have write
permissions to" very explicitly and without any subtle code at all.
The vm_flags tests are cheap and easy, and we could very easily just
add some core ones to make any mistakes much less critical.

Now, making that 'is_cow_mapping()' check explicit at the very top of
this would already go a long way:

        /* FOLL_FORCE for writability only affects COW mappings */
        if (!is_cow_mapping(vma->vm_flags))
                return false;

but I'd actually go even further: in this case that "is_cow_mapping()"
helper to some degree actually hides what is going on.

So I'd actually prefer for that function to be written something like

        /* If the pte is writable, we can write to the page */
        if (pte_write(pte))
                return true;

        /* Maybe FOLL_FORCE is set to override it? */
        if (flags & FOLL_FORCE)
                return false;

        /* But FOLL_FORCE has no effect on shared mappings */
        if (vma->vm_flags & MAP_SHARED)
                return false;

        /* .. or read-only private ones */
        if (!(vma->vm_flags & MAP_MAYWRITE))
                return false;

        /* .. or already writable ones that just need to take a write fault */
        if (vma->vm_flags & MAP_WRITE)
                return false;

and the two first vm_flags tests above are basically doing tat
"is_cow_mapping()", and maybe we could even have a comment to that
effect, but wouldn't it be nice to just write it out that way?

And after you've written it out like the above, now that

        if (!page || !PageAnon(page) || !PageAnonExclusive(page))
                return false;

makes you pretty safe from a data sharing perspective: it's most
definitely not a shared page at that point.

So if you write it that way, the only remaining issues are the magic
special soft-dirty and uffd ones, but at that point it's purely about
the semantics of those features, no longer about any possible "oh, we
fooled some shared page to be writable".

And I think the above is fairly legible without any subtle cases, and
the one-liner comments make it all fairly clear that it's testing.

Is any of this in any _technical_ way different from what your patch
did? No. It's literally just rewriting it to be a bit more explicit in
what it is doing, I think, and it makes that odd "it's not writable if
VM_WRITE is set" case a bit more explicit.

Hmm?

                  Linus
David Hildenbrand Aug. 9, 2022, 8:06 p.m. UTC | #15
On 09.08.22 22:00, Linus Torvalds wrote:
> On Mon, Aug 8, 2022 at 12:32 AM David Hildenbrand <david@redhat.com> wrote:
>>
> 
> So I've read through the patch several times, and it seems fine, but
> this function (and the pmd version of it) just read oddly to me.
> 
>> +static inline bool can_follow_write_pte(pte_t pte, struct page *page,
>> +                                       struct vm_area_struct *vma,
>> +                                       unsigned int flags)
>> +{
>> +       if (pte_write(pte))
>> +               return true;
>> +       if (!(flags & FOLL_FORCE))
>> +               return false;
>> +
>> +       /*
>> +        * See check_vma_flags(): only COW mappings need that special
>> +        * "force" handling when they lack VM_WRITE.
>> +        */
>> +       if (vma->vm_flags & VM_WRITE)
>> +               return false;
>> +       VM_BUG_ON(!is_cow_mapping(vma->vm_flags));
> 
> So apart from the VM_BUG_ON(), this code just looks really strange -
> even despite the comment. Just conceptually, the whole "if it's
> writable, return that you cannot follow it for a write" just looks so
> very very strange.
> 
> That doesn't make the code _wrong_, but considering how many times
> this has had subtle bugs, let's not write code that looks strange.
> 
> So I would suggest that to protect against future bugs, we try to make
> it be fairly clear and straightforward, and maybe even a bit overly
> protective.
> 
> For example, let's kill the "shared mapping that you don't have write
> permissions to" very explicitly and without any subtle code at all.
> The vm_flags tests are cheap and easy, and we could very easily just
> add some core ones to make any mistakes much less critical.
> 
> Now, making that 'is_cow_mapping()' check explicit at the very top of
> this would already go a long way:
> 
>         /* FOLL_FORCE for writability only affects COW mappings */
>         if (!is_cow_mapping(vma->vm_flags))
>                 return false;

I actually put the is_cow_mapping() mapping check in there because
check_vma_flags() should make sure that we cannot possibly end up here
in that case. But we can spell it out with comments, doesn't hurt.

> 
> but I'd actually go even further: in this case that "is_cow_mapping()"
> helper to some degree actually hides what is going on.
> 
> So I'd actually prefer for that function to be written something like
> 
>         /* If the pte is writable, we can write to the page */
>         if (pte_write(pte))
>                 return true;
> 
>         /* Maybe FOLL_FORCE is set to override it? */
>         if (flags & FOLL_FORCE)
>                 return false;
> 
>         /* But FOLL_FORCE has no effect on shared mappings */
>         if (vma->vm_flags & MAP_SHARED)
>                 return false;
> 
>         /* .. or read-only private ones */
>         if (!(vma->vm_flags & MAP_MAYWRITE))
>                 return false;
> 
>         /* .. or already writable ones that just need to take a write fault */
>         if (vma->vm_flags & MAP_WRITE)
>                 return false;
> 
> and the two first vm_flags tests above are basically doing tat
> "is_cow_mapping()", and maybe we could even have a comment to that
> effect, but wouldn't it be nice to just write it out that way?
> 
> And after you've written it out like the above, now that
> 
>         if (!page || !PageAnon(page) || !PageAnonExclusive(page))
>                 return false;
> 
> makes you pretty safe from a data sharing perspective: it's most
> definitely not a shared page at that point.
> 
> So if you write it that way, the only remaining issues are the magic
> special soft-dirty and uffd ones, but at that point it's purely about
> the semantics of those features, no longer about any possible "oh, we
> fooled some shared page to be writable".
> 
> And I think the above is fairly legible without any subtle cases, and
> the one-liner comments make it all fairly clear that it's testing.
> 
> Is any of this in any _technical_ way different from what your patch
> did? No. It's literally just rewriting it to be a bit more explicit in
> what it is doing, I think, and it makes that odd "it's not writable if
> VM_WRITE is set" case a bit more explicit.
> 
> Hmm?

No strong opinion. I'm happy as long as it's fixed, and the fix is robust.
David Hildenbrand Aug. 9, 2022, 8:07 p.m. UTC | #16
On 09.08.22 22:00, Linus Torvalds wrote:
> On Mon, Aug 8, 2022 at 12:32 AM David Hildenbrand <david@redhat.com> wrote:
>>
> 
> So I've read through the patch several times, and it seems fine, but
> this function (and the pmd version of it) just read oddly to me.
> 
>> +static inline bool can_follow_write_pte(pte_t pte, struct page *page,
>> +                                       struct vm_area_struct *vma,
>> +                                       unsigned int flags)
>> +{
>> +       if (pte_write(pte))
>> +               return true;
>> +       if (!(flags & FOLL_FORCE))
>> +               return false;
>> +
>> +       /*
>> +        * See check_vma_flags(): only COW mappings need that special
>> +        * "force" handling when they lack VM_WRITE.
>> +        */
>> +       if (vma->vm_flags & VM_WRITE)
>> +               return false;
>> +       VM_BUG_ON(!is_cow_mapping(vma->vm_flags));
> 
> So apart from the VM_BUG_ON(), this code just looks really strange -
> even despite the comment. Just conceptually, the whole "if it's
> writable, return that you cannot follow it for a write" just looks so
> very very strange.
> 
> That doesn't make the code _wrong_, but considering how many times
> this has had subtle bugs, let's not write code that looks strange.
> 
> So I would suggest that to protect against future bugs, we try to make
> it be fairly clear and straightforward, and maybe even a bit overly
> protective.
> 
> For example, let's kill the "shared mapping that you don't have write
> permissions to" very explicitly and without any subtle code at all.
> The vm_flags tests are cheap and easy, and we could very easily just
> add some core ones to make any mistakes much less critical.
> 
> Now, making that 'is_cow_mapping()' check explicit at the very top of
> this would already go a long way:
> 
>         /* FOLL_FORCE for writability only affects COW mappings */
>         if (!is_cow_mapping(vma->vm_flags))
>                 return false;
> 
> but I'd actually go even further: in this case that "is_cow_mapping()"
> helper to some degree actually hides what is going on.
> 
> So I'd actually prefer for that function to be written something like
> 
>         /* If the pte is writable, we can write to the page */
>         if (pte_write(pte))
>                 return true;
> 
>         /* Maybe FOLL_FORCE is set to override it? */
>         if (flags & FOLL_FORCE)
>                 return false;
> 
>         /* But FOLL_FORCE has no effect on shared mappings */
>         if (vma->vm_flags & MAP_SHARED)
>                 return false;

I'd actually rather check for MAP_MAYSHARE here, which is even stronger.
Thoughts?
Linus Torvalds Aug. 9, 2022, 8:14 p.m. UTC | #17
On Tue, Aug 9, 2022 at 1:07 PM David Hildenbrand <david@redhat.com> wrote:
>
> >         /* But FOLL_FORCE has no effect on shared mappings */
> >         if (vma->vm_flags & MAP_SHARED)
> >                 return false;
>
> I'd actually rather check for MAP_MAYSHARE here, which is even stronger.
> Thoughts?

Hmm. Adding the test for both is basically free (all those vm_flags
checks end up being a bit mask and compare), so no objections.

For some reason I though VM_SHARED and VM_MAYSHARE end up always being
the same bits, and it was a mistake to make them two bits in the first
place (unlike the read/write/exec bits that can are about mprotect),

But as there are two bits, I'm sure somebody ends up touching one and
not the other.

So yeah, I don't see any downside to just checking both bits.

[ That said, is_cow_mapping() only checks VM_SHARED, so if they are
ever different, that's a potential source of confusion ]

               Linus
David Hildenbrand Aug. 9, 2022, 8:20 p.m. UTC | #18
On 09.08.22 22:14, Linus Torvalds wrote:
> On Tue, Aug 9, 2022 at 1:07 PM David Hildenbrand <david@redhat.com> wrote:
>>
>>>         /* But FOLL_FORCE has no effect on shared mappings */
>>>         if (vma->vm_flags & MAP_SHARED)
>>>                 return false;
>>
>> I'd actually rather check for MAP_MAYSHARE here, which is even stronger.
>> Thoughts?
> 
> Hmm. Adding the test for both is basically free (all those vm_flags
> checks end up being a bit mask and compare), so no objections.
> 
> For some reason I though VM_SHARED and VM_MAYSHARE end up always being
> the same bits, and it was a mistake to make them two bits in the first
> place (unlike the read/write/exec bits that can are about mprotect),
> 
> But as there are two bits, I'm sure somebody ends up touching one and
> not the other.
> 
> So yeah, I don't see any downside to just checking both bits.
> 
> [ That said, is_cow_mapping() only checks VM_SHARED, so if they are
> ever different, that's a potential source of confusion ]

IIUC VM_MAYSHARE is always set in a MAP_SHARED mapping, but for file
mappings we only set VM_SHARED if the file allows for writes (and we can
set VM_MAYWRITE or even VM_WRITE). [don't ask me why, it's a steady
source of confusion]

That's why is_cow_mapping() works even due to the weird MAP_SHARED vs.
VM_MAYSHARE logic.


I'd actually only check for VM_MAYSHARE here, which implies MAP_SHARED.
If someone would ever mess that up, I guess we'd be in bigger trouble.

But whatever you prefer.
Linus Torvalds Aug. 9, 2022, 8:20 p.m. UTC | #19
On Tue, Aug 9, 2022 at 1:14 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> But as there are two bits, I'm sure somebody ends up touching one and
> not the other.

Ugh. The nommu code certainly does odd things here. That just looks wrong.

And the hugetlb code does something horrible too, but never *sets* it,
so it looks like some odd subset of VM_SHARED.

                 Linus
David Hildenbrand Aug. 9, 2022, 8:23 p.m. UTC | #20
On 09.08.22 22:20, Linus Torvalds wrote:
> On Tue, Aug 9, 2022 at 1:14 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> But as there are two bits, I'm sure somebody ends up touching one and
>> not the other.
> 
> Ugh. The nommu code certainly does odd things here. That just looks wrong.
> 
> And the hugetlb code does something horrible too, but never *sets* it,
> so it looks like some odd subset of VM_SHARED.

mm/mmap.c:do_mmap() contains the magic bit

switch (flags & MAP_TYPE) {
case MAP_SHARED:
...
case MAP_SHARED_VALIDATE:
...
vm_flags |= VM_SHARED | VM_MAYSHARE;
if (!(file->f_mode & FMODE_WRITE))
	vm_flags &= ~(VM_MAYWRITE | VM_SHARED);
fallthrough;


VM_SHARED semantics are confusing.
Linus Torvalds Aug. 9, 2022, 8:30 p.m. UTC | #21
On Tue, Aug 9, 2022 at 1:20 PM David Hildenbrand <david@redhat.com> wrote:
>
> IIUC VM_MAYSHARE is always set in a MAP_SHARED mapping, but for file
> mappings we only set VM_SHARED if the file allows for writes

Heh.

This is a horrific hack, and probably should go away.

Yeah, we have that

                        if (!(file->f_mode & FMODE_WRITE))
                                vm_flags &= ~(VM_MAYWRITE | VM_SHARED);


but I think that's _entirely_ historical.

Long long ago, in a galaxy far away, we didn't handle shared mmap()
very well. In fact, we used to not handle it at all.

But nntpd would use write() to update the spool file, adn them read it
through a shared mmap.

And since our mmap() *was* coherent with people doing write() system
calls, but didn't handle actual dirty shared mmap, what Linux used to
do was to just say "Oh, you want a read-only shared file mmap? I can
do that - I'll just downgrade it to a read-only _private_ mapping, and
it actually ends up with the same semantics".

And here we are, 30 years later, and it still does that, but it leaves
the VM_MAYSHARE flag so that /proc/<pid>/maps can show that it's a
shared mapping.

                 Linus
Linus Torvalds Aug. 9, 2022, 8:38 p.m. UTC | #22
On Tue, Aug 9, 2022 at 1:30 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> And here we are, 30 years later, and it still does that, but it leaves
> the VM_MAYSHARE flag so that /proc/<pid>/maps can show that it's a
> shared mapping.

.. thinking about it, we end up still having some things that this helps.

For example, because we clear the VM_SHARED flags for read-only shared
mappings, they don't end up going through mapping_{un}map_writable(),
and don't update i_mmap_writable, and don't cause issues with
mapping_deny_writable() or mapping_writably_mapped().

So it ends up actually having random small semantic details due to
those almost three decades of history.

I'm sure there are other odd pieces like that.

                  Linus
David Hildenbrand Aug. 9, 2022, 8:42 p.m. UTC | #23
On 09.08.22 22:30, Linus Torvalds wrote:
> On Tue, Aug 9, 2022 at 1:20 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> IIUC VM_MAYSHARE is always set in a MAP_SHARED mapping, but for file
>> mappings we only set VM_SHARED if the file allows for writes
> 
> Heh.
> 
> This is a horrific hack, and probably should go away.
> 
> Yeah, we have that
> 
>                         if (!(file->f_mode & FMODE_WRITE))
>                                 vm_flags &= ~(VM_MAYWRITE | VM_SHARED);
> 
> 
> but I think that's _entirely_ historical.
> 
> Long long ago, in a galaxy far away, we didn't handle shared mmap()
> very well. In fact, we used to not handle it at all.
> 
> But nntpd would use write() to update the spool file, adn them read it
> through a shared mmap.
> 
> And since our mmap() *was* coherent with people doing write() system
> calls, but didn't handle actual dirty shared mmap, what Linux used to
> do was to just say "Oh, you want a read-only shared file mmap? I can
> do that - I'll just downgrade it to a read-only _private_ mapping, and
> it actually ends up with the same semantics".
> 
> And here we are, 30 years later, and it still does that, but it leaves
> the VM_MAYSHARE flag so that /proc/<pid>/maps can show that it's a
> shared mapping.

I was suspecting that this code is full of legacy :)

What would make sense to me is to just have VM_SHARED and make it
correspond to MAP_SHARED, that would at least confuse me less. Once I
have some spare cycles I'll see how easy that might be to achieve.
David Laight Aug. 9, 2022, 9:16 p.m. UTC | #24
From: Jason Gunthorpe
> Sent: 09 August 2022 20:08
> 
> On Tue, Aug 09, 2022 at 11:59:45AM -0700, Linus Torvalds wrote:
> 
> > But as a very good approximation, the rule is "absolutely no new
> > BUG_ON() calls _ever_". Because I really cannot see a single case
> > where "proper error handling and WARN_ON_ONCE()" isn't the right
> > thing.
> 
> Parallel to this discussion I've had ones where people more or less
> say
> 
>  Since BUG_ON crashes the machine and Linus says that crashing the
>  machine is bad, WARN_ON will also crash the machine if you set the
>  panic_on_warn parameter, so it is also bad, thus we shouldn't use
>  anything.
> 
> I've generally maintained that people who set the panic_on_warn *want*
> these crashes, because that is the entire point of it. So we should
> use WARN_ON with an error recovery for "can't happen" assertions like
> these. I think it is what you are saying here.

They don't necessarily want the crashes, it is more the people who
built the distribution think they want the crashes.

I have had issues with a customer system (with our drivers) randomly
locking up.
Someone had decided that 'PANIC_ON_OOPS' was a good idea but hadn't
enabled anything to actually take the dump.

So instead of a diagnosable problem (and a 'doh' moment) you
get several weeks of head scratching and a very annoyed user.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
diff mbox series

Patch

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 18e01474cf6b..2222ed598112 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2885,7 +2885,6 @@  struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
 #define FOLL_MIGRATION	0x400	/* wait for page to replace migration entry */
 #define FOLL_TRIED	0x800	/* a retry, previous pass started an IO */
 #define FOLL_REMOTE	0x2000	/* we are working on non-current tsk/mm */
-#define FOLL_COW	0x4000	/* internal GUP flag */
 #define FOLL_ANON	0x8000	/* don't do file mappings */
 #define FOLL_LONGTERM	0x10000	/* mapping lifetime is indefinite: see below */
 #define FOLL_SPLIT_PMD	0x20000	/* split huge pmd before returning */
diff --git a/mm/gup.c b/mm/gup.c
index 732825157430..7a0b207f566f 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -478,14 +478,34 @@  static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address,
 	return -EEXIST;
 }
 
-/*
- * FOLL_FORCE can write to even unwritable pte's, but only
- * after we've gone through a COW cycle and they are dirty.
- */
-static inline bool can_follow_write_pte(pte_t pte, unsigned int flags)
-{
-	return pte_write(pte) ||
-		((flags & FOLL_FORCE) && (flags & FOLL_COW) && pte_dirty(pte));
+/* FOLL_FORCE can write to even unwritable PTEs in COW mappings. */
+static inline bool can_follow_write_pte(pte_t pte, struct page *page,
+					struct vm_area_struct *vma,
+					unsigned int flags)
+{
+	if (pte_write(pte))
+		return true;
+	if (!(flags & FOLL_FORCE))
+		return false;
+
+	/*
+	 * See check_vma_flags(): only COW mappings need that special
+	 * "force" handling when they lack VM_WRITE.
+	 */
+	if (vma->vm_flags & VM_WRITE)
+		return false;
+	VM_BUG_ON(!is_cow_mapping(vma->vm_flags));
+
+	/*
+	 * See can_change_pte_writable(): we broke COW and could map the page
+	 * writable if we have an exclusive anonymous page and a write-fault
+	 * isn't require for other reasons.
+	 */
+	if (!page || !PageAnon(page) || !PageAnonExclusive(page))
+		return false;
+	if (vma_soft_dirty_enabled(vma) && !pte_soft_dirty(pte))
+		return false;
+	return !userfaultfd_pte_wp(vma, pte);
 }
 
 static struct page *follow_page_pte(struct vm_area_struct *vma,
@@ -528,12 +548,19 @@  static struct page *follow_page_pte(struct vm_area_struct *vma,
 	}
 	if ((flags & FOLL_NUMA) && pte_protnone(pte))
 		goto no_page;
-	if ((flags & FOLL_WRITE) && !can_follow_write_pte(pte, flags)) {
-		pte_unmap_unlock(ptep, ptl);
-		return NULL;
-	}
 
 	page = vm_normal_page(vma, address, pte);
+
+	/*
+	 * We only care about anon pages in can_follow_write_pte() and don't
+	 * have to worry about pte_devmap() because they are never anon.
+	 */
+	if ((flags & FOLL_WRITE) &&
+	    !can_follow_write_pte(pte, page, vma, flags)) {
+		page = NULL;
+		goto out;
+	}
+
 	if (!page && pte_devmap(pte) && (flags & (FOLL_GET | FOLL_PIN))) {
 		/*
 		 * Only return device mapping pages in the FOLL_GET or FOLL_PIN
@@ -986,17 +1013,6 @@  static int faultin_page(struct vm_area_struct *vma,
 		return -EBUSY;
 	}
 
-	/*
-	 * The VM_FAULT_WRITE bit tells us that do_wp_page has broken COW when
-	 * necessary, even if maybe_mkwrite decided not to set pte_write. We
-	 * can thus safely do subsequent page lookups as if they were reads.
-	 * But only do so when looping for pte_write is futile: in some cases
-	 * userspace may also be wanting to write to the gotten user page,
-	 * which a read fault here might prevent (a readonly page might get
-	 * reCOWed by userspace write).
-	 */
-	if ((ret & VM_FAULT_WRITE) && !(vma->vm_flags & VM_WRITE))
-		*flags |= FOLL_COW;
 	return 0;
 }
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 8a7c1b344abe..352b5220e95e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1040,12 +1040,6 @@  struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
 
 	assert_spin_locked(pmd_lockptr(mm, pmd));
 
-	/*
-	 * When we COW a devmap PMD entry, we split it into PTEs, so we should
-	 * not be in this function with `flags & FOLL_COW` set.
-	 */
-	WARN_ONCE(flags & FOLL_COW, "mm: In follow_devmap_pmd with FOLL_COW set");
-
 	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
 	if (WARN_ON_ONCE((flags & (FOLL_PIN | FOLL_GET)) ==
 			 (FOLL_PIN | FOLL_GET)))
@@ -1395,14 +1389,23 @@  vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf)
 	return VM_FAULT_FALLBACK;
 }
 
-/*
- * FOLL_FORCE can write to even unwritable pmd's, but only
- * after we've gone through a COW cycle and they are dirty.
- */
-static inline bool can_follow_write_pmd(pmd_t pmd, unsigned int flags)
+/* See can_follow_write_pte() on FOLL_FORCE details. */
+static inline bool can_follow_write_pmd(pmd_t pmd, struct page *page,
+					struct vm_area_struct *vma,
+					unsigned int flags)
 {
-	return pmd_write(pmd) ||
-	       ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pmd_dirty(pmd));
+	if (pmd_write(pmd))
+		return true;
+	if (!(flags & FOLL_FORCE))
+		return false;
+	if (vma->vm_flags & VM_WRITE)
+		return false;
+	VM_BUG_ON(!is_cow_mapping(vma->vm_flags));
+	if (!page || !PageAnon(page) || !PageAnonExclusive(page))
+		return false;
+	if (vma_soft_dirty_enabled(vma) && !pmd_soft_dirty(pmd))
+		return false;
+	return !userfaultfd_huge_pmd_wp(vma, pmd);
 }
 
 struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
@@ -1411,12 +1414,16 @@  struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
 				   unsigned int flags)
 {
 	struct mm_struct *mm = vma->vm_mm;
-	struct page *page = NULL;
+	struct page *page;
 
 	assert_spin_locked(pmd_lockptr(mm, pmd));
 
-	if (flags & FOLL_WRITE && !can_follow_write_pmd(*pmd, flags))
-		goto out;
+	page = pmd_page(*pmd);
+	VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
+
+	if ((flags & FOLL_WRITE) &&
+	    !can_follow_write_pmd(*pmd, page, vma, flags))
+		return NULL;
 
 	/* Avoid dumping huge zero page */
 	if ((flags & FOLL_DUMP) && is_huge_zero_pmd(*pmd))
@@ -1424,10 +1431,7 @@  struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
 
 	/* Full NUMA hinting faults to serialise migration in fault paths */
 	if ((flags & FOLL_NUMA) && pmd_protnone(*pmd))
-		goto out;
-
-	page = pmd_page(*pmd);
-	VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
+		return NULL;
 
 	if (!pmd_write(*pmd) && gup_must_unshare(flags, page))
 		return ERR_PTR(-EMLINK);
@@ -1444,7 +1448,6 @@  struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
 	page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
 	VM_BUG_ON_PAGE(!PageCompound(page) && !is_zone_device_page(page), page);
 
-out:
 	return page;
 }