diff mbox series

[v3,3/4] mm: don't expose non-hugetlb page to fast gup prematurely

Message ID 20190924232459.214097-3-yuzhao@google.com (mailing list archive)
State New, archived
Headers show
Series [v3,1/4] mm: remove unnecessary smp_wmb() in collapse_huge_page() | expand

Commit Message

Yu Zhao Sept. 24, 2019, 11:24 p.m. UTC
We don't want to expose a non-hugetlb page to the fast gup running
on a remote CPU before all local non-atomic ops on the page flags
are visible first.

For an anon page that isn't in swap cache, we need to make sure all
prior non-atomic ops, especially __SetPageSwapBacked() in
page_add_new_anon_rmap(), are ordered before set_pte_at() to prevent
the following race:

	CPU 1				CPU1
	set_pte_at()			get_user_pages_fast()
	  page_add_new_anon_rmap()	  gup_pte_range()
	  __SetPageSwapBacked()		    SetPageReferenced()

This demonstrates a non-fatal scenario. Though haven't been directly
observed, the fatal ones can exist, e.g., PG_lock set by fast gup
caller and then overwritten by __SetPageSwapBacked().

For an anon page that is already in swap cache or a file page, we
don't need smp_wmb() before set_pte_at() because adding to swap or
file cach serves as a valid write barrier. Using non-atomic ops
thereafter is a bug, obviously.

smp_wmb() is added following 11 of total 12 page_add_new_anon_rmap()
call sites, with the only exception being
do_huge_pmd_wp_page_fallback() because of an existing smp_wmb().

Signed-off-by: Yu Zhao <yuzhao@google.com>
---
 kernel/events/uprobes.c |  2 ++
 mm/huge_memory.c        |  6 ++++++
 mm/khugepaged.c         |  2 ++
 mm/memory.c             | 10 +++++++++-
 mm/migrate.c            |  2 ++
 mm/swapfile.c           |  6 ++++--
 mm/userfaultfd.c        |  2 ++
 7 files changed, 27 insertions(+), 3 deletions(-)

Comments

Peter Zijlstra Sept. 25, 2019, 8:25 a.m. UTC | #1
On Tue, Sep 24, 2019 at 05:24:58PM -0600, Yu Zhao wrote:
> We don't want to expose a non-hugetlb page to the fast gup running
> on a remote CPU before all local non-atomic ops on the page flags
> are visible first.
> 
> For an anon page that isn't in swap cache, we need to make sure all
> prior non-atomic ops, especially __SetPageSwapBacked() in
> page_add_new_anon_rmap(), are ordered before set_pte_at() to prevent
> the following race:
> 
> 	CPU 1				CPU1
> 	set_pte_at()			get_user_pages_fast()
> 	  page_add_new_anon_rmap()	  gup_pte_range()
> 	  __SetPageSwapBacked()		    SetPageReferenced()
> 
> This demonstrates a non-fatal scenario. Though haven't been directly
> observed, the fatal ones can exist, e.g., PG_lock set by fast gup
> caller and then overwritten by __SetPageSwapBacked().
> 
> For an anon page that is already in swap cache or a file page, we
> don't need smp_wmb() before set_pte_at() because adding to swap or
> file cach serves as a valid write barrier. Using non-atomic ops
> thereafter is a bug, obviously.
> 
> smp_wmb() is added following 11 of total 12 page_add_new_anon_rmap()
> call sites, with the only exception being
> do_huge_pmd_wp_page_fallback() because of an existing smp_wmb().
> 

I'm thinking this patch make stuff rather fragile.. Should we instead
stick the barrier in set_p*d_at() instead? Or rather, make that store a
store-release?
Yu Zhao Sept. 25, 2019, 10:26 p.m. UTC | #2
On Wed, Sep 25, 2019 at 10:25:30AM +0200, Peter Zijlstra wrote:
> On Tue, Sep 24, 2019 at 05:24:58PM -0600, Yu Zhao wrote:
> > We don't want to expose a non-hugetlb page to the fast gup running
> > on a remote CPU before all local non-atomic ops on the page flags
> > are visible first.
> > 
> > For an anon page that isn't in swap cache, we need to make sure all
> > prior non-atomic ops, especially __SetPageSwapBacked() in
> > page_add_new_anon_rmap(), are ordered before set_pte_at() to prevent
> > the following race:
> > 
> > 	CPU 1				CPU1
> > 	set_pte_at()			get_user_pages_fast()
> > 	  page_add_new_anon_rmap()	  gup_pte_range()
> > 	  __SetPageSwapBacked()		    SetPageReferenced()
> > 
> > This demonstrates a non-fatal scenario. Though haven't been directly
> > observed, the fatal ones can exist, e.g., PG_lock set by fast gup
> > caller and then overwritten by __SetPageSwapBacked().
> > 
> > For an anon page that is already in swap cache or a file page, we
> > don't need smp_wmb() before set_pte_at() because adding to swap or
> > file cach serves as a valid write barrier. Using non-atomic ops
> > thereafter is a bug, obviously.
> > 
> > smp_wmb() is added following 11 of total 12 page_add_new_anon_rmap()
> > call sites, with the only exception being
> > do_huge_pmd_wp_page_fallback() because of an existing smp_wmb().
> > 
> 
> I'm thinking this patch make stuff rather fragile.. Should we instead
> stick the barrier in set_p*d_at() instead? Or rather, make that store a
> store-release?

I prefer it this way too, but I suspected the majority would be
concerned with the performance implications, especially those
looping set_pte_at()s in mm/huge_memory.c.

If we have a consensus that smp_wmb() would be better off wrapped
together with set_p*d_at() in a macro, I'd be glad to make the change.

And it seems to me applying smp_store_release() would have to happen
in each arch, which might be just as fragile.
Kirill A . Shutemov Sept. 26, 2019, 10:20 a.m. UTC | #3
On Wed, Sep 25, 2019 at 04:26:54PM -0600, Yu Zhao wrote:
> On Wed, Sep 25, 2019 at 10:25:30AM +0200, Peter Zijlstra wrote:
> > On Tue, Sep 24, 2019 at 05:24:58PM -0600, Yu Zhao wrote:
> > > We don't want to expose a non-hugetlb page to the fast gup running
> > > on a remote CPU before all local non-atomic ops on the page flags
> > > are visible first.
> > > 
> > > For an anon page that isn't in swap cache, we need to make sure all
> > > prior non-atomic ops, especially __SetPageSwapBacked() in
> > > page_add_new_anon_rmap(), are ordered before set_pte_at() to prevent
> > > the following race:
> > > 
> > > 	CPU 1				CPU1
> > > 	set_pte_at()			get_user_pages_fast()
> > > 	  page_add_new_anon_rmap()	  gup_pte_range()
> > > 	  __SetPageSwapBacked()		    SetPageReferenced()
> > > 
> > > This demonstrates a non-fatal scenario. Though haven't been directly
> > > observed, the fatal ones can exist, e.g., PG_lock set by fast gup
> > > caller and then overwritten by __SetPageSwapBacked().
> > > 
> > > For an anon page that is already in swap cache or a file page, we
> > > don't need smp_wmb() before set_pte_at() because adding to swap or
> > > file cach serves as a valid write barrier. Using non-atomic ops
> > > thereafter is a bug, obviously.
> > > 
> > > smp_wmb() is added following 11 of total 12 page_add_new_anon_rmap()
> > > call sites, with the only exception being
> > > do_huge_pmd_wp_page_fallback() because of an existing smp_wmb().
> > > 
> > 
> > I'm thinking this patch make stuff rather fragile.. Should we instead
> > stick the barrier in set_p*d_at() instead? Or rather, make that store a
> > store-release?
> 
> I prefer it this way too, but I suspected the majority would be
> concerned with the performance implications, especially those
> looping set_pte_at()s in mm/huge_memory.c.

We can rename current set_pte_at() to __set_pte_at() or something and
leave it in places where barrier is not needed. The new set_pte_at()( will
be used in the rest of the places with the barrier inside.

BTW, have you looked at other levels of page table hierarchy. Do we have
the same issue for PMD/PUD/... pages?
John Hubbard Sept. 27, 2019, 3:26 a.m. UTC | #4
On 9/26/19 3:20 AM, Kirill A. Shutemov wrote:
> On Wed, Sep 25, 2019 at 04:26:54PM -0600, Yu Zhao wrote:
>> On Wed, Sep 25, 2019 at 10:25:30AM +0200, Peter Zijlstra wrote:
>>> On Tue, Sep 24, 2019 at 05:24:58PM -0600, Yu Zhao wrote:
...
>>> I'm thinking this patch make stuff rather fragile.. Should we instead
>>> stick the barrier in set_p*d_at() instead? Or rather, make that store a
>>> store-release?
>>
>> I prefer it this way too, but I suspected the majority would be
>> concerned with the performance implications, especially those
>> looping set_pte_at()s in mm/huge_memory.c.
> 
> We can rename current set_pte_at() to __set_pte_at() or something and
> leave it in places where barrier is not needed. The new set_pte_at()( will
> be used in the rest of the places with the barrier inside.

+1, sounds nice. I was unhappy about the wide-ranging changes that would have
to be maintained. So this seems much better.

> 
> BTW, have you looked at other levels of page table hierarchy. Do we have
> the same issue for PMD/PUD/... pages?
> 

Along the lines of "what other memory barriers might be missing for
get_user_pages_fast(), I'm also concerned that the synchronization between
get_user_pages_fast() and freeing the page tables might be technically broken,
due to missing memory barriers on the get_user_pages_fast() side. Details:

gup_fast() disables interrupts, but I think it also needs some sort of
memory barrier(s), in order to prevent reads of the page table (gup_pgd_range,
etc) from speculatively happening before the interrupts are disabled. 

Leonardo Bras's recent patchset brought this to my attention. Here, he's
recommending adding atomic counting inc/dec before and after the gup_fast()
irq disable/enable points:

   https://lore.kernel.org/r/20190920195047.7703-4-leonardo@linux.ibm.com

...and that lead to noticing a general lack of barriers there.

thanks,
Yu Zhao Sept. 27, 2019, 5:06 a.m. UTC | #5
On Thu, Sep 26, 2019 at 08:26:46PM -0700, John Hubbard wrote:
> On 9/26/19 3:20 AM, Kirill A. Shutemov wrote:
> > On Wed, Sep 25, 2019 at 04:26:54PM -0600, Yu Zhao wrote:
> >> On Wed, Sep 25, 2019 at 10:25:30AM +0200, Peter Zijlstra wrote:
> >>> On Tue, Sep 24, 2019 at 05:24:58PM -0600, Yu Zhao wrote:
> ...
> >>> I'm thinking this patch make stuff rather fragile.. Should we instead
> >>> stick the barrier in set_p*d_at() instead? Or rather, make that store a
> >>> store-release?
> >>
> >> I prefer it this way too, but I suspected the majority would be
> >> concerned with the performance implications, especially those
> >> looping set_pte_at()s in mm/huge_memory.c.
> > 
> > We can rename current set_pte_at() to __set_pte_at() or something and
> > leave it in places where barrier is not needed. The new set_pte_at()( will
> > be used in the rest of the places with the barrier inside.
> 
> +1, sounds nice. I was unhappy about the wide-ranging changes that would have
> to be maintained. So this seems much better.

Just to be clear that doing so will add unnecessary barriers to one
of the two paths that share set_pte_at().

> > BTW, have you looked at other levels of page table hierarchy. Do we have
> > the same issue for PMD/PUD/... pages?
> > 
> 
> Along the lines of "what other memory barriers might be missing for
> get_user_pages_fast(), I'm also concerned that the synchronization between
> get_user_pages_fast() and freeing the page tables might be technically broken,
> due to missing memory barriers on the get_user_pages_fast() side. Details:
> 
> gup_fast() disables interrupts, but I think it also needs some sort of
> memory barrier(s), in order to prevent reads of the page table (gup_pgd_range,
> etc) from speculatively happening before the interrupts are disabled. 

I was under impression switching back from interrupt context is a
full barrier (otherwise wouldn't we be vulnerable to some side
channel attacks?), so the reader side wouldn't need explicit rmb.
Michal Hocko Sept. 27, 2019, 12:33 p.m. UTC | #6
On Thu 26-09-19 20:26:46, John Hubbard wrote:
> On 9/26/19 3:20 AM, Kirill A. Shutemov wrote:
> > BTW, have you looked at other levels of page table hierarchy. Do we have
> > the same issue for PMD/PUD/... pages?
> > 
> 
> Along the lines of "what other memory barriers might be missing for
> get_user_pages_fast(), I'm also concerned that the synchronization between
> get_user_pages_fast() and freeing the page tables might be technically broken,
> due to missing memory barriers on the get_user_pages_fast() side. Details:
> 
> gup_fast() disables interrupts, but I think it also needs some sort of
> memory barrier(s), in order to prevent reads of the page table (gup_pgd_range,
> etc) from speculatively happening before the interrupts are disabled. 

Could you be more specific about the race scenario please? I thought
that the unmap path will be serialized by the pte lock.
Yu Zhao Sept. 27, 2019, 6:31 p.m. UTC | #7
On Fri, Sep 27, 2019 at 02:33:00PM +0200, Michal Hocko wrote:
> On Thu 26-09-19 20:26:46, John Hubbard wrote:
> > On 9/26/19 3:20 AM, Kirill A. Shutemov wrote:
> > > BTW, have you looked at other levels of page table hierarchy. Do we have
> > > the same issue for PMD/PUD/... pages?
> > > 
> > 
> > Along the lines of "what other memory barriers might be missing for
> > get_user_pages_fast(), I'm also concerned that the synchronization between
> > get_user_pages_fast() and freeing the page tables might be technically broken,
> > due to missing memory barriers on the get_user_pages_fast() side. Details:
> > 
> > gup_fast() disables interrupts, but I think it also needs some sort of
> > memory barrier(s), in order to prevent reads of the page table (gup_pgd_range,
> > etc) from speculatively happening before the interrupts are disabled. 
> 
> Could you be more specific about the race scenario please? I thought
> that the unmap path will be serialized by the pte lock.

Yes, the unmap path is protected by ptl, but the fast gup isn't.
Please correct me if I'm wrong, John. This is the hypothetical race:

CPU 1 (gup)				CPU 2 (zap)
speculatively load a pmd val
					zap the pte table pointed by the pmd val
					flush tlb by ipi
<handle ipi>
					free the pte table
local_irq_disable()
use the stale pmd val
use-after-free the pte table
local_irq_enable()

I don't think it would happen because the interrupt context on CPU 1
would act as a full mb and enforce a reload of the pmd val. But I'm
not entirely sure.
John Hubbard Sept. 27, 2019, 7:31 p.m. UTC | #8
On 9/27/19 5:33 AM, Michal Hocko wrote:
> On Thu 26-09-19 20:26:46, John Hubbard wrote:
>> On 9/26/19 3:20 AM, Kirill A. Shutemov wrote:
>>> BTW, have you looked at other levels of page table hierarchy. Do we have
>>> the same issue for PMD/PUD/... pages?
>>>
>>
>> Along the lines of "what other memory barriers might be missing for
>> get_user_pages_fast(), I'm also concerned that the synchronization between
>> get_user_pages_fast() and freeing the page tables might be technically broken,
>> due to missing memory barriers on the get_user_pages_fast() side. Details:
>>
>> gup_fast() disables interrupts, but I think it also needs some sort of
>> memory barrier(s), in order to prevent reads of the page table (gup_pgd_range,
>> etc) from speculatively happening before the interrupts are disabled.
> 
> Could you be more specific about the race scenario please? I thought
> that the unmap path will be serialized by the pte lock.
> 

I don't see a pte lock anywhere here.

This case is really pretty far out there, but without run-time memory barriers
I don't think we can completely rule it out:

CPU 0                              CPU 1
--------                           ---------------
                                    get_user_pages_fast()

do_unmap()
  unmap_region()
   free_pgtables()
                                    /*
                                     * speculative reads, re-ordered
                                     * by the CPU at run time, not
                                     * compile time. The function calls
                                     * are not really in this order, but
                                     * the corresponding reads could be.
                                     */
                                    gup_pgd_range()
                                     gup_p4d_range()
                                      gup_pud_range()
                                       gup_pmd_range()
                                        pmd = READ_ONCE(*pmdp)
                                         /* This is stale */

   tlb_finish_mmu()
     tlb_flush_mmu()
      tlb_flush_mmu_tlbonly()
        tlb_flush()
         flush_tlb_mm
          flush_tlb_mm_range
           flush_tlb_others
            native_flush_tlb_others
             smp_call_function_many: IPIs
              ...blocks until CPU1 reenables
                 interrupts

                                         local_irq_disable()
                                          ...continue the page walk based
                                             on stale/freed data...

thanks,
John Hubbard Sept. 29, 2019, 10:47 p.m. UTC | #9
On 9/27/19 12:31 PM, John Hubbard wrote:
> On 9/27/19 5:33 AM, Michal Hocko wrote:
>> On Thu 26-09-19 20:26:46, John Hubbard wrote:
>>> On 9/26/19 3:20 AM, Kirill A. Shutemov wrote:
>>>> BTW, have you looked at other levels of page table hierarchy. Do we have
>>>> the same issue for PMD/PUD/... pages?
>>>>
>>>
>>> Along the lines of "what other memory barriers might be missing for
>>> get_user_pages_fast(), I'm also concerned that the synchronization between
>>> get_user_pages_fast() and freeing the page tables might be technically broken,
>>> due to missing memory barriers on the get_user_pages_fast() side. Details:
>>>
>>> gup_fast() disables interrupts, but I think it also needs some sort of
>>> memory barrier(s), in order to prevent reads of the page table (gup_pgd_range,
>>> etc) from speculatively happening before the interrupts are disabled.
>>
>> Could you be more specific about the race scenario please? I thought
>> that the unmap path will be serialized by the pte lock.
>>
> 
> I don't see a pte lock anywhere here.
> 
> This case is really pretty far out there, but without run-time memory barriers
> I don't think we can completely rule it out:
> 
> CPU 0                              CPU 1
> --------                           ---------------
>                                     get_user_pages_fast()
> 
> do_unmap()
>   unmap_region()
>    free_pgtables()
>                                     /*
>                                      * speculative reads, re-ordered
>                                      * by the CPU at run time, not
>                                      * compile time. The function calls
>                                      * are not really in this order, but
>                                      * the corresponding reads could be.
>                                      */
>                                     gup_pgd_range()
>                                      gup_p4d_range()
>                                       gup_pud_range()
>                                        gup_pmd_range()
>                                         pmd = READ_ONCE(*pmdp)
>                                          /* This is stale */
> 
>    tlb_finish_mmu()
>      tlb_flush_mmu()
>       tlb_flush_mmu_tlbonly()
>         tlb_flush()
>          flush_tlb_mm
>           flush_tlb_mm_range
>            flush_tlb_others
>             native_flush_tlb_others
>              smp_call_function_many: IPIs
>               ...blocks until CPU1 reenables
>                  interrupts
> 
>                                          local_irq_disable()
>                                           ...continue the page walk based
>                                              on stale/freed data...
> 


...and note that I just asked Leonardo Bras (+CC) to include "the" fix[1] for that (I proposed
a snippet that attempts a fix anyway), in his upcoming feature that speeds up PPC THP
operations.

So I hope that doesn't muddy this thread too much, but the other one is awfully quiet
and I was worried that the overlapping changes might go unnoticed.

[1] https://lore.kernel.org/r/4ff1e8e8-929b-9cfc-9bf8-ee88e34de888@nvidia.com

thanks,
Jan Kara Sept. 30, 2019, 9:20 a.m. UTC | #10
On Fri 27-09-19 12:31:41, John Hubbard wrote:
> On 9/27/19 5:33 AM, Michal Hocko wrote:
> > On Thu 26-09-19 20:26:46, John Hubbard wrote:
> > > On 9/26/19 3:20 AM, Kirill A. Shutemov wrote:
> > > > BTW, have you looked at other levels of page table hierarchy. Do we have
> > > > the same issue for PMD/PUD/... pages?
> > > > 
> > > 
> > > Along the lines of "what other memory barriers might be missing for
> > > get_user_pages_fast(), I'm also concerned that the synchronization between
> > > get_user_pages_fast() and freeing the page tables might be technically broken,
> > > due to missing memory barriers on the get_user_pages_fast() side. Details:
> > > 
> > > gup_fast() disables interrupts, but I think it also needs some sort of
> > > memory barrier(s), in order to prevent reads of the page table (gup_pgd_range,
> > > etc) from speculatively happening before the interrupts are disabled.
> > 
> > Could you be more specific about the race scenario please? I thought
> > that the unmap path will be serialized by the pte lock.
> > 
> 
> I don't see a pte lock anywhere here.
> 
> This case is really pretty far out there, but without run-time memory barriers
> I don't think we can completely rule it out:
> 
> CPU 0                              CPU 1
> --------                           ---------------
>                                    get_user_pages_fast()
> 
> do_unmap()
>  unmap_region()
>   free_pgtables()
>                                    /*
>                                     * speculative reads, re-ordered
>                                     * by the CPU at run time, not
>                                     * compile time. The function calls
>                                     * are not really in this order, but
>                                     * the corresponding reads could be.
>                                     */
>                                    gup_pgd_range()
>                                     gup_p4d_range()
>                                      gup_pud_range()
>                                       gup_pmd_range()
>                                        pmd = READ_ONCE(*pmdp)
>                                         /* This is stale */
> 
>   tlb_finish_mmu()
>     tlb_flush_mmu()
>      tlb_flush_mmu_tlbonly()
>        tlb_flush()
>         flush_tlb_mm
>          flush_tlb_mm_range
>           flush_tlb_others
>            native_flush_tlb_others
>             smp_call_function_many: IPIs
>              ...blocks until CPU1 reenables
>                 interrupts
> 
>                                         local_irq_disable()
>                                          ...continue the page walk based
>                                             on stale/freed data...

Yes, but then we have:

                head = try_get_compound_head(page, 1);

which has an atomic operation providing barrier in it and then we have:

                if (unlikely(pte_val(pte) != pte_val(*ptep))) {
                        put_page(head);
                        goto pte_unmap;
                }

So we reload PTE again and check the value didn't change. Which should
prevent the race you explain above. Or do I miss anything?

								Honza
John Hubbard Sept. 30, 2019, 5:57 p.m. UTC | #11
On 9/30/19 2:20 AM, Jan Kara wrote:
> On Fri 27-09-19 12:31:41, John Hubbard wrote:
>> On 9/27/19 5:33 AM, Michal Hocko wrote:
>>> On Thu 26-09-19 20:26:46, John Hubbard wrote:
>>>> On 9/26/19 3:20 AM, Kirill A. Shutemov wrote:
...
>> CPU 0                              CPU 1
>> --------                           ---------------
>>                                     get_user_pages_fast()
>>
>> do_unmap()
>>   unmap_region()
>>    free_pgtables()
>>                                     /*
>>                                      * speculative reads, re-ordered
>>                                      * by the CPU at run time, not
>>                                      * compile time. The function calls
>>                                      * are not really in this order, but
>>                                      * the corresponding reads could be.
>>                                      */
>>                                     gup_pgd_range()
>>                                      gup_p4d_range()
>>                                       gup_pud_range()
>>                                        gup_pmd_range()
>>                                         pmd = READ_ONCE(*pmdp)
>>                                          /* This is stale */
>>
>>    tlb_finish_mmu()
>>      tlb_flush_mmu()
>>       tlb_flush_mmu_tlbonly()
>>         tlb_flush()
>>          flush_tlb_mm
>>           flush_tlb_mm_range
>>            flush_tlb_others
>>             native_flush_tlb_others
>>              smp_call_function_many: IPIs
>>               ...blocks until CPU1 reenables
>>                  interrupts
>>
>>                                          local_irq_disable()
>>                                           ...continue the page walk based
>>                                              on stale/freed data...
> 
> Yes, but then we have:
> 
>                  head = try_get_compound_head(page, 1);
> 
> which has an atomic operation providing barrier in it and then we have:
> 
>                  if (unlikely(pte_val(pte) != pte_val(*ptep))) {
>                          put_page(head);
>                          goto pte_unmap;
>                  }
> 
> So we reload PTE again and check the value didn't change. Which should
> prevent the race you explain above. Or do I miss anything?


Well, a couple of questions:

1. Is there *really* a memory barrier in try_get_compound_head()? Because
I only see a READ_ONCE compile barrier, which would mean that run time
reordering is still possible.

2. Your code above is after the "pmd = READ_ONCE(*pmdp)" line, so by then,
it's already found the pte based on reading a stale pmd. So checking the
pte seems like it's checking the wrong thing--it's too late, for this case,
right?


thanks,
Jan Kara Oct. 1, 2019, 7:10 a.m. UTC | #12
On Mon 30-09-19 10:57:08, John Hubbard wrote:
> On 9/30/19 2:20 AM, Jan Kara wrote:
> > On Fri 27-09-19 12:31:41, John Hubbard wrote:
> > > On 9/27/19 5:33 AM, Michal Hocko wrote:
> > > > On Thu 26-09-19 20:26:46, John Hubbard wrote:
> > > > > On 9/26/19 3:20 AM, Kirill A. Shutemov wrote:
> ...
> > > CPU 0                              CPU 1
> > > --------                           ---------------
> > >                                     get_user_pages_fast()
> > > 
> > > do_unmap()
> > >   unmap_region()
> > >    free_pgtables()
> > >                                     /*
> > >                                      * speculative reads, re-ordered
> > >                                      * by the CPU at run time, not
> > >                                      * compile time. The function calls
> > >                                      * are not really in this order, but
> > >                                      * the corresponding reads could be.
> > >                                      */
> > >                                     gup_pgd_range()
> > >                                      gup_p4d_range()
> > >                                       gup_pud_range()
> > >                                        gup_pmd_range()
> > >                                         pmd = READ_ONCE(*pmdp)
> > >                                          /* This is stale */
> > > 
> > >    tlb_finish_mmu()
> > >      tlb_flush_mmu()
> > >       tlb_flush_mmu_tlbonly()
> > >         tlb_flush()
> > >          flush_tlb_mm
> > >           flush_tlb_mm_range
> > >            flush_tlb_others
> > >             native_flush_tlb_others
> > >              smp_call_function_many: IPIs
> > >               ...blocks until CPU1 reenables
> > >                  interrupts
> > > 
> > >                                          local_irq_disable()
> > >                                           ...continue the page walk based
> > >                                              on stale/freed data...
> > 
> > Yes, but then we have:
> > 
> >                  head = try_get_compound_head(page, 1);
> > 
> > which has an atomic operation providing barrier in it and then we have:
> > 
> >                  if (unlikely(pte_val(pte) != pte_val(*ptep))) {
> >                          put_page(head);
> >                          goto pte_unmap;
> >                  }
> > 
> > So we reload PTE again and check the value didn't change. Which should
> > prevent the race you explain above. Or do I miss anything?
> 
> 
> Well, a couple of questions:
> 
> 1. Is there *really* a memory barrier in try_get_compound_head()? Because
> I only see a READ_ONCE compile barrier, which would mean that run time
> reordering is still possible.

try_get_compound_head() has page_cache_add_speculative() which is
atomic_add_unless() which is guaranteed to provide ordering.

> 2. Your code above is after the "pmd = READ_ONCE(*pmdp)" line, so by then,
> it's already found the pte based on reading a stale pmd. So checking the
> pte seems like it's checking the wrong thing--it's too late, for this case,
> right?

Well, if PMD is getting freed, all PTEs in it should be cleared by that
time, shouldn't they? So although we read from stale PMD, either we already
see cleared PTE or the check pte_val(pte) != pte_val(*ptep) will fail and
so we never actually succeed in getting stale PTE entry (at least unless
the page table page that used to be PMD can get freed and reused
- which is not the case in the example you've shown above).

So I still don't see a problem. That being said I don't feel being expert
in this particular area. I just always thought GUP prevents races like this
by the scheme I describe so I'd like to see what I'm missing :).

								Honza
Peter Zijlstra Oct. 1, 2019, 8:36 a.m. UTC | #13
On Tue, Oct 01, 2019 at 09:10:08AM +0200, Jan Kara wrote:
> On Mon 30-09-19 10:57:08, John Hubbard wrote:

> > Well, a couple of questions:
> > 
> > 1. Is there *really* a memory barrier in try_get_compound_head()? Because
> > I only see a READ_ONCE compile barrier, which would mean that run time
> > reordering is still possible.
> 
> try_get_compound_head() has page_cache_add_speculative() which is
> atomic_add_unless() which is guaranteed to provide ordering.

atomic_add_unless() only provides ordering on SUCCESS, now looking at
the code in question that seems sufficient, so all should be well. Just
wanted to let you know that if atomic_add_unless() FAILs there is no
implied ordering what so ever.
Jan Kara Oct. 1, 2019, 8:40 a.m. UTC | #14
On Tue 01-10-19 10:36:04, Peter Zijlstra wrote:
> On Tue, Oct 01, 2019 at 09:10:08AM +0200, Jan Kara wrote:
> > On Mon 30-09-19 10:57:08, John Hubbard wrote:
> 
> > > Well, a couple of questions:
> > > 
> > > 1. Is there *really* a memory barrier in try_get_compound_head()? Because
> > > I only see a READ_ONCE compile barrier, which would mean that run time
> > > reordering is still possible.
> > 
> > try_get_compound_head() has page_cache_add_speculative() which is
> > atomic_add_unless() which is guaranteed to provide ordering.
> 
> atomic_add_unless() only provides ordering on SUCCESS, now looking at
> the code in question that seems sufficient, so all should be well. Just
> wanted to let you know that if atomic_add_unless() FAILs there is no
> implied ordering what so ever.

Yeah, right, thanks for correction!

								Honza
John Hubbard Oct. 1, 2019, 6:43 p.m. UTC | #15
On 10/1/19 12:10 AM, Jan Kara wrote:
> On Mon 30-09-19 10:57:08, John Hubbard wrote:
>> On 9/30/19 2:20 AM, Jan Kara wrote:
>>> On Fri 27-09-19 12:31:41, John Hubbard wrote:
>>>> On 9/27/19 5:33 AM, Michal Hocko wrote:
>>>>> On Thu 26-09-19 20:26:46, John Hubbard wrote:
>>>>>> On 9/26/19 3:20 AM, Kirill A. Shutemov wrote:
>> ...
>> 2. Your code above is after the "pmd = READ_ONCE(*pmdp)" line, so by then,
>> it's already found the pte based on reading a stale pmd. So checking the
>> pte seems like it's checking the wrong thing--it's too late, for this case,
>> right?
> 
> Well, if PMD is getting freed, all PTEs in it should be cleared by that
> time, shouldn't they? So although we read from stale PMD, either we already
> see cleared PTE or the check pte_val(pte) != pte_val(*ptep) will fail and
> so we never actually succeed in getting stale PTE entry (at least unless
> the page table page that used to be PMD can get freed and reused
> - which is not the case in the example you've shown above).
>

Right, that's not what the example shows, but there is nothing here to prevent
the page table pages from being freed and re-used.

 
> So I still don't see a problem. That being said I don't feel being expert
> in this particular area. I just always thought GUP prevents races like this
> by the scheme I describe so I'd like to see what I'm missing :).
> 

I'm very much still "in training" here, so I hope I'm not wasting everyone's 
time. But I feel confident in stating at least this much:

There are two distinct lockless synchronization mechanisms here, each protecting
against a different issue, and it's important not to conflate them and think that
one protects against the other. I still see a hole in (2) below. The mechanisms 
are:

1) Protection against a page (not the page table itself) getting freed
while get_user_pages*() is trying to take a reference to it. This is avoided
by the try-get and the re-checking of pte values that you mention above.
It's an elegant little thing, too. :)

2) Protection against page tables getting freed while a get_user_pages_fast()
call is in progress. This relies on disabling interrupts in gup_fast(), while
firing interrupts in the freeing path (via tlb flushing, IPIs). And on memory
barriers (doh--missing!) to avoid leaking memory references outside of the
irq disabling. 

This one has a problem, because Documentation/memory-barriers.txt points out:

INTERRUPT DISABLING FUNCTIONS
-----------------------------

Functions that disable interrupts (ACQUIRE equivalent) and enable interrupts
(RELEASE equivalent) will act as compiler barriers only.  So if memory or I/O
barriers are required in such a situation, they must be provided from some
other means.


...and so I'm suggesting that we need something approximately like this:

diff --git a/mm/gup.c b/mm/gup.c
index 23a9f9c9d377..1678d50a2d8b 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2415,7 +2415,9 @@ int get_user_pages_fast(unsigned long start, int nr_pages,
        if (IS_ENABLED(CONFIG_HAVE_FAST_GUP) &&
            gup_fast_permitted(start, end)) {
                local_irq_disable();
+               smp_mb();
                gup_pgd_range(addr, end, gup_flags, pages, &nr);
+               smp_mb();
                local_irq_enable();
                ret = nr;


thanks,
John Hubbard Oct. 1, 2019, 10:31 p.m. UTC | #16
On 9/26/19 10:06 PM, Yu Zhao wrote:
> On Thu, Sep 26, 2019 at 08:26:46PM -0700, John Hubbard wrote:
>> On 9/26/19 3:20 AM, Kirill A. Shutemov wrote:
>>> On Wed, Sep 25, 2019 at 04:26:54PM -0600, Yu Zhao wrote:
>>>> On Wed, Sep 25, 2019 at 10:25:30AM +0200, Peter Zijlstra wrote:
>>>>> On Tue, Sep 24, 2019 at 05:24:58PM -0600, Yu Zhao wrote:
>> ...
>>>>> I'm thinking this patch make stuff rather fragile.. Should we instead
>>>>> stick the barrier in set_p*d_at() instead? Or rather, make that store a
>>>>> store-release?
>>>>
>>>> I prefer it this way too, but I suspected the majority would be
>>>> concerned with the performance implications, especially those
>>>> looping set_pte_at()s in mm/huge_memory.c.
>>>
>>> We can rename current set_pte_at() to __set_pte_at() or something and
>>> leave it in places where barrier is not needed. The new set_pte_at()( will
>>> be used in the rest of the places with the barrier inside.
>>
>> +1, sounds nice. I was unhappy about the wide-ranging changes that would have
>> to be maintained. So this seems much better.
> 
> Just to be clear that doing so will add unnecessary barriers to one
> of the two paths that share set_pte_at().

Good point, maybe there's a better place to do it...


> 
>>> BTW, have you looked at other levels of page table hierarchy. Do we have
>>> the same issue for PMD/PUD/... pages?
>>>
>>
>> Along the lines of "what other memory barriers might be missing for
>> get_user_pages_fast(), I'm also concerned that the synchronization between
>> get_user_pages_fast() and freeing the page tables might be technically broken,
>> due to missing memory barriers on the get_user_pages_fast() side. Details:
>>
>> gup_fast() disables interrupts, but I think it also needs some sort of
>> memory barrier(s), in order to prevent reads of the page table (gup_pgd_range,
>> etc) from speculatively happening before the interrupts are disabled. 
> 
> I was under impression switching back from interrupt context is a
> full barrier (otherwise wouldn't we be vulnerable to some side
> channel attacks?), so the reader side wouldn't need explicit rmb.
> 

Documentation/memory-barriers.txt points out:

INTERRUPT DISABLING FUNCTIONS
-----------------------------

Functions that disable interrupts (ACQUIRE equivalent) and enable interrupts
(RELEASE equivalent) will act as compiler barriers only.  So if memory or I/O
barriers are required in such a situation, they must be provided from some
other means.

btw, I'm really sorry I missed your responses over the last 3 or 4 days.
I just tracked down something in our email system that was sometimes
moving some emails to spam (just few enough to escape immediate attention, argghh!).
I think I killed it off for good now. I wasn't ignoring you. :)


thanks,
Yu Zhao Oct. 2, 2019, midnight UTC | #17
On Tue, Oct 01, 2019 at 03:31:51PM -0700, John Hubbard wrote:
> On 9/26/19 10:06 PM, Yu Zhao wrote:
> > On Thu, Sep 26, 2019 at 08:26:46PM -0700, John Hubbard wrote:
> >> On 9/26/19 3:20 AM, Kirill A. Shutemov wrote:
> >>> On Wed, Sep 25, 2019 at 04:26:54PM -0600, Yu Zhao wrote:
> >>>> On Wed, Sep 25, 2019 at 10:25:30AM +0200, Peter Zijlstra wrote:
> >>>>> On Tue, Sep 24, 2019 at 05:24:58PM -0600, Yu Zhao wrote:
> >> ...
> >>>>> I'm thinking this patch make stuff rather fragile.. Should we instead
> >>>>> stick the barrier in set_p*d_at() instead? Or rather, make that store a
> >>>>> store-release?
> >>>>
> >>>> I prefer it this way too, but I suspected the majority would be
> >>>> concerned with the performance implications, especially those
> >>>> looping set_pte_at()s in mm/huge_memory.c.
> >>>
> >>> We can rename current set_pte_at() to __set_pte_at() or something and
> >>> leave it in places where barrier is not needed. The new set_pte_at()( will
> >>> be used in the rest of the places with the barrier inside.
> >>
> >> +1, sounds nice. I was unhappy about the wide-ranging changes that would have
> >> to be maintained. So this seems much better.
> > 
> > Just to be clear that doing so will add unnecessary barriers to one
> > of the two paths that share set_pte_at().
> 
> Good point, maybe there's a better place to do it...
> 
> 
> > 
> >>> BTW, have you looked at other levels of page table hierarchy. Do we have
> >>> the same issue for PMD/PUD/... pages?
> >>>
> >>
> >> Along the lines of "what other memory barriers might be missing for
> >> get_user_pages_fast(), I'm also concerned that the synchronization between
> >> get_user_pages_fast() and freeing the page tables might be technically broken,
> >> due to missing memory barriers on the get_user_pages_fast() side. Details:
> >>
> >> gup_fast() disables interrupts, but I think it also needs some sort of
> >> memory barrier(s), in order to prevent reads of the page table (gup_pgd_range,
> >> etc) from speculatively happening before the interrupts are disabled. 
> > 
> > I was under impression switching back from interrupt context is a
> > full barrier (otherwise wouldn't we be vulnerable to some side
> > channel attacks?), so the reader side wouldn't need explicit rmb.
> > 
> 
> Documentation/memory-barriers.txt points out:
> 
> INTERRUPT DISABLING FUNCTIONS
> -----------------------------
> 
> Functions that disable interrupts (ACQUIRE equivalent) and enable interrupts
> (RELEASE equivalent) will act as compiler barriers only.  So if memory or I/O
> barriers are required in such a situation, they must be provided from some
> other means.
> 
> btw, I'm really sorry I missed your responses over the last 3 or 4 days.
> I just tracked down something in our email system that was sometimes
> moving some emails to spam (just few enough to escape immediate attention, argghh!).
> I think I killed it off for good now. I wasn't ignoring you. :)

Thanks, John. I agree with all you said, including the irq disabling
function not being a sufficient smp_rmb().

I was hoping somebody could clarify whether ipi handlers used by tlb
flush are sufficient to prevent CPU 1 from seeing any stale data from
freed page tables on all supported archs.

	CPU 1			CPU 2

				flush remote tlb by ipi
				wait for the ipi hanlder
	<ipi handler>
				free page table
	disable irq
	walk page table
	enable irq

I think they should because otherwise tlb flush wouldn't work if CPU 1
still sees stale data from the freed page table, unless there is a
really strange CPU cache design I'm not aware of.

Quoting comments from x86 ipi handler flush_tlb_func_common():
 * read active_mm's tlb_gen.  We don't need any explicit barriers
 * because all x86 flush operations are serializing and the
 * atomic64_read operation won't be reordered by the compiler.

For ppc64 ipi hander radix__flush_tlb_range(), there is an "eieio"
instruction:
  https://www.ibm.com/support/knowledgecenter/en/ssw_aix_72/assembler/idalangref_eieio_instrs.html
I'm not sure why it's not "sync" -- I'd guess something implicitly
works as "sync" already (or it's a bug).

I didn't find an ipi handler for tlb flush on arm64. There should be
one, otherwise fast gup on arm64 would be broken. Mark?
Jan Kara Oct. 2, 2019, 9:24 a.m. UTC | #18
On Tue 01-10-19 11:43:30, John Hubbard wrote:
> On 10/1/19 12:10 AM, Jan Kara wrote:
> > On Mon 30-09-19 10:57:08, John Hubbard wrote:
> >> On 9/30/19 2:20 AM, Jan Kara wrote:
> >>> On Fri 27-09-19 12:31:41, John Hubbard wrote:
> >>>> On 9/27/19 5:33 AM, Michal Hocko wrote:
> >>>>> On Thu 26-09-19 20:26:46, John Hubbard wrote:
> >>>>>> On 9/26/19 3:20 AM, Kirill A. Shutemov wrote:
> >> ...
> >> 2. Your code above is after the "pmd = READ_ONCE(*pmdp)" line, so by then,
> >> it's already found the pte based on reading a stale pmd. So checking the
> >> pte seems like it's checking the wrong thing--it's too late, for this case,
> >> right?
> > 
> > Well, if PMD is getting freed, all PTEs in it should be cleared by that
> > time, shouldn't they? So although we read from stale PMD, either we already
> > see cleared PTE or the check pte_val(pte) != pte_val(*ptep) will fail and
> > so we never actually succeed in getting stale PTE entry (at least unless
> > the page table page that used to be PMD can get freed and reused
> > - which is not the case in the example you've shown above).
> 
> Right, that's not what the example shows, but there is nothing here to prevent
> the page table pages from being freed and re-used.
> 
> > So I still don't see a problem. That being said I don't feel being expert
> > in this particular area. I just always thought GUP prevents races like this
> > by the scheme I describe so I'd like to see what I'm missing :).
> 
> I'm very much still "in training" here, so I hope I'm not wasting everyone's 
> time. But I feel confident in stating at least this much:
> 
> There are two distinct lockless synchronization mechanisms here, each
> protecting against a different issue, and it's important not to conflate
> them and think that one protects against the other. I still see a hole in
> (2) below. The mechanisms are:
> 
> 1) Protection against a page (not the page table itself) getting freed
> while get_user_pages*() is trying to take a reference to it. This is avoided
> by the try-get and the re-checking of pte values that you mention above.
> It's an elegant little thing, too. :)
> 
> 2) Protection against page tables getting freed while a get_user_pages_fast()
> call is in progress. This relies on disabling interrupts in gup_fast(), while
> firing interrupts in the freeing path (via tlb flushing, IPIs). And on memory
> barriers (doh--missing!) to avoid leaking memory references outside of the
> irq disabling. 

OK, so you are concerned that the page table walk of pgd->p4d->pud->pmd
will get prefetched by the CPU before interrupts are disabled, somewhere in
the middle of the walk IPI flushing TLBs on the cpu will happen allowing
munmap() on another CPU to proceed and free page tables so the rest of the
walk will happen on freed and possibly reused pages. Do I understand you
right?

Realistically, I don't think this can happen as I'd expect the CPU to throw
away the speculation state on interrupt. But that's just my expectation and
I agree that I don't see anything in Documentation/memory-barriers.txt that
would prevent what you are concerned about. Let's ask Paul :)

Paul, we are discussing here a possible races between
mm/gup.c:__get_user_pages_fast() and mm/mmap.c:unmap_region(). The first
has a code like:

local_irq_save(flags);
load pgd from current->mm
load p4d from pgd
load pud from p4d
load pmd from pud
...
local_irq_restore(flags);

while the second has code like:

unmap_region()
  walk pgd
    walk p4d
      walk pud
        walk pmd
          clear ptes
          flush tlb
  free page tables

Now the serialization between these two relies on the fact that flushing
TLBs from unmap_region() requires IPI to be served on each CPU so in naive
understanding unmap_region() shouldn't be able to get to 'free unused page
tables' part until __get_user_pages_fast() enables interrupts again. Now as
John points out we don't see anything in Documentation/memory-barriers.txt
that would actually guarantee this. So do we really need something like
smp_rmb() after disabling interrupts in __get_user_pages_fast() or is the
race John is concerned about impossible? Thanks!

> This one has a problem, because Documentation/memory-barriers.txt points out:
> 
> INTERRUPT DISABLING FUNCTIONS
> -----------------------------
> 
> Functions that disable interrupts (ACQUIRE equivalent) and enable interrupts
> (RELEASE equivalent) will act as compiler barriers only.  So if memory or I/O
> barriers are required in such a situation, they must be provided from some
> other means.
> 
> 
> ...and so I'm suggesting that we need something approximately like this:
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index 23a9f9c9d377..1678d50a2d8b 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2415,7 +2415,9 @@ int get_user_pages_fast(unsigned long start, int nr_pages,
>         if (IS_ENABLED(CONFIG_HAVE_FAST_GUP) &&
>             gup_fast_permitted(start, end)) {
>                 local_irq_disable();
> +               smp_mb();
>                 gup_pgd_range(addr, end, gup_flags, pages, &nr);
> +               smp_mb();
>                 local_irq_enable();
>                 ret = nr;

								Honza
John Hubbard Oct. 2, 2019, 5:33 p.m. UTC | #19
On 10/2/19 2:24 AM, Jan Kara wrote:
> 
> OK, so you are concerned that the page table walk of pgd->p4d->pud->pmd
> will get prefetched by the CPU before interrupts are disabled, somewhere in
> the middle of the walk IPI flushing TLBs on the cpu will happen allowing
> munmap() on another CPU to proceed and free page tables so the rest of the
> walk will happen on freed and possibly reused pages. Do I understand you
> right?
> 

Yes, that's it.

thanks,
diff mbox series

Patch

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 84fa00497c49..7069785e2e52 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -194,6 +194,8 @@  static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 
 	flush_cache_page(vma, addr, pte_pfn(*pvmw.pte));
 	ptep_clear_flush_notify(vma, addr, pvmw.pte);
+	/* commit non-atomic ops before exposing to fast gup */
+	smp_wmb();
 	set_pte_at_notify(mm, addr, pvmw.pte,
 			mk_pte(new_page, vma->vm_page_prot));
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index de1f15969e27..21d271a29d96 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -616,6 +616,8 @@  static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
 		mem_cgroup_commit_charge(page, memcg, false, true);
 		lru_cache_add_active_or_unevictable(page, vma);
 		pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, pgtable);
+		/* commit non-atomic ops before exposing to fast gup */
+		smp_wmb();
 		set_pmd_at(vma->vm_mm, haddr, vmf->pmd, entry);
 		add_mm_counter(vma->vm_mm, MM_ANONPAGES, HPAGE_PMD_NR);
 		mm_inc_nr_ptes(vma->vm_mm);
@@ -1276,7 +1278,9 @@  static vm_fault_t do_huge_pmd_wp_page_fallback(struct vm_fault *vmf,
 	}
 	kfree(pages);
 
+	/* commit non-atomic ops before exposing to fast gup */
 	smp_wmb(); /* make pte visible before pmd */
+
 	pmd_populate(vma->vm_mm, vmf->pmd, pgtable);
 	page_remove_rmap(page, true);
 	spin_unlock(vmf->ptl);
@@ -1423,6 +1427,8 @@  vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
 		page_add_new_anon_rmap(new_page, vma, haddr, true);
 		mem_cgroup_commit_charge(new_page, memcg, false, true);
 		lru_cache_add_active_or_unevictable(new_page, vma);
+		/* commit non-atomic ops before exposing to fast gup */
+		smp_wmb();
 		set_pmd_at(vma->vm_mm, haddr, vmf->pmd, entry);
 		update_mmu_cache_pmd(vma, vmf->address, vmf->pmd);
 		if (!page) {
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 70ff98e1414d..f2901edce6de 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1074,6 +1074,8 @@  static void collapse_huge_page(struct mm_struct *mm,
 	count_memcg_events(memcg, THP_COLLAPSE_ALLOC, 1);
 	lru_cache_add_active_or_unevictable(new_page, vma);
 	pgtable_trans_huge_deposit(mm, pmd, pgtable);
+	/* commit non-atomic ops before exposing to fast gup */
+	smp_wmb();
 	set_pmd_at(mm, address, pmd, _pmd);
 	update_mmu_cache_pmd(vma, address, pmd);
 	spin_unlock(pmd_ptl);
diff --git a/mm/memory.c b/mm/memory.c
index aa86852d9ec2..6dabbc3cd3b7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2367,6 +2367,8 @@  static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 		 * mmu page tables (such as kvm shadow page tables), we want the
 		 * new page to be mapped directly into the secondary page table.
 		 */
+		/* commit non-atomic ops before exposing to fast gup */
+		smp_wmb();
 		set_pte_at_notify(mm, vmf->address, vmf->pte, entry);
 		update_mmu_cache(vma, vmf->address, vmf->pte);
 		if (old_page) {
@@ -2877,7 +2879,6 @@  vm_fault_t do_swap_page(struct vm_fault *vmf)
 	flush_icache_page(vma, page);
 	if (pte_swp_soft_dirty(vmf->orig_pte))
 		pte = pte_mksoft_dirty(pte);
-	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
 	arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
 	vmf->orig_pte = pte;
 
@@ -2886,12 +2887,15 @@  vm_fault_t do_swap_page(struct vm_fault *vmf)
 		page_add_new_anon_rmap(page, vma, vmf->address, false);
 		mem_cgroup_commit_charge(page, memcg, false, false);
 		lru_cache_add_active_or_unevictable(page, vma);
+		/* commit non-atomic ops before exposing to fast gup */
+		smp_wmb();
 	} else {
 		do_page_add_anon_rmap(page, vma, vmf->address, exclusive);
 		mem_cgroup_commit_charge(page, memcg, true, false);
 		activate_page(page);
 	}
 
+	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
 	swap_free(entry);
 	if (mem_cgroup_swap_full(page) ||
 	    (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
@@ -3034,6 +3038,8 @@  static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	page_add_new_anon_rmap(page, vma, vmf->address, false);
 	mem_cgroup_commit_charge(page, memcg, false, false);
 	lru_cache_add_active_or_unevictable(page, vma);
+	/* commit non-atomic ops before exposing to fast gup */
+	smp_wmb();
 setpte:
 	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
 
@@ -3297,6 +3303,8 @@  vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct mem_cgroup *memcg,
 		page_add_new_anon_rmap(page, vma, vmf->address, false);
 		mem_cgroup_commit_charge(page, memcg, false, false);
 		lru_cache_add_active_or_unevictable(page, vma);
+		/* commit non-atomic ops before exposing to fast gup */
+		smp_wmb();
 	} else {
 		inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page));
 		page_add_file_rmap(page, false);
diff --git a/mm/migrate.c b/mm/migrate.c
index 9f4ed4e985c1..943d147ecc3e 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2783,6 +2783,8 @@  static void migrate_vma_insert_page(struct migrate_vma *migrate,
 		lru_cache_add_active_or_unevictable(page, vma);
 	get_page(page);
 
+	/* commit non-atomic ops before exposing to fast gup */
+	smp_wmb();
 	if (flush) {
 		flush_cache_page(vma, addr, pte_pfn(*ptep));
 		ptep_clear_flush_notify(vma, addr, ptep);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index dab43523afdd..5c5547053ee0 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1880,8 +1880,6 @@  static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
 	dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
 	inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
 	get_page(page);
-	set_pte_at(vma->vm_mm, addr, pte,
-		   pte_mkold(mk_pte(page, vma->vm_page_prot)));
 	if (page == swapcache) {
 		page_add_anon_rmap(page, vma, addr, false);
 		mem_cgroup_commit_charge(page, memcg, true, false);
@@ -1889,7 +1887,11 @@  static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
 		page_add_new_anon_rmap(page, vma, addr, false);
 		mem_cgroup_commit_charge(page, memcg, false, false);
 		lru_cache_add_active_or_unevictable(page, vma);
+		/* commit non-atomic ops before exposing to fast gup */
+		smp_wmb();
 	}
+	set_pte_at(vma->vm_mm, addr, pte,
+		   pte_mkold(mk_pte(page, vma->vm_page_prot)));
 	swap_free(entry);
 	/*
 	 * Move the page to the active list so it is not
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index c7ae74ce5ff3..4f92913242a1 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -92,6 +92,8 @@  static int mcopy_atomic_pte(struct mm_struct *dst_mm,
 	mem_cgroup_commit_charge(page, memcg, false, false);
 	lru_cache_add_active_or_unevictable(page, dst_vma);
 
+	/* commit non-atomic ops before exposing to fast gup */
+	smp_wmb();
 	set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
 
 	/* No need to invalidate - it was non-present before */