diff mbox series

mm: gup: fix the fast GUP race against THP collapse

Message ID 20220901222707.477402-1-shy828301@gmail.com (mailing list archive)
State New
Headers show
Series mm: gup: fix the fast GUP race against THP collapse | expand

Commit Message

Yang Shi Sept. 1, 2022, 10:27 p.m. UTC
Since general RCU GUP fast was introduced in commit 2667f50e8b81 ("mm:
introduce a general RCU get_user_pages_fast()"), a TLB flush is no longer
sufficient to handle concurrent GUP-fast in all cases, it only handles
traditional IPI-based GUP-fast correctly.  On architectures that send
an IPI broadcast on TLB flush, it works as expected.  But on the
architectures that do not use IPI to broadcast TLB flush, it may have
the below race:

   CPU A                                          CPU B
THP collapse                                     fast GUP
                                              gup_pmd_range() <-- see valid pmd
                                                  gup_pte_range() <-- work on pte
pmdp_collapse_flush() <-- clear pmd and flush
__collapse_huge_page_isolate()
    check page pinned <-- before GUP bump refcount
                                                      pin the page
                                                      check PTE <-- no change
__collapse_huge_page_copy()
    copy data to huge page
    ptep_clear()
install huge pmd for the huge page
                                                      return the stale page
discard the stale page

The race could be fixed by checking whether PMD is changed or not after
taking the page pin in fast GUP, just like what it does for PTE.  If the
PMD is changed it means there may be parallel THP collapse, so GUP
should back off.

Also update the stale comment about serializing against fast GUP in
khugepaged.

Fixes: 2667f50e8b81 ("mm: introduce a general RCU get_user_pages_fast()")
Signed-off-by: Yang Shi <shy828301@gmail.com>
---
 mm/gup.c        | 30 ++++++++++++++++++++++++------
 mm/khugepaged.c | 10 ++++++----
 2 files changed, 30 insertions(+), 10 deletions(-)

Comments

Peter Xu Sept. 1, 2022, 11:26 p.m. UTC | #1
Hi, Yang,

On Thu, Sep 01, 2022 at 03:27:07PM -0700, Yang Shi wrote:
> Since general RCU GUP fast was introduced in commit 2667f50e8b81 ("mm:
> introduce a general RCU get_user_pages_fast()"), a TLB flush is no longer
> sufficient to handle concurrent GUP-fast in all cases, it only handles
> traditional IPI-based GUP-fast correctly.

If TLB flush (or, IPI broadcasts) used to work to protect against gup-fast,
I'm kind of confused why it's not sufficient even if with RCU gup?  Isn't
that'll keep working as long as interrupt disabled (which current fast-gup
will still do)?

IIUC the issue is you suspect not all archs correctly implemented
pmdp_collapse_flush(), or am I wrong?

> On architectures that send
> an IPI broadcast on TLB flush, it works as expected.  But on the
> architectures that do not use IPI to broadcast TLB flush, it may have
> the below race:
> 
>    CPU A                                          CPU B
> THP collapse                                     fast GUP
>                                               gup_pmd_range() <-- see valid pmd
>                                                   gup_pte_range() <-- work on pte
> pmdp_collapse_flush() <-- clear pmd and flush
> __collapse_huge_page_isolate()
>     check page pinned <-- before GUP bump refcount
>                                                       pin the page
>                                                       check PTE <-- no change
> __collapse_huge_page_copy()
>     copy data to huge page
>     ptep_clear()
> install huge pmd for the huge page
>                                                       return the stale page
> discard the stale page
> 
> The race could be fixed by checking whether PMD is changed or not after
> taking the page pin in fast GUP, just like what it does for PTE.  If the
> PMD is changed it means there may be parallel THP collapse, so GUP
> should back off.

Could the race also be fixed by impl pmdp_collapse_flush() correctly for
the archs that are missing? Do you know which arch(s) is broken with it?

It's just not clear to me whether this patch is an optimization or a fix,
if it's a fix whether the IPI broadcast in ppc pmdp_collapse_flush() would
still be needed.

Thanks,

> 
> Also update the stale comment about serializing against fast GUP in
> khugepaged.
> 
> Fixes: 2667f50e8b81 ("mm: introduce a general RCU get_user_pages_fast()")
> Signed-off-by: Yang Shi <shy828301@gmail.com>
> ---
>  mm/gup.c        | 30 ++++++++++++++++++++++++------
>  mm/khugepaged.c | 10 ++++++----
>  2 files changed, 30 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index f3fc1f08d90c..4365b2811269 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2380,8 +2380,9 @@ static void __maybe_unused undo_dev_pagemap(int *nr, int nr_start,
>  }
>  
>  #ifdef CONFIG_ARCH_HAS_PTE_SPECIAL
> -static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
> -			 unsigned int flags, struct page **pages, int *nr)
> +static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr,
> +			 unsigned long end, unsigned int flags,
> +			 struct page **pages, int *nr)
>  {
>  	struct dev_pagemap *pgmap = NULL;
>  	int nr_start = *nr, ret = 0;
> @@ -2423,7 +2424,23 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
>  			goto pte_unmap;
>  		}
>  
> -		if (unlikely(pte_val(pte) != pte_val(*ptep))) {
> +		/*
> +		 * THP collapse conceptually does:
> +		 *   1. Clear and flush PMD
> +		 *   2. Check the base page refcount
> +		 *   3. Copy data to huge page
> +		 *   4. Clear PTE
> +		 *   5. Discard the base page
> +		 *
> +		 * So fast GUP may race with THP collapse then pin and
> +		 * return an old page since TLB flush is no longer sufficient
> +		 * to serialize against fast GUP.
> +		 *
> +		 * Check PMD, if it is changed just back off since it
> +		 * means there may be parallel THP collapse.
> +		 */
> +		if (unlikely(pmd_val(pmd) != pmd_val(*pmdp)) ||
> +		    unlikely(pte_val(pte) != pte_val(*ptep))) {
>  			gup_put_folio(folio, 1, flags);
>  			goto pte_unmap;
>  		}
> @@ -2470,8 +2487,9 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
>   * get_user_pages_fast_only implementation that can pin pages. Thus it's still
>   * useful to have gup_huge_pmd even if we can't operate on ptes.
>   */
> -static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
> -			 unsigned int flags, struct page **pages, int *nr)
> +static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr,
> +			 unsigned long end, unsigned int flags,
> +			 struct page **pages, int *nr)
>  {
>  	return 0;
>  }
> @@ -2791,7 +2809,7 @@ static int gup_pmd_range(pud_t *pudp, pud_t pud, unsigned long addr, unsigned lo
>  			if (!gup_huge_pd(__hugepd(pmd_val(pmd)), addr,
>  					 PMD_SHIFT, next, flags, pages, nr))
>  				return 0;
> -		} else if (!gup_pte_range(pmd, addr, next, flags, pages, nr))
> +		} else if (!gup_pte_range(pmd, pmdp, addr, next, flags, pages, nr))
>  			return 0;
>  	} while (pmdp++, addr = next, addr != end);
>  
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 2d74cf01f694..518b49095db3 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1049,10 +1049,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>  
>  	pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
>  	/*
> -	 * After this gup_fast can't run anymore. This also removes
> -	 * any huge TLB entry from the CPU so we won't allow
> -	 * huge and small TLB entries for the same virtual address
> -	 * to avoid the risk of CPU bugs in that area.
> +	 * This removes any huge TLB entry from the CPU so we won't allow
> +	 * huge and small TLB entries for the same virtual address to
> +	 * avoid the risk of CPU bugs in that area.
> +	 *
> +	 * Parallel fast GUP is fine since fast GUP will back off when
> +	 * it detects PMD is changed.
>  	 */
>  	_pmd = pmdp_collapse_flush(vma, address, pmd);
>  	spin_unlock(pmd_ptl);
> -- 
> 2.26.3
>
Yang Shi Sept. 1, 2022, 11:50 p.m. UTC | #2
On Thu, Sep 1, 2022 at 4:26 PM Peter Xu <peterx@redhat.com> wrote:
>
> Hi, Yang,
>
> On Thu, Sep 01, 2022 at 03:27:07PM -0700, Yang Shi wrote:
> > Since general RCU GUP fast was introduced in commit 2667f50e8b81 ("mm:
> > introduce a general RCU get_user_pages_fast()"), a TLB flush is no longer
> > sufficient to handle concurrent GUP-fast in all cases, it only handles
> > traditional IPI-based GUP-fast correctly.
>
> If TLB flush (or, IPI broadcasts) used to work to protect against gup-fast,
> I'm kind of confused why it's not sufficient even if with RCU gup?  Isn't
> that'll keep working as long as interrupt disabled (which current fast-gup
> will still do)?

Actually the wording was copied from David's commit log for his
PageAnonExclusive fix. My understanding is the IPI broadcast still
works, but it may not be supported by all architectures and not
preferred anymore. So we should avoid depending on IPI broadcast IIUC.

>
> IIUC the issue is you suspect not all archs correctly implemented
> pmdp_collapse_flush(), or am I wrong?

This is a possible fix, please see below for details.

>
> > On architectures that send
> > an IPI broadcast on TLB flush, it works as expected.  But on the
> > architectures that do not use IPI to broadcast TLB flush, it may have
> > the below race:
> >
> >    CPU A                                          CPU B
> > THP collapse                                     fast GUP
> >                                               gup_pmd_range() <-- see valid pmd
> >                                                   gup_pte_range() <-- work on pte
> > pmdp_collapse_flush() <-- clear pmd and flush
> > __collapse_huge_page_isolate()
> >     check page pinned <-- before GUP bump refcount
> >                                                       pin the page
> >                                                       check PTE <-- no change
> > __collapse_huge_page_copy()
> >     copy data to huge page
> >     ptep_clear()
> > install huge pmd for the huge page
> >                                                       return the stale page
> > discard the stale page
> >
> > The race could be fixed by checking whether PMD is changed or not after
> > taking the page pin in fast GUP, just like what it does for PTE.  If the
> > PMD is changed it means there may be parallel THP collapse, so GUP
> > should back off.
>
> Could the race also be fixed by impl pmdp_collapse_flush() correctly for
> the archs that are missing? Do you know which arch(s) is broken with it?

Yes, and this was suggested by me in the first place, but per the
suggestion from John and David, this is not the preferred way. I think
it is because:

Firstly, using IPI to serialize against fast GUP is not recommended
anymore since fast GUP does check PTE then back off so we should avoid
it.
Secondly, if checking PMD then backing off could solve the problem,
why do we still need broadcast IPI? It doesn't sound performant.

>
> It's just not clear to me whether this patch is an optimization or a fix,
> if it's a fix whether the IPI broadcast in ppc pmdp_collapse_flush() would
> still be needed.

It is a fix and the fix will make IPI broadcast not useful anymore.

>
> Thanks,
>
> >
> > Also update the stale comment about serializing against fast GUP in
> > khugepaged.
> >
> > Fixes: 2667f50e8b81 ("mm: introduce a general RCU get_user_pages_fast()")
> > Signed-off-by: Yang Shi <shy828301@gmail.com>
> > ---
> >  mm/gup.c        | 30 ++++++++++++++++++++++++------
> >  mm/khugepaged.c | 10 ++++++----
> >  2 files changed, 30 insertions(+), 10 deletions(-)
> >
> > diff --git a/mm/gup.c b/mm/gup.c
> > index f3fc1f08d90c..4365b2811269 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -2380,8 +2380,9 @@ static void __maybe_unused undo_dev_pagemap(int *nr, int nr_start,
> >  }
> >
> >  #ifdef CONFIG_ARCH_HAS_PTE_SPECIAL
> > -static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
> > -                      unsigned int flags, struct page **pages, int *nr)
> > +static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr,
> > +                      unsigned long end, unsigned int flags,
> > +                      struct page **pages, int *nr)
> >  {
> >       struct dev_pagemap *pgmap = NULL;
> >       int nr_start = *nr, ret = 0;
> > @@ -2423,7 +2424,23 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
> >                       goto pte_unmap;
> >               }
> >
> > -             if (unlikely(pte_val(pte) != pte_val(*ptep))) {
> > +             /*
> > +              * THP collapse conceptually does:
> > +              *   1. Clear and flush PMD
> > +              *   2. Check the base page refcount
> > +              *   3. Copy data to huge page
> > +              *   4. Clear PTE
> > +              *   5. Discard the base page
> > +              *
> > +              * So fast GUP may race with THP collapse then pin and
> > +              * return an old page since TLB flush is no longer sufficient
> > +              * to serialize against fast GUP.
> > +              *
> > +              * Check PMD, if it is changed just back off since it
> > +              * means there may be parallel THP collapse.
> > +              */
> > +             if (unlikely(pmd_val(pmd) != pmd_val(*pmdp)) ||
> > +                 unlikely(pte_val(pte) != pte_val(*ptep))) {
> >                       gup_put_folio(folio, 1, flags);
> >                       goto pte_unmap;
> >               }
> > @@ -2470,8 +2487,9 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
> >   * get_user_pages_fast_only implementation that can pin pages. Thus it's still
> >   * useful to have gup_huge_pmd even if we can't operate on ptes.
> >   */
> > -static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
> > -                      unsigned int flags, struct page **pages, int *nr)
> > +static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr,
> > +                      unsigned long end, unsigned int flags,
> > +                      struct page **pages, int *nr)
> >  {
> >       return 0;
> >  }
> > @@ -2791,7 +2809,7 @@ static int gup_pmd_range(pud_t *pudp, pud_t pud, unsigned long addr, unsigned lo
> >                       if (!gup_huge_pd(__hugepd(pmd_val(pmd)), addr,
> >                                        PMD_SHIFT, next, flags, pages, nr))
> >                               return 0;
> > -             } else if (!gup_pte_range(pmd, addr, next, flags, pages, nr))
> > +             } else if (!gup_pte_range(pmd, pmdp, addr, next, flags, pages, nr))
> >                       return 0;
> >       } while (pmdp++, addr = next, addr != end);
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 2d74cf01f694..518b49095db3 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -1049,10 +1049,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >
> >       pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
> >       /*
> > -      * After this gup_fast can't run anymore. This also removes
> > -      * any huge TLB entry from the CPU so we won't allow
> > -      * huge and small TLB entries for the same virtual address
> > -      * to avoid the risk of CPU bugs in that area.
> > +      * This removes any huge TLB entry from the CPU so we won't allow
> > +      * huge and small TLB entries for the same virtual address to
> > +      * avoid the risk of CPU bugs in that area.
> > +      *
> > +      * Parallel fast GUP is fine since fast GUP will back off when
> > +      * it detects PMD is changed.
> >        */
> >       _pmd = pmdp_collapse_flush(vma, address, pmd);
> >       spin_unlock(pmd_ptl);
> > --
> > 2.26.3
> >
>
> --
> Peter Xu
>
David Hildenbrand Sept. 2, 2022, 6:39 a.m. UTC | #3
On 02.09.22 01:50, Yang Shi wrote:
> On Thu, Sep 1, 2022 at 4:26 PM Peter Xu <peterx@redhat.com> wrote:
>>
>> Hi, Yang,
>>
>> On Thu, Sep 01, 2022 at 03:27:07PM -0700, Yang Shi wrote:
>>> Since general RCU GUP fast was introduced in commit 2667f50e8b81 ("mm:
>>> introduce a general RCU get_user_pages_fast()"), a TLB flush is no longer
>>> sufficient to handle concurrent GUP-fast in all cases, it only handles
>>> traditional IPI-based GUP-fast correctly.
>>
>> If TLB flush (or, IPI broadcasts) used to work to protect against gup-fast,
>> I'm kind of confused why it's not sufficient even if with RCU gup?  Isn't
>> that'll keep working as long as interrupt disabled (which current fast-gup
>> will still do)?
> 
> Actually the wording was copied from David's commit log for his
> PageAnonExclusive fix. My understanding is the IPI broadcast still
> works, but it may not be supported by all architectures and not
> preferred anymore. So we should avoid depending on IPI broadcast IIUC.

Right. Not all architectures perform an IPI broadcast on TLB flush.

IPI broadcasts will continue working until we use RCU instead of
disabling local interrupts in GUP-fast.


>>>    CPU A                                          CPU B
>>> THP collapse                                     fast GUP
>>>                                               gup_pmd_range() <-- see valid pmd
>>>                                                   gup_pte_range() <-- work on pte
>>> pmdp_collapse_flush() <-- clear pmd and flush
>>> __collapse_huge_page_isolate()
>>>     check page pinned <-- before GUP bump refcount
>>>                                                       pin the page
>>>                                                       check PTE <-- no change
>>> __collapse_huge_page_copy()
>>>     copy data to huge page
>>>     ptep_clear()
>>> install huge pmd for the huge page
>>>                                                       return the stale page
>>> discard the stale page
>>>
>>> The race could be fixed by checking whether PMD is changed or not after
>>> taking the page pin in fast GUP, just like what it does for PTE.  If the
>>> PMD is changed it means there may be parallel THP collapse, so GUP
>>> should back off.
>>
>> Could the race also be fixed by impl pmdp_collapse_flush() correctly for
>> the archs that are missing? Do you know which arch(s) is broken with it?
> 
> Yes, and this was suggested by me in the first place, but per the
> suggestion from John and David, this is not the preferred way. I think
> it is because:
> 
> Firstly, using IPI to serialize against fast GUP is not recommended
> anymore since fast GUP does check PTE then back off so we should avoid
> it.
> Secondly, if checking PMD then backing off could solve the problem,
> why do we still need broadcast IPI? It doesn't sound performant.

I'd say, using an IPI is the old-styled way of doing things. Sure, using
an IPI broadcast will work (and IMHO it's a lot easier to
not-get-wrong). But it somewhat contradicts to the new way of doing things.

>>
>> It's just not clear to me whether this patch is an optimization or a fix,
>> if it's a fix whether the IPI broadcast in ppc pmdp_collapse_flush() would
>> still be needed.
> 
> It is a fix and the fix will make IPI broadcast not useful anymore.

I'd wonder how "easy" adding the IPI broadcast would be -- IOW, if the
IPI fix has a real advantage.
David Hildenbrand Sept. 2, 2022, 6:42 a.m. UTC | #4
On 02.09.22 00:27, Yang Shi wrote:
> Since general RCU GUP fast was introduced in commit 2667f50e8b81 ("mm:
> introduce a general RCU get_user_pages_fast()"), a TLB flush is no longer
> sufficient to handle concurrent GUP-fast in all cases, it only handles
> traditional IPI-based GUP-fast correctly.  On architectures that send
> an IPI broadcast on TLB flush, it works as expected.  But on the
> architectures that do not use IPI to broadcast TLB flush, it may have
> the below race:
> 
>    CPU A                                          CPU B
> THP collapse                                     fast GUP
>                                               gup_pmd_range() <-- see valid pmd
>                                                   gup_pte_range() <-- work on pte
> pmdp_collapse_flush() <-- clear pmd and flush
> __collapse_huge_page_isolate()
>     check page pinned <-- before GUP bump refcount
>                                                       pin the page
>                                                       check PTE <-- no change
> __collapse_huge_page_copy()
>     copy data to huge page
>     ptep_clear()
> install huge pmd for the huge page
>                                                       return the stale page
> discard the stale page
> 
> The race could be fixed by checking whether PMD is changed or not after
> taking the page pin in fast GUP, just like what it does for PTE.  If the
> PMD is changed it means there may be parallel THP collapse, so GUP
> should back off.
> 
> Also update the stale comment about serializing against fast GUP in
> khugepaged.
> 
> Fixes: 2667f50e8b81 ("mm: introduce a general RCU get_user_pages_fast()")
> Signed-off-by: Yang Shi <shy828301@gmail.com>
> ---
>  mm/gup.c        | 30 ++++++++++++++++++++++++------
>  mm/khugepaged.c | 10 ++++++----
>  2 files changed, 30 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index f3fc1f08d90c..4365b2811269 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2380,8 +2380,9 @@ static void __maybe_unused undo_dev_pagemap(int *nr, int nr_start,
>  }
>  
>  #ifdef CONFIG_ARCH_HAS_PTE_SPECIAL
> -static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
> -			 unsigned int flags, struct page **pages, int *nr)
> +static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr,
> +			 unsigned long end, unsigned int flags,
> +			 struct page **pages, int *nr)
>  {
>  	struct dev_pagemap *pgmap = NULL;
>  	int nr_start = *nr, ret = 0;
> @@ -2423,7 +2424,23 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
>  			goto pte_unmap;
>  		}
>  
> -		if (unlikely(pte_val(pte) != pte_val(*ptep))) {
> +		/*
> +		 * THP collapse conceptually does:
> +		 *   1. Clear and flush PMD
> +		 *   2. Check the base page refcount
> +		 *   3. Copy data to huge page
> +		 *   4. Clear PTE
> +		 *   5. Discard the base page
> +		 *
> +		 * So fast GUP may race with THP collapse then pin and
> +		 * return an old page since TLB flush is no longer sufficient
> +		 * to serialize against fast GUP.
> +		 *
> +		 * Check PMD, if it is changed just back off since it
> +		 * means there may be parallel THP collapse.
> +		 */
> +		if (unlikely(pmd_val(pmd) != pmd_val(*pmdp)) ||
> +		    unlikely(pte_val(pte) != pte_val(*ptep))) {
>  			gup_put_folio(folio, 1, flags);
>  			goto pte_unmap;
>  		}
> @@ -2470,8 +2487,9 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
>   * get_user_pages_fast_only implementation that can pin pages. Thus it's still
>   * useful to have gup_huge_pmd even if we can't operate on ptes.
>   */
> -static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
> -			 unsigned int flags, struct page **pages, int *nr)
> +static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr,
> +			 unsigned long end, unsigned int flags,
> +			 struct page **pages, int *nr)
>  {
>  	return 0;
>  }
> @@ -2791,7 +2809,7 @@ static int gup_pmd_range(pud_t *pudp, pud_t pud, unsigned long addr, unsigned lo
>  			if (!gup_huge_pd(__hugepd(pmd_val(pmd)), addr,
>  					 PMD_SHIFT, next, flags, pages, nr))
>  				return 0;
> -		} else if (!gup_pte_range(pmd, addr, next, flags, pages, nr))
> +		} else if (!gup_pte_range(pmd, pmdp, addr, next, flags, pages, nr))
>  			return 0;
>  	} while (pmdp++, addr = next, addr != end);
>  
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 2d74cf01f694..518b49095db3 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1049,10 +1049,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>  
>  	pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
>  	/*
> -	 * After this gup_fast can't run anymore. This also removes
> -	 * any huge TLB entry from the CPU so we won't allow
> -	 * huge and small TLB entries for the same virtual address
> -	 * to avoid the risk of CPU bugs in that area.
> +	 * This removes any huge TLB entry from the CPU so we won't allow
> +	 * huge and small TLB entries for the same virtual address to
> +	 * avoid the risk of CPU bugs in that area.
> +	 *
> +	 * Parallel fast GUP is fine since fast GUP will back off when
> +	 * it detects PMD is changed.
>  	 */
>  	_pmd = pmdp_collapse_flush(vma, address, pmd);
>  	spin_unlock(pmd_ptl);


As long as pmdp_collapse_flush() implies a full memory barrier (which I
think it does), this should work as expected.

Hopefully someone with experience on RCU GUP-fast (Jason, John?  :) )
can double-check.

To me this sound sane.

Acked-by: David Hildenbrand <david@redhat.com>
Yang Shi Sept. 2, 2022, 3:23 p.m. UTC | #5
On Thu, Sep 1, 2022 at 11:39 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 02.09.22 01:50, Yang Shi wrote:
> > On Thu, Sep 1, 2022 at 4:26 PM Peter Xu <peterx@redhat.com> wrote:
> >>
> >> Hi, Yang,
> >>
> >> On Thu, Sep 01, 2022 at 03:27:07PM -0700, Yang Shi wrote:
> >>> Since general RCU GUP fast was introduced in commit 2667f50e8b81 ("mm:
> >>> introduce a general RCU get_user_pages_fast()"), a TLB flush is no longer
> >>> sufficient to handle concurrent GUP-fast in all cases, it only handles
> >>> traditional IPI-based GUP-fast correctly.
> >>
> >> If TLB flush (or, IPI broadcasts) used to work to protect against gup-fast,
> >> I'm kind of confused why it's not sufficient even if with RCU gup?  Isn't
> >> that'll keep working as long as interrupt disabled (which current fast-gup
> >> will still do)?
> >
> > Actually the wording was copied from David's commit log for his
> > PageAnonExclusive fix. My understanding is the IPI broadcast still
> > works, but it may not be supported by all architectures and not
> > preferred anymore. So we should avoid depending on IPI broadcast IIUC.
>
> Right. Not all architectures perform an IPI broadcast on TLB flush.
>
> IPI broadcasts will continue working until we use RCU instead of
> disabling local interrupts in GUP-fast.
>
>
> >>>    CPU A                                          CPU B
> >>> THP collapse                                     fast GUP
> >>>                                               gup_pmd_range() <-- see valid pmd
> >>>                                                   gup_pte_range() <-- work on pte
> >>> pmdp_collapse_flush() <-- clear pmd and flush
> >>> __collapse_huge_page_isolate()
> >>>     check page pinned <-- before GUP bump refcount
> >>>                                                       pin the page
> >>>                                                       check PTE <-- no change
> >>> __collapse_huge_page_copy()
> >>>     copy data to huge page
> >>>     ptep_clear()
> >>> install huge pmd for the huge page
> >>>                                                       return the stale page
> >>> discard the stale page
> >>>
> >>> The race could be fixed by checking whether PMD is changed or not after
> >>> taking the page pin in fast GUP, just like what it does for PTE.  If the
> >>> PMD is changed it means there may be parallel THP collapse, so GUP
> >>> should back off.
> >>
> >> Could the race also be fixed by impl pmdp_collapse_flush() correctly for
> >> the archs that are missing? Do you know which arch(s) is broken with it?
> >
> > Yes, and this was suggested by me in the first place, but per the
> > suggestion from John and David, this is not the preferred way. I think
> > it is because:
> >
> > Firstly, using IPI to serialize against fast GUP is not recommended
> > anymore since fast GUP does check PTE then back off so we should avoid
> > it.
> > Secondly, if checking PMD then backing off could solve the problem,
> > why do we still need broadcast IPI? It doesn't sound performant.
>
> I'd say, using an IPI is the old-styled way of doing things. Sure, using
> an IPI broadcast will work (and IMHO it's a lot easier to
> not-get-wrong). But it somewhat contradicts to the new way of doing things.
>
> >>
> >> It's just not clear to me whether this patch is an optimization or a fix,
> >> if it's a fix whether the IPI broadcast in ppc pmdp_collapse_flush() would
> >> still be needed.
> >
> > It is a fix and the fix will make IPI broadcast not useful anymore.
>
> I'd wonder how "easy" adding the IPI broadcast would be -- IOW, if the
> IPI fix has a real advantage.

Not sure either, but I guess calling a dummy function via IPI
broadcast should just work. Powepc does so.

>
>
> --
> Thanks,
>
> David / dhildenb
>
Peter Xu Sept. 2, 2022, 3:59 p.m. UTC | #6
On Thu, Sep 01, 2022 at 04:50:45PM -0700, Yang Shi wrote:
> On Thu, Sep 1, 2022 at 4:26 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > Hi, Yang,
> >
> > On Thu, Sep 01, 2022 at 03:27:07PM -0700, Yang Shi wrote:
> > > Since general RCU GUP fast was introduced in commit 2667f50e8b81 ("mm:
> > > introduce a general RCU get_user_pages_fast()"), a TLB flush is no longer
> > > sufficient to handle concurrent GUP-fast in all cases, it only handles
> > > traditional IPI-based GUP-fast correctly.
> >
> > If TLB flush (or, IPI broadcasts) used to work to protect against gup-fast,
> > I'm kind of confused why it's not sufficient even if with RCU gup?  Isn't
> > that'll keep working as long as interrupt disabled (which current fast-gup
> > will still do)?
> 
> Actually the wording was copied from David's commit log for his
> PageAnonExclusive fix. My understanding is the IPI broadcast still
> works, but it may not be supported by all architectures and not
> preferred anymore. So we should avoid depending on IPI broadcast IIUC.
> 
> >
> > IIUC the issue is you suspect not all archs correctly implemented
> > pmdp_collapse_flush(), or am I wrong?
> 
> This is a possible fix, please see below for details.
> 
> >
> > > On architectures that send
> > > an IPI broadcast on TLB flush, it works as expected.  But on the
> > > architectures that do not use IPI to broadcast TLB flush, it may have
> > > the below race:
> > >
> > >    CPU A                                          CPU B
> > > THP collapse                                     fast GUP
> > >                                               gup_pmd_range() <-- see valid pmd
> > >                                                   gup_pte_range() <-- work on pte
> > > pmdp_collapse_flush() <-- clear pmd and flush
> > > __collapse_huge_page_isolate()
> > >     check page pinned <-- before GUP bump refcount
> > >                                                       pin the page
> > >                                                       check PTE <-- no change
> > > __collapse_huge_page_copy()
> > >     copy data to huge page
> > >     ptep_clear()
> > > install huge pmd for the huge page
> > >                                                       return the stale page
> > > discard the stale page
> > >
> > > The race could be fixed by checking whether PMD is changed or not after
> > > taking the page pin in fast GUP, just like what it does for PTE.  If the
> > > PMD is changed it means there may be parallel THP collapse, so GUP
> > > should back off.
> >
> > Could the race also be fixed by impl pmdp_collapse_flush() correctly for
> > the archs that are missing? Do you know which arch(s) is broken with it?
> 
> Yes, and this was suggested by me in the first place, but per the
> suggestion from John and David, this is not the preferred way. I think
> it is because:
> 
> Firstly, using IPI to serialize against fast GUP is not recommended
> anymore since fast GUP does check PTE then back off so we should avoid
> it.
> Secondly, if checking PMD then backing off could solve the problem,
> why do we still need broadcast IPI? It doesn't sound performant.
> 
> >
> > It's just not clear to me whether this patch is an optimization or a fix,
> > if it's a fix whether the IPI broadcast in ppc pmdp_collapse_flush() would
> > still be needed.
> 
> It is a fix and the fix will make IPI broadcast not useful anymore.

How about another patch to remove the ppc impl too?  Then it can be a two
patches series.

So that ppc developers can be copied and maybe it helps to have the ppc
people looking at current approach too.

Then the last piece of it is the s390 pmdp_collapse_flush().  I'm wondering
whether generic pmdp_collapse_flush() would be good enough, since the only
addition comparing to the s390 one will be flush_tlb_range() (which is a
further __tlb_flush_mm_lazy).  David may have some thoughts.

The patch itself looks good to me, one trivial nit below.

> 
> >
> > Thanks,
> >
> > >
> > > Also update the stale comment about serializing against fast GUP in
> > > khugepaged.
> > >
> > > Fixes: 2667f50e8b81 ("mm: introduce a general RCU get_user_pages_fast()")
> > > Signed-off-by: Yang Shi <shy828301@gmail.com>
> > > ---
> > >  mm/gup.c        | 30 ++++++++++++++++++++++++------
> > >  mm/khugepaged.c | 10 ++++++----
> > >  2 files changed, 30 insertions(+), 10 deletions(-)
> > >
> > > diff --git a/mm/gup.c b/mm/gup.c
> > > index f3fc1f08d90c..4365b2811269 100644
> > > --- a/mm/gup.c
> > > +++ b/mm/gup.c
> > > @@ -2380,8 +2380,9 @@ static void __maybe_unused undo_dev_pagemap(int *nr, int nr_start,
> > >  }
> > >
> > >  #ifdef CONFIG_ARCH_HAS_PTE_SPECIAL
> > > -static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
> > > -                      unsigned int flags, struct page **pages, int *nr)
> > > +static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr,
> > > +                      unsigned long end, unsigned int flags,
> > > +                      struct page **pages, int *nr)
> > >  {
> > >       struct dev_pagemap *pgmap = NULL;
> > >       int nr_start = *nr, ret = 0;
> > > @@ -2423,7 +2424,23 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
> > >                       goto pte_unmap;
> > >               }
> > >
> > > -             if (unlikely(pte_val(pte) != pte_val(*ptep))) {
> > > +             /*
> > > +              * THP collapse conceptually does:
> > > +              *   1. Clear and flush PMD
> > > +              *   2. Check the base page refcount
> > > +              *   3. Copy data to huge page
> > > +              *   4. Clear PTE
> > > +              *   5. Discard the base page
> > > +              *
> > > +              * So fast GUP may race with THP collapse then pin and
> > > +              * return an old page since TLB flush is no longer sufficient
> > > +              * to serialize against fast GUP.
> > > +              *
> > > +              * Check PMD, if it is changed just back off since it
> > > +              * means there may be parallel THP collapse.

Would you mind rewording this comment a bit?  I feel it a bit weird to
suddenly mention about thp collapse especially its details.

Maybe some statement on the whole history of why check pte, and in what
case pmd check is needed (where the thp collapse example can be moved to,
imho)?

One of my attempt for reference..

		/*
		 * Fast-gup relies on pte change detection to avoid
		 * concurrent pgtable operations.
		 *
		 * To pin the page, fast-gup needs to do below in order:
		 * (1) pin the page (by prefetching pte), then (2) check
		 * pte not changed.
		 *
		 * For the rest of pgtable operations where pgtable updates
		 * can be racy with fast-gup, we need to do (1) clear pte,
		 * then (2) check whether page is pinned.
		 *
		 * Above will work for all pte-level operations, including
		 * thp split.
		 *
		 * For thp collapse, it's a bit more complicated because
		 * with RCU pgtable free fast-gup can be walking a pgtable
		 * page that is being freed (so pte is still valid but pmd
		 * can be cleared already).  To avoid race in such
		 * condition, we need to also check pmd here to make sure
		 * pmd doesn't change (corresponds to pmdp_collapse_flush()
		 * in the thp collide code path).
		 */

If you agree with the comment change, feel free to add:

Acked-by: Peter Xu <peterx@redhat.com>

Thanks,
Peter Xu Sept. 2, 2022, 4:04 p.m. UTC | #7
On Fri, Sep 02, 2022 at 11:59:56AM -0400, Peter Xu wrote:
> On Thu, Sep 01, 2022 at 04:50:45PM -0700, Yang Shi wrote:
> > On Thu, Sep 1, 2022 at 4:26 PM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > Hi, Yang,
> > >
> > > On Thu, Sep 01, 2022 at 03:27:07PM -0700, Yang Shi wrote:
> > > > Since general RCU GUP fast was introduced in commit 2667f50e8b81 ("mm:
> > > > introduce a general RCU get_user_pages_fast()"), a TLB flush is no longer
> > > > sufficient to handle concurrent GUP-fast in all cases, it only handles
> > > > traditional IPI-based GUP-fast correctly.
> > >
> > > If TLB flush (or, IPI broadcasts) used to work to protect against gup-fast,
> > > I'm kind of confused why it's not sufficient even if with RCU gup?  Isn't
> > > that'll keep working as long as interrupt disabled (which current fast-gup
> > > will still do)?
> > 
> > Actually the wording was copied from David's commit log for his
> > PageAnonExclusive fix. My understanding is the IPI broadcast still
> > works, but it may not be supported by all architectures and not
> > preferred anymore. So we should avoid depending on IPI broadcast IIUC.
> > 
> > >
> > > IIUC the issue is you suspect not all archs correctly implemented
> > > pmdp_collapse_flush(), or am I wrong?
> > 
> > This is a possible fix, please see below for details.
> > 
> > >
> > > > On architectures that send
> > > > an IPI broadcast on TLB flush, it works as expected.  But on the
> > > > architectures that do not use IPI to broadcast TLB flush, it may have
> > > > the below race:
> > > >
> > > >    CPU A                                          CPU B
> > > > THP collapse                                     fast GUP
> > > >                                               gup_pmd_range() <-- see valid pmd
> > > >                                                   gup_pte_range() <-- work on pte
> > > > pmdp_collapse_flush() <-- clear pmd and flush
> > > > __collapse_huge_page_isolate()
> > > >     check page pinned <-- before GUP bump refcount
> > > >                                                       pin the page
> > > >                                                       check PTE <-- no change
> > > > __collapse_huge_page_copy()
> > > >     copy data to huge page
> > > >     ptep_clear()
> > > > install huge pmd for the huge page
> > > >                                                       return the stale page
> > > > discard the stale page
> > > >
> > > > The race could be fixed by checking whether PMD is changed or not after
> > > > taking the page pin in fast GUP, just like what it does for PTE.  If the
> > > > PMD is changed it means there may be parallel THP collapse, so GUP
> > > > should back off.
> > >
> > > Could the race also be fixed by impl pmdp_collapse_flush() correctly for
> > > the archs that are missing? Do you know which arch(s) is broken with it?
> > 
> > Yes, and this was suggested by me in the first place, but per the
> > suggestion from John and David, this is not the preferred way. I think
> > it is because:
> > 
> > Firstly, using IPI to serialize against fast GUP is not recommended
> > anymore since fast GUP does check PTE then back off so we should avoid
> > it.
> > Secondly, if checking PMD then backing off could solve the problem,
> > why do we still need broadcast IPI? It doesn't sound performant.
> > 
> > >
> > > It's just not clear to me whether this patch is an optimization or a fix,
> > > if it's a fix whether the IPI broadcast in ppc pmdp_collapse_flush() would
> > > still be needed.
> > 
> > It is a fix and the fix will make IPI broadcast not useful anymore.
> 
> How about another patch to remove the ppc impl too?  Then it can be a two
> patches series.
> 
> So that ppc developers can be copied and maybe it helps to have the ppc
> people looking at current approach too.
> 
> Then the last piece of it is the s390 pmdp_collapse_flush().  I'm wondering
> whether generic pmdp_collapse_flush() would be good enough, since the only
> addition comparing to the s390 one will be flush_tlb_range() (which is a
> further __tlb_flush_mm_lazy).  David may have some thoughts.
> 
> The patch itself looks good to me, one trivial nit below.
> 
> > 
> > >
> > > Thanks,
> > >
> > > >
> > > > Also update the stale comment about serializing against fast GUP in
> > > > khugepaged.
> > > >
> > > > Fixes: 2667f50e8b81 ("mm: introduce a general RCU get_user_pages_fast()")
> > > > Signed-off-by: Yang Shi <shy828301@gmail.com>
> > > > ---
> > > >  mm/gup.c        | 30 ++++++++++++++++++++++++------
> > > >  mm/khugepaged.c | 10 ++++++----
> > > >  2 files changed, 30 insertions(+), 10 deletions(-)
> > > >
> > > > diff --git a/mm/gup.c b/mm/gup.c
> > > > index f3fc1f08d90c..4365b2811269 100644
> > > > --- a/mm/gup.c
> > > > +++ b/mm/gup.c
> > > > @@ -2380,8 +2380,9 @@ static void __maybe_unused undo_dev_pagemap(int *nr, int nr_start,
> > > >  }
> > > >
> > > >  #ifdef CONFIG_ARCH_HAS_PTE_SPECIAL
> > > > -static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
> > > > -                      unsigned int flags, struct page **pages, int *nr)
> > > > +static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr,
> > > > +                      unsigned long end, unsigned int flags,
> > > > +                      struct page **pages, int *nr)
> > > >  {
> > > >       struct dev_pagemap *pgmap = NULL;
> > > >       int nr_start = *nr, ret = 0;
> > > > @@ -2423,7 +2424,23 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
> > > >                       goto pte_unmap;
> > > >               }
> > > >
> > > > -             if (unlikely(pte_val(pte) != pte_val(*ptep))) {
> > > > +             /*
> > > > +              * THP collapse conceptually does:
> > > > +              *   1. Clear and flush PMD
> > > > +              *   2. Check the base page refcount
> > > > +              *   3. Copy data to huge page
> > > > +              *   4. Clear PTE
> > > > +              *   5. Discard the base page
> > > > +              *
> > > > +              * So fast GUP may race with THP collapse then pin and
> > > > +              * return an old page since TLB flush is no longer sufficient
> > > > +              * to serialize against fast GUP.
> > > > +              *
> > > > +              * Check PMD, if it is changed just back off since it
> > > > +              * means there may be parallel THP collapse.
> 
> Would you mind rewording this comment a bit?  I feel it a bit weird to
> suddenly mention about thp collapse especially its details.
> 
> Maybe some statement on the whole history of why check pte, and in what
> case pmd check is needed (where the thp collapse example can be moved to,
> imho)?
> 
> One of my attempt for reference..
> 
> 		/*
> 		 * Fast-gup relies on pte change detection to avoid
> 		 * concurrent pgtable operations.
> 		 *
> 		 * To pin the page, fast-gup needs to do below in order:
> 		 * (1) pin the page (by prefetching pte), then (2) check
> 		 * pte not changed.
> 		 *
> 		 * For the rest of pgtable operations where pgtable updates
> 		 * can be racy with fast-gup, we need to do (1) clear pte,
> 		 * then (2) check whether page is pinned.
> 		 *
> 		 * Above will work for all pte-level operations, including
> 		 * thp split.

Here a slight amendment could be:

		 * Above will work for all pte-level operations, including
		 * thp split, which applies the same logic but only done
		 * all above in the pmd level rather than pte level.

To be clearer.

> 		 *
> 		 * For thp collapse, it's a bit more complicated because
> 		 * with RCU pgtable free fast-gup can be walking a pgtable
> 		 * page that is being freed (so pte is still valid but pmd
> 		 * can be cleared already).  To avoid race in such
> 		 * condition, we need to also check pmd here to make sure
> 		 * pmd doesn't change (corresponds to pmdp_collapse_flush()
> 		 * in the thp collide code path).
> 		 */
> 
> If you agree with the comment change, feel free to add:
> 
> Acked-by: Peter Xu <peterx@redhat.com>
> 
> Thanks,
> 
> -- 
> Peter Xu
Yang Shi Sept. 2, 2022, 5:30 p.m. UTC | #8
On Fri, Sep 2, 2022 at 9:00 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Thu, Sep 01, 2022 at 04:50:45PM -0700, Yang Shi wrote:
> > On Thu, Sep 1, 2022 at 4:26 PM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > Hi, Yang,
> > >
> > > On Thu, Sep 01, 2022 at 03:27:07PM -0700, Yang Shi wrote:
> > > > Since general RCU GUP fast was introduced in commit 2667f50e8b81 ("mm:
> > > > introduce a general RCU get_user_pages_fast()"), a TLB flush is no longer
> > > > sufficient to handle concurrent GUP-fast in all cases, it only handles
> > > > traditional IPI-based GUP-fast correctly.
> > >
> > > If TLB flush (or, IPI broadcasts) used to work to protect against gup-fast,
> > > I'm kind of confused why it's not sufficient even if with RCU gup?  Isn't
> > > that'll keep working as long as interrupt disabled (which current fast-gup
> > > will still do)?
> >
> > Actually the wording was copied from David's commit log for his
> > PageAnonExclusive fix. My understanding is the IPI broadcast still
> > works, but it may not be supported by all architectures and not
> > preferred anymore. So we should avoid depending on IPI broadcast IIUC.
> >
> > >
> > > IIUC the issue is you suspect not all archs correctly implemented
> > > pmdp_collapse_flush(), or am I wrong?
> >
> > This is a possible fix, please see below for details.
> >
> > >
> > > > On architectures that send
> > > > an IPI broadcast on TLB flush, it works as expected.  But on the
> > > > architectures that do not use IPI to broadcast TLB flush, it may have
> > > > the below race:
> > > >
> > > >    CPU A                                          CPU B
> > > > THP collapse                                     fast GUP
> > > >                                               gup_pmd_range() <-- see valid pmd
> > > >                                                   gup_pte_range() <-- work on pte
> > > > pmdp_collapse_flush() <-- clear pmd and flush
> > > > __collapse_huge_page_isolate()
> > > >     check page pinned <-- before GUP bump refcount
> > > >                                                       pin the page
> > > >                                                       check PTE <-- no change
> > > > __collapse_huge_page_copy()
> > > >     copy data to huge page
> > > >     ptep_clear()
> > > > install huge pmd for the huge page
> > > >                                                       return the stale page
> > > > discard the stale page
> > > >
> > > > The race could be fixed by checking whether PMD is changed or not after
> > > > taking the page pin in fast GUP, just like what it does for PTE.  If the
> > > > PMD is changed it means there may be parallel THP collapse, so GUP
> > > > should back off.
> > >
> > > Could the race also be fixed by impl pmdp_collapse_flush() correctly for
> > > the archs that are missing? Do you know which arch(s) is broken with it?
> >
> > Yes, and this was suggested by me in the first place, but per the
> > suggestion from John and David, this is not the preferred way. I think
> > it is because:
> >
> > Firstly, using IPI to serialize against fast GUP is not recommended
> > anymore since fast GUP does check PTE then back off so we should avoid
> > it.
> > Secondly, if checking PMD then backing off could solve the problem,
> > why do we still need broadcast IPI? It doesn't sound performant.
> >
> > >
> > > It's just not clear to me whether this patch is an optimization or a fix,
> > > if it's a fix whether the IPI broadcast in ppc pmdp_collapse_flush() would
> > > still be needed.
> >
> > It is a fix and the fix will make IPI broadcast not useful anymore.
>
> How about another patch to remove the ppc impl too?  Then it can be a two
> patches series.
>
> So that ppc developers can be copied and maybe it helps to have the ppc
> people looking at current approach too.

I was thinking about it before sending this patch, but I thought it
may be better to wait for ppc developers to confirm some questions. I
think Aneesh was copied in another thread.

>
> Then the last piece of it is the s390 pmdp_collapse_flush().  I'm wondering
> whether generic pmdp_collapse_flush() would be good enough, since the only
> addition comparing to the s390 one will be flush_tlb_range() (which is a
> further __tlb_flush_mm_lazy).  David may have some thoughts.
>
> The patch itself looks good to me, one trivial nit below.
>
> >
> > >
> > > Thanks,
> > >
> > > >
> > > > Also update the stale comment about serializing against fast GUP in
> > > > khugepaged.
> > > >
> > > > Fixes: 2667f50e8b81 ("mm: introduce a general RCU get_user_pages_fast()")
> > > > Signed-off-by: Yang Shi <shy828301@gmail.com>
> > > > ---
> > > >  mm/gup.c        | 30 ++++++++++++++++++++++++------
> > > >  mm/khugepaged.c | 10 ++++++----
> > > >  2 files changed, 30 insertions(+), 10 deletions(-)
> > > >
> > > > diff --git a/mm/gup.c b/mm/gup.c
> > > > index f3fc1f08d90c..4365b2811269 100644
> > > > --- a/mm/gup.c
> > > > +++ b/mm/gup.c
> > > > @@ -2380,8 +2380,9 @@ static void __maybe_unused undo_dev_pagemap(int *nr, int nr_start,
> > > >  }
> > > >
> > > >  #ifdef CONFIG_ARCH_HAS_PTE_SPECIAL
> > > > -static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
> > > > -                      unsigned int flags, struct page **pages, int *nr)
> > > > +static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr,
> > > > +                      unsigned long end, unsigned int flags,
> > > > +                      struct page **pages, int *nr)
> > > >  {
> > > >       struct dev_pagemap *pgmap = NULL;
> > > >       int nr_start = *nr, ret = 0;
> > > > @@ -2423,7 +2424,23 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
> > > >                       goto pte_unmap;
> > > >               }
> > > >
> > > > -             if (unlikely(pte_val(pte) != pte_val(*ptep))) {
> > > > +             /*
> > > > +              * THP collapse conceptually does:
> > > > +              *   1. Clear and flush PMD
> > > > +              *   2. Check the base page refcount
> > > > +              *   3. Copy data to huge page
> > > > +              *   4. Clear PTE
> > > > +              *   5. Discard the base page
> > > > +              *
> > > > +              * So fast GUP may race with THP collapse then pin and
> > > > +              * return an old page since TLB flush is no longer sufficient
> > > > +              * to serialize against fast GUP.
> > > > +              *
> > > > +              * Check PMD, if it is changed just back off since it
> > > > +              * means there may be parallel THP collapse.
>
> Would you mind rewording this comment a bit?  I feel it a bit weird to
> suddenly mention about thp collapse especially its details.
>
> Maybe some statement on the whole history of why check pte, and in what
> case pmd check is needed (where the thp collapse example can be moved to,
> imho)?
>
> One of my attempt for reference..
>
>                 /*
>                  * Fast-gup relies on pte change detection to avoid
>                  * concurrent pgtable operations.
>                  *
>                  * To pin the page, fast-gup needs to do below in order:
>                  * (1) pin the page (by prefetching pte), then (2) check
>                  * pte not changed.
>                  *
>                  * For the rest of pgtable operations where pgtable updates
>                  * can be racy with fast-gup, we need to do (1) clear pte,
>                  * then (2) check whether page is pinned.
>                  *
>                  * Above will work for all pte-level operations, including
>                  * thp split.
>                  *
>                  * For thp collapse, it's a bit more complicated because
>                  * with RCU pgtable free fast-gup can be walking a pgtable
>                  * page that is being freed (so pte is still valid but pmd
>                  * can be cleared already).  To avoid race in such
>                  * condition, we need to also check pmd here to make sure
>                  * pmd doesn't change (corresponds to pmdp_collapse_flush()
>                  * in the thp collide code path).
>                  */
>
> If you agree with the comment change, feel free to add:

Thanks a lot, it looks good to me, will update the patch to incorporate it.

>
> Acked-by: Peter Xu <peterx@redhat.com>
>
> Thanks,
>
> --
> Peter Xu
>
Yang Shi Sept. 2, 2022, 5:45 p.m. UTC | #9
On Fri, Sep 2, 2022 at 9:00 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Thu, Sep 01, 2022 at 04:50:45PM -0700, Yang Shi wrote:
> > On Thu, Sep 1, 2022 at 4:26 PM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > Hi, Yang,
> > >
> > > On Thu, Sep 01, 2022 at 03:27:07PM -0700, Yang Shi wrote:
> > > > Since general RCU GUP fast was introduced in commit 2667f50e8b81 ("mm:
> > > > introduce a general RCU get_user_pages_fast()"), a TLB flush is no longer
> > > > sufficient to handle concurrent GUP-fast in all cases, it only handles
> > > > traditional IPI-based GUP-fast correctly.
> > >
> > > If TLB flush (or, IPI broadcasts) used to work to protect against gup-fast,
> > > I'm kind of confused why it's not sufficient even if with RCU gup?  Isn't
> > > that'll keep working as long as interrupt disabled (which current fast-gup
> > > will still do)?
> >
> > Actually the wording was copied from David's commit log for his
> > PageAnonExclusive fix. My understanding is the IPI broadcast still
> > works, but it may not be supported by all architectures and not
> > preferred anymore. So we should avoid depending on IPI broadcast IIUC.
> >
> > >
> > > IIUC the issue is you suspect not all archs correctly implemented
> > > pmdp_collapse_flush(), or am I wrong?
> >
> > This is a possible fix, please see below for details.
> >
> > >
> > > > On architectures that send
> > > > an IPI broadcast on TLB flush, it works as expected.  But on the
> > > > architectures that do not use IPI to broadcast TLB flush, it may have
> > > > the below race:
> > > >
> > > >    CPU A                                          CPU B
> > > > THP collapse                                     fast GUP
> > > >                                               gup_pmd_range() <-- see valid pmd
> > > >                                                   gup_pte_range() <-- work on pte
> > > > pmdp_collapse_flush() <-- clear pmd and flush
> > > > __collapse_huge_page_isolate()
> > > >     check page pinned <-- before GUP bump refcount
> > > >                                                       pin the page
> > > >                                                       check PTE <-- no change
> > > > __collapse_huge_page_copy()
> > > >     copy data to huge page
> > > >     ptep_clear()
> > > > install huge pmd for the huge page
> > > >                                                       return the stale page
> > > > discard the stale page
> > > >
> > > > The race could be fixed by checking whether PMD is changed or not after
> > > > taking the page pin in fast GUP, just like what it does for PTE.  If the
> > > > PMD is changed it means there may be parallel THP collapse, so GUP
> > > > should back off.
> > >
> > > Could the race also be fixed by impl pmdp_collapse_flush() correctly for
> > > the archs that are missing? Do you know which arch(s) is broken with it?
> >
> > Yes, and this was suggested by me in the first place, but per the
> > suggestion from John and David, this is not the preferred way. I think
> > it is because:
> >
> > Firstly, using IPI to serialize against fast GUP is not recommended
> > anymore since fast GUP does check PTE then back off so we should avoid
> > it.
> > Secondly, if checking PMD then backing off could solve the problem,
> > why do we still need broadcast IPI? It doesn't sound performant.
> >
> > >
> > > It's just not clear to me whether this patch is an optimization or a fix,
> > > if it's a fix whether the IPI broadcast in ppc pmdp_collapse_flush() would
> > > still be needed.
> >
> > It is a fix and the fix will make IPI broadcast not useful anymore.
>
> How about another patch to remove the ppc impl too?  Then it can be a two
> patches series.

BTW, I don't think we could remove the ppc implementation since it is
different from the generic pmdp_collapse_flush(), particularly for the
hash part IIUC.

The generic version calls flush_tlb_range() -> hash__flush_tlb_range()
for hash, but the hash call is actually no-op. The ppc version calls
hash__pmdp_collapse_flush() -> flush_tlb_pmd_range(), which does
something useful.

But, as I mentioned in another thread, we should be able to remove the
IPI call, which just calls a dummy function.

>
> So that ppc developers can be copied and maybe it helps to have the ppc
> people looking at current approach too.
>
> Then the last piece of it is the s390 pmdp_collapse_flush().  I'm wondering
> whether generic pmdp_collapse_flush() would be good enough, since the only
> addition comparing to the s390 one will be flush_tlb_range() (which is a
> further __tlb_flush_mm_lazy).  David may have some thoughts.
>
> The patch itself looks good to me, one trivial nit below.
>
> >
> > >
> > > Thanks,
> > >
> > > >
> > > > Also update the stale comment about serializing against fast GUP in
> > > > khugepaged.
> > > >
> > > > Fixes: 2667f50e8b81 ("mm: introduce a general RCU get_user_pages_fast()")
> > > > Signed-off-by: Yang Shi <shy828301@gmail.com>
> > > > ---
> > > >  mm/gup.c        | 30 ++++++++++++++++++++++++------
> > > >  mm/khugepaged.c | 10 ++++++----
> > > >  2 files changed, 30 insertions(+), 10 deletions(-)
> > > >
> > > > diff --git a/mm/gup.c b/mm/gup.c
> > > > index f3fc1f08d90c..4365b2811269 100644
> > > > --- a/mm/gup.c
> > > > +++ b/mm/gup.c
> > > > @@ -2380,8 +2380,9 @@ static void __maybe_unused undo_dev_pagemap(int *nr, int nr_start,
> > > >  }
> > > >
> > > >  #ifdef CONFIG_ARCH_HAS_PTE_SPECIAL
> > > > -static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
> > > > -                      unsigned int flags, struct page **pages, int *nr)
> > > > +static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr,
> > > > +                      unsigned long end, unsigned int flags,
> > > > +                      struct page **pages, int *nr)
> > > >  {
> > > >       struct dev_pagemap *pgmap = NULL;
> > > >       int nr_start = *nr, ret = 0;
> > > > @@ -2423,7 +2424,23 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
> > > >                       goto pte_unmap;
> > > >               }
> > > >
> > > > -             if (unlikely(pte_val(pte) != pte_val(*ptep))) {
> > > > +             /*
> > > > +              * THP collapse conceptually does:
> > > > +              *   1. Clear and flush PMD
> > > > +              *   2. Check the base page refcount
> > > > +              *   3. Copy data to huge page
> > > > +              *   4. Clear PTE
> > > > +              *   5. Discard the base page
> > > > +              *
> > > > +              * So fast GUP may race with THP collapse then pin and
> > > > +              * return an old page since TLB flush is no longer sufficient
> > > > +              * to serialize against fast GUP.
> > > > +              *
> > > > +              * Check PMD, if it is changed just back off since it
> > > > +              * means there may be parallel THP collapse.
>
> Would you mind rewording this comment a bit?  I feel it a bit weird to
> suddenly mention about thp collapse especially its details.
>
> Maybe some statement on the whole history of why check pte, and in what
> case pmd check is needed (where the thp collapse example can be moved to,
> imho)?
>
> One of my attempt for reference..
>
>                 /*
>                  * Fast-gup relies on pte change detection to avoid
>                  * concurrent pgtable operations.
>                  *
>                  * To pin the page, fast-gup needs to do below in order:
>                  * (1) pin the page (by prefetching pte), then (2) check
>                  * pte not changed.
>                  *
>                  * For the rest of pgtable operations where pgtable updates
>                  * can be racy with fast-gup, we need to do (1) clear pte,
>                  * then (2) check whether page is pinned.
>                  *
>                  * Above will work for all pte-level operations, including
>                  * thp split.
>                  *
>                  * For thp collapse, it's a bit more complicated because
>                  * with RCU pgtable free fast-gup can be walking a pgtable
>                  * page that is being freed (so pte is still valid but pmd
>                  * can be cleared already).  To avoid race in such
>                  * condition, we need to also check pmd here to make sure
>                  * pmd doesn't change (corresponds to pmdp_collapse_flush()
>                  * in the thp collide code path).
>                  */
>
> If you agree with the comment change, feel free to add:
>
> Acked-by: Peter Xu <peterx@redhat.com>
>
> Thanks,
>
> --
> Peter Xu
>
Peter Xu Sept. 2, 2022, 8:33 p.m. UTC | #10
On Fri, Sep 02, 2022 at 10:45:20AM -0700, Yang Shi wrote:
> > How about another patch to remove the ppc impl too?  Then it can be a two
> > patches series.
> 
> BTW, I don't think we could remove the ppc implementation since it is
> different from the generic pmdp_collapse_flush(), particularly for the
> hash part IIUC.
> 
> The generic version calls flush_tlb_range() -> hash__flush_tlb_range()
> for hash, but the hash call is actually no-op. The ppc version calls
> hash__pmdp_collapse_flush() -> flush_tlb_pmd_range(), which does
> something useful.

One thing I found interesting (and also a bit confused..) is that the ppc
code used the name flush_tlb_pmd_range() to "flush tlb range in pte level",
which is kind of against the tlb API design..

The generic tlb API has a very close function called flush_pmd_tlb_range()
which is only used to do pmd-level flushing, while here the ppc version of
flush_tlb_pmd_range() is actually flush_tlb_range() in the generic API.

Agreed that it may worth having a look from ppc developers.
John Hubbard Sept. 4, 2022, 10:21 p.m. UTC | #11
On 9/2/22 08:59, Peter Xu wrote:
...
> Would you mind rewording this comment a bit?  I feel it a bit weird to
> suddenly mention about thp collapse especially its details.
> 
> Maybe some statement on the whole history of why check pte, and in what
> case pmd check is needed (where the thp collapse example can be moved to,
> imho)?
> 
> One of my attempt for reference..
> 
> 		/*
> 		 * Fast-gup relies on pte change detection to avoid
> 		 * concurrent pgtable operations.
> 		 *
> 		 * To pin the page, fast-gup needs to do below in order:
> 		 * (1) pin the page (by prefetching pte), then (2) check
> 		 * pte not changed.
> 		 *
> 		 * For the rest of pgtable operations where pgtable updates
> 		 * can be racy with fast-gup, we need to do (1) clear pte,
> 		 * then (2) check whether page is pinned.
> 		 *
> 		 * Above will work for all pte-level operations, including
> 		 * thp split.
> 		 *
> 		 * For thp collapse, it's a bit more complicated because
> 		 * with RCU pgtable free fast-gup can be walking a pgtable
> 		 * page that is being freed (so pte is still valid but pmd
> 		 * can be cleared already).  To avoid race in such
> 		 * condition, we need to also check pmd here to make sure
> 		 * pmd doesn't change (corresponds to pmdp_collapse_flush()
> 		 * in the thp collide code path).
> 		 */
> 
> If you agree with the comment change, feel free to add:
> 
> Acked-by: Peter Xu <peterx@redhat.com>
> 

This seems fine, but I'd like to additionally request that we move it
to function-level documentation. Because it expands the length of the
function, which previously fit neatly on my screen. So I think it's
time to move it up a level.

thanks,
John Hubbard Sept. 4, 2022, 10:29 p.m. UTC | #12
On 9/1/22 15:27, Yang Shi wrote:
> Since general RCU GUP fast was introduced in commit 2667f50e8b81 ("mm:
> introduce a general RCU get_user_pages_fast()"), a TLB flush is no longer
> sufficient to handle concurrent GUP-fast in all cases, it only handles
> traditional IPI-based GUP-fast correctly.  On architectures that send
> an IPI broadcast on TLB flush, it works as expected.  But on the
> architectures that do not use IPI to broadcast TLB flush, it may have
> the below race:
> 
>    CPU A                                          CPU B
> THP collapse                                     fast GUP
>                                               gup_pmd_range() <-- see valid pmd
>                                                   gup_pte_range() <-- work on pte
> pmdp_collapse_flush() <-- clear pmd and flush
> __collapse_huge_page_isolate()
>     check page pinned <-- before GUP bump refcount
>                                                       pin the page
>                                                       check PTE <-- no change
> __collapse_huge_page_copy()
>     copy data to huge page
>     ptep_clear()
> install huge pmd for the huge page
>                                                       return the stale page
> discard the stale page

Hi Yang,

Thanks for taking the trouble to write down these notes. I always
forget which race we are dealing with, and this is a great help. :)

More...

> 
> The race could be fixed by checking whether PMD is changed or not after
> taking the page pin in fast GUP, just like what it does for PTE.  If the
> PMD is changed it means there may be parallel THP collapse, so GUP
> should back off.
> 
> Also update the stale comment about serializing against fast GUP in
> khugepaged.
> 
> Fixes: 2667f50e8b81 ("mm: introduce a general RCU get_user_pages_fast()")
> Signed-off-by: Yang Shi <shy828301@gmail.com>
> ---
>  mm/gup.c        | 30 ++++++++++++++++++++++++------
>  mm/khugepaged.c | 10 ++++++----
>  2 files changed, 30 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index f3fc1f08d90c..4365b2811269 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2380,8 +2380,9 @@ static void __maybe_unused undo_dev_pagemap(int *nr, int nr_start,
>  }
>  
>  #ifdef CONFIG_ARCH_HAS_PTE_SPECIAL
> -static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
> -			 unsigned int flags, struct page **pages, int *nr)
> +static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr,
> +			 unsigned long end, unsigned int flags,
> +			 struct page **pages, int *nr)
>  {
>  	struct dev_pagemap *pgmap = NULL;
>  	int nr_start = *nr, ret = 0;
> @@ -2423,7 +2424,23 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
>  			goto pte_unmap;
>  		}
>  
> -		if (unlikely(pte_val(pte) != pte_val(*ptep))) {
> +		/*
> +		 * THP collapse conceptually does:
> +		 *   1. Clear and flush PMD
> +		 *   2. Check the base page refcount
> +		 *   3. Copy data to huge page
> +		 *   4. Clear PTE
> +		 *   5. Discard the base page
> +		 *
> +		 * So fast GUP may race with THP collapse then pin and
> +		 * return an old page since TLB flush is no longer sufficient
> +		 * to serialize against fast GUP.
> +		 *
> +		 * Check PMD, if it is changed just back off since it
> +		 * means there may be parallel THP collapse.
> +		 */

As I mentioned in the other thread, it would be a nice touch to move
such discussion into the comment header.

> +		if (unlikely(pmd_val(pmd) != pmd_val(*pmdp)) ||
> +		    unlikely(pte_val(pte) != pte_val(*ptep))) {


That should be READ_ONCE() for the *pmdp and *ptep reads. Because this
whole lockless house of cards may fall apart if we try reading the
page table values without READ_ONCE(). 

That's a rather vague statement, and in fact, the READ_ONCE() should
be paired with a page table write somewhere else, to make that claim
more precise.


>  			gup_put_folio(folio, 1, flags);
>  			goto pte_unmap;
>  		}
> @@ -2470,8 +2487,9 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
>   * get_user_pages_fast_only implementation that can pin pages. Thus it's still
>   * useful to have gup_huge_pmd even if we can't operate on ptes.
>   */
> -static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
> -			 unsigned int flags, struct page **pages, int *nr)
> +static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr,
> +			 unsigned long end, unsigned int flags,
> +			 struct page **pages, int *nr)
>  {
>  	return 0;
>  }
> @@ -2791,7 +2809,7 @@ static int gup_pmd_range(pud_t *pudp, pud_t pud, unsigned long addr, unsigned lo
>  			if (!gup_huge_pd(__hugepd(pmd_val(pmd)), addr,
>  					 PMD_SHIFT, next, flags, pages, nr))
>  				return 0;
> -		} else if (!gup_pte_range(pmd, addr, next, flags, pages, nr))
> +		} else if (!gup_pte_range(pmd, pmdp, addr, next, flags, pages, nr))
>  			return 0;
>  	} while (pmdp++, addr = next, addr != end);
>  
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 2d74cf01f694..518b49095db3 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1049,10 +1049,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>  
>  	pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
>  	/*
> -	 * After this gup_fast can't run anymore. This also removes
> -	 * any huge TLB entry from the CPU so we won't allow
> -	 * huge and small TLB entries for the same virtual address
> -	 * to avoid the risk of CPU bugs in that area.
> +	 * This removes any huge TLB entry from the CPU so we won't allow
> +	 * huge and small TLB entries for the same virtual address to
> +	 * avoid the risk of CPU bugs in that area.
> +	 *
> +	 * Parallel fast GUP is fine since fast GUP will back off when
> +	 * it detects PMD is changed.
>  	 */
>  	_pmd = pmdp_collapse_flush(vma, address, pmd);

To follow up on David Hildenbrand's note about this in the nearby thread...
I'm also not sure if pmdp_collapse_flush() implies a memory barrier on 
all arches. It definitely does do an atomic op with a return value on x86,
but that's just one arch.


thanks,
David Hildenbrand Sept. 5, 2022, 7:59 a.m. UTC | #13
On 05.09.22 00:29, John Hubbard wrote:
> On 9/1/22 15:27, Yang Shi wrote:
>> Since general RCU GUP fast was introduced in commit 2667f50e8b81 ("mm:
>> introduce a general RCU get_user_pages_fast()"), a TLB flush is no longer
>> sufficient to handle concurrent GUP-fast in all cases, it only handles
>> traditional IPI-based GUP-fast correctly.  On architectures that send
>> an IPI broadcast on TLB flush, it works as expected.  But on the
>> architectures that do not use IPI to broadcast TLB flush, it may have
>> the below race:
>>
>>     CPU A                                          CPU B
>> THP collapse                                     fast GUP
>>                                                gup_pmd_range() <-- see valid pmd
>>                                                    gup_pte_range() <-- work on pte
>> pmdp_collapse_flush() <-- clear pmd and flush
>> __collapse_huge_page_isolate()
>>      check page pinned <-- before GUP bump refcount
>>                                                        pin the page
>>                                                        check PTE <-- no change
>> __collapse_huge_page_copy()
>>      copy data to huge page
>>      ptep_clear()
>> install huge pmd for the huge page
>>                                                        return the stale page
>> discard the stale page
> 
> Hi Yang,
> 
> Thanks for taking the trouble to write down these notes. I always
> forget which race we are dealing with, and this is a great help. :)
> 
> More...
> 
>>
>> The race could be fixed by checking whether PMD is changed or not after
>> taking the page pin in fast GUP, just like what it does for PTE.  If the
>> PMD is changed it means there may be parallel THP collapse, so GUP
>> should back off.
>>
>> Also update the stale comment about serializing against fast GUP in
>> khugepaged.
>>
>> Fixes: 2667f50e8b81 ("mm: introduce a general RCU get_user_pages_fast()")
>> Signed-off-by: Yang Shi <shy828301@gmail.com>
>> ---
>>   mm/gup.c        | 30 ++++++++++++++++++++++++------
>>   mm/khugepaged.c | 10 ++++++----
>>   2 files changed, 30 insertions(+), 10 deletions(-)
>>
>> diff --git a/mm/gup.c b/mm/gup.c
>> index f3fc1f08d90c..4365b2811269 100644
>> --- a/mm/gup.c
>> +++ b/mm/gup.c
>> @@ -2380,8 +2380,9 @@ static void __maybe_unused undo_dev_pagemap(int *nr, int nr_start,
>>   }
>>   
>>   #ifdef CONFIG_ARCH_HAS_PTE_SPECIAL
>> -static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
>> -			 unsigned int flags, struct page **pages, int *nr)
>> +static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr,
>> +			 unsigned long end, unsigned int flags,
>> +			 struct page **pages, int *nr)
>>   {
>>   	struct dev_pagemap *pgmap = NULL;
>>   	int nr_start = *nr, ret = 0;
>> @@ -2423,7 +2424,23 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
>>   			goto pte_unmap;
>>   		}
>>   
>> -		if (unlikely(pte_val(pte) != pte_val(*ptep))) {
>> +		/*
>> +		 * THP collapse conceptually does:
>> +		 *   1. Clear and flush PMD
>> +		 *   2. Check the base page refcount
>> +		 *   3. Copy data to huge page
>> +		 *   4. Clear PTE
>> +		 *   5. Discard the base page
>> +		 *
>> +		 * So fast GUP may race with THP collapse then pin and
>> +		 * return an old page since TLB flush is no longer sufficient
>> +		 * to serialize against fast GUP.
>> +		 *
>> +		 * Check PMD, if it is changed just back off since it
>> +		 * means there may be parallel THP collapse.
>> +		 */
> 
> As I mentioned in the other thread, it would be a nice touch to move
> such discussion into the comment header.
> 
>> +		if (unlikely(pmd_val(pmd) != pmd_val(*pmdp)) ||
>> +		    unlikely(pte_val(pte) != pte_val(*ptep))) {
> 
> 
> That should be READ_ONCE() for the *pmdp and *ptep reads. Because this
> whole lockless house of cards may fall apart if we try reading the
> page table values without READ_ONCE().

I came to the conclusion that the implicit memory barrier when grabbing 
a reference on the page is sufficient such that we don't need READ_ONCE 
here.

If we still intend to change that code, we should fixup all GUP-fast 
functions in a similar way. But again, I don't think we need a change here.


>> -	 * After this gup_fast can't run anymore. This also removes
>> -	 * any huge TLB entry from the CPU so we won't allow
>> -	 * huge and small TLB entries for the same virtual address
>> -	 * to avoid the risk of CPU bugs in that area.
>> +	 * This removes any huge TLB entry from the CPU so we won't allow
>> +	 * huge and small TLB entries for the same virtual address to
>> +	 * avoid the risk of CPU bugs in that area.
>> +	 *
>> +	 * Parallel fast GUP is fine since fast GUP will back off when
>> +	 * it detects PMD is changed.
>>   	 */
>>   	_pmd = pmdp_collapse_flush(vma, address, pmd);
> 
> To follow up on David Hildenbrand's note about this in the nearby thread...
> I'm also not sure if pmdp_collapse_flush() implies a memory barrier on
> all arches. It definitely does do an atomic op with a return value on x86,
> but that's just one arch.
> 

I think a ptep/pmdp clear + TLB flush really has to imply a memory 
barrier, otherwise TLB flushing code might easily mess up with 
surrounding code. But we should better double-check.

s390x executes an IDTE instruction, which performs serialization (-> 
memory barrier). arm64 seems to use DSB instructions to enforce memory 
ordering.
Aneesh Kumar K.V Sept. 5, 2022, 8:54 a.m. UTC | #14
Yang Shi <shy828301@gmail.com> writes:

>
> On Fri, Sep 2, 2022 at 9:00 AM Peter Xu <peterx@redhat.com> wrote:
>>
>> On Thu, Sep 01, 2022 at 04:50:45PM -0700, Yang Shi wrote:
>> > On Thu, Sep 1, 2022 at 4:26 PM Peter Xu <peterx@redhat.com> wrote:
>> > >
>> > > Hi, Yang,
>> > >
>> > > On Thu, Sep 01, 2022 at 03:27:07PM -0700, Yang Shi wrote:
>> > > > Since general RCU GUP fast was introduced in commit 2667f50e8b81 ("mm:
>> > > > introduce a general RCU get_user_pages_fast()"), a TLB flush is no longer
>> > > > sufficient to handle concurrent GUP-fast in all cases, it only handles
>> > > > traditional IPI-based GUP-fast correctly.
>> > >
>> > > If TLB flush (or, IPI broadcasts) used to work to protect against gup-fast,
>> > > I'm kind of confused why it's not sufficient even if with RCU gup?  Isn't
>> > > that'll keep working as long as interrupt disabled (which current fast-gup
>> > > will still do)?
>> >
>> > Actually the wording was copied from David's commit log for his
>> > PageAnonExclusive fix. My understanding is the IPI broadcast still
>> > works, but it may not be supported by all architectures and not
>> > preferred anymore. So we should avoid depending on IPI broadcast IIUC.
>> >
>> > >
>> > > IIUC the issue is you suspect not all archs correctly implemented
>> > > pmdp_collapse_flush(), or am I wrong?
>> >
>> > This is a possible fix, please see below for details.
>> >
>> > >
>> > > > On architectures that send
>> > > > an IPI broadcast on TLB flush, it works as expected.  But on the
>> > > > architectures that do not use IPI to broadcast TLB flush, it may have
>> > > > the below race:
>> > > >
>> > > >    CPU A                                          CPU B
>> > > > THP collapse                                     fast GUP
>> > > >                                               gup_pmd_range() <-- see valid pmd
>> > > >                                                   gup_pte_range() <-- work on pte
>> > > > pmdp_collapse_flush() <-- clear pmd and flush
>> > > > __collapse_huge_page_isolate()
>> > > >     check page pinned <-- before GUP bump refcount
>> > > >                                                       pin the page
>> > > >                                                       check PTE <-- no change
>> > > > __collapse_huge_page_copy()
>> > > >     copy data to huge page
>> > > >     ptep_clear()
>> > > > install huge pmd for the huge page
>> > > >                                                       return the stale page
>> > > > discard the stale page
>> > > >
>> > > > The race could be fixed by checking whether PMD is changed or not after
>> > > > taking the page pin in fast GUP, just like what it does for PTE.  If the
>> > > > PMD is changed it means there may be parallel THP collapse, so GUP
>> > > > should back off.
>> > >
>> > > Could the race also be fixed by impl pmdp_collapse_flush() correctly for
>> > > the archs that are missing? Do you know which arch(s) is broken with it?
>> >
>> > Yes, and this was suggested by me in the first place, but per the
>> > suggestion from John and David, this is not the preferred way. I think
>> > it is because:
>> >
>> > Firstly, using IPI to serialize against fast GUP is not recommended
>> > anymore since fast GUP does check PTE then back off so we should avoid
>> > it.
>> > Secondly, if checking PMD then backing off could solve the problem,
>> > why do we still need broadcast IPI? It doesn't sound performant.
>> >
>> > >
>> > > It's just not clear to me whether this patch is an optimization or a fix,
>> > > if it's a fix whether the IPI broadcast in ppc pmdp_collapse_flush() would
>> > > still be needed.
>> >
>> > It is a fix and the fix will make IPI broadcast not useful anymore.
>>
>> How about another patch to remove the ppc impl too?  Then it can be a two
>> patches series.
>
> BTW, I don't think we could remove the ppc implementation since it is
> different from the generic pmdp_collapse_flush(), particularly for the
> hash part IIUC.
>
> The generic version calls flush_tlb_range() -> hash__flush_tlb_range()
> for hash, but the hash call is actually no-op. The ppc version calls
> hash__pmdp_collapse_flush() -> flush_tlb_pmd_range(), which does
> something useful.
>

We should actually rename flush_tlb_pmd_range(). It actually flush the
hash page table entries.

I will do the below patch for ppc64 to clarify this better

diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
index 8b762f282190..fd30fa20c392 100644
--- a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
+++ b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
@@ -112,13 +112,12 @@ static inline void hash__flush_tlb_kernel_range(unsigned long start,
 
 struct mmu_gather;
 extern void hash__tlb_flush(struct mmu_gather *tlb);
-void flush_tlb_pmd_range(struct mm_struct *mm, pmd_t *pmd, unsigned long addr);
 
 #ifdef CONFIG_PPC_64S_HASH_MMU
 /* Private function for use by PCI IO mapping code */
 extern void __flush_hash_table_range(unsigned long start, unsigned long end);
-extern void flush_tlb_pmd_range(struct mm_struct *mm, pmd_t *pmd,
-				unsigned long addr);
+extern void flush_hash_table_pmd_range(struct mm_struct *mm, pmd_t *pmd,
+				       unsigned long addr);
 #else
 static inline void __flush_hash_table_range(unsigned long start, unsigned long end) { }
 #endif
diff --git a/arch/powerpc/mm/book3s64/hash_pgtable.c b/arch/powerpc/mm/book3s64/hash_pgtable.c
index ae008b9df0e6..f30131933a01 100644
--- a/arch/powerpc/mm/book3s64/hash_pgtable.c
+++ b/arch/powerpc/mm/book3s64/hash_pgtable.c
@@ -256,7 +256,7 @@ pmd_t hash__pmdp_collapse_flush(struct vm_area_struct *vma, unsigned long addres
 	 * the __collapse_huge_page_copy can result in copying
 	 * the old content.
 	 */
-	flush_tlb_pmd_range(vma->vm_mm, &pmd, address);
+	flush_hash_table_pmd_range(vma->vm_mm, &pmd, address);
 	return pmd;
 }
 
diff --git a/arch/powerpc/mm/book3s64/hash_tlb.c b/arch/powerpc/mm/book3s64/hash_tlb.c
index eb0bccaf221e..a64ea0a7ef96 100644
--- a/arch/powerpc/mm/book3s64/hash_tlb.c
+++ b/arch/powerpc/mm/book3s64/hash_tlb.c
@@ -221,7 +221,7 @@ void __flush_hash_table_range(unsigned long start, unsigned long end)
 	local_irq_restore(flags);
 }
 
-void flush_tlb_pmd_range(struct mm_struct *mm, pmd_t *pmd, unsigned long addr)
+void flush_hash_table_pmd_range(struct mm_struct *mm, pmd_t *pmd, unsigned long addr)
 {
 	pte_t *pte;
 	pte_t *start_pte;
Aneesh Kumar K.V Sept. 5, 2022, 8:56 a.m. UTC | #15
Peter Xu <peterx@redhat.com> writes:

>
> On Fri, Sep 02, 2022 at 10:45:20AM -0700, Yang Shi wrote:
>> > How about another patch to remove the ppc impl too?  Then it can be a two
>> > patches series.
>> 
>> BTW, I don't think we could remove the ppc implementation since it is
>> different from the generic pmdp_collapse_flush(), particularly for the
>> hash part IIUC.
>> 
>> The generic version calls flush_tlb_range() -> hash__flush_tlb_range()
>> for hash, but the hash call is actually no-op. The ppc version calls
>> hash__pmdp_collapse_flush() -> flush_tlb_pmd_range(), which does
>> something useful.
>
> One thing I found interesting (and also a bit confused..) is that the ppc
> code used the name flush_tlb_pmd_range() to "flush tlb range in pte level",
> which is kind of against the tlb API design..
>
> The generic tlb API has a very close function called flush_pmd_tlb_range()
> which is only used to do pmd-level flushing, while here the ppc version of
> flush_tlb_pmd_range() is actually flush_tlb_range() in the generic API.
>
> Agreed that it may worth having a look from ppc developers.
>

It is actually flushing the hash page table entries. I will rename
flush_tlb_pmd_range to flush_hash_table_pmd_range().

-aneesh
Baolin Wang Sept. 5, 2022, 9:03 a.m. UTC | #16
On 9/5/2022 6:29 AM, John Hubbard wrote:
> On 9/1/22 15:27, Yang Shi wrote:
>> Since general RCU GUP fast was introduced in commit 2667f50e8b81 ("mm:
>> introduce a general RCU get_user_pages_fast()"), a TLB flush is no longer
>> sufficient to handle concurrent GUP-fast in all cases, it only handles
>> traditional IPI-based GUP-fast correctly.  On architectures that send
>> an IPI broadcast on TLB flush, it works as expected.  But on the
>> architectures that do not use IPI to broadcast TLB flush, it may have
>> the below race:
>>
>>     CPU A                                          CPU B
>> THP collapse                                     fast GUP
>>                                                gup_pmd_range() <-- see valid pmd
>>                                                    gup_pte_range() <-- work on pte
>> pmdp_collapse_flush() <-- clear pmd and flush
>> __collapse_huge_page_isolate()
>>      check page pinned <-- before GUP bump refcount
>>                                                        pin the page
>>                                                        check PTE <-- no change
>> __collapse_huge_page_copy()
>>      copy data to huge page
>>      ptep_clear()
>> install huge pmd for the huge page
>>                                                        return the stale page
>> discard the stale page
> 
> Hi Yang,
> 
> Thanks for taking the trouble to write down these notes. I always
> forget which race we are dealing with, and this is a great help. :)
> 
> More...
> 
>>
>> The race could be fixed by checking whether PMD is changed or not after
>> taking the page pin in fast GUP, just like what it does for PTE.  If the
>> PMD is changed it means there may be parallel THP collapse, so GUP
>> should back off.
>>
>> Also update the stale comment about serializing against fast GUP in
>> khugepaged.
>>
>> Fixes: 2667f50e8b81 ("mm: introduce a general RCU get_user_pages_fast()")
>> Signed-off-by: Yang Shi <shy828301@gmail.com>
>> ---
>>   mm/gup.c        | 30 ++++++++++++++++++++++++------
>>   mm/khugepaged.c | 10 ++++++----
>>   2 files changed, 30 insertions(+), 10 deletions(-)
>>
>> diff --git a/mm/gup.c b/mm/gup.c
>> index f3fc1f08d90c..4365b2811269 100644
>> --- a/mm/gup.c
>> +++ b/mm/gup.c
>> @@ -2380,8 +2380,9 @@ static void __maybe_unused undo_dev_pagemap(int *nr, int nr_start,
>>   }
>>   
>>   #ifdef CONFIG_ARCH_HAS_PTE_SPECIAL
>> -static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
>> -			 unsigned int flags, struct page **pages, int *nr)
>> +static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr,
>> +			 unsigned long end, unsigned int flags,
>> +			 struct page **pages, int *nr)
>>   {
>>   	struct dev_pagemap *pgmap = NULL;
>>   	int nr_start = *nr, ret = 0;
>> @@ -2423,7 +2424,23 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
>>   			goto pte_unmap;
>>   		}
>>   
>> -		if (unlikely(pte_val(pte) != pte_val(*ptep))) {
>> +		/*
>> +		 * THP collapse conceptually does:
>> +		 *   1. Clear and flush PMD
>> +		 *   2. Check the base page refcount
>> +		 *   3. Copy data to huge page
>> +		 *   4. Clear PTE
>> +		 *   5. Discard the base page
>> +		 *
>> +		 * So fast GUP may race with THP collapse then pin and
>> +		 * return an old page since TLB flush is no longer sufficient
>> +		 * to serialize against fast GUP.
>> +		 *
>> +		 * Check PMD, if it is changed just back off since it
>> +		 * means there may be parallel THP collapse.
>> +		 */
> 
> As I mentioned in the other thread, it would be a nice touch to move
> such discussion into the comment header.
> 
>> +		if (unlikely(pmd_val(pmd) != pmd_val(*pmdp)) ||
>> +		    unlikely(pte_val(pte) != pte_val(*ptep))) {
> 
> 
> That should be READ_ONCE() for the *pmdp and *ptep reads. Because this
> whole lockless house of cards may fall apart if we try reading the
> page table values without READ_ONCE().
> 
> That's a rather vague statement, and in fact, the READ_ONCE() should
> be paired with a page table write somewhere else, to make that claim
> more precise.

Agree. We also talked about using READ_ONCE() for *ptep (or using 
ptep_get_lockless()) before in a concurrent scenario of GUP-fast and 
migration [1].

 >>> CPU 0:				CPU 1:
 >>> get_user_pages_fast()
 >>> lockless_pages_from_mm()
 >>> local_irq_save()
 >>> gup_pgd_range()
 >>> gup_p4d_range()
 >>> gup_pud_range()
 >>> gup_pmd_range()
 >>> gup_pte_range()
 >>> pte_t pte = ptep_get_lockless(ptep);
 >>> 				migrate_vma_collect_pmd()
 >>> 				ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl)
 >>> 				ptep_get_and_clear(mm, addr, ptep);
 >>> page = pte_page(pte);
 >>> 				set_pte_at(mm, addr, ptep, swp_pte);
 >>> 				migrate_page_move_mapping()
 >>> head = try_grab_compound_head(page, 1, flags);

[1] 
https://lore.kernel.org/all/7ec1d098-0021-ae82-8d73-dd9c2eb80dab@linux.alibaba.com/

>>   			gup_put_folio(folio, 1, flags);
>>   			goto pte_unmap;
>>   		}
>> @@ -2470,8 +2487,9 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
>>    * get_user_pages_fast_only implementation that can pin pages. Thus it's still
>>    * useful to have gup_huge_pmd even if we can't operate on ptes.
>>    */
>> -static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
>> -			 unsigned int flags, struct page **pages, int *nr)
>> +static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr,
>> +			 unsigned long end, unsigned int flags,
>> +			 struct page **pages, int *nr)
>>   {
>>   	return 0;
>>   }
>> @@ -2791,7 +2809,7 @@ static int gup_pmd_range(pud_t *pudp, pud_t pud, unsigned long addr, unsigned lo
>>   			if (!gup_huge_pd(__hugepd(pmd_val(pmd)), addr,
>>   					 PMD_SHIFT, next, flags, pages, nr))
>>   				return 0;
>> -		} else if (!gup_pte_range(pmd, addr, next, flags, pages, nr))
>> +		} else if (!gup_pte_range(pmd, pmdp, addr, next, flags, pages, nr))
>>   			return 0;
>>   	} while (pmdp++, addr = next, addr != end);
>>   
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 2d74cf01f694..518b49095db3 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -1049,10 +1049,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>>   
>>   	pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
>>   	/*
>> -	 * After this gup_fast can't run anymore. This also removes
>> -	 * any huge TLB entry from the CPU so we won't allow
>> -	 * huge and small TLB entries for the same virtual address
>> -	 * to avoid the risk of CPU bugs in that area.
>> +	 * This removes any huge TLB entry from the CPU so we won't allow
>> +	 * huge and small TLB entries for the same virtual address to
>> +	 * avoid the risk of CPU bugs in that area.
>> +	 *
>> +	 * Parallel fast GUP is fine since fast GUP will back off when
>> +	 * it detects PMD is changed.
>>   	 */
>>   	_pmd = pmdp_collapse_flush(vma, address, pmd);
> 
> To follow up on David Hildenbrand's note about this in the nearby thread...
> I'm also not sure if pmdp_collapse_flush() implies a memory barrier on
> all arches. It definitely does do an atomic op with a return value on x86,
> but that's just one arch.
> 
> 
> thanks,
>
Baolin Wang Sept. 5, 2022, 10:16 a.m. UTC | #17
On 9/5/2022 3:59 PM, David Hildenbrand wrote:
> On 05.09.22 00:29, John Hubbard wrote:
>> On 9/1/22 15:27, Yang Shi wrote:
>>> Since general RCU GUP fast was introduced in commit 2667f50e8b81 ("mm:
>>> introduce a general RCU get_user_pages_fast()"), a TLB flush is no 
>>> longer
>>> sufficient to handle concurrent GUP-fast in all cases, it only handles
>>> traditional IPI-based GUP-fast correctly.  On architectures that send
>>> an IPI broadcast on TLB flush, it works as expected.  But on the
>>> architectures that do not use IPI to broadcast TLB flush, it may have
>>> the below race:
>>>
>>>     CPU A                                          CPU B
>>> THP collapse                                     fast GUP
>>>                                                gup_pmd_range() <-- 
>>> see valid pmd
>>>                                                    gup_pte_range() 
>>> <-- work on pte
>>> pmdp_collapse_flush() <-- clear pmd and flush
>>> __collapse_huge_page_isolate()
>>>      check page pinned <-- before GUP bump refcount
>>>                                                        pin the page
>>>                                                        check PTE <-- 
>>> no change
>>> __collapse_huge_page_copy()
>>>      copy data to huge page
>>>      ptep_clear()
>>> install huge pmd for the huge page
>>>                                                        return the 
>>> stale page
>>> discard the stale page
>>
>> Hi Yang,
>>
>> Thanks for taking the trouble to write down these notes. I always
>> forget which race we are dealing with, and this is a great help. :)
>>
>> More...
>>
>>>
>>> The race could be fixed by checking whether PMD is changed or not after
>>> taking the page pin in fast GUP, just like what it does for PTE.  If the
>>> PMD is changed it means there may be parallel THP collapse, so GUP
>>> should back off.
>>>
>>> Also update the stale comment about serializing against fast GUP in
>>> khugepaged.
>>>
>>> Fixes: 2667f50e8b81 ("mm: introduce a general RCU 
>>> get_user_pages_fast()")
>>> Signed-off-by: Yang Shi <shy828301@gmail.com>
>>> ---
>>>   mm/gup.c        | 30 ++++++++++++++++++++++++------
>>>   mm/khugepaged.c | 10 ++++++----
>>>   2 files changed, 30 insertions(+), 10 deletions(-)
>>>
>>> diff --git a/mm/gup.c b/mm/gup.c
>>> index f3fc1f08d90c..4365b2811269 100644
>>> --- a/mm/gup.c
>>> +++ b/mm/gup.c
>>> @@ -2380,8 +2380,9 @@ static void __maybe_unused undo_dev_pagemap(int 
>>> *nr, int nr_start,
>>>   }
>>>   #ifdef CONFIG_ARCH_HAS_PTE_SPECIAL
>>> -static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned 
>>> long end,
>>> -             unsigned int flags, struct page **pages, int *nr)
>>> +static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr,
>>> +             unsigned long end, unsigned int flags,
>>> +             struct page **pages, int *nr)
>>>   {
>>>       struct dev_pagemap *pgmap = NULL;
>>>       int nr_start = *nr, ret = 0;
>>> @@ -2423,7 +2424,23 @@ static int gup_pte_range(pmd_t pmd, unsigned 
>>> long addr, unsigned long end,
>>>               goto pte_unmap;
>>>           }
>>> -        if (unlikely(pte_val(pte) != pte_val(*ptep))) {
>>> +        /*
>>> +         * THP collapse conceptually does:
>>> +         *   1. Clear and flush PMD
>>> +         *   2. Check the base page refcount
>>> +         *   3. Copy data to huge page
>>> +         *   4. Clear PTE
>>> +         *   5. Discard the base page
>>> +         *
>>> +         * So fast GUP may race with THP collapse then pin and
>>> +         * return an old page since TLB flush is no longer sufficient
>>> +         * to serialize against fast GUP.
>>> +         *
>>> +         * Check PMD, if it is changed just back off since it
>>> +         * means there may be parallel THP collapse.
>>> +         */
>>
>> As I mentioned in the other thread, it would be a nice touch to move
>> such discussion into the comment header.
>>
>>> +        if (unlikely(pmd_val(pmd) != pmd_val(*pmdp)) ||
>>> +            unlikely(pte_val(pte) != pte_val(*ptep))) {
>>
>>
>> That should be READ_ONCE() for the *pmdp and *ptep reads. Because this
>> whole lockless house of cards may fall apart if we try reading the
>> page table values without READ_ONCE().
> 
> I came to the conclusion that the implicit memory barrier when grabbing 
> a reference on the page is sufficient such that we don't need READ_ONCE 
> here.

IMHO the compiler may optimize the code 'pte_val(*ptep)' to be always 
get from a register, then we can get an old value if other thread did 
set_pte(). I am not sure how the implicit memory barrier can pervent the 
compiler optimization? Please correct me if I missed something.

> If we still intend to change that code, we should fixup all GUP-fast 
> functions in a similar way. But again, I don't think we need a change here.
> 
> 
>>> -     * After this gup_fast can't run anymore. This also removes
>>> -     * any huge TLB entry from the CPU so we won't allow
>>> -     * huge and small TLB entries for the same virtual address
>>> -     * to avoid the risk of CPU bugs in that area.
>>> +     * This removes any huge TLB entry from the CPU so we won't allow
>>> +     * huge and small TLB entries for the same virtual address to
>>> +     * avoid the risk of CPU bugs in that area.
>>> +     *
>>> +     * Parallel fast GUP is fine since fast GUP will back off when
>>> +     * it detects PMD is changed.
>>>        */
>>>       _pmd = pmdp_collapse_flush(vma, address, pmd);
>>
>> To follow up on David Hildenbrand's note about this in the nearby 
>> thread...
>> I'm also not sure if pmdp_collapse_flush() implies a memory barrier on
>> all arches. It definitely does do an atomic op with a return value on 
>> x86,
>> but that's just one arch.
>>
> 
> I think a ptep/pmdp clear + TLB flush really has to imply a memory 
> barrier, otherwise TLB flushing code might easily mess up with 
> surrounding code. But we should better double-check.
> 
> s390x executes an IDTE instruction, which performs serialization (-> 
> memory barrier). arm64 seems to use DSB instructions to enforce memory 
> ordering.
>
David Hildenbrand Sept. 5, 2022, 10:24 a.m. UTC | #18
On 05.09.22 12:16, Baolin Wang wrote:
> 
> 
> On 9/5/2022 3:59 PM, David Hildenbrand wrote:
>> On 05.09.22 00:29, John Hubbard wrote:
>>> On 9/1/22 15:27, Yang Shi wrote:
>>>> Since general RCU GUP fast was introduced in commit 2667f50e8b81 ("mm:
>>>> introduce a general RCU get_user_pages_fast()"), a TLB flush is no
>>>> longer
>>>> sufficient to handle concurrent GUP-fast in all cases, it only handles
>>>> traditional IPI-based GUP-fast correctly.  On architectures that send
>>>> an IPI broadcast on TLB flush, it works as expected.  But on the
>>>> architectures that do not use IPI to broadcast TLB flush, it may have
>>>> the below race:
>>>>
>>>>      CPU A                                          CPU B
>>>> THP collapse                                     fast GUP
>>>>                                                 gup_pmd_range() <--
>>>> see valid pmd
>>>>                                                     gup_pte_range()
>>>> <-- work on pte
>>>> pmdp_collapse_flush() <-- clear pmd and flush
>>>> __collapse_huge_page_isolate()
>>>>       check page pinned <-- before GUP bump refcount
>>>>                                                         pin the page
>>>>                                                         check PTE <--
>>>> no change
>>>> __collapse_huge_page_copy()
>>>>       copy data to huge page
>>>>       ptep_clear()
>>>> install huge pmd for the huge page
>>>>                                                         return the
>>>> stale page
>>>> discard the stale page
>>>
>>> Hi Yang,
>>>
>>> Thanks for taking the trouble to write down these notes. I always
>>> forget which race we are dealing with, and this is a great help. :)
>>>
>>> More...
>>>
>>>>
>>>> The race could be fixed by checking whether PMD is changed or not after
>>>> taking the page pin in fast GUP, just like what it does for PTE.  If the
>>>> PMD is changed it means there may be parallel THP collapse, so GUP
>>>> should back off.
>>>>
>>>> Also update the stale comment about serializing against fast GUP in
>>>> khugepaged.
>>>>
>>>> Fixes: 2667f50e8b81 ("mm: introduce a general RCU
>>>> get_user_pages_fast()")
>>>> Signed-off-by: Yang Shi <shy828301@gmail.com>
>>>> ---
>>>>    mm/gup.c        | 30 ++++++++++++++++++++++++------
>>>>    mm/khugepaged.c | 10 ++++++----
>>>>    2 files changed, 30 insertions(+), 10 deletions(-)
>>>>
>>>> diff --git a/mm/gup.c b/mm/gup.c
>>>> index f3fc1f08d90c..4365b2811269 100644
>>>> --- a/mm/gup.c
>>>> +++ b/mm/gup.c
>>>> @@ -2380,8 +2380,9 @@ static void __maybe_unused undo_dev_pagemap(int
>>>> *nr, int nr_start,
>>>>    }
>>>>    #ifdef CONFIG_ARCH_HAS_PTE_SPECIAL
>>>> -static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned
>>>> long end,
>>>> -             unsigned int flags, struct page **pages, int *nr)
>>>> +static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr,
>>>> +             unsigned long end, unsigned int flags,
>>>> +             struct page **pages, int *nr)
>>>>    {
>>>>        struct dev_pagemap *pgmap = NULL;
>>>>        int nr_start = *nr, ret = 0;
>>>> @@ -2423,7 +2424,23 @@ static int gup_pte_range(pmd_t pmd, unsigned
>>>> long addr, unsigned long end,
>>>>                goto pte_unmap;
>>>>            }
>>>> -        if (unlikely(pte_val(pte) != pte_val(*ptep))) {
>>>> +        /*
>>>> +         * THP collapse conceptually does:
>>>> +         *   1. Clear and flush PMD
>>>> +         *   2. Check the base page refcount
>>>> +         *   3. Copy data to huge page
>>>> +         *   4. Clear PTE
>>>> +         *   5. Discard the base page
>>>> +         *
>>>> +         * So fast GUP may race with THP collapse then pin and
>>>> +         * return an old page since TLB flush is no longer sufficient
>>>> +         * to serialize against fast GUP.
>>>> +         *
>>>> +         * Check PMD, if it is changed just back off since it
>>>> +         * means there may be parallel THP collapse.
>>>> +         */
>>>
>>> As I mentioned in the other thread, it would be a nice touch to move
>>> such discussion into the comment header.
>>>
>>>> +        if (unlikely(pmd_val(pmd) != pmd_val(*pmdp)) ||
>>>> +            unlikely(pte_val(pte) != pte_val(*ptep))) {
>>>
>>>
>>> That should be READ_ONCE() for the *pmdp and *ptep reads. Because this
>>> whole lockless house of cards may fall apart if we try reading the
>>> page table values without READ_ONCE().
>>
>> I came to the conclusion that the implicit memory barrier when grabbing
>> a reference on the page is sufficient such that we don't need READ_ONCE
>> here.
> 
> IMHO the compiler may optimize the code 'pte_val(*ptep)' to be always
> get from a register, then we can get an old value if other thread did
> set_pte(). I am not sure how the implicit memory barrier can pervent the
> compiler optimization? Please correct me if I missed something.

IIUC, an memory barrier always implies a compiler barrier.
David Hildenbrand Sept. 5, 2022, 11:11 a.m. UTC | #19
On 05.09.22 12:24, David Hildenbrand wrote:
> On 05.09.22 12:16, Baolin Wang wrote:
>>
>>
>> On 9/5/2022 3:59 PM, David Hildenbrand wrote:
>>> On 05.09.22 00:29, John Hubbard wrote:
>>>> On 9/1/22 15:27, Yang Shi wrote:
>>>>> Since general RCU GUP fast was introduced in commit 2667f50e8b81 ("mm:
>>>>> introduce a general RCU get_user_pages_fast()"), a TLB flush is no
>>>>> longer
>>>>> sufficient to handle concurrent GUP-fast in all cases, it only handles
>>>>> traditional IPI-based GUP-fast correctly.  On architectures that send
>>>>> an IPI broadcast on TLB flush, it works as expected.  But on the
>>>>> architectures that do not use IPI to broadcast TLB flush, it may have
>>>>> the below race:
>>>>>
>>>>>       CPU A                                          CPU B
>>>>> THP collapse                                     fast GUP
>>>>>                                                  gup_pmd_range() <--
>>>>> see valid pmd
>>>>>                                                      gup_pte_range()
>>>>> <-- work on pte
>>>>> pmdp_collapse_flush() <-- clear pmd and flush
>>>>> __collapse_huge_page_isolate()
>>>>>        check page pinned <-- before GUP bump refcount
>>>>>                                                          pin the page
>>>>>                                                          check PTE <--
>>>>> no change
>>>>> __collapse_huge_page_copy()
>>>>>        copy data to huge page
>>>>>        ptep_clear()
>>>>> install huge pmd for the huge page
>>>>>                                                          return the
>>>>> stale page
>>>>> discard the stale page
>>>>
>>>> Hi Yang,
>>>>
>>>> Thanks for taking the trouble to write down these notes. I always
>>>> forget which race we are dealing with, and this is a great help. :)
>>>>
>>>> More...
>>>>
>>>>>
>>>>> The race could be fixed by checking whether PMD is changed or not after
>>>>> taking the page pin in fast GUP, just like what it does for PTE.  If the
>>>>> PMD is changed it means there may be parallel THP collapse, so GUP
>>>>> should back off.
>>>>>
>>>>> Also update the stale comment about serializing against fast GUP in
>>>>> khugepaged.
>>>>>
>>>>> Fixes: 2667f50e8b81 ("mm: introduce a general RCU
>>>>> get_user_pages_fast()")
>>>>> Signed-off-by: Yang Shi <shy828301@gmail.com>
>>>>> ---
>>>>>     mm/gup.c        | 30 ++++++++++++++++++++++++------
>>>>>     mm/khugepaged.c | 10 ++++++----
>>>>>     2 files changed, 30 insertions(+), 10 deletions(-)
>>>>>
>>>>> diff --git a/mm/gup.c b/mm/gup.c
>>>>> index f3fc1f08d90c..4365b2811269 100644
>>>>> --- a/mm/gup.c
>>>>> +++ b/mm/gup.c
>>>>> @@ -2380,8 +2380,9 @@ static void __maybe_unused undo_dev_pagemap(int
>>>>> *nr, int nr_start,
>>>>>     }
>>>>>     #ifdef CONFIG_ARCH_HAS_PTE_SPECIAL
>>>>> -static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned
>>>>> long end,
>>>>> -             unsigned int flags, struct page **pages, int *nr)
>>>>> +static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr,
>>>>> +             unsigned long end, unsigned int flags,
>>>>> +             struct page **pages, int *nr)
>>>>>     {
>>>>>         struct dev_pagemap *pgmap = NULL;
>>>>>         int nr_start = *nr, ret = 0;
>>>>> @@ -2423,7 +2424,23 @@ static int gup_pte_range(pmd_t pmd, unsigned
>>>>> long addr, unsigned long end,
>>>>>                 goto pte_unmap;
>>>>>             }
>>>>> -        if (unlikely(pte_val(pte) != pte_val(*ptep))) {
>>>>> +        /*
>>>>> +         * THP collapse conceptually does:
>>>>> +         *   1. Clear and flush PMD
>>>>> +         *   2. Check the base page refcount
>>>>> +         *   3. Copy data to huge page
>>>>> +         *   4. Clear PTE
>>>>> +         *   5. Discard the base page
>>>>> +         *
>>>>> +         * So fast GUP may race with THP collapse then pin and
>>>>> +         * return an old page since TLB flush is no longer sufficient
>>>>> +         * to serialize against fast GUP.
>>>>> +         *
>>>>> +         * Check PMD, if it is changed just back off since it
>>>>> +         * means there may be parallel THP collapse.
>>>>> +         */
>>>>
>>>> As I mentioned in the other thread, it would be a nice touch to move
>>>> such discussion into the comment header.
>>>>
>>>>> +        if (unlikely(pmd_val(pmd) != pmd_val(*pmdp)) ||
>>>>> +            unlikely(pte_val(pte) != pte_val(*ptep))) {
>>>>
>>>>
>>>> That should be READ_ONCE() for the *pmdp and *ptep reads. Because this
>>>> whole lockless house of cards may fall apart if we try reading the
>>>> page table values without READ_ONCE().
>>>
>>> I came to the conclusion that the implicit memory barrier when grabbing
>>> a reference on the page is sufficient such that we don't need READ_ONCE
>>> here.
>>
>> IMHO the compiler may optimize the code 'pte_val(*ptep)' to be always
>> get from a register, then we can get an old value if other thread did
>> set_pte(). I am not sure how the implicit memory barrier can pervent the
>> compiler optimization? Please correct me if I missed something.
> 
> IIUC, an memory barrier always implies a compiler barrier.
> 

To clarify what I mean, Documentation/atomic_t.txt documents

NOTE: when the atomic RmW ops are fully ordered, they should also imply 
a compiler barrier.
Baolin Wang Sept. 5, 2022, 2:35 p.m. UTC | #20
On 9/5/2022 7:11 PM, David Hildenbrand wrote:
> On 05.09.22 12:24, David Hildenbrand wrote:
>> On 05.09.22 12:16, Baolin Wang wrote:
>>>
>>>
>>> On 9/5/2022 3:59 PM, David Hildenbrand wrote:
>>>> On 05.09.22 00:29, John Hubbard wrote:
>>>>> On 9/1/22 15:27, Yang Shi wrote:
>>>>>> Since general RCU GUP fast was introduced in commit 2667f50e8b81 
>>>>>> ("mm:
>>>>>> introduce a general RCU get_user_pages_fast()"), a TLB flush is no
>>>>>> longer
>>>>>> sufficient to handle concurrent GUP-fast in all cases, it only 
>>>>>> handles
>>>>>> traditional IPI-based GUP-fast correctly.  On architectures that send
>>>>>> an IPI broadcast on TLB flush, it works as expected.  But on the
>>>>>> architectures that do not use IPI to broadcast TLB flush, it may have
>>>>>> the below race:
>>>>>>
>>>>>>       CPU A                                          CPU B
>>>>>> THP collapse                                     fast GUP
>>>>>>                                                  gup_pmd_range() <--
>>>>>> see valid pmd
>>>>>>                                                      gup_pte_range()
>>>>>> <-- work on pte
>>>>>> pmdp_collapse_flush() <-- clear pmd and flush
>>>>>> __collapse_huge_page_isolate()
>>>>>>        check page pinned <-- before GUP bump refcount
>>>>>>                                                          pin the page
>>>>>>                                                          check PTE 
>>>>>> <--
>>>>>> no change
>>>>>> __collapse_huge_page_copy()
>>>>>>        copy data to huge page
>>>>>>        ptep_clear()
>>>>>> install huge pmd for the huge page
>>>>>>                                                          return the
>>>>>> stale page
>>>>>> discard the stale page
>>>>>
>>>>> Hi Yang,
>>>>>
>>>>> Thanks for taking the trouble to write down these notes. I always
>>>>> forget which race we are dealing with, and this is a great help. :)
>>>>>
>>>>> More...
>>>>>
>>>>>>
>>>>>> The race could be fixed by checking whether PMD is changed or not 
>>>>>> after
>>>>>> taking the page pin in fast GUP, just like what it does for PTE.  
>>>>>> If the
>>>>>> PMD is changed it means there may be parallel THP collapse, so GUP
>>>>>> should back off.
>>>>>>
>>>>>> Also update the stale comment about serializing against fast GUP in
>>>>>> khugepaged.
>>>>>>
>>>>>> Fixes: 2667f50e8b81 ("mm: introduce a general RCU
>>>>>> get_user_pages_fast()")
>>>>>> Signed-off-by: Yang Shi <shy828301@gmail.com>
>>>>>> ---
>>>>>>     mm/gup.c        | 30 ++++++++++++++++++++++++------
>>>>>>     mm/khugepaged.c | 10 ++++++----
>>>>>>     2 files changed, 30 insertions(+), 10 deletions(-)
>>>>>>
>>>>>> diff --git a/mm/gup.c b/mm/gup.c
>>>>>> index f3fc1f08d90c..4365b2811269 100644
>>>>>> --- a/mm/gup.c
>>>>>> +++ b/mm/gup.c
>>>>>> @@ -2380,8 +2380,9 @@ static void __maybe_unused undo_dev_pagemap(int
>>>>>> *nr, int nr_start,
>>>>>>     }
>>>>>>     #ifdef CONFIG_ARCH_HAS_PTE_SPECIAL
>>>>>> -static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned
>>>>>> long end,
>>>>>> -             unsigned int flags, struct page **pages, int *nr)
>>>>>> +static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr,
>>>>>> +             unsigned long end, unsigned int flags,
>>>>>> +             struct page **pages, int *nr)
>>>>>>     {
>>>>>>         struct dev_pagemap *pgmap = NULL;
>>>>>>         int nr_start = *nr, ret = 0;
>>>>>> @@ -2423,7 +2424,23 @@ static int gup_pte_range(pmd_t pmd, unsigned
>>>>>> long addr, unsigned long end,
>>>>>>                 goto pte_unmap;
>>>>>>             }
>>>>>> -        if (unlikely(pte_val(pte) != pte_val(*ptep))) {
>>>>>> +        /*
>>>>>> +         * THP collapse conceptually does:
>>>>>> +         *   1. Clear and flush PMD
>>>>>> +         *   2. Check the base page refcount
>>>>>> +         *   3. Copy data to huge page
>>>>>> +         *   4. Clear PTE
>>>>>> +         *   5. Discard the base page
>>>>>> +         *
>>>>>> +         * So fast GUP may race with THP collapse then pin and
>>>>>> +         * return an old page since TLB flush is no longer 
>>>>>> sufficient
>>>>>> +         * to serialize against fast GUP.
>>>>>> +         *
>>>>>> +         * Check PMD, if it is changed just back off since it
>>>>>> +         * means there may be parallel THP collapse.
>>>>>> +         */
>>>>>
>>>>> As I mentioned in the other thread, it would be a nice touch to move
>>>>> such discussion into the comment header.
>>>>>
>>>>>> +        if (unlikely(pmd_val(pmd) != pmd_val(*pmdp)) ||
>>>>>> +            unlikely(pte_val(pte) != pte_val(*ptep))) {
>>>>>
>>>>>
>>>>> That should be READ_ONCE() for the *pmdp and *ptep reads. Because this
>>>>> whole lockless house of cards may fall apart if we try reading the
>>>>> page table values without READ_ONCE().
>>>>
>>>> I came to the conclusion that the implicit memory barrier when grabbing
>>>> a reference on the page is sufficient such that we don't need READ_ONCE
>>>> here.
>>>
>>> IMHO the compiler may optimize the code 'pte_val(*ptep)' to be always
>>> get from a register, then we can get an old value if other thread did
>>> set_pte(). I am not sure how the implicit memory barrier can pervent the
>>> compiler optimization? Please correct me if I missed something.
>>
>> IIUC, an memory barrier always implies a compiler barrier.
>>
> 
> To clarify what I mean, Documentation/atomic_t.txt documents
> 
> NOTE: when the atomic RmW ops are fully ordered, they should also imply 
> a compiler barrier.

Right, I agree. That means the complier can not optimize the order of 
the 'pte_val(*ptep)', however what I am confusing is that the complier 
can still save the value of *ptep into a register or stack instead of 
reloading from memory?

A similar issue in commit d6c1f098f2a7 ("mm/swap_state: fix a data race 
in swapin_nr_pages").

--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -509,10 +509,11 @@ static unsigned long swapin_nr_pages(unsigned long 
offset)
                 return 1;

         hits = atomic_xchg(&swapin_readahead_hits, 0);
-       pages = __swapin_nr_pages(prev_offset, offset, hits, max_pages,
+       pages = __swapin_nr_pages(READ_ONCE(prev_offset), offset, hits,
+                                 max_pages,
                                   atomic_read(&last_readahead_pages));
         if (!hits)
-               prev_offset = offset;
+               WRITE_ONCE(prev_offset, offset);
         atomic_set(&last_readahead_pages, pages);

         return pages;
David Hildenbrand Sept. 5, 2022, 2:40 p.m. UTC | #21
On 05.09.22 16:35, Baolin Wang wrote:
> 
> 
> On 9/5/2022 7:11 PM, David Hildenbrand wrote:
>> On 05.09.22 12:24, David Hildenbrand wrote:
>>> On 05.09.22 12:16, Baolin Wang wrote:
>>>>
>>>>
>>>> On 9/5/2022 3:59 PM, David Hildenbrand wrote:
>>>>> On 05.09.22 00:29, John Hubbard wrote:
>>>>>> On 9/1/22 15:27, Yang Shi wrote:
>>>>>>> Since general RCU GUP fast was introduced in commit 2667f50e8b81
>>>>>>> ("mm:
>>>>>>> introduce a general RCU get_user_pages_fast()"), a TLB flush is no
>>>>>>> longer
>>>>>>> sufficient to handle concurrent GUP-fast in all cases, it only
>>>>>>> handles
>>>>>>> traditional IPI-based GUP-fast correctly.  On architectures that send
>>>>>>> an IPI broadcast on TLB flush, it works as expected.  But on the
>>>>>>> architectures that do not use IPI to broadcast TLB flush, it may have
>>>>>>> the below race:
>>>>>>>
>>>>>>>        CPU A                                          CPU B
>>>>>>> THP collapse                                     fast GUP
>>>>>>>                                                   gup_pmd_range() <--
>>>>>>> see valid pmd
>>>>>>>                                                       gup_pte_range()
>>>>>>> <-- work on pte
>>>>>>> pmdp_collapse_flush() <-- clear pmd and flush
>>>>>>> __collapse_huge_page_isolate()
>>>>>>>         check page pinned <-- before GUP bump refcount
>>>>>>>                                                           pin the page
>>>>>>>                                                           check PTE
>>>>>>> <--
>>>>>>> no change
>>>>>>> __collapse_huge_page_copy()
>>>>>>>         copy data to huge page
>>>>>>>         ptep_clear()
>>>>>>> install huge pmd for the huge page
>>>>>>>                                                           return the
>>>>>>> stale page
>>>>>>> discard the stale page
>>>>>>
>>>>>> Hi Yang,
>>>>>>
>>>>>> Thanks for taking the trouble to write down these notes. I always
>>>>>> forget which race we are dealing with, and this is a great help. :)
>>>>>>
>>>>>> More...
>>>>>>
>>>>>>>
>>>>>>> The race could be fixed by checking whether PMD is changed or not
>>>>>>> after
>>>>>>> taking the page pin in fast GUP, just like what it does for PTE.
>>>>>>> If the
>>>>>>> PMD is changed it means there may be parallel THP collapse, so GUP
>>>>>>> should back off.
>>>>>>>
>>>>>>> Also update the stale comment about serializing against fast GUP in
>>>>>>> khugepaged.
>>>>>>>
>>>>>>> Fixes: 2667f50e8b81 ("mm: introduce a general RCU
>>>>>>> get_user_pages_fast()")
>>>>>>> Signed-off-by: Yang Shi <shy828301@gmail.com>
>>>>>>> ---
>>>>>>>      mm/gup.c        | 30 ++++++++++++++++++++++++------
>>>>>>>      mm/khugepaged.c | 10 ++++++----
>>>>>>>      2 files changed, 30 insertions(+), 10 deletions(-)
>>>>>>>
>>>>>>> diff --git a/mm/gup.c b/mm/gup.c
>>>>>>> index f3fc1f08d90c..4365b2811269 100644
>>>>>>> --- a/mm/gup.c
>>>>>>> +++ b/mm/gup.c
>>>>>>> @@ -2380,8 +2380,9 @@ static void __maybe_unused undo_dev_pagemap(int
>>>>>>> *nr, int nr_start,
>>>>>>>      }
>>>>>>>      #ifdef CONFIG_ARCH_HAS_PTE_SPECIAL
>>>>>>> -static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned
>>>>>>> long end,
>>>>>>> -             unsigned int flags, struct page **pages, int *nr)
>>>>>>> +static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr,
>>>>>>> +             unsigned long end, unsigned int flags,
>>>>>>> +             struct page **pages, int *nr)
>>>>>>>      {
>>>>>>>          struct dev_pagemap *pgmap = NULL;
>>>>>>>          int nr_start = *nr, ret = 0;
>>>>>>> @@ -2423,7 +2424,23 @@ static int gup_pte_range(pmd_t pmd, unsigned
>>>>>>> long addr, unsigned long end,
>>>>>>>                  goto pte_unmap;
>>>>>>>              }
>>>>>>> -        if (unlikely(pte_val(pte) != pte_val(*ptep))) {
>>>>>>> +        /*
>>>>>>> +         * THP collapse conceptually does:
>>>>>>> +         *   1. Clear and flush PMD
>>>>>>> +         *   2. Check the base page refcount
>>>>>>> +         *   3. Copy data to huge page
>>>>>>> +         *   4. Clear PTE
>>>>>>> +         *   5. Discard the base page
>>>>>>> +         *
>>>>>>> +         * So fast GUP may race with THP collapse then pin and
>>>>>>> +         * return an old page since TLB flush is no longer
>>>>>>> sufficient
>>>>>>> +         * to serialize against fast GUP.
>>>>>>> +         *
>>>>>>> +         * Check PMD, if it is changed just back off since it
>>>>>>> +         * means there may be parallel THP collapse.
>>>>>>> +         */
>>>>>>
>>>>>> As I mentioned in the other thread, it would be a nice touch to move
>>>>>> such discussion into the comment header.
>>>>>>
>>>>>>> +        if (unlikely(pmd_val(pmd) != pmd_val(*pmdp)) ||
>>>>>>> +            unlikely(pte_val(pte) != pte_val(*ptep))) {
>>>>>>
>>>>>>
>>>>>> That should be READ_ONCE() for the *pmdp and *ptep reads. Because this
>>>>>> whole lockless house of cards may fall apart if we try reading the
>>>>>> page table values without READ_ONCE().
>>>>>
>>>>> I came to the conclusion that the implicit memory barrier when grabbing
>>>>> a reference on the page is sufficient such that we don't need READ_ONCE
>>>>> here.
>>>>
>>>> IMHO the compiler may optimize the code 'pte_val(*ptep)' to be always
>>>> get from a register, then we can get an old value if other thread did
>>>> set_pte(). I am not sure how the implicit memory barrier can pervent the
>>>> compiler optimization? Please correct me if I missed something.
>>>
>>> IIUC, an memory barrier always implies a compiler barrier.
>>>
>>
>> To clarify what I mean, Documentation/atomic_t.txt documents
>>
>> NOTE: when the atomic RmW ops are fully ordered, they should also imply
>> a compiler barrier.
> 
> Right, I agree. That means the complier can not optimize the order of
> the 'pte_val(*ptep)', however what I am confusing is that the complier
> can still save the value of *ptep into a register or stack instead of
> reloading from memory?

After the memory+compiler barrier, the value has to be reloaded. 
Documentation/memory-barriers.txt documents under "COMPILER BARRIERS":

"READ_ONCE() and WRITE_ONCE() can be thought of as weak forms of 
barrier() that affect only the specific accesses flagged by the 
READ_ONCE() or WRITE_ONCE()."

Consequently, if there already is a compile barrier, additional 
READ_ONCE/WRITE_ONCE isn't required.

> 
> A similar issue in commit d6c1f098f2a7 ("mm/swap_state: fix a data race
> in swapin_nr_pages").
> 
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -509,10 +509,11 @@ static unsigned long swapin_nr_pages(unsigned long
> offset)
>                   return 1;
> 
>           hits = atomic_xchg(&swapin_readahead_hits, 0);
> -       pages = __swapin_nr_pages(prev_offset, offset, hits, max_pages,
> +       pages = __swapin_nr_pages(READ_ONCE(prev_offset), offset, hits,
> +                                 max_pages,
>                                     atomic_read(&last_readahead_pages));
>           if (!hits)
> -               prev_offset = offset;
> +               WRITE_ONCE(prev_offset, offset);
>           atomic_set(&last_readahead_pages, pages);
> 
>           return pages;
> 

IIUC the difference here is that there is not other implicit 
memory+compile barrier in between.
John Hubbard Sept. 6, 2022, 2:12 a.m. UTC | #22
On 9/5/22 00:59, David Hildenbrand wrote:
...
>>> diff --git a/mm/gup.c b/mm/gup.c
>>> index f3fc1f08d90c..4365b2811269 100644
>>> --- a/mm/gup.c
>>> +++ b/mm/gup.c
>>> @@ -2380,8 +2380,9 @@ static void __maybe_unused undo_dev_pagemap(int *nr, int nr_start,
>>>   }
>>>   
>>>   #ifdef CONFIG_ARCH_HAS_PTE_SPECIAL
>>> -static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
>>> -			 unsigned int flags, struct page **pages, int *nr)
>>> +static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr,
>>> +			 unsigned long end, unsigned int flags,
>>> +			 struct page **pages, int *nr)
>>>   {
>>>   	struct dev_pagemap *pgmap = NULL;
>>>   	int nr_start = *nr, ret = 0;
>>> @@ -2423,7 +2424,23 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
>>>   			goto pte_unmap;
>>>   		}
>>>   
>>> -		if (unlikely(pte_val(pte) != pte_val(*ptep))) {
>>> +		/*
>>> +		 * THP collapse conceptually does:
>>> +		 *   1. Clear and flush PMD
>>> +		 *   2. Check the base page refcount
>>> +		 *   3. Copy data to huge page
>>> +		 *   4. Clear PTE
>>> +		 *   5. Discard the base page
>>> +		 *
>>> +		 * So fast GUP may race with THP collapse then pin and
>>> +		 * return an old page since TLB flush is no longer sufficient
>>> +		 * to serialize against fast GUP.
>>> +		 *
>>> +		 * Check PMD, if it is changed just back off since it
>>> +		 * means there may be parallel THP collapse.
>>> +		 */
>>
>> As I mentioned in the other thread, it would be a nice touch to move
>> such discussion into the comment header.
>>
>>> +		if (unlikely(pmd_val(pmd) != pmd_val(*pmdp)) ||
>>> +		    unlikely(pte_val(pte) != pte_val(*ptep))) {
>>
>>
>> That should be READ_ONCE() for the *pmdp and *ptep reads. Because this
>> whole lockless house of cards may fall apart if we try reading the
>> page table values without READ_ONCE().
> 
> I came to the conclusion that the implicit memory barrier when grabbing 
> a reference on the page is sufficient such that we don't need READ_ONCE 
> here.

OK, I believe you're referring to this:

	folio = try_grab_folio(page, 1, flags);

just earlier in gup_pte_range(). Yes that's true...but it's hidden, which
is unfortunate. Maybe a comment could help.

> 
> If we still intend to change that code, we should fixup all GUP-fast 
> functions in a similar way. But again, I don't think we need a change here.
> 

It's really rough, having to play this hide-and-seek game of "who did
the memory barrier". And I'm tempted to suggest adding READ_ONCE() to
any and all reads of the page table entries, just to help stay out of
trouble. It's a visual reminder that page table reads are always a
lockless read and are inherently volatile. 

Of course, I realize that adding extra READ_ONCE() calls is not a good
thing. It might be a performance hit, although, again, these are
volatile reads by nature, so you probably had a membar anyway.

And looking in reverse, there are actually a number of places here where
we could probably get away with *removing* READ_ONCE()!

Overall, I would be inclined to load up on READ_ONCE() calls, yes. But I
sort of expect to be overridden on that, due to potential performance
concerns, and that's reasonable.

At a minimum we should add a few short comments about what memory
barriers are used, and why we don't need a READ_ONCE() or something
stronger when reading the pte.


> 
>>> -	 * After this gup_fast can't run anymore. This also removes
>>> -	 * any huge TLB entry from the CPU so we won't allow
>>> -	 * huge and small TLB entries for the same virtual address
>>> -	 * to avoid the risk of CPU bugs in that area.
>>> +	 * This removes any huge TLB entry from the CPU so we won't allow
>>> +	 * huge and small TLB entries for the same virtual address to
>>> +	 * avoid the risk of CPU bugs in that area.
>>> +	 *
>>> +	 * Parallel fast GUP is fine since fast GUP will back off when
>>> +	 * it detects PMD is changed.
>>>   	 */
>>>   	_pmd = pmdp_collapse_flush(vma, address, pmd);
>>
>> To follow up on David Hildenbrand's note about this in the nearby thread...
>> I'm also not sure if pmdp_collapse_flush() implies a memory barrier on
>> all arches. It definitely does do an atomic op with a return value on x86,
>> but that's just one arch.
>>
> 
> I think a ptep/pmdp clear + TLB flush really has to imply a memory 
> barrier, otherwise TLB flushing code might easily mess up with 
> surrounding code. But we should better double-check.

Let's document the function as such, once it's verified: "This is a
guaranteed memory barrier".

> 
> s390x executes an IDTE instruction, which performs serialization (-> 
> memory barrier). arm64 seems to use DSB instructions to enforce memory 
> ordering.
> 

thanks,
Baolin Wang Sept. 6, 2022, 5:53 a.m. UTC | #23
On 9/5/2022 10:40 PM, David Hildenbrand wrote:
> On 05.09.22 16:35, Baolin Wang wrote:
>>
>>
>> On 9/5/2022 7:11 PM, David Hildenbrand wrote:
>>> On 05.09.22 12:24, David Hildenbrand wrote:
>>>> On 05.09.22 12:16, Baolin Wang wrote:
>>>>>
>>>>>
>>>>> On 9/5/2022 3:59 PM, David Hildenbrand wrote:
>>>>>> On 05.09.22 00:29, John Hubbard wrote:
>>>>>>> On 9/1/22 15:27, Yang Shi wrote:
>>>>>>>> Since general RCU GUP fast was introduced in commit 2667f50e8b81
>>>>>>>> ("mm:
>>>>>>>> introduce a general RCU get_user_pages_fast()"), a TLB flush is no
>>>>>>>> longer
>>>>>>>> sufficient to handle concurrent GUP-fast in all cases, it only
>>>>>>>> handles
>>>>>>>> traditional IPI-based GUP-fast correctly.  On architectures that 
>>>>>>>> send
>>>>>>>> an IPI broadcast on TLB flush, it works as expected.  But on the
>>>>>>>> architectures that do not use IPI to broadcast TLB flush, it may 
>>>>>>>> have
>>>>>>>> the below race:
>>>>>>>>
>>>>>>>>        CPU A                                          CPU B
>>>>>>>> THP collapse                                     fast GUP
>>>>>>>>                                                   
>>>>>>>> gup_pmd_range() <--
>>>>>>>> see valid pmd
>>>>>>>>                                                       
>>>>>>>> gup_pte_range()
>>>>>>>> <-- work on pte
>>>>>>>> pmdp_collapse_flush() <-- clear pmd and flush
>>>>>>>> __collapse_huge_page_isolate()
>>>>>>>>         check page pinned <-- before GUP bump refcount
>>>>>>>>                                                           pin 
>>>>>>>> the page
>>>>>>>>                                                           check PTE
>>>>>>>> <--
>>>>>>>> no change
>>>>>>>> __collapse_huge_page_copy()
>>>>>>>>         copy data to huge page
>>>>>>>>         ptep_clear()
>>>>>>>> install huge pmd for the huge page
>>>>>>>>                                                           return 
>>>>>>>> the
>>>>>>>> stale page
>>>>>>>> discard the stale page
>>>>>>>
>>>>>>> Hi Yang,
>>>>>>>
>>>>>>> Thanks for taking the trouble to write down these notes. I always
>>>>>>> forget which race we are dealing with, and this is a great help. :)
>>>>>>>
>>>>>>> More...
>>>>>>>
>>>>>>>>
>>>>>>>> The race could be fixed by checking whether PMD is changed or not
>>>>>>>> after
>>>>>>>> taking the page pin in fast GUP, just like what it does for PTE.
>>>>>>>> If the
>>>>>>>> PMD is changed it means there may be parallel THP collapse, so GUP
>>>>>>>> should back off.
>>>>>>>>
>>>>>>>> Also update the stale comment about serializing against fast GUP in
>>>>>>>> khugepaged.
>>>>>>>>
>>>>>>>> Fixes: 2667f50e8b81 ("mm: introduce a general RCU
>>>>>>>> get_user_pages_fast()")
>>>>>>>> Signed-off-by: Yang Shi <shy828301@gmail.com>
>>>>>>>> ---
>>>>>>>>      mm/gup.c        | 30 ++++++++++++++++++++++++------
>>>>>>>>      mm/khugepaged.c | 10 ++++++----
>>>>>>>>      2 files changed, 30 insertions(+), 10 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/mm/gup.c b/mm/gup.c
>>>>>>>> index f3fc1f08d90c..4365b2811269 100644
>>>>>>>> --- a/mm/gup.c
>>>>>>>> +++ b/mm/gup.c
>>>>>>>> @@ -2380,8 +2380,9 @@ static void __maybe_unused 
>>>>>>>> undo_dev_pagemap(int
>>>>>>>> *nr, int nr_start,
>>>>>>>>      }
>>>>>>>>      #ifdef CONFIG_ARCH_HAS_PTE_SPECIAL
>>>>>>>> -static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned
>>>>>>>> long end,
>>>>>>>> -             unsigned int flags, struct page **pages, int *nr)
>>>>>>>> +static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long 
>>>>>>>> addr,
>>>>>>>> +             unsigned long end, unsigned int flags,
>>>>>>>> +             struct page **pages, int *nr)
>>>>>>>>      {
>>>>>>>>          struct dev_pagemap *pgmap = NULL;
>>>>>>>>          int nr_start = *nr, ret = 0;
>>>>>>>> @@ -2423,7 +2424,23 @@ static int gup_pte_range(pmd_t pmd, unsigned
>>>>>>>> long addr, unsigned long end,
>>>>>>>>                  goto pte_unmap;
>>>>>>>>              }
>>>>>>>> -        if (unlikely(pte_val(pte) != pte_val(*ptep))) {
>>>>>>>> +        /*
>>>>>>>> +         * THP collapse conceptually does:
>>>>>>>> +         *   1. Clear and flush PMD
>>>>>>>> +         *   2. Check the base page refcount
>>>>>>>> +         *   3. Copy data to huge page
>>>>>>>> +         *   4. Clear PTE
>>>>>>>> +         *   5. Discard the base page
>>>>>>>> +         *
>>>>>>>> +         * So fast GUP may race with THP collapse then pin and
>>>>>>>> +         * return an old page since TLB flush is no longer
>>>>>>>> sufficient
>>>>>>>> +         * to serialize against fast GUP.
>>>>>>>> +         *
>>>>>>>> +         * Check PMD, if it is changed just back off since it
>>>>>>>> +         * means there may be parallel THP collapse.
>>>>>>>> +         */
>>>>>>>
>>>>>>> As I mentioned in the other thread, it would be a nice touch to move
>>>>>>> such discussion into the comment header.
>>>>>>>
>>>>>>>> +        if (unlikely(pmd_val(pmd) != pmd_val(*pmdp)) ||
>>>>>>>> +            unlikely(pte_val(pte) != pte_val(*ptep))) {
>>>>>>>
>>>>>>>
>>>>>>> That should be READ_ONCE() for the *pmdp and *ptep reads. Because 
>>>>>>> this
>>>>>>> whole lockless house of cards may fall apart if we try reading the
>>>>>>> page table values without READ_ONCE().
>>>>>>
>>>>>> I came to the conclusion that the implicit memory barrier when 
>>>>>> grabbing
>>>>>> a reference on the page is sufficient such that we don't need 
>>>>>> READ_ONCE
>>>>>> here.
>>>>>
>>>>> IMHO the compiler may optimize the code 'pte_val(*ptep)' to be always
>>>>> get from a register, then we can get an old value if other thread did
>>>>> set_pte(). I am not sure how the implicit memory barrier can 
>>>>> pervent the
>>>>> compiler optimization? Please correct me if I missed something.
>>>>
>>>> IIUC, an memory barrier always implies a compiler barrier.
>>>>
>>>
>>> To clarify what I mean, Documentation/atomic_t.txt documents
>>>
>>> NOTE: when the atomic RmW ops are fully ordered, they should also imply
>>> a compiler barrier.
>>
>> Right, I agree. That means the complier can not optimize the order of
>> the 'pte_val(*ptep)', however what I am confusing is that the complier
>> can still save the value of *ptep into a register or stack instead of
>> reloading from memory?
> 
> After the memory+compiler barrier, the value has to be reloaded. 
> Documentation/memory-barriers.txt documents under "COMPILER BARRIERS":

After some investigation, I realized you are totally right. The GCC 
Extended Asm manual [1] also says:
"To ensure memory contains correct values, GCC may need to flush 
specific register values to memory before executing the asm. Further, 
the compiler does not assume that any values read from memory before an 
asm remain unchanged after that asm; it reloads them as needed. Using 
the "memory" clobber effectively forms a read/write memory barrier for 
the compiler."

So as you said, the value will be reloaded after the memory+compiler 
barrier. Thanks for your explanation.

[1] https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html

> 
> "READ_ONCE() and WRITE_ONCE() can be thought of as weak forms of 
> barrier() that affect only the specific accesses flagged by the 
> READ_ONCE() or WRITE_ONCE()."
> 
> Consequently, if there already is a compile barrier, additional 
> READ_ONCE/WRITE_ONCE isn't required.
> 
>>
>> A similar issue in commit d6c1f098f2a7 ("mm/swap_state: fix a data race
>> in swapin_nr_pages").
>>
>> --- a/mm/swap_state.c
>> +++ b/mm/swap_state.c
>> @@ -509,10 +509,11 @@ static unsigned long swapin_nr_pages(unsigned long
>> offset)
>>                   return 1;
>>
>>           hits = atomic_xchg(&swapin_readahead_hits, 0);
>> -       pages = __swapin_nr_pages(prev_offset, offset, hits, max_pages,
>> +       pages = __swapin_nr_pages(READ_ONCE(prev_offset), offset, hits,
>> +                                 max_pages,
>>                                     atomic_read(&last_readahead_pages));
>>           if (!hits)
>> -               prev_offset = offset;
>> +               WRITE_ONCE(prev_offset, offset);
>>           atomic_set(&last_readahead_pages, pages);
>>
>>           return pages;
>>
> 
> IIUC the difference here is that there is not other implicit 
> memory+compile barrier in between.

Yes, I see the difference.
David Hildenbrand Sept. 6, 2022, 12:50 p.m. UTC | #24
> OK, I believe you're referring to this:
> 
> 	folio = try_grab_folio(page, 1, flags);
> 
> just earlier in gup_pte_range(). Yes that's true...but it's hidden, which
> is unfortunate. Maybe a comment could help.
> 

Most certainly.

>>
>> If we still intend to change that code, we should fixup all GUP-fast
>> functions in a similar way. But again, I don't think we need a change here.
>>
> 
> It's really rough, having to play this hide-and-seek game of "who did
> the memory barrier". And I'm tempted to suggest adding READ_ONCE() to
> any and all reads of the page table entries, just to help stay out of
> trouble. It's a visual reminder that page table reads are always a
> lockless read and are inherently volatile.
> 
> Of course, I realize that adding extra READ_ONCE() calls is not a good
> thing. It might be a performance hit, although, again, these are
> volatile reads by nature, so you probably had a membar anyway.
> 
> And looking in reverse, there are actually a number of places here where
> we could probably get away with *removing* READ_ONCE()!
> 
> Overall, I would be inclined to load up on READ_ONCE() calls, yes. But I
> sort of expect to be overridden on that, due to potential performance
> concerns, and that's reasonable.
> 
> At a minimum we should add a few short comments about what memory
> barriers are used, and why we don't need a READ_ONCE() or something
> stronger when reading the pte.

Adding more unnecessary memory barriers doesn't necessarily improve the 
situation.

Messing with memory barriers is and remains absolutely disgusting.

IMHO, only clear documentation and ASCII art can keep it somehow 
maintainable for human beings.

> 
> 
>>
>>>> -	 * After this gup_fast can't run anymore. This also removes
>>>> -	 * any huge TLB entry from the CPU so we won't allow
>>>> -	 * huge and small TLB entries for the same virtual address
>>>> -	 * to avoid the risk of CPU bugs in that area.
>>>> +	 * This removes any huge TLB entry from the CPU so we won't allow
>>>> +	 * huge and small TLB entries for the same virtual address to
>>>> +	 * avoid the risk of CPU bugs in that area.
>>>> +	 *
>>>> +	 * Parallel fast GUP is fine since fast GUP will back off when
>>>> +	 * it detects PMD is changed.
>>>>    	 */
>>>>    	_pmd = pmdp_collapse_flush(vma, address, pmd);
>>>
>>> To follow up on David Hildenbrand's note about this in the nearby thread...
>>> I'm also not sure if pmdp_collapse_flush() implies a memory barrier on
>>> all arches. It definitely does do an atomic op with a return value on x86,
>>> but that's just one arch.
>>>
>>
>> I think a ptep/pmdp clear + TLB flush really has to imply a memory
>> barrier, otherwise TLB flushing code might easily mess up with
>> surrounding code. But we should better double-check.
> 
> Let's document the function as such, once it's verified: "This is a
> guaranteed memory barrier".

Yes. Hopefully it indeed is on all architectures :)
Jason Gunthorpe Sept. 6, 2022, 1:47 p.m. UTC | #25
On Mon, Sep 05, 2022 at 09:59:47AM +0200, David Hildenbrand wrote:

> > That should be READ_ONCE() for the *pmdp and *ptep reads. Because this
> > whole lockless house of cards may fall apart if we try reading the
> > page table values without READ_ONCE().
> 
> I came to the conclusion that the implicit memory barrier when grabbing a
> reference on the page is sufficient such that we don't need READ_ONCE here.

READ_ONCE is not about barriers or ordering, you still need the
acquire inside the atomic to make the algorithm work.

READ_ONCE primarily is a marker that the data being read is unstable
and that the compiler must avoid all instability when reading it. eg
in this case the compiler could insanely double read the value, even
though the 'if' requires only a single read. This would result in
corrupt calculation.

> If we still intend to change that code, we should fixup all GUP-fast
> functions in a similar way.

This is correct, IMHO we have been slowly modernizing the mm approach
to the memory model to include things like this. While it would be
nice to do everything I think small bits are welcomed as the are
discovered.

Jason
David Hildenbrand Sept. 6, 2022, 1:57 p.m. UTC | #26
On 06.09.22 15:47, Jason Gunthorpe wrote:
> On Mon, Sep 05, 2022 at 09:59:47AM +0200, David Hildenbrand wrote:
> 
>>> That should be READ_ONCE() for the *pmdp and *ptep reads. Because this
>>> whole lockless house of cards may fall apart if we try reading the
>>> page table values without READ_ONCE().
>>
>> I came to the conclusion that the implicit memory barrier when grabbing a
>> reference on the page is sufficient such that we don't need READ_ONCE here.
> 
> READ_ONCE is not about barriers or ordering, you still need the
> acquire inside the atomic to make the algorithm work.


While I don't disagree with what say is, I'll refer to 
Documentation/memory-barriers.txt "COMPILER BARRIER".

As discussed somewhere in this thread, if we already have an atomic RWM 
that implies a full ordering, it implies a compile barrier.


> 
> READ_ONCE primarily is a marker that the data being read is unstable
> and that the compiler must avoid all instability when reading it. eg
> in this case the compiler could insanely double read the value, even
> though the 'if' requires only a single read. This would result in
> corrupt calculation.

As we have a full memory barrier + compile barrier, the compiler might 
indeed do double reads and all that stuff. BUT, it has to re-read after 
we incremented the refcount, and IMHO that's the important part to 
detect the change.
Jason Gunthorpe Sept. 6, 2022, 2:30 p.m. UTC | #27
On Tue, Sep 06, 2022 at 03:57:30PM +0200, David Hildenbrand wrote:

> > READ_ONCE primarily is a marker that the data being read is unstable
> > and that the compiler must avoid all instability when reading it. eg
> > in this case the compiler could insanely double read the value, even
> > though the 'if' requires only a single read. This would result in
> > corrupt calculation.
> 
> As we have a full memory barrier + compile barrier, the compiler might
> indeed do double reads and all that stuff. BUT, it has to re-read after we
> incremented the refcount, and IMHO that's the important part to detect the
> change.

Yes, it is important, but it is not the only important part.

The compiler still has to exectute "if (*a != b)" *correctly*.

This is what READ_ONCE is for. It doesn't set order, it doesn't
implement a barrier, it tells the compiler that '*a' is unstable data
and the compiler cannot make assumptions based on the idea that
reading '*a' multiple times will always return the same value.

If the compiler makes those assumptions then maybe even though 'if (*a
!= b)' is the reality, it could mis-compute '*a == b'. You enter into
undefined behavior here.

Though it is all very unlikely, the general memory model standard is
to annotate with READ_ONCE.

Jason
David Hildenbrand Sept. 6, 2022, 2:44 p.m. UTC | #28
On 06.09.22 16:30, Jason Gunthorpe wrote:
> On Tue, Sep 06, 2022 at 03:57:30PM +0200, David Hildenbrand wrote:
> 
>>> READ_ONCE primarily is a marker that the data being read is unstable
>>> and that the compiler must avoid all instability when reading it. eg
>>> in this case the compiler could insanely double read the value, even
>>> though the 'if' requires only a single read. This would result in
>>> corrupt calculation.
>>
>> As we have a full memory barrier + compile barrier, the compiler might
>> indeed do double reads and all that stuff. BUT, it has to re-read after we
>> incremented the refcount, and IMHO that's the important part to detect the
>> change.
> 
> Yes, it is important, but it is not the only important part.
> 
> The compiler still has to exectute "if (*a != b)" *correctly*.
> 
> This is what READ_ONCE is for. It doesn't set order, it doesn't
> implement a barrier, it tells the compiler that '*a' is unstable data
> and the compiler cannot make assumptions based on the idea that
> reading '*a' multiple times will always return the same value.
> 
> If the compiler makes those assumptions then maybe even though 'if (*a
> != b)' is the reality, it could mis-compute '*a == b'. You enter into
> undefined behavior here.
> 
> Though it is all very unlikely, the general memory model standard is
> to annotate with READ_ONCE.

The only thing I could see going wrong in the comparison once the stars 
alingn would be something like the following:

if (*a != b)

implemented as

if ((*a).lower != b.lower && (*a).higher != b.higher)


This could only go wrong if we have more than one change such that:

Original:

*a = 0x00000000ffffffffull;


First modification:
*a = 0xffffffffffffffffull;

Second modification:
*a = 0x00000000eeeeeeeeull;


If we race with both modifications, we could see that ffffffff matches, 
and could see that 00000000 matches as well.


So I agree that we should change it, but not necessarily as an urgent 
fix and not necessarily in this patch. It's best to adjust all gup_* 
functions in one patch.

... I do wonder if we want to reuse ptep_get_lockless() instead of the 
READ_ONCE(). CONFIG_GUP_GET_PTE_LOW_HIGH is confusing.
Jason Gunthorpe Sept. 6, 2022, 3:33 p.m. UTC | #29
On Tue, Sep 06, 2022 at 04:44:20PM +0200, David Hildenbrand wrote:

> ... I do wonder if we want to reuse ptep_get_lockless() instead of the
> READ_ONCE(). CONFIG_GUP_GET_PTE_LOW_HIGH is confusing.

Yes, that is a whole other topic, and you may be quite right on that.

Jason
Yang Shi Sept. 6, 2022, 6:50 p.m. UTC | #30
On Sun, Sep 4, 2022 at 3:29 PM John Hubbard <jhubbard@nvidia.com> wrote:
>
> On 9/1/22 15:27, Yang Shi wrote:
> > Since general RCU GUP fast was introduced in commit 2667f50e8b81 ("mm:
> > introduce a general RCU get_user_pages_fast()"), a TLB flush is no longer
> > sufficient to handle concurrent GUP-fast in all cases, it only handles
> > traditional IPI-based GUP-fast correctly.  On architectures that send
> > an IPI broadcast on TLB flush, it works as expected.  But on the
> > architectures that do not use IPI to broadcast TLB flush, it may have
> > the below race:
> >
> >    CPU A                                          CPU B
> > THP collapse                                     fast GUP
> >                                               gup_pmd_range() <-- see valid pmd
> >                                                   gup_pte_range() <-- work on pte
> > pmdp_collapse_flush() <-- clear pmd and flush
> > __collapse_huge_page_isolate()
> >     check page pinned <-- before GUP bump refcount
> >                                                       pin the page
> >                                                       check PTE <-- no change
> > __collapse_huge_page_copy()
> >     copy data to huge page
> >     ptep_clear()
> > install huge pmd for the huge page
> >                                                       return the stale page
> > discard the stale page
>
> Hi Yang,
>
> Thanks for taking the trouble to write down these notes. I always
> forget which race we are dealing with, and this is a great help. :)

My pleasure, I'm glad it is helpful.

>
> More...
>
> >
> > The race could be fixed by checking whether PMD is changed or not after
> > taking the page pin in fast GUP, just like what it does for PTE.  If the
> > PMD is changed it means there may be parallel THP collapse, so GUP
> > should back off.
> >
> > Also update the stale comment about serializing against fast GUP in
> > khugepaged.
> >
> > Fixes: 2667f50e8b81 ("mm: introduce a general RCU get_user_pages_fast()")
> > Signed-off-by: Yang Shi <shy828301@gmail.com>
> > ---
> >  mm/gup.c        | 30 ++++++++++++++++++++++++------
> >  mm/khugepaged.c | 10 ++++++----
> >  2 files changed, 30 insertions(+), 10 deletions(-)
> >
> > diff --git a/mm/gup.c b/mm/gup.c
> > index f3fc1f08d90c..4365b2811269 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -2380,8 +2380,9 @@ static void __maybe_unused undo_dev_pagemap(int *nr, int nr_start,
> >  }
> >
> >  #ifdef CONFIG_ARCH_HAS_PTE_SPECIAL
> > -static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
> > -                      unsigned int flags, struct page **pages, int *nr)
> > +static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr,
> > +                      unsigned long end, unsigned int flags,
> > +                      struct page **pages, int *nr)
> >  {
> >       struct dev_pagemap *pgmap = NULL;
> >       int nr_start = *nr, ret = 0;
> > @@ -2423,7 +2424,23 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
> >                       goto pte_unmap;
> >               }
> >
> > -             if (unlikely(pte_val(pte) != pte_val(*ptep))) {
> > +             /*
> > +              * THP collapse conceptually does:
> > +              *   1. Clear and flush PMD
> > +              *   2. Check the base page refcount
> > +              *   3. Copy data to huge page
> > +              *   4. Clear PTE
> > +              *   5. Discard the base page
> > +              *
> > +              * So fast GUP may race with THP collapse then pin and
> > +              * return an old page since TLB flush is no longer sufficient
> > +              * to serialize against fast GUP.
> > +              *
> > +              * Check PMD, if it is changed just back off since it
> > +              * means there may be parallel THP collapse.
> > +              */
>
> As I mentioned in the other thread, it would be a nice touch to move
> such discussion into the comment header.

Sure, you mean the comment before gup_pte_range() so that the real
code stays succinct, right?

>
> > +             if (unlikely(pmd_val(pmd) != pmd_val(*pmdp)) ||
> > +                 unlikely(pte_val(pte) != pte_val(*ptep))) {
>
>
> That should be READ_ONCE() for the *pmdp and *ptep reads. Because this
> whole lockless house of cards may fall apart if we try reading the
> page table values without READ_ONCE().
>
> That's a rather vague statement, and in fact, the READ_ONCE() should
> be paired with a page table write somewhere else, to make that claim
> more precise.

Thanks for the suggestion. Per the discussion later (mainly from David
and Jason), I think we are going to have a separate patch to clean up
all the page table access for GUP.

>
>
> >                       gup_put_folio(folio, 1, flags);
> >                       goto pte_unmap;
> >               }
> > @@ -2470,8 +2487,9 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
> >   * get_user_pages_fast_only implementation that can pin pages. Thus it's still
> >   * useful to have gup_huge_pmd even if we can't operate on ptes.
> >   */
> > -static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
> > -                      unsigned int flags, struct page **pages, int *nr)
> > +static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr,
> > +                      unsigned long end, unsigned int flags,
> > +                      struct page **pages, int *nr)
> >  {
> >       return 0;
> >  }
> > @@ -2791,7 +2809,7 @@ static int gup_pmd_range(pud_t *pudp, pud_t pud, unsigned long addr, unsigned lo
> >                       if (!gup_huge_pd(__hugepd(pmd_val(pmd)), addr,
> >                                        PMD_SHIFT, next, flags, pages, nr))
> >                               return 0;
> > -             } else if (!gup_pte_range(pmd, addr, next, flags, pages, nr))
> > +             } else if (!gup_pte_range(pmd, pmdp, addr, next, flags, pages, nr))
> >                       return 0;
> >       } while (pmdp++, addr = next, addr != end);
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 2d74cf01f694..518b49095db3 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -1049,10 +1049,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >
> >       pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
> >       /*
> > -      * After this gup_fast can't run anymore. This also removes
> > -      * any huge TLB entry from the CPU so we won't allow
> > -      * huge and small TLB entries for the same virtual address
> > -      * to avoid the risk of CPU bugs in that area.
> > +      * This removes any huge TLB entry from the CPU so we won't allow
> > +      * huge and small TLB entries for the same virtual address to
> > +      * avoid the risk of CPU bugs in that area.
> > +      *
> > +      * Parallel fast GUP is fine since fast GUP will back off when
> > +      * it detects PMD is changed.
> >        */
> >       _pmd = pmdp_collapse_flush(vma, address, pmd);
>
> To follow up on David Hildenbrand's note about this in the nearby thread...
> I'm also not sure if pmdp_collapse_flush() implies a memory barrier on
> all arches. It definitely does do an atomic op with a return value on x86,
> but that's just one arch.

Will reply in detail to David's thread.

>
>
> thanks,
>
> --
> John Hubbard
> NVIDIA
>
> >       spin_unlock(pmd_ptl);
>
Yang Shi Sept. 6, 2022, 7:01 p.m. UTC | #31
On Mon, Sep 5, 2022 at 12:59 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 05.09.22 00:29, John Hubbard wrote:
> > On 9/1/22 15:27, Yang Shi wrote:
> >> Since general RCU GUP fast was introduced in commit 2667f50e8b81 ("mm:
> >> introduce a general RCU get_user_pages_fast()"), a TLB flush is no longer
> >> sufficient to handle concurrent GUP-fast in all cases, it only handles
> >> traditional IPI-based GUP-fast correctly.  On architectures that send
> >> an IPI broadcast on TLB flush, it works as expected.  But on the
> >> architectures that do not use IPI to broadcast TLB flush, it may have
> >> the below race:
> >>
> >>     CPU A                                          CPU B
> >> THP collapse                                     fast GUP
> >>                                                gup_pmd_range() <-- see valid pmd
> >>                                                    gup_pte_range() <-- work on pte
> >> pmdp_collapse_flush() <-- clear pmd and flush
> >> __collapse_huge_page_isolate()
> >>      check page pinned <-- before GUP bump refcount
> >>                                                        pin the page
> >>                                                        check PTE <-- no change
> >> __collapse_huge_page_copy()
> >>      copy data to huge page
> >>      ptep_clear()
> >> install huge pmd for the huge page
> >>                                                        return the stale page
> >> discard the stale page
> >
> > Hi Yang,
> >
> > Thanks for taking the trouble to write down these notes. I always
> > forget which race we are dealing with, and this is a great help. :)
> >
> > More...
> >
> >>
> >> The race could be fixed by checking whether PMD is changed or not after
> >> taking the page pin in fast GUP, just like what it does for PTE.  If the
> >> PMD is changed it means there may be parallel THP collapse, so GUP
> >> should back off.
> >>
> >> Also update the stale comment about serializing against fast GUP in
> >> khugepaged.
> >>
> >> Fixes: 2667f50e8b81 ("mm: introduce a general RCU get_user_pages_fast()")
> >> Signed-off-by: Yang Shi <shy828301@gmail.com>
> >> ---
> >>   mm/gup.c        | 30 ++++++++++++++++++++++++------
> >>   mm/khugepaged.c | 10 ++++++----
> >>   2 files changed, 30 insertions(+), 10 deletions(-)
> >>
> >> diff --git a/mm/gup.c b/mm/gup.c
> >> index f3fc1f08d90c..4365b2811269 100644
> >> --- a/mm/gup.c
> >> +++ b/mm/gup.c
> >> @@ -2380,8 +2380,9 @@ static void __maybe_unused undo_dev_pagemap(int *nr, int nr_start,
> >>   }
> >>
> >>   #ifdef CONFIG_ARCH_HAS_PTE_SPECIAL
> >> -static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
> >> -                     unsigned int flags, struct page **pages, int *nr)
> >> +static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr,
> >> +                     unsigned long end, unsigned int flags,
> >> +                     struct page **pages, int *nr)
> >>   {
> >>      struct dev_pagemap *pgmap = NULL;
> >>      int nr_start = *nr, ret = 0;
> >> @@ -2423,7 +2424,23 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
> >>                      goto pte_unmap;
> >>              }
> >>
> >> -            if (unlikely(pte_val(pte) != pte_val(*ptep))) {
> >> +            /*
> >> +             * THP collapse conceptually does:
> >> +             *   1. Clear and flush PMD
> >> +             *   2. Check the base page refcount
> >> +             *   3. Copy data to huge page
> >> +             *   4. Clear PTE
> >> +             *   5. Discard the base page
> >> +             *
> >> +             * So fast GUP may race with THP collapse then pin and
> >> +             * return an old page since TLB flush is no longer sufficient
> >> +             * to serialize against fast GUP.
> >> +             *
> >> +             * Check PMD, if it is changed just back off since it
> >> +             * means there may be parallel THP collapse.
> >> +             */
> >
> > As I mentioned in the other thread, it would be a nice touch to move
> > such discussion into the comment header.
> >
> >> +            if (unlikely(pmd_val(pmd) != pmd_val(*pmdp)) ||
> >> +                unlikely(pte_val(pte) != pte_val(*ptep))) {
> >
> >
> > That should be READ_ONCE() for the *pmdp and *ptep reads. Because this
> > whole lockless house of cards may fall apart if we try reading the
> > page table values without READ_ONCE().
>
> I came to the conclusion that the implicit memory barrier when grabbing
> a reference on the page is sufficient such that we don't need READ_ONCE
> here.
>
> If we still intend to change that code, we should fixup all GUP-fast
> functions in a similar way. But again, I don't think we need a change here.
>
>
> >> -     * After this gup_fast can't run anymore. This also removes
> >> -     * any huge TLB entry from the CPU so we won't allow
> >> -     * huge and small TLB entries for the same virtual address
> >> -     * to avoid the risk of CPU bugs in that area.
> >> +     * This removes any huge TLB entry from the CPU so we won't allow
> >> +     * huge and small TLB entries for the same virtual address to
> >> +     * avoid the risk of CPU bugs in that area.
> >> +     *
> >> +     * Parallel fast GUP is fine since fast GUP will back off when
> >> +     * it detects PMD is changed.
> >>       */
> >>      _pmd = pmdp_collapse_flush(vma, address, pmd);
> >
> > To follow up on David Hildenbrand's note about this in the nearby thread...
> > I'm also not sure if pmdp_collapse_flush() implies a memory barrier on
> > all arches. It definitely does do an atomic op with a return value on x86,
> > but that's just one arch.
> >
>
> I think a ptep/pmdp clear + TLB flush really has to imply a memory
> barrier, otherwise TLB flushing code might easily mess up with
> surrounding code. But we should better double-check.
>
> s390x executes an IDTE instruction, which performs serialization (->
> memory barrier). arm64 seems to use DSB instructions to enforce memory
> ordering.

IIUC we just need to care about the architectures which support both
THP and fast-GUP, right? If so I got the below list:

Loongarch - I didn't see explicit memory barrier, but it does IPI
MIPS - as same as Loongarch
PowerPC - has memory barrier
x86 - atomic operations imply memory barrier
Arm - I didn't see explicit memory barrier, not sure if ARM's tlb
flush instructions imply memory barrier or not, but it does IPI
Arm64 - uses DSB as David mentioned
s390 - executes IDTE as David mentioned

So I think the questionable architectures are Loongarch, MIPS and ARM.
And IPI is not called if the PMD entry is not present on a remote CPU.
So they may have race against fast GUP IIUC. But we'd better double
check with the maintainers.

>
> --
> Thanks,
>
> David / dhildenb
>
Yang Shi Sept. 6, 2022, 7:07 p.m. UTC | #32
On Mon, Sep 5, 2022 at 1:56 AM Aneesh Kumar K.V
<aneesh.kumar@linux.ibm.com> wrote:
>
> Yang Shi <shy828301@gmail.com> writes:
>
> >
> > On Fri, Sep 2, 2022 at 9:00 AM Peter Xu <peterx@redhat.com> wrote:
> >>
> >> On Thu, Sep 01, 2022 at 04:50:45PM -0700, Yang Shi wrote:
> >> > On Thu, Sep 1, 2022 at 4:26 PM Peter Xu <peterx@redhat.com> wrote:
> >> > >
> >> > > Hi, Yang,
> >> > >
> >> > > On Thu, Sep 01, 2022 at 03:27:07PM -0700, Yang Shi wrote:
> >> > > > Since general RCU GUP fast was introduced in commit 2667f50e8b81 ("mm:
> >> > > > introduce a general RCU get_user_pages_fast()"), a TLB flush is no longer
> >> > > > sufficient to handle concurrent GUP-fast in all cases, it only handles
> >> > > > traditional IPI-based GUP-fast correctly.
> >> > >
> >> > > If TLB flush (or, IPI broadcasts) used to work to protect against gup-fast,
> >> > > I'm kind of confused why it's not sufficient even if with RCU gup?  Isn't
> >> > > that'll keep working as long as interrupt disabled (which current fast-gup
> >> > > will still do)?
> >> >
> >> > Actually the wording was copied from David's commit log for his
> >> > PageAnonExclusive fix. My understanding is the IPI broadcast still
> >> > works, but it may not be supported by all architectures and not
> >> > preferred anymore. So we should avoid depending on IPI broadcast IIUC.
> >> >
> >> > >
> >> > > IIUC the issue is you suspect not all archs correctly implemented
> >> > > pmdp_collapse_flush(), or am I wrong?
> >> >
> >> > This is a possible fix, please see below for details.
> >> >
> >> > >
> >> > > > On architectures that send
> >> > > > an IPI broadcast on TLB flush, it works as expected.  But on the
> >> > > > architectures that do not use IPI to broadcast TLB flush, it may have
> >> > > > the below race:
> >> > > >
> >> > > >    CPU A                                          CPU B
> >> > > > THP collapse                                     fast GUP
> >> > > >                                               gup_pmd_range() <-- see valid pmd
> >> > > >                                                   gup_pte_range() <-- work on pte
> >> > > > pmdp_collapse_flush() <-- clear pmd and flush
> >> > > > __collapse_huge_page_isolate()
> >> > > >     check page pinned <-- before GUP bump refcount
> >> > > >                                                       pin the page
> >> > > >                                                       check PTE <-- no change
> >> > > > __collapse_huge_page_copy()
> >> > > >     copy data to huge page
> >> > > >     ptep_clear()
> >> > > > install huge pmd for the huge page
> >> > > >                                                       return the stale page
> >> > > > discard the stale page
> >> > > >
> >> > > > The race could be fixed by checking whether PMD is changed or not after
> >> > > > taking the page pin in fast GUP, just like what it does for PTE.  If the
> >> > > > PMD is changed it means there may be parallel THP collapse, so GUP
> >> > > > should back off.
> >> > >
> >> > > Could the race also be fixed by impl pmdp_collapse_flush() correctly for
> >> > > the archs that are missing? Do you know which arch(s) is broken with it?
> >> >
> >> > Yes, and this was suggested by me in the first place, but per the
> >> > suggestion from John and David, this is not the preferred way. I think
> >> > it is because:
> >> >
> >> > Firstly, using IPI to serialize against fast GUP is not recommended
> >> > anymore since fast GUP does check PTE then back off so we should avoid
> >> > it.
> >> > Secondly, if checking PMD then backing off could solve the problem,
> >> > why do we still need broadcast IPI? It doesn't sound performant.
> >> >
> >> > >
> >> > > It's just not clear to me whether this patch is an optimization or a fix,
> >> > > if it's a fix whether the IPI broadcast in ppc pmdp_collapse_flush() would
> >> > > still be needed.
> >> >
> >> > It is a fix and the fix will make IPI broadcast not useful anymore.
> >>
> >> How about another patch to remove the ppc impl too?  Then it can be a two
> >> patches series.
> >
> > BTW, I don't think we could remove the ppc implementation since it is
> > different from the generic pmdp_collapse_flush(), particularly for the
> > hash part IIUC.
> >
> > The generic version calls flush_tlb_range() -> hash__flush_tlb_range()
> > for hash, but the hash call is actually no-op. The ppc version calls
> > hash__pmdp_collapse_flush() -> flush_tlb_pmd_range(), which does
> > something useful.
> >
>
> We should actually rename flush_tlb_pmd_range(). It actually flush the
> hash page table entries.
>
> I will do the below patch for ppc64 to clarify this better

Thanks, Aneesh. It looks more readable. A follow-up question, I think
we could remove serialize_against_pte_lookup(), which just issues IPI
broadcast to run a dummy function. This IPI should not be needed
anymore with my patch. Of course, we need to keep the memory barrier.

>
> diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
> index 8b762f282190..fd30fa20c392 100644
> --- a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
> +++ b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
> @@ -112,13 +112,12 @@ static inline void hash__flush_tlb_kernel_range(unsigned long start,
>
>  struct mmu_gather;
>  extern void hash__tlb_flush(struct mmu_gather *tlb);
> -void flush_tlb_pmd_range(struct mm_struct *mm, pmd_t *pmd, unsigned long addr);
>
>  #ifdef CONFIG_PPC_64S_HASH_MMU
>  /* Private function for use by PCI IO mapping code */
>  extern void __flush_hash_table_range(unsigned long start, unsigned long end);
> -extern void flush_tlb_pmd_range(struct mm_struct *mm, pmd_t *pmd,
> -                               unsigned long addr);
> +extern void flush_hash_table_pmd_range(struct mm_struct *mm, pmd_t *pmd,
> +                                      unsigned long addr);
>  #else
>  static inline void __flush_hash_table_range(unsigned long start, unsigned long end) { }
>  #endif
> diff --git a/arch/powerpc/mm/book3s64/hash_pgtable.c b/arch/powerpc/mm/book3s64/hash_pgtable.c
> index ae008b9df0e6..f30131933a01 100644
> --- a/arch/powerpc/mm/book3s64/hash_pgtable.c
> +++ b/arch/powerpc/mm/book3s64/hash_pgtable.c
> @@ -256,7 +256,7 @@ pmd_t hash__pmdp_collapse_flush(struct vm_area_struct *vma, unsigned long addres
>          * the __collapse_huge_page_copy can result in copying
>          * the old content.
>          */
> -       flush_tlb_pmd_range(vma->vm_mm, &pmd, address);
> +       flush_hash_table_pmd_range(vma->vm_mm, &pmd, address);
>         return pmd;
>  }
>
> diff --git a/arch/powerpc/mm/book3s64/hash_tlb.c b/arch/powerpc/mm/book3s64/hash_tlb.c
> index eb0bccaf221e..a64ea0a7ef96 100644
> --- a/arch/powerpc/mm/book3s64/hash_tlb.c
> +++ b/arch/powerpc/mm/book3s64/hash_tlb.c
> @@ -221,7 +221,7 @@ void __flush_hash_table_range(unsigned long start, unsigned long end)
>         local_irq_restore(flags);
>  }
>
> -void flush_tlb_pmd_range(struct mm_struct *mm, pmd_t *pmd, unsigned long addr)
> +void flush_hash_table_pmd_range(struct mm_struct *mm, pmd_t *pmd, unsigned long addr)
>  {
>         pte_t *pte;
>         pte_t *start_pte;
>
Yang Shi Sept. 6, 2022, 7:11 p.m. UTC | #33
On Tue, Sep 6, 2022 at 7:44 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 06.09.22 16:30, Jason Gunthorpe wrote:
> > On Tue, Sep 06, 2022 at 03:57:30PM +0200, David Hildenbrand wrote:
> >
> >>> READ_ONCE primarily is a marker that the data being read is unstable
> >>> and that the compiler must avoid all instability when reading it. eg
> >>> in this case the compiler could insanely double read the value, even
> >>> though the 'if' requires only a single read. This would result in
> >>> corrupt calculation.
> >>
> >> As we have a full memory barrier + compile barrier, the compiler might
> >> indeed do double reads and all that stuff. BUT, it has to re-read after we
> >> incremented the refcount, and IMHO that's the important part to detect the
> >> change.
> >
> > Yes, it is important, but it is not the only important part.
> >
> > The compiler still has to exectute "if (*a != b)" *correctly*.
> >
> > This is what READ_ONCE is for. It doesn't set order, it doesn't
> > implement a barrier, it tells the compiler that '*a' is unstable data
> > and the compiler cannot make assumptions based on the idea that
> > reading '*a' multiple times will always return the same value.
> >
> > If the compiler makes those assumptions then maybe even though 'if (*a
> > != b)' is the reality, it could mis-compute '*a == b'. You enter into
> > undefined behavior here.
> >
> > Though it is all very unlikely, the general memory model standard is
> > to annotate with READ_ONCE.
>
> The only thing I could see going wrong in the comparison once the stars
> alingn would be something like the following:
>
> if (*a != b)
>
> implemented as
>
> if ((*a).lower != b.lower && (*a).higher != b.higher)
>
>
> This could only go wrong if we have more than one change such that:
>
> Original:
>
> *a = 0x00000000ffffffffull;
>
>
> First modification:
> *a = 0xffffffffffffffffull;
>
> Second modification:
> *a = 0x00000000eeeeeeeeull;

IIUC this is typically a 32-bit thing.

>
>
> If we race with both modifications, we could see that ffffffff matches,
> and could see that 00000000 matches as well.
>
>
> So I agree that we should change it, but not necessarily as an urgent
> fix and not necessarily in this patch. It's best to adjust all gup_*
> functions in one patch.
>
> ... I do wonder if we want to reuse ptep_get_lockless() instead of the
> READ_ONCE(). CONFIG_GUP_GET_PTE_LOW_HIGH is confusing.
>
> --
> Thanks,
>
> David / dhildenb
>
John Hubbard Sept. 6, 2022, 9:27 p.m. UTC | #34
On 9/6/22 11:50, Yang Shi wrote:
>>> -             if (unlikely(pte_val(pte) != pte_val(*ptep))) {
>>> +             /*
>>> +              * THP collapse conceptually does:
>>> +              *   1. Clear and flush PMD
>>> +              *   2. Check the base page refcount
>>> +              *   3. Copy data to huge page
>>> +              *   4. Clear PTE
>>> +              *   5. Discard the base page
>>> +              *
>>> +              * So fast GUP may race with THP collapse then pin and
>>> +              * return an old page since TLB flush is no longer sufficient
>>> +              * to serialize against fast GUP.
>>> +              *
>>> +              * Check PMD, if it is changed just back off since it
>>> +              * means there may be parallel THP collapse.
>>> +              */
>>
>> As I mentioned in the other thread, it would be a nice touch to move
>> such discussion into the comment header.
> 
> Sure, you mean the comment before gup_pte_range() so that the real
> code stays succinct, right?
Yes.

thanks,
John Hubbard Sept. 6, 2022, 11:16 p.m. UTC | #35
On 9/6/22 07:44, David Hildenbrand wrote:
>> Though it is all very unlikely, the general memory model standard is
>> to annotate with READ_ONCE.
> 
> The only thing I could see going wrong in the comparison once the stars 
> alingn would be something like the following:
> 
> if (*a != b)
> 
> implemented as
> 
> if ((*a).lower != b.lower && (*a).higher != b.higher)
> 
> 
> This could only go wrong if we have more than one change such that:
> 
> Original:
> 
> *a = 0x00000000ffffffffull;
> 
> 
> First modification:
> *a = 0xffffffffffffffffull;
> 
> Second modification:
> *a = 0x00000000eeeeeeeeull;
> 
> 
> If we race with both modifications, we could see that ffffffff matches, 
> and could see that 00000000 matches as well.
> 
> 
> So I agree that we should change it, but not necessarily as an urgent 
> fix and not necessarily in this patch. It's best to adjust all gup_* 
> functions in one patch.
> 

We had a long thread with Paul McKenney back in May [1] about this exact
sort of problem.

In that thread, I recall that "someone" tried to claim that a bare
one-byte read was safe, and even that innocent-sounding claim got
basically torn apart! :)  Because the kernel memory model simply does
not cover you for bare reads and writes to shared mutable memory.

Unfortunately, until now, I'd only really remembered the conclusion:
"use READ_ONCE() and WRITE_ONCE() for any touching of shared mutable
memory", and not the point about other memory barriers not covering this
aspect. Thanks to Jason for reminding us of this.  This time I think I
will remember it well enough to avoid another long thread. Maybe.


[1] https://lore.kernel.org/all/20220524163728.GO1790663@paulmck-ThinkPad-P17-Gen-1/

thanks,
Aneesh Kumar K.V Sept. 7, 2022, 4:50 a.m. UTC | #36
On 9/7/22 12:37 AM, Yang Shi wrote:
> On Mon, Sep 5, 2022 at 1:56 AM Aneesh Kumar K.V
> <aneesh.kumar@linux.ibm.com> wrote:
>>
>> Yang Shi <shy828301@gmail.com> writes:
>>
>>>
>>> On Fri, Sep 2, 2022 at 9:00 AM Peter Xu <peterx@redhat.com> wrote:
>>>>
>>>> On Thu, Sep 01, 2022 at 04:50:45PM -0700, Yang Shi wrote:
>>>>> On Thu, Sep 1, 2022 at 4:26 PM Peter Xu <peterx@redhat.com> wrote:
>>>>>>
>>>>>> Hi, Yang,
>>>>>>
>>>>>> On Thu, Sep 01, 2022 at 03:27:07PM -0700, Yang Shi wrote:
>>>>>>> Since general RCU GUP fast was introduced in commit 2667f50e8b81 ("mm:
>>>>>>> introduce a general RCU get_user_pages_fast()"), a TLB flush is no longer
>>>>>>> sufficient to handle concurrent GUP-fast in all cases, it only handles
>>>>>>> traditional IPI-based GUP-fast correctly.
>>>>>>
>>>>>> If TLB flush (or, IPI broadcasts) used to work to protect against gup-fast,
>>>>>> I'm kind of confused why it's not sufficient even if with RCU gup?  Isn't
>>>>>> that'll keep working as long as interrupt disabled (which current fast-gup
>>>>>> will still do)?
>>>>>
>>>>> Actually the wording was copied from David's commit log for his
>>>>> PageAnonExclusive fix. My understanding is the IPI broadcast still
>>>>> works, but it may not be supported by all architectures and not
>>>>> preferred anymore. So we should avoid depending on IPI broadcast IIUC.
>>>>>
>>>>>>
>>>>>> IIUC the issue is you suspect not all archs correctly implemented
>>>>>> pmdp_collapse_flush(), or am I wrong?
>>>>>
>>>>> This is a possible fix, please see below for details.
>>>>>
>>>>>>
>>>>>>> On architectures that send
>>>>>>> an IPI broadcast on TLB flush, it works as expected.  But on the
>>>>>>> architectures that do not use IPI to broadcast TLB flush, it may have
>>>>>>> the below race:
>>>>>>>
>>>>>>>    CPU A                                          CPU B
>>>>>>> THP collapse                                     fast GUP
>>>>>>>                                               gup_pmd_range() <-- see valid pmd
>>>>>>>                                                   gup_pte_range() <-- work on pte
>>>>>>> pmdp_collapse_flush() <-- clear pmd and flush
>>>>>>> __collapse_huge_page_isolate()
>>>>>>>     check page pinned <-- before GUP bump refcount
>>>>>>>                                                       pin the page
>>>>>>>                                                       check PTE <-- no change
>>>>>>> __collapse_huge_page_copy()
>>>>>>>     copy data to huge page
>>>>>>>     ptep_clear()
>>>>>>> install huge pmd for the huge page
>>>>>>>                                                       return the stale page
>>>>>>> discard the stale page
>>>>>>>
>>>>>>> The race could be fixed by checking whether PMD is changed or not after
>>>>>>> taking the page pin in fast GUP, just like what it does for PTE.  If the
>>>>>>> PMD is changed it means there may be parallel THP collapse, so GUP
>>>>>>> should back off.
>>>>>>
>>>>>> Could the race also be fixed by impl pmdp_collapse_flush() correctly for
>>>>>> the archs that are missing? Do you know which arch(s) is broken with it?
>>>>>
>>>>> Yes, and this was suggested by me in the first place, but per the
>>>>> suggestion from John and David, this is not the preferred way. I think
>>>>> it is because:
>>>>>
>>>>> Firstly, using IPI to serialize against fast GUP is not recommended
>>>>> anymore since fast GUP does check PTE then back off so we should avoid
>>>>> it.
>>>>> Secondly, if checking PMD then backing off could solve the problem,
>>>>> why do we still need broadcast IPI? It doesn't sound performant.
>>>>>
>>>>>>
>>>>>> It's just not clear to me whether this patch is an optimization or a fix,
>>>>>> if it's a fix whether the IPI broadcast in ppc pmdp_collapse_flush() would
>>>>>> still be needed.
>>>>>
>>>>> It is a fix and the fix will make IPI broadcast not useful anymore.
>>>>
>>>> How about another patch to remove the ppc impl too?  Then it can be a two
>>>> patches series.
>>>
>>> BTW, I don't think we could remove the ppc implementation since it is
>>> different from the generic pmdp_collapse_flush(), particularly for the
>>> hash part IIUC.
>>>
>>> The generic version calls flush_tlb_range() -> hash__flush_tlb_range()
>>> for hash, but the hash call is actually no-op. The ppc version calls
>>> hash__pmdp_collapse_flush() -> flush_tlb_pmd_range(), which does
>>> something useful.
>>>
>>
>> We should actually rename flush_tlb_pmd_range(). It actually flush the
>> hash page table entries.
>>
>> I will do the below patch for ppc64 to clarify this better
> 
> Thanks, Aneesh. It looks more readable. A follow-up question, I think
> we could remove serialize_against_pte_lookup(), which just issues IPI
> broadcast to run a dummy function. This IPI should not be needed
> anymore with my patch. Of course, we need to keep the memory barrier.
> 


For radix translation yes. For hash we still need that. W.r.t memory barrier,
radix do use radix__flush_tlb_collapsed_pmd() which does a tlb invalidate.
IIUC that will enfocre the required memory barrier there. So you should be able
to just remove 

modified   arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -937,15 +937,6 @@ pmd_t radix__pmdp_collapse_flush(struct vm_area_struct *vma, unsigned long addre
 	pmd = *pmdp;
 	pmd_clear(pmdp);
 
-	/*
-	 * pmdp collapse_flush need to ensure that there are no parallel gup
-	 * walk after this call. This is needed so that we can have stable
-	 * page ref count when collapsing a page. We don't allow a collapse page
-	 * if we have gup taken on the page. We can ensure that by sending IPI
-	 * because gup walk happens with IRQ disabled.
-	 */
-	serialize_against_pte_lookup(vma->vm_mm);
-
 	radix__flush_tlb_collapsed_pmd(vma->vm_mm, address);
 
 	return pmd;

in your patch. This will also consolidate all the related changes together.

>>
>> diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
>> index 8b762f282190..fd30fa20c392 100644
>> --- a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
>> +++ b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
>> @@ -112,13 +112,12 @@ static inline void hash__flush_tlb_kernel_range(unsigned long start,
>>
>>  struct mmu_gather;
>>  extern void hash__tlb_flush(struct mmu_gather *tlb);
>> -void flush_tlb_pmd_range(struct mm_struct *mm, pmd_t *pmd, unsigned long addr);
>>
>>  #ifdef CONFIG_PPC_64S_HASH_MMU
>>  /* Private function for use by PCI IO mapping code */
>>  extern void __flush_hash_table_range(unsigned long start, unsigned long end);
>> -extern void flush_tlb_pmd_range(struct mm_struct *mm, pmd_t *pmd,
>> -                               unsigned long addr);
>> +extern void flush_hash_table_pmd_range(struct mm_struct *mm, pmd_t *pmd,
>> +                                      unsigned long addr);
>>  #else
>>  static inline void __flush_hash_table_range(unsigned long start, unsigned long end) { }
>>  #endif
>> diff --git a/arch/powerpc/mm/book3s64/hash_pgtable.c b/arch/powerpc/mm/book3s64/hash_pgtable.c
>> index ae008b9df0e6..f30131933a01 100644
>> --- a/arch/powerpc/mm/book3s64/hash_pgtable.c
>> +++ b/arch/powerpc/mm/book3s64/hash_pgtable.c
>> @@ -256,7 +256,7 @@ pmd_t hash__pmdp_collapse_flush(struct vm_area_struct *vma, unsigned long addres
>>          * the __collapse_huge_page_copy can result in copying
>>          * the old content.
>>          */
>> -       flush_tlb_pmd_range(vma->vm_mm, &pmd, address);
>> +       flush_hash_table_pmd_range(vma->vm_mm, &pmd, address);
>>         return pmd;
>>  }
>>
>> diff --git a/arch/powerpc/mm/book3s64/hash_tlb.c b/arch/powerpc/mm/book3s64/hash_tlb.c
>> index eb0bccaf221e..a64ea0a7ef96 100644
>> --- a/arch/powerpc/mm/book3s64/hash_tlb.c
>> +++ b/arch/powerpc/mm/book3s64/hash_tlb.c
>> @@ -221,7 +221,7 @@ void __flush_hash_table_range(unsigned long start, unsigned long end)
>>         local_irq_restore(flags);
>>  }
>>
>> -void flush_tlb_pmd_range(struct mm_struct *mm, pmd_t *pmd, unsigned long addr)
>> +void flush_hash_table_pmd_range(struct mm_struct *mm, pmd_t *pmd, unsigned long addr)
>>  {
>>         pte_t *pte;
>>         pte_t *start_pte;
>>
Yang Shi Sept. 7, 2022, 5:08 p.m. UTC | #37
On Tue, Sep 6, 2022 at 9:51 PM Aneesh Kumar K V
<aneesh.kumar@linux.ibm.com> wrote:
>
> On 9/7/22 12:37 AM, Yang Shi wrote:
> > On Mon, Sep 5, 2022 at 1:56 AM Aneesh Kumar K.V
> > <aneesh.kumar@linux.ibm.com> wrote:
> >>
> >> Yang Shi <shy828301@gmail.com> writes:
> >>
> >>>
> >>> On Fri, Sep 2, 2022 at 9:00 AM Peter Xu <peterx@redhat.com> wrote:
> >>>>
> >>>> On Thu, Sep 01, 2022 at 04:50:45PM -0700, Yang Shi wrote:
> >>>>> On Thu, Sep 1, 2022 at 4:26 PM Peter Xu <peterx@redhat.com> wrote:
> >>>>>>
> >>>>>> Hi, Yang,
> >>>>>>
> >>>>>> On Thu, Sep 01, 2022 at 03:27:07PM -0700, Yang Shi wrote:
> >>>>>>> Since general RCU GUP fast was introduced in commit 2667f50e8b81 ("mm:
> >>>>>>> introduce a general RCU get_user_pages_fast()"), a TLB flush is no longer
> >>>>>>> sufficient to handle concurrent GUP-fast in all cases, it only handles
> >>>>>>> traditional IPI-based GUP-fast correctly.
> >>>>>>
> >>>>>> If TLB flush (or, IPI broadcasts) used to work to protect against gup-fast,
> >>>>>> I'm kind of confused why it's not sufficient even if with RCU gup?  Isn't
> >>>>>> that'll keep working as long as interrupt disabled (which current fast-gup
> >>>>>> will still do)?
> >>>>>
> >>>>> Actually the wording was copied from David's commit log for his
> >>>>> PageAnonExclusive fix. My understanding is the IPI broadcast still
> >>>>> works, but it may not be supported by all architectures and not
> >>>>> preferred anymore. So we should avoid depending on IPI broadcast IIUC.
> >>>>>
> >>>>>>
> >>>>>> IIUC the issue is you suspect not all archs correctly implemented
> >>>>>> pmdp_collapse_flush(), or am I wrong?
> >>>>>
> >>>>> This is a possible fix, please see below for details.
> >>>>>
> >>>>>>
> >>>>>>> On architectures that send
> >>>>>>> an IPI broadcast on TLB flush, it works as expected.  But on the
> >>>>>>> architectures that do not use IPI to broadcast TLB flush, it may have
> >>>>>>> the below race:
> >>>>>>>
> >>>>>>>    CPU A                                          CPU B
> >>>>>>> THP collapse                                     fast GUP
> >>>>>>>                                               gup_pmd_range() <-- see valid pmd
> >>>>>>>                                                   gup_pte_range() <-- work on pte
> >>>>>>> pmdp_collapse_flush() <-- clear pmd and flush
> >>>>>>> __collapse_huge_page_isolate()
> >>>>>>>     check page pinned <-- before GUP bump refcount
> >>>>>>>                                                       pin the page
> >>>>>>>                                                       check PTE <-- no change
> >>>>>>> __collapse_huge_page_copy()
> >>>>>>>     copy data to huge page
> >>>>>>>     ptep_clear()
> >>>>>>> install huge pmd for the huge page
> >>>>>>>                                                       return the stale page
> >>>>>>> discard the stale page
> >>>>>>>
> >>>>>>> The race could be fixed by checking whether PMD is changed or not after
> >>>>>>> taking the page pin in fast GUP, just like what it does for PTE.  If the
> >>>>>>> PMD is changed it means there may be parallel THP collapse, so GUP
> >>>>>>> should back off.
> >>>>>>
> >>>>>> Could the race also be fixed by impl pmdp_collapse_flush() correctly for
> >>>>>> the archs that are missing? Do you know which arch(s) is broken with it?
> >>>>>
> >>>>> Yes, and this was suggested by me in the first place, but per the
> >>>>> suggestion from John and David, this is not the preferred way. I think
> >>>>> it is because:
> >>>>>
> >>>>> Firstly, using IPI to serialize against fast GUP is not recommended
> >>>>> anymore since fast GUP does check PTE then back off so we should avoid
> >>>>> it.
> >>>>> Secondly, if checking PMD then backing off could solve the problem,
> >>>>> why do we still need broadcast IPI? It doesn't sound performant.
> >>>>>
> >>>>>>
> >>>>>> It's just not clear to me whether this patch is an optimization or a fix,
> >>>>>> if it's a fix whether the IPI broadcast in ppc pmdp_collapse_flush() would
> >>>>>> still be needed.
> >>>>>
> >>>>> It is a fix and the fix will make IPI broadcast not useful anymore.
> >>>>
> >>>> How about another patch to remove the ppc impl too?  Then it can be a two
> >>>> patches series.
> >>>
> >>> BTW, I don't think we could remove the ppc implementation since it is
> >>> different from the generic pmdp_collapse_flush(), particularly for the
> >>> hash part IIUC.
> >>>
> >>> The generic version calls flush_tlb_range() -> hash__flush_tlb_range()
> >>> for hash, but the hash call is actually no-op. The ppc version calls
> >>> hash__pmdp_collapse_flush() -> flush_tlb_pmd_range(), which does
> >>> something useful.
> >>>
> >>
> >> We should actually rename flush_tlb_pmd_range(). It actually flush the
> >> hash page table entries.
> >>
> >> I will do the below patch for ppc64 to clarify this better
> >
> > Thanks, Aneesh. It looks more readable. A follow-up question, I think
> > we could remove serialize_against_pte_lookup(), which just issues IPI
> > broadcast to run a dummy function. This IPI should not be needed
> > anymore with my patch. Of course, we need to keep the memory barrier.
> >
>
>
> For radix translation yes. For hash we still need that. W.r.t memory barrier,
> radix do use radix__flush_tlb_collapsed_pmd() which does a tlb invalidate.
> IIUC that will enfocre the required memory barrier there. So you should be able
> to just remove
>
> modified   arch/powerpc/mm/book3s64/radix_pgtable.c
> @@ -937,15 +937,6 @@ pmd_t radix__pmdp_collapse_flush(struct vm_area_struct *vma, unsigned long addre
>         pmd = *pmdp;
>         pmd_clear(pmdp);
>
> -       /*
> -        * pmdp collapse_flush need to ensure that there are no parallel gup
> -        * walk after this call. This is needed so that we can have stable
> -        * page ref count when collapsing a page. We don't allow a collapse page
> -        * if we have gup taken on the page. We can ensure that by sending IPI
> -        * because gup walk happens with IRQ disabled.
> -        */
> -       serialize_against_pte_lookup(vma->vm_mm);
> -
>         radix__flush_tlb_collapsed_pmd(vma->vm_mm, address);
>
>         return pmd;
>
> in your patch. This will also consolidate all the related changes together.

Thanks, Aneesh. It may be better to have the ppc cleanup in a separate patch.

>
> >>
> >> diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
> >> index 8b762f282190..fd30fa20c392 100644
> >> --- a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
> >> +++ b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
> >> @@ -112,13 +112,12 @@ static inline void hash__flush_tlb_kernel_range(unsigned long start,
> >>
> >>  struct mmu_gather;
> >>  extern void hash__tlb_flush(struct mmu_gather *tlb);
> >> -void flush_tlb_pmd_range(struct mm_struct *mm, pmd_t *pmd, unsigned long addr);
> >>
> >>  #ifdef CONFIG_PPC_64S_HASH_MMU
> >>  /* Private function for use by PCI IO mapping code */
> >>  extern void __flush_hash_table_range(unsigned long start, unsigned long end);
> >> -extern void flush_tlb_pmd_range(struct mm_struct *mm, pmd_t *pmd,
> >> -                               unsigned long addr);
> >> +extern void flush_hash_table_pmd_range(struct mm_struct *mm, pmd_t *pmd,
> >> +                                      unsigned long addr);
> >>  #else
> >>  static inline void __flush_hash_table_range(unsigned long start, unsigned long end) { }
> >>  #endif
> >> diff --git a/arch/powerpc/mm/book3s64/hash_pgtable.c b/arch/powerpc/mm/book3s64/hash_pgtable.c
> >> index ae008b9df0e6..f30131933a01 100644
> >> --- a/arch/powerpc/mm/book3s64/hash_pgtable.c
> >> +++ b/arch/powerpc/mm/book3s64/hash_pgtable.c
> >> @@ -256,7 +256,7 @@ pmd_t hash__pmdp_collapse_flush(struct vm_area_struct *vma, unsigned long addres
> >>          * the __collapse_huge_page_copy can result in copying
> >>          * the old content.
> >>          */
> >> -       flush_tlb_pmd_range(vma->vm_mm, &pmd, address);
> >> +       flush_hash_table_pmd_range(vma->vm_mm, &pmd, address);
> >>         return pmd;
> >>  }
> >>
> >> diff --git a/arch/powerpc/mm/book3s64/hash_tlb.c b/arch/powerpc/mm/book3s64/hash_tlb.c
> >> index eb0bccaf221e..a64ea0a7ef96 100644
> >> --- a/arch/powerpc/mm/book3s64/hash_tlb.c
> >> +++ b/arch/powerpc/mm/book3s64/hash_tlb.c
> >> @@ -221,7 +221,7 @@ void __flush_hash_table_range(unsigned long start, unsigned long end)
> >>         local_irq_restore(flags);
> >>  }
> >>
> >> -void flush_tlb_pmd_range(struct mm_struct *mm, pmd_t *pmd, unsigned long addr)
> >> +void flush_hash_table_pmd_range(struct mm_struct *mm, pmd_t *pmd, unsigned long addr)
> >>  {
> >>         pte_t *pte;
> >>         pte_t *start_pte;
> >>
>
diff mbox series

Patch

diff --git a/mm/gup.c b/mm/gup.c
index f3fc1f08d90c..4365b2811269 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2380,8 +2380,9 @@  static void __maybe_unused undo_dev_pagemap(int *nr, int nr_start,
 }
 
 #ifdef CONFIG_ARCH_HAS_PTE_SPECIAL
-static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
-			 unsigned int flags, struct page **pages, int *nr)
+static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr,
+			 unsigned long end, unsigned int flags,
+			 struct page **pages, int *nr)
 {
 	struct dev_pagemap *pgmap = NULL;
 	int nr_start = *nr, ret = 0;
@@ -2423,7 +2424,23 @@  static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 			goto pte_unmap;
 		}
 
-		if (unlikely(pte_val(pte) != pte_val(*ptep))) {
+		/*
+		 * THP collapse conceptually does:
+		 *   1. Clear and flush PMD
+		 *   2. Check the base page refcount
+		 *   3. Copy data to huge page
+		 *   4. Clear PTE
+		 *   5. Discard the base page
+		 *
+		 * So fast GUP may race with THP collapse then pin and
+		 * return an old page since TLB flush is no longer sufficient
+		 * to serialize against fast GUP.
+		 *
+		 * Check PMD, if it is changed just back off since it
+		 * means there may be parallel THP collapse.
+		 */
+		if (unlikely(pmd_val(pmd) != pmd_val(*pmdp)) ||
+		    unlikely(pte_val(pte) != pte_val(*ptep))) {
 			gup_put_folio(folio, 1, flags);
 			goto pte_unmap;
 		}
@@ -2470,8 +2487,9 @@  static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
  * get_user_pages_fast_only implementation that can pin pages. Thus it's still
  * useful to have gup_huge_pmd even if we can't operate on ptes.
  */
-static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
-			 unsigned int flags, struct page **pages, int *nr)
+static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr,
+			 unsigned long end, unsigned int flags,
+			 struct page **pages, int *nr)
 {
 	return 0;
 }
@@ -2791,7 +2809,7 @@  static int gup_pmd_range(pud_t *pudp, pud_t pud, unsigned long addr, unsigned lo
 			if (!gup_huge_pd(__hugepd(pmd_val(pmd)), addr,
 					 PMD_SHIFT, next, flags, pages, nr))
 				return 0;
-		} else if (!gup_pte_range(pmd, addr, next, flags, pages, nr))
+		} else if (!gup_pte_range(pmd, pmdp, addr, next, flags, pages, nr))
 			return 0;
 	} while (pmdp++, addr = next, addr != end);
 
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 2d74cf01f694..518b49095db3 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1049,10 +1049,12 @@  static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 
 	pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
 	/*
-	 * After this gup_fast can't run anymore. This also removes
-	 * any huge TLB entry from the CPU so we won't allow
-	 * huge and small TLB entries for the same virtual address
-	 * to avoid the risk of CPU bugs in that area.
+	 * This removes any huge TLB entry from the CPU so we won't allow
+	 * huge and small TLB entries for the same virtual address to
+	 * avoid the risk of CPU bugs in that area.
+	 *
+	 * Parallel fast GUP is fine since fast GUP will back off when
+	 * it detects PMD is changed.
 	 */
 	_pmd = pmdp_collapse_flush(vma, address, pmd);
 	spin_unlock(pmd_ptl);