diff mbox series

[v2] mm/gup: fix try_grab_compound_head() race with split_huge_page()

Message ID 20210615012014.1100672-1-jannh@google.com (mailing list archive)
State New, archived
Headers show
Series [v2] mm/gup: fix try_grab_compound_head() race with split_huge_page() | expand

Commit Message

Jann Horn June 15, 2021, 1:20 a.m. UTC
try_grab_compound_head() is used to grab a reference to a page from
get_user_pages_fast(), which is only protected against concurrent
freeing of page tables (via local_irq_save()), but not against
concurrent TLB flushes, freeing of data pages, or splitting of compound
pages.

Because no reference is held to the page when try_grab_compound_head()
is called, the page may have been freed and reallocated by the time its
refcount has been elevated; therefore, once we're holding a stable
reference to the page, the caller re-checks whether the PTE still points
to the same page (with the same access rights).

The problem is that try_grab_compound_head() has to grab a reference on
the head page; but between the time we look up what the head page is and
the time we actually grab a reference on the head page, the compound
page may have been split up (either explicitly through split_huge_page()
or by freeing the compound page to the buddy allocator and then
allocating its individual order-0 pages).
If that happens, get_user_pages_fast() may end up returning the right
page but lifting the refcount on a now-unrelated page, leading to
use-after-free of pages.

To fix it:
Re-check whether the pages still belong together after lifting the
refcount on the head page.
Move anything else that checks compound_head(page) below the refcount
increment.

This can't actually happen on bare-metal x86 (because there, disabling
IRQs locks out remote TLB flushes), but it can happen on virtualized x86
(e.g. under KVM) and probably also on arm64. The race window is pretty
narrow, and constantly allocating and shattering hugepages isn't exactly
fast; for now I've only managed to reproduce this in an x86 KVM guest with
an artificially widened timing window (by adding a loop that repeatedly
calls `inl(0x3f8 + 5)` in `try_get_compound_head()` to force VM exits,
so that PV TLB flushes are used instead of IPIs).

As requested on the list, also replace the existing VM_BUG_ON_PAGE()
with a warning and bailout. Since the existing code only performed the
BUG_ON check on DEBUG_VM kernels, ensure that the new code also only
performs the check under that configuration - I don't want to mix two
logically separate changes together too much.
The macro VM_WARN_ON_ONCE_PAGE() doesn't return a value on !DEBUG_VM,
so wrap the whole check in an #ifdef block.
An alternative would be to change the VM_WARN_ON_ONCE_PAGE() definition
for !DEBUG_VM such that it always returns false, but since that would
differ from the behavior of the normal WARN macros, it might be too
confusing for readers.

Cc: Matthew Wilcox <willy@infradead.org>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Jan Kara <jack@suse.cz>
Cc: stable@vger.kernel.org
Fixes: 7aef4172c795 ("mm: handle PTE-mapped tail pages in gerneric fast gup implementaiton")
Signed-off-by: Jann Horn <jannh@google.com>
---
 mm/gup.c | 58 +++++++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 43 insertions(+), 15 deletions(-)


base-commit: 614124bea77e452aa6df7a8714e8bc820b489922

Comments

Andrew Morton June 15, 2021, 2 a.m. UTC | #1
On Tue, 15 Jun 2021 03:20:14 +0200 Jann Horn <jannh@google.com> wrote:

> try_grab_compound_head() is used to grab a reference to a page from
> get_user_pages_fast(), which is only protected against concurrent
> freeing of page tables (via local_irq_save()), but not against
> concurrent TLB flushes, freeing of data pages, or splitting of compound
> pages.
> 
> Because no reference is held to the page when try_grab_compound_head()
> is called, the page may have been freed and reallocated by the time its
> refcount has been elevated; therefore, once we're holding a stable
> reference to the page, the caller re-checks whether the PTE still points
> to the same page (with the same access rights).
> 
> The problem is that try_grab_compound_head() has to grab a reference on
> the head page; but between the time we look up what the head page is and
> the time we actually grab a reference on the head page, the compound
> page may have been split up (either explicitly through split_huge_page()
> or by freeing the compound page to the buddy allocator and then
> allocating its individual order-0 pages).
> If that happens, get_user_pages_fast() may end up returning the right
> page but lifting the refcount on a now-unrelated page, leading to
> use-after-free of pages.
> 
> To fix it:
> Re-check whether the pages still belong together after lifting the
> refcount on the head page.
> Move anything else that checks compound_head(page) below the refcount
> increment.
> 
> This can't actually happen on bare-metal x86 (because there, disabling
> IRQs locks out remote TLB flushes), but it can happen on virtualized x86
> (e.g. under KVM) and probably also on arm64. The race window is pretty
> narrow, and constantly allocating and shattering hugepages isn't exactly
> fast; for now I've only managed to reproduce this in an x86 KVM guest with
> an artificially widened timing window (by adding a loop that repeatedly
> calls `inl(0x3f8 + 5)` in `try_get_compound_head()` to force VM exits,
> so that PV TLB flushes are used instead of IPIs).
> 
> As requested on the list, also replace the existing VM_BUG_ON_PAGE()
> with a warning and bailout. Since the existing code only performed the
> BUG_ON check on DEBUG_VM kernels, ensure that the new code also only
> performs the check under that configuration - I don't want to mix two
> logically separate changes together too much.
> The macro VM_WARN_ON_ONCE_PAGE() doesn't return a value on !DEBUG_VM,
> so wrap the whole check in an #ifdef block.
> An alternative would be to change the VM_WARN_ON_ONCE_PAGE() definition
> for !DEBUG_VM such that it always returns false, but since that would
> differ from the behavior of the normal WARN macros, it might be too
> confusing for readers.
> 
> ...
>
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -43,8 +43,25 @@ static void hpage_pincount_sub(struct page *page, int refs)
>  
>  	atomic_sub(refs, compound_pincount_ptr(page));
>  }
>  
> +/* Equivalent to calling put_page() @refs times. */
> +static void put_page_refs(struct page *page, int refs)
> +{
> +#ifdef CONFIG_DEBUG_VM
> +	if (VM_WARN_ON_ONCE_PAGE(page_ref_count(page) < refs, page))
> +		return;
> +#endif

Well dang those ifdefs.

With CONFIG_DEBUG_VM=n, this expands to

	if (((void)(sizeof((__force long)(page_ref_count(page) < refs))))
		return;

which will fail with "void value not ignored as it ought to be". 
Because VM_WARN_ON_ONCE_PAGE() is an rval with CONFIG_DEBUG_VM=y and is
not an rval with CONFIG_DEBUG_VM=n.    So the ifdefs are needed.

I know we've been around this loop before, but it still sucks!  Someone
please remind me of the reasoning?

Can we do

#define VM_WARN_ON_ONCE_PAGE(cond, page) {
	BUILD_BUG_ON_INVALID(cond);
	cond;
}

?
Jann Horn June 15, 2021, 2:36 a.m. UTC | #2
On Tue, Jun 15, 2021 at 4:00 AM Andrew Morton <akpm@linux-foundation.org> wrote:
> On Tue, 15 Jun 2021 03:20:14 +0200 Jann Horn <jannh@google.com> wrote:
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -43,8 +43,25 @@ static void hpage_pincount_sub(struct page *page, int refs)
> >
> >       atomic_sub(refs, compound_pincount_ptr(page));
> >  }
> >
> > +/* Equivalent to calling put_page() @refs times. */
> > +static void put_page_refs(struct page *page, int refs)
> > +{
> > +#ifdef CONFIG_DEBUG_VM
> > +     if (VM_WARN_ON_ONCE_PAGE(page_ref_count(page) < refs, page))
> > +             return;
> > +#endif
>
> Well dang those ifdefs.
>
> With CONFIG_DEBUG_VM=n, this expands to
>
>         if (((void)(sizeof((__force long)(page_ref_count(page) < refs))))
>                 return;
>
> which will fail with "void value not ignored as it ought to be".
> Because VM_WARN_ON_ONCE_PAGE() is an rval with CONFIG_DEBUG_VM=y and is
> not an rval with CONFIG_DEBUG_VM=n.    So the ifdefs are needed.
>
> I know we've been around this loop before, but it still sucks!  Someone
> please remind me of the reasoning?
>
> Can we do
>
> #define VM_WARN_ON_ONCE_PAGE(cond, page) {
>         BUILD_BUG_ON_INVALID(cond);
>         cond;
> }
>
> ?

See also <https://lore.kernel.org/linux-mm/CAG48ez3Vb1BxaZ0EHhR9ctkjCCygMWOQqFMGqt-=Ea2yXrvKiw@mail.gmail.com/>.

I see basically two issues with your proposal:

1. Even if you use it without "if (...)", the compiler has to generate
code to evaluate the condition unless it can prove that the condition
has no side effects. (It can't prove that for stuff like atomic_read()
or READ_ONCE(), because those are volatile loads and C says you can't
eliminate those. There are compiler builtins that are more
fine-grained, but the kernel doesn't use those.)

2. The current version generates no code at all for !DEBUG_VM builds.
Your proposal would still have the conditional bailout even in
!DEBUG_VM builds - and if the compiler already has to evaluate the
condition and generate a conditional branch, I don't know whether
there is much of a point in omitting the code that prints a warning
under !DEBUG_VM (although I guess it could impact register spilling
and assignment).


If you don't like the ifdeffery in this patch, can you please merge
the v1 patch? It's not like I was adding a new BUG_ON(), I was just
refactoring an existing BUG_ON() into a helper function, so I wasn't
making things worse; and I don't want to think about how to best
design WARN/BUG macros for the VM subsystem in order to land this
bugfix.

(Also, this patch is intended for stable-backporting, so mixing in
more changes unrelated to the issue being fixed might make backporting
more annoying. This v2 patch will already require more fixup than the
v1 one, because VM_WARN_ON_ONCE_PAGE() was only added recently.)
Jann Horn June 15, 2021, 2:38 a.m. UTC | #3
On Tue, Jun 15, 2021 at 4:36 AM Jann Horn <jannh@google.com> wrote:
> If you don't like the ifdeffery in this patch, can you please merge
> the v1 patch? It's not like I was adding a new BUG_ON(), I was just
> refactoring an existing BUG_ON() into a helper function, so I wasn't
> making things worse; and I don't want to think about how to best
> design WARN/BUG macros for the VM subsystem in order to land this
> bugfix.

Ah, nevermind, I hadn't seen that you already merged this one.
John Hubbard June 15, 2021, 6:37 a.m. UTC | #4
On 6/14/21 6:20 PM, Jann Horn wrote:
> try_grab_compound_head() is used to grab a reference to a page from
> get_user_pages_fast(), which is only protected against concurrent
> freeing of page tables (via local_irq_save()), but not against
> concurrent TLB flushes, freeing of data pages, or splitting of compound
> pages.
> 
> Because no reference is held to the page when try_grab_compound_head()
> is called, the page may have been freed and reallocated by the time its
> refcount has been elevated; therefore, once we're holding a stable
> reference to the page, the caller re-checks whether the PTE still points
> to the same page (with the same access rights).
> 
> The problem is that try_grab_compound_head() has to grab a reference on
> the head page; but between the time we look up what the head page is and
> the time we actually grab a reference on the head page, the compound
> page may have been split up (either explicitly through split_huge_page()
> or by freeing the compound page to the buddy allocator and then
> allocating its individual order-0 pages).
> If that happens, get_user_pages_fast() may end up returning the right
> page but lifting the refcount on a now-unrelated page, leading to
> use-after-free of pages.
> 
> To fix it:
> Re-check whether the pages still belong together after lifting the
> refcount on the head page.
> Move anything else that checks compound_head(page) below the refcount
> increment.
> 
> This can't actually happen on bare-metal x86 (because there, disabling
> IRQs locks out remote TLB flushes), but it can happen on virtualized x86
> (e.g. under KVM) and probably also on arm64. The race window is pretty
> narrow, and constantly allocating and shattering hugepages isn't exactly
> fast; for now I've only managed to reproduce this in an x86 KVM guest with
> an artificially widened timing window (by adding a loop that repeatedly
> calls `inl(0x3f8 + 5)` in `try_get_compound_head()` to force VM exits,
> so that PV TLB flushes are used instead of IPIs).
> 
> As requested on the list, also replace the existing VM_BUG_ON_PAGE()
> with a warning and bailout. Since the existing code only performed the
> BUG_ON check on DEBUG_VM kernels, ensure that the new code also only
> performs the check under that configuration - I don't want to mix two
> logically separate changes together too much.
> The macro VM_WARN_ON_ONCE_PAGE() doesn't return a value on !DEBUG_VM,
> so wrap the whole check in an #ifdef block.
> An alternative would be to change the VM_WARN_ON_ONCE_PAGE() definition
> for !DEBUG_VM such that it always returns false, but since that would
> differ from the behavior of the normal WARN macros, it might be too
> confusing for readers.
> 
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Kirill A. Shutemov <kirill@shutemov.name>
> Cc: John Hubbard <jhubbard@nvidia.com>
> Cc: Jan Kara <jack@suse.cz>
> Cc: stable@vger.kernel.org
> Fixes: 7aef4172c795 ("mm: handle PTE-mapped tail pages in gerneric fast gup implementaiton")
> Signed-off-by: Jann Horn <jannh@google.com>

Looks good. I'll poke around maybe tomorrow and see if there is anything
that might possibly improve the VM_WARN*() macro situation, as a follow up.

One small question below, but in any case,

Reviewed-by: John Hubbard <jhubbard@nvidia.com>

> ---
>   mm/gup.c | 58 +++++++++++++++++++++++++++++++++++++++++---------------
>   1 file changed, 43 insertions(+), 15 deletions(-)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index 3ded6a5f26b2..90262e448552 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -43,8 +43,25 @@ static void hpage_pincount_sub(struct page *page, int refs)
>   
>   	atomic_sub(refs, compound_pincount_ptr(page));
>   }
>   
> +/* Equivalent to calling put_page() @refs times. */
> +static void put_page_refs(struct page *page, int refs)
> +{
> +#ifdef CONFIG_DEBUG_VM
> +	if (VM_WARN_ON_ONCE_PAGE(page_ref_count(page) < refs, page))
> +		return;
> +#endif
> +
> +	/*
> +	 * Calling put_page() for each ref is unnecessarily slow. Only the last
> +	 * ref needs a put_page().
> +	 */
> +	if (refs > 1)
> +		page_ref_sub(page, refs - 1);
> +	put_page(page);
> +}
> +
>   /*
>    * Return the compound head page with ref appropriately incremented,
>    * or NULL if that failed.
>    */
> @@ -55,8 +72,23 @@ static inline struct page *try_get_compound_head(struct page *page, int refs)
>   	if (WARN_ON_ONCE(page_ref_count(head) < 0))
>   		return NULL;
>   	if (unlikely(!page_cache_add_speculative(head, refs)))
>   		return NULL;
> +
> +	/*
> +	 * At this point we have a stable reference to the head page; but it
> +	 * could be that between the compound_head() lookup and the refcount
> +	 * increment, the compound page was split, in which case we'd end up
> +	 * holding a reference on a page that has nothing to do with the page
> +	 * we were given anymore.
> +	 * So now that the head page is stable, recheck that the pages still
> +	 * belong together.
> +	 */
> +	if (unlikely(compound_head(page) != head)) {

I was just wondering about what all could happen here. Such as: page gets split,
reallocated into a different-sized compound page, one that still has page pointing
to head. I think that's OK, because we don't look at or change other huge page
fields.

But I thought I'd mention the idea in case anyone else has any clever ideas about
how this simple check might be insufficient here. It seems fine to me, but I
routinely lack enough imagination about concurrent operations. :)

thanks,
Jann Horn June 15, 2021, 12:09 p.m. UTC | #5
On Tue, Jun 15, 2021 at 8:37 AM John Hubbard <jhubbard@nvidia.com> wrote:
> On 6/14/21 6:20 PM, Jann Horn wrote:
> > try_grab_compound_head() is used to grab a reference to a page from
> > get_user_pages_fast(), which is only protected against concurrent
> > freeing of page tables (via local_irq_save()), but not against
> > concurrent TLB flushes, freeing of data pages, or splitting of compound
> > pages.
[...]
> Reviewed-by: John Hubbard <jhubbard@nvidia.com>

Thanks!

[...]
> > @@ -55,8 +72,23 @@ static inline struct page *try_get_compound_head(struct page *page, int refs)
> >       if (WARN_ON_ONCE(page_ref_count(head) < 0))
> >               return NULL;
> >       if (unlikely(!page_cache_add_speculative(head, refs)))
> >               return NULL;
> > +
> > +     /*
> > +      * At this point we have a stable reference to the head page; but it
> > +      * could be that between the compound_head() lookup and the refcount
> > +      * increment, the compound page was split, in which case we'd end up
> > +      * holding a reference on a page that has nothing to do with the page
> > +      * we were given anymore.
> > +      * So now that the head page is stable, recheck that the pages still
> > +      * belong together.
> > +      */
> > +     if (unlikely(compound_head(page) != head)) {
>
> I was just wondering about what all could happen here. Such as: page gets split,
> reallocated into a different-sized compound page, one that still has page pointing
> to head. I think that's OK, because we don't look at or change other huge page
> fields.
>
> But I thought I'd mention the idea in case anyone else has any clever ideas about
> how this simple check might be insufficient here. It seems fine to me, but I
> routinely lack enough imagination about concurrent operations. :)

Hmmm... I think the scariest aspect here is probably the interaction
with concurrent allocation of a compound page on architectures with
store-store reordering (like ARM). *If* the page allocator handled
compound pages with lockless, non-atomic percpu freelists, I think it
might be possible that the zeroing of tail_page->compound_head in
put_page() could be reordered after the page has been freed,
reallocated and set to refcount 1 again?

That shouldn't be possible at the moment, but it is still a bit scary.


I think the lockless page cache code also has to deal with somewhat
similar ordering concerns when it uses page_cache_get_speculative(),
e.g. in mapping_get_entry() - first it looks up a page pointer with
xas_load(), and any access to the page later on would be a _dependent
load_, but if the page then gets freed, reallocated, and inserted into
the page cache again before the refcount increment and the re-check
using xas_reload(), then there would be no data dependency from
xas_reload() to the following use of the page...
Yang Shi June 15, 2021, 11:10 p.m. UTC | #6
On Tue, Jun 15, 2021 at 5:10 AM Jann Horn <jannh@google.com> wrote:
>
> On Tue, Jun 15, 2021 at 8:37 AM John Hubbard <jhubbard@nvidia.com> wrote:
> > On 6/14/21 6:20 PM, Jann Horn wrote:
> > > try_grab_compound_head() is used to grab a reference to a page from
> > > get_user_pages_fast(), which is only protected against concurrent
> > > freeing of page tables (via local_irq_save()), but not against
> > > concurrent TLB flushes, freeing of data pages, or splitting of compound
> > > pages.
> [...]
> > Reviewed-by: John Hubbard <jhubbard@nvidia.com>
>
> Thanks!
>
> [...]
> > > @@ -55,8 +72,23 @@ static inline struct page *try_get_compound_head(struct page *page, int refs)
> > >       if (WARN_ON_ONCE(page_ref_count(head) < 0))
> > >               return NULL;
> > >       if (unlikely(!page_cache_add_speculative(head, refs)))
> > >               return NULL;
> > > +
> > > +     /*
> > > +      * At this point we have a stable reference to the head page; but it
> > > +      * could be that between the compound_head() lookup and the refcount
> > > +      * increment, the compound page was split, in which case we'd end up
> > > +      * holding a reference on a page that has nothing to do with the page
> > > +      * we were given anymore.
> > > +      * So now that the head page is stable, recheck that the pages still
> > > +      * belong together.
> > > +      */
> > > +     if (unlikely(compound_head(page) != head)) {
> >
> > I was just wondering about what all could happen here. Such as: page gets split,
> > reallocated into a different-sized compound page, one that still has page pointing
> > to head. I think that's OK, because we don't look at or change other huge page
> > fields.
> >
> > But I thought I'd mention the idea in case anyone else has any clever ideas about
> > how this simple check might be insufficient here. It seems fine to me, but I
> > routinely lack enough imagination about concurrent operations. :)
>
> Hmmm... I think the scariest aspect here is probably the interaction
> with concurrent allocation of a compound page on architectures with
> store-store reordering (like ARM). *If* the page allocator handled
> compound pages with lockless, non-atomic percpu freelists, I think it
> might be possible that the zeroing of tail_page->compound_head in
> put_page() could be reordered after the page has been freed,
> reallocated and set to refcount 1 again?
>
> That shouldn't be possible at the moment, but it is still a bit scary.

It might be possible after Mel's "mm/page_alloc: Allow high-order
pages to be stored on the per-cpu lists" patch
(https://patchwork.kernel.org/project/linux-mm/patch/20210611135753.GC30378@techsingularity.net/).

>
>
> I think the lockless page cache code also has to deal with somewhat
> similar ordering concerns when it uses page_cache_get_speculative(),
> e.g. in mapping_get_entry() - first it looks up a page pointer with
> xas_load(), and any access to the page later on would be a _dependent
> load_, but if the page then gets freed, reallocated, and inserted into
> the page cache again before the refcount increment and the re-check
> using xas_reload(), then there would be no data dependency from
> xas_reload() to the following use of the page...
>
Vlastimil Babka June 16, 2021, 5:27 p.m. UTC | #7
On 6/16/21 1:10 AM, Yang Shi wrote:
> On Tue, Jun 15, 2021 at 5:10 AM Jann Horn <jannh@google.com> wrote:
>>
>> On Tue, Jun 15, 2021 at 8:37 AM John Hubbard <jhubbard@nvidia.com> wrote:
>> > On 6/14/21 6:20 PM, Jann Horn wrote:
>> > > try_grab_compound_head() is used to grab a reference to a page from
>> > > get_user_pages_fast(), which is only protected against concurrent
>> > > freeing of page tables (via local_irq_save()), but not against
>> > > concurrent TLB flushes, freeing of data pages, or splitting of compound
>> > > pages.
>> [...]
>> > Reviewed-by: John Hubbard <jhubbard@nvidia.com>
>>
>> Thanks!
>>
>> [...]
>> > > @@ -55,8 +72,23 @@ static inline struct page *try_get_compound_head(struct page *page, int refs)
>> > >       if (WARN_ON_ONCE(page_ref_count(head) < 0))
>> > >               return NULL;
>> > >       if (unlikely(!page_cache_add_speculative(head, refs)))
>> > >               return NULL;
>> > > +
>> > > +     /*
>> > > +      * At this point we have a stable reference to the head page; but it
>> > > +      * could be that between the compound_head() lookup and the refcount
>> > > +      * increment, the compound page was split, in which case we'd end up
>> > > +      * holding a reference on a page that has nothing to do with the page
>> > > +      * we were given anymore.
>> > > +      * So now that the head page is stable, recheck that the pages still
>> > > +      * belong together.
>> > > +      */
>> > > +     if (unlikely(compound_head(page) != head)) {
>> >
>> > I was just wondering about what all could happen here. Such as: page gets split,
>> > reallocated into a different-sized compound page, one that still has page pointing
>> > to head. I think that's OK, because we don't look at or change other huge page
>> > fields.
>> >
>> > But I thought I'd mention the idea in case anyone else has any clever ideas about
>> > how this simple check might be insufficient here. It seems fine to me, but I
>> > routinely lack enough imagination about concurrent operations. :)
>>
>> Hmmm... I think the scariest aspect here is probably the interaction
>> with concurrent allocation of a compound page on architectures with
>> store-store reordering (like ARM). *If* the page allocator handled
>> compound pages with lockless, non-atomic percpu freelists, I think it
>> might be possible that the zeroing of tail_page->compound_head in
>> put_page() could be reordered after the page has been freed,
>> reallocated and set to refcount 1 again?
>>
>> That shouldn't be possible at the moment, but it is still a bit scary.
> 
> It might be possible after Mel's "mm/page_alloc: Allow high-order
> pages to be stored on the per-cpu lists" patch
> (https://patchwork.kernel.org/project/linux-mm/patch/20210611135753.GC30378@techsingularity.net/).

Those would be percpu indeed, but not "lockless, non-atomic", no? They are
protected by a local_lock.

>>
>>
>> I think the lockless page cache code also has to deal with somewhat
>> similar ordering concerns when it uses page_cache_get_speculative(),
>> e.g. in mapping_get_entry() - first it looks up a page pointer with
>> xas_load(), and any access to the page later on would be a _dependent
>> load_, but if the page then gets freed, reallocated, and inserted into
>> the page cache again before the refcount increment and the re-check
>> using xas_reload(), then there would be no data dependency from
>> xas_reload() to the following use of the page...
>>
>
Yang Shi June 16, 2021, 6:40 p.m. UTC | #8
On Wed, Jun 16, 2021 at 10:27 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 6/16/21 1:10 AM, Yang Shi wrote:
> > On Tue, Jun 15, 2021 at 5:10 AM Jann Horn <jannh@google.com> wrote:
> >>
> >> On Tue, Jun 15, 2021 at 8:37 AM John Hubbard <jhubbard@nvidia.com> wrote:
> >> > On 6/14/21 6:20 PM, Jann Horn wrote:
> >> > > try_grab_compound_head() is used to grab a reference to a page from
> >> > > get_user_pages_fast(), which is only protected against concurrent
> >> > > freeing of page tables (via local_irq_save()), but not against
> >> > > concurrent TLB flushes, freeing of data pages, or splitting of compound
> >> > > pages.
> >> [...]
> >> > Reviewed-by: John Hubbard <jhubbard@nvidia.com>
> >>
> >> Thanks!
> >>
> >> [...]
> >> > > @@ -55,8 +72,23 @@ static inline struct page *try_get_compound_head(struct page *page, int refs)
> >> > >       if (WARN_ON_ONCE(page_ref_count(head) < 0))
> >> > >               return NULL;
> >> > >       if (unlikely(!page_cache_add_speculative(head, refs)))
> >> > >               return NULL;
> >> > > +
> >> > > +     /*
> >> > > +      * At this point we have a stable reference to the head page; but it
> >> > > +      * could be that between the compound_head() lookup and the refcount
> >> > > +      * increment, the compound page was split, in which case we'd end up
> >> > > +      * holding a reference on a page that has nothing to do with the page
> >> > > +      * we were given anymore.
> >> > > +      * So now that the head page is stable, recheck that the pages still
> >> > > +      * belong together.
> >> > > +      */
> >> > > +     if (unlikely(compound_head(page) != head)) {
> >> >
> >> > I was just wondering about what all could happen here. Such as: page gets split,
> >> > reallocated into a different-sized compound page, one that still has page pointing
> >> > to head. I think that's OK, because we don't look at or change other huge page
> >> > fields.
> >> >
> >> > But I thought I'd mention the idea in case anyone else has any clever ideas about
> >> > how this simple check might be insufficient here. It seems fine to me, but I
> >> > routinely lack enough imagination about concurrent operations. :)
> >>
> >> Hmmm... I think the scariest aspect here is probably the interaction
> >> with concurrent allocation of a compound page on architectures with
> >> store-store reordering (like ARM). *If* the page allocator handled
> >> compound pages with lockless, non-atomic percpu freelists, I think it
> >> might be possible that the zeroing of tail_page->compound_head in
> >> put_page() could be reordered after the page has been freed,
> >> reallocated and set to refcount 1 again?
> >>
> >> That shouldn't be possible at the moment, but it is still a bit scary.
> >
> > It might be possible after Mel's "mm/page_alloc: Allow high-order
> > pages to be stored on the per-cpu lists" patch
> > (https://patchwork.kernel.org/project/linux-mm/patch/20210611135753.GC30378@techsingularity.net/).
>
> Those would be percpu indeed, but not "lockless, non-atomic", no? They are
> protected by a local_lock.

The local_lock is *not* a lock on non-PREEMPT_RT kernel IIUC. It
disables preempt and IRQ. But preempt disable is no-op on non-preempt
kernel. IRQ disable can guarantee it is atomic context, but I'm not
sure if it is equivalent to "atomic freelists" in Jann's context.

>
> >>
> >>
> >> I think the lockless page cache code also has to deal with somewhat
> >> similar ordering concerns when it uses page_cache_get_speculative(),
> >> e.g. in mapping_get_entry() - first it looks up a page pointer with
> >> xas_load(), and any access to the page later on would be a _dependent
> >> load_, but if the page then gets freed, reallocated, and inserted into
> >> the page cache again before the refcount increment and the re-check
> >> using xas_reload(), then there would be no data dependency from
> >> xas_reload() to the following use of the page...
> >>
> >
>
Vlastimil Babka June 17, 2021, 4:09 p.m. UTC | #9
On 6/16/21 8:40 PM, Yang Shi wrote:
> On Wed, Jun 16, 2021 at 10:27 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>>
>> On 6/16/21 1:10 AM, Yang Shi wrote:
>> > On Tue, Jun 15, 2021 at 5:10 AM Jann Horn <jannh@google.com> wrote:
>> >>
>> >
>> > It might be possible after Mel's "mm/page_alloc: Allow high-order
>> > pages to be stored on the per-cpu lists" patch
>> > (https://patchwork.kernel.org/project/linux-mm/patch/20210611135753.GC30378@techsingularity.net/).
>>
>> Those would be percpu indeed, but not "lockless, non-atomic", no? They are
>> protected by a local_lock.
> 
> The local_lock is *not* a lock on non-PREEMPT_RT kernel IIUC. It
> disables preempt and IRQ. But preempt disable is no-op on non-preempt
> kernel. IRQ disable can guarantee it is atomic context, but I'm not
> sure if it is equivalent to "atomic freelists" in Jann's context.

Hm right, I thought all locks had the acquire/release semantics, but this one is
just a barrier().
Jason Gunthorpe June 18, 2021, 1:25 p.m. UTC | #10
On Tue, Jun 15, 2021 at 02:09:38PM +0200, Jann Horn wrote:
> On Tue, Jun 15, 2021 at 8:37 AM John Hubbard <jhubbard@nvidia.com> wrote:
> > On 6/14/21 6:20 PM, Jann Horn wrote:
> > > try_grab_compound_head() is used to grab a reference to a page from
> > > get_user_pages_fast(), which is only protected against concurrent
> > > freeing of page tables (via local_irq_save()), but not against
> > > concurrent TLB flushes, freeing of data pages, or splitting of compound
> > > pages.
> [...]
> > Reviewed-by: John Hubbard <jhubbard@nvidia.com>
> 
> Thanks!
> 
> [...]
> > > @@ -55,8 +72,23 @@ static inline struct page *try_get_compound_head(struct page *page, int refs)
> > >       if (WARN_ON_ONCE(page_ref_count(head) < 0))
> > >               return NULL;
> > >       if (unlikely(!page_cache_add_speculative(head, refs)))
> > >               return NULL;
> > > +
> > > +     /*
> > > +      * At this point we have a stable reference to the head page; but it
> > > +      * could be that between the compound_head() lookup and the refcount
> > > +      * increment, the compound page was split, in which case we'd end up
> > > +      * holding a reference on a page that has nothing to do with the page
> > > +      * we were given anymore.
> > > +      * So now that the head page is stable, recheck that the pages still
> > > +      * belong together.
> > > +      */
> > > +     if (unlikely(compound_head(page) != head)) {
> >
> > I was just wondering about what all could happen here. Such as: page gets split,
> > reallocated into a different-sized compound page, one that still has page pointing
> > to head. I think that's OK, because we don't look at or change other huge page
> > fields.
> >
> > But I thought I'd mention the idea in case anyone else has any clever ideas about
> > how this simple check might be insufficient here. It seems fine to me, but I
> > routinely lack enough imagination about concurrent operations. :)
> 
> Hmmm... I think the scariest aspect here is probably the interaction
> with concurrent allocation of a compound page on architectures with
> store-store reordering (like ARM). *If* the page allocator handled
> compound pages with lockless, non-atomic percpu freelists, I think it
> might be possible that the zeroing of tail_page->compound_head in
> put_page() could be reordered after the page has been freed,
> reallocated and set to refcount 1 again?

Oh wow, yes, this all looks sketchy! Doing a RCU access to page->head
is a really challenging thing :\

On the simplified store side:

  page->head = my_compound
  *ptep = page

There must be some kind of release barrier between those two
operations or this is all broken.. That definately deserves a comment.

Ideally we'd use smp_store_release to install the *pte :\

Assuming we cover the release barrier, I would think the algorithm
should be broadly:

 struct page *target_page = READ_ONCE(pte)
 struct page *target_folio = READ_ONCE(target_page->head)

 page_cache_add_speculative(target_folio, refs)

 if (target_folio != READ_ONCE(target_page->head) ||
     target_page != READ_ONCE(pte))
    goto abort

Which is what this patch does but I would like to see the
READ_ONCE's.

And there possibly should be two try_grab_compound_head()'s since we
don't need this overhead on the fully locked path, especially the
double atomic on page_ref_add()

> I think the lockless page cache code also has to deal with somewhat
> similar ordering concerns when it uses page_cache_get_speculative(),
> e.g. in mapping_get_entry() - first it looks up a page pointer with
> xas_load(), and any access to the page later on would be a _dependent
> load_, but if the page then gets freed, reallocated, and inserted into
> the page cache again before the refcount increment and the re-check
> using xas_reload(), then there would be no data dependency from
> xas_reload() to the following use of the page...

xas_store() should have the smp_store_release() inside it at least..

Even so it doesn't seem to do page->head, so this is not quite the
same thing

Jason
Matthew Wilcox June 18, 2021, 1:50 p.m. UTC | #11
On Fri, Jun 18, 2021 at 10:25:56AM -0300, Jason Gunthorpe wrote:
> On Tue, Jun 15, 2021 at 02:09:38PM +0200, Jann Horn wrote:
> > On Tue, Jun 15, 2021 at 8:37 AM John Hubbard <jhubbard@nvidia.com> wrote:
> > > On 6/14/21 6:20 PM, Jann Horn wrote:
> > > > @@ -55,8 +72,23 @@ static inline struct page *try_get_compound_head(struct page *page, int refs)
> > > >       if (WARN_ON_ONCE(page_ref_count(head) < 0))
> > > >               return NULL;
> > > >       if (unlikely(!page_cache_add_speculative(head, refs)))
> > > >               return NULL;
> > > > +
> > > > +     /*
> > > > +      * At this point we have a stable reference to the head page; but it
> > > > +      * could be that between the compound_head() lookup and the refcount
> > > > +      * increment, the compound page was split, in which case we'd end up
> > > > +      * holding a reference on a page that has nothing to do with the page
> > > > +      * we were given anymore.
> > > > +      * So now that the head page is stable, recheck that the pages still
> > > > +      * belong together.
> > > > +      */
> > > > +     if (unlikely(compound_head(page) != head)) {
> > >
> > > I was just wondering about what all could happen here. Such as: page gets split,
> > > reallocated into a different-sized compound page, one that still has page pointing
> > > to head. I think that's OK, because we don't look at or change other huge page
> > > fields.
> > >
> > > But I thought I'd mention the idea in case anyone else has any clever ideas about
> > > how this simple check might be insufficient here. It seems fine to me, but I
> > > routinely lack enough imagination about concurrent operations. :)
> > 
> > Hmmm... I think the scariest aspect here is probably the interaction
> > with concurrent allocation of a compound page on architectures with
> > store-store reordering (like ARM). *If* the page allocator handled
> > compound pages with lockless, non-atomic percpu freelists, I think it
> > might be possible that the zeroing of tail_page->compound_head in
> > put_page() could be reordered after the page has been freed,
> > reallocated and set to refcount 1 again?
> 
> Oh wow, yes, this all looks sketchy! Doing a RCU access to page->head
> is a really challenging thing :\
> 
> On the simplified store side:
> 
>   page->head = my_compound
>   *ptep = page
> 
> There must be some kind of release barrier between those two
> operations or this is all broken.. That definately deserves a comment.

set_compound_head() includes a WRITE_ONCE.  Is that enough, or does it
need an smp_wmb()?

> Ideally we'd use smp_store_release to install the *pte :\
> 
> Assuming we cover the release barrier, I would think the algorithm
> should be broadly:
> 
>  struct page *target_page = READ_ONCE(pte)
>  struct page *target_folio = READ_ONCE(target_page->head)

compound_head() includes a READ_ONCE already.

>  page_cache_add_speculative(target_folio, refs)

That's spelled folio_ref_try_add_rcu() right now.

>  if (target_folio != READ_ONCE(target_page->head) ||
>      target_page != READ_ONCE(pte))
>     goto abort
> 
> Which is what this patch does but I would like to see the
> READ_ONCE's.

... you want them to be uninlined from compound_head(), et al?

> And there possibly should be two try_grab_compound_head()'s since we
> don't need this overhead on the fully locked path, especially the
> double atomic on page_ref_add()

There's only one atomic on page_ref_add().  And you need more of this
overhead on the fully locked path than you realise; the page might be
split without holding the mmap_sem, for example.

> > I think the lockless page cache code also has to deal with somewhat
> > similar ordering concerns when it uses page_cache_get_speculative(),
> > e.g. in mapping_get_entry() - first it looks up a page pointer with
> > xas_load(), and any access to the page later on would be a _dependent
> > load_, but if the page then gets freed, reallocated, and inserted into
> > the page cache again before the refcount increment and the re-check
> > using xas_reload(), then there would be no data dependency from
> > xas_reload() to the following use of the page...
> 
> xas_store() should have the smp_store_release() inside it at least..
> 
> Even so it doesn't seem to do page->head, so this is not quite the
> same thing

The page cache only stores head pages, so it's a little simpler than
lookup from PTE.  I have ideas for making PFN->folio lookup go directly
to the folio without passing through a page on the way, but that's for
much, much later.
Jason Gunthorpe June 18, 2021, 2:58 p.m. UTC | #12
On Fri, Jun 18, 2021 at 02:50:00PM +0100, Matthew Wilcox wrote:
> On Fri, Jun 18, 2021 at 10:25:56AM -0300, Jason Gunthorpe wrote:
> > On Tue, Jun 15, 2021 at 02:09:38PM +0200, Jann Horn wrote:
> > > On Tue, Jun 15, 2021 at 8:37 AM John Hubbard <jhubbard@nvidia.com> wrote:
> > > > On 6/14/21 6:20 PM, Jann Horn wrote:
> > > > > @@ -55,8 +72,23 @@ static inline struct page *try_get_compound_head(struct page *page, int refs)
> > > > >       if (WARN_ON_ONCE(page_ref_count(head) < 0))
> > > > >               return NULL;
> > > > >       if (unlikely(!page_cache_add_speculative(head, refs)))
> > > > >               return NULL;
> > > > > +
> > > > > +     /*
> > > > > +      * At this point we have a stable reference to the head page; but it
> > > > > +      * could be that between the compound_head() lookup and the refcount
> > > > > +      * increment, the compound page was split, in which case we'd end up
> > > > > +      * holding a reference on a page that has nothing to do with the page
> > > > > +      * we were given anymore.
> > > > > +      * So now that the head page is stable, recheck that the pages still
> > > > > +      * belong together.
> > > > > +      */
> > > > > +     if (unlikely(compound_head(page) != head)) {
> > > >
> > > > I was just wondering about what all could happen here. Such as: page gets split,
> > > > reallocated into a different-sized compound page, one that still has page pointing
> > > > to head. I think that's OK, because we don't look at or change other huge page
> > > > fields.
> > > >
> > > > But I thought I'd mention the idea in case anyone else has any clever ideas about
> > > > how this simple check might be insufficient here. It seems fine to me, but I
> > > > routinely lack enough imagination about concurrent operations. :)
> > > 
> > > Hmmm... I think the scariest aspect here is probably the interaction
> > > with concurrent allocation of a compound page on architectures with
> > > store-store reordering (like ARM). *If* the page allocator handled
> > > compound pages with lockless, non-atomic percpu freelists, I think it
> > > might be possible that the zeroing of tail_page->compound_head in
> > > put_page() could be reordered after the page has been freed,
> > > reallocated and set to refcount 1 again?
> > 
> > Oh wow, yes, this all looks sketchy! Doing a RCU access to page->head
> > is a really challenging thing :\
> > 
> > On the simplified store side:
> > 
> >   page->head = my_compound
> >   *ptep = page
> > 
> > There must be some kind of release barrier between those two
> > operations or this is all broken.. That definately deserves a comment.
> 
> set_compound_head() includes a WRITE_ONCE.  Is that enough, or does it
> need an smp_wmb()?

Probably, at least the generic code maps smp_store_release() to
__smp_wmb.

I think Jann was making the argument that there is going to be some
other release operation due to locking between the two above, eg a
lock unlock or something.

> > Ideally we'd use smp_store_release to install the *pte :\
> > 
> > Assuming we cover the release barrier, I would think the algorithm
> > should be broadly:
> > 
> >  struct page *target_page = READ_ONCE(pte)
> >  struct page *target_folio = READ_ONCE(target_page->head)
> 
> compound_head() includes a READ_ONCE already.

Ah, see I obviously haven't memorized that detail :\

> >  page_cache_add_speculative(target_folio, refs)
> 
> That's spelled folio_ref_try_add_rcu() right now.

That seems a much better name

> >  if (target_folio != READ_ONCE(target_page->head) ||
> >      target_page != READ_ONCE(pte))
> >     goto abort
> > 
> > Which is what this patch does but I would like to see the
> > READ_ONCE's.
> 
> ... you want them to be uninlined from compound_head(), et al?

Not really (though see below), I was mostly looking at the pte which
just does pte_val(), no READ_ONCE in there

> > And there possibly should be two try_grab_compound_head()'s since we
> > don't need this overhead on the fully locked path, especially the
> > double atomic on page_ref_add()
> 
> There's only one atomic on page_ref_add(). 

Look at the original patch, it adds this:

+		else
+			page_ref_add(page, refs * (GUP_PIN_COUNTING_BIAS - 1));

Where page is the folio, which is now two atomics to do the same
ref. This is happening because we can't do hpage_pincount_available()
before having initially locked the folio, thus we can no longer
precompute what 'ref' to give to the first folio_ref_try_add_rcu()

> And you need more of this overhead on the fully locked path than you
> realise; the page might be split without holding the mmap_sem, for
> example.

Fully locked here means holding the PTL spinlocks, so we know the pte
cannot change and particularly the refcount of a folio can't go to
zero. We can't change compound_head if the refcount is
elevated.

Keep in mind we also do this in gpu:

 folio_ref_try_add_rcu(READ_ONCE(target_page->head), 1)
 [..]
 folio_put_refs(READ_ONCE(target_page->head), 1)

Which makes me wonder why we have READ_ONCE inside compound_head?

I'm reading the commit message of 1d798ca3f164 ("mm: make
compound_head() robust"), and to me that looks like another special
lockless algorithm that should have the READ_ONCE in it, not the
general code.

Jason
diff mbox series

Patch

diff --git a/mm/gup.c b/mm/gup.c
index 3ded6a5f26b2..90262e448552 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -43,8 +43,25 @@  static void hpage_pincount_sub(struct page *page, int refs)
 
 	atomic_sub(refs, compound_pincount_ptr(page));
 }
 
+/* Equivalent to calling put_page() @refs times. */
+static void put_page_refs(struct page *page, int refs)
+{
+#ifdef CONFIG_DEBUG_VM
+	if (VM_WARN_ON_ONCE_PAGE(page_ref_count(page) < refs, page))
+		return;
+#endif
+
+	/*
+	 * Calling put_page() for each ref is unnecessarily slow. Only the last
+	 * ref needs a put_page().
+	 */
+	if (refs > 1)
+		page_ref_sub(page, refs - 1);
+	put_page(page);
+}
+
 /*
  * Return the compound head page with ref appropriately incremented,
  * or NULL if that failed.
  */
@@ -55,8 +72,23 @@  static inline struct page *try_get_compound_head(struct page *page, int refs)
 	if (WARN_ON_ONCE(page_ref_count(head) < 0))
 		return NULL;
 	if (unlikely(!page_cache_add_speculative(head, refs)))
 		return NULL;
+
+	/*
+	 * At this point we have a stable reference to the head page; but it
+	 * could be that between the compound_head() lookup and the refcount
+	 * increment, the compound page was split, in which case we'd end up
+	 * holding a reference on a page that has nothing to do with the page
+	 * we were given anymore.
+	 * So now that the head page is stable, recheck that the pages still
+	 * belong together.
+	 */
+	if (unlikely(compound_head(page) != head)) {
+		put_page_refs(head, refs);
+		return NULL;
+	}
+
 	return head;
 }
 
 /*
@@ -94,25 +126,28 @@  __maybe_unused struct page *try_grab_compound_head(struct page *page,
 		if (unlikely((flags & FOLL_LONGTERM) &&
 			     !is_pinnable_page(page)))
 			return NULL;
 
+		/*
+		 * CAUTION: Don't use compound_head() on the page before this
+		 * point, the result won't be stable.
+		 */
+		page = try_get_compound_head(page, refs);
+		if (!page)
+			return NULL;
+
 		/*
 		 * When pinning a compound page of order > 1 (which is what
 		 * hpage_pincount_available() checks for), use an exact count to
 		 * track it, via hpage_pincount_add/_sub().
 		 *
 		 * However, be sure to *also* increment the normal page refcount
 		 * field at least once, so that the page really is pinned.
 		 */
-		if (!hpage_pincount_available(page))
-			refs *= GUP_PIN_COUNTING_BIAS;
-
-		page = try_get_compound_head(page, refs);
-		if (!page)
-			return NULL;
-
 		if (hpage_pincount_available(page))
 			hpage_pincount_add(page, refs);
+		else
+			page_ref_add(page, refs * (GUP_PIN_COUNTING_BIAS - 1));
 
 		mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_ACQUIRED,
 				    orig_refs);
 
@@ -134,16 +169,9 @@  static void put_compound_head(struct page *page, int refs, unsigned int flags)
 		else
 			refs *= GUP_PIN_COUNTING_BIAS;
 	}
 
-	VM_BUG_ON_PAGE(page_ref_count(page) < refs, page);
-	/*
-	 * Calling put_page() for each ref is unnecessarily slow. Only the last
-	 * ref needs a put_page().
-	 */
-	if (refs > 1)
-		page_ref_sub(page, refs - 1);
-	put_page(page);
+	put_page_refs(page, refs);
 }
 
 /**
  * try_grab_page() - elevate a page's refcount by a flag-dependent amount