diff mbox series

[RFC,13/37] mm: implement speculative handling in __handle_mm_fault().

Message ID 20210407014502.24091-14-michel@lespinasse.org (mailing list archive)
State New, archived
Headers show
Series [RFC,01/37] mmap locking API: mmap_lock_is_contended returns a bool | expand

Commit Message

Michel Lespinasse April 7, 2021, 1:44 a.m. UTC
The page table tree is walked with local irqs disabled, which prevents
page table reclamation (similarly to what fast GUP does). The logic is
otherwise similar to the non-speculative path, but with additional
restrictions: in the speculative path, we do not handle huge pages or
wiring new pages tables.

Signed-off-by: Michel Lespinasse <michel@lespinasse.org>
---
 include/linux/mm.h |  4 +++
 mm/memory.c        | 77 ++++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 79 insertions(+), 2 deletions(-)

Comments

Andy Lutomirski April 7, 2021, 3:36 p.m. UTC | #1
On 4/6/21 6:44 PM, Michel Lespinasse wrote:
> The page table tree is walked with local irqs disabled, which prevents
> page table reclamation (similarly to what fast GUP does). The logic is
> otherwise similar to the non-speculative path, but with additional
> restrictions: in the speculative path, we do not handle huge pages or
> wiring new pages tables.

Not on most architectures.  Quoting the actual comment in mm/gup.c:

>  * Before activating this code, please be aware that the following assumptions
>  * are currently made:
>  *
>  *  *) Either MMU_GATHER_RCU_TABLE_FREE is enabled, and tlb_remove_table() is used to
>  *  free pages containing page tables or TLB flushing requires IPI broadcast.

On MMU_GATHER_RCU_TABLE_FREE architectures, you cannot make the
assumption that it is safe to dereference a pointer in a page table just
because irqs are off.  You need RCU protection, too.

You have the same error in the cover letter.

--Andy
Michel Lespinasse April 28, 2021, 2:58 p.m. UTC | #2
On Wed, Apr 07, 2021 at 08:36:01AM -0700, Andy Lutomirski wrote:
> On 4/6/21 6:44 PM, Michel Lespinasse wrote:
> > The page table tree is walked with local irqs disabled, which prevents
> > page table reclamation (similarly to what fast GUP does). The logic is
> > otherwise similar to the non-speculative path, but with additional
> > restrictions: in the speculative path, we do not handle huge pages or
> > wiring new pages tables.
> 
> Not on most architectures.  Quoting the actual comment in mm/gup.c:
> 
> >  * Before activating this code, please be aware that the following assumptions
> >  * are currently made:
> >  *
> >  *  *) Either MMU_GATHER_RCU_TABLE_FREE is enabled, and tlb_remove_table() is used to
> >  *  free pages containing page tables or TLB flushing requires IPI broadcast.
> 
> On MMU_GATHER_RCU_TABLE_FREE architectures, you cannot make the
> assumption that it is safe to dereference a pointer in a page table just
> because irqs are off.  You need RCU protection, too.
> 
> You have the same error in the cover letter.

Hi Andy,

Thanks for your comment. At first I thought did not matter, because we
only enable ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT on selected
architectures, and I thought MMU_GATHER_RCU_TABLE_FREE is not set on
these. But I was wrong - MMU_GATHER_RCU_TABLE_FREE is enabled on X86
with paravirt. So I took another look at fast GUP to make sure I
actually understand it.

This brings a question about lockless_pages_from_mm() - I see it
disabling interrupts, which it explains is necessary for disabling THP
splitting IPIs, but I do not see it taking an RCU read lock as would
be necessary for preventing paga table freeing on
MMU_GATHER_RCU_TABLE_FREE configs. I figure local_irq_save()
indirectly takes an rcu read lock somehow ? I think this is something
I should also mention in my explanation, and I have not seen a good
description of this on the fast GUP side...

Thanks,

--
Michel "walken" Lespinasse
Andy Lutomirski April 28, 2021, 3:13 p.m. UTC | #3
On Wed, Apr 28, 2021 at 8:05 AM Michel Lespinasse <michel@lespinasse.org> wrote:
>
> On Wed, Apr 07, 2021 at 08:36:01AM -0700, Andy Lutomirski wrote:
> > On 4/6/21 6:44 PM, Michel Lespinasse wrote:
> > > The page table tree is walked with local irqs disabled, which prevents
> > > page table reclamation (similarly to what fast GUP does). The logic is
> > > otherwise similar to the non-speculative path, but with additional
> > > restrictions: in the speculative path, we do not handle huge pages or
> > > wiring new pages tables.
> >
> > Not on most architectures.  Quoting the actual comment in mm/gup.c:
> >
> > >  * Before activating this code, please be aware that the following assumptions
> > >  * are currently made:
> > >  *
> > >  *  *) Either MMU_GATHER_RCU_TABLE_FREE is enabled, and tlb_remove_table() is used to
> > >  *  free pages containing page tables or TLB flushing requires IPI broadcast.
> >
> > On MMU_GATHER_RCU_TABLE_FREE architectures, you cannot make the
> > assumption that it is safe to dereference a pointer in a page table just
> > because irqs are off.  You need RCU protection, too.
> >
> > You have the same error in the cover letter.
>
> Hi Andy,
>
> Thanks for your comment. At first I thought did not matter, because we
> only enable ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT on selected
> architectures, and I thought MMU_GATHER_RCU_TABLE_FREE is not set on
> these. But I was wrong - MMU_GATHER_RCU_TABLE_FREE is enabled on X86
> with paravirt. So I took another look at fast GUP to make sure I
> actually understand it.
>
> This brings a question about lockless_pages_from_mm() - I see it
> disabling interrupts, which it explains is necessary for disabling THP
> splitting IPIs, but I do not see it taking an RCU read lock as would
> be necessary for preventing paga table freeing on
> MMU_GATHER_RCU_TABLE_FREE configs. I figure local_irq_save()
> indirectly takes an rcu read lock somehow ? I think this is something
> I should also mention in my explanation, and I have not seen a good
> description of this on the fast GUP side...

Sounds like a bug!  That being said, based on my extremely limited
understanding of how the common RCU modes work, local_irq_save()
probably implies an RCU lock in at least some cases.  Hi Paul!

--Andy
Paul E. McKenney April 28, 2021, 4:11 p.m. UTC | #4
On Wed, Apr 28, 2021 at 08:13:53AM -0700, Andy Lutomirski wrote:
> On Wed, Apr 28, 2021 at 8:05 AM Michel Lespinasse <michel@lespinasse.org> wrote:
> >
> > On Wed, Apr 07, 2021 at 08:36:01AM -0700, Andy Lutomirski wrote:
> > > On 4/6/21 6:44 PM, Michel Lespinasse wrote:
> > > > The page table tree is walked with local irqs disabled, which prevents
> > > > page table reclamation (similarly to what fast GUP does). The logic is
> > > > otherwise similar to the non-speculative path, but with additional
> > > > restrictions: in the speculative path, we do not handle huge pages or
> > > > wiring new pages tables.
> > >
> > > Not on most architectures.  Quoting the actual comment in mm/gup.c:
> > >
> > > >  * Before activating this code, please be aware that the following assumptions
> > > >  * are currently made:
> > > >  *
> > > >  *  *) Either MMU_GATHER_RCU_TABLE_FREE is enabled, and tlb_remove_table() is used to
> > > >  *  free pages containing page tables or TLB flushing requires IPI broadcast.
> > >
> > > On MMU_GATHER_RCU_TABLE_FREE architectures, you cannot make the
> > > assumption that it is safe to dereference a pointer in a page table just
> > > because irqs are off.  You need RCU protection, too.
> > >
> > > You have the same error in the cover letter.
> >
> > Hi Andy,
> >
> > Thanks for your comment. At first I thought did not matter, because we
> > only enable ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT on selected
> > architectures, and I thought MMU_GATHER_RCU_TABLE_FREE is not set on
> > these. But I was wrong - MMU_GATHER_RCU_TABLE_FREE is enabled on X86
> > with paravirt. So I took another look at fast GUP to make sure I
> > actually understand it.
> >
> > This brings a question about lockless_pages_from_mm() - I see it
> > disabling interrupts, which it explains is necessary for disabling THP
> > splitting IPIs, but I do not see it taking an RCU read lock as would
> > be necessary for preventing paga table freeing on
> > MMU_GATHER_RCU_TABLE_FREE configs. I figure local_irq_save()
> > indirectly takes an rcu read lock somehow ? I think this is something
> > I should also mention in my explanation, and I have not seen a good
> > description of this on the fast GUP side...
> 
> Sounds like a bug!  That being said, based on my extremely limited
> understanding of how the common RCU modes work, local_irq_save()
> probably implies an RCU lock in at least some cases.  Hi Paul!

In modern kernels, local_irq_save() does have RCU reader semantics,
meaning that synchronize_rcu() will wait for pre-exiting irq-disabled
regions.  It will also wait for pre-existing bh-disable, preempt-disable,
and of course rcu_read_lock() sections of code.

But don't try this in older kernels, that is not in kernel in which
synchronize_sched() is defined!

							Thanx, Paul
Michel Lespinasse April 29, 2021, 12:02 a.m. UTC | #5
On Wed, Apr 28, 2021 at 09:11:08AM -0700, Paul E. McKenney wrote:
> On Wed, Apr 28, 2021 at 08:13:53AM -0700, Andy Lutomirski wrote:
> > On Wed, Apr 28, 2021 at 8:05 AM Michel Lespinasse <michel@lespinasse.org> wrote:
> > >
> > > On Wed, Apr 07, 2021 at 08:36:01AM -0700, Andy Lutomirski wrote:
> > > > On 4/6/21 6:44 PM, Michel Lespinasse wrote:
> > > > > The page table tree is walked with local irqs disabled, which prevents
> > > > > page table reclamation (similarly to what fast GUP does). The logic is
> > > > > otherwise similar to the non-speculative path, but with additional
> > > > > restrictions: in the speculative path, we do not handle huge pages or
> > > > > wiring new pages tables.
> > > >
> > > > Not on most architectures.  Quoting the actual comment in mm/gup.c:
> > > >
> > > > >  * Before activating this code, please be aware that the following assumptions
> > > > >  * are currently made:
> > > > >  *
> > > > >  *  *) Either MMU_GATHER_RCU_TABLE_FREE is enabled, and tlb_remove_table() is used to
> > > > >  *  free pages containing page tables or TLB flushing requires IPI broadcast.
> > > >
> > > > On MMU_GATHER_RCU_TABLE_FREE architectures, you cannot make the
> > > > assumption that it is safe to dereference a pointer in a page table just
> > > > because irqs are off.  You need RCU protection, too.
> > > >
> > > > You have the same error in the cover letter.
> > >
> > > Hi Andy,
> > >
> > > Thanks for your comment. At first I thought did not matter, because we
> > > only enable ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT on selected
> > > architectures, and I thought MMU_GATHER_RCU_TABLE_FREE is not set on
> > > these. But I was wrong - MMU_GATHER_RCU_TABLE_FREE is enabled on X86
> > > with paravirt. So I took another look at fast GUP to make sure I
> > > actually understand it.
> > >
> > > This brings a question about lockless_pages_from_mm() - I see it
> > > disabling interrupts, which it explains is necessary for disabling THP
> > > splitting IPIs, but I do not see it taking an RCU read lock as would
> > > be necessary for preventing paga table freeing on
> > > MMU_GATHER_RCU_TABLE_FREE configs. I figure local_irq_save()
> > > indirectly takes an rcu read lock somehow ? I think this is something
> > > I should also mention in my explanation, and I have not seen a good
> > > description of this on the fast GUP side...
> > 
> > Sounds like a bug!  That being said, based on my extremely limited
> > understanding of how the common RCU modes work, local_irq_save()
> > probably implies an RCU lock in at least some cases.  Hi Paul!
> 
> In modern kernels, local_irq_save() does have RCU reader semantics,
> meaning that synchronize_rcu() will wait for pre-exiting irq-disabled
> regions.  It will also wait for pre-existing bh-disable, preempt-disable,
> and of course rcu_read_lock() sections of code.

Thanks Paul for confirming / clarifying this. BTW, it would be good to
add this to the rcu header files, just so people have something to
reference to when they depend on such behavior (like fast GUP
currently does).

Going back to my patch. I don't need to protect against THP splitting
here, as I'm only handling the small page case. So when
MMU_GATHER_RCU_TABLE_FREE is enabled, I *think* I could get away with
using only an rcu read lock, instead of disabling interrupts which
implicitly creates the rcu read lock. I'm not sure which way to go -
fast GUP always disables interrupts regardless of the
MMU_GATHER_RCU_TABLE_FREE setting, and I think there is a case to be
made for following the fast GUP stes rather than trying to be smarter.

Andy, do you have any opinion on this ? Or anyone else really ?

Thanks,

--
Michel "walken" Lespinasse
Andy Lutomirski April 29, 2021, 12:05 a.m. UTC | #6
On Wed, Apr 28, 2021 at 5:02 PM Michel Lespinasse <michel@lespinasse.org> wrote:
>
> On Wed, Apr 28, 2021 at 09:11:08AM -0700, Paul E. McKenney wrote:
> > On Wed, Apr 28, 2021 at 08:13:53AM -0700, Andy Lutomirski wrote:
> > > On Wed, Apr 28, 2021 at 8:05 AM Michel Lespinasse <michel@lespinasse.org> wrote:
> > > >
> > > > On Wed, Apr 07, 2021 at 08:36:01AM -0700, Andy Lutomirski wrote:
> > > > > On 4/6/21 6:44 PM, Michel Lespinasse wrote:
> > > > > > The page table tree is walked with local irqs disabled, which prevents
> > > > > > page table reclamation (similarly to what fast GUP does). The logic is
> > > > > > otherwise similar to the non-speculative path, but with additional
> > > > > > restrictions: in the speculative path, we do not handle huge pages or
> > > > > > wiring new pages tables.
> > > > >
> > > > > Not on most architectures.  Quoting the actual comment in mm/gup.c:
> > > > >
> > > > > >  * Before activating this code, please be aware that the following assumptions
> > > > > >  * are currently made:
> > > > > >  *
> > > > > >  *  *) Either MMU_GATHER_RCU_TABLE_FREE is enabled, and tlb_remove_table() is used to
> > > > > >  *  free pages containing page tables or TLB flushing requires IPI broadcast.
> > > > >
> > > > > On MMU_GATHER_RCU_TABLE_FREE architectures, you cannot make the
> > > > > assumption that it is safe to dereference a pointer in a page table just
> > > > > because irqs are off.  You need RCU protection, too.
> > > > >
> > > > > You have the same error in the cover letter.
> > > >
> > > > Hi Andy,
> > > >
> > > > Thanks for your comment. At first I thought did not matter, because we
> > > > only enable ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT on selected
> > > > architectures, and I thought MMU_GATHER_RCU_TABLE_FREE is not set on
> > > > these. But I was wrong - MMU_GATHER_RCU_TABLE_FREE is enabled on X86
> > > > with paravirt. So I took another look at fast GUP to make sure I
> > > > actually understand it.
> > > >
> > > > This brings a question about lockless_pages_from_mm() - I see it
> > > > disabling interrupts, which it explains is necessary for disabling THP
> > > > splitting IPIs, but I do not see it taking an RCU read lock as would
> > > > be necessary for preventing paga table freeing on
> > > > MMU_GATHER_RCU_TABLE_FREE configs. I figure local_irq_save()
> > > > indirectly takes an rcu read lock somehow ? I think this is something
> > > > I should also mention in my explanation, and I have not seen a good
> > > > description of this on the fast GUP side...
> > >
> > > Sounds like a bug!  That being said, based on my extremely limited
> > > understanding of how the common RCU modes work, local_irq_save()
> > > probably implies an RCU lock in at least some cases.  Hi Paul!
> >
> > In modern kernels, local_irq_save() does have RCU reader semantics,
> > meaning that synchronize_rcu() will wait for pre-exiting irq-disabled
> > regions.  It will also wait for pre-existing bh-disable, preempt-disable,
> > and of course rcu_read_lock() sections of code.
>
> Thanks Paul for confirming / clarifying this. BTW, it would be good to
> add this to the rcu header files, just so people have something to
> reference to when they depend on such behavior (like fast GUP
> currently does).

Or, even better, fast GUP could add an explicit RCU read lock.

>
> Going back to my patch. I don't need to protect against THP splitting
> here, as I'm only handling the small page case. So when
> MMU_GATHER_RCU_TABLE_FREE is enabled, I *think* I could get away with
> using only an rcu read lock, instead of disabling interrupts which
> implicitly creates the rcu read lock. I'm not sure which way to go -
> fast GUP always disables interrupts regardless of the
> MMU_GATHER_RCU_TABLE_FREE setting, and I think there is a case to be
> made for following the fast GUP stes rather than trying to be smarter.

How about adding some little helpers:

lockless_page_walk_begin();

lockless_page_walk_end();

these turn into RCU read locks if MMU_GATHER_RCU_TABLE_FREE and into
irqsave otherwise.  And they're somewhat self-documenting.

--Andy
Paul E. McKenney April 29, 2021, 3:52 p.m. UTC | #7
On Wed, Apr 28, 2021 at 05:02:25PM -0700, Michel Lespinasse wrote:
> On Wed, Apr 28, 2021 at 09:11:08AM -0700, Paul E. McKenney wrote:
> > On Wed, Apr 28, 2021 at 08:13:53AM -0700, Andy Lutomirski wrote:
> > > On Wed, Apr 28, 2021 at 8:05 AM Michel Lespinasse <michel@lespinasse.org> wrote:
> > > >
> > > > On Wed, Apr 07, 2021 at 08:36:01AM -0700, Andy Lutomirski wrote:
> > > > > On 4/6/21 6:44 PM, Michel Lespinasse wrote:
> > > > > > The page table tree is walked with local irqs disabled, which prevents
> > > > > > page table reclamation (similarly to what fast GUP does). The logic is
> > > > > > otherwise similar to the non-speculative path, but with additional
> > > > > > restrictions: in the speculative path, we do not handle huge pages or
> > > > > > wiring new pages tables.
> > > > >
> > > > > Not on most architectures.  Quoting the actual comment in mm/gup.c:
> > > > >
> > > > > >  * Before activating this code, please be aware that the following assumptions
> > > > > >  * are currently made:
> > > > > >  *
> > > > > >  *  *) Either MMU_GATHER_RCU_TABLE_FREE is enabled, and tlb_remove_table() is used to
> > > > > >  *  free pages containing page tables or TLB flushing requires IPI broadcast.
> > > > >
> > > > > On MMU_GATHER_RCU_TABLE_FREE architectures, you cannot make the
> > > > > assumption that it is safe to dereference a pointer in a page table just
> > > > > because irqs are off.  You need RCU protection, too.
> > > > >
> > > > > You have the same error in the cover letter.
> > > >
> > > > Hi Andy,
> > > >
> > > > Thanks for your comment. At first I thought did not matter, because we
> > > > only enable ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT on selected
> > > > architectures, and I thought MMU_GATHER_RCU_TABLE_FREE is not set on
> > > > these. But I was wrong - MMU_GATHER_RCU_TABLE_FREE is enabled on X86
> > > > with paravirt. So I took another look at fast GUP to make sure I
> > > > actually understand it.
> > > >
> > > > This brings a question about lockless_pages_from_mm() - I see it
> > > > disabling interrupts, which it explains is necessary for disabling THP
> > > > splitting IPIs, but I do not see it taking an RCU read lock as would
> > > > be necessary for preventing paga table freeing on
> > > > MMU_GATHER_RCU_TABLE_FREE configs. I figure local_irq_save()
> > > > indirectly takes an rcu read lock somehow ? I think this is something
> > > > I should also mention in my explanation, and I have not seen a good
> > > > description of this on the fast GUP side...
> > > 
> > > Sounds like a bug!  That being said, based on my extremely limited
> > > understanding of how the common RCU modes work, local_irq_save()
> > > probably implies an RCU lock in at least some cases.  Hi Paul!
> > 
> > In modern kernels, local_irq_save() does have RCU reader semantics,
> > meaning that synchronize_rcu() will wait for pre-exiting irq-disabled
> > regions.  It will also wait for pre-existing bh-disable, preempt-disable,
> > and of course rcu_read_lock() sections of code.
> 
> Thanks Paul for confirming / clarifying this. BTW, it would be good to
> add this to the rcu header files, just so people have something to
> reference to when they depend on such behavior (like fast GUP
> currently does).

There is this in the synchronize_rcu() header block comment:

 * synchronize_rcu() was waiting.  RCU read-side critical sections are
 * delimited by rcu_read_lock() and rcu_read_unlock(), and may be nested.
 * In addition, regions of code across which interrupts, preemption, or
 * softirqs have been disabled also serve as RCU read-side critical
 * sections.  This includes hardware interrupt handlers, softirq handlers,
 * and NMI handlers.

I have pulled this into a separate paragraph to increase its visibility,
and will check out other locations in comments and documentation.

							Thanx, Paul

> Going back to my patch. I don't need to protect against THP splitting
> here, as I'm only handling the small page case. So when
> MMU_GATHER_RCU_TABLE_FREE is enabled, I *think* I could get away with
> using only an rcu read lock, instead of disabling interrupts which
> implicitly creates the rcu read lock. I'm not sure which way to go -
> fast GUP always disables interrupts regardless of the
> MMU_GATHER_RCU_TABLE_FREE setting, and I think there is a case to be
> made for following the fast GUP stes rather than trying to be smarter.
> 
> Andy, do you have any opinion on this ? Or anyone else really ?
> 
> Thanks,
> 
> --
> Michel "walken" Lespinasse
Matthew Wilcox April 29, 2021, 4:12 p.m. UTC | #8
On Wed, Apr 28, 2021 at 05:05:17PM -0700, Andy Lutomirski wrote:
> On Wed, Apr 28, 2021 at 5:02 PM Michel Lespinasse <michel@lespinasse.org> wrote:
> > Thanks Paul for confirming / clarifying this. BTW, it would be good to
> > add this to the rcu header files, just so people have something to
> > reference to when they depend on such behavior (like fast GUP
> > currently does).
> 
> Or, even better, fast GUP could add an explicit RCU read lock.
> 
> >
> > Going back to my patch. I don't need to protect against THP splitting
> > here, as I'm only handling the small page case. So when
> > MMU_GATHER_RCU_TABLE_FREE is enabled, I *think* I could get away with
> > using only an rcu read lock, instead of disabling interrupts which
> > implicitly creates the rcu read lock. I'm not sure which way to go -
> > fast GUP always disables interrupts regardless of the
> > MMU_GATHER_RCU_TABLE_FREE setting, and I think there is a case to be
> > made for following the fast GUP stes rather than trying to be smarter.
> 
> How about adding some little helpers:
> 
> lockless_page_walk_begin();
> 
> lockless_page_walk_end();
> 
> these turn into RCU read locks if MMU_GATHER_RCU_TABLE_FREE and into
> irqsave otherwise.  And they're somewhat self-documenting.

One of the worst things we can do while holding a spinlock is take a
cache miss because we then delay for several thousand cycles to wait for
the cache line.  That gives every other CPU a really long opportunity
to slam into the spinlock and things go downhill fast at that point.
We've even seen patches to do things like read A, take lock L, then read
A to avoid the cache miss while holding the lock.

What sort of performance effect would it have to free page tables
under RCU for all architectures?  It's painful on s390 & powerpc because
different tables share the same struct page, but I have to believe that's
a solvable problem.
Andy Lutomirski April 29, 2021, 6:04 p.m. UTC | #9
> On Apr 29, 2021, at 9:12 AM, Matthew Wilcox <willy@infradead.org> wrote:
> 
> On Wed, Apr 28, 2021 at 05:05:17PM -0700, Andy Lutomirski wrote:
>>> On Wed, Apr 28, 2021 at 5:02 PM Michel Lespinasse <michel@lespinasse.org> wrote:
>>> Thanks Paul for confirming / clarifying this. BTW, it would be good to
>>> add this to the rcu header files, just so people have something to
>>> reference to when they depend on such behavior (like fast GUP
>>> currently does).
>> 
>> Or, even better, fast GUP could add an explicit RCU read lock.
>> 
>>> 
>>> Going back to my patch. I don't need to protect against THP splitting
>>> here, as I'm only handling the small page case. So when
>>> MMU_GATHER_RCU_TABLE_FREE is enabled, I *think* I could get away with
>>> using only an rcu read lock, instead of disabling interrupts which
>>> implicitly creates the rcu read lock. I'm not sure which way to go -
>>> fast GUP always disables interrupts regardless of the
>>> MMU_GATHER_RCU_TABLE_FREE setting, and I think there is a case to be
>>> made for following the fast GUP stes rather than trying to be smarter.
>> 
>> How about adding some little helpers:
>> 
>> lockless_page_walk_begin();
>> 
>> lockless_page_walk_end();
>> 
>> these turn into RCU read locks if MMU_GATHER_RCU_TABLE_FREE and into
>> irqsave otherwise.  And they're somewhat self-documenting.
> 
> One of the worst things we can do while holding a spinlock is take a
> cache miss because we then delay for several thousand cycles to wait for
> the cache line.  That gives every other CPU a really long opportunity
> to slam into the spinlock and things go downhill fast at that point.
> We've even seen patches to do things like read A, take lock L, then read
> A to avoid the cache miss while holding the lock.
> 
> What sort of performance effect would it have to free page tables
> under RCU for all architectures?  It's painful on s390 & powerpc because
> different tables share the same struct page, but I have to believe that's
> a solvable problem.

The IPI locking mechanism is entirely useless on any architecture that wants to do paravirt shootdowns, so this seems like a good strategy to me.
Paul E. McKenney April 29, 2021, 6:34 p.m. UTC | #10
On Thu, Apr 29, 2021 at 08:52:50AM -0700, Paul E. McKenney wrote:
> On Wed, Apr 28, 2021 at 05:02:25PM -0700, Michel Lespinasse wrote:
> > On Wed, Apr 28, 2021 at 09:11:08AM -0700, Paul E. McKenney wrote:
> > > On Wed, Apr 28, 2021 at 08:13:53AM -0700, Andy Lutomirski wrote:
> > > > On Wed, Apr 28, 2021 at 8:05 AM Michel Lespinasse <michel@lespinasse.org> wrote:
> > > > >
> > > > > On Wed, Apr 07, 2021 at 08:36:01AM -0700, Andy Lutomirski wrote:
> > > > > > On 4/6/21 6:44 PM, Michel Lespinasse wrote:
> > > > > > > The page table tree is walked with local irqs disabled, which prevents
> > > > > > > page table reclamation (similarly to what fast GUP does). The logic is
> > > > > > > otherwise similar to the non-speculative path, but with additional
> > > > > > > restrictions: in the speculative path, we do not handle huge pages or
> > > > > > > wiring new pages tables.
> > > > > >
> > > > > > Not on most architectures.  Quoting the actual comment in mm/gup.c:
> > > > > >
> > > > > > >  * Before activating this code, please be aware that the following assumptions
> > > > > > >  * are currently made:
> > > > > > >  *
> > > > > > >  *  *) Either MMU_GATHER_RCU_TABLE_FREE is enabled, and tlb_remove_table() is used to
> > > > > > >  *  free pages containing page tables or TLB flushing requires IPI broadcast.
> > > > > >
> > > > > > On MMU_GATHER_RCU_TABLE_FREE architectures, you cannot make the
> > > > > > assumption that it is safe to dereference a pointer in a page table just
> > > > > > because irqs are off.  You need RCU protection, too.
> > > > > >
> > > > > > You have the same error in the cover letter.
> > > > >
> > > > > Hi Andy,
> > > > >
> > > > > Thanks for your comment. At first I thought did not matter, because we
> > > > > only enable ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT on selected
> > > > > architectures, and I thought MMU_GATHER_RCU_TABLE_FREE is not set on
> > > > > these. But I was wrong - MMU_GATHER_RCU_TABLE_FREE is enabled on X86
> > > > > with paravirt. So I took another look at fast GUP to make sure I
> > > > > actually understand it.
> > > > >
> > > > > This brings a question about lockless_pages_from_mm() - I see it
> > > > > disabling interrupts, which it explains is necessary for disabling THP
> > > > > splitting IPIs, but I do not see it taking an RCU read lock as would
> > > > > be necessary for preventing paga table freeing on
> > > > > MMU_GATHER_RCU_TABLE_FREE configs. I figure local_irq_save()
> > > > > indirectly takes an rcu read lock somehow ? I think this is something
> > > > > I should also mention in my explanation, and I have not seen a good
> > > > > description of this on the fast GUP side...
> > > > 
> > > > Sounds like a bug!  That being said, based on my extremely limited
> > > > understanding of how the common RCU modes work, local_irq_save()
> > > > probably implies an RCU lock in at least some cases.  Hi Paul!
> > > 
> > > In modern kernels, local_irq_save() does have RCU reader semantics,
> > > meaning that synchronize_rcu() will wait for pre-exiting irq-disabled
> > > regions.  It will also wait for pre-existing bh-disable, preempt-disable,
> > > and of course rcu_read_lock() sections of code.
> > 
> > Thanks Paul for confirming / clarifying this. BTW, it would be good to
> > add this to the rcu header files, just so people have something to
> > reference to when they depend on such behavior (like fast GUP
> > currently does).
> 
> There is this in the synchronize_rcu() header block comment:
> 
>  * synchronize_rcu() was waiting.  RCU read-side critical sections are
>  * delimited by rcu_read_lock() and rcu_read_unlock(), and may be nested.
>  * In addition, regions of code across which interrupts, preemption, or
>  * softirqs have been disabled also serve as RCU read-side critical
>  * sections.  This includes hardware interrupt handlers, softirq handlers,
>  * and NMI handlers.
> 
> I have pulled this into a separate paragraph to increase its visibility,
> and will check out other locations in comments and documentation.

Ditto for call_rcu() and the separate paragraph.

The rcu_read_lock_bh() and rcu_read_lock_sched() header comments noted
that these act as RCU read-side critical sections, but I added similar
verbiage to rcu_dereference_bh_check() and rcu_dereference_sched_check().

Please see below for the resulting commit.

Thoughts?

							Thanx, Paul

------------------------------------------------------------------------

commit 97262c64c2cf807bf06825e454c4bedd228fadfb
Author: Paul E. McKenney <paulmck@kernel.org>
Date:   Thu Apr 29 11:18:01 2021 -0700

    rcu: Improve comments describing RCU read-side critical sections
    
    There are a number of places that call out the fact that preempt-disable
    regions of code now act as RCU read-side critical sections, where
    preempt-disable regions of code include irq-disable regions of code,
    bh-disable regions of code, hardirq handlers, and NMI handlers.  However,
    someone relying solely on (for example) the call_rcu() header comment
    might well have no idea that preempt-disable regions of code have RCU
    semantics.
    
    This commit therefore updates the header comments for
    call_rcu(), synchronize_rcu(), rcu_dereference_bh_check(), and
    rcu_dereference_sched_check() to call out these new(ish) forms of RCU
    readers.
    
    Reported-by: Michel Lespinasse <michel@lespinasse.org>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index a10480f2b4ef..c01b04ad64c4 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -532,7 +532,10 @@ do {									      \
  * @p: The pointer to read, prior to dereferencing
  * @c: The conditions under which the dereference will take place
  *
- * This is the RCU-bh counterpart to rcu_dereference_check().
+ * This is the RCU-bh counterpart to rcu_dereference_check().  However,
+ * please note that in recent kernels, synchronize_rcu() waits for
+ * local_bh_disable() regions of code in addition to regions of code
+ * demarked by rcu_read_lock() and rcu_read_unlock().
  */
 #define rcu_dereference_bh_check(p, c) \
 	__rcu_dereference_check((p), (c) || rcu_read_lock_bh_held(), __rcu)
@@ -543,6 +546,9 @@ do {									      \
  * @c: The conditions under which the dereference will take place
  *
  * This is the RCU-sched counterpart to rcu_dereference_check().
+ * However, please note that in recent kernels, synchronize_rcu() waits
+ * for preemption-disabled regions of code in addition to regions of code
+ * demarked by rcu_read_lock() and rcu_read_unlock().
  */
 #define rcu_dereference_sched_check(p, c) \
 	__rcu_dereference_check((p), (c) || rcu_read_lock_sched_held(), \
@@ -634,6 +640,12 @@ do {									      \
  * sections, invocation of the corresponding RCU callback is deferred
  * until after the all the other CPUs exit their critical sections.
  *
+ * In recent kernels, synchronize_rcu() and call_rcu() also wait for
+ * regions of code with preemption disabled, including regions of code
+ * with interrupts or softirqs disabled.  If your kernel is old enough
+ * for synchronize_sched() to be defined, only code enclosed within
+ * rcu_read_lock() and rcu_read_unlock() are guaranteed to be waited for.
+ *
  * Note, however, that RCU callbacks are permitted to run concurrently
  * with new RCU read-side critical sections.  One way that this can happen
  * is via the following sequence of events: (1) CPU 0 enters an RCU
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 9ea1d4eef1ad..0e76bf47d92b 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3071,12 +3071,13 @@ __call_rcu(struct rcu_head *head, rcu_callback_t func)
  * period elapses, in other words after all pre-existing RCU read-side
  * critical sections have completed.  However, the callback function
  * might well execute concurrently with RCU read-side critical sections
- * that started after call_rcu() was invoked.  RCU read-side critical
- * sections are delimited by rcu_read_lock() and rcu_read_unlock(), and
- * may be nested.  In addition, regions of code across which interrupts,
- * preemption, or softirqs have been disabled also serve as RCU read-side
- * critical sections.  This includes hardware interrupt handlers, softirq
- * handlers, and NMI handlers.
+ * that started after call_rcu() was invoked.
+ *
+ * RCU read-side critical sections are delimited by rcu_read_lock() and
+ * rcu_read_unlock(), and may be nested.  In addition, regions of code
+ * across which interrupts, preemption, or softirqs have been disabled
+ * also serve as RCU read-side critical sections.  This includes hardware
+ * interrupt handlers, softirq handlers, and NMI handlers.
  *
  * Note that all CPUs must agree that the grace period extended beyond
  * all pre-existing RCU read-side critical section.  On systems with more
@@ -3771,12 +3772,13 @@ static int rcu_blocking_is_gp(void)
  * read-side critical sections have completed.  Note, however, that
  * upon return from synchronize_rcu(), the caller might well be executing
  * concurrently with new RCU read-side critical sections that began while
- * synchronize_rcu() was waiting.  RCU read-side critical sections are
- * delimited by rcu_read_lock() and rcu_read_unlock(), and may be nested.
- * In addition, regions of code across which interrupts, preemption, or
- * softirqs have been disabled also serve as RCU read-side critical
- * sections.  This includes hardware interrupt handlers, softirq handlers,
- * and NMI handlers.
+ * synchronize_rcu() was waiting.
+ *
+ * RCU read-side critical sections are delimited by rcu_read_lock() and
+ * rcu_read_unlock(), and may be nested.  In addition, regions of code
+ * across which interrupts, preemption, or softirqs have been disabled
+ * also serve as RCU read-side critical sections.  This includes hardware
+ * interrupt handlers, softirq handlers, and NMI handlers.
  *
  * Note that this guarantee implies further memory-ordering guarantees.
  * On systems with more than one CPU, when synchronize_rcu() returns,
Matthew Wilcox April 29, 2021, 6:49 p.m. UTC | #11
On Thu, Apr 29, 2021 at 11:34:12AM -0700, Paul E. McKenney wrote:
> +++ b/include/linux/rcupdate.h
> @@ -532,7 +532,10 @@ do {									      \
>   * @p: The pointer to read, prior to dereferencing
>   * @c: The conditions under which the dereference will take place
>   *
> - * This is the RCU-bh counterpart to rcu_dereference_check().
> + * This is the RCU-bh counterpart to rcu_dereference_check().  However,
> + * please note that in recent kernels, synchronize_rcu() waits for
> + * local_bh_disable() regions of code in addition to regions of code
> + * demarked by rcu_read_lock() and rcu_read_unlock().
>   */

I've been trying to get rid of "please note that" in my own documentation
recently.  It doesn't add any value.  Also, "recent kernels" is going to
go stale quickly, "Since v5.8" (or whatever) is good because it lets us
know in ten years that we can just delete the reference.

So I'd make this:

 * This is the RCU-bh equivalent of rcu_dereference_check().  Since v5.8,
 * synchronize_rcu() waits for code with bottom halves disabled as well
 * as code between rcu_read_lock() and rcu_read_unlock().
Michel Lespinasse April 29, 2021, 7:14 p.m. UTC | #12
On Thu, Apr 29, 2021 at 05:12:34PM +0100, Matthew Wilcox wrote:
> On Wed, Apr 28, 2021 at 05:05:17PM -0700, Andy Lutomirski wrote:
> > On Wed, Apr 28, 2021 at 5:02 PM Michel Lespinasse <michel@lespinasse.org> wrote:
> > > Thanks Paul for confirming / clarifying this. BTW, it would be good to
> > > add this to the rcu header files, just so people have something to
> > > reference to when they depend on such behavior (like fast GUP
> > > currently does).
> > 
> > Or, even better, fast GUP could add an explicit RCU read lock.
> > 
> > >
> > > Going back to my patch. I don't need to protect against THP splitting
> > > here, as I'm only handling the small page case. So when
> > > MMU_GATHER_RCU_TABLE_FREE is enabled, I *think* I could get away with
> > > using only an rcu read lock, instead of disabling interrupts which
> > > implicitly creates the rcu read lock. I'm not sure which way to go -
> > > fast GUP always disables interrupts regardless of the
> > > MMU_GATHER_RCU_TABLE_FREE setting, and I think there is a case to be
> > > made for following the fast GUP stes rather than trying to be smarter.
> > 
> > How about adding some little helpers:
> > 
> > lockless_page_walk_begin();
> > 
> > lockless_page_walk_end();
> > 
> > these turn into RCU read locks if MMU_GATHER_RCU_TABLE_FREE and into
> > irqsave otherwise.  And they're somewhat self-documenting.
> 
> One of the worst things we can do while holding a spinlock is take a
> cache miss because we then delay for several thousand cycles to wait for
> the cache line.  That gives every other CPU a really long opportunity
> to slam into the spinlock and things go downhill fast at that point.
> We've even seen patches to do things like read A, take lock L, then read
> A to avoid the cache miss while holding the lock.

I understand the effect your are describing, but I do not see how it
applies here - what cacheline are we likely to miss on when using
local_irq_disable() that we wouldn't touch if using rcu_read_lock() ?

> What sort of performance effect would it have to free page tables
> under RCU for all architectures?  It's painful on s390 & powerpc because
> different tables share the same struct page, but I have to believe that's
> a solvable problem.

I agree using RCU to free page tables would be a good thing to try.
I am afraid of adding that to this patchset though, as it seems
somewhate unrelated and adds risk. IMO we are most likely to find
justification for pushing this if/when we try accessing remote mm's without
taking the mmap lock, since disabling IPIs clearly wouldn't work there.

--
Michel "walken" Lespinasse
Matthew Wilcox April 29, 2021, 7:34 p.m. UTC | #13
On Thu, Apr 29, 2021 at 12:14:28PM -0700, Michel Lespinasse wrote:
> On Thu, Apr 29, 2021 at 05:12:34PM +0100, Matthew Wilcox wrote:
> > One of the worst things we can do while holding a spinlock is take a
> > cache miss because we then delay for several thousand cycles to wait for
> > the cache line.  That gives every other CPU a really long opportunity
> > to slam into the spinlock and things go downhill fast at that point.
> > We've even seen patches to do things like read A, take lock L, then read
> > A to avoid the cache miss while holding the lock.
> 
> I understand the effect your are describing, but I do not see how it
> applies here - what cacheline are we likely to miss on when using
> local_irq_disable() that we wouldn't touch if using rcu_read_lock() ?

It's the same cache lines in both cases.  The difference is that in one
case we have interrupts disabled (and a spinlock held?  i wasn't clear
on that) and in the other case, we just have preemption disabled.

> > What sort of performance effect would it have to free page tables
> > under RCU for all architectures?  It's painful on s390 & powerpc because
> > different tables share the same struct page, but I have to believe that's
> > a solvable problem.
> 
> I agree using RCU to free page tables would be a good thing to try.
> I am afraid of adding that to this patchset though, as it seems
> somewhate unrelated and adds risk. IMO we are most likely to find
> justification for pushing this if/when we try accessing remote mm's without
> taking the mmap lock, since disabling IPIs clearly wouldn't work there.

I think that needs to happen _before_ this patchset.  Creating a mess and
then trying to clean it up later isn't a great way to do development.
Michel Lespinasse April 29, 2021, 9:17 p.m. UTC | #14
On Thu, Apr 29, 2021 at 11:34:12AM -0700, Paul E. McKenney wrote:
> ------------------------------------------------------------------------
> 
> commit 97262c64c2cf807bf06825e454c4bedd228fadfb
> Author: Paul E. McKenney <paulmck@kernel.org>
> Date:   Thu Apr 29 11:18:01 2021 -0700
> 
>     rcu: Improve comments describing RCU read-side critical sections
>     
>     There are a number of places that call out the fact that preempt-disable
>     regions of code now act as RCU read-side critical sections, where
>     preempt-disable regions of code include irq-disable regions of code,
>     bh-disable regions of code, hardirq handlers, and NMI handlers.  However,
>     someone relying solely on (for example) the call_rcu() header comment
>     might well have no idea that preempt-disable regions of code have RCU
>     semantics.
>     
>     This commit therefore updates the header comments for
>     call_rcu(), synchronize_rcu(), rcu_dereference_bh_check(), and
>     rcu_dereference_sched_check() to call out these new(ish) forms of RCU
>     readers.
>     
>     Reported-by: Michel Lespinasse <michel@lespinasse.org>
>     Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> 
> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> index a10480f2b4ef..c01b04ad64c4 100644
> --- a/include/linux/rcupdate.h
> +++ b/include/linux/rcupdate.h
> @@ -532,7 +532,10 @@ do {									      \
>   * @p: The pointer to read, prior to dereferencing
>   * @c: The conditions under which the dereference will take place
>   *
> - * This is the RCU-bh counterpart to rcu_dereference_check().
> + * This is the RCU-bh counterpart to rcu_dereference_check().  However,
> + * please note that in recent kernels, synchronize_rcu() waits for
> + * local_bh_disable() regions of code in addition to regions of code
> + * demarked by rcu_read_lock() and rcu_read_unlock().

Two things:
- "recent kernels" could be clarified, as Matthew pointed out
- The above is not 100% clear if call_rcu() also waits for
  local_bh_disable() regions of code ? (you did clarify this in tree.c
  but I think it's better to have that here as well)

>   */
>  #define rcu_dereference_bh_check(p, c) \
>  	__rcu_dereference_check((p), (c) || rcu_read_lock_bh_held(), __rcu)
> @@ -543,6 +546,9 @@ do {									      \
>   * @c: The conditions under which the dereference will take place
>   *
>   * This is the RCU-sched counterpart to rcu_dereference_check().
> + * However, please note that in recent kernels, synchronize_rcu() waits
> + * for preemption-disabled regions of code in addition to regions of code
> + * demarked by rcu_read_lock() and rcu_read_unlock().

Same comments regarding "recent kernels" and call_rcu() here.

>   */
>  #define rcu_dereference_sched_check(p, c) \
>  	__rcu_dereference_check((p), (c) || rcu_read_lock_sched_held(), \
> @@ -634,6 +640,12 @@ do {									      \
>   * sections, invocation of the corresponding RCU callback is deferred
>   * until after the all the other CPUs exit their critical sections.
>   *
> + * In recent kernels, synchronize_rcu() and call_rcu() also wait for
> + * regions of code with preemption disabled, including regions of code
> + * with interrupts or softirqs disabled.  If your kernel is old enough
> + * for synchronize_sched() to be defined, only code enclosed within
> + * rcu_read_lock() and rcu_read_unlock() are guaranteed to be waited for.
> + *

Thanks, this is the quote I was looking for, and also I think it's
important for it to be in rcupdate.h rather than any .c implementation
(I think it's more natural to look at headers for this kind of stuff).

Same comment regarding "old enough" / "recent kernels" though.

>   * Note, however, that RCU callbacks are permitted to run concurrently
>   * with new RCU read-side critical sections.  One way that this can happen
>   * is via the following sequence of events: (1) CPU 0 enters an RCU
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c

The tree.c changes look fine to me.

Thanks a lot for looking into this !

--
Michel "walken" Lespinasse
Michel Lespinasse April 29, 2021, 11:56 p.m. UTC | #15
On Thu, Apr 29, 2021 at 08:34:56PM +0100, Matthew Wilcox wrote:
> On Thu, Apr 29, 2021 at 12:14:28PM -0700, Michel Lespinasse wrote:
> > On Thu, Apr 29, 2021 at 05:12:34PM +0100, Matthew Wilcox wrote:
> > > One of the worst things we can do while holding a spinlock is take a
> > > cache miss because we then delay for several thousand cycles to wait for
> > > the cache line.  That gives every other CPU a really long opportunity
> > > to slam into the spinlock and things go downhill fast at that point.
> > > We've even seen patches to do things like read A, take lock L, then read
> > > A to avoid the cache miss while holding the lock.
> > 
> > I understand the effect your are describing, but I do not see how it
> > applies here - what cacheline are we likely to miss on when using
> > local_irq_disable() that we wouldn't touch if using rcu_read_lock() ?
> 
> It's the same cache lines in both cases.  The difference is that in one
> case we have interrupts disabled (and a spinlock held?  i wasn't clear
> on that) and in the other case, we just have preemption disabled.

To add some context - the problem we are trying to solve here (and a
different instance of it in the next commit) is that we are trying to
map and/or lock the page table, but need to prevent it from being
freed while we are trying to do so. Disabling interrupts or taking an
rcu read lock are two mechanisms for preventing that race, but the
structures accessed are the same in either case.

> > > What sort of performance effect would it have to free page tables
> > > under RCU for all architectures?  It's painful on s390 & powerpc because
> > > different tables share the same struct page, but I have to believe that's
> > > a solvable problem.
> > 
> > I agree using RCU to free page tables would be a good thing to try.
> > I am afraid of adding that to this patchset though, as it seems
> > somewhate unrelated and adds risk. IMO we are most likely to find
> > justification for pushing this if/when we try accessing remote mm's without
> > taking the mmap lock, since disabling IPIs clearly wouldn't work there.
> 
> I think that needs to happen _before_ this patchset.  Creating a mess and
> then trying to clean it up later isn't a great way to do development.

Agree we don't want to create a mess... but I see that as an argument for
not hastily changing the way page tables are reclaimed...

--
Michel "walken" Lespinasse
Paul E. McKenney May 3, 2021, 3:14 a.m. UTC | #16
On Thu, Apr 29, 2021 at 07:49:08PM +0100, Matthew Wilcox wrote:
> On Thu, Apr 29, 2021 at 11:34:12AM -0700, Paul E. McKenney wrote:
> > +++ b/include/linux/rcupdate.h
> > @@ -532,7 +532,10 @@ do {									      \
> >   * @p: The pointer to read, prior to dereferencing
> >   * @c: The conditions under which the dereference will take place
> >   *
> > - * This is the RCU-bh counterpart to rcu_dereference_check().
> > + * This is the RCU-bh counterpart to rcu_dereference_check().  However,
> > + * please note that in recent kernels, synchronize_rcu() waits for
> > + * local_bh_disable() regions of code in addition to regions of code
> > + * demarked by rcu_read_lock() and rcu_read_unlock().
> >   */
> 
> I've been trying to get rid of "please note that" in my own documentation
> recently.  It doesn't add any value.  Also, "recent kernels" is going to
> go stale quickly, "Since v5.8" (or whatever) is good because it lets us
> know in ten years that we can just delete the reference.
> 
> So I'd make this:
> 
>  * This is the RCU-bh equivalent of rcu_dereference_check().  Since v5.8,
>  * synchronize_rcu() waits for code with bottom halves disabled as well
>  * as code between rcu_read_lock() and rcu_read_unlock().

Normally, I would be right there with you on the "less is more"
approach to writing.  But in this particular case:

1.	I added comments to rcu_read_lock_bh(), rcu_read_lock_sched(),
	call_rcu(), and synchronize_rcu().

2.	I included a section entitled "RCU flavor consolidation" in the
	2019 edition of the RCU API: https://lwn.net/Articles/777036/

3.	I presented on this topic at LCA:
	https://www.youtube.com/watch?v=hZX1aokdNiY

4.	I published a paper on this topic:
	https://dl.acm.org/doi/10.1145/3319647.3325836
	http://www.rdrop.com/~paulmck/RCU/rcu-exploit.2019.05.01a.pdf

All of these, even taken together, have proven to be insufficient.
This therefore does not appear to be the place to economize on words.  :-/

Your point on the version (v5.0, as it turns out) is right on, and I
will make that change.

							Thanx, Paul
Paul E. McKenney May 3, 2021, 3:40 a.m. UTC | #17
On Thu, Apr 29, 2021 at 02:17:58PM -0700, Michel Lespinasse wrote:
> On Thu, Apr 29, 2021 at 11:34:12AM -0700, Paul E. McKenney wrote:
> > ------------------------------------------------------------------------
> > 
> > commit 97262c64c2cf807bf06825e454c4bedd228fadfb
> > Author: Paul E. McKenney <paulmck@kernel.org>
> > Date:   Thu Apr 29 11:18:01 2021 -0700
> > 
> >     rcu: Improve comments describing RCU read-side critical sections
> >     
> >     There are a number of places that call out the fact that preempt-disable
> >     regions of code now act as RCU read-side critical sections, where
> >     preempt-disable regions of code include irq-disable regions of code,
> >     bh-disable regions of code, hardirq handlers, and NMI handlers.  However,
> >     someone relying solely on (for example) the call_rcu() header comment
> >     might well have no idea that preempt-disable regions of code have RCU
> >     semantics.
> >     
> >     This commit therefore updates the header comments for
> >     call_rcu(), synchronize_rcu(), rcu_dereference_bh_check(), and
> >     rcu_dereference_sched_check() to call out these new(ish) forms of RCU
> >     readers.
> >     
> >     Reported-by: Michel Lespinasse <michel@lespinasse.org>
> >     Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> > 
> > diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> > index a10480f2b4ef..c01b04ad64c4 100644
> > --- a/include/linux/rcupdate.h
> > +++ b/include/linux/rcupdate.h
> > @@ -532,7 +532,10 @@ do {									      \
> >   * @p: The pointer to read, prior to dereferencing
> >   * @c: The conditions under which the dereference will take place
> >   *
> > - * This is the RCU-bh counterpart to rcu_dereference_check().
> > + * This is the RCU-bh counterpart to rcu_dereference_check().  However,
> > + * please note that in recent kernels, synchronize_rcu() waits for
> > + * local_bh_disable() regions of code in addition to regions of code
> > + * demarked by rcu_read_lock() and rcu_read_unlock().
> 
> Two things:
> - "recent kernels" could be clarified, as Matthew pointed out
> - The above is not 100% clear if call_rcu() also waits for
>   local_bh_disable() regions of code ? (you did clarify this in tree.c
>   but I think it's better to have that here as well)

Good points, I updated both.

> >   */
> >  #define rcu_dereference_bh_check(p, c) \
> >  	__rcu_dereference_check((p), (c) || rcu_read_lock_bh_held(), __rcu)
> > @@ -543,6 +546,9 @@ do {									      \
> >   * @c: The conditions under which the dereference will take place
> >   *
> >   * This is the RCU-sched counterpart to rcu_dereference_check().
> > + * However, please note that in recent kernels, synchronize_rcu() waits
> > + * for preemption-disabled regions of code in addition to regions of code
> > + * demarked by rcu_read_lock() and rcu_read_unlock().
> 
> Same comments regarding "recent kernels" and call_rcu() here.

And here as well.

> >   */
> >  #define rcu_dereference_sched_check(p, c) \
> >  	__rcu_dereference_check((p), (c) || rcu_read_lock_sched_held(), \
> > @@ -634,6 +640,12 @@ do {									      \
> >   * sections, invocation of the corresponding RCU callback is deferred
> >   * until after the all the other CPUs exit their critical sections.
> >   *
> > + * In recent kernels, synchronize_rcu() and call_rcu() also wait for
> > + * regions of code with preemption disabled, including regions of code
> > + * with interrupts or softirqs disabled.  If your kernel is old enough
> > + * for synchronize_sched() to be defined, only code enclosed within
> > + * rcu_read_lock() and rcu_read_unlock() are guaranteed to be waited for.
> > + *
> 
> Thanks, this is the quote I was looking for, and also I think it's
> important for it to be in rcupdate.h rather than any .c implementation
> (I think it's more natural to look at headers for this kind of stuff).
> 
> Same comment regarding "old enough" / "recent kernels" though.
> 
> >   * Note, however, that RCU callbacks are permitted to run concurrently
> >   * with new RCU read-side critical sections.  One way that this can happen
> >   * is via the following sequence of events: (1) CPU 0 enters an RCU
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> 
> The tree.c changes look fine to me.

I added the version here also.

> Thanks a lot for looking into this !

And here is the updated commit.  Thoughts?

							Thanx, Paul

------------------------------------------------------------------------

commit cc5a0ad5aa52d26379d5cd04d0a8f0917caf7365
Author: Paul E. McKenney <paulmck@kernel.org>
Date:   Thu Apr 29 11:18:01 2021 -0700

    rcu: Improve comments describing RCU read-side critical sections
    
    There are a number of places that call out the fact that preempt-disable
    regions of code now act as RCU read-side critical sections, where
    preempt-disable regions of code include irq-disable regions of code,
    bh-disable regions of code, hardirq handlers, and NMI handlers.  However,
    someone relying solely on (for example) the call_rcu() header comment
    might well have no idea that preempt-disable regions of code have RCU
    semantics.
    
    This commit therefore updates the header comments for
    call_rcu(), synchronize_rcu(), rcu_dereference_bh_check(), and
    rcu_dereference_sched_check() to call out these new(ish) forms of RCU
    readers.
    
    Reported-by: Michel Lespinasse <michel@lespinasse.org>
    [ paulmck: Apply Matthew Wilcox and Michel Lespinasse feedback. ]
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index a10480f2b4ef..adc2043e92db 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -532,7 +532,12 @@ do {									      \
  * @p: The pointer to read, prior to dereferencing
  * @c: The conditions under which the dereference will take place
  *
- * This is the RCU-bh counterpart to rcu_dereference_check().
+ * This is the RCU-bh counterpart to rcu_dereference_check().  However,
+ * please note that starting in v5.0 kernels, vanilla RCU grace periods
+ * wait for local_bh_disable() regions of code in addition to regions of
+ * code demarked by rcu_read_lock() and rcu_read_unlock().  This means
+ * that synchronize_rcu(), call_rcu, and friends all take not only
+ * rcu_read_lock() but also rcu_read_lock_bh() into account.
  */
 #define rcu_dereference_bh_check(p, c) \
 	__rcu_dereference_check((p), (c) || rcu_read_lock_bh_held(), __rcu)
@@ -543,6 +548,11 @@ do {									      \
  * @c: The conditions under which the dereference will take place
  *
  * This is the RCU-sched counterpart to rcu_dereference_check().
+ * However, please note that starting in v5.0 kernels, vanilla RCU grace
+ * periods wait for preempt_disable() regions of code in addition to
+ * regions of code demarked by rcu_read_lock() and rcu_read_unlock().
+ * This means that synchronize_rcu(), call_rcu, and friends all take not
+ * only rcu_read_lock() but also rcu_read_lock_sched() into account.
  */
 #define rcu_dereference_sched_check(p, c) \
 	__rcu_dereference_check((p), (c) || rcu_read_lock_sched_held(), \
@@ -634,6 +644,12 @@ do {									      \
  * sections, invocation of the corresponding RCU callback is deferred
  * until after the all the other CPUs exit their critical sections.
  *
+ * In recent kernels, synchronize_rcu() and call_rcu() also wait for
+ * regions of code with preemption disabled, including regions of code
+ * with interrupts or softirqs disabled.  If your kernel is old enough
+ * for synchronize_sched() to be defined, only code enclosed within
+ * rcu_read_lock() and rcu_read_unlock() are guaranteed to be waited for.
+ *
  * Note, however, that RCU callbacks are permitted to run concurrently
  * with new RCU read-side critical sections.  One way that this can happen
  * is via the following sequence of events: (1) CPU 0 enters an RCU
@@ -728,9 +744,11 @@ static inline void rcu_read_unlock(void)
 /**
  * rcu_read_lock_bh() - mark the beginning of an RCU-bh critical section
  *
- * This is equivalent of rcu_read_lock(), but also disables softirqs.
- * Note that anything else that disables softirqs can also serve as
- * an RCU read-side critical section.
+ * This is equivalent to rcu_read_lock(), but also disables softirqs.
+ * Note that anything else that disables softirqs can also serve as an RCU
+ * read-side critical section.  However, please note that this equivalence
+ * applies only to v5.0 and later.  Before v5.0, rcu_read_lock() and
+ * rcu_read_lock_bh() were unrelated.
  *
  * Note that rcu_read_lock_bh() and the matching rcu_read_unlock_bh()
  * must occur in the same context, for example, it is illegal to invoke
@@ -763,9 +781,12 @@ static inline void rcu_read_unlock_bh(void)
 /**
  * rcu_read_lock_sched() - mark the beginning of a RCU-sched critical section
  *
- * This is equivalent of rcu_read_lock(), but disables preemption.
- * Read-side critical sections can also be introduced by anything else
- * that disables preemption, including local_irq_disable() and friends.
+ * This is equivalent to rcu_read_lock(), but also disables preemption.
+ * Read-side critical sections can also be introduced by anything else that
+ * disables preemption, including local_irq_disable() and friends.  However,
+ * please note that the equivalence to rcu_read_lock() applies only to
+ * v5.0 and later.  Before v5.0, rcu_read_lock() and rcu_read_lock_sched()
+ * were unrelated.
  *
  * Note that rcu_read_lock_sched() and the matching rcu_read_unlock_sched()
  * must occur in the same context, for example, it is illegal to invoke
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 9ea1d4eef1ad..9089c23e80dc 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3071,12 +3071,14 @@ __call_rcu(struct rcu_head *head, rcu_callback_t func)
  * period elapses, in other words after all pre-existing RCU read-side
  * critical sections have completed.  However, the callback function
  * might well execute concurrently with RCU read-side critical sections
- * that started after call_rcu() was invoked.  RCU read-side critical
- * sections are delimited by rcu_read_lock() and rcu_read_unlock(), and
- * may be nested.  In addition, regions of code across which interrupts,
- * preemption, or softirqs have been disabled also serve as RCU read-side
- * critical sections.  This includes hardware interrupt handlers, softirq
- * handlers, and NMI handlers.
+ * that started after call_rcu() was invoked.
+ *
+ * RCU read-side critical sections are delimited by rcu_read_lock()
+ * and rcu_read_unlock(), and may be nested.  In addition, but only in
+ * v5.0 and later, regions of code across which interrupts, preemption,
+ * or softirqs have been disabled also serve as RCU read-side critical
+ * sections.  This includes hardware interrupt handlers, softirq handlers,
+ * and NMI handlers.
  *
  * Note that all CPUs must agree that the grace period extended beyond
  * all pre-existing RCU read-side critical section.  On systems with more
@@ -3771,10 +3773,12 @@ static int rcu_blocking_is_gp(void)
  * read-side critical sections have completed.  Note, however, that
  * upon return from synchronize_rcu(), the caller might well be executing
  * concurrently with new RCU read-side critical sections that began while
- * synchronize_rcu() was waiting.  RCU read-side critical sections are
- * delimited by rcu_read_lock() and rcu_read_unlock(), and may be nested.
- * In addition, regions of code across which interrupts, preemption, or
- * softirqs have been disabled also serve as RCU read-side critical
+ * synchronize_rcu() was waiting.
+ *
+ * RCU read-side critical sections are delimited by rcu_read_lock()
+ * and rcu_read_unlock(), and may be nested.  In addition, but only in
+ * v5.0 and later, regions of code across which interrupts, preemption,
+ * or softirqs have been disabled also serve as RCU read-side critical
  * sections.  This includes hardware interrupt handlers, softirq handlers,
  * and NMI handlers.
  *
Michel Lespinasse May 3, 2021, 4:34 a.m. UTC | #18
On Sun, May 02, 2021 at 08:40:49PM -0700, Paul E. McKenney wrote:
> @@ -634,6 +644,12 @@ do {									      \
>   * sections, invocation of the corresponding RCU callback is deferred
>   * until after the all the other CPUs exit their critical sections.
>   *
> + * In recent kernels, synchronize_rcu() and call_rcu() also wait for
> + * regions of code with preemption disabled, including regions of code
> + * with interrupts or softirqs disabled.  If your kernel is old enough
> + * for synchronize_sched() to be defined, only code enclosed within
> + * rcu_read_lock() and rcu_read_unlock() are guaranteed to be waited for.
> + *
>   * Note, however, that RCU callbacks are permitted to run concurrently
>   * with new RCU read-side critical sections.  One way that this can happen
>   * is via the following sequence of events: (1) CPU 0 enters an RCU

You still have "old enough" / "recent kernels" here. But maybe it's OK
given that you added relevant version numbers elsewhere.

Everything else looks great to me.

Thanks,

--
Michel "walken" Lespinasse
Paul E. McKenney May 3, 2021, 4:32 p.m. UTC | #19
On Sun, May 02, 2021 at 09:34:30PM -0700, Michel Lespinasse wrote:
> On Sun, May 02, 2021 at 08:40:49PM -0700, Paul E. McKenney wrote:
> > @@ -634,6 +644,12 @@ do {									      \
> >   * sections, invocation of the corresponding RCU callback is deferred
> >   * until after the all the other CPUs exit their critical sections.
> >   *
> > + * In recent kernels, synchronize_rcu() and call_rcu() also wait for
> > + * regions of code with preemption disabled, including regions of code
> > + * with interrupts or softirqs disabled.  If your kernel is old enough
> > + * for synchronize_sched() to be defined, only code enclosed within
> > + * rcu_read_lock() and rcu_read_unlock() are guaranteed to be waited for.
> > + *
> >   * Note, however, that RCU callbacks are permitted to run concurrently
> >   * with new RCU read-side critical sections.  One way that this can happen
> >   * is via the following sequence of events: (1) CPU 0 enters an RCU
> 
> You still have "old enough" / "recent kernels" here. But maybe it's OK
> given that you added relevant version numbers elsewhere.
> 
> Everything else looks great to me.

Good point!  Like this?

							Thanx, Paul

------------------------------------------------------------------------

commit fd8393a2a8a5ffd25d0766abb262137c36bda9f3
Author: Paul E. McKenney <paulmck@kernel.org>
Date:   Mon May 3 09:32:14 2021 -0700

    fixup! rcu: Improve comments describing RCU read-side critical sections
    
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 901ab6fa252b..323954363389 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -644,11 +644,11 @@ do {									      \
  * sections, invocation of the corresponding RCU callback is deferred
  * until after the all the other CPUs exit their critical sections.
  *
- * In recent kernels, synchronize_rcu() and call_rcu() also wait for
- * regions of code with preemption disabled, including regions of code
- * with interrupts or softirqs disabled.  If your kernel is old enough
- * for synchronize_sched() to be defined, only code enclosed within
- * rcu_read_lock() and rcu_read_unlock() are guaranteed to be waited for.
+ * In v5.0 and later kernels, synchronize_rcu() and call_rcu() also
+ * wait for regions of code with preemption disabled, including regions of
+ * code with interrupts or softirqs disabled.  In pre-v5.0 kernels, which
+ * define synchronize_sched(), only code enclosed within rcu_read_lock()
+ * and rcu_read_unlock() are guaranteed to be waited for.
  *
  * Note, however, that RCU callbacks are permitted to run concurrently
  * with new RCU read-side critical sections.  One way that this can happen
diff mbox series

Patch

diff --git a/include/linux/mm.h b/include/linux/mm.h
index d5988e78e6ab..dee8a4833779 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -525,6 +525,10 @@  struct vm_fault {
 	};
 	unsigned int flags;		/* FAULT_FLAG_xxx flags
 					 * XXX: should really be 'const' */
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+	unsigned long seq;
+	pmd_t orig_pmd;
+#endif
 	pmd_t *pmd;			/* Pointer to pmd entry matching
 					 * the 'address' */
 	pud_t *pud;			/* Pointer to pud entry matching
diff --git a/mm/memory.c b/mm/memory.c
index 66e7a4554c54..a17704aac019 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4307,7 +4307,7 @@  static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
  * return value.  See filemap_fault() and __lock_page_or_retry().
  */
 static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
-		unsigned long address, unsigned int flags)
+		unsigned long address, unsigned int flags, unsigned long seq)
 {
 	struct vm_fault vmf = {
 		.vma = vma,
@@ -4322,6 +4322,79 @@  static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
 	p4d_t *p4d;
 	vm_fault_t ret;
 
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+	if (flags & FAULT_FLAG_SPECULATIVE) {
+		pgd_t pgdval;
+		p4d_t p4dval;
+		pud_t pudval;
+
+		vmf.seq = seq;
+
+		local_irq_disable();
+		pgd = pgd_offset(mm, address);
+		pgdval = READ_ONCE(*pgd);
+		if (pgd_none(pgdval) || unlikely(pgd_bad(pgdval)))
+			goto spf_fail;
+
+		p4d = p4d_offset(pgd, address);
+		p4dval = READ_ONCE(*p4d);
+		if (p4d_none(p4dval) || unlikely(p4d_bad(p4dval)))
+			goto spf_fail;
+
+		vmf.pud = pud_offset(p4d, address);
+		pudval = READ_ONCE(*vmf.pud);
+		if (pud_none(pudval) || unlikely(pud_bad(pudval)) ||
+		    unlikely(pud_trans_huge(pudval)) ||
+		    unlikely(pud_devmap(pudval)))
+			goto spf_fail;
+
+		vmf.pmd = pmd_offset(vmf.pud, address);
+		vmf.orig_pmd = READ_ONCE(*vmf.pmd);
+
+		/*
+		 * pmd_none could mean that a hugepage collapse is in
+		 * progress in our back as collapse_huge_page() mark
+		 * it before invalidating the pte (which is done once
+		 * the IPI is catched by all CPU and we have interrupt
+		 * disabled).  For this reason we cannot handle THP in
+		 * a speculative way since we can't safely identify an
+		 * in progress collapse operation done in our back on
+		 * that PMD.
+		 */
+		if (unlikely(pmd_none(vmf.orig_pmd) ||
+			     is_swap_pmd(vmf.orig_pmd) ||
+			     pmd_trans_huge(vmf.orig_pmd) ||
+			     pmd_devmap(vmf.orig_pmd)))
+			goto spf_fail;
+
+		/*
+		 * The above does not allocate/instantiate page-tables because
+		 * doing so would lead to the possibility of instantiating
+		 * page-tables after free_pgtables() -- and consequently
+		 * leaking them.
+		 *
+		 * The result is that we take at least one non-speculative
+		 * fault per PMD in order to instantiate it.
+		 */
+
+		vmf.pte = pte_offset_map(vmf.pmd, address);
+		vmf.orig_pte = READ_ONCE(*vmf.pte);
+		barrier();
+		if (pte_none(vmf.orig_pte)) {
+			pte_unmap(vmf.pte);
+			vmf.pte = NULL;
+		}
+
+		local_irq_enable();
+
+		return handle_pte_fault(&vmf);
+
+spf_fail:
+		local_irq_enable();
+		return VM_FAULT_RETRY;
+	}
+#endif	/* CONFIG_SPECULATIVE_PAGE_FAULT */
+
 	pgd = pgd_offset(mm, address);
 	p4d = p4d_alloc(mm, pgd, address);
 	if (!p4d)
@@ -4541,7 +4614,7 @@  vm_fault_t do_handle_mm_fault(struct vm_area_struct *vma,
 	if (unlikely(is_vm_hugetlb_page(vma)))
 		ret = hugetlb_fault(vma->vm_mm, vma, address, flags);
 	else
-		ret = __handle_mm_fault(vma, address, flags);
+		ret = __handle_mm_fault(vma, address, flags, seq);
 
 	if (flags & FAULT_FLAG_USER) {
 		mem_cgroup_exit_user_fault();