diff mbox series

[v2,2/2] mm: prevent gup_fast from racing with COW during fork

Message ID 2-v2-dfe9ecdb6c74+2066-gup_fork_jgg@nvidia.com (mailing list archive)
State New, archived
Headers show
Series Add a seqcount between gup_fast and copy_page_range() | expand

Commit Message

Jason Gunthorpe Oct. 30, 2020, 2:46 p.m. UTC
Since commit 70e806e4e645 ("mm: Do early cow for pinned pages during
fork() for ptes") pages under a FOLL_PIN will not be write protected
during COW for fork. This means that pages returned from
pin_user_pages(FOLL_WRITE) should not become write protected while the pin
is active.

However, there is a small race where get_user_pages_fast(FOLL_PIN) can
establish a FOLL_PIN at the same time copy_present_page() is write
protecting it:

        CPU 0                             CPU 1
   get_user_pages_fast()
    internal_get_user_pages_fast()
                                       copy_page_range()
                                         pte_alloc_map_lock()
                                           copy_present_page()
                                             atomic_read(has_pinned) == 0
					     page_maybe_dma_pinned() == false
     atomic_set(has_pinned, 1);
     gup_pgd_range()
      gup_pte_range()
       pte_t pte = gup_get_pte(ptep)
       pte_access_permitted(pte)
       try_grab_compound_head()
                                             pte = pte_wrprotect(pte)
	                                     set_pte_at();
                                         pte_unmap_unlock()
      // GUP now returns with a write protected page

The first attempt to resolve this by using the write protect caused
problems (and was missing a barrrier), see commit f3c64eda3e50 ("mm: avoid
early COW write protect games during fork()")

Instead wrap copy_p4d_range() with the write side of a seqcount and check
the read side around gup_pgd_range(). If there is a collision then
get_user_pages_fast() fails and falls back to slow GUP.

Slow GUP is safe against this race because copy_page_range() is only
called while holding the exclusive side of the mmap_lock on the src
mm_struct.

Fixes: f3c64eda3e50 ("mm: avoid early COW write protect games during fork()")
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/CAHk-=wi=iCnYCARbPGjkVJu9eyYeZ13N64tZYLdOB8CP5Q_PLw@mail.gmail.com
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 arch/x86/kernel/tboot.c    |  1 +
 drivers/firmware/efi/efi.c |  1 +
 include/linux/mm_types.h   |  7 +++++++
 kernel/fork.c              |  1 +
 mm/gup.c                   | 19 +++++++++++++++++++
 mm/init-mm.c               |  1 +
 mm/memory.c                | 10 +++++++++-
 7 files changed, 39 insertions(+), 1 deletion(-)

Comments

Jan Kara Oct. 30, 2020, 4:51 p.m. UTC | #1
On Fri 30-10-20 11:46:21, Jason Gunthorpe wrote:
> Since commit 70e806e4e645 ("mm: Do early cow for pinned pages during
> fork() for ptes") pages under a FOLL_PIN will not be write protected
> during COW for fork. This means that pages returned from
> pin_user_pages(FOLL_WRITE) should not become write protected while the pin
> is active.
> 
> However, there is a small race where get_user_pages_fast(FOLL_PIN) can
> establish a FOLL_PIN at the same time copy_present_page() is write
> protecting it:
> 
>         CPU 0                             CPU 1
>    get_user_pages_fast()
>     internal_get_user_pages_fast()
>                                        copy_page_range()
>                                          pte_alloc_map_lock()
>                                            copy_present_page()
>                                              atomic_read(has_pinned) == 0
> 					     page_maybe_dma_pinned() == false
>      atomic_set(has_pinned, 1);
>      gup_pgd_range()
>       gup_pte_range()
>        pte_t pte = gup_get_pte(ptep)
>        pte_access_permitted(pte)
>        try_grab_compound_head()
>                                              pte = pte_wrprotect(pte)
> 	                                     set_pte_at();
>                                          pte_unmap_unlock()
>       // GUP now returns with a write protected page
> 
> The first attempt to resolve this by using the write protect caused
> problems (and was missing a barrrier), see commit f3c64eda3e50 ("mm: avoid
> early COW write protect games during fork()")
> 
> Instead wrap copy_p4d_range() with the write side of a seqcount and check
> the read side around gup_pgd_range(). If there is a collision then
> get_user_pages_fast() fails and falls back to slow GUP.
> 
> Slow GUP is safe against this race because copy_page_range() is only
> called while holding the exclusive side of the mmap_lock on the src
> mm_struct.
> 
> Fixes: f3c64eda3e50 ("mm: avoid early COW write protect games during fork()")
> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> Link: https://lore.kernel.org/r/CAHk-=wi=iCnYCARbPGjkVJu9eyYeZ13N64tZYLdOB8CP5Q_PLw@mail.gmail.com
> Reviewed-by: John Hubbard <jhubbard@nvidia.com>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

Looks good to me. Just one nit below. With that fixed feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

> @@ -446,6 +447,12 @@ struct mm_struct {
>  		 */
>  		atomic_t has_pinned;
>  
> +		/**
> +		 * @write_protect_seq: Odd when any thread is write protecting
> +		 * pages in this mm, for instance during fork().
> +		 */
> +		seqcount_t write_protect_seq;
> +

So this comment isn't quite true. We can be writeprotecting pages due to
many other reasons and not touch write_protect_seq. E.g. for shared
mappings or due to explicit mprotect() calls. So the write_protect_seq
protection has to be about something more than pure write protection. One
requirement certainly is that the VMA has to be is_cow_mapping(). What
about mprotect(2) calls? I guess the application would have only itself to
blame so we don't care?

Anyway my point is just that the comment should tell more what this is
about. I'd even go as far as making it "page table copying during fork in
progress".

								Honza
Jason Gunthorpe Oct. 30, 2020, 5:02 p.m. UTC | #2
On Fri, Oct 30, 2020 at 05:51:05PM +0100, Jan Kara wrote:

> Looks good to me. Just one nit below. With that fixed feel free to add:
> 
> Reviewed-by: Jan Kara <jack@suse.cz>

Thanks!

> > @@ -446,6 +447,12 @@ struct mm_struct {
> >  		 */
> >  		atomic_t has_pinned;
> >  
> > +		/**
> > +		 * @write_protect_seq: Odd when any thread is write protecting
> > +		 * pages in this mm, for instance during fork().
> > +		 */
> > +		seqcount_t write_protect_seq;
> > +
> 
> So this comment isn't quite true. We can be writeprotecting pages due to
> many other reasons and not touch write_protect_seq. E.g. for shared
> mappings or due to explicit mprotect() calls. So the write_protect_seq
> protection has to be about something more than pure write protection. One
> requirement certainly is that the VMA has to be is_cow_mapping(). What
> about mprotect(2) calls? I guess the application would have only itself to
> blame so we don't care?

Yes, that sounds right, How about

/**
 * @write_protect_seq: Locked when any thread is write protecting
 * pages for COW in this mm, for instance during page table copying
 * for fork().
 */

mprotect and shared mappings cause faults on write access not COW?

Jason
John Hubbard Oct. 30, 2020, 9:20 p.m. UTC | #3
On 10/30/20 7:46 AM, Jason Gunthorpe wrote:
> Since commit 70e806e4e645 ("mm: Do early cow for pinned pages during
> fork() for ptes") pages under a FOLL_PIN will not be write protected
> during COW for fork. This means that pages returned from
> pin_user_pages(FOLL_WRITE) should not become write protected while the pin
> is active.
> 
> However, there is a small race where get_user_pages_fast(FOLL_PIN) can
> establish a FOLL_PIN at the same time copy_present_page() is write
> protecting it:
> 
>          CPU 0                             CPU 1
>     get_user_pages_fast()
>      internal_get_user_pages_fast()
>                                         copy_page_range()
>                                           pte_alloc_map_lock()
>                                             copy_present_page()
>                                               atomic_read(has_pinned) == 0
> 					     page_maybe_dma_pinned() == false
>       atomic_set(has_pinned, 1);
>       gup_pgd_range()
>        gup_pte_range()
>         pte_t pte = gup_get_pte(ptep)
>         pte_access_permitted(pte)
>         try_grab_compound_head()
>                                               pte = pte_wrprotect(pte)
> 	                                     set_pte_at();
>                                           pte_unmap_unlock()
>        // GUP now returns with a write protected page
> 
> The first attempt to resolve this by using the write protect caused
> problems (and was missing a barrrier), see commit f3c64eda3e50 ("mm: avoid
> early COW write protect games during fork()")
> 
> Instead wrap copy_p4d_range() with the write side of a seqcount and check
> the read side around gup_pgd_range(). If there is a collision then
> get_user_pages_fast() fails and falls back to slow GUP.
> 
> Slow GUP is safe against this race because copy_page_range() is only
> called while holding the exclusive side of the mmap_lock on the src
> mm_struct.
> 
> Fixes: f3c64eda3e50 ("mm: avoid early COW write protect games during fork()")
> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> Link: https://lore.kernel.org/r/CAHk-=wi=iCnYCARbPGjkVJu9eyYeZ13N64tZYLdOB8CP5Q_PLw@mail.gmail.com
> Reviewed-by: John Hubbard <jhubbard@nvidia.com>

This updated version still looks good to me. Love-love-love the new raw_seqcount
approach! No more bare memory barriers, and it gains some kcsan coverage too.
Sweet. :)

thanks,
Peter Xu Oct. 30, 2020, 10:52 p.m. UTC | #4
Hi, Jason,

I think majorly the patch looks good to me, but I have a few pure questions
majorly not directly related to the patch itself, but around the contexts.
Since I _feel_ like there'll be a new version to update the comments below,
maybe I can still ask aloud... Please bare with me. :)

On Fri, Oct 30, 2020 at 11:46:21AM -0300, Jason Gunthorpe wrote:
> Slow GUP is safe against this race because copy_page_range() is only
> called while holding the exclusive side of the mmap_lock on the src
> mm_struct.

Pure question: I understand that this patch requires this, but... Could anyone
remind me why read lock of mmap_sem is not enough for fork() before this one?

> 
> Fixes: f3c64eda3e50 ("mm: avoid early COW write protect games during fork()")
> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> Link: https://lore.kernel.org/r/CAHk-=wi=iCnYCARbPGjkVJu9eyYeZ13N64tZYLdOB8CP5Q_PLw@mail.gmail.com
> Reviewed-by: John Hubbard <jhubbard@nvidia.com>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>  arch/x86/kernel/tboot.c    |  1 +
>  drivers/firmware/efi/efi.c |  1 +
>  include/linux/mm_types.h   |  7 +++++++
>  kernel/fork.c              |  1 +
>  mm/gup.c                   | 19 +++++++++++++++++++
>  mm/init-mm.c               |  1 +
>  mm/memory.c                | 10 +++++++++-
>  7 files changed, 39 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/tboot.c b/arch/x86/kernel/tboot.c
> index 992fb1415c0f1f..6a2f542d9588a4 100644
> --- a/arch/x86/kernel/tboot.c
> +++ b/arch/x86/kernel/tboot.c
> @@ -93,6 +93,7 @@ static struct mm_struct tboot_mm = {
>  	.pgd            = swapper_pg_dir,
>  	.mm_users       = ATOMIC_INIT(2),
>  	.mm_count       = ATOMIC_INIT(1),
> +	.write_protect_seq = SEQCNT_ZERO(tboot_mm.write_protect_seq),
>  	MMAP_LOCK_INITIALIZER(init_mm)
>  	.page_table_lock =  __SPIN_LOCK_UNLOCKED(init_mm.page_table_lock),
>  	.mmlist         = LIST_HEAD_INIT(init_mm.mmlist),
> diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
> index 5e5480a0a32d7d..2520f6e05f4d44 100644
> --- a/drivers/firmware/efi/efi.c
> +++ b/drivers/firmware/efi/efi.c
> @@ -57,6 +57,7 @@ struct mm_struct efi_mm = {
>  	.mm_rb			= RB_ROOT,
>  	.mm_users		= ATOMIC_INIT(2),
>  	.mm_count		= ATOMIC_INIT(1),
> +	.write_protect_seq      = SEQCNT_ZERO(efi_mm.write_protect_seq),
>  	MMAP_LOCK_INITIALIZER(efi_mm)
>  	.page_table_lock	= __SPIN_LOCK_UNLOCKED(efi_mm.page_table_lock),
>  	.mmlist			= LIST_HEAD_INIT(efi_mm.mmlist),

Another pure question: I'm just curious how you find all the statically
definied mm_structs, and to make sure all of them are covered (just in case
un-initialized seqcount could fail strangely).

Actually I'm thinking whether we should have one place to keep all the init
vars for all the statically definied mm_structs, so we don't need to find them
everytime, but only change that one place.

> diff --git a/mm/memory.c b/mm/memory.c
> index c48f8df6e50268..294c2c3c4fe00d 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1171,6 +1171,12 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
>  		mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_PAGE,
>  					0, src_vma, src_mm, addr, end);
>  		mmu_notifier_invalidate_range_start(&range);
> +		/*
> +		 * The read side doesn't spin, it goes to the mmap_lock, so the
> +		 * raw version is used to avoid disabling preemption here
> +		 */
> +		mmap_assert_write_locked(src_mm);
> +		raw_write_seqcount_t_begin(&src_mm->write_protect_seq);

Would raw_write_seqcount_begin() be better here?

My understanding is that we used raw_write_seqcount_t_begin() because we're
with spin lock so assuming we disabled preemption already.  However I'm
thinking whether raw_write_seqcount_begin() would be even better to guarantee
that.  I have no idea of how the rt kernel merging topic, but if rt kernel
merged into mainline then IIUC preemption is allowed here (since pgtable spin
lock should be rt_spin_lock, not raw spin locks).

An even further pure question on __seqcount_preemptible() (feel free to ignore
this question!): I saw that __seqcount_preemptible() seems to have been
constantly defined as "return false".  Not sure what happened there..

Thanks,
Jason Gunthorpe Oct. 30, 2020, 11:51 p.m. UTC | #5
On Fri, Oct 30, 2020 at 06:52:50PM -0400, Peter Xu wrote:
> Hi, Jason,
> 
> I think majorly the patch looks good to me, but I have a few pure questions
> majorly not directly related to the patch itself, but around the contexts.
> Since I _feel_ like there'll be a new version to update the comments below,
> maybe I can still ask aloud... Please bare with me. :)

No problem
 
> On Fri, Oct 30, 2020 at 11:46:21AM -0300, Jason Gunthorpe wrote:
> > Slow GUP is safe against this race because copy_page_range() is only
> > called while holding the exclusive side of the mmap_lock on the src
> > mm_struct.
> 
> Pure question: I understand that this patch requires this, but... Could anyone
> remind me why read lock of mmap_sem is not enough for fork() before this one?

I do not know why fork uses the exclusive lock, however the write side
of the seqcount must be exclusive or it doesn't work properly. Was
happy to find we already had the locking to support this.

> > diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
> > index 5e5480a0a32d7d..2520f6e05f4d44 100644
> > +++ b/drivers/firmware/efi/efi.c
> > @@ -57,6 +57,7 @@ struct mm_struct efi_mm = {
> >  	.mm_rb			= RB_ROOT,
> >  	.mm_users		= ATOMIC_INIT(2),
> >  	.mm_count		= ATOMIC_INIT(1),
> > +	.write_protect_seq      = SEQCNT_ZERO(efi_mm.write_protect_seq),
> >  	MMAP_LOCK_INITIALIZER(efi_mm)
> >  	.page_table_lock	= __SPIN_LOCK_UNLOCKED(efi_mm.page_table_lock),
> >  	.mmlist			= LIST_HEAD_INIT(efi_mm.mmlist),
> 
> Another pure question: I'm just curious how you find all the statically
> definied mm_structs, and to make sure all of them are covered (just in case
> un-initialized seqcount could fail strangely).

I searched for all MMAP_LOCK_INITIALIZER() places and assumed that
Michel got them all when he added it :)

> Actually I'm thinking whether we should have one place to keep all the init
> vars for all the statically definied mm_structs, so we don't need to find them
> everytime, but only change that one place.

I was thinking that as well, most of the places are all the same

> > diff --git a/mm/memory.c b/mm/memory.c
> > index c48f8df6e50268..294c2c3c4fe00d 100644
> > +++ b/mm/memory.c
> > @@ -1171,6 +1171,12 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
> >  		mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_PAGE,
> >  					0, src_vma, src_mm, addr, end);
> >  		mmu_notifier_invalidate_range_start(&range);
> > +		/*
> > +		 * The read side doesn't spin, it goes to the mmap_lock, so the
> > +		 * raw version is used to avoid disabling preemption here
> > +		 */
> > +		mmap_assert_write_locked(src_mm);
> > +		raw_write_seqcount_t_begin(&src_mm->write_protect_seq);
> 
> Would raw_write_seqcount_begin() be better here?

Hum..

I felt no because it had the preempt stuff added into it, however it
would work - __seqcount_lock_preemptible() == false for the seqcount_t
case (see below)

Looking more closely, maybe the right API to pick is
write_seqcount_t_begin() and write_seqcount_t_end() ??

However, no idea what the intention of the '*_seqcount_t_*' family is
- it only seems to be used to implement the seqlock..

Lets add Amhed, perhaps he can give some guidance (see next section)?

> My understanding is that we used raw_write_seqcount_t_begin() because we're
> with spin lock so assuming we disabled preemption already.

Here we rely on the exclusive mmap_lock, not a spinlock. This ensures
only one write side is running concurrently as required by seqcount.

The concern about preemption disable is that it wasn't held for fork()
before, and we don't need it.. I understand preemption disable regions
must be short or the RT people will not be happy, holding one across
all of copy_page_range() sounds bad.

Ahmed explained in commit 8117ab508f the reason the seqcount_t write
side has preemption disabled is because it can livelock RT kernels if
the read side is spinning after preempting the write side. eg look at
how __read_seqcount_begin() is implemented:

	while ((seq = __seqcount_sequence(s)) & 1)			\
		cpu_relax();						\

However, in this patch, we don't spin on the read side.

If the read side collides with a writer it immediately goes to the
mmap_lock, which is sleeping, and so it will sort itself out properly,
even if it was preempted.

> An even further pure question on __seqcount_preemptible() (feel free to ignore
> this question!): I saw that __seqcount_preemptible() seems to have been
> constantly defined as "return false".  Not sure what happened there..

The new code has a range of seqcount_t types see
Documentation/locking/seqlock.rst 'Sequence counters with associated
locks'

It uses _Generic to do a bit of meta-programming and creates a compile
time table of lock properties:

SEQCOUNT_LOCKNAME(raw_spinlock, raw_spinlock_t,  false,    s->lock,        raw_spin, raw_spin_lock(s->lock))
SEQCOUNT_LOCKNAME(spinlock,     spinlock_t,      __SEQ_RT, s->lock,        spin,     spin_lock(s->lock))
SEQCOUNT_LOCKNAME(rwlock,       rwlock_t,        __SEQ_RT, s->lock,        read,     read_lock(s->lock))
SEQCOUNT_LOCKNAME(mutex,        struct mutex,    true,     s->lock,        mutex,    mutex_lock(s->lock))
SEQCOUNT_LOCKNAME(ww_mutex,     struct ww_mutex, true,     &s->lock->base, ww_mutex, ww_mutex_lock(s->lock, NULL))

As well as as default set of properties for normal seqcount_t. The
__seqcount_preemptible() is selected by the _Generic for seqcount_t:

#define __seqprop(s, prop) _Generic(*(s),				\
	seqcount_t:		__seqprop_##prop((void *)(s)),		\

And it says preemption must be disabled before using the lock:

static inline void __seqprop_assert(const seqcount_t *s)
{
	lockdep_assert_preemption_disabled();
}

And thus no need to have an automatic disable preemption:

static inline bool __seqprop_preemptible(const seqcount_t *s)
{
	return false;
}

Other lock subtypes are different, eg the codegen for mutex will use
lockdep_assert_held(s->lock) for _assert and true for _preemptible()

So if we map the 'write begin' entry points:

 write_seqcount_begin - Enforces preemption off
 raw_write_seqcount_begin - Auto disable preemption if required (false)
 raw_write_seqcount_t_begin - No preemption stuff
 write_seqcount_t_begin - No preemption stuff

Jason
Peter Xu Oct. 31, 2020, 3:26 p.m. UTC | #6
On Fri, Oct 30, 2020 at 08:51:21PM -0300, Jason Gunthorpe wrote:
> > Another pure question: I'm just curious how you find all the statically
> > definied mm_structs, and to make sure all of them are covered (just in case
> > un-initialized seqcount could fail strangely).
> 
> I searched for all MMAP_LOCK_INITIALIZER() places and assumed that
> Michel got them all when he added it :)

Hmm, I should have noticed that before I ask.. :)

> 
> > Actually I'm thinking whether we should have one place to keep all the init
> > vars for all the statically definied mm_structs, so we don't need to find them
> > everytime, but only change that one place.
> 
> I was thinking that as well, most of the places are all the same

Yes, we can work on top.

> 
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index c48f8df6e50268..294c2c3c4fe00d 100644
> > > +++ b/mm/memory.c
> > > @@ -1171,6 +1171,12 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
> > >  		mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_PAGE,
> > >  					0, src_vma, src_mm, addr, end);
> > >  		mmu_notifier_invalidate_range_start(&range);
> > > +		/*
> > > +		 * The read side doesn't spin, it goes to the mmap_lock, so the
> > > +		 * raw version is used to avoid disabling preemption here
> > > +		 */
> > > +		mmap_assert_write_locked(src_mm);
> > > +		raw_write_seqcount_t_begin(&src_mm->write_protect_seq);
> > 
> > Would raw_write_seqcount_begin() be better here?
> 
> Hum..
> 
> I felt no because it had the preempt stuff added into it, however it
> would work - __seqcount_lock_preemptible() == false for the seqcount_t
> case (see below)
> 
> Looking more closely, maybe the right API to pick is
> write_seqcount_t_begin() and write_seqcount_t_end() ??
> 
> However, no idea what the intention of the '*_seqcount_t_*' family is
> - it only seems to be used to implement the seqlock..
> 
> Lets add Amhed, perhaps he can give some guidance (see next section)?

IMHO we shouldn't directly use these helpers since they seem to only be used by
lock-associated versions of seqcount types.  But yeah, Amhed would be the best
one to answer...

> 
> > My understanding is that we used raw_write_seqcount_t_begin() because we're
> > with spin lock so assuming we disabled preemption already.
> 
> Here we rely on the exclusive mmap_lock, not a spinlock. This ensures
> only one write side is running concurrently as required by seqcount.

So imho here we have these things to consider during one thread updating the
seqcount_t:

  0. Concurrent read is perfectly welcomed, for sure.

  1. Concurrent writes on seqcount_t: mm sem protects it.

  2. Preempted write (if possible, maybe on RT?): I think it's also protected
     by mm sem, so looks ok too to me.

  3. Preempted/interrupted read on seqcount_t.  Seems to be the one discussed
     below.  Looks safe to me now with below explanation.  However...

> 
> The concern about preemption disable is that it wasn't held for fork()
> before, and we don't need it.. I understand preemption disable regions
> must be short or the RT people will not be happy, holding one across
> all of copy_page_range() sounds bad.
> 
> Ahmed explained in commit 8117ab508f the reason the seqcount_t write
> side has preemption disabled is because it can livelock RT kernels if
> the read side is spinning after preempting the write side. eg look at
> how __read_seqcount_begin() is implemented:
> 
> 	while ((seq = __seqcount_sequence(s)) & 1)			\
> 		cpu_relax();						\
> 
> However, in this patch, we don't spin on the read side.

... Shall we document this explicitly (if this patch still needs a repost)?
Seems not straightforward since that seems not the usual way to use seqcount,
not sure whether I'm the only one that feels this way, though.

> 
> If the read side collides with a writer it immediately goes to the
> mmap_lock, which is sleeping, and so it will sort itself out properly,
> even if it was preempted.
> 
> > An even further pure question on __seqcount_preemptible() (feel free to ignore
> > this question!): I saw that __seqcount_preemptible() seems to have been
> > constantly defined as "return false".  Not sure what happened there..
> 
> The new code has a range of seqcount_t types see
> Documentation/locking/seqlock.rst 'Sequence counters with associated
> locks'
> 
> It uses _Generic to do a bit of meta-programming and creates a compile
> time table of lock properties:
> 
> SEQCOUNT_LOCKNAME(raw_spinlock, raw_spinlock_t,  false,    s->lock,        raw_spin, raw_spin_lock(s->lock))
> SEQCOUNT_LOCKNAME(spinlock,     spinlock_t,      __SEQ_RT, s->lock,        spin,     spin_lock(s->lock))
> SEQCOUNT_LOCKNAME(rwlock,       rwlock_t,        __SEQ_RT, s->lock,        read,     read_lock(s->lock))
> SEQCOUNT_LOCKNAME(mutex,        struct mutex,    true,     s->lock,        mutex,    mutex_lock(s->lock))
> SEQCOUNT_LOCKNAME(ww_mutex,     struct ww_mutex, true,     &s->lock->base, ww_mutex, ww_mutex_lock(s->lock, NULL))
> 
> As well as as default set of properties for normal seqcount_t. The
> __seqcount_preemptible() is selected by the _Generic for seqcount_t:
> 
> #define __seqprop(s, prop) _Generic(*(s),				\
> 	seqcount_t:		__seqprop_##prop((void *)(s)),		\
> 
> And it says preemption must be disabled before using the lock:
> 
> static inline void __seqprop_assert(const seqcount_t *s)
> {
> 	lockdep_assert_preemption_disabled();
> }
> 
> And thus no need to have an automatic disable preemption:
> 
> static inline bool __seqprop_preemptible(const seqcount_t *s)
> {
> 	return false;
> }
> 
> Other lock subtypes are different, eg the codegen for mutex will use
> lockdep_assert_held(s->lock) for _assert and true for _preemptible()
> 
> So if we map the 'write begin' entry points:
> 
>  write_seqcount_begin - Enforces preemption off
>  raw_write_seqcount_begin - Auto disable preemption if required (false)
>  raw_write_seqcount_t_begin - No preemption stuff
>  write_seqcount_t_begin - No preemption stuff

Thanks for listing these details.

As a summary, I think I'm convinced maybe we can have this work without disable
preemtion.  It's just that some more comment might be even better.

The other thing is, considering this use of seqcount seems to be quite special
as explained below, I'm just not sure whether this would confuse lockdep or
kcsan, etc., if we decide to use write_seqcount_t_begin().

Thanks,
Jan Kara Nov. 2, 2020, 8:31 a.m. UTC | #7
On Fri 30-10-20 14:02:26, Jason Gunthorpe wrote:
> On Fri, Oct 30, 2020 at 05:51:05PM +0100, Jan Kara wrote:
> > > @@ -446,6 +447,12 @@ struct mm_struct {
> > >  		 */
> > >  		atomic_t has_pinned;
> > >  
> > > +		/**
> > > +		 * @write_protect_seq: Odd when any thread is write protecting
> > > +		 * pages in this mm, for instance during fork().
> > > +		 */
> > > +		seqcount_t write_protect_seq;
> > > +
> > 
> > So this comment isn't quite true. We can be writeprotecting pages due to
> > many other reasons and not touch write_protect_seq. E.g. for shared
> > mappings or due to explicit mprotect() calls. So the write_protect_seq
> > protection has to be about something more than pure write protection. One
> > requirement certainly is that the VMA has to be is_cow_mapping(). What
> > about mprotect(2) calls? I guess the application would have only itself to
> > blame so we don't care?
> 
> Yes, that sounds right, How about
> 
> /**
>  * @write_protect_seq: Locked when any thread is write protecting
>  * pages for COW in this mm, for instance during page table copying
           ^^^ maybe I'd write a bit more explicitly "... write protecting
pages mapped by this mm to enforce later COW, ..."

>  * for fork().
>  */
> 
> mprotect and shared mappings cause faults on write access not COW?

Correct.

								Honza
Ahmed S. Darwish Nov. 2, 2020, 11:58 p.m. UTC | #8
On Fri, Oct 30, 2020 at 11:46:21AM -0300, Jason Gunthorpe wrote:

...

> diff --git a/mm/memory.c b/mm/memory.c
> index c48f8df6e50268..294c2c3c4fe00d 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1171,6 +1171,12 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
>  		mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_PAGE,
>  					0, src_vma, src_mm, addr, end);
>  		mmu_notifier_invalidate_range_start(&range);

> +		/*
> +		 * The read side doesn't spin, it goes to the mmap_lock, so the
> +		 * raw version is used to avoid disabling preemption here
> +		 */
> +		mmap_assert_write_locked(src_mm);
> +		raw_write_seqcount_t_begin(&src_mm->write_protect_seq);
>  	}
>

Please, s/raw_write_seqcount_t_begin()/raw_write_seqcount_begin()/g. For
plain seqcount_t, it's the same, while still respecting the seqlock.h
API boundaries.

Let's make the comment also a bit more clear (IMHO, "lockdep" needs to
be mentioned somewhere):

		/*
		 * Disabling preemption is not needed for the write side, as
		 * the read side doesn't spin, but goes to the mmap_lock.
		 *
		 * Use the raw variant of the seqcount_t write API to avoid
		 * lockdep complaining about preemptibility.
		 */
		mmap_assert_write_locked(src_mm);
		raw_write_seqcount_t_begin(&src_mm->write_protect_seq);

>  	ret = 0;
> @@ -1187,8 +1193,10 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
>  		}
>  	} while (dst_pgd++, src_pgd++, addr = next, addr != end);
>
> -	if (is_cow)
> +	if (is_cow) {
> +		raw_write_seqcount_t_end(&src_mm->write_protect_seq);

ditto.

s/raw_write_seqcount_t_end()/raw_write_seqcount_end()/g

>  		mmu_notifier_invalidate_range_end(&range);
> +	}
>  	return ret;
>  }
>

Thanks,

--
Ahmed S. Darwish
Linutronix GmbH
Ahmed S. Darwish Nov. 3, 2020, 12:17 a.m. UTC | #9
On Fri, Oct 30, 2020 at 08:51:21PM -0300, Jason Gunthorpe wrote:
> On Fri, Oct 30, 2020 at 06:52:50PM -0400, Peter Xu wrote:

...

>
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index c48f8df6e50268..294c2c3c4fe00d 100644
> > > +++ b/mm/memory.c
> > > @@ -1171,6 +1171,12 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
> > >  		mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_PAGE,
> > >  					0, src_vma, src_mm, addr, end);
> > >  		mmu_notifier_invalidate_range_start(&range);
> > > +		/*
> > > +		 * The read side doesn't spin, it goes to the mmap_lock, so the
> > > +		 * raw version is used to avoid disabling preemption here
> > > +		 */
> > > +		mmap_assert_write_locked(src_mm);
> > > +		raw_write_seqcount_t_begin(&src_mm->write_protect_seq);
> >
> > Would raw_write_seqcount_begin() be better here?
>
> Hum..
>
> I felt no because it had the preempt stuff added into it, however it
> would work - __seqcount_lock_preemptible() == false for the seqcount_t
> case (see below)
>
> Looking more closely, maybe the right API to pick is
> write_seqcount_t_begin() and write_seqcount_t_end() ??
>

No, that's not the right API: it is also internal to seqlock.h.

Please stick with the official exported API: raw_write_seqcount_begin().

It should satisfy your needs, and the raw_*() variant is created exactly
for contexts wishing to avoid the lockdep checks (e.g. NMI handlers
cannot invoke lockdep, etc.)

> However, no idea what the intention of the '*_seqcount_t_*' family is
> - it only seems to be used to implement the seqlock..
>

Exactly. '*_seqcount_t_*' is a seqlock.h implementation detail, and it
has _zero_ relevance to what is discussed in this thread actually.

...

> Ahmed explained in commit 8117ab508f the reason the seqcount_t write
> side has preemption disabled is because it can livelock RT kernels if
> the read side is spinning after preempting the write side. eg look at
> how __read_seqcount_begin() is implemented:
>
> 	while ((seq = __seqcount_sequence(s)) & 1)			\
> 		cpu_relax();						\
>
> However, in this patch, we don't spin on the read side.
>
> If the read side collides with a writer it immediately goes to the
> mmap_lock, which is sleeping, and so it will sort itself out properly,
> even if it was preempted.
>

Correct.

Thanks,

--
Ahmed Darwish
Linutronix GmbH
Jason Gunthorpe Nov. 3, 2020, 12:25 a.m. UTC | #10
On Tue, Nov 03, 2020 at 01:17:12AM +0100, Ahmed S. Darwish wrote:

> Please stick with the official exported API: raw_write_seqcount_begin().

How did you know this was 'offical exported API' ??

> Exactly. '*_seqcount_t_*' is a seqlock.h implementation detail, and it
> has _zero_ relevance to what is discussed in this thread actually.

Add some leading __'s to them?

Jason
Ahmed S. Darwish Nov. 3, 2020, 12:33 a.m. UTC | #11
On Sat, Oct 31, 2020 at 11:26:05AM -0400, Peter Xu wrote:

...

> Shall we document this explicitly (if this patch still needs a repost)?

Yes, this patch series needs a v3 :)

> Seems not straightforward since that seems not the usual way to use seqcount,
> not sure whether I'm the only one that feels this way, though.

Yes, this usage is correct but not common. I've proposed a more explicit
comment above the write section code, in my reply to patch #2.

...

> The other thing is, considering this use of seqcount seems to be quite special
> as explained below, I'm just not sure whether this would confuse lockdep or
> kcsan, etc., if we decide to use write_seqcount_t_begin().
>

Lockdep won't be confused as it's not used in the raw_*() variant of the
seqcount APIs.

AFAIK KCSAN also has some margin to protect itself from this: see
seqlock.h KCSAN_SEQLOCK_REGION_MAX.

Thanks,

> Peter Xu

Ahmed Darwish
Linutronix GmbH
Ahmed S. Darwish Nov. 3, 2020, 12:41 a.m. UTC | #12
On Mon, Nov 02, 2020 at 08:25:32PM -0400, Jason Gunthorpe wrote:
> On Tue, Nov 03, 2020 at 01:17:12AM +0100, Ahmed S. Darwish wrote:
>
> > Please stick with the official exported API: raw_write_seqcount_begin().
>
> How did you know this was 'offical exported API' ??
>

All the official exported seqlock.h APIs are marked with verbose
kernel-doc annotations on top. The rest are internal...

> > Exactly. '*_seqcount_t_*' is a seqlock.h implementation detail, and it
> > has _zero_ relevance to what is discussed in this thread actually.
>
> Add some leading __'s to them?
>

It's a bit more complicated than just adding some "__" prefixes, due to
the massive compile-time polymorphism done through _Generic().

The '*_seqcount_t_*' format was the best we could come up with to
distinguish (again, for internal seqlock.h code) between macros taking
all seqcount_LOCKNAME_t types, and macros/functions taking only plain
seqcount_t.

Thanks,

--
Ahmed S. Darwish
Linutronix GmbH
John Hubbard Nov. 3, 2020, 2:20 a.m. UTC | #13
On 11/2/20 4:41 PM, Ahmed S. Darwish wrote:
> On Mon, Nov 02, 2020 at 08:25:32PM -0400, Jason Gunthorpe wrote:
>> On Tue, Nov 03, 2020 at 01:17:12AM +0100, Ahmed S. Darwish wrote:
>>
>>> Please stick with the official exported API: raw_write_seqcount_begin().
>>
>> How did you know this was 'offical exported API' ??
>>
> 
> All the official exported seqlock.h APIs are marked with verbose
> kernel-doc annotations on top. The rest are internal...
> 

OK, but no one here was able to deduce that, probably because there is not
enough consistency throughout the kernel to be able to assume such things--even
though your seqlock project is internally consistent. It's just not *quite*
enough communication.

I think if we added the following it would be very nice:

a) Short comments to the "unofficial and internal" routines, identifying them as
such, and

b) Short comments to the "official API for general use", also identifying
those as such.

c) A comment about what makes "raw" actually raw, for seqlock.


Since I'm proposing new work, I'll also offer to help, perhaps by putting together
a small patch to get it kicked off, if you approve of the idea.

thanks,
Ahmed S. Darwish Nov. 3, 2020, 6:52 a.m. UTC | #14
On Mon, Nov 02, 2020 at 06:20:45PM -0800, John Hubbard wrote:
> On 11/2/20 4:41 PM, Ahmed S. Darwish wrote:
> > On Mon, Nov 02, 2020 at 08:25:32PM -0400, Jason Gunthorpe wrote:
> > > On Tue, Nov 03, 2020 at 01:17:12AM +0100, Ahmed S. Darwish wrote:
> > >
> > > > Please stick with the official exported API: raw_write_seqcount_begin().
> > >
> > > How did you know this was 'offical exported API' ??
> > >
> >
> > All the official exported seqlock.h APIs are marked with verbose
> > kernel-doc annotations on top. The rest are internal...
> >
>
> OK, but no one here was able to deduce that, probably because there is not
> enough consistency throughout the kernel to be able to assume such things--even
> though your seqlock project is internally consistent. It's just not *quite*
> enough communication.
>
> I think if we added the following it would be very nice:
>

The problem is, I've already documented seqlock.h to death.... There are
more comments than code in there, and there is "seqlock.rst" under
Documentation/ to further describe the big picture.

There comes a point where you decide what level of documentation to add,
and what level to skip.

Because in the end, you don't want to confuse "Joe, the general driver
developer" with too much details that's not relevant to their task at
hand.  (I work in the Embedded domain, and I've seen so much ugly code
from embedded drivers/SoC developers already, sorry)

See for example my reply to Linus, where any talk about the lockdep-free
and barrier-free parts of the API was explicitly not mentioned in
seqlock.rst. This was done on purpose: 1) you want to keep the generic
case simple, but the special case do-able, 2) you want to encourage
people to use the standard entry/exit points as much as possible.

> a) Short comments to the "unofficial and internal" routines, identifying them as
> such, and
>
> b) Short comments to the "official API for general use", also identifying
> those as such.
>

I really think the already added kernel-doc is sufficient...

See for example __read_seqcount_begin() and __read_seqcount_retry().
They begin with "__", but they are semi-external seqlock.h API that are
used by VFS to avoid barriers. And these APIs are then polymorphised
according to the write serialization lock type, and so on.

So the most consistent way for seqlock.h was to use kernel-doc as *the*
marker for exported functions.

This is not unique to seqlock.h by the way. The same pattern is heavily
used by the DRM folks.

Yes, of course, we can add even more comments to seqlock.h, but then, I
honestly think it would be too much that maybe people will just skip
reading the whole thing altogether...

> c) A comment about what makes "raw" actually raw, for seqlock.
>

That's already documented.

What more can really be written than what's in seqlock.h below??

  /**
   * raw_read_seqcount_begin() - begin a seqcount_t read section w/o lockdep

  /**
   * raw_seqcount_begin() - begin a seqcount_t read critical section w/o
   *                        lockdep and w/o counter stabilization

  /**
   * raw_write_seqcount_begin() - start a seqcount_t write section w/o lockdep

  /**
   * raw_write_seqcount_end() - end a seqcount_t write section w/o lockdep

>
> Since I'm proposing new work, I'll also offer to help, perhaps by putting together
> a small patch to get it kicked off, if you approve of the idea.
>

Patches are always welcome of course, and please put me in Cc. I don't
approve or deny anything though, that's the locking maintainers job :)

Kind regards,

> John Hubbard
> NVIDIA

--
Ahmed S. Darwish
Linutronix GmbH
John Hubbard Nov. 3, 2020, 7:05 a.m. UTC | #15
On 11/2/20 10:52 PM, Ahmed S. Darwish wrote:
> On Mon, Nov 02, 2020 at 06:20:45PM -0800, John Hubbard wrote:
>> On 11/2/20 4:41 PM, Ahmed S. Darwish wrote:
>>> On Mon, Nov 02, 2020 at 08:25:32PM -0400, Jason Gunthorpe wrote:
>>>> On Tue, Nov 03, 2020 at 01:17:12AM +0100, Ahmed S. Darwish wrote:
>>>>
>>>>> Please stick with the official exported API: raw_write_seqcount_begin().
>>>>
>>>> How did you know this was 'offical exported API' ??
>>>>
>>>
>>> All the official exported seqlock.h APIs are marked with verbose
>>> kernel-doc annotations on top. The rest are internal...
>>>
>>
>> OK, but no one here was able to deduce that, probably because there is not
>> enough consistency throughout the kernel to be able to assume such things--even
>> though your seqlock project is internally consistent. It's just not *quite*
>> enough communication.
>>
>> I think if we added the following it would be very nice:
>>
> 
> The problem is, I've already documented seqlock.h to death.... There are
> more comments than code in there, and there is "seqlock.rst" under
> Documentation/ to further describe the big picture.
> 
> There comes a point where you decide what level of documentation to add,
> and what level to skip.
> 
> Because in the end, you don't want to confuse "Joe, the general driver
> developer" with too much details that's not relevant to their task at
> hand.  (I work in the Embedded domain, and I've seen so much ugly code
> from embedded drivers/SoC developers already, sorry)
> 
> See for example my reply to Linus, where any talk about the lockdep-free
> and barrier-free parts of the API was explicitly not mentioned in
> seqlock.rst. This was done on purpose: 1) you want to keep the generic
> case simple, but the special case do-able, 2) you want to encourage
> people to use the standard entry/exit points as much as possible.
> 

Well, OK, I'll leave it alone. This was not a normal case for the documentation,
anyway. We are using seqcount-ing in a little bit different way than usual,
so I guess an email thread like this is actually the proper way to work it out.

Just wanted to be sure that we didn't miss an opportunity to clarify anything
that needed clarifying it. Thanks for the detailed responses throughout!


thanks,
Peter Xu Nov. 3, 2020, 5:03 p.m. UTC | #16
On Tue, Nov 03, 2020 at 01:17:12AM +0100, Ahmed S. Darwish wrote:
> > > > diff --git a/mm/memory.c b/mm/memory.c
> > > > index c48f8df6e50268..294c2c3c4fe00d 100644
> > > > +++ b/mm/memory.c
> > > > @@ -1171,6 +1171,12 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
> > > >  		mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_PAGE,
> > > >  					0, src_vma, src_mm, addr, end);
> > > >  		mmu_notifier_invalidate_range_start(&range);
> > > > +		/*
> > > > +		 * The read side doesn't spin, it goes to the mmap_lock, so the
> > > > +		 * raw version is used to avoid disabling preemption here
> > > > +		 */
> > > > +		mmap_assert_write_locked(src_mm);
> > > > +		raw_write_seqcount_t_begin(&src_mm->write_protect_seq);
> > >
> > > Would raw_write_seqcount_begin() be better here?
> >
> > Hum..
> >
> > I felt no because it had the preempt stuff added into it, however it
> > would work - __seqcount_lock_preemptible() == false for the seqcount_t
> > case (see below)
> >
> > Looking more closely, maybe the right API to pick is
> > write_seqcount_t_begin() and write_seqcount_t_end() ??
> >
> 
> No, that's not the right API: it is also internal to seqlock.h.
> 
> Please stick with the official exported API: raw_write_seqcount_begin().
> 
> It should satisfy your needs, and the raw_*() variant is created exactly
> for contexts wishing to avoid the lockdep checks (e.g. NMI handlers
> cannot invoke lockdep, etc.)

Ahmed, Jason, feel free to correct me - but I feel like what Jason wanted here
is indeed the version that does not require disabling of preemption, a.k.a.,
write_seqcount_t_begin() and write_seqcount_t_end(), since it's preempt-safe if
the read side does not retry.  Not sure whether there's no "*_t_*" version of it.

Another idea is that maybe we can use the raw_write_seqcount_begin() version,
instead of in copy_page_range() but move it to copy_pte_range().  That would
not affect normal Linux on preemption I think, since when reach pte level we
should have disabled preemption already after all (by taking the pgtable spin
lock).  But again there could be extra overhead since we'll need to take the
write seqcount very often (rather than once per fork(), so maybe there's some
perf influence), also that means it'll be an extra/real disable_preempt() for
the future RT code if it'll land some day (since again rt_spin_lock should not
need to disable preemption, iiuc, which is used here).  So seems it's still
better to do in copy_page_range() as Jason proposed.

Thanks,
Linus Torvalds Nov. 3, 2020, 5:40 p.m. UTC | #17
On Mon, Nov 2, 2020 at 10:52 PM Ahmed S. Darwish
<a.darwish@linutronix.de> wrote:
>
> The problem is, I've already documented seqlock.h to death.... There are
> more comments than code in there, and there is "seqlock.rst" under
> Documentation/ to further describe the big picture.

Well, honestly, I think the correct thing to do is to get rid of the
*_seqcount_t_*() functions entirely.

They add nothing but confusion, and they are entirely misnamed. That's
not the pattern we use for "internal use only" functions, and they are
*very* confusing.

They have other issues too: like raw_write_seqcount_end() not being
usable on its own when preemptibility isn't an issue like here. You
basically _have_ to use raw_write_seqcount_t_end(), because otherwise
it tries to re-enable preemption that was never there.

                   Linus
Ahmed S. Darwish Nov. 4, 2020, 1:32 a.m. UTC | #18
On Tue, Nov 03, 2020 at 09:40:22AM -0800, Linus Torvalds wrote:
> On Mon, Nov 2, 2020 at 10:52 PM Ahmed S. Darwish
> <a.darwish@linutronix.de> wrote:
> >
> > The problem is, I've already documented seqlock.h to death.... There are
> > more comments than code in there, and there is "seqlock.rst" under
> > Documentation/ to further describe the big picture.
>
> Well, honestly, I think the correct thing to do is to get rid of the
> *_seqcount_t_*() functions entirely.
>
> They add nothing but confusion, and they are entirely misnamed. That's
> not the pattern we use for "internal use only" functions, and they are
> *very* confusing.
>

I see. Would the enclosed patch #1 be OK? It basically uses the "__do_"
prefix instead, with some rationale.

>
> They have other issues too: like raw_write_seqcount_end() not being
> usable on its own when preemptibility isn't an issue like here. You
> basically _have_ to use raw_write_seqcount_t_end(), because otherwise
> it tries to re-enable preemption that was never there.
>

Hmmm, raw_write_seqcount_{begin,end}() *never* disable/enable preemption
for plain seqcount_t. This is why I kept recommending those for this
patch series instead of internal raw_write_seqcount_*t*_{begin,end}().

But..... given that multiple people made the exact same remark by now, I
guess that's due to:

#define raw_write_seqcount_begin(s)		\
do {						\
	if (__seqcount_lock_preemptible(s))	\
		preempt_disable();		\
						\
	...					\
} while (0);

#define raw_write_seqcount_end(s)		\
do {						\
	...					\
						\
	if (__seqcount_lock_preemptible(s))	\
		preempt_enable();		\
} while (0);

The tricky part is that __seqcount_lock_preemptible() is always false
for plain "seqcount_t".  With that data type, the _Generic() selection
makes it resolve to __seqprop_preemptible(), which just returns false.

Originally, __seqcount_lock_preemptible() was called:

  __seqcount_associated_lock_exists_and_is_preemptible()

but it got transformed to its current short form in the process of some
pre-mainline refactorings. Looking at it now after all the dust has
settled, maybe the verbose form was much more clear.

Please see the enclosed patch #2... Would that be OK too?

(I will submit these two patches in their own thread after some common
 ground is reached.)

Patches
-------

====>
====> patch #1:
====>

Subject: [RFC][PATCH 1/2] seqlock: Use __do_ prefix instead of non-standed
 _seqcount_t_ marker

The use of "*_seqcount_t_*" as a marker to denote internal seqlock.h
functions taking only plain seqcount_t instead of the whole
seqcount_LOCKNAME_t family is confusing users, as it's also not the
standard kernel pattern for denoting header file internal functions.

Use the __do_ prefix instead.

Note, a plain "__" prefix is not used since seqlock.h already uses it
for some of its exported functions; e.g. __read_seqcount_begin() and
__read_seqcount_retry().

Reported-by: Jason Gunthorpe <jgg@nvidia.com>
Reported-by: John Hubbard <jhubbard@nvidia.com>
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lkml.kernel.org/r/CAHk-=wgB8nyOQufpn0o6a5BpJCJPnXvH+kRxApujhsgG+7qAwQ@mail.gmail.com
Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de>
---
 include/linux/seqlock.h | 62 ++++++++++++++++++++---------------------
 1 file changed, 31 insertions(+), 31 deletions(-)

diff --git a/include/linux/seqlock.h b/include/linux/seqlock.h
index cbfc78b92b65..5de043841d33 100644
--- a/include/linux/seqlock.h
+++ b/include/linux/seqlock.h
@@ -425,9 +425,9 @@ SEQCOUNT_LOCKNAME(ww_mutex,     struct ww_mutex, true,     &s->lock->base, ww_mu
  * Return: true if a read section retry is required, else false
  */
 #define __read_seqcount_retry(s, start)					\
-	__read_seqcount_t_retry(__seqcount_ptr(s), start)
+	__do___read_seqcount_retry(__seqcount_ptr(s), start)

-static inline int __read_seqcount_t_retry(const seqcount_t *s, unsigned start)
+static inline int __do___read_seqcount_retry(const seqcount_t *s, unsigned start)
 {
 	kcsan_atomic_next(0);
 	return unlikely(READ_ONCE(s->sequence) != start);
@@ -445,12 +445,12 @@ static inline int __read_seqcount_t_retry(const seqcount_t *s, unsigned start)
  * Return: true if a read section retry is required, else false
  */
 #define read_seqcount_retry(s, start)					\
-	read_seqcount_t_retry(__seqcount_ptr(s), start)
+	__do_read_seqcount_retry(__seqcount_ptr(s), start)

-static inline int read_seqcount_t_retry(const seqcount_t *s, unsigned start)
+static inline int __do_read_seqcount_retry(const seqcount_t *s, unsigned start)
 {
 	smp_rmb();
-	return __read_seqcount_t_retry(s, start);
+	return __do___read_seqcount_retry(s, start);
 }

 /**
@@ -462,10 +462,10 @@ do {									\
 	if (__seqcount_lock_preemptible(s))				\
 		preempt_disable();					\
 									\
-	raw_write_seqcount_t_begin(__seqcount_ptr(s));			\
+	__do_raw_write_seqcount_begin(__seqcount_ptr(s));		\
 } while (0)

-static inline void raw_write_seqcount_t_begin(seqcount_t *s)
+static inline void __do_raw_write_seqcount_begin(seqcount_t *s)
 {
 	kcsan_nestable_atomic_begin();
 	s->sequence++;
@@ -478,13 +478,13 @@ static inline void raw_write_seqcount_t_begin(seqcount_t *s)
  */
 #define raw_write_seqcount_end(s)					\
 do {									\
-	raw_write_seqcount_t_end(__seqcount_ptr(s));			\
+	__do_raw_write_seqcount_end(__seqcount_ptr(s));			\
 									\
 	if (__seqcount_lock_preemptible(s))				\
 		preempt_enable();					\
 } while (0)

-static inline void raw_write_seqcount_t_end(seqcount_t *s)
+static inline void __do_raw_write_seqcount_end(seqcount_t *s)
 {
 	smp_wmb();
 	s->sequence++;
@@ -506,12 +506,12 @@ do {									\
 	if (__seqcount_lock_preemptible(s))				\
 		preempt_disable();					\
 									\
-	write_seqcount_t_begin_nested(__seqcount_ptr(s), subclass);	\
+	__do_write_seqcount_begin_nested(__seqcount_ptr(s), subclass);	\
 } while (0)

-static inline void write_seqcount_t_begin_nested(seqcount_t *s, int subclass)
+static inline void __do_write_seqcount_begin_nested(seqcount_t *s, int subclass)
 {
-	raw_write_seqcount_t_begin(s);
+	__do_raw_write_seqcount_begin(s);
 	seqcount_acquire(&s->dep_map, subclass, 0, _RET_IP_);
 }

@@ -533,12 +533,12 @@ do {									\
 	if (__seqcount_lock_preemptible(s))				\
 		preempt_disable();					\
 									\
-	write_seqcount_t_begin(__seqcount_ptr(s));			\
+	__do_write_seqcount_begin(__seqcount_ptr(s));			\
 } while (0)

-static inline void write_seqcount_t_begin(seqcount_t *s)
+static inline void __do_write_seqcount_begin(seqcount_t *s)
 {
-	write_seqcount_t_begin_nested(s, 0);
+	__do_write_seqcount_begin_nested(s, 0);
 }

 /**
@@ -549,16 +549,16 @@ static inline void write_seqcount_t_begin(seqcount_t *s)
  */
 #define write_seqcount_end(s)						\
 do {									\
-	write_seqcount_t_end(__seqcount_ptr(s));			\
+	__do_write_seqcount_end(__seqcount_ptr(s));			\
 									\
 	if (__seqcount_lock_preemptible(s))				\
 		preempt_enable();					\
 } while (0)

-static inline void write_seqcount_t_end(seqcount_t *s)
+static inline void __do_write_seqcount_end(seqcount_t *s)
 {
 	seqcount_release(&s->dep_map, _RET_IP_);
-	raw_write_seqcount_t_end(s);
+	__do_raw_write_seqcount_end(s);
 }

 /**
@@ -603,9 +603,9 @@ static inline void write_seqcount_t_end(seqcount_t *s)
  *      }
  */
 #define raw_write_seqcount_barrier(s)					\
-	raw_write_seqcount_t_barrier(__seqcount_ptr(s))
+	__do_raw_write_seqcount_barrier(__seqcount_ptr(s))

-static inline void raw_write_seqcount_t_barrier(seqcount_t *s)
+static inline void __do_raw_write_seqcount_barrier(seqcount_t *s)
 {
 	kcsan_nestable_atomic_begin();
 	s->sequence++;
@@ -623,9 +623,9 @@ static inline void raw_write_seqcount_t_barrier(seqcount_t *s)
  * will complete successfully and see data older than this.
  */
 #define write_seqcount_invalidate(s)					\
-	write_seqcount_t_invalidate(__seqcount_ptr(s))
+	__do_write_seqcount_invalidate(__seqcount_ptr(s))

-static inline void write_seqcount_t_invalidate(seqcount_t *s)
+static inline void __do_write_seqcount_invalidate(seqcount_t *s)
 {
 	smp_wmb();
 	kcsan_nestable_atomic_begin();
@@ -865,7 +865,7 @@ static inline unsigned read_seqretry(const seqlock_t *sl, unsigned start)
 }

 /*
- * For all seqlock_t write side functions, use write_seqcount_*t*_begin()
+ * For all seqlock_t write side functions, use __do_write_seqcount_begin()
  * instead of the generic write_seqcount_begin(). This way, no redundant
  * lockdep_assert_held() checks are added.
  */
@@ -886,7 +886,7 @@ static inline unsigned read_seqretry(const seqlock_t *sl, unsigned start)
 static inline void write_seqlock(seqlock_t *sl)
 {
 	spin_lock(&sl->lock);
-	write_seqcount_t_begin(&sl->seqcount.seqcount);
+	__do_write_seqcount_begin(&sl->seqcount.seqcount);
 }

 /**
@@ -898,7 +898,7 @@ static inline void write_seqlock(seqlock_t *sl)
  */
 static inline void write_sequnlock(seqlock_t *sl)
 {
-	write_seqcount_t_end(&sl->seqcount.seqcount);
+	__do_write_seqcount_end(&sl->seqcount.seqcount);
 	spin_unlock(&sl->lock);
 }

@@ -912,7 +912,7 @@ static inline void write_sequnlock(seqlock_t *sl)
 static inline void write_seqlock_bh(seqlock_t *sl)
 {
 	spin_lock_bh(&sl->lock);
-	write_seqcount_t_begin(&sl->seqcount.seqcount);
+	__do_write_seqcount_begin(&sl->seqcount.seqcount);
 }

 /**
@@ -925,7 +925,7 @@ static inline void write_seqlock_bh(seqlock_t *sl)
  */
 static inline void write_sequnlock_bh(seqlock_t *sl)
 {
-	write_seqcount_t_end(&sl->seqcount.seqcount);
+	__do_write_seqcount_end(&sl->seqcount.seqcount);
 	spin_unlock_bh(&sl->lock);
 }

@@ -939,7 +939,7 @@ static inline void write_sequnlock_bh(seqlock_t *sl)
 static inline void write_seqlock_irq(seqlock_t *sl)
 {
 	spin_lock_irq(&sl->lock);
-	write_seqcount_t_begin(&sl->seqcount.seqcount);
+	__do_write_seqcount_begin(&sl->seqcount.seqcount);
 }

 /**
@@ -951,7 +951,7 @@ static inline void write_seqlock_irq(seqlock_t *sl)
  */
 static inline void write_sequnlock_irq(seqlock_t *sl)
 {
-	write_seqcount_t_end(&sl->seqcount.seqcount);
+	__do_write_seqcount_end(&sl->seqcount.seqcount);
 	spin_unlock_irq(&sl->lock);
 }

@@ -960,7 +960,7 @@ static inline unsigned long __write_seqlock_irqsave(seqlock_t *sl)
 	unsigned long flags;

 	spin_lock_irqsave(&sl->lock, flags);
-	write_seqcount_t_begin(&sl->seqcount.seqcount);
+	__do_write_seqcount_begin(&sl->seqcount.seqcount);
 	return flags;
 }

@@ -989,7 +989,7 @@ static inline unsigned long __write_seqlock_irqsave(seqlock_t *sl)
 static inline void
 write_sequnlock_irqrestore(seqlock_t *sl, unsigned long flags)
 {
-	write_seqcount_t_end(&sl->seqcount.seqcount);
+	__do_write_seqcount_end(&sl->seqcount.seqcount);
 	spin_unlock_irqrestore(&sl->lock, flags);
 }

====>
====> patch #2:
====>

Subject: [PATCH 2/2] seqlock: seqcount_LOCKAME_t: Use more verbose macro names

As evidenced by multiple discussions over LKML so far, it's not clear
that __seqcount_lock_preemptible() is always false for plain seqcount_t.
For that data type, the _Generic() selection resolves to
__seqprop_preemptible(), which just returns false.

Use __seqcount_associated_lock_exists_and_is_preemptible() instead,
which hints that "preemptibility" is for the associated write
serialization lock (if any), not for the seqcount itself.

Similarly, rename __seqcount_assert_lock_held() to
__seqcount_assert_associated_lock_held().

Link: https://lkml.kernel.org/r/CAHk-=wgB8nyOQufpn0o6a5BpJCJPnXvH+kRxApujhsgG+7qAwQ@mail.gmail.com
Link: https://lkml.kernel.org/r/20201030235121.GQ2620339@nvidia.com
Link: https://lkml.kernel.org/r/20201103170327.GJ20600@xz-x1
Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de>
---
 include/linux/seqlock.h | 22 +++++++++++-----------
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/include/linux/seqlock.h b/include/linux/seqlock.h
index 5de043841d33..eb1e5a822e44 100644
--- a/include/linux/seqlock.h
+++ b/include/linux/seqlock.h
@@ -307,10 +307,10 @@ SEQCOUNT_LOCKNAME(ww_mutex,     struct ww_mutex, true,     &s->lock->base, ww_mu
 	__seqprop_case((s),	mutex,		prop),			\
 	__seqprop_case((s),	ww_mutex,	prop))

-#define __seqcount_ptr(s)		__seqprop(s, ptr)
-#define __seqcount_sequence(s)		__seqprop(s, sequence)
-#define __seqcount_lock_preemptible(s)	__seqprop(s, preemptible)
-#define __seqcount_assert_lock_held(s)	__seqprop(s, assert)
+#define __seqcount_ptr(s)					__seqprop(s, ptr)
+#define __seqcount_sequence(s)					__seqprop(s, sequence)
+#define __seqcount_associated_lock_exists_and_is_preemptible(s)	__seqprop(s, preemptible)
+#define __seqcount_assert_associated_lock_held(s)		__seqprop(s, assert)

 /**
  * __read_seqcount_begin() - begin a seqcount_t read section w/o barrier
@@ -459,7 +459,7 @@ static inline int __do_read_seqcount_retry(const seqcount_t *s, unsigned start)
  */
 #define raw_write_seqcount_begin(s)					\
 do {									\
-	if (__seqcount_lock_preemptible(s))				\
+	if (__seqcount_associated_lock_exists_and_is_preemptible(s))	\
 		preempt_disable();					\
 									\
 	__do_raw_write_seqcount_begin(__seqcount_ptr(s));		\
@@ -480,7 +480,7 @@ static inline void __do_raw_write_seqcount_begin(seqcount_t *s)
 do {									\
 	__do_raw_write_seqcount_end(__seqcount_ptr(s));			\
 									\
-	if (__seqcount_lock_preemptible(s))				\
+	if (__seqcount_associated_lock_exists_and_is_preemptible(s))	\
 		preempt_enable();					\
 } while (0)

@@ -501,9 +501,9 @@ static inline void __do_raw_write_seqcount_end(seqcount_t *s)
  */
 #define write_seqcount_begin_nested(s, subclass)			\
 do {									\
-	__seqcount_assert_lock_held(s);					\
+	__seqcount_assert_associated_lock_held(s);			\
 									\
-	if (__seqcount_lock_preemptible(s))				\
+	if (__seqcount_associated_lock_exists_and_is_preemptible(s))	\
 		preempt_disable();					\
 									\
 	__do_write_seqcount_begin_nested(__seqcount_ptr(s), subclass);	\
@@ -528,9 +528,9 @@ static inline void __do_write_seqcount_begin_nested(seqcount_t *s, int subclass)
  */
 #define write_seqcount_begin(s)						\
 do {									\
-	__seqcount_assert_lock_held(s);					\
+	__seqcount_assert_associated_lock_held(s);			\
 									\
-	if (__seqcount_lock_preemptible(s))				\
+	if (__seqcount_associated_lock_exists_and_is_preemptible(s))	\
 		preempt_disable();					\
 									\
 	__do_write_seqcount_begin(__seqcount_ptr(s));			\
@@ -551,7 +551,7 @@ static inline void __do_write_seqcount_begin(seqcount_t *s)
 do {									\
 	__do_write_seqcount_end(__seqcount_ptr(s));			\
 									\
-	if (__seqcount_lock_preemptible(s))				\
+	if (__seqcount_associated_lock_exists_and_is_preemptible(s))	\
 		preempt_enable();					\
 } while (0)

>                    Linus

Thanks,

--
Ahmed S. Darwish
Linutronix GmbH
John Hubbard Nov. 4, 2020, 2:01 a.m. UTC | #19
On 11/3/20 5:32 PM, Ahmed S. Darwish wrote:
> On Tue, Nov 03, 2020 at 09:40:22AM -0800, Linus Torvalds wrote:
>> On Mon, Nov 2, 2020 at 10:52 PM Ahmed S. Darwish
>> <a.darwish@linutronix.de> wrote:
...
> 
> ====>
> ====> patch #1:
> ====>
> 
> Subject: [RFC][PATCH 1/2] seqlock: Use __do_ prefix instead of non-standed
>   _seqcount_t_ marker
> 
> The use of "*_seqcount_t_*" as a marker to denote internal seqlock.h
> functions taking only plain seqcount_t instead of the whole
> seqcount_LOCKNAME_t family is confusing users, as it's also not the
> standard kernel pattern for denoting header file internal functions.
> 
> Use the __do_ prefix instead.
> 
> Note, a plain "__" prefix is not used since seqlock.h already uses it
> for some of its exported functions; e.g. __read_seqcount_begin() and
> __read_seqcount_retry().
> 
> Reported-by: Jason Gunthorpe <jgg@nvidia.com>
> Reported-by: John Hubbard <jhubbard@nvidia.com>
> Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
> Link: https://lkml.kernel.org/r/CAHk-=wgB8nyOQufpn0o6a5BpJCJPnXvH+kRxApujhsgG+7qAwQ@mail.gmail.com
> Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de>
> ---
>   include/linux/seqlock.h | 62 ++++++++++++++++++++---------------------
>   1 file changed, 31 insertions(+), 31 deletions(-)
> 
> diff --git a/include/linux/seqlock.h b/include/linux/seqlock.h
> index cbfc78b92b65..5de043841d33 100644
> --- a/include/linux/seqlock.h
> +++ b/include/linux/seqlock.h
> @@ -425,9 +425,9 @@ SEQCOUNT_LOCKNAME(ww_mutex,     struct ww_mutex, true,     &s->lock->base, ww_mu
>    * Return: true if a read section retry is required, else false
>    */
>   #define __read_seqcount_retry(s, start)					\
> -	__read_seqcount_t_retry(__seqcount_ptr(s), start)
> +	__do___read_seqcount_retry(__seqcount_ptr(s), start)

Looking better. "do_" is clearly an internal function name prefix, so that's good.

A nit: while various numbers of leading underscores are sometimes used, it's a lot
less common to use, say, 3 consecutive underscores (as above) *within* the name. And
I don't think you need it for uniqueness, at least from a quick look around here.

In fact, if you actually need 3 vs 2 vs 1 underscore within the name, I'm going to
claim that that's too far afield, and the naming should be re-revisited. :)

So why not just:

     __do_read_seqcount_retry()

?

...or, if you feeling bold:

	do_read_seqcount_retry()

...thus taking further advantage of the "do" convention, in order to get rid of some
more underscores.

And similarly for other __do___*() functions.

But again, either way, I think "do" is helping a *lot* here (as is getting rid
of the _t_ idea).


thanks,
Ahmed S. Darwish Nov. 4, 2020, 3:17 a.m. UTC | #20
On Tue, Nov 03, 2020 at 06:01:30PM -0800, John Hubbard wrote:
> On 11/3/20 5:32 PM, Ahmed S. Darwish wrote:
...
> >   #define __read_seqcount_retry(s, start)					\
> > -	__read_seqcount_t_retry(__seqcount_ptr(s), start)
> > +	__do___read_seqcount_retry(__seqcount_ptr(s), start)
>
...
> A nit: while various numbers of leading underscores are sometimes used, it's a lot
> less common to use, say, 3 consecutive underscores (as above) *within* the name. And
> I don't think you need it for uniqueness, at least from a quick look around here.
>
...
> But again, either way, I think "do" is helping a *lot* here (as is getting rid
> of the _t_ idea).

The three underscores are needed because there's a do_ version for
read_seqcount_retry(), and another for __read_seqcount_retry().

Similarly for {__,}read_seqcount_begin(). You want to be very careful
with this, and never mistaknely mix the two, because it affects some VFS
hot paths.

Nonetheless, as you mentioned in the later (dropped) part of your
message, I think do_ is better than __do_, so the final result will be:

  do___read_seqcount_retry()
  do_read_seqcount_retry()
  do_raw_write_seqcount_begin()
  do_raw_write_seqcount_end()
  do_write_seqcount_begin()
  ...

and so on.

I'll wait for some further feedback on the two patches (possibly from
Linus or PeterZ), then send a mini patch series.

(This shouldn't block a v3 of Jason's mm patch series though, as it will
 be using the external seqlock.h APIs anyway...).

Thanks,

--
Ahmed S. Darwish
Linutronix GmbH
Linus Torvalds Nov. 4, 2020, 6:38 p.m. UTC | #21
On Tue, Nov 3, 2020 at 7:17 PM Ahmed S. Darwish <a.darwish@linutronix.de> wrote:
>
> Nonetheless, as you mentioned in the later (dropped) part of your
> message, I think do_ is better than __do_, so the final result will be:
>
>   do___read_seqcount_retry()
>   do_read_seqcount_retry()
>   do_raw_write_seqcount_begin()
>   do_raw_write_seqcount_end()
>   do_write_seqcount_begin()
>   ...
>
> and so on.

Looks reasonable to me.

And can you add a few comments to the magic type macros, so that it's
a lot more obvious what the end result was. I clearly wasn't able to
follow all the _Generic() cases from the seqcount_t to the final end
result. It's a really odd combination of subtle _GENERIC() macro and
token pasting to get from zeqcount_t to "false" in
__seqcount_lock_preemptible().

I can see it when I really look, but when looking at the actual use,
it's very non-obvious indeed.

                 Linus
Peter Zijlstra Nov. 10, 2020, 11:53 a.m. UTC | #22
On Wed, Nov 04, 2020 at 04:17:11AM +0100, Ahmed S. Darwish wrote:
> On Tue, Nov 03, 2020 at 06:01:30PM -0800, John Hubbard wrote:
> > On 11/3/20 5:32 PM, Ahmed S. Darwish wrote:
> ...
> > >   #define __read_seqcount_retry(s, start)					\
> > > -	__read_seqcount_t_retry(__seqcount_ptr(s), start)
> > > +	__do___read_seqcount_retry(__seqcount_ptr(s), start)
> >
> ...
> > A nit: while various numbers of leading underscores are sometimes used, it's a lot
> > less common to use, say, 3 consecutive underscores (as above) *within* the name. And
> > I don't think you need it for uniqueness, at least from a quick look around here.
> >
> ...
> > But again, either way, I think "do" is helping a *lot* here (as is getting rid
> > of the _t_ idea).
> 
> The three underscores are needed because there's a do_ version for
> read_seqcount_retry(), and another for __read_seqcount_retry().
> 
> Similarly for {__,}read_seqcount_begin(). You want to be very careful
> with this, and never mistaknely mix the two, because it affects some VFS
> hot paths.
> 
> Nonetheless, as you mentioned in the later (dropped) part of your
> message, I think do_ is better than __do_, so the final result will be:
> 
>   do___read_seqcount_retry()
>   do_read_seqcount_retry()
>   do_raw_write_seqcount_begin()
>   do_raw_write_seqcount_end()
>   do_write_seqcount_begin()
>   ...
> 
> and so on.
> 
> I'll wait for some further feedback on the two patches (possibly from
> Linus or PeterZ), then send a mini patch series.

I'm Ok with that. The change I have issues with is:

-#define __seqcount_lock_preemptible(s) __seqprop(s, preemptible)
+#define __seqcount_associated_lock_exists_and_is_preemptible(s)        __seqprop(s, preemptible)

That's just _realllllllly_ long.

Would something like the below make it easier to follow?

seqprop_preemptible() is 'obviously' related to __seqprop_preemptible()
without having to follow through the _Generic token pasting maze.

---
diff --git a/include/linux/seqlock.h b/include/linux/seqlock.h
index 8d8552474c64..576594add921 100644
--- a/include/linux/seqlock.h
+++ b/include/linux/seqlock.h
@@ -307,10 +307,10 @@ SEQCOUNT_LOCKNAME(ww_mutex,     struct ww_mutex, true,     &s->lock->base, ww_mu
 	__seqprop_case((s),	mutex,		prop),			\
 	__seqprop_case((s),	ww_mutex,	prop))
 
-#define __seqcount_ptr(s)		__seqprop(s, ptr)
-#define __seqcount_sequence(s)		__seqprop(s, sequence)
-#define __seqcount_lock_preemptible(s)	__seqprop(s, preemptible)
-#define __seqcount_assert_lock_held(s)	__seqprop(s, assert)
+#define seqprop_ptr(s)			__seqprop(s, ptr)
+#define seqprop_sequence(s)		__seqprop(s, sequence)
+#define seqprop_preemptible(s)		__seqprop(s, preemptible)
+#define seqprop_assert(s)	__seqprop(s, assert)
 
 /**
  * __read_seqcount_begin() - begin a seqcount_t read section w/o barrier
@@ -330,7 +330,7 @@ SEQCOUNT_LOCKNAME(ww_mutex,     struct ww_mutex, true,     &s->lock->base, ww_mu
 ({									\
 	unsigned __seq;							\
 									\
-	while ((__seq = __seqcount_sequence(s)) & 1)			\
+	while ((__seq = seqprop_sequence(s)) & 1)			\
 		cpu_relax();						\
 									\
 	kcsan_atomic_next(KCSAN_SEQLOCK_REGION_MAX);			\
@@ -359,7 +359,7 @@ SEQCOUNT_LOCKNAME(ww_mutex,     struct ww_mutex, true,     &s->lock->base, ww_mu
  */
 #define read_seqcount_begin(s)						\
 ({									\
-	seqcount_lockdep_reader_access(__seqcount_ptr(s));		\
+	seqcount_lockdep_reader_access(seqprop_ptr(s));			\
 	raw_read_seqcount_begin(s);					\
 })
 
@@ -376,7 +376,7 @@ SEQCOUNT_LOCKNAME(ww_mutex,     struct ww_mutex, true,     &s->lock->base, ww_mu
  */
 #define raw_read_seqcount(s)						\
 ({									\
-	unsigned __seq = __seqcount_sequence(s);			\
+	unsigned __seq = seqprop_sequence(s);				\
 									\
 	smp_rmb();							\
 	kcsan_atomic_next(KCSAN_SEQLOCK_REGION_MAX);			\
@@ -425,7 +425,7 @@ SEQCOUNT_LOCKNAME(ww_mutex,     struct ww_mutex, true,     &s->lock->base, ww_mu
  * Return: true if a read section retry is required, else false
  */
 #define __read_seqcount_retry(s, start)					\
-	__read_seqcount_t_retry(__seqcount_ptr(s), start)
+	__read_seqcount_t_retry(seqprop_ptr(s), start)
 
 static inline int __read_seqcount_t_retry(const seqcount_t *s, unsigned start)
 {
@@ -445,7 +445,7 @@ static inline int __read_seqcount_t_retry(const seqcount_t *s, unsigned start)
  * Return: true if a read section retry is required, else false
  */
 #define read_seqcount_retry(s, start)					\
-	read_seqcount_t_retry(__seqcount_ptr(s), start)
+	read_seqcount_t_retry(seqprop_ptr(s), start)
 
 static inline int read_seqcount_t_retry(const seqcount_t *s, unsigned start)
 {
@@ -459,10 +459,10 @@ static inline int read_seqcount_t_retry(const seqcount_t *s, unsigned start)
  */
 #define raw_write_seqcount_begin(s)					\
 do {									\
-	if (__seqcount_lock_preemptible(s))				\
+	if (seqprop_preemptible(s))					\
 		preempt_disable();					\
 									\
-	raw_write_seqcount_t_begin(__seqcount_ptr(s));			\
+	raw_write_seqcount_t_begin(seqprop_ptr(s));			\
 } while (0)
 
 static inline void raw_write_seqcount_t_begin(seqcount_t *s)
@@ -478,9 +478,9 @@ static inline void raw_write_seqcount_t_begin(seqcount_t *s)
  */
 #define raw_write_seqcount_end(s)					\
 do {									\
-	raw_write_seqcount_t_end(__seqcount_ptr(s));			\
+	raw_write_seqcount_t_end(seqprop_ptr(s));			\
 									\
-	if (__seqcount_lock_preemptible(s))				\
+	if (seqprop_preemptible(s))					\
 		preempt_enable();					\
 } while (0)
 
@@ -501,12 +501,12 @@ static inline void raw_write_seqcount_t_end(seqcount_t *s)
  */
 #define write_seqcount_begin_nested(s, subclass)			\
 do {									\
-	__seqcount_assert_lock_held(s);					\
+	seqprop_assert(s);						\
 									\
-	if (__seqcount_lock_preemptible(s))				\
+	if (seqprop_preemptible(s))					\
 		preempt_disable();					\
 									\
-	write_seqcount_t_begin_nested(__seqcount_ptr(s), subclass);	\
+	write_seqcount_t_begin_nested(seqprop_ptr(s), subclass);	\
 } while (0)
 
 static inline void write_seqcount_t_begin_nested(seqcount_t *s, int subclass)
@@ -528,12 +528,12 @@ static inline void write_seqcount_t_begin_nested(seqcount_t *s, int subclass)
  */
 #define write_seqcount_begin(s)						\
 do {									\
-	__seqcount_assert_lock_held(s);					\
+	seqprop_assert(s);						\
 									\
-	if (__seqcount_lock_preemptible(s))				\
+	if (seqprop_preemptible(s))					\
 		preempt_disable();					\
 									\
-	write_seqcount_t_begin(__seqcount_ptr(s));			\
+	write_seqcount_t_begin(seqprop_ptr(s));				\
 } while (0)
 
 static inline void write_seqcount_t_begin(seqcount_t *s)
@@ -549,9 +549,9 @@ static inline void write_seqcount_t_begin(seqcount_t *s)
  */
 #define write_seqcount_end(s)						\
 do {									\
-	write_seqcount_t_end(__seqcount_ptr(s));			\
+	write_seqcount_t_end(seqprop_ptr(s));				\
 									\
-	if (__seqcount_lock_preemptible(s))				\
+	if (seqprop_preemptible(s))					\
 		preempt_enable();					\
 } while (0)
 
@@ -603,7 +603,7 @@ static inline void write_seqcount_t_end(seqcount_t *s)
  *      }
  */
 #define raw_write_seqcount_barrier(s)					\
-	raw_write_seqcount_t_barrier(__seqcount_ptr(s))
+	raw_write_seqcount_t_barrier(seqprop_ptr(s))
 
 static inline void raw_write_seqcount_t_barrier(seqcount_t *s)
 {
@@ -623,7 +623,7 @@ static inline void raw_write_seqcount_t_barrier(seqcount_t *s)
  * will complete successfully and see data older than this.
  */
 #define write_seqcount_invalidate(s)					\
-	write_seqcount_t_invalidate(__seqcount_ptr(s))
+	write_seqcount_t_invalidate(seqprop_ptr(s))
 
 static inline void write_seqcount_t_invalidate(seqcount_t *s)
 {
diff mbox series

Patch

diff --git a/arch/x86/kernel/tboot.c b/arch/x86/kernel/tboot.c
index 992fb1415c0f1f..6a2f542d9588a4 100644
--- a/arch/x86/kernel/tboot.c
+++ b/arch/x86/kernel/tboot.c
@@ -93,6 +93,7 @@  static struct mm_struct tboot_mm = {
 	.pgd            = swapper_pg_dir,
 	.mm_users       = ATOMIC_INIT(2),
 	.mm_count       = ATOMIC_INIT(1),
+	.write_protect_seq = SEQCNT_ZERO(tboot_mm.write_protect_seq),
 	MMAP_LOCK_INITIALIZER(init_mm)
 	.page_table_lock =  __SPIN_LOCK_UNLOCKED(init_mm.page_table_lock),
 	.mmlist         = LIST_HEAD_INIT(init_mm.mmlist),
diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
index 5e5480a0a32d7d..2520f6e05f4d44 100644
--- a/drivers/firmware/efi/efi.c
+++ b/drivers/firmware/efi/efi.c
@@ -57,6 +57,7 @@  struct mm_struct efi_mm = {
 	.mm_rb			= RB_ROOT,
 	.mm_users		= ATOMIC_INIT(2),
 	.mm_count		= ATOMIC_INIT(1),
+	.write_protect_seq      = SEQCNT_ZERO(efi_mm.write_protect_seq),
 	MMAP_LOCK_INITIALIZER(efi_mm)
 	.page_table_lock	= __SPIN_LOCK_UNLOCKED(efi_mm.page_table_lock),
 	.mmlist			= LIST_HEAD_INIT(efi_mm.mmlist),
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 5a9238f6caad97..f13beacda6fd23 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -14,6 +14,7 @@ 
 #include <linux/uprobes.h>
 #include <linux/page-flags-layout.h>
 #include <linux/workqueue.h>
+#include <linux/seqlock.h>
 
 #include <asm/mmu.h>
 
@@ -446,6 +447,12 @@  struct mm_struct {
 		 */
 		atomic_t has_pinned;
 
+		/**
+		 * @write_protect_seq: Odd when any thread is write protecting
+		 * pages in this mm, for instance during fork().
+		 */
+		seqcount_t write_protect_seq;
+
 #ifdef CONFIG_MMU
 		atomic_long_t pgtables_bytes;	/* PTE page table pages */
 #endif
diff --git a/kernel/fork.c b/kernel/fork.c
index 32083db7a2a23e..6fd934a6d60d96 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1007,6 +1007,7 @@  static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	mm->vmacache_seqnum = 0;
 	atomic_set(&mm->mm_users, 1);
 	atomic_set(&mm->mm_count, 1);
+	seqcount_init(&mm->write_protect_seq);
 	mmap_init_lock(mm);
 	INIT_LIST_HEAD(&mm->mmlist);
 	mm->core_state = NULL;
diff --git a/mm/gup.c b/mm/gup.c
index 150cc962c99201..7944749ed7252f 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2678,11 +2678,18 @@  static unsigned long lockless_pages_from_mm(unsigned long start,
 {
 	unsigned long flags;
 	int nr_pinned = 0;
+	unsigned seq;
 
 	if (!IS_ENABLED(CONFIG_HAVE_FAST_GUP) ||
 	    !gup_fast_permitted(start, end))
 		return 0;
 
+	if (gup_flags & FOLL_PIN) {
+		seq = raw_read_seqcount(&current->mm->write_protect_seq);
+		if (seq & 1)
+			return 0;
+	}
+
 	/*
 	 * Disable interrupts. The nested form is used, in order to allow full,
 	 * general purpose use of this routine.
@@ -2697,6 +2704,18 @@  static unsigned long lockless_pages_from_mm(unsigned long start,
 	local_irq_save(flags);
 	gup_pgd_range(start, end, gup_flags, pages, &nr_pinned);
 	local_irq_restore(flags);
+
+	/*
+	 * When pinning pages for DMA there could be a concurrent write protect
+	 * from fork() via copy_page_range(), in this case always fail fast GUP.
+	 */
+	if (gup_flags & FOLL_PIN) {
+		if (read_seqcount_t_retry(&current->mm->write_protect_seq,
+					  seq)) {
+			unpin_user_pages(pages, nr_pinned);
+			return 0;
+		}
+	}
 	return nr_pinned;
 }
 
diff --git a/mm/init-mm.c b/mm/init-mm.c
index 3a613c85f9ede2..153162669f8062 100644
--- a/mm/init-mm.c
+++ b/mm/init-mm.c
@@ -31,6 +31,7 @@  struct mm_struct init_mm = {
 	.pgd		= swapper_pg_dir,
 	.mm_users	= ATOMIC_INIT(2),
 	.mm_count	= ATOMIC_INIT(1),
+	.write_protect_seq = SEQCNT_ZERO(init_mm.write_protect_seq),
 	MMAP_LOCK_INITIALIZER(init_mm)
 	.page_table_lock =  __SPIN_LOCK_UNLOCKED(init_mm.page_table_lock),
 	.arg_lock	=  __SPIN_LOCK_UNLOCKED(init_mm.arg_lock),
diff --git a/mm/memory.c b/mm/memory.c
index c48f8df6e50268..294c2c3c4fe00d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1171,6 +1171,12 @@  copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
 		mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_PAGE,
 					0, src_vma, src_mm, addr, end);
 		mmu_notifier_invalidate_range_start(&range);
+		/*
+		 * The read side doesn't spin, it goes to the mmap_lock, so the
+		 * raw version is used to avoid disabling preemption here
+		 */
+		mmap_assert_write_locked(src_mm);
+		raw_write_seqcount_t_begin(&src_mm->write_protect_seq);
 	}
 
 	ret = 0;
@@ -1187,8 +1193,10 @@  copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
 		}
 	} while (dst_pgd++, src_pgd++, addr = next, addr != end);
 
-	if (is_cow)
+	if (is_cow) {
+		raw_write_seqcount_t_end(&src_mm->write_protect_seq);
 		mmu_notifier_invalidate_range_end(&range);
+	}
 	return ret;
 }