Message ID | 2-v2-dfe9ecdb6c74+2066-gup_fork_jgg@nvidia.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Add a seqcount between gup_fast and copy_page_range() | expand |
On Fri 30-10-20 11:46:21, Jason Gunthorpe wrote: > Since commit 70e806e4e645 ("mm: Do early cow for pinned pages during > fork() for ptes") pages under a FOLL_PIN will not be write protected > during COW for fork. This means that pages returned from > pin_user_pages(FOLL_WRITE) should not become write protected while the pin > is active. > > However, there is a small race where get_user_pages_fast(FOLL_PIN) can > establish a FOLL_PIN at the same time copy_present_page() is write > protecting it: > > CPU 0 CPU 1 > get_user_pages_fast() > internal_get_user_pages_fast() > copy_page_range() > pte_alloc_map_lock() > copy_present_page() > atomic_read(has_pinned) == 0 > page_maybe_dma_pinned() == false > atomic_set(has_pinned, 1); > gup_pgd_range() > gup_pte_range() > pte_t pte = gup_get_pte(ptep) > pte_access_permitted(pte) > try_grab_compound_head() > pte = pte_wrprotect(pte) > set_pte_at(); > pte_unmap_unlock() > // GUP now returns with a write protected page > > The first attempt to resolve this by using the write protect caused > problems (and was missing a barrrier), see commit f3c64eda3e50 ("mm: avoid > early COW write protect games during fork()") > > Instead wrap copy_p4d_range() with the write side of a seqcount and check > the read side around gup_pgd_range(). If there is a collision then > get_user_pages_fast() fails and falls back to slow GUP. > > Slow GUP is safe against this race because copy_page_range() is only > called while holding the exclusive side of the mmap_lock on the src > mm_struct. > > Fixes: f3c64eda3e50 ("mm: avoid early COW write protect games during fork()") > Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> > Link: https://lore.kernel.org/r/CAHk-=wi=iCnYCARbPGjkVJu9eyYeZ13N64tZYLdOB8CP5Q_PLw@mail.gmail.com > Reviewed-by: John Hubbard <jhubbard@nvidia.com> > Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Looks good to me. Just one nit below. With that fixed feel free to add: Reviewed-by: Jan Kara <jack@suse.cz> > @@ -446,6 +447,12 @@ struct mm_struct { > */ > atomic_t has_pinned; > > + /** > + * @write_protect_seq: Odd when any thread is write protecting > + * pages in this mm, for instance during fork(). > + */ > + seqcount_t write_protect_seq; > + So this comment isn't quite true. We can be writeprotecting pages due to many other reasons and not touch write_protect_seq. E.g. for shared mappings or due to explicit mprotect() calls. So the write_protect_seq protection has to be about something more than pure write protection. One requirement certainly is that the VMA has to be is_cow_mapping(). What about mprotect(2) calls? I guess the application would have only itself to blame so we don't care? Anyway my point is just that the comment should tell more what this is about. I'd even go as far as making it "page table copying during fork in progress". Honza
On Fri, Oct 30, 2020 at 05:51:05PM +0100, Jan Kara wrote: > Looks good to me. Just one nit below. With that fixed feel free to add: > > Reviewed-by: Jan Kara <jack@suse.cz> Thanks! > > @@ -446,6 +447,12 @@ struct mm_struct { > > */ > > atomic_t has_pinned; > > > > + /** > > + * @write_protect_seq: Odd when any thread is write protecting > > + * pages in this mm, for instance during fork(). > > + */ > > + seqcount_t write_protect_seq; > > + > > So this comment isn't quite true. We can be writeprotecting pages due to > many other reasons and not touch write_protect_seq. E.g. for shared > mappings or due to explicit mprotect() calls. So the write_protect_seq > protection has to be about something more than pure write protection. One > requirement certainly is that the VMA has to be is_cow_mapping(). What > about mprotect(2) calls? I guess the application would have only itself to > blame so we don't care? Yes, that sounds right, How about /** * @write_protect_seq: Locked when any thread is write protecting * pages for COW in this mm, for instance during page table copying * for fork(). */ mprotect and shared mappings cause faults on write access not COW? Jason
On 10/30/20 7:46 AM, Jason Gunthorpe wrote: > Since commit 70e806e4e645 ("mm: Do early cow for pinned pages during > fork() for ptes") pages under a FOLL_PIN will not be write protected > during COW for fork. This means that pages returned from > pin_user_pages(FOLL_WRITE) should not become write protected while the pin > is active. > > However, there is a small race where get_user_pages_fast(FOLL_PIN) can > establish a FOLL_PIN at the same time copy_present_page() is write > protecting it: > > CPU 0 CPU 1 > get_user_pages_fast() > internal_get_user_pages_fast() > copy_page_range() > pte_alloc_map_lock() > copy_present_page() > atomic_read(has_pinned) == 0 > page_maybe_dma_pinned() == false > atomic_set(has_pinned, 1); > gup_pgd_range() > gup_pte_range() > pte_t pte = gup_get_pte(ptep) > pte_access_permitted(pte) > try_grab_compound_head() > pte = pte_wrprotect(pte) > set_pte_at(); > pte_unmap_unlock() > // GUP now returns with a write protected page > > The first attempt to resolve this by using the write protect caused > problems (and was missing a barrrier), see commit f3c64eda3e50 ("mm: avoid > early COW write protect games during fork()") > > Instead wrap copy_p4d_range() with the write side of a seqcount and check > the read side around gup_pgd_range(). If there is a collision then > get_user_pages_fast() fails and falls back to slow GUP. > > Slow GUP is safe against this race because copy_page_range() is only > called while holding the exclusive side of the mmap_lock on the src > mm_struct. > > Fixes: f3c64eda3e50 ("mm: avoid early COW write protect games during fork()") > Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> > Link: https://lore.kernel.org/r/CAHk-=wi=iCnYCARbPGjkVJu9eyYeZ13N64tZYLdOB8CP5Q_PLw@mail.gmail.com > Reviewed-by: John Hubbard <jhubbard@nvidia.com> This updated version still looks good to me. Love-love-love the new raw_seqcount approach! No more bare memory barriers, and it gains some kcsan coverage too. Sweet. :) thanks,
Hi, Jason, I think majorly the patch looks good to me, but I have a few pure questions majorly not directly related to the patch itself, but around the contexts. Since I _feel_ like there'll be a new version to update the comments below, maybe I can still ask aloud... Please bare with me. :) On Fri, Oct 30, 2020 at 11:46:21AM -0300, Jason Gunthorpe wrote: > Slow GUP is safe against this race because copy_page_range() is only > called while holding the exclusive side of the mmap_lock on the src > mm_struct. Pure question: I understand that this patch requires this, but... Could anyone remind me why read lock of mmap_sem is not enough for fork() before this one? > > Fixes: f3c64eda3e50 ("mm: avoid early COW write protect games during fork()") > Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> > Link: https://lore.kernel.org/r/CAHk-=wi=iCnYCARbPGjkVJu9eyYeZ13N64tZYLdOB8CP5Q_PLw@mail.gmail.com > Reviewed-by: John Hubbard <jhubbard@nvidia.com> > Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> > --- > arch/x86/kernel/tboot.c | 1 + > drivers/firmware/efi/efi.c | 1 + > include/linux/mm_types.h | 7 +++++++ > kernel/fork.c | 1 + > mm/gup.c | 19 +++++++++++++++++++ > mm/init-mm.c | 1 + > mm/memory.c | 10 +++++++++- > 7 files changed, 39 insertions(+), 1 deletion(-) > > diff --git a/arch/x86/kernel/tboot.c b/arch/x86/kernel/tboot.c > index 992fb1415c0f1f..6a2f542d9588a4 100644 > --- a/arch/x86/kernel/tboot.c > +++ b/arch/x86/kernel/tboot.c > @@ -93,6 +93,7 @@ static struct mm_struct tboot_mm = { > .pgd = swapper_pg_dir, > .mm_users = ATOMIC_INIT(2), > .mm_count = ATOMIC_INIT(1), > + .write_protect_seq = SEQCNT_ZERO(tboot_mm.write_protect_seq), > MMAP_LOCK_INITIALIZER(init_mm) > .page_table_lock = __SPIN_LOCK_UNLOCKED(init_mm.page_table_lock), > .mmlist = LIST_HEAD_INIT(init_mm.mmlist), > diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c > index 5e5480a0a32d7d..2520f6e05f4d44 100644 > --- a/drivers/firmware/efi/efi.c > +++ b/drivers/firmware/efi/efi.c > @@ -57,6 +57,7 @@ struct mm_struct efi_mm = { > .mm_rb = RB_ROOT, > .mm_users = ATOMIC_INIT(2), > .mm_count = ATOMIC_INIT(1), > + .write_protect_seq = SEQCNT_ZERO(efi_mm.write_protect_seq), > MMAP_LOCK_INITIALIZER(efi_mm) > .page_table_lock = __SPIN_LOCK_UNLOCKED(efi_mm.page_table_lock), > .mmlist = LIST_HEAD_INIT(efi_mm.mmlist), Another pure question: I'm just curious how you find all the statically definied mm_structs, and to make sure all of them are covered (just in case un-initialized seqcount could fail strangely). Actually I'm thinking whether we should have one place to keep all the init vars for all the statically definied mm_structs, so we don't need to find them everytime, but only change that one place. > diff --git a/mm/memory.c b/mm/memory.c > index c48f8df6e50268..294c2c3c4fe00d 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -1171,6 +1171,12 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma) > mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_PAGE, > 0, src_vma, src_mm, addr, end); > mmu_notifier_invalidate_range_start(&range); > + /* > + * The read side doesn't spin, it goes to the mmap_lock, so the > + * raw version is used to avoid disabling preemption here > + */ > + mmap_assert_write_locked(src_mm); > + raw_write_seqcount_t_begin(&src_mm->write_protect_seq); Would raw_write_seqcount_begin() be better here? My understanding is that we used raw_write_seqcount_t_begin() because we're with spin lock so assuming we disabled preemption already. However I'm thinking whether raw_write_seqcount_begin() would be even better to guarantee that. I have no idea of how the rt kernel merging topic, but if rt kernel merged into mainline then IIUC preemption is allowed here (since pgtable spin lock should be rt_spin_lock, not raw spin locks). An even further pure question on __seqcount_preemptible() (feel free to ignore this question!): I saw that __seqcount_preemptible() seems to have been constantly defined as "return false". Not sure what happened there.. Thanks,
On Fri, Oct 30, 2020 at 06:52:50PM -0400, Peter Xu wrote: > Hi, Jason, > > I think majorly the patch looks good to me, but I have a few pure questions > majorly not directly related to the patch itself, but around the contexts. > Since I _feel_ like there'll be a new version to update the comments below, > maybe I can still ask aloud... Please bare with me. :) No problem > On Fri, Oct 30, 2020 at 11:46:21AM -0300, Jason Gunthorpe wrote: > > Slow GUP is safe against this race because copy_page_range() is only > > called while holding the exclusive side of the mmap_lock on the src > > mm_struct. > > Pure question: I understand that this patch requires this, but... Could anyone > remind me why read lock of mmap_sem is not enough for fork() before this one? I do not know why fork uses the exclusive lock, however the write side of the seqcount must be exclusive or it doesn't work properly. Was happy to find we already had the locking to support this. > > diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c > > index 5e5480a0a32d7d..2520f6e05f4d44 100644 > > +++ b/drivers/firmware/efi/efi.c > > @@ -57,6 +57,7 @@ struct mm_struct efi_mm = { > > .mm_rb = RB_ROOT, > > .mm_users = ATOMIC_INIT(2), > > .mm_count = ATOMIC_INIT(1), > > + .write_protect_seq = SEQCNT_ZERO(efi_mm.write_protect_seq), > > MMAP_LOCK_INITIALIZER(efi_mm) > > .page_table_lock = __SPIN_LOCK_UNLOCKED(efi_mm.page_table_lock), > > .mmlist = LIST_HEAD_INIT(efi_mm.mmlist), > > Another pure question: I'm just curious how you find all the statically > definied mm_structs, and to make sure all of them are covered (just in case > un-initialized seqcount could fail strangely). I searched for all MMAP_LOCK_INITIALIZER() places and assumed that Michel got them all when he added it :) > Actually I'm thinking whether we should have one place to keep all the init > vars for all the statically definied mm_structs, so we don't need to find them > everytime, but only change that one place. I was thinking that as well, most of the places are all the same > > diff --git a/mm/memory.c b/mm/memory.c > > index c48f8df6e50268..294c2c3c4fe00d 100644 > > +++ b/mm/memory.c > > @@ -1171,6 +1171,12 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma) > > mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_PAGE, > > 0, src_vma, src_mm, addr, end); > > mmu_notifier_invalidate_range_start(&range); > > + /* > > + * The read side doesn't spin, it goes to the mmap_lock, so the > > + * raw version is used to avoid disabling preemption here > > + */ > > + mmap_assert_write_locked(src_mm); > > + raw_write_seqcount_t_begin(&src_mm->write_protect_seq); > > Would raw_write_seqcount_begin() be better here? Hum.. I felt no because it had the preempt stuff added into it, however it would work - __seqcount_lock_preemptible() == false for the seqcount_t case (see below) Looking more closely, maybe the right API to pick is write_seqcount_t_begin() and write_seqcount_t_end() ?? However, no idea what the intention of the '*_seqcount_t_*' family is - it only seems to be used to implement the seqlock.. Lets add Amhed, perhaps he can give some guidance (see next section)? > My understanding is that we used raw_write_seqcount_t_begin() because we're > with spin lock so assuming we disabled preemption already. Here we rely on the exclusive mmap_lock, not a spinlock. This ensures only one write side is running concurrently as required by seqcount. The concern about preemption disable is that it wasn't held for fork() before, and we don't need it.. I understand preemption disable regions must be short or the RT people will not be happy, holding one across all of copy_page_range() sounds bad. Ahmed explained in commit 8117ab508f the reason the seqcount_t write side has preemption disabled is because it can livelock RT kernels if the read side is spinning after preempting the write side. eg look at how __read_seqcount_begin() is implemented: while ((seq = __seqcount_sequence(s)) & 1) \ cpu_relax(); \ However, in this patch, we don't spin on the read side. If the read side collides with a writer it immediately goes to the mmap_lock, which is sleeping, and so it will sort itself out properly, even if it was preempted. > An even further pure question on __seqcount_preemptible() (feel free to ignore > this question!): I saw that __seqcount_preemptible() seems to have been > constantly defined as "return false". Not sure what happened there.. The new code has a range of seqcount_t types see Documentation/locking/seqlock.rst 'Sequence counters with associated locks' It uses _Generic to do a bit of meta-programming and creates a compile time table of lock properties: SEQCOUNT_LOCKNAME(raw_spinlock, raw_spinlock_t, false, s->lock, raw_spin, raw_spin_lock(s->lock)) SEQCOUNT_LOCKNAME(spinlock, spinlock_t, __SEQ_RT, s->lock, spin, spin_lock(s->lock)) SEQCOUNT_LOCKNAME(rwlock, rwlock_t, __SEQ_RT, s->lock, read, read_lock(s->lock)) SEQCOUNT_LOCKNAME(mutex, struct mutex, true, s->lock, mutex, mutex_lock(s->lock)) SEQCOUNT_LOCKNAME(ww_mutex, struct ww_mutex, true, &s->lock->base, ww_mutex, ww_mutex_lock(s->lock, NULL)) As well as as default set of properties for normal seqcount_t. The __seqcount_preemptible() is selected by the _Generic for seqcount_t: #define __seqprop(s, prop) _Generic(*(s), \ seqcount_t: __seqprop_##prop((void *)(s)), \ And it says preemption must be disabled before using the lock: static inline void __seqprop_assert(const seqcount_t *s) { lockdep_assert_preemption_disabled(); } And thus no need to have an automatic disable preemption: static inline bool __seqprop_preemptible(const seqcount_t *s) { return false; } Other lock subtypes are different, eg the codegen for mutex will use lockdep_assert_held(s->lock) for _assert and true for _preemptible() So if we map the 'write begin' entry points: write_seqcount_begin - Enforces preemption off raw_write_seqcount_begin - Auto disable preemption if required (false) raw_write_seqcount_t_begin - No preemption stuff write_seqcount_t_begin - No preemption stuff Jason
On Fri, Oct 30, 2020 at 08:51:21PM -0300, Jason Gunthorpe wrote: > > Another pure question: I'm just curious how you find all the statically > > definied mm_structs, and to make sure all of them are covered (just in case > > un-initialized seqcount could fail strangely). > > I searched for all MMAP_LOCK_INITIALIZER() places and assumed that > Michel got them all when he added it :) Hmm, I should have noticed that before I ask.. :) > > > Actually I'm thinking whether we should have one place to keep all the init > > vars for all the statically definied mm_structs, so we don't need to find them > > everytime, but only change that one place. > > I was thinking that as well, most of the places are all the same Yes, we can work on top. > > > > diff --git a/mm/memory.c b/mm/memory.c > > > index c48f8df6e50268..294c2c3c4fe00d 100644 > > > +++ b/mm/memory.c > > > @@ -1171,6 +1171,12 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma) > > > mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_PAGE, > > > 0, src_vma, src_mm, addr, end); > > > mmu_notifier_invalidate_range_start(&range); > > > + /* > > > + * The read side doesn't spin, it goes to the mmap_lock, so the > > > + * raw version is used to avoid disabling preemption here > > > + */ > > > + mmap_assert_write_locked(src_mm); > > > + raw_write_seqcount_t_begin(&src_mm->write_protect_seq); > > > > Would raw_write_seqcount_begin() be better here? > > Hum.. > > I felt no because it had the preempt stuff added into it, however it > would work - __seqcount_lock_preemptible() == false for the seqcount_t > case (see below) > > Looking more closely, maybe the right API to pick is > write_seqcount_t_begin() and write_seqcount_t_end() ?? > > However, no idea what the intention of the '*_seqcount_t_*' family is > - it only seems to be used to implement the seqlock.. > > Lets add Amhed, perhaps he can give some guidance (see next section)? IMHO we shouldn't directly use these helpers since they seem to only be used by lock-associated versions of seqcount types. But yeah, Amhed would be the best one to answer... > > > My understanding is that we used raw_write_seqcount_t_begin() because we're > > with spin lock so assuming we disabled preemption already. > > Here we rely on the exclusive mmap_lock, not a spinlock. This ensures > only one write side is running concurrently as required by seqcount. So imho here we have these things to consider during one thread updating the seqcount_t: 0. Concurrent read is perfectly welcomed, for sure. 1. Concurrent writes on seqcount_t: mm sem protects it. 2. Preempted write (if possible, maybe on RT?): I think it's also protected by mm sem, so looks ok too to me. 3. Preempted/interrupted read on seqcount_t. Seems to be the one discussed below. Looks safe to me now with below explanation. However... > > The concern about preemption disable is that it wasn't held for fork() > before, and we don't need it.. I understand preemption disable regions > must be short or the RT people will not be happy, holding one across > all of copy_page_range() sounds bad. > > Ahmed explained in commit 8117ab508f the reason the seqcount_t write > side has preemption disabled is because it can livelock RT kernels if > the read side is spinning after preempting the write side. eg look at > how __read_seqcount_begin() is implemented: > > while ((seq = __seqcount_sequence(s)) & 1) \ > cpu_relax(); \ > > However, in this patch, we don't spin on the read side. ... Shall we document this explicitly (if this patch still needs a repost)? Seems not straightforward since that seems not the usual way to use seqcount, not sure whether I'm the only one that feels this way, though. > > If the read side collides with a writer it immediately goes to the > mmap_lock, which is sleeping, and so it will sort itself out properly, > even if it was preempted. > > > An even further pure question on __seqcount_preemptible() (feel free to ignore > > this question!): I saw that __seqcount_preemptible() seems to have been > > constantly defined as "return false". Not sure what happened there.. > > The new code has a range of seqcount_t types see > Documentation/locking/seqlock.rst 'Sequence counters with associated > locks' > > It uses _Generic to do a bit of meta-programming and creates a compile > time table of lock properties: > > SEQCOUNT_LOCKNAME(raw_spinlock, raw_spinlock_t, false, s->lock, raw_spin, raw_spin_lock(s->lock)) > SEQCOUNT_LOCKNAME(spinlock, spinlock_t, __SEQ_RT, s->lock, spin, spin_lock(s->lock)) > SEQCOUNT_LOCKNAME(rwlock, rwlock_t, __SEQ_RT, s->lock, read, read_lock(s->lock)) > SEQCOUNT_LOCKNAME(mutex, struct mutex, true, s->lock, mutex, mutex_lock(s->lock)) > SEQCOUNT_LOCKNAME(ww_mutex, struct ww_mutex, true, &s->lock->base, ww_mutex, ww_mutex_lock(s->lock, NULL)) > > As well as as default set of properties for normal seqcount_t. The > __seqcount_preemptible() is selected by the _Generic for seqcount_t: > > #define __seqprop(s, prop) _Generic(*(s), \ > seqcount_t: __seqprop_##prop((void *)(s)), \ > > And it says preemption must be disabled before using the lock: > > static inline void __seqprop_assert(const seqcount_t *s) > { > lockdep_assert_preemption_disabled(); > } > > And thus no need to have an automatic disable preemption: > > static inline bool __seqprop_preemptible(const seqcount_t *s) > { > return false; > } > > Other lock subtypes are different, eg the codegen for mutex will use > lockdep_assert_held(s->lock) for _assert and true for _preemptible() > > So if we map the 'write begin' entry points: > > write_seqcount_begin - Enforces preemption off > raw_write_seqcount_begin - Auto disable preemption if required (false) > raw_write_seqcount_t_begin - No preemption stuff > write_seqcount_t_begin - No preemption stuff Thanks for listing these details. As a summary, I think I'm convinced maybe we can have this work without disable preemtion. It's just that some more comment might be even better. The other thing is, considering this use of seqcount seems to be quite special as explained below, I'm just not sure whether this would confuse lockdep or kcsan, etc., if we decide to use write_seqcount_t_begin(). Thanks,
On Fri 30-10-20 14:02:26, Jason Gunthorpe wrote: > On Fri, Oct 30, 2020 at 05:51:05PM +0100, Jan Kara wrote: > > > @@ -446,6 +447,12 @@ struct mm_struct { > > > */ > > > atomic_t has_pinned; > > > > > > + /** > > > + * @write_protect_seq: Odd when any thread is write protecting > > > + * pages in this mm, for instance during fork(). > > > + */ > > > + seqcount_t write_protect_seq; > > > + > > > > So this comment isn't quite true. We can be writeprotecting pages due to > > many other reasons and not touch write_protect_seq. E.g. for shared > > mappings or due to explicit mprotect() calls. So the write_protect_seq > > protection has to be about something more than pure write protection. One > > requirement certainly is that the VMA has to be is_cow_mapping(). What > > about mprotect(2) calls? I guess the application would have only itself to > > blame so we don't care? > > Yes, that sounds right, How about > > /** > * @write_protect_seq: Locked when any thread is write protecting > * pages for COW in this mm, for instance during page table copying ^^^ maybe I'd write a bit more explicitly "... write protecting pages mapped by this mm to enforce later COW, ..." > * for fork(). > */ > > mprotect and shared mappings cause faults on write access not COW? Correct. Honza
On Fri, Oct 30, 2020 at 11:46:21AM -0300, Jason Gunthorpe wrote: ... > diff --git a/mm/memory.c b/mm/memory.c > index c48f8df6e50268..294c2c3c4fe00d 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -1171,6 +1171,12 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma) > mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_PAGE, > 0, src_vma, src_mm, addr, end); > mmu_notifier_invalidate_range_start(&range); > + /* > + * The read side doesn't spin, it goes to the mmap_lock, so the > + * raw version is used to avoid disabling preemption here > + */ > + mmap_assert_write_locked(src_mm); > + raw_write_seqcount_t_begin(&src_mm->write_protect_seq); > } > Please, s/raw_write_seqcount_t_begin()/raw_write_seqcount_begin()/g. For plain seqcount_t, it's the same, while still respecting the seqlock.h API boundaries. Let's make the comment also a bit more clear (IMHO, "lockdep" needs to be mentioned somewhere): /* * Disabling preemption is not needed for the write side, as * the read side doesn't spin, but goes to the mmap_lock. * * Use the raw variant of the seqcount_t write API to avoid * lockdep complaining about preemptibility. */ mmap_assert_write_locked(src_mm); raw_write_seqcount_t_begin(&src_mm->write_protect_seq); > ret = 0; > @@ -1187,8 +1193,10 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma) > } > } while (dst_pgd++, src_pgd++, addr = next, addr != end); > > - if (is_cow) > + if (is_cow) { > + raw_write_seqcount_t_end(&src_mm->write_protect_seq); ditto. s/raw_write_seqcount_t_end()/raw_write_seqcount_end()/g > mmu_notifier_invalidate_range_end(&range); > + } > return ret; > } > Thanks, -- Ahmed S. Darwish Linutronix GmbH
On Fri, Oct 30, 2020 at 08:51:21PM -0300, Jason Gunthorpe wrote: > On Fri, Oct 30, 2020 at 06:52:50PM -0400, Peter Xu wrote: ... > > > > diff --git a/mm/memory.c b/mm/memory.c > > > index c48f8df6e50268..294c2c3c4fe00d 100644 > > > +++ b/mm/memory.c > > > @@ -1171,6 +1171,12 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma) > > > mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_PAGE, > > > 0, src_vma, src_mm, addr, end); > > > mmu_notifier_invalidate_range_start(&range); > > > + /* > > > + * The read side doesn't spin, it goes to the mmap_lock, so the > > > + * raw version is used to avoid disabling preemption here > > > + */ > > > + mmap_assert_write_locked(src_mm); > > > + raw_write_seqcount_t_begin(&src_mm->write_protect_seq); > > > > Would raw_write_seqcount_begin() be better here? > > Hum.. > > I felt no because it had the preempt stuff added into it, however it > would work - __seqcount_lock_preemptible() == false for the seqcount_t > case (see below) > > Looking more closely, maybe the right API to pick is > write_seqcount_t_begin() and write_seqcount_t_end() ?? > No, that's not the right API: it is also internal to seqlock.h. Please stick with the official exported API: raw_write_seqcount_begin(). It should satisfy your needs, and the raw_*() variant is created exactly for contexts wishing to avoid the lockdep checks (e.g. NMI handlers cannot invoke lockdep, etc.) > However, no idea what the intention of the '*_seqcount_t_*' family is > - it only seems to be used to implement the seqlock.. > Exactly. '*_seqcount_t_*' is a seqlock.h implementation detail, and it has _zero_ relevance to what is discussed in this thread actually. ... > Ahmed explained in commit 8117ab508f the reason the seqcount_t write > side has preemption disabled is because it can livelock RT kernels if > the read side is spinning after preempting the write side. eg look at > how __read_seqcount_begin() is implemented: > > while ((seq = __seqcount_sequence(s)) & 1) \ > cpu_relax(); \ > > However, in this patch, we don't spin on the read side. > > If the read side collides with a writer it immediately goes to the > mmap_lock, which is sleeping, and so it will sort itself out properly, > even if it was preempted. > Correct. Thanks, -- Ahmed Darwish Linutronix GmbH
On Tue, Nov 03, 2020 at 01:17:12AM +0100, Ahmed S. Darwish wrote: > Please stick with the official exported API: raw_write_seqcount_begin(). How did you know this was 'offical exported API' ?? > Exactly. '*_seqcount_t_*' is a seqlock.h implementation detail, and it > has _zero_ relevance to what is discussed in this thread actually. Add some leading __'s to them? Jason
On Sat, Oct 31, 2020 at 11:26:05AM -0400, Peter Xu wrote: ... > Shall we document this explicitly (if this patch still needs a repost)? Yes, this patch series needs a v3 :) > Seems not straightforward since that seems not the usual way to use seqcount, > not sure whether I'm the only one that feels this way, though. Yes, this usage is correct but not common. I've proposed a more explicit comment above the write section code, in my reply to patch #2. ... > The other thing is, considering this use of seqcount seems to be quite special > as explained below, I'm just not sure whether this would confuse lockdep or > kcsan, etc., if we decide to use write_seqcount_t_begin(). > Lockdep won't be confused as it's not used in the raw_*() variant of the seqcount APIs. AFAIK KCSAN also has some margin to protect itself from this: see seqlock.h KCSAN_SEQLOCK_REGION_MAX. Thanks, > Peter Xu Ahmed Darwish Linutronix GmbH
On Mon, Nov 02, 2020 at 08:25:32PM -0400, Jason Gunthorpe wrote: > On Tue, Nov 03, 2020 at 01:17:12AM +0100, Ahmed S. Darwish wrote: > > > Please stick with the official exported API: raw_write_seqcount_begin(). > > How did you know this was 'offical exported API' ?? > All the official exported seqlock.h APIs are marked with verbose kernel-doc annotations on top. The rest are internal... > > Exactly. '*_seqcount_t_*' is a seqlock.h implementation detail, and it > > has _zero_ relevance to what is discussed in this thread actually. > > Add some leading __'s to them? > It's a bit more complicated than just adding some "__" prefixes, due to the massive compile-time polymorphism done through _Generic(). The '*_seqcount_t_*' format was the best we could come up with to distinguish (again, for internal seqlock.h code) between macros taking all seqcount_LOCKNAME_t types, and macros/functions taking only plain seqcount_t. Thanks, -- Ahmed S. Darwish Linutronix GmbH
On 11/2/20 4:41 PM, Ahmed S. Darwish wrote: > On Mon, Nov 02, 2020 at 08:25:32PM -0400, Jason Gunthorpe wrote: >> On Tue, Nov 03, 2020 at 01:17:12AM +0100, Ahmed S. Darwish wrote: >> >>> Please stick with the official exported API: raw_write_seqcount_begin(). >> >> How did you know this was 'offical exported API' ?? >> > > All the official exported seqlock.h APIs are marked with verbose > kernel-doc annotations on top. The rest are internal... > OK, but no one here was able to deduce that, probably because there is not enough consistency throughout the kernel to be able to assume such things--even though your seqlock project is internally consistent. It's just not *quite* enough communication. I think if we added the following it would be very nice: a) Short comments to the "unofficial and internal" routines, identifying them as such, and b) Short comments to the "official API for general use", also identifying those as such. c) A comment about what makes "raw" actually raw, for seqlock. Since I'm proposing new work, I'll also offer to help, perhaps by putting together a small patch to get it kicked off, if you approve of the idea. thanks,
On Mon, Nov 02, 2020 at 06:20:45PM -0800, John Hubbard wrote: > On 11/2/20 4:41 PM, Ahmed S. Darwish wrote: > > On Mon, Nov 02, 2020 at 08:25:32PM -0400, Jason Gunthorpe wrote: > > > On Tue, Nov 03, 2020 at 01:17:12AM +0100, Ahmed S. Darwish wrote: > > > > > > > Please stick with the official exported API: raw_write_seqcount_begin(). > > > > > > How did you know this was 'offical exported API' ?? > > > > > > > All the official exported seqlock.h APIs are marked with verbose > > kernel-doc annotations on top. The rest are internal... > > > > OK, but no one here was able to deduce that, probably because there is not > enough consistency throughout the kernel to be able to assume such things--even > though your seqlock project is internally consistent. It's just not *quite* > enough communication. > > I think if we added the following it would be very nice: > The problem is, I've already documented seqlock.h to death.... There are more comments than code in there, and there is "seqlock.rst" under Documentation/ to further describe the big picture. There comes a point where you decide what level of documentation to add, and what level to skip. Because in the end, you don't want to confuse "Joe, the general driver developer" with too much details that's not relevant to their task at hand. (I work in the Embedded domain, and I've seen so much ugly code from embedded drivers/SoC developers already, sorry) See for example my reply to Linus, where any talk about the lockdep-free and barrier-free parts of the API was explicitly not mentioned in seqlock.rst. This was done on purpose: 1) you want to keep the generic case simple, but the special case do-able, 2) you want to encourage people to use the standard entry/exit points as much as possible. > a) Short comments to the "unofficial and internal" routines, identifying them as > such, and > > b) Short comments to the "official API for general use", also identifying > those as such. > I really think the already added kernel-doc is sufficient... See for example __read_seqcount_begin() and __read_seqcount_retry(). They begin with "__", but they are semi-external seqlock.h API that are used by VFS to avoid barriers. And these APIs are then polymorphised according to the write serialization lock type, and so on. So the most consistent way for seqlock.h was to use kernel-doc as *the* marker for exported functions. This is not unique to seqlock.h by the way. The same pattern is heavily used by the DRM folks. Yes, of course, we can add even more comments to seqlock.h, but then, I honestly think it would be too much that maybe people will just skip reading the whole thing altogether... > c) A comment about what makes "raw" actually raw, for seqlock. > That's already documented. What more can really be written than what's in seqlock.h below?? /** * raw_read_seqcount_begin() - begin a seqcount_t read section w/o lockdep /** * raw_seqcount_begin() - begin a seqcount_t read critical section w/o * lockdep and w/o counter stabilization /** * raw_write_seqcount_begin() - start a seqcount_t write section w/o lockdep /** * raw_write_seqcount_end() - end a seqcount_t write section w/o lockdep > > Since I'm proposing new work, I'll also offer to help, perhaps by putting together > a small patch to get it kicked off, if you approve of the idea. > Patches are always welcome of course, and please put me in Cc. I don't approve or deny anything though, that's the locking maintainers job :) Kind regards, > John Hubbard > NVIDIA -- Ahmed S. Darwish Linutronix GmbH
On 11/2/20 10:52 PM, Ahmed S. Darwish wrote: > On Mon, Nov 02, 2020 at 06:20:45PM -0800, John Hubbard wrote: >> On 11/2/20 4:41 PM, Ahmed S. Darwish wrote: >>> On Mon, Nov 02, 2020 at 08:25:32PM -0400, Jason Gunthorpe wrote: >>>> On Tue, Nov 03, 2020 at 01:17:12AM +0100, Ahmed S. Darwish wrote: >>>> >>>>> Please stick with the official exported API: raw_write_seqcount_begin(). >>>> >>>> How did you know this was 'offical exported API' ?? >>>> >>> >>> All the official exported seqlock.h APIs are marked with verbose >>> kernel-doc annotations on top. The rest are internal... >>> >> >> OK, but no one here was able to deduce that, probably because there is not >> enough consistency throughout the kernel to be able to assume such things--even >> though your seqlock project is internally consistent. It's just not *quite* >> enough communication. >> >> I think if we added the following it would be very nice: >> > > The problem is, I've already documented seqlock.h to death.... There are > more comments than code in there, and there is "seqlock.rst" under > Documentation/ to further describe the big picture. > > There comes a point where you decide what level of documentation to add, > and what level to skip. > > Because in the end, you don't want to confuse "Joe, the general driver > developer" with too much details that's not relevant to their task at > hand. (I work in the Embedded domain, and I've seen so much ugly code > from embedded drivers/SoC developers already, sorry) > > See for example my reply to Linus, where any talk about the lockdep-free > and barrier-free parts of the API was explicitly not mentioned in > seqlock.rst. This was done on purpose: 1) you want to keep the generic > case simple, but the special case do-able, 2) you want to encourage > people to use the standard entry/exit points as much as possible. > Well, OK, I'll leave it alone. This was not a normal case for the documentation, anyway. We are using seqcount-ing in a little bit different way than usual, so I guess an email thread like this is actually the proper way to work it out. Just wanted to be sure that we didn't miss an opportunity to clarify anything that needed clarifying it. Thanks for the detailed responses throughout! thanks,
On Tue, Nov 03, 2020 at 01:17:12AM +0100, Ahmed S. Darwish wrote: > > > > diff --git a/mm/memory.c b/mm/memory.c > > > > index c48f8df6e50268..294c2c3c4fe00d 100644 > > > > +++ b/mm/memory.c > > > > @@ -1171,6 +1171,12 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma) > > > > mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_PAGE, > > > > 0, src_vma, src_mm, addr, end); > > > > mmu_notifier_invalidate_range_start(&range); > > > > + /* > > > > + * The read side doesn't spin, it goes to the mmap_lock, so the > > > > + * raw version is used to avoid disabling preemption here > > > > + */ > > > > + mmap_assert_write_locked(src_mm); > > > > + raw_write_seqcount_t_begin(&src_mm->write_protect_seq); > > > > > > Would raw_write_seqcount_begin() be better here? > > > > Hum.. > > > > I felt no because it had the preempt stuff added into it, however it > > would work - __seqcount_lock_preemptible() == false for the seqcount_t > > case (see below) > > > > Looking more closely, maybe the right API to pick is > > write_seqcount_t_begin() and write_seqcount_t_end() ?? > > > > No, that's not the right API: it is also internal to seqlock.h. > > Please stick with the official exported API: raw_write_seqcount_begin(). > > It should satisfy your needs, and the raw_*() variant is created exactly > for contexts wishing to avoid the lockdep checks (e.g. NMI handlers > cannot invoke lockdep, etc.) Ahmed, Jason, feel free to correct me - but I feel like what Jason wanted here is indeed the version that does not require disabling of preemption, a.k.a., write_seqcount_t_begin() and write_seqcount_t_end(), since it's preempt-safe if the read side does not retry. Not sure whether there's no "*_t_*" version of it. Another idea is that maybe we can use the raw_write_seqcount_begin() version, instead of in copy_page_range() but move it to copy_pte_range(). That would not affect normal Linux on preemption I think, since when reach pte level we should have disabled preemption already after all (by taking the pgtable spin lock). But again there could be extra overhead since we'll need to take the write seqcount very often (rather than once per fork(), so maybe there's some perf influence), also that means it'll be an extra/real disable_preempt() for the future RT code if it'll land some day (since again rt_spin_lock should not need to disable preemption, iiuc, which is used here). So seems it's still better to do in copy_page_range() as Jason proposed. Thanks,
On Mon, Nov 2, 2020 at 10:52 PM Ahmed S. Darwish <a.darwish@linutronix.de> wrote: > > The problem is, I've already documented seqlock.h to death.... There are > more comments than code in there, and there is "seqlock.rst" under > Documentation/ to further describe the big picture. Well, honestly, I think the correct thing to do is to get rid of the *_seqcount_t_*() functions entirely. They add nothing but confusion, and they are entirely misnamed. That's not the pattern we use for "internal use only" functions, and they are *very* confusing. They have other issues too: like raw_write_seqcount_end() not being usable on its own when preemptibility isn't an issue like here. You basically _have_ to use raw_write_seqcount_t_end(), because otherwise it tries to re-enable preemption that was never there. Linus
On Tue, Nov 03, 2020 at 09:40:22AM -0800, Linus Torvalds wrote: > On Mon, Nov 2, 2020 at 10:52 PM Ahmed S. Darwish > <a.darwish@linutronix.de> wrote: > > > > The problem is, I've already documented seqlock.h to death.... There are > > more comments than code in there, and there is "seqlock.rst" under > > Documentation/ to further describe the big picture. > > Well, honestly, I think the correct thing to do is to get rid of the > *_seqcount_t_*() functions entirely. > > They add nothing but confusion, and they are entirely misnamed. That's > not the pattern we use for "internal use only" functions, and they are > *very* confusing. > I see. Would the enclosed patch #1 be OK? It basically uses the "__do_" prefix instead, with some rationale. > > They have other issues too: like raw_write_seqcount_end() not being > usable on its own when preemptibility isn't an issue like here. You > basically _have_ to use raw_write_seqcount_t_end(), because otherwise > it tries to re-enable preemption that was never there. > Hmmm, raw_write_seqcount_{begin,end}() *never* disable/enable preemption for plain seqcount_t. This is why I kept recommending those for this patch series instead of internal raw_write_seqcount_*t*_{begin,end}(). But..... given that multiple people made the exact same remark by now, I guess that's due to: #define raw_write_seqcount_begin(s) \ do { \ if (__seqcount_lock_preemptible(s)) \ preempt_disable(); \ \ ... \ } while (0); #define raw_write_seqcount_end(s) \ do { \ ... \ \ if (__seqcount_lock_preemptible(s)) \ preempt_enable(); \ } while (0); The tricky part is that __seqcount_lock_preemptible() is always false for plain "seqcount_t". With that data type, the _Generic() selection makes it resolve to __seqprop_preemptible(), which just returns false. Originally, __seqcount_lock_preemptible() was called: __seqcount_associated_lock_exists_and_is_preemptible() but it got transformed to its current short form in the process of some pre-mainline refactorings. Looking at it now after all the dust has settled, maybe the verbose form was much more clear. Please see the enclosed patch #2... Would that be OK too? (I will submit these two patches in their own thread after some common ground is reached.) Patches ------- ====> ====> patch #1: ====> Subject: [RFC][PATCH 1/2] seqlock: Use __do_ prefix instead of non-standed _seqcount_t_ marker The use of "*_seqcount_t_*" as a marker to denote internal seqlock.h functions taking only plain seqcount_t instead of the whole seqcount_LOCKNAME_t family is confusing users, as it's also not the standard kernel pattern for denoting header file internal functions. Use the __do_ prefix instead. Note, a plain "__" prefix is not used since seqlock.h already uses it for some of its exported functions; e.g. __read_seqcount_begin() and __read_seqcount_retry(). Reported-by: Jason Gunthorpe <jgg@nvidia.com> Reported-by: John Hubbard <jhubbard@nvidia.com> Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lkml.kernel.org/r/CAHk-=wgB8nyOQufpn0o6a5BpJCJPnXvH+kRxApujhsgG+7qAwQ@mail.gmail.com Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de> --- include/linux/seqlock.h | 62 ++++++++++++++++++++--------------------- 1 file changed, 31 insertions(+), 31 deletions(-) diff --git a/include/linux/seqlock.h b/include/linux/seqlock.h index cbfc78b92b65..5de043841d33 100644 --- a/include/linux/seqlock.h +++ b/include/linux/seqlock.h @@ -425,9 +425,9 @@ SEQCOUNT_LOCKNAME(ww_mutex, struct ww_mutex, true, &s->lock->base, ww_mu * Return: true if a read section retry is required, else false */ #define __read_seqcount_retry(s, start) \ - __read_seqcount_t_retry(__seqcount_ptr(s), start) + __do___read_seqcount_retry(__seqcount_ptr(s), start) -static inline int __read_seqcount_t_retry(const seqcount_t *s, unsigned start) +static inline int __do___read_seqcount_retry(const seqcount_t *s, unsigned start) { kcsan_atomic_next(0); return unlikely(READ_ONCE(s->sequence) != start); @@ -445,12 +445,12 @@ static inline int __read_seqcount_t_retry(const seqcount_t *s, unsigned start) * Return: true if a read section retry is required, else false */ #define read_seqcount_retry(s, start) \ - read_seqcount_t_retry(__seqcount_ptr(s), start) + __do_read_seqcount_retry(__seqcount_ptr(s), start) -static inline int read_seqcount_t_retry(const seqcount_t *s, unsigned start) +static inline int __do_read_seqcount_retry(const seqcount_t *s, unsigned start) { smp_rmb(); - return __read_seqcount_t_retry(s, start); + return __do___read_seqcount_retry(s, start); } /** @@ -462,10 +462,10 @@ do { \ if (__seqcount_lock_preemptible(s)) \ preempt_disable(); \ \ - raw_write_seqcount_t_begin(__seqcount_ptr(s)); \ + __do_raw_write_seqcount_begin(__seqcount_ptr(s)); \ } while (0) -static inline void raw_write_seqcount_t_begin(seqcount_t *s) +static inline void __do_raw_write_seqcount_begin(seqcount_t *s) { kcsan_nestable_atomic_begin(); s->sequence++; @@ -478,13 +478,13 @@ static inline void raw_write_seqcount_t_begin(seqcount_t *s) */ #define raw_write_seqcount_end(s) \ do { \ - raw_write_seqcount_t_end(__seqcount_ptr(s)); \ + __do_raw_write_seqcount_end(__seqcount_ptr(s)); \ \ if (__seqcount_lock_preemptible(s)) \ preempt_enable(); \ } while (0) -static inline void raw_write_seqcount_t_end(seqcount_t *s) +static inline void __do_raw_write_seqcount_end(seqcount_t *s) { smp_wmb(); s->sequence++; @@ -506,12 +506,12 @@ do { \ if (__seqcount_lock_preemptible(s)) \ preempt_disable(); \ \ - write_seqcount_t_begin_nested(__seqcount_ptr(s), subclass); \ + __do_write_seqcount_begin_nested(__seqcount_ptr(s), subclass); \ } while (0) -static inline void write_seqcount_t_begin_nested(seqcount_t *s, int subclass) +static inline void __do_write_seqcount_begin_nested(seqcount_t *s, int subclass) { - raw_write_seqcount_t_begin(s); + __do_raw_write_seqcount_begin(s); seqcount_acquire(&s->dep_map, subclass, 0, _RET_IP_); } @@ -533,12 +533,12 @@ do { \ if (__seqcount_lock_preemptible(s)) \ preempt_disable(); \ \ - write_seqcount_t_begin(__seqcount_ptr(s)); \ + __do_write_seqcount_begin(__seqcount_ptr(s)); \ } while (0) -static inline void write_seqcount_t_begin(seqcount_t *s) +static inline void __do_write_seqcount_begin(seqcount_t *s) { - write_seqcount_t_begin_nested(s, 0); + __do_write_seqcount_begin_nested(s, 0); } /** @@ -549,16 +549,16 @@ static inline void write_seqcount_t_begin(seqcount_t *s) */ #define write_seqcount_end(s) \ do { \ - write_seqcount_t_end(__seqcount_ptr(s)); \ + __do_write_seqcount_end(__seqcount_ptr(s)); \ \ if (__seqcount_lock_preemptible(s)) \ preempt_enable(); \ } while (0) -static inline void write_seqcount_t_end(seqcount_t *s) +static inline void __do_write_seqcount_end(seqcount_t *s) { seqcount_release(&s->dep_map, _RET_IP_); - raw_write_seqcount_t_end(s); + __do_raw_write_seqcount_end(s); } /** @@ -603,9 +603,9 @@ static inline void write_seqcount_t_end(seqcount_t *s) * } */ #define raw_write_seqcount_barrier(s) \ - raw_write_seqcount_t_barrier(__seqcount_ptr(s)) + __do_raw_write_seqcount_barrier(__seqcount_ptr(s)) -static inline void raw_write_seqcount_t_barrier(seqcount_t *s) +static inline void __do_raw_write_seqcount_barrier(seqcount_t *s) { kcsan_nestable_atomic_begin(); s->sequence++; @@ -623,9 +623,9 @@ static inline void raw_write_seqcount_t_barrier(seqcount_t *s) * will complete successfully and see data older than this. */ #define write_seqcount_invalidate(s) \ - write_seqcount_t_invalidate(__seqcount_ptr(s)) + __do_write_seqcount_invalidate(__seqcount_ptr(s)) -static inline void write_seqcount_t_invalidate(seqcount_t *s) +static inline void __do_write_seqcount_invalidate(seqcount_t *s) { smp_wmb(); kcsan_nestable_atomic_begin(); @@ -865,7 +865,7 @@ static inline unsigned read_seqretry(const seqlock_t *sl, unsigned start) } /* - * For all seqlock_t write side functions, use write_seqcount_*t*_begin() + * For all seqlock_t write side functions, use __do_write_seqcount_begin() * instead of the generic write_seqcount_begin(). This way, no redundant * lockdep_assert_held() checks are added. */ @@ -886,7 +886,7 @@ static inline unsigned read_seqretry(const seqlock_t *sl, unsigned start) static inline void write_seqlock(seqlock_t *sl) { spin_lock(&sl->lock); - write_seqcount_t_begin(&sl->seqcount.seqcount); + __do_write_seqcount_begin(&sl->seqcount.seqcount); } /** @@ -898,7 +898,7 @@ static inline void write_seqlock(seqlock_t *sl) */ static inline void write_sequnlock(seqlock_t *sl) { - write_seqcount_t_end(&sl->seqcount.seqcount); + __do_write_seqcount_end(&sl->seqcount.seqcount); spin_unlock(&sl->lock); } @@ -912,7 +912,7 @@ static inline void write_sequnlock(seqlock_t *sl) static inline void write_seqlock_bh(seqlock_t *sl) { spin_lock_bh(&sl->lock); - write_seqcount_t_begin(&sl->seqcount.seqcount); + __do_write_seqcount_begin(&sl->seqcount.seqcount); } /** @@ -925,7 +925,7 @@ static inline void write_seqlock_bh(seqlock_t *sl) */ static inline void write_sequnlock_bh(seqlock_t *sl) { - write_seqcount_t_end(&sl->seqcount.seqcount); + __do_write_seqcount_end(&sl->seqcount.seqcount); spin_unlock_bh(&sl->lock); } @@ -939,7 +939,7 @@ static inline void write_sequnlock_bh(seqlock_t *sl) static inline void write_seqlock_irq(seqlock_t *sl) { spin_lock_irq(&sl->lock); - write_seqcount_t_begin(&sl->seqcount.seqcount); + __do_write_seqcount_begin(&sl->seqcount.seqcount); } /** @@ -951,7 +951,7 @@ static inline void write_seqlock_irq(seqlock_t *sl) */ static inline void write_sequnlock_irq(seqlock_t *sl) { - write_seqcount_t_end(&sl->seqcount.seqcount); + __do_write_seqcount_end(&sl->seqcount.seqcount); spin_unlock_irq(&sl->lock); } @@ -960,7 +960,7 @@ static inline unsigned long __write_seqlock_irqsave(seqlock_t *sl) unsigned long flags; spin_lock_irqsave(&sl->lock, flags); - write_seqcount_t_begin(&sl->seqcount.seqcount); + __do_write_seqcount_begin(&sl->seqcount.seqcount); return flags; } @@ -989,7 +989,7 @@ static inline unsigned long __write_seqlock_irqsave(seqlock_t *sl) static inline void write_sequnlock_irqrestore(seqlock_t *sl, unsigned long flags) { - write_seqcount_t_end(&sl->seqcount.seqcount); + __do_write_seqcount_end(&sl->seqcount.seqcount); spin_unlock_irqrestore(&sl->lock, flags); } ====> ====> patch #2: ====> Subject: [PATCH 2/2] seqlock: seqcount_LOCKAME_t: Use more verbose macro names As evidenced by multiple discussions over LKML so far, it's not clear that __seqcount_lock_preemptible() is always false for plain seqcount_t. For that data type, the _Generic() selection resolves to __seqprop_preemptible(), which just returns false. Use __seqcount_associated_lock_exists_and_is_preemptible() instead, which hints that "preemptibility" is for the associated write serialization lock (if any), not for the seqcount itself. Similarly, rename __seqcount_assert_lock_held() to __seqcount_assert_associated_lock_held(). Link: https://lkml.kernel.org/r/CAHk-=wgB8nyOQufpn0o6a5BpJCJPnXvH+kRxApujhsgG+7qAwQ@mail.gmail.com Link: https://lkml.kernel.org/r/20201030235121.GQ2620339@nvidia.com Link: https://lkml.kernel.org/r/20201103170327.GJ20600@xz-x1 Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de> --- include/linux/seqlock.h | 22 +++++++++++----------- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/include/linux/seqlock.h b/include/linux/seqlock.h index 5de043841d33..eb1e5a822e44 100644 --- a/include/linux/seqlock.h +++ b/include/linux/seqlock.h @@ -307,10 +307,10 @@ SEQCOUNT_LOCKNAME(ww_mutex, struct ww_mutex, true, &s->lock->base, ww_mu __seqprop_case((s), mutex, prop), \ __seqprop_case((s), ww_mutex, prop)) -#define __seqcount_ptr(s) __seqprop(s, ptr) -#define __seqcount_sequence(s) __seqprop(s, sequence) -#define __seqcount_lock_preemptible(s) __seqprop(s, preemptible) -#define __seqcount_assert_lock_held(s) __seqprop(s, assert) +#define __seqcount_ptr(s) __seqprop(s, ptr) +#define __seqcount_sequence(s) __seqprop(s, sequence) +#define __seqcount_associated_lock_exists_and_is_preemptible(s) __seqprop(s, preemptible) +#define __seqcount_assert_associated_lock_held(s) __seqprop(s, assert) /** * __read_seqcount_begin() - begin a seqcount_t read section w/o barrier @@ -459,7 +459,7 @@ static inline int __do_read_seqcount_retry(const seqcount_t *s, unsigned start) */ #define raw_write_seqcount_begin(s) \ do { \ - if (__seqcount_lock_preemptible(s)) \ + if (__seqcount_associated_lock_exists_and_is_preemptible(s)) \ preempt_disable(); \ \ __do_raw_write_seqcount_begin(__seqcount_ptr(s)); \ @@ -480,7 +480,7 @@ static inline void __do_raw_write_seqcount_begin(seqcount_t *s) do { \ __do_raw_write_seqcount_end(__seqcount_ptr(s)); \ \ - if (__seqcount_lock_preemptible(s)) \ + if (__seqcount_associated_lock_exists_and_is_preemptible(s)) \ preempt_enable(); \ } while (0) @@ -501,9 +501,9 @@ static inline void __do_raw_write_seqcount_end(seqcount_t *s) */ #define write_seqcount_begin_nested(s, subclass) \ do { \ - __seqcount_assert_lock_held(s); \ + __seqcount_assert_associated_lock_held(s); \ \ - if (__seqcount_lock_preemptible(s)) \ + if (__seqcount_associated_lock_exists_and_is_preemptible(s)) \ preempt_disable(); \ \ __do_write_seqcount_begin_nested(__seqcount_ptr(s), subclass); \ @@ -528,9 +528,9 @@ static inline void __do_write_seqcount_begin_nested(seqcount_t *s, int subclass) */ #define write_seqcount_begin(s) \ do { \ - __seqcount_assert_lock_held(s); \ + __seqcount_assert_associated_lock_held(s); \ \ - if (__seqcount_lock_preemptible(s)) \ + if (__seqcount_associated_lock_exists_and_is_preemptible(s)) \ preempt_disable(); \ \ __do_write_seqcount_begin(__seqcount_ptr(s)); \ @@ -551,7 +551,7 @@ static inline void __do_write_seqcount_begin(seqcount_t *s) do { \ __do_write_seqcount_end(__seqcount_ptr(s)); \ \ - if (__seqcount_lock_preemptible(s)) \ + if (__seqcount_associated_lock_exists_and_is_preemptible(s)) \ preempt_enable(); \ } while (0) > Linus Thanks, -- Ahmed S. Darwish Linutronix GmbH
On 11/3/20 5:32 PM, Ahmed S. Darwish wrote: > On Tue, Nov 03, 2020 at 09:40:22AM -0800, Linus Torvalds wrote: >> On Mon, Nov 2, 2020 at 10:52 PM Ahmed S. Darwish >> <a.darwish@linutronix.de> wrote: ... > > ====> > ====> patch #1: > ====> > > Subject: [RFC][PATCH 1/2] seqlock: Use __do_ prefix instead of non-standed > _seqcount_t_ marker > > The use of "*_seqcount_t_*" as a marker to denote internal seqlock.h > functions taking only plain seqcount_t instead of the whole > seqcount_LOCKNAME_t family is confusing users, as it's also not the > standard kernel pattern for denoting header file internal functions. > > Use the __do_ prefix instead. > > Note, a plain "__" prefix is not used since seqlock.h already uses it > for some of its exported functions; e.g. __read_seqcount_begin() and > __read_seqcount_retry(). > > Reported-by: Jason Gunthorpe <jgg@nvidia.com> > Reported-by: John Hubbard <jhubbard@nvidia.com> > Reported-by: Linus Torvalds <torvalds@linux-foundation.org> > Link: https://lkml.kernel.org/r/CAHk-=wgB8nyOQufpn0o6a5BpJCJPnXvH+kRxApujhsgG+7qAwQ@mail.gmail.com > Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de> > --- > include/linux/seqlock.h | 62 ++++++++++++++++++++--------------------- > 1 file changed, 31 insertions(+), 31 deletions(-) > > diff --git a/include/linux/seqlock.h b/include/linux/seqlock.h > index cbfc78b92b65..5de043841d33 100644 > --- a/include/linux/seqlock.h > +++ b/include/linux/seqlock.h > @@ -425,9 +425,9 @@ SEQCOUNT_LOCKNAME(ww_mutex, struct ww_mutex, true, &s->lock->base, ww_mu > * Return: true if a read section retry is required, else false > */ > #define __read_seqcount_retry(s, start) \ > - __read_seqcount_t_retry(__seqcount_ptr(s), start) > + __do___read_seqcount_retry(__seqcount_ptr(s), start) Looking better. "do_" is clearly an internal function name prefix, so that's good. A nit: while various numbers of leading underscores are sometimes used, it's a lot less common to use, say, 3 consecutive underscores (as above) *within* the name. And I don't think you need it for uniqueness, at least from a quick look around here. In fact, if you actually need 3 vs 2 vs 1 underscore within the name, I'm going to claim that that's too far afield, and the naming should be re-revisited. :) So why not just: __do_read_seqcount_retry() ? ...or, if you feeling bold: do_read_seqcount_retry() ...thus taking further advantage of the "do" convention, in order to get rid of some more underscores. And similarly for other __do___*() functions. But again, either way, I think "do" is helping a *lot* here (as is getting rid of the _t_ idea). thanks,
On Tue, Nov 03, 2020 at 06:01:30PM -0800, John Hubbard wrote: > On 11/3/20 5:32 PM, Ahmed S. Darwish wrote: ... > > #define __read_seqcount_retry(s, start) \ > > - __read_seqcount_t_retry(__seqcount_ptr(s), start) > > + __do___read_seqcount_retry(__seqcount_ptr(s), start) > ... > A nit: while various numbers of leading underscores are sometimes used, it's a lot > less common to use, say, 3 consecutive underscores (as above) *within* the name. And > I don't think you need it for uniqueness, at least from a quick look around here. > ... > But again, either way, I think "do" is helping a *lot* here (as is getting rid > of the _t_ idea). The three underscores are needed because there's a do_ version for read_seqcount_retry(), and another for __read_seqcount_retry(). Similarly for {__,}read_seqcount_begin(). You want to be very careful with this, and never mistaknely mix the two, because it affects some VFS hot paths. Nonetheless, as you mentioned in the later (dropped) part of your message, I think do_ is better than __do_, so the final result will be: do___read_seqcount_retry() do_read_seqcount_retry() do_raw_write_seqcount_begin() do_raw_write_seqcount_end() do_write_seqcount_begin() ... and so on. I'll wait for some further feedback on the two patches (possibly from Linus or PeterZ), then send a mini patch series. (This shouldn't block a v3 of Jason's mm patch series though, as it will be using the external seqlock.h APIs anyway...). Thanks, -- Ahmed S. Darwish Linutronix GmbH
On Tue, Nov 3, 2020 at 7:17 PM Ahmed S. Darwish <a.darwish@linutronix.de> wrote: > > Nonetheless, as you mentioned in the later (dropped) part of your > message, I think do_ is better than __do_, so the final result will be: > > do___read_seqcount_retry() > do_read_seqcount_retry() > do_raw_write_seqcount_begin() > do_raw_write_seqcount_end() > do_write_seqcount_begin() > ... > > and so on. Looks reasonable to me. And can you add a few comments to the magic type macros, so that it's a lot more obvious what the end result was. I clearly wasn't able to follow all the _Generic() cases from the seqcount_t to the final end result. It's a really odd combination of subtle _GENERIC() macro and token pasting to get from zeqcount_t to "false" in __seqcount_lock_preemptible(). I can see it when I really look, but when looking at the actual use, it's very non-obvious indeed. Linus
On Wed, Nov 04, 2020 at 04:17:11AM +0100, Ahmed S. Darwish wrote: > On Tue, Nov 03, 2020 at 06:01:30PM -0800, John Hubbard wrote: > > On 11/3/20 5:32 PM, Ahmed S. Darwish wrote: > ... > > > #define __read_seqcount_retry(s, start) \ > > > - __read_seqcount_t_retry(__seqcount_ptr(s), start) > > > + __do___read_seqcount_retry(__seqcount_ptr(s), start) > > > ... > > A nit: while various numbers of leading underscores are sometimes used, it's a lot > > less common to use, say, 3 consecutive underscores (as above) *within* the name. And > > I don't think you need it for uniqueness, at least from a quick look around here. > > > ... > > But again, either way, I think "do" is helping a *lot* here (as is getting rid > > of the _t_ idea). > > The three underscores are needed because there's a do_ version for > read_seqcount_retry(), and another for __read_seqcount_retry(). > > Similarly for {__,}read_seqcount_begin(). You want to be very careful > with this, and never mistaknely mix the two, because it affects some VFS > hot paths. > > Nonetheless, as you mentioned in the later (dropped) part of your > message, I think do_ is better than __do_, so the final result will be: > > do___read_seqcount_retry() > do_read_seqcount_retry() > do_raw_write_seqcount_begin() > do_raw_write_seqcount_end() > do_write_seqcount_begin() > ... > > and so on. > > I'll wait for some further feedback on the two patches (possibly from > Linus or PeterZ), then send a mini patch series. I'm Ok with that. The change I have issues with is: -#define __seqcount_lock_preemptible(s) __seqprop(s, preemptible) +#define __seqcount_associated_lock_exists_and_is_preemptible(s) __seqprop(s, preemptible) That's just _realllllllly_ long. Would something like the below make it easier to follow? seqprop_preemptible() is 'obviously' related to __seqprop_preemptible() without having to follow through the _Generic token pasting maze. --- diff --git a/include/linux/seqlock.h b/include/linux/seqlock.h index 8d8552474c64..576594add921 100644 --- a/include/linux/seqlock.h +++ b/include/linux/seqlock.h @@ -307,10 +307,10 @@ SEQCOUNT_LOCKNAME(ww_mutex, struct ww_mutex, true, &s->lock->base, ww_mu __seqprop_case((s), mutex, prop), \ __seqprop_case((s), ww_mutex, prop)) -#define __seqcount_ptr(s) __seqprop(s, ptr) -#define __seqcount_sequence(s) __seqprop(s, sequence) -#define __seqcount_lock_preemptible(s) __seqprop(s, preemptible) -#define __seqcount_assert_lock_held(s) __seqprop(s, assert) +#define seqprop_ptr(s) __seqprop(s, ptr) +#define seqprop_sequence(s) __seqprop(s, sequence) +#define seqprop_preemptible(s) __seqprop(s, preemptible) +#define seqprop_assert(s) __seqprop(s, assert) /** * __read_seqcount_begin() - begin a seqcount_t read section w/o barrier @@ -330,7 +330,7 @@ SEQCOUNT_LOCKNAME(ww_mutex, struct ww_mutex, true, &s->lock->base, ww_mu ({ \ unsigned __seq; \ \ - while ((__seq = __seqcount_sequence(s)) & 1) \ + while ((__seq = seqprop_sequence(s)) & 1) \ cpu_relax(); \ \ kcsan_atomic_next(KCSAN_SEQLOCK_REGION_MAX); \ @@ -359,7 +359,7 @@ SEQCOUNT_LOCKNAME(ww_mutex, struct ww_mutex, true, &s->lock->base, ww_mu */ #define read_seqcount_begin(s) \ ({ \ - seqcount_lockdep_reader_access(__seqcount_ptr(s)); \ + seqcount_lockdep_reader_access(seqprop_ptr(s)); \ raw_read_seqcount_begin(s); \ }) @@ -376,7 +376,7 @@ SEQCOUNT_LOCKNAME(ww_mutex, struct ww_mutex, true, &s->lock->base, ww_mu */ #define raw_read_seqcount(s) \ ({ \ - unsigned __seq = __seqcount_sequence(s); \ + unsigned __seq = seqprop_sequence(s); \ \ smp_rmb(); \ kcsan_atomic_next(KCSAN_SEQLOCK_REGION_MAX); \ @@ -425,7 +425,7 @@ SEQCOUNT_LOCKNAME(ww_mutex, struct ww_mutex, true, &s->lock->base, ww_mu * Return: true if a read section retry is required, else false */ #define __read_seqcount_retry(s, start) \ - __read_seqcount_t_retry(__seqcount_ptr(s), start) + __read_seqcount_t_retry(seqprop_ptr(s), start) static inline int __read_seqcount_t_retry(const seqcount_t *s, unsigned start) { @@ -445,7 +445,7 @@ static inline int __read_seqcount_t_retry(const seqcount_t *s, unsigned start) * Return: true if a read section retry is required, else false */ #define read_seqcount_retry(s, start) \ - read_seqcount_t_retry(__seqcount_ptr(s), start) + read_seqcount_t_retry(seqprop_ptr(s), start) static inline int read_seqcount_t_retry(const seqcount_t *s, unsigned start) { @@ -459,10 +459,10 @@ static inline int read_seqcount_t_retry(const seqcount_t *s, unsigned start) */ #define raw_write_seqcount_begin(s) \ do { \ - if (__seqcount_lock_preemptible(s)) \ + if (seqprop_preemptible(s)) \ preempt_disable(); \ \ - raw_write_seqcount_t_begin(__seqcount_ptr(s)); \ + raw_write_seqcount_t_begin(seqprop_ptr(s)); \ } while (0) static inline void raw_write_seqcount_t_begin(seqcount_t *s) @@ -478,9 +478,9 @@ static inline void raw_write_seqcount_t_begin(seqcount_t *s) */ #define raw_write_seqcount_end(s) \ do { \ - raw_write_seqcount_t_end(__seqcount_ptr(s)); \ + raw_write_seqcount_t_end(seqprop_ptr(s)); \ \ - if (__seqcount_lock_preemptible(s)) \ + if (seqprop_preemptible(s)) \ preempt_enable(); \ } while (0) @@ -501,12 +501,12 @@ static inline void raw_write_seqcount_t_end(seqcount_t *s) */ #define write_seqcount_begin_nested(s, subclass) \ do { \ - __seqcount_assert_lock_held(s); \ + seqprop_assert(s); \ \ - if (__seqcount_lock_preemptible(s)) \ + if (seqprop_preemptible(s)) \ preempt_disable(); \ \ - write_seqcount_t_begin_nested(__seqcount_ptr(s), subclass); \ + write_seqcount_t_begin_nested(seqprop_ptr(s), subclass); \ } while (0) static inline void write_seqcount_t_begin_nested(seqcount_t *s, int subclass) @@ -528,12 +528,12 @@ static inline void write_seqcount_t_begin_nested(seqcount_t *s, int subclass) */ #define write_seqcount_begin(s) \ do { \ - __seqcount_assert_lock_held(s); \ + seqprop_assert(s); \ \ - if (__seqcount_lock_preemptible(s)) \ + if (seqprop_preemptible(s)) \ preempt_disable(); \ \ - write_seqcount_t_begin(__seqcount_ptr(s)); \ + write_seqcount_t_begin(seqprop_ptr(s)); \ } while (0) static inline void write_seqcount_t_begin(seqcount_t *s) @@ -549,9 +549,9 @@ static inline void write_seqcount_t_begin(seqcount_t *s) */ #define write_seqcount_end(s) \ do { \ - write_seqcount_t_end(__seqcount_ptr(s)); \ + write_seqcount_t_end(seqprop_ptr(s)); \ \ - if (__seqcount_lock_preemptible(s)) \ + if (seqprop_preemptible(s)) \ preempt_enable(); \ } while (0) @@ -603,7 +603,7 @@ static inline void write_seqcount_t_end(seqcount_t *s) * } */ #define raw_write_seqcount_barrier(s) \ - raw_write_seqcount_t_barrier(__seqcount_ptr(s)) + raw_write_seqcount_t_barrier(seqprop_ptr(s)) static inline void raw_write_seqcount_t_barrier(seqcount_t *s) { @@ -623,7 +623,7 @@ static inline void raw_write_seqcount_t_barrier(seqcount_t *s) * will complete successfully and see data older than this. */ #define write_seqcount_invalidate(s) \ - write_seqcount_t_invalidate(__seqcount_ptr(s)) + write_seqcount_t_invalidate(seqprop_ptr(s)) static inline void write_seqcount_t_invalidate(seqcount_t *s) {
diff --git a/arch/x86/kernel/tboot.c b/arch/x86/kernel/tboot.c index 992fb1415c0f1f..6a2f542d9588a4 100644 --- a/arch/x86/kernel/tboot.c +++ b/arch/x86/kernel/tboot.c @@ -93,6 +93,7 @@ static struct mm_struct tboot_mm = { .pgd = swapper_pg_dir, .mm_users = ATOMIC_INIT(2), .mm_count = ATOMIC_INIT(1), + .write_protect_seq = SEQCNT_ZERO(tboot_mm.write_protect_seq), MMAP_LOCK_INITIALIZER(init_mm) .page_table_lock = __SPIN_LOCK_UNLOCKED(init_mm.page_table_lock), .mmlist = LIST_HEAD_INIT(init_mm.mmlist), diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c index 5e5480a0a32d7d..2520f6e05f4d44 100644 --- a/drivers/firmware/efi/efi.c +++ b/drivers/firmware/efi/efi.c @@ -57,6 +57,7 @@ struct mm_struct efi_mm = { .mm_rb = RB_ROOT, .mm_users = ATOMIC_INIT(2), .mm_count = ATOMIC_INIT(1), + .write_protect_seq = SEQCNT_ZERO(efi_mm.write_protect_seq), MMAP_LOCK_INITIALIZER(efi_mm) .page_table_lock = __SPIN_LOCK_UNLOCKED(efi_mm.page_table_lock), .mmlist = LIST_HEAD_INIT(efi_mm.mmlist), diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 5a9238f6caad97..f13beacda6fd23 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -14,6 +14,7 @@ #include <linux/uprobes.h> #include <linux/page-flags-layout.h> #include <linux/workqueue.h> +#include <linux/seqlock.h> #include <asm/mmu.h> @@ -446,6 +447,12 @@ struct mm_struct { */ atomic_t has_pinned; + /** + * @write_protect_seq: Odd when any thread is write protecting + * pages in this mm, for instance during fork(). + */ + seqcount_t write_protect_seq; + #ifdef CONFIG_MMU atomic_long_t pgtables_bytes; /* PTE page table pages */ #endif diff --git a/kernel/fork.c b/kernel/fork.c index 32083db7a2a23e..6fd934a6d60d96 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1007,6 +1007,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p, mm->vmacache_seqnum = 0; atomic_set(&mm->mm_users, 1); atomic_set(&mm->mm_count, 1); + seqcount_init(&mm->write_protect_seq); mmap_init_lock(mm); INIT_LIST_HEAD(&mm->mmlist); mm->core_state = NULL; diff --git a/mm/gup.c b/mm/gup.c index 150cc962c99201..7944749ed7252f 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -2678,11 +2678,18 @@ static unsigned long lockless_pages_from_mm(unsigned long start, { unsigned long flags; int nr_pinned = 0; + unsigned seq; if (!IS_ENABLED(CONFIG_HAVE_FAST_GUP) || !gup_fast_permitted(start, end)) return 0; + if (gup_flags & FOLL_PIN) { + seq = raw_read_seqcount(¤t->mm->write_protect_seq); + if (seq & 1) + return 0; + } + /* * Disable interrupts. The nested form is used, in order to allow full, * general purpose use of this routine. @@ -2697,6 +2704,18 @@ static unsigned long lockless_pages_from_mm(unsigned long start, local_irq_save(flags); gup_pgd_range(start, end, gup_flags, pages, &nr_pinned); local_irq_restore(flags); + + /* + * When pinning pages for DMA there could be a concurrent write protect + * from fork() via copy_page_range(), in this case always fail fast GUP. + */ + if (gup_flags & FOLL_PIN) { + if (read_seqcount_t_retry(¤t->mm->write_protect_seq, + seq)) { + unpin_user_pages(pages, nr_pinned); + return 0; + } + } return nr_pinned; } diff --git a/mm/init-mm.c b/mm/init-mm.c index 3a613c85f9ede2..153162669f8062 100644 --- a/mm/init-mm.c +++ b/mm/init-mm.c @@ -31,6 +31,7 @@ struct mm_struct init_mm = { .pgd = swapper_pg_dir, .mm_users = ATOMIC_INIT(2), .mm_count = ATOMIC_INIT(1), + .write_protect_seq = SEQCNT_ZERO(init_mm.write_protect_seq), MMAP_LOCK_INITIALIZER(init_mm) .page_table_lock = __SPIN_LOCK_UNLOCKED(init_mm.page_table_lock), .arg_lock = __SPIN_LOCK_UNLOCKED(init_mm.arg_lock), diff --git a/mm/memory.c b/mm/memory.c index c48f8df6e50268..294c2c3c4fe00d 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1171,6 +1171,12 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma) mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_PAGE, 0, src_vma, src_mm, addr, end); mmu_notifier_invalidate_range_start(&range); + /* + * The read side doesn't spin, it goes to the mmap_lock, so the + * raw version is used to avoid disabling preemption here + */ + mmap_assert_write_locked(src_mm); + raw_write_seqcount_t_begin(&src_mm->write_protect_seq); } ret = 0; @@ -1187,8 +1193,10 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma) } } while (dst_pgd++, src_pgd++, addr = next, addr != end); - if (is_cow) + if (is_cow) { + raw_write_seqcount_t_end(&src_mm->write_protect_seq); mmu_notifier_invalidate_range_end(&range); + } return ret; }