[v2,1/1] mm: introduce mmap_lock_speculation_{start|end}

Message ID	20240912210222.186542-1-surenb@google.com (mailing list archive)
State	Handled Elsewhere
Headers	show Received: from mail-yb1-f201.google.com (mail-yb1-f201.google.com [209.85.219.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 93E0B1BB691 for <linux-trace-kernel@vger.kernel.org>; Thu, 12 Sep 2024 21:02:27 +0000 (UTC) Date: Thu, 12 Sep 2024 14:02:22 -0700 In-Reply-To: <CAJuCfpFFqqUWYOob_WYG_aY=PurnKvZjxznnx7V0=ESbNzHr_w@mail.gmail.com> Precedence: bulk Mime-Version: 1.0 References: <CAJuCfpFFqqUWYOob_WYG_aY=PurnKvZjxznnx7V0=ESbNzHr_w@mail.gmail.com> Message-ID: <20240912210222.186542-1-surenb@google.com> Subject: [PATCH v2 1/1] mm: introduce mmap_lock_speculation_{start\|end} From: Suren Baghdasaryan <surenb@google.com> To: linux-trace-kernel@vger.kernel.org, peterz@infradead.org, oleg@redhat.com Cc: rostedt@goodmis.org, mhiramat@kernel.org, bpf@vger.kernel.org, linux-kernel@vger.kernel.org, jolsa@kernel.org, paulmck@kernel.org, willy@infradead.org, surenb@google.com, akpm@linux-foundation.org, linux-mm@kvack.org, mjguzik@gmail.com, brauner@kernel.org, jannh@google.com, andrii@kernel.org Content-Type: text/plain; charset="UTF-8"
Series	[v2,1/1] mm: introduce mmap_lock_speculation_{start\|end} \| expand [v2,1/1] mm: introduce mmap_lock_speculation_{start\|end}

Suren Baghdasaryan Sept. 12, 2024, 9:02 p.m. UTC

Add helper functions to speculatively perform operations without
read-locking mmap_lock, expecting that mmap_lock will not be
write-locked and mm is not modified from under us.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
Changes since v1 [1]:
- Made memory barriers in inc_mm_lock_seq and mmap_lock_speculation_end
more strict, per Jann Horn

[1] https://lore.kernel.org/all/20240906051205.530219-2-andrii@kernel.org/

 include/linux/mm_types.h  |  3 ++
 include/linux/mmap_lock.h | 74 ++++++++++++++++++++++++++++++++-------
 kernel/fork.c             |  3 --
 3 files changed, 65 insertions(+), 15 deletions(-)


base-commit: 015bdfcb183759674ba1bd732c3393014e35708b

Suren Baghdasaryan Sept. 12, 2024, 9:04 p.m. UTC | #1

On Thu, Sep 12, 2024 at 2:02 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> Add helper functions to speculatively perform operations without
> read-locking mmap_lock, expecting that mmap_lock will not be
> write-locked and mm is not modified from under us.

Here you go. I hope I got the ordering right this time around, but I
would feel much better if Jann reviewed it before it's included in
your next patchset :)
Thanks,
Suren.

>
> Suggested-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> ---
> Changes since v1 [1]:
> - Made memory barriers in inc_mm_lock_seq and mmap_lock_speculation_end
> more strict, per Jann Horn
>
> [1] https://lore.kernel.org/all/20240906051205.530219-2-andrii@kernel.org/
>
>  include/linux/mm_types.h  |  3 ++
>  include/linux/mmap_lock.h | 74 ++++++++++++++++++++++++++++++++-------
>  kernel/fork.c             |  3 --
>  3 files changed, 65 insertions(+), 15 deletions(-)
>
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 6e3bdf8e38bc..5d8cdebd42bc 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -887,6 +887,9 @@ struct mm_struct {
>                  * Roughly speaking, incrementing the sequence number is
>                  * equivalent to releasing locks on VMAs; reading the sequence
>                  * number can be part of taking a read lock on a VMA.
> +                * Incremented every time mmap_lock is write-locked/unlocked.
> +                * Initialized to 0, therefore odd values indicate mmap_lock
> +                * is write-locked and even values that it's released.
>                  *
>                  * Can be modified under write mmap_lock using RELEASE
>                  * semantics.
> diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
> index de9dc20b01ba..a281519d0c12 100644
> --- a/include/linux/mmap_lock.h
> +++ b/include/linux/mmap_lock.h
> @@ -71,39 +71,86 @@ static inline void mmap_assert_write_locked(const struct mm_struct *mm)
>  }
>
>  #ifdef CONFIG_PER_VMA_LOCK
> +static inline void init_mm_lock_seq(struct mm_struct *mm)
> +{
> +       mm->mm_lock_seq = 0;
> +}
> +
>  /*
> - * Drop all currently-held per-VMA locks.
> - * This is called from the mmap_lock implementation directly before releasing
> - * a write-locked mmap_lock (or downgrading it to read-locked).
> - * This should normally NOT be called manually from other places.
> - * If you want to call this manually anyway, keep in mind that this will release
> - * *all* VMA write locks, including ones from further up the stack.
> + * Increment mm->mm_lock_seq when mmap_lock is write-locked (ACQUIRE semantics)
> + * or write-unlocked (RELEASE semantics).
>   */
> -static inline void vma_end_write_all(struct mm_struct *mm)
> +static inline void inc_mm_lock_seq(struct mm_struct *mm, bool acquire)
>  {
>         mmap_assert_write_locked(mm);
>         /*
>          * Nobody can concurrently modify mm->mm_lock_seq due to exclusive
>          * mmap_lock being held.
> -        * We need RELEASE semantics here to ensure that preceding stores into
> -        * the VMA take effect before we unlock it with this store.
> -        * Pairs with ACQUIRE semantics in vma_start_read().
>          */
> -       smp_store_release(&mm->mm_lock_seq, mm->mm_lock_seq + 1);
> +
> +       if (acquire) {
> +               WRITE_ONCE(mm->mm_lock_seq, mm->mm_lock_seq + 1);
> +               /*
> +                * For ACQUIRE semantics we should ensure no following stores are
> +                * reordered to appear before the mm->mm_lock_seq modification.
> +                */
> +               smp_wmb();
> +       } else {
> +               /*
> +                * We need RELEASE semantics here to ensure that preceding stores
> +                * into the VMA take effect before we unlock it with this store.
> +                * Pairs with ACQUIRE semantics in vma_start_read().
> +                */
> +               smp_store_release(&mm->mm_lock_seq, mm->mm_lock_seq + 1);
> +       }
> +}
> +
> +static inline bool mmap_lock_speculation_start(struct mm_struct *mm, int *seq)
> +{
> +       /* Pairs with RELEASE semantics in inc_mm_lock_seq(). */
> +       *seq = smp_load_acquire(&mm->mm_lock_seq);
> +       /* Allow speculation if mmap_lock is not write-locked */
> +       return (*seq & 1) == 0;
> +}
> +
> +static inline bool mmap_lock_speculation_end(struct mm_struct *mm, int seq)
> +{
> +       /* Pairs with ACQUIRE semantics in inc_mm_lock_seq(). */
> +       smp_rmb();
> +       return seq == READ_ONCE(mm->mm_lock_seq);
>  }
> +
>  #else
> -static inline void vma_end_write_all(struct mm_struct *mm) {}
> +static inline void init_mm_lock_seq(struct mm_struct *mm) {}
> +static inline void inc_mm_lock_seq(struct mm_struct *mm, bool acquire) {}
> +static inline bool mmap_lock_speculation_start(struct mm_struct *mm, int *seq) { return false; }
> +static inline bool mmap_lock_speculation_end(struct mm_struct *mm, int seq) { return false; }
>  #endif
>
> +/*
> + * Drop all currently-held per-VMA locks.
> + * This is called from the mmap_lock implementation directly before releasing
> + * a write-locked mmap_lock (or downgrading it to read-locked).
> + * This should normally NOT be called manually from other places.
> + * If you want to call this manually anyway, keep in mind that this will release
> + * *all* VMA write locks, including ones from further up the stack.
> + */
> +static inline void vma_end_write_all(struct mm_struct *mm)
> +{
> +       inc_mm_lock_seq(mm, false);
> +}
> +
>  static inline void mmap_init_lock(struct mm_struct *mm)
>  {
>         init_rwsem(&mm->mmap_lock);
> +       init_mm_lock_seq(mm);
>  }
>
>  static inline void mmap_write_lock(struct mm_struct *mm)
>  {
>         __mmap_lock_trace_start_locking(mm, true);
>         down_write(&mm->mmap_lock);
> +       inc_mm_lock_seq(mm, true);
>         __mmap_lock_trace_acquire_returned(mm, true, true);
>  }
>
> @@ -111,6 +158,7 @@ static inline void mmap_write_lock_nested(struct mm_struct *mm, int subclass)
>  {
>         __mmap_lock_trace_start_locking(mm, true);
>         down_write_nested(&mm->mmap_lock, subclass);
> +       inc_mm_lock_seq(mm, true);
>         __mmap_lock_trace_acquire_returned(mm, true, true);
>  }
>
> @@ -120,6 +168,8 @@ static inline int mmap_write_lock_killable(struct mm_struct *mm)
>
>         __mmap_lock_trace_start_locking(mm, true);
>         ret = down_write_killable(&mm->mmap_lock);
> +       if (!ret)
> +               inc_mm_lock_seq(mm, true);
>         __mmap_lock_trace_acquire_returned(mm, true, ret == 0);
>         return ret;
>  }
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 61070248a7d3..c86e87ed172b 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1259,9 +1259,6 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
>         seqcount_init(&mm->write_protect_seq);
>         mmap_init_lock(mm);
>         INIT_LIST_HEAD(&mm->mmlist);
> -#ifdef CONFIG_PER_VMA_LOCK
> -       mm->mm_lock_seq = 0;
> -#endif
>         mm_pgtables_bytes_init(mm);
>         mm->map_count = 0;
>         mm->locked_vm = 0;
>
> base-commit: 015bdfcb183759674ba1bd732c3393014e35708b
> --
> 2.46.0.662.g92d0881bb0-goog
>

Andrii Nakryiko Sept. 12, 2024, 10:19 p.m. UTC | #2

On Thu, Sep 12, 2024 at 2:04 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Thu, Sep 12, 2024 at 2:02 PM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > Add helper functions to speculatively perform operations without
> > read-locking mmap_lock, expecting that mmap_lock will not be
> > write-locked and mm is not modified from under us.
>
> Here you go. I hope I got the ordering right this time around, but I
> would feel much better if Jann reviewed it before it's included in
> your next patchset :)

Thanks a lot! And yes, I'll give it a bit of time for reviews before
sending a new revision.

Did you by any chance get any new ideas for possible benchmarks
(beyond what I did with will-it-scale)?


> Thanks,
> Suren.
>
> >
> > Suggested-by: Peter Zijlstra <peterz@infradead.org>
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > ---
> > Changes since v1 [1]:
> > - Made memory barriers in inc_mm_lock_seq and mmap_lock_speculation_end
> > more strict, per Jann Horn
> >
> > [1] https://lore.kernel.org/all/20240906051205.530219-2-andrii@kernel.org/
> >
> >  include/linux/mm_types.h  |  3 ++
> >  include/linux/mmap_lock.h | 74 ++++++++++++++++++++++++++++++++-------
> >  kernel/fork.c             |  3 --
> >  3 files changed, 65 insertions(+), 15 deletions(-)
> >
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index 6e3bdf8e38bc..5d8cdebd42bc 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -887,6 +887,9 @@ struct mm_struct {
> >                  * Roughly speaking, incrementing the sequence number is
> >                  * equivalent to releasing locks on VMAs; reading the sequence
> >                  * number can be part of taking a read lock on a VMA.
> > +                * Incremented every time mmap_lock is write-locked/unlocked.
> > +                * Initialized to 0, therefore odd values indicate mmap_lock
> > +                * is write-locked and even values that it's released.
> >                  *
> >                  * Can be modified under write mmap_lock using RELEASE
> >                  * semantics.
> > diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
> > index de9dc20b01ba..a281519d0c12 100644
> > --- a/include/linux/mmap_lock.h
> > +++ b/include/linux/mmap_lock.h
> > @@ -71,39 +71,86 @@ static inline void mmap_assert_write_locked(const struct mm_struct *mm)
> >  }
> >
> >  #ifdef CONFIG_PER_VMA_LOCK
> > +static inline void init_mm_lock_seq(struct mm_struct *mm)
> > +{
> > +       mm->mm_lock_seq = 0;
> > +}
> > +
> >  /*
> > - * Drop all currently-held per-VMA locks.
> > - * This is called from the mmap_lock implementation directly before releasing
> > - * a write-locked mmap_lock (or downgrading it to read-locked).
> > - * This should normally NOT be called manually from other places.
> > - * If you want to call this manually anyway, keep in mind that this will release
> > - * *all* VMA write locks, including ones from further up the stack.
> > + * Increment mm->mm_lock_seq when mmap_lock is write-locked (ACQUIRE semantics)
> > + * or write-unlocked (RELEASE semantics).
> >   */
> > -static inline void vma_end_write_all(struct mm_struct *mm)
> > +static inline void inc_mm_lock_seq(struct mm_struct *mm, bool acquire)
> >  {
> >         mmap_assert_write_locked(mm);
> >         /*
> >          * Nobody can concurrently modify mm->mm_lock_seq due to exclusive
> >          * mmap_lock being held.
> > -        * We need RELEASE semantics here to ensure that preceding stores into
> > -        * the VMA take effect before we unlock it with this store.
> > -        * Pairs with ACQUIRE semantics in vma_start_read().
> >          */
> > -       smp_store_release(&mm->mm_lock_seq, mm->mm_lock_seq + 1);
> > +
> > +       if (acquire) {
> > +               WRITE_ONCE(mm->mm_lock_seq, mm->mm_lock_seq + 1);
> > +               /*
> > +                * For ACQUIRE semantics we should ensure no following stores are
> > +                * reordered to appear before the mm->mm_lock_seq modification.
> > +                */
> > +               smp_wmb();
> > +       } else {
> > +               /*
> > +                * We need RELEASE semantics here to ensure that preceding stores
> > +                * into the VMA take effect before we unlock it with this store.
> > +                * Pairs with ACQUIRE semantics in vma_start_read().
> > +                */
> > +               smp_store_release(&mm->mm_lock_seq, mm->mm_lock_seq + 1);
> > +       }
> > +}
> > +
> > +static inline bool mmap_lock_speculation_start(struct mm_struct *mm, int *seq)
> > +{
> > +       /* Pairs with RELEASE semantics in inc_mm_lock_seq(). */
> > +       *seq = smp_load_acquire(&mm->mm_lock_seq);
> > +       /* Allow speculation if mmap_lock is not write-locked */
> > +       return (*seq & 1) == 0;
> > +}
> > +
> > +static inline bool mmap_lock_speculation_end(struct mm_struct *mm, int seq)
> > +{
> > +       /* Pairs with ACQUIRE semantics in inc_mm_lock_seq(). */
> > +       smp_rmb();
> > +       return seq == READ_ONCE(mm->mm_lock_seq);
> >  }
> > +
> >  #else
> > -static inline void vma_end_write_all(struct mm_struct *mm) {}
> > +static inline void init_mm_lock_seq(struct mm_struct *mm) {}
> > +static inline void inc_mm_lock_seq(struct mm_struct *mm, bool acquire) {}
> > +static inline bool mmap_lock_speculation_start(struct mm_struct *mm, int *seq) { return false; }
> > +static inline bool mmap_lock_speculation_end(struct mm_struct *mm, int seq) { return false; }
> >  #endif
> >
> > +/*
> > + * Drop all currently-held per-VMA locks.
> > + * This is called from the mmap_lock implementation directly before releasing
> > + * a write-locked mmap_lock (or downgrading it to read-locked).
> > + * This should normally NOT be called manually from other places.
> > + * If you want to call this manually anyway, keep in mind that this will release
> > + * *all* VMA write locks, including ones from further up the stack.
> > + */
> > +static inline void vma_end_write_all(struct mm_struct *mm)
> > +{
> > +       inc_mm_lock_seq(mm, false);
> > +}
> > +
> >  static inline void mmap_init_lock(struct mm_struct *mm)
> >  {
> >         init_rwsem(&mm->mmap_lock);
> > +       init_mm_lock_seq(mm);
> >  }
> >
> >  static inline void mmap_write_lock(struct mm_struct *mm)
> >  {
> >         __mmap_lock_trace_start_locking(mm, true);
> >         down_write(&mm->mmap_lock);
> > +       inc_mm_lock_seq(mm, true);
> >         __mmap_lock_trace_acquire_returned(mm, true, true);
> >  }
> >
> > @@ -111,6 +158,7 @@ static inline void mmap_write_lock_nested(struct mm_struct *mm, int subclass)
> >  {
> >         __mmap_lock_trace_start_locking(mm, true);
> >         down_write_nested(&mm->mmap_lock, subclass);
> > +       inc_mm_lock_seq(mm, true);
> >         __mmap_lock_trace_acquire_returned(mm, true, true);
> >  }
> >
> > @@ -120,6 +168,8 @@ static inline int mmap_write_lock_killable(struct mm_struct *mm)
> >
> >         __mmap_lock_trace_start_locking(mm, true);
> >         ret = down_write_killable(&mm->mmap_lock);
> > +       if (!ret)
> > +               inc_mm_lock_seq(mm, true);
> >         __mmap_lock_trace_acquire_returned(mm, true, ret == 0);
> >         return ret;
> >  }
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index 61070248a7d3..c86e87ed172b 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -1259,9 +1259,6 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
> >         seqcount_init(&mm->write_protect_seq);
> >         mmap_init_lock(mm);
> >         INIT_LIST_HEAD(&mm->mmlist);
> > -#ifdef CONFIG_PER_VMA_LOCK
> > -       mm->mm_lock_seq = 0;
> > -#endif
> >         mm_pgtables_bytes_init(mm);
> >         mm->map_count = 0;
> >         mm->locked_vm = 0;
> >
> > base-commit: 015bdfcb183759674ba1bd732c3393014e35708b
> > --
> > 2.46.0.662.g92d0881bb0-goog
> >

Suren Baghdasaryan Sept. 12, 2024, 10:24 p.m. UTC | #3

On Thu, Sep 12, 2024 at 3:20 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Thu, Sep 12, 2024 at 2:04 PM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Thu, Sep 12, 2024 at 2:02 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > >
> > > Add helper functions to speculatively perform operations without
> > > read-locking mmap_lock, expecting that mmap_lock will not be
> > > write-locked and mm is not modified from under us.
> >
> > Here you go. I hope I got the ordering right this time around, but I
> > would feel much better if Jann reviewed it before it's included in
> > your next patchset :)
>
> Thanks a lot! And yes, I'll give it a bit of time for reviews before
> sending a new revision.
>
> Did you by any chance get any new ideas for possible benchmarks
> (beyond what I did with will-it-scale)?

Hmm. You could use Mel Gorman's mmtests suite I guess.

>
>
> > Thanks,
> > Suren.
> >
> > >
> > > Suggested-by: Peter Zijlstra <peterz@infradead.org>
> > > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > > ---
> > > Changes since v1 [1]:
> > > - Made memory barriers in inc_mm_lock_seq and mmap_lock_speculation_end
> > > more strict, per Jann Horn
> > >
> > > [1] https://lore.kernel.org/all/20240906051205.530219-2-andrii@kernel.org/
> > >
> > >  include/linux/mm_types.h  |  3 ++
> > >  include/linux/mmap_lock.h | 74 ++++++++++++++++++++++++++++++++-------
> > >  kernel/fork.c             |  3 --
> > >  3 files changed, 65 insertions(+), 15 deletions(-)
> > >
> > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > > index 6e3bdf8e38bc..5d8cdebd42bc 100644
> > > --- a/include/linux/mm_types.h
> > > +++ b/include/linux/mm_types.h
> > > @@ -887,6 +887,9 @@ struct mm_struct {
> > >                  * Roughly speaking, incrementing the sequence number is
> > >                  * equivalent to releasing locks on VMAs; reading the sequence
> > >                  * number can be part of taking a read lock on a VMA.
> > > +                * Incremented every time mmap_lock is write-locked/unlocked.
> > > +                * Initialized to 0, therefore odd values indicate mmap_lock
> > > +                * is write-locked and even values that it's released.
> > >                  *
> > >                  * Can be modified under write mmap_lock using RELEASE
> > >                  * semantics.
> > > diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
> > > index de9dc20b01ba..a281519d0c12 100644
> > > --- a/include/linux/mmap_lock.h
> > > +++ b/include/linux/mmap_lock.h
> > > @@ -71,39 +71,86 @@ static inline void mmap_assert_write_locked(const struct mm_struct *mm)
> > >  }
> > >
> > >  #ifdef CONFIG_PER_VMA_LOCK
> > > +static inline void init_mm_lock_seq(struct mm_struct *mm)
> > > +{
> > > +       mm->mm_lock_seq = 0;
> > > +}
> > > +
> > >  /*
> > > - * Drop all currently-held per-VMA locks.
> > > - * This is called from the mmap_lock implementation directly before releasing
> > > - * a write-locked mmap_lock (or downgrading it to read-locked).
> > > - * This should normally NOT be called manually from other places.
> > > - * If you want to call this manually anyway, keep in mind that this will release
> > > - * *all* VMA write locks, including ones from further up the stack.
> > > + * Increment mm->mm_lock_seq when mmap_lock is write-locked (ACQUIRE semantics)
> > > + * or write-unlocked (RELEASE semantics).
> > >   */
> > > -static inline void vma_end_write_all(struct mm_struct *mm)
> > > +static inline void inc_mm_lock_seq(struct mm_struct *mm, bool acquire)
> > >  {
> > >         mmap_assert_write_locked(mm);
> > >         /*
> > >          * Nobody can concurrently modify mm->mm_lock_seq due to exclusive
> > >          * mmap_lock being held.
> > > -        * We need RELEASE semantics here to ensure that preceding stores into
> > > -        * the VMA take effect before we unlock it with this store.
> > > -        * Pairs with ACQUIRE semantics in vma_start_read().
> > >          */
> > > -       smp_store_release(&mm->mm_lock_seq, mm->mm_lock_seq + 1);
> > > +
> > > +       if (acquire) {
> > > +               WRITE_ONCE(mm->mm_lock_seq, mm->mm_lock_seq + 1);
> > > +               /*
> > > +                * For ACQUIRE semantics we should ensure no following stores are
> > > +                * reordered to appear before the mm->mm_lock_seq modification.
> > > +                */
> > > +               smp_wmb();
> > > +       } else {
> > > +               /*
> > > +                * We need RELEASE semantics here to ensure that preceding stores
> > > +                * into the VMA take effect before we unlock it with this store.
> > > +                * Pairs with ACQUIRE semantics in vma_start_read().
> > > +                */
> > > +               smp_store_release(&mm->mm_lock_seq, mm->mm_lock_seq + 1);
> > > +       }
> > > +}
> > > +
> > > +static inline bool mmap_lock_speculation_start(struct mm_struct *mm, int *seq)
> > > +{
> > > +       /* Pairs with RELEASE semantics in inc_mm_lock_seq(). */
> > > +       *seq = smp_load_acquire(&mm->mm_lock_seq);
> > > +       /* Allow speculation if mmap_lock is not write-locked */
> > > +       return (*seq & 1) == 0;
> > > +}
> > > +
> > > +static inline bool mmap_lock_speculation_end(struct mm_struct *mm, int seq)
> > > +{
> > > +       /* Pairs with ACQUIRE semantics in inc_mm_lock_seq(). */
> > > +       smp_rmb();
> > > +       return seq == READ_ONCE(mm->mm_lock_seq);
> > >  }
> > > +
> > >  #else
> > > -static inline void vma_end_write_all(struct mm_struct *mm) {}
> > > +static inline void init_mm_lock_seq(struct mm_struct *mm) {}
> > > +static inline void inc_mm_lock_seq(struct mm_struct *mm, bool acquire) {}
> > > +static inline bool mmap_lock_speculation_start(struct mm_struct *mm, int *seq) { return false; }
> > > +static inline bool mmap_lock_speculation_end(struct mm_struct *mm, int seq) { return false; }
> > >  #endif
> > >
> > > +/*
> > > + * Drop all currently-held per-VMA locks.
> > > + * This is called from the mmap_lock implementation directly before releasing
> > > + * a write-locked mmap_lock (or downgrading it to read-locked).
> > > + * This should normally NOT be called manually from other places.
> > > + * If you want to call this manually anyway, keep in mind that this will release
> > > + * *all* VMA write locks, including ones from further up the stack.
> > > + */
> > > +static inline void vma_end_write_all(struct mm_struct *mm)
> > > +{
> > > +       inc_mm_lock_seq(mm, false);
> > > +}
> > > +
> > >  static inline void mmap_init_lock(struct mm_struct *mm)
> > >  {
> > >         init_rwsem(&mm->mmap_lock);
> > > +       init_mm_lock_seq(mm);
> > >  }
> > >
> > >  static inline void mmap_write_lock(struct mm_struct *mm)
> > >  {
> > >         __mmap_lock_trace_start_locking(mm, true);
> > >         down_write(&mm->mmap_lock);
> > > +       inc_mm_lock_seq(mm, true);
> > >         __mmap_lock_trace_acquire_returned(mm, true, true);
> > >  }
> > >
> > > @@ -111,6 +158,7 @@ static inline void mmap_write_lock_nested(struct mm_struct *mm, int subclass)
> > >  {
> > >         __mmap_lock_trace_start_locking(mm, true);
> > >         down_write_nested(&mm->mmap_lock, subclass);
> > > +       inc_mm_lock_seq(mm, true);
> > >         __mmap_lock_trace_acquire_returned(mm, true, true);
> > >  }
> > >
> > > @@ -120,6 +168,8 @@ static inline int mmap_write_lock_killable(struct mm_struct *mm)
> > >
> > >         __mmap_lock_trace_start_locking(mm, true);
> > >         ret = down_write_killable(&mm->mmap_lock);
> > > +       if (!ret)
> > > +               inc_mm_lock_seq(mm, true);
> > >         __mmap_lock_trace_acquire_returned(mm, true, ret == 0);
> > >         return ret;
> > >  }
> > > diff --git a/kernel/fork.c b/kernel/fork.c
> > > index 61070248a7d3..c86e87ed172b 100644
> > > --- a/kernel/fork.c
> > > +++ b/kernel/fork.c
> > > @@ -1259,9 +1259,6 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
> > >         seqcount_init(&mm->write_protect_seq);
> > >         mmap_init_lock(mm);
> > >         INIT_LIST_HEAD(&mm->mmlist);
> > > -#ifdef CONFIG_PER_VMA_LOCK
> > > -       mm->mm_lock_seq = 0;
> > > -#endif
> > >         mm_pgtables_bytes_init(mm);
> > >         mm->map_count = 0;
> > >         mm->locked_vm = 0;
> > >
> > > base-commit: 015bdfcb183759674ba1bd732c3393014e35708b
> > > --
> > > 2.46.0.662.g92d0881bb0-goog
> > >

Jann Horn Sept. 12, 2024, 10:52 p.m. UTC | #4

On Thu, Sep 12, 2024 at 11:02 PM Suren Baghdasaryan <surenb@google.com> wrote:
> Add helper functions to speculatively perform operations without
> read-locking mmap_lock, expecting that mmap_lock will not be
> write-locked and mm is not modified from under us.

I think this is okay now, except for some comments that should be
fixed up. (Plus my gripe about the sequence count being 32-bit.)

> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 6e3bdf8e38bc..5d8cdebd42bc 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -887,6 +887,9 @@ struct mm_struct {
>                  * Roughly speaking, incrementing the sequence number is
>                  * equivalent to releasing locks on VMAs; reading the sequence
>                  * number can be part of taking a read lock on a VMA.
> +                * Incremented every time mmap_lock is write-locked/unlocked.
> +                * Initialized to 0, therefore odd values indicate mmap_lock
> +                * is write-locked and even values that it's released.

FWIW, I would still feel happier if this was a 64-bit number, though I
guess at least with uprobes the attack surface is not that large even
if you can wrap that counter... 2^31 counter increments are not all
that much, especially if someone introduces a kernel path in the
future that lets you repeatedly take the mmap_lock for writing within
a single syscall without doing much work, or maybe on some machine
where syscalls are really fast. I really don't like hinging memory
safety on how fast or slow some piece of code can run, unless we can
make strong arguments about it based on how many memory writes a CPU
core is capable of doing per second or stuff like that.

> diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
> index de9dc20b01ba..a281519d0c12 100644
> --- a/include/linux/mmap_lock.h
> +++ b/include/linux/mmap_lock.h
> @@ -71,39 +71,86 @@ static inline void mmap_assert_write_locked(const struct mm_struct *mm)
>  }
>
>  #ifdef CONFIG_PER_VMA_LOCK
> +static inline void init_mm_lock_seq(struct mm_struct *mm)
> +{
> +       mm->mm_lock_seq = 0;
> +}
> +
>  /*
> - * Drop all currently-held per-VMA locks.
> - * This is called from the mmap_lock implementation directly before releasing
> - * a write-locked mmap_lock (or downgrading it to read-locked).
> - * This should normally NOT be called manually from other places.
> - * If you want to call this manually anyway, keep in mind that this will release
> - * *all* VMA write locks, including ones from further up the stack.
> + * Increment mm->mm_lock_seq when mmap_lock is write-locked (ACQUIRE semantics)
> + * or write-unlocked (RELEASE semantics).
>   */
> -static inline void vma_end_write_all(struct mm_struct *mm)
> +static inline void inc_mm_lock_seq(struct mm_struct *mm, bool acquire)
>  {
>         mmap_assert_write_locked(mm);

Not a memory barriers thing, but maybe you could throw in some kind of
VM_WARN_ON() in the branches below that checks that the sequence
number is odd/even as expected, just to make extra sure...

>         /*
>          * Nobody can concurrently modify mm->mm_lock_seq due to exclusive
>          * mmap_lock being held.
> -        * We need RELEASE semantics here to ensure that preceding stores into
> -        * the VMA take effect before we unlock it with this store.
> -        * Pairs with ACQUIRE semantics in vma_start_read().
>          */
> -       smp_store_release(&mm->mm_lock_seq, mm->mm_lock_seq + 1);
> +
> +       if (acquire) {
> +               WRITE_ONCE(mm->mm_lock_seq, mm->mm_lock_seq + 1);
> +               /*
> +                * For ACQUIRE semantics we should ensure no following stores are
> +                * reordered to appear before the mm->mm_lock_seq modification.
> +                */
> +               smp_wmb();

This is not really a full ACQUIRE; smp_wmb() only orders *stores*, not
loads, while a real ACQUIRE also prevents reads from being reordered
up above the atomic access. Please reword the comment to make it clear
that we don't have a full ACQUIRE here.

We can still have subsequent loads reordered up before the
mm->mm_lock_seq increment. But I guess that's probably fine as long as
nobody does anything exceedingly weird that involves lockless users
*writing* data that we have to read consistently, which wouldn't
really make sense...

So yeah, I guess this is probably fine, and it matches what
do_raw_write_seqcount_begin() is doing.

> +       } else {
> +               /*
> +                * We need RELEASE semantics here to ensure that preceding stores
> +                * into the VMA take effect before we unlock it with this store.
> +                * Pairs with ACQUIRE semantics in vma_start_read().
> +                */
> +               smp_store_release(&mm->mm_lock_seq, mm->mm_lock_seq + 1);
> +       }
> +}
> +
> +static inline bool mmap_lock_speculation_start(struct mm_struct *mm, int *seq)
> +{
> +       /* Pairs with RELEASE semantics in inc_mm_lock_seq(). */
> +       *seq = smp_load_acquire(&mm->mm_lock_seq);
> +       /* Allow speculation if mmap_lock is not write-locked */
> +       return (*seq & 1) == 0;
> +}
> +
> +static inline bool mmap_lock_speculation_end(struct mm_struct *mm, int seq)
> +{
> +       /* Pairs with ACQUIRE semantics in inc_mm_lock_seq(). */

(see above, it's not actually a full ACQUIRE)

> +       smp_rmb();
> +       return seq == READ_ONCE(mm->mm_lock_seq);
>  }
> +
>  #else
> -static inline void vma_end_write_all(struct mm_struct *mm) {}
> +static inline void init_mm_lock_seq(struct mm_struct *mm) {}
> +static inline void inc_mm_lock_seq(struct mm_struct *mm, bool acquire) {}
> +static inline bool mmap_lock_speculation_start(struct mm_struct *mm, int *seq) { return false; }
> +static inline bool mmap_lock_speculation_end(struct mm_struct *mm, int seq) { return false; }
>  #endif
>
> +/*
> + * Drop all currently-held per-VMA locks.
> + * This is called from the mmap_lock implementation directly before releasing
> + * a write-locked mmap_lock (or downgrading it to read-locked).
> + * This should normally NOT be called manually from other places.
> + * If you want to call this manually anyway, keep in mind that this will release
> + * *all* VMA write locks, including ones from further up the stack.

Outdated comment - now you are absolutely not allowed to call
vma_end_write_all() manually anymore, it would mess up the odd/even
state of the counter.

> + */
> +static inline void vma_end_write_all(struct mm_struct *mm)
> +{
> +       inc_mm_lock_seq(mm, false);
> +}

Matthew Wilcox Sept. 24, 2024, 5:15 p.m. UTC | #5

On Fri, Sep 13, 2024 at 12:52:39AM +0200, Jann Horn wrote:
> FWIW, I would still feel happier if this was a 64-bit number, though I
> guess at least with uprobes the attack surface is not that large even
> if you can wrap that counter... 2^31 counter increments are not all
> that much, especially if someone introduces a kernel path in the
> future that lets you repeatedly take the mmap_lock for writing within
> a single syscall without doing much work, or maybe on some machine
> where syscalls are really fast. I really don't like hinging memory
> safety on how fast or slow some piece of code can run, unless we can
> make strong arguments about it based on how many memory writes a CPU
> core is capable of doing per second or stuff like that.

You could repeatedly call munmap(1, 0) which will take the
mmap_write_lock, do no work and call mmap_write_unlock().  We could
fix that by moving the start/len validation outside the
mmap_write_lock(), but it won't increase the path length by much.
How many syscalls can we do per second?
https://blogs.oracle.com/linux/post/syscall-latency suggests 217ns per
syscall, so we'll be close to 4.6m syscalls/second or 466 seconds (7
minutes, 46 seconds).

Jann Horn Sept. 24, 2024, 6 p.m. UTC | #6

On Tue, Sep 24, 2024 at 7:15 PM Matthew Wilcox <willy@infradead.org> wrote:
> On Fri, Sep 13, 2024 at 12:52:39AM +0200, Jann Horn wrote:
> > FWIW, I would still feel happier if this was a 64-bit number, though I
> > guess at least with uprobes the attack surface is not that large even
> > if you can wrap that counter... 2^31 counter increments are not all
> > that much, especially if someone introduces a kernel path in the
> > future that lets you repeatedly take the mmap_lock for writing within
> > a single syscall without doing much work, or maybe on some machine
> > where syscalls are really fast. I really don't like hinging memory
> > safety on how fast or slow some piece of code can run, unless we can
> > make strong arguments about it based on how many memory writes a CPU
> > core is capable of doing per second or stuff like that.
>
> You could repeatedly call munmap(1, 0) which will take the
> mmap_write_lock, do no work and call mmap_write_unlock().  We could
> fix that by moving the start/len validation outside the
> mmap_write_lock(), but it won't increase the path length by much.
> How many syscalls can we do per second?
> https://blogs.oracle.com/linux/post/syscall-latency suggests 217ns per
> syscall, so we'll be close to 4.6m syscalls/second or 466 seconds (7
> minutes, 46 seconds).

Yeah, that seems like a pretty reasonable guess.

One method that may or may not be faster would be to use an io-uring
worker to dispatch a bunch of IORING_OP_MADVISE operations - that
would save on syscall entry overhead but in exchange you'd have to
worry about feeding a constant stream of work into the worker thread
in a cache-efficient way, maybe by having one CPU constantly switch
back and forth between a userspace thread and a uring worker or
something like that.

[v2,1/1] mm: introduce mmap_lock_speculation_{start|end}

Commit Message

Comments

Patch