diff mbox series

Add mmap_assert_locked() annotations to find_vma*()

Message ID 20210731175341.3458608-1-lrizzo@google.com (mailing list archive)
State New
Headers show
Series Add mmap_assert_locked() annotations to find_vma*() | expand

Commit Message

Luigi Rizzo July 31, 2021, 5:53 p.m. UTC
find_vma() and variants need protection when used.
This patch adds mmap_assert_lock() calls in the functions.

To make sure the invariant is satisfied, we also need to add a
mmap_read_loc() around the get_user_pages_remote() call in
get_arg_page(). The lock is not strictly necessary because the mm
has been newly created, but the extra cost is limited because
the same mutex was also acquired shortly before in __bprm_mm_init(),
so it is hot and uncontended.

Signed-off-by: Luigi Rizzo <lrizzo@google.com>
---
 fs/exec.c | 2 ++
 mm/mmap.c | 2 ++
 2 files changed, 4 insertions(+)

Comments

Andrew Morton Aug. 1, 2021, 7:33 p.m. UTC | #1
On Sat, 31 Jul 2021 10:53:41 -0700 Luigi Rizzo <lrizzo@google.com> wrote:

> find_vma() and variants need protection when used.
> This patch adds mmap_assert_lock() calls in the functions.
> 
> To make sure the invariant is satisfied, we also need to add a
> mmap_read_loc() around the get_user_pages_remote() call in
> get_arg_page(). The lock is not strictly necessary because the mm
> has been newly created, but the extra cost is limited because
> the same mutex was also acquired shortly before in __bprm_mm_init(),
> so it is hot and uncontended.
> 

Well, it isn't cost-free.  find_vma() is called a lot and a surprising
number of systems apparently run with CONFIG_DEBUG_VM.  Why do you
think this cost is justified?
Luigi Rizzo Aug. 2, 2021, 12:11 a.m. UTC | #2
On Sun, Aug 1, 2021 at 9:33 PM Andrew Morton <akpm@linux-foundation.org>
wrote:

> On Sat, 31 Jul 2021 10:53:41 -0700 Luigi Rizzo <lrizzo@google.com> wrote:
>
> > find_vma() and variants need protection when used.
> > This patch adds mmap_assert_lock() calls in the functions.
> >
> > To make sure the invariant is satisfied, we also need to add a
> > mmap_read_loc() around the get_user_pages_remote() call in
> > get_arg_page(). The lock is not strictly necessary because the mm
> > has been newly created, but the extra cost is limited because
> > the same mutex was also acquired shortly before in __bprm_mm_init(),
> > so it is hot and uncontended.
> >
>
> Well, it isn't cost-free.  find_vma() is called a lot and a surprising
> number of systems apparently run with CONFIG_DEBUG_VM.  Why do you
> think this cost is justified?
>

I assume you are concerned with the cost of mmap_assert_locked() ?

I'd say the justification is the same as for all asserts:
at some point some code change may miss the required lock, and the
asserts are there to catch elusive race conditions,

There are in fact already instances of mmap_locked_assert()
right before find_vma() in walk_page_range(), and a couple before
calls to __get_user_pages().

As for the cost, I'd think that if CONFIG_DEBUG_VM is set,
one does it on purpose to catch errors and is prepared to pay
the cost (in this case the atomic_read(counter) in rwsem_is_locked(),
the counter should be hot).

FWIW I have instrumented find_vma() on a fast machine using kstats

   https://github.com/luigirizzo/lr-cstats

(load the module then enable the trace with
  echo "trace pcpu:find_vma bits 3" > /sys/kernel/debug/kstats/_control
and monitor the time with
  watch "grep CPUS /sys/kernel/debug/kstats/find_vma"

I didn't run anything especially intensive except some network
benchmarks, but I have collected ~2M samples with the following
distribution of find_vma() time in nanoseconds in 3 configs:

CONFIGURATION         p10   p50   p90   p95   p98

no-debug               89   109   214   332    605
debug                 331   369   603   862   1338
debug+this patch      337   369   603   863   1339

As you can see, just compiling a debug kernel, even without this patch,
makes the function 3x more expensive. The effect of this patch is
not measurable (the differences are below measurement error).

cheers
luigi
Luigi Rizzo Aug. 2, 2021, 12:16 a.m. UTC | #3
(repost in plain text)

On Sun, Aug 1, 2021 at 9:33 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Sat, 31 Jul 2021 10:53:41 -0700 Luigi Rizzo <lrizzo@google.com> wrote:
>
> > find_vma() and variants need protection when used.
> > This patch adds mmap_assert_lock() calls in the functions.
> >
> > To make sure the invariant is satisfied, we also need to add a
> > mmap_read_loc() around the get_user_pages_remote() call in
> > get_arg_page(). The lock is not strictly necessary because the mm
> > has been newly created, but the extra cost is limited because
> > the same mutex was also acquired shortly before in __bprm_mm_init(),
> > so it is hot and uncontended.
> >
>
> Well, it isn't cost-free.  find_vma() is called a lot and a surprising
> number of systems apparently run with CONFIG_DEBUG_VM.  Why do you
> think this cost is justified?

I assume you are concerned with the cost of mmap_assert_locked() ?

I'd say the justification is the same as for all asserts:
at some point some code change may miss the required lock, and the
asserts are there to catch elusive race conditions,

There are in fact already instances of mmap_locked_assert()
right before find_vma() in walk_page_range(), and a couple before
calls to __get_user_pages().

As for the cost, I'd think that if CONFIG_DEBUG_VM is set,
one does it on purpose to catch errors and is prepared to pay
the cost (in this case the atomic_read(counter) in rwsem_is_locked(),
the counter should be hot).

FWIW I have instrumented find_vma() on a fast machine using kstats

   https://github.com/luigirizzo/lr-cstats

(load the module then enable the trace with
  echo "trace pcpu:find_vma bits 3" > /sys/kernel/debug/kstats/_control
and monitor the time with
  watch "grep CPUS /sys/kernel/debug/kstats/find_vma"

I didn't run anything especially intensive except some network
benchmarks, but I have collected ~2M samples with the following
distribution of find_vma() time in nanoseconds in 3 configs:

CONFIGURATION         p10   p50   p90   p95   p98

no-debug               89   109   214   332    605
debug                 331   369   603   862   1338
debug+this patch      337   369   603   863   1339

As you can see, just compiling a debug kernel, even without this patch,
makes the function 3x more expensive. The effect of this patch is
not measurable (the differences are below measurement error).

cheers
luigi
Andrew Morton Aug. 2, 2021, 9:11 p.m. UTC | #4
On Mon, 2 Aug 2021 02:16:14 +0200 Luigi Rizzo <lrizzo@google.com> wrote:

> > Well, it isn't cost-free.  find_vma() is called a lot and a surprising
> > number of systems apparently run with CONFIG_DEBUG_VM.  Why do you
> > think this cost is justified?
> 
> I assume you are concerned with the cost of mmap_assert_locked() ?
> 
> I'd say the justification is the same as for all asserts:
> at some point some code change may miss the required lock, and the
> asserts are there to catch elusive race conditions,
> 
> There are in fact already instances of mmap_locked_assert()
> right before find_vma() in walk_page_range(), and a couple before
> calls to __get_user_pages().
> 
> As for the cost, I'd think that if CONFIG_DEBUG_VM is set,
> one does it on purpose to catch errors and is prepared to pay
> the cost (in this case the atomic_read(counter) in rwsem_is_locked(),
> the counter should be hot).
> 
> FWIW I have instrumented find_vma() on a fast machine using kstats
> 
>    https://github.com/luigirizzo/lr-cstats
> 
> (load the module then enable the trace with
>   echo "trace pcpu:find_vma bits 3" > /sys/kernel/debug/kstats/_control
> and monitor the time with
>   watch "grep CPUS /sys/kernel/debug/kstats/find_vma"
> 
> I didn't run anything especially intensive except some network
> benchmarks, but I have collected ~2M samples with the following
> distribution of find_vma() time in nanoseconds in 3 configs:
> 
> CONFIGURATION         p10   p50   p90   p95   p98
> 
> no-debug               89   109   214   332    605
> debug                 331   369   603   862   1338
> debug+this patch      337   369   603   863   1339
> 
> As you can see, just compiling a debug kernel, even without this patch,
> makes the function 3x more expensive. The effect of this patch is
> not measurable (the differences are below measurement error).

Cool, thanks, that's convincing.
Jason Gunthorpe Aug. 3, 2021, 4:08 p.m. UTC | #5
On Sat, Jul 31, 2021 at 10:53:41AM -0700, Luigi Rizzo wrote:
> find_vma() and variants need protection when used.
> This patch adds mmap_assert_lock() calls in the functions.
> 
> To make sure the invariant is satisfied, we also need to add a
> mmap_read_loc() around the get_user_pages_remote() call in
> get_arg_page(). The lock is not strictly necessary because the mm
> has been newly created, but the extra cost is limited because
> the same mutex was also acquired shortly before in __bprm_mm_init(),
> so it is hot and uncontended.
> 
> Signed-off-by: Luigi Rizzo <lrizzo@google.com>
>  fs/exec.c | 2 ++
>  mm/mmap.c | 2 ++
>  2 files changed, 4 insertions(+)
> 
> diff --git a/fs/exec.c b/fs/exec.c
> index 38f63451b928..ac7603e985b4 100644
> +++ b/fs/exec.c
> @@ -217,8 +217,10 @@ static struct page *get_arg_page(struct linux_binprm *bprm, unsigned long pos,
>  	 * We are doing an exec().  'current' is the process
>  	 * doing the exec and bprm->mm is the new process's mm.
>  	 */
> +	mmap_read_lock(bprm->mm);
>  	ret = get_user_pages_remote(bprm->mm, pos, 1, gup_flags,
>  			&page, NULL, NULL);
> +	mmap_read_unlock(bprm->mm);
>  	if (ret <= 0)
>  		return NULL;

Wasn't Jann Horn working on something like this too?

https://lore.kernel.org/linux-mm/20201016225713.1971256-1-jannh@google.com/

IIRC it was very tricky here, are you sure it is OK to obtain this lock
here?

I would much rather see Jann's complete solution be merged then
hacking at the exec problem on the side..

Jason
Luigi Rizzo Aug. 3, 2021, 9:48 p.m. UTC | #6
On Tue, Aug 3, 2021 at 6:08 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Sat, Jul 31, 2021 at 10:53:41AM -0700, Luigi Rizzo wrote:
> > find_vma() and variants need protection when used.
> > This patch adds mmap_assert_lock() calls in the functions.
> >
> > To make sure the invariant is satisfied, we also need to add a
> > mmap_read_loc() around the get_user_pages_remote() call in
> > get_arg_page(). The lock is not strictly necessary because the mm
> > has been newly created, but the extra cost is limited because
> > the same mutex was also acquired shortly before in __bprm_mm_init(),
> > so it is hot and uncontended.
> >
> > Signed-off-by: Luigi Rizzo <lrizzo@google.com>
> >  fs/exec.c | 2 ++
> >  mm/mmap.c | 2 ++
> >  2 files changed, 4 insertions(+)
> >
> > diff --git a/fs/exec.c b/fs/exec.c
> > index 38f63451b928..ac7603e985b4 100644
> > +++ b/fs/exec.c
> > @@ -217,8 +217,10 @@ static struct page *get_arg_page(struct linux_binprm *bprm, unsigned long pos,
> >        * We are doing an exec().  'current' is the process
> >        * doing the exec and bprm->mm is the new process's mm.
> >        */
> > +     mmap_read_lock(bprm->mm);
> >       ret = get_user_pages_remote(bprm->mm, pos, 1, gup_flags,
> >                       &page, NULL, NULL);
> > +     mmap_read_unlock(bprm->mm);
> >       if (ret <= 0)
> >               return NULL;
>
> Wasn't Jann Horn working on something like this too?
>
> https://lore.kernel.org/linux-mm/20201016225713.1971256-1-jannh@google.com/
>
> IIRC it was very tricky here, are you sure it is OK to obtain this lock
> here?

I cannot comment on Jann's patch series but no other thread knows
about this mm at this point in the code so the lock is definitely
safe to acquire (shortly before there was also a write lock acquired
on the same mm, in the same conditions).

cheers
luigi

>
> I would much rather see Jann's complete solution be merged then
> hacking at the exec problem on the side..
>
> Jason
Liam Howlett Aug. 3, 2021, 11:07 p.m. UTC | #7
* Luigi Rizzo <lrizzo@google.com> [210803 17:49]:
> On Tue, Aug 3, 2021 at 6:08 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >
> > On Sat, Jul 31, 2021 at 10:53:41AM -0700, Luigi Rizzo wrote:
> > > find_vma() and variants need protection when used.
> > > This patch adds mmap_assert_lock() calls in the functions.
> > >
> > > To make sure the invariant is satisfied, we also need to add a
> > > mmap_read_loc() around the get_user_pages_remote() call in
> > > get_arg_page(). The lock is not strictly necessary because the mm
> > > has been newly created, but the extra cost is limited because
> > > the same mutex was also acquired shortly before in __bprm_mm_init(),
> > > so it is hot and uncontended.
> > >
> > > Signed-off-by: Luigi Rizzo <lrizzo@google.com>
> > >  fs/exec.c | 2 ++
> > >  mm/mmap.c | 2 ++
> > >  2 files changed, 4 insertions(+)
> > >
> > > diff --git a/fs/exec.c b/fs/exec.c
> > > index 38f63451b928..ac7603e985b4 100644
> > > +++ b/fs/exec.c
> > > @@ -217,8 +217,10 @@ static struct page *get_arg_page(struct linux_binprm *bprm, unsigned long pos,
> > >        * We are doing an exec().  'current' is the process
> > >        * doing the exec and bprm->mm is the new process's mm.
> > >        */
> > > +     mmap_read_lock(bprm->mm);
> > >       ret = get_user_pages_remote(bprm->mm, pos, 1, gup_flags,
> > >                       &page, NULL, NULL);
> > > +     mmap_read_unlock(bprm->mm);
> > >       if (ret <= 0)
> > >               return NULL;
> >
> > Wasn't Jann Horn working on something like this too?
> >
> > https://lore.kernel.org/linux-mm/20201016225713.1971256-1-jannh@google.com/
> >
> > IIRC it was very tricky here, are you sure it is OK to obtain this lock
> > here?
> 
> I cannot comment on Jann's patch series but no other thread knows
> about this mm at this point in the code so the lock is definitely
> safe to acquire (shortly before there was also a write lock acquired
> on the same mm, in the same conditions).

If there is no other code that knows about this mm, then does one need
the lock at all?  Is this just to satisfy the new check you added?

If you want to make this change, I would suggest writing it in a way to
ensure the call to expand_downwards() in the same function also holds
the lock.  I believe this is technically required as well?  What do you
think?

Thanks,
Liam

> 
> cheers
> luigi
> 
> >
> > I would much rather see Jann's complete solution be merged then
> > hacking at the exec problem on the side..
> >
> > Jason
>
Jason Gunthorpe Aug. 3, 2021, 11:35 p.m. UTC | #8
On Tue, Aug 03, 2021 at 11:07:35PM +0000, Liam Howlett wrote:
> * Luigi Rizzo <lrizzo@google.com> [210803 17:49]:
> > On Tue, Aug 3, 2021 at 6:08 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > >
> > > On Sat, Jul 31, 2021 at 10:53:41AM -0700, Luigi Rizzo wrote:
> > > > find_vma() and variants need protection when used.
> > > > This patch adds mmap_assert_lock() calls in the functions.
> > > >
> > > > To make sure the invariant is satisfied, we also need to add a
> > > > mmap_read_loc() around the get_user_pages_remote() call in
> > > > get_arg_page(). The lock is not strictly necessary because the mm
> > > > has been newly created, but the extra cost is limited because
> > > > the same mutex was also acquired shortly before in __bprm_mm_init(),
> > > > so it is hot and uncontended.
> > > >
> > > > Signed-off-by: Luigi Rizzo <lrizzo@google.com>
> > > >  fs/exec.c | 2 ++
> > > >  mm/mmap.c | 2 ++
> > > >  2 files changed, 4 insertions(+)
> > > >
> > > > diff --git a/fs/exec.c b/fs/exec.c
> > > > index 38f63451b928..ac7603e985b4 100644
> > > > +++ b/fs/exec.c
> > > > @@ -217,8 +217,10 @@ static struct page *get_arg_page(struct linux_binprm *bprm, unsigned long pos,
> > > >        * We are doing an exec().  'current' is the process
> > > >        * doing the exec and bprm->mm is the new process's mm.
> > > >        */
> > > > +     mmap_read_lock(bprm->mm);
> > > >       ret = get_user_pages_remote(bprm->mm, pos, 1, gup_flags,
> > > >                       &page, NULL, NULL);
> > > > +     mmap_read_unlock(bprm->mm);
> > > >       if (ret <= 0)
> > > >               return NULL;
> > >
> > > Wasn't Jann Horn working on something like this too?
> > >
> > > https://lore.kernel.org/linux-mm/20201016225713.1971256-1-jannh@google.com/
> > >
> > > IIRC it was very tricky here, are you sure it is OK to obtain this lock
> > > here?
> > 
> > I cannot comment on Jann's patch series but no other thread knows
> > about this mm at this point in the code so the lock is definitely
> > safe to acquire (shortly before there was also a write lock acquired
> > on the same mm, in the same conditions).
> 
> If there is no other code that knows about this mm, then does one need
> the lock at all?  Is this just to satisfy the new check you added?
> 
> If you want to make this change, I would suggest writing it in a way to
> ensure the call to expand_downwards() in the same function also holds
> the lock.  I believe this is technically required as well?  What do you
> think?

This is essentially what Jann was doing. Since the mm is newly created
we can create it write locked and then we can add proper locking tests
to many of the functions called along this path.

Adding useless locks around each troublesome callsite just seems
really confusing to me.

Jason
Luigi Rizzo Aug. 3, 2021, 11:57 p.m. UTC | #9
On Wed, Aug 4, 2021 at 1:35 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Tue, Aug 03, 2021 at 11:07:35PM +0000, Liam Howlett wrote:
> > * Luigi Rizzo <lrizzo@google.com> [210803 17:49]:
> > > On Tue, Aug 3, 2021 at 6:08 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > > >
> > > > On Sat, Jul 31, 2021 at 10:53:41AM -0700, Luigi Rizzo wrote:
> > > > > find_vma() and variants need protection when used.
> > > > > This patch adds mmap_assert_lock() calls in the functions.
> > > > >
> > > > > To make sure the invariant is satisfied, we also need to add a
> > > > > mmap_read_loc() around the get_user_pages_remote() call in
> > > > > get_arg_page(). The lock is not strictly necessary because the mm
> > > > > has been newly created, but the extra cost is limited because
> > > > > the same mutex was also acquired shortly before in __bprm_mm_init(),
> > > > > so it is hot and uncontended.
> > > > >
> > > > > Signed-off-by: Luigi Rizzo <lrizzo@google.com>
> > > > >  fs/exec.c | 2 ++
> > > > >  mm/mmap.c | 2 ++
> > > > >  2 files changed, 4 insertions(+)
> > > > >
> > > > > diff --git a/fs/exec.c b/fs/exec.c
> > > > > index 38f63451b928..ac7603e985b4 100644
> > > > > +++ b/fs/exec.c
> > > > > @@ -217,8 +217,10 @@ static struct page *get_arg_page(struct linux_binprm *bprm, unsigned long pos,
> > > > >        * We are doing an exec().  'current' is the process
> > > > >        * doing the exec and bprm->mm is the new process's mm.
> > > > >        */
> > > > > +     mmap_read_lock(bprm->mm);
> > > > >       ret = get_user_pages_remote(bprm->mm, pos, 1, gup_flags,
> > > > >                       &page, NULL, NULL);
> > > > > +     mmap_read_unlock(bprm->mm);
> > > > >       if (ret <= 0)
> > > > >               return NULL;
> > > >
> > > > Wasn't Jann Horn working on something like this too?
> > > >
> > > > https://lore.kernel.org/linux-mm/20201016225713.1971256-1-jannh@google.com/
> > > >
> > > > IIRC it was very tricky here, are you sure it is OK to obtain this lock
> > > > here?
> > >
> > > I cannot comment on Jann's patch series but no other thread knows
> > > about this mm at this point in the code so the lock is definitely
> > > safe to acquire (shortly before there was also a write lock acquired
> > > on the same mm, in the same conditions).
> >
> > If there is no other code that knows about this mm, then does one need
> > the lock at all?  Is this just to satisfy the new check you added?
> >
> > If you want to make this change, I would suggest writing it in a way to
> > ensure the call to expand_downwards() in the same function also holds
> > the lock.  I believe this is technically required as well?  What do you
> > think?
>
> This is essentially what Jann was doing. Since the mm is newly created
> we can create it write locked and then we can add proper locking tests
> to many of the functions called along this path.
>
> Adding useless locks around each troublesome callsite just seems
> really confusing to me.

Uhm... by that reasoning, even creating the mm locked (and unlocking
at the end) is equally unnecessary.

My goal was to add asserts and invariants that are easy
to understand and get right, rather than optimize a path
that does not appear to be critical.

Adding one read lock pair around the one function we annotate
is easy to understand and clearly a leaf lock.

Having alloc_bprm return a locked object is a bit unconventional,
and also passing it to other methods raises the question of whether
they take other lock possibly causing lock order reversals
in the future.

cheers
luigi
Liam Howlett Aug. 4, 2021, 5:12 a.m. UTC | #10
* Luigi Rizzo <lrizzo@google.com> [210803 19:58]:
> On Wed, Aug 4, 2021 at 1:35 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >
> > On Tue, Aug 03, 2021 at 11:07:35PM +0000, Liam Howlett wrote:
> > > * Luigi Rizzo <lrizzo@google.com> [210803 17:49]:
> > > > On Tue, Aug 3, 2021 at 6:08 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > > > >
> > > > > On Sat, Jul 31, 2021 at 10:53:41AM -0700, Luigi Rizzo wrote:
> > > > > > find_vma() and variants need protection when used.
> > > > > > This patch adds mmap_assert_lock() calls in the functions.
> > > > > >
> > > > > > To make sure the invariant is satisfied, we also need to add a
> > > > > > mmap_read_loc() around the get_user_pages_remote() call in
> > > > > > get_arg_page(). The lock is not strictly necessary because the mm
> > > > > > has been newly created, but the extra cost is limited because
> > > > > > the same mutex was also acquired shortly before in __bprm_mm_init(),
> > > > > > so it is hot and uncontended.
> > > > > >
> > > > > > Signed-off-by: Luigi Rizzo <lrizzo@google.com>
> > > > > >  fs/exec.c | 2 ++
> > > > > >  mm/mmap.c | 2 ++
> > > > > >  2 files changed, 4 insertions(+)
> > > > > >
> > > > > > diff --git a/fs/exec.c b/fs/exec.c
> > > > > > index 38f63451b928..ac7603e985b4 100644
> > > > > > +++ b/fs/exec.c
> > > > > > @@ -217,8 +217,10 @@ static struct page *get_arg_page(struct linux_binprm *bprm, unsigned long pos,
> > > > > >        * We are doing an exec().  'current' is the process
> > > > > >        * doing the exec and bprm->mm is the new process's mm.
> > > > > >        */
> > > > > > +     mmap_read_lock(bprm->mm);
> > > > > >       ret = get_user_pages_remote(bprm->mm, pos, 1, gup_flags,
> > > > > >                       &page, NULL, NULL);
> > > > > > +     mmap_read_unlock(bprm->mm);
> > > > > >       if (ret <= 0)
> > > > > >               return NULL;
> > > > >
> > > > > Wasn't Jann Horn working on something like this too?
> > > > >
> > > > > https://lore.kernel.org/linux-mm/20201016225713.1971256-1-jannh@google.com/
> > > > >
> > > > > IIRC it was very tricky here, are you sure it is OK to obtain this lock
> > > > > here?
> > > >
> > > > I cannot comment on Jann's patch series but no other thread knows
> > > > about this mm at this point in the code so the lock is definitely
> > > > safe to acquire (shortly before there was also a write lock acquired
> > > > on the same mm, in the same conditions).
> > >
> > > If there is no other code that knows about this mm, then does one need
> > > the lock at all?  Is this just to satisfy the new check you added?
> > >
> > > If you want to make this change, I would suggest writing it in a way to
> > > ensure the call to expand_downwards() in the same function also holds
> > > the lock.  I believe this is technically required as well?  What do you
> > > think?
> >
> > This is essentially what Jann was doing. Since the mm is newly created
> > we can create it write locked and then we can add proper locking tests
> > to many of the functions called along this path.

That sounds good.  Jann has left the patch as pending a fix since
November 2020.  Can't the removal of the lock/unlock be added to the
next iteration of the patch?  Was there a v4 of that patch?

> >
> > Adding useless locks around each troublesome callsite just seems
> > really confusing to me.
> 
> Uhm... by that reasoning, even creating the mm locked (and unlocking
> at the end) is equally unnecessary.

I think taking the lock is more clear than leaving it the way it's
currently written.  It is actually confusing to see the lock taken after
calling expand_downwards() which explicitly mentions the lock as
required in the comments though.  This should at least have a comment
about early creation not requiring the lock.

> 
> My goal was to add asserts and invariants that are easy
> to understand and get right, rather than optimize a path
> that does not appear to be critical.
> 
> Adding one read lock pair around the one function we annotate
> is easy to understand and clearly a leaf lock.
> 
> Having alloc_bprm return a locked object is a bit unconventional,
> and also passing it to other methods raises the question of whether
> they take other lock possibly causing lock order reversals
> in the future.

We are (probably?) okay as the usual order right now is to take the mmap
sem before the pte and interval tree.  It's also just for the set up, so
unless there is a special case that could cause trouble... or maybe I
should ask which cases will cause trouble?

Thanks,
Liam
Jann Horn Aug. 4, 2021, 2:42 p.m. UTC | #11
On Wed, Aug 4, 2021 at 1:07 AM Liam Howlett <liam.howlett@oracle.com> wrote:
> * Luigi Rizzo <lrizzo@google.com> [210803 17:49]:
> > On Tue, Aug 3, 2021 at 6:08 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > >
> > > On Sat, Jul 31, 2021 at 10:53:41AM -0700, Luigi Rizzo wrote:
> > > > find_vma() and variants need protection when used.
> > > > This patch adds mmap_assert_lock() calls in the functions.
> > > >
> > > > To make sure the invariant is satisfied, we also need to add a
> > > > mmap_read_loc() around the get_user_pages_remote() call in
> > > > get_arg_page(). The lock is not strictly necessary because the mm
> > > > has been newly created, but the extra cost is limited because
> > > > the same mutex was also acquired shortly before in __bprm_mm_init(),
> > > > so it is hot and uncontended.
> > > >
> > > > Signed-off-by: Luigi Rizzo <lrizzo@google.com>
> > > >  fs/exec.c | 2 ++
> > > >  mm/mmap.c | 2 ++
> > > >  2 files changed, 4 insertions(+)
> > > >
> > > > diff --git a/fs/exec.c b/fs/exec.c
> > > > index 38f63451b928..ac7603e985b4 100644
> > > > +++ b/fs/exec.c
> > > > @@ -217,8 +217,10 @@ static struct page *get_arg_page(struct linux_binprm *bprm, unsigned long pos,
> > > >        * We are doing an exec().  'current' is the process
> > > >        * doing the exec and bprm->mm is the new process's mm.
> > > >        */
> > > > +     mmap_read_lock(bprm->mm);
> > > >       ret = get_user_pages_remote(bprm->mm, pos, 1, gup_flags,
> > > >                       &page, NULL, NULL);
> > > > +     mmap_read_unlock(bprm->mm);
> > > >       if (ret <= 0)
> > > >               return NULL;
> > >
> > > Wasn't Jann Horn working on something like this too?
> > >
> > > https://lore.kernel.org/linux-mm/20201016225713.1971256-1-jannh@google.com/
> > >
> > > IIRC it was very tricky here, are you sure it is OK to obtain this lock
> > > here?
> >
> > I cannot comment on Jann's patch series but no other thread knows
> > about this mm at this point in the code so the lock is definitely
> > safe to acquire (shortly before there was also a write lock acquired
> > on the same mm, in the same conditions).
>
> If there is no other code that knows about this mm, then does one need
> the lock at all?  Is this just to satisfy the new check you added?
>
> If you want to make this change, I would suggest writing it in a way to
> ensure the call to expand_downwards() in the same function also holds
> the lock.  I believe this is technically required as well?  What do you
> think?

The call to expand_downwards() takes a VMA pointer as argument, and
the mmap lock is the only thing that normally prevents concurrent
freeing of VMA structs. Taking a lock there would be of limited utility - either
the lock is not necessary because nobody else can access the MM, or
the lock is insufficient because someone could have freed the VMA
pointer before the lock was taken. So I think that taking a lock
around the expand_downwards() call would just be obfuscating things,
unless you specifically want to prevent concurrent *reads* while
concurrent *writes* are impossible.

Since I haven't sent a new version of my old series for almost a year,
I think it'd be fine to take Luigi's patch for now, and undo it at a
later point when/if we want to actually use proper locking here
because we're worried about concurrent access to the MM.
Jason Gunthorpe Aug. 4, 2021, 3:21 p.m. UTC | #12
On Wed, Aug 04, 2021 at 04:42:23PM +0200, Jann Horn wrote:
> Since I haven't sent a new version of my old series for almost a year,
> I think it'd be fine to take Luigi's patch for now, and undo it at a
> later point when/if we want to actually use proper locking here
> because we're worried about concurrent access to the MM.

IIRC one of the major points of that work was not "proper locking" but
to have enough locking to be complatible with lockdep so we could add
assertions like in get_user_pages and find_vma.

Jason
Liam Howlett Aug. 4, 2021, 5:32 p.m. UTC | #13
* Jann Horn <jannh@google.com> [210804 10:42]:
> On Wed, Aug 4, 2021 at 1:07 AM Liam Howlett <liam.howlett@oracle.com> wrote:
> > * Luigi Rizzo <lrizzo@google.com> [210803 17:49]:
> > > On Tue, Aug 3, 2021 at 6:08 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > > >
> > > > On Sat, Jul 31, 2021 at 10:53:41AM -0700, Luigi Rizzo wrote:
> > > > > find_vma() and variants need protection when used.
> > > > > This patch adds mmap_assert_lock() calls in the functions.
> > > > >
> > > > > To make sure the invariant is satisfied, we also need to add a
> > > > > mmap_read_loc() around the get_user_pages_remote() call in
> > > > > get_arg_page(). The lock is not strictly necessary because the mm
> > > > > has been newly created, but the extra cost is limited because
> > > > > the same mutex was also acquired shortly before in __bprm_mm_init(),
> > > > > so it is hot and uncontended.
> > > > >
> > > > > Signed-off-by: Luigi Rizzo <lrizzo@google.com>
> > > > >  fs/exec.c | 2 ++
> > > > >  mm/mmap.c | 2 ++
> > > > >  2 files changed, 4 insertions(+)
> > > > >
> > > > > diff --git a/fs/exec.c b/fs/exec.c
> > > > > index 38f63451b928..ac7603e985b4 100644
> > > > > +++ b/fs/exec.c
> > > > > @@ -217,8 +217,10 @@ static struct page *get_arg_page(struct linux_binprm *bprm, unsigned long pos,
> > > > >        * We are doing an exec().  'current' is the process
> > > > >        * doing the exec and bprm->mm is the new process's mm.
> > > > >        */
> > > > > +     mmap_read_lock(bprm->mm);
> > > > >       ret = get_user_pages_remote(bprm->mm, pos, 1, gup_flags,
> > > > >                       &page, NULL, NULL);
> > > > > +     mmap_read_unlock(bprm->mm);
> > > > >       if (ret <= 0)
> > > > >               return NULL;
> > > >
> > > > Wasn't Jann Horn working on something like this too?
> > > >
> > > > https://lore.kernel.org/linux-mm/20201016225713.1971256-1-jannh@google.com/
> > > >
> > > > IIRC it was very tricky here, are you sure it is OK to obtain this lock
> > > > here?
> > >
> > > I cannot comment on Jann's patch series but no other thread knows
> > > about this mm at this point in the code so the lock is definitely
> > > safe to acquire (shortly before there was also a write lock acquired
> > > on the same mm, in the same conditions).
> >
> > If there is no other code that knows about this mm, then does one need
> > the lock at all?  Is this just to satisfy the new check you added?
> >
> > If you want to make this change, I would suggest writing it in a way to
> > ensure the call to expand_downwards() in the same function also holds
> > the lock.  I believe this is technically required as well?  What do you
> > think?
> 
> The call to expand_downwards() takes a VMA pointer as argument, and
> the mmap lock is the only thing that normally prevents concurrent
> freeing of VMA structs. Taking a lock there would be of limited utility - either
> the lock is not necessary because nobody else can access the MM, or
> the lock is insufficient because someone could have freed the VMA
> pointer before the lock was taken. So I think that taking a lock
> around the expand_downwards() call would just be obfuscating things,
> unless you specifically want to prevent concurrent *reads* while
> concurrent *writes* are impossible.

Good point on the VMA being passed in, that certainly points to your
previous patch being a better approach.  That resolves my questions
around the patch.

> 
> Since I haven't sent a new version of my old series for almost a year,
> I think it'd be fine to take Luigi's patch for now, and undo it at a
> later point when/if we want to actually use proper locking here
> because we're worried about concurrent access to the MM.
Jann Horn Aug. 4, 2021, 9:22 p.m. UTC | #14
On Wed, Aug 4, 2021 at 5:21 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> On Wed, Aug 04, 2021 at 04:42:23PM +0200, Jann Horn wrote:
> > Since I haven't sent a new version of my old series for almost a year,
> > I think it'd be fine to take Luigi's patch for now, and undo it at a
> > later point when/if we want to actually use proper locking here
> > because we're worried about concurrent access to the MM.
>
> IIRC one of the major points of that work was not "proper locking" but
> to have enough locking to be complatible with lockdep so we could add
> assertions like in get_user_pages and find_vma.

That's part of it; but it's also for making the code more clearly
correct and future-proofing it. Looking at it now, I think
process_madvise() might actually already be able to race with execve()
to some degree; and if you made a change like this to the current
kernel:

diff --git a/mm/madvise.c b/mm/madvise.c
index 6d3d348b17f4..3648c198673c 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1043,12 +1043,14 @@ madvise_behavior_valid(int behavior)
 static bool
 process_madvise_behavior_valid(int behavior)
 {
        switch (behavior) {
        case MADV_COLD:
        case MADV_PAGEOUT:
+       case MADV_DOFORK:
+       case MADV_DONTFORK:
                return true;
        default:
                return false;
        }
 }

it would probably introduce a memory corruption bug, because then
someone might be able to destroy the stack VMA between
setup_new_exec() and setup_arg_pages() by using process_madvise() to
trigger VMA splitting/merging in the right pattern.
diff mbox series

Patch

diff --git a/fs/exec.c b/fs/exec.c
index 38f63451b928..ac7603e985b4 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -217,8 +217,10 @@  static struct page *get_arg_page(struct linux_binprm *bprm, unsigned long pos,
 	 * We are doing an exec().  'current' is the process
 	 * doing the exec and bprm->mm is the new process's mm.
 	 */
+	mmap_read_lock(bprm->mm);
 	ret = get_user_pages_remote(bprm->mm, pos, 1, gup_flags,
 			&page, NULL, NULL);
+	mmap_read_unlock(bprm->mm);
 	if (ret <= 0)
 		return NULL;
 
diff --git a/mm/mmap.c b/mm/mmap.c
index ca54d36d203a..79f4f8ae43ec 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -534,6 +534,7 @@  static int find_vma_links(struct mm_struct *mm, unsigned long addr,
 {
 	struct rb_node **__rb_link, *__rb_parent, *rb_prev;
 
+	mmap_assert_locked(mm);
 	__rb_link = &mm->mm_rb.rb_node;
 	rb_prev = __rb_parent = NULL;
 
@@ -2303,6 +2304,7 @@  struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
 	struct rb_node *rb_node;
 	struct vm_area_struct *vma;
 
+	mmap_assert_locked(mm);
 	/* Check the cache first. */
 	vma = vmacache_find(mm, addr);
 	if (likely(vma))