diff mbox series

binder_alloc: Move alloc_page() out of mmap_rwsem to reduce the lock duration

Message ID 20240902225009.34576-1-21cnbao@gmail.com (mailing list archive)
State New
Headers show
Series binder_alloc: Move alloc_page() out of mmap_rwsem to reduce the lock duration | expand

Commit Message

Barry Song Sept. 2, 2024, 10:50 p.m. UTC
From: Barry Song <v-songbaohua@oppo.com>

The mmap_write_lock() can block all access to the VMAs, for example page
faults. Performing memory allocation while holding this lock may trigger
direct reclamation, leading to others being queued in the rwsem for an
extended period.
We've observed that the allocation can sometimes take more than 300ms,
significantly blocking other threads. The user interface sometimes
becomes less responsive as a result. To prevent this, let's move the
allocation outside of the write lock.
A potential side effect could be an extra alloc_page() for the second
thread executing binder_install_single_page() while the first thread
has done it earlier. However, according to Tangquan's 48-hour profiling
using monkey, the likelihood of this occurring is minimal, with a ratio
of only 1 in 2400. Compared to the significantly costly rwsem, this is
negligible.
On the other hand, holding a write lock without making any VMA
modifications appears questionable and likely incorrect. While this
patch focuses on reducing the lock duration, future updates may aim
to eliminate the write lock entirely.

Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Arve Hjønnevåg" <arve@android.com>
Cc: Todd Kjos <tkjos@android.com>
Cc: Martijn Coenen <maco@android.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Carlos Llamas <cmllamas@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Tested-by: Tangquan Zheng <zhengtangquan@oppo.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 drivers/android/binder_alloc.c | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

Comments

Greg KH Sept. 3, 2024, 8:57 a.m. UTC | #1
On Tue, Sep 03, 2024 at 10:50:09AM +1200, Barry Song wrote:
> From: Barry Song <v-songbaohua@oppo.com>
> 
> The mmap_write_lock() can block all access to the VMAs, for example page
> faults. Performing memory allocation while holding this lock may trigger
> direct reclamation, leading to others being queued in the rwsem for an
> extended period.
> We've observed that the allocation can sometimes take more than 300ms,
> significantly blocking other threads. The user interface sometimes
> becomes less responsive as a result. To prevent this, let's move the
> allocation outside of the write lock.
> A potential side effect could be an extra alloc_page() for the second
> thread executing binder_install_single_page() while the first thread
> has done it earlier. However, according to Tangquan's 48-hour profiling
> using monkey, the likelihood of this occurring is minimal, with a ratio
> of only 1 in 2400. Compared to the significantly costly rwsem, this is
> negligible.
> On the other hand, holding a write lock without making any VMA
> modifications appears questionable and likely incorrect. While this
> patch focuses on reducing the lock duration, future updates may aim
> to eliminate the write lock entirely.
> 
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Cc: "Arve Hjønnevåg" <arve@android.com>
> Cc: Todd Kjos <tkjos@android.com>
> Cc: Martijn Coenen <maco@android.com>
> Cc: Joel Fernandes <joel@joelfernandes.org>
> Cc: Christian Brauner <brauner@kernel.org>
> Cc: Carlos Llamas <cmllamas@google.com>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Tested-by: Tangquan Zheng <zhengtangquan@oppo.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
>  drivers/android/binder_alloc.c | 18 ++++++++++++++----
>  1 file changed, 14 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/android/binder_alloc.c b/drivers/android/binder_alloc.c
> index b3acbc4174fb..f20074e23a7c 100644
> --- a/drivers/android/binder_alloc.c
> +++ b/drivers/android/binder_alloc.c
> @@ -227,13 +227,23 @@ static int binder_install_single_page(struct binder_alloc *alloc,
>  	if (!mmget_not_zero(alloc->mm))
>  		return -ESRCH;
>  
> +	/*
> +	 * Don't allocate page in mmap_write_lock, this can block
> +	 * mmap_rwsem for a long time; Meanwhile, allocation failure
> +	 * doesn't necessarily need to return -ENOMEM, if lru_page
> +	 * has been installed, we can still return 0(success).
> +	 */
> +	page = alloc_page(GFP_KERNEL | __GFP_HIGHMEM | __GFP_ZERO);

But now you are allocating new pages even if binder_get_installed_page()
is an error, right?  Doesn't that slow things down?

How was this benchmarked?

> +
>  	/*
>  	 * Protected with mmap_sem in write mode as multiple tasks
>  	 * might race to install the same page.
>  	 */
>  	mmap_write_lock(alloc->mm);
> -	if (binder_get_installed_page(lru_page))
> +	if (binder_get_installed_page(lru_page)) {
> +		ret = 1;

That is not a valid error value :(

>  		goto out;
> +	}
>  
>  	if (!alloc->vma) {
>  		pr_err("%d: %s failed, no vma\n", alloc->pid, __func__);
> @@ -241,7 +251,6 @@ static int binder_install_single_page(struct binder_alloc *alloc,
>  		goto out;
>  	}
>  
> -	page = alloc_page(GFP_KERNEL | __GFP_HIGHMEM | __GFP_ZERO);
>  	if (!page) {
>  		pr_err("%d: failed to allocate page\n", alloc->pid);
>  		ret = -ENOMEM;
> @@ -252,7 +261,6 @@ static int binder_install_single_page(struct binder_alloc *alloc,
>  	if (ret) {
>  		pr_err("%d: %s failed to insert page at offset %lx with %d\n",
>  		       alloc->pid, __func__, addr - alloc->buffer, ret);
> -		__free_page(page);
>  		ret = -ENOMEM;
>  		goto out;
>  	}
> @@ -262,7 +270,9 @@ static int binder_install_single_page(struct binder_alloc *alloc,
>  out:
>  	mmap_write_unlock(alloc->mm);
>  	mmput_async(alloc->mm);
> -	return ret;
> +	if (ret && page)
> +		__free_page(page);
> +	return ret < 0 ? ret : 0;

Please only use ? : for when you have to, otherwise please spell it out
with a normal if statement:
	if (ret < 0)
		return ret;
	return 0;

But, this is abusing the fact that you set "ret = 1" above, which is
going to trip someone up in the future as that is NOT a normal coding
pattern we have in the kernel, sorry.

If you insist on this change, please rework it to not have that type of
"positive means one thing, 0 means another, and negative means yet
something else" please.

thanks,

greg k-h
Barry Song Sept. 3, 2024, 9:07 a.m. UTC | #2
On Tue, Sep 3, 2024 at 4:57 PM Greg Kroah-Hartman
<gregkh@linuxfoundation.org> wrote:
>
> On Tue, Sep 03, 2024 at 10:50:09AM +1200, Barry Song wrote:
> > From: Barry Song <v-songbaohua@oppo.com>
> >
> > The mmap_write_lock() can block all access to the VMAs, for example page
> > faults. Performing memory allocation while holding this lock may trigger
> > direct reclamation, leading to others being queued in the rwsem for an
> > extended period.
> > We've observed that the allocation can sometimes take more than 300ms,
> > significantly blocking other threads. The user interface sometimes
> > becomes less responsive as a result. To prevent this, let's move the
> > allocation outside of the write lock.
> > A potential side effect could be an extra alloc_page() for the second
> > thread executing binder_install_single_page() while the first thread
> > has done it earlier. However, according to Tangquan's 48-hour profiling
> > using monkey, the likelihood of this occurring is minimal, with a ratio
> > of only 1 in 2400. Compared to the significantly costly rwsem, this is
> > negligible.
> > On the other hand, holding a write lock without making any VMA
> > modifications appears questionable and likely incorrect. While this
> > patch focuses on reducing the lock duration, future updates may aim
> > to eliminate the write lock entirely.
> >
> > Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> > Cc: "Arve Hjønnevåg" <arve@android.com>
> > Cc: Todd Kjos <tkjos@android.com>
> > Cc: Martijn Coenen <maco@android.com>
> > Cc: Joel Fernandes <joel@joelfernandes.org>
> > Cc: Christian Brauner <brauner@kernel.org>
> > Cc: Carlos Llamas <cmllamas@google.com>
> > Cc: Suren Baghdasaryan <surenb@google.com>
> > Tested-by: Tangquan Zheng <zhengtangquan@oppo.com>
> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > ---
> >  drivers/android/binder_alloc.c | 18 ++++++++++++++----
> >  1 file changed, 14 insertions(+), 4 deletions(-)
> >
> > diff --git a/drivers/android/binder_alloc.c b/drivers/android/binder_alloc.c
> > index b3acbc4174fb..f20074e23a7c 100644
> > --- a/drivers/android/binder_alloc.c
> > +++ b/drivers/android/binder_alloc.c
> > @@ -227,13 +227,23 @@ static int binder_install_single_page(struct binder_alloc *alloc,
> >       if (!mmget_not_zero(alloc->mm))
> >               return -ESRCH;
> >
> > +     /*
> > +      * Don't allocate page in mmap_write_lock, this can block
> > +      * mmap_rwsem for a long time; Meanwhile, allocation failure
> > +      * doesn't necessarily need to return -ENOMEM, if lru_page
> > +      * has been installed, we can still return 0(success).
> > +      */
> > +     page = alloc_page(GFP_KERNEL | __GFP_HIGHMEM | __GFP_ZERO);
>
> But now you are allocating new pages even if binder_get_installed_page()
> is an error, right?  Doesn't that slow things down?

very very unlikely, as the ratio is only 1/2400 while write lock 100% blocks
everyone.

>
> How was this benchmarked?
>

i actually put Tangquan's profiling result:
"
However, according to Tangquan's 48-hour profiling
 using monkey, the likelihood of this occurring is minimal, with a ratio
 of only 1 in 2400. Compared to the significantly costly rwsem, this is
 negligible."

> > +
> >       /*
> >        * Protected with mmap_sem in write mode as multiple tasks
> >        * might race to install the same page.
> >        */
> >       mmap_write_lock(alloc->mm);
> > -     if (binder_get_installed_page(lru_page))
> > +     if (binder_get_installed_page(lru_page)) {
> > +             ret = 1;
>
> That is not a valid error value :(
>
> >               goto out;
> > +     }
> >
> >       if (!alloc->vma) {
> >               pr_err("%d: %s failed, no vma\n", alloc->pid, __func__);
> > @@ -241,7 +251,6 @@ static int binder_install_single_page(struct binder_alloc *alloc,
> >               goto out;
> >       }
> >
> > -     page = alloc_page(GFP_KERNEL | __GFP_HIGHMEM | __GFP_ZERO);
> >       if (!page) {
> >               pr_err("%d: failed to allocate page\n", alloc->pid);
> >               ret = -ENOMEM;
> > @@ -252,7 +261,6 @@ static int binder_install_single_page(struct binder_alloc *alloc,
> >       if (ret) {
> >               pr_err("%d: %s failed to insert page at offset %lx with %d\n",
> >                      alloc->pid, __func__, addr - alloc->buffer, ret);
> > -             __free_page(page);
> >               ret = -ENOMEM;
> >               goto out;
> >       }
> > @@ -262,7 +270,9 @@ static int binder_install_single_page(struct binder_alloc *alloc,
> >  out:
> >       mmap_write_unlock(alloc->mm);
> >       mmput_async(alloc->mm);
> > -     return ret;
> > +     if (ret && page)
> > +             __free_page(page);
> > +     return ret < 0 ? ret : 0;
>
> Please only use ? : for when you have to, otherwise please spell it out
> with a normal if statement:
>         if (ret < 0)
>                 return ret;
>         return 0;
>
> But, this is abusing the fact that you set "ret = 1" above, which is
> going to trip someone up in the future as that is NOT a normal coding
> pattern we have in the kernel, sorry.
>
> If you insist on this change, please rework it to not have that type of
> "positive means one thing, 0 means another, and negative means yet
> something else" please.

I was trying to consolidate all free_page() calls into one place. Otherwise,
we would need multiple free_page() calls. I'm perfectly fine with having more
free_page() calls in both the ret = 1 and ret < 0 paths. In that case,
the ret = 1
path can be removed if you prefer.

>
> thanks,
>
> greg k-h

Thanks
Barry
Greg KH Sept. 3, 2024, 10:04 a.m. UTC | #3
On Tue, Sep 03, 2024 at 05:07:23PM +0800, Barry Song wrote:
> On Tue, Sep 3, 2024 at 4:57 PM Greg Kroah-Hartman
> <gregkh@linuxfoundation.org> wrote:
> >
> > On Tue, Sep 03, 2024 at 10:50:09AM +1200, Barry Song wrote:
> > > From: Barry Song <v-songbaohua@oppo.com>
> > >
> > > The mmap_write_lock() can block all access to the VMAs, for example page
> > > faults. Performing memory allocation while holding this lock may trigger
> > > direct reclamation, leading to others being queued in the rwsem for an
> > > extended period.
> > > We've observed that the allocation can sometimes take more than 300ms,
> > > significantly blocking other threads. The user interface sometimes
> > > becomes less responsive as a result. To prevent this, let's move the
> > > allocation outside of the write lock.
> > > A potential side effect could be an extra alloc_page() for the second
> > > thread executing binder_install_single_page() while the first thread
> > > has done it earlier. However, according to Tangquan's 48-hour profiling
> > > using monkey, the likelihood of this occurring is minimal, with a ratio
> > > of only 1 in 2400. Compared to the significantly costly rwsem, this is
> > > negligible.
> > > On the other hand, holding a write lock without making any VMA
> > > modifications appears questionable and likely incorrect. While this
> > > patch focuses on reducing the lock duration, future updates may aim
> > > to eliminate the write lock entirely.
> > >
> > > Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> > > Cc: "Arve Hjønnevåg" <arve@android.com>
> > > Cc: Todd Kjos <tkjos@android.com>
> > > Cc: Martijn Coenen <maco@android.com>
> > > Cc: Joel Fernandes <joel@joelfernandes.org>
> > > Cc: Christian Brauner <brauner@kernel.org>
> > > Cc: Carlos Llamas <cmllamas@google.com>
> > > Cc: Suren Baghdasaryan <surenb@google.com>
> > > Tested-by: Tangquan Zheng <zhengtangquan@oppo.com>
> > > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > > ---
> > >  drivers/android/binder_alloc.c | 18 ++++++++++++++----
> > >  1 file changed, 14 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/drivers/android/binder_alloc.c b/drivers/android/binder_alloc.c
> > > index b3acbc4174fb..f20074e23a7c 100644
> > > --- a/drivers/android/binder_alloc.c
> > > +++ b/drivers/android/binder_alloc.c
> > > @@ -227,13 +227,23 @@ static int binder_install_single_page(struct binder_alloc *alloc,
> > >       if (!mmget_not_zero(alloc->mm))
> > >               return -ESRCH;
> > >
> > > +     /*
> > > +      * Don't allocate page in mmap_write_lock, this can block
> > > +      * mmap_rwsem for a long time; Meanwhile, allocation failure
> > > +      * doesn't necessarily need to return -ENOMEM, if lru_page
> > > +      * has been installed, we can still return 0(success).
> > > +      */
> > > +     page = alloc_page(GFP_KERNEL | __GFP_HIGHMEM | __GFP_ZERO);
> >
> > But now you are allocating new pages even if binder_get_installed_page()
> > is an error, right?  Doesn't that slow things down?
> 
> very very unlikely, as the ratio is only 1/2400 while write lock 100% blocks
> everyone.

Ok, but:

> > How was this benchmarked?
> 
> i actually put Tangquan's profiling result:
> "
> However, according to Tangquan's 48-hour profiling
>  using monkey, the likelihood of this occurring is minimal, with a ratio
>  of only 1 in 2400. Compared to the significantly costly rwsem, this is
>  negligible."

That's not a benchmark, or any real numbers of how this overall saves
any time.

> > >       /*
> > >        * Protected with mmap_sem in write mode as multiple tasks
> > >        * might race to install the same page.
> > >        */
> > >       mmap_write_lock(alloc->mm);
> > > -     if (binder_get_installed_page(lru_page))
> > > +     if (binder_get_installed_page(lru_page)) {
> > > +             ret = 1;
> >
> > That is not a valid error value :(
> >
> > >               goto out;
> > > +     }
> > >
> > >       if (!alloc->vma) {
> > >               pr_err("%d: %s failed, no vma\n", alloc->pid, __func__);
> > > @@ -241,7 +251,6 @@ static int binder_install_single_page(struct binder_alloc *alloc,
> > >               goto out;
> > >       }
> > >
> > > -     page = alloc_page(GFP_KERNEL | __GFP_HIGHMEM | __GFP_ZERO);
> > >       if (!page) {
> > >               pr_err("%d: failed to allocate page\n", alloc->pid);
> > >               ret = -ENOMEM;
> > > @@ -252,7 +261,6 @@ static int binder_install_single_page(struct binder_alloc *alloc,
> > >       if (ret) {
> > >               pr_err("%d: %s failed to insert page at offset %lx with %d\n",
> > >                      alloc->pid, __func__, addr - alloc->buffer, ret);
> > > -             __free_page(page);
> > >               ret = -ENOMEM;
> > >               goto out;
> > >       }
> > > @@ -262,7 +270,9 @@ static int binder_install_single_page(struct binder_alloc *alloc,
> > >  out:
> > >       mmap_write_unlock(alloc->mm);
> > >       mmput_async(alloc->mm);
> > > -     return ret;
> > > +     if (ret && page)
> > > +             __free_page(page);
> > > +     return ret < 0 ? ret : 0;
> >
> > Please only use ? : for when you have to, otherwise please spell it out
> > with a normal if statement:
> >         if (ret < 0)
> >                 return ret;
> >         return 0;
> >
> > But, this is abusing the fact that you set "ret = 1" above, which is
> > going to trip someone up in the future as that is NOT a normal coding
> > pattern we have in the kernel, sorry.
> >
> > If you insist on this change, please rework it to not have that type of
> > "positive means one thing, 0 means another, and negative means yet
> > something else" please.
> 
> I was trying to consolidate all free_page() calls into one place. Otherwise,
> we would need multiple free_page() calls. I'm perfectly fine with having more
> free_page() calls in both the ret = 1 and ret < 0 paths. In that case,
> the ret = 1
> path can be removed if you prefer.

Remember, we write code for people first, and compilers second.  You
have to maintain this code for the next 10+ years, make it _VERY_
obvious what is happening and how it works as you will be coming back to
it and not remembering what was made for what reason at all.

thanks,

greg k-h
Barry Song Sept. 3, 2024, 11:01 a.m. UTC | #4
On Tue, Sep 3, 2024 at 6:37 PM Greg Kroah-Hartman
<gregkh@linuxfoundation.org> wrote:
>
> On Tue, Sep 03, 2024 at 05:07:23PM +0800, Barry Song wrote:
> > On Tue, Sep 3, 2024 at 4:57 PM Greg Kroah-Hartman
> > <gregkh@linuxfoundation.org> wrote:
> > >
> > > On Tue, Sep 03, 2024 at 10:50:09AM +1200, Barry Song wrote:
> > > > From: Barry Song <v-songbaohua@oppo.com>
> > > >
> > > > The mmap_write_lock() can block all access to the VMAs, for example page
> > > > faults. Performing memory allocation while holding this lock may trigger
> > > > direct reclamation, leading to others being queued in the rwsem for an
> > > > extended period.
> > > > We've observed that the allocation can sometimes take more than 300ms,
> > > > significantly blocking other threads. The user interface sometimes
> > > > becomes less responsive as a result. To prevent this, let's move the
> > > > allocation outside of the write lock.
> > > > A potential side effect could be an extra alloc_page() for the second
> > > > thread executing binder_install_single_page() while the first thread
> > > > has done it earlier. However, according to Tangquan's 48-hour profiling
> > > > using monkey, the likelihood of this occurring is minimal, with a ratio
> > > > of only 1 in 2400. Compared to the significantly costly rwsem, this is
> > > > negligible.
> > > > On the other hand, holding a write lock without making any VMA
> > > > modifications appears questionable and likely incorrect. While this
> > > > patch focuses on reducing the lock duration, future updates may aim
> > > > to eliminate the write lock entirely.
> > > >
> > > > Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> > > > Cc: "Arve Hjønnevåg" <arve@android.com>
> > > > Cc: Todd Kjos <tkjos@android.com>
> > > > Cc: Martijn Coenen <maco@android.com>
> > > > Cc: Joel Fernandes <joel@joelfernandes.org>
> > > > Cc: Christian Brauner <brauner@kernel.org>
> > > > Cc: Carlos Llamas <cmllamas@google.com>
> > > > Cc: Suren Baghdasaryan <surenb@google.com>
> > > > Tested-by: Tangquan Zheng <zhengtangquan@oppo.com>
> > > > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > > > ---
> > > >  drivers/android/binder_alloc.c | 18 ++++++++++++++----
> > > >  1 file changed, 14 insertions(+), 4 deletions(-)
> > > >
> > > > diff --git a/drivers/android/binder_alloc.c b/drivers/android/binder_alloc.c
> > > > index b3acbc4174fb..f20074e23a7c 100644
> > > > --- a/drivers/android/binder_alloc.c
> > > > +++ b/drivers/android/binder_alloc.c
> > > > @@ -227,13 +227,23 @@ static int binder_install_single_page(struct binder_alloc *alloc,
> > > >       if (!mmget_not_zero(alloc->mm))
> > > >               return -ESRCH;
> > > >
> > > > +     /*
> > > > +      * Don't allocate page in mmap_write_lock, this can block
> > > > +      * mmap_rwsem for a long time; Meanwhile, allocation failure
> > > > +      * doesn't necessarily need to return -ENOMEM, if lru_page
> > > > +      * has been installed, we can still return 0(success).
> > > > +      */
> > > > +     page = alloc_page(GFP_KERNEL | __GFP_HIGHMEM | __GFP_ZERO);
> > >
> > > But now you are allocating new pages even if binder_get_installed_page()
> > > is an error, right?  Doesn't that slow things down?
> >
> > very very unlikely, as the ratio is only 1/2400 while write lock 100% blocks
> > everyone.
>
> Ok, but:
>
> > > How was this benchmarked?
> >
> > i actually put Tangquan's profiling result:
> > "
> > However, according to Tangquan's 48-hour profiling
> >  using monkey, the likelihood of this occurring is minimal, with a ratio
> >  of only 1 in 2400. Compared to the significantly costly rwsem, this is
> >  negligible."
>
> That's not a benchmark, or any real numbers of how this overall saves
> any time.

My point is that mmap_rw is one of the most notorious locks impacting
system performance. Our profiling indicates that this binder's rwsem lock
is among the top two write locks, blocking page faults and other accesses
to the VMA.

From the memory management perspective, I believe the improvement will
be quite evident. However, I agree that we should include more data to show
how reducing lock contention can enhance overall user experience and
minimize UI lag.

>
> > > >       /*
> > > >        * Protected with mmap_sem in write mode as multiple tasks
> > > >        * might race to install the same page.
> > > >        */
> > > >       mmap_write_lock(alloc->mm);
> > > > -     if (binder_get_installed_page(lru_page))
> > > > +     if (binder_get_installed_page(lru_page)) {
> > > > +             ret = 1;
> > >
> > > That is not a valid error value :(
> > >
> > > >               goto out;
> > > > +     }
> > > >
> > > >       if (!alloc->vma) {
> > > >               pr_err("%d: %s failed, no vma\n", alloc->pid, __func__);
> > > > @@ -241,7 +251,6 @@ static int binder_install_single_page(struct binder_alloc *alloc,
> > > >               goto out;
> > > >       }
> > > >
> > > > -     page = alloc_page(GFP_KERNEL | __GFP_HIGHMEM | __GFP_ZERO);
> > > >       if (!page) {
> > > >               pr_err("%d: failed to allocate page\n", alloc->pid);
> > > >               ret = -ENOMEM;
> > > > @@ -252,7 +261,6 @@ static int binder_install_single_page(struct binder_alloc *alloc,
> > > >       if (ret) {
> > > >               pr_err("%d: %s failed to insert page at offset %lx with %d\n",
> > > >                      alloc->pid, __func__, addr - alloc->buffer, ret);
> > > > -             __free_page(page);
> > > >               ret = -ENOMEM;
> > > >               goto out;
> > > >       }
> > > > @@ -262,7 +270,9 @@ static int binder_install_single_page(struct binder_alloc *alloc,
> > > >  out:
> > > >       mmap_write_unlock(alloc->mm);
> > > >       mmput_async(alloc->mm);
> > > > -     return ret;
> > > > +     if (ret && page)
> > > > +             __free_page(page);
> > > > +     return ret < 0 ? ret : 0;
> > >
> > > Please only use ? : for when you have to, otherwise please spell it out
> > > with a normal if statement:
> > >         if (ret < 0)
> > >                 return ret;
> > >         return 0;
> > >
> > > But, this is abusing the fact that you set "ret = 1" above, which is
> > > going to trip someone up in the future as that is NOT a normal coding
> > > pattern we have in the kernel, sorry.
> > >
> > > If you insist on this change, please rework it to not have that type of
> > > "positive means one thing, 0 means another, and negative means yet
> > > something else" please.
> >
> > I was trying to consolidate all free_page() calls into one place. Otherwise,
> > we would need multiple free_page() calls. I'm perfectly fine with having more
> > free_page() calls in both the ret = 1 and ret < 0 paths. In that case,
> > the ret = 1
> > path can be removed if you prefer.
>
> Remember, we write code for people first, and compilers second.  You
> have to maintain this code for the next 10+ years, make it _VERY_
> obvious what is happening and how it works as you will be coming back to
> it and not remembering what was made for what reason at all.

no objection to this.

>
> thanks,
>
> greg k-h

Thanks
Barry
Hillf Danton Sept. 3, 2024, 11:01 a.m. UTC | #5
On Tue, Sep 03, 2024 at 10:50:09AM +1200, Barry Song wrote:
> From: Barry Song <v-songbaohua@oppo.com>
> 
> The mmap_write_lock() can block all access to the VMAs, for example page
> faults. Performing memory allocation while holding this lock may trigger
> direct reclamation, leading to others being queued in the rwsem for an
> extended period.
> We've observed that the allocation can sometimes take more than 300ms,
> significantly blocking other threads. The user interface sometimes
> becomes less responsive as a result. To prevent this, let's move the
> allocation outside of the write lock.

I suspect concurrent allocators make things better wrt response, cutting
alloc latency down to 10ms for instance in your scenario. Feel free to
show figures given Tangquan's 48-hour profiling.

> A potential side effect could be an extra alloc_page() for the second
> thread executing binder_install_single_page() while the first thread
> has done it earlier. However, according to Tangquan's 48-hour profiling
> using monkey, the likelihood of this occurring is minimal, with a ratio
> of only 1 in 2400. Compared to the significantly costly rwsem, this is
> negligible.
> On the other hand, holding a write lock without making any VMA
> modifications appears questionable and likely incorrect. While this
> patch focuses on reducing the lock duration, future updates may aim
> to eliminate the write lock entirely.

If spin, better not before taking a look at vm_insert_page().
Barry Song Sept. 3, 2024, 11:45 a.m. UTC | #6
On Tue, Sep 3, 2024 at 7:01 PM Hillf Danton <hdanton@sina.com> wrote:
>
> On Tue, Sep 03, 2024 at 10:50:09AM +1200, Barry Song wrote:
> > From: Barry Song <v-songbaohua@oppo.com>
> >
> > The mmap_write_lock() can block all access to the VMAs, for example page
> > faults. Performing memory allocation while holding this lock may trigger
> > direct reclamation, leading to others being queued in the rwsem for an
> > extended period.
> > We've observed that the allocation can sometimes take more than 300ms,
> > significantly blocking other threads. The user interface sometimes
> > becomes less responsive as a result. To prevent this, let's move the
> > allocation outside of the write lock.
>
> I suspect concurrent allocators make things better wrt response, cutting
> alloc latency down to 10ms for instance in your scenario. Feel free to
> show figures given Tangquan's 48-hour profiling.

Likely.

Concurrent allocators are quite common in PFs which occur
in the same PTE. whoever gets PTL sets PTE, others free the allocated
pages.

>
> > A potential side effect could be an extra alloc_page() for the second
> > thread executing binder_install_single_page() while the first thread
> > has done it earlier. However, according to Tangquan's 48-hour profiling
> > using monkey, the likelihood of this occurring is minimal, with a ratio
> > of only 1 in 2400. Compared to the significantly costly rwsem, this is
> > negligible.
> > On the other hand, holding a write lock without making any VMA
> > modifications appears questionable and likely incorrect. While this
> > patch focuses on reducing the lock duration, future updates may aim
> > to eliminate the write lock entirely.
>
> If spin, better not before taking a look at vm_insert_page().

I have patch 2/3 transitioning to mmap_read_lock, and per_vma_lock is
currently in the
testing queue. At the moment, alloc->spin is in place, but I'm not
entirely convinced
it's the best replacement for the write lock. Let's wait for
Tangquan's test results.

Patch 2 is detailed below, but it has only passed the build-test phase
so far, so
its result is uncertain. I'm sharing it early in case you find it
interesting. And I
am not convinced Commit d1d8875c8c13 ("binder: fix UAF of alloc->vma in
race with munmap()") is a correct fix to really avoid all UAF of alloc->vma.

[PATCH]  binder_alloc: Don't use mmap_write_lock for installing page

Commit d1d8875c8c13 ("binder: fix UAF of alloc->vma in race with
munmap()") uses the mmap_rwsem write lock to protect against a race
condition with munmap, where the vma is detached by the write lock,
but pages are zapped by the read lock. This approach is extremely
expensive for the system, though perhaps less so for binder itself,
as the write lock can block all other operations.

As an alternative, we could hold only the read lock and re-check
that the vma hasn't been detached. To protect simultaneous page
installation, we could use alloc->lock instead.

Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 drivers/android/binder_alloc.c | 32 +++++++++++++++++---------------
 1 file changed, 17 insertions(+), 15 deletions(-)

diff --git a/drivers/android/binder_alloc.c b/drivers/android/binder_alloc.c
index f20074e23a7c..a2281dfacbbc 100644
--- a/drivers/android/binder_alloc.c
+++ b/drivers/android/binder_alloc.c
@@ -228,24 +228,17 @@ static int binder_install_single_page(struct
binder_alloc *alloc,
                return -ESRCH;

        /*
-        * Don't allocate page in mmap_write_lock, this can block
-        * mmap_rwsem for a long time; Meanwhile, allocation failure
-        * doesn't necessarily need to return -ENOMEM, if lru_page
-        * has been installed, we can still return 0(success).
+        * Allocation failure doesn't necessarily need to return -ENOMEM,
+        * if lru_page has been installed, we can still return 0(success).
+        * So, defer the !page check until after binder_get_installed_page()
+        * is completed.
         */
        page = alloc_page(GFP_KERNEL | __GFP_HIGHMEM | __GFP_ZERO);

-       /*
-        * Protected with mmap_sem in write mode as multiple tasks
-        * might race to install the same page.
-        */
-       mmap_write_lock(alloc->mm);
-       if (binder_get_installed_page(lru_page)) {
-               ret = 1;
-               goto out;
-       }
+       mmap_read_lock(alloc->mm);

-       if (!alloc->vma) {
+       /* vma might have been dropped or deattached */
+       if (!alloc->vma || !find_vma(alloc->mm, addr)) {
                pr_err("%d: %s failed, no vma\n", alloc->pid, __func__);
                ret = -ESRCH;
                goto out;
@@ -257,18 +250,27 @@ static int binder_install_single_page(struct
binder_alloc *alloc,
                goto out;
        }

+       spin_lock(&alloc->lock);
+       if (binder_get_installed_page(lru_page)) {
+               spin_unlock(&alloc->lock);
+               ret = 1;
+               goto out;
+       }
+
        ret = vm_insert_page(alloc->vma, addr, page);
        if (ret) {
                pr_err("%d: %s failed to insert page at offset %lx with %d\n",
                       alloc->pid, __func__, addr - alloc->buffer, ret);
+               spin_unlock(&alloc->lock);
                ret = -ENOMEM;
                goto out;
        }

        /* Mark page installation complete and safe to use */
        binder_set_installed_page(lru_page, page);
+       spin_unlock(&alloc->lock);
 out:
-       mmap_write_unlock(alloc->mm);
+       mmap_read_unlock(alloc->mm);
        mmput_async(alloc->mm);
        if (ret && page)
                __free_page(page);
--
2.39.3 (Apple Git-146)
Carlos Llamas Sept. 3, 2024, 1:55 p.m. UTC | #7
On Tue, Sep 03, 2024 at 07:45:12PM +0800, Barry Song wrote:
> On Tue, Sep 3, 2024 at 7:01 PM Hillf Danton <hdanton@sina.com> wrote:
> >
> > On Tue, Sep 03, 2024 at 10:50:09AM +1200, Barry Song wrote:
> > > From: Barry Song <v-songbaohua@oppo.com>
> > >
> > > The mmap_write_lock() can block all access to the VMAs, for example page
> > > faults. Performing memory allocation while holding this lock may trigger
> > > direct reclamation, leading to others being queued in the rwsem for an
> > > extended period.
> > > We've observed that the allocation can sometimes take more than 300ms,
> > > significantly blocking other threads. The user interface sometimes
> > > becomes less responsive as a result. To prevent this, let's move the
> > > allocation outside of the write lock.

Thanks for you patch Barry. So, we are aware of this contention and I've
been working on a fix for it. See more about this below.

> >
> > I suspect concurrent allocators make things better wrt response, cutting
> > alloc latency down to 10ms for instance in your scenario. Feel free to
> > show figures given Tangquan's 48-hour profiling.
> 
> Likely.
> 
> Concurrent allocators are quite common in PFs which occur
> in the same PTE. whoever gets PTL sets PTE, others free the allocated
> pages.
> 
> >
> > > A potential side effect could be an extra alloc_page() for the second
> > > thread executing binder_install_single_page() while the first thread
> > > has done it earlier. However, according to Tangquan's 48-hour profiling
> > > using monkey, the likelihood of this occurring is minimal, with a ratio
> > > of only 1 in 2400. Compared to the significantly costly rwsem, this is
> > > negligible.

This is not negligible. In fact, it is the exact reason for the page
allocation to be done with the mmap sem. If the first thread sleeps on
vm_insert_page(), then binder gets into a bad state of multiple threads
trying to reclaim pages that won't really be used. Memory pressure goes
from bad to worst pretty quick.

FWIW, I believe this was first talked about here:
https://lore.kernel.org/all/ZWmNpxPXZSxdmDE1@google.com/


> > > On the other hand, holding a write lock without making any VMA
> > > modifications appears questionable and likely incorrect. While this
> > > patch focuses on reducing the lock duration, future updates may aim
> > > to eliminate the write lock entirely.
> >
> > If spin, better not before taking a look at vm_insert_page().
> 
> I have patch 2/3 transitioning to mmap_read_lock, and per_vma_lock is
> currently in the
> testing queue. At the moment, alloc->spin is in place, but I'm not
> entirely convinced
> it's the best replacement for the write lock. Let's wait for
> Tangquan's test results.
> 
> Patch 2 is detailed below, but it has only passed the build-test phase
> so far, so
> its result is uncertain. I'm sharing it early in case you find it
> interesting. And I
> am not convinced Commit d1d8875c8c13 ("binder: fix UAF of alloc->vma in
> race with munmap()") is a correct fix to really avoid all UAF of alloc->vma.
> 
> [PATCH]  binder_alloc: Don't use mmap_write_lock for installing page
> 
> Commit d1d8875c8c13 ("binder: fix UAF of alloc->vma in race with
> munmap()") uses the mmap_rwsem write lock to protect against a race
> condition with munmap, where the vma is detached by the write lock,
> but pages are zapped by the read lock. This approach is extremely
> expensive for the system, though perhaps less so for binder itself,
> as the write lock can block all other operations.
> 
> As an alternative, we could hold only the read lock and re-check
> that the vma hasn't been detached. To protect simultaneous page
> installation, we could use alloc->lock instead.
> 
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
>  drivers/android/binder_alloc.c | 32 +++++++++++++++++---------------
>  1 file changed, 17 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/android/binder_alloc.c b/drivers/android/binder_alloc.c
> index f20074e23a7c..a2281dfacbbc 100644
> --- a/drivers/android/binder_alloc.c
> +++ b/drivers/android/binder_alloc.c
> @@ -228,24 +228,17 @@ static int binder_install_single_page(struct
> binder_alloc *alloc,
>                 return -ESRCH;
> 
>         /*
> -        * Don't allocate page in mmap_write_lock, this can block
> -        * mmap_rwsem for a long time; Meanwhile, allocation failure
> -        * doesn't necessarily need to return -ENOMEM, if lru_page
> -        * has been installed, we can still return 0(success).
> +        * Allocation failure doesn't necessarily need to return -ENOMEM,
> +        * if lru_page has been installed, we can still return 0(success).
> +        * So, defer the !page check until after binder_get_installed_page()
> +        * is completed.
>          */
>         page = alloc_page(GFP_KERNEL | __GFP_HIGHMEM | __GFP_ZERO);
> 
> -       /*
> -        * Protected with mmap_sem in write mode as multiple tasks
> -        * might race to install the same page.
> -        */
> -       mmap_write_lock(alloc->mm);
> -       if (binder_get_installed_page(lru_page)) {
> -               ret = 1;
> -               goto out;
> -       }
> +       mmap_read_lock(alloc->mm);
> 
> -       if (!alloc->vma) {
> +       /* vma might have been dropped or deattached */
> +       if (!alloc->vma || !find_vma(alloc->mm, addr)) {
>                 pr_err("%d: %s failed, no vma\n", alloc->pid, __func__);
>                 ret = -ESRCH;
>                 goto out;
> @@ -257,18 +250,27 @@ static int binder_install_single_page(struct
> binder_alloc *alloc,
>                 goto out;
>         }
> 
> +       spin_lock(&alloc->lock);

You can't hold a spinlock and then call vm_insert_page().

> +       if (binder_get_installed_page(lru_page)) {
> +               spin_unlock(&alloc->lock);
> +               ret = 1;
> +               goto out;
> +       }
> +
>         ret = vm_insert_page(alloc->vma, addr, page);
>         if (ret) {
>                 pr_err("%d: %s failed to insert page at offset %lx with %d\n",
>                        alloc->pid, __func__, addr - alloc->buffer, ret);
> +               spin_unlock(&alloc->lock);
>                 ret = -ENOMEM;
>                 goto out;
>         }
> 
>         /* Mark page installation complete and safe to use */
>         binder_set_installed_page(lru_page, page);
> +       spin_unlock(&alloc->lock);
>  out:
> -       mmap_write_unlock(alloc->mm);
> +       mmap_read_unlock(alloc->mm);
>         mmput_async(alloc->mm);
>         if (ret && page)
>                 __free_page(page);
> --
> 2.39.3 (Apple Git-146)


Sorry, but as I mentioned, I've been working on fixing this contention
by supporting concurrent "faults" in binder_install_single_page(). This
is the appropriate fix. I should be sending a patch soon after working
out the conflicts with the shrinker's callback.

Thanks,
--
Carlos Llamas
Barry Song Sept. 3, 2024, 9:10 p.m. UTC | #8
On Wed, Sep 4, 2024 at 1:55 AM Carlos Llamas <cmllamas@google.com> wrote:
>
> On Tue, Sep 03, 2024 at 07:45:12PM +0800, Barry Song wrote:
> > On Tue, Sep 3, 2024 at 7:01 PM Hillf Danton <hdanton@sina.com> wrote:
> > >
> > > On Tue, Sep 03, 2024 at 10:50:09AM +1200, Barry Song wrote:
> > > > From: Barry Song <v-songbaohua@oppo.com>
> > > >
> > > > The mmap_write_lock() can block all access to the VMAs, for example page
> > > > faults. Performing memory allocation while holding this lock may trigger
> > > > direct reclamation, leading to others being queued in the rwsem for an
> > > > extended period.
> > > > We've observed that the allocation can sometimes take more than 300ms,
> > > > significantly blocking other threads. The user interface sometimes
> > > > becomes less responsive as a result. To prevent this, let's move the
> > > > allocation outside of the write lock.
>
> Thanks for you patch Barry. So, we are aware of this contention and I've
> been working on a fix for it. See more about this below.

Cool, Carlos!

>
> > >
> > > I suspect concurrent allocators make things better wrt response, cutting
> > > alloc latency down to 10ms for instance in your scenario. Feel free to
> > > show figures given Tangquan's 48-hour profiling.
> >
> > Likely.
> >
> > Concurrent allocators are quite common in PFs which occur
> > in the same PTE. whoever gets PTL sets PTE, others free the allocated
> > pages.
> >
> > >
> > > > A potential side effect could be an extra alloc_page() for the second
> > > > thread executing binder_install_single_page() while the first thread
> > > > has done it earlier. However, according to Tangquan's 48-hour profiling
> > > > using monkey, the likelihood of this occurring is minimal, with a ratio
> > > > of only 1 in 2400. Compared to the significantly costly rwsem, this is
> > > > negligible.
>
> This is not negligible. In fact, it is the exact reason for the page
> allocation to be done with the mmap sem. If the first thread sleeps on
> vm_insert_page(), then binder gets into a bad state of multiple threads
> trying to reclaim pages that won't really be used. Memory pressure goes
> from bad to worst pretty quick.
>
> FWIW, I believe this was first talked about here:
> https://lore.kernel.org/all/ZWmNpxPXZSxdmDE1@google.com/

However, I'm not entirely convinced that this is a problem :-) Concurrent
allocations like this can occur in many places, especially in PFs. Reclamation
is not useless because it helps free up memory for others; it's not
without value.
I also don't believe binder is one of the largest users executing concurrent
allocations.

>
>
> > > > On the other hand, holding a write lock without making any VMA
> > > > modifications appears questionable and likely incorrect. While this
> > > > patch focuses on reducing the lock duration, future updates may aim
> > > > to eliminate the write lock entirely.
> > >
> > > If spin, better not before taking a look at vm_insert_page().
> >
> > I have patch 2/3 transitioning to mmap_read_lock, and per_vma_lock is
> > currently in the
> > testing queue. At the moment, alloc->spin is in place, but I'm not
> > entirely convinced
> > it's the best replacement for the write lock. Let's wait for
> > Tangquan's test results.
> >
> > Patch 2 is detailed below, but it has only passed the build-test phase
> > so far, so
> > its result is uncertain. I'm sharing it early in case you find it
> > interesting. And I
> > am not convinced Commit d1d8875c8c13 ("binder: fix UAF of alloc->vma in
> > race with munmap()") is a correct fix to really avoid all UAF of alloc->vma.
> >
> > [PATCH]  binder_alloc: Don't use mmap_write_lock for installing page
> >
> > Commit d1d8875c8c13 ("binder: fix UAF of alloc->vma in race with
> > munmap()") uses the mmap_rwsem write lock to protect against a race
> > condition with munmap, where the vma is detached by the write lock,
> > but pages are zapped by the read lock. This approach is extremely
> > expensive for the system, though perhaps less so for binder itself,
> > as the write lock can block all other operations.
> >
> > As an alternative, we could hold only the read lock and re-check
> > that the vma hasn't been detached. To protect simultaneous page
> > installation, we could use alloc->lock instead.
> >
> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > ---
> >  drivers/android/binder_alloc.c | 32 +++++++++++++++++---------------
> >  1 file changed, 17 insertions(+), 15 deletions(-)
> >
> > diff --git a/drivers/android/binder_alloc.c b/drivers/android/binder_alloc.c
> > index f20074e23a7c..a2281dfacbbc 100644
> > --- a/drivers/android/binder_alloc.c
> > +++ b/drivers/android/binder_alloc.c
> > @@ -228,24 +228,17 @@ static int binder_install_single_page(struct
> > binder_alloc *alloc,
> >                 return -ESRCH;
> >
> >         /*
> > -        * Don't allocate page in mmap_write_lock, this can block
> > -        * mmap_rwsem for a long time; Meanwhile, allocation failure
> > -        * doesn't necessarily need to return -ENOMEM, if lru_page
> > -        * has been installed, we can still return 0(success).
> > +        * Allocation failure doesn't necessarily need to return -ENOMEM,
> > +        * if lru_page has been installed, we can still return 0(success).
> > +        * So, defer the !page check until after binder_get_installed_page()
> > +        * is completed.
> >          */
> >         page = alloc_page(GFP_KERNEL | __GFP_HIGHMEM | __GFP_ZERO);
> >
> > -       /*
> > -        * Protected with mmap_sem in write mode as multiple tasks
> > -        * might race to install the same page.
> > -        */
> > -       mmap_write_lock(alloc->mm);
> > -       if (binder_get_installed_page(lru_page)) {
> > -               ret = 1;
> > -               goto out;
> > -       }
> > +       mmap_read_lock(alloc->mm);
> >
> > -       if (!alloc->vma) {
> > +       /* vma might have been dropped or deattached */
> > +       if (!alloc->vma || !find_vma(alloc->mm, addr)) {
> >                 pr_err("%d: %s failed, no vma\n", alloc->pid, __func__);
> >                 ret = -ESRCH;
> >                 goto out;
> > @@ -257,18 +250,27 @@ static int binder_install_single_page(struct
> > binder_alloc *alloc,
> >                 goto out;
> >         }
> >
> > +       spin_lock(&alloc->lock);
>
> You can't hold a spinlock and then call vm_insert_page().

Thanks! This patch has only passed the build test so far. It seems like
we can hold off on further testing for now.

>
> > +       if (binder_get_installed_page(lru_page)) {
> > +               spin_unlock(&alloc->lock);
> > +               ret = 1;
> > +               goto out;
> > +       }
> > +
> >         ret = vm_insert_page(alloc->vma, addr, page);
> >         if (ret) {
> >                 pr_err("%d: %s failed to insert page at offset %lx with %d\n",
> >                        alloc->pid, __func__, addr - alloc->buffer, ret);
> > +               spin_unlock(&alloc->lock);
> >                 ret = -ENOMEM;
> >                 goto out;
> >         }
> >
> >         /* Mark page installation complete and safe to use */
> >         binder_set_installed_page(lru_page, page);
> > +       spin_unlock(&alloc->lock);
> >  out:
> > -       mmap_write_unlock(alloc->mm);
> > +       mmap_read_unlock(alloc->mm);
> >         mmput_async(alloc->mm);
> >         if (ret && page)
> >                 __free_page(page);
> > --
> > 2.39.3 (Apple Git-146)
>
>
> Sorry, but as I mentioned, I've been working on fixing this contention
> by supporting concurrent "faults" in binder_install_single_page(). This
> is the appropriate fix. I should be sending a patch soon after working
> out the conflicts with the shrinker's callback.

Awesome! I’m eager to see your patch, and we’re ready to help with testing.
I strongly recommend dropping the write lock entirely. Using
`mmap_write_lock()` isn’t just a binder-specific concern; it has the
potential to affect the entire Android system.

In patch 3, I experimented with using `per_vma_lock` as well. I’m _not_
proposing it for merging since you’re already working on it, but I wanted
to share the idea. (just like patch2, it has only passed build-test)

[PATCH] binder_alloc: Further move to per_vma_lock from mmap_read_lock

To further reduce the read lock duration, let's try using per_vma_lock
first. If that fails, we can take the read lock, similar to how page
fault handlers operate.

Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 drivers/android/binder_alloc.c | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/drivers/android/binder_alloc.c b/drivers/android/binder_alloc.c
index a2281dfacbbc..b40a5dd650c8 100644
--- a/drivers/android/binder_alloc.c
+++ b/drivers/android/binder_alloc.c
@@ -221,6 +221,8 @@ static int binder_install_single_page(struct
binder_alloc *alloc,
                                      struct binder_lru_page *lru_page,
                                      unsigned long addr)
 {
+       struct vm_area_struct *vma;
+       bool per_vma_lock = true;
        struct page *page;
        int ret = 0;

@@ -235,10 +237,15 @@ static int binder_install_single_page(struct
binder_alloc *alloc,
         */
        page = alloc_page(GFP_KERNEL | __GFP_HIGHMEM | __GFP_ZERO);

-       mmap_read_lock(alloc->mm);
+       vma = lock_vma_under_rcu(alloc->mm, addr);
+       if (!vma) {
+               per_vma_lock = false;
+               mmap_read_lock(alloc->mm);
+               vma = find_vma(alloc->mm, addr);
+       }

-       /* vma might have been dropped or deattached */
-       if (!alloc->vma || !find_vma(alloc->mm, addr)) {
+       /* vma might have been dropped, deattached or changed to new one */
+       if (!alloc->vma || !vma || vma != alloc->vma) {
                pr_err("%d: %s failed, no vma\n", alloc->pid, __func__);
                ret = -ESRCH;
                goto out;
@@ -270,7 +277,10 @@ static int binder_install_single_page(struct
binder_alloc *alloc,
        binder_set_installed_page(lru_page, page);
        spin_unlock(&alloc->lock);
 out:
-       mmap_read_unlock(alloc->mm);
+       if (per_vma_lock)
+               vma_end_read(vma);
+       else
+               mmap_read_unlock(alloc->mm);
        mmput_async(alloc->mm);
        if (ret && page)
                __free_page(page);
Carlos Llamas Sept. 3, 2024, 10:23 p.m. UTC | #9
On Wed, Sep 04, 2024 at 09:10:20AM +1200, Barry Song wrote:
> However, I'm not entirely convinced that this is a problem :-) Concurrent
> allocations like this can occur in many places, especially in PFs. Reclamation
> is not useless because it helps free up memory for others; it's not
> without value.

Yeah, for binder though there is a high chance of multiple callers
trying to allocate the _same_ page concurrently. What I observed is that
pages being reclaimed "in vain" are often in the same vma and this means
subsequent callers will need to allocate these pages.

But even if this wasn't an issue, the correct solution would still be to
support concurrent faults. So, in reality it doesn't matter and we still
need to refactor the call to be non-exclusive.

> I also don't believe binder is one of the largest users executing concurrent
> allocations.

Oh, I have no statistics on this.

> Awesome! I’m eager to see your patch, and we’re ready to help with testing.
> I strongly recommend dropping the write lock entirely. Using
> `mmap_write_lock()` isn’t just a binder-specific concern; it has the
> potential to affect the entire Android system.
> 
> In patch 3, I experimented with using `per_vma_lock` as well. I’m _not_
> proposing it for merging since you’re already working on it, but I wanted
> to share the idea. (just like patch2, it has only passed build-test)

Yeap, I'm using the per-vma-lock too per Suren's suggestion.

> 
> [PATCH] binder_alloc: Further move to per_vma_lock from mmap_read_lock
> 
> To further reduce the read lock duration, let's try using per_vma_lock
> first. If that fails, we can take the read lock, similar to how page
> fault handlers operate.
> 
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
>  drivers/android/binder_alloc.c | 18 ++++++++++++++----
>  1 file changed, 14 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/android/binder_alloc.c b/drivers/android/binder_alloc.c
> index a2281dfacbbc..b40a5dd650c8 100644
> --- a/drivers/android/binder_alloc.c
> +++ b/drivers/android/binder_alloc.c
> @@ -221,6 +221,8 @@ static int binder_install_single_page(struct
> binder_alloc *alloc,
>                                       struct binder_lru_page *lru_page,
>                                       unsigned long addr)
>  {
> +       struct vm_area_struct *vma;
> +       bool per_vma_lock = true;
>         struct page *page;
>         int ret = 0;
> 
> @@ -235,10 +237,15 @@ static int binder_install_single_page(struct
> binder_alloc *alloc,
>          */
>         page = alloc_page(GFP_KERNEL | __GFP_HIGHMEM | __GFP_ZERO);
> 
> -       mmap_read_lock(alloc->mm);
> +       vma = lock_vma_under_rcu(alloc->mm, addr);
> +       if (!vma) {
> +               per_vma_lock = false;
> +               mmap_read_lock(alloc->mm);
> +               vma = find_vma(alloc->mm, addr);
> +       }
> 
> -       /* vma might have been dropped or deattached */
> -       if (!alloc->vma || !find_vma(alloc->mm, addr)) {
> +       /* vma might have been dropped, deattached or changed to new one */
> +       if (!alloc->vma || !vma || vma != alloc->vma) {
>                 pr_err("%d: %s failed, no vma\n", alloc->pid, __func__);
>                 ret = -ESRCH;
>                 goto out;
> @@ -270,7 +277,10 @@ static int binder_install_single_page(struct
> binder_alloc *alloc,
>         binder_set_installed_page(lru_page, page);
>         spin_unlock(&alloc->lock);
>  out:
> -       mmap_read_unlock(alloc->mm);
> +       if (per_vma_lock)
> +               vma_end_read(vma);
> +       else
> +               mmap_read_unlock(alloc->mm);
>         mmput_async(alloc->mm);
>         if (ret && page)
>                 __free_page(page);
> -- 
> 2.39.3 (Apple Git-146)

This looks fairly similar to my patch. However, you would need to handle
the case were vm_insert_page() returns -EBUSY (success that raced) and
also sync with the shrinker callbacks in binder_alloc_free_page() which
is the biggest concern.

Let's not duplicate efforts though. Can we please wait for my patch?
I'll add you as Co-Developed-by, since you've posted this already?

Regards,
Carlos Llamas
Barry Song Sept. 3, 2024, 10:43 p.m. UTC | #10
On Wed, Sep 4, 2024 at 10:23 AM Carlos Llamas <cmllamas@google.com> wrote:
>
> On Wed, Sep 04, 2024 at 09:10:20AM +1200, Barry Song wrote:
> > However, I'm not entirely convinced that this is a problem :-) Concurrent
> > allocations like this can occur in many places, especially in PFs. Reclamation
> > is not useless because it helps free up memory for others; it's not
> > without value.
>
> Yeah, for binder though there is a high chance of multiple callers
> trying to allocate the _same_ page concurrently. What I observed is that
> pages being reclaimed "in vain" are often in the same vma and this means
> subsequent callers will need to allocate these pages.
>
> But even if this wasn't an issue, the correct solution would still be to
> support concurrent faults. So, in reality it doesn't matter and we still
> need to refactor the call to be non-exclusive.
>
> > I also don't believe binder is one of the largest users executing concurrent
> > allocations.
>
> Oh, I have no statistics on this.
>
> > Awesome! I’m eager to see your patch, and we’re ready to help with testing.
> > I strongly recommend dropping the write lock entirely. Using
> > `mmap_write_lock()` isn’t just a binder-specific concern; it has the
> > potential to affect the entire Android system.
> >
> > In patch 3, I experimented with using `per_vma_lock` as well. I’m _not_
> > proposing it for merging since you’re already working on it, but I wanted
> > to share the idea. (just like patch2, it has only passed build-test)
>
> Yeap, I'm using the per-vma-lock too per Suren's suggestion.
>
> >
> > [PATCH] binder_alloc: Further move to per_vma_lock from mmap_read_lock
> >
> > To further reduce the read lock duration, let's try using per_vma_lock
> > first. If that fails, we can take the read lock, similar to how page
> > fault handlers operate.
> >
> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > ---
> >  drivers/android/binder_alloc.c | 18 ++++++++++++++----
> >  1 file changed, 14 insertions(+), 4 deletions(-)
> >
> > diff --git a/drivers/android/binder_alloc.c b/drivers/android/binder_alloc.c
> > index a2281dfacbbc..b40a5dd650c8 100644
> > --- a/drivers/android/binder_alloc.c
> > +++ b/drivers/android/binder_alloc.c
> > @@ -221,6 +221,8 @@ static int binder_install_single_page(struct
> > binder_alloc *alloc,
> >                                       struct binder_lru_page *lru_page,
> >                                       unsigned long addr)
> >  {
> > +       struct vm_area_struct *vma;
> > +       bool per_vma_lock = true;
> >         struct page *page;
> >         int ret = 0;
> >
> > @@ -235,10 +237,15 @@ static int binder_install_single_page(struct
> > binder_alloc *alloc,
> >          */
> >         page = alloc_page(GFP_KERNEL | __GFP_HIGHMEM | __GFP_ZERO);
> >
> > -       mmap_read_lock(alloc->mm);
> > +       vma = lock_vma_under_rcu(alloc->mm, addr);
> > +       if (!vma) {
> > +               per_vma_lock = false;
> > +               mmap_read_lock(alloc->mm);
> > +               vma = find_vma(alloc->mm, addr);
> > +       }
> >
> > -       /* vma might have been dropped or deattached */
> > -       if (!alloc->vma || !find_vma(alloc->mm, addr)) {
> > +       /* vma might have been dropped, deattached or changed to new one */
> > +       if (!alloc->vma || !vma || vma != alloc->vma) {
> >                 pr_err("%d: %s failed, no vma\n", alloc->pid, __func__);
> >                 ret = -ESRCH;
> >                 goto out;
> > @@ -270,7 +277,10 @@ static int binder_install_single_page(struct
> > binder_alloc *alloc,
> >         binder_set_installed_page(lru_page, page);
> >         spin_unlock(&alloc->lock);
> >  out:
> > -       mmap_read_unlock(alloc->mm);
> > +       if (per_vma_lock)
> > +               vma_end_read(vma);
> > +       else
> > +               mmap_read_unlock(alloc->mm);
> >         mmput_async(alloc->mm);
> >         if (ret && page)
> >                 __free_page(page);
> > --
> > 2.39.3 (Apple Git-146)
>
> This looks fairly similar to my patch. However, you would need to handle
> the case were vm_insert_page() returns -EBUSY (success that raced) and
> also sync with the shrinker callbacks in binder_alloc_free_page() which
> is the biggest concern.
>
> Let's not duplicate efforts though. Can we please wait for my patch?
> I'll add you as Co-Developed-by, since you've posted this already?

Yes, I’d be more than happy to wait for your patch, as I believe you have
much much more experience with binder.

>
> Regards,
> Carlos Llamas

Thanks
Barry
diff mbox series

Patch

diff --git a/drivers/android/binder_alloc.c b/drivers/android/binder_alloc.c
index b3acbc4174fb..f20074e23a7c 100644
--- a/drivers/android/binder_alloc.c
+++ b/drivers/android/binder_alloc.c
@@ -227,13 +227,23 @@  static int binder_install_single_page(struct binder_alloc *alloc,
 	if (!mmget_not_zero(alloc->mm))
 		return -ESRCH;
 
+	/*
+	 * Don't allocate page in mmap_write_lock, this can block
+	 * mmap_rwsem for a long time; Meanwhile, allocation failure
+	 * doesn't necessarily need to return -ENOMEM, if lru_page
+	 * has been installed, we can still return 0(success).
+	 */
+	page = alloc_page(GFP_KERNEL | __GFP_HIGHMEM | __GFP_ZERO);
+
 	/*
 	 * Protected with mmap_sem in write mode as multiple tasks
 	 * might race to install the same page.
 	 */
 	mmap_write_lock(alloc->mm);
-	if (binder_get_installed_page(lru_page))
+	if (binder_get_installed_page(lru_page)) {
+		ret = 1;
 		goto out;
+	}
 
 	if (!alloc->vma) {
 		pr_err("%d: %s failed, no vma\n", alloc->pid, __func__);
@@ -241,7 +251,6 @@  static int binder_install_single_page(struct binder_alloc *alloc,
 		goto out;
 	}
 
-	page = alloc_page(GFP_KERNEL | __GFP_HIGHMEM | __GFP_ZERO);
 	if (!page) {
 		pr_err("%d: failed to allocate page\n", alloc->pid);
 		ret = -ENOMEM;
@@ -252,7 +261,6 @@  static int binder_install_single_page(struct binder_alloc *alloc,
 	if (ret) {
 		pr_err("%d: %s failed to insert page at offset %lx with %d\n",
 		       alloc->pid, __func__, addr - alloc->buffer, ret);
-		__free_page(page);
 		ret = -ENOMEM;
 		goto out;
 	}
@@ -262,7 +270,9 @@  static int binder_install_single_page(struct binder_alloc *alloc,
 out:
 	mmap_write_unlock(alloc->mm);
 	mmput_async(alloc->mm);
-	return ret;
+	if (ret && page)
+		__free_page(page);
+	return ret < 0 ? ret : 0;
 }
 
 static int binder_install_buffer_pages(struct binder_alloc *alloc,