mbox series

[rfc,0/2] mm: cma: make cma_release() non-blocking

Message ID 20201016225254.3853109-1-guro@fb.com (mailing list archive)
Headers show
Series mm: cma: make cma_release() non-blocking | expand

Message

Roman Gushchin Oct. 16, 2020, 10:52 p.m. UTC
This small patchset makes cma_release() non-blocking and simplifies
the code in hugetlbfs, where previously we had to temporarily drop
hugetlb_lock around the cma_release() call.

It should help Zi Yan on his work on 1 GB THPs: splitting a gigantic
THP under a memory pressure requires a cma_release() call. If it's
a blocking function, it complicates the already complicated code.
Because there are at least two use cases like this (hugetlbfs is
another example), I believe it's just better to make cma_release()
non-blocking.

It also makes it more consistent with other memory releasing functions
in the kernel: most of them are non-blocking.


Roman Gushchin (2):
  mm: cma: make cma_release() non-blocking
  mm: hugetlb: don't drop hugetlb_lock around cma_release() call

 mm/cma.c     | 51 +++++++++++++++++++++++++++++++++++++++++++++++++--
 mm/hugetlb.c |  6 ------
 2 files changed, 49 insertions(+), 8 deletions(-)

Comments

Mike Kravetz Oct. 22, 2020, 12:15 a.m. UTC | #1
On 10/16/20 3:52 PM, Roman Gushchin wrote:
> This small patchset makes cma_release() non-blocking and simplifies
> the code in hugetlbfs, where previously we had to temporarily drop
> hugetlb_lock around the cma_release() call.
> 
> It should help Zi Yan on his work on 1 GB THPs: splitting a gigantic
> THP under a memory pressure requires a cma_release() call. If it's
> a blocking function, it complicates the already complicated code.
> Because there are at least two use cases like this (hugetlbfs is
> another example), I believe it's just better to make cma_release()
> non-blocking.
> 
> It also makes it more consistent with other memory releasing functions
> in the kernel: most of them are non-blocking.

Thanks for looking into this Roman.

I may be missing something, but why does cma_release have to be blocking
today?  Certainly, it takes the bitmap in cma_clear_bitmap and could
block.  However, I do not see why cma->lock has to be a mutex.  I may be
missing something, but I do not see any code protected by the mutex doing
anything that could sleep?

Could we simply change that mutex to a spinlock?
Xiaqing (A) Oct. 22, 2020, 1:54 a.m. UTC | #2
On 2020/10/17 6:52, Roman Gushchin wrote:

> This small patchset makes cma_release() non-blocking and simplifies
> the code in hugetlbfs, where previously we had to temporarily drop
> hugetlb_lock around the cma_release() call.
>
> It should help Zi Yan on his work on 1 GB THPs: splitting a gigantic
> THP under a memory pressure requires a cma_release() call. If it's
> a blocking function, it complicates the already complicated code.
> Because there are at least two use cases like this (hugetlbfs is
> another example), I believe it's just better to make cma_release()
> non-blocking.
>
> It also makes it more consistent with other memory releasing functions
> in the kernel: most of them are non-blocking.
>
>
> Roman Gushchin (2):
>    mm: cma: make cma_release() non-blocking
>    mm: hugetlb: don't drop hugetlb_lock around cma_release() call
>
>   mm/cma.c     | 51 +++++++++++++++++++++++++++++++++++++++++++++++++--
>   mm/hugetlb.c |  6 ------
>   2 files changed, 49 insertions(+), 8 deletions(-)
>
I don't think this patch is a good idea.It transfers part or even all of the time of
cma_release to cma_alloc, which is more concerned by performance indicators.
On Android phones, CPU resource competition is intense in many scenarios,
As a result, kernel threads and workers can be scheduled only after some ticks or more.
In this case, the performance of cma_alloc will deteriorate significantly,
which is not good news for many services on Android.
Roman Gushchin Oct. 22, 2020, 2:33 a.m. UTC | #3
On Wed, Oct 21, 2020 at 05:15:53PM -0700, Mike Kravetz wrote:
> On 10/16/20 3:52 PM, Roman Gushchin wrote:
> > This small patchset makes cma_release() non-blocking and simplifies
> > the code in hugetlbfs, where previously we had to temporarily drop
> > hugetlb_lock around the cma_release() call.
> > 
> > It should help Zi Yan on his work on 1 GB THPs: splitting a gigantic
> > THP under a memory pressure requires a cma_release() call. If it's
> > a blocking function, it complicates the already complicated code.
> > Because there are at least two use cases like this (hugetlbfs is
> > another example), I believe it's just better to make cma_release()
> > non-blocking.
> > 
> > It also makes it more consistent with other memory releasing functions
> > in the kernel: most of them are non-blocking.
> 
> Thanks for looking into this Roman.

Hi Mike,

> 
> I may be missing something, but why does cma_release have to be blocking
> today?  Certainly, it takes the bitmap in cma_clear_bitmap and could
> block.  However, I do not see why cma->lock has to be a mutex.  I may be
> missing something, but I do not see any code protected by the mutex doing
> anything that could sleep?
> 
> Could we simply change that mutex to a spinlock?

I actually have suggested it few months ago, but the idea was
opposed by Joonsoo: https://lkml.org/lkml/2020/4/3/12 .

The time of a bitmap operation is definitely not an issue in my context,
but I can't speak for something like an embedded/rt case.

Thanks!
Roman Gushchin Oct. 22, 2020, 2:45 a.m. UTC | #4
On Thu, Oct 22, 2020 at 09:54:53AM +0800, Xiaqing (A) wrote:
> 
> 
> On 2020/10/17 6:52, Roman Gushchin wrote:
> 
> > This small patchset makes cma_release() non-blocking and simplifies
> > the code in hugetlbfs, where previously we had to temporarily drop
> > hugetlb_lock around the cma_release() call.
> > 
> > It should help Zi Yan on his work on 1 GB THPs: splitting a gigantic
> > THP under a memory pressure requires a cma_release() call. If it's
> > a blocking function, it complicates the already complicated code.
> > Because there are at least two use cases like this (hugetlbfs is
> > another example), I believe it's just better to make cma_release()
> > non-blocking.
> > 
> > It also makes it more consistent with other memory releasing functions
> > in the kernel: most of them are non-blocking.
> > 
> > 
> > Roman Gushchin (2):
> >    mm: cma: make cma_release() non-blocking
> >    mm: hugetlb: don't drop hugetlb_lock around cma_release() call
> > 
> >   mm/cma.c     | 51 +++++++++++++++++++++++++++++++++++++++++++++++++--
> >   mm/hugetlb.c |  6 ------
> >   2 files changed, 49 insertions(+), 8 deletions(-)
> > 
> I don't think this patch is a good idea.It transfers part or even all of the time of
> cma_release to cma_alloc, which is more concerned by performance indicators.

I'm not quite sure: if cma_alloc() is racing with cma_release(), cma_alloc() will
wait for the cma_lock mutex anyway. So we don't really transfer anything to cma_alloc().

> On Android phones, CPU resource competition is intense in many scenarios,
> As a result, kernel threads and workers can be scheduled only after some ticks or more.
> In this case, the performance of cma_alloc will deteriorate significantly,
> which is not good news for many services on Android.

Ok, I agree, if the cpu is heavily loaded, it might affect the total execution time.

If we aren't going into the mutex->spinlock conversion direction (as Mike suggested),
we can address the performance concerns by introducing a cma_release_nowait() function,
so that the default cma_release() would work in the old way.
cma_release_nowait() can set an atomic flag on a cma area, which will cause following
cma_alloc()'s to flush the release queue. In this case there will be no performance
penalty unless somebody is using cma_release_nowait().
Will it work for you?

Thank you!
Xiaqing (A) Oct. 22, 2020, 3:47 a.m. UTC | #5
On 2020/10/22 10:45, Roman Gushchin wrote:

> On Thu, Oct 22, 2020 at 09:54:53AM +0800, Xiaqing (A) wrote:
>>
>> On 2020/10/17 6:52, Roman Gushchin wrote:
>>
>>> This small patchset makes cma_release() non-blocking and simplifies
>>> the code in hugetlbfs, where previously we had to temporarily drop
>>> hugetlb_lock around the cma_release() call.
>>>
>>> It should help Zi Yan on his work on 1 GB THPs: splitting a gigantic
>>> THP under a memory pressure requires a cma_release() call. If it's
>>> a blocking function, it complicates the already complicated code.
>>> Because there are at least two use cases like this (hugetlbfs is
>>> another example), I believe it's just better to make cma_release()
>>> non-blocking.
>>>
>>> It also makes it more consistent with other memory releasing functions
>>> in the kernel: most of them are non-blocking.
>>>
>>>
>>> Roman Gushchin (2):
>>>     mm: cma: make cma_release() non-blocking
>>>     mm: hugetlb: don't drop hugetlb_lock around cma_release() call
>>>
>>>    mm/cma.c     | 51 +++++++++++++++++++++++++++++++++++++++++++++++++--
>>>    mm/hugetlb.c |  6 ------
>>>    2 files changed, 49 insertions(+), 8 deletions(-)
>>>
>> I don't think this patch is a good idea.It transfers part or even all of the time of
>> cma_release to cma_alloc, which is more concerned by performance indicators.
> I'm not quite sure: if cma_alloc() is racing with cma_release(), cma_alloc() will
> wait for the cma_lock mutex anyway. So we don't really transfer anything to cma_alloc().
>
>> On Android phones, CPU resource competition is intense in many scenarios,
>> As a result, kernel threads and workers can be scheduled only after some ticks or more.
>> In this case, the performance of cma_alloc will deteriorate significantly,
>> which is not good news for many services on Android.
> Ok, I agree, if the cpu is heavily loaded, it might affect the total execution time.
>
> If we aren't going into the mutex->spinlock conversion direction (as Mike suggested),
> we can address the performance concerns by introducing a cma_release_nowait() function,
> so that the default cma_release() would work in the old way.
> cma_release_nowait() can set an atomic flag on a cma area, which will cause following
> cma_alloc()'s to flush the release queue. In this case there will be no performance
> penalty unless somebody is using cma_release_nowait().
> Will it work for you?

That looks good to me.

Thanks!

>
> Thank you!
>
>
Mike Kravetz Oct. 22, 2020, 4:42 p.m. UTC | #6
On 10/21/20 7:33 PM, Roman Gushchin wrote:
> On Wed, Oct 21, 2020 at 05:15:53PM -0700, Mike Kravetz wrote:
>> On 10/16/20 3:52 PM, Roman Gushchin wrote:
>>> This small patchset makes cma_release() non-blocking and simplifies
>>> the code in hugetlbfs, where previously we had to temporarily drop
>>> hugetlb_lock around the cma_release() call.
>>>
>>> It should help Zi Yan on his work on 1 GB THPs: splitting a gigantic
>>> THP under a memory pressure requires a cma_release() call. If it's
>>> a blocking function, it complicates the already complicated code.
>>> Because there are at least two use cases like this (hugetlbfs is
>>> another example), I believe it's just better to make cma_release()
>>> non-blocking.
>>>
>>> It also makes it more consistent with other memory releasing functions
>>> in the kernel: most of them are non-blocking.
>>
>> Thanks for looking into this Roman.
> 
> Hi Mike,
> 
>>
>> I may be missing something, but why does cma_release have to be blocking
>> today?  Certainly, it takes the bitmap in cma_clear_bitmap and could
>> block.  However, I do not see why cma->lock has to be a mutex.  I may be
>> missing something, but I do not see any code protected by the mutex doing
>> anything that could sleep?
>>
>> Could we simply change that mutex to a spinlock?
> 
> I actually have suggested it few months ago, but the idea was
> opposed by Joonsoo: https://lkml.org/lkml/2020/4/3/12 .
> 
> The time of a bitmap operation is definitely not an issue in my context,
> but I can't speak for something like an embedded/rt case.
> 

I wonder if it may be time to look into replacing the cma area bitmap
with some other data structure?  Joonsoo was concerned about the time
required to traverse the bitmap for an 8GB area.  With new support for
allocating 1GB hugetlb pages from cma, I can imagine someone setting
up a cma area that is hundreds of GB if not TB in size.  It is going
take some time to traverse a bitmap describing such a huge area.
Roman Gushchin Oct. 22, 2020, 5:16 p.m. UTC | #7
On Thu, Oct 22, 2020 at 09:42:11AM -0700, Mike Kravetz wrote:
> On 10/21/20 7:33 PM, Roman Gushchin wrote:
> > On Wed, Oct 21, 2020 at 05:15:53PM -0700, Mike Kravetz wrote:
> >> On 10/16/20 3:52 PM, Roman Gushchin wrote:
> >>> This small patchset makes cma_release() non-blocking and simplifies
> >>> the code in hugetlbfs, where previously we had to temporarily drop
> >>> hugetlb_lock around the cma_release() call.
> >>>
> >>> It should help Zi Yan on his work on 1 GB THPs: splitting a gigantic
> >>> THP under a memory pressure requires a cma_release() call. If it's
> >>> a blocking function, it complicates the already complicated code.
> >>> Because there are at least two use cases like this (hugetlbfs is
> >>> another example), I believe it's just better to make cma_release()
> >>> non-blocking.
> >>>
> >>> It also makes it more consistent with other memory releasing functions
> >>> in the kernel: most of them are non-blocking.
> >>
> >> Thanks for looking into this Roman.
> > 
> > Hi Mike,
> > 
> >>
> >> I may be missing something, but why does cma_release have to be blocking
> >> today?  Certainly, it takes the bitmap in cma_clear_bitmap and could
> >> block.  However, I do not see why cma->lock has to be a mutex.  I may be
> >> missing something, but I do not see any code protected by the mutex doing
> >> anything that could sleep?
> >>
> >> Could we simply change that mutex to a spinlock?
> > 
> > I actually have suggested it few months ago, but the idea was
> > opposed by Joonsoo: https://lkml.org/lkml/2020/4/3/12 .
> > 
> > The time of a bitmap operation is definitely not an issue in my context,
> > but I can't speak for something like an embedded/rt case.
> > 
> 
> I wonder if it may be time to look into replacing the cma area bitmap
> with some other data structure?  Joonsoo was concerned about the time
> required to traverse the bitmap for an 8GB area.  With new support for
> allocating 1GB hugetlb pages from cma, I can imagine someone setting
> up a cma area that is hundreds of GB if not TB in size.  It is going
> take some time to traverse a bitmap describing such a huge area.

If the cma area is used exclusively for 1 GB allocations, the bitmap can
have only 1 bit per GB, so it shouldn't be a big problem.

Long-term I have some hopes to be able to allocate 1 GB pages without
a need to reserve a cma area: we can try to group pages based on their mobility
on a 1 GB scale, so that all non-movable pages will reside in few 1 GB blocks.
I'm looking into that direction, but don't have any results yet.

If this idea fails and we'll end up allocating a large cma area unconditionally
and shrink it on demand (I think Rik suggested something like this),
replacing the bitmap with something else sounds like a good idea to me.

As now, I want to unblock Zi Yan on his work on 1 GB THPs, so maybe
we can go with introducing cma_release_nowait(), as I suggested in
the other e-mail in this thread? Do you have any objections?

Thanks!
Mike Kravetz Oct. 22, 2020, 5:25 p.m. UTC | #8
On 10/22/20 10:16 AM, Roman Gushchin wrote:
> As now, I want to unblock Zi Yan on his work on 1 GB THPs, so maybe
> we can go with introducing cma_release_nowait(), as I suggested in
> the other e-mail in this thread? Do you have any objections?

No objections from me.  And, I like that it could be used by the hugetlb
code to make it simpler.