diff mbox series

[v2] mm, hotplug: move init_currently_empty_zone() under zone_span_lock protection

Message ID 20181122101241.7965-1-richard.weiyang@gmail.com (mailing list archive)
State New, archived
Headers show
Series [v2] mm, hotplug: move init_currently_empty_zone() under zone_span_lock protection | expand

Commit Message

Wei Yang Nov. 22, 2018, 10:12 a.m. UTC
During online_pages phase, pgdat->nr_zones will be updated in case this
zone is empty.

Currently the online_pages phase is protected by the global lock
mem_hotplug_begin(), which ensures there is no contention during the
update of nr_zones. But this global lock introduces scalability issues.

This patch is a preparation for removing the global lock during
online_pages phase. Also this patch changes the documentation of
node_size_lock to include the protectioin of nr_zones.

Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
---
v2:
  * commit log changes
  * modify the code in move_pfn_range_to_zone() instead of in
    init_currently_empty_zone()
  * documentation change

---
 include/linux/mmzone.h | 7 ++++---
 mm/memory_hotplug.c    | 5 ++---
 2 files changed, 6 insertions(+), 6 deletions(-)

Comments

Wei Yang Nov. 22, 2018, 10:15 a.m. UTC | #1
On Thu, Nov 22, 2018 at 06:12:41PM +0800, Wei Yang wrote:
>During online_pages phase, pgdat->nr_zones will be updated in case this
>zone is empty.
>
>Currently the online_pages phase is protected by the global lock
>mem_hotplug_begin(), which ensures there is no contention during the
>update of nr_zones. But this global lock introduces scalability issues.
>
>This patch is a preparation for removing the global lock during
>online_pages phase. Also this patch changes the documentation of
>node_size_lock to include the protectioin of nr_zones.
>
>Signed-off-by: Wei Yang <richard.weiyang@gmail.com>

Missed this, if I am correct. :-)

Acked-by: Michal Hocko <mhocko@suse.com>
Michal Hocko Nov. 22, 2018, 10:29 a.m. UTC | #2
On Thu 22-11-18 10:15:57, Wei Yang wrote:
> On Thu, Nov 22, 2018 at 06:12:41PM +0800, Wei Yang wrote:
> >During online_pages phase, pgdat->nr_zones will be updated in case this
> >zone is empty.
> >
> >Currently the online_pages phase is protected by the global lock
> >mem_hotplug_begin(), which ensures there is no contention during the
> >update of nr_zones. But this global lock introduces scalability issues.
> >
> >This patch is a preparation for removing the global lock during
> >online_pages phase. Also this patch changes the documentation of
> >node_size_lock to include the protectioin of nr_zones.

I would just add that the patch moves init_currently_empty_zone under
both zone_span_writelock and pgdat_resize_lock because both the pgdat
state is changed (nr_zones) and the zone's start_pfn

> >
> >Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
> 
> Missed this, if I am correct. :-)
> 
> Acked-by: Michal Hocko <mhocko@suse.com>

Yes, thank you.
Oscar Salvador Nov. 22, 2018, 10:37 a.m. UTC | #3
On Thu, 2018-11-22 at 18:12 +0800, Wei Yang wrote:
> 
> Signed-off-by: Wei Yang <richard.weiyang@gmail.com>

Thanks ;-)

Reviewed-by: Oscar Salvador <osalvador@suse.de>
Wei Yang Nov. 22, 2018, 2:27 p.m. UTC | #4
On Thu, Nov 22, 2018 at 11:29:22AM +0100, Michal Hocko wrote:
>On Thu 22-11-18 10:15:57, Wei Yang wrote:
>> On Thu, Nov 22, 2018 at 06:12:41PM +0800, Wei Yang wrote:
>> >During online_pages phase, pgdat->nr_zones will be updated in case this
>> >zone is empty.
>> >
>> >Currently the online_pages phase is protected by the global lock
>> >mem_hotplug_begin(), which ensures there is no contention during the
>> >update of nr_zones. But this global lock introduces scalability issues.
>> >
>> >This patch is a preparation for removing the global lock during
>> >online_pages phase. Also this patch changes the documentation of
>> >node_size_lock to include the protectioin of nr_zones.
>
>I would just add that the patch moves init_currently_empty_zone under
>both zone_span_writelock and pgdat_resize_lock because both the pgdat
>state is changed (nr_zones) and the zone's start_pfn
>
>> >
>> >Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
>> 
>> Missed this, if I am correct. :-)
>> 
>> Acked-by: Michal Hocko <mhocko@suse.com>
>
>Yes, thank you.

My pleasure :-)

>-- 
>Michal Hocko
>SUSE Labs
Wei Yang Nov. 22, 2018, 2:28 p.m. UTC | #5
On Thu, Nov 22, 2018 at 11:37:38AM +0100, osalvador wrote:
>On Thu, 2018-11-22 at 18:12 +0800, Wei Yang wrote:
>> 
>> Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
>
>Thanks ;-)
>

Thanks for your suggestions :-)

>Reviewed-by: Oscar Salvador <osalvador@suse.de>
David Hildenbrand Nov. 22, 2018, 3:26 p.m. UTC | #6
On 22.11.18 11:12, Wei Yang wrote:
> During online_pages phase, pgdat->nr_zones will be updated in case this
> zone is empty.
> 
> Currently the online_pages phase is protected by the global lock
> mem_hotplug_begin(), which ensures there is no contention during the
> update of nr_zones. But this global lock introduces scalability issues.
> 
> This patch is a preparation for removing the global lock during
> online_pages phase. Also this patch changes the documentation of
> node_size_lock to include the protectioin of nr_zones.

I looked into locking recently, and there is more to it.

Please read:

commit dee6da22efac451d361f5224a60be2796d847b51
Author: David Hildenbrand <david@redhat.com>
Date:   Tue Oct 30 15:10:44 2018 -0700

    memory-hotplug.rst: add some details about locking internals
    
    Let's document the magic a bit, especially why device_hotplug_lock is
    required when adding/removing memory and how it all play together with
    requests to online/offline memory from user space.

Short summary: Onlining/offlining of memory requires the device_hotplug_lock
as of now.

mem_hotplug_begin() is just an internal optimization. (we don't want
 everybody to take the device lock)



> 
> Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
> ---
> v2:
>   * commit log changes
>   * modify the code in move_pfn_range_to_zone() instead of in
>     init_currently_empty_zone()
>   * documentation change
> 
> ---
>  include/linux/mmzone.h | 7 ++++---
>  mm/memory_hotplug.c    | 5 ++---
>  2 files changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 68d7b558924b..1bb749bee284 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -636,9 +636,10 @@ typedef struct pglist_data {
>  #endif
>  #if defined(CONFIG_MEMORY_HOTPLUG) || defined(CONFIG_DEFERRED_STRUCT_PAGE_INIT)
>  	/*
> -	 * Must be held any time you expect node_start_pfn, node_present_pages
> -	 * or node_spanned_pages stay constant.  Holding this will also
> -	 * guarantee that any pfn_valid() stays that way.
> +	 * Must be held any time you expect node_start_pfn,
> +	 * node_present_pages, node_spanned_pages or nr_zones stay constant.
> +	 * Holding this will also guarantee that any pfn_valid() stays that
> +	 * way.
>  	 *
>  	 * pgdat_resize_lock() and pgdat_resize_unlock() are provided to
>  	 * manipulate node_size_lock without checking for CONFIG_MEMORY_HOTPLUG
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 61972da38d93..f626e7e5f57b 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -742,14 +742,13 @@ void __ref move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
>  	int nid = pgdat->node_id;
>  	unsigned long flags;
>  
> -	if (zone_is_empty(zone))
> -		init_currently_empty_zone(zone, start_pfn, nr_pages);
> -
>  	clear_zone_contiguous(zone);
>  
>  	/* TODO Huh pgdat is irqsave while zone is not. It used to be like that before */
>  	pgdat_resize_lock(pgdat, &flags);
>  	zone_span_writelock(zone);
> +	if (zone_is_empty(zone))
> +		init_currently_empty_zone(zone, start_pfn, nr_pages);
>  	resize_zone_range(zone, start_pfn, nr_pages);
>  	zone_span_writeunlock(zone);
>  	resize_pgdat_range(pgdat, start_pfn, nr_pages);
>
Wei Yang Nov. 22, 2018, 9:28 p.m. UTC | #7
On Thu, Nov 22, 2018 at 04:26:40PM +0100, David Hildenbrand wrote:
>On 22.11.18 11:12, Wei Yang wrote:
>> During online_pages phase, pgdat->nr_zones will be updated in case this
>> zone is empty.
>> 
>> Currently the online_pages phase is protected by the global lock
>> mem_hotplug_begin(), which ensures there is no contention during the
>> update of nr_zones. But this global lock introduces scalability issues.
>> 
>> This patch is a preparation for removing the global lock during
>> online_pages phase. Also this patch changes the documentation of
>> node_size_lock to include the protectioin of nr_zones.
>
>I looked into locking recently, and there is more to it.
>
>Please read:
>
>commit dee6da22efac451d361f5224a60be2796d847b51
>Author: David Hildenbrand <david@redhat.com>
>Date:   Tue Oct 30 15:10:44 2018 -0700
>
>    memory-hotplug.rst: add some details about locking internals
>    
>    Let's document the magic a bit, especially why device_hotplug_lock is
>    required when adding/removing memory and how it all play together with
>    requests to online/offline memory from user space.
>
>Short summary: Onlining/offlining of memory requires the device_hotplug_lock
>as of now.
>
>mem_hotplug_begin() is just an internal optimization. (we don't want
> everybody to take the device lock)
>

Hi, David

Thanks for your comment.

Hmm... I didn't catch your point.

Related to memory hot-plug, there are (at least) three locks,

  * device_hotplug_lock    (global)
  * device lock            (device scope)
  * mem_hotplug_lock       (global)

But with two different hold sequence in two cases:

  * device_online()

    device_hotplug_lock
    device_lock
    mem_hotplug_lock

  * add_memory_resource()

    device_hotplug_lock
    mem_hotplug_lock
    device_lock
       ^
       |
       I don't find where this is hold in add_memory_resource(). 
       Would you mind giving me a hint?

If my understanding is correct, what is your point?

I guess your point is : just remove mem_hotplug_lock is not enough to
resolve the scalability issue?

Please correct me, if I am not. :-)
David Hildenbrand Nov. 22, 2018, 9:53 p.m. UTC | #8
On 22.11.18 22:28, Wei Yang wrote:
> On Thu, Nov 22, 2018 at 04:26:40PM +0100, David Hildenbrand wrote:
>> On 22.11.18 11:12, Wei Yang wrote:
>>> During online_pages phase, pgdat->nr_zones will be updated in case this
>>> zone is empty.
>>>
>>> Currently the online_pages phase is protected by the global lock
>>> mem_hotplug_begin(), which ensures there is no contention during the
>>> update of nr_zones. But this global lock introduces scalability issues.
>>>
>>> This patch is a preparation for removing the global lock during
>>> online_pages phase. Also this patch changes the documentation of
>>> node_size_lock to include the protectioin of nr_zones.
>>
>> I looked into locking recently, and there is more to it.
>>
>> Please read:
>>
>> commit dee6da22efac451d361f5224a60be2796d847b51
>> Author: David Hildenbrand <david@redhat.com>
>> Date:   Tue Oct 30 15:10:44 2018 -0700
>>
>>    memory-hotplug.rst: add some details about locking internals
>>    
>>    Let's document the magic a bit, especially why device_hotplug_lock is
>>    required when adding/removing memory and how it all play together with
>>    requests to online/offline memory from user space.
>>
>> Short summary: Onlining/offlining of memory requires the device_hotplug_lock
>> as of now.
>>
>> mem_hotplug_begin() is just an internal optimization. (we don't want
>> everybody to take the device lock)
>>
> 
> Hi, David
> 
> Thanks for your comment.

My last sentence should have been "we don't want everybody to take the
device hotplug lock" :) That caused confusion.

> 
> Hmm... I didn't catch your point.
> 
> Related to memory hot-plug, there are (at least) three locks,
> 
>   * device_hotplug_lock    (global)
>   * device lock            (device scope)
>   * mem_hotplug_lock       (global)
> 
> But with two different hold sequence in two cases:
> 
>   * device_online()
> 
>     device_hotplug_lock
>     device_lock
>     mem_hotplug_lock
> 
>   * add_memory_resource()
> 
>     device_hotplug_lock
>     mem_hotplug_lock
>     device_lock
>        ^
>        |
>        I don't find where this is hold in add_memory_resource(). 
>        Would you mind giving me a hint?
> 
> If my understanding is correct, what is your point?
> 

The point I was trying to make:

Right now all onlining/offlining/adding/removing is protected by the
device_hotplug_lock (and that's a good thing, things are fragile enough
already).

mem_hotplug_lock() is used in addition for get_online_mems().

"This patch is a preparation for removing the global lock during
online_pages phase." - is more like "one global lock".

> I guess your point is : just remove mem_hotplug_lock is not enough to
> resolve the scalability issue?

Depends on which scalability issue :)

Getting rid of / removing the impact of mem_hotplug_lock is certainly a
very good idea. And improves scalability of all callers of
get_online_mems(). If that is the intention, very good :)

If the intention is to make onlining/offlining more scalable (e.g. in
parallel or such), then scalability is limited by device_hotplug_lock.


> 
> Please correct me, if I am not. :-)
> 

Guess I was just wondering which scalability issue we are trying to solve :)
Wei Yang Nov. 22, 2018, 11:53 p.m. UTC | #9
On Thu, Nov 22, 2018 at 10:53:31PM +0100, David Hildenbrand wrote:
>On 22.11.18 22:28, Wei Yang wrote:
>> On Thu, Nov 22, 2018 at 04:26:40PM +0100, David Hildenbrand wrote:
>>> On 22.11.18 11:12, Wei Yang wrote:
>>>> During online_pages phase, pgdat->nr_zones will be updated in case this
>>>> zone is empty.
>>>>
>>>> Currently the online_pages phase is protected by the global lock
>>>> mem_hotplug_begin(), which ensures there is no contention during the
>>>> update of nr_zones. But this global lock introduces scalability issues.
>>>>
>>>> This patch is a preparation for removing the global lock during
>>>> online_pages phase. Also this patch changes the documentation of
>>>> node_size_lock to include the protectioin of nr_zones.
>>>
>>> I looked into locking recently, and there is more to it.
>>>
>>> Please read:
>>>
>>> commit dee6da22efac451d361f5224a60be2796d847b51
>>> Author: David Hildenbrand <david@redhat.com>
>>> Date:   Tue Oct 30 15:10:44 2018 -0700
>>>
>>>    memory-hotplug.rst: add some details about locking internals
>>>    
>>>    Let's document the magic a bit, especially why device_hotplug_lock is
>>>    required when adding/removing memory and how it all play together with
>>>    requests to online/offline memory from user space.
>>>
>>> Short summary: Onlining/offlining of memory requires the device_hotplug_lock
>>> as of now.
>>>
>>> mem_hotplug_begin() is just an internal optimization. (we don't want
>>> everybody to take the device lock)
>>>
>> 
>> Hi, David
>> 
>> Thanks for your comment.
>
>My last sentence should have been "we don't want everybody to take the
>device hotplug lock" :) That caused confusion.
>
>> 
>> Hmm... I didn't catch your point.
>> 
>> Related to memory hot-plug, there are (at least) three locks,
>> 
>>   * device_hotplug_lock    (global)
>>   * device lock            (device scope)
>>   * mem_hotplug_lock       (global)
>> 
>> But with two different hold sequence in two cases:
>> 
>>   * device_online()
>> 
>>     device_hotplug_lock
>>     device_lock
>>     mem_hotplug_lock
>> 
>>   * add_memory_resource()
>> 
>>     device_hotplug_lock
>>     mem_hotplug_lock
>>     device_lock
>>        ^
>>        |
>>        I don't find where this is hold in add_memory_resource(). 
>>        Would you mind giving me a hint?
>> 
>> If my understanding is correct, what is your point?
>> 
>
>The point I was trying to make:
>
>Right now all onlining/offlining/adding/removing is protected by the
>device_hotplug_lock (and that's a good thing, things are fragile enough
>already).
>
>mem_hotplug_lock() is used in addition for get_online_mems().
>
>"This patch is a preparation for removing the global lock during
>online_pages phase." - is more like "one global lock".
>

Thanks for reminding. You are right.

>> I guess your point is : just remove mem_hotplug_lock is not enough to
>> resolve the scalability issue?
>
>Depends on which scalability issue :)
>
>Getting rid of / removing the impact of mem_hotplug_lock is certainly a
>very good idea. And improves scalability of all callers of
>get_online_mems(). If that is the intention, very good :)
>

Maybe not exact.

The intention is to get rid of mem_hotplug_begin/done, if I am correct.

>If the intention is to make onlining/offlining more scalable (e.g. in
>parallel or such), then scalability is limited by device_hotplug_lock.
>

I didn't notice this lock.

While this is a step by step improvement.

>
>> 
>> Please correct me, if I am not. :-)
>> 
>
>Guess I was just wondering which scalability issue we are trying to solve :)
>
>-- 
>
>Thanks,
>
>David / dhildenb
Michal Hocko Nov. 23, 2018, 8:42 a.m. UTC | #10
On Thu 22-11-18 16:26:40, David Hildenbrand wrote:
> On 22.11.18 11:12, Wei Yang wrote:
> > During online_pages phase, pgdat->nr_zones will be updated in case this
> > zone is empty.
> > 
> > Currently the online_pages phase is protected by the global lock
> > mem_hotplug_begin(), which ensures there is no contention during the
> > update of nr_zones. But this global lock introduces scalability issues.
> > 
> > This patch is a preparation for removing the global lock during
> > online_pages phase. Also this patch changes the documentation of
> > node_size_lock to include the protectioin of nr_zones.
> 
> I looked into locking recently, and there is more to it.
> 
> Please read:
> 
> commit dee6da22efac451d361f5224a60be2796d847b51
> Author: David Hildenbrand <david@redhat.com>
> Date:   Tue Oct 30 15:10:44 2018 -0700
> 
>     memory-hotplug.rst: add some details about locking internals
>     
>     Let's document the magic a bit, especially why device_hotplug_lock is
>     required when adding/removing memory and how it all play together with
>     requests to online/offline memory from user space.
> 
> Short summary: Onlining/offlining of memory requires the device_hotplug_lock
> as of now.

Well, I would tend to disagree here. You might be describing the current
state of art but the device_hotplug_lock doesn't make much sense for the
memory hotplug in principle. There is absolutely nothing in the core MM
that would require this lock. The current state just uses a BKL in some
sense and we really want to get rid of that longterm. This patch is a tiny
step in that direction and I suspect many more will need to come on the
way. We really want to end up with a clear scope of each lock being
taken. A project for a brave soul...
David Hildenbrand Nov. 23, 2018, 8:46 a.m. UTC | #11
On 23.11.18 09:42, Michal Hocko wrote:
> On Thu 22-11-18 16:26:40, David Hildenbrand wrote:
>> On 22.11.18 11:12, Wei Yang wrote:
>>> During online_pages phase, pgdat->nr_zones will be updated in case this
>>> zone is empty.
>>>
>>> Currently the online_pages phase is protected by the global lock
>>> mem_hotplug_begin(), which ensures there is no contention during the
>>> update of nr_zones. But this global lock introduces scalability issues.
>>>
>>> This patch is a preparation for removing the global lock during
>>> online_pages phase. Also this patch changes the documentation of
>>> node_size_lock to include the protectioin of nr_zones.
>>
>> I looked into locking recently, and there is more to it.
>>
>> Please read:
>>
>> commit dee6da22efac451d361f5224a60be2796d847b51
>> Author: David Hildenbrand <david@redhat.com>
>> Date:   Tue Oct 30 15:10:44 2018 -0700
>>
>>     memory-hotplug.rst: add some details about locking internals
>>     
>>     Let's document the magic a bit, especially why device_hotplug_lock is
>>     required when adding/removing memory and how it all play together with
>>     requests to online/offline memory from user space.
>>
>> Short summary: Onlining/offlining of memory requires the device_hotplug_lock
>> as of now.
> 
> Well, I would tend to disagree here. You might be describing the current
> state of art but the device_hotplug_lock doesn't make much sense for the
> memory hotplug in principle. There is absolutely nothing in the core MM

There are collisions with CPU hotplug that require this lock (when nodes
come and go as far as I remember). And there is the problematic lock
inversion that can happen when adding/remving memory. This all has to be
sorted out, we'll have to see if we really need it for
onlining/offlining, though, however ...

> that would require this lock. The current state just uses a BKL in some
> sense and we really want to get rid of that longterm. This patch is a tiny
> step in that direction and I suspect many more will need to come on the
> way. We really want to end up with a clear scope of each lock being
> taken. A project for a brave soul...

... for now I don't consider "optimize for parallel
onlining/offlining/adding/removing of memory and cpus" really necessary.
What is necessary indeed is to not slowdown the whole system just
because some memory is coming/going. Therefore I agree, this patch is a
step into the right direction.
Wei Yang Nov. 26, 2018, 1:44 a.m. UTC | #12
On Fri, Nov 23, 2018 at 09:46:52AM +0100, David Hildenbrand wrote:
>On 23.11.18 09:42, Michal Hocko wrote:
>> On Thu 22-11-18 16:26:40, David Hildenbrand wrote:
>>> On 22.11.18 11:12, Wei Yang wrote:
>>>> During online_pages phase, pgdat->nr_zones will be updated in case this
>>>> zone is empty.
>>>>
>>>> Currently the online_pages phase is protected by the global lock
>>>> mem_hotplug_begin(), which ensures there is no contention during the
>>>> update of nr_zones. But this global lock introduces scalability issues.
>>>>
>>>> This patch is a preparation for removing the global lock during
>>>> online_pages phase. Also this patch changes the documentation of
>>>> node_size_lock to include the protectioin of nr_zones.
>>>
>>> I looked into locking recently, and there is more to it.
>>>
>>> Please read:
>>>
>>> commit dee6da22efac451d361f5224a60be2796d847b51
>>> Author: David Hildenbrand <david@redhat.com>
>>> Date:   Tue Oct 30 15:10:44 2018 -0700
>>>
>>>     memory-hotplug.rst: add some details about locking internals
>>>     
>>>     Let's document the magic a bit, especially why device_hotplug_lock is
>>>     required when adding/removing memory and how it all play together with
>>>     requests to online/offline memory from user space.
>>>
>>> Short summary: Onlining/offlining of memory requires the device_hotplug_lock
>>> as of now.
>> 
>> Well, I would tend to disagree here. You might be describing the current
>> state of art but the device_hotplug_lock doesn't make much sense for the
>> memory hotplug in principle. There is absolutely nothing in the core MM
>
>There are collisions with CPU hotplug that require this lock (when nodes
>come and go as far as I remember). And there is the problematic lock
>inversion that can happen when adding/remving memory. This all has to be
>sorted out, we'll have to see if we really need it for
>onlining/offlining, though, however ...
>

Seems I get a little understanding on this part.

There are two hotplug:
   * CPU hotplug 
   * Memory hotplug.

There are two phase for Memory hotplug:
   * physical add/remove 
   * logical online/offline

All of them are protected by device_hotplug_lock now, so we need to be
careful to release this in any case. Is my understanding correct?

>> that would require this lock. The current state just uses a BKL in some
>> sense and we really want to get rid of that longterm. This patch is a tiny
>> step in that direction and I suspect many more will need to come on the
>> way. We really want to end up with a clear scope of each lock being
>> taken. A project for a brave soul...
>
>... for now I don't consider "optimize for parallel
>onlining/offlining/adding/removing of memory and cpus" really necessary.
>What is necessary indeed is to not slowdown the whole system just
>because some memory is coming/going. Therefore I agree, this patch is a
>step into the right direction.
>

Agree.

The target is to accelerate the hot-plug without slow down the normal
process. 

>-- 
>
>Thanks,
>
>David / dhildenb
David Hildenbrand Nov. 26, 2018, 9:24 a.m. UTC | #13
On 26.11.18 02:44, Wei Yang wrote:
> On Fri, Nov 23, 2018 at 09:46:52AM +0100, David Hildenbrand wrote:
>> On 23.11.18 09:42, Michal Hocko wrote:
>>> On Thu 22-11-18 16:26:40, David Hildenbrand wrote:
>>>> On 22.11.18 11:12, Wei Yang wrote:
>>>>> During online_pages phase, pgdat->nr_zones will be updated in case this
>>>>> zone is empty.
>>>>>
>>>>> Currently the online_pages phase is protected by the global lock
>>>>> mem_hotplug_begin(), which ensures there is no contention during the
>>>>> update of nr_zones. But this global lock introduces scalability issues.
>>>>>
>>>>> This patch is a preparation for removing the global lock during
>>>>> online_pages phase. Also this patch changes the documentation of
>>>>> node_size_lock to include the protectioin of nr_zones.
>>>>
>>>> I looked into locking recently, and there is more to it.
>>>>
>>>> Please read:
>>>>
>>>> commit dee6da22efac451d361f5224a60be2796d847b51
>>>> Author: David Hildenbrand <david@redhat.com>
>>>> Date:   Tue Oct 30 15:10:44 2018 -0700
>>>>
>>>>     memory-hotplug.rst: add some details about locking internals
>>>>     
>>>>     Let's document the magic a bit, especially why device_hotplug_lock is
>>>>     required when adding/removing memory and how it all play together with
>>>>     requests to online/offline memory from user space.
>>>>
>>>> Short summary: Onlining/offlining of memory requires the device_hotplug_lock
>>>> as of now.
>>>
>>> Well, I would tend to disagree here. You might be describing the current
>>> state of art but the device_hotplug_lock doesn't make much sense for the
>>> memory hotplug in principle. There is absolutely nothing in the core MM
>>
>> There are collisions with CPU hotplug that require this lock (when nodes
>> come and go as far as I remember). And there is the problematic lock
>> inversion that can happen when adding/remving memory. This all has to be
>> sorted out, we'll have to see if we really need it for
>> onlining/offlining, though, however ...
>>
> 
> Seems I get a little understanding on this part.
> 
> There are two hotplug:
>    * CPU hotplug 
>    * Memory hotplug.
> 
> There are two phase for Memory hotplug:
>    * physical add/remove 
>    * logical online/offline
> 
> All of them are protected by device_hotplug_lock now, so we need to be
> careful to release this in any case. Is my understanding correct?

Yes, e.g. the acpi driver always held the device_hotplug_lock (due to
possible problems with concurrent cpu/memory hot(un)plug). Onlining
offlining of devices (including cpu/memory) from user space always held
the device_hotplug_lock. So this part was executed sequentially for a
long time.

I recently made sure that any adding/removing/onlining/offlining
correctly grabs the device_hotplug_lock AND the mem_hotplug_lock in any
case (because it was inconsistent and broken), so it is all executed
sequentially.

So when getting rid of mem_hotplug_lock we only have to care about all
users that don't take the device_hotplug_lock.

> 
>>> that would require this lock. The current state just uses a BKL in some
>>> sense and we really want to get rid of that longterm. This patch is a tiny
>>> step in that direction and I suspect many more will need to come on the
>>> way. We really want to end up with a clear scope of each lock being
>>> taken. A project for a brave soul...
>>
>> ... for now I don't consider "optimize for parallel
>> onlining/offlining/adding/removing of memory and cpus" really necessary.
>> What is necessary indeed is to not slowdown the whole system just
>> because some memory is coming/going. Therefore I agree, this patch is a
>> step into the right direction.
>>
> 
> Agree.
> 
> The target is to accelerate the hot-plug without slow down the normal
> process. 

Indeed!
Wei Yang Nov. 27, 2018, 12:23 a.m. UTC | #14
On Mon, Nov 26, 2018 at 10:24:26AM +0100, David Hildenbrand wrote:
>On 26.11.18 02:44, Wei Yang wrote:
>> On Fri, Nov 23, 2018 at 09:46:52AM +0100, David Hildenbrand wrote:
>>> On 23.11.18 09:42, Michal Hocko wrote:
>>>> On Thu 22-11-18 16:26:40, David Hildenbrand wrote:
>>>>> On 22.11.18 11:12, Wei Yang wrote:
>>>>>> During online_pages phase, pgdat->nr_zones will be updated in case this
>>>>>> zone is empty.
>>>>>>
>>>>>> Currently the online_pages phase is protected by the global lock
>>>>>> mem_hotplug_begin(), which ensures there is no contention during the
>>>>>> update of nr_zones. But this global lock introduces scalability issues.
>>>>>>
>>>>>> This patch is a preparation for removing the global lock during
>>>>>> online_pages phase. Also this patch changes the documentation of
>>>>>> node_size_lock to include the protectioin of nr_zones.
>>>>>
>>>>> I looked into locking recently, and there is more to it.
>>>>>
>>>>> Please read:
>>>>>
>>>>> commit dee6da22efac451d361f5224a60be2796d847b51
>>>>> Author: David Hildenbrand <david@redhat.com>
>>>>> Date:   Tue Oct 30 15:10:44 2018 -0700
>>>>>
>>>>>     memory-hotplug.rst: add some details about locking internals
>>>>>     
>>>>>     Let's document the magic a bit, especially why device_hotplug_lock is
>>>>>     required when adding/removing memory and how it all play together with
>>>>>     requests to online/offline memory from user space.
>>>>>
>>>>> Short summary: Onlining/offlining of memory requires the device_hotplug_lock
>>>>> as of now.
>>>>
>>>> Well, I would tend to disagree here. You might be describing the current
>>>> state of art but the device_hotplug_lock doesn't make much sense for the
>>>> memory hotplug in principle. There is absolutely nothing in the core MM
>>>
>>> There are collisions with CPU hotplug that require this lock (when nodes
>>> come and go as far as I remember). And there is the problematic lock
>>> inversion that can happen when adding/remving memory. This all has to be
>>> sorted out, we'll have to see if we really need it for
>>> onlining/offlining, though, however ...
>>>
>> 
>> Seems I get a little understanding on this part.
>> 
>> There are two hotplug:
>>    * CPU hotplug 
>>    * Memory hotplug.
>> 
>> There are two phase for Memory hotplug:
>>    * physical add/remove 
>>    * logical online/offline
>> 
>> All of them are protected by device_hotplug_lock now, so we need to be
>> careful to release this in any case. Is my understanding correct?
>
>Yes, e.g. the acpi driver always held the device_hotplug_lock (due to
>possible problems with concurrent cpu/memory hot(un)plug). Onlining
>offlining of devices (including cpu/memory) from user space always held
>the device_hotplug_lock. So this part was executed sequentially for a
>long time.
>
>I recently made sure that any adding/removing/onlining/offlining
>correctly grabs the device_hotplug_lock AND the mem_hotplug_lock in any
>case (because it was inconsistent and broken), so it is all executed
>sequentially.
>
>So when getting rid of mem_hotplug_lock we only have to care about all
>users that don't take the device_hotplug_lock.
>

Thanks for your sharing.

So there are two global locks to protect the hotplug procedure.

  * device_hotplug_lock
  * mem_hotplug_lock

Seems even removing mem_hotplug_lock doesn't help much on scalability?

Maybe we need to move one by one.
diff mbox series

Patch

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 68d7b558924b..1bb749bee284 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -636,9 +636,10 @@  typedef struct pglist_data {
 #endif
 #if defined(CONFIG_MEMORY_HOTPLUG) || defined(CONFIG_DEFERRED_STRUCT_PAGE_INIT)
 	/*
-	 * Must be held any time you expect node_start_pfn, node_present_pages
-	 * or node_spanned_pages stay constant.  Holding this will also
-	 * guarantee that any pfn_valid() stays that way.
+	 * Must be held any time you expect node_start_pfn,
+	 * node_present_pages, node_spanned_pages or nr_zones stay constant.
+	 * Holding this will also guarantee that any pfn_valid() stays that
+	 * way.
 	 *
 	 * pgdat_resize_lock() and pgdat_resize_unlock() are provided to
 	 * manipulate node_size_lock without checking for CONFIG_MEMORY_HOTPLUG
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 61972da38d93..f626e7e5f57b 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -742,14 +742,13 @@  void __ref move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
 	int nid = pgdat->node_id;
 	unsigned long flags;
 
-	if (zone_is_empty(zone))
-		init_currently_empty_zone(zone, start_pfn, nr_pages);
-
 	clear_zone_contiguous(zone);
 
 	/* TODO Huh pgdat is irqsave while zone is not. It used to be like that before */
 	pgdat_resize_lock(pgdat, &flags);
 	zone_span_writelock(zone);
+	if (zone_is_empty(zone))
+		init_currently_empty_zone(zone, start_pfn, nr_pages);
 	resize_zone_range(zone, start_pfn, nr_pages);
 	zone_span_writeunlock(zone);
 	resize_pgdat_range(pgdat, start_pfn, nr_pages);