diff mbox series

mm, hotplug: protect nr_zones with pgdat_resize_lock()

Message ID 20181120014822.27968-1-richard.weiyang@gmail.com (mailing list archive)
State New, archived
Headers show
Series mm, hotplug: protect nr_zones with pgdat_resize_lock() | expand

Commit Message

Wei Yang Nov. 20, 2018, 1:48 a.m. UTC
After memory hot-added, users could online pages through sysfs, and this
could be done in parallel.

In case two threads online pages in two different empty zones at the
same time, there would be a contention to update the nr_zones.

The patch use pgdat_resize_lock() to protect this critical section.

Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
---
 mm/page_alloc.c | 3 +++
 1 file changed, 3 insertions(+)

Comments

Michal Hocko Nov. 20, 2018, 7:31 a.m. UTC | #1
On Tue 20-11-18 09:48:22, Wei Yang wrote:
> After memory hot-added, users could online pages through sysfs, and this
> could be done in parallel.
> 
> In case two threads online pages in two different empty zones at the
> same time, there would be a contention to update the nr_zones.

No, this shouldn't be the case as I've explained in the original thread.
We use memory hotplug lock over the online phase. So there shouldn't be
any race possible.

On the other hand I would like to see the global lock to go away because
it causes scalability issues and I would like to change it to a range
lock. This would make this race possible.

That being said this is more of a preparatory work than a fix. One could
argue that pgdat resize lock is abused here but I am not convinced a
dedicated lock is much better. We do take this lock already and spanning
its scope seems reasonable. An update to the documentation is due.

> The patch use pgdat_resize_lock() to protect this critical section.
> 
> Signed-off-by: Wei Yang <richard.weiyang@gmail.com>

After the changelog is updated to reflect the above, feel free to add
Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/page_alloc.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index e13987c2e1c4..525a5344a13b 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5796,9 +5796,12 @@ void __meminit init_currently_empty_zone(struct zone *zone,
>  {
>  	struct pglist_data *pgdat = zone->zone_pgdat;
>  	int zone_idx = zone_idx(zone) + 1;
> +	unsigned long flags;
>  
> +	pgdat_resize_lock(pgdat, &flags);
>  	if (zone_idx > pgdat->nr_zones)
>  		pgdat->nr_zones = zone_idx;
> +	pgdat_resize_unlock(pgdat, &flags);
>  
>  	zone->zone_start_pfn = zone_start_pfn;
>  
> -- 
> 2.15.1
Oscar Salvador Nov. 20, 2018, 7:58 a.m. UTC | #2
> On the other hand I would like to see the global lock to go away 
> because
> it causes scalability issues and I would like to change it to a range
> lock. This would make this race possible.
> 
> That being said this is more of a preparatory work than a fix. One 
> could
> argue that pgdat resize lock is abused here but I am not convinced a
> dedicated lock is much better. We do take this lock already and 
> spanning
> its scope seems reasonable. An update to the documentation is due.

Would not make more sense to move it within the pgdat lock
in move_pfn_range_to_zone?
The call from free_area_init_core is safe as we are single-thread there.

And if we want to move towards a range locking, I even think it would be 
more
consistent if we move it within the zone's span lock (which is already 
wrapped with a pgdat lock).
Michal Hocko Nov. 20, 2018, 8:48 a.m. UTC | #3
On Tue 20-11-18 08:58:11, osalvador@suse.de wrote:
> > On the other hand I would like to see the global lock to go away because
> > it causes scalability issues and I would like to change it to a range
> > lock. This would make this race possible.
> > 
> > That being said this is more of a preparatory work than a fix. One could
> > argue that pgdat resize lock is abused here but I am not convinced a
> > dedicated lock is much better. We do take this lock already and spanning
> > its scope seems reasonable. An update to the documentation is due.
> 
> Would not make more sense to move it within the pgdat lock
> in move_pfn_range_to_zone?

yes, that was what I meant originally and I haven't really looked closer
to the diff itself because I've stopped right at the description.

> The call from free_area_init_core is safe as we are single-thread there.
> 
> And if we want to move towards a range locking, I even think it would be
> more
> consistent if we move it within the zone's span lock (which is already
> wrapped with a pgdat lock).

Agreed!
Wei Yang Nov. 21, 2018, 2:44 a.m. UTC | #4
On Tue, Nov 20, 2018 at 08:31:41AM +0100, Michal Hocko wrote:
>On Tue 20-11-18 09:48:22, Wei Yang wrote:
>> After memory hot-added, users could online pages through sysfs, and this
>> could be done in parallel.
>> 
>> In case two threads online pages in two different empty zones at the
>> same time, there would be a contention to update the nr_zones.
>
>No, this shouldn't be the case as I've explained in the original thread.
>We use memory hotplug lock over the online phase. So there shouldn't be
>any race possible.

Sorry for misunderstanding your point.

>
>On the other hand I would like to see the global lock to go away because
>it causes scalability issues and I would like to change it to a range
>lock. This would make this race possible.

The global lock you want to remove is mem_hotplug_begin() ?

Hmm... my understanding may not correct. While mem_hotplug_begin() use
percpu lock, which means if there are two threads running on two
different cpus to online pages at the same time, they could get their
own lock?

If this is the case, will we face the race condition here?

>
>That being said this is more of a preparatory work than a fix. One could
>argue that pgdat resize lock is abused here but I am not convinced a
>dedicated lock is much better. We do take this lock already and spanning
>its scope seems reasonable. An update to the documentation is due.

Agree, I will try to update the documentation in next verstion. 

>
>> The patch use pgdat_resize_lock() to protect this critical section.
>> 
>> Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
>
>After the changelog is updated to reflect the above, feel free to add
>Acked-by: Michal Hocko <mhocko@suse.com>
>
>> ---
>>  mm/page_alloc.c | 3 +++
>>  1 file changed, 3 insertions(+)
>> 
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index e13987c2e1c4..525a5344a13b 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -5796,9 +5796,12 @@ void __meminit init_currently_empty_zone(struct zone *zone,
>>  {
>>  	struct pglist_data *pgdat = zone->zone_pgdat;
>>  	int zone_idx = zone_idx(zone) + 1;
>> +	unsigned long flags;
>>  
>> +	pgdat_resize_lock(pgdat, &flags);
>>  	if (zone_idx > pgdat->nr_zones)
>>  		pgdat->nr_zones = zone_idx;
>> +	pgdat_resize_unlock(pgdat, &flags);
>>  
>>  	zone->zone_start_pfn = zone_start_pfn;
>>  
>> -- 
>> 2.15.1
>
>-- 
>Michal Hocko
>SUSE Labs
Wei Yang Nov. 21, 2018, 2:52 a.m. UTC | #5
On Tue, Nov 20, 2018 at 08:58:11AM +0100, osalvador@suse.de wrote:
>> On the other hand I would like to see the global lock to go away because
>> it causes scalability issues and I would like to change it to a range
>> lock. This would make this race possible.
>> 
>> That being said this is more of a preparatory work than a fix. One could
>> argue that pgdat resize lock is abused here but I am not convinced a
>> dedicated lock is much better. We do take this lock already and spanning
>> its scope seems reasonable. An update to the documentation is due.
>
>Would not make more sense to move it within the pgdat lock
>in move_pfn_range_to_zone?
>The call from free_area_init_core is safe as we are single-thread there.
>

Agree. This would be better.

>And if we want to move towards a range locking, I even think it would be more
>consistent if we move it within the zone's span lock (which is already
>wrapped with a pgdat lock).
>

I lost a little here, just want to confirm with you.

Instead of call pgdat_resize_lock() around init_currently_empty_zone()
in move_pfn_range_to_zone(), we move init_currently_empty_zone() before
resize_zone_range()?

This sounds reasonable.

>
>
Michal Hocko Nov. 21, 2018, 7:14 a.m. UTC | #6
On Wed 21-11-18 02:44:35, Wei Yang wrote:
> On Tue, Nov 20, 2018 at 08:31:41AM +0100, Michal Hocko wrote:
> >On Tue 20-11-18 09:48:22, Wei Yang wrote:
> >> After memory hot-added, users could online pages through sysfs, and this
> >> could be done in parallel.
> >> 
> >> In case two threads online pages in two different empty zones at the
> >> same time, there would be a contention to update the nr_zones.
> >
> >No, this shouldn't be the case as I've explained in the original thread.
> >We use memory hotplug lock over the online phase. So there shouldn't be
> >any race possible.
> 
> Sorry for misunderstanding your point.
> 
> >
> >On the other hand I would like to see the global lock to go away because
> >it causes scalability issues and I would like to change it to a range
> >lock. This would make this race possible.
> 
> The global lock you want to remove is mem_hotplug_begin() ?

Yes

> 
> Hmm... my understanding may not correct. While mem_hotplug_begin() use
> percpu lock, which means if there are two threads running on two
> different cpus to online pages at the same time, they could get their
> own lock?

No. The per-cpu is a mere implementation detail on how the
synchronization is done. Only one path might aquire the exclusive lock.
Michal Hocko Nov. 21, 2018, 7:15 a.m. UTC | #7
On Wed 21-11-18 02:52:31, Wei Yang wrote:
> On Tue, Nov 20, 2018 at 08:58:11AM +0100, osalvador@suse.de wrote:
> >> On the other hand I would like to see the global lock to go away because
> >> it causes scalability issues and I would like to change it to a range
> >> lock. This would make this race possible.
> >> 
> >> That being said this is more of a preparatory work than a fix. One could
> >> argue that pgdat resize lock is abused here but I am not convinced a
> >> dedicated lock is much better. We do take this lock already and spanning
> >> its scope seems reasonable. An update to the documentation is due.
> >
> >Would not make more sense to move it within the pgdat lock
> >in move_pfn_range_to_zone?
> >The call from free_area_init_core is safe as we are single-thread there.
> >
> 
> Agree. This would be better.
> 
> >And if we want to move towards a range locking, I even think it would be more
> >consistent if we move it within the zone's span lock (which is already
> >wrapped with a pgdat lock).
> >
> 
> I lost a little here, just want to confirm with you.
> 
> Instead of call pgdat_resize_lock() around init_currently_empty_zone()
> in move_pfn_range_to_zone(), we move init_currently_empty_zone() before
> resize_zone_range()?
> 
> This sounds reasonable.

Btw. resolving the existing TODO would be nice as well, now that you are
looking that direction...

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index c6c42a7425e5..c75fca900044 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -743,13 +743,12 @@ void __ref move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
 	int nid = pgdat->node_id;
 	unsigned long flags;
 
+	/* TODO Huh pgdat is irqsave while zone is not. It used to be like that before */
+	pgdat_resize_lock(pgdat, &flags);
 	if (zone_is_empty(zone))
 		init_currently_empty_zone(zone, start_pfn, nr_pages);
 
 	clear_zone_contiguous(zone);
-
-	/* TODO Huh pgdat is irqsave while zone is not. It used to be like that before */
-	pgdat_resize_lock(pgdat, &flags);
 	zone_span_writelock(zone);
 	resize_zone_range(zone, start_pfn, nr_pages);
 	zone_span_writeunlock(zone);
Oscar Salvador Nov. 21, 2018, 8:24 a.m. UTC | #8
On Wed, 2018-11-21 at 02:52 +0000, Wei Yang wrote:
> On Tue, Nov 20, 2018 at 08:58:11AM +0100, osalvador@suse.de wrote:
> > > On the other hand I would like to see the global lock to go away
> > > because
> > > it causes scalability issues and I would like to change it to a
> > > range
> > > lock. This would make this race possible.
> > > 
> > > That being said this is more of a preparatory work than a fix.
> > > One could
> > > argue that pgdat resize lock is abused here but I am not
> > > convinced a
> > > dedicated lock is much better. We do take this lock already and
> > > spanning
> > > its scope seems reasonable. An update to the documentation is
> > > due.
> > 
> > Would not make more sense to move it within the pgdat lock
> > in move_pfn_range_to_zone?
> > The call from free_area_init_core is safe as we are single-thread
> > there.
> > 
> 
> Agree. This would be better.
> 
> > And if we want to move towards a range locking, I even think it
> > would be more
> > consistent if we move it within the zone's span lock (which is
> > already
> > wrapped with a pgdat lock).
> > 
> 
> I lost a little here, just want to confirm with you.
> 
> Instead of call pgdat_resize_lock() around
> init_currently_empty_zone()
> in move_pfn_range_to_zone(), we move init_currently_empty_zone()
> before
> resize_zone_range()?
> 
> This sounds reasonable.

Yeah.
spanned pages are being touched in:

- shrink_pgdat_span
- resize_zone_range
- init_currently_emty_zone

The first two are already protected by the span lock.

In init_currently_empty_zone, we also touch zone_start_pfn, which is
part of the spanned pages (beginning), so I think it makes sense to
also protect it with the span lock.
We just call init_currently_empty_zone in case the zone is empty, so
the race should be not existent to be honest.

But I just think it is more consistent, and since moving it under
spanlock would imply to also have it under pgdat lock, which was the
main point of this, I think we do not have anything to lose.
Wei Yang Nov. 22, 2018, 1:52 a.m. UTC | #9
On Wed, Nov 21, 2018 at 08:15:49AM +0100, Michal Hocko wrote:
>On Wed 21-11-18 02:52:31, Wei Yang wrote:
>> On Tue, Nov 20, 2018 at 08:58:11AM +0100, osalvador@suse.de wrote:
>> >> On the other hand I would like to see the global lock to go away because
>> >> it causes scalability issues and I would like to change it to a range
>> >> lock. This would make this race possible.
>> >> 
>> >> That being said this is more of a preparatory work than a fix. One could
>> >> argue that pgdat resize lock is abused here but I am not convinced a
>> >> dedicated lock is much better. We do take this lock already and spanning
>> >> its scope seems reasonable. An update to the documentation is due.
>> >
>> >Would not make more sense to move it within the pgdat lock
>> >in move_pfn_range_to_zone?
>> >The call from free_area_init_core is safe as we are single-thread there.
>> >
>> 
>> Agree. This would be better.
>> 
>> >And if we want to move towards a range locking, I even think it would be more
>> >consistent if we move it within the zone's span lock (which is already
>> >wrapped with a pgdat lock).
>> >
>> 
>> I lost a little here, just want to confirm with you.
>> 
>> Instead of call pgdat_resize_lock() around init_currently_empty_zone()
>> in move_pfn_range_to_zone(), we move init_currently_empty_zone() before
>> resize_zone_range()?
>> 
>> This sounds reasonable.
>
>Btw. resolving the existing TODO would be nice as well, now that you are
>looking that direction...

I took a look at that commit, seems I need some time to understand this
TODO. :-)

>
>diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>index c6c42a7425e5..c75fca900044 100644
>--- a/mm/memory_hotplug.c
>+++ b/mm/memory_hotplug.c
>@@ -743,13 +743,12 @@ void __ref move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
> 	int nid = pgdat->node_id;
> 	unsigned long flags;
> 
>+	/* TODO Huh pgdat is irqsave while zone is not. It used to be like that before */
>+	pgdat_resize_lock(pgdat, &flags);
> 	if (zone_is_empty(zone))
> 		init_currently_empty_zone(zone, start_pfn, nr_pages);
> 
> 	clear_zone_contiguous(zone);
>-
>-	/* TODO Huh pgdat is irqsave while zone is not. It used to be like that before */
>-	pgdat_resize_lock(pgdat, &flags);
> 	zone_span_writelock(zone);

Just want to make sure, Oscar suggests to move the code here to protect
this under zone_span_lock.

If this the correct, I would spin a v2 and try to address the TODO.

> 	resize_zone_range(zone, start_pfn, nr_pages);
> 	zone_span_writeunlock(zone);
>-- 
>Michal Hocko
>SUSE Labs
Michal Hocko Nov. 22, 2018, 8:39 a.m. UTC | #10
On Thu 22-11-18 01:52:39, Wei Yang wrote:
> On Wed, Nov 21, 2018 at 08:15:49AM +0100, Michal Hocko wrote:
[...]
> >diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> >index c6c42a7425e5..c75fca900044 100644
> >--- a/mm/memory_hotplug.c
> >+++ b/mm/memory_hotplug.c
> >@@ -743,13 +743,12 @@ void __ref move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
> > 	int nid = pgdat->node_id;
> > 	unsigned long flags;
> > 
> >+	/* TODO Huh pgdat is irqsave while zone is not. It used to be like that before */
> >+	pgdat_resize_lock(pgdat, &flags);
> > 	if (zone_is_empty(zone))
> > 		init_currently_empty_zone(zone, start_pfn, nr_pages);
> > 
> > 	clear_zone_contiguous(zone);
> >-
> >-	/* TODO Huh pgdat is irqsave while zone is not. It used to be like that before */
> >-	pgdat_resize_lock(pgdat, &flags);
> > 	zone_span_writelock(zone);
> 
> Just want to make sure, Oscar suggests to move the code here to protect
> this under zone_span_lock.

Yes, both locks held is probably safer. Because there is both pgdat and
zone state updated.

Sorry to confuse you
Wei Yang Nov. 26, 2018, 2:28 a.m. UTC | #11
On Wed, Nov 21, 2018 at 3:15 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Wed 21-11-18 02:52:31, Wei Yang wrote:
> > On Tue, Nov 20, 2018 at 08:58:11AM +0100, osalvador@suse.de wrote:
> > >> On the other hand I would like to see the global lock to go away because
> > >> it causes scalability issues and I would like to change it to a range
> > >> lock. This would make this race possible.
> > >>
> > >> That being said this is more of a preparatory work than a fix. One could
> > >> argue that pgdat resize lock is abused here but I am not convinced a
> > >> dedicated lock is much better. We do take this lock already and spanning
> > >> its scope seems reasonable. An update to the documentation is due.
> > >
> > >Would not make more sense to move it within the pgdat lock
> > >in move_pfn_range_to_zone?
> > >The call from free_area_init_core is safe as we are single-thread there.
> > >
> >
> > Agree. This would be better.
> >
> > >And if we want to move towards a range locking, I even think it would be more
> > >consistent if we move it within the zone's span lock (which is already
> > >wrapped with a pgdat lock).
> > >
> >
> > I lost a little here, just want to confirm with you.
> >
> > Instead of call pgdat_resize_lock() around init_currently_empty_zone()
> > in move_pfn_range_to_zone(), we move init_currently_empty_zone() before
> > resize_zone_range()?
> >
> > This sounds reasonable.
>
> Btw. resolving the existing TODO would be nice as well, now that you are
> looking that direction...
>

Michal,

I took a look at commit f1dd2cd13c4b ("mm, memory_hotplug: do not
associate hotadded memory to zones until online"), and try to understand
this TODO.

The reason to acquire these lock is before this commit, the memory is
associated with zone at physical adding phase, while after this commit,
this is delayed to logical online stage.

But I get some difficulty to understand this TODO. You want to get rid of
these lock? While these locks seem necessary to protect those data of
pgdat/zone. Would you mind sharing more on this statement?
Michal Hocko Nov. 26, 2018, 8:16 a.m. UTC | #12
On Mon 26-11-18 10:28:40, Wei Yang wrote:
[...]
> But I get some difficulty to understand this TODO. You want to get rid of
> these lock? While these locks seem necessary to protect those data of
> pgdat/zone. Would you mind sharing more on this statement?

Why do we need this lock to be irqsave? Is there any caller that uses
the lock from the IRQ context?
Wei Yang Nov. 26, 2018, 9:06 a.m. UTC | #13
On Mon, Nov 26, 2018 at 09:16:08AM +0100, Michal Hocko wrote:
>On Mon 26-11-18 10:28:40, Wei Yang wrote:
>[...]
>> But I get some difficulty to understand this TODO. You want to get rid of
>> these lock? While these locks seem necessary to protect those data of
>> pgdat/zone. Would you mind sharing more on this statement?
>
>Why do we need this lock to be irqsave? Is there any caller that uses
>the lock from the IRQ context?

I see you put the comment 'irqsave' in code, I thought this is the
requirement bringing in by this commit. So this is copyed from somewhere
else?

From my understanding, we don't access pgdat from interrupt context.

BTW, one more confirmation. One irqsave lock means we can't do something
during holding the lock, like sleep. Is my understanding correct?

>-- 
>Michal Hocko
>SUSE Labs
Michal Hocko Nov. 26, 2018, 10:03 a.m. UTC | #14
On Mon 26-11-18 09:06:54, Wei Yang wrote:
> On Mon, Nov 26, 2018 at 09:16:08AM +0100, Michal Hocko wrote:
> >On Mon 26-11-18 10:28:40, Wei Yang wrote:
> >[...]
> >> But I get some difficulty to understand this TODO. You want to get rid of
> >> these lock? While these locks seem necessary to protect those data of
> >> pgdat/zone. Would you mind sharing more on this statement?
> >
> >Why do we need this lock to be irqsave? Is there any caller that uses
> >the lock from the IRQ context?
> 
> I see you put the comment 'irqsave' in code, I thought this is the
> requirement bringing in by this commit. So this is copyed from somewhere
> else?

No, the irqsave lock has been there for a long time but it was not clear
to me whether it is still required. Maybe it never was. I just didn't
have time to look into that and put a TODO there. The code wouldn't be
less correct if I kept it.

> >From my understanding, we don't access pgdat from interrupt context.
> 
> BTW, one more confirmation. One irqsave lock means we can't do something
> during holding the lock, like sleep. Is my understanding correct?

You cannot sleep in any atomic context. IRQ safe lock only means that
IRQs are disabled along with the lock. The irqsave variant should be
taken when an IRQ context itself can take the lock. There is a lot of
documentation to clarify this e.g. Linux Device Drivers. I would
recommend to read through that.
Wei Yang Nov. 27, 2018, 12:18 a.m. UTC | #15
On Mon, Nov 26, 2018 at 11:03:30AM +0100, Michal Hocko wrote:
>On Mon 26-11-18 09:06:54, Wei Yang wrote:
>> On Mon, Nov 26, 2018 at 09:16:08AM +0100, Michal Hocko wrote:
>> >On Mon 26-11-18 10:28:40, Wei Yang wrote:
>> >[...]
>> >> But I get some difficulty to understand this TODO. You want to get rid of
>> >> these lock? While these locks seem necessary to protect those data of
>> >> pgdat/zone. Would you mind sharing more on this statement?
>> >
>> >Why do we need this lock to be irqsave? Is there any caller that uses
>> >the lock from the IRQ context?
>> 
>> I see you put the comment 'irqsave' in code, I thought this is the
>> requirement bringing in by this commit. So this is copyed from somewhere
>> else?
>
>No, the irqsave lock has been there for a long time but it was not clear
>to me whether it is still required. Maybe it never was. I just didn't
>have time to look into that and put a TODO there. The code wouldn't be
>less correct if I kept it.
>

Let me summarize what you expect to do.

Go through all the users of pgdat_resize_lock, if none of them is called
from IRQ context, we could do the following change:

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index ffd9cd10fcf3..45a5affcab8a 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -272,14 +272,14 @@ static inline bool movable_node_is_enabled(void)
  * pgdat resizing functions
  */
 static inline
-void pgdat_resize_lock(struct pglist_data *pgdat, unsigned long *flags)
+void pgdat_resize_lock(struct pglist_data *pgdat)
 {
-	spin_lock_irqsave(&pgdat->node_size_lock, *flags);
+	spin_lock(&pgdat->node_size_lock);
 }
 static inline
-void pgdat_resize_unlock(struct pglist_data *pgdat, unsigned long *flags)
+void pgdat_resize_unlock(struct pglist_data *pgdat)
 {
-	spin_unlock_irqrestore(&pgdat->node_size_lock, *flags);
+	spin_unlock(&pgdat->node_size_lock);
 }
 static inline
 void pgdat_resize_init(struct pglist_data *pgdat)

>> >From my understanding, we don't access pgdat from interrupt context.
>> 
>> BTW, one more confirmation. One irqsave lock means we can't do something
>> during holding the lock, like sleep. Is my understanding correct?
>
>You cannot sleep in any atomic context. IRQ safe lock only means that
>IRQs are disabled along with the lock. The irqsave variant should be
>taken when an IRQ context itself can take the lock. There is a lot of
>documentation to clarify this e.g. Linux Device Drivers. I would
>recommend to read through that.
>

Thanks.

I took a look at this one which seems to resolve my confusion.

https://www.kernel.org/doc/Documentation/locking/spinlocks.txt

>-- 
>Michal Hocko
>SUSE Labs
Wei Yang Nov. 27, 2018, 3:12 a.m. UTC | #16
On Mon, Nov 26, 2018 at 09:16:08AM +0100, Michal Hocko wrote:
>On Mon 26-11-18 10:28:40, Wei Yang wrote:
>[...]
>> But I get some difficulty to understand this TODO. You want to get rid of
>> these lock? While these locks seem necessary to protect those data of
>> pgdat/zone. Would you mind sharing more on this statement?
>
>Why do we need this lock to be irqsave? Is there any caller that uses
>the lock from the IRQ context?

Went through the code, we have totally 9 place acquire
pgdat_resize_lock:

   lib/show_mem.c:         1    show_mem()
   mm/memory_hotplug.c:    4    online/offline_pages/__remove_zone()
   mm/page_alloc.c:        2    defer_init
   mm/sparse.c:            2    not necessary

Two places I am not sure:

   * show_mem() would be called from __alloc_pages_slowpath()
   * __remove_zone() is related to acpi_scan() on x86, may related to
     other method on different arch

I am not 100% for sure, while they looks like to be called in IRQ
context.

My ugly idea is:

   * drop pgdat_resize_lock in show_mem(), we don't change the value
     here. or replace this with a read/write lock?
   * can we adjust pgdat's range in offline_pages()? This would be
     consistent since we adjust them in online_pages().


>-- 
>Michal Hocko
>SUSE Labs
Michal Hocko Nov. 27, 2018, 1:16 p.m. UTC | #17
On Tue 27-11-18 03:12:00, Wei Yang wrote:
> On Mon, Nov 26, 2018 at 09:16:08AM +0100, Michal Hocko wrote:
> >On Mon 26-11-18 10:28:40, Wei Yang wrote:
> >[...]
> >> But I get some difficulty to understand this TODO. You want to get rid of
> >> these lock? While these locks seem necessary to protect those data of
> >> pgdat/zone. Would you mind sharing more on this statement?
> >
> >Why do we need this lock to be irqsave? Is there any caller that uses
> >the lock from the IRQ context?
> 
> Went through the code, we have totally 9 place acquire
> pgdat_resize_lock:
> 
>    lib/show_mem.c:         1    show_mem()
>    mm/memory_hotplug.c:    4    online/offline_pages/__remove_zone()
>    mm/page_alloc.c:        2    defer_init
>    mm/sparse.c:            2    not necessary
> 
> Two places I am not sure:
> 
>    * show_mem() would be called from __alloc_pages_slowpath()

This shouldn't really need the lock. It is a mostly debugging aid rather
than something that cannot tolarate racing with hotplug. What is the
worst case that can happen?

>    * __remove_zone() is related to acpi_scan() on x86, may related to
>      other method on different arch

This one really needs a lock qwith a larger scope anyway.
Wei Yang Nov. 27, 2018, 11:56 p.m. UTC | #18
On Tue, Nov 27, 2018 at 02:16:58PM +0100, Michal Hocko wrote:
>On Tue 27-11-18 03:12:00, Wei Yang wrote:
>> On Mon, Nov 26, 2018 at 09:16:08AM +0100, Michal Hocko wrote:
>> >On Mon 26-11-18 10:28:40, Wei Yang wrote:
>> >[...]
>> >> But I get some difficulty to understand this TODO. You want to get rid of
>> >> these lock? While these locks seem necessary to protect those data of
>> >> pgdat/zone. Would you mind sharing more on this statement?
>> >
>> >Why do we need this lock to be irqsave? Is there any caller that uses
>> >the lock from the IRQ context?
>> 
>> Went through the code, we have totally 9 place acquire
>> pgdat_resize_lock:
>> 
>>    lib/show_mem.c:         1    show_mem()
>>    mm/memory_hotplug.c:    4    online/offline_pages/__remove_zone()
>>    mm/page_alloc.c:        2    defer_init
>>    mm/sparse.c:            2    not necessary
>> 
>> Two places I am not sure:
>> 
>>    * show_mem() would be called from __alloc_pages_slowpath()
>
>This shouldn't really need the lock. It is a mostly debugging aid rather
>than something that cannot tolarate racing with hotplug. What is the
>worst case that can happen?
>

Agree.

The worst case is debug information is not exact in case defer init or
hotplug happens at the same time. While this is a rare case.

If you think it is ok, I would suggest to remove the lock here.

>>    * __remove_zone() is related to acpi_scan() on x86, may related to
>>      other method on different arch
>
>This one really needs a lock qwith a larger scope anyway.

Based on my understanding, __remove_zone() happens at physical memory
remove phase. While for currently logic, we adjust zone information at
logic memory online phase.

They looks not consistent?

If we could do this at logical memory offline phase, we are sure this is
not in IRQ context.

>-- 
>Michal Hocko
>SUSE Labs
diff mbox series

Patch

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e13987c2e1c4..525a5344a13b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5796,9 +5796,12 @@  void __meminit init_currently_empty_zone(struct zone *zone,
 {
 	struct pglist_data *pgdat = zone->zone_pgdat;
 	int zone_idx = zone_idx(zone) + 1;
+	unsigned long flags;
 
+	pgdat_resize_lock(pgdat, &flags);
 	if (zone_idx > pgdat->nr_zones)
 		pgdat->nr_zones = zone_idx;
+	pgdat_resize_unlock(pgdat, &flags);
 
 	zone->zone_start_pfn = zone_start_pfn;