Message ID | 20240722021059.1076399-1-lizhijian@fujitsu.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v2] mm/page_alloc: Fix pcp->count race between drain_pages_zone() vs __rmqueue_pcplist() | expand |
On 7/22/24 4:10 AM, Li Zhijian wrote: > It's expected that no page should be left in pcp_list after calling > zone_pcp_disable() in offline_pages(). Previously, it's observed that > offline_pages() gets stuck [1] due to some pages remaining in pcp_list. > > Cause: > There is a race condition between drain_pages_zone() and __rmqueue_pcplist() > involving the pcp->count variable. See below scenario: > > CPU0 CPU1 > ---------------- --------------- > spin_lock(&pcp->lock); > __rmqueue_pcplist() { > zone_pcp_disable() { > /* list is empty */ > if (list_empty(list)) { > /* add pages to pcp_list */ > alloced = rmqueue_bulk() > mutex_lock(&pcp_batch_high_lock) > ... > __drain_all_pages() { > drain_pages_zone() { > /* read pcp->count, it's 0 here */ > count = READ_ONCE(pcp->count) > /* 0 means nothing to drain */ > /* update pcp->count */ > pcp->count += alloced << order; > ... > ... > spin_unlock(&pcp->lock); > > In this case, after calling zone_pcp_disable() though, there are still some > pages in pcp_list. And these pages in pcp_list are neither movable nor > isolated, offline_pages() gets stuck as a result. > > Solution: > Expand the scope of the pcp->lock to also protect pcp->count in > drain_pages_zone(), to ensure no pages are left in the pcp list after > zone_pcp_disable() > > [1] https://lore.kernel.org/linux-mm/6a07125f-e720-404c-b2f9-e55f3f166e85@fujitsu.com/ > > Cc: David Hildenbrand <david@redhat.com> > Cc: Vlastimil Babka (SUSE) <vbabka@kernel.org> > Reported-by: Yao Xingtao <yaoxt.fnst@fujitsu.com> > Signed-off-by: Li Zhijian <lizhijian@fujitsu.com> Can we find a breaking commit for Fixes: ? > --- > V2: > - Narrow down the scope of the spin_lock() to limit the draining latency. # Vlastimil and David > - In above scenario, it's sufficient to read pcp->count once with lock held, and it fully fixed > my issue[1] in thounds runs(It happened in more than 5% before). That should be ok indeed, but... > RFC: > https://lore.kernel.org/linux-mm/20240716073929.843277-1-lizhijian@fujitsu.com/ > --- > mm/page_alloc.c | 5 ++++- > 1 file changed, 4 insertions(+), 1 deletion(-) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 9ecf99190ea2..5388a35c4e9c 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -2323,8 +2323,11 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp) > static void drain_pages_zone(unsigned int cpu, struct zone *zone) > { > struct per_cpu_pages *pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu); > - int count = READ_ONCE(pcp->count); > + int count; > > + spin_lock(&pcp->lock); > + count = pcp->count; > + spin_unlock(&pcp->lock); > while (count) { > int to_drain = min(count, pcp->batch << CONFIG_PCP_BATCH_SCALE_MAX); > count -= to_drain; It's wasteful to do a lock/unlock cycle just to read the count. It could rather look something like this: while (true) spin_lock(&pcp->lock); count = pcp->count; ... count -= to_drain; if (to_drain) drain_zone_pages(...) ... spin_unlock(&pcp->lock); if (count) break;
Hi David Thanks for you quickly reply. On 22/07/2024 14:44, Vlastimil Babka (SUSE) wrote: > On 7/22/24 4:10 AM, Li Zhijian wrote: >> It's expected that no page should be left in pcp_list after calling >> zone_pcp_disable() in offline_pages(). Previously, it's observed that >> offline_pages() gets stuck [1] due to some pages remaining in pcp_list. >> >> Cause: >> There is a race condition between drain_pages_zone() and __rmqueue_pcplist() >> involving the pcp->count variable. See below scenario: >> >> CPU0 CPU1 >> ---------------- --------------- >> spin_lock(&pcp->lock); >> __rmqueue_pcplist() { >> zone_pcp_disable() { >> /* list is empty */ >> if (list_empty(list)) { >> /* add pages to pcp_list */ >> alloced = rmqueue_bulk() >> mutex_lock(&pcp_batch_high_lock) >> ... >> __drain_all_pages() { >> drain_pages_zone() { >> /* read pcp->count, it's 0 here */ >> count = READ_ONCE(pcp->count) >> /* 0 means nothing to drain */ >> /* update pcp->count */ >> pcp->count += alloced << order; >> ... >> ... >> spin_unlock(&pcp->lock); >> >> In this case, after calling zone_pcp_disable() though, there are still some >> pages in pcp_list. And these pages in pcp_list are neither movable nor >> isolated, offline_pages() gets stuck as a result. >> >> Solution: >> Expand the scope of the pcp->lock to also protect pcp->count in >> drain_pages_zone(), to ensure no pages are left in the pcp list after >> zone_pcp_disable() >> >> [1] https://lore.kernel.org/linux-mm/6a07125f-e720-404c-b2f9-e55f3f166e85@fujitsu.com/ >> >> Cc: David Hildenbrand <david@redhat.com> >> Cc: Vlastimil Babka (SUSE) <vbabka@kernel.org> >> Reported-by: Yao Xingtao <yaoxt.fnst@fujitsu.com> >> Signed-off-by: Li Zhijian <lizhijian@fujitsu.com> > > Can we find a breaking commit for Fixes: ? I haven't confirmed the FBC because my reproducer is not fit to run in the old kernel for some reasons. but I noticed it didn't read the count without lock held since below commit 4b23a68f9536 mm/page_alloc: protect PCP lists with a spinlock > >> --- >> V2: >> - Narrow down the scope of the spin_lock() to limit the draining latency. # Vlastimil and David >> - In above scenario, it's sufficient to read pcp->count once with lock held, and it fully fixed >> my issue[1] in thounds runs(It happened in more than 5% before). > > That should be ok indeed, but... > >> RFC: >> https://lore.kernel.org/linux-mm/20240716073929.843277-1-lizhijian@fujitsu.com/ >> --- >> mm/page_alloc.c | 5 ++++- >> 1 file changed, 4 insertions(+), 1 deletion(-) >> >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >> index 9ecf99190ea2..5388a35c4e9c 100644 >> --- a/mm/page_alloc.c >> +++ b/mm/page_alloc.c >> @@ -2323,8 +2323,11 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp) >> static void drain_pages_zone(unsigned int cpu, struct zone *zone) >> { >> struct per_cpu_pages *pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu); >> - int count = READ_ONCE(pcp->count); >> + int count; >> >> + spin_lock(&pcp->lock); >> + count = pcp->count; >> + spin_unlock(&pcp->lock); >> while (count) { >> int to_drain = min(count, pcp->batch << CONFIG_PCP_BATCH_SCALE_MAX); >> count -= to_drain; > > It's wasteful to do a lock/unlock cycle just to read the count. How about, static void drain_pages_zone(unsigned int cpu, struct zone *zone) { struct per_cpu_pages *pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu); int count, to_drain; do { spin_lock(&pcp->lock); to_drain = min(pcp->count, pcp->batch << CONFIG_PCP_BATCH_SCALE_MAX); free_pcppages_bulk(zone, to_drain, pcp, 0); spin_unlock(&pcp->lock); } while (to_drain); } > It could > rather look something like this: > Sorry, I don't follow your code... > while (true) > spin_lock(&pcp->lock); > count = pcp->count; > ... > count -= to_drain; > if (to_drain) > drain_zone_pages(...) Which subroutine does this code belong to, why it involves drain_zone_pages > ... > spin_unlock(&pcp->lock); > if (count) > break; Thanks Zhijian
On 7/22/24 11:15 AM, Zhijian Li (Fujitsu) wrote: > Hi David > > Thanks for you quickly reply. > > > On 22/07/2024 14:44, Vlastimil Babka (SUSE) wrote: >> On 7/22/24 4:10 AM, Li Zhijian wrote: >>> It's expected that no page should be left in pcp_list after calling >>> zone_pcp_disable() in offline_pages(). Previously, it's observed that >>> offline_pages() gets stuck [1] due to some pages remaining in pcp_list. >>> >>> Cause: >>> There is a race condition between drain_pages_zone() and __rmqueue_pcplist() >>> involving the pcp->count variable. See below scenario: >>> >>> CPU0 CPU1 >>> ---------------- --------------- >>> spin_lock(&pcp->lock); >>> __rmqueue_pcplist() { >>> zone_pcp_disable() { >>> /* list is empty */ >>> if (list_empty(list)) { >>> /* add pages to pcp_list */ >>> alloced = rmqueue_bulk() >>> mutex_lock(&pcp_batch_high_lock) >>> ... >>> __drain_all_pages() { >>> drain_pages_zone() { >>> /* read pcp->count, it's 0 here */ >>> count = READ_ONCE(pcp->count) >>> /* 0 means nothing to drain */ >>> /* update pcp->count */ >>> pcp->count += alloced << order; >>> ... >>> ... >>> spin_unlock(&pcp->lock); >>> >>> In this case, after calling zone_pcp_disable() though, there are still some >>> pages in pcp_list. And these pages in pcp_list are neither movable nor >>> isolated, offline_pages() gets stuck as a result. >>> >>> Solution: >>> Expand the scope of the pcp->lock to also protect pcp->count in >>> drain_pages_zone(), to ensure no pages are left in the pcp list after >>> zone_pcp_disable() >>> >>> [1] https://lore.kernel.org/linux-mm/6a07125f-e720-404c-b2f9-e55f3f166e85@fujitsu.com/ >>> >>> Cc: David Hildenbrand <david@redhat.com> >>> Cc: Vlastimil Babka (SUSE) <vbabka@kernel.org> >>> Reported-by: Yao Xingtao <yaoxt.fnst@fujitsu.com> >>> Signed-off-by: Li Zhijian <lizhijian@fujitsu.com> >> >> Can we find a breaking commit for Fixes: ? > > I haven't confirmed the FBC because my reproducer is not fit to run in the old kernel for some reasons. > but I noticed it didn't read the count without lock held since below commit > > 4b23a68f9536 mm/page_alloc: protect PCP lists with a spinlock > > > >> >>> --- >>> V2: >>> - Narrow down the scope of the spin_lock() to limit the draining latency. # Vlastimil and David >>> - In above scenario, it's sufficient to read pcp->count once with lock held, and it fully fixed >>> my issue[1] in thounds runs(It happened in more than 5% before). >> >> That should be ok indeed, but... >> >>> RFC: >>> https://lore.kernel.org/linux-mm/20240716073929.843277-1-lizhijian@fujitsu.com/ >>> --- >>> mm/page_alloc.c | 5 ++++- >>> 1 file changed, 4 insertions(+), 1 deletion(-) >>> >>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >>> index 9ecf99190ea2..5388a35c4e9c 100644 >>> --- a/mm/page_alloc.c >>> +++ b/mm/page_alloc.c >>> @@ -2323,8 +2323,11 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp) >>> static void drain_pages_zone(unsigned int cpu, struct zone *zone) >>> { >>> struct per_cpu_pages *pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu); >>> - int count = READ_ONCE(pcp->count); >>> + int count; >>> >>> + spin_lock(&pcp->lock); >>> + count = pcp->count; >>> + spin_unlock(&pcp->lock); >>> while (count) { >>> int to_drain = min(count, pcp->batch << CONFIG_PCP_BATCH_SCALE_MAX); >>> count -= to_drain; >> >> It's wasteful to do a lock/unlock cycle just to read the count. > > How about, > > static void drain_pages_zone(unsigned int cpu, struct zone *zone) > { > struct per_cpu_pages *pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu); > int count, to_drain; > > do { > spin_lock(&pcp->lock); > to_drain = min(pcp->count, pcp->batch << CONFIG_PCP_BATCH_SCALE_MAX); > free_pcppages_bulk(zone, to_drain, pcp, 0); > spin_unlock(&pcp->lock); > } while (to_drain); Yeah better than break. But I think you still should use count = pcp->count; ... count -= to_drain; } while(count); or you make one extra wasteful iteration to find to_drain is zero. (assuming "it's sufficient to read pcp->count once with lock held", that I agree with) > } >> It could >> rather look something like this: >> > > Sorry, I don't follow your code... > >> while (true) >> spin_lock(&pcp->lock); >> count = pcp->count; >> ... >> count -= to_drain; >> if (to_drain) >> drain_zone_pages(...) > > Which subroutine does this code belong to, why it involves drain_zone_pages Yeah sorry I meant free_pcppages_bulk() >> ... >> spin_unlock(&pcp->lock); >> if (count) >> break; > > Thanks > Zhijian
On 22.07.24 11:15, Zhijian Li (Fujitsu) wrote: > Hi David > > Thanks for you quickly reply. Heh, Vlasimil replied but I agree with his feedback :)
On 22/07/2024 17:28, Vlastimil Babka (SUSE) wrote: > On 7/22/24 11:15 AM, Zhijian Li (Fujitsu) wrote: >> Hi David >> >> Thanks for you quickly reply. >> >> >> On 22/07/2024 14:44, Vlastimil Babka (SUSE) wrote: >>> On 7/22/24 4:10 AM, Li Zhijian wrote: >>>> It's expected that no page should be left in pcp_list after calling >>>> zone_pcp_disable() in offline_pages(). Previously, it's observed that >>>> offline_pages() gets stuck [1] due to some pages remaining in pcp_list. >>>> >>>> Cause: >>>> There is a race condition between drain_pages_zone() and __rmqueue_pcplist() >>>> involving the pcp->count variable. See below scenario: >>>> >>>> CPU0 CPU1 >>>> ---------------- --------------- >>>> spin_lock(&pcp->lock); >>>> __rmqueue_pcplist() { >>>> zone_pcp_disable() { >>>> /* list is empty */ >>>> if (list_empty(list)) { >>>> /* add pages to pcp_list */ >>>> alloced = rmqueue_bulk() >>>> mutex_lock(&pcp_batch_high_lock) >>>> ... >>>> __drain_all_pages() { >>>> drain_pages_zone() { >>>> /* read pcp->count, it's 0 here */ >>>> count = READ_ONCE(pcp->count) >>>> /* 0 means nothing to drain */ >>>> /* update pcp->count */ >>>> pcp->count += alloced << order; >>>> ... >>>> ... >>>> spin_unlock(&pcp->lock); >>>> >>>> In this case, after calling zone_pcp_disable() though, there are still some >>>> pages in pcp_list. And these pages in pcp_list are neither movable nor >>>> isolated, offline_pages() gets stuck as a result. >>>> >>>> Solution: >>>> Expand the scope of the pcp->lock to also protect pcp->count in >>>> drain_pages_zone(), to ensure no pages are left in the pcp list after >>>> zone_pcp_disable() >>>> >>>> [1] https://lore.kernel.org/linux-mm/6a07125f-e720-404c-b2f9-e55f3f166e85@fujitsu.com/ >>>> >>>> Cc: David Hildenbrand <david@redhat.com> >>>> Cc: Vlastimil Babka (SUSE) <vbabka@kernel.org> >>>> Reported-by: Yao Xingtao <yaoxt.fnst@fujitsu.com> >>>> Signed-off-by: Li Zhijian <lizhijian@fujitsu.com> >>> >>> Can we find a breaking commit for Fixes: ? >> >> I haven't confirmed the FBC because my reproducer is not fit to run in the old kernel for some reasons. >> but I noticed it didn't read the count without lock held since below commit >> >> 4b23a68f9536 mm/page_alloc: protect PCP lists with a spinlock >> >> >> >>> >>>> --- >>>> V2: >>>> - Narrow down the scope of the spin_lock() to limit the draining latency. # Vlastimil and David >>>> - In above scenario, it's sufficient to read pcp->count once with lock held, and it fully fixed >>>> my issue[1] in thounds runs(It happened in more than 5% before). >>> >>> That should be ok indeed, but... >>> >>>> RFC: >>>> https://lore.kernel.org/linux-mm/20240716073929.843277-1-lizhijian@fujitsu.com/ >>>> --- >>>> mm/page_alloc.c | 5 ++++- >>>> 1 file changed, 4 insertions(+), 1 deletion(-) >>>> >>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >>>> index 9ecf99190ea2..5388a35c4e9c 100644 >>>> --- a/mm/page_alloc.c >>>> +++ b/mm/page_alloc.c >>>> @@ -2323,8 +2323,11 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp) >>>> static void drain_pages_zone(unsigned int cpu, struct zone *zone) >>>> { >>>> struct per_cpu_pages *pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu); >>>> - int count = READ_ONCE(pcp->count); >>>> + int count; >>>> >>>> + spin_lock(&pcp->lock); >>>> + count = pcp->count; >>>> + spin_unlock(&pcp->lock); >>>> while (count) { >>>> int to_drain = min(count, pcp->batch << CONFIG_PCP_BATCH_SCALE_MAX); >>>> count -= to_drain; >>> >>> It's wasteful to do a lock/unlock cycle just to read the count. >> >> How about, >> >> static void drain_pages_zone(unsigned int cpu, struct zone *zone) >> { >> struct per_cpu_pages *pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu); >> int count, to_drain; >> >> do { >> spin_lock(&pcp->lock); >> to_drain = min(pcp->count, pcp->batch << CONFIG_PCP_BATCH_SCALE_MAX); >> free_pcppages_bulk(zone, to_drain, pcp, 0); >> spin_unlock(&pcp->lock); >> } while (to_drain); > > Yeah better than break. But I think you still should use Okay, I will update it in V3 Thanks Zhijian > count = pcp->count; > ... > count -= to_drain; > } while(count); > > or you make one extra wasteful iteration to find to_drain is zero. > (assuming "it's sufficient to read pcp->count once with lock held", that I > agree with)> >> } >>> It could >>> rather look something like this: >>> >> >> Sorry, I don't follow your code... >> >>> while (true) >>> spin_lock(&pcp->lock); >>> count = pcp->count; >>> ... >>> count -= to_drain; >>> if (to_drain) >>> drain_zone_pages(...) >> >> Which subroutine does this code belong to, why it involves drain_zone_pages > > Yeah sorry I meant free_pcppages_bulk() > >>> ... >>> spin_unlock(&pcp->lock); >>> if (count) >>> break; >> >> Thanks >> Zhijian >
diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 9ecf99190ea2..5388a35c4e9c 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2323,8 +2323,11 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp) static void drain_pages_zone(unsigned int cpu, struct zone *zone) { struct per_cpu_pages *pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu); - int count = READ_ONCE(pcp->count); + int count; + spin_lock(&pcp->lock); + count = pcp->count; + spin_unlock(&pcp->lock); while (count) { int to_drain = min(count, pcp->batch << CONFIG_PCP_BATCH_SCALE_MAX); count -= to_drain;
It's expected that no page should be left in pcp_list after calling zone_pcp_disable() in offline_pages(). Previously, it's observed that offline_pages() gets stuck [1] due to some pages remaining in pcp_list. Cause: There is a race condition between drain_pages_zone() and __rmqueue_pcplist() involving the pcp->count variable. See below scenario: CPU0 CPU1 ---------------- --------------- spin_lock(&pcp->lock); __rmqueue_pcplist() { zone_pcp_disable() { /* list is empty */ if (list_empty(list)) { /* add pages to pcp_list */ alloced = rmqueue_bulk() mutex_lock(&pcp_batch_high_lock) ... __drain_all_pages() { drain_pages_zone() { /* read pcp->count, it's 0 here */ count = READ_ONCE(pcp->count) /* 0 means nothing to drain */ /* update pcp->count */ pcp->count += alloced << order; ... ... spin_unlock(&pcp->lock); In this case, after calling zone_pcp_disable() though, there are still some pages in pcp_list. And these pages in pcp_list are neither movable nor isolated, offline_pages() gets stuck as a result. Solution: Expand the scope of the pcp->lock to also protect pcp->count in drain_pages_zone(), to ensure no pages are left in the pcp list after zone_pcp_disable() [1] https://lore.kernel.org/linux-mm/6a07125f-e720-404c-b2f9-e55f3f166e85@fujitsu.com/ Cc: David Hildenbrand <david@redhat.com> Cc: Vlastimil Babka (SUSE) <vbabka@kernel.org> Reported-by: Yao Xingtao <yaoxt.fnst@fujitsu.com> Signed-off-by: Li Zhijian <lizhijian@fujitsu.com> --- V2: - Narrow down the scope of the spin_lock() to limit the draining latency. # Vlastimil and David - In above scenario, it's sufficient to read pcp->count once with lock held, and it fully fixed my issue[1] in thounds runs(It happened in more than 5% before). RFC: https://lore.kernel.org/linux-mm/20240716073929.843277-1-lizhijian@fujitsu.com/ --- mm/page_alloc.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-)