diff mbox series

[RFC] mm/page_alloc: Fix pcp->count race between drain_pages_zone() vs __rmqueue_pcplist()

Message ID 20240716073929.843277-1-lizhijian@fujitsu.com (mailing list archive)
State New
Headers show
Series [RFC] mm/page_alloc: Fix pcp->count race between drain_pages_zone() vs __rmqueue_pcplist() | expand

Commit Message

Zhijian Li (Fujitsu) July 16, 2024, 7:39 a.m. UTC
It's expected that no page should be left in pcp_list after calling
zone_pcp_disable() in offline_pages(). Previously, it's observed that
offline_pages() gets stuck [1] due to some pages remaining in pcp_list.

Cause:
There is a race condition between drain_pages_zone() and __rmqueue_pcplist()
involving the pcp->count variable. See below scenario:

         CPU0                              CPU1
    ----------------                    ---------------
                                      spin_lock(&pcp->lock);
                                      __rmqueue_pcplist() {
zone_pcp_disable() {
                                        /* list is empty */
                                        if (list_empty(list)) {
                                          /* add pages to pcp_list */
                                          alloced = rmqueue_bulk()
  mutex_lock(&pcp_batch_high_lock)
  ...
  __drain_all_pages() {
    drain_pages_zone() {
      /* read pcp->count, it's 0 here */
      count = READ_ONCE(pcp->count)
      /* 0 means nothing to drain */
                                          /* update pcp->count */
                                          pcp->count += alloced << order;
      ...
                                      ...
                                      spin_unlock(&pcp->lock);

In this case, after calling zone_pcp_disable() though, there are still some
pages in pcp_list. And these pages in pcp_list are neither movable nor
isolated, offline_pages() gets stuck as a result.

Solution:
Expand the scope of the pcp->lock to also protect pcp->count in
drain_pages_zone(), ensuring no pages are left in the pcp list.

[1] https://lore.kernel.org/linux-mm/6a07125f-e720-404c-b2f9-e55f3f166e85@fujitsu.com/

Cc: David Hildenbrand <david@redhat.com>
Reported-by: Yao Xingtao <yaoxt.fnst@fujitsu.com>
Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
---
 mm/page_alloc.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

Comments

Vlastimil Babka (SUSE) July 18, 2024, 11:16 a.m. UTC | #1
On 7/16/24 9:39 AM, Li Zhijian wrote:
> It's expected that no page should be left in pcp_list after calling
> zone_pcp_disable() in offline_pages(). Previously, it's observed that
> offline_pages() gets stuck [1] due to some pages remaining in pcp_list.
> 
> Cause:
> There is a race condition between drain_pages_zone() and __rmqueue_pcplist()
> involving the pcp->count variable. See below scenario:
> 
>          CPU0                              CPU1
>     ----------------                    ---------------
>                                       spin_lock(&pcp->lock);
>                                       __rmqueue_pcplist() {
> zone_pcp_disable() {
>                                         /* list is empty */
>                                         if (list_empty(list)) {
>                                           /* add pages to pcp_list */
>                                           alloced = rmqueue_bulk()
>   mutex_lock(&pcp_batch_high_lock)
>   ...
>   __drain_all_pages() {
>     drain_pages_zone() {
>       /* read pcp->count, it's 0 here */
>       count = READ_ONCE(pcp->count)
>       /* 0 means nothing to drain */
>                                           /* update pcp->count */
>                                           pcp->count += alloced << order;
>       ...
>                                       ...
>                                       spin_unlock(&pcp->lock);
> 
> In this case, after calling zone_pcp_disable() though, there are still some
> pages in pcp_list. And these pages in pcp_list are neither movable nor
> isolated, offline_pages() gets stuck as a result.
> 
> Solution:
> Expand the scope of the pcp->lock to also protect pcp->count in
> drain_pages_zone(), ensuring no pages are left in the pcp list.
> 
> [1] https://lore.kernel.org/linux-mm/6a07125f-e720-404c-b2f9-e55f3f166e85@fujitsu.com/
> 
> Cc: David Hildenbrand <david@redhat.com>
> Reported-by: Yao Xingtao <yaoxt.fnst@fujitsu.com>
> Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
> ---
>  mm/page_alloc.c | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 9ecf99190ea2..1780df31d5f5 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2323,16 +2323,17 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
>  static void drain_pages_zone(unsigned int cpu, struct zone *zone)
>  {
>  	struct per_cpu_pages *pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
> -	int count = READ_ONCE(pcp->count);
> +	int count;
>  
> +	spin_lock(&pcp->lock);
> +	count = pcp->count;
>  	while (count) {
>  		int to_drain = min(count, pcp->batch << CONFIG_PCP_BATCH_SCALE_MAX);
>  		count -= to_drain;
>  
> -		spin_lock(&pcp->lock);
>  		free_pcppages_bulk(zone, to_drain, pcp, 0);
> -		spin_unlock(&pcp->lock);
>  	}
> +	spin_unlock(&pcp->lock);

This way seems to be partially going against the purpose of 55f77df7d715
("mm: page_alloc: control latency caused by zone PCP draining") - the zone
lock hold time will still be limited by the batch, but not the pcp lock
time. It should still be possible to relock between the iterations? To
prevent the race I think the main part is determining pcp->count under the
lock, but release/retake should still be ok if the pcp->count is reread
after relocking.

>  }
>  
>  /*
David Hildenbrand July 18, 2024, 2:02 p.m. UTC | #2
On 18.07.24 13:16, Vlastimil Babka (SUSE) wrote:
> On 7/16/24 9:39 AM, Li Zhijian wrote:
>> It's expected that no page should be left in pcp_list after calling
>> zone_pcp_disable() in offline_pages(). Previously, it's observed that
>> offline_pages() gets stuck [1] due to some pages remaining in pcp_list.
>>
>> Cause:
>> There is a race condition between drain_pages_zone() and __rmqueue_pcplist()
>> involving the pcp->count variable. See below scenario:
>>
>>           CPU0                              CPU1
>>      ----------------                    ---------------
>>                                        spin_lock(&pcp->lock);
>>                                        __rmqueue_pcplist() {
>> zone_pcp_disable() {
>>                                          /* list is empty */
>>                                          if (list_empty(list)) {
>>                                            /* add pages to pcp_list */
>>                                            alloced = rmqueue_bulk()
>>    mutex_lock(&pcp_batch_high_lock)
>>    ...
>>    __drain_all_pages() {
>>      drain_pages_zone() {
>>        /* read pcp->count, it's 0 here */
>>        count = READ_ONCE(pcp->count)
>>        /* 0 means nothing to drain */
>>                                            /* update pcp->count */
>>                                            pcp->count += alloced << order;
>>        ...
>>                                        ...
>>                                        spin_unlock(&pcp->lock);
>>
>> In this case, after calling zone_pcp_disable() though, there are still some
>> pages in pcp_list. And these pages in pcp_list are neither movable nor
>> isolated, offline_pages() gets stuck as a result.
>>
>> Solution:
>> Expand the scope of the pcp->lock to also protect pcp->count in
>> drain_pages_zone(), ensuring no pages are left in the pcp list.
>>
>> [1] https://lore.kernel.org/linux-mm/6a07125f-e720-404c-b2f9-e55f3f166e85@fujitsu.com/
>>
>> Cc: David Hildenbrand <david@redhat.com>
>> Reported-by: Yao Xingtao <yaoxt.fnst@fujitsu.com>
>> Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
>> ---
>>   mm/page_alloc.c | 7 ++++---
>>   1 file changed, 4 insertions(+), 3 deletions(-)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 9ecf99190ea2..1780df31d5f5 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -2323,16 +2323,17 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
>>   static void drain_pages_zone(unsigned int cpu, struct zone *zone)
>>   {
>>   	struct per_cpu_pages *pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
>> -	int count = READ_ONCE(pcp->count);
>> +	int count;
>>   
>> +	spin_lock(&pcp->lock);
>> +	count = pcp->count;
>>   	while (count) {
>>   		int to_drain = min(count, pcp->batch << CONFIG_PCP_BATCH_SCALE_MAX);
>>   		count -= to_drain;
>>   
>> -		spin_lock(&pcp->lock);
>>   		free_pcppages_bulk(zone, to_drain, pcp, 0);
>> -		spin_unlock(&pcp->lock);
>>   	}
>> +	spin_unlock(&pcp->lock);
> 
> This way seems to be partially going against the purpose of 55f77df7d715
> ("mm: page_alloc: control latency caused by zone PCP draining") - the zone
> lock hold time will still be limited by the batch, but not the pcp lock
> time. It should still be possible to relock between the iterations? To
> prevent the race I think the main part is determining pcp->count under the
> lock, but release/retake should still be ok if the pcp->count is reread
> after relocking.

Agreed, had the smame thing in mind when skimming over this patch.

@Li, with this patch the problems you have been seeing are fully 
resolved, correct?
Zhijian Li (Fujitsu) July 19, 2024, 2:28 a.m. UTC | #3
On 18/07/2024 22:02, David Hildenbrand wrote:
> On 18.07.24 13:16, Vlastimil Babka (SUSE) wrote:
>> On 7/16/24 9:39 AM, Li Zhijian wrote:
>>> It's expected that no page should be left in pcp_list after calling
>>> zone_pcp_disable() in offline_pages(). Previously, it's observed that
>>> offline_pages() gets stuck [1] due to some pages remaining in pcp_list.
>>>
>>> Cause:
>>> There is a race condition between drain_pages_zone() and __rmqueue_pcplist()
>>> involving the pcp->count variable. See below scenario:
>>>
>>>           CPU0                              CPU1
>>>      ----------------                    ---------------
>>>                                        spin_lock(&pcp->lock);
>>>                                        __rmqueue_pcplist() {
>>> zone_pcp_disable() {
>>>                                          /* list is empty */
>>>                                          if (list_empty(list)) {
>>>                                            /* add pages to pcp_list */
>>>                                            alloced = rmqueue_bulk()
>>>    mutex_lock(&pcp_batch_high_lock)
>>>    ...
>>>    __drain_all_pages() {
>>>      drain_pages_zone() {
>>>        /* read pcp->count, it's 0 here */
>>>        count = READ_ONCE(pcp->count)
>>>        /* 0 means nothing to drain */
>>>                                            /* update pcp->count */
>>>                                            pcp->count += alloced << order;
>>>        ...
>>>                                        ...
>>>                                        spin_unlock(&pcp->lock);
>>>
>>> In this case, after calling zone_pcp_disable() though, there are still some
>>> pages in pcp_list. And these pages in pcp_list are neither movable nor
>>> isolated, offline_pages() gets stuck as a result.
>>>
>>> Solution:
>>> Expand the scope of the pcp->lock to also protect pcp->count in
>>> drain_pages_zone(), ensuring no pages are left in the pcp list.
>>>
>>> [1] https://lore.kernel.org/linux-mm/6a07125f-e720-404c-b2f9-e55f3f166e85@fujitsu.com/
>>>
>>> Cc: David Hildenbrand <david@redhat.com>
>>> Reported-by: Yao Xingtao <yaoxt.fnst@fujitsu.com>
>>> Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
>>> ---
>>>   mm/page_alloc.c | 7 ++++---
>>>   1 file changed, 4 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index 9ecf99190ea2..1780df31d5f5 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -2323,16 +2323,17 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
>>>   static void drain_pages_zone(unsigned int cpu, struct zone *zone)
>>>   {
>>>       struct per_cpu_pages *pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
>>> -    int count = READ_ONCE(pcp->count);
>>> +    int count;
>>> +    spin_lock(&pcp->lock);
>>> +    count = pcp->count;
>>>       while (count) {
>>>           int to_drain = min(count, pcp->batch << CONFIG_PCP_BATCH_SCALE_MAX);
>>>           count -= to_drain;
>>> -        spin_lock(&pcp->lock);
>>>           free_pcppages_bulk(zone, to_drain, pcp, 0);
>>> -        spin_unlock(&pcp->lock);
>>>       }
>>> +    spin_unlock(&pcp->lock);
>>
>> This way seems to be partially going against the purpose of 55f77df7d715
>> ("mm: page_alloc: control latency caused by zone PCP draining") - the zone
>> lock hold time will still be limited by the batch, but not the pcp lock
>> time. It should still be possible to relock between the iterations? To
>> prevent the race I think the main part is determining pcp->count under the
>> lock, but release/retake should still be ok if the pcp->count is reread
>> after relocking.

Okay, I will try it.

> 
> Agreed, had the smame thing in mind when skimming over this patch.


> 
> @Li, with this patch the problems you have been seeing are fully resolved, correct?
> 

Yeah, It worked in my thousand test runs.

P.S, Previously, It occurred more than 5%.
diff mbox series

Patch

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9ecf99190ea2..1780df31d5f5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2323,16 +2323,17 @@  void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
 static void drain_pages_zone(unsigned int cpu, struct zone *zone)
 {
 	struct per_cpu_pages *pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
-	int count = READ_ONCE(pcp->count);
+	int count;
 
+	spin_lock(&pcp->lock);
+	count = pcp->count;
 	while (count) {
 		int to_drain = min(count, pcp->batch << CONFIG_PCP_BATCH_SCALE_MAX);
 		count -= to_drain;
 
-		spin_lock(&pcp->lock);
 		free_pcppages_bulk(zone, to_drain, pcp, 0);
-		spin_unlock(&pcp->lock);
 	}
+	spin_unlock(&pcp->lock);
 }
 
 /*