Message ID | 1597150703-19003-1-git-send-email-charante@codeaurora.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [V2] mm, page_alloc: fix core hung in free_pcppages_bulk() | expand |
On 11.08.20 14:58, Charan Teja Reddy wrote: > The following race is observed with the repeated online, offline and a > delay between two successive online of memory blocks of movable zone. > > P1 P2 > > Online the first memory block in > the movable zone. The pcp struct > values are initialized to default > values,i.e., pcp->high = 0 & > pcp->batch = 1. > > Allocate the pages from the > movable zone. > > Try to Online the second memory > block in the movable zone thus it > entered the online_pages() but yet > to call zone_pcp_update(). > This process is entered into > the exit path thus it tries > to release the order-0 pages > to pcp lists through > free_unref_page_commit(). > As pcp->high = 0, pcp->count = 1 > proceed to call the function > free_pcppages_bulk(). > Update the pcp values thus the > new pcp values are like, say, > pcp->high = 378, pcp->batch = 63. > Read the pcp's batch value using > READ_ONCE() and pass the same to > free_pcppages_bulk(), pcp values > passed here are, batch = 63, > count = 1. > > Since num of pages in the pcp > lists are less than ->batch, > then it will stuck in > while(list_empty(list)) loop > with interrupts disabled thus > a core hung. > > Avoid this by ensuring free_pcppages_bulk() is called with proper count > of pcp list pages. > > The mentioned race is some what easily reproducible without [1] because > pcp's are not updated for the first memory block online and thus there > is a enough race window for P2 between alloc+free and pcp struct values > update through onlining of second memory block. > > With [1], the race is still exists but it is very much narrow as we > update the pcp struct values for the first memory block online itself. > > [1]: https://patchwork.kernel.org/patch/11696389/ > IIUC, this is not limited to the movable zone, it could also happen in corner cases with the normal zone (e.g., hotplug to a node that only has DMA memory, or no other memory yet). > Signed-off-by: Charan Teja Reddy <charante@codeaurora.org> > --- > > v1: https://patchwork.kernel.org/patch/11707637/ > > mm/page_alloc.c | 5 +++++ > 1 file changed, 5 insertions(+) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index e4896e6..839039f 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -1304,6 +1304,11 @@ static void free_pcppages_bulk(struct zone *zone, int count, > struct page *page, *tmp; > LIST_HEAD(head); > > + /* > + * Ensure proper count is passed which otherwise would stuck in the > + * below while (list_empty(list)) loop. > + */ > + count = min(pcp->count, count); > while (count) { > struct list_head *list; > > Fixes: and Cc: stable... tags?
Thanks David for the inputs. On 8/12/2020 2:35 AM, David Hildenbrand wrote: > On 11.08.20 14:58, Charan Teja Reddy wrote: >> The following race is observed with the repeated online, offline and a >> delay between two successive online of memory blocks of movable zone. >> >> P1 P2 >> >> Online the first memory block in >> the movable zone. The pcp struct >> values are initialized to default >> values,i.e., pcp->high = 0 & >> pcp->batch = 1. >> >> Allocate the pages from the >> movable zone. >> >> Try to Online the second memory >> block in the movable zone thus it >> entered the online_pages() but yet >> to call zone_pcp_update(). >> This process is entered into >> the exit path thus it tries >> to release the order-0 pages >> to pcp lists through >> free_unref_page_commit(). >> As pcp->high = 0, pcp->count = 1 >> proceed to call the function >> free_pcppages_bulk(). >> Update the pcp values thus the >> new pcp values are like, say, >> pcp->high = 378, pcp->batch = 63. >> Read the pcp's batch value using >> READ_ONCE() and pass the same to >> free_pcppages_bulk(), pcp values >> passed here are, batch = 63, >> count = 1. >> >> Since num of pages in the pcp >> lists are less than ->batch, >> then it will stuck in >> while(list_empty(list)) loop >> with interrupts disabled thus >> a core hung. >> >> Avoid this by ensuring free_pcppages_bulk() is called with proper count >> of pcp list pages. >> >> The mentioned race is some what easily reproducible without [1] because >> pcp's are not updated for the first memory block online and thus there >> is a enough race window for P2 between alloc+free and pcp struct values >> update through onlining of second memory block. >> >> With [1], the race is still exists but it is very much narrow as we >> update the pcp struct values for the first memory block online itself. >> >> [1]: https://patchwork.kernel.org/patch/11696389/ >> > > IIUC, this is not limited to the movable zone, it could also happen in > corner cases with the normal zone (e.g., hotplug to a node that only has > DMA memory, or no other memory yet). Yes, this is my understanding too. I explained the above race in terms of just movable zone for which it is observed. We can add the below line in the end in patch commit message: "This is not limited to the movable zone, it could also happen in cases with the normal zone (e.g., hotplug to a node that only has DMA memory, or no other memory yet)." Just curious, there exists such systems where just a dma zone present and we hot add the normal zone? I am not aware such thing in the embedded world. > >> Signed-off-by: Charan Teja Reddy <charante@codeaurora.org> >> --- >> >> v1: https://patchwork.kernel.org/patch/11707637/ >> >> mm/page_alloc.c | 5 +++++ >> 1 file changed, 5 insertions(+) >> >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >> index e4896e6..839039f 100644 >> --- a/mm/page_alloc.c >> +++ b/mm/page_alloc.c >> @@ -1304,6 +1304,11 @@ static void free_pcppages_bulk(struct zone *zone, int count, >> struct page *page, *tmp; >> LIST_HEAD(head); >> >> + /* >> + * Ensure proper count is passed which otherwise would stuck in the >> + * below while (list_empty(list)) loop. >> + */ >> + count = min(pcp->count, count); >> while (count) { >> struct list_head *list; >> >> > > Fixes: and Cc: stable... tags? Fixes: 5f8dcc21211a ("page-allocator: split per-cpu list into one-list-per-migrate-type") Cc: <stable@vger.kernel.org> [2.6+] I am not sure If I should have to raise V3 including these? >
On 12.08.20 11:46, Charan Teja Kalla wrote: > > Thanks David for the inputs. > > On 8/12/2020 2:35 AM, David Hildenbrand wrote: >> On 11.08.20 14:58, Charan Teja Reddy wrote: >>> The following race is observed with the repeated online, offline and a >>> delay between two successive online of memory blocks of movable zone. >>> >>> P1 P2 >>> >>> Online the first memory block in >>> the movable zone. The pcp struct >>> values are initialized to default >>> values,i.e., pcp->high = 0 & >>> pcp->batch = 1. >>> >>> Allocate the pages from the >>> movable zone. >>> >>> Try to Online the second memory >>> block in the movable zone thus it >>> entered the online_pages() but yet >>> to call zone_pcp_update(). >>> This process is entered into >>> the exit path thus it tries >>> to release the order-0 pages >>> to pcp lists through >>> free_unref_page_commit(). >>> As pcp->high = 0, pcp->count = 1 >>> proceed to call the function >>> free_pcppages_bulk(). >>> Update the pcp values thus the >>> new pcp values are like, say, >>> pcp->high = 378, pcp->batch = 63. >>> Read the pcp's batch value using >>> READ_ONCE() and pass the same to >>> free_pcppages_bulk(), pcp values >>> passed here are, batch = 63, >>> count = 1. >>> >>> Since num of pages in the pcp >>> lists are less than ->batch, >>> then it will stuck in >>> while(list_empty(list)) loop >>> with interrupts disabled thus >>> a core hung. >>> >>> Avoid this by ensuring free_pcppages_bulk() is called with proper count >>> of pcp list pages. >>> >>> The mentioned race is some what easily reproducible without [1] because >>> pcp's are not updated for the first memory block online and thus there >>> is a enough race window for P2 between alloc+free and pcp struct values >>> update through onlining of second memory block. >>> >>> With [1], the race is still exists but it is very much narrow as we >>> update the pcp struct values for the first memory block online itself. >>> >>> [1]: https://patchwork.kernel.org/patch/11696389/ >>> >> >> IIUC, this is not limited to the movable zone, it could also happen in >> corner cases with the normal zone (e.g., hotplug to a node that only has >> DMA memory, or no other memory yet). > > Yes, this is my understanding too. I explained the above race in terms > of just movable zone for which it is observed. We can add the below line > in the end in patch commit message: > "This is not limited to the movable zone, it could also happen in cases > with the normal zone (e.g., hotplug to a node that only has DMA memory, > or no other memory yet)." Yeah, that makes sense! > > Just curious, there exists such systems where just a dma zone present > and we hot add the normal zone? I am not aware such thing in the > embedded world. You can easily create such setups using QEMU. IIRC, just specify a QEMU guest with 2G initial memory and a single NUMA node, or 4G initial memory and two NUMA nodes. Then hotplug memory. (IIRC kata containers always start a VM with 2G and then hotplug memory) >> >>> Signed-off-by: Charan Teja Reddy <charante@codeaurora.org> >>> --- >>> >>> v1: https://patchwork.kernel.org/patch/11707637/ >>> >>> mm/page_alloc.c | 5 +++++ >>> 1 file changed, 5 insertions(+) >>> >>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >>> index e4896e6..839039f 100644 >>> --- a/mm/page_alloc.c >>> +++ b/mm/page_alloc.c >>> @@ -1304,6 +1304,11 @@ static void free_pcppages_bulk(struct zone *zone, int count, >>> struct page *page, *tmp; >>> LIST_HEAD(head); >>> >>> + /* >>> + * Ensure proper count is passed which otherwise would stuck in the >>> + * below while (list_empty(list)) loop. >>> + */ >>> + count = min(pcp->count, count); >>> while (count) { >>> struct list_head *list; >>> >>> >> >> Fixes: and Cc: stable... tags? > > Fixes: 5f8dcc21211a ("page-allocator: split per-cpu list into > one-list-per-migrate-type") > Cc: <stable@vger.kernel.org> [2.6+] Did we have memory hotplug support then already? > > I am not sure If I should have to raise V3 including these? Maybe Andrew can fixup when applying.
On 8/12/2020 3:30 PM, David Hildenbrand wrote: > On 12.08.20 11:46, Charan Teja Kalla wrote: >> >> Thanks David for the inputs. >> >> On 8/12/2020 2:35 AM, David Hildenbrand wrote: >>> On 11.08.20 14:58, Charan Teja Reddy wrote: >>>> The following race is observed with the repeated online, offline and a >>>> delay between two successive online of memory blocks of movable zone. >>>> >>>> P1 P2 >>>> >>>> Online the first memory block in >>>> the movable zone. The pcp struct >>>> values are initialized to default >>>> values,i.e., pcp->high = 0 & >>>> pcp->batch = 1. >>>> >>>> Allocate the pages from the >>>> movable zone. >>>> >>>> Try to Online the second memory >>>> block in the movable zone thus it >>>> entered the online_pages() but yet >>>> to call zone_pcp_update(). >>>> This process is entered into >>>> the exit path thus it tries >>>> to release the order-0 pages >>>> to pcp lists through >>>> free_unref_page_commit(). >>>> As pcp->high = 0, pcp->count = 1 >>>> proceed to call the function >>>> free_pcppages_bulk(). >>>> Update the pcp values thus the >>>> new pcp values are like, say, >>>> pcp->high = 378, pcp->batch = 63. >>>> Read the pcp's batch value using >>>> READ_ONCE() and pass the same to >>>> free_pcppages_bulk(), pcp values >>>> passed here are, batch = 63, >>>> count = 1. >>>> >>>> Since num of pages in the pcp >>>> lists are less than ->batch, >>>> then it will stuck in >>>> while(list_empty(list)) loop >>>> with interrupts disabled thus >>>> a core hung. >>>> >>>> Avoid this by ensuring free_pcppages_bulk() is called with proper count >>>> of pcp list pages. >>>> >>>> The mentioned race is some what easily reproducible without [1] because >>>> pcp's are not updated for the first memory block online and thus there >>>> is a enough race window for P2 between alloc+free and pcp struct values >>>> update through onlining of second memory block. >>>> >>>> With [1], the race is still exists but it is very much narrow as we >>>> update the pcp struct values for the first memory block online itself. >>>> >>>> [1]: https://patchwork.kernel.org/patch/11696389/ >>>> >>> >>> IIUC, this is not limited to the movable zone, it could also happen in >>> corner cases with the normal zone (e.g., hotplug to a node that only has >>> DMA memory, or no other memory yet). >> >> Yes, this is my understanding too. I explained the above race in terms >> of just movable zone for which it is observed. We can add the below line >> in the end in patch commit message: >> "This is not limited to the movable zone, it could also happen in cases >> with the normal zone (e.g., hotplug to a node that only has DMA memory, >> or no other memory yet)." > > Yeah, that makes sense! > >> >> Just curious, there exists such systems where just a dma zone present >> and we hot add the normal zone? I am not aware such thing in the >> embedded world. > > You can easily create such setups using QEMU. > > IIRC, just specify a QEMU guest with 2G initial memory and a single NUMA > node, or 4G initial memory and two NUMA nodes. Then hotplug memory. > > (IIRC kata containers always start a VM with 2G and then hotplug memory) > I see. Thanks for letting me know this. >>> >>>> Signed-off-by: Charan Teja Reddy <charante@codeaurora.org> >>>> --- >>>> >>>> v1: https://patchwork.kernel.org/patch/11707637/ >>>> >>>> mm/page_alloc.c | 5 +++++ >>>> 1 file changed, 5 insertions(+) >>>> >>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >>>> index e4896e6..839039f 100644 >>>> --- a/mm/page_alloc.c >>>> +++ b/mm/page_alloc.c >>>> @@ -1304,6 +1304,11 @@ static void free_pcppages_bulk(struct zone *zone, int count, >>>> struct page *page, *tmp; >>>> LIST_HEAD(head); >>>> >>>> + /* >>>> + * Ensure proper count is passed which otherwise would stuck in the >>>> + * below while (list_empty(list)) loop. >>>> + */ >>>> + count = min(pcp->count, count); >>>> while (count) { >>>> struct list_head *list; >>>> >>>> >>> >>> Fixes: and Cc: stable... tags? >> >> Fixes: 5f8dcc21211a ("page-allocator: split per-cpu list into >> one-list-per-migrate-type") >> Cc: <stable@vger.kernel.org> [2.6+] > > Did we have memory hotplug support then already? Yes, it exist. https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/mm/memory_hotplug.c?h=v2.6.39 > >> >> I am not sure If I should have to raise V3 including these? > > > Maybe Andrew can fixup when applying. Okay, let Andrew decide on this. Meanwhile If you find that this patch looks correct, ACK from you helps here. > >
On Wed, 12 Aug 2020, Charan Teja Kalla wrote: > >> Signed-off-by: Charan Teja Reddy <charante@codeaurora.org> > >> --- > >> > >> v1: https://patchwork.kernel.org/patch/11707637/ > >> > >> mm/page_alloc.c | 5 +++++ > >> 1 file changed, 5 insertions(+) > >> > >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c > >> index e4896e6..839039f 100644 > >> --- a/mm/page_alloc.c > >> +++ b/mm/page_alloc.c > >> @@ -1304,6 +1304,11 @@ static void free_pcppages_bulk(struct zone *zone, int count, > >> struct page *page, *tmp; > >> LIST_HEAD(head); > >> > >> + /* > >> + * Ensure proper count is passed which otherwise would stuck in the > >> + * below while (list_empty(list)) loop. > >> + */ > >> + count = min(pcp->count, count); > >> while (count) { > >> struct list_head *list; > >> > >> > > > > Fixes: and Cc: stable... tags? > > Fixes: 5f8dcc21211a ("page-allocator: split per-cpu list into > one-list-per-migrate-type") > Cc: <stable@vger.kernel.org> [2.6+] > Acked-by: David Rientjes <rientjes@google.com>
On 12.08.20 12:11, Charan Teja Kalla wrote: > > > On 8/12/2020 3:30 PM, David Hildenbrand wrote: >> On 12.08.20 11:46, Charan Teja Kalla wrote: >>> >>> Thanks David for the inputs. >>> >>> On 8/12/2020 2:35 AM, David Hildenbrand wrote: >>>> On 11.08.20 14:58, Charan Teja Reddy wrote: >>>>> The following race is observed with the repeated online, offline and a >>>>> delay between two successive online of memory blocks of movable zone. >>>>> >>>>> P1 P2 >>>>> >>>>> Online the first memory block in >>>>> the movable zone. The pcp struct >>>>> values are initialized to default >>>>> values,i.e., pcp->high = 0 & >>>>> pcp->batch = 1. >>>>> >>>>> Allocate the pages from the >>>>> movable zone. >>>>> >>>>> Try to Online the second memory >>>>> block in the movable zone thus it >>>>> entered the online_pages() but yet >>>>> to call zone_pcp_update(). >>>>> This process is entered into >>>>> the exit path thus it tries >>>>> to release the order-0 pages >>>>> to pcp lists through >>>>> free_unref_page_commit(). >>>>> As pcp->high = 0, pcp->count = 1 >>>>> proceed to call the function >>>>> free_pcppages_bulk(). >>>>> Update the pcp values thus the >>>>> new pcp values are like, say, >>>>> pcp->high = 378, pcp->batch = 63. >>>>> Read the pcp's batch value using >>>>> READ_ONCE() and pass the same to >>>>> free_pcppages_bulk(), pcp values >>>>> passed here are, batch = 63, >>>>> count = 1. >>>>> >>>>> Since num of pages in the pcp >>>>> lists are less than ->batch, >>>>> then it will stuck in >>>>> while(list_empty(list)) loop >>>>> with interrupts disabled thus >>>>> a core hung. >>>>> >>>>> Avoid this by ensuring free_pcppages_bulk() is called with proper count >>>>> of pcp list pages. >>>>> >>>>> The mentioned race is some what easily reproducible without [1] because >>>>> pcp's are not updated for the first memory block online and thus there >>>>> is a enough race window for P2 between alloc+free and pcp struct values >>>>> update through onlining of second memory block. >>>>> >>>>> With [1], the race is still exists but it is very much narrow as we >>>>> update the pcp struct values for the first memory block online itself. >>>>> >>>>> [1]: https://patchwork.kernel.org/patch/11696389/ >>>>> >>>> >>>> IIUC, this is not limited to the movable zone, it could also happen in >>>> corner cases with the normal zone (e.g., hotplug to a node that only has >>>> DMA memory, or no other memory yet). >>> >>> Yes, this is my understanding too. I explained the above race in terms >>> of just movable zone for which it is observed. We can add the below line >>> in the end in patch commit message: >>> "This is not limited to the movable zone, it could also happen in cases >>> with the normal zone (e.g., hotplug to a node that only has DMA memory, >>> or no other memory yet)." >> >> Yeah, that makes sense! >> >>> >>> Just curious, there exists such systems where just a dma zone present >>> and we hot add the normal zone? I am not aware such thing in the >>> embedded world. >> >> You can easily create such setups using QEMU. >> >> IIRC, just specify a QEMU guest with 2G initial memory and a single NUMA >> node, or 4G initial memory and two NUMA nodes. Then hotplug memory. >> >> (IIRC kata containers always start a VM with 2G and then hotplug memory) >> > I see. Thanks for letting me know this. > >>>> >>>>> Signed-off-by: Charan Teja Reddy <charante@codeaurora.org> >>>>> --- >>>>> >>>>> v1: https://patchwork.kernel.org/patch/11707637/ >>>>> >>>>> mm/page_alloc.c | 5 +++++ >>>>> 1 file changed, 5 insertions(+) >>>>> >>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >>>>> index e4896e6..839039f 100644 >>>>> --- a/mm/page_alloc.c >>>>> +++ b/mm/page_alloc.c >>>>> @@ -1304,6 +1304,11 @@ static void free_pcppages_bulk(struct zone *zone, int count, >>>>> struct page *page, *tmp; >>>>> LIST_HEAD(head); >>>>> >>>>> + /* >>>>> + * Ensure proper count is passed which otherwise would stuck in the >>>>> + * below while (list_empty(list)) loop. >>>>> + */ >>>>> + count = min(pcp->count, count); >>>>> while (count) { >>>>> struct list_head *list; >>>>> >>>>> >>>> >>>> Fixes: and Cc: stable... tags? >>> >>> Fixes: 5f8dcc21211a ("page-allocator: split per-cpu list into >>> one-list-per-migrate-type") >>> Cc: <stable@vger.kernel.org> [2.6+] >> >> Did we have memory hotplug support then already? > > Yes, it exist. > https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/mm/memory_hotplug.c?h=v2.6.39 Okay, so I guess these tags make sense. > >> >>> >>> I am not sure If I should have to raise V3 including these? >> >> >> Maybe Andrew can fixup when applying. > > Okay, let Andrew decide on this. Meanwhile If you find that this patch > looks correct, ACK from you helps here. Sure, I think this is good enough as a simple fix. Acked-by: David Hildenbrand <david@redhat.com>
On Tue 11-08-20 18:28:23, Charan Teja Reddy wrote: [...] > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index e4896e6..839039f 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -1304,6 +1304,11 @@ static void free_pcppages_bulk(struct zone *zone, int count, > struct page *page, *tmp; > LIST_HEAD(head); > > + /* > + * Ensure proper count is passed which otherwise would stuck in the > + * below while (list_empty(list)) loop. > + */ > + count = min(pcp->count, count); > while (count) { > struct list_head *list; How does this prevent the race actually? Don't we need something like the following instead? diff --git a/mm/page_alloc.c b/mm/page_alloc.c index e028b87ce294..45bcc7ba37c4 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1317,9 +1317,16 @@ static void free_pcppages_bulk(struct zone *zone, int count, * lists */ do { + bool looped = false; + batch_free++; - if (++migratetype == MIGRATE_PCPTYPES) + if (++migratetype == MIGRATE_PCPTYPES) { + if (looped) + goto free; + migratetype = 0; + looped = true; + } list = &pcp->lists[migratetype]; } while (list_empty(list)); @@ -1352,6 +1359,7 @@ static void free_pcppages_bulk(struct zone *zone, int count, } while (--count && --batch_free && !list_empty(list)); } +free: spin_lock(&zone->lock); isolated_pageblocks = has_isolate_pageblock(zone);
Thanks Michal for comments. On 8/13/2020 5:11 PM, Michal Hocko wrote: > On Tue 11-08-20 18:28:23, Charan Teja Reddy wrote: > [...] >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >> index e4896e6..839039f 100644 >> --- a/mm/page_alloc.c >> +++ b/mm/page_alloc.c >> @@ -1304,6 +1304,11 @@ static void free_pcppages_bulk(struct zone *zone, int count, >> struct page *page, *tmp; >> LIST_HEAD(head); >> >> + /* >> + * Ensure proper count is passed which otherwise would stuck in the >> + * below while (list_empty(list)) loop. >> + */ >> + count = min(pcp->count, count); >> while (count) { >> struct list_head *list; > > > How does this prevent the race actually? This doesn't prevent the race. This only fixes the core hung(as this is called with spin_lock_irq()) caused by the race condition. This core hung is because of incorrect count value is passed to the free_pcppages_bulk() function. The actual race should be fixed by David's suggestion (isolate, online and undo isolation). Something needs to be improved in commit message? May be like below: s/The following race is observed with the repeated ... / Cpu core hung is observed as a result of race with the use case of repeated... Don't we need something like > the following instead? > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index e028b87ce294..45bcc7ba37c4 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -1317,9 +1317,16 @@ static void free_pcppages_bulk(struct zone *zone, int count, > * lists > */ > do { > + bool looped = false; IIUC, this looped will always be initialzed to false thus never jumped to free. But I think I got your idea that looping of the pcp lists for any pages. If not found despite MIGRATE_PCPTYPES count lists are traversed, just break the loop. Does this checking really required? Shouldn't pcp->count tells the same whether any pages left in the pcp lists? > + > batch_free++; > - if (++migratetype == MIGRATE_PCPTYPES) > + if (++migratetype == MIGRATE_PCPTYPES) { > + if (looped) > + goto free; > + > migratetype = 0; > + looped = true; > + } > list = &pcp->lists[migratetype]; > } while (list_empty(list)); > > @@ -1352,6 +1359,7 @@ static void free_pcppages_bulk(struct zone *zone, int count, > } while (--count && --batch_free && !list_empty(list)); > } > > +free: > spin_lock(&zone->lock); > isolated_pageblocks = has_isolate_pageblock(zone); > >
On Thu 13-08-20 21:51:29, Charan Teja Kalla wrote: > Thanks Michal for comments. > > On 8/13/2020 5:11 PM, Michal Hocko wrote: > > On Tue 11-08-20 18:28:23, Charan Teja Reddy wrote: > > [...] > >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c > >> index e4896e6..839039f 100644 > >> --- a/mm/page_alloc.c > >> +++ b/mm/page_alloc.c > >> @@ -1304,6 +1304,11 @@ static void free_pcppages_bulk(struct zone *zone, int count, > >> struct page *page, *tmp; > >> LIST_HEAD(head); > >> > >> + /* > >> + * Ensure proper count is passed which otherwise would stuck in the > >> + * below while (list_empty(list)) loop. > >> + */ > >> + count = min(pcp->count, count); > >> while (count) { > >> struct list_head *list; > > > > > > How does this prevent the race actually? > > This doesn't prevent the race. This only fixes the core hung(as this is > called with spin_lock_irq()) caused by the race condition. This core > hung is because of incorrect count value is passed to the > free_pcppages_bulk() function. Let me ask differently. What does enforce that the count and lists do not get out of sync in the loop. Your changelog says that the fix is to use the proper value without any specifics.
Thanks Michal. On 8/13/2020 10:00 PM, Michal Hocko wrote: > On Thu 13-08-20 21:51:29, Charan Teja Kalla wrote: >> Thanks Michal for comments. >> >> On 8/13/2020 5:11 PM, Michal Hocko wrote: >>> On Tue 11-08-20 18:28:23, Charan Teja Reddy wrote: >>> [...] >>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >>>> index e4896e6..839039f 100644 >>>> --- a/mm/page_alloc.c >>>> +++ b/mm/page_alloc.c >>>> @@ -1304,6 +1304,11 @@ static void free_pcppages_bulk(struct zone *zone, int count, >>>> struct page *page, *tmp; >>>> LIST_HEAD(head); >>>> >>>> + /* >>>> + * Ensure proper count is passed which otherwise would stuck in the >>>> + * below while (list_empty(list)) loop. >>>> + */ >>>> + count = min(pcp->count, count); >>>> while (count) { >>>> struct list_head *list; >>> >>> >>> How does this prevent the race actually? >> >> This doesn't prevent the race. This only fixes the core hung(as this is >> called with spin_lock_irq()) caused by the race condition. This core >> hung is because of incorrect count value is passed to the >> free_pcppages_bulk() function. > > Let me ask differently. What does enforce that the count and lists do > not get out of sync in the loop. count value is updated whenever an order-0 page is being added to the pcp lists through free_unref_page_commit(), which is being called with both interrupts, premption disabled. static void free_unref_page_commit(struct page *page, { .... list_add(&page->lru, &pcp->lists[migratetype]); pcp->count++ } As these are pcp lists, they only gets touched by another process when this process is context switched, which happens only after enabling premption or interrupts. So, as long as process is operating on these pcp lists in free_unref_page_commit function, the count and lists are always synced. However, the problem here is not that the count and lists are being out of sync. They do always in sync, as explained above. It is with the asking free_pcppages_bulk() to free the pages more than what is present in the pcp lists which is ending up in while(list_empty()). > Your changelog says that the fix is to > use the proper value without any specifics. > Will change this to: Ensure the count value passed is not greater than the pcp lists count. Any better you suggest?
On Thu 13-08-20 22:57:32, Charan Teja Kalla wrote: > Thanks Michal. > > On 8/13/2020 10:00 PM, Michal Hocko wrote: > > On Thu 13-08-20 21:51:29, Charan Teja Kalla wrote: > >> Thanks Michal for comments. > >> > >> On 8/13/2020 5:11 PM, Michal Hocko wrote: > >>> On Tue 11-08-20 18:28:23, Charan Teja Reddy wrote: > >>> [...] > >>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c > >>>> index e4896e6..839039f 100644 > >>>> --- a/mm/page_alloc.c > >>>> +++ b/mm/page_alloc.c > >>>> @@ -1304,6 +1304,11 @@ static void free_pcppages_bulk(struct zone *zone, int count, > >>>> struct page *page, *tmp; > >>>> LIST_HEAD(head); > >>>> > >>>> + /* > >>>> + * Ensure proper count is passed which otherwise would stuck in the > >>>> + * below while (list_empty(list)) loop. > >>>> + */ > >>>> + count = min(pcp->count, count); > >>>> while (count) { > >>>> struct list_head *list; > >>> > >>> > >>> How does this prevent the race actually? > >> > >> This doesn't prevent the race. This only fixes the core hung(as this is > >> called with spin_lock_irq()) caused by the race condition. This core > >> hung is because of incorrect count value is passed to the > >> free_pcppages_bulk() function. > > > > Let me ask differently. What does enforce that the count and lists do > > not get out of sync in the loop. > > count value is updated whenever an order-0 page is being added to the > pcp lists through free_unref_page_commit(), which is being called with > both interrupts, premption disabled. > static void free_unref_page_commit(struct page *page, { > .... > list_add(&page->lru, &pcp->lists[migratetype]); > pcp->count++ > } > > As these are pcp lists, they only gets touched by another process when > this process is context switched, which happens only after enabling > premption or interrupts. So, as long as process is operating on these > pcp lists in free_unref_page_commit function, the count and lists are > always synced. > > However, the problem here is not that the count and lists are being out > of sync. They do always in sync, as explained above. It is with the > asking free_pcppages_bulk() to free the pages more than what is present > in the pcp lists which is ending up in while(list_empty()). You are right. I managed to confuse myself. The thing is that the batch count is out of sync. > > Your changelog says that the fix is to > > use the proper value without any specifics. > > > Will change this to: Ensure the count value passed is not greater than > the pcp lists count. Any better you suggest? Yes, this makes it more clear. Feel free to add Acked-by: Michal Hocko <mhocko@suse.com>
diff --git a/mm/page_alloc.c b/mm/page_alloc.c index e4896e6..839039f 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1304,6 +1304,11 @@ static void free_pcppages_bulk(struct zone *zone, int count, struct page *page, *tmp; LIST_HEAD(head); + /* + * Ensure proper count is passed which otherwise would stuck in the + * below while (list_empty(list)) loop. + */ + count = min(pcp->count, count); while (count) { struct list_head *list;
The following race is observed with the repeated online, offline and a delay between two successive online of memory blocks of movable zone. P1 P2 Online the first memory block in the movable zone. The pcp struct values are initialized to default values,i.e., pcp->high = 0 & pcp->batch = 1. Allocate the pages from the movable zone. Try to Online the second memory block in the movable zone thus it entered the online_pages() but yet to call zone_pcp_update(). This process is entered into the exit path thus it tries to release the order-0 pages to pcp lists through free_unref_page_commit(). As pcp->high = 0, pcp->count = 1 proceed to call the function free_pcppages_bulk(). Update the pcp values thus the new pcp values are like, say, pcp->high = 378, pcp->batch = 63. Read the pcp's batch value using READ_ONCE() and pass the same to free_pcppages_bulk(), pcp values passed here are, batch = 63, count = 1. Since num of pages in the pcp lists are less than ->batch, then it will stuck in while(list_empty(list)) loop with interrupts disabled thus a core hung. Avoid this by ensuring free_pcppages_bulk() is called with proper count of pcp list pages. The mentioned race is some what easily reproducible without [1] because pcp's are not updated for the first memory block online and thus there is a enough race window for P2 between alloc+free and pcp struct values update through onlining of second memory block. With [1], the race is still exists but it is very much narrow as we update the pcp struct values for the first memory block online itself. [1]: https://patchwork.kernel.org/patch/11696389/ Signed-off-by: Charan Teja Reddy <charante@codeaurora.org> --- v1: https://patchwork.kernel.org/patch/11707637/ mm/page_alloc.c | 5 +++++ 1 file changed, 5 insertions(+)