diff mbox series

mm,page_owner: Fix recursion

Message ID 20240313234245.18824-1-osalvador@suse.de (mailing list archive)
State New
Headers show
Series mm,page_owner: Fix recursion | expand

Commit Message

Oscar Salvador March 13, 2024, 11:42 p.m. UTC
Prior to 217b2119b9e2 ("mm,page_owner: implement the tracking of the stacks count")
the only place where page_owner could potentially go into recursion due to
its need of allocating more memory was in save_stack(), which ends up calling
into stackdepot code with the possibility of allocating memory.

We made sure to guard against that by signaling that the current task was
already in page_owner code, so in case a recursion attempt was made, we
could catch that and return dummy_handle.

After above commit, a new place in page_owner code was introduced where we
could allocate memory, meaning we could go into recursion would we take that
path.

Make sure to signal that we are in page_owner in that codepath as well.
Move the guard code into two helpers {un}set_current_in_page_owner()
and use them prior to calling in the two functions that might allocate
memory.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
Fixes: 217b2119b9e2 ("mm,page_owner: implement the tracking of the stacks count")
---
 mm/page_owner.c | 30 +++++++++++++++++++++---------
 1 file changed, 21 insertions(+), 9 deletions(-)

Comments

Oscar Salvador March 13, 2024, 11:46 p.m. UTC | #1
On Thu, Mar 14, 2024 at 12:42:45AM +0100, Oscar Salvador wrote:
> @@ -232,6 +241,7 @@ void __reset_page_owner(struct page *page, unsigned short order)
>  	alloc_handle = page_owner->handle;
>  
>  	handle = save_stack(GFP_NOWAIT | __GFP_NOWARN);
> +
>  	for (i = 0; i < (1 << order); i++) {

Sigh, a last-minute unnoticed change.

@Andrew: Do you want me to send v2 fixing that up?
Tetsuo Handa March 14, 2024, 3:01 a.m. UTC | #2
On 2024/03/14 8:42, Oscar Salvador wrote:
> Prior to 217b2119b9e2 ("mm,page_owner: implement the tracking of the stacks count")
> the only place where page_owner could potentially go into recursion due to
> its need of allocating more memory was in save_stack(), which ends up calling
> into stackdepot code with the possibility of allocating memory.
> 
> We made sure to guard against that by signaling that the current task was
> already in page_owner code, so in case a recursion attempt was made, we
> could catch that and return dummy_handle.
> 
> After above commit, a new place in page_owner code was introduced where we
> could allocate memory, meaning we could go into recursion would we take that
> path.
> 
> Make sure to signal that we are in page_owner in that codepath as well.
> Move the guard code into two helpers {un}set_current_in_page_owner()
> and use them prior to calling in the two functions that might allocate
> memory.
> 
> Signed-off-by: Oscar Salvador <osalvador@suse.de>
> Fixes: 217b2119b9e2 ("mm,page_owner: implement the tracking of the stacks count")

Maybe culprit for a page owner refcount bug reported at
https://syzkaller.appspot.com/bug?id=8e4e66dfe299a2a00204ad220c641daaf1486a00 , for
that commit went to next-20240214 and syzbot started failing to test since next-20240215 ?

Please send this patch to linux-next.git as soon as possible (or can someone experiencing
this bug try booting linux-next.git with this patch applied, so that we can check whether
syzbot can resume testing linux-next.git), and then send to linux.git together (so that
various trees which depend on linux.git won't start failing to boot).
Oscar Salvador March 14, 2024, 5:47 a.m. UTC | #3
On Thu, Mar 14, 2024 at 12:01:24PM +0900, Tetsuo Handa wrote:
> Maybe culprit for a page owner refcount bug reported at
> https://syzkaller.appspot.com/bug?id=8e4e66dfe299a2a00204ad220c641daaf1486a00 , for
> that commit went to next-20240214 and syzbot started failing to test since next-20240215 ?
> 
> Please send this patch to linux-next.git as soon as possible (or can someone experiencing
> this bug try booting linux-next.git with this patch applied, so that we can check whether
> syzbot can resume testing linux-next.git), and then send to linux.git together (so that
> various trees which depend on linux.git won't start failing to boot).

No, that is something else that I already started fixing a few days ago.
I think I will have the fix ready today.
Oscar Salvador March 14, 2024, 7:01 a.m. UTC | #4
On Thu, Mar 14, 2024 at 06:47:43AM +0100, Oscar Salvador wrote:
> On Thu, Mar 14, 2024 at 12:01:24PM +0900, Tetsuo Handa wrote:
> > Maybe culprit for a page owner refcount bug reported at
> > https://syzkaller.appspot.com/bug?id=8e4e66dfe299a2a00204ad220c641daaf1486a00 , for
> > that commit went to next-20240214 and syzbot started failing to test since next-20240215 ?
> > 
> > Please send this patch to linux-next.git as soon as possible (or can someone experiencing
> > this bug try booting linux-next.git with this patch applied, so that we can check whether
> > syzbot can resume testing linux-next.git), and then send to linux.git together (so that
> > various trees which depend on linux.git won't start failing to boot).
> 
> No, that is something else that I already started fixing a few days ago.
> I think I will have the fix ready today.

I already have the fix. I will do some more testing and then I will send
it out.

Thanks
Vlastimil Babka March 15, 2024, 1:07 p.m. UTC | #5
On 3/14/24 00:42, Oscar Salvador wrote:
> Prior to 217b2119b9e2 ("mm,page_owner: implement the tracking of the stacks count")
> the only place where page_owner could potentially go into recursion due to
> its need of allocating more memory was in save_stack(), which ends up calling
> into stackdepot code with the possibility of allocating memory.
> 
> We made sure to guard against that by signaling that the current task was
> already in page_owner code, so in case a recursion attempt was made, we
> could catch that and return dummy_handle.
> 
> After above commit, a new place in page_owner code was introduced where we
> could allocate memory, meaning we could go into recursion would we take that
> path.
> 
> Make sure to signal that we are in page_owner in that codepath as well.
> Move the guard code into two helpers {un}set_current_in_page_owner()
> and use them prior to calling in the two functions that might allocate
> memory.
> 
> Signed-off-by: Oscar Salvador <osalvador@suse.de>
> Fixes: 217b2119b9e2 ("mm,page_owner: implement the tracking of the stacks count")
> ---
>  mm/page_owner.c | 30 +++++++++++++++++++++---------
>  1 file changed, 21 insertions(+), 9 deletions(-)
> 
> @@ -292,7 +302,9 @@ noinline void __set_page_owner(struct page *page, unsigned short order,
>  		return;
>  	__set_page_owner_handle(page_ext, handle, order, gfp_mask);
>  	page_ext_put(page_ext);
> +	set_current_in_page_owner();
>  	inc_stack_record_count(handle, gfp_mask);
> +	unset_current_in_page_owner();

This is because of the kmalloc() in add_stack_record_to_list() right? Why
not wrap just that then?

>  }
>  
>  void __set_page_owner_migrate_reason(struct page *page, int reason)
Tetsuo Handa March 19, 2024, 10:51 a.m. UTC | #6
On 2024/03/14 16:01, Oscar Salvador wrote:
> On Thu, Mar 14, 2024 at 06:47:43AM +0100, Oscar Salvador wrote:
>> On Thu, Mar 14, 2024 at 12:01:24PM +0900, Tetsuo Handa wrote:
>>> Maybe culprit for a page owner refcount bug reported at
>>> https://syzkaller.appspot.com/bug?id=8e4e66dfe299a2a00204ad220c641daaf1486a00 , for
>>> that commit went to next-20240214 and syzbot started failing to test since next-20240215 ?
>>>
>>> Please send this patch to linux-next.git as soon as possible (or can someone experiencing
>>> this bug try booting linux-next.git with this patch applied, so that we can check whether
>>> syzbot can resume testing linux-next.git), and then send to linux.git together (so that
>>> various trees which depend on linux.git won't start failing to boot).
>>
>> No, that is something else that I already started fixing a few days ago.
>> I think I will have the fix ready today.
> 
> I already have the fix. I will do some more testing and then I will send
> it out.

OK. Please test your patch using
https://syzkaller.appspot.com/bug?extid=98c1a1753a0731df2dd4 .
diff mbox series

Patch

diff --git a/mm/page_owner.c b/mm/page_owner.c
index e96dd9092658..60663d657f7a 100644
--- a/mm/page_owner.c
+++ b/mm/page_owner.c
@@ -54,6 +54,22 @@  static depot_stack_handle_t early_handle;
 
 static void init_early_allocated_pages(void);
 
+static inline void set_current_in_page_owner(void)
+{
+	/*
+	 * Avoid recursion.
+	 *
+	 * We might need to allocate more memory from page_owner code, so make
+	 * sure to signal it in order to avoid recursion.
+	 */
+	current->in_page_owner = 1;
+}
+
+static inline void unset_current_in_page_owner(void)
+{
+	current->in_page_owner = 0;
+}
+
 static int __init early_page_owner_param(char *buf)
 {
 	int ret = kstrtobool(buf, &page_owner_enabled);
@@ -133,23 +149,16 @@  static noinline depot_stack_handle_t save_stack(gfp_t flags)
 	depot_stack_handle_t handle;
 	unsigned int nr_entries;
 
-	/*
-	 * Avoid recursion.
-	 *
-	 * Sometimes page metadata allocation tracking requires more
-	 * memory to be allocated:
-	 * - when new stack trace is saved to stack depot
-	 */
 	if (current->in_page_owner)
 		return dummy_handle;
-	current->in_page_owner = 1;
 
+	set_current_in_page_owner();
 	nr_entries = stack_trace_save(entries, ARRAY_SIZE(entries), 2);
 	handle = stack_depot_save(entries, nr_entries, flags);
 	if (!handle)
 		handle = failure_handle;
+	unset_current_in_page_owner();
 
-	current->in_page_owner = 0;
 	return handle;
 }
 
@@ -232,6 +241,7 @@  void __reset_page_owner(struct page *page, unsigned short order)
 	alloc_handle = page_owner->handle;
 
 	handle = save_stack(GFP_NOWAIT | __GFP_NOWARN);
+
 	for (i = 0; i < (1 << order); i++) {
 		__clear_bit(PAGE_EXT_OWNER_ALLOCATED, &page_ext->flags);
 		page_owner->free_handle = handle;
@@ -292,7 +302,9 @@  noinline void __set_page_owner(struct page *page, unsigned short order,
 		return;
 	__set_page_owner_handle(page_ext, handle, order, gfp_mask);
 	page_ext_put(page_ext);
+	set_current_in_page_owner();
 	inc_stack_record_count(handle, gfp_mask);
+	unset_current_in_page_owner();
 }
 
 void __set_page_owner_migrate_reason(struct page *page, int reason)