Message ID | 20241101130845.19100-1-42.hyeyoo@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v1] mm/slab: Allow cache creation to proceed even if sysfs registration fails | expand |
On Fri, Nov 1, 2024 at 10:08 PM Hyeonggon Yoo <42.hyeyoo@gmail.com> wrote: > > When kobject_init_and_add() fails during cache creation, > kobj->name can be leaked because SLUB does not call kobject_put(), > which should be invoked per the kobject API documentation. > This has a bit of historical context, though; SLUB does not call > kobject_put() to avoid double-free for struct kmem_cache because > 1) simply calling it would free all resources related to the cache, and > 2) struct kmem_cache descriptor is always freed by cache_cache()'s > error handling path, causing struct kmem_cache to be freed twice. > > This issue can be reproduced by creating new slab caches while applying > failslab for kernfs_node_cache. This makes kobject_add_varg() succeed, > but causes kobject_add_internal() to fail in kobject_init_and_add() > during cache creation. > > Historically, this issue has attracted developers' attention several times. > Each time a fix addressed either the leak or the double-free, > it caused the other issue. Let's summarize a bit of history here: > > The leak has existed since the early days of SLUB. > > Commit 54b6a731025f ("slub: fix leak of 'name' in sysfs_slab_add") > introduced a double-free bug while fixing the leak. > > Commit 80da026a8e5d ("mm/slub: fix slab double-free in case of duplicate > sysfs filename") re-introduced the leak while fixing the double-free > error. > > Commit dde3c6b72a16 ("mm/slub: fix a memory leak in sysfs_slab_add()") > fixed the memory leak, but it was later reverted by commit 757fed1d0898 > ("Revert "mm/slub: fix a memory leak in sysfs_slab_add()"") to avoid > the double-free error. > > This is where we are now: we've chosen a memory leak over a double-free. > > To resolve this memory leak, skip creating sysfs files if it fails > and continue with cache creation regardless (as suggested by Christoph). > This resolves the memory leak because both the cache and the kobject > remain alive on kobject_init_and_add() failure. > > If SLUB tries to create an alias for a cache without sysfs files, > its symbolic link will not be generated. > > Since a slab cache might not have associated sysfs files, call kobject_del() > only if such files exist. > > Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> [+Cc Jinjie, Liu] > --- > > RFC -> v1: Make sysfs optional instead of destroying the cache on sysfs > errors. (Suggested by Christoph) RFC version for reference: https://lore.kernel.org/linux-mm/20241021091413.154775-1-42.hyeyoo@gmail.com/ > mm/slub.c | 21 +++++++++++++++++---- > 1 file changed, 17 insertions(+), 4 deletions(-) > > diff --git a/mm/slub.c b/mm/slub.c > index 151a987dc3a0..b4b211468f77 100644 > --- a/mm/slub.c > +++ b/mm/slub.c > @@ -6116,7 +6116,8 @@ __kmem_cache_alias(const char *name, unsigned int size, unsigned int align, > s = find_mergeable(size, align, flags, name, ctor); > if (s) { > if (sysfs_slab_alias(s, name)) > - return NULL; > + pr_err("SLUB: Unable to add slab alias %s to sysfs\n", > + name); > > s->refcount++; > > @@ -6204,9 +6205,15 @@ int do_kmem_cache_create(struct kmem_cache *s, const char *name, > goto out; > } > > + /* > + * Failing to create sysfs files is not critical to SLUB functionality. > + * If it fails, proceed with cache creation without these files. > + */ > err = sysfs_slab_add(s); > - if (err) > - goto out; > + if (err) { > + err = 0; > + pr_err("SLUB: Unable to add slab %s to sysfs\n", s->name); > + } > > if (s->flags & SLAB_STORE_USER) > debugfs_slab_add(s); > @@ -7276,7 +7283,8 @@ static int sysfs_slab_add(struct kmem_cache *s) > > void sysfs_slab_unlink(struct kmem_cache *s) > { > - kobject_del(&s->kobj); > + if (s->kobj.state_in_sysfs) > + kobject_del(&s->kobj); > } > > void sysfs_slab_release(struct kmem_cache *s) > @@ -7305,6 +7313,11 @@ static int sysfs_slab_alias(struct kmem_cache *s, const char *name) > * If we have a leftover link then remove it. > */ > sysfs_remove_link(&slab_kset->kobj, name); > + /* > + * The original cache may have failed to generate sysfs file. > + * In that case, sysfs_create_link() returns -ENOENT and > + * symbolic link creation is skipped. > + */ > return sysfs_create_link(&slab_kset->kobj, &s->kobj, name); > } > > -- > 2.45.0 >
On 11/1/24 14:15, Hyeonggon Yoo wrote: > On Fri, Nov 1, 2024 at 10:08 PM Hyeonggon Yoo <42.hyeyoo@gmail.com> wrote: >> >> When kobject_init_and_add() fails during cache creation, >> kobj->name can be leaked because SLUB does not call kobject_put(), >> which should be invoked per the kobject API documentation. >> This has a bit of historical context, though; SLUB does not call >> kobject_put() to avoid double-free for struct kmem_cache because >> 1) simply calling it would free all resources related to the cache, and >> 2) struct kmem_cache descriptor is always freed by cache_cache()'s >> error handling path, causing struct kmem_cache to be freed twice. >> >> This issue can be reproduced by creating new slab caches while applying >> failslab for kernfs_node_cache. This makes kobject_add_varg() succeed, >> but causes kobject_add_internal() to fail in kobject_init_and_add() >> during cache creation. >> >> Historically, this issue has attracted developers' attention several times. >> Each time a fix addressed either the leak or the double-free, >> it caused the other issue. Let's summarize a bit of history here: >> >> The leak has existed since the early days of SLUB. >> >> Commit 54b6a731025f ("slub: fix leak of 'name' in sysfs_slab_add") >> introduced a double-free bug while fixing the leak. >> >> Commit 80da026a8e5d ("mm/slub: fix slab double-free in case of duplicate >> sysfs filename") re-introduced the leak while fixing the double-free >> error. >> >> Commit dde3c6b72a16 ("mm/slub: fix a memory leak in sysfs_slab_add()") >> fixed the memory leak, but it was later reverted by commit 757fed1d0898 >> ("Revert "mm/slub: fix a memory leak in sysfs_slab_add()"") to avoid >> the double-free error. >> >> This is where we are now: we've chosen a memory leak over a double-free. >> >> To resolve this memory leak, skip creating sysfs files if it fails >> and continue with cache creation regardless (as suggested by Christoph). >> This resolves the memory leak because both the cache and the kobject >> remain alive on kobject_init_and_add() failure. >> >> If SLUB tries to create an alias for a cache without sysfs files, >> its symbolic link will not be generated. >> >> Since a slab cache might not have associated sysfs files, call kobject_del() >> only if such files exist. >> >> Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Thanks, added to slab/for-next > > [+Cc Jinjie, Liu] > >> --- >> >> RFC -> v1: Make sysfs optional instead of destroying the cache on sysfs >> errors. (Suggested by Christoph) > > RFC version for reference: > https://lore.kernel.org/linux-mm/20241021091413.154775-1-42.hyeyoo@gmail.com/ > >> mm/slub.c | 21 +++++++++++++++++---- >> 1 file changed, 17 insertions(+), 4 deletions(-) >> >> diff --git a/mm/slub.c b/mm/slub.c >> index 151a987dc3a0..b4b211468f77 100644 >> --- a/mm/slub.c >> +++ b/mm/slub.c >> @@ -6116,7 +6116,8 @@ __kmem_cache_alias(const char *name, unsigned int size, unsigned int align, >> s = find_mergeable(size, align, flags, name, ctor); >> if (s) { >> if (sysfs_slab_alias(s, name)) >> - return NULL; >> + pr_err("SLUB: Unable to add slab alias %s to sysfs\n", I've changed "slab" to "cache". >> + name); >> >> s->refcount++; >> >> @@ -6204,9 +6205,15 @@ int do_kmem_cache_create(struct kmem_cache *s, const char *name, >> goto out; >> } >> >> + /* >> + * Failing to create sysfs files is not critical to SLUB functionality. >> + * If it fails, proceed with cache creation without these files. >> + */ >> err = sysfs_slab_add(s); >> - if (err) >> - goto out; >> + if (err) { >> + err = 0; >> + pr_err("SLUB: Unable to add slab %s to sysfs\n", s->name); Also here, and simplified to "if (sysfs_slab_add(s)) ... " to avoid dealing with err. >> + } >> >> if (s->flags & SLAB_STORE_USER) >> debugfs_slab_add(s); >> @@ -7276,7 +7283,8 @@ static int sysfs_slab_add(struct kmem_cache *s) >> >> void sysfs_slab_unlink(struct kmem_cache *s) >> { >> - kobject_del(&s->kobj); >> + if (s->kobj.state_in_sysfs) >> + kobject_del(&s->kobj); >> } >> >> void sysfs_slab_release(struct kmem_cache *s) >> @@ -7305,6 +7313,11 @@ static int sysfs_slab_alias(struct kmem_cache *s, const char *name) >> * If we have a leftover link then remove it. >> */ >> sysfs_remove_link(&slab_kset->kobj, name); >> + /* >> + * The original cache may have failed to generate sysfs file. >> + * In that case, sysfs_create_link() returns -ENOENT and >> + * symbolic link creation is skipped. >> + */ >> return sysfs_create_link(&slab_kset->kobj, &s->kobj, name); >> } >> >> -- >> 2.45.0 >>
On Fri, Nov 1, 2024 at 10:58 PM Vlastimil Babka <vbabka@suse.cz> wrote: > > On 11/1/24 14:15, Hyeonggon Yoo wrote: > > On Fri, Nov 1, 2024 at 10:08 PM Hyeonggon Yoo <42.hyeyoo@gmail.com> wrote: > >> > >> When kobject_init_and_add() fails during cache creation, > >> kobj->name can be leaked because SLUB does not call kobject_put(), > >> which should be invoked per the kobject API documentation. > >> This has a bit of historical context, though; SLUB does not call > >> kobject_put() to avoid double-free for struct kmem_cache because > >> 1) simply calling it would free all resources related to the cache, and > >> 2) struct kmem_cache descriptor is always freed by cache_cache()'s > >> error handling path, causing struct kmem_cache to be freed twice. > >> > >> This issue can be reproduced by creating new slab caches while applying > >> failslab for kernfs_node_cache. This makes kobject_add_varg() succeed, > >> but causes kobject_add_internal() to fail in kobject_init_and_add() > >> during cache creation. > >> > >> Historically, this issue has attracted developers' attention several times. > >> Each time a fix addressed either the leak or the double-free, > >> it caused the other issue. Let's summarize a bit of history here: > >> > >> The leak has existed since the early days of SLUB. > >> > >> Commit 54b6a731025f ("slub: fix leak of 'name' in sysfs_slab_add") > >> introduced a double-free bug while fixing the leak. > >> > >> Commit 80da026a8e5d ("mm/slub: fix slab double-free in case of duplicate > >> sysfs filename") re-introduced the leak while fixing the double-free > >> error. > >> > >> Commit dde3c6b72a16 ("mm/slub: fix a memory leak in sysfs_slab_add()") > >> fixed the memory leak, but it was later reverted by commit 757fed1d0898 > >> ("Revert "mm/slub: fix a memory leak in sysfs_slab_add()"") to avoid > >> the double-free error. > >> > >> This is where we are now: we've chosen a memory leak over a double-free. > >> > >> To resolve this memory leak, skip creating sysfs files if it fails > >> and continue with cache creation regardless (as suggested by Christoph). > >> This resolves the memory leak because both the cache and the kobject > >> remain alive on kobject_init_and_add() failure. > >> > >> If SLUB tries to create an alias for a cache without sysfs files, > >> its symbolic link will not be generated. > >> > >> Since a slab cache might not have associated sysfs files, call kobject_del() > >> only if such files exist. > >> > >> Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> > > Thanks, added to slab/for-next > > > > > [+Cc Jinjie, Liu] > > > >> --- > >> > >> RFC -> v1: Make sysfs optional instead of destroying the cache on sysfs > >> errors. (Suggested by Christoph) > > > > RFC version for reference: > > https://lore.kernel.org/linux-mm/20241021091413.154775-1-42.hyeyoo@gmail.com/ > > > >> mm/slub.c | 21 +++++++++++++++++---- > >> 1 file changed, 17 insertions(+), 4 deletions(-) > >> > >> diff --git a/mm/slub.c b/mm/slub.c > >> index 151a987dc3a0..b4b211468f77 100644 > >> --- a/mm/slub.c > >> +++ b/mm/slub.c > >> @@ -6116,7 +6116,8 @@ __kmem_cache_alias(const char *name, unsigned int size, unsigned int align, > >> s = find_mergeable(size, align, flags, name, ctor); > >> if (s) { > >> if (sysfs_slab_alias(s, name)) > >> - return NULL; > >> + pr_err("SLUB: Unable to add slab alias %s to sysfs\n", > > I've changed "slab" to "cache". > > >> + name); > >> > >> s->refcount++; > >> > >> @@ -6204,9 +6205,15 @@ int do_kmem_cache_create(struct kmem_cache *s, const char *name, > >> goto out; > >> } > >> > >> + /* > >> + * Failing to create sysfs files is not critical to SLUB functionality. > >> + * If it fails, proceed with cache creation without these files. > >> + */ > >> err = sysfs_slab_add(s); > >> - if (err) > >> - goto out; > >> + if (err) { > >> + err = 0; > >> + pr_err("SLUB: Unable to add slab %s to sysfs\n", s->name); > > Also here, and simplified to "if (sysfs_slab_add(s)) ... " to avoid dealing > with err. Oh no. err is initialized to -EINVAL, so that will not work as intended. It is causing the following list corruption. [ 0.607833] __kmem_cache_create_args(fscrypt_inode_info) failed with error -22 [ 0.608518] CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.12.0-rc2+ #63 [ 0.609181] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc38 04/01/2014 [ 0.610000] Call Trace: [ 0.610233] <TASK> [ 0.610433] dump_stack_lvl+0x64/0x80 [ 0.610806] __kmem_cache_create_args+0x1eb/0x280 [ 0.611253] ? __pfx_fscrypt_init+0x10/0x10 [ 0.611647] fscrypt_init+0x88/0xf0 [ 0.611980] ? __pfx_fscrypt_init+0x10/0x10 [ 0.612373] do_one_initcall+0x5b/0x320 [ 0.612736] kernel_init_freeable+0x351/0x510 [ 0.613150] ? __pfx_kernel_init+0x10/0x10 [ 0.613536] kernel_init+0x1a/0x1d0 [ 0.613865] ret_from_fork+0x34/0x50 [ 0.614207] ? __pfx_kernel_init+0x10/0x10 [ 0.614591] ret_from_fork_asm+0x1a/0x30 [ 0.614968] </TASK> [ 0.615203] list_add corruption. prev->next should be next (ffff986bc2bb9aa0), but was ffff986bc2b6. [ 0.616308] ------------[ cut here ]------------ [ 0.616746] kernel BUG at lib/list_debug.c:32! [ 0.617173] Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI [ 0.617709] CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.12.0-rc2+ #63 [ 0.618372] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc38 04/01/2014 [ 0.619192] RIP: 0010:__list_add_valid_or_report+0x78/0xa0 [ 0.619714] Code: 8b ff 0f 0b 48 89 c1 48 c7 c7 50 ca c0 9a e8 2f f4 8b ff 0f 0b 48 89 d1 48 89 c6 b [ 0.621473] RSP: 0018:ffffa47380013c98 EFLAGS: 00010246 [ 0.621969] RAX: 0000000000000075 RBX: ffff986bc2b63238 RCX: ffffffff9b5646a8 [ 0.622638] RDX: 0000000000000000 RSI: 0000000000000003 RDI: 0000000000000001 [ 0.623313] RBP: ffff986bc2bb9ab8 R08: 0000000000000000 R09: 205d333032353136 [ 0.623996] R10: 74707572726f6320 R11: 6464615f7473696c R12: ffff986bc2bb9aa0 [ 0.624673] R13: ffff986bc2b63240 R14: ffff986bc2b63240 R15: ffff986bc2b631c0 [ 0.625355] FS: 0000000000000000(0000) GS:ffff986bdf400000(0000) knlGS:0000000000000000 [ 0.626111] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 0.626653] CR2: ffff986bd3801000 CR3: 000000001242a000 CR4: 00000000000006f0 [ 0.627325] Call Trace: [ 0.627559] <TASK> [ 0.627760] ? die+0x36/0x90 [ 0.628039] ? do_trap+0xdd/0x100 [ 0.628354] ? __list_add_valid_or_report+0x78/0xa0 [ 0.628814] ? do_error_trap+0x6a/0x90 [ 0.629172] ? __list_add_valid_or_report+0x78/0xa0 [ 0.629633] ? exc_invalid_op+0x50/0x70 [ 0.630001] ? __list_add_valid_or_report+0x78/0xa0 [ 0.630463] ? asm_exc_invalid_op+0x1a/0x20 [ 0.630861] ? __list_add_valid_or_report+0x78/0xa0 [ 0.631324] ? __list_add_valid_or_report+0x78/0xa0 [ 0.631784] kobject_add_internal+0x78/0x2a0 [ 0.632192] kobject_init_and_add+0x8c/0xd0 [ 0.632589] ? kernfs_find_ns+0x35/0xc0 [ 0.632957] sysfs_slab_add+0x193/0x1e0 [ 0.633318] do_kmem_cache_create+0x455/0x630 [ 0.633727] __kmem_cache_create_args+0x157/0x280 [ 0.634176] ? __pfx_fsverity_init+0x10/0x10 [ 0.634586] fsverity_init_info_cache+0x66/0x90 [ 0.635022] fsverity_init+0x13/0x40 [ 0.635365] do_one_initcall+0x5b/0x320 [ 0.635734] kernel_init_freeable+0x351/0x510 [ 0.636154] ? __pfx_kernel_init+0x10/0x10 [ 0.636547] kernel_init+0x1a/0x1d0 [ 0.636881] ret_from_fork+0x34/0x50 [ 0.637227] ? __pfx_kernel_init+0x10/0x10 [ 0.637619] ret_from_fork_asm+0x1a/0x30 [ 0.637998] </TASK> [ 0.638211] Modules linked in: [ 0.638512] ---[ end trace 0000000000000000 ]--- [ 0.638962] RIP: 0010:__list_add_valid_or_report+0x78/0xa0 [ 0.639483] Code: 8b ff 0f 0b 48 89 c1 48 c7 c7 50 ca c0 9a e8 2f f4 8b ff 0f 0b 48 89 d1 48 89 c6 b [ 0.641253] RSP: 0018:ffffa47380013c98 EFLAGS: 00010246 [ 0.641753] RAX: 0000000000000075 RBX: ffff986bc2b63238 RCX: ffffffff9b5646a8 [ 0.642431] RDX: 0000000000000000 RSI: 0000000000000003 RDI: 0000000000000001 [ 0.643118] RBP: ffff986bc2bb9ab8 R08: 0000000000000000 R09: 205d333032353136 [ 0.643795] R10: 74707572726f6320 R11: 6464615f7473696c R12: ffff986bc2bb9aa0 [ 0.644471] R13: ffff986bc2b63240 R14: ffff986bc2b63240 R15: ffff986bc2b631c0 [ 0.645152] FS: 0000000000000000(0000) GS:ffff986bdf400000(0000) knlGS:0000000000000000 [ 0.645923] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 0.646470] CR2: ffff986bd3801000 CR3: 000000001242a000 CR4: 00000000000006f0 [ 0.647156] note: swapper/0[1] exited with preempt_count 1 [ 0.647686] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b [ 0.648446] Kernel Offset: 0x18000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000) [ 0.649466] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b ]--- > >> + } > >> > >> if (s->flags & SLAB_STORE_USER) > >> debugfs_slab_add(s); > >> @@ -7276,7 +7283,8 @@ static int sysfs_slab_add(struct kmem_cache *s) > >> > >> void sysfs_slab_unlink(struct kmem_cache *s) > >> { > >> - kobject_del(&s->kobj); > >> + if (s->kobj.state_in_sysfs) > >> + kobject_del(&s->kobj); > >> } > >> > >> void sysfs_slab_release(struct kmem_cache *s) > >> @@ -7305,6 +7313,11 @@ static int sysfs_slab_alias(struct kmem_cache *s, const char *name) > >> * If we have a leftover link then remove it. > >> */ > >> sysfs_remove_link(&slab_kset->kobj, name); > >> + /* > >> + * The original cache may have failed to generate sysfs file. > >> + * In that case, sysfs_create_link() returns -ENOENT and > >> + * symbolic link creation is skipped. > >> + */ > >> return sysfs_create_link(&slab_kset->kobj, &s->kobj, name); > >> } > >> > >> -- > >> 2.45.0 > >> >
On 11/2/24 8:18 AM, Hyeonggon Yoo wrote: >> >> Also here, and simplified to "if (sysfs_slab_add(s)) ... " to avoid dealing >> with err. > > Oh no. err is initialized to -EINVAL, so that will not work as intended. > It is causing the following list corruption. Ooops, right, thanks a lot. Wrongly assumed that a test boot in virtme-ng would catch silly mistakes like that. Looks like all caches were created with SLAB_STATE < FULL. Fixed by setting err = 0 before trying sysfs add. > [ 0.607833] __kmem_cache_create_args(fscrypt_inode_info) failed > with error -22 > [ 0.608518] CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.12.0-rc2+ #63 > [ 0.609181] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), > BIOS 1.16.2-1.fc38 04/01/2014 > [ 0.610000] Call Trace: > [ 0.610233] <TASK> > [ 0.610433] dump_stack_lvl+0x64/0x80 > [ 0.610806] __kmem_cache_create_args+0x1eb/0x280 > [ 0.611253] ? __pfx_fscrypt_init+0x10/0x10 > [ 0.611647] fscrypt_init+0x88/0xf0 > [ 0.611980] ? __pfx_fscrypt_init+0x10/0x10 > [ 0.612373] do_one_initcall+0x5b/0x320 > [ 0.612736] kernel_init_freeable+0x351/0x510 > [ 0.613150] ? __pfx_kernel_init+0x10/0x10 > [ 0.613536] kernel_init+0x1a/0x1d0 > [ 0.613865] ret_from_fork+0x34/0x50 > [ 0.614207] ? __pfx_kernel_init+0x10/0x10 > [ 0.614591] ret_from_fork_asm+0x1a/0x30 > [ 0.614968] </TASK> > [ 0.615203] list_add corruption. prev->next should be next > (ffff986bc2bb9aa0), but was ffff986bc2b6. > [ 0.616308] ------------[ cut here ]------------ > [ 0.616746] kernel BUG at lib/list_debug.c:32! > [ 0.617173] Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI > [ 0.617709] CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.12.0-rc2+ #63 > [ 0.618372] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), > BIOS 1.16.2-1.fc38 04/01/2014 > [ 0.619192] RIP: 0010:__list_add_valid_or_report+0x78/0xa0 > [ 0.619714] Code: 8b ff 0f 0b 48 89 c1 48 c7 c7 50 ca c0 9a e8 2f > f4 8b ff 0f 0b 48 89 d1 48 89 c6 b > [ 0.621473] RSP: 0018:ffffa47380013c98 EFLAGS: 00010246 > [ 0.621969] RAX: 0000000000000075 RBX: ffff986bc2b63238 RCX: ffffffff9b5646a8 > [ 0.622638] RDX: 0000000000000000 RSI: 0000000000000003 RDI: 0000000000000001 > [ 0.623313] RBP: ffff986bc2bb9ab8 R08: 0000000000000000 R09: 205d333032353136 > [ 0.623996] R10: 74707572726f6320 R11: 6464615f7473696c R12: ffff986bc2bb9aa0 > [ 0.624673] R13: ffff986bc2b63240 R14: ffff986bc2b63240 R15: ffff986bc2b631c0 > [ 0.625355] FS: 0000000000000000(0000) GS:ffff986bdf400000(0000) > knlGS:0000000000000000 > [ 0.626111] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 0.626653] CR2: ffff986bd3801000 CR3: 000000001242a000 CR4: 00000000000006f0 > [ 0.627325] Call Trace: > [ 0.627559] <TASK> > [ 0.627760] ? die+0x36/0x90 > [ 0.628039] ? do_trap+0xdd/0x100 > [ 0.628354] ? __list_add_valid_or_report+0x78/0xa0 > [ 0.628814] ? do_error_trap+0x6a/0x90 > [ 0.629172] ? __list_add_valid_or_report+0x78/0xa0 > [ 0.629633] ? exc_invalid_op+0x50/0x70 > [ 0.630001] ? __list_add_valid_or_report+0x78/0xa0 > [ 0.630463] ? asm_exc_invalid_op+0x1a/0x20 > [ 0.630861] ? __list_add_valid_or_report+0x78/0xa0 > [ 0.631324] ? __list_add_valid_or_report+0x78/0xa0 > [ 0.631784] kobject_add_internal+0x78/0x2a0 > [ 0.632192] kobject_init_and_add+0x8c/0xd0 > [ 0.632589] ? kernfs_find_ns+0x35/0xc0 > [ 0.632957] sysfs_slab_add+0x193/0x1e0 > [ 0.633318] do_kmem_cache_create+0x455/0x630 > [ 0.633727] __kmem_cache_create_args+0x157/0x280 > [ 0.634176] ? __pfx_fsverity_init+0x10/0x10 > [ 0.634586] fsverity_init_info_cache+0x66/0x90 > [ 0.635022] fsverity_init+0x13/0x40 > [ 0.635365] do_one_initcall+0x5b/0x320 > [ 0.635734] kernel_init_freeable+0x351/0x510 > [ 0.636154] ? __pfx_kernel_init+0x10/0x10 > [ 0.636547] kernel_init+0x1a/0x1d0 > [ 0.636881] ret_from_fork+0x34/0x50 > [ 0.637227] ? __pfx_kernel_init+0x10/0x10 > [ 0.637619] ret_from_fork_asm+0x1a/0x30 > [ 0.637998] </TASK> > [ 0.638211] Modules linked in: > [ 0.638512] ---[ end trace 0000000000000000 ]--- > [ 0.638962] RIP: 0010:__list_add_valid_or_report+0x78/0xa0 > [ 0.639483] Code: 8b ff 0f 0b 48 89 c1 48 c7 c7 50 ca c0 9a e8 2f > f4 8b ff 0f 0b 48 89 d1 48 89 c6 b > [ 0.641253] RSP: 0018:ffffa47380013c98 EFLAGS: 00010246 > [ 0.641753] RAX: 0000000000000075 RBX: ffff986bc2b63238 RCX: ffffffff9b5646a8 > [ 0.642431] RDX: 0000000000000000 RSI: 0000000000000003 RDI: 0000000000000001 > [ 0.643118] RBP: ffff986bc2bb9ab8 R08: 0000000000000000 R09: 205d333032353136 > [ 0.643795] R10: 74707572726f6320 R11: 6464615f7473696c R12: ffff986bc2bb9aa0 > [ 0.644471] R13: ffff986bc2b63240 R14: ffff986bc2b63240 R15: ffff986bc2b631c0 > [ 0.645152] FS: 0000000000000000(0000) GS:ffff986bdf400000(0000) > knlGS:0000000000000000 > [ 0.645923] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 0.646470] CR2: ffff986bd3801000 CR3: 000000001242a000 CR4: 00000000000006f0 > [ 0.647156] note: swapper/0[1] exited with preempt_count 1 > [ 0.647686] Kernel panic - not syncing: Attempted to kill init! > exitcode=0x0000000b > [ 0.648446] Kernel Offset: 0x18000000 from 0xffffffff81000000 > (relocation range: 0xffffffff80000000) > [ 0.649466] ---[ end Kernel panic - not syncing: Attempted to kill > init! exitcode=0x0000000b ]--- > > >>>> + } >>>> >>>> if (s->flags & SLAB_STORE_USER) >>>> debugfs_slab_add(s); >>>> @@ -7276,7 +7283,8 @@ static int sysfs_slab_add(struct kmem_cache *s) >>>> >>>> void sysfs_slab_unlink(struct kmem_cache *s) >>>> { >>>> - kobject_del(&s->kobj); >>>> + if (s->kobj.state_in_sysfs) >>>> + kobject_del(&s->kobj); >>>> } >>>> >>>> void sysfs_slab_release(struct kmem_cache *s) >>>> @@ -7305,6 +7313,11 @@ static int sysfs_slab_alias(struct kmem_cache *s, const char *name) >>>> * If we have a leftover link then remove it. >>>> */ >>>> sysfs_remove_link(&slab_kset->kobj, name); >>>> + /* >>>> + * The original cache may have failed to generate sysfs file. >>>> + * In that case, sysfs_create_link() returns -ENOENT and >>>> + * symbolic link creation is skipped. >>>> + */ >>>> return sysfs_create_link(&slab_kset->kobj, &s->kobj, name); >>>> } >>>> >>>> -- >>>> 2.45.0 >>>> >>
On Sat, Nov 2, 2024 at 5:50 PM Vlastimil Babka <vbabka@suse.cz> wrote: > > On 11/2/24 8:18 AM, Hyeonggon Yoo wrote: > >> > >> Also here, and simplified to "if (sysfs_slab_add(s)) ... " to avoid dealing > >> with err. > > > > Oh no. err is initialized to -EINVAL, so that will not work as intended. > > It is causing the following list corruption. > > Ooops, right, thanks a lot. Wrongly assumed that a test boot in > virtme-ng would catch silly mistakes like that. Looks like all caches > were created with SLAB_STATE < FULL. > > Fixed by setting err = 0 before trying sysfs add. Thanks! Hmm... by the way, why doesn't SLUB update 'err' in the event of an error in init_cache_random_seq(), init_kmem_cache_nodes(), or alloc_kmem_cache_cpus()? I may be missing something, but it doesn't seem to handle these errors properly to me... > > > [ 0.607833] __kmem_cache_create_args(fscrypt_inode_info) failed > > with error -22 > > [ 0.608518] CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.12.0-rc2+ #63 > > [ 0.609181] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), > > BIOS 1.16.2-1.fc38 04/01/2014 > > [ 0.610000] Call Trace: > > [ 0.610233] <TASK> > > [ 0.610433] dump_stack_lvl+0x64/0x80 > > [ 0.610806] __kmem_cache_create_args+0x1eb/0x280 > > [ 0.611253] ? __pfx_fscrypt_init+0x10/0x10 > > [ 0.611647] fscrypt_init+0x88/0xf0 > > [ 0.611980] ? __pfx_fscrypt_init+0x10/0x10 > > [ 0.612373] do_one_initcall+0x5b/0x320 > > [ 0.612736] kernel_init_freeable+0x351/0x510 > > [ 0.613150] ? __pfx_kernel_init+0x10/0x10 > > [ 0.613536] kernel_init+0x1a/0x1d0 > > [ 0.613865] ret_from_fork+0x34/0x50 > > [ 0.614207] ? __pfx_kernel_init+0x10/0x10 > > [ 0.614591] ret_from_fork_asm+0x1a/0x30 > > [ 0.614968] </TASK> > > [ 0.615203] list_add corruption. prev->next should be next > > (ffff986bc2bb9aa0), but was ffff986bc2b6. > > [ 0.616308] ------------[ cut here ]------------ > > [ 0.616746] kernel BUG at lib/list_debug.c:32! > > [ 0.617173] Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI > > [ 0.617709] CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.12.0-rc2+ #63 > > [ 0.618372] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), > > BIOS 1.16.2-1.fc38 04/01/2014 > > [ 0.619192] RIP: 0010:__list_add_valid_or_report+0x78/0xa0 > > [ 0.619714] Code: 8b ff 0f 0b 48 89 c1 48 c7 c7 50 ca c0 9a e8 2f > > f4 8b ff 0f 0b 48 89 d1 48 89 c6 b > > [ 0.621473] RSP: 0018:ffffa47380013c98 EFLAGS: 00010246 > > [ 0.621969] RAX: 0000000000000075 RBX: ffff986bc2b63238 RCX: ffffffff9b5646a8 > > [ 0.622638] RDX: 0000000000000000 RSI: 0000000000000003 RDI: 0000000000000001 > > [ 0.623313] RBP: ffff986bc2bb9ab8 R08: 0000000000000000 R09: 205d333032353136 > > [ 0.623996] R10: 74707572726f6320 R11: 6464615f7473696c R12: ffff986bc2bb9aa0 > > [ 0.624673] R13: ffff986bc2b63240 R14: ffff986bc2b63240 R15: ffff986bc2b631c0 > > [ 0.625355] FS: 0000000000000000(0000) GS:ffff986bdf400000(0000) > > knlGS:0000000000000000 > > [ 0.626111] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > [ 0.626653] CR2: ffff986bd3801000 CR3: 000000001242a000 CR4: 00000000000006f0 > > [ 0.627325] Call Trace: > > [ 0.627559] <TASK> > > [ 0.627760] ? die+0x36/0x90 > > [ 0.628039] ? do_trap+0xdd/0x100 > > [ 0.628354] ? __list_add_valid_or_report+0x78/0xa0 > > [ 0.628814] ? do_error_trap+0x6a/0x90 > > [ 0.629172] ? __list_add_valid_or_report+0x78/0xa0 > > [ 0.629633] ? exc_invalid_op+0x50/0x70 > > [ 0.630001] ? __list_add_valid_or_report+0x78/0xa0 > > [ 0.630463] ? asm_exc_invalid_op+0x1a/0x20 > > [ 0.630861] ? __list_add_valid_or_report+0x78/0xa0 > > [ 0.631324] ? __list_add_valid_or_report+0x78/0xa0 > > [ 0.631784] kobject_add_internal+0x78/0x2a0 > > [ 0.632192] kobject_init_and_add+0x8c/0xd0 > > [ 0.632589] ? kernfs_find_ns+0x35/0xc0 > > [ 0.632957] sysfs_slab_add+0x193/0x1e0 > > [ 0.633318] do_kmem_cache_create+0x455/0x630 > > [ 0.633727] __kmem_cache_create_args+0x157/0x280 > > [ 0.634176] ? __pfx_fsverity_init+0x10/0x10 > > [ 0.634586] fsverity_init_info_cache+0x66/0x90 > > [ 0.635022] fsverity_init+0x13/0x40 > > [ 0.635365] do_one_initcall+0x5b/0x320 > > [ 0.635734] kernel_init_freeable+0x351/0x510 > > [ 0.636154] ? __pfx_kernel_init+0x10/0x10 > > [ 0.636547] kernel_init+0x1a/0x1d0 > > [ 0.636881] ret_from_fork+0x34/0x50 > > [ 0.637227] ? __pfx_kernel_init+0x10/0x10 > > [ 0.637619] ret_from_fork_asm+0x1a/0x30 > > [ 0.637998] </TASK> > > [ 0.638211] Modules linked in: > > [ 0.638512] ---[ end trace 0000000000000000 ]--- > > [ 0.638962] RIP: 0010:__list_add_valid_or_report+0x78/0xa0 > > [ 0.639483] Code: 8b ff 0f 0b 48 89 c1 48 c7 c7 50 ca c0 9a e8 2f > > f4 8b ff 0f 0b 48 89 d1 48 89 c6 b > > [ 0.641253] RSP: 0018:ffffa47380013c98 EFLAGS: 00010246 > > [ 0.641753] RAX: 0000000000000075 RBX: ffff986bc2b63238 RCX: ffffffff9b5646a8 > > [ 0.642431] RDX: 0000000000000000 RSI: 0000000000000003 RDI: 0000000000000001 > > [ 0.643118] RBP: ffff986bc2bb9ab8 R08: 0000000000000000 R09: 205d333032353136 > > [ 0.643795] R10: 74707572726f6320 R11: 6464615f7473696c R12: ffff986bc2bb9aa0 > > [ 0.644471] R13: ffff986bc2b63240 R14: ffff986bc2b63240 R15: ffff986bc2b631c0 > > [ 0.645152] FS: 0000000000000000(0000) GS:ffff986bdf400000(0000) > > knlGS:0000000000000000 > > [ 0.645923] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > [ 0.646470] CR2: ffff986bd3801000 CR3: 000000001242a000 CR4: 00000000000006f0 > > [ 0.647156] note: swapper/0[1] exited with preempt_count 1 > > [ 0.647686] Kernel panic - not syncing: Attempted to kill init! > > exitcode=0x0000000b > > [ 0.648446] Kernel Offset: 0x18000000 from 0xffffffff81000000 > > (relocation range: 0xffffffff80000000) > > [ 0.649466] ---[ end Kernel panic - not syncing: Attempted to kill > > init! exitcode=0x0000000b ]--- > > > > > >>>> + } > >>>> > >>>> if (s->flags & SLAB_STORE_USER) > >>>> debugfs_slab_add(s); > >>>> @@ -7276,7 +7283,8 @@ static int sysfs_slab_add(struct kmem_cache *s) > >>>> > >>>> void sysfs_slab_unlink(struct kmem_cache *s) > >>>> { > >>>> - kobject_del(&s->kobj); > >>>> + if (s->kobj.state_in_sysfs) > >>>> + kobject_del(&s->kobj); > >>>> } > >>>> > >>>> void sysfs_slab_release(struct kmem_cache *s) > >>>> @@ -7305,6 +7313,11 @@ static int sysfs_slab_alias(struct kmem_cache *s, const char *name) > >>>> * If we have a leftover link then remove it. > >>>> */ > >>>> sysfs_remove_link(&slab_kset->kobj, name); > >>>> + /* > >>>> + * The original cache may have failed to generate sysfs file. > >>>> + * In that case, sysfs_create_link() returns -ENOENT and > >>>> + * symbolic link creation is skipped. > >>>> + */ > >>>> return sysfs_create_link(&slab_kset->kobj, &s->kobj, name); > >>>> } > >>>> > >>>> -- > >>>> 2.45.0 > >>>> > >>
On Sat, Nov 2, 2024 at 7:07 PM Hyeonggon Yoo <42.hyeyoo@gmail.com> wrote: > > On Sat, Nov 2, 2024 at 5:50 PM Vlastimil Babka <vbabka@suse.cz> wrote: > > > > On 11/2/24 8:18 AM, Hyeonggon Yoo wrote: > > >> > > >> Also here, and simplified to "if (sysfs_slab_add(s)) ... " to avoid dealing > > >> with err. > > > > > > Oh no. err is initialized to -EINVAL, so that will not work as intended. > > > It is causing the following list corruption. > > > > Ooops, right, thanks a lot. Wrongly assumed that a test boot in > > virtme-ng would catch silly mistakes like that. Looks like all caches > > were created with SLAB_STATE < FULL. > > > > Fixed by setting err = 0 before trying sysfs add. > > Thanks! > > Hmm... by the way, why doesn't SLUB update 'err' in the event of an error in > init_cache_random_seq(), init_kmem_cache_nodes(), or alloc_kmem_cache_cpus()? > I may be missing something, but it doesn't seem to handle these errors > properly to me... Oh, it seems like a recent change fc0eac57d08c ("slab: pull kmem_cache_open() into do_kmem_cache_create()") incorrectly pulled kmem_cache_open()? Cc-ing Christian Brauner.
On Sat, Nov 2, 2024 at 7:16 PM Hyeonggon Yoo <42.hyeyoo@gmail.com> wrote: > > On Sat, Nov 2, 2024 at 7:07 PM Hyeonggon Yoo <42.hyeyoo@gmail.com> wrote: > > > > On Sat, Nov 2, 2024 at 5:50 PM Vlastimil Babka <vbabka@suse.cz> wrote: > > > > > > On 11/2/24 8:18 AM, Hyeonggon Yoo wrote: > > > >> > > > >> Also here, and simplified to "if (sysfs_slab_add(s)) ... " to avoid dealing > > > >> with err. > > > > > > > > Oh no. err is initialized to -EINVAL, so that will not work as intended. > > > > It is causing the following list corruption. > > > > > > Ooops, right, thanks a lot. Wrongly assumed that a test boot in > > > virtme-ng would catch silly mistakes like that. Looks like all caches > > > were created with SLAB_STATE < FULL. > > > > > > Fixed by setting err = 0 before trying sysfs add. > > > > Thanks! > > > > Hmm... by the way, why doesn't SLUB update 'err' in the event of an error in > > init_cache_random_seq(), init_kmem_cache_nodes(), or alloc_kmem_cache_cpus()? > > I may be missing something, but it doesn't seem to handle these errors > > properly to me... > > Oh, it seems like a recent change fc0eac57d08c ("slab: pull kmem_cache_open() > into do_kmem_cache_create()") incorrectly pulled kmem_cache_open()? > > Cc-ing Christian Brauner. Apologies for the oversight; it slipped my mind. It was actually correct after all. :(
On 11/2/24 15:51, Hyeonggon Yoo wrote: > On Sat, Nov 2, 2024 at 7:16 PM Hyeonggon Yoo <42.hyeyoo@gmail.com> wrote: >> >> On Sat, Nov 2, 2024 at 7:07 PM Hyeonggon Yoo <42.hyeyoo@gmail.com> wrote: >> > >> > On Sat, Nov 2, 2024 at 5:50 PM Vlastimil Babka <vbabka@suse.cz> wrote: >> > > >> > > On 11/2/24 8:18 AM, Hyeonggon Yoo wrote: >> > > >> >> > > >> Also here, and simplified to "if (sysfs_slab_add(s)) ... " to avoid dealing >> > > >> with err. >> > > > >> > > > Oh no. err is initialized to -EINVAL, so that will not work as intended. >> > > > It is causing the following list corruption. >> > > >> > > Ooops, right, thanks a lot. Wrongly assumed that a test boot in >> > > virtme-ng would catch silly mistakes like that. Looks like all caches >> > > were created with SLAB_STATE < FULL. >> > > >> > > Fixed by setting err = 0 before trying sysfs add. >> > >> > Thanks! >> > >> > Hmm... by the way, why doesn't SLUB update 'err' in the event of an error in >> > init_cache_random_seq(), init_kmem_cache_nodes(), or alloc_kmem_cache_cpus()? >> > I may be missing something, but it doesn't seem to handle these errors >> > properly to me... >> >> Oh, it seems like a recent change fc0eac57d08c ("slab: pull kmem_cache_open() >> into do_kmem_cache_create()") incorrectly pulled kmem_cache_open()? >> >> Cc-ing Christian Brauner. > > Apologies for the oversight; it slipped my mind. It was actually > correct after all. :( Yeah AFAICS that commit didn't change it. But yeah it would make sense to clean up those init/alloc functions so they all return 0 on success and error code on failure, and wire that up so allocation failures are -ENOMEM and not -EINVAL. Probably only some fault injection environments are going to notice, though.
diff --git a/mm/slub.c b/mm/slub.c index 151a987dc3a0..b4b211468f77 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -6116,7 +6116,8 @@ __kmem_cache_alias(const char *name, unsigned int size, unsigned int align, s = find_mergeable(size, align, flags, name, ctor); if (s) { if (sysfs_slab_alias(s, name)) - return NULL; + pr_err("SLUB: Unable to add slab alias %s to sysfs\n", + name); s->refcount++; @@ -6204,9 +6205,15 @@ int do_kmem_cache_create(struct kmem_cache *s, const char *name, goto out; } + /* + * Failing to create sysfs files is not critical to SLUB functionality. + * If it fails, proceed with cache creation without these files. + */ err = sysfs_slab_add(s); - if (err) - goto out; + if (err) { + err = 0; + pr_err("SLUB: Unable to add slab %s to sysfs\n", s->name); + } if (s->flags & SLAB_STORE_USER) debugfs_slab_add(s); @@ -7276,7 +7283,8 @@ static int sysfs_slab_add(struct kmem_cache *s) void sysfs_slab_unlink(struct kmem_cache *s) { - kobject_del(&s->kobj); + if (s->kobj.state_in_sysfs) + kobject_del(&s->kobj); } void sysfs_slab_release(struct kmem_cache *s) @@ -7305,6 +7313,11 @@ static int sysfs_slab_alias(struct kmem_cache *s, const char *name) * If we have a leftover link then remove it. */ sysfs_remove_link(&slab_kset->kobj, name); + /* + * The original cache may have failed to generate sysfs file. + * In that case, sysfs_create_link() returns -ENOENT and + * symbolic link creation is skipped. + */ return sysfs_create_link(&slab_kset->kobj, &s->kobj, name); }
When kobject_init_and_add() fails during cache creation, kobj->name can be leaked because SLUB does not call kobject_put(), which should be invoked per the kobject API documentation. This has a bit of historical context, though; SLUB does not call kobject_put() to avoid double-free for struct kmem_cache because 1) simply calling it would free all resources related to the cache, and 2) struct kmem_cache descriptor is always freed by cache_cache()'s error handling path, causing struct kmem_cache to be freed twice. This issue can be reproduced by creating new slab caches while applying failslab for kernfs_node_cache. This makes kobject_add_varg() succeed, but causes kobject_add_internal() to fail in kobject_init_and_add() during cache creation. Historically, this issue has attracted developers' attention several times. Each time a fix addressed either the leak or the double-free, it caused the other issue. Let's summarize a bit of history here: The leak has existed since the early days of SLUB. Commit 54b6a731025f ("slub: fix leak of 'name' in sysfs_slab_add") introduced a double-free bug while fixing the leak. Commit 80da026a8e5d ("mm/slub: fix slab double-free in case of duplicate sysfs filename") re-introduced the leak while fixing the double-free error. Commit dde3c6b72a16 ("mm/slub: fix a memory leak in sysfs_slab_add()") fixed the memory leak, but it was later reverted by commit 757fed1d0898 ("Revert "mm/slub: fix a memory leak in sysfs_slab_add()"") to avoid the double-free error. This is where we are now: we've chosen a memory leak over a double-free. To resolve this memory leak, skip creating sysfs files if it fails and continue with cache creation regardless (as suggested by Christoph). This resolves the memory leak because both the cache and the kobject remain alive on kobject_init_and_add() failure. If SLUB tries to create an alias for a cache without sysfs files, its symbolic link will not be generated. Since a slab cache might not have associated sysfs files, call kobject_del() only if such files exist. Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> --- RFC -> v1: Make sysfs optional instead of destroying the cache on sysfs errors. (Suggested by Christoph) mm/slub.c | 21 +++++++++++++++++---- 1 file changed, 17 insertions(+), 4 deletions(-)