Message ID | 20240807-b4-slab-kfree_rcu-destroy-v2-7-ea79102f428c@suse.cz (mailing list archive) |
---|---|
State | Accepted |
Commit | 4e1c44b3db79ba910adec32e2e1b920a0e34890a |
Headers | show |
Series | mm, slub: handle pending kfree_rcu() in kmem_cache_destroy() | expand |
On Wed, Aug 07, 2024 at 12:31:20PM +0200, Vlastimil Babka wrote: > Add a test that will create cache, allocate one object, kfree_rcu() it > and attempt to destroy it. As long as the usage of kvfree_rcu_barrier() > in kmem_cache_destroy() works correctly, there should be no warnings in > dmesg and the test should pass. > > Additionally add a test_leak_destroy() test that leaks an object on > purpose and verifies that kmem_cache_destroy() catches it. > > Signed-off-by: Vlastimil Babka <vbabka@suse.cz> > --- > lib/slub_kunit.c | 31 +++++++++++++++++++++++++++++++ > 1 file changed, 31 insertions(+) > > diff --git a/lib/slub_kunit.c b/lib/slub_kunit.c > index e6667a28c014..6e3a1e5a7142 100644 > --- a/lib/slub_kunit.c > +++ b/lib/slub_kunit.c > @@ -5,6 +5,7 @@ > #include <linux/slab.h> > #include <linux/module.h> > #include <linux/kernel.h> > +#include <linux/rcupdate.h> > #include "../mm/slab.h" > > static struct kunit_resource resource; > @@ -157,6 +158,34 @@ static void test_kmalloc_redzone_access(struct kunit *test) > kmem_cache_destroy(s); > } > > +struct test_kfree_rcu_struct { > + struct rcu_head rcu; > +}; > + > +static void test_kfree_rcu(struct kunit *test) > +{ > + struct kmem_cache *s = test_kmem_cache_create("TestSlub_kfree_rcu", > + sizeof(struct test_kfree_rcu_struct), > + SLAB_NO_MERGE); > + struct test_kfree_rcu_struct *p = kmem_cache_alloc(s, GFP_KERNEL); > + > + kfree_rcu(p, rcu); > + kmem_cache_destroy(s); > + > + KUNIT_EXPECT_EQ(test, 0, slab_errors); > +} > + > Thank you for this test case! I used this series to test _more_ the barrier and came to conclusion that it is not enough, i.e. i had to extend it to something like below: <snip> + snprintf(name, sizeof(name), "test-slub-%d", current->pid); + + for (i = 0; i < test_loop_count; i++) { + s = test_kmem_cache_create(name, sizeof(struct test_kfree_rcu_struct), + SLAB_NO_MERGE); + + if (!s) + BUG(); + + get_random_bytes(&nr_to_alloc, sizeof(nr_to_alloc)); + nr_to_alloc = nr_to_alloc % 1000000; + INIT_LIST_HEAD(&local_head); + + for (j = 0; j < nr_to_alloc; j++) { + p = kmem_cache_alloc(s, GFP_KERNEL); + + if (p) + list_add(&p->list, &local_head); + } + + list_for_each_entry_safe(p, n, &local_head, list) + kfree_rcu(p, rcu); + + kmem_cache_destroy(s); + } <snip> by using this(~11 parallel jobs) i could trigger a warning that a freed cache still has some objects and i have already figured out why. I will send a v2 of barrier implementation with a fix. Thanks! -- Uladzislau Rezki
On Wed, Aug 7, 2024 at 7:31 PM Vlastimil Babka <vbabka@suse.cz> wrote: > > Add a test that will create cache, allocate one object, kfree_rcu() it > and attempt to destroy it. As long as the usage of kvfree_rcu_barrier() > in kmem_cache_destroy() works correctly, there should be no warnings in > dmesg and the test should pass. > > Additionally add a test_leak_destroy() test that leaks an object on > purpose and verifies that kmem_cache_destroy() catches it. > > Signed-off-by: Vlastimil Babka <vbabka@suse.cz> > --- > lib/slub_kunit.c | 31 +++++++++++++++++++++++++++++++ > 1 file changed, 31 insertions(+) > Hi Vlastimil, I think we might need to suppress the WARN() due to the active objects in kmem_cache_destroy() when it's called from slub_kunit. With this change, the warning below will be printed every time slub_kunit is loaded, which made me wonder if there's a bug (for a while). Actually, SLUB calls pr_err() is called by __kmem_cache_shutdown() if there are any active objects during destruction, and the error message is suppressed by slub_kunit. However, kmem_cache_destroy() still calls WARN() regardless if there is any error during shutdown. [ 147.546531] Object 0x00000000c09342ca @offset=640 [ 147.546542] ------------[ cut here ]------------ [ 147.546544] kmem_cache_destroy TestSlub_kfree_rcu: Slab cache still has objects when called from test_leak_destroy+0x74/0x108 [slub_kunit] [ 147.546579] WARNING: CPU: 5 PID: 39703 at mm/slab_common.c:507 kmem_cache_destroy+0x174/0x188 [ 147.546587] Modules linked in: slub_kunit uinput snd_seq_dummy snd_hrtimer rfkill nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct sunrpc nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink qrtr binfmt_misc vfat fat snd_hda_codec_generic snd_hda_intel snd_intel_dspcfg snd_hda_codec uvcvideo snd_hda_core uvc snd_hwdep videobuf2_vmalloc snd_seq videobuf2_memops snd_seq_device videobuf2_v4l2 snd_pcm videobuf2_common snd_timer snd videodev mc soundcore virtio_balloon acpi_tad joydev loop zram virtio_gpu ahci_platform libahci_platform virtio_dma_buf crct10dif_ce polyval_ce polyval_generic ghash_ce sha3_ce virtio_net sha512_ce net_failover sha512_arm64 failover virtio_mmio ip6_tables ip_tables fuse [ 147.546646] CPU: 5 UID: 0 PID: 39703 Comm: kunit_try_catch Tainted: G N 6.11.0-rc7-next-20240912 #2 [ 147.546649] Tainted: [N]=TEST [ 147.546650] Hardware name: Parallels International GmbH. Parallels ARM Virtual Machine/Parallels ARM Virtual Platform, BIOS 20.0.0 (55653) Thu, 05 Sep 202 [ 147.546652] pstate: 61400005 (nZCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--) [ 147.546655] pc : kmem_cache_destroy+0x174/0x188 [ 147.546657] lr : kmem_cache_destroy+0x174/0x188 [ 147.546659] sp : ffff80008aba3d60 [ 147.546660] x29: ffff80008aba3d60 x28: 0000000000000000 x27: 0000000000000000 [ 147.546662] x26: 0000000000000000 x25: 0000000000000000 x24: ffff800094a2b438 [ 147.546665] x23: ffff80008089b750 x22: 0000000000000001 x21: f9cc80007c1782f4 [ 147.546666] x20: ffff800082f9d088 x19: ffff0000c2308b00 x18: 00000000fffffffd [ 147.546668] x17: 0000000046d4ed9c x16: 00000000ae1ad4db x15: ffff80008aba3430 [ 147.546670] x14: 0000000000000001 x13: ffff80008aba3657 x12: ffff800082f0f060 [ 147.546679] x11: 0000000000000001 x10: 0000000000000001 x9 : ffff8000801652c8 [ 147.546682] x8 : c0000000ffffdfff x7 : ffff800082e5ee68 x6 : 00000000000affa8 [ 147.546684] x5 : ffff00031fc70448 x4 : 0000000000000000 x3 : ffff80029d6b7000 [ 147.546686] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff00011f1c8000 [ 147.546688] Call trace: [ 147.546689] kmem_cache_destroy+0x174/0x188 [ 147.546692] test_leak_destroy+0x74/0x108 [slub_kunit] [ 147.546693] kunit_try_run_case+0x74/0x170 [ 147.546697] kunit_generic_run_threadfn_adapter+0x30/0x60 [ 147.546699] kthread+0xf4/0x108 [ 147.546705] ret_from_fork+0x10/0x20 [ 147.546710] ---[ end trace 0000000000000000 ]---
On 9/14/24 15:22, Hyeonggon Yoo wrote: > On Wed, Aug 7, 2024 at 7:31 PM Vlastimil Babka <vbabka@suse.cz> wrote: >> >> Add a test that will create cache, allocate one object, kfree_rcu() it >> and attempt to destroy it. As long as the usage of kvfree_rcu_barrier() >> in kmem_cache_destroy() works correctly, there should be no warnings in >> dmesg and the test should pass. >> >> Additionally add a test_leak_destroy() test that leaks an object on >> purpose and verifies that kmem_cache_destroy() catches it. >> >> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> >> --- >> lib/slub_kunit.c | 31 +++++++++++++++++++++++++++++++ >> 1 file changed, 31 insertions(+) >> > > Hi Vlastimil, > > I think we might need to suppress the WARN() due to the active objects > in kmem_cache_destroy() > when it's called from slub_kunit. With this change, the warning below > will be printed every time > slub_kunit is loaded, which made me wonder if there's a bug (for a while). > > Actually, SLUB calls pr_err() is called by __kmem_cache_shutdown() if > there are any active objects > during destruction, and the error message is suppressed by slub_kunit. > However, kmem_cache_destroy() > still calls WARN() regardless if there is any error during shutdown. Yeah, there was a LKP report about it already and I wanted to handle this but forgot. It's not wrong to produce warnings during the tests, for example the KASAN tests generate tons of them. But it's true that we suppress them for slub and should continue so for consistency and not having to teach lkp what can be ignored. But I think it's fine if we add the suppressing during the rc stabilization phase so will send the PR for merge window as it is, too late now. Want to take a stab at the patch? :) Vlastimil > [ 147.546531] Object 0x00000000c09342ca @offset=640 > [ 147.546542] ------------[ cut here ]------------ > [ 147.546544] kmem_cache_destroy TestSlub_kfree_rcu: Slab cache still > has objects when called from test_leak_destroy+0x74/0x108 [slub_kunit] > [ 147.546579] WARNING: CPU: 5 PID: 39703 at mm/slab_common.c:507 > kmem_cache_destroy+0x174/0x188 > [ 147.546587] Modules linked in: slub_kunit uinput snd_seq_dummy > snd_hrtimer rfkill nf_conntrack_netbios_ns nf_conntrack_broadcast > nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet > nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct sunrpc nft_chain_nat > nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables > nfnetlink qrtr binfmt_misc vfat fat snd_hda_codec_generic > snd_hda_intel snd_intel_dspcfg snd_hda_codec uvcvideo snd_hda_core uvc > snd_hwdep videobuf2_vmalloc snd_seq videobuf2_memops snd_seq_device > videobuf2_v4l2 snd_pcm videobuf2_common snd_timer snd videodev mc > soundcore virtio_balloon acpi_tad joydev loop zram virtio_gpu > ahci_platform libahci_platform virtio_dma_buf crct10dif_ce polyval_ce > polyval_generic ghash_ce sha3_ce virtio_net sha512_ce net_failover > sha512_arm64 failover virtio_mmio ip6_tables ip_tables fuse > [ 147.546646] CPU: 5 UID: 0 PID: 39703 Comm: kunit_try_catch Tainted: > G N 6.11.0-rc7-next-20240912 #2 > [ 147.546649] Tainted: [N]=TEST > [ 147.546650] Hardware name: Parallels International GmbH. Parallels > ARM Virtual Machine/Parallels ARM Virtual Platform, BIOS 20.0.0 > (55653) Thu, 05 Sep 202 > [ 147.546652] pstate: 61400005 (nZCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--) > [ 147.546655] pc : kmem_cache_destroy+0x174/0x188 > [ 147.546657] lr : kmem_cache_destroy+0x174/0x188 > [ 147.546659] sp : ffff80008aba3d60 > [ 147.546660] x29: ffff80008aba3d60 x28: 0000000000000000 x27: 0000000000000000 > [ 147.546662] x26: 0000000000000000 x25: 0000000000000000 x24: ffff800094a2b438 > [ 147.546665] x23: ffff80008089b750 x22: 0000000000000001 x21: f9cc80007c1782f4 > [ 147.546666] x20: ffff800082f9d088 x19: ffff0000c2308b00 x18: 00000000fffffffd > [ 147.546668] x17: 0000000046d4ed9c x16: 00000000ae1ad4db x15: ffff80008aba3430 > [ 147.546670] x14: 0000000000000001 x13: ffff80008aba3657 x12: ffff800082f0f060 > [ 147.546679] x11: 0000000000000001 x10: 0000000000000001 x9 : ffff8000801652c8 > [ 147.546682] x8 : c0000000ffffdfff x7 : ffff800082e5ee68 x6 : 00000000000affa8 > [ 147.546684] x5 : ffff00031fc70448 x4 : 0000000000000000 x3 : ffff80029d6b7000 > [ 147.546686] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff00011f1c8000 > [ 147.546688] Call trace: > [ 147.546689] kmem_cache_destroy+0x174/0x188 > [ 147.546692] test_leak_destroy+0x74/0x108 [slub_kunit] > [ 147.546693] kunit_try_run_case+0x74/0x170 > [ 147.546697] kunit_generic_run_threadfn_adapter+0x30/0x60 > [ 147.546699] kthread+0xf4/0x108 > [ 147.546705] ret_from_fork+0x10/0x20 > [ 147.546710] ---[ end trace 0000000000000000 ]---
Hi, On Wed, Aug 07, 2024 at 12:31:20PM +0200, Vlastimil Babka wrote: > Add a test that will create cache, allocate one object, kfree_rcu() it > and attempt to destroy it. As long as the usage of kvfree_rcu_barrier() > in kmem_cache_destroy() works correctly, there should be no warnings in > dmesg and the test should pass. > > Additionally add a test_leak_destroy() test that leaks an object on > purpose and verifies that kmem_cache_destroy() catches it. > > Signed-off-by: Vlastimil Babka <vbabka@suse.cz> This test case, when run, triggers a warning traceback. kmem_cache_destroy TestSlub_kfree_rcu: Slab cache still has objects when called from test_leak_destroy+0x70/0x11c WARNING: CPU: 0 PID: 715 at mm/slab_common.c:511 kmem_cache_destroy+0x1dc/0x1e4 That is, however, not the worst of it. It also causes boot stalls on several platforms and architectures (various arm platforms, arm64, loongarch, various ppc, and various x86_64). Reverting it fixes the problem. Bisect results are attached for reference. Guenter --- # bad: [baeb9a7d8b60b021d907127509c44507539c15e5] Merge tag 'sched-rt-2024-09-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip # good: [2f27fce67173bbb05d5a0ee03dae5c021202c912] Merge tag 'sound-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound git bisect start 'HEAD' '2f27fce67173' # good: [ae2c6d8b3b88c176dff92028941a4023f1b4cb91] Merge tag 'drm-xe-next-fixes-2024-09-12' of https://gitlab.freedesktop.org/drm/xe/kernel into drm-next git bisect good ae2c6d8b3b88c176dff92028941a4023f1b4cb91 # bad: [c8d8a35d094626808cd07ed0758e14c7e4cf61ac] Merge tag 'livepatching-for-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/livepatching/livepatching git bisect bad c8d8a35d094626808cd07ed0758e14c7e4cf61ac # bad: [cc52dc2fe39ff5dee9916ac2d9381ec3cbf650c0] Merge tag 'pwm/for-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/ukleinek/linux git bisect bad cc52dc2fe39ff5dee9916ac2d9381ec3cbf650c0 # bad: [bdf56c7580d267a123cc71ca0f2459c797b76fde] Merge tag 'slab-for-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab git bisect bad bdf56c7580d267a123cc71ca0f2459c797b76fde # good: [355debb83bf79853cde43579f88eed16adb1da29] Merge branches 'context_tracking.15.08.24a', 'csd.lock.15.08.24a', 'nocb.09.09.24a', 'rcutorture.14.08.24a', 'rcustall.09.09.24a', 'srcu.12.08.24a', 'rcu.tasks.14.08.24a', 'rcu_scaling_tests.15.08.24a', 'fixes.12.08.24a' and 'misc.11.08.24a' into next.09.09.24a git bisect good 355debb83bf79853cde43579f88eed16adb1da29 # good: [067610ebaaec53809794807842a2fcf5f1f5b9eb] Merge tag 'rcu.release.v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux git bisect good 067610ebaaec53809794807842a2fcf5f1f5b9eb # good: [4b7ff9ab98af11a477d50f08382bcc4c2f899926] mm, slab: restore kerneldoc for kmem_cache_create() git bisect good 4b7ff9ab98af11a477d50f08382bcc4c2f899926 # bad: [a715e94dbda4ece41aac49b7b7ff8ddb55a7fe08] Merge branch 'slab/for-6.12/rcu_barriers' into slab/for-next git bisect bad a715e94dbda4ece41aac49b7b7ff8ddb55a7fe08 # bad: [b3c34245756adada8a50bdaedbb3965b071c7b0a] kasan: catch invalid free before SLUB reinitializes the object git bisect bad b3c34245756adada8a50bdaedbb3965b071c7b0a # good: [2eb14c1c2717396f2fb1e4a4c5a1ec87cdd174f6] mm, slab: reintroduce rcu_barrier() into kmem_cache_destroy() git bisect good 2eb14c1c2717396f2fb1e4a4c5a1ec87cdd174f6 # good: [6c6c47b063b593785202be158e61fe5c827d6677] mm, slab: call kvfree_rcu_barrier() from kmem_cache_destroy() git bisect good 6c6c47b063b593785202be158e61fe5c827d6677 # bad: [4e1c44b3db79ba910adec32e2e1b920a0e34890a] kunit, slub: add test_kfree_rcu() and test_leak_destroy() git bisect bad 4e1c44b3db79ba910adec32e2e1b920a0e34890a # first bad commit: [4e1c44b3db79ba910adec32e2e1b920a0e34890a] kunit, slub: add test_kfree_rcu() and test_leak_destroy()
+CC kunit folks On 9/20/24 15:35, Guenter Roeck wrote: > Hi, Hi, > On Wed, Aug 07, 2024 at 12:31:20PM +0200, Vlastimil Babka wrote: >> Add a test that will create cache, allocate one object, kfree_rcu() it >> and attempt to destroy it. As long as the usage of kvfree_rcu_barrier() >> in kmem_cache_destroy() works correctly, there should be no warnings in >> dmesg and the test should pass. >> >> Additionally add a test_leak_destroy() test that leaks an object on >> purpose and verifies that kmem_cache_destroy() catches it. >> >> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> > > This test case, when run, triggers a warning traceback. > > kmem_cache_destroy TestSlub_kfree_rcu: Slab cache still has objects when called from test_leak_destroy+0x70/0x11c > WARNING: CPU: 0 PID: 715 at mm/slab_common.c:511 kmem_cache_destroy+0x1dc/0x1e4 Yes that should be suppressed like the other slub_kunit tests do. I have assumed it's not that urgent because for example the KASAN kunit tests all produce tons of warnings and thus assumed it's in some way acceptable for kunit tests to do. > That is, however, not the worst of it. It also causes boot stalls on > several platforms and architectures (various arm platforms, arm64, > loongarch, various ppc, and various x86_64). Reverting it fixes the > problem. Bisect results are attached for reference. OK, this part is unexpected. I assume you have the test built-in and not a module, otherwise it can't affect boot? And by stall you mean a delay or a complete lockup? I've tried to reproduce that with virtme, but it seemed fine, maybe it's .config specific? I do wonder about the placement of the call of kunit_run_all_tests() from kernel_init_freeable() as that's before a bunch of initialization finishes. For example, system_state = SYSTEM_RUNNING; and rcu_end_inkernel_boot() only happens later in kernel_init(). I wouldn't be surprised if that means calling kfree_rcu() or rcu_barrier() or kvfree_rcu_barrier() as part of the slub tests is too early. Does the diff below fix the problem? Any advice from kunit folks? I could perhaps possibly make the slab test module-only instead of tristate or do some ifdef builtin on the problematic tests, but maybe it wouldn't be necessary with kunit_run_all_tests() happening a bit later. ----8<---- diff --git a/init/main.c b/init/main.c index c4778edae797..7890ebb00e84 100644 --- a/init/main.c +++ b/init/main.c @@ -1489,6 +1489,8 @@ static int __ref kernel_init(void *unused) rcu_end_inkernel_boot(); + kunit_run_all_tests(); + do_sysctl_args(); if (ramdisk_execute_command) { @@ -1579,8 +1581,6 @@ static noinline void __init kernel_init_freeable(void) do_basic_setup(); - kunit_run_all_tests(); - wait_for_initramfs(); console_on_rootfs();
On 9/21/24 13:40, Vlastimil Babka wrote: > +CC kunit folks > > On 9/20/24 15:35, Guenter Roeck wrote: >> Hi, > > Hi, > >> On Wed, Aug 07, 2024 at 12:31:20PM +0200, Vlastimil Babka wrote: >>> Add a test that will create cache, allocate one object, kfree_rcu() it >>> and attempt to destroy it. As long as the usage of kvfree_rcu_barrier() >>> in kmem_cache_destroy() works correctly, there should be no warnings in >>> dmesg and the test should pass. >>> >>> Additionally add a test_leak_destroy() test that leaks an object on >>> purpose and verifies that kmem_cache_destroy() catches it. >>> >>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> >> >> This test case, when run, triggers a warning traceback. >> >> kmem_cache_destroy TestSlub_kfree_rcu: Slab cache still has objects when called from test_leak_destroy+0x70/0x11c >> WARNING: CPU: 0 PID: 715 at mm/slab_common.c:511 kmem_cache_destroy+0x1dc/0x1e4 > > Yes that should be suppressed like the other slub_kunit tests do. I have > assumed it's not that urgent because for example the KASAN kunit tests all > produce tons of warnings and thus assumed it's in some way acceptable for > kunit tests to do. > I have all tests which generate warning backtraces disabled. Trying to identify which warnings are noise and which warnings are on purpose doesn't scale, so it is all or nothing for me. I tried earlier to introduce a patch series which would enable selective backtrace suppression, but that died the death of architecture maintainers not caring and people demanding it to be perfect (meaning it only addressed WARNING: backtraces and not BUG: backtraces, and apparently that wasn't good enough). If the backtrace is intentional (and I think you are saying that it is), I'll simply disable the test. That may be a bit counter-productive, but there is really no alternative for me. >> That is, however, not the worst of it. It also causes boot stalls on >> several platforms and architectures (various arm platforms, arm64, >> loongarch, various ppc, and various x86_64). Reverting it fixes the >> problem. Bisect results are attached for reference. > > OK, this part is unexpected. I assume you have the test built-in and not a > module, otherwise it can't affect boot? And by stall you mean a delay or a Yes. > complete lockup? I've tried to reproduce that with virtme, but it seemed > fine, maybe it's .config specific? It is a complete lockup. > > I do wonder about the placement of the call of kunit_run_all_tests() from > kernel_init_freeable() as that's before a bunch of initialization finishes. > > For example, system_state = SYSTEM_RUNNING; and rcu_end_inkernel_boot() only > happens later in kernel_init(). I wouldn't be surprised if that means > calling kfree_rcu() or rcu_barrier() or kvfree_rcu_barrier() as part of the > slub tests is too early. > > Does the diff below fix the problem? Any advice from kunit folks? I could > perhaps possibly make the slab test module-only instead of tristate or do > some ifdef builtin on the problematic tests, but maybe it wouldn't be > necessary with kunit_run_all_tests() happening a bit later. > It does, at least based on my limited testing. However, given that the backtrace is apparently intentional, it doesn't really matter - I'll disable the test instead. Thanks, Guenter
On 9/21/24 23:08, Guenter Roeck wrote: > On 9/21/24 13:40, Vlastimil Babka wrote: >> +CC kunit folks >> >> On 9/20/24 15:35, Guenter Roeck wrote: >>> Hi, >> >> Hi, >> >>> On Wed, Aug 07, 2024 at 12:31:20PM +0200, Vlastimil Babka wrote: >>>> Add a test that will create cache, allocate one object, kfree_rcu() it >>>> and attempt to destroy it. As long as the usage of kvfree_rcu_barrier() >>>> in kmem_cache_destroy() works correctly, there should be no warnings in >>>> dmesg and the test should pass. >>>> >>>> Additionally add a test_leak_destroy() test that leaks an object on >>>> purpose and verifies that kmem_cache_destroy() catches it. >>>> >>>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> >>> >>> This test case, when run, triggers a warning traceback. >>> >>> kmem_cache_destroy TestSlub_kfree_rcu: Slab cache still has objects when called from test_leak_destroy+0x70/0x11c >>> WARNING: CPU: 0 PID: 715 at mm/slab_common.c:511 kmem_cache_destroy+0x1dc/0x1e4 >> >> Yes that should be suppressed like the other slub_kunit tests do. I have >> assumed it's not that urgent because for example the KASAN kunit tests all >> produce tons of warnings and thus assumed it's in some way acceptable for >> kunit tests to do. >> > > I have all tests which generate warning backtraces disabled. Trying to identify > which warnings are noise and which warnings are on purpose doesn't scale, > so it is all or nothing for me. I tried earlier to introduce a patch series > which would enable selective backtrace suppression, but that died the death > of architecture maintainers not caring and people demanding it to be perfect > (meaning it only addressed WARNING: backtraces and not BUG: backtraces, > and apparently that wasn't good enough). Ah, didn't know, too bad. > If the backtrace is intentional (and I think you are saying that it is), > I'll simply disable the test. That may be a bit counter-productive, but > there is really no alternative for me. It's intentional in the sense that the test intentionally triggers a condition that normally produces a warning. Many if the slub kunit test do that, but are able to suppress printing the warning when it happens in the kunit context. I forgot to do that for the new test initially as the warning there happens from a different path that those that already have the kunit suppression, but we'll implement that suppression there too ASAP. >>> That is, however, not the worst of it. It also causes boot stalls on >>> several platforms and architectures (various arm platforms, arm64, >>> loongarch, various ppc, and various x86_64). Reverting it fixes the >>> problem. Bisect results are attached for reference. >> >> OK, this part is unexpected. I assume you have the test built-in and not a >> module, otherwise it can't affect boot? And by stall you mean a delay or a > > Yes. > >> complete lockup? I've tried to reproduce that with virtme, but it seemed >> fine, maybe it's .config specific? > > It is a complete lockup. > >> >> I do wonder about the placement of the call of kunit_run_all_tests() from >> kernel_init_freeable() as that's before a bunch of initialization finishes. >> >> For example, system_state = SYSTEM_RUNNING; and rcu_end_inkernel_boot() only >> happens later in kernel_init(). I wouldn't be surprised if that means >> calling kfree_rcu() or rcu_barrier() or kvfree_rcu_barrier() as part of the >> slub tests is too early. >> >> Does the diff below fix the problem? Any advice from kunit folks? I could >> perhaps possibly make the slab test module-only instead of tristate or do >> some ifdef builtin on the problematic tests, but maybe it wouldn't be >> necessary with kunit_run_all_tests() happening a bit later. >> > > It does, at least based on my limited testing. However, given that the > backtrace is apparently intentional, it doesn't really matter - I'll disable > the test instead. Right, thanks. I will notify you when the suppression is done so the test can be re-enabled. But as the lockup should not be caused by the warning itself, we will still have to address it, as others configuring kunit tests built-in might see the lockup, or you would again see it as well, once the warning is addressed and test re-enabled. Your testing suggests moving the kunit_run_all_tests() call might be the right fix so let's see what kunit folks think. I would hope that no other kunit test would become broken by being executed later in boot, because when compiled as modules, it already happens even much later already. Thanks! Vlastimil > Thanks, > Guenter >
On Sun, Sep 22, 2024 at 6:25 AM Vlastimil Babka <vbabka@suse.cz> wrote: > > On 9/21/24 23:08, Guenter Roeck wrote: > > On 9/21/24 13:40, Vlastimil Babka wrote: > >> +CC kunit folks > >> > >> On 9/20/24 15:35, Guenter Roeck wrote: > >>> Hi, > >> > >> Hi, > >> > >>> On Wed, Aug 07, 2024 at 12:31:20PM +0200, Vlastimil Babka wrote: > >>>> Add a test that will create cache, allocate one object, kfree_rcu() it > >>>> and attempt to destroy it. As long as the usage of kvfree_rcu_barrier() > >>>> in kmem_cache_destroy() works correctly, there should be no warnings in > >>>> dmesg and the test should pass. > >>>> > >>>> Additionally add a test_leak_destroy() test that leaks an object on > >>>> purpose and verifies that kmem_cache_destroy() catches it. > >>>> > >>>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> > >>> > >>> This test case, when run, triggers a warning traceback. > >>> > >>> kmem_cache_destroy TestSlub_kfree_rcu: Slab cache still has objects when called from test_leak_destroy+0x70/0x11c > >>> WARNING: CPU: 0 PID: 715 at mm/slab_common.c:511 kmem_cache_destroy+0x1dc/0x1e4 > >> > >> Yes that should be suppressed like the other slub_kunit tests do. I have > >> assumed it's not that urgent because for example the KASAN kunit tests all > >> produce tons of warnings and thus assumed it's in some way acceptable for > >> kunit tests to do. > >> > > > > I have all tests which generate warning backtraces disabled. Trying to identify > > which warnings are noise and which warnings are on purpose doesn't scale, > > so it is all or nothing for me. I tried earlier to introduce a patch series > > which would enable selective backtrace suppression, but that died the death > > of architecture maintainers not caring and people demanding it to be perfect > > (meaning it only addressed WARNING: backtraces and not BUG: backtraces, > > and apparently that wasn't good enough). > > Ah, didn't know, too bad. > > > If the backtrace is intentional (and I think you are saying that it is), > > I'll simply disable the test. That may be a bit counter-productive, but > > there is really no alternative for me. > > It's intentional in the sense that the test intentionally triggers a > condition that normally produces a warning. Many if the slub kunit test do > that, but are able to suppress printing the warning when it happens in the > kunit context. I forgot to do that for the new test initially as the warning > there happens from a different path that those that already have the kunit > suppression, but we'll implement that suppression there too ASAP. We might also need to address the concern of the commit 7302e91f39a ("mm/slab_common: use WARN() if cache still has objects on destroy"), the concern that some users prefer WARN() over pr_err() to catch errors on testing systems which relies on WARN() format, and to respect panic_on_warn. So we might need to call WARN() instead of pr_err() if there are errors in slub error handling code in general, except when running kunit tests? Thanks, Hyeonggon
On 9/21/24 23:16, Hyeonggon Yoo wrote: > On Sun, Sep 22, 2024 at 6:25 AM Vlastimil Babka <vbabka@suse.cz> wrote: >> >> On 9/21/24 23:08, Guenter Roeck wrote: >>> On 9/21/24 13:40, Vlastimil Babka wrote: >>>> +CC kunit folks >>>> >>>> On 9/20/24 15:35, Guenter Roeck wrote: >>>>> Hi, >>>> >>>> Hi, >>>> >>>>> On Wed, Aug 07, 2024 at 12:31:20PM +0200, Vlastimil Babka wrote: >>>>>> Add a test that will create cache, allocate one object, kfree_rcu() it >>>>>> and attempt to destroy it. As long as the usage of kvfree_rcu_barrier() >>>>>> in kmem_cache_destroy() works correctly, there should be no warnings in >>>>>> dmesg and the test should pass. >>>>>> >>>>>> Additionally add a test_leak_destroy() test that leaks an object on >>>>>> purpose and verifies that kmem_cache_destroy() catches it. >>>>>> >>>>>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> >>>>> >>>>> This test case, when run, triggers a warning traceback. >>>>> >>>>> kmem_cache_destroy TestSlub_kfree_rcu: Slab cache still has objects when called from test_leak_destroy+0x70/0x11c >>>>> WARNING: CPU: 0 PID: 715 at mm/slab_common.c:511 kmem_cache_destroy+0x1dc/0x1e4 >>>> >>>> Yes that should be suppressed like the other slub_kunit tests do. I have >>>> assumed it's not that urgent because for example the KASAN kunit tests all >>>> produce tons of warnings and thus assumed it's in some way acceptable for >>>> kunit tests to do. >>>> >>> >>> I have all tests which generate warning backtraces disabled. Trying to identify >>> which warnings are noise and which warnings are on purpose doesn't scale, >>> so it is all or nothing for me. I tried earlier to introduce a patch series >>> which would enable selective backtrace suppression, but that died the death >>> of architecture maintainers not caring and people demanding it to be perfect >>> (meaning it only addressed WARNING: backtraces and not BUG: backtraces, >>> and apparently that wasn't good enough). >> >> Ah, didn't know, too bad. >> >>> If the backtrace is intentional (and I think you are saying that it is), >>> I'll simply disable the test. That may be a bit counter-productive, but >>> there is really no alternative for me. >> >> It's intentional in the sense that the test intentionally triggers a >> condition that normally produces a warning. Many if the slub kunit test do >> that, but are able to suppress printing the warning when it happens in the >> kunit context. I forgot to do that for the new test initially as the warning >> there happens from a different path that those that already have the kunit >> suppression, but we'll implement that suppression there too ASAP. > > We might also need to address the concern of the commit > 7302e91f39a ("mm/slab_common: use WARN() if cache still has objects on > destroy"), > the concern that some users prefer WARN() over pr_err() to catch > errors on testing systems > which relies on WARN() format, and to respect panic_on_warn. > > So we might need to call WARN() instead of pr_err() if there are errors in > slub error handling code in general, except when running kunit tests? > If people _want_ to see WARNING backtraces generated on purpose, so be it. For me it means that _real_ WARNING backtraces disappear in the noise. Manually maintaining a list of expected warning backtraces is too maintenance expensive for me, so I simply disable all kunit tests which generate backtraces on purpose. That is just me, though. Other testbeds may have more resources available and may be perfectly happy with the associated maintenance cost. In this specific case, I now have disabled slub kunit tests, and, as mentioned before, from my perspective there is no need to change the code just to accommodate my needs. I'll do the same with all other new unit tests which generate backtraces in the future, without bothering anyone. Sorry for the noise. Thanks, Guenter
On Sun, Sep 22, 2024 at 11:13 PM Guenter Roeck <linux@roeck-us.net> wrote: > > On 9/21/24 23:16, Hyeonggon Yoo wrote: > > On Sun, Sep 22, 2024 at 6:25 AM Vlastimil Babka <vbabka@suse.cz> wrote: > >> > >> On 9/21/24 23:08, Guenter Roeck wrote: > >>> On 9/21/24 13:40, Vlastimil Babka wrote: > >>>> +CC kunit folks > >>>> > >>>> On 9/20/24 15:35, Guenter Roeck wrote: > >>>>> Hi, > >>>> > >>>> Hi, > >>>> > >>>>> On Wed, Aug 07, 2024 at 12:31:20PM +0200, Vlastimil Babka wrote: > >>>>>> Add a test that will create cache, allocate one object, kfree_rcu() it > >>>>>> and attempt to destroy it. As long as the usage of kvfree_rcu_barrier() > >>>>>> in kmem_cache_destroy() works correctly, there should be no warnings in > >>>>>> dmesg and the test should pass. > >>>>>> > >>>>>> Additionally add a test_leak_destroy() test that leaks an object on > >>>>>> purpose and verifies that kmem_cache_destroy() catches it. > >>>>>> > >>>>>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> > >>>>> > >>>>> This test case, when run, triggers a warning traceback. > >>>>> > >>>>> kmem_cache_destroy TestSlub_kfree_rcu: Slab cache still has objects when called from test_leak_destroy+0x70/0x11c > >>>>> WARNING: CPU: 0 PID: 715 at mm/slab_common.c:511 kmem_cache_destroy+0x1dc/0x1e4 > >>>> > >>>> Yes that should be suppressed like the other slub_kunit tests do. I have > >>>> assumed it's not that urgent because for example the KASAN kunit tests all > >>>> produce tons of warnings and thus assumed it's in some way acceptable for > >>>> kunit tests to do. > >>>> > >>> > >>> I have all tests which generate warning backtraces disabled. Trying to identify > >>> which warnings are noise and which warnings are on purpose doesn't scale, > >>> so it is all or nothing for me. I tried earlier to introduce a patch series > >>> which would enable selective backtrace suppression, but that died the death > >>> of architecture maintainers not caring and people demanding it to be perfect > >>> (meaning it only addressed WARNING: backtraces and not BUG: backtraces, > >>> and apparently that wasn't good enough). > >> > >> Ah, didn't know, too bad. > >> > >>> If the backtrace is intentional (and I think you are saying that it is), > >>> I'll simply disable the test. That may be a bit counter-productive, but > >>> there is really no alternative for me. > >> > >> It's intentional in the sense that the test intentionally triggers a > >> condition that normally produces a warning. Many if the slub kunit test do > >> that, but are able to suppress printing the warning when it happens in the > >> kunit context. I forgot to do that for the new test initially as the warning > >> there happens from a different path that those that already have the kunit > >> suppression, but we'll implement that suppression there too ASAP. > > > > We might also need to address the concern of the commit > > 7302e91f39a ("mm/slab_common: use WARN() if cache still has objects on > > destroy"), > > the concern that some users prefer WARN() over pr_err() to catch > > errors on testing systems > > which relies on WARN() format, and to respect panic_on_warn. > > > > So we might need to call WARN() instead of pr_err() if there are errors in > > slub error handling code in general, except when running kunit tests? > > > > If people _want_ to see WARNING backtraces generated on purpose, so be it. > For me it means that _real_ WARNING backtraces disappear in the noise. > Manually maintaining a list of expected warning backtraces is too maintenance > expensive for me, so I simply disable all kunit tests which generate > backtraces on purpose. That is just me, though. Other testbeds may have > more resources available and may be perfectly happy with the associated > maintenance cost. > > In this specific case, I now have disabled slub kunit tests, and, as > mentioned before, from my perspective there is no need to change the > code just to accommodate my needs. I'll do the same with all other new > unit tests which generate backtraces in the future, without bothering > anyone. > > Sorry for the noise. I don't think this was a noise :) IMO some people want to see WARNING during testing to catch errors, but not for the slub_kunit test case. I think a proper approach here would be suppressing warnings while running slub_kunit test cases, but print WARNING when it is not running slub_kunit test cases. That would require some work changing the slub error reporting logic to print WARNING on certain errors. Any opinions, Vlastimil? Thanks, Hyeonggon
On 9/25/24 14:56, Hyeonggon Yoo wrote: > On Sun, Sep 22, 2024 at 11:13 PM Guenter Roeck <linux@roeck-us.net> wrote: >> >> On 9/21/24 23:16, Hyeonggon Yoo wrote: >> > On Sun, Sep 22, 2024 at 6:25 AM Vlastimil Babka <vbabka@suse.cz> wrote: >> >> >> >> On 9/21/24 23:08, Guenter Roeck wrote: >> >>> On 9/21/24 13:40, Vlastimil Babka wrote: >> >>>> +CC kunit folks >> >>>> >> >>>> On 9/20/24 15:35, Guenter Roeck wrote: >> >>>>> Hi, >> >>>> >> >>>> Hi, >> >>>> >> >>>>> On Wed, Aug 07, 2024 at 12:31:20PM +0200, Vlastimil Babka wrote: >> >>>>>> Add a test that will create cache, allocate one object, kfree_rcu() it >> >>>>>> and attempt to destroy it. As long as the usage of kvfree_rcu_barrier() >> >>>>>> in kmem_cache_destroy() works correctly, there should be no warnings in >> >>>>>> dmesg and the test should pass. >> >>>>>> >> >>>>>> Additionally add a test_leak_destroy() test that leaks an object on >> >>>>>> purpose and verifies that kmem_cache_destroy() catches it. >> >>>>>> >> >>>>>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> >> >>>>> >> >>>>> This test case, when run, triggers a warning traceback. >> >>>>> >> >>>>> kmem_cache_destroy TestSlub_kfree_rcu: Slab cache still has objects when called from test_leak_destroy+0x70/0x11c >> >>>>> WARNING: CPU: 0 PID: 715 at mm/slab_common.c:511 kmem_cache_destroy+0x1dc/0x1e4 >> >>>> >> >>>> Yes that should be suppressed like the other slub_kunit tests do. I have >> >>>> assumed it's not that urgent because for example the KASAN kunit tests all >> >>>> produce tons of warnings and thus assumed it's in some way acceptable for >> >>>> kunit tests to do. >> >>>> >> >>> >> >>> I have all tests which generate warning backtraces disabled. Trying to identify >> >>> which warnings are noise and which warnings are on purpose doesn't scale, >> >>> so it is all or nothing for me. I tried earlier to introduce a patch series >> >>> which would enable selective backtrace suppression, but that died the death >> >>> of architecture maintainers not caring and people demanding it to be perfect >> >>> (meaning it only addressed WARNING: backtraces and not BUG: backtraces, >> >>> and apparently that wasn't good enough). >> >> >> >> Ah, didn't know, too bad. >> >> >> >>> If the backtrace is intentional (and I think you are saying that it is), >> >>> I'll simply disable the test. That may be a bit counter-productive, but >> >>> there is really no alternative for me. >> >> >> >> It's intentional in the sense that the test intentionally triggers a >> >> condition that normally produces a warning. Many if the slub kunit test do >> >> that, but are able to suppress printing the warning when it happens in the >> >> kunit context. I forgot to do that for the new test initially as the warning >> >> there happens from a different path that those that already have the kunit >> >> suppression, but we'll implement that suppression there too ASAP. >> > >> > We might also need to address the concern of the commit >> > 7302e91f39a ("mm/slab_common: use WARN() if cache still has objects on >> > destroy"), >> > the concern that some users prefer WARN() over pr_err() to catch >> > errors on testing systems >> > which relies on WARN() format, and to respect panic_on_warn. >> > >> > So we might need to call WARN() instead of pr_err() if there are errors in >> > slub error handling code in general, except when running kunit tests? >> > >> >> If people _want_ to see WARNING backtraces generated on purpose, so be it. >> For me it means that _real_ WARNING backtraces disappear in the noise. >> Manually maintaining a list of expected warning backtraces is too maintenance >> expensive for me, so I simply disable all kunit tests which generate >> backtraces on purpose. That is just me, though. Other testbeds may have >> more resources available and may be perfectly happy with the associated >> maintenance cost. >> >> In this specific case, I now have disabled slub kunit tests, and, as >> mentioned before, from my perspective there is no need to change the >> code just to accommodate my needs. I'll do the same with all other new >> unit tests which generate backtraces in the future, without bothering >> anyone. >> >> Sorry for the noise. > > I don't think this was a noise :) IMO some people want to see WARNING > during testing to catch errors, > but not for the slub_kunit test case. I think a proper approach here > would be suppressing > warnings while running slub_kunit test cases, but print WARNING when > it is not running slub_kunit test cases. > > That would require some work changing the slub error reporting logic > to print WARNING on certain errors. > Any opinions, Vlastimil? Yes, we should suppress the existing warning on kmem_cache_destroy() in kunit test context, and separately we can change pr_err() to WARN() as long as they are still suppressed in kunit test context. > Thanks, > Hyeonggon
On 9/26/24 14:54, Vlastimil Babka wrote: > On 9/25/24 14:56, Hyeonggon Yoo wrote: >> >> I don't think this was a noise :) IMO some people want to see WARNING >> during testing to catch errors, >> but not for the slub_kunit test case. I think a proper approach here >> would be suppressing >> warnings while running slub_kunit test cases, but print WARNING when >> it is not running slub_kunit test cases. >> >> That would require some work changing the slub error reporting logic >> to print WARNING on certain errors. >> Any opinions, Vlastimil? > > Yes, we should suppress the existing warning on kmem_cache_destroy() in Done here: https://lore.kernel.org/all/20240930-b4-slub-kunit-fix-v1-0-32ca9dbbbc11@suse.cz/ > kunit test context, and separately we can change pr_err() to WARN() as long > as they are still suppressed in kunit test context. Can be done later separately as it would be out of the scope for a rcX fix. >> Thanks, >> Hyeonggon >
diff --git a/lib/slub_kunit.c b/lib/slub_kunit.c index e6667a28c014..6e3a1e5a7142 100644 --- a/lib/slub_kunit.c +++ b/lib/slub_kunit.c @@ -5,6 +5,7 @@ #include <linux/slab.h> #include <linux/module.h> #include <linux/kernel.h> +#include <linux/rcupdate.h> #include "../mm/slab.h" static struct kunit_resource resource; @@ -157,6 +158,34 @@ static void test_kmalloc_redzone_access(struct kunit *test) kmem_cache_destroy(s); } +struct test_kfree_rcu_struct { + struct rcu_head rcu; +}; + +static void test_kfree_rcu(struct kunit *test) +{ + struct kmem_cache *s = test_kmem_cache_create("TestSlub_kfree_rcu", + sizeof(struct test_kfree_rcu_struct), + SLAB_NO_MERGE); + struct test_kfree_rcu_struct *p = kmem_cache_alloc(s, GFP_KERNEL); + + kfree_rcu(p, rcu); + kmem_cache_destroy(s); + + KUNIT_EXPECT_EQ(test, 0, slab_errors); +} + +static void test_leak_destroy(struct kunit *test) +{ + struct kmem_cache *s = test_kmem_cache_create("TestSlub_kfree_rcu", + 64, SLAB_NO_MERGE); + kmem_cache_alloc(s, GFP_KERNEL); + + kmem_cache_destroy(s); + + KUNIT_EXPECT_EQ(test, 1, slab_errors); +} + static int test_init(struct kunit *test) { slab_errors = 0; @@ -177,6 +206,8 @@ static struct kunit_case test_cases[] = { KUNIT_CASE(test_clobber_redzone_free), KUNIT_CASE(test_kmalloc_redzone_access), + KUNIT_CASE(test_kfree_rcu), + KUNIT_CASE(test_leak_destroy), {} };
Add a test that will create cache, allocate one object, kfree_rcu() it and attempt to destroy it. As long as the usage of kvfree_rcu_barrier() in kmem_cache_destroy() works correctly, there should be no warnings in dmesg and the test should pass. Additionally add a test_leak_destroy() test that leaks an object on purpose and verifies that kmem_cache_destroy() catches it. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> --- lib/slub_kunit.c | 31 +++++++++++++++++++++++++++++++ 1 file changed, 31 insertions(+)