Message ID | 20230119211455.498968-2-echanude@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | fs/namespace: defer free_mount from namespace_unlock | expand |
On Thu, Jan 19, 2023 at 04:14:55PM -0500, Eric Chanudet wrote: > From: Alexander Larsson <alexl@redhat.com> > > Use call_rcu to defer releasing the umount'ed or detached filesystem > when calling namepsace_unlock(). > > Calling synchronize_rcu_expedited() has a significant cost on RT kernel > that default to rcupdate.rcu_normal_after_boot=1. > > For example, on a 6.2-rt1 kernel: > perf stat -r 10 --null --pre 'mount -t tmpfs tmpfs mnt' -- umount mnt > 0.07464 +- 0.00396 seconds time elapsed ( +- 5.31% ) > > With this change applied: > perf stat -r 10 --null --pre 'mount -t tmpfs tmpfs mnt' -- umount mnt > 0.00162604 +- 0.00000637 seconds time elapsed ( +- 0.39% ) > > Waiting for the grace period before completing the syscall does not seem > mandatory. The struct mount umount'ed are queued up for release in a > separate list and no longer accessible to following syscalls. Again, NAK. If a filesystem is expected to be shut down by umount(2), userland expects it to have been already shut down by the time the syscall returns. It's not just visibility in namespace; it's "can I pull the disk out?". Or "can the shutdown get to taking the network down?", for that matter.
On Thu, Jan 19, 2023 at 11:09 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > On Thu, Jan 19, 2023 at 04:14:55PM -0500, Eric Chanudet wrote: > > From: Alexander Larsson <alexl@redhat.com> > > > > Use call_rcu to defer releasing the umount'ed or detached filesystem > > when calling namepsace_unlock(). > > > > Calling synchronize_rcu_expedited() has a significant cost on RT kernel > > that default to rcupdate.rcu_normal_after_boot=1. > > > > For example, on a 6.2-rt1 kernel: > > perf stat -r 10 --null --pre 'mount -t tmpfs tmpfs mnt' -- umount mnt > > 0.07464 +- 0.00396 seconds time elapsed ( +- 5.31% ) > > > > With this change applied: > > perf stat -r 10 --null --pre 'mount -t tmpfs tmpfs mnt' -- umount mnt > > 0.00162604 +- 0.00000637 seconds time elapsed ( +- 0.39% ) > > > > Waiting for the grace period before completing the syscall does not seem > > mandatory. The struct mount umount'ed are queued up for release in a > > separate list and no longer accessible to following syscalls. > > Again, NAK. If a filesystem is expected to be shut down by umount(2), > userland expects it to have been already shut down by the time the > syscall returns. > > It's not just visibility in namespace; it's "can I pull the disk out?". > Or "can the shutdown get to taking the network down?", for that matter. In the usecase we're worrying about, all the unmounts are lazy (i.e. MNT_DETACH). What about delaying the destroy in that case? That seems in line with the expected behaviour of lazy shutdown. I.e. you can't rely on it to be settled anyway.
Greeting, FYI, we noticed WARNING:inconsistent_lock_state due to commit (built with gcc-11): commit: fcd28ffda89717f5d68fff9d3260b57f02d93eda ("[RFC PATCH RESEND 1/1] fs/namespace: defer free_mount from namespace_unlock") url: https://github.com/intel-lab-lkp/linux/commits/Eric-Chanudet/fs-namespace-defer-free_mount-from-namespace_unlock/20230120-052658 base: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git d368967cb1039b5c4cccb62b5a4b9468c50cd143 patch link: https://lore.kernel.org/all/20230119211455.498968-2-echanude@redhat.com/ patch subject: [RFC PATCH RESEND 1/1] fs/namespace: defer free_mount from namespace_unlock in testcase: boot on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace): [ 13.499613][ C0] WARNING: inconsistent lock state [ 13.500185][ C0] 6.2.0-rc4-00078-gfcd28ffda897 #1 Not tainted [ 13.500852][ C0] -------------------------------- [ 13.501426][ C0] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage. [ 13.502160][ C0] systemd/1 [HC0[0]:SC1[1]:HE1:SE0] takes: [ 13.502810][ C0] ffffffff84414790 (mount_lock){+.?.}-{2:2}, at: mntput_no_expire (include/linux/seqlock.h:469 include/linux/seqlock.h:515 include/linux/seqlock.h:541 include/linux/seqlock.h:887 fs/namespace.c:125 fs/namespace.c:1337) [ 13.503774][ C0] {SOFTIRQ-ON-W} state was registered at: [ 13.504400][ C0] __lock_acquire (kernel/locking/lockdep.c:5009) [ 13.504961][ C0] lock_acquire (kernel/locking/lockdep.c:466 kernel/locking/lockdep.c:5670 kernel/locking/lockdep.c:5633) [ 13.505496][ C0] _raw_spin_lock (include/linux/spinlock_api_smp.h:134 kernel/locking/spinlock.c:154) [ 13.506033][ C0] vfs_create_mount (include/linux/seqlock.h:469 include/linux/seqlock.h:515 include/linux/seqlock.h:541 include/linux/seqlock.h:887 fs/namespace.c:125 fs/namespace.c:1145) [ 13.506594][ C0] vfs_kern_mount (fs/namespace.c:1201) [ 13.507214][ C0] kern_mount (fs/namespace.c:4590) [ 13.507716][ C0] shmem_init (mm/shmem.c:4060) [ 13.508225][ C0] mnt_init (fs/namespace.c:4574) [ 13.508723][ C0] vfs_caches_init (fs/dcache.c:3354) [ 13.509273][ C0] start_kernel (init/main.c:1131) [ 13.509812][ C0] secondary_startup_64_no_verify (arch/x86/kernel/head_64.S:358) [ 13.510468][ C0] irq event stamp: 1059406 [ 13.510999][ C0] hardirqs last enabled at (1059406): kasan_quarantine_put (arch/x86/include/asm/irqflags.h:45 (discriminator 1) arch/x86/include/asm/irqflags.h:80 (discriminator 1) arch/x86/include/asm/irqflags.h:138 (discriminator 1) mm/kasan/quarantine.c:242 (discriminator 1)) [ 13.512073][ C0] hardirqs last disabled at (1059405): kasan_quarantine_put (mm/kasan/quarantine.c:217 (discriminator 1)) [ 13.513123][ C0] softirqs last enabled at (1058838): __do_softirq (arch/x86/include/asm/preempt.h:27 kernel/softirq.c:415 kernel/softirq.c:600) [ 13.514123][ C0] softirqs last disabled at (1059387): __irq_exit_rcu (kernel/softirq.c:445 kernel/softirq.c:650) [ 13.515161][ C0] [ 13.515161][ C0] other info that might help us debug this: [ 13.516033][ C0] Possible unsafe locking scenario: [ 13.516033][ C0] [ 13.516857][ C0] CPU0 [ 13.517282][ C0] ---- [ 13.517700][ C0] lock(mount_lock); [ 13.518184][ C0] <Interrupt> [ 13.518612][ C0] lock(mount_lock); [ 13.519118][ C0] [ 13.519118][ C0] *** DEADLOCK *** [ 13.519118][ C0] [ 13.520048][ C0] 3 locks held by systemd/1: [ 13.520581][ C0] #0: ffffffff84925200 (rcu_read_lock){....}-{1:2}, at: __is_insn_slot_addr (kernel/kprobes.c:297) [ 13.521514][ C0] #1: ffffffff849250e0 (rcu_callback){....}-{0:0}, at: rcu_do_batch (include/linux/rcupdate.h:325 kernel/rcu/tree.c:2241) [ 13.522372][ C0] #2: ffffffff84925200 (rcu_read_lock){....}-{1:2}, at: mntput_no_expire (fs/namespace.c:1318) [ 13.523384][ C0] [ 13.523384][ C0] stack backtrace: [ 13.524062][ C0] CPU: 0 PID: 1 Comm: systemd Not tainted 6.2.0-rc4-00078-gfcd28ffda897 #1 [ 13.524985][ C0] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-debian-1.16.0-5 04/01/2014 [ 13.526078][ C0] Call Trace: [ 13.526492][ C0] <IRQ> [ 13.526882][ C0] dump_stack_lvl (lib/dump_stack.c:107 (discriminator 4)) [ 13.527408][ C0] mark_lock_irq.cold (kernel/locking/lockdep.c:4206 kernel/locking/lockdep.c:3975 kernel/locking/lockdep.c:4178) [ 13.527961][ C0] ? lock_chain_count (kernel/locking/lockdep.c:4169) [ 13.528512][ C0] ? filter_irq_stacks (kernel/stacktrace.c:114) [ 13.529078][ C0] ? save_trace (kernel/locking/lockdep.c:585) [ 13.529597][ C0] ? stack_trace_save (kernel/stacktrace.c:123) [ 13.530151][ C0] mark_lock+0x4c9/0xac0 [ 13.530704][ C0] ? mark_lock_irq (kernel/locking/lockdep.c:4593) [ 13.531246][ C0] ? kasan_save_stack (mm/kasan/common.c:47) [ 13.531804][ C0] ? kasan_save_stack (mm/kasan/common.c:46) [ 13.532359][ C0] ? kasan_set_track (mm/kasan/common.c:52) [ 13.532901][ C0] ? kasan_save_free_info (mm/kasan/generic.c:520) [ 13.533485][ C0] mark_usage (kernel/locking/lockdep.c:4529) [ 13.533981][ C0] __lock_acquire (kernel/locking/lockdep.c:5009) [ 13.534531][ C0] lock_acquire (kernel/locking/lockdep.c:466 kernel/locking/lockdep.c:5670 kernel/locking/lockdep.c:5633) [ 13.535060][ C0] ? mntput_no_expire (include/linux/seqlock.h:469 include/linux/seqlock.h:515 include/linux/seqlock.h:541 include/linux/seqlock.h:887 fs/namespace.c:125 fs/namespace.c:1337) [ 13.535631][ C0] ? rcu_read_unlock (include/linux/rcupdate.h:793 (discriminator 5)) [ 13.536176][ C0] ? rcu_read_unlock (include/linux/rcupdate.h:793 (discriminator 5)) [ 13.536718][ C0] ? lock_downgrade (kernel/locking/lockdep.c:5320) [ 13.537261][ C0] _raw_spin_lock (include/linux/spinlock_api_smp.h:134 kernel/locking/spinlock.c:154) [ 13.537780][ C0] ? mntput_no_expire (include/linux/seqlock.h:469 include/linux/seqlock.h:515 include/linux/seqlock.h:541 include/linux/seqlock.h:887 fs/namespace.c:125 fs/namespace.c:1337) [ 13.538339][ C0] mntput_no_expire (include/linux/seqlock.h:469 include/linux/seqlock.h:515 include/linux/seqlock.h:541 include/linux/seqlock.h:887 fs/namespace.c:125 fs/namespace.c:1337) [ 13.538901][ C0] ? rcu_read_unlock (include/linux/rcupdate.h:793 (discriminator 5)) [ 13.539451][ C0] ? lock_downgrade (kernel/locking/lockdep.c:5320) [ 13.539999][ C0] ? slab_free_freelist_hook (mm/slub.c:1807) [ 13.540609][ C0] ? mnt_get_count (fs/namespace.c:1318) [ 13.541158][ C0] ? lock_is_held_type (kernel/locking/lockdep.c:5409 kernel/locking/lockdep.c:5711) [ 13.541731][ C0] delayed_mount_release (fs/namespace.c:1607) [ 13.542305][ C0] rcu_do_batch (include/linux/rcupdate.h:330 kernel/rcu/tree.c:2248) [ 13.543383][ C0] ? rcu_implicit_dynticks_qs (kernel/rcu/tree.c:2185) [ 13.544036][ C0] ? trace_hardirqs_on (kernel/trace/trace_preemptirq.c:50 (discriminator 22)) [ 13.544610][ C0] ? _raw_spin_unlock_irqrestore (arch/x86/include/asm/irqflags.h:45 arch/x86/include/asm/irqflags.h:80 arch/x86/include/asm/irqflags.h:138 include/linux/spinlock_api_smp.h:151 kernel/locking/spinlock.c:194) [ 13.545254][ C0] ? rcu_report_qs_rdp (kernel/rcu/tree.c:2050) [ 13.545826][ C0] rcu_core (kernel/rcu/tree.c:2508) [ 13.546315][ C0] __do_softirq (arch/x86/include/asm/jump_label.h:27 include/linux/jump_label.h:207 include/trace/events/irq.h:142 kernel/softirq.c:572) [ 13.547696][ C0] __irq_exit_rcu (kernel/softirq.c:445 kernel/softirq.c:650) [ 13.548241][ C0] irq_exit_rcu (kernel/softirq.c:664) [ 13.548741][ C0] sysvec_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:1107 (discriminator 14)) [ 13.549370][ C0] </IRQ> [ 13.549757][ C0] <TASK> [ 13.550132][ C0] asm_sysvec_apic_timer_interrupt (arch/x86/include/asm/idtentry.h:649) [ 13.550793][ C0] RIP: 0010:lock_acquire (kernel/locking/lockdep.c:5636) [ 13.551387][ C0] Code: ff ff 48 83 c4 20 65 0f c1 05 af 26 d0 7e 83 f8 01 0f 85 ae 02 00 00 48 83 7c 24 08 00 74 01 fb 48 b8 00 00 00 00 00 fc ff df <48> 01 c3 48 c7 03 00 00 00 00 48 c7 43 08 00 00 00 00 48 8b 84 24 All code ======== 0: ff (bad) 1: ff 48 83 decl -0x7d(%rax) 4: c4 (bad) 5: 20 65 0f and %ah,0xf(%rbp) 8: c1 05 af 26 d0 7e 83 roll $0x83,0x7ed026af(%rip) # 0x7ed026be f: f8 clc 10: 01 0f add %ecx,(%rdi) 12: 85 ae 02 00 00 48 test %ebp,0x48000002(%rsi) 18: 83 7c 24 08 00 cmpl $0x0,0x8(%rsp) 1d: 74 01 je 0x20 1f: fb sti 20: 48 b8 00 00 00 00 00 movabs $0xdffffc0000000000,%rax 27: fc ff df 2a:* 48 01 c3 add %rax,%rbx <-- trapping instruction 2d: 48 c7 03 00 00 00 00 movq $0x0,(%rbx) 34: 48 c7 43 08 00 00 00 movq $0x0,0x8(%rbx) 3b: 00 3c: 48 rex.W 3d: 8b .byte 0x8b 3e: 84 .byte 0x84 3f: 24 .byte 0x24 Code starting with the faulting instruction =========================================== 0: 48 01 c3 add %rax,%rbx 3: 48 c7 03 00 00 00 00 movq $0x0,(%rbx) a: 48 c7 43 08 00 00 00 movq $0x0,0x8(%rbx) 11: 00 12: 48 rex.W 13: 8b .byte 0x8b 14: 84 .byte 0x84 15: 24 .byte 0x24 If you fix the issue, kindly add following tag | Reported-by: kernel test robot <yujie.liu@intel.com> | Link: https://lore.kernel.org/oe-lkp/202301301039.8bdb3db5-yujie.liu@intel.com To reproduce: # build kernel cd linux cp config-6.2.0-rc4-00078-gfcd28ffda897 .config make HOSTCC=gcc-11 CC=gcc-11 ARCH=x86_64 olddefconfig prepare modules_prepare bzImage modules make HOSTCC=gcc-11 CC=gcc-11 ARCH=x86_64 INSTALL_MOD_PATH=<mod-install-dir> modules_install cd <mod-install-dir> find lib/ | cpio -o -H newc --quiet | gzip > modules.cgz git clone https://github.com/intel/lkp-tests.git cd lkp-tests bin/lkp qemu -k <bzImage> -m modules.cgz job-script # job-script is attached in this email # if come across any failure that blocks the test, # please remove ~/.lkp and /lkp dir to run from a clean state.
diff --git a/fs/namespace.c b/fs/namespace.c index ab467ee58341..11d219a6e83c 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -44,6 +44,11 @@ static unsigned int m_hash_shift __read_mostly; static unsigned int mp_hash_mask __read_mostly; static unsigned int mp_hash_shift __read_mostly; +struct mount_delayed_release { + struct rcu_head rcu; + struct hlist_head release_list; +}; + static __initdata unsigned long mhash_entries; static int __init set_mhash_entries(char *str) { @@ -1582,11 +1587,31 @@ int may_umount(struct vfsmount *mnt) EXPORT_SYMBOL(may_umount); -static void namespace_unlock(void) +static void free_mounts(struct hlist_head *mount_list) { - struct hlist_head head; struct hlist_node *p; struct mount *m; + + hlist_for_each_entry_safe(m, p, mount_list, mnt_umount) { + hlist_del(&m->mnt_umount); + mntput(&m->mnt); + } +} + +static void delayed_mount_release(struct rcu_head *head) +{ + struct mount_delayed_release *drelease = + container_of(head, struct mount_delayed_release, rcu); + + free_mounts(&drelease->release_list); + kfree(drelease); +} + +static void namespace_unlock(void) +{ + struct hlist_head head; + struct mount_delayed_release *drelease; + LIST_HEAD(list); hlist_move_list(&unmounted, &head); @@ -1599,12 +1624,15 @@ static void namespace_unlock(void) if (likely(hlist_empty(&head))) return; - synchronize_rcu_expedited(); - - hlist_for_each_entry_safe(m, p, &head, mnt_umount) { - hlist_del(&m->mnt_umount); - mntput(&m->mnt); + drelease = kmalloc(sizeof(*drelease), GFP_KERNEL); + if (unlikely(!drelease)) { + synchronize_rcu_expedited(); + free_mounts(&head); + return; } + + hlist_move_list(&head, &drelease->release_list); + call_rcu(&drelease->rcu, delayed_mount_release); } static inline void namespace_lock(void)