bpf, sockmap: fix deadlock in rcu_report_exp_cpu_mult

Message ID	tencent_F436364A347489774B677A3D13367E968E09@qq.com (mailing list archive)
State	Changes Requested
Delegated to:	BPF
Headers	show Received: from out203-205-251-82.mail.qq.com (out203-205-251-82.mail.qq.com [203.205.251.82]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7194D385 for <bpf@vger.kernel.org>; Sat, 23 Mar 2024 05:47:44 +0000 (UTC) Message-ID: <tencent_F436364A347489774B677A3D13367E968E09@qq.com> From: Edward Adam Davis <eadavis@qq.com> To: syzbot+c4f4d25859c2e5859988@syzkaller.appspotmail.com Cc: 42.hyeyoo@gmail.com, andrii@kernel.org, ast@kernel.org, bpf@vger.kernel.org, daniel@iogearbox.net, davem@davemloft.net, edumazet@google.com, jakub@cloudflare.com, john.fastabend@gmail.com, kafai@fb.com, kpsingh@kernel.org, kuba@kernel.org, linux-kernel@vger.kernel.org, namhyung@kernel.org, netdev@vger.kernel.org, pabeni@redhat.com, peterz@infradead.org, songliubraving@fb.com, syzkaller-bugs@googlegroups.com, yhs@fb.com Subject: [PATCH] bpf, sockmap: fix deadlock in rcu_report_exp_cpu_mult Date: Sat, 23 Mar 2024 13:42:30 +0800 In-Reply-To: <000000000000dc9aca0613ec855c@google.com> References: <000000000000dc9aca0613ec855c@google.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	bpf, sockmap: fix deadlock in rcu_report_exp_cpu_mult \| expand bpf, sockmap: fix deadlock in rcu_report_exp_cpu_mult

Context	Check	Description
netdev/series_format	warning	Single patches do not need cover letters; Target tree name not specified in the subject
netdev/tree_selection	success	Guessed tree name to be net-next, async
netdev/ynl	success	Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present	success	Fixes tag not required for -next series
netdev/header_inline	success	No static functions without inline keyword in header files
netdev/build_32bit	success	Errors and warnings before: 946 this patch: 946
netdev/build_tools	success	No tools touched, skip
netdev/cc_maintainers	success	CCed 7 of 7 maintainers
netdev/build_clang	success	Errors and warnings before: 956 this patch: 956
netdev/verify_signedoff	success	Signed-off-by tag matches author and committer
netdev/deprecated_api	success	None detected
netdev/check_selftest	success	No net selftest shell script
netdev/verify_fixes	success	No Fixes tag
netdev/build_allmodconfig_warn	success	Errors and warnings before: 957 this patch: 957
netdev/checkpatch	success	total: 0 errors, 0 warnings, 0 checks, 41 lines checked
netdev/build_clang_rust	success	No Rust files in patch. Skipping build
netdev/kdoc	success	Errors and warnings before: 0 this patch: 0
netdev/source_inline	success	Was 0 now: 0
bpf/vmtest-bpf-next-PR	success	PR summary
netdev/contest	success	net-next-2024-03-26--09-00 (tests: 946)
bpf/vmtest-bpf-next-VM_Test-5	success	Logs for aarch64-gcc / build-release
bpf/vmtest-bpf-next-VM_Test-1	success	Logs for ShellCheck
bpf/vmtest-bpf-next-VM_Test-0	success	Logs for Lint
bpf/vmtest-bpf-next-VM_Test-2	success	Logs for Unittests
bpf/vmtest-bpf-next-VM_Test-3	success	Logs for Validate matrix.py
bpf/vmtest-bpf-next-VM_Test-9	success	Logs for aarch64-gcc / test (test_verifier, false, 360) / test_verifier on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-11	success	Logs for s390x-gcc / build / build for s390x with gcc
bpf/vmtest-bpf-next-VM_Test-4	success	Logs for aarch64-gcc / build / build for aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-10	success	Logs for aarch64-gcc / veristat
bpf/vmtest-bpf-next-VM_Test-12	success	Logs for s390x-gcc / build-release
bpf/vmtest-bpf-next-VM_Test-28	success	Logs for x86_64-llvm-17 / build / build for x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-19	success	Logs for x86_64-gcc / build / build for x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-18	success	Logs for set-matrix
bpf/vmtest-bpf-next-VM_Test-17	success	Logs for s390x-gcc / veristat
bpf/vmtest-bpf-next-VM_Test-16	success	Logs for s390x-gcc / test (test_verifier, false, 360) / test_verifier on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-34	success	Logs for x86_64-llvm-17 / veristat
bpf/vmtest-bpf-next-VM_Test-20	success	Logs for x86_64-gcc / build-release
bpf/vmtest-bpf-next-VM_Test-35	success	Logs for x86_64-llvm-18 / build / build for x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-42	success	Logs for x86_64-llvm-18 / veristat
bpf/vmtest-bpf-next-VM_Test-8	success	Logs for aarch64-gcc / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-6	success	Logs for aarch64-gcc / test (test_maps, false, 360) / test_maps on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-7	success	Logs for aarch64-gcc / test (test_progs, false, 360) / test_progs on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-13	success	Logs for s390x-gcc / test (test_maps, false, 360) / test_maps on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-36	success	Logs for x86_64-llvm-18 / build-release / build for x86_64 with llvm-18 and -O2 optimization
bpf/vmtest-bpf-next-VM_Test-23	success	Logs for x86_64-gcc / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-39	success	Logs for x86_64-llvm-18 / test (test_progs_cpuv4, false, 360) / test_progs_cpuv4 on x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-33	success	Logs for x86_64-llvm-17 / test (test_verifier, false, 360) / test_verifier on x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-21	success	Logs for x86_64-gcc / test (test_maps, false, 360) / test_maps on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-26	success	Logs for x86_64-gcc / test (test_verifier, false, 360) / test_verifier on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-22	success	Logs for x86_64-gcc / test (test_progs, false, 360) / test_progs on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-24	success	Logs for x86_64-gcc / test (test_progs_no_alu32_parallel, true, 30) / test_progs_no_alu32_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-37	success	Logs for x86_64-llvm-18 / test (test_maps, false, 360) / test_maps on x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-38	success	Logs for x86_64-llvm-18 / test (test_progs, false, 360) / test_progs on x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-27	success	Logs for x86_64-gcc / veristat / veristat on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-25	success	Logs for x86_64-gcc / test (test_progs_parallel, true, 30) / test_progs_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-31	success	Logs for x86_64-llvm-17 / test (test_progs, false, 360) / test_progs on x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-32	success	Logs for x86_64-llvm-17 / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-30	success	Logs for x86_64-llvm-17 / test (test_maps, false, 360) / test_maps on x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-41	success	Logs for x86_64-llvm-18 / test (test_verifier, false, 360) / test_verifier on x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-40	success	Logs for x86_64-llvm-18 / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-29	success	Logs for x86_64-llvm-17 / build-release / build for x86_64 with llvm-17 and -O2 optimization
bpf/vmtest-bpf-next-VM_Test-14	success	Logs for s390x-gcc / test (test_progs, false, 360) / test_progs on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-15	success	Logs for s390x-gcc / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on s390x with gcc

Edward Adam Davis March 23, 2024, 5:42 a.m. UTC

[Syzbot reported]
WARNING: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected
6.8.0-syzkaller-05221-gea80e3ed09ab #0 Not tainted
-----------------------------------------------------
rcu_exp_gp_kthr/18 [HC0[0]:SC0[2]:HE0:SE0] is trying to acquire:
ffff88802b5ab020 (&htab->buckets[i].lock){+...}-{2:2}, at: spin_lock_bh include/linux/spinlock.h:356 [inline]
ffff88802b5ab020 (&htab->buckets[i].lock){+...}-{2:2}, at: sock_hash_delete_elem+0xb0/0x300 net/core/sock_map.c:939

and this task is already holding:
ffffffff8e136558 (rcu_node_0){-.-.}-{2:2}, at: sync_rcu_exp_done_unlocked+0xe/0x140 kernel/rcu/tree_exp.h:169
which would create a new lock dependency:
 (rcu_node_0){-.-.}-{2:2} -> (&htab->buckets[i].lock){+...}-{2:2}

but this new dependency connects a HARDIRQ-irq-safe lock:
 (rcu_node_0){-.-.}-{2:2}

... which became HARDIRQ-irq-safe at:
  lock_acquire+0x1e4/0x530 kernel/locking/lockdep.c:5754
  __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
  _raw_spin_lock_irqsave+0xd5/0x120 kernel/locking/spinlock.c:162
  rcu_report_exp_cpu_mult+0x27/0x2f0 kernel/rcu/tree_exp.h:238
  csd_do_func kernel/smp.c:133 [inline]
  __flush_smp_call_function_queue+0xb2e/0x15b0 kernel/smp.c:542
  __sysvec_call_function_single+0xa8/0x3e0 arch/x86/kernel/smp.c:271
  instr_sysvec_call_function_single arch/x86/kernel/smp.c:266 [inline]
  sysvec_call_function_single+0x9e/0xc0 arch/x86/kernel/smp.c:266
  asm_sysvec_call_function_single+0x1a/0x20 arch/x86/include/asm/idtentry.h:709
  __sanitizer_cov_trace_switch+0x90/0x120
  update_event_printk kernel/trace/trace_events.c:2750 [inline]
  trace_event_eval_update+0x311/0xf90 kernel/trace/trace_events.c:2922
  process_one_work kernel/workqueue.c:3254 [inline]
  process_scheduled_works+0xa00/0x1770 kernel/workqueue.c:3335
  worker_thread+0x86d/0xd70 kernel/workqueue.c:3416
  kthread+0x2f0/0x390 kernel/kthread.c:388
  ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
  ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:243

to a HARDIRQ-irq-unsafe lock:
 (&htab->buckets[i].lock){+...}-{2:2}

... which became HARDIRQ-irq-unsafe at:
...
  lock_acquire+0x1e4/0x530 kernel/locking/lockdep.c:5754
  __raw_spin_lock_bh include/linux/spinlock_api_smp.h:126 [inline]
  _raw_spin_lock_bh+0x35/0x50 kernel/locking/spinlock.c:178
  spin_lock_bh include/linux/spinlock.h:356 [inline]
  sock_hash_delete_elem+0xb0/0x300 net/core/sock_map.c:939
  0xffffffffa0001b0e
  bpf_dispatcher_nop_func include/linux/bpf.h:1234 [inline]
  __bpf_prog_run include/linux/filter.h:657 [inline]
  bpf_prog_run include/linux/filter.h:664 [inline]
  __bpf_trace_run kernel/trace/bpf_trace.c:2381 [inline]
  bpf_trace_run2+0x204/0x420 kernel/trace/bpf_trace.c:2420
  trace_contention_end+0xd7/0x100 include/trace/events/lock.h:122
  __mutex_lock_common kernel/locking/mutex.c:617 [inline]
  __mutex_lock+0x2e5/0xd70 kernel/locking/mutex.c:752
  futex_cleanup_begin kernel/futex/core.c:1091 [inline]
  futex_exit_release+0x34/0x1f0 kernel/futex/core.c:1143
  exit_mm_release+0x1a/0x30 kernel/fork.c:1652
  exit_mm+0xb0/0x310 kernel/exit.c:542
  do_exit+0x99e/0x27e0 kernel/exit.c:865
  do_group_exit+0x207/0x2c0 kernel/exit.c:1027
  __do_sys_exit_group kernel/exit.c:1038 [inline]
  __se_sys_exit_group kernel/exit.c:1036 [inline]
  __x64_sys_exit_group+0x3f/0x40 kernel/exit.c:1036
  do_syscall_64+0xfb/0x240
  entry_SYSCALL_64_after_hwframe+0x6d/0x75

other info that might help us debug this:

 Possible interrupt unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&htab->buckets[i].lock);
                               local_irq_disable();
                               lock(rcu_node_0);
                               lock(&htab->buckets[i].lock);
  <Interrupt>
    lock(rcu_node_0);

 *** DEADLOCK ***
[Fix]
Ensure that the context interrupt state is the same before and after using the
bucket->lock.

Reported-and-tested-by: syzbot+c4f4d25859c2e5859988@syzkaller.appspotmail.com
Signed-off-by: Edward Adam Davis <eadavis@qq.com>
---
 net/core/sock_map.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

Alexei Starovoitov March 23, 2024, 7:08 a.m. UTC | #1

John,
please review.
It seems this bug was causing multiple syzbot reports.

On Fri, Mar 22, 2024 at 10:42 PM Edward Adam Davis <eadavis@qq.com> wrote:
>
> [Syzbot reported]
> WARNING: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected
> 6.8.0-syzkaller-05221-gea80e3ed09ab #0 Not tainted
> -----------------------------------------------------
> rcu_exp_gp_kthr/18 [HC0[0]:SC0[2]:HE0:SE0] is trying to acquire:
> ffff88802b5ab020 (&htab->buckets[i].lock){+...}-{2:2}, at: spin_lock_bh include/linux/spinlock.h:356 [inline]
> ffff88802b5ab020 (&htab->buckets[i].lock){+...}-{2:2}, at: sock_hash_delete_elem+0xb0/0x300 net/core/sock_map.c:939
>
> and this task is already holding:
> ffffffff8e136558 (rcu_node_0){-.-.}-{2:2}, at: sync_rcu_exp_done_unlocked+0xe/0x140 kernel/rcu/tree_exp.h:169
> which would create a new lock dependency:
>  (rcu_node_0){-.-.}-{2:2} -> (&htab->buckets[i].lock){+...}-{2:2}
>
> but this new dependency connects a HARDIRQ-irq-safe lock:
>  (rcu_node_0){-.-.}-{2:2}
>
> ... which became HARDIRQ-irq-safe at:
>   lock_acquire+0x1e4/0x530 kernel/locking/lockdep.c:5754
>   __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
>   _raw_spin_lock_irqsave+0xd5/0x120 kernel/locking/spinlock.c:162
>   rcu_report_exp_cpu_mult+0x27/0x2f0 kernel/rcu/tree_exp.h:238
>   csd_do_func kernel/smp.c:133 [inline]
>   __flush_smp_call_function_queue+0xb2e/0x15b0 kernel/smp.c:542
>   __sysvec_call_function_single+0xa8/0x3e0 arch/x86/kernel/smp.c:271
>   instr_sysvec_call_function_single arch/x86/kernel/smp.c:266 [inline]
>   sysvec_call_function_single+0x9e/0xc0 arch/x86/kernel/smp.c:266
>   asm_sysvec_call_function_single+0x1a/0x20 arch/x86/include/asm/idtentry.h:709
>   __sanitizer_cov_trace_switch+0x90/0x120
>   update_event_printk kernel/trace/trace_events.c:2750 [inline]
>   trace_event_eval_update+0x311/0xf90 kernel/trace/trace_events.c:2922
>   process_one_work kernel/workqueue.c:3254 [inline]
>   process_scheduled_works+0xa00/0x1770 kernel/workqueue.c:3335
>   worker_thread+0x86d/0xd70 kernel/workqueue.c:3416
>   kthread+0x2f0/0x390 kernel/kthread.c:388
>   ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
>   ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:243
>
> to a HARDIRQ-irq-unsafe lock:
>  (&htab->buckets[i].lock){+...}-{2:2}
>
> ... which became HARDIRQ-irq-unsafe at:
> ...
>   lock_acquire+0x1e4/0x530 kernel/locking/lockdep.c:5754
>   __raw_spin_lock_bh include/linux/spinlock_api_smp.h:126 [inline]
>   _raw_spin_lock_bh+0x35/0x50 kernel/locking/spinlock.c:178
>   spin_lock_bh include/linux/spinlock.h:356 [inline]
>   sock_hash_delete_elem+0xb0/0x300 net/core/sock_map.c:939
>   0xffffffffa0001b0e
>   bpf_dispatcher_nop_func include/linux/bpf.h:1234 [inline]
>   __bpf_prog_run include/linux/filter.h:657 [inline]
>   bpf_prog_run include/linux/filter.h:664 [inline]
>   __bpf_trace_run kernel/trace/bpf_trace.c:2381 [inline]
>   bpf_trace_run2+0x204/0x420 kernel/trace/bpf_trace.c:2420
>   trace_contention_end+0xd7/0x100 include/trace/events/lock.h:122
>   __mutex_lock_common kernel/locking/mutex.c:617 [inline]
>   __mutex_lock+0x2e5/0xd70 kernel/locking/mutex.c:752
>   futex_cleanup_begin kernel/futex/core.c:1091 [inline]
>   futex_exit_release+0x34/0x1f0 kernel/futex/core.c:1143
>   exit_mm_release+0x1a/0x30 kernel/fork.c:1652
>   exit_mm+0xb0/0x310 kernel/exit.c:542
>   do_exit+0x99e/0x27e0 kernel/exit.c:865
>   do_group_exit+0x207/0x2c0 kernel/exit.c:1027
>   __do_sys_exit_group kernel/exit.c:1038 [inline]
>   __se_sys_exit_group kernel/exit.c:1036 [inline]
>   __x64_sys_exit_group+0x3f/0x40 kernel/exit.c:1036
>   do_syscall_64+0xfb/0x240
>   entry_SYSCALL_64_after_hwframe+0x6d/0x75
>
> other info that might help us debug this:
>
>  Possible interrupt unsafe locking scenario:
>
>        CPU0                    CPU1
>        ----                    ----
>   lock(&htab->buckets[i].lock);
>                                local_irq_disable();
>                                lock(rcu_node_0);
>                                lock(&htab->buckets[i].lock);
>   <Interrupt>
>     lock(rcu_node_0);
>
>  *** DEADLOCK ***
> [Fix]
> Ensure that the context interrupt state is the same before and after using the
> bucket->lock.
>
> Reported-and-tested-by: syzbot+c4f4d25859c2e5859988@syzkaller.appspotmail.com
> Signed-off-by: Edward Adam Davis <eadavis@qq.com>
> ---
>  net/core/sock_map.c | 10 ++++++----
>  1 file changed, 6 insertions(+), 4 deletions(-)
>
> diff --git a/net/core/sock_map.c b/net/core/sock_map.c
> index 27d733c0f65e..ae8f81b26e16 100644
> --- a/net/core/sock_map.c
> +++ b/net/core/sock_map.c
> @@ -932,11 +932,12 @@ static long sock_hash_delete_elem(struct bpf_map *map, void *key)
>         struct bpf_shtab_bucket *bucket;
>         struct bpf_shtab_elem *elem;
>         int ret = -ENOENT;
> +       unsigned long flags;
>
>         hash = sock_hash_bucket_hash(key, key_size);
>         bucket = sock_hash_select_bucket(htab, hash);
>
> -       spin_lock_bh(&bucket->lock);
> +       spin_lock_irqsave(&bucket->lock, flags);
>         elem = sock_hash_lookup_elem_raw(&bucket->head, hash, key, key_size);
>         if (elem) {
>                 hlist_del_rcu(&elem->node);
> @@ -944,7 +945,7 @@ static long sock_hash_delete_elem(struct bpf_map *map, void *key)
>                 sock_hash_free_elem(htab, elem);
>                 ret = 0;
>         }
> -       spin_unlock_bh(&bucket->lock);
> +       spin_unlock_irqrestore(&bucket->lock, flags);
>         return ret;
>  }
>
> @@ -1136,6 +1137,7 @@ static void sock_hash_free(struct bpf_map *map)
>         struct bpf_shtab_elem *elem;
>         struct hlist_node *node;
>         int i;
> +       unsigned long flags;
>
>         /* After the sync no updates or deletes will be in-flight so it
>          * is safe to walk map and remove entries without risking a race
> @@ -1151,11 +1153,11 @@ static void sock_hash_free(struct bpf_map *map)
>                  * exists, psock exists and holds a ref to socket. That
>                  * lets us to grab a socket ref too.
>                  */
> -               spin_lock_bh(&bucket->lock);
> +               spin_lock_irqsave(&bucket->lock, flags);
>                 hlist_for_each_entry(elem, &bucket->head, node)
>                         sock_hold(elem->sk);
>                 hlist_move_list(&bucket->head, &unlink_list);
> -               spin_unlock_bh(&bucket->lock);
> +               spin_unlock_irqrestore(&bucket->lock, flags);
>
>                 /* Process removed entries out of atomic context to
>                  * block for socket lock before deleting the psock's
> --
> 2.43.0
>

Jakub Sitnicki March 25, 2024, 12:23 p.m. UTC | #2

On Sat, Mar 23, 2024 at 12:08 AM -07, Alexei Starovoitov wrote:
> John,
> please review.
> It seems this bug was causing multiple syzbot reports.

Any chance we could disallow mutating sockhash from interrupt context?

If that is not an option, then this looks like a good start of a fix.
But we also need to cover sock_map_unref->sock_sock_map_del_link called
from sock_hash_delete_elem. It also grabs a spin lock.

Also, sockhash is not the only affected map type. I see we're grabbing a
spin lock in ->map_delete_elem without disabling interrupts as well in:

- sock_map_delete_elem
- reuseport_array_delete_elem
- xsk_map_delete_elem

> On Fri, Mar 22, 2024 at 10:42 PM Edward Adam Davis <eadavis@qq.com> wrote:
>>
>> [Syzbot reported]
>> WARNING: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected
>> 6.8.0-syzkaller-05221-gea80e3ed09ab #0 Not tainted
>> -----------------------------------------------------
>> rcu_exp_gp_kthr/18 [HC0[0]:SC0[2]:HE0:SE0] is trying to acquire:
>> ffff88802b5ab020 (&htab->buckets[i].lock){+...}-{2:2}, at: spin_lock_bh include/linux/spinlock.h:356 [inline]
>> ffff88802b5ab020 (&htab->buckets[i].lock){+...}-{2:2}, at: sock_hash_delete_elem+0xb0/0x300 net/core/sock_map.c:939
>>
>> and this task is already holding:
>> ffffffff8e136558 (rcu_node_0){-.-.}-{2:2}, at: sync_rcu_exp_done_unlocked+0xe/0x140 kernel/rcu/tree_exp.h:169
>> which would create a new lock dependency:
>>  (rcu_node_0){-.-.}-{2:2} -> (&htab->buckets[i].lock){+...}-{2:2}
>>
>> but this new dependency connects a HARDIRQ-irq-safe lock:
>>  (rcu_node_0){-.-.}-{2:2}
>>
>> ... which became HARDIRQ-irq-safe at:
>>   lock_acquire+0x1e4/0x530 kernel/locking/lockdep.c:5754
>>   __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
>>   _raw_spin_lock_irqsave+0xd5/0x120 kernel/locking/spinlock.c:162
>>   rcu_report_exp_cpu_mult+0x27/0x2f0 kernel/rcu/tree_exp.h:238
>>   csd_do_func kernel/smp.c:133 [inline]
>>   __flush_smp_call_function_queue+0xb2e/0x15b0 kernel/smp.c:542
>>   __sysvec_call_function_single+0xa8/0x3e0 arch/x86/kernel/smp.c:271
>>   instr_sysvec_call_function_single arch/x86/kernel/smp.c:266 [inline]
>>   sysvec_call_function_single+0x9e/0xc0 arch/x86/kernel/smp.c:266
>>   asm_sysvec_call_function_single+0x1a/0x20 arch/x86/include/asm/idtentry.h:709
>>   __sanitizer_cov_trace_switch+0x90/0x120
>>   update_event_printk kernel/trace/trace_events.c:2750 [inline]
>>   trace_event_eval_update+0x311/0xf90 kernel/trace/trace_events.c:2922
>>   process_one_work kernel/workqueue.c:3254 [inline]
>>   process_scheduled_works+0xa00/0x1770 kernel/workqueue.c:3335
>>   worker_thread+0x86d/0xd70 kernel/workqueue.c:3416
>>   kthread+0x2f0/0x390 kernel/kthread.c:388
>>   ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
>>   ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:243
>>
>> to a HARDIRQ-irq-unsafe lock:
>>  (&htab->buckets[i].lock){+...}-{2:2}
>>
>> ... which became HARDIRQ-irq-unsafe at:
>> ...
>>   lock_acquire+0x1e4/0x530 kernel/locking/lockdep.c:5754
>>   __raw_spin_lock_bh include/linux/spinlock_api_smp.h:126 [inline]
>>   _raw_spin_lock_bh+0x35/0x50 kernel/locking/spinlock.c:178
>>   spin_lock_bh include/linux/spinlock.h:356 [inline]
>>   sock_hash_delete_elem+0xb0/0x300 net/core/sock_map.c:939
>>   0xffffffffa0001b0e
>>   bpf_dispatcher_nop_func include/linux/bpf.h:1234 [inline]
>>   __bpf_prog_run include/linux/filter.h:657 [inline]
>>   bpf_prog_run include/linux/filter.h:664 [inline]
>>   __bpf_trace_run kernel/trace/bpf_trace.c:2381 [inline]
>>   bpf_trace_run2+0x204/0x420 kernel/trace/bpf_trace.c:2420
>>   trace_contention_end+0xd7/0x100 include/trace/events/lock.h:122
>>   __mutex_lock_common kernel/locking/mutex.c:617 [inline]
>>   __mutex_lock+0x2e5/0xd70 kernel/locking/mutex.c:752
>>   futex_cleanup_begin kernel/futex/core.c:1091 [inline]
>>   futex_exit_release+0x34/0x1f0 kernel/futex/core.c:1143
>>   exit_mm_release+0x1a/0x30 kernel/fork.c:1652
>>   exit_mm+0xb0/0x310 kernel/exit.c:542
>>   do_exit+0x99e/0x27e0 kernel/exit.c:865
>>   do_group_exit+0x207/0x2c0 kernel/exit.c:1027
>>   __do_sys_exit_group kernel/exit.c:1038 [inline]
>>   __se_sys_exit_group kernel/exit.c:1036 [inline]
>>   __x64_sys_exit_group+0x3f/0x40 kernel/exit.c:1036
>>   do_syscall_64+0xfb/0x240
>>   entry_SYSCALL_64_after_hwframe+0x6d/0x75
>>
>> other info that might help us debug this:
>>
>>  Possible interrupt unsafe locking scenario:
>>
>>        CPU0                    CPU1
>>        ----                    ----
>>   lock(&htab->buckets[i].lock);
>>                                local_irq_disable();
>>                                lock(rcu_node_0);
>>                                lock(&htab->buckets[i].lock);
>>   <Interrupt>
>>     lock(rcu_node_0);
>>
>>  *** DEADLOCK ***
>> [Fix]
>> Ensure that the context interrupt state is the same before and after using the
>> bucket->lock.
>>
>> Reported-and-tested-by: syzbot+c4f4d25859c2e5859988@syzkaller.appspotmail.com
>> Signed-off-by: Edward Adam Davis <eadavis@qq.com>
>> ---
>>  net/core/sock_map.c | 10 ++++++----
>>  1 file changed, 6 insertions(+), 4 deletions(-)
>>
>> diff --git a/net/core/sock_map.c b/net/core/sock_map.c
>> index 27d733c0f65e..ae8f81b26e16 100644
>> --- a/net/core/sock_map.c
>> +++ b/net/core/sock_map.c
>> @@ -932,11 +932,12 @@ static long sock_hash_delete_elem(struct bpf_map *map, void *key)
>>         struct bpf_shtab_bucket *bucket;
>>         struct bpf_shtab_elem *elem;
>>         int ret = -ENOENT;
>> +       unsigned long flags;
>>
>>         hash = sock_hash_bucket_hash(key, key_size);
>>         bucket = sock_hash_select_bucket(htab, hash);
>>
>> -       spin_lock_bh(&bucket->lock);
>> +       spin_lock_irqsave(&bucket->lock, flags);
>>         elem = sock_hash_lookup_elem_raw(&bucket->head, hash, key, key_size);
>>         if (elem) {
>>                 hlist_del_rcu(&elem->node);
>> @@ -944,7 +945,7 @@ static long sock_hash_delete_elem(struct bpf_map *map, void *key)
>>                 sock_hash_free_elem(htab, elem);
>>                 ret = 0;
>>         }
>> -       spin_unlock_bh(&bucket->lock);
>> +       spin_unlock_irqrestore(&bucket->lock, flags);
>>         return ret;
>>  }
>>
>> @@ -1136,6 +1137,7 @@ static void sock_hash_free(struct bpf_map *map)
>>         struct bpf_shtab_elem *elem;
>>         struct hlist_node *node;
>>         int i;
>> +       unsigned long flags;
>>
>>         /* After the sync no updates or deletes will be in-flight so it
>>          * is safe to walk map and remove entries without risking a race
>> @@ -1151,11 +1153,11 @@ static void sock_hash_free(struct bpf_map *map)
>>                  * exists, psock exists and holds a ref to socket. That
>>                  * lets us to grab a socket ref too.
>>                  */
>> -               spin_lock_bh(&bucket->lock);
>> +               spin_lock_irqsave(&bucket->lock, flags);
>>                 hlist_for_each_entry(elem, &bucket->head, node)
>>                         sock_hold(elem->sk);
>>                 hlist_move_list(&bucket->head, &unlink_list);
>> -               spin_unlock_bh(&bucket->lock);
>> +               spin_unlock_irqrestore(&bucket->lock, flags);
>>
>>                 /* Process removed entries out of atomic context to
>>                  * block for socket lock before deleting the psock's
>> --
>> 2.43.0
>>

Jakub Sitnicki March 25, 2024, 1:49 p.m. UTC | #3

On Mon, Mar 25, 2024 at 01:23 PM +01, Jakub Sitnicki wrote:

[...]

> But we also need to cover sock_map_unref->sock_sock_map_del_link called
> from sock_hash_delete_elem. It also grabs a spin lock.

On second look, no need to disable interrupts in
sock_map_unref->sock_sock_map_del_link. Call is enclosed in the critical
section in sock_hash_delete_elem that has been updated.

I have a question, though, why are we patching sock_hash_free? It
doesn't get called unless there are no more existing users of the BPF
map. So nothing can mutate it from interrupt context.

[...]

Jakub Sitnicki March 26, 2024, 10:15 p.m. UTC | #4

On Mon, Mar 25, 2024 at 01:23 PM +01, Jakub Sitnicki wrote:
> On Sat, Mar 23, 2024 at 12:08 AM -07, Alexei Starovoitov wrote:
>> It seems this bug was causing multiple syzbot reports.
> Any chance we could disallow mutating sockhash from interrupt context?

I've been playing with the repro from one of the other reports:

https://lore.kernel.org/all/CABOYnLzaRiZ+M1v7dPaeObnj_=S4JYmWbgrXaYsyBbWh=553vQ@mail.gmail.com/

syzkaller workload is artificial. So, if we can avoid it, I'd rather not
support modifying sockmap/sockhash in contexts where irqs are disabled,
and lock safety rules are stricter than what we abide to today.

Ideally, we allow task and softirq contexts with irqs enabled (so no
tracing progs attached to timer tick, which syzcaller is using as corpus
here). Otherwise, we will have to cover for that in selftests.

I'm thinking about a restriction like:

---8<---

diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index 27d733c0f65e..3692f7256dd6 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -907,6 +907,7 @@ static void sock_hash_delete_from_link(struct bpf_map *map, struct sock *sk,
 	struct bpf_shtab_elem *elem_probe, *elem = link_raw;
 	struct bpf_shtab_bucket *bucket;
 
+	WARN_ON_ONCE(irqs_disabled());
 	WARN_ON_ONCE(!rcu_read_lock_held());
 	bucket = sock_hash_select_bucket(htab, elem->hash);
 
@@ -933,6 +934,10 @@ static long sock_hash_delete_elem(struct bpf_map *map, void *key)
 	struct bpf_shtab_elem *elem;
 	int ret = -ENOENT;
 
+	/* Can't run. We don't play nice with hardirq-safe locks. */
+	if (irqs_disabled())
+		return -EOPNOTSUPP;
+
 	hash = sock_hash_bucket_hash(key, key_size);
 	bucket = sock_hash_select_bucket(htab, hash);
 
@@ -986,6 +991,7 @@ static int sock_hash_update_common(struct bpf_map *map, void *key,
 	struct sk_psock *psock;
 	int ret;
 
+	WARN_ON_ONCE(irqs_disabled());
 	WARN_ON_ONCE(!rcu_read_lock_held());
 	if (unlikely(flags > BPF_EXIST))
 		return -EINVAL;

John Fastabend March 29, 2024, 5:29 a.m. UTC | #5

Jakub Sitnicki wrote:
> On Mon, Mar 25, 2024 at 01:23 PM +01, Jakub Sitnicki wrote:
> 
> [...]
> 
> > But we also need to cover sock_map_unref->sock_sock_map_del_link called
> > from sock_hash_delete_elem. It also grabs a spin lock.
> 
> On second look, no need to disable interrupts in
> sock_map_unref->sock_sock_map_del_link. Call is enclosed in the critical
> section in sock_hash_delete_elem that has been updated.
> 
> I have a question, though, why are we patching sock_hash_free? It
> doesn't get called unless there are no more existing users of the BPF
> map. So nothing can mutate it from interrupt context.
> 
> [...]

Agree sock_hash_free should be only after all refs are dropped.

Edward, did you want to send a v2 for this? Also if you want fixing the
sockmap case as well would be useful. Also happy to finish up the patches
if you would rather not.

Thanks,
John

Shung-Hsi Yu March 29, 2024, 3:52 p.m. UTC | #6

On Tue, Mar 26, 2024 at 11:15:47PM +0100, Jakub Sitnicki wrote:
> On Mon, Mar 25, 2024 at 01:23 PM +01, Jakub Sitnicki wrote:
> > On Sat, Mar 23, 2024 at 12:08 AM -07, Alexei Starovoitov wrote:
> >> It seems this bug was causing multiple syzbot reports.
> > Any chance we could disallow mutating sockhash from interrupt context?
> 
> I've been playing with the repro from one of the other reports:
> 
> https://lore.kernel.org/all/CABOYnLzaRiZ+M1v7dPaeObnj_=S4JYmWbgrXaYsyBbWh=553vQ@mail.gmail.com/

Possibly also related:
- "A potential deadlock in sockhash map"[1] report awhile back
- commit ed17aa92dc56b ("bpf, sockmap: fix deadlocks in the sockhash and
  sockmap")
- commit 8c5c2a4898e3d ("bpf, sockmap: Revert buggy deadlock fix in the
  sockhash and sockmap")

1: https://lore.kernel.org/all/CABcoxUayum5oOqFMMqAeWuS8+EzojquSOSyDA3J_2omY=2EeAg@mail.gmail.com/

bpf, sockmap: fix deadlock in rcu_report_exp_cpu_mult

Checks

Commit Message

Comments

Patch