diff mbox series

[v1,net] udp: Set SOCK_RCU_FREE earlier in udp_lib_get_port().

Message ID 20240708182023.93979-1-kuniyu@amazon.com (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series [v1,net] udp: Set SOCK_RCU_FREE earlier in udp_lib_get_port(). | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for net
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag present in non-next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 833 this patch: 833
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers success CCed 6 of 6 maintainers
netdev/build_clang success Errors and warnings before: 835 this patch: 835
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 841 this patch: 841
netdev/checkpatch warning WARNING: line length of 83 exceeds 80 columns
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 3 this patch: 3
netdev/source_inline success Was 0 now: 0
netdev/contest success net-next-2024-07-08--21-00 (tests: 669)

Commit Message

Kuniyuki Iwashima July 8, 2024, 6:20 p.m. UTC
syzkaller triggered the warning [0] in udp_v4_early_demux().

In udp_v4_early_demux(), we do not touch the refcount of the looked-up
sk and use sock_pfree() as skb->destructor, so we check SOCK_RCU_FREE
to ensure that the sk is safe to access during the RCU grace period.

Currently, SOCK_RCU_FREE is flagged for a bound socket after being put
into the hash table.  Moreover, the SOCK_RCU_FREE check is done too
early in udp_v4_early_demux(), so there could be a small race window:

  CPU1                                 CPU2
  ----                                 ----
  udp_v4_early_demux()                 udp_lib_get_port()
  |                                    |- hlist_add_head_rcu()
  |- sk = __udp4_lib_demux_lookup()    |
  |- DEBUG_NET_WARN_ON_ONCE(sk_is_refcounted(sk));
                                       `- sock_set_flag(sk, SOCK_RCU_FREE)

In practice, sock_pfree() is called much later, when SOCK_RCU_FREE
is most likely propagated for other CPUs; otherwise, we will see
another warning of sk refcount underflow, but at least I didn't.

Technically, moving sock_set_flag(sk, SOCK_RCU_FREE) before
hlist_add_{head,tail}_rcu() does not guarantee the order, and we
must put smp_mb() between them, or smp_wmb() there and smp_rmb()
in udp_v4_early_demux().

But it's overkill in the real scenario, so I just put smp_mb() only under
CONFIG_DEBUG_NET to silence the splat.  When we see the refcount underflow
warning, we can remove the config guard.

Another option would be to remove DEBUG_NET_WARN_ON_ONCE(), but this could
make future debugging harder without the hints in udp_v4_early_demux() and
udp_lib_get_port().

[0]:
WARNING: CPU: 0 PID: 11198 at net/ipv4/udp.c:2599 udp_v4_early_demux+0x481/0xb70 net/ipv4/udp.c:2599
Modules linked in:
CPU: 0 PID: 11198 Comm: syz-executor.1 Not tainted 6.9.0-g93bda33046e7 #13
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
RIP: 0010:udp_v4_early_demux+0x481/0xb70 net/ipv4/udp.c:2599
Code: c5 7a 15 fe bb 01 00 00 00 44 89 e9 31 ff d3 e3 81 e3 bf ef ff ff 89 de e8 2c 74 15 fe 85 db 0f 85 02 06 00 00 e8 9f 7a 15 fe <0f> 0b e8 98 7a 15 fe 49 8d 7e 60 e8 4f 39 2f fe 49 c7 46 60 20 52
RSP: 0018:ffffc9000ce3fa58 EFLAGS: 00010293
RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff8318c92c
RDX: ffff888036ccde00 RSI: ffffffff8318c2f1 RDI: 0000000000000001
RBP: ffff88805a2dd6e0 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0001ffffffffffff R12: ffff88805a2dd680
R13: 0000000000000007 R14: ffff88800923f900 R15: ffff88805456004e
FS:  00007fc449127640(0000) GS:ffff88807dc00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fc449126e38 CR3: 000000003de4b002 CR4: 0000000000770ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
PKRU: 55555554
Call Trace:
 <TASK>
 ip_rcv_finish_core.constprop.0+0xbdd/0xd20 net/ipv4/ip_input.c:349
 ip_rcv_finish+0xda/0x150 net/ipv4/ip_input.c:447
 NF_HOOK include/linux/netfilter.h:314 [inline]
 NF_HOOK include/linux/netfilter.h:308 [inline]
 ip_rcv+0x16c/0x180 net/ipv4/ip_input.c:569
 __netif_receive_skb_one_core+0xb3/0xe0 net/core/dev.c:5624
 __netif_receive_skb+0x21/0xd0 net/core/dev.c:5738
 netif_receive_skb_internal net/core/dev.c:5824 [inline]
 netif_receive_skb+0x271/0x300 net/core/dev.c:5884
 tun_rx_batched drivers/net/tun.c:1549 [inline]
 tun_get_user+0x24db/0x2c50 drivers/net/tun.c:2002
 tun_chr_write_iter+0x107/0x1a0 drivers/net/tun.c:2048
 new_sync_write fs/read_write.c:497 [inline]
 vfs_write+0x76f/0x8d0 fs/read_write.c:590
 ksys_write+0xbf/0x190 fs/read_write.c:643
 __do_sys_write fs/read_write.c:655 [inline]
 __se_sys_write fs/read_write.c:652 [inline]
 __x64_sys_write+0x41/0x50 fs/read_write.c:652
 x64_sys_call+0xe66/0x1990 arch/x86/include/generated/asm/syscalls_64.h:2
 do_syscall_x64 arch/x86/entry/common.c:52 [inline]
 do_syscall_64+0x4b/0x110 arch/x86/entry/common.c:83
 entry_SYSCALL_64_after_hwframe+0x4b/0x53
RIP: 0033:0x7fc44a68bc1f
Code: 89 54 24 18 48 89 74 24 10 89 7c 24 08 e8 e9 cf f5 ff 48 8b 54 24 18 48 8b 74 24 10 41 89 c0 8b 7c 24 08 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 31 44 89 c7 48 89 44 24 08 e8 3c d0 f5 ff 48
RSP: 002b:00007fc449126c90 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 00000000004bc050 RCX: 00007fc44a68bc1f
RDX: 0000000000000032 RSI: 00000000200000c0 RDI: 00000000000000c8
RBP: 00000000004bc050 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000032 R11: 0000000000000293 R12: 0000000000000000
R13: 000000000000000b R14: 00007fc44a5ec530 R15: 0000000000000000
 </TASK>

Fixes: 08842c43d016 ("udp: no longer touch sk->sk_refcnt in early demux")
Reported-by: syzkaller <syzkaller@googlegroups.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 net/ipv4/udp.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

Comments

Eric Dumazet July 8, 2024, 6:38 p.m. UTC | #1
On Mon, Jul 8, 2024 at 11:20 AM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
>
> syzkaller triggered the warning [0] in udp_v4_early_demux().
>
> In udp_v4_early_demux(), we do not touch the refcount of the looked-up
> sk and use sock_pfree() as skb->destructor, so we check SOCK_RCU_FREE
> to ensure that the sk is safe to access during the RCU grace period.
>
> Currently, SOCK_RCU_FREE is flagged for a bound socket after being put
> into the hash table.  Moreover, the SOCK_RCU_FREE check is done too
> early in udp_v4_early_demux(), so there could be a small race window:
>
>   CPU1                                 CPU2
>   ----                                 ----
>   udp_v4_early_demux()                 udp_lib_get_port()
>   |                                    |- hlist_add_head_rcu()
>   |- sk = __udp4_lib_demux_lookup()    |
>   |- DEBUG_NET_WARN_ON_ONCE(sk_is_refcounted(sk));
>                                        `- sock_set_flag(sk, SOCK_RCU_FREE)
>
> In practice, sock_pfree() is called much later, when SOCK_RCU_FREE
> is most likely propagated for other CPUs; otherwise, we will see
> another warning of sk refcount underflow, but at least I didn't.
>
> Technically, moving sock_set_flag(sk, SOCK_RCU_FREE) before
> hlist_add_{head,tail}_rcu() does not guarantee the order, and we
> must put smp_mb() between them, or smp_wmb() there and smp_rmb()
> in udp_v4_early_demux().
>
> But it's overkill in the real scenario, so I just put smp_mb() only under
> CONFIG_DEBUG_NET to silence the splat.  When we see the refcount underflow
> warning, we can remove the config guard.
>
> Another option would be to remove DEBUG_NET_WARN_ON_ONCE(), but this could
> make future debugging harder without the hints in udp_v4_early_demux() and
> udp_lib_get_port().
>
> [0]:
>
> Fixes: 08842c43d016 ("udp: no longer touch sk->sk_refcnt in early demux")
> Reported-by: syzkaller <syzkaller@googlegroups.com>
> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> ---
>  net/ipv4/udp.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index 189c9113fe9a..1a05cc3d2b4f 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -326,6 +326,12 @@ int udp_lib_get_port(struct sock *sk, unsigned short snum,
>                         goto fail_unlock;
>                 }
>
> +               sock_set_flag(sk, SOCK_RCU_FREE);

Nice catch.

> +
> +               if (IS_ENABLED(CONFIG_DEBUG_NET))
> +                       /* for DEBUG_NET_WARN_ON_ONCE() in udp_v4_early_demux(). */
> +                       smp_mb();
> +

I do not think this smp_mb() is needed. If this was, many other RCU
operations would need it,

RCU rules mandate that all memory writes must be committed before the
object can be seen
by other cpus in the hash table.

This includes the setting of the SOCK_RCU_FREE flag.

For instance, hlist_add_head_rcu() does a
rcu_assign_pointer(hlist_first_rcu(h), n);


>                 sk_add_node_rcu(sk, &hslot->head);
>                 hslot->count++;
>                 sock_prot_inuse_add(sock_net(sk), sk->sk_prot, 1);
> @@ -342,7 +348,7 @@ int udp_lib_get_port(struct sock *sk, unsigned short snum,
>                 hslot2->count++;
>                 spin_unlock(&hslot2->lock);
>         }
> -       sock_set_flag(sk, SOCK_RCU_FREE);
> +
>         error = 0;
>  fail_unlock:
>         spin_unlock_bh(&hslot->lock);
> --
> 2.30.2
>
diff mbox series

Patch

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 189c9113fe9a..1a05cc3d2b4f 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -326,6 +326,12 @@  int udp_lib_get_port(struct sock *sk, unsigned short snum,
 			goto fail_unlock;
 		}
 
+		sock_set_flag(sk, SOCK_RCU_FREE);
+
+		if (IS_ENABLED(CONFIG_DEBUG_NET))
+			/* for DEBUG_NET_WARN_ON_ONCE() in udp_v4_early_demux(). */
+			smp_mb();
+
 		sk_add_node_rcu(sk, &hslot->head);
 		hslot->count++;
 		sock_prot_inuse_add(sock_net(sk), sk->sk_prot, 1);
@@ -342,7 +348,7 @@  int udp_lib_get_port(struct sock *sk, unsigned short snum,
 		hslot2->count++;
 		spin_unlock(&hslot2->lock);
 	}
-	sock_set_flag(sk, SOCK_RCU_FREE);
+
 	error = 0;
 fail_unlock:
 	spin_unlock_bh(&hslot->lock);