Message ID | 20211220143330.680945-1-eric.dumazet@gmail.com (mailing list archive) |
---|---|
State | Accepted |
Commit | 8f905c0e7354ef261360fb7535ea079b1082c105 |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [net] inet: fully convert sk->sk_rx_dst to RCU rules | expand |
Hello: This patch was applied to netdev/net.git (master) by Jakub Kicinski <kuba@kernel.org>: On Mon, 20 Dec 2021 06:33:30 -0800 you wrote: > From: Eric Dumazet <edumazet@google.com> > > syzbot reported various issues around early demux, > one being included in this changelog [1] > > sk->sk_rx_dst is using RCU protection without clearly > documenting it. > > [...] Here is the summary with links: - [net] inet: fully convert sk->sk_rx_dst to RCU rules https://git.kernel.org/netdev/net/c/8f905c0e7354 You are awesome, thank you!
On Mon, Dec 20, 2021 at 06:33:30AM -0800, Eric Dumazet wrote: > From: Eric Dumazet <edumazet@google.com> > > syzbot reported various issues around early demux, > one being included in this changelog [1] > > sk->sk_rx_dst is using RCU protection without clearly > documenting it. > > And following sequences in tcp_v4_do_rcv()/tcp_v6_do_rcv() > are not following standard RCU rules. > > [a] dst_release(dst); > [b] sk->sk_rx_dst = NULL; > > They look wrong because a delete operation of RCU protected > pointer is supposed to clear the pointer before > the call_rcu()/synchronize_rcu() guarding actual memory freeing. > > In some cases indeed, dst could be freed before [b] is done. > > We could cheat by clearing sk_rx_dst before calling > dst_release(), but this seems the right time to stick > to standard RCU annotations and debugging facilities. > > [1] > BUG: KASAN: use-after-free in dst_check include/net/dst.h:470 [inline] > BUG: KASAN: use-after-free in tcp_v4_early_demux+0x95b/0x960 net/ipv4/tcp_ipv4.c:1792 > Read of size 2 at addr ffff88807f1cb73a by task syz-executor.5/9204 > > CPU: 0 PID: 9204 Comm: syz-executor.5 Not tainted 5.16.0-rc5-syzkaller #0 > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 > Call Trace: > <TASK> > __dump_stack lib/dump_stack.c:88 [inline] > dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106 > print_address_description.constprop.0.cold+0x8d/0x320 mm/kasan/report.c:247 > __kasan_report mm/kasan/report.c:433 [inline] > kasan_report.cold+0x83/0xdf mm/kasan/report.c:450 > dst_check include/net/dst.h:470 [inline] > tcp_v4_early_demux+0x95b/0x960 net/ipv4/tcp_ipv4.c:1792 > ip_rcv_finish_core.constprop.0+0x15de/0x1e80 net/ipv4/ip_input.c:340 > ip_list_rcv_finish.constprop.0+0x1b2/0x6e0 net/ipv4/ip_input.c:583 > ip_sublist_rcv net/ipv4/ip_input.c:609 [inline] > ip_list_rcv+0x34e/0x490 net/ipv4/ip_input.c:644 > __netif_receive_skb_list_ptype net/core/dev.c:5508 [inline] > __netif_receive_skb_list_core+0x549/0x8e0 net/core/dev.c:5556 > __netif_receive_skb_list net/core/dev.c:5608 [inline] > netif_receive_skb_list_internal+0x75e/0xd80 net/core/dev.c:5699 > gro_normal_list net/core/dev.c:5853 [inline] > gro_normal_list net/core/dev.c:5849 [inline] > napi_complete_done+0x1f1/0x880 net/core/dev.c:6590 > virtqueue_napi_complete drivers/net/virtio_net.c:339 [inline] > virtnet_poll+0xca2/0x11b0 drivers/net/virtio_net.c:1557 > __napi_poll+0xaf/0x440 net/core/dev.c:7023 > napi_poll net/core/dev.c:7090 [inline] > net_rx_action+0x801/0xb40 net/core/dev.c:7177 > __do_softirq+0x29b/0x9c2 kernel/softirq.c:558 > invoke_softirq kernel/softirq.c:432 [inline] > __irq_exit_rcu+0x123/0x180 kernel/softirq.c:637 > irq_exit_rcu+0x5/0x20 kernel/softirq.c:649 > common_interrupt+0x52/0xc0 arch/x86/kernel/irq.c:240 > asm_common_interrupt+0x1e/0x40 arch/x86/include/asm/idtentry.h:629 > RIP: 0033:0x7f5e972bfd57 > Code: 39 d1 73 14 0f 1f 80 00 00 00 00 48 8b 50 f8 48 83 e8 08 48 39 ca 77 f3 48 39 c3 73 3e 48 89 13 48 8b 50 f8 48 89 38 49 8b 0e <48> 8b 3e 48 83 c3 08 48 83 c6 08 eb bc 48 39 d1 72 9e 48 39 d0 73 > RSP: 002b:00007fff8a413210 EFLAGS: 00000283 > RAX: 00007f5e97108990 RBX: 00007f5e97108338 RCX: ffffffff81d3aa45 > RDX: ffffffff81d3aa45 RSI: 00007f5e97108340 RDI: ffffffff81d3aa45 > RBP: 00007f5e97107eb8 R08: 00007f5e97108d88 R09: 0000000093c2e8d9 > R10: 0000000000000000 R11: 0000000000000000 R12: 00007f5e97107eb0 > R13: 00007f5e97108338 R14: 00007f5e97107ea8 R15: 0000000000000019 > </TASK> > > Allocated by task 13: > kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38 > kasan_set_track mm/kasan/common.c:46 [inline] > set_alloc_info mm/kasan/common.c:434 [inline] > __kasan_slab_alloc+0x90/0xc0 mm/kasan/common.c:467 > kasan_slab_alloc include/linux/kasan.h:259 [inline] > slab_post_alloc_hook mm/slab.h:519 [inline] > slab_alloc_node mm/slub.c:3234 [inline] > slab_alloc mm/slub.c:3242 [inline] > kmem_cache_alloc+0x202/0x3a0 mm/slub.c:3247 > dst_alloc+0x146/0x1f0 net/core/dst.c:92 > rt_dst_alloc+0x73/0x430 net/ipv4/route.c:1613 > ip_route_input_slow+0x1817/0x3a20 net/ipv4/route.c:2340 > ip_route_input_rcu net/ipv4/route.c:2470 [inline] > ip_route_input_noref+0x116/0x2a0 net/ipv4/route.c:2415 > ip_rcv_finish_core.constprop.0+0x288/0x1e80 net/ipv4/ip_input.c:354 > ip_list_rcv_finish.constprop.0+0x1b2/0x6e0 net/ipv4/ip_input.c:583 > ip_sublist_rcv net/ipv4/ip_input.c:609 [inline] > ip_list_rcv+0x34e/0x490 net/ipv4/ip_input.c:644 > __netif_receive_skb_list_ptype net/core/dev.c:5508 [inline] > __netif_receive_skb_list_core+0x549/0x8e0 net/core/dev.c:5556 > __netif_receive_skb_list net/core/dev.c:5608 [inline] > netif_receive_skb_list_internal+0x75e/0xd80 net/core/dev.c:5699 > gro_normal_list net/core/dev.c:5853 [inline] > gro_normal_list net/core/dev.c:5849 [inline] > napi_complete_done+0x1f1/0x880 net/core/dev.c:6590 > virtqueue_napi_complete drivers/net/virtio_net.c:339 [inline] > virtnet_poll+0xca2/0x11b0 drivers/net/virtio_net.c:1557 > __napi_poll+0xaf/0x440 net/core/dev.c:7023 > napi_poll net/core/dev.c:7090 [inline] > net_rx_action+0x801/0xb40 net/core/dev.c:7177 > __do_softirq+0x29b/0x9c2 kernel/softirq.c:558 > > Freed by task 13: > kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38 > kasan_set_track+0x21/0x30 mm/kasan/common.c:46 > kasan_set_free_info+0x20/0x30 mm/kasan/generic.c:370 > ____kasan_slab_free mm/kasan/common.c:366 [inline] > ____kasan_slab_free mm/kasan/common.c:328 [inline] > __kasan_slab_free+0xff/0x130 mm/kasan/common.c:374 > kasan_slab_free include/linux/kasan.h:235 [inline] > slab_free_hook mm/slub.c:1723 [inline] > slab_free_freelist_hook+0x8b/0x1c0 mm/slub.c:1749 > slab_free mm/slub.c:3513 [inline] > kmem_cache_free+0xbd/0x5d0 mm/slub.c:3530 > dst_destroy+0x2d6/0x3f0 net/core/dst.c:127 > rcu_do_batch kernel/rcu/tree.c:2506 [inline] > rcu_core+0x7ab/0x1470 kernel/rcu/tree.c:2741 > __do_softirq+0x29b/0x9c2 kernel/softirq.c:558 > > Last potentially related work creation: > kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38 > __kasan_record_aux_stack+0xf5/0x120 mm/kasan/generic.c:348 > __call_rcu kernel/rcu/tree.c:2985 [inline] > call_rcu+0xb1/0x740 kernel/rcu/tree.c:3065 > dst_release net/core/dst.c:177 [inline] > dst_release+0x79/0xe0 net/core/dst.c:167 > tcp_v4_do_rcv+0x612/0x8d0 net/ipv4/tcp_ipv4.c:1712 > sk_backlog_rcv include/net/sock.h:1030 [inline] > __release_sock+0x134/0x3b0 net/core/sock.c:2768 > release_sock+0x54/0x1b0 net/core/sock.c:3300 > tcp_sendmsg+0x36/0x40 net/ipv4/tcp.c:1441 > inet_sendmsg+0x99/0xe0 net/ipv4/af_inet.c:819 > sock_sendmsg_nosec net/socket.c:704 [inline] > sock_sendmsg+0xcf/0x120 net/socket.c:724 > sock_write_iter+0x289/0x3c0 net/socket.c:1057 > call_write_iter include/linux/fs.h:2162 [inline] > new_sync_write+0x429/0x660 fs/read_write.c:503 > vfs_write+0x7cd/0xae0 fs/read_write.c:590 > ksys_write+0x1ee/0x250 fs/read_write.c:643 > do_syscall_x64 arch/x86/entry/common.c:50 [inline] > do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 > entry_SYSCALL_64_after_hwframe+0x44/0xae > > The buggy address belongs to the object at ffff88807f1cb700 > which belongs to the cache ip_dst_cache of size 176 > The buggy address is located 58 bytes inside of > 176-byte region [ffff88807f1cb700, ffff88807f1cb7b0) > The buggy address belongs to the page: > page:ffffea0001fc72c0 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x7f1cb > flags: 0xfff00000000200(slab|node=0|zone=1|lastcpupid=0x7ff) > raw: 00fff00000000200 dead000000000100 dead000000000122 ffff8881413bb780 > raw: 0000000000000000 0000000000100010 00000001ffffffff 0000000000000000 > page dumped because: kasan: bad access detected > page_owner tracks the page as allocated > page last allocated via order 0, migratetype Unmovable, gfp_mask 0x112a20(GFP_ATOMIC|__GFP_NOWARN|__GFP_NORETRY|__GFP_HARDWALL), pid 5, ts 108466983062, free_ts 108048976062 > prep_new_page mm/page_alloc.c:2418 [inline] > get_page_from_freelist+0xa72/0x2f50 mm/page_alloc.c:4149 > __alloc_pages+0x1b2/0x500 mm/page_alloc.c:5369 > alloc_pages+0x1a7/0x300 mm/mempolicy.c:2191 > alloc_slab_page mm/slub.c:1793 [inline] > allocate_slab mm/slub.c:1930 [inline] > new_slab+0x32d/0x4a0 mm/slub.c:1993 > ___slab_alloc+0x918/0xfe0 mm/slub.c:3022 > __slab_alloc.constprop.0+0x4d/0xa0 mm/slub.c:3109 > slab_alloc_node mm/slub.c:3200 [inline] > slab_alloc mm/slub.c:3242 [inline] > kmem_cache_alloc+0x35c/0x3a0 mm/slub.c:3247 > dst_alloc+0x146/0x1f0 net/core/dst.c:92 > rt_dst_alloc+0x73/0x430 net/ipv4/route.c:1613 > __mkroute_output net/ipv4/route.c:2564 [inline] > ip_route_output_key_hash_rcu+0x921/0x2d00 net/ipv4/route.c:2791 > ip_route_output_key_hash+0x18b/0x300 net/ipv4/route.c:2619 > __ip_route_output_key include/net/route.h:126 [inline] > ip_route_output_flow+0x23/0x150 net/ipv4/route.c:2850 > ip_route_output_key include/net/route.h:142 [inline] > geneve_get_v4_rt+0x3a6/0x830 drivers/net/geneve.c:809 > geneve_xmit_skb drivers/net/geneve.c:899 [inline] > geneve_xmit+0xc4a/0x3540 drivers/net/geneve.c:1082 > __netdev_start_xmit include/linux/netdevice.h:4994 [inline] > netdev_start_xmit include/linux/netdevice.h:5008 [inline] > xmit_one net/core/dev.c:3590 [inline] > dev_hard_start_xmit+0x1eb/0x920 net/core/dev.c:3606 > __dev_queue_xmit+0x299a/0x3650 net/core/dev.c:4229 > page last free stack trace: > reset_page_owner include/linux/page_owner.h:24 [inline] > free_pages_prepare mm/page_alloc.c:1338 [inline] > free_pcp_prepare+0x374/0x870 mm/page_alloc.c:1389 > free_unref_page_prepare mm/page_alloc.c:3309 [inline] > free_unref_page+0x19/0x690 mm/page_alloc.c:3388 > qlink_free mm/kasan/quarantine.c:146 [inline] > qlist_free_all+0x5a/0xc0 mm/kasan/quarantine.c:165 > kasan_quarantine_reduce+0x180/0x200 mm/kasan/quarantine.c:272 > __kasan_slab_alloc+0xa2/0xc0 mm/kasan/common.c:444 > kasan_slab_alloc include/linux/kasan.h:259 [inline] > slab_post_alloc_hook mm/slab.h:519 [inline] > slab_alloc_node mm/slub.c:3234 [inline] > kmem_cache_alloc_node+0x255/0x3f0 mm/slub.c:3270 > __alloc_skb+0x215/0x340 net/core/skbuff.c:414 > alloc_skb include/linux/skbuff.h:1126 [inline] > alloc_skb_with_frags+0x93/0x620 net/core/skbuff.c:6078 > sock_alloc_send_pskb+0x783/0x910 net/core/sock.c:2575 > mld_newpack+0x1df/0x770 net/ipv6/mcast.c:1754 > add_grhead+0x265/0x330 net/ipv6/mcast.c:1857 > add_grec+0x1053/0x14e0 net/ipv6/mcast.c:1995 > mld_send_initial_cr.part.0+0xf6/0x230 net/ipv6/mcast.c:2242 > mld_send_initial_cr net/ipv6/mcast.c:1232 [inline] > mld_dad_work+0x1d3/0x690 net/ipv6/mcast.c:2268 > process_one_work+0x9b2/0x1690 kernel/workqueue.c:2298 > worker_thread+0x658/0x11f0 kernel/workqueue.c:2445 > > Memory state around the buggy address: > ffff88807f1cb600: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb > ffff88807f1cb680: fb fb fb fb fb fb fc fc fc fc fc fc fc fc fc fc > >ffff88807f1cb700: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb > ^ > ffff88807f1cb780: fb fb fb fb fb fb fc fc fc fc fc fc fc fc fc fc > ffff88807f1cb800: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb > > Fixes: 41063e9dd119 ("ipv4: Early TCP socket demux.") > Signed-off-by: Eric Dumazet <edumazet@google.com> > --- > include/net/sock.h | 2 +- > net/ipv4/af_inet.c | 2 +- > net/ipv4/tcp.c | 3 +-- > net/ipv4/tcp_input.c | 2 +- > net/ipv4/tcp_ipv4.c | 11 +++++++---- > net/ipv4/udp.c | 6 +++--- > net/ipv6/tcp_ipv6.c | 11 +++++++---- > net/ipv6/udp.c | 4 ++-- > 8 files changed, 23 insertions(+), 18 deletions(-) > > diff --git a/include/net/sock.h b/include/net/sock.h > index bea21ff70e74d906216f4eaa2d5a712d12551216..d47e9658da28545c1f6afd9db0cf136b3e13d7b6 100644 > --- a/include/net/sock.h > +++ b/include/net/sock.h > @@ -431,7 +431,7 @@ struct sock { > #ifdef CONFIG_XFRM > struct xfrm_policy __rcu *sk_policy[2]; > #endif > - struct dst_entry *sk_rx_dst; > + struct dst_entry __rcu *sk_rx_dst; > int sk_rx_dst_ifindex; > u32 sk_rx_dst_cookie; > > diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c > index 0189e3cd4a7df2dc2ea7121182ee290e0164df90..6b5956500436187d5e9b801ebb6f8f52ba0db090 100644 > --- a/net/ipv4/af_inet.c > +++ b/net/ipv4/af_inet.c > @@ -154,7 +154,7 @@ void inet_sock_destruct(struct sock *sk) > > kfree(rcu_dereference_protected(inet->inet_opt, 1)); > dst_release(rcu_dereference_protected(sk->sk_dst_cache, 1)); > - dst_release(sk->sk_rx_dst); > + dst_release(rcu_dereference_protected(sk->sk_rx_dst, 1)); > sk_refcnt_debug_dec(sk); > } > EXPORT_SYMBOL(inet_sock_destruct); > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c > index bbb3d39c69afc2d5a42c6ace8d473657861da61f..2bb28bfd83bf621b5d9f3fb7ce5695195697b43a 100644 > --- a/net/ipv4/tcp.c > +++ b/net/ipv4/tcp.c > @@ -3012,8 +3012,7 @@ int tcp_disconnect(struct sock *sk, int flags) > icsk->icsk_ack.rcv_mss = TCP_MIN_MSS; > memset(&tp->rx_opt, 0, sizeof(tp->rx_opt)); > __sk_dst_reset(sk); > - dst_release(sk->sk_rx_dst); > - sk->sk_rx_dst = NULL; > + dst_release(xchg((__force struct dst_entry **)&sk->sk_rx_dst, NULL)); > tcp_saved_syn_free(tp); > tp->compressed_ack = 0; > tp->segs_in = 0; > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c > index 246ab7b5e857eb9e802c4805075e89c98cf00636..0ce46849ec3d4595699dd54229919b2d66b70257 100644 > --- a/net/ipv4/tcp_input.c > +++ b/net/ipv4/tcp_input.c > @@ -5787,7 +5787,7 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb) > trace_tcp_probe(sk, skb); > > tcp_mstamp_refresh(tp); > - if (unlikely(!sk->sk_rx_dst)) > + if (unlikely(!rcu_access_pointer(sk->sk_rx_dst))) > inet_csk(sk)->icsk_af_ops->sk_rx_dst_set(sk, skb); > /* > * Header prediction. > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c > index 13d868c43284584ee0c58ddfd411bb52c8b0c830..084df223b5dff8089a615a4a8d392b620fc0a28a 100644 > --- a/net/ipv4/tcp_ipv4.c > +++ b/net/ipv4/tcp_ipv4.c > @@ -1701,7 +1701,10 @@ int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb) > struct sock *rsk; > > if (sk->sk_state == TCP_ESTABLISHED) { /* Fast path */ > - struct dst_entry *dst = sk->sk_rx_dst; > + struct dst_entry *dst; > + > + dst = rcu_dereference_protected(sk->sk_rx_dst, > + lockdep_sock_is_held(sk)); > > sock_rps_save_rxhash(sk, skb); > sk_mark_napi_id(sk, skb); > @@ -1709,8 +1712,8 @@ int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb) > if (sk->sk_rx_dst_ifindex != skb->skb_iif || > !INDIRECT_CALL_1(dst->ops->check, ipv4_dst_check, > dst, 0)) { > + RCU_INIT_POINTER(sk->sk_rx_dst, NULL); > dst_release(dst); > - sk->sk_rx_dst = NULL; > } > } > tcp_rcv_established(sk, skb); > @@ -1786,7 +1789,7 @@ int tcp_v4_early_demux(struct sk_buff *skb) > skb->sk = sk; > skb->destructor = sock_edemux; > if (sk_fullsock(sk)) { > - struct dst_entry *dst = READ_ONCE(sk->sk_rx_dst); > + struct dst_entry *dst = rcu_dereference(sk->sk_rx_dst); > > if (dst) > dst = dst_check(dst, 0); > @@ -2201,7 +2204,7 @@ void inet_sk_rx_dst_set(struct sock *sk, const struct sk_buff *skb) > struct dst_entry *dst = skb_dst(skb); > > if (dst && dst_hold_safe(dst)) { > - sk->sk_rx_dst = dst; > + rcu_assign_pointer(sk->sk_rx_dst, dst); > sk->sk_rx_dst_ifindex = skb->skb_iif; > } > } > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c > index 23b05e28490b0a0a690a837027f26167e353f8ce..15c6b450b8dba44d5f344a554eea6b991c1ea5f1 100644 > --- a/net/ipv4/udp.c > +++ b/net/ipv4/udp.c > @@ -2250,7 +2250,7 @@ bool udp_sk_rx_dst_set(struct sock *sk, struct dst_entry *dst) > struct dst_entry *old; > > if (dst_hold_safe(dst)) { > - old = xchg(&sk->sk_rx_dst, dst); > + old = xchg((__force struct dst_entry **)&sk->sk_rx_dst, dst); > dst_release(old); > return old != dst; > } > @@ -2440,7 +2440,7 @@ int __udp4_lib_rcv(struct sk_buff *skb, struct udp_table *udptable, > struct dst_entry *dst = skb_dst(skb); > int ret; > > - if (unlikely(sk->sk_rx_dst != dst)) > + if (unlikely(rcu_dereference(sk->sk_rx_dst) != dst)) > udp_sk_rx_dst_set(sk, dst); > > ret = udp_unicast_rcv_skb(sk, skb, uh); > @@ -2599,7 +2599,7 @@ int udp_v4_early_demux(struct sk_buff *skb) > > skb->sk = sk; > skb->destructor = sock_efree; > - dst = READ_ONCE(sk->sk_rx_dst); > + dst = rcu_dereference(sk->sk_rx_dst); > > if (dst) > dst = dst_check(dst, 0); > diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c > index 551fce49841d7f53a111b0435855634cece2b40a..680e6481b9672040ccb41fd08ab80166575bef50 100644 > --- a/net/ipv6/tcp_ipv6.c > +++ b/net/ipv6/tcp_ipv6.c > @@ -107,7 +107,7 @@ static void inet6_sk_rx_dst_set(struct sock *sk, const struct sk_buff *skb) > if (dst && dst_hold_safe(dst)) { > const struct rt6_info *rt = (const struct rt6_info *)dst; > > - sk->sk_rx_dst = dst; > + rcu_assign_pointer(sk->sk_rx_dst, dst); > sk->sk_rx_dst_ifindex = skb->skb_iif; > sk->sk_rx_dst_cookie = rt6_get_cookie(rt); > } > @@ -1505,7 +1505,10 @@ static int tcp_v6_do_rcv(struct sock *sk, struct sk_buff *skb) > opt_skb = skb_clone(skb, sk_gfp_mask(sk, GFP_ATOMIC)); > > if (sk->sk_state == TCP_ESTABLISHED) { /* Fast path */ > - struct dst_entry *dst = sk->sk_rx_dst; > + struct dst_entry *dst; > + > + dst = rcu_dereference_protected(sk->sk_rx_dst, > + lockdep_sock_is_held(sk)); > > sock_rps_save_rxhash(sk, skb); > sk_mark_napi_id(sk, skb); > @@ -1513,8 +1516,8 @@ static int tcp_v6_do_rcv(struct sock *sk, struct sk_buff *skb) > if (sk->sk_rx_dst_ifindex != skb->skb_iif || > INDIRECT_CALL_1(dst->ops->check, ip6_dst_check, > dst, sk->sk_rx_dst_cookie) == NULL) { > + RCU_INIT_POINTER(sk->sk_rx_dst, NULL); > dst_release(dst); > - sk->sk_rx_dst = NULL; > } > } > > @@ -1874,7 +1877,7 @@ INDIRECT_CALLABLE_SCOPE void tcp_v6_early_demux(struct sk_buff *skb) > skb->sk = sk; > skb->destructor = sock_edemux; > if (sk_fullsock(sk)) { > - struct dst_entry *dst = READ_ONCE(sk->sk_rx_dst); > + struct dst_entry *dst = rcu_dereference(sk->sk_rx_dst); > > if (dst) > dst = dst_check(dst, sk->sk_rx_dst_cookie); > diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c > index e43b31d25fb61c7875f3bb8a93eb74da244d912a..a2caca6ccf114546f9e4ea854ad67208b2f3873e 100644 > --- a/net/ipv6/udp.c > +++ b/net/ipv6/udp.c > @@ -956,7 +956,7 @@ int __udp6_lib_rcv(struct sk_buff *skb, struct udp_table *udptable, > struct dst_entry *dst = skb_dst(skb); > int ret; > > - if (unlikely(sk->sk_rx_dst != dst)) > + if (unlikely(rcu_dereference(sk->sk_rx_dst) != dst)) > udp6_sk_rx_dst_set(sk, dst); > > if (!uh->check && !udp_sk(sk)->no_check6_rx) { > @@ -1070,7 +1070,7 @@ INDIRECT_CALLABLE_SCOPE void udp_v6_early_demux(struct sk_buff *skb) > > skb->sk = sk; > skb->destructor = sock_efree; > - dst = READ_ONCE(sk->sk_rx_dst); > + dst = rcu_dereference(sk->sk_rx_dst); > > if (dst) > dst = dst_check(dst, sk->sk_rx_dst_cookie); > -- > 2.34.1.173.g76aa8bc2d0-goog > > Eric, this patch was picked for v5.15 stable and I wonder whether this needs to be backported to older branches too. The Fixes commit quoted here seems to go back all the way to v3.5. Would you know? -- Carlos Llamas
On Thu, Oct 20, 2022 at 2:05 PM Carlos Llamas <cmllamas@google.com> wrote: > > On Mon, Dec 20, 2021 at 06:33:30AM -0800, Eric Dumazet wrote: > > Eric, this patch was picked for v5.15 stable and I wonder whether this > needs to be backported to older branches too. The Fixes commit quoted > here seems to go back all the way to v3.5. Would you know? I guess nobody cared to address some merge conflicts on older kernel versions. If you want, you could handle the backport, this patch can help some rare race windows on preemptable kernels. > > -- > Carlos Llamas
On Thu, Oct 20, 2022 at 02:53:06PM -0700, Eric Dumazet wrote: > On Thu, Oct 20, 2022 at 2:05 PM Carlos Llamas <cmllamas@google.com> wrote: > > > > On Mon, Dec 20, 2021 at 06:33:30AM -0800, Eric Dumazet wrote: > > > > > Eric, this patch was picked for v5.15 stable and I wonder whether this > > needs to be backported to older branches too. The Fixes commit quoted > > here seems to go back all the way to v3.5. Would you know? > > I guess nobody cared to address some merge conflicts on older kernel versions. > > If you want, you could handle the backport, this patch can help some > rare race windows > on preemptable kernels. Sounds good, I'll have a look at backporting this patch.
diff --git a/include/net/sock.h b/include/net/sock.h index bea21ff70e74d906216f4eaa2d5a712d12551216..d47e9658da28545c1f6afd9db0cf136b3e13d7b6 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -431,7 +431,7 @@ struct sock { #ifdef CONFIG_XFRM struct xfrm_policy __rcu *sk_policy[2]; #endif - struct dst_entry *sk_rx_dst; + struct dst_entry __rcu *sk_rx_dst; int sk_rx_dst_ifindex; u32 sk_rx_dst_cookie; diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index 0189e3cd4a7df2dc2ea7121182ee290e0164df90..6b5956500436187d5e9b801ebb6f8f52ba0db090 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -154,7 +154,7 @@ void inet_sock_destruct(struct sock *sk) kfree(rcu_dereference_protected(inet->inet_opt, 1)); dst_release(rcu_dereference_protected(sk->sk_dst_cache, 1)); - dst_release(sk->sk_rx_dst); + dst_release(rcu_dereference_protected(sk->sk_rx_dst, 1)); sk_refcnt_debug_dec(sk); } EXPORT_SYMBOL(inet_sock_destruct); diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index bbb3d39c69afc2d5a42c6ace8d473657861da61f..2bb28bfd83bf621b5d9f3fb7ce5695195697b43a 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -3012,8 +3012,7 @@ int tcp_disconnect(struct sock *sk, int flags) icsk->icsk_ack.rcv_mss = TCP_MIN_MSS; memset(&tp->rx_opt, 0, sizeof(tp->rx_opt)); __sk_dst_reset(sk); - dst_release(sk->sk_rx_dst); - sk->sk_rx_dst = NULL; + dst_release(xchg((__force struct dst_entry **)&sk->sk_rx_dst, NULL)); tcp_saved_syn_free(tp); tp->compressed_ack = 0; tp->segs_in = 0; diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 246ab7b5e857eb9e802c4805075e89c98cf00636..0ce46849ec3d4595699dd54229919b2d66b70257 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -5787,7 +5787,7 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb) trace_tcp_probe(sk, skb); tcp_mstamp_refresh(tp); - if (unlikely(!sk->sk_rx_dst)) + if (unlikely(!rcu_access_pointer(sk->sk_rx_dst))) inet_csk(sk)->icsk_af_ops->sk_rx_dst_set(sk, skb); /* * Header prediction. diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 13d868c43284584ee0c58ddfd411bb52c8b0c830..084df223b5dff8089a615a4a8d392b620fc0a28a 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -1701,7 +1701,10 @@ int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb) struct sock *rsk; if (sk->sk_state == TCP_ESTABLISHED) { /* Fast path */ - struct dst_entry *dst = sk->sk_rx_dst; + struct dst_entry *dst; + + dst = rcu_dereference_protected(sk->sk_rx_dst, + lockdep_sock_is_held(sk)); sock_rps_save_rxhash(sk, skb); sk_mark_napi_id(sk, skb); @@ -1709,8 +1712,8 @@ int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb) if (sk->sk_rx_dst_ifindex != skb->skb_iif || !INDIRECT_CALL_1(dst->ops->check, ipv4_dst_check, dst, 0)) { + RCU_INIT_POINTER(sk->sk_rx_dst, NULL); dst_release(dst); - sk->sk_rx_dst = NULL; } } tcp_rcv_established(sk, skb); @@ -1786,7 +1789,7 @@ int tcp_v4_early_demux(struct sk_buff *skb) skb->sk = sk; skb->destructor = sock_edemux; if (sk_fullsock(sk)) { - struct dst_entry *dst = READ_ONCE(sk->sk_rx_dst); + struct dst_entry *dst = rcu_dereference(sk->sk_rx_dst); if (dst) dst = dst_check(dst, 0); @@ -2201,7 +2204,7 @@ void inet_sk_rx_dst_set(struct sock *sk, const struct sk_buff *skb) struct dst_entry *dst = skb_dst(skb); if (dst && dst_hold_safe(dst)) { - sk->sk_rx_dst = dst; + rcu_assign_pointer(sk->sk_rx_dst, dst); sk->sk_rx_dst_ifindex = skb->skb_iif; } } diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index 23b05e28490b0a0a690a837027f26167e353f8ce..15c6b450b8dba44d5f344a554eea6b991c1ea5f1 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -2250,7 +2250,7 @@ bool udp_sk_rx_dst_set(struct sock *sk, struct dst_entry *dst) struct dst_entry *old; if (dst_hold_safe(dst)) { - old = xchg(&sk->sk_rx_dst, dst); + old = xchg((__force struct dst_entry **)&sk->sk_rx_dst, dst); dst_release(old); return old != dst; } @@ -2440,7 +2440,7 @@ int __udp4_lib_rcv(struct sk_buff *skb, struct udp_table *udptable, struct dst_entry *dst = skb_dst(skb); int ret; - if (unlikely(sk->sk_rx_dst != dst)) + if (unlikely(rcu_dereference(sk->sk_rx_dst) != dst)) udp_sk_rx_dst_set(sk, dst); ret = udp_unicast_rcv_skb(sk, skb, uh); @@ -2599,7 +2599,7 @@ int udp_v4_early_demux(struct sk_buff *skb) skb->sk = sk; skb->destructor = sock_efree; - dst = READ_ONCE(sk->sk_rx_dst); + dst = rcu_dereference(sk->sk_rx_dst); if (dst) dst = dst_check(dst, 0); diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c index 551fce49841d7f53a111b0435855634cece2b40a..680e6481b9672040ccb41fd08ab80166575bef50 100644 --- a/net/ipv6/tcp_ipv6.c +++ b/net/ipv6/tcp_ipv6.c @@ -107,7 +107,7 @@ static void inet6_sk_rx_dst_set(struct sock *sk, const struct sk_buff *skb) if (dst && dst_hold_safe(dst)) { const struct rt6_info *rt = (const struct rt6_info *)dst; - sk->sk_rx_dst = dst; + rcu_assign_pointer(sk->sk_rx_dst, dst); sk->sk_rx_dst_ifindex = skb->skb_iif; sk->sk_rx_dst_cookie = rt6_get_cookie(rt); } @@ -1505,7 +1505,10 @@ static int tcp_v6_do_rcv(struct sock *sk, struct sk_buff *skb) opt_skb = skb_clone(skb, sk_gfp_mask(sk, GFP_ATOMIC)); if (sk->sk_state == TCP_ESTABLISHED) { /* Fast path */ - struct dst_entry *dst = sk->sk_rx_dst; + struct dst_entry *dst; + + dst = rcu_dereference_protected(sk->sk_rx_dst, + lockdep_sock_is_held(sk)); sock_rps_save_rxhash(sk, skb); sk_mark_napi_id(sk, skb); @@ -1513,8 +1516,8 @@ static int tcp_v6_do_rcv(struct sock *sk, struct sk_buff *skb) if (sk->sk_rx_dst_ifindex != skb->skb_iif || INDIRECT_CALL_1(dst->ops->check, ip6_dst_check, dst, sk->sk_rx_dst_cookie) == NULL) { + RCU_INIT_POINTER(sk->sk_rx_dst, NULL); dst_release(dst); - sk->sk_rx_dst = NULL; } } @@ -1874,7 +1877,7 @@ INDIRECT_CALLABLE_SCOPE void tcp_v6_early_demux(struct sk_buff *skb) skb->sk = sk; skb->destructor = sock_edemux; if (sk_fullsock(sk)) { - struct dst_entry *dst = READ_ONCE(sk->sk_rx_dst); + struct dst_entry *dst = rcu_dereference(sk->sk_rx_dst); if (dst) dst = dst_check(dst, sk->sk_rx_dst_cookie); diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c index e43b31d25fb61c7875f3bb8a93eb74da244d912a..a2caca6ccf114546f9e4ea854ad67208b2f3873e 100644 --- a/net/ipv6/udp.c +++ b/net/ipv6/udp.c @@ -956,7 +956,7 @@ int __udp6_lib_rcv(struct sk_buff *skb, struct udp_table *udptable, struct dst_entry *dst = skb_dst(skb); int ret; - if (unlikely(sk->sk_rx_dst != dst)) + if (unlikely(rcu_dereference(sk->sk_rx_dst) != dst)) udp6_sk_rx_dst_set(sk, dst); if (!uh->check && !udp_sk(sk)->no_check6_rx) { @@ -1070,7 +1070,7 @@ INDIRECT_CALLABLE_SCOPE void udp_v6_early_demux(struct sk_buff *skb) skb->sk = sk; skb->destructor = sock_efree; - dst = READ_ONCE(sk->sk_rx_dst); + dst = rcu_dereference(sk->sk_rx_dst); if (dst) dst = dst_check(dst, sk->sk_rx_dst_cookie);