Message ID | 20230403194959.48928-2-kuniyu@amazon.com (mailing list archive) |
---|---|
State | Accepted |
Commit | 0a78cf7264d29abeca098eae0b188a10aabc8a32 |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | raw/ping: Fix locking in /proc/net/{raw,icmp}. | expand |
On Mon, Apr 3, 2023 at 9:51 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote: > > Dae R. Jeong reported a NULL deref in raw_get_next() [0]. > > It seems that the repro was running these sequences in parallel so > that one thread was iterating on a socket that was being freed in > another netns. > > unshare(0x40060200) > r0 = syz_open_procfs(0x0, &(0x7f0000002080)='net/raw\x00') > socket$inet_icmp_raw(0x2, 0x3, 0x1) > pread64(r0, &(0x7f0000000000)=""/10, 0xa, 0x10000000007f) > > After commit 0daf07e52709 ("raw: convert raw sockets to RCU"), we > use RCU and hlist_nulls_for_each_entry() to iterate over SOCK_RAW > sockets. However, we should use spinlock for slow paths to avoid > the NULL deref. > > Also, SOCK_RAW does not use SLAB_TYPESAFE_BY_RCU, and the slab object > is not reused during iteration in the grace period. In fact, the > lockless readers do not check the nulls marker with get_nulls_value(). > So, SOCK_RAW should use hlist instead of hlist_nulls. > > Instead of adding an unnecessary barrier by sk_nulls_for_each_rcu(), > let's convert hlist_nulls to hlist and use sk_for_each_rcu() for > fast paths and sk_for_each() and spinlock for /proc/net/raw. > > Fixes: 0daf07e52709 ("raw: convert raw sockets to RCU") > Reported-by: syzbot <syzkaller@googlegroups.com> > Reported-by: Dae R. Jeong <threeearcat@gmail.com> > Link: https://lore.kernel.org/netdev/ZCA2mGV_cmq7lIfV@dragonet/ > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> > --- > include/net/raw.h | 4 ++-- > net/ipv4/raw.c | 36 +++++++++++++++++++----------------- > net/ipv4/raw_diag.c | 10 ++++------ > net/ipv6/raw.c | 10 ++++------ > 4 files changed, 29 insertions(+), 31 deletions(-) > SGTM, thanks a lot. Reviewed-by: Eric Dumazet <edumazet@google.com>
On Tue, Apr 4, 2023 at 4:02 AM Kuniyuki Iwashima <kuniyu@amazon.com> wrote: > > Dae R. Jeong reported a NULL deref in raw_get_next() [0]. > > It seems that the repro was running these sequences in parallel so > that one thread was iterating on a socket that was being freed in > another netns. > > unshare(0x40060200) > r0 = syz_open_procfs(0x0, &(0x7f0000002080)='net/raw\x00') > socket$inet_icmp_raw(0x2, 0x3, 0x1) > pread64(r0, &(0x7f0000000000)=""/10, 0xa, 0x10000000007f) > > After commit 0daf07e52709 ("raw: convert raw sockets to RCU"), we > use RCU and hlist_nulls_for_each_entry() to iterate over SOCK_RAW > sockets. However, we should use spinlock for slow paths to avoid > the NULL deref. > > Also, SOCK_RAW does not use SLAB_TYPESAFE_BY_RCU, and the slab object > is not reused during iteration in the grace period. In fact, the > lockless readers do not check the nulls marker with get_nulls_value(). > So, SOCK_RAW should use hlist instead of hlist_nulls. > > Instead of adding an unnecessary barrier by sk_nulls_for_each_rcu(), > let's convert hlist_nulls to hlist and use sk_for_each_rcu() for > fast paths and sk_for_each() and spinlock for /proc/net/raw. > > [0]: > general protection fault, probably for non-canonical address 0xdffffc0000000005: 0000 [#1] PREEMPT SMP KASAN > KASAN: null-ptr-deref in range [0x0000000000000028-0x000000000000002f] > CPU: 2 PID: 20952 Comm: syz-executor.0 Not tainted 6.2.0-g048ec869bafd-dirty #7 > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 > RIP: 0010:read_pnet include/net/net_namespace.h:383 [inline] > RIP: 0010:sock_net include/net/sock.h:649 [inline] > RIP: 0010:raw_get_next net/ipv4/raw.c:974 [inline] > RIP: 0010:raw_get_idx net/ipv4/raw.c:986 [inline] > RIP: 0010:raw_seq_start+0x431/0x800 net/ipv4/raw.c:995 > Code: ef e8 33 3d 94 f7 49 8b 6d 00 4c 89 ef e8 b7 65 5f f7 49 89 ed 49 83 c5 98 0f 84 9a 00 00 00 48 83 c5 c8 48 89 e8 48 c1 e8 03 <42> 80 3c 30 00 74 08 48 89 ef e8 00 3d 94 f7 4c 8b 7d 00 48 89 ef > RSP: 0018:ffffc9001154f9b0 EFLAGS: 00010206 > RAX: 0000000000000005 RBX: 1ffff1100302c8fd RCX: 0000000000000000 > RDX: 0000000000000028 RSI: ffffc9001154f988 RDI: ffffc9000f77a338 > RBP: 0000000000000029 R08: ffffffff8a50ffb4 R09: fffffbfff24b6bd9 > R10: fffffbfff24b6bd9 R11: 0000000000000000 R12: ffff88801db73b78 > R13: fffffffffffffff9 R14: dffffc0000000000 R15: 0000000000000030 > FS: 00007f843ae8e700(0000) GS:ffff888063700000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 000055bb9614b35f CR3: 000000003c672000 CR4: 00000000003506e0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > Call Trace: > <TASK> > seq_read_iter+0x4c6/0x10f0 fs/seq_file.c:225 > seq_read+0x224/0x320 fs/seq_file.c:162 > pde_read fs/proc/inode.c:316 [inline] > proc_reg_read+0x23f/0x330 fs/proc/inode.c:328 > vfs_read+0x31e/0xd30 fs/read_write.c:468 > ksys_pread64 fs/read_write.c:665 [inline] > __do_sys_pread64 fs/read_write.c:675 [inline] > __se_sys_pread64 fs/read_write.c:672 [inline] > __x64_sys_pread64+0x1e9/0x280 fs/read_write.c:672 > do_syscall_x64 arch/x86/entry/common.c:51 [inline] > do_syscall_64+0x4e/0xa0 arch/x86/entry/common.c:82 > entry_SYSCALL_64_after_hwframe+0x63/0xcd > RIP: 0033:0x478d29 > Code: f7 d8 64 89 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48 > RSP: 002b:00007f843ae8dbe8 EFLAGS: 00000246 ORIG_RAX: 0000000000000011 > RAX: ffffffffffffffda RBX: 0000000000791408 RCX: 0000000000478d29 > RDX: 000000000000000a RSI: 0000000020000000 RDI: 0000000000000003 > RBP: 00000000f477909a R08: 0000000000000000 R09: 0000000000000000 > R10: 000010000000007f R11: 0000000000000246 R12: 0000000000791740 > R13: 0000000000791414 R14: 0000000000791408 R15: 00007ffc2eb48a50 > </TASK> > Modules linked in: > ---[ end trace 0000000000000000 ]--- > RIP: 0010:read_pnet include/net/net_namespace.h:383 [inline] > RIP: 0010:sock_net include/net/sock.h:649 [inline] > RIP: 0010:raw_get_next net/ipv4/raw.c:974 [inline] > RIP: 0010:raw_get_idx net/ipv4/raw.c:986 [inline] > RIP: 0010:raw_seq_start+0x431/0x800 net/ipv4/raw.c:995 > Code: ef e8 33 3d 94 f7 49 8b 6d 00 4c 89 ef e8 b7 65 5f f7 49 89 ed 49 83 c5 98 0f 84 9a 00 00 00 48 83 c5 c8 48 89 e8 48 c1 e8 03 <42> 80 3c 30 00 74 08 48 89 ef e8 00 3d 94 f7 4c 8b 7d 00 48 89 ef > RSP: 0018:ffffc9001154f9b0 EFLAGS: 00010206 > RAX: 0000000000000005 RBX: 1ffff1100302c8fd RCX: 0000000000000000 > RDX: 0000000000000028 RSI: ffffc9001154f988 RDI: ffffc9000f77a338 > RBP: 0000000000000029 R08: ffffffff8a50ffb4 R09: fffffbfff24b6bd9 > R10: fffffbfff24b6bd9 R11: 0000000000000000 R12: ffff88801db73b78 > R13: fffffffffffffff9 R14: dffffc0000000000 R15: 0000000000000030 > FS: 00007f843ae8e700(0000) GS:ffff888063700000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 00007f92ff166000 CR3: 000000003c672000 CR4: 00000000003506e0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > > Fixes: 0daf07e52709 ("raw: convert raw sockets to RCU") > Reported-by: syzbot <syzkaller@googlegroups.com> > Reported-by: Dae R. Jeong <threeearcat@gmail.com> > Link: https://lore.kernel.org/netdev/ZCA2mGV_cmq7lIfV@dragonet/ > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> > --- > include/net/raw.h | 4 ++-- > net/ipv4/raw.c | 36 +++++++++++++++++++----------------- > net/ipv4/raw_diag.c | 10 ++++------ > net/ipv6/raw.c | 10 ++++------ > 4 files changed, 29 insertions(+), 31 deletions(-) > > diff --git a/include/net/raw.h b/include/net/raw.h > index 2c004c20ed99..3af5289fdead 100644 > --- a/include/net/raw.h > +++ b/include/net/raw.h > @@ -37,7 +37,7 @@ int raw_rcv(struct sock *, struct sk_buff *); > struct raw_hashinfo { > spinlock_t lock; > > - struct hlist_nulls_head ht[RAW_HTABLE_SIZE] ____cacheline_aligned; > + struct hlist_head ht[RAW_HTABLE_SIZE] ____cacheline_aligned; > }; > > static inline u32 raw_hashfunc(const struct net *net, u32 proto) > @@ -51,7 +51,7 @@ static inline void raw_hashinfo_init(struct raw_hashinfo *hashinfo) > > spin_lock_init(&hashinfo->lock); > for (i = 0; i < RAW_HTABLE_SIZE; i++) > - INIT_HLIST_NULLS_HEAD(&hashinfo->ht[i], i); > + INIT_HLIST_HEAD(&hashinfo->ht[i]); > } > > #ifdef CONFIG_PROC_FS > diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c > index 94df935ee0c5..8088a5011e7d 100644 > --- a/net/ipv4/raw.c > +++ b/net/ipv4/raw.c > @@ -91,12 +91,12 @@ EXPORT_SYMBOL_GPL(raw_v4_hashinfo); > int raw_hash_sk(struct sock *sk) > { > struct raw_hashinfo *h = sk->sk_prot->h.raw_hash; > - struct hlist_nulls_head *hlist; > + struct hlist_head *hlist; > > hlist = &h->ht[raw_hashfunc(sock_net(sk), inet_sk(sk)->inet_num)]; > > spin_lock(&h->lock); > - __sk_nulls_add_node_rcu(sk, hlist); > + sk_add_node_rcu(sk, hlist); > sock_set_flag(sk, SOCK_RCU_FREE); > spin_unlock(&h->lock); > sock_prot_inuse_add(sock_net(sk), sk->sk_prot, 1); > @@ -110,7 +110,7 @@ void raw_unhash_sk(struct sock *sk) > struct raw_hashinfo *h = sk->sk_prot->h.raw_hash; > > spin_lock(&h->lock); > - if (__sk_nulls_del_node_init_rcu(sk)) > + if (sk_del_node_init_rcu(sk)) > sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1); > spin_unlock(&h->lock); > } > @@ -163,16 +163,15 @@ static int icmp_filter(const struct sock *sk, const struct sk_buff *skb) > static int raw_v4_input(struct net *net, struct sk_buff *skb, > const struct iphdr *iph, int hash) > { > - struct hlist_nulls_head *hlist; > - struct hlist_nulls_node *hnode; > int sdif = inet_sdif(skb); > + struct hlist_head *hlist; > int dif = inet_iif(skb); > int delivered = 0; > struct sock *sk; > > hlist = &raw_v4_hashinfo.ht[hash]; > rcu_read_lock(); > - sk_nulls_for_each(sk, hnode, hlist) { > + sk_for_each_rcu(sk, hlist) { > if (!raw_v4_match(net, sk, iph->protocol, > iph->saddr, iph->daddr, dif, sdif)) > continue; > @@ -264,10 +263,9 @@ static void raw_err(struct sock *sk, struct sk_buff *skb, u32 info) > void raw_icmp_error(struct sk_buff *skb, int protocol, u32 info) > { > struct net *net = dev_net(skb->dev); > - struct hlist_nulls_head *hlist; > - struct hlist_nulls_node *hnode; > int dif = skb->dev->ifindex; > int sdif = inet_sdif(skb); > + struct hlist_head *hlist; > const struct iphdr *iph; > struct sock *sk; > int hash; > @@ -276,7 +274,7 @@ void raw_icmp_error(struct sk_buff *skb, int protocol, u32 info) > hlist = &raw_v4_hashinfo.ht[hash]; > > rcu_read_lock(); > - sk_nulls_for_each(sk, hnode, hlist) { > + sk_for_each_rcu(sk, hlist) { > iph = (const struct iphdr *)skb->data; > if (!raw_v4_match(net, sk, iph->protocol, > iph->daddr, iph->saddr, dif, sdif)) > @@ -950,14 +948,13 @@ static struct sock *raw_get_first(struct seq_file *seq, int bucket) > { > struct raw_hashinfo *h = pde_data(file_inode(seq->file)); > struct raw_iter_state *state = raw_seq_private(seq); > - struct hlist_nulls_head *hlist; > - struct hlist_nulls_node *hnode; > + struct hlist_head *hlist; > struct sock *sk; > > for (state->bucket = bucket; state->bucket < RAW_HTABLE_SIZE; > ++state->bucket) { > hlist = &h->ht[state->bucket]; > - sk_nulls_for_each(sk, hnode, hlist) { > + sk_for_each(sk, hlist) { > if (sock_net(sk) == seq_file_net(seq)) > return sk; > } > @@ -970,7 +967,7 @@ static struct sock *raw_get_next(struct seq_file *seq, struct sock *sk) > struct raw_iter_state *state = raw_seq_private(seq); > > do { > - sk = sk_nulls_next(sk); > + sk = sk_next(sk); > } while (sk && sock_net(sk) != seq_file_net(seq)); > > if (!sk) > @@ -989,9 +986,12 @@ static struct sock *raw_get_idx(struct seq_file *seq, loff_t pos) > } > [...] > void *raw_seq_start(struct seq_file *seq, loff_t *pos) > - __acquires(RCU) > + __acquires(&h->lock) > { > - rcu_read_lock(); > + struct raw_hashinfo *h = pde_data(file_inode(seq->file)); > + > + spin_lock(&h->lock); I would like to ask two questions which make me confused: 1) Why would we use spin_lock to protect the socket in a raw hashtable for reader's safety under the rcu protection? Normally, if we use the RCU protection, we only make sure that we need to destroy the socket by calling call_rcu() which would prevent the READER of the socket from getting a NULL pointer. 2) Using spin lock when we're cating /proc/net/raw file may block/postpone the action of hashing socket somewhere else? Thanks, Jason > + > return *pos ? raw_get_idx(seq, *pos - 1) : SEQ_START_TOKEN; > } > EXPORT_SYMBOL_GPL(raw_seq_start); > @@ -1010,9 +1010,11 @@ void *raw_seq_next(struct seq_file *seq, void *v, loff_t *pos) > EXPORT_SYMBOL_GPL(raw_seq_next); > > void raw_seq_stop(struct seq_file *seq, void *v) > - __releases(RCU) > + __releases(&h->lock) > { > - rcu_read_unlock(); > + struct raw_hashinfo *h = pde_data(file_inode(seq->file)); > + > + spin_unlock(&h->lock); > } > EXPORT_SYMBOL_GPL(raw_seq_stop); > > diff --git a/net/ipv4/raw_diag.c b/net/ipv4/raw_diag.c > index 999321834b94..da3591a66a16 100644 > --- a/net/ipv4/raw_diag.c > +++ b/net/ipv4/raw_diag.c > @@ -57,8 +57,7 @@ static bool raw_lookup(struct net *net, struct sock *sk, > static struct sock *raw_sock_get(struct net *net, const struct inet_diag_req_v2 *r) > { > struct raw_hashinfo *hashinfo = raw_get_hashinfo(r); > - struct hlist_nulls_head *hlist; > - struct hlist_nulls_node *hnode; > + struct hlist_head *hlist; > struct sock *sk; > int slot; > > @@ -68,7 +67,7 @@ static struct sock *raw_sock_get(struct net *net, const struct inet_diag_req_v2 > rcu_read_lock(); > for (slot = 0; slot < RAW_HTABLE_SIZE; slot++) { > hlist = &hashinfo->ht[slot]; > - sk_nulls_for_each(sk, hnode, hlist) { > + sk_for_each_rcu(sk, hlist) { > if (raw_lookup(net, sk, r)) { > /* > * Grab it and keep until we fill > @@ -142,9 +141,8 @@ static void raw_diag_dump(struct sk_buff *skb, struct netlink_callback *cb, > struct raw_hashinfo *hashinfo = raw_get_hashinfo(r); > struct net *net = sock_net(skb->sk); > struct inet_diag_dump_data *cb_data; > - struct hlist_nulls_head *hlist; > - struct hlist_nulls_node *hnode; > int num, s_num, slot, s_slot; > + struct hlist_head *hlist; > struct sock *sk = NULL; > struct nlattr *bc; > > @@ -161,7 +159,7 @@ static void raw_diag_dump(struct sk_buff *skb, struct netlink_callback *cb, > num = 0; > > hlist = &hashinfo->ht[slot]; > - sk_nulls_for_each(sk, hnode, hlist) { > + sk_for_each_rcu(sk, hlist) { > struct inet_sock *inet = inet_sk(sk); > > if (!net_eq(sock_net(sk), net)) > diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c > index bac9ba747bde..a327aa481df4 100644 > --- a/net/ipv6/raw.c > +++ b/net/ipv6/raw.c > @@ -141,10 +141,9 @@ EXPORT_SYMBOL(rawv6_mh_filter_unregister); > static bool ipv6_raw_deliver(struct sk_buff *skb, int nexthdr) > { > struct net *net = dev_net(skb->dev); > - struct hlist_nulls_head *hlist; > - struct hlist_nulls_node *hnode; > const struct in6_addr *saddr; > const struct in6_addr *daddr; > + struct hlist_head *hlist; > struct sock *sk; > bool delivered = false; > __u8 hash; > @@ -155,7 +154,7 @@ static bool ipv6_raw_deliver(struct sk_buff *skb, int nexthdr) > hash = raw_hashfunc(net, nexthdr); > hlist = &raw_v6_hashinfo.ht[hash]; > rcu_read_lock(); > - sk_nulls_for_each(sk, hnode, hlist) { > + sk_for_each_rcu(sk, hlist) { > int filtered; > > if (!raw_v6_match(net, sk, nexthdr, daddr, saddr, > @@ -333,15 +332,14 @@ void raw6_icmp_error(struct sk_buff *skb, int nexthdr, > u8 type, u8 code, int inner_offset, __be32 info) > { > struct net *net = dev_net(skb->dev); > - struct hlist_nulls_head *hlist; > - struct hlist_nulls_node *hnode; > + struct hlist_head *hlist; > struct sock *sk; > int hash; > > hash = raw_hashfunc(net, nexthdr); > hlist = &raw_v6_hashinfo.ht[hash]; > rcu_read_lock(); > - sk_nulls_for_each(sk, hnode, hlist) { > + sk_for_each_rcu(sk, hlist) { > /* Note: ipv6_hdr(skb) != skb->data */ > const struct ipv6hdr *ip6h = (const struct ipv6hdr *)skb->data; > > -- > 2.30.2 >
On Tue, Apr 4, 2023 at 4:46 AM Jason Xing <kerneljasonxing@gmail.com> wrote: > > I would like to ask two questions which make me confused: > 1) Why would we use spin_lock to protect the socket in a raw hashtable > for reader's safety under the rcu protection? Normally, if we use the > RCU protection, we only make sure that we need to destroy the socket > by calling call_rcu() which would prevent the READER of the socket > from getting a NULL pointer. Yes, but then we can not sleep or yield the cpu. > 2) Using spin lock when we're cating /proc/net/raw file may > block/postpone the action of hashing socket somewhere else? /proc/net/raw file access is rare, and limited in duration (at most one page filled by system call) Use of RCU was only intended in my original patch to solve deadlock issues under packet floods, like DOS attacks. Really using RCU in the data/fast path makes sense (and we did that already). For control paths (or /proc/.... files), not so much.
On Tue, Apr 4, 2023 at 12:07 PM Eric Dumazet <edumazet@google.com> wrote: > > On Tue, Apr 4, 2023 at 4:46 AM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > I would like to ask two questions which make me confused: > > 1) Why would we use spin_lock to protect the socket in a raw hashtable > > for reader's safety under the rcu protection? Normally, if we use the > > RCU protection, we only make sure that we need to destroy the socket > > by calling call_rcu() which would prevent the READER of the socket > > from getting a NULL pointer. > > Yes, but then we can not sleep or yield the cpu. Indeed. We also cannot sleep/yield under the protection of the spin lock. And I checked the caller in fs/seq_file.c and noticed that we have no chance to sleep/yield between ->start and ->stop. So I wonder why we couldn't use RCU directly like the patch[1] you proposed before and choose deliberately to switch to spin lock? Spin lock for the whole hashinfo to protect the reader side is heavy, and RCU outperforms spin lock in this case, I think. [1]: commit 0daf07e52709 ("raw: convert raw sockets to RCU") Thanks, Jason > > > 2) Using spin lock when we're cating /proc/net/raw file may > > block/postpone the action of hashing socket somewhere else? > > /proc/net/raw file access is rare, and limited in duration (at most > one page filled by system call) > > Use of RCU was only intended in my original patch to solve deadlock issues > under packet floods, like DOS attacks. > > Really using RCU in the data/fast path makes sense (and we did that already). > For control paths (or /proc/.... files), not so much.
On Tue, Apr 4, 2023 at 8:56 AM Jason Xing <kerneljasonxing@gmail.com> wrote: > > On Tue, Apr 4, 2023 at 12:07 PM Eric Dumazet <edumazet@google.com> wrote: > > > > On Tue, Apr 4, 2023 at 4:46 AM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > > > I would like to ask two questions which make me confused: > > > 1) Why would we use spin_lock to protect the socket in a raw hashtable > > > for reader's safety under the rcu protection? Normally, if we use the > > > RCU protection, we only make sure that we need to destroy the socket > > > by calling call_rcu() which would prevent the READER of the socket > > > from getting a NULL pointer. > > > > Yes, but then we can not sleep or yield the cpu. > > Indeed. We also cannot sleep/yield under the protection of the spin > lock. And I checked the caller in fs/seq_file.c and noticed that we > have no chance to sleep/yield between ->start and ->stop. > You missed my point. The spinlock can trivially be replaced by a mutex, now the fast path has been RCU converted. This would allow raw_get_idx()/raw_get_first() to use cond_resched(), if some hosts try to use 10,000 raw sockets :/ Is it a real problem to solve right now ? I do not think so. > So I wonder why we couldn't use RCU directly like the patch[1] you > proposed before and choose deliberately to switch to spin lock? Spin > lock for the whole hashinfo to protect the reader side is heavy, and > RCU outperforms spin lock in this case, I think. spinlock is just fine enough, most hosts have less than 10 raw sockets, because raw sockets make things _much_ slower. RCU 'just because' does not make sense, it would suggest that RAW sockets scale, while they do not.
On Tue, Apr 4, 2023 at 3:23 PM Eric Dumazet <edumazet@google.com> wrote: > > On Tue, Apr 4, 2023 at 8:56 AM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > On Tue, Apr 4, 2023 at 12:07 PM Eric Dumazet <edumazet@google.com> wrote: > > > > > > On Tue, Apr 4, 2023 at 4:46 AM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > > > > > I would like to ask two questions which make me confused: > > > > 1) Why would we use spin_lock to protect the socket in a raw hashtable > > > > for reader's safety under the rcu protection? Normally, if we use the > > > > RCU protection, we only make sure that we need to destroy the socket > > > > by calling call_rcu() which would prevent the READER of the socket > > > > from getting a NULL pointer. > > > > > > Yes, but then we can not sleep or yield the cpu. > > > > Indeed. We also cannot sleep/yield under the protection of the spin > > lock. And I checked the caller in fs/seq_file.c and noticed that we > > have no chance to sleep/yield between ->start and ->stop. > > > > You missed my point. > The spinlock can trivially be replaced by a mutex, now the fast path > has been RCU converted. > This would allow raw_get_idx()/raw_get_first() to use cond_resched(), > if some hosts try to use 10,000 raw sockets :/ Thanks for the clarification. I agreed. The patch for now itself is good :) > Is it a real problem to solve right now ? I do not think so. > > > So I wonder why we couldn't use RCU directly like the patch[1] you > > proposed before and choose deliberately to switch to spin lock? Spin > > lock for the whole hashinfo to protect the reader side is heavy, and > > RCU outperforms spin lock in this case, I think. > > spinlock is just fine enough, most hosts have less than 10 raw sockets, > because raw sockets make things _much_ slower. Sure. Thanks, Jason > > RCU 'just because' does not make sense, it would suggest that RAW sockets > scale, while they do not.
diff --git a/include/net/raw.h b/include/net/raw.h index 2c004c20ed99..3af5289fdead 100644 --- a/include/net/raw.h +++ b/include/net/raw.h @@ -37,7 +37,7 @@ int raw_rcv(struct sock *, struct sk_buff *); struct raw_hashinfo { spinlock_t lock; - struct hlist_nulls_head ht[RAW_HTABLE_SIZE] ____cacheline_aligned; + struct hlist_head ht[RAW_HTABLE_SIZE] ____cacheline_aligned; }; static inline u32 raw_hashfunc(const struct net *net, u32 proto) @@ -51,7 +51,7 @@ static inline void raw_hashinfo_init(struct raw_hashinfo *hashinfo) spin_lock_init(&hashinfo->lock); for (i = 0; i < RAW_HTABLE_SIZE; i++) - INIT_HLIST_NULLS_HEAD(&hashinfo->ht[i], i); + INIT_HLIST_HEAD(&hashinfo->ht[i]); } #ifdef CONFIG_PROC_FS diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c index 94df935ee0c5..8088a5011e7d 100644 --- a/net/ipv4/raw.c +++ b/net/ipv4/raw.c @@ -91,12 +91,12 @@ EXPORT_SYMBOL_GPL(raw_v4_hashinfo); int raw_hash_sk(struct sock *sk) { struct raw_hashinfo *h = sk->sk_prot->h.raw_hash; - struct hlist_nulls_head *hlist; + struct hlist_head *hlist; hlist = &h->ht[raw_hashfunc(sock_net(sk), inet_sk(sk)->inet_num)]; spin_lock(&h->lock); - __sk_nulls_add_node_rcu(sk, hlist); + sk_add_node_rcu(sk, hlist); sock_set_flag(sk, SOCK_RCU_FREE); spin_unlock(&h->lock); sock_prot_inuse_add(sock_net(sk), sk->sk_prot, 1); @@ -110,7 +110,7 @@ void raw_unhash_sk(struct sock *sk) struct raw_hashinfo *h = sk->sk_prot->h.raw_hash; spin_lock(&h->lock); - if (__sk_nulls_del_node_init_rcu(sk)) + if (sk_del_node_init_rcu(sk)) sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1); spin_unlock(&h->lock); } @@ -163,16 +163,15 @@ static int icmp_filter(const struct sock *sk, const struct sk_buff *skb) static int raw_v4_input(struct net *net, struct sk_buff *skb, const struct iphdr *iph, int hash) { - struct hlist_nulls_head *hlist; - struct hlist_nulls_node *hnode; int sdif = inet_sdif(skb); + struct hlist_head *hlist; int dif = inet_iif(skb); int delivered = 0; struct sock *sk; hlist = &raw_v4_hashinfo.ht[hash]; rcu_read_lock(); - sk_nulls_for_each(sk, hnode, hlist) { + sk_for_each_rcu(sk, hlist) { if (!raw_v4_match(net, sk, iph->protocol, iph->saddr, iph->daddr, dif, sdif)) continue; @@ -264,10 +263,9 @@ static void raw_err(struct sock *sk, struct sk_buff *skb, u32 info) void raw_icmp_error(struct sk_buff *skb, int protocol, u32 info) { struct net *net = dev_net(skb->dev); - struct hlist_nulls_head *hlist; - struct hlist_nulls_node *hnode; int dif = skb->dev->ifindex; int sdif = inet_sdif(skb); + struct hlist_head *hlist; const struct iphdr *iph; struct sock *sk; int hash; @@ -276,7 +274,7 @@ void raw_icmp_error(struct sk_buff *skb, int protocol, u32 info) hlist = &raw_v4_hashinfo.ht[hash]; rcu_read_lock(); - sk_nulls_for_each(sk, hnode, hlist) { + sk_for_each_rcu(sk, hlist) { iph = (const struct iphdr *)skb->data; if (!raw_v4_match(net, sk, iph->protocol, iph->daddr, iph->saddr, dif, sdif)) @@ -950,14 +948,13 @@ static struct sock *raw_get_first(struct seq_file *seq, int bucket) { struct raw_hashinfo *h = pde_data(file_inode(seq->file)); struct raw_iter_state *state = raw_seq_private(seq); - struct hlist_nulls_head *hlist; - struct hlist_nulls_node *hnode; + struct hlist_head *hlist; struct sock *sk; for (state->bucket = bucket; state->bucket < RAW_HTABLE_SIZE; ++state->bucket) { hlist = &h->ht[state->bucket]; - sk_nulls_for_each(sk, hnode, hlist) { + sk_for_each(sk, hlist) { if (sock_net(sk) == seq_file_net(seq)) return sk; } @@ -970,7 +967,7 @@ static struct sock *raw_get_next(struct seq_file *seq, struct sock *sk) struct raw_iter_state *state = raw_seq_private(seq); do { - sk = sk_nulls_next(sk); + sk = sk_next(sk); } while (sk && sock_net(sk) != seq_file_net(seq)); if (!sk) @@ -989,9 +986,12 @@ static struct sock *raw_get_idx(struct seq_file *seq, loff_t pos) } void *raw_seq_start(struct seq_file *seq, loff_t *pos) - __acquires(RCU) + __acquires(&h->lock) { - rcu_read_lock(); + struct raw_hashinfo *h = pde_data(file_inode(seq->file)); + + spin_lock(&h->lock); + return *pos ? raw_get_idx(seq, *pos - 1) : SEQ_START_TOKEN; } EXPORT_SYMBOL_GPL(raw_seq_start); @@ -1010,9 +1010,11 @@ void *raw_seq_next(struct seq_file *seq, void *v, loff_t *pos) EXPORT_SYMBOL_GPL(raw_seq_next); void raw_seq_stop(struct seq_file *seq, void *v) - __releases(RCU) + __releases(&h->lock) { - rcu_read_unlock(); + struct raw_hashinfo *h = pde_data(file_inode(seq->file)); + + spin_unlock(&h->lock); } EXPORT_SYMBOL_GPL(raw_seq_stop); diff --git a/net/ipv4/raw_diag.c b/net/ipv4/raw_diag.c index 999321834b94..da3591a66a16 100644 --- a/net/ipv4/raw_diag.c +++ b/net/ipv4/raw_diag.c @@ -57,8 +57,7 @@ static bool raw_lookup(struct net *net, struct sock *sk, static struct sock *raw_sock_get(struct net *net, const struct inet_diag_req_v2 *r) { struct raw_hashinfo *hashinfo = raw_get_hashinfo(r); - struct hlist_nulls_head *hlist; - struct hlist_nulls_node *hnode; + struct hlist_head *hlist; struct sock *sk; int slot; @@ -68,7 +67,7 @@ static struct sock *raw_sock_get(struct net *net, const struct inet_diag_req_v2 rcu_read_lock(); for (slot = 0; slot < RAW_HTABLE_SIZE; slot++) { hlist = &hashinfo->ht[slot]; - sk_nulls_for_each(sk, hnode, hlist) { + sk_for_each_rcu(sk, hlist) { if (raw_lookup(net, sk, r)) { /* * Grab it and keep until we fill @@ -142,9 +141,8 @@ static void raw_diag_dump(struct sk_buff *skb, struct netlink_callback *cb, struct raw_hashinfo *hashinfo = raw_get_hashinfo(r); struct net *net = sock_net(skb->sk); struct inet_diag_dump_data *cb_data; - struct hlist_nulls_head *hlist; - struct hlist_nulls_node *hnode; int num, s_num, slot, s_slot; + struct hlist_head *hlist; struct sock *sk = NULL; struct nlattr *bc; @@ -161,7 +159,7 @@ static void raw_diag_dump(struct sk_buff *skb, struct netlink_callback *cb, num = 0; hlist = &hashinfo->ht[slot]; - sk_nulls_for_each(sk, hnode, hlist) { + sk_for_each_rcu(sk, hlist) { struct inet_sock *inet = inet_sk(sk); if (!net_eq(sock_net(sk), net)) diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c index bac9ba747bde..a327aa481df4 100644 --- a/net/ipv6/raw.c +++ b/net/ipv6/raw.c @@ -141,10 +141,9 @@ EXPORT_SYMBOL(rawv6_mh_filter_unregister); static bool ipv6_raw_deliver(struct sk_buff *skb, int nexthdr) { struct net *net = dev_net(skb->dev); - struct hlist_nulls_head *hlist; - struct hlist_nulls_node *hnode; const struct in6_addr *saddr; const struct in6_addr *daddr; + struct hlist_head *hlist; struct sock *sk; bool delivered = false; __u8 hash; @@ -155,7 +154,7 @@ static bool ipv6_raw_deliver(struct sk_buff *skb, int nexthdr) hash = raw_hashfunc(net, nexthdr); hlist = &raw_v6_hashinfo.ht[hash]; rcu_read_lock(); - sk_nulls_for_each(sk, hnode, hlist) { + sk_for_each_rcu(sk, hlist) { int filtered; if (!raw_v6_match(net, sk, nexthdr, daddr, saddr, @@ -333,15 +332,14 @@ void raw6_icmp_error(struct sk_buff *skb, int nexthdr, u8 type, u8 code, int inner_offset, __be32 info) { struct net *net = dev_net(skb->dev); - struct hlist_nulls_head *hlist; - struct hlist_nulls_node *hnode; + struct hlist_head *hlist; struct sock *sk; int hash; hash = raw_hashfunc(net, nexthdr); hlist = &raw_v6_hashinfo.ht[hash]; rcu_read_lock(); - sk_nulls_for_each(sk, hnode, hlist) { + sk_for_each_rcu(sk, hlist) { /* Note: ipv6_hdr(skb) != skb->data */ const struct ipv6hdr *ip6h = (const struct ipv6hdr *)skb->data;
Dae R. Jeong reported a NULL deref in raw_get_next() [0]. It seems that the repro was running these sequences in parallel so that one thread was iterating on a socket that was being freed in another netns. unshare(0x40060200) r0 = syz_open_procfs(0x0, &(0x7f0000002080)='net/raw\x00') socket$inet_icmp_raw(0x2, 0x3, 0x1) pread64(r0, &(0x7f0000000000)=""/10, 0xa, 0x10000000007f) After commit 0daf07e52709 ("raw: convert raw sockets to RCU"), we use RCU and hlist_nulls_for_each_entry() to iterate over SOCK_RAW sockets. However, we should use spinlock for slow paths to avoid the NULL deref. Also, SOCK_RAW does not use SLAB_TYPESAFE_BY_RCU, and the slab object is not reused during iteration in the grace period. In fact, the lockless readers do not check the nulls marker with get_nulls_value(). So, SOCK_RAW should use hlist instead of hlist_nulls. Instead of adding an unnecessary barrier by sk_nulls_for_each_rcu(), let's convert hlist_nulls to hlist and use sk_for_each_rcu() for fast paths and sk_for_each() and spinlock for /proc/net/raw. [0]: general protection fault, probably for non-canonical address 0xdffffc0000000005: 0000 [#1] PREEMPT SMP KASAN KASAN: null-ptr-deref in range [0x0000000000000028-0x000000000000002f] CPU: 2 PID: 20952 Comm: syz-executor.0 Not tainted 6.2.0-g048ec869bafd-dirty #7 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 RIP: 0010:read_pnet include/net/net_namespace.h:383 [inline] RIP: 0010:sock_net include/net/sock.h:649 [inline] RIP: 0010:raw_get_next net/ipv4/raw.c:974 [inline] RIP: 0010:raw_get_idx net/ipv4/raw.c:986 [inline] RIP: 0010:raw_seq_start+0x431/0x800 net/ipv4/raw.c:995 Code: ef e8 33 3d 94 f7 49 8b 6d 00 4c 89 ef e8 b7 65 5f f7 49 89 ed 49 83 c5 98 0f 84 9a 00 00 00 48 83 c5 c8 48 89 e8 48 c1 e8 03 <42> 80 3c 30 00 74 08 48 89 ef e8 00 3d 94 f7 4c 8b 7d 00 48 89 ef RSP: 0018:ffffc9001154f9b0 EFLAGS: 00010206 RAX: 0000000000000005 RBX: 1ffff1100302c8fd RCX: 0000000000000000 RDX: 0000000000000028 RSI: ffffc9001154f988 RDI: ffffc9000f77a338 RBP: 0000000000000029 R08: ffffffff8a50ffb4 R09: fffffbfff24b6bd9 R10: fffffbfff24b6bd9 R11: 0000000000000000 R12: ffff88801db73b78 R13: fffffffffffffff9 R14: dffffc0000000000 R15: 0000000000000030 FS: 00007f843ae8e700(0000) GS:ffff888063700000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055bb9614b35f CR3: 000000003c672000 CR4: 00000000003506e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> seq_read_iter+0x4c6/0x10f0 fs/seq_file.c:225 seq_read+0x224/0x320 fs/seq_file.c:162 pde_read fs/proc/inode.c:316 [inline] proc_reg_read+0x23f/0x330 fs/proc/inode.c:328 vfs_read+0x31e/0xd30 fs/read_write.c:468 ksys_pread64 fs/read_write.c:665 [inline] __do_sys_pread64 fs/read_write.c:675 [inline] __se_sys_pread64 fs/read_write.c:672 [inline] __x64_sys_pread64+0x1e9/0x280 fs/read_write.c:672 do_syscall_x64 arch/x86/entry/common.c:51 [inline] do_syscall_64+0x4e/0xa0 arch/x86/entry/common.c:82 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x478d29 Code: f7 d8 64 89 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007f843ae8dbe8 EFLAGS: 00000246 ORIG_RAX: 0000000000000011 RAX: ffffffffffffffda RBX: 0000000000791408 RCX: 0000000000478d29 RDX: 000000000000000a RSI: 0000000020000000 RDI: 0000000000000003 RBP: 00000000f477909a R08: 0000000000000000 R09: 0000000000000000 R10: 000010000000007f R11: 0000000000000246 R12: 0000000000791740 R13: 0000000000791414 R14: 0000000000791408 R15: 00007ffc2eb48a50 </TASK> Modules linked in: ---[ end trace 0000000000000000 ]--- RIP: 0010:read_pnet include/net/net_namespace.h:383 [inline] RIP: 0010:sock_net include/net/sock.h:649 [inline] RIP: 0010:raw_get_next net/ipv4/raw.c:974 [inline] RIP: 0010:raw_get_idx net/ipv4/raw.c:986 [inline] RIP: 0010:raw_seq_start+0x431/0x800 net/ipv4/raw.c:995 Code: ef e8 33 3d 94 f7 49 8b 6d 00 4c 89 ef e8 b7 65 5f f7 49 89 ed 49 83 c5 98 0f 84 9a 00 00 00 48 83 c5 c8 48 89 e8 48 c1 e8 03 <42> 80 3c 30 00 74 08 48 89 ef e8 00 3d 94 f7 4c 8b 7d 00 48 89 ef RSP: 0018:ffffc9001154f9b0 EFLAGS: 00010206 RAX: 0000000000000005 RBX: 1ffff1100302c8fd RCX: 0000000000000000 RDX: 0000000000000028 RSI: ffffc9001154f988 RDI: ffffc9000f77a338 RBP: 0000000000000029 R08: ffffffff8a50ffb4 R09: fffffbfff24b6bd9 R10: fffffbfff24b6bd9 R11: 0000000000000000 R12: ffff88801db73b78 R13: fffffffffffffff9 R14: dffffc0000000000 R15: 0000000000000030 FS: 00007f843ae8e700(0000) GS:ffff888063700000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f92ff166000 CR3: 000000003c672000 CR4: 00000000003506e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Fixes: 0daf07e52709 ("raw: convert raw sockets to RCU") Reported-by: syzbot <syzkaller@googlegroups.com> Reported-by: Dae R. Jeong <threeearcat@gmail.com> Link: https://lore.kernel.org/netdev/ZCA2mGV_cmq7lIfV@dragonet/ Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> --- include/net/raw.h | 4 ++-- net/ipv4/raw.c | 36 +++++++++++++++++++----------------- net/ipv4/raw_diag.c | 10 ++++------ net/ipv6/raw.c | 10 ++++------ 4 files changed, 29 insertions(+), 31 deletions(-)