Message ID | 20220619003919.394622-1-i.maximets@ovn.org (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [net] net: ensure all external references are released in deferred skbuffs | expand |
On Sun, Jun 19, 2022 at 2:39 AM Ilya Maximets <i.maximets@ovn.org> wrote: > > Open vSwitch system test suite is broken due to inability to > load/unload netfilter modules. kworker thread is getting trapped > in the infinite loop while running a net cleanup inside the > nf_conntrack_cleanup_net_list, because deferred skbuffs are still > holding nfct references and not being freed by their CPU cores. > > In general, the idea that we will have an rx interrupt on every > CPU core at some point in a near future doesn't seem correct. > Devices are getting created and destroyed, interrupts are getting > re-scheduled, CPUs are going online and offline dynamically. > Any of these events may leave packets stuck in defer list for a > long time. It might be OK, if they are just a piece of memory, > but we can't afford them holding references to any other resources. > > In case of OVS, nfct reference keeps the kernel thread in busy loop > while holding a 'pernet_ops_rwsem' semaphore. That blocks the > later modprobe request from user space: > > # ps > 299 root R 99.3 200:25.89 kworker/u96:4+ > > # journalctl > INFO: task modprobe:11787 blocked for more than 1228 seconds. > Not tainted 5.19.0-rc2 #8 > task:modprobe state:D > Call Trace: > <TASK> > __schedule+0x8aa/0x21d0 > schedule+0xcc/0x200 > rwsem_down_write_slowpath+0x8e4/0x1580 > down_write+0xfc/0x140 > register_pernet_subsys+0x15/0x40 > nf_nat_init+0xb6/0x1000 [nf_nat] > do_one_initcall+0xbb/0x410 > do_init_module+0x1b4/0x640 > load_module+0x4c1b/0x58d0 > __do_sys_init_module+0x1d7/0x220 > do_syscall_64+0x3a/0x80 > entry_SYSCALL_64_after_hwframe+0x46/0xb0 > > At this point OVS testsuite is unresponsive and never recover, > because these skbuffs are never freed. > > Solution is to make sure no external references attached to skb > before pushing it to the defer list. Using skb_release_head_state() > for that purpose. The function modified to be re-enterable, as it > will be called again during the defer list flush. > > Another approach that can fix the OVS use-case, is to kick all > cores while waiting for references to be released during the net > cleanup. But that sounds more like a workaround for a current > issue rather than a proper solution and will not cover possible > issues in other parts of the code. > > Additionally checking for skb_zcopy() while deferring. This might > not be necessary, as I'm not sure if we can actually have zero copy > packets on this path, but seems worth having for completeness as we > should never defer such packets regardless. > > CC: Eric Dumazet <edumazet@google.com> > Fixes: 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists") > Signed-off-by: Ilya Maximets <i.maximets@ovn.org> > --- > net/core/skbuff.c | 16 +++++++++++----- > 1 file changed, 11 insertions(+), 5 deletions(-) I do not think this patch is doing the right thing. Packets sitting in TCP receive queues should not hold state that is not relevant for TCP recvmsg(). This consumes extra memory for no good reason, and defer expensive atomic operations. We for instance release skb dst before skb is queued, we should do the same for conntrack state. This would increase performance anyway, as we free ct state while cpu caches are hot.
On Wed, Jun 22, 2022 at 12:02 PM Eric Dumazet <edumazet@google.com> wrote: > > On Sun, Jun 19, 2022 at 2:39 AM Ilya Maximets <i.maximets@ovn.org> wrote: > > > > Open vSwitch system test suite is broken due to inability to > > load/unload netfilter modules. kworker thread is getting trapped > > in the infinite loop while running a net cleanup inside the > > nf_conntrack_cleanup_net_list, because deferred skbuffs are still > > holding nfct references and not being freed by their CPU cores. > > > > In general, the idea that we will have an rx interrupt on every > > CPU core at some point in a near future doesn't seem correct. > > Devices are getting created and destroyed, interrupts are getting > > re-scheduled, CPUs are going online and offline dynamically. > > Any of these events may leave packets stuck in defer list for a > > long time. It might be OK, if they are just a piece of memory, > > but we can't afford them holding references to any other resources. > > > > In case of OVS, nfct reference keeps the kernel thread in busy loop > > while holding a 'pernet_ops_rwsem' semaphore. That blocks the > > later modprobe request from user space: > > > > # ps > > 299 root R 99.3 200:25.89 kworker/u96:4+ > > > > # journalctl > > INFO: task modprobe:11787 blocked for more than 1228 seconds. > > Not tainted 5.19.0-rc2 #8 > > task:modprobe state:D > > Call Trace: > > <TASK> > > __schedule+0x8aa/0x21d0 > > schedule+0xcc/0x200 > > rwsem_down_write_slowpath+0x8e4/0x1580 > > down_write+0xfc/0x140 > > register_pernet_subsys+0x15/0x40 > > nf_nat_init+0xb6/0x1000 [nf_nat] > > do_one_initcall+0xbb/0x410 > > do_init_module+0x1b4/0x640 > > load_module+0x4c1b/0x58d0 > > __do_sys_init_module+0x1d7/0x220 > > do_syscall_64+0x3a/0x80 > > entry_SYSCALL_64_after_hwframe+0x46/0xb0 > > > > At this point OVS testsuite is unresponsive and never recover, > > because these skbuffs are never freed. > > > > Solution is to make sure no external references attached to skb > > before pushing it to the defer list. Using skb_release_head_state() > > for that purpose. The function modified to be re-enterable, as it > > will be called again during the defer list flush. > > > > Another approach that can fix the OVS use-case, is to kick all > > cores while waiting for references to be released during the net > > cleanup. But that sounds more like a workaround for a current > > issue rather than a proper solution and will not cover possible > > issues in other parts of the code. > > > > Additionally checking for skb_zcopy() while deferring. This might > > not be necessary, as I'm not sure if we can actually have zero copy > > packets on this path, but seems worth having for completeness as we > > should never defer such packets regardless. > > > > CC: Eric Dumazet <edumazet@google.com> > > Fixes: 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists") > > Signed-off-by: Ilya Maximets <i.maximets@ovn.org> > > --- > > net/core/skbuff.c | 16 +++++++++++----- > > 1 file changed, 11 insertions(+), 5 deletions(-) > > I do not think this patch is doing the right thing. > > Packets sitting in TCP receive queues should not hold state that is > not relevant for TCP recvmsg(). > > This consumes extra memory for no good reason, and defer expensive > atomic operations. > > We for instance release skb dst before skb is queued, we should do the > same for conntrack state. > > This would increase performance anyway, as we free ct state while cpu > caches are hot. I am thinking of the following instead. A new helper can be added (and later be used in net/packet/af_packet.c and probably elsewhere) diff --git a/include/net/dst.h b/include/net/dst.h index 6aa252c3fc55ccaee58faebf265510469e91d780..7c3316d9d6e73daea17223a5261f6a5c4f68eae3 100644 --- a/include/net/dst.h +++ b/include/net/dst.h @@ -276,6 +276,15 @@ static inline void skb_dst_drop(struct sk_buff *skb) } } +/* Before queueing skb in a receive queue, get rid of + * potentially expensive components. + */ +static inline void skb_cleanup(struct sk_buff *skb) +{ + skb_dst_drop(skb); + nf_reset_ct(skb); +} + static inline void __skb_dst_copy(struct sk_buff *nskb, unsigned long refdst) { nskb->slow_gro |= !!refdst; diff --git a/net/ipv4/tcp_fastopen.c b/net/ipv4/tcp_fastopen.c index fdbcf2a6d08ef4a5164247b5a5b4b222289b191a..913c98e446d56ee067b54b2c704ac1195ef1a81e 100644 --- a/net/ipv4/tcp_fastopen.c +++ b/net/ipv4/tcp_fastopen.c @@ -177,7 +177,7 @@ void tcp_fastopen_add_skb(struct sock *sk, struct sk_buff *skb) if (!skb) return; - skb_dst_drop(skb); + skb_cleanup(skb); /* segs_in has been initialized to 1 in tcp_create_openreq_child(). * Hence, reset segs_in to 0 before calling tcp_segs_in() * to avoid double counting. Also, tcp_segs_in() expects diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 2e2a9ece9af27372e6b653d685a89a2c71ba05d1..987981a16ee34e0601e7e722abef1bb098c307c5 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -5005,7 +5005,7 @@ static void tcp_data_queue(struct sock *sk, struct sk_buff *skb) __kfree_skb(skb); return; } - skb_dst_drop(skb); + skb_cleanup(skb); __skb_pull(skb, tcp_hdr(skb)->doff * 4); reason = SKB_DROP_REASON_NOT_SPECIFIED; @@ -5931,7 +5931,7 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb) NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPHPHITS); /* Bulk data transfer: receiver */ - skb_dst_drop(skb); + skb_cleanup(skb); __skb_pull(skb, tcp_header_len); eaten = tcp_queue_rcv(sk, skb, &fragstolen); diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index fe8f23b95d32ca4a35d05166d471327bc608fa91..d9acd906f28267ff07450d78d079e4e8eab74957 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -1765,7 +1765,7 @@ bool tcp_add_backlog(struct sock *sk, struct sk_buff *skb, */ skb_condense(skb); - skb_dst_drop(skb); + skb_cleanup(skb); if (unlikely(tcp_checksum_complete(skb))) { bh_unlock_sock(sk);
Eric Dumazet <edumazet@google.com> wrote: > On Sun, Jun 19, 2022 at 2:39 AM Ilya Maximets <i.maximets@ovn.org> wrote: > > > > Open vSwitch system test suite is broken due to inability to > > load/unload netfilter modules. kworker thread is getting trapped > > in the infinite loop while running a net cleanup inside the > > nf_conntrack_cleanup_net_list, because deferred skbuffs are still > > holding nfct references and not being freed by their CPU cores. > > > > In general, the idea that we will have an rx interrupt on every > > CPU core at some point in a near future doesn't seem correct. > > Devices are getting created and destroyed, interrupts are getting > > re-scheduled, CPUs are going online and offline dynamically. > > Any of these events may leave packets stuck in defer list for a > > long time. It might be OK, if they are just a piece of memory, > > but we can't afford them holding references to any other resources. > > > > In case of OVS, nfct reference keeps the kernel thread in busy loop > > while holding a 'pernet_ops_rwsem' semaphore. That blocks the > > later modprobe request from user space: > > > > # ps > > 299 root R 99.3 200:25.89 kworker/u96:4+ > > > > # journalctl > > INFO: task modprobe:11787 blocked for more than 1228 seconds. > > Not tainted 5.19.0-rc2 #8 > > task:modprobe state:D > > Call Trace: > > <TASK> > > __schedule+0x8aa/0x21d0 > > schedule+0xcc/0x200 > > rwsem_down_write_slowpath+0x8e4/0x1580 > > down_write+0xfc/0x140 > > register_pernet_subsys+0x15/0x40 > > nf_nat_init+0xb6/0x1000 [nf_nat] > > do_one_initcall+0xbb/0x410 > > do_init_module+0x1b4/0x640 > > load_module+0x4c1b/0x58d0 > > __do_sys_init_module+0x1d7/0x220 > > do_syscall_64+0x3a/0x80 > > entry_SYSCALL_64_after_hwframe+0x46/0xb0 > > > > At this point OVS testsuite is unresponsive and never recover, > > because these skbuffs are never freed. > > > > Solution is to make sure no external references attached to skb > > before pushing it to the defer list. Using skb_release_head_state() > > for that purpose. The function modified to be re-enterable, as it > > will be called again during the defer list flush. > > > > Another approach that can fix the OVS use-case, is to kick all > > cores while waiting for references to be released during the net > > cleanup. But that sounds more like a workaround for a current > > issue rather than a proper solution and will not cover possible > > issues in other parts of the code. > > > > Additionally checking for skb_zcopy() while deferring. This might > > not be necessary, as I'm not sure if we can actually have zero copy > > packets on this path, but seems worth having for completeness as we > > should never defer such packets regardless. > > > > CC: Eric Dumazet <edumazet@google.com> > > Fixes: 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists") > > Signed-off-by: Ilya Maximets <i.maximets@ovn.org> > > --- > > net/core/skbuff.c | 16 +++++++++++----- > > 1 file changed, 11 insertions(+), 5 deletions(-) > > I do not think this patch is doing the right thing. > > Packets sitting in TCP receive queues should not hold state that is > not relevant for TCP recvmsg(). Agree, but tcp_v4/6_rcv() already call nf_reset_ct(), else it would not be possible to remove nf_conntrack module in practice. I wonder where the deferred skbs are coming from, any and all queued skbs need the conntrack state dropped. I don't mind a new helper that does a combined dst+ct release though.
On Wed, Jun 22, 2022 at 12:28 PM Florian Westphal <fw@strlen.de> wrote: > > Eric Dumazet <edumazet@google.com> wrote: > > On Sun, Jun 19, 2022 at 2:39 AM Ilya Maximets <i.maximets@ovn.org> wrote: > > > > > > Open vSwitch system test suite is broken due to inability to > > > load/unload netfilter modules. kworker thread is getting trapped > > > in the infinite loop while running a net cleanup inside the > > > nf_conntrack_cleanup_net_list, because deferred skbuffs are still > > > holding nfct references and not being freed by their CPU cores. > > > > > > In general, the idea that we will have an rx interrupt on every > > > CPU core at some point in a near future doesn't seem correct. > > > Devices are getting created and destroyed, interrupts are getting > > > re-scheduled, CPUs are going online and offline dynamically. > > > Any of these events may leave packets stuck in defer list for a > > > long time. It might be OK, if they are just a piece of memory, > > > but we can't afford them holding references to any other resources. > > > > > > In case of OVS, nfct reference keeps the kernel thread in busy loop > > > while holding a 'pernet_ops_rwsem' semaphore. That blocks the > > > later modprobe request from user space: > > > > > > # ps > > > 299 root R 99.3 200:25.89 kworker/u96:4+ > > > > > > # journalctl > > > INFO: task modprobe:11787 blocked for more than 1228 seconds. > > > Not tainted 5.19.0-rc2 #8 > > > task:modprobe state:D > > > Call Trace: > > > <TASK> > > > __schedule+0x8aa/0x21d0 > > > schedule+0xcc/0x200 > > > rwsem_down_write_slowpath+0x8e4/0x1580 > > > down_write+0xfc/0x140 > > > register_pernet_subsys+0x15/0x40 > > > nf_nat_init+0xb6/0x1000 [nf_nat] > > > do_one_initcall+0xbb/0x410 > > > do_init_module+0x1b4/0x640 > > > load_module+0x4c1b/0x58d0 > > > __do_sys_init_module+0x1d7/0x220 > > > do_syscall_64+0x3a/0x80 > > > entry_SYSCALL_64_after_hwframe+0x46/0xb0 > > > > > > At this point OVS testsuite is unresponsive and never recover, > > > because these skbuffs are never freed. > > > > > > Solution is to make sure no external references attached to skb > > > before pushing it to the defer list. Using skb_release_head_state() > > > for that purpose. The function modified to be re-enterable, as it > > > will be called again during the defer list flush. > > > > > > Another approach that can fix the OVS use-case, is to kick all > > > cores while waiting for references to be released during the net > > > cleanup. But that sounds more like a workaround for a current > > > issue rather than a proper solution and will not cover possible > > > issues in other parts of the code. > > > > > > Additionally checking for skb_zcopy() while deferring. This might > > > not be necessary, as I'm not sure if we can actually have zero copy > > > packets on this path, but seems worth having for completeness as we > > > should never defer such packets regardless. > > > > > > CC: Eric Dumazet <edumazet@google.com> > > > Fixes: 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists") > > > Signed-off-by: Ilya Maximets <i.maximets@ovn.org> > > > --- > > > net/core/skbuff.c | 16 +++++++++++----- > > > 1 file changed, 11 insertions(+), 5 deletions(-) > > > > I do not think this patch is doing the right thing. > > > > Packets sitting in TCP receive queues should not hold state that is > > not relevant for TCP recvmsg(). > > Agree, but tcp_v4/6_rcv() already call nf_reset_ct(), else it would > not be possible to remove nf_conntrack module in practice. Well, existing nf_reset_ct() does not catch all cases, like TCP fastopen ? Maybe 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists") only widened the problem. > > I wonder where the deferred skbs are coming from, any and all > queued skbs need the conntrack state dropped. > > I don't mind a new helper that does a combined dst+ct release though.
On 6/22/22 12:36, Eric Dumazet wrote: > On Wed, Jun 22, 2022 at 12:28 PM Florian Westphal <fw@strlen.de> wrote: >> >> Eric Dumazet <edumazet@google.com> wrote: >>> On Sun, Jun 19, 2022 at 2:39 AM Ilya Maximets <i.maximets@ovn.org> wrote: >>>> >>>> Open vSwitch system test suite is broken due to inability to >>>> load/unload netfilter modules. kworker thread is getting trapped >>>> in the infinite loop while running a net cleanup inside the >>>> nf_conntrack_cleanup_net_list, because deferred skbuffs are still >>>> holding nfct references and not being freed by their CPU cores. >>>> >>>> In general, the idea that we will have an rx interrupt on every >>>> CPU core at some point in a near future doesn't seem correct. >>>> Devices are getting created and destroyed, interrupts are getting >>>> re-scheduled, CPUs are going online and offline dynamically. >>>> Any of these events may leave packets stuck in defer list for a >>>> long time. It might be OK, if they are just a piece of memory, >>>> but we can't afford them holding references to any other resources. >>>> >>>> In case of OVS, nfct reference keeps the kernel thread in busy loop >>>> while holding a 'pernet_ops_rwsem' semaphore. That blocks the >>>> later modprobe request from user space: >>>> >>>> # ps >>>> 299 root R 99.3 200:25.89 kworker/u96:4+ >>>> >>>> # journalctl >>>> INFO: task modprobe:11787 blocked for more than 1228 seconds. >>>> Not tainted 5.19.0-rc2 #8 >>>> task:modprobe state:D >>>> Call Trace: >>>> <TASK> >>>> __schedule+0x8aa/0x21d0 >>>> schedule+0xcc/0x200 >>>> rwsem_down_write_slowpath+0x8e4/0x1580 >>>> down_write+0xfc/0x140 >>>> register_pernet_subsys+0x15/0x40 >>>> nf_nat_init+0xb6/0x1000 [nf_nat] >>>> do_one_initcall+0xbb/0x410 >>>> do_init_module+0x1b4/0x640 >>>> load_module+0x4c1b/0x58d0 >>>> __do_sys_init_module+0x1d7/0x220 >>>> do_syscall_64+0x3a/0x80 >>>> entry_SYSCALL_64_after_hwframe+0x46/0xb0 >>>> >>>> At this point OVS testsuite is unresponsive and never recover, >>>> because these skbuffs are never freed. >>>> >>>> Solution is to make sure no external references attached to skb >>>> before pushing it to the defer list. Using skb_release_head_state() >>>> for that purpose. The function modified to be re-enterable, as it >>>> will be called again during the defer list flush. >>>> >>>> Another approach that can fix the OVS use-case, is to kick all >>>> cores while waiting for references to be released during the net >>>> cleanup. But that sounds more like a workaround for a current >>>> issue rather than a proper solution and will not cover possible >>>> issues in other parts of the code. >>>> >>>> Additionally checking for skb_zcopy() while deferring. This might >>>> not be necessary, as I'm not sure if we can actually have zero copy >>>> packets on this path, but seems worth having for completeness as we >>>> should never defer such packets regardless. >>>> >>>> CC: Eric Dumazet <edumazet@google.com> >>>> Fixes: 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists") >>>> Signed-off-by: Ilya Maximets <i.maximets@ovn.org> >>>> --- >>>> net/core/skbuff.c | 16 +++++++++++----- >>>> 1 file changed, 11 insertions(+), 5 deletions(-) >>> >>> I do not think this patch is doing the right thing. >>> >>> Packets sitting in TCP receive queues should not hold state that is >>> not relevant for TCP recvmsg(). >> >> Agree, but tcp_v4/6_rcv() already call nf_reset_ct(), else it would >> not be possible to remove nf_conntrack module in practice. > > Well, existing nf_reset_ct() does not catch all cases, like TCP fastopen ? Yeah, that is kind of the main problem I have with the current code. It's very hard to find all the cases where skb has to be cleaned up and almost impossible for someone who doesn't know every aspect of every network subsystem in the kernel. That's why I went with the more or less bulletproof approach of cleaning up while actually deferring. I can try and test the code you proposed in the other reply, but at least, I think, we need a bunch of debug warnings in the skb_attempt_defer_free() to catch possible issues. Also, what about cleaning up extensions? IIUC, at least one of them can hold external references. SKB_EXT_SEC_PATH holds xfrm_state. We should, probably, free them as well? And what about zerocopy skb? I think, we should still not allow them to be deferred as they seem to hold HW resources. > > Maybe 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists") > only widened the problem. > >> >> I wonder where the deferred skbs are coming from, any and all >> queued skbs need the conntrack state dropped. >> >> I don't mind a new helper that does a combined dst+ct release though.
On Wed, Jun 22, 2022 at 1:32 PM Ilya Maximets <i.maximets@ovn.org> wrote: > > On 6/22/22 12:36, Eric Dumazet wrote: > > On Wed, Jun 22, 2022 at 12:28 PM Florian Westphal <fw@strlen.de> wrote: > >> > >> Eric Dumazet <edumazet@google.com> wrote: > >>> On Sun, Jun 19, 2022 at 2:39 AM Ilya Maximets <i.maximets@ovn.org> wrote: > >>>> > >>>> Open vSwitch system test suite is broken due to inability to > >>>> load/unload netfilter modules. kworker thread is getting trapped > >>>> in the infinite loop while running a net cleanup inside the > >>>> nf_conntrack_cleanup_net_list, because deferred skbuffs are still > >>>> holding nfct references and not being freed by their CPU cores. > >>>> > >>>> In general, the idea that we will have an rx interrupt on every > >>>> CPU core at some point in a near future doesn't seem correct. > >>>> Devices are getting created and destroyed, interrupts are getting > >>>> re-scheduled, CPUs are going online and offline dynamically. > >>>> Any of these events may leave packets stuck in defer list for a > >>>> long time. It might be OK, if they are just a piece of memory, > >>>> but we can't afford them holding references to any other resources. > >>>> > >>>> In case of OVS, nfct reference keeps the kernel thread in busy loop > >>>> while holding a 'pernet_ops_rwsem' semaphore. That blocks the > >>>> later modprobe request from user space: > >>>> > >>>> # ps > >>>> 299 root R 99.3 200:25.89 kworker/u96:4+ > >>>> > >>>> # journalctl > >>>> INFO: task modprobe:11787 blocked for more than 1228 seconds. > >>>> Not tainted 5.19.0-rc2 #8 > >>>> task:modprobe state:D > >>>> Call Trace: > >>>> <TASK> > >>>> __schedule+0x8aa/0x21d0 > >>>> schedule+0xcc/0x200 > >>>> rwsem_down_write_slowpath+0x8e4/0x1580 > >>>> down_write+0xfc/0x140 > >>>> register_pernet_subsys+0x15/0x40 > >>>> nf_nat_init+0xb6/0x1000 [nf_nat] > >>>> do_one_initcall+0xbb/0x410 > >>>> do_init_module+0x1b4/0x640 > >>>> load_module+0x4c1b/0x58d0 > >>>> __do_sys_init_module+0x1d7/0x220 > >>>> do_syscall_64+0x3a/0x80 > >>>> entry_SYSCALL_64_after_hwframe+0x46/0xb0 > >>>> > >>>> At this point OVS testsuite is unresponsive and never recover, > >>>> because these skbuffs are never freed. > >>>> > >>>> Solution is to make sure no external references attached to skb > >>>> before pushing it to the defer list. Using skb_release_head_state() > >>>> for that purpose. The function modified to be re-enterable, as it > >>>> will be called again during the defer list flush. > >>>> > >>>> Another approach that can fix the OVS use-case, is to kick all > >>>> cores while waiting for references to be released during the net > >>>> cleanup. But that sounds more like a workaround for a current > >>>> issue rather than a proper solution and will not cover possible > >>>> issues in other parts of the code. > >>>> > >>>> Additionally checking for skb_zcopy() while deferring. This might > >>>> not be necessary, as I'm not sure if we can actually have zero copy > >>>> packets on this path, but seems worth having for completeness as we > >>>> should never defer such packets regardless. > >>>> > >>>> CC: Eric Dumazet <edumazet@google.com> > >>>> Fixes: 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists") > >>>> Signed-off-by: Ilya Maximets <i.maximets@ovn.org> > >>>> --- > >>>> net/core/skbuff.c | 16 +++++++++++----- > >>>> 1 file changed, 11 insertions(+), 5 deletions(-) > >>> > >>> I do not think this patch is doing the right thing. > >>> > >>> Packets sitting in TCP receive queues should not hold state that is > >>> not relevant for TCP recvmsg(). > >> > >> Agree, but tcp_v4/6_rcv() already call nf_reset_ct(), else it would > >> not be possible to remove nf_conntrack module in practice. > > > > Well, existing nf_reset_ct() does not catch all cases, like TCP fastopen ? > > Yeah, that is kind of the main problem I have with the current > code. It's very hard to find all the cases where skb has to be > cleaned up and almost impossible for someone who doesn't know > every aspect of every network subsystem in the kernel. That's > why I went with the more or less bulletproof approach of cleaning > up while actually deferring. I can try and test the code you > proposed in the other reply, but at least, I think, we need a > bunch of debug warnings in the skb_attempt_defer_free() to catch > possible issues. Debug warnings are expensive if they need to bring new cache lines. So far skb_attempt_defer_free() is only used by TCP in well known conditions. > > Also, what about cleaning up extensions? IIUC, at least one > of them can hold external references. SKB_EXT_SEC_PATH holds > xfrm_state. We should, probably, free them as well? I do not know about this, I would ask XFRM maintainers > > And what about zerocopy skb? I think, we should still not > allow them to be deferred as they seem to hold HW resources. The point of skb_attempt_defer_free() i is to make the freeing happen at producer much instead of consumer. I do not think there is anything special in this regard with zero copy. I would leave the current code as is. A simpler patch might be to move the existing nf_reset_ct() earlier, can you test this ? I note that IPv6 does the nf_reset_ct() earlier, from ip6_protocol_deliver_rcu() diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index fda811a5251f2d76ac24a036e6b4f4e7d7d96d6f..a06464f96fe0cc94dd78272738ddaab2c19e87db 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -1919,6 +1919,8 @@ int tcp_v4_rcv(struct sk_buff *skb) struct sock *sk; int ret; + nf_reset_ct(skb); + drop_reason = SKB_DROP_REASON_NOT_SPECIFIED; if (skb->pkt_type != PACKET_HOST) goto discard_it; @@ -2046,8 +2048,6 @@ int tcp_v4_rcv(struct sk_buff *skb) if (drop_reason) goto discard_and_relse; - nf_reset_ct(skb); - if (tcp_filter(sk, skb)) { drop_reason = SKB_DROP_REASON_SOCKET_FILTER; goto discard_and_relse;
On 6/22/22 13:43, Eric Dumazet wrote: > On Wed, Jun 22, 2022 at 1:32 PM Ilya Maximets <i.maximets@ovn.org> wrote: >> >> On 6/22/22 12:36, Eric Dumazet wrote: >>> On Wed, Jun 22, 2022 at 12:28 PM Florian Westphal <fw@strlen.de> wrote: >>>> >>>> Eric Dumazet <edumazet@google.com> wrote: >>>>> On Sun, Jun 19, 2022 at 2:39 AM Ilya Maximets <i.maximets@ovn.org> wrote: >>>>>> >>>>>> Open vSwitch system test suite is broken due to inability to >>>>>> load/unload netfilter modules. kworker thread is getting trapped >>>>>> in the infinite loop while running a net cleanup inside the >>>>>> nf_conntrack_cleanup_net_list, because deferred skbuffs are still >>>>>> holding nfct references and not being freed by their CPU cores. >>>>>> >>>>>> In general, the idea that we will have an rx interrupt on every >>>>>> CPU core at some point in a near future doesn't seem correct. >>>>>> Devices are getting created and destroyed, interrupts are getting >>>>>> re-scheduled, CPUs are going online and offline dynamically. >>>>>> Any of these events may leave packets stuck in defer list for a >>>>>> long time. It might be OK, if they are just a piece of memory, >>>>>> but we can't afford them holding references to any other resources. >>>>>> >>>>>> In case of OVS, nfct reference keeps the kernel thread in busy loop >>>>>> while holding a 'pernet_ops_rwsem' semaphore. That blocks the >>>>>> later modprobe request from user space: >>>>>> >>>>>> # ps >>>>>> 299 root R 99.3 200:25.89 kworker/u96:4+ >>>>>> >>>>>> # journalctl >>>>>> INFO: task modprobe:11787 blocked for more than 1228 seconds. >>>>>> Not tainted 5.19.0-rc2 #8 >>>>>> task:modprobe state:D >>>>>> Call Trace: >>>>>> <TASK> >>>>>> __schedule+0x8aa/0x21d0 >>>>>> schedule+0xcc/0x200 >>>>>> rwsem_down_write_slowpath+0x8e4/0x1580 >>>>>> down_write+0xfc/0x140 >>>>>> register_pernet_subsys+0x15/0x40 >>>>>> nf_nat_init+0xb6/0x1000 [nf_nat] >>>>>> do_one_initcall+0xbb/0x410 >>>>>> do_init_module+0x1b4/0x640 >>>>>> load_module+0x4c1b/0x58d0 >>>>>> __do_sys_init_module+0x1d7/0x220 >>>>>> do_syscall_64+0x3a/0x80 >>>>>> entry_SYSCALL_64_after_hwframe+0x46/0xb0 >>>>>> >>>>>> At this point OVS testsuite is unresponsive and never recover, >>>>>> because these skbuffs are never freed. >>>>>> >>>>>> Solution is to make sure no external references attached to skb >>>>>> before pushing it to the defer list. Using skb_release_head_state() >>>>>> for that purpose. The function modified to be re-enterable, as it >>>>>> will be called again during the defer list flush. >>>>>> >>>>>> Another approach that can fix the OVS use-case, is to kick all >>>>>> cores while waiting for references to be released during the net >>>>>> cleanup. But that sounds more like a workaround for a current >>>>>> issue rather than a proper solution and will not cover possible >>>>>> issues in other parts of the code. >>>>>> >>>>>> Additionally checking for skb_zcopy() while deferring. This might >>>>>> not be necessary, as I'm not sure if we can actually have zero copy >>>>>> packets on this path, but seems worth having for completeness as we >>>>>> should never defer such packets regardless. >>>>>> >>>>>> CC: Eric Dumazet <edumazet@google.com> >>>>>> Fixes: 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists") >>>>>> Signed-off-by: Ilya Maximets <i.maximets@ovn.org> >>>>>> --- >>>>>> net/core/skbuff.c | 16 +++++++++++----- >>>>>> 1 file changed, 11 insertions(+), 5 deletions(-) >>>>> >>>>> I do not think this patch is doing the right thing. >>>>> >>>>> Packets sitting in TCP receive queues should not hold state that is >>>>> not relevant for TCP recvmsg(). >>>> >>>> Agree, but tcp_v4/6_rcv() already call nf_reset_ct(), else it would >>>> not be possible to remove nf_conntrack module in practice. >>> >>> Well, existing nf_reset_ct() does not catch all cases, like TCP fastopen ? >> >> Yeah, that is kind of the main problem I have with the current >> code. It's very hard to find all the cases where skb has to be >> cleaned up and almost impossible for someone who doesn't know >> every aspect of every network subsystem in the kernel. That's >> why I went with the more or less bulletproof approach of cleaning >> up while actually deferring. I can try and test the code you >> proposed in the other reply, but at least, I think, we need a >> bunch of debug warnings in the skb_attempt_defer_free() to catch >> possible issues. > > Debug warnings are expensive if they need to bring new cache lines. > > So far skb_attempt_defer_free() is only used by TCP in well known conditions. That's true for skb_attempt_defer_free() itself, but it's hard to tell the same for all the parts of the code that are enqueuing these skbuffs. Even if all the places are well known, it looks to me highly error prone. Having a couple of DEBUG_NET_WARN_ON_ONCE should not cause any performance issues, IIUC, unless debug is enabled, right? e.g. DEBUG_NET_WARN_ON_ONCE(skb_nfct(skb)); DEBUG_NET_WARN_ON_ONCE(skb_dst(skb)); > > >> >> Also, what about cleaning up extensions? IIUC, at least one >> of them can hold external references. SKB_EXT_SEC_PATH holds >> xfrm_state. We should, probably, free them as well? > > I do not know about this, I would ask XFRM maintainers @Steffen, @Herbert, what do you think? Is it a problem if the skb holding SKB_EXT_SEC_PATH extension is held not freed in a defer list indefinitely (for a very long time)? Or is it even possible for an skb with SKB_EXT_SEC_PATH extension present to be enqueued to a TCP receive queue without releasing all the references? This seems like a long existing problem though, so can be fixed separately, if it is a problem. > >> >> And what about zerocopy skb? I think, we should still not >> allow them to be deferred as they seem to hold HW resources. > > The point of skb_attempt_defer_free() i is to make the freeing happen > at producer > much instead of consumer. > > I do not think there is anything special in this regard with zero > copy. I would leave the current code as is. The problem is that these skbuffs are never actually freed, so if they do hold HW resources, these resources will never be released (never == for a very long time, hours, days). Not a blocker from my point of view, just trying to highlight a possible issue. > > A simpler patch might be to move the existing nf_reset_ct() earlier, > can you test this ? I tested the patch below and it seems to fix the issue seen with OVS testsuite. Though it's not obvious for me why this happens. Can you explain a bit more? > > I note that IPv6 does the nf_reset_ct() earlier, from ip6_protocol_deliver_rcu() > > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c > index fda811a5251f2d76ac24a036e6b4f4e7d7d96d6f..a06464f96fe0cc94dd78272738ddaab2c19e87db > 100644 > --- a/net/ipv4/tcp_ipv4.c > +++ b/net/ipv4/tcp_ipv4.c > @@ -1919,6 +1919,8 @@ int tcp_v4_rcv(struct sk_buff *skb) > struct sock *sk; > int ret; > > + nf_reset_ct(skb); > + > drop_reason = SKB_DROP_REASON_NOT_SPECIFIED; > if (skb->pkt_type != PACKET_HOST) > goto discard_it; > @@ -2046,8 +2048,6 @@ int tcp_v4_rcv(struct sk_buff *skb) > if (drop_reason) > goto discard_and_relse; > > - nf_reset_ct(skb); > - > if (tcp_filter(sk, skb)) { > drop_reason = SKB_DROP_REASON_SOCKET_FILTER; > goto discard_and_relse;
On Wed, Jun 22, 2022 at 4:26 PM Ilya Maximets <i.maximets@ovn.org> wrote: > > On 6/22/22 13:43, Eric Dumazet wrote: > > On Wed, Jun 22, 2022 at 1:32 PM Ilya Maximets <i.maximets@ovn.org> wrote: > >> > >> On 6/22/22 12:36, Eric Dumazet wrote: > >>> On Wed, Jun 22, 2022 at 12:28 PM Florian Westphal <fw@strlen.de> wrote: > >>>> > >>>> Eric Dumazet <edumazet@google.com> wrote: > >>>>> On Sun, Jun 19, 2022 at 2:39 AM Ilya Maximets <i.maximets@ovn.org> wrote: > >>>>>> > >>>>>> Open vSwitch system test suite is broken due to inability to > >>>>>> load/unload netfilter modules. kworker thread is getting trapped > >>>>>> in the infinite loop while running a net cleanup inside the > >>>>>> nf_conntrack_cleanup_net_list, because deferred skbuffs are still > >>>>>> holding nfct references and not being freed by their CPU cores. > >>>>>> > >>>>>> In general, the idea that we will have an rx interrupt on every > >>>>>> CPU core at some point in a near future doesn't seem correct. > >>>>>> Devices are getting created and destroyed, interrupts are getting > >>>>>> re-scheduled, CPUs are going online and offline dynamically. > >>>>>> Any of these events may leave packets stuck in defer list for a > >>>>>> long time. It might be OK, if they are just a piece of memory, > >>>>>> but we can't afford them holding references to any other resources. > >>>>>> > >>>>>> In case of OVS, nfct reference keeps the kernel thread in busy loop > >>>>>> while holding a 'pernet_ops_rwsem' semaphore. That blocks the > >>>>>> later modprobe request from user space: > >>>>>> > >>>>>> # ps > >>>>>> 299 root R 99.3 200:25.89 kworker/u96:4+ > >>>>>> > >>>>>> # journalctl > >>>>>> INFO: task modprobe:11787 blocked for more than 1228 seconds. > >>>>>> Not tainted 5.19.0-rc2 #8 > >>>>>> task:modprobe state:D > >>>>>> Call Trace: > >>>>>> <TASK> > >>>>>> __schedule+0x8aa/0x21d0 > >>>>>> schedule+0xcc/0x200 > >>>>>> rwsem_down_write_slowpath+0x8e4/0x1580 > >>>>>> down_write+0xfc/0x140 > >>>>>> register_pernet_subsys+0x15/0x40 > >>>>>> nf_nat_init+0xb6/0x1000 [nf_nat] > >>>>>> do_one_initcall+0xbb/0x410 > >>>>>> do_init_module+0x1b4/0x640 > >>>>>> load_module+0x4c1b/0x58d0 > >>>>>> __do_sys_init_module+0x1d7/0x220 > >>>>>> do_syscall_64+0x3a/0x80 > >>>>>> entry_SYSCALL_64_after_hwframe+0x46/0xb0 > >>>>>> > >>>>>> At this point OVS testsuite is unresponsive and never recover, > >>>>>> because these skbuffs are never freed. > >>>>>> > >>>>>> Solution is to make sure no external references attached to skb > >>>>>> before pushing it to the defer list. Using skb_release_head_state() > >>>>>> for that purpose. The function modified to be re-enterable, as it > >>>>>> will be called again during the defer list flush. > >>>>>> > >>>>>> Another approach that can fix the OVS use-case, is to kick all > >>>>>> cores while waiting for references to be released during the net > >>>>>> cleanup. But that sounds more like a workaround for a current > >>>>>> issue rather than a proper solution and will not cover possible > >>>>>> issues in other parts of the code. > >>>>>> > >>>>>> Additionally checking for skb_zcopy() while deferring. This might > >>>>>> not be necessary, as I'm not sure if we can actually have zero copy > >>>>>> packets on this path, but seems worth having for completeness as we > >>>>>> should never defer such packets regardless. > >>>>>> > >>>>>> CC: Eric Dumazet <edumazet@google.com> > >>>>>> Fixes: 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists") > >>>>>> Signed-off-by: Ilya Maximets <i.maximets@ovn.org> > >>>>>> --- > >>>>>> net/core/skbuff.c | 16 +++++++++++----- > >>>>>> 1 file changed, 11 insertions(+), 5 deletions(-) > >>>>> > >>>>> I do not think this patch is doing the right thing. > >>>>> > >>>>> Packets sitting in TCP receive queues should not hold state that is > >>>>> not relevant for TCP recvmsg(). > >>>> > >>>> Agree, but tcp_v4/6_rcv() already call nf_reset_ct(), else it would > >>>> not be possible to remove nf_conntrack module in practice. > >>> > >>> Well, existing nf_reset_ct() does not catch all cases, like TCP fastopen ? > >> > >> Yeah, that is kind of the main problem I have with the current > >> code. It's very hard to find all the cases where skb has to be > >> cleaned up and almost impossible for someone who doesn't know > >> every aspect of every network subsystem in the kernel. That's > >> why I went with the more or less bulletproof approach of cleaning > >> up while actually deferring. I can try and test the code you > >> proposed in the other reply, but at least, I think, we need a > >> bunch of debug warnings in the skb_attempt_defer_free() to catch > >> possible issues. > > > > Debug warnings are expensive if they need to bring new cache lines. > > > > So far skb_attempt_defer_free() is only used by TCP in well known conditions. > > That's true for skb_attempt_defer_free() itself, but it's > hard to tell the same for all the parts of the code that > are enqueuing these skbuffs. Even if all the places are > well known, it looks to me highly error prone. Having > a couple of DEBUG_NET_WARN_ON_ONCE should not cause any > performance issues, IIUC, unless debug is enabled, right? > > e.g. > > DEBUG_NET_WARN_ON_ONCE(skb_nfct(skb)); > DEBUG_NET_WARN_ON_ONCE(skb_dst(skb)); Sure, but we need to put them at the right place. Doing these checks in skb_attempt_defer_free() is too late. > > > > > > >> > >> Also, what about cleaning up extensions? IIUC, at least one > >> of them can hold external references. SKB_EXT_SEC_PATH holds > >> xfrm_state. We should, probably, free them as well? > > > > I do not know about this, I would ask XFRM maintainers > > @Steffen, @Herbert, what do you think? Is it a problem if the > skb holding SKB_EXT_SEC_PATH extension is held not freed in > a defer list indefinitely (for a very long time)? Or is it > even possible for an skb with SKB_EXT_SEC_PATH extension present > to be enqueued to a TCP receive queue without releasing all > the references? > > This seems like a long existing problem though, so can be fixed > separately, if it is a problem. > > > > >> > >> And what about zerocopy skb? I think, we should still not > >> allow them to be deferred as they seem to hold HW resources. > > > > The point of skb_attempt_defer_free() i is to make the freeing happen > > at producer > > much instead of consumer. > > > > I do not think there is anything special in this regard with zero > > copy. I would leave the current code as is. > > The problem is that these skbuffs are never actually freed, > so if they do hold HW resources, these resources will never be > released (never == for a very long time, hours, days). This is fine. Number of such skbs per cpu is limited. Or if some HW feature needs buffers to be released _immediately_, we have to implement a method for doing so. Please point why buffers sitting in TCP receive queues are not causing issues ? Do we need a general method "Please scan all TCP sockets and release skb holding HW resources immediately" > > Not a blocker from my point of view, just trying to highlight > a possible issue. The tradeoff is known really, as for any cache. > > > > > A simpler patch might be to move the existing nf_reset_ct() earlier, > > can you test this ? > > I tested the patch below and it seems to fix the issue seen > with OVS testsuite. Though it's not obvious for me why this > happens. Can you explain a bit more? I think the bug predates my recent change. skb_attempt_defer_free() is increasing the probability of hitting the bug. > > > > > I note that IPv6 does the nf_reset_ct() earlier, from ip6_protocol_deliver_rcu() > > > > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c > > index fda811a5251f2d76ac24a036e6b4f4e7d7d96d6f..a06464f96fe0cc94dd78272738ddaab2c19e87db > > 100644 > > --- a/net/ipv4/tcp_ipv4.c > > +++ b/net/ipv4/tcp_ipv4.c > > @@ -1919,6 +1919,8 @@ int tcp_v4_rcv(struct sk_buff *skb) > > struct sock *sk; > > int ret; > > > > + nf_reset_ct(skb); > > + > > drop_reason = SKB_DROP_REASON_NOT_SPECIFIED; > > if (skb->pkt_type != PACKET_HOST) > > goto discard_it; > > @@ -2046,8 +2048,6 @@ int tcp_v4_rcv(struct sk_buff *skb) > > if (drop_reason) > > goto discard_and_relse; > > > > - nf_reset_ct(skb); > > - > > if (tcp_filter(sk, skb)) { > > drop_reason = SKB_DROP_REASON_SOCKET_FILTER; > > goto discard_and_relse; >
On Wed, Jun 22, 2022 at 4:26 PM Ilya Maximets <i.maximets@ovn.org> wrote: > > On 6/22/22 13:43, Eric Dumazet wrote: > > I tested the patch below and it seems to fix the issue seen > with OVS testsuite. Though it's not obvious for me why this > happens. Can you explain a bit more? Anyway, I am not sure we can call nf_reset_ct(skb) that early. git log seems to say that xfrm check needs to be done before nf_reset_ct(skb), I have no idea why. I suspect some incoming packets are not going through xfrm4_policy_check() and end up being stored in a TCP receive queue. Maybe something is missing before calling tcp_child_process() > > > > > I note that IPv6 does the nf_reset_ct() earlier, from ip6_protocol_deliver_rcu() > > > > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c > > index fda811a5251f2d76ac24a036e6b4f4e7d7d96d6f..a06464f96fe0cc94dd78272738ddaab2c19e87db > > 100644 > > --- a/net/ipv4/tcp_ipv4.c > > +++ b/net/ipv4/tcp_ipv4.c > > @@ -1919,6 +1919,8 @@ int tcp_v4_rcv(struct sk_buff *skb) > > struct sock *sk; > > int ret; > > > > + nf_reset_ct(skb); > > + > > drop_reason = SKB_DROP_REASON_NOT_SPECIFIED; > > if (skb->pkt_type != PACKET_HOST) > > goto discard_it; > > @@ -2046,8 +2048,6 @@ int tcp_v4_rcv(struct sk_buff *skb) > > if (drop_reason) > > goto discard_and_relse; > > > > - nf_reset_ct(skb); > > - > > if (tcp_filter(sk, skb)) { > > drop_reason = SKB_DROP_REASON_SOCKET_FILTER; > > goto discard_and_relse; >
On Wed, Jun 22, 2022 at 6:29 PM Eric Dumazet <edumazet@google.com> wrote: > > On Wed, Jun 22, 2022 at 4:26 PM Ilya Maximets <i.maximets@ovn.org> wrote: > > > > On 6/22/22 13:43, Eric Dumazet wrote: > > > > > I tested the patch below and it seems to fix the issue seen > > with OVS testsuite. Though it's not obvious for me why this > > happens. Can you explain a bit more? > > Anyway, I am not sure we can call nf_reset_ct(skb) that early. > > git log seems to say that xfrm check needs to be done before > nf_reset_ct(skb), I have no idea why. Additional remark: In IPv6 side, xfrm6_policy_check() _is_ called after nf_reset_ct(skb) Steffen, do you have some comments ? Some context: commit b59c270104f03960069596722fea70340579244d Author: Patrick McHardy <kaber@trash.net> Date: Fri Jan 6 23:06:10 2006 -0800 [NETFILTER]: Keep conntrack reference until IPsec policy checks are done Keep the conntrack reference until policy checks have been performed for IPsec NAT support. The reference needs to be dropped before a packet is queued to avoid having the conntrack module unloadable. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net> > > I suspect some incoming packets are not going through > xfrm4_policy_check() and end up being stored in a TCP receive queue. > > Maybe something is missing before calling tcp_child_process() > > > > > > > > > > I note that IPv6 does the nf_reset_ct() earlier, from ip6_protocol_deliver_rcu() > > > > > > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c > > > index fda811a5251f2d76ac24a036e6b4f4e7d7d96d6f..a06464f96fe0cc94dd78272738ddaab2c19e87db > > > 100644 > > > --- a/net/ipv4/tcp_ipv4.c > > > +++ b/net/ipv4/tcp_ipv4.c > > > @@ -1919,6 +1919,8 @@ int tcp_v4_rcv(struct sk_buff *skb) > > > struct sock *sk; > > > int ret; > > > > > > + nf_reset_ct(skb); > > > + > > > drop_reason = SKB_DROP_REASON_NOT_SPECIFIED; > > > if (skb->pkt_type != PACKET_HOST) > > > goto discard_it; > > > @@ -2046,8 +2048,6 @@ int tcp_v4_rcv(struct sk_buff *skb) > > > if (drop_reason) > > > goto discard_and_relse; > > > > > > - nf_reset_ct(skb); > > > - > > > if (tcp_filter(sk, skb)) { > > > drop_reason = SKB_DROP_REASON_SOCKET_FILTER; > > > goto discard_and_relse; > >
On Wed, Jun 22, 2022 at 6:39 PM Eric Dumazet <edumazet@google.com> wrote: > > On Wed, Jun 22, 2022 at 6:29 PM Eric Dumazet <edumazet@google.com> wrote: > > > > On Wed, Jun 22, 2022 at 4:26 PM Ilya Maximets <i.maximets@ovn.org> wrote: > > > > > > On 6/22/22 13:43, Eric Dumazet wrote: > > > > > > > > I tested the patch below and it seems to fix the issue seen > > > with OVS testsuite. Though it's not obvious for me why this > > > happens. Can you explain a bit more? > > > > Anyway, I am not sure we can call nf_reset_ct(skb) that early. > > > > git log seems to say that xfrm check needs to be done before > > nf_reset_ct(skb), I have no idea why. > > Additional remark: In IPv6 side, xfrm6_policy_check() _is_ called > after nf_reset_ct(skb) > > Steffen, do you have some comments ? > > Some context: > commit b59c270104f03960069596722fea70340579244d > Author: Patrick McHardy <kaber@trash.net> > Date: Fri Jan 6 23:06:10 2006 -0800 > > [NETFILTER]: Keep conntrack reference until IPsec policy checks are done > > Keep the conntrack reference until policy checks have been performed for > IPsec NAT support. The reference needs to be dropped before a packet is > queued to avoid having the conntrack module unloadable. > > Signed-off-by: Patrick McHardy <kaber@trash.net> > Signed-off-by: David S. Miller <davem@davemloft.net> > Oh well... __xfrm_policy_check() has : nf_nat_decode_session(skb, &fl, family); This answers my questions. This means we are probably missing at least one XFRM check in TCP stack in some cases. (Only after adding this XFRM check we can call nf_reset_ct(skb)) > > > > > I suspect some incoming packets are not going through > > xfrm4_policy_check() and end up being stored in a TCP receive queue. > > > > Maybe something is missing before calling tcp_child_process() > > > > > > > > > > > > > > > I note that IPv6 does the nf_reset_ct() earlier, from ip6_protocol_deliver_rcu() > > > > > > > > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c > > > > index fda811a5251f2d76ac24a036e6b4f4e7d7d96d6f..a06464f96fe0cc94dd78272738ddaab2c19e87db > > > > 100644 > > > > --- a/net/ipv4/tcp_ipv4.c > > > > +++ b/net/ipv4/tcp_ipv4.c > > > > @@ -1919,6 +1919,8 @@ int tcp_v4_rcv(struct sk_buff *skb) > > > > struct sock *sk; > > > > int ret; > > > > > > > > + nf_reset_ct(skb); > > > > + > > > > drop_reason = SKB_DROP_REASON_NOT_SPECIFIED; > > > > if (skb->pkt_type != PACKET_HOST) > > > > goto discard_it; > > > > @@ -2046,8 +2048,6 @@ int tcp_v4_rcv(struct sk_buff *skb) > > > > if (drop_reason) > > > > goto discard_and_relse; > > > > > > > > - nf_reset_ct(skb); > > > > - > > > > if (tcp_filter(sk, skb)) { > > > > drop_reason = SKB_DROP_REASON_SOCKET_FILTER; > > > > goto discard_and_relse; > > >
On Wed, Jun 22, 2022 at 6:47 PM Eric Dumazet <edumazet@google.com> wrote: > > On Wed, Jun 22, 2022 at 6:39 PM Eric Dumazet <edumazet@google.com> wrote: > > > > On Wed, Jun 22, 2022 at 6:29 PM Eric Dumazet <edumazet@google.com> wrote: > > > > > > On Wed, Jun 22, 2022 at 4:26 PM Ilya Maximets <i.maximets@ovn.org> wrote: > > > > > > > > On 6/22/22 13:43, Eric Dumazet wrote: > > > > > > > > > > > I tested the patch below and it seems to fix the issue seen > > > > with OVS testsuite. Though it's not obvious for me why this > > > > happens. Can you explain a bit more? > > > > > > Anyway, I am not sure we can call nf_reset_ct(skb) that early. > > > > > > git log seems to say that xfrm check needs to be done before > > > nf_reset_ct(skb), I have no idea why. > > > > Additional remark: In IPv6 side, xfrm6_policy_check() _is_ called > > after nf_reset_ct(skb) > > > > Steffen, do you have some comments ? > > > > Some context: > > commit b59c270104f03960069596722fea70340579244d > > Author: Patrick McHardy <kaber@trash.net> > > Date: Fri Jan 6 23:06:10 2006 -0800 > > > > [NETFILTER]: Keep conntrack reference until IPsec policy checks are done > > > > Keep the conntrack reference until policy checks have been performed for > > IPsec NAT support. The reference needs to be dropped before a packet is > > queued to avoid having the conntrack module unloadable. > > > > Signed-off-by: Patrick McHardy <kaber@trash.net> > > Signed-off-by: David S. Miller <davem@davemloft.net> > > > > Oh well... __xfrm_policy_check() has : > > nf_nat_decode_session(skb, &fl, family); > > This answers my questions. > > This means we are probably missing at least one XFRM check in TCP > stack in some cases. > (Only after adding this XFRM check we can call nf_reset_ct(skb)) > Maybe this will help ? diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index fe8f23b95d32ca4a35d05166d471327bc608fa91..49c1348e40b6c7b6a98b54d716f29c948e00ba33 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -2019,12 +2019,19 @@ int tcp_v4_rcv(struct sk_buff *skb) if (nsk == sk) { reqsk_put(req); tcp_v4_restore_cb(skb); - } else if (tcp_child_process(sk, nsk, skb)) { - tcp_v4_send_reset(nsk, skb); - goto discard_and_relse; } else { - sock_put(sk); - return 0; + if (!xfrm4_policy_check(nsk, XFRM_POLICY_IN, skb)) { + drop_reason = SKB_DROP_REASON_XFRM_POLICY; + goto discard_and_relse; + } + nf_reset_ct(skb); + if (tcp_child_process(sk, nsk, skb)) { + tcp_v4_send_reset(nsk, skb); + goto discard_and_relse; + } else { + sock_put(sk); + return 0; + } } }
On 6/22/22 16:35, Eric Dumazet wrote: > On Wed, Jun 22, 2022 at 4:26 PM Ilya Maximets <i.maximets@ovn.org> wrote: >> >> On 6/22/22 13:43, Eric Dumazet wrote: >>> On Wed, Jun 22, 2022 at 1:32 PM Ilya Maximets <i.maximets@ovn.org> wrote: >>>> >>>> On 6/22/22 12:36, Eric Dumazet wrote: >>>>> On Wed, Jun 22, 2022 at 12:28 PM Florian Westphal <fw@strlen.de> wrote: >>>>>> >>>>>> Eric Dumazet <edumazet@google.com> wrote: >>>>>>> On Sun, Jun 19, 2022 at 2:39 AM Ilya Maximets <i.maximets@ovn.org> wrote: >>>>>>>> >>>>>>>> Open vSwitch system test suite is broken due to inability to >>>>>>>> load/unload netfilter modules. kworker thread is getting trapped >>>>>>>> in the infinite loop while running a net cleanup inside the >>>>>>>> nf_conntrack_cleanup_net_list, because deferred skbuffs are still >>>>>>>> holding nfct references and not being freed by their CPU cores. >>>>>>>> >>>>>>>> In general, the idea that we will have an rx interrupt on every >>>>>>>> CPU core at some point in a near future doesn't seem correct. >>>>>>>> Devices are getting created and destroyed, interrupts are getting >>>>>>>> re-scheduled, CPUs are going online and offline dynamically. >>>>>>>> Any of these events may leave packets stuck in defer list for a >>>>>>>> long time. It might be OK, if they are just a piece of memory, >>>>>>>> but we can't afford them holding references to any other resources. >>>>>>>> >>>>>>>> In case of OVS, nfct reference keeps the kernel thread in busy loop >>>>>>>> while holding a 'pernet_ops_rwsem' semaphore. That blocks the >>>>>>>> later modprobe request from user space: >>>>>>>> >>>>>>>> # ps >>>>>>>> 299 root R 99.3 200:25.89 kworker/u96:4+ >>>>>>>> >>>>>>>> # journalctl >>>>>>>> INFO: task modprobe:11787 blocked for more than 1228 seconds. >>>>>>>> Not tainted 5.19.0-rc2 #8 >>>>>>>> task:modprobe state:D >>>>>>>> Call Trace: >>>>>>>> <TASK> >>>>>>>> __schedule+0x8aa/0x21d0 >>>>>>>> schedule+0xcc/0x200 >>>>>>>> rwsem_down_write_slowpath+0x8e4/0x1580 >>>>>>>> down_write+0xfc/0x140 >>>>>>>> register_pernet_subsys+0x15/0x40 >>>>>>>> nf_nat_init+0xb6/0x1000 [nf_nat] >>>>>>>> do_one_initcall+0xbb/0x410 >>>>>>>> do_init_module+0x1b4/0x640 >>>>>>>> load_module+0x4c1b/0x58d0 >>>>>>>> __do_sys_init_module+0x1d7/0x220 >>>>>>>> do_syscall_64+0x3a/0x80 >>>>>>>> entry_SYSCALL_64_after_hwframe+0x46/0xb0 >>>>>>>> >>>>>>>> At this point OVS testsuite is unresponsive and never recover, >>>>>>>> because these skbuffs are never freed. >>>>>>>> >>>>>>>> Solution is to make sure no external references attached to skb >>>>>>>> before pushing it to the defer list. Using skb_release_head_state() >>>>>>>> for that purpose. The function modified to be re-enterable, as it >>>>>>>> will be called again during the defer list flush. >>>>>>>> >>>>>>>> Another approach that can fix the OVS use-case, is to kick all >>>>>>>> cores while waiting for references to be released during the net >>>>>>>> cleanup. But that sounds more like a workaround for a current >>>>>>>> issue rather than a proper solution and will not cover possible >>>>>>>> issues in other parts of the code. >>>>>>>> >>>>>>>> Additionally checking for skb_zcopy() while deferring. This might >>>>>>>> not be necessary, as I'm not sure if we can actually have zero copy >>>>>>>> packets on this path, but seems worth having for completeness as we >>>>>>>> should never defer such packets regardless. >>>>>>>> >>>>>>>> CC: Eric Dumazet <edumazet@google.com> >>>>>>>> Fixes: 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists") >>>>>>>> Signed-off-by: Ilya Maximets <i.maximets@ovn.org> >>>>>>>> --- >>>>>>>> net/core/skbuff.c | 16 +++++++++++----- >>>>>>>> 1 file changed, 11 insertions(+), 5 deletions(-) >>>>>>> >>>>>>> I do not think this patch is doing the right thing. >>>>>>> >>>>>>> Packets sitting in TCP receive queues should not hold state that is >>>>>>> not relevant for TCP recvmsg(). >>>>>> >>>>>> Agree, but tcp_v4/6_rcv() already call nf_reset_ct(), else it would >>>>>> not be possible to remove nf_conntrack module in practice. >>>>> >>>>> Well, existing nf_reset_ct() does not catch all cases, like TCP fastopen ? >>>> >>>> Yeah, that is kind of the main problem I have with the current >>>> code. It's very hard to find all the cases where skb has to be >>>> cleaned up and almost impossible for someone who doesn't know >>>> every aspect of every network subsystem in the kernel. That's >>>> why I went with the more or less bulletproof approach of cleaning >>>> up while actually deferring. I can try and test the code you >>>> proposed in the other reply, but at least, I think, we need a >>>> bunch of debug warnings in the skb_attempt_defer_free() to catch >>>> possible issues. >>> >>> Debug warnings are expensive if they need to bring new cache lines. >>> >>> So far skb_attempt_defer_free() is only used by TCP in well known conditions. >> >> That's true for skb_attempt_defer_free() itself, but it's >> hard to tell the same for all the parts of the code that >> are enqueuing these skbuffs. Even if all the places are >> well known, it looks to me highly error prone. Having >> a couple of DEBUG_NET_WARN_ON_ONCE should not cause any >> performance issues, IIUC, unless debug is enabled, right? >> >> e.g. >> >> DEBUG_NET_WARN_ON_ONCE(skb_nfct(skb)); >> DEBUG_NET_WARN_ON_ONCE(skb_dst(skb)); > > Sure, but we need to put them at the right place. > > Doing these checks in skb_attempt_defer_free() is too late. Yes, I agree that it doesn't give any useful information for debugging, because we need to know who put those packets in the queue, not who took them out. But 'not too late' places should already have calls to reset these fields in a close proximity to where the warnings should be, so it's hard to tell if that will make a lot of sense. What I had in mind is just a red flag that says that something went wrong, so it can be raised while testing new changes to the kernel and running with debug enabled. That can save some debugging and bisecting time in the future if some unexpected nfct or dst reference will sneak into the queue. But I will not insist on adding these, if you think they are not valuable. > >> >>> >>> >>>> >>>> Also, what about cleaning up extensions? IIUC, at least one >>>> of them can hold external references. SKB_EXT_SEC_PATH holds >>>> xfrm_state. We should, probably, free them as well? >>> >>> I do not know about this, I would ask XFRM maintainers >> >> @Steffen, @Herbert, what do you think? Is it a problem if the >> skb holding SKB_EXT_SEC_PATH extension is held not freed in >> a defer list indefinitely (for a very long time)? Or is it >> even possible for an skb with SKB_EXT_SEC_PATH extension present >> to be enqueued to a TCP receive queue without releasing all >> the references? To answer my own questions: - Yes, it's possible. - No, doesn't seem to be a big problem, IIUC, as xfrm_state seem to be self-contained and keeping it around for a long time doesn't seem to be harmful. Some input from xfrm maintainers would still be great though. >> >> This seems like a long existing problem though, so can be fixed >> separately, if it is a problem. >> >>> >>>> >>>> And what about zerocopy skb? I think, we should still not >>>> allow them to be deferred as they seem to hold HW resources. >>> >>> The point of skb_attempt_defer_free() i is to make the freeing happen >>> at producer >>> much instead of consumer. >>> >>> I do not think there is anything special in this regard with zero >>> copy. I would leave the current code as is. >> >> The problem is that these skbuffs are never actually freed, >> so if they do hold HW resources, these resources will never be >> released (never == for a very long time, hours, days). > > This is fine. Number of such skbs per cpu is limited. > > Or if some HW feature needs buffers to be released _immediately_, we > have to implement > a method for doing so. > > Please point why buffers sitting in TCP receive queues are not causing issues ? > Do we need a general method "Please scan all TCP sockets and release > skb holding HW resources immediately" That's a good point. I agree that it's probably not a big deal in this case. > >> >> Not a blocker from my point of view, just trying to highlight >> a possible issue. > > The tradeoff is known really, as for any cache. > >> >>> >>> A simpler patch might be to move the existing nf_reset_ct() earlier, >>> can you test this ? >> >> I tested the patch below and it seems to fix the issue seen >> with OVS testsuite. Though it's not obvious for me why this >> happens. Can you explain a bit more? > > I think the bug predates my recent change. > > skb_attempt_defer_free() is increasing the probability of hitting the bug. > >> >>> >>> I note that IPv6 does the nf_reset_ct() earlier, from ip6_protocol_deliver_rcu() >>> >>> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c >>> index fda811a5251f2d76ac24a036e6b4f4e7d7d96d6f..a06464f96fe0cc94dd78272738ddaab2c19e87db >>> 100644 >>> --- a/net/ipv4/tcp_ipv4.c >>> +++ b/net/ipv4/tcp_ipv4.c >>> @@ -1919,6 +1919,8 @@ int tcp_v4_rcv(struct sk_buff *skb) >>> struct sock *sk; >>> int ret; >>> >>> + nf_reset_ct(skb); >>> + >>> drop_reason = SKB_DROP_REASON_NOT_SPECIFIED; >>> if (skb->pkt_type != PACKET_HOST) >>> goto discard_it; >>> @@ -2046,8 +2048,6 @@ int tcp_v4_rcv(struct sk_buff *skb) >>> if (drop_reason) >>> goto discard_and_relse; >>> >>> - nf_reset_ct(skb); >>> - >>> if (tcp_filter(sk, skb)) { >>> drop_reason = SKB_DROP_REASON_SOCKET_FILTER; >>> goto discard_and_relse; >>
On 6/22/22 19:03, Eric Dumazet wrote: > On Wed, Jun 22, 2022 at 6:47 PM Eric Dumazet <edumazet@google.com> wrote: >> >> On Wed, Jun 22, 2022 at 6:39 PM Eric Dumazet <edumazet@google.com> wrote: >>> >>> On Wed, Jun 22, 2022 at 6:29 PM Eric Dumazet <edumazet@google.com> wrote: >>>> >>>> On Wed, Jun 22, 2022 at 4:26 PM Ilya Maximets <i.maximets@ovn.org> wrote: >>>>> >>>>> On 6/22/22 13:43, Eric Dumazet wrote: >>>> >>>>> >>>>> I tested the patch below and it seems to fix the issue seen >>>>> with OVS testsuite. Though it's not obvious for me why this >>>>> happens. Can you explain a bit more? >>>> >>>> Anyway, I am not sure we can call nf_reset_ct(skb) that early. >>>> >>>> git log seems to say that xfrm check needs to be done before >>>> nf_reset_ct(skb), I have no idea why. >>> >>> Additional remark: In IPv6 side, xfrm6_policy_check() _is_ called >>> after nf_reset_ct(skb) >>> >>> Steffen, do you have some comments ? >>> >>> Some context: >>> commit b59c270104f03960069596722fea70340579244d >>> Author: Patrick McHardy <kaber@trash.net> >>> Date: Fri Jan 6 23:06:10 2006 -0800 >>> >>> [NETFILTER]: Keep conntrack reference until IPsec policy checks are done >>> >>> Keep the conntrack reference until policy checks have been performed for >>> IPsec NAT support. The reference needs to be dropped before a packet is >>> queued to avoid having the conntrack module unloadable. >>> >>> Signed-off-by: Patrick McHardy <kaber@trash.net> >>> Signed-off-by: David S. Miller <davem@davemloft.net> >>> >> >> Oh well... __xfrm_policy_check() has : >> >> nf_nat_decode_session(skb, &fl, family); >> >> This answers my questions. >> >> This means we are probably missing at least one XFRM check in TCP >> stack in some cases. >> (Only after adding this XFRM check we can call nf_reset_ct(skb)) >> > > Maybe this will help ? I tested this patch and it seems to fix the OVS problem. I did not test the xfrm part of it. Will you post an official patch? > > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c > index fe8f23b95d32ca4a35d05166d471327bc608fa91..49c1348e40b6c7b6a98b54d716f29c948e00ba33 > 100644 > --- a/net/ipv4/tcp_ipv4.c > +++ b/net/ipv4/tcp_ipv4.c > @@ -2019,12 +2019,19 @@ int tcp_v4_rcv(struct sk_buff *skb) > if (nsk == sk) { > reqsk_put(req); > tcp_v4_restore_cb(skb); > - } else if (tcp_child_process(sk, nsk, skb)) { > - tcp_v4_send_reset(nsk, skb); > - goto discard_and_relse; > } else { > - sock_put(sk); > - return 0; > + if (!xfrm4_policy_check(nsk, XFRM_POLICY_IN, skb)) { > + drop_reason = SKB_DROP_REASON_XFRM_POLICY; > + goto discard_and_relse; > + } > + nf_reset_ct(skb); > + if (tcp_child_process(sk, nsk, skb)) { > + tcp_v4_send_reset(nsk, skb); > + goto discard_and_relse; > + } else { > + sock_put(sk); > + return 0; > + } > } > }
On Wed, Jun 22, 2022 at 8:19 PM Ilya Maximets <i.maximets@ovn.org> wrote: > > On 6/22/22 19:03, Eric Dumazet wrote: > > On Wed, Jun 22, 2022 at 6:47 PM Eric Dumazet <edumazet@google.com> wrote: > >> > >> On Wed, Jun 22, 2022 at 6:39 PM Eric Dumazet <edumazet@google.com> wrote: > >>> > >>> On Wed, Jun 22, 2022 at 6:29 PM Eric Dumazet <edumazet@google.com> wrote: > >>>> > >>>> On Wed, Jun 22, 2022 at 4:26 PM Ilya Maximets <i.maximets@ovn.org> wrote: > >>>>> > >>>>> On 6/22/22 13:43, Eric Dumazet wrote: > >>>> > >>>>> > >>>>> I tested the patch below and it seems to fix the issue seen > >>>>> with OVS testsuite. Though it's not obvious for me why this > >>>>> happens. Can you explain a bit more? > >>>> > >>>> Anyway, I am not sure we can call nf_reset_ct(skb) that early. > >>>> > >>>> git log seems to say that xfrm check needs to be done before > >>>> nf_reset_ct(skb), I have no idea why. > >>> > >>> Additional remark: In IPv6 side, xfrm6_policy_check() _is_ called > >>> after nf_reset_ct(skb) > >>> > >>> Steffen, do you have some comments ? > >>> > >>> Some context: > >>> commit b59c270104f03960069596722fea70340579244d > >>> Author: Patrick McHardy <kaber@trash.net> > >>> Date: Fri Jan 6 23:06:10 2006 -0800 > >>> > >>> [NETFILTER]: Keep conntrack reference until IPsec policy checks are done > >>> > >>> Keep the conntrack reference until policy checks have been performed for > >>> IPsec NAT support. The reference needs to be dropped before a packet is > >>> queued to avoid having the conntrack module unloadable. > >>> > >>> Signed-off-by: Patrick McHardy <kaber@trash.net> > >>> Signed-off-by: David S. Miller <davem@davemloft.net> > >>> > >> > >> Oh well... __xfrm_policy_check() has : > >> > >> nf_nat_decode_session(skb, &fl, family); > >> > >> This answers my questions. > >> > >> This means we are probably missing at least one XFRM check in TCP > >> stack in some cases. > >> (Only after adding this XFRM check we can call nf_reset_ct(skb)) > >> > > > > Maybe this will help ? > > I tested this patch and it seems to fix the OVS problem. > I did not test the xfrm part of it. > > Will you post an official patch? Yes I will. I need to double check we do not leak either the req, or the child. Maybe the XFRM check should be done even earlier, on the listening socket ? Or if we assume the SYNACK packet has been sent after the XFRM test has been applied to the SYN, maybe we could just call nf_reset_ct(skb) to lower risk of regressions. With the last patch, it would be strange that we accept the 3WHS and establish a socket, but drop the payload in the 3rd packet... > > > > > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c > > index fe8f23b95d32ca4a35d05166d471327bc608fa91..49c1348e40b6c7b6a98b54d716f29c948e00ba33 > > 100644 > > --- a/net/ipv4/tcp_ipv4.c > > +++ b/net/ipv4/tcp_ipv4.c > > @@ -2019,12 +2019,19 @@ int tcp_v4_rcv(struct sk_buff *skb) > > if (nsk == sk) { > > reqsk_put(req); > > tcp_v4_restore_cb(skb); > > - } else if (tcp_child_process(sk, nsk, skb)) { > > - tcp_v4_send_reset(nsk, skb); > > - goto discard_and_relse; > > } else { > > - sock_put(sk); > > - return 0; > > + if (!xfrm4_policy_check(nsk, XFRM_POLICY_IN, skb)) { > > + drop_reason = SKB_DROP_REASON_XFRM_POLICY; > > + goto discard_and_relse; > > + } > > + nf_reset_ct(skb); > > + if (tcp_child_process(sk, nsk, skb)) { > > + tcp_v4_send_reset(nsk, skb); > > + goto discard_and_relse; > > + } else { > > + sock_put(sk); > > + return 0; > > + } > > } > > } >
On Wed, Jun 22, 2022 at 9:04 PM Eric Dumazet <edumazet@google.com> wrote: > > On Wed, Jun 22, 2022 at 8:19 PM Ilya Maximets <i.maximets@ovn.org> wrote: > > > > On 6/22/22 19:03, Eric Dumazet wrote: > > > On Wed, Jun 22, 2022 at 6:47 PM Eric Dumazet <edumazet@google.com> wrote: > > >> > > >> On Wed, Jun 22, 2022 at 6:39 PM Eric Dumazet <edumazet@google.com> wrote: > > >>> > > >>> On Wed, Jun 22, 2022 at 6:29 PM Eric Dumazet <edumazet@google.com> wrote: > > >>>> > > >>>> On Wed, Jun 22, 2022 at 4:26 PM Ilya Maximets <i.maximets@ovn.org> wrote: > > >>>>> > > >>>>> On 6/22/22 13:43, Eric Dumazet wrote: > > >>>> > > >>>>> > > >>>>> I tested the patch below and it seems to fix the issue seen > > >>>>> with OVS testsuite. Though it's not obvious for me why this > > >>>>> happens. Can you explain a bit more? > > >>>> > > >>>> Anyway, I am not sure we can call nf_reset_ct(skb) that early. > > >>>> > > >>>> git log seems to say that xfrm check needs to be done before > > >>>> nf_reset_ct(skb), I have no idea why. > > >>> > > >>> Additional remark: In IPv6 side, xfrm6_policy_check() _is_ called > > >>> after nf_reset_ct(skb) > > >>> > > >>> Steffen, do you have some comments ? > > >>> > > >>> Some context: > > >>> commit b59c270104f03960069596722fea70340579244d > > >>> Author: Patrick McHardy <kaber@trash.net> > > >>> Date: Fri Jan 6 23:06:10 2006 -0800 > > >>> > > >>> [NETFILTER]: Keep conntrack reference until IPsec policy checks are done > > >>> > > >>> Keep the conntrack reference until policy checks have been performed for > > >>> IPsec NAT support. The reference needs to be dropped before a packet is > > >>> queued to avoid having the conntrack module unloadable. > > >>> > > >>> Signed-off-by: Patrick McHardy <kaber@trash.net> > > >>> Signed-off-by: David S. Miller <davem@davemloft.net> > > >>> > > >> > > >> Oh well... __xfrm_policy_check() has : > > >> > > >> nf_nat_decode_session(skb, &fl, family); > > >> > > >> This answers my questions. > > >> > > >> This means we are probably missing at least one XFRM check in TCP > > >> stack in some cases. > > >> (Only after adding this XFRM check we can call nf_reset_ct(skb)) > > >> > > > > > > Maybe this will help ? > > > > I tested this patch and it seems to fix the OVS problem. > > I did not test the xfrm part of it. > > > > Will you post an official patch? > > Yes I will. I need to double check we do not leak either the req, or the child. > > Maybe the XFRM check should be done even earlier, on the listening socket ? > > Or if we assume the SYNACK packet has been sent after the XFRM test > has been applied to the SYN, > maybe we could just call nf_reset_ct(skb) to lower risk of regressions. > > With the last patch, it would be strange that we accept the 3WHS and > establish a socket, > but drop the payload in the 3rd packet... Ilya, can you test the following patch ? I think it makes more sense to let XFRM reject the packet earlier, and not complete the 3WHS, if for some reason this happens. Thanks ! diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index fe8f23b95d32ca4a35d05166d471327bc608fa91..da5a3c44c4fb70f1d3ecc596e694a86267f1c44a 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -1964,7 +1964,10 @@ int tcp_v4_rcv(struct sk_buff *skb) struct sock *nsk; sk = req->rsk_listener; - drop_reason = tcp_inbound_md5_hash(sk, skb, + if (!xfrm4_policy_check(sk, XFRM_POLICY_IN, skb)) + drop_reason = SKB_DROP_REASON_XFRM_POLICY; + else + drop_reason = tcp_inbound_md5_hash(sk, skb, &iph->saddr, &iph->daddr, AF_INET, dif, sdif); if (unlikely(drop_reason)) { @@ -2016,6 +2019,7 @@ int tcp_v4_rcv(struct sk_buff *skb) } goto discard_and_relse; } + nf_reset_ct(skb); if (nsk == sk) { reqsk_put(req); tcp_v4_restore_cb(skb);
On 6/22/22 21:27, Eric Dumazet wrote: > On Wed, Jun 22, 2022 at 9:04 PM Eric Dumazet <edumazet@google.com> wrote: >> >> On Wed, Jun 22, 2022 at 8:19 PM Ilya Maximets <i.maximets@ovn.org> wrote: >>> >>> On 6/22/22 19:03, Eric Dumazet wrote: >>>> On Wed, Jun 22, 2022 at 6:47 PM Eric Dumazet <edumazet@google.com> wrote: >>>>> >>>>> On Wed, Jun 22, 2022 at 6:39 PM Eric Dumazet <edumazet@google.com> wrote: >>>>>> >>>>>> On Wed, Jun 22, 2022 at 6:29 PM Eric Dumazet <edumazet@google.com> wrote: >>>>>>> >>>>>>> On Wed, Jun 22, 2022 at 4:26 PM Ilya Maximets <i.maximets@ovn.org> wrote: >>>>>>>> >>>>>>>> On 6/22/22 13:43, Eric Dumazet wrote: >>>>>>> >>>>>>>> >>>>>>>> I tested the patch below and it seems to fix the issue seen >>>>>>>> with OVS testsuite. Though it's not obvious for me why this >>>>>>>> happens. Can you explain a bit more? >>>>>>> >>>>>>> Anyway, I am not sure we can call nf_reset_ct(skb) that early. >>>>>>> >>>>>>> git log seems to say that xfrm check needs to be done before >>>>>>> nf_reset_ct(skb), I have no idea why. >>>>>> >>>>>> Additional remark: In IPv6 side, xfrm6_policy_check() _is_ called >>>>>> after nf_reset_ct(skb) >>>>>> >>>>>> Steffen, do you have some comments ? >>>>>> >>>>>> Some context: >>>>>> commit b59c270104f03960069596722fea70340579244d >>>>>> Author: Patrick McHardy <kaber@trash.net> >>>>>> Date: Fri Jan 6 23:06:10 2006 -0800 >>>>>> >>>>>> [NETFILTER]: Keep conntrack reference until IPsec policy checks are done >>>>>> >>>>>> Keep the conntrack reference until policy checks have been performed for >>>>>> IPsec NAT support. The reference needs to be dropped before a packet is >>>>>> queued to avoid having the conntrack module unloadable. >>>>>> >>>>>> Signed-off-by: Patrick McHardy <kaber@trash.net> >>>>>> Signed-off-by: David S. Miller <davem@davemloft.net> >>>>>> >>>>> >>>>> Oh well... __xfrm_policy_check() has : >>>>> >>>>> nf_nat_decode_session(skb, &fl, family); >>>>> >>>>> This answers my questions. >>>>> >>>>> This means we are probably missing at least one XFRM check in TCP >>>>> stack in some cases. >>>>> (Only after adding this XFRM check we can call nf_reset_ct(skb)) >>>>> >>>> >>>> Maybe this will help ? >>> >>> I tested this patch and it seems to fix the OVS problem. >>> I did not test the xfrm part of it. >>> >>> Will you post an official patch? >> >> Yes I will. I need to double check we do not leak either the req, or the child. >> >> Maybe the XFRM check should be done even earlier, on the listening socket ? >> >> Or if we assume the SYNACK packet has been sent after the XFRM test >> has been applied to the SYN, >> maybe we could just call nf_reset_ct(skb) to lower risk of regressions. >> >> With the last patch, it would be strange that we accept the 3WHS and >> establish a socket, >> but drop the payload in the 3rd packet... > > Ilya, can you test the following patch ? Tested with OVS and it works fine, the issue doesn't appear. Still didn't test the xfrm part, as I'm not sure how. > I think it makes more sense to let XFRM reject the packet earlier, and > not complete the 3WHS, > if for some reason this happens. OK. However, now the patch looks more like two separate fixes. > > Thanks ! > > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c > index fe8f23b95d32ca4a35d05166d471327bc608fa91..da5a3c44c4fb70f1d3ecc596e694a86267f1c44a > 100644 > --- a/net/ipv4/tcp_ipv4.c > +++ b/net/ipv4/tcp_ipv4.c > @@ -1964,7 +1964,10 @@ int tcp_v4_rcv(struct sk_buff *skb) > struct sock *nsk; > > sk = req->rsk_listener; > - drop_reason = tcp_inbound_md5_hash(sk, skb, > + if (!xfrm4_policy_check(sk, XFRM_POLICY_IN, skb)) > + drop_reason = SKB_DROP_REASON_XFRM_POLICY; > + else > + drop_reason = tcp_inbound_md5_hash(sk, skb, > &iph->saddr, &iph->daddr, > AF_INET, dif, sdif); > if (unlikely(drop_reason)) { > @@ -2016,6 +2019,7 @@ int tcp_v4_rcv(struct sk_buff *skb) > } > goto discard_and_relse; > } > + nf_reset_ct(skb); > if (nsk == sk) { > reqsk_put(req); > tcp_v4_restore_cb(skb);
On 6/19/22 02:39, Ilya Maximets wrote: > Open vSwitch system test suite is broken due to inability to > load/unload netfilter modules. kworker thread is getting trapped > in the infinite loop while running a net cleanup inside the > nf_conntrack_cleanup_net_list, because deferred skbuffs are still > holding nfct references and not being freed by their CPU cores. > > In general, the idea that we will have an rx interrupt on every > CPU core at some point in a near future doesn't seem correct. > Devices are getting created and destroyed, interrupts are getting > re-scheduled, CPUs are going online and offline dynamically. > Any of these events may leave packets stuck in defer list for a > long time. It might be OK, if they are just a piece of memory, > but we can't afford them holding references to any other resources. > > In case of OVS, nfct reference keeps the kernel thread in busy loop > while holding a 'pernet_ops_rwsem' semaphore. That blocks the > later modprobe request from user space: > > # ps > 299 root R 99.3 200:25.89 kworker/u96:4+ > > # journalctl > INFO: task modprobe:11787 blocked for more than 1228 seconds. > Not tainted 5.19.0-rc2 #8 > task:modprobe state:D > Call Trace: > <TASK> > __schedule+0x8aa/0x21d0 > schedule+0xcc/0x200 > rwsem_down_write_slowpath+0x8e4/0x1580 > down_write+0xfc/0x140 > register_pernet_subsys+0x15/0x40 > nf_nat_init+0xb6/0x1000 [nf_nat] > do_one_initcall+0xbb/0x410 > do_init_module+0x1b4/0x640 > load_module+0x4c1b/0x58d0 > __do_sys_init_module+0x1d7/0x220 > do_syscall_64+0x3a/0x80 > entry_SYSCALL_64_after_hwframe+0x46/0xb0 > > At this point OVS testsuite is unresponsive and never recover, > because these skbuffs are never freed. > > Solution is to make sure no external references attached to skb > before pushing it to the defer list. Using skb_release_head_state() > for that purpose. The function modified to be re-enterable, as it > will be called again during the defer list flush. > > Another approach that can fix the OVS use-case, is to kick all > cores while waiting for references to be released during the net > cleanup. But that sounds more like a workaround for a current > issue rather than a proper solution and will not cover possible > issues in other parts of the code. > > Additionally checking for skb_zcopy() while deferring. This might > not be necessary, as I'm not sure if we can actually have zero copy > packets on this path, but seems worth having for completeness as we > should never defer such packets regardless. > > CC: Eric Dumazet <edumazet@google.com> > Fixes: 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists") > Signed-off-by: Ilya Maximets <i.maximets@ovn.org> > --- Happy Friday! :) Resurrecting this thread because I managed to reproduce the issue again on a latest 6.6.0-rc1. (It doesn't mean we need to accept this particular patch, I just think that it's an appropriate discussion thread.) It's a bit different testsuite this time. Last year I had an issue with OVS testsuite, today I have an issue with OVN testsuite. Their structure is very similar, but OVN tests are a fair bit more complex. The story is the same: Each test loads a pack of kernel modules including OVS and nf_conntrack, sends some traffic, verifies OVN functionality, stops OVN/OVS and unloads all the previously loaded modules to clean up the space for the next tests. Kernel hangs in the same way as before waiting for nf_conntrack module to be unloaded: 13 root R 100.0 933:17.98 kworker/u80:1+netns CPU: 12 PID: 13 Comm: kworker/u80:1 Not tainted 6.6.0-rc1+ #7 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.1-2.fc36 04/01/2014 Workqueue: netns cleanup_net RIP: 0010:nf_ct_iterate_cleanup+0x35/0x1e0 [nf_conntrack] Code: 56 41 55 41 54 49 89 fc 55 31 ed 53 e8 a4 db a8 ed 48 c7 c7 20 e9 e2 c0 e8 48 f3 a8 ed 39 2d 7e 2d 01 00 77 14 e9 05 01 00 00 <83> c5 01 3b 2d 6e 2d 01 00 0f 83 f6 00 00 00 48 8b 15 75 2d 01 00 RSP: 0018:ffffb23040073d98 EFLAGS: 00000202 RAX: 000000000001ce57 RBX: ffff98dbfac73958 RCX: ffff98e31f732b38 RDX: ffff98dbfac00000 RSI: ffffb23040073dd0 RDI: ffffffffc0e2e920 RBP: 000000000000e72b R08: ffff98e31f732b38 R09: ffff98e31f732b38 R10: 000000000001d5ce R11: 0000000000000000 R12: ffffffffc0e1b080 R13: ffffb23040073e28 R14: ffff98dbc47b0600 R15: ffffb23040073dd0 FS: 0000000000000000(0000) GS:ffff98e31f700000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f98bd7bca80 CR3: 000000076f420004 CR4: 0000000000370ee0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <NMI> ? nmi_cpu_backtrace+0x82/0xf0 ? nmi_cpu_backtrace_handler+0xd/0x20 ? nmi_handle+0x5e/0x150 ? default_do_nmi+0x40/0x100 ? exc_nmi+0x112/0x190 ? end_repeat_nmi+0x16/0x67 ? __pfx_kill_all+0x10/0x10 [nf_conntrack] ? nf_ct_iterate_cleanup+0x35/0x1e0 [nf_conntrack] ? nf_ct_iterate_cleanup+0x35/0x1e0 [nf_conntrack] ? nf_ct_iterate_cleanup+0x35/0x1e0 [nf_conntrack] </NMI> <TASK> nf_conntrack_cleanup_net_list+0xac/0x140 [nf_conntrack] cleanup_net+0x213/0x3b0 process_one_work+0x177/0x340 worker_thread+0x27e/0x390 ? __pfx_worker_thread+0x10/0x10 kthread+0xe2/0x110 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x30/0x50 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1b/0x30 </TASK> Since this testsuite is different from what I had issues with before, I don't really know when the issue first appeared. It may have been there even last year. I can reproduce the issue with ubuntu 6.2.0-32-generic kernel, but I don't see it with the latest 5.14.0-284.25.1.el9_2.x86_64 on RHEL9. It doesn't necessarily mean that RHEL9 doesn't have the issue though, the testsuite is not 100% reliable. I'll try to dig deeper and bisect the problem on the upstream kernel. For now it seems like the issue is likely in IPv6 code, because the tests that trigger it are mostly IPv6-related. Best regards, Ilya Maximets.
On 9/22/23 15:26, Ilya Maximets wrote: > On 6/19/22 02:39, Ilya Maximets wrote: >> Open vSwitch system test suite is broken due to inability to >> load/unload netfilter modules. kworker thread is getting trapped >> in the infinite loop while running a net cleanup inside the >> nf_conntrack_cleanup_net_list, because deferred skbuffs are still >> holding nfct references and not being freed by their CPU cores. >> >> In general, the idea that we will have an rx interrupt on every >> CPU core at some point in a near future doesn't seem correct. >> Devices are getting created and destroyed, interrupts are getting >> re-scheduled, CPUs are going online and offline dynamically. >> Any of these events may leave packets stuck in defer list for a >> long time. It might be OK, if they are just a piece of memory, >> but we can't afford them holding references to any other resources. >> >> In case of OVS, nfct reference keeps the kernel thread in busy loop >> while holding a 'pernet_ops_rwsem' semaphore. That blocks the >> later modprobe request from user space: >> >> # ps >> 299 root R 99.3 200:25.89 kworker/u96:4+ >> >> # journalctl >> INFO: task modprobe:11787 blocked for more than 1228 seconds. >> Not tainted 5.19.0-rc2 #8 >> task:modprobe state:D >> Call Trace: >> <TASK> >> __schedule+0x8aa/0x21d0 >> schedule+0xcc/0x200 >> rwsem_down_write_slowpath+0x8e4/0x1580 >> down_write+0xfc/0x140 >> register_pernet_subsys+0x15/0x40 >> nf_nat_init+0xb6/0x1000 [nf_nat] >> do_one_initcall+0xbb/0x410 >> do_init_module+0x1b4/0x640 >> load_module+0x4c1b/0x58d0 >> __do_sys_init_module+0x1d7/0x220 >> do_syscall_64+0x3a/0x80 >> entry_SYSCALL_64_after_hwframe+0x46/0xb0 >> >> At this point OVS testsuite is unresponsive and never recover, >> because these skbuffs are never freed. >> >> Solution is to make sure no external references attached to skb >> before pushing it to the defer list. Using skb_release_head_state() >> for that purpose. The function modified to be re-enterable, as it >> will be called again during the defer list flush. >> >> Another approach that can fix the OVS use-case, is to kick all >> cores while waiting for references to be released during the net >> cleanup. But that sounds more like a workaround for a current >> issue rather than a proper solution and will not cover possible >> issues in other parts of the code. >> >> Additionally checking for skb_zcopy() while deferring. This might >> not be necessary, as I'm not sure if we can actually have zero copy >> packets on this path, but seems worth having for completeness as we >> should never defer such packets regardless. >> >> CC: Eric Dumazet <edumazet@google.com> >> Fixes: 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists") >> Signed-off-by: Ilya Maximets <i.maximets@ovn.org> >> --- > > Happy Friday! :) > > Resurrecting this thread because I managed to reproduce the issue again > on a latest 6.6.0-rc1. > > (It doesn't mean we need to accept this particular patch, I just think > that it's an appropriate discussion thread.) > > It's a bit different testsuite this time. Last year I had an issue with > OVS testsuite, today I have an issue with OVN testsuite. Their structure > is very similar, but OVN tests are a fair bit more complex. > > The story is the same: Each test loads a pack of kernel modules including > OVS and nf_conntrack, sends some traffic, verifies OVN functionality, > stops OVN/OVS and unloads all the previously loaded modules to clean up > the space for the next tests. > > Kernel hangs in the same way as before waiting for nf_conntrack module > to be unloaded: > > 13 root R 100.0 933:17.98 kworker/u80:1+netns > > CPU: 12 PID: 13 Comm: kworker/u80:1 Not tainted 6.6.0-rc1+ #7 > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.1-2.fc36 04/01/2014 > Workqueue: netns cleanup_net > RIP: 0010:nf_ct_iterate_cleanup+0x35/0x1e0 [nf_conntrack] > Code: 56 41 55 41 54 49 89 fc 55 31 ed 53 e8 a4 db a8 > ed 48 c7 c7 20 e9 e2 c0 e8 48 f3 a8 ed 39 2d 7e > 2d 01 00 77 14 e9 05 01 00 00 <83> c5 01 3b 2d 6e > 2d 01 00 0f 83 f6 00 00 00 48 8b 15 75 2d 01 00 > RSP: 0018:ffffb23040073d98 EFLAGS: 00000202 > RAX: 000000000001ce57 RBX: ffff98dbfac73958 RCX: ffff98e31f732b38 > RDX: ffff98dbfac00000 RSI: ffffb23040073dd0 RDI: ffffffffc0e2e920 > RBP: 000000000000e72b R08: ffff98e31f732b38 R09: ffff98e31f732b38 > R10: 000000000001d5ce R11: 0000000000000000 R12: ffffffffc0e1b080 > R13: ffffb23040073e28 R14: ffff98dbc47b0600 R15: ffffb23040073dd0 > FS: 0000000000000000(0000) GS:ffff98e31f700000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 00007f98bd7bca80 CR3: 000000076f420004 CR4: 0000000000370ee0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > Call Trace: > <NMI> > ? nmi_cpu_backtrace+0x82/0xf0 > ? nmi_cpu_backtrace_handler+0xd/0x20 > ? nmi_handle+0x5e/0x150 > ? default_do_nmi+0x40/0x100 > ? exc_nmi+0x112/0x190 > ? end_repeat_nmi+0x16/0x67 > ? __pfx_kill_all+0x10/0x10 [nf_conntrack] > ? nf_ct_iterate_cleanup+0x35/0x1e0 [nf_conntrack] > ? nf_ct_iterate_cleanup+0x35/0x1e0 [nf_conntrack] > ? nf_ct_iterate_cleanup+0x35/0x1e0 [nf_conntrack] > </NMI> > <TASK> > nf_conntrack_cleanup_net_list+0xac/0x140 [nf_conntrack] > cleanup_net+0x213/0x3b0 > process_one_work+0x177/0x340 > worker_thread+0x27e/0x390 > ? __pfx_worker_thread+0x10/0x10 > kthread+0xe2/0x110 > ? __pfx_kthread+0x10/0x10 > ret_from_fork+0x30/0x50 > ? __pfx_kthread+0x10/0x10 > ret_from_fork_asm+0x1b/0x30 > </TASK> > > Since this testsuite is different from what I had issues with before, > I don't really know when the issue first appeared. It may have been > there even last year. > > I can reproduce the issue with ubuntu 6.2.0-32-generic kernel, but > I don't see it with the latest 5.14.0-284.25.1.el9_2.x86_64 on RHEL9. > It doesn't necessarily mean that RHEL9 doesn't have the issue though, > the testsuite is not 100% reliable. > > I'll try to dig deeper and bisect the problem on the upstream kernel. > > For now it seems like the issue is likely in IPv6 code, because the > tests that trigger it are mostly IPv6-related. So, I bisected this down to an unsurprising commit: commit b0e214d212030fe497d4d150bb3474e50ad5d093 Author: Madhu Koriginja <madhu.koriginja@nxp.com> Date: Tue Mar 21 21:28:44 2023 +0530 netfilter: keep conntrack reference until IPsecv6 policy checks are done Keep the conntrack reference until policy checks have been performed for IPsec V6 NAT support, just like ipv4. The reference needs to be dropped before a packet is queued to avoid having the conntrack module unloadable. Fixes: 58a317f1061c ("netfilter: ipv6: add IPv6 NAT support") Signed-off-by: Madhu Koriginja <madhu.koriginja@nxp.com> Signed-off-by: Florian Westphal <fw@strlen.de> This commit repeated the pattern from the old ipv4 commit b59c270104f0 ("[NETFILTER]: Keep conntrack reference until IPsec policy checks are done") and hence introduced the exact same issue, but for IPv6. They both failed to deliver on the "avoid having the conntrack module unloadable" part. IIUC, the fix should be exactly the same as Eric did for ipv4 in commit 6f0012e35160 ("tcp: add a missing nf_reset_ct() in 3WHS handling"). I can try that and send a fix. Would still really like to have some preventive mechanism for this kind of issues though. Any ideas on that? Best regards, Ilya Maximets.
On 9/22/23 21:07, Ilya Maximets wrote: > On 9/22/23 15:26, Ilya Maximets wrote: >> On 6/19/22 02:39, Ilya Maximets wrote: >>> Open vSwitch system test suite is broken due to inability to >>> load/unload netfilter modules. kworker thread is getting trapped >>> in the infinite loop while running a net cleanup inside the >>> nf_conntrack_cleanup_net_list, because deferred skbuffs are still >>> holding nfct references and not being freed by their CPU cores. >>> >>> In general, the idea that we will have an rx interrupt on every >>> CPU core at some point in a near future doesn't seem correct. >>> Devices are getting created and destroyed, interrupts are getting >>> re-scheduled, CPUs are going online and offline dynamically. >>> Any of these events may leave packets stuck in defer list for a >>> long time. It might be OK, if they are just a piece of memory, >>> but we can't afford them holding references to any other resources. >>> >>> In case of OVS, nfct reference keeps the kernel thread in busy loop >>> while holding a 'pernet_ops_rwsem' semaphore. That blocks the >>> later modprobe request from user space: >>> >>> # ps >>> 299 root R 99.3 200:25.89 kworker/u96:4+ >>> >>> # journalctl >>> INFO: task modprobe:11787 blocked for more than 1228 seconds. >>> Not tainted 5.19.0-rc2 #8 >>> task:modprobe state:D >>> Call Trace: >>> <TASK> >>> __schedule+0x8aa/0x21d0 >>> schedule+0xcc/0x200 >>> rwsem_down_write_slowpath+0x8e4/0x1580 >>> down_write+0xfc/0x140 >>> register_pernet_subsys+0x15/0x40 >>> nf_nat_init+0xb6/0x1000 [nf_nat] >>> do_one_initcall+0xbb/0x410 >>> do_init_module+0x1b4/0x640 >>> load_module+0x4c1b/0x58d0 >>> __do_sys_init_module+0x1d7/0x220 >>> do_syscall_64+0x3a/0x80 >>> entry_SYSCALL_64_after_hwframe+0x46/0xb0 >>> >>> At this point OVS testsuite is unresponsive and never recover, >>> because these skbuffs are never freed. >>> >>> Solution is to make sure no external references attached to skb >>> before pushing it to the defer list. Using skb_release_head_state() >>> for that purpose. The function modified to be re-enterable, as it >>> will be called again during the defer list flush. >>> >>> Another approach that can fix the OVS use-case, is to kick all >>> cores while waiting for references to be released during the net >>> cleanup. But that sounds more like a workaround for a current >>> issue rather than a proper solution and will not cover possible >>> issues in other parts of the code. >>> >>> Additionally checking for skb_zcopy() while deferring. This might >>> not be necessary, as I'm not sure if we can actually have zero copy >>> packets on this path, but seems worth having for completeness as we >>> should never defer such packets regardless. >>> >>> CC: Eric Dumazet <edumazet@google.com> >>> Fixes: 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists") >>> Signed-off-by: Ilya Maximets <i.maximets@ovn.org> >>> --- >> >> Happy Friday! :) >> >> Resurrecting this thread because I managed to reproduce the issue again >> on a latest 6.6.0-rc1. >> >> (It doesn't mean we need to accept this particular patch, I just think >> that it's an appropriate discussion thread.) >> >> It's a bit different testsuite this time. Last year I had an issue with >> OVS testsuite, today I have an issue with OVN testsuite. Their structure >> is very similar, but OVN tests are a fair bit more complex. >> >> The story is the same: Each test loads a pack of kernel modules including >> OVS and nf_conntrack, sends some traffic, verifies OVN functionality, >> stops OVN/OVS and unloads all the previously loaded modules to clean up >> the space for the next tests. >> >> Kernel hangs in the same way as before waiting for nf_conntrack module >> to be unloaded: >> >> 13 root R 100.0 933:17.98 kworker/u80:1+netns >> >> CPU: 12 PID: 13 Comm: kworker/u80:1 Not tainted 6.6.0-rc1+ #7 >> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.1-2.fc36 04/01/2014 >> Workqueue: netns cleanup_net >> RIP: 0010:nf_ct_iterate_cleanup+0x35/0x1e0 [nf_conntrack] >> Code: 56 41 55 41 54 49 89 fc 55 31 ed 53 e8 a4 db a8 >> ed 48 c7 c7 20 e9 e2 c0 e8 48 f3 a8 ed 39 2d 7e >> 2d 01 00 77 14 e9 05 01 00 00 <83> c5 01 3b 2d 6e >> 2d 01 00 0f 83 f6 00 00 00 48 8b 15 75 2d 01 00 >> RSP: 0018:ffffb23040073d98 EFLAGS: 00000202 >> RAX: 000000000001ce57 RBX: ffff98dbfac73958 RCX: ffff98e31f732b38 >> RDX: ffff98dbfac00000 RSI: ffffb23040073dd0 RDI: ffffffffc0e2e920 >> RBP: 000000000000e72b R08: ffff98e31f732b38 R09: ffff98e31f732b38 >> R10: 000000000001d5ce R11: 0000000000000000 R12: ffffffffc0e1b080 >> R13: ffffb23040073e28 R14: ffff98dbc47b0600 R15: ffffb23040073dd0 >> FS: 0000000000000000(0000) GS:ffff98e31f700000(0000) knlGS:0000000000000000 >> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> CR2: 00007f98bd7bca80 CR3: 000000076f420004 CR4: 0000000000370ee0 >> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 >> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 >> Call Trace: >> <NMI> >> ? nmi_cpu_backtrace+0x82/0xf0 >> ? nmi_cpu_backtrace_handler+0xd/0x20 >> ? nmi_handle+0x5e/0x150 >> ? default_do_nmi+0x40/0x100 >> ? exc_nmi+0x112/0x190 >> ? end_repeat_nmi+0x16/0x67 >> ? __pfx_kill_all+0x10/0x10 [nf_conntrack] >> ? nf_ct_iterate_cleanup+0x35/0x1e0 [nf_conntrack] >> ? nf_ct_iterate_cleanup+0x35/0x1e0 [nf_conntrack] >> ? nf_ct_iterate_cleanup+0x35/0x1e0 [nf_conntrack] >> </NMI> >> <TASK> >> nf_conntrack_cleanup_net_list+0xac/0x140 [nf_conntrack] >> cleanup_net+0x213/0x3b0 >> process_one_work+0x177/0x340 >> worker_thread+0x27e/0x390 >> ? __pfx_worker_thread+0x10/0x10 >> kthread+0xe2/0x110 >> ? __pfx_kthread+0x10/0x10 >> ret_from_fork+0x30/0x50 >> ? __pfx_kthread+0x10/0x10 >> ret_from_fork_asm+0x1b/0x30 >> </TASK> >> >> Since this testsuite is different from what I had issues with before, >> I don't really know when the issue first appeared. It may have been >> there even last year. >> >> I can reproduce the issue with ubuntu 6.2.0-32-generic kernel, but >> I don't see it with the latest 5.14.0-284.25.1.el9_2.x86_64 on RHEL9. >> It doesn't necessarily mean that RHEL9 doesn't have the issue though, >> the testsuite is not 100% reliable. >> >> I'll try to dig deeper and bisect the problem on the upstream kernel. >> >> For now it seems like the issue is likely in IPv6 code, because the >> tests that trigger it are mostly IPv6-related. > > So, I bisected this down to an unsurprising commit: > > commit b0e214d212030fe497d4d150bb3474e50ad5d093 > Author: Madhu Koriginja <madhu.koriginja@nxp.com> > Date: Tue Mar 21 21:28:44 2023 +0530 > > netfilter: keep conntrack reference until IPsecv6 policy checks are done > > Keep the conntrack reference until policy checks have been performed for > IPsec V6 NAT support, just like ipv4. > > The reference needs to be dropped before a packet is > queued to avoid having the conntrack module unloadable. > > Fixes: 58a317f1061c ("netfilter: ipv6: add IPv6 NAT support") > Signed-off-by: Madhu Koriginja <madhu.koriginja@nxp.com> > Signed-off-by: Florian Westphal <fw@strlen.de> > > This commit repeated the pattern from the old ipv4 commit b59c270104f0 > ("[NETFILTER]: Keep conntrack reference until IPsec policy checks are done") > and hence introduced the exact same issue, but for IPv6. They both failed > to deliver on the "avoid having the conntrack module unloadable" part. > > IIUC, the fix should be exactly the same as Eric did for ipv4 in commit > 6f0012e35160 ("tcp: add a missing nf_reset_ct() in 3WHS handling"). > > I can try that and send a fix. Posted a fix here: https://lore.kernel.org/netdev/20230922210530.2045146-1-i.maximets@ovn.org/ It fixes the problem I have with OVN testsuite. > > Would still really like to have some preventive mechanism for this kind of > issues though. Any ideas on that? > > Best regards, Ilya Maximets.
diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 5b3559cb1d82..5660250e795f 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -726,11 +726,9 @@ void skb_release_head_state(struct sk_buff *skb) skb_dst_drop(skb); if (skb->destructor) { WARN_ON(in_hardirq()); - skb->destructor(skb); + skb_orphan(skb); } -#if IS_ENABLED(CONFIG_NF_CONNTRACK) - nf_conntrack_put(skb_nfct(skb)); -#endif + nf_reset_ct(skb); skb_ext_put(skb); } @@ -6502,7 +6500,8 @@ void skb_attempt_defer_free(struct sk_buff *skb) if (WARN_ON_ONCE(cpu >= nr_cpu_ids) || !cpu_online(cpu) || - cpu == raw_smp_processor_id()) { + cpu == raw_smp_processor_id() || + skb_zcopy(skb)) { nodefer: __kfree_skb(skb); return; } @@ -6512,6 +6511,13 @@ nodefer: __kfree_skb(skb); if (READ_ONCE(sd->defer_count) >= defer_max) goto nodefer; + /* skb can contain all kinds of external references that + * will prevent module unloading or destruction of other + * resources. Need to release them now, since skb can + * stay on a defer list indefinitely. + */ + skb_release_head_state(skb); + spin_lock_irqsave(&sd->defer_lock, flags); /* Send an IPI every time queue reaches half capacity. */ kick = sd->defer_count == (defer_max >> 1);
Open vSwitch system test suite is broken due to inability to load/unload netfilter modules. kworker thread is getting trapped in the infinite loop while running a net cleanup inside the nf_conntrack_cleanup_net_list, because deferred skbuffs are still holding nfct references and not being freed by their CPU cores. In general, the idea that we will have an rx interrupt on every CPU core at some point in a near future doesn't seem correct. Devices are getting created and destroyed, interrupts are getting re-scheduled, CPUs are going online and offline dynamically. Any of these events may leave packets stuck in defer list for a long time. It might be OK, if they are just a piece of memory, but we can't afford them holding references to any other resources. In case of OVS, nfct reference keeps the kernel thread in busy loop while holding a 'pernet_ops_rwsem' semaphore. That blocks the later modprobe request from user space: # ps 299 root R 99.3 200:25.89 kworker/u96:4+ # journalctl INFO: task modprobe:11787 blocked for more than 1228 seconds. Not tainted 5.19.0-rc2 #8 task:modprobe state:D Call Trace: <TASK> __schedule+0x8aa/0x21d0 schedule+0xcc/0x200 rwsem_down_write_slowpath+0x8e4/0x1580 down_write+0xfc/0x140 register_pernet_subsys+0x15/0x40 nf_nat_init+0xb6/0x1000 [nf_nat] do_one_initcall+0xbb/0x410 do_init_module+0x1b4/0x640 load_module+0x4c1b/0x58d0 __do_sys_init_module+0x1d7/0x220 do_syscall_64+0x3a/0x80 entry_SYSCALL_64_after_hwframe+0x46/0xb0 At this point OVS testsuite is unresponsive and never recover, because these skbuffs are never freed. Solution is to make sure no external references attached to skb before pushing it to the defer list. Using skb_release_head_state() for that purpose. The function modified to be re-enterable, as it will be called again during the defer list flush. Another approach that can fix the OVS use-case, is to kick all cores while waiting for references to be released during the net cleanup. But that sounds more like a workaround for a current issue rather than a proper solution and will not cover possible issues in other parts of the code. Additionally checking for skb_zcopy() while deferring. This might not be necessary, as I'm not sure if we can actually have zero copy packets on this path, but seems worth having for completeness as we should never defer such packets regardless. CC: Eric Dumazet <edumazet@google.com> Fixes: 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists") Signed-off-by: Ilya Maximets <i.maximets@ovn.org> --- net/core/skbuff.c | 16 +++++++++++----- 1 file changed, 11 insertions(+), 5 deletions(-)