diff mbox series

[net] net: ensure all external references are released in deferred skbuffs

Message ID 20220619003919.394622-1-i.maximets@ovn.org (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series [net] net: ensure all external references are released in deferred skbuffs | expand

Checks

Context Check Description
netdev/tree_selection success Clearly marked for net
netdev/fixes_present success Fixes tag present in non-next series
netdev/subject_prefix success Link
netdev/cover_letter success Single patches do not need cover letters
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 2 this patch: 2
netdev/cc_maintainers success CCed 5 of 5 maintainers
netdev/build_clang success Errors and warnings before: 6 this patch: 6
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 2 this patch: 2
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 35 lines checked
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Ilya Maximets June 19, 2022, 12:39 a.m. UTC
Open vSwitch system test suite is broken due to inability to
load/unload netfilter modules.  kworker thread is getting trapped
in the infinite loop while running a net cleanup inside the
nf_conntrack_cleanup_net_list, because deferred skbuffs are still
holding nfct references and not being freed by their CPU cores.

In general, the idea that we will have an rx interrupt on every
CPU core at some point in a near future doesn't seem correct.
Devices are getting created and destroyed, interrupts are getting
re-scheduled, CPUs are going online and offline dynamically.
Any of these events may leave packets stuck in defer list for a
long time.  It might be OK, if they are just a piece of memory,
but we can't afford them holding references to any other resources.

In case of OVS, nfct reference keeps the kernel thread in busy loop
while holding a 'pernet_ops_rwsem' semaphore.  That blocks the
later modprobe request from user space:

  # ps
   299 root  R  99.3  200:25.89 kworker/u96:4+

  # journalctl
  INFO: task modprobe:11787 blocked for more than 1228 seconds.
        Not tainted 5.19.0-rc2 #8
  task:modprobe     state:D
  Call Trace:
   <TASK>
   __schedule+0x8aa/0x21d0
   schedule+0xcc/0x200
   rwsem_down_write_slowpath+0x8e4/0x1580
   down_write+0xfc/0x140
   register_pernet_subsys+0x15/0x40
   nf_nat_init+0xb6/0x1000 [nf_nat]
   do_one_initcall+0xbb/0x410
   do_init_module+0x1b4/0x640
   load_module+0x4c1b/0x58d0
   __do_sys_init_module+0x1d7/0x220
   do_syscall_64+0x3a/0x80
   entry_SYSCALL_64_after_hwframe+0x46/0xb0

At this point OVS testsuite is unresponsive and never recover,
because these skbuffs are never freed.

Solution is to make sure no external references attached to skb
before pushing it to the defer list.  Using skb_release_head_state()
for that purpose.  The function modified to be re-enterable, as it
will be called again during the defer list flush.

Another approach that can fix the OVS use-case, is to kick all
cores while waiting for references to be released during the net
cleanup.  But that sounds more like a workaround for a current
issue rather than a proper solution and will not cover possible
issues in other parts of the code.

Additionally checking for skb_zcopy() while deferring.  This might
not be necessary, as I'm not sure if we can actually have zero copy
packets on this path, but seems worth having for completeness as we
should never defer such packets regardless.

CC: Eric Dumazet <edumazet@google.com>
Fixes: 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists")
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
---
 net/core/skbuff.c | 16 +++++++++++-----
 1 file changed, 11 insertions(+), 5 deletions(-)

Comments

Eric Dumazet June 22, 2022, 10:02 a.m. UTC | #1
On Sun, Jun 19, 2022 at 2:39 AM Ilya Maximets <i.maximets@ovn.org> wrote:
>
> Open vSwitch system test suite is broken due to inability to
> load/unload netfilter modules.  kworker thread is getting trapped
> in the infinite loop while running a net cleanup inside the
> nf_conntrack_cleanup_net_list, because deferred skbuffs are still
> holding nfct references and not being freed by their CPU cores.
>
> In general, the idea that we will have an rx interrupt on every
> CPU core at some point in a near future doesn't seem correct.
> Devices are getting created and destroyed, interrupts are getting
> re-scheduled, CPUs are going online and offline dynamically.
> Any of these events may leave packets stuck in defer list for a
> long time.  It might be OK, if they are just a piece of memory,
> but we can't afford them holding references to any other resources.
>
> In case of OVS, nfct reference keeps the kernel thread in busy loop
> while holding a 'pernet_ops_rwsem' semaphore.  That blocks the
> later modprobe request from user space:
>
>   # ps
>    299 root  R  99.3  200:25.89 kworker/u96:4+
>
>   # journalctl
>   INFO: task modprobe:11787 blocked for more than 1228 seconds.
>         Not tainted 5.19.0-rc2 #8
>   task:modprobe     state:D
>   Call Trace:
>    <TASK>
>    __schedule+0x8aa/0x21d0
>    schedule+0xcc/0x200
>    rwsem_down_write_slowpath+0x8e4/0x1580
>    down_write+0xfc/0x140
>    register_pernet_subsys+0x15/0x40
>    nf_nat_init+0xb6/0x1000 [nf_nat]
>    do_one_initcall+0xbb/0x410
>    do_init_module+0x1b4/0x640
>    load_module+0x4c1b/0x58d0
>    __do_sys_init_module+0x1d7/0x220
>    do_syscall_64+0x3a/0x80
>    entry_SYSCALL_64_after_hwframe+0x46/0xb0
>
> At this point OVS testsuite is unresponsive and never recover,
> because these skbuffs are never freed.
>
> Solution is to make sure no external references attached to skb
> before pushing it to the defer list.  Using skb_release_head_state()
> for that purpose.  The function modified to be re-enterable, as it
> will be called again during the defer list flush.
>
> Another approach that can fix the OVS use-case, is to kick all
> cores while waiting for references to be released during the net
> cleanup.  But that sounds more like a workaround for a current
> issue rather than a proper solution and will not cover possible
> issues in other parts of the code.
>
> Additionally checking for skb_zcopy() while deferring.  This might
> not be necessary, as I'm not sure if we can actually have zero copy
> packets on this path, but seems worth having for completeness as we
> should never defer such packets regardless.
>
> CC: Eric Dumazet <edumazet@google.com>
> Fixes: 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists")
> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
> ---
>  net/core/skbuff.c | 16 +++++++++++-----
>  1 file changed, 11 insertions(+), 5 deletions(-)

I do not think this patch is doing the right thing.

Packets sitting in TCP receive queues should not hold state that is
not relevant for TCP recvmsg().

This consumes extra memory for no good reason, and defer expensive
atomic operations.

We for instance release skb dst before skb is queued, we should do the
same for conntrack state.

This would increase performance anyway, as we free ct state while cpu
caches are hot.
Eric Dumazet June 22, 2022, 10:15 a.m. UTC | #2
On Wed, Jun 22, 2022 at 12:02 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Sun, Jun 19, 2022 at 2:39 AM Ilya Maximets <i.maximets@ovn.org> wrote:
> >
> > Open vSwitch system test suite is broken due to inability to
> > load/unload netfilter modules.  kworker thread is getting trapped
> > in the infinite loop while running a net cleanup inside the
> > nf_conntrack_cleanup_net_list, because deferred skbuffs are still
> > holding nfct references and not being freed by their CPU cores.
> >
> > In general, the idea that we will have an rx interrupt on every
> > CPU core at some point in a near future doesn't seem correct.
> > Devices are getting created and destroyed, interrupts are getting
> > re-scheduled, CPUs are going online and offline dynamically.
> > Any of these events may leave packets stuck in defer list for a
> > long time.  It might be OK, if they are just a piece of memory,
> > but we can't afford them holding references to any other resources.
> >
> > In case of OVS, nfct reference keeps the kernel thread in busy loop
> > while holding a 'pernet_ops_rwsem' semaphore.  That blocks the
> > later modprobe request from user space:
> >
> >   # ps
> >    299 root  R  99.3  200:25.89 kworker/u96:4+
> >
> >   # journalctl
> >   INFO: task modprobe:11787 blocked for more than 1228 seconds.
> >         Not tainted 5.19.0-rc2 #8
> >   task:modprobe     state:D
> >   Call Trace:
> >    <TASK>
> >    __schedule+0x8aa/0x21d0
> >    schedule+0xcc/0x200
> >    rwsem_down_write_slowpath+0x8e4/0x1580
> >    down_write+0xfc/0x140
> >    register_pernet_subsys+0x15/0x40
> >    nf_nat_init+0xb6/0x1000 [nf_nat]
> >    do_one_initcall+0xbb/0x410
> >    do_init_module+0x1b4/0x640
> >    load_module+0x4c1b/0x58d0
> >    __do_sys_init_module+0x1d7/0x220
> >    do_syscall_64+0x3a/0x80
> >    entry_SYSCALL_64_after_hwframe+0x46/0xb0
> >
> > At this point OVS testsuite is unresponsive and never recover,
> > because these skbuffs are never freed.
> >
> > Solution is to make sure no external references attached to skb
> > before pushing it to the defer list.  Using skb_release_head_state()
> > for that purpose.  The function modified to be re-enterable, as it
> > will be called again during the defer list flush.
> >
> > Another approach that can fix the OVS use-case, is to kick all
> > cores while waiting for references to be released during the net
> > cleanup.  But that sounds more like a workaround for a current
> > issue rather than a proper solution and will not cover possible
> > issues in other parts of the code.
> >
> > Additionally checking for skb_zcopy() while deferring.  This might
> > not be necessary, as I'm not sure if we can actually have zero copy
> > packets on this path, but seems worth having for completeness as we
> > should never defer such packets regardless.
> >
> > CC: Eric Dumazet <edumazet@google.com>
> > Fixes: 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists")
> > Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
> > ---
> >  net/core/skbuff.c | 16 +++++++++++-----
> >  1 file changed, 11 insertions(+), 5 deletions(-)
>
> I do not think this patch is doing the right thing.
>
> Packets sitting in TCP receive queues should not hold state that is
> not relevant for TCP recvmsg().
>
> This consumes extra memory for no good reason, and defer expensive
> atomic operations.
>
> We for instance release skb dst before skb is queued, we should do the
> same for conntrack state.
>
> This would increase performance anyway, as we free ct state while cpu
> caches are hot.

I am thinking of the following instead.

A new helper can be added (and later be used in net/packet/af_packet.c
and probably elsewhere)

diff --git a/include/net/dst.h b/include/net/dst.h
index 6aa252c3fc55ccaee58faebf265510469e91d780..7c3316d9d6e73daea17223a5261f6a5c4f68eae3
100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -276,6 +276,15 @@ static inline void skb_dst_drop(struct sk_buff *skb)
        }
 }

+/* Before queueing skb in a receive queue, get rid of
+ * potentially expensive components.
+ */
+static inline void skb_cleanup(struct sk_buff *skb)
+{
+       skb_dst_drop(skb);
+       nf_reset_ct(skb);
+}
+
 static inline void __skb_dst_copy(struct sk_buff *nskb, unsigned long refdst)
 {
        nskb->slow_gro |= !!refdst;
diff --git a/net/ipv4/tcp_fastopen.c b/net/ipv4/tcp_fastopen.c
index fdbcf2a6d08ef4a5164247b5a5b4b222289b191a..913c98e446d56ee067b54b2c704ac1195ef1a81e
100644
--- a/net/ipv4/tcp_fastopen.c
+++ b/net/ipv4/tcp_fastopen.c
@@ -177,7 +177,7 @@ void tcp_fastopen_add_skb(struct sock *sk, struct
sk_buff *skb)
        if (!skb)
                return;

-       skb_dst_drop(skb);
+       skb_cleanup(skb);
        /* segs_in has been initialized to 1 in tcp_create_openreq_child().
         * Hence, reset segs_in to 0 before calling tcp_segs_in()
         * to avoid double counting.  Also, tcp_segs_in() expects
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 2e2a9ece9af27372e6b653d685a89a2c71ba05d1..987981a16ee34e0601e7e722abef1bb098c307c5
100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5005,7 +5005,7 @@ static void tcp_data_queue(struct sock *sk,
struct sk_buff *skb)
                __kfree_skb(skb);
                return;
        }
-       skb_dst_drop(skb);
+       skb_cleanup(skb);
        __skb_pull(skb, tcp_hdr(skb)->doff * 4);

        reason = SKB_DROP_REASON_NOT_SPECIFIED;
@@ -5931,7 +5931,7 @@ void tcp_rcv_established(struct sock *sk, struct
sk_buff *skb)
                        NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPHPHITS);

                        /* Bulk data transfer: receiver */
-                       skb_dst_drop(skb);
+                       skb_cleanup(skb);
                        __skb_pull(skb, tcp_header_len);
                        eaten = tcp_queue_rcv(sk, skb, &fragstolen);

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index fe8f23b95d32ca4a35d05166d471327bc608fa91..d9acd906f28267ff07450d78d079e4e8eab74957
100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1765,7 +1765,7 @@ bool tcp_add_backlog(struct sock *sk, struct sk_buff *skb,
         */
        skb_condense(skb);

-       skb_dst_drop(skb);
+       skb_cleanup(skb);

        if (unlikely(tcp_checksum_complete(skb))) {
                bh_unlock_sock(sk);
Florian Westphal June 22, 2022, 10:28 a.m. UTC | #3
Eric Dumazet <edumazet@google.com> wrote:
> On Sun, Jun 19, 2022 at 2:39 AM Ilya Maximets <i.maximets@ovn.org> wrote:
> >
> > Open vSwitch system test suite is broken due to inability to
> > load/unload netfilter modules.  kworker thread is getting trapped
> > in the infinite loop while running a net cleanup inside the
> > nf_conntrack_cleanup_net_list, because deferred skbuffs are still
> > holding nfct references and not being freed by their CPU cores.
> >
> > In general, the idea that we will have an rx interrupt on every
> > CPU core at some point in a near future doesn't seem correct.
> > Devices are getting created and destroyed, interrupts are getting
> > re-scheduled, CPUs are going online and offline dynamically.
> > Any of these events may leave packets stuck in defer list for a
> > long time.  It might be OK, if they are just a piece of memory,
> > but we can't afford them holding references to any other resources.
> >
> > In case of OVS, nfct reference keeps the kernel thread in busy loop
> > while holding a 'pernet_ops_rwsem' semaphore.  That blocks the
> > later modprobe request from user space:
> >
> >   # ps
> >    299 root  R  99.3  200:25.89 kworker/u96:4+
> >
> >   # journalctl
> >   INFO: task modprobe:11787 blocked for more than 1228 seconds.
> >         Not tainted 5.19.0-rc2 #8
> >   task:modprobe     state:D
> >   Call Trace:
> >    <TASK>
> >    __schedule+0x8aa/0x21d0
> >    schedule+0xcc/0x200
> >    rwsem_down_write_slowpath+0x8e4/0x1580
> >    down_write+0xfc/0x140
> >    register_pernet_subsys+0x15/0x40
> >    nf_nat_init+0xb6/0x1000 [nf_nat]
> >    do_one_initcall+0xbb/0x410
> >    do_init_module+0x1b4/0x640
> >    load_module+0x4c1b/0x58d0
> >    __do_sys_init_module+0x1d7/0x220
> >    do_syscall_64+0x3a/0x80
> >    entry_SYSCALL_64_after_hwframe+0x46/0xb0
> >
> > At this point OVS testsuite is unresponsive and never recover,
> > because these skbuffs are never freed.
> >
> > Solution is to make sure no external references attached to skb
> > before pushing it to the defer list.  Using skb_release_head_state()
> > for that purpose.  The function modified to be re-enterable, as it
> > will be called again during the defer list flush.
> >
> > Another approach that can fix the OVS use-case, is to kick all
> > cores while waiting for references to be released during the net
> > cleanup.  But that sounds more like a workaround for a current
> > issue rather than a proper solution and will not cover possible
> > issues in other parts of the code.
> >
> > Additionally checking for skb_zcopy() while deferring.  This might
> > not be necessary, as I'm not sure if we can actually have zero copy
> > packets on this path, but seems worth having for completeness as we
> > should never defer such packets regardless.
> >
> > CC: Eric Dumazet <edumazet@google.com>
> > Fixes: 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists")
> > Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
> > ---
> >  net/core/skbuff.c | 16 +++++++++++-----
> >  1 file changed, 11 insertions(+), 5 deletions(-)
> 
> I do not think this patch is doing the right thing.
> 
> Packets sitting in TCP receive queues should not hold state that is
> not relevant for TCP recvmsg().

Agree, but tcp_v4/6_rcv() already call nf_reset_ct(), else it would
not be possible to remove nf_conntrack module in practice.

I wonder where the deferred skbs are coming from, any and all
queued skbs need the conntrack state dropped.

I don't mind a new helper that does a combined dst+ct release though.
Eric Dumazet June 22, 2022, 10:36 a.m. UTC | #4
On Wed, Jun 22, 2022 at 12:28 PM Florian Westphal <fw@strlen.de> wrote:
>
> Eric Dumazet <edumazet@google.com> wrote:
> > On Sun, Jun 19, 2022 at 2:39 AM Ilya Maximets <i.maximets@ovn.org> wrote:
> > >
> > > Open vSwitch system test suite is broken due to inability to
> > > load/unload netfilter modules.  kworker thread is getting trapped
> > > in the infinite loop while running a net cleanup inside the
> > > nf_conntrack_cleanup_net_list, because deferred skbuffs are still
> > > holding nfct references and not being freed by their CPU cores.
> > >
> > > In general, the idea that we will have an rx interrupt on every
> > > CPU core at some point in a near future doesn't seem correct.
> > > Devices are getting created and destroyed, interrupts are getting
> > > re-scheduled, CPUs are going online and offline dynamically.
> > > Any of these events may leave packets stuck in defer list for a
> > > long time.  It might be OK, if they are just a piece of memory,
> > > but we can't afford them holding references to any other resources.
> > >
> > > In case of OVS, nfct reference keeps the kernel thread in busy loop
> > > while holding a 'pernet_ops_rwsem' semaphore.  That blocks the
> > > later modprobe request from user space:
> > >
> > >   # ps
> > >    299 root  R  99.3  200:25.89 kworker/u96:4+
> > >
> > >   # journalctl
> > >   INFO: task modprobe:11787 blocked for more than 1228 seconds.
> > >         Not tainted 5.19.0-rc2 #8
> > >   task:modprobe     state:D
> > >   Call Trace:
> > >    <TASK>
> > >    __schedule+0x8aa/0x21d0
> > >    schedule+0xcc/0x200
> > >    rwsem_down_write_slowpath+0x8e4/0x1580
> > >    down_write+0xfc/0x140
> > >    register_pernet_subsys+0x15/0x40
> > >    nf_nat_init+0xb6/0x1000 [nf_nat]
> > >    do_one_initcall+0xbb/0x410
> > >    do_init_module+0x1b4/0x640
> > >    load_module+0x4c1b/0x58d0
> > >    __do_sys_init_module+0x1d7/0x220
> > >    do_syscall_64+0x3a/0x80
> > >    entry_SYSCALL_64_after_hwframe+0x46/0xb0
> > >
> > > At this point OVS testsuite is unresponsive and never recover,
> > > because these skbuffs are never freed.
> > >
> > > Solution is to make sure no external references attached to skb
> > > before pushing it to the defer list.  Using skb_release_head_state()
> > > for that purpose.  The function modified to be re-enterable, as it
> > > will be called again during the defer list flush.
> > >
> > > Another approach that can fix the OVS use-case, is to kick all
> > > cores while waiting for references to be released during the net
> > > cleanup.  But that sounds more like a workaround for a current
> > > issue rather than a proper solution and will not cover possible
> > > issues in other parts of the code.
> > >
> > > Additionally checking for skb_zcopy() while deferring.  This might
> > > not be necessary, as I'm not sure if we can actually have zero copy
> > > packets on this path, but seems worth having for completeness as we
> > > should never defer such packets regardless.
> > >
> > > CC: Eric Dumazet <edumazet@google.com>
> > > Fixes: 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists")
> > > Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
> > > ---
> > >  net/core/skbuff.c | 16 +++++++++++-----
> > >  1 file changed, 11 insertions(+), 5 deletions(-)
> >
> > I do not think this patch is doing the right thing.
> >
> > Packets sitting in TCP receive queues should not hold state that is
> > not relevant for TCP recvmsg().
>
> Agree, but tcp_v4/6_rcv() already call nf_reset_ct(), else it would
> not be possible to remove nf_conntrack module in practice.

Well, existing nf_reset_ct() does not catch all cases, like TCP fastopen ?

Maybe 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists")
only widened the problem.

>
> I wonder where the deferred skbs are coming from, any and all
> queued skbs need the conntrack state dropped.
>
> I don't mind a new helper that does a combined dst+ct release though.
Ilya Maximets June 22, 2022, 11:32 a.m. UTC | #5
On 6/22/22 12:36, Eric Dumazet wrote:
> On Wed, Jun 22, 2022 at 12:28 PM Florian Westphal <fw@strlen.de> wrote:
>>
>> Eric Dumazet <edumazet@google.com> wrote:
>>> On Sun, Jun 19, 2022 at 2:39 AM Ilya Maximets <i.maximets@ovn.org> wrote:
>>>>
>>>> Open vSwitch system test suite is broken due to inability to
>>>> load/unload netfilter modules.  kworker thread is getting trapped
>>>> in the infinite loop while running a net cleanup inside the
>>>> nf_conntrack_cleanup_net_list, because deferred skbuffs are still
>>>> holding nfct references and not being freed by their CPU cores.
>>>>
>>>> In general, the idea that we will have an rx interrupt on every
>>>> CPU core at some point in a near future doesn't seem correct.
>>>> Devices are getting created and destroyed, interrupts are getting
>>>> re-scheduled, CPUs are going online and offline dynamically.
>>>> Any of these events may leave packets stuck in defer list for a
>>>> long time.  It might be OK, if they are just a piece of memory,
>>>> but we can't afford them holding references to any other resources.
>>>>
>>>> In case of OVS, nfct reference keeps the kernel thread in busy loop
>>>> while holding a 'pernet_ops_rwsem' semaphore.  That blocks the
>>>> later modprobe request from user space:
>>>>
>>>>   # ps
>>>>    299 root  R  99.3  200:25.89 kworker/u96:4+
>>>>
>>>>   # journalctl
>>>>   INFO: task modprobe:11787 blocked for more than 1228 seconds.
>>>>         Not tainted 5.19.0-rc2 #8
>>>>   task:modprobe     state:D
>>>>   Call Trace:
>>>>    <TASK>
>>>>    __schedule+0x8aa/0x21d0
>>>>    schedule+0xcc/0x200
>>>>    rwsem_down_write_slowpath+0x8e4/0x1580
>>>>    down_write+0xfc/0x140
>>>>    register_pernet_subsys+0x15/0x40
>>>>    nf_nat_init+0xb6/0x1000 [nf_nat]
>>>>    do_one_initcall+0xbb/0x410
>>>>    do_init_module+0x1b4/0x640
>>>>    load_module+0x4c1b/0x58d0
>>>>    __do_sys_init_module+0x1d7/0x220
>>>>    do_syscall_64+0x3a/0x80
>>>>    entry_SYSCALL_64_after_hwframe+0x46/0xb0
>>>>
>>>> At this point OVS testsuite is unresponsive and never recover,
>>>> because these skbuffs are never freed.
>>>>
>>>> Solution is to make sure no external references attached to skb
>>>> before pushing it to the defer list.  Using skb_release_head_state()
>>>> for that purpose.  The function modified to be re-enterable, as it
>>>> will be called again during the defer list flush.
>>>>
>>>> Another approach that can fix the OVS use-case, is to kick all
>>>> cores while waiting for references to be released during the net
>>>> cleanup.  But that sounds more like a workaround for a current
>>>> issue rather than a proper solution and will not cover possible
>>>> issues in other parts of the code.
>>>>
>>>> Additionally checking for skb_zcopy() while deferring.  This might
>>>> not be necessary, as I'm not sure if we can actually have zero copy
>>>> packets on this path, but seems worth having for completeness as we
>>>> should never defer such packets regardless.
>>>>
>>>> CC: Eric Dumazet <edumazet@google.com>
>>>> Fixes: 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists")
>>>> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
>>>> ---
>>>>  net/core/skbuff.c | 16 +++++++++++-----
>>>>  1 file changed, 11 insertions(+), 5 deletions(-)
>>>
>>> I do not think this patch is doing the right thing.
>>>
>>> Packets sitting in TCP receive queues should not hold state that is
>>> not relevant for TCP recvmsg().
>>
>> Agree, but tcp_v4/6_rcv() already call nf_reset_ct(), else it would
>> not be possible to remove nf_conntrack module in practice.
> 
> Well, existing nf_reset_ct() does not catch all cases, like TCP fastopen ?

Yeah, that is kind of the main problem I have with the current
code.  It's very hard to find all the cases where skb has to be
cleaned up and almost impossible for someone who doesn't know
every aspect of every network subsystem in the kernel.  That's
why I went with the more or less bulletproof approach of cleaning
up while actually deferring.  I can try and test the code you
proposed in the other reply, but at least, I think, we need a
bunch of debug warnings in the skb_attempt_defer_free() to catch
possible issues.

Also, what about cleaning up extensions?  IIUC, at least one
of them can hold external references.  SKB_EXT_SEC_PATH holds
xfrm_state.  We should, probably, free them as well?

And what about zerocopy skb?  I think, we should still not
allow them to be deferred as they seem to hold HW resources.

> 
> Maybe 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists")
> only widened the problem.
> 
>>
>> I wonder where the deferred skbs are coming from, any and all
>> queued skbs need the conntrack state dropped.
>>
>> I don't mind a new helper that does a combined dst+ct release though.
Eric Dumazet June 22, 2022, 11:43 a.m. UTC | #6
On Wed, Jun 22, 2022 at 1:32 PM Ilya Maximets <i.maximets@ovn.org> wrote:
>
> On 6/22/22 12:36, Eric Dumazet wrote:
> > On Wed, Jun 22, 2022 at 12:28 PM Florian Westphal <fw@strlen.de> wrote:
> >>
> >> Eric Dumazet <edumazet@google.com> wrote:
> >>> On Sun, Jun 19, 2022 at 2:39 AM Ilya Maximets <i.maximets@ovn.org> wrote:
> >>>>
> >>>> Open vSwitch system test suite is broken due to inability to
> >>>> load/unload netfilter modules.  kworker thread is getting trapped
> >>>> in the infinite loop while running a net cleanup inside the
> >>>> nf_conntrack_cleanup_net_list, because deferred skbuffs are still
> >>>> holding nfct references and not being freed by their CPU cores.
> >>>>
> >>>> In general, the idea that we will have an rx interrupt on every
> >>>> CPU core at some point in a near future doesn't seem correct.
> >>>> Devices are getting created and destroyed, interrupts are getting
> >>>> re-scheduled, CPUs are going online and offline dynamically.
> >>>> Any of these events may leave packets stuck in defer list for a
> >>>> long time.  It might be OK, if they are just a piece of memory,
> >>>> but we can't afford them holding references to any other resources.
> >>>>
> >>>> In case of OVS, nfct reference keeps the kernel thread in busy loop
> >>>> while holding a 'pernet_ops_rwsem' semaphore.  That blocks the
> >>>> later modprobe request from user space:
> >>>>
> >>>>   # ps
> >>>>    299 root  R  99.3  200:25.89 kworker/u96:4+
> >>>>
> >>>>   # journalctl
> >>>>   INFO: task modprobe:11787 blocked for more than 1228 seconds.
> >>>>         Not tainted 5.19.0-rc2 #8
> >>>>   task:modprobe     state:D
> >>>>   Call Trace:
> >>>>    <TASK>
> >>>>    __schedule+0x8aa/0x21d0
> >>>>    schedule+0xcc/0x200
> >>>>    rwsem_down_write_slowpath+0x8e4/0x1580
> >>>>    down_write+0xfc/0x140
> >>>>    register_pernet_subsys+0x15/0x40
> >>>>    nf_nat_init+0xb6/0x1000 [nf_nat]
> >>>>    do_one_initcall+0xbb/0x410
> >>>>    do_init_module+0x1b4/0x640
> >>>>    load_module+0x4c1b/0x58d0
> >>>>    __do_sys_init_module+0x1d7/0x220
> >>>>    do_syscall_64+0x3a/0x80
> >>>>    entry_SYSCALL_64_after_hwframe+0x46/0xb0
> >>>>
> >>>> At this point OVS testsuite is unresponsive and never recover,
> >>>> because these skbuffs are never freed.
> >>>>
> >>>> Solution is to make sure no external references attached to skb
> >>>> before pushing it to the defer list.  Using skb_release_head_state()
> >>>> for that purpose.  The function modified to be re-enterable, as it
> >>>> will be called again during the defer list flush.
> >>>>
> >>>> Another approach that can fix the OVS use-case, is to kick all
> >>>> cores while waiting for references to be released during the net
> >>>> cleanup.  But that sounds more like a workaround for a current
> >>>> issue rather than a proper solution and will not cover possible
> >>>> issues in other parts of the code.
> >>>>
> >>>> Additionally checking for skb_zcopy() while deferring.  This might
> >>>> not be necessary, as I'm not sure if we can actually have zero copy
> >>>> packets on this path, but seems worth having for completeness as we
> >>>> should never defer such packets regardless.
> >>>>
> >>>> CC: Eric Dumazet <edumazet@google.com>
> >>>> Fixes: 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists")
> >>>> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
> >>>> ---
> >>>>  net/core/skbuff.c | 16 +++++++++++-----
> >>>>  1 file changed, 11 insertions(+), 5 deletions(-)
> >>>
> >>> I do not think this patch is doing the right thing.
> >>>
> >>> Packets sitting in TCP receive queues should not hold state that is
> >>> not relevant for TCP recvmsg().
> >>
> >> Agree, but tcp_v4/6_rcv() already call nf_reset_ct(), else it would
> >> not be possible to remove nf_conntrack module in practice.
> >
> > Well, existing nf_reset_ct() does not catch all cases, like TCP fastopen ?
>
> Yeah, that is kind of the main problem I have with the current
> code.  It's very hard to find all the cases where skb has to be
> cleaned up and almost impossible for someone who doesn't know
> every aspect of every network subsystem in the kernel.  That's
> why I went with the more or less bulletproof approach of cleaning
> up while actually deferring.  I can try and test the code you
> proposed in the other reply, but at least, I think, we need a
> bunch of debug warnings in the skb_attempt_defer_free() to catch
> possible issues.

Debug warnings are expensive if they need to bring new cache lines.

So far skb_attempt_defer_free() is only used by TCP in well known conditions.


>
> Also, what about cleaning up extensions?  IIUC, at least one
> of them can hold external references.  SKB_EXT_SEC_PATH holds
> xfrm_state.  We should, probably, free them as well?

I do not know about this, I would ask XFRM maintainers

>
> And what about zerocopy skb?  I think, we should still not
> allow them to be deferred as they seem to hold HW resources.

The point of skb_attempt_defer_free() i is to make the freeing happen
at producer
 much instead of consumer.

I do not think there is anything special in this regard with zero
copy. I would leave the current code as is.

A simpler patch might be to move the existing nf_reset_ct() earlier,
can you test this ?

I note that IPv6 does the nf_reset_ct() earlier, from ip6_protocol_deliver_rcu()

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index fda811a5251f2d76ac24a036e6b4f4e7d7d96d6f..a06464f96fe0cc94dd78272738ddaab2c19e87db
100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1919,6 +1919,8 @@ int tcp_v4_rcv(struct sk_buff *skb)
        struct sock *sk;
        int ret;

+       nf_reset_ct(skb);
+
        drop_reason = SKB_DROP_REASON_NOT_SPECIFIED;
        if (skb->pkt_type != PACKET_HOST)
                goto discard_it;
@@ -2046,8 +2048,6 @@ int tcp_v4_rcv(struct sk_buff *skb)
        if (drop_reason)
                goto discard_and_relse;

-       nf_reset_ct(skb);
-
        if (tcp_filter(sk, skb)) {
                drop_reason = SKB_DROP_REASON_SOCKET_FILTER;
                goto discard_and_relse;
Ilya Maximets June 22, 2022, 2:26 p.m. UTC | #7
On 6/22/22 13:43, Eric Dumazet wrote:
> On Wed, Jun 22, 2022 at 1:32 PM Ilya Maximets <i.maximets@ovn.org> wrote:
>>
>> On 6/22/22 12:36, Eric Dumazet wrote:
>>> On Wed, Jun 22, 2022 at 12:28 PM Florian Westphal <fw@strlen.de> wrote:
>>>>
>>>> Eric Dumazet <edumazet@google.com> wrote:
>>>>> On Sun, Jun 19, 2022 at 2:39 AM Ilya Maximets <i.maximets@ovn.org> wrote:
>>>>>>
>>>>>> Open vSwitch system test suite is broken due to inability to
>>>>>> load/unload netfilter modules.  kworker thread is getting trapped
>>>>>> in the infinite loop while running a net cleanup inside the
>>>>>> nf_conntrack_cleanup_net_list, because deferred skbuffs are still
>>>>>> holding nfct references and not being freed by their CPU cores.
>>>>>>
>>>>>> In general, the idea that we will have an rx interrupt on every
>>>>>> CPU core at some point in a near future doesn't seem correct.
>>>>>> Devices are getting created and destroyed, interrupts are getting
>>>>>> re-scheduled, CPUs are going online and offline dynamically.
>>>>>> Any of these events may leave packets stuck in defer list for a
>>>>>> long time.  It might be OK, if they are just a piece of memory,
>>>>>> but we can't afford them holding references to any other resources.
>>>>>>
>>>>>> In case of OVS, nfct reference keeps the kernel thread in busy loop
>>>>>> while holding a 'pernet_ops_rwsem' semaphore.  That blocks the
>>>>>> later modprobe request from user space:
>>>>>>
>>>>>>   # ps
>>>>>>    299 root  R  99.3  200:25.89 kworker/u96:4+
>>>>>>
>>>>>>   # journalctl
>>>>>>   INFO: task modprobe:11787 blocked for more than 1228 seconds.
>>>>>>         Not tainted 5.19.0-rc2 #8
>>>>>>   task:modprobe     state:D
>>>>>>   Call Trace:
>>>>>>    <TASK>
>>>>>>    __schedule+0x8aa/0x21d0
>>>>>>    schedule+0xcc/0x200
>>>>>>    rwsem_down_write_slowpath+0x8e4/0x1580
>>>>>>    down_write+0xfc/0x140
>>>>>>    register_pernet_subsys+0x15/0x40
>>>>>>    nf_nat_init+0xb6/0x1000 [nf_nat]
>>>>>>    do_one_initcall+0xbb/0x410
>>>>>>    do_init_module+0x1b4/0x640
>>>>>>    load_module+0x4c1b/0x58d0
>>>>>>    __do_sys_init_module+0x1d7/0x220
>>>>>>    do_syscall_64+0x3a/0x80
>>>>>>    entry_SYSCALL_64_after_hwframe+0x46/0xb0
>>>>>>
>>>>>> At this point OVS testsuite is unresponsive and never recover,
>>>>>> because these skbuffs are never freed.
>>>>>>
>>>>>> Solution is to make sure no external references attached to skb
>>>>>> before pushing it to the defer list.  Using skb_release_head_state()
>>>>>> for that purpose.  The function modified to be re-enterable, as it
>>>>>> will be called again during the defer list flush.
>>>>>>
>>>>>> Another approach that can fix the OVS use-case, is to kick all
>>>>>> cores while waiting for references to be released during the net
>>>>>> cleanup.  But that sounds more like a workaround for a current
>>>>>> issue rather than a proper solution and will not cover possible
>>>>>> issues in other parts of the code.
>>>>>>
>>>>>> Additionally checking for skb_zcopy() while deferring.  This might
>>>>>> not be necessary, as I'm not sure if we can actually have zero copy
>>>>>> packets on this path, but seems worth having for completeness as we
>>>>>> should never defer such packets regardless.
>>>>>>
>>>>>> CC: Eric Dumazet <edumazet@google.com>
>>>>>> Fixes: 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists")
>>>>>> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
>>>>>> ---
>>>>>>  net/core/skbuff.c | 16 +++++++++++-----
>>>>>>  1 file changed, 11 insertions(+), 5 deletions(-)
>>>>>
>>>>> I do not think this patch is doing the right thing.
>>>>>
>>>>> Packets sitting in TCP receive queues should not hold state that is
>>>>> not relevant for TCP recvmsg().
>>>>
>>>> Agree, but tcp_v4/6_rcv() already call nf_reset_ct(), else it would
>>>> not be possible to remove nf_conntrack module in practice.
>>>
>>> Well, existing nf_reset_ct() does not catch all cases, like TCP fastopen ?
>>
>> Yeah, that is kind of the main problem I have with the current
>> code.  It's very hard to find all the cases where skb has to be
>> cleaned up and almost impossible for someone who doesn't know
>> every aspect of every network subsystem in the kernel.  That's
>> why I went with the more or less bulletproof approach of cleaning
>> up while actually deferring.  I can try and test the code you
>> proposed in the other reply, but at least, I think, we need a
>> bunch of debug warnings in the skb_attempt_defer_free() to catch
>> possible issues.
> 
> Debug warnings are expensive if they need to bring new cache lines.
> 
> So far skb_attempt_defer_free() is only used by TCP in well known conditions.

That's true for skb_attempt_defer_free() itself, but it's
hard to tell the same for all the parts of the code that
are enqueuing these skbuffs.  Even if all the places are
well known, it looks to me highly error prone.  Having
a couple of DEBUG_NET_WARN_ON_ONCE should not cause any
performance issues, IIUC, unless debug is enabled, right?

e.g.

	DEBUG_NET_WARN_ON_ONCE(skb_nfct(skb));
	DEBUG_NET_WARN_ON_ONCE(skb_dst(skb));

> 
> 
>>
>> Also, what about cleaning up extensions?  IIUC, at least one
>> of them can hold external references.  SKB_EXT_SEC_PATH holds
>> xfrm_state.  We should, probably, free them as well?
> 
> I do not know about this, I would ask XFRM maintainers

@Steffen, @Herbert, what do you think?  Is it a problem if the
skb holding SKB_EXT_SEC_PATH extension is held not freed in
a defer list indefinitely (for a very long time)?  Or is it
even possible for an skb with SKB_EXT_SEC_PATH extension present
to be enqueued to a TCP receive queue without releasing all
the references?

This seems like a long existing problem though, so can be fixed
separately, if it is a problem.

> 
>>
>> And what about zerocopy skb?  I think, we should still not
>> allow them to be deferred as they seem to hold HW resources.
> 
> The point of skb_attempt_defer_free() i is to make the freeing happen
> at producer
>  much instead of consumer.
> 
> I do not think there is anything special in this regard with zero
> copy. I would leave the current code as is.

The problem is that these skbuffs are never actually freed,
so if they do hold HW resources, these resources will never be
released (never == for a very long time, hours, days).

Not a blocker from my point of view, just trying to highlight
a possible issue.

> 
> A simpler patch might be to move the existing nf_reset_ct() earlier,
> can you test this ?

I tested the patch below and it seems to fix the issue seen
with OVS testsuite.  Though it's not obvious for me why this
happens.  Can you explain a bit more?

> 
> I note that IPv6 does the nf_reset_ct() earlier, from ip6_protocol_deliver_rcu()
> 
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index fda811a5251f2d76ac24a036e6b4f4e7d7d96d6f..a06464f96fe0cc94dd78272738ddaab2c19e87db
> 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -1919,6 +1919,8 @@ int tcp_v4_rcv(struct sk_buff *skb)
>         struct sock *sk;
>         int ret;
> 
> +       nf_reset_ct(skb);
> +
>         drop_reason = SKB_DROP_REASON_NOT_SPECIFIED;
>         if (skb->pkt_type != PACKET_HOST)
>                 goto discard_it;
> @@ -2046,8 +2048,6 @@ int tcp_v4_rcv(struct sk_buff *skb)
>         if (drop_reason)
>                 goto discard_and_relse;
> 
> -       nf_reset_ct(skb);
> -
>         if (tcp_filter(sk, skb)) {
>                 drop_reason = SKB_DROP_REASON_SOCKET_FILTER;
>                 goto discard_and_relse;
Eric Dumazet June 22, 2022, 2:35 p.m. UTC | #8
On Wed, Jun 22, 2022 at 4:26 PM Ilya Maximets <i.maximets@ovn.org> wrote:
>
> On 6/22/22 13:43, Eric Dumazet wrote:
> > On Wed, Jun 22, 2022 at 1:32 PM Ilya Maximets <i.maximets@ovn.org> wrote:
> >>
> >> On 6/22/22 12:36, Eric Dumazet wrote:
> >>> On Wed, Jun 22, 2022 at 12:28 PM Florian Westphal <fw@strlen.de> wrote:
> >>>>
> >>>> Eric Dumazet <edumazet@google.com> wrote:
> >>>>> On Sun, Jun 19, 2022 at 2:39 AM Ilya Maximets <i.maximets@ovn.org> wrote:
> >>>>>>
> >>>>>> Open vSwitch system test suite is broken due to inability to
> >>>>>> load/unload netfilter modules.  kworker thread is getting trapped
> >>>>>> in the infinite loop while running a net cleanup inside the
> >>>>>> nf_conntrack_cleanup_net_list, because deferred skbuffs are still
> >>>>>> holding nfct references and not being freed by their CPU cores.
> >>>>>>
> >>>>>> In general, the idea that we will have an rx interrupt on every
> >>>>>> CPU core at some point in a near future doesn't seem correct.
> >>>>>> Devices are getting created and destroyed, interrupts are getting
> >>>>>> re-scheduled, CPUs are going online and offline dynamically.
> >>>>>> Any of these events may leave packets stuck in defer list for a
> >>>>>> long time.  It might be OK, if they are just a piece of memory,
> >>>>>> but we can't afford them holding references to any other resources.
> >>>>>>
> >>>>>> In case of OVS, nfct reference keeps the kernel thread in busy loop
> >>>>>> while holding a 'pernet_ops_rwsem' semaphore.  That blocks the
> >>>>>> later modprobe request from user space:
> >>>>>>
> >>>>>>   # ps
> >>>>>>    299 root  R  99.3  200:25.89 kworker/u96:4+
> >>>>>>
> >>>>>>   # journalctl
> >>>>>>   INFO: task modprobe:11787 blocked for more than 1228 seconds.
> >>>>>>         Not tainted 5.19.0-rc2 #8
> >>>>>>   task:modprobe     state:D
> >>>>>>   Call Trace:
> >>>>>>    <TASK>
> >>>>>>    __schedule+0x8aa/0x21d0
> >>>>>>    schedule+0xcc/0x200
> >>>>>>    rwsem_down_write_slowpath+0x8e4/0x1580
> >>>>>>    down_write+0xfc/0x140
> >>>>>>    register_pernet_subsys+0x15/0x40
> >>>>>>    nf_nat_init+0xb6/0x1000 [nf_nat]
> >>>>>>    do_one_initcall+0xbb/0x410
> >>>>>>    do_init_module+0x1b4/0x640
> >>>>>>    load_module+0x4c1b/0x58d0
> >>>>>>    __do_sys_init_module+0x1d7/0x220
> >>>>>>    do_syscall_64+0x3a/0x80
> >>>>>>    entry_SYSCALL_64_after_hwframe+0x46/0xb0
> >>>>>>
> >>>>>> At this point OVS testsuite is unresponsive and never recover,
> >>>>>> because these skbuffs are never freed.
> >>>>>>
> >>>>>> Solution is to make sure no external references attached to skb
> >>>>>> before pushing it to the defer list.  Using skb_release_head_state()
> >>>>>> for that purpose.  The function modified to be re-enterable, as it
> >>>>>> will be called again during the defer list flush.
> >>>>>>
> >>>>>> Another approach that can fix the OVS use-case, is to kick all
> >>>>>> cores while waiting for references to be released during the net
> >>>>>> cleanup.  But that sounds more like a workaround for a current
> >>>>>> issue rather than a proper solution and will not cover possible
> >>>>>> issues in other parts of the code.
> >>>>>>
> >>>>>> Additionally checking for skb_zcopy() while deferring.  This might
> >>>>>> not be necessary, as I'm not sure if we can actually have zero copy
> >>>>>> packets on this path, but seems worth having for completeness as we
> >>>>>> should never defer such packets regardless.
> >>>>>>
> >>>>>> CC: Eric Dumazet <edumazet@google.com>
> >>>>>> Fixes: 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists")
> >>>>>> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
> >>>>>> ---
> >>>>>>  net/core/skbuff.c | 16 +++++++++++-----
> >>>>>>  1 file changed, 11 insertions(+), 5 deletions(-)
> >>>>>
> >>>>> I do not think this patch is doing the right thing.
> >>>>>
> >>>>> Packets sitting in TCP receive queues should not hold state that is
> >>>>> not relevant for TCP recvmsg().
> >>>>
> >>>> Agree, but tcp_v4/6_rcv() already call nf_reset_ct(), else it would
> >>>> not be possible to remove nf_conntrack module in practice.
> >>>
> >>> Well, existing nf_reset_ct() does not catch all cases, like TCP fastopen ?
> >>
> >> Yeah, that is kind of the main problem I have with the current
> >> code.  It's very hard to find all the cases where skb has to be
> >> cleaned up and almost impossible for someone who doesn't know
> >> every aspect of every network subsystem in the kernel.  That's
> >> why I went with the more or less bulletproof approach of cleaning
> >> up while actually deferring.  I can try and test the code you
> >> proposed in the other reply, but at least, I think, we need a
> >> bunch of debug warnings in the skb_attempt_defer_free() to catch
> >> possible issues.
> >
> > Debug warnings are expensive if they need to bring new cache lines.
> >
> > So far skb_attempt_defer_free() is only used by TCP in well known conditions.
>
> That's true for skb_attempt_defer_free() itself, but it's
> hard to tell the same for all the parts of the code that
> are enqueuing these skbuffs.  Even if all the places are
> well known, it looks to me highly error prone.  Having
> a couple of DEBUG_NET_WARN_ON_ONCE should not cause any
> performance issues, IIUC, unless debug is enabled, right?
>
> e.g.
>
>         DEBUG_NET_WARN_ON_ONCE(skb_nfct(skb));
>         DEBUG_NET_WARN_ON_ONCE(skb_dst(skb));

Sure, but we need to put them at the right place.

Doing these checks in skb_attempt_defer_free() is too late.

>
> >
> >
> >>
> >> Also, what about cleaning up extensions?  IIUC, at least one
> >> of them can hold external references.  SKB_EXT_SEC_PATH holds
> >> xfrm_state.  We should, probably, free them as well?
> >
> > I do not know about this, I would ask XFRM maintainers
>
> @Steffen, @Herbert, what do you think?  Is it a problem if the
> skb holding SKB_EXT_SEC_PATH extension is held not freed in
> a defer list indefinitely (for a very long time)?  Or is it
> even possible for an skb with SKB_EXT_SEC_PATH extension present
> to be enqueued to a TCP receive queue without releasing all
> the references?
>
> This seems like a long existing problem though, so can be fixed
> separately, if it is a problem.
>
> >
> >>
> >> And what about zerocopy skb?  I think, we should still not
> >> allow them to be deferred as they seem to hold HW resources.
> >
> > The point of skb_attempt_defer_free() i is to make the freeing happen
> > at producer
> >  much instead of consumer.
> >
> > I do not think there is anything special in this regard with zero
> > copy. I would leave the current code as is.
>
> The problem is that these skbuffs are never actually freed,
> so if they do hold HW resources, these resources will never be
> released (never == for a very long time, hours, days).

This is fine. Number of such skbs per cpu is limited.

Or if some HW feature needs buffers to be released _immediately_, we
have to implement
a method for doing so.

Please point why buffers sitting in TCP receive queues are not causing issues ?
Do we need a general method "Please scan all TCP sockets and release
skb holding HW resources immediately"

>
> Not a blocker from my point of view, just trying to highlight
> a possible issue.

The tradeoff is known really, as for any cache.

>
> >
> > A simpler patch might be to move the existing nf_reset_ct() earlier,
> > can you test this ?
>
> I tested the patch below and it seems to fix the issue seen
> with OVS testsuite.  Though it's not obvious for me why this
> happens.  Can you explain a bit more?

I think the bug predates my recent change.

skb_attempt_defer_free() is increasing the probability of hitting the bug.

>
> >
> > I note that IPv6 does the nf_reset_ct() earlier, from ip6_protocol_deliver_rcu()
> >
> > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> > index fda811a5251f2d76ac24a036e6b4f4e7d7d96d6f..a06464f96fe0cc94dd78272738ddaab2c19e87db
> > 100644
> > --- a/net/ipv4/tcp_ipv4.c
> > +++ b/net/ipv4/tcp_ipv4.c
> > @@ -1919,6 +1919,8 @@ int tcp_v4_rcv(struct sk_buff *skb)
> >         struct sock *sk;
> >         int ret;
> >
> > +       nf_reset_ct(skb);
> > +
> >         drop_reason = SKB_DROP_REASON_NOT_SPECIFIED;
> >         if (skb->pkt_type != PACKET_HOST)
> >                 goto discard_it;
> > @@ -2046,8 +2048,6 @@ int tcp_v4_rcv(struct sk_buff *skb)
> >         if (drop_reason)
> >                 goto discard_and_relse;
> >
> > -       nf_reset_ct(skb);
> > -
> >         if (tcp_filter(sk, skb)) {
> >                 drop_reason = SKB_DROP_REASON_SOCKET_FILTER;
> >                 goto discard_and_relse;
>
Eric Dumazet June 22, 2022, 4:29 p.m. UTC | #9
On Wed, Jun 22, 2022 at 4:26 PM Ilya Maximets <i.maximets@ovn.org> wrote:
>
> On 6/22/22 13:43, Eric Dumazet wrote:

>
> I tested the patch below and it seems to fix the issue seen
> with OVS testsuite.  Though it's not obvious for me why this
> happens.  Can you explain a bit more?

Anyway, I am not sure we can call nf_reset_ct(skb) that early.

git log seems to say that xfrm check needs to be done before
nf_reset_ct(skb), I have no idea why.

I suspect some incoming packets are not going through
xfrm4_policy_check() and end up being stored in a TCP receive queue.

Maybe something is missing before calling tcp_child_process()


>
> >
> > I note that IPv6 does the nf_reset_ct() earlier, from ip6_protocol_deliver_rcu()
> >
> > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> > index fda811a5251f2d76ac24a036e6b4f4e7d7d96d6f..a06464f96fe0cc94dd78272738ddaab2c19e87db
> > 100644
> > --- a/net/ipv4/tcp_ipv4.c
> > +++ b/net/ipv4/tcp_ipv4.c
> > @@ -1919,6 +1919,8 @@ int tcp_v4_rcv(struct sk_buff *skb)
> >         struct sock *sk;
> >         int ret;
> >
> > +       nf_reset_ct(skb);
> > +
> >         drop_reason = SKB_DROP_REASON_NOT_SPECIFIED;
> >         if (skb->pkt_type != PACKET_HOST)
> >                 goto discard_it;
> > @@ -2046,8 +2048,6 @@ int tcp_v4_rcv(struct sk_buff *skb)
> >         if (drop_reason)
> >                 goto discard_and_relse;
> >
> > -       nf_reset_ct(skb);
> > -
> >         if (tcp_filter(sk, skb)) {
> >                 drop_reason = SKB_DROP_REASON_SOCKET_FILTER;
> >                 goto discard_and_relse;
>
Eric Dumazet June 22, 2022, 4:39 p.m. UTC | #10
On Wed, Jun 22, 2022 at 6:29 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Wed, Jun 22, 2022 at 4:26 PM Ilya Maximets <i.maximets@ovn.org> wrote:
> >
> > On 6/22/22 13:43, Eric Dumazet wrote:
>
> >
> > I tested the patch below and it seems to fix the issue seen
> > with OVS testsuite.  Though it's not obvious for me why this
> > happens.  Can you explain a bit more?
>
> Anyway, I am not sure we can call nf_reset_ct(skb) that early.
>
> git log seems to say that xfrm check needs to be done before
> nf_reset_ct(skb), I have no idea why.

Additional remark: In IPv6 side, xfrm6_policy_check() _is_ called
after nf_reset_ct(skb)

Steffen, do you have some comments ?

Some context:
commit b59c270104f03960069596722fea70340579244d
Author: Patrick McHardy <kaber@trash.net>
Date:   Fri Jan 6 23:06:10 2006 -0800

    [NETFILTER]: Keep conntrack reference until IPsec policy checks are done

    Keep the conntrack reference until policy checks have been performed for
    IPsec NAT support. The reference needs to be dropped before a packet is
    queued to avoid having the conntrack module unloadable.

    Signed-off-by: Patrick McHardy <kaber@trash.net>
    Signed-off-by: David S. Miller <davem@davemloft.net>


>
> I suspect some incoming packets are not going through
> xfrm4_policy_check() and end up being stored in a TCP receive queue.
>
> Maybe something is missing before calling tcp_child_process()
>
>
> >
> > >
> > > I note that IPv6 does the nf_reset_ct() earlier, from ip6_protocol_deliver_rcu()
> > >
> > > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> > > index fda811a5251f2d76ac24a036e6b4f4e7d7d96d6f..a06464f96fe0cc94dd78272738ddaab2c19e87db
> > > 100644
> > > --- a/net/ipv4/tcp_ipv4.c
> > > +++ b/net/ipv4/tcp_ipv4.c
> > > @@ -1919,6 +1919,8 @@ int tcp_v4_rcv(struct sk_buff *skb)
> > >         struct sock *sk;
> > >         int ret;
> > >
> > > +       nf_reset_ct(skb);
> > > +
> > >         drop_reason = SKB_DROP_REASON_NOT_SPECIFIED;
> > >         if (skb->pkt_type != PACKET_HOST)
> > >                 goto discard_it;
> > > @@ -2046,8 +2048,6 @@ int tcp_v4_rcv(struct sk_buff *skb)
> > >         if (drop_reason)
> > >                 goto discard_and_relse;
> > >
> > > -       nf_reset_ct(skb);
> > > -
> > >         if (tcp_filter(sk, skb)) {
> > >                 drop_reason = SKB_DROP_REASON_SOCKET_FILTER;
> > >                 goto discard_and_relse;
> >
Eric Dumazet June 22, 2022, 4:47 p.m. UTC | #11
On Wed, Jun 22, 2022 at 6:39 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Wed, Jun 22, 2022 at 6:29 PM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Wed, Jun 22, 2022 at 4:26 PM Ilya Maximets <i.maximets@ovn.org> wrote:
> > >
> > > On 6/22/22 13:43, Eric Dumazet wrote:
> >
> > >
> > > I tested the patch below and it seems to fix the issue seen
> > > with OVS testsuite.  Though it's not obvious for me why this
> > > happens.  Can you explain a bit more?
> >
> > Anyway, I am not sure we can call nf_reset_ct(skb) that early.
> >
> > git log seems to say that xfrm check needs to be done before
> > nf_reset_ct(skb), I have no idea why.
>
> Additional remark: In IPv6 side, xfrm6_policy_check() _is_ called
> after nf_reset_ct(skb)
>
> Steffen, do you have some comments ?
>
> Some context:
> commit b59c270104f03960069596722fea70340579244d
> Author: Patrick McHardy <kaber@trash.net>
> Date:   Fri Jan 6 23:06:10 2006 -0800
>
>     [NETFILTER]: Keep conntrack reference until IPsec policy checks are done
>
>     Keep the conntrack reference until policy checks have been performed for
>     IPsec NAT support. The reference needs to be dropped before a packet is
>     queued to avoid having the conntrack module unloadable.
>
>     Signed-off-by: Patrick McHardy <kaber@trash.net>
>     Signed-off-by: David S. Miller <davem@davemloft.net>
>

Oh well... __xfrm_policy_check() has :

nf_nat_decode_session(skb, &fl, family);

This  answers my questions.

This means we are probably missing at least one XFRM check in TCP
stack in some cases.
(Only after adding this XFRM check we can call nf_reset_ct(skb))

>
> >
> > I suspect some incoming packets are not going through
> > xfrm4_policy_check() and end up being stored in a TCP receive queue.
> >
> > Maybe something is missing before calling tcp_child_process()
> >
> >
> > >
> > > >
> > > > I note that IPv6 does the nf_reset_ct() earlier, from ip6_protocol_deliver_rcu()
> > > >
> > > > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> > > > index fda811a5251f2d76ac24a036e6b4f4e7d7d96d6f..a06464f96fe0cc94dd78272738ddaab2c19e87db
> > > > 100644
> > > > --- a/net/ipv4/tcp_ipv4.c
> > > > +++ b/net/ipv4/tcp_ipv4.c
> > > > @@ -1919,6 +1919,8 @@ int tcp_v4_rcv(struct sk_buff *skb)
> > > >         struct sock *sk;
> > > >         int ret;
> > > >
> > > > +       nf_reset_ct(skb);
> > > > +
> > > >         drop_reason = SKB_DROP_REASON_NOT_SPECIFIED;
> > > >         if (skb->pkt_type != PACKET_HOST)
> > > >                 goto discard_it;
> > > > @@ -2046,8 +2048,6 @@ int tcp_v4_rcv(struct sk_buff *skb)
> > > >         if (drop_reason)
> > > >                 goto discard_and_relse;
> > > >
> > > > -       nf_reset_ct(skb);
> > > > -
> > > >         if (tcp_filter(sk, skb)) {
> > > >                 drop_reason = SKB_DROP_REASON_SOCKET_FILTER;
> > > >                 goto discard_and_relse;
> > >
Eric Dumazet June 22, 2022, 5:03 p.m. UTC | #12
On Wed, Jun 22, 2022 at 6:47 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Wed, Jun 22, 2022 at 6:39 PM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Wed, Jun 22, 2022 at 6:29 PM Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > On Wed, Jun 22, 2022 at 4:26 PM Ilya Maximets <i.maximets@ovn.org> wrote:
> > > >
> > > > On 6/22/22 13:43, Eric Dumazet wrote:
> > >
> > > >
> > > > I tested the patch below and it seems to fix the issue seen
> > > > with OVS testsuite.  Though it's not obvious for me why this
> > > > happens.  Can you explain a bit more?
> > >
> > > Anyway, I am not sure we can call nf_reset_ct(skb) that early.
> > >
> > > git log seems to say that xfrm check needs to be done before
> > > nf_reset_ct(skb), I have no idea why.
> >
> > Additional remark: In IPv6 side, xfrm6_policy_check() _is_ called
> > after nf_reset_ct(skb)
> >
> > Steffen, do you have some comments ?
> >
> > Some context:
> > commit b59c270104f03960069596722fea70340579244d
> > Author: Patrick McHardy <kaber@trash.net>
> > Date:   Fri Jan 6 23:06:10 2006 -0800
> >
> >     [NETFILTER]: Keep conntrack reference until IPsec policy checks are done
> >
> >     Keep the conntrack reference until policy checks have been performed for
> >     IPsec NAT support. The reference needs to be dropped before a packet is
> >     queued to avoid having the conntrack module unloadable.
> >
> >     Signed-off-by: Patrick McHardy <kaber@trash.net>
> >     Signed-off-by: David S. Miller <davem@davemloft.net>
> >
>
> Oh well... __xfrm_policy_check() has :
>
> nf_nat_decode_session(skb, &fl, family);
>
> This  answers my questions.
>
> This means we are probably missing at least one XFRM check in TCP
> stack in some cases.
> (Only after adding this XFRM check we can call nf_reset_ct(skb))
>

Maybe this will help ?

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index fe8f23b95d32ca4a35d05166d471327bc608fa91..49c1348e40b6c7b6a98b54d716f29c948e00ba33
100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2019,12 +2019,19 @@ int tcp_v4_rcv(struct sk_buff *skb)
                if (nsk == sk) {
                        reqsk_put(req);
                        tcp_v4_restore_cb(skb);
-               } else if (tcp_child_process(sk, nsk, skb)) {
-                       tcp_v4_send_reset(nsk, skb);
-                       goto discard_and_relse;
                } else {
-                       sock_put(sk);
-                       return 0;
+                       if (!xfrm4_policy_check(nsk, XFRM_POLICY_IN, skb)) {
+                               drop_reason = SKB_DROP_REASON_XFRM_POLICY;
+                               goto discard_and_relse;
+                       }
+                       nf_reset_ct(skb);
+                       if (tcp_child_process(sk, nsk, skb)) {
+                               tcp_v4_send_reset(nsk, skb);
+                               goto discard_and_relse;
+                       } else {
+                               sock_put(sk);
+                               return 0;
+                       }
                }
        }
Ilya Maximets June 22, 2022, 6:09 p.m. UTC | #13
On 6/22/22 16:35, Eric Dumazet wrote:
> On Wed, Jun 22, 2022 at 4:26 PM Ilya Maximets <i.maximets@ovn.org> wrote:
>>
>> On 6/22/22 13:43, Eric Dumazet wrote:
>>> On Wed, Jun 22, 2022 at 1:32 PM Ilya Maximets <i.maximets@ovn.org> wrote:
>>>>
>>>> On 6/22/22 12:36, Eric Dumazet wrote:
>>>>> On Wed, Jun 22, 2022 at 12:28 PM Florian Westphal <fw@strlen.de> wrote:
>>>>>>
>>>>>> Eric Dumazet <edumazet@google.com> wrote:
>>>>>>> On Sun, Jun 19, 2022 at 2:39 AM Ilya Maximets <i.maximets@ovn.org> wrote:
>>>>>>>>
>>>>>>>> Open vSwitch system test suite is broken due to inability to
>>>>>>>> load/unload netfilter modules.  kworker thread is getting trapped
>>>>>>>> in the infinite loop while running a net cleanup inside the
>>>>>>>> nf_conntrack_cleanup_net_list, because deferred skbuffs are still
>>>>>>>> holding nfct references and not being freed by their CPU cores.
>>>>>>>>
>>>>>>>> In general, the idea that we will have an rx interrupt on every
>>>>>>>> CPU core at some point in a near future doesn't seem correct.
>>>>>>>> Devices are getting created and destroyed, interrupts are getting
>>>>>>>> re-scheduled, CPUs are going online and offline dynamically.
>>>>>>>> Any of these events may leave packets stuck in defer list for a
>>>>>>>> long time.  It might be OK, if they are just a piece of memory,
>>>>>>>> but we can't afford them holding references to any other resources.
>>>>>>>>
>>>>>>>> In case of OVS, nfct reference keeps the kernel thread in busy loop
>>>>>>>> while holding a 'pernet_ops_rwsem' semaphore.  That blocks the
>>>>>>>> later modprobe request from user space:
>>>>>>>>
>>>>>>>>   # ps
>>>>>>>>    299 root  R  99.3  200:25.89 kworker/u96:4+
>>>>>>>>
>>>>>>>>   # journalctl
>>>>>>>>   INFO: task modprobe:11787 blocked for more than 1228 seconds.
>>>>>>>>         Not tainted 5.19.0-rc2 #8
>>>>>>>>   task:modprobe     state:D
>>>>>>>>   Call Trace:
>>>>>>>>    <TASK>
>>>>>>>>    __schedule+0x8aa/0x21d0
>>>>>>>>    schedule+0xcc/0x200
>>>>>>>>    rwsem_down_write_slowpath+0x8e4/0x1580
>>>>>>>>    down_write+0xfc/0x140
>>>>>>>>    register_pernet_subsys+0x15/0x40
>>>>>>>>    nf_nat_init+0xb6/0x1000 [nf_nat]
>>>>>>>>    do_one_initcall+0xbb/0x410
>>>>>>>>    do_init_module+0x1b4/0x640
>>>>>>>>    load_module+0x4c1b/0x58d0
>>>>>>>>    __do_sys_init_module+0x1d7/0x220
>>>>>>>>    do_syscall_64+0x3a/0x80
>>>>>>>>    entry_SYSCALL_64_after_hwframe+0x46/0xb0
>>>>>>>>
>>>>>>>> At this point OVS testsuite is unresponsive and never recover,
>>>>>>>> because these skbuffs are never freed.
>>>>>>>>
>>>>>>>> Solution is to make sure no external references attached to skb
>>>>>>>> before pushing it to the defer list.  Using skb_release_head_state()
>>>>>>>> for that purpose.  The function modified to be re-enterable, as it
>>>>>>>> will be called again during the defer list flush.
>>>>>>>>
>>>>>>>> Another approach that can fix the OVS use-case, is to kick all
>>>>>>>> cores while waiting for references to be released during the net
>>>>>>>> cleanup.  But that sounds more like a workaround for a current
>>>>>>>> issue rather than a proper solution and will not cover possible
>>>>>>>> issues in other parts of the code.
>>>>>>>>
>>>>>>>> Additionally checking for skb_zcopy() while deferring.  This might
>>>>>>>> not be necessary, as I'm not sure if we can actually have zero copy
>>>>>>>> packets on this path, but seems worth having for completeness as we
>>>>>>>> should never defer such packets regardless.
>>>>>>>>
>>>>>>>> CC: Eric Dumazet <edumazet@google.com>
>>>>>>>> Fixes: 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists")
>>>>>>>> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
>>>>>>>> ---
>>>>>>>>  net/core/skbuff.c | 16 +++++++++++-----
>>>>>>>>  1 file changed, 11 insertions(+), 5 deletions(-)
>>>>>>>
>>>>>>> I do not think this patch is doing the right thing.
>>>>>>>
>>>>>>> Packets sitting in TCP receive queues should not hold state that is
>>>>>>> not relevant for TCP recvmsg().
>>>>>>
>>>>>> Agree, but tcp_v4/6_rcv() already call nf_reset_ct(), else it would
>>>>>> not be possible to remove nf_conntrack module in practice.
>>>>>
>>>>> Well, existing nf_reset_ct() does not catch all cases, like TCP fastopen ?
>>>>
>>>> Yeah, that is kind of the main problem I have with the current
>>>> code.  It's very hard to find all the cases where skb has to be
>>>> cleaned up and almost impossible for someone who doesn't know
>>>> every aspect of every network subsystem in the kernel.  That's
>>>> why I went with the more or less bulletproof approach of cleaning
>>>> up while actually deferring.  I can try and test the code you
>>>> proposed in the other reply, but at least, I think, we need a
>>>> bunch of debug warnings in the skb_attempt_defer_free() to catch
>>>> possible issues.
>>>
>>> Debug warnings are expensive if they need to bring new cache lines.
>>>
>>> So far skb_attempt_defer_free() is only used by TCP in well known conditions.
>>
>> That's true for skb_attempt_defer_free() itself, but it's
>> hard to tell the same for all the parts of the code that
>> are enqueuing these skbuffs.  Even if all the places are
>> well known, it looks to me highly error prone.  Having
>> a couple of DEBUG_NET_WARN_ON_ONCE should not cause any
>> performance issues, IIUC, unless debug is enabled, right?
>>
>> e.g.
>>
>>         DEBUG_NET_WARN_ON_ONCE(skb_nfct(skb));
>>         DEBUG_NET_WARN_ON_ONCE(skb_dst(skb));
> 
> Sure, but we need to put them at the right place.
> 
> Doing these checks in skb_attempt_defer_free() is too late.

Yes, I agree that it doesn't give any useful information for
debugging, because we need to know who put those packets in
the queue, not who took them out.  But 'not too late' places
should already have calls to reset these fields in a close
proximity to where the warnings should be, so it's hard to
tell if that will make a lot of sense.  What I had in mind is
just a red flag that says that something went wrong, so it
can be raised while testing new changes to the kernel and
running with debug enabled.  That can save some debugging and
bisecting time in the future if some unexpected nfct or dst
reference will sneak into the queue.
But I will not insist on adding these, if you think they are
not valuable.

> 
>>
>>>
>>>
>>>>
>>>> Also, what about cleaning up extensions?  IIUC, at least one
>>>> of them can hold external references.  SKB_EXT_SEC_PATH holds
>>>> xfrm_state.  We should, probably, free them as well?
>>>
>>> I do not know about this, I would ask XFRM maintainers
>>
>> @Steffen, @Herbert, what do you think?  Is it a problem if the
>> skb holding SKB_EXT_SEC_PATH extension is held not freed in
>> a defer list indefinitely (for a very long time)?  Or is it
>> even possible for an skb with SKB_EXT_SEC_PATH extension present
>> to be enqueued to a TCP receive queue without releasing all
>> the references?

To answer my own questions:
 - Yes, it's possible.
 - No, doesn't seem to be a big problem, IIUC, as xfrm_state
   seem to be self-contained and keeping it around for a long
   time doesn't seem to be harmful.

Some input from xfrm maintainers would still be great though.

>>
>> This seems like a long existing problem though, so can be fixed
>> separately, if it is a problem.
>>
>>>
>>>>
>>>> And what about zerocopy skb?  I think, we should still not
>>>> allow them to be deferred as they seem to hold HW resources.
>>>
>>> The point of skb_attempt_defer_free() i is to make the freeing happen
>>> at producer
>>>  much instead of consumer.
>>>
>>> I do not think there is anything special in this regard with zero
>>> copy. I would leave the current code as is.
>>
>> The problem is that these skbuffs are never actually freed,
>> so if they do hold HW resources, these resources will never be
>> released (never == for a very long time, hours, days).
> 
> This is fine. Number of such skbs per cpu is limited.
> 
> Or if some HW feature needs buffers to be released _immediately_, we
> have to implement
> a method for doing so.
> 
> Please point why buffers sitting in TCP receive queues are not causing issues ?
> Do we need a general method "Please scan all TCP sockets and release
> skb holding HW resources immediately"

That's a good point.  I agree that it's probably not a big deal
in this case.

> 
>>
>> Not a blocker from my point of view, just trying to highlight
>> a possible issue.
> 
> The tradeoff is known really, as for any cache.
> 
>>
>>>
>>> A simpler patch might be to move the existing nf_reset_ct() earlier,
>>> can you test this ?
>>
>> I tested the patch below and it seems to fix the issue seen
>> with OVS testsuite.  Though it's not obvious for me why this
>> happens.  Can you explain a bit more?
> 
> I think the bug predates my recent change.
> 
> skb_attempt_defer_free() is increasing the probability of hitting the bug.
> 
>>
>>>
>>> I note that IPv6 does the nf_reset_ct() earlier, from ip6_protocol_deliver_rcu()
>>>
>>> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
>>> index fda811a5251f2d76ac24a036e6b4f4e7d7d96d6f..a06464f96fe0cc94dd78272738ddaab2c19e87db
>>> 100644
>>> --- a/net/ipv4/tcp_ipv4.c
>>> +++ b/net/ipv4/tcp_ipv4.c
>>> @@ -1919,6 +1919,8 @@ int tcp_v4_rcv(struct sk_buff *skb)
>>>         struct sock *sk;
>>>         int ret;
>>>
>>> +       nf_reset_ct(skb);
>>> +
>>>         drop_reason = SKB_DROP_REASON_NOT_SPECIFIED;
>>>         if (skb->pkt_type != PACKET_HOST)
>>>                 goto discard_it;
>>> @@ -2046,8 +2048,6 @@ int tcp_v4_rcv(struct sk_buff *skb)
>>>         if (drop_reason)
>>>                 goto discard_and_relse;
>>>
>>> -       nf_reset_ct(skb);
>>> -
>>>         if (tcp_filter(sk, skb)) {
>>>                 drop_reason = SKB_DROP_REASON_SOCKET_FILTER;
>>>                 goto discard_and_relse;
>>
Ilya Maximets June 22, 2022, 6:19 p.m. UTC | #14
On 6/22/22 19:03, Eric Dumazet wrote:
> On Wed, Jun 22, 2022 at 6:47 PM Eric Dumazet <edumazet@google.com> wrote:
>>
>> On Wed, Jun 22, 2022 at 6:39 PM Eric Dumazet <edumazet@google.com> wrote:
>>>
>>> On Wed, Jun 22, 2022 at 6:29 PM Eric Dumazet <edumazet@google.com> wrote:
>>>>
>>>> On Wed, Jun 22, 2022 at 4:26 PM Ilya Maximets <i.maximets@ovn.org> wrote:
>>>>>
>>>>> On 6/22/22 13:43, Eric Dumazet wrote:
>>>>
>>>>>
>>>>> I tested the patch below and it seems to fix the issue seen
>>>>> with OVS testsuite.  Though it's not obvious for me why this
>>>>> happens.  Can you explain a bit more?
>>>>
>>>> Anyway, I am not sure we can call nf_reset_ct(skb) that early.
>>>>
>>>> git log seems to say that xfrm check needs to be done before
>>>> nf_reset_ct(skb), I have no idea why.
>>>
>>> Additional remark: In IPv6 side, xfrm6_policy_check() _is_ called
>>> after nf_reset_ct(skb)
>>>
>>> Steffen, do you have some comments ?
>>>
>>> Some context:
>>> commit b59c270104f03960069596722fea70340579244d
>>> Author: Patrick McHardy <kaber@trash.net>
>>> Date:   Fri Jan 6 23:06:10 2006 -0800
>>>
>>>     [NETFILTER]: Keep conntrack reference until IPsec policy checks are done
>>>
>>>     Keep the conntrack reference until policy checks have been performed for
>>>     IPsec NAT support. The reference needs to be dropped before a packet is
>>>     queued to avoid having the conntrack module unloadable.
>>>
>>>     Signed-off-by: Patrick McHardy <kaber@trash.net>
>>>     Signed-off-by: David S. Miller <davem@davemloft.net>
>>>
>>
>> Oh well... __xfrm_policy_check() has :
>>
>> nf_nat_decode_session(skb, &fl, family);
>>
>> This  answers my questions.
>>
>> This means we are probably missing at least one XFRM check in TCP
>> stack in some cases.
>> (Only after adding this XFRM check we can call nf_reset_ct(skb))
>>
> 
> Maybe this will help ?

I tested this patch and it seems to fix the OVS problem.
I did not test the xfrm part of it.

Will you post an official patch?

> 
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index fe8f23b95d32ca4a35d05166d471327bc608fa91..49c1348e40b6c7b6a98b54d716f29c948e00ba33
> 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -2019,12 +2019,19 @@ int tcp_v4_rcv(struct sk_buff *skb)
>                 if (nsk == sk) {
>                         reqsk_put(req);
>                         tcp_v4_restore_cb(skb);
> -               } else if (tcp_child_process(sk, nsk, skb)) {
> -                       tcp_v4_send_reset(nsk, skb);
> -                       goto discard_and_relse;
>                 } else {
> -                       sock_put(sk);
> -                       return 0;
> +                       if (!xfrm4_policy_check(nsk, XFRM_POLICY_IN, skb)) {
> +                               drop_reason = SKB_DROP_REASON_XFRM_POLICY;
> +                               goto discard_and_relse;
> +                       }
> +                       nf_reset_ct(skb);
> +                       if (tcp_child_process(sk, nsk, skb)) {
> +                               tcp_v4_send_reset(nsk, skb);
> +                               goto discard_and_relse;
> +                       } else {
> +                               sock_put(sk);
> +                               return 0;
> +                       }
>                 }
>         }
Eric Dumazet June 22, 2022, 7:04 p.m. UTC | #15
On Wed, Jun 22, 2022 at 8:19 PM Ilya Maximets <i.maximets@ovn.org> wrote:
>
> On 6/22/22 19:03, Eric Dumazet wrote:
> > On Wed, Jun 22, 2022 at 6:47 PM Eric Dumazet <edumazet@google.com> wrote:
> >>
> >> On Wed, Jun 22, 2022 at 6:39 PM Eric Dumazet <edumazet@google.com> wrote:
> >>>
> >>> On Wed, Jun 22, 2022 at 6:29 PM Eric Dumazet <edumazet@google.com> wrote:
> >>>>
> >>>> On Wed, Jun 22, 2022 at 4:26 PM Ilya Maximets <i.maximets@ovn.org> wrote:
> >>>>>
> >>>>> On 6/22/22 13:43, Eric Dumazet wrote:
> >>>>
> >>>>>
> >>>>> I tested the patch below and it seems to fix the issue seen
> >>>>> with OVS testsuite.  Though it's not obvious for me why this
> >>>>> happens.  Can you explain a bit more?
> >>>>
> >>>> Anyway, I am not sure we can call nf_reset_ct(skb) that early.
> >>>>
> >>>> git log seems to say that xfrm check needs to be done before
> >>>> nf_reset_ct(skb), I have no idea why.
> >>>
> >>> Additional remark: In IPv6 side, xfrm6_policy_check() _is_ called
> >>> after nf_reset_ct(skb)
> >>>
> >>> Steffen, do you have some comments ?
> >>>
> >>> Some context:
> >>> commit b59c270104f03960069596722fea70340579244d
> >>> Author: Patrick McHardy <kaber@trash.net>
> >>> Date:   Fri Jan 6 23:06:10 2006 -0800
> >>>
> >>>     [NETFILTER]: Keep conntrack reference until IPsec policy checks are done
> >>>
> >>>     Keep the conntrack reference until policy checks have been performed for
> >>>     IPsec NAT support. The reference needs to be dropped before a packet is
> >>>     queued to avoid having the conntrack module unloadable.
> >>>
> >>>     Signed-off-by: Patrick McHardy <kaber@trash.net>
> >>>     Signed-off-by: David S. Miller <davem@davemloft.net>
> >>>
> >>
> >> Oh well... __xfrm_policy_check() has :
> >>
> >> nf_nat_decode_session(skb, &fl, family);
> >>
> >> This  answers my questions.
> >>
> >> This means we are probably missing at least one XFRM check in TCP
> >> stack in some cases.
> >> (Only after adding this XFRM check we can call nf_reset_ct(skb))
> >>
> >
> > Maybe this will help ?
>
> I tested this patch and it seems to fix the OVS problem.
> I did not test the xfrm part of it.
>
> Will you post an official patch?

Yes I will. I need to double check we do not leak either the req, or the child.

Maybe the XFRM check should be done even earlier, on the listening socket ?

Or if we assume the SYNACK packet has been sent after the XFRM test
has been applied to the SYN,
maybe we could just call nf_reset_ct(skb) to lower risk of regressions.

With the last patch, it would be strange that we accept the 3WHS and
establish a socket,
but drop the payload in the 3rd packet...

>
> >
> > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> > index fe8f23b95d32ca4a35d05166d471327bc608fa91..49c1348e40b6c7b6a98b54d716f29c948e00ba33
> > 100644
> > --- a/net/ipv4/tcp_ipv4.c
> > +++ b/net/ipv4/tcp_ipv4.c
> > @@ -2019,12 +2019,19 @@ int tcp_v4_rcv(struct sk_buff *skb)
> >                 if (nsk == sk) {
> >                         reqsk_put(req);
> >                         tcp_v4_restore_cb(skb);
> > -               } else if (tcp_child_process(sk, nsk, skb)) {
> > -                       tcp_v4_send_reset(nsk, skb);
> > -                       goto discard_and_relse;
> >                 } else {
> > -                       sock_put(sk);
> > -                       return 0;
> > +                       if (!xfrm4_policy_check(nsk, XFRM_POLICY_IN, skb)) {
> > +                               drop_reason = SKB_DROP_REASON_XFRM_POLICY;
> > +                               goto discard_and_relse;
> > +                       }
> > +                       nf_reset_ct(skb);
> > +                       if (tcp_child_process(sk, nsk, skb)) {
> > +                               tcp_v4_send_reset(nsk, skb);
> > +                               goto discard_and_relse;
> > +                       } else {
> > +                               sock_put(sk);
> > +                               return 0;
> > +                       }
> >                 }
> >         }
>
Eric Dumazet June 22, 2022, 7:27 p.m. UTC | #16
On Wed, Jun 22, 2022 at 9:04 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Wed, Jun 22, 2022 at 8:19 PM Ilya Maximets <i.maximets@ovn.org> wrote:
> >
> > On 6/22/22 19:03, Eric Dumazet wrote:
> > > On Wed, Jun 22, 2022 at 6:47 PM Eric Dumazet <edumazet@google.com> wrote:
> > >>
> > >> On Wed, Jun 22, 2022 at 6:39 PM Eric Dumazet <edumazet@google.com> wrote:
> > >>>
> > >>> On Wed, Jun 22, 2022 at 6:29 PM Eric Dumazet <edumazet@google.com> wrote:
> > >>>>
> > >>>> On Wed, Jun 22, 2022 at 4:26 PM Ilya Maximets <i.maximets@ovn.org> wrote:
> > >>>>>
> > >>>>> On 6/22/22 13:43, Eric Dumazet wrote:
> > >>>>
> > >>>>>
> > >>>>> I tested the patch below and it seems to fix the issue seen
> > >>>>> with OVS testsuite.  Though it's not obvious for me why this
> > >>>>> happens.  Can you explain a bit more?
> > >>>>
> > >>>> Anyway, I am not sure we can call nf_reset_ct(skb) that early.
> > >>>>
> > >>>> git log seems to say that xfrm check needs to be done before
> > >>>> nf_reset_ct(skb), I have no idea why.
> > >>>
> > >>> Additional remark: In IPv6 side, xfrm6_policy_check() _is_ called
> > >>> after nf_reset_ct(skb)
> > >>>
> > >>> Steffen, do you have some comments ?
> > >>>
> > >>> Some context:
> > >>> commit b59c270104f03960069596722fea70340579244d
> > >>> Author: Patrick McHardy <kaber@trash.net>
> > >>> Date:   Fri Jan 6 23:06:10 2006 -0800
> > >>>
> > >>>     [NETFILTER]: Keep conntrack reference until IPsec policy checks are done
> > >>>
> > >>>     Keep the conntrack reference until policy checks have been performed for
> > >>>     IPsec NAT support. The reference needs to be dropped before a packet is
> > >>>     queued to avoid having the conntrack module unloadable.
> > >>>
> > >>>     Signed-off-by: Patrick McHardy <kaber@trash.net>
> > >>>     Signed-off-by: David S. Miller <davem@davemloft.net>
> > >>>
> > >>
> > >> Oh well... __xfrm_policy_check() has :
> > >>
> > >> nf_nat_decode_session(skb, &fl, family);
> > >>
> > >> This  answers my questions.
> > >>
> > >> This means we are probably missing at least one XFRM check in TCP
> > >> stack in some cases.
> > >> (Only after adding this XFRM check we can call nf_reset_ct(skb))
> > >>
> > >
> > > Maybe this will help ?
> >
> > I tested this patch and it seems to fix the OVS problem.
> > I did not test the xfrm part of it.
> >
> > Will you post an official patch?
>
> Yes I will. I need to double check we do not leak either the req, or the child.
>
> Maybe the XFRM check should be done even earlier, on the listening socket ?
>
> Or if we assume the SYNACK packet has been sent after the XFRM test
> has been applied to the SYN,
> maybe we could just call nf_reset_ct(skb) to lower risk of regressions.
>
> With the last patch, it would be strange that we accept the 3WHS and
> establish a socket,
> but drop the payload in the 3rd packet...

Ilya, can you test the following patch ?
I think it makes more sense to let XFRM reject the packet earlier, and
not complete the 3WHS,
if for some reason this happens.

Thanks !

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index fe8f23b95d32ca4a35d05166d471327bc608fa91..da5a3c44c4fb70f1d3ecc596e694a86267f1c44a
100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1964,7 +1964,10 @@ int tcp_v4_rcv(struct sk_buff *skb)
                struct sock *nsk;

                sk = req->rsk_listener;
-               drop_reason = tcp_inbound_md5_hash(sk, skb,
+               if (!xfrm4_policy_check(sk, XFRM_POLICY_IN, skb))
+                       drop_reason = SKB_DROP_REASON_XFRM_POLICY;
+               else
+                       drop_reason = tcp_inbound_md5_hash(sk, skb,
                                                   &iph->saddr, &iph->daddr,
                                                   AF_INET, dif, sdif);
                if (unlikely(drop_reason)) {
@@ -2016,6 +2019,7 @@ int tcp_v4_rcv(struct sk_buff *skb)
                        }
                        goto discard_and_relse;
                }
+               nf_reset_ct(skb);
                if (nsk == sk) {
                        reqsk_put(req);
                        tcp_v4_restore_cb(skb);
Ilya Maximets June 22, 2022, 9:12 p.m. UTC | #17
On 6/22/22 21:27, Eric Dumazet wrote:
> On Wed, Jun 22, 2022 at 9:04 PM Eric Dumazet <edumazet@google.com> wrote:
>>
>> On Wed, Jun 22, 2022 at 8:19 PM Ilya Maximets <i.maximets@ovn.org> wrote:
>>>
>>> On 6/22/22 19:03, Eric Dumazet wrote:
>>>> On Wed, Jun 22, 2022 at 6:47 PM Eric Dumazet <edumazet@google.com> wrote:
>>>>>
>>>>> On Wed, Jun 22, 2022 at 6:39 PM Eric Dumazet <edumazet@google.com> wrote:
>>>>>>
>>>>>> On Wed, Jun 22, 2022 at 6:29 PM Eric Dumazet <edumazet@google.com> wrote:
>>>>>>>
>>>>>>> On Wed, Jun 22, 2022 at 4:26 PM Ilya Maximets <i.maximets@ovn.org> wrote:
>>>>>>>>
>>>>>>>> On 6/22/22 13:43, Eric Dumazet wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> I tested the patch below and it seems to fix the issue seen
>>>>>>>> with OVS testsuite.  Though it's not obvious for me why this
>>>>>>>> happens.  Can you explain a bit more?
>>>>>>>
>>>>>>> Anyway, I am not sure we can call nf_reset_ct(skb) that early.
>>>>>>>
>>>>>>> git log seems to say that xfrm check needs to be done before
>>>>>>> nf_reset_ct(skb), I have no idea why.
>>>>>>
>>>>>> Additional remark: In IPv6 side, xfrm6_policy_check() _is_ called
>>>>>> after nf_reset_ct(skb)
>>>>>>
>>>>>> Steffen, do you have some comments ?
>>>>>>
>>>>>> Some context:
>>>>>> commit b59c270104f03960069596722fea70340579244d
>>>>>> Author: Patrick McHardy <kaber@trash.net>
>>>>>> Date:   Fri Jan 6 23:06:10 2006 -0800
>>>>>>
>>>>>>     [NETFILTER]: Keep conntrack reference until IPsec policy checks are done
>>>>>>
>>>>>>     Keep the conntrack reference until policy checks have been performed for
>>>>>>     IPsec NAT support. The reference needs to be dropped before a packet is
>>>>>>     queued to avoid having the conntrack module unloadable.
>>>>>>
>>>>>>     Signed-off-by: Patrick McHardy <kaber@trash.net>
>>>>>>     Signed-off-by: David S. Miller <davem@davemloft.net>
>>>>>>
>>>>>
>>>>> Oh well... __xfrm_policy_check() has :
>>>>>
>>>>> nf_nat_decode_session(skb, &fl, family);
>>>>>
>>>>> This  answers my questions.
>>>>>
>>>>> This means we are probably missing at least one XFRM check in TCP
>>>>> stack in some cases.
>>>>> (Only after adding this XFRM check we can call nf_reset_ct(skb))
>>>>>
>>>>
>>>> Maybe this will help ?
>>>
>>> I tested this patch and it seems to fix the OVS problem.
>>> I did not test the xfrm part of it.
>>>
>>> Will you post an official patch?
>>
>> Yes I will. I need to double check we do not leak either the req, or the child.
>>
>> Maybe the XFRM check should be done even earlier, on the listening socket ?
>>
>> Or if we assume the SYNACK packet has been sent after the XFRM test
>> has been applied to the SYN,
>> maybe we could just call nf_reset_ct(skb) to lower risk of regressions.
>>
>> With the last patch, it would be strange that we accept the 3WHS and
>> establish a socket,
>> but drop the payload in the 3rd packet...
> 
> Ilya, can you test the following patch ?

Tested with OVS and it works fine, the issue doesn't appear.
Still didn't test the xfrm part, as I'm not sure how.

> I think it makes more sense to let XFRM reject the packet earlier, and
> not complete the 3WHS,
> if for some reason this happens.

OK.  However, now the patch looks more like two separate fixes.

> 
> Thanks !
> 
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index fe8f23b95d32ca4a35d05166d471327bc608fa91..da5a3c44c4fb70f1d3ecc596e694a86267f1c44a
> 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -1964,7 +1964,10 @@ int tcp_v4_rcv(struct sk_buff *skb)
>                 struct sock *nsk;
> 
>                 sk = req->rsk_listener;
> -               drop_reason = tcp_inbound_md5_hash(sk, skb,
> +               if (!xfrm4_policy_check(sk, XFRM_POLICY_IN, skb))
> +                       drop_reason = SKB_DROP_REASON_XFRM_POLICY;
> +               else
> +                       drop_reason = tcp_inbound_md5_hash(sk, skb,
>                                                    &iph->saddr, &iph->daddr,
>                                                    AF_INET, dif, sdif);
>                 if (unlikely(drop_reason)) {
> @@ -2016,6 +2019,7 @@ int tcp_v4_rcv(struct sk_buff *skb)
>                         }
>                         goto discard_and_relse;
>                 }
> +               nf_reset_ct(skb);
>                 if (nsk == sk) {
>                         reqsk_put(req);
>                         tcp_v4_restore_cb(skb);
Ilya Maximets Sept. 22, 2023, 1:26 p.m. UTC | #18
On 6/19/22 02:39, Ilya Maximets wrote:
> Open vSwitch system test suite is broken due to inability to
> load/unload netfilter modules.  kworker thread is getting trapped
> in the infinite loop while running a net cleanup inside the
> nf_conntrack_cleanup_net_list, because deferred skbuffs are still
> holding nfct references and not being freed by their CPU cores.
> 
> In general, the idea that we will have an rx interrupt on every
> CPU core at some point in a near future doesn't seem correct.
> Devices are getting created and destroyed, interrupts are getting
> re-scheduled, CPUs are going online and offline dynamically.
> Any of these events may leave packets stuck in defer list for a
> long time.  It might be OK, if they are just a piece of memory,
> but we can't afford them holding references to any other resources.
> 
> In case of OVS, nfct reference keeps the kernel thread in busy loop
> while holding a 'pernet_ops_rwsem' semaphore.  That blocks the
> later modprobe request from user space:
> 
>   # ps
>    299 root  R  99.3  200:25.89 kworker/u96:4+
> 
>   # journalctl
>   INFO: task modprobe:11787 blocked for more than 1228 seconds.
>         Not tainted 5.19.0-rc2 #8
>   task:modprobe     state:D
>   Call Trace:
>    <TASK>
>    __schedule+0x8aa/0x21d0
>    schedule+0xcc/0x200
>    rwsem_down_write_slowpath+0x8e4/0x1580
>    down_write+0xfc/0x140
>    register_pernet_subsys+0x15/0x40
>    nf_nat_init+0xb6/0x1000 [nf_nat]
>    do_one_initcall+0xbb/0x410
>    do_init_module+0x1b4/0x640
>    load_module+0x4c1b/0x58d0
>    __do_sys_init_module+0x1d7/0x220
>    do_syscall_64+0x3a/0x80
>    entry_SYSCALL_64_after_hwframe+0x46/0xb0
> 
> At this point OVS testsuite is unresponsive and never recover,
> because these skbuffs are never freed.
> 
> Solution is to make sure no external references attached to skb
> before pushing it to the defer list.  Using skb_release_head_state()
> for that purpose.  The function modified to be re-enterable, as it
> will be called again during the defer list flush.
> 
> Another approach that can fix the OVS use-case, is to kick all
> cores while waiting for references to be released during the net
> cleanup.  But that sounds more like a workaround for a current
> issue rather than a proper solution and will not cover possible
> issues in other parts of the code.
> 
> Additionally checking for skb_zcopy() while deferring.  This might
> not be necessary, as I'm not sure if we can actually have zero copy
> packets on this path, but seems worth having for completeness as we
> should never defer such packets regardless.
> 
> CC: Eric Dumazet <edumazet@google.com>
> Fixes: 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists")
> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
> ---

Happy Friday! :)

Resurrecting this thread because I managed to reproduce the issue again
on a latest 6.6.0-rc1.

(It doesn't mean we need to accept this particular patch, I just think
that it's an appropriate discussion thread.)

It's a bit different testsuite this time.  Last year I had an issue with
OVS testsuite, today I have an issue with OVN testsuite.  Their structure
is very similar, but OVN tests are a fair bit more complex.

The story is the same:  Each test loads a pack of kernel modules including
OVS and nf_conntrack, sends some traffic, verifies OVN functionality,
stops OVN/OVS and unloads all the previously loaded modules to clean up
the space for the next tests.

Kernel hangs in the same way as before waiting for nf_conntrack module
to be unloaded:

  13 root R  100.0  933:17.98 kworker/u80:1+netns

  CPU: 12 PID: 13 Comm: kworker/u80:1 Not tainted 6.6.0-rc1+ #7
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.1-2.fc36 04/01/2014
  Workqueue: netns cleanup_net
  RIP: 0010:nf_ct_iterate_cleanup+0x35/0x1e0 [nf_conntrack]
  Code: 56 41 55 41 54 49 89 fc 55 31 ed 53 e8 a4 db a8
        ed 48 c7 c7 20 e9 e2 c0 e8 48 f3 a8 ed 39 2d 7e
        2d 01 00 77 14 e9 05 01 00 00 <83> c5 01 3b 2d 6e
        2d 01 00 0f 83 f6 00 00 00 48 8b 15 75 2d 01 00
  RSP: 0018:ffffb23040073d98 EFLAGS: 00000202
  RAX: 000000000001ce57 RBX: ffff98dbfac73958 RCX: ffff98e31f732b38
  RDX: ffff98dbfac00000 RSI: ffffb23040073dd0 RDI: ffffffffc0e2e920
  RBP: 000000000000e72b R08: ffff98e31f732b38 R09: ffff98e31f732b38
  R10: 000000000001d5ce R11: 0000000000000000 R12: ffffffffc0e1b080
  R13: ffffb23040073e28 R14: ffff98dbc47b0600 R15: ffffb23040073dd0
  FS:  0000000000000000(0000) GS:ffff98e31f700000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00007f98bd7bca80 CR3: 000000076f420004 CR4: 0000000000370ee0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
   <NMI>
   ? nmi_cpu_backtrace+0x82/0xf0
   ? nmi_cpu_backtrace_handler+0xd/0x20
   ? nmi_handle+0x5e/0x150
   ? default_do_nmi+0x40/0x100
   ? exc_nmi+0x112/0x190
   ? end_repeat_nmi+0x16/0x67
   ? __pfx_kill_all+0x10/0x10 [nf_conntrack]
   ? nf_ct_iterate_cleanup+0x35/0x1e0 [nf_conntrack]
   ? nf_ct_iterate_cleanup+0x35/0x1e0 [nf_conntrack]
   ? nf_ct_iterate_cleanup+0x35/0x1e0 [nf_conntrack]
   </NMI>
   <TASK>
   nf_conntrack_cleanup_net_list+0xac/0x140 [nf_conntrack]
   cleanup_net+0x213/0x3b0
   process_one_work+0x177/0x340
   worker_thread+0x27e/0x390
   ? __pfx_worker_thread+0x10/0x10
   kthread+0xe2/0x110
   ? __pfx_kthread+0x10/0x10
   ret_from_fork+0x30/0x50
   ? __pfx_kthread+0x10/0x10
   ret_from_fork_asm+0x1b/0x30
   </TASK>

Since this testsuite is different from what I had issues with before,
I don't really know when the issue first appeared.  It may have been
there even last year.

I can reproduce the issue with ubuntu 6.2.0-32-generic kernel, but
I don't see it with the latest 5.14.0-284.25.1.el9_2.x86_64 on RHEL9.
It doesn't necessarily mean that RHEL9 doesn't have the issue though,
the testsuite is not 100% reliable.

I'll try to dig deeper and bisect the problem on the upstream kernel.

For now it seems like the issue is likely in IPv6 code, because the
tests that trigger it are mostly IPv6-related.

Best regards, Ilya Maximets.
Ilya Maximets Sept. 22, 2023, 7:07 p.m. UTC | #19
On 9/22/23 15:26, Ilya Maximets wrote:
> On 6/19/22 02:39, Ilya Maximets wrote:
>> Open vSwitch system test suite is broken due to inability to
>> load/unload netfilter modules.  kworker thread is getting trapped
>> in the infinite loop while running a net cleanup inside the
>> nf_conntrack_cleanup_net_list, because deferred skbuffs are still
>> holding nfct references and not being freed by their CPU cores.
>>
>> In general, the idea that we will have an rx interrupt on every
>> CPU core at some point in a near future doesn't seem correct.
>> Devices are getting created and destroyed, interrupts are getting
>> re-scheduled, CPUs are going online and offline dynamically.
>> Any of these events may leave packets stuck in defer list for a
>> long time.  It might be OK, if they are just a piece of memory,
>> but we can't afford them holding references to any other resources.
>>
>> In case of OVS, nfct reference keeps the kernel thread in busy loop
>> while holding a 'pernet_ops_rwsem' semaphore.  That blocks the
>> later modprobe request from user space:
>>
>>   # ps
>>    299 root  R  99.3  200:25.89 kworker/u96:4+
>>
>>   # journalctl
>>   INFO: task modprobe:11787 blocked for more than 1228 seconds.
>>         Not tainted 5.19.0-rc2 #8
>>   task:modprobe     state:D
>>   Call Trace:
>>    <TASK>
>>    __schedule+0x8aa/0x21d0
>>    schedule+0xcc/0x200
>>    rwsem_down_write_slowpath+0x8e4/0x1580
>>    down_write+0xfc/0x140
>>    register_pernet_subsys+0x15/0x40
>>    nf_nat_init+0xb6/0x1000 [nf_nat]
>>    do_one_initcall+0xbb/0x410
>>    do_init_module+0x1b4/0x640
>>    load_module+0x4c1b/0x58d0
>>    __do_sys_init_module+0x1d7/0x220
>>    do_syscall_64+0x3a/0x80
>>    entry_SYSCALL_64_after_hwframe+0x46/0xb0
>>
>> At this point OVS testsuite is unresponsive and never recover,
>> because these skbuffs are never freed.
>>
>> Solution is to make sure no external references attached to skb
>> before pushing it to the defer list.  Using skb_release_head_state()
>> for that purpose.  The function modified to be re-enterable, as it
>> will be called again during the defer list flush.
>>
>> Another approach that can fix the OVS use-case, is to kick all
>> cores while waiting for references to be released during the net
>> cleanup.  But that sounds more like a workaround for a current
>> issue rather than a proper solution and will not cover possible
>> issues in other parts of the code.
>>
>> Additionally checking for skb_zcopy() while deferring.  This might
>> not be necessary, as I'm not sure if we can actually have zero copy
>> packets on this path, but seems worth having for completeness as we
>> should never defer such packets regardless.
>>
>> CC: Eric Dumazet <edumazet@google.com>
>> Fixes: 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists")
>> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
>> ---
> 
> Happy Friday! :)
> 
> Resurrecting this thread because I managed to reproduce the issue again
> on a latest 6.6.0-rc1.
> 
> (It doesn't mean we need to accept this particular patch, I just think
> that it's an appropriate discussion thread.)
> 
> It's a bit different testsuite this time.  Last year I had an issue with
> OVS testsuite, today I have an issue with OVN testsuite.  Their structure
> is very similar, but OVN tests are a fair bit more complex.
> 
> The story is the same:  Each test loads a pack of kernel modules including
> OVS and nf_conntrack, sends some traffic, verifies OVN functionality,
> stops OVN/OVS and unloads all the previously loaded modules to clean up
> the space for the next tests.
> 
> Kernel hangs in the same way as before waiting for nf_conntrack module
> to be unloaded:
> 
>   13 root R  100.0  933:17.98 kworker/u80:1+netns
> 
>   CPU: 12 PID: 13 Comm: kworker/u80:1 Not tainted 6.6.0-rc1+ #7
>   Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.1-2.fc36 04/01/2014
>   Workqueue: netns cleanup_net
>   RIP: 0010:nf_ct_iterate_cleanup+0x35/0x1e0 [nf_conntrack]
>   Code: 56 41 55 41 54 49 89 fc 55 31 ed 53 e8 a4 db a8
>         ed 48 c7 c7 20 e9 e2 c0 e8 48 f3 a8 ed 39 2d 7e
>         2d 01 00 77 14 e9 05 01 00 00 <83> c5 01 3b 2d 6e
>         2d 01 00 0f 83 f6 00 00 00 48 8b 15 75 2d 01 00
>   RSP: 0018:ffffb23040073d98 EFLAGS: 00000202
>   RAX: 000000000001ce57 RBX: ffff98dbfac73958 RCX: ffff98e31f732b38
>   RDX: ffff98dbfac00000 RSI: ffffb23040073dd0 RDI: ffffffffc0e2e920
>   RBP: 000000000000e72b R08: ffff98e31f732b38 R09: ffff98e31f732b38
>   R10: 000000000001d5ce R11: 0000000000000000 R12: ffffffffc0e1b080
>   R13: ffffb23040073e28 R14: ffff98dbc47b0600 R15: ffffb23040073dd0
>   FS:  0000000000000000(0000) GS:ffff98e31f700000(0000) knlGS:0000000000000000
>   CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>   CR2: 00007f98bd7bca80 CR3: 000000076f420004 CR4: 0000000000370ee0
>   DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>   DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>   Call Trace:
>    <NMI>
>    ? nmi_cpu_backtrace+0x82/0xf0
>    ? nmi_cpu_backtrace_handler+0xd/0x20
>    ? nmi_handle+0x5e/0x150
>    ? default_do_nmi+0x40/0x100
>    ? exc_nmi+0x112/0x190
>    ? end_repeat_nmi+0x16/0x67
>    ? __pfx_kill_all+0x10/0x10 [nf_conntrack]
>    ? nf_ct_iterate_cleanup+0x35/0x1e0 [nf_conntrack]
>    ? nf_ct_iterate_cleanup+0x35/0x1e0 [nf_conntrack]
>    ? nf_ct_iterate_cleanup+0x35/0x1e0 [nf_conntrack]
>    </NMI>
>    <TASK>
>    nf_conntrack_cleanup_net_list+0xac/0x140 [nf_conntrack]
>    cleanup_net+0x213/0x3b0
>    process_one_work+0x177/0x340
>    worker_thread+0x27e/0x390
>    ? __pfx_worker_thread+0x10/0x10
>    kthread+0xe2/0x110
>    ? __pfx_kthread+0x10/0x10
>    ret_from_fork+0x30/0x50
>    ? __pfx_kthread+0x10/0x10
>    ret_from_fork_asm+0x1b/0x30
>    </TASK>
> 
> Since this testsuite is different from what I had issues with before,
> I don't really know when the issue first appeared.  It may have been
> there even last year.
> 
> I can reproduce the issue with ubuntu 6.2.0-32-generic kernel, but
> I don't see it with the latest 5.14.0-284.25.1.el9_2.x86_64 on RHEL9.
> It doesn't necessarily mean that RHEL9 doesn't have the issue though,
> the testsuite is not 100% reliable.
> 
> I'll try to dig deeper and bisect the problem on the upstream kernel.
> 
> For now it seems like the issue is likely in IPv6 code, because the
> tests that trigger it are mostly IPv6-related.

So, I bisected this down to an unsurprising commit:

commit b0e214d212030fe497d4d150bb3474e50ad5d093
Author: Madhu Koriginja <madhu.koriginja@nxp.com>
Date:   Tue Mar 21 21:28:44 2023 +0530

    netfilter: keep conntrack reference until IPsecv6 policy checks are done
    
    Keep the conntrack reference until policy checks have been performed for
    IPsec V6 NAT support, just like ipv4.
    
    The reference needs to be dropped before a packet is
    queued to avoid having the conntrack module unloadable.
    
    Fixes: 58a317f1061c ("netfilter: ipv6: add IPv6 NAT support")
    Signed-off-by: Madhu Koriginja <madhu.koriginja@nxp.com>
    Signed-off-by: Florian Westphal <fw@strlen.de>

This commit repeated the pattern from the old ipv4 commit b59c270104f0
("[NETFILTER]: Keep conntrack reference until IPsec policy checks are done")
and hence introduced the exact same issue, but for IPv6.  They both failed
to deliver on the "avoid having the conntrack module unloadable" part.

IIUC, the fix should be exactly the same as Eric did for ipv4 in commit
6f0012e35160 ("tcp: add a missing nf_reset_ct() in 3WHS handling").

I can try that and send a fix.

Would still really like to have some preventive mechanism for this kind of
issues though.  Any ideas on that?

Best regards, Ilya Maximets.
Ilya Maximets Sept. 22, 2023, 9:09 p.m. UTC | #20
On 9/22/23 21:07, Ilya Maximets wrote:
> On 9/22/23 15:26, Ilya Maximets wrote:
>> On 6/19/22 02:39, Ilya Maximets wrote:
>>> Open vSwitch system test suite is broken due to inability to
>>> load/unload netfilter modules.  kworker thread is getting trapped
>>> in the infinite loop while running a net cleanup inside the
>>> nf_conntrack_cleanup_net_list, because deferred skbuffs are still
>>> holding nfct references and not being freed by their CPU cores.
>>>
>>> In general, the idea that we will have an rx interrupt on every
>>> CPU core at some point in a near future doesn't seem correct.
>>> Devices are getting created and destroyed, interrupts are getting
>>> re-scheduled, CPUs are going online and offline dynamically.
>>> Any of these events may leave packets stuck in defer list for a
>>> long time.  It might be OK, if they are just a piece of memory,
>>> but we can't afford them holding references to any other resources.
>>>
>>> In case of OVS, nfct reference keeps the kernel thread in busy loop
>>> while holding a 'pernet_ops_rwsem' semaphore.  That blocks the
>>> later modprobe request from user space:
>>>
>>>   # ps
>>>    299 root  R  99.3  200:25.89 kworker/u96:4+
>>>
>>>   # journalctl
>>>   INFO: task modprobe:11787 blocked for more than 1228 seconds.
>>>         Not tainted 5.19.0-rc2 #8
>>>   task:modprobe     state:D
>>>   Call Trace:
>>>    <TASK>
>>>    __schedule+0x8aa/0x21d0
>>>    schedule+0xcc/0x200
>>>    rwsem_down_write_slowpath+0x8e4/0x1580
>>>    down_write+0xfc/0x140
>>>    register_pernet_subsys+0x15/0x40
>>>    nf_nat_init+0xb6/0x1000 [nf_nat]
>>>    do_one_initcall+0xbb/0x410
>>>    do_init_module+0x1b4/0x640
>>>    load_module+0x4c1b/0x58d0
>>>    __do_sys_init_module+0x1d7/0x220
>>>    do_syscall_64+0x3a/0x80
>>>    entry_SYSCALL_64_after_hwframe+0x46/0xb0
>>>
>>> At this point OVS testsuite is unresponsive and never recover,
>>> because these skbuffs are never freed.
>>>
>>> Solution is to make sure no external references attached to skb
>>> before pushing it to the defer list.  Using skb_release_head_state()
>>> for that purpose.  The function modified to be re-enterable, as it
>>> will be called again during the defer list flush.
>>>
>>> Another approach that can fix the OVS use-case, is to kick all
>>> cores while waiting for references to be released during the net
>>> cleanup.  But that sounds more like a workaround for a current
>>> issue rather than a proper solution and will not cover possible
>>> issues in other parts of the code.
>>>
>>> Additionally checking for skb_zcopy() while deferring.  This might
>>> not be necessary, as I'm not sure if we can actually have zero copy
>>> packets on this path, but seems worth having for completeness as we
>>> should never defer such packets regardless.
>>>
>>> CC: Eric Dumazet <edumazet@google.com>
>>> Fixes: 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists")
>>> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
>>> ---
>>
>> Happy Friday! :)
>>
>> Resurrecting this thread because I managed to reproduce the issue again
>> on a latest 6.6.0-rc1.
>>
>> (It doesn't mean we need to accept this particular patch, I just think
>> that it's an appropriate discussion thread.)
>>
>> It's a bit different testsuite this time.  Last year I had an issue with
>> OVS testsuite, today I have an issue with OVN testsuite.  Their structure
>> is very similar, but OVN tests are a fair bit more complex.
>>
>> The story is the same:  Each test loads a pack of kernel modules including
>> OVS and nf_conntrack, sends some traffic, verifies OVN functionality,
>> stops OVN/OVS and unloads all the previously loaded modules to clean up
>> the space for the next tests.
>>
>> Kernel hangs in the same way as before waiting for nf_conntrack module
>> to be unloaded:
>>
>>   13 root R  100.0  933:17.98 kworker/u80:1+netns
>>
>>   CPU: 12 PID: 13 Comm: kworker/u80:1 Not tainted 6.6.0-rc1+ #7
>>   Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.1-2.fc36 04/01/2014
>>   Workqueue: netns cleanup_net
>>   RIP: 0010:nf_ct_iterate_cleanup+0x35/0x1e0 [nf_conntrack]
>>   Code: 56 41 55 41 54 49 89 fc 55 31 ed 53 e8 a4 db a8
>>         ed 48 c7 c7 20 e9 e2 c0 e8 48 f3 a8 ed 39 2d 7e
>>         2d 01 00 77 14 e9 05 01 00 00 <83> c5 01 3b 2d 6e
>>         2d 01 00 0f 83 f6 00 00 00 48 8b 15 75 2d 01 00
>>   RSP: 0018:ffffb23040073d98 EFLAGS: 00000202
>>   RAX: 000000000001ce57 RBX: ffff98dbfac73958 RCX: ffff98e31f732b38
>>   RDX: ffff98dbfac00000 RSI: ffffb23040073dd0 RDI: ffffffffc0e2e920
>>   RBP: 000000000000e72b R08: ffff98e31f732b38 R09: ffff98e31f732b38
>>   R10: 000000000001d5ce R11: 0000000000000000 R12: ffffffffc0e1b080
>>   R13: ffffb23040073e28 R14: ffff98dbc47b0600 R15: ffffb23040073dd0
>>   FS:  0000000000000000(0000) GS:ffff98e31f700000(0000) knlGS:0000000000000000
>>   CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>   CR2: 00007f98bd7bca80 CR3: 000000076f420004 CR4: 0000000000370ee0
>>   DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>   DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>   Call Trace:
>>    <NMI>
>>    ? nmi_cpu_backtrace+0x82/0xf0
>>    ? nmi_cpu_backtrace_handler+0xd/0x20
>>    ? nmi_handle+0x5e/0x150
>>    ? default_do_nmi+0x40/0x100
>>    ? exc_nmi+0x112/0x190
>>    ? end_repeat_nmi+0x16/0x67
>>    ? __pfx_kill_all+0x10/0x10 [nf_conntrack]
>>    ? nf_ct_iterate_cleanup+0x35/0x1e0 [nf_conntrack]
>>    ? nf_ct_iterate_cleanup+0x35/0x1e0 [nf_conntrack]
>>    ? nf_ct_iterate_cleanup+0x35/0x1e0 [nf_conntrack]
>>    </NMI>
>>    <TASK>
>>    nf_conntrack_cleanup_net_list+0xac/0x140 [nf_conntrack]
>>    cleanup_net+0x213/0x3b0
>>    process_one_work+0x177/0x340
>>    worker_thread+0x27e/0x390
>>    ? __pfx_worker_thread+0x10/0x10
>>    kthread+0xe2/0x110
>>    ? __pfx_kthread+0x10/0x10
>>    ret_from_fork+0x30/0x50
>>    ? __pfx_kthread+0x10/0x10
>>    ret_from_fork_asm+0x1b/0x30
>>    </TASK>
>>
>> Since this testsuite is different from what I had issues with before,
>> I don't really know when the issue first appeared.  It may have been
>> there even last year.
>>
>> I can reproduce the issue with ubuntu 6.2.0-32-generic kernel, but
>> I don't see it with the latest 5.14.0-284.25.1.el9_2.x86_64 on RHEL9.
>> It doesn't necessarily mean that RHEL9 doesn't have the issue though,
>> the testsuite is not 100% reliable.
>>
>> I'll try to dig deeper and bisect the problem on the upstream kernel.
>>
>> For now it seems like the issue is likely in IPv6 code, because the
>> tests that trigger it are mostly IPv6-related.
> 
> So, I bisected this down to an unsurprising commit:
> 
> commit b0e214d212030fe497d4d150bb3474e50ad5d093
> Author: Madhu Koriginja <madhu.koriginja@nxp.com>
> Date:   Tue Mar 21 21:28:44 2023 +0530
> 
>     netfilter: keep conntrack reference until IPsecv6 policy checks are done
>     
>     Keep the conntrack reference until policy checks have been performed for
>     IPsec V6 NAT support, just like ipv4.
>     
>     The reference needs to be dropped before a packet is
>     queued to avoid having the conntrack module unloadable.
>     
>     Fixes: 58a317f1061c ("netfilter: ipv6: add IPv6 NAT support")
>     Signed-off-by: Madhu Koriginja <madhu.koriginja@nxp.com>
>     Signed-off-by: Florian Westphal <fw@strlen.de>
> 
> This commit repeated the pattern from the old ipv4 commit b59c270104f0
> ("[NETFILTER]: Keep conntrack reference until IPsec policy checks are done")
> and hence introduced the exact same issue, but for IPv6.  They both failed
> to deliver on the "avoid having the conntrack module unloadable" part.
> 
> IIUC, the fix should be exactly the same as Eric did for ipv4 in commit
> 6f0012e35160 ("tcp: add a missing nf_reset_ct() in 3WHS handling").
> 
> I can try that and send a fix.

Posted a fix here:
  https://lore.kernel.org/netdev/20230922210530.2045146-1-i.maximets@ovn.org/

It fixes the problem I have with OVN testsuite.

> 
> Would still really like to have some preventive mechanism for this kind of
> issues though.  Any ideas on that?
> 
> Best regards, Ilya Maximets.
diff mbox series

Patch

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 5b3559cb1d82..5660250e795f 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -726,11 +726,9 @@  void skb_release_head_state(struct sk_buff *skb)
 	skb_dst_drop(skb);
 	if (skb->destructor) {
 		WARN_ON(in_hardirq());
-		skb->destructor(skb);
+		skb_orphan(skb);
 	}
-#if IS_ENABLED(CONFIG_NF_CONNTRACK)
-	nf_conntrack_put(skb_nfct(skb));
-#endif
+	nf_reset_ct(skb);
 	skb_ext_put(skb);
 }
 
@@ -6502,7 +6500,8 @@  void skb_attempt_defer_free(struct sk_buff *skb)
 
 	if (WARN_ON_ONCE(cpu >= nr_cpu_ids) ||
 	    !cpu_online(cpu) ||
-	    cpu == raw_smp_processor_id()) {
+	    cpu == raw_smp_processor_id() ||
+	    skb_zcopy(skb)) {
 nodefer:	__kfree_skb(skb);
 		return;
 	}
@@ -6512,6 +6511,13 @@  nodefer:	__kfree_skb(skb);
 	if (READ_ONCE(sd->defer_count) >= defer_max)
 		goto nodefer;
 
+	/* skb can contain all kinds of external references that
+	 * will prevent module unloading or destruction of other
+	 * resources.  Need to release them now, since skb can
+	 * stay on a defer list indefinitely.
+	 */
+	skb_release_head_state(skb);
+
 	spin_lock_irqsave(&sd->defer_lock, flags);
 	/* Send an IPI every time queue reaches half capacity. */
 	kick = sd->defer_count == (defer_max >> 1);