diff mbox series

[net] net: drop secpath extension before skb deferral free

Message ID 20240513100246.85173-1-jianbol@nvidia.com (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series [net] net: drop secpath extension before skb deferral free | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for net
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag present in non-next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 927 this patch: 927
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers fail 2 blamed authors not CCed: pabeni@redhat.com kuba@kernel.org; 2 maintainers not CCed: pabeni@redhat.com kuba@kernel.org
netdev/build_clang success Errors and warnings before: 936 this patch: 936
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 938 this patch: 938
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 10 lines checked
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 54 this patch: 54
netdev/source_inline success Was 0 now: 0
netdev/contest success net-next-2024-05-13--12-00 (tests: 1018)

Commit Message

Jianbo Liu May 13, 2024, 10:02 a.m. UTC
In commit 68822bdf76f1 ("net: generalize skb freeing deferral to
per-cpu lists"), skb can be queued on remote cpu list for deferral
free.

The remote cpu is kicked if the queue reaches half capacity. As
mentioned in the patch, this seems very unlikely to trigger
NET_RX_SOFTIRQ on the remote CPU in this way. But that seems not true,
we actually saw something that indicates this: skb is not freed
immediately, or even kept for a long time. And the possibility is
increased if there are more cpu cores.

As skb is not freed, its extension is not freed as well. An error
occurred while unloading the driver after running TCP traffic with
IPsec, where both crypto and packet were offloaded. However, in the
case of crypto offload, this failure was rare and significantly more
challenging to replicate.

 unregister_netdevice: waiting for eth2 to become free. Usage count = 2
 ref_tracker: eth%d@000000007421424b has 1/1 users at
      xfrm_dev_state_add+0xe5/0x4d0
      xfrm_add_sa+0xc5c/0x11e0
      xfrm_user_rcv_msg+0xfa/0x240
      netlink_rcv_skb+0x54/0x100
      xfrm_netlink_rcv+0x31/0x40
      netlink_unicast+0x1fc/0x2c0
      netlink_sendmsg+0x232/0x4a0
      __sock_sendmsg+0x38/0x60
      ____sys_sendmsg+0x1e3/0x200
      ___sys_sendmsg+0x80/0xc0
      __sys_sendmsg+0x51/0x90
      do_syscall_64+0x40/0xe0
      entry_SYSCALL_64_after_hwframe+0x46/0x4e

The ref_tracker shows the netdev is hold when the offloading xfrm
state is first added to hardware. When receiving packet, the secpath
extension, which saves xfrm state, is added to skb by ipsec offload,
and the xfrm state is hence hold by the received skb. It can't be
flushed till skb is dequeued from the defer list, then skb and its
extension are really freed. Also, the netdev can't be unregistered
because it still referred by xfrm state.

To fix this issue, drop this extension before skb is queued to the
defer list, so xfrm state destruction is not blocked.

Fixes: 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists")
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
---
 net/core/skbuff.c | 4 ++++
 1 file changed, 4 insertions(+)

Comments

Eric Dumazet May 13, 2024, 10:29 a.m. UTC | #1
On Mon, May 13, 2024 at 12:04 PM Jianbo Liu <jianbol@nvidia.com> wrote:
>
> In commit 68822bdf76f1 ("net: generalize skb freeing deferral to
> per-cpu lists"), skb can be queued on remote cpu list for deferral
> free.
>
> The remote cpu is kicked if the queue reaches half capacity. As
> mentioned in the patch, this seems very unlikely to trigger
> NET_RX_SOFTIRQ on the remote CPU in this way. But that seems not true,
> we actually saw something that indicates this: skb is not freed
> immediately, or even kept for a long time. And the possibility is
> increased if there are more cpu cores.
>
> As skb is not freed, its extension is not freed as well. An error
> occurred while unloading the driver after running TCP traffic with
> IPsec, where both crypto and packet were offloaded. However, in the
> case of crypto offload, this failure was rare and significantly more
> challenging to replicate.
>
>  unregister_netdevice: waiting for eth2 to become free. Usage count = 2
>  ref_tracker: eth%d@000000007421424b has 1/1 users at
>       xfrm_dev_state_add+0xe5/0x4d0
>       xfrm_add_sa+0xc5c/0x11e0
>       xfrm_user_rcv_msg+0xfa/0x240
>       netlink_rcv_skb+0x54/0x100
>       xfrm_netlink_rcv+0x31/0x40
>       netlink_unicast+0x1fc/0x2c0
>       netlink_sendmsg+0x232/0x4a0
>       __sock_sendmsg+0x38/0x60
>       ____sys_sendmsg+0x1e3/0x200
>       ___sys_sendmsg+0x80/0xc0
>       __sys_sendmsg+0x51/0x90
>       do_syscall_64+0x40/0xe0
>       entry_SYSCALL_64_after_hwframe+0x46/0x4e
>
> The ref_tracker shows the netdev is hold when the offloading xfrm
> state is first added to hardware. When receiving packet, the secpath
> extension, which saves xfrm state, is added to skb by ipsec offload,
> and the xfrm state is hence hold by the received skb. It can't be
> flushed till skb is dequeued from the defer list, then skb and its
> extension are really freed. Also, the netdev can't be unregistered
> because it still referred by xfrm state.
>
> To fix this issue, drop this extension before skb is queued to the
> defer list, so xfrm state destruction is not blocked.
>
> Fixes: 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists")
> Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
> Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
> ---


This attribution and patch seem wrong. Also you should CC XFRM maintainers.

Before being freed from tcp_recvmsg() path, packets can sit in TCP
receive queues for arbitrary amounts of time.

secpath_reset() should be called much earlier than in the code you
tried to change.

If XFRM state can stick to packets stored in protocol queues, we have
a much bigger problem.

I suspect all callers of nf_reset_ct() need a fix of some kind.
Jianbo Liu May 14, 2024, 7:37 a.m. UTC | #2
On Mon, 2024-05-13 at 12:29 +0200, Eric Dumazet wrote:
> On Mon, May 13, 2024 at 12:04 PM Jianbo Liu <jianbol@nvidia.com> wrote:
> > 
> > In commit 68822bdf76f1 ("net: generalize skb freeing deferral to
> > per-cpu lists"), skb can be queued on remote cpu list for deferral
> > free.
> > 
> > The remote cpu is kicked if the queue reaches half capacity. As
> > mentioned in the patch, this seems very unlikely to trigger
> > NET_RX_SOFTIRQ on the remote CPU in this way. But that seems not
> > true,
> > we actually saw something that indicates this: skb is not freed
> > immediately, or even kept for a long time. And the possibility is
> > increased if there are more cpu cores.
> > 
> > As skb is not freed, its extension is not freed as well. An error
> > occurred while unloading the driver after running TCP traffic with
> > IPsec, where both crypto and packet were offloaded. However, in the
> > case of crypto offload, this failure was rare and significantly more
> > challenging to replicate.
> > 
> >  unregister_netdevice: waiting for eth2 to become free. Usage count =
> > 2
> >  ref_tracker: eth%d@000000007421424b has 1/1 users at
> >       xfrm_dev_state_add+0xe5/0x4d0
> >       xfrm_add_sa+0xc5c/0x11e0
> >       xfrm_user_rcv_msg+0xfa/0x240
> >       netlink_rcv_skb+0x54/0x100
> >       xfrm_netlink_rcv+0x31/0x40
> >       netlink_unicast+0x1fc/0x2c0
> >       netlink_sendmsg+0x232/0x4a0
> >       __sock_sendmsg+0x38/0x60
> >       ____sys_sendmsg+0x1e3/0x200
> >       ___sys_sendmsg+0x80/0xc0
> >       __sys_sendmsg+0x51/0x90
> >       do_syscall_64+0x40/0xe0
> >       entry_SYSCALL_64_after_hwframe+0x46/0x4e
> > 
> > The ref_tracker shows the netdev is hold when the offloading xfrm
> > state is first added to hardware. When receiving packet, the secpath
> > extension, which saves xfrm state, is added to skb by ipsec offload,
> > and the xfrm state is hence hold by the received skb. It can't be
> > flushed till skb is dequeued from the defer list, then skb and its
> > extension are really freed. Also, the netdev can't be unregistered
> > because it still referred by xfrm state.
> > 
> > To fix this issue, drop this extension before skb is queued to the
> > defer list, so xfrm state destruction is not blocked.
> > 
> > Fixes: 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu
> > lists")
> > Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
> > Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
> > ---
> 
> 
> This attribution and patch seem wrong. Also you should CC XFRM
> maintainers.
> 
> Before being freed from tcp_recvmsg() path, packets can sit in TCP
> receive queues for arbitrary amounts of time.
> 
> secpath_reset() should be called much earlier than in the code you
> tried to change.

Yes, this also fixed the issue if I moved secpatch_reset() before
tcp_v4_do_rcv().
 
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2314,6 +2314,7 @@ int tcp_v4_rcv(struct sk_buff *skb)
 	tcp_v4_fill_cb(skb, iph, th);
 
 	skb->dev = NULL;
+	secpath_reset(skb);
 
 	if (sk->sk_state == TCP_LISTEN) {
 		ret = tcp_v4_do_rcv(sk, skb);

Do you want me to send v2, or push a new one if you agree with this
change?

Thanks!
Jianbo

> 
> If XFRM state can stick to packets stored in protocol queues, we have
> a much bigger problem.
> 
> I suspect all callers of nf_reset_ct() need a fix of some kind.
Eric Dumazet May 14, 2024, 8:51 a.m. UTC | #3
On Tue, May 14, 2024 at 9:37 AM Jianbo Liu <jianbol@nvidia.com> wrote:
>
> On Mon, 2024-05-13 at 12:29 +0200, Eric Dumazet wrote:
> > On Mon, May 13, 2024 at 12:04 PM Jianbo Liu <jianbol@nvidia.com> wrote:
> > >
> > > In commit 68822bdf76f1 ("net: generalize skb freeing deferral to
> > > per-cpu lists"), skb can be queued on remote cpu list for deferral
> > > free.
> > >
> > > The remote cpu is kicked if the queue reaches half capacity. As
> > > mentioned in the patch, this seems very unlikely to trigger
> > > NET_RX_SOFTIRQ on the remote CPU in this way. But that seems not
> > > true,
> > > we actually saw something that indicates this: skb is not freed
> > > immediately, or even kept for a long time. And the possibility is
> > > increased if there are more cpu cores.
> > >
> > > As skb is not freed, its extension is not freed as well. An error
> > > occurred while unloading the driver after running TCP traffic with
> > > IPsec, where both crypto and packet were offloaded. However, in the
> > > case of crypto offload, this failure was rare and significantly more
> > > challenging to replicate.
> > >
> > >  unregister_netdevice: waiting for eth2 to become free. Usage count =
> > > 2
> > >  ref_tracker: eth%d@000000007421424b has 1/1 users at
> > >       xfrm_dev_state_add+0xe5/0x4d0
> > >       xfrm_add_sa+0xc5c/0x11e0
> > >       xfrm_user_rcv_msg+0xfa/0x240
> > >       netlink_rcv_skb+0x54/0x100
> > >       xfrm_netlink_rcv+0x31/0x40
> > >       netlink_unicast+0x1fc/0x2c0
> > >       netlink_sendmsg+0x232/0x4a0
> > >       __sock_sendmsg+0x38/0x60
> > >       ____sys_sendmsg+0x1e3/0x200
> > >       ___sys_sendmsg+0x80/0xc0
> > >       __sys_sendmsg+0x51/0x90
> > >       do_syscall_64+0x40/0xe0
> > >       entry_SYSCALL_64_after_hwframe+0x46/0x4e
> > >
> > > The ref_tracker shows the netdev is hold when the offloading xfrm
> > > state is first added to hardware. When receiving packet, the secpath
> > > extension, which saves xfrm state, is added to skb by ipsec offload,
> > > and the xfrm state is hence hold by the received skb. It can't be
> > > flushed till skb is dequeued from the defer list, then skb and its
> > > extension are really freed. Also, the netdev can't be unregistered
> > > because it still referred by xfrm state.
> > >
> > > To fix this issue, drop this extension before skb is queued to the
> > > defer list, so xfrm state destruction is not blocked.
> > >
> > > Fixes: 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu
> > > lists")
> > > Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
> > > Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
> > > ---
> >
> >
> > This attribution and patch seem wrong. Also you should CC XFRM
> > maintainers.
> >
> > Before being freed from tcp_recvmsg() path, packets can sit in TCP
> > receive queues for arbitrary amounts of time.
> >
> > secpath_reset() should be called much earlier than in the code you
> > tried to change.
>
> Yes, this also fixed the issue if I moved secpatch_reset() before
> tcp_v4_do_rcv().
>
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -2314,6 +2314,7 @@ int tcp_v4_rcv(struct sk_buff *skb)
>         tcp_v4_fill_cb(skb, iph, th);
>
>         skb->dev = NULL;
> +       secpath_reset(skb);
>
>         if (sk->sk_state == TCP_LISTEN) {
>                 ret = tcp_v4_do_rcv(sk, skb);
>
> Do you want me to send v2, or push a new one if you agree with this
> change?

That would only care about TCP and IPv4.

I think we need a full fix, not a partial work around to an immediate problem.

Can we have some feedback from Steffen, I  wonder if we missed
something really obvious.

It is hard to believe this has been broken for such  a long time.

I think the issue came with

commit d77e38e612a017480157fe6d2c1422f42cb5b7e3
Author: Steffen Klassert <steffen.klassert@secunet.com>
Date:   Fri Apr 14 10:06:10 2017 +0200

    xfrm: Add an IPsec hardware offloading API

    This patch adds all the bits that are needed to do
    IPsec hardware offload for IPsec states and ESP packets.
    We add xfrmdev_ops to the net_device. xfrmdev_ops has
    function pointers that are needed to manage the xfrm
    states in the hardware and to do a per packet
    offloading decision.

    Joint work with:
    Ilan Tayari <ilant@mellanox.com>
    Guy Shapiro <guysh@mellanox.com>
    Yossi Kuperman <yossiku@mellanox.com>

    Signed-off-by: Guy Shapiro <guysh@mellanox.com>
    Signed-off-by: Ilan Tayari <ilant@mellanox.com>
    Signed-off-by: Yossi Kuperman <yossiku@mellanox.com>
    Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>

We should probably handle NETDEV_DOWN/NETDEV_UNREGISTER better,
instead of adding  secpath_reset(skb) there and there.
Jianbo Liu May 15, 2024, 3:10 a.m. UTC | #4
On Tue, 2024-05-14 at 10:51 +0200, Eric Dumazet wrote:
> On Tue, May 14, 2024 at 9:37 AM Jianbo Liu <jianbol@nvidia.com>
> wrote:
> > 
> > On Mon, 2024-05-13 at 12:29 +0200, Eric Dumazet wrote:
> > > On Mon, May 13, 2024 at 12:04 PM Jianbo Liu <jianbol@nvidia.com>
> > > wrote:
> > > > 
> > > > 
...
> > > 
> > > 
> > > This attribution and patch seem wrong. Also you should CC XFRM
> > > maintainers.
> > > 
> > > Before being freed from tcp_recvmsg() path, packets can sit in
> > > TCP
> > > receive queues for arbitrary amounts of time.
> > > 
> > > secpath_reset() should be called much earlier than in the code
> > > you
> > > tried to change.
> > 
> > Yes, this also fixed the issue if I moved secpatch_reset() before
> > tcp_v4_do_rcv().
> > 
> > --- a/net/ipv4/tcp_ipv4.c
> > +++ b/net/ipv4/tcp_ipv4.c
> > @@ -2314,6 +2314,7 @@ int tcp_v4_rcv(struct sk_buff *skb)
> >         tcp_v4_fill_cb(skb, iph, th);
> > 
> >         skb->dev = NULL;
> > +       secpath_reset(skb);
> > 
> >         if (sk->sk_state == TCP_LISTEN) {
> >                 ret = tcp_v4_do_rcv(sk, skb);
> > 
> > Do you want me to send v2, or push a new one if you agree with this
> > change?
> 
> That would only care about TCP and IPv4.
> 
> I think we need a full fix, not a partial work around to an immediate
> problem.
> 

Do you want to fix the issue if skb with secpath extension is hold in
protocol queues? But, is that possible? 

> Can we have some feedback from Steffen, I  wonder if we missed
> something really obvious.
> 
> It is hard to believe this has been broken for such  a long time.
> 
> I think the issue came with
> 
> commit d77e38e612a017480157fe6d2c1422f42cb5b7e3
> Author: Steffen Klassert <steffen.klassert@secunet.com>
> Date:   Fri Apr 14 10:06:10 2017 +0200
> 
>     xfrm: Add an IPsec hardware offloading API
> 
>     This patch adds all the bits that are needed to do
>     IPsec hardware offload for IPsec states and ESP packets.
>     We add xfrmdev_ops to the net_device. xfrmdev_ops has
>     function pointers that are needed to manage the xfrm
>     states in the hardware and to do a per packet
>     offloading decision.
> 
>     Joint work with:
>     Ilan Tayari <ilant@mellanox.com>
>     Guy Shapiro <guysh@mellanox.com>
>     Yossi Kuperman <yossiku@mellanox.com>
> 
>     Signed-off-by: Guy Shapiro <guysh@mellanox.com>
>     Signed-off-by: Ilan Tayari <ilant@mellanox.com>
>     Signed-off-by: Yossi Kuperman <yossiku@mellanox.com>
>     Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
> 
> We should probably handle NETDEV_DOWN/NETDEV_UNREGISTER better,
> 

And I think we also need to replace xfrm_state_put() with
xfrm_state_put_sync() in xfrm_dev_state_flush().

> instead of adding  secpath_reset(skb) there and there.
Jianbo Liu May 20, 2024, 10:06 a.m. UTC | #5
On Tue, 2024-05-14 at 10:51 +0200, Eric Dumazet wrote:
> On Tue, May 14, 2024 at 9:37 AM Jianbo Liu <jianbol@nvidia.com>
> wrote:
> > 
> > On Mon, 2024-05-13 at 12:29 +0200, Eric Dumazet wrote:
> > > On Mon, May 13, 2024 at 12:04 PM Jianbo Liu <jianbol@nvidia.com>
> > > wrote:
> > > 
> > > 
...
> > > This attribution and patch seem wrong. Also you should CC XFRM
> > > maintainers.
> > > 
> > > Before being freed from tcp_recvmsg() path, packets can sit in
> > > TCP
> > > receive queues for arbitrary amounts of time.
> > > 
> > > secpath_reset() should be called much earlier than in the code
> > > you
> > > tried to change.
> > 
> > Yes, this also fixed the issue if I moved secpatch_reset() before
> > tcp_v4_do_rcv().
> > 
> > --- a/net/ipv4/tcp_ipv4.c
> > +++ b/net/ipv4/tcp_ipv4.c
> > @@ -2314,6 +2314,7 @@ int tcp_v4_rcv(struct sk_buff *skb)
> >         tcp_v4_fill_cb(skb, iph, th);
> > 
> >         skb->dev = NULL;
> > +       secpath_reset(skb);
> > 
> >         if (sk->sk_state == TCP_LISTEN) {
> >                 ret = tcp_v4_do_rcv(sk, skb);
> > 
> > Do you want me to send v2, or push a new one if you agree with this
> > change?
> 
> That would only care about TCP and IPv4.
> 
> I think we need a full fix, not a partial work around to an immediate
> problem.
> 
> Can we have some feedback from Steffen, I  wonder if we missed
> something really obvious.
> 
> It is hard to believe this has been broken for such  a long time.

Could you please give me some suggestions?
Should I add new function to reset both ct and secpath, and replace
nf_reset_ct() where necessary on receive flow?

> 
> I think the issue came with
> 
> commit d77e38e612a017480157fe6d2c1422f42cb5b7e3
> Author: Steffen Klassert <steffen.klassert@secunet.com>
> Date:   Fri Apr 14 10:06:10 2017 +0200
> 
>     xfrm: Add an IPsec hardware offloading API
> 
>     This patch adds all the bits that are needed to do
>     IPsec hardware offload for IPsec states and ESP packets.
>     We add xfrmdev_ops to the net_device. xfrmdev_ops has
>     function pointers that are needed to manage the xfrm
>     states in the hardware and to do a per packet
>     offloading decision.
> 
>     Joint work with:
>     Ilan Tayari <ilant@mellanox.com>
>     Guy Shapiro <guysh@mellanox.com>
>     Yossi Kuperman <yossiku@mellanox.com>
> 
>     Signed-off-by: Guy Shapiro <guysh@mellanox.com>
>     Signed-off-by: Ilan Tayari <ilant@mellanox.com>
>     Signed-off-by: Yossi Kuperman <yossiku@mellanox.com>
>     Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
> 
> We should probably handle NETDEV_DOWN/NETDEV_UNREGISTER better,
> instead of adding  secpath_reset(skb) there and there.
Steffen Klassert May 21, 2024, 10:15 a.m. UTC | #6
On Mon, May 20, 2024 at 10:06:24AM +0000, Jianbo Liu wrote:
> On Tue, 2024-05-14 at 10:51 +0200, Eric Dumazet wrote:
> > 
> > I think we need a full fix, not a partial work around to an immediate
> > problem.
> > 
> > Can we have some feedback from Steffen, I  wonder if we missed
> > something really obvious.
> > 
> > It is hard to believe this has been broken for such  a long time.
> 
> Could you please give me some suggestions?

Sorry for the delay, I've just returned from vacation.

I'll have a look at it.
Steffen Klassert May 22, 2024, 9:34 a.m. UTC | #7
On Mon, May 20, 2024 at 10:06:24AM +0000, Jianbo Liu wrote:
> On Tue, 2024-05-14 at 10:51 +0200, Eric Dumazet wrote:
> > On Tue, May 14, 2024 at 9:37 AM Jianbo Liu <jianbol@nvidia.com>
> > wrote:
> > > 
> > > On Mon, 2024-05-13 at 12:29 +0200, Eric Dumazet wrote:
> > > > On Mon, May 13, 2024 at 12:04 PM Jianbo Liu <jianbol@nvidia.com>
> > > > wrote:
> > > > 
> > > > 
> ...
> > > > This attribution and patch seem wrong. Also you should CC XFRM
> > > > maintainers.
> > > > 
> > > > Before being freed from tcp_recvmsg() path, packets can sit in
> > > > TCP
> > > > receive queues for arbitrary amounts of time.
> > > > 
> > > > secpath_reset() should be called much earlier than in the code
> > > > you
> > > > tried to change.
> > > 
> > > Yes, this also fixed the issue if I moved secpatch_reset() before
> > > tcp_v4_do_rcv().
> > > 
> > > --- a/net/ipv4/tcp_ipv4.c
> > > +++ b/net/ipv4/tcp_ipv4.c
> > > @@ -2314,6 +2314,7 @@ int tcp_v4_rcv(struct sk_buff *skb)
> > >         tcp_v4_fill_cb(skb, iph, th);
> > > 
> > >         skb->dev = NULL;
> > > +       secpath_reset(skb);
> > > 
> > >         if (sk->sk_state == TCP_LISTEN) {
> > >                 ret = tcp_v4_do_rcv(sk, skb);
> > > 
> > > Do you want me to send v2, or push a new one if you agree with this
> > > change?
> > 
> > That would only care about TCP and IPv4.
> > 
> > I think we need a full fix, not a partial work around to an immediate
> > problem.
> > 
> > Can we have some feedback from Steffen, I  wonder if we missed
> > something really obvious.
> > 
> > It is hard to believe this has been broken for such  a long time.
> 
> Could you please give me some suggestions?
> Should I add new function to reset both ct and secpath, and replace
> nf_reset_ct() where necessary on receive flow?

Maybe we should directly remove the device from the xfrm_state
when the decice goes down, this should catch all the cases.

I think about something like this (untested) patch:

diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
index 0c306473a79d..ba402275ab57 100644
--- a/net/xfrm/xfrm_state.c
+++ b/net/xfrm/xfrm_state.c
@@ -867,7 +867,11 @@ int xfrm_dev_state_flush(struct net *net, struct net_device *dev, bool task_vali
 				xfrm_state_hold(x);
 				spin_unlock_bh(&net->xfrm.xfrm_state_lock);
 
-				err = xfrm_state_delete(x);
+				spin_lock_bh(&x->lock);
+				err = __xfrm_state_delete(x);
+				xfrm_dev_state_free(x);
+				spin_unlock_bh(&x->lock);
+
 				xfrm_audit_state_delete(x, err ? 0 : 1,
 							task_valid);
 				xfrm_state_put(x);

The secpath is still attached to all skbs, but the hang on device
unregister should go away.
Eric Dumazet May 22, 2024, 11:06 a.m. UTC | #8
On Wed, May 22, 2024 at 11:34 AM Steffen Klassert
<steffen.klassert@secunet.com> wrote:
>
> On Mon, May 20, 2024 at 10:06:24AM +0000, Jianbo Liu wrote:
> > On Tue, 2024-05-14 at 10:51 +0200, Eric Dumazet wrote:
> > > On Tue, May 14, 2024 at 9:37 AM Jianbo Liu <jianbol@nvidia.com>
> > > wrote:
> > > >
> > > > On Mon, 2024-05-13 at 12:29 +0200, Eric Dumazet wrote:
> > > > > On Mon, May 13, 2024 at 12:04 PM Jianbo Liu <jianbol@nvidia.com>
> > > > > wrote:
> > > > >
> > > > >
> > ...
> > > > > This attribution and patch seem wrong. Also you should CC XFRM
> > > > > maintainers.
> > > > >
> > > > > Before being freed from tcp_recvmsg() path, packets can sit in
> > > > > TCP
> > > > > receive queues for arbitrary amounts of time.
> > > > >
> > > > > secpath_reset() should be called much earlier than in the code
> > > > > you
> > > > > tried to change.
> > > >
> > > > Yes, this also fixed the issue if I moved secpatch_reset() before
> > > > tcp_v4_do_rcv().
> > > >
> > > > --- a/net/ipv4/tcp_ipv4.c
> > > > +++ b/net/ipv4/tcp_ipv4.c
> > > > @@ -2314,6 +2314,7 @@ int tcp_v4_rcv(struct sk_buff *skb)
> > > >         tcp_v4_fill_cb(skb, iph, th);
> > > >
> > > >         skb->dev = NULL;
> > > > +       secpath_reset(skb);
> > > >
> > > >         if (sk->sk_state == TCP_LISTEN) {
> > > >                 ret = tcp_v4_do_rcv(sk, skb);
> > > >
> > > > Do you want me to send v2, or push a new one if you agree with this
> > > > change?
> > >
> > > That would only care about TCP and IPv4.
> > >
> > > I think we need a full fix, not a partial work around to an immediate
> > > problem.
> > >
> > > Can we have some feedback from Steffen, I  wonder if we missed
> > > something really obvious.
> > >
> > > It is hard to believe this has been broken for such  a long time.
> >
> > Could you please give me some suggestions?
> > Should I add new function to reset both ct and secpath, and replace
> > nf_reset_ct() where necessary on receive flow?
>
> Maybe we should directly remove the device from the xfrm_state
> when the decice goes down, this should catch all the cases.
>
> I think about something like this (untested) patch:
>
> diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
> index 0c306473a79d..ba402275ab57 100644
> --- a/net/xfrm/xfrm_state.c
> +++ b/net/xfrm/xfrm_state.c
> @@ -867,7 +867,11 @@ int xfrm_dev_state_flush(struct net *net, struct net_device *dev, bool task_vali
>                                 xfrm_state_hold(x);
>                                 spin_unlock_bh(&net->xfrm.xfrm_state_lock);
>
> -                               err = xfrm_state_delete(x);
> +                               spin_lock_bh(&x->lock);
> +                               err = __xfrm_state_delete(x);
> +                               xfrm_dev_state_free(x);
> +                               spin_unlock_bh(&x->lock);
> +
>                                 xfrm_audit_state_delete(x, err ? 0 : 1,
>                                                         task_valid);
>                                 xfrm_state_put(x);
>
> The secpath is still attached to all skbs, but the hang on device
> unregister should go away.

Seems fine, but I wonder if we need some READ_ONCE(xso->dev) to avoid
races then ?

diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index 77ebf5bcf0b901b7b70875b3ccb5cead14ab1602..55eb5898e9a4263048b70d5a0d5dc274ab784c0f
100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -1950,9 +1950,10 @@ bool xfrm_dev_offload_ok(struct sk_buff *skb,
struct xfrm_state *x);
 static inline void xfrm_dev_state_advance_esn(struct xfrm_state *x)
 {
        struct xfrm_dev_offload *xso = &x->xso;
+       struct net_device *dev = READ_ONCE(xso->dev);

-       if (xso->dev && xso->dev->xfrmdev_ops->xdo_dev_state_advance_esn)
-               xso->dev->xfrmdev_ops->xdo_dev_state_advance_esn(x);
+       if (dev && dev->xfrmdev_ops->xdo_dev_state_advance_esn)
+               dev->xfrmdev_ops->xdo_dev_state_advance_esn(x);
 }

 static inline bool xfrm_dst_offload_ok(struct dst_entry *dst)
@@ -1976,20 +1977,21 @@ static inline bool xfrm_dst_offload_ok(struct
dst_entry *dst)
 static inline void xfrm_dev_state_delete(struct xfrm_state *x)
 {
        struct xfrm_dev_offload *xso = &x->xso;
+       struct net_device *dev = READ_ONCE(xso->dev);

-       if (xso->dev)
-               xso->dev->xfrmdev_ops->xdo_dev_state_delete(x);
+       if (dev)
+               dev->xfrmdev_ops->xdo_dev_state_delete(x);
 }

 static inline void xfrm_dev_state_free(struct xfrm_state *x)
 {
        struct xfrm_dev_offload *xso = &x->xso;
-       struct net_device *dev = xso->dev;
+       struct net_device *dev = READ_ONCE(xso->dev);

        if (dev && dev->xfrmdev_ops) {
                if (dev->xfrmdev_ops->xdo_dev_state_free)
                        dev->xfrmdev_ops->xdo_dev_state_free(x);
-               xso->dev = NULL;
+               WRITE_ONCE(xso->dev, NULL);
                xso->type = XFRM_DEV_OFFLOAD_UNSPECIFIED;
                netdev_put(dev, &xso->dev_tracker);
        }
Jianbo Liu May 23, 2024, 2:22 a.m. UTC | #9
On Wed, 2024-05-22 at 11:34 +0200, Steffen Klassert wrote:
> On Mon, May 20, 2024 at 10:06:24AM +0000, Jianbo Liu wrote:
> > On Tue, 2024-05-14 at 10:51 +0200, Eric Dumazet wrote:
> > > On Tue, May 14, 2024 at 9:37 AM Jianbo Liu <jianbol@nvidia.com>
> > > wrote:
> > > > 
> > > > On Mon, 2024-05-13 at 12:29 +0200, Eric Dumazet wrote:
> > > > > On Mon, May 13, 2024 at 12:04 PM Jianbo Liu <
> > > > > jianbol@nvidia.com>
> > > > > wrote:
> > > > > 
> > > > > 
> > ...
> > > > > This attribution and patch seem wrong. Also you should CC
> > > > > XFRM
> > > > > maintainers.
> > > > > 
> > > > > Before being freed from tcp_recvmsg() path, packets can sit
> > > > > in
> > > > > TCP
> > > > > receive queues for arbitrary amounts of time.
> > > > > 
> > > > > secpath_reset() should be called much earlier than in the
> > > > > code
> > > > > you
> > > > > tried to change.
> > > > 
> > > > Yes, this also fixed the issue if I moved secpatch_reset()
> > > > before
> > > > tcp_v4_do_rcv().
> > > > 
> > > > --- a/net/ipv4/tcp_ipv4.c
> > > > +++ b/net/ipv4/tcp_ipv4.c
> > > > @@ -2314,6 +2314,7 @@ int tcp_v4_rcv(struct sk_buff *skb)
> > > >         tcp_v4_fill_cb(skb, iph, th);
> > > > 
> > > >         skb->dev = NULL;
> > > > +       secpath_reset(skb);
> > > > 
> > > >         if (sk->sk_state == TCP_LISTEN) {
> > > >                 ret = tcp_v4_do_rcv(sk, skb);
> > > > 
> > > > Do you want me to send v2, or push a new one if you agree with
> > > > this
> > > > change?
> > > 
> > > That would only care about TCP and IPv4.
> > > 
> > > I think we need a full fix, not a partial work around to an
> > > immediate
> > > problem.
> > > 
> > > Can we have some feedback from Steffen, I  wonder if we missed
> > > something really obvious.
> > > 
> > > It is hard to believe this has been broken for such  a long time.
> > 
> > Could you please give me some suggestions?
> > Should I add new function to reset both ct and secpath, and replace
> > nf_reset_ct() where necessary on receive flow?
> 
> Maybe we should directly remove the device from the xfrm_state
> when the decice goes down, this should catch all the cases.
> 
> I think about something like this (untested) patch:
> 
> diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
> index 0c306473a79d..ba402275ab57 100644
> --- a/net/xfrm/xfrm_state.c
> +++ b/net/xfrm/xfrm_state.c
> @@ -867,7 +867,11 @@ int xfrm_dev_state_flush(struct net *net, struct
> net_device *dev, bool task_vali
>                                 xfrm_state_hold(x);
>                                 spin_unlock_bh(&net-
> >xfrm.xfrm_state_lock);
>  
> -                               err = xfrm_state_delete(x);
> +                               spin_lock_bh(&x->lock);
> +                               err = __xfrm_state_delete(x);
> +                               xfrm_dev_state_free(x);
> +                               spin_unlock_bh(&x->lock);
> +
>                                 xfrm_audit_state_delete(x, err ? 0 :
> 1,
>                                                         task_valid);
>                                 xfrm_state_put(x);
> 
> The secpath is still attached to all skbs, but the hang on device
> unregister should go away.

It didn't fix the issue. I run these commands before unregister netdev:
ip x s delall
ip x p delall
ip x s f
ip x p f
Steffen Klassert May 23, 2024, 6:44 a.m. UTC | #10
On Thu, May 23, 2024 at 02:22:38AM +0000, Jianbo Liu wrote:
> On Wed, 2024-05-22 at 11:34 +0200, Steffen Klassert wrote:
> > 
> > Maybe we should directly remove the device from the xfrm_state
> > when the decice goes down, this should catch all the cases.
> > 
> > I think about something like this (untested) patch:
> > 
> > diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
> > index 0c306473a79d..ba402275ab57 100644
> > --- a/net/xfrm/xfrm_state.c
> > +++ b/net/xfrm/xfrm_state.c
> > @@ -867,7 +867,11 @@ int xfrm_dev_state_flush(struct net *net, struct
> > net_device *dev, bool task_vali
> >                                 xfrm_state_hold(x);
> >                                 spin_unlock_bh(&net-
> > >xfrm.xfrm_state_lock);
> >  
> > -                               err = xfrm_state_delete(x);
> > +                               spin_lock_bh(&x->lock);
> > +                               err = __xfrm_state_delete(x);
> > +                               xfrm_dev_state_free(x);
> > +                               spin_unlock_bh(&x->lock);
> > +
> >                                 xfrm_audit_state_delete(x, err ? 0 :
> > 1,
> >                                                         task_valid);
> >                                 xfrm_state_put(x);
> > 
> > The secpath is still attached to all skbs, but the hang on device
> > unregister should go away.
> 
> It didn't fix the issue.

Do you have a backtrace of the ref_tracker?

Is that with packet offload?

Looks like we need to remove the device from the xfrm_policy
too if packet offload is used.
Jianbo Liu May 23, 2024, 6:57 a.m. UTC | #11
On Thu, 2024-05-23 at 08:44 +0200, Steffen Klassert wrote:
> On Thu, May 23, 2024 at 02:22:38AM +0000, Jianbo Liu wrote:
> > On Wed, 2024-05-22 at 11:34 +0200, Steffen Klassert wrote:
> > > 
> > > Maybe we should directly remove the device from the xfrm_state
> > > when the decice goes down, this should catch all the cases.
> > > 
> > > I think about something like this (untested) patch:
> > > 
> > > diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
> > > index 0c306473a79d..ba402275ab57 100644
> > > --- a/net/xfrm/xfrm_state.c
> > > +++ b/net/xfrm/xfrm_state.c
> > > @@ -867,7 +867,11 @@ int xfrm_dev_state_flush(struct net *net,
> > > struct
> > > net_device *dev, bool task_vali
> > >                                 xfrm_state_hold(x);
> > >                                 spin_unlock_bh(&net-
> > > > xfrm.xfrm_state_lock);
> > >  
> > > -                               err = xfrm_state_delete(x);
> > > +                               spin_lock_bh(&x->lock);
> > > +                               err = __xfrm_state_delete(x);
> > > +                               xfrm_dev_state_free(x);
> > > +                               spin_unlock_bh(&x->lock);
> > > +
> > >                                 xfrm_audit_state_delete(x, err ? 0
> > > :
> > > 1,
> > >                                                         task_valid)
> > > ;
> > >                                 xfrm_state_put(x);
> > > 
> > > The secpath is still attached to all skbs, but the hang on device
> > > unregister should go away.
> > 
> > It didn't fix the issue.
> 
> Do you have a backtrace of the ref_tracker?
> 
> Is that with packet offload?
> 

Yes. And it's the same trace I posted before.

 ref_tracker: eth%d@000000007421424b has 1/1 users at
      xfrm_dev_state_add+0xe5/0x4d0
      xfrm_add_sa+0xc5c/0x11e0
      xfrm_user_rcv_msg+0xfa/0x240
      netlink_rcv_skb+0x54/0x100
      xfrm_netlink_rcv+0x31/0x40
      netlink_unicast+0x1fc/0x2c0
      netlink_sendmsg+0x232/0x4a0
      __sock_sendmsg+0x38/0x60
      ____sys_sendmsg+0x1e3/0x200
      ___sys_sendmsg+0x80/0xc0
      __sys_sendmsg+0x51/0x90
      do_syscall_64+0x40/0xe0
      entry_SYSCALL_64_after_hwframe+0x46/0x4e


> Looks like we need to remove the device from the xfrm_policy
> too if packet offload is used.
Steffen Klassert May 23, 2024, 10 a.m. UTC | #12
On Thu, May 23, 2024 at 06:57:25AM +0000, Jianbo Liu wrote:
> On Thu, 2024-05-23 at 08:44 +0200, Steffen Klassert wrote:
> > On Thu, May 23, 2024 at 02:22:38AM +0000, Jianbo Liu wrote:
> > > On Wed, 2024-05-22 at 11:34 +0200, Steffen Klassert wrote:
> > > > 
> > > > Maybe we should directly remove the device from the xfrm_state
> > > > when the decice goes down, this should catch all the cases.
> > > > 
> > > > I think about something like this (untested) patch:
> > > > 
> > > > diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
> > > > index 0c306473a79d..ba402275ab57 100644
> > > > --- a/net/xfrm/xfrm_state.c
> > > > +++ b/net/xfrm/xfrm_state.c
> > > > @@ -867,7 +867,11 @@ int xfrm_dev_state_flush(struct net *net,
> > > > struct
> > > > net_device *dev, bool task_vali
> > > >                                 xfrm_state_hold(x);
> > > >                                 spin_unlock_bh(&net-
> > > > > xfrm.xfrm_state_lock);
> > > >  
> > > > -                               err = xfrm_state_delete(x);
> > > > +                               spin_lock_bh(&x->lock);
> > > > +                               err = __xfrm_state_delete(x);
> > > > +                               xfrm_dev_state_free(x);
> > > > +                               spin_unlock_bh(&x->lock);
> > > > +
> > > >                                 xfrm_audit_state_delete(x, err ? 0
> > > > :
> > > > 1,
> > > >                                                         task_valid)
> > > > ;
> > > >                                 xfrm_state_put(x);
> > > > 
> > > > The secpath is still attached to all skbs, but the hang on device
> > > > unregister should go away.
> > > 
> > > It didn't fix the issue.
> > 
> > Do you have a backtrace of the ref_tracker?
> > 
> > Is that with packet offload?
> > 
> 
> Yes. And it's the same trace I posted before.
> 
>  ref_tracker: eth%d@000000007421424b has 1/1 users at
>       xfrm_dev_state_add+0xe5/0x4d0

Hm, interesting.

Can you check if xfrm_dev_state_free() is triggered in that codepath
and if it actually removes the device from the states?
Jianbo Liu May 23, 2024, 3:26 p.m. UTC | #13
On Thu, 2024-05-23 at 12:00 +0200, Steffen Klassert wrote:
> On Thu, May 23, 2024 at 06:57:25AM +0000, Jianbo Liu wrote:
> > On Thu, 2024-05-23 at 08:44 +0200, Steffen Klassert wrote:
> > > On Thu, May 23, 2024 at 02:22:38AM +0000, Jianbo Liu wrote:
> > > > On Wed, 2024-05-22 at 11:34 +0200, Steffen Klassert wrote:
> > > > > 
> > > > > Maybe we should directly remove the device from the
> > > > > xfrm_state
> > > > > when the decice goes down, this should catch all the cases.
> > > > > 
> > > > > I think about something like this (untested) patch:
> > > > > 
> > > > > diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
> > > > > index 0c306473a79d..ba402275ab57 100644
> > > > > --- a/net/xfrm/xfrm_state.c
> > > > > +++ b/net/xfrm/xfrm_state.c
> > > > > @@ -867,7 +867,11 @@ int xfrm_dev_state_flush(struct net
> > > > > *net,
> > > > > struct
> > > > > net_device *dev, bool task_vali
> > > > >                                 xfrm_state_hold(x);
> > > > >                                 spin_unlock_bh(&net-
> > > > > > xfrm.xfrm_state_lock);
> > > > >  
> > > > > -                               err = xfrm_state_delete(x);
> > > > > +                               spin_lock_bh(&x->lock);
> > > > > +                               err = __xfrm_state_delete(x);
> > > > > +                               xfrm_dev_state_free(x);
> > > > > +                               spin_unlock_bh(&x->lock);
> > > > > +
> > > > >                                 xfrm_audit_state_delete(x,
> > > > > err ? 0
> > > > > :
> > > > > 1,
> > > > >                                                         task_
> > > > > valid)
> > > > > ;
> > > > >                                 xfrm_state_put(x);
> > > > > 
> > > > > The secpath is still attached to all skbs, but the hang on
> > > > > device
> > > > > unregister should go away.
> > > > 
> > > > It didn't fix the issue.
> > > 
> > > Do you have a backtrace of the ref_tracker?
> > > 
> > > Is that with packet offload?
> > > 
> > 
> > Yes. And it's the same trace I posted before.
> > 
> >  ref_tracker: eth%d@000000007421424b has 1/1 users at
> >       xfrm_dev_state_add+0xe5/0x4d0
> 
> Hm, interesting.
> 
> Can you check if xfrm_dev_state_free() is triggered in that codepath
> and if it actually removes the device from the states?
> 

xfrm_dev_state_free is not triggered. I think it's because I did "ip x
s delall" before unregister netdev.

Besides, as it's possible to sleep in dev's xdo_dev_state_free, there
is a "scheduling while atomic" issue to call xfrm_dev_state_free in
spin_lock.
Leon Romanovsky May 26, 2024, 10:57 a.m. UTC | #14
On Thu, May 23, 2024 at 08:44:56AM +0200, Steffen Klassert wrote:
> On Thu, May 23, 2024 at 02:22:38AM +0000, Jianbo Liu wrote:
> > On Wed, 2024-05-22 at 11:34 +0200, Steffen Klassert wrote:
> > > 
> > > Maybe we should directly remove the device from the xfrm_state
> > > when the decice goes down, this should catch all the cases.
> > > 
> > > I think about something like this (untested) patch:
> > > 
> > > diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
> > > index 0c306473a79d..ba402275ab57 100644
> > > --- a/net/xfrm/xfrm_state.c
> > > +++ b/net/xfrm/xfrm_state.c
> > > @@ -867,7 +867,11 @@ int xfrm_dev_state_flush(struct net *net, struct
> > > net_device *dev, bool task_vali
> > >                                 xfrm_state_hold(x);
> > >                                 spin_unlock_bh(&net-
> > > >xfrm.xfrm_state_lock);
> > >  
> > > -                               err = xfrm_state_delete(x);
> > > +                               spin_lock_bh(&x->lock);
> > > +                               err = __xfrm_state_delete(x);
> > > +                               xfrm_dev_state_free(x);
> > > +                               spin_unlock_bh(&x->lock);
> > > +
> > >                                 xfrm_audit_state_delete(x, err ? 0 :
> > > 1,
> > >                                                         task_valid);
> > >                                 xfrm_state_put(x);
> > > 
> > > The secpath is still attached to all skbs, but the hang on device
> > > unregister should go away.
> > 
> > It didn't fix the issue.
> 
> Do you have a backtrace of the ref_tracker?
> 
> Is that with packet offload?

We saw same failure with crypto and packet offloads.

> 
> Looks like we need to remove the device from the xfrm_policy
> too if packet offload is used.

When we debugged it, we found that this line [1] was responsible for
elevated reference count. XFRM policy cleaned properly.

Thanks

[1] https://elixir.bootlin.com/linux/latest/source/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_rxtx.c#L332

>
Steffen Klassert May 27, 2024, 7:40 a.m. UTC | #15
On Thu, May 23, 2024 at 03:26:22PM +0000, Jianbo Liu wrote:
> On Thu, 2024-05-23 at 12:00 +0200, Steffen Klassert wrote:
> > 
> > Hm, interesting.
> > 
> > Can you check if xfrm_dev_state_free() is triggered in that codepath
> > and if it actually removes the device from the states?
> > 
> 
> xfrm_dev_state_free is not triggered. I think it's because I did "ip x
> s delall" before unregister netdev.

Yes, likely. So we can't defer the device removal to the state free
functions, we always need to do that on state delete.

> Besides, as it's possible to sleep in dev's xdo_dev_state_free, there
> is a "scheduling while atomic" issue to call xfrm_dev_state_free in
> spin_lock.

Thanks for the hint!
Steffen Klassert May 28, 2024, 8:44 a.m. UTC | #16
On Mon, May 27, 2024 at 09:40:23AM +0200, Steffen Klassert wrote:
> On Thu, May 23, 2024 at 03:26:22PM +0000, Jianbo Liu wrote:
> > On Thu, 2024-05-23 at 12:00 +0200, Steffen Klassert wrote:
> > > 
> > > Hm, interesting.
> > > 
> > > Can you check if xfrm_dev_state_free() is triggered in that codepath
> > > and if it actually removes the device from the states?
> > > 
> > 
> > xfrm_dev_state_free is not triggered. I think it's because I did "ip x
> > s delall" before unregister netdev.
> 
> Yes, likely. So we can't defer the device removal to the state free
> functions, we always need to do that on state delete.

The only (not too complicated) solution I see so far is to
free the device early, along with the state delete function:

diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
index 649bb739df0d..bfc71d2daa6a 100644
--- a/net/xfrm/xfrm_state.c
+++ b/net/xfrm/xfrm_state.c
@@ -721,6 +721,7 @@ int __xfrm_state_delete(struct xfrm_state *x)
 			sock_put(rcu_dereference_raw(x->encap_sk));
 
 		xfrm_dev_state_delete(x);
+		xfrm_dev_state_free(x);
 
 		/* All xfrm_state objects are created by xfrm_state_alloc.
 		 * The xfrm_state_alloc call gives a reference, and that
Jianbo Liu May 28, 2024, 9:02 a.m. UTC | #17
On Tue, 2024-05-28 at 10:44 +0200, Steffen Klassert wrote:
> On Mon, May 27, 2024 at 09:40:23AM +0200, Steffen Klassert wrote:
> > On Thu, May 23, 2024 at 03:26:22PM +0000, Jianbo Liu wrote:
> > > On Thu, 2024-05-23 at 12:00 +0200, Steffen Klassert wrote:
> > > > 
> > > > Hm, interesting.
> > > > 
> > > > Can you check if xfrm_dev_state_free() is triggered in that
> > > > codepath
> > > > and if it actually removes the device from the states?
> > > > 
> > > 
> > > xfrm_dev_state_free is not triggered. I think it's because I did
> > > "ip x
> > > s delall" before unregister netdev.
> > 
> > Yes, likely. So we can't defer the device removal to the state free
> > functions, we always need to do that on state delete.
> 
> The only (not too complicated) solution I see so far is to
> free the device early, along with the state delete function:
> 
> diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
> index 649bb739df0d..bfc71d2daa6a 100644
> --- a/net/xfrm/xfrm_state.c
> +++ b/net/xfrm/xfrm_state.c
> @@ -721,6 +721,7 @@ int __xfrm_state_delete(struct xfrm_state *x)
>                         sock_put(rcu_dereference_raw(x->encap_sk));
>  
>                 xfrm_dev_state_delete(x);
> +               xfrm_dev_state_free(x);
>  

Still hit "scheduling while atomic" issue because __xfrm_state_delete
is called in state's spin lock. 

>                 /* All xfrm_state objects are created by
> xfrm_state_alloc.
>                  * The xfrm_state_alloc call gives a reference, and
> that
Steffen Klassert May 28, 2024, 9:26 a.m. UTC | #18
On Tue, May 28, 2024 at 09:02:49AM +0000, Jianbo Liu wrote:
> On Tue, 2024-05-28 at 10:44 +0200, Steffen Klassert wrote:
> > On Mon, May 27, 2024 at 09:40:23AM +0200, Steffen Klassert wrote:
> > > On Thu, May 23, 2024 at 03:26:22PM +0000, Jianbo Liu wrote:
> > > > On Thu, 2024-05-23 at 12:00 +0200, Steffen Klassert wrote:
> > > > > 
> > > > > Hm, interesting.
> > > > > 
> > > > > Can you check if xfrm_dev_state_free() is triggered in that
> > > > > codepath
> > > > > and if it actually removes the device from the states?
> > > > > 
> > > > 
> > > > xfrm_dev_state_free is not triggered. I think it's because I did
> > > > "ip x
> > > > s delall" before unregister netdev.
> > > 
> > > Yes, likely. So we can't defer the device removal to the state free
> > > functions, we always need to do that on state delete.
> > 
> > The only (not too complicated) solution I see so far is to
> > free the device early, along with the state delete function:
> > 
> > diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
> > index 649bb739df0d..bfc71d2daa6a 100644
> > --- a/net/xfrm/xfrm_state.c
> > +++ b/net/xfrm/xfrm_state.c
> > @@ -721,6 +721,7 @@ int __xfrm_state_delete(struct xfrm_state *x)
> >                         sock_put(rcu_dereference_raw(x->encap_sk));
> >  
> >                 xfrm_dev_state_delete(x);
> > +               xfrm_dev_state_free(x);
> >  
> 
> Still hit "scheduling while atomic" issue because __xfrm_state_delete
> is called in state's spin lock. 

Grmpf. Yes, apparently.

Unfortunately I don't have a NIC to test installed currently.
diff mbox series

Patch

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index b99127712e67..d7f5024f3c08 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -7025,6 +7025,10 @@  nodefer:	__kfree_skb(skb);
 	if (READ_ONCE(sd->defer_count) >= defer_max)
 		goto nodefer;
 
+#ifdef CONFIG_XFRM
+	skb_ext_del(skb, SKB_EXT_SEC_PATH);
+#endif
+
 	spin_lock_bh(&sd->defer_lock);
 	/* Send an IPI every time queue reaches half capacity. */
 	kick = sd->defer_count == (defer_max >> 1);