diff mbox series

[net] net: devmem: fix kernel panic when socket close after module unload

Message ID 20250415092417.1437488-1-ap420073@gmail.com (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series [net] net: devmem: fix kernel panic when socket close after module unload | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for net
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag present in non-next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 1 this patch: 1
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers fail 1 blamed authors not CCed: willemb@google.com; 1 maintainers not CCed: willemb@google.com
netdev/build_clang success Errors and warnings before: 2 this patch: 2
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 2 this patch: 2
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 11 lines checked
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
netdev/contest success net-next-2025-04-15--21-00 (tests: 914)

Commit Message

Taehee Yoo April 15, 2025, 9:24 a.m. UTC
Kernel panic occurs when a devmem TCP socket is closed after NIC module
is unloaded.

This is Devmem TCP unregistration scenarios. number is an order.
(a)socket close    (b)pp destroy    (c)uninstall    result
1                  2                3               OK
1                  3                2               (d)Impossible
2		   1                3               OK
3                  1                2               (e)Kernel panic
2                  3                1               (d)Impossible
3                  2                1               (d)Impossible

(a) netdev_nl_sock_priv_destroy() is called when devmem TCP socket is
    closed.
(b) page_pool_destroy() is called when the interface is down.
(c) mp_ops->uninstall() is called when an interface is unregistered.
(d) There is no scenario in mp_ops->uninstall() is called before
    page_pool_destroy().
    Because unregister_netdevice_many_notify() closes interfaces first
    and then calls mp_ops->uninstall().
(e) netdev_nl_sock_priv_destroy() accesses struct net_device.
    But if the interface module has already been removed, net_device
    pointer is invalid, so it causes kernel panic.

In summary, there are only 3 possible scenarios.
 A. sk close -> pp destroy -> uninstall.
 B. pp destroy -> sk close -> uninstall.
 C. pp destroy -> uninstall -> sk close.

Case C is a kernel panic scenario.

In order to fix this problem, it makes netdev_nl_sock_priv_destroy() do
nothing if a module is already removed.
The mp_ops->uninstall() handles these instead.

The netdev_nl_sock_priv_destroy() iterates binding->list and releases
them all with net_devmem_unbind_dmabuf().
The net_devmem_unbind_dmabuf() has the below steps.
1. Delete binding from a list.
2. Call _net_mp_close_rxq() for all rxq's bound to a binding.
3. Call net_devmem_dmabuf_binding_put() to release resources.

The mp_ops->uninstall() doesn't need to call _net_mp_close_rxq() because
resources are already released properly when an interface is being down.

From now on netdev_nl_sock_priv_destroy() will do nothing if a module
has been removed because all bindings are removed from a list by
mp_ops->uninstall().

netdev_nl_sock_priv_destroy() internally sets mp_ops to NULL.
So mp_ops->uninstall has not been called if devmem TCP socket was
already closed.

Tests:
Scenario A:
    ./ncdevmem -s 192.168.1.4 -c 192.168.1.2 -f $interface -l -p 8000 \
	-v 7 -t 1 -q 1 &
    pid=$!
    sleep 10
    ip link set $interface down
    kill $pid
    modprobe -rv $module

Scenario B:
    ./ncdevmem -s 192.168.1.4 -c 192.168.1.2 -f $interface -l -p 8000 \
	-v 7 -t 1 -q 1 &
    pid=$!
    sleep 10
    ip link set $interface down
    kill $pid
    modprobe -rv $module

Scenario C:
    ./ncdevmem -s 192.168.1.4 -c 192.168.1.2 -f $interface -l -p 8000 \
	-v 7 -t 1 -q 1 &
    pid=$!
    sleep 10
    modprobe -rv $module
    sleep 5
    kill $pid

Splat looks like:
Oops: general protection fault, probably for non-canonical address 0xdffffc001fffa9f7: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN NOPTI
KASAN: probably user-memory-access in range [0x00000000fffd4fb8-0x00000000fffd4fbf]
CPU: 0 UID: 0 PID: 2041 Comm: ncdevmem Tainted: G    B   W           6.15.0-rc1+ #2 PREEMPT(undef)  0947ec89efa0fd68838b78e36aa1617e97ff5d7f
Tainted: [B]=BAD_PAGE, [W]=WARN
RIP: 0010:__mutex_lock (./include/linux/sched.h:2244 kernel/locking/mutex.c:400 kernel/locking/mutex.c:443 kernel/locking/mutex.c:605 kernel/locking/mutex.c:746)
Code: ea 03 80 3c 02 00 0f 85 4f 13 00 00 49 8b 1e 48 83 e3 f8 74 6a 48 b8 00 00 00 00 00 fc ff df 48 8d 7b 34 48 89 fa 48 c1 ea 03 <0f> b6 f
RSP: 0018:ffff88826f7ef730 EFLAGS: 00010203
RAX: dffffc0000000000 RBX: 00000000fffd4f88 RCX: ffffffffaa9bc811
RDX: 000000001fffa9f7 RSI: 0000000000000008 RDI: 00000000fffd4fbc
RBP: ffff88826f7ef8b0 R08: 0000000000000000 R09: ffffed103e6aa1a4
R10: 0000000000000007 R11: ffff88826f7ef442 R12: fffffbfff669f65e
R13: ffff88812a830040 R14: ffff8881f3550d20 R15: 00000000fffd4f88
FS:  0000000000000000(0000) GS:ffff888866c05000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000563bed0cb288 CR3: 00000001a7c98000 CR4: 00000000007506f0
PKRU: 55555554
Call Trace:
<TASK>
 ...
 netdev_nl_sock_priv_destroy (net/core/netdev-genl.c:953 (discriminator 3))
 genl_release (net/netlink/genetlink.c:653 net/netlink/genetlink.c:694 net/netlink/genetlink.c:705)
 ...
 netlink_release (net/netlink/af_netlink.c:737)
 ...
 __sock_release (net/socket.c:647)
 sock_close (net/socket.c:1393)

Fixes: 170aafe35cb9 ("netdev: support binding dma-buf to netdevice")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
---

In order to test this patch, driver side implementation of devmem TCP[1]
is needed to be applied.

[1] https://lore.kernel.org/netdev/20250415052458.1260575-1-ap420073@gmail.com/T/#u

 net/core/devmem.c | 5 +++++
 1 file changed, 5 insertions(+)

Comments

Stanislav Fomichev April 15, 2025, 5:33 p.m. UTC | #1
On 04/15, Taehee Yoo wrote:
> Kernel panic occurs when a devmem TCP socket is closed after NIC module
> is unloaded.
> 
> This is Devmem TCP unregistration scenarios. number is an order.
> (a)socket close    (b)pp destroy    (c)uninstall    result
> 1                  2                3               OK
> 1                  3                2               (d)Impossible
> 2		   1                3               OK
> 3                  1                2               (e)Kernel panic
> 2                  3                1               (d)Impossible
> 3                  2                1               (d)Impossible
> 
> (a) netdev_nl_sock_priv_destroy() is called when devmem TCP socket is
>     closed.
> (b) page_pool_destroy() is called when the interface is down.
> (c) mp_ops->uninstall() is called when an interface is unregistered.
> (d) There is no scenario in mp_ops->uninstall() is called before
>     page_pool_destroy().
>     Because unregister_netdevice_many_notify() closes interfaces first
>     and then calls mp_ops->uninstall().
> (e) netdev_nl_sock_priv_destroy() accesses struct net_device.
>     But if the interface module has already been removed, net_device
>     pointer is invalid, so it causes kernel panic.
> 
> In summary, there are only 3 possible scenarios.
>  A. sk close -> pp destroy -> uninstall.
>  B. pp destroy -> sk close -> uninstall.
>  C. pp destroy -> uninstall -> sk close.
> 
> Case C is a kernel panic scenario.
> 
> In order to fix this problem, it makes netdev_nl_sock_priv_destroy() do
> nothing if a module is already removed.
> The mp_ops->uninstall() handles these instead.
> 
> The netdev_nl_sock_priv_destroy() iterates binding->list and releases
> them all with net_devmem_unbind_dmabuf().
> The net_devmem_unbind_dmabuf() has the below steps.
> 1. Delete binding from a list.

[..]

> 2. Call _net_mp_close_rxq() for all rxq's bound to a binding.

This should not happen in the (C) case, right? mp_dmabuf_devmem_uninstall
removes the rxq from the xa. Can you explain more on why specifically
the crash occurs? Are we missing some check in netdev_nl_sock_priv_destroy
that makes sure the device is still alive?
Mina Almasry April 15, 2025, 6:22 p.m. UTC | #2
On Tue, Apr 15, 2025 at 2:24 AM Taehee Yoo <ap420073@gmail.com> wrote:
>
> Kernel panic occurs when a devmem TCP socket is closed after NIC module
> is unloaded.
>
> This is Devmem TCP unregistration scenarios. number is an order.
> (a)socket close    (b)pp destroy    (c)uninstall    result
> 1                  2                3               OK
> 1                  3                2               (d)Impossible
> 2                  1                3               OK
> 3                  1                2               (e)Kernel panic
> 2                  3                1               (d)Impossible
> 3                  2                1               (d)Impossible
>
> (a) netdev_nl_sock_priv_destroy() is called when devmem TCP socket is
>     closed.
> (b) page_pool_destroy() is called when the interface is down.
> (c) mp_ops->uninstall() is called when an interface is unregistered.
> (d) There is no scenario in mp_ops->uninstall() is called before
>     page_pool_destroy().
>     Because unregister_netdevice_many_notify() closes interfaces first
>     and then calls mp_ops->uninstall().
> (e) netdev_nl_sock_priv_destroy() accesses struct net_device.
>     But if the interface module has already been removed, net_device
>     pointer is invalid, so it causes kernel panic.
>
> In summary, there are only 3 possible scenarios.
>  A. sk close -> pp destroy -> uninstall.
>  B. pp destroy -> sk close -> uninstall.
>  C. pp destroy -> uninstall -> sk close.
>
> Case C is a kernel panic scenario.
>
> In order to fix this problem, it makes netdev_nl_sock_priv_destroy() do
> nothing if a module is already removed.
> The mp_ops->uninstall() handles these instead.
>
> The netdev_nl_sock_priv_destroy() iterates binding->list and releases
> them all with net_devmem_unbind_dmabuf().
> The net_devmem_unbind_dmabuf() has the below steps.
> 1. Delete binding from a list.
> 2. Call _net_mp_close_rxq() for all rxq's bound to a binding.
> 3. Call net_devmem_dmabuf_binding_put() to release resources.
>
> The mp_ops->uninstall() doesn't need to call _net_mp_close_rxq() because
> resources are already released properly when an interface is being down.
>
> From now on netdev_nl_sock_priv_destroy() will do nothing if a module
> has been removed because all bindings are removed from a list by
> mp_ops->uninstall().
>
> netdev_nl_sock_priv_destroy() internally sets mp_ops to NULL.
> So mp_ops->uninstall has not been called if devmem TCP socket was
> already closed.
>
> Tests:
> Scenario A:
>     ./ncdevmem -s 192.168.1.4 -c 192.168.1.2 -f $interface -l -p 8000 \
>         -v 7 -t 1 -q 1 &
>     pid=$!
>     sleep 10
>     ip link set $interface down
>     kill $pid
>     modprobe -rv $module
>
> Scenario B:
>     ./ncdevmem -s 192.168.1.4 -c 192.168.1.2 -f $interface -l -p 8000 \
>         -v 7 -t 1 -q 1 &
>     pid=$!
>     sleep 10
>     ip link set $interface down
>     kill $pid
>     modprobe -rv $module
>

Scenario A & B are exactly the same steps?

> Scenario C:
>     ./ncdevmem -s 192.168.1.4 -c 192.168.1.2 -f $interface -l -p 8000 \
>         -v 7 -t 1 -q 1 &
>     pid=$!
>     sleep 10
>     modprobe -rv $module
>     sleep 5
>     kill $pid
>
> Splat looks like:
> Oops: general protection fault, probably for non-canonical address 0xdffffc001fffa9f7: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN NOPTI
> KASAN: probably user-memory-access in range [0x00000000fffd4fb8-0x00000000fffd4fbf]
> CPU: 0 UID: 0 PID: 2041 Comm: ncdevmem Tainted: G    B   W           6.15.0-rc1+ #2 PREEMPT(undef)  0947ec89efa0fd68838b78e36aa1617e97ff5d7f
> Tainted: [B]=BAD_PAGE, [W]=WARN
> RIP: 0010:__mutex_lock (./include/linux/sched.h:2244 kernel/locking/mutex.c:400 kernel/locking/mutex.c:443 kernel/locking/mutex.c:605 kernel/locking/mutex.c:746)
> Code: ea 03 80 3c 02 00 0f 85 4f 13 00 00 49 8b 1e 48 83 e3 f8 74 6a 48 b8 00 00 00 00 00 fc ff df 48 8d 7b 34 48 89 fa 48 c1 ea 03 <0f> b6 f
> RSP: 0018:ffff88826f7ef730 EFLAGS: 00010203
> RAX: dffffc0000000000 RBX: 00000000fffd4f88 RCX: ffffffffaa9bc811
> RDX: 000000001fffa9f7 RSI: 0000000000000008 RDI: 00000000fffd4fbc
> RBP: ffff88826f7ef8b0 R08: 0000000000000000 R09: ffffed103e6aa1a4
> R10: 0000000000000007 R11: ffff88826f7ef442 R12: fffffbfff669f65e
> R13: ffff88812a830040 R14: ffff8881f3550d20 R15: 00000000fffd4f88
> FS:  0000000000000000(0000) GS:ffff888866c05000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000563bed0cb288 CR3: 00000001a7c98000 CR4: 00000000007506f0
> PKRU: 55555554
> Call Trace:
> <TASK>
>  ...
>  netdev_nl_sock_priv_destroy (net/core/netdev-genl.c:953 (discriminator 3))

Line 953 is:

netdev_lock(dev);

Which was introduced by:

commit 42f342387841 ("net: fix use-after-free in the
netdev_nl_sock_priv_destroy()") and rolling back a few fixes, it's
really introduced by commit 1d22d3060b9b ("net: drop rtnl_lock for
queue_mgmt operations").

My first question, does this issue still reproduce if you remove the
per netdev locking and go back to relying on rtnl_locking? Or do we
crash somewhere else in net_devmem_unbind_dmabuf? If so, where?
Looking through the rest of the unbinding code, it's not clear to me
any of it actually uses dev, so it may just be the locking...

>  genl_release (net/netlink/genetlink.c:653 net/netlink/genetlink.c:694 net/netlink/genetlink.c:705)
>  ...
>  netlink_release (net/netlink/af_netlink.c:737)
>  ...
>  __sock_release (net/socket.c:647)
>  sock_close (net/socket.c:1393)
>
> Fixes: 170aafe35cb9 ("netdev: support binding dma-buf to netdevice")
> Signed-off-by: Taehee Yoo <ap420073@gmail.com>
> ---
>
> In order to test this patch, driver side implementation of devmem TCP[1]
> is needed to be applied.
>
> [1] https://lore.kernel.org/netdev/20250415052458.1260575-1-ap420073@gmail.com/T/#u
>
>  net/core/devmem.c | 5 +++++
>  1 file changed, 5 insertions(+)
>
> diff --git a/net/core/devmem.c b/net/core/devmem.c
> index 6e27a47d0493..8948796b0af5 100644
> --- a/net/core/devmem.c
> +++ b/net/core/devmem.c
> @@ -379,6 +379,11 @@ static void mp_dmabuf_devmem_uninstall(void *mp_priv,
>         xa_for_each(&binding->bound_rxqs, xa_idx, bound_rxq) {
>                 if (bound_rxq == rxq) {
>                         xa_erase(&binding->bound_rxqs, xa_idx);
> +
> +                       if (xa_empty(&binding->bound_rxqs)) {
> +                               list_del(&binding->list);
> +                               net_devmem_dmabuf_binding_put(binding);

On the surface, this fix looks completely unreviewable to be honest.
refcounting must be balanced. i.e. every put has a corresponding get,
otherwise there is a double free. I'm not sure which get you're
dropping here. I think this will cause a double put when the netlink
socket is closed?

--
Thanks,
Mina
Stanislav Fomichev April 15, 2025, 6:59 p.m. UTC | #3
On 04/15, Mina Almasry wrote:
> On Tue, Apr 15, 2025 at 2:24 AM Taehee Yoo <ap420073@gmail.com> wrote:
> >
> > Kernel panic occurs when a devmem TCP socket is closed after NIC module
> > is unloaded.
> >
> > This is Devmem TCP unregistration scenarios. number is an order.
> > (a)socket close    (b)pp destroy    (c)uninstall    result
> > 1                  2                3               OK
> > 1                  3                2               (d)Impossible
> > 2                  1                3               OK
> > 3                  1                2               (e)Kernel panic
> > 2                  3                1               (d)Impossible
> > 3                  2                1               (d)Impossible
> >
> > (a) netdev_nl_sock_priv_destroy() is called when devmem TCP socket is
> >     closed.
> > (b) page_pool_destroy() is called when the interface is down.
> > (c) mp_ops->uninstall() is called when an interface is unregistered.
> > (d) There is no scenario in mp_ops->uninstall() is called before
> >     page_pool_destroy().
> >     Because unregister_netdevice_many_notify() closes interfaces first
> >     and then calls mp_ops->uninstall().
> > (e) netdev_nl_sock_priv_destroy() accesses struct net_device.
> >     But if the interface module has already been removed, net_device
> >     pointer is invalid, so it causes kernel panic.
> >
> > In summary, there are only 3 possible scenarios.
> >  A. sk close -> pp destroy -> uninstall.
> >  B. pp destroy -> sk close -> uninstall.
> >  C. pp destroy -> uninstall -> sk close.
> >
> > Case C is a kernel panic scenario.
> >
> > In order to fix this problem, it makes netdev_nl_sock_priv_destroy() do
> > nothing if a module is already removed.
> > The mp_ops->uninstall() handles these instead.
> >
> > The netdev_nl_sock_priv_destroy() iterates binding->list and releases
> > them all with net_devmem_unbind_dmabuf().
> > The net_devmem_unbind_dmabuf() has the below steps.
> > 1. Delete binding from a list.
> > 2. Call _net_mp_close_rxq() for all rxq's bound to a binding.
> > 3. Call net_devmem_dmabuf_binding_put() to release resources.
> >
> > The mp_ops->uninstall() doesn't need to call _net_mp_close_rxq() because
> > resources are already released properly when an interface is being down.
> >
> > From now on netdev_nl_sock_priv_destroy() will do nothing if a module
> > has been removed because all bindings are removed from a list by
> > mp_ops->uninstall().
> >
> > netdev_nl_sock_priv_destroy() internally sets mp_ops to NULL.
> > So mp_ops->uninstall has not been called if devmem TCP socket was
> > already closed.
> >
> > Tests:
> > Scenario A:
> >     ./ncdevmem -s 192.168.1.4 -c 192.168.1.2 -f $interface -l -p 8000 \
> >         -v 7 -t 1 -q 1 &
> >     pid=$!
> >     sleep 10
> >     ip link set $interface down
> >     kill $pid
> >     modprobe -rv $module
> >
> > Scenario B:
> >     ./ncdevmem -s 192.168.1.4 -c 192.168.1.2 -f $interface -l -p 8000 \
> >         -v 7 -t 1 -q 1 &
> >     pid=$!
> >     sleep 10
> >     ip link set $interface down
> >     kill $pid
> >     modprobe -rv $module
> >
> 
> Scenario A & B are exactly the same steps?
> 
> > Scenario C:
> >     ./ncdevmem -s 192.168.1.4 -c 192.168.1.2 -f $interface -l -p 8000 \
> >         -v 7 -t 1 -q 1 &
> >     pid=$!
> >     sleep 10
> >     modprobe -rv $module
> >     sleep 5
> >     kill $pid
> >
> > Splat looks like:
> > Oops: general protection fault, probably for non-canonical address 0xdffffc001fffa9f7: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN NOPTI
> > KASAN: probably user-memory-access in range [0x00000000fffd4fb8-0x00000000fffd4fbf]
> > CPU: 0 UID: 0 PID: 2041 Comm: ncdevmem Tainted: G    B   W           6.15.0-rc1+ #2 PREEMPT(undef)  0947ec89efa0fd68838b78e36aa1617e97ff5d7f
> > Tainted: [B]=BAD_PAGE, [W]=WARN
> > RIP: 0010:__mutex_lock (./include/linux/sched.h:2244 kernel/locking/mutex.c:400 kernel/locking/mutex.c:443 kernel/locking/mutex.c:605 kernel/locking/mutex.c:746)
> > Code: ea 03 80 3c 02 00 0f 85 4f 13 00 00 49 8b 1e 48 83 e3 f8 74 6a 48 b8 00 00 00 00 00 fc ff df 48 8d 7b 34 48 89 fa 48 c1 ea 03 <0f> b6 f
> > RSP: 0018:ffff88826f7ef730 EFLAGS: 00010203
> > RAX: dffffc0000000000 RBX: 00000000fffd4f88 RCX: ffffffffaa9bc811
> > RDX: 000000001fffa9f7 RSI: 0000000000000008 RDI: 00000000fffd4fbc
> > RBP: ffff88826f7ef8b0 R08: 0000000000000000 R09: ffffed103e6aa1a4
> > R10: 0000000000000007 R11: ffff88826f7ef442 R12: fffffbfff669f65e
> > R13: ffff88812a830040 R14: ffff8881f3550d20 R15: 00000000fffd4f88
> > FS:  0000000000000000(0000) GS:ffff888866c05000(0000) knlGS:0000000000000000
> > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: 0000563bed0cb288 CR3: 00000001a7c98000 CR4: 00000000007506f0
> > PKRU: 55555554
> > Call Trace:
> > <TASK>
> >  ...
> >  netdev_nl_sock_priv_destroy (net/core/netdev-genl.c:953 (discriminator 3))
> 
> Line 953 is:
> 
> netdev_lock(dev);
> 
> Which was introduced by:
> 
> commit 42f342387841 ("net: fix use-after-free in the
> netdev_nl_sock_priv_destroy()") and rolling back a few fixes, it's
> really introduced by commit 1d22d3060b9b ("net: drop rtnl_lock for
> queue_mgmt operations").
> 
> My first question, does this issue still reproduce if you remove the
> per netdev locking and go back to relying on rtnl_locking? Or do we
> crash somewhere else in net_devmem_unbind_dmabuf? If so, where?
> Looking through the rest of the unbinding code, it's not clear to me
> any of it actually uses dev, so it may just be the locking...
 
A proper fix, most likely, will involve resetting binding->dev to NULL
when the device is going away. Replacing rtnl with dev lock exposes the
fact that we can't assume that the binding->dev is still valid by
the time we do unbind.
Jakub Kicinski April 16, 2025, 2:59 a.m. UTC | #4
On Tue, 15 Apr 2025 11:59:40 -0700 Stanislav Fomichev wrote:
> > commit 42f342387841 ("net: fix use-after-free in the
> > netdev_nl_sock_priv_destroy()") and rolling back a few fixes, it's
> > really introduced by commit 1d22d3060b9b ("net: drop rtnl_lock for
> > queue_mgmt operations").
> > 
> > My first question, does this issue still reproduce if you remove the
> > per netdev locking and go back to relying on rtnl_locking? Or do we
> > crash somewhere else in net_devmem_unbind_dmabuf? If so, where?
> > Looking through the rest of the unbinding code, it's not clear to me
> > any of it actually uses dev, so it may just be the locking...  
>  
> A proper fix, most likely, will involve resetting binding->dev to NULL
> when the device is going away.

Right, tho a bit of work and tricky handling will be necessary to get
that right. We're not holding a ref on binding->dev.

I think we need to invert the socket mutex vs instance lock ordering.
Make the priv mutex protect the binding->list and binding->dev.
For that to work the binding needs to also store a pointer to its
owning socket?

Then in both uninstall paths (from socket and from netdev unreg) we can
take the socket mutex, delete from list, clear the ->dev pointer,
unlock, release the ref on the binding.

The socket close path would probably need to lock the socket, look at 
the first entry, if entry has ->dev call netdev_hold(), release the
socket, lock the netdev, lock the socket again, look at the ->dev, if
NULL we raced - done. If not NULL release the socket, call unbind.
netdev_put(). Restart this paragraph.

I can't think of an easier way.

> Replacing rtnl with dev lock exposes the fact that we can't assume
> that the binding->dev is still valid by the time we do unbind.

Note that binding->dev is never accessed by net_devmem_unbind_dmabuf().
So if the device was unregistered and its queues flushed, the only thing
we touch the netdev pointer for is the instance lock :(
Stanislav Fomichev April 16, 2025, 2:40 p.m. UTC | #5
On 04/15, Jakub Kicinski wrote:
> On Tue, 15 Apr 2025 11:59:40 -0700 Stanislav Fomichev wrote:
> > > commit 42f342387841 ("net: fix use-after-free in the
> > > netdev_nl_sock_priv_destroy()") and rolling back a few fixes, it's
> > > really introduced by commit 1d22d3060b9b ("net: drop rtnl_lock for
> > > queue_mgmt operations").
> > > 
> > > My first question, does this issue still reproduce if you remove the
> > > per netdev locking and go back to relying on rtnl_locking? Or do we
> > > crash somewhere else in net_devmem_unbind_dmabuf? If so, where?
> > > Looking through the rest of the unbinding code, it's not clear to me
> > > any of it actually uses dev, so it may just be the locking...  
> >  
> > A proper fix, most likely, will involve resetting binding->dev to NULL
> > when the device is going away.
> 
> Right, tho a bit of work and tricky handling will be necessary to get
> that right. We're not holding a ref on binding->dev.
> 
> I think we need to invert the socket mutex vs instance lock ordering.
> Make the priv mutex protect the binding->list and binding->dev.
> For that to work the binding needs to also store a pointer to its
> owning socket?
> 
> Then in both uninstall paths (from socket and from netdev unreg) we can
> take the socket mutex, delete from list, clear the ->dev pointer,
> unlock, release the ref on the binding.
> 
> The socket close path would probably need to lock the socket, look at 
> the first entry, if entry has ->dev call netdev_hold(), release the
> socket, lock the netdev, lock the socket again, look at the ->dev, if
> NULL we raced - done. If not NULL release the socket, call unbind.
> netdev_put(). Restart this paragraph.
> 
> I can't think of an easier way.

An alternative might be to have a new extra lock to just protect
the binding->bound_rxq? And we can move the netdev_lock/unlock inside
the xa_for_each loop in net_devmem_unbind_dmabuf. This will make sure
we don't touch the outdated 'dev'. But I think you're right, the same
lock ordering issue is gonna happen in this case as well.

> > Replacing rtnl with dev lock exposes the fact that we can't assume
> > that the binding->dev is still valid by the time we do unbind.
> 
> Note that binding->dev is never accessed by net_devmem_unbind_dmabuf().
> So if the device was unregistered and its queues flushed, the only thing
> we touch the netdev pointer for is the instance lock :(

I was assuming that bound_rxq is also protected by the instance lock.
But as you were saying earlier, xa has its own lock, so I might
be wrong with that assumption..
Taehee Yoo April 16, 2025, 3:01 p.m. UTC | #6
On Wed, Apr 16, 2025 at 11:59 AM Jakub Kicinski <kuba@kernel.org> wrote:
>

Hi Mina, Stanislav and Jakub,
Thank you so much for the reviews!

> On Tue, 15 Apr 2025 11:59:40 -0700 Stanislav Fomichev wrote:
> > > commit 42f342387841 ("net: fix use-after-free in the
> > > netdev_nl_sock_priv_destroy()") and rolling back a few fixes, it's
> > > really introduced by commit 1d22d3060b9b ("net: drop rtnl_lock for
> > > queue_mgmt operations").

Yes, you're right.
The real fix would be commit 1d22d3060b9b
("net: drop rtnl_lock for queue_mgmt operations").

> > >
> > > My first question, does this issue still reproduce if you remove the
> > > per netdev locking and go back to relying on rtnl_locking? Or do we
> > > crash somewhere else in net_devmem_unbind_dmabuf? If so, where?
> > > Looking through the rest of the unbinding code, it's not clear to me
> > > any of it actually uses dev, so it may just be the locking...

Accessing binding->dev causes a crash after unreg.
I thought binding->dev is needed in the net_devmem_unbind_dmabuf(),
but it's not.
binding->dev is used only for locking, as you mentioned.

> >
> > A proper fix, most likely, will involve resetting binding->dev to NULL
> > when the device is going away.
>
> Right, tho a bit of work and tricky handling will be necessary to get
> that right. We're not holding a ref on binding->dev.
>
> I think we need to invert the socket mutex vs instance lock ordering.
> Make the priv mutex protect the binding->list and binding->dev.
> For that to work the binding needs to also store a pointer to its
> owning socket?
>
> Then in both uninstall paths (from socket and from netdev unreg) we can
> take the socket mutex, delete from list, clear the ->dev pointer,
> unlock, release the ref on the binding.
>
> The socket close path would probably need to lock the socket, look at
> the first entry, if entry has ->dev call netdev_hold(), release the
> socket, lock the netdev, lock the socket again, look at the ->dev, if
> NULL we raced - done. If not NULL release the socket, call unbind.
> netdev_put(). Restart this paragraph.
>
> I can't think of an easier way.

Thank you so much for a detailed guide :)
I tried what you suggested, then I tested cases A, B, and C.
I can't see any splats from lockdep, kasan, etc.
Also, I checked that bindings are released well by checking
/sys/kernel/debug/dma_buf/bufinfo.
I think this approach works well.
However, I tested this simply. So I'm not sure yet about race condition.
I need more tests targeting race condition.

I modified the locking order in the netdev_nl_bind_rx_doit().
And modified netdev_nl_sock_priv_destroy() code looks like:

void netdev_nl_sock_priv_destroy(struct netdev_nl_sock *priv)
{
        struct net_devmem_dmabuf_binding *binding;
        struct net_devmem_dmabuf_binding *temp;
        struct net_device *dev;

        mutex_lock(&priv->lock);
        list_for_each_entry_safe(binding, temp, &priv->bindings, list) {
                dev = binding->dev;
                if (dev) {
                        netdev_hold(dev, &priv->dev_tracker, GFP_KERNEL);
                        mutex_unlock(&priv->lock);
                        netdev_lock(dev);
                        mutex_lock(&priv->lock);
                        if (binding->dev)
                                net_devmem_unbind_dmabuf(binding);
                        mutex_unlock(&priv->lock);
                        netdev_unlock(dev);
                        netdev_put(dev, &priv->dev_tracker);
                        mutex_lock(&priv->lock);
                }
        }
        mutex_unlock(&priv->lock);
}

Also modified the uninstall code looks like:

static void mp_dmabuf_devmem_uninstall(void *mp_priv,
                                       struct netdev_rx_queue *rxq)
{
        struct net_devmem_dmabuf_binding *binding = mp_priv;
        struct netdev_rx_queue *bound_rxq;
        unsigned long xa_idx;

        mutex_lock(&binding->priv->lock);
        xa_for_each(&binding->bound_rxqs, xa_idx, bound_rxq) {
                if (bound_rxq == rxq) {
                        xa_erase(&binding->bound_rxqs, xa_idx);
                        if (xa_empty(&binding->bound_rxqs)) {
                                list_del(&binding->list);
                                binding->dev = NULL;
                                net_devmem_dmabuf_binding_put(binding);
                        }
                        break;
                }
        }
        mutex_unlock(&binding->priv->lock);
}

I think the uninstall code looks good to me, but
netdev_nl_sock_priv_destroy() is longer than I expected.
If this is okay with you, I would like to stabilize it with more tests.

>
> > Replacing rtnl with dev lock exposes the fact that we can't assume
> > that the binding->dev is still valid by the time we do unbind.
>
> Note that binding->dev is never accessed by net_devmem_unbind_dmabuf().
> So if the device was unregistered and its queues flushed, the only thing
> we touch the netdev pointer for is the instance lock :(
>
Mina Almasry April 16, 2025, 3:47 p.m. UTC | #7
On Tue, Apr 15, 2025 at 7:59 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Tue, 15 Apr 2025 11:59:40 -0700 Stanislav Fomichev wrote:
> > > commit 42f342387841 ("net: fix use-after-free in the
> > > netdev_nl_sock_priv_destroy()") and rolling back a few fixes, it's
> > > really introduced by commit 1d22d3060b9b ("net: drop rtnl_lock for
> > > queue_mgmt operations").
> > >
> > > My first question, does this issue still reproduce if you remove the
> > > per netdev locking and go back to relying on rtnl_locking? Or do we
> > > crash somewhere else in net_devmem_unbind_dmabuf? If so, where?
> > > Looking through the rest of the unbinding code, it's not clear to me
> > > any of it actually uses dev, so it may just be the locking...
> >
> > A proper fix, most likely, will involve resetting binding->dev to NULL
> > when the device is going away.
>
> Right, tho a bit of work and tricky handling will be necessary to get
> that right. We're not holding a ref on binding->dev.
>
> I think we need to invert the socket mutex vs instance lock ordering.
> Make the priv mutex protect the binding->list and binding->dev.
> For that to work the binding needs to also store a pointer to its
> owning socket?
>
> Then in both uninstall paths (from socket and from netdev unreg) we can
> take the socket mutex, delete from list, clear the ->dev pointer,
> unlock, release the ref on the binding.
>

I don't like that the ref obtained by the socket can be released by
both the socket and the netdev unreg :( It creates a weird mental
model where the ref owned by the socket can actually be dropped by the
netdev unreg path and then the socket close needs to detect that
something else dropped its ref. It also creates a weird scenario where
the device got unregistered and reregistered (I assume that's
possible? Or no?) and the socket is alive and the device is registered
but actually the binding is not active.

> The socket close path would probably need to lock the socket, look at
> the first entry, if entry has ->dev call netdev_hold(), release the
> socket, lock the netdev, lock the socket again, look at the ->dev, if
> NULL we raced - done. If not NULL release the socket, call unbind.
> netdev_put(). Restart this paragraph.
>
> I can't think of an easier way.
>

How about, roughly:

- the binding holds a ref on dev, making sure that the dev is alive
until the last ref on the binding is dropped and the binding is freed.
- The ref owned by the socket is only ever dropped by the socket close.
- When we netdev_lock(binding->dev) to later do a
net_devmem_dmabuf_unbind, we must first grab another ref on the
binding->dev, so that it doesn't get freed if the unbind drops the
last ref.

I think that would work too?

Can you remind me why we do a dev_memory_provider_uninstall on a
device unregister? If the device gets unregistered then re-registered
(again, I'm kinda assuming that is possible, I'm not sure) I expect it
to still be memory provider bound, because the netlink socket is still
alive and the userspace is still expecting a live binding. Maybe
delete the dev_memory_provider_uninstall code I added on unregister,
and sorry I put it there...? Or is there some reason I'm forgetting
that we have to uninstall the memory provider on unregister?
Jakub Kicinski April 17, 2025, 12:15 a.m. UTC | #8
On Wed, 16 Apr 2025 07:40:07 -0700 Stanislav Fomichev wrote:
> > The socket close path would probably need to lock the socket, look at 
> > the first entry, if entry has ->dev call netdev_hold(), release the
> > socket, lock the netdev, lock the socket again, look at the ->dev, if
> > NULL we raced - done. If not NULL release the socket, call unbind.
> > netdev_put(). Restart this paragraph.
> > 
> > I can't think of an easier way.  
> 
> An alternative might be to have a new extra lock to just protect
> the binding->bound_rxq? And we can move the netdev_lock/unlock inside
> the xa_for_each loop in net_devmem_unbind_dmabuf. This will make sure
> we don't touch the outdated 'dev'. But I think you're right, the same
> lock ordering issue is gonna happen in this case as well.

Yea, I was wondering about that but unless we're holding something that
prevents the netdev from going away - a lock or a ref - we'll just
circle back to the same race.
Jakub Kicinski April 17, 2025, 12:27 a.m. UTC | #9
On Wed, 16 Apr 2025 08:47:14 -0700 Mina Almasry wrote:
> > Right, tho a bit of work and tricky handling will be necessary to get
> > that right. We're not holding a ref on binding->dev.
> >
> > I think we need to invert the socket mutex vs instance lock ordering.
> > Make the priv mutex protect the binding->list and binding->dev.
> > For that to work the binding needs to also store a pointer to its
> > owning socket?
> >
> > Then in both uninstall paths (from socket and from netdev unreg) we can
> > take the socket mutex, delete from list, clear the ->dev pointer,
> > unlock, release the ref on the binding.
> 
> I don't like that the ref obtained by the socket can be released by
> both the socket and the netdev unreg :( It creates a weird mental
> model where the ref owned by the socket can actually be dropped by the
> netdev unreg path and then the socket close needs to detect that
> something else dropped its ref. It also creates a weird scenario where
> the device got unregistered and reregistered (I assume that's
> possible? Or no?) and the socket is alive and the device is registered
> but actually the binding is not active.

I agree. But it's be best I could come up with (and what we ended up
with in io-uring)...

> > The socket close path would probably need to lock the socket, look at
> > the first entry, if entry has ->dev call netdev_hold(), release the
> > socket, lock the netdev, lock the socket again, look at the ->dev, if
> > NULL we raced - done. If not NULL release the socket, call unbind.
> > netdev_put(). Restart this paragraph.
> >
> > I can't think of an easier way.
> >  
> 
> How about, roughly:
> 
> - the binding holds a ref on dev, making sure that the dev is alive
> until the last ref on the binding is dropped and the binding is freed.
> - The ref owned by the socket is only ever dropped by the socket close.
> - When we netdev_lock(binding->dev) to later do a
> net_devmem_dmabuf_unbind, we must first grab another ref on the
> binding->dev, so that it doesn't get freed if the unbind drops the
> last ref.

Right now you can't hold a reference on a netdevice forever.
You have to register a notifier and when NETDEV_UNREGISTER triggers
you must give up the reference you took. Also, fun note, it is illegal
to take an "additional reference". You must re-lookup the device or
otherwise safely ensure device is not getting torn down.

See netdev_wait_allrefs_any(), that blocks whoever called unregister
until all refs are reclaimed.

> I think that would work too?
> 
> Can you remind me why we do a dev_memory_provider_uninstall on a
> device unregister? If the device gets unregistered then re-registered
> (again, I'm kinda assuming that is possible, I'm not sure) 

It's not legal right now. I think there's a BUG_ON() somewhere.

> I expect it
> to still be memory provider bound, because the netlink socket is still
> alive and the userspace is still expecting a live binding. Maybe
> delete the dev_memory_provider_uninstall code I added on unregister,
> and sorry I put it there...? Or is there some reason I'm forgetting
> that we have to uninstall the memory provider on unregister?

IIRC bound_rxqs will point to freed memory once the netdev is gone.
If we had a ref on the netdev then yeah we could possibly potentially
keep the queues around. But holding refs on a netdev is.. a topic for
another time. I'm trying to limit amount of code we'd need to revert
if the instance locking turns out to be fundamentally broken :S
Jakub Kicinski April 17, 2025, 12:35 a.m. UTC | #10
On Thu, 17 Apr 2025 00:01:57 +0900 Taehee Yoo wrote:
> Thank you so much for a detailed guide :)
> I tried what you suggested, then I tested cases A, B, and C.
> I can't see any splats from lockdep, kasan, etc.
> Also, I checked that bindings are released well by checking
> /sys/kernel/debug/dma_buf/bufinfo.
> I think this approach works well.
> However, I tested this simply. So I'm not sure yet about race condition.
> I need more tests targeting race condition.
> 
> I modified the locking order in the netdev_nl_bind_rx_doit().
> And modified netdev_nl_sock_priv_destroy() code looks like:
> 
> void netdev_nl_sock_priv_destroy(struct netdev_nl_sock *priv)
> {
>         struct net_devmem_dmabuf_binding *binding;
>         struct net_devmem_dmabuf_binding *temp;
>         struct net_device *dev;
> 
>         mutex_lock(&priv->lock);
>         list_for_each_entry_safe(binding, temp, &priv->bindings, list) {

Not sure you can "for each entry safe here. Since you drop the lock in
the loop what this helper saves as the "temp" / next struct may be 
freed by the time we get to it. I think we need:

	mutex_lock()
	while (!list_empty())
		binding = list_first..

>                 dev = binding->dev;
>                 if (dev) {

nit: flip the condition to avoid the indent

but I think the condition is too early, we should protect the pointer
itself with the same lock as the list. So if the entry is on the list
dev must not be NULL.

>                         netdev_hold(dev, &priv->dev_tracker, GFP_KERNEL);

I think you can declare the tracker on the stack, FWIW

>                         mutex_unlock(&priv->lock);
>                         netdev_lock(dev);
>                         mutex_lock(&priv->lock);
>                         if (binding->dev)
>                                 net_devmem_unbind_dmabuf(binding);

Mina suggests that we should only release the ref from the socket side.
I guess that'd be good, it will prevent the binding itself from going
away. Either way you need to make sure you hold a ref on the binding.
Either by letting mp_dmabuf_devmem_uninstall() be as is, or taking
a new ref before you release the socket lock here.

>                         mutex_unlock(&priv->lock);
>                         netdev_unlock(dev);
>                         netdev_put(dev, &priv->dev_tracker);
>                         mutex_lock(&priv->lock);
>                 }
>         }
>         mutex_unlock(&priv->lock);
> }
Taehee Yoo April 17, 2025, 6:57 a.m. UTC | #11
On Thu, Apr 17, 2025 at 9:35 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Thu, 17 Apr 2025 00:01:57 +0900 Taehee Yoo wrote:
> > Thank you so much for a detailed guide :)
> > I tried what you suggested, then I tested cases A, B, and C.
> > I can't see any splats from lockdep, kasan, etc.
> > Also, I checked that bindings are released well by checking
> > /sys/kernel/debug/dma_buf/bufinfo.
> > I think this approach works well.
> > However, I tested this simply. So I'm not sure yet about race condition.
> > I need more tests targeting race condition.
> >
> > I modified the locking order in the netdev_nl_bind_rx_doit().
> > And modified netdev_nl_sock_priv_destroy() code looks like:
> >
> > void netdev_nl_sock_priv_destroy(struct netdev_nl_sock *priv)
> > {
> >         struct net_devmem_dmabuf_binding *binding;
> >         struct net_devmem_dmabuf_binding *temp;
> >         struct net_device *dev;
> >
> >         mutex_lock(&priv->lock);
> >         list_for_each_entry_safe(binding, temp, &priv->bindings, list) {
>
> Not sure you can "for each entry safe here. Since you drop the lock in
> the loop what this helper saves as the "temp" / next struct may be
> freed by the time we get to it. I think we need:
>
>         mutex_lock()
>         while (!list_empty())
>                 binding = list_first..
>
> >                 dev = binding->dev;
> >                 if (dev) {
>

Thanks. I will try to use that you suggested.

> nit: flip the condition to avoid the indent
>
> but I think the condition is too early, we should protect the pointer
> itself with the same lock as the list. So if the entry is on the list
> dev must not be NULL.

Yes, I think it would be okay to remove this condition.

>
> >                         netdev_hold(dev, &priv->dev_tracker, GFP_KERNEL);
>
> I think you can declare the tracker on the stack, FWIW

Okay, I will use stack instead.

>
> >                         mutex_unlock(&priv->lock);
> >                         netdev_lock(dev);
> >                         mutex_lock(&priv->lock);
> >                         if (binding->dev)
> >                                 net_devmem_unbind_dmabuf(binding);
>
> Mina suggests that we should only release the ref from the socket side.
> I guess that'd be good, it will prevent the binding itself from going
> away. Either way you need to make sure you hold a ref on the binding.
> Either by letting mp_dmabuf_devmem_uninstall() be as is, or taking
> a new ref before you release the socket lock here.

Thanks Mina for the suggestion!
What I would like to do is like that
If binding->dev is NULL, it skips locking, but it still keeps calling
net_devmem_unbind_dmabuf().
Calling net_devmem_unbind_dmabuf() is safe even if after module unload,
because binding->bound_rxq is deleted by the uninstall path.
If bound_rxq is empty, binding->dev will not be accessed.
The only uninstall side code change is to set binding->dev to NULL and
add priv->lock.
This approach was already suggested by Stanislav earlier in this thread.

This is based on an inverse locking order from
priv lock -> instance lock to instance lock -> priv lock.
Mina, Stanislav, and Jakub, can you confirm this?

>
> >                         mutex_unlock(&priv->lock);
> >                         netdev_unlock(dev);
> >                         netdev_put(dev, &priv->dev_tracker);
> >                         mutex_lock(&priv->lock);
> >                 }
> >         }
> >         mutex_unlock(&priv->lock);
> > }
Jakub Kicinski April 17, 2025, 2:09 p.m. UTC | #12
On Thu, 17 Apr 2025 15:57:47 +0900 Taehee Yoo wrote:
> Thanks Mina for the suggestion!
> What I would like to do is like that
> If binding->dev is NULL, it skips locking, but it still keeps calling
> net_devmem_unbind_dmabuf().

note that the current code in net_devmem_unbind_dmabuf() is also not
safe against double removal from the socket list.

> Calling net_devmem_unbind_dmabuf() is safe even if after module unload,
> because binding->bound_rxq is deleted by the uninstall path.
> If bound_rxq is empty, binding->dev will not be accessed.
> The only uninstall side code change is to set binding->dev to NULL and
> add priv->lock.
> This approach was already suggested by Stanislav earlier in this thread.

> Mina, Stanislav, and Jakub, can you confirm this?

Maybe just send the code, even if it's not perfect. It's a bit hard 
to track all the suggested changes :)
Mina Almasry April 17, 2025, 9:07 p.m. UTC | #13
On Wed, Apr 16, 2025 at 5:27 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Wed, 16 Apr 2025 08:47:14 -0700 Mina Almasry wrote:
> > > Right, tho a bit of work and tricky handling will be necessary to get
> > > that right. We're not holding a ref on binding->dev.
> > >
> > > I think we need to invert the socket mutex vs instance lock ordering.
> > > Make the priv mutex protect the binding->list and binding->dev.
> > > For that to work the binding needs to also store a pointer to its
> > > owning socket?
> > >
> > > Then in both uninstall paths (from socket and from netdev unreg) we can
> > > take the socket mutex, delete from list, clear the ->dev pointer,
> > > unlock, release the ref on the binding.
> >
> > I don't like that the ref obtained by the socket can be released by
> > both the socket and the netdev unreg :( It creates a weird mental
> > model where the ref owned by the socket can actually be dropped by the
> > netdev unreg path and then the socket close needs to detect that
> > something else dropped its ref. It also creates a weird scenario where
> > the device got unregistered and reregistered (I assume that's
> > possible? Or no?) and the socket is alive and the device is registered
> > but actually the binding is not active.
>
> I agree. But it's be best I could come up with (and what we ended up
> with in io-uring)...
>
> > > The socket close path would probably need to lock the socket, look at
> > > the first entry, if entry has ->dev call netdev_hold(), release the
> > > socket, lock the netdev, lock the socket again, look at the ->dev, if
> > > NULL we raced - done. If not NULL release the socket, call unbind.
> > > netdev_put(). Restart this paragraph.
> > >
> > > I can't think of an easier way.
> > >
> >
> > How about, roughly:
> >
> > - the binding holds a ref on dev, making sure that the dev is alive
> > until the last ref on the binding is dropped and the binding is freed.
> > - The ref owned by the socket is only ever dropped by the socket close.
> > - When we netdev_lock(binding->dev) to later do a
> > net_devmem_dmabuf_unbind, we must first grab another ref on the
> > binding->dev, so that it doesn't get freed if the unbind drops the
> > last ref.
>
> Right now you can't hold a reference on a netdevice forever.
> You have to register a notifier and when NETDEV_UNREGISTER triggers
> you must give up the reference you took. Also, fun note, it is illegal
> to take an "additional reference". You must re-lookup the device or
> otherwise safely ensure device is not getting torn down.
>
> See netdev_wait_allrefs_any(), that blocks whoever called unregister
> until all refs are reclaimed.
>
> > I think that would work too?
> >
> > Can you remind me why we do a dev_memory_provider_uninstall on a
> > device unregister? If the device gets unregistered then re-registered
> > (again, I'm kinda assuming that is possible, I'm not sure)
>
> It's not legal right now. I think there's a BUG_ON() somewhere.
>

Thanks, if re-registering is not possible, that makes this a lot simpler.

> > I expect it
> > to still be memory provider bound, because the netlink socket is still
> > alive and the userspace is still expecting a live binding. Maybe
> > delete the dev_memory_provider_uninstall code I added on unregister,
> > and sorry I put it there...? Or is there some reason I'm forgetting
> > that we have to uninstall the memory provider on unregister?
>
> IIRC bound_rxqs will point to freed memory once the netdev is gone.
> If we had a ref on the netdev then yeah we could possibly potentially
> keep the queues around. But holding refs on a netdev is.. a topic for
> another time. I'm trying to limit amount of code we'd need to revert
> if the instance locking turns out to be fundamentally broken :S

OK, no need to hold a reference, please ignore that suggestion. Thanks
for the detailed explanation.

There are a lot of suggestions flying around at the moment as you
noted so I'll refrain from adding more and I'll just review the next
version. Agree with what Jakub says below, please do send Taehee even
if it's not perfect, this may need some iteration.
Taehee Yoo April 18, 2025, 10:46 a.m. UTC | #14
On Thu, Apr 17, 2025 at 11:09 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Thu, 17 Apr 2025 15:57:47 +0900 Taehee Yoo wrote:
> > Thanks Mina for the suggestion!
> > What I would like to do is like that
> > If binding->dev is NULL, it skips locking, but it still keeps calling
> > net_devmem_unbind_dmabuf().
>
> note that the current code in net_devmem_unbind_dmabuf() is also not
> safe against double removal from the socket list.

Okay, I will look into this too.

>
> > Calling net_devmem_unbind_dmabuf() is safe even if after module unload,
> > because binding->bound_rxq is deleted by the uninstall path.
> > If bound_rxq is empty, binding->dev will not be accessed.
> > The only uninstall side code change is to set binding->dev to NULL and
> > add priv->lock.
> > This approach was already suggested by Stanislav earlier in this thread.
>
> > Mina, Stanislav, and Jakub, can you confirm this?
>
> Maybe just send the code, even if it's not perfect. It's a bit hard
> to track all the suggested changes :)

Thanks! I will send a patch soon.
Taehee Yoo April 18, 2025, 10:52 a.m. UTC | #15
On Fri, Apr 18, 2025 at 6:07 AM Mina Almasry <almasrymina@google.com> wrote:
>
> On Wed, Apr 16, 2025 at 5:27 PM Jakub Kicinski <kuba@kernel.org> wrote:
> >
> > On Wed, 16 Apr 2025 08:47:14 -0700 Mina Almasry wrote:
> > > > Right, tho a bit of work and tricky handling will be necessary to get
> > > > that right. We're not holding a ref on binding->dev.
> > > >
> > > > I think we need to invert the socket mutex vs instance lock ordering.
> > > > Make the priv mutex protect the binding->list and binding->dev.
> > > > For that to work the binding needs to also store a pointer to its
> > > > owning socket?
> > > >
> > > > Then in both uninstall paths (from socket and from netdev unreg) we can
> > > > take the socket mutex, delete from list, clear the ->dev pointer,
> > > > unlock, release the ref on the binding.
> > >
> > > I don't like that the ref obtained by the socket can be released by
> > > both the socket and the netdev unreg :( It creates a weird mental
> > > model where the ref owned by the socket can actually be dropped by the
> > > netdev unreg path and then the socket close needs to detect that
> > > something else dropped its ref. It also creates a weird scenario where
> > > the device got unregistered and reregistered (I assume that's
> > > possible? Or no?) and the socket is alive and the device is registered
> > > but actually the binding is not active.
> >
> > I agree. But it's be best I could come up with (and what we ended up
> > with in io-uring)...
> >
> > > > The socket close path would probably need to lock the socket, look at
> > > > the first entry, if entry has ->dev call netdev_hold(), release the
> > > > socket, lock the netdev, lock the socket again, look at the ->dev, if
> > > > NULL we raced - done. If not NULL release the socket, call unbind.
> > > > netdev_put(). Restart this paragraph.
> > > >
> > > > I can't think of an easier way.
> > > >
> > >
> > > How about, roughly:
> > >
> > > - the binding holds a ref on dev, making sure that the dev is alive
> > > until the last ref on the binding is dropped and the binding is freed.
> > > - The ref owned by the socket is only ever dropped by the socket close.
> > > - When we netdev_lock(binding->dev) to later do a
> > > net_devmem_dmabuf_unbind, we must first grab another ref on the
> > > binding->dev, so that it doesn't get freed if the unbind drops the
> > > last ref.
> >
> > Right now you can't hold a reference on a netdevice forever.
> > You have to register a notifier and when NETDEV_UNREGISTER triggers
> > you must give up the reference you took. Also, fun note, it is illegal
> > to take an "additional reference". You must re-lookup the device or
> > otherwise safely ensure device is not getting torn down.
> >
> > See netdev_wait_allrefs_any(), that blocks whoever called unregister
> > until all refs are reclaimed.
> >
> > > I think that would work too?
> > >
> > > Can you remind me why we do a dev_memory_provider_uninstall on a
> > > device unregister? If the device gets unregistered then re-registered
> > > (again, I'm kinda assuming that is possible, I'm not sure)
> >
> > It's not legal right now. I think there's a BUG_ON() somewhere.
> >
>
> Thanks, if re-registering is not possible, that makes this a lot simpler.
>
> > > I expect it
> > > to still be memory provider bound, because the netlink socket is still
> > > alive and the userspace is still expecting a live binding. Maybe
> > > delete the dev_memory_provider_uninstall code I added on unregister,
> > > and sorry I put it there...? Or is there some reason I'm forgetting
> > > that we have to uninstall the memory provider on unregister?
> >
> > IIRC bound_rxqs will point to freed memory once the netdev is gone.
> > If we had a ref on the netdev then yeah we could possibly potentially
> > keep the queues around. But holding refs on a netdev is.. a topic for
> > another time. I'm trying to limit amount of code we'd need to revert
> > if the instance locking turns out to be fundamentally broken :S
>
> OK, no need to hold a reference, please ignore that suggestion. Thanks
> for the detailed explanation.
>
> There are a lot of suggestions flying around at the moment as you
> noted so I'll refrain from adding more and I'll just review the next
> version. Agree with what Jakub says below, please do send Taehee even
> if it's not perfect, this may need some iteration.

Thanks for the review!
I’ll send the patch over shortly.

Thanks!
Taehee Yoo

>
> --
> Thanks,
> Mina
diff mbox series

Patch

diff --git a/net/core/devmem.c b/net/core/devmem.c
index 6e27a47d0493..8948796b0af5 100644
--- a/net/core/devmem.c
+++ b/net/core/devmem.c
@@ -379,6 +379,11 @@  static void mp_dmabuf_devmem_uninstall(void *mp_priv,
 	xa_for_each(&binding->bound_rxqs, xa_idx, bound_rxq) {
 		if (bound_rxq == rxq) {
 			xa_erase(&binding->bound_rxqs, xa_idx);
+
+			if (xa_empty(&binding->bound_rxqs)) {
+				list_del(&binding->list);
+				net_devmem_dmabuf_binding_put(binding);
+			}
 			break;
 		}
 	}