diff mbox series

[RFC,net-next,1/2] devlink: Hold a reference on parent device

Message ID 20230619125015.1541143-2-idosch@nvidia.com (mailing list archive)
State RFC
Delegated to: Netdev Maintainers
Headers show
Series devlink: Acquire device lock during reload | expand

Checks

Context Check Description
netdev/series_format success Posting correctly formatted
netdev/tree_selection success Clearly marked for net-next, async
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 10 this patch: 10
netdev/cc_maintainers success CCed 6 of 6 maintainers
netdev/build_clang success Errors and warnings before: 8 this patch: 8
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 10 this patch: 10
netdev/checkpatch warning WARNING: Commit log lines starting with '#' are dropped by git as comments
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Ido Schimmel June 19, 2023, 12:50 p.m. UTC
Each devlink instance is associated with a parent device and a pointer
to this device is stored in the devlink structure, but devlink does not
hold a reference on this device.

This is going to be a problem in the next patch where - among other
things - devlink will acquire the device lock during netns dismantle,
before the reload operation. Since netns dismantle is performed
asynchronously and since a reference is not held on the parent device,
it will be possible to hit a use-after-free.

Prepare for the upcoming change by holding a reference on the parent
device.

Unfortunately, with this patch and this reproducer [1], the following
crash can be observed [2]. The last reference is released from the
device asynchronously - after an RCU grace period - when the netdevsim
module is no longer present. This causes device_release() to invoke a
release callback that is no longer present: nsim_bus_dev_release().

It's not clear to me if I'm doing something wrong in devlink (I don't
think so), if it's a bug in netdevsim or alternatively a bug in core
driver code that allows the bus module to go away before all the devices
that were connected to it are released.

The problem can be solved by devlink holding a reference on the backing
module (i.e., dev->driver->owner) or by each netdevsim device holding a
reference on the netdevsim module. However, this will prevent the
removal of the module when devices are present, something that is
possible today.

[1]
#!/bin/bash

for i in $(seq 1 1000); do
        echo "$i"
        insmod drivers/net/netdevsim/netdevsim.ko
        echo "10 0" > /sys/bus/netdevsim/new_device
        rmmod netdevsim
done

[2]
BUG: unable to handle page fault for address: ffffffffc0490910
#PF: supervisor instruction fetch in kernel mode
#PF: error_code(0x0010) - not-present page
PGD 12e040067 P4D 12e040067 PUD 12e042067 PMD 100a38067 PTE 0
Oops: 0010 [#1] PREEMPT SMP
CPU: 0 PID: 138 Comm: kworker/0:2 Not tainted 6.4.0-rc5-custom-g42e05937ca59 #299
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc37 04/01/2014
Workqueue: events devlink_release
RIP: 0010:0xffffffffc0490910
Code: Unable to access opcode bytes at 0xffffffffc04908e6.
RSP: 0018:ffffb487802f3e40 EFLAGS: 00010282
RAX: ffffffffc0490910 RBX: ffff92e6c0057800 RCX: 0001020304050608
RDX: 0000000000000001 RSI: ffffffff92b7d763 RDI: ffff92e6c0057800
RBP: ffff92e6c1ef0a00 R08: ffff92e6c0055158 R09: ffff92e6c2be9134
R10: 0000000000000018 R11: fefefefefefefeff R12: ffffffff934c3e80
R13: ffff92e6c2a1a740 R14: 0000000000000000 R15: ffff92e7f7c30b05
FS:  0000000000000000(0000) GS:ffff92e7f7c00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffffc04908e6 CR3: 0000000101f1a004 CR4: 0000000000170ef0
Call Trace:
 <TASK>
 ? __die+0x23/0x70
 ? page_fault_oops+0x181/0x470
 ? exc_page_fault+0xa6/0x140
 ? asm_exc_page_fault+0x26/0x30
 ? device_release+0x23/0x90
 ? device_release+0x34/0x90
 ? kobject_put+0x7d/0x1b0
 ? devlink_release+0x16/0x30
 ? process_one_work+0x1e0/0x3d0
 ? worker_thread+0x4e/0x3b0
 ? rescuer_thread+0x3a0/0x3a0
 ? kthread+0xe5/0x120
 ? kthread_complete_and_exit+0x20/0x20
 ? ret_from_fork+0x1f/0x30
 </TASK>
Modules linked in: [last unloaded: netdevsim]

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
---
 net/devlink/core.c | 3 +++
 1 file changed, 3 insertions(+)

Comments

Jiri Pirko June 20, 2023, 6:23 a.m. UTC | #1
Mon, Jun 19, 2023 at 02:50:14PM CEST, idosch@nvidia.com wrote:
>Each devlink instance is associated with a parent device and a pointer
>to this device is stored in the devlink structure, but devlink does not
>hold a reference on this device.
>
>This is going to be a problem in the next patch where - among other
>things - devlink will acquire the device lock during netns dismantle,
>before the reload operation. Since netns dismantle is performed
>asynchronously and since a reference is not held on the parent device,
>it will be possible to hit a use-after-free.
>
>Prepare for the upcoming change by holding a reference on the parent
>device.
>
>Unfortunately, with this patch and this reproducer [1], the following
>crash can be observed [2]. The last reference is released from the
>device asynchronously - after an RCU grace period - when the netdevsim
>module is no longer present. This causes device_release() to invoke a
>release callback that is no longer present: nsim_bus_dev_release().
>
>It's not clear to me if I'm doing something wrong in devlink (I don't
>think so), if it's a bug in netdevsim or alternatively a bug in core
>driver code that allows the bus module to go away before all the devices
>that were connected to it are released.
>
>The problem can be solved by devlink holding a reference on the backing
>module (i.e., dev->driver->owner) or by each netdevsim device holding a
>reference on the netdevsim module. However, this will prevent the
>removal of the module when devices are present, something that is
>possible today.
>
>[1]
>#!/bin/bash
>
>for i in $(seq 1 1000); do
>        echo "$i"
>        insmod drivers/net/netdevsim/netdevsim.ko
>        echo "10 0" > /sys/bus/netdevsim/new_device
>        rmmod netdevsim
>done
>
>[2]
>BUG: unable to handle page fault for address: ffffffffc0490910
>#PF: supervisor instruction fetch in kernel mode
>#PF: error_code(0x0010) - not-present page
>PGD 12e040067 P4D 12e040067 PUD 12e042067 PMD 100a38067 PTE 0
>Oops: 0010 [#1] PREEMPT SMP
>CPU: 0 PID: 138 Comm: kworker/0:2 Not tainted 6.4.0-rc5-custom-g42e05937ca59 #299
>Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc37 04/01/2014
>Workqueue: events devlink_release
>RIP: 0010:0xffffffffc0490910
>Code: Unable to access opcode bytes at 0xffffffffc04908e6.
>RSP: 0018:ffffb487802f3e40 EFLAGS: 00010282
>RAX: ffffffffc0490910 RBX: ffff92e6c0057800 RCX: 0001020304050608
>RDX: 0000000000000001 RSI: ffffffff92b7d763 RDI: ffff92e6c0057800
>RBP: ffff92e6c1ef0a00 R08: ffff92e6c0055158 R09: ffff92e6c2be9134
>R10: 0000000000000018 R11: fefefefefefefeff R12: ffffffff934c3e80
>R13: ffff92e6c2a1a740 R14: 0000000000000000 R15: ffff92e7f7c30b05
>FS:  0000000000000000(0000) GS:ffff92e7f7c00000(0000) knlGS:0000000000000000
>CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>CR2: ffffffffc04908e6 CR3: 0000000101f1a004 CR4: 0000000000170ef0
>Call Trace:
> <TASK>
> ? __die+0x23/0x70
> ? page_fault_oops+0x181/0x470
> ? exc_page_fault+0xa6/0x140
> ? asm_exc_page_fault+0x26/0x30
> ? device_release+0x23/0x90
> ? device_release+0x34/0x90
> ? kobject_put+0x7d/0x1b0
> ? devlink_release+0x16/0x30
> ? process_one_work+0x1e0/0x3d0
> ? worker_thread+0x4e/0x3b0
> ? rescuer_thread+0x3a0/0x3a0
> ? kthread+0xe5/0x120
> ? kthread_complete_and_exit+0x20/0x20
> ? ret_from_fork+0x1f/0x30
> </TASK>
>Modules linked in: [last unloaded: netdevsim]
>
>Signed-off-by: Ido Schimmel <idosch@nvidia.com>

Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Ido Schimmel June 20, 2023, 7:05 a.m. UTC | #2
On Tue, Jun 20, 2023 at 08:23:26AM +0200, Jiri Pirko wrote:
> Mon, Jun 19, 2023 at 02:50:14PM CEST, idosch@nvidia.com wrote:
> >Each devlink instance is associated with a parent device and a pointer
> >to this device is stored in the devlink structure, but devlink does not
> >hold a reference on this device.
> >
> >This is going to be a problem in the next patch where - among other
> >things - devlink will acquire the device lock during netns dismantle,
> >before the reload operation. Since netns dismantle is performed
> >asynchronously and since a reference is not held on the parent device,
> >it will be possible to hit a use-after-free.
> >
> >Prepare for the upcoming change by holding a reference on the parent
> >device.
> >
> >Unfortunately, with this patch and this reproducer [1], the following
> >crash can be observed [2]. The last reference is released from the
> >device asynchronously - after an RCU grace period - when the netdevsim
> >module is no longer present. This causes device_release() to invoke a
> >release callback that is no longer present: nsim_bus_dev_release().
> >
> >It's not clear to me if I'm doing something wrong in devlink (I don't
> >think so), if it's a bug in netdevsim or alternatively a bug in core
> >driver code that allows the bus module to go away before all the devices
> >that were connected to it are released.
> >
> >The problem can be solved by devlink holding a reference on the backing
> >module (i.e., dev->driver->owner) or by each netdevsim device holding a
> >reference on the netdevsim module. However, this will prevent the
> >removal of the module when devices are present, something that is
> >possible today.
> >
> >[1]
> >#!/bin/bash
> >
> >for i in $(seq 1 1000); do
> >        echo "$i"
> >        insmod drivers/net/netdevsim/netdevsim.ko
> >        echo "10 0" > /sys/bus/netdevsim/new_device
> >        rmmod netdevsim
> >done
> >
> >[2]
> >BUG: unable to handle page fault for address: ffffffffc0490910
> >#PF: supervisor instruction fetch in kernel mode
> >#PF: error_code(0x0010) - not-present page
> >PGD 12e040067 P4D 12e040067 PUD 12e042067 PMD 100a38067 PTE 0
> >Oops: 0010 [#1] PREEMPT SMP
> >CPU: 0 PID: 138 Comm: kworker/0:2 Not tainted 6.4.0-rc5-custom-g42e05937ca59 #299
> >Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc37 04/01/2014
> >Workqueue: events devlink_release
> >RIP: 0010:0xffffffffc0490910
> >Code: Unable to access opcode bytes at 0xffffffffc04908e6.
> >RSP: 0018:ffffb487802f3e40 EFLAGS: 00010282
> >RAX: ffffffffc0490910 RBX: ffff92e6c0057800 RCX: 0001020304050608
> >RDX: 0000000000000001 RSI: ffffffff92b7d763 RDI: ffff92e6c0057800
> >RBP: ffff92e6c1ef0a00 R08: ffff92e6c0055158 R09: ffff92e6c2be9134
> >R10: 0000000000000018 R11: fefefefefefefeff R12: ffffffff934c3e80
> >R13: ffff92e6c2a1a740 R14: 0000000000000000 R15: ffff92e7f7c30b05
> >FS:  0000000000000000(0000) GS:ffff92e7f7c00000(0000) knlGS:0000000000000000
> >CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >CR2: ffffffffc04908e6 CR3: 0000000101f1a004 CR4: 0000000000170ef0
> >Call Trace:
> > <TASK>
> > ? __die+0x23/0x70
> > ? page_fault_oops+0x181/0x470
> > ? exc_page_fault+0xa6/0x140
> > ? asm_exc_page_fault+0x26/0x30
> > ? device_release+0x23/0x90
> > ? device_release+0x34/0x90
> > ? kobject_put+0x7d/0x1b0
> > ? devlink_release+0x16/0x30
> > ? process_one_work+0x1e0/0x3d0
> > ? worker_thread+0x4e/0x3b0
> > ? rescuer_thread+0x3a0/0x3a0
> > ? kthread+0xe5/0x120
> > ? kthread_complete_and_exit+0x20/0x20
> > ? ret_from_fork+0x1f/0x30
> > </TASK>
> >Modules linked in: [last unloaded: netdevsim]
> >
> >Signed-off-by: Ido Schimmel <idosch@nvidia.com>
> 
> Reviewed-by: Jiri Pirko <jiri@nvidia.com>

Thanks, but I was hoping to get feedback on how to solve the problem
mentioned in the commit message :p
Jakub Kicinski June 20, 2023, 5:43 p.m. UTC | #3
On Tue, 20 Jun 2023 10:05:23 +0300 Ido Schimmel wrote:
> On Tue, Jun 20, 2023 at 08:23:26AM +0200, Jiri Pirko wrote:
> > Mon, Jun 19, 2023 at 02:50:14PM CEST, idosch@nvidia.com wrote:  
>  [...]  
> > 
> > Reviewed-by: Jiri Pirko <jiri@nvidia.com>  
> 
> Thanks, but I was hoping to get feedback on how to solve the problem
> mentioned in the commit message :p

Do we need to hold the reference on the device until release?
I think you can release it in devlink_free().
The only valid fields for an unregistered devlink instance are:
 - lock
 - refcount
 - index

And obviously unregistered devices can't be reloaded.
Ido Schimmel June 21, 2023, 6:31 a.m. UTC | #4
On Tue, Jun 20, 2023 at 10:43:51AM -0700, Jakub Kicinski wrote:
> Do we need to hold the reference on the device until release?
> I think you can release it in devlink_free().
> The only valid fields for an unregistered devlink instance are:
>  - lock
>  - refcount
>  - index
> 
> And obviously unregistered devices can't be reloaded.

Thanks for taking a look.

Moving the release to devlink_free() [1] was the first thing I tried and
it indeed solves the problem I mentioned earlier, but creates a new one.
After devlink_free() returns the devlink instance can still be accessed
by user space in devlink_get_from_attrs_lock(). If I reload in a loop
while concurrently removing and adding the device [2], we can hit a UAF
when trying to acquire the device lock [3].

[1]
diff --git a/net/devlink/core.c b/net/devlink/core.c
index a4b6d548e50c..b60a8463d6e0 100644
--- a/net/devlink/core.c
+++ b/net/devlink/core.c
@@ -92,7 +92,6 @@ static void devlink_release(struct work_struct *work)
 
        mutex_destroy(&devlink->lock);
        lockdep_unregister_key(&devlink->lock_key);
-       put_device(devlink->dev);
        kfree(devlink);
 }
 
@@ -264,6 +263,8 @@ void devlink_free(struct devlink *devlink)
 
        xa_erase(&devlinks, devlink->index);
 
+       put_device(devlink->dev);
+
        devlink_put(devlink);
 }
 EXPORT_SYMBOL_GPL(devlink_free);

[2]
while true; do devlink dev reload netdevsim/netdevsim10 &> /dev/null; done &
while true; do echo "10 0" > /sys/bus/netdevsim/new_device; echo 10 > /sys/bus/netdevsim/del_device; done

[3]
[   96.081096] ==================================================================
[   96.082158] BUG: KASAN: slab-use-after-free in __mutex_lock+0x18a7/0x1ac0
[   96.083107] Read of size 8 at addr ffff888109caa8d8 by task devlink/429

[   96.084266] CPU: 0 PID: 429 Comm: devlink Not tainted 6.4.0-rc6-custom-gb01b0912311c-dirty #303
[   96.085443] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc37 04/01/2014
[   96.086632] Call Trace:
[   96.086990]  <TASK>
[   96.087314]  dump_stack_lvl+0x91/0xf0
[   96.087852]  print_report+0xcf/0x660
[   96.088384]  ? __virt_addr_valid+0x86/0x360
[   96.088980]  ? __mutex_lock+0x18a7/0x1ac0
[   96.089558]  ? __mutex_lock+0x18a7/0x1ac0
[   96.090132]  kasan_report+0xd6/0x110
[   96.090667]  ? __mutex_lock+0x18a7/0x1ac0
[   96.091245]  __mutex_lock+0x18a7/0x1ac0
[   96.091898]  ? devlink_get_from_attrs_lock+0x2bc/0x460
[   96.092645]  ? mutex_lock_io_nested+0x18b0/0x18b0
[   96.093323]  ? reacquire_held_locks+0x4e0/0x4e0
[   96.093972]  ? devlink_try_get+0x158/0x1e0
[   96.094586]  ? devlink_get_from_attrs_lock+0x460/0x460
[   96.095325]  ? devlink_get_from_attrs_lock+0x2bc/0x460
[   96.096053]  devlink_get_from_attrs_lock+0x2bc/0x460
[   96.096767]  ? devlink_nl_post_doit+0xf0/0xf0
[   96.097409]  ? __nla_parse+0x40/0x50
[   96.097943]  ? devlink_get_from_attrs_lock+0x460/0x460
[   96.098671]  devlink_nl_pre_doit+0xb3/0x480
[   96.099255]  genl_family_rcv_msg_doit.isra.0+0x1b8/0x2e0
[   96.099966]  ? genl_start+0x670/0x670
[   96.100487]  ? ns_capable+0xda/0x110
[   96.100991]  genl_rcv_msg+0x558/0x7f0
[   96.101516]  ? genl_family_rcv_msg_doit.isra.0+0x2e0/0x2e0
[   96.102260]  ? devlink_get_from_attrs_lock+0x460/0x460
[   96.102925]  ? devlink_reload+0x540/0x540
[   96.103429]  ? devlink_pernet_pre_exit+0x340/0x340
[   96.104017]  netlink_rcv_skb+0x170/0x440
[   96.104508]  ? genl_family_rcv_msg_doit.isra.0+0x2e0/0x2e0
[   96.105166]  ? netlink_ack+0x1380/0x1380
[   96.105662]  ? lock_contended+0xc70/0xc70
[   96.106172]  ? rwsem_down_read_slowpath+0xda0/0xda0
[   96.106767]  ? netlink_deliver_tap+0x1b6/0xd90
[   96.107317]  ? is_vmalloc_addr+0x8b/0xb0
[   96.107804]  genl_rcv+0x2d/0x40
[   96.108213]  netlink_unicast+0x53f/0x810
[   96.108698]  ? netlink_attachskb+0x870/0x870
[   96.109230]  ? lock_release+0x3ac/0xbb0
[   96.109714]  netlink_sendmsg+0x95c/0xe80
[   96.110206]  ? netlink_unicast+0x810/0x810
[   96.110711]  ? __might_fault+0x15b/0x190
[   96.111194]  ? _copy_from_user+0x9f/0xd0
[   96.111691]  __sys_sendto+0x2aa/0x420
[   96.112148]  ? __ia32_sys_getpeername+0xb0/0xb0
[   96.112706]  ? reacquire_held_locks+0x4e0/0x4e0
[   96.113275]  ? rcu_is_watching+0x12/0xb0
[   96.113769]  ? blkcg_exit_disk+0x50/0x50
[   96.114260]  __x64_sys_sendto+0xe5/0x1c0
[   96.114742]  ? lockdep_hardirqs_on+0x7d/0x100
[   96.115285]  ? syscall_enter_from_user_mode+0x20/0x50
[   96.115899]  do_syscall_64+0x38/0x80
[   96.116352]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[   96.116964] RIP: 0033:0x7fa073f72fa7
[   96.117417] Code: 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 80 3d 5d d6 0c 00 00 41 89 ca 74 10 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 71 c3 55 48 83 ec 30 44 89 4c 24 2c 4c 89 44
[   96.119548] RSP: 002b:00007ffca2723938 EFLAGS: 00000202 ORIG_RAX: 000000000000002c
[   96.120449] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007fa073f72fa7
[   96.121287] RDX: 0000000000000034 RSI: 000055883950a460 RDI: 0000000000000003
[   96.122120] RBP: 0000558837836e60 R08: 00007fa074050200 R09: 000000000000000c
[   96.122951] R10: 0000000000000000 R11: 0000000000000202 R12: 000055883950a2a0
[   96.123789] R13: 000055883950a460 R14: 000055883784eeab R15: 000055883950a2a0
[   96.124638]  </TASK>

[   96.125120] Allocated by task 209:
[   96.125543]  kasan_save_stack+0x33/0x50
[   96.125567]  kasan_set_track+0x25/0x30
[   96.125587]  __kasan_kmalloc+0x87/0x90
[   96.125606]  new_device_store+0x22d/0x6a0
[   96.125625]  bus_attr_store+0x7b/0xa0
[   96.125641]  sysfs_kf_write+0x11c/0x170
[   96.125659]  kernfs_fop_write_iter+0x3f7/0x600
[   96.125676]  vfs_write+0x680/0xe90
[   96.125691]  ksys_write+0x143/0x270
[   96.125707]  do_syscall_64+0x38/0x80
[   96.125722]  entry_SYSCALL_64_after_hwframe+0x63/0xcd

[   96.125947] Freed by task 209:
[   96.126332]  kasan_save_stack+0x33/0x50
[   96.126355]  kasan_set_track+0x25/0x30
[   96.126374]  kasan_save_free_info+0x2e/0x40
[   96.126389]  __kasan_slab_free+0x10a/0x180
[   96.126409]  __kmem_cache_free+0x8a/0x1a0
[   96.126429]  device_release+0xa6/0x240
[   96.126449]  kobject_put+0x1f7/0x5b0
[   96.126465]  device_unregister+0x34/0xc0
[   96.126478]  del_device_store+0x308/0x430
[   96.126496]  bus_attr_store+0x7b/0xa0
[   96.126511]  sysfs_kf_write+0x11c/0x170
[   96.126528]  kernfs_fop_write_iter+0x3f7/0x600
[   96.126545]  vfs_write+0x680/0xe90
[   96.126560]  ksys_write+0x143/0x270
[   96.126575]  do_syscall_64+0x38/0x80
[   96.126590]  entry_SYSCALL_64_after_hwframe+0x63/0xcd

[   96.126815] The buggy address belongs to the object at ffff888109caa800
                which belongs to the cache kmalloc-1k of size 1024
[   96.128252] The buggy address is located 216 bytes inside of
                freed 1024-byte region [ffff888109caa800, ffff888109caac00)

[   96.129875] The buggy address belongs to the physical page:
[   96.130540] page:00000000c7912b23 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x109ca8
[   96.130562] head:00000000c7912b23 order:3 entire_mapcount:0 nr_pages_mapped:0 pincount:0
[   96.130575] flags: 0x200000000010200(slab|head|node=0|zone=2)
[   96.130591] page_type: 0xffffffff()
[   96.130607] raw: 0200000000010200 ffff888100041dc0 dead000000000100 dead000000000122
[   96.130622] raw: 0000000000000000 0000000000100010 00000001ffffffff 0000000000000000
[   96.130632] page dumped because: kasan: bad access detected

[   96.130844] Memory state around the buggy address:
[   96.131424]  ffff888109caa780: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[   96.132266]  ffff888109caa800: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[   96.133101] >ffff888109caa880: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[   96.133944]                                                     ^
[   96.134666]  ffff888109caa900: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[   96.135510]  ffff888109caa980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[   96.136351] ==================================================================
Jiri Pirko June 21, 2023, 11:48 a.m. UTC | #5
Mon, Jun 19, 2023 at 02:50:14PM CEST, idosch@nvidia.com wrote:
>Each devlink instance is associated with a parent device and a pointer
>to this device is stored in the devlink structure, but devlink does not
>hold a reference on this device.
>
>This is going to be a problem in the next patch where - among other
>things - devlink will acquire the device lock during netns dismantle,
>before the reload operation. Since netns dismantle is performed
>asynchronously and since a reference is not held on the parent device,
>it will be possible to hit a use-after-free.
>
>Prepare for the upcoming change by holding a reference on the parent
>device.
>
>Unfortunately, with this patch and this reproducer [1], the following
>crash can be observed [2]. The last reference is released from the
>device asynchronously - after an RCU grace period - when the netdevsim
>module is no longer present. This causes device_release() to invoke a
>release callback that is no longer present: nsim_bus_dev_release().
>
>It's not clear to me if I'm doing something wrong in devlink (I don't
>think so), if it's a bug in netdevsim or alternatively a bug in core
>driver code that allows the bus module to go away before all the devices
>that were connected to it are released.
>
>The problem can be solved by devlink holding a reference on the backing
>module (i.e., dev->driver->owner) or by each netdevsim device holding a
>reference on the netdevsim module. However, this will prevent the
>removal of the module when devices are present, something that is
>possible today.
>
>[1]
>#!/bin/bash
>
>for i in $(seq 1 1000); do
>        echo "$i"
>        insmod drivers/net/netdevsim/netdevsim.ko
>        echo "10 0" > /sys/bus/netdevsim/new_device
>        rmmod netdevsim
>done
>
>[2]
>BUG: unable to handle page fault for address: ffffffffc0490910
>#PF: supervisor instruction fetch in kernel mode
>#PF: error_code(0x0010) - not-present page
>PGD 12e040067 P4D 12e040067 PUD 12e042067 PMD 100a38067 PTE 0
>Oops: 0010 [#1] PREEMPT SMP
>CPU: 0 PID: 138 Comm: kworker/0:2 Not tainted 6.4.0-rc5-custom-g42e05937ca59 #299
>Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc37 04/01/2014
>Workqueue: events devlink_release
>RIP: 0010:0xffffffffc0490910
>Code: Unable to access opcode bytes at 0xffffffffc04908e6.
>RSP: 0018:ffffb487802f3e40 EFLAGS: 00010282
>RAX: ffffffffc0490910 RBX: ffff92e6c0057800 RCX: 0001020304050608
>RDX: 0000000000000001 RSI: ffffffff92b7d763 RDI: ffff92e6c0057800
>RBP: ffff92e6c1ef0a00 R08: ffff92e6c0055158 R09: ffff92e6c2be9134
>R10: 0000000000000018 R11: fefefefefefefeff R12: ffffffff934c3e80
>R13: ffff92e6c2a1a740 R14: 0000000000000000 R15: ffff92e7f7c30b05
>FS:  0000000000000000(0000) GS:ffff92e7f7c00000(0000) knlGS:0000000000000000
>CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>CR2: ffffffffc04908e6 CR3: 0000000101f1a004 CR4: 0000000000170ef0
>Call Trace:
> <TASK>
> ? __die+0x23/0x70
> ? page_fault_oops+0x181/0x470
> ? exc_page_fault+0xa6/0x140
> ? asm_exc_page_fault+0x26/0x30
> ? device_release+0x23/0x90
> ? device_release+0x34/0x90
> ? kobject_put+0x7d/0x1b0
> ? devlink_release+0x16/0x30
> ? process_one_work+0x1e0/0x3d0
> ? worker_thread+0x4e/0x3b0
> ? rescuer_thread+0x3a0/0x3a0
> ? kthread+0xe5/0x120
> ? kthread_complete_and_exit+0x20/0x20
> ? ret_from_fork+0x1f/0x30
> </TASK>
>Modules linked in: [last unloaded: netdevsim]
>
>Signed-off-by: Ido Schimmel <idosch@nvidia.com>
>---
> net/devlink/core.c | 3 +++
> 1 file changed, 3 insertions(+)
>
>diff --git a/net/devlink/core.c b/net/devlink/core.c
>index c23ebabadc52..b191112f57af 100644
>--- a/net/devlink/core.c
>+++ b/net/devlink/core.c
>@@ -4,6 +4,7 @@
>  * Copyright (c) 2016 Jiri Pirko <jiri@mellanox.com>
>  */
> 
>+#include <linux/device.h>
> #include <net/genetlink.h>
> 
> #include "devl_internal.h"
>@@ -91,6 +92,7 @@ static void devlink_release(struct work_struct *work)
> 
> 	mutex_destroy(&devlink->lock);
> 	lockdep_unregister_key(&devlink->lock_key);
>+	put_device(devlink->dev);

In this case I think you have to make sure this is called before
devlink_free() ends. After the caller of devlink_free() returns (most
probably .remove callback), nothing stops module from being removed.

I don't see other way. Utilize complete() here and wait_for_completion()
at the end of devlink_free().

If the completion in devlink_put() area rings a bell for you, let me save
you the trouble looking it up:
9053637e0da7 ("devlink: remove the registration guarantee of references")
This commit removed that. But it is a different usage.



> 	kfree(devlink);
> }
> 
>@@ -204,6 +206,7 @@ struct devlink *devlink_alloc_ns(const struct devlink_ops *ops,
> 	if (ret < 0)
> 		goto err_xa_alloc;
> 
>+	get_device(dev);
> 	devlink->dev = dev;
> 	devlink->ops = ops;
> 	xa_init_flags(&devlink->ports, XA_FLAGS_ALLOC);
>-- 
>2.40.1
>
Ido Schimmel June 21, 2023, 3:35 p.m. UTC | #6
On Wed, Jun 21, 2023 at 01:48:36PM +0200, Jiri Pirko wrote:
> Mon, Jun 19, 2023 at 02:50:14PM CEST, idosch@nvidia.com wrote:
> >@@ -91,6 +92,7 @@ static void devlink_release(struct work_struct *work)
> > 
> > 	mutex_destroy(&devlink->lock);
> > 	lockdep_unregister_key(&devlink->lock_key);
> >+	put_device(devlink->dev);
> 
> In this case I think you have to make sure this is called before
> devlink_free() ends. After the caller of devlink_free() returns (most
> probably .remove callback), nothing stops module from being removed.
> 
> I don't see other way. Utilize complete() here and wait_for_completion()
> at the end of devlink_free().

I might be missing something, but how can I do something like
wait_for_completion(&devlink->comp) at the end of devlink_free()? After
I call devlink_put() the devlink instance can be freed and the
wait_for_completion() call will result in a UAF.

> 
> If the completion in devlink_put() area rings a bell for you, let me save
> you the trouble looking it up:
> 9053637e0da7 ("devlink: remove the registration guarantee of references")
> This commit removed that. But it is a different usage.
> 
> 
> 
> > 	kfree(devlink);
> > }
> > 
> >@@ -204,6 +206,7 @@ struct devlink *devlink_alloc_ns(const struct devlink_ops *ops,
> > 	if (ret < 0)
> > 		goto err_xa_alloc;
> > 
> >+	get_device(dev);
> > 	devlink->dev = dev;
> > 	devlink->ops = ops;
> > 	xa_init_flags(&devlink->ports, XA_FLAGS_ALLOC);
> >-- 
> >2.40.1
> >
Jakub Kicinski June 21, 2023, 7:03 p.m. UTC | #7
On Wed, 21 Jun 2023 09:31:43 +0300 Ido Schimmel wrote:
> Thanks for taking a look.
> 
> Moving the release to devlink_free() [1] was the first thing I tried and
> it indeed solves the problem I mentioned earlier, but creates a new one.
> After devlink_free() returns the devlink instance can still be accessed
> by user space in devlink_get_from_attrs_lock(). If I reload in a loop
> while concurrently removing and adding the device [2], we can hit a UAF
> when trying to acquire the device lock [3].

Ugh, I didn't look at the second patch, it's taking the device lock
before validating that the devlink instance is registered. 
So we need to extend the list of fields which must always be valid :(

Let's try to fix it at the netdevsim level then? AFAIU we only need the
bus to remain loaded for nsim_bus_dev_release to exist? What if we split
netdevsim into two modules, put the bus stuff in a new module called
netdevsim_bus, and leave the rest (driver) in just netdevsim. That way
we can take a ref on netdevsim_bus until all devices are gone, and still
load / unload netdevsim. With unload resulting in all devices getting
auto-deleted.

I haven't looked in detail so maybe you'll immediately tell me it won't
work, but I'm guessing this is how "real" buses work avoid the problem?
Ido Schimmel June 22, 2023, 6:03 a.m. UTC | #8
On Wed, Jun 21, 2023 at 12:03:57PM -0700, Jakub Kicinski wrote:
> Let's try to fix it at the netdevsim level then?

I think it's a good direction. I couldn't find a way to fix it in
devlink and the problem does seem specific to netdevsim.

> AFAIU we only need the bus to remain loaded for nsim_bus_dev_release
> to exist?

That's my understanding as well.

> What if we split netdevsim into two modules, put the bus stuff in a
> new module called netdevsim_bus, and leave the rest (driver) in just
> netdevsim. That way we can take a ref on netdevsim_bus until all
> devices are gone, and still load / unload netdevsim. With unload
> resulting in all devices getting auto-deleted.
> 
> I haven't looked in detail so maybe you'll immediately tell me it won't
> work, but I'm guessing this is how "real" buses work avoid the problem?

At least with PCI (which I believe is the bus backing most devlink
users), the release callback is builtin - not part of a module - so this
problem can't happen there.

Anyway, I will explore this direction.

Thanks!
Jiri Pirko June 22, 2023, 6:29 a.m. UTC | #9
Wed, Jun 21, 2023 at 05:35:15PM CEST, idosch@nvidia.com wrote:
>On Wed, Jun 21, 2023 at 01:48:36PM +0200, Jiri Pirko wrote:
>> Mon, Jun 19, 2023 at 02:50:14PM CEST, idosch@nvidia.com wrote:
>> >@@ -91,6 +92,7 @@ static void devlink_release(struct work_struct *work)
>> > 
>> > 	mutex_destroy(&devlink->lock);
>> > 	lockdep_unregister_key(&devlink->lock_key);
>> >+	put_device(devlink->dev);
>> 
>> In this case I think you have to make sure this is called before
>> devlink_free() ends. After the caller of devlink_free() returns (most
>> probably .remove callback), nothing stops module from being removed.
>> 
>> I don't see other way. Utilize complete() here and wait_for_completion()
>> at the end of devlink_free().
>
>I might be missing something, but how can I do something like
>wait_for_completion(&devlink->comp) at the end of devlink_free()? After
>I call devlink_put() the devlink instance can be freed and the
>wait_for_completion() call will result in a UAF.

You have to move the free() to devlink_free()
Basically, all the things done in devlink_put that are symmetrical to
the initialization done in devlink_alloc() should be moved there.


>
>> 
>> If the completion in devlink_put() area rings a bell for you, let me save
>> you the trouble looking it up:
>> 9053637e0da7 ("devlink: remove the registration guarantee of references")
>> This commit removed that. But it is a different usage.
>> 
>> 
>> 
>> > 	kfree(devlink);
>> > }
>> > 
>> >@@ -204,6 +206,7 @@ struct devlink *devlink_alloc_ns(const struct devlink_ops *ops,
>> > 	if (ret < 0)
>> > 		goto err_xa_alloc;
>> > 
>> >+	get_device(dev);
>> > 	devlink->dev = dev;
>> > 	devlink->ops = ops;
>> > 	xa_init_flags(&devlink->ports, XA_FLAGS_ALLOC);
>> >-- 
>> >2.40.1
>> >
Ido Schimmel June 25, 2023, 11:55 a.m. UTC | #10
On Thu, Jun 22, 2023 at 08:29:31AM +0200, Jiri Pirko wrote:
> Wed, Jun 21, 2023 at 05:35:15PM CEST, idosch@nvidia.com wrote:
> >On Wed, Jun 21, 2023 at 01:48:36PM +0200, Jiri Pirko wrote:
> >> Mon, Jun 19, 2023 at 02:50:14PM CEST, idosch@nvidia.com wrote:
> >> >@@ -91,6 +92,7 @@ static void devlink_release(struct work_struct *work)
> >> > 
> >> > 	mutex_destroy(&devlink->lock);
> >> > 	lockdep_unregister_key(&devlink->lock_key);
> >> >+	put_device(devlink->dev);
> >> 
> >> In this case I think you have to make sure this is called before
> >> devlink_free() ends. After the caller of devlink_free() returns (most
> >> probably .remove callback), nothing stops module from being removed.
> >> 
> >> I don't see other way. Utilize complete() here and wait_for_completion()
> >> at the end of devlink_free().
> >
> >I might be missing something, but how can I do something like
> >wait_for_completion(&devlink->comp) at the end of devlink_free()? After
> >I call devlink_put() the devlink instance can be freed and the
> >wait_for_completion() call will result in a UAF.
> 
> You have to move the free() to devlink_free()
> Basically, all the things done in devlink_put that are symmetrical to
> the initialization done in devlink_alloc() should be moved there.

But it's a bit weird to dereference 'devlink' (to wait for the
completion) after calling 'devlink_put(devlink)'. Given that this
problem seems to be specific to netdevsim, don't you think it's better
to fix it in netdevsim rather than working around it in devlink?

> 
> 
> >
> >> 
> >> If the completion in devlink_put() area rings a bell for you, let me save
> >> you the trouble looking it up:
> >> 9053637e0da7 ("devlink: remove the registration guarantee of references")
> >> This commit removed that. But it is a different usage.
> >> 
> >> 
> >> 
> >> > 	kfree(devlink);
> >> > }
> >> > 
> >> >@@ -204,6 +206,7 @@ struct devlink *devlink_alloc_ns(const struct devlink_ops *ops,
> >> > 	if (ret < 0)
> >> > 		goto err_xa_alloc;
> >> > 
> >> >+	get_device(dev);
> >> > 	devlink->dev = dev;
> >> > 	devlink->ops = ops;
> >> > 	xa_init_flags(&devlink->ports, XA_FLAGS_ALLOC);
> >> >-- 
> >> >2.40.1
> >> >
Jiri Pirko June 27, 2023, 10:13 a.m. UTC | #11
Sun, Jun 25, 2023 at 01:55:36PM CEST, idosch@nvidia.com wrote:
>On Thu, Jun 22, 2023 at 08:29:31AM +0200, Jiri Pirko wrote:
>> Wed, Jun 21, 2023 at 05:35:15PM CEST, idosch@nvidia.com wrote:
>> >On Wed, Jun 21, 2023 at 01:48:36PM +0200, Jiri Pirko wrote:
>> >> Mon, Jun 19, 2023 at 02:50:14PM CEST, idosch@nvidia.com wrote:
>> >> >@@ -91,6 +92,7 @@ static void devlink_release(struct work_struct *work)
>> >> > 
>> >> > 	mutex_destroy(&devlink->lock);
>> >> > 	lockdep_unregister_key(&devlink->lock_key);
>> >> >+	put_device(devlink->dev);
>> >> 
>> >> In this case I think you have to make sure this is called before
>> >> devlink_free() ends. After the caller of devlink_free() returns (most
>> >> probably .remove callback), nothing stops module from being removed.
>> >> 
>> >> I don't see other way. Utilize complete() here and wait_for_completion()
>> >> at the end of devlink_free().
>> >
>> >I might be missing something, but how can I do something like
>> >wait_for_completion(&devlink->comp) at the end of devlink_free()? After
>> >I call devlink_put() the devlink instance can be freed and the
>> >wait_for_completion() call will result in a UAF.
>> 
>> You have to move the free() to devlink_free()
>> Basically, all the things done in devlink_put that are symmetrical to
>> the initialization done in devlink_alloc() should be moved there.
>
>But it's a bit weird to dereference 'devlink' (to wait for the
>completion) after calling 'devlink_put(devlink)'. Given that this

Well, I don't see any problem in that.

>problem seems to be specific to netdevsim, don't you think it's better
>to fix it in netdevsim rather than working around it in devlink?

Up to you, both work.


>
>> 
>> 
>> >
>> >> 
>> >> If the completion in devlink_put() area rings a bell for you, let me save
>> >> you the trouble looking it up:
>> >> 9053637e0da7 ("devlink: remove the registration guarantee of references")
>> >> This commit removed that. But it is a different usage.
>> >> 
>> >> 
>> >> 
>> >> > 	kfree(devlink);
>> >> > }
>> >> > 
>> >> >@@ -204,6 +206,7 @@ struct devlink *devlink_alloc_ns(const struct devlink_ops *ops,
>> >> > 	if (ret < 0)
>> >> > 		goto err_xa_alloc;
>> >> > 
>> >> >+	get_device(dev);
>> >> > 	devlink->dev = dev;
>> >> > 	devlink->ops = ops;
>> >> > 	xa_init_flags(&devlink->ports, XA_FLAGS_ALLOC);
>> >> >-- 
>> >> >2.40.1
>> >> >
diff mbox series

Patch

diff --git a/net/devlink/core.c b/net/devlink/core.c
index c23ebabadc52..b191112f57af 100644
--- a/net/devlink/core.c
+++ b/net/devlink/core.c
@@ -4,6 +4,7 @@ 
  * Copyright (c) 2016 Jiri Pirko <jiri@mellanox.com>
  */
 
+#include <linux/device.h>
 #include <net/genetlink.h>
 
 #include "devl_internal.h"
@@ -91,6 +92,7 @@  static void devlink_release(struct work_struct *work)
 
 	mutex_destroy(&devlink->lock);
 	lockdep_unregister_key(&devlink->lock_key);
+	put_device(devlink->dev);
 	kfree(devlink);
 }
 
@@ -204,6 +206,7 @@  struct devlink *devlink_alloc_ns(const struct devlink_ops *ops,
 	if (ret < 0)
 		goto err_xa_alloc;
 
+	get_device(dev);
 	devlink->dev = dev;
 	devlink->ops = ops;
 	xa_init_flags(&devlink->ports, XA_FLAGS_ALLOC);