Message ID | 20230619125015.1541143-2-idosch@nvidia.com (mailing list archive) |
---|---|
State | RFC |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | devlink: Acquire device lock during reload | expand |
Mon, Jun 19, 2023 at 02:50:14PM CEST, idosch@nvidia.com wrote: >Each devlink instance is associated with a parent device and a pointer >to this device is stored in the devlink structure, but devlink does not >hold a reference on this device. > >This is going to be a problem in the next patch where - among other >things - devlink will acquire the device lock during netns dismantle, >before the reload operation. Since netns dismantle is performed >asynchronously and since a reference is not held on the parent device, >it will be possible to hit a use-after-free. > >Prepare for the upcoming change by holding a reference on the parent >device. > >Unfortunately, with this patch and this reproducer [1], the following >crash can be observed [2]. The last reference is released from the >device asynchronously - after an RCU grace period - when the netdevsim >module is no longer present. This causes device_release() to invoke a >release callback that is no longer present: nsim_bus_dev_release(). > >It's not clear to me if I'm doing something wrong in devlink (I don't >think so), if it's a bug in netdevsim or alternatively a bug in core >driver code that allows the bus module to go away before all the devices >that were connected to it are released. > >The problem can be solved by devlink holding a reference on the backing >module (i.e., dev->driver->owner) or by each netdevsim device holding a >reference on the netdevsim module. However, this will prevent the >removal of the module when devices are present, something that is >possible today. > >[1] >#!/bin/bash > >for i in $(seq 1 1000); do > echo "$i" > insmod drivers/net/netdevsim/netdevsim.ko > echo "10 0" > /sys/bus/netdevsim/new_device > rmmod netdevsim >done > >[2] >BUG: unable to handle page fault for address: ffffffffc0490910 >#PF: supervisor instruction fetch in kernel mode >#PF: error_code(0x0010) - not-present page >PGD 12e040067 P4D 12e040067 PUD 12e042067 PMD 100a38067 PTE 0 >Oops: 0010 [#1] PREEMPT SMP >CPU: 0 PID: 138 Comm: kworker/0:2 Not tainted 6.4.0-rc5-custom-g42e05937ca59 #299 >Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc37 04/01/2014 >Workqueue: events devlink_release >RIP: 0010:0xffffffffc0490910 >Code: Unable to access opcode bytes at 0xffffffffc04908e6. >RSP: 0018:ffffb487802f3e40 EFLAGS: 00010282 >RAX: ffffffffc0490910 RBX: ffff92e6c0057800 RCX: 0001020304050608 >RDX: 0000000000000001 RSI: ffffffff92b7d763 RDI: ffff92e6c0057800 >RBP: ffff92e6c1ef0a00 R08: ffff92e6c0055158 R09: ffff92e6c2be9134 >R10: 0000000000000018 R11: fefefefefefefeff R12: ffffffff934c3e80 >R13: ffff92e6c2a1a740 R14: 0000000000000000 R15: ffff92e7f7c30b05 >FS: 0000000000000000(0000) GS:ffff92e7f7c00000(0000) knlGS:0000000000000000 >CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >CR2: ffffffffc04908e6 CR3: 0000000101f1a004 CR4: 0000000000170ef0 >Call Trace: > <TASK> > ? __die+0x23/0x70 > ? page_fault_oops+0x181/0x470 > ? exc_page_fault+0xa6/0x140 > ? asm_exc_page_fault+0x26/0x30 > ? device_release+0x23/0x90 > ? device_release+0x34/0x90 > ? kobject_put+0x7d/0x1b0 > ? devlink_release+0x16/0x30 > ? process_one_work+0x1e0/0x3d0 > ? worker_thread+0x4e/0x3b0 > ? rescuer_thread+0x3a0/0x3a0 > ? kthread+0xe5/0x120 > ? kthread_complete_and_exit+0x20/0x20 > ? ret_from_fork+0x1f/0x30 > </TASK> >Modules linked in: [last unloaded: netdevsim] > >Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com>
On Tue, Jun 20, 2023 at 08:23:26AM +0200, Jiri Pirko wrote: > Mon, Jun 19, 2023 at 02:50:14PM CEST, idosch@nvidia.com wrote: > >Each devlink instance is associated with a parent device and a pointer > >to this device is stored in the devlink structure, but devlink does not > >hold a reference on this device. > > > >This is going to be a problem in the next patch where - among other > >things - devlink will acquire the device lock during netns dismantle, > >before the reload operation. Since netns dismantle is performed > >asynchronously and since a reference is not held on the parent device, > >it will be possible to hit a use-after-free. > > > >Prepare for the upcoming change by holding a reference on the parent > >device. > > > >Unfortunately, with this patch and this reproducer [1], the following > >crash can be observed [2]. The last reference is released from the > >device asynchronously - after an RCU grace period - when the netdevsim > >module is no longer present. This causes device_release() to invoke a > >release callback that is no longer present: nsim_bus_dev_release(). > > > >It's not clear to me if I'm doing something wrong in devlink (I don't > >think so), if it's a bug in netdevsim or alternatively a bug in core > >driver code that allows the bus module to go away before all the devices > >that were connected to it are released. > > > >The problem can be solved by devlink holding a reference on the backing > >module (i.e., dev->driver->owner) or by each netdevsim device holding a > >reference on the netdevsim module. However, this will prevent the > >removal of the module when devices are present, something that is > >possible today. > > > >[1] > >#!/bin/bash > > > >for i in $(seq 1 1000); do > > echo "$i" > > insmod drivers/net/netdevsim/netdevsim.ko > > echo "10 0" > /sys/bus/netdevsim/new_device > > rmmod netdevsim > >done > > > >[2] > >BUG: unable to handle page fault for address: ffffffffc0490910 > >#PF: supervisor instruction fetch in kernel mode > >#PF: error_code(0x0010) - not-present page > >PGD 12e040067 P4D 12e040067 PUD 12e042067 PMD 100a38067 PTE 0 > >Oops: 0010 [#1] PREEMPT SMP > >CPU: 0 PID: 138 Comm: kworker/0:2 Not tainted 6.4.0-rc5-custom-g42e05937ca59 #299 > >Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc37 04/01/2014 > >Workqueue: events devlink_release > >RIP: 0010:0xffffffffc0490910 > >Code: Unable to access opcode bytes at 0xffffffffc04908e6. > >RSP: 0018:ffffb487802f3e40 EFLAGS: 00010282 > >RAX: ffffffffc0490910 RBX: ffff92e6c0057800 RCX: 0001020304050608 > >RDX: 0000000000000001 RSI: ffffffff92b7d763 RDI: ffff92e6c0057800 > >RBP: ffff92e6c1ef0a00 R08: ffff92e6c0055158 R09: ffff92e6c2be9134 > >R10: 0000000000000018 R11: fefefefefefefeff R12: ffffffff934c3e80 > >R13: ffff92e6c2a1a740 R14: 0000000000000000 R15: ffff92e7f7c30b05 > >FS: 0000000000000000(0000) GS:ffff92e7f7c00000(0000) knlGS:0000000000000000 > >CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > >CR2: ffffffffc04908e6 CR3: 0000000101f1a004 CR4: 0000000000170ef0 > >Call Trace: > > <TASK> > > ? __die+0x23/0x70 > > ? page_fault_oops+0x181/0x470 > > ? exc_page_fault+0xa6/0x140 > > ? asm_exc_page_fault+0x26/0x30 > > ? device_release+0x23/0x90 > > ? device_release+0x34/0x90 > > ? kobject_put+0x7d/0x1b0 > > ? devlink_release+0x16/0x30 > > ? process_one_work+0x1e0/0x3d0 > > ? worker_thread+0x4e/0x3b0 > > ? rescuer_thread+0x3a0/0x3a0 > > ? kthread+0xe5/0x120 > > ? kthread_complete_and_exit+0x20/0x20 > > ? ret_from_fork+0x1f/0x30 > > </TASK> > >Modules linked in: [last unloaded: netdevsim] > > > >Signed-off-by: Ido Schimmel <idosch@nvidia.com> > > Reviewed-by: Jiri Pirko <jiri@nvidia.com> Thanks, but I was hoping to get feedback on how to solve the problem mentioned in the commit message :p
On Tue, 20 Jun 2023 10:05:23 +0300 Ido Schimmel wrote: > On Tue, Jun 20, 2023 at 08:23:26AM +0200, Jiri Pirko wrote: > > Mon, Jun 19, 2023 at 02:50:14PM CEST, idosch@nvidia.com wrote: > [...] > > > > Reviewed-by: Jiri Pirko <jiri@nvidia.com> > > Thanks, but I was hoping to get feedback on how to solve the problem > mentioned in the commit message :p Do we need to hold the reference on the device until release? I think you can release it in devlink_free(). The only valid fields for an unregistered devlink instance are: - lock - refcount - index And obviously unregistered devices can't be reloaded.
On Tue, Jun 20, 2023 at 10:43:51AM -0700, Jakub Kicinski wrote: > Do we need to hold the reference on the device until release? > I think you can release it in devlink_free(). > The only valid fields for an unregistered devlink instance are: > - lock > - refcount > - index > > And obviously unregistered devices can't be reloaded. Thanks for taking a look. Moving the release to devlink_free() [1] was the first thing I tried and it indeed solves the problem I mentioned earlier, but creates a new one. After devlink_free() returns the devlink instance can still be accessed by user space in devlink_get_from_attrs_lock(). If I reload in a loop while concurrently removing and adding the device [2], we can hit a UAF when trying to acquire the device lock [3]. [1] diff --git a/net/devlink/core.c b/net/devlink/core.c index a4b6d548e50c..b60a8463d6e0 100644 --- a/net/devlink/core.c +++ b/net/devlink/core.c @@ -92,7 +92,6 @@ static void devlink_release(struct work_struct *work) mutex_destroy(&devlink->lock); lockdep_unregister_key(&devlink->lock_key); - put_device(devlink->dev); kfree(devlink); } @@ -264,6 +263,8 @@ void devlink_free(struct devlink *devlink) xa_erase(&devlinks, devlink->index); + put_device(devlink->dev); + devlink_put(devlink); } EXPORT_SYMBOL_GPL(devlink_free); [2] while true; do devlink dev reload netdevsim/netdevsim10 &> /dev/null; done & while true; do echo "10 0" > /sys/bus/netdevsim/new_device; echo 10 > /sys/bus/netdevsim/del_device; done [3] [ 96.081096] ================================================================== [ 96.082158] BUG: KASAN: slab-use-after-free in __mutex_lock+0x18a7/0x1ac0 [ 96.083107] Read of size 8 at addr ffff888109caa8d8 by task devlink/429 [ 96.084266] CPU: 0 PID: 429 Comm: devlink Not tainted 6.4.0-rc6-custom-gb01b0912311c-dirty #303 [ 96.085443] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc37 04/01/2014 [ 96.086632] Call Trace: [ 96.086990] <TASK> [ 96.087314] dump_stack_lvl+0x91/0xf0 [ 96.087852] print_report+0xcf/0x660 [ 96.088384] ? __virt_addr_valid+0x86/0x360 [ 96.088980] ? __mutex_lock+0x18a7/0x1ac0 [ 96.089558] ? __mutex_lock+0x18a7/0x1ac0 [ 96.090132] kasan_report+0xd6/0x110 [ 96.090667] ? __mutex_lock+0x18a7/0x1ac0 [ 96.091245] __mutex_lock+0x18a7/0x1ac0 [ 96.091898] ? devlink_get_from_attrs_lock+0x2bc/0x460 [ 96.092645] ? mutex_lock_io_nested+0x18b0/0x18b0 [ 96.093323] ? reacquire_held_locks+0x4e0/0x4e0 [ 96.093972] ? devlink_try_get+0x158/0x1e0 [ 96.094586] ? devlink_get_from_attrs_lock+0x460/0x460 [ 96.095325] ? devlink_get_from_attrs_lock+0x2bc/0x460 [ 96.096053] devlink_get_from_attrs_lock+0x2bc/0x460 [ 96.096767] ? devlink_nl_post_doit+0xf0/0xf0 [ 96.097409] ? __nla_parse+0x40/0x50 [ 96.097943] ? devlink_get_from_attrs_lock+0x460/0x460 [ 96.098671] devlink_nl_pre_doit+0xb3/0x480 [ 96.099255] genl_family_rcv_msg_doit.isra.0+0x1b8/0x2e0 [ 96.099966] ? genl_start+0x670/0x670 [ 96.100487] ? ns_capable+0xda/0x110 [ 96.100991] genl_rcv_msg+0x558/0x7f0 [ 96.101516] ? genl_family_rcv_msg_doit.isra.0+0x2e0/0x2e0 [ 96.102260] ? devlink_get_from_attrs_lock+0x460/0x460 [ 96.102925] ? devlink_reload+0x540/0x540 [ 96.103429] ? devlink_pernet_pre_exit+0x340/0x340 [ 96.104017] netlink_rcv_skb+0x170/0x440 [ 96.104508] ? genl_family_rcv_msg_doit.isra.0+0x2e0/0x2e0 [ 96.105166] ? netlink_ack+0x1380/0x1380 [ 96.105662] ? lock_contended+0xc70/0xc70 [ 96.106172] ? rwsem_down_read_slowpath+0xda0/0xda0 [ 96.106767] ? netlink_deliver_tap+0x1b6/0xd90 [ 96.107317] ? is_vmalloc_addr+0x8b/0xb0 [ 96.107804] genl_rcv+0x2d/0x40 [ 96.108213] netlink_unicast+0x53f/0x810 [ 96.108698] ? netlink_attachskb+0x870/0x870 [ 96.109230] ? lock_release+0x3ac/0xbb0 [ 96.109714] netlink_sendmsg+0x95c/0xe80 [ 96.110206] ? netlink_unicast+0x810/0x810 [ 96.110711] ? __might_fault+0x15b/0x190 [ 96.111194] ? _copy_from_user+0x9f/0xd0 [ 96.111691] __sys_sendto+0x2aa/0x420 [ 96.112148] ? __ia32_sys_getpeername+0xb0/0xb0 [ 96.112706] ? reacquire_held_locks+0x4e0/0x4e0 [ 96.113275] ? rcu_is_watching+0x12/0xb0 [ 96.113769] ? blkcg_exit_disk+0x50/0x50 [ 96.114260] __x64_sys_sendto+0xe5/0x1c0 [ 96.114742] ? lockdep_hardirqs_on+0x7d/0x100 [ 96.115285] ? syscall_enter_from_user_mode+0x20/0x50 [ 96.115899] do_syscall_64+0x38/0x80 [ 96.116352] entry_SYSCALL_64_after_hwframe+0x63/0xcd [ 96.116964] RIP: 0033:0x7fa073f72fa7 [ 96.117417] Code: 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 80 3d 5d d6 0c 00 00 41 89 ca 74 10 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 71 c3 55 48 83 ec 30 44 89 4c 24 2c 4c 89 44 [ 96.119548] RSP: 002b:00007ffca2723938 EFLAGS: 00000202 ORIG_RAX: 000000000000002c [ 96.120449] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007fa073f72fa7 [ 96.121287] RDX: 0000000000000034 RSI: 000055883950a460 RDI: 0000000000000003 [ 96.122120] RBP: 0000558837836e60 R08: 00007fa074050200 R09: 000000000000000c [ 96.122951] R10: 0000000000000000 R11: 0000000000000202 R12: 000055883950a2a0 [ 96.123789] R13: 000055883950a460 R14: 000055883784eeab R15: 000055883950a2a0 [ 96.124638] </TASK> [ 96.125120] Allocated by task 209: [ 96.125543] kasan_save_stack+0x33/0x50 [ 96.125567] kasan_set_track+0x25/0x30 [ 96.125587] __kasan_kmalloc+0x87/0x90 [ 96.125606] new_device_store+0x22d/0x6a0 [ 96.125625] bus_attr_store+0x7b/0xa0 [ 96.125641] sysfs_kf_write+0x11c/0x170 [ 96.125659] kernfs_fop_write_iter+0x3f7/0x600 [ 96.125676] vfs_write+0x680/0xe90 [ 96.125691] ksys_write+0x143/0x270 [ 96.125707] do_syscall_64+0x38/0x80 [ 96.125722] entry_SYSCALL_64_after_hwframe+0x63/0xcd [ 96.125947] Freed by task 209: [ 96.126332] kasan_save_stack+0x33/0x50 [ 96.126355] kasan_set_track+0x25/0x30 [ 96.126374] kasan_save_free_info+0x2e/0x40 [ 96.126389] __kasan_slab_free+0x10a/0x180 [ 96.126409] __kmem_cache_free+0x8a/0x1a0 [ 96.126429] device_release+0xa6/0x240 [ 96.126449] kobject_put+0x1f7/0x5b0 [ 96.126465] device_unregister+0x34/0xc0 [ 96.126478] del_device_store+0x308/0x430 [ 96.126496] bus_attr_store+0x7b/0xa0 [ 96.126511] sysfs_kf_write+0x11c/0x170 [ 96.126528] kernfs_fop_write_iter+0x3f7/0x600 [ 96.126545] vfs_write+0x680/0xe90 [ 96.126560] ksys_write+0x143/0x270 [ 96.126575] do_syscall_64+0x38/0x80 [ 96.126590] entry_SYSCALL_64_after_hwframe+0x63/0xcd [ 96.126815] The buggy address belongs to the object at ffff888109caa800 which belongs to the cache kmalloc-1k of size 1024 [ 96.128252] The buggy address is located 216 bytes inside of freed 1024-byte region [ffff888109caa800, ffff888109caac00) [ 96.129875] The buggy address belongs to the physical page: [ 96.130540] page:00000000c7912b23 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x109ca8 [ 96.130562] head:00000000c7912b23 order:3 entire_mapcount:0 nr_pages_mapped:0 pincount:0 [ 96.130575] flags: 0x200000000010200(slab|head|node=0|zone=2) [ 96.130591] page_type: 0xffffffff() [ 96.130607] raw: 0200000000010200 ffff888100041dc0 dead000000000100 dead000000000122 [ 96.130622] raw: 0000000000000000 0000000000100010 00000001ffffffff 0000000000000000 [ 96.130632] page dumped because: kasan: bad access detected [ 96.130844] Memory state around the buggy address: [ 96.131424] ffff888109caa780: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 96.132266] ffff888109caa800: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 96.133101] >ffff888109caa880: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 96.133944] ^ [ 96.134666] ffff888109caa900: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 96.135510] ffff888109caa980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 96.136351] ==================================================================
Mon, Jun 19, 2023 at 02:50:14PM CEST, idosch@nvidia.com wrote: >Each devlink instance is associated with a parent device and a pointer >to this device is stored in the devlink structure, but devlink does not >hold a reference on this device. > >This is going to be a problem in the next patch where - among other >things - devlink will acquire the device lock during netns dismantle, >before the reload operation. Since netns dismantle is performed >asynchronously and since a reference is not held on the parent device, >it will be possible to hit a use-after-free. > >Prepare for the upcoming change by holding a reference on the parent >device. > >Unfortunately, with this patch and this reproducer [1], the following >crash can be observed [2]. The last reference is released from the >device asynchronously - after an RCU grace period - when the netdevsim >module is no longer present. This causes device_release() to invoke a >release callback that is no longer present: nsim_bus_dev_release(). > >It's not clear to me if I'm doing something wrong in devlink (I don't >think so), if it's a bug in netdevsim or alternatively a bug in core >driver code that allows the bus module to go away before all the devices >that were connected to it are released. > >The problem can be solved by devlink holding a reference on the backing >module (i.e., dev->driver->owner) or by each netdevsim device holding a >reference on the netdevsim module. However, this will prevent the >removal of the module when devices are present, something that is >possible today. > >[1] >#!/bin/bash > >for i in $(seq 1 1000); do > echo "$i" > insmod drivers/net/netdevsim/netdevsim.ko > echo "10 0" > /sys/bus/netdevsim/new_device > rmmod netdevsim >done > >[2] >BUG: unable to handle page fault for address: ffffffffc0490910 >#PF: supervisor instruction fetch in kernel mode >#PF: error_code(0x0010) - not-present page >PGD 12e040067 P4D 12e040067 PUD 12e042067 PMD 100a38067 PTE 0 >Oops: 0010 [#1] PREEMPT SMP >CPU: 0 PID: 138 Comm: kworker/0:2 Not tainted 6.4.0-rc5-custom-g42e05937ca59 #299 >Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc37 04/01/2014 >Workqueue: events devlink_release >RIP: 0010:0xffffffffc0490910 >Code: Unable to access opcode bytes at 0xffffffffc04908e6. >RSP: 0018:ffffb487802f3e40 EFLAGS: 00010282 >RAX: ffffffffc0490910 RBX: ffff92e6c0057800 RCX: 0001020304050608 >RDX: 0000000000000001 RSI: ffffffff92b7d763 RDI: ffff92e6c0057800 >RBP: ffff92e6c1ef0a00 R08: ffff92e6c0055158 R09: ffff92e6c2be9134 >R10: 0000000000000018 R11: fefefefefefefeff R12: ffffffff934c3e80 >R13: ffff92e6c2a1a740 R14: 0000000000000000 R15: ffff92e7f7c30b05 >FS: 0000000000000000(0000) GS:ffff92e7f7c00000(0000) knlGS:0000000000000000 >CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >CR2: ffffffffc04908e6 CR3: 0000000101f1a004 CR4: 0000000000170ef0 >Call Trace: > <TASK> > ? __die+0x23/0x70 > ? page_fault_oops+0x181/0x470 > ? exc_page_fault+0xa6/0x140 > ? asm_exc_page_fault+0x26/0x30 > ? device_release+0x23/0x90 > ? device_release+0x34/0x90 > ? kobject_put+0x7d/0x1b0 > ? devlink_release+0x16/0x30 > ? process_one_work+0x1e0/0x3d0 > ? worker_thread+0x4e/0x3b0 > ? rescuer_thread+0x3a0/0x3a0 > ? kthread+0xe5/0x120 > ? kthread_complete_and_exit+0x20/0x20 > ? ret_from_fork+0x1f/0x30 > </TASK> >Modules linked in: [last unloaded: netdevsim] > >Signed-off-by: Ido Schimmel <idosch@nvidia.com> >--- > net/devlink/core.c | 3 +++ > 1 file changed, 3 insertions(+) > >diff --git a/net/devlink/core.c b/net/devlink/core.c >index c23ebabadc52..b191112f57af 100644 >--- a/net/devlink/core.c >+++ b/net/devlink/core.c >@@ -4,6 +4,7 @@ > * Copyright (c) 2016 Jiri Pirko <jiri@mellanox.com> > */ > >+#include <linux/device.h> > #include <net/genetlink.h> > > #include "devl_internal.h" >@@ -91,6 +92,7 @@ static void devlink_release(struct work_struct *work) > > mutex_destroy(&devlink->lock); > lockdep_unregister_key(&devlink->lock_key); >+ put_device(devlink->dev); In this case I think you have to make sure this is called before devlink_free() ends. After the caller of devlink_free() returns (most probably .remove callback), nothing stops module from being removed. I don't see other way. Utilize complete() here and wait_for_completion() at the end of devlink_free(). If the completion in devlink_put() area rings a bell for you, let me save you the trouble looking it up: 9053637e0da7 ("devlink: remove the registration guarantee of references") This commit removed that. But it is a different usage. > kfree(devlink); > } > >@@ -204,6 +206,7 @@ struct devlink *devlink_alloc_ns(const struct devlink_ops *ops, > if (ret < 0) > goto err_xa_alloc; > >+ get_device(dev); > devlink->dev = dev; > devlink->ops = ops; > xa_init_flags(&devlink->ports, XA_FLAGS_ALLOC); >-- >2.40.1 >
On Wed, Jun 21, 2023 at 01:48:36PM +0200, Jiri Pirko wrote: > Mon, Jun 19, 2023 at 02:50:14PM CEST, idosch@nvidia.com wrote: > >@@ -91,6 +92,7 @@ static void devlink_release(struct work_struct *work) > > > > mutex_destroy(&devlink->lock); > > lockdep_unregister_key(&devlink->lock_key); > >+ put_device(devlink->dev); > > In this case I think you have to make sure this is called before > devlink_free() ends. After the caller of devlink_free() returns (most > probably .remove callback), nothing stops module from being removed. > > I don't see other way. Utilize complete() here and wait_for_completion() > at the end of devlink_free(). I might be missing something, but how can I do something like wait_for_completion(&devlink->comp) at the end of devlink_free()? After I call devlink_put() the devlink instance can be freed and the wait_for_completion() call will result in a UAF. > > If the completion in devlink_put() area rings a bell for you, let me save > you the trouble looking it up: > 9053637e0da7 ("devlink: remove the registration guarantee of references") > This commit removed that. But it is a different usage. > > > > > kfree(devlink); > > } > > > >@@ -204,6 +206,7 @@ struct devlink *devlink_alloc_ns(const struct devlink_ops *ops, > > if (ret < 0) > > goto err_xa_alloc; > > > >+ get_device(dev); > > devlink->dev = dev; > > devlink->ops = ops; > > xa_init_flags(&devlink->ports, XA_FLAGS_ALLOC); > >-- > >2.40.1 > >
On Wed, 21 Jun 2023 09:31:43 +0300 Ido Schimmel wrote: > Thanks for taking a look. > > Moving the release to devlink_free() [1] was the first thing I tried and > it indeed solves the problem I mentioned earlier, but creates a new one. > After devlink_free() returns the devlink instance can still be accessed > by user space in devlink_get_from_attrs_lock(). If I reload in a loop > while concurrently removing and adding the device [2], we can hit a UAF > when trying to acquire the device lock [3]. Ugh, I didn't look at the second patch, it's taking the device lock before validating that the devlink instance is registered. So we need to extend the list of fields which must always be valid :( Let's try to fix it at the netdevsim level then? AFAIU we only need the bus to remain loaded for nsim_bus_dev_release to exist? What if we split netdevsim into two modules, put the bus stuff in a new module called netdevsim_bus, and leave the rest (driver) in just netdevsim. That way we can take a ref on netdevsim_bus until all devices are gone, and still load / unload netdevsim. With unload resulting in all devices getting auto-deleted. I haven't looked in detail so maybe you'll immediately tell me it won't work, but I'm guessing this is how "real" buses work avoid the problem?
On Wed, Jun 21, 2023 at 12:03:57PM -0700, Jakub Kicinski wrote: > Let's try to fix it at the netdevsim level then? I think it's a good direction. I couldn't find a way to fix it in devlink and the problem does seem specific to netdevsim. > AFAIU we only need the bus to remain loaded for nsim_bus_dev_release > to exist? That's my understanding as well. > What if we split netdevsim into two modules, put the bus stuff in a > new module called netdevsim_bus, and leave the rest (driver) in just > netdevsim. That way we can take a ref on netdevsim_bus until all > devices are gone, and still load / unload netdevsim. With unload > resulting in all devices getting auto-deleted. > > I haven't looked in detail so maybe you'll immediately tell me it won't > work, but I'm guessing this is how "real" buses work avoid the problem? At least with PCI (which I believe is the bus backing most devlink users), the release callback is builtin - not part of a module - so this problem can't happen there. Anyway, I will explore this direction. Thanks!
Wed, Jun 21, 2023 at 05:35:15PM CEST, idosch@nvidia.com wrote: >On Wed, Jun 21, 2023 at 01:48:36PM +0200, Jiri Pirko wrote: >> Mon, Jun 19, 2023 at 02:50:14PM CEST, idosch@nvidia.com wrote: >> >@@ -91,6 +92,7 @@ static void devlink_release(struct work_struct *work) >> > >> > mutex_destroy(&devlink->lock); >> > lockdep_unregister_key(&devlink->lock_key); >> >+ put_device(devlink->dev); >> >> In this case I think you have to make sure this is called before >> devlink_free() ends. After the caller of devlink_free() returns (most >> probably .remove callback), nothing stops module from being removed. >> >> I don't see other way. Utilize complete() here and wait_for_completion() >> at the end of devlink_free(). > >I might be missing something, but how can I do something like >wait_for_completion(&devlink->comp) at the end of devlink_free()? After >I call devlink_put() the devlink instance can be freed and the >wait_for_completion() call will result in a UAF. You have to move the free() to devlink_free() Basically, all the things done in devlink_put that are symmetrical to the initialization done in devlink_alloc() should be moved there. > >> >> If the completion in devlink_put() area rings a bell for you, let me save >> you the trouble looking it up: >> 9053637e0da7 ("devlink: remove the registration guarantee of references") >> This commit removed that. But it is a different usage. >> >> >> >> > kfree(devlink); >> > } >> > >> >@@ -204,6 +206,7 @@ struct devlink *devlink_alloc_ns(const struct devlink_ops *ops, >> > if (ret < 0) >> > goto err_xa_alloc; >> > >> >+ get_device(dev); >> > devlink->dev = dev; >> > devlink->ops = ops; >> > xa_init_flags(&devlink->ports, XA_FLAGS_ALLOC); >> >-- >> >2.40.1 >> >
On Thu, Jun 22, 2023 at 08:29:31AM +0200, Jiri Pirko wrote: > Wed, Jun 21, 2023 at 05:35:15PM CEST, idosch@nvidia.com wrote: > >On Wed, Jun 21, 2023 at 01:48:36PM +0200, Jiri Pirko wrote: > >> Mon, Jun 19, 2023 at 02:50:14PM CEST, idosch@nvidia.com wrote: > >> >@@ -91,6 +92,7 @@ static void devlink_release(struct work_struct *work) > >> > > >> > mutex_destroy(&devlink->lock); > >> > lockdep_unregister_key(&devlink->lock_key); > >> >+ put_device(devlink->dev); > >> > >> In this case I think you have to make sure this is called before > >> devlink_free() ends. After the caller of devlink_free() returns (most > >> probably .remove callback), nothing stops module from being removed. > >> > >> I don't see other way. Utilize complete() here and wait_for_completion() > >> at the end of devlink_free(). > > > >I might be missing something, but how can I do something like > >wait_for_completion(&devlink->comp) at the end of devlink_free()? After > >I call devlink_put() the devlink instance can be freed and the > >wait_for_completion() call will result in a UAF. > > You have to move the free() to devlink_free() > Basically, all the things done in devlink_put that are symmetrical to > the initialization done in devlink_alloc() should be moved there. But it's a bit weird to dereference 'devlink' (to wait for the completion) after calling 'devlink_put(devlink)'. Given that this problem seems to be specific to netdevsim, don't you think it's better to fix it in netdevsim rather than working around it in devlink? > > > > > >> > >> If the completion in devlink_put() area rings a bell for you, let me save > >> you the trouble looking it up: > >> 9053637e0da7 ("devlink: remove the registration guarantee of references") > >> This commit removed that. But it is a different usage. > >> > >> > >> > >> > kfree(devlink); > >> > } > >> > > >> >@@ -204,6 +206,7 @@ struct devlink *devlink_alloc_ns(const struct devlink_ops *ops, > >> > if (ret < 0) > >> > goto err_xa_alloc; > >> > > >> >+ get_device(dev); > >> > devlink->dev = dev; > >> > devlink->ops = ops; > >> > xa_init_flags(&devlink->ports, XA_FLAGS_ALLOC); > >> >-- > >> >2.40.1 > >> >
Sun, Jun 25, 2023 at 01:55:36PM CEST, idosch@nvidia.com wrote: >On Thu, Jun 22, 2023 at 08:29:31AM +0200, Jiri Pirko wrote: >> Wed, Jun 21, 2023 at 05:35:15PM CEST, idosch@nvidia.com wrote: >> >On Wed, Jun 21, 2023 at 01:48:36PM +0200, Jiri Pirko wrote: >> >> Mon, Jun 19, 2023 at 02:50:14PM CEST, idosch@nvidia.com wrote: >> >> >@@ -91,6 +92,7 @@ static void devlink_release(struct work_struct *work) >> >> > >> >> > mutex_destroy(&devlink->lock); >> >> > lockdep_unregister_key(&devlink->lock_key); >> >> >+ put_device(devlink->dev); >> >> >> >> In this case I think you have to make sure this is called before >> >> devlink_free() ends. After the caller of devlink_free() returns (most >> >> probably .remove callback), nothing stops module from being removed. >> >> >> >> I don't see other way. Utilize complete() here and wait_for_completion() >> >> at the end of devlink_free(). >> > >> >I might be missing something, but how can I do something like >> >wait_for_completion(&devlink->comp) at the end of devlink_free()? After >> >I call devlink_put() the devlink instance can be freed and the >> >wait_for_completion() call will result in a UAF. >> >> You have to move the free() to devlink_free() >> Basically, all the things done in devlink_put that are symmetrical to >> the initialization done in devlink_alloc() should be moved there. > >But it's a bit weird to dereference 'devlink' (to wait for the >completion) after calling 'devlink_put(devlink)'. Given that this Well, I don't see any problem in that. >problem seems to be specific to netdevsim, don't you think it's better >to fix it in netdevsim rather than working around it in devlink? Up to you, both work. > >> >> >> > >> >> >> >> If the completion in devlink_put() area rings a bell for you, let me save >> >> you the trouble looking it up: >> >> 9053637e0da7 ("devlink: remove the registration guarantee of references") >> >> This commit removed that. But it is a different usage. >> >> >> >> >> >> >> >> > kfree(devlink); >> >> > } >> >> > >> >> >@@ -204,6 +206,7 @@ struct devlink *devlink_alloc_ns(const struct devlink_ops *ops, >> >> > if (ret < 0) >> >> > goto err_xa_alloc; >> >> > >> >> >+ get_device(dev); >> >> > devlink->dev = dev; >> >> > devlink->ops = ops; >> >> > xa_init_flags(&devlink->ports, XA_FLAGS_ALLOC); >> >> >-- >> >> >2.40.1 >> >> >
diff --git a/net/devlink/core.c b/net/devlink/core.c index c23ebabadc52..b191112f57af 100644 --- a/net/devlink/core.c +++ b/net/devlink/core.c @@ -4,6 +4,7 @@ * Copyright (c) 2016 Jiri Pirko <jiri@mellanox.com> */ +#include <linux/device.h> #include <net/genetlink.h> #include "devl_internal.h" @@ -91,6 +92,7 @@ static void devlink_release(struct work_struct *work) mutex_destroy(&devlink->lock); lockdep_unregister_key(&devlink->lock_key); + put_device(devlink->dev); kfree(devlink); } @@ -204,6 +206,7 @@ struct devlink *devlink_alloc_ns(const struct devlink_ops *ops, if (ret < 0) goto err_xa_alloc; + get_device(dev); devlink->dev = dev; devlink->ops = ops; xa_init_flags(&devlink->ports, XA_FLAGS_ALLOC);
Each devlink instance is associated with a parent device and a pointer to this device is stored in the devlink structure, but devlink does not hold a reference on this device. This is going to be a problem in the next patch where - among other things - devlink will acquire the device lock during netns dismantle, before the reload operation. Since netns dismantle is performed asynchronously and since a reference is not held on the parent device, it will be possible to hit a use-after-free. Prepare for the upcoming change by holding a reference on the parent device. Unfortunately, with this patch and this reproducer [1], the following crash can be observed [2]. The last reference is released from the device asynchronously - after an RCU grace period - when the netdevsim module is no longer present. This causes device_release() to invoke a release callback that is no longer present: nsim_bus_dev_release(). It's not clear to me if I'm doing something wrong in devlink (I don't think so), if it's a bug in netdevsim or alternatively a bug in core driver code that allows the bus module to go away before all the devices that were connected to it are released. The problem can be solved by devlink holding a reference on the backing module (i.e., dev->driver->owner) or by each netdevsim device holding a reference on the netdevsim module. However, this will prevent the removal of the module when devices are present, something that is possible today. [1] #!/bin/bash for i in $(seq 1 1000); do echo "$i" insmod drivers/net/netdevsim/netdevsim.ko echo "10 0" > /sys/bus/netdevsim/new_device rmmod netdevsim done [2] BUG: unable to handle page fault for address: ffffffffc0490910 #PF: supervisor instruction fetch in kernel mode #PF: error_code(0x0010) - not-present page PGD 12e040067 P4D 12e040067 PUD 12e042067 PMD 100a38067 PTE 0 Oops: 0010 [#1] PREEMPT SMP CPU: 0 PID: 138 Comm: kworker/0:2 Not tainted 6.4.0-rc5-custom-g42e05937ca59 #299 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc37 04/01/2014 Workqueue: events devlink_release RIP: 0010:0xffffffffc0490910 Code: Unable to access opcode bytes at 0xffffffffc04908e6. RSP: 0018:ffffb487802f3e40 EFLAGS: 00010282 RAX: ffffffffc0490910 RBX: ffff92e6c0057800 RCX: 0001020304050608 RDX: 0000000000000001 RSI: ffffffff92b7d763 RDI: ffff92e6c0057800 RBP: ffff92e6c1ef0a00 R08: ffff92e6c0055158 R09: ffff92e6c2be9134 R10: 0000000000000018 R11: fefefefefefefeff R12: ffffffff934c3e80 R13: ffff92e6c2a1a740 R14: 0000000000000000 R15: ffff92e7f7c30b05 FS: 0000000000000000(0000) GS:ffff92e7f7c00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffffffc04908e6 CR3: 0000000101f1a004 CR4: 0000000000170ef0 Call Trace: <TASK> ? __die+0x23/0x70 ? page_fault_oops+0x181/0x470 ? exc_page_fault+0xa6/0x140 ? asm_exc_page_fault+0x26/0x30 ? device_release+0x23/0x90 ? device_release+0x34/0x90 ? kobject_put+0x7d/0x1b0 ? devlink_release+0x16/0x30 ? process_one_work+0x1e0/0x3d0 ? worker_thread+0x4e/0x3b0 ? rescuer_thread+0x3a0/0x3a0 ? kthread+0xe5/0x120 ? kthread_complete_and_exit+0x20/0x20 ? ret_from_fork+0x1f/0x30 </TASK> Modules linked in: [last unloaded: netdevsim] Signed-off-by: Ido Schimmel <idosch@nvidia.com> --- net/devlink/core.c | 3 +++ 1 file changed, 3 insertions(+)