Message ID | 2fdc7b4e11c3283cd65c7cf77c81bd6687a32c20.1629844159.git.dcaratti@redhat.com (mailing list archive) |
---|---|
State | Accepted |
Commit | cd9b50adc6bb9ad3f7d244590a389522215865c4 |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [PATH,net] net/sched: ets: fix crash when flipping from 'strict' to 'quantum' | expand |
Context | Check | Description |
---|---|---|
netdev/cover_letter | success | Link |
netdev/fixes_present | success | Link |
netdev/patch_count | success | Link |
netdev/tree_selection | success | Clearly marked for net |
netdev/subject_prefix | success | Link |
netdev/cc_maintainers | success | CCed 7 of 7 maintainers |
netdev/source_inline | success | Was 0 now: 0 |
netdev/verify_signedoff | success | Link |
netdev/module_param | success | Was 0 now: 0 |
netdev/build_32bit | success | Errors and warnings before: 0 this patch: 0 |
netdev/kdoc | success | Errors and warnings before: 0 this patch: 0 |
netdev/verify_fixes | success | Link |
netdev/checkpatch | success | total: 0 errors, 0 warnings, 0 checks, 13 lines checked |
netdev/build_allmodconfig_warn | success | Errors and warnings before: 0 this patch: 0 |
netdev/header_inline | success | Link |
Hello: This patch was applied to netdev/net.git (refs/heads/master): On Wed, 25 Aug 2021 00:33:48 +0200 you wrote: > While running kselftests, Hangbin observed that sch_ets.sh often crashes, > and splats like the following one are seen in the output of 'dmesg': > > BUG: kernel NULL pointer dereference, address: 0000000000000000 > #PF: supervisor read access in kernel mode > #PF: error_code(0x0000) - not-present page > PGD 159f12067 P4D 159f12067 PUD 159f13067 PMD 0 > Oops: 0000 [#1] SMP NOPTI > CPU: 2 PID: 921 Comm: tc Not tainted 5.14.0-rc6+ #458 > Hardware name: Red Hat KVM, BIOS 1.11.1-4.module+el8.1.0+4066+0f1aadab 04/01/2014 > RIP: 0010:__list_del_entry_valid+0x2d/0x50 > Code: 48 8b 57 08 48 b9 00 01 00 00 00 00 ad de 48 39 c8 0f 84 ac 6e 5b 00 48 b9 22 01 00 00 00 00 ad de 48 39 ca 0f 84 cf 6e 5b 00 <48> 8b 32 48 39 fe 0f 85 af 6e 5b 00 48 8b 50 08 48 39 f2 0f 85 94 > RSP: 0018:ffffb2da005c3890 EFLAGS: 00010217 > RAX: 0000000000000000 RBX: ffff9073ba23f800 RCX: dead000000000122 > RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffff9073ba23fbc8 > RBP: ffff9073ba23f890 R08: 0000000000000001 R09: 0000000000000001 > R10: 0000000000000001 R11: 0000000000000001 R12: dead000000000100 > R13: ffff9073ba23fb00 R14: 0000000000000002 R15: 0000000000000002 > FS: 00007f93e5564e40(0000) GS:ffff9073bba00000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 0000000000000000 CR3: 000000014ad34000 CR4: 0000000000350ee0 > Call Trace: > ets_qdisc_reset+0x6e/0x100 [sch_ets] > qdisc_reset+0x49/0x1d0 > tbf_reset+0x15/0x60 [sch_tbf] > qdisc_reset+0x49/0x1d0 > dev_reset_queue.constprop.42+0x2f/0x90 > dev_deactivate_many+0x1d3/0x3d0 > dev_deactivate+0x56/0x90 > qdisc_graft+0x47e/0x5a0 > tc_get_qdisc+0x1db/0x3e0 > rtnetlink_rcv_msg+0x164/0x4c0 > netlink_rcv_skb+0x50/0x100 > netlink_unicast+0x1a5/0x280 > netlink_sendmsg+0x242/0x480 > sock_sendmsg+0x5b/0x60 > ____sys_sendmsg+0x1f2/0x260 > ___sys_sendmsg+0x7c/0xc0 > __sys_sendmsg+0x57/0xa0 > do_syscall_64+0x3a/0x80 > entry_SYSCALL_64_after_hwframe+0x44/0xae > RIP: 0033:0x7f93e44b8338 > Code: 89 02 48 c7 c0 ff ff ff ff eb b5 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 25 43 2c 00 8b 00 85 c0 75 17 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 41 54 41 89 d4 55 > RSP: 002b:00007ffc0db737a8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e > RAX: ffffffffffffffda RBX: 0000000061255c06 RCX: 00007f93e44b8338 > RDX: 0000000000000000 RSI: 00007ffc0db73810 RDI: 0000000000000003 > RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000 > R10: 000000000000000b R11: 0000000000000246 R12: 0000000000000001 > R13: 0000000000687880 R14: 0000000000000000 R15: 0000000000000000 > Modules linked in: sch_ets sch_tbf dummy rfkill iTCO_wdt iTCO_vendor_support intel_rapl_msr intel_rapl_common joydev i2c_i801 pcspkr i2c_smbus lpc_ich virtio_balloon ip_tables xfs libcrc32c crct10dif_pclmul crc32_pclmul crc32c_intel ahci libahci ghash_clmulni_intel libata serio_raw virtio_blk virtio_console virtio_net net_failover failover sunrpc dm_mirror dm_region_hash dm_log dm_mod > CR2: 0000000000000000 > > [...] Here is the summary with links: - [PATH,net] net/sched: ets: fix crash when flipping from 'strict' to 'quantum' https://git.kernel.org/netdev/net/c/cd9b50adc6bb You are awesome, thank you! -- Deet-doot-dot, I am a bot. https://korg.docs.kernel.org/patchwork/pwbot.html
On Tue, Aug 24, 2021 at 3:34 PM Davide Caratti <dcaratti@redhat.com> wrote: > When the change() function decreases the value of 'nstrict', we must take > into account that packets might be already enqueued on a class that flips > from 'strict' to 'quantum': otherwise that class will not be added to the > bandwidth-sharing list. Then, a call to ets_qdisc_reset() will attempt to > do list_del(&alist) with 'alist' filled with zero, hence the NULL pointer > dereference. I am confused about how we end up having NULL in list head. From your changelog, you imply it happens when we change an existing Qdisc, but I don't see any place that could set this list head to NULL. list_del() clearly doesn't set NULL. But if it is a new Qdisc, Qdisc is allocated with zero's hence having NULL as list head. However, in this case, q->nstrict should be 0 before the loop, so I don't think your code helps at all as the for loop can't even be entered? Thanks.
hello Cong, thanks a lot for looking at this! On Mon, Aug 30, 2021 at 05:43:09PM -0700, Cong Wang wrote: > On Tue, Aug 24, 2021 at 3:34 PM Davide Caratti <dcaratti@redhat.com> wrote: > > When the change() function decreases the value of 'nstrict', we must take > > into account that packets might be already enqueued on a class that flips > > from 'strict' to 'quantum': otherwise that class will not be added to the > > bandwidth-sharing list. Then, a call to ets_qdisc_reset() will attempt to > > do list_del(&alist) with 'alist' filled with zero, hence the NULL pointer > > dereference. > > I am confused about how we end up having NULL in list head. > > From your changelog, you imply it happens when we change an existing > Qdisc, but I don't see any place that could set this list head to NULL. > list_del() clearly doesn't set NULL. right, the problem happens when we try to *decrease* the value of 'nstrict' while traffic is being loaded. ETS classes from 0 to nstrict - 1 don't initialize this list at all, nor they use it. The initialization happens when the first packet is enqueued to one of the bandwidth-sharing classes (from nstrict to nbands - 1), see [1]). However, if we modify the value of 'nstrict' while q->classes[i].qdisc->q.len is greater than zero, the list initialization is probably not going to happen, so the struct q->classes[i].alist remains filled with zero even since the first allocation, like you mentioned below: > But if it is a new Qdisc, Qdisc is allocated with zero's hence having NULL > as list head. However, in this case, q->nstrict should be 0 before the loop, ^^ this. But you are wrong, q->nstrict can be any value from 0 to 16 (inclusive) before the loop. As the value of 'nstrict' is reduced (e.g. from 4 to 0), the code can reach the loop in ets_qdisc_change() with the following 4 conditions simultaneously true: 1) nstrict = 0 2) q->nstrict = 4 3) q->classes[2].qdisc->q.qlen greater than zero 4) q->classes[2].alist filled with 0 then, the value of q->nstrict is updated to 0. After that, ets_qdisc_reset() can be called on unpatched Linux with the following 3 conditions simultaneously true: a) q->nstrict = 0 b) q->classes[2].qdisc->q.qlen greater than zero c) q->classes[2].alist filled with 0 that leads to the reported crash. > so I don't think your code helps at all as the for loop can't even be entered? The first code I tried was just doing INIT_LIST_HEAD(&q->classes[i].alist), with i ranging from nstrict to q->nstrict - 1: it fixed the crash in ets_qdisc_reset(). But then I observed that classes changing from strict to quantum were not sharing any bandwidth at all, and they had a non-empty backlog even after I stopped all the traffic: so, I added the if (q->classes[i].qdisc->q.qlen) { list_add_tail(&q->classes[i].alist, &q->active); q->classes[i].deficit = quanta[i]; } part, that updates the DRR list with non-empty classes that are changing from strict to DRR. After that, I verified that all classes were depleted correctly. On a second thought, this INIT_LIST_HEAD(&q->classes[i].alist) line is no more useful. If q->classes[i].qdisc->q.qlen is 0, either it remains 0 until the call to ets_qdisc_reset(), or some packet is enqueued after q->nstrict is updated, and the enqueueing of a 'first' packet [1] will fix the value of q->classes[i].alist. Finally, if q->classes[i].qdisc->q.qlen is bigger than zero, the list_add_tail() part of my patch covers the missing update of the DRR list. In all these cases, the NULL dereference in ets_qdisc_reset() is prevented. So, I can probably send a patch (for net-next, when it reopens) that removes this INIT_LIST_HEAD() line; anyway, its presence is harmless IMO. WDYT? > > Thanks. BTW, please find below a reproducer that's more efficient than kselftests in seeing the problem: 1 #!/bin/bash 2 ip link del dev ddd0 >/dev/null 2>&1 3 ip link add dev ddd0 type dummy 4 ip link set dev ddd0 up 5 tc qdisc add dev ddd0 handle 1: root tbf rate 100kbit latency 1000ms burst 1Mbit 6 tc qdisc add dev ddd0 handle 10: parent 1: ets bands 4 strict 4 priomap 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 7 tc qdisc show dev ddd0 8 mausezahn ddd0 -A 10.10.10.1 -B 10.10.10.2 -c 0 -a own -b 00:c1:a0:c1:a0:00 -t udp & 9 mpid=$! 10 sleep 5 11 tc qdisc change dev ddd0 handle 10: ets bands 8 quanta 2500 2500 2500 2500 2500 2500 2500 2500 priomap 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 12 tc qdisc del dev ddd0 root 13 sleep 1 14 tc qdisc show dev ddd0 15 kill $mpid 16 rmmod sch_ets Line 11 changes the ets qdisc from 4 prio to 8 DRR, so the value of q->nstrict changes from 4 to 0. The crash in ets_qdisc_reset() happens with band number #2 (that's the only class being loaded with traffic during the test). thanks!
On Tue, Aug 31, 2021 at 2:54 AM Davide Caratti <dcaratti@redhat.com> wrote: > > hello Cong, thanks a lot for looking at this! > > On Mon, Aug 30, 2021 at 05:43:09PM -0700, Cong Wang wrote: > > On Tue, Aug 24, 2021 at 3:34 PM Davide Caratti <dcaratti@redhat.com> wrote: > > > When the change() function decreases the value of 'nstrict', we must take > > > into account that packets might be already enqueued on a class that flips > > > from 'strict' to 'quantum': otherwise that class will not be added to the > > > bandwidth-sharing list. Then, a call to ets_qdisc_reset() will attempt to > > > do list_del(&alist) with 'alist' filled with zero, hence the NULL pointer > > > dereference. > > > > I am confused about how we end up having NULL in list head. > > > > From your changelog, you imply it happens when we change an existing > > Qdisc, but I don't see any place that could set this list head to NULL. > > list_del() clearly doesn't set NULL. > > right, the problem happens when we try to *decrease* the value of 'nstrict' > while traffic is being loaded. > > ETS classes from 0 to nstrict - 1 don't initialize this list at all, nor they > use it. The initialization happens when the first packet is enqueued to one of > the bandwidth-sharing classes (from nstrict to nbands - 1), see [1]). > > However, if we modify the value of 'nstrict' while q->classes[i].qdisc->q.len is > greater than zero, the list initialization is probably not going to happen, so > the struct > > q->classes[i].alist > > remains filled with zero even since the first allocation, like you mentioned below: > > > But if it is a new Qdisc, Qdisc is allocated with zero's hence having NULL > > as list head. However, in this case, q->nstrict should be 0 before the loop, > > ^^ this. But you are wrong, q->nstrict can be any value from 0 to 16 (inclusive) > before the loop. As the value of 'nstrict' is reduced (e.g. from 4 to 0), the code > can reach the loop in ets_qdisc_change() with the following 4 conditions > simultaneously true: > > 1) nstrict = 0 > 2) q->nstrict = 4 > 3) q->classes[2].qdisc->q.qlen greater than zero > 4) q->classes[2].alist filled with 0 > > then, the value of q->nstrict is updated to 0. After that, ets_qdisc_reset() can be > called on unpatched Linux with the following 3 conditions simultaneously true: > > a) q->nstrict = 0 > b) q->classes[2].qdisc->q.qlen greater than zero > c) q->classes[2].alist filled with 0 > > that leads to the reported crash. > > > so I don't think your code helps at all as the for loop can't even be entered? > > The first code I tried was just doing INIT_LIST_HEAD(&q->classes[i].alist), with > i ranging from nstrict to q->nstrict - 1: it fixed the crash in ets_qdisc_reset(). > > But then I observed that classes changing from strict to quantum were not sharing > any bandwidth at all, and they had a non-empty backlog even after I stopped all > the traffic: so, I added the > > if (q->classes[i].qdisc->q.qlen) { > list_add_tail(&q->classes[i].alist, &q->active); > q->classes[i].deficit = quanta[i]; > } > > part, that updates the DRR list with non-empty classes that are changing from > strict to DRR. After that, I verified that all classes were depleted correctly. > > On a second thought, this INIT_LIST_HEAD(&q->classes[i].alist) line is no more > useful. If q->classes[i].qdisc->q.qlen is 0, either it remains 0 until the call > to ets_qdisc_reset(), or some packet is enqueued after q->nstrict is updated, > and the enqueueing of a 'first' packet [1] will fix the value of > q->classes[i].alist. > Finally, if q->classes[i].qdisc->q.qlen is bigger than zero, the list_add_tail() > part of my patch covers the missing update of the DRR list. In all these cases, > the NULL dereference in ets_qdisc_reset() is prevented. > > So, I can probably send a patch (for net-next, when it reopens) that removes > this INIT_LIST_HEAD() line; anyway, its presence is harmless IMO. WDYT? Actually I am thinking about the opposite, that is, always initializing the list head. ;) Unless we always use list_add*() before list_del(), initializing it unconditionally is a correct fix. I do not doubt your patch works, however I worry that there might be some other call paths which could call list_del() with the NULL. Thanks.
On Tue, Aug 31, 2021 at 11:16:44AM -0700, Cong Wang wrote: > On Tue, Aug 31, 2021 at 2:54 AM Davide Caratti <dcaratti@redhat.com> wrote: > > > > hello Cong, thanks a lot for looking at this! > > > > On Mon, Aug 30, 2021 at 05:43:09PM -0700, Cong Wang wrote: > > > On Tue, Aug 24, 2021 at 3:34 PM Davide Caratti <dcaratti@redhat.com> wrote: [...] > > > > Then, a call to ets_qdisc_reset() will attempt to > > > > do list_del(&alist) with 'alist' filled with zero, hence the NULL pointer > > > > dereference. > > > > > > I am confused about how we end up having NULL in list head. [...] > > So, I can probably send a patch (for net-next, when it reopens) that removes > > this INIT_LIST_HEAD() line; anyway, its presence is harmless IMO. WDYT? > > Actually I am thinking about the opposite, that is, always initializing the > list head. ;) Unless we always use list_add*() before list_del(), initializing > it unconditionally is a correct fix. uh, maybe I get the point. we can do a INIT_LIST_HEAD(&cl[i].alist) in ets_qdisc_init() with 'i' ranging from 0 to TCQ_ETS_MAX_BANDS - 1, then the memset() in [1] needs to be replaced with something that clears all members of struct ets_class except 'alist'. At this point, the INIT_LIST_HEAD() line in ets_qdisc_change() can be removed. I can re-run the kseltest and eventually send a patch for that (targeting net-next, no need to rush), is that ok for you? thanks,
On Wed, Sep 1, 2021 at 1:54 PM Davide Caratti <dcaratti@redhat.com> wrote: > uh, maybe I get the point. > > we can do a INIT_LIST_HEAD(&cl[i].alist) in ets_qdisc_init() with 'i' ranging > from 0 to TCQ_ETS_MAX_BANDS - 1, then the memset() in [1] needs to be > replaced with something that clears all members of struct ets_class except > 'alist'. At this point, the INIT_LIST_HEAD() line in ets_qdisc_change() can be > removed. > > I can re-run the kseltest and eventually send a patch for that (targeting > net-next, no need to rush), is that ok for you? Sure, whatever works for you. Thanks.
diff --git a/net/sched/sch_ets.c b/net/sched/sch_ets.c index c1e84d1eeaba..c76701ac35ab 100644 --- a/net/sched/sch_ets.c +++ b/net/sched/sch_ets.c @@ -660,6 +660,13 @@ static int ets_qdisc_change(struct Qdisc *sch, struct nlattr *opt, sch_tree_lock(sch); q->nbands = nbands; + for (i = nstrict; i < q->nstrict; i++) { + INIT_LIST_HEAD(&q->classes[i].alist); + if (q->classes[i].qdisc->q.qlen) { + list_add_tail(&q->classes[i].alist, &q->active); + q->classes[i].deficit = quanta[i]; + } + } q->nstrict = nstrict; memcpy(q->prio2band, priomap, sizeof(priomap));
While running kselftests, Hangbin observed that sch_ets.sh often crashes, and splats like the following one are seen in the output of 'dmesg': BUG: kernel NULL pointer dereference, address: 0000000000000000 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 159f12067 P4D 159f12067 PUD 159f13067 PMD 0 Oops: 0000 [#1] SMP NOPTI CPU: 2 PID: 921 Comm: tc Not tainted 5.14.0-rc6+ #458 Hardware name: Red Hat KVM, BIOS 1.11.1-4.module+el8.1.0+4066+0f1aadab 04/01/2014 RIP: 0010:__list_del_entry_valid+0x2d/0x50 Code: 48 8b 57 08 48 b9 00 01 00 00 00 00 ad de 48 39 c8 0f 84 ac 6e 5b 00 48 b9 22 01 00 00 00 00 ad de 48 39 ca 0f 84 cf 6e 5b 00 <48> 8b 32 48 39 fe 0f 85 af 6e 5b 00 48 8b 50 08 48 39 f2 0f 85 94 RSP: 0018:ffffb2da005c3890 EFLAGS: 00010217 RAX: 0000000000000000 RBX: ffff9073ba23f800 RCX: dead000000000122 RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffff9073ba23fbc8 RBP: ffff9073ba23f890 R08: 0000000000000001 R09: 0000000000000001 R10: 0000000000000001 R11: 0000000000000001 R12: dead000000000100 R13: ffff9073ba23fb00 R14: 0000000000000002 R15: 0000000000000002 FS: 00007f93e5564e40(0000) GS:ffff9073bba00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 000000014ad34000 CR4: 0000000000350ee0 Call Trace: ets_qdisc_reset+0x6e/0x100 [sch_ets] qdisc_reset+0x49/0x1d0 tbf_reset+0x15/0x60 [sch_tbf] qdisc_reset+0x49/0x1d0 dev_reset_queue.constprop.42+0x2f/0x90 dev_deactivate_many+0x1d3/0x3d0 dev_deactivate+0x56/0x90 qdisc_graft+0x47e/0x5a0 tc_get_qdisc+0x1db/0x3e0 rtnetlink_rcv_msg+0x164/0x4c0 netlink_rcv_skb+0x50/0x100 netlink_unicast+0x1a5/0x280 netlink_sendmsg+0x242/0x480 sock_sendmsg+0x5b/0x60 ____sys_sendmsg+0x1f2/0x260 ___sys_sendmsg+0x7c/0xc0 __sys_sendmsg+0x57/0xa0 do_syscall_64+0x3a/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f93e44b8338 Code: 89 02 48 c7 c0 ff ff ff ff eb b5 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 25 43 2c 00 8b 00 85 c0 75 17 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 41 54 41 89 d4 55 RSP: 002b:00007ffc0db737a8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 0000000061255c06 RCX: 00007f93e44b8338 RDX: 0000000000000000 RSI: 00007ffc0db73810 RDI: 0000000000000003 RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000 R10: 000000000000000b R11: 0000000000000246 R12: 0000000000000001 R13: 0000000000687880 R14: 0000000000000000 R15: 0000000000000000 Modules linked in: sch_ets sch_tbf dummy rfkill iTCO_wdt iTCO_vendor_support intel_rapl_msr intel_rapl_common joydev i2c_i801 pcspkr i2c_smbus lpc_ich virtio_balloon ip_tables xfs libcrc32c crct10dif_pclmul crc32_pclmul crc32c_intel ahci libahci ghash_clmulni_intel libata serio_raw virtio_blk virtio_console virtio_net net_failover failover sunrpc dm_mirror dm_region_hash dm_log dm_mod CR2: 0000000000000000 When the change() function decreases the value of 'nstrict', we must take into account that packets might be already enqueued on a class that flips from 'strict' to 'quantum': otherwise that class will not be added to the bandwidth-sharing list. Then, a call to ets_qdisc_reset() will attempt to do list_del(&alist) with 'alist' filled with zero, hence the NULL pointer dereference. For classes flipping from 'strict' to 'quantum', initialize an empty list and eventually add it to the bandwidth-sharing list, if there are packets already enqueued. In this way, the kernel will: a) prevent crashing as described above. b) avoid retaining the backlog packets (for an arbitrarily long time) in case no packet is enqueued after a change from 'strict' to 'quantum'. Reported-by: Hangbin Liu <liuhangbin@gmail.com> Fixes: dcc68b4d8084 ("net: sch_ets: Add a new Qdisc") Signed-off-by: Davide Caratti <dcaratti@redhat.com> --- net/sched/sch_ets.c | 7 +++++++ 1 file changed, 7 insertions(+)