diff mbox series

[net] sch_htb: Fix prematurely activating netdev during htb_destroy_class_offload

Message ID 20230104174744.22280-1-rrameshbabu@nvidia.com (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series [net] sch_htb: Fix prematurely activating netdev during htb_destroy_class_offload | expand

Checks

Context Check Description
netdev/tree_selection success Clearly marked for net
netdev/fixes_present success Fixes tag present in non-next series
netdev/subject_prefix success Link
netdev/cover_letter success Single patches do not need cover letters
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 0 this patch: 0
netdev/cc_maintainers fail 2 blamed authors not CCed: maximmi@mellanox.com kuba@kernel.org; 4 maintainers not CCed: davem@davemloft.net pabeni@redhat.com maximmi@mellanox.com kuba@kernel.org
netdev/build_clang success Errors and warnings before: 0 this patch: 0
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 0 this patch: 0
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 19 lines checked
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Rahul Rameshbabu Jan. 4, 2023, 5:47 p.m. UTC
When the netdev qdisc is updated correctly with the new qdisc before
destroying the old qdisc, the netdev should not be activated till cleanup
is completed. When htb_destroy_class_offload called htb_graft_helper, the
netdev may be activated before cleanup is completed. The new netdev qdisc
may be used prematurely by queues before cleanup is done. Call
dev_graft_qdisc in place of htb_graft_helper when destroying the htb to
prevent premature netdev activation.

Fixes: d03b195b5aa0 ("sch_htb: Hierarchical QoS hardware offload")
Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Acked-by: Saeed Mahameed <saeedm@nvidia.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Maxim Mikityanskiy <maxtram95@gmail.com>
---
 net/sched/sch_htb.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

Comments

Maxim Mikityanskiy Jan. 4, 2023, 8:12 p.m. UTC | #1
On Wed, 4 Jan 2023 at 19:53, Rahul Rameshbabu <rrameshbabu@nvidia.com> wrote:
>
> When the netdev qdisc is updated correctly with the new qdisc before
> destroying the old qdisc, the netdev should not be activated till cleanup
> is completed. When htb_destroy_class_offload called htb_graft_helper, the
> netdev may be activated before cleanup is completed.

Oh, so that's what was happening! Now I get the full picture:

1. The user does RTM_DELQDISC.
2. qdisc_graft calls dev_deactivate, which sets dev_queue->qdisc to
NULL, but keeps dev_queue->qdisc_sleeping.
3. The loop in qdisc_graft calls dev_graft_qdisc(dev_queue, new),
where new is NULL, for each queue.
4. Then we get into htb_destroy_class_offload, and it's important
whether dev->qdisc is still HTB (before Eric's patch) or noop_qdisc
(after Eric's patch).
5. If dev->qdisc is noop_qdisc, and htb_graft_helper accidentally
activates the netdev, attach_default_qdiscs will be called, and
dev_queue->qdisc will no longer be NULL for the rest of the queues,
hence the WARN_ON triggering.

Nice catch indeed, premature activation of the netdev wasn't intended.

> The new netdev qdisc
> may be used prematurely by queues before cleanup is done. Call
> dev_graft_qdisc in place of htb_graft_helper when destroying the htb to
> prevent premature netdev activation.
>
> Fixes: d03b195b5aa0 ("sch_htb: Hierarchical QoS hardware offload")
> Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
> Acked-by: Saeed Mahameed <saeedm@nvidia.com>
> Cc: Eric Dumazet <edumazet@google.com>
> Cc: Maxim Mikityanskiy <maxtram95@gmail.com>
> ---
>  net/sched/sch_htb.c | 8 +++++---
>  1 file changed, 5 insertions(+), 3 deletions(-)
>
> diff --git a/net/sched/sch_htb.c b/net/sched/sch_htb.c
> index 2238edece1a4..f62334ef016a 100644
> --- a/net/sched/sch_htb.c
> +++ b/net/sched/sch_htb.c
> @@ -1557,14 +1557,16 @@ static int htb_destroy_class_offload(struct Qdisc *sch, struct htb_class *cl,
>
>         WARN_ON(!q);
>         dev_queue = htb_offload_get_queue(cl);
> -       old = htb_graft_helper(dev_queue, NULL);
> -       if (destroying)
> +       if (destroying) {
> +               old = dev_graft_qdisc(dev_queue, NULL);
>                 /* Before HTB is destroyed, the kernel grafts noop_qdisc to
>                  * all queues.
>                  */
>                 WARN_ON(!(old->flags & TCQ_F_BUILTIN));

Now regarding this WARN_ON, I have concerns about its correctness.

Can the user replace the root qdisc from HTB to something else with a
single command? I.e. instead of `tc qdisc del dev eth2 root handle 1:`
do `tc qdisc replace ...` or whatever that causes qdisc_graft to be
called with new != NULL? If that is possible, then:

1. `old` won't be noop_qdisc, but rather the new qdisc (if it doesn't
support the attach callback) or the old one left from HTB (old == q,
if the new qdisc supports the attach callback). WARN_ON should
trigger.

2. We shouldn't even call dev_graft_qdisc in this case (if destroying
is true). Likewise, we shouldn't try to revert it on errors or call
qdisc_put on it.

Could you please try to reproduce this scenario of triggering WARN_ON?
I remember testing it, and something actually prevented me from doing
a replacement, but maybe I just missed something back then.

> -       else
> +       } else {
> +               old = htb_graft_helper(dev_queue, NULL);
>                 WARN_ON(old != q);
> +       }
>
>         if (cl->parent) {
>                 _bstats_update(&cl->parent->bstats_bias,
> --
> 2.36.2
>
Rahul Rameshbabu Jan. 5, 2023, 6:03 a.m. UTC | #2
Maxim Mikityanskiy <maxtram95@gmail.com> writes:

> On Wed, 4 Jan 2023 at 19:53, Rahul Rameshbabu <rrameshbabu@nvidia.com> wrote:
>>
>> When the netdev qdisc is updated correctly with the new qdisc before
>> destroying the old qdisc, the netdev should not be activated till cleanup
>> is completed. When htb_destroy_class_offload called htb_graft_helper, the
>> netdev may be activated before cleanup is completed.
>
> Oh, so that's what was happening! Now I get the full picture:
>
> 1. The user does RTM_DELQDISC.
> 2. qdisc_graft calls dev_deactivate, which sets dev_queue->qdisc to
> NULL, but keeps dev_queue->qdisc_sleeping.
> 3. The loop in qdisc_graft calls dev_graft_qdisc(dev_queue, new),
> where new is NULL, for each queue.
> 4. Then we get into htb_destroy_class_offload, and it's important
> whether dev->qdisc is still HTB (before Eric's patch) or noop_qdisc
> (after Eric's patch).
> 5. If dev->qdisc is noop_qdisc, and htb_graft_helper accidentally
> activates the netdev, attach_default_qdiscs will be called, and
> dev_queue->qdisc will no longer be NULL for the rest of the queues,
> hence the WARN_ON triggering.
>
> Nice catch indeed, premature activation of the netdev wasn't intended.
>
>> The new netdev qdisc
>> may be used prematurely by queues before cleanup is done. Call
>> dev_graft_qdisc in place of htb_graft_helper when destroying the htb to
>> prevent premature netdev activation.
>>
>> Fixes: d03b195b5aa0 ("sch_htb: Hierarchical QoS hardware offload")
>> Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
>> Acked-by: Saeed Mahameed <saeedm@nvidia.com>
>> Cc: Eric Dumazet <edumazet@google.com>
>> Cc: Maxim Mikityanskiy <maxtram95@gmail.com>
>> ---
>>  net/sched/sch_htb.c | 8 +++++---
>>  1 file changed, 5 insertions(+), 3 deletions(-)
>>
>> diff --git a/net/sched/sch_htb.c b/net/sched/sch_htb.c
>> index 2238edece1a4..f62334ef016a 100644
>> --- a/net/sched/sch_htb.c
>> +++ b/net/sched/sch_htb.c
>> @@ -1557,14 +1557,16 @@ static int htb_destroy_class_offload(struct Qdisc *sch, struct htb_class *cl,
>>
>>         WARN_ON(!q);
>>         dev_queue = htb_offload_get_queue(cl);
>> -       old = htb_graft_helper(dev_queue, NULL);
>> -       if (destroying)
>> +       if (destroying) {
>> +               old = dev_graft_qdisc(dev_queue, NULL);
>>                 /* Before HTB is destroyed, the kernel grafts noop_qdisc to
>>                  * all queues.
>>                  */
>>                 WARN_ON(!(old->flags & TCQ_F_BUILTIN));
>
> Now regarding this WARN_ON, I have concerns about its correctness.
>
> Can the user replace the root qdisc from HTB to something else with a
> single command? I.e. instead of `tc qdisc del dev eth2 root handle 1:`
> do `tc qdisc replace ...` or whatever that causes qdisc_graft to be
> called with new != NULL? If that is possible, then:
>
> 1. `old` won't be noop_qdisc, but rather the new qdisc (if it doesn't
> support the attach callback) or the old one left from HTB (old == q,
> if the new qdisc supports the attach callback). WARN_ON should
> trigger.
>
> 2. We shouldn't even call dev_graft_qdisc in this case (if destroying
> is true). Likewise, we shouldn't try to revert it on errors or call
> qdisc_put on it.
>
> Could you please try to reproduce this scenario of triggering WARN_ON?
> I remember testing it, and something actually prevented me from doing
> a replacement, but maybe I just missed something back then.
>

Reproduction steps

  ip link set dev eth2 up
  ip link set dev eth2 up
  ip addr add 194.237.173.123/16 dev eth2
  tc qdisc add dev eth2 clsact
  tc qdisc add dev eth2 root handle 1: htb default 1 offload
  tc class add dev eth2 classid 1: parent root htb rate 18000mbit ceil 22500.0mbit burst 450000kbit cburst 450000kbit
  tc class add dev eth2 classid 1:3 parent 1: htb rate 3596mbit burst 89900kbit cburst 89900kbit 
  tc qdisc replace dev eth2 root pfifo

The warning is indeed triggered because the new root is pfifo rather
than noop_qdisc. I agree with both points you brought up in the patch.
When I saw the ternary in qdisc_graft for the rcu_assign_pointer call, I
was worried about the case when new was defined as a new qdisc rather
than defaulting to noop_qdisc but assumed there were some guaratees for
htb.

I believe the correct fix for a robust implementation of
htb_destroy_class_offload would be to not depend on functions that
retrieve the top level qdisc. However, I see a number of functions, not
just offload related ones, in the htb implementation that seem to depend
on the assumption that the old qdisc can safely be accessed with helpers
such as htb_graft_helper. One such example is htb_change_class. The
trivial solution I see is to change qdisc_graft to first do a
rcu_assign_pointer with noop_qdisc, call notify_and_destroy, and only
afterwards call rcu_assign_pointer with the new qdisc if defined. Let me
know your thoughts on this.

  [  384.474535] ------------[ cut here ]------------
  [  384.476685] WARNING: CPU: 2 PID: 1038 at net/sched/sch_htb.c:1561 htb_destroy_class_offload+0x179/0x430 [sch_htb]
  [  384.481217] Modules linked in: sch_htb sch_ingress xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat br_netfilter overlay rpcrdma rdma_ucm ib_iser libiscsi scsi_transport_iscsi rdma_cm iw_cm ib_umad ib_ipoib ib_cm mlx5_ib ib_uverbs ib_core fuse mlx5_core
  [  384.487081] CPU: 2 PID: 1038 Comm: tc Not tainted 6.1.0-rc2_for_upstream_min_debug_2022_10_24_15_44 #1
  [  384.488414] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
  [  384.489987] RIP: 0010:htb_destroy_class_offload+0x179/0x430 [sch_htb]
  [  384.490937] Code: 2b 04 25 28 00 00 00 0f 85 cb 02 00 00 48 83 c4 48 44 89 f0 5b 5d 41 5c 41 5d 41 5e 41 5f c3 41 f6 45 10 01 0f 85 26 ff ff ff <0f> 0b e9 1f ff ff ff 4d 3b 7e 40 0f 84 d9 fe ff ff 0f 0b e9 d2 fe
  [  384.493495] RSP: 0018:ffff88815162b840 EFLAGS: 00010246
  [  384.494358] RAX: 000000000000002a RBX: ffff88810e040000 RCX: 0000000021800002
  [  384.495461] RDX: 0000000021800000 RSI: 0000000000000246 RDI: ffff88810e0404c0
  [  384.496581] RBP: ffff888151ea0c00 R08: 0000000100006174 R09: ffffffff82897070
  [  384.497684] R10: 0000000000000000 R11: 0000000000000002 R12: 0000000000000001
  [  384.498923] R13: ffff88810b189200 R14: ffff88810b189a00 R15: ffff888110060a00
  [  384.500044] FS:  00007f7a2e7a3800(0000) GS:ffff88852cc80000(0000) knlGS:0000000000000000
  [  384.501390] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [  384.502339] CR2: 0000000000487598 CR3: 0000000151f41003 CR4: 0000000000370ea0
  [  384.503458] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  [  384.504581] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  [  384.505699] Call Trace:
  [  384.506231]  <TASK>
  [  384.506691]  ? tcf_block_put+0x74/0xa0
  [  384.507365]  htb_destroy+0x142/0x3c0 [sch_htb]
  [  384.508131]  ? hrtimer_cancel+0x11/0x40
  [  384.508832]  ? rtnl_is_locked+0x11/0x20
  [  384.509522]  ? htb_reset+0xe3/0x1a0 [sch_htb]
  [  384.510293]  qdisc_destroy+0x3b/0xd0
  [  384.510943]  qdisc_graft+0x40b/0x590
  [  384.511600]  tc_modify_qdisc+0x577/0x870
  [  384.512309]  rtnetlink_rcv_msg+0x2a2/0x390
  [  384.513031]  ? rtnl_calcit.isra.0+0x120/0x120
  [  384.513806]  netlink_rcv_skb+0x54/0x100
  [  384.514495]  netlink_unicast+0x1f6/0x2c0
  [  384.515190]  netlink_sendmsg+0x237/0x490
  [  384.515890]  sock_sendmsg+0x33/0x40
  [  384.516556]  ____sys_sendmsg+0x1d1/0x1f0
  [  384.517265]  ___sys_sendmsg+0x72/0xb0
  [  384.517942]  ? ___sys_recvmsg+0x7c/0xb0
  [  384.518631]  __sys_sendmsg+0x51/0x90
  [  384.519289]  do_syscall_64+0x3d/0x90
  [  384.519943]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
  [  384.520794] RIP: 0033:0x7f7a2eaccc17
  [  384.521449] Code: 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 89 54 24 1c 48 89 74 24 10
  [  384.524306] RSP: 002b:00007ffd8c62fe78 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
  [  384.525568] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f7a2eaccc17
  [  384.526703] RDX: 0000000000000000 RSI: 00007ffd8c62fee0 RDI: 0000000000000003
  [  384.527828] RBP: 0000000063b66264 R08: 0000000000000001 R09: 00007f7a2eb8da40
  [  384.528962] R10: 0000000000405aeb R11: 0000000000000246 R12: 0000000000000001
  [  384.530097] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000485400
  [  384.531221]  </TASK>
  [  384.531689] ---[ end trace 0000000000000000 ]---

>> -       else
>> +       } else {
>> +               old = htb_graft_helper(dev_queue, NULL);
>>                 WARN_ON(old != q);
>> +       }
>>
>>         if (cl->parent) {
>>                 _bstats_update(&cl->parent->bstats_bias,
>> --
>> 2.36.2
>>
Maxim Mikityanskiy Jan. 6, 2023, 12:03 p.m. UTC | #3
On Thu, 5 Jan 2023 at 08:03, Rahul Rameshbabu <rrameshbabu@nvidia.com> wrote:
>
> Maxim Mikityanskiy <maxtram95@gmail.com> writes:
>
> > On Wed, 4 Jan 2023 at 19:53, Rahul Rameshbabu <rrameshbabu@nvidia.com> wrote:
> >>
> >> When the netdev qdisc is updated correctly with the new qdisc before
> >> destroying the old qdisc, the netdev should not be activated till cleanup
> >> is completed. When htb_destroy_class_offload called htb_graft_helper, the
> >> netdev may be activated before cleanup is completed.
> >
> > Oh, so that's what was happening! Now I get the full picture:
> >
> > 1. The user does RTM_DELQDISC.
> > 2. qdisc_graft calls dev_deactivate, which sets dev_queue->qdisc to
> > NULL, but keeps dev_queue->qdisc_sleeping.
> > 3. The loop in qdisc_graft calls dev_graft_qdisc(dev_queue, new),
> > where new is NULL, for each queue.
> > 4. Then we get into htb_destroy_class_offload, and it's important
> > whether dev->qdisc is still HTB (before Eric's patch) or noop_qdisc
> > (after Eric's patch).
> > 5. If dev->qdisc is noop_qdisc, and htb_graft_helper accidentally
> > activates the netdev, attach_default_qdiscs will be called, and
> > dev_queue->qdisc will no longer be NULL for the rest of the queues,
> > hence the WARN_ON triggering.
> >
> > Nice catch indeed, premature activation of the netdev wasn't intended.
> >
> >> The new netdev qdisc
> >> may be used prematurely by queues before cleanup is done. Call
> >> dev_graft_qdisc in place of htb_graft_helper when destroying the htb to
> >> prevent premature netdev activation.
> >>
> >> Fixes: d03b195b5aa0 ("sch_htb: Hierarchical QoS hardware offload")
> >> Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
> >> Acked-by: Saeed Mahameed <saeedm@nvidia.com>
> >> Cc: Eric Dumazet <edumazet@google.com>
> >> Cc: Maxim Mikityanskiy <maxtram95@gmail.com>
> >> ---
> >>  net/sched/sch_htb.c | 8 +++++---
> >>  1 file changed, 5 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/net/sched/sch_htb.c b/net/sched/sch_htb.c
> >> index 2238edece1a4..f62334ef016a 100644
> >> --- a/net/sched/sch_htb.c
> >> +++ b/net/sched/sch_htb.c
> >> @@ -1557,14 +1557,16 @@ static int htb_destroy_class_offload(struct Qdisc *sch, struct htb_class *cl,
> >>
> >>         WARN_ON(!q);
> >>         dev_queue = htb_offload_get_queue(cl);
> >> -       old = htb_graft_helper(dev_queue, NULL);
> >> -       if (destroying)
> >> +       if (destroying) {
> >> +               old = dev_graft_qdisc(dev_queue, NULL);
> >>                 /* Before HTB is destroyed, the kernel grafts noop_qdisc to
> >>                  * all queues.
> >>                  */
> >>                 WARN_ON(!(old->flags & TCQ_F_BUILTIN));
> >
> > Now regarding this WARN_ON, I have concerns about its correctness.
> >
> > Can the user replace the root qdisc from HTB to something else with a
> > single command? I.e. instead of `tc qdisc del dev eth2 root handle 1:`
> > do `tc qdisc replace ...` or whatever that causes qdisc_graft to be
> > called with new != NULL? If that is possible, then:
> >
> > 1. `old` won't be noop_qdisc, but rather the new qdisc (if it doesn't
> > support the attach callback) or the old one left from HTB (old == q,
> > if the new qdisc supports the attach callback). WARN_ON should
> > trigger.
> >
> > 2. We shouldn't even call dev_graft_qdisc in this case (if destroying
> > is true). Likewise, we shouldn't try to revert it on errors or call
> > qdisc_put on it.
> >
> > Could you please try to reproduce this scenario of triggering WARN_ON?
> > I remember testing it, and something actually prevented me from doing
> > a replacement, but maybe I just missed something back then.
> >
>
> Reproduction steps
>
>   ip link set dev eth2 up
>   ip link set dev eth2 up
>   ip addr add 194.237.173.123/16 dev eth2
>   tc qdisc add dev eth2 clsact
>   tc qdisc add dev eth2 root handle 1: htb default 1 offload
>   tc class add dev eth2 classid 1: parent root htb rate 18000mbit ceil 22500.0mbit burst 450000kbit cburst 450000kbit
>   tc class add dev eth2 classid 1:3 parent 1: htb rate 3596mbit burst 89900kbit cburst 89900kbit
>   tc qdisc replace dev eth2 root pfifo
>
> The warning is indeed triggered because the new root is pfifo rather
> than noop_qdisc. I agree with both points you brought up in the patch.
> When I saw the ternary in qdisc_graft for the rcu_assign_pointer call, I
> was worried about the case when new was defined as a new qdisc rather
> than defaulting to noop_qdisc but assumed there were some guaratees for
> htb.

OK, so that means my WARN_ON relies on false guarantees.

> However, I see a number of functions, not
> just offload related ones, in the htb implementation that seem to depend
> on the assumption that the old qdisc can safely be accessed with helpers
> such as htb_graft_helper. One such example is htb_change_class.

htb_graft_helper is only used by the offload, you can see that all
usages are guarded by if (q->offload).

> The
> trivial solution I see is to change qdisc_graft to first do a
> rcu_assign_pointer with noop_qdisc, call notify_and_destroy, and only
> afterwards call rcu_assign_pointer with the new qdisc if defined. Let me
> know your thoughts on this.

I don't think it's a good idea to introduce such a change to generic
code just to fix HTB offload. It would basically remove the ability to
replace the qdisc atomically, and there will be periods of time when
all packets are dropped.

> I believe the correct fix for a robust implementation of
> htb_destroy_class_offload would be to not depend on functions that
> retrieve the top level qdisc.

This is actually the case. The source of truth is internal data
structures of HTB, specifically cl->leaf.q points to the queue qdisc,
and cl->leaf.offload_queue points to the queue itself (the latter is
very useful when the qdisc is noop_qdisc, and we can't retrieve
dev_queue from the qdisc itself). Whenever we take a qdisc from the
netdev_queue itself, it should be only to check consistency with the
source of truth (please tell me if you spotted some places where it's
used for something else).

What I suggest to do to fix this particular bug:

1. Remove the WARN_ON check for noop_qdisc on destroy, it's simply
invalid. Rephrase the comment to explain why you WARN_ON(old != q)
only when !destroying.

2. Don't graft anything on destroy, the kernel has already replaced
the qdisc with the right one (noop_qdisc on delete, the new qdisc on
replace).

3. qdisc_put down below shouldn't be called when destroying, because
we didn't graft anything ourselves. We should still call
qdisc_put(old) when err == 0 to drop the ref (we hold one ref for
cl->leaf.q and the other ref for dev_queue->qdisc, which is normally
the same qdisc). And we should still graft old back on errors, but not
when destroying (we ignore all errors on destroy and just go on, we
can't fail).

>
>   [  384.474535] ------------[ cut here ]------------
>   [  384.476685] WARNING: CPU: 2 PID: 1038 at net/sched/sch_htb.c:1561 htb_destroy_class_offload+0x179/0x430 [sch_htb]
>   [  384.481217] Modules linked in: sch_htb sch_ingress xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat br_netfilter overlay rpcrdma rdma_ucm ib_iser libiscsi scsi_transport_iscsi rdma_cm iw_cm ib_umad ib_ipoib ib_cm mlx5_ib ib_uverbs ib_core fuse mlx5_core
>   [  384.487081] CPU: 2 PID: 1038 Comm: tc Not tainted 6.1.0-rc2_for_upstream_min_debug_2022_10_24_15_44 #1
>   [  384.488414] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
>   [  384.489987] RIP: 0010:htb_destroy_class_offload+0x179/0x430 [sch_htb]
>   [  384.490937] Code: 2b 04 25 28 00 00 00 0f 85 cb 02 00 00 48 83 c4 48 44 89 f0 5b 5d 41 5c 41 5d 41 5e 41 5f c3 41 f6 45 10 01 0f 85 26 ff ff ff <0f> 0b e9 1f ff ff ff 4d 3b 7e 40 0f 84 d9 fe ff ff 0f 0b e9 d2 fe
>   [  384.493495] RSP: 0018:ffff88815162b840 EFLAGS: 00010246
>   [  384.494358] RAX: 000000000000002a RBX: ffff88810e040000 RCX: 0000000021800002
>   [  384.495461] RDX: 0000000021800000 RSI: 0000000000000246 RDI: ffff88810e0404c0
>   [  384.496581] RBP: ffff888151ea0c00 R08: 0000000100006174 R09: ffffffff82897070
>   [  384.497684] R10: 0000000000000000 R11: 0000000000000002 R12: 0000000000000001
>   [  384.498923] R13: ffff88810b189200 R14: ffff88810b189a00 R15: ffff888110060a00
>   [  384.500044] FS:  00007f7a2e7a3800(0000) GS:ffff88852cc80000(0000) knlGS:0000000000000000
>   [  384.501390] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>   [  384.502339] CR2: 0000000000487598 CR3: 0000000151f41003 CR4: 0000000000370ea0
>   [  384.503458] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>   [  384.504581] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>   [  384.505699] Call Trace:
>   [  384.506231]  <TASK>
>   [  384.506691]  ? tcf_block_put+0x74/0xa0
>   [  384.507365]  htb_destroy+0x142/0x3c0 [sch_htb]
>   [  384.508131]  ? hrtimer_cancel+0x11/0x40
>   [  384.508832]  ? rtnl_is_locked+0x11/0x20
>   [  384.509522]  ? htb_reset+0xe3/0x1a0 [sch_htb]
>   [  384.510293]  qdisc_destroy+0x3b/0xd0
>   [  384.510943]  qdisc_graft+0x40b/0x590
>   [  384.511600]  tc_modify_qdisc+0x577/0x870
>   [  384.512309]  rtnetlink_rcv_msg+0x2a2/0x390
>   [  384.513031]  ? rtnl_calcit.isra.0+0x120/0x120
>   [  384.513806]  netlink_rcv_skb+0x54/0x100
>   [  384.514495]  netlink_unicast+0x1f6/0x2c0
>   [  384.515190]  netlink_sendmsg+0x237/0x490
>   [  384.515890]  sock_sendmsg+0x33/0x40
>   [  384.516556]  ____sys_sendmsg+0x1d1/0x1f0
>   [  384.517265]  ___sys_sendmsg+0x72/0xb0
>   [  384.517942]  ? ___sys_recvmsg+0x7c/0xb0
>   [  384.518631]  __sys_sendmsg+0x51/0x90
>   [  384.519289]  do_syscall_64+0x3d/0x90
>   [  384.519943]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
>   [  384.520794] RIP: 0033:0x7f7a2eaccc17
>   [  384.521449] Code: 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 89 54 24 1c 48 89 74 24 10
>   [  384.524306] RSP: 002b:00007ffd8c62fe78 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
>   [  384.525568] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f7a2eaccc17
>   [  384.526703] RDX: 0000000000000000 RSI: 00007ffd8c62fee0 RDI: 0000000000000003
>   [  384.527828] RBP: 0000000063b66264 R08: 0000000000000001 R09: 00007f7a2eb8da40
>   [  384.528962] R10: 0000000000405aeb R11: 0000000000000246 R12: 0000000000000001
>   [  384.530097] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000485400
>   [  384.531221]  </TASK>
>   [  384.531689] ---[ end trace 0000000000000000 ]---
>
> >> -       else
> >> +       } else {
> >> +               old = htb_graft_helper(dev_queue, NULL);
> >>                 WARN_ON(old != q);
> >> +       }
> >>
> >>         if (cl->parent) {
> >>                 _bstats_update(&cl->parent->bstats_bias,
> >> --
> >> 2.36.2
> >>
Rahul Rameshbabu Jan. 6, 2023, 11:36 p.m. UTC | #4
Hi Maxim,

I am working on an updated patch based on your comments. I have further
comments below.

-- Rahul Rameshbabu

Maxim Mikityanskiy <maxtram95@gmail.com> writes:

> On Thu, 5 Jan 2023 at 08:03, Rahul Rameshbabu <rrameshbabu@nvidia.com> wrote:
>>
>> Maxim Mikityanskiy <maxtram95@gmail.com> writes:
>>
>> > On Wed, 4 Jan 2023 at 19:53, Rahul Rameshbabu <rrameshbabu@nvidia.com> wrote:
>> >>
>> >> When the netdev qdisc is updated correctly with the new qdisc before
>> >> destroying the old qdisc, the netdev should not be activated till cleanup
>> >> is completed. When htb_destroy_class_offload called htb_graft_helper, the
>> >> netdev may be activated before cleanup is completed.
>> >
>> > Oh, so that's what was happening! Now I get the full picture:
>> >
>> > 1. The user does RTM_DELQDISC.
>> > 2. qdisc_graft calls dev_deactivate, which sets dev_queue->qdisc to
>> > NULL, but keeps dev_queue->qdisc_sleeping.
>> > 3. The loop in qdisc_graft calls dev_graft_qdisc(dev_queue, new),
>> > where new is NULL, for each queue.
>> > 4. Then we get into htb_destroy_class_offload, and it's important
>> > whether dev->qdisc is still HTB (before Eric's patch) or noop_qdisc
>> > (after Eric's patch).
>> > 5. If dev->qdisc is noop_qdisc, and htb_graft_helper accidentally
>> > activates the netdev, attach_default_qdiscs will be called, and
>> > dev_queue->qdisc will no longer be NULL for the rest of the queues,
>> > hence the WARN_ON triggering.
>> >
>> > Nice catch indeed, premature activation of the netdev wasn't intended.
>> >
>> >> The new netdev qdisc
>> >> may be used prematurely by queues before cleanup is done. Call
>> >> dev_graft_qdisc in place of htb_graft_helper when destroying the htb to
>> >> prevent premature netdev activation.
>> >>
>> >> Fixes: d03b195b5aa0 ("sch_htb: Hierarchical QoS hardware offload")
>> >> Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
>> >> Acked-by: Saeed Mahameed <saeedm@nvidia.com>
>> >> Cc: Eric Dumazet <edumazet@google.com>
>> >> Cc: Maxim Mikityanskiy <maxtram95@gmail.com>
>> >> ---
>> >>  net/sched/sch_htb.c | 8 +++++---
>> >>  1 file changed, 5 insertions(+), 3 deletions(-)
>> >>
>> >> diff --git a/net/sched/sch_htb.c b/net/sched/sch_htb.c
>> >> index 2238edece1a4..f62334ef016a 100644
>> >> --- a/net/sched/sch_htb.c
>> >> +++ b/net/sched/sch_htb.c
>> >> @@ -1557,14 +1557,16 @@ static int htb_destroy_class_offload(struct Qdisc *sch, struct htb_class *cl,
>> >>
>> >>         WARN_ON(!q);
>> >>         dev_queue = htb_offload_get_queue(cl);
>> >> -       old = htb_graft_helper(dev_queue, NULL);
>> >> -       if (destroying)
>> >> +       if (destroying) {
>> >> +               old = dev_graft_qdisc(dev_queue, NULL);
>> >>                 /* Before HTB is destroyed, the kernel grafts noop_qdisc to
>> >>                  * all queues.
>> >>                  */
>> >>                 WARN_ON(!(old->flags & TCQ_F_BUILTIN));
>> >
>> > Now regarding this WARN_ON, I have concerns about its correctness.
>> >
>> > Can the user replace the root qdisc from HTB to something else with a
>> > single command? I.e. instead of `tc qdisc del dev eth2 root handle 1:`
>> > do `tc qdisc replace ...` or whatever that causes qdisc_graft to be
>> > called with new != NULL? If that is possible, then:
>> >
>> > 1. `old` won't be noop_qdisc, but rather the new qdisc (if it doesn't
>> > support the attach callback) or the old one left from HTB (old == q,
>> > if the new qdisc supports the attach callback). WARN_ON should
>> > trigger.
>> >
>> > 2. We shouldn't even call dev_graft_qdisc in this case (if destroying
>> > is true). Likewise, we shouldn't try to revert it on errors or call
>> > qdisc_put on it.
>> >
>> > Could you please try to reproduce this scenario of triggering WARN_ON?
>> > I remember testing it, and something actually prevented me from doing
>> > a replacement, but maybe I just missed something back then.
>> >
>>
>> Reproduction steps
>>
>>   ip link set dev eth2 up
>>   ip link set dev eth2 up
>>   ip addr add 194.237.173.123/16 dev eth2
>>   tc qdisc add dev eth2 clsact
>>   tc qdisc add dev eth2 root handle 1: htb default 1 offload
>>   tc class add dev eth2 classid 1: parent root htb rate 18000mbit ceil 22500.0mbit burst 450000kbit cburst 450000kbit
>>   tc class add dev eth2 classid 1:3 parent 1: htb rate 3596mbit burst 89900kbit cburst 89900kbit
>>   tc qdisc replace dev eth2 root pfifo
>>
>> The warning is indeed triggered because the new root is pfifo rather
>> than noop_qdisc. I agree with both points you brought up in the patch.
>> When I saw the ternary in qdisc_graft for the rcu_assign_pointer call, I
>> was worried about the case when new was defined as a new qdisc rather
>> than defaulting to noop_qdisc but assumed there were some guaratees for
>> htb.
>
> OK, so that means my WARN_ON relies on false guarantees.
>
>> However, I see a number of functions, not
>> just offload related ones, in the htb implementation that seem to depend
>> on the assumption that the old qdisc can safely be accessed with helpers
>> such as htb_graft_helper. One such example is htb_change_class.
>
> htb_graft_helper is only used by the offload, you can see that all
> usages are guarded by if (q->offload).
>
>> The
>> trivial solution I see is to change qdisc_graft to first do a
>> rcu_assign_pointer with noop_qdisc, call notify_and_destroy, and only
>> afterwards call rcu_assign_pointer with the new qdisc if defined. Let me
>> know your thoughts on this.
>
> I don't think it's a good idea to introduce such a change to generic
> code just to fix HTB offload. It would basically remove the ability to
> replace the qdisc atomically, and there will be periods of time when
> all packets are dropped.
>
>> I believe the correct fix for a robust implementation of
>> htb_destroy_class_offload would be to not depend on functions that
>> retrieve the top level qdisc.
>
> This is actually the case. The source of truth is internal data
> structures of HTB, specifically cl->leaf.q points to the queue qdisc,
> and cl->leaf.offload_queue points to the queue itself (the latter is
> very useful when the qdisc is noop_qdisc, and we can't retrieve
> dev_queue from the qdisc itself). Whenever we take a qdisc from the
> netdev_queue itself, it should be only to check consistency with the
> source of truth (please tell me if you spotted some places where it's
> used for something else).

Yeah, I did catch the qdisc taken from the netdev_queue used in such a
way but only in one place. This is in htb_change_class in what appears
to be an offload context (I misread in my earlier comment). It's just
the call to _bstats_update though. Should it be changed to
parent->leaf.q (since it should be the source of truth for the htb
structure in this context) in the event the WARN_ON is encountered
(shouldn't happen normally hence the WARN_ON to ensure this guarantee)?

  dev_queue = htb_offload_get_queue(parent);
  old_q = htb_graft_helper(dev_queue, NULL);
  WARN_ON(old_q != parent->leaf.q);
  offload_opt = (struct tc_htb_qopt_offload) {
    .command = TC_HTB_LEAF_TO_INNER,
    .classid = cl->common.classid,
    .parent_classid =
      TC_H_MIN(parent->common.classid),
    .rate = max_t(u64, hopt->rate.rate, rate64),
    .ceil = max_t(u64, hopt->ceil.rate, ceil64),
    .extack = extack,
  };
  err = htb_offload(dev, &offload_opt);
  if (err) {
    pr_err("htb: TC_HTB_LEAF_TO_INNER failed with err = %d\n",
            err);
    htb_graft_helper(dev_queue, old_q);
    goto err_kill_estimator;
  }
  _bstats_update(&parent->bstats_bias,
            u64_stats_read(&old_q->bstats.bytes),
            u64_stats_read(&old_q->bstats.packets));

>
> What I suggest to do to fix this particular bug:
>
> 1. Remove the WARN_ON check for noop_qdisc on destroy, it's simply
> invalid. Rephrase the comment to explain why you WARN_ON(old != q)
> only when !destroying.
>
> 2. Don't graft anything on destroy, the kernel has already replaced
> the qdisc with the right one (noop_qdisc on delete, the new qdisc on
> replace).
>
> 3. qdisc_put down below shouldn't be called when destroying, because
> we didn't graft anything ourselves. We should still call
> qdisc_put(old) when err == 0 to drop the ref (we hold one ref for
> cl->leaf.q and the other ref for dev_queue->qdisc, which is normally
> the same qdisc). And we should still graft old back on errors, but not
> when destroying (we ignore all errors on destroy and just go on, we
> can't fail).

Ack.

>
>>
>>   [  384.474535] ------------[ cut here ]------------
>>   [  384.476685] WARNING: CPU: 2 PID: 1038 at net/sched/sch_htb.c:1561 htb_destroy_class_offload+0x179/0x430 [sch_htb]
>>   [ 384.481217] Modules linked in: sch_htb sch_ingress xt_conntrack
>> xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat
>> br_netfilter overlay rpcrdma rdma_ucm ib_iser libiscsi scsi_transport_iscsi
>> rdma_cm iw_cm ib_umad ib_ipoib ib_cm mlx5_ib ib_uverbs ib_core fuse mlx5_core
>>   [  384.487081] CPU: 2 PID: 1038 Comm: tc Not tainted 6.1.0-rc2_for_upstream_min_debug_2022_10_24_15_44 #1
>>   [  384.488414] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
>>   [  384.489987] RIP: 0010:htb_destroy_class_offload+0x179/0x430 [sch_htb]
>>   [  384.490937] Code: 2b 04 25 28 00 00 00 0f 85 cb 02 00 00 48 83 c4 48 44 89 f0 5b 5d 41 5c 41 5d 41 5e 41 5f c3 41 f6 45 10 01 0f 85 26 ff ff ff <0f> 0b e9 1f ff ff ff 4d 3b 7e 40 0f 84 d9 fe ff ff 0f 0b e9 d2 fe
>>   [  384.493495] RSP: 0018:ffff88815162b840 EFLAGS: 00010246
>>   [  384.494358] RAX: 000000000000002a RBX: ffff88810e040000 RCX: 0000000021800002
>>   [  384.495461] RDX: 0000000021800000 RSI: 0000000000000246 RDI: ffff88810e0404c0
>>   [  384.496581] RBP: ffff888151ea0c00 R08: 0000000100006174 R09: ffffffff82897070
>>   [  384.497684] R10: 0000000000000000 R11: 0000000000000002 R12: 0000000000000001
>>   [  384.498923] R13: ffff88810b189200 R14: ffff88810b189a00 R15: ffff888110060a00
>>   [  384.500044] FS:  00007f7a2e7a3800(0000) GS:ffff88852cc80000(0000) knlGS:0000000000000000
>>   [  384.501390] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>   [  384.502339] CR2: 0000000000487598 CR3: 0000000151f41003 CR4: 0000000000370ea0
>>   [  384.503458] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>   [  384.504581] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>   [  384.505699] Call Trace:
>>   [  384.506231]  <TASK>
>>   [  384.506691]  ? tcf_block_put+0x74/0xa0
>>   [  384.507365]  htb_destroy+0x142/0x3c0 [sch_htb]
>>   [  384.508131]  ? hrtimer_cancel+0x11/0x40
>>   [  384.508832]  ? rtnl_is_locked+0x11/0x20
>>   [  384.509522]  ? htb_reset+0xe3/0x1a0 [sch_htb]
>>   [  384.510293]  qdisc_destroy+0x3b/0xd0
>>   [  384.510943]  qdisc_graft+0x40b/0x590
>>   [  384.511600]  tc_modify_qdisc+0x577/0x870
>>   [  384.512309]  rtnetlink_rcv_msg+0x2a2/0x390
>>   [  384.513031]  ? rtnl_calcit.isra.0+0x120/0x120
>>   [  384.513806]  netlink_rcv_skb+0x54/0x100
>>   [  384.514495]  netlink_unicast+0x1f6/0x2c0
>>   [  384.515190]  netlink_sendmsg+0x237/0x490
>>   [  384.515890]  sock_sendmsg+0x33/0x40
>>   [  384.516556]  ____sys_sendmsg+0x1d1/0x1f0
>>   [  384.517265]  ___sys_sendmsg+0x72/0xb0
>>   [  384.517942]  ? ___sys_recvmsg+0x7c/0xb0
>>   [  384.518631]  __sys_sendmsg+0x51/0x90
>>   [  384.519289]  do_syscall_64+0x3d/0x90
>>   [  384.519943]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
>>   [  384.520794] RIP: 0033:0x7f7a2eaccc17
>>   [  384.521449] Code: 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 89 54 24 1c 48 89 74 24 10
>>   [  384.524306] RSP: 002b:00007ffd8c62fe78 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
>>   [  384.525568] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f7a2eaccc17
>>   [  384.526703] RDX: 0000000000000000 RSI: 00007ffd8c62fee0 RDI: 0000000000000003
>>   [  384.527828] RBP: 0000000063b66264 R08: 0000000000000001 R09: 00007f7a2eb8da40
>>   [  384.528962] R10: 0000000000405aeb R11: 0000000000000246 R12: 0000000000000001
>>   [  384.530097] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000485400
>>   [  384.531221]  </TASK>
>>   [  384.531689] ---[ end trace 0000000000000000 ]---
>>
>> >> -       else
>> >> +       } else {
>> >> +               old = htb_graft_helper(dev_queue, NULL);
>> >>                 WARN_ON(old != q);
>> >> +       }
>> >>
>> >>         if (cl->parent) {
>> >>                 _bstats_update(&cl->parent->bstats_bias,
>> >> --
>> >> 2.36.2
>> >>
Maxim Mikityanskiy Jan. 10, 2023, 5:41 p.m. UTC | #5
On Fri, Jan 06, 2023 at 03:36:37PM -0800, Rahul Rameshbabu wrote:
> Hi Maxim,
> 
> I am working on an updated patch based on your comments. I have further
> comments below.
> 
> -- Rahul Rameshbabu
> 
> Maxim Mikityanskiy <maxtram95@gmail.com> writes:
> 
> > On Thu, 5 Jan 2023 at 08:03, Rahul Rameshbabu <rrameshbabu@nvidia.com> wrote:
> >>
> >> Maxim Mikityanskiy <maxtram95@gmail.com> writes:
> >>
> >> > On Wed, 4 Jan 2023 at 19:53, Rahul Rameshbabu <rrameshbabu@nvidia.com> wrote:
> >> >>
> >> >> When the netdev qdisc is updated correctly with the new qdisc before
> >> >> destroying the old qdisc, the netdev should not be activated till cleanup
> >> >> is completed. When htb_destroy_class_offload called htb_graft_helper, the
> >> >> netdev may be activated before cleanup is completed.
> >> >
> >> > Oh, so that's what was happening! Now I get the full picture:
> >> >
> >> > 1. The user does RTM_DELQDISC.
> >> > 2. qdisc_graft calls dev_deactivate, which sets dev_queue->qdisc to
> >> > NULL, but keeps dev_queue->qdisc_sleeping.
> >> > 3. The loop in qdisc_graft calls dev_graft_qdisc(dev_queue, new),
> >> > where new is NULL, for each queue.
> >> > 4. Then we get into htb_destroy_class_offload, and it's important
> >> > whether dev->qdisc is still HTB (before Eric's patch) or noop_qdisc
> >> > (after Eric's patch).
> >> > 5. If dev->qdisc is noop_qdisc, and htb_graft_helper accidentally
> >> > activates the netdev, attach_default_qdiscs will be called, and
> >> > dev_queue->qdisc will no longer be NULL for the rest of the queues,
> >> > hence the WARN_ON triggering.
> >> >
> >> > Nice catch indeed, premature activation of the netdev wasn't intended.
> >> >
> >> >> The new netdev qdisc
> >> >> may be used prematurely by queues before cleanup is done. Call
> >> >> dev_graft_qdisc in place of htb_graft_helper when destroying the htb to
> >> >> prevent premature netdev activation.
> >> >>
> >> >> Fixes: d03b195b5aa0 ("sch_htb: Hierarchical QoS hardware offload")
> >> >> Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
> >> >> Acked-by: Saeed Mahameed <saeedm@nvidia.com>
> >> >> Cc: Eric Dumazet <edumazet@google.com>
> >> >> Cc: Maxim Mikityanskiy <maxtram95@gmail.com>
> >> >> ---
> >> >>  net/sched/sch_htb.c | 8 +++++---
> >> >>  1 file changed, 5 insertions(+), 3 deletions(-)
> >> >>
> >> >> diff --git a/net/sched/sch_htb.c b/net/sched/sch_htb.c
> >> >> index 2238edece1a4..f62334ef016a 100644
> >> >> --- a/net/sched/sch_htb.c
> >> >> +++ b/net/sched/sch_htb.c
> >> >> @@ -1557,14 +1557,16 @@ static int htb_destroy_class_offload(struct Qdisc *sch, struct htb_class *cl,
> >> >>
> >> >>         WARN_ON(!q);
> >> >>         dev_queue = htb_offload_get_queue(cl);
> >> >> -       old = htb_graft_helper(dev_queue, NULL);
> >> >> -       if (destroying)
> >> >> +       if (destroying) {
> >> >> +               old = dev_graft_qdisc(dev_queue, NULL);
> >> >>                 /* Before HTB is destroyed, the kernel grafts noop_qdisc to
> >> >>                  * all queues.
> >> >>                  */
> >> >>                 WARN_ON(!(old->flags & TCQ_F_BUILTIN));
> >> >
> >> > Now regarding this WARN_ON, I have concerns about its correctness.
> >> >
> >> > Can the user replace the root qdisc from HTB to something else with a
> >> > single command? I.e. instead of `tc qdisc del dev eth2 root handle 1:`
> >> > do `tc qdisc replace ...` or whatever that causes qdisc_graft to be
> >> > called with new != NULL? If that is possible, then:
> >> >
> >> > 1. `old` won't be noop_qdisc, but rather the new qdisc (if it doesn't
> >> > support the attach callback) or the old one left from HTB (old == q,
> >> > if the new qdisc supports the attach callback). WARN_ON should
> >> > trigger.
> >> >
> >> > 2. We shouldn't even call dev_graft_qdisc in this case (if destroying
> >> > is true). Likewise, we shouldn't try to revert it on errors or call
> >> > qdisc_put on it.
> >> >
> >> > Could you please try to reproduce this scenario of triggering WARN_ON?
> >> > I remember testing it, and something actually prevented me from doing
> >> > a replacement, but maybe I just missed something back then.
> >> >
> >>
> >> Reproduction steps
> >>
> >>   ip link set dev eth2 up
> >>   ip link set dev eth2 up
> >>   ip addr add 194.237.173.123/16 dev eth2
> >>   tc qdisc add dev eth2 clsact
> >>   tc qdisc add dev eth2 root handle 1: htb default 1 offload
> >>   tc class add dev eth2 classid 1: parent root htb rate 18000mbit ceil 22500.0mbit burst 450000kbit cburst 450000kbit
> >>   tc class add dev eth2 classid 1:3 parent 1: htb rate 3596mbit burst 89900kbit cburst 89900kbit
> >>   tc qdisc replace dev eth2 root pfifo
> >>
> >> The warning is indeed triggered because the new root is pfifo rather
> >> than noop_qdisc. I agree with both points you brought up in the patch.
> >> When I saw the ternary in qdisc_graft for the rcu_assign_pointer call, I
> >> was worried about the case when new was defined as a new qdisc rather
> >> than defaulting to noop_qdisc but assumed there were some guaratees for
> >> htb.
> >
> > OK, so that means my WARN_ON relies on false guarantees.
> >
> >> However, I see a number of functions, not
> >> just offload related ones, in the htb implementation that seem to depend
> >> on the assumption that the old qdisc can safely be accessed with helpers
> >> such as htb_graft_helper. One such example is htb_change_class.
> >
> > htb_graft_helper is only used by the offload, you can see that all
> > usages are guarded by if (q->offload).
> >
> >> The
> >> trivial solution I see is to change qdisc_graft to first do a
> >> rcu_assign_pointer with noop_qdisc, call notify_and_destroy, and only
> >> afterwards call rcu_assign_pointer with the new qdisc if defined. Let me
> >> know your thoughts on this.
> >
> > I don't think it's a good idea to introduce such a change to generic
> > code just to fix HTB offload. It would basically remove the ability to
> > replace the qdisc atomically, and there will be periods of time when
> > all packets are dropped.
> >
> >> I believe the correct fix for a robust implementation of
> >> htb_destroy_class_offload would be to not depend on functions that
> >> retrieve the top level qdisc.
> >
> > This is actually the case. The source of truth is internal data
> > structures of HTB, specifically cl->leaf.q points to the queue qdisc,
> > and cl->leaf.offload_queue points to the queue itself (the latter is
> > very useful when the qdisc is noop_qdisc, and we can't retrieve
> > dev_queue from the qdisc itself). Whenever we take a qdisc from the
> > netdev_queue itself, it should be only to check consistency with the
> > source of truth (please tell me if you spotted some places where it's
> > used for something else).
> 
> Yeah, I did catch the qdisc taken from the netdev_queue used in such a
> way but only in one place. This is in htb_change_class in what appears
> to be an offload context (I misread in my earlier comment). It's just
> the call to _bstats_update though. Should it be changed to
> parent->leaf.q (since it should be the source of truth for the htb
> structure in this context) in the event the WARN_ON is encountered
> (shouldn't happen normally hence the WARN_ON to ensure this guarantee)?

I'm inclined to keep it as is. My argument above is more relevant to the
integrity of the data structures while being modified in the control
flow. When we modify the tree, the internal data structure is the source
of truth, but we still need the same qdiscs to be assigned to dev_queues
for datapath to use them, so we derive/copy dev_queue->qdisc from the
internal data structure. (dev_queue->qdisc_sleeping only serves the
purpose of updating dev_queue->qdisc.)

Stats, on the other hand, are a datapath entity. They are produced in
the datapath, which only has access to dev_queue->qdisc. After detaching
the qdisc from the datapath (with htb_graft_helper), we read out the
accumulated stats and store them. To me, the current code makes sense.
Moreover, it's robust in that sense that when we read the stats, the
qdisc is no longer attached to the dev_queue, so the stats stand still
at this point.

That's how I see it, TL/DR: we don't use it here for the structure of
the tree, we just read some data produced by datapath, from where the
datapath wrote it.

>   dev_queue = htb_offload_get_queue(parent);
>   old_q = htb_graft_helper(dev_queue, NULL);
>   WARN_ON(old_q != parent->leaf.q);
>   offload_opt = (struct tc_htb_qopt_offload) {
>     .command = TC_HTB_LEAF_TO_INNER,
>     .classid = cl->common.classid,
>     .parent_classid =
>       TC_H_MIN(parent->common.classid),
>     .rate = max_t(u64, hopt->rate.rate, rate64),
>     .ceil = max_t(u64, hopt->ceil.rate, ceil64),
>     .extack = extack,
>   };
>   err = htb_offload(dev, &offload_opt);
>   if (err) {
>     pr_err("htb: TC_HTB_LEAF_TO_INNER failed with err = %d\n",
>             err);
>     htb_graft_helper(dev_queue, old_q);
>     goto err_kill_estimator;
>   }
>   _bstats_update(&parent->bstats_bias,
>             u64_stats_read(&old_q->bstats.bytes),
>             u64_stats_read(&old_q->bstats.packets));
> 
> >
> > What I suggest to do to fix this particular bug:
> >
> > 1. Remove the WARN_ON check for noop_qdisc on destroy, it's simply
> > invalid. Rephrase the comment to explain why you WARN_ON(old != q)
> > only when !destroying.
> >
> > 2. Don't graft anything on destroy, the kernel has already replaced
> > the qdisc with the right one (noop_qdisc on delete, the new qdisc on
> > replace).
> >
> > 3. qdisc_put down below shouldn't be called when destroying, because
> > we didn't graft anything ourselves. We should still call
> > qdisc_put(old) when err == 0 to drop the ref (we hold one ref for
> > cl->leaf.q and the other ref for dev_queue->qdisc, which is normally
> > the same qdisc). And we should still graft old back on errors, but not
> > when destroying (we ignore all errors on destroy and just go on, we
> > can't fail).
> 
> Ack.
> 
> >
> >>
> >>   [  384.474535] ------------[ cut here ]------------
> >>   [  384.476685] WARNING: CPU: 2 PID: 1038 at net/sched/sch_htb.c:1561 htb_destroy_class_offload+0x179/0x430 [sch_htb]
> >>   [ 384.481217] Modules linked in: sch_htb sch_ingress xt_conntrack
> >> xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat
> >> br_netfilter overlay rpcrdma rdma_ucm ib_iser libiscsi scsi_transport_iscsi
> >> rdma_cm iw_cm ib_umad ib_ipoib ib_cm mlx5_ib ib_uverbs ib_core fuse mlx5_core
> >>   [  384.487081] CPU: 2 PID: 1038 Comm: tc Not tainted 6.1.0-rc2_for_upstream_min_debug_2022_10_24_15_44 #1
> >>   [  384.488414] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
> >>   [  384.489987] RIP: 0010:htb_destroy_class_offload+0x179/0x430 [sch_htb]
> >>   [  384.490937] Code: 2b 04 25 28 00 00 00 0f 85 cb 02 00 00 48 83 c4 48 44 89 f0 5b 5d 41 5c 41 5d 41 5e 41 5f c3 41 f6 45 10 01 0f 85 26 ff ff ff <0f> 0b e9 1f ff ff ff 4d 3b 7e 40 0f 84 d9 fe ff ff 0f 0b e9 d2 fe
> >>   [  384.493495] RSP: 0018:ffff88815162b840 EFLAGS: 00010246
> >>   [  384.494358] RAX: 000000000000002a RBX: ffff88810e040000 RCX: 0000000021800002
> >>   [  384.495461] RDX: 0000000021800000 RSI: 0000000000000246 RDI: ffff88810e0404c0
> >>   [  384.496581] RBP: ffff888151ea0c00 R08: 0000000100006174 R09: ffffffff82897070
> >>   [  384.497684] R10: 0000000000000000 R11: 0000000000000002 R12: 0000000000000001
> >>   [  384.498923] R13: ffff88810b189200 R14: ffff88810b189a00 R15: ffff888110060a00
> >>   [  384.500044] FS:  00007f7a2e7a3800(0000) GS:ffff88852cc80000(0000) knlGS:0000000000000000
> >>   [  384.501390] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>   [  384.502339] CR2: 0000000000487598 CR3: 0000000151f41003 CR4: 0000000000370ea0
> >>   [  384.503458] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> >>   [  384.504581] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> >>   [  384.505699] Call Trace:
> >>   [  384.506231]  <TASK>
> >>   [  384.506691]  ? tcf_block_put+0x74/0xa0
> >>   [  384.507365]  htb_destroy+0x142/0x3c0 [sch_htb]
> >>   [  384.508131]  ? hrtimer_cancel+0x11/0x40
> >>   [  384.508832]  ? rtnl_is_locked+0x11/0x20
> >>   [  384.509522]  ? htb_reset+0xe3/0x1a0 [sch_htb]
> >>   [  384.510293]  qdisc_destroy+0x3b/0xd0
> >>   [  384.510943]  qdisc_graft+0x40b/0x590
> >>   [  384.511600]  tc_modify_qdisc+0x577/0x870
> >>   [  384.512309]  rtnetlink_rcv_msg+0x2a2/0x390
> >>   [  384.513031]  ? rtnl_calcit.isra.0+0x120/0x120
> >>   [  384.513806]  netlink_rcv_skb+0x54/0x100
> >>   [  384.514495]  netlink_unicast+0x1f6/0x2c0
> >>   [  384.515190]  netlink_sendmsg+0x237/0x490
> >>   [  384.515890]  sock_sendmsg+0x33/0x40
> >>   [  384.516556]  ____sys_sendmsg+0x1d1/0x1f0
> >>   [  384.517265]  ___sys_sendmsg+0x72/0xb0
> >>   [  384.517942]  ? ___sys_recvmsg+0x7c/0xb0
> >>   [  384.518631]  __sys_sendmsg+0x51/0x90
> >>   [  384.519289]  do_syscall_64+0x3d/0x90
> >>   [  384.519943]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
> >>   [  384.520794] RIP: 0033:0x7f7a2eaccc17
> >>   [  384.521449] Code: 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 89 54 24 1c 48 89 74 24 10
> >>   [  384.524306] RSP: 002b:00007ffd8c62fe78 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
> >>   [  384.525568] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f7a2eaccc17
> >>   [  384.526703] RDX: 0000000000000000 RSI: 00007ffd8c62fee0 RDI: 0000000000000003
> >>   [  384.527828] RBP: 0000000063b66264 R08: 0000000000000001 R09: 00007f7a2eb8da40
> >>   [  384.528962] R10: 0000000000405aeb R11: 0000000000000246 R12: 0000000000000001
> >>   [  384.530097] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000485400
> >>   [  384.531221]  </TASK>
> >>   [  384.531689] ---[ end trace 0000000000000000 ]---
> >>
> >> >> -       else
> >> >> +       } else {
> >> >> +               old = htb_graft_helper(dev_queue, NULL);
> >> >>                 WARN_ON(old != q);
> >> >> +       }
> >> >>
> >> >>         if (cl->parent) {
> >> >>                 _bstats_update(&cl->parent->bstats_bias,
> >> >> --
> >> >> 2.36.2
> >> >>
diff mbox series

Patch

diff --git a/net/sched/sch_htb.c b/net/sched/sch_htb.c
index 2238edece1a4..f62334ef016a 100644
--- a/net/sched/sch_htb.c
+++ b/net/sched/sch_htb.c
@@ -1557,14 +1557,16 @@  static int htb_destroy_class_offload(struct Qdisc *sch, struct htb_class *cl,
 
 	WARN_ON(!q);
 	dev_queue = htb_offload_get_queue(cl);
-	old = htb_graft_helper(dev_queue, NULL);
-	if (destroying)
+	if (destroying) {
+		old = dev_graft_qdisc(dev_queue, NULL);
 		/* Before HTB is destroyed, the kernel grafts noop_qdisc to
 		 * all queues.
 		 */
 		WARN_ON(!(old->flags & TCQ_F_BUILTIN));
-	else
+	} else {
+		old = htb_graft_helper(dev_queue, NULL);
 		WARN_ON(old != q);
+	}
 
 	if (cl->parent) {
 		_bstats_update(&cl->parent->bstats_bias,