Message ID | 429357af094297abbc45f47b8e606f11206df049.1684887977.git.peilin.ye@bytedance.com (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | net/sched: Fixes for sch_ingress and sch_clsact | expand |
On 23/05/2023 22:20, Peilin Ye wrote: > From: Peilin Ye <peilin.ye@bytedance.com> > > mini_Qdisc_pair::p_miniq is a double pointer to mini_Qdisc, initialized in > ingress_init() to point to net_device::miniq_ingress. ingress Qdiscs > access this per-net_device pointer in mini_qdisc_pair_swap(). Similar for > clsact Qdiscs and miniq_egress. > > Unfortunately, after introducing RTNL-unlocked RTM_{NEW,DEL,GET}TFILTER > requests (thanks Hillf Danton for the hint), when replacing ingress or > clsact Qdiscs, for example, the old Qdisc ("@old") could access the same > miniq_{in,e}gress pointer(s) concurrently with the new Qdisc ("@new"), > causing race conditions [1] including a use-after-free bug in > mini_qdisc_pair_swap() reported by syzbot: > > BUG: KASAN: slab-use-after-free in mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573 > Write of size 8 at addr ffff888045b31308 by task syz-executor690/14901 > ... > Call Trace: > <TASK> > __dump_stack lib/dump_stack.c:88 [inline] > dump_stack_lvl+0xd9/0x150 lib/dump_stack.c:106 > print_address_description.constprop.0+0x2c/0x3c0 mm/kasan/report.c:319 > print_report mm/kasan/report.c:430 [inline] > kasan_report+0x11c/0x130 mm/kasan/report.c:536 > mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573 > tcf_chain_head_change_item net/sched/cls_api.c:495 [inline] > tcf_chain0_head_change.isra.0+0xb9/0x120 net/sched/cls_api.c:509 > tcf_chain_tp_insert net/sched/cls_api.c:1826 [inline] > tcf_chain_tp_insert_unique net/sched/cls_api.c:1875 [inline] > tc_new_tfilter+0x1de6/0x2290 net/sched/cls_api.c:2266 > ... > > @old and @new should not affect each other. In other words, @old should > never modify miniq_{in,e}gress after @new, and @new should not update > @old's RCU state. Fixing without changing sch_api.c turned out to be > difficult (please refer to Closes: for discussions). Instead, make sure > @new's first call always happen after @old's last call, in > qdisc_destroy(), has finished: > > In qdisc_graft(), return -EAGAIN and tell the caller to replay > (suggested by Vlad Buslov) if @old has any ongoing RTNL-unlocked filter > requests, and call qdisc_destroy() for @old before grafting @new. > > Introduce qdisc_refcount_dec_if_one() as the counterpart of > qdisc_refcount_inc_nz() used for RTNL-unlocked filter requests. Introduce > a non-static version of qdisc_destroy() that does a TCQ_F_BUILTIN check, > just like qdisc_put() etc. > > Depends on patch "net/sched: Refactor qdisc_graft() for ingress and clsact > Qdiscs". > > [1] To illustrate, the syzkaller reproducer adds ingress Qdiscs under > TC_H_ROOT (no longer possible after patch "net/sched: sch_ingress: Only > create under TC_H_INGRESS") on eth0 that has 8 transmission queues: > > Thread 1 creates ingress Qdisc A (containing mini Qdisc a1 and a2), then > adds a flower filter X to A. > > Thread 2 creates another ingress Qdisc B (containing mini Qdisc b1 and > b2) to replace A, then adds a flower filter Y to B. > > Thread 1 A's refcnt Thread 2 > RTM_NEWQDISC (A, RTNL-locked) > qdisc_create(A) 1 > qdisc_graft(A) 9 > > RTM_NEWTFILTER (X, RTNL-unlocked) > __tcf_qdisc_find(A) 10 > tcf_chain0_head_change(A) > mini_qdisc_pair_swap(A) (1st) > | > | RTM_NEWQDISC (B, RTNL-locked) > RCU sync 2 qdisc_graft(B) > | 1 notify_and_destroy(A) > | > tcf_block_release(A) 0 RTM_NEWTFILTER (Y, RTNL-unlocked) > qdisc_destroy(A) tcf_chain0_head_change(B) > tcf_chain0_head_change_cb_del(A) mini_qdisc_pair_swap(B) (2nd) > mini_qdisc_pair_swap(A) (3rd) | > ... ... > > Here, B calls mini_qdisc_pair_swap(), pointing eth0->miniq_ingress to its > mini Qdisc, b1. Then, A calls mini_qdisc_pair_swap() again during > ingress_destroy(), setting eth0->miniq_ingress to NULL, so ingress packets > on eth0 will not find filter Y in sch_handle_ingress(). > > This is only one of the possible consequences of concurrently accessing > miniq_{in,e}gress pointers. The point is clear though: again, A should > never modify those per-net_device pointers after B, and B should not > update A's RCU state. > > Fixes: 7a096d579e8e ("net: sched: ingress: set 'unlocked' flag for Qdisc ops") > Fixes: 87f373921c4e ("net: sched: ingress: set 'unlocked' flag for clsact Qdisc ops") > Reported-by: syzbot+b53a9c0d1ea4ad62da8b@syzkaller.appspotmail.com > Closes: https://lore.kernel.org/r/0000000000006cf87705f79acf1a@google.com/ > Cc: Hillf Danton <hdanton@sina.com> > Cc: Vlad Buslov <vladbu@mellanox.com> > Signed-off-by: Peilin Ye <peilin.ye@bytedance.com> Tested-by: Pedro Tammela <pctammela@mojatatu.com> > --- > change in v5: > - reinitialize @q, @p (suggested by Vlad) and @tcm before replaying, > just like @flags in tc_new_tfilter() > > change in v3, v4: > - add in-body From: tag > > changes in v2: > - replay the request if the current Qdisc has any ongoing RTNL-unlocked > filter requests (Vlad) > - minor changes in code comments and commit log > > include/net/sch_generic.h | 8 ++++++++ > net/sched/sch_api.c | 40 ++++++++++++++++++++++++++++++--------- > net/sched/sch_generic.c | 14 +++++++++++--- > 3 files changed, 50 insertions(+), 12 deletions(-) > > diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h > index fab5ba3e61b7..3e9cc43cbc90 100644 > --- a/include/net/sch_generic.h > +++ b/include/net/sch_generic.h > @@ -137,6 +137,13 @@ static inline void qdisc_refcount_inc(struct Qdisc *qdisc) > refcount_inc(&qdisc->refcnt); > } > > +static inline bool qdisc_refcount_dec_if_one(struct Qdisc *qdisc) > +{ > + if (qdisc->flags & TCQ_F_BUILTIN) > + return true; > + return refcount_dec_if_one(&qdisc->refcnt); > +} > + > /* Intended to be used by unlocked users, when concurrent qdisc release is > * possible. > */ > @@ -652,6 +659,7 @@ void dev_deactivate_many(struct list_head *head); > struct Qdisc *dev_graft_qdisc(struct netdev_queue *dev_queue, > struct Qdisc *qdisc); > void qdisc_reset(struct Qdisc *qdisc); > +void qdisc_destroy(struct Qdisc *qdisc); > void qdisc_put(struct Qdisc *qdisc); > void qdisc_put_unlocked(struct Qdisc *qdisc); > void qdisc_tree_reduce_backlog(struct Qdisc *qdisc, int n, int len); > diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c > index f72a581666a2..286b7c58f5b9 100644 > --- a/net/sched/sch_api.c > +++ b/net/sched/sch_api.c > @@ -1080,10 +1080,18 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent, > if ((q && q->flags & TCQ_F_INGRESS) || > (new && new->flags & TCQ_F_INGRESS)) { > ingress = 1; > - if (!dev_ingress_queue(dev)) { > + dev_queue = dev_ingress_queue(dev); > + if (!dev_queue) { > NL_SET_ERR_MSG(extack, "Device does not have an ingress queue"); > return -ENOENT; > } > + > + /* Replay if the current ingress (or clsact) Qdisc has ongoing > + * RTNL-unlocked filter request(s). This is the counterpart of that > + * qdisc_refcount_inc_nz() call in __tcf_qdisc_find(). > + */ > + if (!qdisc_refcount_dec_if_one(dev_queue->qdisc_sleeping)) > + return -EAGAIN; > } > > if (dev->flags & IFF_UP) > @@ -1104,8 +1112,16 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent, > qdisc_put(old); > } > } else { > - dev_queue = dev_ingress_queue(dev); > - old = dev_graft_qdisc(dev_queue, new); > + old = dev_graft_qdisc(dev_queue, NULL); > + > + /* {ingress,clsact}_destroy() @old before grafting @new to avoid > + * unprotected concurrent accesses to net_device::miniq_{in,e}gress > + * pointer(s) in mini_qdisc_pair_swap(). > + */ > + qdisc_notify(net, skb, n, classid, old, new, extack); > + qdisc_destroy(old); > + > + dev_graft_qdisc(dev_queue, new); > } > > skip: > @@ -1119,8 +1135,6 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent, > > if (new && new->ops->attach) > new->ops->attach(new); > - } else { > - notify_and_destroy(net, skb, n, classid, old, new, extack); > } > > if (dev->flags & IFF_UP) > @@ -1450,19 +1464,22 @@ static int tc_get_qdisc(struct sk_buff *skb, struct nlmsghdr *n, > struct netlink_ext_ack *extack) > { > struct net *net = sock_net(skb->sk); > - struct tcmsg *tcm = nlmsg_data(n); > struct nlattr *tca[TCA_MAX + 1]; > struct net_device *dev; > + struct Qdisc *q, *p; > + struct tcmsg *tcm; > u32 clid; > - struct Qdisc *q = NULL; > - struct Qdisc *p = NULL; > int err; > > +replay: > err = nlmsg_parse_deprecated(n, sizeof(*tcm), tca, TCA_MAX, > rtm_tca_policy, extack); > if (err < 0) > return err; > > + tcm = nlmsg_data(n); > + q = p = NULL; > + > dev = __dev_get_by_index(net, tcm->tcm_ifindex); > if (!dev) > return -ENODEV; > @@ -1515,8 +1532,11 @@ static int tc_get_qdisc(struct sk_buff *skb, struct nlmsghdr *n, > return -ENOENT; > } > err = qdisc_graft(dev, p, skb, n, clid, NULL, q, extack); > - if (err != 0) > + if (err != 0) { > + if (err == -EAGAIN) > + goto replay; > return err; > + } > } else { > qdisc_notify(net, skb, n, clid, NULL, q, NULL); > } > @@ -1704,6 +1724,8 @@ static int tc_modify_qdisc(struct sk_buff *skb, struct nlmsghdr *n, > if (err) { > if (q) > qdisc_put(q); > + if (err == -EAGAIN) > + goto replay; > return err; > } > > diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c > index 37e41f972f69..e14ed47f961c 100644 > --- a/net/sched/sch_generic.c > +++ b/net/sched/sch_generic.c > @@ -1046,7 +1046,7 @@ static void qdisc_free_cb(struct rcu_head *head) > qdisc_free(q); > } > > -static void qdisc_destroy(struct Qdisc *qdisc) > +static void __qdisc_destroy(struct Qdisc *qdisc) > { > const struct Qdisc_ops *ops = qdisc->ops; > > @@ -1070,6 +1070,14 @@ static void qdisc_destroy(struct Qdisc *qdisc) > call_rcu(&qdisc->rcu, qdisc_free_cb); > } > > +void qdisc_destroy(struct Qdisc *qdisc) > +{ > + if (qdisc->flags & TCQ_F_BUILTIN) > + return; > + > + __qdisc_destroy(qdisc); > +} > + > void qdisc_put(struct Qdisc *qdisc) > { > if (!qdisc) > @@ -1079,7 +1087,7 @@ void qdisc_put(struct Qdisc *qdisc) > !refcount_dec_and_test(&qdisc->refcnt)) > return; > > - qdisc_destroy(qdisc); > + __qdisc_destroy(qdisc); > } > EXPORT_SYMBOL(qdisc_put); > > @@ -1094,7 +1102,7 @@ void qdisc_put_unlocked(struct Qdisc *qdisc) > !refcount_dec_and_rtnl_lock(&qdisc->refcnt)) > return; > > - qdisc_destroy(qdisc); > + __qdisc_destroy(qdisc); > rtnl_unlock(); > } > EXPORT_SYMBOL(qdisc_put_unlocked);
Pedro, When you have a moment - could you run tc monitor in parallel to the reproducer and double check it generates the correct events... cheers, jamal On Wed, May 24, 2023 at 11:39 AM Pedro Tammela <pctammela@mojatatu.com> wrote: > > On 23/05/2023 22:20, Peilin Ye wrote: > > From: Peilin Ye <peilin.ye@bytedance.com> > > > > mini_Qdisc_pair::p_miniq is a double pointer to mini_Qdisc, initialized in > > ingress_init() to point to net_device::miniq_ingress. ingress Qdiscs > > access this per-net_device pointer in mini_qdisc_pair_swap(). Similar for > > clsact Qdiscs and miniq_egress. > > > > Unfortunately, after introducing RTNL-unlocked RTM_{NEW,DEL,GET}TFILTER > > requests (thanks Hillf Danton for the hint), when replacing ingress or > > clsact Qdiscs, for example, the old Qdisc ("@old") could access the same > > miniq_{in,e}gress pointer(s) concurrently with the new Qdisc ("@new"), > > causing race conditions [1] including a use-after-free bug in > > mini_qdisc_pair_swap() reported by syzbot: > > > > BUG: KASAN: slab-use-after-free in mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573 > > Write of size 8 at addr ffff888045b31308 by task syz-executor690/14901 > > ... > > Call Trace: > > <TASK> > > __dump_stack lib/dump_stack.c:88 [inline] > > dump_stack_lvl+0xd9/0x150 lib/dump_stack.c:106 > > print_address_description.constprop.0+0x2c/0x3c0 mm/kasan/report.c:319 > > print_report mm/kasan/report.c:430 [inline] > > kasan_report+0x11c/0x130 mm/kasan/report.c:536 > > mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573 > > tcf_chain_head_change_item net/sched/cls_api.c:495 [inline] > > tcf_chain0_head_change.isra.0+0xb9/0x120 net/sched/cls_api.c:509 > > tcf_chain_tp_insert net/sched/cls_api.c:1826 [inline] > > tcf_chain_tp_insert_unique net/sched/cls_api.c:1875 [inline] > > tc_new_tfilter+0x1de6/0x2290 net/sched/cls_api.c:2266 > > ... > > > > @old and @new should not affect each other. In other words, @old should > > never modify miniq_{in,e}gress after @new, and @new should not update > > @old's RCU state. Fixing without changing sch_api.c turned out to be > > difficult (please refer to Closes: for discussions). Instead, make sure > > @new's first call always happen after @old's last call, in > > qdisc_destroy(), has finished: > > > > In qdisc_graft(), return -EAGAIN and tell the caller to replay > > (suggested by Vlad Buslov) if @old has any ongoing RTNL-unlocked filter > > requests, and call qdisc_destroy() for @old before grafting @new. > > > > Introduce qdisc_refcount_dec_if_one() as the counterpart of > > qdisc_refcount_inc_nz() used for RTNL-unlocked filter requests. Introduce > > a non-static version of qdisc_destroy() that does a TCQ_F_BUILTIN check, > > just like qdisc_put() etc. > > > > Depends on patch "net/sched: Refactor qdisc_graft() for ingress and clsact > > Qdiscs". > > > > [1] To illustrate, the syzkaller reproducer adds ingress Qdiscs under > > TC_H_ROOT (no longer possible after patch "net/sched: sch_ingress: Only > > create under TC_H_INGRESS") on eth0 that has 8 transmission queues: > > > > Thread 1 creates ingress Qdisc A (containing mini Qdisc a1 and a2), then > > adds a flower filter X to A. > > > > Thread 2 creates another ingress Qdisc B (containing mini Qdisc b1 and > > b2) to replace A, then adds a flower filter Y to B. > > > > Thread 1 A's refcnt Thread 2 > > RTM_NEWQDISC (A, RTNL-locked) > > qdisc_create(A) 1 > > qdisc_graft(A) 9 > > > > RTM_NEWTFILTER (X, RTNL-unlocked) > > __tcf_qdisc_find(A) 10 > > tcf_chain0_head_change(A) > > mini_qdisc_pair_swap(A) (1st) > > | > > | RTM_NEWQDISC (B, RTNL-locked) > > RCU sync 2 qdisc_graft(B) > > | 1 notify_and_destroy(A) > > | > > tcf_block_release(A) 0 RTM_NEWTFILTER (Y, RTNL-unlocked) > > qdisc_destroy(A) tcf_chain0_head_change(B) > > tcf_chain0_head_change_cb_del(A) mini_qdisc_pair_swap(B) (2nd) > > mini_qdisc_pair_swap(A) (3rd) | > > ... ... > > > > Here, B calls mini_qdisc_pair_swap(), pointing eth0->miniq_ingress to its > > mini Qdisc, b1. Then, A calls mini_qdisc_pair_swap() again during > > ingress_destroy(), setting eth0->miniq_ingress to NULL, so ingress packets > > on eth0 will not find filter Y in sch_handle_ingress(). > > > > This is only one of the possible consequences of concurrently accessing > > miniq_{in,e}gress pointers. The point is clear though: again, A should > > never modify those per-net_device pointers after B, and B should not > > update A's RCU state. > > > > Fixes: 7a096d579e8e ("net: sched: ingress: set 'unlocked' flag for Qdisc ops") > > Fixes: 87f373921c4e ("net: sched: ingress: set 'unlocked' flag for clsact Qdisc ops") > > Reported-by: syzbot+b53a9c0d1ea4ad62da8b@syzkaller.appspotmail.com > > Closes: https://lore.kernel.org/r/0000000000006cf87705f79acf1a@google.com/ > > Cc: Hillf Danton <hdanton@sina.com> > > Cc: Vlad Buslov <vladbu@mellanox.com> > > Signed-off-by: Peilin Ye <peilin.ye@bytedance.com> > > Tested-by: Pedro Tammela <pctammela@mojatatu.com> > > > --- > > change in v5: > > - reinitialize @q, @p (suggested by Vlad) and @tcm before replaying, > > just like @flags in tc_new_tfilter() > > > > change in v3, v4: > > - add in-body From: tag > > > > changes in v2: > > - replay the request if the current Qdisc has any ongoing RTNL-unlocked > > filter requests (Vlad) > > - minor changes in code comments and commit log > > > > include/net/sch_generic.h | 8 ++++++++ > > net/sched/sch_api.c | 40 ++++++++++++++++++++++++++++++--------- > > net/sched/sch_generic.c | 14 +++++++++++--- > > 3 files changed, 50 insertions(+), 12 deletions(-) > > > > diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h > > index fab5ba3e61b7..3e9cc43cbc90 100644 > > --- a/include/net/sch_generic.h > > +++ b/include/net/sch_generic.h > > @@ -137,6 +137,13 @@ static inline void qdisc_refcount_inc(struct Qdisc *qdisc) > > refcount_inc(&qdisc->refcnt); > > } > > > > +static inline bool qdisc_refcount_dec_if_one(struct Qdisc *qdisc) > > +{ > > + if (qdisc->flags & TCQ_F_BUILTIN) > > + return true; > > + return refcount_dec_if_one(&qdisc->refcnt); > > +} > > + > > /* Intended to be used by unlocked users, when concurrent qdisc release is > > * possible. > > */ > > @@ -652,6 +659,7 @@ void dev_deactivate_many(struct list_head *head); > > struct Qdisc *dev_graft_qdisc(struct netdev_queue *dev_queue, > > struct Qdisc *qdisc); > > void qdisc_reset(struct Qdisc *qdisc); > > +void qdisc_destroy(struct Qdisc *qdisc); > > void qdisc_put(struct Qdisc *qdisc); > > void qdisc_put_unlocked(struct Qdisc *qdisc); > > void qdisc_tree_reduce_backlog(struct Qdisc *qdisc, int n, int len); > > diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c > > index f72a581666a2..286b7c58f5b9 100644 > > --- a/net/sched/sch_api.c > > +++ b/net/sched/sch_api.c > > @@ -1080,10 +1080,18 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent, > > if ((q && q->flags & TCQ_F_INGRESS) || > > (new && new->flags & TCQ_F_INGRESS)) { > > ingress = 1; > > - if (!dev_ingress_queue(dev)) { > > + dev_queue = dev_ingress_queue(dev); > > + if (!dev_queue) { > > NL_SET_ERR_MSG(extack, "Device does not have an ingress queue"); > > return -ENOENT; > > } > > + > > + /* Replay if the current ingress (or clsact) Qdisc has ongoing > > + * RTNL-unlocked filter request(s). This is the counterpart of that > > + * qdisc_refcount_inc_nz() call in __tcf_qdisc_find(). > > + */ > > + if (!qdisc_refcount_dec_if_one(dev_queue->qdisc_sleeping)) > > + return -EAGAIN; > > } > > > > if (dev->flags & IFF_UP) > > @@ -1104,8 +1112,16 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent, > > qdisc_put(old); > > } > > } else { > > - dev_queue = dev_ingress_queue(dev); > > - old = dev_graft_qdisc(dev_queue, new); > > + old = dev_graft_qdisc(dev_queue, NULL); > > + > > + /* {ingress,clsact}_destroy() @old before grafting @new to avoid > > + * unprotected concurrent accesses to net_device::miniq_{in,e}gress > > + * pointer(s) in mini_qdisc_pair_swap(). > > + */ > > + qdisc_notify(net, skb, n, classid, old, new, extack); > > + qdisc_destroy(old); > > + > > + dev_graft_qdisc(dev_queue, new); > > } > > > > skip: > > @@ -1119,8 +1135,6 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent, > > > > if (new && new->ops->attach) > > new->ops->attach(new); > > - } else { > > - notify_and_destroy(net, skb, n, classid, old, new, extack); > > } > > > > if (dev->flags & IFF_UP) > > @@ -1450,19 +1464,22 @@ static int tc_get_qdisc(struct sk_buff *skb, struct nlmsghdr *n, > > struct netlink_ext_ack *extack) > > { > > struct net *net = sock_net(skb->sk); > > - struct tcmsg *tcm = nlmsg_data(n); > > struct nlattr *tca[TCA_MAX + 1]; > > struct net_device *dev; > > + struct Qdisc *q, *p; > > + struct tcmsg *tcm; > > u32 clid; > > - struct Qdisc *q = NULL; > > - struct Qdisc *p = NULL; > > int err; > > > > +replay: > > err = nlmsg_parse_deprecated(n, sizeof(*tcm), tca, TCA_MAX, > > rtm_tca_policy, extack); > > if (err < 0) > > return err; > > > > + tcm = nlmsg_data(n); > > + q = p = NULL; > > + > > dev = __dev_get_by_index(net, tcm->tcm_ifindex); > > if (!dev) > > return -ENODEV; > > @@ -1515,8 +1532,11 @@ static int tc_get_qdisc(struct sk_buff *skb, struct nlmsghdr *n, > > return -ENOENT; > > } > > err = qdisc_graft(dev, p, skb, n, clid, NULL, q, extack); > > - if (err != 0) > > + if (err != 0) { > > + if (err == -EAGAIN) > > + goto replay; > > return err; > > + } > > } else { > > qdisc_notify(net, skb, n, clid, NULL, q, NULL); > > } > > @@ -1704,6 +1724,8 @@ static int tc_modify_qdisc(struct sk_buff *skb, struct nlmsghdr *n, > > if (err) { > > if (q) > > qdisc_put(q); > > + if (err == -EAGAIN) > > + goto replay; > > return err; > > } > > > > diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c > > index 37e41f972f69..e14ed47f961c 100644 > > --- a/net/sched/sch_generic.c > > +++ b/net/sched/sch_generic.c > > @@ -1046,7 +1046,7 @@ static void qdisc_free_cb(struct rcu_head *head) > > qdisc_free(q); > > } > > > > -static void qdisc_destroy(struct Qdisc *qdisc) > > +static void __qdisc_destroy(struct Qdisc *qdisc) > > { > > const struct Qdisc_ops *ops = qdisc->ops; > > > > @@ -1070,6 +1070,14 @@ static void qdisc_destroy(struct Qdisc *qdisc) > > call_rcu(&qdisc->rcu, qdisc_free_cb); > > } > > > > +void qdisc_destroy(struct Qdisc *qdisc) > > +{ > > + if (qdisc->flags & TCQ_F_BUILTIN) > > + return; > > + > > + __qdisc_destroy(qdisc); > > +} > > + > > void qdisc_put(struct Qdisc *qdisc) > > { > > if (!qdisc) > > @@ -1079,7 +1087,7 @@ void qdisc_put(struct Qdisc *qdisc) > > !refcount_dec_and_test(&qdisc->refcnt)) > > return; > > > > - qdisc_destroy(qdisc); > > + __qdisc_destroy(qdisc); > > } > > EXPORT_SYMBOL(qdisc_put); > > > > @@ -1094,7 +1102,7 @@ void qdisc_put_unlocked(struct Qdisc *qdisc) > > !refcount_dec_and_rtnl_lock(&qdisc->refcnt)) > > return; > > > > - qdisc_destroy(qdisc); > > + __qdisc_destroy(qdisc); > > rtnl_unlock(); > > } > > EXPORT_SYMBOL(qdisc_put_unlocked); >
On Wed, 2023-05-24 at 12:09 -0400, Jamal Hadi Salim wrote: > When you have a moment - could you run tc monitor in parallel to the > reproducer and double check it generates the correct events... FTR, I'll wait a bit to apply this series, to allow for the above tests. Unless someone will scream very loudly very soon, it's not going to enter today's PR. Since the addressed issue is an ancient one, it should not a problem, I hope. Cheers, Paolo
On Thu, May 25, 2023 at 5:25 AM Paolo Abeni <pabeni@redhat.com> wrote: > > On Wed, 2023-05-24 at 12:09 -0400, Jamal Hadi Salim wrote: > > When you have a moment - could you run tc monitor in parallel to the > > reproducer and double check it generates the correct events... > > FTR, I'll wait a bit to apply this series, to allow for the above > tests. Unless someone will scream very loudly very soon, it's not going > to enter today's PR. Since the addressed issue is an ancient one, it > should not a problem, I hope. Sorry I do not mean to hold this back - I think we should have applied V1 then worried about making things better. We can worry about events after. So Acking this. cheers, jamal > Cheers, > > Paolo >
On Wed, May 24, 2023 at 11:39 AM Pedro Tammela <pctammela@mojatatu.com> wrote: > > On 23/05/2023 22:20, Peilin Ye wrote: > > From: Peilin Ye <peilin.ye@bytedance.com> > > > > mini_Qdisc_pair::p_miniq is a double pointer to mini_Qdisc, initialized in > > ingress_init() to point to net_device::miniq_ingress. ingress Qdiscs > > access this per-net_device pointer in mini_qdisc_pair_swap(). Similar for > > clsact Qdiscs and miniq_egress. > > > > Unfortunately, after introducing RTNL-unlocked RTM_{NEW,DEL,GET}TFILTER > > requests (thanks Hillf Danton for the hint), when replacing ingress or > > clsact Qdiscs, for example, the old Qdisc ("@old") could access the same > > miniq_{in,e}gress pointer(s) concurrently with the new Qdisc ("@new"), > > causing race conditions [1] including a use-after-free bug in > > mini_qdisc_pair_swap() reported by syzbot: > > > > BUG: KASAN: slab-use-after-free in mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573 > > Write of size 8 at addr ffff888045b31308 by task syz-executor690/14901 > > ... > > Call Trace: > > <TASK> > > __dump_stack lib/dump_stack.c:88 [inline] > > dump_stack_lvl+0xd9/0x150 lib/dump_stack.c:106 > > print_address_description.constprop.0+0x2c/0x3c0 mm/kasan/report.c:319 > > print_report mm/kasan/report.c:430 [inline] > > kasan_report+0x11c/0x130 mm/kasan/report.c:536 > > mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573 > > tcf_chain_head_change_item net/sched/cls_api.c:495 [inline] > > tcf_chain0_head_change.isra.0+0xb9/0x120 net/sched/cls_api.c:509 > > tcf_chain_tp_insert net/sched/cls_api.c:1826 [inline] > > tcf_chain_tp_insert_unique net/sched/cls_api.c:1875 [inline] > > tc_new_tfilter+0x1de6/0x2290 net/sched/cls_api.c:2266 > > ... > > > > @old and @new should not affect each other. In other words, @old should > > never modify miniq_{in,e}gress after @new, and @new should not update > > @old's RCU state. Fixing without changing sch_api.c turned out to be > > difficult (please refer to Closes: for discussions). Instead, make sure > > @new's first call always happen after @old's last call, in > > qdisc_destroy(), has finished: > > > > In qdisc_graft(), return -EAGAIN and tell the caller to replay > > (suggested by Vlad Buslov) if @old has any ongoing RTNL-unlocked filter > > requests, and call qdisc_destroy() for @old before grafting @new. > > > > Introduce qdisc_refcount_dec_if_one() as the counterpart of > > qdisc_refcount_inc_nz() used for RTNL-unlocked filter requests. Introduce > > a non-static version of qdisc_destroy() that does a TCQ_F_BUILTIN check, > > just like qdisc_put() etc. > > > > Depends on patch "net/sched: Refactor qdisc_graft() for ingress and clsact > > Qdiscs". > > > > [1] To illustrate, the syzkaller reproducer adds ingress Qdiscs under > > TC_H_ROOT (no longer possible after patch "net/sched: sch_ingress: Only > > create under TC_H_INGRESS") on eth0 that has 8 transmission queues: > > > > Thread 1 creates ingress Qdisc A (containing mini Qdisc a1 and a2), then > > adds a flower filter X to A. > > > > Thread 2 creates another ingress Qdisc B (containing mini Qdisc b1 and > > b2) to replace A, then adds a flower filter Y to B. > > > > Thread 1 A's refcnt Thread 2 > > RTM_NEWQDISC (A, RTNL-locked) > > qdisc_create(A) 1 > > qdisc_graft(A) 9 > > > > RTM_NEWTFILTER (X, RTNL-unlocked) > > __tcf_qdisc_find(A) 10 > > tcf_chain0_head_change(A) > > mini_qdisc_pair_swap(A) (1st) > > | > > | RTM_NEWQDISC (B, RTNL-locked) > > RCU sync 2 qdisc_graft(B) > > | 1 notify_and_destroy(A) > > | > > tcf_block_release(A) 0 RTM_NEWTFILTER (Y, RTNL-unlocked) > > qdisc_destroy(A) tcf_chain0_head_change(B) > > tcf_chain0_head_change_cb_del(A) mini_qdisc_pair_swap(B) (2nd) > > mini_qdisc_pair_swap(A) (3rd) | > > ... ... > > > > Here, B calls mini_qdisc_pair_swap(), pointing eth0->miniq_ingress to its > > mini Qdisc, b1. Then, A calls mini_qdisc_pair_swap() again during > > ingress_destroy(), setting eth0->miniq_ingress to NULL, so ingress packets > > on eth0 will not find filter Y in sch_handle_ingress(). > > > > This is only one of the possible consequences of concurrently accessing > > miniq_{in,e}gress pointers. The point is clear though: again, A should > > never modify those per-net_device pointers after B, and B should not > > update A's RCU state. > > > > Fixes: 7a096d579e8e ("net: sched: ingress: set 'unlocked' flag for Qdisc ops") > > Fixes: 87f373921c4e ("net: sched: ingress: set 'unlocked' flag for clsact Qdisc ops") > > Reported-by: syzbot+b53a9c0d1ea4ad62da8b@syzkaller.appspotmail.com > > Closes: https://lore.kernel.org/r/0000000000006cf87705f79acf1a@google.com/ > > Cc: Hillf Danton <hdanton@sina.com> > > Cc: Vlad Buslov <vladbu@mellanox.com> > > Signed-off-by: Peilin Ye <peilin.ye@bytedance.com> > > Tested-by: Pedro Tammela <pctammela@mojatatu.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> cheers, jamal
On Fri, May 26, 2023 at 8:20 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > On Wed, May 24, 2023 at 11:39 AM Pedro Tammela <pctammela@mojatatu.com> wrote: > > > > On 23/05/2023 22:20, Peilin Ye wrote: > > > From: Peilin Ye <peilin.ye@bytedance.com> > Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> I apologize i am going to take this back, lets please hold the series for now. In pursuit for the effect on events, Pedro and I spent a few _hours_ chasing this - and regardless of the events, there are still challenges with the concurrency issue. The current reproducer unfortunately cant cause damage after patch 2, so really patch 6 was not being tested. We hacked the repro to hit codepath patch 6 fixes. We are not sure what the root cause is - but it certainly due to the series. Peilin, Pedro will post the new repro. cheers, jamal
On 26/05/2023 16:47, Jamal Hadi Salim wrote:
> [...] Peilin, Pedro will post the new repro.
Hi!
We tweaked the reproducer to:
---
r0 = socket$netlink(0x10, 0x3, 0x0)
r1 = socket(0x10, 0x803, 0x0)
sendmsg$nl_route_sched(r1, &(0x7f0000000300)={0x0, 0x0,
&(0x7f0000000240)={&(0x7f0000000380)=ANY=[], 0x24}}, 0x0)
getsockname$packet(r1, &(0x7f0000000200)={0x11, 0x0, <r2=>0x0, 0x1, 0x0,
0x6, @broadcast}, &(0x7f0000000440)=0x14)
sendmsg$nl_route(r0, &(0x7f0000000040)={0x0, 0x0,
&(0x7f0000000000)={&(0x7f0000000080)=ANY=[@ANYBLOB="480000001000050700"/20,
@ANYRES32=r2, @ANYBLOB="0000000000000000280012000900010076657468"],
0x48}}, 0x0)
sendmsg$nl_route_sched(0xffffffffffffffff, &(0x7f00000002c0)={0x0, 0x0,
&(0x7f0000000280)={&(0x7f0000000540)=@newqdisc={0x30, 0x24, 0xf0b, 0x0,
0x0, {0x0, 0x0, 0x0, r2, {}, {0xfff1, 0xffff}},
[@qdisc_kind_options=@q_ingress={0xc}]}, 0x30}}, 0x0)
sendmsg$nl_route_sched(0xffffffffffffffff, &(0x7f0000000340)={0x0, 0x0,
&(0x7f00000000c0)={&(0x7f0000000580)=@newtfilter={0x3c, 0x2c, 0xd27,
0x0, 0x0, {0x0, 0x0, 0x0, r2, {}, {0xfff1, 0xffff}, {0xc}},
[@filter_kind_options=@f_flower={{0xb}, {0xc, 0x2,
[@TCA_FLOWER_CLASSID={0x8}]}}]}, 0x3c}}, 0x0)
r4 = socket$netlink(0x10, 0x3, 0x0)
sendmmsg(r4, &(0x7f00000002c0), 0x40000000000009f, 0x0)
r5 = socket$netlink(0x10, 0x3, 0x0)
sendmmsg(r5, &(0x7f00000002c0), 0x40000000000009f, 0x0)
---
We then generate the C program with:
syz-prog2c -sandbox none -enable net_dev -threaded -repeat 0 -prog
peilin.syz > repro.c
Now here comes a very important detail. The above will create a new net
namespace to shoot the netlink messages. We are only able to reproduce
the deadlock with your patches if we comment the creation of the new
namespace out:
---
diff --git a/repro.c b/repro.c
index ee8eb0726..5cdbfb289 100644
--- a/repro.c
+++ b/repro.c
@@ -1121,9 +1121,8 @@ static int do_sandbox_none(void)
sandbox_common();
drop_caps();
initialize_netdevices_init();
- if (unshare(CLONE_NEWNET)) {
- }
+ // Doesn't seem to deadlock in a new netns
+ // if (unshare(CLONE_NEWNET)) {
+ // }
write_file("/proc/sys/net/ipv4/ping_group_range", "0 65535");
initialize_netdevices();
setup_binderfs();
---
The reason we did this was to check on the event with 'tc mon'.
The splat is quite big, see attached. It has all the indications of a
deadlock in the rtnl_lock.
Thanks,
Pedro
[ 291.882988][ T64] INFO: task kworker/1:2:4513 blocked for more than 143 seconds.
[ 291.885716][ T64] Not tainted 6.4.0-rc2-00215-g6078d01d2926 #13
[ 291.887658][ T64] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 291.890484][ T64] task:kworker/1:2 state:D stack:28272 pid:4513 ppid:2 flags:0x00004000
[ 291.893628][ T64] Workqueue: events_power_efficient reg_check_chans_work
[ 291.896167][ T64] Call Trace:
[ 291.897406][ T64] <TASK>
[ 291.898476][ T64] __schedule+0xecb/0x59d0
[ 291.900101][ T64] ? lockdep_hardirqs_on_prepare+0x410/0x410
[ 291.902218][ T64] ? lockdep_hardirqs_on_prepare+0x410/0x410
[ 291.904563][ T64] ? io_schedule_timeout+0x150/0x150
[ 291.906437][ T64] ? reacquire_held_locks+0x4b0/0x4b0
[ 291.908241][ T64] ? _raw_spin_unlock_irq+0x23/0x50
[ 291.909564][ T64] ? lockdep_hardirqs_on+0x7d/0x100
[ 291.910912][ T64] schedule+0xe7/0x1b0
[ 291.912060][ T64] schedule_preempt_disabled+0x13/0x20
[ 291.913608][ T64] __mutex_lock+0x907/0x12d0
[ 291.914850][ T64] ? reg_check_chans_work+0x7d/0xfe0
[ 291.916249][ T64] ? mutex_lock_io_nested+0x1120/0x1120
[ 291.917689][ T64] ? reg_check_chans_work+0x7d/0xfe0
[ 291.919051][ T64] ? rtnl_lock+0x9/0x20
[ 291.920164][ T64] reg_check_chans_work+0x7d/0xfe0
[ 291.921451][ T64] ? lock_sync+0x190/0x190
[ 291.922567][ T64] ? freq_reg_info+0x1d0/0x1d0
[ 291.923802][ T64] ? spin_bug+0x1d0/0x1d0
[ 291.924910][ T64] process_one_work+0x9f9/0x15f0
[ 291.926155][ T64] ? pwq_dec_nr_in_flight+0x2a0/0x2a0
[ 291.927544][ T64] ? spin_bug+0x1d0/0x1d0
[ 291.928706][ T64] worker_thread+0x687/0x1110
[ 291.929867][ T64] ? process_one_work+0x15f0/0x15f0
[ 291.930968][ T64] kthread+0x334/0x430
[ 291.931869][ T64] ? kthread_complete_and_exit+0x40/0x40
[ 291.933092][ T64] ret_from_fork+0x1f/0x30
[ 291.934037][ T64] </TASK>
[ 291.934675][ T64] INFO: task kworker/3:2:6525 blocked for more than 143 seconds.
[ 291.936271][ T64] Not tainted 6.4.0-rc2-00215-g6078d01d2926 #13
[ 291.937672][ T64] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 291.939514][ T64] task:kworker/3:2 state:D stack:28272 pid:6525 ppid:2 flags:0x00004000
[ 291.941718][ T64] Workqueue: events linkwatch_event
[ 291.943042][ T64] Call Trace:
[ 291.943834][ T64] <TASK>
[ 291.944526][ T64] __schedule+0xecb/0x59d0
[ 291.945582][ T64] ? lockdep_hardirqs_on_prepare+0x410/0x410
[ 291.946952][ T64] ? lockdep_hardirqs_on_prepare+0x410/0x410
[ 291.948389][ T64] ? io_schedule_timeout+0x150/0x150
[ 291.949615][ T64] ? reacquire_held_locks+0x4b0/0x4b0
[ 291.950847][ T64] ? _raw_spin_unlock_irq+0x23/0x50
[ 291.952077][ T64] ? lockdep_hardirqs_on+0x7d/0x100
[ 291.953344][ T64] schedule+0xe7/0x1b0
[ 291.954286][ T64] schedule_preempt_disabled+0x13/0x20
[ 291.955512][ T64] __mutex_lock+0x907/0x12d0
[ 291.956603][ T64] ? linkwatch_event+0xf/0x70
[ 291.957680][ T64] ? mutex_lock_io_nested+0x1120/0x1120
[ 291.958953][ T64] ? spin_bug+0x1d0/0x1d0
[ 291.959996][ T64] ? linkwatch_event+0xf/0x70
[ 291.960738][ T64] ? rtnl_lock+0x9/0x20
[ 291.961249][ T64] linkwatch_event+0xf/0x70
[ 291.961770][ T64] process_one_work+0x9f9/0x15f0
[ 291.962306][ T64] ? pwq_dec_nr_in_flight+0x2a0/0x2a0
[ 291.962928][ T64] ? spin_bug+0x1d0/0x1d0
[ 291.963409][ T64] worker_thread+0x687/0x1110
[ 291.963982][ T64] ? __kthread_parkme+0x152/0x220
[ 291.964567][ T64] ? process_one_work+0x15f0/0x15f0
[ 291.965163][ T64] kthread+0x334/0x430
[ 291.965641][ T64] ? kthread_complete_and_exit+0x40/0x40
[ 291.966303][ T64] ret_from_fork+0x1f/0x30
[ 291.966836][ T64] </TASK>
[ 291.967189][ T64] INFO: task systemd-udevd:6764 blocked for more than 143 seconds.
[ 291.968062][ T64] Not tainted 6.4.0-rc2-00215-g6078d01d2926 #13
[ 291.968813][ T64] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 291.969691][ T64] task:systemd-udevd state:D stack:25968 pid:6764 ppid:4544 flags:0x00000006
[ 291.970609][ T64] Call Trace:
[ 291.970939][ T64] <TASK>
[ 291.971239][ T64] __schedule+0xecb/0x59d0
[ 291.971718][ T64] ? lockdep_hardirqs_on_prepare+0x410/0x410
[ 291.972341][ T64] ? lockdep_hardirqs_on_prepare+0x410/0x410
[ 291.972965][ T64] ? io_schedule_timeout+0x150/0x150
[ 291.973572][ T64] ? __mutex_lock+0x902/0x12d0
[ 291.974120][ T64] schedule+0xe7/0x1b0
[ 291.974579][ T64] schedule_preempt_disabled+0x13/0x20
[ 291.975174][ T64] __mutex_lock+0x907/0x12d0
[ 291.975702][ T64] ? rtnetlink_rcv_msg+0x3e2/0xd30
[ 291.976299][ T64] ? mutex_lock_io_nested+0x1120/0x1120
[ 291.976938][ T64] ? rtnetlink_rcv_msg+0x3e2/0xd30
[ 291.977506][ T64] rtnetlink_rcv_msg+0x3e2/0xd30
[ 291.978042][ T64] ? rtnl_getlink+0xb10/0xb10
[ 291.978573][ T64] ? entry_SYSCALL_64_after_hwframe+0x63/0xcd
[ 291.979295][ T64] ? __module_address+0x55/0x3b0
[ 291.979937][ T64] ? netdev_core_pick_tx+0x390/0x390
[ 291.980580][ T64] netlink_rcv_skb+0x166/0x440
[ 291.981164][ T64] ? rtnl_getlink+0xb10/0xb10
[ 291.981726][ T64] ? netlink_ack+0x1370/0x1370
[ 291.982310][ T64] ? kasan_set_track+0x25/0x30
[ 291.982973][ T64] ? netlink_deliver_tap+0x1b1/0xd00
[ 291.983579][ T64] netlink_unicast+0x530/0x800
[ 291.984132][ T64] ? netlink_attachskb+0x880/0x880
[ 291.984682][ T64] ? __sanitizer_cov_trace_switch+0x54/0x90
[ 291.985330][ T64] ? __phys_addr_symbol+0x30/0x70
[ 291.985904][ T64] ? __check_object_size+0x323/0x740
[ 291.986488][ T64] netlink_sendmsg+0x90b/0xe10
[ 291.987023][ T64] ? netlink_unicast+0x800/0x800
[ 291.987584][ T64] ? bpf_lsm_socket_sendmsg+0x9/0x10
[ 291.988189][ T64] ? netlink_unicast+0x800/0x800
[ 291.988737][ T64] sock_sendmsg+0xd9/0x180
[ 291.989228][ T64] __sys_sendto+0x201/0x2f0
[ 291.989714][ T64] ? __ia32_sys_getpeername+0xb0/0xb0
[ 291.990296][ T64] ? get_random_bytes+0x20/0x20
[ 291.990833][ T64] ? rcu_is_watching+0x12/0xb0
[ 291.991368][ T64] ? __rseq_handle_notify_resume+0x5c7/0xfe0
[ 291.992062][ T64] __x64_sys_sendto+0xe0/0x1b0
[ 291.992600][ T64] ? syscall_enter_from_user_mode+0x26/0x80
[ 291.993324][ T64] do_syscall_64+0x38/0xb0
[ 291.993816][ T64] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[ 291.994436][ T64] RIP: 0033:0x7f1de823fcf7
[ 291.994902][ T64] RSP: 002b:00007ffd8c5d3208 EFLAGS: 00000202 ORIG_RAX: 000000000000002c
[ 291.995834][ T64] RAX: ffffffffffffffda RBX: 000055d9974d14e0 RCX: 00007f1de823fcf7
[ 291.996711][ T64] RDX: 0000000000000020 RSI: 000055d997332930 RDI: 0000000000000005
[ 291.997593][ T64] RBP: 00007ffd8c5d33b8 R08: 00007ffd8c5d3280 R09: 0000000000000080
[ 291.998477][ T64] R10: 0000000000000000 R11: 0000000000000202 R12: 00000000043abe07
[ 291.999360][ T64] R13: 0000000000000001 R14: 000055d9974845a0 R15: 0000000000000000
[ 292.000254][ T64] </TASK>
[ 292.000609][ T64] INFO: task systemd-udevd:6773 blocked for more than 143 seconds.
[ 292.001482][ T64] Not tainted 6.4.0-rc2-00215-g6078d01d2926 #13
[ 292.002233][ T64] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 292.003237][ T64] task:systemd-udevd state:D stack:25968 pid:6773 ppid:4544 flags:0x00004006
[ 292.004292][ T64] Call Trace:
[ 292.004743][ T64] <TASK>
[ 292.005128][ T64] __schedule+0xecb/0x59d0
[ 292.005637][ T64] ? lockdep_hardirqs_on_prepare+0x410/0x410
[ 292.006304][ T64] ? lockdep_hardirqs_on_prepare+0x410/0x410
[ 292.007005][ T64] ? io_schedule_timeout+0x150/0x150
[ 292.007601][ T64] ? __mutex_lock+0x902/0x12d0
[ 292.008174][ T64] schedule+0xe7/0x1b0
[ 292.008648][ T64] schedule_preempt_disabled+0x13/0x20
[ 292.009257][ T64] __mutex_lock+0x907/0x12d0
[ 292.009793][ T64] ? dev_ethtool+0x21d/0x53e0
[ 292.010374][ T64] ? mutex_lock_io_nested+0x1120/0x1120
[ 292.011105][ T64] ? kasan_unpoison+0x44/0x70
[ 292.011987][ T64] ? __kmem_cache_alloc_node+0x19e/0x330
[ 292.012994][ T64] ? dev_ethtool+0x170/0x53e0
[ 292.013757][ T64] ? dev_ethtool+0x21d/0x53e0
[ 292.014470][ T64] ? rtnl_lock+0x9/0x20
[ 292.015032][ T64] dev_ethtool+0x21d/0x53e0
[ 292.015633][ T64] ? filter_irq_stacks+0x90/0x90
[ 292.016269][ T64] ? print_usage_bug.part.0+0x670/0x670
[ 292.016941][ T64] ? ethtool_get_module_info_call+0x1c0/0x1c0
[ 292.017674][ T64] ? kasan_set_track+0x25/0x30
[ 292.018228][ T64] ? ____kasan_slab_free+0x15e/0x1b0
[ 292.018794][ T64] ? slab_free_freelist_hook+0x10b/0x1e0
[ 292.019287][ T64] ? __kmem_cache_free+0xaf/0x2e0
[ 292.019729][ T64] ? tomoyo_path_number_perm+0x43a/0x530
[ 292.020250][ T64] ? security_file_ioctl+0x72/0xb0
[ 292.020700][ T64] ? __x64_sys_ioctl+0xbb/0x210
[ 292.021128][ T64] ? do_syscall_64+0x38/0xb0
[ 292.021538][ T64] ? mark_lock+0x105/0x1960
[ 292.021950][ T64] ? __lock_acquire+0xc42/0x5cf0
[ 292.022399][ T64] ? lockdep_hardirqs_on_prepare+0x410/0x410
[ 292.022990][ T64] ? lockdep_hardirqs_on_prepare+0x410/0x410
[ 292.023514][ T64] ? lockdep_hardirqs_on+0x7d/0x100
[ 292.023995][ T64] ? __sanitizer_cov_trace_switch+0x54/0x90
[ 292.024513][ T64] ? find_held_lock+0x2d/0x110
[ 292.024940][ T64] ? dev_load+0x7d/0x240
[ 292.025322][ T64] ? reacquire_held_locks+0x4b0/0x4b0
[ 292.025801][ T64] ? full_name_hash+0xbc/0x110
[ 292.026220][ T64] dev_ioctl+0x29e/0x1090
[ 292.026601][ T64] sock_do_ioctl+0x1be/0x250
[ 292.027006][ T64] ? get_user_ifreq+0x250/0x250
[ 292.027433][ T64] ? do_vfs_ioctl+0x34a/0x17f0
[ 292.027870][ T64] ? vfs_fileattr_set+0xbf0/0xbf0
[ 292.028318][ T64] sock_ioctl+0x205/0x6a0
[ 292.028697][ T64] ? br_ioctl_call+0xb0/0xb0
[ 292.029145][ T64] ? reacquire_held_locks+0x4b0/0x4b0
[ 292.029827][ T64] ? handle_mm_fault+0x32c/0x9e0
[ 292.030501][ T64] ? bpf_lsm_file_ioctl+0x9/0x10
[ 292.031054][ T64] ? br_ioctl_call+0xb0/0xb0
[ 292.031527][ T64] __x64_sys_ioctl+0x189/0x210
[ 292.032091][ T64] do_syscall_64+0x38/0xb0
[ 292.032609][ T64] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[ 292.033390][ T64] RIP: 0033:0x7f1de82319ef
[ 292.033908][ T64] RSP: 002b:00007ffd8c5d32f0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 292.034751][ T64] RAX: ffffffffffffffda RBX: 000055d997476d20 RCX: 00007f1de82319ef
[ 292.035448][ T64] RDX: 00007ffd8c5d3430 RSI: 0000000000008946 RDI: 000000000000000b
[ 292.036179][ T64] RBP: 000055d997488a50 R08: 0000000000000000 R09: 0000000000000006
[ 292.036884][ T64] R10: 00007ffd8c5d3400 R11: 0000000000000246 R12: 000055d99732a230
[ 292.037581][ T64] R13: 00007ffd8c5d3430 R14: 0000000000000000 R15: 000055d99732a238
[ 292.038268][ T64] </TASK>
[ 292.038557][ T64] INFO: task repro:7652 blocked for more than 143 seconds.
[ 292.039202][ T64] Not tainted 6.4.0-rc2-00215-g6078d01d2926 #13
[ 292.039776][ T64] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 292.040509][ T64] task:repro state:D stack:25984 pid:7652 ppid:6753 flags:0x00004006
[ 292.041289][ T64] Call Trace:
[ 292.041575][ T64] <TASK>
[ 292.041848][ T64] __schedule+0xecb/0x59d0
[ 292.042297][ T64] ? lockdep_hardirqs_on_prepare+0x410/0x410
[ 292.043017][ T64] ? lockdep_hardirqs_on_prepare+0x410/0x410
[ 292.043711][ T64] ? io_schedule_timeout+0x150/0x150
[ 292.044360][ T64] ? __mutex_lock+0x902/0x12d0
[ 292.044845][ T64] schedule+0xe7/0x1b0
[ 292.045300][ T64] schedule_preempt_disabled+0x13/0x20
[ 292.045872][ T64] __mutex_lock+0x907/0x12d0
[ 292.046396][ T64] ? fl_change+0x2102/0x51e0
[ 292.046833][ T64] ? tc_setup_cb_reoffload+0x100/0x100
[ 292.047353][ T64] ? mutex_lock_io_nested+0x1120/0x1120
[ 292.047915][ T64] ? _raw_spin_unlock+0x28/0x40
[ 292.048368][ T64] ? tcf_exts_init_ex+0x1bc/0x610
[ 292.048828][ T64] ? fl_change+0x2102/0x51e0
[ 292.049328][ T64] ? rtnl_lock+0x9/0x20
[ 292.049768][ T64] fl_change+0x2102/0x51e0
[ 292.050232][ T64] ? find_held_lock+0x2d/0x110
[ 292.050681][ T64] ? __rhashtable_insert_fast.constprop.0.isra.0+0x14a0/0x14a0
[ 292.051416][ T64] ? fl_get+0x21c/0x3c0
[ 292.051791][ T64] ? fl_put+0x20/0x20
[ 292.052170][ T64] ? mini_qdisc_pair_swap+0x128/0x1f0
[ 292.052655][ T64] ? __rhashtable_insert_fast.constprop.0.isra.0+0x14a0/0x14a0
[ 292.053395][ T64] tc_new_tfilter+0x992/0x22b0
[ 292.053858][ T64] ? tc_del_tfilter+0x1570/0x1570
[ 292.054419][ T64] ? lockdep_hardirqs_on_prepare+0x410/0x410
[ 292.055066][ T64] ? kasan_quarantine_put+0x102/0x230
[ 292.055660][ T64] ? rtnetlink_rcv_msg+0x94a/0xd30
[ 292.056224][ T64] ? reacquire_held_locks+0x4b0/0x4b0
[ 292.056799][ T64] ? bpf_lsm_capable+0x9/0x10
[ 292.057311][ T64] ? tc_del_tfilter+0x1570/0x1570
[ 292.057843][ T64] rtnetlink_rcv_msg+0x98a/0xd30
[ 292.058381][ T64] ? rtnl_getlink+0xb10/0xb10
[ 292.058833][ T64] ? __x64_sys_sendmmsg+0x9c/0x100
[ 292.059305][ T64] ? do_syscall_64+0x38/0xb0
[ 292.059752][ T64] ? netdev_core_pick_tx+0x390/0x390
[ 292.060297][ T64] netlink_rcv_skb+0x166/0x440
[ 292.060733][ T64] ? rtnl_getlink+0xb10/0xb10
[ 292.061146][ T64] ? netlink_ack+0x1370/0x1370
[ 292.061566][ T64] ? kasan_set_track+0x25/0x30
[ 292.062003][ T64] ? netlink_deliver_tap+0x1b1/0xd00
[ 292.062468][ T64] netlink_unicast+0x530/0x800
[ 292.062955][ T64] ? netlink_attachskb+0x880/0x880
[ 292.063416][ T64] ? __sanitizer_cov_trace_switch+0x54/0x90
[ 292.063958][ T64] ? __phys_addr_symbol+0x30/0x70
[ 292.064468][ T64] ? __check_object_size+0x323/0x740
[ 292.065009][ T64] netlink_sendmsg+0x90b/0xe10
[ 292.065517][ T64] ? netlink_unicast+0x800/0x800
[ 292.066010][ T64] ? bpf_lsm_socket_sendmsg+0x9/0x10
[ 292.066467][ T64] ? netlink_unicast+0x800/0x800
[ 292.066933][ T64] sock_sendmsg+0xd9/0x180
[ 292.067358][ T64] ____sys_sendmsg+0x264/0x910
[ 292.067794][ T64] ? kernel_sendmsg+0x50/0x50
[ 292.068252][ T64] ? __copy_msghdr+0x460/0x460
[ 292.068690][ T64] ___sys_sendmsg+0x11d/0x1b0
[ 292.069111][ T64] ? do_recvmmsg+0x700/0x700
[ 292.069551][ T64] ? find_held_lock+0x2d/0x110
[ 292.069972][ T64] ? __might_fault+0xe5/0x190
[ 292.070397][ T64] ? reacquire_held_locks+0x4b0/0x4b0
[ 292.070862][ T64] __sys_sendmmsg+0x18e/0x430
[ 292.071272][ T64] ? __ia32_sys_sendmsg+0xb0/0xb0
[ 292.071705][ T64] ? reacquire_held_locks+0x4b0/0x4b0
[ 292.072204][ T64] ? rcu_is_watching+0x12/0xb0
[ 292.072620][ T64] ? xfd_validate_state+0x5d/0x180
[ 292.073099][ T64] ? restore_fpregs_from_fpstate+0xc1/0x1d0
[ 292.073611][ T64] ? unlock_page_memcg+0x2d0/0x2d0
[ 292.074062][ T64] ? do_futex+0x350/0x350
[ 292.074461][ T64] __x64_sys_sendmmsg+0x9c/0x100
[ 292.074893][ T64] ? syscall_enter_from_user_mode+0x26/0x80
[ 292.075494][ T64] do_syscall_64+0x38/0xb0
[ 292.075960][ T64] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[ 292.076473][ T64] RIP: 0033:0x7f3f4319289d
[ 292.076850][ T64] RSP: 002b:00007f3f43056e38 EFLAGS: 00000246 ORIG_RAX: 0000000000000133
[ 292.077557][ T64] RAX: ffffffffffffffda RBX: 0000558de6727248 RCX: 00007f3f4319289d
[ 292.078234][ T64] RDX: 040000000000009f RSI: 00000000200002c0 RDI: 0000000000000006
[ 292.078924][ T64] RBP: 0000000000000006 R08: 0000000000000000 R09: 0000000000000000
[ 292.079586][ T64] R10: 0000000000000000 R11: 0000000000000246 R12: 0000558de672724c
[ 292.080259][ T64] R13: 0000558de6723030 R14: 040000000000009f R15: 0000558de6727240
[ 292.080943][ T64] </TASK>
[ 292.081221][ T64]
[ 292.081221][ T64] Showing all locks held in the system:
[ 292.081868][ T64] 1 lock held by rcu_tasks_kthre/14:
[ 292.082316][ T64] #0: ffffffff8c798430 (rcu_tasks.tasks_gp_mutex){+.+.}-{3:3}, at: rcu_tasks_one_gp+0x2c/0xe20
[ 292.083266][ T64] 1 lock held by rcu_tasks_trace/15:
[ 292.083721][ T64] #0: ffffffff8c798130 (rcu_tasks_trace.tasks_gp_mutex){+.+.}-{3:3}, at: rcu_tasks_one_gp+0x2c/0xe20
[ 292.084702][ T64] 5 locks held by ksoftirqd/4/38:
[ 292.085152][ T64] 3 locks held by kworker/4:0/39:
[ 292.085598][ T64] #0: ffff88811b78b938 ((wq_completion)ipv6_addrconf){+.+.}-{0:0}, at: process_one_work+0x8e2/0x15f0
[ 292.086704][ T64] #1: ffffc9000139fda0 ((work_completion)(&(&net->ipv6.addr_chk_work)->work)){+.+.}-{0:0}, at: process_one_work+0x916/0x15f0
[ 292.088126][ T64] #2: ffffffff8e115668 (rtnl_mutex){+.+.}-{3:3}, at: addrconf_verify_work+0x12/0x30
[ 292.089159][ T64] 1 lock held by khungtaskd/64:
[ 292.089656][ T64] #0: ffffffff8c799040 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x55/0x340
[ 292.090713][ T64] 3 locks held by kworker/1:2/4513:
[ 292.091175][ T64] #0: ffff888100079d38 ((wq_completion)events_power_efficient){+.+.}-{0:0}, at: process_one_work+0x8e2/0x15f0
[ 292.092274][ T64] #1: ffffc90016fcfda0 ((reg_check_chans).work){+.+.}-{0:0}, at: process_one_work+0x916/0x15f0
[ 292.093319][ T64] #2: ffffffff8e115668 (rtnl_mutex){+.+.}-{3:3}, at: reg_check_chans_work+0x7d/0xfe0
[ 292.094269][ T64] 3 locks held by kworker/3:2/6525:
[ 292.094801][ T64] #0: ffff888100078d38 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x8e2/0x15f0
[ 292.095834][ T64] #1: ffffc9000648fda0 ((linkwatch_work).work){+.+.}-{0:0}, at: process_one_work+0x916/0x15f0
[ 292.096754][ T64] #2: ffffffff8e115668 (rtnl_mutex){+.+.}-{3:3}, at: linkwatch_event+0xf/0x70
[ 292.097662][ T64] 1 lock held by systemd-udevd/6764:
[ 292.098148][ T64] #0: ffffffff8e115668 (rtnl_mutex){+.+.}-{3:3}, at: rtnetlink_rcv_msg+0x3e2/0xd30
[ 292.098984][ T64] 1 lock held by systemd-udevd/6773:
[ 292.099462][ T64] #0: ffffffff8e115668 (rtnl_mutex){+.+.}-{3:3}, at: dev_ethtool+0x21d/0x53e0
[ 292.100300][ T64] 1 lock held by repro/7652:
[ 292.100700][ T64] #0: ffffffff8e115668 (rtnl_mutex){+.+.}-{3:3}, at: fl_change+0x2102/0x51e0
[ 292.101494][ T64] 2 locks held by repro/7653:
[ 292.101905][ T64]
[ 292.102135][ T64] =============================================
[ 292.102135][ T64]
[ 292.102905][ T64] NMI backtrace for cpu 5
[ 292.103277][ T64] CPU: 5 PID: 64 Comm: khungtaskd Not tainted 6.4.0-rc2-00215-g6078d01d2926 #13
[ 292.104054][ T64] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
[ 292.104931][ T64] Call Trace:
[ 292.105222][ T64] <TASK>
[ 292.105481][ T64] dump_stack_lvl+0xd9/0x1b0
[ 292.105889][ T64] nmi_cpu_backtrace+0x277/0x380
[ 292.106318][ T64] ? lapic_can_unplug_cpu+0xa0/0xa0
[ 292.106770][ T64] nmi_trigger_cpumask_backtrace+0x2ac/0x310
[ 292.107290][ T64] watchdog+0xed1/0x1150
[ 292.107715][ T64] ? proc_dohung_task_timeout_secs+0x90/0x90
[ 292.108330][ T64] kthread+0x334/0x430
[ 292.108736][ T64] ? kthread_complete_and_exit+0x40/0x40
[ 292.109285][ T64] ret_from_fork+0x1f/0x30
[ 292.109720][ T64] </TASK>
[ 292.109991][ T64] Sending NMI from CPU 5 to CPUs 0-4,6-7:
[ 292.110569][ C3] NMI backtrace for cpu 3
[ 292.110577][ C3] CPU: 3 PID: 7653 Comm: repro Not tainted 6.4.0-rc2-00215-g6078d01d2926 #13
[ 292.110586][ C3] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
[ 292.110591][ C3] RIP: 0010:unwind_get_return_address+0x6e/0xa0
[ 292.110604][ C3] Code: ea 03 80 3c 02 00 75 32 48 8b 7b 48 e8 3b 46 1c 00 85 c0 74 d3 48 b8 00 00 00 00 00 fc ff df 48 89 ea 48 c1 ea 03 80 3c 02 00 <75> 18 48 8b 43 48 5b 5d c3 e8 14 c5 9e 00 eb a8 48 89 ef e8 2a c5
[ 292.110611][ C3] RSP: 0018:ffffc900095d6ed0 EFLAGS: 00000246
[ 292.110618][ C3] RAX: dffffc0000000000 RBX: ffffc900095d6ee8 RCX: 0000000000000000
[ 292.110623][ C3] RDX: 1ffff920012bade6 RSI: ffffc900095d72a0 RDI: ffffffff88370cd5
[ 292.110628][ C3] RBP: ffffc900095d6f30 R08: ffffc900095d6f1c R09: ffffffff8f831a34
[ 292.110633][ C3] R10: ffffc900095d6ee8 R11: 0000000000072d0d R12: ffffc900095d6fa0
[ 292.110637][ C3] R13: 0000000000000000 R14: ffff888173271dc0 R15: ffff888171b140a0
[ 292.110645][ C3] FS: 00007f3f430376c0(0000) GS:ffff888437c80000(0000) knlGS:0000000000000000
[ 292.110653][ C3] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 292.110658][ C3] CR2: 0000558de83422d8 CR3: 0000000116ff7000 CR4: 0000000000350ee0
[ 292.110664][ C3] Call Trace:
[ 292.110667][ C3] <TASK>
[ 292.110670][ C3] ? write_profile+0x440/0x440
[ 292.110684][ C3] arch_stack_walk+0x97/0xf0
[ 292.110697][ C3] ? ingress_init+0x165/0x220
[ 292.110706][ C3] stack_trace_save+0x96/0xd0
[ 292.110716][ C3] ? filter_irq_stacks+0x90/0x90
[ 292.110727][ C3] kasan_save_stack+0x20/0x40
[ 292.110741][ C3] ? kasan_save_stack+0x20/0x40
[ 292.110753][ C3] ? kasan_set_track+0x25/0x30
[ 292.110766][ C3] ? __kasan_kmalloc+0xa2/0xb0
[ 292.110776][ C3] ? tcf_block_get_ext+0x14f/0x1900
[ 292.110785][ C3] ? ingress_init+0x165/0x220
[ 292.110792][ C3] ? qdisc_create+0x4f7/0x1090
[ 292.110799][ C3] ? tc_modify_qdisc+0xd38/0x1b70
[ 292.110806][ C3] ? rtnetlink_rcv_msg+0x439/0xd30
[ 292.110813][ C3] ? netlink_rcv_skb+0x166/0x440
[ 292.110824][ C3] ? netlink_unicast+0x530/0x800
[ 292.110835][ C3] ? netlink_sendmsg+0x90b/0xe10
[ 292.110847][ C3] ? find_held_lock+0x2d/0x110
[ 292.110857][ C3] ? __kmem_cache_alloc_node+0x49/0x330
[ 292.110867][ C3] ? reacquire_held_locks+0x4b0/0x4b0
[ 292.110878][ C3] ? kasan_unpoison+0x44/0x70
[ 292.110885][ C3] ? __kasan_slab_alloc+0x4e/0x90
[ 292.110896][ C3] ? __kmem_cache_alloc_node+0x19e/0x330
[ 292.110906][ C3] ? tcf_block_get_ext+0x14f/0x1900
[ 292.110915][ C3] kasan_set_track+0x25/0x30
[ 292.110925][ C3] __kasan_kmalloc+0xa2/0xb0
[ 292.110935][ C3] tcf_block_get_ext+0x14f/0x1900
[ 292.110945][ C3] ingress_init+0x165/0x220
[ 292.110952][ C3] ? clsact_init+0x330/0x330
[ 292.110959][ C3] qdisc_create+0x4f7/0x1090
[ 292.110967][ C3] ? tc_get_qdisc+0xb80/0xb80
[ 292.110975][ C3] tc_modify_qdisc+0xd38/0x1b70
[ 292.110984][ C3] ? qdisc_create+0x1090/0x1090
[ 292.110994][ C3] ? bpf_lsm_capable+0x9/0x10
[ 292.111004][ C3] ? qdisc_create+0x1090/0x1090
[ 292.111012][ C3] rtnetlink_rcv_msg+0x439/0xd30
[ 292.111019][ C3] ? rtnl_getlink+0xb10/0xb10
[ 292.111025][ C3] ? __x64_sys_sendmmsg+0x9c/0x100
[ 292.111034][ C3] ? do_syscall_64+0x38/0xb0
[ 292.111043][ C3] ? netdev_core_pick_tx+0x390/0x390
[ 292.111056][ C3] netlink_rcv_skb+0x166/0x440
[ 292.111067][ C3] ? rtnl_getlink+0xb10/0xb10
[ 292.111074][ C3] ? netlink_ack+0x1370/0x1370
[ 292.111085][ C3] ? kasan_set_track+0x25/0x30
[ 292.111097][ C3] ? netlink_deliver_tap+0x1b1/0xd00
[ 292.111109][ C3] netlink_unicast+0x530/0x800
[ 292.111121][ C3] ? netlink_attachskb+0x880/0x880
[ 292.111132][ C3] ? __sanitizer_cov_trace_switch+0x54/0x90
[ 292.111142][ C3] ? __phys_addr_symbol+0x30/0x70
[ 292.111153][ C3] ? __check_object_size+0x323/0x740
[ 292.111161][ C3] netlink_sendmsg+0x90b/0xe10
[ 292.111173][ C3] ? netlink_unicast+0x800/0x800
[ 292.111185][ C3] ? bpf_lsm_socket_sendmsg+0x9/0x10
[ 292.111193][ C3] ? netlink_unicast+0x800/0x800
[ 292.111204][ C3] sock_sendmsg+0xd9/0x180
[ 292.111216][ C3] ____sys_sendmsg+0x264/0x910
[ 292.111227][ C3] ? kernel_sendmsg+0x50/0x50
[ 292.111238][ C3] ? __copy_msghdr+0x460/0x460
[ 292.111246][ C3] ? lockdep_hardirqs_on_prepare+0x410/0x410
[ 292.111258][ C3] ___sys_sendmsg+0x11d/0x1b0
[ 292.111266][ C3] ? do_recvmmsg+0x700/0x700
[ 292.111274][ C3] ? __fget_files+0x260/0x420
[ 292.111284][ C3] ? reacquire_held_locks+0x4b0/0x4b0
[ 292.111296][ C3] ? __fget_files+0x282/0x420
[ 292.111307][ C3] ? __fget_light+0xe6/0x270
[ 292.111318][ C3] __sys_sendmmsg+0x18e/0x430
[ 292.111327][ C3] ? __ia32_sys_sendmsg+0xb0/0xb0
[ 292.111334][ C3] ? reacquire_held_locks+0x4b0/0x4b0
[ 292.111346][ C3] ? rcu_is_watching+0x12/0xb0
[ 292.111358][ C3] ? __rseq_handle_notify_resume+0x5c7/0xfe0
[ 292.111368][ C3] ? __do_sys_rseq+0x750/0x750
[ 292.111376][ C3] ? unlock_page_memcg+0x2d0/0x2d0
[ 292.111387][ C3] ? do_futex+0x350/0x350
[ 292.111395][ C3] __x64_sys_sendmmsg+0x9c/0x100
[ 292.111403][ C3] ? syscall_enter_from_user_mode+0x26/0x80
[ 292.111414][ C3] do_syscall_64+0x38/0xb0
[ 292.111421][ C3] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[ 292.111433][ C3] RIP: 0033:0x7f3f4319289d
[ 292.111440][ C3] Code: 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 4b 05 0e 00 f7 d8 64 89 01 48
[ 292.111447][ C3] RSP: 002b:00007f3f43035e38 EFLAGS: 00000246 ORIG_RAX: 0000000000000133
[ 292.111454][ C3] RAX: ffffffffffffffda RBX: 0000558de6727258 RCX: 00007f3f4319289d
[ 292.111459][ C3] RDX: 040000000000009f RSI: 00000000200002c0 RDI: 0000000000000007
[ 292.111463][ C3] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[ 292.111467][ C3] R10: 0000000000000000 R11: 0000000000000246 R12: 0000558de672725c
[ 292.111472][ C3] R13: 0000558de6723030 R14: 040000000000009f R15: 0000558de6727250
[ 292.111479][ C3] </TASK>
[ 292.111481][ C4] NMI backtrace for cpu 4 skipped: idling at default_idle+0xf/0x20
[ 292.092844][ C7] NMI backtrace for cpu 7 skipped: idling at default_idle+0xf/0x20
[ 292.092832][ C2] NMI backtrace for cpu 2 skipped: idling at default_idle+0xf/0x20
[ 292.092832][ C2] INFO: NMI handler (nmi_cpu_backtrace_handler) took too long to run: 1.014 msecs
[ 292.092820][ C1] NMI backtrace for cpu 1 skipped: idling at default_idle+0xf/0x20
[ 292.092820][ C1] INFO: NMI handler (nmi_cpu_backtrace_handler) took too long to run: 1.033 msecs
[ 292.102824][ C6] NMI backtrace for cpu 6 skipped: idling at default_idle+0xf/0x20
[ 292.102824][ C6] INFO: NMI handler (nmi_cpu_backtrace_handler) took too long to run: 1.054 msecs
[ 292.102812][ C0] NMI backtrace for cpu 0 skipped: idling at default_idle+0xf/0x20
[ 292.102812][ C0] INFO: NMI handler (nmi_cpu_backtrace_handler) took too long to run: 1.092 msecs
[ 292.112559][ T64] Kernel panic - not syncing: hung_task: blocked tasks
[ 292.112565][ T64] CPU: 5 PID: 64 Comm: khungtaskd Not tainted 6.4.0-rc2-00215-g6078d01d2926 #13
[ 292.112575][ T64] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
[ 292.112582][ T64] Call Trace:
[ 292.112586][ T64] <TASK>
[ 292.112590][ T64] dump_stack_lvl+0xd9/0x1b0
[ 292.112603][ T64] panic+0x689/0x730
[ 292.112615][ T64] ? panic_smp_self_stop+0xa0/0xa0
[ 292.112631][ T64] ? __irq_work_queue_local+0x132/0x3f0
[ 292.112641][ T64] ? irq_work_queue+0x2a/0x70
[ 292.112649][ T64] ? __wake_up_klogd.part.0+0x99/0xf0
[ 292.112663][ T64] ? watchdog+0xc89/0x1150
[ 292.112678][ T64] watchdog+0xc9a/0x1150
[ 292.112698][ T64] ? proc_dohung_task_timeout_secs+0x90/0x90
[ 292.112713][ T64] kthread+0x334/0x430
[ 292.112723][ T64] ? kthread_complete_and_exit+0x40/0x40
[ 292.112734][ T64] ret_from_fork+0x1f/0x30
[ 292.112750][ T64] </TASK>
[ 292.112866][ T64] Kernel Offset: disabled
[ 292.112866][ T64] Rebooting in 86400 seconds..
On Fri, May 26, 2023 at 05:21:34PM -0300, Pedro Tammela wrote: > On 26/05/2023 16:47, Jamal Hadi Salim wrote: > > [...] Peilin, Pedro will post the new repro. > > We tweaked the reproducer to: > --- > r0 = socket$netlink(0x10, 0x3, 0x0) > r1 = socket(0x10, 0x803, 0x0) > sendmsg$nl_route_sched(r1, &(0x7f0000000300)={0x0, 0x0, > &(0x7f0000000240)={&(0x7f0000000380)=ANY=[], 0x24}}, 0x0) > getsockname$packet(r1, &(0x7f0000000200)={0x11, 0x0, <r2=>0x0, 0x1, 0x0, > 0x6, @broadcast}, &(0x7f0000000440)=0x14) > sendmsg$nl_route(r0, &(0x7f0000000040)={0x0, 0x0, > &(0x7f0000000000)={&(0x7f0000000080)=ANY=[@ANYBLOB="480000001000050700"/20, > @ANYRES32=r2, @ANYBLOB="0000000000000000280012000900010076657468"], 0x48}}, > 0x0) > sendmsg$nl_route_sched(0xffffffffffffffff, &(0x7f00000002c0)={0x0, 0x0, > &(0x7f0000000280)={&(0x7f0000000540)=@newqdisc={0x30, 0x24, 0xf0b, 0x0, 0x0, > {0x0, 0x0, 0x0, r2, {}, {0xfff1, 0xffff}}, > [@qdisc_kind_options=@q_ingress={0xc}]}, 0x30}}, 0x0) > sendmsg$nl_route_sched(0xffffffffffffffff, &(0x7f0000000340)={0x0, 0x0, > &(0x7f00000000c0)={&(0x7f0000000580)=@newtfilter={0x3c, 0x2c, 0xd27, 0x0, > 0x0, {0x0, 0x0, 0x0, r2, {}, {0xfff1, 0xffff}, {0xc}}, > [@filter_kind_options=@f_flower={{0xb}, {0xc, 0x2, > [@TCA_FLOWER_CLASSID={0x8}]}}]}, 0x3c}}, 0x0) > r4 = socket$netlink(0x10, 0x3, 0x0) > sendmmsg(r4, &(0x7f00000002c0), 0x40000000000009f, 0x0) > r5 = socket$netlink(0x10, 0x3, 0x0) > sendmmsg(r5, &(0x7f00000002c0), 0x40000000000009f, 0x0) > --- > > We then generate the C program with: > syz-prog2c -sandbox none -enable net_dev -threaded -repeat 0 -prog > peilin.syz > repro.c > > Now here comes a very important detail. The above will create a new net > namespace to shoot the netlink messages. We are only able to reproduce the > deadlock with your patches if we comment the creation of the new namespace > out: > --- > diff --git a/repro.c b/repro.c > index ee8eb0726..5cdbfb289 100644 > --- a/repro.c > +++ b/repro.c > @@ -1121,9 +1121,8 @@ static int do_sandbox_none(void) > sandbox_common(); > drop_caps(); > initialize_netdevices_init(); > - if (unshare(CLONE_NEWNET)) { > - } > + // Doesn't seem to deadlock in a new netns > + // if (unshare(CLONE_NEWNET)) { > + // } > write_file("/proc/sys/net/ipv4/ping_group_range", "0 65535"); > initialize_netdevices(); > setup_binderfs(); > > --- > > The reason we did this was to check on the event with 'tc mon'. > The splat is quite big, see attached. It has all the indications of a > deadlock in the rtnl_lock. Thanks a lot, I'll get right on it. Peilin Ye
On Fri, 26 May 2023 16:09:51 -0700 Peilin Ye wrote:
> Thanks a lot, I'll get right on it.
Any insights? Is it just a live-lock inherent to the retry scheme
or we actually forget to release the lock/refcnt?
Hi Jakub and all, On Fri, May 26, 2023 at 07:33:24PM -0700, Jakub Kicinski wrote: > On Fri, 26 May 2023 16:09:51 -0700 Peilin Ye wrote: > > Thanks a lot, I'll get right on it. > > Any insights? Is it just a live-lock inherent to the retry scheme > or we actually forget to release the lock/refcnt? I think it's just a thread holding the RTNL mutex for too long (replaying too many times). We could replay for arbitrary times in tc_{modify,get}_qdisc() if the user keeps sending RTNL-unlocked filter requests for the old Qdisc. I tested the new reproducer Pedro posted, on: 1. All 6 v5 patches, FWIW, which caused a similar hang as Pedro reported 2. First 5 v5 patches, plus patch 6 in v1 (no replaying), did not trigger any issues (in about 30 minutes). 3. All 6 v5 patches, plus this diff: diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c index 286b7c58f5b9..988718ba5abe 100644 --- a/net/sched/sch_api.c +++ b/net/sched/sch_api.c @@ -1090,8 +1090,11 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent, * RTNL-unlocked filter request(s). This is the counterpart of that * qdisc_refcount_inc_nz() call in __tcf_qdisc_find(). */ - if (!qdisc_refcount_dec_if_one(dev_queue->qdisc_sleeping)) + if (!qdisc_refcount_dec_if_one(dev_queue->qdisc_sleeping)) { + rtnl_unlock(); + rtnl_lock(); return -EAGAIN; + } } if (dev->flags & IFF_UP) Did not trigger any issues (in about 30 mintues) either. What would you suggest? Thanks, Peilin Ye
On Sat, May 27, 2023 at 4:23 AM Peilin Ye <yepeilin.cs@gmail.com> wrote: > > Hi Jakub and all, > > On Fri, May 26, 2023 at 07:33:24PM -0700, Jakub Kicinski wrote: > > On Fri, 26 May 2023 16:09:51 -0700 Peilin Ye wrote: > > > Thanks a lot, I'll get right on it. > > > > Any insights? Is it just a live-lock inherent to the retry scheme > > or we actually forget to release the lock/refcnt? > > I think it's just a thread holding the RTNL mutex for too long (replaying > too many times). We could replay for arbitrary times in > tc_{modify,get}_qdisc() if the user keeps sending RTNL-unlocked filter > requests for the old Qdisc. > > I tested the new reproducer Pedro posted, on: > > 1. All 6 v5 patches, FWIW, which caused a similar hang as Pedro reported > > 2. First 5 v5 patches, plus patch 6 in v1 (no replaying), did not trigger > any issues (in about 30 minutes). > > 3. All 6 v5 patches, plus this diff: > > diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c > index 286b7c58f5b9..988718ba5abe 100644 > --- a/net/sched/sch_api.c > +++ b/net/sched/sch_api.c > @@ -1090,8 +1090,11 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent, > * RTNL-unlocked filter request(s). This is the counterpart of that > * qdisc_refcount_inc_nz() call in __tcf_qdisc_find(). > */ > - if (!qdisc_refcount_dec_if_one(dev_queue->qdisc_sleeping)) > + if (!qdisc_refcount_dec_if_one(dev_queue->qdisc_sleeping)) { > + rtnl_unlock(); > + rtnl_lock(); > return -EAGAIN; > + } > } > > if (dev->flags & IFF_UP) > > Did not trigger any issues (in about 30 mintues) either. > > What would you suggest? I am more worried it is a wackamole situation. We fixed the first reproducer with essentially patches 1-4 but we opened a new one which the second reproducer catches. One thing the current reproducer does is create a lot rtnl contention in the beggining by creating all those devices and then after it is just creating/deleting qdisc and doing update with flower where such contention is reduced. i.e it may just take longer for the mole to pop up. Why dont we push the V1 patch in and then worry about getting clever with EAGAIN after? Can you test the V1 version with the repro Pedro posted? It shouldnt have these issues. Also it would be interesting to see how performance of the parallel updates to flower is affected. cheers, jamal > Thanks, > Peilin Ye >
On Sun 28 May 2023 at 14:54, Jamal Hadi Salim <jhs@mojatatu.com> wrote: > On Sat, May 27, 2023 at 4:23 AM Peilin Ye <yepeilin.cs@gmail.com> wrote: >> >> Hi Jakub and all, >> >> On Fri, May 26, 2023 at 07:33:24PM -0700, Jakub Kicinski wrote: >> > On Fri, 26 May 2023 16:09:51 -0700 Peilin Ye wrote: >> > > Thanks a lot, I'll get right on it. >> > >> > Any insights? Is it just a live-lock inherent to the retry scheme >> > or we actually forget to release the lock/refcnt? >> >> I think it's just a thread holding the RTNL mutex for too long (replaying >> too many times). We could replay for arbitrary times in >> tc_{modify,get}_qdisc() if the user keeps sending RTNL-unlocked filter >> requests for the old Qdisc. After looking very carefully at the code I think I know what the issue might be: Task 1 graft Qdisc Task 2 new filter + + | | v v rtnl_lock() take q->refcnt + + | | v v Spin while q->refcnt!=1 Block on rtnl_lock() indefinitely due to -EAGAIN This will cause a real deadlock with the proposed patch. I'll try to come up with a better approach. Sorry for not seeing it earlier. >> >> I tested the new reproducer Pedro posted, on: >> >> 1. All 6 v5 patches, FWIW, which caused a similar hang as Pedro reported >> >> 2. First 5 v5 patches, plus patch 6 in v1 (no replaying), did not trigger >> any issues (in about 30 minutes). >> >> 3. All 6 v5 patches, plus this diff: >> >> diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c >> index 286b7c58f5b9..988718ba5abe 100644 >> --- a/net/sched/sch_api.c >> +++ b/net/sched/sch_api.c >> @@ -1090,8 +1090,11 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent, >> * RTNL-unlocked filter request(s). This is the counterpart of that >> * qdisc_refcount_inc_nz() call in __tcf_qdisc_find(). >> */ >> - if (!qdisc_refcount_dec_if_one(dev_queue->qdisc_sleeping)) >> + if (!qdisc_refcount_dec_if_one(dev_queue->qdisc_sleeping)) { >> + rtnl_unlock(); >> + rtnl_lock(); >> return -EAGAIN; >> + } >> } >> >> if (dev->flags & IFF_UP) >> >> Did not trigger any issues (in about 30 mintues) either. >> >> What would you suggest? > > > I am more worried it is a wackamole situation. We fixed the first > reproducer with essentially patches 1-4 but we opened a new one which > the second reproducer catches. One thing the current reproducer does > is create a lot rtnl contention in the beggining by creating all those > devices and then after it is just creating/deleting qdisc and doing > update with flower where such contention is reduced. i.e it may just > take longer for the mole to pop up. > > Why dont we push the V1 patch in and then worry about getting clever > with EAGAIN after? Can you test the V1 version with the repro Pedro > posted? It shouldnt have these issues. Also it would be interesting to > see how performance of the parallel updates to flower is affected. This or at least push first 4 patches of this series. They target other older commits and fix straightforward issues with the API.
On Mon 29 May 2023 at 14:50, Vlad Buslov <vladbu@nvidia.com> wrote: > On Sun 28 May 2023 at 14:54, Jamal Hadi Salim <jhs@mojatatu.com> wrote: >> On Sat, May 27, 2023 at 4:23 AM Peilin Ye <yepeilin.cs@gmail.com> wrote: >>> >>> Hi Jakub and all, >>> >>> On Fri, May 26, 2023 at 07:33:24PM -0700, Jakub Kicinski wrote: >>> > On Fri, 26 May 2023 16:09:51 -0700 Peilin Ye wrote: >>> > > Thanks a lot, I'll get right on it. >>> > >>> > Any insights? Is it just a live-lock inherent to the retry scheme >>> > or we actually forget to release the lock/refcnt? >>> >>> I think it's just a thread holding the RTNL mutex for too long (replaying >>> too many times). We could replay for arbitrary times in >>> tc_{modify,get}_qdisc() if the user keeps sending RTNL-unlocked filter >>> requests for the old Qdisc. > > After looking very carefully at the code I think I know what the issue > might be: > > Task 1 graft Qdisc Task 2 new filter > + + > | | > v v > rtnl_lock() take q->refcnt > + + > | | > v v > Spin while q->refcnt!=1 Block on rtnl_lock() indefinitely due to -EAGAIN > > This will cause a real deadlock with the proposed patch. I'll try to > come up with a better approach. Sorry for not seeing it earlier. > Followup: I considered two approaches for preventing the dealock: - Refactor cls_api to always obtain the lock before taking a reference to Qdisc. I started implementing PoC moving the rtnl_lock() call in tc_new_tfilter() before __tcf_qdisc_find() and decided it is not feasible because cls_api will still try to obtain rtnl_lock when offloading a filter to a device with non-unlocked driver or after releasing the lock when loading a classifier module. - Account for such cls_api behavior in sch_api by dropping and re-tacking the lock before replaying. This actually seems to be quite straightforward since 'replay' functionality that we are reusing for this is designed for similar behavior - it releases rtnl lock before loading a sch module, takes the lock again and safely replays the function by re-obtaining all the necessary data. If livelock with concurrent filters insertion is an issue, then it can be remedied by setting a new Qdisc->flags bit "DELETED-REJECT-NEW-FILTERS" and checking for it together with QDISC_CLASS_OPS_DOIT_UNLOCKED in order to force any concurrent filter insertion coming after the flag is set to synchronize on rtnl lock. Thoughts? >>> >>> I tested the new reproducer Pedro posted, on: >>> >>> 1. All 6 v5 patches, FWIW, which caused a similar hang as Pedro reported >>> >>> 2. First 5 v5 patches, plus patch 6 in v1 (no replaying), did not trigger >>> any issues (in about 30 minutes). >>> >>> 3. All 6 v5 patches, plus this diff: >>> >>> diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c >>> index 286b7c58f5b9..988718ba5abe 100644 >>> --- a/net/sched/sch_api.c >>> +++ b/net/sched/sch_api.c >>> @@ -1090,8 +1090,11 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent, >>> * RTNL-unlocked filter request(s). This is the counterpart of that >>> * qdisc_refcount_inc_nz() call in __tcf_qdisc_find(). >>> */ >>> - if (!qdisc_refcount_dec_if_one(dev_queue->qdisc_sleeping)) >>> + if (!qdisc_refcount_dec_if_one(dev_queue->qdisc_sleeping)) { >>> + rtnl_unlock(); >>> + rtnl_lock(); >>> return -EAGAIN; >>> + } >>> } >>> >>> if (dev->flags & IFF_UP) >>> >>> Did not trigger any issues (in about 30 mintues) either. >>> >>> What would you suggest? >> >> >> I am more worried it is a wackamole situation. We fixed the first >> reproducer with essentially patches 1-4 but we opened a new one which >> the second reproducer catches. One thing the current reproducer does >> is create a lot rtnl contention in the beggining by creating all those >> devices and then after it is just creating/deleting qdisc and doing >> update with flower where such contention is reduced. i.e it may just >> take longer for the mole to pop up. >> >> Why dont we push the V1 patch in and then worry about getting clever >> with EAGAIN after? Can you test the V1 version with the repro Pedro >> posted? It shouldnt have these issues. Also it would be interesting to >> see how performance of the parallel updates to flower is affected. > > This or at least push first 4 patches of this series. They target other > older commits and fix straightforward issues with the API.
On Mon, May 29, 2023 at 8:06 AM Vlad Buslov <vladbu@nvidia.com> wrote: > > On Sun 28 May 2023 at 14:54, Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > On Sat, May 27, 2023 at 4:23 AM Peilin Ye <yepeilin.cs@gmail.com> wrote: > >> > >> Hi Jakub and all, > >> > >> On Fri, May 26, 2023 at 07:33:24PM -0700, Jakub Kicinski wrote: > >> > On Fri, 26 May 2023 16:09:51 -0700 Peilin Ye wrote: > >> > > Thanks a lot, I'll get right on it. > >> > > >> > Any insights? Is it just a live-lock inherent to the retry scheme > >> > or we actually forget to release the lock/refcnt? > >> > >> I think it's just a thread holding the RTNL mutex for too long (replaying > >> too many times). We could replay for arbitrary times in > >> tc_{modify,get}_qdisc() if the user keeps sending RTNL-unlocked filter > >> requests for the old Qdisc. > > After looking very carefully at the code I think I know what the issue > might be: > > Task 1 graft Qdisc Task 2 new filter > + + > | | > v v > rtnl_lock() take q->refcnt > + + > | | > v v > Spin while q->refcnt!=1 Block on rtnl_lock() indefinitely due to -EAGAIN > > This will cause a real deadlock with the proposed patch. I'll try to > come up with a better approach. Sorry for not seeing it earlier. > > >> > >> I tested the new reproducer Pedro posted, on: > >> > >> 1. All 6 v5 patches, FWIW, which caused a similar hang as Pedro reported > >> > >> 2. First 5 v5 patches, plus patch 6 in v1 (no replaying), did not trigger > >> any issues (in about 30 minutes). > >> > >> 3. All 6 v5 patches, plus this diff: > >> > >> diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c > >> index 286b7c58f5b9..988718ba5abe 100644 > >> --- a/net/sched/sch_api.c > >> +++ b/net/sched/sch_api.c > >> @@ -1090,8 +1090,11 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent, > >> * RTNL-unlocked filter request(s). This is the counterpart of that > >> * qdisc_refcount_inc_nz() call in __tcf_qdisc_find(). > >> */ > >> - if (!qdisc_refcount_dec_if_one(dev_queue->qdisc_sleeping)) > >> + if (!qdisc_refcount_dec_if_one(dev_queue->qdisc_sleeping)) { > >> + rtnl_unlock(); > >> + rtnl_lock(); > >> return -EAGAIN; > >> + } > >> } > >> > >> if (dev->flags & IFF_UP) > >> > >> Did not trigger any issues (in about 30 mintues) either. > >> > >> What would you suggest? > > > > > > I am more worried it is a wackamole situation. We fixed the first > > reproducer with essentially patches 1-4 but we opened a new one which > > the second reproducer catches. One thing the current reproducer does > > is create a lot rtnl contention in the beggining by creating all those > > devices and then after it is just creating/deleting qdisc and doing > > update with flower where such contention is reduced. i.e it may just > > take longer for the mole to pop up. > > > > Why dont we push the V1 patch in and then worry about getting clever > > with EAGAIN after? Can you test the V1 version with the repro Pedro > > posted? It shouldnt have these issues. Also it would be interesting to > > see how performance of the parallel updates to flower is affected. > > This or at least push first 4 patches of this series. They target other > older commits and fix straightforward issues with the API. Yes, lets get patch 1-4 in first ... cheers, jamal
On Mon, May 29, 2023 at 09:55:51AM -0400, Jamal Hadi Salim wrote: > On Mon, May 29, 2023 at 8:06 AM Vlad Buslov <vladbu@nvidia.com> wrote: > > On Sun 28 May 2023 at 14:54, Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > > Why dont we push the V1 patch in and then worry about getting clever > > > with EAGAIN after? Can you test the V1 version with the repro Pedro > > > posted? It shouldnt have these issues. Also it would be interesting to > > > see how performance of the parallel updates to flower is affected. > > > > This or at least push first 4 patches of this series. They target other > > older commits and fix straightforward issues with the API. > > Yes, lets get patch 1-4 in first ... Sure, I'll send 1-4 as v6 first. Thanks, Peilin Ye
On Mon, 29 May 2023 15:58:50 +0300 Vlad Buslov wrote: > If livelock with concurrent filters insertion is an issue, then it can > be remedied by setting a new Qdisc->flags bit > "DELETED-REJECT-NEW-FILTERS" and checking for it together with > QDISC_CLASS_OPS_DOIT_UNLOCKED in order to force any concurrent filter > insertion coming after the flag is set to synchronize on rtnl lock. Sounds very nice, yes.
On Mon, May 29, 2023 at 02:50:26PM +0300, Vlad Buslov wrote: > After looking very carefully at the code I think I know what the issue > might be: > > Task 1 graft Qdisc Task 2 new filter > + + > | | > v v > rtnl_lock() take q->refcnt > + + > | | > v v > Spin while q->refcnt!=1 Block on rtnl_lock() indefinitely due to -EAGAIN > > This will cause a real deadlock with the proposed patch. I'll try to > come up with a better approach. Sorry for not seeing it earlier. Thanks a lot for pointing this out! The reproducers add flower filters to ingress Qdiscs so I didn't think of rtnl_lock()'ed filter requests... On Mon, May 29, 2023 at 03:58:50PM +0300, Vlad Buslov wrote: > - Account for such cls_api behavior in sch_api by dropping and > re-tacking the lock before replaying. This actually seems to be quite > straightforward since 'replay' functionality that we are reusing for > this is designed for similar behavior - it releases rtnl lock before > loading a sch module, takes the lock again and safely replays the > function by re-obtaining all the necessary data. Yes, I've tested this using that reproducer Pedro posted. On Mon, May 29, 2023 at 03:58:50PM +0300, Vlad Buslov wrote: > If livelock with concurrent filters insertion is an issue, then it can > be remedied by setting a new Qdisc->flags bit > "DELETED-REJECT-NEW-FILTERS" and checking for it together with > QDISC_CLASS_OPS_DOIT_UNLOCKED in order to force any concurrent filter > insertion coming after the flag is set to synchronize on rtnl lock. Thanks for the suggestion! I'll try this approach. Currently QDISC_CLASS_OPS_DOIT_UNLOCKED is checked after taking a refcnt of the "being-deleted" Qdisc. I'll try forcing "late" requests (that arrive later than Qdisc is flagged as being-deleted) sync on RTNL lock without (before) taking the Qdisc refcnt (otherwise I think Task 1 will replay for even longer?). Thanks, Peilin Ye
On Tue 30 May 2023 at 02:11, Peilin Ye <yepeilin.cs@gmail.com> wrote: > On Mon, May 29, 2023 at 02:50:26PM +0300, Vlad Buslov wrote: >> After looking very carefully at the code I think I know what the issue >> might be: >> >> Task 1 graft Qdisc Task 2 new filter >> + + >> | | >> v v >> rtnl_lock() take q->refcnt >> + + >> | | >> v v >> Spin while q->refcnt!=1 Block on rtnl_lock() indefinitely due to -EAGAIN >> >> This will cause a real deadlock with the proposed patch. I'll try to >> come up with a better approach. Sorry for not seeing it earlier. > > Thanks a lot for pointing this out! The reproducers add flower filters to > ingress Qdiscs so I didn't think of rtnl_lock()'ed filter requests... > > On Mon, May 29, 2023 at 03:58:50PM +0300, Vlad Buslov wrote: >> - Account for such cls_api behavior in sch_api by dropping and >> re-tacking the lock before replaying. This actually seems to be quite >> straightforward since 'replay' functionality that we are reusing for >> this is designed for similar behavior - it releases rtnl lock before >> loading a sch module, takes the lock again and safely replays the >> function by re-obtaining all the necessary data. > > Yes, I've tested this using that reproducer Pedro posted. > > On Mon, May 29, 2023 at 03:58:50PM +0300, Vlad Buslov wrote: >> If livelock with concurrent filters insertion is an issue, then it can >> be remedied by setting a new Qdisc->flags bit >> "DELETED-REJECT-NEW-FILTERS" and checking for it together with >> QDISC_CLASS_OPS_DOIT_UNLOCKED in order to force any concurrent filter >> insertion coming after the flag is set to synchronize on rtnl lock. > > Thanks for the suggestion! I'll try this approach. > > Currently QDISC_CLASS_OPS_DOIT_UNLOCKED is checked after taking a refcnt of > the "being-deleted" Qdisc. I'll try forcing "late" requests (that arrive > later than Qdisc is flagged as being-deleted) sync on RTNL lock without > (before) taking the Qdisc refcnt (otherwise I think Task 1 will replay for > even longer?). Yeah, I see what you mean. Looking at the code __tcf_qdisc_find() already returns -EINVAL when q->refcnt is zero, so maybe returning -EINVAL from that function when "DELETED-REJECT-NEW-FILTERS" flags is set is also fine? Would be much easier to implement as opposed to moving rtnl_lock there.
On Tue, May 30, 2023 at 03:18:19PM +0300, Vlad Buslov wrote: > >> If livelock with concurrent filters insertion is an issue, then it can > >> be remedied by setting a new Qdisc->flags bit > >> "DELETED-REJECT-NEW-FILTERS" and checking for it together with > >> QDISC_CLASS_OPS_DOIT_UNLOCKED in order to force any concurrent filter > >> insertion coming after the flag is set to synchronize on rtnl lock. > > > > Thanks for the suggestion! I'll try this approach. > > > > Currently QDISC_CLASS_OPS_DOIT_UNLOCKED is checked after taking a refcnt of > > the "being-deleted" Qdisc. I'll try forcing "late" requests (that arrive > > later than Qdisc is flagged as being-deleted) sync on RTNL lock without > > (before) taking the Qdisc refcnt (otherwise I think Task 1 will replay for > > even longer?). > > Yeah, I see what you mean. Looking at the code __tcf_qdisc_find() > already returns -EINVAL when q->refcnt is zero, so maybe returning > -EINVAL from that function when "DELETED-REJECT-NEW-FILTERS" flags is > set is also fine? Would be much easier to implement as opposed to moving > rtnl_lock there. Ah, I see, sure. Thanks, Peilin Ye
Hi Vlad and all, On Tue, May 30, 2023 at 03:18:19PM +0300, Vlad Buslov wrote: > >> If livelock with concurrent filters insertion is an issue, then it can > >> be remedied by setting a new Qdisc->flags bit > >> "DELETED-REJECT-NEW-FILTERS" and checking for it together with > >> QDISC_CLASS_OPS_DOIT_UNLOCKED in order to force any concurrent filter > >> insertion coming after the flag is set to synchronize on rtnl lock. > > > > Thanks for the suggestion! I'll try this approach. > > > > Currently QDISC_CLASS_OPS_DOIT_UNLOCKED is checked after taking a refcnt of > > the "being-deleted" Qdisc. I'll try forcing "late" requests (that arrive > > later than Qdisc is flagged as being-deleted) sync on RTNL lock without > > (before) taking the Qdisc refcnt (otherwise I think Task 1 will replay for > > even longer?). > > Yeah, I see what you mean. Looking at the code __tcf_qdisc_find() > already returns -EINVAL when q->refcnt is zero, so maybe returning > -EINVAL from that function when "DELETED-REJECT-NEW-FILTERS" flags is > set is also fine? Would be much easier to implement as opposed to moving > rtnl_lock there. I implemented [1] this suggestion and tested the livelock issue in QEMU (-m 16G, CONFIG_NR_CPUS=8). I tried deleting the ingress Qdisc (let's call it "request A") while it has a lot of ongoing filter requests, and here's the result: #1 #2 #3 #4 ---------------------------------------------------------- a. refcnt 89 93 230 571 b. replayed 167,568 196,450 336,291 878,027 c. time real 0m2.478s 0m2.746s 0m3.693s 0m9.461s user 0m0.000s 0m0.000s 0m0.000s 0m0.000s sys 0m0.623s 0m0.681s 0m1.119s 0m2.770s a. is the Qdisc refcnt when A calls qdisc_graft() for the first time; b. is the number of times A has been replayed; c. is the time(1) output for A. a. and b. are collected from printk() output. This is better than before, but A could still be replayed for hundreds of thousands of times and hang for a few seconds. Is this okay? If not, is it possible (or should we) to make A really _wait_ on Qdisc refcnt, instead of "busy-replaying"? Thanks, Peilin Ye [1] Diff against v5 patch 6 (printk() calls not included): diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h index 3e9cc43cbc90..de7b0538b309 100644 --- a/include/net/sch_generic.h +++ b/include/net/sch_generic.h @@ -94,6 +94,7 @@ struct Qdisc { #define TCQ_F_INVISIBLE 0x80 /* invisible by default in dump */ #define TCQ_F_NOLOCK 0x100 /* qdisc does not require locking */ #define TCQ_F_OFFLOADED 0x200 /* qdisc is offloaded to HW */ +#define TCQ_F_DESTROYING 0x400 /* destroying, reject filter requests */ u32 limit; const struct Qdisc_ops *ops; struct qdisc_size_table __rcu *stab; @@ -185,6 +186,11 @@ static inline bool qdisc_is_empty(const struct Qdisc *qdisc) return !READ_ONCE(qdisc->q.qlen); } +static inline bool qdisc_is_destroying(const struct Qdisc *qdisc) +{ + return qdisc->flags & TCQ_F_DESTROYING; +} + /* For !TCQ_F_NOLOCK qdisc, qdisc_run_begin/end() must be invoked with * the qdisc root lock acquired. */ diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c index 2621550bfddc..3e7f6f286ac0 100644 --- a/net/sched/cls_api.c +++ b/net/sched/cls_api.c @@ -1172,7 +1172,7 @@ static int __tcf_qdisc_find(struct net *net, struct Qdisc **q, *parent = (*q)->handle; } else { *q = qdisc_lookup_rcu(dev, TC_H_MAJ(*parent)); - if (!*q) { + if (!*q || qdisc_is_destroying(*q)) { NL_SET_ERR_MSG(extack, "Parent Qdisc doesn't exists"); err = -EINVAL; goto errout_rcu; diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c index 286b7c58f5b9..d6e47546c7fe 100644 --- a/net/sched/sch_api.c +++ b/net/sched/sch_api.c @@ -1086,12 +1086,18 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent, return -ENOENT; } - /* Replay if the current ingress (or clsact) Qdisc has ongoing - * RTNL-unlocked filter request(s). This is the counterpart of that - * qdisc_refcount_inc_nz() call in __tcf_qdisc_find(). + /* If current ingress (clsact) Qdisc has ongoing filter requests, stop + * accepting any more by marking it as "being destroyed", then tell the + * caller to replay by returning -EAGAIN. */ - if (!qdisc_refcount_dec_if_one(dev_queue->qdisc_sleeping)) + q = dev_queue->qdisc_sleeping; + if (!qdisc_refcount_dec_if_one(q)) { + q->flags |= TCQ_F_DESTROYING; + rtnl_unlock(); + schedule(); + rtnl_lock(); return -EAGAIN; + } } if (dev->flags & IFF_UP)
On Wed 31 May 2023 at 20:57, Peilin Ye <yepeilin.cs@gmail.com> wrote: > Hi Vlad and all, > > On Tue, May 30, 2023 at 03:18:19PM +0300, Vlad Buslov wrote: >> >> If livelock with concurrent filters insertion is an issue, then it can >> >> be remedied by setting a new Qdisc->flags bit >> >> "DELETED-REJECT-NEW-FILTERS" and checking for it together with >> >> QDISC_CLASS_OPS_DOIT_UNLOCKED in order to force any concurrent filter >> >> insertion coming after the flag is set to synchronize on rtnl lock. >> > >> > Thanks for the suggestion! I'll try this approach. >> > >> > Currently QDISC_CLASS_OPS_DOIT_UNLOCKED is checked after taking a refcnt of >> > the "being-deleted" Qdisc. I'll try forcing "late" requests (that arrive >> > later than Qdisc is flagged as being-deleted) sync on RTNL lock without >> > (before) taking the Qdisc refcnt (otherwise I think Task 1 will replay for >> > even longer?). >> >> Yeah, I see what you mean. Looking at the code __tcf_qdisc_find() >> already returns -EINVAL when q->refcnt is zero, so maybe returning >> -EINVAL from that function when "DELETED-REJECT-NEW-FILTERS" flags is >> set is also fine? Would be much easier to implement as opposed to moving >> rtnl_lock there. > > I implemented [1] this suggestion and tested the livelock issue in QEMU (-m > 16G, CONFIG_NR_CPUS=8). I tried deleting the ingress Qdisc (let's call it > "request A") while it has a lot of ongoing filter requests, and here's the > result: > > #1 #2 #3 #4 > ---------------------------------------------------------- > a. refcnt 89 93 230 571 > b. replayed 167,568 196,450 336,291 878,027 > c. time real 0m2.478s 0m2.746s 0m3.693s 0m9.461s > user 0m0.000s 0m0.000s 0m0.000s 0m0.000s > sys 0m0.623s 0m0.681s 0m1.119s 0m2.770s > > a. is the Qdisc refcnt when A calls qdisc_graft() for the first time; > b. is the number of times A has been replayed; > c. is the time(1) output for A. > > a. and b. are collected from printk() output. This is better than before, > but A could still be replayed for hundreds of thousands of times and hang > for a few seconds. I don't get where does few seconds waiting time come from. I'm probably missing something obvious here, but the waiting time should be the maximum filter op latency of new/get/del filter request that is already in-flight (i.e. already passed qdisc_is_destroying() check) and it should take several orders of magnitude less time. > > Is this okay? If not, is it possible (or should we) to make A really > _wait_ on Qdisc refcnt, instead of "busy-replaying"? > > Thanks, > Peilin Ye > > [1] Diff against v5 patch 6 (printk() calls not included): > > diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h > index 3e9cc43cbc90..de7b0538b309 100644 > --- a/include/net/sch_generic.h > +++ b/include/net/sch_generic.h > @@ -94,6 +94,7 @@ struct Qdisc { > #define TCQ_F_INVISIBLE 0x80 /* invisible by default in dump */ > #define TCQ_F_NOLOCK 0x100 /* qdisc does not require locking */ > #define TCQ_F_OFFLOADED 0x200 /* qdisc is offloaded to HW */ > +#define TCQ_F_DESTROYING 0x400 /* destroying, reject filter requests */ > u32 limit; > const struct Qdisc_ops *ops; > struct qdisc_size_table __rcu *stab; > @@ -185,6 +186,11 @@ static inline bool qdisc_is_empty(const struct Qdisc *qdisc) > return !READ_ONCE(qdisc->q.qlen); > } > > +static inline bool qdisc_is_destroying(const struct Qdisc *qdisc) > +{ > + return qdisc->flags & TCQ_F_DESTROYING; Hmm, do we need at least some kind of {READ|WRITE}_ONCE() for accessing flags since they are now used in unlocked filter code path? > +} > + > /* For !TCQ_F_NOLOCK qdisc, qdisc_run_begin/end() must be invoked with > * the qdisc root lock acquired. > */ > diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c > index 2621550bfddc..3e7f6f286ac0 100644 > --- a/net/sched/cls_api.c > +++ b/net/sched/cls_api.c > @@ -1172,7 +1172,7 @@ static int __tcf_qdisc_find(struct net *net, struct Qdisc **q, > *parent = (*q)->handle; > } else { > *q = qdisc_lookup_rcu(dev, TC_H_MAJ(*parent)); > - if (!*q) { > + if (!*q || qdisc_is_destroying(*q)) { > NL_SET_ERR_MSG(extack, "Parent Qdisc doesn't exists"); > err = -EINVAL; > goto errout_rcu; > diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c > index 286b7c58f5b9..d6e47546c7fe 100644 > --- a/net/sched/sch_api.c > +++ b/net/sched/sch_api.c > @@ -1086,12 +1086,18 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent, > return -ENOENT; > } > > - /* Replay if the current ingress (or clsact) Qdisc has ongoing > - * RTNL-unlocked filter request(s). This is the counterpart of that > - * qdisc_refcount_inc_nz() call in __tcf_qdisc_find(). > + /* If current ingress (clsact) Qdisc has ongoing filter requests, stop > + * accepting any more by marking it as "being destroyed", then tell the > + * caller to replay by returning -EAGAIN. > */ > - if (!qdisc_refcount_dec_if_one(dev_queue->qdisc_sleeping)) > + q = dev_queue->qdisc_sleeping; > + if (!qdisc_refcount_dec_if_one(q)) { > + q->flags |= TCQ_F_DESTROYING; > + rtnl_unlock(); > + schedule(); > + rtnl_lock(); > return -EAGAIN; > + } > } > > if (dev->flags & IFF_UP)
On 01/06/2023 00:57, Peilin Ye wrote: > Hi Vlad and all, > > On Tue, May 30, 2023 at 03:18:19PM +0300, Vlad Buslov wrote: >>>> If livelock with concurrent filters insertion is an issue, then it can >>>> be remedied by setting a new Qdisc->flags bit >>>> "DELETED-REJECT-NEW-FILTERS" and checking for it together with >>>> QDISC_CLASS_OPS_DOIT_UNLOCKED in order to force any concurrent filter >>>> insertion coming after the flag is set to synchronize on rtnl lock. >>> >>> Thanks for the suggestion! I'll try this approach. >>> >>> Currently QDISC_CLASS_OPS_DOIT_UNLOCKED is checked after taking a refcnt of >>> the "being-deleted" Qdisc. I'll try forcing "late" requests (that arrive >>> later than Qdisc is flagged as being-deleted) sync on RTNL lock without >>> (before) taking the Qdisc refcnt (otherwise I think Task 1 will replay for >>> even longer?). >> >> Yeah, I see what you mean. Looking at the code __tcf_qdisc_find() >> already returns -EINVAL when q->refcnt is zero, so maybe returning >> -EINVAL from that function when "DELETED-REJECT-NEW-FILTERS" flags is >> set is also fine? Would be much easier to implement as opposed to moving >> rtnl_lock there. > > I implemented [1] this suggestion and tested the livelock issue in QEMU (-m > 16G, CONFIG_NR_CPUS=8). I tried deleting the ingress Qdisc (let's call it > "request A") while it has a lot of ongoing filter requests, and here's the > result: > > #1 #2 #3 #4 > ---------------------------------------------------------- > a. refcnt 89 93 230 571 > b. replayed 167,568 196,450 336,291 878,027 > c. time real 0m2.478s 0m2.746s 0m3.693s 0m9.461s > user 0m0.000s 0m0.000s 0m0.000s 0m0.000s > sys 0m0.623s 0m0.681s 0m1.119s 0m2.770s > > a. is the Qdisc refcnt when A calls qdisc_graft() for the first time; > b. is the number of times A has been replayed; > c. is the time(1) output for A. > > a. and b. are collected from printk() output. This is better than before, > but A could still be replayed for hundreds of thousands of times and hang > for a few seconds. > > Is this okay? If not, is it possible (or should we) to make A really > _wait_ on Qdisc refcnt, instead of "busy-replaying"? > > Thanks, > Peilin Ye > > [1] Diff against v5 patch 6 (printk() calls not included): > > diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h > index 3e9cc43cbc90..de7b0538b309 100644 > --- a/include/net/sch_generic.h > +++ b/include/net/sch_generic.h > @@ -94,6 +94,7 @@ struct Qdisc { > #define TCQ_F_INVISIBLE 0x80 /* invisible by default in dump */ > #define TCQ_F_NOLOCK 0x100 /* qdisc does not require locking */ > #define TCQ_F_OFFLOADED 0x200 /* qdisc is offloaded to HW */ > +#define TCQ_F_DESTROYING 0x400 /* destroying, reject filter requests */ > u32 limit; > const struct Qdisc_ops *ops; > struct qdisc_size_table __rcu *stab; > @@ -185,6 +186,11 @@ static inline bool qdisc_is_empty(const struct Qdisc *qdisc) > return !READ_ONCE(qdisc->q.qlen); > } > > +static inline bool qdisc_is_destroying(const struct Qdisc *qdisc) > +{ > + return qdisc->flags & TCQ_F_DESTROYING; > +} > + > /* For !TCQ_F_NOLOCK qdisc, qdisc_run_begin/end() must be invoked with > * the qdisc root lock acquired. > */ > diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c > index 2621550bfddc..3e7f6f286ac0 100644 > --- a/net/sched/cls_api.c > +++ b/net/sched/cls_api.c > @@ -1172,7 +1172,7 @@ static int __tcf_qdisc_find(struct net *net, struct Qdisc **q, > *parent = (*q)->handle; > } else { > *q = qdisc_lookup_rcu(dev, TC_H_MAJ(*parent)); > - if (!*q) { > + if (!*q || qdisc_is_destroying(*q)) { > NL_SET_ERR_MSG(extack, "Parent Qdisc doesn't exists"); > err = -EINVAL; > goto errout_rcu; > diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c > index 286b7c58f5b9..d6e47546c7fe 100644 > --- a/net/sched/sch_api.c > +++ b/net/sched/sch_api.c > @@ -1086,12 +1086,18 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent, > return -ENOENT; > } > > - /* Replay if the current ingress (or clsact) Qdisc has ongoing > - * RTNL-unlocked filter request(s). This is the counterpart of that > - * qdisc_refcount_inc_nz() call in __tcf_qdisc_find(). > + /* If current ingress (clsact) Qdisc has ongoing filter requests, stop > + * accepting any more by marking it as "being destroyed", then tell the > + * caller to replay by returning -EAGAIN. > */ > - if (!qdisc_refcount_dec_if_one(dev_queue->qdisc_sleeping)) > + q = dev_queue->qdisc_sleeping; > + if (!qdisc_refcount_dec_if_one(q)) { > + q->flags |= TCQ_F_DESTROYING; > + rtnl_unlock(); > + schedule(); Was this intended or just a leftover? rtnl_lock() would reschedule if needed as it's a mutex_lock > + rtnl_lock(); > return -EAGAIN; > + } > } > > if (dev->flags & IFF_UP) >
On Thu, Jun 01, 2023 at 09:20:39AM +0300, Vlad Buslov wrote: > On Wed 31 May 2023 at 20:57, Peilin Ye <yepeilin.cs@gmail.com> wrote: > > +static inline bool qdisc_is_destroying(const struct Qdisc *qdisc) > > +{ > > + return qdisc->flags & TCQ_F_DESTROYING; > > Hmm, do we need at least some kind of {READ|WRITE}_ONCE() for accessing > flags since they are now used in unlocked filter code path? Thanks, after taking another look at cls_api.c, I noticed this code in tc_new_tfilter(): err = tp->ops->change(net, skb, tp, cl, t->tcm_handle, tca, &fh, flags, extack); if (err == 0) { tfilter_notify(net, skb, n, tp, block, q, parent, fh, RTM_NEWTFILTER, false, rtnl_held, extack); tfilter_put(tp, fh); /* q pointer is NULL for shared blocks */ if (q) q->flags &= ~TCQ_F_CAN_BYPASS; } ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ TCQ_F_CAN_BYPASS is cleared after e.g. adding a filter to the Qdisc, and it isn't atomic [1]. We also have this: ->dequeue() htb_dequeue() htb_dequeue_tree() qdisc_warn_nonwc(): void qdisc_warn_nonwc(const char *txt, struct Qdisc *qdisc) { if (!(qdisc->flags & TCQ_F_WARN_NONWC)) { pr_warn("%s: %s qdisc %X: is non-work-conserving?\n", txt, qdisc->ops->id, qdisc->handle >> 16); qdisc->flags |= TCQ_F_WARN_NONWC; } ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ } EXPORT_SYMBOL(qdisc_warn_nonwc); Also non-atomic; isn't it possible for the above 2 underlined statements to race with each other? If true, I think we need to change Qdisc::flags to use atomic bitops, just like what we're doing for Qdisc::state and ::state2. It feels like a separate TODO, however. I also thought about adding the new DELETED-REJECT-NEW-FILTERS flag to ::state2, but not sure if it's okay to extend it for our purpose. Thanks, Peilin Ye [1] Compiled to this on my Intel 64: 0x0000000000017788 <+6472>: 83 62 10 fb andl $0xfffffffb,0x10(%rdx)
Hi Pedro, On Thu, Jun 01, 2023 at 10:03:22AM -0300, Pedro Tammela wrote: > On 01/06/2023 00:57, Peilin Ye wrote: > > - /* Replay if the current ingress (or clsact) Qdisc has ongoing > > - * RTNL-unlocked filter request(s). This is the counterpart of that > > - * qdisc_refcount_inc_nz() call in __tcf_qdisc_find(). > > + /* If current ingress (clsact) Qdisc has ongoing filter requests, stop > > + * accepting any more by marking it as "being destroyed", then tell the > > + * caller to replay by returning -EAGAIN. > > */ > > - if (!qdisc_refcount_dec_if_one(dev_queue->qdisc_sleeping)) > > + q = dev_queue->qdisc_sleeping; > > + if (!qdisc_refcount_dec_if_one(q)) { > > + q->flags |= TCQ_F_DESTROYING; > > + rtnl_unlock(); > > + schedule(); > > Was this intended or just a leftover? > rtnl_lock() would reschedule if needed as it's a mutex_lock In qdisc_create(): rtnl_unlock(); request_module("sch_%s", name); rtnl_lock(); ops = qdisc_lookup_ops(kind); if (ops != NULL) { /* We will try again qdisc_lookup_ops, * so don't keep a reference. */ module_put(ops->owner); err = -EAGAIN; goto err_out; If qdisc_lookup_ops() returns !NULL, it means we've successfully loaded the module, so we know that the next replay should (normally) succeed. However in the diff I posted, if qdisc_refcount_dec_if_one() returned false, we know that we should step back and wait a bit - especially when there're ongoing RTNL-locked filter requests: we want them to grab the RTNL mutex before us, which was my intention here, to unconditionally give up the CPU. I haven't tested if it really makes a difference though. Thanks, Peilin Ye
On Tue 06 Jun 2023 at 17:57, Peilin Ye <yepeilin.cs@gmail.com> wrote: > On Thu, Jun 01, 2023 at 09:20:39AM +0300, Vlad Buslov wrote: >> On Wed 31 May 2023 at 20:57, Peilin Ye <yepeilin.cs@gmail.com> wrote: >> > +static inline bool qdisc_is_destroying(const struct Qdisc *qdisc) >> > +{ >> > + return qdisc->flags & TCQ_F_DESTROYING; >> >> Hmm, do we need at least some kind of {READ|WRITE}_ONCE() for accessing >> flags since they are now used in unlocked filter code path? > > Thanks, after taking another look at cls_api.c, I noticed this code in > tc_new_tfilter(): > > err = tp->ops->change(net, skb, tp, cl, t->tcm_handle, tca, &fh, > flags, extack); > if (err == 0) { > tfilter_notify(net, skb, n, tp, block, q, parent, fh, > RTM_NEWTFILTER, false, rtnl_held, extack); > tfilter_put(tp, fh); > /* q pointer is NULL for shared blocks */ > if (q) > q->flags &= ~TCQ_F_CAN_BYPASS; > } ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > TCQ_F_CAN_BYPASS is cleared after e.g. adding a filter to the Qdisc, and it > isn't atomic [1]. Yeah, I see we have already got such behavior in 3f05e6886a59 ("net_sched: unset TCQ_F_CAN_BYPASS when adding filters"). > > We also have this: > > ->dequeue() > htb_dequeue() > htb_dequeue_tree() > qdisc_warn_nonwc(): > > void qdisc_warn_nonwc(const char *txt, struct Qdisc *qdisc) > { > if (!(qdisc->flags & TCQ_F_WARN_NONWC)) { > pr_warn("%s: %s qdisc %X: is non-work-conserving?\n", > txt, qdisc->ops->id, qdisc->handle >> 16); > qdisc->flags |= TCQ_F_WARN_NONWC; > } ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > } > EXPORT_SYMBOL(qdisc_warn_nonwc); > > Also non-atomic; isn't it possible for the above 2 underlined statements to > race with each other? If true, I think we need to change Qdisc::flags to > use atomic bitops, just like what we're doing for Qdisc::state and > ::state2. It feels like a separate TODO, however. It looks like even though 3f05e6886a59 ("net_sched: unset TCQ_F_CAN_BYPASS when adding filters") was introduced after cls api unlock by now we have these in exactly the same list of supported kernels (5.4 LTS and newer). Considering this, the conversion to the atomic bitops can be done as a standalone fix for cited commit and after it will have been accepted and backported the qdisc fix can just assume that qdisc->flags is an atomic bitops field in all target kernels and use it as-is. WDYT? > > I also thought about adding the new DELETED-REJECT-NEW-FILTERS flag to > ::state2, but not sure if it's okay to extend it for our purpose. As you described above qdisc->flags is already used to interact with cls api (including changing it dynamically), so I don't see why not.
On Thu, Jun 01, 2023 at 09:20:39AM +0300, Vlad Buslov wrote: > >> >> If livelock with concurrent filters insertion is an issue, then it can > >> >> be remedied by setting a new Qdisc->flags bit > >> >> "DELETED-REJECT-NEW-FILTERS" and checking for it together with > >> >> QDISC_CLASS_OPS_DOIT_UNLOCKED in order to force any concurrent filter > >> >> insertion coming after the flag is set to synchronize on rtnl lock. > >> > > >> > Thanks for the suggestion! I'll try this approach. > >> > > >> > Currently QDISC_CLASS_OPS_DOIT_UNLOCKED is checked after taking a refcnt of > >> > the "being-deleted" Qdisc. I'll try forcing "late" requests (that arrive > >> > later than Qdisc is flagged as being-deleted) sync on RTNL lock without > >> > (before) taking the Qdisc refcnt (otherwise I think Task 1 will replay for > >> > even longer?). > >> > >> Yeah, I see what you mean. Looking at the code __tcf_qdisc_find() > >> already returns -EINVAL when q->refcnt is zero, so maybe returning > >> -EINVAL from that function when "DELETED-REJECT-NEW-FILTERS" flags is > >> set is also fine? Would be much easier to implement as opposed to moving > >> rtnl_lock there. > > > > I implemented [1] this suggestion and tested the livelock issue in QEMU (-m > > 16G, CONFIG_NR_CPUS=8). I tried deleting the ingress Qdisc (let's call it > > "request A") while it has a lot of ongoing filter requests, and here's the > > result: > > > > #1 #2 #3 #4 > > ---------------------------------------------------------- > > a. refcnt 89 93 230 571 > > b. replayed 167,568 196,450 336,291 878,027 > > c. time real 0m2.478s 0m2.746s 0m3.693s 0m9.461s > > user 0m0.000s 0m0.000s 0m0.000s 0m0.000s > > sys 0m0.623s 0m0.681s 0m1.119s 0m2.770s > > > > a. is the Qdisc refcnt when A calls qdisc_graft() for the first time; > > b. is the number of times A has been replayed; > > c. is the time(1) output for A. > > > > a. and b. are collected from printk() output. This is better than before, > > but A could still be replayed for hundreds of thousands of times and hang > > for a few seconds. > > I don't get where does few seconds waiting time come from. I'm probably > missing something obvious here, but the waiting time should be the > maximum filter op latency of new/get/del filter request that is already > in-flight (i.e. already passed qdisc_is_destroying() check) and it > should take several orders of magnitude less time. Yeah I agree, here's what I did: In Terminal 1 I keep adding filters to eth1 in a naive and unrealistic loop: $ echo "1 1 32" > /sys/bus/netdevsim/new_device $ tc qdisc add dev eth1 ingress $ for (( i=1; i<=3000; i++ )) > do > tc filter add dev eth1 ingress proto all flower src_mac 00:11:22:33:44:55 action pass > /dev/null 2>&1 & > done When the loop is running, I delete the Qdisc in Terminal 2: $ time tc qdisc delete dev eth1 ingress Which took seconds on average. However, if I specify a unique "prio" when adding filters in that loop, e.g.: $ for (( i=1; i<=3000; i++ )) > do > tc filter add dev eth1 ingress proto all prio $i flower src_mac 00:11:22:33:44:55 action pass > /dev/null 2>&1 & > done ^^^^^^^ Then deleting the Qdisc in Terminal 2 becomes a lot faster: real 0m0.712s user 0m0.000s sys 0m0.152s In fact it's so fast that I couldn't even make qdisc->refcnt > 1, so I did yet another test [1], which looks a lot better. When I didn't specify "prio", sometimes that rhashtable_lookup_insert_fast() call in fl_ht_insert_unique() returns -EEXIST. Is it because that concurrent add-filter requests auto-allocated the same "prio" number, so they collided with each other? Do you think this is related to why it's slow? Thanks, Peilin Ye [1] In a beefier QEMU setup (64 cores, -m 128G), I started 64 tc instances in -batch mode that keeps adding a unique filter (with "prio" and "handle" specified) then deletes it. Again, when they are running I delete the ingress Qdisc, and here's the result: #1 #2 #3 #4 ---------------------------------------------------------- a. refcnt 64 63 64 64 b. replayed 169 5,630 887 3,442 c. time real 0m0.171s 0m0.147s 0m0.186s 0m0.111s user 0m0.000s 0m0.009s 0m0.001s 0m0.000s sys 0m0.112s 0m0.108s 0m0.115s 0m0.104s
On Wed, Jun 07, 2023 at 11:18:32AM +0300, Vlad Buslov wrote: > > I also thought about adding the new DELETED-REJECT-NEW-FILTERS flag to > > ::state2, but not sure if it's okay to extend it for our purpose. > > As you described above qdisc->flags is already used to interact with cls > api (including changing it dynamically), so I don't see why not. Sorry, I don't follow, I meant qdisc->state2: enum qdisc_state2_t { /* Only for !TCQ_F_NOLOCK qdisc. Never access it directly. * Use qdisc_run_begin/end() or qdisc_is_running() instead. */ __QDISC_STATE2_RUNNING, }; NVM, I think using qdisc->flags after making it atomic sounds better. On Wed, Jun 07, 2023 at 11:18:32AM +0300, Vlad Buslov wrote: > > err = tp->ops->change(net, skb, tp, cl, t->tcm_handle, tca, &fh, > > flags, extack); > > if (err == 0) { > > tfilter_notify(net, skb, n, tp, block, q, parent, fh, > > RTM_NEWTFILTER, false, rtnl_held, extack); > > tfilter_put(tp, fh); > > /* q pointer is NULL for shared blocks */ > > if (q) > > q->flags &= ~TCQ_F_CAN_BYPASS; > > } ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > > > TCQ_F_CAN_BYPASS is cleared after e.g. adding a filter to the Qdisc, and it > > isn't atomic [1]. > > Yeah, I see we have already got such behavior in 3f05e6886a59 > ("net_sched: unset TCQ_F_CAN_BYPASS when adding filters"). > > > We also have this: > > > > ->dequeue() > > htb_dequeue() > > htb_dequeue_tree() > > qdisc_warn_nonwc(): > > > > void qdisc_warn_nonwc(const char *txt, struct Qdisc *qdisc) > > { > > if (!(qdisc->flags & TCQ_F_WARN_NONWC)) { > > pr_warn("%s: %s qdisc %X: is non-work-conserving?\n", > > txt, qdisc->ops->id, qdisc->handle >> 16); > > qdisc->flags |= TCQ_F_WARN_NONWC; > > } ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > } > > EXPORT_SYMBOL(qdisc_warn_nonwc); > > > > Also non-atomic; isn't it possible for the above 2 underlined statements to > > race with each other? If true, I think we need to change Qdisc::flags to > > use atomic bitops, just like what we're doing for Qdisc::state and > > ::state2. It feels like a separate TODO, however. > > It looks like even though 3f05e6886a59 ("net_sched: unset > TCQ_F_CAN_BYPASS when adding filters") was introduced after cls api > unlock by now we have these in exactly the same list of supported > kernels (5.4 LTS and newer). Considering this, the conversion to the > atomic bitops can be done as a standalone fix for cited commit and after > it will have been accepted and backported the qdisc fix can just assume > that qdisc->flags is an atomic bitops field in all target kernels and > use it as-is. WDYT? Sounds great, how about: 1. I'll post the non-replay version of the fix (after updating the commit message), and we apply that first, as suggested by Jamal 2. Make qdisc->flags atomic 3. Make the fix better by replaying and using the (now atomic) IS-DESTROYING flag with test_and_set_bit() and friends ? Thanks, Peilin Ye
On Wed 07 Jun 2023 at 18:08, Peilin Ye <yepeilin.cs@gmail.com> wrote: > On Wed, Jun 07, 2023 at 11:18:32AM +0300, Vlad Buslov wrote: >> > I also thought about adding the new DELETED-REJECT-NEW-FILTERS flag to >> > ::state2, but not sure if it's okay to extend it for our purpose. >> >> As you described above qdisc->flags is already used to interact with cls >> api (including changing it dynamically), so I don't see why not. > > Sorry, I don't follow, I meant qdisc->state2: > > enum qdisc_state2_t { > /* Only for !TCQ_F_NOLOCK qdisc. Never access it directly. > * Use qdisc_run_begin/end() or qdisc_is_running() instead. > */ > __QDISC_STATE2_RUNNING, > }; Sorry, I misunderstood what you were suggesting. Got it now. > > NVM, I think using qdisc->flags after making it atomic sounds better. Agree. > > On Wed, Jun 07, 2023 at 11:18:32AM +0300, Vlad Buslov wrote: >> > err = tp->ops->change(net, skb, tp, cl, t->tcm_handle, tca, &fh, >> > flags, extack); >> > if (err == 0) { >> > tfilter_notify(net, skb, n, tp, block, q, parent, fh, >> > RTM_NEWTFILTER, false, rtnl_held, extack); >> > tfilter_put(tp, fh); >> > /* q pointer is NULL for shared blocks */ >> > if (q) >> > q->flags &= ~TCQ_F_CAN_BYPASS; >> > } ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ >> > >> > TCQ_F_CAN_BYPASS is cleared after e.g. adding a filter to the Qdisc, and it >> > isn't atomic [1]. >> >> Yeah, I see we have already got such behavior in 3f05e6886a59 >> ("net_sched: unset TCQ_F_CAN_BYPASS when adding filters"). >> >> > We also have this: >> > >> > ->dequeue() >> > htb_dequeue() >> > htb_dequeue_tree() >> > qdisc_warn_nonwc(): >> > >> > void qdisc_warn_nonwc(const char *txt, struct Qdisc *qdisc) >> > { >> > if (!(qdisc->flags & TCQ_F_WARN_NONWC)) { >> > pr_warn("%s: %s qdisc %X: is non-work-conserving?\n", >> > txt, qdisc->ops->id, qdisc->handle >> 16); >> > qdisc->flags |= TCQ_F_WARN_NONWC; >> > } ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ >> > } >> > EXPORT_SYMBOL(qdisc_warn_nonwc); >> > >> > Also non-atomic; isn't it possible for the above 2 underlined statements to >> > race with each other? If true, I think we need to change Qdisc::flags to >> > use atomic bitops, just like what we're doing for Qdisc::state and >> > ::state2. It feels like a separate TODO, however. >> >> It looks like even though 3f05e6886a59 ("net_sched: unset >> TCQ_F_CAN_BYPASS when adding filters") was introduced after cls api >> unlock by now we have these in exactly the same list of supported >> kernels (5.4 LTS and newer). Considering this, the conversion to the >> atomic bitops can be done as a standalone fix for cited commit and after >> it will have been accepted and backported the qdisc fix can just assume >> that qdisc->flags is an atomic bitops field in all target kernels and >> use it as-is. WDYT? > > Sounds great, how about: > > 1. I'll post the non-replay version of the fix (after updating the commit > message), and we apply that first, as suggested by Jamal From my side there are no objections to any of the proposed approaches since we have never had any users with legitimate use-case where they need to replace/delete a qdisc concurrently with a filter update, so returning -EBUSY (or -EAGAIN) to the user in such case would work as either temporary or the final fix. However, Jakub had reservations with such approach so don't know where we stand now regarding this. > > 2. Make qdisc->flags atomic > > 3. Make the fix better by replaying and using the (now atomic) > IS-DESTROYING flag with test_and_set_bit() and friends > > ? Again, no objections from my side. Ping me if you need help with any of these.
On Wed 07 Jun 2023 at 17:39, Peilin Ye <yepeilin.cs@gmail.com> wrote: > On Thu, Jun 01, 2023 at 09:20:39AM +0300, Vlad Buslov wrote: >> >> >> If livelock with concurrent filters insertion is an issue, then it can >> >> >> be remedied by setting a new Qdisc->flags bit >> >> >> "DELETED-REJECT-NEW-FILTERS" and checking for it together with >> >> >> QDISC_CLASS_OPS_DOIT_UNLOCKED in order to force any concurrent filter >> >> >> insertion coming after the flag is set to synchronize on rtnl lock. >> >> > >> >> > Thanks for the suggestion! I'll try this approach. >> >> > >> >> > Currently QDISC_CLASS_OPS_DOIT_UNLOCKED is checked after taking a refcnt of >> >> > the "being-deleted" Qdisc. I'll try forcing "late" requests (that arrive >> >> > later than Qdisc is flagged as being-deleted) sync on RTNL lock without >> >> > (before) taking the Qdisc refcnt (otherwise I think Task 1 will replay for >> >> > even longer?). >> >> >> >> Yeah, I see what you mean. Looking at the code __tcf_qdisc_find() >> >> already returns -EINVAL when q->refcnt is zero, so maybe returning >> >> -EINVAL from that function when "DELETED-REJECT-NEW-FILTERS" flags is >> >> set is also fine? Would be much easier to implement as opposed to moving >> >> rtnl_lock there. >> > >> > I implemented [1] this suggestion and tested the livelock issue in QEMU (-m >> > 16G, CONFIG_NR_CPUS=8). I tried deleting the ingress Qdisc (let's call it >> > "request A") while it has a lot of ongoing filter requests, and here's the >> > result: >> > >> > #1 #2 #3 #4 >> > ---------------------------------------------------------- >> > a. refcnt 89 93 230 571 >> > b. replayed 167,568 196,450 336,291 878,027 >> > c. time real 0m2.478s 0m2.746s 0m3.693s 0m9.461s >> > user 0m0.000s 0m0.000s 0m0.000s 0m0.000s >> > sys 0m0.623s 0m0.681s 0m1.119s 0m2.770s >> > >> > a. is the Qdisc refcnt when A calls qdisc_graft() for the first time; >> > b. is the number of times A has been replayed; >> > c. is the time(1) output for A. >> > >> > a. and b. are collected from printk() output. This is better than before, >> > but A could still be replayed for hundreds of thousands of times and hang >> > for a few seconds. >> >> I don't get where does few seconds waiting time come from. I'm probably >> missing something obvious here, but the waiting time should be the >> maximum filter op latency of new/get/del filter request that is already >> in-flight (i.e. already passed qdisc_is_destroying() check) and it >> should take several orders of magnitude less time. > > Yeah I agree, here's what I did: > > In Terminal 1 I keep adding filters to eth1 in a naive and unrealistic > loop: > > $ echo "1 1 32" > /sys/bus/netdevsim/new_device > $ tc qdisc add dev eth1 ingress > $ for (( i=1; i<=3000; i++ )) > > do > > tc filter add dev eth1 ingress proto all flower src_mac 00:11:22:33:44:55 action pass > /dev/null 2>&1 & > > done > > When the loop is running, I delete the Qdisc in Terminal 2: > > $ time tc qdisc delete dev eth1 ingress > > Which took seconds on average. However, if I specify a unique "prio" when > adding filters in that loop, e.g.: > > $ for (( i=1; i<=3000; i++ )) > > do > > tc filter add dev eth1 ingress proto all prio $i flower src_mac 00:11:22:33:44:55 action pass > /dev/null 2>&1 & > > done ^^^^^^^ > > Then deleting the Qdisc in Terminal 2 becomes a lot faster: > > real 0m0.712s > user 0m0.000s > sys 0m0.152s > > In fact it's so fast that I couldn't even make qdisc->refcnt > 1, so I did > yet another test [1], which looks a lot better. That makes sense, thanks for explaining. > > When I didn't specify "prio", sometimes that > rhashtable_lookup_insert_fast() call in fl_ht_insert_unique() returns > -EEXIST. Is it because that concurrent add-filter requests auto-allocated > the same "prio" number, so they collided with each other? Do you think > this is related to why it's slow? It is slow because when creating a filter without providing priority you are basically measuring the latency of creating a whole flower classifier instance (multiple memory allocation, initialization of all kinds of idrs, hash tables and locks, updating tp list in chain, etc.), not just a single filter, so significantly higher latency is expected. But my point still stands: with latest version of your fix the maximum time of 'spinning' in sch_api is the maximum concurrent tcf_{new|del|get}_tfilter op latency that has already obtained the qdisc and any concurrent filter API messages coming after qdisc->flags "DELETED-REJECT-NEW-FILTERS" has been set will fail and can't livelock the concurrent qdisc del/replace. > > Thanks, > Peilin Ye > > [1] In a beefier QEMU setup (64 cores, -m 128G), I started 64 tc instances > in -batch mode that keeps adding a unique filter (with "prio" and "handle" > specified) then deletes it. Again, when they are running I delete the > ingress Qdisc, and here's the result: > > #1 #2 #3 #4 > ---------------------------------------------------------- > a. refcnt 64 63 64 64 > b. replayed 169 5,630 887 3,442 > c. time real 0m0.171s 0m0.147s 0m0.186s 0m0.111s > user 0m0.000s 0m0.009s 0m0.001s 0m0.000s > sys 0m0.112s 0m0.108s 0m0.115s 0m0.104s
On Thu, Jun 08, 2023 at 12:17:27PM +0300, Vlad Buslov wrote: > > When I didn't specify "prio", sometimes that > > rhashtable_lookup_insert_fast() call in fl_ht_insert_unique() returns > > -EEXIST. Is it because that concurrent add-filter requests auto-allocated > > the same "prio" number, so they collided with each other? Do you think > > this is related to why it's slow? > > It is slow because when creating a filter without providing priority you > are basically measuring the latency of creating a whole flower > classifier instance (multiple memory allocation, initialization of all > kinds of idrs, hash tables and locks, updating tp list in chain, etc.), > not just a single filter, so significantly higher latency is expected. Ah, I see. Thanks for the explanation. Thanks, Peilin Ye
On Thu, Jun 08, 2023 at 10:48:12AM +0300, Vlad Buslov wrote: > >> It looks like even though 3f05e6886a59 ("net_sched: unset > >> TCQ_F_CAN_BYPASS when adding filters") was introduced after cls api > >> unlock by now we have these in exactly the same list of supported > >> kernels (5.4 LTS and newer). Considering this, the conversion to the > >> atomic bitops can be done as a standalone fix for cited commit and after > >> it will have been accepted and backported the qdisc fix can just assume > >> that qdisc->flags is an atomic bitops field in all target kernels and > >> use it as-is. WDYT? > > > > Sounds great, how about: > > > > 1. I'll post the non-replay version of the fix (after updating the commit > > message), and we apply that first, as suggested by Jamal > > From my side there are no objections to any of the proposed approaches > since we have never had any users with legitimate use-case where they > need to replace/delete a qdisc concurrently with a filter update, so > returning -EBUSY (or -EAGAIN) to the user in such case would work as > either temporary or the final fix. I see, yeah. > However, Jakub had reservations with such approach so don't know where we > stand now regarding this. Either way I'd say applying that non-replay version first is better than leaving the bug unfixed. It's been many days since the root cause of the issue has been figured out. I'll post it, and start making qdisc->flags atomic. > > 2. Make qdisc->flags atomic > > > > 3. Make the fix better by replaying and using the (now atomic) > > IS-DESTROYING flag with test_and_set_bit() and friends > > > > ? > > Again, no objections from my side. Ping me if you need help with any of > these. Sure, thanks, Vlad! Peilin Ye
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h index fab5ba3e61b7..3e9cc43cbc90 100644 --- a/include/net/sch_generic.h +++ b/include/net/sch_generic.h @@ -137,6 +137,13 @@ static inline void qdisc_refcount_inc(struct Qdisc *qdisc) refcount_inc(&qdisc->refcnt); } +static inline bool qdisc_refcount_dec_if_one(struct Qdisc *qdisc) +{ + if (qdisc->flags & TCQ_F_BUILTIN) + return true; + return refcount_dec_if_one(&qdisc->refcnt); +} + /* Intended to be used by unlocked users, when concurrent qdisc release is * possible. */ @@ -652,6 +659,7 @@ void dev_deactivate_many(struct list_head *head); struct Qdisc *dev_graft_qdisc(struct netdev_queue *dev_queue, struct Qdisc *qdisc); void qdisc_reset(struct Qdisc *qdisc); +void qdisc_destroy(struct Qdisc *qdisc); void qdisc_put(struct Qdisc *qdisc); void qdisc_put_unlocked(struct Qdisc *qdisc); void qdisc_tree_reduce_backlog(struct Qdisc *qdisc, int n, int len); diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c index f72a581666a2..286b7c58f5b9 100644 --- a/net/sched/sch_api.c +++ b/net/sched/sch_api.c @@ -1080,10 +1080,18 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent, if ((q && q->flags & TCQ_F_INGRESS) || (new && new->flags & TCQ_F_INGRESS)) { ingress = 1; - if (!dev_ingress_queue(dev)) { + dev_queue = dev_ingress_queue(dev); + if (!dev_queue) { NL_SET_ERR_MSG(extack, "Device does not have an ingress queue"); return -ENOENT; } + + /* Replay if the current ingress (or clsact) Qdisc has ongoing + * RTNL-unlocked filter request(s). This is the counterpart of that + * qdisc_refcount_inc_nz() call in __tcf_qdisc_find(). + */ + if (!qdisc_refcount_dec_if_one(dev_queue->qdisc_sleeping)) + return -EAGAIN; } if (dev->flags & IFF_UP) @@ -1104,8 +1112,16 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent, qdisc_put(old); } } else { - dev_queue = dev_ingress_queue(dev); - old = dev_graft_qdisc(dev_queue, new); + old = dev_graft_qdisc(dev_queue, NULL); + + /* {ingress,clsact}_destroy() @old before grafting @new to avoid + * unprotected concurrent accesses to net_device::miniq_{in,e}gress + * pointer(s) in mini_qdisc_pair_swap(). + */ + qdisc_notify(net, skb, n, classid, old, new, extack); + qdisc_destroy(old); + + dev_graft_qdisc(dev_queue, new); } skip: @@ -1119,8 +1135,6 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent, if (new && new->ops->attach) new->ops->attach(new); - } else { - notify_and_destroy(net, skb, n, classid, old, new, extack); } if (dev->flags & IFF_UP) @@ -1450,19 +1464,22 @@ static int tc_get_qdisc(struct sk_buff *skb, struct nlmsghdr *n, struct netlink_ext_ack *extack) { struct net *net = sock_net(skb->sk); - struct tcmsg *tcm = nlmsg_data(n); struct nlattr *tca[TCA_MAX + 1]; struct net_device *dev; + struct Qdisc *q, *p; + struct tcmsg *tcm; u32 clid; - struct Qdisc *q = NULL; - struct Qdisc *p = NULL; int err; +replay: err = nlmsg_parse_deprecated(n, sizeof(*tcm), tca, TCA_MAX, rtm_tca_policy, extack); if (err < 0) return err; + tcm = nlmsg_data(n); + q = p = NULL; + dev = __dev_get_by_index(net, tcm->tcm_ifindex); if (!dev) return -ENODEV; @@ -1515,8 +1532,11 @@ static int tc_get_qdisc(struct sk_buff *skb, struct nlmsghdr *n, return -ENOENT; } err = qdisc_graft(dev, p, skb, n, clid, NULL, q, extack); - if (err != 0) + if (err != 0) { + if (err == -EAGAIN) + goto replay; return err; + } } else { qdisc_notify(net, skb, n, clid, NULL, q, NULL); } @@ -1704,6 +1724,8 @@ static int tc_modify_qdisc(struct sk_buff *skb, struct nlmsghdr *n, if (err) { if (q) qdisc_put(q); + if (err == -EAGAIN) + goto replay; return err; } diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c index 37e41f972f69..e14ed47f961c 100644 --- a/net/sched/sch_generic.c +++ b/net/sched/sch_generic.c @@ -1046,7 +1046,7 @@ static void qdisc_free_cb(struct rcu_head *head) qdisc_free(q); } -static void qdisc_destroy(struct Qdisc *qdisc) +static void __qdisc_destroy(struct Qdisc *qdisc) { const struct Qdisc_ops *ops = qdisc->ops; @@ -1070,6 +1070,14 @@ static void qdisc_destroy(struct Qdisc *qdisc) call_rcu(&qdisc->rcu, qdisc_free_cb); } +void qdisc_destroy(struct Qdisc *qdisc) +{ + if (qdisc->flags & TCQ_F_BUILTIN) + return; + + __qdisc_destroy(qdisc); +} + void qdisc_put(struct Qdisc *qdisc) { if (!qdisc) @@ -1079,7 +1087,7 @@ void qdisc_put(struct Qdisc *qdisc) !refcount_dec_and_test(&qdisc->refcnt)) return; - qdisc_destroy(qdisc); + __qdisc_destroy(qdisc); } EXPORT_SYMBOL(qdisc_put); @@ -1094,7 +1102,7 @@ void qdisc_put_unlocked(struct Qdisc *qdisc) !refcount_dec_and_rtnl_lock(&qdisc->refcnt)) return; - qdisc_destroy(qdisc); + __qdisc_destroy(qdisc); rtnl_unlock(); } EXPORT_SYMBOL(qdisc_put_unlocked);