diff mbox series

[v5,net,6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting

Message ID 429357af094297abbc45f47b8e606f11206df049.1684887977.git.peilin.ye@bytedance.com (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series net/sched: Fixes for sch_ingress and sch_clsact | expand

Checks

Context Check Description
netdev/series_format success Posting correctly formatted
netdev/tree_selection success Clearly marked for net
netdev/fixes_present success Fixes tag present in non-next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 1391 this patch: 1391
netdev/cc_maintainers success CCed 9 of 9 maintainers
netdev/build_clang success Errors and warnings before: 149 this patch: 149
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 1412 this patch: 1412
netdev/checkpatch warning CHECK: multiple assignments should be avoided WARNING: line length of 82 exceeds 80 columns WARNING: line length of 86 exceeds 80 columns WARNING: line length of 88 exceeds 80 columns WARNING: line length of 91 exceeds 80 columns WARNING: line length of 92 exceeds 80 columns
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Peilin Ye May 24, 2023, 1:20 a.m. UTC
From: Peilin Ye <peilin.ye@bytedance.com>

mini_Qdisc_pair::p_miniq is a double pointer to mini_Qdisc, initialized in
ingress_init() to point to net_device::miniq_ingress.  ingress Qdiscs
access this per-net_device pointer in mini_qdisc_pair_swap().  Similar for
clsact Qdiscs and miniq_egress.

Unfortunately, after introducing RTNL-unlocked RTM_{NEW,DEL,GET}TFILTER
requests (thanks Hillf Danton for the hint), when replacing ingress or
clsact Qdiscs, for example, the old Qdisc ("@old") could access the same
miniq_{in,e}gress pointer(s) concurrently with the new Qdisc ("@new"),
causing race conditions [1] including a use-after-free bug in
mini_qdisc_pair_swap() reported by syzbot:

 BUG: KASAN: slab-use-after-free in mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573
 Write of size 8 at addr ffff888045b31308 by task syz-executor690/14901
...
 Call Trace:
  <TASK>
  __dump_stack lib/dump_stack.c:88 [inline]
  dump_stack_lvl+0xd9/0x150 lib/dump_stack.c:106
  print_address_description.constprop.0+0x2c/0x3c0 mm/kasan/report.c:319
  print_report mm/kasan/report.c:430 [inline]
  kasan_report+0x11c/0x130 mm/kasan/report.c:536
  mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573
  tcf_chain_head_change_item net/sched/cls_api.c:495 [inline]
  tcf_chain0_head_change.isra.0+0xb9/0x120 net/sched/cls_api.c:509
  tcf_chain_tp_insert net/sched/cls_api.c:1826 [inline]
  tcf_chain_tp_insert_unique net/sched/cls_api.c:1875 [inline]
  tc_new_tfilter+0x1de6/0x2290 net/sched/cls_api.c:2266
...

@old and @new should not affect each other.  In other words, @old should
never modify miniq_{in,e}gress after @new, and @new should not update
@old's RCU state.  Fixing without changing sch_api.c turned out to be
difficult (please refer to Closes: for discussions).  Instead, make sure
@new's first call always happen after @old's last call, in
qdisc_destroy(), has finished:

In qdisc_graft(), return -EAGAIN and tell the caller to replay
(suggested by Vlad Buslov) if @old has any ongoing RTNL-unlocked filter
requests, and call qdisc_destroy() for @old before grafting @new.

Introduce qdisc_refcount_dec_if_one() as the counterpart of
qdisc_refcount_inc_nz() used for RTNL-unlocked filter requests.  Introduce
a non-static version of qdisc_destroy() that does a TCQ_F_BUILTIN check,
just like qdisc_put() etc.

Depends on patch "net/sched: Refactor qdisc_graft() for ingress and clsact
Qdiscs".

[1] To illustrate, the syzkaller reproducer adds ingress Qdiscs under
TC_H_ROOT (no longer possible after patch "net/sched: sch_ingress: Only
create under TC_H_INGRESS") on eth0 that has 8 transmission queues:

  Thread 1 creates ingress Qdisc A (containing mini Qdisc a1 and a2), then
  adds a flower filter X to A.

  Thread 2 creates another ingress Qdisc B (containing mini Qdisc b1 and
  b2) to replace A, then adds a flower filter Y to B.

 Thread 1               A's refcnt   Thread 2
  RTM_NEWQDISC (A, RTNL-locked)
   qdisc_create(A)               1
   qdisc_graft(A)                9

  RTM_NEWTFILTER (X, RTNL-unlocked)
   __tcf_qdisc_find(A)          10
   tcf_chain0_head_change(A)
   mini_qdisc_pair_swap(A) (1st)
            |
            |                         RTM_NEWQDISC (B, RTNL-locked)
         RCU sync                2     qdisc_graft(B)
            |                    1     notify_and_destroy(A)
            |
   tcf_block_release(A)          0    RTM_NEWTFILTER (Y, RTNL-unlocked)
   qdisc_destroy(A)                    tcf_chain0_head_change(B)
   tcf_chain0_head_change_cb_del(A)    mini_qdisc_pair_swap(B) (2nd)
   mini_qdisc_pair_swap(A) (3rd)                |
           ...                                 ...

Here, B calls mini_qdisc_pair_swap(), pointing eth0->miniq_ingress to its
mini Qdisc, b1.  Then, A calls mini_qdisc_pair_swap() again during
ingress_destroy(), setting eth0->miniq_ingress to NULL, so ingress packets
on eth0 will not find filter Y in sch_handle_ingress().

This is only one of the possible consequences of concurrently accessing
miniq_{in,e}gress pointers.  The point is clear though: again, A should
never modify those per-net_device pointers after B, and B should not
update A's RCU state.

Fixes: 7a096d579e8e ("net: sched: ingress: set 'unlocked' flag for Qdisc ops")
Fixes: 87f373921c4e ("net: sched: ingress: set 'unlocked' flag for clsact Qdisc ops")
Reported-by: syzbot+b53a9c0d1ea4ad62da8b@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/r/0000000000006cf87705f79acf1a@google.com/
Cc: Hillf Danton <hdanton@sina.com>
Cc: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
---
change in v5:
  - reinitialize @q, @p (suggested by Vlad) and @tcm before replaying,
    just like @flags in tc_new_tfilter()

change in v3, v4:
  - add in-body From: tag

changes in v2:
  - replay the request if the current Qdisc has any ongoing RTNL-unlocked
    filter requests (Vlad)
  - minor changes in code comments and commit log

 include/net/sch_generic.h |  8 ++++++++
 net/sched/sch_api.c       | 40 ++++++++++++++++++++++++++++++---------
 net/sched/sch_generic.c   | 14 +++++++++++---
 3 files changed, 50 insertions(+), 12 deletions(-)

Comments

Pedro Tammela May 24, 2023, 3:39 p.m. UTC | #1
On 23/05/2023 22:20, Peilin Ye wrote:
> From: Peilin Ye <peilin.ye@bytedance.com>
> 
> mini_Qdisc_pair::p_miniq is a double pointer to mini_Qdisc, initialized in
> ingress_init() to point to net_device::miniq_ingress.  ingress Qdiscs
> access this per-net_device pointer in mini_qdisc_pair_swap().  Similar for
> clsact Qdiscs and miniq_egress.
> 
> Unfortunately, after introducing RTNL-unlocked RTM_{NEW,DEL,GET}TFILTER
> requests (thanks Hillf Danton for the hint), when replacing ingress or
> clsact Qdiscs, for example, the old Qdisc ("@old") could access the same
> miniq_{in,e}gress pointer(s) concurrently with the new Qdisc ("@new"),
> causing race conditions [1] including a use-after-free bug in
> mini_qdisc_pair_swap() reported by syzbot:
> 
>   BUG: KASAN: slab-use-after-free in mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573
>   Write of size 8 at addr ffff888045b31308 by task syz-executor690/14901
> ...
>   Call Trace:
>    <TASK>
>    __dump_stack lib/dump_stack.c:88 [inline]
>    dump_stack_lvl+0xd9/0x150 lib/dump_stack.c:106
>    print_address_description.constprop.0+0x2c/0x3c0 mm/kasan/report.c:319
>    print_report mm/kasan/report.c:430 [inline]
>    kasan_report+0x11c/0x130 mm/kasan/report.c:536
>    mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573
>    tcf_chain_head_change_item net/sched/cls_api.c:495 [inline]
>    tcf_chain0_head_change.isra.0+0xb9/0x120 net/sched/cls_api.c:509
>    tcf_chain_tp_insert net/sched/cls_api.c:1826 [inline]
>    tcf_chain_tp_insert_unique net/sched/cls_api.c:1875 [inline]
>    tc_new_tfilter+0x1de6/0x2290 net/sched/cls_api.c:2266
> ...
> 
> @old and @new should not affect each other.  In other words, @old should
> never modify miniq_{in,e}gress after @new, and @new should not update
> @old's RCU state.  Fixing without changing sch_api.c turned out to be
> difficult (please refer to Closes: for discussions).  Instead, make sure
> @new's first call always happen after @old's last call, in
> qdisc_destroy(), has finished:
> 
> In qdisc_graft(), return -EAGAIN and tell the caller to replay
> (suggested by Vlad Buslov) if @old has any ongoing RTNL-unlocked filter
> requests, and call qdisc_destroy() for @old before grafting @new.
> 
> Introduce qdisc_refcount_dec_if_one() as the counterpart of
> qdisc_refcount_inc_nz() used for RTNL-unlocked filter requests.  Introduce
> a non-static version of qdisc_destroy() that does a TCQ_F_BUILTIN check,
> just like qdisc_put() etc.
> 
> Depends on patch "net/sched: Refactor qdisc_graft() for ingress and clsact
> Qdiscs".
> 
> [1] To illustrate, the syzkaller reproducer adds ingress Qdiscs under
> TC_H_ROOT (no longer possible after patch "net/sched: sch_ingress: Only
> create under TC_H_INGRESS") on eth0 that has 8 transmission queues:
> 
>    Thread 1 creates ingress Qdisc A (containing mini Qdisc a1 and a2), then
>    adds a flower filter X to A.
> 
>    Thread 2 creates another ingress Qdisc B (containing mini Qdisc b1 and
>    b2) to replace A, then adds a flower filter Y to B.
> 
>   Thread 1               A's refcnt   Thread 2
>    RTM_NEWQDISC (A, RTNL-locked)
>     qdisc_create(A)               1
>     qdisc_graft(A)                9
> 
>    RTM_NEWTFILTER (X, RTNL-unlocked)
>     __tcf_qdisc_find(A)          10
>     tcf_chain0_head_change(A)
>     mini_qdisc_pair_swap(A) (1st)
>              |
>              |                         RTM_NEWQDISC (B, RTNL-locked)
>           RCU sync                2     qdisc_graft(B)
>              |                    1     notify_and_destroy(A)
>              |
>     tcf_block_release(A)          0    RTM_NEWTFILTER (Y, RTNL-unlocked)
>     qdisc_destroy(A)                    tcf_chain0_head_change(B)
>     tcf_chain0_head_change_cb_del(A)    mini_qdisc_pair_swap(B) (2nd)
>     mini_qdisc_pair_swap(A) (3rd)                |
>             ...                                 ...
> 
> Here, B calls mini_qdisc_pair_swap(), pointing eth0->miniq_ingress to its
> mini Qdisc, b1.  Then, A calls mini_qdisc_pair_swap() again during
> ingress_destroy(), setting eth0->miniq_ingress to NULL, so ingress packets
> on eth0 will not find filter Y in sch_handle_ingress().
> 
> This is only one of the possible consequences of concurrently accessing
> miniq_{in,e}gress pointers.  The point is clear though: again, A should
> never modify those per-net_device pointers after B, and B should not
> update A's RCU state.
> 
> Fixes: 7a096d579e8e ("net: sched: ingress: set 'unlocked' flag for Qdisc ops")
> Fixes: 87f373921c4e ("net: sched: ingress: set 'unlocked' flag for clsact Qdisc ops")
> Reported-by: syzbot+b53a9c0d1ea4ad62da8b@syzkaller.appspotmail.com
> Closes: https://lore.kernel.org/r/0000000000006cf87705f79acf1a@google.com/
> Cc: Hillf Danton <hdanton@sina.com>
> Cc: Vlad Buslov <vladbu@mellanox.com>
> Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>

Tested-by: Pedro Tammela <pctammela@mojatatu.com>

> ---
> change in v5:
>    - reinitialize @q, @p (suggested by Vlad) and @tcm before replaying,
>      just like @flags in tc_new_tfilter()
> 
> change in v3, v4:
>    - add in-body From: tag
> 
> changes in v2:
>    - replay the request if the current Qdisc has any ongoing RTNL-unlocked
>      filter requests (Vlad)
>    - minor changes in code comments and commit log
> 
>   include/net/sch_generic.h |  8 ++++++++
>   net/sched/sch_api.c       | 40 ++++++++++++++++++++++++++++++---------
>   net/sched/sch_generic.c   | 14 +++++++++++---
>   3 files changed, 50 insertions(+), 12 deletions(-)
> 
> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
> index fab5ba3e61b7..3e9cc43cbc90 100644
> --- a/include/net/sch_generic.h
> +++ b/include/net/sch_generic.h
> @@ -137,6 +137,13 @@ static inline void qdisc_refcount_inc(struct Qdisc *qdisc)
>   	refcount_inc(&qdisc->refcnt);
>   }
>   
> +static inline bool qdisc_refcount_dec_if_one(struct Qdisc *qdisc)
> +{
> +	if (qdisc->flags & TCQ_F_BUILTIN)
> +		return true;
> +	return refcount_dec_if_one(&qdisc->refcnt);
> +}
> +
>   /* Intended to be used by unlocked users, when concurrent qdisc release is
>    * possible.
>    */
> @@ -652,6 +659,7 @@ void dev_deactivate_many(struct list_head *head);
>   struct Qdisc *dev_graft_qdisc(struct netdev_queue *dev_queue,
>   			      struct Qdisc *qdisc);
>   void qdisc_reset(struct Qdisc *qdisc);
> +void qdisc_destroy(struct Qdisc *qdisc);
>   void qdisc_put(struct Qdisc *qdisc);
>   void qdisc_put_unlocked(struct Qdisc *qdisc);
>   void qdisc_tree_reduce_backlog(struct Qdisc *qdisc, int n, int len);
> diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
> index f72a581666a2..286b7c58f5b9 100644
> --- a/net/sched/sch_api.c
> +++ b/net/sched/sch_api.c
> @@ -1080,10 +1080,18 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
>   		if ((q && q->flags & TCQ_F_INGRESS) ||
>   		    (new && new->flags & TCQ_F_INGRESS)) {
>   			ingress = 1;
> -			if (!dev_ingress_queue(dev)) {
> +			dev_queue = dev_ingress_queue(dev);
> +			if (!dev_queue) {
>   				NL_SET_ERR_MSG(extack, "Device does not have an ingress queue");
>   				return -ENOENT;
>   			}
> +
> +			/* Replay if the current ingress (or clsact) Qdisc has ongoing
> +			 * RTNL-unlocked filter request(s).  This is the counterpart of that
> +			 * qdisc_refcount_inc_nz() call in __tcf_qdisc_find().
> +			 */
> +			if (!qdisc_refcount_dec_if_one(dev_queue->qdisc_sleeping))
> +				return -EAGAIN;
>   		}
>   
>   		if (dev->flags & IFF_UP)
> @@ -1104,8 +1112,16 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
>   				qdisc_put(old);
>   			}
>   		} else {
> -			dev_queue = dev_ingress_queue(dev);
> -			old = dev_graft_qdisc(dev_queue, new);
> +			old = dev_graft_qdisc(dev_queue, NULL);
> +
> +			/* {ingress,clsact}_destroy() @old before grafting @new to avoid
> +			 * unprotected concurrent accesses to net_device::miniq_{in,e}gress
> +			 * pointer(s) in mini_qdisc_pair_swap().
> +			 */
> +			qdisc_notify(net, skb, n, classid, old, new, extack);
> +			qdisc_destroy(old);
> +
> +			dev_graft_qdisc(dev_queue, new);
>   		}
>   
>   skip:
> @@ -1119,8 +1135,6 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
>   
>   			if (new && new->ops->attach)
>   				new->ops->attach(new);
> -		} else {
> -			notify_and_destroy(net, skb, n, classid, old, new, extack);
>   		}
>   
>   		if (dev->flags & IFF_UP)
> @@ -1450,19 +1464,22 @@ static int tc_get_qdisc(struct sk_buff *skb, struct nlmsghdr *n,
>   			struct netlink_ext_ack *extack)
>   {
>   	struct net *net = sock_net(skb->sk);
> -	struct tcmsg *tcm = nlmsg_data(n);
>   	struct nlattr *tca[TCA_MAX + 1];
>   	struct net_device *dev;
> +	struct Qdisc *q, *p;
> +	struct tcmsg *tcm;
>   	u32 clid;
> -	struct Qdisc *q = NULL;
> -	struct Qdisc *p = NULL;
>   	int err;
>   
> +replay:
>   	err = nlmsg_parse_deprecated(n, sizeof(*tcm), tca, TCA_MAX,
>   				     rtm_tca_policy, extack);
>   	if (err < 0)
>   		return err;
>   
> +	tcm = nlmsg_data(n);
> +	q = p = NULL;
> +
>   	dev = __dev_get_by_index(net, tcm->tcm_ifindex);
>   	if (!dev)
>   		return -ENODEV;
> @@ -1515,8 +1532,11 @@ static int tc_get_qdisc(struct sk_buff *skb, struct nlmsghdr *n,
>   			return -ENOENT;
>   		}
>   		err = qdisc_graft(dev, p, skb, n, clid, NULL, q, extack);
> -		if (err != 0)
> +		if (err != 0) {
> +			if (err == -EAGAIN)
> +				goto replay;
>   			return err;
> +		}
>   	} else {
>   		qdisc_notify(net, skb, n, clid, NULL, q, NULL);
>   	}
> @@ -1704,6 +1724,8 @@ static int tc_modify_qdisc(struct sk_buff *skb, struct nlmsghdr *n,
>   	if (err) {
>   		if (q)
>   			qdisc_put(q);
> +		if (err == -EAGAIN)
> +			goto replay;
>   		return err;
>   	}
>   
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index 37e41f972f69..e14ed47f961c 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -1046,7 +1046,7 @@ static void qdisc_free_cb(struct rcu_head *head)
>   	qdisc_free(q);
>   }
>   
> -static void qdisc_destroy(struct Qdisc *qdisc)
> +static void __qdisc_destroy(struct Qdisc *qdisc)
>   {
>   	const struct Qdisc_ops  *ops = qdisc->ops;
>   
> @@ -1070,6 +1070,14 @@ static void qdisc_destroy(struct Qdisc *qdisc)
>   	call_rcu(&qdisc->rcu, qdisc_free_cb);
>   }
>   
> +void qdisc_destroy(struct Qdisc *qdisc)
> +{
> +	if (qdisc->flags & TCQ_F_BUILTIN)
> +		return;
> +
> +	__qdisc_destroy(qdisc);
> +}
> +
>   void qdisc_put(struct Qdisc *qdisc)
>   {
>   	if (!qdisc)
> @@ -1079,7 +1087,7 @@ void qdisc_put(struct Qdisc *qdisc)
>   	    !refcount_dec_and_test(&qdisc->refcnt))
>   		return;
>   
> -	qdisc_destroy(qdisc);
> +	__qdisc_destroy(qdisc);
>   }
>   EXPORT_SYMBOL(qdisc_put);
>   
> @@ -1094,7 +1102,7 @@ void qdisc_put_unlocked(struct Qdisc *qdisc)
>   	    !refcount_dec_and_rtnl_lock(&qdisc->refcnt))
>   		return;
>   
> -	qdisc_destroy(qdisc);
> +	__qdisc_destroy(qdisc);
>   	rtnl_unlock();
>   }
>   EXPORT_SYMBOL(qdisc_put_unlocked);
Jamal Hadi Salim May 24, 2023, 4:09 p.m. UTC | #2
Pedro,
When you have a moment - could you run tc monitor in parallel to the
reproducer and double check it generates the correct events...

cheers,
jamal

On Wed, May 24, 2023 at 11:39 AM Pedro Tammela <pctammela@mojatatu.com> wrote:
>
> On 23/05/2023 22:20, Peilin Ye wrote:
> > From: Peilin Ye <peilin.ye@bytedance.com>
> >
> > mini_Qdisc_pair::p_miniq is a double pointer to mini_Qdisc, initialized in
> > ingress_init() to point to net_device::miniq_ingress.  ingress Qdiscs
> > access this per-net_device pointer in mini_qdisc_pair_swap().  Similar for
> > clsact Qdiscs and miniq_egress.
> >
> > Unfortunately, after introducing RTNL-unlocked RTM_{NEW,DEL,GET}TFILTER
> > requests (thanks Hillf Danton for the hint), when replacing ingress or
> > clsact Qdiscs, for example, the old Qdisc ("@old") could access the same
> > miniq_{in,e}gress pointer(s) concurrently with the new Qdisc ("@new"),
> > causing race conditions [1] including a use-after-free bug in
> > mini_qdisc_pair_swap() reported by syzbot:
> >
> >   BUG: KASAN: slab-use-after-free in mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573
> >   Write of size 8 at addr ffff888045b31308 by task syz-executor690/14901
> > ...
> >   Call Trace:
> >    <TASK>
> >    __dump_stack lib/dump_stack.c:88 [inline]
> >    dump_stack_lvl+0xd9/0x150 lib/dump_stack.c:106
> >    print_address_description.constprop.0+0x2c/0x3c0 mm/kasan/report.c:319
> >    print_report mm/kasan/report.c:430 [inline]
> >    kasan_report+0x11c/0x130 mm/kasan/report.c:536
> >    mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573
> >    tcf_chain_head_change_item net/sched/cls_api.c:495 [inline]
> >    tcf_chain0_head_change.isra.0+0xb9/0x120 net/sched/cls_api.c:509
> >    tcf_chain_tp_insert net/sched/cls_api.c:1826 [inline]
> >    tcf_chain_tp_insert_unique net/sched/cls_api.c:1875 [inline]
> >    tc_new_tfilter+0x1de6/0x2290 net/sched/cls_api.c:2266
> > ...
> >
> > @old and @new should not affect each other.  In other words, @old should
> > never modify miniq_{in,e}gress after @new, and @new should not update
> > @old's RCU state.  Fixing without changing sch_api.c turned out to be
> > difficult (please refer to Closes: for discussions).  Instead, make sure
> > @new's first call always happen after @old's last call, in
> > qdisc_destroy(), has finished:
> >
> > In qdisc_graft(), return -EAGAIN and tell the caller to replay
> > (suggested by Vlad Buslov) if @old has any ongoing RTNL-unlocked filter
> > requests, and call qdisc_destroy() for @old before grafting @new.
> >
> > Introduce qdisc_refcount_dec_if_one() as the counterpart of
> > qdisc_refcount_inc_nz() used for RTNL-unlocked filter requests.  Introduce
> > a non-static version of qdisc_destroy() that does a TCQ_F_BUILTIN check,
> > just like qdisc_put() etc.
> >
> > Depends on patch "net/sched: Refactor qdisc_graft() for ingress and clsact
> > Qdiscs".
> >
> > [1] To illustrate, the syzkaller reproducer adds ingress Qdiscs under
> > TC_H_ROOT (no longer possible after patch "net/sched: sch_ingress: Only
> > create under TC_H_INGRESS") on eth0 that has 8 transmission queues:
> >
> >    Thread 1 creates ingress Qdisc A (containing mini Qdisc a1 and a2), then
> >    adds a flower filter X to A.
> >
> >    Thread 2 creates another ingress Qdisc B (containing mini Qdisc b1 and
> >    b2) to replace A, then adds a flower filter Y to B.
> >
> >   Thread 1               A's refcnt   Thread 2
> >    RTM_NEWQDISC (A, RTNL-locked)
> >     qdisc_create(A)               1
> >     qdisc_graft(A)                9
> >
> >    RTM_NEWTFILTER (X, RTNL-unlocked)
> >     __tcf_qdisc_find(A)          10
> >     tcf_chain0_head_change(A)
> >     mini_qdisc_pair_swap(A) (1st)
> >              |
> >              |                         RTM_NEWQDISC (B, RTNL-locked)
> >           RCU sync                2     qdisc_graft(B)
> >              |                    1     notify_and_destroy(A)
> >              |
> >     tcf_block_release(A)          0    RTM_NEWTFILTER (Y, RTNL-unlocked)
> >     qdisc_destroy(A)                    tcf_chain0_head_change(B)
> >     tcf_chain0_head_change_cb_del(A)    mini_qdisc_pair_swap(B) (2nd)
> >     mini_qdisc_pair_swap(A) (3rd)                |
> >             ...                                 ...
> >
> > Here, B calls mini_qdisc_pair_swap(), pointing eth0->miniq_ingress to its
> > mini Qdisc, b1.  Then, A calls mini_qdisc_pair_swap() again during
> > ingress_destroy(), setting eth0->miniq_ingress to NULL, so ingress packets
> > on eth0 will not find filter Y in sch_handle_ingress().
> >
> > This is only one of the possible consequences of concurrently accessing
> > miniq_{in,e}gress pointers.  The point is clear though: again, A should
> > never modify those per-net_device pointers after B, and B should not
> > update A's RCU state.
> >
> > Fixes: 7a096d579e8e ("net: sched: ingress: set 'unlocked' flag for Qdisc ops")
> > Fixes: 87f373921c4e ("net: sched: ingress: set 'unlocked' flag for clsact Qdisc ops")
> > Reported-by: syzbot+b53a9c0d1ea4ad62da8b@syzkaller.appspotmail.com
> > Closes: https://lore.kernel.org/r/0000000000006cf87705f79acf1a@google.com/
> > Cc: Hillf Danton <hdanton@sina.com>
> > Cc: Vlad Buslov <vladbu@mellanox.com>
> > Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
>
> Tested-by: Pedro Tammela <pctammela@mojatatu.com>
>
> > ---
> > change in v5:
> >    - reinitialize @q, @p (suggested by Vlad) and @tcm before replaying,
> >      just like @flags in tc_new_tfilter()
> >
> > change in v3, v4:
> >    - add in-body From: tag
> >
> > changes in v2:
> >    - replay the request if the current Qdisc has any ongoing RTNL-unlocked
> >      filter requests (Vlad)
> >    - minor changes in code comments and commit log
> >
> >   include/net/sch_generic.h |  8 ++++++++
> >   net/sched/sch_api.c       | 40 ++++++++++++++++++++++++++++++---------
> >   net/sched/sch_generic.c   | 14 +++++++++++---
> >   3 files changed, 50 insertions(+), 12 deletions(-)
> >
> > diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
> > index fab5ba3e61b7..3e9cc43cbc90 100644
> > --- a/include/net/sch_generic.h
> > +++ b/include/net/sch_generic.h
> > @@ -137,6 +137,13 @@ static inline void qdisc_refcount_inc(struct Qdisc *qdisc)
> >       refcount_inc(&qdisc->refcnt);
> >   }
> >
> > +static inline bool qdisc_refcount_dec_if_one(struct Qdisc *qdisc)
> > +{
> > +     if (qdisc->flags & TCQ_F_BUILTIN)
> > +             return true;
> > +     return refcount_dec_if_one(&qdisc->refcnt);
> > +}
> > +
> >   /* Intended to be used by unlocked users, when concurrent qdisc release is
> >    * possible.
> >    */
> > @@ -652,6 +659,7 @@ void dev_deactivate_many(struct list_head *head);
> >   struct Qdisc *dev_graft_qdisc(struct netdev_queue *dev_queue,
> >                             struct Qdisc *qdisc);
> >   void qdisc_reset(struct Qdisc *qdisc);
> > +void qdisc_destroy(struct Qdisc *qdisc);
> >   void qdisc_put(struct Qdisc *qdisc);
> >   void qdisc_put_unlocked(struct Qdisc *qdisc);
> >   void qdisc_tree_reduce_backlog(struct Qdisc *qdisc, int n, int len);
> > diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
> > index f72a581666a2..286b7c58f5b9 100644
> > --- a/net/sched/sch_api.c
> > +++ b/net/sched/sch_api.c
> > @@ -1080,10 +1080,18 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
> >               if ((q && q->flags & TCQ_F_INGRESS) ||
> >                   (new && new->flags & TCQ_F_INGRESS)) {
> >                       ingress = 1;
> > -                     if (!dev_ingress_queue(dev)) {
> > +                     dev_queue = dev_ingress_queue(dev);
> > +                     if (!dev_queue) {
> >                               NL_SET_ERR_MSG(extack, "Device does not have an ingress queue");
> >                               return -ENOENT;
> >                       }
> > +
> > +                     /* Replay if the current ingress (or clsact) Qdisc has ongoing
> > +                      * RTNL-unlocked filter request(s).  This is the counterpart of that
> > +                      * qdisc_refcount_inc_nz() call in __tcf_qdisc_find().
> > +                      */
> > +                     if (!qdisc_refcount_dec_if_one(dev_queue->qdisc_sleeping))
> > +                             return -EAGAIN;
> >               }
> >
> >               if (dev->flags & IFF_UP)
> > @@ -1104,8 +1112,16 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
> >                               qdisc_put(old);
> >                       }
> >               } else {
> > -                     dev_queue = dev_ingress_queue(dev);
> > -                     old = dev_graft_qdisc(dev_queue, new);
> > +                     old = dev_graft_qdisc(dev_queue, NULL);
> > +
> > +                     /* {ingress,clsact}_destroy() @old before grafting @new to avoid
> > +                      * unprotected concurrent accesses to net_device::miniq_{in,e}gress
> > +                      * pointer(s) in mini_qdisc_pair_swap().
> > +                      */
> > +                     qdisc_notify(net, skb, n, classid, old, new, extack);
> > +                     qdisc_destroy(old);
> > +
> > +                     dev_graft_qdisc(dev_queue, new);
> >               }
> >
> >   skip:
> > @@ -1119,8 +1135,6 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
> >
> >                       if (new && new->ops->attach)
> >                               new->ops->attach(new);
> > -             } else {
> > -                     notify_and_destroy(net, skb, n, classid, old, new, extack);
> >               }
> >
> >               if (dev->flags & IFF_UP)
> > @@ -1450,19 +1464,22 @@ static int tc_get_qdisc(struct sk_buff *skb, struct nlmsghdr *n,
> >                       struct netlink_ext_ack *extack)
> >   {
> >       struct net *net = sock_net(skb->sk);
> > -     struct tcmsg *tcm = nlmsg_data(n);
> >       struct nlattr *tca[TCA_MAX + 1];
> >       struct net_device *dev;
> > +     struct Qdisc *q, *p;
> > +     struct tcmsg *tcm;
> >       u32 clid;
> > -     struct Qdisc *q = NULL;
> > -     struct Qdisc *p = NULL;
> >       int err;
> >
> > +replay:
> >       err = nlmsg_parse_deprecated(n, sizeof(*tcm), tca, TCA_MAX,
> >                                    rtm_tca_policy, extack);
> >       if (err < 0)
> >               return err;
> >
> > +     tcm = nlmsg_data(n);
> > +     q = p = NULL;
> > +
> >       dev = __dev_get_by_index(net, tcm->tcm_ifindex);
> >       if (!dev)
> >               return -ENODEV;
> > @@ -1515,8 +1532,11 @@ static int tc_get_qdisc(struct sk_buff *skb, struct nlmsghdr *n,
> >                       return -ENOENT;
> >               }
> >               err = qdisc_graft(dev, p, skb, n, clid, NULL, q, extack);
> > -             if (err != 0)
> > +             if (err != 0) {
> > +                     if (err == -EAGAIN)
> > +                             goto replay;
> >                       return err;
> > +             }
> >       } else {
> >               qdisc_notify(net, skb, n, clid, NULL, q, NULL);
> >       }
> > @@ -1704,6 +1724,8 @@ static int tc_modify_qdisc(struct sk_buff *skb, struct nlmsghdr *n,
> >       if (err) {
> >               if (q)
> >                       qdisc_put(q);
> > +             if (err == -EAGAIN)
> > +                     goto replay;
> >               return err;
> >       }
> >
> > diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> > index 37e41f972f69..e14ed47f961c 100644
> > --- a/net/sched/sch_generic.c
> > +++ b/net/sched/sch_generic.c
> > @@ -1046,7 +1046,7 @@ static void qdisc_free_cb(struct rcu_head *head)
> >       qdisc_free(q);
> >   }
> >
> > -static void qdisc_destroy(struct Qdisc *qdisc)
> > +static void __qdisc_destroy(struct Qdisc *qdisc)
> >   {
> >       const struct Qdisc_ops  *ops = qdisc->ops;
> >
> > @@ -1070,6 +1070,14 @@ static void qdisc_destroy(struct Qdisc *qdisc)
> >       call_rcu(&qdisc->rcu, qdisc_free_cb);
> >   }
> >
> > +void qdisc_destroy(struct Qdisc *qdisc)
> > +{
> > +     if (qdisc->flags & TCQ_F_BUILTIN)
> > +             return;
> > +
> > +     __qdisc_destroy(qdisc);
> > +}
> > +
> >   void qdisc_put(struct Qdisc *qdisc)
> >   {
> >       if (!qdisc)
> > @@ -1079,7 +1087,7 @@ void qdisc_put(struct Qdisc *qdisc)
> >           !refcount_dec_and_test(&qdisc->refcnt))
> >               return;
> >
> > -     qdisc_destroy(qdisc);
> > +     __qdisc_destroy(qdisc);
> >   }
> >   EXPORT_SYMBOL(qdisc_put);
> >
> > @@ -1094,7 +1102,7 @@ void qdisc_put_unlocked(struct Qdisc *qdisc)
> >           !refcount_dec_and_rtnl_lock(&qdisc->refcnt))
> >               return;
> >
> > -     qdisc_destroy(qdisc);
> > +     __qdisc_destroy(qdisc);
> >       rtnl_unlock();
> >   }
> >   EXPORT_SYMBOL(qdisc_put_unlocked);
>
Paolo Abeni May 25, 2023, 9:25 a.m. UTC | #3
On Wed, 2023-05-24 at 12:09 -0400, Jamal Hadi Salim wrote:
> When you have a moment - could you run tc monitor in parallel to the
> reproducer and double check it generates the correct events...

FTR, I'll wait a bit to apply this series, to allow for the above
tests. Unless someone will scream very loudly very soon, it's not going
to enter today's PR. Since the addressed issue is an ancient one, it
should not a problem, I hope.

Cheers,

Paolo
Jamal Hadi Salim May 26, 2023, 12:19 p.m. UTC | #4
On Thu, May 25, 2023 at 5:25 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On Wed, 2023-05-24 at 12:09 -0400, Jamal Hadi Salim wrote:
> > When you have a moment - could you run tc monitor in parallel to the
> > reproducer and double check it generates the correct events...
>
> FTR, I'll wait a bit to apply this series, to allow for the above
> tests. Unless someone will scream very loudly very soon, it's not going
> to enter today's PR. Since the addressed issue is an ancient one, it
> should not a problem, I hope.

Sorry I do not mean to hold this back - I think we should have applied
V1 then worried about making things better.  We can worry about events
after.
So Acking this.

cheers,
jamal

> Cheers,
>
> Paolo
>
Jamal Hadi Salim May 26, 2023, 12:20 p.m. UTC | #5
On Wed, May 24, 2023 at 11:39 AM Pedro Tammela <pctammela@mojatatu.com> wrote:
>
> On 23/05/2023 22:20, Peilin Ye wrote:
> > From: Peilin Ye <peilin.ye@bytedance.com>
> >
> > mini_Qdisc_pair::p_miniq is a double pointer to mini_Qdisc, initialized in
> > ingress_init() to point to net_device::miniq_ingress.  ingress Qdiscs
> > access this per-net_device pointer in mini_qdisc_pair_swap().  Similar for
> > clsact Qdiscs and miniq_egress.
> >
> > Unfortunately, after introducing RTNL-unlocked RTM_{NEW,DEL,GET}TFILTER
> > requests (thanks Hillf Danton for the hint), when replacing ingress or
> > clsact Qdiscs, for example, the old Qdisc ("@old") could access the same
> > miniq_{in,e}gress pointer(s) concurrently with the new Qdisc ("@new"),
> > causing race conditions [1] including a use-after-free bug in
> > mini_qdisc_pair_swap() reported by syzbot:
> >
> >   BUG: KASAN: slab-use-after-free in mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573
> >   Write of size 8 at addr ffff888045b31308 by task syz-executor690/14901
> > ...
> >   Call Trace:
> >    <TASK>
> >    __dump_stack lib/dump_stack.c:88 [inline]
> >    dump_stack_lvl+0xd9/0x150 lib/dump_stack.c:106
> >    print_address_description.constprop.0+0x2c/0x3c0 mm/kasan/report.c:319
> >    print_report mm/kasan/report.c:430 [inline]
> >    kasan_report+0x11c/0x130 mm/kasan/report.c:536
> >    mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573
> >    tcf_chain_head_change_item net/sched/cls_api.c:495 [inline]
> >    tcf_chain0_head_change.isra.0+0xb9/0x120 net/sched/cls_api.c:509
> >    tcf_chain_tp_insert net/sched/cls_api.c:1826 [inline]
> >    tcf_chain_tp_insert_unique net/sched/cls_api.c:1875 [inline]
> >    tc_new_tfilter+0x1de6/0x2290 net/sched/cls_api.c:2266
> > ...
> >
> > @old and @new should not affect each other.  In other words, @old should
> > never modify miniq_{in,e}gress after @new, and @new should not update
> > @old's RCU state.  Fixing without changing sch_api.c turned out to be
> > difficult (please refer to Closes: for discussions).  Instead, make sure
> > @new's first call always happen after @old's last call, in
> > qdisc_destroy(), has finished:
> >
> > In qdisc_graft(), return -EAGAIN and tell the caller to replay
> > (suggested by Vlad Buslov) if @old has any ongoing RTNL-unlocked filter
> > requests, and call qdisc_destroy() for @old before grafting @new.
> >
> > Introduce qdisc_refcount_dec_if_one() as the counterpart of
> > qdisc_refcount_inc_nz() used for RTNL-unlocked filter requests.  Introduce
> > a non-static version of qdisc_destroy() that does a TCQ_F_BUILTIN check,
> > just like qdisc_put() etc.
> >
> > Depends on patch "net/sched: Refactor qdisc_graft() for ingress and clsact
> > Qdiscs".
> >
> > [1] To illustrate, the syzkaller reproducer adds ingress Qdiscs under
> > TC_H_ROOT (no longer possible after patch "net/sched: sch_ingress: Only
> > create under TC_H_INGRESS") on eth0 that has 8 transmission queues:
> >
> >    Thread 1 creates ingress Qdisc A (containing mini Qdisc a1 and a2), then
> >    adds a flower filter X to A.
> >
> >    Thread 2 creates another ingress Qdisc B (containing mini Qdisc b1 and
> >    b2) to replace A, then adds a flower filter Y to B.
> >
> >   Thread 1               A's refcnt   Thread 2
> >    RTM_NEWQDISC (A, RTNL-locked)
> >     qdisc_create(A)               1
> >     qdisc_graft(A)                9
> >
> >    RTM_NEWTFILTER (X, RTNL-unlocked)
> >     __tcf_qdisc_find(A)          10
> >     tcf_chain0_head_change(A)
> >     mini_qdisc_pair_swap(A) (1st)
> >              |
> >              |                         RTM_NEWQDISC (B, RTNL-locked)
> >           RCU sync                2     qdisc_graft(B)
> >              |                    1     notify_and_destroy(A)
> >              |
> >     tcf_block_release(A)          0    RTM_NEWTFILTER (Y, RTNL-unlocked)
> >     qdisc_destroy(A)                    tcf_chain0_head_change(B)
> >     tcf_chain0_head_change_cb_del(A)    mini_qdisc_pair_swap(B) (2nd)
> >     mini_qdisc_pair_swap(A) (3rd)                |
> >             ...                                 ...
> >
> > Here, B calls mini_qdisc_pair_swap(), pointing eth0->miniq_ingress to its
> > mini Qdisc, b1.  Then, A calls mini_qdisc_pair_swap() again during
> > ingress_destroy(), setting eth0->miniq_ingress to NULL, so ingress packets
> > on eth0 will not find filter Y in sch_handle_ingress().
> >
> > This is only one of the possible consequences of concurrently accessing
> > miniq_{in,e}gress pointers.  The point is clear though: again, A should
> > never modify those per-net_device pointers after B, and B should not
> > update A's RCU state.
> >
> > Fixes: 7a096d579e8e ("net: sched: ingress: set 'unlocked' flag for Qdisc ops")
> > Fixes: 87f373921c4e ("net: sched: ingress: set 'unlocked' flag for clsact Qdisc ops")
> > Reported-by: syzbot+b53a9c0d1ea4ad62da8b@syzkaller.appspotmail.com
> > Closes: https://lore.kernel.org/r/0000000000006cf87705f79acf1a@google.com/
> > Cc: Hillf Danton <hdanton@sina.com>
> > Cc: Vlad Buslov <vladbu@mellanox.com>
> > Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
>
> Tested-by: Pedro Tammela <pctammela@mojatatu.com>


Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>


cheers,
jamal
Jamal Hadi Salim May 26, 2023, 7:47 p.m. UTC | #6
On Fri, May 26, 2023 at 8:20 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
> On Wed, May 24, 2023 at 11:39 AM Pedro Tammela <pctammela@mojatatu.com> wrote:
> >
> > On 23/05/2023 22:20, Peilin Ye wrote:
> > > From: Peilin Ye <peilin.ye@bytedance.com>

> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>

I apologize i am going to take this back, lets please hold the series for now.

In pursuit for the effect on events, Pedro and I spent a few _hours_
chasing this - and regardless of the events, there are still
challenges with the concurrency issue. The current reproducer
unfortunately cant cause damage after patch 2, so really patch 6 was
not being tested. We hacked the repro to hit codepath patch 6 fixes.
We are not sure what the root cause is - but it certainly due to the
series. Peilin, Pedro will post the new repro.

cheers,
jamal
Pedro Tammela May 26, 2023, 8:21 p.m. UTC | #7
On 26/05/2023 16:47, Jamal Hadi Salim wrote:
> [...] Peilin, Pedro will post the new repro.

Hi!

We tweaked the reproducer to:
---
r0 = socket$netlink(0x10, 0x3, 0x0)
r1 = socket(0x10, 0x803, 0x0)
sendmsg$nl_route_sched(r1, &(0x7f0000000300)={0x0, 0x0, 
&(0x7f0000000240)={&(0x7f0000000380)=ANY=[], 0x24}}, 0x0)
getsockname$packet(r1, &(0x7f0000000200)={0x11, 0x0, <r2=>0x0, 0x1, 0x0, 
0x6, @broadcast}, &(0x7f0000000440)=0x14)
sendmsg$nl_route(r0, &(0x7f0000000040)={0x0, 0x0, 
&(0x7f0000000000)={&(0x7f0000000080)=ANY=[@ANYBLOB="480000001000050700"/20, 
@ANYRES32=r2, @ANYBLOB="0000000000000000280012000900010076657468"], 
0x48}}, 0x0)
sendmsg$nl_route_sched(0xffffffffffffffff, &(0x7f00000002c0)={0x0, 0x0, 
&(0x7f0000000280)={&(0x7f0000000540)=@newqdisc={0x30, 0x24, 0xf0b, 0x0, 
0x0, {0x0, 0x0, 0x0, r2, {}, {0xfff1, 0xffff}}, 
[@qdisc_kind_options=@q_ingress={0xc}]}, 0x30}}, 0x0)
sendmsg$nl_route_sched(0xffffffffffffffff, &(0x7f0000000340)={0x0, 0x0, 
&(0x7f00000000c0)={&(0x7f0000000580)=@newtfilter={0x3c, 0x2c, 0xd27, 
0x0, 0x0, {0x0, 0x0, 0x0, r2, {}, {0xfff1, 0xffff}, {0xc}}, 
[@filter_kind_options=@f_flower={{0xb}, {0xc, 0x2, 
[@TCA_FLOWER_CLASSID={0x8}]}}]}, 0x3c}}, 0x0)
r4 = socket$netlink(0x10, 0x3, 0x0)
sendmmsg(r4, &(0x7f00000002c0), 0x40000000000009f, 0x0)
r5 = socket$netlink(0x10, 0x3, 0x0)
sendmmsg(r5, &(0x7f00000002c0), 0x40000000000009f, 0x0)
---

We then generate the C program with:
syz-prog2c -sandbox none -enable net_dev -threaded -repeat 0 -prog 
peilin.syz > repro.c

Now here comes a very important detail. The above will create a new net 
namespace to shoot the netlink messages. We are only able to reproduce 
the deadlock with your patches if we comment the creation of the new 
namespace out:
---
diff --git a/repro.c b/repro.c
index ee8eb0726..5cdbfb289 100644
--- a/repro.c
+++ b/repro.c
@@ -1121,9 +1121,8 @@ static int do_sandbox_none(void)
    sandbox_common();
    drop_caps();
    initialize_netdevices_init();
-  if (unshare(CLONE_NEWNET)) {
-  }
+  // Doesn't seem to deadlock in a new netns
+  // if (unshare(CLONE_NEWNET)) {
+  // }
    write_file("/proc/sys/net/ipv4/ping_group_range", "0 65535");
    initialize_netdevices();
    setup_binderfs();

---

The reason we did this was to check on the event with 'tc mon'.
The splat is quite big, see attached. It has all the indications of a 
deadlock in the rtnl_lock.

Thanks,
Pedro
[  291.882988][   T64] INFO: task kworker/1:2:4513 blocked for more than 143 seconds.
[  291.885716][   T64]       Not tainted 6.4.0-rc2-00215-g6078d01d2926 #13
[  291.887658][   T64] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  291.890484][   T64] task:kworker/1:2     state:D stack:28272 pid:4513  ppid:2      flags:0x00004000
[  291.893628][   T64] Workqueue: events_power_efficient reg_check_chans_work
[  291.896167][   T64] Call Trace:
[  291.897406][   T64]  <TASK>
[  291.898476][   T64]  __schedule+0xecb/0x59d0
[  291.900101][   T64]  ? lockdep_hardirqs_on_prepare+0x410/0x410
[  291.902218][   T64]  ? lockdep_hardirqs_on_prepare+0x410/0x410
[  291.904563][   T64]  ? io_schedule_timeout+0x150/0x150
[  291.906437][   T64]  ? reacquire_held_locks+0x4b0/0x4b0
[  291.908241][   T64]  ? _raw_spin_unlock_irq+0x23/0x50
[  291.909564][   T64]  ? lockdep_hardirqs_on+0x7d/0x100
[  291.910912][   T64]  schedule+0xe7/0x1b0
[  291.912060][   T64]  schedule_preempt_disabled+0x13/0x20
[  291.913608][   T64]  __mutex_lock+0x907/0x12d0
[  291.914850][   T64]  ? reg_check_chans_work+0x7d/0xfe0
[  291.916249][   T64]  ? mutex_lock_io_nested+0x1120/0x1120
[  291.917689][   T64]  ? reg_check_chans_work+0x7d/0xfe0
[  291.919051][   T64]  ? rtnl_lock+0x9/0x20
[  291.920164][   T64]  reg_check_chans_work+0x7d/0xfe0
[  291.921451][   T64]  ? lock_sync+0x190/0x190
[  291.922567][   T64]  ? freq_reg_info+0x1d0/0x1d0
[  291.923802][   T64]  ? spin_bug+0x1d0/0x1d0
[  291.924910][   T64]  process_one_work+0x9f9/0x15f0
[  291.926155][   T64]  ? pwq_dec_nr_in_flight+0x2a0/0x2a0
[  291.927544][   T64]  ? spin_bug+0x1d0/0x1d0
[  291.928706][   T64]  worker_thread+0x687/0x1110
[  291.929867][   T64]  ? process_one_work+0x15f0/0x15f0
[  291.930968][   T64]  kthread+0x334/0x430
[  291.931869][   T64]  ? kthread_complete_and_exit+0x40/0x40
[  291.933092][   T64]  ret_from_fork+0x1f/0x30
[  291.934037][   T64]  </TASK>
[  291.934675][   T64] INFO: task kworker/3:2:6525 blocked for more than 143 seconds.
[  291.936271][   T64]       Not tainted 6.4.0-rc2-00215-g6078d01d2926 #13
[  291.937672][   T64] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  291.939514][   T64] task:kworker/3:2     state:D stack:28272 pid:6525  ppid:2      flags:0x00004000
[  291.941718][   T64] Workqueue: events linkwatch_event
[  291.943042][   T64] Call Trace:
[  291.943834][   T64]  <TASK>
[  291.944526][   T64]  __schedule+0xecb/0x59d0
[  291.945582][   T64]  ? lockdep_hardirqs_on_prepare+0x410/0x410
[  291.946952][   T64]  ? lockdep_hardirqs_on_prepare+0x410/0x410
[  291.948389][   T64]  ? io_schedule_timeout+0x150/0x150
[  291.949615][   T64]  ? reacquire_held_locks+0x4b0/0x4b0
[  291.950847][   T64]  ? _raw_spin_unlock_irq+0x23/0x50
[  291.952077][   T64]  ? lockdep_hardirqs_on+0x7d/0x100
[  291.953344][   T64]  schedule+0xe7/0x1b0
[  291.954286][   T64]  schedule_preempt_disabled+0x13/0x20
[  291.955512][   T64]  __mutex_lock+0x907/0x12d0
[  291.956603][   T64]  ? linkwatch_event+0xf/0x70
[  291.957680][   T64]  ? mutex_lock_io_nested+0x1120/0x1120
[  291.958953][   T64]  ? spin_bug+0x1d0/0x1d0
[  291.959996][   T64]  ? linkwatch_event+0xf/0x70
[  291.960738][   T64]  ? rtnl_lock+0x9/0x20
[  291.961249][   T64]  linkwatch_event+0xf/0x70
[  291.961770][   T64]  process_one_work+0x9f9/0x15f0
[  291.962306][   T64]  ? pwq_dec_nr_in_flight+0x2a0/0x2a0
[  291.962928][   T64]  ? spin_bug+0x1d0/0x1d0
[  291.963409][   T64]  worker_thread+0x687/0x1110
[  291.963982][   T64]  ? __kthread_parkme+0x152/0x220
[  291.964567][   T64]  ? process_one_work+0x15f0/0x15f0
[  291.965163][   T64]  kthread+0x334/0x430
[  291.965641][   T64]  ? kthread_complete_and_exit+0x40/0x40
[  291.966303][   T64]  ret_from_fork+0x1f/0x30
[  291.966836][   T64]  </TASK>
[  291.967189][   T64] INFO: task systemd-udevd:6764 blocked for more than 143 seconds.
[  291.968062][   T64]       Not tainted 6.4.0-rc2-00215-g6078d01d2926 #13
[  291.968813][   T64] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  291.969691][   T64] task:systemd-udevd   state:D stack:25968 pid:6764  ppid:4544   flags:0x00000006
[  291.970609][   T64] Call Trace:
[  291.970939][   T64]  <TASK>
[  291.971239][   T64]  __schedule+0xecb/0x59d0
[  291.971718][   T64]  ? lockdep_hardirqs_on_prepare+0x410/0x410
[  291.972341][   T64]  ? lockdep_hardirqs_on_prepare+0x410/0x410
[  291.972965][   T64]  ? io_schedule_timeout+0x150/0x150
[  291.973572][   T64]  ? __mutex_lock+0x902/0x12d0
[  291.974120][   T64]  schedule+0xe7/0x1b0
[  291.974579][   T64]  schedule_preempt_disabled+0x13/0x20
[  291.975174][   T64]  __mutex_lock+0x907/0x12d0
[  291.975702][   T64]  ? rtnetlink_rcv_msg+0x3e2/0xd30
[  291.976299][   T64]  ? mutex_lock_io_nested+0x1120/0x1120
[  291.976938][   T64]  ? rtnetlink_rcv_msg+0x3e2/0xd30
[  291.977506][   T64]  rtnetlink_rcv_msg+0x3e2/0xd30
[  291.978042][   T64]  ? rtnl_getlink+0xb10/0xb10
[  291.978573][   T64]  ? entry_SYSCALL_64_after_hwframe+0x63/0xcd
[  291.979295][   T64]  ? __module_address+0x55/0x3b0
[  291.979937][   T64]  ? netdev_core_pick_tx+0x390/0x390
[  291.980580][   T64]  netlink_rcv_skb+0x166/0x440
[  291.981164][   T64]  ? rtnl_getlink+0xb10/0xb10
[  291.981726][   T64]  ? netlink_ack+0x1370/0x1370
[  291.982310][   T64]  ? kasan_set_track+0x25/0x30
[  291.982973][   T64]  ? netlink_deliver_tap+0x1b1/0xd00
[  291.983579][   T64]  netlink_unicast+0x530/0x800
[  291.984132][   T64]  ? netlink_attachskb+0x880/0x880
[  291.984682][   T64]  ? __sanitizer_cov_trace_switch+0x54/0x90
[  291.985330][   T64]  ? __phys_addr_symbol+0x30/0x70
[  291.985904][   T64]  ? __check_object_size+0x323/0x740
[  291.986488][   T64]  netlink_sendmsg+0x90b/0xe10
[  291.987023][   T64]  ? netlink_unicast+0x800/0x800
[  291.987584][   T64]  ? bpf_lsm_socket_sendmsg+0x9/0x10
[  291.988189][   T64]  ? netlink_unicast+0x800/0x800
[  291.988737][   T64]  sock_sendmsg+0xd9/0x180
[  291.989228][   T64]  __sys_sendto+0x201/0x2f0
[  291.989714][   T64]  ? __ia32_sys_getpeername+0xb0/0xb0
[  291.990296][   T64]  ? get_random_bytes+0x20/0x20
[  291.990833][   T64]  ? rcu_is_watching+0x12/0xb0
[  291.991368][   T64]  ? __rseq_handle_notify_resume+0x5c7/0xfe0
[  291.992062][   T64]  __x64_sys_sendto+0xe0/0x1b0
[  291.992600][   T64]  ? syscall_enter_from_user_mode+0x26/0x80
[  291.993324][   T64]  do_syscall_64+0x38/0xb0
[  291.993816][   T64]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[  291.994436][   T64] RIP: 0033:0x7f1de823fcf7
[  291.994902][   T64] RSP: 002b:00007ffd8c5d3208 EFLAGS: 00000202 ORIG_RAX: 000000000000002c
[  291.995834][   T64] RAX: ffffffffffffffda RBX: 000055d9974d14e0 RCX: 00007f1de823fcf7
[  291.996711][   T64] RDX: 0000000000000020 RSI: 000055d997332930 RDI: 0000000000000005
[  291.997593][   T64] RBP: 00007ffd8c5d33b8 R08: 00007ffd8c5d3280 R09: 0000000000000080
[  291.998477][   T64] R10: 0000000000000000 R11: 0000000000000202 R12: 00000000043abe07
[  291.999360][   T64] R13: 0000000000000001 R14: 000055d9974845a0 R15: 0000000000000000
[  292.000254][   T64]  </TASK>
[  292.000609][   T64] INFO: task systemd-udevd:6773 blocked for more than 143 seconds.
[  292.001482][   T64]       Not tainted 6.4.0-rc2-00215-g6078d01d2926 #13
[  292.002233][   T64] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  292.003237][   T64] task:systemd-udevd   state:D stack:25968 pid:6773  ppid:4544   flags:0x00004006
[  292.004292][   T64] Call Trace:
[  292.004743][   T64]  <TASK>
[  292.005128][   T64]  __schedule+0xecb/0x59d0
[  292.005637][   T64]  ? lockdep_hardirqs_on_prepare+0x410/0x410
[  292.006304][   T64]  ? lockdep_hardirqs_on_prepare+0x410/0x410
[  292.007005][   T64]  ? io_schedule_timeout+0x150/0x150
[  292.007601][   T64]  ? __mutex_lock+0x902/0x12d0
[  292.008174][   T64]  schedule+0xe7/0x1b0
[  292.008648][   T64]  schedule_preempt_disabled+0x13/0x20
[  292.009257][   T64]  __mutex_lock+0x907/0x12d0
[  292.009793][   T64]  ? dev_ethtool+0x21d/0x53e0
[  292.010374][   T64]  ? mutex_lock_io_nested+0x1120/0x1120
[  292.011105][   T64]  ? kasan_unpoison+0x44/0x70
[  292.011987][   T64]  ? __kmem_cache_alloc_node+0x19e/0x330
[  292.012994][   T64]  ? dev_ethtool+0x170/0x53e0
[  292.013757][   T64]  ? dev_ethtool+0x21d/0x53e0
[  292.014470][   T64]  ? rtnl_lock+0x9/0x20
[  292.015032][   T64]  dev_ethtool+0x21d/0x53e0
[  292.015633][   T64]  ? filter_irq_stacks+0x90/0x90
[  292.016269][   T64]  ? print_usage_bug.part.0+0x670/0x670
[  292.016941][   T64]  ? ethtool_get_module_info_call+0x1c0/0x1c0
[  292.017674][   T64]  ? kasan_set_track+0x25/0x30
[  292.018228][   T64]  ? ____kasan_slab_free+0x15e/0x1b0
[  292.018794][   T64]  ? slab_free_freelist_hook+0x10b/0x1e0
[  292.019287][   T64]  ? __kmem_cache_free+0xaf/0x2e0
[  292.019729][   T64]  ? tomoyo_path_number_perm+0x43a/0x530
[  292.020250][   T64]  ? security_file_ioctl+0x72/0xb0
[  292.020700][   T64]  ? __x64_sys_ioctl+0xbb/0x210
[  292.021128][   T64]  ? do_syscall_64+0x38/0xb0
[  292.021538][   T64]  ? mark_lock+0x105/0x1960
[  292.021950][   T64]  ? __lock_acquire+0xc42/0x5cf0
[  292.022399][   T64]  ? lockdep_hardirqs_on_prepare+0x410/0x410
[  292.022990][   T64]  ? lockdep_hardirqs_on_prepare+0x410/0x410
[  292.023514][   T64]  ? lockdep_hardirqs_on+0x7d/0x100
[  292.023995][   T64]  ? __sanitizer_cov_trace_switch+0x54/0x90
[  292.024513][   T64]  ? find_held_lock+0x2d/0x110
[  292.024940][   T64]  ? dev_load+0x7d/0x240
[  292.025322][   T64]  ? reacquire_held_locks+0x4b0/0x4b0
[  292.025801][   T64]  ? full_name_hash+0xbc/0x110
[  292.026220][   T64]  dev_ioctl+0x29e/0x1090
[  292.026601][   T64]  sock_do_ioctl+0x1be/0x250
[  292.027006][   T64]  ? get_user_ifreq+0x250/0x250
[  292.027433][   T64]  ? do_vfs_ioctl+0x34a/0x17f0
[  292.027870][   T64]  ? vfs_fileattr_set+0xbf0/0xbf0
[  292.028318][   T64]  sock_ioctl+0x205/0x6a0
[  292.028697][   T64]  ? br_ioctl_call+0xb0/0xb0
[  292.029145][   T64]  ? reacquire_held_locks+0x4b0/0x4b0
[  292.029827][   T64]  ? handle_mm_fault+0x32c/0x9e0
[  292.030501][   T64]  ? bpf_lsm_file_ioctl+0x9/0x10
[  292.031054][   T64]  ? br_ioctl_call+0xb0/0xb0
[  292.031527][   T64]  __x64_sys_ioctl+0x189/0x210
[  292.032091][   T64]  do_syscall_64+0x38/0xb0
[  292.032609][   T64]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[  292.033390][   T64] RIP: 0033:0x7f1de82319ef
[  292.033908][   T64] RSP: 002b:00007ffd8c5d32f0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  292.034751][   T64] RAX: ffffffffffffffda RBX: 000055d997476d20 RCX: 00007f1de82319ef
[  292.035448][   T64] RDX: 00007ffd8c5d3430 RSI: 0000000000008946 RDI: 000000000000000b
[  292.036179][   T64] RBP: 000055d997488a50 R08: 0000000000000000 R09: 0000000000000006
[  292.036884][   T64] R10: 00007ffd8c5d3400 R11: 0000000000000246 R12: 000055d99732a230
[  292.037581][   T64] R13: 00007ffd8c5d3430 R14: 0000000000000000 R15: 000055d99732a238
[  292.038268][   T64]  </TASK>
[  292.038557][   T64] INFO: task repro:7652 blocked for more than 143 seconds.
[  292.039202][   T64]       Not tainted 6.4.0-rc2-00215-g6078d01d2926 #13
[  292.039776][   T64] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  292.040509][   T64] task:repro           state:D stack:25984 pid:7652  ppid:6753   flags:0x00004006
[  292.041289][   T64] Call Trace:
[  292.041575][   T64]  <TASK>
[  292.041848][   T64]  __schedule+0xecb/0x59d0
[  292.042297][   T64]  ? lockdep_hardirqs_on_prepare+0x410/0x410
[  292.043017][   T64]  ? lockdep_hardirqs_on_prepare+0x410/0x410
[  292.043711][   T64]  ? io_schedule_timeout+0x150/0x150
[  292.044360][   T64]  ? __mutex_lock+0x902/0x12d0
[  292.044845][   T64]  schedule+0xe7/0x1b0
[  292.045300][   T64]  schedule_preempt_disabled+0x13/0x20
[  292.045872][   T64]  __mutex_lock+0x907/0x12d0
[  292.046396][   T64]  ? fl_change+0x2102/0x51e0
[  292.046833][   T64]  ? tc_setup_cb_reoffload+0x100/0x100
[  292.047353][   T64]  ? mutex_lock_io_nested+0x1120/0x1120
[  292.047915][   T64]  ? _raw_spin_unlock+0x28/0x40
[  292.048368][   T64]  ? tcf_exts_init_ex+0x1bc/0x610
[  292.048828][   T64]  ? fl_change+0x2102/0x51e0
[  292.049328][   T64]  ? rtnl_lock+0x9/0x20
[  292.049768][   T64]  fl_change+0x2102/0x51e0
[  292.050232][   T64]  ? find_held_lock+0x2d/0x110
[  292.050681][   T64]  ? __rhashtable_insert_fast.constprop.0.isra.0+0x14a0/0x14a0
[  292.051416][   T64]  ? fl_get+0x21c/0x3c0
[  292.051791][   T64]  ? fl_put+0x20/0x20
[  292.052170][   T64]  ? mini_qdisc_pair_swap+0x128/0x1f0
[  292.052655][   T64]  ? __rhashtable_insert_fast.constprop.0.isra.0+0x14a0/0x14a0
[  292.053395][   T64]  tc_new_tfilter+0x992/0x22b0
[  292.053858][   T64]  ? tc_del_tfilter+0x1570/0x1570
[  292.054419][   T64]  ? lockdep_hardirqs_on_prepare+0x410/0x410
[  292.055066][   T64]  ? kasan_quarantine_put+0x102/0x230
[  292.055660][   T64]  ? rtnetlink_rcv_msg+0x94a/0xd30
[  292.056224][   T64]  ? reacquire_held_locks+0x4b0/0x4b0
[  292.056799][   T64]  ? bpf_lsm_capable+0x9/0x10
[  292.057311][   T64]  ? tc_del_tfilter+0x1570/0x1570
[  292.057843][   T64]  rtnetlink_rcv_msg+0x98a/0xd30
[  292.058381][   T64]  ? rtnl_getlink+0xb10/0xb10
[  292.058833][   T64]  ? __x64_sys_sendmmsg+0x9c/0x100
[  292.059305][   T64]  ? do_syscall_64+0x38/0xb0
[  292.059752][   T64]  ? netdev_core_pick_tx+0x390/0x390
[  292.060297][   T64]  netlink_rcv_skb+0x166/0x440
[  292.060733][   T64]  ? rtnl_getlink+0xb10/0xb10
[  292.061146][   T64]  ? netlink_ack+0x1370/0x1370
[  292.061566][   T64]  ? kasan_set_track+0x25/0x30
[  292.062003][   T64]  ? netlink_deliver_tap+0x1b1/0xd00
[  292.062468][   T64]  netlink_unicast+0x530/0x800
[  292.062955][   T64]  ? netlink_attachskb+0x880/0x880
[  292.063416][   T64]  ? __sanitizer_cov_trace_switch+0x54/0x90
[  292.063958][   T64]  ? __phys_addr_symbol+0x30/0x70
[  292.064468][   T64]  ? __check_object_size+0x323/0x740
[  292.065009][   T64]  netlink_sendmsg+0x90b/0xe10
[  292.065517][   T64]  ? netlink_unicast+0x800/0x800
[  292.066010][   T64]  ? bpf_lsm_socket_sendmsg+0x9/0x10
[  292.066467][   T64]  ? netlink_unicast+0x800/0x800
[  292.066933][   T64]  sock_sendmsg+0xd9/0x180
[  292.067358][   T64]  ____sys_sendmsg+0x264/0x910
[  292.067794][   T64]  ? kernel_sendmsg+0x50/0x50
[  292.068252][   T64]  ? __copy_msghdr+0x460/0x460
[  292.068690][   T64]  ___sys_sendmsg+0x11d/0x1b0
[  292.069111][   T64]  ? do_recvmmsg+0x700/0x700
[  292.069551][   T64]  ? find_held_lock+0x2d/0x110
[  292.069972][   T64]  ? __might_fault+0xe5/0x190
[  292.070397][   T64]  ? reacquire_held_locks+0x4b0/0x4b0
[  292.070862][   T64]  __sys_sendmmsg+0x18e/0x430
[  292.071272][   T64]  ? __ia32_sys_sendmsg+0xb0/0xb0
[  292.071705][   T64]  ? reacquire_held_locks+0x4b0/0x4b0
[  292.072204][   T64]  ? rcu_is_watching+0x12/0xb0
[  292.072620][   T64]  ? xfd_validate_state+0x5d/0x180
[  292.073099][   T64]  ? restore_fpregs_from_fpstate+0xc1/0x1d0
[  292.073611][   T64]  ? unlock_page_memcg+0x2d0/0x2d0
[  292.074062][   T64]  ? do_futex+0x350/0x350
[  292.074461][   T64]  __x64_sys_sendmmsg+0x9c/0x100
[  292.074893][   T64]  ? syscall_enter_from_user_mode+0x26/0x80
[  292.075494][   T64]  do_syscall_64+0x38/0xb0
[  292.075960][   T64]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[  292.076473][   T64] RIP: 0033:0x7f3f4319289d
[  292.076850][   T64] RSP: 002b:00007f3f43056e38 EFLAGS: 00000246 ORIG_RAX: 0000000000000133
[  292.077557][   T64] RAX: ffffffffffffffda RBX: 0000558de6727248 RCX: 00007f3f4319289d
[  292.078234][   T64] RDX: 040000000000009f RSI: 00000000200002c0 RDI: 0000000000000006
[  292.078924][   T64] RBP: 0000000000000006 R08: 0000000000000000 R09: 0000000000000000
[  292.079586][   T64] R10: 0000000000000000 R11: 0000000000000246 R12: 0000558de672724c
[  292.080259][   T64] R13: 0000558de6723030 R14: 040000000000009f R15: 0000558de6727240
[  292.080943][   T64]  </TASK>
[  292.081221][   T64]
[  292.081221][   T64] Showing all locks held in the system:
[  292.081868][   T64] 1 lock held by rcu_tasks_kthre/14:
[  292.082316][   T64]  #0: ffffffff8c798430 (rcu_tasks.tasks_gp_mutex){+.+.}-{3:3}, at: rcu_tasks_one_gp+0x2c/0xe20
[  292.083266][   T64] 1 lock held by rcu_tasks_trace/15:
[  292.083721][   T64]  #0: ffffffff8c798130 (rcu_tasks_trace.tasks_gp_mutex){+.+.}-{3:3}, at: rcu_tasks_one_gp+0x2c/0xe20
[  292.084702][   T64] 5 locks held by ksoftirqd/4/38:
[  292.085152][   T64] 3 locks held by kworker/4:0/39:
[  292.085598][   T64]  #0: ffff88811b78b938 ((wq_completion)ipv6_addrconf){+.+.}-{0:0}, at: process_one_work+0x8e2/0x15f0
[  292.086704][   T64]  #1: ffffc9000139fda0 ((work_completion)(&(&net->ipv6.addr_chk_work)->work)){+.+.}-{0:0}, at: process_one_work+0x916/0x15f0
[  292.088126][   T64]  #2: ffffffff8e115668 (rtnl_mutex){+.+.}-{3:3}, at: addrconf_verify_work+0x12/0x30
[  292.089159][   T64] 1 lock held by khungtaskd/64:
[  292.089656][   T64]  #0: ffffffff8c799040 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x55/0x340
[  292.090713][   T64] 3 locks held by kworker/1:2/4513:
[  292.091175][   T64]  #0: ffff888100079d38 ((wq_completion)events_power_efficient){+.+.}-{0:0}, at: process_one_work+0x8e2/0x15f0
[  292.092274][   T64]  #1: ffffc90016fcfda0 ((reg_check_chans).work){+.+.}-{0:0}, at: process_one_work+0x916/0x15f0
[  292.093319][   T64]  #2: ffffffff8e115668 (rtnl_mutex){+.+.}-{3:3}, at: reg_check_chans_work+0x7d/0xfe0
[  292.094269][   T64] 3 locks held by kworker/3:2/6525:
[  292.094801][   T64]  #0: ffff888100078d38 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x8e2/0x15f0
[  292.095834][   T64]  #1: ffffc9000648fda0 ((linkwatch_work).work){+.+.}-{0:0}, at: process_one_work+0x916/0x15f0
[  292.096754][   T64]  #2: ffffffff8e115668 (rtnl_mutex){+.+.}-{3:3}, at: linkwatch_event+0xf/0x70
[  292.097662][   T64] 1 lock held by systemd-udevd/6764:
[  292.098148][   T64]  #0: ffffffff8e115668 (rtnl_mutex){+.+.}-{3:3}, at: rtnetlink_rcv_msg+0x3e2/0xd30
[  292.098984][   T64] 1 lock held by systemd-udevd/6773:
[  292.099462][   T64]  #0: ffffffff8e115668 (rtnl_mutex){+.+.}-{3:3}, at: dev_ethtool+0x21d/0x53e0
[  292.100300][   T64] 1 lock held by repro/7652:
[  292.100700][   T64]  #0: ffffffff8e115668 (rtnl_mutex){+.+.}-{3:3}, at: fl_change+0x2102/0x51e0
[  292.101494][   T64] 2 locks held by repro/7653:
[  292.101905][   T64]
[  292.102135][   T64] =============================================
[  292.102135][   T64]
[  292.102905][   T64] NMI backtrace for cpu 5
[  292.103277][   T64] CPU: 5 PID: 64 Comm: khungtaskd Not tainted 6.4.0-rc2-00215-g6078d01d2926 #13
[  292.104054][   T64] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
[  292.104931][   T64] Call Trace:
[  292.105222][   T64]  <TASK>
[  292.105481][   T64]  dump_stack_lvl+0xd9/0x1b0
[  292.105889][   T64]  nmi_cpu_backtrace+0x277/0x380
[  292.106318][   T64]  ? lapic_can_unplug_cpu+0xa0/0xa0
[  292.106770][   T64]  nmi_trigger_cpumask_backtrace+0x2ac/0x310
[  292.107290][   T64]  watchdog+0xed1/0x1150
[  292.107715][   T64]  ? proc_dohung_task_timeout_secs+0x90/0x90
[  292.108330][   T64]  kthread+0x334/0x430
[  292.108736][   T64]  ? kthread_complete_and_exit+0x40/0x40
[  292.109285][   T64]  ret_from_fork+0x1f/0x30
[  292.109720][   T64]  </TASK>
[  292.109991][   T64] Sending NMI from CPU 5 to CPUs 0-4,6-7:
[  292.110569][    C3] NMI backtrace for cpu 3
[  292.110577][    C3] CPU: 3 PID: 7653 Comm: repro Not tainted 6.4.0-rc2-00215-g6078d01d2926 #13
[  292.110586][    C3] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
[  292.110591][    C3] RIP: 0010:unwind_get_return_address+0x6e/0xa0
[  292.110604][    C3] Code: ea 03 80 3c 02 00 75 32 48 8b 7b 48 e8 3b 46 1c 00 85 c0 74 d3 48 b8 00 00 00 00 00 fc ff df 48 89 ea 48 c1 ea 03 80 3c 02 00 <75> 18 48 8b 43 48 5b 5d c3 e8 14 c5 9e 00 eb a8 48 89 ef e8 2a c5
[  292.110611][    C3] RSP: 0018:ffffc900095d6ed0 EFLAGS: 00000246
[  292.110618][    C3] RAX: dffffc0000000000 RBX: ffffc900095d6ee8 RCX: 0000000000000000
[  292.110623][    C3] RDX: 1ffff920012bade6 RSI: ffffc900095d72a0 RDI: ffffffff88370cd5
[  292.110628][    C3] RBP: ffffc900095d6f30 R08: ffffc900095d6f1c R09: ffffffff8f831a34
[  292.110633][    C3] R10: ffffc900095d6ee8 R11: 0000000000072d0d R12: ffffc900095d6fa0
[  292.110637][    C3] R13: 0000000000000000 R14: ffff888173271dc0 R15: ffff888171b140a0
[  292.110645][    C3] FS:  00007f3f430376c0(0000) GS:ffff888437c80000(0000) knlGS:0000000000000000
[  292.110653][    C3] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  292.110658][    C3] CR2: 0000558de83422d8 CR3: 0000000116ff7000 CR4: 0000000000350ee0
[  292.110664][    C3] Call Trace:
[  292.110667][    C3]  <TASK>
[  292.110670][    C3]  ? write_profile+0x440/0x440
[  292.110684][    C3]  arch_stack_walk+0x97/0xf0
[  292.110697][    C3]  ? ingress_init+0x165/0x220
[  292.110706][    C3]  stack_trace_save+0x96/0xd0
[  292.110716][    C3]  ? filter_irq_stacks+0x90/0x90
[  292.110727][    C3]  kasan_save_stack+0x20/0x40
[  292.110741][    C3]  ? kasan_save_stack+0x20/0x40
[  292.110753][    C3]  ? kasan_set_track+0x25/0x30
[  292.110766][    C3]  ? __kasan_kmalloc+0xa2/0xb0
[  292.110776][    C3]  ? tcf_block_get_ext+0x14f/0x1900
[  292.110785][    C3]  ? ingress_init+0x165/0x220
[  292.110792][    C3]  ? qdisc_create+0x4f7/0x1090
[  292.110799][    C3]  ? tc_modify_qdisc+0xd38/0x1b70
[  292.110806][    C3]  ? rtnetlink_rcv_msg+0x439/0xd30
[  292.110813][    C3]  ? netlink_rcv_skb+0x166/0x440
[  292.110824][    C3]  ? netlink_unicast+0x530/0x800
[  292.110835][    C3]  ? netlink_sendmsg+0x90b/0xe10
[  292.110847][    C3]  ? find_held_lock+0x2d/0x110
[  292.110857][    C3]  ? __kmem_cache_alloc_node+0x49/0x330
[  292.110867][    C3]  ? reacquire_held_locks+0x4b0/0x4b0
[  292.110878][    C3]  ? kasan_unpoison+0x44/0x70
[  292.110885][    C3]  ? __kasan_slab_alloc+0x4e/0x90
[  292.110896][    C3]  ? __kmem_cache_alloc_node+0x19e/0x330
[  292.110906][    C3]  ? tcf_block_get_ext+0x14f/0x1900
[  292.110915][    C3]  kasan_set_track+0x25/0x30
[  292.110925][    C3]  __kasan_kmalloc+0xa2/0xb0
[  292.110935][    C3]  tcf_block_get_ext+0x14f/0x1900
[  292.110945][    C3]  ingress_init+0x165/0x220
[  292.110952][    C3]  ? clsact_init+0x330/0x330
[  292.110959][    C3]  qdisc_create+0x4f7/0x1090
[  292.110967][    C3]  ? tc_get_qdisc+0xb80/0xb80
[  292.110975][    C3]  tc_modify_qdisc+0xd38/0x1b70
[  292.110984][    C3]  ? qdisc_create+0x1090/0x1090
[  292.110994][    C3]  ? bpf_lsm_capable+0x9/0x10
[  292.111004][    C3]  ? qdisc_create+0x1090/0x1090
[  292.111012][    C3]  rtnetlink_rcv_msg+0x439/0xd30
[  292.111019][    C3]  ? rtnl_getlink+0xb10/0xb10
[  292.111025][    C3]  ? __x64_sys_sendmmsg+0x9c/0x100
[  292.111034][    C3]  ? do_syscall_64+0x38/0xb0
[  292.111043][    C3]  ? netdev_core_pick_tx+0x390/0x390
[  292.111056][    C3]  netlink_rcv_skb+0x166/0x440
[  292.111067][    C3]  ? rtnl_getlink+0xb10/0xb10
[  292.111074][    C3]  ? netlink_ack+0x1370/0x1370
[  292.111085][    C3]  ? kasan_set_track+0x25/0x30
[  292.111097][    C3]  ? netlink_deliver_tap+0x1b1/0xd00
[  292.111109][    C3]  netlink_unicast+0x530/0x800
[  292.111121][    C3]  ? netlink_attachskb+0x880/0x880
[  292.111132][    C3]  ? __sanitizer_cov_trace_switch+0x54/0x90
[  292.111142][    C3]  ? __phys_addr_symbol+0x30/0x70
[  292.111153][    C3]  ? __check_object_size+0x323/0x740
[  292.111161][    C3]  netlink_sendmsg+0x90b/0xe10
[  292.111173][    C3]  ? netlink_unicast+0x800/0x800
[  292.111185][    C3]  ? bpf_lsm_socket_sendmsg+0x9/0x10
[  292.111193][    C3]  ? netlink_unicast+0x800/0x800
[  292.111204][    C3]  sock_sendmsg+0xd9/0x180
[  292.111216][    C3]  ____sys_sendmsg+0x264/0x910
[  292.111227][    C3]  ? kernel_sendmsg+0x50/0x50
[  292.111238][    C3]  ? __copy_msghdr+0x460/0x460
[  292.111246][    C3]  ? lockdep_hardirqs_on_prepare+0x410/0x410
[  292.111258][    C3]  ___sys_sendmsg+0x11d/0x1b0
[  292.111266][    C3]  ? do_recvmmsg+0x700/0x700
[  292.111274][    C3]  ? __fget_files+0x260/0x420
[  292.111284][    C3]  ? reacquire_held_locks+0x4b0/0x4b0
[  292.111296][    C3]  ? __fget_files+0x282/0x420
[  292.111307][    C3]  ? __fget_light+0xe6/0x270
[  292.111318][    C3]  __sys_sendmmsg+0x18e/0x430
[  292.111327][    C3]  ? __ia32_sys_sendmsg+0xb0/0xb0
[  292.111334][    C3]  ? reacquire_held_locks+0x4b0/0x4b0
[  292.111346][    C3]  ? rcu_is_watching+0x12/0xb0
[  292.111358][    C3]  ? __rseq_handle_notify_resume+0x5c7/0xfe0
[  292.111368][    C3]  ? __do_sys_rseq+0x750/0x750
[  292.111376][    C3]  ? unlock_page_memcg+0x2d0/0x2d0
[  292.111387][    C3]  ? do_futex+0x350/0x350
[  292.111395][    C3]  __x64_sys_sendmmsg+0x9c/0x100
[  292.111403][    C3]  ? syscall_enter_from_user_mode+0x26/0x80
[  292.111414][    C3]  do_syscall_64+0x38/0xb0
[  292.111421][    C3]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[  292.111433][    C3] RIP: 0033:0x7f3f4319289d
[  292.111440][    C3] Code: 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 4b 05 0e 00 f7 d8 64 89 01 48
[  292.111447][    C3] RSP: 002b:00007f3f43035e38 EFLAGS: 00000246 ORIG_RAX: 0000000000000133
[  292.111454][    C3] RAX: ffffffffffffffda RBX: 0000558de6727258 RCX: 00007f3f4319289d
[  292.111459][    C3] RDX: 040000000000009f RSI: 00000000200002c0 RDI: 0000000000000007
[  292.111463][    C3] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[  292.111467][    C3] R10: 0000000000000000 R11: 0000000000000246 R12: 0000558de672725c
[  292.111472][    C3] R13: 0000558de6723030 R14: 040000000000009f R15: 0000558de6727250
[  292.111479][    C3]  </TASK>
[  292.111481][    C4] NMI backtrace for cpu 4 skipped: idling at default_idle+0xf/0x20
[  292.092844][    C7] NMI backtrace for cpu 7 skipped: idling at default_idle+0xf/0x20
[  292.092832][    C2] NMI backtrace for cpu 2 skipped: idling at default_idle+0xf/0x20
[  292.092832][    C2] INFO: NMI handler (nmi_cpu_backtrace_handler) took too long to run: 1.014 msecs
[  292.092820][    C1] NMI backtrace for cpu 1 skipped: idling at default_idle+0xf/0x20
[  292.092820][    C1] INFO: NMI handler (nmi_cpu_backtrace_handler) took too long to run: 1.033 msecs
[  292.102824][    C6] NMI backtrace for cpu 6 skipped: idling at default_idle+0xf/0x20
[  292.102824][    C6] INFO: NMI handler (nmi_cpu_backtrace_handler) took too long to run: 1.054 msecs
[  292.102812][    C0] NMI backtrace for cpu 0 skipped: idling at default_idle+0xf/0x20
[  292.102812][    C0] INFO: NMI handler (nmi_cpu_backtrace_handler) took too long to run: 1.092 msecs
[  292.112559][   T64] Kernel panic - not syncing: hung_task: blocked tasks
[  292.112565][   T64] CPU: 5 PID: 64 Comm: khungtaskd Not tainted 6.4.0-rc2-00215-g6078d01d2926 #13
[  292.112575][   T64] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
[  292.112582][   T64] Call Trace:
[  292.112586][   T64]  <TASK>
[  292.112590][   T64]  dump_stack_lvl+0xd9/0x1b0
[  292.112603][   T64]  panic+0x689/0x730
[  292.112615][   T64]  ? panic_smp_self_stop+0xa0/0xa0
[  292.112631][   T64]  ? __irq_work_queue_local+0x132/0x3f0
[  292.112641][   T64]  ? irq_work_queue+0x2a/0x70
[  292.112649][   T64]  ? __wake_up_klogd.part.0+0x99/0xf0
[  292.112663][   T64]  ? watchdog+0xc89/0x1150
[  292.112678][   T64]  watchdog+0xc9a/0x1150
[  292.112698][   T64]  ? proc_dohung_task_timeout_secs+0x90/0x90
[  292.112713][   T64]  kthread+0x334/0x430
[  292.112723][   T64]  ? kthread_complete_and_exit+0x40/0x40
[  292.112734][   T64]  ret_from_fork+0x1f/0x30
[  292.112750][   T64]  </TASK>
[  292.112866][   T64] Kernel Offset: disabled
[  292.112866][   T64] Rebooting in 86400 seconds..
Peilin Ye May 26, 2023, 11:09 p.m. UTC | #8
On Fri, May 26, 2023 at 05:21:34PM -0300, Pedro Tammela wrote:
> On 26/05/2023 16:47, Jamal Hadi Salim wrote:
> > [...] Peilin, Pedro will post the new repro.
> 
> We tweaked the reproducer to:
> ---
> r0 = socket$netlink(0x10, 0x3, 0x0)
> r1 = socket(0x10, 0x803, 0x0)
> sendmsg$nl_route_sched(r1, &(0x7f0000000300)={0x0, 0x0,
> &(0x7f0000000240)={&(0x7f0000000380)=ANY=[], 0x24}}, 0x0)
> getsockname$packet(r1, &(0x7f0000000200)={0x11, 0x0, <r2=>0x0, 0x1, 0x0,
> 0x6, @broadcast}, &(0x7f0000000440)=0x14)
> sendmsg$nl_route(r0, &(0x7f0000000040)={0x0, 0x0,
> &(0x7f0000000000)={&(0x7f0000000080)=ANY=[@ANYBLOB="480000001000050700"/20,
> @ANYRES32=r2, @ANYBLOB="0000000000000000280012000900010076657468"], 0x48}},
> 0x0)
> sendmsg$nl_route_sched(0xffffffffffffffff, &(0x7f00000002c0)={0x0, 0x0,
> &(0x7f0000000280)={&(0x7f0000000540)=@newqdisc={0x30, 0x24, 0xf0b, 0x0, 0x0,
> {0x0, 0x0, 0x0, r2, {}, {0xfff1, 0xffff}},
> [@qdisc_kind_options=@q_ingress={0xc}]}, 0x30}}, 0x0)
> sendmsg$nl_route_sched(0xffffffffffffffff, &(0x7f0000000340)={0x0, 0x0,
> &(0x7f00000000c0)={&(0x7f0000000580)=@newtfilter={0x3c, 0x2c, 0xd27, 0x0,
> 0x0, {0x0, 0x0, 0x0, r2, {}, {0xfff1, 0xffff}, {0xc}},
> [@filter_kind_options=@f_flower={{0xb}, {0xc, 0x2,
> [@TCA_FLOWER_CLASSID={0x8}]}}]}, 0x3c}}, 0x0)
> r4 = socket$netlink(0x10, 0x3, 0x0)
> sendmmsg(r4, &(0x7f00000002c0), 0x40000000000009f, 0x0)
> r5 = socket$netlink(0x10, 0x3, 0x0)
> sendmmsg(r5, &(0x7f00000002c0), 0x40000000000009f, 0x0)
> ---
> 
> We then generate the C program with:
> syz-prog2c -sandbox none -enable net_dev -threaded -repeat 0 -prog
> peilin.syz > repro.c
> 
> Now here comes a very important detail. The above will create a new net
> namespace to shoot the netlink messages. We are only able to reproduce the
> deadlock with your patches if we comment the creation of the new namespace
> out:
> ---
> diff --git a/repro.c b/repro.c
> index ee8eb0726..5cdbfb289 100644
> --- a/repro.c
> +++ b/repro.c
> @@ -1121,9 +1121,8 @@ static int do_sandbox_none(void)
>    sandbox_common();
>    drop_caps();
>    initialize_netdevices_init();
> -  if (unshare(CLONE_NEWNET)) {
> -  }
> +  // Doesn't seem to deadlock in a new netns
> +  // if (unshare(CLONE_NEWNET)) {
> +  // }
>    write_file("/proc/sys/net/ipv4/ping_group_range", "0 65535");
>    initialize_netdevices();
>    setup_binderfs();
> 
> ---
> 
> The reason we did this was to check on the event with 'tc mon'.
> The splat is quite big, see attached. It has all the indications of a
> deadlock in the rtnl_lock.

Thanks a lot, I'll get right on it.

Peilin Ye
Jakub Kicinski May 27, 2023, 2:33 a.m. UTC | #9
On Fri, 26 May 2023 16:09:51 -0700 Peilin Ye wrote:
> Thanks a lot, I'll get right on it.

Any insights? Is it just a live-lock inherent to the retry scheme 
or we actually forget to release the lock/refcnt?
Peilin Ye May 27, 2023, 8:23 a.m. UTC | #10
Hi Jakub and all,

On Fri, May 26, 2023 at 07:33:24PM -0700, Jakub Kicinski wrote:
> On Fri, 26 May 2023 16:09:51 -0700 Peilin Ye wrote:
> > Thanks a lot, I'll get right on it.
>
> Any insights? Is it just a live-lock inherent to the retry scheme
> or we actually forget to release the lock/refcnt?

I think it's just a thread holding the RTNL mutex for too long (replaying
too many times).  We could replay for arbitrary times in
tc_{modify,get}_qdisc() if the user keeps sending RTNL-unlocked filter
requests for the old Qdisc.

I tested the new reproducer Pedro posted, on:

1. All 6 v5 patches, FWIW, which caused a similar hang as Pedro reported

2. First 5 v5 patches, plus patch 6 in v1 (no replaying), did not trigger
   any issues (in about 30 minutes).

3. All 6 v5 patches, plus this diff:

diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index 286b7c58f5b9..988718ba5abe 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -1090,8 +1090,11 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
                         * RTNL-unlocked filter request(s).  This is the counterpart of that
                         * qdisc_refcount_inc_nz() call in __tcf_qdisc_find().
                         */
-                       if (!qdisc_refcount_dec_if_one(dev_queue->qdisc_sleeping))
+                       if (!qdisc_refcount_dec_if_one(dev_queue->qdisc_sleeping)) {
+                               rtnl_unlock();
+                               rtnl_lock();
                                return -EAGAIN;
+                       }
                }

                if (dev->flags & IFF_UP)

   Did not trigger any issues (in about 30 mintues) either.

What would you suggest?

Thanks,
Peilin Ye
Jamal Hadi Salim May 28, 2023, 6:54 p.m. UTC | #11
On Sat, May 27, 2023 at 4:23 AM Peilin Ye <yepeilin.cs@gmail.com> wrote:
>
> Hi Jakub and all,
>
> On Fri, May 26, 2023 at 07:33:24PM -0700, Jakub Kicinski wrote:
> > On Fri, 26 May 2023 16:09:51 -0700 Peilin Ye wrote:
> > > Thanks a lot, I'll get right on it.
> >
> > Any insights? Is it just a live-lock inherent to the retry scheme
> > or we actually forget to release the lock/refcnt?
>
> I think it's just a thread holding the RTNL mutex for too long (replaying
> too many times).  We could replay for arbitrary times in
> tc_{modify,get}_qdisc() if the user keeps sending RTNL-unlocked filter
> requests for the old Qdisc.
>
> I tested the new reproducer Pedro posted, on:
>
> 1. All 6 v5 patches, FWIW, which caused a similar hang as Pedro reported
>
> 2. First 5 v5 patches, plus patch 6 in v1 (no replaying), did not trigger
>    any issues (in about 30 minutes).
>
> 3. All 6 v5 patches, plus this diff:
>
> diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
> index 286b7c58f5b9..988718ba5abe 100644
> --- a/net/sched/sch_api.c
> +++ b/net/sched/sch_api.c
> @@ -1090,8 +1090,11 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
>                          * RTNL-unlocked filter request(s).  This is the counterpart of that
>                          * qdisc_refcount_inc_nz() call in __tcf_qdisc_find().
>                          */
> -                       if (!qdisc_refcount_dec_if_one(dev_queue->qdisc_sleeping))
> +                       if (!qdisc_refcount_dec_if_one(dev_queue->qdisc_sleeping)) {
> +                               rtnl_unlock();
> +                               rtnl_lock();
>                                 return -EAGAIN;
> +                       }
>                 }
>
>                 if (dev->flags & IFF_UP)
>
>    Did not trigger any issues (in about 30 mintues) either.
>
> What would you suggest?


I am more worried it is a wackamole situation. We fixed the first
reproducer with essentially patches 1-4 but we opened a new one which
the second reproducer catches. One thing the current reproducer does
is create a lot rtnl contention in the beggining by creating all those
devices and then after it is just creating/deleting qdisc and doing
update with flower where such contention is reduced. i.e it may just
take longer for the mole to pop up.

Why dont we push the V1 patch in and then worry about getting clever
with EAGAIN after? Can you test the V1 version with the repro Pedro
posted? It shouldnt have these issues. Also it would be interesting to
see how performance of the parallel updates to flower is affected.

cheers,
jamal

> Thanks,
> Peilin Ye
>
Vlad Buslov May 29, 2023, 11:50 a.m. UTC | #12
On Sun 28 May 2023 at 14:54, Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> On Sat, May 27, 2023 at 4:23 AM Peilin Ye <yepeilin.cs@gmail.com> wrote:
>>
>> Hi Jakub and all,
>>
>> On Fri, May 26, 2023 at 07:33:24PM -0700, Jakub Kicinski wrote:
>> > On Fri, 26 May 2023 16:09:51 -0700 Peilin Ye wrote:
>> > > Thanks a lot, I'll get right on it.
>> >
>> > Any insights? Is it just a live-lock inherent to the retry scheme
>> > or we actually forget to release the lock/refcnt?
>>
>> I think it's just a thread holding the RTNL mutex for too long (replaying
>> too many times).  We could replay for arbitrary times in
>> tc_{modify,get}_qdisc() if the user keeps sending RTNL-unlocked filter
>> requests for the old Qdisc.

After looking very carefully at the code I think I know what the issue
might be:

   Task 1 graft Qdisc   Task 2 new filter
           +                    +
           |                    |
           v                    v
        rtnl_lock()       take  q->refcnt
           +                    +
           |                    |
           v                    v
Spin while q->refcnt!=1   Block on rtnl_lock() indefinitely due to -EAGAIN

This will cause a real deadlock with the proposed patch. I'll try to
come up with a better approach. Sorry for not seeing it earlier.

>>
>> I tested the new reproducer Pedro posted, on:
>>
>> 1. All 6 v5 patches, FWIW, which caused a similar hang as Pedro reported
>>
>> 2. First 5 v5 patches, plus patch 6 in v1 (no replaying), did not trigger
>>    any issues (in about 30 minutes).
>>
>> 3. All 6 v5 patches, plus this diff:
>>
>> diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
>> index 286b7c58f5b9..988718ba5abe 100644
>> --- a/net/sched/sch_api.c
>> +++ b/net/sched/sch_api.c
>> @@ -1090,8 +1090,11 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
>>                          * RTNL-unlocked filter request(s).  This is the counterpart of that
>>                          * qdisc_refcount_inc_nz() call in __tcf_qdisc_find().
>>                          */
>> -                       if (!qdisc_refcount_dec_if_one(dev_queue->qdisc_sleeping))
>> +                       if (!qdisc_refcount_dec_if_one(dev_queue->qdisc_sleeping)) {
>> +                               rtnl_unlock();
>> +                               rtnl_lock();
>>                                 return -EAGAIN;
>> +                       }
>>                 }
>>
>>                 if (dev->flags & IFF_UP)
>>
>>    Did not trigger any issues (in about 30 mintues) either.
>>
>> What would you suggest?
>
>
> I am more worried it is a wackamole situation. We fixed the first
> reproducer with essentially patches 1-4 but we opened a new one which
> the second reproducer catches. One thing the current reproducer does
> is create a lot rtnl contention in the beggining by creating all those
> devices and then after it is just creating/deleting qdisc and doing
> update with flower where such contention is reduced. i.e it may just
> take longer for the mole to pop up.
>
> Why dont we push the V1 patch in and then worry about getting clever
> with EAGAIN after? Can you test the V1 version with the repro Pedro
> posted? It shouldnt have these issues. Also it would be interesting to
> see how performance of the parallel updates to flower is affected.

This or at least push first 4 patches of this series. They target other
older commits and fix straightforward issues with the API.
Vlad Buslov May 29, 2023, 12:58 p.m. UTC | #13
On Mon 29 May 2023 at 14:50, Vlad Buslov <vladbu@nvidia.com> wrote:
> On Sun 28 May 2023 at 14:54, Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>> On Sat, May 27, 2023 at 4:23 AM Peilin Ye <yepeilin.cs@gmail.com> wrote:
>>>
>>> Hi Jakub and all,
>>>
>>> On Fri, May 26, 2023 at 07:33:24PM -0700, Jakub Kicinski wrote:
>>> > On Fri, 26 May 2023 16:09:51 -0700 Peilin Ye wrote:
>>> > > Thanks a lot, I'll get right on it.
>>> >
>>> > Any insights? Is it just a live-lock inherent to the retry scheme
>>> > or we actually forget to release the lock/refcnt?
>>>
>>> I think it's just a thread holding the RTNL mutex for too long (replaying
>>> too many times).  We could replay for arbitrary times in
>>> tc_{modify,get}_qdisc() if the user keeps sending RTNL-unlocked filter
>>> requests for the old Qdisc.
>
> After looking very carefully at the code I think I know what the issue
> might be:
>
>    Task 1 graft Qdisc   Task 2 new filter
>            +                    +
>            |                    |
>            v                    v
>         rtnl_lock()       take  q->refcnt
>            +                    +
>            |                    |
>            v                    v
> Spin while q->refcnt!=1   Block on rtnl_lock() indefinitely due to -EAGAIN
>
> This will cause a real deadlock with the proposed patch. I'll try to
> come up with a better approach. Sorry for not seeing it earlier.
>

Followup: I considered two approaches for preventing the dealock:

- Refactor cls_api to always obtain the lock before taking a reference
  to Qdisc. I started implementing PoC moving the rtnl_lock() call in
  tc_new_tfilter() before __tcf_qdisc_find() and decided it is not
  feasible because cls_api will still try to obtain rtnl_lock when
  offloading a filter to a device with non-unlocked driver or after
  releasing the lock when loading a classifier module.

- Account for such cls_api behavior in sch_api by dropping and
  re-tacking the lock before replaying. This actually seems to be quite
  straightforward since 'replay' functionality that we are reusing for
  this is designed for similar behavior - it releases rtnl lock before
  loading a sch module, takes the lock again and safely replays the
  function by re-obtaining all the necessary data.

If livelock with concurrent filters insertion is an issue, then it can
be remedied by setting a new Qdisc->flags bit
"DELETED-REJECT-NEW-FILTERS" and checking for it together with
QDISC_CLASS_OPS_DOIT_UNLOCKED in order to force any concurrent filter
insertion coming after the flag is set to synchronize on rtnl lock.

Thoughts?

>>>
>>> I tested the new reproducer Pedro posted, on:
>>>
>>> 1. All 6 v5 patches, FWIW, which caused a similar hang as Pedro reported
>>>
>>> 2. First 5 v5 patches, plus patch 6 in v1 (no replaying), did not trigger
>>>    any issues (in about 30 minutes).
>>>
>>> 3. All 6 v5 patches, plus this diff:
>>>
>>> diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
>>> index 286b7c58f5b9..988718ba5abe 100644
>>> --- a/net/sched/sch_api.c
>>> +++ b/net/sched/sch_api.c
>>> @@ -1090,8 +1090,11 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
>>>                          * RTNL-unlocked filter request(s).  This is the counterpart of that
>>>                          * qdisc_refcount_inc_nz() call in __tcf_qdisc_find().
>>>                          */
>>> -                       if (!qdisc_refcount_dec_if_one(dev_queue->qdisc_sleeping))
>>> +                       if (!qdisc_refcount_dec_if_one(dev_queue->qdisc_sleeping)) {
>>> +                               rtnl_unlock();
>>> +                               rtnl_lock();
>>>                                 return -EAGAIN;
>>> +                       }
>>>                 }
>>>
>>>                 if (dev->flags & IFF_UP)
>>>
>>>    Did not trigger any issues (in about 30 mintues) either.
>>>
>>> What would you suggest?
>>
>>
>> I am more worried it is a wackamole situation. We fixed the first
>> reproducer with essentially patches 1-4 but we opened a new one which
>> the second reproducer catches. One thing the current reproducer does
>> is create a lot rtnl contention in the beggining by creating all those
>> devices and then after it is just creating/deleting qdisc and doing
>> update with flower where such contention is reduced. i.e it may just
>> take longer for the mole to pop up.
>>
>> Why dont we push the V1 patch in and then worry about getting clever
>> with EAGAIN after? Can you test the V1 version with the repro Pedro
>> posted? It shouldnt have these issues. Also it would be interesting to
>> see how performance of the parallel updates to flower is affected.
>
> This or at least push first 4 patches of this series. They target other
> older commits and fix straightforward issues with the API.
Jamal Hadi Salim May 29, 2023, 1:55 p.m. UTC | #14
On Mon, May 29, 2023 at 8:06 AM Vlad Buslov <vladbu@nvidia.com> wrote:
>
> On Sun 28 May 2023 at 14:54, Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > On Sat, May 27, 2023 at 4:23 AM Peilin Ye <yepeilin.cs@gmail.com> wrote:
> >>
> >> Hi Jakub and all,
> >>
> >> On Fri, May 26, 2023 at 07:33:24PM -0700, Jakub Kicinski wrote:
> >> > On Fri, 26 May 2023 16:09:51 -0700 Peilin Ye wrote:
> >> > > Thanks a lot, I'll get right on it.
> >> >
> >> > Any insights? Is it just a live-lock inherent to the retry scheme
> >> > or we actually forget to release the lock/refcnt?
> >>
> >> I think it's just a thread holding the RTNL mutex for too long (replaying
> >> too many times).  We could replay for arbitrary times in
> >> tc_{modify,get}_qdisc() if the user keeps sending RTNL-unlocked filter
> >> requests for the old Qdisc.
>
> After looking very carefully at the code I think I know what the issue
> might be:
>
>    Task 1 graft Qdisc   Task 2 new filter
>            +                    +
>            |                    |
>            v                    v
>         rtnl_lock()       take  q->refcnt
>            +                    +
>            |                    |
>            v                    v
> Spin while q->refcnt!=1   Block on rtnl_lock() indefinitely due to -EAGAIN
>
> This will cause a real deadlock with the proposed patch. I'll try to
> come up with a better approach. Sorry for not seeing it earlier.
>
> >>
> >> I tested the new reproducer Pedro posted, on:
> >>
> >> 1. All 6 v5 patches, FWIW, which caused a similar hang as Pedro reported
> >>
> >> 2. First 5 v5 patches, plus patch 6 in v1 (no replaying), did not trigger
> >>    any issues (in about 30 minutes).
> >>
> >> 3. All 6 v5 patches, plus this diff:
> >>
> >> diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
> >> index 286b7c58f5b9..988718ba5abe 100644
> >> --- a/net/sched/sch_api.c
> >> +++ b/net/sched/sch_api.c
> >> @@ -1090,8 +1090,11 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
> >>                          * RTNL-unlocked filter request(s).  This is the counterpart of that
> >>                          * qdisc_refcount_inc_nz() call in __tcf_qdisc_find().
> >>                          */
> >> -                       if (!qdisc_refcount_dec_if_one(dev_queue->qdisc_sleeping))
> >> +                       if (!qdisc_refcount_dec_if_one(dev_queue->qdisc_sleeping)) {
> >> +                               rtnl_unlock();
> >> +                               rtnl_lock();
> >>                                 return -EAGAIN;
> >> +                       }
> >>                 }
> >>
> >>                 if (dev->flags & IFF_UP)
> >>
> >>    Did not trigger any issues (in about 30 mintues) either.
> >>
> >> What would you suggest?
> >
> >
> > I am more worried it is a wackamole situation. We fixed the first
> > reproducer with essentially patches 1-4 but we opened a new one which
> > the second reproducer catches. One thing the current reproducer does
> > is create a lot rtnl contention in the beggining by creating all those
> > devices and then after it is just creating/deleting qdisc and doing
> > update with flower where such contention is reduced. i.e it may just
> > take longer for the mole to pop up.
> >
> > Why dont we push the V1 patch in and then worry about getting clever
> > with EAGAIN after? Can you test the V1 version with the repro Pedro
> > posted? It shouldnt have these issues. Also it would be interesting to
> > see how performance of the parallel updates to flower is affected.
>
> This or at least push first 4 patches of this series. They target other
> older commits and fix straightforward issues with the API.


Yes, lets get patch 1-4 in first ...

cheers,
jamal
Peilin Ye May 29, 2023, 7:14 p.m. UTC | #15
On Mon, May 29, 2023 at 09:55:51AM -0400, Jamal Hadi Salim wrote:
> On Mon, May 29, 2023 at 8:06 AM Vlad Buslov <vladbu@nvidia.com> wrote:
> > On Sun 28 May 2023 at 14:54, Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > > Why dont we push the V1 patch in and then worry about getting clever
> > > with EAGAIN after? Can you test the V1 version with the repro Pedro
> > > posted? It shouldnt have these issues. Also it would be interesting to
> > > see how performance of the parallel updates to flower is affected.
> >
> > This or at least push first 4 patches of this series. They target other
> > older commits and fix straightforward issues with the API.
> 
> Yes, lets get patch 1-4 in first ...

Sure, I'll send 1-4 as v6 first.

Thanks,
Peilin Ye
Jakub Kicinski May 30, 2023, 1:03 a.m. UTC | #16
On Mon, 29 May 2023 15:58:50 +0300 Vlad Buslov wrote:
> If livelock with concurrent filters insertion is an issue, then it can
> be remedied by setting a new Qdisc->flags bit
> "DELETED-REJECT-NEW-FILTERS" and checking for it together with
> QDISC_CLASS_OPS_DOIT_UNLOCKED in order to force any concurrent filter
> insertion coming after the flag is set to synchronize on rtnl lock.

Sounds very nice, yes.
Peilin Ye May 30, 2023, 9:11 a.m. UTC | #17
On Mon, May 29, 2023 at 02:50:26PM +0300, Vlad Buslov wrote:
> After looking very carefully at the code I think I know what the issue
> might be:
>
>    Task 1 graft Qdisc   Task 2 new filter
>            +                    +
>            |                    |
>            v                    v
>         rtnl_lock()       take  q->refcnt
>            +                    +
>            |                    |
>            v                    v
> Spin while q->refcnt!=1   Block on rtnl_lock() indefinitely due to -EAGAIN
>
> This will cause a real deadlock with the proposed patch. I'll try to
> come up with a better approach. Sorry for not seeing it earlier.

Thanks a lot for pointing this out!  The reproducers add flower filters to
ingress Qdiscs so I didn't think of rtnl_lock()'ed filter requests...

On Mon, May 29, 2023 at 03:58:50PM +0300, Vlad Buslov wrote:
> - Account for such cls_api behavior in sch_api by dropping and
>   re-tacking the lock before replaying. This actually seems to be quite
>   straightforward since 'replay' functionality that we are reusing for
>   this is designed for similar behavior - it releases rtnl lock before
>   loading a sch module, takes the lock again and safely replays the
>   function by re-obtaining all the necessary data.

Yes, I've tested this using that reproducer Pedro posted.

On Mon, May 29, 2023 at 03:58:50PM +0300, Vlad Buslov wrote:
> If livelock with concurrent filters insertion is an issue, then it can
> be remedied by setting a new Qdisc->flags bit
> "DELETED-REJECT-NEW-FILTERS" and checking for it together with
> QDISC_CLASS_OPS_DOIT_UNLOCKED in order to force any concurrent filter
> insertion coming after the flag is set to synchronize on rtnl lock.

Thanks for the suggestion!  I'll try this approach.

Currently QDISC_CLASS_OPS_DOIT_UNLOCKED is checked after taking a refcnt of
the "being-deleted" Qdisc.  I'll try forcing "late" requests (that arrive
later than Qdisc is flagged as being-deleted) sync on RTNL lock without
(before) taking the Qdisc refcnt (otherwise I think Task 1 will replay for
even longer?).

Thanks,
Peilin Ye
Vlad Buslov May 30, 2023, 12:18 p.m. UTC | #18
On Tue 30 May 2023 at 02:11, Peilin Ye <yepeilin.cs@gmail.com> wrote:
> On Mon, May 29, 2023 at 02:50:26PM +0300, Vlad Buslov wrote:
>> After looking very carefully at the code I think I know what the issue
>> might be:
>>
>>    Task 1 graft Qdisc   Task 2 new filter
>>            +                    +
>>            |                    |
>>            v                    v
>>         rtnl_lock()       take  q->refcnt
>>            +                    +
>>            |                    |
>>            v                    v
>> Spin while q->refcnt!=1   Block on rtnl_lock() indefinitely due to -EAGAIN
>>
>> This will cause a real deadlock with the proposed patch. I'll try to
>> come up with a better approach. Sorry for not seeing it earlier.
>
> Thanks a lot for pointing this out!  The reproducers add flower filters to
> ingress Qdiscs so I didn't think of rtnl_lock()'ed filter requests...
>
> On Mon, May 29, 2023 at 03:58:50PM +0300, Vlad Buslov wrote:
>> - Account for such cls_api behavior in sch_api by dropping and
>>   re-tacking the lock before replaying. This actually seems to be quite
>>   straightforward since 'replay' functionality that we are reusing for
>>   this is designed for similar behavior - it releases rtnl lock before
>>   loading a sch module, takes the lock again and safely replays the
>>   function by re-obtaining all the necessary data.
>
> Yes, I've tested this using that reproducer Pedro posted.
>
> On Mon, May 29, 2023 at 03:58:50PM +0300, Vlad Buslov wrote:
>> If livelock with concurrent filters insertion is an issue, then it can
>> be remedied by setting a new Qdisc->flags bit
>> "DELETED-REJECT-NEW-FILTERS" and checking for it together with
>> QDISC_CLASS_OPS_DOIT_UNLOCKED in order to force any concurrent filter
>> insertion coming after the flag is set to synchronize on rtnl lock.
>
> Thanks for the suggestion!  I'll try this approach.
>
> Currently QDISC_CLASS_OPS_DOIT_UNLOCKED is checked after taking a refcnt of
> the "being-deleted" Qdisc.  I'll try forcing "late" requests (that arrive
> later than Qdisc is flagged as being-deleted) sync on RTNL lock without
> (before) taking the Qdisc refcnt (otherwise I think Task 1 will replay for
> even longer?).

Yeah, I see what you mean. Looking at the code __tcf_qdisc_find()
already returns -EINVAL when q->refcnt is zero, so maybe returning
-EINVAL from that function when "DELETED-REJECT-NEW-FILTERS" flags is
set is also fine? Would be much easier to implement as opposed to moving
rtnl_lock there.
Peilin Ye May 31, 2023, 12:29 a.m. UTC | #19
On Tue, May 30, 2023 at 03:18:19PM +0300, Vlad Buslov wrote:
> >> If livelock with concurrent filters insertion is an issue, then it can
> >> be remedied by setting a new Qdisc->flags bit
> >> "DELETED-REJECT-NEW-FILTERS" and checking for it together with
> >> QDISC_CLASS_OPS_DOIT_UNLOCKED in order to force any concurrent filter
> >> insertion coming after the flag is set to synchronize on rtnl lock.
> >
> > Thanks for the suggestion!  I'll try this approach.
> >
> > Currently QDISC_CLASS_OPS_DOIT_UNLOCKED is checked after taking a refcnt of
> > the "being-deleted" Qdisc.  I'll try forcing "late" requests (that arrive
> > later than Qdisc is flagged as being-deleted) sync on RTNL lock without
> > (before) taking the Qdisc refcnt (otherwise I think Task 1 will replay for
> > even longer?).
> 
> Yeah, I see what you mean. Looking at the code __tcf_qdisc_find()
> already returns -EINVAL when q->refcnt is zero, so maybe returning
> -EINVAL from that function when "DELETED-REJECT-NEW-FILTERS" flags is
> set is also fine? Would be much easier to implement as opposed to moving
> rtnl_lock there.

Ah, I see, sure.

Thanks,
Peilin Ye
Peilin Ye June 1, 2023, 3:57 a.m. UTC | #20
Hi Vlad and all,

On Tue, May 30, 2023 at 03:18:19PM +0300, Vlad Buslov wrote:
> >> If livelock with concurrent filters insertion is an issue, then it can
> >> be remedied by setting a new Qdisc->flags bit
> >> "DELETED-REJECT-NEW-FILTERS" and checking for it together with
> >> QDISC_CLASS_OPS_DOIT_UNLOCKED in order to force any concurrent filter
> >> insertion coming after the flag is set to synchronize on rtnl lock.
> >
> > Thanks for the suggestion!  I'll try this approach.
> >
> > Currently QDISC_CLASS_OPS_DOIT_UNLOCKED is checked after taking a refcnt of
> > the "being-deleted" Qdisc.  I'll try forcing "late" requests (that arrive
> > later than Qdisc is flagged as being-deleted) sync on RTNL lock without
> > (before) taking the Qdisc refcnt (otherwise I think Task 1 will replay for
> > even longer?).
> 
> Yeah, I see what you mean. Looking at the code __tcf_qdisc_find()
> already returns -EINVAL when q->refcnt is zero, so maybe returning
> -EINVAL from that function when "DELETED-REJECT-NEW-FILTERS" flags is
> set is also fine? Would be much easier to implement as opposed to moving
> rtnl_lock there.

I implemented [1] this suggestion and tested the livelock issue in QEMU (-m
16G, CONFIG_NR_CPUS=8).  I tried deleting the ingress Qdisc (let's call it
"request A") while it has a lot of ongoing filter requests, and here's the
result:

                        #1         #2         #3         #4
  ----------------------------------------------------------
   a. refcnt            89         93        230        571
   b. replayed     167,568    196,450    336,291    878,027
   c. time real   0m2.478s   0m2.746s   0m3.693s   0m9.461s
           user   0m0.000s   0m0.000s   0m0.000s   0m0.000s
            sys   0m0.623s   0m0.681s   0m1.119s   0m2.770s

   a. is the Qdisc refcnt when A calls qdisc_graft() for the first time;
   b. is the number of times A has been replayed;
   c. is the time(1) output for A.

a. and b. are collected from printk() output.  This is better than before,
but A could still be replayed for hundreds of thousands of times and hang
for a few seconds.

Is this okay?  If not, is it possible (or should we) to make A really
_wait_ on Qdisc refcnt, instead of "busy-replaying"?

Thanks,
Peilin Ye

[1] Diff against v5 patch 6 (printk() calls not included):

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 3e9cc43cbc90..de7b0538b309 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -94,6 +94,7 @@ struct Qdisc {
 #define TCQ_F_INVISIBLE                0x80 /* invisible by default in dump */
 #define TCQ_F_NOLOCK           0x100 /* qdisc does not require locking */
 #define TCQ_F_OFFLOADED                0x200 /* qdisc is offloaded to HW */
+#define TCQ_F_DESTROYING       0x400 /* destroying, reject filter requests */
        u32                     limit;
        const struct Qdisc_ops  *ops;
        struct qdisc_size_table __rcu *stab;
@@ -185,6 +186,11 @@ static inline bool qdisc_is_empty(const struct Qdisc *qdisc)
        return !READ_ONCE(qdisc->q.qlen);
 }

+static inline bool qdisc_is_destroying(const struct Qdisc *qdisc)
+{
+       return qdisc->flags & TCQ_F_DESTROYING;
+}
+
 /* For !TCQ_F_NOLOCK qdisc, qdisc_run_begin/end() must be invoked with
  * the qdisc root lock acquired.
  */
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 2621550bfddc..3e7f6f286ac0 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -1172,7 +1172,7 @@ static int __tcf_qdisc_find(struct net *net, struct Qdisc **q,
                *parent = (*q)->handle;
        } else {
                *q = qdisc_lookup_rcu(dev, TC_H_MAJ(*parent));
-               if (!*q) {
+               if (!*q || qdisc_is_destroying(*q)) {
                        NL_SET_ERR_MSG(extack, "Parent Qdisc doesn't exists");
                        err = -EINVAL;
                        goto errout_rcu;
diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index 286b7c58f5b9..d6e47546c7fe 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -1086,12 +1086,18 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
                                return -ENOENT;
                        }

-                       /* Replay if the current ingress (or clsact) Qdisc has ongoing
-                        * RTNL-unlocked filter request(s).  This is the counterpart of that
-                        * qdisc_refcount_inc_nz() call in __tcf_qdisc_find().
+                       /* If current ingress (clsact) Qdisc has ongoing filter requests, stop
+                        * accepting any more by marking it as "being destroyed", then tell the
+                        * caller to replay by returning -EAGAIN.
                         */
-                       if (!qdisc_refcount_dec_if_one(dev_queue->qdisc_sleeping))
+                       q = dev_queue->qdisc_sleeping;
+                       if (!qdisc_refcount_dec_if_one(q)) {
+                               q->flags |= TCQ_F_DESTROYING;
+                               rtnl_unlock();
+                               schedule();
+                               rtnl_lock();
                                return -EAGAIN;
+                       }
                }

                if (dev->flags & IFF_UP)
Vlad Buslov June 1, 2023, 6:20 a.m. UTC | #21
On Wed 31 May 2023 at 20:57, Peilin Ye <yepeilin.cs@gmail.com> wrote:
> Hi Vlad and all,
>
> On Tue, May 30, 2023 at 03:18:19PM +0300, Vlad Buslov wrote:
>> >> If livelock with concurrent filters insertion is an issue, then it can
>> >> be remedied by setting a new Qdisc->flags bit
>> >> "DELETED-REJECT-NEW-FILTERS" and checking for it together with
>> >> QDISC_CLASS_OPS_DOIT_UNLOCKED in order to force any concurrent filter
>> >> insertion coming after the flag is set to synchronize on rtnl lock.
>> >
>> > Thanks for the suggestion!  I'll try this approach.
>> >
>> > Currently QDISC_CLASS_OPS_DOIT_UNLOCKED is checked after taking a refcnt of
>> > the "being-deleted" Qdisc.  I'll try forcing "late" requests (that arrive
>> > later than Qdisc is flagged as being-deleted) sync on RTNL lock without
>> > (before) taking the Qdisc refcnt (otherwise I think Task 1 will replay for
>> > even longer?).
>> 
>> Yeah, I see what you mean. Looking at the code __tcf_qdisc_find()
>> already returns -EINVAL when q->refcnt is zero, so maybe returning
>> -EINVAL from that function when "DELETED-REJECT-NEW-FILTERS" flags is
>> set is also fine? Would be much easier to implement as opposed to moving
>> rtnl_lock there.
>
> I implemented [1] this suggestion and tested the livelock issue in QEMU (-m
> 16G, CONFIG_NR_CPUS=8).  I tried deleting the ingress Qdisc (let's call it
> "request A") while it has a lot of ongoing filter requests, and here's the
> result:
>
>                         #1         #2         #3         #4
>   ----------------------------------------------------------
>    a. refcnt            89         93        230        571
>    b. replayed     167,568    196,450    336,291    878,027
>    c. time real   0m2.478s   0m2.746s   0m3.693s   0m9.461s
>            user   0m0.000s   0m0.000s   0m0.000s   0m0.000s
>             sys   0m0.623s   0m0.681s   0m1.119s   0m2.770s
>
>    a. is the Qdisc refcnt when A calls qdisc_graft() for the first time;
>    b. is the number of times A has been replayed;
>    c. is the time(1) output for A.
>
> a. and b. are collected from printk() output.  This is better than before,
> but A could still be replayed for hundreds of thousands of times and hang
> for a few seconds.

I don't get where does few seconds waiting time come from. I'm probably
missing something obvious here, but the waiting time should be the
maximum filter op latency of new/get/del filter request that is already
in-flight (i.e. already passed qdisc_is_destroying() check) and it
should take several orders of magnitude less time.

>
> Is this okay?  If not, is it possible (or should we) to make A really
> _wait_ on Qdisc refcnt, instead of "busy-replaying"?
>
> Thanks,
> Peilin Ye
>
> [1] Diff against v5 patch 6 (printk() calls not included):
>
> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
> index 3e9cc43cbc90..de7b0538b309 100644
> --- a/include/net/sch_generic.h
> +++ b/include/net/sch_generic.h
> @@ -94,6 +94,7 @@ struct Qdisc {
>  #define TCQ_F_INVISIBLE                0x80 /* invisible by default in dump */
>  #define TCQ_F_NOLOCK           0x100 /* qdisc does not require locking */
>  #define TCQ_F_OFFLOADED                0x200 /* qdisc is offloaded to HW */
> +#define TCQ_F_DESTROYING       0x400 /* destroying, reject filter requests */
>         u32                     limit;
>         const struct Qdisc_ops  *ops;
>         struct qdisc_size_table __rcu *stab;
> @@ -185,6 +186,11 @@ static inline bool qdisc_is_empty(const struct Qdisc *qdisc)
>         return !READ_ONCE(qdisc->q.qlen);
>  }
>
> +static inline bool qdisc_is_destroying(const struct Qdisc *qdisc)
> +{
> +       return qdisc->flags & TCQ_F_DESTROYING;

Hmm, do we need at least some kind of {READ|WRITE}_ONCE() for accessing
flags since they are now used in unlocked filter code path?

> +}
> +
>  /* For !TCQ_F_NOLOCK qdisc, qdisc_run_begin/end() must be invoked with
>   * the qdisc root lock acquired.
>   */
> diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
> index 2621550bfddc..3e7f6f286ac0 100644
> --- a/net/sched/cls_api.c
> +++ b/net/sched/cls_api.c
> @@ -1172,7 +1172,7 @@ static int __tcf_qdisc_find(struct net *net, struct Qdisc **q,
>                 *parent = (*q)->handle;
>         } else {
>                 *q = qdisc_lookup_rcu(dev, TC_H_MAJ(*parent));
> -               if (!*q) {
> +               if (!*q || qdisc_is_destroying(*q)) {
>                         NL_SET_ERR_MSG(extack, "Parent Qdisc doesn't exists");
>                         err = -EINVAL;
>                         goto errout_rcu;
> diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
> index 286b7c58f5b9..d6e47546c7fe 100644
> --- a/net/sched/sch_api.c
> +++ b/net/sched/sch_api.c
> @@ -1086,12 +1086,18 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
>                                 return -ENOENT;
>                         }
>
> -                       /* Replay if the current ingress (or clsact) Qdisc has ongoing
> -                        * RTNL-unlocked filter request(s).  This is the counterpart of that
> -                        * qdisc_refcount_inc_nz() call in __tcf_qdisc_find().
> +                       /* If current ingress (clsact) Qdisc has ongoing filter requests, stop
> +                        * accepting any more by marking it as "being destroyed", then tell the
> +                        * caller to replay by returning -EAGAIN.
>                          */
> -                       if (!qdisc_refcount_dec_if_one(dev_queue->qdisc_sleeping))
> +                       q = dev_queue->qdisc_sleeping;
> +                       if (!qdisc_refcount_dec_if_one(q)) {
> +                               q->flags |= TCQ_F_DESTROYING;
> +                               rtnl_unlock();
> +                               schedule();
> +                               rtnl_lock();
>                                 return -EAGAIN;
> +                       }
>                 }
>
>                 if (dev->flags & IFF_UP)
Pedro Tammela June 1, 2023, 1:03 p.m. UTC | #22
On 01/06/2023 00:57, Peilin Ye wrote:
> Hi Vlad and all,
> 
> On Tue, May 30, 2023 at 03:18:19PM +0300, Vlad Buslov wrote:
>>>> If livelock with concurrent filters insertion is an issue, then it can
>>>> be remedied by setting a new Qdisc->flags bit
>>>> "DELETED-REJECT-NEW-FILTERS" and checking for it together with
>>>> QDISC_CLASS_OPS_DOIT_UNLOCKED in order to force any concurrent filter
>>>> insertion coming after the flag is set to synchronize on rtnl lock.
>>>
>>> Thanks for the suggestion!  I'll try this approach.
>>>
>>> Currently QDISC_CLASS_OPS_DOIT_UNLOCKED is checked after taking a refcnt of
>>> the "being-deleted" Qdisc.  I'll try forcing "late" requests (that arrive
>>> later than Qdisc is flagged as being-deleted) sync on RTNL lock without
>>> (before) taking the Qdisc refcnt (otherwise I think Task 1 will replay for
>>> even longer?).
>>
>> Yeah, I see what you mean. Looking at the code __tcf_qdisc_find()
>> already returns -EINVAL when q->refcnt is zero, so maybe returning
>> -EINVAL from that function when "DELETED-REJECT-NEW-FILTERS" flags is
>> set is also fine? Would be much easier to implement as opposed to moving
>> rtnl_lock there.
> 
> I implemented [1] this suggestion and tested the livelock issue in QEMU (-m
> 16G, CONFIG_NR_CPUS=8).  I tried deleting the ingress Qdisc (let's call it
> "request A") while it has a lot of ongoing filter requests, and here's the
> result:
> 
>                          #1         #2         #3         #4
>    ----------------------------------------------------------
>     a. refcnt            89         93        230        571
>     b. replayed     167,568    196,450    336,291    878,027
>     c. time real   0m2.478s   0m2.746s   0m3.693s   0m9.461s
>             user   0m0.000s   0m0.000s   0m0.000s   0m0.000s
>              sys   0m0.623s   0m0.681s   0m1.119s   0m2.770s
> 
>     a. is the Qdisc refcnt when A calls qdisc_graft() for the first time;
>     b. is the number of times A has been replayed;
>     c. is the time(1) output for A.
> 
> a. and b. are collected from printk() output.  This is better than before,
> but A could still be replayed for hundreds of thousands of times and hang
> for a few seconds.
> 
> Is this okay?  If not, is it possible (or should we) to make A really
> _wait_ on Qdisc refcnt, instead of "busy-replaying"?
> 
> Thanks,
> Peilin Ye
> 
> [1] Diff against v5 patch 6 (printk() calls not included):
> 
> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
> index 3e9cc43cbc90..de7b0538b309 100644
> --- a/include/net/sch_generic.h
> +++ b/include/net/sch_generic.h
> @@ -94,6 +94,7 @@ struct Qdisc {
>   #define TCQ_F_INVISIBLE                0x80 /* invisible by default in dump */
>   #define TCQ_F_NOLOCK           0x100 /* qdisc does not require locking */
>   #define TCQ_F_OFFLOADED                0x200 /* qdisc is offloaded to HW */
> +#define TCQ_F_DESTROYING       0x400 /* destroying, reject filter requests */
>          u32                     limit;
>          const struct Qdisc_ops  *ops;
>          struct qdisc_size_table __rcu *stab;
> @@ -185,6 +186,11 @@ static inline bool qdisc_is_empty(const struct Qdisc *qdisc)
>          return !READ_ONCE(qdisc->q.qlen);
>   }
> 
> +static inline bool qdisc_is_destroying(const struct Qdisc *qdisc)
> +{
> +       return qdisc->flags & TCQ_F_DESTROYING;
> +}
> +
>   /* For !TCQ_F_NOLOCK qdisc, qdisc_run_begin/end() must be invoked with
>    * the qdisc root lock acquired.
>    */
> diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
> index 2621550bfddc..3e7f6f286ac0 100644
> --- a/net/sched/cls_api.c
> +++ b/net/sched/cls_api.c
> @@ -1172,7 +1172,7 @@ static int __tcf_qdisc_find(struct net *net, struct Qdisc **q,
>                  *parent = (*q)->handle;
>          } else {
>                  *q = qdisc_lookup_rcu(dev, TC_H_MAJ(*parent));
> -               if (!*q) {
> +               if (!*q || qdisc_is_destroying(*q)) {
>                          NL_SET_ERR_MSG(extack, "Parent Qdisc doesn't exists");
>                          err = -EINVAL;
>                          goto errout_rcu;
> diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
> index 286b7c58f5b9..d6e47546c7fe 100644
> --- a/net/sched/sch_api.c
> +++ b/net/sched/sch_api.c
> @@ -1086,12 +1086,18 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
>                                  return -ENOENT;
>                          }
> 
> -                       /* Replay if the current ingress (or clsact) Qdisc has ongoing
> -                        * RTNL-unlocked filter request(s).  This is the counterpart of that
> -                        * qdisc_refcount_inc_nz() call in __tcf_qdisc_find().
> +                       /* If current ingress (clsact) Qdisc has ongoing filter requests, stop
> +                        * accepting any more by marking it as "being destroyed", then tell the
> +                        * caller to replay by returning -EAGAIN.
>                           */
> -                       if (!qdisc_refcount_dec_if_one(dev_queue->qdisc_sleeping))
> +                       q = dev_queue->qdisc_sleeping;
> +                       if (!qdisc_refcount_dec_if_one(q)) {
> +                               q->flags |= TCQ_F_DESTROYING;
> +                               rtnl_unlock();
> +                               schedule();
Was this intended or just a leftover?
rtnl_lock() would reschedule if needed as it's a mutex_lock
> +                               rtnl_lock();
>                                  return -EAGAIN;
> +                       }
>                  }
> 
>                  if (dev->flags & IFF_UP)
>
Peilin Ye June 7, 2023, 12:57 a.m. UTC | #23
On Thu, Jun 01, 2023 at 09:20:39AM +0300, Vlad Buslov wrote:
> On Wed 31 May 2023 at 20:57, Peilin Ye <yepeilin.cs@gmail.com> wrote:
> > +static inline bool qdisc_is_destroying(const struct Qdisc *qdisc)
> > +{
> > +       return qdisc->flags & TCQ_F_DESTROYING;
> 
> Hmm, do we need at least some kind of {READ|WRITE}_ONCE() for accessing
> flags since they are now used in unlocked filter code path?

Thanks, after taking another look at cls_api.c, I noticed this code in
tc_new_tfilter():

	err = tp->ops->change(net, skb, tp, cl, t->tcm_handle, tca, &fh,
			      flags, extack);
	if (err == 0) {
		tfilter_notify(net, skb, n, tp, block, q, parent, fh,
			       RTM_NEWTFILTER, false, rtnl_held, extack);
		tfilter_put(tp, fh);
		/* q pointer is NULL for shared blocks */
		if (q)
			q->flags &= ~TCQ_F_CAN_BYPASS;
	}               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

TCQ_F_CAN_BYPASS is cleared after e.g. adding a filter to the Qdisc, and it
isn't atomic [1].

We also have this:

  ->dequeue()
    htb_dequeue()
      htb_dequeue_tree()
        qdisc_warn_nonwc():

  void qdisc_warn_nonwc(const char *txt, struct Qdisc *qdisc)
  {
          if (!(qdisc->flags & TCQ_F_WARN_NONWC)) {
                  pr_warn("%s: %s qdisc %X: is non-work-conserving?\n",
                          txt, qdisc->ops->id, qdisc->handle >> 16);
                  qdisc->flags |= TCQ_F_WARN_NONWC;
          }       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  }
  EXPORT_SYMBOL(qdisc_warn_nonwc);

Also non-atomic; isn't it possible for the above 2 underlined statements to
race with each other?  If true, I think we need to change Qdisc::flags to
use atomic bitops, just like what we're doing for Qdisc::state and
::state2.  It feels like a separate TODO, however.

I also thought about adding the new DELETED-REJECT-NEW-FILTERS flag to
::state2, but not sure if it's okay to extend it for our purpose.

Thanks,
Peilin Ye

[1] Compiled to this on my Intel 64:

   0x0000000000017788 <+6472>:	83 62 10 fb        	andl   $0xfffffffb,0x10(%rdx)
Peilin Ye June 7, 2023, 4:25 a.m. UTC | #24
Hi Pedro,

On Thu, Jun 01, 2023 at 10:03:22AM -0300, Pedro Tammela wrote:
> On 01/06/2023 00:57, Peilin Ye wrote:
> > -                       /* Replay if the current ingress (or clsact) Qdisc has ongoing
> > -                        * RTNL-unlocked filter request(s).  This is the counterpart of that
> > -                        * qdisc_refcount_inc_nz() call in __tcf_qdisc_find().
> > +                       /* If current ingress (clsact) Qdisc has ongoing filter requests, stop
> > +                        * accepting any more by marking it as "being destroyed", then tell the
> > +                        * caller to replay by returning -EAGAIN.
> >                           */
> > -                       if (!qdisc_refcount_dec_if_one(dev_queue->qdisc_sleeping))
> > +                       q = dev_queue->qdisc_sleeping;
> > +                       if (!qdisc_refcount_dec_if_one(q)) {
> > +                               q->flags |= TCQ_F_DESTROYING;
> > +                               rtnl_unlock();
> > +                               schedule();
>
> Was this intended or just a leftover?
> rtnl_lock() would reschedule if needed as it's a mutex_lock

In qdisc_create():

			rtnl_unlock();
			request_module("sch_%s", name);
			rtnl_lock();
			ops = qdisc_lookup_ops(kind);
			if (ops != NULL) {
				/* We will try again qdisc_lookup_ops,
				 * so don't keep a reference.
				 */
				module_put(ops->owner);
				err = -EAGAIN;
				goto err_out;

If qdisc_lookup_ops() returns !NULL, it means we've successfully loaded the
module, so we know that the next replay should (normally) succeed.

However in the diff I posted, if qdisc_refcount_dec_if_one() returned
false, we know that we should step back and wait a bit - especially when
there're ongoing RTNL-locked filter requests: we want them to grab the RTNL
mutex before us, which was my intention here, to unconditionally give up
the CPU.  I haven't tested if it really makes a difference though.

Thanks,
Peilin Ye
Vlad Buslov June 7, 2023, 8:18 a.m. UTC | #25
On Tue 06 Jun 2023 at 17:57, Peilin Ye <yepeilin.cs@gmail.com> wrote:
> On Thu, Jun 01, 2023 at 09:20:39AM +0300, Vlad Buslov wrote:
>> On Wed 31 May 2023 at 20:57, Peilin Ye <yepeilin.cs@gmail.com> wrote:
>> > +static inline bool qdisc_is_destroying(const struct Qdisc *qdisc)
>> > +{
>> > +       return qdisc->flags & TCQ_F_DESTROYING;
>> 
>> Hmm, do we need at least some kind of {READ|WRITE}_ONCE() for accessing
>> flags since they are now used in unlocked filter code path?
>
> Thanks, after taking another look at cls_api.c, I noticed this code in
> tc_new_tfilter():
>
> 	err = tp->ops->change(net, skb, tp, cl, t->tcm_handle, tca, &fh,
> 			      flags, extack);
> 	if (err == 0) {
> 		tfilter_notify(net, skb, n, tp, block, q, parent, fh,
> 			       RTM_NEWTFILTER, false, rtnl_held, extack);
> 		tfilter_put(tp, fh);
> 		/* q pointer is NULL for shared blocks */
> 		if (q)
> 			q->flags &= ~TCQ_F_CAN_BYPASS;
> 	}               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> TCQ_F_CAN_BYPASS is cleared after e.g. adding a filter to the Qdisc, and it
> isn't atomic [1].

Yeah, I see we have already got such behavior in 3f05e6886a59
("net_sched: unset TCQ_F_CAN_BYPASS when adding filters").

>
> We also have this:
>
>   ->dequeue()
>     htb_dequeue()
>       htb_dequeue_tree()
>         qdisc_warn_nonwc():
>
>   void qdisc_warn_nonwc(const char *txt, struct Qdisc *qdisc)
>   {
>           if (!(qdisc->flags & TCQ_F_WARN_NONWC)) {
>                   pr_warn("%s: %s qdisc %X: is non-work-conserving?\n",
>                           txt, qdisc->ops->id, qdisc->handle >> 16);
>                   qdisc->flags |= TCQ_F_WARN_NONWC;
>           }       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>   }
>   EXPORT_SYMBOL(qdisc_warn_nonwc);
>
> Also non-atomic; isn't it possible for the above 2 underlined statements to
> race with each other?  If true, I think we need to change Qdisc::flags to
> use atomic bitops, just like what we're doing for Qdisc::state and
> ::state2.  It feels like a separate TODO, however.

It looks like even though 3f05e6886a59 ("net_sched: unset
TCQ_F_CAN_BYPASS when adding filters") was introduced after cls api
unlock by now we have these in exactly the same list of supported
kernels (5.4 LTS and newer). Considering this, the conversion to the
atomic bitops can be done as a standalone fix for cited commit and after
it will have been accepted and backported the qdisc fix can just assume
that qdisc->flags is an atomic bitops field in all target kernels and
use it as-is. WDYT?

>
> I also thought about adding the new DELETED-REJECT-NEW-FILTERS flag to
> ::state2, but not sure if it's okay to extend it for our purpose.

As you described above qdisc->flags is already used to interact with cls
api (including changing it dynamically), so I don't see why not.
Peilin Ye June 8, 2023, 12:39 a.m. UTC | #26
On Thu, Jun 01, 2023 at 09:20:39AM +0300, Vlad Buslov wrote:
> >> >> If livelock with concurrent filters insertion is an issue, then it can
> >> >> be remedied by setting a new Qdisc->flags bit
> >> >> "DELETED-REJECT-NEW-FILTERS" and checking for it together with
> >> >> QDISC_CLASS_OPS_DOIT_UNLOCKED in order to force any concurrent filter
> >> >> insertion coming after the flag is set to synchronize on rtnl lock.
> >> >
> >> > Thanks for the suggestion!  I'll try this approach.
> >> >
> >> > Currently QDISC_CLASS_OPS_DOIT_UNLOCKED is checked after taking a refcnt of
> >> > the "being-deleted" Qdisc.  I'll try forcing "late" requests (that arrive
> >> > later than Qdisc is flagged as being-deleted) sync on RTNL lock without
> >> > (before) taking the Qdisc refcnt (otherwise I think Task 1 will replay for
> >> > even longer?).
> >> 
> >> Yeah, I see what you mean. Looking at the code __tcf_qdisc_find()
> >> already returns -EINVAL when q->refcnt is zero, so maybe returning
> >> -EINVAL from that function when "DELETED-REJECT-NEW-FILTERS" flags is
> >> set is also fine? Would be much easier to implement as opposed to moving
> >> rtnl_lock there.
> >
> > I implemented [1] this suggestion and tested the livelock issue in QEMU (-m
> > 16G, CONFIG_NR_CPUS=8).  I tried deleting the ingress Qdisc (let's call it
> > "request A") while it has a lot of ongoing filter requests, and here's the
> > result:
> >
> >                         #1         #2         #3         #4
> >   ----------------------------------------------------------
> >    a. refcnt            89         93        230        571
> >    b. replayed     167,568    196,450    336,291    878,027
> >    c. time real   0m2.478s   0m2.746s   0m3.693s   0m9.461s
> >            user   0m0.000s   0m0.000s   0m0.000s   0m0.000s
> >             sys   0m0.623s   0m0.681s   0m1.119s   0m2.770s
> >
> >    a. is the Qdisc refcnt when A calls qdisc_graft() for the first time;
> >    b. is the number of times A has been replayed;
> >    c. is the time(1) output for A.
> >
> > a. and b. are collected from printk() output.  This is better than before,
> > but A could still be replayed for hundreds of thousands of times and hang
> > for a few seconds.
> 
> I don't get where does few seconds waiting time come from. I'm probably
> missing something obvious here, but the waiting time should be the
> maximum filter op latency of new/get/del filter request that is already
> in-flight (i.e. already passed qdisc_is_destroying() check) and it
> should take several orders of magnitude less time.

Yeah I agree, here's what I did:

In Terminal 1 I keep adding filters to eth1 in a naive and unrealistic
loop:

  $ echo "1 1 32" > /sys/bus/netdevsim/new_device
  $ tc qdisc add dev eth1 ingress
  $ for (( i=1; i<=3000; i++ ))
  > do
  > tc filter add dev eth1 ingress proto all flower src_mac 00:11:22:33:44:55 action pass > /dev/null 2>&1 &
  > done

When the loop is running, I delete the Qdisc in Terminal 2:

  $ time tc qdisc delete dev eth1 ingress

Which took seconds on average.  However, if I specify a unique "prio" when
adding filters in that loop, e.g.:

  $ for (( i=1; i<=3000; i++ ))
  > do
  > tc filter add dev eth1 ingress proto all prio $i flower src_mac 00:11:22:33:44:55 action pass > /dev/null 2>&1 &
  > done				     ^^^^^^^

Then deleting the Qdisc in Terminal 2 becomes a lot faster:

  real  0m0.712s
  user  0m0.000s
  sys   0m0.152s 

In fact it's so fast that I couldn't even make qdisc->refcnt > 1, so I did
yet another test [1], which looks a lot better.

When I didn't specify "prio", sometimes that
rhashtable_lookup_insert_fast() call in fl_ht_insert_unique() returns
-EEXIST.  Is it because that concurrent add-filter requests auto-allocated
the same "prio" number, so they collided with each other?  Do you think
this is related to why it's slow?

Thanks,
Peilin Ye

[1] In a beefier QEMU setup (64 cores, -m 128G), I started 64 tc instances
in -batch mode that keeps adding a unique filter (with "prio" and "handle"
specified) then deletes it.  Again, when they are running I delete the
ingress Qdisc, and here's the result:

                         #1         #2         #3         #4
   ----------------------------------------------------------
    a. refcnt            64         63         64         64
    b. replayed         169      5,630        887      3,442
    c. time real   0m0.171s   0m0.147s   0m0.186s   0m0.111s
            user   0m0.000s   0m0.009s   0m0.001s   0m0.000s
             sys   0m0.112s   0m0.108s   0m0.115s   0m0.104s
Peilin Ye June 8, 2023, 1:08 a.m. UTC | #27
On Wed, Jun 07, 2023 at 11:18:32AM +0300, Vlad Buslov wrote:
> > I also thought about adding the new DELETED-REJECT-NEW-FILTERS flag to
> > ::state2, but not sure if it's okay to extend it for our purpose.
>
> As you described above qdisc->flags is already used to interact with cls
> api (including changing it dynamically), so I don't see why not.

Sorry, I don't follow, I meant qdisc->state2:

  enum qdisc_state2_t {
          /* Only for !TCQ_F_NOLOCK qdisc. Never access it directly.
           * Use qdisc_run_begin/end() or qdisc_is_running() instead.
           */
          __QDISC_STATE2_RUNNING,
  };

NVM, I think using qdisc->flags after making it atomic sounds better.

On Wed, Jun 07, 2023 at 11:18:32AM +0300, Vlad Buslov wrote:
> > 	err = tp->ops->change(net, skb, tp, cl, t->tcm_handle, tca, &fh,
> > 			      flags, extack);
> > 	if (err == 0) {
> > 		tfilter_notify(net, skb, n, tp, block, q, parent, fh,
> > 			       RTM_NEWTFILTER, false, rtnl_held, extack);
> > 		tfilter_put(tp, fh);
> > 		/* q pointer is NULL for shared blocks */
> > 		if (q)
> > 			q->flags &= ~TCQ_F_CAN_BYPASS;
> > 	}               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> >
> > TCQ_F_CAN_BYPASS is cleared after e.g. adding a filter to the Qdisc, and it
> > isn't atomic [1].
> 
> Yeah, I see we have already got such behavior in 3f05e6886a59
> ("net_sched: unset TCQ_F_CAN_BYPASS when adding filters").
> 
> > We also have this:
> >
> >   ->dequeue()
> >     htb_dequeue()
> >       htb_dequeue_tree()
> >         qdisc_warn_nonwc():
> >
> >   void qdisc_warn_nonwc(const char *txt, struct Qdisc *qdisc)
> >   {
> >           if (!(qdisc->flags & TCQ_F_WARN_NONWC)) {
> >                   pr_warn("%s: %s qdisc %X: is non-work-conserving?\n",
> >                           txt, qdisc->ops->id, qdisc->handle >> 16);
> >                   qdisc->flags |= TCQ_F_WARN_NONWC;
> >           }       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> >   }
> >   EXPORT_SYMBOL(qdisc_warn_nonwc);
> >
> > Also non-atomic; isn't it possible for the above 2 underlined statements to
> > race with each other?  If true, I think we need to change Qdisc::flags to
> > use atomic bitops, just like what we're doing for Qdisc::state and
> > ::state2.  It feels like a separate TODO, however.
> 
> It looks like even though 3f05e6886a59 ("net_sched: unset
> TCQ_F_CAN_BYPASS when adding filters") was introduced after cls api
> unlock by now we have these in exactly the same list of supported
> kernels (5.4 LTS and newer). Considering this, the conversion to the
> atomic bitops can be done as a standalone fix for cited commit and after
> it will have been accepted and backported the qdisc fix can just assume
> that qdisc->flags is an atomic bitops field in all target kernels and
> use it as-is. WDYT?

Sounds great, how about:

  1. I'll post the non-replay version of the fix (after updating the commit
     message), and we apply that first, as suggested by Jamal

  2. Make qdisc->flags atomic

  3. Make the fix better by replaying and using the (now atomic)
     IS-DESTROYING flag with test_and_set_bit() and friends

?

Thanks,
Peilin Ye
Vlad Buslov June 8, 2023, 7:48 a.m. UTC | #28
On Wed 07 Jun 2023 at 18:08, Peilin Ye <yepeilin.cs@gmail.com> wrote:
> On Wed, Jun 07, 2023 at 11:18:32AM +0300, Vlad Buslov wrote:
>> > I also thought about adding the new DELETED-REJECT-NEW-FILTERS flag to
>> > ::state2, but not sure if it's okay to extend it for our purpose.
>>
>> As you described above qdisc->flags is already used to interact with cls
>> api (including changing it dynamically), so I don't see why not.
>
> Sorry, I don't follow, I meant qdisc->state2:
>
>   enum qdisc_state2_t {
>           /* Only for !TCQ_F_NOLOCK qdisc. Never access it directly.
>            * Use qdisc_run_begin/end() or qdisc_is_running() instead.
>            */
>           __QDISC_STATE2_RUNNING,
>   };

Sorry, I misunderstood what you were suggesting. Got it now.

>
> NVM, I think using qdisc->flags after making it atomic sounds better.

Agree.

>
> On Wed, Jun 07, 2023 at 11:18:32AM +0300, Vlad Buslov wrote:
>> > 	err = tp->ops->change(net, skb, tp, cl, t->tcm_handle, tca, &fh,
>> > 			      flags, extack);
>> > 	if (err == 0) {
>> > 		tfilter_notify(net, skb, n, tp, block, q, parent, fh,
>> > 			       RTM_NEWTFILTER, false, rtnl_held, extack);
>> > 		tfilter_put(tp, fh);
>> > 		/* q pointer is NULL for shared blocks */
>> > 		if (q)
>> > 			q->flags &= ~TCQ_F_CAN_BYPASS;
>> > 	}               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>> >
>> > TCQ_F_CAN_BYPASS is cleared after e.g. adding a filter to the Qdisc, and it
>> > isn't atomic [1].
>> 
>> Yeah, I see we have already got such behavior in 3f05e6886a59
>> ("net_sched: unset TCQ_F_CAN_BYPASS when adding filters").
>> 
>> > We also have this:
>> >
>> >   ->dequeue()
>> >     htb_dequeue()
>> >       htb_dequeue_tree()
>> >         qdisc_warn_nonwc():
>> >
>> >   void qdisc_warn_nonwc(const char *txt, struct Qdisc *qdisc)
>> >   {
>> >           if (!(qdisc->flags & TCQ_F_WARN_NONWC)) {
>> >                   pr_warn("%s: %s qdisc %X: is non-work-conserving?\n",
>> >                           txt, qdisc->ops->id, qdisc->handle >> 16);
>> >                   qdisc->flags |= TCQ_F_WARN_NONWC;
>> >           }       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>> >   }
>> >   EXPORT_SYMBOL(qdisc_warn_nonwc);
>> >
>> > Also non-atomic; isn't it possible for the above 2 underlined statements to
>> > race with each other?  If true, I think we need to change Qdisc::flags to
>> > use atomic bitops, just like what we're doing for Qdisc::state and
>> > ::state2.  It feels like a separate TODO, however.
>> 
>> It looks like even though 3f05e6886a59 ("net_sched: unset
>> TCQ_F_CAN_BYPASS when adding filters") was introduced after cls api
>> unlock by now we have these in exactly the same list of supported
>> kernels (5.4 LTS and newer). Considering this, the conversion to the
>> atomic bitops can be done as a standalone fix for cited commit and after
>> it will have been accepted and backported the qdisc fix can just assume
>> that qdisc->flags is an atomic bitops field in all target kernels and
>> use it as-is. WDYT?
>
> Sounds great, how about:
>
>   1. I'll post the non-replay version of the fix (after updating the commit
>      message), and we apply that first, as suggested by Jamal

From my side there are no objections to any of the proposed approaches
since we have never had any users with legitimate use-case where they
need to replace/delete a qdisc concurrently with a filter update, so
returning -EBUSY (or -EAGAIN) to the user in such case would work as
either temporary or the final fix. However, Jakub had reservations with
such approach so don't know where we stand now regarding this.

>
>   2. Make qdisc->flags atomic
>
>   3. Make the fix better by replaying and using the (now atomic)
>      IS-DESTROYING flag with test_and_set_bit() and friends
>
> ?

Again, no objections from my side. Ping me if you need help with any of
these.
Vlad Buslov June 8, 2023, 9:17 a.m. UTC | #29
On Wed 07 Jun 2023 at 17:39, Peilin Ye <yepeilin.cs@gmail.com> wrote:
> On Thu, Jun 01, 2023 at 09:20:39AM +0300, Vlad Buslov wrote:
>> >> >> If livelock with concurrent filters insertion is an issue, then it can
>> >> >> be remedied by setting a new Qdisc->flags bit
>> >> >> "DELETED-REJECT-NEW-FILTERS" and checking for it together with
>> >> >> QDISC_CLASS_OPS_DOIT_UNLOCKED in order to force any concurrent filter
>> >> >> insertion coming after the flag is set to synchronize on rtnl lock.
>> >> >
>> >> > Thanks for the suggestion!  I'll try this approach.
>> >> >
>> >> > Currently QDISC_CLASS_OPS_DOIT_UNLOCKED is checked after taking a refcnt of
>> >> > the "being-deleted" Qdisc.  I'll try forcing "late" requests (that arrive
>> >> > later than Qdisc is flagged as being-deleted) sync on RTNL lock without
>> >> > (before) taking the Qdisc refcnt (otherwise I think Task 1 will replay for
>> >> > even longer?).
>> >> 
>> >> Yeah, I see what you mean. Looking at the code __tcf_qdisc_find()
>> >> already returns -EINVAL when q->refcnt is zero, so maybe returning
>> >> -EINVAL from that function when "DELETED-REJECT-NEW-FILTERS" flags is
>> >> set is also fine? Would be much easier to implement as opposed to moving
>> >> rtnl_lock there.
>> >
>> > I implemented [1] this suggestion and tested the livelock issue in QEMU (-m
>> > 16G, CONFIG_NR_CPUS=8).  I tried deleting the ingress Qdisc (let's call it
>> > "request A") while it has a lot of ongoing filter requests, and here's the
>> > result:
>> >
>> >                         #1         #2         #3         #4
>> >   ----------------------------------------------------------
>> >    a. refcnt            89         93        230        571
>> >    b. replayed     167,568    196,450    336,291    878,027
>> >    c. time real   0m2.478s   0m2.746s   0m3.693s   0m9.461s
>> >            user   0m0.000s   0m0.000s   0m0.000s   0m0.000s
>> >             sys   0m0.623s   0m0.681s   0m1.119s   0m2.770s
>> >
>> >    a. is the Qdisc refcnt when A calls qdisc_graft() for the first time;
>> >    b. is the number of times A has been replayed;
>> >    c. is the time(1) output for A.
>> >
>> > a. and b. are collected from printk() output.  This is better than before,
>> > but A could still be replayed for hundreds of thousands of times and hang
>> > for a few seconds.
>> 
>> I don't get where does few seconds waiting time come from. I'm probably
>> missing something obvious here, but the waiting time should be the
>> maximum filter op latency of new/get/del filter request that is already
>> in-flight (i.e. already passed qdisc_is_destroying() check) and it
>> should take several orders of magnitude less time.
>
> Yeah I agree, here's what I did:
>
> In Terminal 1 I keep adding filters to eth1 in a naive and unrealistic
> loop:
>
>   $ echo "1 1 32" > /sys/bus/netdevsim/new_device
>   $ tc qdisc add dev eth1 ingress
>   $ for (( i=1; i<=3000; i++ ))
>   > do
>   > tc filter add dev eth1 ingress proto all flower src_mac 00:11:22:33:44:55 action pass > /dev/null 2>&1 &
>   > done
>
> When the loop is running, I delete the Qdisc in Terminal 2:
>
>   $ time tc qdisc delete dev eth1 ingress
>
> Which took seconds on average.  However, if I specify a unique "prio" when
> adding filters in that loop, e.g.:
>
>   $ for (( i=1; i<=3000; i++ ))
>   > do
>   > tc filter add dev eth1 ingress proto all prio $i flower src_mac 00:11:22:33:44:55 action pass > /dev/null 2>&1 &
>   > done				     ^^^^^^^
>
> Then deleting the Qdisc in Terminal 2 becomes a lot faster:
>
>   real  0m0.712s
>   user  0m0.000s
>   sys   0m0.152s 
>
> In fact it's so fast that I couldn't even make qdisc->refcnt > 1, so I did
> yet another test [1], which looks a lot better.

That makes sense, thanks for explaining.

>
> When I didn't specify "prio", sometimes that
> rhashtable_lookup_insert_fast() call in fl_ht_insert_unique() returns
> -EEXIST.  Is it because that concurrent add-filter requests auto-allocated
> the same "prio" number, so they collided with each other?  Do you think
> this is related to why it's slow?

It is slow because when creating a filter without providing priority you
are basically measuring the latency of creating a whole flower
classifier instance (multiple memory allocation, initialization of all
kinds of idrs, hash tables and locks, updating tp list in chain, etc.),
not just a single filter, so significantly higher latency is expected.

But my point still stands: with latest version of your fix the maximum
time of 'spinning' in sch_api is the maximum concurrent
tcf_{new|del|get}_tfilter op latency that has already obtained the qdisc
and any concurrent filter API messages coming after qdisc->flags
"DELETED-REJECT-NEW-FILTERS" has been set will fail and can't livelock
the concurrent qdisc del/replace.

>
> Thanks,
> Peilin Ye
>
> [1] In a beefier QEMU setup (64 cores, -m 128G), I started 64 tc instances
> in -batch mode that keeps adding a unique filter (with "prio" and "handle"
> specified) then deletes it.  Again, when they are running I delete the
> ingress Qdisc, and here's the result:
>
>                          #1         #2         #3         #4
>    ----------------------------------------------------------
>     a. refcnt            64         63         64         64
>     b. replayed         169      5,630        887      3,442
>     c. time real   0m0.171s   0m0.147s   0m0.186s   0m0.111s
>             user   0m0.000s   0m0.009s   0m0.001s   0m0.000s
>              sys   0m0.112s   0m0.108s   0m0.115s   0m0.104s
Peilin Ye June 10, 2023, 12:20 a.m. UTC | #30
On Thu, Jun 08, 2023 at 12:17:27PM +0300, Vlad Buslov wrote:
> > When I didn't specify "prio", sometimes that
> > rhashtable_lookup_insert_fast() call in fl_ht_insert_unique() returns
> > -EEXIST.  Is it because that concurrent add-filter requests auto-allocated
> > the same "prio" number, so they collided with each other?  Do you think
> > this is related to why it's slow?
> 
> It is slow because when creating a filter without providing priority you
> are basically measuring the latency of creating a whole flower
> classifier instance (multiple memory allocation, initialization of all
> kinds of idrs, hash tables and locks, updating tp list in chain, etc.),
> not just a single filter, so significantly higher latency is expected.

Ah, I see.  Thanks for the explanation.

Thanks,
Peilin Ye
Peilin Ye June 11, 2023, 3:25 a.m. UTC | #31
On Thu, Jun 08, 2023 at 10:48:12AM +0300, Vlad Buslov wrote:
> >> It looks like even though 3f05e6886a59 ("net_sched: unset
> >> TCQ_F_CAN_BYPASS when adding filters") was introduced after cls api
> >> unlock by now we have these in exactly the same list of supported
> >> kernels (5.4 LTS and newer). Considering this, the conversion to the
> >> atomic bitops can be done as a standalone fix for cited commit and after
> >> it will have been accepted and backported the qdisc fix can just assume
> >> that qdisc->flags is an atomic bitops field in all target kernels and
> >> use it as-is. WDYT?
> >
> > Sounds great, how about:
> >
> >   1. I'll post the non-replay version of the fix (after updating the commit
> >      message), and we apply that first, as suggested by Jamal
>
> From my side there are no objections to any of the proposed approaches
> since we have never had any users with legitimate use-case where they
> need to replace/delete a qdisc concurrently with a filter update, so
> returning -EBUSY (or -EAGAIN) to the user in such case would work as
> either temporary or the final fix.

I see, yeah.

> However, Jakub had reservations with such approach so don't know where we
> stand now regarding this.

Either way I'd say applying that non-replay version first is better than
leaving the bug unfixed.  It's been many days since the root cause of the
issue has been figured out.  I'll post it, and start making qdisc->flags
atomic.

> >   2. Make qdisc->flags atomic
> >
> >   3. Make the fix better by replaying and using the (now atomic)
> >      IS-DESTROYING flag with test_and_set_bit() and friends
> >
> > ?
>
> Again, no objections from my side. Ping me if you need help with any of
> these.

Sure, thanks, Vlad!
Peilin Ye
diff mbox series

Patch

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index fab5ba3e61b7..3e9cc43cbc90 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -137,6 +137,13 @@  static inline void qdisc_refcount_inc(struct Qdisc *qdisc)
 	refcount_inc(&qdisc->refcnt);
 }
 
+static inline bool qdisc_refcount_dec_if_one(struct Qdisc *qdisc)
+{
+	if (qdisc->flags & TCQ_F_BUILTIN)
+		return true;
+	return refcount_dec_if_one(&qdisc->refcnt);
+}
+
 /* Intended to be used by unlocked users, when concurrent qdisc release is
  * possible.
  */
@@ -652,6 +659,7 @@  void dev_deactivate_many(struct list_head *head);
 struct Qdisc *dev_graft_qdisc(struct netdev_queue *dev_queue,
 			      struct Qdisc *qdisc);
 void qdisc_reset(struct Qdisc *qdisc);
+void qdisc_destroy(struct Qdisc *qdisc);
 void qdisc_put(struct Qdisc *qdisc);
 void qdisc_put_unlocked(struct Qdisc *qdisc);
 void qdisc_tree_reduce_backlog(struct Qdisc *qdisc, int n, int len);
diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index f72a581666a2..286b7c58f5b9 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -1080,10 +1080,18 @@  static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
 		if ((q && q->flags & TCQ_F_INGRESS) ||
 		    (new && new->flags & TCQ_F_INGRESS)) {
 			ingress = 1;
-			if (!dev_ingress_queue(dev)) {
+			dev_queue = dev_ingress_queue(dev);
+			if (!dev_queue) {
 				NL_SET_ERR_MSG(extack, "Device does not have an ingress queue");
 				return -ENOENT;
 			}
+
+			/* Replay if the current ingress (or clsact) Qdisc has ongoing
+			 * RTNL-unlocked filter request(s).  This is the counterpart of that
+			 * qdisc_refcount_inc_nz() call in __tcf_qdisc_find().
+			 */
+			if (!qdisc_refcount_dec_if_one(dev_queue->qdisc_sleeping))
+				return -EAGAIN;
 		}
 
 		if (dev->flags & IFF_UP)
@@ -1104,8 +1112,16 @@  static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
 				qdisc_put(old);
 			}
 		} else {
-			dev_queue = dev_ingress_queue(dev);
-			old = dev_graft_qdisc(dev_queue, new);
+			old = dev_graft_qdisc(dev_queue, NULL);
+
+			/* {ingress,clsact}_destroy() @old before grafting @new to avoid
+			 * unprotected concurrent accesses to net_device::miniq_{in,e}gress
+			 * pointer(s) in mini_qdisc_pair_swap().
+			 */
+			qdisc_notify(net, skb, n, classid, old, new, extack);
+			qdisc_destroy(old);
+
+			dev_graft_qdisc(dev_queue, new);
 		}
 
 skip:
@@ -1119,8 +1135,6 @@  static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
 
 			if (new && new->ops->attach)
 				new->ops->attach(new);
-		} else {
-			notify_and_destroy(net, skb, n, classid, old, new, extack);
 		}
 
 		if (dev->flags & IFF_UP)
@@ -1450,19 +1464,22 @@  static int tc_get_qdisc(struct sk_buff *skb, struct nlmsghdr *n,
 			struct netlink_ext_ack *extack)
 {
 	struct net *net = sock_net(skb->sk);
-	struct tcmsg *tcm = nlmsg_data(n);
 	struct nlattr *tca[TCA_MAX + 1];
 	struct net_device *dev;
+	struct Qdisc *q, *p;
+	struct tcmsg *tcm;
 	u32 clid;
-	struct Qdisc *q = NULL;
-	struct Qdisc *p = NULL;
 	int err;
 
+replay:
 	err = nlmsg_parse_deprecated(n, sizeof(*tcm), tca, TCA_MAX,
 				     rtm_tca_policy, extack);
 	if (err < 0)
 		return err;
 
+	tcm = nlmsg_data(n);
+	q = p = NULL;
+
 	dev = __dev_get_by_index(net, tcm->tcm_ifindex);
 	if (!dev)
 		return -ENODEV;
@@ -1515,8 +1532,11 @@  static int tc_get_qdisc(struct sk_buff *skb, struct nlmsghdr *n,
 			return -ENOENT;
 		}
 		err = qdisc_graft(dev, p, skb, n, clid, NULL, q, extack);
-		if (err != 0)
+		if (err != 0) {
+			if (err == -EAGAIN)
+				goto replay;
 			return err;
+		}
 	} else {
 		qdisc_notify(net, skb, n, clid, NULL, q, NULL);
 	}
@@ -1704,6 +1724,8 @@  static int tc_modify_qdisc(struct sk_buff *skb, struct nlmsghdr *n,
 	if (err) {
 		if (q)
 			qdisc_put(q);
+		if (err == -EAGAIN)
+			goto replay;
 		return err;
 	}
 
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 37e41f972f69..e14ed47f961c 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -1046,7 +1046,7 @@  static void qdisc_free_cb(struct rcu_head *head)
 	qdisc_free(q);
 }
 
-static void qdisc_destroy(struct Qdisc *qdisc)
+static void __qdisc_destroy(struct Qdisc *qdisc)
 {
 	const struct Qdisc_ops  *ops = qdisc->ops;
 
@@ -1070,6 +1070,14 @@  static void qdisc_destroy(struct Qdisc *qdisc)
 	call_rcu(&qdisc->rcu, qdisc_free_cb);
 }
 
+void qdisc_destroy(struct Qdisc *qdisc)
+{
+	if (qdisc->flags & TCQ_F_BUILTIN)
+		return;
+
+	__qdisc_destroy(qdisc);
+}
+
 void qdisc_put(struct Qdisc *qdisc)
 {
 	if (!qdisc)
@@ -1079,7 +1087,7 @@  void qdisc_put(struct Qdisc *qdisc)
 	    !refcount_dec_and_test(&qdisc->refcnt))
 		return;
 
-	qdisc_destroy(qdisc);
+	__qdisc_destroy(qdisc);
 }
 EXPORT_SYMBOL(qdisc_put);
 
@@ -1094,7 +1102,7 @@  void qdisc_put_unlocked(struct Qdisc *qdisc)
 	    !refcount_dec_and_rtnl_lock(&qdisc->refcnt))
 		return;
 
-	qdisc_destroy(qdisc);
+	__qdisc_destroy(qdisc);
 	rtnl_unlock();
 }
 EXPORT_SYMBOL(qdisc_put_unlocked);