diff mbox series

[net] net/sched: Fix mirred deadlock on device recursion

Message ID 20240415210728.36949-1-victor@mojatatu.com (mailing list archive)
State Accepted
Commit 0f022d32c3eca477fbf79a205243a6123ed0fe11
Delegated to: Netdev Maintainers
Headers show
Series [net] net/sched: Fix mirred deadlock on device recursion | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for net
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag present in non-next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 2341 this patch: 2341
netdev/build_tools success Errors and warnings before: 0 this patch: 0
netdev/cc_maintainers fail 4 blamed authors not CCed: daniel@iogearbox.net tom@herbertland.com fw@strlen.de ast@kernel.org; 4 maintainers not CCed: daniel@iogearbox.net tom@herbertland.com fw@strlen.de ast@kernel.org
netdev/build_clang success Errors and warnings before: 966 this patch: 966
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 2409 this patch: 2409
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 33 lines checked
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 79 this patch: 79
netdev/source_inline success Was 0 now: 0
netdev/contest success net-next-2024-04-18--00-00 (tests: 961)

Commit Message

Victor Nogueira April 15, 2024, 9:07 p.m. UTC
From: Eric Dumazet <edumazet@google.com>

When the mirred action is used on a classful egress qdisc and a packet is
mirrored or redirected to self we hit a qdisc lock deadlock.
See trace below.

[..... other info removed for brevity....]
[   82.890906]
[   82.890906] ============================================
[   82.890906] WARNING: possible recursive locking detected
[   82.890906] 6.8.0-05205-g77fadd89fe2d-dirty #213 Tainted: G        W
[   82.890906] --------------------------------------------
[   82.890906] ping/418 is trying to acquire lock:
[   82.890906] ffff888006994110 (&sch->q.lock){+.-.}-{3:3}, at:
__dev_queue_xmit+0x1778/0x3550
[   82.890906]
[   82.890906] but task is already holding lock:
[   82.890906] ffff888006994110 (&sch->q.lock){+.-.}-{3:3}, at:
__dev_queue_xmit+0x1778/0x3550
[   82.890906]
[   82.890906] other info that might help us debug this:
[   82.890906]  Possible unsafe locking scenario:
[   82.890906]
[   82.890906]        CPU0
[   82.890906]        ----
[   82.890906]   lock(&sch->q.lock);
[   82.890906]   lock(&sch->q.lock);
[   82.890906]
[   82.890906]  *** DEADLOCK ***
[   82.890906]
[..... other info removed for brevity....]

Example setup (eth0->eth0) to recreate
tc qdisc add dev eth0 root handle 1: htb default 30
tc filter add dev eth0 handle 1: protocol ip prio 2 matchall \
     action mirred egress redirect dev eth0

Another example(eth0->eth1->eth0) to recreate
tc qdisc add dev eth0 root handle 1: htb default 30
tc filter add dev eth0 handle 1: protocol ip prio 2 matchall \
     action mirred egress redirect dev eth1

tc qdisc add dev eth1 root handle 1: htb default 30
tc filter add dev eth1 handle 1: protocol ip prio 2 matchall \
     action mirred egress redirect dev eth0

We fix this by adding an owner field (CPU id) to struct Qdisc set after
root qdisc is entered. When the softirq enters it a second time, if the
qdisc owner is the same CPU, the packet is dropped to break the loop.

Reported-by: Mingshuai Ren <renmingshuai@huawei.com>
Closes: https://lore.kernel.org/netdev/20240314111713.5979-1-renmingshuai@huawei.com/
Fixes: 3bcb846ca4cf ("net: get rid of spin_trylock() in net_tx_action()")
Fixes: e578d9c02587 ("net: sched: use counter to break reclassify loops")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Victor Nogueira <victor@mojatatu.com>
Reviewed-by: Pedro Tammela <pctammela@mojatatu.com>
Tested-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
---
 include/net/sch_generic.h | 1 +
 net/core/dev.c            | 6 ++++++
 net/sched/sch_generic.c   | 1 +
 3 files changed, 8 insertions(+)

Comments

patchwork-bot+netdevbpf@kernel.org April 18, 2024, 1:40 a.m. UTC | #1
Hello:

This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Mon, 15 Apr 2024 18:07:28 -0300 you wrote:
> From: Eric Dumazet <edumazet@google.com>
> 
> When the mirred action is used on a classful egress qdisc and a packet is
> mirrored or redirected to self we hit a qdisc lock deadlock.
> See trace below.
> 
> [..... other info removed for brevity....]
> [   82.890906]
> [   82.890906] ============================================
> [   82.890906] WARNING: possible recursive locking detected
> [   82.890906] 6.8.0-05205-g77fadd89fe2d-dirty #213 Tainted: G        W
> [   82.890906] --------------------------------------------
> [   82.890906] ping/418 is trying to acquire lock:
> [   82.890906] ffff888006994110 (&sch->q.lock){+.-.}-{3:3}, at:
> __dev_queue_xmit+0x1778/0x3550
> [   82.890906]
> [   82.890906] but task is already holding lock:
> [   82.890906] ffff888006994110 (&sch->q.lock){+.-.}-{3:3}, at:
> __dev_queue_xmit+0x1778/0x3550
> [   82.890906]
> [   82.890906] other info that might help us debug this:
> [   82.890906]  Possible unsafe locking scenario:
> [   82.890906]
> [   82.890906]        CPU0
> [   82.890906]        ----
> [   82.890906]   lock(&sch->q.lock);
> [   82.890906]   lock(&sch->q.lock);
> [   82.890906]
> [   82.890906]  *** DEADLOCK ***
> [   82.890906]
> [..... other info removed for brevity....]
> 
> [...]

Here is the summary with links:
  - [net] net/sched: Fix mirred deadlock on device recursion
    https://git.kernel.org/netdev/net/c/0f022d32c3ec

You are awesome, thank you!
Johannes Berg June 7, 2024, 2:40 p.m. UTC | #2
Hi all,

I noticed today that this causes a userspace visible change in behaviour
(and a regression in some of our tests) for transmitting to a device
when it has no carrier, when noop_qdisc is assigned to it. Instead of
silently dropping the packets, -ENOBUFS will be returned if the socket
opted in to RECVERR.

The reason for this is that the static noop_qdisc:

struct Qdisc noop_qdisc = {
        .enqueue        =       noop_enqueue,
        .dequeue        =       noop_dequeue,
        .flags          =       TCQ_F_BUILTIN,
        .ops            =       &noop_qdisc_ops,
        .q.lock         =       __SPIN_LOCK_UNLOCKED(noop_qdisc.q.lock),
        .dev_queue      =       &noop_netdev_queue,
        .busylock       =       __SPIN_LOCK_UNLOCKED(noop_qdisc.busylock),
        .gso_skb = {
                .next = (struct sk_buff *)&noop_qdisc.gso_skb,
                .prev = (struct sk_buff *)&noop_qdisc.gso_skb,
                .qlen = 0,
                .lock = __SPIN_LOCK_UNLOCKED(noop_qdisc.gso_skb.lock),
        },
        .skb_bad_txq = {
                .next = (struct sk_buff *)&noop_qdisc.skb_bad_txq,
                .prev = (struct sk_buff *)&noop_qdisc.skb_bad_txq,
                .qlen = 0,
                .lock = __SPIN_LOCK_UNLOCKED(noop_qdisc.skb_bad_txq.lock),
        },
};

doesn't have an owner set, and it's obviously not allocated via
qdisc_alloc(). Thus, it defaults to 0, so if you get to it on CPU 0 (I
was using ARCH=um which isn't even SMP) then it will just always run
into the 

> +	if (unlikely(READ_ONCE(q->owner) == smp_processor_id())) {
> +		kfree_skb_reason(skb, SKB_DROP_REASON_TC_RECLASSIFY_LOOP);
> +		return NET_XMIT_DROP;
> +	}

case.

I'm not sure I understand the busylock logic well enough, so almost
seems to me we shouldn't do this whole thing on the noop_qdisc at all,
e.g. via tagging owner with -2 to say don't do it:

--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3865,9 +3865,11 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
 		qdisc_run_end(q);
 		rc = NET_XMIT_SUCCESS;
 	} else {
-		WRITE_ONCE(q->owner, smp_processor_id());
+		if (q->owner != -2)
+			WRITE_ONCE(q->owner, smp_processor_id());
 		rc = dev_qdisc_enqueue(skb, q, &to_free, txq);
-		WRITE_ONCE(q->owner, -1);
+		if (q->owner != -2)
+			WRITE_ONCE(q->owner, -1);
 		if (qdisc_run_begin(q)) {
 			if (unlikely(contended)) {
 				spin_unlock(&q->busylock);
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 2a637a17061b..e857e4638671 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -657,6 +657,7 @@ static struct netdev_queue noop_netdev_queue = {
 };
 
 struct Qdisc noop_qdisc = {
+	.owner		=	-2,
 	.enqueue	=	noop_enqueue,
 	.dequeue	=	noop_dequeue,
 	.flags		=	TCQ_F_BUILTIN,


(and yes, I believe it doesn't need to be READ_ONCE for the check
against -2 since that's mutually exclusive with all other values)

Or maybe simply ignoring the value for the noop_qdisc:

--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3822,7 +3822,7 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
 		return rc;
 	}
 
-	if (unlikely(READ_ONCE(q->owner) == smp_processor_id())) {
+	if (unlikely(q != &noop_qdisc && READ_ONCE(q->owner) == smp_processor_id())) {
 		kfree_skb_reason(skb, SKB_DROP_REASON_TC_RECLASSIFY_LOOP);
 		return NET_XMIT_DROP;
 	}

That's shorter, but I'm not sure if there might be other special
cases...

Or maybe someone can think of an even better fix?

Thanks,
johannes
Eric Dumazet June 7, 2024, 2:54 p.m. UTC | #3
On Fri, Jun 7, 2024 at 4:40 PM Johannes Berg <johannes@sipsolutions.net> wrote:
>
> Hi all,
>
> I noticed today that this causes a userspace visible change in behaviour
> (and a regression in some of our tests) for transmitting to a device
> when it has no carrier, when noop_qdisc is assigned to it. Instead of
> silently dropping the packets, -ENOBUFS will be returned if the socket
> opted in to RECVERR.
>
> The reason for this is that the static noop_qdisc:
>
> struct Qdisc noop_qdisc = {
>         .enqueue        =       noop_enqueue,
>         .dequeue        =       noop_dequeue,
>         .flags          =       TCQ_F_BUILTIN,
>         .ops            =       &noop_qdisc_ops,
>         .q.lock         =       __SPIN_LOCK_UNLOCKED(noop_qdisc.q.lock),
>         .dev_queue      =       &noop_netdev_queue,
>         .busylock       =       __SPIN_LOCK_UNLOCKED(noop_qdisc.busylock),
>         .gso_skb = {
>                 .next = (struct sk_buff *)&noop_qdisc.gso_skb,
>                 .prev = (struct sk_buff *)&noop_qdisc.gso_skb,
>                 .qlen = 0,
>                 .lock = __SPIN_LOCK_UNLOCKED(noop_qdisc.gso_skb.lock),
>         },
>         .skb_bad_txq = {
>                 .next = (struct sk_buff *)&noop_qdisc.skb_bad_txq,
>                 .prev = (struct sk_buff *)&noop_qdisc.skb_bad_txq,
>                 .qlen = 0,
>                 .lock = __SPIN_LOCK_UNLOCKED(noop_qdisc.skb_bad_txq.lock),
>         },
> };
>
> doesn't have an owner set, and it's obviously not allocated via
> qdisc_alloc(). Thus, it defaults to 0, so if you get to it on CPU 0 (I
> was using ARCH=um which isn't even SMP) then it will just always run
> into the
>
> > +     if (unlikely(READ_ONCE(q->owner) == smp_processor_id())) {
> > +             kfree_skb_reason(skb, SKB_DROP_REASON_TC_RECLASSIFY_LOOP);
> > +             return NET_XMIT_DROP;
> > +     }
>
> case.
>
> I'm not sure I understand the busylock logic well enough, so almost
> seems to me we shouldn't do this whole thing on the noop_qdisc at all,
> e.g. via tagging owner with -2 to say don't do it:
>
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -3865,9 +3865,11 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
>                 qdisc_run_end(q);
>                 rc = NET_XMIT_SUCCESS;
>         } else {
> -               WRITE_ONCE(q->owner, smp_processor_id());
> +               if (q->owner != -2)
> +                       WRITE_ONCE(q->owner, smp_processor_id());
>                 rc = dev_qdisc_enqueue(skb, q, &to_free, txq);
> -               WRITE_ONCE(q->owner, -1);
> +               if (q->owner != -2)
> +                       WRITE_ONCE(q->owner, -1);
>                 if (qdisc_run_begin(q)) {
>                         if (unlikely(contended)) {
>                                 spin_unlock(&q->busylock);
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index 2a637a17061b..e857e4638671 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -657,6 +657,7 @@ static struct netdev_queue noop_netdev_queue = {
>  };
>
>  struct Qdisc noop_qdisc = {
> +       .owner          =       -2,
>         .enqueue        =       noop_enqueue,
>         .dequeue        =       noop_dequeue,
>         .flags          =       TCQ_F_BUILTIN,
>
>
> (and yes, I believe it doesn't need to be READ_ONCE for the check
> against -2 since that's mutually exclusive with all other values)
>
> Or maybe simply ignoring the value for the noop_qdisc:
>
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -3822,7 +3822,7 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
>                 return rc;
>         }
>
> -       if (unlikely(READ_ONCE(q->owner) == smp_processor_id())) {
> +       if (unlikely(q != &noop_qdisc && READ_ONCE(q->owner) == smp_processor_id())) {
>                 kfree_skb_reason(skb, SKB_DROP_REASON_TC_RECLASSIFY_LOOP);
>                 return NET_XMIT_DROP;
>         }
>
> That's shorter, but I'm not sure if there might be other special
> cases...
>
> Or maybe someone can think of an even better fix?

Why not simply initialize noop_qdisc.owner to -1 ?
Johannes Berg June 7, 2024, 2:56 p.m. UTC | #4
On Fri, 2024-06-07 at 16:54 +0200, Eric Dumazet wrote:
> > 
> > I'm not sure I understand the busylock logic well enough

> Why not simply initialize noop_qdisc.owner to -1 ?

I didn't understand the locking logic, so I was worried you could still
have it be used in parallel since it can be assigned to any number of
devices.

johannes
diff mbox series

Patch

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 76db6be16083..f561dfb79743 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -117,6 +117,7 @@  struct Qdisc {
 	struct qdisc_skb_head	q;
 	struct gnet_stats_basic_sync bstats;
 	struct gnet_stats_queue	qstats;
+	int                     owner;
 	unsigned long		state;
 	unsigned long		state2; /* must be written under qdisc spinlock */
 	struct Qdisc            *next_sched;
diff --git a/net/core/dev.c b/net/core/dev.c
index 854a3a28a8d8..f6c6e494f0a9 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3808,6 +3808,10 @@  static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
 		return rc;
 	}
 
+	if (unlikely(READ_ONCE(q->owner) == smp_processor_id())) {
+		kfree_skb_reason(skb, SKB_DROP_REASON_TC_RECLASSIFY_LOOP);
+		return NET_XMIT_DROP;
+	}
 	/*
 	 * Heuristic to force contended enqueues to serialize on a
 	 * separate lock before trying to get qdisc main lock.
@@ -3847,7 +3851,9 @@  static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
 		qdisc_run_end(q);
 		rc = NET_XMIT_SUCCESS;
 	} else {
+		WRITE_ONCE(q->owner, smp_processor_id());
 		rc = dev_qdisc_enqueue(skb, q, &to_free, txq);
+		WRITE_ONCE(q->owner, -1);
 		if (qdisc_run_begin(q)) {
 			if (unlikely(contended)) {
 				spin_unlock(&q->busylock);
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index ff5336493777..4a2c763e2d11 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -974,6 +974,7 @@  struct Qdisc *qdisc_alloc(struct netdev_queue *dev_queue,
 	sch->enqueue = ops->enqueue;
 	sch->dequeue = ops->dequeue;
 	sch->dev_queue = dev_queue;
+	sch->owner = -1;
 	netdev_hold(dev, &sch->dev_tracker, GFP_KERNEL);
 	refcount_set(&sch->refcnt, 1);