[v5,net,6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting

From: Peilin Ye <peilin.ye@bytedance.com>

From: Peilin Ye <peilin.ye@bytedance.com>

mini_Qdisc_pair::p_miniq is a double pointer to mini_Qdisc, initialized in
ingress_init() to point to net_device::miniq_ingress.  ingress Qdiscs
access this per-net_device pointer in mini_qdisc_pair_swap().  Similar for
clsact Qdiscs and miniq_egress.

Unfortunately, after introducing RTNL-unlocked RTM_{NEW,DEL,GET}TFILTER
requests (thanks Hillf Danton for the hint), when replacing ingress or
clsact Qdiscs, for example, the old Qdisc ("@old") could access the same
miniq_{in,e}gress pointer(s) concurrently with the new Qdisc ("@new"),
causing race conditions [1] including a use-after-free bug in
mini_qdisc_pair_swap() reported by syzbot:

 BUG: KASAN: slab-use-after-free in mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573
 Write of size 8 at addr ffff888045b31308 by task syz-executor690/14901
...
 Call Trace:
  <TASK>
  __dump_stack lib/dump_stack.c:88 [inline]
  dump_stack_lvl+0xd9/0x150 lib/dump_stack.c:106
  print_address_description.constprop.0+0x2c/0x3c0 mm/kasan/report.c:319
  print_report mm/kasan/report.c:430 [inline]
  kasan_report+0x11c/0x130 mm/kasan/report.c:536
  mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573
  tcf_chain_head_change_item net/sched/cls_api.c:495 [inline]
  tcf_chain0_head_change.isra.0+0xb9/0x120 net/sched/cls_api.c:509
  tcf_chain_tp_insert net/sched/cls_api.c:1826 [inline]
  tcf_chain_tp_insert_unique net/sched/cls_api.c:1875 [inline]
  tc_new_tfilter+0x1de6/0x2290 net/sched/cls_api.c:2266
...

@old and @new should not affect each other.  In other words, @old should
never modify miniq_{in,e}gress after @new, and @new should not update
@old's RCU state.  Fixing without changing sch_api.c turned out to be
difficult (please refer to Closes: for discussions).  Instead, make sure
@new's first call always happen after @old's last call, in
qdisc_destroy(), has finished:

In qdisc_graft(), return -EAGAIN and tell the caller to replay
(suggested by Vlad Buslov) if @old has any ongoing RTNL-unlocked filter
requests, and call qdisc_destroy() for @old before grafting @new.

Introduce qdisc_refcount_dec_if_one() as the counterpart of
qdisc_refcount_inc_nz() used for RTNL-unlocked filter requests.  Introduce
a non-static version of qdisc_destroy() that does a TCQ_F_BUILTIN check,
just like qdisc_put() etc.

Depends on patch "net/sched: Refactor qdisc_graft() for ingress and clsact
Qdiscs".

[1] To illustrate, the syzkaller reproducer adds ingress Qdiscs under
TC_H_ROOT (no longer possible after patch "net/sched: sch_ingress: Only
create under TC_H_INGRESS") on eth0 that has 8 transmission queues:

  Thread 1 creates ingress Qdisc A (containing mini Qdisc a1 and a2), then
  adds a flower filter X to A.

  Thread 2 creates another ingress Qdisc B (containing mini Qdisc b1 and
  b2) to replace A, then adds a flower filter Y to B.

 Thread 1               A's refcnt   Thread 2
  RTM_NEWQDISC (A, RTNL-locked)
   qdisc_create(A)               1
   qdisc_graft(A)                9

  RTM_NEWTFILTER (X, RTNL-unlocked)
   __tcf_qdisc_find(A)          10
   tcf_chain0_head_change(A)
   mini_qdisc_pair_swap(A) (1st)
            |
            |                         RTM_NEWQDISC (B, RTNL-locked)
         RCU sync                2     qdisc_graft(B)
            |                    1     notify_and_destroy(A)
            |
   tcf_block_release(A)          0    RTM_NEWTFILTER (Y, RTNL-unlocked)
   qdisc_destroy(A)                    tcf_chain0_head_change(B)
   tcf_chain0_head_change_cb_del(A)    mini_qdisc_pair_swap(B) (2nd)
   mini_qdisc_pair_swap(A) (3rd)                |
           ...                                 ...

Here, B calls mini_qdisc_pair_swap(), pointing eth0->miniq_ingress to its
mini Qdisc, b1.  Then, A calls mini_qdisc_pair_swap() again during
ingress_destroy(), setting eth0->miniq_ingress to NULL, so ingress packets
on eth0 will not find filter Y in sch_handle_ingress().

This is only one of the possible consequences of concurrently accessing
miniq_{in,e}gress pointers.  The point is clear though: again, A should
never modify those per-net_device pointers after B, and B should not
update A's RCU state.

Fixes: 7a096d579e8e ("net: sched: ingress: set 'unlocked' flag for Qdisc ops")
Fixes: 87f373921c4e ("net: sched: ingress: set 'unlocked' flag for clsact Qdisc ops")
Reported-by: syzbot+b53a9c0d1ea4ad62da8b@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/r/0000000000006cf87705f79acf1a@google.com/
Cc: Hillf Danton <hdanton@sina.com>
Cc: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
---
change in v5:
  - reinitialize @q, @p (suggested by Vlad) and @tcm before replaying,
    just like @flags in tc_new_tfilter()

change in v3, v4:
  - add in-body From: tag

changes in v2:
  - replay the request if the current Qdisc has any ongoing RTNL-unlocked
    filter requests (Vlad)
  - minor changes in code comments and commit log

 include/net/sch_generic.h |  8 ++++++++
 net/sched/sch_api.c       | 40 ++++++++++++++++++++++++++++++---------
 net/sched/sch_generic.c   | 14 +++++++++++---
 3 files changed, 50 insertions(+), 12 deletions(-)

Message ID	429357af094297abbc45f47b8e606f11206df049.1684887977.git.peilin.ye@bytedance.com (mailing list archive)
State	Superseded
Delegated to:	Netdev Maintainers
Headers	show Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net [23.128.96.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 50697ED5 for <netdev@vger.kernel.org>; Wed, 24 May 2023 01:20:42 +0000 (UTC) Received: from mail-qk1-x72a.google.com (mail-qk1-x72a.google.com [IPv6:2607:f8b0:4864:20::72a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id CE8F1127; Tue, 23 May 2023 18:20:34 -0700 (PDT) Received: by mail-qk1-x72a.google.com with SMTP id af79cd13be357-75b1219506fso49849385a.1; Tue, 23 May 2023 18:20:34 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1684891234; x=1687483234; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=tKpUR2FFpXmH27P+0k3LZaBzm4wZrrFObgzzkOXFaAo=; b=efMfi9QMl8QWY/ux4eAXzVEX+V70xaI3MurHUynBLz+plLkXN4MdhgAsxDmQfgIB8H ROtuL/tN/0JJlX5c/eaDT+f1e+AoFk9bq+Oc0Nr9TkM2+gVguu1QvnTxHhvNKXoutoyB Ff49L8zUQ9R5s8oRMfFyS6S+MfVrt/PJKfsQ9N6fkKDvMcL/b0zRFf6HtZUfmbXn8yHT B5YrM2Wnp7ezjAQcdgk2XVh4puSXNR8T4H3mxDpXZ8xxAGj8eRuZyI/L/j4wzLY6FcOa CyRZCrT2EJkDhfqbkQ+yp4Y//ySzHKaFAI3tuUOlzbzsEbni+CwPl7gmUixo74MXNDDR akCQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1684891234; x=1687483234; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=tKpUR2FFpXmH27P+0k3LZaBzm4wZrrFObgzzkOXFaAo=; b=df+FqzfmGjOhk8VSoCHz0EcqnWYaIsJ7JnJR2pOlekAodoiODWRbvARAUrphSYeAug kwDDMxckNg/mNvtZlHMMoXeWmkoa2TUHxmJYWuGdfc7wfwt9T9LK3pMNqPh1FOfrBJ1E zgfIklxRLf3yPJKiiDfsO5V2mlZi/UdtkVPDmUapZkGk+BIX8NRPHHxqe3xZlN3ckaFK Tul/RAiKfOJnkeUuahM1P36WOwFDSLrbyuqbGd0wNK+kbI5pnEN779YPInr6NuA6mWrG pfFKb5whw2YCEvsJQUv/2isVGMUZaWIPgfnZJrzMDDnBM9AAdnV3D8NQqeO1CMpBqSDx 8r1Q== X-Gm-Message-State: AC+VfDyukoUV9EDOps27lezkn20cXig9GhFBLEMzI8m2qnz8psOfQY+A +yaua8Qz+8YMVLaI337WZg== X-Google-Smtp-Source: ACHHUZ4UHywzf9o4XPmV9szHh2dYEELnGTx+WKgxZFMWy+Sz45zMcC8uOnKR7lWV8SR4s/zUplFTTw== X-Received: by 2002:a37:c48:0:b0:75b:23a1:8e4e with SMTP id 69-20020a370c48000000b0075b23a18e4emr6207392qkm.31.1684891233700; Tue, 23 May 2023 18:20:33 -0700 (PDT) Received: from C02FL77VMD6R.bytedance.net ([2600:1700:d860:12b0:c32:b55:eaec:a556]) by smtp.gmail.com with ESMTPSA id p16-20020ae9f310000000b007577d3d1f65sm2916899qkg.7.2023.05.23.18.20.31 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Tue, 23 May 2023 18:20:33 -0700 (PDT) From: Peilin Ye <yepeilin.cs@gmail.com> X-Google-Original-From: Peilin Ye <peilin.ye@bytedance.com> To: "David S. Miller" <davem@davemloft.net>, Eric Dumazet <edumazet@google.com>, Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>, Jamal Hadi Salim <jhs@mojatatu.com>, Cong Wang <xiyou.wangcong@gmail.com>, Jiri Pirko <jiri@resnulli.us> Cc: Peilin Ye <peilin.ye@bytedance.com>, Daniel Borkmann <daniel@iogearbox.net>, John Fastabend <john.fastabend@gmail.com>, Vlad Buslov <vladbu@mellanox.com>, Pedro Tammela <pctammela@mojatatu.com>, Hillf Danton <hdanton@sina.com>, netdev@vger.kernel.org, linux-kernel@vger.kernel.org, Cong Wang <cong.wang@bytedance.com>, Peilin Ye <yepeilin.cs@gmail.com> Subject: [PATCH v5 net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting Date: Tue, 23 May 2023 18:20:20 -0700 Message-Id: <429357af094297abbc45f47b8e606f11206df049.1684887977.git.peilin.ye@bytedance.com> X-Mailer: git-send-email 2.30.1 (Apple Git-130) In-Reply-To: <cover.1684887977.git.peilin.ye@bytedance.com> References: <cover.1684887977.git.peilin.ye@bytedance.com> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: <netdev.vger.kernel.org> List-Subscribe: <mailto:netdev+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:netdev+unsubscribe@vger.kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM, RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net X-Patchwork-Delegate: kuba@kernel.org
Series	net/sched: Fixes for sch_ingress and sch_clsact \| expand [v5,net,0/6] net/sched: Fixes for sch_ingress and sch_clsact [v5,net,1/6] net/sched: sch_ingress: Only create under TC_H_INGRESS [v5,net,2/6] net/sched: sch_clsact: Only create under TC_H_CLSACT [v5,net,3/6] net/sched: Reserve TC_H_INGRESS (TC_H_CLSACT) for ingress (clsact) Qdiscs [v5,net,4/6] net/sched: Prohibit regrafting ingress or clsact Qdiscs [v5,net,5/6] net/sched: Refactor qdisc_graft() for ingress and clsact Qdiscs [v5,net,6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting

Context	Check	Description
netdev/series_format	success	Posting correctly formatted
netdev/tree_selection	success	Clearly marked for net
netdev/fixes_present	success	Fixes tag present in non-next series
netdev/header_inline	success	No static functions without inline keyword in header files
netdev/build_32bit	success	Errors and warnings before: 1391 this patch: 1391
netdev/cc_maintainers	success	CCed 9 of 9 maintainers
netdev/build_clang	success	Errors and warnings before: 149 this patch: 149
netdev/verify_signedoff	success	Signed-off-by tag matches author and committer
netdev/deprecated_api	success	None detected
netdev/check_selftest	success	No net selftest shell script
netdev/verify_fixes	success	Fixes tag looks correct
netdev/build_allmodconfig_warn	success	Errors and warnings before: 1412 this patch: 1412
netdev/checkpatch	warning	CHECK: multiple assignments should be avoided WARNING: line length of 82 exceeds 80 columns WARNING: line length of 86 exceeds 80 columns WARNING: line length of 88 exceeds 80 columns WARNING: line length of 91 exceeds 80 columns WARNING: line length of 92 exceeds 80 columns
netdev/kdoc	success	Errors and warnings before: 0 this patch: 0
netdev/source_inline	success	Was 0 now: 0

[v5,net,6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting

Checks

Commit Message

Comments

Patch