[bpf-next,v3,2/3] bpf, x64: Fix tailcall hierarchy

From commit ebf7d1f508a73871 ("bpf, x64: rework pro/epilogue and tailcall
handling in JIT"), the tailcall on x64 works better than before.

From commit e411901c0b775a3a ("bpf: allow for tailcalls in BPF subprograms
for x64 JIT"), tailcall is able to run in BPF subprograms on x64.

How about:

1. More than 1 subprograms are called in a bpf program.
2. The tailcalls in the subprograms call the bpf program.

Because of missing tail_call_cnt back-propagation, a tailcall hierarchy
comes up. And MAX_TAIL_CALL_CNT limit does not work for this case.

Let's take a look into an example:

\#include <linux/bpf.h>
\#include <bpf/bpf_helpers.h>
\#include "bpf_legacy.h"

struct {
	__uint(type, BPF_MAP_TYPE_PROG_ARRAY);
	__uint(max_entries, 1);
	__uint(key_size, sizeof(__u32));
	__uint(value_size, sizeof(__u32));
} jmp_table SEC(".maps");

int count = 0;

static __noinline
int subprog_tail(struct __sk_buff *skb)
{
	bpf_tail_call_static(skb, &jmp_table, 0);
	return 0;
}

SEC("tc")
int entry(struct __sk_buff *skb)
{
	volatile int ret = 1;

	count++;
	subprog_tail(skb); /* subprog call1 */
	subprog_tail(skb); /* subprog call2 */

	return ret;
}

char __license[] SEC("license") = "GPL";

And the entry bpf prog is populated to the 0th slot of jmp_table. Then,
what happens when entry bpf prog runs? The CPU will be stalled because
of too many tailcalls, e.g. the test_progs failed to run on aarch64 and
s390x because of "rcu: INFO: rcu_sched self-detected stall on CPU".

So, if CPU does not stall because of too many tailcalls, how many
tailcalls will be there for this case? And why MAX_TAIL_CALL_CNT limit
does not work for this case?

Let's step into some running steps.

At the very first time when subprog_tail() is called, subprog_tail() does
tailcall the entry bpf prog. Then, subprog_taill() is called at second time
at the position subprog call1, and it tailcalls the entry bpf prog again.

Then, again and again. At the very first time when MAX_TAIL_CALL_CNT limit
works, subprog_tail() has been called for 34 times at the position subprog
call1. And at this time, the tail_call_cnt is 33 in subprog_tail().

Next, the 34th subprog_tail() returns to entry() because of
MAX_TAIL_CALL_CNT limit.

In entry(), the 34th entry(), at the time after the 34th subprog_tail() at
the position subprog call1 finishes and before the 1st subprog_tail() at
the position subprog call2 calls in entry(), what's the value of
tail_call_cnt in entry()? It's 33.

As we know, tail_all_cnt is pushed on the stack of entry(), and propagates
to subprog_tail() by %rax from stack.

Then, at the time when subprog_tail() at the position subprog call2 is
called for its first time, tail_call_cnt 33 propagates to subprog_tail()
by %rax. And the tailcall in subprog_tail() is aborted because of
tail_call_cnt >= MAX_TAIL_CALL_CNT too.

Then, subprog_tail() at the position subprog call2 ends, and the 34th
entry() ends. And it returns to the 33rd subprog_tail() called from the
position subprog call1. But wait, at this time, what's the value of
tail_call_cnt under the stack of subprog_tail()? It's 33.

Then, in the 33rd entry(), at the time after the 33th subprog_tail() at
the position subprog call1 finishes and before the 2nd subprog_tail() at
the position subprog call2 calls, what's the value of tail_call_cnt
in current entry()? It's *32*. Why not 33?

Before stepping into subprog_tail() at the position subprog call2 in 33rd
entry(), like stopping the time machine, let's have a look at the stack
memory:

  |  STACK  |
  +---------+ RBP  <-- current rbp
  |   ret   | STACK of 33rd entry()
  |   tcc   | its value is 32
  +---------+ RSP  <-- current rsp
  |   rip   | STACK of 34rd entry()
  |   rbp   | reuse the STACK of 33rd subprog_tail() at the position
  |   ret   |                                        subprog call1
  |   tcc   | its value is 33
  +---------+ rsp
  |   rip   | STACK of 1st subprog_tail() at the position subprog call2
  |   rbp   |
  |   tcc   | its value is 33
  +---------+ rsp

Why not 33? It's because tail_call_cnt does not back-propagate from
subprog_tail() to entry().

Then, while stepping into subprog_tail() at the position subprog call2 in
33rd entry():

  |  STACK  |
  +---------+
  |   ret   | STACK of 33rd entry()
  |   tcc   | its value is 32
  |   rip   |
  |   rbp   |
  +---------+ RBP  <-- current rbp
  |   tcc   | its value is 32; STACK of subprog_tail() at the position
  +---------+ RSP  <-- current rsp                        subprog call2

Then, while pausing after tailcalling in 2nd subprog_tail() at the position
subprog call2:

  |  STACK  |
  +---------+
  |   ret   | STACK of 33rd entry()
  |   tcc   | its value is 32
  |   rip   |
  |   rbp   |
  +---------+ RBP  <-- current rbp
  |   tcc   | its value is 33; STACK of subprog_tail() at the position
  +---------+ RSP  <-- current rsp                        subprog call2

Note: what happens to tail_call_cnt:
	/*
	 * if (tail_call_cnt++ >= MAX_TAIL_CALL_CNT)
	 *	goto out;
	 */
It's to check >= MAX_TAIL_CALL_CNT first and then increment tail_call_cnt.

So, current tailcall is allowed to run.

Then, entry() is tailcalled. And the stack memory status is:

  |  STACK  |
  +---------+
  |   ret   | STACK of 33rd entry()
  |   tcc   | its value is 32
  |   rip   |
  |   rbp   |
  +---------+ RBP  <-- current rbp
  |   ret   | STACK of 35th entry(); reuse STACK of subprog_tail() at the
  |   tcc   | its value is 33                   the position subprog call2
  +---------+ RSP  <-- current rsp

So, the tailcalls in the 35th entry() will be aborted.

And, ..., again and again.  :(

And, I hope you have understood the reason why MAX_TAIL_CALL_CNT limit
does not work for this case.

And, how many tailcalls are there for this case if CPU does not stall?

From top-down view, does it look like hierarchy layer and layer?

I think it is a hierarchy layer model with 2+4+8+...+2**33 tailcalls. As a
result, if CPU does not stall, there will be 2**34 - 2 = 17,179,869,182
tailcalls. That's the guy making CPU stalled.

What about there are N subprog_tail() in entry()? If CPU does not stall
because of too many tailcalls, there will be almost N**34 tailcalls.

As we learn about the issue, how does this patch resolve it?

In this patch, it stores tail_call_cnt at task_struct. When a tailcall
happens, the caller and callee bpf progs are in the same thread context,
which means the current task_struct won't change after a tailcall.

First, at the prologue of bpf prog, it initialise the tail_call_cnt at
task_struct.

Then, when a tailcall happens, it compares tail_call_cnt with
MAX_TAIL_CALL_CNT, and then increments it.

Additionally, in order to avoid touching other registers excluding %rax,
it uses asm to handle tail_call_cnt at task_struct.

As a result, the previous tailcall way can be removed totally, including

1. "push rax" at prologue.
2. load tail_call_cnt to rax before calling function.
3. "pop rax" before jumping to tailcallee when tailcall.
4. "push rax" and load tail_call_cnt to rax at trampoline.

Fixes: ebf7d1f508a7 ("bpf, x64: rework pro/epilogue and tailcall handling in JIT")
Fixes: e411901c0b77 ("bpf: allow for tailcalls in BPF subprograms for x64 JIT")
Signed-off-by: Leon Hwang <hffilwlqm@gmail.com>
---
 arch/x86/net/bpf_jit_comp.c | 137 +++++++++++++++++++++---------------
 1 file changed, 81 insertions(+), 56 deletions(-)

Message ID	20240402152638.31377-3-hffilwlqm@gmail.com (mailing list archive)
State	Changes Requested
Delegated to:	BPF
Headers	show Received: from mail-pl1-f172.google.com (mail-pl1-f172.google.com [209.85.214.172]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B759885262 for <bpf@vger.kernel.org>; Tue, 2 Apr 2024 15:26:59 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.172 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1712071621; cv=none; b=Yv/AKv8PS9mv7KgmjXkXMOMl9fwrPk+GB1LmZKCKDIhm61anckYPz9QxWqccfKjoXN0ZkaLeE6fqIVV3oRhUb9771g1MwNjNs/bQAMl+5nOv5J3ar4Ph/PZn4p0VxbeXIU6WPx2mKo4ySBtXNzyRKr4Hoyrg6LZ/iY2zS9RQnUU= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1712071621; c=relaxed/simple; bh=LGGFOWmrueBRuNaW4RzxW7NvMmBml0s8qrnhyu9q+X8=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=olMVteN6mYIL12byn5Qof1gyvpYjE88EcHvvcmK3Eadh8YNcQNfa6quYan2uRfk46elBR06YElEU8917AIP5/+YqNzq2vCdfAzx9FhiJBAd8D6BaOOLpN/tudE8/8Z1+JnlBUEzY6sHfrkOK6k3X3XPd8C+iKzqluYNQTciDNJs= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=kO9prndl; arc=none smtp.client-ip=209.85.214.172 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="kO9prndl" Received: by mail-pl1-f172.google.com with SMTP id d9443c01a7336-1e27fadbbe1so4559705ad.1 for <bpf@vger.kernel.org>; Tue, 02 Apr 2024 08:26:59 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1712071618; x=1712676418; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=njL6tMjKCKCYbdPsByoqYXdZydiTsQ0eL5nfl/OaP44=; b=kO9prndlbijN8pdnxFIOQYcbyTpXlH/nI7NloqpX8pFEEC/f00Wrw5uy0tH+iEw5Qh foWlkBq4X4zVm4bUDq9r85uFQhXcMosDTqlTX0qqcqHatEYAR2coMOVwKUXjcEssukPa nXDUDgdB8weeDgfTwjUEWX+w3VmvAuVTfW7OOM0vnEvkx/AYhupJjGKkFnoqL2a9Y+8Y IZzUG8UC5HIri3ZVcoBI6dR9AaQO7oJaisPuqYjacKbDXAZiv2E5xod6CAr6tmuRJHmx QAaNortfLDb8XvFi7Ph0fwVi/vqvek7tUWOYkPI0PlMjGG2KXw15VYAascRJZw6LNynj A1lQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1712071618; x=1712676418; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=njL6tMjKCKCYbdPsByoqYXdZydiTsQ0eL5nfl/OaP44=; b=Iq0z5pb0TGXQH89OPIRnYocUnWHMnK+NRW751I6LRsDG6tVAxdv4lpXUiRKDJUsftB hfjemj5dYr2IU4BBO2uqHdYeQK79meMTmXEiwUkujU72L2k0R6FLhnhlRFJufrCr6UD5 uSQP38LLHMGipl+oJ5rwyaXeHT2wO8ykn6VskIR/N5/doQgRdJHF+k5J8O+l/xBevSxQ No0oQKJJXptD6RxSSnXZ5jEZrdnTYTwKgX5CEKywQG+qVpFfSxtgEngYs7pjLWRGll7W QJqHoWDstwW6fIz2V1uJYRnQ5EZAGVp9KgmnhreA65j62qfWon9gRZAQAZnHA7VPTOjj Al4g== X-Gm-Message-State: AOJu0YxTgifHmQ9ox1X6vTE/79aDcibgwwXG9x68ESfiEvX59BIttyVf /g1waO6D+T0rncKBgHlZ8HIYijvD7u49LcitUaOP21i4YQp21wuMk9bB1Ka8 X-Google-Smtp-Source: AGHT+IFPT6IRzm+CfIpXLUd1aSjlni/xgTl/5XPi1d/AlF/3m9EkF3jXZ1wqYx/gUfVLH1EcCZ9oEg== X-Received: by 2002:a17:902:6548:b0:1e2:7e04:3ab1 with SMTP id d8-20020a170902654800b001e27e043ab1mr1240512pln.33.1712071618461; Tue, 02 Apr 2024 08:26:58 -0700 (PDT) Received: from localhost.localdomain (bb116-14-181-187.singnet.com.sg. [116.14.181.187]) by smtp.gmail.com with ESMTPSA id ju2-20020a170903428200b001e0a08bbe49sm1640505plb.140.2024.04.02.08.26.55 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 02 Apr 2024 08:26:57 -0700 (PDT) From: Leon Hwang <hffilwlqm@gmail.com> To: bpf@vger.kernel.org Cc: ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org, maciej.fijalkowski@intel.com, jakub@cloudflare.com, pulehui@huawei.com, hengqi.chen@gmail.com, hffilwlqm@gmail.com, kernel-patches-bot@fb.com Subject: [PATCH bpf-next v3 2/3] bpf, x64: Fix tailcall hierarchy Date: Tue, 2 Apr 2024 23:26:37 +0800 Message-ID: <20240402152638.31377-3-hffilwlqm@gmail.com> X-Mailer: git-send-email 2.44.0 In-Reply-To: <20240402152638.31377-1-hffilwlqm@gmail.com> References: <20240402152638.31377-1-hffilwlqm@gmail.com> Precedence: bulk X-Mailing-List: bpf@vger.kernel.org List-Id: <bpf.vger.kernel.org> List-Subscribe: <mailto:bpf+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:bpf+unsubscribe@vger.kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Delegate: bpf@iogearbox.net
Series	bpf, x64: Fix tailcall hierarchy \| expand [bpf-next,v3,0/3] bpf, x64: Fix tailcall hierarchy [bpf-next,v3,1/3] bpf: Add bpf_tail_call_cnt to task_struct [bpf-next,v3,2/3] bpf, x64: Fix tailcall hierarchy [bpf-next,v3,3/3] selftests/bpf: Add testcases for tailcall hierarchy fixing

Context	Check	Description
bpf/vmtest-bpf-next-PR	fail	PR summary
netdev/series_format	success	Posting correctly formatted
netdev/tree_selection	success	Clearly marked for bpf-next, async
netdev/ynl	success	Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present	success	Fixes tag not required for -next series
netdev/header_inline	success	No static functions without inline keyword in header files
netdev/build_32bit	success	Errors and warnings before: 8 this patch: 8
netdev/build_tools	success	No tools touched, skip
netdev/cc_maintainers	warning	17 maintainers not CCed: john.fastabend@gmail.com kpsingh@kernel.org mingo@redhat.com martin.lau@linux.dev tglx@linutronix.de dsahern@kernel.org eddyz87@gmail.com sdf@google.com bp@alien8.de netdev@vger.kernel.org x86@kernel.org yonghong.song@linux.dev dave.hansen@linux.intel.com hpa@zytor.com haoluo@google.com jolsa@kernel.org song@kernel.org
netdev/build_clang	success	Errors and warnings before: 955 this patch: 955
netdev/verify_signedoff	success	Signed-off-by tag matches author and committer
netdev/deprecated_api	success	None detected
netdev/check_selftest	success	No net selftest shell script
netdev/verify_fixes	success	Fixes tag looks correct
netdev/build_allmodconfig_warn	fail	Errors and warnings before: 956 this patch: 958
netdev/checkpatch	warning	CHECK: Consider using #include <linux/percpu.h> instead of <asm/percpu.h> CHECK: Lines should not end with a '(' WARNING: line length of 94 exceeds 80 columns WARNING: line length of 98 exceeds 80 columns
netdev/build_clang_rust	success	No Rust files in patch. Skipping build
netdev/kdoc	success	Errors and warnings before: 0 this patch: 0
netdev/source_inline	success	Was 0 now: 0
bpf/vmtest-bpf-next-VM_Test-0	success	Logs for Lint
bpf/vmtest-bpf-next-VM_Test-1	success	Logs for ShellCheck
bpf/vmtest-bpf-next-VM_Test-2	success	Logs for Unittests
bpf/vmtest-bpf-next-VM_Test-3	success	Logs for Validate matrix.py
bpf/vmtest-bpf-next-VM_Test-5	success	Logs for aarch64-gcc / build-release
bpf/vmtest-bpf-next-VM_Test-4	success	Logs for aarch64-gcc / build / build for aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-10	success	Logs for aarch64-gcc / veristat
bpf/vmtest-bpf-next-VM_Test-12	success	Logs for s390x-gcc / build-release
bpf/vmtest-bpf-next-VM_Test-9	success	Logs for aarch64-gcc / test (test_verifier, false, 360) / test_verifier on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-11	success	Logs for s390x-gcc / build / build for s390x with gcc
bpf/vmtest-bpf-next-VM_Test-34	success	Logs for x86_64-llvm-17 / veristat
bpf/vmtest-bpf-next-VM_Test-19	success	Logs for x86_64-gcc / build / build for x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-20	success	Logs for x86_64-gcc / build-release
bpf/vmtest-bpf-next-VM_Test-29	success	Logs for x86_64-llvm-17 / build-release / build for x86_64 with llvm-17 and -O2 optimization
bpf/vmtest-bpf-next-VM_Test-17	success	Logs for s390x-gcc / veristat
bpf/vmtest-bpf-next-VM_Test-28	success	Logs for x86_64-llvm-17 / build / build for x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-18	success	Logs for set-matrix
bpf/vmtest-bpf-next-VM_Test-6	success	Logs for aarch64-gcc / test (test_maps, false, 360) / test_maps on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-7	fail	Logs for aarch64-gcc / test (test_progs, false, 360) / test_progs on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-8	fail	Logs for aarch64-gcc / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-35	success	Logs for x86_64-llvm-18 / build / build for x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-36	success	Logs for x86_64-llvm-18 / build-release / build for x86_64 with llvm-18 and -O2 optimization
bpf/vmtest-bpf-next-VM_Test-42	success	Logs for x86_64-llvm-18 / veristat
bpf/vmtest-bpf-next-VM_Test-26	success	Logs for x86_64-gcc / test (test_verifier, false, 360) / test_verifier on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-33	success	Logs for x86_64-llvm-17 / test (test_verifier, false, 360) / test_verifier on x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-32	success	Logs for x86_64-llvm-17 / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-39	success	Logs for x86_64-llvm-18 / test (test_progs_cpuv4, false, 360) / test_progs_cpuv4 on x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-37	success	Logs for x86_64-llvm-18 / test (test_maps, false, 360) / test_maps on x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-38	success	Logs for x86_64-llvm-18 / test (test_progs, false, 360) / test_progs on x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-41	success	Logs for x86_64-llvm-18 / test (test_verifier, false, 360) / test_verifier on x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-27	success	Logs for x86_64-gcc / veristat / veristat on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-31	success	Logs for x86_64-llvm-17 / test (test_progs, false, 360) / test_progs on x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-22	success	Logs for x86_64-gcc / test (test_progs, false, 360) / test_progs on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-21	success	Logs for x86_64-gcc / test (test_maps, false, 360) / test_maps on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-24	success	Logs for x86_64-gcc / test (test_progs_no_alu32_parallel, true, 30) / test_progs_no_alu32_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-23	success	Logs for x86_64-gcc / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-25	success	Logs for x86_64-gcc / test (test_progs_parallel, true, 30) / test_progs_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-30	success	Logs for x86_64-llvm-17 / test (test_maps, false, 360) / test_maps on x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-40	success	Logs for x86_64-llvm-18 / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-13	fail	Logs for s390x-gcc / test (test_maps, false, 360) / test_maps on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-16	fail	Logs for s390x-gcc / test (test_verifier, false, 360) / test_verifier on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-14	success	Logs for s390x-gcc / test (test_progs, false, 360) / test_progs on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-15	success	Logs for s390x-gcc / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on s390x with gcc

[bpf-next,v3,2/3] bpf, x64: Fix tailcall hierarchy

Checks

Commit Message

Comments

Patch