Message ID | 20240123152716.5975-4-laoar.shao@gmail.com (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | BPF |
Headers | show |
Series | bpf: Add bpf_iter_cpumask | expand |
On Tue, Jan 23, 2024 at 11:27:16PM +0800, Yafang Shao wrote: > Within the BPF program, we leverage the cgroup iterator to iterate through > percpu runqueue data, specifically the 'nr_running' metric. Subsequently > we expose this data to userspace by means of a sequence file. > > The CPU affinity for the cpumask is determined by the PID of a task: > > - PID of the init task (PID 1) > We typically don't set CPU affinity for init task and thus we can iterate > across all possible CPUs using the init task. However, in scenarios where > you've set CPU affinity for the init task, you should set your > current-task's cpu affinity to all possible CPUs and then proceed to > iterate through all possible CPUs using the current task. > - PID of a task with defined CPU affinity > The aim here is to iterate through a specific cpumask. This scenario > aligns with tasks residing within a cpuset cgroup. > - Invalid PID (e.g., PID -1) > No cpumask is available in this case. > > The result as follows, > #65/1 cpumask_iter/init_pid:OK > #65/2 cpumask_iter/invalid_pid:OK > #65/3 cpumask_iter/self_pid_one_cpu:OK > #65/4 cpumask_iter/self_pid_multi_cpus:OK > #65 cpumask_iter:OK > Summary: 1/4 PASSED, 0 SKIPPED, 0 FAILED > > CONFIG_PSI=y is required for this testcase. > > Signed-off-by: Yafang Shao <laoar.shao@gmail.com> > --- > tools/testing/selftests/bpf/config | 1 + > .../selftests/bpf/prog_tests/cpumask_iter.c | 130 ++++++++++++++++++ > .../selftests/bpf/progs/cpumask_common.h | 3 + > .../selftests/bpf/progs/test_cpumask_iter.c | 56 ++++++++ > 4 files changed, 190 insertions(+) > create mode 100644 tools/testing/selftests/bpf/prog_tests/cpumask_iter.c > create mode 100644 tools/testing/selftests/bpf/progs/test_cpumask_iter.c > > diff --git a/tools/testing/selftests/bpf/config b/tools/testing/selftests/bpf/config > index c125c441abc7..9c42568ed376 100644 > --- a/tools/testing/selftests/bpf/config > +++ b/tools/testing/selftests/bpf/config > @@ -78,6 +78,7 @@ CONFIG_NF_CONNTRACK_MARK=y > CONFIG_NF_DEFRAG_IPV4=y > CONFIG_NF_DEFRAG_IPV6=y > CONFIG_NF_NAT=y > +CONFIG_PSI=y > CONFIG_RC_CORE=y > CONFIG_SECURITY=y > CONFIG_SECURITYFS=y > diff --git a/tools/testing/selftests/bpf/prog_tests/cpumask_iter.c b/tools/testing/selftests/bpf/prog_tests/cpumask_iter.c > new file mode 100644 > index 000000000000..1db4efc57c5f > --- /dev/null > +++ b/tools/testing/selftests/bpf/prog_tests/cpumask_iter.c > @@ -0,0 +1,130 @@ > +// SPDX-License-Identifier: GPL-2.0 > +/* Copyright (c) 2024 Yafang Shao <laoar.shao@gmail.com> */ > + > +#define _GNU_SOURCE > +#include <sched.h> > +#include <stdio.h> > +#include <unistd.h> > + > +#include <test_progs.h> > +#include "cgroup_helpers.h" > +#include "test_cpumask_iter.skel.h" > + > +static void verify_percpu_data(struct bpf_link *link, int nr_cpu_exp, int nr_running_exp) In general it seems wrong for this to be void. If we fail to verify something, why would we want to keep chugging along in the caller? That seems like it would just add a bunch of test failure noise for all the stuff that fails after. > +{ > + int iter_fd, len, item, nr_running, psi_running, nr_cpus; > + char buf[128]; > + size_t left; > + char *p; > + > + iter_fd = bpf_iter_create(bpf_link__fd(link)); > + if (!ASSERT_GE(iter_fd, 0, "iter_fd")) > + return; > + > + memset(buf, 0, sizeof(buf)); > + left = ARRAY_SIZE(buf); > + p = buf; > + while ((len = read(iter_fd, p, left)) > 0) { > + p += len; > + left -= len; > + } > + > + item = sscanf(buf, "nr_running %u nr_cpus %u psi_running %u\n", > + &nr_running, &nr_cpus, &psi_running); I don't think there's any reason we should need to do string parsing like this. We can set all of these variables from BPF and then check them in the skeleton from user space. > + if (nr_cpu_exp == -1) { > + ASSERT_EQ(item, -1, "seq_format"); > + goto out; > + } > + > + ASSERT_EQ(item, 3, "seq_format"); > + ASSERT_GE(nr_running, nr_running_exp, "nr_running"); > + ASSERT_GE(psi_running, nr_running_exp, "psi_running"); > + ASSERT_EQ(nr_cpus, nr_cpu_exp, "nr_cpus"); > + > +out: > + close(iter_fd); > +} > + > +void test_cpumask_iter(void) We should also have negative testcases that verify that we fail to load with bogus / unsafe inputs. See e.g. [0]. [0]: https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/progs/cpumask_failure.c > +{ > + DECLARE_LIBBPF_OPTS(bpf_iter_attach_opts, opts); > + int nr_possible, cgrp_fd, pid, err, cnt, i; > + struct test_cpumask_iter *skel; > + union bpf_iter_link_info linfo; > + int cpu_ids[] = {1, 3, 4, 5}; > + struct bpf_link *link; > + cpu_set_t set; > + > + skel = test_cpumask_iter__open_and_load(); > + if (!ASSERT_OK_PTR(skel, "test_for_each_cpu__open_and_load")) > + return; > + > + if (setup_cgroup_environment()) > + goto destroy; > + > + /* Utilize the cgroup iter */ > + cgrp_fd = get_root_cgroup(); > + if (!ASSERT_GE(cgrp_fd, 0, "create cgrp")) > + goto cleanup; > + > + memset(&linfo, 0, sizeof(linfo)); > + linfo.cgroup.cgroup_fd = cgrp_fd; > + linfo.cgroup.order = BPF_CGROUP_ITER_SELF_ONLY; > + opts.link_info = &linfo; > + opts.link_info_len = sizeof(linfo); > + > + link = bpf_program__attach_iter(skel->progs.cpu_cgroup, &opts); > + if (!ASSERT_OK_PTR(link, "attach_iter")) > + goto close_fd; > + > + skel->bss->target_pid = 1; > + /* In case init task is set CPU affinity */ > + err = sched_getaffinity(1, sizeof(set), &set); > + if (!ASSERT_OK(err, "setaffinity")) > + goto free_link; > + > + cnt = CPU_COUNT(&set); > + nr_possible = bpf_num_possible_cpus(); > + if (test__start_subtest("init_pid")) Can you instead please do what we do in [1] and have each testcase be self contained and separately invoked? Doing things the way you have it in this patch has a few drawbacks: 1. The testcases are now all interdependent. If one fails we continue onto others even though we may or may not have any reason to believe that the system state is sane or what we expect. If the later testcases fail, what does it even mean? Or if they pass, is that expected? We should be setting up and tearing down whatever state we need between testcases to have confidence that the signal from each testcase is independent of whatever signal was in the previous one. 2. This is also confusing because we're doing a bunch of validation and testing even if test__start_subtest() returns false. It's unclear what the expecations are in such a case, and I think it's a lot more logical if you just don't do anything and skip the testcase. [1]: https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/prog_tests/cpumask.c#n8 > + /* current task is running. */ > + verify_percpu_data(link, cnt, cnt == nr_possible ? 1 : 0); > + > + skel->bss->target_pid = -1; > + if (test__start_subtest("invalid_pid")) > + verify_percpu_data(link, -1, -1); > + > + pid = getpid(); > + skel->bss->target_pid = pid; > + CPU_ZERO(&set); > + CPU_SET(0, &set); > + err = sched_setaffinity(pid, sizeof(set), &set); > + if (!ASSERT_OK(err, "setaffinity")) > + goto free_link; > + > + if (test__start_subtest("self_pid_one_cpu")) > + verify_percpu_data(link, 1, 1); > + > + /* Assume there are at least 8 CPUs on the testbed */ > + if (nr_possible < 8) > + goto free_link; > + > + CPU_ZERO(&set); > + /* Set the CPU affinitiy: 1,3-5 */ > + for (i = 0; i < ARRAY_SIZE(cpu_ids); i++) > + CPU_SET(cpu_ids[i], &set); > + err = sched_setaffinity(pid, sizeof(set), &set); > + if (!ASSERT_OK(err, "setaffinity")) > + goto free_link; > + > + if (test__start_subtest("self_pid_multi_cpus")) > + verify_percpu_data(link, ARRAY_SIZE(cpu_ids), 1); > + > +free_link: > + bpf_link__destroy(link); > +close_fd: > + close(cgrp_fd); > +cleanup: > + cleanup_cgroup_environment(); > +destroy: > + test_cpumask_iter__destroy(skel); > +} > diff --git a/tools/testing/selftests/bpf/progs/cpumask_common.h b/tools/testing/selftests/bpf/progs/cpumask_common.h > index 0cd4aebb97cf..cdb9dc95e9d9 100644 > --- a/tools/testing/selftests/bpf/progs/cpumask_common.h > +++ b/tools/testing/selftests/bpf/progs/cpumask_common.h > @@ -55,6 +55,9 @@ void bpf_cpumask_copy(struct bpf_cpumask *dst, const struct cpumask *src) __ksym > u32 bpf_cpumask_any_distribute(const struct cpumask *src) __ksym; > u32 bpf_cpumask_any_and_distribute(const struct cpumask *src1, const struct cpumask *src2) __ksym; > u32 bpf_cpumask_weight(const struct cpumask *cpumask) __ksym; > +int bpf_iter_cpumask_new(struct bpf_iter_cpumask *it, const struct cpumask *mask) __ksym; > +int *bpf_iter_cpumask_next(struct bpf_iter_cpumask *it) __ksym; > +void bpf_iter_cpumask_destroy(struct bpf_iter_cpumask *it) __ksym; > > void bpf_rcu_read_lock(void) __ksym; > void bpf_rcu_read_unlock(void) __ksym; > diff --git a/tools/testing/selftests/bpf/progs/test_cpumask_iter.c b/tools/testing/selftests/bpf/progs/test_cpumask_iter.c > new file mode 100644 > index 000000000000..cb8b8359516b > --- /dev/null > +++ b/tools/testing/selftests/bpf/progs/test_cpumask_iter.c > @@ -0,0 +1,56 @@ > +// SPDX-License-Identifier: GPL-2.0-only > +/* Copyright (c) 2024 Yafang Shao <laoar.shao@gmail.com> */ > + > +#include "vmlinux.h" > +#include <bpf/bpf_helpers.h> > +#include <bpf/bpf_tracing.h> > + > +#include "task_kfunc_common.h" > +#include "cpumask_common.h" > + > +extern const struct psi_group_cpu system_group_pcpu __ksym __weak; > +extern const struct rq runqueues __ksym __weak; > + > +int target_pid; > + > +SEC("iter.s/cgroup") > +int BPF_PROG(cpu_cgroup, struct bpf_iter_meta *meta, struct cgroup *cgrp) > +{ > + u32 nr_running = 0, psi_nr_running = 0, nr_cpus = 0; > + struct psi_group_cpu *groupc; > + struct task_struct *p; > + struct rq *rq; > + int *cpu; > + > + /* epilogue */ > + if (cgrp == NULL) > + return 0; > + > + bpf_rcu_read_lock(); Why is this required? p->cpus_ptr should be trusted given that @p is trusted, no? Does this fail to load if you don't have all of this in an RCU read region? > + p = bpf_task_from_pid(target_pid); > + if (!p) { > + bpf_rcu_read_unlock(); > + return 1; > + } > + > + bpf_for_each(cpumask, cpu, p->cpus_ptr) { > + rq = (struct rq *)bpf_per_cpu_ptr(&runqueues, *cpu); > + if (!rq) > + continue; If this happens we should set an error code and check it in user space. > + nr_running += rq->nr_running; > + nr_cpus += 1; > + > + groupc = (struct psi_group_cpu *)bpf_per_cpu_ptr(&system_group_pcpu, *cpu); > + if (!groupc) > + continue; Same here. > + psi_nr_running += groupc->tasks[NR_RUNNING]; > + } > + BPF_SEQ_PRINTF(meta->seq, "nr_running %u nr_cpus %u psi_running %u\n", > + nr_running, nr_cpus, psi_nr_running); > + > + bpf_task_release(p); > + bpf_rcu_read_unlock(); > + return 0; > +} > + > +char _license[] SEC("license") = "GPL"; Can you please move this to the top of the file? > -- > 2.39.1 > >
On Wed, Jan 24, 2024 at 4:48 AM David Vernet <void@manifault.com> wrote: > > On Tue, Jan 23, 2024 at 11:27:16PM +0800, Yafang Shao wrote: > > Within the BPF program, we leverage the cgroup iterator to iterate through > > percpu runqueue data, specifically the 'nr_running' metric. Subsequently > > we expose this data to userspace by means of a sequence file. > > > > The CPU affinity for the cpumask is determined by the PID of a task: > > > > - PID of the init task (PID 1) > > We typically don't set CPU affinity for init task and thus we can iterate > > across all possible CPUs using the init task. However, in scenarios where > > you've set CPU affinity for the init task, you should set your > > current-task's cpu affinity to all possible CPUs and then proceed to > > iterate through all possible CPUs using the current task. > > - PID of a task with defined CPU affinity > > The aim here is to iterate through a specific cpumask. This scenario > > aligns with tasks residing within a cpuset cgroup. > > - Invalid PID (e.g., PID -1) > > No cpumask is available in this case. > > > > The result as follows, > > #65/1 cpumask_iter/init_pid:OK > > #65/2 cpumask_iter/invalid_pid:OK > > #65/3 cpumask_iter/self_pid_one_cpu:OK > > #65/4 cpumask_iter/self_pid_multi_cpus:OK > > #65 cpumask_iter:OK > > Summary: 1/4 PASSED, 0 SKIPPED, 0 FAILED > > > > CONFIG_PSI=y is required for this testcase. > > > > Signed-off-by: Yafang Shao <laoar.shao@gmail.com> > > --- > > tools/testing/selftests/bpf/config | 1 + > > .../selftests/bpf/prog_tests/cpumask_iter.c | 130 ++++++++++++++++++ > > .../selftests/bpf/progs/cpumask_common.h | 3 + > > .../selftests/bpf/progs/test_cpumask_iter.c | 56 ++++++++ > > 4 files changed, 190 insertions(+) > > create mode 100644 tools/testing/selftests/bpf/prog_tests/cpumask_iter.c > > create mode 100644 tools/testing/selftests/bpf/progs/test_cpumask_iter.c > > > > diff --git a/tools/testing/selftests/bpf/config b/tools/testing/selftests/bpf/config > > index c125c441abc7..9c42568ed376 100644 > > --- a/tools/testing/selftests/bpf/config > > +++ b/tools/testing/selftests/bpf/config > > @@ -78,6 +78,7 @@ CONFIG_NF_CONNTRACK_MARK=y > > CONFIG_NF_DEFRAG_IPV4=y > > CONFIG_NF_DEFRAG_IPV6=y > > CONFIG_NF_NAT=y > > +CONFIG_PSI=y > > CONFIG_RC_CORE=y > > CONFIG_SECURITY=y > > CONFIG_SECURITYFS=y > > diff --git a/tools/testing/selftests/bpf/prog_tests/cpumask_iter.c b/tools/testing/selftests/bpf/prog_tests/cpumask_iter.c > > new file mode 100644 > > index 000000000000..1db4efc57c5f > > --- /dev/null > > +++ b/tools/testing/selftests/bpf/prog_tests/cpumask_iter.c > > @@ -0,0 +1,130 @@ > > +// SPDX-License-Identifier: GPL-2.0 > > +/* Copyright (c) 2024 Yafang Shao <laoar.shao@gmail.com> */ > > + > > +#define _GNU_SOURCE > > +#include <sched.h> > > +#include <stdio.h> > > +#include <unistd.h> > > + > > +#include <test_progs.h> > > +#include "cgroup_helpers.h" > > +#include "test_cpumask_iter.skel.h" > > + > > +static void verify_percpu_data(struct bpf_link *link, int nr_cpu_exp, int nr_running_exp) > > In general it seems wrong for this to be void. If we fail to verify > something, why would we want to keep chugging along in the caller? That > seems like it would just add a bunch of test failure noise for all the > stuff that fails after. makes sense. > > > +{ > > + int iter_fd, len, item, nr_running, psi_running, nr_cpus; > > + char buf[128]; > > + size_t left; > > + char *p; > > + > > + iter_fd = bpf_iter_create(bpf_link__fd(link)); > > + if (!ASSERT_GE(iter_fd, 0, "iter_fd")) > > + return; > > + > > + memset(buf, 0, sizeof(buf)); > > + left = ARRAY_SIZE(buf); > > + p = buf; > > + while ((len = read(iter_fd, p, left)) > 0) { > > + p += len; > > + left -= len; > > + } > > + > > + item = sscanf(buf, "nr_running %u nr_cpus %u psi_running %u\n", > > + &nr_running, &nr_cpus, &psi_running); > > I don't think there's any reason we should need to do string parsing > like this. We can set all of these variables from BPF and then check > them in the skeleton from user space. I agree with your point about the simplification achieved through skeleton variables. However, it's worth noting that a seq file in this context serves to demonstrate how we expose the data through a seq file. > > > + if (nr_cpu_exp == -1) { > > + ASSERT_EQ(item, -1, "seq_format"); > > + goto out; > > + } > > + > > + ASSERT_EQ(item, 3, "seq_format"); > > + ASSERT_GE(nr_running, nr_running_exp, "nr_running"); > > + ASSERT_GE(psi_running, nr_running_exp, "psi_running"); > > + ASSERT_EQ(nr_cpus, nr_cpu_exp, "nr_cpus"); > > + > > +out: > > + close(iter_fd); > > +} > > + > > +void test_cpumask_iter(void) > > We should also have negative testcases that verify that we fail to load > with bogus / unsafe inputs. See e.g. [0]. will add it. > > [0]: https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/progs/cpumask_failure.c > > > +{ > > + DECLARE_LIBBPF_OPTS(bpf_iter_attach_opts, opts); > > + int nr_possible, cgrp_fd, pid, err, cnt, i; > > + struct test_cpumask_iter *skel; > > + union bpf_iter_link_info linfo; > > + int cpu_ids[] = {1, 3, 4, 5}; > > + struct bpf_link *link; > > + cpu_set_t set; > > + > > + skel = test_cpumask_iter__open_and_load(); > > + if (!ASSERT_OK_PTR(skel, "test_for_each_cpu__open_and_load")) > > + return; > > + > > + if (setup_cgroup_environment()) > > + goto destroy; > > + > > + /* Utilize the cgroup iter */ > > + cgrp_fd = get_root_cgroup(); > > + if (!ASSERT_GE(cgrp_fd, 0, "create cgrp")) > > + goto cleanup; > > + > > + memset(&linfo, 0, sizeof(linfo)); > > + linfo.cgroup.cgroup_fd = cgrp_fd; > > + linfo.cgroup.order = BPF_CGROUP_ITER_SELF_ONLY; > > + opts.link_info = &linfo; > > + opts.link_info_len = sizeof(linfo); > > + > > + link = bpf_program__attach_iter(skel->progs.cpu_cgroup, &opts); > > + if (!ASSERT_OK_PTR(link, "attach_iter")) > > + goto close_fd; > > + > > + skel->bss->target_pid = 1; > > + /* In case init task is set CPU affinity */ > > + err = sched_getaffinity(1, sizeof(set), &set); > > + if (!ASSERT_OK(err, "setaffinity")) > > + goto free_link; > > + > > + cnt = CPU_COUNT(&set); > > + nr_possible = bpf_num_possible_cpus(); > > + if (test__start_subtest("init_pid")) > > Can you instead please do what we do in [1] and have each testcase be > self contained and separately invoked? Doing things the way you have it > in this patch has a few drawbacks: > > 1. The testcases are now all interdependent. If one fails we continue > onto others even though we may or may not have any reason to believe > that the system state is sane or what we expect. If the later > testcases fail, what does it even mean? Or if they pass, is that > expected? We should be setting up and tearing down whatever state we > need between testcases to have confidence that the signal from each > testcase is independent of whatever signal was in the previous one. > > 2. This is also confusing because we're doing a bunch of validation and > testing even if test__start_subtest() returns false. It's unclear > what the expecations are in such a case, and I think it's a lot more > logical if you just don't do anything and skip the testcase. > > [1]: https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/prog_tests/cpumask.c#n8 Will check it. Appreciate your insights. Thank you. > > > + /* current task is running. */ > > + verify_percpu_data(link, cnt, cnt == nr_possible ? 1 : 0); > > + > > + skel->bss->target_pid = -1; > > + if (test__start_subtest("invalid_pid")) > > + verify_percpu_data(link, -1, -1); > > + > > + pid = getpid(); > > + skel->bss->target_pid = pid; > > + CPU_ZERO(&set); > > + CPU_SET(0, &set); > > + err = sched_setaffinity(pid, sizeof(set), &set); > > + if (!ASSERT_OK(err, "setaffinity")) > > + goto free_link; > > + > > + if (test__start_subtest("self_pid_one_cpu")) > > + verify_percpu_data(link, 1, 1); > > + > > + /* Assume there are at least 8 CPUs on the testbed */ > > + if (nr_possible < 8) > > + goto free_link; > > + > > + CPU_ZERO(&set); > > + /* Set the CPU affinitiy: 1,3-5 */ > > + for (i = 0; i < ARRAY_SIZE(cpu_ids); i++) > > + CPU_SET(cpu_ids[i], &set); > > + err = sched_setaffinity(pid, sizeof(set), &set); > > + if (!ASSERT_OK(err, "setaffinity")) > > + goto free_link; > > + > > + if (test__start_subtest("self_pid_multi_cpus")) > > + verify_percpu_data(link, ARRAY_SIZE(cpu_ids), 1); > > + > > +free_link: > > + bpf_link__destroy(link); > > +close_fd: > > + close(cgrp_fd); > > +cleanup: > > + cleanup_cgroup_environment(); > > +destroy: > > + test_cpumask_iter__destroy(skel); > > +} > > diff --git a/tools/testing/selftests/bpf/progs/cpumask_common.h b/tools/testing/selftests/bpf/progs/cpumask_common.h > > index 0cd4aebb97cf..cdb9dc95e9d9 100644 > > --- a/tools/testing/selftests/bpf/progs/cpumask_common.h > > +++ b/tools/testing/selftests/bpf/progs/cpumask_common.h > > @@ -55,6 +55,9 @@ void bpf_cpumask_copy(struct bpf_cpumask *dst, const struct cpumask *src) __ksym > > u32 bpf_cpumask_any_distribute(const struct cpumask *src) __ksym; > > u32 bpf_cpumask_any_and_distribute(const struct cpumask *src1, const struct cpumask *src2) __ksym; > > u32 bpf_cpumask_weight(const struct cpumask *cpumask) __ksym; > > +int bpf_iter_cpumask_new(struct bpf_iter_cpumask *it, const struct cpumask *mask) __ksym; > > +int *bpf_iter_cpumask_next(struct bpf_iter_cpumask *it) __ksym; > > +void bpf_iter_cpumask_destroy(struct bpf_iter_cpumask *it) __ksym; > > > > void bpf_rcu_read_lock(void) __ksym; > > void bpf_rcu_read_unlock(void) __ksym; > > diff --git a/tools/testing/selftests/bpf/progs/test_cpumask_iter.c b/tools/testing/selftests/bpf/progs/test_cpumask_iter.c > > new file mode 100644 > > index 000000000000..cb8b8359516b > > --- /dev/null > > +++ b/tools/testing/selftests/bpf/progs/test_cpumask_iter.c > > @@ -0,0 +1,56 @@ > > +// SPDX-License-Identifier: GPL-2.0-only > > +/* Copyright (c) 2024 Yafang Shao <laoar.shao@gmail.com> */ > > + > > +#include "vmlinux.h" > > +#include <bpf/bpf_helpers.h> > > +#include <bpf/bpf_tracing.h> > > + > > +#include "task_kfunc_common.h" > > +#include "cpumask_common.h" > > + > > +extern const struct psi_group_cpu system_group_pcpu __ksym __weak; > > +extern const struct rq runqueues __ksym __weak; > > + > > +int target_pid; > > + > > +SEC("iter.s/cgroup") > > +int BPF_PROG(cpu_cgroup, struct bpf_iter_meta *meta, struct cgroup *cgrp) > > +{ > > + u32 nr_running = 0, psi_nr_running = 0, nr_cpus = 0; > > + struct psi_group_cpu *groupc; > > + struct task_struct *p; > > + struct rq *rq; > > + int *cpu; > > + > > + /* epilogue */ > > + if (cgrp == NULL) > > + return 0; > > + > > + bpf_rcu_read_lock(); > > Why is this required? p->cpus_ptr should be trusted given that @p is > trusted, no? Does this fail to load if you don't have all of this in an > RCU read region? It is a sleepable prog iter.s, so we must explicitly add a rcu lock, otherwise the load will fail. > > > + p = bpf_task_from_pid(target_pid); > > + if (!p) { > > + bpf_rcu_read_unlock(); > > + return 1; > > + } > > + > > + bpf_for_each(cpumask, cpu, p->cpus_ptr) { > > + rq = (struct rq *)bpf_per_cpu_ptr(&runqueues, *cpu); > > + if (!rq) > > + continue; > > If this happens we should set an error code and check it in user space. sure, will do it. > > > + nr_running += rq->nr_running; > > + nr_cpus += 1; > > + > > + groupc = (struct psi_group_cpu *)bpf_per_cpu_ptr(&system_group_pcpu, *cpu); > > + if (!groupc) > > + continue; > > Same here. will do it. > > > + psi_nr_running += groupc->tasks[NR_RUNNING]; > > + } > > + BPF_SEQ_PRINTF(meta->seq, "nr_running %u nr_cpus %u psi_running %u\n", > > + nr_running, nr_cpus, psi_nr_running); > > + > > + bpf_task_release(p); > > + bpf_rcu_read_unlock(); > > + return 0; > > +} > > + > > +char _license[] SEC("license") = "GPL"; > > Can you please move this to the top of the file? will do it.
diff --git a/tools/testing/selftests/bpf/config b/tools/testing/selftests/bpf/config index c125c441abc7..9c42568ed376 100644 --- a/tools/testing/selftests/bpf/config +++ b/tools/testing/selftests/bpf/config @@ -78,6 +78,7 @@ CONFIG_NF_CONNTRACK_MARK=y CONFIG_NF_DEFRAG_IPV4=y CONFIG_NF_DEFRAG_IPV6=y CONFIG_NF_NAT=y +CONFIG_PSI=y CONFIG_RC_CORE=y CONFIG_SECURITY=y CONFIG_SECURITYFS=y diff --git a/tools/testing/selftests/bpf/prog_tests/cpumask_iter.c b/tools/testing/selftests/bpf/prog_tests/cpumask_iter.c new file mode 100644 index 000000000000..1db4efc57c5f --- /dev/null +++ b/tools/testing/selftests/bpf/prog_tests/cpumask_iter.c @@ -0,0 +1,130 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2024 Yafang Shao <laoar.shao@gmail.com> */ + +#define _GNU_SOURCE +#include <sched.h> +#include <stdio.h> +#include <unistd.h> + +#include <test_progs.h> +#include "cgroup_helpers.h" +#include "test_cpumask_iter.skel.h" + +static void verify_percpu_data(struct bpf_link *link, int nr_cpu_exp, int nr_running_exp) +{ + int iter_fd, len, item, nr_running, psi_running, nr_cpus; + char buf[128]; + size_t left; + char *p; + + iter_fd = bpf_iter_create(bpf_link__fd(link)); + if (!ASSERT_GE(iter_fd, 0, "iter_fd")) + return; + + memset(buf, 0, sizeof(buf)); + left = ARRAY_SIZE(buf); + p = buf; + while ((len = read(iter_fd, p, left)) > 0) { + p += len; + left -= len; + } + + item = sscanf(buf, "nr_running %u nr_cpus %u psi_running %u\n", + &nr_running, &nr_cpus, &psi_running); + if (nr_cpu_exp == -1) { + ASSERT_EQ(item, -1, "seq_format"); + goto out; + } + + ASSERT_EQ(item, 3, "seq_format"); + ASSERT_GE(nr_running, nr_running_exp, "nr_running"); + ASSERT_GE(psi_running, nr_running_exp, "psi_running"); + ASSERT_EQ(nr_cpus, nr_cpu_exp, "nr_cpus"); + +out: + close(iter_fd); +} + +void test_cpumask_iter(void) +{ + DECLARE_LIBBPF_OPTS(bpf_iter_attach_opts, opts); + int nr_possible, cgrp_fd, pid, err, cnt, i; + struct test_cpumask_iter *skel; + union bpf_iter_link_info linfo; + int cpu_ids[] = {1, 3, 4, 5}; + struct bpf_link *link; + cpu_set_t set; + + skel = test_cpumask_iter__open_and_load(); + if (!ASSERT_OK_PTR(skel, "test_for_each_cpu__open_and_load")) + return; + + if (setup_cgroup_environment()) + goto destroy; + + /* Utilize the cgroup iter */ + cgrp_fd = get_root_cgroup(); + if (!ASSERT_GE(cgrp_fd, 0, "create cgrp")) + goto cleanup; + + memset(&linfo, 0, sizeof(linfo)); + linfo.cgroup.cgroup_fd = cgrp_fd; + linfo.cgroup.order = BPF_CGROUP_ITER_SELF_ONLY; + opts.link_info = &linfo; + opts.link_info_len = sizeof(linfo); + + link = bpf_program__attach_iter(skel->progs.cpu_cgroup, &opts); + if (!ASSERT_OK_PTR(link, "attach_iter")) + goto close_fd; + + skel->bss->target_pid = 1; + /* In case init task is set CPU affinity */ + err = sched_getaffinity(1, sizeof(set), &set); + if (!ASSERT_OK(err, "setaffinity")) + goto free_link; + + cnt = CPU_COUNT(&set); + nr_possible = bpf_num_possible_cpus(); + if (test__start_subtest("init_pid")) + /* current task is running. */ + verify_percpu_data(link, cnt, cnt == nr_possible ? 1 : 0); + + skel->bss->target_pid = -1; + if (test__start_subtest("invalid_pid")) + verify_percpu_data(link, -1, -1); + + pid = getpid(); + skel->bss->target_pid = pid; + CPU_ZERO(&set); + CPU_SET(0, &set); + err = sched_setaffinity(pid, sizeof(set), &set); + if (!ASSERT_OK(err, "setaffinity")) + goto free_link; + + if (test__start_subtest("self_pid_one_cpu")) + verify_percpu_data(link, 1, 1); + + /* Assume there are at least 8 CPUs on the testbed */ + if (nr_possible < 8) + goto free_link; + + CPU_ZERO(&set); + /* Set the CPU affinitiy: 1,3-5 */ + for (i = 0; i < ARRAY_SIZE(cpu_ids); i++) + CPU_SET(cpu_ids[i], &set); + err = sched_setaffinity(pid, sizeof(set), &set); + if (!ASSERT_OK(err, "setaffinity")) + goto free_link; + + if (test__start_subtest("self_pid_multi_cpus")) + verify_percpu_data(link, ARRAY_SIZE(cpu_ids), 1); + +free_link: + bpf_link__destroy(link); +close_fd: + close(cgrp_fd); +cleanup: + cleanup_cgroup_environment(); +destroy: + test_cpumask_iter__destroy(skel); +} diff --git a/tools/testing/selftests/bpf/progs/cpumask_common.h b/tools/testing/selftests/bpf/progs/cpumask_common.h index 0cd4aebb97cf..cdb9dc95e9d9 100644 --- a/tools/testing/selftests/bpf/progs/cpumask_common.h +++ b/tools/testing/selftests/bpf/progs/cpumask_common.h @@ -55,6 +55,9 @@ void bpf_cpumask_copy(struct bpf_cpumask *dst, const struct cpumask *src) __ksym u32 bpf_cpumask_any_distribute(const struct cpumask *src) __ksym; u32 bpf_cpumask_any_and_distribute(const struct cpumask *src1, const struct cpumask *src2) __ksym; u32 bpf_cpumask_weight(const struct cpumask *cpumask) __ksym; +int bpf_iter_cpumask_new(struct bpf_iter_cpumask *it, const struct cpumask *mask) __ksym; +int *bpf_iter_cpumask_next(struct bpf_iter_cpumask *it) __ksym; +void bpf_iter_cpumask_destroy(struct bpf_iter_cpumask *it) __ksym; void bpf_rcu_read_lock(void) __ksym; void bpf_rcu_read_unlock(void) __ksym; diff --git a/tools/testing/selftests/bpf/progs/test_cpumask_iter.c b/tools/testing/selftests/bpf/progs/test_cpumask_iter.c new file mode 100644 index 000000000000..cb8b8359516b --- /dev/null +++ b/tools/testing/selftests/bpf/progs/test_cpumask_iter.c @@ -0,0 +1,56 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* Copyright (c) 2024 Yafang Shao <laoar.shao@gmail.com> */ + +#include "vmlinux.h" +#include <bpf/bpf_helpers.h> +#include <bpf/bpf_tracing.h> + +#include "task_kfunc_common.h" +#include "cpumask_common.h" + +extern const struct psi_group_cpu system_group_pcpu __ksym __weak; +extern const struct rq runqueues __ksym __weak; + +int target_pid; + +SEC("iter.s/cgroup") +int BPF_PROG(cpu_cgroup, struct bpf_iter_meta *meta, struct cgroup *cgrp) +{ + u32 nr_running = 0, psi_nr_running = 0, nr_cpus = 0; + struct psi_group_cpu *groupc; + struct task_struct *p; + struct rq *rq; + int *cpu; + + /* epilogue */ + if (cgrp == NULL) + return 0; + + bpf_rcu_read_lock(); + p = bpf_task_from_pid(target_pid); + if (!p) { + bpf_rcu_read_unlock(); + return 1; + } + + bpf_for_each(cpumask, cpu, p->cpus_ptr) { + rq = (struct rq *)bpf_per_cpu_ptr(&runqueues, *cpu); + if (!rq) + continue; + nr_running += rq->nr_running; + nr_cpus += 1; + + groupc = (struct psi_group_cpu *)bpf_per_cpu_ptr(&system_group_pcpu, *cpu); + if (!groupc) + continue; + psi_nr_running += groupc->tasks[NR_RUNNING]; + } + BPF_SEQ_PRINTF(meta->seq, "nr_running %u nr_cpus %u psi_running %u\n", + nr_running, nr_cpus, psi_nr_running); + + bpf_task_release(p); + bpf_rcu_read_unlock(); + return 0; +} + +char _license[] SEC("license") = "GPL";
Within the BPF program, we leverage the cgroup iterator to iterate through percpu runqueue data, specifically the 'nr_running' metric. Subsequently we expose this data to userspace by means of a sequence file. The CPU affinity for the cpumask is determined by the PID of a task: - PID of the init task (PID 1) We typically don't set CPU affinity for init task and thus we can iterate across all possible CPUs using the init task. However, in scenarios where you've set CPU affinity for the init task, you should set your current-task's cpu affinity to all possible CPUs and then proceed to iterate through all possible CPUs using the current task. - PID of a task with defined CPU affinity The aim here is to iterate through a specific cpumask. This scenario aligns with tasks residing within a cpuset cgroup. - Invalid PID (e.g., PID -1) No cpumask is available in this case. The result as follows, #65/1 cpumask_iter/init_pid:OK #65/2 cpumask_iter/invalid_pid:OK #65/3 cpumask_iter/self_pid_one_cpu:OK #65/4 cpumask_iter/self_pid_multi_cpus:OK #65 cpumask_iter:OK Summary: 1/4 PASSED, 0 SKIPPED, 0 FAILED CONFIG_PSI=y is required for this testcase. Signed-off-by: Yafang Shao <laoar.shao@gmail.com> --- tools/testing/selftests/bpf/config | 1 + .../selftests/bpf/prog_tests/cpumask_iter.c | 130 ++++++++++++++++++ .../selftests/bpf/progs/cpumask_common.h | 3 + .../selftests/bpf/progs/test_cpumask_iter.c | 56 ++++++++ 4 files changed, 190 insertions(+) create mode 100644 tools/testing/selftests/bpf/prog_tests/cpumask_iter.c create mode 100644 tools/testing/selftests/bpf/progs/test_cpumask_iter.c