Message ID | 20220709000439.243271-9-yosryahmed@google.com (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | BPF |
Headers | show |
Series | bpf: rstat: cgroup hierarchical stats | expand |
On 7/8/22 5:04 PM, Yosry Ahmed wrote: > Add a selftest that tests the whole workflow for collecting, > aggregating (flushing), and displaying cgroup hierarchical stats. > > TL;DR: > - Userspace program creates a cgroup hierarchy and induces memcg reclaim > in parts of it. > - Whenever reclaim happens, vmscan_start and vmscan_end update > per-cgroup percpu readings, and tell rstat which (cgroup, cpu) pairs > have updates. > - When userspace tries to read the stats, vmscan_dump calls rstat to flush > the stats, and outputs the stats in text format to userspace (similar > to cgroupfs stats). > - rstat calls vmscan_flush once for every (cgroup, cpu) pair that has > updates, vmscan_flush aggregates cpu readings and propagates updates > to parents. > - Userspace program makes sure the stats are aggregated and read > correctly. > > Detailed explanation: > - The test loads tracing bpf programs, vmscan_start and vmscan_end, to > measure the latency of cgroup reclaim. Per-cgroup readings are stored in > percpu maps for efficiency. When a cgroup reading is updated on a cpu, > cgroup_rstat_updated(cgroup, cpu) is called to add the cgroup to the > rstat updated tree on that cpu. > > - A cgroup_iter program, vmscan_dump, is loaded and pinned to a file, for > each cgroup. Reading this file invokes the program, which calls > cgroup_rstat_flush(cgroup) to ask rstat to propagate the updates for all > cpus and cgroups that have updates in this cgroup's subtree. Afterwards, > the stats are exposed to the user. vmscan_dump returns 1 to terminate > iteration early, so that we only expose stats for one cgroup per read. > > - An ftrace program, vmscan_flush, is also loaded and attached to > bpf_rstat_flush. When rstat flushing is ongoing, vmscan_flush is invoked > once for each (cgroup, cpu) pair that has updates. cgroups are popped > from the rstat tree in a bottom-up fashion, so calls will always be > made for cgroups that have updates before their parents. The program > aggregates percpu readings to a total per-cgroup reading, and also > propagates them to the parent cgroup. After rstat flushing is over, all > cgroups will have correct updated hierarchical readings (including all > cpus and all their descendants). > > - Finally, the test creates a cgroup hierarchy and induces memcg reclaim > in parts of it, and makes sure that the stats collection, aggregation, > and reading workflow works as expected. > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com> > --- > .../prog_tests/cgroup_hierarchical_stats.c | 362 ++++++++++++++++++ > .../bpf/progs/cgroup_hierarchical_stats.c | 235 ++++++++++++ > 2 files changed, 597 insertions(+) > create mode 100644 tools/testing/selftests/bpf/prog_tests/cgroup_hierarchical_stats.c > create mode 100644 tools/testing/selftests/bpf/progs/cgroup_hierarchical_stats.c > [...] > + > +static unsigned long long get_cgroup_vmscan_delay(unsigned long long cgroup_id, > + const char *file_name) > +{ > + char buf[128], path[128]; > + unsigned long long vmscan = 0, id = 0; > + int err; > + > + /* For every cgroup, read the file generated by cgroup_iter */ > + snprintf(path, 128, "%s%s", BPFFS_VMSCAN, file_name); > + err = read_from_file(path, buf, 128); > + if (!ASSERT_OK(err, "read cgroup_iter")) > + return 0; > + > + /* Check the output file formatting */ > + ASSERT_EQ(sscanf(buf, "cg_id: %llu, total_vmscan_delay: %llu\n", > + &id, &vmscan), 2, "output format"); > + > + /* Check that the cgroup_id is displayed correctly */ > + ASSERT_EQ(id, cgroup_id, "cgroup_id"); > + /* Check that the vmscan reading is non-zero */ > + ASSERT_GT(vmscan, 0, "vmscan_reading"); > + return vmscan; > +} > + > +static void check_vmscan_stats(void) > +{ > + int i; > + unsigned long long vmscan_readings[N_CGROUPS], vmscan_root; > + > + for (i = 0; i < N_CGROUPS; i++) > + vmscan_readings[i] = get_cgroup_vmscan_delay(cgroups[i].id, > + cgroups[i].name); > + > + /* Read stats for root too */ > + vmscan_root = get_cgroup_vmscan_delay(CG_ROOT_ID, CG_ROOT_NAME); > + > + /* Check that child1 == child1_1 + child1_2 */ > + ASSERT_EQ(vmscan_readings[1], vmscan_readings[3] + vmscan_readings[4], > + "child1_vmscan"); > + /* Check that child2 == child2_1 + child2_2 */ > + ASSERT_EQ(vmscan_readings[2], vmscan_readings[5] + vmscan_readings[6], > + "child2_vmscan"); > + /* Check that test == child1 + child2 */ > + ASSERT_EQ(vmscan_readings[0], vmscan_readings[1] + vmscan_readings[2], > + "test_vmscan"); > + /* Check that root >= test */ > + ASSERT_GE(vmscan_root, vmscan_readings[1], "root_vmscan"); I still get a test failure with get_cgroup_vmscan_delay:PASS:cgroup_id 0 nsec get_cgroup_vmscan_delay:FAIL:vmscan_reading unexpected vmscan_reading: actual 0 <= expected 0 check_vmscan_stats:FAIL:child1_vmscan unexpected child1_vmscan: actual 0 != expected -2 check_vmscan_stats:FAIL:child2_vmscan unexpected child2_vmscan: actual 0 != expected -2 check_vmscan_stats:PASS:test_vmscan 0 nsec check_vmscan_stats:PASS:root_vmscan 0 nsec I added 'dump_stack()' in function try_to_free_mem_cgroup_pages() and run this test (#33) and didn't get any stacktrace. But I do get stacktraces due to other operations like try_to_free_mem_cgroup_pages+0x1fd [kernel] try_to_free_mem_cgroup_pages+0x1fd [kernel] memory_reclaim_write+0x88 [kernel] cgroup_file_write+0x88 [kernel] kernfs_fop_write_iter+0xd0 [kernel] vfs_write+0x2c4 [kernel] __x64_sys_write+0x60 [kernel] do_syscall_64+0x2d [kernel] entry_SYSCALL_64_after_hwframe+0x44 [kernel] If you can show me the stacktrace about how try_to_free_mem_cgroup_pages() is triggered in your setup, I can help debug this problem in my environment. > +} > + > +static int setup_cgroup_iter(struct cgroup_hierarchical_stats *obj, int cgroup_fd, [...]
On 7/10/22 5:26 PM, Yonghong Song wrote: > > > On 7/8/22 5:04 PM, Yosry Ahmed wrote: >> Add a selftest that tests the whole workflow for collecting, >> aggregating (flushing), and displaying cgroup hierarchical stats. >> >> TL;DR: >> - Userspace program creates a cgroup hierarchy and induces memcg reclaim >> in parts of it. >> - Whenever reclaim happens, vmscan_start and vmscan_end update >> per-cgroup percpu readings, and tell rstat which (cgroup, cpu) pairs >> have updates. >> - When userspace tries to read the stats, vmscan_dump calls rstat to >> flush >> the stats, and outputs the stats in text format to userspace (similar >> to cgroupfs stats). >> - rstat calls vmscan_flush once for every (cgroup, cpu) pair that has >> updates, vmscan_flush aggregates cpu readings and propagates updates >> to parents. >> - Userspace program makes sure the stats are aggregated and read >> correctly. >> >> Detailed explanation: >> - The test loads tracing bpf programs, vmscan_start and vmscan_end, to >> measure the latency of cgroup reclaim. Per-cgroup readings are >> stored in >> percpu maps for efficiency. When a cgroup reading is updated on a cpu, >> cgroup_rstat_updated(cgroup, cpu) is called to add the cgroup to the >> rstat updated tree on that cpu. >> >> - A cgroup_iter program, vmscan_dump, is loaded and pinned to a file, for >> each cgroup. Reading this file invokes the program, which calls >> cgroup_rstat_flush(cgroup) to ask rstat to propagate the updates >> for all >> cpus and cgroups that have updates in this cgroup's subtree. >> Afterwards, >> the stats are exposed to the user. vmscan_dump returns 1 to terminate >> iteration early, so that we only expose stats for one cgroup per read. >> >> - An ftrace program, vmscan_flush, is also loaded and attached to >> bpf_rstat_flush. When rstat flushing is ongoing, vmscan_flush is >> invoked >> once for each (cgroup, cpu) pair that has updates. cgroups are popped >> from the rstat tree in a bottom-up fashion, so calls will always be >> made for cgroups that have updates before their parents. The program >> aggregates percpu readings to a total per-cgroup reading, and also >> propagates them to the parent cgroup. After rstat flushing is over, >> all >> cgroups will have correct updated hierarchical readings (including all >> cpus and all their descendants). >> >> - Finally, the test creates a cgroup hierarchy and induces memcg reclaim >> in parts of it, and makes sure that the stats collection, aggregation, >> and reading workflow works as expected. >> >> Signed-off-by: Yosry Ahmed <yosryahmed@google.com> >> --- >> .../prog_tests/cgroup_hierarchical_stats.c | 362 ++++++++++++++++++ >> .../bpf/progs/cgroup_hierarchical_stats.c | 235 ++++++++++++ >> 2 files changed, 597 insertions(+) >> create mode 100644 >> tools/testing/selftests/bpf/prog_tests/cgroup_hierarchical_stats.c >> create mode 100644 >> tools/testing/selftests/bpf/progs/cgroup_hierarchical_stats.c >> > [...] >> + >> +static unsigned long long get_cgroup_vmscan_delay(unsigned long long >> cgroup_id, >> + const char *file_name) >> +{ >> + char buf[128], path[128]; >> + unsigned long long vmscan = 0, id = 0; >> + int err; >> + >> + /* For every cgroup, read the file generated by cgroup_iter */ >> + snprintf(path, 128, "%s%s", BPFFS_VMSCAN, file_name); >> + err = read_from_file(path, buf, 128); >> + if (!ASSERT_OK(err, "read cgroup_iter")) >> + return 0; >> + >> + /* Check the output file formatting */ >> + ASSERT_EQ(sscanf(buf, "cg_id: %llu, total_vmscan_delay: %llu\n", >> + &id, &vmscan), 2, "output format"); >> + >> + /* Check that the cgroup_id is displayed correctly */ >> + ASSERT_EQ(id, cgroup_id, "cgroup_id"); >> + /* Check that the vmscan reading is non-zero */ >> + ASSERT_GT(vmscan, 0, "vmscan_reading"); >> + return vmscan; >> +} >> + >> +static void check_vmscan_stats(void) >> +{ >> + int i; >> + unsigned long long vmscan_readings[N_CGROUPS], vmscan_root; >> + >> + for (i = 0; i < N_CGROUPS; i++) >> + vmscan_readings[i] = get_cgroup_vmscan_delay(cgroups[i].id, >> + cgroups[i].name); >> + >> + /* Read stats for root too */ >> + vmscan_root = get_cgroup_vmscan_delay(CG_ROOT_ID, CG_ROOT_NAME); >> + >> + /* Check that child1 == child1_1 + child1_2 */ >> + ASSERT_EQ(vmscan_readings[1], vmscan_readings[3] + >> vmscan_readings[4], >> + "child1_vmscan"); >> + /* Check that child2 == child2_1 + child2_2 */ >> + ASSERT_EQ(vmscan_readings[2], vmscan_readings[5] + >> vmscan_readings[6], >> + "child2_vmscan"); >> + /* Check that test == child1 + child2 */ >> + ASSERT_EQ(vmscan_readings[0], vmscan_readings[1] + >> vmscan_readings[2], >> + "test_vmscan"); >> + /* Check that root >= test */ >> + ASSERT_GE(vmscan_root, vmscan_readings[1], "root_vmscan"); > > I still get a test failure with > > get_cgroup_vmscan_delay:PASS:cgroup_id 0 nsec > get_cgroup_vmscan_delay:FAIL:vmscan_reading unexpected vmscan_reading: > actual 0 <= expected 0 > check_vmscan_stats:FAIL:child1_vmscan unexpected child1_vmscan: actual 0 > != expected -2 > check_vmscan_stats:FAIL:child2_vmscan unexpected child2_vmscan: actual 0 > != expected -2 > check_vmscan_stats:PASS:test_vmscan 0 nsec > check_vmscan_stats:PASS:root_vmscan 0 nsec > > I added 'dump_stack()' in function try_to_free_mem_cgroup_pages() > and run this test (#33) and didn't get any stacktrace. > But I do get stacktraces due to other operations like > try_to_free_mem_cgroup_pages+0x1fd [kernel] > try_to_free_mem_cgroup_pages+0x1fd [kernel] > memory_reclaim_write+0x88 [kernel] > cgroup_file_write+0x88 [kernel] > kernfs_fop_write_iter+0xd0 [kernel] > vfs_write+0x2c4 [kernel] > __x64_sys_write+0x60 [kernel] > do_syscall_64+0x2d [kernel] > entry_SYSCALL_64_after_hwframe+0x44 [kernel] > > If you can show me the stacktrace about how > try_to_free_mem_cgroup_pages() is triggered in your setup, I can > help debug this problem in my environment. BTW, CI also reported the test failure. https://github.com/kernel-patches/bpf/pull/3284 For example, with gcc built kernel, https://github.com/kernel-patches/bpf/runs/7272407890?check_suite_focus=true The error: get_cgroup_vmscan_delay:PASS:cgroup_id 0 nsec get_cgroup_vmscan_delay:PASS:vmscan_reading 0 nsec check_vmscan_stats:FAIL:child1_vmscan unexpected child1_vmscan: actual 28390910 != expected 28390909 check_vmscan_stats:FAIL:child2_vmscan unexpected child2_vmscan: actual 0 != expected -2 check_vmscan_stats:PASS:test_vmscan 0 nsec check_vmscan_stats:PASS:root_vmscan 0 nsec > >> +} >> + >> +static int setup_cgroup_iter(struct cgroup_hierarchical_stats *obj, >> int cgroup_fd, > [...]
On Sun, Jul 10, 2022 at 5:51 PM Yonghong Song <yhs@fb.com> wrote: > > > > On 7/10/22 5:26 PM, Yonghong Song wrote: [...] > > BTW, CI also reported the test failure. > https://github.com/kernel-patches/bpf/pull/3284 > > For example, with gcc built kernel, > https://github.com/kernel-patches/bpf/runs/7272407890?check_suite_focus=true > > The error: > > get_cgroup_vmscan_delay:PASS:cgroup_id 0 nsec > get_cgroup_vmscan_delay:PASS:vmscan_reading 0 nsec > check_vmscan_stats:FAIL:child1_vmscan unexpected child1_vmscan: > actual 28390910 != expected 28390909 > check_vmscan_stats:FAIL:child2_vmscan unexpected child2_vmscan: > actual 0 != expected -2 > check_vmscan_stats:PASS:test_vmscan 0 nsec > check_vmscan_stats:PASS:root_vmscan 0 nsec > Yonghong, I noticed that the test only failed on test_progs-no_alu32, not test_progs. test_progs passed. I believe Yosry and I have only tested on test_progs. I tried building and running the no_alu32 version, but so far, not able to run test_progs-no_alu32. Whenever I ran test_progs-no_alu32, it exits without any message. Do you have any clue what could be wrong? > > [...]
On 7/10/22 11:01 PM, Hao Luo wrote: > On Sun, Jul 10, 2022 at 5:51 PM Yonghong Song <yhs@fb.com> wrote: >> >> >> >> On 7/10/22 5:26 PM, Yonghong Song wrote: > [...] >> >> BTW, CI also reported the test failure. >> https://github.com/kernel-patches/bpf/pull/3284 >> >> For example, with gcc built kernel, >> https://github.com/kernel-patches/bpf/runs/7272407890?check_suite_focus=true >> >> The error: >> >> get_cgroup_vmscan_delay:PASS:cgroup_id 0 nsec >> get_cgroup_vmscan_delay:PASS:vmscan_reading 0 nsec >> check_vmscan_stats:FAIL:child1_vmscan unexpected child1_vmscan: >> actual 28390910 != expected 28390909 >> check_vmscan_stats:FAIL:child2_vmscan unexpected child2_vmscan: >> actual 0 != expected -2 >> check_vmscan_stats:PASS:test_vmscan 0 nsec >> check_vmscan_stats:PASS:root_vmscan 0 nsec >> > > Yonghong, > > I noticed that the test only failed on test_progs-no_alu32, not > test_progs. test_progs passed. I believe Yosry and I have only tested In my case, both test_progs and test_progs-no_alu32 failed the test. I think the reason for the failure is the same. > on test_progs. I tried building and running the no_alu32 version, but > so far, not able to run test_progs-no_alu32. Whenever I ran > test_progs-no_alu32, it exits without any message. Do you have any > clue what could be wrong? It works fine in my environment. test_progs should be very similar to test_progs-no_alu32. The only difference is bpf programs with different insn set. Some tests may not run with test_progs-no_alu32, e.g., newer atomic insn tests. I have no idea why test_progs-no_alu32 won't work for you, I guess you may need to debug it a little bit. > >>> > [...]
On Sun, Jul 10, 2022 at 11:19 PM Yonghong Song <yhs@fb.com> wrote: > > > > On 7/10/22 11:01 PM, Hao Luo wrote: > > On Sun, Jul 10, 2022 at 5:51 PM Yonghong Song <yhs@fb.com> wrote: > >> > >> > >> > >> On 7/10/22 5:26 PM, Yonghong Song wrote: > > [...] > >> > >> BTW, CI also reported the test failure. > >> https://github.com/kernel-patches/bpf/pull/3284 > >> > >> For example, with gcc built kernel, > >> https://github.com/kernel-patches/bpf/runs/7272407890?check_suite_focus=true > >> > >> The error: > >> > >> get_cgroup_vmscan_delay:PASS:cgroup_id 0 nsec > >> get_cgroup_vmscan_delay:PASS:vmscan_reading 0 nsec > >> check_vmscan_stats:FAIL:child1_vmscan unexpected child1_vmscan: > >> actual 28390910 != expected 28390909 > >> check_vmscan_stats:FAIL:child2_vmscan unexpected child2_vmscan: > >> actual 0 != expected -2 > >> check_vmscan_stats:PASS:test_vmscan 0 nsec > >> check_vmscan_stats:PASS:root_vmscan 0 nsec > >> > > > > Yonghong, > > > > I noticed that the test only failed on test_progs-no_alu32, not > > test_progs. test_progs passed. I believe Yosry and I have only tested > > In my case, both test_progs and test_progs-no_alu32 failed the test. > I think the reason for the failure is the same. > > > on test_progs. I tried building and running the no_alu32 version, but > > so far, not able to run test_progs-no_alu32. Whenever I ran > > test_progs-no_alu32, it exits without any message. Do you have any > > clue what could be wrong? > > It works fine in my environment. test_progs should be very similar to > test_progs-no_alu32. The only difference is bpf programs with different > insn set. Some tests may not run with test_progs-no_alu32, e.g., newer > atomic insn tests. > > I have no idea why test_progs-no_alu32 won't work for you, I guess you > may need to debug it a little bit. > Yonghong, I reproduced the failure using vmtest.sh now. Yosry and I are debugging it. Once we have any result, we will report back. Thanks for taking a look. > > > >>> > > [...]
On Sun, Jul 10, 2022 at 5:51 PM Yonghong Song <yhs@fb.com> wrote: > > > > On 7/10/22 5:26 PM, Yonghong Song wrote: > > > > > > On 7/8/22 5:04 PM, Yosry Ahmed wrote: > >> Add a selftest that tests the whole workflow for collecting, > >> aggregating (flushing), and displaying cgroup hierarchical stats. > >> > >> TL;DR: > >> - Userspace program creates a cgroup hierarchy and induces memcg reclaim > >> in parts of it. > >> - Whenever reclaim happens, vmscan_start and vmscan_end update > >> per-cgroup percpu readings, and tell rstat which (cgroup, cpu) pairs > >> have updates. > >> - When userspace tries to read the stats, vmscan_dump calls rstat to > >> flush > >> the stats, and outputs the stats in text format to userspace (similar > >> to cgroupfs stats). > >> - rstat calls vmscan_flush once for every (cgroup, cpu) pair that has > >> updates, vmscan_flush aggregates cpu readings and propagates updates > >> to parents. > >> - Userspace program makes sure the stats are aggregated and read > >> correctly. > >> > >> Detailed explanation: > >> - The test loads tracing bpf programs, vmscan_start and vmscan_end, to > >> measure the latency of cgroup reclaim. Per-cgroup readings are > >> stored in > >> percpu maps for efficiency. When a cgroup reading is updated on a cpu, > >> cgroup_rstat_updated(cgroup, cpu) is called to add the cgroup to the > >> rstat updated tree on that cpu. > >> > >> - A cgroup_iter program, vmscan_dump, is loaded and pinned to a file, for > >> each cgroup. Reading this file invokes the program, which calls > >> cgroup_rstat_flush(cgroup) to ask rstat to propagate the updates > >> for all > >> cpus and cgroups that have updates in this cgroup's subtree. > >> Afterwards, > >> the stats are exposed to the user. vmscan_dump returns 1 to terminate > >> iteration early, so that we only expose stats for one cgroup per read. > >> > >> - An ftrace program, vmscan_flush, is also loaded and attached to > >> bpf_rstat_flush. When rstat flushing is ongoing, vmscan_flush is > >> invoked > >> once for each (cgroup, cpu) pair that has updates. cgroups are popped > >> from the rstat tree in a bottom-up fashion, so calls will always be > >> made for cgroups that have updates before their parents. The program > >> aggregates percpu readings to a total per-cgroup reading, and also > >> propagates them to the parent cgroup. After rstat flushing is over, > >> all > >> cgroups will have correct updated hierarchical readings (including all > >> cpus and all their descendants). > >> > >> - Finally, the test creates a cgroup hierarchy and induces memcg reclaim > >> in parts of it, and makes sure that the stats collection, aggregation, > >> and reading workflow works as expected. > >> > >> Signed-off-by: Yosry Ahmed <yosryahmed@google.com> > >> --- > >> .../prog_tests/cgroup_hierarchical_stats.c | 362 ++++++++++++++++++ > >> .../bpf/progs/cgroup_hierarchical_stats.c | 235 ++++++++++++ > >> 2 files changed, 597 insertions(+) > >> create mode 100644 > >> tools/testing/selftests/bpf/prog_tests/cgroup_hierarchical_stats.c > >> create mode 100644 > >> tools/testing/selftests/bpf/progs/cgroup_hierarchical_stats.c > >> > > [...] > >> + > >> +static unsigned long long get_cgroup_vmscan_delay(unsigned long long > >> cgroup_id, > >> + const char *file_name) > >> +{ > >> + char buf[128], path[128]; > >> + unsigned long long vmscan = 0, id = 0; > >> + int err; > >> + > >> + /* For every cgroup, read the file generated by cgroup_iter */ > >> + snprintf(path, 128, "%s%s", BPFFS_VMSCAN, file_name); > >> + err = read_from_file(path, buf, 128); > >> + if (!ASSERT_OK(err, "read cgroup_iter")) > >> + return 0; > >> + > >> + /* Check the output file formatting */ > >> + ASSERT_EQ(sscanf(buf, "cg_id: %llu, total_vmscan_delay: %llu\n", > >> + &id, &vmscan), 2, "output format"); > >> + > >> + /* Check that the cgroup_id is displayed correctly */ > >> + ASSERT_EQ(id, cgroup_id, "cgroup_id"); > >> + /* Check that the vmscan reading is non-zero */ > >> + ASSERT_GT(vmscan, 0, "vmscan_reading"); > >> + return vmscan; > >> +} > >> + > >> +static void check_vmscan_stats(void) > >> +{ > >> + int i; > >> + unsigned long long vmscan_readings[N_CGROUPS], vmscan_root; > >> + > >> + for (i = 0; i < N_CGROUPS; i++) > >> + vmscan_readings[i] = get_cgroup_vmscan_delay(cgroups[i].id, > >> + cgroups[i].name); > >> + > >> + /* Read stats for root too */ > >> + vmscan_root = get_cgroup_vmscan_delay(CG_ROOT_ID, CG_ROOT_NAME); > >> + > >> + /* Check that child1 == child1_1 + child1_2 */ > >> + ASSERT_EQ(vmscan_readings[1], vmscan_readings[3] + > >> vmscan_readings[4], > >> + "child1_vmscan"); > >> + /* Check that child2 == child2_1 + child2_2 */ > >> + ASSERT_EQ(vmscan_readings[2], vmscan_readings[5] + > >> vmscan_readings[6], > >> + "child2_vmscan"); > >> + /* Check that test == child1 + child2 */ > >> + ASSERT_EQ(vmscan_readings[0], vmscan_readings[1] + > >> vmscan_readings[2], > >> + "test_vmscan"); > >> + /* Check that root >= test */ > >> + ASSERT_GE(vmscan_root, vmscan_readings[1], "root_vmscan"); > > > > I still get a test failure with > > > > get_cgroup_vmscan_delay:PASS:cgroup_id 0 nsec > > get_cgroup_vmscan_delay:FAIL:vmscan_reading unexpected vmscan_reading: > > actual 0 <= expected 0 > > check_vmscan_stats:FAIL:child1_vmscan unexpected child1_vmscan: actual 0 > > != expected -2 > > check_vmscan_stats:FAIL:child2_vmscan unexpected child2_vmscan: actual 0 > > != expected -2 > > check_vmscan_stats:PASS:test_vmscan 0 nsec > > check_vmscan_stats:PASS:root_vmscan 0 nsec > > > > I added 'dump_stack()' in function try_to_free_mem_cgroup_pages() > > and run this test (#33) and didn't get any stacktrace. > > But I do get stacktraces due to other operations like > > try_to_free_mem_cgroup_pages+0x1fd [kernel] > > try_to_free_mem_cgroup_pages+0x1fd [kernel] > > memory_reclaim_write+0x88 [kernel] > > cgroup_file_write+0x88 [kernel] > > kernfs_fop_write_iter+0xd0 [kernel] > > vfs_write+0x2c4 [kernel] > > __x64_sys_write+0x60 [kernel] > > do_syscall_64+0x2d [kernel] > > entry_SYSCALL_64_after_hwframe+0x44 [kernel] > > > > If you can show me the stacktrace about how > > try_to_free_mem_cgroup_pages() is triggered in your setup, I can > > help debug this problem in my environment. > > BTW, CI also reported the test failure. > https://github.com/kernel-patches/bpf/pull/3284 > > For example, with gcc built kernel, > https://github.com/kernel-patches/bpf/runs/7272407890?check_suite_focus=true > > The error: > > get_cgroup_vmscan_delay:PASS:cgroup_id 0 nsec > get_cgroup_vmscan_delay:PASS:vmscan_reading 0 nsec > check_vmscan_stats:FAIL:child1_vmscan unexpected child1_vmscan: > actual 28390910 != expected 28390909 > check_vmscan_stats:FAIL:child2_vmscan unexpected child2_vmscan: > actual 0 != expected -2 > check_vmscan_stats:PASS:test_vmscan 0 nsec > check_vmscan_stats:PASS:root_vmscan 0 nsec > Hey Yonghong, Thanks for helping us debug this failure. I can reproduce the CI failure in my enviornment, but this failure is actually different from the failure in your environment. In your environment it looks like no stats are gathered for all cgroups (either no reclaim happening or bpf progs not being run). In the CI and in my environment, only one cgroup observes this behavior. The thing is, I was able to reproduce the problem only when I ran all test_progs. When I run the selftest alone (test_progs -t cgroup_hierarchical_stats), it consistently passes, which is interesting. Anyway, one failure at a time :) I am working on debugging the CI failure (that occurs only when all tests are run), then we'll see if fixing that fixes the problem in our environment as well. If you have any pointers about why a test would consistently pass alone and consistently fail with others that would be good. Otherwise, I will keep you updated with any findings I reach. Thanks again! > > > >> +} > >> + > >> +static int setup_cgroup_iter(struct cgroup_hierarchical_stats *obj, > >> int cgroup_fd, > > [...]
On Mon, Jul 11, 2022 at 8:55 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > On Sun, Jul 10, 2022 at 5:51 PM Yonghong Song <yhs@fb.com> wrote: > > > > > > > > On 7/10/22 5:26 PM, Yonghong Song wrote: > > > > > > > > > On 7/8/22 5:04 PM, Yosry Ahmed wrote: > > >> Add a selftest that tests the whole workflow for collecting, > > >> aggregating (flushing), and displaying cgroup hierarchical stats. > > >> > > >> TL;DR: > > >> - Userspace program creates a cgroup hierarchy and induces memcg reclaim > > >> in parts of it. > > >> - Whenever reclaim happens, vmscan_start and vmscan_end update > > >> per-cgroup percpu readings, and tell rstat which (cgroup, cpu) pairs > > >> have updates. > > >> - When userspace tries to read the stats, vmscan_dump calls rstat to > > >> flush > > >> the stats, and outputs the stats in text format to userspace (similar > > >> to cgroupfs stats). > > >> - rstat calls vmscan_flush once for every (cgroup, cpu) pair that has > > >> updates, vmscan_flush aggregates cpu readings and propagates updates > > >> to parents. > > >> - Userspace program makes sure the stats are aggregated and read > > >> correctly. > > >> > > >> Detailed explanation: > > >> - The test loads tracing bpf programs, vmscan_start and vmscan_end, to > > >> measure the latency of cgroup reclaim. Per-cgroup readings are > > >> stored in > > >> percpu maps for efficiency. When a cgroup reading is updated on a cpu, > > >> cgroup_rstat_updated(cgroup, cpu) is called to add the cgroup to the > > >> rstat updated tree on that cpu. > > >> > > >> - A cgroup_iter program, vmscan_dump, is loaded and pinned to a file, for > > >> each cgroup. Reading this file invokes the program, which calls > > >> cgroup_rstat_flush(cgroup) to ask rstat to propagate the updates > > >> for all > > >> cpus and cgroups that have updates in this cgroup's subtree. > > >> Afterwards, > > >> the stats are exposed to the user. vmscan_dump returns 1 to terminate > > >> iteration early, so that we only expose stats for one cgroup per read. > > >> > > >> - An ftrace program, vmscan_flush, is also loaded and attached to > > >> bpf_rstat_flush. When rstat flushing is ongoing, vmscan_flush is > > >> invoked > > >> once for each (cgroup, cpu) pair that has updates. cgroups are popped > > >> from the rstat tree in a bottom-up fashion, so calls will always be > > >> made for cgroups that have updates before their parents. The program > > >> aggregates percpu readings to a total per-cgroup reading, and also > > >> propagates them to the parent cgroup. After rstat flushing is over, > > >> all > > >> cgroups will have correct updated hierarchical readings (including all > > >> cpus and all their descendants). > > >> > > >> - Finally, the test creates a cgroup hierarchy and induces memcg reclaim > > >> in parts of it, and makes sure that the stats collection, aggregation, > > >> and reading workflow works as expected. > > >> > > >> Signed-off-by: Yosry Ahmed <yosryahmed@google.com> > > >> --- > > >> .../prog_tests/cgroup_hierarchical_stats.c | 362 ++++++++++++++++++ > > >> .../bpf/progs/cgroup_hierarchical_stats.c | 235 ++++++++++++ > > >> 2 files changed, 597 insertions(+) > > >> create mode 100644 > > >> tools/testing/selftests/bpf/prog_tests/cgroup_hierarchical_stats.c > > >> create mode 100644 > > >> tools/testing/selftests/bpf/progs/cgroup_hierarchical_stats.c > > >> > > > [...] > > >> + > > >> +static unsigned long long get_cgroup_vmscan_delay(unsigned long long > > >> cgroup_id, > > >> + const char *file_name) > > >> +{ > > >> + char buf[128], path[128]; > > >> + unsigned long long vmscan = 0, id = 0; > > >> + int err; > > >> + > > >> + /* For every cgroup, read the file generated by cgroup_iter */ > > >> + snprintf(path, 128, "%s%s", BPFFS_VMSCAN, file_name); > > >> + err = read_from_file(path, buf, 128); > > >> + if (!ASSERT_OK(err, "read cgroup_iter")) > > >> + return 0; > > >> + > > >> + /* Check the output file formatting */ > > >> + ASSERT_EQ(sscanf(buf, "cg_id: %llu, total_vmscan_delay: %llu\n", > > >> + &id, &vmscan), 2, "output format"); > > >> + > > >> + /* Check that the cgroup_id is displayed correctly */ > > >> + ASSERT_EQ(id, cgroup_id, "cgroup_id"); > > >> + /* Check that the vmscan reading is non-zero */ > > >> + ASSERT_GT(vmscan, 0, "vmscan_reading"); > > >> + return vmscan; > > >> +} > > >> + > > >> +static void check_vmscan_stats(void) > > >> +{ > > >> + int i; > > >> + unsigned long long vmscan_readings[N_CGROUPS], vmscan_root; > > >> + > > >> + for (i = 0; i < N_CGROUPS; i++) > > >> + vmscan_readings[i] = get_cgroup_vmscan_delay(cgroups[i].id, > > >> + cgroups[i].name); > > >> + > > >> + /* Read stats for root too */ > > >> + vmscan_root = get_cgroup_vmscan_delay(CG_ROOT_ID, CG_ROOT_NAME); > > >> + > > >> + /* Check that child1 == child1_1 + child1_2 */ > > >> + ASSERT_EQ(vmscan_readings[1], vmscan_readings[3] + > > >> vmscan_readings[4], > > >> + "child1_vmscan"); > > >> + /* Check that child2 == child2_1 + child2_2 */ > > >> + ASSERT_EQ(vmscan_readings[2], vmscan_readings[5] + > > >> vmscan_readings[6], > > >> + "child2_vmscan"); > > >> + /* Check that test == child1 + child2 */ > > >> + ASSERT_EQ(vmscan_readings[0], vmscan_readings[1] + > > >> vmscan_readings[2], > > >> + "test_vmscan"); > > >> + /* Check that root >= test */ > > >> + ASSERT_GE(vmscan_root, vmscan_readings[1], "root_vmscan"); > > > > > > I still get a test failure with > > > > > > get_cgroup_vmscan_delay:PASS:cgroup_id 0 nsec > > > get_cgroup_vmscan_delay:FAIL:vmscan_reading unexpected vmscan_reading: > > > actual 0 <= expected 0 > > > check_vmscan_stats:FAIL:child1_vmscan unexpected child1_vmscan: actual 0 > > > != expected -2 > > > check_vmscan_stats:FAIL:child2_vmscan unexpected child2_vmscan: actual 0 > > > != expected -2 > > > check_vmscan_stats:PASS:test_vmscan 0 nsec > > > check_vmscan_stats:PASS:root_vmscan 0 nsec > > > > > > I added 'dump_stack()' in function try_to_free_mem_cgroup_pages() > > > and run this test (#33) and didn't get any stacktrace. > > > But I do get stacktraces due to other operations like > > > try_to_free_mem_cgroup_pages+0x1fd [kernel] > > > try_to_free_mem_cgroup_pages+0x1fd [kernel] > > > memory_reclaim_write+0x88 [kernel] > > > cgroup_file_write+0x88 [kernel] > > > kernfs_fop_write_iter+0xd0 [kernel] > > > vfs_write+0x2c4 [kernel] > > > __x64_sys_write+0x60 [kernel] > > > do_syscall_64+0x2d [kernel] > > > entry_SYSCALL_64_after_hwframe+0x44 [kernel] > > > > > > If you can show me the stacktrace about how > > > try_to_free_mem_cgroup_pages() is triggered in your setup, I can > > > help debug this problem in my environment. > > > > BTW, CI also reported the test failure. > > https://github.com/kernel-patches/bpf/pull/3284 > > > > For example, with gcc built kernel, > > https://github.com/kernel-patches/bpf/runs/7272407890?check_suite_focus=true > > > > The error: > > > > get_cgroup_vmscan_delay:PASS:cgroup_id 0 nsec > > get_cgroup_vmscan_delay:PASS:vmscan_reading 0 nsec > > check_vmscan_stats:FAIL:child1_vmscan unexpected child1_vmscan: > > actual 28390910 != expected 28390909 > > check_vmscan_stats:FAIL:child2_vmscan unexpected child2_vmscan: > > actual 0 != expected -2 > > check_vmscan_stats:PASS:test_vmscan 0 nsec > > check_vmscan_stats:PASS:root_vmscan 0 nsec > > > > Hey Yonghong, > > Thanks for helping us debug this failure. I can reproduce the CI > failure in my enviornment, but this failure is actually different from > the failure in your environment. In your environment it looks like no > stats are gathered for all cgroups (either no reclaim happening or bpf > progs not being run). In the CI and in my environment, only one cgroup > observes this behavior. > > The thing is, I was able to reproduce the problem only when I ran all > test_progs. When I run the selftest alone (test_progs -t > cgroup_hierarchical_stats), it consistently passes, which is > interesting. I think I figured this one out (the CI failure). I set max_entries for the maps in the test to 10, because I have 1 entry per-cgroup, and I have less than 10 cgroups. When I run the test with other tests I *think* there are other cgroups that are being created, so the number exceeds 10, and some of the entries for the test cgroups cannot be created. I saw a lot of "failed to create entry for cgroup.." message in the bpf trace produced by my test, and the error turned out to be -E2BIG. I increased max_entries to 100 and it seems to be consistently passing when run with all the other tests, using both test_progs and test_progs-no_alu32. Please find a diff attached fixing this problem and a few other nits: - Return meaningful exit codes from the reclaimer() child process and check them in induce_vmscan(). - Make buf and path variables static in get_cgroup_vmscan_delay() - Print error code in bpf trace when we fail to create a bpf map entry. - Print 0 instead of -1 when we can't find a map entry, to avoid underflowing the unsigned counters in the test. Let me know if this diff works or not, and if I need to send a new version with the diff or not. Also let me know if this fixes the failures that you have been seeing locally (which looked different from the CI failures). Thanks! > > Anyway, one failure at a time :) I am working on debugging the CI > failure (that occurs only when all tests are run), then we'll see if > fixing that fixes the problem in our environment as well. > > If you have any pointers about why a test would consistently pass > alone and consistently fail with others that would be good. Otherwise, > I will keep you updated with any findings I reach. > > Thanks again! > > > > > > >> +} > > >> + > > >> +static int setup_cgroup_iter(struct cgroup_hierarchical_stats *obj, > > >> int cgroup_fd, > > > [...]
On Mon, Jul 18, 2022 at 12:34 PM Yosry Ahmed <yosryahmed@google.com> wrote: > [...] > > I think I figured this one out (the CI failure). I set max_entries for > the maps in the test to 10, because I have 1 entry per-cgroup, and I > have less than 10 cgroups. When I run the test with other tests I > *think* there are other cgroups that are being created, so the number > exceeds 10, and some of the entries for the test cgroups cannot be > created. Using hashmap to store per-cgroup data is only a short-term solution. We should work on extending cgroup-local storage to tracing programs. Maybe as a follow-up change once cgroup_iter is merged. > in the bpf trace produced by my test, and the error turned out to be > -E2BIG. I increased max_entries to 100 and it seems to be consistently > passing when run with all the other tests, using both test_progs and > test_progs-no_alu32. > > Please find a diff attached fixing this problem and a few other nits: > - Return meaningful exit codes from the reclaimer() child process and > check them in induce_vmscan(). > - Make buf and path variables static in get_cgroup_vmscan_delay() > - Print error code in bpf trace when we fail to create a bpf map entry. > - Print 0 instead of -1 when we can't find a map entry, to avoid > underflowing the unsigned counters in the test. > > Let me know if this diff works or not, and if I need to send a new > version with the diff or not. Also let me know if this fixes the > failures that you have been seeing locally (which looked different > from the CI failures). > Yosry, I also need to address Yonghong's comments in the cgroup_iter patch, so we need to send v4 anyway. Hao > Thanks! > [...]
On 7/18/22 12:34 PM, Yosry Ahmed wrote: > On Mon, Jul 11, 2022 at 8:55 PM Yosry Ahmed <yosryahmed@google.com> wrote: >> >> On Sun, Jul 10, 2022 at 5:51 PM Yonghong Song <yhs@fb.com> wrote: >>> >>> >>> >>> On 7/10/22 5:26 PM, Yonghong Song wrote: >>>> >>>> >>>> On 7/8/22 5:04 PM, Yosry Ahmed wrote: >>>>> Add a selftest that tests the whole workflow for collecting, >>>>> aggregating (flushing), and displaying cgroup hierarchical stats. >>>>> >>>>> TL;DR: >>>>> - Userspace program creates a cgroup hierarchy and induces memcg reclaim >>>>> in parts of it. >>>>> - Whenever reclaim happens, vmscan_start and vmscan_end update >>>>> per-cgroup percpu readings, and tell rstat which (cgroup, cpu) pairs >>>>> have updates. >>>>> - When userspace tries to read the stats, vmscan_dump calls rstat to >>>>> flush >>>>> the stats, and outputs the stats in text format to userspace (similar >>>>> to cgroupfs stats). >>>>> - rstat calls vmscan_flush once for every (cgroup, cpu) pair that has >>>>> updates, vmscan_flush aggregates cpu readings and propagates updates >>>>> to parents. >>>>> - Userspace program makes sure the stats are aggregated and read >>>>> correctly. >>>>> >>>>> Detailed explanation: >>>>> - The test loads tracing bpf programs, vmscan_start and vmscan_end, to >>>>> measure the latency of cgroup reclaim. Per-cgroup readings are >>>>> stored in >>>>> percpu maps for efficiency. When a cgroup reading is updated on a cpu, >>>>> cgroup_rstat_updated(cgroup, cpu) is called to add the cgroup to the >>>>> rstat updated tree on that cpu. >>>>> >>>>> - A cgroup_iter program, vmscan_dump, is loaded and pinned to a file, for >>>>> each cgroup. Reading this file invokes the program, which calls >>>>> cgroup_rstat_flush(cgroup) to ask rstat to propagate the updates >>>>> for all >>>>> cpus and cgroups that have updates in this cgroup's subtree. >>>>> Afterwards, >>>>> the stats are exposed to the user. vmscan_dump returns 1 to terminate >>>>> iteration early, so that we only expose stats for one cgroup per read. >>>>> >>>>> - An ftrace program, vmscan_flush, is also loaded and attached to >>>>> bpf_rstat_flush. When rstat flushing is ongoing, vmscan_flush is >>>>> invoked >>>>> once for each (cgroup, cpu) pair that has updates. cgroups are popped >>>>> from the rstat tree in a bottom-up fashion, so calls will always be >>>>> made for cgroups that have updates before their parents. The program >>>>> aggregates percpu readings to a total per-cgroup reading, and also >>>>> propagates them to the parent cgroup. After rstat flushing is over, >>>>> all >>>>> cgroups will have correct updated hierarchical readings (including all >>>>> cpus and all their descendants). >>>>> >>>>> - Finally, the test creates a cgroup hierarchy and induces memcg reclaim >>>>> in parts of it, and makes sure that the stats collection, aggregation, >>>>> and reading workflow works as expected. >>>>> >>>>> Signed-off-by: Yosry Ahmed <yosryahmed@google.com> >>>>> --- >>>>> .../prog_tests/cgroup_hierarchical_stats.c | 362 ++++++++++++++++++ >>>>> .../bpf/progs/cgroup_hierarchical_stats.c | 235 ++++++++++++ >>>>> 2 files changed, 597 insertions(+) >>>>> create mode 100644 >>>>> tools/testing/selftests/bpf/prog_tests/cgroup_hierarchical_stats.c >>>>> create mode 100644 >>>>> tools/testing/selftests/bpf/progs/cgroup_hierarchical_stats.c >>>>> >>>> [...] >>>>> + >>>>> +static unsigned long long get_cgroup_vmscan_delay(unsigned long long >>>>> cgroup_id, >>>>> + const char *file_name) >>>>> +{ >>>>> + char buf[128], path[128]; >>>>> + unsigned long long vmscan = 0, id = 0; >>>>> + int err; >>>>> + >>>>> + /* For every cgroup, read the file generated by cgroup_iter */ >>>>> + snprintf(path, 128, "%s%s", BPFFS_VMSCAN, file_name); >>>>> + err = read_from_file(path, buf, 128); >>>>> + if (!ASSERT_OK(err, "read cgroup_iter")) >>>>> + return 0; >>>>> + >>>>> + /* Check the output file formatting */ >>>>> + ASSERT_EQ(sscanf(buf, "cg_id: %llu, total_vmscan_delay: %llu\n", >>>>> + &id, &vmscan), 2, "output format"); >>>>> + >>>>> + /* Check that the cgroup_id is displayed correctly */ >>>>> + ASSERT_EQ(id, cgroup_id, "cgroup_id"); >>>>> + /* Check that the vmscan reading is non-zero */ >>>>> + ASSERT_GT(vmscan, 0, "vmscan_reading"); >>>>> + return vmscan; >>>>> +} >>>>> + >>>>> +static void check_vmscan_stats(void) >>>>> +{ >>>>> + int i; >>>>> + unsigned long long vmscan_readings[N_CGROUPS], vmscan_root; >>>>> + >>>>> + for (i = 0; i < N_CGROUPS; i++) >>>>> + vmscan_readings[i] = get_cgroup_vmscan_delay(cgroups[i].id, >>>>> + cgroups[i].name); >>>>> + >>>>> + /* Read stats for root too */ >>>>> + vmscan_root = get_cgroup_vmscan_delay(CG_ROOT_ID, CG_ROOT_NAME); >>>>> + >>>>> + /* Check that child1 == child1_1 + child1_2 */ >>>>> + ASSERT_EQ(vmscan_readings[1], vmscan_readings[3] + >>>>> vmscan_readings[4], >>>>> + "child1_vmscan"); >>>>> + /* Check that child2 == child2_1 + child2_2 */ >>>>> + ASSERT_EQ(vmscan_readings[2], vmscan_readings[5] + >>>>> vmscan_readings[6], >>>>> + "child2_vmscan"); >>>>> + /* Check that test == child1 + child2 */ >>>>> + ASSERT_EQ(vmscan_readings[0], vmscan_readings[1] + >>>>> vmscan_readings[2], >>>>> + "test_vmscan"); >>>>> + /* Check that root >= test */ >>>>> + ASSERT_GE(vmscan_root, vmscan_readings[1], "root_vmscan"); >>>> >>>> I still get a test failure with >>>> >>>> get_cgroup_vmscan_delay:PASS:cgroup_id 0 nsec >>>> get_cgroup_vmscan_delay:FAIL:vmscan_reading unexpected vmscan_reading: >>>> actual 0 <= expected 0 >>>> check_vmscan_stats:FAIL:child1_vmscan unexpected child1_vmscan: actual 0 >>>> != expected -2 >>>> check_vmscan_stats:FAIL:child2_vmscan unexpected child2_vmscan: actual 0 >>>> != expected -2 >>>> check_vmscan_stats:PASS:test_vmscan 0 nsec >>>> check_vmscan_stats:PASS:root_vmscan 0 nsec >>>> >>>> I added 'dump_stack()' in function try_to_free_mem_cgroup_pages() >>>> and run this test (#33) and didn't get any stacktrace. >>>> But I do get stacktraces due to other operations like >>>> try_to_free_mem_cgroup_pages+0x1fd [kernel] >>>> try_to_free_mem_cgroup_pages+0x1fd [kernel] >>>> memory_reclaim_write+0x88 [kernel] >>>> cgroup_file_write+0x88 [kernel] >>>> kernfs_fop_write_iter+0xd0 [kernel] >>>> vfs_write+0x2c4 [kernel] >>>> __x64_sys_write+0x60 [kernel] >>>> do_syscall_64+0x2d [kernel] >>>> entry_SYSCALL_64_after_hwframe+0x44 [kernel] >>>> >>>> If you can show me the stacktrace about how >>>> try_to_free_mem_cgroup_pages() is triggered in your setup, I can >>>> help debug this problem in my environment. >>> >>> BTW, CI also reported the test failure. >>> https://github.com/kernel-patches/bpf/pull/3284 >>> >>> For example, with gcc built kernel, >>> https://github.com/kernel-patches/bpf/runs/7272407890?check_suite_focus=true >>> >>> The error: >>> >>> get_cgroup_vmscan_delay:PASS:cgroup_id 0 nsec >>> get_cgroup_vmscan_delay:PASS:vmscan_reading 0 nsec >>> check_vmscan_stats:FAIL:child1_vmscan unexpected child1_vmscan: >>> actual 28390910 != expected 28390909 >>> check_vmscan_stats:FAIL:child2_vmscan unexpected child2_vmscan: >>> actual 0 != expected -2 >>> check_vmscan_stats:PASS:test_vmscan 0 nsec >>> check_vmscan_stats:PASS:root_vmscan 0 nsec >>> >> >> Hey Yonghong, >> >> Thanks for helping us debug this failure. I can reproduce the CI >> failure in my enviornment, but this failure is actually different from >> the failure in your environment. In your environment it looks like no >> stats are gathered for all cgroups (either no reclaim happening or bpf >> progs not being run). In the CI and in my environment, only one cgroup >> observes this behavior. >> >> The thing is, I was able to reproduce the problem only when I ran all >> test_progs. When I run the selftest alone (test_progs -t >> cgroup_hierarchical_stats), it consistently passes, which is >> interesting. > > I think I figured this one out (the CI failure). I set max_entries for > the maps in the test to 10, because I have 1 entry per-cgroup, and I > have less than 10 cgroups. When I run the test with other tests I > *think* there are other cgroups that are being created, so the number > exceeds 10, and some of the entries for the test cgroups cannot be > created. I saw a lot of "failed to create entry for cgroup.." message > in the bpf trace produced by my test, and the error turned out to be > -E2BIG. I increased max_entries to 100 and it seems to be consistently > passing when run with all the other tests, using both test_progs and > test_progs-no_alu32. > > Please find a diff attached fixing this problem and a few other nits: > - Return meaningful exit codes from the reclaimer() child process and > check them in induce_vmscan(). > - Make buf and path variables static in get_cgroup_vmscan_delay() > - Print error code in bpf trace when we fail to create a bpf map entry. > - Print 0 instead of -1 when we can't find a map entry, to avoid > underflowing the unsigned counters in the test. > > Let me know if this diff works or not, and if I need to send a new > version with the diff or not. Also let me know if this fixes the > failures that you have been seeing locally (which looked different > from the CI failures). I tried this patch and the test passed in my local environment so the diff sounds good to me. > > Thanks! > >> >> Anyway, one failure at a time :) I am working on debugging the CI >> failure (that occurs only when all tests are run), then we'll see if >> fixing that fixes the problem in our environment as well. >> >> If you have any pointers about why a test would consistently pass >> alone and consistently fail with others that would be good. Otherwise, >> I will keep you updated with any findings I reach. >> >> Thanks again! >> >>>> >>>>> +} >>>>> + >>>>> +static int setup_cgroup_iter(struct cgroup_hierarchical_stats *obj, >>>>> int cgroup_fd, >>>> [...]
On Tue, Jul 19, 2022 at 9:17 AM Yonghong Song <yhs@fb.com> wrote: > > > > On 7/18/22 12:34 PM, Yosry Ahmed wrote: > > On Mon, Jul 11, 2022 at 8:55 PM Yosry Ahmed <yosryahmed@google.com> wrote: > >> > >> On Sun, Jul 10, 2022 at 5:51 PM Yonghong Song <yhs@fb.com> wrote: > >>> > >>> > >>> > >>> On 7/10/22 5:26 PM, Yonghong Song wrote: > >>>> > >>>> > >>>> On 7/8/22 5:04 PM, Yosry Ahmed wrote: > >>>>> Add a selftest that tests the whole workflow for collecting, > >>>>> aggregating (flushing), and displaying cgroup hierarchical stats. > >>>>> > >>>>> TL;DR: > >>>>> - Userspace program creates a cgroup hierarchy and induces memcg reclaim > >>>>> in parts of it. > >>>>> - Whenever reclaim happens, vmscan_start and vmscan_end update > >>>>> per-cgroup percpu readings, and tell rstat which (cgroup, cpu) pairs > >>>>> have updates. > >>>>> - When userspace tries to read the stats, vmscan_dump calls rstat to > >>>>> flush > >>>>> the stats, and outputs the stats in text format to userspace (similar > >>>>> to cgroupfs stats). > >>>>> - rstat calls vmscan_flush once for every (cgroup, cpu) pair that has > >>>>> updates, vmscan_flush aggregates cpu readings and propagates updates > >>>>> to parents. > >>>>> - Userspace program makes sure the stats are aggregated and read > >>>>> correctly. > >>>>> > >>>>> Detailed explanation: > >>>>> - The test loads tracing bpf programs, vmscan_start and vmscan_end, to > >>>>> measure the latency of cgroup reclaim. Per-cgroup readings are > >>>>> stored in > >>>>> percpu maps for efficiency. When a cgroup reading is updated on a cpu, > >>>>> cgroup_rstat_updated(cgroup, cpu) is called to add the cgroup to the > >>>>> rstat updated tree on that cpu. > >>>>> > >>>>> - A cgroup_iter program, vmscan_dump, is loaded and pinned to a file, for > >>>>> each cgroup. Reading this file invokes the program, which calls > >>>>> cgroup_rstat_flush(cgroup) to ask rstat to propagate the updates > >>>>> for all > >>>>> cpus and cgroups that have updates in this cgroup's subtree. > >>>>> Afterwards, > >>>>> the stats are exposed to the user. vmscan_dump returns 1 to terminate > >>>>> iteration early, so that we only expose stats for one cgroup per read. > >>>>> > >>>>> - An ftrace program, vmscan_flush, is also loaded and attached to > >>>>> bpf_rstat_flush. When rstat flushing is ongoing, vmscan_flush is > >>>>> invoked > >>>>> once for each (cgroup, cpu) pair that has updates. cgroups are popped > >>>>> from the rstat tree in a bottom-up fashion, so calls will always be > >>>>> made for cgroups that have updates before their parents. The program > >>>>> aggregates percpu readings to a total per-cgroup reading, and also > >>>>> propagates them to the parent cgroup. After rstat flushing is over, > >>>>> all > >>>>> cgroups will have correct updated hierarchical readings (including all > >>>>> cpus and all their descendants). > >>>>> > >>>>> - Finally, the test creates a cgroup hierarchy and induces memcg reclaim > >>>>> in parts of it, and makes sure that the stats collection, aggregation, > >>>>> and reading workflow works as expected. > >>>>> > >>>>> Signed-off-by: Yosry Ahmed <yosryahmed@google.com> > >>>>> --- > >>>>> .../prog_tests/cgroup_hierarchical_stats.c | 362 ++++++++++++++++++ > >>>>> .../bpf/progs/cgroup_hierarchical_stats.c | 235 ++++++++++++ > >>>>> 2 files changed, 597 insertions(+) > >>>>> create mode 100644 > >>>>> tools/testing/selftests/bpf/prog_tests/cgroup_hierarchical_stats.c > >>>>> create mode 100644 > >>>>> tools/testing/selftests/bpf/progs/cgroup_hierarchical_stats.c > >>>>> > >>>> [...] > >>>>> + > >>>>> +static unsigned long long get_cgroup_vmscan_delay(unsigned long long > >>>>> cgroup_id, > >>>>> + const char *file_name) > >>>>> +{ > >>>>> + char buf[128], path[128]; > >>>>> + unsigned long long vmscan = 0, id = 0; > >>>>> + int err; > >>>>> + > >>>>> + /* For every cgroup, read the file generated by cgroup_iter */ > >>>>> + snprintf(path, 128, "%s%s", BPFFS_VMSCAN, file_name); > >>>>> + err = read_from_file(path, buf, 128); > >>>>> + if (!ASSERT_OK(err, "read cgroup_iter")) > >>>>> + return 0; > >>>>> + > >>>>> + /* Check the output file formatting */ > >>>>> + ASSERT_EQ(sscanf(buf, "cg_id: %llu, total_vmscan_delay: %llu\n", > >>>>> + &id, &vmscan), 2, "output format"); > >>>>> + > >>>>> + /* Check that the cgroup_id is displayed correctly */ > >>>>> + ASSERT_EQ(id, cgroup_id, "cgroup_id"); > >>>>> + /* Check that the vmscan reading is non-zero */ > >>>>> + ASSERT_GT(vmscan, 0, "vmscan_reading"); > >>>>> + return vmscan; > >>>>> +} > >>>>> + > >>>>> +static void check_vmscan_stats(void) > >>>>> +{ > >>>>> + int i; > >>>>> + unsigned long long vmscan_readings[N_CGROUPS], vmscan_root; > >>>>> + > >>>>> + for (i = 0; i < N_CGROUPS; i++) > >>>>> + vmscan_readings[i] = get_cgroup_vmscan_delay(cgroups[i].id, > >>>>> + cgroups[i].name); > >>>>> + > >>>>> + /* Read stats for root too */ > >>>>> + vmscan_root = get_cgroup_vmscan_delay(CG_ROOT_ID, CG_ROOT_NAME); > >>>>> + > >>>>> + /* Check that child1 == child1_1 + child1_2 */ > >>>>> + ASSERT_EQ(vmscan_readings[1], vmscan_readings[3] + > >>>>> vmscan_readings[4], > >>>>> + "child1_vmscan"); > >>>>> + /* Check that child2 == child2_1 + child2_2 */ > >>>>> + ASSERT_EQ(vmscan_readings[2], vmscan_readings[5] + > >>>>> vmscan_readings[6], > >>>>> + "child2_vmscan"); > >>>>> + /* Check that test == child1 + child2 */ > >>>>> + ASSERT_EQ(vmscan_readings[0], vmscan_readings[1] + > >>>>> vmscan_readings[2], > >>>>> + "test_vmscan"); > >>>>> + /* Check that root >= test */ > >>>>> + ASSERT_GE(vmscan_root, vmscan_readings[1], "root_vmscan"); > >>>> > >>>> I still get a test failure with > >>>> > >>>> get_cgroup_vmscan_delay:PASS:cgroup_id 0 nsec > >>>> get_cgroup_vmscan_delay:FAIL:vmscan_reading unexpected vmscan_reading: > >>>> actual 0 <= expected 0 > >>>> check_vmscan_stats:FAIL:child1_vmscan unexpected child1_vmscan: actual 0 > >>>> != expected -2 > >>>> check_vmscan_stats:FAIL:child2_vmscan unexpected child2_vmscan: actual 0 > >>>> != expected -2 > >>>> check_vmscan_stats:PASS:test_vmscan 0 nsec > >>>> check_vmscan_stats:PASS:root_vmscan 0 nsec > >>>> > >>>> I added 'dump_stack()' in function try_to_free_mem_cgroup_pages() > >>>> and run this test (#33) and didn't get any stacktrace. > >>>> But I do get stacktraces due to other operations like > >>>> try_to_free_mem_cgroup_pages+0x1fd [kernel] > >>>> try_to_free_mem_cgroup_pages+0x1fd [kernel] > >>>> memory_reclaim_write+0x88 [kernel] > >>>> cgroup_file_write+0x88 [kernel] > >>>> kernfs_fop_write_iter+0xd0 [kernel] > >>>> vfs_write+0x2c4 [kernel] > >>>> __x64_sys_write+0x60 [kernel] > >>>> do_syscall_64+0x2d [kernel] > >>>> entry_SYSCALL_64_after_hwframe+0x44 [kernel] > >>>> > >>>> If you can show me the stacktrace about how > >>>> try_to_free_mem_cgroup_pages() is triggered in your setup, I can > >>>> help debug this problem in my environment. > >>> > >>> BTW, CI also reported the test failure. > >>> https://github.com/kernel-patches/bpf/pull/3284 > >>> > >>> For example, with gcc built kernel, > >>> https://github.com/kernel-patches/bpf/runs/7272407890?check_suite_focus=true > >>> > >>> The error: > >>> > >>> get_cgroup_vmscan_delay:PASS:cgroup_id 0 nsec > >>> get_cgroup_vmscan_delay:PASS:vmscan_reading 0 nsec > >>> check_vmscan_stats:FAIL:child1_vmscan unexpected child1_vmscan: > >>> actual 28390910 != expected 28390909 > >>> check_vmscan_stats:FAIL:child2_vmscan unexpected child2_vmscan: > >>> actual 0 != expected -2 > >>> check_vmscan_stats:PASS:test_vmscan 0 nsec > >>> check_vmscan_stats:PASS:root_vmscan 0 nsec > >>> > >> > >> Hey Yonghong, > >> > >> Thanks for helping us debug this failure. I can reproduce the CI > >> failure in my enviornment, but this failure is actually different from > >> the failure in your environment. In your environment it looks like no > >> stats are gathered for all cgroups (either no reclaim happening or bpf > >> progs not being run). In the CI and in my environment, only one cgroup > >> observes this behavior. > >> > >> The thing is, I was able to reproduce the problem only when I ran all > >> test_progs. When I run the selftest alone (test_progs -t > >> cgroup_hierarchical_stats), it consistently passes, which is > >> interesting. > > > > I think I figured this one out (the CI failure). I set max_entries for > > the maps in the test to 10, because I have 1 entry per-cgroup, and I > > have less than 10 cgroups. When I run the test with other tests I > > *think* there are other cgroups that are being created, so the number > > exceeds 10, and some of the entries for the test cgroups cannot be > > created. I saw a lot of "failed to create entry for cgroup.." message > > in the bpf trace produced by my test, and the error turned out to be > > -E2BIG. I increased max_entries to 100 and it seems to be consistently > > passing when run with all the other tests, using both test_progs and > > test_progs-no_alu32. > > > > Please find a diff attached fixing this problem and a few other nits: > > - Return meaningful exit codes from the reclaimer() child process and > > check them in induce_vmscan(). > > - Make buf and path variables static in get_cgroup_vmscan_delay() > > - Print error code in bpf trace when we fail to create a bpf map entry. > > - Print 0 instead of -1 when we can't find a map entry, to avoid > > underflowing the unsigned counters in the test. > > > > Let me know if this diff works or not, and if I need to send a new > > version with the diff or not. Also let me know if this fixes the > > failures that you have been seeing locally (which looked different > > from the CI failures). > > I tried this patch and the test passed in my local environment > so the diff sounds good to me. > Awesome! Thanks so much for helping debugging this! I will bundle this diff with Hao's cgroup_iter changes and send a v4 soon. > > > > Thanks! > > > >> > >> Anyway, one failure at a time :) I am working on debugging the CI > >> failure (that occurs only when all tests are run), then we'll see if > >> fixing that fixes the problem in our environment as well. > >> > >> If you have any pointers about why a test would consistently pass > >> alone and consistently fail with others that would be good. Otherwise, > >> I will keep you updated with any findings I reach. > >> > >> Thanks again! > >> > >>>> > >>>>> +} > >>>>> + > >>>>> +static int setup_cgroup_iter(struct cgroup_hierarchical_stats *obj, > >>>>> int cgroup_fd, > >>>> [...]
diff --git a/tools/testing/selftests/bpf/prog_tests/cgroup_hierarchical_stats.c b/tools/testing/selftests/bpf/prog_tests/cgroup_hierarchical_stats.c new file mode 100644 index 0000000000000..5d0a8bb110a44 --- /dev/null +++ b/tools/testing/selftests/bpf/prog_tests/cgroup_hierarchical_stats.c @@ -0,0 +1,362 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Functions to manage eBPF programs attached to cgroup subsystems + * + * Copyright 2022 Google LLC. + */ +#include <errno.h> +#include <sys/types.h> +#include <sys/mount.h> +#include <sys/stat.h> +#include <unistd.h> + +#include <test_progs.h> +#include <bpf/libbpf.h> +#include <bpf/bpf.h> + +#include "cgroup_helpers.h" +#include "cgroup_hierarchical_stats.skel.h" + +#define PAGE_SIZE 4096 +#define MB(x) (x << 20) + +#define BPFFS_ROOT "/sys/fs/bpf/" +#define BPFFS_VMSCAN BPFFS_ROOT"vmscan/" + +#define CG_ROOT_NAME "root" +#define CG_ROOT_ID 1 + +#define CGROUP_PATH(p, n) {.path = #p"/"#n, .name = #n} + +static struct { + const char *path, *name; + unsigned long long id; + int fd; +} cgroups[] = { + CGROUP_PATH(/, test), + CGROUP_PATH(/test, child1), + CGROUP_PATH(/test, child2), + CGROUP_PATH(/test/child1, child1_1), + CGROUP_PATH(/test/child1, child1_2), + CGROUP_PATH(/test/child2, child2_1), + CGROUP_PATH(/test/child2, child2_2), +}; + +#define N_CGROUPS ARRAY_SIZE(cgroups) +#define N_NON_LEAF_CGROUPS 3 + +int root_cgroup_fd; +bool mounted_bpffs; + +static int read_from_file(const char *path, char *buf, size_t size) +{ + int fd, len; + + fd = open(path, O_RDONLY); + if (fd < 0) { + log_err("Open %s", path); + return 1; + } + len = read(fd, buf, size); + if (len < 0) + log_err("Read %s", path); + else + buf[len] = 0; + close(fd); + return len < 0; +} + +static int setup_bpffs(void) +{ + int err; + + /* Mount bpffs */ + err = mount("bpf", BPFFS_ROOT, "bpf", 0, NULL); + mounted_bpffs = !err; + if (!ASSERT_OK(err && errno != EBUSY, "mount bpffs")) + return err; + + /* Create a directory to contain stat files in bpffs */ + err = mkdir(BPFFS_VMSCAN, 0755); + ASSERT_OK(err, "mkdir bpffs"); + return err; +} + +static void cleanup_bpffs(void) +{ + /* Remove created directory in bpffs */ + ASSERT_OK(rmdir(BPFFS_VMSCAN), "rmdir "BPFFS_VMSCAN); + + /* Unmount bpffs, if it wasn't already mounted when we started */ + if (mounted_bpffs) + return; + ASSERT_OK(umount(BPFFS_ROOT), "unmount bpffs"); +} + +static int setup_cgroups(void) +{ + int i, fd, err; + + err = setup_cgroup_environment(); + if (!ASSERT_OK(err, "setup_cgroup_environment")) + return err; + + root_cgroup_fd = get_root_cgroup(); + if (!ASSERT_GE(root_cgroup_fd, 0, "get_root_cgroup")) + return root_cgroup_fd; + + for (i = 0; i < N_CGROUPS; i++) { + fd = create_and_get_cgroup(cgroups[i].path); + if (!ASSERT_GE(fd, 0, "create_and_get_cgroup")) + return fd; + + cgroups[i].fd = fd; + cgroups[i].id = get_cgroup_id(cgroups[i].path); + + /* + * Enable memcg controller for the entire hierarchy. + * Note that stats are collected for all cgroups in a hierarchy + * with memcg enabled anyway, but are only exposed for cgroups + * that have memcg enabled. + */ + if (i < N_NON_LEAF_CGROUPS) { + err = enable_controllers(cgroups[i].path, "memory"); + if (!ASSERT_OK(err, "enable_controllers")) + return err; + } + } + return 0; +} + +static void cleanup_cgroups(void) +{ + close(root_cgroup_fd); + for (int i = 0; i < N_CGROUPS; i++) + close(cgroups[i].fd); + cleanup_cgroup_environment(); +} + + +static int setup_hierarchy(void) +{ + return setup_bpffs() || setup_cgroups(); +} + +static void destroy_hierarchy(void) +{ + cleanup_cgroups(); + cleanup_bpffs(); +} + +static void reclaimer(const char *cgroup_path, size_t size) +{ + char *buf, *ptr; + char size_buf[128]; + int err; + + /* Join cgroup in the parent process workdir */ + join_parent_cgroup(cgroup_path); + + /* Allocate memory */ + buf = malloc(size); + for (ptr = buf; ptr < buf + size; ptr += PAGE_SIZE) + *ptr = 1; + + /* + * Try to reclaim memory. + * memory.reclaim can return EAGAIN if the amount is not + * fully reclaimed. + */ + snprintf(size_buf, 128, "%lu", size); + err = write_cgroup_file_parent(cgroup_path, "memory.reclaim", size_buf); + + free(buf); + exit(err && errno != EAGAIN); +} + +static int induce_vmscan(void) +{ + int i, status, err = 0; + + /* + * In every leaf cgroup, run a child process that allocates some memory + * and attempts to reclaim some of it. + */ + for (i = N_NON_LEAF_CGROUPS; i < N_CGROUPS; i++) { + pid_t pid; + + /* Create reclaimer child */ + pid = fork(); + if (pid == 0) + reclaimer(cgroups[i].path, MB(5)); + if (!ASSERT_GT(pid, 0, "fork reclaimer child")) + return pid; + + /* Cleanup reclaimer child */ + waitpid(pid, &status, 0); + err = !WIFEXITED(status) || WEXITSTATUS(status); + ASSERT_OK(err, "reclaimer child exit status"); + } + return 0; +} + +static unsigned long long get_cgroup_vmscan_delay(unsigned long long cgroup_id, + const char *file_name) +{ + char buf[128], path[128]; + unsigned long long vmscan = 0, id = 0; + int err; + + /* For every cgroup, read the file generated by cgroup_iter */ + snprintf(path, 128, "%s%s", BPFFS_VMSCAN, file_name); + err = read_from_file(path, buf, 128); + if (!ASSERT_OK(err, "read cgroup_iter")) + return 0; + + /* Check the output file formatting */ + ASSERT_EQ(sscanf(buf, "cg_id: %llu, total_vmscan_delay: %llu\n", + &id, &vmscan), 2, "output format"); + + /* Check that the cgroup_id is displayed correctly */ + ASSERT_EQ(id, cgroup_id, "cgroup_id"); + /* Check that the vmscan reading is non-zero */ + ASSERT_GT(vmscan, 0, "vmscan_reading"); + return vmscan; +} + +static void check_vmscan_stats(void) +{ + int i; + unsigned long long vmscan_readings[N_CGROUPS], vmscan_root; + + for (i = 0; i < N_CGROUPS; i++) + vmscan_readings[i] = get_cgroup_vmscan_delay(cgroups[i].id, + cgroups[i].name); + + /* Read stats for root too */ + vmscan_root = get_cgroup_vmscan_delay(CG_ROOT_ID, CG_ROOT_NAME); + + /* Check that child1 == child1_1 + child1_2 */ + ASSERT_EQ(vmscan_readings[1], vmscan_readings[3] + vmscan_readings[4], + "child1_vmscan"); + /* Check that child2 == child2_1 + child2_2 */ + ASSERT_EQ(vmscan_readings[2], vmscan_readings[5] + vmscan_readings[6], + "child2_vmscan"); + /* Check that test == child1 + child2 */ + ASSERT_EQ(vmscan_readings[0], vmscan_readings[1] + vmscan_readings[2], + "test_vmscan"); + /* Check that root >= test */ + ASSERT_GE(vmscan_root, vmscan_readings[1], "root_vmscan"); +} + +static int setup_cgroup_iter(struct cgroup_hierarchical_stats *obj, int cgroup_fd, + const char *file_name) +{ + DECLARE_LIBBPF_OPTS(bpf_iter_attach_opts, opts); + union bpf_iter_link_info linfo = {}; + struct bpf_link *link; + char path[128]; + int err; + + /* + * Create an iter link, parameterized by cgroup_fd. + * We only want to traverse one cgroup, so set the traversal order to + * "pre", and return 1 from dump_vmscan to stop iteration after the + * first cgroup. + */ + linfo.cgroup.cgroup_fd = cgroup_fd; + linfo.cgroup.traversal_order = BPF_ITER_CGROUP_PRE; + opts.link_info = &linfo; + opts.link_info_len = sizeof(linfo); + link = bpf_program__attach_iter(obj->progs.dump_vmscan, &opts); + if (!ASSERT_OK_PTR(link, "attach iter")) + return libbpf_get_error(link); + + /* Pin the link to a bpffs file */ + snprintf(path, 128, "%s%s", BPFFS_VMSCAN, file_name); + err = bpf_link__pin(link, path); + if (!ASSERT_OK(err, "pin cgroup_iter")) + return err; + + /* Remove the link, leaving only the ref held by the pinned file */ + err = bpf_link__destroy(link); + ASSERT_OK(err, "destroy cgroup_iter link"); + return err; +} + +static int setup_progs(struct cgroup_hierarchical_stats **skel) +{ + int i, err; + struct bpf_link *link; + struct cgroup_hierarchical_stats *obj; + + obj = cgroup_hierarchical_stats__open_and_load(); + if (!ASSERT_OK_PTR(obj, "open_and_load")) + return libbpf_get_error(obj); + + /* Attach cgroup_iter program that will dump the stats to cgroups */ + for (i = 0; i < N_CGROUPS; i++) { + err = setup_cgroup_iter(obj, cgroups[i].fd, cgroups[i].name); + if (!ASSERT_OK(err, "setup_cgroup_iter")) + return err; + } + /* Also dump stats for root */ + err = setup_cgroup_iter(obj, root_cgroup_fd, CG_ROOT_NAME); + if (!ASSERT_OK(err, "setup_cgroup_iter")) + return err; + + /* Attach rstat flusher */ + link = bpf_program__attach(obj->progs.vmscan_flush); + if (!ASSERT_OK_PTR(link, "attach rstat")) + return libbpf_get_error(link); + obj->links.vmscan_flush = link; + + /* Attach tracing programs that will calculate vmscan delays */ + link = bpf_program__attach(obj->progs.vmscan_start); + if (!ASSERT_OK_PTR(obj, "attach raw_tracepoint")) + return libbpf_get_error(link); + obj->links.vmscan_start = link; + + link = bpf_program__attach(obj->progs.vmscan_end); + if (!ASSERT_OK_PTR(obj, "attach raw_tracepoint")) + return libbpf_get_error(link); + obj->links.vmscan_end = link; + + *skel = obj; + return 0; +} + +void destroy_progs(struct cgroup_hierarchical_stats *skel) +{ + char path[128]; + int i; + + for (i = 0; i < N_CGROUPS; i++) { + /* Delete files in bpffs that cgroup_iters are pinned in */ + snprintf(path, 128, "%s%s", BPFFS_VMSCAN, + cgroups[i].name); + ASSERT_OK(remove(path), "remove cgroup_iter pin"); + } + + /* Delete root file in bpffs */ + snprintf(path, 128, "%s%s", BPFFS_VMSCAN, CG_ROOT_NAME); + ASSERT_OK(remove(path), "remove cgroup_iter root pin"); + cgroup_hierarchical_stats__destroy(skel); +} + +void test_cgroup_hierarchical_stats(void) +{ + struct cgroup_hierarchical_stats *skel = NULL; + + if (setup_hierarchy()) + goto hierarchy_cleanup; + if (setup_progs(&skel)) + goto cleanup; + if (induce_vmscan()) + goto cleanup; + check_vmscan_stats(); +cleanup: + destroy_progs(skel); +hierarchy_cleanup: + destroy_hierarchy(); +} diff --git a/tools/testing/selftests/bpf/progs/cgroup_hierarchical_stats.c b/tools/testing/selftests/bpf/progs/cgroup_hierarchical_stats.c new file mode 100644 index 0000000000000..0a1a3bebdf4cb --- /dev/null +++ b/tools/testing/selftests/bpf/progs/cgroup_hierarchical_stats.c @@ -0,0 +1,235 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Functions to manage eBPF programs attached to cgroup subsystems + * + * Copyright 2022 Google LLC. + */ +#include "vmlinux.h" +#include <bpf/bpf_helpers.h> +#include <bpf/bpf_tracing.h> + +char _license[] SEC("license") = "GPL"; + +/* + * Start times are stored per-task, not per-cgroup, as multiple tasks in one + * cgroup can perform reclain concurrently. + */ +struct { + __uint(type, BPF_MAP_TYPE_TASK_STORAGE); + __uint(map_flags, BPF_F_NO_PREALLOC); + __type(key, int); + __type(value, __u64); +} vmscan_start_time SEC(".maps"); + +struct vmscan_percpu { + /* Previous percpu state, to figure out if we have new updates */ + __u64 prev; + /* Current percpu state */ + __u64 state; +}; + +struct vmscan { + /* State propagated through children, pending aggregation */ + __u64 pending; + /* Total state, including all cpus and all children */ + __u64 state; +}; + +struct { + __uint(type, BPF_MAP_TYPE_PERCPU_HASH); + __uint(max_entries, 10); + __type(key, __u64); + __type(value, struct vmscan_percpu); +} pcpu_cgroup_vmscan_elapsed SEC(".maps"); + +struct { + __uint(type, BPF_MAP_TYPE_HASH); + __uint(max_entries, 10); + __type(key, __u64); + __type(value, struct vmscan); +} cgroup_vmscan_elapsed SEC(".maps"); + +extern void cgroup_rstat_updated(struct cgroup *cgrp, int cpu) __ksym; +extern void cgroup_rstat_flush(struct cgroup *cgrp) __ksym; + +static inline struct cgroup *task_memcg(struct task_struct *task) +{ + return task->cgroups->subsys[memory_cgrp_id]->cgroup; +} + +static inline uint64_t cgroup_id(struct cgroup *cgrp) +{ + return cgrp->kn->id; +} + +static inline int create_vmscan_percpu_elem(__u64 cg_id, __u64 state) +{ + struct vmscan_percpu pcpu_init = {.state = state, .prev = 0}; + + if (bpf_map_update_elem(&pcpu_cgroup_vmscan_elapsed, &cg_id, + &pcpu_init, BPF_NOEXIST)) { + bpf_printk("failed to create pcpu entry for cgroup %llu\n" + , cg_id); + return 1; + } + return 0; +} + +static inline int create_vmscan_elem(__u64 cg_id, __u64 state, __u64 pending) +{ + struct vmscan init = {.state = state, .pending = pending}; + + if (bpf_map_update_elem(&cgroup_vmscan_elapsed, &cg_id, + &init, BPF_NOEXIST)) { + bpf_printk("failed to create entry for cgroup %llu\n" + , cg_id); + return 1; + } + return 0; +} + +SEC("tp_btf/mm_vmscan_memcg_reclaim_begin") +int BPF_PROG(vmscan_start, int order, gfp_t gfp_flags) +{ + struct task_struct *task = bpf_get_current_task_btf(); + __u64 *start_time_ptr; + + start_time_ptr = bpf_task_storage_get(&vmscan_start_time, task, 0, + BPF_LOCAL_STORAGE_GET_F_CREATE); + if (!start_time_ptr) { + bpf_printk("error retrieving storage\n"); + return 0; + } + + *start_time_ptr = bpf_ktime_get_ns(); + return 0; +} + +SEC("tp_btf/mm_vmscan_memcg_reclaim_end") +int BPF_PROG(vmscan_end, unsigned long nr_reclaimed) +{ + struct vmscan_percpu *pcpu_stat; + struct task_struct *current = bpf_get_current_task_btf(); + struct cgroup *cgrp; + __u64 *start_time_ptr; + __u64 current_elapsed, cg_id; + __u64 end_time = bpf_ktime_get_ns(); + + /* + * cgrp is the first parent cgroup of current that has memcg enabled in + * its subtree_control, or NULL if memcg is disabled in the entire tree. + * In a cgroup hierarchy like this: + * a + * / \ + * b c + * If "a" has memcg enabled, while "b" doesn't, then processes in "b" + * will accumulate their stats directly to "a". This makes sure that no + * stats are lost from processes in leaf cgroups that don't have memcg + * enabled, but only exposes stats for cgroups that have memcg enabled. + */ + cgrp = task_memcg(current); + if (!cgrp) + return 0; + + cg_id = cgroup_id(cgrp); + start_time_ptr = bpf_task_storage_get(&vmscan_start_time, current, 0, + BPF_LOCAL_STORAGE_GET_F_CREATE); + if (!start_time_ptr) { + bpf_printk("error retrieving storage local storage\n"); + return 0; + } + + current_elapsed = end_time - *start_time_ptr; + pcpu_stat = bpf_map_lookup_elem(&pcpu_cgroup_vmscan_elapsed, + &cg_id); + if (pcpu_stat) + pcpu_stat->state += current_elapsed; + else if (create_vmscan_percpu_elem(cg_id, current_elapsed)) + return 0; + + cgroup_rstat_updated(cgrp, bpf_get_smp_processor_id()); + return 0; +} + +SEC("fentry/bpf_rstat_flush") +int BPF_PROG(vmscan_flush, struct cgroup *cgrp, struct cgroup *parent, int cpu) +{ + struct vmscan_percpu *pcpu_stat; + struct vmscan *total_stat, *parent_stat; + __u64 cg_id = cgroup_id(cgrp); + __u64 parent_cg_id = parent ? cgroup_id(parent) : 0; + __u64 *pcpu_vmscan; + __u64 state; + __u64 delta = 0; + + /* Add CPU changes on this level since the last flush */ + pcpu_stat = bpf_map_lookup_percpu_elem(&pcpu_cgroup_vmscan_elapsed, + &cg_id, cpu); + if (pcpu_stat) { + state = pcpu_stat->state; + delta += state - pcpu_stat->prev; + pcpu_stat->prev = state; + } + + total_stat = bpf_map_lookup_elem(&cgroup_vmscan_elapsed, &cg_id); + if (!total_stat) { + if (create_vmscan_elem(cg_id, delta, 0)) + return 0; + goto update_parent; + } + + /* Collect pending stats from subtree */ + if (total_stat->pending) { + delta += total_stat->pending; + total_stat->pending = 0; + } + + /* Propagate changes to this cgroup's total */ + total_stat->state += delta; + +update_parent: + /* Skip if there are no changes to propagate, or no parent */ + if (!delta || !parent_cg_id) + return 0; + + /* Propagate changes to cgroup's parent */ + parent_stat = bpf_map_lookup_elem(&cgroup_vmscan_elapsed, + &parent_cg_id); + if (parent_stat) + parent_stat->pending += delta; + else + create_vmscan_elem(parent_cg_id, 0, delta); + + return 0; +} + +SEC("iter.s/cgroup") +int BPF_PROG(dump_vmscan, struct bpf_iter_meta *meta, struct cgroup *cgrp) +{ + struct seq_file *seq = meta->seq; + struct vmscan *total_stat; + __u64 cg_id = cgrp ? cgroup_id(cgrp) : 0; + + /* Do nothing for the terminal call */ + if (!cg_id) + return 1; + + /* Flush the stats to make sure we get the most updated numbers */ + cgroup_rstat_flush(cgrp); + + total_stat = bpf_map_lookup_elem(&cgroup_vmscan_elapsed, &cg_id); + if (!total_stat) { + bpf_printk("error finding stats for cgroup %llu\n", cg_id); + BPF_SEQ_PRINTF(seq, "cg_id: %llu, total_vmscan_delay: -1\n", + cg_id); + return 1; + } + BPF_SEQ_PRINTF(seq, "cg_id: %llu, total_vmscan_delay: %llu\n", + cg_id, total_stat->state); + + /* + * We only dump stats for one cgroup here, so return 1 to stop + * iteration after the first cgroup. + */ + return 1; +}
Add a selftest that tests the whole workflow for collecting, aggregating (flushing), and displaying cgroup hierarchical stats. TL;DR: - Userspace program creates a cgroup hierarchy and induces memcg reclaim in parts of it. - Whenever reclaim happens, vmscan_start and vmscan_end update per-cgroup percpu readings, and tell rstat which (cgroup, cpu) pairs have updates. - When userspace tries to read the stats, vmscan_dump calls rstat to flush the stats, and outputs the stats in text format to userspace (similar to cgroupfs stats). - rstat calls vmscan_flush once for every (cgroup, cpu) pair that has updates, vmscan_flush aggregates cpu readings and propagates updates to parents. - Userspace program makes sure the stats are aggregated and read correctly. Detailed explanation: - The test loads tracing bpf programs, vmscan_start and vmscan_end, to measure the latency of cgroup reclaim. Per-cgroup readings are stored in percpu maps for efficiency. When a cgroup reading is updated on a cpu, cgroup_rstat_updated(cgroup, cpu) is called to add the cgroup to the rstat updated tree on that cpu. - A cgroup_iter program, vmscan_dump, is loaded and pinned to a file, for each cgroup. Reading this file invokes the program, which calls cgroup_rstat_flush(cgroup) to ask rstat to propagate the updates for all cpus and cgroups that have updates in this cgroup's subtree. Afterwards, the stats are exposed to the user. vmscan_dump returns 1 to terminate iteration early, so that we only expose stats for one cgroup per read. - An ftrace program, vmscan_flush, is also loaded and attached to bpf_rstat_flush. When rstat flushing is ongoing, vmscan_flush is invoked once for each (cgroup, cpu) pair that has updates. cgroups are popped from the rstat tree in a bottom-up fashion, so calls will always be made for cgroups that have updates before their parents. The program aggregates percpu readings to a total per-cgroup reading, and also propagates them to the parent cgroup. After rstat flushing is over, all cgroups will have correct updated hierarchical readings (including all cpus and all their descendants). - Finally, the test creates a cgroup hierarchy and induces memcg reclaim in parts of it, and makes sure that the stats collection, aggregation, and reading workflow works as expected. Signed-off-by: Yosry Ahmed <yosryahmed@google.com> --- .../prog_tests/cgroup_hierarchical_stats.c | 362 ++++++++++++++++++ .../bpf/progs/cgroup_hierarchical_stats.c | 235 ++++++++++++ 2 files changed, 597 insertions(+) create mode 100644 tools/testing/selftests/bpf/prog_tests/cgroup_hierarchical_stats.c create mode 100644 tools/testing/selftests/bpf/progs/cgroup_hierarchical_stats.c