Message ID | 20231116022411.2250072-6-yosryahmed@google.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm: memcg: subtree stats flushing and thresholds | expand |
Hello, kernel test robot noticed a -3.7% regression of aim7.jobs-per-min on: commit: f6eccb430010201d3c155b73035f3bf755fe7697 ("[PATCH v3 5/5] mm: memcg: restore subtree stats flushing") url: https://github.com/intel-lab-lkp/linux/commits/Yosry-Ahmed/mm-memcg-change-flush_next_time-to-flush_last_time/20231116-103300 base: https://git.kernel.org/cgit/linux/kernel/git/akpm/mm.git mm-everything patch link: https://lore.kernel.org/all/20231116022411.2250072-6-yosryahmed@google.com/ patch subject: [PATCH v3 5/5] mm: memcg: restore subtree stats flushing testcase: aim7 test machine: 128 threads 2 sockets Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz (Ice Lake) with 256G memory parameters: disk: 1BRD_48G fs: ext4 test: disk_rr load: 3000 cpufreq_governor: performance If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <oliver.sang@intel.com> | Closes: https://lore.kernel.org/oe-lkp/202311221505.65236274-oliver.sang@intel.com Details are as below: --------------------------------------------------------------------------------------------------> The kernel config and materials to reproduce are available at: https://download.01.org/0day-ci/archive/20231122/202311221505.65236274-oliver.sang@intel.com ========================================================================================= compiler/cpufreq_governor/disk/fs/kconfig/load/rootfs/tbox_group/test/testcase: gcc-12/performance/1BRD_48G/ext4/x86_64-rhel-8.3/3000/debian-11.1-x86_64-20220510.cgz/lkp-icl-2sp2/disk_rr/aim7 commit: 4c86da8ea2 ("mm: workingset: move the stats flush into workingset_test_recent()") f6eccb4300 ("mm: memcg: restore subtree stats flushing") 4c86da8ea2d2f784 f6eccb430010201d3c155b73035 ---------------- --------------------------- %stddev %change %stddev \ | \ 15513 ± 14% +17.4% 18206 ± 7% numa-vmstat.node1.nr_mapped 616938 -3.7% 593885 aim7.jobs-per-min 149804 ± 4% +17.6% 176189 ± 6% aim7.time.involuntary_context_switches 2310 +6.3% 2455 aim7.time.system_time 24960256 ± 9% -14.1% 21429987 ± 7% perf-stat.i.branch-misses 1357010 ± 14% -22.6% 1050646 ± 10% perf-stat.i.dTLB-load-misses 0.20 ± 8% -0.0 0.16 ± 7% perf-stat.overall.branch-miss-rate% 2.80 +5.7% 2.96 perf-stat.overall.cpi 1506 +7.9% 1624 ± 2% perf-stat.overall.cycles-between-cache-misses 0.36 -5.4% 0.34 perf-stat.overall.ipc 24383919 ± 8% -14.5% 20853721 ± 7% perf-stat.ps.branch-misses 0.00 ±223% +2700.0% 0.01 ± 10% perf-sched.sch_delay.avg.ms.__cond_resched.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm 0.00 ± 35% +1454.2% 0.06 ± 54% perf-sched.sch_delay.avg.ms.do_task_dead.do_exit.do_group_exit.__x64_sys_exit_group.do_syscall_64 0.01 ± 13% +3233.3% 0.18 ± 41% perf-sched.sch_delay.avg.ms.do_wait.kernel_wait4.__do_sys_wait4.do_syscall_64 0.01 ± 30% +5900.0% 0.31 ± 47% perf-sched.sch_delay.avg.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.syscall_exit_to_user_mode.do_syscall_64 0.00 ±141% +337.5% 0.01 ± 6% perf-sched.sch_delay.avg.ms.schedule_timeout.__wait_for_common.__flush_work.isra.0 0.00 ± 9% +2843.5% 0.11 ±116% perf-sched.sch_delay.avg.ms.schedule_timeout.rcu_gp_fqs_loop.rcu_gp_kthread.kthread 0.00 ±223% +660.0% 0.01 ± 16% perf-sched.sch_delay.max.ms.__cond_resched.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm 0.01 ± 9% -41.3% 0.00 ± 11% perf-sched.sch_delay.max.ms.__x64_sys_pause.do_syscall_64.entry_SYSCALL_64_after_hwframe.[unknown] 0.20 ±206% +3311.9% 6.66 ± 72% perf-sched.sch_delay.max.ms.do_task_dead.do_exit.do_group_exit.__x64_sys_exit_group.do_syscall_64 0.02 ± 41% +1.8e+05% 28.67 ± 53% perf-sched.sch_delay.max.ms.do_wait.kernel_wait4.__do_sys_wait4.do_syscall_64 0.01 ± 52% +41275.8% 2.28 ± 72% perf-sched.sch_delay.max.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_sysvec_apic_timer_interrupt 0.01 ± 23% +2.8e+05% 20.56 ± 65% perf-sched.sch_delay.max.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.syscall_exit_to_user_mode.do_syscall_64 0.01 ± 11% +142.9% 0.01 ± 76% perf-sched.sch_delay.max.ms.schedule_hrtimeout_range_clock.do_poll.constprop.0.do_sys_poll 0.00 ±141% +412.5% 0.01 ± 15% perf-sched.sch_delay.max.ms.schedule_timeout.__wait_for_common.__flush_work.isra.0 0.01 ± 42% +177.3% 0.02 ± 66% perf-sched.sch_delay.max.ms.schedule_timeout.kcompactd.kthread.ret_from_fork 0.01 ± 20% +1.3e+05% 12.95 ±105% perf-sched.sch_delay.max.ms.schedule_timeout.rcu_gp_fqs_loop.rcu_gp_kthread.kthread 0.07 ±131% +289.2% 0.27 ± 55% perf-sched.total_sch_delay.average.ms 0.39 ± 5% +307.4% 1.58 ± 22% perf-sched.wait_and_delay.avg.ms.do_wait.kernel_wait4.__do_sys_wait4.do_syscall_64 0.33 ± 46% +5674.0% 18.79 ± 73% perf-sched.wait_and_delay.avg.ms.schedule_timeout.__wait_for_common.wait_for_completion_state.kernel_clone 0.83 ±223% +41660.0% 348.00 ± 74% perf-sched.wait_and_delay.count.exit_to_user_mode_loop.exit_to_user_mode_prepare.syscall_exit_to_user_mode.do_syscall_64 11.25 ± 64% +225.6% 36.62 ± 45% perf-sched.wait_and_delay.max.ms.do_wait.kernel_wait4.__do_sys_wait4.do_syscall_64 0.81 ± 44% +1.1e+05% 912.56 ± 92% perf-sched.wait_and_delay.max.ms.schedule_timeout.__wait_for_common.wait_for_completion_state.kernel_clone 0.61 ±223% +11430.9% 69.86 ± 55% perf-sched.wait_time.avg.ms.__cond_resched.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm 1.44 ± 50% +1120.7% 17.58 ± 49% perf-sched.wait_time.avg.ms.devkmsg_read.vfs_read.ksys_read.do_syscall_64 0.06 ±204% +6992.9% 4.16 ± 91% perf-sched.wait_time.avg.ms.do_task_dead.do_exit.do_group_exit.get_signal.arch_do_signal_or_restart 0.38 ± 5% +265.2% 1.40 ± 21% perf-sched.wait_time.avg.ms.do_wait.kernel_wait4.__do_sys_wait4.do_syscall_64 0.57 ±141% +1413.0% 8.59 ±110% perf-sched.wait_time.avg.ms.schedule_hrtimeout_range_clock.do_select.core_sys_select.kern_select 0.00 ±223% +3.8e+06% 25.42 ±143% perf-sched.wait_time.avg.ms.schedule_timeout.__wait_for_common.__flush_work.isra.0 0.35 ± 24% +5215.2% 18.72 ± 73% perf-sched.wait_time.avg.ms.schedule_timeout.__wait_for_common.wait_for_completion_state.kernel_clone 1.03 ± 70% +1610.0% 17.59 ± 49% perf-sched.wait_time.avg.ms.syslog_print.do_syslog.kmsg_read.vfs_read 2.82 ±223% +6949.3% 198.44 ± 60% perf-sched.wait_time.max.ms.__cond_resched.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm 2.69 ± 45% +4345.1% 119.46 ± 71% perf-sched.wait_time.max.ms.devkmsg_read.vfs_read.ksys_read.do_syscall_64 0.10 ±212% +10364.1% 10.59 ±106% perf-sched.wait_time.max.ms.do_task_dead.do_exit.do_group_exit.get_signal.arch_do_signal_or_restart 1.14 ±141% +6549.1% 75.53 ±137% perf-sched.wait_time.max.ms.schedule_hrtimeout_range_clock.do_select.core_sys_select.kern_select 0.00 ±223% +6.5e+06% 76.30 ±141% perf-sched.wait_time.max.ms.schedule_timeout.__wait_for_common.__flush_work.isra.0 0.91 ± 15% +1e+05% 912.19 ± 92% perf-sched.wait_time.max.ms.schedule_timeout.__wait_for_common.wait_for_completion_state.kernel_clone 2.06 ± 70% +5708.3% 119.46 ± 71% perf-sched.wait_time.max.ms.syslog_print.do_syslog.kmsg_read.vfs_read 2.59 -0.1 2.45 perf-profile.calltrace.cycles-pp.ext4_block_write_begin.ext4_da_write_begin.generic_perform_write.ext4_buffered_write_iter.vfs_write 2.10 -0.1 1.99 perf-profile.calltrace.cycles-pp.ext4_da_do_write_end.generic_perform_write.ext4_buffered_write_iter.vfs_write.ksys_write 0.70 ± 2% -0.1 0.59 perf-profile.calltrace.cycles-pp.workingset_activation.folio_mark_accessed.filemap_read.vfs_read.ksys_read 1.75 -0.1 1.65 perf-profile.calltrace.cycles-pp.copy_page_to_iter.filemap_read.vfs_read.ksys_read.do_syscall_64 1.41 -0.1 1.32 perf-profile.calltrace.cycles-pp.llseek 1.66 -0.1 1.57 perf-profile.calltrace.cycles-pp._copy_to_iter.copy_page_to_iter.filemap_read.vfs_read.ksys_read 1.75 -0.1 1.67 perf-profile.calltrace.cycles-pp.block_write_end.ext4_da_do_write_end.generic_perform_write.ext4_buffered_write_iter.vfs_write 1.66 -0.1 1.58 perf-profile.calltrace.cycles-pp.__block_commit_write.block_write_end.ext4_da_do_write_end.generic_perform_write.ext4_buffered_write_iter 0.84 -0.1 0.78 perf-profile.calltrace.cycles-pp.ext4_da_map_blocks.ext4_da_get_block_prep.ext4_block_write_begin.ext4_da_write_begin.generic_perform_write 0.86 -0.1 0.80 perf-profile.calltrace.cycles-pp.ext4_da_get_block_prep.ext4_block_write_begin.ext4_da_write_begin.generic_perform_write.ext4_buffered_write_iter 0.94 -0.1 0.89 perf-profile.calltrace.cycles-pp.zero_user_segments.ext4_block_write_begin.ext4_da_write_begin.generic_perform_write.ext4_buffered_write_iter 0.92 -0.1 0.86 perf-profile.calltrace.cycles-pp.memset_orig.zero_user_segments.ext4_block_write_begin.ext4_da_write_begin.generic_perform_write 0.71 -0.1 0.66 ± 2% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.llseek 0.86 -0.0 0.81 perf-profile.calltrace.cycles-pp.copy_page_from_iter_atomic.generic_perform_write.ext4_buffered_write_iter.vfs_write.ksys_write 0.60 -0.0 0.56 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.llseek 0.94 -0.0 0.90 perf-profile.calltrace.cycles-pp.mark_buffer_dirty.__block_commit_write.block_write_end.ext4_da_do_write_end.generic_perform_write 0.85 -0.0 0.82 perf-profile.calltrace.cycles-pp.filemap_get_pages.filemap_read.vfs_read.ksys_read.do_syscall_64 0.71 -0.0 0.69 perf-profile.calltrace.cycles-pp.filemap_get_read_batch.filemap_get_pages.filemap_read.vfs_read.ksys_read 0.94 -0.0 0.91 perf-profile.calltrace.cycles-pp.balance_dirty_pages_ratelimited_flags.generic_perform_write.ext4_buffered_write_iter.vfs_write.ksys_write 1.08 -0.0 1.05 perf-profile.calltrace.cycles-pp.try_to_free_buffers.truncate_cleanup_folio.truncate_inode_pages_range.ext4_evict_inode.evict 0.70 -0.0 0.68 perf-profile.calltrace.cycles-pp.__folio_mark_dirty.mark_buffer_dirty.__block_commit_write.block_write_end.ext4_da_do_write_end 1.35 -0.0 1.34 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.folio_lruvec_lock_irqsave.folio_batch_move_lru.__folio_batch_release 1.39 -0.0 1.37 perf-profile.calltrace.cycles-pp.folio_batch_move_lru.__folio_batch_release.truncate_inode_pages_range.ext4_evict_inode.evict 1.35 -0.0 1.34 perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.folio_lruvec_lock_irqsave.folio_batch_move_lru.__folio_batch_release.truncate_inode_pages_range 1.35 -0.0 1.34 perf-profile.calltrace.cycles-pp.folio_lruvec_lock_irqsave.folio_batch_move_lru.__folio_batch_release.truncate_inode_pages_range.ext4_evict_inode 0.53 -0.0 0.51 perf-profile.calltrace.cycles-pp.folio_alloc.__filemap_get_folio.ext4_da_write_begin.generic_perform_write.ext4_buffered_write_iter 28.25 +0.2 28.47 perf-profile.calltrace.cycles-pp.__folio_batch_release.truncate_inode_pages_range.ext4_evict_inode.evict.__dentry_kill 25.49 +0.2 25.73 perf-profile.calltrace.cycles-pp.release_pages.__folio_batch_release.truncate_inode_pages_range.ext4_evict_inode.evict 24.68 +0.3 24.94 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.folio_lruvec_lock_irqsave.release_pages.__folio_batch_release 24.70 +0.3 24.96 perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.folio_lruvec_lock_irqsave.release_pages.__folio_batch_release.truncate_inode_pages_range 24.70 +0.3 24.97 perf-profile.calltrace.cycles-pp.folio_lruvec_lock_irqsave.release_pages.__folio_batch_release.truncate_inode_pages_range.ext4_evict_inode 33.66 +0.3 33.95 perf-profile.calltrace.cycles-pp.ext4_buffered_write_iter.vfs_write.ksys_write.do_syscall_64.entry_SYSCALL_64_after_hwframe 23.80 +0.3 24.11 perf-profile.calltrace.cycles-pp.folio_mark_accessed.filemap_read.vfs_read.ksys_read.do_syscall_64 32.63 +0.3 32.97 perf-profile.calltrace.cycles-pp.generic_perform_write.ext4_buffered_write_iter.vfs_write.ksys_write.do_syscall_64 22.93 +0.4 23.35 perf-profile.calltrace.cycles-pp.folio_activate.folio_mark_accessed.filemap_read.vfs_read.ksys_read 22.08 +0.4 22.50 perf-profile.calltrace.cycles-pp.folio_lruvec_lock_irqsave.folio_batch_move_lru.folio_activate.folio_mark_accessed.filemap_read 22.07 +0.4 22.49 perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.folio_lruvec_lock_irqsave.folio_batch_move_lru.folio_activate.folio_mark_accessed 22.06 +0.4 22.48 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.folio_lruvec_lock_irqsave.folio_batch_move_lru.folio_activate 22.88 +0.4 23.31 perf-profile.calltrace.cycles-pp.folio_batch_move_lru.folio_activate.folio_mark_accessed.filemap_read.vfs_read 27.90 +0.6 28.49 perf-profile.calltrace.cycles-pp.ext4_da_write_begin.generic_perform_write.ext4_buffered_write_iter.vfs_write.ksys_write 25.00 +0.8 25.76 perf-profile.calltrace.cycles-pp.__filemap_get_folio.ext4_da_write_begin.generic_perform_write.ext4_buffered_write_iter.vfs_write 23.72 +0.8 24.54 perf-profile.calltrace.cycles-pp.filemap_add_folio.__filemap_get_folio.ext4_da_write_begin.generic_perform_write.ext4_buffered_write_iter 22.56 +0.8 23.39 perf-profile.calltrace.cycles-pp.folio_add_lru.filemap_add_folio.__filemap_get_folio.ext4_da_write_begin.generic_perform_write 22.52 +0.8 23.34 perf-profile.calltrace.cycles-pp.folio_batch_move_lru.folio_add_lru.filemap_add_folio.__filemap_get_folio.ext4_da_write_begin 21.97 +0.8 22.81 perf-profile.calltrace.cycles-pp.folio_lruvec_lock_irqsave.folio_batch_move_lru.folio_add_lru.filemap_add_folio.__filemap_get_folio 21.94 +0.8 22.79 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.folio_lruvec_lock_irqsave.folio_batch_move_lru.folio_add_lru 21.96 +0.8 22.80 perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.folio_lruvec_lock_irqsave.folio_batch_move_lru.folio_add_lru.filemap_add_folio 0.41 -0.2 0.24 ± 2% perf-profile.children.cycles-pp.mem_cgroup_css_rstat_flush 0.54 -0.2 0.37 ± 2% perf-profile.children.cycles-pp.cgroup_rstat_flush_locked 0.55 -0.2 0.38 perf-profile.children.cycles-pp.cgroup_rstat_flush 2.60 -0.1 2.46 perf-profile.children.cycles-pp.ext4_block_write_begin 1.66 -0.1 1.56 perf-profile.children.cycles-pp.llseek 0.70 ± 2% -0.1 0.59 perf-profile.children.cycles-pp.workingset_activation 2.12 -0.1 2.02 perf-profile.children.cycles-pp.ext4_da_do_write_end 0.52 ± 3% -0.1 0.42 perf-profile.children.cycles-pp.workingset_age_nonresident 1.76 -0.1 1.66 perf-profile.children.cycles-pp.copy_page_to_iter 1.67 -0.1 1.58 perf-profile.children.cycles-pp._copy_to_iter 1.78 -0.1 1.69 perf-profile.children.cycles-pp.block_write_end 1.67 -0.1 1.59 perf-profile.children.cycles-pp.__block_commit_write 1.00 -0.1 0.94 perf-profile.children.cycles-pp.__entry_text_start 0.86 -0.1 0.81 perf-profile.children.cycles-pp.ext4_da_get_block_prep 0.60 -0.1 0.54 ± 2% perf-profile.children.cycles-pp.__fdget_pos 0.79 -0.1 0.73 ± 2% perf-profile.children.cycles-pp.entry_SYSRETQ_unsafe_stack 0.87 -0.1 0.82 perf-profile.children.cycles-pp.copy_page_from_iter_atomic 0.95 -0.1 0.89 perf-profile.children.cycles-pp.zero_user_segments 0.85 -0.1 0.80 perf-profile.children.cycles-pp.ext4_da_map_blocks 0.95 -0.1 0.90 perf-profile.children.cycles-pp.memset_orig 0.43 -0.0 0.38 ± 2% perf-profile.children.cycles-pp.__fget_light 0.50 -0.0 0.46 perf-profile.children.cycles-pp.asm_sysvec_apic_timer_interrupt 0.47 ± 2% -0.0 0.42 perf-profile.children.cycles-pp.sysvec_apic_timer_interrupt 0.41 ± 3% -0.0 0.36 ± 2% perf-profile.children.cycles-pp.__sysvec_apic_timer_interrupt 0.40 ± 3% -0.0 0.36 ± 2% perf-profile.children.cycles-pp.hrtimer_interrupt 0.37 ± 2% -0.0 0.33 ± 2% perf-profile.children.cycles-pp.__hrtimer_run_queues 0.64 -0.0 0.60 perf-profile.children.cycles-pp.xas_load 0.74 -0.0 0.70 perf-profile.children.cycles-pp.filemap_get_read_batch 0.95 -0.0 0.92 perf-profile.children.cycles-pp.mark_buffer_dirty 0.44 -0.0 0.41 perf-profile.children.cycles-pp.file_modified 0.98 -0.0 0.94 perf-profile.children.cycles-pp.balance_dirty_pages_ratelimited_flags 0.87 -0.0 0.84 perf-profile.children.cycles-pp.filemap_get_pages 0.43 -0.0 0.40 perf-profile.children.cycles-pp.fault_in_iov_iter_readable 0.31 ± 6% -0.0 0.28 perf-profile.children.cycles-pp.disk_rr 0.41 -0.0 0.38 perf-profile.children.cycles-pp.touch_atime 0.38 -0.0 0.35 perf-profile.children.cycles-pp.fault_in_readable 0.32 ± 2% -0.0 0.30 perf-profile.children.cycles-pp.xas_descend 0.37 -0.0 0.34 ± 3% perf-profile.children.cycles-pp.ksys_lseek 0.34 -0.0 0.32 perf-profile.children.cycles-pp.atime_needs_update 1.08 -0.0 1.06 perf-profile.children.cycles-pp.try_to_free_buffers 0.20 ± 2% -0.0 0.17 ± 2% perf-profile.children.cycles-pp.syscall_enter_from_user_mode 0.22 ± 2% -0.0 0.20 ± 2% perf-profile.children.cycles-pp.ext4_es_insert_delayed_block 0.34 ± 2% -0.0 0.32 perf-profile.children.cycles-pp.__cond_resched 0.44 -0.0 0.42 perf-profile.children.cycles-pp.filemap_get_entry 0.23 ± 2% -0.0 0.21 perf-profile.children.cycles-pp.inode_needs_update_time 0.71 -0.0 0.69 perf-profile.children.cycles-pp.__folio_mark_dirty 0.37 -0.0 0.36 perf-profile.children.cycles-pp.__mem_cgroup_charge 0.24 ± 2% -0.0 0.22 ± 2% perf-profile.children.cycles-pp._raw_spin_lock 0.24 -0.0 0.22 perf-profile.children.cycles-pp.syscall_return_via_sysret 0.40 -0.0 0.38 perf-profile.children.cycles-pp.syscall_exit_to_user_mode 0.14 -0.0 0.13 ± 2% perf-profile.children.cycles-pp.up_write 0.50 -0.0 0.49 perf-profile.children.cycles-pp.alloc_pages_mpol 0.14 -0.0 0.13 perf-profile.children.cycles-pp.current_time 0.10 -0.0 0.09 perf-profile.children.cycles-pp.__es_insert_extent 0.25 ± 3% +0.0 0.27 ± 3% perf-profile.children.cycles-pp.__mod_lruvec_state 0.19 ± 3% +0.0 0.21 ± 3% perf-profile.children.cycles-pp.__mod_node_page_state 1.12 +0.1 1.20 perf-profile.children.cycles-pp.__mod_lruvec_page_state 0.99 +0.1 1.09 ± 2% perf-profile.children.cycles-pp.__mod_memcg_lruvec_state 0.00 +0.1 0.13 ± 3% perf-profile.children.cycles-pp.mutex_spin_on_owner 30.58 +0.1 30.72 perf-profile.children.cycles-pp.dput 0.64 +0.1 0.79 ± 4% perf-profile.children.cycles-pp.cgroup_rstat_updated 30.44 +0.2 30.60 perf-profile.children.cycles-pp.truncate_inode_pages_range 0.00 +0.2 0.18 ± 3% perf-profile.children.cycles-pp.__mutex_lock 97.33 +0.2 97.51 perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe 97.11 +0.2 97.31 perf-profile.children.cycles-pp.do_syscall_64 28.25 +0.2 28.47 perf-profile.children.cycles-pp.__folio_batch_release 25.74 +0.2 25.96 perf-profile.children.cycles-pp.release_pages 33.71 +0.3 33.99 perf-profile.children.cycles-pp.ext4_buffered_write_iter 23.82 +0.3 24.12 perf-profile.children.cycles-pp.folio_mark_accessed 32.74 +0.3 33.09 perf-profile.children.cycles-pp.generic_perform_write 22.94 +0.4 23.36 perf-profile.children.cycles-pp.folio_activate 27.94 +0.6 28.53 perf-profile.children.cycles-pp.ext4_da_write_begin 25.04 +0.8 25.80 perf-profile.children.cycles-pp.__filemap_get_folio 23.73 +0.8 24.54 perf-profile.children.cycles-pp.filemap_add_folio 22.61 +0.8 23.44 perf-profile.children.cycles-pp.folio_add_lru 48.23 +1.2 49.47 perf-profile.children.cycles-pp.folio_batch_move_lru 71.67 +1.5 73.13 perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath 71.80 +1.5 73.29 perf-profile.children.cycles-pp._raw_spin_lock_irqsave 71.64 +1.5 73.14 perf-profile.children.cycles-pp.folio_lruvec_lock_irqsave 0.40 ± 2% -0.2 0.23 ± 2% perf-profile.self.cycles-pp.mem_cgroup_css_rstat_flush 0.52 ± 2% -0.1 0.42 ± 2% perf-profile.self.cycles-pp.workingset_age_nonresident 1.65 -0.1 1.56 perf-profile.self.cycles-pp._copy_to_iter 0.86 -0.1 0.81 perf-profile.self.cycles-pp.copy_page_from_iter_atomic 0.94 -0.1 0.89 perf-profile.self.cycles-pp.memset_orig 0.76 -0.0 0.71 ± 2% perf-profile.self.cycles-pp.entry_SYSRETQ_unsafe_stack 0.52 ± 4% -0.0 0.47 perf-profile.self.cycles-pp.__mod_memcg_lruvec_state 0.40 -0.0 0.36 ± 3% perf-profile.self.cycles-pp.__fget_light 0.53 ± 2% -0.0 0.50 perf-profile.self.cycles-pp.vfs_write 0.63 -0.0 0.59 perf-profile.self.cycles-pp.filemap_read 0.66 -0.0 0.62 perf-profile.self.cycles-pp.__block_commit_write 0.37 -0.0 0.34 ± 2% perf-profile.self.cycles-pp.fault_in_readable 0.43 -0.0 0.41 perf-profile.self.cycles-pp.vfs_read 0.26 ± 4% -0.0 0.24 ± 2% perf-profile.self.cycles-pp.balance_dirty_pages_ratelimited_flags 0.28 -0.0 0.26 ± 2% perf-profile.self.cycles-pp.xas_descend 0.28 -0.0 0.26 perf-profile.self.cycles-pp.read 0.28 -0.0 0.25 perf-profile.self.cycles-pp.__filemap_get_folio 0.17 -0.0 0.15 ± 2% perf-profile.self.cycles-pp.syscall_enter_from_user_mode 0.27 -0.0 0.25 perf-profile.self.cycles-pp.do_syscall_64 0.46 -0.0 0.44 perf-profile.self.cycles-pp.filemap_get_read_batch 0.22 ± 2% -0.0 0.20 ± 4% perf-profile.self.cycles-pp.ext4_da_write_begin 0.26 -0.0 0.25 ± 2% perf-profile.self.cycles-pp.__entry_text_start 0.24 -0.0 0.22 perf-profile.self.cycles-pp.syscall_return_via_sysret 0.21 ± 2% -0.0 0.19 ± 2% perf-profile.self.cycles-pp.filemap_get_entry 0.13 -0.0 0.12 ± 3% perf-profile.self.cycles-pp.down_write 0.22 ± 2% -0.0 0.21 ± 2% perf-profile.self.cycles-pp.ext4_da_do_write_end 0.20 ± 2% -0.0 0.19 ± 2% perf-profile.self.cycles-pp.__cond_resched 0.17 -0.0 0.16 perf-profile.self.cycles-pp.folio_mark_accessed 0.10 -0.0 0.09 perf-profile.self.cycles-pp.ksys_write 0.09 -0.0 0.08 perf-profile.self.cycles-pp.entry_SYSCALL_64_safe_stack 0.12 -0.0 0.11 perf-profile.self.cycles-pp.find_lock_entries 0.18 ± 2% +0.0 0.20 ± 2% perf-profile.self.cycles-pp.__mod_node_page_state 0.16 +0.0 0.19 ± 3% perf-profile.self.cycles-pp._raw_spin_lock_irqsave 0.00 +0.1 0.13 ± 3% perf-profile.self.cycles-pp.mutex_spin_on_owner 0.54 ± 2% +0.2 0.72 ± 4% perf-profile.self.cycles-pp.cgroup_rstat_updated 71.67 +1.5 73.13 perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath Disclaimer: Results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance.
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 7bdcf3020d7a3..6edd3ec4d8d54 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -1046,8 +1046,8 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec, return x; } -void mem_cgroup_flush_stats(void); -void mem_cgroup_flush_stats_ratelimited(void); +void mem_cgroup_flush_stats(struct mem_cgroup *memcg); +void mem_cgroup_flush_stats_ratelimited(struct mem_cgroup *memcg); void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx, int val); @@ -1548,11 +1548,11 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec, return node_page_state(lruvec_pgdat(lruvec), idx); } -static inline void mem_cgroup_flush_stats(void) +static inline void mem_cgroup_flush_stats(struct mem_cgroup *memcg) { } -static inline void mem_cgroup_flush_stats_ratelimited(void) +static inline void mem_cgroup_flush_stats_ratelimited(struct mem_cgroup *memcg) { } diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 74db05237775d..2baa9349d1590 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -669,7 +669,6 @@ struct memcg_vmstats { */ static void flush_memcg_stats_dwork(struct work_struct *w); static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwork); -static atomic_t stats_flush_ongoing = ATOMIC_INIT(0); static u64 flush_last_time; #define FLUSH_TIME (2UL*HZ) @@ -730,35 +729,47 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val) } } -static void do_flush_stats(void) +static void do_flush_stats(struct mem_cgroup *memcg) { - /* - * We always flush the entire tree, so concurrent flushers can just - * skip. This avoids a thundering herd problem on the rstat global lock - * from memcg flushers (e.g. reclaim, refault, etc). - */ - if (atomic_read(&stats_flush_ongoing) || - atomic_xchg(&stats_flush_ongoing, 1)) - return; - - WRITE_ONCE(flush_last_time, jiffies_64); - - cgroup_rstat_flush(root_mem_cgroup->css.cgroup); + if (mem_cgroup_is_root(memcg)) + WRITE_ONCE(flush_last_time, jiffies_64); - atomic_set(&stats_flush_ongoing, 0); + cgroup_rstat_flush(memcg->css.cgroup); } -void mem_cgroup_flush_stats(void) +/* + * mem_cgroup_flush_stats - flush the stats of a memory cgroup subtree + * @memcg: root of the subtree to flush + * + * Flushing is serialized by the underlying global rstat lock. There is also a + * minimum amount of work to be done even if there are no stat updates to flush. + * Hence, we only flush the stats if the updates delta exceeds a threshold. This + * avoids unnecessary work and contention on the underlying lock. + */ +void mem_cgroup_flush_stats(struct mem_cgroup *memcg) { - if (memcg_should_flush_stats(root_mem_cgroup)) - do_flush_stats(); + static DEFINE_MUTEX(memcg_stats_flush_mutex); + + if (mem_cgroup_disabled()) + return; + + if (!memcg) + memcg = root_mem_cgroup; + + if (memcg_should_flush_stats(memcg)) { + mutex_lock(&memcg_stats_flush_mutex); + /* Check again after locking, another flush may have occurred */ + if (memcg_should_flush_stats(memcg)) + do_flush_stats(memcg); + mutex_unlock(&memcg_stats_flush_mutex); + } } -void mem_cgroup_flush_stats_ratelimited(void) +void mem_cgroup_flush_stats_ratelimited(struct mem_cgroup *memcg) { /* Only flush if the periodic flusher is one full cycle late */ if (time_after64(jiffies_64, READ_ONCE(flush_last_time) + 2*FLUSH_TIME)) - mem_cgroup_flush_stats(); + mem_cgroup_flush_stats(memcg); } static void flush_memcg_stats_dwork(struct work_struct *w) @@ -767,7 +778,7 @@ static void flush_memcg_stats_dwork(struct work_struct *w) * Deliberately ignore memcg_should_flush_stats() here so that flushing * in latency-sensitive paths is as cheap as possible. */ - do_flush_stats(); + do_flush_stats(root_mem_cgroup); queue_delayed_work(system_unbound_wq, &stats_flush_dwork, FLUSH_TIME); } @@ -1642,7 +1653,7 @@ static void memcg_stat_format(struct mem_cgroup *memcg, struct seq_buf *s) * * Current memory state: */ - mem_cgroup_flush_stats(); + mem_cgroup_flush_stats(memcg); for (i = 0; i < ARRAY_SIZE(memory_stats); i++) { u64 size; @@ -4191,7 +4202,7 @@ static int memcg_numa_stat_show(struct seq_file *m, void *v) int nid; struct mem_cgroup *memcg = mem_cgroup_from_seq(m); - mem_cgroup_flush_stats(); + mem_cgroup_flush_stats(memcg); for (stat = stats; stat < stats + ARRAY_SIZE(stats); stat++) { seq_printf(m, "%s=%lu", stat->name, @@ -4272,7 +4283,7 @@ static void memcg1_stat_format(struct mem_cgroup *memcg, struct seq_buf *s) BUILD_BUG_ON(ARRAY_SIZE(memcg1_stat_names) != ARRAY_SIZE(memcg1_stats)); - mem_cgroup_flush_stats(); + mem_cgroup_flush_stats(memcg); for (i = 0; i < ARRAY_SIZE(memcg1_stats); i++) { unsigned long nr; @@ -4768,7 +4779,7 @@ void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages, struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css); struct mem_cgroup *parent; - mem_cgroup_flush_stats(); + mem_cgroup_flush_stats(memcg); *pdirty = memcg_page_state(memcg, NR_FILE_DIRTY); *pwriteback = memcg_page_state(memcg, NR_WRITEBACK); @@ -6857,7 +6868,7 @@ static int memory_numa_stat_show(struct seq_file *m, void *v) int i; struct mem_cgroup *memcg = mem_cgroup_from_seq(m); - mem_cgroup_flush_stats(); + mem_cgroup_flush_stats(memcg); for (i = 0; i < ARRAY_SIZE(memory_stats); i++) { int nid; @@ -8088,7 +8099,11 @@ bool obj_cgroup_may_zswap(struct obj_cgroup *objcg) break; } - cgroup_rstat_flush(memcg->css.cgroup); + /* + * mem_cgroup_flush_stats() ignores small changes. Use + * do_flush_stats() directly to get accurate stats for charging. + */ + do_flush_stats(memcg); pages = memcg_page_state(memcg, MEMCG_ZSWAP_B) / PAGE_SIZE; if (pages < max) continue; @@ -8153,8 +8168,10 @@ void obj_cgroup_uncharge_zswap(struct obj_cgroup *objcg, size_t size) static u64 zswap_current_read(struct cgroup_subsys_state *css, struct cftype *cft) { - cgroup_rstat_flush(css->cgroup); - return memcg_page_state(mem_cgroup_from_css(css), MEMCG_ZSWAP_B); + struct mem_cgroup *memcg = mem_cgroup_from_css(css); + + mem_cgroup_flush_stats(memcg); + return memcg_page_state(memcg, MEMCG_ZSWAP_B); } static int zswap_max_show(struct seq_file *m, void *v) diff --git a/mm/vmscan.c b/mm/vmscan.c index 506f8220c5fe5..f93c989d7b387 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2222,7 +2222,7 @@ static void prepare_scan_control(pg_data_t *pgdat, struct scan_control *sc) * Flush the memory cgroup stats, so that we read accurate per-memcg * lruvec stats for heuristics. */ - mem_cgroup_flush_stats(); + mem_cgroup_flush_stats(sc->target_mem_cgroup); /* * Determine the scan balance between anon and file LRUs. diff --git a/mm/workingset.c b/mm/workingset.c index a573be6c59fd9..11045febc3838 100644 --- a/mm/workingset.c +++ b/mm/workingset.c @@ -464,8 +464,12 @@ bool workingset_test_recent(void *shadow, bool file, bool *workingset) rcu_read_unlock(); - /* Flush stats (and potentially sleep) outside the RCU read section */ - mem_cgroup_flush_stats_ratelimited(); + /* + * Flush stats (and potentially sleep) outside the RCU read section. + * XXX: With per-memcg flushing and thresholding, is ratelimiting + * still needed here? + */ + mem_cgroup_flush_stats_ratelimited(eviction_memcg); eviction_lruvec = mem_cgroup_lruvec(eviction_memcg, pgdat); refault = atomic_long_read(&eviction_lruvec->nonresident_age); @@ -676,7 +680,7 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker, struct lruvec *lruvec; int i; - mem_cgroup_flush_stats(); + mem_cgroup_flush_stats(sc->memcg); lruvec = mem_cgroup_lruvec(sc->memcg, NODE_DATA(sc->nid)); for (pages = 0, i = 0; i < NR_LRU_LISTS; i++) pages += lruvec_page_state_local(lruvec,