diff mbox series

[RFC,V1,5/6] sched/numa: Allow recently accessed VMAs to be scanned

Message ID 109ca1ea59b9dd6f2daf7b7fbc74e83ae074fbdf.1693287931.git.raghavendra.kt@amd.com (mailing list archive)
State New
Headers show
Series sched/numa: Enhance disjoint VMA scanning | expand

Commit Message

Raghavendra K T Aug. 29, 2023, 6:06 a.m. UTC
This ensures hot VMAs get scanned on priority irresepctive of their
access by current task.

Suggested-by: Bharata B Rao <bharata@amd.com>
Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
---
 kernel/sched/fair.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

Comments

kernel test robot Sept. 10, 2023, 3:29 p.m. UTC | #1
Hello,

kernel test robot noticed a -33.6% improvement of autonuma-benchmark.numa02.seconds on:


commit: af46f3c9ca2d16485912f8b9c896ef48bbfe1388 ("[RFC PATCH V1 5/6] sched/numa: Allow recently accessed VMAs to be scanned")
url: https://github.com/intel-lab-lkp/linux/commits/Raghavendra-K-T/sched-numa-Move-up-the-access-pid-reset-logic/20230829-141007
base: https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git 2f88c8e802c8b128a155976631f4eb2ce4f3c805
patch link: https://lore.kernel.org/all/109ca1ea59b9dd6f2daf7b7fbc74e83ae074fbdf.1693287931.git.raghavendra.kt@amd.com/
patch subject: [RFC PATCH V1 5/6] sched/numa: Allow recently accessed VMAs to be scanned

testcase: autonuma-benchmark
test machine: 128 threads 2 sockets Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz (Ice Lake) with 128G memory
parameters:

	iterations: 4x
	test: numa01_THREAD_ALLOC
	cpufreq_governor: performance



Details are as below:
-------------------------------------------------------------------------------------------------->


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20230910/202309102311.84b42068-oliver.sang@intel.com

=========================================================================================
compiler/cpufreq_governor/iterations/kconfig/rootfs/tbox_group/test/testcase:
  gcc-12/performance/4x/x86_64-rhel-8.3/debian-11.1-x86_64-20220510.cgz/lkp-icl-2sp6/numa01_THREAD_ALLOC/autonuma-benchmark

commit: 
  167773d1dd ("sched/numa: Increase tasks' access history")
  af46f3c9ca ("sched/numa: Allow recently accessed VMAs to be scanned")

167773d1ddb5ffdd af46f3c9ca2d16485912f8b9c89 
---------------- --------------------------- 
         %stddev     %change         %stddev
             \          |                \  
 2.534e+10 ± 10%     -13.0%  2.204e+10 ±  7%  cpuidle..time
  26431366 ± 10%     -13.2%   22948978 ±  7%  cpuidle..usage
      0.15 ±  4%      -0.0        0.12 ±  3%  mpstat.cpu.all.soft%
      2.92 ±  3%      +0.4        3.32 ±  4%  mpstat.cpu.all.sys%
      2243 ±  2%     -12.7%       1957 ±  3%  uptime.boot
     29811 ±  8%     -11.1%      26507 ±  6%  uptime.idle
      5.32 ± 79%     -64.2%       1.91 ± 60%  perf-sched.sch_delay.max.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_exc_page_fault
      2.70 ± 18%     +37.8%       3.72 ±  9%  perf-sched.sch_delay.max.ms.schedule_hrtimeout_range_clock.do_select.core_sys_select.kern_select
      0.64 ±137%  +26644.2%     169.91 ±220%  perf-sched.wait_time.avg.ms.__cond_resched.task_work_run.exit_to_user_mode_loop.exit_to_user_mode_prepare.syscall_exit_to_user_mode
      0.08 ± 20%      +0.0        0.12 ± 10%  perf-profile.children.cycles-pp.terminate_walk
      0.10 ± 25%      +0.0        0.14 ± 10%  perf-profile.children.cycles-pp.wake_up_q
      0.06 ± 50%      +0.0        0.10 ± 10%  perf-profile.children.cycles-pp.vfs_readlink
      0.15 ± 36%      +0.1        0.22 ± 13%  perf-profile.children.cycles-pp.readlink
      1.31 ± 19%      +0.4        1.69 ± 12%  perf-profile.children.cycles-pp.unmap_vmas
      2.46 ± 19%      +0.5        2.99 ±  4%  perf-profile.children.cycles-pp.exit_mmap
    311653 ± 10%     -23.7%     237884 ±  9%  turbostat.C1E
  26018024 ± 10%     -13.1%   22597563 ±  7%  turbostat.C6
      6.41 ±  9%     -13.6%       5.54 ±  8%  turbostat.CPU%c1
      2.47 ± 11%     +36.0%       3.36 ±  6%  turbostat.CPU%c6
 2.881e+08 ±  2%     -12.8%  2.513e+08 ±  3%  turbostat.IRQ
    212.86            +2.8%     218.84        turbostat.RAMWatt
    341.49            -4.1%     327.42 ±  2%  autonuma-benchmark.numa01.seconds
    186.67 ±  6%     -27.1%     136.12 ±  7%  autonuma-benchmark.numa01_THREAD_ALLOC.seconds
     21.17 ±  7%     -33.6%      14.05        autonuma-benchmark.numa02.seconds
      2200 ±  2%     -13.0%       1913 ±  3%  autonuma-benchmark.time.elapsed_time
      2200 ±  2%     -13.0%       1913 ±  3%  autonuma-benchmark.time.elapsed_time.max
   1159380 ±  2%     -12.0%    1019969 ±  3%  autonuma-benchmark.time.involuntary_context_switches
   3363550            -5.0%    3194802        autonuma-benchmark.time.minor_page_faults
    243046 ±  2%     -13.3%     210725 ±  3%  autonuma-benchmark.time.user_time
   7494239            -6.8%    6984234        proc-vmstat.numa_hit
    118829 ±  6%     +13.7%     135136 ±  6%  proc-vmstat.numa_huge_pte_updates
   6207618            -8.4%    5686795 ±  2%  proc-vmstat.numa_local
   8834573 ±  3%     +20.2%   10616944 ±  4%  proc-vmstat.numa_pages_migrated
  61094857 ±  6%     +13.6%   69409875 ±  6%  proc-vmstat.numa_pte_updates
   8602789            -9.0%    7827793 ±  2%  proc-vmstat.pgfault
   8834573 ±  3%     +20.2%   10616944 ±  4%  proc-vmstat.pgmigrate_success
    371818           -10.1%     334391 ±  2%  proc-vmstat.pgreuse
     17200 ±  3%     +20.3%      20686 ±  4%  proc-vmstat.thp_migration_success
  16401792 ±  2%     -12.7%   14322816 ±  3%  proc-vmstat.unevictable_pgs_scanned
 1.606e+08 ±  2%     -13.8%  1.385e+08 ±  3%  sched_debug.cfs_rq:/.avg_vruntime.avg
 1.666e+08 ±  2%     -14.0%  1.433e+08 ±  3%  sched_debug.cfs_rq:/.avg_vruntime.max
 1.364e+08 ±  2%     -11.7%  1.204e+08 ±  3%  sched_debug.cfs_rq:/.avg_vruntime.min
   4795327 ±  7%     -17.5%    3956991 ±  7%  sched_debug.cfs_rq:/.avg_vruntime.stddev
 1.606e+08 ±  2%     -13.8%  1.385e+08 ±  3%  sched_debug.cfs_rq:/.min_vruntime.avg
 1.666e+08 ±  2%     -14.0%  1.433e+08 ±  3%  sched_debug.cfs_rq:/.min_vruntime.max
 1.364e+08 ±  2%     -11.7%  1.204e+08 ±  3%  sched_debug.cfs_rq:/.min_vruntime.min
   4795327 ±  7%     -17.5%    3956991 ±  7%  sched_debug.cfs_rq:/.min_vruntime.stddev
    364.96 ±  6%     +16.6%     425.70 ±  5%  sched_debug.cfs_rq:/.util_est_enqueued.avg
   1099114           -13.0%     956021 ±  2%  sched_debug.cpu.clock.avg
   1099477           -13.0%     956344 ±  2%  sched_debug.cpu.clock.max
   1098702           -13.0%     955643 ±  2%  sched_debug.cpu.clock.min
   1080712           -13.0%     940415 ±  2%  sched_debug.cpu.clock_task.avg
   1085309           -13.1%     943557 ±  2%  sched_debug.cpu.clock_task.max
   1064613           -13.0%     925993 ±  2%  sched_debug.cpu.clock_task.min
     28890 ±  3%     -11.7%      25504 ±  3%  sched_debug.cpu.curr->pid.avg
     35200           -11.0%      31344        sched_debug.cpu.curr->pid.max
    862245 ±  3%      -8.7%     786984        sched_debug.cpu.max_idle_balance_cost.max
     74019 ±  9%     -28.2%      53158 ±  7%  sched_debug.cpu.max_idle_balance_cost.stddev
     15507           -11.9%      13667 ±  2%  sched_debug.cpu.nr_switches.avg
     57616 ±  6%     -19.0%      46642 ±  8%  sched_debug.cpu.nr_switches.max
      8460 ±  6%     -12.9%       7368 ±  5%  sched_debug.cpu.nr_switches.stddev
   1098689           -13.0%     955631 ±  2%  sched_debug.cpu_clk
   1097964           -13.0%     954907 ±  2%  sched_debug.ktime
      0.00           +15.0%       0.00 ±  2%  sched_debug.rt_rq:.rt_nr_migratory.avg
      0.03           +15.0%       0.03 ±  2%  sched_debug.rt_rq:.rt_nr_migratory.max
      0.00           +15.0%       0.00 ±  2%  sched_debug.rt_rq:.rt_nr_migratory.stddev
      0.00           +15.0%       0.00 ±  2%  sched_debug.rt_rq:.rt_nr_running.avg
      0.03           +15.0%       0.03 ±  2%  sched_debug.rt_rq:.rt_nr_running.max
      0.00           +15.0%       0.00 ±  2%  sched_debug.rt_rq:.rt_nr_running.stddev
   1099511           -13.0%     956501 ±  2%  sched_debug.sched_clk
      1162 ±  2%     +15.2%       1339 ±  3%  perf-stat.i.MPKI
 1.656e+08            +3.6%  1.716e+08        perf-stat.i.branch-instructions
      0.95 ±  4%      +0.1        1.03        perf-stat.i.branch-miss-rate%
   1538367 ±  6%     +11.0%    1707146 ±  2%  perf-stat.i.branch-misses
 6.327e+08 ±  3%     +18.7%  7.513e+08 ±  4%  perf-stat.i.cache-misses
 8.282e+08 ±  2%     +15.2%  9.542e+08 ±  3%  perf-stat.i.cache-references
    658.12 ±  3%     -11.4%     582.98 ±  6%  perf-stat.i.cycles-between-cache-misses
 2.201e+08            +2.8%  2.263e+08        perf-stat.i.dTLB-loads
    579771            +0.9%     584915        perf-stat.i.dTLB-store-misses
 1.122e+08            +1.4%  1.138e+08        perf-stat.i.dTLB-stores
 8.278e+08            +3.1%  8.538e+08        perf-stat.i.instructions
     13.98 ±  2%     +14.3%      15.98 ±  3%  perf-stat.i.metric.M/sec
      3797            +4.3%       3958        perf-stat.i.minor-faults
    258749            +8.0%     279391 ±  2%  perf-stat.i.node-load-misses
    261169 ±  2%      +7.4%     280417 ±  5%  perf-stat.i.node-loads
     40.91 ±  3%      -3.0       37.89 ±  3%  perf-stat.i.node-store-miss-rate%
 3.841e+08 ±  6%     +27.6%  4.902e+08 ±  7%  perf-stat.i.node-stores
      3797            +4.3%       3958        perf-stat.i.page-faults
    998.24 ±  2%     +11.8%       1116 ±  2%  perf-stat.overall.MPKI
    463.91            -3.2%     448.99        perf-stat.overall.cpi
    604.23 ±  3%     -15.9%     508.08 ±  4%  perf-stat.overall.cycles-between-cache-misses
      0.00            +3.3%       0.00        perf-stat.overall.ipc
     39.20 ±  5%      -4.5       34.70 ±  6%  perf-stat.overall.node-store-miss-rate%
 1.636e+08            +3.8%  1.698e+08        perf-stat.ps.branch-instructions
   1499760 ±  6%     +11.1%    1665855 ±  2%  perf-stat.ps.branch-misses
 6.296e+08 ±  3%     +19.0%  7.489e+08 ±  4%  perf-stat.ps.cache-misses
 8.178e+08 ±  2%     +15.5%  9.447e+08 ±  3%  perf-stat.ps.cache-references
  2.18e+08            +2.9%  2.244e+08        perf-stat.ps.dTLB-loads
    578148            +0.9%     583328        perf-stat.ps.dTLB-store-misses
 1.117e+08            +1.4%  1.132e+08        perf-stat.ps.dTLB-stores
 8.192e+08            +3.3%   8.46e+08        perf-stat.ps.instructions
      3744            +4.3%       3906        perf-stat.ps.minor-faults
    255974            +8.2%     276924 ±  2%  perf-stat.ps.node-load-misses
    263796 ±  2%      +7.7%     284110 ±  5%  perf-stat.ps.node-loads
  3.82e+08 ±  6%     +27.7%  4.879e+08 ±  7%  perf-stat.ps.node-stores
      3744            +4.3%       3906        perf-stat.ps.page-faults
 1.805e+12 ±  2%     -10.1%  1.622e+12 ±  2%  perf-stat.total.instructions




Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.
Raghavendra K T Sept. 11, 2023, 11:25 a.m. UTC | #2
On 9/10/2023 8:59 PM, kernel test robot wrote:
>    341.49            -4.1%     327.42 ±  2%  autonuma-benchmark.numa01.seconds
>      186.67 ±  6%     -27.1%     136.12 ±  7%  autonuma-benchmark.numa01_THREAD_ALLOC.seconds
>       21.17 ±  7%     -33.6%      14.05        autonuma-benchmark.numa02.seconds
>        2200 ±  2%     -13.0%       1913 ±  3%  autonuma-benchmark.time.elapsed_time

Hello Oliver/Kernel test robot,
Thank yo alot for testing.

Results are impressive. Can I take this result as
positive for whole series too?

Mel/PeterZ,

Whenever time permits can you please let us know your comments/concerns
on the series?

Thanks and Regards
- Raghu
kernel test robot Sept. 12, 2023, 2:22 a.m. UTC | #3
hi, Raghu,

On Mon, Sep 11, 2023 at 04:55:56PM +0530, Raghavendra K T wrote:
> On 9/10/2023 8:59 PM, kernel test robot wrote:
> >    341.49            -4.1%     327.42 ±  2%  autonuma-benchmark.numa01.seconds
> >      186.67 ±  6%     -27.1%     136.12 ±  7%  autonuma-benchmark.numa01_THREAD_ALLOC.seconds
> >       21.17 ±  7%     -33.6%      14.05        autonuma-benchmark.numa02.seconds
> >        2200 ±  2%     -13.0%       1913 ±  3%  autonuma-benchmark.time.elapsed_time
> 
> Hello Oliver/Kernel test robot,
> Thank yo alot for testing.
> 
> Results are impressive. Can I take this result as
> positive for whole series too?

FYI. we applied your patch set like below:

68cfe9439a1ba (linux-review/Raghavendra-K-T/sched-numa-Move-up-the-access-pid-reset-logic/20230829-141007) sched/numa: Allow scanning of shared VMAs
af46f3c9ca2d1 sched/numa: Allow recently accessed VMAs to be scanned
167773d1ddb5f sched/numa: Increase tasks' access history
fc769221b2306 sched/numa: Remove unconditional scan logic using mm numa_scan_seq
1ef5cbb92bdb3 sched/numa: Add disjoint vma unconditional scan logic
2a806eab1c2e1 sched/numa: Move up the access pid reset logic
2f88c8e802c8b (tip/sched/core) sched/eevdf/doc: Modify the documented knob to base_slice_ns as well

in our tests, we also tested the 68cfe9439a1ba, if comparing it to af46f3c9ca2d1:

=========================================================================================
compiler/cpufreq_governor/iterations/kconfig/rootfs/tbox_group/test/testcase:
  gcc-12/performance/4x/x86_64-rhel-8.3/debian-11.1-x86_64-20220510.cgz/lkp-icl-2sp6/numa01_THREAD_ALLOC/autonuma-benchmark

commit:
  af46f3c9ca ("sched/numa: Allow recently accessed VMAs to be scanned")
  68cfe9439a ("sched/numa: Allow scanning of shared VMA")

af46f3c9ca2d1648 68cfe9439a1baa642e05883fa64
---------------- ---------------------------
         %stddev     %change         %stddev
             \          |                \
    327.42 ±  2%      -1.1%     323.83 ±  3%  autonuma-benchmark.numa01.seconds
    136.12 ±  7%     -25.1%     101.90 ±  2%  autonuma-benchmark.numa01_THREAD_ALLOC.seconds
     14.05            +1.5%      14.26        autonuma-benchmark.numa02.seconds
      1913 ±  3%      -7.9%       1763 ±  2%  autonuma-benchmark.time.elapsed_time


below is the full comparison FYI.


af46f3c9ca2d1648 68cfe9439a1baa642e05883fa64
---------------- ---------------------------
         %stddev     %change         %stddev
             \          |                \
     36437 ±  9%     +20.4%      43867 ± 10%  meminfo.Mapped
      0.02 ± 17%      +0.0        0.03 ±  8%  mpstat.cpu.all.iowait%
     71.00 ±  2%      +6.3%      75.50        turbostat.PkgTmp
   3956991 ±  7%     -15.0%    3361998 ±  5%  sched_debug.cfs_rq:/.avg_vruntime.stddev
   3956991 ±  7%     -15.0%    3361997 ±  5%  sched_debug.cfs_rq:/.min_vruntime.stddev
    -30.18           +27.8%     -38.56        sched_debug.cpu.nr_uninterruptible.min
      1913 ±  3%      -7.9%       1763 ±  2%  time.elapsed_time
      1913 ±  3%      -7.9%       1763 ±  2%  time.elapsed_time.max
   3194802            -2.4%    3117907        time.minor_page_faults
    210725 ±  3%      -8.7%     192483 ±  3%  time.user_time
    327.42 ±  2%      -1.1%     323.83 ±  3%  autonuma-benchmark.numa01.seconds
    136.12 ±  7%     -25.1%     101.90 ±  2%  autonuma-benchmark.numa01_THREAD_ALLOC.seconds
     14.05            +1.5%      14.26        autonuma-benchmark.numa02.seconds
      1913 ±  3%      -7.9%       1763 ±  2%  autonuma-benchmark.time.elapsed_time
      1913 ±  3%      -7.9%       1763 ±  2%  autonuma-benchmark.time.elapsed_time.max
   3194802            -2.4%    3117907        autonuma-benchmark.time.minor_page_faults
    210725 ±  3%      -8.7%     192483 ±  3%  autonuma-benchmark.time.user_time
      1.33 ± 91%     -88.0%       0.16 ± 14%  perf-sched.sch_delay.avg.ms.pipe_read.vfs_read.ksys_read.do_syscall_64
      0.09 ±194%   +3204.2%       3.03 ± 66%  perf-sched.sch_delay.max.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_sysvec_reschedule_ipi
      3.72 ±  9%     -24.8%       2.80 ± 21%  perf-sched.sch_delay.max.ms.schedule_hrtimeout_range_clock.do_select.core_sys_select.kern_select
     41.00 ±147%   +2060.2%     885.67 ±105%  perf-sched.wait_and_delay.count.io_schedule.migration_entry_wait_on_locked.__handle_mm_fault.handle_mm_fault
     18.61 ± 18%     -28.5%      13.30 ± 21%  perf-sched.wait_time.avg.ms.schedule_timeout.__wait_for_common.wait_for_completion_state.kernel_clone
      7.84 ±100%    +354.6%      35.66 ± 89%  perf-sched.wait_time.max.ms.__cond_resched.__wait_for_common.wait_for_completion_state.kernel_clone.__x64_sys_vfork
      9285 ±  8%     +20.1%      11152 ± 10%  proc-vmstat.nr_mapped
   6984234            -4.0%    6706018        proc-vmstat.numa_hit
   5686795 ±  2%      -5.2%    5390176        proc-vmstat.numa_local
  10616944 ±  4%     +15.7%   12279801 ±  3%  proc-vmstat.numa_pages_migrated
   7827793 ±  2%      -5.2%    7421440 ±  2%  proc-vmstat.pgfault
  10616944 ±  4%     +15.7%   12279801 ±  3%  proc-vmstat.pgmigrate_success
    334391 ±  2%      -8.6%     305628 ±  2%  proc-vmstat.pgreuse
     20686 ±  4%     +15.7%      23939 ±  3%  proc-vmstat.thp_migration_success
  14322816 ±  3%      -8.2%   13147392 ±  2%  proc-vmstat.unevictable_pgs_scanned
      1339 ±  3%      +8.6%       1454 ±  2%  perf-stat.i.MPKI
 1.716e+08            +2.8%  1.764e+08        perf-stat.i.branch-instructions
      1.03            +0.1        1.11 ±  3%  perf-stat.i.branch-miss-rate%
   1707146 ±  2%      +9.5%    1869960 ±  4%  perf-stat.i.branch-misses
 7.513e+08 ±  4%     +11.1%  8.351e+08 ±  3%  perf-stat.i.cache-misses
 9.542e+08 ±  3%      +8.9%   1.04e+09 ±  3%  perf-stat.i.cache-references
    534.57            -1.5%     526.34        perf-stat.i.cpi
    158.57            +1.6%     161.11        perf-stat.i.cpu-migrations
    582.98 ±  6%     -11.4%     516.40 ±  3%  perf-stat.i.cycles-between-cache-misses
 2.263e+08            +2.2%  2.312e+08        perf-stat.i.dTLB-loads
 8.538e+08            +2.5%  8.753e+08        perf-stat.i.instructions
     15.98 ±  3%      +8.9%      17.40 ±  3%  perf-stat.i.metric.M/sec
      3958            +3.0%       4075        perf-stat.i.minor-faults
     37.89 ±  3%      -3.6       34.28 ±  5%  perf-stat.i.node-store-miss-rate%
 2.585e+08 ±  4%      -7.7%  2.385e+08 ±  3%  perf-stat.i.node-store-misses
 4.902e+08 ±  7%     +21.1%  5.937e+08 ±  7%  perf-stat.i.node-stores
      3958            +2.9%       4075        perf-stat.i.page-faults
      1116 ±  2%      +6.2%       1186 ±  2%  perf-stat.overall.MPKI
      0.98            +0.1        1.04 ±  3%  perf-stat.overall.branch-miss-rate%
    448.99            -2.8%     436.60        perf-stat.overall.cpi
    508.08 ±  4%     -10.1%     456.56 ±  4%  perf-stat.overall.cycles-between-cache-misses
      0.00            +2.8%       0.00        perf-stat.overall.ipc
     34.70 ±  6%      -5.7       29.02 ±  7%  perf-stat.overall.node-store-miss-rate%
 1.698e+08            +2.8%  1.746e+08        perf-stat.ps.branch-instructions
   1665855 ±  2%      +9.5%    1824511 ±  3%  perf-stat.ps.branch-misses
 7.489e+08 ±  4%     +10.9%  8.306e+08 ±  4%  perf-stat.ps.cache-misses
 9.447e+08 ±  3%      +8.9%  1.029e+09 ±  3%  perf-stat.ps.cache-references
    158.05            +1.4%     160.31        perf-stat.ps.cpu-migrations
 2.244e+08            +2.1%  2.292e+08        perf-stat.ps.dTLB-loads
  8.46e+08            +2.5%  8.672e+08        perf-stat.ps.instructions
      3906            +2.9%       4020        perf-stat.ps.minor-faults
    284110 ±  5%     +12.0%     318166 ±  2%  perf-stat.ps.node-loads
 2.584e+08 ±  3%      -7.3%  2.395e+08 ±  3%  perf-stat.ps.node-store-misses
 4.879e+08 ±  7%     +20.6%  5.883e+08 ±  7%  perf-stat.ps.node-stores
      3906            +2.9%       4020        perf-stat.ps.page-faults
 1.622e+12 ±  2%      -5.7%   1.53e+12 ±  2%  perf-stat.total.instructions
      6.29 ± 13%      -2.2        4.11 ± 24%  perf-profile.calltrace.cycles-pp.read
      6.22 ± 13%      -2.2        4.05 ± 24%  perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.read
      6.21 ± 13%      -2.2        4.04 ± 24%  perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.read
      6.04 ± 13%      -2.1        3.90 ± 24%  perf-profile.calltrace.cycles-pp.vfs_read.ksys_read.do_syscall_64.entry_SYSCALL_64_after_hwframe.read
      6.09 ± 13%      -2.1        3.96 ± 24%  perf-profile.calltrace.cycles-pp.ksys_read.do_syscall_64.entry_SYSCALL_64_after_hwframe.read
      3.68 ± 17%      -1.4        2.25 ± 36%  perf-profile.calltrace.cycles-pp.do_filp_open.do_sys_openat2.__x64_sys_openat.do_syscall_64.entry_SYSCALL_64_after_hwframe
      3.22 ± 16%      -1.4        1.79 ± 27%  perf-profile.calltrace.cycles-pp.open64
      3.66 ± 16%      -1.4        2.24 ± 36%  perf-profile.calltrace.cycles-pp.path_openat.do_filp_open.do_sys_openat2.__x64_sys_openat.do_syscall_64
      3.88 ± 13%      -1.4        2.49 ± 20%  perf-profile.calltrace.cycles-pp.seq_read.vfs_read.ksys_read.do_syscall_64.entry_SYSCALL_64_after_hwframe
      3.83 ± 13%      -1.4        2.48 ± 19%  perf-profile.calltrace.cycles-pp.seq_read_iter.seq_read.vfs_read.ksys_read.do_syscall_64
      3.03 ± 17%      -1.3        1.71 ± 26%  perf-profile.calltrace.cycles-pp.do_sys_openat2.__x64_sys_openat.do_syscall_64.entry_SYSCALL_64_after_hwframe.open64
      3.09 ± 17%      -1.3        1.77 ± 27%  perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.open64
      3.08 ± 17%      -1.3        1.76 ± 27%  perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.open64
      3.04 ± 17%      -1.3        1.73 ± 26%  perf-profile.calltrace.cycles-pp.__x64_sys_openat.do_syscall_64.entry_SYSCALL_64_after_hwframe.open64
      2.61 ± 14%      -1.0        1.60 ± 20%  perf-profile.calltrace.cycles-pp.proc_single_show.seq_read_iter.seq_read.vfs_read.ksys_read
      2.58 ± 13%      -1.0        1.58 ± 21%  perf-profile.calltrace.cycles-pp.do_task_stat.proc_single_show.seq_read_iter.seq_read.vfs_read
      0.99 ± 17%      -0.5        0.46 ± 75%  perf-profile.calltrace.cycles-pp.__xstat64
      0.97 ± 18%      -0.5        0.46 ± 75%  perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.__xstat64
      0.96 ± 18%      -0.5        0.46 ± 75%  perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.__xstat64
      0.95 ± 18%      -0.5        0.45 ± 75%  perf-profile.calltrace.cycles-pp.__do_sys_newstat.do_syscall_64.entry_SYSCALL_64_after_hwframe.__xstat64
      0.92 ± 19%      -0.5        0.45 ± 75%  perf-profile.calltrace.cycles-pp.vfs_fstatat.__do_sys_newstat.do_syscall_64.entry_SYSCALL_64_after_hwframe.__xstat64
      0.72 ± 12%      -0.3        0.40 ± 71%  perf-profile.calltrace.cycles-pp.exit_to_user_mode_loop.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_sysvec_apic_timer_interrupt
      7.12 ± 13%      -2.4        4.73 ± 22%  perf-profile.children.cycles-pp.ksys_read
      6.91 ± 12%      -2.3        4.57 ± 23%  perf-profile.children.cycles-pp.vfs_read
      6.30 ± 13%      -2.2        4.12 ± 24%  perf-profile.children.cycles-pp.read
      5.34 ± 12%      -1.9        3.46 ± 25%  perf-profile.children.cycles-pp.seq_read_iter
      4.65 ± 13%      -1.7        2.98 ± 31%  perf-profile.children.cycles-pp.do_sys_openat2
      4.67 ± 13%      -1.7        3.01 ± 30%  perf-profile.children.cycles-pp.__x64_sys_openat
      4.43 ± 13%      -1.6        2.86 ± 29%  perf-profile.children.cycles-pp.do_filp_open
      4.41 ± 13%      -1.6        2.85 ± 29%  perf-profile.children.cycles-pp.path_openat
      3.23 ± 16%      -1.4        1.80 ± 27%  perf-profile.children.cycles-pp.open64
      3.89 ± 13%      -1.4        2.49 ± 20%  perf-profile.children.cycles-pp.seq_read
      2.61 ± 14%      -1.0        1.60 ± 20%  perf-profile.children.cycles-pp.proc_single_show
      2.59 ± 13%      -1.0        1.58 ± 21%  perf-profile.children.cycles-pp.do_task_stat
      1.66 ± 12%      -0.7        0.96 ± 36%  perf-profile.children.cycles-pp.lookup_fast
      1.43 ± 16%      -0.6        0.86 ± 29%  perf-profile.children.cycles-pp.walk_component
      1.50 ± 14%      -0.5        0.96 ± 30%  perf-profile.children.cycles-pp.link_path_walk
      1.24 ± 10%      -0.5        0.77 ± 32%  perf-profile.children.cycles-pp.do_open
      1.53 ±  7%      -0.4        1.08 ± 19%  perf-profile.children.cycles-pp.sched_setaffinity
      1.02 ± 15%      -0.4        0.64 ± 33%  perf-profile.children.cycles-pp.__xstat64
      1.10 ± 18%      -0.4        0.72 ± 31%  perf-profile.children.cycles-pp.__do_sys_newstat
      1.09 ± 18%      -0.4        0.73 ± 30%  perf-profile.children.cycles-pp.path_lookupat
      1.10 ± 18%      -0.4        0.74 ± 29%  perf-profile.children.cycles-pp.filename_lookup
      1.07 ± 19%      -0.4        0.72 ± 32%  perf-profile.children.cycles-pp.vfs_fstatat
      0.97 ±  9%      -0.4        0.62 ± 34%  perf-profile.children.cycles-pp.do_dentry_open
      0.82 ± 19%      -0.4        0.48 ± 34%  perf-profile.children.cycles-pp.__d_lookup_rcu
      0.94 ± 18%      -0.3        0.61 ± 35%  perf-profile.children.cycles-pp.vfs_statx
      0.61 ± 11%      -0.3        0.33 ± 32%  perf-profile.children.cycles-pp.pid_revalidate
      0.78 ± 14%      -0.3        0.50 ± 29%  perf-profile.children.cycles-pp.tlb_finish_mmu
      0.64 ± 15%      -0.3        0.37 ± 29%  perf-profile.children.cycles-pp.getdents64
      0.62 ± 16%      -0.3        0.35 ± 28%  perf-profile.children.cycles-pp.proc_pid_readdir
      0.64 ± 15%      -0.3        0.37 ± 29%  perf-profile.children.cycles-pp.__x64_sys_getdents64
      0.64 ± 15%      -0.3        0.37 ± 29%  perf-profile.children.cycles-pp.iterate_dir
      0.61 ± 15%      -0.3        0.35 ± 24%  perf-profile.children.cycles-pp.__percpu_counter_init
      0.96 ±  8%      -0.3        0.71 ± 20%  perf-profile.children.cycles-pp.evlist_cpu_iterator__next
      1.03 ± 12%      -0.2        0.78 ± 15%  perf-profile.children.cycles-pp.__libc_read
      0.75 ±  8%      -0.2        0.53 ± 17%  perf-profile.children.cycles-pp.__x64_sys_sched_setaffinity
      0.39 ± 13%      -0.2        0.19 ± 24%  perf-profile.children.cycles-pp.__entry_text_start
      0.40 ± 18%      -0.2        0.22 ± 25%  perf-profile.children.cycles-pp.ptrace_may_access
      0.62 ±  7%      -0.2        0.45 ± 17%  perf-profile.children.cycles-pp.__sched_setaffinity
      0.36 ± 16%      -0.2        0.20 ± 25%  perf-profile.children.cycles-pp.proc_fill_cache
      0.57 ±  6%      -0.2        0.40 ± 20%  perf-profile.children.cycles-pp.__set_cpus_allowed_ptr
      0.42 ± 21%      -0.2        0.27 ± 38%  perf-profile.children.cycles-pp.inode_permission
      0.36 ± 20%      -0.1        0.22 ± 25%  perf-profile.children.cycles-pp._find_next_bit
      0.39 ± 14%      -0.1        0.25 ± 22%  perf-profile.children.cycles-pp.__kmem_cache_alloc_node
      0.44 ± 12%      -0.1        0.30 ± 26%  perf-profile.children.cycles-pp.pick_link
      0.25 ± 18%      -0.1        0.12 ± 19%  perf-profile.children.cycles-pp.security_ptrace_access_check
      0.32 ± 15%      -0.1        0.19 ± 22%  perf-profile.children.cycles-pp.__x64_sys_readlink
      0.22 ± 13%      -0.1        0.11 ± 33%  perf-profile.children.cycles-pp.readlink
      0.31 ± 14%      -0.1        0.19 ± 22%  perf-profile.children.cycles-pp.do_readlinkat
      0.32 ± 11%      -0.1        0.22 ± 30%  perf-profile.children.cycles-pp.vfs_fstat
      0.26 ± 19%      -0.1        0.15 ± 26%  perf-profile.children.cycles-pp.load_elf_interp
      0.22 ± 17%      -0.1        0.12 ± 32%  perf-profile.children.cycles-pp.d_hash_and_lookup
      0.21 ± 31%      -0.1        0.12 ± 31%  perf-profile.children.cycles-pp.may_open
      0.30 ± 14%      -0.1        0.21 ± 18%  perf-profile.children.cycles-pp.copy_strings
      0.24 ± 18%      -0.1        0.14 ± 32%  perf-profile.children.cycles-pp.unlink_anon_vmas
      0.19 ± 19%      -0.1        0.10 ± 32%  perf-profile.children.cycles-pp.__kmalloc_node
      0.29 ±  8%      -0.1        0.21 ± 10%  perf-profile.children.cycles-pp.affine_move_task
      0.24 ± 21%      -0.1        0.16 ± 24%  perf-profile.children.cycles-pp.__mod_memcg_lruvec_state
      0.22 ± 10%      -0.1        0.14 ± 28%  perf-profile.children.cycles-pp.mas_preallocate
      0.24 ± 12%      -0.1        0.16 ± 30%  perf-profile.children.cycles-pp.mas_alloc_nodes
      0.21 ± 14%      -0.1        0.14 ± 20%  perf-profile.children.cycles-pp.__d_alloc
      0.10 ± 19%      -0.1        0.03 ±100%  perf-profile.children.cycles-pp.pid_task
      0.14 ± 24%      -0.1        0.06 ± 50%  perf-profile.children.cycles-pp.single_open
      0.20 ± 11%      -0.1        0.12 ± 12%  perf-profile.children.cycles-pp.cpu_stop_queue_work
      0.18 ± 16%      -0.1        0.11 ± 25%  perf-profile.children.cycles-pp.generic_fillattr
      0.14 ± 19%      -0.1        0.07 ± 29%  perf-profile.children.cycles-pp.apparmor_ptrace_access_check
      0.14 ± 23%      -0.1        0.08 ± 30%  perf-profile.children.cycles-pp.native_flush_tlb_one_user
      0.10 ± 10%      -0.1        0.04 ± 71%  perf-profile.children.cycles-pp.vfs_readlink
      0.09 ± 19%      -0.1        0.03 ±100%  perf-profile.children.cycles-pp.aa_get_task_label
      0.14 ± 25%      -0.1        0.08 ± 23%  perf-profile.children.cycles-pp.proc_pid_get_link
      0.16 ± 21%      -0.1        0.10 ± 28%  perf-profile.children.cycles-pp.thread_group_cputime_adjusted
      0.19 ± 15%      -0.1        0.13 ± 27%  perf-profile.children.cycles-pp.strnlen_user
      0.18 ± 27%      -0.1        0.11 ± 21%  perf-profile.children.cycles-pp.wq_worker_comm
      0.18 ± 13%      -0.1        0.11 ± 36%  perf-profile.children.cycles-pp.vfs_getattr_nosec
      0.17 ± 16%      -0.1        0.11 ± 24%  perf-profile.children.cycles-pp.proc_pid_cmdline_read
      0.12 ± 10%      -0.1        0.06 ± 48%  perf-profile.children.cycles-pp.terminate_walk
      0.14 ± 18%      -0.1        0.09 ± 27%  perf-profile.children.cycles-pp.thread_group_cputime
      0.13 ± 21%      -0.0        0.08 ± 27%  perf-profile.children.cycles-pp.get_obj_cgroup_from_current
      0.14 ± 18%      -0.0        0.10 ± 26%  perf-profile.children.cycles-pp.get_mm_cmdline
      0.14 ± 10%      -0.0        0.10 ± 17%  perf-profile.children.cycles-pp.wake_up_q
      1.37 ± 16%      -0.6        0.81 ± 23%  perf-profile.self.cycles-pp.do_task_stat
      0.80 ± 18%      -0.3        0.46 ± 34%  perf-profile.self.cycles-pp.__d_lookup_rcu
      0.39 ± 15%      -0.2        0.19 ± 33%  perf-profile.self.cycles-pp.pid_revalidate
      0.37 ± 11%      -0.2        0.18 ± 22%  perf-profile.self.cycles-pp.__entry_text_start
      0.36 ± 14%      -0.2        0.21 ± 37%  perf-profile.self.cycles-pp.do_dentry_open
      0.44 ± 17%      -0.1        0.31 ± 24%  perf-profile.self.cycles-pp.gather_pte_stats
      0.23 ± 15%      -0.1        0.14 ± 14%  perf-profile.self.cycles-pp.__kmem_cache_alloc_node
      0.10 ± 18%      -0.1        0.03 ±100%  perf-profile.self.cycles-pp.pid_task
      0.21 ± 17%      -0.1        0.14 ± 25%  perf-profile.self.cycles-pp.__mod_memcg_lruvec_state
      0.14 ± 23%      -0.1        0.08 ± 30%  perf-profile.self.cycles-pp.native_flush_tlb_one_user
      0.16 ± 23%      -0.1        0.09 ± 26%  perf-profile.self.cycles-pp.generic_fillattr
      0.09 ± 20%      -0.1        0.03 ±101%  perf-profile.self.cycles-pp.unlink_anon_vmas
      0.10 ± 25%      -0.1        0.04 ± 76%  perf-profile.self.cycles-pp.proc_fill_cache
      0.12 ± 20%      -0.1        0.06 ± 58%  perf-profile.self.cycles-pp.lookup_fast



> 
> Mel/PeterZ,
> 
> Whenever time permits can you please let us know your comments/concerns
> on the series?
> 
> Thanks and Regards
> - Raghu
>
Raghavendra K T Sept. 12, 2023, 6:43 a.m. UTC | #4
On 9/12/2023 7:52 AM, Oliver Sang wrote:
> hi, Raghu,
> 
> On Mon, Sep 11, 2023 at 04:55:56PM +0530, Raghavendra K T wrote:
>> On 9/10/2023 8:59 PM, kernel test robot wrote:
>>>     341.49            -4.1%     327.42 ±  2%  autonuma-benchmark.numa01.seconds
>>>       186.67 ±  6%     -27.1%     136.12 ±  7%  autonuma-benchmark.numa01_THREAD_ALLOC.seconds
>>>        21.17 ±  7%     -33.6%      14.05        autonuma-benchmark.numa02.seconds
>>>         2200 ±  2%     -13.0%       1913 ±  3%  autonuma-benchmark.time.elapsed_time
>>
>> Hello Oliver/Kernel test robot,
>> Thank yo alot for testing.
>>
>> Results are impressive. Can I take this result as
>> positive for whole series too?
> 
> FYI. we applied your patch set like below:
> 
> 68cfe9439a1ba (linux-review/Raghavendra-K-T/sched-numa-Move-up-the-access-pid-reset-logic/20230829-141007) sched/numa: Allow scanning of shared VMAs
> af46f3c9ca2d1 sched/numa: Allow recently accessed VMAs to be scanned
> 167773d1ddb5f sched/numa: Increase tasks' access history
> fc769221b2306 sched/numa: Remove unconditional scan logic using mm numa_scan_seq
> 1ef5cbb92bdb3 sched/numa: Add disjoint vma unconditional scan logic
> 2a806eab1c2e1 sched/numa: Move up the access pid reset logic
> 2f88c8e802c8b (tip/sched/core) sched/eevdf/doc: Modify the documented knob to base_slice_ns as well
> 
> in our tests, we also tested the 68cfe9439a1ba, if comparing it to af46f3c9ca2d1:
> 
> =========================================================================================
> compiler/cpufreq_governor/iterations/kconfig/rootfs/tbox_group/test/testcase:
>    gcc-12/performance/4x/x86_64-rhel-8.3/debian-11.1-x86_64-20220510.cgz/lkp-icl-2sp6/numa01_THREAD_ALLOC/autonuma-benchmark
> 
> commit:
>    af46f3c9ca ("sched/numa: Allow recently accessed VMAs to be scanned")
>    68cfe9439a ("sched/numa: Allow scanning of shared VMA")
> 
> af46f3c9ca2d1648 68cfe9439a1baa642e05883fa64
> ---------------- ---------------------------
>           %stddev     %change         %stddev
>               \          |                \
>      327.42 ±  2%      -1.1%     323.83 ±  3%  autonuma-benchmark.numa01.seconds
>      136.12 ±  7%     -25.1%     101.90 ±  2%  autonuma-benchmark.numa01_THREAD_ALLOC.seconds
>       14.05            +1.5%      14.26        autonuma-benchmark.numa02.seconds
>        1913 ±  3%      -7.9%       1763 ±  2%  autonuma-benchmark.time.elapsed_time
> 
> 
> below is the full comparison FYI.
> 

Thanks a lot for further run and details.

Combining this result with previous, we do have a very good
result overall for LKP.

  167773d1dd ("sched/numa: Increase tasks' access history")
   af46f3c9ca ("sched/numa: Allow recently accessed VMAs to be scanned")

167773d1ddb5ffdd af46f3c9ca2d16485912f8b9c89
---------------- ---------------------------
          %stddev     %change         %stddev
341.49            -4.1%     327.42 ±  2%  autonuma-benchmark.numa01.seconds
     186.67 ±  6%     -27.1%     136.12 ±  7% 
autonuma-benchmark.numa01_THREAD_ALLOC.seconds
      21.17 ±  7%     -33.6%      14.05 
autonuma-benchmark.numa02.seconds
       2200 ±  2%     -13.0%       1913 ±  3% 
autonuma-benchmark.time.elapsed_time

Thanks and Regards
- Raghu




> 
> 
> 
>>
>> Mel/PeterZ,
>>
>> Whenever time permits can you please let us know your comments/concerns
>> on the series?
>>
>> Thanks and Regards
>> - Raghu
>>
diff mbox series

Patch

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3ae2a1a3ef5c..6529da7f370a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2971,8 +2971,22 @@  static inline bool vma_test_access_pid_history(struct vm_area_struct *vma)
 	return test_bit(pid_bit, &pids);
 }
 
+static inline bool vma_accessed_recent(struct vm_area_struct *vma)
+{
+	unsigned long *pids, pid_idx;
+
+	pid_idx = vma->numab_state->access_pid_idx;
+	pids = vma->numab_state->access_pids + pid_idx;
+
+	return (bitmap_weight(pids, BITS_PER_LONG) >= 1);
+}
+
 static bool vma_is_accessed(struct vm_area_struct *vma)
 {
+	/* Check at least one task had accessed VMA recently. */
+	if (vma_accessed_recent(vma))
+		return true;
+
 	/* Check if the current task had historically accessed VMA. */
 	if (vma_test_access_pid_history(vma))
 		return true;