diff mbox series

[2/2] /proc/stat: Add sysctl parameter to control irq counts latency

Message ID 1546873978-27797-3-git-send-email-longman@redhat.com (mailing list archive)
State New, archived
Headers show
Series /proc/stat: Reduce irqs counting performance overhead | expand

Commit Message

Waiman Long Jan. 7, 2019, 3:12 p.m. UTC
Reading /proc/stat can be slow especially if there are many irqs and on
systems with many CPUs as summation of all the percpu counts for each
of the irqs is required. On some newer systems, there can be more than
1000 irqs per socket.

Applications that need to read /proc/stat many times per seconds will
easily hit a bottleneck. In reality, the irq counts are seldom looked
at. Even those applications that read them don't really need up-to-date
information. One way to reduce the performance impact of irq counts
computation is to do it less frequently.

A new "fs/proc-stat-irqs-latency-ms" sysctl parameter is now added to
control the maximum latency in milliseconds allowed between the time
when the computation was done and when the values are reported. Setting
this parameter to an appropriate value will allow us to reduce the
performance impact of reading /proc/stat repetitively. If /proc/stat
is read once in a while, the irq counts will be accurate. Reading
/proc/stat repetitively, however, may make the counts somewhat stale.

On a 4-socket 96-core Broadwell system (HT off) with 2824 irqs,
the times for reading /proc/stat 10,000 times with various values of
proc-stat-irqs-latency-ms were:

  proc-stat-irqs-latency-ms   elapsed time   sys time
  -------------------------   ------------   --------
              0                 11.041s       9.452s
              1                 12.983s      10.314s
             10                  8.452s       5.466s
            100                  8.003s       4.882s
           1000                  8.000s       4.740s

Signed-off-by: Waiman Long <longman@redhat.com>
---
 Documentation/sysctl/fs.txt | 16 +++++++++++++++
 fs/proc/stat.c              | 48 +++++++++++++++++++++++++++++++++++++++++++++
 kernel/sysctl.c             | 12 ++++++++++++
 3 files changed, 76 insertions(+)

Comments

Matthew Wilcox Jan. 7, 2019, 3:58 p.m. UTC | #1
On Mon, Jan 07, 2019 at 10:12:58AM -0500, Waiman Long wrote:
> Reading /proc/stat can be slow especially if there are many irqs and on
> systems with many CPUs as summation of all the percpu counts for each
> of the irqs is required. On some newer systems, there can be more than
> 1000 irqs per socket.
> 
> Applications that need to read /proc/stat many times per seconds will
> easily hit a bottleneck. In reality, the irq counts are seldom looked
> at. Even those applications that read them don't really need up-to-date
> information. One way to reduce the performance impact of irq counts
> computation is to do it less frequently.
> 
> A new "fs/proc-stat-irqs-latency-ms" sysctl parameter is now added to

No.  No, no, no, no, no.  No.

Stop adding new sysctls for this kind of thing.  It's just a way to shift
blame from us (who allegedly know what we're doing) to the sysadmin
(who probably has better things to be doing than keeping up with the
intricacies of development of every single piece of software running on
their system).

Let's figure out what this _should_ be.  As a strawman, I propose we
update these stats once a second.  That's easy to document and understand.

> @@ -98,7 +105,48 @@ static u64 compute_stat_irqs_sum(void)
>  static void show_stat_irqs(struct seq_file *p)
>  {
>  	int i;
> +#ifdef CONFIG_PROC_SYSCTL
> +	static char *irqs_buf;		   /* Buffer for irqs values */
> +	static int buflen;
> +	static unsigned long last_jiffies; /* Last buffer update jiffies */
> +	static DEFINE_MUTEX(irqs_mutex);
> +	unsigned int latency = proc_stat_irqs_latency_ms;
> +
> +	if (latency) {
> +		char *ptr;
> +
> +		latency = _msecs_to_jiffies(latency);
> +
> +		mutex_lock(&irqs_mutex);
> +		if (irqs_buf && time_before(jiffies, last_jiffies + latency))
> +			goto print_out;
> +
> +		/*
> +		 * Each irq value may require up to 11 bytes.
> +		 */
> +		if (!irqs_buf) {
> +			irqs_buf = kmalloc(nr_irqs * 11 + 32,
> +					   GFP_KERNEL | __GFP_ZERO);

Why are you caching the _output_ of calling sprintf(), rather than caching the
values of each interrupt?
Waiman Long Jan. 7, 2019, 4:07 p.m. UTC | #2
On 01/07/2019 10:58 AM, Matthew Wilcox wrote:
> On Mon, Jan 07, 2019 at 10:12:58AM -0500, Waiman Long wrote:
>> Reading /proc/stat can be slow especially if there are many irqs and on
>> systems with many CPUs as summation of all the percpu counts for each
>> of the irqs is required. On some newer systems, there can be more than
>> 1000 irqs per socket.
>>
>> Applications that need to read /proc/stat many times per seconds will
>> easily hit a bottleneck. In reality, the irq counts are seldom looked
>> at. Even those applications that read them don't really need up-to-date
>> information. One way to reduce the performance impact of irq counts
>> computation is to do it less frequently.
>>
>> A new "fs/proc-stat-irqs-latency-ms" sysctl parameter is now added to
> No.  No, no, no, no, no.  No.
>
> Stop adding new sysctls for this kind of thing.  It's just a way to shift
> blame from us (who allegedly know what we're doing) to the sysadmin
> (who probably has better things to be doing than keeping up with the
> intricacies of development of every single piece of software running on
> their system).
>
> Let's figure out what this _should_ be.  As a strawman, I propose we
> update these stats once a second.  That's easy to document and understand.

I am fine with having a fixed value (like 1s) for reporting purpose. It
is just people may have many different opinions on what the right value
should be. That is why I opt for flexibility in the initial patch.

>
>> @@ -98,7 +105,48 @@ static u64 compute_stat_irqs_sum(void)
>>  static void show_stat_irqs(struct seq_file *p)
>>  {
>>  	int i;
>> +#ifdef CONFIG_PROC_SYSCTL
>> +	static char *irqs_buf;		   /* Buffer for irqs values */
>> +	static int buflen;
>> +	static unsigned long last_jiffies; /* Last buffer update jiffies */
>> +	static DEFINE_MUTEX(irqs_mutex);
>> +	unsigned int latency = proc_stat_irqs_latency_ms;
>> +
>> +	if (latency) {
>> +		char *ptr;
>> +
>> +		latency = _msecs_to_jiffies(latency);
>> +
>> +		mutex_lock(&irqs_mutex);
>> +		if (irqs_buf && time_before(jiffies, last_jiffies + latency))
>> +			goto print_out;
>> +
>> +		/*
>> +		 * Each irq value may require up to 11 bytes.
>> +		 */
>> +		if (!irqs_buf) {
>> +			irqs_buf = kmalloc(nr_irqs * 11 + 32,
>> +					   GFP_KERNEL | __GFP_ZERO);
> Why are you caching the _output_ of calling sprintf(), rather than caching the
> values of each interrupt?
>
It is just faster to dump the whole string buffer than redoing the
number formatting each time when the values don't change. I can cache
the individual sums instead if it is the preferred by most.

Cheers,
Longman
Matthew Wilcox Jan. 7, 2019, 4:14 p.m. UTC | #3
On Mon, Jan 07, 2019 at 11:07:47AM -0500, Waiman Long wrote:
> > Why are you caching the _output_ of calling sprintf(), rather than caching the
> > values of each interrupt?
> >
> It is just faster to dump the whole string buffer than redoing the
> number formatting each time when the values don't change. I can cache
> the individual sums instead if it is the preferred by most.

But it also consumes more memory.  Can you gather some measurements to
find out what the performance difference is if you cache the values
instead of the strings?
Waiman Long Jan. 7, 2019, 4:19 p.m. UTC | #4
On 01/07/2019 11:14 AM, Matthew Wilcox wrote:
> On Mon, Jan 07, 2019 at 11:07:47AM -0500, Waiman Long wrote:
>>> Why are you caching the _output_ of calling sprintf(), rather than caching the
>>> values of each interrupt?
>>>
>> It is just faster to dump the whole string buffer than redoing the
>> number formatting each time when the values don't change. I can cache
>> the individual sums instead if it is the preferred by most.
> But it also consumes more memory.  Can you gather some measurements to
> find out what the performance difference is if you cache the values
> instead of the strings?

I allocate 11 bytes (10 digits + space) for each unsigned int IRQ value.
I needs 4 bytes for the unsigned value itself. So it is a saving of 7
bytes. With 4k irqs, it will be 28k.

I will run some measurements of caching the values versus saving the
string later this week.

Thanks,
Longman
Alexey Dobriyan Jan. 7, 2019, 4:33 p.m. UTC | #5
On Mon, Jan 07, 2019 at 07:58:40AM -0800, Matthew Wilcox wrote:
> On Mon, Jan 07, 2019 at 10:12:58AM -0500, Waiman Long wrote:

> > A new "fs/proc-stat-irqs-latency-ms" sysctl parameter is now added to
> 
> No.  No, no, no, no, no.  No.
> 
> Stop adding new sysctls for this kind of thing.  It's just a way to shift
> blame from us (who allegedly know what we're doing) to the sysadmin
> (who probably has better things to be doing than keeping up with the
> intricacies of development of every single piece of software running on
> their system).
> 
> Let's figure out what this _should_ be.

Yeah, let's start interrogating about which values specifically this
super sekret applications wants.

I assume CPU stats, so system call which returns CPU statistics in binary form.

> Why are you caching the _output_ of calling sprintf(), rather than caching the
> values of each interrupt?

For output caching string is better, but I'm not defending the patch.
Waiman Long Jan. 7, 2019, 4:59 p.m. UTC | #6
On 01/07/2019 11:33 AM, Alexey Dobriyan wrote:
> On Mon, Jan 07, 2019 at 07:58:40AM -0800, Matthew Wilcox wrote:
>> On Mon, Jan 07, 2019 at 10:12:58AM -0500, Waiman Long wrote:
>>> A new "fs/proc-stat-irqs-latency-ms" sysctl parameter is now added to
>> No.  No, no, no, no, no.  No.
>>
>> Stop adding new sysctls for this kind of thing.  It's just a way to shift
>> blame from us (who allegedly know what we're doing) to the sysadmin
>> (who probably has better things to be doing than keeping up with the
>> intricacies of development of every single piece of software running on
>> their system).
>>
>> Let's figure out what this _should_ be.
> Yeah, let's start interrogating about which values specifically this
> super sekret applications wants.
>
> I assume CPU stats, so system call which returns CPU statistics in binary form.

The /proc/stat file contains a bunch of statistics for different system
parameters. I doubt the irq values are what most applications that read
/proc/stats are looking for. The problem some customers see was the fact
that they saw their application performance dropped quite significantly
when moving to a newer system with more CPUs and more irqs. The issue
here is their applications read /proc/stats at a relatively high
frequency. The irq counting overhead can weight quite significantly and
slow their applications down. Of course, they can modify their
applications to try not to read /proc/stats that often.

One instance where I was notified recently was the Oracle database that
used lseek() to move the current file pointer of /proc/stats causing the
system to recompute the stats every time the file pointer was moved. Of
course, that is the not the right way to read a procfs file and they are
probably going to change that. My point is customers are going to hit
problem like that when they move to larger and newer systems.

Thanks,
Longman
Kees Cook Jan. 21, 2019, 8:02 p.m. UTC | #7
On Fri, Jan 18, 2019 at 9:44 PM kernel test robot <rong.a.chen@intel.com> wrote:
>
> Greeting,
>
> FYI, we noticed a -4.8% regression of reaim.jobs_per_min due to commit:
>
>
> commit: 3047027b34b8c6404b509903058b89836093acc7 ("[PATCH 2/2] /proc/stat: Add sysctl parameter to control irq counts latency")
> url: https://github.com/0day-ci/linux/commits/Waiman-Long/proc-stat-Reduce-irqs-counting-performance-overhead/20190108-104818

Is this expected? (And it seems like other things in the report below
are faster? I don't understand why this particular regression was
called out?)

-Kees

>
>
> in testcase: reaim
> on test machine: 56 threads Intel(R) Xeon(R) CPU E5-2695 v3 @ 2.30GHz with 256G memory
> with following parameters:
>
>         runtime: 300s
>         nr_task: 5000
>         test: shared_memory
>         cpufreq_governor: performance
>         ucode: 0x3d
>
> test-description: REAIM is an updated and improved version of AIM 7 benchmark.
> test-url: https://sourceforge.net/projects/re-aim-7/
>
>
>
> Details are as below:
> -------------------------------------------------------------------------------------------------->
>
>
> To reproduce:
>
>         git clone https://github.com/intel/lkp-tests.git
>         cd lkp-tests
>         bin/lkp install job.yaml  # job file is attached in this email
>         bin/lkp run     job.yaml
>
> =========================================================================================
> compiler/cpufreq_governor/kconfig/nr_task/rootfs/runtime/tbox_group/test/testcase/ucode:
>   gcc-7/performance/x86_64-rhel-7.2/5000/debian-x86_64-2018-04-03.cgz/300s/lkp-hsw-ep5/shared_memory/reaim/0x3d
>
> commit:
>   51e8bce392 ("/proc/stat: Extract irqs counting code into show_stat_irqs()")
>   3047027b34 ("/proc/stat: Add sysctl parameter to control irq counts latency")
>
> 51e8bce392dd2cc9 3047027b34b8c6404b50990305
> ---------------- --------------------------
>        fail:runs  %reproduction    fail:runs
>            |             |             |
>           1:4          -25%            :4     kmsg.igb#:#:#:exceed_max#second
>          %stddev     %change         %stddev
>              \          |                \
>     101.96            +7.5%     109.60        reaim.child_systime
>      33.32            -1.8%      32.73        reaim.child_utime
>    5534451            -4.8%    5271308        reaim.jobs_per_min
>       1106            -4.8%       1054        reaim.jobs_per_min_child
>    5800927            -4.9%    5517884        reaim.max_jobs_per_min
>       5.42            +5.0%       5.69        reaim.parent_time
>       1.51            +5.3%       1.59        reaim.std_dev_time
>   29374932            -2.8%   28558608        reaim.time.minor_page_faults
>       1681            +1.6%       1708        reaim.time.percent_of_cpu_this_job_got
>       3841            +4.5%       4012        reaim.time.system_time
>       1234            -4.4%       1180        reaim.time.user_time
>   18500000            -2.7%   18000000        reaim.workload
>    5495296 ±  9%      -9.5%    4970496        meminfo.DirectMap2M
>       5142 ± 18%     -43.2%       2920 ± 46%  numa-vmstat.node0.nr_shmem
>      29.00 ± 32%     +56.9%      45.50 ± 10%  vmstat.procs.r
>      67175 ± 37%     +66.6%     111910 ± 20%  numa-meminfo.node0.AnonHugePages
>      20591 ± 18%     -43.2%      11691 ± 46%  numa-meminfo.node0.Shmem
>      64688 ±  6%     -36.8%      40906 ± 19%  slabinfo.kmalloc-8.active_objs
>      64691 ±  6%     -36.8%      40908 ± 19%  slabinfo.kmalloc-8.num_objs
>      37.36 ±  7%     +11.1%      41.53 ±  4%  boot-time.boot
>      29.15 ±  6%     +14.3%      33.31 ±  3%  boot-time.dhcp
>     847.73 ±  9%     +12.9%     957.09 ±  4%  boot-time.idle
>     202.50 ±100%    +101.7%     408.50        proc-vmstat.nr_mlock
>       8018 ±  9%     -12.3%       7034 ±  2%  proc-vmstat.nr_shmem
>   29175944            -2.8%   28369676        proc-vmstat.numa_hit
>   29170351            -2.8%   28364111        proc-vmstat.numa_local
>       5439 ±  5%     -18.7%       4423 ±  7%  proc-vmstat.pgactivate
>   30220220            -2.8%   29374906        proc-vmstat.pgalloc_normal
>   30182224            -2.7%   29368266        proc-vmstat.pgfault
>   30186671            -2.8%   29341792        proc-vmstat.pgfree
>      69510 ± 12%     -34.2%      45759 ± 33%  sched_debug.cfs_rq:/.load.avg
>      30.21 ± 24%     -33.6%      20.05 ± 20%  sched_debug.cfs_rq:/.runnable_load_avg.avg
>      66447 ± 12%     -37.6%      41460 ± 37%  sched_debug.cfs_rq:/.runnable_weight.avg
>      12.35 ±  4%     +88.0%      23.22 ± 15%  sched_debug.cpu.clock.stddev
>      12.35 ±  4%     +88.0%      23.22 ± 15%  sched_debug.cpu.clock_task.stddev
>      30.06 ± 12%     -26.5%      22.10 ± 13%  sched_debug.cpu.cpu_load[0].avg
>      29.37 ±  9%     -22.6%      22.72 ± 13%  sched_debug.cpu.cpu_load[1].avg
>      28.71 ±  6%     -21.1%      22.66 ± 16%  sched_debug.cpu.cpu_load[2].avg
>      17985           -12.0%      15823 ±  2%  sched_debug.cpu.curr->pid.max
>      67478 ±  6%     -32.5%      45531 ± 24%  sched_debug.cpu.load.avg
>      10369 ± 49%    -100.0%       0.00        sched_debug.cpu.load.min
>       0.21 ± 34%    -100.0%       0.00        sched_debug.cpu.nr_running.min
>      12.98           -16.3%      10.86 ± 11%  perf-stat.i.MPKI
>  5.712e+09            -3.8%  5.492e+09        perf-stat.i.branch-instructions
>  1.024e+08            -3.7%   98557208        perf-stat.i.branch-misses
>       8.17            +0.4        8.58 ±  2%  perf-stat.i.cache-miss-rate%
>    7839589           +10.5%    8659798        perf-stat.i.cache-misses
>   86324420            +3.8%   89595898 ±  2%  perf-stat.i.cache-references
>       1.55 ±  2%      -4.0%       1.49        perf-stat.i.cpi
>       2290            -1.9%       2246        perf-stat.i.cpu-migrations
>       4667           -10.9%       4160        perf-stat.i.cycles-between-cache-misses
>  8.749e+09            -3.9%  8.409e+09        perf-stat.i.dTLB-loads
>     527660 ±  3%     -15.0%     448539 ±  5%  perf-stat.i.dTLB-store-misses
>  5.747e+09            -4.3%  5.499e+09        perf-stat.i.dTLB-stores
>   53047071            -3.5%   51190942        perf-stat.i.iTLB-load-misses
>   20576112           -13.7%   17759009 ±  2%  perf-stat.i.iTLB-loads
>  3.207e+10            -3.9%  3.083e+10        perf-stat.i.instructions
>       0.77            -2.3%       0.75        perf-stat.i.ipc
>      99933            -3.8%      96127        perf-stat.i.minor-faults
>    4325719            +5.6%    4568226        perf-stat.i.node-load-misses
>      52.39            -2.0       50.36        perf-stat.i.node-store-miss-rate%
>    1411700           +20.9%    1706321        perf-stat.i.node-store-misses
>     883790           +34.1%    1184836        perf-stat.i.node-stores
>      99933            -3.8%      96127        perf-stat.i.page-faults
>       2.69            +7.9%       2.91        perf-stat.overall.MPKI
>       9.08            +0.6        9.67 ±  2%  perf-stat.overall.cache-miss-rate%
>       1.13            +5.0%       1.19        perf-stat.overall.cpi
>       4633            -8.6%       4233        perf-stat.overall.cycles-between-cache-misses
>       0.01 ±  2%      -0.0        0.01 ±  4%  perf-stat.overall.dTLB-store-miss-rate%
>      72.05            +2.2       74.24        perf-stat.overall.iTLB-load-miss-rate%
>       0.88            -4.8%       0.84        perf-stat.overall.ipc
>      78.12            +1.4       79.52        perf-stat.overall.node-load-miss-rate%
>      61.49            -2.5       59.01        perf-stat.overall.node-store-miss-rate%
>  5.688e+09            -3.8%  5.471e+09        perf-stat.ps.branch-instructions
>   1.02e+08            -3.7%   98177745        perf-stat.ps.branch-misses
>    7807912           +10.5%    8626353        perf-stat.ps.cache-misses
>   85999504            +3.8%   89266688 ±  2%  perf-stat.ps.cache-references
>       2282            -1.9%       2239        perf-stat.ps.cpu-migrations
>  8.713e+09            -3.9%  8.376e+09        perf-stat.ps.dTLB-loads
>     525761 ±  3%     -15.0%     446967 ±  5%  perf-stat.ps.dTLB-store-misses
>  5.723e+09            -4.3%  5.478e+09        perf-stat.ps.dTLB-stores
>   52823322            -3.5%   50990190        perf-stat.ps.iTLB-load-misses
>   20490261           -13.7%   17689923 ±  2%  perf-stat.ps.iTLB-loads
>  3.193e+10            -3.8%  3.071e+10        perf-stat.ps.instructions
>      99560            -3.8%      95786        perf-stat.ps.minor-faults
>    4308031            +5.6%    4550453        perf-stat.ps.node-load-misses
>    1405805           +20.9%    1699670        perf-stat.ps.node-store-misses
>     880319           +34.1%    1180378        perf-stat.ps.node-stores
>      99560            -3.8%      95786        perf-stat.ps.page-faults
>  9.664e+12            -2.8%  9.397e+12        perf-stat.total.instructions
>      33.09 ±  7%     -25.0        8.10 ± 45%  perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe
>      32.21 ±  7%     -24.3        7.92 ± 45%  perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe
>      22.36 ±  7%     -17.1        5.24 ± 49%  perf-profile.calltrace.cycles-pp.do_semtimedop.do_syscall_64.entry_SYSCALL_64_after_hwframe
>      10.57 ±  5%      -8.3        2.31 ± 48%  perf-profile.calltrace.cycles-pp.entry_SYSCALL_64
>       9.60 ±  4%      -7.6        2.03 ± 49%  perf-profile.calltrace.cycles-pp.syscall_return_via_sysret
>       3.68 ±  8%      -3.3        0.43 ±102%  perf-profile.calltrace.cycles-pp.security_ipc_permission.do_semtimedop.do_syscall_64.entry_SYSCALL_64_after_hwframe
>       4.37 ±  9%      -2.8        1.60 ± 50%  perf-profile.calltrace.cycles-pp.ipc_obtain_object_check.do_semtimedop.do_syscall_64.entry_SYSCALL_64_after_hwframe
>       2.30 ±  2%      -1.9        0.44 ±102%  perf-profile.calltrace.cycles-pp.security_sem_semop.do_semtimedop.do_syscall_64.entry_SYSCALL_64_after_hwframe
>       5.10 ±  3%      -1.6        3.46 ± 10%  perf-profile.calltrace.cycles-pp.ipc_has_perm.security_ipc_permission.do_semtimedop.do_syscall_64.entry_SYSCALL_64_after_hwframe
>       3.20            -1.3        1.95 ± 23%  perf-profile.calltrace.cycles-pp.avc_has_perm.ipc_has_perm.security_ipc_permission.do_semtimedop.do_syscall_64
>       2.12 ±  3%      -0.4        1.76 ±  4%  perf-profile.calltrace.cycles-pp.avc_has_perm.ipc_has_perm.security_sem_semop.do_semtimedop.do_syscall_64
>       1.03 ±  2%      -0.1        0.94 ±  3%  perf-profile.calltrace.cycles-pp.__do_page_fault.do_page_fault.page_fault
>       1.06 ±  2%      -0.1        0.97 ±  3%  perf-profile.calltrace.cycles-pp.do_page_fault.page_fault
>       1.06 ±  2%      -0.1        0.98 ±  3%  perf-profile.calltrace.cycles-pp.page_fault
>       0.91 ±  3%      -0.1        0.82 ±  3%  perf-profile.calltrace.cycles-pp.handle_mm_fault.__do_page_fault.do_page_fault.page_fault
>       0.86 ±  3%      -0.1        0.78 ±  3%  perf-profile.calltrace.cycles-pp.__handle_mm_fault.handle_mm_fault.__do_page_fault.do_page_fault.page_fault
>       0.61 ± 12%      +0.2        0.79 ± 12%  perf-profile.calltrace.cycles-pp.smp_apic_timer_interrupt.apic_timer_interrupt.cpuidle_enter_state.do_idle.cpu_startup_entry
>       0.61 ± 10%      +0.2        0.79 ± 10%  perf-profile.calltrace.cycles-pp.shm_close.remove_vma.__do_munmap.ksys_shmdt.do_syscall_64
>       0.67 ± 11%      +0.2        0.86 ± 10%  perf-profile.calltrace.cycles-pp.apic_timer_interrupt.cpuidle_enter_state.do_idle.cpu_startup_entry.start_secondary
>       0.40 ± 57%      +0.2        0.62 ±  8%  perf-profile.calltrace.cycles-pp.call_rwsem_down_write_failed.down_write.do_shmat.__x64_sys_shmat.do_syscall_64
>       0.40 ± 57%      +0.2        0.62 ±  7%  perf-profile.calltrace.cycles-pp.rwsem_down_write_failed.call_rwsem_down_write_failed.down_write.do_shmat.__x64_sys_shmat
>       0.43 ± 57%      +0.2        0.66 ±  7%  perf-profile.calltrace.cycles-pp.down_write.do_shmat.__x64_sys_shmat.do_syscall_64.entry_SYSCALL_64_after_hwframe
>       0.41 ± 57%      +0.3        0.67 ±  9%  perf-profile.calltrace.cycles-pp.call_rwsem_down_write_failed.down_write.shm_close.remove_vma.__do_munmap
>       0.40 ± 57%      +0.3        0.66 ±  7%  perf-profile.calltrace.cycles-pp.down_write.ipcget.__x64_sys_shmget.do_syscall_64.entry_SYSCALL_64_after_hwframe
>       0.41 ± 57%      +0.3        0.67 ±  9%  perf-profile.calltrace.cycles-pp.rwsem_down_write_failed.call_rwsem_down_write_failed.down_write.shm_close.remove_vma
>       0.44 ± 57%      +0.3        0.71 ±  9%  perf-profile.calltrace.cycles-pp.down_write.shm_close.remove_vma.__do_munmap.ksys_shmdt
>       0.26 ±100%      +0.4        0.67 ±  8%  perf-profile.calltrace.cycles-pp.dput.__fput.task_work_run.exit_to_usermode_loop.do_syscall_64
>       0.28 ±100%      +0.4        0.71 ±  7%  perf-profile.calltrace.cycles-pp.__might_fault._copy_from_user.do_semtimedop.do_syscall_64.entry_SYSCALL_64_after_hwframe
>       0.58 ± 10%      +0.4        1.03 ±  6%  perf-profile.calltrace.cycles-pp.shmctl_down.ksys_shmctl.do_syscall_64.entry_SYSCALL_64_after_hwframe.shmctl
>       0.59 ± 10%      +0.5        1.05 ±  6%  perf-profile.calltrace.cycles-pp.ksys_shmctl.do_syscall_64.entry_SYSCALL_64_after_hwframe.shmctl
>       0.13 ±173%      +0.5        0.62 ±  7%  perf-profile.calltrace.cycles-pp.call_rwsem_down_write_failed.down_write.ipcget.__x64_sys_shmget.do_syscall_64
>       0.13 ±173%      +0.5        0.62 ±  7%  perf-profile.calltrace.cycles-pp.rwsem_down_write_failed.call_rwsem_down_write_failed.down_write.ipcget.__x64_sys_shmget
>       0.12 ±173%      +0.5        0.66 ±  8%  perf-profile.calltrace.cycles-pp.dentry_kill.dput.__fput.task_work_run.exit_to_usermode_loop
>       0.00            +0.6        0.58 ±  7%  perf-profile.calltrace.cycles-pp.copy_user_generic_unrolled._copy_from_user.do_semtimedop.do_syscall_64.entry_SYSCALL_64_after_hwframe
>       0.00            +0.6        0.60 ±  7%  perf-profile.calltrace.cycles-pp.unmap_region.__do_munmap.ksys_shmdt.do_syscall_64.entry_SYSCALL_64_after_hwframe
>       0.00            +0.6        0.61 ±  7%  perf-profile.calltrace.cycles-pp.wait_consider_task.do_wait.kernel_wait4.__do_sys_wait4.do_syscall_64
>       0.00            +0.6        0.62 ±  4%  perf-profile.calltrace.cycles-pp.semctl_down.ksys_semctl.do_syscall_64.entry_SYSCALL_64_after_hwframe.semctl
>       0.75 ±  8%      +0.6        1.38 ±  8%  perf-profile.calltrace.cycles-pp.__x64_sys_semop.do_syscall_64.entry_SYSCALL_64_after_hwframe.semop
>       0.00            +0.6        0.65 ±  9%  perf-profile.calltrace.cycles-pp.ipcperms.do_semtimedop.do_syscall_64.entry_SYSCALL_64_after_hwframe.semop
>       0.00            +0.6        0.65 ±  8%  perf-profile.calltrace.cycles-pp.__dentry_kill.dentry_kill.dput.__fput.task_work_run
>       0.89 ± 11%      +0.7        1.57 ±  8%  perf-profile.calltrace.cycles-pp.do_shmat.__x64_sys_shmat.do_syscall_64.entry_SYSCALL_64_after_hwframe.shmat
>       0.90 ± 11%      +0.7        1.58 ±  8%  perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.shmat
>       0.90 ± 10%      +0.7        1.58 ±  8%  perf-profile.calltrace.cycles-pp.__x64_sys_shmat.do_syscall_64.entry_SYSCALL_64_after_hwframe.shmat
>       0.00            +0.7        0.68 ±  5%  perf-profile.calltrace.cycles-pp.do_wait.kernel_wait4.__do_sys_wait4.do_syscall_64.entry_SYSCALL_64_after_hwframe
>       0.00            +0.7        0.69 ±  6%  perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.wait
>       0.00            +0.7        0.69 ±  6%  perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.wait
>       0.00            +0.7        0.69 ±  6%  perf-profile.calltrace.cycles-pp.__do_sys_wait4.do_syscall_64.entry_SYSCALL_64_after_hwframe.wait
>       0.00            +0.7        0.69 ±  6%  perf-profile.calltrace.cycles-pp.kernel_wait4.__do_sys_wait4.do_syscall_64.entry_SYSCALL_64_after_hwframe.wait
>       0.90 ± 11%      +0.7        1.59 ±  8%  perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.shmat
>       0.00            +0.7        0.69 ±  5%  perf-profile.calltrace.cycles-pp.wait
>       0.27 ±100%      +0.7        0.99 ±  7%  perf-profile.calltrace.cycles-pp.perform_atomic_semop.do_semtimedop.do_syscall_64.entry_SYSCALL_64_after_hwframe.semop
>       0.95 ± 11%      +0.7        1.67 ±  8%  perf-profile.calltrace.cycles-pp.shmat
>       0.00            +0.7        0.73 ±  6%  perf-profile.calltrace.cycles-pp.ipcget.ksys_semget.do_syscall_64.entry_SYSCALL_64_after_hwframe.semget
>       0.00            +0.7        0.74 ±  8%  perf-profile.calltrace.cycles-pp.task_work_run.exit_to_usermode_loop.do_syscall_64.entry_SYSCALL_64_after_hwframe.shmctl
>       0.72 ±  9%      +0.7        1.46 ±  7%  perf-profile.calltrace.cycles-pp._raw_spin_lock.do_semtimedop.do_syscall_64.entry_SYSCALL_64_after_hwframe.semop
>       0.00            +0.8        0.75 ±  5%  perf-profile.calltrace.cycles-pp.ksys_semget.do_syscall_64.entry_SYSCALL_64_after_hwframe.semget
>       0.00            +0.8        0.76 ±  8%  perf-profile.calltrace.cycles-pp.exit_to_usermode_loop.do_syscall_64.entry_SYSCALL_64_after_hwframe.shmctl
>       0.89 ±  9%      +0.8        1.65 ±  9%  perf-profile.calltrace.cycles-pp.__do_munmap.ksys_shmdt.do_syscall_64.entry_SYSCALL_64_after_hwframe.shmdt
>       0.00            +0.8        0.76 ±  5%  perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.semget
>       0.00            +0.8        0.76 ±  5%  perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.semget
>       0.90 ±  9%      +0.8        1.68 ±  9%  perf-profile.calltrace.cycles-pp.ksys_shmdt.do_syscall_64.entry_SYSCALL_64_after_hwframe.shmdt
>       0.00            +0.8        0.78 ±  6%  perf-profile.calltrace.cycles-pp.ksys_semctl.do_syscall_64.entry_SYSCALL_64_after_hwframe.semctl
>       0.00            +0.8        0.80 ±  5%  perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.semctl
>       0.00            +0.8        0.80 ±  5%  perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.semctl
>       0.00            +0.8        0.85 ±  5%  perf-profile.calltrace.cycles-pp.semget
>       0.97 ±  9%      +0.8        1.82 ±  6%  perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.shmctl
>       0.97 ±  9%      +0.8        1.82 ±  7%  perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.shmctl
>       0.94 ± 10%      +0.9        1.79 ±  6%  perf-profile.calltrace.cycles-pp._copy_from_user.do_semtimedop.do_syscall_64.entry_SYSCALL_64_after_hwframe.semop
>       1.04 ±  9%      +0.9        1.91 ±  6%  perf-profile.calltrace.cycles-pp.ipcget.__x64_sys_shmget.do_syscall_64.entry_SYSCALL_64_after_hwframe.shmget
>       1.05 ±  9%      +0.9        1.92 ±  5%  perf-profile.calltrace.cycles-pp.__x64_sys_shmget.do_syscall_64.entry_SYSCALL_64_after_hwframe.shmget
>       1.05 ±  9%      +0.9        1.93 ±  6%  perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.shmget
>       1.05 ±  9%      +0.9        1.93 ±  6%  perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.shmget
>       1.02 ±  9%      +0.9        1.92 ±  7%  perf-profile.calltrace.cycles-pp.shmctl
>       1.00 ±  9%      +0.9        1.93 ±  9%  perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.shmdt
>       1.00 ±  8%      +0.9        1.94 ±  9%  perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.shmdt
>       1.13 ±  8%      +0.9        2.08 ±  6%  perf-profile.calltrace.cycles-pp.shmget
>       0.00            +1.0        0.97 ±  5%  perf-profile.calltrace.cycles-pp.semctl
>       1.03 ±  9%      +1.0        1.99 ±  5%  perf-profile.calltrace.cycles-pp.do_smart_update.do_semtimedop.do_syscall_64.entry_SYSCALL_64_after_hwframe.semop
>       1.04 ±  8%      +1.0        2.03 ±  9%  perf-profile.calltrace.cycles-pp.shmdt
>       2.61 ±  9%      +1.3        3.89 ± 14%  perf-profile.calltrace.cycles-pp.security_ipc_permission.do_semtimedop.do_syscall_64.entry_SYSCALL_64_after_hwframe.semop
>       1.65 ± 13%      +2.0        3.65 ±  7%  perf-profile.calltrace.cycles-pp.security_sem_semop.do_semtimedop.do_syscall_64.entry_SYSCALL_64_after_hwframe.semop
>       1.61 ±  6%      +5.3        6.89 ±  2%  perf-profile.calltrace.cycles-pp.idr_find.ipc_obtain_object_check.do_semtimedop.do_syscall_64.entry_SYSCALL_64_after_hwframe
>       6.45 ±  9%      +6.2       12.63 ±  6%  perf-profile.calltrace.cycles-pp.syscall_return_via_sysret.semop
>       2.92 ±  5%      +8.0       10.88 ±  7%  perf-profile.calltrace.cycles-pp.ipc_obtain_object_check.do_semtimedop.do_syscall_64.entry_SYSCALL_64_after_hwframe.semop
>       9.07 ±  8%      +8.8       17.85 ±  6%  perf-profile.calltrace.cycles-pp.entry_SYSCALL_64.semop
>      15.56 ±  8%     +18.8       34.38 ±  6%  perf-profile.calltrace.cycles-pp.do_semtimedop.do_syscall_64.entry_SYSCALL_64_after_hwframe.semop
>      17.40 ±  8%     +20.4       37.82 ±  6%  perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.semop
>      17.99 ±  8%     +21.0       39.00 ±  6%  perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.semop
>      33.85 ±  8%     +36.1       69.98 ±  6%  perf-profile.calltrace.cycles-pp.semop
>
>
>
>                       reaim.time.percent_of_cpu_this_job_got
>
>   1800 +-+------------------------------------------------------------------+
>        |         O                                                          |
>   1780 +-OO O O     O                                                       |
>   1760 O-+     O   O  OO                                                    |
>        |                                                                    |
>   1740 +-+                                                                  |
>        |                                                                    |
>   1720 +-+                  O                       .+.                     |
>        |                   O  O                    +   +                    |
>   1700 +-+               O                       .+    :                    |
>   1680 +-+      .+      .+.    .+ .+.+         .+       :         +.  .+.++.|
>        |.++.+.++  :  .++   ++.+  +    +.+.++.++         +.+.++.+.+  ++      |
>   1660 +-+        : +                                                       |
>        |           +                                                        |
>   1640 +-+------------------------------------------------------------------+
>
>
> [*] bisect-good sample
> [O] bisect-bad  sample
>
>
>
> Disclaimer:
> Results have been estimated based on internal Intel analysis and are provided
> for informational purposes only. Any difference in system hardware or software
> design or configuration may affect actual performance.
>
>
> Thanks,
> Rong Chen
Alexey Dobriyan Jan. 21, 2019, 9:25 p.m. UTC | #8
On Tue, Jan 22, 2019 at 09:02:53AM +1300, Kees Cook wrote:
> On Fri, Jan 18, 2019 at 9:44 PM kernel test robot <rong.a.chen@intel.com> wrote:
> >
> > Greeting,
> >
> > FYI, we noticed a -4.8% regression of reaim.jobs_per_min due to commit:
> >
> >
> > commit: 3047027b34b8c6404b509903058b89836093acc7 ("[PATCH 2/2] /proc/stat: Add sysctl parameter to control irq counts latency")
> > url: https://github.com/0day-ci/linux/commits/Waiman-Long/proc-stat-Reduce-irqs-counting-performance-overhead/20190108-104818
> 
> Is this expected? (And it seems like other things in the report below
> are faster? I don't understand why this particular regression was
> called out?)

No, but the sysctl has been dropped, so the point is moot.
diff mbox series

Patch

diff --git a/Documentation/sysctl/fs.txt b/Documentation/sysctl/fs.txt
index 819caf8..603d1b5 100644
--- a/Documentation/sysctl/fs.txt
+++ b/Documentation/sysctl/fs.txt
@@ -34,6 +34,7 @@  Currently, these files are in /proc/sys/fs:
 - overflowgid
 - pipe-user-pages-hard
 - pipe-user-pages-soft
+- proc-stat-irqs-latency-ms
 - protected_fifos
 - protected_hardlinks
 - protected_regular
@@ -184,6 +185,21 @@  applied.
 
 ==============================================================
 
+proc-stat-irqs-latency-ms:
+
+The maximum latency (in mseconds) between the time when the IRQ counts
+in the "intr" line of /proc/stat were computed and the time when they
+are reported.
+
+The default is 0 which means the counts are computed every time
+/proc/stat is read. As computing the IRQ counts can be the most time
+consuming part of accessing /proc/stat, setting a high enough value
+will shorten the time to read it in most cases.
+
+The actual maximum latency is rounded up to the next multiple of jiffies.
+
+==============================================================
+
 protected_fifos:
 
 The intent of this protection is to avoid unintentional writes to
diff --git a/fs/proc/stat.c b/fs/proc/stat.c
index 4b06f1b..52f5845 100644
--- a/fs/proc/stat.c
+++ b/fs/proc/stat.c
@@ -13,6 +13,7 @@ 
 #include <linux/irqnr.h>
 #include <linux/sched/cputime.h>
 #include <linux/tick.h>
+#include <linux/jiffies.h>
 
 #ifndef arch_irq_stat_cpu
 #define arch_irq_stat_cpu(cpu) 0
@@ -21,6 +22,12 @@ 
 #define arch_irq_stat() 0
 #endif
 
+/*
+ * Maximum latency (in ms) of the irq values reported in the "intr" line.
+ * This is converted internally to multiple of jiffies.
+ */
+unsigned int proc_stat_irqs_latency_ms;
+
 #ifdef arch_idle_time
 
 static u64 get_idle_time(int cpu)
@@ -98,7 +105,48 @@  static u64 compute_stat_irqs_sum(void)
 static void show_stat_irqs(struct seq_file *p)
 {
 	int i;
+#ifdef CONFIG_PROC_SYSCTL
+	static char *irqs_buf;		   /* Buffer for irqs values */
+	static int buflen;
+	static unsigned long last_jiffies; /* Last buffer update jiffies */
+	static DEFINE_MUTEX(irqs_mutex);
+	unsigned int latency = proc_stat_irqs_latency_ms;
+
+	if (latency) {
+		char *ptr;
+
+		latency = _msecs_to_jiffies(latency);
+
+		mutex_lock(&irqs_mutex);
+		if (irqs_buf && time_before(jiffies, last_jiffies + latency))
+			goto print_out;
+
+		/*
+		 * Each irq value may require up to 11 bytes.
+		 */
+		if (!irqs_buf) {
+			irqs_buf = kmalloc(nr_irqs * 11 + 32,
+					   GFP_KERNEL | __GFP_ZERO);
+			if (!irqs_buf) {
+				mutex_unlock(&irqs_mutex);
+				goto fallback;
+			}
+		}
 
+		ptr = irqs_buf;
+		ptr += sprintf(ptr, "intr %llu", compute_stat_irqs_sum());
+		for_each_irq_nr(i)
+			ptr += sprintf(ptr, " %u", kstat_irqs_usr(i));
+		*ptr++ = '\n';
+		buflen = ptr - irqs_buf;
+		last_jiffies = jiffies;
+print_out:
+		seq_write(p, irqs_buf, buflen);
+		mutex_unlock(&irqs_mutex);
+		return;
+	}
+fallback:
+#endif
 	seq_put_decimal_ull(p, "intr ", compute_stat_irqs_sum());
 	for_each_irq_nr(i)
 		seq_put_decimal_ull(p, " ", kstat_irqs_usr(i));
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 1825f71..07010c9 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -114,6 +114,9 @@ 
 #ifndef CONFIG_MMU
 extern int sysctl_nr_trim_pages;
 #endif
+#ifdef CONFIG_PROC_FS
+extern unsigned int proc_stat_irqs_latency_ms;
+#endif
 
 /* Constants used for minimum and  maximum */
 #ifdef CONFIG_LOCKUP_DETECTOR
@@ -1890,6 +1893,15 @@  static int sysrq_sysctl_handler(struct ctl_table *table, int write,
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= &one,
 	},
+#ifdef CONFIG_PROC_FS
+	{
+		.procname	= "proc-stat-irqs-latency-ms",
+		.data		= &proc_stat_irqs_latency_ms,
+		.maxlen		= sizeof(proc_stat_irqs_latency_ms),
+		.mode		= 0644,
+		.proc_handler	= proc_douintvec,
+	},
+#endif
 	{ }
 };