diff mbox series

[V2,1/2] genriq: Avoid summation loops for /proc/stat

Message ID 20190208135020.925487496@linutronix.de (mailing list archive)
State New, archived
Headers show
Series genirq, proc: Speedup /proc/stat interrupt statistics | expand

Commit Message

Thomas Gleixner Feb. 8, 2019, 1:48 p.m. UTC
Waiman reported that on large systems with a large amount of interrupts the
readout of /proc/stat takes a long time to sum up the interrupt
statistics. In principle this is not a problem. but for unknown reasons
some enterprise quality software reads /proc/stat with a high frequency.

The reason for this is that interrupt statistics are accounted per cpu. So
the /proc/stat logic has to sum up the interrupt stats for each interrupt.

This can be largely avoided for interrupts which are not marked as
'PER_CPU' interrupts by simply adding a per interrupt summation counter
which is incremented along with the per interrupt per cpu counter.

The PER_CPU interrupts need to avoid that and use only per cpu accounting
because they share the interrupt number and the interrupt descriptor and
concurrent updates would conflict or require unwanted synchronization.

Reported-by: Waiman Long <longman@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

8<-------------

v2: Undo the unintentional layout change of struct irq_desc.

 include/linux/irqdesc.h |    1 +
 kernel/irq/chip.c       |   12 ++++++++++--
 kernel/irq/internals.h  |    8 +++++++-
 kernel/irq/irqdesc.c    |    7 ++++++-
 4 files changed, 24 insertions(+), 4 deletions(-)

Comments

Andrew Morton Feb. 8, 2019, 10:32 p.m. UTC | #1
On Fri, 08 Feb 2019 14:48:03 +0100 Thomas Gleixner <tglx@linutronix.de> wrote:

> Waiman reported that on large systems with a large amount of interrupts the
> readout of /proc/stat takes a long time to sum up the interrupt
> statistics. In principle this is not a problem. but for unknown reasons
> some enterprise quality software reads /proc/stat with a high frequency.
> 
> The reason for this is that interrupt statistics are accounted per cpu. So
> the /proc/stat logic has to sum up the interrupt stats for each interrupt.
> 
> This can be largely avoided for interrupts which are not marked as
> 'PER_CPU' interrupts by simply adding a per interrupt summation counter
> which is incremented along with the per interrupt per cpu counter.
> 
> The PER_CPU interrupts need to avoid that and use only per cpu accounting
> because they share the interrupt number and the interrupt descriptor and
> concurrent updates would conflict or require unwanted synchronization.
> 
> ...
>
> --- a/include/linux/irqdesc.h
> +++ b/include/linux/irqdesc.h
> @@ -65,6 +65,7 @@ struct irq_desc {
>  	unsigned int		core_internal_state__do_not_mess_with_it;
>  	unsigned int		depth;		/* nested irq disables */
>  	unsigned int		wake_depth;	/* nested wake enables */
> +	unsigned int		tot_count;

Confused.  Isn't this going to quickly overflow?
Waiman Long Feb. 8, 2019, 10:46 p.m. UTC | #2
On 02/08/2019 05:32 PM, Andrew Morton wrote:
> On Fri, 08 Feb 2019 14:48:03 +0100 Thomas Gleixner <tglx@linutronix.de> wrote:
>
>> Waiman reported that on large systems with a large amount of interrupts the
>> readout of /proc/stat takes a long time to sum up the interrupt
>> statistics. In principle this is not a problem. but for unknown reasons
>> some enterprise quality software reads /proc/stat with a high frequency.
>>
>> The reason for this is that interrupt statistics are accounted per cpu. So
>> the /proc/stat logic has to sum up the interrupt stats for each interrupt.
>>
>> This can be largely avoided for interrupts which are not marked as
>> 'PER_CPU' interrupts by simply adding a per interrupt summation counter
>> which is incremented along with the per interrupt per cpu counter.
>>
>> The PER_CPU interrupts need to avoid that and use only per cpu accounting
>> because they share the interrupt number and the interrupt descriptor and
>> concurrent updates would conflict or require unwanted synchronization.
>>
>> ...
>>
>> --- a/include/linux/irqdesc.h
>> +++ b/include/linux/irqdesc.h
>> @@ -65,6 +65,7 @@ struct irq_desc {
>>  	unsigned int		core_internal_state__do_not_mess_with_it;
>>  	unsigned int		depth;		/* nested irq disables */
>>  	unsigned int		wake_depth;	/* nested wake enables */
>> +	unsigned int		tot_count;
> Confused.  Isn't this going to quickly overflow?
>
>
All the current irq count computations for each individual irqs are
using unsigned int type. Only the sum of all the irqs is u64. Yes, it is
possible for an individual irq count to exceed 32 bits given sufficient
uptime.  My PC has an uptime of 36 days and the highest irq count value
is 79,227,699. Given the current rate, the overflow will happen after
about 5 years. A larger server system may have an overflow in much
shorter period. So maybe we should consider changing all the irq counts
to unsigned long then.

Cheers,
Longman
Andrew Morton Feb. 8, 2019, 11:21 p.m. UTC | #3
On Fri, 8 Feb 2019 17:46:39 -0500 Waiman Long <longman@redhat.com> wrote:

> On 02/08/2019 05:32 PM, Andrew Morton wrote:
> > On Fri, 08 Feb 2019 14:48:03 +0100 Thomas Gleixner <tglx@linutronix.de> wrote:
> >
> >> Waiman reported that on large systems with a large amount of interrupts the
> >> readout of /proc/stat takes a long time to sum up the interrupt
> >> statistics. In principle this is not a problem. but for unknown reasons
> >> some enterprise quality software reads /proc/stat with a high frequency.
> >>
> >> The reason for this is that interrupt statistics are accounted per cpu. So
> >> the /proc/stat logic has to sum up the interrupt stats for each interrupt.
> >>
> >> This can be largely avoided for interrupts which are not marked as
> >> 'PER_CPU' interrupts by simply adding a per interrupt summation counter
> >> which is incremented along with the per interrupt per cpu counter.
> >>
> >> The PER_CPU interrupts need to avoid that and use only per cpu accounting
> >> because they share the interrupt number and the interrupt descriptor and
> >> concurrent updates would conflict or require unwanted synchronization.
> >>
> >> ...
> >>
> >> --- a/include/linux/irqdesc.h
> >> +++ b/include/linux/irqdesc.h
> >> @@ -65,6 +65,7 @@ struct irq_desc {
> >>  	unsigned int		core_internal_state__do_not_mess_with_it;
> >>  	unsigned int		depth;		/* nested irq disables */
> >>  	unsigned int		wake_depth;	/* nested wake enables */
> >> +	unsigned int		tot_count;
> > Confused.  Isn't this going to quickly overflow?
> >
> >
> All the current irq count computations for each individual irqs are
> using unsigned int type. Only the sum of all the irqs is u64. Yes, it is
> possible for an individual irq count to exceed 32 bits given sufficient
> uptime.  My PC has an uptime of 36 days and the highest irq count value
> is 79,227,699. Given the current rate, the overflow will happen after
> about 5 years. A larger server system may have an overflow in much
> shorter period. So maybe we should consider changing all the irq counts
> to unsigned long then.

It sounds like it.  A 10khz interrupt will overflow in 4 days...
Matthew Wilcox Feb. 9, 2019, 3:41 a.m. UTC | #4
On Fri, Feb 08, 2019 at 03:21:51PM -0800, Andrew Morton wrote:
> It sounds like it.  A 10khz interrupt will overflow in 4 days...

If you've got a 10kHz interrupt, you have a bigger problem.  Anything
happening 10,000 times a second is going to need interrupt mitigation
to perform acceptably.

More importantly, userspace can (and must) cope with wrapping.  This isn't
anything new from Thomas' patch.  As long as userspace is polling more
often than once a day, it's going to see a wrapped value before it wraps
again, so it won't miss 4 billion interrupts.
David Laight Feb. 13, 2019, 3:55 p.m. UTC | #5
From: Matthew Wilcox
> Sent: 09 February 2019 03:41
> 
> On Fri, Feb 08, 2019 at 03:21:51PM -0800, Andrew Morton wrote:
> > It sounds like it.  A 10khz interrupt will overflow in 4 days...
> 
> If you've got a 10kHz interrupt, you have a bigger problem.  Anything
> happening 10,000 times a second is going to need interrupt mitigation
> to perform acceptably.

Not necessarily - you may want the immediate interrupt for each
received ethernet packet.

> More importantly, userspace can (and must) cope with wrapping.  This isn't
> anything new from Thomas' patch.  As long as userspace is polling more
> often than once a day, it's going to see a wrapped value before it wraps
> again, so it won't miss 4 billion interrupts.

If userspace is expected to detect wraps, making the sum 64bit is
pointless, confusing and stupid.
The code would have to mask off the high bits before determining
that the value has decreased and then adding in 2^32.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
diff mbox series

Patch

--- a/include/linux/irqdesc.h
+++ b/include/linux/irqdesc.h
@@ -65,6 +65,7 @@  struct irq_desc {
 	unsigned int		core_internal_state__do_not_mess_with_it;
 	unsigned int		depth;		/* nested irq disables */
 	unsigned int		wake_depth;	/* nested wake enables */
+	unsigned int		tot_count;
 	unsigned int		irq_count;	/* For detecting broken IRQs */
 	unsigned long		last_unhandled;	/* Aging timer for unhandled count */
 	unsigned int		irqs_unhandled;
--- a/kernel/irq/chip.c
+++ b/kernel/irq/chip.c
@@ -855,7 +855,11 @@  void handle_percpu_irq(struct irq_desc *
 {
 	struct irq_chip *chip = irq_desc_get_chip(desc);
 
-	kstat_incr_irqs_this_cpu(desc);
+	/*
+	 * PER CPU interrupts are not serialized. Do not touch
+	 * desc->tot_count.
+	 */
+	__kstat_incr_irqs_this_cpu(desc);
 
 	if (chip->irq_ack)
 		chip->irq_ack(&desc->irq_data);
@@ -884,7 +888,11 @@  void handle_percpu_devid_irq(struct irq_
 	unsigned int irq = irq_desc_get_irq(desc);
 	irqreturn_t res;
 
-	kstat_incr_irqs_this_cpu(desc);
+	/*
+	 * PER CPU interrupts are not serialized. Do not touch
+	 * desc->tot_count.
+	 */
+	__kstat_incr_irqs_this_cpu(desc);
 
 	if (chip->irq_ack)
 		chip->irq_ack(&desc->irq_data);
--- a/kernel/irq/internals.h
+++ b/kernel/irq/internals.h
@@ -242,12 +242,18 @@  static inline void irq_state_set_masked(
 
 #undef __irqd_to_state
 
-static inline void kstat_incr_irqs_this_cpu(struct irq_desc *desc)
+static inline void __kstat_incr_irqs_this_cpu(struct irq_desc *desc)
 {
 	__this_cpu_inc(*desc->kstat_irqs);
 	__this_cpu_inc(kstat.irqs_sum);
 }
 
+static inline void kstat_incr_irqs_this_cpu(struct irq_desc *desc)
+{
+	__kstat_incr_irqs_this_cpu(desc);
+	desc->tot_count++;
+}
+
 static inline int irq_desc_get_node(struct irq_desc *desc)
 {
 	return irq_common_data_get_node(&desc->irq_common_data);
--- a/kernel/irq/irqdesc.c
+++ b/kernel/irq/irqdesc.c
@@ -119,6 +119,7 @@  static void desc_set_defaults(unsigned i
 	desc->depth = 1;
 	desc->irq_count = 0;
 	desc->irqs_unhandled = 0;
+	desc->tot_count = 0;
 	desc->name = NULL;
 	desc->owner = owner;
 	for_each_possible_cpu(cpu)
@@ -919,11 +920,15 @@  unsigned int kstat_irqs_cpu(unsigned int
 unsigned int kstat_irqs(unsigned int irq)
 {
 	struct irq_desc *desc = irq_to_desc(irq);
-	int cpu;
 	unsigned int sum = 0;
+	int cpu;
 
 	if (!desc || !desc->kstat_irqs)
 		return 0;
+	if (!irq_settings_is_per_cpu_devid(desc) &&
+	    !irq_settings_is_per_cpu(desc))
+	    return desc->tot_count;
+
 	for_each_possible_cpu(cpu)
 		sum += *per_cpu_ptr(desc->kstat_irqs, cpu);
 	return sum;