diff mbox series

irq/timings: Fix model validity

Message ID 20181107094624.GB9828@hirez.programming.kicks-ass.net (mailing list archive)
State Not Applicable, archived
Headers show
Series irq/timings: Fix model validity | expand

Commit Message

Peter Zijlstra Nov. 7, 2018, 9:46 a.m. UTC
On Wed, Nov 07, 2018 at 09:59:36AM +0100, Peter Zijlstra wrote:
> On Wed, Nov 07, 2018 at 12:39:31AM +0100, Rafael J. Wysocki wrote:

> > In general, however, I need to be convinced that interrupts that
> > didn't wake up the CPU from idle are relevant for next wakeup
> > prediction.  I see that this may be the case, but to what extent is
> > rather unclear to me and it looks like calling
> > irq_timings_next_event() would add considerable overhead.
> 
> How about we add a (debug) knob so that people can play with it for now?
> If it turns out to be useful, we'll learn.

That said; Daniel, I think there is a problem with how irqs_update()
sets irqs->valid. We seem to set valid even when we're still training.

---
Subject: irq/timings: Fix model validity

The per IRQ timing predictor will produce a 'valid' prediction even if
the model is still training. This should not happen.

Fix this by moving the actual training (online stddev algorithm) up a
bit and returning early (before predicting) when we've not yet reached
the sample threshold.

A direct concequence is that the predictor will only ever run with at
least that many samples, which means we can remove one branch.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/irq/timings.c | 66 +++++++++++++++++++++++++++++-----------------------
 1 file changed, 37 insertions(+), 29 deletions(-)

Comments

Daniel Lezcano Nov. 7, 2018, 10:52 a.m. UTC | #1
On 07/11/2018 10:46, Peter Zijlstra wrote:
> On Wed, Nov 07, 2018 at 09:59:36AM +0100, Peter Zijlstra wrote:
>> On Wed, Nov 07, 2018 at 12:39:31AM +0100, Rafael J. Wysocki wrote:
> 
>>> In general, however, I need to be convinced that interrupts that
>>> didn't wake up the CPU from idle are relevant for next wakeup
>>> prediction.  I see that this may be the case, but to what extent is
>>> rather unclear to me and it looks like calling
>>> irq_timings_next_event() would add considerable overhead.
>>
>> How about we add a (debug) knob so that people can play with it for now?
>> If it turns out to be useful, we'll learn.
> 
> That said; Daniel, I think there is a problem with how irqs_update()
> sets irqs->valid. We seem to set valid even when we're still training.

Yes, the fix seems right.

Thanks for fixing it.

  -- Daniel

> ---
> Subject: irq/timings: Fix model validity
> 
> The per IRQ timing predictor will produce a 'valid' prediction even if
> the model is still training. This should not happen.
> 
> Fix this by moving the actual training (online stddev algorithm) up a
> bit and returning early (before predicting) when we've not yet reached
> the sample threshold.
> 
> A direct concequence is that the predictor will only ever run with at
> least that many samples, which means we can remove one branch.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/irq/timings.c | 66 +++++++++++++++++++++++++++++-----------------------
>  1 file changed, 37 insertions(+), 29 deletions(-)
> 
> diff --git a/kernel/irq/timings.c b/kernel/irq/timings.c
> index 1e4cb63a5c82..5d22fd5facd5 100644
> --- a/kernel/irq/timings.c
> +++ b/kernel/irq/timings.c
> @@ -28,6 +28,13 @@ struct irqt_stat {
>  	int	valid;
>  };
>  
> +/*
> + * The rule of thumb in statistics for the normal distribution
> + * is having at least 30 samples in order to have the model to
> + * apply.
> + */
> +#define SAMPLE_THRESHOLD	30
> +
>  static DEFINE_IDR(irqt_stats);
>  
>  void irq_timings_enable(void)
> @@ -101,7 +108,6 @@ void irq_timings_disable(void)
>   * distribution appears when the number of samples is 30 (it is the
>   * rule of thumb in statistics, cf. "30 samples" on Internet). When
>   * there are three consecutive anomalies, the statistics are resetted.
> - *
>   */
>  static void irqs_update(struct irqt_stat *irqs, u64 ts)
>  {
> @@ -146,11 +152,38 @@ static void irqs_update(struct irqt_stat *irqs, u64 ts)
>  	 */
>  	diff = interval - irqs->avg;
>  
> +	/*
> +	 * Online average algorithm:
> +	 *
> +	 *  new_average = average + ((value - average) / count)
> +	 *
> +	 * The variance computation depends on the new average
> +	 * to be computed here first.
> +	 *
> +	 */
> +	irqs->avg = irqs->avg + (diff >> IRQ_TIMINGS_SHIFT);
> +
> +	/*
> +	 * Online variance algorithm:
> +	 *
> +	 *  new_variance = variance + (value - average) x (value - new_average)
> +	 *
> +	 * Warning: irqs->avg is updated with the line above, hence
> +	 * 'interval - irqs->avg' is no longer equal to 'diff'
> +	 */
> +	irqs->variance = irqs->variance + (diff * (interval - irqs->avg));
> +
>  	/*
>  	 * Increment the number of samples.
>  	 */
>  	irqs->nr_samples++;
>  
> +	/*
> +	 * If we're still training the model, we can't make any predictions yet.
> +	 */
> +	if (irqs->nr_samples < SAMPLE_THRESHOLD)
> +		return;
> +
>  	/*
>  	 * Online variance divided by the number of elements if there
>  	 * is more than one sample.  Normally the formula is division
> @@ -158,16 +191,12 @@ static void irqs_update(struct irqt_stat *irqs, u64 ts)
>  	 * more than 32 and dividing by 32 instead of 31 is enough
>  	 * precise.
>  	 */
> -	if (likely(irqs->nr_samples > 1))
> -		variance = irqs->variance >> IRQ_TIMINGS_SHIFT;
> +	variance = irqs->variance >> IRQ_TIMINGS_SHIFT;
>  
>  	/*
> -	 * The rule of thumb in statistics for the normal distribution
> -	 * is having at least 30 samples in order to have the model to
> -	 * apply. Values outside the interval are considered as an
> -	 * anomaly.
> +	 * Values outside the interval are considered as an anomaly.
>  	 */
> -	if ((irqs->nr_samples >= 30) && ((diff * diff) > (9 * variance))) {
> +	if ((diff * diff) > (9 * variance)) {
>  		/*
>  		 * After three consecutive anomalies, we reset the
>  		 * stats as it is no longer stable enough.
> @@ -191,27 +220,6 @@ static void irqs_update(struct irqt_stat *irqs, u64 ts)
>  	 */
>  	irqs->valid = 1;
>  
> -	/*
> -	 * Online average algorithm:
> -	 *
> -	 *  new_average = average + ((value - average) / count)
> -	 *
> -	 * The variance computation depends on the new average
> -	 * to be computed here first.
> -	 *
> -	 */
> -	irqs->avg = irqs->avg + (diff >> IRQ_TIMINGS_SHIFT);
> -
> -	/*
> -	 * Online variance algorithm:
> -	 *
> -	 *  new_variance = variance + (value - average) x (value - new_average)
> -	 *
> -	 * Warning: irqs->avg is updated with the line above, hence
> -	 * 'interval - irqs->avg' is no longer equal to 'diff'
> -	 */
> -	irqs->variance = irqs->variance + (diff * (interval - irqs->avg));
> -
>  	/*
>  	 * Update the next event
>  	 */
>
Peter Zijlstra Nov. 7, 2018, 1:05 p.m. UTC | #2
On Wed, Nov 07, 2018 at 11:52:31AM +0100, Daniel Lezcano wrote:
> > @@ -146,11 +152,38 @@ static void irqs_update(struct irqt_stat *irqs, u64 ts)
> >  	 */
> >  	diff = interval - irqs->avg;
> >  
> > +	/*
> > +	 * Online average algorithm:
> > +	 *
> > +	 *  new_average = average + ((value - average) / count)
> > +	 *
> > +	 * The variance computation depends on the new average
> > +	 * to be computed here first.
> > +	 *
> > +	 */
> > +	irqs->avg = irqs->avg + (diff >> IRQ_TIMINGS_SHIFT);
> > +
> > +	/*
> > +	 * Online variance algorithm:
> > +	 *
> > +	 *  new_variance = variance + (value - average) x (value - new_average)
> > +	 *
> > +	 * Warning: irqs->avg is updated with the line above, hence
> > +	 * 'interval - irqs->avg' is no longer equal to 'diff'
> > +	 */
> > +	irqs->variance = irqs->variance + (diff * (interval - irqs->avg));
> > +
> >  	/*
> >  	 * Increment the number of samples.
> >  	 */
> >  	irqs->nr_samples++;

FWIW, I'm confused on this. The normal (Welford's) online algorithm
does:

	count++;
	delta = value - mean;
	mean += delta / count;
	M2 += delta * (value - mean);

But the above uses:

	mean += delta / 32;

Which, for count >> 32, over-estimates the mean adjustment. But worse,
it significantly under-estimates the mean during training.

How is the computed variance still correct with this? I can not find any
comments that clarifies this. I'm thinking that since the mean will
slowly wander towards it's actual location (assuming an actual standard
distribution input) the resulting variance will be far too large, since
the (value - mean) term will be much larger than 'expected'.

> > @@ -158,16 +191,12 @@ static void irqs_update(struct irqt_stat *irqs, u64 ts)
> >  	 * more than 32 and dividing by 32 instead of 31 is enough
> >  	 * precise.
> >  	 */
> > +	variance = irqs->variance >> IRQ_TIMINGS_SHIFT;

Worse; variance is actually (as the comment states):

	s^2 = M2 / (count -1)

But instead you compute:

	s^2 = M2 / 32;

Which is again much larger than the actual result; assuming count >> 32.

So you compute a variance that is inflated in two different ways.


I'm not seeing how this thing works reliably.
Daniel Lezcano Nov. 8, 2018, 8:10 a.m. UTC | #3
On 07/11/2018 14:05, Peter Zijlstra wrote:
> On Wed, Nov 07, 2018 at 11:52:31AM +0100, Daniel Lezcano wrote:
>>> @@ -146,11 +152,38 @@ static void irqs_update(struct irqt_stat *irqs, u64 ts)
>>>  	 */
>>>  	diff = interval - irqs->avg;
>>>  
>>> +	/*
>>> +	 * Online average algorithm:
>>> +	 *
>>> +	 *  new_average = average + ((value - average) / count)
>>> +	 *
>>> +	 * The variance computation depends on the new average
>>> +	 * to be computed here first.
>>> +	 *
>>> +	 */
>>> +	irqs->avg = irqs->avg + (diff >> IRQ_TIMINGS_SHIFT);
>>> +
>>> +	/*
>>> +	 * Online variance algorithm:
>>> +	 *
>>> +	 *  new_variance = variance + (value - average) x (value - new_average)
>>> +	 *
>>> +	 * Warning: irqs->avg is updated with the line above, hence
>>> +	 * 'interval - irqs->avg' is no longer equal to 'diff'
>>> +	 */
>>> +	irqs->variance = irqs->variance + (diff * (interval - irqs->avg));
>>> +
>>>  	/*
>>>  	 * Increment the number of samples.
>>>  	 */
>>>  	irqs->nr_samples++;
> 
> FWIW, I'm confused on this. The normal (Welford's) online algorithm
> does:
> 
> 	count++;
> 	delta = value - mean;
> 	mean += delta / count;
> 	M2 += delta * (value - mean);
> 
> But the above uses:
> 
> 	mean += delta / 32;
> 
> Which, for count >> 32, over-estimates the mean adjustment. But worse,
> it significantly under-estimates the mean during training.
> 
> How is the computed variance still correct with this? I can not find any
> comments that clarifies this. I'm thinking that since the mean will
> slowly wander towards it's actual location (assuming an actual standard
> distribution input) the resulting variance will be far too large, since
> the (value - mean) term will be much larger than 'expected'.

You are right, initially it was divided by min(count, 32) but for
optimization reason, we decided to change that by a power of two
constant assuming the number of samples will reach quickly 32 and the
compiler will replace that by a shift.

https://lkml.org/lkml/2017/3/23/696

>>> @@ -158,16 +191,12 @@ static void irqs_update(struct irqt_stat *irqs, u64 ts)
>>>  	 * more than 32 and dividing by 32 instead of 31 is enough
>>>  	 * precise.
>>>  	 */
>>> +	variance = irqs->variance >> IRQ_TIMINGS_SHIFT;
> 
> Worse; variance is actually (as the comment states):
> 
> 	s^2 = M2 / (count -1)
> 
> But instead you compute:
> 
> 	s^2 = M2 / 32;
> 
> Which is again much larger than the actual result; assuming count >> 32.
> 
> So you compute a variance that is inflated in two different ways.
> 
> 
> I'm not seeing how this thing works reliably.

I have to revisit this part of code soon, I will double check that.
diff mbox series

Patch

diff --git a/kernel/irq/timings.c b/kernel/irq/timings.c
index 1e4cb63a5c82..5d22fd5facd5 100644
--- a/kernel/irq/timings.c
+++ b/kernel/irq/timings.c
@@ -28,6 +28,13 @@  struct irqt_stat {
 	int	valid;
 };
 
+/*
+ * The rule of thumb in statistics for the normal distribution
+ * is having at least 30 samples in order to have the model to
+ * apply.
+ */
+#define SAMPLE_THRESHOLD	30
+
 static DEFINE_IDR(irqt_stats);
 
 void irq_timings_enable(void)
@@ -101,7 +108,6 @@  void irq_timings_disable(void)
  * distribution appears when the number of samples is 30 (it is the
  * rule of thumb in statistics, cf. "30 samples" on Internet). When
  * there are three consecutive anomalies, the statistics are resetted.
- *
  */
 static void irqs_update(struct irqt_stat *irqs, u64 ts)
 {
@@ -146,11 +152,38 @@  static void irqs_update(struct irqt_stat *irqs, u64 ts)
 	 */
 	diff = interval - irqs->avg;
 
+	/*
+	 * Online average algorithm:
+	 *
+	 *  new_average = average + ((value - average) / count)
+	 *
+	 * The variance computation depends on the new average
+	 * to be computed here first.
+	 *
+	 */
+	irqs->avg = irqs->avg + (diff >> IRQ_TIMINGS_SHIFT);
+
+	/*
+	 * Online variance algorithm:
+	 *
+	 *  new_variance = variance + (value - average) x (value - new_average)
+	 *
+	 * Warning: irqs->avg is updated with the line above, hence
+	 * 'interval - irqs->avg' is no longer equal to 'diff'
+	 */
+	irqs->variance = irqs->variance + (diff * (interval - irqs->avg));
+
 	/*
 	 * Increment the number of samples.
 	 */
 	irqs->nr_samples++;
 
+	/*
+	 * If we're still training the model, we can't make any predictions yet.
+	 */
+	if (irqs->nr_samples < SAMPLE_THRESHOLD)
+		return;
+
 	/*
 	 * Online variance divided by the number of elements if there
 	 * is more than one sample.  Normally the formula is division
@@ -158,16 +191,12 @@  static void irqs_update(struct irqt_stat *irqs, u64 ts)
 	 * more than 32 and dividing by 32 instead of 31 is enough
 	 * precise.
 	 */
-	if (likely(irqs->nr_samples > 1))
-		variance = irqs->variance >> IRQ_TIMINGS_SHIFT;
+	variance = irqs->variance >> IRQ_TIMINGS_SHIFT;
 
 	/*
-	 * The rule of thumb in statistics for the normal distribution
-	 * is having at least 30 samples in order to have the model to
-	 * apply. Values outside the interval are considered as an
-	 * anomaly.
+	 * Values outside the interval are considered as an anomaly.
 	 */
-	if ((irqs->nr_samples >= 30) && ((diff * diff) > (9 * variance))) {
+	if ((diff * diff) > (9 * variance)) {
 		/*
 		 * After three consecutive anomalies, we reset the
 		 * stats as it is no longer stable enough.
@@ -191,27 +220,6 @@  static void irqs_update(struct irqt_stat *irqs, u64 ts)
 	 */
 	irqs->valid = 1;
 
-	/*
-	 * Online average algorithm:
-	 *
-	 *  new_average = average + ((value - average) / count)
-	 *
-	 * The variance computation depends on the new average
-	 * to be computed here first.
-	 *
-	 */
-	irqs->avg = irqs->avg + (diff >> IRQ_TIMINGS_SHIFT);
-
-	/*
-	 * Online variance algorithm:
-	 *
-	 *  new_variance = variance + (value - average) x (value - new_average)
-	 *
-	 * Warning: irqs->avg is updated with the line above, hence
-	 * 'interval - irqs->avg' is no longer equal to 'diff'
-	 */
-	irqs->variance = irqs->variance + (diff * (interval - irqs->avg));
-
 	/*
 	 * Update the next event
 	 */