Message ID | 150307948527.29525.18126889783757078160.stgit@Solace.fritz.box (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 08/18/2017 07:04 PM, Dario Faggioli wrote: > Idea is: the more CPUs are still active in a grace period, > the more we can wait to check whether it's time to invoke > the callbacks (on those CPUs that have already quiesced, > are idle, and have callbacks queued). > > What we're trying to avoid is one of those idle CPUs to > wake up, only to discover that the grace period is still > running, and that it hence could have be slept longer > (saving more power). But this will only happen: 1. If the cpu in question had just been running 2. Until the grace period is actually finished, at which point it will stop entirely again (since idle vcpus are filtered out now). So I think we're only taking about one or two extra wakeups per cpu maximum -- if this even happens at all. Wouldn't it be better to first add a performance counter, and check to see if and how often this situation happens? > This patch implements an heuristic aimed at achieving > that, at the price of having to call cpumask_weight() on > the 'entering idle' path, on CPUs with queued callbacks. The heuristic seems a bit strange to me too: why would each cpu increase the grace period in a linear fashion? I haven't looked at the grace period code, but I would have expected each one to be independent; and so there'd be a "diminishing returns" effect when adding more cpus. If we have to have something like this (which I'm not at all convinced we do), I would think having a single number which self-adjusted so that the number of 'misses' was around 1% would be a good balance. What about a heuristic like this: 1. If the timer goes off and the grace period isn't over, add 10ms to the timer period 2. If the timer goes off and the grace period *is* over, subtract 100us from the timer period The above should self-adjust so that around 1% of timers miss. Thoughts? -George
On Tue, 2017-08-29 at 17:30 +0100, George Dunlap wrote: > On 08/18/2017 07:04 PM, Dario Faggioli wrote: > > > > What we're trying to avoid is one of those idle CPUs to > > wake up, only to discover that the grace period is still > > running, and that it hence could have be slept longer > > (saving more power). > > So I think we're only taking about one or two extra wakeups per cpu > maximum -- if this even happens at all. > Yep, indeed. > Wouldn't it be better to first add a performance counter, and check > to > see if and how often this situation happens? > The counter is there already. It's rcu_idle_timer ("RCU: idle_timer"). > > This patch implements an heuristic aimed at achieving > > that, at the price of having to call cpumask_weight() on > > the 'entering idle' path, on CPUs with queued callbacks. > > The heuristic seems a bit strange to me too: why would each cpu > increase > the grace period in a linear fashion? I haven't looked at the grace > period code, but I would have expected each one to be independent; > and > so there'd be a "diminishing returns" effect when adding more cpus. > I like the idea, in general. Let me just double check whether I'm understanding what you're suggesting properly. First of all, what do you mean with "adding more cpus"? Adding to what? The timer is useful when a CPU, which is participating in the grace period, goes idle, while the grace period is not finished. From that point on, the number of CPUs participating to _that_ grace period will monotonically decrease, until it reaches zero. So what does 'adding' means in this context? > If we have to have something like this (which I'm not at all > convinced > we do), I would think having a single number which self-adjusted so > that > the number of 'misses' was around 1% would be a good balance. What > about a heuristic like this: > > 1. If the timer goes off and the grace period isn't over, add 10ms to > the timer period > 2. If the timer goes off and the grace period *is* over, subtract > 100us > from the timer period > So, let's say we start with a period of 1ms. First time RCU is used, the timer fires twice: the first time it finds the grace period is still ongoning --and hence the period becomes 11ms-- while the seconds finds it over --so the period is now 10.9ms. Next time RCU is used, if the timer is necessary, we use 10.9ms. Am I getting your proposal right? If yes, do we allow the period to become smaller than the initial value (1ms, in the example above). I'd say we better not (or that we at least set a lower bound), or, given enough occurrences of cases when the timer fires and finds the grace period to be over in a row, the period can become 0! Regards, Dario
On 09/06/2017 05:51 PM, Dario Faggioli wrote: > On Tue, 2017-08-29 at 17:30 +0100, George Dunlap wrote: >> On 08/18/2017 07:04 PM, Dario Faggioli wrote: >>> >>> What we're trying to avoid is one of those idle CPUs to >>> wake up, only to discover that the grace period is still >>> running, and that it hence could have be slept longer >>> (saving more power). >> >> So I think we're only taking about one or two extra wakeups per cpu >> maximum -- if this even happens at all. >> > Yep, indeed. > >> Wouldn't it be better to first add a performance counter, and check >> to >> see if and how often this situation happens? >> > The counter is there already. It's rcu_idle_timer ("RCU: idle_timer"). And do you (or maybe Julien or Stefano) have an idea how often this happens with a fixed timer like patch 5? >>> This patch implements an heuristic aimed at achieving >>> that, at the price of having to call cpumask_weight() on >>> the 'entering idle' path, on CPUs with queued callbacks. >> >> The heuristic seems a bit strange to me too: why would each cpu >> increase >> the grace period in a linear fashion? I haven't looked at the grace >> period code, but I would have expected each one to be independent; >> and >> so there'd be a "diminishing returns" effect when adding more cpus. >> > I like the idea, in general. Let me just double check whether I'm > understanding what you're suggesting properly. > > First of all, what do you mean with "adding more cpus"? Adding to what? > The timer is useful when a CPU, which is participating in the grace > period, goes idle, while the grace period is not finished. From that > point on, the number of CPUs participating to _that_ grace period will > monotonically decrease, until it reaches zero. So what does 'adding' > means in this context? 'Adding' means 'adding more pcpus to the rcu operation'. The formula in this patch for the wait time was K * N, where N is the number of cpus involved in the operation. Thus, the calculated wait time will scale linearly with the number of cpus. An RCU operation with 2 pcups will be 2K, while an operation with 4 cpus will be 4K. This implies that the *time you think it will take to complete an RCU transaction* will scale linearly with the number of cpus -- that on average, an operation with 4 cpus will take twice as long for all cpus to complete their grace periods as an operation with 2 cpus, and operation with 6 cpus will take three times as long as an operation with 2 cpus. This seems unlikely to me: It seems to me like like statistically the amount of time taken will converge to some limit as the number of pcpus involved in the operation grows. >> If we have to have something like this (which I'm not at all >> convinced >> we do), I would think having a single number which self-adjusted so >> that >> the number of 'misses' was around 1% would be a good balance. What >> about a heuristic like this: >> >> 1. If the timer goes off and the grace period isn't over, add 10ms to >> the timer period >> 2. If the timer goes off and the grace period *is* over, subtract >> 100us >> from the timer period >> > So, let's say we start with a period of 1ms. First time RCU is used, > the timer fires twice: the first time it finds the grace period is > still ongoning --and hence the period becomes 11ms-- while the seconds > finds it over --so the period is now 10.9ms. > > Next time RCU is used, if the timer is necessary, we use 10.9ms. > > Am I getting your proposal right? > > If yes, do we allow the period to become smaller than the initial value > (1ms, in the example above). I'd say we better not (or that we at least > set a lower bound), or, given enough occurrences of cases when the > timer fires and finds the grace period to be over in a row, the period > can become 0! Yes, just about -- I was thinking of just starting at 10ms rather than at 1ms, but everything else is pretty accurate. And yes of course we probably want an upper and lower bound on the timeout. -George
diff --git a/xen/common/rcupdate.c b/xen/common/rcupdate.c index 871936f..5a3cb1d 100644 --- a/xen/common/rcupdate.c +++ b/xen/common/rcupdate.c @@ -110,10 +110,17 @@ struct rcu_data { * About how far in the future the timer should be programmed each time, * it's hard to tell (guess!!). Since this mimics Linux's periodic timer * tick, take values used there as an indication. In Linux 2.6.21, tick - * period can be 10ms, 4ms, 3.33ms or 1ms. Let's use 10ms, to enable - * at least some power saving on the CPU that is going idle. + * period can be 10ms, 4ms, 3.33ms or 1ms. + * + * That being said, we can assume that, the more CPUs are still active in + * the current grace period, the longer it will take for it to come to its + * end. We wait 10ms for each active CPU, as minimizing the wakeups enables + * more effective power saving, on the CPU that has gone idle. But we also + * never wait more than 100ms, to avoid delaying recognising the end of a + * grace period (and the invocation of the callbacks) by too much. */ -#define RCU_IDLE_TIMER_PERIOD MILLISECS(10) +#define RCU_IDLE_TIMER_CPU_DELAY MILLISECS(10) +#define RCU_IDLE_TIMER_PERIOD_MAX MILLISECS(100) static DEFINE_PER_CPU(struct rcu_data, rcu_data); @@ -444,6 +451,7 @@ int rcu_needs_cpu(int cpu) void rcu_idle_timer_start() { struct rcu_data *rdp = &this_cpu(rcu_data); + s_time_t next; /* * Note that we don't check rcu_pending() here. In fact, we don't want @@ -453,7 +461,9 @@ void rcu_idle_timer_start() if (likely(!rdp->curlist)) return; - set_timer(&rdp->idle_timer, NOW() + RCU_IDLE_TIMER_PERIOD); + next = min_t(s_time_t, RCU_IDLE_TIMER_PERIOD_MAX, + cpumask_weight(&rcu_ctrlblk.cpumask) * RCU_IDLE_TIMER_CPU_DELAY); + set_timer(&rdp->idle_timer, NOW() + next); rdp->idle_timer_active = true; }
Idea is: the more CPUs are still active in a grace period, the more we can wait to check whether it's time to invoke the callbacks (on those CPUs that have already quiesced, are idle, and have callbacks queued). What we're trying to avoid is one of those idle CPUs to wake up, only to discover that the grace period is still running, and that it hence could have be slept longer (saving more power). This patch implements an heuristic aimed at achieving that, at the price of having to call cpumask_weight() on the 'entering idle' path, on CPUs with queued callbacks. Of course, we, at the same time, don't want to delay recognising that we can invoke the callbacks for too much, so we also set a maximum. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> --- Cc: Andrew Cooper <andrew.cooper3@citrix.com> Cc: George Dunlap <George.Dunlap@eu.citrix.com> Cc: Ian Jackson <ian.jackson@eu.citrix.com> Cc: Jan Beulich <jbeulich@suse.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Tim Deegan <tim@xen.org> Cc: Wei Liu <wei.liu2@citrix.com> Cc: Julien Grall <julien.grall@arm.com> --- xen/common/rcupdate.c | 18 ++++++++++++++---- 1 file changed, 14 insertions(+), 4 deletions(-)