[03/11] sched: Extend scheduler's asym packing

Message ID	1471559812-19967-4-git-send-email-srinivas.pandruvada@linux.intel.com (mailing list archive)
State	Not Applicable, archived
Headers	show Return-Path: <linux-pm-owner@kernel.org> From: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> To: mingo@redhat.com, tglx@linutronix.de, hpa@zytor.com, rjw@rjwysocki.net, peterz@infradead.org Cc: x86@kernel.org, bp@suse.de, sudeep.holla@arm.com, ak@linux.intel.com, linux-acpi@vger.kernel.org, linux-pm@vger.kernel.org, alexey.klimov@arm.com, viresh.kumar@linaro.org, akpm@linux-foundation.org, linux-kernel@vger.kernel.org, lenb@kernel.org, tim.c.chen@linux.intel.com, srinivas.pandruvada@linux.intel.com, paul.gortmaker@windriver.com, jpoimboe@redhat.com, mcgrof@kernel.org, jgross@suse.com, robert.moore@intel.com, dvyukov@google.com, jeyu@redhat.com Subject: [PATCH 03/11] sched: Extend scheduler's asym packing Date: Thu, 18 Aug 2016 15:36:44 -0700 Message-Id: <1471559812-19967-4-git-send-email-srinivas.pandruvada@linux.intel.com> In-Reply-To: <1471559812-19967-1-git-send-email-srinivas.pandruvada@linux.intel.com> References: <1471559812-19967-1-git-send-email-srinivas.pandruvada@linux.intel.com> Sender: linux-pm-owner@vger.kernel.org Precedence: bulk

srinivas pandruvada Aug. 18, 2016, 10:36 p.m. UTC

From: Tim Chen <tim.c.chen@linux.intel.com>

We generalize the scheduler's asym packing to provide an
ordering of the cpu beyond just the cpu number.  This allows
the use of the ASYM_PACKING scheduler machinery to move
loads to prefered CPU in a sched domain based on a preference
defined by sched_asym_prefer function.

We also record the most preferred cpu in a sched group when
we build the cpu's capacity for fast lookup of preferred cpu
during load balancing.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
---
 kernel/sched/core.c  | 18 ++++++++++++++++++
 kernel/sched/fair.c  | 25 ++++++++++++++-----------
 kernel/sched/sched.h | 17 +++++++++++++++++
 3 files changed, 49 insertions(+), 11 deletions(-)

Morten Rasmussen Aug. 25, 2016, 11:22 a.m. UTC | #1

On Thu, Aug 18, 2016 at 03:36:44PM -0700, Srinivas Pandruvada wrote:
> From: Tim Chen <tim.c.chen@linux.intel.com>
> 
> We generalize the scheduler's asym packing to provide an
> ordering of the cpu beyond just the cpu number.  This allows
> the use of the ASYM_PACKING scheduler machinery to move
> loads to prefered CPU in a sched domain based on a preference
> defined by sched_asym_prefer function.
> 
> We also record the most preferred cpu in a sched group when
> we build the cpu's capacity for fast lookup of preferred cpu
> during load balancing.
> 
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

[...]

> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index c64fc51..75e1002 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -532,6 +532,22 @@ struct dl_rq {
>  
>  #ifdef CONFIG_SMP
>  
> +#ifndef sched_asym_prefer
> +
> +/* For default ASYM_PACKING, lower numbered cpu is prefered */
> +static inline bool sched_asym_prefer(int a, int b)
> +{
> +	return a < b;
> +}
> +
> +#endif /* sched_asym_prefer */

Isn't this a very significant change in the interface between
architecture and the scheduler?

If I'm not mistaken, our current interface is quite strict when it comes
to information passed from the architecture into the scheduler. We allow
'topology' flags, but not behavioural flags, to be set by the
architecture, and the architecture can expose current and max cpu
capacities through the arch_scale_*() functions. For NUMA, we can expose
'distance' between nodes (and more?).

These are meant to describe the system topology to the scheduler, so it
can make better decisions on its own. sched_asym_prefer() is is not only
affecting scheduler behaviour, it is handing off scheduling decisions to
architecture code. In essence allowing logic to be plugged into the
scheduler, although with a somewhat limited scope of impact.

Should this been seen as the architecture/scheduler is up for revision
and we will start allowing architecture code to plug in function to
affect scheduling behaviour?

I haven't reviewed the entire patch set in detail, but why can't the cpu
priority list be handed to the scheduler instead of moving scheduling
decisions out of the scheduler?

Isn't it possible to express the cpu 'priority' as different cpu
capacities instead? Without understanding the details of ITMT, it seems
to me that what you really have is different cpu compute capacities, and
that is what we have cpu capacity for.

Is the intention long term to change the cpu priority order on the fly,
otherwise I don't see why you would put the logic in architecture code?

Finally, the existing callback functions from the scheduler to
architecture code are prefixed with arch_, I think this one should do
the same to make it clear that this function may be implemented by
generic scheduler code.

Thanks,
Morten
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Peter Zijlstra Aug. 25, 2016, 11:45 a.m. UTC | #2

On Thu, Aug 25, 2016 at 12:22:52PM +0100, Morten Rasmussen wrote:
> I haven't reviewed the entire patch set in detail, but why can't the cpu
> priority list be handed to the scheduler instead of moving scheduling
> decisions out of the scheduler?

It basically does that. All that we allow here is the architecture to
override the default order of what is considered priority.

The default (as per Power7) is naked cpu number, with lower cpu numbers
having higher priority to higher numbers.

This patch set allows the architecture to provide a less_than operator
(and through that a custom order).
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Morten Rasmussen Aug. 25, 2016, 1:18 p.m. UTC | #3

On Thu, Aug 25, 2016 at 01:45:22PM +0200, Peter Zijlstra wrote:
> On Thu, Aug 25, 2016 at 12:22:52PM +0100, Morten Rasmussen wrote:
> > I haven't reviewed the entire patch set in detail, but why can't the cpu
> > priority list be handed to the scheduler instead of moving scheduling
> > decisions out of the scheduler?
> 
> It basically does that. All that we allow here is the architecture to
> override the default order of what is considered priority.
> 
> The default (as per Power7) is naked cpu number, with lower cpu numbers
> having higher priority to higher numbers.
> 
> This patch set allows the architecture to provide a less_than operator
> (and through that a custom order).

But why not just pass the customized list into the scheduler? Seems
simpler?

A custom less_than operator opens up for potential exploitation that is
much more complicated than what is shown in this patch set. Paired up
with the utilization signals, which are now exposed outside the
scheduler through cpufreq, you can start making complex load-balancing
decisions outside the scheduler using the this new interface.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Peter Zijlstra Aug. 25, 2016, 1:45 p.m. UTC | #4

On Thu, Aug 25, 2016 at 02:18:37PM +0100, Morten Rasmussen wrote:

> But why not just pass the customized list into the scheduler? Seems
> simpler?

Mostly because I didn't want to regress Power I suppose. The ITMT stuff
needs an extra load, whereas the Power stuff can use the CPU number we
already have.

Also, since we need an interface to pass in this custom list, I don't
see the distinction, you can do the same manipulation by constantly
updating the prio list.

But not of this stuff should be EXPORT'ed, so its only available to the
core kernel, which greatly limits the potential for abuse. We can see
arch code just fine.

And if you spin a custom kernel, you can already wreck the load
balancer.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Morten Rasmussen Aug. 26, 2016, 10:39 a.m. UTC | #5

On Thu, Aug 25, 2016 at 03:45:03PM +0200, Peter Zijlstra wrote:
> On Thu, Aug 25, 2016 at 02:18:37PM +0100, Morten Rasmussen wrote:
> 
> > But why not just pass the customized list into the scheduler? Seems
> > simpler?
> 
> Mostly because I didn't want to regress Power I suppose. The ITMT stuff
> needs an extra load, whereas the Power stuff can use the CPU number we
> already have.

The customized list wouldn't have to be mandatory. You could easily
create a default list that would match current behaviour for Power.

To pass in a custom list of priorities you could either extend struct
sched_domain_topology_level to have another function pointer that
returns the cpu priority, or introduce an arch_cpu_priotity() function.
Either of them could be used in the sched_domain hierarchy to set the
sched_group priority cpu and if you add a rq->cpu_priority, the
asymmetric packing comparison would be a simple comparison between
rq->cpu_priority of the two cpus in question.

What is the 'extra load' needed for ITMT? Isn't it just a priority list,
or does the absolute priority value have a meaning? I only saw it used
for less_than comparison, maybe I missed it.

If you need to express the difference in compute capability, why not use
capacity?

> Also, since we need an interface to pass in this custom list, I don't
> see the distinction, you can do the same manipulation by constantly
> updating the prio list.

Sure, but the overhead of rebuilding the sched_domain hierarchy is huge
compared to just tweaking the result of the less_than operator that get
called from the scheduler frequently. However, updating
group_priority_cpu() would require a rebuild too in this patch set.

> But not of this stuff should be EXPORT'ed, so its only available to the
> core kernel, which greatly limits the potential for abuse. We can see
> arch code just fine.

I don't see why it can't be wired up to be controlled by entities
outside arch code, e.g. cpufreq or the thermal framework, or even code
outside the kernel (firmware).

> And if you spin a custom kernel, you can already wreck the load
> balancer.

You can wreck any software where you have the source code and a compiler
:)
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Peter Zijlstra Aug. 26, 2016, 12:42 p.m. UTC | #6

On Fri, Aug 26, 2016 at 11:39:46AM +0100, Morten Rasmussen wrote:
> On Thu, Aug 25, 2016 at 03:45:03PM +0200, Peter Zijlstra wrote:
> > On Thu, Aug 25, 2016 at 02:18:37PM +0100, Morten Rasmussen wrote:
> > 
> > > But why not just pass the customized list into the scheduler? Seems
> > > simpler?
> > 
> > Mostly because I didn't want to regress Power I suppose. The ITMT stuff
> > needs an extra load, whereas the Power stuff can use the CPU number we
> > already have.
> 
> The customized list wouldn't have to be mandatory. You could easily
> create a default list that would match current behaviour for Power.

Sure, but then you have the extra load.. probably not an issue but
still.

> What is the 'extra load' needed for ITMT? Isn't it just a priority list,
> or does the absolute priority value have a meaning? I only saw it used
> for less_than comparison, maybe I missed it.

LOAD as in a memop, we need to go fetch the priority from wherever we
put it in memory, be it rq->cpu_priority or a percpu variable on its
own.

> If you need to express the difference in compute capability, why not use
> capacity?

Doesn't work, capacity is actually equal with these things.

Think of one core having more turbo range when thermals allow it. But
the moment you run multiple cores the thermal head-room dissipates and
they all end up running at more or less the same (lower) frequency.

All of this asym/prio stuff only matters when cores (Power) / packages
(Intel) are mostly idle.

On Power SMT0 can go faster than SMT7 when all other siblings are idle,
with ITMT some core can go faster than other when the rest is idle.

I suppose we _could_ model it with a dynamic capacity value, but last
time I looked at that it made my head hurt.

> > Also, since we need an interface to pass in this custom list, I don't
> > see the distinction, you can do the same manipulation by constantly
> > updating the prio list.
> 
> Sure, but the overhead of rebuilding the sched_domain hierarchy is huge
> compared to just tweaking the result of the less_than operator that get
> called from the scheduler frequently. However, updating
> group_priority_cpu() would require a rebuild too in this patch set.

You don't actually need to rebuild the domains to change the priorities.
We only need to rebuild the domains when we add/remove SD_ASYM_PACKING.

Yes, the sched_group::asym_prefer_cpu thing is tedious, but you could
actually update that without a rebuild if one wanted.

Note that there's actually a semi useful use case for dynamically
updating the cpu priorities: core hopping.

  https://www.researchgate.net/publication/279915789_Evaluation_of_Core_Hopping_on_POWER7

Again, that's something only relevant to mostly idle packages.

> > But not of this stuff should be EXPORT'ed, so its only available to the
> > core kernel, which greatly limits the potential for abuse. We can see
> > arch code just fine.
> 
> I don't see why it can't be wired up to be controlled by entities
> outside arch code, e.g. cpufreq or the thermal framework, or even code
> outside the kernel (firmware).

I suppose an arch could do that, but then we'd see that, wouldn't we?

The firmware and kernel would need to co-ordinate where the prio value
lives, which is not something trivially done. And even if the value
lives in rq->cpu_priority, it _could_ do that.

In any case, I don't feel too strongly about this, if you want to stick
the value in rq->cpu_priority and have Power use that we can do that I
suppose.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Tim Chen Aug. 26, 2016, 5:25 p.m. UTC | #7

On Fri, 2016-08-26 at 14:42 +0200, Peter Zijlstra wrote:
> On Fri, Aug 26, 2016 at 11:39:46AM +0100, Morten Rasmussen wrote:
> > 
> > On Thu, Aug 25, 2016 at 03:45:03PM +0200, Peter Zijlstra wrote:
> > > 
> > > On Thu, Aug 25, 2016 at 02:18:37PM +0100, Morten Rasmussen wrote:
> > > 
> > > > 
> > > > But why not just pass the customized list into the scheduler? Seems
> > > > simpler?
> > > Mostly because I didn't want to regress Power I suppose. The ITMT stuff
> > > needs an extra load, whereas the Power stuff can use the CPU number we
> > > already have.
> > The customized list wouldn't have to be mandatory. You could easily
> > create a default list that would match current behaviour for Power.
> Sure, but then you have the extra load.. probably not an issue but
> still.
> 
> > 
> > What is the 'extra load' needed for ITMT? Isn't it just a priority list,
> > or does the absolute priority value have a meaning? I only saw it used
> > for less_than comparison, maybe I missed it.
> LOAD as in a memop, we need to go fetch the priority from wherever we
> put it in memory, be it rq->cpu_priority or a percpu variable on its
> own.
> 
> > 
> > If you need to express the difference in compute capability, why not use
> > capacity?
> Doesn't work, capacity is actually equal with these things.
> 
> Think of one core having more turbo range when thermals allow it. But
> the moment you run multiple cores the thermal head-room dissipates and
> they all end up running at more or less the same (lower) frequency.
> 
> All of this asym/prio stuff only matters when cores (Power) / packages
> (Intel) are mostly idle.
> 
> On Power SMT0 can go faster than SMT7 when all other siblings are idle,
> with ITMT some core can go faster than other when the rest is idle.
> 
> I suppose we _could_ model it with a dynamic capacity value, but last
> time I looked at that it made my head hurt.
> 
> > 
> > > 
> > > Also, since we need an interface to pass in this custom list, I don't
> > > see the distinction, you can do the same manipulation by constantly
> > > updating the prio list.
> > Sure, but the overhead of rebuilding the sched_domain hierarchy is huge
> > compared to just tweaking the result of the less_than operator that get
> > called from the scheduler frequently. However, updating
> > group_priority_cpu() would require a rebuild too in this patch set.
> You don't actually need to rebuild the domains to change the priorities.
> We only need to rebuild the domains when we add/remove SD_ASYM_PACKING.
> 
> Yes, the sched_group::asym_prefer_cpu thing is tedious, but you could
> actually update that without a rebuild if one wanted.
> 
> Note that there's actually a semi useful use case for dynamically
> updating the cpu priorities: core hopping.
> 
>   https://www.researchgate.net/publication/279915789_Evaluation_of_Core_Hopping_on_POWER7
> 
> Again, that's something only relevant to mostly idle packages.
> 
> > 
> > > 
> > > But not of this stuff should be EXPORT'ed, so its only available to the
> > > core kernel, which greatly limits the potential for abuse. We can see
> > > arch code just fine.
> > I don't see why it can't be wired up to be controlled by entities
> > outside arch code, e.g. cpufreq or the thermal framework, or even code
> > outside the kernel (firmware).
> I suppose an arch could do that, but then we'd see that, wouldn't we?
> 
> The firmware and kernel would need to co-ordinate where the prio value
> lives, which is not something trivially done. And even if the value
> lives in rq->cpu_priority, it _could_ do that.
> 
> 
> In any case, I don't feel too strongly about this, if you want to stick
> the value in rq->cpu_priority and have Power use that we can do that I
> suppose.

This will mean increasing the rq structure for power pc.

I guess some compile flag to decide if this cpu_priority field should be
in rq. Something like
COFIG_SCHED_ITMT || ((CONFIG_PPC64 || CONFIG_PPC32) && CONFIG_SCHED_SMT))?

And I will need code to power pc to instantiate rp->cpu_priority on boot.

This gets somewhat ugly.

I prefer the other alternative Morten suggested by
having an arch_cpu_asym_priority() function. It is cleaner
without increasing size or rq structure.

I can define for default lower cpu having higher priority:

int __weak arch_cpu_asym_priority(int cpu)
{
        return -cpu;
}

and then define it appropriately for x86 when ITMT is used.

Tim


--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[03/11] sched: Extend scheduler's asym packing

Commit Message

Comments

Patch