diff mbox

[v6,5/9] x86/sysctl: Add sysctl for ITMT scheduling feature

Message ID b3f648c2c4cd36b6a043239bee8437a2060c0ac4.1477000078.git.tim.c.chen@linux.intel.com (mailing list archive)
State Not Applicable, archived
Headers show

Commit Message

Tim Chen Oct. 20, 2016, 9:59 p.m. UTC
Intel Turbo Boost Max Technology 3.0 (ITMT) feature
allows some cores to be boosted to higher turbo
frequency than others.

Add /proc/sys/kernel/sched_itmt_enabled so operator
can enable/disable scheduling of tasks that favor cores
with higher turbo boost frequency potential.

By default, system that is ITMT capable and single
socket has this feature turned on.  It is more likely
to be lightly loaded and operates in Turbo range.

When there is a change in the ITMT scheduling operation
desired, a rebuild of the sched domain is initiated
so the scheduler can set up sched domains with appropriate
flag to enable/disable ITMT scheduling operations.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
---
 arch/x86/include/asm/topology.h |   7 ++-
 arch/x86/kernel/itmt.c          | 110 +++++++++++++++++++++++++++++++++++++++-
 2 files changed, 114 insertions(+), 3 deletions(-)

Comments

Thomas Gleixner Oct. 26, 2016, 10:49 a.m. UTC | #1
On Thu, 20 Oct 2016, Tim Chen wrote:
> +static int sched_itmt_update_handler(struct ctl_table *table, int write,
> +			      void __user *buffer, size_t *lenp, loff_t *ppos)

Please align the arguments proper

static int
sched_itmt_update_handler(struct ctl_table *table, int write,
			  void __user *buffer, size_t *lenp, loff_t *ppos)

> +{
> +	int ret;
> +	unsigned int old_sysctl;

	unsigned int old_sysctl;
	int ret;

Please. It's way simpler to read.

> -void sched_set_itmt_support(void)
> +int sched_set_itmt_support(void)
>  {
>  	mutex_lock(&itmt_update_mutex);
>  
> +	if (sched_itmt_capable) {
> +		mutex_unlock(&itmt_update_mutex);
> +		return 0;
> +	}
> +
> +	itmt_sysctl_header = register_sysctl_table(itmt_root_table);
> +	if (!itmt_sysctl_header) {
> +		mutex_unlock(&itmt_update_mutex);
> +		return -ENOMEM;
> +	}
> +
>  	sched_itmt_capable = true;
>  
> +	/*
> +	 * ITMT capability automatically enables ITMT
> +	 * scheduling for small systems (single node).
> +	 */
> +	if (topology_num_packages() == 1)
> +		sysctl_sched_itmt_enabled = 1;

I really hate this. This is policy and the kernel should not impose
policy. Why would I like to have this enforced on my single socket XEON
server?

> +	if (sysctl_sched_itmt_enabled) {

Why would sysctl_sched_itmt_enabled be true at this point, aside of the
above policy imposement?

Thanks,

	tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thomas Gleixner Oct. 26, 2016, 10:52 a.m. UTC | #2
On Thu, 20 Oct 2016, Tim Chen wrote:
>  
> +	if (itmt_sysctl_header)
> +		unregister_sysctl_table(itmt_sysctl_header);

What sets itmt_sysctl_header to NULL?

Thanks,

	tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thomas Gleixner Oct. 26, 2016, 11:24 a.m. UTC | #3
On Wed, 26 Oct 2016, Peter Zijlstra wrote:
> On Wed, Oct 26, 2016 at 12:49:36PM +0200, Thomas Gleixner wrote:
> 
> > > +	/*
> > > +	 * ITMT capability automatically enables ITMT
> > > +	 * scheduling for small systems (single node).
> > > +	 */
> > > +	if (topology_num_packages() == 1)
> > > +		sysctl_sched_itmt_enabled = 1;
> > 
> > I really hate this. This is policy and the kernel should not impose
> > policy. Why would I like to have this enforced on my single socket XEON
> > server?
> 
> So this really wants to be enabled by default; otherwise nobody will use
> this, and it really does help single threaded workloads.

Fair enough. Then this wants to be documented.
 
> There were reservations on the multi-socket case of ITMT, maybe it would
> help to spell those out in great detail here. That is, have the comment
> explain the policy instead of simply stating what the code does (which
> is always bad comment policy, you can read the code just fine).

What is the objection for multi sockets? If it improves the behaviour then
why would this be a bad thing for multi sockets?

Thanks,

	tglx

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra Oct. 26, 2016, 11:25 a.m. UTC | #4
On Wed, Oct 26, 2016 at 12:49:36PM +0200, Thomas Gleixner wrote:

> > +	/*
> > +	 * ITMT capability automatically enables ITMT
> > +	 * scheduling for small systems (single node).
> > +	 */
> > +	if (topology_num_packages() == 1)
> > +		sysctl_sched_itmt_enabled = 1;
> 
> I really hate this. This is policy and the kernel should not impose
> policy. Why would I like to have this enforced on my single socket XEON
> server?

So this really wants to be enabled by default; otherwise nobody will use
this, and it really does help single threaded workloads.

There were reservations on the multi-socket case of ITMT, maybe it would
help to spell those out in great detail here. That is, have the comment
explain the policy instead of simply stating what the code does (which
is always bad comment policy, you can read the code just fine).
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tim Chen Oct. 26, 2016, 5:23 p.m. UTC | #5
On Wed, 2016-10-26 at 13:24 +0200, Thomas Gleixner wrote:
> On Wed, 26 Oct 2016, Peter Zijlstra wrote:
> > 
> > On Wed, Oct 26, 2016 at 12:49:36PM +0200, Thomas Gleixner wrote:
> > 
> > > 
> > > > 
> > > > +	/*
> > > > +	 * ITMT capability automatically enables ITMT
> > > > +	 * scheduling for small systems (single node).
> > > > +	 */
> > > > +	if (topology_num_packages() == 1)
> > > > +		sysctl_sched_itmt_enabled = 1;
> > > I really hate this. This is policy and the kernel should not impose
> > > policy. Why would I like to have this enforced on my single socket XEON
> > > server?
> > So this really wants to be enabled by default; otherwise nobody will use
> > this, and it really does help single threaded workloads.
> Fair enough. Then this wants to be documented.
>  
> > 
> > There were reservations on the multi-socket case of ITMT, maybe it would
> > help to spell those out in great detail here. That is, have the comment
> > explain the policy instead of simply stating what the code does (which
> > is always bad comment policy, you can read the code just fine).
> What is the objection for multi sockets? If it improves the behaviour then
> why would this be a bad thing for multi sockets?

For multi-socket (server system), it is much more likely that they will
have multiple cpus in a socket busy and not run in turbo mode. So the extra
work in migrating the workload to the one with extra headroom will
not make use of those headroom in that scenario.  I will update the comment
to reflect this policy.

See also our previous discussions: http://lkml.iu.edu/hypermail/linux/kernel/1609.1/03381.html

Tim


--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tim Chen Oct. 26, 2016, 5:59 p.m. UTC | #6
On Wed, 2016-10-26 at 12:49 +0200, Thomas Gleixner wrote:
> On Thu, 20 Oct 2016, Tim Chen wrote:
> > 
> > +static int sched_itmt_update_handler(struct ctl_table *table, int write,
> > +			      void __user *buffer, size_t *lenp, loff_t *ppos)
> Please align the arguments proper
> 
> static int
> sched_itmt_update_handler(struct ctl_table *table, int write,
> 			  void __user *buffer, size_t *lenp, loff_t *ppos)
> 

Okay.

> > 
> > +{
> > +	int ret;
> > +	unsigned int old_sysctl;
> 	unsigned int old_sysctl;
> 	int ret;
> 
> Please. It's way simpler to read.

Sure.

> 
> > 
> > -void sched_set_itmt_support(void)
> > +int sched_set_itmt_support(void)
> >  {
> >  	mutex_lock(&itmt_update_mutex);
> >  
> > +	if (sched_itmt_capable) {
> > +		mutex_unlock(&itmt_update_mutex);
> > +		return 0;
> > +	}
> > +
> > +	itmt_sysctl_header = register_sysctl_table(itmt_root_table);
> > +	if (!itmt_sysctl_header) {
> > +		mutex_unlock(&itmt_update_mutex);
> > +		return -ENOMEM;
> > +	}
> > +
> >  	sched_itmt_capable = true;
> >  
> > +	/*
> > +	 * ITMT capability automatically enables ITMT
> > +	 * scheduling for small systems (single node).
> > +	 */
> > +	if (topology_num_packages() == 1)
> > +		sysctl_sched_itmt_enabled = 1;
> I really hate this. This is policy and the kernel should not impose
> policy. Why would I like to have this enforced on my single socket XEON
> server?
> 
> > 
> > +	if (sysctl_sched_itmt_enabled) {
> Why would sysctl_sched_itmt_enabled be true at this point, aside of the
> above policy imposement?

That's true, it will only be enabled for the above case.  I can merge
it into the if check above.


Tim
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tim Chen Oct. 26, 2016, 6:03 p.m. UTC | #7
On Wed, 2016-10-26 at 12:52 +0200, Thomas Gleixner wrote:
> On Thu, 20 Oct 2016, Tim Chen wrote:
> > 
> >  
> > +	if (itmt_sysctl_header)
> > +		unregister_sysctl_table(itmt_sysctl_header);
> What sets itmt_sysctl_header to NULL?
> 

If the registration of the itmt sysctl table has failed, it will
be NULL.

Tim
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thomas Gleixner Oct. 26, 2016, 6:09 p.m. UTC | #8
On Wed, 26 Oct 2016, Tim Chen wrote:
> On Wed, 2016-10-26 at 13:24 +0200, Thomas Gleixner wrote:
> > > There were reservations on the multi-socket case of ITMT, maybe it would
> > > help to spell those out in great detail here. That is, have the comment
> > > explain the policy instead of simply stating what the code does (which
> > > is always bad comment policy, you can read the code just fine).
> > What is the objection for multi sockets? If it improves the behaviour then
> > why would this be a bad thing for multi sockets?
> 
> For multi-socket (server system), it is much more likely that they will
> have multiple cpus in a socket busy and not run in turbo mode. So the extra
> work in migrating the workload to the one with extra headroom will
> not make use of those headroom in that scenario.  I will update the comment
> to reflect this policy.

So on a single socket server system the extra work does not matter, right?
Don't tell me that single socket server systems are irrelevant. Intel is
actively promoting single socket CPUs, like XEON D, for high densitiy
servers...

Instead of handwaving arguments I prefer a proper analysis of what the
overhead is and why it is not a good thing for loaded servers in general.

Then instead of slapping half baken heuristics into the code, we should sit
down and think a bit harder about it.

Thanks,

	tglx
Thomas Gleixner Oct. 26, 2016, 6:11 p.m. UTC | #9
On Wed, 26 Oct 2016, Tim Chen wrote:

> On Wed, 2016-10-26 at 12:52 +0200, Thomas Gleixner wrote:
> > On Thu, 20 Oct 2016, Tim Chen wrote:
> > > 
> > >  
> > > +	if (itmt_sysctl_header)
> > > +		unregister_sysctl_table(itmt_sysctl_header);
> > What sets itmt_sysctl_header to NULL?
> > 
> 
> If the registration of the itmt sysctl table has failed, it will
> be NULL.

And what clears it _AFTER_ the deregistration? Nothing, AFAICT.
Tim Chen Oct. 26, 2016, 7:38 p.m. UTC | #10
On Wed, 2016-10-26 at 20:11 +0200, Thomas Gleixner wrote:
> On Wed, 26 Oct 2016, Tim Chen wrote:
> 
> > 
> > On Wed, 2016-10-26 at 12:52 +0200, Thomas Gleixner wrote:
> > > 
> > > On Thu, 20 Oct 2016, Tim Chen wrote:
> > > > 
> > > > 
> > > >  
> > > > +	if (itmt_sysctl_header)
> > > > +		unregister_sysctl_table(itmt_sysctl_header);
> > > What sets itmt_sysctl_header to NULL?
> > > 
> > If the registration of the itmt sysctl table has failed, it will
> > be NULL.
> And what clears it _AFTER_ the deregistration? Nothing, AFAICT.

Ok. I'll clear itmt_sysctl_header here.

Thanks.

Tim
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tim Chen Oct. 27, 2016, 7:32 p.m. UTC | #11
On Wed, 2016-10-26 at 20:09 +0200, Thomas Gleixner wrote:
> On Wed, 26 Oct 2016, Tim Chen wrote:
> > 
> > On Wed, 2016-10-26 at 13:24 +0200, Thomas Gleixner wrote:
> > > 
> > > > 
> > > > There were reservations on the multi-socket case of ITMT, maybe it would
> > > > help to spell those out in great detail here. That is, have the comment
> > > > explain the policy instead of simply stating what the code does (which
> > > > is always bad comment policy, you can read the code just fine).
> > > What is the objection for multi sockets? If it improves the behaviour then
> > > why would this be a bad thing for multi sockets?
> > For multi-socket (server system), it is much more likely that they will
> > have multiple cpus in a socket busy and not run in turbo mode. So the extra
> > work in migrating the workload to the one with extra headroom will
> > not make use of those headroom in that scenario.  I will update the comment
> > to reflect this policy.
> So on a single socket server system the extra work does not matter, right?
> Don't tell me that single socket server systems are irrelevant. Intel is
> actively promoting single socket CPUs, like XEON D, for high densitiy
> servers...
> 
> Instead of handwaving arguments I prefer a proper analysis of what the
> overhead is and why it is not a good thing for loaded servers in general.
> 
> Then instead of slapping half baken heuristics into the code, we should sit
> down and think a bit harder about it.
> 

The ITMT scheduling overhead should be small.  Mostly a small number of 
cycles initially spent to idle balance tasks towards an idled favored core, and cycles to refill
hot data in the mid level cache for the migrated task.  Those should be a very
small percentage of the cycles that the task spent running on the favored core.
So any extra boost in frequency should compensate so should be a good trade off.

After some internal discussions, we think we should enable the ITMT feature by 
default for all systems supporting ITMT.  I will remove the single socket
restriction.

Thanks.

Tim
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index a73fb80..46ebdd1 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -155,23 +155,26 @@  extern bool x86_topology_update;
 #include <asm/percpu.h>
 
 DECLARE_PER_CPU_READ_MOSTLY(int, sched_core_priority);
+extern unsigned int __read_mostly sysctl_sched_itmt_enabled;
 
 /* Interface to set priority of a cpu */
 void sched_set_itmt_core_prio(int prio, int core_cpu);
 
 /* Interface to notify scheduler that system supports ITMT */
-void sched_set_itmt_support(void);
+int sched_set_itmt_support(void);
 
 /* Interface to notify scheduler that system revokes ITMT support */
 void sched_clear_itmt_support(void);
 
 #else /* CONFIG_SCHED_ITMT */
 
+#define sysctl_sched_itmt_enabled	0
 static inline void sched_set_itmt_core_prio(int prio, int core_cpu)
 {
 }
-static inline void sched_set_itmt_support(void)
+static inline int sched_set_itmt_support(void)
 {
+	return 0;
 }
 static inline void sched_clear_itmt_support(void)
 {
diff --git a/arch/x86/kernel/itmt.c b/arch/x86/kernel/itmt.c
index 63c9b3e..e999e6e 100644
--- a/arch/x86/kernel/itmt.c
+++ b/arch/x86/kernel/itmt.c
@@ -34,6 +34,67 @@  DEFINE_PER_CPU_READ_MOSTLY(int, sched_core_priority);
 /* Boolean to track if system has ITMT capabilities */
 static bool __read_mostly sched_itmt_capable;
 
+/*
+ * Boolean to control whether we want to move processes to cpu capable
+ * of higher turbo frequency for cpus supporting Intel Turbo Boost Max
+ * Technology 3.0.
+ *
+ * It can be set via /proc/sys/kernel/sched_itmt_enabled
+ */
+unsigned int __read_mostly sysctl_sched_itmt_enabled;
+
+static int sched_itmt_update_handler(struct ctl_table *table, int write,
+			      void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	int ret;
+	unsigned int old_sysctl;
+
+	mutex_lock(&itmt_update_mutex);
+
+	if (!sched_itmt_capable) {
+		mutex_unlock(&itmt_update_mutex);
+		return -EINVAL;
+	}
+
+	old_sysctl = sysctl_sched_itmt_enabled;
+	ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
+
+	if (!ret && write && old_sysctl != sysctl_sched_itmt_enabled) {
+		x86_topology_update = true;
+		rebuild_sched_domains();
+	}
+
+	mutex_unlock(&itmt_update_mutex);
+
+	return ret;
+}
+
+static unsigned int zero;
+static unsigned int one = 1;
+static struct ctl_table itmt_kern_table[] = {
+	{
+		.procname	= "sched_itmt_enabled",
+		.data		= &sysctl_sched_itmt_enabled,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= sched_itmt_update_handler,
+		.extra1		= &zero,
+		.extra2		= &one,
+	},
+	{}
+};
+
+static struct ctl_table itmt_root_table[] = {
+	{
+		.procname	= "kernel",
+		.mode		= 0555,
+		.child		= itmt_kern_table,
+	},
+	{}
+};
+
+static struct ctl_table_header *itmt_sysctl_header;
+
 /**
  * sched_set_itmt_support() - Indicate platform supports ITMT
  *
@@ -45,14 +106,44 @@  static bool __read_mostly sched_itmt_capable;
  *
  * This must be done only after sched_set_itmt_core_prio
  * has been called to set the cpus' priorities.
+ * It must not be called with cpu hot plug lock
+ * held as we need to acquire the lock to rebuild sched domains
+ * later.
+ *
+ * Return: 0 on success
  */
-void sched_set_itmt_support(void)
+int sched_set_itmt_support(void)
 {
 	mutex_lock(&itmt_update_mutex);
 
+	if (sched_itmt_capable) {
+		mutex_unlock(&itmt_update_mutex);
+		return 0;
+	}
+
+	itmt_sysctl_header = register_sysctl_table(itmt_root_table);
+	if (!itmt_sysctl_header) {
+		mutex_unlock(&itmt_update_mutex);
+		return -ENOMEM;
+	}
+
 	sched_itmt_capable = true;
 
+	/*
+	 * ITMT capability automatically enables ITMT
+	 * scheduling for small systems (single node).
+	 */
+	if (topology_num_packages() == 1)
+		sysctl_sched_itmt_enabled = 1;
+
+	if (sysctl_sched_itmt_enabled) {
+		x86_topology_update = true;
+		rebuild_sched_domains();
+	}
+
 	mutex_unlock(&itmt_update_mutex);
+
+	return 0;
 }
 
 /**
@@ -61,13 +152,30 @@  void sched_set_itmt_support(void)
  * This function is used by the OS to indicate that it has
  * revoked the platform's support of ITMT feature.
  *
+ * It must not be called with cpu hot plug lock
+ * held as we need to acquire the lock to rebuild sched domains
+ * later.
  */
 void sched_clear_itmt_support(void)
 {
 	mutex_lock(&itmt_update_mutex);
 
+	if (!sched_itmt_capable) {
+		mutex_unlock(&itmt_update_mutex);
+		return;
+	}
 	sched_itmt_capable = false;
 
+	if (itmt_sysctl_header)
+		unregister_sysctl_table(itmt_sysctl_header);
+
+	if (sysctl_sched_itmt_enabled) {
+		/* disable sched_itmt if we are no longer ITMT capable */
+		sysctl_sched_itmt_enabled = 0;
+		x86_topology_update = true;
+		rebuild_sched_domains();
+	}
+
 	mutex_unlock(&itmt_update_mutex);
 }