diff mbox

[V3,RESEND,RFC,1/2] sched: Bail out of yield_to when source and target runqueue has one task

Message ID 20130122073913.24731.65118.sendpatchset@codeblue.in.ibm.com (mailing list archive)
State New, archived
Headers show

Commit Message

Raghavendra K T Jan. 22, 2013, 7:39 a.m. UTC
From: Peter Zijlstra <peterz@infradead.org>

In case of undercomitted scenarios, especially in large guests
yield_to overhead is significantly high. when run queue length of
source and target is one, take an opportunity to bail out and return
-ESRCH. This return condition can be further exploited to quickly come
out of PLE handler.

(History: Raghavendra initially worked on break out of kvm ple handler upon
 seeing source runqueue length = 1, but it had to export rq length).
 Peter came up with the elegant idea of return -ESRCH in scheduler core.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Raghavendra, Checking the rq length of target vcpu condition added.(thanks Avi)
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Acked-by: Andrew Jones <drjones@redhat.com>
Tested-by: Chegu Vinod <chegu_vinod@hp.com>
---

 kernel/sched/core.c |   25 +++++++++++++++++++------
 1 file changed, 19 insertions(+), 6 deletions(-)


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Ingo Molnar Jan. 24, 2013, 10:32 a.m. UTC | #1
* Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> wrote:

> From: Peter Zijlstra <peterz@infradead.org>
> 
> In case of undercomitted scenarios, especially in large guests
> yield_to overhead is significantly high. when run queue length of
> source and target is one, take an opportunity to bail out and return
> -ESRCH. This return condition can be further exploited to quickly come
> out of PLE handler.
> 
> (History: Raghavendra initially worked on break out of kvm ple handler upon
>  seeing source runqueue length = 1, but it had to export rq length).
>  Peter came up with the elegant idea of return -ESRCH in scheduler core.
> 
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> Raghavendra, Checking the rq length of target vcpu condition added.(thanks Avi)
> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
> Acked-by: Andrew Jones <drjones@redhat.com>
> Tested-by: Chegu Vinod <chegu_vinod@hp.com>
> ---
> 
>  kernel/sched/core.c |   25 +++++++++++++++++++------
>  1 file changed, 19 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 2d8927f..fc219a5 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4289,7 +4289,10 @@ EXPORT_SYMBOL(yield);
>   * It's the caller's job to ensure that the target task struct
>   * can't go away on us before we can do any checks.
>   *
> - * Returns true if we indeed boosted the target task.
> + * Returns:
> + *	true (>0) if we indeed boosted the target task.
> + *	false (0) if we failed to boost the target.
> + *	-ESRCH if there's no task to yield to.
>   */
>  bool __sched yield_to(struct task_struct *p, bool preempt)
>  {
> @@ -4303,6 +4306,15 @@ bool __sched yield_to(struct task_struct *p, bool preempt)
>  
>  again:
>  	p_rq = task_rq(p);
> +	/*
> +	 * If we're the only runnable task on the rq and target rq also
> +	 * has only one task, there's absolutely no point in yielding.
> +	 */
> +	if (rq->nr_running == 1 && p_rq->nr_running == 1) {
> +		yielded = -ESRCH;
> +		goto out_irq;
> +	}

Looks good to me in principle.

Would be nice to get more consistent benchmark numbers. Once 
those are unambiguously showing that this is a win:

  Acked-by: Ingo Molnar <mingo@kernel.org>

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Raghavendra K T Jan. 25, 2013, 10:40 a.m. UTC | #2
* Ingo Molnar <mingo@kernel.org> [2013-01-24 11:32:13]:

> 
> * Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> wrote:
> 
> > From: Peter Zijlstra <peterz@infradead.org>
> > 
> > In case of undercomitted scenarios, especially in large guests
> > yield_to overhead is significantly high. when run queue length of
> > source and target is one, take an opportunity to bail out and return
> > -ESRCH. This return condition can be further exploited to quickly come
> > out of PLE handler.
> > 
> > (History: Raghavendra initially worked on break out of kvm ple handler upon
> >  seeing source runqueue length = 1, but it had to export rq length).
> >  Peter came up with the elegant idea of return -ESRCH in scheduler core.
> > 
> > Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> > Raghavendra, Checking the rq length of target vcpu condition added.(thanks Avi)
> > Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> > Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
> > Acked-by: Andrew Jones <drjones@redhat.com>
> > Tested-by: Chegu Vinod <chegu_vinod@hp.com>
> > ---
> > 
> >  kernel/sched/core.c |   25 +++++++++++++++++++------
> >  1 file changed, 19 insertions(+), 6 deletions(-)
> > 
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 2d8927f..fc219a5 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -4289,7 +4289,10 @@ EXPORT_SYMBOL(yield);
> >   * It's the caller's job to ensure that the target task struct
> >   * can't go away on us before we can do any checks.
> >   *
> > - * Returns true if we indeed boosted the target task.
> > + * Returns:
> > + *	true (>0) if we indeed boosted the target task.
> > + *	false (0) if we failed to boost the target.
> > + *	-ESRCH if there's no task to yield to.
> >   */
> >  bool __sched yield_to(struct task_struct *p, bool preempt)
> >  {
> > @@ -4303,6 +4306,15 @@ bool __sched yield_to(struct task_struct *p, bool preempt)
> >  
> >  again:
> >  	p_rq = task_rq(p);
> > +	/*
> > +	 * If we're the only runnable task on the rq and target rq also
> > +	 * has only one task, there's absolutely no point in yielding.
> > +	 */
> > +	if (rq->nr_running == 1 && p_rq->nr_running == 1) {
> > +		yielded = -ESRCH;
> > +		goto out_irq;
> > +	}
> 
> Looks good to me in principle.
> 
> Would be nice to get more consistent benchmark numbers. Once 
> those are unambiguously showing that this is a win:
> 
>   Acked-by: Ingo Molnar <mingo@kernel.org>
>

I ran the test with kernbench and sysbench again on 32 core mx3850
machine with 32 vcpu guests. Results shows definite improvements.

ebizzy and dbench show similar improvement for 1x overcommit
(note that stdev for 1x in dbench is lesser improvemet is now seen at
only 20%)

[ all the experiments are taken out of 8 run averages ].

The patches benefit large guest undercommit scenarios, so I believe
with large guest performance improvemnt is even significant. [ Chegu
Vinod results show performance near to no ple cases ]. Unfortunately I
do not have a machine to test larger guest (>32).

Ingo, Please let me know if this is okay to you.

base kernel = 3.8.0-rc4

+-----------+-----------+-----------+------------+-----------+
                kernbench  (time in sec lower is better)
+-----------+-----------+-----------+------------+-----------+
    base        stdev        patched    stdev      %improve
+-----------+-----------+-----------+------------+-----------+
1x   46.6028     1.8672	    42.4494     1.1390	   8.91234
2x   99.9074     9.1859	    90.4050     2.6131	   9.51121
+-----------+-----------+-----------+------------+-----------+
+-----------+-----------+-----------+------------+-----------+
               sysbench (time in sec lower is better) 
+-----------+-----------+-----------+------------+-----------+
    base        stdev        patched    stdev      %improve
+-----------+-----------+-----------+------------+-----------+
1x   18.7402     0.3764	    17.7431     0.3589	   5.32065
2x   13.2238     0.1935	    13.0096     0.3152	   1.61981
+-----------+-----------+-----------+------------+-----------+

+-----------+-----------+-----------+------------+-----------+
                ebizzy  (records/sec higher is better)
+-----------+-----------+-----------+------------+-----------+
    base        stdev        patched    stdev      %improve
+-----------+-----------+-----------+------------+-----------+
1x  2421.9000    19.1801	  5883.1000   112.7243	 142.91259
+-----------+-----------+-----------+------------+-----------+

+-----------+-----------+-----------+------------+-----------+
                dbench (throughput MB/sec  higher is better)
+-----------+-----------+-----------+------------+-----------+
    base        stdev        patched    stdev      %improve
+-----------+-----------+-----------+------------+-----------+
1x  11675.9900   857.4154	 14103.5000   215.8425	  20.79061
+-----------+-----------+-----------+------------+-----------+

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ingo Molnar Jan. 25, 2013, 10:47 a.m. UTC | #3
* Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> wrote:

> * Ingo Molnar <mingo@kernel.org> [2013-01-24 11:32:13]:
> 
> > 
> > * Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> wrote:
> > 
> > > From: Peter Zijlstra <peterz@infradead.org>
> > > 
> > > In case of undercomitted scenarios, especially in large guests
> > > yield_to overhead is significantly high. when run queue length of
> > > source and target is one, take an opportunity to bail out and return
> > > -ESRCH. This return condition can be further exploited to quickly come
> > > out of PLE handler.
> > > 
> > > (History: Raghavendra initially worked on break out of kvm ple handler upon
> > >  seeing source runqueue length = 1, but it had to export rq length).
> > >  Peter came up with the elegant idea of return -ESRCH in scheduler core.
> > > 
> > > Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> > > Raghavendra, Checking the rq length of target vcpu condition added.(thanks Avi)
> > > Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> > > Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
> > > Acked-by: Andrew Jones <drjones@redhat.com>
> > > Tested-by: Chegu Vinod <chegu_vinod@hp.com>
> > > ---
> > > 
> > >  kernel/sched/core.c |   25 +++++++++++++++++++------
> > >  1 file changed, 19 insertions(+), 6 deletions(-)
> > > 
> > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > index 2d8927f..fc219a5 100644
> > > --- a/kernel/sched/core.c
> > > +++ b/kernel/sched/core.c
> > > @@ -4289,7 +4289,10 @@ EXPORT_SYMBOL(yield);
> > >   * It's the caller's job to ensure that the target task struct
> > >   * can't go away on us before we can do any checks.
> > >   *
> > > - * Returns true if we indeed boosted the target task.
> > > + * Returns:
> > > + *	true (>0) if we indeed boosted the target task.
> > > + *	false (0) if we failed to boost the target.
> > > + *	-ESRCH if there's no task to yield to.
> > >   */
> > >  bool __sched yield_to(struct task_struct *p, bool preempt)
> > >  {
> > > @@ -4303,6 +4306,15 @@ bool __sched yield_to(struct task_struct *p, bool preempt)
> > >  
> > >  again:
> > >  	p_rq = task_rq(p);
> > > +	/*
> > > +	 * If we're the only runnable task on the rq and target rq also
> > > +	 * has only one task, there's absolutely no point in yielding.
> > > +	 */
> > > +	if (rq->nr_running == 1 && p_rq->nr_running == 1) {
> > > +		yielded = -ESRCH;
> > > +		goto out_irq;
> > > +	}
> > 
> > Looks good to me in principle.
> > 
> > Would be nice to get more consistent benchmark numbers. Once 
> > those are unambiguously showing that this is a win:
> > 
> >   Acked-by: Ingo Molnar <mingo@kernel.org>
> >
> 
> I ran the test with kernbench and sysbench again on 32 core mx3850
> machine with 32 vcpu guests. Results shows definite improvements.
> 
> ebizzy and dbench show similar improvement for 1x overcommit
> (note that stdev for 1x in dbench is lesser improvemet is now seen at
> only 20%)
> 
> [ all the experiments are taken out of 8 run averages ].
> 
> The patches benefit large guest undercommit scenarios, so I believe
> with large guest performance improvemnt is even significant. [ Chegu
> Vinod results show performance near to no ple cases ]. Unfortunately I
> do not have a machine to test larger guest (>32).
> 
> Ingo, Please let me know if this is okay to you.
> 
> base kernel = 3.8.0-rc4
> 
> +-----------+-----------+-----------+------------+-----------+
>                 kernbench  (time in sec lower is better)
> +-----------+-----------+-----------+------------+-----------+
>     base        stdev        patched    stdev      %improve
> +-----------+-----------+-----------+------------+-----------+
> 1x   46.6028     1.8672	    42.4494     1.1390	   8.91234
> 2x   99.9074     9.1859	    90.4050     2.6131	   9.51121
> +-----------+-----------+-----------+------------+-----------+
> +-----------+-----------+-----------+------------+-----------+
>                sysbench (time in sec lower is better) 
> +-----------+-----------+-----------+------------+-----------+
>     base        stdev        patched    stdev      %improve
> +-----------+-----------+-----------+------------+-----------+
> 1x   18.7402     0.3764	    17.7431     0.3589	   5.32065
> 2x   13.2238     0.1935	    13.0096     0.3152	   1.61981
> +-----------+-----------+-----------+------------+-----------+
> 
> +-----------+-----------+-----------+------------+-----------+
>                 ebizzy  (records/sec higher is better)
> +-----------+-----------+-----------+------------+-----------+
>     base        stdev        patched    stdev      %improve
> +-----------+-----------+-----------+------------+-----------+
> 1x  2421.9000    19.1801	  5883.1000   112.7243	 142.91259
> +-----------+-----------+-----------+------------+-----------+
> 
> +-----------+-----------+-----------+------------+-----------+
>                 dbench (throughput MB/sec  higher is better)
> +-----------+-----------+-----------+------------+-----------+
>     base        stdev        patched    stdev      %improve
> +-----------+-----------+-----------+------------+-----------+
> 1x  11675.9900   857.4154	 14103.5000   215.8425	  20.79061
> +-----------+-----------+-----------+------------+-----------+

The numbers look pretty convincing, thanks. The workloads were 
CPU bound most of the time, right?

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andrew Jones Jan. 25, 2013, 11:05 a.m. UTC | #4
On Fri, Jan 25, 2013 at 04:10:25PM +0530, Raghavendra K T wrote:
> * Ingo Molnar <mingo@kernel.org> [2013-01-24 11:32:13]:
> 
> > 
> > * Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> wrote:
> > 
> > > From: Peter Zijlstra <peterz@infradead.org>
> > > 
> > > In case of undercomitted scenarios, especially in large guests
> > > yield_to overhead is significantly high. when run queue length of
> > > source and target is one, take an opportunity to bail out and return
> > > -ESRCH. This return condition can be further exploited to quickly come
> > > out of PLE handler.
> > > 
> > > (History: Raghavendra initially worked on break out of kvm ple handler upon
> > >  seeing source runqueue length = 1, but it had to export rq length).
> > >  Peter came up with the elegant idea of return -ESRCH in scheduler core.
> > > 
> > > Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> > > Raghavendra, Checking the rq length of target vcpu condition added.(thanks Avi)
> > > Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> > > Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
> > > Acked-by: Andrew Jones <drjones@redhat.com>
> > > Tested-by: Chegu Vinod <chegu_vinod@hp.com>
> > > ---
> > > 
> > >  kernel/sched/core.c |   25 +++++++++++++++++++------
> > >  1 file changed, 19 insertions(+), 6 deletions(-)
> > > 
> > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > index 2d8927f..fc219a5 100644
> > > --- a/kernel/sched/core.c
> > > +++ b/kernel/sched/core.c
> > > @@ -4289,7 +4289,10 @@ EXPORT_SYMBOL(yield);
> > >   * It's the caller's job to ensure that the target task struct
> > >   * can't go away on us before we can do any checks.
> > >   *
> > > - * Returns true if we indeed boosted the target task.
> > > + * Returns:
> > > + *	true (>0) if we indeed boosted the target task.
> > > + *	false (0) if we failed to boost the target.
> > > + *	-ESRCH if there's no task to yield to.
> > >   */
> > >  bool __sched yield_to(struct task_struct *p, bool preempt)
> > >  {
> > > @@ -4303,6 +4306,15 @@ bool __sched yield_to(struct task_struct *p, bool preempt)
> > >  
> > >  again:
> > >  	p_rq = task_rq(p);
> > > +	/*
> > > +	 * If we're the only runnable task on the rq and target rq also
> > > +	 * has only one task, there's absolutely no point in yielding.
> > > +	 */
> > > +	if (rq->nr_running == 1 && p_rq->nr_running == 1) {
> > > +		yielded = -ESRCH;
> > > +		goto out_irq;
> > > +	}
> > 
> > Looks good to me in principle.
> > 
> > Would be nice to get more consistent benchmark numbers. Once 
> > those are unambiguously showing that this is a win:
> > 
> >   Acked-by: Ingo Molnar <mingo@kernel.org>
> >
> 
> I ran the test with kernbench and sysbench again on 32 core mx3850
> machine with 32 vcpu guests. Results shows definite improvements.
> 
> ebizzy and dbench show similar improvement for 1x overcommit
> (note that stdev for 1x in dbench is lesser improvemet is now seen at
> only 20%)
> 
> [ all the experiments are taken out of 8 run averages ].
> 
> The patches benefit large guest undercommit scenarios, so I believe
> with large guest performance improvemnt is even significant. [ Chegu
> Vinod results show performance near to no ple cases ].

The last results you posted for dbench for the patched 1x case were
showing much better throughput than the no-ple 1x case, which is what
was strange. Is that still happening? You don't have the no-ple 1x
data here this time. The percent errors look a lot better.

Unfortunately I
> do not have a machine to test larger guest (>32).
> 
> Ingo, Please let me know if this is okay to you.
> 
> base kernel = 3.8.0-rc4
> 
> +-----------+-----------+-----------+------------+-----------+
>                 kernbench  (time in sec lower is better)
> +-----------+-----------+-----------+------------+-----------+
>     base        stdev        patched    stdev      %improve
> +-----------+-----------+-----------+------------+-----------+
> 1x   46.6028     1.8672	    42.4494     1.1390	   8.91234
> 2x   99.9074     9.1859	    90.4050     2.6131	   9.51121
> +-----------+-----------+-----------+------------+-----------+
> +-----------+-----------+-----------+------------+-----------+
>                sysbench (time in sec lower is better) 
> +-----------+-----------+-----------+------------+-----------+
>     base        stdev        patched    stdev      %improve
> +-----------+-----------+-----------+------------+-----------+
> 1x   18.7402     0.3764	    17.7431     0.3589	   5.32065
> 2x   13.2238     0.1935	    13.0096     0.3152	   1.61981
> +-----------+-----------+-----------+------------+-----------+
> 
> +-----------+-----------+-----------+------------+-----------+
>                 ebizzy  (records/sec higher is better)
> +-----------+-----------+-----------+------------+-----------+
>     base        stdev        patched    stdev      %improve
> +-----------+-----------+-----------+------------+-----------+
> 1x  2421.9000    19.1801	  5883.1000   112.7243	 142.91259
> +-----------+-----------+-----------+------------+-----------+
> 
> +-----------+-----------+-----------+------------+-----------+
>                 dbench (throughput MB/sec  higher is better)
> +-----------+-----------+-----------+------------+-----------+
>     base        stdev        patched    stdev      %improve
> +-----------+-----------+-----------+------------+-----------+
> 1x  11675.9900   857.4154	 14103.5000   215.8425	  20.79061
> +-----------+-----------+-----------+------------+-----------+
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Raghavendra K T Jan. 25, 2013, 3:54 p.m. UTC | #5
On 01/25/2013 04:17 PM, Ingo Molnar wrote:
>
> * Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> wrote:
>
>> * Ingo Molnar <mingo@kernel.org> [2013-01-24 11:32:13]:
>>
>>>
>>> * Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> wrote:
>>>
>>>> From: Peter Zijlstra <peterz@infradead.org>
>>>>
>>>> In case of undercomitted scenarios, especially in large guests
>>>> yield_to overhead is significantly high. when run queue length of
>>>> source and target is one, take an opportunity to bail out and return
>>>> -ESRCH. This return condition can be further exploited to quickly come
>>>> out of PLE handler.
>>>>
>>>> (History: Raghavendra initially worked on break out of kvm ple handler upon
>>>>   seeing source runqueue length = 1, but it had to export rq length).
>>>>   Peter came up with the elegant idea of return -ESRCH in scheduler core.
>>>>
>>>> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
>>>> Raghavendra, Checking the rq length of target vcpu condition added.(thanks Avi)
>>>> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
>>>> Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
>>>> Acked-by: Andrew Jones <drjones@redhat.com>
>>>> Tested-by: Chegu Vinod <chegu_vinod@hp.com>
>>>> ---
>>>>
>>>>   kernel/sched/core.c |   25 +++++++++++++++++++------
>>>>   1 file changed, 19 insertions(+), 6 deletions(-)
>>>>
>>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>>> index 2d8927f..fc219a5 100644
>>>> --- a/kernel/sched/core.c
>>>> +++ b/kernel/sched/core.c
>>>> @@ -4289,7 +4289,10 @@ EXPORT_SYMBOL(yield);
>>>>    * It's the caller's job to ensure that the target task struct
>>>>    * can't go away on us before we can do any checks.
>>>>    *
>>>> - * Returns true if we indeed boosted the target task.
>>>> + * Returns:
>>>> + *	true (>0) if we indeed boosted the target task.
>>>> + *	false (0) if we failed to boost the target.
>>>> + *	-ESRCH if there's no task to yield to.
>>>>    */
>>>>   bool __sched yield_to(struct task_struct *p, bool preempt)
>>>>   {
>>>> @@ -4303,6 +4306,15 @@ bool __sched yield_to(struct task_struct *p, bool preempt)
>>>>
>>>>   again:
>>>>   	p_rq = task_rq(p);
>>>> +	/*
>>>> +	 * If we're the only runnable task on the rq and target rq also
>>>> +	 * has only one task, there's absolutely no point in yielding.
>>>> +	 */
>>>> +	if (rq->nr_running == 1 && p_rq->nr_running == 1) {
>>>> +		yielded = -ESRCH;
>>>> +		goto out_irq;
>>>> +	}
>>>
>>> Looks good to me in principle.
>>>
>>> Would be nice to get more consistent benchmark numbers. Once
>>> those are unambiguously showing that this is a win:
>>>
>>>    Acked-by: Ingo Molnar <mingo@kernel.org>
>>>
>>
>> I ran the test with kernbench and sysbench again on 32 core mx3850
>> machine with 32 vcpu guests. Results shows definite improvements.
>>
>> ebizzy and dbench show similar improvement for 1x overcommit
>> (note that stdev for 1x in dbench is lesser improvemet is now seen at
>> only 20%)
>>
>> [ all the experiments are taken out of 8 run averages ].
>>
>> The patches benefit large guest undercommit scenarios, so I believe
>> with large guest performance improvemnt is even significant. [ Chegu
>> Vinod results show performance near to no ple cases ]. Unfortunately I
>> do not have a machine to test larger guest (>32).
>>
>> Ingo, Please let me know if this is okay to you.
>>
>> base kernel = 3.8.0-rc4
>>
>> +-----------+-----------+-----------+------------+-----------+
>>                  kernbench  (time in sec lower is better)
>> +-----------+-----------+-----------+------------+-----------+
>>      base        stdev        patched    stdev      %improve
>> +-----------+-----------+-----------+------------+-----------+
>> 1x   46.6028     1.8672	    42.4494     1.1390	   8.91234
>> 2x   99.9074     9.1859	    90.4050     2.6131	   9.51121
>> +-----------+-----------+-----------+------------+-----------+
>> +-----------+-----------+-----------+------------+-----------+
>>                 sysbench (time in sec lower is better)
>> +-----------+-----------+-----------+------------+-----------+
>>      base        stdev        patched    stdev      %improve
>> +-----------+-----------+-----------+------------+-----------+
>> 1x   18.7402     0.3764	    17.7431     0.3589	   5.32065
>> 2x   13.2238     0.1935	    13.0096     0.3152	   1.61981
>> +-----------+-----------+-----------+------------+-----------+
>>
>> +-----------+-----------+-----------+------------+-----------+
>>                  ebizzy  (records/sec higher is better)
>> +-----------+-----------+-----------+------------+-----------+
>>      base        stdev        patched    stdev      %improve
>> +-----------+-----------+-----------+------------+-----------+
>> 1x  2421.9000    19.1801	  5883.1000   112.7243	 142.91259
>> +-----------+-----------+-----------+------------+-----------+
>>
>> +-----------+-----------+-----------+------------+-----------+
>>                  dbench (throughput MB/sec  higher is better)
>> +-----------+-----------+-----------+------------+-----------+
>>      base        stdev        patched    stdev      %improve
>> +-----------+-----------+-----------+------------+-----------+
>> 1x  11675.9900   857.4154	 14103.5000   215.8425	  20.79061
>> +-----------+-----------+-----------+------------+-----------+
>
> The numbers look pretty convincing, thanks. The workloads were
> CPU bound most of the time, right?

Yes. CPU bound most of the time. I also used tmpfs to reduce io
overhead (for dbbench).

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Raghavendra K T Jan. 25, 2013, 3:58 p.m. UTC | #6
On 01/25/2013 04:35 PM, Andrew Jones wrote:
> On Fri, Jan 25, 2013 at 04:10:25PM +0530, Raghavendra K T wrote:
>> * Ingo Molnar <mingo@kernel.org> [2013-01-24 11:32:13]:
>>
>>>
>>> * Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> wrote:
>>>
>>>> From: Peter Zijlstra <peterz@infradead.org>
>>>>
>>>> In case of undercomitted scenarios, especially in large guests
>>>> yield_to overhead is significantly high. when run queue length of
>>>> source and target is one, take an opportunity to bail out and return
>>>> -ESRCH. This return condition can be further exploited to quickly come
>>>> out of PLE handler.
>>>>
>>>> (History: Raghavendra initially worked on break out of kvm ple handler upon
>>>>   seeing source runqueue length = 1, but it had to export rq length).
>>>>   Peter came up with the elegant idea of return -ESRCH in scheduler core.
>>>>
>>>> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
>>>> Raghavendra, Checking the rq length of target vcpu condition added.(thanks Avi)
>>>> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
>>>> Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
>>>> Acked-by: Andrew Jones <drjones@redhat.com>
>>>> Tested-by: Chegu Vinod <chegu_vinod@hp.com>
>>>> ---
>>>>
>>>>   kernel/sched/core.c |   25 +++++++++++++++++++------
>>>>   1 file changed, 19 insertions(+), 6 deletions(-)
>>>>
>>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>>> index 2d8927f..fc219a5 100644
>>>> --- a/kernel/sched/core.c
>>>> +++ b/kernel/sched/core.c
>>>> @@ -4289,7 +4289,10 @@ EXPORT_SYMBOL(yield);
>>>>    * It's the caller's job to ensure that the target task struct
>>>>    * can't go away on us before we can do any checks.
>>>>    *
>>>> - * Returns true if we indeed boosted the target task.
>>>> + * Returns:
>>>> + *	true (>0) if we indeed boosted the target task.
>>>> + *	false (0) if we failed to boost the target.
>>>> + *	-ESRCH if there's no task to yield to.
>>>>    */
>>>>   bool __sched yield_to(struct task_struct *p, bool preempt)
>>>>   {
>>>> @@ -4303,6 +4306,15 @@ bool __sched yield_to(struct task_struct *p, bool preempt)
>>>>
>>>>   again:
>>>>   	p_rq = task_rq(p);
>>>> +	/*
>>>> +	 * If we're the only runnable task on the rq and target rq also
>>>> +	 * has only one task, there's absolutely no point in yielding.
>>>> +	 */
>>>> +	if (rq->nr_running == 1 && p_rq->nr_running == 1) {
>>>> +		yielded = -ESRCH;
>>>> +		goto out_irq;
>>>> +	}
>>>
>>> Looks good to me in principle.
>>>
>>> Would be nice to get more consistent benchmark numbers. Once
>>> those are unambiguously showing that this is a win:
>>>
>>>    Acked-by: Ingo Molnar <mingo@kernel.org>
>>>
>>
>> I ran the test with kernbench and sysbench again on 32 core mx3850
>> machine with 32 vcpu guests. Results shows definite improvements.
>>
>> ebizzy and dbench show similar improvement for 1x overcommit
>> (note that stdev for 1x in dbench is lesser improvemet is now seen at
>> only 20%)
>>
>> [ all the experiments are taken out of 8 run averages ].
>>
>> The patches benefit large guest undercommit scenarios, so I believe
>> with large guest performance improvemnt is even significant. [ Chegu
>> Vinod results show performance near to no ple cases ].
>
> The last results you posted for dbench for the patched 1x case were
> showing much better throughput than the no-ple 1x case, which is what
> was strange. Is that still happening? You don't have the no-ple 1x
> data here this time. The percent errors look a lot better.

I re-ran the experiment and almost got 4% (13500 vs 14100) less 
throughput compared to patched for no-ple case. ( I believe this 
variation may be due to having 4 guest with 3 idle.. as no-ple is very 
sensitive after 1x).



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ingo Molnar Jan. 25, 2013, 6:49 p.m. UTC | #7
* Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> wrote:

> On 01/25/2013 04:17 PM, Ingo Molnar wrote:
> >
> >* Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> wrote:
> >
> >>* Ingo Molnar <mingo@kernel.org> [2013-01-24 11:32:13]:
> >>
> >>>
> >>>* Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> wrote:
> >>>
> >>>>From: Peter Zijlstra <peterz@infradead.org>
> >>>>
> >>>>In case of undercomitted scenarios, especially in large guests
> >>>>yield_to overhead is significantly high. when run queue length of
> >>>>source and target is one, take an opportunity to bail out and return
> >>>>-ESRCH. This return condition can be further exploited to quickly come
> >>>>out of PLE handler.
> >>>>
> >>>>(History: Raghavendra initially worked on break out of kvm ple handler upon
> >>>>  seeing source runqueue length = 1, but it had to export rq length).
> >>>>  Peter came up with the elegant idea of return -ESRCH in scheduler core.
> >>>>
> >>>>Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> >>>>Raghavendra, Checking the rq length of target vcpu condition added.(thanks Avi)
> >>>>Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> >>>>Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
> >>>>Acked-by: Andrew Jones <drjones@redhat.com>
> >>>>Tested-by: Chegu Vinod <chegu_vinod@hp.com>
> >>>>---
> >>>>
> >>>>  kernel/sched/core.c |   25 +++++++++++++++++++------
> >>>>  1 file changed, 19 insertions(+), 6 deletions(-)
> >>>>
> >>>>diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> >>>>index 2d8927f..fc219a5 100644
> >>>>--- a/kernel/sched/core.c
> >>>>+++ b/kernel/sched/core.c
> >>>>@@ -4289,7 +4289,10 @@ EXPORT_SYMBOL(yield);
> >>>>   * It's the caller's job to ensure that the target task struct
> >>>>   * can't go away on us before we can do any checks.
> >>>>   *
> >>>>- * Returns true if we indeed boosted the target task.
> >>>>+ * Returns:
> >>>>+ *	true (>0) if we indeed boosted the target task.
> >>>>+ *	false (0) if we failed to boost the target.
> >>>>+ *	-ESRCH if there's no task to yield to.
> >>>>   */
> >>>>  bool __sched yield_to(struct task_struct *p, bool preempt)
> >>>>  {
> >>>>@@ -4303,6 +4306,15 @@ bool __sched yield_to(struct task_struct *p, bool preempt)
> >>>>
> >>>>  again:
> >>>>  	p_rq = task_rq(p);
> >>>>+	/*
> >>>>+	 * If we're the only runnable task on the rq and target rq also
> >>>>+	 * has only one task, there's absolutely no point in yielding.
> >>>>+	 */
> >>>>+	if (rq->nr_running == 1 && p_rq->nr_running == 1) {
> >>>>+		yielded = -ESRCH;
> >>>>+		goto out_irq;
> >>>>+	}
> >>>
> >>>Looks good to me in principle.
> >>>
> >>>Would be nice to get more consistent benchmark numbers. Once
> >>>those are unambiguously showing that this is a win:
> >>>
> >>>   Acked-by: Ingo Molnar <mingo@kernel.org>
> >>>
> >>
> >>I ran the test with kernbench and sysbench again on 32 core mx3850
> >>machine with 32 vcpu guests. Results shows definite improvements.
> >>
> >>ebizzy and dbench show similar improvement for 1x overcommit
> >>(note that stdev for 1x in dbench is lesser improvemet is now seen at
> >>only 20%)
> >>
> >>[ all the experiments are taken out of 8 run averages ].
> >>
> >>The patches benefit large guest undercommit scenarios, so I believe
> >>with large guest performance improvemnt is even significant. [ Chegu
> >>Vinod results show performance near to no ple cases ]. Unfortunately I
> >>do not have a machine to test larger guest (>32).
> >>
> >>Ingo, Please let me know if this is okay to you.
> >>
> >>base kernel = 3.8.0-rc4
> >>
> >>+-----------+-----------+-----------+------------+-----------+
> >>                 kernbench  (time in sec lower is better)
> >>+-----------+-----------+-----------+------------+-----------+
> >>     base        stdev        patched    stdev      %improve
> >>+-----------+-----------+-----------+------------+-----------+
> >>1x   46.6028     1.8672	    42.4494     1.1390	   8.91234
> >>2x   99.9074     9.1859	    90.4050     2.6131	   9.51121
> >>+-----------+-----------+-----------+------------+-----------+
> >>+-----------+-----------+-----------+------------+-----------+
> >>                sysbench (time in sec lower is better)
> >>+-----------+-----------+-----------+------------+-----------+
> >>     base        stdev        patched    stdev      %improve
> >>+-----------+-----------+-----------+------------+-----------+
> >>1x   18.7402     0.3764	    17.7431     0.3589	   5.32065
> >>2x   13.2238     0.1935	    13.0096     0.3152	   1.61981
> >>+-----------+-----------+-----------+------------+-----------+
> >>
> >>+-----------+-----------+-----------+------------+-----------+
> >>                 ebizzy  (records/sec higher is better)
> >>+-----------+-----------+-----------+------------+-----------+
> >>     base        stdev        patched    stdev      %improve
> >>+-----------+-----------+-----------+------------+-----------+
> >>1x  2421.9000    19.1801	  5883.1000   112.7243	 142.91259
> >>+-----------+-----------+-----------+------------+-----------+
> >>
> >>+-----------+-----------+-----------+------------+-----------+
> >>                 dbench (throughput MB/sec  higher is better)
> >>+-----------+-----------+-----------+------------+-----------+
> >>     base        stdev        patched    stdev      %improve
> >>+-----------+-----------+-----------+------------+-----------+
> >>1x  11675.9900   857.4154	 14103.5000   215.8425	  20.79061
> >>+-----------+-----------+-----------+------------+-----------+
> >
> >The numbers look pretty convincing, thanks. The workloads were
> >CPU bound most of the time, right?
> 
> Yes. CPU bound most of the time. I also used tmpfs to reduce 
> io overhead (for dbbench).

Ok, cool.

Which tree will this be upstreamed through - the KVM tree? I'd 
suggest the KVM tree because KVM will be the one exposed to the 
effects of this change.

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Raghavendra K T Jan. 27, 2013, 4:58 p.m. UTC | #8
On 01/26/2013 12:19 AM, Ingo Molnar wrote:
>
> * Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> wrote:
>
>> On 01/25/2013 04:17 PM, Ingo Molnar wrote:
>>>
>>> * Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> wrote:
>>>
>>>> * Ingo Molnar <mingo@kernel.org> [2013-01-24 11:32:13]:
>>>>
>>>>>
>>>>> * Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> wrote:
>>>>>
>>>>>> From: Peter Zijlstra <peterz@infradead.org>
>>>>>>
>>>>>> In case of undercomitted scenarios, especially in large guests
>>>>>> yield_to overhead is significantly high. when run queue length of
>>>>>> source and target is one, take an opportunity to bail out and return
>>>>>> -ESRCH. This return condition can be further exploited to quickly come
>>>>>> out of PLE handler.
>>>>>>
>>>>>> (History: Raghavendra initially worked on break out of kvm ple handler upon
>>>>>>   seeing source runqueue length = 1, but it had to export rq length).
>>>>>>   Peter came up with the elegant idea of return -ESRCH in scheduler core.
>>>>>>
>>>>>> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
>>>>>> Raghavendra, Checking the rq length of target vcpu condition added.(thanks Avi)
>>>>>> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
>>>>>> Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
>>>>>> Acked-by: Andrew Jones <drjones@redhat.com>
>>>>>> Tested-by: Chegu Vinod <chegu_vinod@hp.com>
>>>>>> ---
>>>>>>
>>>>>>   kernel/sched/core.c |   25 +++++++++++++++++++------
>>>>>>   1 file changed, 19 insertions(+), 6 deletions(-)
>>>>>>
>>>>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>>>>> index 2d8927f..fc219a5 100644
>>>>>> --- a/kernel/sched/core.c
>>>>>> +++ b/kernel/sched/core.c
>>>>>> @@ -4289,7 +4289,10 @@ EXPORT_SYMBOL(yield);
>>>>>>    * It's the caller's job to ensure that the target task struct
>>>>>>    * can't go away on us before we can do any checks.
>>>>>>    *
>>>>>> - * Returns true if we indeed boosted the target task.
>>>>>> + * Returns:
>>>>>> + *	true (>0) if we indeed boosted the target task.
>>>>>> + *	false (0) if we failed to boost the target.
>>>>>> + *	-ESRCH if there's no task to yield to.
>>>>>>    */
>>>>>>   bool __sched yield_to(struct task_struct *p, bool preempt)
>>>>>>   {
>>>>>> @@ -4303,6 +4306,15 @@ bool __sched yield_to(struct task_struct *p, bool preempt)
>>>>>>
>>>>>>   again:
>>>>>>   	p_rq = task_rq(p);
>>>>>> +	/*
>>>>>> +	 * If we're the only runnable task on the rq and target rq also
>>>>>> +	 * has only one task, there's absolutely no point in yielding.
>>>>>> +	 */
>>>>>> +	if (rq->nr_running == 1 && p_rq->nr_running == 1) {
>>>>>> +		yielded = -ESRCH;
>>>>>> +		goto out_irq;
>>>>>> +	}
>>>>>
>>>>> Looks good to me in principle.
>>>>>
>>>>> Would be nice to get more consistent benchmark numbers. Once
>>>>> those are unambiguously showing that this is a win:
>>>>>
>>>>>    Acked-by: Ingo Molnar <mingo@kernel.org>
>>>>>
>>>>
>>>> I ran the test with kernbench and sysbench again on 32 core mx3850
>>>> machine with 32 vcpu guests. Results shows definite improvements.
>>>>
>>>> ebizzy and dbench show similar improvement for 1x overcommit
>>>> (note that stdev for 1x in dbench is lesser improvemet is now seen at
>>>> only 20%)
>>>>
>>>> [ all the experiments are taken out of 8 run averages ].
>>>>
>>>> The patches benefit large guest undercommit scenarios, so I believe
>>>> with large guest performance improvemnt is even significant. [ Chegu
>>>> Vinod results show performance near to no ple cases ]. Unfortunately I
>>>> do not have a machine to test larger guest (>32).
>>>>
>>>> Ingo, Please let me know if this is okay to you.
>>>>
>>>> base kernel = 3.8.0-rc4
>>>>
>>>> +-----------+-----------+-----------+------------+-----------+
>>>>                  kernbench  (time in sec lower is better)
>>>> +-----------+-----------+-----------+------------+-----------+
>>>>      base        stdev        patched    stdev      %improve
>>>> +-----------+-----------+-----------+------------+-----------+
>>>> 1x   46.6028     1.8672	    42.4494     1.1390	   8.91234
>>>> 2x   99.9074     9.1859	    90.4050     2.6131	   9.51121
>>>> +-----------+-----------+-----------+------------+-----------+
>>>> +-----------+-----------+-----------+------------+-----------+
>>>>                 sysbench (time in sec lower is better)
>>>> +-----------+-----------+-----------+------------+-----------+
>>>>      base        stdev        patched    stdev      %improve
>>>> +-----------+-----------+-----------+------------+-----------+
>>>> 1x   18.7402     0.3764	    17.7431     0.3589	   5.32065
>>>> 2x   13.2238     0.1935	    13.0096     0.3152	   1.61981
>>>> +-----------+-----------+-----------+------------+-----------+
>>>>
>>>> +-----------+-----------+-----------+------------+-----------+
>>>>                  ebizzy  (records/sec higher is better)
>>>> +-----------+-----------+-----------+------------+-----------+
>>>>      base        stdev        patched    stdev      %improve
>>>> +-----------+-----------+-----------+------------+-----------+
>>>> 1x  2421.9000    19.1801	  5883.1000   112.7243	 142.91259
>>>> +-----------+-----------+-----------+------------+-----------+
>>>>
>>>> +-----------+-----------+-----------+------------+-----------+
>>>>                  dbench (throughput MB/sec  higher is better)
>>>> +-----------+-----------+-----------+------------+-----------+
>>>>      base        stdev        patched    stdev      %improve
>>>> +-----------+-----------+-----------+------------+-----------+
>>>> 1x  11675.9900   857.4154	 14103.5000   215.8425	  20.79061
>>>> +-----------+-----------+-----------+------------+-----------+
>>>
>>> The numbers look pretty convincing, thanks. The workloads were
>>> CPU bound most of the time, right?
>>
>> Yes. CPU bound most of the time. I also used tmpfs to reduce
>> io overhead (for dbbench).
>
> Ok, cool.
>
> Which tree will this be upstreamed through - the KVM tree? I'd
> suggest the KVM tree because KVM will be the one exposed to the
> effects of this change.

Thanks Ingo.

Marcelo, Could you please take this into kvm tree.. ?


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2d8927f..fc219a5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4289,7 +4289,10 @@  EXPORT_SYMBOL(yield);
  * It's the caller's job to ensure that the target task struct
  * can't go away on us before we can do any checks.
  *
- * Returns true if we indeed boosted the target task.
+ * Returns:
+ *	true (>0) if we indeed boosted the target task.
+ *	false (0) if we failed to boost the target.
+ *	-ESRCH if there's no task to yield to.
  */
 bool __sched yield_to(struct task_struct *p, bool preempt)
 {
@@ -4303,6 +4306,15 @@  bool __sched yield_to(struct task_struct *p, bool preempt)
 
 again:
 	p_rq = task_rq(p);
+	/*
+	 * If we're the only runnable task on the rq and target rq also
+	 * has only one task, there's absolutely no point in yielding.
+	 */
+	if (rq->nr_running == 1 && p_rq->nr_running == 1) {
+		yielded = -ESRCH;
+		goto out_irq;
+	}
+
 	double_rq_lock(rq, p_rq);
 	while (task_rq(p) != p_rq) {
 		double_rq_unlock(rq, p_rq);
@@ -4310,13 +4322,13 @@  again:
 	}
 
 	if (!curr->sched_class->yield_to_task)
-		goto out;
+		goto out_unlock;
 
 	if (curr->sched_class != p->sched_class)
-		goto out;
+		goto out_unlock;
 
 	if (task_running(p_rq, p) || p->state)
-		goto out;
+		goto out_unlock;
 
 	yielded = curr->sched_class->yield_to_task(rq, p, preempt);
 	if (yielded) {
@@ -4329,11 +4341,12 @@  again:
 			resched_task(p_rq->curr);
 	}
 
-out:
+out_unlock:
 	double_rq_unlock(rq, p_rq);
+out_irq:
 	local_irq_restore(flags);
 
-	if (yielded)
+	if (yielded > 0)
 		schedule();
 
 	return yielded;