mbox series

[RFC,00/49] xen: add core scheduling support

Message ID 20190329150934.17694-1-jgross@suse.com (mailing list archive)
Headers show
Series xen: add core scheduling support | expand

Message

Jürgen Groß March 29, 2019, 3:08 p.m. UTC
This series is very RFC!!!!

Add support for core- and socket-scheduling in the Xen hypervisor.

Via boot parameter sched_granularity=core (or sched_granularity=socket)
it is possible to change the scheduling granularity from thread (the
default) to either whole cores or even sockets.

All logical cpus (threads) of the core or socket are always scheduled
together. This means that on a core always vcpus of the same domain
will be active, and those vcpus will always be scheduled at the same
time.

This is achieved by switching the scheduler to no longer see vcpus as
the primary object to schedule, but "schedule items". Each schedule
item consists of as many vcpus as each core has threads on the current
system. The vcpu->item relation is fixed.

I have done some very basic performance testing: on a 4 cpu system
(2 cores with 2 threads each) I did a "make -j 4" for building the Xen
hypervisor. With This test has been run on dom0, once with no other
guest active and once with another guest with 4 vcpus running the same
test. The results are (always elapsed time, system time, user time):

sched_granularity=thread, no other guest: 116.10 177.65 207.84
sched_granularity=core,   no other guest: 114.04 175.47 207.45
sched_granularity=thread, other guest:    202.30 334.21 384.63
sched_granularity=core,   other guest:    207.24 293.04 371.37

All tests have been performed with credit2, the other schedulers are
untested up to now.

Cpupools are not yet working, as moving cpus between cpupools needs
more work.

HVM domains do not work yet, there is a doublefault in Xen at the
end of Seabios. I'm currently investigating this issue.

This is x86-only for the moment. ARM doesn't even build with the
series applied. For full ARM support I might need some help with the
ARM specific context switch handling.

The first 7 patches have been sent to xen-devel already, I'm just
adding them here for convenience as they are prerequisites.

I'm especially looking for feedback regarding the overall idea and
design.


Juergen Gross (49):
  xen/sched: call cpu_disable_scheduler() via cpu notifier
  xen: add helper for calling notifier_call_chain() to common/cpu.c
  xen: add new cpu notifier action CPU_RESUME_FAILED
  xen: don't free percpu areas during suspend
  xen/cpupool: simplify suspend/resume handling
  xen/sched: don't disable scheduler on cpus during suspend
  xen/sched: fix credit2 smt idle handling
  xen/sched: use new sched_item instead of vcpu in scheduler interfaces
  xen/sched: alloc struct sched_item for each vcpu
  xen/sched: move per-vcpu scheduler private data pointer to sched_item
  xen/sched: build a linked list of struct sched_item
  xen/sched: introduce struct sched_resource
  xen/sched: let pick_cpu return a scheduler resource
  xen/sched: switch schedule_data.curr to point at sched_item
  xen/sched: move per cpu scheduler private data into struct
    sched_resource
  xen/sched: switch vcpu_schedule_lock to item_schedule_lock
  xen/sched: move some per-vcpu items to struct sched_item
  xen/sched: add scheduler helpers hiding vcpu
  xen/sched: add domain pointer to struct sched_item
  xen/sched: add id to struct sched_item
  xen/sched: rename scheduler related perf counters
  xen/sched: switch struct task_slice from vcpu to sched_item
  xen/sched: move is_running indicator to struct sched_item
  xen/sched: make null scheduler vcpu agnostic.
  xen/sched: make rt scheduler vcpu agnostic.
  xen/sched: make credit scheduler vcpu agnostic.
  xen/sched: make credit2 scheduler vcpu agnostic.
  xen/sched: make arinc653 scheduler vcpu agnostic.
  xen: add sched_item_pause_nosync() and sched_item_unpause()
  xen: let vcpu_create() select processor
  xen/sched: use sched_resource cpu instead smp_processor_id in
    schedulers
  xen/sched: switch schedule() from vcpus to sched_items
  xen/sched: switch sched_move_irqs() to take sched_item as parameter
  xen: switch from for_each_vcpu() to for_each_sched_item()
  xen/sched: add runstate counters to struct sched_item
  xen/sched: rework and rename vcpu_force_reschedule()
  xen/sched: Change vcpu_migrate_*() to operate on schedule item
  xen/sched: move struct task_slice into struct sched_item
  xen/sched: add code to sync scheduling of all vcpus of a sched item
  xen/sched: add support for multiple vcpus per sched item where missing
  x86: make loading of GDT at context switch more modular
  xen/sched: add support for guest vcpu idle
  xen/sched: modify cpupool_domain_cpumask() to be an item mask
  xen: round up max vcpus to scheduling granularity
  xen/sched: support allocating multiple vcpus into one sched item
  xen/sched: add a scheduler_percpu_init() function
  xen/sched: support core scheduling in continue_running()
  xen/sched: make vcpu_wake() core scheduling aware
  xen/sched: add scheduling granularity enum

 xen/arch/arm/domain.c                |   14 +
 xen/arch/arm/domain_build.c          |   13 +-
 xen/arch/arm/smpboot.c               |    6 +-
 xen/arch/x86/dom0_build.c            |   11 +-
 xen/arch/x86/domain.c                |  243 +++++--
 xen/arch/x86/hvm/dom0_build.c        |    9 +-
 xen/arch/x86/hvm/hvm.c               |    7 +-
 xen/arch/x86/hvm/viridian/viridian.c |    1 +
 xen/arch/x86/hvm/vlapic.c            |    1 +
 xen/arch/x86/hvm/vmx/vmcs.c          |    6 +-
 xen/arch/x86/hvm/vmx/vmx.c           |    5 +-
 xen/arch/x86/mm.c                    |   10 +-
 xen/arch/x86/percpu.c                |    3 +-
 xen/arch/x86/pv/descriptor-tables.c  |    6 +-
 xen/arch/x86/pv/dom0_build.c         |   10 +-
 xen/arch/x86/pv/domain.c             |   19 +
 xen/arch/x86/pv/emul-priv-op.c       |    2 +
 xen/arch/x86/pv/shim.c               |    4 +-
 xen/arch/x86/pv/traps.c              |    6 +-
 xen/arch/x86/setup.c                 |    2 +
 xen/arch/x86/smpboot.c               |    5 +-
 xen/arch/x86/traps.c                 |   10 +-
 xen/common/cpu.c                     |   61 +-
 xen/common/cpupool.c                 |  161 ++---
 xen/common/domain.c                  |   36 +-
 xen/common/domctl.c                  |   28 +-
 xen/common/keyhandler.c              |    7 +-
 xen/common/sched_arinc653.c          |  258 ++++---
 xen/common/sched_credit.c            |  704 +++++++++---------
 xen/common/sched_credit2.c           | 1143 +++++++++++++++---------------
 xen/common/sched_null.c              |  423 +++++------
 xen/common/sched_rt.c                |  538 +++++++-------
 xen/common/schedule.c                | 1292 ++++++++++++++++++++++++----------
 xen/common/softirq.c                 |    6 +-
 xen/common/wait.c                    |    5 +-
 xen/include/asm-x86/cpuidle.h        |    2 +-
 xen/include/asm-x86/dom0_build.h     |    3 +-
 xen/include/asm-x86/domain.h         |    3 +
 xen/include/xen/cpu.h                |   29 +-
 xen/include/xen/domain.h             |    3 +-
 xen/include/xen/perfc_defn.h         |   32 +-
 xen/include/xen/sched-if.h           |  276 ++++++--
 xen/include/xen/sched.h              |   40 +-
 xen/include/xen/softirq.h            |    1 +
 44 files changed, 3175 insertions(+), 2269 deletions(-)

Comments

Jürgen Groß March 29, 2019, 3:37 p.m. UTC | #1
On 29/03/2019 16:08, Juergen Gross wrote:
> This series is very RFC!!!!
> 
> Add support for core- and socket-scheduling in the Xen hypervisor.
> 
> Via boot parameter sched_granularity=core (or sched_granularity=socket)
> it is possible to change the scheduling granularity from thread (the
> default) to either whole cores or even sockets.
> 
> All logical cpus (threads) of the core or socket are always scheduled
> together. This means that on a core always vcpus of the same domain
> will be active, and those vcpus will always be scheduled at the same
> time.
> 
> This is achieved by switching the scheduler to no longer see vcpus as
> the primary object to schedule, but "schedule items". Each schedule
> item consists of as many vcpus as each core has threads on the current
> system. The vcpu->item relation is fixed.
> 
> I have done some very basic performance testing: on a 4 cpu system
> (2 cores with 2 threads each) I did a "make -j 4" for building the Xen
> hypervisor. With This test has been run on dom0, once with no other
> guest active and once with another guest with 4 vcpus running the same
> test. The results are (always elapsed time, system time, user time):
> 
> sched_granularity=thread, no other guest: 116.10 177.65 207.84
> sched_granularity=core,   no other guest: 114.04 175.47 207.45
> sched_granularity=thread, other guest:    202.30 334.21 384.63
> sched_granularity=core,   other guest:    207.24 293.04 371.37
> 
> All tests have been performed with credit2, the other schedulers are
> untested up to now.
> 
> Cpupools are not yet working, as moving cpus between cpupools needs
> more work.
> 
> HVM domains do not work yet, there is a doublefault in Xen at the
> end of Seabios. I'm currently investigating this issue.
> 
> This is x86-only for the moment. ARM doesn't even build with the
> series applied. For full ARM support I might need some help with the
> ARM specific context switch handling.
> 
> The first 7 patches have been sent to xen-devel already, I'm just
> adding them here for convenience as they are prerequisites.
> 
> I'm especially looking for feedback regarding the overall idea and
> design.

I have put the patches in a repository:

github.com/jgross1/xen.git sched-rfc


Juergen
Jan Beulich March 29, 2019, 3:39 p.m. UTC | #2
>>> On 29.03.19 at 16:08, <jgross@suse.com> wrote:
> Via boot parameter sched_granularity=core (or sched_granularity=socket)
> it is possible to change the scheduling granularity from thread (the
> default) to either whole cores or even sockets.
> 
> All logical cpus (threads) of the core or socket are always scheduled
> together. This means that on a core always vcpus of the same domain
> will be active, and those vcpus will always be scheduled at the same
> time.
> 
> This is achieved by switching the scheduler to no longer see vcpus as
> the primary object to schedule, but "schedule items". Each schedule
> item consists of as many vcpus as each core has threads on the current
> system. The vcpu->item relation is fixed.

Hmm, I find this surprising: A typical guest would have more vCPU-s
than there are threads per core. So if two of them want to run, but
each is associated with a different core, you'd need two cores instead
of one to actually fulfill the request? I could see this necessarily being
the case if you arranged vCPU-s into virtual threads, cores, sockets,
and nodes, but at least from the patch titles it doesn't look as if you
did in this series. Are there other reasons to make this a fixed
relationship?

As a minor cosmetic request visible from this cover letter right away:
Could the command line option please become "sched-granularity="
or even "sched-gran="?

Jan
Jürgen Groß March 29, 2019, 3:46 p.m. UTC | #3
On 29/03/2019 16:39, Jan Beulich wrote:
>>>> On 29.03.19 at 16:08, <jgross@suse.com> wrote:
>> Via boot parameter sched_granularity=core (or sched_granularity=socket)
>> it is possible to change the scheduling granularity from thread (the
>> default) to either whole cores or even sockets.
>>
>> All logical cpus (threads) of the core or socket are always scheduled
>> together. This means that on a core always vcpus of the same domain
>> will be active, and those vcpus will always be scheduled at the same
>> time.
>>
>> This is achieved by switching the scheduler to no longer see vcpus as
>> the primary object to schedule, but "schedule items". Each schedule
>> item consists of as many vcpus as each core has threads on the current
>> system. The vcpu->item relation is fixed.
> 
> Hmm, I find this surprising: A typical guest would have more vCPU-s
> than there are threads per core. So if two of them want to run, but
> each is associated with a different core, you'd need two cores instead
> of one to actually fulfill the request? I could see this necessarily being

Correct.

> the case if you arranged vCPU-s into virtual threads, cores, sockets,
> and nodes, but at least from the patch titles it doesn't look as if you
> did in this series. Are there other reasons to make this a fixed
> relationship?

In fact I'm doing it, but only implicitly and without adapting the
cpuid related information. The idea is to pass the topology information
at least below the scheduling granularity to the guest later.

Not having the fixed relationship would result in something like the
co-scheduling series Dario already sent, which would need more than
mechanical changes in each scheduler.

> As a minor cosmetic request visible from this cover letter right away:
> Could the command line option please become "sched-granularity="
> or even "sched-gran="?

Of course!


Juergen
Dario Faggioli March 29, 2019, 4:56 p.m. UTC | #4
On Fri, 2019-03-29 at 16:46 +0100, Juergen Gross wrote:
> On 29/03/2019 16:39, Jan Beulich wrote:
> > > > > On 29.03.19 at 16:08, <jgross@suse.com> wrote:
> > > This is achieved by switching the scheduler to no longer see
> > > vcpus as
> > > the primary object to schedule, but "schedule items". Each
> > > schedule
> > > item consists of as many vcpus as each core has threads on the
> > > current
> > > system. The vcpu->item relation is fixed.
> > 
> > the case if you arranged vCPU-s into virtual threads, cores,
> > sockets,
> > and nodes, but at least from the patch titles it doesn't look as if
> > you
> > did in this series. Are there other reasons to make this a fixed
> > relationship?
> 
> In fact I'm doing it, but only implicitly and without adapting the
> cpuid related information. The idea is to pass the topology
> information
> at least below the scheduling granularity to the guest later.
> 
> Not having the fixed relationship would result in something like the
> co-scheduling series Dario already sent, which would need more than
> mechanical changes in each scheduler.
> 
Yep. So, just for the records, those series are, this one for Credit1:
https://lists.xenproject.org/archives/html/xen-devel/2018-08/msg02164.html

And this one for Credit2:
https://lists.xenproject.org/archives/html/xen-devel/2018-10/msg01113.html

Both are RFC, but the Credit2 one was much, much better (more complete,
more tested, more stable, achieving better fairness, etc).

In these series, the "relationship" being discussed here is not fixed.
Not right now, at least, but it can become so (I didn't do it as we
currently lack the info for doing that properly).

It is/was, IMO, a good thing that everything work both with or without
topology enlightenment (even for one we'll have it, in case one, for
whatever reason, doesn't care).

As said by Juergen, the two approaches (and hence the structure of the
series) are quite different. This series is more generic, acts on the
common scheduler code and logic. It's quite intrusive, as we can see
:-D, but enables the feature for all the schedulers all at once (well,
they all need changes, but mostly mechanical).

My series, OTOH, act on each scheduler specifically (and in fact there
is one for Credit and one for Credit2, and there would need to be one
for RTDS, if wanted, etc). They're much more self contained, but less
generic; and the changes necessary within each scheduler are specific
to the scheduler itself, and non-mechanical.

Regards,
Dario
Jürgen Groß March 29, 2019, 5 p.m. UTC | #5
On 29/03/2019 17:56, Dario Faggioli wrote:
> On Fri, 2019-03-29 at 16:46 +0100, Juergen Gross wrote:
>> On 29/03/2019 16:39, Jan Beulich wrote:
>>>>>> On 29.03.19 at 16:08, <jgross@suse.com> wrote:
>>>> This is achieved by switching the scheduler to no longer see
>>>> vcpus as
>>>> the primary object to schedule, but "schedule items". Each
>>>> schedule
>>>> item consists of as many vcpus as each core has threads on the
>>>> current
>>>> system. The vcpu->item relation is fixed.
>>>
>>> the case if you arranged vCPU-s into virtual threads, cores,
>>> sockets,
>>> and nodes, but at least from the patch titles it doesn't look as if
>>> you
>>> did in this series. Are there other reasons to make this a fixed
>>> relationship?
>>
>> In fact I'm doing it, but only implicitly and without adapting the
>> cpuid related information. The idea is to pass the topology
>> information
>> at least below the scheduling granularity to the guest later.
>>
>> Not having the fixed relationship would result in something like the
>> co-scheduling series Dario already sent, which would need more than
>> mechanical changes in each scheduler.
>>
> Yep. So, just for the records, those series are, this one for Credit1:
> https://lists.xenproject.org/archives/html/xen-devel/2018-08/msg02164.html
> 
> And this one for Credit2:
> https://lists.xenproject.org/archives/html/xen-devel/2018-10/msg01113.html
> 
> Both are RFC, but the Credit2 one was much, much better (more complete,
> more tested, more stable, achieving better fairness, etc).
> 
> In these series, the "relationship" being discussed here is not fixed.
> Not right now, at least, but it can become so (I didn't do it as we
> currently lack the info for doing that properly).
> 
> It is/was, IMO, a good thing that everything work both with or without
> topology enlightenment (even for one we'll have it, in case one, for
> whatever reason, doesn't care).
> 
> As said by Juergen, the two approaches (and hence the structure of the
> series) are quite different. This series is more generic, acts on the
> common scheduler code and logic. It's quite intrusive, as we can see
> :-D, but enables the feature for all the schedulers all at once (well,
> they all need changes, but mostly mechanical).
> 
> My series, OTOH, act on each scheduler specifically (and in fact there
> is one for Credit and one for Credit2, and there would need to be one
> for RTDS, if wanted, etc). They're much more self contained, but less
> generic; and the changes necessary within each scheduler are specific
> to the scheduler itself, and non-mechanical.

Another line of thought: in case we want core scheduling for security
reasons (to ensure always vcpus of the same guest are sharing a core)
the same might apply to the guest itself: it might want to ensure
only threads of the same process are sharing a core. This would be
quite easy with my series, but impossible for Dario's solution without
the fixed relationship between guest siblings.


Juergen
Dario Faggioli March 29, 2019, 5:29 p.m. UTC | #6
On Fri, 2019-03-29 at 18:00 +0100, Juergen Gross wrote:
> On 29/03/2019 17:56, Dario Faggioli wrote:
> > As said by Juergen, the two approaches (and hence the structure of
> > the
> > series) are quite different. This series is more generic, acts on
> > the
> > common scheduler code and logic. It's quite intrusive, as we can
> > see
> > :-D, but enables the feature for all the schedulers all at once
> > (well,
> > they all need changes, but mostly mechanical).
> > 
> > My series, OTOH, act on each scheduler specifically (and in fact
> > there
> > is one for Credit and one for Credit2, and there would need to be
> > one
> > for RTDS, if wanted, etc). They're much more self contained, but
> > less
> > generic; and the changes necessary within each scheduler are
> > specific
> > to the scheduler itself, and non-mechanical.
> 
> Another line of thought: in case we want core scheduling for security
> reasons (to ensure always vcpus of the same guest are sharing a core)
> the same might apply to the guest itself: it might want to ensure
> only threads of the same process are sharing a core.
>
Sure, as soon as we'll manage to "passthrough" to it the necessary
topology information.

> This would be
> quite easy with my series, but impossible for Dario's solution
> without
> the fixed relationship between guest siblings.
>
Well, not "impossible". :-)

As said above, that's not there, but it can be added/implemented.

Anyway... Lemme go back looking at the patches, and preparing for
running benchmarks. :-D :-D

Dario
Rian Quinn March 29, 2019, 5:39 p.m. UTC | #7
Out of curiosity, has there been any research done on whether or not
it makes more sense to just disable CPU threading with respect to
overall performance? In some of the testing that we did with OpenXT,
we noticed in some of our tests a performance increase when
hyperthreading was disabled. I would be curious what other research
has been done in this regard.

Either way, if threading is enabled, grouping up threads makes a lot
of sense WRT some of the recent security issues that have come up with
Intel CPUs.


On Fri, Mar 29, 2019 at 11:03 AM Juergen Gross <jgross@suse.com> wrote:
>
> On 29/03/2019 17:56, Dario Faggioli wrote:
> > On Fri, 2019-03-29 at 16:46 +0100, Juergen Gross wrote:
> >> On 29/03/2019 16:39, Jan Beulich wrote:
> >>>>>> On 29.03.19 at 16:08, <jgross@suse.com> wrote:
> >>>> This is achieved by switching the scheduler to no longer see
> >>>> vcpus as
> >>>> the primary object to schedule, but "schedule items". Each
> >>>> schedule
> >>>> item consists of as many vcpus as each core has threads on the
> >>>> current
> >>>> system. The vcpu->item relation is fixed.
> >>>
> >>> the case if you arranged vCPU-s into virtual threads, cores,
> >>> sockets,
> >>> and nodes, but at least from the patch titles it doesn't look as if
> >>> you
> >>> did in this series. Are there other reasons to make this a fixed
> >>> relationship?
> >>
> >> In fact I'm doing it, but only implicitly and without adapting the
> >> cpuid related information. The idea is to pass the topology
> >> information
> >> at least below the scheduling granularity to the guest later.
> >>
> >> Not having the fixed relationship would result in something like the
> >> co-scheduling series Dario already sent, which would need more than
> >> mechanical changes in each scheduler.
> >>
> > Yep. So, just for the records, those series are, this one for Credit1:
> > https://lists.xenproject.org/archives/html/xen-devel/2018-08/msg02164.html
> >
> > And this one for Credit2:
> > https://lists.xenproject.org/archives/html/xen-devel/2018-10/msg01113.html
> >
> > Both are RFC, but the Credit2 one was much, much better (more complete,
> > more tested, more stable, achieving better fairness, etc).
> >
> > In these series, the "relationship" being discussed here is not fixed.
> > Not right now, at least, but it can become so (I didn't do it as we
> > currently lack the info for doing that properly).
> >
> > It is/was, IMO, a good thing that everything work both with or without
> > topology enlightenment (even for one we'll have it, in case one, for
> > whatever reason, doesn't care).
> >
> > As said by Juergen, the two approaches (and hence the structure of the
> > series) are quite different. This series is more generic, acts on the
> > common scheduler code and logic. It's quite intrusive, as we can see
> > :-D, but enables the feature for all the schedulers all at once (well,
> > they all need changes, but mostly mechanical).
> >
> > My series, OTOH, act on each scheduler specifically (and in fact there
> > is one for Credit and one for Credit2, and there would need to be one
> > for RTDS, if wanted, etc). They're much more self contained, but less
> > generic; and the changes necessary within each scheduler are specific
> > to the scheduler itself, and non-mechanical.
>
> Another line of thought: in case we want core scheduling for security
> reasons (to ensure always vcpus of the same guest are sharing a core)
> the same might apply to the guest itself: it might want to ensure
> only threads of the same process are sharing a core. This would be
> quite easy with my series, but impossible for Dario's solution without
> the fixed relationship between guest siblings.
>
>
> Juergen
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xenproject.org
> https://lists.xenproject.org/mailman/listinfo/xen-devel
Andrew Cooper March 29, 2019, 5:48 p.m. UTC | #8
On 29/03/2019 17:39, Rian Quinn wrote:
> Out of curiosity, has there been any research done on whether or not
> it makes more sense to just disable CPU threading with respect to
> overall performance? In some of the testing that we did with OpenXT,
> we noticed in some of our tests a performance increase when
> hyperthreading was disabled. I would be curious what other research
> has been done in this regard.
>
> Either way, if threading is enabled, grouping up threads makes a lot
> of sense WRT some of the recent security issues that have come up with
> Intel CPUs.

There has been plenty of academic research done, and there are real
usecases where disabling HT improves performance.

However, there are plenty when it doesn't.  During L1TF testing,
XenServer measured one typical usecase (agregate small packet IO
throughput, which is representative of a load of webserver VMs) which
took a 60% perf hit.

10% of this was the raw L1D_FLUSH hit, while 50% of it was actually due
to the increased IO latency of halving the number of vcpus which could
be run concurrently.

As for core aware scheduling, even if nothing else, grouping things up
will get you better cache sharing from the VM's point of view.

As you can probably tell, the answer is far too workload dependent to
come up with a general rule, but at least having the options available
will let people experiment.

~Andrew
Dario Faggioli March 29, 2019, 6:16 p.m. UTC | #9
Even if I've only skimmed through it... cool series! :-D

On Fri, 2019-03-29 at 16:08 +0100, Juergen Gross wrote:
> 
> I have done some very basic performance testing: on a 4 cpu system
> (2 cores with 2 threads each) I did a "make -j 4" for building the
> Xen
> hypervisor. With This test has been run on dom0, once with no other
> guest active and once with another guest with 4 vcpus running the
> same
> test. The results are (always elapsed time, system time, user time):
> 
> sched_granularity=thread, no other guest: 116.10 177.65 207.84
> sched_granularity=core,   no other guest: 114.04 175.47 207.45
> sched_granularity=thread, other guest:    202.30 334.21 384.63
> sched_granularity=core,   other guest:    207.24 293.04 371.37
> 
So, just to be sure I'm reading this properly,
"sched_granularity=thread" means no co-scheduling of any sort is in
effect, right? Basically the patch series is applied, but "not used",
correct?

If yes, these are interesting, and promising, numbers. :-)

> All tests have been performed with credit2, the other schedulers are
> untested up to now.
> 
Just as an heads up for people (as Juergen knows this already :-D), I'm
planning to run some performance evaluation of this patches.

I've got an 8 CPUs system (4 cores, 2 threads each, no-NUMA) and an 16
CPUs system (2 sockets/NUMA nodes, 4 cores each, 2 threads each) on
which I should be able to get some bench suite running relatively easy
and (hopefully) quick.

I'm planning to evaluate:
- vanilla (i.e., without this series), SMT enabled in BIOS
- vanilla (i.e., without this series), SMT disabled in BIOS
- patched (i.e., with this series), granularity=thread
- patched (i.e., with this series), granularity=core

I'll do start with no overcommitment, and then move to 2x
overcommitment (as you did above).

And I'll also be focusing on Credit2 only.

Everyone else who also want to do some stress and performance testing
and share the results, that's very much appreciated. :-)

Regards,
Dario
Rian Quinn March 29, 2019, 6:35 p.m. UTC | #10
Makes sense. The reason I ask is we currently have to disable HT due
to L1TF until a scheduler change is made to address the issue and the
#1 question everyone asks is what will that do to performance so any
info on that topic and how a patch like this will address the L1TF
issue is most helpful.

On Fri, Mar 29, 2019 at 11:49 AM Andrew Cooper
<andrew.cooper3@citrix.com> wrote:
>
> On 29/03/2019 17:39, Rian Quinn wrote:
> > Out of curiosity, has there been any research done on whether or not
> > it makes more sense to just disable CPU threading with respect to
> > overall performance? In some of the testing that we did with OpenXT,
> > we noticed in some of our tests a performance increase when
> > hyperthreading was disabled. I would be curious what other research
> > has been done in this regard.
> >
> > Either way, if threading is enabled, grouping up threads makes a lot
> > of sense WRT some of the recent security issues that have come up with
> > Intel CPUs.
>
> There has been plenty of academic research done, and there are real
> usecases where disabling HT improves performance.
>
> However, there are plenty when it doesn't.  During L1TF testing,
> XenServer measured one typical usecase (agregate small packet IO
> throughput, which is representative of a load of webserver VMs) which
> took a 60% perf hit.
>
> 10% of this was the raw L1D_FLUSH hit, while 50% of it was actually due
> to the increased IO latency of halving the number of vcpus which could
> be run concurrently.
>
> As for core aware scheduling, even if nothing else, grouping things up
> will get you better cache sharing from the VM's point of view.
>
> As you can probably tell, the answer is far too workload dependent to
> come up with a general rule, but at least having the options available
> will let people experiment.
>
> ~Andrew
Jürgen Groß March 30, 2019, 9:55 a.m. UTC | #11
On 29/03/2019 19:16, Dario Faggioli wrote:
> Even if I've only skimmed through it... cool series! :-D
> 
> On Fri, 2019-03-29 at 16:08 +0100, Juergen Gross wrote:
>>
>> I have done some very basic performance testing: on a 4 cpu system
>> (2 cores with 2 threads each) I did a "make -j 4" for building the
>> Xen
>> hypervisor. With This test has been run on dom0, once with no other
>> guest active and once with another guest with 4 vcpus running the
>> same
>> test. The results are (always elapsed time, system time, user time):
>>
>> sched_granularity=thread, no other guest: 116.10 177.65 207.84
>> sched_granularity=core,   no other guest: 114.04 175.47 207.45
>> sched_granularity=thread, other guest:    202.30 334.21 384.63
>> sched_granularity=core,   other guest:    207.24 293.04 371.37
>>
> So, just to be sure I'm reading this properly,
> "sched_granularity=thread" means no co-scheduling of any sort is in
> effect, right? Basically the patch series is applied, but "not used",
> correct?

Yes.

> If yes, these are interesting, and promising, numbers. :-)
> 
>> All tests have been performed with credit2, the other schedulers are
>> untested up to now.
>>
> Just as an heads up for people (as Juergen knows this already :-D), I'm
> planning to run some performance evaluation of this patches.
> 
> I've got an 8 CPUs system (4 cores, 2 threads each, no-NUMA) and an 16
> CPUs system (2 sockets/NUMA nodes, 4 cores each, 2 threads each) on
> which I should be able to get some bench suite running relatively easy
> and (hopefully) quick.
> 
> I'm planning to evaluate:
> - vanilla (i.e., without this series), SMT enabled in BIOS
> - vanilla (i.e., without this series), SMT disabled in BIOS
> - patched (i.e., with this series), granularity=thread
> - patched (i.e., with this series), granularity=core
> 
> I'll do start with no overcommitment, and then move to 2x
> overcommitment (as you did above).

Thanks, I appreciate that!


Juergen
Jan Beulich April 1, 2019, 6:41 a.m. UTC | #12
>>> On 29.03.19 at 16:08, <jgross@suse.com> wrote:
> Via boot parameter sched_granularity=core (or sched_granularity=socket)
> it is possible to change the scheduling granularity from thread (the
> default) to either whole cores or even sockets.

One further general question came to mind: How about also having
"sched-granularity=thread" (or "...=none") to retain current
behavior, at least to have an easy way to compare effects if
wanted? But perhaps also to allow to deal with potential resources
wasting configurations like having mostly VMs with e.g. an odd
number of vCPU-s.

The other question of course is whether the terms thread, core,
and socket are generic enough to be used in architecture
independent code. Even on x86 it already leaves out / unclear
where / how e.g. AMD's compute units would be classified. I
don't have any good suggestion for abstraction, so possibly
the terms used may want to become arch-specific.

Jan
Jürgen Groß April 1, 2019, 6:49 a.m. UTC | #13
On 01/04/2019 08:41, Jan Beulich wrote:
>>>> On 29.03.19 at 16:08, <jgross@suse.com> wrote:
>> Via boot parameter sched_granularity=core (or sched_granularity=socket)
>> it is possible to change the scheduling granularity from thread (the
>> default) to either whole cores or even sockets.
> 
> One further general question came to mind: How about also having
> "sched-granularity=thread" (or "...=none") to retain current
> behavior, at least to have an easy way to compare effects if
> wanted? But perhaps also to allow to deal with potential resources
> wasting configurations like having mostly VMs with e.g. an odd
> number of vCPU-s.

Fine with me.

> The other question of course is whether the terms thread, core,
> and socket are generic enough to be used in architecture
> independent code. Even on x86 it already leaves out / unclear
> where / how e.g. AMD's compute units would be classified. I
> don't have any good suggestion for abstraction, so possibly
> the terms used may want to become arch-specific.

I followed the already known terms from the credit2_runqueue
parameter. I think they should match. Which would call for
"sched-granularity=cpu" instead of "thread".


Juergen
Dario Faggioli April 1, 2019, 7:10 a.m. UTC | #14
On Mon, 2019-04-01 at 08:49 +0200, Juergen Gross wrote:
> On 01/04/2019 08:41, Jan Beulich wrote:
> > One further general question came to mind: How about also having
> > "sched-granularity=thread" (or "...=none") to retain current
> > behavior, at least to have an easy way to compare effects if
> > wanted? But perhaps also to allow to deal with potential resources
> > wasting configurations like having mostly VMs with e.g. an odd
> > number of vCPU-s.
> 
> Fine with me.
> 
Mmm... I'm still in the process of looking at the patches, so there
might be something I'm missing, but, from the descriptions and from
talking to you (Juergen), I was assuming that to be the case already...
isn't it so?

> > The other question of course is whether the terms thread, core,
> > and socket are generic enough to be used in architecture
> > independent code. Even on x86 it already leaves out / unclear
> > where / how e.g. AMD's compute units would be classified. I
> > don't have any good suggestion for abstraction, so possibly
> > the terms used may want to become arch-specific.
> 
> I followed the already known terms from the credit2_runqueue
> parameter. I think they should match. Which would call for
> "sched-granularity=cpu" instead of "thread".
> 
Yep, I'd go for cpu. Both for, as you said, consistency and also
because I can envision "granularity=thread" being mistaken/interpreted
as a form of "thread aware co-scheduling" (i.e., what
"granularity=core" actually does! :-O)

Regards,
Dario
Jan Beulich April 1, 2019, 7:13 a.m. UTC | #15
>>> On 01.04.19 at 08:49, <jgross@suse.com> wrote:
> On 01/04/2019 08:41, Jan Beulich wrote:
>>>>> On 29.03.19 at 16:08, <jgross@suse.com> wrote:
>>> Via boot parameter sched_granularity=core (or sched_granularity=socket)
>>> it is possible to change the scheduling granularity from thread (the
>>> default) to either whole cores or even sockets.
>> 
>> One further general question came to mind: How about also having
>> "sched-granularity=thread" (or "...=none") to retain current
>> behavior, at least to have an easy way to compare effects if
>> wanted? But perhaps also to allow to deal with potential resources
>> wasting configurations like having mostly VMs with e.g. an odd
>> number of vCPU-s.
> 
> Fine with me.
> 
>> The other question of course is whether the terms thread, core,
>> and socket are generic enough to be used in architecture
>> independent code. Even on x86 it already leaves out / unclear
>> where / how e.g. AMD's compute units would be classified. I
>> don't have any good suggestion for abstraction, so possibly
>> the terms used may want to become arch-specific.
> 
> I followed the already known terms from the credit2_runqueue
> parameter. I think they should match. Which would call for
> "sched-granularity=cpu" instead of "thread".

"cpu" is fine of course. I wonder though whether the other two
were a good choice for "credit2_runqueue". Stefano, Julien -
is this terminology at least half way suitable for Arm?

Jan
Jürgen Groß April 1, 2019, 7:15 a.m. UTC | #16
On 01/04/2019 09:10, Dario Faggioli wrote:
> On Mon, 2019-04-01 at 08:49 +0200, Juergen Gross wrote:
>> On 01/04/2019 08:41, Jan Beulich wrote:
>>> One further general question came to mind: How about also having
>>> "sched-granularity=thread" (or "...=none") to retain current
>>> behavior, at least to have an easy way to compare effects if
>>> wanted? But perhaps also to allow to deal with potential resources
>>> wasting configurations like having mostly VMs with e.g. an odd
>>> number of vCPU-s.
>>
>> Fine with me.
>>
> Mmm... I'm still in the process of looking at the patches, so there
> might be something I'm missing, but, from the descriptions and from
> talking to you (Juergen), I was assuming that to be the case already...
> isn't it so?

Yes, it is.

I understood Jan to ask for a special parameter value for that.


Juergen
Dario Faggioli April 11, 2019, 12:34 a.m. UTC | #17
On Fri, 2019-03-29 at 19:16 +0100, Dario Faggioli wrote:
> On Fri, 2019-03-29 at 16:08 +0100, Juergen Gross wrote:
> > I have done some very basic performance testing: on a 4 cpu system
> > (2 cores with 2 threads each) I did a "make -j 4" for building the
> > Xen
> > hypervisor. With This test has been run on dom0, once with no other
> > guest active and once with another guest with 4 vcpus running the
> > same
> > test.
> Just as an heads up for people (as Juergen knows this already :-D),
> I'm
> planning to run some performance evaluation of this patches.
> 
> I've got an 8 CPUs system (4 cores, 2 threads each, no-NUMA) and an
> 16
> CPUs system (2 sockets/NUMA nodes, 4 cores each, 2 threads each) on
> which I should be able to get some bench suite running relatively
> easy
> and (hopefully) quick.
> 
> I'm planning to evaluate:
> - vanilla (i.e., without this series), SMT enabled in BIOS
> - vanilla (i.e., without this series), SMT disabled in BIOS
> - patched (i.e., with this series), granularity=thread
> - patched (i.e., with this series), granularity=core
> 
> I'll do start with no overcommitment, and then move to 2x
> overcommitment (as you did above).
> 
I've got the first set of results. It's fewer than I wanted/expected to
have at this point in time, but still...

Also, it's Phoronix again. I don't especially love it, but I'm still
working on convincing our own internal automated benchmarking tool
(which I like a lot more :-) ) to be a good friend of Xen. :-P

It's a not too big set of tests, done in the following conditions:
- hardware: Intel Xeon E5620; 2 NUMA nodes, 4 cores and 2 threads each
- slow disk (old rotational HDD)
- benchmarks run in dom0
- CPU, memory and some disk IO benchmarks
- all Spec&Melt mitigations disabled both at Xen and dom0 kernel level
- cpufreq governor = performance, max_cstate = C1
- *non* debug hypervisor

In just one sentence, what I'd say is "So far so god" :-D

https://openbenchmarking.org/result/1904105-SP-1904100DA38

1) 'Xen dom0, SMT On, vanilla' is staging *without* this series even 
    applied
2) 'Xen dom0, SMT on, patched, sched_granularity=thread' is with this 
    series applied, but scheduler behavior as right now
3) 'Xen dom0, SMT on, patched, sched_granularity=core' is with this 
    series applied, and core-scheduling enabled
4) 'Xen dom0, SMT Off, vanilla' is staging *without* this series 
    applied, and SMT turned off in BIOS (i.e., we only have 8 CPUs)

So, comparing 1 and 4, we see, for each specific benchmark, what is the
cost of disabling SMT (or vice-versa, the gain of using SMT).

Comparing 1 and 2, we see the overhead introduced by this series, when
it is not used to achieve core-scheduling.

Compating 1 and 3, we see the differences with what we have right now,
and what we'll have with core-scheduling enabled, as it is implemented
in this series.

Some of the things we can see from the results:
- disabling SMT (i.e., 1 vs 4) is not always bad, but it is bad 
  overall, i.e., if you look at how many tests are better and at how 
  many are slower, with SMT off (and also, by how much). Of course, 
  this can be considered true for these specific benchmarks, on this 
  specific hardware and with this configuration
- the overhead introduced by this series is, overall, pretty small, 
  apart from not more than a couple of exceptions (e.g., Stream Triad 
  or zstd compression). OTOH, there seem to be cases where this series 
  improves performance (e.g., Stress-NG Socket Activity)
- the performance we achieve with core-scheduling are more than 
  acceptable
- between core-scheduling and disabling SMT, core-scheduling wins and
  I wouldn't even call it a match :-P

Of course, other thoughts, comments, alternative analysis are welcome.

As said above, this is less that what I wanted to have, and in fact I'm
running more stuff.

I have a much more comprehensive set of benchmarks running in these
days. It being "much more comprehensive", however, also means it takes
more time.

I have a newer and faster (both CPU and disk) machine, but I need to
re-purpose it for benchmarking purposes.

At least now that the old Xeon NUMA box is done with this first round,
I can use it for:
- running the tests inside a "regular" PV domain
- running the tests inside more than one PV domain, i.e. with some 
  degree of overcommitment

I'll push out results as soon as I have them.

Regards
Jürgen Groß April 11, 2019, 7:16 a.m. UTC | #18
On 11/04/2019 02:34, Dario Faggioli wrote:
> On Fri, 2019-03-29 at 19:16 +0100, Dario Faggioli wrote:
>> On Fri, 2019-03-29 at 16:08 +0100, Juergen Gross wrote:
>>> I have done some very basic performance testing: on a 4 cpu system
>>> (2 cores with 2 threads each) I did a "make -j 4" for building the
>>> Xen
>>> hypervisor. With This test has been run on dom0, once with no other
>>> guest active and once with another guest with 4 vcpus running the
>>> same
>>> test.
>> Just as an heads up for people (as Juergen knows this already :-D),
>> I'm
>> planning to run some performance evaluation of this patches.
>>
>> I've got an 8 CPUs system (4 cores, 2 threads each, no-NUMA) and an
>> 16
>> CPUs system (2 sockets/NUMA nodes, 4 cores each, 2 threads each) on
>> which I should be able to get some bench suite running relatively
>> easy
>> and (hopefully) quick.
>>
>> I'm planning to evaluate:
>> - vanilla (i.e., without this series), SMT enabled in BIOS
>> - vanilla (i.e., without this series), SMT disabled in BIOS
>> - patched (i.e., with this series), granularity=thread
>> - patched (i.e., with this series), granularity=core
>>
>> I'll do start with no overcommitment, and then move to 2x
>> overcommitment (as you did above).
>>
> I've got the first set of results. It's fewer than I wanted/expected to
> have at this point in time, but still...
> 
> Also, it's Phoronix again. I don't especially love it, but I'm still
> working on convincing our own internal automated benchmarking tool
> (which I like a lot more :-) ) to be a good friend of Xen. :-P

I think the Phoronix tests as such are not that bad, its the way they
are used by Phoronix which is completely idiotic.

> It's a not too big set of tests, done in the following conditions:
> - hardware: Intel Xeon E5620; 2 NUMA nodes, 4 cores and 2 threads each
> - slow disk (old rotational HDD)
> - benchmarks run in dom0
> - CPU, memory and some disk IO benchmarks
> - all Spec&Melt mitigations disabled both at Xen and dom0 kernel level
> - cpufreq governor = performance, max_cstate = C1
> - *non* debug hypervisor
> 
> In just one sentence, what I'd say is "So far so god" :-D
> 
> https://openbenchmarking.org/result/1904105-SP-1904100DA38

Thanks for doing that!


Juergen
Dario Faggioli April 11, 2019, 1:28 p.m. UTC | #19
On Thu, 2019-04-11 at 09:16 +0200, Juergen Gross wrote:
> On 11/04/2019 02:34, Dario Faggioli wrote:
> > Also, it's Phoronix again. I don't especially love it, but I'm
> > still
> > working on convincing our own internal automated benchmarking tool
> > (which I like a lot more :-) ) to be a good friend of Xen. :-P
> 
> I think the Phoronix tests as such are not that bad, its the way they
> are used by Phoronix which is completely idiotic.
> 
Sure, that is the main problem.

About the suite itself, the fact that it is kind of a black box, can be
a very good thing, but also a not so good one.

Opaqueness is, AFAIUI, among its design goals, so I can't possibly
complain about that. And in fact, that is what makes it so easy and
quick to play with it. :-)

If you want to tweak the configuration of a benchmark, or change how
they're run, beside the config options that are pre-defined for each
benchmark, (e.g., do stuff like adding `numactl blabla` "in front" of
some), that is a lot less obvious or easy. And yes, this is somewhat
the case for most, if not all, the benchmarking suite, but I find
Phoronix makes this _particularly_ tricky.

Anyway... :-D :-D

Regards