diff mbox series

[3/3] x86/smt: Support for enabling/disabling SMT at runtime

Message ID 1554235048-3373-4-git-send-email-andrew.cooper3@citrix.com (mailing list archive)
State New, archived
Headers show
Series x86/smt: Runtime SMT controls | expand

Commit Message

Andrew Cooper April 2, 2019, 7:57 p.m. UTC
Currently, a user can in combine the output of `xl info -n`, the APCI tables,
and some manual CPUID data to figure out which CPU numbers to feed into
`xen-hptool cpu-offline` to effectively disable SMT at runtime.

A more convenient option is to teach Xen how to perform this action.

First of all, extend XEN_SYSCTL_cpu_hotplug with two new operations.
Introduce new smt_{up,down}_helper() functions which wrap the
cpu_{up,down}_helper() helpers with logic which understands siblings based on
their APIC_ID.

Add libxc stubs, and extend xen-hptool with smt-{enable,disable} options.
These are intended to be shorthands for a loop over cpu-{online,offline}.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Wei Liu <wei.liu2@citrix.com>
CC: Roger Pau Monné <roger.pau@citrix.com>

Slightly RFC.  I'm not very happy with the contination situation, but -EBUSY
is the preexisting style and it seems like it is the only option from tasklet
context.

Is it intentional that we can actually online and offline processors beyond
maxcpu?  This is a consequence of the cpu parking logic.
---
 tools/libxc/include/xenctrl.h   |  2 +
 tools/libxc/xc_cpu_hotplug.c    | 26 +++++++++++
 tools/misc/xen-hptool.c         | 56 ++++++++++++++++++++++++
 xen/arch/x86/setup.c            |  2 +-
 xen/arch/x86/sysctl.c           | 96 +++++++++++++++++++++++++++++++++++++++++
 xen/include/asm-x86/processor.h |  1 +
 xen/include/public/sysctl.h     |  9 ++++
 7 files changed, 191 insertions(+), 1 deletion(-)

Comments

Jan Beulich April 3, 2019, 9:33 a.m. UTC | #1
>>> On 02.04.19 at 21:57, <andrew.cooper3@citrix.com> wrote:
> Currently, a user can in combine the output of `xl info -n`, the APCI tables,
> and some manual CPUID data to figure out which CPU numbers to feed into
> `xen-hptool cpu-offline` to effectively disable SMT at runtime.
> 
> A more convenient option is to teach Xen how to perform this action.
> 
> First of all, extend XEN_SYSCTL_cpu_hotplug with two new operations.
> Introduce new smt_{up,down}_helper() functions which wrap the
> cpu_{up,down}_helper() helpers with logic which understands siblings based on
> their APIC_ID.
> 
> Add libxc stubs, and extend xen-hptool with smt-{enable,disable} options.
> These are intended to be shorthands for a loop over cpu-{online,offline}.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> ---
> CC: Jan Beulich <JBeulich@suse.com>
> CC: Wei Liu <wei.liu2@citrix.com>
> CC: Roger Pau Monné <roger.pau@citrix.com>
> 
> Slightly RFC.  I'm not very happy with the contination situation, but -EBUSY
> is the preexisting style and it seems like it is the only option from tasklet
> context.

Well, offloading the re-invocation to the caller isn't really nice.
Looking at the code, is there any reason why couldn't use
the usual -ERESTART / hypercall_create_continuation()? This
would require a little bit of re-work, in particular to allow
passing the vCPU into hypercall_create_continuation(), but
beyond that I can't see any immediate obstacles. Though
clearly I wouldn't make this a prereq requirement for the work
here.

> Is it intentional that we can actually online and offline processors beyond
> maxcpu?  This is a consequence of the cpu parking logic.

I think so, yes. That's meant to be a boot time limit only imo.
The runtime limit is nr_cpu_ids.

> --- a/xen/arch/x86/setup.c
> +++ b/xen/arch/x86/setup.c
> @@ -60,7 +60,7 @@ static bool __initdata opt_nosmp;
>  boolean_param("nosmp", opt_nosmp);
>  
>  /* maxcpus: maximum number of CPUs to activate. */
> -static unsigned int __initdata max_cpus;
> +unsigned int max_cpus;
>  integer_param("maxcpus", max_cpus);

As per above I don't think this change should be needed or
wanted, but if so for whatever reason, wouldn't the variable
better be __read_mostly?

> --- a/xen/arch/x86/sysctl.c
> +++ b/xen/arch/x86/sysctl.c
> @@ -114,6 +114,92 @@ long cpu_down_helper(void *data)
>      return ret;
>  }
>  
> +static long smt_up_helper(void *data)
> +{
> +    unsigned int cpu, sibling_mask =
> +        (1u << (boot_cpu_data.x86_num_siblings - 1)) - 1;

I don't think this is quite right for higher than 2-thread configurations.
In detect_extended_topology() terms, don't you simply mean
(1u << ht_mask_width) - 1 here, i.e. just
boot_cpu_data.x86_num_siblings - 1 (without any shifting)?

> +    int ret = 0;
> +
> +    if ( !cpu_has_htt || !sibling_mask )
> +        return -EOPNOTSUPP;

Why not put the first part of the check right into the sysctl
handler?

> +    opt_smt = true;

Perhaps also bail early when the variable already has the
designated value? And again perhaps right in the sysctl
handler?

> +    for_each_present_cpu ( cpu )
> +    {
> +        if ( cpu == 0 )
> +            continue;

Is this special case really needed? If so, perhaps worth a brief
comment?

> +        if ( cpu >= max_cpus )
> +            break;
> +
> +        if ( x86_cpu_to_apicid[cpu] & sibling_mask )
> +            ret = cpu_up_helper(_p(cpu));

Shouldn't this be restricted to CPUs a sibling of which is already
online? And widened at the same time, to also online thread 0
if one of the other threads is already online?

Also any reason you use _p() here but not in patch 2?

> +static long smt_down_helper(void *data)
> +{
> +    unsigned int cpu, sibling_mask =
> +        (1u << (boot_cpu_data.x86_num_siblings - 1)) - 1;
> +    int ret = 0;
> +
> +    if ( !cpu_has_htt || !sibling_mask )
> +        return -EOPNOTSUPP;
> +
> +    opt_smt = false;
> +
> +    for_each_present_cpu ( cpu )
> +    {
> +        if ( cpu == 0 )
> +            continue;
> +        if ( cpu >= max_cpus )
> +            break;
> +
> +        if ( x86_cpu_to_apicid[cpu] & sibling_mask )
> +            ret = cpu_down_helper(_p(cpu));

Similarly here, wouldn't you better skip this if it would offline
the last thread of a core?

I also notice that the two functions are extremely similar, and
hence it might be worthwhile considering to fold them, with the
caller controlling the behavior via the so far unused function
parameter (at which point the related remark of mine on patch
2 would become inapplicable).

> --- a/xen/include/public/sysctl.h
> +++ b/xen/include/public/sysctl.h
> @@ -246,8 +246,17 @@ struct xen_sysctl_get_pmstat {
>  struct xen_sysctl_cpu_hotplug {
>      /* IN variables */
>      uint32_t cpu;   /* Physical cpu. */
> +
> +    /* Single CPU enable/disable. */
>  #define XEN_SYSCTL_CPU_HOTPLUG_ONLINE  0
>  #define XEN_SYSCTL_CPU_HOTPLUG_OFFLINE 1
> +
> +    /*
> +     * SMT enable/disable. Caller must zero the 'cpu' field to begin, and
> +     * ignore it on completion.
> +     */
> +#define XEN_SYSCTL_CPU_HOTPLUG_SMT_ENABLE  2
> +#define XEN_SYSCTL_CPU_HOTPLUG_SMT_DISABLE 3

Is the "cpu" field constraint mentioned in the comment just a
precaution? I can't see you encode anything into that field, or
use it upon getting re-invoked. I assume that's because of the
expectation that only actual onlining/offlining would potentially
take long, while iterating over all present CPUs without further
action ought to be fast enough.

Jan
Andrew Cooper April 3, 2019, 10:17 a.m. UTC | #2
On 03/04/2019 10:33, Jan Beulich wrote:
>>>> On 02.04.19 at 21:57, <andrew.cooper3@citrix.com> wrote:
>> Currently, a user can in combine the output of `xl info -n`, the APCI tables,
>> and some manual CPUID data to figure out which CPU numbers to feed into
>> `xen-hptool cpu-offline` to effectively disable SMT at runtime.
>>
>> A more convenient option is to teach Xen how to perform this action.
>>
>> First of all, extend XEN_SYSCTL_cpu_hotplug with two new operations.
>> Introduce new smt_{up,down}_helper() functions which wrap the
>> cpu_{up,down}_helper() helpers with logic which understands siblings based on
>> their APIC_ID.
>>
>> Add libxc stubs, and extend xen-hptool with smt-{enable,disable} options.
>> These are intended to be shorthands for a loop over cpu-{online,offline}.
>>
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
>> ---
>> CC: Jan Beulich <JBeulich@suse.com>
>> CC: Wei Liu <wei.liu2@citrix.com>
>> CC: Roger Pau Monné <roger.pau@citrix.com>
>>
>> Slightly RFC.  I'm not very happy with the contination situation, but -EBUSY
>> is the preexisting style and it seems like it is the only option from tasklet
>> context.
> Well, offloading the re-invocation to the caller isn't really nice.
> Looking at the code, is there any reason why couldn't use
> the usual -ERESTART / hypercall_create_continuation()? This
> would require a little bit of re-work, in particular to allow
> passing the vCPU into hypercall_create_continuation(), but
> beyond that I can't see any immediate obstacles. Though
> clearly I wouldn't make this a prereq requirement for the work
> here.

The problem isn't really the ERESTART.  We could do some plumbing and
make it work, but the real problem is that I can't stash the current cpu
index in the sysctl data block across the continuation point.

At the moment, the loop depends on, once all CPUs are in the correct
state, getting through the for_each_present_cpu() loop without taking a
further continuation.

>
>> Is it intentional that we can actually online and offline processors beyond
>> maxcpu?  This is a consequence of the cpu parking logic.
> I think so, yes. That's meant to be a boot time limit only imo.
> The runtime limit is nr_cpu_ids.
>
>> --- a/xen/arch/x86/setup.c
>> +++ b/xen/arch/x86/setup.c
>> @@ -60,7 +60,7 @@ static bool __initdata opt_nosmp;
>>  boolean_param("nosmp", opt_nosmp);
>>  
>>  /* maxcpus: maximum number of CPUs to activate. */
>> -static unsigned int __initdata max_cpus;
>> +unsigned int max_cpus;
>>  integer_param("maxcpus", max_cpus);
> As per above I don't think this change should be needed or
> wanted, but if so for whatever reason, wouldn't the variable
> better be __read_mostly?

__read_mostly, yes, but as to whether the change is needed, that
entirely depends on whether the change in semantics to maxcpus= was
accidental or intentional.

>
>> --- a/xen/arch/x86/sysctl.c
>> +++ b/xen/arch/x86/sysctl.c
>> @@ -114,6 +114,92 @@ long cpu_down_helper(void *data)
>>      return ret;
>>  }
>>  
>> +static long smt_up_helper(void *data)
>> +{
>> +    unsigned int cpu, sibling_mask =
>> +        (1u << (boot_cpu_data.x86_num_siblings - 1)) - 1;
> I don't think this is quite right for higher than 2-thread configurations.
> In detect_extended_topology() terms, don't you simply mean
> (1u << ht_mask_width) - 1 here, i.e. just
> boot_cpu_data.x86_num_siblings - 1 (without any shifting)?

Good point, yes.

>
>> +    int ret = 0;
>> +
>> +    if ( !cpu_has_htt || !sibling_mask )
>> +        return -EOPNOTSUPP;
> Why not put the first part of the check right into the sysctl
> handler?

Can do.  I think this is a side effect of how it developed.

>
>> +    opt_smt = true;
> Perhaps also bail early when the variable already has the
> designated value? And again perhaps right in the sysctl
> handler?

That is not safe across continuations.

While it would be a very silly thing to do, there could be two callers
which are fighting over whether SMT is disabled or enabled.

>
>> +    for_each_present_cpu ( cpu )
>> +    {
>> +        if ( cpu == 0 )
>> +            continue;
> Is this special case really needed? If so, perhaps worth a brief
> comment?

Trying to down cpu 0 is a hard -EINVAL.

>
>> +        if ( cpu >= max_cpus )
>> +            break;
>> +
>> +        if ( x86_cpu_to_apicid[cpu] & sibling_mask )
>> +            ret = cpu_up_helper(_p(cpu));
> Shouldn't this be restricted to CPUs a sibling of which is already
> online? And widened at the same time, to also online thread 0
> if one of the other threads is already online?

Unfortunately, that turns into a rats nest very very quickly, which is
why I gave up and simplified the semantics to strictly "this shall
{of,off}line the nonzero siblings threads".

This is a convenience for people wanting to do a one-time
reconfiguration of the system, and indeed, how has multiple end user
requests behind its coming into existence.  Users who are already
hotplugging aren't going to be interested in this functionality.

As the usecases don't overlap, I went for the most simple logic.

> Also any reason you use _p() here but not in patch 2?

I thought I'd fixed patch 2 up, but I clearly hadn't

> I also notice that the two functions are extremely similar, and
> hence it might be worthwhile considering to fold them, with the
> caller controlling the behavior via the so far unused function
> parameter (at which point the related remark of mine on patch
> 2 would become inapplicable).

By passing the plug boolean in via data?  Yes I suppose they are rather
more similar than they started out.

>
>> --- a/xen/include/public/sysctl.h
>> +++ b/xen/include/public/sysctl.h
>> @@ -246,8 +246,17 @@ struct xen_sysctl_get_pmstat {
>>  struct xen_sysctl_cpu_hotplug {
>>      /* IN variables */
>>      uint32_t cpu;   /* Physical cpu. */
>> +
>> +    /* Single CPU enable/disable. */
>>  #define XEN_SYSCTL_CPU_HOTPLUG_ONLINE  0
>>  #define XEN_SYSCTL_CPU_HOTPLUG_OFFLINE 1
>> +
>> +    /*
>> +     * SMT enable/disable. Caller must zero the 'cpu' field to begin, and
>> +     * ignore it on completion.
>> +     */
>> +#define XEN_SYSCTL_CPU_HOTPLUG_SMT_ENABLE  2
>> +#define XEN_SYSCTL_CPU_HOTPLUG_SMT_DISABLE 3
> Is the "cpu" field constraint mentioned in the comment just a
> precaution? I can't see you encode anything into that field, or
> use it upon getting re-invoked. I assume that's because of the
> expectation that only actual onlining/offlining would potentially
> take long, while iterating over all present CPUs without further
> action ought to be fast enough.

Ah - that was stale from before I encountered the "fun" of continuations
from tasklet context.

I would prefer to find a better way, but short of doing a full vcpu
context switch, I don't see an option.

~Andrew
Jan Beulich April 3, 2019, 10:44 a.m. UTC | #3
>>> On 03.04.19 at 12:17, <andrew.cooper3@citrix.com> wrote:
> On 03/04/2019 10:33, Jan Beulich wrote:
>>>>> On 02.04.19 at 21:57, <andrew.cooper3@citrix.com> wrote:
>>> Slightly RFC.  I'm not very happy with the contination situation, but -EBUSY
>>> is the preexisting style and it seems like it is the only option from tasklet
>>> context.
>> Well, offloading the re-invocation to the caller isn't really nice.
>> Looking at the code, is there any reason why couldn't use
>> the usual -ERESTART / hypercall_create_continuation()? This
>> would require a little bit of re-work, in particular to allow
>> passing the vCPU into hypercall_create_continuation(), but
>> beyond that I can't see any immediate obstacles. Though
>> clearly I wouldn't make this a prereq requirement for the work
>> here.
> 
> The problem isn't really the ERESTART.  We could do some plumbing and
> make it work, but the real problem is that I can't stash the current cpu
> index in the sysctl data block across the continuation point.
> 
> At the moment, the loop depends on, once all CPUs are in the correct
> state, getting through the for_each_present_cpu() loop without taking a
> further continuation.

But these are two orthogonal things: One is how to invoke the
continuation, and the other is where the continuation is to
resume from. I think the former is more important to address,
as it affects how the tools side code needs to look like.

>>> Is it intentional that we can actually online and offline processors beyond
>>> maxcpu?  This is a consequence of the cpu parking logic.
>> I think so, yes. That's meant to be a boot time limit only imo.
>> The runtime limit is nr_cpu_ids.
>>
>>> --- a/xen/arch/x86/setup.c
>>> +++ b/xen/arch/x86/setup.c
>>> @@ -60,7 +60,7 @@ static bool __initdata opt_nosmp;
>>>  boolean_param("nosmp", opt_nosmp);
>>>  
>>>  /* maxcpus: maximum number of CPUs to activate. */
>>> -static unsigned int __initdata max_cpus;
>>> +unsigned int max_cpus;
>>>  integer_param("maxcpus", max_cpus);
>> As per above I don't think this change should be needed or
>> wanted, but if so for whatever reason, wouldn't the variable
>> better be __read_mostly?
> 
> __read_mostly, yes, but as to whether the change is needed, that
> entirely depends on whether the change in semantics to maxcpus= was
> accidental or intentional.

Well, as said, I did consider this while putting together the
parking series, and I therefore consider it intentional.

>>> +    opt_smt = true;
>> Perhaps also bail early when the variable already has the
>> designated value? And again perhaps right in the sysctl
>> handler?
> 
> That is not safe across continuations.
> 
> While it would be a very silly thing to do, there could be two callers
> which are fighting over whether SMT is disabled or enabled.

Oh, and actually not just that: The continuation then wouldn't
do anything anymore (unless you first reverted the setting,
which in turn wouldn't be right in case any other CPU activity
would occur in parallel, while the continuation is still pending).

>>> +    for_each_present_cpu ( cpu )
>>> +    {
>>> +        if ( cpu == 0 )
>>> +            continue;
>> Is this special case really needed? If so, perhaps worth a brief
>> comment?
> 
> Trying to down cpu 0 is a hard -EINVAL.

But here we're on the CPU-up path. Plus, for eventually supporting
the offlining of CPU 0, it would feel slightly better if you used
smp_processor_id() here.

>>> +        if ( cpu >= max_cpus )
>>> +            break;
>>> +
>>> +        if ( x86_cpu_to_apicid[cpu] & sibling_mask )
>>> +            ret = cpu_up_helper(_p(cpu));
>> Shouldn't this be restricted to CPUs a sibling of which is already
>> online? And widened at the same time, to also online thread 0
>> if one of the other threads is already online?
> 
> Unfortunately, that turns into a rats nest very very quickly, which is
> why I gave up and simplified the semantics to strictly "this shall
> {of,off}line the nonzero siblings threads".

Okay, if that's the intention, then I can certainly live with this.
But it needs to be called out at the very least in the public header.
(It might be worthwhile setting up a flag right away for "full"
behavior, but leave acting upon it unimplemented). It also wouldn't
hurt if the patch description already set expectations accordingly.

Then again, considering your "maxcpus=" related question,
it would certainly be odd for people to see non-zero threads
come online here when they've intentionally left entire cores
or nodes offline for whatever reason. Arguably that's not
something to expect people would commonly do, and hence it
may not be worth wasting meaningful extra effort on. But as
above, and such "oddities" should be spelled out, such that it
can be recognized that they're not oversights.

>> I also notice that the two functions are extremely similar, and
>> hence it might be worthwhile considering to fold them, with the
>> caller controlling the behavior via the so far unused function
>> parameter (at which point the related remark of mine on patch
>> 2 would become inapplicable).
> 
> By passing the plug boolean in via data?

Yes.

>>> --- a/xen/include/public/sysctl.h
>>> +++ b/xen/include/public/sysctl.h
>>> @@ -246,8 +246,17 @@ struct xen_sysctl_get_pmstat {
>>>  struct xen_sysctl_cpu_hotplug {
>>>      /* IN variables */
>>>      uint32_t cpu;   /* Physical cpu. */
>>> +
>>> +    /* Single CPU enable/disable. */
>>>  #define XEN_SYSCTL_CPU_HOTPLUG_ONLINE  0
>>>  #define XEN_SYSCTL_CPU_HOTPLUG_OFFLINE 1
>>> +
>>> +    /*
>>> +     * SMT enable/disable. Caller must zero the 'cpu' field to begin, and
>>> +     * ignore it on completion.
>>> +     */
>>> +#define XEN_SYSCTL_CPU_HOTPLUG_SMT_ENABLE  2
>>> +#define XEN_SYSCTL_CPU_HOTPLUG_SMT_DISABLE 3
>> Is the "cpu" field constraint mentioned in the comment just a
>> precaution? I can't see you encode anything into that field, or
>> use it upon getting re-invoked. I assume that's because of the
>> expectation that only actual onlining/offlining would potentially
>> take long, while iterating over all present CPUs without further
>> action ought to be fast enough.
> 
> Ah - that was stale from before I encountered the "fun" of continuations
> from tasklet context.
> 
> I would prefer to find a better way, but short of doing a full vcpu
> context switch, I don't see an option.

And I don't think there's a strong need. It should just be made
clear (again in the description) that the remark here is just a
precaution at this time, unless you want to drop it altogether.

One thing you may want to do though:

        /* Tolerate already-online siblings. */
        if ( ret == -EEXIST )
        {
            ret = 0;
            continue;
        }

to bypass the general_preempt_check() in that case, such
that you can guarantee making forward progress.

Jan
Andrew Cooper April 3, 2019, 11:33 a.m. UTC | #4
On 03/04/2019 11:44, Jan Beulich wrote:
>>>> On 03.04.19 at 12:17, <andrew.cooper3@citrix.com> wrote:
>> On 03/04/2019 10:33, Jan Beulich wrote:
>>>>>> On 02.04.19 at 21:57, <andrew.cooper3@citrix.com> wrote:
>>>> Slightly RFC.  I'm not very happy with the contination situation, but -EBUSY
>>>> is the preexisting style and it seems like it is the only option from tasklet
>>>> context.
>>> Well, offloading the re-invocation to the caller isn't really nice.
>>> Looking at the code, is there any reason why couldn't use
>>> the usual -ERESTART / hypercall_create_continuation()? This
>>> would require a little bit of re-work, in particular to allow
>>> passing the vCPU into hypercall_create_continuation(), but
>>> beyond that I can't see any immediate obstacles. Though
>>> clearly I wouldn't make this a prereq requirement for the work
>>> here.
>> The problem isn't really the ERESTART.  We could do some plumbing and
>> make it work, but the real problem is that I can't stash the current cpu
>> index in the sysctl data block across the continuation point.
>>
>> At the moment, the loop depends on, once all CPUs are in the correct
>> state, getting through the for_each_present_cpu() loop without taking a
>> further continuation.
> But these are two orthogonal things: One is how to invoke the
> continuation, and the other is where the continuation is to
> resume from. I think the former is more important to address,
> as it affects how the tools side code needs to look like.

Right, but -EBUSY is consistent with how the single online/offline ops
function at the moment, which is why I reused it here.

>
>>>> +    for_each_present_cpu ( cpu )
>>>> +    {
>>>> +        if ( cpu == 0 )
>>>> +            continue;
>>> Is this special case really needed? If so, perhaps worth a brief
>>> comment?
>> Trying to down cpu 0 is a hard -EINVAL.
> But here we're on the CPU-up path. Plus, for eventually supporting
> the offlining of CPU 0, it would feel slightly better if you used
> smp_processor_id() here.

Are there any processors where you can actually take CPU 0 offline?  Its
certainly not possible on any Intel or AMD CPUs.

While I can appreciate the theoretical end goal, it isn't a reality and
I see no signs of that changing.  Xen very definitely cannot take CPU 0
offline, nor can hardware, and I don't see any value in jumping through
hoops for an end goal which doesn't exist.

>>>> +        if ( cpu >= max_cpus )
>>>> +            break;
>>>> +
>>>> +        if ( x86_cpu_to_apicid[cpu] & sibling_mask )
>>>> +            ret = cpu_up_helper(_p(cpu));
>>> Shouldn't this be restricted to CPUs a sibling of which is already
>>> online? And widened at the same time, to also online thread 0
>>> if one of the other threads is already online?
>> Unfortunately, that turns into a rats nest very very quickly, which is
>> why I gave up and simplified the semantics to strictly "this shall
>> {of,off}line the nonzero siblings threads".
> Okay, if that's the intention, then I can certainly live with this.
> But it needs to be called out at the very least in the public header.
> (It might be worthwhile setting up a flag right away for "full"
> behavior, but leave acting upon it unimplemented). It also wouldn't
> hurt if the patch description already set expectations accordingly.
>
> Then again, considering your "maxcpus=" related question,
> it would certainly be odd for people to see non-zero threads
> come online here when they've intentionally left entire cores
> or nodes offline for whatever reason. Arguably that's not
> something to expect people would commonly do, and hence it
> may not be worth wasting meaningful extra effort on. But as
> above, and such "oddities" should be spelled out, such that it
> can be recognized that they're not oversights.

And we come back to Xen's perennial problem of having no documentation. 
I'll see if I can find some time to put some Sphinx/RST together for this.

As for the maxcpus behaviour, I think that is sufficiently niche to
debugging circumstances only that perhaps we can ignore it.  I certainly
don't expect to see maxcpus= used in production.

~Andrew
Jan Beulich April 3, 2019, 12:10 p.m. UTC | #5
>>> On 03.04.19 at 13:33, <andrew.cooper3@citrix.com> wrote:
> On 03/04/2019 11:44, Jan Beulich wrote:
>>>>> On 03.04.19 at 12:17, <andrew.cooper3@citrix.com> wrote:
>>> On 03/04/2019 10:33, Jan Beulich wrote:
>>>>>>> On 02.04.19 at 21:57, <andrew.cooper3@citrix.com> wrote:
>>>>> Slightly RFC.  I'm not very happy with the contination situation, but -EBUSY
>>>>> is the preexisting style and it seems like it is the only option from tasklet
>>>>> context.
>>>> Well, offloading the re-invocation to the caller isn't really nice.
>>>> Looking at the code, is there any reason why couldn't use
>>>> the usual -ERESTART / hypercall_create_continuation()? This
>>>> would require a little bit of re-work, in particular to allow
>>>> passing the vCPU into hypercall_create_continuation(), but
>>>> beyond that I can't see any immediate obstacles. Though
>>>> clearly I wouldn't make this a prereq requirement for the work
>>>> here.
>>> The problem isn't really the ERESTART.  We could do some plumbing and
>>> make it work, but the real problem is that I can't stash the current cpu
>>> index in the sysctl data block across the continuation point.
>>>
>>> At the moment, the loop depends on, once all CPUs are in the correct
>>> state, getting through the for_each_present_cpu() loop without taking a
>>> further continuation.
>> But these are two orthogonal things: One is how to invoke the
>> continuation, and the other is where the continuation is to
>> resume from. I think the former is more important to address,
>> as it affects how the tools side code needs to look like.
> 
> Right, but -EBUSY is consistent with how the single online/offline ops
> function at the moment, which is why I reused it here.

Right, and which also is why I don't think this has to be taken care
of right now.

>>>>> +    for_each_present_cpu ( cpu )
>>>>> +    {
>>>>> +        if ( cpu == 0 )
>>>>> +            continue;
>>>> Is this special case really needed? If so, perhaps worth a brief
>>>> comment?
>>> Trying to down cpu 0 is a hard -EINVAL.
>> But here we're on the CPU-up path. Plus, for eventually supporting
>> the offlining of CPU 0, it would feel slightly better if you used
>> smp_processor_id() here.
> 
> Are there any processors where you can actually take CPU 0 offline?  Its
> certainly not possible on any Intel or AMD CPUs.
> 
> While I can appreciate the theoretical end goal, it isn't a reality and
> I see no signs of that changing.  Xen very definitely cannot take CPU 0
> offline, nor can hardware, and I don't see any value in jumping through
> hoops for an end goal which doesn't exist.

Interesting. Why was it then that x86 Linux took quite some
steps to make it possible (see BOOTPARAM_HOTPLUG_CPU0
Kconfig option as a possible anchor to locate pieces, which
even has a description of conditions that need to be met in
order for this to be possible)? IOW I'd appreciate clarification
on what it is exactly that you think prevents offlining CPU0
(from a pure hardware / firmware perspective), besides the
PIC and suspend/resume aspects (neither of which has to be
in use) mentioned there.

Jan
Jan Beulich April 11, 2019, 8:16 a.m. UTC | #6
>>> On 03.04.19 at 12:17, <andrew.cooper3@citrix.com> wrote:
> On 03/04/2019 10:33, Jan Beulich wrote:
>>>>> On 02.04.19 at 21:57, <andrew.cooper3@citrix.com> wrote:
>>> +        if ( x86_cpu_to_apicid[cpu] & sibling_mask )
>>> +            ret = cpu_up_helper(_p(cpu));
>> Shouldn't this be restricted to CPUs a sibling of which is already
>> online? And widened at the same time, to also online thread 0
>> if one of the other threads is already online?
> 
> Unfortunately, that turns into a rats nest very very quickly, which is
> why I gave up and simplified the semantics to strictly "this shall
> {of,off}line the nonzero siblings threads".
> 
> This is a convenience for people wanting to do a one-time
> reconfiguration of the system, and indeed, how has multiple end user
> requests behind its coming into existence.  Users who are already
> hotplugging aren't going to be interested in this functionality.

I'd like to come back to this: I assume by "hotplugging" you
really mean hot {on,off}lining. Thinking about the actual physical
hotplug case, the flow of events is (if I'm not mistaken) for the
Dom0 kernel to issue XENPF_cpu_hotadd (in response to respective
ACPI events), which then would still need to be followed by
xen-hptool invocations to actually bring up the new cores/threads.

While we invoke remove_siblinginfo() from __cpu_disable(), I don't
see why we shouldn't be able to clone cpu_sibling_mask into an
instance not getting altered when a CPU gets parked. This could
then be used here, albeit the context of me thinking about this
again is my intended attempt to make core parking work with
opt_smt set to false, seeing that Jürgen had pointed out this case
as not working.

I further wonder whether these new sysctl-s should be constrained
to the park_offline_cpus == true case. I don't think it was the
intention to bring up/down secondary compute units of AMD Fam15
CPUs, and I also don't think fiddling with AMD Fam17 secondary
threads is really intended/necessary here. I wouldn't be worried
much about the other, opt_mce, dependency: People shouldn't use
"mce=0" on production systems anyway; we should perhaps name
this an unsupported configuration.

Jan
diff mbox series

Patch

diff --git a/tools/libxc/include/xenctrl.h b/tools/libxc/include/xenctrl.h
index a3628e5..49a6b2a 100644
--- a/tools/libxc/include/xenctrl.h
+++ b/tools/libxc/include/xenctrl.h
@@ -1854,6 +1854,8 @@  int xc_pm_reset_cxstat(xc_interface *xch, int cpuid);
 
 int xc_cpu_online(xc_interface *xch, int cpu);
 int xc_cpu_offline(xc_interface *xch, int cpu);
+int xc_smt_enable(xc_interface *xch);
+int xc_smt_disable(xc_interface *xch);
 
 /* 
  * cpufreq para name of this structure named 
diff --git a/tools/libxc/xc_cpu_hotplug.c b/tools/libxc/xc_cpu_hotplug.c
index 58c2a0f..2ea9825 100644
--- a/tools/libxc/xc_cpu_hotplug.c
+++ b/tools/libxc/xc_cpu_hotplug.c
@@ -46,3 +46,29 @@  int xc_cpu_offline(xc_interface *xch, int cpu)
     return ret;
 }
 
+int xc_smt_enable(xc_interface *xch)
+{
+    DECLARE_SYSCTL;
+    int ret;
+
+    sysctl.cmd = XEN_SYSCTL_cpu_hotplug;
+    sysctl.u.cpu_hotplug.cpu = 0;
+    sysctl.u.cpu_hotplug.op = XEN_SYSCTL_CPU_HOTPLUG_SMT_ENABLE;
+    ret = xc_sysctl(xch, &sysctl);
+
+    return ret;
+}
+
+int xc_smt_disable(xc_interface *xch)
+{
+    DECLARE_SYSCTL;
+    int ret;
+
+    sysctl.cmd = XEN_SYSCTL_cpu_hotplug;
+    sysctl.u.cpu_hotplug.cpu = 0;
+    sysctl.u.cpu_hotplug.op = XEN_SYSCTL_CPU_HOTPLUG_SMT_DISABLE;
+    ret = xc_sysctl(xch, &sysctl);
+
+    return ret;
+}
+
diff --git a/tools/misc/xen-hptool.c b/tools/misc/xen-hptool.c
index 40cd966..6e27d9c 100644
--- a/tools/misc/xen-hptool.c
+++ b/tools/misc/xen-hptool.c
@@ -19,6 +19,8 @@  void show_help(void)
             "  mem-online    <mfn>      online MEMORY <mfn>\n"
             "  mem-offline   <mfn>      offline MEMORY <mfn>\n"
             "  mem-status    <mfn>      query Memory status<mfn>\n"
+            "  smt-enable               onlines all SMT threads\n"
+            "  smt-disable              offlines all SMT threads\n"
            );
 }
 
@@ -304,6 +306,58 @@  static int hp_cpu_offline_func(int argc, char *argv[])
     return ret;
 }
 
+static int main_smt_enable(int argc, char *argv[])
+{
+    int ret;
+
+    if ( argc )
+    {
+        show_help();
+        return -1;
+    }
+
+    for ( ;; )
+    {
+        ret = xc_smt_enable(xch);
+        if ( (ret >= 0) || (errno != EBUSY) )
+            break;
+    }
+
+    if ( ret < 0 )
+        fprintf(stderr, "Unable to enable SMT: errno %d, %s\n",
+                errno, strerror(errno));
+    else
+        printf("Enabled SMT\n");
+
+    return ret;
+}
+
+static int main_smt_disable(int argc, char *argv[])
+{
+    int ret;
+
+    if ( argc )
+    {
+        show_help();
+        return -1;
+    }
+
+    for ( ;; )
+    {
+        ret = xc_smt_disable(xch);
+        if ( (ret >= 0) || (errno != EBUSY) )
+            break;
+    }
+
+    if ( ret < 0 )
+        fprintf(stderr, "Unable to disable SMT: errno %d, %s\n",
+                errno, strerror(errno));
+    else
+        printf("Disabled SMT\n");
+
+    return ret;
+}
+
 struct {
     const char *name;
     int (*function)(int argc, char *argv[]);
@@ -314,6 +368,8 @@  struct {
     { "mem-status", hp_mem_query_func},
     { "mem-online", hp_mem_online_func},
     { "mem-offline", hp_mem_offline_func},
+    { "smt-enable", main_smt_enable },
+    { "smt-disable", main_smt_disable },
 };
 
 
diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index 3440794..af245eb 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -60,7 +60,7 @@  static bool __initdata opt_nosmp;
 boolean_param("nosmp", opt_nosmp);
 
 /* maxcpus: maximum number of CPUs to activate. */
-static unsigned int __initdata max_cpus;
+unsigned int max_cpus;
 integer_param("maxcpus", max_cpus);
 
 int8_t __read_mostly opt_smt = -1;
diff --git a/xen/arch/x86/sysctl.c b/xen/arch/x86/sysctl.c
index b3cc4b5..0e2e409 100644
--- a/xen/arch/x86/sysctl.c
+++ b/xen/arch/x86/sysctl.c
@@ -114,6 +114,92 @@  long cpu_down_helper(void *data)
     return ret;
 }
 
+static long smt_up_helper(void *data)
+{
+    unsigned int cpu, sibling_mask =
+        (1u << (boot_cpu_data.x86_num_siblings - 1)) - 1;
+    int ret = 0;
+
+    if ( !cpu_has_htt || !sibling_mask )
+        return -EOPNOTSUPP;
+
+    opt_smt = true;
+
+    for_each_present_cpu ( cpu )
+    {
+        if ( cpu == 0 )
+            continue;
+        if ( cpu >= max_cpus )
+            break;
+
+        if ( x86_cpu_to_apicid[cpu] & sibling_mask )
+            ret = cpu_up_helper(_p(cpu));
+
+        /* Tolerate already-online siblings. */
+        if ( ret == -EEXIST )
+            ret = 0;
+
+        if ( ret )
+            break;
+
+        if ( general_preempt_check() )
+        {
+            /* In tasklet context - can't create a contination. */
+            ret = -EBUSY;
+            break;
+        }
+    }
+
+    if ( !ret )
+        printk(XENLOG_INFO "SMT enabled - online CPUs {%*pbl}\n",
+               nr_cpu_ids, cpumask_bits(&cpu_online_map));
+
+    return ret;
+}
+
+static long smt_down_helper(void *data)
+{
+    unsigned int cpu, sibling_mask =
+        (1u << (boot_cpu_data.x86_num_siblings - 1)) - 1;
+    int ret = 0;
+
+    if ( !cpu_has_htt || !sibling_mask )
+        return -EOPNOTSUPP;
+
+    opt_smt = false;
+
+    for_each_present_cpu ( cpu )
+    {
+        if ( cpu == 0 )
+            continue;
+        if ( cpu >= max_cpus )
+            break;
+
+        if ( x86_cpu_to_apicid[cpu] & sibling_mask )
+            ret = cpu_down_helper(_p(cpu));
+
+        /* Tolerate already-offline siblings. */
+        if ( ret == -EEXIST )
+            ret = 0;
+
+        if ( ret )
+            break;
+
+        if ( general_preempt_check() )
+        {
+            /* In tasklet context - can't create a contination. */
+            ret = -EBUSY;
+            break;
+        }
+    }
+
+    if ( !ret )
+        printk(XENLOG_INFO "SMT disabled - online CPUs {%*pbl}\n",
+               nr_cpu_ids, cpumask_bits(&cpu_online_map));
+
+    return ret;
+}
+
 void arch_do_physinfo(struct xen_sysctl_physinfo *pi)
 {
     memcpy(pi->hw_cap, boot_cpu_data.x86_capability,
@@ -155,6 +241,16 @@  long arch_do_sysctl(
             hcpu = (void *)(unsigned long)cpu;
             break;
 
+        case XEN_SYSCTL_CPU_HOTPLUG_SMT_ENABLE:
+            plug = true;
+            fn = smt_up_helper;
+            break;
+
+        case XEN_SYSCTL_CPU_HOTPLUG_SMT_DISABLE:
+            plug = false;
+            fn = smt_down_helper;
+            break;
+
         default:
             ret = -EOPNOTSUPP;
             break;
diff --git a/xen/include/asm-x86/processor.h b/xen/include/asm-x86/processor.h
index cef3ffb..8ffcffe 100644
--- a/xen/include/asm-x86/processor.h
+++ b/xen/include/asm-x86/processor.h
@@ -154,6 +154,7 @@  extern void (*ctxt_switch_masking)(const struct vcpu *next);
 extern bool_t opt_cpu_info;
 extern u32 cpuid_ext_features;
 extern u64 trampoline_misc_enable_off;
+extern unsigned int max_cpus;
 
 /* Maximum width of physical addresses supported by the hardware. */
 extern unsigned int paddr_bits;
diff --git a/xen/include/public/sysctl.h b/xen/include/public/sysctl.h
index c49b4dc..5c43aed 100644
--- a/xen/include/public/sysctl.h
+++ b/xen/include/public/sysctl.h
@@ -246,8 +246,17 @@  struct xen_sysctl_get_pmstat {
 struct xen_sysctl_cpu_hotplug {
     /* IN variables */
     uint32_t cpu;   /* Physical cpu. */
+
+    /* Single CPU enable/disable. */
 #define XEN_SYSCTL_CPU_HOTPLUG_ONLINE  0
 #define XEN_SYSCTL_CPU_HOTPLUG_OFFLINE 1
+
+    /*
+     * SMT enable/disable. Caller must zero the 'cpu' field to begin, and
+     * ignore it on completion.
+     */
+#define XEN_SYSCTL_CPU_HOTPLUG_SMT_ENABLE  2
+#define XEN_SYSCTL_CPU_HOTPLUG_SMT_DISABLE 3
     uint32_t op;    /* hotplug opcode */
 };