diff mbox series

xen/sched: Remove d->is_pinned

Message ID 1554113345-15068-1-git-send-email-andrew.cooper3@citrix.com (mailing list archive)
State New, archived
Headers show
Series xen/sched: Remove d->is_pinned | expand

Commit Message

Andrew Cooper April 1, 2019, 10:09 a.m. UTC
The is_pinned field is rather odd.  It can only be activated with the
"dom0_vcpus_pin" command line option, and causes dom0 (or the late hwdom) to
have its vcpus identity pinned to pcpus.

Having dom0_vcpus_pin active disallows the use of vcpu_set_hard_affinity().
However, when a pcpu is hot unplugged, or moved between cpupools, the affinity
is broken and reverts to cpumask_all.  This can result in vcpus which are no
longer pinned, and cannot be adjusted.

A related bit of functionality is the is_pinned_vcpu() predicate.  This is
only used by x86 code, and permits the use of VCPUOP_get_physid and writeable
access to some extra MSRs.

The implementation however returns true for is_pinned (which will include
unpinned vcpus from the above scenario), *or* if the hard affinity mask only
has a single bit set (which is redundant with the intended effect of
is_pinned, but also includes other domains).

Rework the behaviour of "dom0_vcpus_pin" to only being an initial pinning
configuration, and permit full adjustment.  This allows the user to
reconfigure dom0 after the fact or fix up from the fallout of cpu hot unplug
and cpupool manipulation.

An unprivileged domain has no buisness using VCPUOP_get_physid, and shouldn't
be able to just because it happens to be pinned by admin choice.  All uses of
is_pinned_vcpu() should be restricted to the hardware domain, so rename it to
is_hwdom_pinned_vcpu() to avoid future misuse.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Wei Liu <wei.liu2@citrix.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Stefano Stabellini <sstabellini@kernel.org>
CC: Julien Grall <julien.grall@arm.com>
CC: George Dunlap <george.dunlap@eu.citrix.com>
CC: Dario Faggioli <dfaggioli@suse.com>
CC: Juergen Gross <jgross@suse.com>

Other oddities which I've noticed and should probably be fixed in due course.

1) Permitting real writes to MSR_IA32_UCODE_REV is pointless.  The only
   legitimate value to occur here is 0, and that is already properly accounted
   for in the read handler.

2) dom0_setup_vcpu() results in 2 or 3 calls sched_set_affinity(), which is 1
   or 2 too many.

3) cpumask_weight() is a surprisingly expensive function, and we use it all
   over the place.  It will almost certainly benefit from the use of popcnt.
---
 xen/arch/x86/dom0_build.c      |  2 +-
 xen/arch/x86/domain.c          |  2 +-
 xen/arch/x86/pv/emul-priv-op.c |  9 ++++-----
 xen/common/domain.c            |  1 -
 xen/common/schedule.c          |  5 +----
 xen/include/xen/sched.h        | 10 ++++++----
 6 files changed, 13 insertions(+), 16 deletions(-)

Comments

Jan Beulich April 1, 2019, 12:21 p.m. UTC | #1
>>> On 01.04.19 at 12:09, <andrew.cooper3@citrix.com> wrote:
> 3) cpumask_weight() is a surprisingly expensive function, and we use it all
>    over the place.  It will almost certainly benefit from the use of popcnt.

FTR this was something I was meaning to possibly do once the BMI2
alternatives patching series would have been accepted. It having
got rejected (and also the fsgsbase patching one, but that's moot
now anyway due to selection of insn having become dependent on a
runtime property), I've stopped all considerations towards further
alternatives. While there wouldn't be as many helper functions
needed here as they were needed there, our basic disagreement
as to whether to use per-register helper functions would continue
to exist.

> --- a/xen/arch/x86/domain.c
> +++ b/xen/arch/x86/domain.c
> @@ -1229,7 +1229,7 @@ arch_do_vcpu_op(
>          struct vcpu_get_physid cpu_id;
>  
>          rc = -EINVAL;
> -        if ( !is_pinned_vcpu(v) )
> +        if ( !is_hwdom_pinned_vcpu(v) )
>              break;

I think the idea here was to allow eventual use of this in a highly
disaggregated environment by other than Dom0. But this is not
to say that I would mind this change; in fact
Reviewed-by: Jan Beulich <jbeulich@suse.com>

Jan
Andrew Cooper April 1, 2019, 12:35 p.m. UTC | #2
On 01/04/2019 13:21, Jan Beulich wrote:
>>>> On 01.04.19 at 12:09, <andrew.cooper3@citrix.com> wrote:
>> 3) cpumask_weight() is a surprisingly expensive function, and we use it all
>>    over the place.  It will almost certainly benefit from the use of popcnt.
> FTR this was something I was meaning to possibly do once the BMI2
> alternatives patching series would have been accepted. It having
> got rejected (and also the fsgsbase patching one, but that's moot
> now anyway due to selection of insn having become dependent on a
> runtime property), I've stopped all considerations towards further
> alternatives. While there wouldn't be as many helper functions
> needed here as they were needed there, our basic disagreement
> as to whether to use per-register helper functions would continue
> to exist.

The Linux way of fixing this is actually very neat.

Depending on the CPU, there may genuinely be a call to a software
routine, or a single popcnt instruction.  Therefore, the end result is:

asm (ALTERNATIVE("call __sw_hweight64",
                 "popcnt %[in], %[out]", X86_FEATURE_POPCNT)
                 : [out] "=a" (res),
                 : [in] "=D" (w));


etc.  I also note that Linux now has a more optimised sofware version
for x86 and Arm64, due to the architecture selecting
CONFIG_ARCH_HAS_FAST_MULTIPLIER

>
>> --- a/xen/arch/x86/domain.c
>> +++ b/xen/arch/x86/domain.c
>> @@ -1229,7 +1229,7 @@ arch_do_vcpu_op(
>>          struct vcpu_get_physid cpu_id;
>>  
>>          rc = -EINVAL;
>> -        if ( !is_pinned_vcpu(v) )
>> +        if ( !is_hwdom_pinned_vcpu(v) )
>>              break;
> I think the idea here was to allow eventual use of this in a highly
> disaggregated environment by other than Dom0. But this is not
> to say that I would mind this change; in fact

I suppose that usecase is fine in principle, but should come with some
suitable XSM.

> Reviewed-by: Jan Beulich <jbeulich@suse.com>

Thanks,

~Andrew
<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
  </head>
  <body text="#000000" bgcolor="#FFFFFF">
    <div class="moz-cite-prefix">On 01/04/2019 13:21, Jan Beulich wrote:<br>
    </div>
    <blockquote type="cite"
      cite="mid:5CA2025B020000780022397B@prv1-mh.provo.novell.com">
      <blockquote type="cite">
        <blockquote type="cite">
          <blockquote type="cite">
            <pre class="moz-quote-pre" wrap="">On 01.04.19 at 12:09, <a class="moz-txt-link-rfc2396E" href="mailto:andrew.cooper3@citrix.com">&lt;andrew.cooper3@citrix.com&gt;</a> wrote:
</pre>
          </blockquote>
        </blockquote>
        <pre class="moz-quote-pre" wrap="">3) cpumask_weight() is a surprisingly expensive function, and we use it all
   over the place.  It will almost certainly benefit from the use of popcnt.
</pre>
      </blockquote>
      <pre class="moz-quote-pre" wrap="">
FTR this was something I was meaning to possibly do once the BMI2
alternatives patching series would have been accepted. It having
got rejected (and also the fsgsbase patching one, but that's moot
now anyway due to selection of insn having become dependent on a
runtime property), I've stopped all considerations towards further
alternatives. While there wouldn't be as many helper functions
needed here as they were needed there, our basic disagreement
as to whether to use per-register helper functions would continue
to exist.</pre>
    </blockquote>
    <br>
    The Linux way of fixing this is actually very neat.<br>
    <br>
    Depending on the CPU, there may genuinely be a call to a software
    routine, or a single popcnt instruction.  Therefore, the end result
    is:<br>
    <br>
    <pre>asm (ALTERNATIVE("call __sw_hweight64",
                 "popcnt %[in], %[out]", X86_FEATURE_POPCNT)
                 : [out] "=a" (res),
                 : [in] "=D" (w));
</pre>
    <br>
    etc.  I also note that Linux now has a more optimised sofware
    version for x86 and Arm64, due to the architecture selecting
    CONFIG_ARCH_HAS_FAST_MULTIPLIER<br>
    <br>
    <blockquote type="cite"
      cite="mid:5CA2025B020000780022397B@prv1-mh.provo.novell.com">
      <pre class="moz-quote-pre" wrap="">

</pre>
      <blockquote type="cite">
        <pre class="moz-quote-pre" wrap="">--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1229,7 +1229,7 @@ arch_do_vcpu_op(
         struct vcpu_get_physid cpu_id;
 
         rc = -EINVAL;
-        if ( !is_pinned_vcpu(v) )
+        if ( !is_hwdom_pinned_vcpu(v) )
             break;
</pre>
      </blockquote>
      <pre class="moz-quote-pre" wrap="">
I think the idea here was to allow eventual use of this in a highly
disaggregated environment by other than Dom0. But this is not
to say that I would mind this change; in fact</pre>
    </blockquote>
    <br>
    I suppose that usecase is fine in principle, but should come with
    some suitable XSM.<br>
    <br>
    <blockquote type="cite"
      cite="mid:5CA2025B020000780022397B@prv1-mh.provo.novell.com">
      <pre class="moz-quote-pre" wrap="">
Reviewed-by: Jan Beulich <a class="moz-txt-link-rfc2396E" href="mailto:jbeulich@suse.com">&lt;jbeulich@suse.com&gt;</a></pre>
    </blockquote>
    <br>
    Thanks,<br>
    <br>
    ~Andrew<br>
  </body>
</html>
Dario Faggioli April 1, 2019, 1:37 p.m. UTC | #3
On Mon, 2019-04-01 at 11:09 +0100, Andrew Cooper wrote:
> The is_pinned field is rather odd.  
>
Indeed. And I'm very, very happy to see it going away.

> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
>
Reviewed-by: Dario Faggioli <dfaggioli@suse.com>

Regards,
Dario
Jan Beulich April 1, 2019, 1:52 p.m. UTC | #4
>>> On 01.04.19 at 14:35, <andrew.cooper3@citrix.com> wrote:
> On 01/04/2019 13:21, Jan Beulich wrote:
>>>>> On 01.04.19 at 12:09, <andrew.cooper3@citrix.com> wrote:
>>> 3) cpumask_weight() is a surprisingly expensive function, and we use it all
>>>    over the place.  It will almost certainly benefit from the use of popcnt.
>> FTR this was something I was meaning to possibly do once the BMI2
>> alternatives patching series would have been accepted. It having
>> got rejected (and also the fsgsbase patching one, but that's moot
>> now anyway due to selection of insn having become dependent on a
>> runtime property), I've stopped all considerations towards further
>> alternatives. While there wouldn't be as many helper functions
>> needed here as they were needed there, our basic disagreement
>> as to whether to use per-register helper functions would continue
>> to exist.
> 
> The Linux way of fixing this is actually very neat.
> 
> Depending on the CPU, there may genuinely be a call to a software
> routine, or a single popcnt instruction.  Therefore, the end result is:
> 
> asm (ALTERNATIVE("call __sw_hweight64",
>                  "popcnt %[in], %[out]", X86_FEATURE_POPCNT)
>                  : [out] "=a" (res),
>                  : [in] "=D" (w));

But that's precisely the reason for our disagreement on the BMI2
series: I dislike taking away all control from the compiler as to
which register to choose.

> etc.  I also note that Linux now has a more optimised sofware version
> for x86 and Arm64, due to the architecture selecting
> CONFIG_ARCH_HAS_FAST_MULTIPLIER

This may indeed be worthwhile to clone.

Jan
diff mbox series

Patch

diff --git a/xen/arch/x86/dom0_build.c b/xen/arch/x86/dom0_build.c
index 6ebe367..73f5407 100644
--- a/xen/arch/x86/dom0_build.c
+++ b/xen/arch/x86/dom0_build.c
@@ -213,7 +213,7 @@  struct vcpu *__init dom0_setup_vcpu(struct domain *d,
         }
         else
         {
-            if ( !d->is_pinned && !dom0_affinity_relaxed )
+            if ( !opt_dom0_vcpus_pin && !dom0_affinity_relaxed )
                 sched_set_affinity(v, &dom0_cpus, NULL);
             sched_set_affinity(v, NULL, &dom0_cpus);
         }
diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 8d579e2..20b86fd 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1229,7 +1229,7 @@  arch_do_vcpu_op(
         struct vcpu_get_physid cpu_id;
 
         rc = -EINVAL;
-        if ( !is_pinned_vcpu(v) )
+        if ( !is_hwdom_pinned_vcpu(v) )
             break;
 
         cpu_id.phys_id =
diff --git a/xen/arch/x86/pv/emul-priv-op.c b/xen/arch/x86/pv/emul-priv-op.c
index 3746e2a..84ce67c 100644
--- a/xen/arch/x86/pv/emul-priv-op.c
+++ b/xen/arch/x86/pv/emul-priv-op.c
@@ -1017,7 +1017,7 @@  static int write_msr(unsigned int reg, uint64_t val,
         if ( boot_cpu_data.x86_vendor != X86_VENDOR_AMD ||
              boot_cpu_data.x86 < 0x10 || boot_cpu_data.x86 > 0x17 )
             break;
-        if ( !is_hardware_domain(currd) || !is_pinned_vcpu(curr) )
+        if ( !is_hwdom_pinned_vcpu(curr) )
             return X86EMUL_OKAY;
         if ( (rdmsr_safe(MSR_AMD64_NB_CFG, temp) != 0) ||
              ((val ^ temp) & ~(1ULL << AMD64_NB_CFG_CF8_EXT_ENABLE_BIT)) )
@@ -1030,7 +1030,7 @@  static int write_msr(unsigned int reg, uint64_t val,
         if ( boot_cpu_data.x86_vendor != X86_VENDOR_AMD ||
              boot_cpu_data.x86 < 0x10 || boot_cpu_data.x86 > 0x17 )
             break;
-        if ( !is_hardware_domain(currd) || !is_pinned_vcpu(curr) )
+        if ( !is_hwdom_pinned_vcpu(curr) )
             return X86EMUL_OKAY;
         if ( rdmsr_safe(MSR_FAM10H_MMIO_CONF_BASE, temp) != 0 )
             break;
@@ -1050,7 +1050,7 @@  static int write_msr(unsigned int reg, uint64_t val,
     case MSR_IA32_UCODE_REV:
         if ( boot_cpu_data.x86_vendor != X86_VENDOR_INTEL )
             break;
-        if ( !is_hardware_domain(currd) || !is_pinned_vcpu(curr) )
+        if ( !is_hwdom_pinned_vcpu(curr) )
             return X86EMUL_OKAY;
         if ( rdmsr_safe(reg, temp) )
             break;
@@ -1086,8 +1086,7 @@  static int write_msr(unsigned int reg, uint64_t val,
     case MSR_IA32_ENERGY_PERF_BIAS:
         if ( boot_cpu_data.x86_vendor != X86_VENDOR_INTEL )
             break;
-        if ( !is_hardware_domain(currd) || !is_pinned_vcpu(curr) ||
-             wrmsr_safe(reg, val) == 0 )
+        if ( !is_hwdom_pinned_vcpu(curr) || wrmsr_safe(reg, val) == 0 )
             return X86EMUL_OKAY;
         break;
 
diff --git a/xen/common/domain.c b/xen/common/domain.c
index 3b18f11..a1f8bb4 100644
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -342,7 +342,6 @@  struct domain *domain_create(domid_t domid,
         if ( hardware_domid < 0 || hardware_domid >= DOMID_FIRST_RESERVED )
             panic("The value of hardware_dom must be a valid domain ID\n");
 
-        d->is_pinned = opt_dom0_vcpus_pin;
         d->disable_migrate = 1;
         old_hwdom = hardware_domain;
         hardware_domain = d;
diff --git a/xen/common/schedule.c b/xen/common/schedule.c
index 60755a6..76d6078 100644
--- a/xen/common/schedule.c
+++ b/xen/common/schedule.c
@@ -276,7 +276,7 @@  int sched_init_vcpu(struct vcpu *v, unsigned int processor)
      * Initialize affinity settings. The idler, and potentially
      * domain-0 VCPUs, are pinned onto their respective physical CPUs.
      */
-    if ( is_idle_domain(d) || d->is_pinned )
+    if ( is_idle_domain(d) || (is_hardware_domain(d) && opt_dom0_vcpus_pin) )
         sched_set_affinity(v, cpumask_of(processor), &cpumask_all);
     else
         sched_set_affinity(v, &cpumask_all, &cpumask_all);
@@ -958,9 +958,6 @@  int vcpu_set_hard_affinity(struct vcpu *v, const cpumask_t *affinity)
     cpumask_t online_affinity;
     cpumask_t *online;
 
-    if ( v->domain->is_pinned )
-        return -EINVAL;
-
     online = VCPU2ONLINE(v);
     cpumask_and(&online_affinity, affinity, online);
     if ( cpumask_empty(&online_affinity) )
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index edee52d..6d23b6d 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -381,8 +381,6 @@  struct domain
     bool             is_console;
     /* Is this a xenstore domain (not dom0)? */
     bool             is_xenstore;
-    /* Domain's VCPUs are pinned 1:1 to physical CPUs? */
-    bool             is_pinned;
     /* Non-migratable and non-restoreable? */
     bool             disable_migrate;
     /* Is this guest being debugged by dom0? */
@@ -961,8 +959,12 @@  static inline bool is_hvm_vcpu(const struct vcpu *v)
     return is_hvm_domain(v->domain);
 }
 
-#define is_pinned_vcpu(v) ((v)->domain->is_pinned || \
-                           cpumask_weight((v)->cpu_hard_affinity) == 1)
+static inline bool is_hwdom_pinned_vcpu(const struct vcpu *v)
+{
+    return (is_hardware_domain(v->domain) &&
+            cpumask_weight(v->cpu_hard_affinity) == 1);
+}
+
 #ifdef CONFIG_HAS_PASSTHROUGH
 #define has_iommu_pt(d) (dom_iommu(d)->status != IOMMU_STATUS_disabled)
 #define need_iommu_pt_sync(d) (dom_iommu(d)->need_sync)