diff mbox

[v3,2/2] x86/Intel: virtualize support for cpuid faulting

Message ID 20161014194736.5913-3-khuey@kylehuey.com (mailing list archive)
State New, archived
Headers show

Commit Message

Kyle Huey Oct. 14, 2016, 7:47 p.m. UTC
On HVM guests, the cpuid triggers a vm exit, so we can check the emulated
faulting state in vmx_do_cpuid and inject a GP(0) if CPL > 0. Notably no
hardware support for faulting on cpuid is necessary to emulate support with an
HVM guest.

On PV guests, hardware support is required so that userspace cpuid will trap
to xen. Xen already enables cpuid faulting on supported CPUs for pv guests (that
aren't the control domain, see the comment in intel_ctxt_switch_levelling).
Every PV guest cpuid will trap via a GP(0) to emulate_privileged_op (via
do_general_protection). Once there we simply decline to emulate cpuid if the
CPL > 0 and faulting is enabled, leaving the GP(0) for the guest kernel to
handle.

Signed-off-by: Kyle Huey <khuey@kylehuey.com>
---
 xen/arch/x86/hvm/vmx/vmx.c   | 24 ++++++++++++++++++++++--
 xen/arch/x86/traps.c         | 34 ++++++++++++++++++++++++++++++++++
 xen/include/asm-x86/domain.h |  3 +++
 3 files changed, 59 insertions(+), 2 deletions(-)

Comments

Wei Liu Oct. 17, 2016, 12:32 p.m. UTC | #1
On Fri, Oct 14, 2016 at 12:47:36PM -0700, Kyle Huey wrote:
> On HVM guests, the cpuid triggers a vm exit, so we can check the emulated
> faulting state in vmx_do_cpuid and inject a GP(0) if CPL > 0. Notably no
> hardware support for faulting on cpuid is necessary to emulate support with an
> HVM guest.
> 
> On PV guests, hardware support is required so that userspace cpuid will trap
> to xen. Xen already enables cpuid faulting on supported CPUs for pv guests (that
> aren't the control domain, see the comment in intel_ctxt_switch_levelling).
> Every PV guest cpuid will trap via a GP(0) to emulate_privileged_op (via
> do_general_protection). Once there we simply decline to emulate cpuid if the
> CPL > 0 and faulting is enabled, leaving the GP(0) for the guest kernel to
> handle.
> 
> Signed-off-by: Kyle Huey <khuey@kylehuey.com>

Andrew expressed the desire of taking this patch into 4.8. After reading
the description and code in detail, I think this patch falls into the
"nice-to-have" category.

The main risk here is this patch doesn't have architecturally correct
behaviour. I would like to see an ack or review from VT maintainers to
make this patch eligible for acceptance.

Another thing to consider is timing. We plan to cut RC3 before Friday
this week, so if this patch can be acked and becomes part of RC3 I'm
fine with applying it. If not, we shall revisit the situation when it is
acked.

Wei.
Andrew Cooper Oct. 17, 2016, 12:49 p.m. UTC | #2
On 14/10/16 20:47, Kyle Huey wrote:
> On HVM guests, the cpuid triggers a vm exit, so we can check the emulated
> faulting state in vmx_do_cpuid and inject a GP(0) if CPL > 0. Notably no
> hardware support for faulting on cpuid is necessary to emulate support with an
> HVM guest.
>
> On PV guests, hardware support is required so that userspace cpuid will trap
> to xen. Xen already enables cpuid faulting on supported CPUs for pv guests (that

to Xen.

> aren't the control domain, see the comment in intel_ctxt_switch_levelling).
> Every PV guest cpuid will trap via a GP(0) to emulate_privileged_op (via
> do_general_protection). Once there we simply decline to emulate cpuid if the
> CPL > 0 and faulting is enabled, leaving the GP(0) for the guest kernel to
> handle.
>
> Signed-off-by: Kyle Huey <khuey@kylehuey.com>
> ---
>  xen/arch/x86/hvm/vmx/vmx.c   | 24 ++++++++++++++++++++++--
>  xen/arch/x86/traps.c         | 34 ++++++++++++++++++++++++++++++++++
>  xen/include/asm-x86/domain.h |  3 +++
>  3 files changed, 59 insertions(+), 2 deletions(-)
>
> diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
> index b9102ce..c038393 100644
> --- a/xen/arch/x86/hvm/vmx/vmx.c
> +++ b/xen/arch/x86/hvm/vmx/vmx.c
> @@ -2427,16 +2427,25 @@ static void vmx_cpuid_intercept(
>  
>      HVMTRACE_5D (CPUID, input, *eax, *ebx, *ecx, *edx);
>  }
>  
>  static int vmx_do_cpuid(struct cpu_user_regs *regs)
>  {
>      unsigned int eax, ebx, ecx, edx;
>      unsigned int leaf, subleaf;
> +    struct segment_register sreg;
> +    struct vcpu *v = current;
> +
> +    hvm_get_segment_register(v, x86_seg_ss, &sreg);
> +    if ( v->arch.cpuid_fault && sreg.attr.fields.dpl > 0 )
> +    {
> +        hvm_inject_hw_exception(TRAP_gp_fault, 0);
> +        return 1; /* Don't advance the guest IP! */
> +    }

Thinking about it, the segment register query can be skipped in the
likely case that faulting isn't enabled.  Could this be re-arranged to:

if ( v->arch.cpuid_fault )
{
    struct segment_register sreg;

    hvm_get_segment_register(v, x86_seg_ss, &sreg);
    if ( sreg.attr.fields.dpl > 0 )
    {
        hvm_inject_hw_exception(TRAP_gp_fault, 0);
        return 1; /* Don't advance the guest IP! */
    }
}

With these two minor issues taken care of, Reviewed-by: Andrew Cooper
<andrew.cooper3@citrix.com>
Kyle Huey Oct. 20, 2016, 5:10 a.m. UTC | #3
On Mon, Oct 17, 2016 at 5:32 AM, Wei Liu <wei.liu2@citrix.com> wrote:
> On Fri, Oct 14, 2016 at 12:47:36PM -0700, Kyle Huey wrote:
>> On HVM guests, the cpuid triggers a vm exit, so we can check the emulated
>> faulting state in vmx_do_cpuid and inject a GP(0) if CPL > 0. Notably no
>> hardware support for faulting on cpuid is necessary to emulate support with an
>> HVM guest.
>>
>> On PV guests, hardware support is required so that userspace cpuid will trap
>> to xen. Xen already enables cpuid faulting on supported CPUs for pv guests (that
>> aren't the control domain, see the comment in intel_ctxt_switch_levelling).
>> Every PV guest cpuid will trap via a GP(0) to emulate_privileged_op (via
>> do_general_protection). Once there we simply decline to emulate cpuid if the
>> CPL > 0 and faulting is enabled, leaving the GP(0) for the guest kernel to
>> handle.
>>
>> Signed-off-by: Kyle Huey <khuey@kylehuey.com>
>
> Andrew expressed the desire of taking this patch into 4.8. After reading
> the description and code in detail, I think this patch falls into the
> "nice-to-have" category.
>
> The main risk here is this patch doesn't have architecturally correct
> behaviour. I would like to see an ack or review from VT maintainers to
> make this patch eligible for acceptance.
>
> Another thing to consider is timing. We plan to cut RC3 before Friday
> this week, so if this patch can be acked and becomes part of RC3 I'm
> fine with applying it. If not, we shall revisit the situation when it is
> acked.

Kevin Tian reviewed the patch yesterday, so I think we're just waiting
for a final review from Andrew here.

That said, rr currently does not work in Xen guests due to some PMU
issues that we haven't tracked down yet.  So for us it's not a big
deal if this feature does not make it into 4.8.  I won't be
disappointed if you cut it from 4.8 to reduce technical risk.

- Kyle
Andrew Cooper Oct. 20, 2016, 7:56 a.m. UTC | #4
On 20/10/2016 06:10, Kyle Huey wrote:
> On Mon, Oct 17, 2016 at 5:32 AM, Wei Liu <wei.liu2@citrix.com> wrote:
>> On Fri, Oct 14, 2016 at 12:47:36PM -0700, Kyle Huey wrote:
>>> On HVM guests, the cpuid triggers a vm exit, so we can check the emulated
>>> faulting state in vmx_do_cpuid and inject a GP(0) if CPL > 0. Notably no
>>> hardware support for faulting on cpuid is necessary to emulate support with an
>>> HVM guest.
>>>
>>> On PV guests, hardware support is required so that userspace cpuid will trap
>>> to xen. Xen already enables cpuid faulting on supported CPUs for pv guests (that
>>> aren't the control domain, see the comment in intel_ctxt_switch_levelling).
>>> Every PV guest cpuid will trap via a GP(0) to emulate_privileged_op (via
>>> do_general_protection). Once there we simply decline to emulate cpuid if the
>>> CPL > 0 and faulting is enabled, leaving the GP(0) for the guest kernel to
>>> handle.
>>>
>>> Signed-off-by: Kyle Huey <khuey@kylehuey.com>
>> Andrew expressed the desire of taking this patch into 4.8. After reading
>> the description and code in detail, I think this patch falls into the
>> "nice-to-have" category.
>>
>> The main risk here is this patch doesn't have architecturally correct
>> behaviour. I would like to see an ack or review from VT maintainers to
>> make this patch eligible for acceptance.
>>
>> Another thing to consider is timing. We plan to cut RC3 before Friday
>> this week, so if this patch can be acked and becomes part of RC3 I'm
>> fine with applying it. If not, we shall revisit the situation when it is
>> acked.
> Kevin Tian reviewed the patch yesterday, so I think we're just waiting
> for a final review from Andrew here.

Ah - I am just waiting for your final respin with the comments so far
addressed.

>
> That said, rr currently does not work in Xen guests due to some PMU
> issues that we haven't tracked down yet.

Is this RR trying to use vPMU and it not functioning, or not
specifically trying to use PMU facilities and getting stuck anyway?

>   So for us it's not a big
> deal if this feature does not make it into 4.8.  I won't be
> disappointed if you cut it from 4.8 to reduce technical risk.

From my point of view, its a small feature with working code and a
comprehensive test case ready to go straight into regression testing. 
This makes it the least risky feature going.

~Andrew
Kyle Huey Oct. 20, 2016, 1:55 p.m. UTC | #5
On Thu, Oct 20, 2016 at 12:56 AM, Andrew Cooper
<andrew.cooper3@citrix.com> wrote:
> On 20/10/2016 06:10, Kyle Huey wrote:
>> On Mon, Oct 17, 2016 at 5:32 AM, Wei Liu <wei.liu2@citrix.com> wrote:
>>> On Fri, Oct 14, 2016 at 12:47:36PM -0700, Kyle Huey wrote:
>>>> On HVM guests, the cpuid triggers a vm exit, so we can check the emulated
>>>> faulting state in vmx_do_cpuid and inject a GP(0) if CPL > 0. Notably no
>>>> hardware support for faulting on cpuid is necessary to emulate support with an
>>>> HVM guest.
>>>>
>>>> On PV guests, hardware support is required so that userspace cpuid will trap
>>>> to xen. Xen already enables cpuid faulting on supported CPUs for pv guests (that
>>>> aren't the control domain, see the comment in intel_ctxt_switch_levelling).
>>>> Every PV guest cpuid will trap via a GP(0) to emulate_privileged_op (via
>>>> do_general_protection). Once there we simply decline to emulate cpuid if the
>>>> CPL > 0 and faulting is enabled, leaving the GP(0) for the guest kernel to
>>>> handle.
>>>>
>>>> Signed-off-by: Kyle Huey <khuey@kylehuey.com>
>>> Andrew expressed the desire of taking this patch into 4.8. After reading
>>> the description and code in detail, I think this patch falls into the
>>> "nice-to-have" category.
>>>
>>> The main risk here is this patch doesn't have architecturally correct
>>> behaviour. I would like to see an ack or review from VT maintainers to
>>> make this patch eligible for acceptance.
>>>
>>> Another thing to consider is timing. We plan to cut RC3 before Friday
>>> this week, so if this patch can be acked and becomes part of RC3 I'm
>>> fine with applying it. If not, we shall revisit the situation when it is
>>> acked.
>> Kevin Tian reviewed the patch yesterday, so I think we're just waiting
>> for a final review from Andrew here.
>
> Ah - I am just waiting for your final respin with the comments so far
> addressed.

Oh.  I thought I already did that ... though apparently I didn't.  I
must have forgotten to remove --dry-run or something.  Anyways, it's
sent now.

>> That said, rr currently does not work in Xen guests due to some PMU
>> issues that we haven't tracked down yet.
>
> Is this RR trying to use vPMU and it not functioning, or not
> specifically trying to use PMU facilities and getting stuck anyway?

The latter.  rr relies on the values returned by the PMU (the retired
conditional branches counter in particular) being exactly the same
during the recording and replay phases.  This is true when running on
bare metal, and when running inside a KVM guest, but when running in a
Xen HVM guest we see values that are off by a branch or two on a small
fraction of our tests.  Since it works in KVM I suspect this is some
sort of issue with how Xen multiplexes the real PMU and events are
"leaking" between guests (or perhaps from Xen itself, though I don't
think the Xen kernel executes any ring 3 code).  Even if that's
correct we're a long way from tracking it down and patching it though.

>>   So for us it's not a big
>> deal if this feature does not make it into 4.8.  I won't be
>> disappointed if you cut it from 4.8 to reduce technical risk.
>
> From my point of view, its a small feature with working code and a
> comprehensive test case ready to go straight into regression testing.
> This makes it the least risky feature going.

- Kyle
Andrew Cooper Oct. 20, 2016, 2:11 p.m. UTC | #6
On 20/10/16 14:55, Kyle Huey wrote:
>
>>> That said, rr currently does not work in Xen guests due to some PMU
>>> issues that we haven't tracked down yet.
>> Is this RR trying to use vPMU and it not functioning, or not
>> specifically trying to use PMU facilities and getting stuck anyway?
> The latter.  rr relies on the values returned by the PMU (the retired
> conditional branches counter in particular) being exactly the same
> during the recording and replay phases.  This is true when running on
> bare metal, and when running inside a KVM guest, but when running in a
> Xen HVM guest we see values that are off by a branch or two on a small
> fraction of our tests.  Since it works in KVM I suspect this is some
> sort of issue with how Xen multiplexes the real PMU and events are
> "leaking" between guests (or perhaps from Xen itself, though I don't
> think the Xen kernel executes any ring 3 code).  Even if that's
> correct we're a long way from tracking it down and patching it though.

Hmm.  That is unfortunate, and does point towards a bug in Xen.  Are
these tests which notice the problem easy to run?

Boris (CC'd) is the maintainer of that code.  It has undergone quite a
few changes recently.

~Andrew
Boris Ostrovsky Oct. 20, 2016, 2:40 p.m. UTC | #7
On 10/20/2016 10:11 AM, Andrew Cooper wrote:
> On 20/10/16 14:55, Kyle Huey wrote:
>>>> That said, rr currently does not work in Xen guests due to some PMU
>>>> issues that we haven't tracked down yet.
>>> Is this RR trying to use vPMU and it not functioning, or not
>>> specifically trying to use PMU facilities and getting stuck anyway?
>> The latter.  rr relies on the values returned by the PMU (the retired
>> conditional branches counter in particular) being exactly the same
>> during the recording and replay phases.  This is true when running on
>> bare metal, and when running inside a KVM guest, but when running in a
>> Xen HVM guest we see values that are off by a branch or two on a small
>> fraction of our tests.  Since it works in KVM I suspect this is some
>> sort of issue with how Xen multiplexes the real PMU and events are
>> "leaking" between guests (or perhaps from Xen itself, though I don't
>> think the Xen kernel executes any ring 3 code).  Even if that's
>> correct we're a long way from tracking it down and patching it though.
> Hmm.  That is unfortunate, and does point towards a bug in Xen.  Are
> these tests which notice the problem easy to run?
>
> Boris (CC'd) is the maintainer of that code.  It has undergone quite a
> few changes recently.

I am actually not the maintainer, I just break this code more often than
others.

But yes, having a test case would make it much easier to understand what
and why is not working.

Would something like

    wrmsr(PERFCTR,0);
    wrmsr(EVNTSEL, XXX); //enable counter
    // do something simple, with branches
    wrmsr(EVTSEL,YYY); // disable counter
   
demonstrate the problem? (I assume we are talking about HVM guest)

-boris
Kyle Huey Oct. 21, 2016, 3:52 p.m. UTC | #8
On Thu, Oct 20, 2016 at 7:40 AM, Boris Ostrovsky
<boris.ostrovsky@oracle.com> wrote:
> On 10/20/2016 10:11 AM, Andrew Cooper wrote:
>> On 20/10/16 14:55, Kyle Huey wrote:
>>>>> That said, rr currently does not work in Xen guests due to some PMU
>>>>> issues that we haven't tracked down yet.
>>>> Is this RR trying to use vPMU and it not functioning, or not
>>>> specifically trying to use PMU facilities and getting stuck anyway?
>>> The latter.  rr relies on the values returned by the PMU (the retired
>>> conditional branches counter in particular) being exactly the same
>>> during the recording and replay phases.  This is true when running on
>>> bare metal, and when running inside a KVM guest, but when running in a
>>> Xen HVM guest we see values that are off by a branch or two on a small
>>> fraction of our tests.  Since it works in KVM I suspect this is some
>>> sort of issue with how Xen multiplexes the real PMU and events are
>>> "leaking" between guests (or perhaps from Xen itself, though I don't
>>> think the Xen kernel executes any ring 3 code).  Even if that's
>>> correct we're a long way from tracking it down and patching it though.
>> Hmm.  That is unfortunate, and does point towards a bug in Xen.  Are
>> these tests which notice the problem easy to run?
>>
>> Boris (CC'd) is the maintainer of that code.  It has undergone quite a
>> few changes recently.
>
> I am actually not the maintainer, I just break this code more often than
> others.
>
> But yes, having a test case would make it much easier to understand what
> and why is not working.
>
> Would something like
>
>     wrmsr(PERFCTR,0);
>     wrmsr(EVNTSEL, XXX); //enable counter
>     // do something simple, with branches
>     wrmsr(EVTSEL,YYY); // disable counter
>
> demonstrate the problem? (I assume we are talking about HVM guest)
>
> -boris
>

That is a good question.  I'll see if I can reduce the problem down
from "run Linux and run our tests inside it".

- Kyle
Kyle Huey Oct. 24, 2016, 4:18 a.m. UTC | #9
On Fri, Oct 21, 2016 at 8:52 AM, Kyle Huey <me@kylehuey.com> wrote:
> On Thu, Oct 20, 2016 at 7:40 AM, Boris Ostrovsky
> <boris.ostrovsky@oracle.com> wrote:
>> On 10/20/2016 10:11 AM, Andrew Cooper wrote:
>>> On 20/10/16 14:55, Kyle Huey wrote:
>>>>>> That said, rr currently does not work in Xen guests due to some PMU
>>>>>> issues that we haven't tracked down yet.
>>>>> Is this RR trying to use vPMU and it not functioning, or not
>>>>> specifically trying to use PMU facilities and getting stuck anyway?
>>>> The latter.  rr relies on the values returned by the PMU (the retired
>>>> conditional branches counter in particular) being exactly the same
>>>> during the recording and replay phases.  This is true when running on
>>>> bare metal, and when running inside a KVM guest, but when running in a
>>>> Xen HVM guest we see values that are off by a branch or two on a small
>>>> fraction of our tests.  Since it works in KVM I suspect this is some
>>>> sort of issue with how Xen multiplexes the real PMU and events are
>>>> "leaking" between guests (or perhaps from Xen itself, though I don't
>>>> think the Xen kernel executes any ring 3 code).  Even if that's
>>>> correct we're a long way from tracking it down and patching it though.
>>> Hmm.  That is unfortunate, and does point towards a bug in Xen.  Are
>>> these tests which notice the problem easy to run?
>>>
>>> Boris (CC'd) is the maintainer of that code.  It has undergone quite a
>>> few changes recently.
>>
>> I am actually not the maintainer, I just break this code more often than
>> others.
>>
>> But yes, having a test case would make it much easier to understand what
>> and why is not working.
>>
>> Would something like
>>
>>     wrmsr(PERFCTR,0);
>>     wrmsr(EVNTSEL, XXX); //enable counter
>>     // do something simple, with branches
>>     wrmsr(EVTSEL,YYY); // disable counter
>>
>> demonstrate the problem? (I assume we are talking about HVM guest)
>>
>> -boris
>>
>
> That is a good question.  I'll see if I can reduce the problem down
> from "run Linux and run our tests inside it".

The anomalies we see appear to be related to, or at least triggerable
by, the performance monitoring interrupt.  The following program runs
a loop of roughly 2^25 conditional branches.  It takes one argument,
the number of conditional branches to program the PMI to trigger on.
The default is 50,000, and if you run the program with that it'll
produce the same value every time.  If you drop it to 5000 or so
you'll probably see occasional off-by-one discrepancies.  If you drop
it to 500 the performance counter values fluctuate wildly.

I'm not yet sure if this is specifically related to the PMI, or if it
can be caused by any interrupt and it's only how frequently the
interrupts occur that matters.

- Kyle

#define _GNU_SOURCE 1

#include <assert.h>
#include <fcntl.h>
#include <linux/perf_event.h>
#include <signal.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/ioctl.h>
#include <sys/syscall.h>
#include <unistd.h>

static struct perf_event_attr rcb_attr;
static uint64_t period;
static int fd;

void counter_on(uint64_t ticks)
{
  int ret = ioctl(fd, PERF_EVENT_IOC_RESET, 0);
  assert(!ret);
  ret = ioctl(fd, PERF_EVENT_IOC_PERIOD, &ticks);
  assert(!ret);
  ret = ioctl(fd, PERF_EVENT_IOC_ENABLE, 1);
  assert(!ret);
}

void counter_off()
{
  int ret = ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);
  assert(!ret);
}

int64_t read_counter()
{
  int64_t val;
  ssize_t nread = read(fd, &val, sizeof(val));
  assert(nread == sizeof(val));
  return val;
}

void do_test()
{
  int64_t counts;
  int i, dummy;

  counter_on(period);
  for (i = 0; i < (1 << 25); i++) {
    dummy += i % (1 << 10);
    dummy += i % (79 * (1 << 10));
  }

  counter_off();
  counts = read_counter();
  printf("Counted %ld conditional branches\n", counts);
}

int main(int argc, const char* argv[])
{
  memset(&rcb_attr, 0, sizeof(rcb_attr));
  rcb_attr.size = sizeof(rcb_attr);
  rcb_attr.type = PERF_TYPE_RAW;
  /* Intel retired conditional branches counter, ring 3 only */
  rcb_attr.config = 0x5101c4;
  rcb_attr.exclude_kernel = 1;
  rcb_attr.exclude_guest = 1;
  /* We'll change this later */
  rcb_attr.sample_period = 0xffffffff;

  /* start the counter */
  fd = syscall(__NR_perf_event_open, &rcb_attr, 0, -1, -1, 0);
  if (fd < 0) {
    printf("Failed to initialize counter\n");
    return -1;
  }

  signal(SIGALRM, SIG_IGN);

  if (fcntl(fd, F_SETFL, O_ASYNC) || fcntl(fd, F_SETSIG, SIGALRM)) {
    printf("Failed to make counter async\n");
    return -1;
  }

  counter_off();

  period = 50000;
  if (argc > 1) {
    sscanf(argv[1], "%ld", &period);
  }

  printf("Period is %ld\n", period);

  do_test();

  return 0;
}
Boris Ostrovsky Oct. 24, 2016, 3:05 p.m. UTC | #10
On 10/24/2016 12:18 AM, Kyle Huey wrote:
>
> The anomalies we see appear to be related to, or at least triggerable
> by, the performance monitoring interrupt.  The following program runs
> a loop of roughly 2^25 conditional branches.  It takes one argument,
> the number of conditional branches to program the PMI to trigger on.
> The default is 50,000, and if you run the program with that it'll
> produce the same value every time.  If you drop it to 5000 or so
> you'll probably see occasional off-by-one discrepancies.  If you drop
> it to 500 the performance counter values fluctuate wildly.

Yes, it does change but I also see the difference on baremetal (although
not as big as it is in an HVM guest):
ostr@workbase> ./pmu 500
Period is 500
Counted 5950003 conditional branches
ostr@workbase> ./pmu 500
Period is 500
Counted 5850003 conditional branches
ostr@workbase> ./pmu 500
Period is 500
Counted 7530107 conditional branches
ostr@workbase>



>
> I'm not yet sure if this is specifically related to the PMI, or if it
> can be caused by any interrupt and it's only how frequently the
> interrupts occur that matters.

I have never used file interface to performance counters, but what are
we reporting here (in read_counter()) --- total number of events or
number of events since last sample? It is also curious to me that the
counter in non-zero after  PERF_EVENT_IOC_RESET (but again, I don't have
any experience with these interfaces).

Also, exclude_guest doesn't appear to make any difference, I don't know
if there are any bits in Intel counters that allow you to distinguish
guest from host (unlike AMD, where there is a bit for that).


-boris
Kyle Huey Oct. 24, 2016, 7:22 p.m. UTC | #11
On Mon, Oct 24, 2016 at 8:05 AM, Boris Ostrovsky
<boris.ostrovsky@oracle.com> wrote:
> On 10/24/2016 12:18 AM, Kyle Huey wrote:
>>
>> The anomalies we see appear to be related to, or at least triggerable
>> by, the performance monitoring interrupt.  The following program runs
>> a loop of roughly 2^25 conditional branches.  It takes one argument,
>> the number of conditional branches to program the PMI to trigger on.
>> The default is 50,000, and if you run the program with that it'll
>> produce the same value every time.  If you drop it to 5000 or so
>> you'll probably see occasional off-by-one discrepancies.  If you drop
>> it to 500 the performance counter values fluctuate wildly.
>
> Yes, it does change but I also see the difference on baremetal (although
> not as big as it is in an HVM guest):
> ostr@workbase> ./pmu 500
> Period is 500
> Counted 5950003 conditional branches
> ostr@workbase> ./pmu 500
> Period is 500
> Counted 5850003 conditional branches
> ostr@workbase> ./pmu 500
> Period is 500
> Counted 7530107 conditional branches
> ostr@workbase>

Yeah, you're right.  I simplified the testcase too far.  I have
included a better one.  This testcase is stable on bare metal (down to
an interrupt every 10 branches, I didn't try below that) and more
accurately represents what our software actually does.  rr acts as a
ptrace supervisor to the process being recorded, and it seems that
context switching between the supervisor and tracee processes
stabilizes the performance counter values somehow.

>> I'm not yet sure if this is specifically related to the PMI, or if it
>> can be caused by any interrupt and it's only how frequently the
>> interrupts occur that matters.
>
> I have never used file interface to performance counters, but what are
> we reporting here (in read_counter()) --- total number of events or
> number of events since last sample? It is also curious to me that the
> counter in non-zero after  PERF_EVENT_IOC_RESET (but again, I don't have
> any experience with these interfaces).

It should be number of events since the last time the counter was
reset (or overflowed, I guess).  On my machine the counter value is
zero both before and after the PERF_EVENT_IOC_RESET ioctl.

> Also, exclude_guest doesn't appear to make any difference, I don't know
> if there are any bits in Intel counters that allow you to distinguish
> guest from host (unlike AMD, where there is a bit for that).

exclude_guest is a Linux specific thing for excluding KVM guests.
There is no hardware support involved; it's handled entirely in the
perf events infrastructure in the kernel.

- Kyle

#define _GNU_SOURCE 1

#include <assert.h>
#include <fcntl.h>
#include <linux/perf_event.h>
#include <signal.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/ioctl.h>
#include <sys/ptrace.h>
#include <sys/syscall.h>
#include <sys/wait.h>
#include <unistd.h>

static struct perf_event_attr rcb_attr;
static uint64_t period;
static int fd;

void counter_on(uint64_t ticks)
{
  int ret = ioctl(fd, PERF_EVENT_IOC_RESET, 0);
  assert(!ret);
  ret = ioctl(fd, PERF_EVENT_IOC_PERIOD, &ticks);
  assert(!ret);
  ret = ioctl(fd, PERF_EVENT_IOC_ENABLE, 1);
  assert(!ret);
}

void counter_off()
{
  int ret = ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);
  assert(!ret);
}

int64_t read_counter()
{
  int64_t val;
  ssize_t nread = read(fd, &val, sizeof(val));
  assert(nread == sizeof(val));
  return val;
}

void do_test()
{
  int i, dummy;

  for (i = 0; i < (1 << 25); i++) {
    dummy += i % (1 << 10);
    dummy += i % (79 * (1 << 10));
  }
}

int main(int argc, const char* argv[])
{
  int pid;
  memset(&rcb_attr, 0, sizeof(rcb_attr));
  rcb_attr.size = sizeof(rcb_attr);
  rcb_attr.type = PERF_TYPE_RAW;
  /* Intel retired conditional branches counter, ring 3 only */
  rcb_attr.config = 0x5101c4;
  rcb_attr.exclude_kernel = 1;
  rcb_attr.exclude_guest = 1;
  /* We'll change this later */
  rcb_attr.sample_period = 0xffffffff;

  signal(SIGALRM, SIG_IGN);
  pid = fork();
  if (pid == 0) {
    /* Wait for the parent */
    kill(getpid(), SIGSTOP);
    do_test();
    return 0;
  }

  /* start the counter */
  fd = syscall(__NR_perf_event_open, &rcb_attr, pid, -1, -1, 0);
  if (fd < 0) {
    printf("Failed to initialize counter\n");
    return -1;
  }

  counter_off();

  struct f_owner_ex own;
  own.type = F_OWNER_PID;
  own.pid = pid;
  if (fcntl(fd, F_SETOWN_EX, &own) ||
      fcntl(fd, F_SETFL, O_ASYNC) ||
      fcntl(fd, F_SETSIG, SIGALRM)) {
    printf("Failed to make counter async\n");
    return -1;
  }

  period = 50000;
  if (argc > 1) {
    sscanf(argv[1], "%ld", &period);
  }

  printf("Period is %ld\n", period);

  counter_on(period);
  ptrace(PTRACE_SEIZE, pid, NULL, 0);
  ptrace(PTRACE_CONT, pid, NULL, SIGCONT);

  int status = 0;
  while (1) {
    waitpid(pid, &status, 0);
    if (WIFEXITED(status)) {
      break;
    }
    if (WIFSIGNALED(status)) {
      assert(0);
      continue;
    }
    if (WIFSTOPPED(status)) {
      if (WSTOPSIG(status) == SIGALRM ||
      WSTOPSIG(status) == SIGSTOP) {
    ptrace(PTRACE_CONT, pid, NULL, WSTOPSIG(status));
    continue;
      }
    }
    assert(0 && "unhandled ptrace event!");
  }

  counter_off();
  int64_t counts = read_counter();
  printf("Counted %ld conditional branches\n", counts);

  return 0;
}
Boris Ostrovsky Oct. 24, 2016, 9:15 p.m. UTC | #12
On 10/24/2016 03:22 PM, Kyle Huey wrote:
> On Mon, Oct 24, 2016 at 8:05 AM, Boris Ostrovsky
> <boris.ostrovsky@oracle.com> wrote:
>> On 10/24/2016 12:18 AM, Kyle Huey wrote:
>>> The anomalies we see appear to be related to, or at least triggerable
>>> by, the performance monitoring interrupt.  The following program runs
>>> a loop of roughly 2^25 conditional branches.  It takes one argument,
>>> the number of conditional branches to program the PMI to trigger on.
>>> The default is 50,000, and if you run the program with that it'll
>>> produce the same value every time.  If you drop it to 5000 or so
>>> you'll probably see occasional off-by-one discrepancies.  If you drop
>>> it to 500 the performance counter values fluctuate wildly.
>> Yes, it does change but I also see the difference on baremetal (although
>> not as big as it is in an HVM guest):
>> ostr@workbase> ./pmu 500
>> Period is 500
>> Counted 5950003 conditional branches
>> ostr@workbase> ./pmu 500
>> Period is 500
>> Counted 5850003 conditional branches
>> ostr@workbase> ./pmu 500
>> Period is 500
>> Counted 7530107 conditional branches
>> ostr@workbase>
> Yeah, you're right.  I simplified the testcase too far.  I have
> included a better one.  This testcase is stable on bare metal (down to
> an interrupt every 10 branches, I didn't try below that) and more
> accurately represents what our software actually does. 

When I run this program in a loop the first iteration is always off:

ostr@workbase> while [ true ]; do taskset -c 0 ./pmu 500|grep -v Period
; done
Counted 33554556 conditional branches
Counted 33554729 conditional branches
Counted 33554729 conditional branches
...

but then it indeed is stable. Could it be "priming" the branch
predictor? Does this counter count mis-predicted branches? Probably not
since the first number is smaller than the rest.


>  rr acts as a
> ptrace supervisor to the process being recorded, and it seems that
> context switching between the supervisor and tracee processes
> stabilizes the performance counter values somehow.

Not sure I understand what you mean by this. The supervising thread is
presumably sitting in kernel (in waitpid()) so nothing should be counted
for it.

-boris
diff mbox

Patch

diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index b9102ce..c038393 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -2427,16 +2427,25 @@  static void vmx_cpuid_intercept(
 
     HVMTRACE_5D (CPUID, input, *eax, *ebx, *ecx, *edx);
 }
 
 static int vmx_do_cpuid(struct cpu_user_regs *regs)
 {
     unsigned int eax, ebx, ecx, edx;
     unsigned int leaf, subleaf;
+    struct segment_register sreg;
+    struct vcpu *v = current;
+
+    hvm_get_segment_register(v, x86_seg_ss, &sreg);
+    if ( v->arch.cpuid_fault && sreg.attr.fields.dpl > 0 )
+    {
+        hvm_inject_hw_exception(TRAP_gp_fault, 0);
+        return 1; /* Don't advance the guest IP! */
+    }
 
     eax = regs->eax;
     ebx = regs->ebx;
     ecx = regs->ecx;
     edx = regs->edx;
 
     leaf = regs->eax;
     subleaf = regs->ecx;
@@ -2694,19 +2703,23 @@  static int vmx_msr_read_intercept(unsigned int msr, uint64_t *msr_content)
     case MSR_CORE_PERF_FIXED_CTR_CTRL...MSR_CORE_PERF_GLOBAL_OVF_CTRL:
     case MSR_IA32_PEBS_ENABLE:
     case MSR_IA32_DS_AREA:
         if ( vpmu_do_rdmsr(msr, msr_content) )
             goto gp_fault;
         break;
 
     case MSR_INTEL_PLATFORM_INFO:
-        if ( rdmsr_safe(MSR_INTEL_PLATFORM_INFO, *msr_content) )
-            goto gp_fault;
+        *msr_content = MSR_PLATFORM_INFO_CPUID_FAULTING;
+        break;
+
+    case MSR_INTEL_MISC_FEATURES_ENABLES:
         *msr_content = 0;
+        if ( current->arch.cpuid_fault )
+            *msr_content |= MSR_MISC_FEATURES_CPUID_FAULTING;
         break;
 
     default:
         if ( passive_domain_do_rdmsr(msr, msr_content) )
             goto done;
         switch ( long_mode_do_msr_read(msr, msr_content) )
         {
             case HNDL_unhandled:
@@ -2925,16 +2938,23 @@  static int vmx_msr_write_intercept(unsigned int msr, uint64_t msr_content)
         break;
 
     case MSR_INTEL_PLATFORM_INFO:
         if ( msr_content ||
              rdmsr_safe(MSR_INTEL_PLATFORM_INFO, msr_content) )
             goto gp_fault;
         break;
 
+    case MSR_INTEL_MISC_FEATURES_ENABLES:
+        if ( msr_content & ~MSR_MISC_FEATURES_CPUID_FAULTING )
+            goto gp_fault;
+        v->arch.cpuid_fault =
+            !!(msr_content & MSR_MISC_FEATURES_CPUID_FAULTING);
+        break;
+
     default:
         if ( passive_domain_do_wrmsr(msr, msr_content) )
             return X86EMUL_OKAY;
 
         if ( wrmsr_viridian_regs(msr, msr_content) ) 
             break;
 
         switch ( long_mode_do_msr_write(msr, msr_content) )
diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index 293ff8d..12322bd 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -1315,16 +1315,24 @@  static int emulate_forced_invalid_op(struct cpu_user_regs *regs)
     /* We only emulate CPUID. */
     if ( ( rc = copy_from_user(instr, (char *)eip, sizeof(instr))) != 0 )
     {
         propagate_page_fault(eip + sizeof(instr) - rc, 0);
         return EXCRET_fault_fixed;
     }
     if ( memcmp(instr, "\xf\xa2", sizeof(instr)) )
         return 0;
+
+    /* If cpuid faulting is enabled and CPL>0 inject a #GP in place of #UD. */
+    if ( current->arch.cpuid_fault && !guest_kernel_mode(current, regs) ) {
+        regs->eip = eip;
+        do_guest_trap(TRAP_gp_fault, regs);
+        return EXCRET_fault_fixed;
+    }
+
     eip += sizeof(instr);
 
     pv_cpuid(regs);
 
     instruction_done(regs, eip, 0);
 
     trace_trap_one_addr(TRC_PV_FORCED_INVALID_OP, regs->eip);
 
@@ -2474,16 +2482,27 @@  static int priv_op_read_msr(unsigned int reg, uint64_t *val,
         *val = 0;
         return X86EMUL_OKAY;
 
     case MSR_INTEL_PLATFORM_INFO:
         if ( boot_cpu_data.x86_vendor != X86_VENDOR_INTEL ||
              rdmsr_safe(MSR_INTEL_PLATFORM_INFO, *val) )
             break;
         *val = 0;
+        if ( this_cpu(cpuid_faulting_enabled) )
+            *val |= MSR_PLATFORM_INFO_CPUID_FAULTING;
+        return X86EMUL_OKAY;
+
+    case MSR_INTEL_MISC_FEATURES_ENABLES:
+        if ( boot_cpu_data.x86_vendor != X86_VENDOR_INTEL ||
+             rdmsr_safe(MSR_INTEL_MISC_FEATURES_ENABLES, *val) )
+            break;
+        *val = 0;
+        if ( curr->arch.cpuid_fault )
+            *val |= MSR_MISC_FEATURES_CPUID_FAULTING;
         return X86EMUL_OKAY;
 
     case MSR_P6_PERFCTR(0)...MSR_P6_PERFCTR(7):
     case MSR_P6_EVNTSEL(0)...MSR_P6_EVNTSEL(3):
     case MSR_CORE_PERF_FIXED_CTR0...MSR_CORE_PERF_FIXED_CTR2:
     case MSR_CORE_PERF_FIXED_CTR_CTRL...MSR_CORE_PERF_GLOBAL_OVF_CTRL:
         if ( boot_cpu_data.x86_vendor == X86_VENDOR_INTEL )
         {
@@ -2677,16 +2696,27 @@  static int priv_op_write_msr(unsigned int reg, uint64_t val,
         return X86EMUL_OKAY;
 
     case MSR_INTEL_PLATFORM_INFO:
         if ( boot_cpu_data.x86_vendor != X86_VENDOR_INTEL ||
              val || rdmsr_safe(MSR_INTEL_PLATFORM_INFO, val) )
             break;
         return X86EMUL_OKAY;
 
+    case MSR_INTEL_MISC_FEATURES_ENABLES:
+        if ( boot_cpu_data.x86_vendor != X86_VENDOR_INTEL ||
+             (val & ~MSR_MISC_FEATURES_CPUID_FAULTING) ||
+             rdmsr_safe(MSR_INTEL_MISC_FEATURES_ENABLES, temp) )
+            break;
+        if ( (val & MSR_MISC_FEATURES_CPUID_FAULTING) &&
+             !this_cpu(cpuid_faulting_enabled) )
+            break;
+        curr->arch.cpuid_fault = !!(val & MSR_MISC_FEATURES_CPUID_FAULTING);
+        return X86EMUL_OKAY;
+
     case MSR_P6_PERFCTR(0)...MSR_P6_PERFCTR(7):
     case MSR_P6_EVNTSEL(0)...MSR_P6_EVNTSEL(3):
     case MSR_CORE_PERF_FIXED_CTR0...MSR_CORE_PERF_FIXED_CTR2:
     case MSR_CORE_PERF_FIXED_CTR_CTRL...MSR_CORE_PERF_GLOBAL_OVF_CTRL:
         if ( boot_cpu_data.x86_vendor == X86_VENDOR_INTEL )
         {
             vpmu_msr = true;
     case MSR_AMD_FAM15H_EVNTSEL0...MSR_AMD_FAM15H_PERFCTR5:
@@ -3186,16 +3216,20 @@  static int emulate_privileged_op(struct cpu_user_regs *regs)
         if ( priv_op_read_msr(regs->_ecx, &val, NULL) != X86EMUL_OKAY )
             goto fail;
  rdmsr_writeback:
         regs->eax = (uint32_t)val;
         regs->edx = (uint32_t)(val >> 32);
         break;
 
     case 0xa2: /* CPUID */
+        /* If cpuid faulting is enabled and CPL>0 leave the #GP untouched. */
+        if ( v->arch.cpuid_fault && !guest_kernel_mode(v, regs) )
+            goto fail;
+
         pv_cpuid(regs);
         break;
 
     default:
         goto fail;
     }
 
 #undef wr_ad
diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h
index 5807a1f..27c20cc 100644
--- a/xen/include/asm-x86/domain.h
+++ b/xen/include/asm-x86/domain.h
@@ -552,16 +552,19 @@  struct arch_vcpu
      * However, processor should not be able to touch eXtended states before
      * it explicitly enables it via xcr0.
      */
     uint64_t xcr0_accum;
     /* This variable determines whether nonlazy extended state has been used,
      * and thus should be saved/restored. */
     bool_t nonlazy_xstate_used;
 
+    /* Has the guest enabled CPUID faulting? */
+    bool cpuid_fault;
+
     /*
      * The SMAP check policy when updating runstate_guest(v) and the
      * secondary system time.
      */
     smap_check_policy_t smap_check_policy;
 
     struct vmce vmce;