mbox series

[v3,00/17] KVM: x86/pmu: Add support to enable Guest PEBS via DS

Message ID 20210104131542.495413-1-like.xu@linux.intel.com (mailing list archive)
Headers show
Series KVM: x86/pmu: Add support to enable Guest PEBS via DS | expand

Message

Like Xu Jan. 4, 2021, 1:15 p.m. UTC
The Precise Event Based Sampling (PEBS) facility on Intel Ice Lake Server
platforms can provide an architectural state of the instruction executed
after the guest instruction that caused the event. This patch set enables
the PEBS via DS feature for KVM guests on the Ice Lake Server.

We can use PEBS feature on the linux guest like native:

  # perf record -e instructions:ppp ./br_instr a
  # perf record -c 100000 -e instructions:pp ./br_instr a

The guest PEBS will be disabled on purpose when host is using PEBS. 
By default, KVM disables the co-existence of guest PEBS and host PEBS.

The whole patch set could be divided into three parts and the first two
parts enables the basic PEBS via DS feature which could be considered
to be merged and no regression about host perf is expected.

Compared to the first version, an important change here is the removal
of the forced 1-1 mapping of the virtual to physical PMC and we handle
the cross-mapping issue carefully in the part 3 which may address
artificial competition concern from PeterZ.

In general, there are 2 code paths to emulate guest PEBS facility.

1) Fast path (part 2, patch 0004-0012)

This is when the host assigned physical PMC has an identical index as
the virtual PMC (e.g. using physical PMC0 to emulate virtual PMC0).
It works as the 1-1 mapping that we did in the first version.

2) Slow path (part 3, patch 0012-0017)

This is when the host assigned physical PMC has a different index
from the virtual PMC (e.g. using physical PMC1 to emulate virtual PMC0)
In this case, KVM needs to rewrite the PEBS records to change the
applicable counter indexes to the virtual PMC indexes, which would
otherwise contain the physical counter index written by PEBS facility,
and switch the counter reset values to the offset corresponding to
the physical counter indexes in the DS data structure. 

Large PEBS needs to be disabled by KVM rewriting the
pebs_interrupt_threshold filed in DS to only one record in
the slow path.  This is because a guest may implicitly drain PEBS buffer,
e.g., context switch. KVM doesn't get a chance to update the PEBS buffer.
The physical PMC index will confuse the guest. The difficulty comes
when multiple events get rescheduled inside the guest. Hence disabling
large PEBS in this case might be an easy and safe way to keep it corrects
as an initial step here. 

We don't expect this change would break any guest 
code, which can generally tolerate
earlier PMIs. In the fast path with 1:1 mapping this is not needed.

The rewriting work is performed before delivering a vPMI to the guest to
notify the guest to read the record (before entering the guest, where
interrupt has been disabled so no counter reschedule would happen
at that point on the host).

For the DS area virtualization, the PEBS hardware is registered with the
guest virtual address (gva) of the guest DS memory.
In the past, the difficulty is that the host needs to pin the guest
DS memory, as the page fault caused by the PEBS hardware can't be fixed.
This isn't needed from ICX thanks to the hardware support.

KVM rewriting the guest DS area needs to walk the guest page tables to
translate gva to host virtual address (hva).

To reduce the translation overhead, we cache the translation on the first
time of DS memory rewriting. The cached translation is valid to use by
KVM until the guest disables PEBS (VMExits to KVM), which means the guest
may do re-allocation of the PEBS buffer next time and KVM needs
to re-walk the guest pages tables to update the cached translation.

In summary, this patch set enables the guest PEBS to retrieve the correct
information from its own PEBS records on the Ice Lake server platforms
when host is not using PEBS facility at the same time. And we expect it
should work when migrating to another Ice Lake.

Here are the results of pebs test from guest/host for same workload:

perf report on guest:
# Samples: 2K of event 'instructions:ppp', # Event count (approx.): 1473377250
# Overhead  Command   Shared Object      Symbol
  57.74%  br_instr  br_instr           [.] lfsr_cond
  41.40%  br_instr  br_instr           [.] cmp_end
   0.21%  br_instr  [kernel.kallsyms]  [k] __lock_acquire

perf report on host:
# Samples: 2K of event 'instructions:ppp', # Event count (approx.): 1462721386
# Overhead  Command   Shared Object     Symbol
  57.90%  br_instr  br_instr          [.] lfsr_cond
  41.95%  br_instr  br_instr          [.] cmp_end
   0.05%  br_instr  [kernel.vmlinux]  [k] lock_acquire
   
Conclusion: the profiling results on the guest are similar to that on the host.

Please check more details in each commit and feel free to comment.

v2->v3 Changelog:
- drop the counter_freezing check and disable guest PEBS when host uses PEBS;
- use kvm_read/write_guest_[offset]_cached() to reduce memory rewrite overhead;
- use GLOBAL_STATUS_BUFFER_OVF_BIT instead of 62;
- make intel_pmu_handle_event() static;
- rebased to kvm-queue d45f89f7437d;

Previous:
https://lore.kernel.org/kvm/20201109021254.79755-1-like.xu@linux.intel.com/

Like Xu (17):
  KVM: x86/pmu: Set MSR_IA32_MISC_ENABLE_EMON bit when vPMU is enabled
  KVM: vmx/pmu: Use IA32_PERF_CAPABILITIES to adjust features visibility
  KVM: x86/pmu: Introduce the ctrl_mask value for fixed counter
  perf: x86/ds: Handle guest PEBS overflow PMI and inject it to guest
  KVM: x86/pmu: Reprogram guest PEBS event to emulate guest PEBS counter
  KVM: x86/pmu: Add IA32_PEBS_ENABLE MSR emulation for extended PEBS
  KVM: x86/pmu: Add IA32_DS_AREA MSR emulation to manage guest DS buffer
  KVM: x86/pmu: Add PEBS_DATA_CFG MSR emulation to support adaptive PEBS
  KVM: x86: Set PEBS_UNAVAIL in IA32_MISC_ENABLE when PEBS is enabled
  KVM: x86/pmu: Expose CPUIDs feature bits PDCM, DS, DTES64
  KVM: x86/pmu: Adjust precise_ip to emulate Ice Lake guest PDIR counter
  KVM: x86/pmu: Disable guest PEBS when counters are cross-mapped
  KVM: x86/pmu: Add hook to emulate pebs for cross-mapped counters
  KVM: vmx/pmu: Limit pebs_interrupt_threshold in the guest DS area
  KVM: vmx/pmu: Rewrite applicable_counters field in guest PEBS records
  KVM: x86/pmu: Save guest pebs reset values when pebs is configured
  KVM: x86/pmu: Adjust guest pebs reset values for crpss-mapped counters

 arch/x86/events/intel/core.c     |  45 +++++
 arch/x86/events/intel/ds.c       |  62 +++++++
 arch/x86/include/asm/kvm_host.h  |  18 ++
 arch/x86/include/asm/msr-index.h |   6 +
 arch/x86/kvm/pmu.c               |  92 +++++++--
 arch/x86/kvm/pmu.h               |  20 ++
 arch/x86/kvm/vmx/capabilities.h  |  17 +-
 arch/x86/kvm/vmx/pmu_intel.c     | 310 ++++++++++++++++++++++++++++++-
 arch/x86/kvm/vmx/vmx.c           |  29 +++
 arch/x86/kvm/x86.c               |  12 +-
 10 files changed, 592 insertions(+), 19 deletions(-)

Comments

Sean Christopherson Jan. 14, 2021, 7:10 p.m. UTC | #1
On Mon, Jan 04, 2021, Like Xu wrote:
> 2) Slow path (part 3, patch 0012-0017)
> 
> This is when the host assigned physical PMC has a different index
> from the virtual PMC (e.g. using physical PMC1 to emulate virtual PMC0)
> In this case, KVM needs to rewrite the PEBS records to change the
> applicable counter indexes to the virtual PMC indexes, which would
> otherwise contain the physical counter index written by PEBS facility,
> and switch the counter reset values to the offset corresponding to
> the physical counter indexes in the DS data structure. 
> 
> Large PEBS needs to be disabled by KVM rewriting the
> pebs_interrupt_threshold filed in DS to only one record in
> the slow path.  This is because a guest may implicitly drain PEBS buffer,
> e.g., context switch. KVM doesn't get a chance to update the PEBS buffer.

Are the PEBS record write, PEBS index update, and subsequent PMI atomic with
respect to instruction execution?  If not, doesn't this approach still leave a
window where the guest could see the wrong counter?

The virtualization hole is also visible if the guest is reading the PEBS records
from a different vCPU, though I assume no sane kernel does that?

> The physical PMC index will confuse the guest. The difficulty comes
> when multiple events get rescheduled inside the guest. Hence disabling
> large PEBS in this case might be an easy and safe way to keep it corrects
> as an initial step here.
Xu, Like Jan. 15, 2021, 2:02 a.m. UTC | #2
Hi Sean,

Thanks for your comments !

On 2021/1/15 3:10, Sean Christopherson wrote:
> On Mon, Jan 04, 2021, Like Xu wrote:
>> 2) Slow path (part 3, patch 0012-0017)
>>
>> This is when the host assigned physical PMC has a different index
>> from the virtual PMC (e.g. using physical PMC1 to emulate virtual PMC0)
>> In this case, KVM needs to rewrite the PEBS records to change the
>> applicable counter indexes to the virtual PMC indexes, which would
>> otherwise contain the physical counter index written by PEBS facility,
>> and switch the counter reset values to the offset corresponding to
>> the physical counter indexes in the DS data structure.
>>
>> Large PEBS needs to be disabled by KVM rewriting the
>> pebs_interrupt_threshold filed in DS to only one record in
>> the slow path.  This is because a guest may implicitly drain PEBS buffer,
>> e.g., context switch. KVM doesn't get a chance to update the PEBS buffer.
> Are the PEBS record write, PEBS index update, and subsequent PMI atomic with
> respect to instruction execution?  If not, doesn't this approach still leave a
> window where the guest could see the wrong counter?

First, KVM would limit/rewrite guest DS pebs_interrupt_threshold to one 
record before vm-entry,
(see patch [PATCH v3 14/17] KVM: vmx/pmu: Limit pebs_interrupt_threshold in 
the guest DS area)
which means once a PEBS record is written into the guest pebs buffer,
a PEBS PMI will be generated immediately and thus vm-exit.

Second, KVM would complete the PEBS record rewriting, PEBS index update, 
and inject vPMI
before the next vm-entry (we deal with these separately in patches 15-17 
for easy review).

After the updated PEBS record(s) are (atomically?) prepared, guests will be 
notified via PMI
and there is no window for vcpu to check whether there is a PEBS record due 
to vm-exit.

> The virtualization hole is also visible if the guest is reading the PEBS records
> from a different vCPU, though I assume no sane kernel does that?

I have checked the guest PEBS driver behavior for Linux and Windows, and 
they're sane.

Theoretically, it's true for busy-poll PBES buffer readers from other vCPUs
and to fix it, making all vCPUs vm-exit is onerous for a large-size guest
and I don't think you would accept this or do we have a better idea ?

In fact, we don't think it's a hole or vulnerability because the motivation for
correcting the counter index(s) is to help guest PEBS reader understand their
PEBS records correctly and provide the same sampling accuracy as the non-cross
mapped case, instead of providing a new attack interface from guest to host.

PeterZ commented on the V1 version and insisted that the host perf allows 
the guest
counter to be assigned a cross-mapped back-end counter. In this case, the 
slow path
patches (13-17) are introduced to ensure that from the guest counter 
perspective,
the PEBS records are also correct. We do not want these records to be 
invalid and
ignored, which would undermine the accuracy of PEBS.

In the practical use, the slow patch rarely happens and
we're glad to see if the fast patch could be upstream
and the cross-mapped case is teamprily disabled
until we're on the same page for the cross mapped case.

In actual use, slow path rarely occur. As a first step,
we propose to upstream the quick patches (patch 01-12) with your help.
The guest PEBS would been disabled temporarily when guest PEBS counters
are cross-mapped until we figure out a satisfactory cross-mapping solution.

---
thx,likexu

>
>> The physical PMC index will confuse the guest. The difficulty comes
>> when multiple events get rescheduled inside the guest. Hence disabling
>> large PEBS in this case might be an easy and safe way to keep it corrects
>> as an initial step here.
Sean Christopherson Jan. 15, 2021, 5:57 p.m. UTC | #3
On Fri, Jan 15, 2021, Xu, Like wrote:
> Hi Sean,
> 
> Thanks for your comments !
> 
> On 2021/1/15 3:10, Sean Christopherson wrote:
> > On Mon, Jan 04, 2021, Like Xu wrote:
> > > 2) Slow path (part 3, patch 0012-0017)
> > > 
> > > This is when the host assigned physical PMC has a different index
> > > from the virtual PMC (e.g. using physical PMC1 to emulate virtual PMC0)
> > > In this case, KVM needs to rewrite the PEBS records to change the
> > > applicable counter indexes to the virtual PMC indexes, which would
> > > otherwise contain the physical counter index written by PEBS facility,
> > > and switch the counter reset values to the offset corresponding to
> > > the physical counter indexes in the DS data structure.
> > > 
> > > Large PEBS needs to be disabled by KVM rewriting the
> > > pebs_interrupt_threshold filed in DS to only one record in
> > > the slow path.  This is because a guest may implicitly drain PEBS buffer,
> > > e.g., context switch. KVM doesn't get a chance to update the PEBS buffer.
> > Are the PEBS record write, PEBS index update, and subsequent PMI atomic with
> > respect to instruction execution?  If not, doesn't this approach still leave a
> > window where the guest could see the wrong counter?
> 
> First, KVM would limit/rewrite guest DS pebs_interrupt_threshold to one
> record before vm-entry,
> (see patch [PATCH v3 14/17] KVM: vmx/pmu: Limit pebs_interrupt_threshold in
> the guest DS area)
> which means once a PEBS record is written into the guest pebs buffer,
> a PEBS PMI will be generated immediately and thus vm-exit.

I'm asking about ucode/hardare.  Is the "guest pebs buffer write -> PEBS PMI"
guaranteed to be atomic?

In practice, under what scenarios will guest counters get cross-mapped?  And,
how does this support affect guest accuracy?  I.e. how bad do things get for the
guest if we simply disable guest counters if they can't have a 1:1 association
with their physical counter?
Andi Kleen Jan. 15, 2021, 6:27 p.m. UTC | #4
> I'm asking about ucode/hardare.  Is the "guest pebs buffer write -> PEBS PMI"
> guaranteed to be atomic?

Of course not.
> 
> In practice, under what scenarios will guest counters get cross-mapped?  And,
> how does this support affect guest accuracy?  I.e. how bad do things get for the
> guest if we simply disable guest counters if they can't have a 1:1 association
> with their physical counter?

This would completely break perfmon for the guest, likely with no way to
recover.

-Andi
Sean Christopherson Jan. 15, 2021, 6:51 p.m. UTC | #5
On Fri, Jan 15, 2021, Andi Kleen wrote:
> > I'm asking about ucode/hardare.  Is the "guest pebs buffer write -> PEBS PMI"
> > guaranteed to be atomic?
> 
> Of course not.

So there's still a window where the guest could observe the bad counter index,
correct?
Andi Kleen Jan. 15, 2021, 7:11 p.m. UTC | #6
On Fri, Jan 15, 2021 at 10:51:38AM -0800, Sean Christopherson wrote:
> On Fri, Jan 15, 2021, Andi Kleen wrote:
> > > I'm asking about ucode/hardare.  Is the "guest pebs buffer write -> PEBS PMI"
> > > guaranteed to be atomic?
> > 
> > Of course not.
> 
> So there's still a window where the guest could observe the bad counter index,
> correct?

Yes.

But with single record PEBS it doesn't really matter with normal perfmon
drivers.

-Andi
Peter Zijlstra Jan. 22, 2021, 9:56 a.m. UTC | #7
On Fri, Jan 15, 2021 at 10:51:38AM -0800, Sean Christopherson wrote:
> On Fri, Jan 15, 2021, Andi Kleen wrote:
> > > I'm asking about ucode/hardare.  Is the "guest pebs buffer write -> PEBS PMI"
> > > guaranteed to be atomic?
> > 
> > Of course not.
> 
> So there's still a window where the guest could observe the bad counter index,
> correct?

Guest could do a hypercall to fix up the DS area before it tries to read
it I suppose. Or the HV could expose the index mapping and have the
guest fix up it.

Adding a little virt crud on top shouldn't be too hard.
Like Xu Jan. 25, 2021, 8:08 a.m. UTC | #8
Hi Peter,

On 2021/1/22 17:56, Peter Zijlstra wrote:
> On Fri, Jan 15, 2021 at 10:51:38AM -0800, Sean Christopherson wrote:
>> On Fri, Jan 15, 2021, Andi Kleen wrote:
>>>> I'm asking about ucode/hardare.  Is the "guest pebs buffer write -> PEBS PMI"
>>>> guaranteed to be atomic?
>>>
>>> Of course not.
>>
>> So there's still a window where the guest could observe the bad counter index,
>> correct?
> 
> Guest could do a hypercall to fix up the DS area before it tries to read
> it I suppose. Or the HV could expose the index mapping and have the
> guest fix up it.

A weird (malicious) guest would read unmodified PEBS records in the
guest PEBS buffer from other vCPUs without the need for hypercall or
index mapping from HV.

Do you see any security issues on this host index leak window?

> 
> Adding a little virt crud on top shouldn't be too hard.
> 

The patches 13-17 in this version has modified the guest PEBS buffer
to correct the index mapping information in the guest PEBS records.

---
thx,likexu
Peter Zijlstra Jan. 25, 2021, 11:13 a.m. UTC | #9
On Mon, Jan 25, 2021 at 04:08:22PM +0800, Like Xu wrote:
> Hi Peter,
> 
> On 2021/1/22 17:56, Peter Zijlstra wrote:
> > On Fri, Jan 15, 2021 at 10:51:38AM -0800, Sean Christopherson wrote:
> > > On Fri, Jan 15, 2021, Andi Kleen wrote:
> > > > > I'm asking about ucode/hardare.  Is the "guest pebs buffer write -> PEBS PMI"
> > > > > guaranteed to be atomic?
> > > > 
> > > > Of course not.
> > > 
> > > So there's still a window where the guest could observe the bad counter index,
> > > correct?
> > 
> > Guest could do a hypercall to fix up the DS area before it tries to read
> > it I suppose. Or the HV could expose the index mapping and have the
> > guest fix up it.
> 
> A weird (malicious) guest would read unmodified PEBS records in the
> guest PEBS buffer from other vCPUs without the need for hypercall or
> index mapping from HV.
> 
> Do you see any security issues on this host index leak window?
> 
> > 
> > Adding a little virt crud on top shouldn't be too hard.
> > 
> 
> The patches 13-17 in this version has modified the guest PEBS buffer
> to correct the index mapping information in the guest PEBS records.

Right, but given there is no atomicity between writing the DS area and
triggering the PMI (as already established earlier in this thread), a
malicious guest can already access this information, no?
Xu, Like Jan. 25, 2021, 12:07 p.m. UTC | #10
On 2021/1/25 19:13, Peter Zijlstra wrote:
> On Mon, Jan 25, 2021 at 04:08:22PM +0800, Like Xu wrote:
>> Hi Peter,
>>
>> On 2021/1/22 17:56, Peter Zijlstra wrote:
>>> On Fri, Jan 15, 2021 at 10:51:38AM -0800, Sean Christopherson wrote:
>>>> On Fri, Jan 15, 2021, Andi Kleen wrote:
>>>>>> I'm asking about ucode/hardare.  Is the "guest pebs buffer write -> PEBS PMI"
>>>>>> guaranteed to be atomic?
>>>>> Of course not.
>>>> So there's still a window where the guest could observe the bad counter index,
>>>> correct?
>>> Guest could do a hypercall to fix up the DS area before it tries to read
>>> it I suppose. Or the HV could expose the index mapping and have the
>>> guest fix up it.
>> A weird (malicious) guest would read unmodified PEBS records in the
>> guest PEBS buffer from other vCPUs without the need for hypercall or
>> index mapping from HV.
>>
>> Do you see any security issues on this host index leak window?
>>
>>> Adding a little virt crud on top shouldn't be too hard.
>>>
>> The patches 13-17 in this version has modified the guest PEBS buffer
>> to correct the index mapping information in the guest PEBS records.
> Right, but given there is no atomicity between writing the DS area and
> triggering the PMI (as already established earlier in this thread), a
> malicious guest can already access this information, no?
>

So under the premise that counter cross-mapping is allowed,
how can hypercall help fix it ?

Personally,  I think it is acceptable at the moment, and
no security issues based on this have been defined and found.
Peter Zijlstra Jan. 25, 2021, 12:18 p.m. UTC | #11
On Mon, Jan 25, 2021 at 08:07:06PM +0800, Xu, Like wrote:

> So under the premise that counter cross-mapping is allowed,
> how can hypercall help fix it ?

Hypercall or otherwise exposing the mapping, will let the guest fix it
up when it already touches the data. Which avoids the host from having
to access the guest memory and is faster, no?
Xu, Like Jan. 25, 2021, 12:53 p.m. UTC | #12
On 2021/1/25 20:18, Peter Zijlstra wrote:
> On Mon, Jan 25, 2021 at 08:07:06PM +0800, Xu, Like wrote:
>
>> So under the premise that counter cross-mapping is allowed,
>> how can hypercall help fix it ?
> Hypercall or otherwise exposing the mapping, will let the guest fix it
> up when it already touches the data. Which avoids the host from having
> to access the guest memory and is faster, no?
- as you may know, the mapping table is changing rapidly from
the time records to be rewritten to the time records to be read;

- the patches will modify the records before it is notified via PMI
which means it's transparent to normal guests (including Windows);

- a malicious guest would ignore the exposed mapping and the
hypercall and I don't think it can solve the leakage issue at all;

- make the guest aware of that hypercall or mapping requires more code changes
in the guest side; but now we can make it on the KVM side and we also know that
cross-mapping case rarely happens, and the overhead is acceptable based on 
our tests;

Please let me know if you or Sean are not going to
buy in the PEBS records rewrite proposal in the patch 13 - 17.

---
thx,likexu