diff mbox series

[v2,2/3] x86/svm: Always intercept ICEBP

Message ID 20191126120357.13398-3-andrew.cooper3@citrix.com (mailing list archive)
State New, archived
Headers show
Series x86/hvm: Multiple corrections to task switch handling | expand

Commit Message

Andrew Cooper Nov. 26, 2019, 12:03 p.m. UTC
ICEBP isn't handled well by SVM.

The VMexit state for a #DB-vectored TASK_SWITCH has %rip pointing to the
appropriate instruction boundary (fault or trap, as appropriate), except for
an ICEBP-induced #DB TASK_SWITCH, where %rip points at the ICEBP instruction
rather than after it.  As ICEBP isn't distinguished in the vectoring event
type, the state is ambiguous.

To add to the confusion, an ICEBP which occurs due to Introspection
intercepting the instruction, or from x86_emulate() will have %rip updated as
a consequence of partial emulation required to inject an ICEBP event in the
first place.

We could in principle spot the non-injected case in the TASK_SWITCH handler,
but this still results in complexity if the ICEBP instruction also has an
Instruction Breakpoint active on it (which genuinely has fault semantics).

Unconditionally intercept ICEBP.  This does have a trap semantics for the
intercept, and allows us to move %rip forwards appropriately before the
TASK_SWITCH intercept is hit.  This makes the behaviour of #DB-vectored
switches consistent however the ICEBP #DB came about, and avoids special cases
in the TASK_SWITCH intercept.

This in turn allows for the removal of the conditional
hvm_set_icebp_interception() logic used by the monitor subsystem, as ICEBP's
will now always be submitted for monitoring checks.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Wei Liu <wl@xen.org>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Razvan Cojocaru <rcojocaru@bitdefender.com>
CC: Tamas K Lengyel <tamas@tklengyel.com>
CC: Alexandru Isaila <aisaila@bitdefender.com>
CC: Petre Pircalabu <ppircalabu@bitdefender.com>
CC: Juergen Gross <jgross@suse.com>

v2:
 * New
---
 xen/arch/x86/hvm/svm/svm.c    | 19 -------------------
 xen/arch/x86/hvm/svm/vmcb.c   |  2 +-
 xen/arch/x86/monitor.c        |  3 ---
 xen/include/asm-x86/hvm/hvm.h | 11 -----------
 4 files changed, 1 insertion(+), 34 deletions(-)

Comments

Alexandru Stefan ISAILA Nov. 26, 2019, 12:28 p.m. UTC | #1
On 26.11.2019 14:03, Andrew Cooper wrote:
> ICEBP isn't handled well by SVM.
> 
> The VMexit state for a #DB-vectored TASK_SWITCH has %rip pointing to the
> appropriate instruction boundary (fault or trap, as appropriate), except for
> an ICEBP-induced #DB TASK_SWITCH, where %rip points at the ICEBP instruction
> rather than after it.  As ICEBP isn't distinguished in the vectoring event
> type, the state is ambiguous.
> 
> To add to the confusion, an ICEBP which occurs due to Introspection
> intercepting the instruction, or from x86_emulate() will have %rip updated as
> a consequence of partial emulation required to inject an ICEBP event in the
> first place.
> 
> We could in principle spot the non-injected case in the TASK_SWITCH handler,
> but this still results in complexity if the ICEBP instruction also has an
> Instruction Breakpoint active on it (which genuinely has fault semantics).
> 
> Unconditionally intercept ICEBP.  This does have a trap semantics for the
> intercept, and allows us to move %rip forwards appropriately before the
> TASK_SWITCH intercept is hit.  This makes the behaviour of #DB-vectored
> switches consistent however the ICEBP #DB came about, and avoids special cases
> in the TASK_SWITCH intercept.
> 
> This in turn allows for the removal of the conditional
> hvm_set_icebp_interception() logic used by the monitor subsystem, as ICEBP's
> will now always be submitted for monitoring checks.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

LGTM.

Reviewed-by: Alexandru Isaila <aisaila@bitdefender.com>
Petre Ovidiu PIRCALABU Nov. 26, 2019, 3:15 p.m. UTC | #2
On Tue, 2019-11-26 at 12:03 +0000, Andrew Cooper wrote:
> ICEBP isn't handled well by SVM.
> 
> The VMexit state for a #DB-vectored TASK_SWITCH has %rip pointing to
> the
> appropriate instruction boundary (fault or trap, as appropriate),
> except for
> an ICEBP-induced #DB TASK_SWITCH, where %rip points at the ICEBP
> instruction
> rather than after it.  As ICEBP isn't distinguished in the vectoring
> event
> type, the state is ambiguous.
> 
> To add to the confusion, an ICEBP which occurs due to Introspection
> intercepting the instruction, or from x86_emulate() will have %rip
> updated as
> a consequence of partial emulation required to inject an ICEBP event
> in the
> first place.
> 
> We could in principle spot the non-injected case in the TASK_SWITCH
> handler,
> but this still results in complexity if the ICEBP instruction also
> has an
> Instruction Breakpoint active on it (which genuinely has fault
> semantics).
> 
> Unconditionally intercept ICEBP.  This does have a trap semantics for
> the
> intercept, and allows us to move %rip forwards appropriately before
> the
> TASK_SWITCH intercept is hit.  This makes the behaviour of #DB-
> vectored
> switches consistent however the ICEBP #DB came about, and avoids
> special cases
> in the TASK_SWITCH intercept.
> 
> This in turn allows for the removal of the conditional
> hvm_set_icebp_interception() logic used by the monitor subsystem, as
> ICEBP's
> will now always be submitted for monitoring checks.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> 
Reviewed-by: Petre Pircalabu <ppircalabu@bitdefender.com>
Jan Beulich Nov. 26, 2019, 3:32 p.m. UTC | #3
On 26.11.2019 13:03, Andrew Cooper wrote:
> ICEBP isn't handled well by SVM.
> 
> The VMexit state for a #DB-vectored TASK_SWITCH has %rip pointing to the
> appropriate instruction boundary (fault or trap, as appropriate), except for
> an ICEBP-induced #DB TASK_SWITCH, where %rip points at the ICEBP instruction
> rather than after it.  As ICEBP isn't distinguished in the vectoring event
> type, the state is ambiguous.
> 
> To add to the confusion, an ICEBP which occurs due to Introspection
> intercepting the instruction, or from x86_emulate() will have %rip updated as
> a consequence of partial emulation required to inject an ICEBP event in the
> first place.
> 
> We could in principle spot the non-injected case in the TASK_SWITCH handler,
> but this still results in complexity if the ICEBP instruction also has an
> Instruction Breakpoint active on it (which genuinely has fault semantics).
> 
> Unconditionally intercept ICEBP.  This does have a trap semantics for the
> intercept, and allows us to move %rip forwards appropriately before the
> TASK_SWITCH intercept is hit.

Both because of you mentioning the moving forwards of %rip and with the
irc discussion in mind that we had no irc, don't you mean "fault
semantics" here? If so
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Otherwise I guess I'm still missing something.

> ---
>  xen/arch/x86/hvm/svm/svm.c    | 19 -------------------
>  xen/arch/x86/hvm/svm/vmcb.c   |  2 +-
>  xen/arch/x86/monitor.c        |  3 ---
>  xen/include/asm-x86/hvm/hvm.h | 11 -----------
>  4 files changed, 1 insertion(+), 34 deletions(-)

This, of course, is pretty nice in any event.

Jan
Roger Pau Monné Nov. 26, 2019, 3:34 p.m. UTC | #4
On Tue, Nov 26, 2019 at 12:03:56PM +0000, Andrew Cooper wrote:
> ICEBP isn't handled well by SVM.
> 
> The VMexit state for a #DB-vectored TASK_SWITCH has %rip pointing to the
> appropriate instruction boundary (fault or trap, as appropriate), except for
> an ICEBP-induced #DB TASK_SWITCH, where %rip points at the ICEBP instruction
> rather than after it.  As ICEBP isn't distinguished in the vectoring event
> type, the state is ambiguous.
> 
> To add to the confusion, an ICEBP which occurs due to Introspection
> intercepting the instruction, or from x86_emulate() will have %rip updated as
> a consequence of partial emulation required to inject an ICEBP event in the
> first place.
> 
> We could in principle spot the non-injected case in the TASK_SWITCH handler,
> but this still results in complexity if the ICEBP instruction also has an
> Instruction Breakpoint active on it (which genuinely has fault semantics).
> 
> Unconditionally intercept ICEBP.  This does have a trap semantics for the
> intercept, and allows us to move %rip forwards appropriately before the
> TASK_SWITCH intercept is hit.  This makes the behaviour of #DB-vectored
> switches consistent however the ICEBP #DB came about, and avoids special cases
> in the TASK_SWITCH intercept.
> 
> This in turn allows for the removal of the conditional
> hvm_set_icebp_interception() logic used by the monitor subsystem, as ICEBP's
> will now always be submitted for monitoring checks.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

AFAICT this brings AMD implementation inline with Intel that also will
unconditionally vmexit on icebp?

Thanks, Roger.
Andrew Cooper Nov. 26, 2019, 3:59 p.m. UTC | #5
On 26/11/2019 15:32, Jan Beulich wrote:
> On 26.11.2019 13:03, Andrew Cooper wrote:
>> ICEBP isn't handled well by SVM.
>>
>> The VMexit state for a #DB-vectored TASK_SWITCH has %rip pointing to the
>> appropriate instruction boundary (fault or trap, as appropriate), except for
>> an ICEBP-induced #DB TASK_SWITCH, where %rip points at the ICEBP instruction
>> rather than after it.  As ICEBP isn't distinguished in the vectoring event
>> type, the state is ambiguous.
>>
>> To add to the confusion, an ICEBP which occurs due to Introspection
>> intercepting the instruction, or from x86_emulate() will have %rip updated as
>> a consequence of partial emulation required to inject an ICEBP event in the
>> first place.
>>
>> We could in principle spot the non-injected case in the TASK_SWITCH handler,
>> but this still results in complexity if the ICEBP instruction also has an
>> Instruction Breakpoint active on it (which genuinely has fault semantics).
>>
>> Unconditionally intercept ICEBP.  This does have a trap semantics for the
>> intercept, and allows us to move %rip forwards appropriately before the
>> TASK_SWITCH intercept is hit.
> Both because of you mentioning the moving forwards of %rip and with the
> irc discussion in mind that we had no irc, don't you mean "fault
> semantics" here?

ICEBP really is too broken under SVM to handle architecturally.

The ICEBP intercept has nRIP decode support, because it is an
instruction intercept.  We emulate the injection (because it is ICEBP),
which means we re-enter the guest with %rip moved forward, and #DB
(HW_EXCEPTION) pending for injection.  This means that...

>  If so
> Reviewed-by: Jan Beulich <jbeulich@suse.com>

... the ICEBP-#DB-vectored TASK_SWITCH will now find %rip pointing after
the ICEBP instruction, rather than at it, making it consistent with
every other #DB-vectored TASK_SWITCH.

This does means that an early task-switch fault for ICEBP will reliably
be delivered with the wrong (i.e. trap) semantics, but this is less bad
than mixed fault/trap semantics depending on whether the source of the
ICEBP was introspection/emulation or native execution.

We could restore proper fault behaviour by extending
svm_emul_swint_injection() to figure out that a task switch is needed,
and invoke hvm_task_switch() directly, but I don't have enough TUITS
right now.

> Otherwise I guess I'm still missing something.

I hope this clears it up.

~Andrew
Jan Beulich Nov. 26, 2019, 4:05 p.m. UTC | #6
On 26.11.2019 16:59, Andrew Cooper wrote:
> On 26/11/2019 15:32, Jan Beulich wrote:
>> On 26.11.2019 13:03, Andrew Cooper wrote:
>>> ICEBP isn't handled well by SVM.
>>>
>>> The VMexit state for a #DB-vectored TASK_SWITCH has %rip pointing to the
>>> appropriate instruction boundary (fault or trap, as appropriate), except for
>>> an ICEBP-induced #DB TASK_SWITCH, where %rip points at the ICEBP instruction
>>> rather than after it.  As ICEBP isn't distinguished in the vectoring event
>>> type, the state is ambiguous.
>>>
>>> To add to the confusion, an ICEBP which occurs due to Introspection
>>> intercepting the instruction, or from x86_emulate() will have %rip updated as
>>> a consequence of partial emulation required to inject an ICEBP event in the
>>> first place.
>>>
>>> We could in principle spot the non-injected case in the TASK_SWITCH handler,
>>> but this still results in complexity if the ICEBP instruction also has an
>>> Instruction Breakpoint active on it (which genuinely has fault semantics).
>>>
>>> Unconditionally intercept ICEBP.  This does have a trap semantics for the
>>> intercept, and allows us to move %rip forwards appropriately before the
>>> TASK_SWITCH intercept is hit.
>> Both because of you mentioning the moving forwards of %rip and with the
>> irc discussion in mind that we had no irc, don't you mean "fault
>> semantics" here?
> 
> ICEBP really is too broken under SVM to handle architecturally.
> 
> The ICEBP intercept has nRIP decode support, because it is an
> instruction intercept.  We emulate the injection (because it is ICEBP),
> which means we re-enter the guest with %rip moved forward, and #DB
> (HW_EXCEPTION) pending for injection.  This means that...
> 
>>  If so
>> Reviewed-by: Jan Beulich <jbeulich@suse.com>
> 
> ... the ICEBP-#DB-vectored TASK_SWITCH will now find %rip pointing after
> the ICEBP instruction, rather than at it, making it consistent with
> every other #DB-vectored TASK_SWITCH.
> 
> This does means that an early task-switch fault for ICEBP will reliably
> be delivered with the wrong (i.e. trap) semantics, but this is less bad
> than mixed fault/trap semantics depending on whether the source of the
> ICEBP was introspection/emulation or native execution.
> 
> We could restore proper fault behaviour by extending
> svm_emul_swint_injection() to figure out that a task switch is needed,
> and invoke hvm_task_switch() directly, but I don't have enough TUITS
> right now.
> 
>> Otherwise I guess I'm still missing something.
> 
> I hope this clears it up.

Well, it helps, but you don't really answer the question: Is "trap"
in that sentence of the description really correct? I.e. don't you
instead mean "fault" there?

Jan
Andrew Cooper Nov. 26, 2019, 4:09 p.m. UTC | #7
On 26/11/2019 15:34, Roger Pau Monné wrote:
> On Tue, Nov 26, 2019 at 12:03:56PM +0000, Andrew Cooper wrote:
>> ICEBP isn't handled well by SVM.
>>
>> The VMexit state for a #DB-vectored TASK_SWITCH has %rip pointing to the
>> appropriate instruction boundary (fault or trap, as appropriate), except for
>> an ICEBP-induced #DB TASK_SWITCH, where %rip points at the ICEBP instruction
>> rather than after it.  As ICEBP isn't distinguished in the vectoring event
>> type, the state is ambiguous.
>>
>> To add to the confusion, an ICEBP which occurs due to Introspection
>> intercepting the instruction, or from x86_emulate() will have %rip updated as
>> a consequence of partial emulation required to inject an ICEBP event in the
>> first place.
>>
>> We could in principle spot the non-injected case in the TASK_SWITCH handler,
>> but this still results in complexity if the ICEBP instruction also has an
>> Instruction Breakpoint active on it (which genuinely has fault semantics).
>>
>> Unconditionally intercept ICEBP.  This does have a trap semantics for the
>> intercept, and allows us to move %rip forwards appropriately before the
>> TASK_SWITCH intercept is hit.  This makes the behaviour of #DB-vectored
>> switches consistent however the ICEBP #DB came about, and avoids special cases
>> in the TASK_SWITCH intercept.
>>
>> This in turn allows for the removal of the conditional
>> hvm_set_icebp_interception() logic used by the monitor subsystem, as ICEBP's
>> will now always be submitted for monitoring checks.
>>
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
>
> AFAICT this brings AMD implementation inline with Intel that also will
> unconditionally vmexit on icebp?

VT-x and SVM handle things quite differently.

VT-x has no instruction intercept for ICEBP, but the #DB intercept will
triggered by an ICEBP instruction.  ICEBP has its own event type
(Privileged Software Exception, which is an amusing name considering it
is an unprivleged instruction, bypasses privilege checks, and sets the
External bit in an error code).

SVM does have an instruction intercept for ICEBP, but the #DB from
ICEBP's don't trigger the normal #DB intercept.  However, secondary
#DB's generated by ICEBP's unintercepted #DB do trigger the #DB intercept.

For safety reasons we must intercept #DB to prevent CPU deadlocks.  This
means that ICEBP are in practice always intercepted on Intel due to
their #DB side effect, but they weren't intercepted on AMD, which is why
the monitor subsystem had a way of turning interception on.

So yes, the overall effect is that ICEBPs will now unconditionally
vmexit on both Intel and AMD, but underlying mechanism for why they
vmexit is still vendor-specific.

~Andrew
Andrew Cooper Nov. 26, 2019, 4:11 p.m. UTC | #8
On 26/11/2019 16:05, Jan Beulich wrote:
> On 26.11.2019 16:59, Andrew Cooper wrote:
>> On 26/11/2019 15:32, Jan Beulich wrote:
>>> On 26.11.2019 13:03, Andrew Cooper wrote:
>>>> ICEBP isn't handled well by SVM.
>>>>
>>>> The VMexit state for a #DB-vectored TASK_SWITCH has %rip pointing to the
>>>> appropriate instruction boundary (fault or trap, as appropriate), except for
>>>> an ICEBP-induced #DB TASK_SWITCH, where %rip points at the ICEBP instruction
>>>> rather than after it.  As ICEBP isn't distinguished in the vectoring event
>>>> type, the state is ambiguous.
>>>>
>>>> To add to the confusion, an ICEBP which occurs due to Introspection
>>>> intercepting the instruction, or from x86_emulate() will have %rip updated as
>>>> a consequence of partial emulation required to inject an ICEBP event in the
>>>> first place.
>>>>
>>>> We could in principle spot the non-injected case in the TASK_SWITCH handler,
>>>> but this still results in complexity if the ICEBP instruction also has an
>>>> Instruction Breakpoint active on it (which genuinely has fault semantics).
>>>>
>>>> Unconditionally intercept ICEBP.  This does have a trap semantics for the
>>>> intercept, and allows us to move %rip forwards appropriately before the
>>>> TASK_SWITCH intercept is hit.
>>> Both because of you mentioning the moving forwards of %rip and with the
>>> irc discussion in mind that we had no irc, don't you mean "fault
>>> semantics" here?
>> ICEBP really is too broken under SVM to handle architecturally.
>>
>> The ICEBP intercept has nRIP decode support, because it is an
>> instruction intercept.  We emulate the injection (because it is ICEBP),
>> which means we re-enter the guest with %rip moved forward, and #DB
>> (HW_EXCEPTION) pending for injection.  This means that...
>>
>>>  If so
>>> Reviewed-by: Jan Beulich <jbeulich@suse.com>
>> ... the ICEBP-#DB-vectored TASK_SWITCH will now find %rip pointing after
>> the ICEBP instruction, rather than at it, making it consistent with
>> every other #DB-vectored TASK_SWITCH.
>>
>> This does means that an early task-switch fault for ICEBP will reliably
>> be delivered with the wrong (i.e. trap) semantics, but this is less bad
>> than mixed fault/trap semantics depending on whether the source of the
>> ICEBP was introspection/emulation or native execution.
>>
>> We could restore proper fault behaviour by extending
>> svm_emul_swint_injection() to figure out that a task switch is needed,
>> and invoke hvm_task_switch() directly, but I don't have enough TUITS
>> right now.
>>
>>> Otherwise I guess I'm still missing something.
>> I hope this clears it up.
> Well, it helps, but you don't really answer the question: Is "trap"
> in that sentence of the description really correct? I.e. don't you
> instead mean "fault" there?

I've reworded that bit to:

Unconditionally intercept ICEBP.  This does have NRIPs support as it is an
instruction intercept, which allows us allows us to move %rip forwards
appropriately before the TASK_SWITCH intercept is hit.  This allows...

Any better?

~Andrew
Jan Beulich Nov. 26, 2019, 4:14 p.m. UTC | #9
On 26.11.2019 17:11, Andrew Cooper wrote:
> On 26/11/2019 16:05, Jan Beulich wrote:
>> On 26.11.2019 16:59, Andrew Cooper wrote:
>>> On 26/11/2019 15:32, Jan Beulich wrote:
>>>> On 26.11.2019 13:03, Andrew Cooper wrote:
>>>>> ICEBP isn't handled well by SVM.
>>>>>
>>>>> The VMexit state for a #DB-vectored TASK_SWITCH has %rip pointing to the
>>>>> appropriate instruction boundary (fault or trap, as appropriate), except for
>>>>> an ICEBP-induced #DB TASK_SWITCH, where %rip points at the ICEBP instruction
>>>>> rather than after it.  As ICEBP isn't distinguished in the vectoring event
>>>>> type, the state is ambiguous.
>>>>>
>>>>> To add to the confusion, an ICEBP which occurs due to Introspection
>>>>> intercepting the instruction, or from x86_emulate() will have %rip updated as
>>>>> a consequence of partial emulation required to inject an ICEBP event in the
>>>>> first place.
>>>>>
>>>>> We could in principle spot the non-injected case in the TASK_SWITCH handler,
>>>>> but this still results in complexity if the ICEBP instruction also has an
>>>>> Instruction Breakpoint active on it (which genuinely has fault semantics).
>>>>>
>>>>> Unconditionally intercept ICEBP.  This does have a trap semantics for the
>>>>> intercept, and allows us to move %rip forwards appropriately before the
>>>>> TASK_SWITCH intercept is hit.
>>>> Both because of you mentioning the moving forwards of %rip and with the
>>>> irc discussion in mind that we had no irc, don't you mean "fault
>>>> semantics" here?
>>> ICEBP really is too broken under SVM to handle architecturally.
>>>
>>> The ICEBP intercept has nRIP decode support, because it is an
>>> instruction intercept.  We emulate the injection (because it is ICEBP),
>>> which means we re-enter the guest with %rip moved forward, and #DB
>>> (HW_EXCEPTION) pending for injection.  This means that...
>>>
>>>>  If so
>>>> Reviewed-by: Jan Beulich <jbeulich@suse.com>
>>> ... the ICEBP-#DB-vectored TASK_SWITCH will now find %rip pointing after
>>> the ICEBP instruction, rather than at it, making it consistent with
>>> every other #DB-vectored TASK_SWITCH.
>>>
>>> This does means that an early task-switch fault for ICEBP will reliably
>>> be delivered with the wrong (i.e. trap) semantics, but this is less bad
>>> than mixed fault/trap semantics depending on whether the source of the
>>> ICEBP was introspection/emulation or native execution.
>>>
>>> We could restore proper fault behaviour by extending
>>> svm_emul_swint_injection() to figure out that a task switch is needed,
>>> and invoke hvm_task_switch() directly, but I don't have enough TUITS
>>> right now.
>>>
>>>> Otherwise I guess I'm still missing something.
>>> I hope this clears it up.
>> Well, it helps, but you don't really answer the question: Is "trap"
>> in that sentence of the description really correct? I.e. don't you
>> instead mean "fault" there?
> 
> I've reworded that bit to:
> 
> Unconditionally intercept ICEBP.  This does have NRIPs support as it is an
> instruction intercept, which allows us allows us to move %rip forwards
> appropriately before the TASK_SWITCH intercept is hit.  This allows...
> 
> Any better?

Ah yes, thanks. (But please drop one of the two "allows us".)

Jan
Andrew Cooper Nov. 26, 2019, 4:16 p.m. UTC | #10
On 26/11/2019 16:14, Jan Beulich wrote:
> On 26.11.2019 17:11, Andrew Cooper wrote:
>> On 26/11/2019 16:05, Jan Beulich wrote:
>>> On 26.11.2019 16:59, Andrew Cooper wrote:
>>>> On 26/11/2019 15:32, Jan Beulich wrote:
>>>>> On 26.11.2019 13:03, Andrew Cooper wrote:
>>>>>> ICEBP isn't handled well by SVM.
>>>>>>
>>>>>> The VMexit state for a #DB-vectored TASK_SWITCH has %rip pointing to the
>>>>>> appropriate instruction boundary (fault or trap, as appropriate), except for
>>>>>> an ICEBP-induced #DB TASK_SWITCH, where %rip points at the ICEBP instruction
>>>>>> rather than after it.  As ICEBP isn't distinguished in the vectoring event
>>>>>> type, the state is ambiguous.
>>>>>>
>>>>>> To add to the confusion, an ICEBP which occurs due to Introspection
>>>>>> intercepting the instruction, or from x86_emulate() will have %rip updated as
>>>>>> a consequence of partial emulation required to inject an ICEBP event in the
>>>>>> first place.
>>>>>>
>>>>>> We could in principle spot the non-injected case in the TASK_SWITCH handler,
>>>>>> but this still results in complexity if the ICEBP instruction also has an
>>>>>> Instruction Breakpoint active on it (which genuinely has fault semantics).
>>>>>>
>>>>>> Unconditionally intercept ICEBP.  This does have a trap semantics for the
>>>>>> intercept, and allows us to move %rip forwards appropriately before the
>>>>>> TASK_SWITCH intercept is hit.
>>>>> Both because of you mentioning the moving forwards of %rip and with the
>>>>> irc discussion in mind that we had no irc, don't you mean "fault
>>>>> semantics" here?
>>>> ICEBP really is too broken under SVM to handle architecturally.
>>>>
>>>> The ICEBP intercept has nRIP decode support, because it is an
>>>> instruction intercept.  We emulate the injection (because it is ICEBP),
>>>> which means we re-enter the guest with %rip moved forward, and #DB
>>>> (HW_EXCEPTION) pending for injection.  This means that...
>>>>
>>>>>  If so
>>>>> Reviewed-by: Jan Beulich <jbeulich@suse.com>
>>>> ... the ICEBP-#DB-vectored TASK_SWITCH will now find %rip pointing after
>>>> the ICEBP instruction, rather than at it, making it consistent with
>>>> every other #DB-vectored TASK_SWITCH.
>>>>
>>>> This does means that an early task-switch fault for ICEBP will reliably
>>>> be delivered with the wrong (i.e. trap) semantics, but this is less bad
>>>> than mixed fault/trap semantics depending on whether the source of the
>>>> ICEBP was introspection/emulation or native execution.
>>>>
>>>> We could restore proper fault behaviour by extending
>>>> svm_emul_swint_injection() to figure out that a task switch is needed,
>>>> and invoke hvm_task_switch() directly, but I don't have enough TUITS
>>>> right now.
>>>>
>>>>> Otherwise I guess I'm still missing something.
>>>> I hope this clears it up.
>>> Well, it helps, but you don't really answer the question: Is "trap"
>>> in that sentence of the description really correct? I.e. don't you
>>> instead mean "fault" there?
>> I've reworded that bit to:
>>
>> Unconditionally intercept ICEBP.  This does have NRIPs support as it is an
>> instruction intercept, which allows us allows us to move %rip forwards
>> appropriately before the TASK_SWITCH intercept is hit.  This allows...
>>
>> Any better?
> Ah yes, thanks. (But please drop one of the two "allows us".)

Oops yes.  Irritatingly, that causes #DB-vectored to move onto a new
line, and trigger Git's comment syntax.  I'll tweak a little bit more.

~Andrew
Roger Pau Monné Nov. 27, 2019, 8:55 a.m. UTC | #11
On Tue, Nov 26, 2019 at 04:09:08PM +0000, Andrew Cooper wrote:
> On 26/11/2019 15:34, Roger Pau Monné wrote:
> > On Tue, Nov 26, 2019 at 12:03:56PM +0000, Andrew Cooper wrote:
> >> ICEBP isn't handled well by SVM.
> >>
> >> The VMexit state for a #DB-vectored TASK_SWITCH has %rip pointing to the
> >> appropriate instruction boundary (fault or trap, as appropriate), except for
> >> an ICEBP-induced #DB TASK_SWITCH, where %rip points at the ICEBP instruction
> >> rather than after it.  As ICEBP isn't distinguished in the vectoring event
> >> type, the state is ambiguous.
> >>
> >> To add to the confusion, an ICEBP which occurs due to Introspection
> >> intercepting the instruction, or from x86_emulate() will have %rip updated as
> >> a consequence of partial emulation required to inject an ICEBP event in the
> >> first place.
> >>
> >> We could in principle spot the non-injected case in the TASK_SWITCH handler,
> >> but this still results in complexity if the ICEBP instruction also has an
> >> Instruction Breakpoint active on it (which genuinely has fault semantics).
> >>
> >> Unconditionally intercept ICEBP.  This does have a trap semantics for the
> >> intercept, and allows us to move %rip forwards appropriately before the
> >> TASK_SWITCH intercept is hit.  This makes the behaviour of #DB-vectored
> >> switches consistent however the ICEBP #DB came about, and avoids special cases
> >> in the TASK_SWITCH intercept.
> >>
> >> This in turn allows for the removal of the conditional
> >> hvm_set_icebp_interception() logic used by the monitor subsystem, as ICEBP's
> >> will now always be submitted for monitoring checks.
> >>
> >> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> > Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
> >
> > AFAICT this brings AMD implementation inline with Intel that also will
> > unconditionally vmexit on icebp?
> 
> VT-x and SVM handle things quite differently.
> 
> VT-x has no instruction intercept for ICEBP, but the #DB intercept will
> triggered by an ICEBP instruction.  ICEBP has its own event type
> (Privileged Software Exception, which is an amusing name considering it
> is an unprivleged instruction, bypasses privilege checks, and sets the
> External bit in an error code).
> 
> SVM does have an instruction intercept for ICEBP, but the #DB from
> ICEBP's don't trigger the normal #DB intercept.  However, secondary
> #DB's generated by ICEBP's unintercepted #DB do trigger the #DB intercept.
> 
> For safety reasons we must intercept #DB to prevent CPU deadlocks.  This
> means that ICEBP are in practice always intercepted on Intel due to
> their #DB side effect, but they weren't intercepted on AMD, which is why
> the monitor subsystem had a way of turning interception on.
> 
> So yes, the overall effect is that ICEBPs will now unconditionally
> vmexit on both Intel and AMD, but underlying mechanism for why they
> vmexit is still vendor-specific.

Thanks for the detailed explanation, I realized vmx didn't have a
ICEBP specific VMEXIT reason, but I assumed the #DB would be trapped
as that's how the monitor system intercepts those.

Roger.
diff mbox series

Patch

diff --git a/xen/arch/x86/hvm/svm/svm.c b/xen/arch/x86/hvm/svm/svm.c
index 049b800e20..a7a79fcef7 100644
--- a/xen/arch/x86/hvm/svm/svm.c
+++ b/xen/arch/x86/hvm/svm/svm.c
@@ -173,24 +173,6 @@  static void svm_enable_msr_interception(struct domain *d, uint32_t msr)
         svm_intercept_msr(v, msr, MSR_INTERCEPT_WRITE);
 }
 
-static void svm_set_icebp_interception(struct domain *d, bool enable)
-{
-    const struct vcpu *v;
-
-    for_each_vcpu ( d, v )
-    {
-        struct vmcb_struct *vmcb = v->arch.hvm.svm.vmcb;
-        uint32_t intercepts = vmcb_get_general2_intercepts(vmcb);
-
-        if ( enable )
-            intercepts |= GENERAL2_INTERCEPT_ICEBP;
-        else
-            intercepts &= ~GENERAL2_INTERCEPT_ICEBP;
-
-        vmcb_set_general2_intercepts(vmcb, intercepts);
-    }
-}
-
 static void svm_save_dr(struct vcpu *v)
 {
     struct vmcb_struct *vmcb = v->arch.hvm.svm.vmcb;
@@ -2474,7 +2456,6 @@  static struct hvm_function_table __initdata svm_function_table = {
     .msr_read_intercept   = svm_msr_read_intercept,
     .msr_write_intercept  = svm_msr_write_intercept,
     .enable_msr_interception = svm_enable_msr_interception,
-    .set_icebp_interception = svm_set_icebp_interception,
     .set_rdtsc_exiting    = svm_set_rdtsc_exiting,
     .set_descriptor_access_exiting = svm_set_descriptor_access_exiting,
     .get_insn_bytes       = svm_get_insn_bytes,
diff --git a/xen/arch/x86/hvm/svm/vmcb.c b/xen/arch/x86/hvm/svm/vmcb.c
index 71ee7102f7..1fef0da22c 100644
--- a/xen/arch/x86/hvm/svm/vmcb.c
+++ b/xen/arch/x86/hvm/svm/vmcb.c
@@ -73,7 +73,7 @@  static int construct_vmcb(struct vcpu *v)
         GENERAL2_INTERCEPT_STGI        | GENERAL2_INTERCEPT_CLGI        |
         GENERAL2_INTERCEPT_SKINIT      | GENERAL2_INTERCEPT_MWAIT       |
         GENERAL2_INTERCEPT_WBINVD      | GENERAL2_INTERCEPT_MONITOR     |
-        GENERAL2_INTERCEPT_XSETBV;
+        GENERAL2_INTERCEPT_XSETBV      | GENERAL2_INTERCEPT_ICEBP;
 
     /* Intercept all debug-register writes. */
     vmcb->_dr_intercepts = ~0u;
diff --git a/xen/arch/x86/monitor.c b/xen/arch/x86/monitor.c
index 3c42e21906..bbcb7536c7 100644
--- a/xen/arch/x86/monitor.c
+++ b/xen/arch/x86/monitor.c
@@ -301,9 +301,6 @@  int arch_monitor_domctl_event(struct domain *d,
         ad->monitor.debug_exception_sync = requested_status ?
                                             mop->u.debug_exception.sync :
                                             0;
-
-        hvm_set_icebp_interception(d, requested_status);
-
         domain_unpause(d);
         break;
     }
diff --git a/xen/include/asm-x86/hvm/hvm.h b/xen/include/asm-x86/hvm/hvm.h
index 4cce59bb31..17fb7efa6e 100644
--- a/xen/include/asm-x86/hvm/hvm.h
+++ b/xen/include/asm-x86/hvm/hvm.h
@@ -206,7 +206,6 @@  struct hvm_function_table {
                                 bool_t access_w, bool_t access_x);
 
     void (*enable_msr_interception)(struct domain *d, uint32_t msr);
-    void (*set_icebp_interception)(struct domain *d, bool enable);
     bool_t (*is_singlestep_supported)(void);
 
     /* Alternate p2m */
@@ -615,16 +614,6 @@  static inline bool_t hvm_enable_msr_interception(struct domain *d, uint32_t msr)
     return 0;
 }
 
-static inline bool hvm_set_icebp_interception(struct domain *d, bool enable)
-{
-    if ( hvm_funcs.set_icebp_interception )
-    {
-        hvm_funcs.set_icebp_interception(d, enable);
-        return true;
-    }
-    return false;
-}
-
 static inline bool_t hvm_is_singlestep_supported(void)
 {
     return (hvm_funcs.is_singlestep_supported &&