[v2,2/3] x86/svm: Always intercept ICEBP

Message ID	20191126120357.13398-3-andrew.cooper3@citrix.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=lwom=ZS=lists.xenproject.org=xen-devel-bounces@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 661DB2073F Received-SPF: None (esa5.hc3370-68.iphmx.com: no sender authenticity information available from domain of andrew.cooper3@citrix.com) identity=pra; client-ip=162.221.158.21; receiver=esa5.hc3370-68.iphmx.com; envelope-from="Andrew.Cooper3@citrix.com"; x-sender="andrew.cooper3@citrix.com"; x-conformance=sidf_compatible Received-SPF: Pass (esa5.hc3370-68.iphmx.com: domain of Andrew.Cooper3@citrix.com designates 162.221.158.21 as permitted sender) identity=mailfrom; client-ip=162.221.158.21; receiver=esa5.hc3370-68.iphmx.com; envelope-from="Andrew.Cooper3@citrix.com"; x-sender="Andrew.Cooper3@citrix.com"; x-conformance=sidf_compatible; x-record-type="v=spf1"; x-record-text="v=spf1 ip4:209.167.231.154 ip4:178.63.86.133 ip4:195.66.111.40/30 ip4:85.115.9.32/28 ip4:199.102.83.4 ip4:192.28.146.160 ip4:192.28.146.107 ip4:216.52.6.88 ip4:216.52.6.188 ip4:162.221.158.21 ip4:162.221.156.83 ip4:168.245.78.127 ~all" Received-SPF: None (esa5.hc3370-68.iphmx.com: no sender authenticity information available from domain of postmaster@mail.citrix.com) identity=helo; client-ip=162.221.158.21; receiver=esa5.hc3370-68.iphmx.com; envelope-from="Andrew.Cooper3@citrix.com"; x-sender="postmaster@mail.citrix.com"; x-conformance=sidf_compatible IronPort-SDR: rE46jArJ2zZZIEwA7r2xQFkEcMre88cuRmGe7rdUShV8MG0WYORm6Ne3P6HcB4UrMDhIdl8LBP wOeuL50M78tatGqKNDxRdo6yDHo2Vltsz4GBk8rVkOMKcCNIHgUUpy/PbNfV12VXsjy9owkD6N EvofXvJB6mWxdoltFp5Nnn5Zx8gjSeCqwN2Rj94m4xIFgRkzF0R+yJsxRsBLHiGFJECCL0dIsu xpu6PQQ9wJG1IOWn6ix4uN4UZXFRDdO28PBM4UnEhMgAOYvOrmjS9ynWYkP4KWl+A2NCc9n9Rn pZw= From: Andrew Cooper <andrew.cooper3@citrix.com> To: Xen-devel <xen-devel@lists.xenproject.org> Date: Tue, 26 Nov 2019 12:03:56 +0000 Message-ID: <20191126120357.13398-3-andrew.cooper3@citrix.com> In-Reply-To: <20191126120357.13398-1-andrew.cooper3@citrix.com> References: <20191126120357.13398-1-andrew.cooper3@citrix.com> MIME-Version: 1.0 Subject: [Xen-devel] [PATCH v2 2/3] x86/svm: Always intercept ICEBP Precedence: list Cc: Petre Pircalabu <ppircalabu@bitdefender.com>, Juergen Gross <jgross@suse.com>, Tamas K Lengyel <tamas@tklengyel.com>, Wei Liu <wl@xen.org>, Razvan Cojocaru <rcojocaru@bitdefender.com>, Andrew Cooper <andrew.cooper3@citrix.com>, Jan Beulich <JBeulich@suse.com>, Alexandru Isaila <aisaila@bitdefender.com>, =?utf-8?q?Roger_Pau_Monn=C3=A9?= <roger.pau@citrix.com> Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Errors-To: xen-devel-bounces@lists.xenproject.org Sender: "Xen-devel" <xen-devel-bounces@lists.xenproject.org>
Series	x86/hvm: Multiple corrections to task switch handling \| expand [for-4.13,v2,0/3] x86/hvm: Multiple corrections to task switch handling [v2,1/3] x86/vtx: Fix fault semantics for early task switch failures [v2,2/3] x86/svm: Always intercept ICEBP [v2,3/3] x86/svm: Write the correct %eip into the outgoing task

Andrew Cooper Nov. 26, 2019, 12:03 p.m. UTC

ICEBP isn't handled well by SVM.

The VMexit state for a #DB-vectored TASK_SWITCH has %rip pointing to the
appropriate instruction boundary (fault or trap, as appropriate), except for
an ICEBP-induced #DB TASK_SWITCH, where %rip points at the ICEBP instruction
rather than after it.  As ICEBP isn't distinguished in the vectoring event
type, the state is ambiguous.

To add to the confusion, an ICEBP which occurs due to Introspection
intercepting the instruction, or from x86_emulate() will have %rip updated as
a consequence of partial emulation required to inject an ICEBP event in the
first place.

We could in principle spot the non-injected case in the TASK_SWITCH handler,
but this still results in complexity if the ICEBP instruction also has an
Instruction Breakpoint active on it (which genuinely has fault semantics).

Unconditionally intercept ICEBP.  This does have a trap semantics for the
intercept, and allows us to move %rip forwards appropriately before the
TASK_SWITCH intercept is hit.  This makes the behaviour of #DB-vectored
switches consistent however the ICEBP #DB came about, and avoids special cases
in the TASK_SWITCH intercept.

This in turn allows for the removal of the conditional
hvm_set_icebp_interception() logic used by the monitor subsystem, as ICEBP's
will now always be submitted for monitoring checks.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Wei Liu <wl@xen.org>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Razvan Cojocaru <rcojocaru@bitdefender.com>
CC: Tamas K Lengyel <tamas@tklengyel.com>
CC: Alexandru Isaila <aisaila@bitdefender.com>
CC: Petre Pircalabu <ppircalabu@bitdefender.com>
CC: Juergen Gross <jgross@suse.com>

v2:
 * New
---
 xen/arch/x86/hvm/svm/svm.c    | 19 -------------------
 xen/arch/x86/hvm/svm/vmcb.c   |  2 +-
 xen/arch/x86/monitor.c        |  3 ---
 xen/include/asm-x86/hvm/hvm.h | 11 -----------
 4 files changed, 1 insertion(+), 34 deletions(-)

Alexandru Stefan ISAILA Nov. 26, 2019, 12:28 p.m. UTC | #1

On 26.11.2019 14:03, Andrew Cooper wrote:
> ICEBP isn't handled well by SVM.
> 
> The VMexit state for a #DB-vectored TASK_SWITCH has %rip pointing to the
> appropriate instruction boundary (fault or trap, as appropriate), except for
> an ICEBP-induced #DB TASK_SWITCH, where %rip points at the ICEBP instruction
> rather than after it.  As ICEBP isn't distinguished in the vectoring event
> type, the state is ambiguous.
> 
> To add to the confusion, an ICEBP which occurs due to Introspection
> intercepting the instruction, or from x86_emulate() will have %rip updated as
> a consequence of partial emulation required to inject an ICEBP event in the
> first place.
> 
> We could in principle spot the non-injected case in the TASK_SWITCH handler,
> but this still results in complexity if the ICEBP instruction also has an
> Instruction Breakpoint active on it (which genuinely has fault semantics).
> 
> Unconditionally intercept ICEBP.  This does have a trap semantics for the
> intercept, and allows us to move %rip forwards appropriately before the
> TASK_SWITCH intercept is hit.  This makes the behaviour of #DB-vectored
> switches consistent however the ICEBP #DB came about, and avoids special cases
> in the TASK_SWITCH intercept.
> 
> This in turn allows for the removal of the conditional
> hvm_set_icebp_interception() logic used by the monitor subsystem, as ICEBP's
> will now always be submitted for monitoring checks.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

LGTM.

Reviewed-by: Alexandru Isaila <aisaila@bitdefender.com>

Petre Ovidiu PIRCALABU Nov. 26, 2019, 3:15 p.m. UTC | #2

On Tue, 2019-11-26 at 12:03 +0000, Andrew Cooper wrote:
> ICEBP isn't handled well by SVM.
> 
> The VMexit state for a #DB-vectored TASK_SWITCH has %rip pointing to
> the
> appropriate instruction boundary (fault or trap, as appropriate),
> except for
> an ICEBP-induced #DB TASK_SWITCH, where %rip points at the ICEBP
> instruction
> rather than after it.  As ICEBP isn't distinguished in the vectoring
> event
> type, the state is ambiguous.
> 
> To add to the confusion, an ICEBP which occurs due to Introspection
> intercepting the instruction, or from x86_emulate() will have %rip
> updated as
> a consequence of partial emulation required to inject an ICEBP event
> in the
> first place.
> 
> We could in principle spot the non-injected case in the TASK_SWITCH
> handler,
> but this still results in complexity if the ICEBP instruction also
> has an
> Instruction Breakpoint active on it (which genuinely has fault
> semantics).
> 
> Unconditionally intercept ICEBP.  This does have a trap semantics for
> the
> intercept, and allows us to move %rip forwards appropriately before
> the
> TASK_SWITCH intercept is hit.  This makes the behaviour of #DB-
> vectored
> switches consistent however the ICEBP #DB came about, and avoids
> special cases
> in the TASK_SWITCH intercept.
> 
> This in turn allows for the removal of the conditional
> hvm_set_icebp_interception() logic used by the monitor subsystem, as
> ICEBP's
> will now always be submitted for monitoring checks.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> 
Reviewed-by: Petre Pircalabu <ppircalabu@bitdefender.com>

Jan Beulich Nov. 26, 2019, 3:32 p.m. UTC | #3

On 26.11.2019 13:03, Andrew Cooper wrote:
> ICEBP isn't handled well by SVM.
> 
> The VMexit state for a #DB-vectored TASK_SWITCH has %rip pointing to the
> appropriate instruction boundary (fault or trap, as appropriate), except for
> an ICEBP-induced #DB TASK_SWITCH, where %rip points at the ICEBP instruction
> rather than after it.  As ICEBP isn't distinguished in the vectoring event
> type, the state is ambiguous.
> 
> To add to the confusion, an ICEBP which occurs due to Introspection
> intercepting the instruction, or from x86_emulate() will have %rip updated as
> a consequence of partial emulation required to inject an ICEBP event in the
> first place.
> 
> We could in principle spot the non-injected case in the TASK_SWITCH handler,
> but this still results in complexity if the ICEBP instruction also has an
> Instruction Breakpoint active on it (which genuinely has fault semantics).
> 
> Unconditionally intercept ICEBP.  This does have a trap semantics for the
> intercept, and allows us to move %rip forwards appropriately before the
> TASK_SWITCH intercept is hit.

Both because of you mentioning the moving forwards of %rip and with the
irc discussion in mind that we had no irc, don't you mean "fault
semantics" here? If so
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Otherwise I guess I'm still missing something.

> ---
>  xen/arch/x86/hvm/svm/svm.c    | 19 -------------------
>  xen/arch/x86/hvm/svm/vmcb.c   |  2 +-
>  xen/arch/x86/monitor.c        |  3 ---
>  xen/include/asm-x86/hvm/hvm.h | 11 -----------
>  4 files changed, 1 insertion(+), 34 deletions(-)

This, of course, is pretty nice in any event.

Jan

Roger Pau Monné Nov. 26, 2019, 3:34 p.m. UTC | #4

On Tue, Nov 26, 2019 at 12:03:56PM +0000, Andrew Cooper wrote:
> ICEBP isn't handled well by SVM.
> 
> The VMexit state for a #DB-vectored TASK_SWITCH has %rip pointing to the
> appropriate instruction boundary (fault or trap, as appropriate), except for
> an ICEBP-induced #DB TASK_SWITCH, where %rip points at the ICEBP instruction
> rather than after it.  As ICEBP isn't distinguished in the vectoring event
> type, the state is ambiguous.
> 
> To add to the confusion, an ICEBP which occurs due to Introspection
> intercepting the instruction, or from x86_emulate() will have %rip updated as
> a consequence of partial emulation required to inject an ICEBP event in the
> first place.
> 
> We could in principle spot the non-injected case in the TASK_SWITCH handler,
> but this still results in complexity if the ICEBP instruction also has an
> Instruction Breakpoint active on it (which genuinely has fault semantics).
> 
> Unconditionally intercept ICEBP.  This does have a trap semantics for the
> intercept, and allows us to move %rip forwards appropriately before the
> TASK_SWITCH intercept is hit.  This makes the behaviour of #DB-vectored
> switches consistent however the ICEBP #DB came about, and avoids special cases
> in the TASK_SWITCH intercept.
> 
> This in turn allows for the removal of the conditional
> hvm_set_icebp_interception() logic used by the monitor subsystem, as ICEBP's
> will now always be submitted for monitoring checks.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

AFAICT this brings AMD implementation inline with Intel that also will
unconditionally vmexit on icebp?

Thanks, Roger.

Andrew Cooper Nov. 26, 2019, 3:59 p.m. UTC | #5

On 26/11/2019 15:32, Jan Beulich wrote:
> On 26.11.2019 13:03, Andrew Cooper wrote:
>> ICEBP isn't handled well by SVM.
>>
>> The VMexit state for a #DB-vectored TASK_SWITCH has %rip pointing to the
>> appropriate instruction boundary (fault or trap, as appropriate), except for
>> an ICEBP-induced #DB TASK_SWITCH, where %rip points at the ICEBP instruction
>> rather than after it.  As ICEBP isn't distinguished in the vectoring event
>> type, the state is ambiguous.
>>
>> To add to the confusion, an ICEBP which occurs due to Introspection
>> intercepting the instruction, or from x86_emulate() will have %rip updated as
>> a consequence of partial emulation required to inject an ICEBP event in the
>> first place.
>>
>> We could in principle spot the non-injected case in the TASK_SWITCH handler,
>> but this still results in complexity if the ICEBP instruction also has an
>> Instruction Breakpoint active on it (which genuinely has fault semantics).
>>
>> Unconditionally intercept ICEBP.  This does have a trap semantics for the
>> intercept, and allows us to move %rip forwards appropriately before the
>> TASK_SWITCH intercept is hit.
> Both because of you mentioning the moving forwards of %rip and with the
> irc discussion in mind that we had no irc, don't you mean "fault
> semantics" here?

ICEBP really is too broken under SVM to handle architecturally.

The ICEBP intercept has nRIP decode support, because it is an
instruction intercept.  We emulate the injection (because it is ICEBP),
which means we re-enter the guest with %rip moved forward, and #DB
(HW_EXCEPTION) pending for injection.  This means that...

>  If so
> Reviewed-by: Jan Beulich <jbeulich@suse.com>

... the ICEBP-#DB-vectored TASK_SWITCH will now find %rip pointing after
the ICEBP instruction, rather than at it, making it consistent with
every other #DB-vectored TASK_SWITCH.

This does means that an early task-switch fault for ICEBP will reliably
be delivered with the wrong (i.e. trap) semantics, but this is less bad
than mixed fault/trap semantics depending on whether the source of the
ICEBP was introspection/emulation or native execution.

We could restore proper fault behaviour by extending
svm_emul_swint_injection() to figure out that a task switch is needed,
and invoke hvm_task_switch() directly, but I don't have enough TUITS
right now.

> Otherwise I guess I'm still missing something.

I hope this clears it up.

~Andrew

Jan Beulich Nov. 26, 2019, 4:05 p.m. UTC | #6

On 26.11.2019 16:59, Andrew Cooper wrote:
> On 26/11/2019 15:32, Jan Beulich wrote:
>> On 26.11.2019 13:03, Andrew Cooper wrote:
>>> ICEBP isn't handled well by SVM.
>>>
>>> The VMexit state for a #DB-vectored TASK_SWITCH has %rip pointing to the
>>> appropriate instruction boundary (fault or trap, as appropriate), except for
>>> an ICEBP-induced #DB TASK_SWITCH, where %rip points at the ICEBP instruction
>>> rather than after it.  As ICEBP isn't distinguished in the vectoring event
>>> type, the state is ambiguous.
>>>
>>> To add to the confusion, an ICEBP which occurs due to Introspection
>>> intercepting the instruction, or from x86_emulate() will have %rip updated as
>>> a consequence of partial emulation required to inject an ICEBP event in the
>>> first place.
>>>
>>> We could in principle spot the non-injected case in the TASK_SWITCH handler,
>>> but this still results in complexity if the ICEBP instruction also has an
>>> Instruction Breakpoint active on it (which genuinely has fault semantics).
>>>
>>> Unconditionally intercept ICEBP.  This does have a trap semantics for the
>>> intercept, and allows us to move %rip forwards appropriately before the
>>> TASK_SWITCH intercept is hit.
>> Both because of you mentioning the moving forwards of %rip and with the
>> irc discussion in mind that we had no irc, don't you mean "fault
>> semantics" here?
> 
> ICEBP really is too broken under SVM to handle architecturally.
> 
> The ICEBP intercept has nRIP decode support, because it is an
> instruction intercept.  We emulate the injection (because it is ICEBP),
> which means we re-enter the guest with %rip moved forward, and #DB
> (HW_EXCEPTION) pending for injection.  This means that...
> 
>>  If so
>> Reviewed-by: Jan Beulich <jbeulich@suse.com>
> 
> ... the ICEBP-#DB-vectored TASK_SWITCH will now find %rip pointing after
> the ICEBP instruction, rather than at it, making it consistent with
> every other #DB-vectored TASK_SWITCH.
> 
> This does means that an early task-switch fault for ICEBP will reliably
> be delivered with the wrong (i.e. trap) semantics, but this is less bad
> than mixed fault/trap semantics depending on whether the source of the
> ICEBP was introspection/emulation or native execution.
> 
> We could restore proper fault behaviour by extending
> svm_emul_swint_injection() to figure out that a task switch is needed,
> and invoke hvm_task_switch() directly, but I don't have enough TUITS
> right now.
> 
>> Otherwise I guess I'm still missing something.
> 
> I hope this clears it up.

Well, it helps, but you don't really answer the question: Is "trap"
in that sentence of the description really correct? I.e. don't you
instead mean "fault" there?

Jan

Andrew Cooper Nov. 26, 2019, 4:09 p.m. UTC | #7

On 26/11/2019 15:34, Roger Pau Monné wrote:
> On Tue, Nov 26, 2019 at 12:03:56PM +0000, Andrew Cooper wrote:
>> ICEBP isn't handled well by SVM.
>>
>> The VMexit state for a #DB-vectored TASK_SWITCH has %rip pointing to the
>> appropriate instruction boundary (fault or trap, as appropriate), except for
>> an ICEBP-induced #DB TASK_SWITCH, where %rip points at the ICEBP instruction
>> rather than after it.  As ICEBP isn't distinguished in the vectoring event
>> type, the state is ambiguous.
>>
>> To add to the confusion, an ICEBP which occurs due to Introspection
>> intercepting the instruction, or from x86_emulate() will have %rip updated as
>> a consequence of partial emulation required to inject an ICEBP event in the
>> first place.
>>
>> We could in principle spot the non-injected case in the TASK_SWITCH handler,
>> but this still results in complexity if the ICEBP instruction also has an
>> Instruction Breakpoint active on it (which genuinely has fault semantics).
>>
>> Unconditionally intercept ICEBP.  This does have a trap semantics for the
>> intercept, and allows us to move %rip forwards appropriately before the
>> TASK_SWITCH intercept is hit.  This makes the behaviour of #DB-vectored
>> switches consistent however the ICEBP #DB came about, and avoids special cases
>> in the TASK_SWITCH intercept.
>>
>> This in turn allows for the removal of the conditional
>> hvm_set_icebp_interception() logic used by the monitor subsystem, as ICEBP's
>> will now always be submitted for monitoring checks.
>>
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
>
> AFAICT this brings AMD implementation inline with Intel that also will
> unconditionally vmexit on icebp?

VT-x and SVM handle things quite differently.

VT-x has no instruction intercept for ICEBP, but the #DB intercept will
triggered by an ICEBP instruction.  ICEBP has its own event type
(Privileged Software Exception, which is an amusing name considering it
is an unprivleged instruction, bypasses privilege checks, and sets the
External bit in an error code).

SVM does have an instruction intercept for ICEBP, but the #DB from
ICEBP's don't trigger the normal #DB intercept.  However, secondary
#DB's generated by ICEBP's unintercepted #DB do trigger the #DB intercept.

For safety reasons we must intercept #DB to prevent CPU deadlocks.  This
means that ICEBP are in practice always intercepted on Intel due to
their #DB side effect, but they weren't intercepted on AMD, which is why
the monitor subsystem had a way of turning interception on.

So yes, the overall effect is that ICEBPs will now unconditionally
vmexit on both Intel and AMD, but underlying mechanism for why they
vmexit is still vendor-specific.

~Andrew

Andrew Cooper Nov. 26, 2019, 4:11 p.m. UTC | #8

On 26/11/2019 16:05, Jan Beulich wrote:
> On 26.11.2019 16:59, Andrew Cooper wrote:
>> On 26/11/2019 15:32, Jan Beulich wrote:
>>> On 26.11.2019 13:03, Andrew Cooper wrote:
>>>> ICEBP isn't handled well by SVM.
>>>>
>>>> The VMexit state for a #DB-vectored TASK_SWITCH has %rip pointing to the
>>>> appropriate instruction boundary (fault or trap, as appropriate), except for
>>>> an ICEBP-induced #DB TASK_SWITCH, where %rip points at the ICEBP instruction
>>>> rather than after it.  As ICEBP isn't distinguished in the vectoring event
>>>> type, the state is ambiguous.
>>>>
>>>> To add to the confusion, an ICEBP which occurs due to Introspection
>>>> intercepting the instruction, or from x86_emulate() will have %rip updated as
>>>> a consequence of partial emulation required to inject an ICEBP event in the
>>>> first place.
>>>>
>>>> We could in principle spot the non-injected case in the TASK_SWITCH handler,
>>>> but this still results in complexity if the ICEBP instruction also has an
>>>> Instruction Breakpoint active on it (which genuinely has fault semantics).
>>>>
>>>> Unconditionally intercept ICEBP.  This does have a trap semantics for the
>>>> intercept, and allows us to move %rip forwards appropriately before the
>>>> TASK_SWITCH intercept is hit.
>>> Both because of you mentioning the moving forwards of %rip and with the
>>> irc discussion in mind that we had no irc, don't you mean "fault
>>> semantics" here?
>> ICEBP really is too broken under SVM to handle architecturally.
>>
>> The ICEBP intercept has nRIP decode support, because it is an
>> instruction intercept.  We emulate the injection (because it is ICEBP),
>> which means we re-enter the guest with %rip moved forward, and #DB
>> (HW_EXCEPTION) pending for injection.  This means that...
>>
>>>  If so
>>> Reviewed-by: Jan Beulich <jbeulich@suse.com>
>> ... the ICEBP-#DB-vectored TASK_SWITCH will now find %rip pointing after
>> the ICEBP instruction, rather than at it, making it consistent with
>> every other #DB-vectored TASK_SWITCH.
>>
>> This does means that an early task-switch fault for ICEBP will reliably
>> be delivered with the wrong (i.e. trap) semantics, but this is less bad
>> than mixed fault/trap semantics depending on whether the source of the
>> ICEBP was introspection/emulation or native execution.
>>
>> We could restore proper fault behaviour by extending
>> svm_emul_swint_injection() to figure out that a task switch is needed,
>> and invoke hvm_task_switch() directly, but I don't have enough TUITS
>> right now.
>>
>>> Otherwise I guess I'm still missing something.
>> I hope this clears it up.
> Well, it helps, but you don't really answer the question: Is "trap"
> in that sentence of the description really correct? I.e. don't you
> instead mean "fault" there?

I've reworded that bit to:

Unconditionally intercept ICEBP.  This does have NRIPs support as it is an
instruction intercept, which allows us allows us to move %rip forwards
appropriately before the TASK_SWITCH intercept is hit.  This allows...

Any better?

~Andrew

Jan Beulich Nov. 26, 2019, 4:14 p.m. UTC | #9

On 26.11.2019 17:11, Andrew Cooper wrote:
> On 26/11/2019 16:05, Jan Beulich wrote:
>> On 26.11.2019 16:59, Andrew Cooper wrote:
>>> On 26/11/2019 15:32, Jan Beulich wrote:
>>>> On 26.11.2019 13:03, Andrew Cooper wrote:
>>>>> ICEBP isn't handled well by SVM.
>>>>>
>>>>> The VMexit state for a #DB-vectored TASK_SWITCH has %rip pointing to the
>>>>> appropriate instruction boundary (fault or trap, as appropriate), except for
>>>>> an ICEBP-induced #DB TASK_SWITCH, where %rip points at the ICEBP instruction
>>>>> rather than after it.  As ICEBP isn't distinguished in the vectoring event
>>>>> type, the state is ambiguous.
>>>>>
>>>>> To add to the confusion, an ICEBP which occurs due to Introspection
>>>>> intercepting the instruction, or from x86_emulate() will have %rip updated as
>>>>> a consequence of partial emulation required to inject an ICEBP event in the
>>>>> first place.
>>>>>
>>>>> We could in principle spot the non-injected case in the TASK_SWITCH handler,
>>>>> but this still results in complexity if the ICEBP instruction also has an
>>>>> Instruction Breakpoint active on it (which genuinely has fault semantics).
>>>>>
>>>>> Unconditionally intercept ICEBP.  This does have a trap semantics for the
>>>>> intercept, and allows us to move %rip forwards appropriately before the
>>>>> TASK_SWITCH intercept is hit.
>>>> Both because of you mentioning the moving forwards of %rip and with the
>>>> irc discussion in mind that we had no irc, don't you mean "fault
>>>> semantics" here?
>>> ICEBP really is too broken under SVM to handle architecturally.
>>>
>>> The ICEBP intercept has nRIP decode support, because it is an
>>> instruction intercept.  We emulate the injection (because it is ICEBP),
>>> which means we re-enter the guest with %rip moved forward, and #DB
>>> (HW_EXCEPTION) pending for injection.  This means that...
>>>
>>>>  If so
>>>> Reviewed-by: Jan Beulich <jbeulich@suse.com>
>>> ... the ICEBP-#DB-vectored TASK_SWITCH will now find %rip pointing after
>>> the ICEBP instruction, rather than at it, making it consistent with
>>> every other #DB-vectored TASK_SWITCH.
>>>
>>> This does means that an early task-switch fault for ICEBP will reliably
>>> be delivered with the wrong (i.e. trap) semantics, but this is less bad
>>> than mixed fault/trap semantics depending on whether the source of the
>>> ICEBP was introspection/emulation or native execution.
>>>
>>> We could restore proper fault behaviour by extending
>>> svm_emul_swint_injection() to figure out that a task switch is needed,
>>> and invoke hvm_task_switch() directly, but I don't have enough TUITS
>>> right now.
>>>
>>>> Otherwise I guess I'm still missing something.
>>> I hope this clears it up.
>> Well, it helps, but you don't really answer the question: Is "trap"
>> in that sentence of the description really correct? I.e. don't you
>> instead mean "fault" there?
> 
> I've reworded that bit to:
> 
> Unconditionally intercept ICEBP.  This does have NRIPs support as it is an
> instruction intercept, which allows us allows us to move %rip forwards
> appropriately before the TASK_SWITCH intercept is hit.  This allows...
> 
> Any better?

Ah yes, thanks. (But please drop one of the two "allows us".)

Jan

Andrew Cooper Nov. 26, 2019, 4:16 p.m. UTC | #10

On 26/11/2019 16:14, Jan Beulich wrote:
> On 26.11.2019 17:11, Andrew Cooper wrote:
>> On 26/11/2019 16:05, Jan Beulich wrote:
>>> On 26.11.2019 16:59, Andrew Cooper wrote:
>>>> On 26/11/2019 15:32, Jan Beulich wrote:
>>>>> On 26.11.2019 13:03, Andrew Cooper wrote:
>>>>>> ICEBP isn't handled well by SVM.
>>>>>>
>>>>>> The VMexit state for a #DB-vectored TASK_SWITCH has %rip pointing to the
>>>>>> appropriate instruction boundary (fault or trap, as appropriate), except for
>>>>>> an ICEBP-induced #DB TASK_SWITCH, where %rip points at the ICEBP instruction
>>>>>> rather than after it.  As ICEBP isn't distinguished in the vectoring event
>>>>>> type, the state is ambiguous.
>>>>>>
>>>>>> To add to the confusion, an ICEBP which occurs due to Introspection
>>>>>> intercepting the instruction, or from x86_emulate() will have %rip updated as
>>>>>> a consequence of partial emulation required to inject an ICEBP event in the
>>>>>> first place.
>>>>>>
>>>>>> We could in principle spot the non-injected case in the TASK_SWITCH handler,
>>>>>> but this still results in complexity if the ICEBP instruction also has an
>>>>>> Instruction Breakpoint active on it (which genuinely has fault semantics).
>>>>>>
>>>>>> Unconditionally intercept ICEBP.  This does have a trap semantics for the
>>>>>> intercept, and allows us to move %rip forwards appropriately before the
>>>>>> TASK_SWITCH intercept is hit.
>>>>> Both because of you mentioning the moving forwards of %rip and with the
>>>>> irc discussion in mind that we had no irc, don't you mean "fault
>>>>> semantics" here?
>>>> ICEBP really is too broken under SVM to handle architecturally.
>>>>
>>>> The ICEBP intercept has nRIP decode support, because it is an
>>>> instruction intercept.  We emulate the injection (because it is ICEBP),
>>>> which means we re-enter the guest with %rip moved forward, and #DB
>>>> (HW_EXCEPTION) pending for injection.  This means that...
>>>>
>>>>>  If so
>>>>> Reviewed-by: Jan Beulich <jbeulich@suse.com>
>>>> ... the ICEBP-#DB-vectored TASK_SWITCH will now find %rip pointing after
>>>> the ICEBP instruction, rather than at it, making it consistent with
>>>> every other #DB-vectored TASK_SWITCH.
>>>>
>>>> This does means that an early task-switch fault for ICEBP will reliably
>>>> be delivered with the wrong (i.e. trap) semantics, but this is less bad
>>>> than mixed fault/trap semantics depending on whether the source of the
>>>> ICEBP was introspection/emulation or native execution.
>>>>
>>>> We could restore proper fault behaviour by extending
>>>> svm_emul_swint_injection() to figure out that a task switch is needed,
>>>> and invoke hvm_task_switch() directly, but I don't have enough TUITS
>>>> right now.
>>>>
>>>>> Otherwise I guess I'm still missing something.
>>>> I hope this clears it up.
>>> Well, it helps, but you don't really answer the question: Is "trap"
>>> in that sentence of the description really correct? I.e. don't you
>>> instead mean "fault" there?
>> I've reworded that bit to:
>>
>> Unconditionally intercept ICEBP.  This does have NRIPs support as it is an
>> instruction intercept, which allows us allows us to move %rip forwards
>> appropriately before the TASK_SWITCH intercept is hit.  This allows...
>>
>> Any better?
> Ah yes, thanks. (But please drop one of the two "allows us".)

Oops yes.  Irritatingly, that causes #DB-vectored to move onto a new
line, and trigger Git's comment syntax.  I'll tweak a little bit more.

~Andrew

Roger Pau Monné Nov. 27, 2019, 8:55 a.m. UTC | #11

On Tue, Nov 26, 2019 at 04:09:08PM +0000, Andrew Cooper wrote:
> On 26/11/2019 15:34, Roger Pau Monné wrote:
> > On Tue, Nov 26, 2019 at 12:03:56PM +0000, Andrew Cooper wrote:
> >> ICEBP isn't handled well by SVM.
> >>
> >> The VMexit state for a #DB-vectored TASK_SWITCH has %rip pointing to the
> >> appropriate instruction boundary (fault or trap, as appropriate), except for
> >> an ICEBP-induced #DB TASK_SWITCH, where %rip points at the ICEBP instruction
> >> rather than after it.  As ICEBP isn't distinguished in the vectoring event
> >> type, the state is ambiguous.
> >>
> >> To add to the confusion, an ICEBP which occurs due to Introspection
> >> intercepting the instruction, or from x86_emulate() will have %rip updated as
> >> a consequence of partial emulation required to inject an ICEBP event in the
> >> first place.
> >>
> >> We could in principle spot the non-injected case in the TASK_SWITCH handler,
> >> but this still results in complexity if the ICEBP instruction also has an
> >> Instruction Breakpoint active on it (which genuinely has fault semantics).
> >>
> >> Unconditionally intercept ICEBP.  This does have a trap semantics for the
> >> intercept, and allows us to move %rip forwards appropriately before the
> >> TASK_SWITCH intercept is hit.  This makes the behaviour of #DB-vectored
> >> switches consistent however the ICEBP #DB came about, and avoids special cases
> >> in the TASK_SWITCH intercept.
> >>
> >> This in turn allows for the removal of the conditional
> >> hvm_set_icebp_interception() logic used by the monitor subsystem, as ICEBP's
> >> will now always be submitted for monitoring checks.
> >>
> >> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> > Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
> >
> > AFAICT this brings AMD implementation inline with Intel that also will
> > unconditionally vmexit on icebp?
> 
> VT-x and SVM handle things quite differently.
> 
> VT-x has no instruction intercept for ICEBP, but the #DB intercept will
> triggered by an ICEBP instruction.  ICEBP has its own event type
> (Privileged Software Exception, which is an amusing name considering it
> is an unprivleged instruction, bypasses privilege checks, and sets the
> External bit in an error code).
> 
> SVM does have an instruction intercept for ICEBP, but the #DB from
> ICEBP's don't trigger the normal #DB intercept.  However, secondary
> #DB's generated by ICEBP's unintercepted #DB do trigger the #DB intercept.
> 
> For safety reasons we must intercept #DB to prevent CPU deadlocks.  This
> means that ICEBP are in practice always intercepted on Intel due to
> their #DB side effect, but they weren't intercepted on AMD, which is why
> the monitor subsystem had a way of turning interception on.
> 
> So yes, the overall effect is that ICEBPs will now unconditionally
> vmexit on both Intel and AMD, but underlying mechanism for why they
> vmexit is still vendor-specific.

Thanks for the detailed explanation, I realized vmx didn't have a
ICEBP specific VMEXIT reason, but I assumed the #DB would be trapped
as that's how the monitor system intercepts those.

Roger.

[v2,2/3] x86/svm: Always intercept ICEBP

Commit Message

Comments

Patch