diff mbox series

[3/3] x86/pv: Don't use IST for NMI/#MC/#DB in !CONFIG_PV builds

Message ID 20200420145911.5708-4-andrew.cooper3@citrix.com (mailing list archive)
State New, archived
Headers show
Series x86: IST cleanup | expand

Commit Message

Andrew Cooper April 20, 2020, 2:59 p.m. UTC
ISTs are used to force a stack switch on CPL0=>0 interrupts/exceptions.  They
however come with a nasty corner case in the case of reentrancy where the
outer exception frame gets clobbered.

When the SYSCALL/SYSRET instructions aren't used, there is no need to use IST
for anything other than #DF, which reduces the number of corner cases.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Wei Liu <wl@xen.org>
CC: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/arch/x86/cpu/common.c       |  8 +++++---
 xen/include/asm-x86/processor.h | 12 +++++++++++-
 2 files changed, 16 insertions(+), 4 deletions(-)

Comments

Jan Beulich April 21, 2020, 7:48 a.m. UTC | #1
On 20.04.2020 16:59, Andrew Cooper wrote:
> --- a/xen/include/asm-x86/processor.h
> +++ b/xen/include/asm-x86/processor.h
> @@ -441,12 +441,18 @@ struct tss_page {
>  };
>  DECLARE_PER_CPU(struct tss_page, tss_page);
>  
> +/*
> + * Interrupt Stack Tables.  Used to force a stack switch on a CPL0=>0
> + * interrupt/exception.  #DF uses IST all the time to detect stack overflows
> + * cleanly.  NMI/#MC/#DB only need IST to cover the SYSCALL gap, and therefore
> + * only necessary with PV guests.
> + */

Is it really only the SYSCALL gap that we mean to cover? In particular
for #MC I'd suspect it is a good idea to switch stacks as well, to get
onto a distinct memory page in case the #MC was stack related. With
NMI it might as well be better to switch; I agree we don't need any
switching for #DB.

I also think that the comment at the top of current.h would want
updating with these adjustments (which I notice lacks the #DB part
anyway).

Jan
Andrew Cooper April 23, 2020, 6:49 p.m. UTC | #2
On 21/04/2020 08:48, Jan Beulich wrote:
> On 20.04.2020 16:59, Andrew Cooper wrote:
>> --- a/xen/include/asm-x86/processor.h
>> +++ b/xen/include/asm-x86/processor.h
>> @@ -441,12 +441,18 @@ struct tss_page {
>>  };
>>  DECLARE_PER_CPU(struct tss_page, tss_page);
>>  
>> +/*
>> + * Interrupt Stack Tables.  Used to force a stack switch on a CPL0=>0
>> + * interrupt/exception.  #DF uses IST all the time to detect stack overflows
>> + * cleanly.  NMI/#MC/#DB only need IST to cover the SYSCALL gap, and therefore
>> + * only necessary with PV guests.
>> + */
> Is it really only the SYSCALL gap that we mean to cover? In particular
> for #MC I'd suspect it is a good idea to switch stacks as well, to get
> onto a distinct memory page in case the #MC was stack related.

If #MC occurs on your stack, you have already lost.  The chances of only
taking a single #MC are tiny because the next-line prefetcher will be
doing its job (or it hits when the lines (plural) leave L3, which will
be in guest context at some point in the future.)

The very best you can hope for is to cleanly print something and crash -
even if you manage to run the handler, you've got no idea if the
interrupted context had a spinlock held, and Xen has no support for
changing to a different pcpu stack.

> With NMI it might as well be better to switch;

Why?  There is no benefit (with no SYSCALL in the picture), and a
downside which causes state loss.

>  I agree we don't need any
> switching for #DB.
>
> I also think that the comment at the top of current.h would want
> updating with these adjustments (which I notice lacks the #DB part
> anyway).

Oops - I totally forgot that for the XSA-260 fix.

~Andrew
Jan Beulich April 24, 2020, 6:21 a.m. UTC | #3
On 23.04.2020 20:49, Andrew Cooper wrote:
> On 21/04/2020 08:48, Jan Beulich wrote:
>> On 20.04.2020 16:59, Andrew Cooper wrote:
>>> --- a/xen/include/asm-x86/processor.h
>>> +++ b/xen/include/asm-x86/processor.h
>>> @@ -441,12 +441,18 @@ struct tss_page {
>>>  };
>>>  DECLARE_PER_CPU(struct tss_page, tss_page);
>>>  
>>> +/*
>>> + * Interrupt Stack Tables.  Used to force a stack switch on a CPL0=>0
>>> + * interrupt/exception.  #DF uses IST all the time to detect stack overflows
>>> + * cleanly.  NMI/#MC/#DB only need IST to cover the SYSCALL gap, and therefore
>>> + * only necessary with PV guests.
>>> + */
>> Is it really only the SYSCALL gap that we mean to cover? In particular
>> for #MC I'd suspect it is a good idea to switch stacks as well, to get
>> onto a distinct memory page in case the #MC was stack related.
> 
> If #MC occurs on your stack, you have already lost.  The chances of only
> taking a single #MC are tiny because the next-line prefetcher will be
> doing its job (or it hits when the lines (plural) leave L3, which will
> be in guest context at some point in the future.)

It would seem fishy behavior to me for the hardware to continue
issuing prefetches against a page that has just been noticed to
cause issues when accessed. Surely hardware at least _could_
internally mark such a page #UC or some such, to prevent further
prefetches to actually hit the bus.

> The very best you can hope for is to cleanly print something and crash -
> even if you manage to run the handler, you've got no idea if the
> interrupted context had a spinlock held, and Xen has no support for
> changing to a different pcpu stack.

But getting something printed is magnitudes better that hanging
or rebooting entirely silently.

Nevertheless, as long as you extend the description somewhat to
express this reasoning in some way
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(a little hesitantly though).

Jan
diff mbox series

Patch

diff --git a/xen/arch/x86/cpu/common.c b/xen/arch/x86/cpu/common.c
index 7b093cb421..d45495c701 100644
--- a/xen/arch/x86/cpu/common.c
+++ b/xen/arch/x86/cpu/common.c
@@ -732,15 +732,17 @@  void load_system_tables(void)
 		.rsp2 = 0x8600111111111111ul,
 
 		/*
-		 * MCE, NMI and Double Fault handlers get their own stacks.
+		 * #DF always uses a separate stack. NMI/#MC/#DB only need a
+		 * separate stacks when PV guests are used.
 		 * All others poisoned.
 		 */
 		.ist = {
-			[IST_MCE - 1] = stack_top + IST_MCE * PAGE_SIZE,
 			[IST_DF  - 1] = stack_top + IST_DF  * PAGE_SIZE,
+#ifdef CONFIG_PV
 			[IST_NMI - 1] = stack_top + IST_NMI * PAGE_SIZE,
+			[IST_MCE - 1] = stack_top + IST_MCE * PAGE_SIZE,
 			[IST_DB  - 1] = stack_top + IST_DB  * PAGE_SIZE,
-
+#endif
 			[IST_MAX ... ARRAY_SIZE(tss->ist) - 1] =
 				0x8600111111111111ul,
 		},
diff --git a/xen/include/asm-x86/processor.h b/xen/include/asm-x86/processor.h
index ea6e5497f4..33f2052c8e 100644
--- a/xen/include/asm-x86/processor.h
+++ b/xen/include/asm-x86/processor.h
@@ -441,12 +441,18 @@  struct tss_page {
 };
 DECLARE_PER_CPU(struct tss_page, tss_page);
 
+/*
+ * Interrupt Stack Tables.  Used to force a stack switch on a CPL0=>0
+ * interrupt/exception.  #DF uses IST all the time to detect stack overflows
+ * cleanly.  NMI/#MC/#DB only need IST to cover the SYSCALL gap, and therefore
+ * only necessary with PV guests.
+ */
 #define IST_NONE 0UL
 #define IST_DF   1UL
 #define IST_NMI  2UL
 #define IST_MCE  3UL
 #define IST_DB   4UL
-#define IST_MAX  4UL
+#define IST_MAX  (IS_ENABLED(CONFIG_PV) ? 4ul : 1ul)
 
 /* Set the Interrupt Stack Table used by a particular IDT entry. */
 static inline void set_ist(idt_entry_t *idt, unsigned int ist)
@@ -461,6 +467,8 @@  static inline void set_ist(idt_entry_t *idt, unsigned int ist)
 static inline void enable_each_ist(idt_entry_t *idt)
 {
     set_ist(&idt[TRAP_double_fault],  IST_DF);
+    if ( !IS_ENABLED(CONFIG_PV) )
+        return;
     set_ist(&idt[TRAP_nmi],           IST_NMI);
     set_ist(&idt[TRAP_machine_check], IST_MCE);
     set_ist(&idt[TRAP_debug],         IST_DB);
@@ -469,6 +477,8 @@  static inline void enable_each_ist(idt_entry_t *idt)
 static inline void disable_each_ist(idt_entry_t *idt)
 {
     set_ist(&idt[TRAP_double_fault],  IST_NONE);
+    if ( !IS_ENABLED(CONFIG_PV) )
+        return;
     set_ist(&idt[TRAP_nmi],           IST_NONE);
     set_ist(&idt[TRAP_machine_check], IST_NONE);
     set_ist(&idt[TRAP_debug],         IST_NONE);