diff mbox series

[2/2,4.15?] x86/shadow: encode full GFN in magic MMIO entries

Message ID ccf12da3-b3df-7be1-1898-992ec994b78f@suse.com (mailing list archive)
State New, archived
Headers show
Series x86/shadow: further refinements to "fast fault path" suppression | expand

Commit Message

Jan Beulich March 5, 2021, 3:37 p.m. UTC
Since we don't need to encode all of the PTE flags, we have enough bits
in the shadow entry to store the full GFN. Don't use literal numbers -
instead derive the involved values. Or, where derivation would become
too ugly, sanity-check the result (invoking #error to identify failure).

This then allows dropping from sh_l1e_mmio() again the guarding against
too large GFNs.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
I wonder if the respective check in sh_audit_l1_table() is actually
useful to retain with these changes.

Comments

Andrew Cooper March 5, 2021, 4:32 p.m. UTC | #1
On 05/03/2021 15:37, Jan Beulich wrote:
> Since we don't need to encode all of the PTE flags, we have enough bits
> in the shadow entry to store the full GFN. Don't use literal numbers -
> instead derive the involved values. Or, where derivation would become
> too ugly, sanity-check the result (invoking #error to identify failure).
>
> This then allows dropping from sh_l1e_mmio() again the guarding against
> too large GFNs.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> I wonder if the respective check in sh_audit_l1_table() is actually
> useful to retain with these changes.
>
> --- a/xen/arch/x86/mm/shadow/types.h
> +++ b/xen/arch/x86/mm/shadow/types.h
> @@ -283,9 +283,17 @@ shadow_put_page_from_l1e(shadow_l1e_t sl
>   * This is only feasible for PAE and 64bit Xen: 32-bit non-PAE PTEs don't
>   * have reserved bits that we can use for this.  And even there it can only
>   * be used if we can be certain the processor doesn't use all 52 address bits.
> + *
> + * For the MMIO encoding (see below) we need the bottom 4 bits for
> + * identifying the kind of entry and a full GFN's worth of bits to encode
> + * the originating frame number.  Set all remaining bits to trigger
> + * reserved bit faults, if (see above) the hardware permits triggering such.
>   */
>  
> -#define SH_L1E_MAGIC 0xffffffff00000001ULL
> +#define SH_L1E_MAGIC_NR_META_BITS 4
> +#define SH_L1E_MAGIC_MASK ((~0ULL << (PADDR_BITS - PAGE_SHIFT + \
> +                                      SH_L1E_MAGIC_NR_META_BITS)) | \
> +                           _PAGE_PRESENT)
>  
>  static inline bool sh_have_pte_rsvd_bits(void)
>  {
> @@ -294,7 +302,8 @@ static inline bool sh_have_pte_rsvd_bits
>  
>  static inline bool sh_l1e_is_magic(shadow_l1e_t sl1e)
>  {
> -    return (sl1e.l1 & SH_L1E_MAGIC) == SH_L1E_MAGIC;
> +    BUILD_BUG_ON(!(PADDR_MASK & SH_L1E_MAGIC_MASK));
> +    return (sl1e.l1 & SH_L1E_MAGIC_MASK) == SH_L1E_MAGIC_MASK;
>  }
>  
>  /* Guest not present: a single magic value */
> @@ -320,20 +329,26 @@ static inline bool sh_l1e_is_gnp(shadow_
>  
>  /*
>   * MMIO: an invalid PTE that contains the GFN of the equivalent guest l1e.
> - * We store 28 bits of GFN in bits 4:32 of the entry.
> + * We store the GFN in bits 4:43 of the entry.
>   * The present bit is set, and the U/S and R/W bits are taken from the guest.
>   * Bit 3 is always 0, to differentiate from gnp above.
>   */
> -#define SH_L1E_MMIO_MAGIC       0xffffffff00000001ULL
> -#define SH_L1E_MMIO_MAGIC_MASK  0xffffffff00000009ULL
> -#define SH_L1E_MMIO_GFN_MASK    0x00000000fffffff0ULL
> +#define SH_L1E_MMIO_MAGIC       SH_L1E_MAGIC_MASK
> +#define SH_L1E_MMIO_MAGIC_BIT   ((_PAGE_PRESENT | _PAGE_RW | _PAGE_USER) + 1)
> +#if SH_L1E_MMIO_MAGIC_BIT & (SH_L1E_MMIO_MAGIC_BIT - 1)
> +# error SH_L1E_MMIO_MAGIC_BIT needs to be a power of 2
> +#endif
> +#if SH_L1E_MMIO_MAGIC_BIT >> SH_L1E_MAGIC_NR_META_BITS
> +# error SH_L1E_MMIO_MAGIC_BIT and SH_L1E_MAGIC_NR_META_BITS are out of sync
> +#endif
> +#define SH_L1E_MMIO_MAGIC_MASK  (SH_L1E_MAGIC_MASK | SH_L1E_MMIO_MAGIC_BIT)
> +#define SH_L1E_MMIO_GFN_MASK    ~(SH_L1E_MMIO_MAGIC_MASK | _PAGE_RW | _PAGE_USER)

In practice, it is 4:36, because we have to limit shadow guests to 32
bits of gfn for XSA-173 (width of the superpage backpointer IIRC).

Also, this property is important for L1TF.  The more guest-controllable
bits we permit in here, the greater the chance of being vulnerable to
L1TF on massive machines.

(I'm a little concerned that I can't spot an L1TF comment which has been
made stale by these changes...  Probably the fault of whichever numpty
prepared the L1TF patches, because I'm certain we discussed this at the
time)

Going from 32 to 36 bits moves the upper safety barrier from TOP-4G to
TOP-64G but I recall us agreed that that was ok, especially as the main
safety guestimate is "no RAM in the top quarter of the address space".

However, I don't think we want to accidentally creep beyond bit 36, so
I'd suggest that the easy fix here is just adjusting a nibble in the
MMIO masks.

~Andrew
Tim Deegan March 8, 2021, 9:39 a.m. UTC | #2
At 16:37 +0100 on 05 Mar (1614962265), Jan Beulich wrote:
> Since we don't need to encode all of the PTE flags, we have enough bits
> in the shadow entry to store the full GFN. Don't use literal numbers -
> instead derive the involved values. Or, where derivation would become
> too ugly, sanity-check the result (invoking #error to identify failure).
> 
> This then allows dropping from sh_l1e_mmio() again the guarding against
> too large GFNs.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Tim Deegan <tim@xen.org>

> I wonder if the respective check in sh_audit_l1_table() is actually
> useful to retain with these changes.

Yes, I think so.  We care about these PTEs being bogus for any reason,
not just the ones that this could address.

> -#define SH_L1E_MAGIC 0xffffffff00000001ULL
> +#define SH_L1E_MAGIC_NR_META_BITS 4
> +#define SH_L1E_MAGIC_MASK ((~0ULL << (PADDR_BITS - PAGE_SHIFT + \
> +                                      SH_L1E_MAGIC_NR_META_BITS)) | \
> +                           _PAGE_PRESENT)

I don't think this makes the code any more readable, TBH, but if you
prefer it that's OK.  I'd be happier with it if you added a
BUILD_BUG_ON that checks that 1ULL << (PADDR_BITS - 1) is set in the
mask, since that's the main thing we care about.

> -#define SH_L1E_MMIO_MAGIC       0xffffffff00000001ULL
> -#define SH_L1E_MMIO_MAGIC_MASK  0xffffffff00000009ULL
> -#define SH_L1E_MMIO_GFN_MASK    0x00000000fffffff0ULL
> +#define SH_L1E_MMIO_MAGIC       SH_L1E_MAGIC_MASK
> +#define SH_L1E_MMIO_MAGIC_BIT   ((_PAGE_PRESENT | _PAGE_RW | _PAGE_USER) + 1)

IMO this would be more readable as a straight 0x8 (or even _PAGE_PWT).
The ack stands either way.

Cheers,

Tim.
Jan Beulich March 8, 2021, 12:05 p.m. UTC | #3
On 08.03.2021 10:39, Tim Deegan wrote:
> At 16:37 +0100 on 05 Mar (1614962265), Jan Beulich wrote:
>> Since we don't need to encode all of the PTE flags, we have enough bits
>> in the shadow entry to store the full GFN. Don't use literal numbers -
>> instead derive the involved values. Or, where derivation would become
>> too ugly, sanity-check the result (invoking #error to identify failure).
>>
>> This then allows dropping from sh_l1e_mmio() again the guarding against
>> too large GFNs.
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> Acked-by: Tim Deegan <tim@xen.org>

Thanks.

>> I wonder if the respective check in sh_audit_l1_table() is actually
>> useful to retain with these changes.
> 
> Yes, I think so.  We care about these PTEs being bogus for any reason,
> not just the ones that this could address.
> 
>> -#define SH_L1E_MAGIC 0xffffffff00000001ULL
>> +#define SH_L1E_MAGIC_NR_META_BITS 4
>> +#define SH_L1E_MAGIC_MASK ((~0ULL << (PADDR_BITS - PAGE_SHIFT + \
>> +                                      SH_L1E_MAGIC_NR_META_BITS)) | \
>> +                           _PAGE_PRESENT)
> 
> I don't think this makes the code any more readable, TBH, but if you
> prefer it that's OK.  I'd be happier with it if you added a
> BUILD_BUG_ON that checks that 1ULL << (PADDR_BITS - 1) is set in the
> mask, since that's the main thing we care about.
> 
>> -#define SH_L1E_MMIO_MAGIC       0xffffffff00000001ULL
>> -#define SH_L1E_MMIO_MAGIC_MASK  0xffffffff00000009ULL
>> -#define SH_L1E_MMIO_GFN_MASK    0x00000000fffffff0ULL
>> +#define SH_L1E_MMIO_MAGIC       SH_L1E_MAGIC_MASK
>> +#define SH_L1E_MMIO_MAGIC_BIT   ((_PAGE_PRESENT | _PAGE_RW | _PAGE_USER) + 1)
> 
> IMO this would be more readable as a straight 0x8 (or even _PAGE_PWT).
> The ack stands either way.

For both of your remarks, I agree that readability suffers by doing
it this way, so I'll undo some of it since you'd prefer it to be
more readable. My goal wasn't so much readability though but ease
of changes down the road (no need to change multiple related
definitions) and documentation of why these values actually are the
way they are.

But let's first see anyway what further changes are needed in
response to Andrew's remarks.

Jan
Jan Beulich March 8, 2021, 12:42 p.m. UTC | #4
On 05.03.2021 17:32, Andrew Cooper wrote:
> On 05/03/2021 15:37, Jan Beulich wrote:
>> Since we don't need to encode all of the PTE flags, we have enough bits
>> in the shadow entry to store the full GFN. Don't use literal numbers -
>> instead derive the involved values. Or, where derivation would become
>> too ugly, sanity-check the result (invoking #error to identify failure).
>>
>> This then allows dropping from sh_l1e_mmio() again the guarding against
>> too large GFNs.
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> ---
>> I wonder if the respective check in sh_audit_l1_table() is actually
>> useful to retain with these changes.
>>
>> --- a/xen/arch/x86/mm/shadow/types.h
>> +++ b/xen/arch/x86/mm/shadow/types.h
>> @@ -283,9 +283,17 @@ shadow_put_page_from_l1e(shadow_l1e_t sl
>>   * This is only feasible for PAE and 64bit Xen: 32-bit non-PAE PTEs don't
>>   * have reserved bits that we can use for this.  And even there it can only
>>   * be used if we can be certain the processor doesn't use all 52 address bits.
>> + *
>> + * For the MMIO encoding (see below) we need the bottom 4 bits for
>> + * identifying the kind of entry and a full GFN's worth of bits to encode
>> + * the originating frame number.  Set all remaining bits to trigger
>> + * reserved bit faults, if (see above) the hardware permits triggering such.
>>   */
>>  
>> -#define SH_L1E_MAGIC 0xffffffff00000001ULL
>> +#define SH_L1E_MAGIC_NR_META_BITS 4
>> +#define SH_L1E_MAGIC_MASK ((~0ULL << (PADDR_BITS - PAGE_SHIFT + \
>> +                                      SH_L1E_MAGIC_NR_META_BITS)) | \
>> +                           _PAGE_PRESENT)
>>  
>>  static inline bool sh_have_pte_rsvd_bits(void)
>>  {
>> @@ -294,7 +302,8 @@ static inline bool sh_have_pte_rsvd_bits
>>  
>>  static inline bool sh_l1e_is_magic(shadow_l1e_t sl1e)
>>  {
>> -    return (sl1e.l1 & SH_L1E_MAGIC) == SH_L1E_MAGIC;
>> +    BUILD_BUG_ON(!(PADDR_MASK & SH_L1E_MAGIC_MASK));
>> +    return (sl1e.l1 & SH_L1E_MAGIC_MASK) == SH_L1E_MAGIC_MASK;
>>  }
>>  
>>  /* Guest not present: a single magic value */
>> @@ -320,20 +329,26 @@ static inline bool sh_l1e_is_gnp(shadow_
>>  
>>  /*
>>   * MMIO: an invalid PTE that contains the GFN of the equivalent guest l1e.
>> - * We store 28 bits of GFN in bits 4:32 of the entry.
>> + * We store the GFN in bits 4:43 of the entry.
>>   * The present bit is set, and the U/S and R/W bits are taken from the guest.
>>   * Bit 3 is always 0, to differentiate from gnp above.
>>   */
>> -#define SH_L1E_MMIO_MAGIC       0xffffffff00000001ULL
>> -#define SH_L1E_MMIO_MAGIC_MASK  0xffffffff00000009ULL
>> -#define SH_L1E_MMIO_GFN_MASK    0x00000000fffffff0ULL
>> +#define SH_L1E_MMIO_MAGIC       SH_L1E_MAGIC_MASK
>> +#define SH_L1E_MMIO_MAGIC_BIT   ((_PAGE_PRESENT | _PAGE_RW | _PAGE_USER) + 1)
>> +#if SH_L1E_MMIO_MAGIC_BIT & (SH_L1E_MMIO_MAGIC_BIT - 1)
>> +# error SH_L1E_MMIO_MAGIC_BIT needs to be a power of 2
>> +#endif
>> +#if SH_L1E_MMIO_MAGIC_BIT >> SH_L1E_MAGIC_NR_META_BITS
>> +# error SH_L1E_MMIO_MAGIC_BIT and SH_L1E_MAGIC_NR_META_BITS are out of sync
>> +#endif
>> +#define SH_L1E_MMIO_MAGIC_MASK  (SH_L1E_MAGIC_MASK | SH_L1E_MMIO_MAGIC_BIT)
>> +#define SH_L1E_MMIO_GFN_MASK    ~(SH_L1E_MMIO_MAGIC_MASK | _PAGE_RW | _PAGE_USER)
> 
> In practice, it is 4:36, because we have to limit shadow guests to 32
> bits of gfn for XSA-173 (width of the superpage backpointer IIRC).

When !BIGMEM - yes.

> Also, this property is important for L1TF.  The more guest-controllable
> bits we permit in here, the greater the chance of being vulnerable to
> L1TF on massive machines.
> 
> (I'm a little concerned that I can't spot an L1TF comment which has been
> made stale by these changes...  Probably the fault of whichever numpty
> prepared the L1TF patches, because I'm certain we discussed this at the
> time)
> 
> Going from 32 to 36 bits moves the upper safety barrier from TOP-4G to
> TOP-64G but I recall us agreed that that was ok, especially as the main
> safety guestimate is "no RAM in the top quarter of the address space".
> 
> However, I don't think we want to accidentally creep beyond bit 36, so
> I'd suggest that the easy fix here is just adjusting a nibble in the
> MMIO masks.

With BIGMEM I'm not sure we want to be this strict. Nor do we need
to as long as we only need 4 bits at the bottom - we only go up to
bit 43 with what we allow guests control over. IOW we will need to
be careful on old hardware when l1d_maxphysaddr == 44, but on
anything newer we're still far enough away I would think. So I
guess instead of outright dropping the GFN check from sh_l1e_mmio()
I want to replace it by use of is_l1tf_safe_maddr() (on the
produced shadow_l1e_t).

Jan
diff mbox series

Patch

--- a/xen/arch/x86/mm/shadow/types.h
+++ b/xen/arch/x86/mm/shadow/types.h
@@ -283,9 +283,17 @@  shadow_put_page_from_l1e(shadow_l1e_t sl
  * This is only feasible for PAE and 64bit Xen: 32-bit non-PAE PTEs don't
  * have reserved bits that we can use for this.  And even there it can only
  * be used if we can be certain the processor doesn't use all 52 address bits.
+ *
+ * For the MMIO encoding (see below) we need the bottom 4 bits for
+ * identifying the kind of entry and a full GFN's worth of bits to encode
+ * the originating frame number.  Set all remaining bits to trigger
+ * reserved bit faults, if (see above) the hardware permits triggering such.
  */
 
-#define SH_L1E_MAGIC 0xffffffff00000001ULL
+#define SH_L1E_MAGIC_NR_META_BITS 4
+#define SH_L1E_MAGIC_MASK ((~0ULL << (PADDR_BITS - PAGE_SHIFT + \
+                                      SH_L1E_MAGIC_NR_META_BITS)) | \
+                           _PAGE_PRESENT)
 
 static inline bool sh_have_pte_rsvd_bits(void)
 {
@@ -294,7 +302,8 @@  static inline bool sh_have_pte_rsvd_bits
 
 static inline bool sh_l1e_is_magic(shadow_l1e_t sl1e)
 {
-    return (sl1e.l1 & SH_L1E_MAGIC) == SH_L1E_MAGIC;
+    BUILD_BUG_ON(!(PADDR_MASK & SH_L1E_MAGIC_MASK));
+    return (sl1e.l1 & SH_L1E_MAGIC_MASK) == SH_L1E_MAGIC_MASK;
 }
 
 /* Guest not present: a single magic value */
@@ -320,20 +329,26 @@  static inline bool sh_l1e_is_gnp(shadow_
 
 /*
  * MMIO: an invalid PTE that contains the GFN of the equivalent guest l1e.
- * We store 28 bits of GFN in bits 4:32 of the entry.
+ * We store the GFN in bits 4:43 of the entry.
  * The present bit is set, and the U/S and R/W bits are taken from the guest.
  * Bit 3 is always 0, to differentiate from gnp above.
  */
-#define SH_L1E_MMIO_MAGIC       0xffffffff00000001ULL
-#define SH_L1E_MMIO_MAGIC_MASK  0xffffffff00000009ULL
-#define SH_L1E_MMIO_GFN_MASK    0x00000000fffffff0ULL
+#define SH_L1E_MMIO_MAGIC       SH_L1E_MAGIC_MASK
+#define SH_L1E_MMIO_MAGIC_BIT   ((_PAGE_PRESENT | _PAGE_RW | _PAGE_USER) + 1)
+#if SH_L1E_MMIO_MAGIC_BIT & (SH_L1E_MMIO_MAGIC_BIT - 1)
+# error SH_L1E_MMIO_MAGIC_BIT needs to be a power of 2
+#endif
+#if SH_L1E_MMIO_MAGIC_BIT >> SH_L1E_MAGIC_NR_META_BITS
+# error SH_L1E_MMIO_MAGIC_BIT and SH_L1E_MAGIC_NR_META_BITS are out of sync
+#endif
+#define SH_L1E_MMIO_MAGIC_MASK  (SH_L1E_MAGIC_MASK | SH_L1E_MMIO_MAGIC_BIT)
+#define SH_L1E_MMIO_GFN_MASK    ~(SH_L1E_MMIO_MAGIC_MASK | _PAGE_RW | _PAGE_USER)
 
 static inline shadow_l1e_t sh_l1e_mmio(gfn_t gfn, u32 gflags)
 {
     unsigned long gfn_val = MASK_INSR(gfn_x(gfn), SH_L1E_MMIO_GFN_MASK);
 
-    if ( !sh_have_pte_rsvd_bits() ||
-         gfn_x(gfn) != MASK_EXTR(gfn_val, SH_L1E_MMIO_GFN_MASK) )
+    if ( !sh_have_pte_rsvd_bits() )
         return shadow_l1e_empty();
 
     return (shadow_l1e_t) { (SH_L1E_MMIO_MAGIC | gfn_val |